Re: [RFC PATCH] vfio: VFIO Driver core framework

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH] vfio: VFIO Driver core framework
       [not found] <20111103195452.21259.93021.stgit@bling.home>
@ 2011-11-09  4:17 ` Aaron Fabbri
  2011-11-09  4:41   ` Alex Williamson
  2011-11-09  8:11 ` Christian Benvenuti (benve)
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 62+ messages in thread
From: Aaron Fabbri @ 2011-11-09  4:17 UTC (permalink / raw)
  To: Alex Williamson, chrisw, aik, pmac, dwg, joerg.roedel, agraf,
	benve, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci

I'm going to send out chunks of comments as I go over this stuff.  Below
I've covered the documentation file and vfio_iommu.c.  More comments coming
soon...

On 11/3/11 1:12 PM, "Alex Williamson" <alex.williamson@redhat.com> wrote:

> VFIO provides a secure, IOMMU based interface for user space
> drivers, including device assignment to virtual machines.
> This provides the base management of IOMMU groups, devices,
> and IOMMU objects.  See Documentation/vfio.txt included in
> this patch for user and kernel API description.
> 
> Note, this implements the new API discussed at KVM Forum
> 2011, as represented by the drvier version 0.2.  It's hoped
> that this provides a modular enough interface to support PCI
> and non-PCI userspace drivers across various architectures
> and IOMMU implementations.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
<snip>
> +
> +Groups, Devices, IOMMUs, oh my
> +-----------------------------------------------------------------------------
> --
> +
> +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> +can't always distinguish transactions from each individual device in
> +the system.  Sometimes this is because of the IOMMU design, such as with
> +PEs, other times it's caused by the I/O topology, for instance a

Can you define this acronym the first time you use it, i.e.

+ PEs (partitionable endpoints), ...

> +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> +devices created by these restictions IOMMU groups (or just "groups" for

restrictions

> +this document).
> +
> +The IOMMU cannot distiguish transactions between the individual devices

distinguish

> +within the group, therefore the group is the basic unit of ownership for
> +a userspace process.  Because of this, groups are also the primary
> +interface to both devices and IOMMU domains in VFIO.
> +
<snip>
> +file descriptor referencing the same internal IOMMU object from either
> +X or Y).  Merged groups can be dissolved either explictly with UNMERGE

explicitly

<snip>
> +
> +Device tree devices also invlude ioctls for further defining the

include

<snip>
> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> new file mode 100644
> index 0000000..029dae3
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu.c
<snip>
> +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> +                      dma_addr_t start, size_t size)
> +{
> +    struct list_head *pos;
> +    struct dma_map_page *mlp;
> +
> +    list_for_each(pos, &iommu->dm_list) {
> +        mlp = list_entry(pos, struct dma_map_page, list);
> +        if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +                   start, size))
> +            return mlp;
> +    }
> +    return NULL;
> +}
> +

This function below should be static.

> +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> +                size_t size, struct dma_map_page *mlp)
> +{
> +    struct dma_map_page *split;
> +    int npage_lo, npage_hi;
> +
> +    /* Existing dma region is completely covered, unmap all */
> +    if (start <= mlp->daddr &&
> +        start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +        vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> +        list_del(&mlp->list);
> +        npage_lo = mlp->npage;
> +        kfree(mlp);
> +        return npage_lo;
> +    }
> +
> +    /* Overlap low address of existing range */
> +    if (start <= mlp->daddr) {
> +        size_t overlap;
> +
> +        overlap = start + size - mlp->daddr;
> +        npage_lo = overlap >> PAGE_SHIFT;
> +        npage_hi = mlp->npage - npage_lo;

npage_hi not used.. Delete this line ^

> +
> +        vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> +        mlp->daddr += overlap;
> +        mlp->vaddr += overlap;
> +        mlp->npage -= npage_lo;
> +        return npage_lo;
> +    }
> +
> +    /* Overlap high address of existing range */
> +    if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +        size_t overlap;
> +
> +        overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> +        npage_hi = overlap >> PAGE_SHIFT;
> +        npage_lo = mlp->npage - npage_hi;
> +
> +        vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> +        mlp->npage -= npage_hi;
> +        return npage_hi;
> +    }
> +
> +    /* Split existing */
> +    npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> +    npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> +
> +    split = kzalloc(sizeof *split, GFP_KERNEL);
> +    if (!split)
> +        return -ENOMEM;
> +
> +    vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> +
> +    mlp->npage = npage_lo;
> +
> +    split->npage = npage_hi;
> +    split->daddr = start + size;
> +    split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> +    split->rdwr = mlp->rdwr;
> +    list_add(&split->list, &iommu->dm_list);
> +    return size >> PAGE_SHIFT;
> +}
> +

Function should be static.

> +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +    int ret = 0;
> +    size_t npage = dmp->size >> PAGE_SHIFT;
> +    struct list_head *pos, *n;
> +
> +    if (dmp->dmaaddr & ~PAGE_MASK)
> +        return -EINVAL;
> +    if (dmp->size & ~PAGE_MASK)
> +        return -EINVAL;
> +
> +    mutex_lock(&iommu->dgate);
> +
> +    list_for_each_safe(pos, n, &iommu->dm_list) {
> +        struct dma_map_page *mlp;
> +
> +        mlp = list_entry(pos, struct dma_map_page, list);
> +        if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +                   dmp->dmaaddr, dmp->size)) {
> +            ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> +                              dmp->size, mlp);
> +            if (ret > 0)
> +                npage -= NPAGE_TO_SIZE(ret);

Why NPAGE_TO_SIZE here?

> +            if (ret < 0 || npage == 0)
> +                break;
> +        }
> +    }
> +    mutex_unlock(&iommu->dgate);
> +    return ret > 0 ? 0 : ret;
> +}
> +

Function should be static.

> +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +    int npage;
> +    struct dma_map_page *mlp, *mmlp = NULL;
> +    dma_addr_t daddr = dmp->dmaaddr;


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-09  4:17 ` [RFC PATCH] vfio: VFIO Driver core framework Aaron Fabbri
@ 2011-11-09  4:41   ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2011-11-09  4:41 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf, benve, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Tue, 2011-11-08 at 20:17 -0800, Aaron Fabbri wrote:
> I'm going to send out chunks of comments as I go over this stuff.  Below
> I've covered the documentation file and vfio_iommu.c.  More comments coming
> soon...
> 
> On 11/3/11 1:12 PM, "Alex Williamson" <alex.williamson@redhat.com> wrote:
> 
> > VFIO provides a secure, IOMMU based interface for user space
> > drivers, including device assignment to virtual machines.
> > This provides the base management of IOMMU groups, devices,
> > and IOMMU objects.  See Documentation/vfio.txt included in
> > this patch for user and kernel API description.
> > 
> > Note, this implements the new API discussed at KVM Forum
> > 2011, as represented by the drvier version 0.2.  It's hoped
> > that this provides a modular enough interface to support PCI
> > and non-PCI userspace drivers across various architectures
> > and IOMMU implementations.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> <snip>
> > +
> > +Groups, Devices, IOMMUs, oh my
> > +-----------------------------------------------------------------------------
> > --
> > +
> > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > +can't always distinguish transactions from each individual device in
> > +the system.  Sometimes this is because of the IOMMU design, such as with
> > +PEs, other times it's caused by the I/O topology, for instance a
> 
> Can you define this acronym the first time you use it, i.e.
> 
> + PEs (partitionable endpoints), ...

It was actually up in the <snip>:

... POWER systems with Partitionable Endpoints (PEs) ...

I tried to make sure I defined them, but let me know if anything else is
missing/non-obvious.

> > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > +devices created by these restictions IOMMU groups (or just "groups" for
> 
> restrictions

Ugh, lost w/o a spell checker.  Fixed all these.

> > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > new file mode 100644
> > index 0000000..029dae3
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu.c
> <snip>
> > +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> > +                      dma_addr_t start, size_t size)
> > +{
> > +    struct list_head *pos;
> > +    struct dma_map_page *mlp;
> > +
> > +    list_for_each(pos, &iommu->dm_list) {
> > +        mlp = list_entry(pos, struct dma_map_page, list);
> > +        if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +                   start, size))
> > +            return mlp;
> > +    }
> > +    return NULL;
> > +}
> > +
> 
> This function below should be static.

Fixed

> > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> > +                size_t size, struct dma_map_page *mlp)
> > +{
> > +    struct dma_map_page *split;
> > +    int npage_lo, npage_hi;
> > +
> > +    /* Existing dma region is completely covered, unmap all */
> > +    if (start <= mlp->daddr &&
> > +        start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +        vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +        list_del(&mlp->list);
> > +        npage_lo = mlp->npage;
> > +        kfree(mlp);
> > +        return npage_lo;
> > +    }
> > +
> > +    /* Overlap low address of existing range */
> > +    if (start <= mlp->daddr) {
> > +        size_t overlap;
> > +
> > +        overlap = start + size - mlp->daddr;
> > +        npage_lo = overlap >> PAGE_SHIFT;
> > +        npage_hi = mlp->npage - npage_lo;
> 
> npage_hi not used.. Delete this line ^

Yep, and npage_lo in the next block.  I was setting them just for
symmetry, but they can be removed now.

> > +
> > +        vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > +        mlp->daddr += overlap;
> > +        mlp->vaddr += overlap;
> > +        mlp->npage -= npage_lo;
> > +        return npage_lo;
> > +    }
> > +
> > +    /* Overlap high address of existing range */
> > +    if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +        size_t overlap;
> > +
> > +        overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > +        npage_hi = overlap >> PAGE_SHIFT;
> > +        npage_lo = mlp->npage - npage_hi;
> > +
> > +        vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > +        mlp->npage -= npage_hi;
> > +        return npage_hi;
> > +    }
> > +
> > +    /* Split existing */
> > +    npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> > +    npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> > +
> > +    split = kzalloc(sizeof *split, GFP_KERNEL);
> > +    if (!split)
> > +        return -ENOMEM;
> > +
> > +    vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> > +
> > +    mlp->npage = npage_lo;
> > +
> > +    split->npage = npage_hi;
> > +    split->daddr = start + size;
> > +    split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> > +    split->rdwr = mlp->rdwr;
> > +    list_add(&split->list, &iommu->dm_list);
> > +    return size >> PAGE_SHIFT;
> > +}
> > +
> 
> Function should be static.

Fixed

> > +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +    int ret = 0;
> > +    size_t npage = dmp->size >> PAGE_SHIFT;
> > +    struct list_head *pos, *n;
> > +
> > +    if (dmp->dmaaddr & ~PAGE_MASK)
> > +        return -EINVAL;
> > +    if (dmp->size & ~PAGE_MASK)
> > +        return -EINVAL;
> > +
> > +    mutex_lock(&iommu->dgate);
> > +
> > +    list_for_each_safe(pos, n, &iommu->dm_list) {
> > +        struct dma_map_page *mlp;
> > +
> > +        mlp = list_entry(pos, struct dma_map_page, list);
> > +        if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +                   dmp->dmaaddr, dmp->size)) {
> > +            ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> > +                              dmp->size, mlp);
> > +            if (ret > 0)
> > +                npage -= NPAGE_TO_SIZE(ret);
> 
> Why NPAGE_TO_SIZE here?

Looks like a bug, I'll change and test.

> > +            if (ret < 0 || npage == 0)
> > +                break;
> > +        }
> > +    }
> > +    mutex_unlock(&iommu->dgate);
> > +    return ret > 0 ? 0 : ret;
> > +}
> > +
> 
> Function should be static.

Fixed.

> > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +    int npage;
> > +    struct dma_map_page *mlp, *mmlp = NULL;
> > +    dma_addr_t daddr = dmp->dmaaddr;
> 

Thanks!

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC PATCH] vfio: VFIO Driver core framework
       [not found] <20111103195452.21259.93021.stgit@bling.home>
  2011-11-09  4:17 ` [RFC PATCH] vfio: VFIO Driver core framework Aaron Fabbri
@ 2011-11-09  8:11 ` Christian Benvenuti (benve)
  2011-11-09 18:02   ` Alex Williamson
  2011-11-10  0:57 ` Christian Benvenuti (benve)
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 62+ messages in thread
From: Christian Benvenuti (benve) @ 2011-11-09  8:11 UTC (permalink / raw)
  To: Alex Williamson, chrisw, aik, pmac, dwg, joerg.roedel, agraf,
	Aaron Fabbri (aafabbri), B08248, B07421, avi, konrad.wilk, kvm,
	qemu-devel, iommu, linux-pci

SSBoYXZlIG5vdCBnb25lIHRocm91Z2ggdGhlIGFsbCBwYXRjaCB5ZXQsIGJ1dCBoZXJlIGFyZQ0K
bXkgZmlyc3QgY29tbWVudHMvcXVlc3Rpb25zIGFib3V0IHRoZSBjb2RlIGluIHZmaW9fbWFpbi5j
DQooYW5kIHBjaS92ZmlvX3BjaS5jKS4NCg0KPiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0K
PiBGcm9tOiBBbGV4IFdpbGxpYW1zb24gW21haWx0bzphbGV4LndpbGxpYW1zb25AcmVkaGF0LmNv
bV0NCj4gU2VudDogVGh1cnNkYXksIE5vdmVtYmVyIDAzLCAyMDExIDE6MTIgUE0NCj4gVG86IGNo
cmlzd0Bzb3VzLXNvbC5vcmc7IGFpa0BhdTEuaWJtLmNvbTsgcG1hY0BhdTEuaWJtLmNvbTsNCj4g
ZHdnQGF1MS5pYm0uY29tOyBqb2VyZy5yb2VkZWxAYW1kLmNvbTsgYWdyYWZAc3VzZS5kZTsgQ2hy
aXN0aWFuDQo+IEJlbnZlbnV0aSAoYmVudmUpOyBBYXJvbiBGYWJicmkgKGFhZmFiYnJpKTsgQjA4
MjQ4QGZyZWVzY2FsZS5jb207DQo+IEIwNzQyMUBmcmVlc2NhbGUuY29tOyBhdmlAcmVkaGF0LmNv
bTsga29ucmFkLndpbGtAb3JhY2xlLmNvbTsNCj4ga3ZtQHZnZXIua2VybmVsLm9yZzsgcWVtdS1k
ZXZlbEBub25nbnUub3JnOyBpb21tdUBsaXN0cy5saW51eC0NCj4gZm91bmRhdGlvbi5vcmc7IGxp
bnV4LXBjaUB2Z2VyLmtlcm5lbC5vcmcNCj4gU3ViamVjdDogW1JGQyBQQVRDSF0gdmZpbzogVkZJ
TyBEcml2ZXIgY29yZSBmcmFtZXdvcmsNCg0KPHNuaXA+DQoNCj4gZGlmZiAtLWdpdCBhL2RyaXZl
cnMvdmZpby92ZmlvX21haW4uYyBiL2RyaXZlcnMvdmZpby92ZmlvX21haW4uYw0KPiBuZXcgZmls
ZSBtb2RlIDEwMDY0NA0KPiBpbmRleCAwMDAwMDAwLi42MTY5MzU2DQo+IC0tLSAvZGV2L251bGwN
Cj4gKysrIGIvZHJpdmVycy92ZmlvL3ZmaW9fbWFpbi5jDQo+IEBAIC0wLDAgKzEsMTE1MSBAQA0K
PiArLyoNCj4gKyAqIFZGSU8gZnJhbWV3b3JrDQo+ICsgKg0KPiArICogQ29weXJpZ2h0IChDKSAy
MDExIFJlZCBIYXQsIEluYy4gIEFsbCByaWdodHMgcmVzZXJ2ZWQuDQo+ICsgKiAgICAgQXV0aG9y
OiBBbGV4IFdpbGxpYW1zb24gPGFsZXgud2lsbGlhbXNvbkByZWRoYXQuY29tPg0KPiArICoNCj4g
KyAqIFRoaXMgcHJvZ3JhbSBpcyBmcmVlIHNvZnR3YXJlOyB5b3UgY2FuIHJlZGlzdHJpYnV0ZSBp
dCBhbmQvb3INCj4gbW9kaWZ5DQo+ICsgKiBpdCB1bmRlciB0aGUgdGVybXMgb2YgdGhlIEdOVSBH
ZW5lcmFsIFB1YmxpYyBMaWNlbnNlIHZlcnNpb24gMiBhcw0KPiArICogcHVibGlzaGVkIGJ5IHRo
ZSBGcmVlIFNvZnR3YXJlIEZvdW5kYXRpb24uDQo+ICsgKg0KPiArICogRGVyaXZlZCBmcm9tIG9y
aWdpbmFsIHZmaW86DQo+ICsgKiBDb3B5cmlnaHQgMjAxMCBDaXNjbyBTeXN0ZW1zLCBJbmMuICBB
bGwgcmlnaHRzIHJlc2VydmVkLg0KPiArICogQXV0aG9yOiBUb20gTHlvbiwgcHVnc0BjaXNjby5j
b20NCj4gKyAqLw0KPiArDQo+ICsjaW5jbHVkZSA8bGludXgvY2Rldi5oPg0KPiArI2luY2x1ZGUg
PGxpbnV4L2NvbXBhdC5oPg0KPiArI2luY2x1ZGUgPGxpbnV4L2RldmljZS5oPg0KPiArI2luY2x1
ZGUgPGxpbnV4L2ZpbGUuaD4NCj4gKyNpbmNsdWRlIDxsaW51eC9hbm9uX2lub2Rlcy5oPg0KPiAr
I2luY2x1ZGUgPGxpbnV4L2ZzLmg+DQo+ICsjaW5jbHVkZSA8bGludXgvaWRyLmg+DQo+ICsjaW5j
bHVkZSA8bGludXgvaW9tbXUuaD4NCj4gKyNpbmNsdWRlIDxsaW51eC9tbS5oPg0KPiArI2luY2x1
ZGUgPGxpbnV4L21vZHVsZS5oPg0KPiArI2luY2x1ZGUgPGxpbnV4L3NsYWIuaD4NCj4gKyNpbmNs
dWRlIDxsaW51eC9zdHJpbmcuaD4NCj4gKyNpbmNsdWRlIDxsaW51eC91YWNjZXNzLmg+DQo+ICsj
aW5jbHVkZSA8bGludXgvdmZpby5oPg0KPiArI2luY2x1ZGUgPGxpbnV4L3dhaXQuaD4NCj4gKw0K
PiArI2luY2x1ZGUgInZmaW9fcHJpdmF0ZS5oIg0KPiArDQo+ICsjZGVmaW5lIERSSVZFUl9WRVJT
SU9OCSIwLjIiDQo+ICsjZGVmaW5lIERSSVZFUl9BVVRIT1IJIkFsZXggV2lsbGlhbXNvbiA8YWxl
eC53aWxsaWFtc29uQHJlZGhhdC5jb20+Ig0KPiArI2RlZmluZSBEUklWRVJfREVTQwkiVkZJTyAt
IFVzZXIgTGV2ZWwgbWV0YS1kcml2ZXIiDQo+ICsNCj4gK3N0YXRpYyBpbnQgYWxsb3dfdW5zYWZl
X2ludHJzOw0KPiArbW9kdWxlX3BhcmFtKGFsbG93X3Vuc2FmZV9pbnRycywgaW50LCAwKTsNCj4g
K01PRFVMRV9QQVJNX0RFU0MoYWxsb3dfdW5zYWZlX2ludHJzLA0KPiArICAgICAgICAiQWxsb3cg
dXNlIG9mIElPTU1VcyB3aGljaCBkbyBub3Qgc3VwcG9ydCBpbnRlcnJ1cHQNCj4gcmVtYXBwaW5n
Iik7DQo+ICsNCj4gK3N0YXRpYyBzdHJ1Y3QgdmZpbyB7DQo+ICsJZGV2X3QJCQlkZXZ0Ow0KPiAr
CXN0cnVjdCBjZGV2CQljZGV2Ow0KPiArCXN0cnVjdCBsaXN0X2hlYWQJZ3JvdXBfbGlzdDsNCj4g
KwlzdHJ1Y3QgbXV0ZXgJCWxvY2s7DQo+ICsJc3RydWN0IGtyZWYJCWtyZWY7DQo+ICsJc3RydWN0
IGNsYXNzCQkqY2xhc3M7DQo+ICsJc3RydWN0IGlkcgkJaWRyOw0KPiArCXdhaXRfcXVldWVfaGVh
ZF90CXJlbGVhc2VfcTsNCj4gK30gdmZpbzsNCj4gKw0KPiArc3RhdGljIGNvbnN0IHN0cnVjdCBm
aWxlX29wZXJhdGlvbnMgdmZpb19ncm91cF9mb3BzOw0KPiArZXh0ZXJuIGNvbnN0IHN0cnVjdCBm
aWxlX29wZXJhdGlvbnMgdmZpb19pb21tdV9mb3BzOw0KPiArDQo+ICtzdHJ1Y3QgdmZpb19ncm91
cCB7DQo+ICsJZGV2X3QJCQlkZXZ0Ow0KPiArCXVuc2lnbmVkIGludAkJZ3JvdXBpZDsNCg0KVGhp
cyBncm91cGlkIGlzIHJldHVybmVkIGJ5IHRoZSBkZXZpY2VfZ3JvdXAgY2FsbGJhY2sgeW91IHJl
Y2VudGx5IGFkZGVkDQp3aXRoIGEgc2VwYXJhdGUgKG5vdCB5ZXQgaW4gdHJlZSkgSU9NTVUgcGF0
Y2guDQpJcyBpdCBjb3JyZWN0IHRvIHNheSB0aGF0IHRoZSBzY29wZSBvZiB0aGlzIElEIGlzIHRo
ZSBidXMgdGhlIGlvbW11DQpiZWxvbmdzIHRvbyAoYnV0IHlvdSB1c2UgaXQgYXMgaWYgaXQgd2Fz
IGdsb2JhbCk/DQpJIGJlbGlldmUgdGhlcmUgaXMgbm90aGluZyByaWdodCBub3cgdG8gZW5zdXJl
IHRoZSB1bmlxdWVuZXNzIG9mIHN1Y2gNCklEIGFjcm9zcyBidXMgdHlwZXMgKGFzc3VtaW5nIHRo
ZXJlIHdpbGwgYmUgb3RoZXIgYnVzIGRyaXZlcnMgaW4gdGhlDQpmdXR1cmUgYmVzaWRlcyB2Zmlv
LXBjaSkuDQpJZiB0aGF0J3MgdGhlIGNhc2UsIHRoZSB2ZmlvLmdyb3VwX2xpc3QgZ2xvYmFsIGxp
c3QgYW5kIHRoZSBfX3ZmaW9fbG9va3VwX2Rldg0Kcm91dGluZSBzaG91bGQgYmUgY2hhbmdlZCB0
byBhY2NvdW50IGZvciB0aGUgYnVzIHRvbz8NCk9wcywgSSBqdXN0IHNhdyB0aGUgZXJyb3IgbXNn
IGluIHZmaW9fZ3JvdXBfYWRkX2RldiBhYm91dCB0aGUgZ3JvdXAgaWQgY29uZmxpY3QuDQpJcyB0
aGF0IHdhcm5pbmcgcmVsYXRlZCB0byB3aGF0IEkgbWVudGlvbmVkIGFib3ZlPw0KDQo+ICsJc3Ry
dWN0IGJ1c190eXBlCQkqYnVzOw0KPiArCXN0cnVjdCB2ZmlvX2lvbW11CSppb21tdTsNCj4gKwlz
dHJ1Y3QgbGlzdF9oZWFkCWRldmljZV9saXN0Ow0KPiArCXN0cnVjdCBsaXN0X2hlYWQJaW9tbXVf
bmV4dDsNCj4gKwlzdHJ1Y3QgbGlzdF9oZWFkCWdyb3VwX25leHQ7DQo+ICsJaW50CQkJcmVmY250
Ow0KPiArfTsNCj4gKw0KPiArc3RydWN0IHZmaW9fZGV2aWNlIHsNCj4gKwlzdHJ1Y3QgZGV2aWNl
CQkJKmRldjsNCj4gKwljb25zdCBzdHJ1Y3QgdmZpb19kZXZpY2Vfb3BzCSpvcHM7DQo+ICsJc3Ry
dWN0IHZmaW9faW9tbXUJCSppb21tdTsNCg0KSSB3b25kZXIgaWYgeW91IG5lZWQgdG8gaGF2ZSB0
aGUgJ2lvbW11JyBmaWVsZCBoZXJlLg0KdmZpb19kZXZpY2UuaW9tbXUgaXMgYWx3YXlzIHNldCBh
bmQgcmVzZXQgdG9nZXRoZXIgd2l0aA0KdmZpb19ncm91cC5pb21tdS4NCkdpdmVuIHRoYXQgYSB2
ZmlvX2RldmljZSBpbnN0YW5jZSBpcyBhbHdheXMgbGlua2VkIHRvIGEgdmZpb19ncm91cA0KaW5z
dGFuY2UsIGRvIHdlIG5lZWQgdGhpcyBkdXBsaWNhdGlvbj8gSXMgdGhpcyBkdXBsaWNhdGlvbiB0
aGVyZQ0KYmVjYXVzZSB5b3UgZG8gbm90IHdhbnQgdGhlIGRvdWJsZSBkZXJlZmVyZW5jZSBkZXZp
Y2UtPmdyb3VwLT5pb21tdT8NCg0KPiArCXN0cnVjdCB2ZmlvX2dyb3VwCQkqZ3JvdXA7DQo+ICsJ
c3RydWN0IGxpc3RfaGVhZAkJZGV2aWNlX25leHQ7DQo+ICsJYm9vbAkJCQlhdHRhY2hlZDsNCj4g
KwlpbnQJCQkJcmVmY250Ow0KPiArCXZvaWQJCQkJKmRldmljZV9kYXRhOw0KPiArfTsNCj4gKw0K
PiArLyoNCj4gKyAqIEhlbHBlciBmdW5jdGlvbnMgY2FsbGVkIHVuZGVyIHZmaW8ubG9jaw0KPiAr
ICovDQo+ICsNCj4gKy8qIFJldHVybiB0cnVlIGlmIGFueSBkZXZpY2VzIHdpdGhpbiBhIGdyb3Vw
IGFyZSBvcGVuZWQgKi8NCj4gK3N0YXRpYyBib29sIF9fdmZpb19ncm91cF9kZXZzX2ludXNlKHN0
cnVjdCB2ZmlvX2dyb3VwICpncm91cCkNCj4gK3sNCj4gKwlzdHJ1Y3QgbGlzdF9oZWFkICpwb3M7
DQo+ICsNCj4gKwlsaXN0X2Zvcl9lYWNoKHBvcywgJmdyb3VwLT5kZXZpY2VfbGlzdCkgew0KPiAr
CQlzdHJ1Y3QgdmZpb19kZXZpY2UgKmRldmljZTsNCj4gKw0KPiArCQlkZXZpY2UgPSBsaXN0X2Vu
dHJ5KHBvcywgc3RydWN0IHZmaW9fZGV2aWNlLCBkZXZpY2VfbmV4dCk7DQo+ICsJCWlmIChkZXZp
Y2UtPnJlZmNudCkNCj4gKwkJCXJldHVybiB0cnVlOw0KPiArCX0NCj4gKwlyZXR1cm4gZmFsc2U7
DQo+ICt9DQo+ICsNCj4gKy8qIFJldHVybiB0cnVlIGlmIGFueSBvZiB0aGUgZ3JvdXBzIGF0dGFj
aGVkIHRvIGFuIGlvbW11IGFyZSBvcGVuZWQuDQo+ICsgKiBXZSBjYW4gb25seSB0ZWFyIGFwYXJ0
IG1lcmdlZCBncm91cHMgd2hlbiBub3RoaW5nIGlzIGxlZnQgb3Blbi4gKi8NCj4gK3N0YXRpYyBi
b29sIF9fdmZpb19pb21tdV9ncm91cHNfaW51c2Uoc3RydWN0IHZmaW9faW9tbXUgKmlvbW11KQ0K
PiArew0KPiArCXN0cnVjdCBsaXN0X2hlYWQgKnBvczsNCj4gKw0KPiArCWxpc3RfZm9yX2VhY2go
cG9zLCAmaW9tbXUtPmdyb3VwX2xpc3QpIHsNCj4gKwkJc3RydWN0IHZmaW9fZ3JvdXAgKmdyb3Vw
Ow0KPiArDQo+ICsJCWdyb3VwID0gbGlzdF9lbnRyeShwb3MsIHN0cnVjdCB2ZmlvX2dyb3VwLCBp
b21tdV9uZXh0KTsNCj4gKwkJaWYgKGdyb3VwLT5yZWZjbnQpDQo+ICsJCQlyZXR1cm4gdHJ1ZTsN
Cj4gKwl9DQo+ICsJcmV0dXJuIGZhbHNlOw0KPiArfQ0KPiArDQo+ICsvKiBBbiBpb21tdSBpcyAi
aW4gdXNlIiBpZiBpdCBoYXMgYSBmaWxlIGRlc2NyaXB0b3Igb3BlbiBvciBpZiBhbnkgb2YNCj4g
KyAqIHRoZSBncm91cHMgYXNzaWduZWQgdG8gdGhlIGlvbW11IGhhdmUgZGV2aWNlcyBvcGVuLiAq
Lw0KPiArc3RhdGljIGJvb2wgX192ZmlvX2lvbW11X2ludXNlKHN0cnVjdCB2ZmlvX2lvbW11ICpp
b21tdSkNCj4gK3sNCj4gKwlzdHJ1Y3QgbGlzdF9oZWFkICpwb3M7DQo+ICsNCj4gKwlpZiAoaW9t
bXUtPnJlZmNudCkNCj4gKwkJcmV0dXJuIHRydWU7DQo+ICsNCj4gKwlsaXN0X2Zvcl9lYWNoKHBv
cywgJmlvbW11LT5ncm91cF9saXN0KSB7DQo+ICsJCXN0cnVjdCB2ZmlvX2dyb3VwICpncm91cDsN
Cj4gKw0KPiArCQlncm91cCA9IGxpc3RfZW50cnkocG9zLCBzdHJ1Y3QgdmZpb19ncm91cCwgaW9t
bXVfbmV4dCk7DQo+ICsNCj4gKwkJaWYgKF9fdmZpb19ncm91cF9kZXZzX2ludXNlKGdyb3VwKSkN
Cj4gKwkJCXJldHVybiB0cnVlOw0KPiArCX0NCj4gKwlyZXR1cm4gZmFsc2U7DQo+ICt9DQoNCkkg
bG9va2VkIGF0IGhvdyB5b3UgdGFrZSBjYXJlIG9mIHJlZiBjb3VudHMgLi4uDQoNClRoaXMgaXMg
aG93IHRoZSB0cmVlIG9mIHZmaW9faW9tbXUvdmZpb19ncm91cC92ZmlvX2RldmljZSBkYXRhDQpT
dHJ1Y3R1cmVzIGlzIG9yZ2FuaXplZCAoSSdsbCB1c2UganVzdCBpb21tdS9ncm91cC9kZXYgdG8g
bWFrZQ0KdGhlIGdyYXBoIHNtYWxsZXIpOg0KDQogICAgICAgICAgICBpb21tdQ0KICAgICAgICAg
ICAvICAgICBcDQogICAgICAgICAgLyAgICAgICBcIA0KICAgIGdyb3VwICAgLi4uICAgICBncm91
cA0KICAgIC8gIFwgICAgICAgICAgIC8gIFwgICANCiAgIC8gICAgXCAgICAgICAgIC8gICAgXA0K
ZGV2ICAuLiAgZGV2ICAgZGV2ICAuLiAgZGV2DQoNClRoaXMgaXMgaG93IHlvdSBnZXQgYSBmaWxl
IGRlc2NyaXB0b3IgZm9yIHRoZSB0aHJlZSBraW5kIG9mIG9iamVjdHM6DQoNCi0gZ3JvdXAgOiBv
cGVuIC9kZXYvdmZpby94eHggZm9yIGdyb3VwIHh4eA0KLSBpb21tdSA6IGdyb3VwIGlvY3RsIFZG
SU9fR1JPVVBfR0VUX0lPTU1VX0ZEDQotIGRldmljZTogZ3JvdXAgaW9jdGwgVkZJT19HUk9VUF9H
RVRfREVWSUNFX0ZEDQoNCkdpdmVuIHRoZSBhYm92ZSB0b3BvbG9neSwgSSB3b3VsZCBhc3N1bWUg
dGhhdDoNCg0KKDEpIGFuIGlvbW11IGlzICdpbnVzZScgaWYgOiBhKSBpb21tdSByZWZjbnQgPiAw
LCBvcg0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICBiKSBhbnkgb2YgaXRzIGdyb3VwcyBp
cyAnaW51c2UnDQoNCigyKSBhICBncm91cCBpcyAnaW51c2UnIGlmIDogYSkgZ3JvdXAgcmVmY250
ID4gMCwgb3INCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgYikgYW55IG9mIGl0cyBkZXZp
Y2VzIGlzICdpbnVzZScNCg0KKDMpIGEgZGV2aWNlIGlzICdpbnVzZScgaWYgOiBhKSBkZXZpY2Ug
cmVmY250ID4gMA0KDQpZb3UgaGF2ZSBjb2RlZCB0aGUgJ2ludXNlJyBsb2dpYyB3aXRoIHRoZXNl
IHRocmVlIHJvdXRpbmVzOg0KDQogICAgX192ZmlvX2lvbW11X2ludXNlLCB3aGljaCBpbXBsZW1l
bnRzICgxKSBhYm92ZQ0KDQphbmQNCiAgICBfX3ZmaW9faW9tbXVfZ3JvdXBzX2ludXNlDQogICAg
X192ZmlvX2dyb3VwX2RldnNfaW51c2UNCg0Kd2hpY2ggYXJlIHVzZWQgYnkgX192ZmlvX2lvbW11
X2ludXNlLg0KV2h5IGRvbid0IHlvdSBjaGVjayB0aGUgZ3JvdXAgcmVmY250IGluIF9fdmZpb19p
b21tdV9ncm91cHNfaW51c2U/DQoNCldvdWxkIGl0IG1ha2Ugc2Vuc2UgKGFuZCB0aGUgY29kZSBt
b3JlIHJlYWRhYmxlKSB0byBzdHJ1Y3R1cmUgdGhlDQpuZXN0ZWQgcmVmY250L2ludXNlIGNoZWNr
IGxpa2UgdGhpcz8NCihUaGUgbnVtYmVycyAoMSkoMikoMykgcmVmZXIgdG8gdGhlIHRocmVlICdp
bnVzZScgY29uZGl0aW9ucyBhYm92ZSkNCg0KICAgKDEpX192ZmlvX2lvbW11X2ludXNlDQogICB8
DQogICArLT4gY2hlY2sgaW9tbXUgcmVmY250DQogICArLT4gX192ZmlvX2lvbW11X2dyb3Vwc19p
bnVzZQ0KICAgICAgIHwNCiAgICAgICArLT5MT09QOiAoMilfX3ZmaW9faW9tbXVfZ3JvdXBfaW51
c2U8LS1NSVNTSU5HDQogICAgICAgICAgICAgICAgfA0KICAgICAgICAgICAgICAgICstPiBjaGVj
ayBncm91cCByZWZjbnQ8LS1NSVNTSU5HDQogICAgICAgICAgICAgICAgKy0+IF9fdmZpb19ncm91
cF9kZXZzX2ludXNlKCkNCiAgICAgICAgICAgICAgICAgICAgfA0KICAgICAgICAgICAgICAgICAg
ICArLT4gTE9PUDogKDMpX192ZmlvX2dyb3VwX2Rldl9pbnVzZTwtLU1JU1NJTkcNCiAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgIHwNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICst
PiBjaGVjayBkZXZpY2UgcmVmY250DQoNCj4gK3N0YXRpYyB2b2lkIF9fdmZpb19ncm91cF9zZXRf
aW9tbXUoc3RydWN0IHZmaW9fZ3JvdXAgKmdyb3VwLA0KPiArCQkJCSAgIHN0cnVjdCB2ZmlvX2lv
bW11ICppb21tdSkNCj4gK3sNCj4gKwlzdHJ1Y3QgbGlzdF9oZWFkICpwb3M7DQo+ICsNCj4gKwlp
ZiAoZ3JvdXAtPmlvbW11KQ0KPiArCQlsaXN0X2RlbCgmZ3JvdXAtPmlvbW11X25leHQpOw0KPiAr
CWlmIChpb21tdSkNCj4gKwkJbGlzdF9hZGQoJmdyb3VwLT5pb21tdV9uZXh0LCAmaW9tbXUtPmdy
b3VwX2xpc3QpOw0KPiArDQo+ICsJZ3JvdXAtPmlvbW11ID0gaW9tbXU7DQoNCklmIHlvdSByZW1v
dmUgdGhlIHZmaW9fZGV2aWNlLmlvbW11IGZpZWxkIChhcyBzdWdnZXN0ZWQgYWJvdmUgaW4gYSBw
cmV2aW91cw0KQ29tbWVudCksIHRoZSBibG9jayBiZWxvdyB3b3VsZCBub3QgYmUgbmVlZGVkIGFu
eW1vcmUuDQoNCj4gKwlsaXN0X2Zvcl9lYWNoKHBvcywgJmdyb3VwLT5kZXZpY2VfbGlzdCkgew0K
PiArCQlzdHJ1Y3QgdmZpb19kZXZpY2UgKmRldmljZTsNCj4gKw0KPiArCQlkZXZpY2UgPSBsaXN0
X2VudHJ5KHBvcywgc3RydWN0IHZmaW9fZGV2aWNlLCBkZXZpY2VfbmV4dCk7DQo+ICsJCWRldmlj
ZS0+aW9tbXUgPSBpb21tdTsNCj4gKwl9DQo+ICt9DQo+ICsNCj4gK3N0YXRpYyB2b2lkIF9fdmZp
b19pb21tdV9kZXRhY2hfZGV2KHN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSwNCj4gKwkJCQkgICAg
c3RydWN0IHZmaW9fZGV2aWNlICpkZXZpY2UpDQo+ICt7DQo+ICsJQlVHX09OKCFpb21tdS0+ZG9t
YWluICYmIGRldmljZS0+YXR0YWNoZWQpOw0KPiArDQo+ICsJaWYgKCFpb21tdS0+ZG9tYWluIHx8
ICFkZXZpY2UtPmF0dGFjaGVkKQ0KPiArCQlyZXR1cm47DQo+ICsNCj4gKwlpb21tdV9kZXRhY2hf
ZGV2aWNlKGlvbW11LT5kb21haW4sIGRldmljZS0+ZGV2KTsNCj4gKwlkZXZpY2UtPmF0dGFjaGVk
ID0gZmFsc2U7DQo+ICt9DQo+ICsNCj4gK3N0YXRpYyB2b2lkIF9fdmZpb19pb21tdV9kZXRhY2hf
Z3JvdXAoc3RydWN0IHZmaW9faW9tbXUgKmlvbW11LA0KPiArCQkJCSAgICAgIHN0cnVjdCB2Zmlv
X2dyb3VwICpncm91cCkNCj4gK3sNCj4gKwlzdHJ1Y3QgbGlzdF9oZWFkICpwb3M7DQo+ICsNCj4g
KwlsaXN0X2Zvcl9lYWNoKHBvcywgJmdyb3VwLT5kZXZpY2VfbGlzdCkgew0KPiArCQlzdHJ1Y3Qg
dmZpb19kZXZpY2UgKmRldmljZTsNCj4gKw0KPiArCQlkZXZpY2UgPSBsaXN0X2VudHJ5KHBvcywg
c3RydWN0IHZmaW9fZGV2aWNlLCBkZXZpY2VfbmV4dCk7DQo+ICsJCV9fdmZpb19pb21tdV9kZXRh
Y2hfZGV2KGlvbW11LCBkZXZpY2UpOw0KPiArCX0NCj4gK30NCj4gKw0KPiArc3RhdGljIGludCBf
X3ZmaW9faW9tbXVfYXR0YWNoX2RldihzdHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUsDQo+ICsJCQkJ
ICAgc3RydWN0IHZmaW9fZGV2aWNlICpkZXZpY2UpDQo+ICt7DQo+ICsJaW50IHJldDsNCj4gKw0K
PiArCUJVR19PTihkZXZpY2UtPmF0dGFjaGVkKTsNCj4gKw0KPiArCWlmICghaW9tbXUgfHwgIWlv
bW11LT5kb21haW4pDQo+ICsJCXJldHVybiAtRUlOVkFMOw0KPiArDQo+ICsJcmV0ID0gaW9tbXVf
YXR0YWNoX2RldmljZShpb21tdS0+ZG9tYWluLCBkZXZpY2UtPmRldik7DQo+ICsJaWYgKCFyZXQp
DQo+ICsJCWRldmljZS0+YXR0YWNoZWQgPSB0cnVlOw0KPiArDQo+ICsJcmV0dXJuIHJldDsNCj4g
K30NCj4gKw0KPiArc3RhdGljIGludCBfX3ZmaW9faW9tbXVfYXR0YWNoX2dyb3VwKHN0cnVjdCB2
ZmlvX2lvbW11ICppb21tdSwNCj4gKwkJCQkgICAgIHN0cnVjdCB2ZmlvX2dyb3VwICpncm91cCkN
Cj4gK3sNCj4gKwlzdHJ1Y3QgbGlzdF9oZWFkICpwb3M7DQo+ICsNCj4gKwlsaXN0X2Zvcl9lYWNo
KHBvcywgJmdyb3VwLT5kZXZpY2VfbGlzdCkgew0KPiArCQlzdHJ1Y3QgdmZpb19kZXZpY2UgKmRl
dmljZTsNCj4gKwkJaW50IHJldDsNCj4gKw0KPiArCQlkZXZpY2UgPSBsaXN0X2VudHJ5KHBvcywg
c3RydWN0IHZmaW9fZGV2aWNlLCBkZXZpY2VfbmV4dCk7DQo+ICsJCXJldCA9IF9fdmZpb19pb21t
dV9hdHRhY2hfZGV2KGlvbW11LCBkZXZpY2UpOw0KPiArCQlpZiAocmV0KSB7DQo+ICsJCQlfX3Zm
aW9faW9tbXVfZGV0YWNoX2dyb3VwKGlvbW11LCBncm91cCk7DQo+ICsJCQlyZXR1cm4gcmV0Ow0K
PiArCQl9DQo+ICsJfQ0KPiArCXJldHVybiAwOw0KPiArfQ0KPiArDQo+ICsvKiBUaGUgaW9tbXUg
aXMgdmlhYmxlLCBpZS4gcmVhZHkgdG8gYmUgY29uZmlndXJlZCwgd2hlbiBhbGwgdGhlDQo+IGRl
dmljZXMNCj4gKyAqIGZvciBhbGwgdGhlIGdyb3VwcyBhdHRhY2hlZCB0byB0aGUgaW9tbXUgYXJl
IGJvdW5kIHRvIHRoZWlyIHZmaW8NCj4gZGV2aWNlDQo+ICsgKiBkcml2ZXJzIChleC4gdmZpby1w
Y2kpLiAgVGhpcyBzZXRzIHRoZSBkZXZpY2VfZGF0YSBwcml2YXRlIGRhdGENCj4gcG9pbnRlci4g
Ki8NCj4gK3N0YXRpYyBib29sIF9fdmZpb19pb21tdV92aWFibGUoc3RydWN0IHZmaW9faW9tbXUg
KmlvbW11KQ0KPiArew0KPiArCXN0cnVjdCBsaXN0X2hlYWQgKmdwb3MsICpkcG9zOw0KPiArDQo+
ICsJbGlzdF9mb3JfZWFjaChncG9zLCAmaW9tbXUtPmdyb3VwX2xpc3QpIHsNCj4gKwkJc3RydWN0
IHZmaW9fZ3JvdXAgKmdyb3VwOw0KPiArCQlncm91cCA9IGxpc3RfZW50cnkoZ3Bvcywgc3RydWN0
IHZmaW9fZ3JvdXAsIGlvbW11X25leHQpOw0KPiArDQo+ICsJCWxpc3RfZm9yX2VhY2goZHBvcywg
Jmdyb3VwLT5kZXZpY2VfbGlzdCkgew0KPiArCQkJc3RydWN0IHZmaW9fZGV2aWNlICpkZXZpY2U7
DQo+ICsJCQlkZXZpY2UgPSBsaXN0X2VudHJ5KGRwb3MsDQo+ICsJCQkJCSAgICBzdHJ1Y3QgdmZp
b19kZXZpY2UsIGRldmljZV9uZXh0KTsNCj4gKw0KPiArCQkJaWYgKCFkZXZpY2UtPmRldmljZV9k
YXRhKQ0KPiArCQkJCXJldHVybiBmYWxzZTsNCj4gKwkJfQ0KPiArCX0NCj4gKwlyZXR1cm4gdHJ1
ZTsNCj4gK30NCj4gKw0KPiArc3RhdGljIHZvaWQgX192ZmlvX2Nsb3NlX2lvbW11KHN0cnVjdCB2
ZmlvX2lvbW11ICppb21tdSkNCj4gK3sNCj4gKwlzdHJ1Y3QgbGlzdF9oZWFkICpwb3M7DQo+ICsN
Cj4gKwlpZiAoIWlvbW11LT5kb21haW4pDQo+ICsJCXJldHVybjsNCj4gKw0KPiArCWxpc3RfZm9y
X2VhY2gocG9zLCAmaW9tbXUtPmdyb3VwX2xpc3QpIHsNCj4gKwkJc3RydWN0IHZmaW9fZ3JvdXAg
Kmdyb3VwOw0KPiArCQlncm91cCA9IGxpc3RfZW50cnkocG9zLCBzdHJ1Y3QgdmZpb19ncm91cCwg
aW9tbXVfbmV4dCk7DQo+ICsNCj4gKwkJX192ZmlvX2lvbW11X2RldGFjaF9ncm91cChpb21tdSwg
Z3JvdXApOw0KPiArCX0NCj4gKw0KPiArCXZmaW9faW9tbXVfdW5tYXBhbGwoaW9tbXUpOw0KPiAr
DQo+ICsJaW9tbXVfZG9tYWluX2ZyZWUoaW9tbXUtPmRvbWFpbik7DQo+ICsJaW9tbXUtPmRvbWFp
biA9IE5VTEw7DQo+ICsJaW9tbXUtPm1tID0gTlVMTDsNCj4gK30NCj4gKw0KPiArLyogT3BlbiB0
aGUgSU9NTVUuICBUaGlzIGdhdGVzIGFsbCBhY2Nlc3MgdG8gdGhlIGlvbW11IG9yIGRldmljZSBm
aWxlDQo+ICsgKiBkZXNjcmlwdG9ycyBhbmQgc2V0cyBjdXJyZW50LT5tbSBhcyB0aGUgZXhjbHVz
aXZlIHVzZXIuICovDQoNCkdpdmVuIHRoZSBmbiAgdmZpb19ncm91cF9vcGVuIChpZSwgMXN0IG9i
amVjdCwgMm5kIG9wZXJhdGlvbiksIEkgd291bGQgaGF2ZQ0KY2FsbGVkIHRoaXMgb25lIF9fdmZp
b19pb21tdV9vcGVuIChpbnN0ZWFkIG9mIF9fdmZpb19vcGVuX2lvbW11KS4NCklzIGl0IG5hbWVk
IF9fdmZpb19vcGVuX2lvbW11IHRvIGF2b2lkIGEgY29uZmxpY3Qgd2l0aCB0aGUgbmFtZXNwYWNl
IGluIHZmaW9faW9tbXUuYz8gICAgICANCg0KPiArc3RhdGljIGludCBfX3ZmaW9fb3Blbl9pb21t
dShzdHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUpDQo+ICt7DQo+ICsJc3RydWN0IGxpc3RfaGVhZCAq
cG9zOw0KPiArCWludCByZXQ7DQo+ICsNCj4gKwlpZiAoIV9fdmZpb19pb21tdV92aWFibGUoaW9t
bXUpKQ0KPiArCQlyZXR1cm4gLUVCVVNZOw0KPiArDQo+ICsJaWYgKGlvbW11LT5kb21haW4pDQo+
ICsJCXJldHVybiAtRUlOVkFMOw0KPiArDQo+ICsJaW9tbXUtPmRvbWFpbiA9IGlvbW11X2RvbWFp
bl9hbGxvYyhpb21tdS0+YnVzKTsNCj4gKwlpZiAoIWlvbW11LT5kb21haW4pDQo+ICsJCXJldHVy
biAtRUZBVUxUOw0KPiArDQo+ICsJbGlzdF9mb3JfZWFjaChwb3MsICZpb21tdS0+Z3JvdXBfbGlz
dCkgew0KPiArCQlzdHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXA7DQo+ICsJCWdyb3VwID0gbGlzdF9l
bnRyeShwb3MsIHN0cnVjdCB2ZmlvX2dyb3VwLCBpb21tdV9uZXh0KTsNCj4gKw0KPiArCQlyZXQg
PSBfX3ZmaW9faW9tbXVfYXR0YWNoX2dyb3VwKGlvbW11LCBncm91cCk7DQo+ICsJCWlmIChyZXQp
IHsNCj4gKwkJCV9fdmZpb19jbG9zZV9pb21tdShpb21tdSk7DQo+ICsJCQlyZXR1cm4gcmV0Ow0K
PiArCQl9DQo+ICsJfQ0KPiArDQo+ICsJaWYgKCFhbGxvd191bnNhZmVfaW50cnMgJiYNCj4gKwkg
ICAgIWlvbW11X2RvbWFpbl9oYXNfY2FwKGlvbW11LT5kb21haW4sIElPTU1VX0NBUF9JTlRSX1JF
TUFQKSkgew0KPiArCQlfX3ZmaW9fY2xvc2VfaW9tbXUoaW9tbXUpOw0KPiArCQlyZXR1cm4gLUVG
QVVMVDsNCj4gKwl9DQo+ICsNCj4gKwlpb21tdS0+Y2FjaGUgPSAoaW9tbXVfZG9tYWluX2hhc19j
YXAoaW9tbXUtPmRvbWFpbiwNCj4gKwkJCQkJICAgICBJT01NVV9DQVBfQ0FDSEVfQ09IRVJFTkNZ
KSAhPSAwKTsNCj4gKwlpb21tdS0+bW0gPSBjdXJyZW50LT5tbTsNCj4gKw0KPiArCXJldHVybiAw
Ow0KPiArfQ0KPiArDQo+ICsvKiBBY3RpdmVseSB0cnkgdG8gdGVhciBkb3duIHRoZSBpb21tdSBh
bmQgbWVyZ2VkIGdyb3Vwcy4gIElmIHRoZXJlDQo+IGFyZSBubw0KPiArICogb3BlbiBpb21tdSBv
ciBkZXZpY2UgZmRzLCB3ZSBjbG9zZSB0aGUgaW9tbXUuICBJZiB3ZSBjbG9zZSB0aGUNCj4gaW9t
bXUgYW5kDQo+ICsgKiB0aGVyZSBhcmUgYWxzbyBubyBvcGVuIGdyb3VwIGZkcywgd2UgY2FuIGZ1
dGhlciBkaXNzb2x2ZSB0aGUgZ3JvdXANCj4gdG8NCj4gKyAqIGlvbW11IGFzc29jaWF0aW9uIGFu
ZCBmcmVlIHRoZSBpb21tdSBkYXRhIHN0cnVjdHVyZS4gKi8NCj4gK3N0YXRpYyBpbnQgX192Zmlv
X3RyeV9kaXNzb2x2ZV9pb21tdShzdHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUpDQo+ICt7DQo+ICsN
Cj4gKwlpZiAoX192ZmlvX2lvbW11X2ludXNlKGlvbW11KSkNCj4gKwkJcmV0dXJuIC1FQlVTWTsN
Cj4gKw0KPiArCV9fdmZpb19jbG9zZV9pb21tdShpb21tdSk7DQo+ICsNCj4gKwlpZiAoIV9fdmZp
b19pb21tdV9ncm91cHNfaW51c2UoaW9tbXUpKSB7DQo+ICsJCXN0cnVjdCBsaXN0X2hlYWQgKnBv
cywgKnBwb3M7DQo+ICsNCj4gKwkJbGlzdF9mb3JfZWFjaF9zYWZlKHBvcywgcHBvcywgJmlvbW11
LT5ncm91cF9saXN0KSB7DQo+ICsJCQlzdHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXA7DQo+ICsNCj4g
KwkJCWdyb3VwID0gbGlzdF9lbnRyeShwb3MsIHN0cnVjdCB2ZmlvX2dyb3VwLA0KPiBpb21tdV9u
ZXh0KTsNCj4gKwkJCV9fdmZpb19ncm91cF9zZXRfaW9tbXUoZ3JvdXAsIE5VTEwpOw0KPiArCQl9
DQo+ICsNCj4gKw0KPiArCQlrZnJlZShpb21tdSk7DQo+ICsJfQ0KPiArDQo+ICsJcmV0dXJuIDA7
DQo+ICt9DQo+ICsNCj4gK3N0YXRpYyBzdHJ1Y3QgdmZpb19kZXZpY2UgKl9fdmZpb19sb29rdXBf
ZGV2KHN0cnVjdCBkZXZpY2UgKmRldikNCj4gK3sNCj4gKwlzdHJ1Y3QgbGlzdF9oZWFkICpncG9z
Ow0KPiArCXVuc2lnbmVkIGludCBncm91cGlkOw0KPiArDQo+ICsJaWYgKGlvbW11X2RldmljZV9n
cm91cChkZXYsICZncm91cGlkKSkNCj4gKwkJcmV0dXJuIE5VTEw7DQo+ICsNCj4gKwlsaXN0X2Zv
cl9lYWNoKGdwb3MsICZ2ZmlvLmdyb3VwX2xpc3QpIHsNCj4gKwkJc3RydWN0IHZmaW9fZ3JvdXAg
Kmdyb3VwOw0KPiArCQlzdHJ1Y3QgbGlzdF9oZWFkICpkcG9zOw0KPiArDQo+ICsJCWdyb3VwID0g
bGlzdF9lbnRyeShncG9zLCBzdHJ1Y3QgdmZpb19ncm91cCwgZ3JvdXBfbmV4dCk7DQo+ICsNCj4g
KwkJaWYgKGdyb3VwLT5ncm91cGlkICE9IGdyb3VwaWQpDQo+ICsJCQljb250aW51ZTsNCj4gKw0K
PiArCQlsaXN0X2Zvcl9lYWNoKGRwb3MsICZncm91cC0+ZGV2aWNlX2xpc3QpIHsNCj4gKwkJCXN0
cnVjdCB2ZmlvX2RldmljZSAqZGV2aWNlOw0KPiArDQo+ICsJCQlkZXZpY2UgPSBsaXN0X2VudHJ5
KGRwb3MsDQo+ICsJCQkJCSAgICBzdHJ1Y3QgdmZpb19kZXZpY2UsIGRldmljZV9uZXh0KTsNCj4g
Kw0KPiArCQkJaWYgKGRldmljZS0+ZGV2ID09IGRldikNCj4gKwkJCQlyZXR1cm4gZGV2aWNlOw0K
PiArCQl9DQo+ICsJfQ0KPiArCXJldHVybiBOVUxMOw0KPiArfQ0KPiArDQo+ICsvKiBBbGwgcmVs
ZWFzZSBwYXRocyBzaW1wbHkgZGVjcmVtZW50IHRoZSByZWZjbnQsIGF0dGVtcHQgdG8gdGVhcmRv
d24NCj4gKyAqIHRoZSBpb21tdSBhbmQgbWVyZ2VkIGdyb3VwcywgYW5kIHdha2V1cCBhbnl0aGlu
ZyB0aGF0IG1pZ2h0IGJlDQo+ICsgKiB3YWl0aW5nIGlmIHdlIHN1Y2Nlc3NmdWxseSBkaXNzb2x2
ZSBhbnl0aGluZy4gKi8NCj4gK3N0YXRpYyBpbnQgdmZpb19kb19yZWxlYXNlKGludCAqcmVmY250
LCBzdHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUpDQo+ICt7DQo+ICsJYm9vbCB3YWtlOw0KPiArDQo+
ICsJbXV0ZXhfbG9jaygmdmZpby5sb2NrKTsNCj4gKw0KPiArCSgqcmVmY250KS0tOw0KPiArCXdh
a2UgPSAoX192ZmlvX3RyeV9kaXNzb2x2ZV9pb21tdShpb21tdSkgPT0gMCk7DQo+ICsNCj4gKwlt
dXRleF91bmxvY2soJnZmaW8ubG9jayk7DQo+ICsNCj4gKwlpZiAod2FrZSkNCj4gKwkJd2FrZV91
cCgmdmZpby5yZWxlYXNlX3EpOw0KPiArDQo+ICsJcmV0dXJuIDA7DQo+ICt9DQo+ICsNCj4gKy8q
DQo+ICsgKiBEZXZpY2UgZm9wcyAtIHBhc3N0aHJvdWdoIHRvIHZmaW8gZGV2aWNlIGRyaXZlciB3
LyBkZXZpY2VfZGF0YQ0KPiArICovDQo+ICtzdGF0aWMgaW50IHZmaW9fZGV2aWNlX3JlbGVhc2Uo
c3RydWN0IGlub2RlICppbm9kZSwgc3RydWN0IGZpbGUNCj4gKmZpbGVwKQ0KPiArew0KPiArCXN0
cnVjdCB2ZmlvX2RldmljZSAqZGV2aWNlID0gZmlsZXAtPnByaXZhdGVfZGF0YTsNCj4gKw0KPiAr
CXZmaW9fZG9fcmVsZWFzZSgmZGV2aWNlLT5yZWZjbnQsIGRldmljZS0+aW9tbXUpOw0KPiArDQo+
ICsJZGV2aWNlLT5vcHMtPnB1dChkZXZpY2UtPmRldmljZV9kYXRhKTsNCj4gKw0KPiArCXJldHVy
biAwOw0KPiArfQ0KPiArDQo+ICtzdGF0aWMgbG9uZyB2ZmlvX2RldmljZV91bmxfaW9jdGwoc3Ry
dWN0IGZpbGUgKmZpbGVwLA0KPiArCQkJCSAgdW5zaWduZWQgaW50IGNtZCwgdW5zaWduZWQgbG9u
ZyBhcmcpDQo+ICt7DQo+ICsJc3RydWN0IHZmaW9fZGV2aWNlICpkZXZpY2UgPSBmaWxlcC0+cHJp
dmF0ZV9kYXRhOw0KPiArDQo+ICsJcmV0dXJuIGRldmljZS0+b3BzLT5pb2N0bChkZXZpY2UtPmRl
dmljZV9kYXRhLCBjbWQsIGFyZyk7DQo+ICt9DQo+ICsNCj4gK3N0YXRpYyBzc2l6ZV90IHZmaW9f
ZGV2aWNlX3JlYWQoc3RydWN0IGZpbGUgKmZpbGVwLCBjaGFyIF9fdXNlciAqYnVmLA0KPiArCQkJ
CXNpemVfdCBjb3VudCwgbG9mZl90ICpwcG9zKQ0KPiArew0KPiArCXN0cnVjdCB2ZmlvX2Rldmlj
ZSAqZGV2aWNlID0gZmlsZXAtPnByaXZhdGVfZGF0YTsNCj4gKw0KPiArCXJldHVybiBkZXZpY2Ut
Pm9wcy0+cmVhZChkZXZpY2UtPmRldmljZV9kYXRhLCBidWYsIGNvdW50LCBwcG9zKTsNCj4gK30N
Cj4gKw0KPiArc3RhdGljIHNzaXplX3QgdmZpb19kZXZpY2Vfd3JpdGUoc3RydWN0IGZpbGUgKmZp
bGVwLCBjb25zdCBjaGFyIF9fdXNlcg0KPiAqYnVmLA0KPiArCQkJCSBzaXplX3QgY291bnQsIGxv
ZmZfdCAqcHBvcykNCj4gK3sNCj4gKwlzdHJ1Y3QgdmZpb19kZXZpY2UgKmRldmljZSA9IGZpbGVw
LT5wcml2YXRlX2RhdGE7DQo+ICsNCj4gKwlyZXR1cm4gZGV2aWNlLT5vcHMtPndyaXRlKGRldmlj
ZS0+ZGV2aWNlX2RhdGEsIGJ1ZiwgY291bnQsIHBwb3MpOw0KPiArfQ0KPiArDQo+ICtzdGF0aWMg
aW50IHZmaW9fZGV2aWNlX21tYXAoc3RydWN0IGZpbGUgKmZpbGVwLCBzdHJ1Y3Qgdm1fYXJlYV9z
dHJ1Y3QNCj4gKnZtYSkNCj4gK3sNCj4gKwlzdHJ1Y3QgdmZpb19kZXZpY2UgKmRldmljZSA9IGZp
bGVwLT5wcml2YXRlX2RhdGE7DQo+ICsNCj4gKwlyZXR1cm4gZGV2aWNlLT5vcHMtPm1tYXAoZGV2
aWNlLT5kZXZpY2VfZGF0YSwgdm1hKTsNCj4gK30NCj4gKw0KPiArI2lmZGVmIENPTkZJR19DT01Q
QVQNCj4gK3N0YXRpYyBsb25nIHZmaW9fZGV2aWNlX2NvbXBhdF9pb2N0bChzdHJ1Y3QgZmlsZSAq
ZmlsZXAsDQo+ICsJCQkJICAgICB1bnNpZ25lZCBpbnQgY21kLCB1bnNpZ25lZCBsb25nIGFyZykN
Cj4gK3sNCj4gKwlhcmcgPSAodW5zaWduZWQgbG9uZyljb21wYXRfcHRyKGFyZyk7DQo+ICsJcmV0
dXJuIHZmaW9fZGV2aWNlX3VubF9pb2N0bChmaWxlcCwgY21kLCBhcmcpOw0KPiArfQ0KPiArI2Vu
ZGlmCS8qIENPTkZJR19DT01QQVQgKi8NCj4gKw0KPiArY29uc3Qgc3RydWN0IGZpbGVfb3BlcmF0
aW9ucyB2ZmlvX2RldmljZV9mb3BzID0gew0KPiArCS5vd25lcgkJPSBUSElTX01PRFVMRSwNCj4g
KwkucmVsZWFzZQk9IHZmaW9fZGV2aWNlX3JlbGVhc2UsDQo+ICsJLnJlYWQJCT0gdmZpb19kZXZp
Y2VfcmVhZCwNCj4gKwkud3JpdGUJCT0gdmZpb19kZXZpY2Vfd3JpdGUsDQo+ICsJLnVubG9ja2Vk
X2lvY3RsCT0gdmZpb19kZXZpY2VfdW5sX2lvY3RsLA0KPiArI2lmZGVmIENPTkZJR19DT01QQVQN
Cj4gKwkuY29tcGF0X2lvY3RsCT0gdmZpb19kZXZpY2VfY29tcGF0X2lvY3RsLA0KPiArI2VuZGlm
DQo+ICsJLm1tYXAJCT0gdmZpb19kZXZpY2VfbW1hcCwNCj4gK307DQo+ICsNCj4gKy8qDQo+ICsg
KiBHcm91cCBmb3BzDQo+ICsgKi8NCj4gK3N0YXRpYyBpbnQgdmZpb19ncm91cF9vcGVuKHN0cnVj
dCBpbm9kZSAqaW5vZGUsIHN0cnVjdCBmaWxlICpmaWxlcCkNCj4gK3sNCj4gKwlzdHJ1Y3QgdmZp
b19ncm91cCAqZ3JvdXA7DQo+ICsJaW50IHJldCA9IDA7DQo+ICsNCj4gKwltdXRleF9sb2NrKCZ2
ZmlvLmxvY2spOw0KPiArDQo+ICsJZ3JvdXAgPSBpZHJfZmluZCgmdmZpby5pZHIsIGltaW5vcihp
bm9kZSkpOw0KPiArDQo+ICsJaWYgKCFncm91cCkgew0KPiArCQlyZXQgPSAtRU5PREVWOw0KPiAr
CQlnb3RvIG91dDsNCj4gKwl9DQo+ICsNCj4gKwlmaWxlcC0+cHJpdmF0ZV9kYXRhID0gZ3JvdXA7
DQo+ICsNCj4gKwlpZiAoIWdyb3VwLT5pb21tdSkgew0KPiArCQlzdHJ1Y3QgdmZpb19pb21tdSAq
aW9tbXU7DQo+ICsNCj4gKwkJaW9tbXUgPSBremFsbG9jKHNpemVvZigqaW9tbXUpLCBHRlBfS0VS
TkVMKTsNCj4gKwkJaWYgKCFpb21tdSkgew0KPiArCQkJcmV0ID0gLUVOT01FTTsNCj4gKwkJCWdv
dG8gb3V0Ow0KPiArCQl9DQo+ICsJCUlOSVRfTElTVF9IRUFEKCZpb21tdS0+Z3JvdXBfbGlzdCk7
DQo+ICsJCUlOSVRfTElTVF9IRUFEKCZpb21tdS0+ZG1fbGlzdCk7DQo+ICsJCW11dGV4X2luaXQo
JmlvbW11LT5kZ2F0ZSk7DQo+ICsJCWlvbW11LT5idXMgPSBncm91cC0+YnVzOw0KPiArCQlfX3Zm
aW9fZ3JvdXBfc2V0X2lvbW11KGdyb3VwLCBpb21tdSk7DQo+ICsJfQ0KPiArCWdyb3VwLT5yZWZj
bnQrKzsNCj4gKw0KPiArb3V0Og0KPiArCW11dGV4X3VubG9jaygmdmZpby5sb2NrKTsNCj4gKw0K
PiArCXJldHVybiByZXQ7DQo+ICt9DQo+ICsNCj4gK3N0YXRpYyBpbnQgdmZpb19ncm91cF9yZWxl
YXNlKHN0cnVjdCBpbm9kZSAqaW5vZGUsIHN0cnVjdCBmaWxlICpmaWxlcCkNCj4gK3sNCj4gKwlz
dHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXAgPSBmaWxlcC0+cHJpdmF0ZV9kYXRhOw0KPiArDQo+ICsJ
cmV0dXJuIHZmaW9fZG9fcmVsZWFzZSgmZ3JvdXAtPnJlZmNudCwgZ3JvdXAtPmlvbW11KTsNCj4g
K30NCj4gKw0KPiArLyogQXR0ZW1wdCB0byBtZXJnZSB0aGUgZ3JvdXAgcG9pbnRlZCB0byBieSBm
ZCBpbnRvIGdyb3VwLiAgVGhlIG1lcmdlLQ0KPiBlZQ0KPiArICogZ3JvdXAgbXVzdCBub3QgaGF2
ZSBhbiBpb21tdSBvciBhbnkgZGV2aWNlcyBvcGVuIGJlY2F1c2Ugd2UgY2Fubm90DQo+ICsgKiBt
YWludGFpbiB0aGF0IGNvbnRleHQgYWNyb3NzIHRoZSBtZXJnZS4gIFRoZSBtZXJnZS1lciBncm91
cCBjYW4gYmUNCj4gKyAqIGluIHVzZS4gKi8NCj4gK3N0YXRpYyBpbnQgdmZpb19ncm91cF9tZXJn
ZShzdHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXAsIGludCBmZCkNCg0KVGhlIGRvY3VtZW50YXRpb24g
aW4gdmZpby50eHQgZXhwbGFpbnMgY2xlYXJseSB0aGUgbG9naWMgaW1wbGVtZW50ZWQgYnkNCnRo
ZSBtZXJnZS91bm1lcmdlIGdyb3VwIGlvY3Rscy4NCkhvd2V2ZXIsIHdoYXQgeW91IGFyZSBkb2lu
ZyBpcyBub3QgbWVyZ2luZyBncm91cHMsIGJ1dCByYXRoZXIgYWRkaW5nL3JlbW92aW5nDQpncm91
cHMgdG8vZnJvbSBpb21tdXMgKGFuZCBjcmVhdGluZyBmbGF0IGxpc3RzIG9mIGdyb3VwcykuDQpG
b3IgZXhhbXBsZSwgd2hlbiB5b3UgZG8NCg0KICBtZXJnZShBLEIpDQoNCnlvdSBhY3R1YWxseSBt
ZWFuIHRvIHNheSAibWVyZ2UgQiB0byB0aGUgbGlzdCBvZiBncm91cHMgYXNzaWduZWQgdG8gdGhl
DQpzYW1lIGlvbW11IGFzIGdyb3VwIEEiLg0KRm9yIHRoZSBzYW1lIHJlYXNvbiwgeW91IGRvIG5v
dCByZWFsbHkgbmVlZCB0byBwcm92aWRlIHRoZSBncm91cCB5b3Ugd2FudA0KdG8gdW5tZXJnZSBm
cm9tLCB3aGljaCBtZWFucyB0aGF0IGluc3RlYWQgb2YNCg0KICB1bm1lcmdlKEEsQikgDQoNCnlv
dSB3b3VsZCBqdXN0IG5lZWQNCg0KICB1bm1lcmdlKEIpDQoNCkkgdW5kZXJzdGFuZCB0aGUgcmVh
c29uIHdoeSBpdCBpcyBub3QgYSByZWFsIG1lcmdlL3VubWVyZ2UgKGllLCB0byBrZWVwIHRoZQ0K
b3JpZ2luYWwgZ3JvdXBzIHNvIHRoYXQgeW91IGNhbiB1bm1lcmdlIGxhdGVyKSAuLi4gaG93ZXZl
ciBJIGp1c3Qgd29uZGVyIGlmDQppdCB3b3VsZG4ndCBiZSBtb3JlIG5hdHVyYWwgdG8gaW1wbGVt
ZW50IHRoZSBWRklPX0lPTU1VX0FERF9HUk9VUC9ERUxfR1JPVVANCmlvbW11IGlvY3RscyBpbnN0
ZWFkPyAodGhlIHJlbGF0aW9uc2hpcHMgYmV0d2VlbiB0aGUgZGF0YSBzdHJ1Y3R1cmUgd291bGQN
CnJlbWFpbiB0aGUgc2FtZSkNCkkgZ3Vlc3MgeW91IGFscmVhZHkgZGlzY2FyZGVkIHRoaXMgb3B0
aW9uIGZvciBzb21lIHJlYXNvbnMsIHJpZ2h0PyBXaGF0IHdhcw0KdGhlIHJlYXNvbj8NCg0KPiAr
ew0KPiArCXN0cnVjdCB2ZmlvX2dyb3VwICpuZXc7DQo+ICsJc3RydWN0IHZmaW9faW9tbXUgKm9s
ZF9pb21tdTsNCj4gKwlzdHJ1Y3QgZmlsZSAqZmlsZTsNCj4gKwlpbnQgcmV0ID0gMDsNCj4gKwli
b29sIG9wZW5lZCA9IGZhbHNlOw0KPiArDQo+ICsJbXV0ZXhfbG9jaygmdmZpby5sb2NrKTsNCj4g
Kw0KPiArCWZpbGUgPSBmZ2V0KGZkKTsNCj4gKwlpZiAoIWZpbGUpIHsNCj4gKwkJcmV0ID0gLUVC
QURGOw0KPiArCQlnb3RvIG91dF9ub3B1dDsNCj4gKwl9DQo+ICsNCj4gKwkvKiBTYW5pdHkgY2hl
Y2ssIGlzIHRoaXMgcmVhbGx5IG91ciBmZD8gKi8NCj4gKwlpZiAoZmlsZS0+Zl9vcCAhPSAmdmZp
b19ncm91cF9mb3BzKSB7DQo+ICsJCXJldCA9IC1FSU5WQUw7DQo+ICsJCWdvdG8gb3V0Ow0KPiAr
CX0NCj4gKw0KPiArCW5ldyA9IGZpbGUtPnByaXZhdGVfZGF0YTsNCj4gKw0KPiArCWlmICghbmV3
IHx8IG5ldyA9PSBncm91cCB8fCAhbmV3LT5pb21tdSB8fA0KPiArCSAgICBuZXctPmlvbW11LT5k
b21haW4gfHwgbmV3LT5idXMgIT0gZ3JvdXAtPmJ1cykgew0KPiArCQlyZXQgPSAtRUlOVkFMOw0K
PiArCQlnb3RvIG91dDsNCj4gKwl9DQo+ICsNCj4gKwkvKiBXZSBuZWVkIHRvIGF0dGFjaCBhbGwg
dGhlIGRldmljZXMgdG8gZWFjaCBkb21haW4gc2VwYXJhdGVseQ0KPiArCSAqIGluIG9yZGVyIHRv
IHZhbGlkYXRlIHRoYXQgdGhlIGNhcGFiaWxpdGllcyBtYXRjaCBmb3IgYm90aC4gICovDQo+ICsJ
cmV0ID0gX192ZmlvX29wZW5faW9tbXUobmV3LT5pb21tdSk7DQo+ICsJaWYgKHJldCkNCj4gKwkJ
Z290byBvdXQ7DQo+ICsNCj4gKwlpZiAoIWdyb3VwLT5pb21tdS0+ZG9tYWluKSB7DQo+ICsJCXJl
dCA9IF9fdmZpb19vcGVuX2lvbW11KGdyb3VwLT5pb21tdSk7DQo+ICsJCWlmIChyZXQpDQo+ICsJ
CQlnb3RvIG91dDsNCj4gKwkJb3BlbmVkID0gdHJ1ZTsNCj4gKwl9DQo+ICsNCj4gKwkvKiBJZiBj
YWNoZSBjb2hlcmVuY3kgZG9lc24ndCBtYXRjaCB3ZSdkIHBvdGVudGlhbHkgbmVlZCB0bw0KPiAr
CSAqIHJlbWFwIGV4aXN0aW5nIGlvbW11IG1hcHBpbmdzIGluIHRoZSBtZXJnZS1lciBkb21haW4u
DQo+ICsJICogUG9vciByZXR1cm4gdG8gYm90aGVyIHRyeWluZyB0byBhbGxvdyB0aGlzIGN1cnJl
bnRseS4gKi8NCj4gKwlpZiAoaW9tbXVfZG9tYWluX2hhc19jYXAoZ3JvdXAtPmlvbW11LT5kb21h
aW4sDQo+ICsJCQkJIElPTU1VX0NBUF9DQUNIRV9DT0hFUkVOQ1kpICE9DQo+ICsJICAgIGlvbW11
X2RvbWFpbl9oYXNfY2FwKG5ldy0+aW9tbXUtPmRvbWFpbiwNCj4gKwkJCQkgSU9NTVVfQ0FQX0NB
Q0hFX0NPSEVSRU5DWSkpIHsNCj4gKwkJX192ZmlvX2Nsb3NlX2lvbW11KG5ldy0+aW9tbXUpOw0K
PiArCQlpZiAob3BlbmVkKQ0KPiArCQkJX192ZmlvX2Nsb3NlX2lvbW11KGdyb3VwLT5pb21tdSk7
DQo+ICsJCXJldCA9IC1FSU5WQUw7DQo+ICsJCWdvdG8gb3V0Ow0KPiArCX0NCj4gKw0KPiArCS8q
IENsb3NlIHRoZSBpb21tdSBmb3IgdGhlIG1lcmdlLWVlIGFuZCBhdHRhY2ggYWxsIGl0cyBkZXZp
Y2VzDQo+ICsJICogdG8gdGhlIG1lcmdlLWVyIGlvbW11LiAqLw0KPiArCV9fdmZpb19jbG9zZV9p
b21tdShuZXctPmlvbW11KTsNCj4gKw0KPiArCXJldCA9IF9fdmZpb19pb21tdV9hdHRhY2hfZ3Jv
dXAoZ3JvdXAtPmlvbW11LCBuZXcpOw0KPiArCWlmIChyZXQpDQo+ICsJCWdvdG8gb3V0Ow0KPiAr
DQo+ICsJLyogc2V0X2lvbW11IHVubGlua3MgbmV3IGZyb20gdGhlIGlvbW11LCBzbyBzYXZlIGEg
cG9pbnRlciB0byBpdA0KPiAqLw0KPiArCW9sZF9pb21tdSA9IG5ldy0+aW9tbXU7DQo+ICsJX192
ZmlvX2dyb3VwX3NldF9pb21tdShuZXcsIGdyb3VwLT5pb21tdSk7DQo+ICsJa2ZyZWUob2xkX2lv
bW11KTsNCj4gKw0KPiArb3V0Og0KPiArCWZwdXQoZmlsZSk7DQo+ICtvdXRfbm9wdXQ6DQo+ICsJ
bXV0ZXhfdW5sb2NrKCZ2ZmlvLmxvY2spOw0KPiArCXJldHVybiByZXQ7DQo+ICt9DQo+ICsNCj4g
Ky8qIFVubWVyZ2UgdGhlIGdyb3VwIHBvaW50ZWQgdG8gYnkgZmQgZnJvbSBncm91cC4gKi8NCj4g
K3N0YXRpYyBpbnQgdmZpb19ncm91cF91bm1lcmdlKHN0cnVjdCB2ZmlvX2dyb3VwICpncm91cCwg
aW50IGZkKQ0KPiArew0KPiArCXN0cnVjdCB2ZmlvX2dyb3VwICpuZXc7DQo+ICsJc3RydWN0IHZm
aW9faW9tbXUgKm5ld19pb21tdTsNCj4gKwlzdHJ1Y3QgZmlsZSAqZmlsZTsNCj4gKwlpbnQgcmV0
ID0gMDsNCj4gKw0KPiArCS8qIFNpbmNlIHRoZSBtZXJnZS1vdXQgZ3JvdXAgaXMgYWxyZWFkeSBv
cGVuZWQsIGl0IG5lZWRzIHRvDQo+ICsJICogaGF2ZSBhbiBpb21tdSBzdHJ1Y3QgYXNzb2NpYXRl
ZCB3aXRoIGl0LiAqLw0KPiArCW5ld19pb21tdSA9IGt6YWxsb2Moc2l6ZW9mKCpuZXdfaW9tbXUp
LCBHRlBfS0VSTkVMKTsNCj4gKwlpZiAoIW5ld19pb21tdSkNCj4gKwkJcmV0dXJuIC1FTk9NRU07
DQo+ICsNCj4gKwlJTklUX0xJU1RfSEVBRCgmbmV3X2lvbW11LT5ncm91cF9saXN0KTsNCj4gKwlJ
TklUX0xJU1RfSEVBRCgmbmV3X2lvbW11LT5kbV9saXN0KTsNCj4gKwltdXRleF9pbml0KCZuZXdf
aW9tbXUtPmRnYXRlKTsNCj4gKwluZXdfaW9tbXUtPmJ1cyA9IGdyb3VwLT5idXM7DQo+ICsNCj4g
KwltdXRleF9sb2NrKCZ2ZmlvLmxvY2spOw0KPiArDQo+ICsJZmlsZSA9IGZnZXQoZmQpOw0KPiAr
CWlmICghZmlsZSkgew0KPiArCQlyZXQgPSAtRUJBREY7DQo+ICsJCWdvdG8gb3V0X25vcHV0Ow0K
PiArCX0NCj4gKw0KPiArCS8qIFNhbml0eSBjaGVjaywgaXMgdGhpcyByZWFsbHkgb3VyIGZkPyAq
Lw0KPiArCWlmIChmaWxlLT5mX29wICE9ICZ2ZmlvX2dyb3VwX2ZvcHMpIHsNCj4gKwkJcmV0ID0g
LUVJTlZBTDsNCj4gKwkJZ290byBvdXQ7DQo+ICsJfQ0KPiArDQo+ICsJbmV3ID0gZmlsZS0+cHJp
dmF0ZV9kYXRhOw0KPiArCWlmICghbmV3IHx8IG5ldyA9PSBncm91cCB8fCBuZXctPmlvbW11ICE9
IGdyb3VwLT5pb21tdSkgew0KPiArCQlyZXQgPSAtRUlOVkFMOw0KPiArCQlnb3RvIG91dDsNCj4g
Kwl9DQo+ICsNCj4gKwkvKiBXZSBjYW4ndCBtZXJnZS1vdXQgYSBncm91cCB3aXRoIGRldmljZXMg
c3RpbGwgaW4gdXNlLiAqLw0KPiArCWlmIChfX3ZmaW9fZ3JvdXBfZGV2c19pbnVzZShuZXcpKSB7
DQo+ICsJCXJldCA9IC1FQlVTWTsNCj4gKwkJZ290byBvdXQ7DQo+ICsJfQ0KPiArDQo+ICsJX192
ZmlvX2lvbW11X2RldGFjaF9ncm91cChncm91cC0+aW9tbXUsIG5ldyk7DQo+ICsJX192ZmlvX2dy
b3VwX3NldF9pb21tdShuZXcsIG5ld19pb21tdSk7DQo+ICsNCj4gK291dDoNCj4gKwlmcHV0KGZp
bGUpOw0KPiArb3V0X25vcHV0Og0KPiArCWlmIChyZXQpDQo+ICsJCWtmcmVlKG5ld19pb21tdSk7
DQo+ICsJbXV0ZXhfdW5sb2NrKCZ2ZmlvLmxvY2spOw0KPiArCXJldHVybiByZXQ7DQo+ICt9DQo+
ICsNCj4gKy8qIEdldCBhIG5ldyBpb21tdSBmaWxlIGRlc2NyaXB0b3IuICBUaGlzIHdpbGwgb3Bl
biB0aGUgaW9tbXUsIHNldHRpbmcNCj4gKyAqIHRoZSBjdXJyZW50LT5tbSBvd25lcnNoaXAgaWYg
aXQncyBub3QgYWxyZWFkeSBzZXQuICovDQo+ICtzdGF0aWMgaW50IHZmaW9fZ3JvdXBfZ2V0X2lv
bW11X2ZkKHN0cnVjdCB2ZmlvX2dyb3VwICpncm91cCkNCj4gK3sNCj4gKwlpbnQgcmV0ID0gMDsN
Cj4gKw0KPiArCW11dGV4X2xvY2soJnZmaW8ubG9jayk7DQo+ICsNCj4gKwlpZiAoIWdyb3VwLT5p
b21tdS0+ZG9tYWluKSB7DQo+ICsJCXJldCA9IF9fdmZpb19vcGVuX2lvbW11KGdyb3VwLT5pb21t
dSk7DQo+ICsJCWlmIChyZXQpDQo+ICsJCQlnb3RvIG91dDsNCj4gKwl9DQo+ICsNCj4gKwlyZXQg
PSBhbm9uX2lub2RlX2dldGZkKCJbdmZpby1pb21tdV0iLCAmdmZpb19pb21tdV9mb3BzLA0KPiAr
CQkJICAgICAgIGdyb3VwLT5pb21tdSwgT19SRFdSKTsNCj4gKwlpZiAocmV0IDwgMCkNCj4gKwkJ
Z290byBvdXQ7DQo+ICsNCj4gKwlncm91cC0+aW9tbXUtPnJlZmNudCsrOw0KPiArb3V0Og0KPiAr
CW11dGV4X3VubG9jaygmdmZpby5sb2NrKTsNCj4gKwlyZXR1cm4gcmV0Ow0KPiArfQ0KPiArDQo+
ICsvKiBHZXQgYSBuZXcgZGV2aWNlIGZpbGUgZGVzY3JpcHRvci4gIFRoaXMgd2lsbCBvcGVuIHRo
ZSBpb21tdSwNCj4gc2V0dGluZw0KPiArICogdGhlIGN1cnJlbnQtPm1tIG93bmVyc2hpcCBpZiBp
dCdzIG5vdCBhbHJlYWR5IHNldC4gIEl0J3MgZGlmZmljdWx0DQo+IHRvDQo+ICsgKiBzcGVjaWZ5
IHRoZSByZXF1aXJlbWVudHMgZm9yIG1hdGNoaW5nIGEgdXNlciBzdXBwbGllZCBidWZmZXIgdG8g
YQ0KPiArICogZGV2aWNlLCBzbyB3ZSB1c2UgYSB2ZmlvIGRyaXZlciBjYWxsYmFjayB0byB0ZXN0
IGZvciBhIG1hdGNoLiAgRm9yDQo+ICsgKiBQQ0ksIGRldl9uYW1lKGRldikgaXMgdW5pcXVlLCBi
dXQgb3RoZXIgZHJpdmVycyBtYXkgcmVxdWlyZQ0KPiBpbmNsdWRpbmcNCj4gKyAqIGEgcGFyZW50
IGRldmljZSBzdHJpbmcuICovDQo+ICtzdGF0aWMgaW50IHZmaW9fZ3JvdXBfZ2V0X2RldmljZV9m
ZChzdHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXAsIGNoYXINCj4gKmJ1ZikNCj4gK3sNCj4gKwlzdHJ1
Y3QgdmZpb19pb21tdSAqaW9tbXUgPSBncm91cC0+aW9tbXU7DQo+ICsJc3RydWN0IGxpc3RfaGVh
ZCAqZ3BvczsNCj4gKwlpbnQgcmV0ID0gLUVOT0RFVjsNCj4gKw0KPiArCW11dGV4X2xvY2soJnZm
aW8ubG9jayk7DQo+ICsNCj4gKwlpZiAoIWlvbW11LT5kb21haW4pIHsNCj4gKwkJcmV0ID0gX192
ZmlvX29wZW5faW9tbXUoaW9tbXUpOw0KPiArCQlpZiAocmV0KQ0KPiArCQkJZ290byBvdXQ7DQo+
ICsJfQ0KPiArDQo+ICsJbGlzdF9mb3JfZWFjaChncG9zLCAmaW9tbXUtPmdyb3VwX2xpc3QpIHsN
Cj4gKwkJc3RydWN0IGxpc3RfaGVhZCAqZHBvczsNCj4gKw0KPiArCQlncm91cCA9IGxpc3RfZW50
cnkoZ3Bvcywgc3RydWN0IHZmaW9fZ3JvdXAsIGlvbW11X25leHQpOw0KPiArDQo+ICsJCWxpc3Rf
Zm9yX2VhY2goZHBvcywgJmdyb3VwLT5kZXZpY2VfbGlzdCkgew0KPiArCQkJc3RydWN0IHZmaW9f
ZGV2aWNlICpkZXZpY2U7DQo+ICsNCj4gKwkJCWRldmljZSA9IGxpc3RfZW50cnkoZHBvcywNCj4g
KwkJCQkJICAgIHN0cnVjdCB2ZmlvX2RldmljZSwgZGV2aWNlX25leHQpOw0KPiArDQo+ICsJCQlp
ZiAoZGV2aWNlLT5vcHMtPm1hdGNoKGRldmljZS0+ZGV2LCBidWYpKSB7DQo+ICsJCQkJc3RydWN0
IGZpbGUgKmZpbGU7DQo+ICsNCj4gKwkJCQlpZiAoZGV2aWNlLT5vcHMtPmdldChkZXZpY2UtPmRl
dmljZV9kYXRhKSkgew0KPiArCQkJCQlyZXQgPSAtRUZBVUxUOw0KPiArCQkJCQlnb3RvIG91dDsN
Cj4gKwkJCQl9DQo+ICsNCj4gKwkJCQkvKiBXZSBjYW4ndCB1c2UgYW5vbl9pbm9kZV9nZXRmZCgp
LCBsaWtlIGFib3ZlDQo+ICsJCQkJICogYmVjYXVzZSB3ZSBuZWVkIHRvIG1vZGlmeSB0aGUgZl9t
b2RlIGZsYWdzDQo+ICsJCQkJICogZGlyZWN0bHkgdG8gYWxsb3cgbW9yZSB0aGFuIGp1c3QgaW9j
dGxzICovDQo+ICsJCQkJcmV0ID0gZ2V0X3VudXNlZF9mZCgpOw0KPiArCQkJCWlmIChyZXQgPCAw
KSB7DQo+ICsJCQkJCWRldmljZS0+b3BzLT5wdXQoZGV2aWNlLT5kZXZpY2VfZGF0YSk7DQo+ICsJ
CQkJCWdvdG8gb3V0Ow0KPiArCQkJCX0NCj4gKw0KPiArCQkJCWZpbGUgPSBhbm9uX2lub2RlX2dl
dGZpbGUoIlt2ZmlvLWRldmljZV0iLA0KPiArCQkJCQkJCSAgJnZmaW9fZGV2aWNlX2ZvcHMsDQo+
ICsJCQkJCQkJICBkZXZpY2UsIE9fUkRXUik7DQo+ICsJCQkJaWYgKElTX0VSUihmaWxlKSkgew0K
PiArCQkJCQlwdXRfdW51c2VkX2ZkKHJldCk7DQo+ICsJCQkJCXJldCA9IFBUUl9FUlIoZmlsZSk7
DQo+ICsJCQkJCWRldmljZS0+b3BzLT5wdXQoZGV2aWNlLT5kZXZpY2VfZGF0YSk7DQo+ICsJCQkJ
CWdvdG8gb3V0Ow0KPiArCQkJCX0NCj4gKw0KPiArCQkJCS8qIFRvZG86IGFkZCBhbiBhbm9uX2lu
b2RlIGludGVyZmFjZSB0byBkbw0KPiArCQkJCSAqIHRoaXMuICBBcHBlYXJzIHRvIGJlIG1pc3Np
bmcgYnkgbGFjayBvZg0KPiArCQkJCSAqIG5lZWQgcmF0aGVyIHRoYW4gZXhwbGljaXRseSBwcmV2
ZW50ZWQuDQo+ICsJCQkJICogTm93IHRoZXJlJ3MgbmVlZC4gKi8NCj4gKwkJCQlmaWxlLT5mX21v
ZGUgfD0gKEZNT0RFX0xTRUVLIHwNCj4gKwkJCQkJCSBGTU9ERV9QUkVBRCB8DQo+ICsJCQkJCQkg
Rk1PREVfUFdSSVRFKTsNCj4gKw0KPiArCQkJCWZkX2luc3RhbGwocmV0LCBmaWxlKTsNCj4gKw0K
PiArCQkJCWRldmljZS0+cmVmY250Kys7DQo+ICsJCQkJZ290byBvdXQ7DQo+ICsJCQl9DQo+ICsJ
CX0NCj4gKwl9DQo+ICtvdXQ6DQo+ICsJbXV0ZXhfdW5sb2NrKCZ2ZmlvLmxvY2spOw0KPiArCXJl
dHVybiByZXQ7DQo+ICt9DQo+ICsNCj4gK3N0YXRpYyBsb25nIHZmaW9fZ3JvdXBfdW5sX2lvY3Rs
KHN0cnVjdCBmaWxlICpmaWxlcCwNCj4gKwkJCQkgdW5zaWduZWQgaW50IGNtZCwgdW5zaWduZWQg
bG9uZyBhcmcpDQo+ICt7DQo+ICsJc3RydWN0IHZmaW9fZ3JvdXAgKmdyb3VwID0gZmlsZXAtPnBy
aXZhdGVfZGF0YTsNCj4gKw0KPiArCWlmIChjbWQgPT0gVkZJT19HUk9VUF9HRVRfRkxBR1MpIHsN
Cj4gKwkJdTY0IGZsYWdzID0gMDsNCj4gKw0KPiArCQltdXRleF9sb2NrKCZ2ZmlvLmxvY2spOw0K
PiArCQlpZiAoX192ZmlvX2lvbW11X3ZpYWJsZShncm91cC0+aW9tbXUpKQ0KPiArCQkJZmxhZ3Mg
fD0gVkZJT19HUk9VUF9GTEFHU19WSUFCTEU7DQo+ICsJCW11dGV4X3VubG9jaygmdmZpby5sb2Nr
KTsNCj4gKw0KPiArCQlpZiAoZ3JvdXAtPmlvbW11LT5tbSkNCj4gKwkJCWZsYWdzIHw9IFZGSU9f
R1JPVVBfRkxBR1NfTU1fTE9DS0VEOw0KPiArDQo+ICsJCXJldHVybiBwdXRfdXNlcihmbGFncywg
KHU2NCBfX3VzZXIgKilhcmcpOw0KPiArCX0NCj4gKw0KPiArCS8qIEJlbG93IGNvbW1hbmRzIGFy
ZSByZXN0cmljdGVkIG9uY2UgdGhlIG1tIGlzIHNldCAqLw0KPiArCWlmIChncm91cC0+aW9tbXUt
Pm1tICYmIGdyb3VwLT5pb21tdS0+bW0gIT0gY3VycmVudC0+bW0pDQo+ICsJCXJldHVybiAtRVBF
Uk07DQo+ICsJaWYgKGNtZCA9PSBWRklPX0dST1VQX01FUkdFIHx8IGNtZCA9PSBWRklPX0dST1VQ
X1VOTUVSR0UpIHsNCj4gKwkJaW50IGZkOw0KPiArDQo+ICsJCWlmIChnZXRfdXNlcihmZCwgKGlu
dCBfX3VzZXIgKilhcmcpKQ0KPiArCQkJcmV0dXJuIC1FRkFVTFQ7DQo+ICsJCWlmIChmZCA8IDAp
DQo+ICsJCQlyZXR1cm4gLUVJTlZBTDsNCj4gKw0KPiArCQlpZiAoY21kID09IFZGSU9fR1JPVVBf
TUVSR0UpDQo+ICsJCQlyZXR1cm4gdmZpb19ncm91cF9tZXJnZShncm91cCwgZmQpOw0KPiArCQll
bHNlDQo+ICsJCQlyZXR1cm4gdmZpb19ncm91cF91bm1lcmdlKGdyb3VwLCBmZCk7DQo+ICsJfSBl
bHNlIGlmIChjbWQgPT0gVkZJT19HUk9VUF9HRVRfSU9NTVVfRkQpIHsNCj4gKwkJcmV0dXJuIHZm
aW9fZ3JvdXBfZ2V0X2lvbW11X2ZkKGdyb3VwKTsNCj4gKwl9IGVsc2UgaWYgKGNtZCA9PSBWRklP
X0dST1VQX0dFVF9ERVZJQ0VfRkQpIHsNCj4gKwkJY2hhciAqYnVmOw0KPiArCQlpbnQgcmV0Ow0K
PiArDQo+ICsJCWJ1ZiA9IHN0cm5kdXBfdXNlcigoY29uc3QgY2hhciBfX3VzZXIgKilhcmcsIFBB
R0VfU0laRSk7DQo+ICsJCWlmIChJU19FUlIoYnVmKSkNCj4gKwkJCXJldHVybiBQVFJfRVJSKGJ1
Zik7DQo+ICsNCj4gKwkJcmV0ID0gdmZpb19ncm91cF9nZXRfZGV2aWNlX2ZkKGdyb3VwLCBidWYp
Ow0KPiArCQlrZnJlZShidWYpOw0KPiArCQlyZXR1cm4gcmV0Ow0KPiArCX0NCj4gKw0KPiArCXJl
dHVybiAtRU5PU1lTOw0KPiArfQ0KPiArDQo+ICsjaWZkZWYgQ09ORklHX0NPTVBBVA0KPiArc3Rh
dGljIGxvbmcgdmZpb19ncm91cF9jb21wYXRfaW9jdGwoc3RydWN0IGZpbGUgKmZpbGVwLA0KPiAr
CQkJCSAgICB1bnNpZ25lZCBpbnQgY21kLCB1bnNpZ25lZCBsb25nIGFyZykNCj4gK3sNCj4gKwlh
cmcgPSAodW5zaWduZWQgbG9uZyljb21wYXRfcHRyKGFyZyk7DQo+ICsJcmV0dXJuIHZmaW9fZ3Jv
dXBfdW5sX2lvY3RsKGZpbGVwLCBjbWQsIGFyZyk7DQo+ICt9DQo+ICsjZW5kaWYJLyogQ09ORklH
X0NPTVBBVCAqLw0KPiArDQo+ICtzdGF0aWMgY29uc3Qgc3RydWN0IGZpbGVfb3BlcmF0aW9ucyB2
ZmlvX2dyb3VwX2ZvcHMgPSB7DQo+ICsJLm93bmVyCQk9IFRISVNfTU9EVUxFLA0KPiArCS5vcGVu
CQk9IHZmaW9fZ3JvdXBfb3BlbiwNCj4gKwkucmVsZWFzZQk9IHZmaW9fZ3JvdXBfcmVsZWFzZSwN
Cj4gKwkudW5sb2NrZWRfaW9jdGwJPSB2ZmlvX2dyb3VwX3VubF9pb2N0bCwNCj4gKyNpZmRlZiBD
T05GSUdfQ09NUEFUDQo+ICsJLmNvbXBhdF9pb2N0bAk9IHZmaW9fZ3JvdXBfY29tcGF0X2lvY3Rs
LA0KPiArI2VuZGlmDQo+ICt9Ow0KPiArDQo+ICsvKiBpb21tdSBmZCByZWxlYXNlIGhvb2sgKi8N
Cg0KR2l2ZW4gdmZpb19kZXZpY2VfcmVsZWFzZSBhbmQNCiAgICAgIHZmaW9fZ3JvdXBfcmVsZWFz
ZSAoaWUsIDFzdCBvYmplY3QsIDJuZCBvcGVyYXRpb24pLCBJIHdhcw0KZ29pbmcgdG8gc3VnZ2Vz
dCByZW5hbWluZyB0aGUgZm4gYmVsb3cgdG8gdmZpb19pb21tdV9yZWxlYXNlLCBidXQNCnRoZW4g
SSBzYXcgdGhlIGxhdHRlciBuYW1lIGJlaW5nIGFscmVhZHkgdXNlZCBpbiB2ZmlvX2lvbW11LmMg
Li4uDQphIGJpdCBjb25mdXNpbmcgYnV0IEkgZ3Vlc3MgaXQncyBvayB0aGVuLg0KDQo+ICtpbnQg
dmZpb19yZWxlYXNlX2lvbW11KHN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSkNCj4gK3sNCj4gKwly
ZXR1cm4gdmZpb19kb19yZWxlYXNlKCZpb21tdS0+cmVmY250LCBpb21tdSk7DQo+ICt9DQo+ICsN
Cj4gKy8qDQo+ICsgKiBWRklPIGRyaXZlciBBUEkNCj4gKyAqLw0KPiArDQo+ICsvKiBBZGQgYSBu
ZXcgZGV2aWNlIHRvIHRoZSB2ZmlvIGZyYW1ld29yayB3aXRoIGFzc29jaWF0ZWQgdmZpbyBkcml2
ZXINCj4gKyAqIGNhbGxiYWNrcy4gIFRoaXMgaXMgdGhlIGVudHJ5IHBvaW50IGZvciB2ZmlvIGRy
aXZlcnMgdG8gcmVnaXN0ZXINCj4gZGV2aWNlcy4gKi8NCj4gK2ludCB2ZmlvX2dyb3VwX2FkZF9k
ZXYoc3RydWN0IGRldmljZSAqZGV2LCBjb25zdCBzdHJ1Y3QNCj4gdmZpb19kZXZpY2Vfb3BzICpv
cHMpDQo+ICt7DQo+ICsJc3RydWN0IGxpc3RfaGVhZCAqcG9zOw0KPiArCXN0cnVjdCB2ZmlvX2dy
b3VwICpncm91cCA9IE5VTEw7DQo+ICsJc3RydWN0IHZmaW9fZGV2aWNlICpkZXZpY2UgPSBOVUxM
Ow0KPiArCXVuc2lnbmVkIGludCBncm91cGlkOw0KPiArCWludCByZXQgPSAwOw0KPiArCWJvb2wg
bmV3X2dyb3VwID0gZmFsc2U7DQo+ICsNCj4gKwlpZiAoIW9wcykNCj4gKwkJcmV0dXJuIC1FSU5W
QUw7DQo+ICsNCj4gKwlpZiAoaW9tbXVfZGV2aWNlX2dyb3VwKGRldiwgJmdyb3VwaWQpKQ0KPiAr
CQlyZXR1cm4gLUVOT0RFVjsNCj4gKw0KPiArCW11dGV4X2xvY2soJnZmaW8ubG9jayk7DQo+ICsN
Cj4gKwlsaXN0X2Zvcl9lYWNoKHBvcywgJnZmaW8uZ3JvdXBfbGlzdCkgew0KPiArCQlncm91cCA9
IGxpc3RfZW50cnkocG9zLCBzdHJ1Y3QgdmZpb19ncm91cCwgZ3JvdXBfbmV4dCk7DQo+ICsJCWlm
IChncm91cC0+Z3JvdXBpZCA9PSBncm91cGlkKQ0KPiArCQkJYnJlYWs7DQo+ICsJCWdyb3VwID0g
TlVMTDsNCj4gKwl9DQo+ICsNCj4gKwlpZiAoIWdyb3VwKSB7DQo+ICsJCWludCBtaW5vcjsNCj4g
Kw0KPiArCQlpZiAodW5saWtlbHkoaWRyX3ByZV9nZXQoJnZmaW8uaWRyLCBHRlBfS0VSTkVMKSA9
PSAwKSkgew0KPiArCQkJcmV0ID0gLUVOT01FTTsNCj4gKwkJCWdvdG8gb3V0Ow0KPiArCQl9DQo+
ICsNCj4gKwkJZ3JvdXAgPSBremFsbG9jKHNpemVvZigqZ3JvdXApLCBHRlBfS0VSTkVMKTsNCj4g
KwkJaWYgKCFncm91cCkgew0KPiArCQkJcmV0ID0gLUVOT01FTTsNCj4gKwkJCWdvdG8gb3V0Ow0K
PiArCQl9DQo+ICsNCj4gKwkJZ3JvdXAtPmdyb3VwaWQgPSBncm91cGlkOw0KPiArCQlJTklUX0xJ
U1RfSEVBRCgmZ3JvdXAtPmRldmljZV9saXN0KTsNCj4gKw0KPiArCQlyZXQgPSBpZHJfZ2V0X25l
dygmdmZpby5pZHIsIGdyb3VwLCAmbWlub3IpOw0KPiArCQlpZiAocmV0ID09IDAgJiYgbWlub3Ig
PiBNSU5PUk1BU0spIHsNCj4gKwkJCWlkcl9yZW1vdmUoJnZmaW8uaWRyLCBtaW5vcik7DQo+ICsJ
CQlrZnJlZShncm91cCk7DQo+ICsJCQlyZXQgPSAtRU5PU1BDOw0KPiArCQkJZ290byBvdXQ7DQo+
ICsJCX0NCj4gKw0KPiArCQlncm91cC0+ZGV2dCA9IE1LREVWKE1BSk9SKHZmaW8uZGV2dCksIG1p
bm9yKTsNCj4gKwkJZGV2aWNlX2NyZWF0ZSh2ZmlvLmNsYXNzLCBOVUxMLCBncm91cC0+ZGV2dCwN
Cj4gKwkJCSAgICAgIGdyb3VwLCAiJXUiLCBncm91cGlkKTsNCj4gKw0KPiArCQlncm91cC0+YnVz
ID0gZGV2LT5idXM7DQo+ICsJCWxpc3RfYWRkKCZncm91cC0+Z3JvdXBfbmV4dCwgJnZmaW8uZ3Jv
dXBfbGlzdCk7DQo+ICsJCW5ld19ncm91cCA9IHRydWU7DQo+ICsJfSBlbHNlIHsNCj4gKwkJaWYg
KGdyb3VwLT5idXMgIT0gZGV2LT5idXMpIHsNCj4gKwkJCXByaW50ayhLRVJOX1dBUk5JTkcNCj4g
KwkJCSAgICAgICAiRXJyb3I6IElPTU1VIGdyb3VwIElEIGNvbmZsaWN0LiAgR3JvdXAgSUQgJXUN
Cj4gIg0KPiArCQkJCSJvbiBib3RoIGJ1cyAlcyBhbmQgJXNcbiIsIGdyb3VwaWQsDQo+ICsJCQkJ
Z3JvdXAtPmJ1cy0+bmFtZSwgZGV2LT5idXMtPm5hbWUpOw0KPiArCQkJcmV0ID0gLUVGQVVMVDsN
Cj4gKwkJCWdvdG8gb3V0Ow0KPiArCQl9DQo+ICsNCj4gKwkJbGlzdF9mb3JfZWFjaChwb3MsICZn
cm91cC0+ZGV2aWNlX2xpc3QpIHsNCj4gKwkJCWRldmljZSA9IGxpc3RfZW50cnkocG9zLA0KPiAr
CQkJCQkgICAgc3RydWN0IHZmaW9fZGV2aWNlLCBkZXZpY2VfbmV4dCk7DQo+ICsJCQlpZiAoZGV2
aWNlLT5kZXYgPT0gZGV2KQ0KPiArCQkJCWJyZWFrOw0KPiArCQkJZGV2aWNlID0gTlVMTDsNCj4g
KwkJfQ0KPiArCX0NCj4gKw0KPiArCWlmICghZGV2aWNlKSB7DQo+ICsJCWlmIChfX3ZmaW9fZ3Jv
dXBfZGV2c19pbnVzZShncm91cCkgfHwNCj4gKwkJICAgIChncm91cC0+aW9tbXUgJiYgZ3JvdXAt
PmlvbW11LT5yZWZjbnQpKSB7DQo+ICsJCQlwcmludGsoS0VSTl9XQVJOSU5HDQo+ICsJCQkgICAg
ICAgIkFkZGluZyBkZXZpY2UgJXMgdG8gZ3JvdXAgJXUgd2hpbGUgZ3JvdXAgaXMNCj4gYWxyZWFk
eSBpbiB1c2UhIVxuIiwNCj4gKwkJCSAgICAgICBkZXZfbmFtZShkZXYpLCBncm91cC0+Z3JvdXBp
ZCk7DQo+ICsJCQkvKiBYWFggSG93IHRvIHByZXZlbnQgb3RoZXIgZHJpdmVycyBmcm9tIGNsYWlt
aW5nPyAqLw0KDQpIZXJlIHdlIGFyZSBhZGRpbmcgYSBkZXZpY2UgKG5vdCB5ZXQgYXNzaWduZWQg
dG8gYSB2ZmlvIGJ1cykgdG8gYSBncm91cA0KdGhhdCBpcyBhbHJlYWR5IGluIHVzZS4NCkdpdmVu
IHRoYXQgaXQgd291bGQgbm90IGJlIGFjY2VwdGFibGUgZm9yIHRoaXMgZGV2aWNlIHRvIGdldCBh
c3NpZ25lZA0KdG8gYSBub24gdmZpbyBkcml2ZXIsIHdoeSBub3QgZm9yY2luZyBzdWNoIGFzc2ln
bm1lbnQgaGVyZSB0aGVuPw0KSSBhbSBub3Qgc3VyZSB0aG91Z2ggd2hhdCB0aGUgYmVzdCB3YXkg
dG8gZG8gaXQgd291bGQgYmUuDQpXaGF0IGFib3V0IHNvbWV0aGluZyBsaWtlIHRoaXM6DQoNCi0g
d2hlbiB0aGUgYnVzIHZmaW8tcGNpIHByb2Nlc3NlcyB0aGUgQlVTX05PVElGWV9BRERfREVWSUNF
DQogIG5vdGlmaWNhdGlvbiBpdCBhc3NpZ25zIHRvIHRoZSBkZXZpY2UgYSBQQ0kgSUQgdGhhdCB3
aWxsIG1ha2Ugc3VyZQ0KICB0aGUgdmZpby1wY2kncyBwcm9iZSByb3V0aW5lIHdpbGwgYmUgaW52
b2tlZCAoYW5kIG5vIG90aGVyIGRyaXZlciBjYW4NCiAgdGhlcmVmb3JlIGNsYWltIHRoZSBkZXZp
Y2UpLiBUaGF0IFBDSSBJRCB3b3VsZCBoYXZlIHRvIGJlIGFkZGVkDQogIHRvIHRoZSB2ZmlvX3Bj
aV9kcml2ZXIncyBpZF90YWJsZSAoaXQgd291bGQgYmUgdGhlIGV4Y2VwdGlvbiB0byB0aGUNCiAg
Im9ubHkgZHluYW1pYyBJRHMiIHJ1bGUpLiBUb28gaGFja2lzaD8NCg0KPiArCQl9DQo+ICsNCj4g
KwkJZGV2aWNlID0ga3phbGxvYyhzaXplb2YoKmRldmljZSksIEdGUF9LRVJORUwpOw0KPiArCQlp
ZiAoIWRldmljZSkgew0KPiArCQkJLyogSWYgd2UganVzdCBjcmVhdGVkIHRoaXMgZ3JvdXAsIHRl
YXIgaXQgZG93biAqLw0KPiArCQkJaWYgKG5ld19ncm91cCkgew0KPiArCQkJCWxpc3RfZGVsKCZn
cm91cC0+Z3JvdXBfbmV4dCk7DQo+ICsJCQkJZGV2aWNlX2Rlc3Ryb3kodmZpby5jbGFzcywgZ3Jv
dXAtPmRldnQpOw0KPiArCQkJCWlkcl9yZW1vdmUoJnZmaW8uaWRyLCBNSU5PUihncm91cC0+ZGV2
dCkpOw0KPiArCQkJCWtmcmVlKGdyb3VwKTsNCj4gKwkJCX0NCj4gKwkJCXJldCA9IC1FTk9NRU07
DQo+ICsJCQlnb3RvIG91dDsNCj4gKwkJfQ0KPiArDQo+ICsJCWxpc3RfYWRkKCZkZXZpY2UtPmRl
dmljZV9uZXh0LCAmZ3JvdXAtPmRldmljZV9saXN0KTsNCj4gKwkJZGV2aWNlLT5kZXYgPSBkZXY7
DQo+ICsJCWRldmljZS0+b3BzID0gb3BzOw0KPiArCQlkZXZpY2UtPmlvbW11ID0gZ3JvdXAtPmlv
bW11OyAvKiBOVUxMIGlmIG5ldyAqLw0KDQpTaG91bGRuJ3QgeW91IGNoZWNrIHRoZSByZXR1cm4g
Y29kZSBvZiBfX3ZmaW9faW9tbXVfYXR0YWNoX2Rldj8NCg0KPiArCQlfX3ZmaW9faW9tbXVfYXR0
YWNoX2Rldihncm91cC0+aW9tbXUsIGRldmljZSk7DQo+ICsJfQ0KPiArb3V0Og0KPiArCW11dGV4
X3VubG9jaygmdmZpby5sb2NrKTsNCj4gKwlyZXR1cm4gcmV0Ow0KPiArfQ0KPiArRVhQT1JUX1NZ
TUJPTF9HUEwodmZpb19ncm91cF9hZGRfZGV2KTsNCj4gKw0KPiArLyogUmVtb3ZlIGEgZGV2aWNl
IGZyb20gdGhlIHZmaW8gZnJhbWV3b3JrICovDQoNClRoaXMgZm4gYmVsb3cgZG9lcyBub3QgcmV0
dXJuIGFueSBlcnJvciBjb2RlLiBPayAuLi4NCkhvd2V2ZXIsIHRoZXJlIGFyZSBhIG51bWJlciBv
ZiBlcnJvcnMgY2FzZSB0aGF0IHlvdSB0ZXN0LCBmb3IgZXhhbXBsZQ0KLSBkZXZpY2UgdGhhdCBk
b2VzIG5vdCBiZWxvbmcgdG8gYW55IGdyb3VwIChhY2NvcmRpbmcgdG8gaW9tbXUgQVBJKQ0KLSBk
ZXZpY2UgdGhhdCBiZWxvbmdzIHRvIGEgZ3JvdXAgYnV0IHRoYXQgZG9lcyBub3QgYXBwZWFyIGlu
IHRoZSBsaXN0DQogIG9mIGRldmljZXMgb2YgdGhlIHZmaW9fZ3JvdXAgc3RydWN0dXJlLg0KQXJl
IHRoZSBhYm92ZSB0d28gZXJyb3JzIGNoZWNrcyBqdXN0IHBhcmFub2lhIG9yIGFyZSB0aG9zZSBl
cnJvcnMgYWN0dWFsbHkgcG9zc2libGU/DQpJZiB0aGV5IHdlcmUgcG9zc2libGUsIHNob3VsZG4n
dCB3ZSBnZW5lcmF0ZSBhIHdhcm5pbmcgKG1vc3QgcHJvYmFibHkNCml0IHdvdWxkIGJlIGEgYnVn
IGluIHRoZSBjb2RlKT8NCg0KPiArdm9pZCB2ZmlvX2dyb3VwX2RlbF9kZXYoc3RydWN0IGRldmlj
ZSAqZGV2KQ0KPiArew0KPiArCXN0cnVjdCBsaXN0X2hlYWQgKnBvczsNCj4gKwlzdHJ1Y3QgdmZp
b19ncm91cCAqZ3JvdXAgPSBOVUxMOw0KPiArCXN0cnVjdCB2ZmlvX2RldmljZSAqZGV2aWNlID0g
TlVMTDsNCj4gKwl1bnNpZ25lZCBpbnQgZ3JvdXBpZDsNCj4gKw0KPiArCWlmIChpb21tdV9kZXZp
Y2VfZ3JvdXAoZGV2LCAmZ3JvdXBpZCkpDQo+ICsJCXJldHVybjsNCj4gKw0KPiArCW11dGV4X2xv
Y2soJnZmaW8ubG9jayk7DQo+ICsNCj4gKwlsaXN0X2Zvcl9lYWNoKHBvcywgJnZmaW8uZ3JvdXBf
bGlzdCkgew0KPiArCQlncm91cCA9IGxpc3RfZW50cnkocG9zLCBzdHJ1Y3QgdmZpb19ncm91cCwg
Z3JvdXBfbmV4dCk7DQo+ICsJCWlmIChncm91cC0+Z3JvdXBpZCA9PSBncm91cGlkKQ0KPiArCQkJ
YnJlYWs7DQo+ICsJCWdyb3VwID0gTlVMTDsNCj4gKwl9DQo+ICsNCj4gKwlpZiAoIWdyb3VwKQ0K
PiArCQlnb3RvIG91dDsNCj4gKw0KPiArCWxpc3RfZm9yX2VhY2gocG9zLCAmZ3JvdXAtPmRldmlj
ZV9saXN0KSB7DQo+ICsJCWRldmljZSA9IGxpc3RfZW50cnkocG9zLCBzdHJ1Y3QgdmZpb19kZXZp
Y2UsIGRldmljZV9uZXh0KTsNCj4gKwkJaWYgKGRldmljZS0+ZGV2ID09IGRldikNCj4gKwkJCWJy
ZWFrOw0KPiArCQlkZXZpY2UgPSBOVUxMOw0KPiArCX0NCj4gKw0KPiArCWlmICghZGV2aWNlKQ0K
PiArCQlnb3RvIG91dDsNCj4gKw0KPiArCUJVR19PTihkZXZpY2UtPnJlZmNudCk7DQo+ICsNCj4g
KwlpZiAoZGV2aWNlLT5hdHRhY2hlZCkNCj4gKwkJX192ZmlvX2lvbW11X2RldGFjaF9kZXYoZ3Jv
dXAtPmlvbW11LCBkZXZpY2UpOw0KPiArDQo+ICsJbGlzdF9kZWwoJmRldmljZS0+ZGV2aWNlX25l
eHQpOw0KPiArCWtmcmVlKGRldmljZSk7DQo+ICsNCj4gKwkvKiBJZiB0aGlzIHdhcyB0aGUgb25s
eSBkZXZpY2UgaW4gdGhlIGdyb3VwLCByZW1vdmUgdGhlIGdyb3VwLg0KPiArCSAqIE5vdGUgdGhh
dCB3ZSBpbnRlbnRpb25hbGx5IHVubWVyZ2UgZW1wdHkgZ3JvdXBzIGhlcmUgaWYgdGhlDQo+ICsJ
ICogZ3JvdXAgZmQgaXNuJ3Qgb3BlbmVkLiAqLw0KPiArCWlmIChsaXN0X2VtcHR5KCZncm91cC0+
ZGV2aWNlX2xpc3QpICYmIGdyb3VwLT5yZWZjbnQgPT0gMCkgew0KPiArCQlzdHJ1Y3QgdmZpb19p
b21tdSAqaW9tbXUgPSBncm91cC0+aW9tbXU7DQo+ICsNCj4gKwkJaWYgKGlvbW11KSB7DQo+ICsJ
CQlfX3ZmaW9fZ3JvdXBfc2V0X2lvbW11KGdyb3VwLCBOVUxMKTsNCj4gKwkJCV9fdmZpb190cnlf
ZGlzc29sdmVfaW9tbXUoaW9tbXUpOw0KPiArCQl9DQo+ICsNCj4gKwkJZGV2aWNlX2Rlc3Ryb3ko
dmZpby5jbGFzcywgZ3JvdXAtPmRldnQpOw0KPiArCQlpZHJfcmVtb3ZlKCZ2ZmlvLmlkciwgTUlO
T1IoZ3JvdXAtPmRldnQpKTsNCj4gKwkJbGlzdF9kZWwoJmdyb3VwLT5ncm91cF9uZXh0KTsNCj4g
KwkJa2ZyZWUoZ3JvdXApOw0KPiArCX0NCj4gK291dDoNCj4gKwltdXRleF91bmxvY2soJnZmaW8u
bG9jayk7DQo+ICt9DQo+ICtFWFBPUlRfU1lNQk9MX0dQTCh2ZmlvX2dyb3VwX2RlbF9kZXYpOw0K
PiArDQo+ICsvKiBXaGVuIGEgZGV2aWNlIGlzIGJvdW5kIHRvIGEgdmZpbyBkZXZpY2UgZHJpdmVy
IChleC4gdmZpby1wY2kpLCB0aGlzDQo+ICsgKiBlbnRyeSBwb2ludCBpcyB1c2VkIHRvIG1hcmsg
dGhlIGRldmljZSB1c2FibGUgKHZpYWJsZSkuICBUaGUgdmZpbw0KPiArICogZGV2aWNlIGRyaXZl
ciBhc3NvY2lhdGVzIGEgcHJpdmF0ZSBkZXZpY2VfZGF0YSBzdHJ1Y3Qgd2l0aCB0aGUNCj4gZGV2
aWNlDQo+ICsgKiBoZXJlLCB3aGljaCB3aWxsIGxhdGVyIGJlIHJldHVybiBmb3IgdmZpb19kZXZp
Y2VfZm9wcyBjYWxsYmFja3MuICovDQo+ICtpbnQgdmZpb19iaW5kX2RldihzdHJ1Y3QgZGV2aWNl
ICpkZXYsIHZvaWQgKmRldmljZV9kYXRhKQ0KPiArew0KPiArCXN0cnVjdCB2ZmlvX2RldmljZSAq
ZGV2aWNlOw0KPiArCWludCByZXQgPSAtRUlOVkFMOw0KPiArDQo+ICsJQlVHX09OKCFkZXZpY2Vf
ZGF0YSk7DQo+ICsNCj4gKwltdXRleF9sb2NrKCZ2ZmlvLmxvY2spOw0KPiArDQo+ICsJZGV2aWNl
ID0gX192ZmlvX2xvb2t1cF9kZXYoZGV2KTsNCj4gKw0KPiArCUJVR19PTighZGV2aWNlKTsNCj4g
Kw0KPiArCXJldCA9IGRldl9zZXRfZHJ2ZGF0YShkZXYsIGRldmljZSk7DQo+ICsJaWYgKCFyZXQp
DQo+ICsJCWRldmljZS0+ZGV2aWNlX2RhdGEgPSBkZXZpY2VfZGF0YTsNCj4gKw0KPiArCW11dGV4
X3VubG9jaygmdmZpby5sb2NrKTsNCj4gKwlyZXR1cm4gcmV0Ow0KPiArfQ0KPiArRVhQT1JUX1NZ
TUJPTF9HUEwodmZpb19iaW5kX2Rldik7DQo+ICsNCj4gKy8qIEEgZGV2aWNlIGlzIG9ubHkgcmVt
b3ZlYWJsZSBpZiB0aGUgaW9tbXUgZm9yIHRoZSBncm91cCBpcyBub3QgaW4NCj4gdXNlLiAqLw0K
PiArc3RhdGljIGJvb2wgdmZpb19kZXZpY2VfcmVtb3ZlYWJsZShzdHJ1Y3QgdmZpb19kZXZpY2Ug
KmRldmljZSkNCj4gK3sNCj4gKwlib29sIHJldCA9IHRydWU7DQo+ICsNCj4gKwltdXRleF9sb2Nr
KCZ2ZmlvLmxvY2spOw0KPiArDQo+ICsJaWYgKGRldmljZS0+aW9tbXUgJiYgX192ZmlvX2lvbW11
X2ludXNlKGRldmljZS0+aW9tbXUpKQ0KPiArCQlyZXQgPSBmYWxzZTsNCj4gKw0KPiArCW11dGV4
X3VubG9jaygmdmZpby5sb2NrKTsNCj4gKwlyZXR1cm4gcmV0Ow0KPiArfQ0KPiArDQo+ICsvKiBO
b3RpZnkgdmZpbyB0aGF0IGEgZGV2aWNlIGlzIGJlaW5nIHVuYm91bmQgZnJvbSB0aGUgdmZpbyBk
ZXZpY2UNCj4gZHJpdmVyDQo+ICsgKiBhbmQgcmV0dXJuIHRoZSBkZXZpY2UgcHJpdmF0ZSBkZXZp
Y2VfZGF0YSBwb2ludGVyLiAgSWYgdGhlIGdyb3VwIGlzDQo+ICsgKiBpbiB1c2UsIHdlIG5lZWQg
dG8gYmxvY2sgb3IgdGFrZSBvdGhlciBtZWFzdXJlcyB0byBtYWtlIGl0IHNhZmUgZm9yDQo+ICsg
KiB0aGUgZGV2aWNlIHRvIGJlIHJlbW92ZWQgZnJvbSB0aGUgaW9tbXUuICovDQo+ICt2b2lkICp2
ZmlvX3VuYmluZF9kZXYoc3RydWN0IGRldmljZSAqZGV2KQ0KPiArew0KPiArCXN0cnVjdCB2Zmlv
X2RldmljZSAqZGV2aWNlID0gZGV2X2dldF9kcnZkYXRhKGRldik7DQo+ICsJdm9pZCAqZGV2aWNl
X2RhdGE7DQo+ICsNCj4gKwlCVUdfT04oIWRldmljZSk7DQo+ICsNCj4gK2FnYWluOg0KPiArCWlm
ICghdmZpb19kZXZpY2VfcmVtb3ZlYWJsZShkZXZpY2UpKSB7DQo+ICsJCS8qIFhYWCBzaWduYWwg
Zm9yIGFsbCBkZXZpY2VzIGluIGdyb3VwIHRvIGJlIHJlbW92ZWQgb3INCj4gKwkJICogcmVzb3J0
IHRvIGtpbGxpbmcgdGhlIHByb2Nlc3MgaG9sZGluZyB0aGUgZGV2aWNlIGZkcy4NCj4gKwkJICog
Rm9yIG5vdyBqdXN0IGJsb2NrIHdhaXRpbmcgZm9yIHJlbGVhc2VzIHRvIHdha2UgdXMuICovDQo+
ICsJCXdhaXRfZXZlbnQodmZpby5yZWxlYXNlX3EsIHZmaW9fZGV2aWNlX3JlbW92ZWFibGUoZGV2
aWNlKSk7DQoNCkFueSBuZXcgaWRlYS9wcm9wb3NhbCBvbiBob3cgdG8gaGFuZGxlIHRoaXMgc2l0
dWF0aW9uPw0KVGhlIGxhc3Qgb25lIEkgcmVtZW1iZXIgd2FzIHRvIGxlYXZlIHRoZSBzb2Z0L2hh
cmQvZXRjIHRpbWVvdXQgaGFuZGxpbmcgaW4NCnVzZXJzcGFjZSBhbmQgaW1wbGVtZW50IGl0IGFz
IGEgc29ydCBvZiBwb2xpY3kuIElzIHRoYXQgb25lIHN0aWxsIHRoZSBtb3N0DQpsaWtlbHkgY2Fu
ZGlkYXRlIHNvbHV0aW9uIHRvIGhhbmRsZSB0aGlzIHNpdHVhdGlvbj8NCg0KPiArCX0NCj4gKw0K
PiArCW11dGV4X2xvY2soJnZmaW8ubG9jayk7DQo+ICsNCj4gKwkvKiBOZWVkIHRvIHJlLWNoZWNr
IHRoYXQgdGhlIGRldmljZSBpcyBzdGlsbCByZW1vdmVhYmxlIHVuZGVyDQo+IGxvY2suICovDQo+
ICsJaWYgKGRldmljZS0+aW9tbXUgJiYgX192ZmlvX2lvbW11X2ludXNlKGRldmljZS0+aW9tbXUp
KSB7DQo+ICsJCW11dGV4X3VubG9jaygmdmZpby5sb2NrKTsNCj4gKwkJZ290byBhZ2FpbjsNCj4g
Kwl9DQo+ICsNCj4gKwlkZXZpY2VfZGF0YSA9IGRldmljZS0+ZGV2aWNlX2RhdGE7DQo+ICsNCj4g
KwlkZXZpY2UtPmRldmljZV9kYXRhID0gTlVMTDsNCj4gKwlkZXZfc2V0X2RydmRhdGEoZGV2LCBO
VUxMKTsNCj4gKw0KPiArCW11dGV4X3VubG9jaygmdmZpby5sb2NrKTsNCj4gKwlyZXR1cm4gZGV2
aWNlX2RhdGE7DQo+ICt9DQo+ICtFWFBPUlRfU1lNQk9MX0dQTCh2ZmlvX3VuYmluZF9kZXYpOw0K
PiArDQo+ICsvKg0KPiArICogTW9kdWxlL2NsYXNzIHN1cHBvcnQNCj4gKyAqLw0KPiArc3RhdGlj
IHZvaWQgdmZpb19jbGFzc19yZWxlYXNlKHN0cnVjdCBrcmVmICprcmVmKQ0KPiArew0KPiArCWNs
YXNzX2Rlc3Ryb3kodmZpby5jbGFzcyk7DQo+ICsJdmZpby5jbGFzcyA9IE5VTEw7DQo+ICt9DQo+
ICsNCj4gK3N0YXRpYyBjaGFyICp2ZmlvX2Rldm5vZGUoc3RydWN0IGRldmljZSAqZGV2LCBtb2Rl
X3QgKm1vZGUpDQo+ICt7DQo+ICsJcmV0dXJuIGthc3ByaW50ZihHRlBfS0VSTkVMLCAidmZpby8l
cyIsIGRldl9uYW1lKGRldikpOw0KPiArfQ0KPiArDQo+ICtzdGF0aWMgaW50IF9faW5pdCB2Zmlv
X2luaXQodm9pZCkNCj4gK3sNCj4gKwlpbnQgcmV0Ow0KPiArDQo+ICsJaWRyX2luaXQoJnZmaW8u
aWRyKTsNCj4gKwltdXRleF9pbml0KCZ2ZmlvLmxvY2spOw0KPiArCUlOSVRfTElTVF9IRUFEKCZ2
ZmlvLmdyb3VwX2xpc3QpOw0KPiArCWluaXRfd2FpdHF1ZXVlX2hlYWQoJnZmaW8ucmVsZWFzZV9x
KTsNCj4gKw0KPiArCWtyZWZfaW5pdCgmdmZpby5rcmVmKTsNCj4gKwl2ZmlvLmNsYXNzID0gY2xh
c3NfY3JlYXRlKFRISVNfTU9EVUxFLCAidmZpbyIpOw0KPiArCWlmIChJU19FUlIodmZpby5jbGFz
cykpIHsNCj4gKwkJcmV0ID0gUFRSX0VSUih2ZmlvLmNsYXNzKTsNCj4gKwkJZ290byBlcnJfY2xh
c3M7DQo+ICsJfQ0KPiArDQo+ICsJdmZpby5jbGFzcy0+ZGV2bm9kZSA9IHZmaW9fZGV2bm9kZTsN
Cj4gKw0KPiArCS8qIEZJWE1FIC0gaG93IG1hbnkgbWlub3JzIHRvIGFsbG9jYXRlLi4uIGFsbCBv
ZiB0aGVtISAqLw0KPiArCXJldCA9IGFsbG9jX2NocmRldl9yZWdpb24oJnZmaW8uZGV2dCwgMCwg
TUlOT1JNQVNLLCAidmZpbyIpOw0KPiArCWlmIChyZXQpDQo+ICsJCWdvdG8gZXJyX2NocmRldjsN
Cj4gKw0KPiArCWNkZXZfaW5pdCgmdmZpby5jZGV2LCAmdmZpb19ncm91cF9mb3BzKTsNCj4gKwly
ZXQgPSBjZGV2X2FkZCgmdmZpby5jZGV2LCB2ZmlvLmRldnQsIE1JTk9STUFTSyk7DQo+ICsJaWYg
KHJldCkNCj4gKwkJZ290byBlcnJfY2RldjsNCj4gKw0KPiArCXByX2luZm8oRFJJVkVSX0RFU0Mg
IiB2ZXJzaW9uOiAiIERSSVZFUl9WRVJTSU9OICJcbiIpOw0KPiArDQo+ICsJcmV0dXJuIDA7DQo+
ICsNCj4gK2Vycl9jZGV2Og0KPiArCXVucmVnaXN0ZXJfY2hyZGV2X3JlZ2lvbih2ZmlvLmRldnQs
IE1JTk9STUFTSyk7DQo+ICtlcnJfY2hyZGV2Og0KPiArCWtyZWZfcHV0KCZ2ZmlvLmtyZWYsIHZm
aW9fY2xhc3NfcmVsZWFzZSk7DQo+ICtlcnJfY2xhc3M6DQo+ICsJcmV0dXJuIHJldDsNCj4gK30N
Cj4gKw0KPiArc3RhdGljIHZvaWQgX19leGl0IHZmaW9fY2xlYW51cCh2b2lkKQ0KPiArew0KPiAr
CXN0cnVjdCBsaXN0X2hlYWQgKmdwb3MsICpncHBvczsNCj4gKw0KPiArCWxpc3RfZm9yX2VhY2hf
c2FmZShncG9zLCBncHBvcywgJnZmaW8uZ3JvdXBfbGlzdCkgew0KPiArCQlzdHJ1Y3QgdmZpb19n
cm91cCAqZ3JvdXA7DQo+ICsJCXN0cnVjdCBsaXN0X2hlYWQgKmRwb3MsICpkcHBvczsNCj4gKw0K
PiArCQlncm91cCA9IGxpc3RfZW50cnkoZ3Bvcywgc3RydWN0IHZmaW9fZ3JvdXAsIGdyb3VwX25l
eHQpOw0KPiArDQo+ICsJCWxpc3RfZm9yX2VhY2hfc2FmZShkcG9zLCBkcHBvcywgJmdyb3VwLT5k
ZXZpY2VfbGlzdCkgew0KPiArCQkJc3RydWN0IHZmaW9fZGV2aWNlICpkZXZpY2U7DQo+ICsNCj4g
KwkJCWRldmljZSA9IGxpc3RfZW50cnkoZHBvcywNCj4gKwkJCQkJICAgIHN0cnVjdCB2ZmlvX2Rl
dmljZSwgZGV2aWNlX25leHQpOw0KPiArCQkJdmZpb19ncm91cF9kZWxfZGV2KGRldmljZS0+ZGV2
KTsNCj4gKwkJfQ0KPiArCX0NCj4gKw0KPiArCWlkcl9kZXN0cm95KCZ2ZmlvLmlkcik7DQo+ICsJ
Y2Rldl9kZWwoJnZmaW8uY2Rldik7DQo+ICsJdW5yZWdpc3Rlcl9jaHJkZXZfcmVnaW9uKHZmaW8u
ZGV2dCwgTUlOT1JNQVNLKTsNCj4gKwlrcmVmX3B1dCgmdmZpby5rcmVmLCB2ZmlvX2NsYXNzX3Jl
bGVhc2UpOw0KPiArfQ0KPiArDQo+ICttb2R1bGVfaW5pdCh2ZmlvX2luaXQpOw0KPiArbW9kdWxl
X2V4aXQodmZpb19jbGVhbnVwKTsNCj4gKw0KPiArTU9EVUxFX1ZFUlNJT04oRFJJVkVSX1ZFUlNJ
T04pOw0KPiArTU9EVUxFX0xJQ0VOU0UoIkdQTCB2MiIpOw0KPiArTU9EVUxFX0FVVEhPUihEUklW
RVJfQVVUSE9SKTsNCj4gK01PRFVMRV9ERVNDUklQVElPTihEUklWRVJfREVTQyk7DQo+IGRpZmYg
LS1naXQgYS9kcml2ZXJzL3ZmaW8vdmZpb19wcml2YXRlLmggYi9kcml2ZXJzL3ZmaW8vdmZpb19w
cml2YXRlLmgNCj4gbmV3IGZpbGUgbW9kZSAxMDA2NDQNCj4gaW5kZXggMDAwMDAwMC4uMzUwYWQ2
Nw0KPiAtLS0gL2Rldi9udWxsDQo+ICsrKyBiL2RyaXZlcnMvdmZpby92ZmlvX3ByaXZhdGUuaA0K
PiBAQCAtMCwwICsxLDM0IEBADQo+ICsvKg0KPiArICogQ29weXJpZ2h0IChDKSAyMDExIFJlZCBI
YXQsIEluYy4gIEFsbCByaWdodHMgcmVzZXJ2ZWQuDQo+ICsgKiAgICAgQXV0aG9yOiBBbGV4IFdp
bGxpYW1zb24gPGFsZXgud2lsbGlhbXNvbkByZWRoYXQuY29tPg0KPiArICoNCj4gKyAqIFRoaXMg
cHJvZ3JhbSBpcyBmcmVlIHNvZnR3YXJlOyB5b3UgY2FuIHJlZGlzdHJpYnV0ZSBpdCBhbmQvb3IN
Cj4gbW9kaWZ5DQo+ICsgKiBpdCB1bmRlciB0aGUgdGVybXMgb2YgdGhlIEdOVSBHZW5lcmFsIFB1
YmxpYyBMaWNlbnNlIHZlcnNpb24gMiBhcw0KPiArICogcHVibGlzaGVkIGJ5IHRoZSBGcmVlIFNv
ZnR3YXJlIEZvdW5kYXRpb24uDQo+ICsgKg0KPiArICogRGVyaXZlZCBmcm9tIG9yaWdpbmFsIHZm
aW86DQo+ICsgKiBDb3B5cmlnaHQgMjAxMCBDaXNjbyBTeXN0ZW1zLCBJbmMuICBBbGwgcmlnaHRz
IHJlc2VydmVkLg0KPiArICogQXV0aG9yOiBUb20gTHlvbiwgcHVnc0BjaXNjby5jb20NCj4gKyAq
Lw0KPiArDQo+ICsjaW5jbHVkZSA8bGludXgvbGlzdC5oPg0KPiArI2luY2x1ZGUgPGxpbnV4L211
dGV4Lmg+DQo+ICsNCj4gKyNpZm5kZWYgVkZJT19QUklWQVRFX0gNCj4gKyNkZWZpbmUgVkZJT19Q
UklWQVRFX0gNCj4gKw0KPiArc3RydWN0IHZmaW9faW9tbXUgew0KPiArCXN0cnVjdCBpb21tdV9k
b21haW4JCSpkb21haW47DQo+ICsJc3RydWN0IGJ1c190eXBlCQkJKmJ1czsNCj4gKwlzdHJ1Y3Qg
bXV0ZXgJCQlkZ2F0ZTsNCj4gKwlzdHJ1Y3QgbGlzdF9oZWFkCQlkbV9saXN0Ow0KPiArCXN0cnVj
dCBtbV9zdHJ1Y3QJCSptbTsNCj4gKwlzdHJ1Y3QgbGlzdF9oZWFkCQlncm91cF9saXN0Ow0KPiAr
CWludAkJCQlyZWZjbnQ7DQo+ICsJYm9vbAkJCQljYWNoZTsNCj4gK307DQo+ICsNCj4gK2V4dGVy
biBpbnQgdmZpb19yZWxlYXNlX2lvbW11KHN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSk7DQo+ICtl
eHRlcm4gdm9pZCB2ZmlvX2lvbW11X3VubWFwYWxsKHN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSk7
DQo+ICsNCj4gKyNlbmRpZiAvKiBWRklPX1BSSVZBVEVfSCAqLw0KPiBkaWZmIC0tZ2l0IGEvaW5j
bHVkZS9saW51eC92ZmlvLmggYi9pbmNsdWRlL2xpbnV4L3ZmaW8uaA0KPiBuZXcgZmlsZSBtb2Rl
IDEwMDY0NA0KPiBpbmRleCAwMDAwMDAwLi40MjY5YjA4DQo+IC0tLSAvZGV2L251bGwNCj4gKysr
IGIvaW5jbHVkZS9saW51eC92ZmlvLmgNCj4gQEAgLTAsMCArMSwxNTUgQEANCj4gKy8qDQo+ICsg
KiBDb3B5cmlnaHQgMjAxMCBDaXNjbyBTeXN0ZW1zLCBJbmMuICBBbGwgcmlnaHRzIHJlc2VydmVk
Lg0KPiArICogQXV0aG9yOiBUb20gTHlvbiwgcHVnc0BjaXNjby5jb20NCj4gKyAqDQo+ICsgKiBU
aGlzIHByb2dyYW0gaXMgZnJlZSBzb2Z0d2FyZTsgeW91IG1heSByZWRpc3RyaWJ1dGUgaXQgYW5k
L29yDQo+IG1vZGlmeQ0KPiArICogaXQgdW5kZXIgdGhlIHRlcm1zIG9mIHRoZSBHTlUgR2VuZXJh
bCBQdWJsaWMgTGljZW5zZSBhcyBwdWJsaXNoZWQNCj4gYnkNCj4gKyAqIHRoZSBGcmVlIFNvZnR3
YXJlIEZvdW5kYXRpb247IHZlcnNpb24gMiBvZiB0aGUgTGljZW5zZS4NCj4gKyAqDQo+ICsgKiBU
SEUgU09GVFdBUkUgSVMgUFJPVklERUQgIkFTIElTIiwgV0lUSE9VVCBXQVJSQU5UWSBPRiBBTlkg
S0lORCwNCj4gKyAqIEVYUFJFU1MgT1IgSU1QTElFRCwgSU5DTFVESU5HIEJVVCBOT1QgTElNSVRF
RCBUTyBUSEUgV0FSUkFOVElFUyBPRg0KPiArICogTUVSQ0hBTlRBQklMSVRZLCBGSVRORVNTIEZP
UiBBIFBBUlRJQ1VMQVIgUFVSUE9TRSBBTkQNCj4gKyAqIE5PTklORlJJTkdFTUVOVC4gSU4gTk8g
RVZFTlQgU0hBTEwgVEhFIEFVVEhPUlMgT1IgQ09QWVJJR0hUIEhPTERFUlMNCj4gKyAqIEJFIExJ
QUJMRSBGT1IgQU5ZIENMQUlNLCBEQU1BR0VTIE9SIE9USEVSIExJQUJJTElUWSwgV0hFVEhFUiBJ
TiBBTg0KPiArICogQUNUSU9OIE9GIENPTlRSQUNULCBUT1JUIE9SIE9USEVSV0lTRSwgQVJJU0lO
RyBGUk9NLCBPVVQgT0YgT1IgSU4NCj4gKyAqIENPTk5FQ1RJT04gV0lUSCBUSEUgU09GVFdBUkUg
T1IgVEhFIFVTRSBPUiBPVEhFUiBERUFMSU5HUyBJTiBUSEUNCj4gKyAqIFNPRlRXQVJFLg0KPiAr
ICoNCj4gKyAqIFBvcnRpb25zIGRlcml2ZWQgZnJvbSBkcml2ZXJzL3Vpby91aW8uYzoNCj4gKyAq
IENvcHlyaWdodChDKSAyMDA1LCBCZW5lZGlrdCBTcHJhbmdlciA8Yi5zcHJhbmdlckBsaW51dHJv
bml4LmRlPg0KPiArICogQ29weXJpZ2h0KEMpIDIwMDUsIFRob21hcyBHbGVpeG5lciA8dGdseEBs
aW51dHJvbml4LmRlPg0KPiArICogQ29weXJpZ2h0KEMpIDIwMDYsIEhhbnMgSi4gS29jaCA8aGpr
QGxpbnV0cm9uaXguZGU+DQo+ICsgKiBDb3B5cmlnaHQoQykgMjAwNiwgR3JlZyBLcm9haC1IYXJ0
bWFuIDxncmVnQGtyb2FoLmNvbT4NCj4gKyAqDQo+ICsgKiBQb3J0aW9ucyBkZXJpdmVkIGZyb20g
ZHJpdmVycy91aW8vdWlvX3BjaV9nZW5lcmljLmM6DQo+ICsgKiBDb3B5cmlnaHQgKEMpIDIwMDkg
UmVkIEhhdCwgSW5jLg0KPiArICogQXV0aG9yOiBNaWNoYWVsIFMuIFRzaXJraW4gPG1zdEByZWRo
YXQuY29tPg0KPiArICovDQo+ICsjaW5jbHVkZSA8bGludXgvdHlwZXMuaD4NCj4gKw0KPiArI2lm
bmRlZiBWRklPX0gNCj4gKyNkZWZpbmUgVkZJT19IDQo+ICsNCj4gKyNpZmRlZiBfX0tFUk5FTF9f
DQo+ICsNCj4gK3N0cnVjdCB2ZmlvX2RldmljZV9vcHMgew0KPiArCWJvb2wJCQkoKm1hdGNoKShz
dHJ1Y3QgZGV2aWNlICosIGNoYXIgKik7DQo+ICsJaW50CQkJKCpnZXQpKHZvaWQgKik7DQo+ICsJ
dm9pZAkJCSgqcHV0KSh2b2lkICopOw0KPiArCXNzaXplX3QJCQkoKnJlYWQpKHZvaWQgKiwgY2hh
ciBfX3VzZXIgKiwNCj4gKwkJCQkJc2l6ZV90LCBsb2ZmX3QgKik7DQo+ICsJc3NpemVfdAkJCSgq
d3JpdGUpKHZvaWQgKiwgY29uc3QgY2hhciBfX3VzZXIgKiwNCj4gKwkJCQkJIHNpemVfdCwgbG9m
Zl90ICopOw0KPiArCWxvbmcJCQkoKmlvY3RsKSh2b2lkICosIHVuc2lnbmVkIGludCwgdW5zaWdu
ZWQgbG9uZyk7DQo+ICsJaW50CQkJKCptbWFwKSh2b2lkICosIHN0cnVjdCB2bV9hcmVhX3N0cnVj
dCAqKTsNCj4gK307DQo+ICsNCj4gK2V4dGVybiBpbnQgdmZpb19ncm91cF9hZGRfZGV2KHN0cnVj
dCBkZXZpY2UgKmRldmljZSwNCj4gKwkJCSAgICAgIGNvbnN0IHN0cnVjdCB2ZmlvX2RldmljZV9v
cHMgKm9wcyk7DQo+ICtleHRlcm4gdm9pZCB2ZmlvX2dyb3VwX2RlbF9kZXYoc3RydWN0IGRldmlj
ZSAqZGV2aWNlKTsNCj4gK2V4dGVybiBpbnQgdmZpb19iaW5kX2RldihzdHJ1Y3QgZGV2aWNlICpk
ZXZpY2UsIHZvaWQgKmRldmljZV9kYXRhKTsNCj4gK2V4dGVybiB2b2lkICp2ZmlvX3VuYmluZF9k
ZXYoc3RydWN0IGRldmljZSAqZGV2aWNlKTsNCj4gKw0KPiArI2VuZGlmIC8qIF9fS0VSTkVMX18g
Ki8NCj4gKw0KPiArLyoNCj4gKyAqIFZGSU8gZHJpdmVyIC0gYWxsb3cgbWFwcGluZyBhbmQgdXNl
IG9mIGNlcnRhaW4gZGV2aWNlcw0KPiArICogaW4gdW5wcml2aWxlZ2VkIHVzZXIgcHJvY2Vzc2Vz
LiAoSWYgSU9NTVUgaXMgcHJlc2VudCkNCj4gKyAqIEVzcGVjaWFsbHkgdXNlZnVsIGZvciBWaXJ0
dWFsIEZ1bmN0aW9uIHBhcnRzIG9mIFNSLUlPViBkZXZpY2VzDQo+ICsgKi8NCj4gKw0KPiArDQo+
ICsvKiBLZXJuZWwgJiBVc2VyIGxldmVsIGRlZmluZXMgZm9yIGlvY3RscyAqLw0KPiArDQo+ICsj
ZGVmaW5lIFZGSU9fR1JPVVBfR0VUX0ZMQUdTCQlfSU9SKCc7JywgMTAwLCBfX3U2NCkNCj4gKyAj
ZGVmaW5lIFZGSU9fR1JPVVBfRkxBR1NfVklBQkxFCSgxIDw8IDApDQo+ICsgI2RlZmluZSBWRklP
X0dST1VQX0ZMQUdTX01NX0xPQ0tFRAkoMSA8PCAxKQ0KPiArI2RlZmluZSBWRklPX0dST1VQX01F
UkdFCQlfSU9XKCc7JywgMTAxLCBpbnQpDQo+ICsjZGVmaW5lIFZGSU9fR1JPVVBfVU5NRVJHRQkJ
X0lPVygnOycsIDEwMiwgaW50KQ0KPiArI2RlZmluZSBWRklPX0dST1VQX0dFVF9JT01NVV9GRAkJ
X0lPKCc7JywgMTAzKQ0KPiArI2RlZmluZSBWRklPX0dST1VQX0dFVF9ERVZJQ0VfRkQJX0lPVygn
OycsIDEwNCwgY2hhciAqKQ0KPiArDQo+ICsvKg0KPiArICogU3RydWN0dXJlIGZvciBETUEgbWFw
cGluZyBvZiB1c2VyIGJ1ZmZlcnMNCj4gKyAqIHZhZGRyLCBkbWFhZGRyLCBhbmQgc2l6ZSBtdXN0
IGFsbCBiZSBwYWdlIGFsaWduZWQNCj4gKyAqLw0KPiArc3RydWN0IHZmaW9fZG1hX21hcCB7DQo+
ICsJX191NjQJbGVuOwkJLyogbGVuZ3RoIG9mIHN0cnVjdHVyZSAqLw0KPiArCV9fdTY0CXZhZGRy
OwkJLyogcHJvY2VzcyB2aXJ0dWFsIGFkZHIgKi8NCj4gKwlfX3U2NAlkbWFhZGRyOwkvKiBkZXNp
cmVkIGFuZC9vciByZXR1cm5lZCBkbWEgYWRkcmVzcyAqLw0KPiArCV9fdTY0CXNpemU7CQkvKiBz
aXplIGluIGJ5dGVzICovDQo+ICsJX191NjQJZmxhZ3M7DQo+ICsjZGVmaW5lCVZGSU9fRE1BX01B
UF9GTEFHX1dSSVRFCQkoMSA8PCAwKSAvKiByZXEgd3JpdGVhYmxlIERNQQ0KPiBtZW0gKi8NCj4g
K307DQo+ICsNCj4gKyNkZWZpbmUJVkZJT19JT01NVV9HRVRfRkxBR1MJCV9JT1IoJzsnLCAxMDUs
IF9fdTY0KQ0KPiArIC8qIERvZXMgdGhlIElPTU1VIHN1cHBvcnQgbWFwcGluZyBhbnkgSU9WQSB0
byBhbnkgdmlydHVhbCBhZGRyZXNzPyAqLw0KPiArICNkZWZpbmUgVkZJT19JT01NVV9GTEFHU19N
QVBfQU5ZCSgxIDw8IDApDQo+ICsjZGVmaW5lCVZGSU9fSU9NTVVfTUFQX0RNQQkJX0lPV1IoJzsn
LCAxMDYsIHN0cnVjdA0KPiB2ZmlvX2RtYV9tYXApDQo+ICsjZGVmaW5lCVZGSU9fSU9NTVVfVU5N
QVBfRE1BCQlfSU9XUignOycsIDEwNywgc3RydWN0DQo+IHZmaW9fZG1hX21hcCkNCj4gKw0KPiAr
I2RlZmluZSBWRklPX0RFVklDRV9HRVRfRkxBR1MJCV9JT1IoJzsnLCAxMDgsIF9fdTY0KQ0KPiAr
ICNkZWZpbmUgVkZJT19ERVZJQ0VfRkxBR1NfUENJCQkoMSA8PCAwKQ0KPiArICNkZWZpbmUgVkZJ
T19ERVZJQ0VfRkxBR1NfRFQJCSgxIDw8IDEpDQo+ICsgI2RlZmluZSBWRklPX0RFVklDRV9GTEFH
U19SRVNFVAkoMSA8PCAyKQ0KPiArI2RlZmluZSBWRklPX0RFVklDRV9HRVRfTlVNX1JFR0lPTlMJ
X0lPUignOycsIDEwOSwgaW50KQ0KPiArDQo+ICtzdHJ1Y3QgdmZpb19yZWdpb25faW5mbyB7DQo+
ICsJX191MzIJbGVuOwkJLyogbGVuZ3RoIG9mIHN0cnVjdHVyZSAqLw0KPiArCV9fdTMyCWluZGV4
OwkJLyogcmVnaW9uIG51bWJlciAqLw0KPiArCV9fdTY0CXNpemU7CQkvKiBzaXplIGluIGJ5dGVz
IG9mIHJlZ2lvbiAqLw0KPiArCV9fdTY0CW9mZnNldDsJCS8qIHN0YXJ0IG9mZnNldCBvZiByZWdp
b24gKi8NCj4gKwlfX3U2NAlmbGFnczsNCj4gKyNkZWZpbmUgVkZJT19SRUdJT05fSU5GT19GTEFH
X01NQVAJCSgxIDw8IDApDQo+ICsjZGVmaW5lIFZGSU9fUkVHSU9OX0lORk9fRkxBR19STwkJKDEg
PDwgMSkNCj4gKyNkZWZpbmUgVkZJT19SRUdJT05fSU5GT19GTEFHX1BIWVNfVkFMSUQJKDEgPDwg
MikNCj4gKwlfX3U2NAlwaHlzOwkJLyogcGh5c2ljYWwgYWRkcmVzcyBvZiByZWdpb24gKi8NCj4g
K307DQo+ICsNCj4gKyNkZWZpbmUgVkZJT19ERVZJQ0VfR0VUX1JFR0lPTl9JTkZPCV9JT1dSKCc7
JywgMTEwLCBzdHJ1Y3QNCj4gdmZpb19yZWdpb25faW5mbykNCj4gKw0KPiArI2RlZmluZSBWRklP
X0RFVklDRV9HRVRfTlVNX0lSUVMJX0lPUignOycsIDExMSwgaW50KQ0KPiArDQo+ICtzdHJ1Y3Qg
dmZpb19pcnFfaW5mbyB7DQo+ICsJX191MzIJbGVuOwkJLyogbGVuZ3RoIG9mIHN0cnVjdHVyZSAq
Lw0KPiArCV9fdTMyCWluZGV4OwkJLyogSVJRIG51bWJlciAqLw0KPiArCV9fdTMyCWNvdW50OwkJ
LyogbnVtYmVyIG9mIGluZGl2aWR1YWwgSVJRcyAqLw0KPiArCV9fdTMyCWZsYWdzOw0KPiArI2Rl
ZmluZSBWRklPX0lSUV9JTkZPX0ZMQUdfTEVWRUwJCSgxIDw8IDApDQo+ICt9Ow0KPiArDQo+ICsj
ZGVmaW5lIFZGSU9fREVWSUNFX0dFVF9JUlFfSU5GTwlfSU9XUignOycsIDExMiwgc3RydWN0DQo+
IHZmaW9faXJxX2luZm8pDQo+ICsNCj4gKy8qIFNldCBJUlEgZXZlbnRmZHMsIGFyZ1swXSA9IGlu
ZGV4LCBhcmdbMV0gPSBjb3VudCwgYXJnWzItbl0gPQ0KPiBldmVudGZkcyAqLw0KPiArI2RlZmlu
ZSBWRklPX0RFVklDRV9TRVRfSVJRX0VWRU5URkRTCV9JT1coJzsnLCAxMTMsIGludCkNCj4gKw0K
PiArLyogVW5tYXNrIElSUSBpbmRleCwgYXJnWzBdID0gaW5kZXggKi8NCj4gKyNkZWZpbmUgVkZJ
T19ERVZJQ0VfVU5NQVNLX0lSUQkJX0lPVygnOycsIDExNCwgaW50KQ0KPiArDQo+ICsvKiBTZXQg
dW5tYXNrIGV2ZW50ZmQsIGFyZ1swXSA9IGluZGV4LCBhcmdbMV0gPSBldmVudGZkICovDQo+ICsj
ZGVmaW5lIFZGSU9fREVWSUNFX1NFVF9VTk1BU0tfSVJRX0VWRU5URkQJX0lPVygnOycsIDExNSwg
aW50KQ0KPiArDQo+ICsjZGVmaW5lIFZGSU9fREVWSUNFX1JFU0VUCQlfSU8oJzsnLCAxMTYpDQo+
ICsNCj4gK3N0cnVjdCB2ZmlvX2R0cGF0aCB7DQo+ICsJX191MzIJbGVuOwkJLyogbGVuZ3RoIG9m
IHN0cnVjdHVyZSAqLw0KPiArCV9fdTMyCWluZGV4Ow0KPiArCV9fdTY0CWZsYWdzOw0KPiArI2Rl
ZmluZSBWRklPX0RUUEFUSF9GTEFHU19SRUdJT04JKDEgPDwgMCkNCj4gKyNkZWZpbmUgVkZJT19E
VFBBVEhfRkxBR1NfSVJRCQkoMSA8PCAxKQ0KPiArCWNoYXIJKnBhdGg7DQo+ICt9Ow0KPiArI2Rl
ZmluZSBWRklPX0RFVklDRV9HRVRfRFRQQVRICQlfSU9XUignOycsIDExNywgc3RydWN0DQo+IHZm
aW9fZHRwYXRoKQ0KPiArDQo+ICtzdHJ1Y3QgdmZpb19kdGluZGV4IHsNCj4gKwlfX3UzMglsZW47
CQkvKiBsZW5ndGggb2Ygc3RydWN0dXJlICovDQo+ICsJX191MzIJaW5kZXg7DQo+ICsJX191MzIJ
cHJvcF90eXBlOw0KPiArCV9fdTMyCXByb3BfaW5kZXg7DQo+ICsJX191NjQJZmxhZ3M7DQo+ICsj
ZGVmaW5lIFZGSU9fRFRJTkRFWF9GTEFHU19SRUdJT04JKDEgPDwgMCkNCj4gKyNkZWZpbmUgVkZJ
T19EVElOREVYX0ZMQUdTX0lSUQkJKDEgPDwgMSkNCj4gK307DQo+ICsjZGVmaW5lIFZGSU9fREVW
SUNFX0dFVF9EVElOREVYCQlfSU9XUignOycsIDExOCwgc3RydWN0DQo+IHZmaW9fZHRpbmRleCkN
Cj4gKw0KPiArI2VuZGlmIC8qIFZGSU9fSCAqLw0KDQovQ2hyaXMNCg0K

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-09  8:11 ` Christian Benvenuti (benve)
@ 2011-11-09 18:02   ` Alex Williamson
  2011-11-09 21:08     ` Christian Benvenuti (benve)
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-09 18:02 UTC (permalink / raw)
  To: Christian Benvenuti (benve)
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf,
	Aaron Fabbri (aafabbri), B08248, B07421, avi, konrad.wilk, kvm,
	qemu-devel, iommu, linux-pci

On Wed, 2011-11-09 at 02:11 -0600, Christian Benvenuti (benve) wrote:
> I have not gone through the all patch yet, but here are
> my first comments/questions about the code in vfio_main.c
> (and pci/vfio_pci.c).

Thanks!  Comments inline...

> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, November 03, 2011 1:12 PM
> > To: chrisw@sous-sol.org; aik@au1.ibm.com; pmac@au1.ibm.com;
> > dwg@au1.ibm.com; joerg.roedel@amd.com; agraf@suse.de; Christian
> > Benvenuti (benve); Aaron Fabbri (aafabbri); B08248@freescale.com;
> > B07421@freescale.com; avi@redhat.com; konrad.wilk@oracle.com;
> > kvm@vger.kernel.org; qemu-devel@nongnu.org; iommu@lists.linux-
> > foundation.org; linux-pci@vger.kernel.org
> > Subject: [RFC PATCH] vfio: VFIO Driver core framework
> 
> <snip>
> 
> > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> > new file mode 100644
> > index 0000000..6169356
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_main.c
> > @@ -0,0 +1,1151 @@
> > +/*
> > + * VFIO framework
> > + *
> > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + */
> > +
> > +#include <linux/cdev.h>
> > +#include <linux/compat.h>
> > +#include <linux/device.h>
> > +#include <linux/file.h>
> > +#include <linux/anon_inodes.h>
> > +#include <linux/fs.h>
> > +#include <linux/idr.h>
> > +#include <linux/iommu.h>
> > +#include <linux/mm.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/string.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +#include <linux/wait.h>
> > +
> > +#include "vfio_private.h"
> > +
> > +#define DRIVER_VERSION	"0.2"
> > +#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> > +#define DRIVER_DESC	"VFIO - User Level meta-driver"
> > +
> > +static int allow_unsafe_intrs;
> > +module_param(allow_unsafe_intrs, int, 0);
> > +MODULE_PARM_DESC(allow_unsafe_intrs,
> > +        "Allow use of IOMMUs which do not support interrupt
> > remapping");
> > +
> > +static struct vfio {
> > +	dev_t			devt;
> > +	struct cdev		cdev;
> > +	struct list_head	group_list;
> > +	struct mutex		lock;
> > +	struct kref		kref;
> > +	struct class		*class;
> > +	struct idr		idr;
> > +	wait_queue_head_t	release_q;
> > +} vfio;
> > +
> > +static const struct file_operations vfio_group_fops;
> > +extern const struct file_operations vfio_iommu_fops;
> > +
> > +struct vfio_group {
> > +	dev_t			devt;
> > +	unsigned int		groupid;
> 
> This groupid is returned by the device_group callback you recently added
> with a separate (not yet in tree) IOMMU patch.
> Is it correct to say that the scope of this ID is the bus the iommu
> belongs too (but you use it as if it was global)?
> I believe there is nothing right now to ensure the uniqueness of such
> ID across bus types (assuming there will be other bus drivers in the
> future besides vfio-pci).
> If that's the case, the vfio.group_list global list and the __vfio_lookup_dev
> routine should be changed to account for the bus too?
> Ops, I just saw the error msg in vfio_group_add_dev about the group id conflict.
> Is that warning related to what I mentioned above?

Yeah, this is a concern, but I can't think of a system where we would
manifest a collision.  The IOMMU driver is expected to provide unique
groupids for all devices below them, but we could imagine a system that
implements two different bus_types, each with a different IOMMU driver
and we have no coordination between them.  Perhaps since we have
iommu_ops per bus, we should also expose the bus in the vfio group path,
ie. /dev/vfio/%s/%u, dev->bus->name, iommu_device_group(dev,..).  This
means userspace would need to do a readlink of the subsystem entry where
it finds the iommu_group to find the vfio group.  Reasonable?

> > +	struct bus_type		*bus;
> > +	struct vfio_iommu	*iommu;
> > +	struct list_head	device_list;
> > +	struct list_head	iommu_next;
> > +	struct list_head	group_next;
> > +	int			refcnt;
> > +};
> > +
> > +struct vfio_device {
> > +	struct device			*dev;
> > +	const struct vfio_device_ops	*ops;
> > +	struct vfio_iommu		*iommu;
> 
> I wonder if you need to have the 'iommu' field here.
> vfio_device.iommu is always set and reset together with
> vfio_group.iommu.
> Given that a vfio_device instance is always linked to a vfio_group
> instance, do we need this duplication? Is this duplication there
> because you do not want the double dereference device->group->iommu?

I think that was my initial goal in duplicating the pointer on the
device.  I believe I was also at one point passing a vfio_device around
and needed the pointer.  We seem to be getting along fine w/o that and I
don't see any performance sensitive paths from getting from the device
to iommu, so I'll see about removing it.

> > +	struct vfio_group		*group;
> > +	struct list_head		device_next;
> > +	bool				attached;
> > +	int				refcnt;
> > +	void				*device_data;
> > +};
> > +
> > +/*
> > + * Helper functions called under vfio.lock
> > + */
> > +
> > +/* Return true if any devices within a group are opened */
> > +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		if (device->refcnt)
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +/* Return true if any of the groups attached to an iommu are opened.
> > + * We can only tear apart merged groups when nothing is left open. */
> > +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +		if (group->refcnt)
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +/* An iommu is "in use" if it has a file descriptor open or if any of
> > + * the groups assigned to the iommu have devices open. */
> > +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (iommu->refcnt)
> > +		return true;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		if (__vfio_group_devs_inuse(group))
> > +			return true;
> > +	}
> > +	return false;
> > +}
> 
> I looked at how you take care of ref counts ...
> 
> This is how the tree of vfio_iommu/vfio_group/vfio_device data
> Structures is organized (I'll use just iommu/group/dev to make
> the graph smaller):
> 
>             iommu
>            /     \
>           /       \ 
>     group   ...     group
>     /  \           /  \   
>    /    \         /    \
> dev  ..  dev   dev  ..  dev
> 
> This is how you get a file descriptor for the three kind of objects:
> 
> - group : open /dev/vfio/xxx for group xxx
> - iommu : group ioctl VFIO_GROUP_GET_IOMMU_FD
> - device: group ioctl VFIO_GROUP_GET_DEVICE_FD
> 
> Given the above topology, I would assume that:
> 
> (1) an iommu is 'inuse' if : a) iommu refcnt > 0, or
>                              b) any of its groups is 'inuse'
> 
> (2) a  group is 'inuse' if : a) group refcnt > 0, or
>                              b) any of its devices is 'inuse'
> 
> (3) a device is 'inuse' if : a) device refcnt > 0

(2) is a bit debatable.  I've wrestled with this one for a while.  The
vfio_iommu serves two purposes.  First, it is the object we use for
managing iommu domains, which includes allocating domains and attaching
devices to domains.  Groups objects aren't involved here, they just
manage the set of devices.  The second role is to manage merged groups,
because whether or not groups can be merged is a function of iommu
domain compatibility.

So if we look at "is the iommu in use?" ie. can I destroy the mapping
context, detach devices and free the domain, the reference count on the
group is irrelevant.  The user has to have a device or iommu file
descriptor opened somewhere, across the group or merged group, for that
context to be maintained.  A reasonable requirement, I think.

However, if we ask "is the group in use?" ie. can I not only destroy the
mappings above, but also automatically tear apart merged groups, then I
think we need to look at the group refcnt.

There's also a symmetry factor, the group is a benign entry point to
device access.  It's only when device or iommu access is granted that
the group gains any real power.  Therefore, shouldn't that power also be
removed when those access points are closed?

> You have coded the 'inuse' logic with these three routines:
> 
>     __vfio_iommu_inuse, which implements (1) above
> 
> and
>     __vfio_iommu_groups_inuse

Implements (2.a)

>     __vfio_group_devs_inuse

Implements (2.b)

> which are used by __vfio_iommu_inuse.
> Why don't you check the group refcnt in __vfio_iommu_groups_inuse?

Hopefully explained above, but open for discussion.

> Would it make sense (and the code more readable) to structure the
> nested refcnt/inuse check like this?
> (The numbers (1)(2)(3) refer to the three 'inuse' conditions above)
> 
>    (1)__vfio_iommu_inuse
>    |
>    +-> check iommu refcnt
>    +-> __vfio_iommu_groups_inuse
>        |
>        +->LOOP: (2)__vfio_iommu_group_inuse<--MISSING
>                 |
>                 +-> check group refcnt<--MISSING
>                 +-> __vfio_group_devs_inuse()
>                     |
>                     +-> LOOP: (3)__vfio_group_dev_inuse<--MISSING
>                               |
>                               +-> check device refcnt

We currently do:

   (1)__vfio_iommu_inuse
    |
    +-> check iommu refcnt
    +-> __vfio_group_devs_inuse
        |
        +->LOOP: (2.b)__vfio_group_devs_inuse
                  |
                  +-> LOOP: (3) check device refcnt

If that passes, the iommu context can be dissolved and we follow up
with:

    __vfio_iommu_groups_inuse
    |
    +-> LOOP: (2.a)__vfio_iommu_groups_inuse
               |
               +-> check group refcnt

If that passes, groups can also be umerged.

Is this right?

> > +static void __vfio_group_set_iommu(struct vfio_group *group,
> > +				   struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (group->iommu)
> > +		list_del(&group->iommu_next);
> > +	if (iommu)
> > +		list_add(&group->iommu_next, &iommu->group_list);
> > +
> > +	group->iommu = iommu;
> 
> If you remove the vfio_device.iommu field (as suggested above in a previous
> Comment), the block below would not be needed anymore.

Yep, I'll try removing that and see how it plays out.

> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		device->iommu = iommu;
> > +	}
> > +}
> > +
> > +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> > +				    struct vfio_device *device)
> > +{
> > +	BUG_ON(!iommu->domain && device->attached);
> > +
> > +	if (!iommu->domain || !device->attached)
> > +		return;
> > +
> > +	iommu_detach_device(iommu->domain, device->dev);
> > +	device->attached = false;
> > +}
> > +
> > +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> > +				      struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		__vfio_iommu_detach_dev(iommu, device);
> > +	}
> > +}
> > +
> > +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> > +				   struct vfio_device *device)
> > +{
> > +	int ret;
> > +
> > +	BUG_ON(device->attached);
> > +
> > +	if (!iommu || !iommu->domain)
> > +		return -EINVAL;
> > +
> > +	ret = iommu_attach_device(iommu->domain, device->dev);
> > +	if (!ret)
> > +		device->attached = true;
> > +
> > +	return ret;
> > +}
> > +
> > +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,
> > +				     struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +		int ret;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		ret = __vfio_iommu_attach_dev(iommu, device);
> > +		if (ret) {
> > +			__vfio_iommu_detach_group(iommu, group);
> > +			return ret;
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> > +/* The iommu is viable, ie. ready to be configured, when all the
> > devices
> > + * for all the groups attached to the iommu are bound to their vfio
> > device
> > + * drivers (ex. vfio-pci).  This sets the device_data private data
> > pointer. */
> > +static bool __vfio_iommu_viable(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *gpos, *dpos;
> > +
> > +	list_for_each(gpos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(gpos, struct vfio_group, iommu_next);
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (!device->device_data)
> > +				return false;
> > +		}
> > +	}
> > +	return true;
> > +}
> > +
> > +static void __vfio_close_iommu(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (!iommu->domain)
> > +		return;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		__vfio_iommu_detach_group(iommu, group);
> > +	}
> > +
> > +	vfio_iommu_unmapall(iommu);
> > +
> > +	iommu_domain_free(iommu->domain);
> > +	iommu->domain = NULL;
> > +	iommu->mm = NULL;
> > +}
> > +
> > +/* Open the IOMMU.  This gates all access to the iommu or device file
> > + * descriptors and sets current->mm as the exclusive user. */
> 
> Given the fn  vfio_group_open (ie, 1st object, 2nd operation), I would have
> called this one __vfio_iommu_open (instead of __vfio_open_iommu).
> Is it named __vfio_open_iommu to avoid a conflict with the namespace in vfio_iommu.c?      

I would have expected that too, I'll look at renaming these.

> > +static int __vfio_open_iommu(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +	int ret;
> > +
> > +	if (!__vfio_iommu_viable(iommu))
> > +		return -EBUSY;
> > +
> > +	if (iommu->domain)
> > +		return -EINVAL;
> > +
> > +	iommu->domain = iommu_domain_alloc(iommu->bus);
> > +	if (!iommu->domain)
> > +		return -EFAULT;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		ret = __vfio_iommu_attach_group(iommu, group);
> > +		if (ret) {
> > +			__vfio_close_iommu(iommu);
> > +			return ret;
> > +		}
> > +	}
> > +
> > +	if (!allow_unsafe_intrs &&
> > +	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> > +		__vfio_close_iommu(iommu);
> > +		return -EFAULT;
> > +	}
> > +
> > +	iommu->cache = (iommu_domain_has_cap(iommu->domain,
> > +					     IOMMU_CAP_CACHE_COHERENCY) != 0);
> > +	iommu->mm = current->mm;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Actively try to tear down the iommu and merged groups.  If there
> > are no
> > + * open iommu or device fds, we close the iommu.  If we close the
> > iommu and
> > + * there are also no open group fds, we can futher dissolve the group
> > to
> > + * iommu association and free the iommu data structure. */
> > +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)
> > +{
> > +
> > +	if (__vfio_iommu_inuse(iommu))
> > +		return -EBUSY;
> > +
> > +	__vfio_close_iommu(iommu);
> > +
> > +	if (!__vfio_iommu_groups_inuse(iommu)) {
> > +		struct list_head *pos, *ppos;
> > +
> > +		list_for_each_safe(pos, ppos, &iommu->group_list) {
> > +			struct vfio_group *group;
> > +
> > +			group = list_entry(pos, struct vfio_group,
> > iommu_next);
> > +			__vfio_group_set_iommu(group, NULL);
> > +		}
> > +
> > +
> > +		kfree(iommu);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static struct vfio_device *__vfio_lookup_dev(struct device *dev)
> > +{
> > +	struct list_head *gpos;
> > +	unsigned int groupid;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return NULL;
> > +
> > +	list_for_each(gpos, &vfio.group_list) {
> > +		struct vfio_group *group;
> > +		struct list_head *dpos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, group_next);
> > +
> > +		if (group->groupid != groupid)
> > +			continue;
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (device->dev == dev)
> > +				return device;
> > +		}
> > +	}
> > +	return NULL;
> > +}
> > +
> > +/* All release paths simply decrement the refcnt, attempt to teardown
> > + * the iommu and merged groups, and wakeup anything that might be
> > + * waiting if we successfully dissolve anything. */
> > +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)
> > +{
> > +	bool wake;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	(*refcnt)--;
> > +	wake = (__vfio_try_dissolve_iommu(iommu) == 0);
> > +
> > +	mutex_unlock(&vfio.lock);
> > +
> > +	if (wake)
> > +		wake_up(&vfio.release_q);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Device fops - passthrough to vfio device driver w/ device_data
> > + */
> > +static int vfio_device_release(struct inode *inode, struct file
> > *filep)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	vfio_do_release(&device->refcnt, device->iommu);
> > +
> > +	device->ops->put(device->device_data);
> > +
> > +	return 0;
> > +}
> > +
> > +static long vfio_device_unl_ioctl(struct file *filep,
> > +				  unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->ioctl(device->device_data, cmd, arg);
> > +}
> > +
> > +static ssize_t vfio_device_read(struct file *filep, char __user *buf,
> > +				size_t count, loff_t *ppos)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->read(device->device_data, buf, count, ppos);
> > +}
> > +
> > +static ssize_t vfio_device_write(struct file *filep, const char __user
> > *buf,
> > +				 size_t count, loff_t *ppos)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->write(device->device_data, buf, count, ppos);
> > +}
> > +
> > +static int vfio_device_mmap(struct file *filep, struct vm_area_struct
> > *vma)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->mmap(device->device_data, vma);
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +static long vfio_device_compat_ioctl(struct file *filep,
> > +				     unsigned int cmd, unsigned long arg)
> > +{
> > +	arg = (unsigned long)compat_ptr(arg);
> > +	return vfio_device_unl_ioctl(filep, cmd, arg);
> > +}
> > +#endif	/* CONFIG_COMPAT */
> > +
> > +const struct file_operations vfio_device_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.release	= vfio_device_release,
> > +	.read		= vfio_device_read,
> > +	.write		= vfio_device_write,
> > +	.unlocked_ioctl	= vfio_device_unl_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= vfio_device_compat_ioctl,
> > +#endif
> > +	.mmap		= vfio_device_mmap,
> > +};
> > +
> > +/*
> > + * Group fops
> > + */
> > +static int vfio_group_open(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_group *group;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	group = idr_find(&vfio.idr, iminor(inode));
> > +
> > +	if (!group) {
> > +		ret = -ENODEV;
> > +		goto out;
> > +	}
> > +
> > +	filep->private_data = group;
> > +
> > +	if (!group->iommu) {
> > +		struct vfio_iommu *iommu;
> > +
> > +		iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > +		if (!iommu) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +		INIT_LIST_HEAD(&iommu->group_list);
> > +		INIT_LIST_HEAD(&iommu->dm_list);
> > +		mutex_init(&iommu->dgate);
> > +		iommu->bus = group->bus;
> > +		__vfio_group_set_iommu(group, iommu);
> > +	}
> > +	group->refcnt++;
> > +
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +
> > +	return ret;
> > +}
> > +
> > +static int vfio_group_release(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_group *group = filep->private_data;
> > +
> > +	return vfio_do_release(&group->refcnt, group->iommu);
> > +}
> > +
> > +/* Attempt to merge the group pointed to by fd into group.  The merge-
> > ee
> > + * group must not have an iommu or any devices open because we cannot
> > + * maintain that context across the merge.  The merge-er group can be
> > + * in use. */
> > +static int vfio_group_merge(struct vfio_group *group, int fd)
> 
> The documentation in vfio.txt explains clearly the logic implemented by
> the merge/unmerge group ioctls.
> However, what you are doing is not merging groups, but rather adding/removing
> groups to/from iommus (and creating flat lists of groups).
> For example, when you do
> 
>   merge(A,B)
> 
> you actually mean to say "merge B to the list of groups assigned to the
> same iommu as group A".

It's actually a little more than that.  After you've merged B into A,
you can close the file descriptor for B and access all of the devices
for the merged group from A.

> For the same reason, you do not really need to provide the group you want
> to unmerge from, which means that instead of
> 
>   unmerge(A,B) 
> 
> you would just need
> 
>   unmerge(B)

Good point, we can avoid the awkward reference via file descriptor for
the unmerge.

> I understand the reason why it is not a real merge/unmerge (ie, to keep the
> original groups so that you can unmerge later)

Right, we still need to have visibility of the groups comprising the
merged group, but the abstraction provided to the user seems to be
deeper than you're thinking.

>  ... however I just wonder if
> it wouldn't be more natural to implement the VFIO_IOMMU_ADD_GROUP/DEL_GROUP
> iommu ioctls instead? (the relationships between the data structure would
> remain the same)
> I guess you already discarded this option for some reasons, right? What was
> the reason?

It's a possibility, I'm not sure it was discussed or really what
advantage it provides.  It seems like we'd logically lose the ability to
access devices from other groups, whether that's good or bad, I don't
know.  I think the notion of "merge" promotes the idea that the groups
are peers and an iommu_add/del feels a bit more hierarchical.

> > +{
> > +	struct vfio_group *new;
> > +	struct vfio_iommu *old_iommu;
> > +	struct file *file;
> > +	int ret = 0;
> > +	bool opened = false;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	file = fget(fd);
> > +	if (!file) {
> > +		ret = -EBADF;
> > +		goto out_noput;
> > +	}
> > +
> > +	/* Sanity check, is this really our fd? */
> > +	if (file->f_op != &vfio_group_fops) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	new = file->private_data;
> > +
> > +	if (!new || new == group || !new->iommu ||
> > +	    new->iommu->domain || new->bus != group->bus) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* We need to attach all the devices to each domain separately
> > +	 * in order to validate that the capabilities match for both.  */
> > +	ret = __vfio_open_iommu(new->iommu);
> > +	if (ret)
> > +		goto out;
> > +
> > +	if (!group->iommu->domain) {
> > +		ret = __vfio_open_iommu(group->iommu);
> > +		if (ret)
> > +			goto out;
> > +		opened = true;
> > +	}
> > +
> > +	/* If cache coherency doesn't match we'd potentialy need to
> > +	 * remap existing iommu mappings in the merge-er domain.
> > +	 * Poor return to bother trying to allow this currently. */
> > +	if (iommu_domain_has_cap(group->iommu->domain,
> > +				 IOMMU_CAP_CACHE_COHERENCY) !=
> > +	    iommu_domain_has_cap(new->iommu->domain,
> > +				 IOMMU_CAP_CACHE_COHERENCY)) {
> > +		__vfio_close_iommu(new->iommu);
> > +		if (opened)
> > +			__vfio_close_iommu(group->iommu);
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* Close the iommu for the merge-ee and attach all its devices
> > +	 * to the merge-er iommu. */
> > +	__vfio_close_iommu(new->iommu);
> > +
> > +	ret = __vfio_iommu_attach_group(group->iommu, new);
> > +	if (ret)
> > +		goto out;
> > +
> > +	/* set_iommu unlinks new from the iommu, so save a pointer to it
> > */
> > +	old_iommu = new->iommu;
> > +	__vfio_group_set_iommu(new, group->iommu);
> > +	kfree(old_iommu);
> > +
> > +out:
> > +	fput(file);
> > +out_noput:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Unmerge the group pointed to by fd from group. */
> > +static int vfio_group_unmerge(struct vfio_group *group, int fd)
> > +{
> > +	struct vfio_group *new;
> > +	struct vfio_iommu *new_iommu;
> > +	struct file *file;
> > +	int ret = 0;
> > +
> > +	/* Since the merge-out group is already opened, it needs to
> > +	 * have an iommu struct associated with it. */
> > +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);
> > +	if (!new_iommu)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&new_iommu->group_list);
> > +	INIT_LIST_HEAD(&new_iommu->dm_list);
> > +	mutex_init(&new_iommu->dgate);
> > +	new_iommu->bus = group->bus;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	file = fget(fd);
> > +	if (!file) {
> > +		ret = -EBADF;
> > +		goto out_noput;
> > +	}
> > +
> > +	/* Sanity check, is this really our fd? */
> > +	if (file->f_op != &vfio_group_fops) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	new = file->private_data;
> > +	if (!new || new == group || new->iommu != group->iommu) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* We can't merge-out a group with devices still in use. */
> > +	if (__vfio_group_devs_inuse(new)) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	__vfio_iommu_detach_group(group->iommu, new);
> > +	__vfio_group_set_iommu(new, new_iommu);
> > +
> > +out:
> > +	fput(file);
> > +out_noput:
> > +	if (ret)
> > +		kfree(new_iommu);
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Get a new iommu file descriptor.  This will open the iommu, setting
> > + * the current->mm ownership if it's not already set. */
> > +static int vfio_group_get_iommu_fd(struct vfio_group *group)
> > +{
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (!group->iommu->domain) {
> > +		ret = __vfio_open_iommu(group->iommu);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
> > +			       group->iommu, O_RDWR);
> > +	if (ret < 0)
> > +		goto out;
> > +
> > +	group->iommu->refcnt++;
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Get a new device file descriptor.  This will open the iommu,
> > setting
> > + * the current->mm ownership if it's not already set.  It's difficult
> > to
> > + * specify the requirements for matching a user supplied buffer to a
> > + * device, so we use a vfio driver callback to test for a match.  For
> > + * PCI, dev_name(dev) is unique, but other drivers may require
> > including
> > + * a parent device string. */
> > +static int vfio_group_get_device_fd(struct vfio_group *group, char
> > *buf)
> > +{
> > +	struct vfio_iommu *iommu = group->iommu;
> > +	struct list_head *gpos;
> > +	int ret = -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (!iommu->domain) {
> > +		ret = __vfio_open_iommu(iommu);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	list_for_each(gpos, &iommu->group_list) {
> > +		struct list_head *dpos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, iommu_next);
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (device->ops->match(device->dev, buf)) {
> > +				struct file *file;
> > +
> > +				if (device->ops->get(device->device_data)) {
> > +					ret = -EFAULT;
> > +					goto out;
> > +				}
> > +
> > +				/* We can't use anon_inode_getfd(), like above
> > +				 * because we need to modify the f_mode flags
> > +				 * directly to allow more than just ioctls */
> > +				ret = get_unused_fd();
> > +				if (ret < 0) {
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> > +
> > +				file = anon_inode_getfile("[vfio-device]",
> > +							  &vfio_device_fops,
> > +							  device, O_RDWR);
> > +				if (IS_ERR(file)) {
> > +					put_unused_fd(ret);
> > +					ret = PTR_ERR(file);
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> > +
> > +				/* Todo: add an anon_inode interface to do
> > +				 * this.  Appears to be missing by lack of
> > +				 * need rather than explicitly prevented.
> > +				 * Now there's need. */
> > +				file->f_mode |= (FMODE_LSEEK |
> > +						 FMODE_PREAD |
> > +						 FMODE_PWRITE);
> > +
> > +				fd_install(ret, file);
> > +
> > +				device->refcnt++;
> > +				goto out;
> > +			}
> > +		}
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +static long vfio_group_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_group *group = filep->private_data;
> > +
> > +	if (cmd == VFIO_GROUP_GET_FLAGS) {
> > +		u64 flags = 0;
> > +
> > +		mutex_lock(&vfio.lock);
> > +		if (__vfio_iommu_viable(group->iommu))
> > +			flags |= VFIO_GROUP_FLAGS_VIABLE;
> > +		mutex_unlock(&vfio.lock);
> > +
> > +		if (group->iommu->mm)
> > +			flags |= VFIO_GROUP_FLAGS_MM_LOCKED;
> > +
> > +		return put_user(flags, (u64 __user *)arg);
> > +	}
> > +
> > +	/* Below commands are restricted once the mm is set */
> > +	if (group->iommu->mm && group->iommu->mm != current->mm)
> > +		return -EPERM;
> > +	if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) {
> > +		int fd;
> > +
> > +		if (get_user(fd, (int __user *)arg))
> > +			return -EFAULT;
> > +		if (fd < 0)
> > +			return -EINVAL;
> > +
> > +		if (cmd == VFIO_GROUP_MERGE)
> > +			return vfio_group_merge(group, fd);
> > +		else
> > +			return vfio_group_unmerge(group, fd);
> > +	} else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {
> > +		return vfio_group_get_iommu_fd(group);
> > +	} else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {
> > +		char *buf;
> > +		int ret;
> > +
> > +		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
> > +		if (IS_ERR(buf))
> > +			return PTR_ERR(buf);
> > +
> > +		ret = vfio_group_get_device_fd(group, buf);
> > +		kfree(buf);
> > +		return ret;
> > +	}
> > +
> > +	return -ENOSYS;
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +static long vfio_group_compat_ioctl(struct file *filep,
> > +				    unsigned int cmd, unsigned long arg)
> > +{
> > +	arg = (unsigned long)compat_ptr(arg);
> > +	return vfio_group_unl_ioctl(filep, cmd, arg);
> > +}
> > +#endif	/* CONFIG_COMPAT */
> > +
> > +static const struct file_operations vfio_group_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.open		= vfio_group_open,
> > +	.release	= vfio_group_release,
> > +	.unlocked_ioctl	= vfio_group_unl_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= vfio_group_compat_ioctl,
> > +#endif
> > +};
> > +
> > +/* iommu fd release hook */
> 
> Given vfio_device_release and
>       vfio_group_release (ie, 1st object, 2nd operation), I was
> going to suggest renaming the fn below to vfio_iommu_release, but
> then I saw the latter name being already used in vfio_iommu.c ...
> a bit confusing but I guess it's ok then.

Right, this one was definitely because of naming collision.

> > +int vfio_release_iommu(struct vfio_iommu *iommu)
> > +{
> > +	return vfio_do_release(&iommu->refcnt, iommu);
> > +}
> > +
> > +/*
> > + * VFIO driver API
> > + */
> > +
> > +/* Add a new device to the vfio framework with associated vfio driver
> > + * callbacks.  This is the entry point for vfio drivers to register
> > devices. */
> > +int vfio_group_add_dev(struct device *dev, const struct
> > vfio_device_ops *ops)
> > +{
> > +	struct list_head *pos;
> > +	struct vfio_group *group = NULL;
> > +	struct vfio_device *device = NULL;
> > +	unsigned int groupid;
> > +	int ret = 0;
> > +	bool new_group = false;
> > +
> > +	if (!ops)
> > +		return -EINVAL;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	list_for_each(pos, &vfio.group_list) {
> > +		group = list_entry(pos, struct vfio_group, group_next);
> > +		if (group->groupid == groupid)
> > +			break;
> > +		group = NULL;
> > +	}
> > +
> > +	if (!group) {
> > +		int minor;
> > +
> > +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +		if (!group) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group->groupid = groupid;
> > +		INIT_LIST_HEAD(&group->device_list);
> > +
> > +		ret = idr_get_new(&vfio.idr, group, &minor);
> > +		if (ret == 0 && minor > MINORMASK) {
> > +			idr_remove(&vfio.idr, minor);
> > +			kfree(group);
> > +			ret = -ENOSPC;
> > +			goto out;
> > +		}
> > +
> > +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> > +		device_create(vfio.class, NULL, group->devt,
> > +			      group, "%u", groupid);
> > +
> > +		group->bus = dev->bus;
> > +		list_add(&group->group_next, &vfio.group_list);
> > +		new_group = true;
> > +	} else {
> > +		if (group->bus != dev->bus) {
> > +			printk(KERN_WARNING
> > +			       "Error: IOMMU group ID conflict.  Group ID %u
> > "
> > +				"on both bus %s and %s\n", groupid,
> > +				group->bus->name, dev->bus->name);
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +
> > +		list_for_each(pos, &group->device_list) {
> > +			device = list_entry(pos,
> > +					    struct vfio_device, device_next);
> > +			if (device->dev == dev)
> > +				break;
> > +			device = NULL;
> > +		}
> > +	}
> > +
> > +	if (!device) {
> > +		if (__vfio_group_devs_inuse(group) ||
> > +		    (group->iommu && group->iommu->refcnt)) {
> > +			printk(KERN_WARNING
> > +			       "Adding device %s to group %u while group is
> > already in use!!\n",
> > +			       dev_name(dev), group->groupid);
> > +			/* XXX How to prevent other drivers from claiming? */
> 
> Here we are adding a device (not yet assigned to a vfio bus) to a group
> that is already in use.
> Given that it would not be acceptable for this device to get assigned
> to a non vfio driver, why not forcing such assignment here then?

Exactly, I just don't know the mechanics of how to make that happen and
was hoping for suggestions...

> I am not sure though what the best way to do it would be.
> What about something like this:
> 
> - when the bus vfio-pci processes the BUS_NOTIFY_ADD_DEVICE
>   notification it assigns to the device a PCI ID that will make sure
>   the vfio-pci's probe routine will be invoked (and no other driver can
>   therefore claim the device). That PCI ID would have to be added
>   to the vfio_pci_driver's id_table (it would be the exception to the
>   "only dynamic IDs" rule). Too hackish?

Presumably some other driver also has the ID in it's id_table, how do we
make sure we win?

> > +		}
> > +
> > +		device = kzalloc(sizeof(*device), GFP_KERNEL);
> > +		if (!device) {
> > +			/* If we just created this group, tear it down */
> > +			if (new_group) {
> > +				list_del(&group->group_next);
> > +				device_destroy(vfio.class, group->devt);
> > +				idr_remove(&vfio.idr, MINOR(group->devt));
> > +				kfree(group);
> > +			}
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		list_add(&device->device_next, &group->device_list);
> > +		device->dev = dev;
> > +		device->ops = ops;
> > +		device->iommu = group->iommu; /* NULL if new */
> 
> Shouldn't you check the return code of __vfio_iommu_attach_dev?

Yep, looks like I did this because the expected use case has a NULL
iommu here, so I need to distiguish that error from an actual
iommu_attach_device() error.

> > +		__vfio_iommu_attach_dev(group->iommu, device);
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_group_add_dev);
> > +
> > +/* Remove a device from the vfio framework */
> 
> This fn below does not return any error code. Ok ...
> However, there are a number of errors case that you test, for example
> - device that does not belong to any group (according to iommu API)
> - device that belongs to a group but that does not appear in the list
>   of devices of the vfio_group structure.
> Are the above two errors checks just paranoia or are those errors actually possible?
> If they were possible, shouldn't we generate a warning (most probably
> it would be a bug in the code)?

They're all vfio-bus driver bugs of some sort, so it's just a matter of
how much we want to scream about them.  I'll comments on each below.

> > +void vfio_group_del_dev(struct device *dev)
> > +{
> > +	struct list_head *pos;
> > +	struct vfio_group *group = NULL;
> > +	struct vfio_device *device = NULL;
> > +	unsigned int groupid;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return;

Here the bus driver is probably just sitting on a notifier list for
their bus_type and a device is getting removed.  Unless we want to
require the bus driver to track everything it's attempted to add and
whether it worked, we can just ignore this.

> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	list_for_each(pos, &vfio.group_list) {
> > +		group = list_entry(pos, struct vfio_group, group_next);
> > +		if (group->groupid == groupid)
> > +			break;
> > +		group = NULL;
> > +	}
> > +
> > +	if (!group)
> > +		goto out;

We don't even have a group for the device, we could BUG_ON here.  The
bus driver failed to tell us about something that was then removed.

> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		if (device->dev == dev)
> > +			break;
> > +		device = NULL;
> > +	}
> > +
> > +	if (!device)
> > +		goto out;

Same here.

> > +
> > +	BUG_ON(device->refcnt);
> > +
> > +	if (device->attached)
> > +		__vfio_iommu_detach_dev(group->iommu, device);
> > +
> > +	list_del(&device->device_next);
> > +	kfree(device);
> > +
> > +	/* If this was the only device in the group, remove the group.
> > +	 * Note that we intentionally unmerge empty groups here if the
> > +	 * group fd isn't opened. */
> > +	if (list_empty(&group->device_list) && group->refcnt == 0) {
> > +		struct vfio_iommu *iommu = group->iommu;
> > +
> > +		if (iommu) {
> > +			__vfio_group_set_iommu(group, NULL);
> > +			__vfio_try_dissolve_iommu(iommu);
> > +		}
> > +
> > +		device_destroy(vfio.class, group->devt);
> > +		idr_remove(&vfio.idr, MINOR(group->devt));
> > +		list_del(&group->group_next);
> > +		kfree(group);
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_group_del_dev);
> > +
> > +/* When a device is bound to a vfio device driver (ex. vfio-pci), this
> > + * entry point is used to mark the device usable (viable).  The vfio
> > + * device driver associates a private device_data struct with the
> > device
> > + * here, which will later be return for vfio_device_fops callbacks. */
> > +int vfio_bind_dev(struct device *dev, void *device_data)
> > +{
> > +	struct vfio_device *device;
> > +	int ret = -EINVAL;
> > +
> > +	BUG_ON(!device_data);
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	device = __vfio_lookup_dev(dev);
> > +
> > +	BUG_ON(!device);
> > +
> > +	ret = dev_set_drvdata(dev, device);
> > +	if (!ret)
> > +		device->device_data = device_data;
> > +
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_bind_dev);
> > +
> > +/* A device is only removeable if the iommu for the group is not in
> > use. */
> > +static bool vfio_device_removeable(struct vfio_device *device)
> > +{
> > +	bool ret = true;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (device->iommu && __vfio_iommu_inuse(device->iommu))
> > +		ret = false;
> > +
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Notify vfio that a device is being unbound from the vfio device
> > driver
> > + * and return the device private device_data pointer.  If the group is
> > + * in use, we need to block or take other measures to make it safe for
> > + * the device to be removed from the iommu. */
> > +void *vfio_unbind_dev(struct device *dev)
> > +{
> > +	struct vfio_device *device = dev_get_drvdata(dev);
> > +	void *device_data;
> > +
> > +	BUG_ON(!device);
> > +
> > +again:
> > +	if (!vfio_device_removeable(device)) {
> > +		/* XXX signal for all devices in group to be removed or
> > +		 * resort to killing the process holding the device fds.
> > +		 * For now just block waiting for releases to wake us. */
> > +		wait_event(vfio.release_q, vfio_device_removeable(device));
> 
> Any new idea/proposal on how to handle this situation?
> The last one I remember was to leave the soft/hard/etc timeout handling in
> userspace and implement it as a sort of policy. Is that one still the most
> likely candidate solution to handle this situation?

I haven't heard any new proposals.  I think we need the hard timeout
handling in the kernel.  We can't leave it to userspace to decide they
get to keep the device.  We could have this tunable via an ioctl, but I
don't see how we wouldn't require CAP_SYS_ADMIN (or similar) to tweak
it.  I was intending to re-implement the netlink interface to signal the
removal, but expect to get allergic reactions to that.

Thanks for the comments!

Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-09 18:02   ` Alex Williamson
@ 2011-11-09 21:08     ` Christian Benvenuti (benve)
  2011-11-09 23:40       ` Alex Williamson
  0 siblings, 1 reply; 62+ messages in thread
From: Christian Benvenuti (benve) @ 2011-11-09 21:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf,
	Aaron Fabbri (aafabbri), B08248, B07421, avi, konrad.wilk, kvm,
	qemu-devel, iommu, linux-pci

Q29tbWVudHMgaW5saW5lLi4uDQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJv
bTogQWxleCBXaWxsaWFtc29uIFttYWlsdG86YWxleC53aWxsaWFtc29uQHJlZGhhdC5jb21dDQo+
IFNlbnQ6IFdlZG5lc2RheSwgTm92ZW1iZXIgMDksIDIwMTEgMTA6MDMgQU0NCj4gVG86IENocmlz
dGlhbiBCZW52ZW51dGkgKGJlbnZlKQ0KPiBDYzogY2hyaXN3QHNvdXMtc29sLm9yZzsgYWlrQGF1
MS5pYm0uY29tOyBwbWFjQGF1MS5pYm0uY29tOw0KPiBkd2dAYXUxLmlibS5jb207IGpvZXJnLnJv
ZWRlbEBhbWQuY29tOyBhZ3JhZkBzdXNlLmRlOyBBYXJvbiBGYWJicmkNCj4gKGFhZmFiYnJpKTsg
QjA4MjQ4QGZyZWVzY2FsZS5jb207IEIwNzQyMUBmcmVlc2NhbGUuY29tOyBhdmlAcmVkaGF0LmNv
bTsNCj4ga29ucmFkLndpbGtAb3JhY2xlLmNvbTsga3ZtQHZnZXIua2VybmVsLm9yZzsgcWVtdS1k
ZXZlbEBub25nbnUub3JnOw0KPiBpb21tdUBsaXN0cy5saW51eC1mb3VuZGF0aW9uLm9yZzsgbGlu
dXgtcGNpQHZnZXIua2VybmVsLm9yZw0KPiBTdWJqZWN0OiBSRTogW1JGQyBQQVRDSF0gdmZpbzog
VkZJTyBEcml2ZXIgY29yZSBmcmFtZXdvcmsNCj4gDQo+IE9uIFdlZCwgMjAxMS0xMS0wOSBhdCAw
MjoxMSAtMDYwMCwgQ2hyaXN0aWFuIEJlbnZlbnV0aSAoYmVudmUpIHdyb3RlOg0KPiA+IEkgaGF2
ZSBub3QgZ29uZSB0aHJvdWdoIHRoZSBhbGwgcGF0Y2ggeWV0LCBidXQgaGVyZSBhcmUNCj4gPiBt
eSBmaXJzdCBjb21tZW50cy9xdWVzdGlvbnMgYWJvdXQgdGhlIGNvZGUgaW4gdmZpb19tYWluLmMN
Cj4gPiAoYW5kIHBjaS92ZmlvX3BjaS5jKS4NCj4gDQo+IFRoYW5rcyEgIENvbW1lbnRzIGlubGlu
ZS4uLg0KPiANCj4gPiA+IC0tLS0tT3JpZ2luYWwgTWVzc2FnZS0tLS0tDQo+ID4gPiBGcm9tOiBB
bGV4IFdpbGxpYW1zb24gW21haWx0bzphbGV4LndpbGxpYW1zb25AcmVkaGF0LmNvbV0NCj4gPiA+
IFNlbnQ6IFRodXJzZGF5LCBOb3ZlbWJlciAwMywgMjAxMSAxOjEyIFBNDQo+ID4gPiBUbzogY2hy
aXN3QHNvdXMtc29sLm9yZzsgYWlrQGF1MS5pYm0uY29tOyBwbWFjQGF1MS5pYm0uY29tOw0KPiA+
ID4gZHdnQGF1MS5pYm0uY29tOyBqb2VyZy5yb2VkZWxAYW1kLmNvbTsgYWdyYWZAc3VzZS5kZTsg
Q2hyaXN0aWFuDQo+ID4gPiBCZW52ZW51dGkgKGJlbnZlKTsgQWFyb24gRmFiYnJpIChhYWZhYmJy
aSk7IEIwODI0OEBmcmVlc2NhbGUuY29tOw0KPiA+ID4gQjA3NDIxQGZyZWVzY2FsZS5jb207IGF2
aUByZWRoYXQuY29tOyBrb25yYWQud2lsa0BvcmFjbGUuY29tOw0KPiA+ID4ga3ZtQHZnZXIua2Vy
bmVsLm9yZzsgcWVtdS1kZXZlbEBub25nbnUub3JnOyBpb21tdUBsaXN0cy5saW51eC0NCj4gPiA+
IGZvdW5kYXRpb24ub3JnOyBsaW51eC1wY2lAdmdlci5rZXJuZWwub3JnDQo+ID4gPiBTdWJqZWN0
OiBbUkZDIFBBVENIXSB2ZmlvOiBWRklPIERyaXZlciBjb3JlIGZyYW1ld29yaw0KPiA+DQo+ID4g
PHNuaXA+DQo+ID4NCj4gPiA+IGRpZmYgLS1naXQgYS9kcml2ZXJzL3ZmaW8vdmZpb19tYWluLmMg
Yi9kcml2ZXJzL3ZmaW8vdmZpb19tYWluLmMNCj4gPiA+IG5ldyBmaWxlIG1vZGUgMTAwNjQ0DQo+
ID4gPiBpbmRleCAwMDAwMDAwLi42MTY5MzU2DQo+ID4gPiAtLS0gL2Rldi9udWxsDQo+ID4gPiAr
KysgYi9kcml2ZXJzL3ZmaW8vdmZpb19tYWluLmMNCj4gPiA+IEBAIC0wLDAgKzEsMTE1MSBAQA0K
PiA+ID4gKy8qDQo+ID4gPiArICogVkZJTyBmcmFtZXdvcmsNCj4gPiA+ICsgKg0KPiA+ID4gKyAq
IENvcHlyaWdodCAoQykgMjAxMSBSZWQgSGF0LCBJbmMuICBBbGwgcmlnaHRzIHJlc2VydmVkLg0K
PiA+ID4gKyAqICAgICBBdXRob3I6IEFsZXggV2lsbGlhbXNvbiA8YWxleC53aWxsaWFtc29uQHJl
ZGhhdC5jb20+DQo+ID4gPiArICoNCj4gPiA+ICsgKiBUaGlzIHByb2dyYW0gaXMgZnJlZSBzb2Z0
d2FyZTsgeW91IGNhbiByZWRpc3RyaWJ1dGUgaXQgYW5kL29yDQo+ID4gPiBtb2RpZnkNCj4gPiA+
ICsgKiBpdCB1bmRlciB0aGUgdGVybXMgb2YgdGhlIEdOVSBHZW5lcmFsIFB1YmxpYyBMaWNlbnNl
IHZlcnNpb24gMg0KPiBhcw0KPiA+ID4gKyAqIHB1Ymxpc2hlZCBieSB0aGUgRnJlZSBTb2Z0d2Fy
ZSBGb3VuZGF0aW9uLg0KPiA+ID4gKyAqDQo+ID4gPiArICogRGVyaXZlZCBmcm9tIG9yaWdpbmFs
IHZmaW86DQo+ID4gPiArICogQ29weXJpZ2h0IDIwMTAgQ2lzY28gU3lzdGVtcywgSW5jLiAgQWxs
IHJpZ2h0cyByZXNlcnZlZC4NCj4gPiA+ICsgKiBBdXRob3I6IFRvbSBMeW9uLCBwdWdzQGNpc2Nv
LmNvbQ0KPiA+ID4gKyAqLw0KPiA+ID4gKw0KPiA+ID4gKyNpbmNsdWRlIDxsaW51eC9jZGV2Lmg+
DQo+ID4gPiArI2luY2x1ZGUgPGxpbnV4L2NvbXBhdC5oPg0KPiA+ID4gKyNpbmNsdWRlIDxsaW51
eC9kZXZpY2UuaD4NCj4gPiA+ICsjaW5jbHVkZSA8bGludXgvZmlsZS5oPg0KPiA+ID4gKyNpbmNs
dWRlIDxsaW51eC9hbm9uX2lub2Rlcy5oPg0KPiA+ID4gKyNpbmNsdWRlIDxsaW51eC9mcy5oPg0K
PiA+ID4gKyNpbmNsdWRlIDxsaW51eC9pZHIuaD4NCj4gPiA+ICsjaW5jbHVkZSA8bGludXgvaW9t
bXUuaD4NCj4gPiA+ICsjaW5jbHVkZSA8bGludXgvbW0uaD4NCj4gPiA+ICsjaW5jbHVkZSA8bGlu
dXgvbW9kdWxlLmg+DQo+ID4gPiArI2luY2x1ZGUgPGxpbnV4L3NsYWIuaD4NCj4gPiA+ICsjaW5j
bHVkZSA8bGludXgvc3RyaW5nLmg+DQo+ID4gPiArI2luY2x1ZGUgPGxpbnV4L3VhY2Nlc3MuaD4N
Cj4gPiA+ICsjaW5jbHVkZSA8bGludXgvdmZpby5oPg0KPiA+ID4gKyNpbmNsdWRlIDxsaW51eC93
YWl0Lmg+DQo+ID4gPiArDQo+ID4gPiArI2luY2x1ZGUgInZmaW9fcHJpdmF0ZS5oIg0KPiA+ID4g
Kw0KPiA+ID4gKyNkZWZpbmUgRFJJVkVSX1ZFUlNJT04JIjAuMiINCj4gPiA+ICsjZGVmaW5lIERS
SVZFUl9BVVRIT1IJIkFsZXggV2lsbGlhbXNvbg0KPiA8YWxleC53aWxsaWFtc29uQHJlZGhhdC5j
b20+Ig0KPiA+ID4gKyNkZWZpbmUgRFJJVkVSX0RFU0MJIlZGSU8gLSBVc2VyIExldmVsIG1ldGEt
ZHJpdmVyIg0KPiA+ID4gKw0KPiA+ID4gK3N0YXRpYyBpbnQgYWxsb3dfdW5zYWZlX2ludHJzOw0K
PiA+ID4gK21vZHVsZV9wYXJhbShhbGxvd191bnNhZmVfaW50cnMsIGludCwgMCk7DQo+ID4gPiAr
TU9EVUxFX1BBUk1fREVTQyhhbGxvd191bnNhZmVfaW50cnMsDQo+ID4gPiArICAgICAgICAiQWxs
b3cgdXNlIG9mIElPTU1VcyB3aGljaCBkbyBub3Qgc3VwcG9ydCBpbnRlcnJ1cHQNCj4gPiA+IHJl
bWFwcGluZyIpOw0KPiA+ID4gKw0KPiA+ID4gK3N0YXRpYyBzdHJ1Y3QgdmZpbyB7DQo+ID4gPiAr
CWRldl90CQkJZGV2dDsNCj4gPiA+ICsJc3RydWN0IGNkZXYJCWNkZXY7DQo+ID4gPiArCXN0cnVj
dCBsaXN0X2hlYWQJZ3JvdXBfbGlzdDsNCj4gPiA+ICsJc3RydWN0IG11dGV4CQlsb2NrOw0KPiA+
ID4gKwlzdHJ1Y3Qga3JlZgkJa3JlZjsNCj4gPiA+ICsJc3RydWN0IGNsYXNzCQkqY2xhc3M7DQo+
ID4gPiArCXN0cnVjdCBpZHIJCWlkcjsNCj4gPiA+ICsJd2FpdF9xdWV1ZV9oZWFkX3QJcmVsZWFz
ZV9xOw0KPiA+ID4gK30gdmZpbzsNCj4gPiA+ICsNCj4gPiA+ICtzdGF0aWMgY29uc3Qgc3RydWN0
IGZpbGVfb3BlcmF0aW9ucyB2ZmlvX2dyb3VwX2ZvcHM7DQo+ID4gPiArZXh0ZXJuIGNvbnN0IHN0
cnVjdCBmaWxlX29wZXJhdGlvbnMgdmZpb19pb21tdV9mb3BzOw0KPiA+ID4gKw0KPiA+ID4gK3N0
cnVjdCB2ZmlvX2dyb3VwIHsNCj4gPiA+ICsJZGV2X3QJCQlkZXZ0Ow0KPiA+ID4gKwl1bnNpZ25l
ZCBpbnQJCWdyb3VwaWQ7DQo+ID4NCj4gPiBUaGlzIGdyb3VwaWQgaXMgcmV0dXJuZWQgYnkgdGhl
IGRldmljZV9ncm91cCBjYWxsYmFjayB5b3UgcmVjZW50bHkNCj4gYWRkZWQNCj4gPiB3aXRoIGEg
c2VwYXJhdGUgKG5vdCB5ZXQgaW4gdHJlZSkgSU9NTVUgcGF0Y2guDQo+ID4gSXMgaXQgY29ycmVj
dCB0byBzYXkgdGhhdCB0aGUgc2NvcGUgb2YgdGhpcyBJRCBpcyB0aGUgYnVzIHRoZSBpb21tdQ0K
PiA+IGJlbG9uZ3MgdG9vIChidXQgeW91IHVzZSBpdCBhcyBpZiBpdCB3YXMgZ2xvYmFsKT8NCj4g
PiBJIGJlbGlldmUgdGhlcmUgaXMgbm90aGluZyByaWdodCBub3cgdG8gZW5zdXJlIHRoZSB1bmlx
dWVuZXNzIG9mIHN1Y2gNCj4gPiBJRCBhY3Jvc3MgYnVzIHR5cGVzIChhc3N1bWluZyB0aGVyZSB3
aWxsIGJlIG90aGVyIGJ1cyBkcml2ZXJzIGluIHRoZQ0KPiA+IGZ1dHVyZSBiZXNpZGVzIHZmaW8t
cGNpKS4NCj4gPiBJZiB0aGF0J3MgdGhlIGNhc2UsIHRoZSB2ZmlvLmdyb3VwX2xpc3QgZ2xvYmFs
IGxpc3QgYW5kIHRoZQ0KPiBfX3ZmaW9fbG9va3VwX2Rldg0KPiA+IHJvdXRpbmUgc2hvdWxkIGJl
IGNoYW5nZWQgdG8gYWNjb3VudCBmb3IgdGhlIGJ1cyB0b28/DQo+ID4gT3BzLCBJIGp1c3Qgc2F3
IHRoZSBlcnJvciBtc2cgaW4gdmZpb19ncm91cF9hZGRfZGV2IGFib3V0IHRoZSBncm91cA0KPiBp
ZCBjb25mbGljdC4NCj4gPiBJcyB0aGF0IHdhcm5pbmcgcmVsYXRlZCB0byB3aGF0IEkgbWVudGlv
bmVkIGFib3ZlPw0KPiANCj4gWWVhaCwgdGhpcyBpcyBhIGNvbmNlcm4sIGJ1dCBJIGNhbid0IHRo
aW5rIG9mIGEgc3lzdGVtIHdoZXJlIHdlIHdvdWxkDQo+IG1hbmlmZXN0IGEgY29sbGlzaW9uLiAg
VGhlIElPTU1VIGRyaXZlciBpcyBleHBlY3RlZCB0byBwcm92aWRlIHVuaXF1ZQ0KPiBncm91cGlk
cyBmb3IgYWxsIGRldmljZXMgYmVsb3cgdGhlbSwgYnV0IHdlIGNvdWxkIGltYWdpbmUgYSBzeXN0
ZW0gdGhhdA0KPiBpbXBsZW1lbnRzIHR3byBkaWZmZXJlbnQgYnVzX3R5cGVzLCBlYWNoIHdpdGgg
YSBkaWZmZXJlbnQgSU9NTVUgZHJpdmVyDQo+IGFuZCB3ZSBoYXZlIG5vIGNvb3JkaW5hdGlvbiBi
ZXR3ZWVuIHRoZW0uICBQZXJoYXBzIHNpbmNlIHdlIGhhdmUNCj4gaW9tbXVfb3BzIHBlciBidXMs
IHdlIHNob3VsZCBhbHNvIGV4cG9zZSB0aGUgYnVzIGluIHRoZSB2ZmlvIGdyb3VwDQo+IHBhdGgs
DQo+IGllLiAvZGV2L3ZmaW8vJXMvJXUsIGRldi0+YnVzLT5uYW1lLCBpb21tdV9kZXZpY2VfZ3Jv
dXAoZGV2LC4uKS4gIFRoaXMNCj4gbWVhbnMgdXNlcnNwYWNlIHdvdWxkIG5lZWQgdG8gZG8gYSBy
ZWFkbGluayBvZiB0aGUgc3Vic3lzdGVtIGVudHJ5DQo+IHdoZXJlDQo+IGl0IGZpbmRzIHRoZSBp
b21tdV9ncm91cCB0byBmaW5kIHRoZSB2ZmlvIGdyb3VwLiAgUmVhc29uYWJsZT8NCg0KTW9zdCBw
cm9iYWJseSB3ZSB3b24ndCBzZWUgdXNlIGNhc2VzIHdpdGggbXVsdGlwbGUgYnVzZXMgYW55dGlt
ZSBzb29uLCBidXQNCnRoaXMgc2NoZW1lIHlvdSBwcm9wb3NlZCAod2l0aCB0aGUgcGVyLWJ1cyBz
dWJkaXIpIGxvb2tzIGdvb2QgdG8gbWUuIA0KDQo+ID4gPiArCXN0cnVjdCBidXNfdHlwZQkJKmJ1
czsNCj4gPiA+ICsJc3RydWN0IHZmaW9faW9tbXUJKmlvbW11Ow0KPiA+ID4gKwlzdHJ1Y3QgbGlz
dF9oZWFkCWRldmljZV9saXN0Ow0KPiA+ID4gKwlzdHJ1Y3QgbGlzdF9oZWFkCWlvbW11X25leHQ7
DQo+ID4gPiArCXN0cnVjdCBsaXN0X2hlYWQJZ3JvdXBfbmV4dDsNCj4gPiA+ICsJaW50CQkJcmVm
Y250Ow0KPiA+ID4gK307DQo+ID4gPiArDQo+ID4gPiArc3RydWN0IHZmaW9fZGV2aWNlIHsNCj4g
PiA+ICsJc3RydWN0IGRldmljZQkJCSpkZXY7DQo+ID4gPiArCWNvbnN0IHN0cnVjdCB2ZmlvX2Rl
dmljZV9vcHMJKm9wczsNCj4gPiA+ICsJc3RydWN0IHZmaW9faW9tbXUJCSppb21tdTsNCj4gPg0K
PiA+IEkgd29uZGVyIGlmIHlvdSBuZWVkIHRvIGhhdmUgdGhlICdpb21tdScgZmllbGQgaGVyZS4N
Cj4gPiB2ZmlvX2RldmljZS5pb21tdSBpcyBhbHdheXMgc2V0IGFuZCByZXNldCB0b2dldGhlciB3
aXRoDQo+ID4gdmZpb19ncm91cC5pb21tdS4NCj4gPiBHaXZlbiB0aGF0IGEgdmZpb19kZXZpY2Ug
aW5zdGFuY2UgaXMgYWx3YXlzIGxpbmtlZCB0byBhIHZmaW9fZ3JvdXANCj4gPiBpbnN0YW5jZSwg
ZG8gd2UgbmVlZCB0aGlzIGR1cGxpY2F0aW9uPyBJcyB0aGlzIGR1cGxpY2F0aW9uIHRoZXJlDQo+
ID4gYmVjYXVzZSB5b3UgZG8gbm90IHdhbnQgdGhlIGRvdWJsZSBkZXJlZmVyZW5jZSBkZXZpY2Ut
Pmdyb3VwLT5pb21tdT8NCj4gDQo+IEkgdGhpbmsgdGhhdCB3YXMgbXkgaW5pdGlhbCBnb2FsIGlu
IGR1cGxpY2F0aW5nIHRoZSBwb2ludGVyIG9uIHRoZQ0KPiBkZXZpY2UuICBJIGJlbGlldmUgSSB3
YXMgYWxzbyBhdCBvbmUgcG9pbnQgcGFzc2luZyBhIHZmaW9fZGV2aWNlIGFyb3VuZA0KPiBhbmQg
bmVlZGVkIHRoZSBwb2ludGVyLiAgV2Ugc2VlbSB0byBiZSBnZXR0aW5nIGFsb25nIGZpbmUgdy9v
IHRoYXQgYW5kDQo+IEkNCj4gZG9uJ3Qgc2VlIGFueSBwZXJmb3JtYW5jZSBzZW5zaXRpdmUgcGF0
aHMgZnJvbSBnZXR0aW5nIGZyb20gdGhlIGRldmljZQ0KPiB0byBpb21tdSwgc28gSSdsbCBzZWUg
YWJvdXQgcmVtb3ZpbmcgaXQuDQoNCkkgZ3Vlc3MgeW91IGNhbiBhZGQgaXQgYmFjayBsYXRlciBp
ZiB0aGVyZSB3aWxsIGJlIG5lZWQgZm9yIGl0Lg0KUmlnaHQgbm93LCBzaW5jZSB5b3UgYWx3YXlz
IGluaXQvZGVpbml0IGJvdGggYXQgdGhlIHNhbWUgdGltZSwgdGhpcyB3b3VsZCBzaW1wbGlmeQ0K
dGhlIGNvZGUgYW5kIG1ha2UgaXQgbW9yZSB1bmxpa2VseSB0byB1c2UgYW4gb3V0LW9mLXN5bmMg
cG9pbnRlci4NCg0KPiA+ID4gKwlzdHJ1Y3QgdmZpb19ncm91cAkJKmdyb3VwOw0KPiA+ID4gKwlz
dHJ1Y3QgbGlzdF9oZWFkCQlkZXZpY2VfbmV4dDsNCj4gPiA+ICsJYm9vbAkJCQlhdHRhY2hlZDsN
Cj4gPiA+ICsJaW50CQkJCXJlZmNudDsNCj4gPiA+ICsJdm9pZAkJCQkqZGV2aWNlX2RhdGE7DQo+
ID4gPiArfTsNCj4gPiA+ICsNCj4gPiA+ICsvKg0KPiA+ID4gKyAqIEhlbHBlciBmdW5jdGlvbnMg
Y2FsbGVkIHVuZGVyIHZmaW8ubG9jaw0KPiA+ID4gKyAqLw0KPiA+ID4gKw0KPiA+ID4gKy8qIFJl
dHVybiB0cnVlIGlmIGFueSBkZXZpY2VzIHdpdGhpbiBhIGdyb3VwIGFyZSBvcGVuZWQgKi8NCj4g
PiA+ICtzdGF0aWMgYm9vbCBfX3ZmaW9fZ3JvdXBfZGV2c19pbnVzZShzdHJ1Y3QgdmZpb19ncm91
cCAqZ3JvdXApDQo+ID4gPiArew0KPiA+ID4gKwlzdHJ1Y3QgbGlzdF9oZWFkICpwb3M7DQo+ID4g
PiArDQo+ID4gPiArCWxpc3RfZm9yX2VhY2gocG9zLCAmZ3JvdXAtPmRldmljZV9saXN0KSB7DQo+
ID4gPiArCQlzdHJ1Y3QgdmZpb19kZXZpY2UgKmRldmljZTsNCj4gPiA+ICsNCj4gPiA+ICsJCWRl
dmljZSA9IGxpc3RfZW50cnkocG9zLCBzdHJ1Y3QgdmZpb19kZXZpY2UsIGRldmljZV9uZXh0KTsN
Cj4gPiA+ICsJCWlmIChkZXZpY2UtPnJlZmNudCkNCj4gPiA+ICsJCQlyZXR1cm4gdHJ1ZTsNCj4g
PiA+ICsJfQ0KPiA+ID4gKwlyZXR1cm4gZmFsc2U7DQo+ID4gPiArfQ0KPiA+ID4gKw0KPiA+ID4g
Ky8qIFJldHVybiB0cnVlIGlmIGFueSBvZiB0aGUgZ3JvdXBzIGF0dGFjaGVkIHRvIGFuIGlvbW11
IGFyZQ0KPiBvcGVuZWQuDQo+ID4gPiArICogV2UgY2FuIG9ubHkgdGVhciBhcGFydCBtZXJnZWQg
Z3JvdXBzIHdoZW4gbm90aGluZyBpcyBsZWZ0IG9wZW4uDQo+ICovDQo+ID4gPiArc3RhdGljIGJv
b2wgX192ZmlvX2lvbW11X2dyb3Vwc19pbnVzZShzdHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUpDQo+
ID4gPiArew0KPiA+ID4gKwlzdHJ1Y3QgbGlzdF9oZWFkICpwb3M7DQo+ID4gPiArDQo+ID4gPiAr
CWxpc3RfZm9yX2VhY2gocG9zLCAmaW9tbXUtPmdyb3VwX2xpc3QpIHsNCj4gPiA+ICsJCXN0cnVj
dCB2ZmlvX2dyb3VwICpncm91cDsNCj4gPiA+ICsNCj4gPiA+ICsJCWdyb3VwID0gbGlzdF9lbnRy
eShwb3MsIHN0cnVjdCB2ZmlvX2dyb3VwLCBpb21tdV9uZXh0KTsNCj4gPiA+ICsJCWlmIChncm91
cC0+cmVmY250KQ0KPiA+ID4gKwkJCXJldHVybiB0cnVlOw0KPiA+ID4gKwl9DQo+ID4gPiArCXJl
dHVybiBmYWxzZTsNCj4gPiA+ICt9DQo+ID4gPiArDQo+ID4gPiArLyogQW4gaW9tbXUgaXMgImlu
IHVzZSIgaWYgaXQgaGFzIGEgZmlsZSBkZXNjcmlwdG9yIG9wZW4gb3IgaWYgYW55DQo+IG9mDQo+
ID4gPiArICogdGhlIGdyb3VwcyBhc3NpZ25lZCB0byB0aGUgaW9tbXUgaGF2ZSBkZXZpY2VzIG9w
ZW4uICovDQo+ID4gPiArc3RhdGljIGJvb2wgX192ZmlvX2lvbW11X2ludXNlKHN0cnVjdCB2Zmlv
X2lvbW11ICppb21tdSkNCj4gPiA+ICt7DQo+ID4gPiArCXN0cnVjdCBsaXN0X2hlYWQgKnBvczsN
Cj4gPiA+ICsNCj4gPiA+ICsJaWYgKGlvbW11LT5yZWZjbnQpDQo+ID4gPiArCQlyZXR1cm4gdHJ1
ZTsNCj4gPiA+ICsNCj4gPiA+ICsJbGlzdF9mb3JfZWFjaChwb3MsICZpb21tdS0+Z3JvdXBfbGlz
dCkgew0KPiA+ID4gKwkJc3RydWN0IHZmaW9fZ3JvdXAgKmdyb3VwOw0KPiA+ID4gKw0KPiA+ID4g
KwkJZ3JvdXAgPSBsaXN0X2VudHJ5KHBvcywgc3RydWN0IHZmaW9fZ3JvdXAsIGlvbW11X25leHQp
Ow0KPiA+ID4gKw0KPiA+ID4gKwkJaWYgKF9fdmZpb19ncm91cF9kZXZzX2ludXNlKGdyb3VwKSkN
Cj4gPiA+ICsJCQlyZXR1cm4gdHJ1ZTsNCj4gPiA+ICsJfQ0KPiA+ID4gKwlyZXR1cm4gZmFsc2U7
DQo+ID4gPiArfQ0KPiA+DQo+ID4gSSBsb29rZWQgYXQgaG93IHlvdSB0YWtlIGNhcmUgb2YgcmVm
IGNvdW50cyAuLi4NCj4gPg0KPiA+IFRoaXMgaXMgaG93IHRoZSB0cmVlIG9mIHZmaW9faW9tbXUv
dmZpb19ncm91cC92ZmlvX2RldmljZSBkYXRhDQo+ID4gU3RydWN0dXJlcyBpcyBvcmdhbml6ZWQg
KEknbGwgdXNlIGp1c3QgaW9tbXUvZ3JvdXAvZGV2IHRvIG1ha2UNCj4gPiB0aGUgZ3JhcGggc21h
bGxlcik6DQo+ID4NCj4gPiAgICAgICAgICAgICBpb21tdQ0KPiA+ICAgICAgICAgICAgLyAgICAg
XA0KPiA+ICAgICAgICAgICAvICAgICAgIFwNCj4gPiAgICAgZ3JvdXAgICAuLi4gICAgIGdyb3Vw
DQo+ID4gICAgIC8gIFwgICAgICAgICAgIC8gIFwNCj4gPiAgICAvICAgIFwgICAgICAgICAvICAg
IFwNCj4gPiBkZXYgIC4uICBkZXYgICBkZXYgIC4uICBkZXYNCj4gPg0KPiA+IFRoaXMgaXMgaG93
IHlvdSBnZXQgYSBmaWxlIGRlc2NyaXB0b3IgZm9yIHRoZSB0aHJlZSBraW5kIG9mIG9iamVjdHM6
DQo+ID4NCj4gPiAtIGdyb3VwIDogb3BlbiAvZGV2L3ZmaW8veHh4IGZvciBncm91cCB4eHgNCj4g
PiAtIGlvbW11IDogZ3JvdXAgaW9jdGwgVkZJT19HUk9VUF9HRVRfSU9NTVVfRkQNCj4gPiAtIGRl
dmljZTogZ3JvdXAgaW9jdGwgVkZJT19HUk9VUF9HRVRfREVWSUNFX0ZEDQo+ID4NCj4gPiBHaXZl
biB0aGUgYWJvdmUgdG9wb2xvZ3ksIEkgd291bGQgYXNzdW1lIHRoYXQ6DQo+ID4NCj4gPiAoMSkg
YW4gaW9tbXUgaXMgJ2ludXNlJyBpZiA6IGEpIGlvbW11IHJlZmNudCA+IDAsIG9yDQo+ID4gICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICBiKSBhbnkgb2YgaXRzIGdyb3VwcyBpcyAnaW51c2Un
DQo+ID4NCj4gPiAoMikgYSAgZ3JvdXAgaXMgJ2ludXNlJyBpZiA6IGEpIGdyb3VwIHJlZmNudCA+
IDAsIG9yDQo+ID4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICBiKSBhbnkgb2YgaXRzIGRl
dmljZXMgaXMgJ2ludXNlJw0KPiA+DQo+ID4gKDMpIGEgZGV2aWNlIGlzICdpbnVzZScgaWYgOiBh
KSBkZXZpY2UgcmVmY250ID4gMA0KPiANCj4gKDIpIGlzIGEgYml0IGRlYmF0YWJsZS4gIEkndmUg
d3Jlc3RsZWQgd2l0aCB0aGlzIG9uZSBmb3IgYSB3aGlsZS4gIFRoZQ0KPiB2ZmlvX2lvbW11IHNl
cnZlcyB0d28gcHVycG9zZXMuICBGaXJzdCwgaXQgaXMgdGhlIG9iamVjdCB3ZSB1c2UgZm9yDQo+
IG1hbmFnaW5nIGlvbW11IGRvbWFpbnMsIHdoaWNoIGluY2x1ZGVzIGFsbG9jYXRpbmcgZG9tYWlu
cyBhbmQgYXR0YWNoaW5nDQo+IGRldmljZXMgdG8gZG9tYWlucy4gIEdyb3VwcyBvYmplY3RzIGFy
ZW4ndCBpbnZvbHZlZCBoZXJlLCB0aGV5IGp1c3QNCj4gbWFuYWdlIHRoZSBzZXQgb2YgZGV2aWNl
cy4gIFRoZSBzZWNvbmQgcm9sZSBpcyB0byBtYW5hZ2UgbWVyZ2VkIGdyb3VwcywNCj4gYmVjYXVz
ZSB3aGV0aGVyIG9yIG5vdCBncm91cHMgY2FuIGJlIG1lcmdlZCBpcyBhIGZ1bmN0aW9uIG9mIGlv
bW11DQo+IGRvbWFpbiBjb21wYXRpYmlsaXR5Lg0KPiANCj4gU28gaWYgd2UgbG9vayBhdCAiaXMg
dGhlIGlvbW11IGluIHVzZT8iIGllLiBjYW4gSSBkZXN0cm95IHRoZSBtYXBwaW5nDQo+IGNvbnRl
eHQsIGRldGFjaCBkZXZpY2VzIGFuZCBmcmVlIHRoZSBkb21haW4sIHRoZSByZWZlcmVuY2UgY291
bnQgb24gdGhlDQo+IGdyb3VwIGlzIGlycmVsZXZhbnQuICBUaGUgdXNlciBoYXMgdG8gaGF2ZSBh
IGRldmljZSBvciBpb21tdSBmaWxlDQo+IGRlc2NyaXB0b3Igb3BlbmVkIHNvbWV3aGVyZSwgYWNy
b3NzIHRoZSBncm91cCBvciBtZXJnZWQgZ3JvdXAsIGZvciB0aGF0DQo+IGNvbnRleHQgdG8gYmUg
bWFpbnRhaW5lZC4gIEEgcmVhc29uYWJsZSByZXF1aXJlbWVudCwgSSB0aGluay4NCg0KT0ssIHRo
ZW4gaWYgeW91IGNsb3NlIGFsbCBkZXZpY2VzIGFuZCB0aGUgaW9tbXUsIGtlZXBpbmcgdGhlIGdy
b3VwIG9wZW4NCldvdWxkIG5vdCBwcm90ZWN0IHRoZSBpb21tdSBkb21haW4gbWFwcGluZy4gVGhp
cyBtZWFucyB0aGF0IGlmIHlvdSAob3INCkEgbWFuYWdlbWVudCBhcHBsaWNhdGlvbikgbmVlZCB0
byBjbG9zZSBhbGwgZGV2aWNlcytpb21tdSBhbmQgcmVvcGVuDQpyaWdodCBhd2F5IGFnYWluIHRo
ZSBzYW1lIGRldmljZXMraW9tbXUgeW91IG1heSBnZXQgYSBmYWlsdXJlIG9uIHRoZQ0KaW9tbXUg
ZG9tYWluIGNyZWF0aW9uIChzdXBwb3NpbmcgdGhlIHN5c3RlbSBnb2VzIG91dCBvZiByZXNvdXJj
ZXMpLg0KSXMgdGhpcyBqdXN0IGEgdmVyeSB1bmxpa2VseSBzY2VuYXJpbz8gDQpJIGd1ZXNzIGlu
IHRoaXMgY2FzZSB5b3Ugd291bGQgc2ltcGx5IGhhdmUgdG8gYXZvaWQgcmVsZWFzaW5nIHRoZSBp
b21tdQ0KZmQsIHJpZ2h0Pw0KDQo+IEhvd2V2ZXIsIGlmIHdlIGFzayAiaXMgdGhlIGdyb3VwIGlu
IHVzZT8iIGllLiBjYW4gSSBub3Qgb25seSBkZXN0cm95DQo+IHRoZQ0KPiBtYXBwaW5ncyBhYm92
ZSwgYnV0IGFsc28gYXV0b21hdGljYWxseSB0ZWFyIGFwYXJ0IG1lcmdlZCBncm91cHMsIHRoZW4g
SQ0KPiB0aGluayB3ZSBuZWVkIHRvIGxvb2sgYXQgdGhlIGdyb3VwIHJlZmNudC4NCg0KQ29ycmVj
dC4NCg0KPiBUaGVyZSdzIGFsc28gYSBzeW1tZXRyeSBmYWN0b3IsIHRoZSBncm91cCBpcyBhIGJl
bmlnbiBlbnRyeSBwb2ludCB0bw0KPiBkZXZpY2UgYWNjZXNzLiAgSXQncyBvbmx5IHdoZW4gZGV2
aWNlIG9yIGlvbW11IGFjY2VzcyBpcyBncmFudGVkIHRoYXQNCj4gdGhlIGdyb3VwIGdhaW5zIGFu
eSByZWFsIHBvd2VyLiAgVGhlcmVmb3JlLCBzaG91bGRuJ3QgdGhhdCBwb3dlciBhbHNvDQo+IGJl
DQo+IHJlbW92ZWQgd2hlbiB0aG9zZSBhY2Nlc3MgcG9pbnRzIGFyZSBjbG9zZWQ/DQo+IA0KPiA+
IFlvdSBoYXZlIGNvZGVkIHRoZSAnaW51c2UnIGxvZ2ljIHdpdGggdGhlc2UgdGhyZWUgcm91dGlu
ZXM6DQo+ID4NCj4gPiAgICAgX192ZmlvX2lvbW11X2ludXNlLCB3aGljaCBpbXBsZW1lbnRzICgx
KSBhYm92ZQ0KPiA+DQo+ID4gYW5kDQo+ID4gICAgIF9fdmZpb19pb21tdV9ncm91cHNfaW51c2UN
Cj4gDQo+IEltcGxlbWVudHMgKDIuYSkNCg0KWWVzLCBidXQgZm9yIGFsIGdyb3VwcyBhdCBvbmNl
Lg0KDQo+ID4gICAgIF9fdmZpb19ncm91cF9kZXZzX2ludXNlDQo+IA0KPiBJbXBsZW1lbnRzICgy
LmIpDQoNClllcw0KDQo+ID4gd2hpY2ggYXJlIHVzZWQgYnkgX192ZmlvX2lvbW11X2ludXNlLg0K
PiA+IFdoeSBkb24ndCB5b3UgY2hlY2sgdGhlIGdyb3VwIHJlZmNudCBpbiBfX3ZmaW9faW9tbXVf
Z3JvdXBzX2ludXNlPw0KPiANCj4gSG9wZWZ1bGx5IGV4cGxhaW5lZCBhYm92ZSwgYnV0IG9wZW4g
Zm9yIGRpc2N1c3Npb24uDQo+IA0KPiA+IFdvdWxkIGl0IG1ha2Ugc2Vuc2UgKGFuZCB0aGUgY29k
ZSBtb3JlIHJlYWRhYmxlKSB0byBzdHJ1Y3R1cmUgdGhlDQo+ID4gbmVzdGVkIHJlZmNudC9pbnVz
ZSBjaGVjayBsaWtlIHRoaXM/DQo+ID4gKFRoZSBudW1iZXJzICgxKSgyKSgzKSByZWZlciB0byB0
aGUgdGhyZWUgJ2ludXNlJyBjb25kaXRpb25zIGFib3ZlKQ0KPiA+DQo+ID4gICAgKDEpX192Zmlv
X2lvbW11X2ludXNlDQo+ID4gICAgfA0KPiA+ICAgICstPiBjaGVjayBpb21tdSByZWZjbnQNCj4g
PiAgICArLT4gX192ZmlvX2lvbW11X2dyb3Vwc19pbnVzZQ0KPiA+ICAgICAgICB8DQo+ID4gICAg
ICAgICstPkxPT1A6ICgyKV9fdmZpb19pb21tdV9ncm91cF9pbnVzZTwtLU1JU1NJTkcNCj4gPiAg
ICAgICAgICAgICAgICAgfA0KPiA+ICAgICAgICAgICAgICAgICArLT4gY2hlY2sgZ3JvdXAgcmVm
Y250PC0tTUlTU0lORw0KPiA+ICAgICAgICAgICAgICAgICArLT4gX192ZmlvX2dyb3VwX2RldnNf
aW51c2UoKQ0KPiA+ICAgICAgICAgICAgICAgICAgICAgfA0KPiA+ICAgICAgICAgICAgICAgICAg
ICAgKy0+IExPT1A6ICgzKV9fdmZpb19ncm91cF9kZXZfaW51c2U8LS1NSVNTSU5HDQo+ID4gICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgfA0KPiA+ICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICstPiBjaGVjayBkZXZpY2UgcmVmY250DQo+IA0KPiBXZSBjdXJyZW50bHkgZG86DQo+
IA0KPiAgICAoMSlfX3ZmaW9faW9tbXVfaW51c2UNCj4gICAgIHwNCj4gICAgICstPiBjaGVjayBp
b21tdSByZWZjbnQNCj4gICAgICstPiBfX3ZmaW9fZ3JvdXBfZGV2c19pbnVzZQ0KPiAgICAgICAg
IHwNCj4gICAgICAgICArLT5MT09QOiAoMi5iKV9fdmZpb19ncm91cF9kZXZzX2ludXNlDQo+ICAg
ICAgICAgICAgICAgICAgIHwNCj4gICAgICAgICAgICAgICAgICAgKy0+IExPT1A6ICgzKSBjaGVj
ayBkZXZpY2UgcmVmY250DQo+IA0KPiBJZiB0aGF0IHBhc3NlcywgdGhlIGlvbW11IGNvbnRleHQg
Y2FuIGJlIGRpc3NvbHZlZCBhbmQgd2UgZm9sbG93IHVwDQo+IHdpdGg6DQo+IA0KPiAgICAgX192
ZmlvX2lvbW11X2dyb3Vwc19pbnVzZQ0KPiAgICAgfA0KPiAgICAgKy0+IExPT1A6ICgyLmEpX192
ZmlvX2lvbW11X2dyb3Vwc19pbnVzZQ0KPiAgICAgICAgICAgICAgICB8DQo+ICAgICAgICAgICAg
ICAgICstPiBjaGVjayBncm91cCByZWZjbnQNCj4gDQo+IElmIHRoYXQgcGFzc2VzLCBncm91cHMg
Y2FuIGFsc28gYmUgdW1lcmdlZC4NCj4gDQo+IElzIHRoaXMgcmlnaHQ/DQoNClllcywgYXNzdW1p
bmcgd2Ugc3RpY2sgdG8gdGhlICJiZW5pZ24iIHJvbGUgb2YgZ3JvdXBzIHlvdQ0KZGVzY3JpYmVk
IGFib3ZlLg0KDQo+ID4gPiArc3RhdGljIHZvaWQgX192ZmlvX2dyb3VwX3NldF9pb21tdShzdHJ1
Y3QgdmZpb19ncm91cCAqZ3JvdXAsDQo+ID4gPiArCQkJCSAgIHN0cnVjdCB2ZmlvX2lvbW11ICpp
b21tdSkNCj4gPiA+ICt7DQo+ID4gPiArCXN0cnVjdCBsaXN0X2hlYWQgKnBvczsNCj4gPiA+ICsN
Cj4gPiA+ICsJaWYgKGdyb3VwLT5pb21tdSkNCj4gPiA+ICsJCWxpc3RfZGVsKCZncm91cC0+aW9t
bXVfbmV4dCk7DQo+ID4gPiArCWlmIChpb21tdSkNCj4gPiA+ICsJCWxpc3RfYWRkKCZncm91cC0+
aW9tbXVfbmV4dCwgJmlvbW11LT5ncm91cF9saXN0KTsNCj4gPiA+ICsNCj4gPiA+ICsJZ3JvdXAt
PmlvbW11ID0gaW9tbXU7DQo+ID4NCj4gPiBJZiB5b3UgcmVtb3ZlIHRoZSB2ZmlvX2RldmljZS5p
b21tdSBmaWVsZCAoYXMgc3VnZ2VzdGVkIGFib3ZlIGluIGENCj4gcHJldmlvdXMNCj4gPiBDb21t
ZW50KSwgdGhlIGJsb2NrIGJlbG93IHdvdWxkIG5vdCBiZSBuZWVkZWQgYW55bW9yZS4NCj4gDQo+
IFllcCwgSSdsbCB0cnkgcmVtb3ZpbmcgdGhhdCBhbmQgc2VlIGhvdyBpdCBwbGF5cyBvdXQuDQo+
IA0KPiA+ID4gKwlsaXN0X2Zvcl9lYWNoKHBvcywgJmdyb3VwLT5kZXZpY2VfbGlzdCkgew0KPiA+
ID4gKwkJc3RydWN0IHZmaW9fZGV2aWNlICpkZXZpY2U7DQo+ID4gPiArDQo+ID4gPiArCQlkZXZp
Y2UgPSBsaXN0X2VudHJ5KHBvcywgc3RydWN0IHZmaW9fZGV2aWNlLCBkZXZpY2VfbmV4dCk7DQo+
ID4gPiArCQlkZXZpY2UtPmlvbW11ID0gaW9tbXU7DQo+ID4gPiArCX0NCj4gPiA+ICt9DQo+ID4g
PiArDQo+ID4gPiArc3RhdGljIHZvaWQgX192ZmlvX2lvbW11X2RldGFjaF9kZXYoc3RydWN0IHZm
aW9faW9tbXUgKmlvbW11LA0KPiA+ID4gKwkJCQkgICAgc3RydWN0IHZmaW9fZGV2aWNlICpkZXZp
Y2UpDQo+ID4gPiArew0KPiA+ID4gKwlCVUdfT04oIWlvbW11LT5kb21haW4gJiYgZGV2aWNlLT5h
dHRhY2hlZCk7DQo+ID4gPiArDQo+ID4gPiArCWlmICghaW9tbXUtPmRvbWFpbiB8fCAhZGV2aWNl
LT5hdHRhY2hlZCkNCj4gPiA+ICsJCXJldHVybjsNCj4gPiA+ICsNCj4gPiA+ICsJaW9tbXVfZGV0
YWNoX2RldmljZShpb21tdS0+ZG9tYWluLCBkZXZpY2UtPmRldik7DQo+ID4gPiArCWRldmljZS0+
YXR0YWNoZWQgPSBmYWxzZTsNCj4gPiA+ICt9DQo+ID4gPiArDQo+ID4gPiArc3RhdGljIHZvaWQg
X192ZmlvX2lvbW11X2RldGFjaF9ncm91cChzdHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUsDQo+ID4g
PiArCQkJCSAgICAgIHN0cnVjdCB2ZmlvX2dyb3VwICpncm91cCkNCj4gPiA+ICt7DQo+ID4gPiAr
CXN0cnVjdCBsaXN0X2hlYWQgKnBvczsNCj4gPiA+ICsNCj4gPiA+ICsJbGlzdF9mb3JfZWFjaChw
b3MsICZncm91cC0+ZGV2aWNlX2xpc3QpIHsNCj4gPiA+ICsJCXN0cnVjdCB2ZmlvX2RldmljZSAq
ZGV2aWNlOw0KPiA+ID4gKw0KPiA+ID4gKwkJZGV2aWNlID0gbGlzdF9lbnRyeShwb3MsIHN0cnVj
dCB2ZmlvX2RldmljZSwgZGV2aWNlX25leHQpOw0KPiA+ID4gKwkJX192ZmlvX2lvbW11X2RldGFj
aF9kZXYoaW9tbXUsIGRldmljZSk7DQo+ID4gPiArCX0NCj4gPiA+ICt9DQo+ID4gPiArDQo+ID4g
PiArc3RhdGljIGludCBfX3ZmaW9faW9tbXVfYXR0YWNoX2RldihzdHJ1Y3QgdmZpb19pb21tdSAq
aW9tbXUsDQo+ID4gPiArCQkJCSAgIHN0cnVjdCB2ZmlvX2RldmljZSAqZGV2aWNlKQ0KPiA+ID4g
K3sNCj4gPiA+ICsJaW50IHJldDsNCj4gPiA+ICsNCj4gPiA+ICsJQlVHX09OKGRldmljZS0+YXR0
YWNoZWQpOw0KPiA+ID4gKw0KPiA+ID4gKwlpZiAoIWlvbW11IHx8ICFpb21tdS0+ZG9tYWluKQ0K
PiA+ID4gKwkJcmV0dXJuIC1FSU5WQUw7DQo+ID4gPiArDQo+ID4gPiArCXJldCA9IGlvbW11X2F0
dGFjaF9kZXZpY2UoaW9tbXUtPmRvbWFpbiwgZGV2aWNlLT5kZXYpOw0KPiA+ID4gKwlpZiAoIXJl
dCkNCj4gPiA+ICsJCWRldmljZS0+YXR0YWNoZWQgPSB0cnVlOw0KPiA+ID4gKw0KPiA+ID4gKwly
ZXR1cm4gcmV0Ow0KPiA+ID4gK30NCj4gPiA+ICsNCj4gPiA+ICtzdGF0aWMgaW50IF9fdmZpb19p
b21tdV9hdHRhY2hfZ3JvdXAoc3RydWN0IHZmaW9faW9tbXUgKmlvbW11LA0KPiA+ID4gKwkJCQkg
ICAgIHN0cnVjdCB2ZmlvX2dyb3VwICpncm91cCkNCj4gPiA+ICt7DQo+ID4gPiArCXN0cnVjdCBs
aXN0X2hlYWQgKnBvczsNCj4gPiA+ICsNCj4gPiA+ICsJbGlzdF9mb3JfZWFjaChwb3MsICZncm91
cC0+ZGV2aWNlX2xpc3QpIHsNCj4gPiA+ICsJCXN0cnVjdCB2ZmlvX2RldmljZSAqZGV2aWNlOw0K
PiA+ID4gKwkJaW50IHJldDsNCj4gPiA+ICsNCj4gPiA+ICsJCWRldmljZSA9IGxpc3RfZW50cnko
cG9zLCBzdHJ1Y3QgdmZpb19kZXZpY2UsIGRldmljZV9uZXh0KTsNCj4gPiA+ICsJCXJldCA9IF9f
dmZpb19pb21tdV9hdHRhY2hfZGV2KGlvbW11LCBkZXZpY2UpOw0KPiA+ID4gKwkJaWYgKHJldCkg
ew0KPiA+ID4gKwkJCV9fdmZpb19pb21tdV9kZXRhY2hfZ3JvdXAoaW9tbXUsIGdyb3VwKTsNCj4g
PiA+ICsJCQlyZXR1cm4gcmV0Ow0KPiA+ID4gKwkJfQ0KPiA+ID4gKwl9DQo+ID4gPiArCXJldHVy
biAwOw0KPiA+ID4gK30NCj4gPiA+ICsNCj4gPiA+ICsvKiBUaGUgaW9tbXUgaXMgdmlhYmxlLCBp
ZS4gcmVhZHkgdG8gYmUgY29uZmlndXJlZCwgd2hlbiBhbGwgdGhlDQo+ID4gPiBkZXZpY2VzDQo+
ID4gPiArICogZm9yIGFsbCB0aGUgZ3JvdXBzIGF0dGFjaGVkIHRvIHRoZSBpb21tdSBhcmUgYm91
bmQgdG8gdGhlaXINCj4gdmZpbw0KPiA+ID4gZGV2aWNlDQo+ID4gPiArICogZHJpdmVycyAoZXgu
IHZmaW8tcGNpKS4gIFRoaXMgc2V0cyB0aGUgZGV2aWNlX2RhdGEgcHJpdmF0ZSBkYXRhDQo+ID4g
PiBwb2ludGVyLiAqLw0KPiA+ID4gK3N0YXRpYyBib29sIF9fdmZpb19pb21tdV92aWFibGUoc3Ry
dWN0IHZmaW9faW9tbXUgKmlvbW11KQ0KPiA+ID4gK3sNCj4gPiA+ICsJc3RydWN0IGxpc3RfaGVh
ZCAqZ3BvcywgKmRwb3M7DQo+ID4gPiArDQo+ID4gPiArCWxpc3RfZm9yX2VhY2goZ3BvcywgJmlv
bW11LT5ncm91cF9saXN0KSB7DQo+ID4gPiArCQlzdHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXA7DQo+
ID4gPiArCQlncm91cCA9IGxpc3RfZW50cnkoZ3Bvcywgc3RydWN0IHZmaW9fZ3JvdXAsIGlvbW11
X25leHQpOw0KPiA+ID4gKw0KPiA+ID4gKwkJbGlzdF9mb3JfZWFjaChkcG9zLCAmZ3JvdXAtPmRl
dmljZV9saXN0KSB7DQo+ID4gPiArCQkJc3RydWN0IHZmaW9fZGV2aWNlICpkZXZpY2U7DQo+ID4g
PiArCQkJZGV2aWNlID0gbGlzdF9lbnRyeShkcG9zLA0KPiA+ID4gKwkJCQkJICAgIHN0cnVjdCB2
ZmlvX2RldmljZSwgZGV2aWNlX25leHQpOw0KPiA+ID4gKw0KPiA+ID4gKwkJCWlmICghZGV2aWNl
LT5kZXZpY2VfZGF0YSkNCj4gPiA+ICsJCQkJcmV0dXJuIGZhbHNlOw0KPiA+ID4gKwkJfQ0KPiA+
ID4gKwl9DQo+ID4gPiArCXJldHVybiB0cnVlOw0KPiA+ID4gK30NCj4gPiA+ICsNCj4gPiA+ICtz
dGF0aWMgdm9pZCBfX3ZmaW9fY2xvc2VfaW9tbXUoc3RydWN0IHZmaW9faW9tbXUgKmlvbW11KQ0K
PiA+ID4gK3sNCj4gPiA+ICsJc3RydWN0IGxpc3RfaGVhZCAqcG9zOw0KPiA+ID4gKw0KPiA+ID4g
KwlpZiAoIWlvbW11LT5kb21haW4pDQo+ID4gPiArCQlyZXR1cm47DQo+ID4gPiArDQo+ID4gPiAr
CWxpc3RfZm9yX2VhY2gocG9zLCAmaW9tbXUtPmdyb3VwX2xpc3QpIHsNCj4gPiA+ICsJCXN0cnVj
dCB2ZmlvX2dyb3VwICpncm91cDsNCj4gPiA+ICsJCWdyb3VwID0gbGlzdF9lbnRyeShwb3MsIHN0
cnVjdCB2ZmlvX2dyb3VwLCBpb21tdV9uZXh0KTsNCj4gPiA+ICsNCj4gPiA+ICsJCV9fdmZpb19p
b21tdV9kZXRhY2hfZ3JvdXAoaW9tbXUsIGdyb3VwKTsNCj4gPiA+ICsJfQ0KPiA+ID4gKw0KPiA+
ID4gKwl2ZmlvX2lvbW11X3VubWFwYWxsKGlvbW11KTsNCj4gPiA+ICsNCj4gPiA+ICsJaW9tbXVf
ZG9tYWluX2ZyZWUoaW9tbXUtPmRvbWFpbik7DQo+ID4gPiArCWlvbW11LT5kb21haW4gPSBOVUxM
Ow0KPiA+ID4gKwlpb21tdS0+bW0gPSBOVUxMOw0KPiA+ID4gK30NCj4gPiA+ICsNCj4gPiA+ICsv
KiBPcGVuIHRoZSBJT01NVS4gIFRoaXMgZ2F0ZXMgYWxsIGFjY2VzcyB0byB0aGUgaW9tbXUgb3Ig
ZGV2aWNlDQo+IGZpbGUNCj4gPiA+ICsgKiBkZXNjcmlwdG9ycyBhbmQgc2V0cyBjdXJyZW50LT5t
bSBhcyB0aGUgZXhjbHVzaXZlIHVzZXIuICovDQo+ID4NCj4gPiBHaXZlbiB0aGUgZm4gIHZmaW9f
Z3JvdXBfb3BlbiAoaWUsIDFzdCBvYmplY3QsIDJuZCBvcGVyYXRpb24pLCBJDQo+IHdvdWxkIGhh
dmUNCj4gPiBjYWxsZWQgdGhpcyBvbmUgX192ZmlvX2lvbW11X29wZW4gKGluc3RlYWQgb2YgX192
ZmlvX29wZW5faW9tbXUpLg0KPiA+IElzIGl0IG5hbWVkIF9fdmZpb19vcGVuX2lvbW11IHRvIGF2
b2lkIGEgY29uZmxpY3Qgd2l0aCB0aGUgbmFtZXNwYWNlDQo+IGluIHZmaW9faW9tbXUuYz8NCj4g
DQo+IEkgd291bGQgaGF2ZSBleHBlY3RlZCB0aGF0IHRvbywgSSdsbCBsb29rIGF0IHJlbmFtaW5n
IHRoZXNlLg0KPiANCj4gPiA+ICtzdGF0aWMgaW50IF9fdmZpb19vcGVuX2lvbW11KHN0cnVjdCB2
ZmlvX2lvbW11ICppb21tdSkNCj4gPiA+ICt7DQo+ID4gPiArCXN0cnVjdCBsaXN0X2hlYWQgKnBv
czsNCj4gPiA+ICsJaW50IHJldDsNCj4gPiA+ICsNCj4gPiA+ICsJaWYgKCFfX3ZmaW9faW9tbXVf
dmlhYmxlKGlvbW11KSkNCj4gPiA+ICsJCXJldHVybiAtRUJVU1k7DQo+ID4gPiArDQo+ID4gPiAr
CWlmIChpb21tdS0+ZG9tYWluKQ0KPiA+ID4gKwkJcmV0dXJuIC1FSU5WQUw7DQo+ID4gPiArDQo+
ID4gPiArCWlvbW11LT5kb21haW4gPSBpb21tdV9kb21haW5fYWxsb2MoaW9tbXUtPmJ1cyk7DQo+
ID4gPiArCWlmICghaW9tbXUtPmRvbWFpbikNCj4gPiA+ICsJCXJldHVybiAtRUZBVUxUOw0KPiA+
ID4gKw0KPiA+ID4gKwlsaXN0X2Zvcl9lYWNoKHBvcywgJmlvbW11LT5ncm91cF9saXN0KSB7DQo+
ID4gPiArCQlzdHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXA7DQo+ID4gPiArCQlncm91cCA9IGxpc3Rf
ZW50cnkocG9zLCBzdHJ1Y3QgdmZpb19ncm91cCwgaW9tbXVfbmV4dCk7DQo+ID4gPiArDQo+ID4g
PiArCQlyZXQgPSBfX3ZmaW9faW9tbXVfYXR0YWNoX2dyb3VwKGlvbW11LCBncm91cCk7DQo+ID4g
PiArCQlpZiAocmV0KSB7DQo+ID4gPiArCQkJX192ZmlvX2Nsb3NlX2lvbW11KGlvbW11KTsNCj4g
PiA+ICsJCQlyZXR1cm4gcmV0Ow0KPiA+ID4gKwkJfQ0KPiA+ID4gKwl9DQo+ID4gPiArDQo+ID4g
PiArCWlmICghYWxsb3dfdW5zYWZlX2ludHJzICYmDQo+ID4gPiArCSAgICAhaW9tbXVfZG9tYWlu
X2hhc19jYXAoaW9tbXUtPmRvbWFpbiwgSU9NTVVfQ0FQX0lOVFJfUkVNQVApKSB7DQo+ID4gPiAr
CQlfX3ZmaW9fY2xvc2VfaW9tbXUoaW9tbXUpOw0KPiA+ID4gKwkJcmV0dXJuIC1FRkFVTFQ7DQo+
ID4gPiArCX0NCj4gPiA+ICsNCj4gPiA+ICsJaW9tbXUtPmNhY2hlID0gKGlvbW11X2RvbWFpbl9o
YXNfY2FwKGlvbW11LT5kb21haW4sDQo+ID4gPiArCQkJCQkgICAgIElPTU1VX0NBUF9DQUNIRV9D
T0hFUkVOQ1kpICE9IDApOw0KPiA+ID4gKwlpb21tdS0+bW0gPSBjdXJyZW50LT5tbTsNCj4gPiA+
ICsNCj4gPiA+ICsJcmV0dXJuIDA7DQo+ID4gPiArfQ0KPiA+ID4gKw0KPiA+ID4gKy8qIEFjdGl2
ZWx5IHRyeSB0byB0ZWFyIGRvd24gdGhlIGlvbW11IGFuZCBtZXJnZWQgZ3JvdXBzLiAgSWYNCj4g
dGhlcmUNCj4gPiA+IGFyZSBubw0KPiA+ID4gKyAqIG9wZW4gaW9tbXUgb3IgZGV2aWNlIGZkcywg
d2UgY2xvc2UgdGhlIGlvbW11LiAgSWYgd2UgY2xvc2UgdGhlDQo+ID4gPiBpb21tdSBhbmQNCj4g
PiA+ICsgKiB0aGVyZSBhcmUgYWxzbyBubyBvcGVuIGdyb3VwIGZkcywgd2UgY2FuIGZ1dGhlciBk
aXNzb2x2ZSB0aGUNCj4gZ3JvdXANCj4gPiA+IHRvDQo+ID4gPiArICogaW9tbXUgYXNzb2NpYXRp
b24gYW5kIGZyZWUgdGhlIGlvbW11IGRhdGEgc3RydWN0dXJlLiAqLw0KPiA+ID4gK3N0YXRpYyBp
bnQgX192ZmlvX3RyeV9kaXNzb2x2ZV9pb21tdShzdHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUpDQo+
ID4gPiArew0KPiA+ID4gKw0KPiA+ID4gKwlpZiAoX192ZmlvX2lvbW11X2ludXNlKGlvbW11KSkN
Cj4gPiA+ICsJCXJldHVybiAtRUJVU1k7DQo+ID4gPiArDQo+ID4gPiArCV9fdmZpb19jbG9zZV9p
b21tdShpb21tdSk7DQo+ID4gPiArDQo+ID4gPiArCWlmICghX192ZmlvX2lvbW11X2dyb3Vwc19p
bnVzZShpb21tdSkpIHsNCj4gPiA+ICsJCXN0cnVjdCBsaXN0X2hlYWQgKnBvcywgKnBwb3M7DQo+
ID4gPiArDQo+ID4gPiArCQlsaXN0X2Zvcl9lYWNoX3NhZmUocG9zLCBwcG9zLCAmaW9tbXUtPmdy
b3VwX2xpc3QpIHsNCj4gPiA+ICsJCQlzdHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXA7DQo+ID4gPiAr
DQo+ID4gPiArCQkJZ3JvdXAgPSBsaXN0X2VudHJ5KHBvcywgc3RydWN0IHZmaW9fZ3JvdXAsDQo+
ID4gPiBpb21tdV9uZXh0KTsNCj4gPiA+ICsJCQlfX3ZmaW9fZ3JvdXBfc2V0X2lvbW11KGdyb3Vw
LCBOVUxMKTsNCj4gPiA+ICsJCX0NCj4gPiA+ICsNCj4gPiA+ICsNCj4gPiA+ICsJCWtmcmVlKGlv
bW11KTsNCj4gPiA+ICsJfQ0KPiA+ID4gKw0KPiA+ID4gKwlyZXR1cm4gMDsNCj4gPiA+ICt9DQo+
ID4gPiArDQo+ID4gPiArc3RhdGljIHN0cnVjdCB2ZmlvX2RldmljZSAqX192ZmlvX2xvb2t1cF9k
ZXYoc3RydWN0IGRldmljZSAqZGV2KQ0KPiA+ID4gK3sNCj4gPiA+ICsJc3RydWN0IGxpc3RfaGVh
ZCAqZ3BvczsNCj4gPiA+ICsJdW5zaWduZWQgaW50IGdyb3VwaWQ7DQo+ID4gPiArDQo+ID4gPiAr
CWlmIChpb21tdV9kZXZpY2VfZ3JvdXAoZGV2LCAmZ3JvdXBpZCkpDQo+ID4gPiArCQlyZXR1cm4g
TlVMTDsNCj4gPiA+ICsNCj4gPiA+ICsJbGlzdF9mb3JfZWFjaChncG9zLCAmdmZpby5ncm91cF9s
aXN0KSB7DQo+ID4gPiArCQlzdHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXA7DQo+ID4gPiArCQlzdHJ1
Y3QgbGlzdF9oZWFkICpkcG9zOw0KPiA+ID4gKw0KPiA+ID4gKwkJZ3JvdXAgPSBsaXN0X2VudHJ5
KGdwb3MsIHN0cnVjdCB2ZmlvX2dyb3VwLCBncm91cF9uZXh0KTsNCj4gPiA+ICsNCj4gPiA+ICsJ
CWlmIChncm91cC0+Z3JvdXBpZCAhPSBncm91cGlkKQ0KPiA+ID4gKwkJCWNvbnRpbnVlOw0KPiA+
ID4gKw0KPiA+ID4gKwkJbGlzdF9mb3JfZWFjaChkcG9zLCAmZ3JvdXAtPmRldmljZV9saXN0KSB7
DQo+ID4gPiArCQkJc3RydWN0IHZmaW9fZGV2aWNlICpkZXZpY2U7DQo+ID4gPiArDQo+ID4gPiAr
CQkJZGV2aWNlID0gbGlzdF9lbnRyeShkcG9zLA0KPiA+ID4gKwkJCQkJICAgIHN0cnVjdCB2Zmlv
X2RldmljZSwgZGV2aWNlX25leHQpOw0KPiA+ID4gKw0KPiA+ID4gKwkJCWlmIChkZXZpY2UtPmRl
diA9PSBkZXYpDQo+ID4gPiArCQkJCXJldHVybiBkZXZpY2U7DQo+ID4gPiArCQl9DQo+ID4gPiAr
CX0NCj4gPiA+ICsJcmV0dXJuIE5VTEw7DQo+ID4gPiArfQ0KPiA+ID4gKw0KPiA+ID4gKy8qIEFs
bCByZWxlYXNlIHBhdGhzIHNpbXBseSBkZWNyZW1lbnQgdGhlIHJlZmNudCwgYXR0ZW1wdCB0bw0K
PiB0ZWFyZG93bg0KPiA+ID4gKyAqIHRoZSBpb21tdSBhbmQgbWVyZ2VkIGdyb3VwcywgYW5kIHdh
a2V1cCBhbnl0aGluZyB0aGF0IG1pZ2h0IGJlDQo+ID4gPiArICogd2FpdGluZyBpZiB3ZSBzdWNj
ZXNzZnVsbHkgZGlzc29sdmUgYW55dGhpbmcuICovDQo+ID4gPiArc3RhdGljIGludCB2ZmlvX2Rv
X3JlbGVhc2UoaW50ICpyZWZjbnQsIHN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSkNCj4gPiA+ICt7
DQo+ID4gPiArCWJvb2wgd2FrZTsNCj4gPiA+ICsNCj4gPiA+ICsJbXV0ZXhfbG9jaygmdmZpby5s
b2NrKTsNCj4gPiA+ICsNCj4gPiA+ICsJKCpyZWZjbnQpLS07DQo+ID4gPiArCXdha2UgPSAoX192
ZmlvX3RyeV9kaXNzb2x2ZV9pb21tdShpb21tdSkgPT0gMCk7DQo+ID4gPiArDQo+ID4gPiArCW11
dGV4X3VubG9jaygmdmZpby5sb2NrKTsNCj4gPiA+ICsNCj4gPiA+ICsJaWYgKHdha2UpDQo+ID4g
PiArCQl3YWtlX3VwKCZ2ZmlvLnJlbGVhc2VfcSk7DQo+ID4gPiArDQo+ID4gPiArCXJldHVybiAw
Ow0KPiA+ID4gK30NCj4gPiA+ICsNCj4gPiA+ICsvKg0KPiA+ID4gKyAqIERldmljZSBmb3BzIC0g
cGFzc3Rocm91Z2ggdG8gdmZpbyBkZXZpY2UgZHJpdmVyIHcvIGRldmljZV9kYXRhDQo+ID4gPiAr
ICovDQo+ID4gPiArc3RhdGljIGludCB2ZmlvX2RldmljZV9yZWxlYXNlKHN0cnVjdCBpbm9kZSAq
aW5vZGUsIHN0cnVjdCBmaWxlDQo+ID4gPiAqZmlsZXApDQo+ID4gPiArew0KPiA+ID4gKwlzdHJ1
Y3QgdmZpb19kZXZpY2UgKmRldmljZSA9IGZpbGVwLT5wcml2YXRlX2RhdGE7DQo+ID4gPiArDQo+
ID4gPiArCXZmaW9fZG9fcmVsZWFzZSgmZGV2aWNlLT5yZWZjbnQsIGRldmljZS0+aW9tbXUpOw0K
PiA+ID4gKw0KPiA+ID4gKwlkZXZpY2UtPm9wcy0+cHV0KGRldmljZS0+ZGV2aWNlX2RhdGEpOw0K
PiA+ID4gKw0KPiA+ID4gKwlyZXR1cm4gMDsNCj4gPiA+ICt9DQo+ID4gPiArDQo+ID4gPiArc3Rh
dGljIGxvbmcgdmZpb19kZXZpY2VfdW5sX2lvY3RsKHN0cnVjdCBmaWxlICpmaWxlcCwNCj4gPiA+
ICsJCQkJICB1bnNpZ25lZCBpbnQgY21kLCB1bnNpZ25lZCBsb25nIGFyZykNCj4gPiA+ICt7DQo+
ID4gPiArCXN0cnVjdCB2ZmlvX2RldmljZSAqZGV2aWNlID0gZmlsZXAtPnByaXZhdGVfZGF0YTsN
Cj4gPiA+ICsNCj4gPiA+ICsJcmV0dXJuIGRldmljZS0+b3BzLT5pb2N0bChkZXZpY2UtPmRldmlj
ZV9kYXRhLCBjbWQsIGFyZyk7DQo+ID4gPiArfQ0KPiA+ID4gKw0KPiA+ID4gK3N0YXRpYyBzc2l6
ZV90IHZmaW9fZGV2aWNlX3JlYWQoc3RydWN0IGZpbGUgKmZpbGVwLCBjaGFyIF9fdXNlcg0KPiAq
YnVmLA0KPiA+ID4gKwkJCQlzaXplX3QgY291bnQsIGxvZmZfdCAqcHBvcykNCj4gPiA+ICt7DQo+
ID4gPiArCXN0cnVjdCB2ZmlvX2RldmljZSAqZGV2aWNlID0gZmlsZXAtPnByaXZhdGVfZGF0YTsN
Cj4gPiA+ICsNCj4gPiA+ICsJcmV0dXJuIGRldmljZS0+b3BzLT5yZWFkKGRldmljZS0+ZGV2aWNl
X2RhdGEsIGJ1ZiwgY291bnQsIHBwb3MpOw0KPiA+ID4gK30NCj4gPiA+ICsNCj4gPiA+ICtzdGF0
aWMgc3NpemVfdCB2ZmlvX2RldmljZV93cml0ZShzdHJ1Y3QgZmlsZSAqZmlsZXAsIGNvbnN0IGNo
YXINCj4gX191c2VyDQo+ID4gPiAqYnVmLA0KPiA+ID4gKwkJCQkgc2l6ZV90IGNvdW50LCBsb2Zm
X3QgKnBwb3MpDQo+ID4gPiArew0KPiA+ID4gKwlzdHJ1Y3QgdmZpb19kZXZpY2UgKmRldmljZSA9
IGZpbGVwLT5wcml2YXRlX2RhdGE7DQo+ID4gPiArDQo+ID4gPiArCXJldHVybiBkZXZpY2UtPm9w
cy0+d3JpdGUoZGV2aWNlLT5kZXZpY2VfZGF0YSwgYnVmLCBjb3VudCwgcHBvcyk7DQo+ID4gPiAr
fQ0KPiA+ID4gKw0KPiA+ID4gK3N0YXRpYyBpbnQgdmZpb19kZXZpY2VfbW1hcChzdHJ1Y3QgZmls
ZSAqZmlsZXAsIHN0cnVjdA0KPiB2bV9hcmVhX3N0cnVjdA0KPiA+ID4gKnZtYSkNCj4gPiA+ICt7
DQo+ID4gPiArCXN0cnVjdCB2ZmlvX2RldmljZSAqZGV2aWNlID0gZmlsZXAtPnByaXZhdGVfZGF0
YTsNCj4gPiA+ICsNCj4gPiA+ICsJcmV0dXJuIGRldmljZS0+b3BzLT5tbWFwKGRldmljZS0+ZGV2
aWNlX2RhdGEsIHZtYSk7DQo+ID4gPiArfQ0KPiA+ID4gKw0KPiA+ID4gKyNpZmRlZiBDT05GSUdf
Q09NUEFUDQo+ID4gPiArc3RhdGljIGxvbmcgdmZpb19kZXZpY2VfY29tcGF0X2lvY3RsKHN0cnVj
dCBmaWxlICpmaWxlcCwNCj4gPiA+ICsJCQkJICAgICB1bnNpZ25lZCBpbnQgY21kLCB1bnNpZ25l
ZCBsb25nIGFyZykNCj4gPiA+ICt7DQo+ID4gPiArCWFyZyA9ICh1bnNpZ25lZCBsb25nKWNvbXBh
dF9wdHIoYXJnKTsNCj4gPiA+ICsJcmV0dXJuIHZmaW9fZGV2aWNlX3VubF9pb2N0bChmaWxlcCwg
Y21kLCBhcmcpOw0KPiA+ID4gK30NCj4gPiA+ICsjZW5kaWYJLyogQ09ORklHX0NPTVBBVCAqLw0K
PiA+ID4gKw0KPiA+ID4gK2NvbnN0IHN0cnVjdCBmaWxlX29wZXJhdGlvbnMgdmZpb19kZXZpY2Vf
Zm9wcyA9IHsNCj4gPiA+ICsJLm93bmVyCQk9IFRISVNfTU9EVUxFLA0KPiA+ID4gKwkucmVsZWFz
ZQk9IHZmaW9fZGV2aWNlX3JlbGVhc2UsDQo+ID4gPiArCS5yZWFkCQk9IHZmaW9fZGV2aWNlX3Jl
YWQsDQo+ID4gPiArCS53cml0ZQkJPSB2ZmlvX2RldmljZV93cml0ZSwNCj4gPiA+ICsJLnVubG9j
a2VkX2lvY3RsCT0gdmZpb19kZXZpY2VfdW5sX2lvY3RsLA0KPiA+ID4gKyNpZmRlZiBDT05GSUdf
Q09NUEFUDQo+ID4gPiArCS5jb21wYXRfaW9jdGwJPSB2ZmlvX2RldmljZV9jb21wYXRfaW9jdGws
DQo+ID4gPiArI2VuZGlmDQo+ID4gPiArCS5tbWFwCQk9IHZmaW9fZGV2aWNlX21tYXAsDQo+ID4g
PiArfTsNCj4gPiA+ICsNCj4gPiA+ICsvKg0KPiA+ID4gKyAqIEdyb3VwIGZvcHMNCj4gPiA+ICsg
Ki8NCj4gPiA+ICtzdGF0aWMgaW50IHZmaW9fZ3JvdXBfb3BlbihzdHJ1Y3QgaW5vZGUgKmlub2Rl
LCBzdHJ1Y3QgZmlsZQ0KPiAqZmlsZXApDQo+ID4gPiArew0KPiA+ID4gKwlzdHJ1Y3QgdmZpb19n
cm91cCAqZ3JvdXA7DQo+ID4gPiArCWludCByZXQgPSAwOw0KPiA+ID4gKw0KPiA+ID4gKwltdXRl
eF9sb2NrKCZ2ZmlvLmxvY2spOw0KPiA+ID4gKw0KPiA+ID4gKwlncm91cCA9IGlkcl9maW5kKCZ2
ZmlvLmlkciwgaW1pbm9yKGlub2RlKSk7DQo+ID4gPiArDQo+ID4gPiArCWlmICghZ3JvdXApIHsN
Cj4gPiA+ICsJCXJldCA9IC1FTk9ERVY7DQo+ID4gPiArCQlnb3RvIG91dDsNCj4gPiA+ICsJfQ0K
PiA+ID4gKw0KPiA+ID4gKwlmaWxlcC0+cHJpdmF0ZV9kYXRhID0gZ3JvdXA7DQo+ID4gPiArDQo+
ID4gPiArCWlmICghZ3JvdXAtPmlvbW11KSB7DQo+ID4gPiArCQlzdHJ1Y3QgdmZpb19pb21tdSAq
aW9tbXU7DQo+ID4gPiArDQo+ID4gPiArCQlpb21tdSA9IGt6YWxsb2Moc2l6ZW9mKCppb21tdSks
IEdGUF9LRVJORUwpOw0KPiA+ID4gKwkJaWYgKCFpb21tdSkgew0KPiA+ID4gKwkJCXJldCA9IC1F
Tk9NRU07DQo+ID4gPiArCQkJZ290byBvdXQ7DQo+ID4gPiArCQl9DQo+ID4gPiArCQlJTklUX0xJ
U1RfSEVBRCgmaW9tbXUtPmdyb3VwX2xpc3QpOw0KPiA+ID4gKwkJSU5JVF9MSVNUX0hFQUQoJmlv
bW11LT5kbV9saXN0KTsNCj4gPiA+ICsJCW11dGV4X2luaXQoJmlvbW11LT5kZ2F0ZSk7DQo+ID4g
PiArCQlpb21tdS0+YnVzID0gZ3JvdXAtPmJ1czsNCj4gPiA+ICsJCV9fdmZpb19ncm91cF9zZXRf
aW9tbXUoZ3JvdXAsIGlvbW11KTsNCj4gPiA+ICsJfQ0KPiA+ID4gKwlncm91cC0+cmVmY250Kys7
DQo+ID4gPiArDQo+ID4gPiArb3V0Og0KPiA+ID4gKwltdXRleF91bmxvY2soJnZmaW8ubG9jayk7
DQo+ID4gPiArDQo+ID4gPiArCXJldHVybiByZXQ7DQo+ID4gPiArfQ0KPiA+ID4gKw0KPiA+ID4g
K3N0YXRpYyBpbnQgdmZpb19ncm91cF9yZWxlYXNlKHN0cnVjdCBpbm9kZSAqaW5vZGUsIHN0cnVj
dCBmaWxlDQo+ICpmaWxlcCkNCj4gPiA+ICt7DQo+ID4gPiArCXN0cnVjdCB2ZmlvX2dyb3VwICpn
cm91cCA9IGZpbGVwLT5wcml2YXRlX2RhdGE7DQo+ID4gPiArDQo+ID4gPiArCXJldHVybiB2Zmlv
X2RvX3JlbGVhc2UoJmdyb3VwLT5yZWZjbnQsIGdyb3VwLT5pb21tdSk7DQo+ID4gPiArfQ0KPiA+
ID4gKw0KPiA+ID4gKy8qIEF0dGVtcHQgdG8gbWVyZ2UgdGhlIGdyb3VwIHBvaW50ZWQgdG8gYnkg
ZmQgaW50byBncm91cC4gIFRoZQ0KPiBtZXJnZS0NCj4gPiA+IGVlDQo+ID4gPiArICogZ3JvdXAg
bXVzdCBub3QgaGF2ZSBhbiBpb21tdSBvciBhbnkgZGV2aWNlcyBvcGVuIGJlY2F1c2Ugd2UNCj4g
Y2Fubm90DQo+ID4gPiArICogbWFpbnRhaW4gdGhhdCBjb250ZXh0IGFjcm9zcyB0aGUgbWVyZ2Uu
ICBUaGUgbWVyZ2UtZXIgZ3JvdXAgY2FuDQo+IGJlDQo+ID4gPiArICogaW4gdXNlLiAqLw0KPiA+
ID4gK3N0YXRpYyBpbnQgdmZpb19ncm91cF9tZXJnZShzdHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXAs
IGludCBmZCkNCj4gPg0KPiA+IFRoZSBkb2N1bWVudGF0aW9uIGluIHZmaW8udHh0IGV4cGxhaW5z
IGNsZWFybHkgdGhlIGxvZ2ljIGltcGxlbWVudGVkDQo+IGJ5DQo+ID4gdGhlIG1lcmdlL3VubWVy
Z2UgZ3JvdXAgaW9jdGxzLg0KPiA+IEhvd2V2ZXIsIHdoYXQgeW91IGFyZSBkb2luZyBpcyBub3Qg
bWVyZ2luZyBncm91cHMsIGJ1dCByYXRoZXINCj4gYWRkaW5nL3JlbW92aW5nDQo+ID4gZ3JvdXBz
IHRvL2Zyb20gaW9tbXVzIChhbmQgY3JlYXRpbmcgZmxhdCBsaXN0cyBvZiBncm91cHMpLg0KPiA+
IEZvciBleGFtcGxlLCB3aGVuIHlvdSBkbw0KPiA+DQo+ID4gICBtZXJnZShBLEIpDQo+ID4NCj4g
PiB5b3UgYWN0dWFsbHkgbWVhbiB0byBzYXkgIm1lcmdlIEIgdG8gdGhlIGxpc3Qgb2YgZ3JvdXBz
IGFzc2lnbmVkIHRvDQo+IHRoZQ0KPiA+IHNhbWUgaW9tbXUgYXMgZ3JvdXAgQSIuDQo+IA0KPiBJ
dCdzIGFjdHVhbGx5IGEgbGl0dGxlIG1vcmUgdGhhbiB0aGF0LiAgQWZ0ZXIgeW91J3ZlIG1lcmdl
ZCBCIGludG8gQSwNCj4geW91IGNhbiBjbG9zZSB0aGUgZmlsZSBkZXNjcmlwdG9yIGZvciBCIGFu
ZCBhY2Nlc3MgYWxsIG9mIHRoZSBkZXZpY2VzDQo+IGZvciB0aGUgbWVyZ2VkIGdyb3VwIGZyb20g
QS4NCg0KSXQgaXMgYWN0dWFsbHkgbW9yZS4uLg0KDQpTY2VuYXJpbyAxOg0KDQogIGNyZWF0ZV9n
cnAoQSkNCiAgY3JlYXRlX2dycChCKQ0KICAuLi4NCiAgbWVyZ2VfZ3JwKEEsQikNCiAgY3JlYXRl
X2dycChDKQ0KICBtZXJnZV9ncnAoQyxCKSAuLi4gdGhpcyB3b3JrcywgcmlnaHQ/DQoNClNjZW5h
cmlvIDI6DQoNCiAgY3JlYXRlX2dycChBKQ0KICBjcmVhdGVfZ3JwKEIpDQogIGZkX3ggPSBnZXRf
ZGV2X2ZkKEIseCkNCiAgLi4uDQogIG1lcmdlX2dycChBLEIpDQogIGNyZWF0ZV9ncnAoQykNCiAg
bWVyZ2VfZ3JwKEEsQykNCiAgZmRfeCA9IGdldF9kZXZfZmQoQyx4KSANCg0KVGhvc2UgdHdvIGV4
YW1wbGVzIHNlZW1zIHRvIHN1Z2dlc3QgbWUgbW9yZSBvZiBhIGxpc3QtYWJzdHJhY3Rpb24gdGhh
biBhIG1lcmdlIGFic3RyYWN0aW9uLg0KSG93ZXZlciwgaWYgaXQgZml0cyBpbnRvIHRoZSBhZ3Jl
ZWQgc3ludGF4L2xvZ2ljIGl0IGlzIG9rLCBhcyBsb25nIGFzIHdlIGRvY3VtZW50IGl0DQpwcm9w
ZXJseS4NCg0KPiA+IEZvciB0aGUgc2FtZSByZWFzb24sIHlvdSBkbyBub3QgcmVhbGx5IG5lZWQg
dG8gcHJvdmlkZSB0aGUgZ3JvdXAgeW91DQo+IHdhbnQNCj4gPiB0byB1bm1lcmdlIGZyb20sIHdo
aWNoIG1lYW5zIHRoYXQgaW5zdGVhZCBvZg0KPiA+DQo+ID4gICB1bm1lcmdlKEEsQikNCj4gPg0K
PiA+IHlvdSB3b3VsZCBqdXN0IG5lZWQNCj4gPg0KPiA+ICAgdW5tZXJnZShCKQ0KPiANCj4gR29v
ZCBwb2ludCwgd2UgY2FuIGF2b2lkIHRoZSBhd2t3YXJkIHJlZmVyZW5jZSB2aWEgZmlsZSBkZXNj
cmlwdG9yIGZvcg0KPiB0aGUgdW5tZXJnZS4NCj4gDQo+ID4gSSB1bmRlcnN0YW5kIHRoZSByZWFz
b24gd2h5IGl0IGlzIG5vdCBhIHJlYWwgbWVyZ2UvdW5tZXJnZSAoaWUsIHRvDQo+IGtlZXAgdGhl
DQo+ID4gb3JpZ2luYWwgZ3JvdXBzIHNvIHRoYXQgeW91IGNhbiB1bm1lcmdlIGxhdGVyKQ0KPiAN
Cj4gUmlnaHQsIHdlIHN0aWxsIG5lZWQgdG8gaGF2ZSB2aXNpYmlsaXR5IG9mIHRoZSBncm91cHMg
Y29tcHJpc2luZyB0aGUNCj4gbWVyZ2VkIGdyb3VwLCBidXQgdGhlIGFic3RyYWN0aW9uIHByb3Zp
ZGVkIHRvIHRoZSB1c2VyIHNlZW1zIHRvIGJlDQo+IGRlZXBlciB0aGFuIHlvdSdyZSB0aGlua2lu
Zy4NCj4gDQo+ID4gIC4uLiBob3dldmVyIEkganVzdCB3b25kZXIgaWYNCj4gPiBpdCB3b3VsZG4n
dCBiZSBtb3JlIG5hdHVyYWwgdG8gaW1wbGVtZW50IHRoZQ0KPiBWRklPX0lPTU1VX0FERF9HUk9V
UC9ERUxfR1JPVVANCj4gPiBpb21tdSBpb2N0bHMgaW5zdGVhZD8gKHRoZSByZWxhdGlvbnNoaXBz
IGJldHdlZW4gdGhlIGRhdGEgc3RydWN0dXJlDQo+IHdvdWxkDQo+ID4gcmVtYWluIHRoZSBzYW1l
KQ0KPiA+IEkgZ3Vlc3MgeW91IGFscmVhZHkgZGlzY2FyZGVkIHRoaXMgb3B0aW9uIGZvciBzb21l
IHJlYXNvbnMsIHJpZ2h0Pw0KPiBXaGF0IHdhcw0KPiA+IHRoZSByZWFzb24/DQo+IA0KPiBJdCdz
IGEgcG9zc2liaWxpdHksIEknbSBub3Qgc3VyZSBpdCB3YXMgZGlzY3Vzc2VkIG9yIHJlYWxseSB3
aGF0DQo+IGFkdmFudGFnZSBpdCBwcm92aWRlcy4gIEl0IHNlZW1zIGxpa2Ugd2UnZCBsb2dpY2Fs
bHkgbG9zZSB0aGUgYWJpbGl0eQ0KPiB0bw0KPiBhY2Nlc3MgZGV2aWNlcyBmcm9tIG90aGVyIGdy
b3VwcywNCg0KV2hhdCBpcyB0aGUgcmVhbCAoaW1tZWRpYXRlKSBiZW5lZml0IG9mIHRoaXMgY2Fw
YWJpbGl0eT8NCg0KPiB3aGV0aGVyIHRoYXQncyBnb29kIG9yIGJhZCwgSSBkb24ndCBrbm93LiAg
SSB0aGluayB0aGUgbm90aW9uIG9mICJtZXJnZSINCj4gcHJvbW90ZXMgdGhlIGlkZWEgdGhhdCB0
aGUgZ3JvdXBzDQo+IGFyZSBwZWVycyBhbmQgYW4gaW9tbXVfYWRkL2RlbCBmZWVscyBhIGJpdCBt
b3JlIGhpZXJhcmNoaWNhbC4NCg0KSSBhZ3JlZS4gDQoNCj4gPiA+ICt7DQo+ID4gPiArCXN0cnVj
dCB2ZmlvX2dyb3VwICpuZXc7DQo+ID4gPiArCXN0cnVjdCB2ZmlvX2lvbW11ICpvbGRfaW9tbXU7
DQo+ID4gPiArCXN0cnVjdCBmaWxlICpmaWxlOw0KPiA+ID4gKwlpbnQgcmV0ID0gMDsNCj4gPiA+
ICsJYm9vbCBvcGVuZWQgPSBmYWxzZTsNCj4gPiA+ICsNCj4gPiA+ICsJbXV0ZXhfbG9jaygmdmZp
by5sb2NrKTsNCj4gPiA+ICsNCj4gPiA+ICsJZmlsZSA9IGZnZXQoZmQpOw0KPiA+ID4gKwlpZiAo
IWZpbGUpIHsNCj4gPiA+ICsJCXJldCA9IC1FQkFERjsNCj4gPiA+ICsJCWdvdG8gb3V0X25vcHV0
Ow0KPiA+ID4gKwl9DQo+ID4gPiArDQo+ID4gPiArCS8qIFNhbml0eSBjaGVjaywgaXMgdGhpcyBy
ZWFsbHkgb3VyIGZkPyAqLw0KPiA+ID4gKwlpZiAoZmlsZS0+Zl9vcCAhPSAmdmZpb19ncm91cF9m
b3BzKSB7DQo+ID4gPiArCQlyZXQgPSAtRUlOVkFMOw0KPiA+ID4gKwkJZ290byBvdXQ7DQo+ID4g
PiArCX0NCj4gPiA+ICsNCj4gPiA+ICsJbmV3ID0gZmlsZS0+cHJpdmF0ZV9kYXRhOw0KPiA+ID4g
Kw0KPiA+ID4gKwlpZiAoIW5ldyB8fCBuZXcgPT0gZ3JvdXAgfHwgIW5ldy0+aW9tbXUgfHwNCj4g
PiA+ICsJICAgIG5ldy0+aW9tbXUtPmRvbWFpbiB8fCBuZXctPmJ1cyAhPSBncm91cC0+YnVzKSB7
DQo+ID4gPiArCQlyZXQgPSAtRUlOVkFMOw0KPiA+ID4gKwkJZ290byBvdXQ7DQo+ID4gPiArCX0N
Cj4gPiA+ICsNCj4gPiA+ICsJLyogV2UgbmVlZCB0byBhdHRhY2ggYWxsIHRoZSBkZXZpY2VzIHRv
IGVhY2ggZG9tYWluIHNlcGFyYXRlbHkNCj4gPiA+ICsJICogaW4gb3JkZXIgdG8gdmFsaWRhdGUg
dGhhdCB0aGUgY2FwYWJpbGl0aWVzIG1hdGNoIGZvciBib3RoLiAgKi8NCj4gPiA+ICsJcmV0ID0g
X192ZmlvX29wZW5faW9tbXUobmV3LT5pb21tdSk7DQo+ID4gPiArCWlmIChyZXQpDQo+ID4gPiAr
CQlnb3RvIG91dDsNCj4gPiA+ICsNCj4gPiA+ICsJaWYgKCFncm91cC0+aW9tbXUtPmRvbWFpbikg
ew0KPiA+ID4gKwkJcmV0ID0gX192ZmlvX29wZW5faW9tbXUoZ3JvdXAtPmlvbW11KTsNCj4gPiA+
ICsJCWlmIChyZXQpDQo+ID4gPiArCQkJZ290byBvdXQ7DQo+ID4gPiArCQlvcGVuZWQgPSB0cnVl
Ow0KPiA+ID4gKwl9DQo+ID4gPiArDQo+ID4gPiArCS8qIElmIGNhY2hlIGNvaGVyZW5jeSBkb2Vz
bid0IG1hdGNoIHdlJ2QgcG90ZW50aWFseSBuZWVkIHRvDQo+ID4gPiArCSAqIHJlbWFwIGV4aXN0
aW5nIGlvbW11IG1hcHBpbmdzIGluIHRoZSBtZXJnZS1lciBkb21haW4uDQo+ID4gPiArCSAqIFBv
b3IgcmV0dXJuIHRvIGJvdGhlciB0cnlpbmcgdG8gYWxsb3cgdGhpcyBjdXJyZW50bHkuICovDQo+
ID4gPiArCWlmIChpb21tdV9kb21haW5faGFzX2NhcChncm91cC0+aW9tbXUtPmRvbWFpbiwNCj4g
PiA+ICsJCQkJIElPTU1VX0NBUF9DQUNIRV9DT0hFUkVOQ1kpICE9DQo+ID4gPiArCSAgICBpb21t
dV9kb21haW5faGFzX2NhcChuZXctPmlvbW11LT5kb21haW4sDQo+ID4gPiArCQkJCSBJT01NVV9D
QVBfQ0FDSEVfQ09IRVJFTkNZKSkgew0KPiA+ID4gKwkJX192ZmlvX2Nsb3NlX2lvbW11KG5ldy0+
aW9tbXUpOw0KPiA+ID4gKwkJaWYgKG9wZW5lZCkNCj4gPiA+ICsJCQlfX3ZmaW9fY2xvc2VfaW9t
bXUoZ3JvdXAtPmlvbW11KTsNCj4gPiA+ICsJCXJldCA9IC1FSU5WQUw7DQo+ID4gPiArCQlnb3Rv
IG91dDsNCj4gPiA+ICsJfQ0KPiA+ID4gKw0KPiA+ID4gKwkvKiBDbG9zZSB0aGUgaW9tbXUgZm9y
IHRoZSBtZXJnZS1lZSBhbmQgYXR0YWNoIGFsbCBpdHMgZGV2aWNlcw0KPiA+ID4gKwkgKiB0byB0
aGUgbWVyZ2UtZXIgaW9tbXUuICovDQo+ID4gPiArCV9fdmZpb19jbG9zZV9pb21tdShuZXctPmlv
bW11KTsNCj4gPiA+ICsNCj4gPiA+ICsJcmV0ID0gX192ZmlvX2lvbW11X2F0dGFjaF9ncm91cChn
cm91cC0+aW9tbXUsIG5ldyk7DQo+ID4gPiArCWlmIChyZXQpDQo+ID4gPiArCQlnb3RvIG91dDsN
Cj4gPiA+ICsNCj4gPiA+ICsJLyogc2V0X2lvbW11IHVubGlua3MgbmV3IGZyb20gdGhlIGlvbW11
LCBzbyBzYXZlIGEgcG9pbnRlciB0byBpdA0KPiA+ID4gKi8NCj4gPiA+ICsJb2xkX2lvbW11ID0g
bmV3LT5pb21tdTsNCj4gPiA+ICsJX192ZmlvX2dyb3VwX3NldF9pb21tdShuZXcsIGdyb3VwLT5p
b21tdSk7DQo+ID4gPiArCWtmcmVlKG9sZF9pb21tdSk7DQo+ID4gPiArDQo+ID4gPiArb3V0Og0K
PiA+ID4gKwlmcHV0KGZpbGUpOw0KPiA+ID4gK291dF9ub3B1dDoNCj4gPiA+ICsJbXV0ZXhfdW5s
b2NrKCZ2ZmlvLmxvY2spOw0KPiA+ID4gKwlyZXR1cm4gcmV0Ow0KPiA+ID4gK30NCj4gPiA+ICsN
Cj4gPiA+ICsvKiBVbm1lcmdlIHRoZSBncm91cCBwb2ludGVkIHRvIGJ5IGZkIGZyb20gZ3JvdXAu
ICovDQo+ID4gPiArc3RhdGljIGludCB2ZmlvX2dyb3VwX3VubWVyZ2Uoc3RydWN0IHZmaW9fZ3Jv
dXAgKmdyb3VwLCBpbnQgZmQpDQo+ID4gPiArew0KPiA+ID4gKwlzdHJ1Y3QgdmZpb19ncm91cCAq
bmV3Ow0KPiA+ID4gKwlzdHJ1Y3QgdmZpb19pb21tdSAqbmV3X2lvbW11Ow0KPiA+ID4gKwlzdHJ1
Y3QgZmlsZSAqZmlsZTsNCj4gPiA+ICsJaW50IHJldCA9IDA7DQo+ID4gPiArDQo+ID4gPiArCS8q
IFNpbmNlIHRoZSBtZXJnZS1vdXQgZ3JvdXAgaXMgYWxyZWFkeSBvcGVuZWQsIGl0IG5lZWRzIHRv
DQo+ID4gPiArCSAqIGhhdmUgYW4gaW9tbXUgc3RydWN0IGFzc29jaWF0ZWQgd2l0aCBpdC4gKi8N
Cj4gPiA+ICsJbmV3X2lvbW11ID0ga3phbGxvYyhzaXplb2YoKm5ld19pb21tdSksIEdGUF9LRVJO
RUwpOw0KPiA+ID4gKwlpZiAoIW5ld19pb21tdSkNCj4gPiA+ICsJCXJldHVybiAtRU5PTUVNOw0K
PiA+ID4gKw0KPiA+ID4gKwlJTklUX0xJU1RfSEVBRCgmbmV3X2lvbW11LT5ncm91cF9saXN0KTsN
Cj4gPiA+ICsJSU5JVF9MSVNUX0hFQUQoJm5ld19pb21tdS0+ZG1fbGlzdCk7DQo+ID4gPiArCW11
dGV4X2luaXQoJm5ld19pb21tdS0+ZGdhdGUpOw0KPiA+ID4gKwluZXdfaW9tbXUtPmJ1cyA9IGdy
b3VwLT5idXM7DQo+ID4gPiArDQo+ID4gPiArCW11dGV4X2xvY2soJnZmaW8ubG9jayk7DQo+ID4g
PiArDQo+ID4gPiArCWZpbGUgPSBmZ2V0KGZkKTsNCj4gPiA+ICsJaWYgKCFmaWxlKSB7DQo+ID4g
PiArCQlyZXQgPSAtRUJBREY7DQo+ID4gPiArCQlnb3RvIG91dF9ub3B1dDsNCj4gPiA+ICsJfQ0K
PiA+ID4gKw0KPiA+ID4gKwkvKiBTYW5pdHkgY2hlY2ssIGlzIHRoaXMgcmVhbGx5IG91ciBmZD8g
Ki8NCj4gPiA+ICsJaWYgKGZpbGUtPmZfb3AgIT0gJnZmaW9fZ3JvdXBfZm9wcykgew0KPiA+ID4g
KwkJcmV0ID0gLUVJTlZBTDsNCj4gPiA+ICsJCWdvdG8gb3V0Ow0KPiA+ID4gKwl9DQo+ID4gPiAr
DQo+ID4gPiArCW5ldyA9IGZpbGUtPnByaXZhdGVfZGF0YTsNCj4gPiA+ICsJaWYgKCFuZXcgfHwg
bmV3ID09IGdyb3VwIHx8IG5ldy0+aW9tbXUgIT0gZ3JvdXAtPmlvbW11KSB7DQo+ID4gPiArCQly
ZXQgPSAtRUlOVkFMOw0KPiA+ID4gKwkJZ290byBvdXQ7DQo+ID4gPiArCX0NCj4gPiA+ICsNCj4g
PiA+ICsJLyogV2UgY2FuJ3QgbWVyZ2Utb3V0IGEgZ3JvdXAgd2l0aCBkZXZpY2VzIHN0aWxsIGlu
IHVzZS4gKi8NCj4gPiA+ICsJaWYgKF9fdmZpb19ncm91cF9kZXZzX2ludXNlKG5ldykpIHsNCj4g
PiA+ICsJCXJldCA9IC1FQlVTWTsNCj4gPiA+ICsJCWdvdG8gb3V0Ow0KPiA+ID4gKwl9DQo+ID4g
PiArDQo+ID4gPiArCV9fdmZpb19pb21tdV9kZXRhY2hfZ3JvdXAoZ3JvdXAtPmlvbW11LCBuZXcp
Ow0KPiA+ID4gKwlfX3ZmaW9fZ3JvdXBfc2V0X2lvbW11KG5ldywgbmV3X2lvbW11KTsNCj4gPiA+
ICsNCj4gPiA+ICtvdXQ6DQo+ID4gPiArCWZwdXQoZmlsZSk7DQo+ID4gPiArb3V0X25vcHV0Og0K
PiA+ID4gKwlpZiAocmV0KQ0KPiA+ID4gKwkJa2ZyZWUobmV3X2lvbW11KTsNCj4gPiA+ICsJbXV0
ZXhfdW5sb2NrKCZ2ZmlvLmxvY2spOw0KPiA+ID4gKwlyZXR1cm4gcmV0Ow0KPiA+ID4gK30NCj4g
PiA+ICsNCj4gPiA+ICsvKiBHZXQgYSBuZXcgaW9tbXUgZmlsZSBkZXNjcmlwdG9yLiAgVGhpcyB3
aWxsIG9wZW4gdGhlIGlvbW11LA0KPiBzZXR0aW5nDQo+ID4gPiArICogdGhlIGN1cnJlbnQtPm1t
IG93bmVyc2hpcCBpZiBpdCdzIG5vdCBhbHJlYWR5IHNldC4gKi8NCj4gPiA+ICtzdGF0aWMgaW50
IHZmaW9fZ3JvdXBfZ2V0X2lvbW11X2ZkKHN0cnVjdCB2ZmlvX2dyb3VwICpncm91cCkNCj4gPiA+
ICt7DQo+ID4gPiArCWludCByZXQgPSAwOw0KPiA+ID4gKw0KPiA+ID4gKwltdXRleF9sb2NrKCZ2
ZmlvLmxvY2spOw0KPiA+ID4gKw0KPiA+ID4gKwlpZiAoIWdyb3VwLT5pb21tdS0+ZG9tYWluKSB7
DQo+ID4gPiArCQlyZXQgPSBfX3ZmaW9fb3Blbl9pb21tdShncm91cC0+aW9tbXUpOw0KPiA+ID4g
KwkJaWYgKHJldCkNCj4gPiA+ICsJCQlnb3RvIG91dDsNCj4gPiA+ICsJfQ0KPiA+ID4gKw0KPiA+
ID4gKwlyZXQgPSBhbm9uX2lub2RlX2dldGZkKCJbdmZpby1pb21tdV0iLCAmdmZpb19pb21tdV9m
b3BzLA0KPiA+ID4gKwkJCSAgICAgICBncm91cC0+aW9tbXUsIE9fUkRXUik7DQo+ID4gPiArCWlm
IChyZXQgPCAwKQ0KPiA+ID4gKwkJZ290byBvdXQ7DQo+ID4gPiArDQo+ID4gPiArCWdyb3VwLT5p
b21tdS0+cmVmY250Kys7DQo+ID4gPiArb3V0Og0KPiA+ID4gKwltdXRleF91bmxvY2soJnZmaW8u
bG9jayk7DQo+ID4gPiArCXJldHVybiByZXQ7DQo+ID4gPiArfQ0KPiA+ID4gKw0KPiA+ID4gKy8q
IEdldCBhIG5ldyBkZXZpY2UgZmlsZSBkZXNjcmlwdG9yLiAgVGhpcyB3aWxsIG9wZW4gdGhlIGlv
bW11LA0KPiA+ID4gc2V0dGluZw0KPiA+ID4gKyAqIHRoZSBjdXJyZW50LT5tbSBvd25lcnNoaXAg
aWYgaXQncyBub3QgYWxyZWFkeSBzZXQuICBJdCdzDQo+IGRpZmZpY3VsdA0KPiA+ID4gdG8NCj4g
PiA+ICsgKiBzcGVjaWZ5IHRoZSByZXF1aXJlbWVudHMgZm9yIG1hdGNoaW5nIGEgdXNlciBzdXBw
bGllZCBidWZmZXIgdG8NCj4gYQ0KPiA+ID4gKyAqIGRldmljZSwgc28gd2UgdXNlIGEgdmZpbyBk
cml2ZXIgY2FsbGJhY2sgdG8gdGVzdCBmb3IgYSBtYXRjaC4NCj4gRm9yDQo+ID4gPiArICogUENJ
LCBkZXZfbmFtZShkZXYpIGlzIHVuaXF1ZSwgYnV0IG90aGVyIGRyaXZlcnMgbWF5IHJlcXVpcmUN
Cj4gPiA+IGluY2x1ZGluZw0KPiA+ID4gKyAqIGEgcGFyZW50IGRldmljZSBzdHJpbmcuICovDQo+
ID4gPiArc3RhdGljIGludCB2ZmlvX2dyb3VwX2dldF9kZXZpY2VfZmQoc3RydWN0IHZmaW9fZ3Jv
dXAgKmdyb3VwLCBjaGFyDQo+ID4gPiAqYnVmKQ0KPiA+ID4gK3sNCj4gPiA+ICsJc3RydWN0IHZm
aW9faW9tbXUgKmlvbW11ID0gZ3JvdXAtPmlvbW11Ow0KPiA+ID4gKwlzdHJ1Y3QgbGlzdF9oZWFk
ICpncG9zOw0KPiA+ID4gKwlpbnQgcmV0ID0gLUVOT0RFVjsNCj4gPiA+ICsNCj4gPiA+ICsJbXV0
ZXhfbG9jaygmdmZpby5sb2NrKTsNCj4gPiA+ICsNCj4gPiA+ICsJaWYgKCFpb21tdS0+ZG9tYWlu
KSB7DQo+ID4gPiArCQlyZXQgPSBfX3ZmaW9fb3Blbl9pb21tdShpb21tdSk7DQo+ID4gPiArCQlp
ZiAocmV0KQ0KPiA+ID4gKwkJCWdvdG8gb3V0Ow0KPiA+ID4gKwl9DQo+ID4gPiArDQo+ID4gPiAr
CWxpc3RfZm9yX2VhY2goZ3BvcywgJmlvbW11LT5ncm91cF9saXN0KSB7DQo+ID4gPiArCQlzdHJ1
Y3QgbGlzdF9oZWFkICpkcG9zOw0KPiA+ID4gKw0KPiA+ID4gKwkJZ3JvdXAgPSBsaXN0X2VudHJ5
KGdwb3MsIHN0cnVjdCB2ZmlvX2dyb3VwLCBpb21tdV9uZXh0KTsNCj4gPiA+ICsNCj4gPiA+ICsJ
CWxpc3RfZm9yX2VhY2goZHBvcywgJmdyb3VwLT5kZXZpY2VfbGlzdCkgew0KPiA+ID4gKwkJCXN0
cnVjdCB2ZmlvX2RldmljZSAqZGV2aWNlOw0KPiA+ID4gKw0KPiA+ID4gKwkJCWRldmljZSA9IGxp
c3RfZW50cnkoZHBvcywNCj4gPiA+ICsJCQkJCSAgICBzdHJ1Y3QgdmZpb19kZXZpY2UsIGRldmlj
ZV9uZXh0KTsNCj4gPiA+ICsNCj4gPiA+ICsJCQlpZiAoZGV2aWNlLT5vcHMtPm1hdGNoKGRldmlj
ZS0+ZGV2LCBidWYpKSB7DQo+ID4gPiArCQkJCXN0cnVjdCBmaWxlICpmaWxlOw0KPiA+ID4gKw0K
PiA+ID4gKwkJCQlpZiAoZGV2aWNlLT5vcHMtPmdldChkZXZpY2UtPmRldmljZV9kYXRhKSkgew0K
PiA+ID4gKwkJCQkJcmV0ID0gLUVGQVVMVDsNCj4gPiA+ICsJCQkJCWdvdG8gb3V0Ow0KPiA+ID4g
KwkJCQl9DQo+ID4gPiArDQo+ID4gPiArCQkJCS8qIFdlIGNhbid0IHVzZSBhbm9uX2lub2RlX2dl
dGZkKCksIGxpa2UgYWJvdmUNCj4gPiA+ICsJCQkJICogYmVjYXVzZSB3ZSBuZWVkIHRvIG1vZGlm
eSB0aGUgZl9tb2RlIGZsYWdzDQo+ID4gPiArCQkJCSAqIGRpcmVjdGx5IHRvIGFsbG93IG1vcmUg
dGhhbiBqdXN0IGlvY3RscyAqLw0KPiA+ID4gKwkJCQlyZXQgPSBnZXRfdW51c2VkX2ZkKCk7DQo+
ID4gPiArCQkJCWlmIChyZXQgPCAwKSB7DQo+ID4gPiArCQkJCQlkZXZpY2UtPm9wcy0+cHV0KGRl
dmljZS0+ZGV2aWNlX2RhdGEpOw0KPiA+ID4gKwkJCQkJZ290byBvdXQ7DQo+ID4gPiArCQkJCX0N
Cj4gPiA+ICsNCj4gPiA+ICsJCQkJZmlsZSA9IGFub25faW5vZGVfZ2V0ZmlsZSgiW3ZmaW8tZGV2
aWNlXSIsDQo+ID4gPiArCQkJCQkJCSAgJnZmaW9fZGV2aWNlX2ZvcHMsDQo+ID4gPiArCQkJCQkJ
CSAgZGV2aWNlLCBPX1JEV1IpOw0KPiA+ID4gKwkJCQlpZiAoSVNfRVJSKGZpbGUpKSB7DQo+ID4g
PiArCQkJCQlwdXRfdW51c2VkX2ZkKHJldCk7DQo+ID4gPiArCQkJCQlyZXQgPSBQVFJfRVJSKGZp
bGUpOw0KPiA+ID4gKwkJCQkJZGV2aWNlLT5vcHMtPnB1dChkZXZpY2UtPmRldmljZV9kYXRhKTsN
Cj4gPiA+ICsJCQkJCWdvdG8gb3V0Ow0KPiA+ID4gKwkJCQl9DQo+ID4gPiArDQo+ID4gPiArCQkJ
CS8qIFRvZG86IGFkZCBhbiBhbm9uX2lub2RlIGludGVyZmFjZSB0byBkbw0KPiA+ID4gKwkJCQkg
KiB0aGlzLiAgQXBwZWFycyB0byBiZSBtaXNzaW5nIGJ5IGxhY2sgb2YNCj4gPiA+ICsJCQkJICog
bmVlZCByYXRoZXIgdGhhbiBleHBsaWNpdGx5IHByZXZlbnRlZC4NCj4gPiA+ICsJCQkJICogTm93
IHRoZXJlJ3MgbmVlZC4gKi8NCj4gPiA+ICsJCQkJZmlsZS0+Zl9tb2RlIHw9IChGTU9ERV9MU0VF
SyB8DQo+ID4gPiArCQkJCQkJIEZNT0RFX1BSRUFEIHwNCj4gPiA+ICsJCQkJCQkgRk1PREVfUFdS
SVRFKTsNCj4gPiA+ICsNCj4gPiA+ICsJCQkJZmRfaW5zdGFsbChyZXQsIGZpbGUpOw0KPiA+ID4g
Kw0KPiA+ID4gKwkJCQlkZXZpY2UtPnJlZmNudCsrOw0KPiA+ID4gKwkJCQlnb3RvIG91dDsNCj4g
PiA+ICsJCQl9DQo+ID4gPiArCQl9DQo+ID4gPiArCX0NCj4gPiA+ICtvdXQ6DQo+ID4gPiArCW11
dGV4X3VubG9jaygmdmZpby5sb2NrKTsNCj4gPiA+ICsJcmV0dXJuIHJldDsNCj4gPiA+ICt9DQo+
ID4gPiArDQo+ID4gPiArc3RhdGljIGxvbmcgdmZpb19ncm91cF91bmxfaW9jdGwoc3RydWN0IGZp
bGUgKmZpbGVwLA0KPiA+ID4gKwkJCQkgdW5zaWduZWQgaW50IGNtZCwgdW5zaWduZWQgbG9uZyBh
cmcpDQo+ID4gPiArew0KPiA+ID4gKwlzdHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXAgPSBmaWxlcC0+
cHJpdmF0ZV9kYXRhOw0KPiA+ID4gKw0KPiA+ID4gKwlpZiAoY21kID09IFZGSU9fR1JPVVBfR0VU
X0ZMQUdTKSB7DQo+ID4gPiArCQl1NjQgZmxhZ3MgPSAwOw0KPiA+ID4gKw0KPiA+ID4gKwkJbXV0
ZXhfbG9jaygmdmZpby5sb2NrKTsNCj4gPiA+ICsJCWlmIChfX3ZmaW9faW9tbXVfdmlhYmxlKGdy
b3VwLT5pb21tdSkpDQo+ID4gPiArCQkJZmxhZ3MgfD0gVkZJT19HUk9VUF9GTEFHU19WSUFCTEU7
DQo+ID4gPiArCQltdXRleF91bmxvY2soJnZmaW8ubG9jayk7DQo+ID4gPiArDQo+ID4gPiArCQlp
ZiAoZ3JvdXAtPmlvbW11LT5tbSkNCj4gPiA+ICsJCQlmbGFncyB8PSBWRklPX0dST1VQX0ZMQUdT
X01NX0xPQ0tFRDsNCj4gPiA+ICsNCj4gPiA+ICsJCXJldHVybiBwdXRfdXNlcihmbGFncywgKHU2
NCBfX3VzZXIgKilhcmcpOw0KPiA+ID4gKwl9DQo+ID4gPiArDQo+ID4gPiArCS8qIEJlbG93IGNv
bW1hbmRzIGFyZSByZXN0cmljdGVkIG9uY2UgdGhlIG1tIGlzIHNldCAqLw0KPiA+ID4gKwlpZiAo
Z3JvdXAtPmlvbW11LT5tbSAmJiBncm91cC0+aW9tbXUtPm1tICE9IGN1cnJlbnQtPm1tKQ0KPiA+
ID4gKwkJcmV0dXJuIC1FUEVSTTsNCj4gPiA+ICsJaWYgKGNtZCA9PSBWRklPX0dST1VQX01FUkdF
IHx8IGNtZCA9PSBWRklPX0dST1VQX1VOTUVSR0UpIHsNCj4gPiA+ICsJCWludCBmZDsNCj4gPiA+
ICsNCj4gPiA+ICsJCWlmIChnZXRfdXNlcihmZCwgKGludCBfX3VzZXIgKilhcmcpKQ0KPiA+ID4g
KwkJCXJldHVybiAtRUZBVUxUOw0KPiA+ID4gKwkJaWYgKGZkIDwgMCkNCj4gPiA+ICsJCQlyZXR1
cm4gLUVJTlZBTDsNCj4gPiA+ICsNCj4gPiA+ICsJCWlmIChjbWQgPT0gVkZJT19HUk9VUF9NRVJH
RSkNCj4gPiA+ICsJCQlyZXR1cm4gdmZpb19ncm91cF9tZXJnZShncm91cCwgZmQpOw0KPiA+ID4g
KwkJZWxzZQ0KPiA+ID4gKwkJCXJldHVybiB2ZmlvX2dyb3VwX3VubWVyZ2UoZ3JvdXAsIGZkKTsN
Cj4gPiA+ICsJfSBlbHNlIGlmIChjbWQgPT0gVkZJT19HUk9VUF9HRVRfSU9NTVVfRkQpIHsNCj4g
PiA+ICsJCXJldHVybiB2ZmlvX2dyb3VwX2dldF9pb21tdV9mZChncm91cCk7DQo+ID4gPiArCX0g
ZWxzZSBpZiAoY21kID09IFZGSU9fR1JPVVBfR0VUX0RFVklDRV9GRCkgew0KPiA+ID4gKwkJY2hh
ciAqYnVmOw0KPiA+ID4gKwkJaW50IHJldDsNCj4gPiA+ICsNCj4gPiA+ICsJCWJ1ZiA9IHN0cm5k
dXBfdXNlcigoY29uc3QgY2hhciBfX3VzZXIgKilhcmcsIFBBR0VfU0laRSk7DQo+ID4gPiArCQlp
ZiAoSVNfRVJSKGJ1ZikpDQo+ID4gPiArCQkJcmV0dXJuIFBUUl9FUlIoYnVmKTsNCj4gPiA+ICsN
Cj4gPiA+ICsJCXJldCA9IHZmaW9fZ3JvdXBfZ2V0X2RldmljZV9mZChncm91cCwgYnVmKTsNCj4g
PiA+ICsJCWtmcmVlKGJ1Zik7DQo+ID4gPiArCQlyZXR1cm4gcmV0Ow0KPiA+ID4gKwl9DQo+ID4g
PiArDQo+ID4gPiArCXJldHVybiAtRU5PU1lTOw0KPiA+ID4gK30NCj4gPiA+ICsNCj4gPiA+ICsj
aWZkZWYgQ09ORklHX0NPTVBBVA0KPiA+ID4gK3N0YXRpYyBsb25nIHZmaW9fZ3JvdXBfY29tcGF0
X2lvY3RsKHN0cnVjdCBmaWxlICpmaWxlcCwNCj4gPiA+ICsJCQkJICAgIHVuc2lnbmVkIGludCBj
bWQsIHVuc2lnbmVkIGxvbmcgYXJnKQ0KPiA+ID4gK3sNCj4gPiA+ICsJYXJnID0gKHVuc2lnbmVk
IGxvbmcpY29tcGF0X3B0cihhcmcpOw0KPiA+ID4gKwlyZXR1cm4gdmZpb19ncm91cF91bmxfaW9j
dGwoZmlsZXAsIGNtZCwgYXJnKTsNCj4gPiA+ICt9DQo+ID4gPiArI2VuZGlmCS8qIENPTkZJR19D
T01QQVQgKi8NCj4gPiA+ICsNCj4gPiA+ICtzdGF0aWMgY29uc3Qgc3RydWN0IGZpbGVfb3BlcmF0
aW9ucyB2ZmlvX2dyb3VwX2ZvcHMgPSB7DQo+ID4gPiArCS5vd25lcgkJPSBUSElTX01PRFVMRSwN
Cj4gPiA+ICsJLm9wZW4JCT0gdmZpb19ncm91cF9vcGVuLA0KPiA+ID4gKwkucmVsZWFzZQk9IHZm
aW9fZ3JvdXBfcmVsZWFzZSwNCj4gPiA+ICsJLnVubG9ja2VkX2lvY3RsCT0gdmZpb19ncm91cF91
bmxfaW9jdGwsDQo+ID4gPiArI2lmZGVmIENPTkZJR19DT01QQVQNCj4gPiA+ICsJLmNvbXBhdF9p
b2N0bAk9IHZmaW9fZ3JvdXBfY29tcGF0X2lvY3RsLA0KPiA+ID4gKyNlbmRpZg0KPiA+ID4gK307
DQo+ID4gPiArDQo+ID4gPiArLyogaW9tbXUgZmQgcmVsZWFzZSBob29rICovDQo+ID4NCj4gPiBH
aXZlbiB2ZmlvX2RldmljZV9yZWxlYXNlIGFuZA0KPiA+ICAgICAgIHZmaW9fZ3JvdXBfcmVsZWFz
ZSAoaWUsIDFzdCBvYmplY3QsIDJuZCBvcGVyYXRpb24pLCBJIHdhcw0KPiA+IGdvaW5nIHRvIHN1
Z2dlc3QgcmVuYW1pbmcgdGhlIGZuIGJlbG93IHRvIHZmaW9faW9tbXVfcmVsZWFzZSwgYnV0DQo+
ID4gdGhlbiBJIHNhdyB0aGUgbGF0dGVyIG5hbWUgYmVpbmcgYWxyZWFkeSB1c2VkIGluIHZmaW9f
aW9tbXUuYyAuLi4NCj4gPiBhIGJpdCBjb25mdXNpbmcgYnV0IEkgZ3Vlc3MgaXQncyBvayB0aGVu
Lg0KPiANCj4gUmlnaHQsIHRoaXMgb25lIHdhcyBkZWZpbml0ZWx5IGJlY2F1c2Ugb2YgbmFtaW5n
IGNvbGxpc2lvbi4NCj4gDQo+ID4gPiAraW50IHZmaW9fcmVsZWFzZV9pb21tdShzdHJ1Y3QgdmZp
b19pb21tdSAqaW9tbXUpDQo+ID4gPiArew0KPiA+ID4gKwlyZXR1cm4gdmZpb19kb19yZWxlYXNl
KCZpb21tdS0+cmVmY250LCBpb21tdSk7DQo+ID4gPiArfQ0KPiA+ID4gKw0KPiA+ID4gKy8qDQo+
ID4gPiArICogVkZJTyBkcml2ZXIgQVBJDQo+ID4gPiArICovDQo+ID4gPiArDQo+ID4gPiArLyog
QWRkIGEgbmV3IGRldmljZSB0byB0aGUgdmZpbyBmcmFtZXdvcmsgd2l0aCBhc3NvY2lhdGVkIHZm
aW8NCj4gZHJpdmVyDQo+ID4gPiArICogY2FsbGJhY2tzLiAgVGhpcyBpcyB0aGUgZW50cnkgcG9p
bnQgZm9yIHZmaW8gZHJpdmVycyB0bw0KPiByZWdpc3Rlcg0KPiA+ID4gZGV2aWNlcy4gKi8NCj4g
PiA+ICtpbnQgdmZpb19ncm91cF9hZGRfZGV2KHN0cnVjdCBkZXZpY2UgKmRldiwgY29uc3Qgc3Ry
dWN0DQo+ID4gPiB2ZmlvX2RldmljZV9vcHMgKm9wcykNCj4gPiA+ICt7DQo+ID4gPiArCXN0cnVj
dCBsaXN0X2hlYWQgKnBvczsNCj4gPiA+ICsJc3RydWN0IHZmaW9fZ3JvdXAgKmdyb3VwID0gTlVM
TDsNCj4gPiA+ICsJc3RydWN0IHZmaW9fZGV2aWNlICpkZXZpY2UgPSBOVUxMOw0KPiA+ID4gKwl1
bnNpZ25lZCBpbnQgZ3JvdXBpZDsNCj4gPiA+ICsJaW50IHJldCA9IDA7DQo+ID4gPiArCWJvb2wg
bmV3X2dyb3VwID0gZmFsc2U7DQo+ID4gPiArDQo+ID4gPiArCWlmICghb3BzKQ0KPiA+ID4gKwkJ
cmV0dXJuIC1FSU5WQUw7DQo+ID4gPiArDQo+ID4gPiArCWlmIChpb21tdV9kZXZpY2VfZ3JvdXAo
ZGV2LCAmZ3JvdXBpZCkpDQo+ID4gPiArCQlyZXR1cm4gLUVOT0RFVjsNCj4gPiA+ICsNCj4gPiA+
ICsJbXV0ZXhfbG9jaygmdmZpby5sb2NrKTsNCj4gPiA+ICsNCj4gPiA+ICsJbGlzdF9mb3JfZWFj
aChwb3MsICZ2ZmlvLmdyb3VwX2xpc3QpIHsNCj4gPiA+ICsJCWdyb3VwID0gbGlzdF9lbnRyeShw
b3MsIHN0cnVjdCB2ZmlvX2dyb3VwLCBncm91cF9uZXh0KTsNCj4gPiA+ICsJCWlmIChncm91cC0+
Z3JvdXBpZCA9PSBncm91cGlkKQ0KPiA+ID4gKwkJCWJyZWFrOw0KPiA+ID4gKwkJZ3JvdXAgPSBO
VUxMOw0KPiA+ID4gKwl9DQo+ID4gPiArDQo+ID4gPiArCWlmICghZ3JvdXApIHsNCj4gPiA+ICsJ
CWludCBtaW5vcjsNCj4gPiA+ICsNCj4gPiA+ICsJCWlmICh1bmxpa2VseShpZHJfcHJlX2dldCgm
dmZpby5pZHIsIEdGUF9LRVJORUwpID09IDApKSB7DQo+ID4gPiArCQkJcmV0ID0gLUVOT01FTTsN
Cj4gPiA+ICsJCQlnb3RvIG91dDsNCj4gPiA+ICsJCX0NCj4gPiA+ICsNCj4gPiA+ICsJCWdyb3Vw
ID0ga3phbGxvYyhzaXplb2YoKmdyb3VwKSwgR0ZQX0tFUk5FTCk7DQo+ID4gPiArCQlpZiAoIWdy
b3VwKSB7DQo+ID4gPiArCQkJcmV0ID0gLUVOT01FTTsNCj4gPiA+ICsJCQlnb3RvIG91dDsNCj4g
PiA+ICsJCX0NCj4gPiA+ICsNCj4gPiA+ICsJCWdyb3VwLT5ncm91cGlkID0gZ3JvdXBpZDsNCj4g
PiA+ICsJCUlOSVRfTElTVF9IRUFEKCZncm91cC0+ZGV2aWNlX2xpc3QpOw0KPiA+ID4gKw0KPiA+
ID4gKwkJcmV0ID0gaWRyX2dldF9uZXcoJnZmaW8uaWRyLCBncm91cCwgJm1pbm9yKTsNCj4gPiA+
ICsJCWlmIChyZXQgPT0gMCAmJiBtaW5vciA+IE1JTk9STUFTSykgew0KPiA+ID4gKwkJCWlkcl9y
ZW1vdmUoJnZmaW8uaWRyLCBtaW5vcik7DQo+ID4gPiArCQkJa2ZyZWUoZ3JvdXApOw0KPiA+ID4g
KwkJCXJldCA9IC1FTk9TUEM7DQo+ID4gPiArCQkJZ290byBvdXQ7DQo+ID4gPiArCQl9DQo+ID4g
PiArDQo+ID4gPiArCQlncm91cC0+ZGV2dCA9IE1LREVWKE1BSk9SKHZmaW8uZGV2dCksIG1pbm9y
KTsNCj4gPiA+ICsJCWRldmljZV9jcmVhdGUodmZpby5jbGFzcywgTlVMTCwgZ3JvdXAtPmRldnQs
DQo+ID4gPiArCQkJICAgICAgZ3JvdXAsICIldSIsIGdyb3VwaWQpOw0KPiA+ID4gKw0KPiA+ID4g
KwkJZ3JvdXAtPmJ1cyA9IGRldi0+YnVzOw0KPiA+ID4gKwkJbGlzdF9hZGQoJmdyb3VwLT5ncm91
cF9uZXh0LCAmdmZpby5ncm91cF9saXN0KTsNCj4gPiA+ICsJCW5ld19ncm91cCA9IHRydWU7DQo+
ID4gPiArCX0gZWxzZSB7DQo+ID4gPiArCQlpZiAoZ3JvdXAtPmJ1cyAhPSBkZXYtPmJ1cykgew0K
PiA+ID4gKwkJCXByaW50ayhLRVJOX1dBUk5JTkcNCj4gPiA+ICsJCQkgICAgICAgIkVycm9yOiBJ
T01NVSBncm91cCBJRCBjb25mbGljdC4gIEdyb3VwIElEICV1DQo+ID4gPiAiDQo+ID4gPiArCQkJ
CSJvbiBib3RoIGJ1cyAlcyBhbmQgJXNcbiIsIGdyb3VwaWQsDQo+ID4gPiArCQkJCWdyb3VwLT5i
dXMtPm5hbWUsIGRldi0+YnVzLT5uYW1lKTsNCj4gPiA+ICsJCQlyZXQgPSAtRUZBVUxUOw0KPiA+
ID4gKwkJCWdvdG8gb3V0Ow0KPiA+ID4gKwkJfQ0KPiA+ID4gKw0KPiA+ID4gKwkJbGlzdF9mb3Jf
ZWFjaChwb3MsICZncm91cC0+ZGV2aWNlX2xpc3QpIHsNCj4gPiA+ICsJCQlkZXZpY2UgPSBsaXN0
X2VudHJ5KHBvcywNCj4gPiA+ICsJCQkJCSAgICBzdHJ1Y3QgdmZpb19kZXZpY2UsIGRldmljZV9u
ZXh0KTsNCj4gPiA+ICsJCQlpZiAoZGV2aWNlLT5kZXYgPT0gZGV2KQ0KPiA+ID4gKwkJCQlicmVh
azsNCj4gPiA+ICsJCQlkZXZpY2UgPSBOVUxMOw0KPiA+ID4gKwkJfQ0KPiA+ID4gKwl9DQo+ID4g
PiArDQo+ID4gPiArCWlmICghZGV2aWNlKSB7DQo+ID4gPiArCQlpZiAoX192ZmlvX2dyb3VwX2Rl
dnNfaW51c2UoZ3JvdXApIHx8DQo+ID4gPiArCQkgICAgKGdyb3VwLT5pb21tdSAmJiBncm91cC0+
aW9tbXUtPnJlZmNudCkpIHsNCj4gPiA+ICsJCQlwcmludGsoS0VSTl9XQVJOSU5HDQo+ID4gPiAr
CQkJICAgICAgICJBZGRpbmcgZGV2aWNlICVzIHRvIGdyb3VwICV1IHdoaWxlIGdyb3VwIGlzDQo+
ID4gPiBhbHJlYWR5IGluIHVzZSEhXG4iLA0KPiA+ID4gKwkJCSAgICAgICBkZXZfbmFtZShkZXYp
LCBncm91cC0+Z3JvdXBpZCk7DQo+ID4gPiArCQkJLyogWFhYIEhvdyB0byBwcmV2ZW50IG90aGVy
IGRyaXZlcnMgZnJvbSBjbGFpbWluZz8gKi8NCj4gPg0KPiA+IEhlcmUgd2UgYXJlIGFkZGluZyBh
IGRldmljZSAobm90IHlldCBhc3NpZ25lZCB0byBhIHZmaW8gYnVzKSB0byBhDQo+IGdyb3VwDQo+
ID4gdGhhdCBpcyBhbHJlYWR5IGluIHVzZS4NCj4gPiBHaXZlbiB0aGF0IGl0IHdvdWxkIG5vdCBi
ZSBhY2NlcHRhYmxlIGZvciB0aGlzIGRldmljZSB0byBnZXQgYXNzaWduZWQNCj4gPiB0byBhIG5v
biB2ZmlvIGRyaXZlciwgd2h5IG5vdCBmb3JjaW5nIHN1Y2ggYXNzaWdubWVudCBoZXJlIHRoZW4/
DQo+IA0KPiBFeGFjdGx5LCBJIGp1c3QgZG9uJ3Qga25vdyB0aGUgbWVjaGFuaWNzIG9mIGhvdyB0
byBtYWtlIHRoYXQgaGFwcGVuIGFuZA0KPiB3YXMgaG9waW5nIGZvciBzdWdnZXN0aW9ucy4uLg0K
PiANCj4gPiBJIGFtIG5vdCBzdXJlIHRob3VnaCB3aGF0IHRoZSBiZXN0IHdheSB0byBkbyBpdCB3
b3VsZCBiZS4NCj4gPiBXaGF0IGFib3V0IHNvbWV0aGluZyBsaWtlIHRoaXM6DQo+ID4NCj4gPiAt
IHdoZW4gdGhlIGJ1cyB2ZmlvLXBjaSBwcm9jZXNzZXMgdGhlIEJVU19OT1RJRllfQUREX0RFVklD
RQ0KPiA+ICAgbm90aWZpY2F0aW9uIGl0IGFzc2lnbnMgdG8gdGhlIGRldmljZSBhIFBDSSBJRCB0
aGF0IHdpbGwgbWFrZSBzdXJlDQo+ID4gICB0aGUgdmZpby1wY2kncyBwcm9iZSByb3V0aW5lIHdp
bGwgYmUgaW52b2tlZCAoYW5kIG5vIG90aGVyIGRyaXZlcg0KPiBjYW4NCj4gPiAgIHRoZXJlZm9y
ZSBjbGFpbSB0aGUgZGV2aWNlKS4gVGhhdCBQQ0kgSUQgd291bGQgaGF2ZSB0byBiZSBhZGRlZA0K
PiA+ICAgdG8gdGhlIHZmaW9fcGNpX2RyaXZlcidzIGlkX3RhYmxlIChpdCB3b3VsZCBiZSB0aGUg
ZXhjZXB0aW9uIHRvIHRoZQ0KPiA+ICAgIm9ubHkgZHluYW1pYyBJRHMiIHJ1bGUpLiBUb28gaGFj
a2lzaD8NCj4gDQo+IFByZXN1bWFibHkgc29tZSBvdGhlciBkcml2ZXIgYWxzbyBoYXMgdGhlIElE
IGluIGl0J3MgaWRfdGFibGUsIGhvdyBkbw0KPiB3ZSBtYWtlIHN1cmUgd2Ugd2luPw0KDQpCeSBt
YW5nbGluZyBzdWNoIElEICh3aGVuIHByb2Nlc3NpbmcgdGhlIEJVU19OT1RJRllfQUREX0RFVklD
RSBub3RpZmljYXRpb24pIHRvDQptYXRjaCBhZ2FpbnN0IGEgJ2Zha2UnIElEIHJlZ2lzdGVyZWQg
aW4gdGhlIHZmaW8tcGNpIHRhYmxlIChpdCB3b3VsZCBiZSBsaWtlIGENCnNvcnQgb2YgZHJpdmVy
IHJlZGlyZWN0L2RpdmVydCkuIFRoZSB2ZmlvLXBjaSdzIHByb2JlIHJvdXRpbmUgd291bGQgcmVz
dG9yZQ0KdGhlIG9yaWdpbmFsIElEICh3ZSBkbyBub3Qgd2FudCB0byBjb25mdXNlIHVzZXJzcGFj
ZSkuIFRoaXMgaXMgaGFja2lzaCwgSSBhZ3JlZS4NCg0KV2hhdCBhYm91dCB0aGlzOg0KLSBXaGVu
IHZmaW8tcGNpIHByb2Nlc3NlcyB0aGUgQlVTX05PVElGWV9BRERfREVWSUNFIG5vdGlmaWNhdGlv
biBpdCBjYW4NCiAgcHJlLWluaXRpYWxpemUgdGhlIGRyaXZlciBwb2ludGVyICh2aWEgYW4gQVBJ
KS4gV2Ugd291bGQgdGhlbiBuZWVkIHRvIGNoYW5nZQ0KICB0aGUgbWF0Y2gvcHJvYmUgUENJIG1l
Y2hhbmlzbSB0b286IGZvciBleGFtcGxlLCB0aGUgUENJIGNvcmUgd2lsbCBoYXZlIHRvIGNoZWNr
DQogIGFuZCBob25vciBzdWNoIHByZS1kcml2ZXItaW5pdGlhbGl6YXRpb24gd2hlbiBwcmVzZW50
IChhbmQgZ2l2ZSBpdCBoaWdoZXINCiAgcHJpb3JpdHkgb3ZlciB0aGUgbWF0Y2ggY2FsbGJhY2tz
KS4NCiAgSG93IHRvIGRvIHRoaXM/IEZvciBleGFtcGxlLCB3aGVuIHZmaW9fZ3JvdXBfYWRkX2Rl
diBpcyBpbnZva2VkLCBpdCBjaGVja3MNCiAgd2hldGhlciB0aGUgZGV2aWNlIGlzIGdldHRpbmcg
YWRkZWQgdG8gYW4gYWxyZWFkeSBleGlzdGVudCBncm91cCB3aGVyZQ0KICB0aGUgb3RoZXIgZGV2
aWNlcyAod2VsbCwgeW91IHdvdWxkIG5lZWQgdG8gY2hlY2sganVzdCBvbmUgb2YgdGhlIGRldmlj
ZXMgaW4NCiAgdGhlIGdyb3VwKSBhcmUgYWxyZWFkeSBhc3NpZ25lZCB0byB2ZmlvLXBjaSwgYW5k
IGluIHN1Y2ggYSBjYXNlIGl0DQogIHByZS1pbml0aWFsaXplIHRoZSBkcml2ZXIgdG8gdmZpby1w
Y2kuDQoNCk5PVEU6IEJ5ICJwcmVpbml0IiBJIG1lYW4gInNhdmUgaW50byB0aGUgZGV2aWNlIGEg
cmVmZXJlbmNlIHRvIGEgZHJpdmVyIGJlZm9yZQ0KICAgICAgdGhlICdtYXRjaCcgY2FsbGJhY2tz
Ii4NCg0KVGhpcyB3b3VsZCBiZSB0aGUgdGltZWxpbmU6DQoNCnwNCistPiBuZXcgZGV2aWNlIGdl
dHMgYWRkZWQgdG8gKFBDSSkgYnVzDQp8DQorLT4gUENJOiBzZW5kIEJVU19OT1RJRklFUl9BRERf
REVWSUNFIG5vdGlmaWNhdGlvbg0KfA0KKy0+IFZGSU86dmZpb19wY2lfZGV2aWNlX25vdGlmaWVy
DQp8ICAgICAgICB8DQp8ICAgICAgICArLT4gQlVTX05PVElGSUVSX0FERF9ERVZJQ0U6IHZmaW9f
Z3JvdXBfYWRkX2Rldg0KfCAgICAgICAgICAgIHwNCnwgICAgICAgICAgICArLT5pb21tdV9kZXZp
Y2VfZ3JvdXAoZGV2LCZncm91cGlkKQ0KfCAgICAgICAgICAgICstPmdyb3VwID0gPHNlYXJjaCBn
cm91cGlkIGluIHZmaW8uZ3JvdXBfbGlzdD4NCnwgICAgICAgICAgICArLT5pZiAoZ3JvdXAgJiYg
Z3JvdXBfaXNfdmZpbyhncm91cCkpDQp8ICAgICAgICAgICAgfCAgICAgICAgPHByZWluaXQgZGV2
aWNlIGRyaXZlciB0byB2ZmlvLXBjaT4NCnwgICAgICAgICAgICAuLi4NCnwNCistPiBQQ0k6IHh4
eA0KfCAgICAgICAgfA0KfCAgICAgICAgKy0+IGlmICghZGV2aWNlX2RyaXZlcl9pc19wcmVpbml0
KGRldikpDQp8ICAgICAgICB8ICAgICAgIHByb2JlPTxzZWFyY2ggZHJpdmVyJ3MgcHJvYmUgY2Fs
bGJhY2sgdXNpbmcgJ21hdGNoJz4NCnwgICAgICAgIHwgICBlbHNlIA0KfCAgICAgICAgfCAgICAg
ICBwcm9iZT08Z2V0IGl0IGZyb20gcHJlaW50IGRyaXZlciBjb25maWc+DQp8ICAgICAgICB8ICAg
ICAgICgrZmFsbGJhY2sgdG8gJ21hdGNoJyBpZiBwcmVpbml0IGRyaXZlciBkaXNhcHBlYXJlZD8p
DQp8ICAgICAgICB8ICAgDQp8ICAgICAgICArLT4gcmMgPSBwcm9iZSguLi4pDQp8ICAgICAgICB8
DQp8ICAgICAgICAuLi4NCnYNCi4uLg0KDQpPZiBjb3Vyc2UsIHdoYXQgaWYgbXVsdGlwbGUgZHJp
dmVycyBkZWNpZGUgdG8gcHJlaW5pdCB0aGUgZGV2aWNlID8NCg0KT25lIHdheSB0byBtYWtlIGl0
IGNsZWFuZXIgd291bGQgYmUgdG86DQotIGhhdmUgdGhlIFBDSSBsYXllciBleHBvcnQgYW4gQVBJ
IHRoYXQgYWxsb3dzIChmb3IgZXhhbXBsZSkgdGhlIGJ1cw0KICBub3RpZmljYXRpb24gY2FsbGJh
Y2tzIChsaWtlIHZmaW9fcGNpX2RldmljZV9ub3RpZmllcikgdG8gcHJlaW5pdCBhIGRyaXZlcg0K
LSBtYWtlIHN1Y2ggQVBJIHJlamVjdCBjYWxscyBvbiBkZXZpY2VzIHRoYXQgYWxyZWFkeSBoYXZl
IGEgcHJlaW5pdA0KICBkcml2ZXIuDQotIG1ha2UgVkZJTyBkZXRlY3QgdGhlIGNhc2Ugd2hlcmUg
dmZpb19wY2lfZGV2aWNlX25vdGlmaWVyIGNhbiBub3QNCiAgcHJlaW5pdCB0aGUgZHJpdmVyICh0
byB2ZmlvLXBjaSkgZm9yIHRoZSBuZXcgZGV2aWNlIChiZWNhdXNlIGFscmVhZHkNCiAgcHJlaW5p
dGVkKSBhbmQgcmFpc2UgYW4gZXJyb3Ivd2FybmluZy4NCg0KV291bGQgdGhpcyBsb29rIGEgYml0
IGNsZWFuZXI/DQoNCj4gPiA+ICsJCX0NCj4gPiA+ICsNCj4gPiA+ICsJCWRldmljZSA9IGt6YWxs
b2Moc2l6ZW9mKCpkZXZpY2UpLCBHRlBfS0VSTkVMKTsNCj4gPiA+ICsJCWlmICghZGV2aWNlKSB7
DQo+ID4gPiArCQkJLyogSWYgd2UganVzdCBjcmVhdGVkIHRoaXMgZ3JvdXAsIHRlYXIgaXQgZG93
biAqLw0KPiA+ID4gKwkJCWlmIChuZXdfZ3JvdXApIHsNCj4gPiA+ICsJCQkJbGlzdF9kZWwoJmdy
b3VwLT5ncm91cF9uZXh0KTsNCj4gPiA+ICsJCQkJZGV2aWNlX2Rlc3Ryb3kodmZpby5jbGFzcywg
Z3JvdXAtPmRldnQpOw0KPiA+ID4gKwkJCQlpZHJfcmVtb3ZlKCZ2ZmlvLmlkciwgTUlOT1IoZ3Jv
dXAtPmRldnQpKTsNCj4gPiA+ICsJCQkJa2ZyZWUoZ3JvdXApOw0KPiA+ID4gKwkJCX0NCj4gPiA+
ICsJCQlyZXQgPSAtRU5PTUVNOw0KPiA+ID4gKwkJCWdvdG8gb3V0Ow0KPiA+ID4gKwkJfQ0KPiA+
ID4gKw0KPiA+ID4gKwkJbGlzdF9hZGQoJmRldmljZS0+ZGV2aWNlX25leHQsICZncm91cC0+ZGV2
aWNlX2xpc3QpOw0KPiA+ID4gKwkJZGV2aWNlLT5kZXYgPSBkZXY7DQo+ID4gPiArCQlkZXZpY2Ut
Pm9wcyA9IG9wczsNCj4gPiA+ICsJCWRldmljZS0+aW9tbXUgPSBncm91cC0+aW9tbXU7IC8qIE5V
TEwgaWYgbmV3ICovDQo+ID4NCj4gPiBTaG91bGRuJ3QgeW91IGNoZWNrIHRoZSByZXR1cm4gY29k
ZSBvZiBfX3ZmaW9faW9tbXVfYXR0YWNoX2Rldj8NCj4gDQo+IFllcCwgbG9va3MgbGlrZSBJIGRp
ZCB0aGlzIGJlY2F1c2UgdGhlIGV4cGVjdGVkIHVzZSBjYXNlIGhhcyBhIE5VTEwNCj4gaW9tbXUg
aGVyZSwgc28gSSBuZWVkIHRvIGRpc3RpZ3Vpc2ggdGhhdCBlcnJvciBmcm9tIGFuIGFjdHVhbA0K
PiBpb21tdV9hdHRhY2hfZGV2aWNlKCkgZXJyb3IuDQo+IA0KPiA+ID4gKwkJX192ZmlvX2lvbW11
X2F0dGFjaF9kZXYoZ3JvdXAtPmlvbW11LCBkZXZpY2UpOw0KPiA+ID4gKwl9DQo+ID4gPiArb3V0
Og0KPiA+ID4gKwltdXRleF91bmxvY2soJnZmaW8ubG9jayk7DQo+ID4gPiArCXJldHVybiByZXQ7
DQo+ID4gPiArfQ0KPiA+ID4gK0VYUE9SVF9TWU1CT0xfR1BMKHZmaW9fZ3JvdXBfYWRkX2Rldik7
DQo+ID4gPiArDQo+ID4gPiArLyogUmVtb3ZlIGEgZGV2aWNlIGZyb20gdGhlIHZmaW8gZnJhbWV3
b3JrICovDQo+ID4NCj4gPiBUaGlzIGZuIGJlbG93IGRvZXMgbm90IHJldHVybiBhbnkgZXJyb3Ig
Y29kZS4gT2sgLi4uDQo+ID4gSG93ZXZlciwgdGhlcmUgYXJlIGEgbnVtYmVyIG9mIGVycm9ycyBj
YXNlIHRoYXQgeW91IHRlc3QsIGZvciBleGFtcGxlDQo+ID4gLSBkZXZpY2UgdGhhdCBkb2VzIG5v
dCBiZWxvbmcgdG8gYW55IGdyb3VwIChhY2NvcmRpbmcgdG8gaW9tbXUgQVBJKQ0KPiA+IC0gZGV2
aWNlIHRoYXQgYmVsb25ncyB0byBhIGdyb3VwIGJ1dCB0aGF0IGRvZXMgbm90IGFwcGVhciBpbiB0
aGUgbGlzdA0KPiA+ICAgb2YgZGV2aWNlcyBvZiB0aGUgdmZpb19ncm91cCBzdHJ1Y3R1cmUuDQo+
ID4gQXJlIHRoZSBhYm92ZSB0d28gZXJyb3JzIGNoZWNrcyBqdXN0IHBhcmFub2lhIG9yIGFyZSB0
aG9zZSBlcnJvcnMNCj4gYWN0dWFsbHkgcG9zc2libGU/DQo+ID4gSWYgdGhleSB3ZXJlIHBvc3Np
YmxlLCBzaG91bGRuJ3Qgd2UgZ2VuZXJhdGUgYSB3YXJuaW5nIChtb3N0IHByb2JhYmx5DQo+ID4g
aXQgd291bGQgYmUgYSBidWcgaW4gdGhlIGNvZGUpPw0KPiANCj4gVGhleSdyZSBhbGwgdmZpby1i
dXMgZHJpdmVyIGJ1Z3Mgb2Ygc29tZSBzb3J0LCBzbyBpdCdzIGp1c3QgYSBtYXR0ZXIgb2YNCj4g
aG93IG11Y2ggd2Ugd2FudCB0byBzY3JlYW0gYWJvdXQgdGhlbS4gIEknbGwgY29tbWVudHMgb24g
ZWFjaCBiZWxvdy4NCj4gDQo+ID4gPiArdm9pZCB2ZmlvX2dyb3VwX2RlbF9kZXYoc3RydWN0IGRl
dmljZSAqZGV2KQ0KPiA+ID4gK3sNCj4gPiA+ICsJc3RydWN0IGxpc3RfaGVhZCAqcG9zOw0KPiA+
ID4gKwlzdHJ1Y3QgdmZpb19ncm91cCAqZ3JvdXAgPSBOVUxMOw0KPiA+ID4gKwlzdHJ1Y3QgdmZp
b19kZXZpY2UgKmRldmljZSA9IE5VTEw7DQo+ID4gPiArCXVuc2lnbmVkIGludCBncm91cGlkOw0K
PiA+ID4gKw0KPiA+ID4gKwlpZiAoaW9tbXVfZGV2aWNlX2dyb3VwKGRldiwgJmdyb3VwaWQpKQ0K
PiA+ID4gKwkJcmV0dXJuOw0KPiANCj4gSGVyZSB0aGUgYnVzIGRyaXZlciBpcyBwcm9iYWJseSBq
dXN0IHNpdHRpbmcgb24gYSBub3RpZmllciBsaXN0IGZvcg0KPiB0aGVpciBidXNfdHlwZSBhbmQg
YSBkZXZpY2UgaXMgZ2V0dGluZyByZW1vdmVkLiAgVW5sZXNzIHdlIHdhbnQgdG8NCj4gcmVxdWly
ZSB0aGUgYnVzIGRyaXZlciB0byB0cmFjayBldmVyeXRoaW5nIGl0J3MgYXR0ZW1wdGVkIHRvIGFk
ZCBhbmQNCj4gd2hldGhlciBpdCB3b3JrZWQsIHdlIGNhbiBqdXN0IGlnbm9yZSB0aGlzLg0KDQpP
SywgSSBzZWUgd2hhdCB5b3UgbWVhbi4gSWYgdmZpb19ncm91cF9hZGRfZGV2IGZhaWxzIGZvciBz
b21lIHJlYXNvbnMgd2UNCmRvIG5vdCBrZWVwIHRyYWNrIG9mIGl0LiBSaWdodD8NCldvdWxkIGl0
IG1ha2Ugc2Vuc2UgdG8gYWRkIG9uZSBzcGVjaWFsIGdyb3VwIHRvIHZmaW8uZ3JvdXBfbGlzdCAo
b3IgYmV0dGVyDQpPbiBhIHNlcGFyYXRlIGZpZWxkIG9mIHRoZSB2ZmlvIHN0cnVjdHVyZSkgd2hv
c2UgZ29hbA0Kd291bGQgYmUganVzdCB0aGF0OiBrZWVwIHRyYWNrIG9mIHRob3NlIGRldmljZXMg
dGhhdCBmYWlsZWQgdG8gYmUgYWRkZWQNCnRvIHRoZSBWRklPIGZyYW1ld29yayAoY2FuIGl0IGhl
bHAgZm9yIGRlYnVnZ2luZyB0b28/KT8NCg0KPiA+ID4gKw0KPiA+ID4gKwltdXRleF9sb2NrKCZ2
ZmlvLmxvY2spOw0KPiA+ID4gKw0KPiA+ID4gKwlsaXN0X2Zvcl9lYWNoKHBvcywgJnZmaW8uZ3Jv
dXBfbGlzdCkgew0KPiA+ID4gKwkJZ3JvdXAgPSBsaXN0X2VudHJ5KHBvcywgc3RydWN0IHZmaW9f
Z3JvdXAsIGdyb3VwX25leHQpOw0KPiA+ID4gKwkJaWYgKGdyb3VwLT5ncm91cGlkID09IGdyb3Vw
aWQpDQo+ID4gPiArCQkJYnJlYWs7DQo+ID4gPiArCQlncm91cCA9IE5VTEw7DQo+ID4gPiArCX0N
Cj4gPiA+ICsNCj4gPiA+ICsJaWYgKCFncm91cCkNCj4gPiA+ICsJCWdvdG8gb3V0Ow0KPiANCj4g
V2UgZG9uJ3QgZXZlbiBoYXZlIGEgZ3JvdXAgZm9yIHRoZSBkZXZpY2UsIHdlIGNvdWxkIEJVR19P
TiBoZXJlLiAgVGhlDQo+IGJ1cyBkcml2ZXIgZmFpbGVkIHRvIHRlbGwgdXMgYWJvdXQgc29tZXRo
aW5nIHRoYXQgd2FzIHRoZW4gcmVtb3ZlZC4NCj4gDQo+ID4gPiArDQo+ID4gPiArCWxpc3RfZm9y
X2VhY2gocG9zLCAmZ3JvdXAtPmRldmljZV9saXN0KSB7DQo+ID4gPiArCQlkZXZpY2UgPSBsaXN0
X2VudHJ5KHBvcywgc3RydWN0IHZmaW9fZGV2aWNlLCBkZXZpY2VfbmV4dCk7DQo+ID4gPiArCQlp
ZiAoZGV2aWNlLT5kZXYgPT0gZGV2KQ0KPiA+ID4gKwkJCWJyZWFrOw0KPiA+ID4gKwkJZGV2aWNl
ID0gTlVMTDsNCj4gPiA+ICsJfQ0KPiA+ID4gKw0KPiA+ID4gKwlpZiAoIWRldmljZSkNCj4gPiA+
ICsJCWdvdG8gb3V0Ow0KPiANCj4gU2FtZSBoZXJlLg0KPiANCj4gPiA+ICsNCj4gPiA+ICsJQlVH
X09OKGRldmljZS0+cmVmY250KTsNCj4gPiA+ICsNCj4gPiA+ICsJaWYgKGRldmljZS0+YXR0YWNo
ZWQpDQo+ID4gPiArCQlfX3ZmaW9faW9tbXVfZGV0YWNoX2Rldihncm91cC0+aW9tbXUsIGRldmlj
ZSk7DQo+ID4gPiArDQo+ID4gPiArCWxpc3RfZGVsKCZkZXZpY2UtPmRldmljZV9uZXh0KTsNCj4g
PiA+ICsJa2ZyZWUoZGV2aWNlKTsNCj4gPiA+ICsNCj4gPiA+ICsJLyogSWYgdGhpcyB3YXMgdGhl
IG9ubHkgZGV2aWNlIGluIHRoZSBncm91cCwgcmVtb3ZlIHRoZSBncm91cC4NCj4gPiA+ICsJICog
Tm90ZSB0aGF0IHdlIGludGVudGlvbmFsbHkgdW5tZXJnZSBlbXB0eSBncm91cHMgaGVyZSBpZiB0
aGUNCj4gPiA+ICsJICogZ3JvdXAgZmQgaXNuJ3Qgb3BlbmVkLiAqLw0KPiA+ID4gKwlpZiAobGlz
dF9lbXB0eSgmZ3JvdXAtPmRldmljZV9saXN0KSAmJiBncm91cC0+cmVmY250ID09IDApIHsNCj4g
PiA+ICsJCXN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSA9IGdyb3VwLT5pb21tdTsNCj4gPiA+ICsN
Cj4gPiA+ICsJCWlmIChpb21tdSkgew0KPiA+ID4gKwkJCV9fdmZpb19ncm91cF9zZXRfaW9tbXUo
Z3JvdXAsIE5VTEwpOw0KPiA+ID4gKwkJCV9fdmZpb190cnlfZGlzc29sdmVfaW9tbXUoaW9tbXUp
Ow0KPiA+ID4gKwkJfQ0KPiA+ID4gKw0KPiA+ID4gKwkJZGV2aWNlX2Rlc3Ryb3kodmZpby5jbGFz
cywgZ3JvdXAtPmRldnQpOw0KPiA+ID4gKwkJaWRyX3JlbW92ZSgmdmZpby5pZHIsIE1JTk9SKGdy
b3VwLT5kZXZ0KSk7DQo+ID4gPiArCQlsaXN0X2RlbCgmZ3JvdXAtPmdyb3VwX25leHQpOw0KPiA+
ID4gKwkJa2ZyZWUoZ3JvdXApOw0KPiA+ID4gKwl9DQo+ID4gPiArb3V0Og0KPiA+ID4gKwltdXRl
eF91bmxvY2soJnZmaW8ubG9jayk7DQo+ID4gPiArfQ0KPiA+ID4gK0VYUE9SVF9TWU1CT0xfR1BM
KHZmaW9fZ3JvdXBfZGVsX2Rldik7DQo+ID4gPiArDQo+ID4gPiArLyogV2hlbiBhIGRldmljZSBp
cyBib3VuZCB0byBhIHZmaW8gZGV2aWNlIGRyaXZlciAoZXguIHZmaW8tcGNpKSwNCj4gdGhpcw0K
PiA+ID4gKyAqIGVudHJ5IHBvaW50IGlzIHVzZWQgdG8gbWFyayB0aGUgZGV2aWNlIHVzYWJsZSAo
dmlhYmxlKS4gIFRoZQ0KPiB2ZmlvDQo+ID4gPiArICogZGV2aWNlIGRyaXZlciBhc3NvY2lhdGVz
IGEgcHJpdmF0ZSBkZXZpY2VfZGF0YSBzdHJ1Y3Qgd2l0aCB0aGUNCj4gPiA+IGRldmljZQ0KPiA+
ID4gKyAqIGhlcmUsIHdoaWNoIHdpbGwgbGF0ZXIgYmUgcmV0dXJuIGZvciB2ZmlvX2RldmljZV9m
b3BzDQo+IGNhbGxiYWNrcy4gKi8NCj4gPiA+ICtpbnQgdmZpb19iaW5kX2RldihzdHJ1Y3QgZGV2
aWNlICpkZXYsIHZvaWQgKmRldmljZV9kYXRhKQ0KPiA+ID4gK3sNCj4gPiA+ICsJc3RydWN0IHZm
aW9fZGV2aWNlICpkZXZpY2U7DQo+ID4gPiArCWludCByZXQgPSAtRUlOVkFMOw0KPiA+ID4gKw0K
PiA+ID4gKwlCVUdfT04oIWRldmljZV9kYXRhKTsNCj4gPiA+ICsNCj4gPiA+ICsJbXV0ZXhfbG9j
aygmdmZpby5sb2NrKTsNCj4gPiA+ICsNCj4gPiA+ICsJZGV2aWNlID0gX192ZmlvX2xvb2t1cF9k
ZXYoZGV2KTsNCj4gPiA+ICsNCj4gPiA+ICsJQlVHX09OKCFkZXZpY2UpOw0KPiA+ID4gKw0KPiA+
ID4gKwlyZXQgPSBkZXZfc2V0X2RydmRhdGEoZGV2LCBkZXZpY2UpOw0KPiA+ID4gKwlpZiAoIXJl
dCkNCj4gPiA+ICsJCWRldmljZS0+ZGV2aWNlX2RhdGEgPSBkZXZpY2VfZGF0YTsNCj4gPiA+ICsN
Cj4gPiA+ICsJbXV0ZXhfdW5sb2NrKCZ2ZmlvLmxvY2spOw0KPiA+ID4gKwlyZXR1cm4gcmV0Ow0K
PiA+ID4gK30NCj4gPiA+ICtFWFBPUlRfU1lNQk9MX0dQTCh2ZmlvX2JpbmRfZGV2KTsNCj4gPiA+
ICsNCj4gPiA+ICsvKiBBIGRldmljZSBpcyBvbmx5IHJlbW92ZWFibGUgaWYgdGhlIGlvbW11IGZv
ciB0aGUgZ3JvdXAgaXMgbm90DQo+IGluDQo+ID4gPiB1c2UuICovDQo+ID4gPiArc3RhdGljIGJv
b2wgdmZpb19kZXZpY2VfcmVtb3ZlYWJsZShzdHJ1Y3QgdmZpb19kZXZpY2UgKmRldmljZSkNCj4g
PiA+ICt7DQo+ID4gPiArCWJvb2wgcmV0ID0gdHJ1ZTsNCj4gPiA+ICsNCj4gPiA+ICsJbXV0ZXhf
bG9jaygmdmZpby5sb2NrKTsNCj4gPiA+ICsNCj4gPiA+ICsJaWYgKGRldmljZS0+aW9tbXUgJiYg
X192ZmlvX2lvbW11X2ludXNlKGRldmljZS0+aW9tbXUpKQ0KPiA+ID4gKwkJcmV0ID0gZmFsc2U7
DQo+ID4gPiArDQo+ID4gPiArCW11dGV4X3VubG9jaygmdmZpby5sb2NrKTsNCj4gPiA+ICsJcmV0
dXJuIHJldDsNCj4gPiA+ICt9DQo+ID4gPiArDQo+ID4gPiArLyogTm90aWZ5IHZmaW8gdGhhdCBh
IGRldmljZSBpcyBiZWluZyB1bmJvdW5kIGZyb20gdGhlIHZmaW8gZGV2aWNlDQo+ID4gPiBkcml2
ZXINCj4gPiA+ICsgKiBhbmQgcmV0dXJuIHRoZSBkZXZpY2UgcHJpdmF0ZSBkZXZpY2VfZGF0YSBw
b2ludGVyLiAgSWYgdGhlDQo+IGdyb3VwIGlzDQo+ID4gPiArICogaW4gdXNlLCB3ZSBuZWVkIHRv
IGJsb2NrIG9yIHRha2Ugb3RoZXIgbWVhc3VyZXMgdG8gbWFrZSBpdCBzYWZlDQo+IGZvcg0KPiA+
ID4gKyAqIHRoZSBkZXZpY2UgdG8gYmUgcmVtb3ZlZCBmcm9tIHRoZSBpb21tdS4gKi8NCj4gPiA+
ICt2b2lkICp2ZmlvX3VuYmluZF9kZXYoc3RydWN0IGRldmljZSAqZGV2KQ0KPiA+ID4gK3sNCj4g
PiA+ICsJc3RydWN0IHZmaW9fZGV2aWNlICpkZXZpY2UgPSBkZXZfZ2V0X2RydmRhdGEoZGV2KTsN
Cj4gPiA+ICsJdm9pZCAqZGV2aWNlX2RhdGE7DQo+ID4gPiArDQo+ID4gPiArCUJVR19PTighZGV2
aWNlKTsNCj4gPiA+ICsNCj4gPiA+ICthZ2FpbjoNCj4gPiA+ICsJaWYgKCF2ZmlvX2RldmljZV9y
ZW1vdmVhYmxlKGRldmljZSkpIHsNCj4gPiA+ICsJCS8qIFhYWCBzaWduYWwgZm9yIGFsbCBkZXZp
Y2VzIGluIGdyb3VwIHRvIGJlIHJlbW92ZWQgb3INCj4gPiA+ICsJCSAqIHJlc29ydCB0byBraWxs
aW5nIHRoZSBwcm9jZXNzIGhvbGRpbmcgdGhlIGRldmljZSBmZHMuDQo+ID4gPiArCQkgKiBGb3Ig
bm93IGp1c3QgYmxvY2sgd2FpdGluZyBmb3IgcmVsZWFzZXMgdG8gd2FrZSB1cy4gKi8NCj4gPiA+
ICsJCXdhaXRfZXZlbnQodmZpby5yZWxlYXNlX3EsIHZmaW9fZGV2aWNlX3JlbW92ZWFibGUoZGV2
aWNlKSk7DQo+ID4NCj4gPiBBbnkgbmV3IGlkZWEvcHJvcG9zYWwgb24gaG93IHRvIGhhbmRsZSB0
aGlzIHNpdHVhdGlvbj8NCj4gPiBUaGUgbGFzdCBvbmUgSSByZW1lbWJlciB3YXMgdG8gbGVhdmUg
dGhlIHNvZnQvaGFyZC9ldGMgdGltZW91dA0KPiBoYW5kbGluZyBpbg0KPiA+IHVzZXJzcGFjZSBh
bmQgaW1wbGVtZW50IGl0IGFzIGEgc29ydCBvZiBwb2xpY3kuIElzIHRoYXQgb25lIHN0aWxsIHRo
ZQ0KPiBtb3N0DQo+ID4gbGlrZWx5IGNhbmRpZGF0ZSBzb2x1dGlvbiB0byBoYW5kbGUgdGhpcyBz
aXR1YXRpb24/DQo+IA0KPiBJIGhhdmVuJ3QgaGVhcmQgYW55IG5ldyBwcm9wb3NhbHMuICBJIHRo
aW5rIHdlIG5lZWQgdGhlIGhhcmQgdGltZW91dA0KPiBoYW5kbGluZyBpbiB0aGUga2VybmVsLiAg
V2UgY2FuJ3QgbGVhdmUgaXQgdG8gdXNlcnNwYWNlIHRvIGRlY2lkZSB0aGV5DQo+IGdldCB0byBr
ZWVwIHRoZSBkZXZpY2UuICBXZSBjb3VsZCBoYXZlIHRoaXMgdHVuYWJsZSB2aWEgYW4gaW9jdGws
IGJ1dCBJDQo+IGRvbid0IHNlZSBob3cgd2Ugd291bGRuJ3QgcmVxdWlyZSBDQVBfU1lTX0FETUlO
IChvciBzaW1pbGFyKSB0byB0d2Vhaw0KPiBpdC4gIEkgd2FzIGludGVuZGluZyB0byByZS1pbXBs
ZW1lbnQgdGhlIG5ldGxpbmsgaW50ZXJmYWNlIHRvIHNpZ25hbA0KPiB0aGUNCj4gcmVtb3ZhbCwg
YnV0IGV4cGVjdCB0byBnZXQgYWxsZXJnaWMgcmVhY3Rpb25zIHRvIHRoYXQuDQoNCihJIHBlcnNv
bmFsbHkgbGlrZSB0aGUgYXN5bmMgbmV0bGluayBzaWduYWxpbmcsIGJ1dCBJIGFtIE9LIHdpdGgg
YW4gaW9jdGwgYmFzZWQNCm1lY2hhbmlzbSBpZiBpdCBwcm92aWRlcyB0aGUgc2FtZSBmbGV4aWJp
bGl0eSkNCg0KV2hhdCB3b3VsZCBiZSBhIHJlYXNvbmFibGUgaGFyZCB0aW1lb3V0Pw0KDQovQ2hy
aXMNCg0KDQo=

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-09 21:08     ` Christian Benvenuti (benve)
@ 2011-11-09 23:40       ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2011-11-09 23:40 UTC (permalink / raw)
  To: Christian Benvenuti (benve)
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf,
	Aaron Fabbri (aafabbri), B08248, B07421, avi, konrad.wilk, kvm,
	qemu-devel, iommu, linux-pci

On Wed, 2011-11-09 at 15:08 -0600, Christian Benvenuti (benve) wrote:
<snip>
> > > > +
> > > > +struct vfio_group {
> > > > +	dev_t			devt;
> > > > +	unsigned int		groupid;
> > >
> > > This groupid is returned by the device_group callback you recently
> > added
> > > with a separate (not yet in tree) IOMMU patch.
> > > Is it correct to say that the scope of this ID is the bus the iommu
> > > belongs too (but you use it as if it was global)?
> > > I believe there is nothing right now to ensure the uniqueness of such
> > > ID across bus types (assuming there will be other bus drivers in the
> > > future besides vfio-pci).
> > > If that's the case, the vfio.group_list global list and the
> > __vfio_lookup_dev
> > > routine should be changed to account for the bus too?
> > > Ops, I just saw the error msg in vfio_group_add_dev about the group
> > id conflict.
> > > Is that warning related to what I mentioned above?
> > 
> > Yeah, this is a concern, but I can't think of a system where we would
> > manifest a collision.  The IOMMU driver is expected to provide unique
> > groupids for all devices below them, but we could imagine a system that
> > implements two different bus_types, each with a different IOMMU driver
> > and we have no coordination between them.  Perhaps since we have
> > iommu_ops per bus, we should also expose the bus in the vfio group
> > path,
> > ie. /dev/vfio/%s/%u, dev->bus->name, iommu_device_group(dev,..).  This
> > means userspace would need to do a readlink of the subsystem entry
> > where
> > it finds the iommu_group to find the vfio group.  Reasonable?
> 
> Most probably we won't see use cases with multiple buses anytime soon, but
> this scheme you proposed (with the per-bus subdir) looks good to me. 

Ok, I think that's easier than any scheme of trying to organize globally
unique groupids instead of just bus_type unique.  That makes group
objects internally matched by the {groupid, bus} pair.

<snip>
> > >
> > > I looked at how you take care of ref counts ...
> > >
> > > This is how the tree of vfio_iommu/vfio_group/vfio_device data
> > > Structures is organized (I'll use just iommu/group/dev to make
> > > the graph smaller):
> > >
> > >             iommu
> > >            /     \
> > >           /       \
> > >     group   ...     group
> > >     /  \           /  \
> > >    /    \         /    \
> > > dev  ..  dev   dev  ..  dev
> > >
> > > This is how you get a file descriptor for the three kind of objects:
> > >
> > > - group : open /dev/vfio/xxx for group xxx
> > > - iommu : group ioctl VFIO_GROUP_GET_IOMMU_FD
> > > - device: group ioctl VFIO_GROUP_GET_DEVICE_FD
> > >
> > > Given the above topology, I would assume that:
> > >
> > > (1) an iommu is 'inuse' if : a) iommu refcnt > 0, or
> > >                              b) any of its groups is 'inuse'
> > >
> > > (2) a  group is 'inuse' if : a) group refcnt > 0, or
> > >                              b) any of its devices is 'inuse'
> > >
> > > (3) a device is 'inuse' if : a) device refcnt > 0
> > 
> > (2) is a bit debatable.  I've wrestled with this one for a while.  The
> > vfio_iommu serves two purposes.  First, it is the object we use for
> > managing iommu domains, which includes allocating domains and attaching
> > devices to domains.  Groups objects aren't involved here, they just
> > manage the set of devices.  The second role is to manage merged groups,
> > because whether or not groups can be merged is a function of iommu
> > domain compatibility.
> > 
> > So if we look at "is the iommu in use?" ie. can I destroy the mapping
> > context, detach devices and free the domain, the reference count on the
> > group is irrelevant.  The user has to have a device or iommu file
> > descriptor opened somewhere, across the group or merged group, for that
> > context to be maintained.  A reasonable requirement, I think.
> 
> OK, then if you close all devices and the iommu, keeping the group open
> Would not protect the iommu domain mapping. This means that if you (or
> A management application) need to close all devices+iommu and reopen
> right away again the same devices+iommu you may get a failure on the
> iommu domain creation (supposing the system goes out of resources).
> Is this just a very unlikely scenario? 

Can you think of a use case that would require such?  I can't.

> I guess in this case you would simply have to avoid releasing the iommu
> fd, right?

Right.  We could also debate whether we should drop all iommu mappings
when the iommu refcnt goes to zero.  We don't currently do that, but it
might make sense.

> 
> > However, if we ask "is the group in use?" ie. can I not only destroy
> > the
> > mappings above, but also automatically tear apart merged groups, then I
> > think we need to look at the group refcnt.
> 
> Correct.
> 
> > There's also a symmetry factor, the group is a benign entry point to
> > device access.  It's only when device or iommu access is granted that
> > the group gains any real power.  Therefore, shouldn't that power also
> > be
> > removed when those access points are closed?
> > 
> > > You have coded the 'inuse' logic with these three routines:
> > >
> > >     __vfio_iommu_inuse, which implements (1) above
> > >
> > > and
> > >     __vfio_iommu_groups_inuse
> > 
> > Implements (2.a)
> 
> Yes, but for al groups at once.

Right

> > >     __vfio_group_devs_inuse
> > 
> > Implements (2.b)
> 
> Yes
> 
> > > which are used by __vfio_iommu_inuse.
> > > Why don't you check the group refcnt in __vfio_iommu_groups_inuse?
> > 
> > Hopefully explained above, but open for discussion.
> > 
> > > Would it make sense (and the code more readable) to structure the
> > > nested refcnt/inuse check like this?
> > > (The numbers (1)(2)(3) refer to the three 'inuse' conditions above)
> > >
> > >    (1)__vfio_iommu_inuse
> > >    |
> > >    +-> check iommu refcnt
> > >    +-> __vfio_iommu_groups_inuse
> > >        |
> > >        +->LOOP: (2)__vfio_iommu_group_inuse<--MISSING
> > >                 |
> > >                 +-> check group refcnt<--MISSING
> > >                 +-> __vfio_group_devs_inuse()
> > >                     |
> > >                     +-> LOOP: (3)__vfio_group_dev_inuse<--MISSING
> > >                               |
> > >                               +-> check device refcnt
> > 
> > We currently do:
> > 
> >    (1)__vfio_iommu_inuse
> >     |
> >     +-> check iommu refcnt
> >     +-> __vfio_group_devs_inuse
> >         |
> >         +->LOOP: (2.b)__vfio_group_devs_inuse
> >                   |
> >                   +-> LOOP: (3) check device refcnt
> > 
> > If that passes, the iommu context can be dissolved and we follow up
> > with:
> > 
> >     __vfio_iommu_groups_inuse
> >     |
> >     +-> LOOP: (2.a)__vfio_iommu_groups_inuse
> >                |
> >                +-> check group refcnt
> > 
> > If that passes, groups can also be umerged.
> > 
> > Is this right?
> 
> Yes, assuming we stick to the "benign" role of groups you
> described above.

Ok, no change then.  Thanks for looking at that so closely.

<snip>
> > > > +static int vfio_group_merge(struct vfio_group *group, int fd)
> > >
> > > The documentation in vfio.txt explains clearly the logic implemented
> > by
> > > the merge/unmerge group ioctls.
> > > However, what you are doing is not merging groups, but rather
> > adding/removing
> > > groups to/from iommus (and creating flat lists of groups).
> > > For example, when you do
> > >
> > >   merge(A,B)
> > >
> > > you actually mean to say "merge B to the list of groups assigned to
> > the
> > > same iommu as group A".
> > 
> > It's actually a little more than that.  After you've merged B into A,
> > you can close the file descriptor for B and access all of the devices
> > for the merged group from A.
> 
> It is actually more...
> 
> Scenario 1:
> 
>   create_grp(A)
>   create_grp(B)
>   ...
>   merge_grp(A,B)
>   create_grp(C)
>   merge_grp(C,B) ... this works, right?

No, but merge_grp(B,C) does.  I currently require that the incoming
group has no open device or iommu file descriptors and is a singular
group.  The device/iommu is a hard requirement since we'll be changing
the iommu context and can't leave an attack window.  The singular group
is an implementation detail.  Given the iommu/device requirement, it's
just as easy for userspace to tear apart the group and pass each
individually.

> Scenario 2:
> 
>   create_grp(A)
>   create_grp(B)
>   fd_x = get_dev_fd(B,x)
>   ...
>   merge_grp(A,B)

NAK, fails no open device test.  Again, merge_grp(B,A) is supported.

>   create_grp(C)
>   merge_grp(A,C)

Yep, this works.

>   fd_x = get_dev_fd(C,x) 

Yep, and if x is they same in both cases, you'll get 2 different file
descriptors backed by the same device.

> Those two examples seems to suggest me more of a list-abstraction than a merge abstraction.
> However, if it fits into the agreed syntax/logic it is ok, as long as we document it
> properly.

Can you suggest documentation changes that would make this more clear?

> > > For the same reason, you do not really need to provide the group you
> > want
> > > to unmerge from, which means that instead of
> > >
> > >   unmerge(A,B)
> > >
> > > you would just need
> > >
> > >   unmerge(B)
> > 
> > Good point, we can avoid the awkward reference via file descriptor for
> > the unmerge.
> > 
> > > I understand the reason why it is not a real merge/unmerge (ie, to
> > keep the
> > > original groups so that you can unmerge later)
> > 
> > Right, we still need to have visibility of the groups comprising the
> > merged group, but the abstraction provided to the user seems to be
> > deeper than you're thinking.
> > 
> > >  ... however I just wonder if
> > > it wouldn't be more natural to implement the
> > VFIO_IOMMU_ADD_GROUP/DEL_GROUP
> > > iommu ioctls instead? (the relationships between the data structure
> > would
> > > remain the same)
> > > I guess you already discarded this option for some reasons, right?
> > What was
> > > the reason?
> > 
> > It's a possibility, I'm not sure it was discussed or really what
> > advantage it provides.  It seems like we'd logically lose the ability
> > to
> > access devices from other groups,
> 
> What is the real (immediate) benefit of this capability?

Mostly convenience, but also promotes the peer idea where merged groups
simply create a "super" group that can access the iommu and all the
devices of the member groups.  On x86 we expect that merging groups will
always succeed and groups will typically have a single device, so a
driver could merge them all together, throw away all the extra group
file descriptors and manage the whole super group via a single group fd.

> > whether that's good or bad, I don't know.  I think the notion of "merge"
> > promotes the idea that the groups
> > are peers and an iommu_add/del feels a bit more hierarchical.
> 
> I agree. 
<snip>
> > > > +	if (!device) {
> > > > +		if (__vfio_group_devs_inuse(group) ||
> > > > +		    (group->iommu && group->iommu->refcnt)) {
> > > > +			printk(KERN_WARNING
> > > > +			       "Adding device %s to group %u while group is
> > > > already in use!!\n",
> > > > +			       dev_name(dev), group->groupid);
> > > > +			/* XXX How to prevent other drivers from claiming? */
> > >
> > > Here we are adding a device (not yet assigned to a vfio bus) to a
> > group
> > > that is already in use.
> > > Given that it would not be acceptable for this device to get assigned
> > > to a non vfio driver, why not forcing such assignment here then?
> > 
> > Exactly, I just don't know the mechanics of how to make that happen and
> > was hoping for suggestions...
> > 
> > > I am not sure though what the best way to do it would be.
> > > What about something like this:
> > >
> > > - when the bus vfio-pci processes the BUS_NOTIFY_ADD_DEVICE
> > >   notification it assigns to the device a PCI ID that will make sure
> > >   the vfio-pci's probe routine will be invoked (and no other driver
> > can
> > >   therefore claim the device). That PCI ID would have to be added
> > >   to the vfio_pci_driver's id_table (it would be the exception to the
> > >   "only dynamic IDs" rule). Too hackish?
> > 
> > Presumably some other driver also has the ID in it's id_table, how do
> > we make sure we win?
> 
> By mangling such ID (when processing the BUS_NOTIFY_ADD_DEVICE notification) to
> match against a 'fake' ID registered in the vfio-pci table (it would be like a
> sort of driver redirect/divert). The vfio-pci's probe routine would restore
> the original ID (we do not want to confuse userspace). This is hackish, I agree.
> 
> What about this:
> - When vfio-pci processes the BUS_NOTIFY_ADD_DEVICE notification it can
>   pre-initialize the driver pointer (via an API). We would then need to change
>   the match/probe PCI mechanism too: for example, the PCI core will have to check
>   and honor such pre-driver-initialization when present (and give it higher
>   priority over the match callbacks).
>   How to do this? For example, when vfio_group_add_dev is invoked, it checks
>   whether the device is getting added to an already existent group where
>   the other devices (well, you would need to check just one of the devices in
>   the group) are already assigned to vfio-pci, and in such a case it
>   pre-initialize the driver to vfio-pci.

It's ok to make a group "non-viable", we only want to intervene if the
iommu is inuse (iommu or device refcnt > 0).

> 
> NOTE: By "preinit" I mean "save into the device a reference to a driver before
>       the 'match' callbacks".
> 
> This would be the timeline:
> 
> |
> +-> new device gets added to (PCI) bus
> |
> +-> PCI: send BUS_NOTIFIER_ADD_DEVICE notification
> |
> +-> VFIO:vfio_pci_device_notifier
> |        |
> |        +-> BUS_NOTIFIER_ADD_DEVICE: vfio_group_add_dev
> |            |
> |            +->iommu_device_group(dev,&groupid)
> |            +->group = <search groupid in vfio.group_list>
> |            +->if (group && group_is_vfio(group))
> |            |        <preinit device driver to vfio-pci>
> |            ...
> |
> +-> PCI: xxx
> |        |
> |        +-> if (!device_driver_is_preinit(dev))
> |        |       probe=<search driver's probe callback using 'match'>
> |        |   else 
> |        |       probe=<get it from preint driver config>
> |        |       (+fallback to 'match' if preinit driver disappeared?)
> |        |   
> |        +-> rc = probe(...)
> |        |
> |        ...
> v
> ...
> 
> Of course, what if multiple drivers decide to preinit the device ?

Yep, we'd have to have a policy to BUG_ON if the preinit driver is
already set.

> One way to make it cleaner would be to:
> - have the PCI layer export an API that allows (for example) the bus
>   notification callbacks (like vfio_pci_device_notifier) to preinit a driver
> - make such API reject calls on devices that already have a preinit
>   driver.
> - make VFIO detect the case where vfio_pci_device_notifier can not
>   preinit the driver (to vfio-pci) for the new device (because already
>   preinited) and raise an error/warning.
> 
> Would this look a bit cleaner?

It looks like there might already be infrastructure that we can set
dev->driver and call the driver probe() function, so maybe we're only in
trouble if dev->driver is already set when we get the bus add
notification.  I just wasn't sure if that was entirely kosher.  I'll
have to try that and figure out how to test it; fake hotplug maybe.

<snip>
> > > This fn below does not return any error code. Ok ...
> > > However, there are a number of errors case that you test, for example
> > > - device that does not belong to any group (according to iommu API)
> > > - device that belongs to a group but that does not appear in the list
> > >   of devices of the vfio_group structure.
> > > Are the above two errors checks just paranoia or are those errors
> > actually possible?
> > > If they were possible, shouldn't we generate a warning (most probably
> > > it would be a bug in the code)?
> > 
> > They're all vfio-bus driver bugs of some sort, so it's just a matter of
> > how much we want to scream about them.  I'll comments on each below.
> > 
> > > > +void vfio_group_del_dev(struct device *dev)
> > > > +{
> > > > +	struct list_head *pos;
> > > > +	struct vfio_group *group = NULL;
> > > > +	struct vfio_device *device = NULL;
> > > > +	unsigned int groupid;
> > > > +
> > > > +	if (iommu_device_group(dev, &groupid))
> > > > +		return;
> > 
> > Here the bus driver is probably just sitting on a notifier list for
> > their bus_type and a device is getting removed.  Unless we want to
> > require the bus driver to track everything it's attempted to add and
> > whether it worked, we can just ignore this.
> 
> OK, I see what you mean. If vfio_group_add_dev fails for some reasons we
> do not keep track of it. Right?

The primary thing I'm thinking of here is not vfio_group_add_dev()
failing for "some reason", but specifically failing because the device
doesn't have a groupid, ie. it's not behind an iommu.  In that case it's
just a random device that can't be used by vfio.

> Would it make sense to add one special group to vfio.group_list (or better
> On a separate field of the vfio structure) whose goal
> would be just that: keep track of those devices that failed to be added
> to the VFIO framework (can it help for debugging too?)?

For the above case, no, we shouldn't need to track those.  But it does
seem like there's a gap for devices that fail vfio_group_add_dev() for
other reasons.  I don't think we want a special group for them, because
that isolates them from other devices that are potentially in the same
group.  I think instead what we want to do is set a taint flag on the
group.  We can do a BUG_ON not being able to allocate a group, then a
WARN_ON if we fail elsewhere and mark the group tainted so it's
effectively never viable.

<snip>
> > > > +	if (!vfio_device_removeable(device)) {
> > > > +		/* XXX signal for all devices in group to be removed or
> > > > +		 * resort to killing the process holding the device fds.
> > > > +		 * For now just block waiting for releases to wake us. */
> > > > +		wait_event(vfio.release_q, vfio_device_removeable(device));
> > >
> > > Any new idea/proposal on how to handle this situation?
> > > The last one I remember was to leave the soft/hard/etc timeout
> > handling in
> > > userspace and implement it as a sort of policy. Is that one still the
> > most
> > > likely candidate solution to handle this situation?
> > 
> > I haven't heard any new proposals.  I think we need the hard timeout
> > handling in the kernel.  We can't leave it to userspace to decide they
> > get to keep the device.  We could have this tunable via an ioctl, but I
> > don't see how we wouldn't require CAP_SYS_ADMIN (or similar) to tweak
> > it.  I was intending to re-implement the netlink interface to signal
> > the
> > removal, but expect to get allergic reactions to that.
> 
> (I personally like the async netlink signaling, but I am OK with an ioctl based
> mechanism if it provides the same flexibility)
> 
> What would be a reasonable hard timeout?

I think we were looking at 10s of seconds in the old vfio code.  Tough
call though.  Could potentially provide a module_param override so an
admin that trusts their users could set long/infinite timeout.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC PATCH] vfio: VFIO Driver core framework
       [not found] <20111103195452.21259.93021.stgit@bling.home>
  2011-11-09  4:17 ` [RFC PATCH] vfio: VFIO Driver core framework Aaron Fabbri
  2011-11-09  8:11 ` Christian Benvenuti (benve)
@ 2011-11-10  0:57 ` Christian Benvenuti (benve)
  2011-11-11 18:04   ` Alex Williamson
  2011-11-11 17:51 ` Konrad Rzeszutek Wilk
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 62+ messages in thread
From: Christian Benvenuti (benve) @ 2011-11-10  0:57 UTC (permalink / raw)
  To: Alex Williamson, chrisw, aik, pmac, dwg, joerg.roedel, agraf,
	Aaron Fabbri (aafabbri), B08248, B07421, avi, konrad.wilk, kvm,
	qemu-devel, iommu, linux-pci

SGVyZSBhcmUgZmV3IG1pbm9yIGNvbW1lbnRzIG9uIHZmaW9faW9tbXUuYyAuLi4NCg0KPiBkaWZm
IC0tZ2l0IGEvZHJpdmVycy92ZmlvL3ZmaW9faW9tbXUuYyBiL2RyaXZlcnMvdmZpby92ZmlvX2lv
bW11LmMNCj4gbmV3IGZpbGUgbW9kZSAxMDA2NDQNCj4gaW5kZXggMDAwMDAwMC4uMDI5ZGFlMw0K
PiAtLS0gL2Rldi9udWxsDQo+ICsrKyBiL2RyaXZlcnMvdmZpby92ZmlvX2lvbW11LmMNCj4gQEAg
LTAsMCArMSw1MzAgQEANCj4gKy8qDQo+ICsgKiBWRklPOiBJT01NVSBETUEgbWFwcGluZyBzdXBw
b3J0DQo+ICsgKg0KPiArICogQ29weXJpZ2h0IChDKSAyMDExIFJlZCBIYXQsIEluYy4gIEFsbCBy
aWdodHMgcmVzZXJ2ZWQuDQo+ICsgKiAgICAgQXV0aG9yOiBBbGV4IFdpbGxpYW1zb24gPGFsZXgu
d2lsbGlhbXNvbkByZWRoYXQuY29tPg0KPiArICoNCj4gKyAqIFRoaXMgcHJvZ3JhbSBpcyBmcmVl
IHNvZnR3YXJlOyB5b3UgY2FuIHJlZGlzdHJpYnV0ZSBpdCBhbmQvb3INCj4gbW9kaWZ5DQo+ICsg
KiBpdCB1bmRlciB0aGUgdGVybXMgb2YgdGhlIEdOVSBHZW5lcmFsIFB1YmxpYyBMaWNlbnNlIHZl
cnNpb24gMiBhcw0KPiArICogcHVibGlzaGVkIGJ5IHRoZSBGcmVlIFNvZnR3YXJlIEZvdW5kYXRp
b24uDQo+ICsgKg0KPiArICogRGVyaXZlZCBmcm9tIG9yaWdpbmFsIHZmaW86DQo+ICsgKiBDb3B5
cmlnaHQgMjAxMCBDaXNjbyBTeXN0ZW1zLCBJbmMuICBBbGwgcmlnaHRzIHJlc2VydmVkLg0KPiAr
ICogQXV0aG9yOiBUb20gTHlvbiwgcHVnc0BjaXNjby5jb20NCj4gKyAqLw0KPiArDQo+ICsjaW5j
bHVkZSA8bGludXgvY29tcGF0Lmg+DQo+ICsjaW5jbHVkZSA8bGludXgvZGV2aWNlLmg+DQo+ICsj
aW5jbHVkZSA8bGludXgvZnMuaD4NCj4gKyNpbmNsdWRlIDxsaW51eC9pb21tdS5oPg0KPiArI2lu
Y2x1ZGUgPGxpbnV4L21vZHVsZS5oPg0KPiArI2luY2x1ZGUgPGxpbnV4L21tLmg+DQo+ICsjaW5j
bHVkZSA8bGludXgvc2NoZWQuaD4NCj4gKyNpbmNsdWRlIDxsaW51eC9zbGFiLmg+DQo+ICsjaW5j
bHVkZSA8bGludXgvdWFjY2Vzcy5oPg0KPiArI2luY2x1ZGUgPGxpbnV4L3ZmaW8uaD4NCj4gKyNp
bmNsdWRlIDxsaW51eC93b3JrcXVldWUuaD4NCj4gKw0KPiArI2luY2x1ZGUgInZmaW9fcHJpdmF0
ZS5oIg0KDQpEb2Vzbid0IHRoZSAnZG1hXycgIHByZWZpeCBiZWxvbmcgdG8gdGhlIGdlbmVyaWMg
RE1BIGNvZGU/DQoNCj4gK3N0cnVjdCBkbWFfbWFwX3BhZ2Ugew0KPiArCXN0cnVjdCBsaXN0X2hl
YWQJbGlzdDsNCj4gKwlkbWFfYWRkcl90CQlkYWRkcjsNCj4gKwl1bnNpZ25lZCBsb25nCQl2YWRk
cjsNCj4gKwlpbnQJCQlucGFnZTsNCj4gKwlpbnQJCQlyZHdyOw0KPiArfTsNCj4gKw0KPiArLyoN
Cj4gKyAqIFRoaXMgY29kZSBoYW5kbGVzIG1hcHBpbmcgYW5kIHVubWFwcGluZyBvZiB1c2VyIGRh
dGEgYnVmZmVycw0KPiArICogaW50byBETUEnYmxlIHNwYWNlIHVzaW5nIHRoZSBJT01NVQ0KPiAr
ICovDQo+ICsNCj4gKyNkZWZpbmUgTlBBR0VfVE9fU0laRShucGFnZSkJKChzaXplX3QpKG5wYWdl
KSA8PCBQQUdFX1NISUZUKQ0KPiArDQo+ICtzdHJ1Y3Qgdndvcmsgew0KPiArCXN0cnVjdCBtbV9z
dHJ1Y3QJKm1tOw0KPiArCWludAkJCW5wYWdlOw0KPiArCXN0cnVjdCB3b3JrX3N0cnVjdAl3b3Jr
Ow0KPiArfTsNCj4gKw0KPiArLyogZGVsYXllZCBkZWNyZW1lbnQgZm9yIGxvY2tlZF92bSAqLw0K
PiArc3RhdGljIHZvaWQgdmZpb19sb2NrX2FjY3RfYmcoc3RydWN0IHdvcmtfc3RydWN0ICp3b3Jr
KQ0KPiArew0KPiArCXN0cnVjdCB2d29yayAqdndvcmsgPSBjb250YWluZXJfb2Yod29yaywgc3Ry
dWN0IHZ3b3JrLCB3b3JrKTsNCj4gKwlzdHJ1Y3QgbW1fc3RydWN0ICptbTsNCj4gKw0KPiArCW1t
ID0gdndvcmstPm1tOw0KPiArCWRvd25fd3JpdGUoJm1tLT5tbWFwX3NlbSk7DQo+ICsJbW0tPmxv
Y2tlZF92bSArPSB2d29yay0+bnBhZ2U7DQo+ICsJdXBfd3JpdGUoJm1tLT5tbWFwX3NlbSk7DQo+
ICsJbW1wdXQobW0pOwkJLyogdW5yZWYgbW0gKi8NCj4gKwlrZnJlZSh2d29yayk7DQo+ICt9DQo+
ICsNCj4gK3N0YXRpYyB2b2lkIHZmaW9fbG9ja19hY2N0KGludCBucGFnZSkNCj4gK3sNCj4gKwlz
dHJ1Y3QgdndvcmsgKnZ3b3JrOw0KPiArCXN0cnVjdCBtbV9zdHJ1Y3QgKm1tOw0KPiArDQo+ICsJ
aWYgKCFjdXJyZW50LT5tbSkgew0KPiArCQkvKiBwcm9jZXNzIGV4aXRlZCAqLw0KPiArCQlyZXR1
cm47DQo+ICsJfQ0KPiArCWlmIChkb3duX3dyaXRlX3RyeWxvY2soJmN1cnJlbnQtPm1tLT5tbWFw
X3NlbSkpIHsNCj4gKwkJY3VycmVudC0+bW0tPmxvY2tlZF92bSArPSBucGFnZTsNCj4gKwkJdXBf
d3JpdGUoJmN1cnJlbnQtPm1tLT5tbWFwX3NlbSk7DQo+ICsJCXJldHVybjsNCj4gKwl9DQo+ICsJ
LyoNCj4gKwkgKiBDb3VsZG4ndCBnZXQgbW1hcF9zZW0gbG9jaywgc28gbXVzdCBzZXR1cCB0byBk
ZWNyZW1lbnQNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgIF5eXl5eXl5eXg0KDQpJbmNyZW1lbnQ/DQoNCj4gKwkgKiBtbS0+bG9ja2VkX3ZtIGxh
dGVyLiBJZiBsb2NrZWRfdm0gd2VyZSBhdG9taWMsIHdlIHdvdWxkbid0DQo+ICsJICogbmVlZCB0
aGlzIHNpbGxpbmVzcw0KPiArCSAqLw0KPiArCXZ3b3JrID0ga21hbGxvYyhzaXplb2Yoc3RydWN0
IHZ3b3JrKSwgR0ZQX0tFUk5FTCk7DQo+ICsJaWYgKCF2d29yaykNCj4gKwkJcmV0dXJuOw0KPiAr
CW1tID0gZ2V0X3Rhc2tfbW0oY3VycmVudCk7CS8qIHRha2UgcmVmIG1tICovDQo+ICsJaWYgKCFt
bSkgew0KPiArCQlrZnJlZSh2d29yayk7DQo+ICsJCXJldHVybjsNCj4gKwl9DQo+ICsJSU5JVF9X
T1JLKCZ2d29yay0+d29yaywgdmZpb19sb2NrX2FjY3RfYmcpOw0KPiArCXZ3b3JrLT5tbSA9IG1t
Ow0KPiArCXZ3b3JrLT5ucGFnZSA9IG5wYWdlOw0KPiArCXNjaGVkdWxlX3dvcmsoJnZ3b3JrLT53
b3JrKTsNCj4gK30NCj4gKw0KPiArLyogU29tZSBtYXBwaW5ncyBhcmVuJ3QgYmFja2VkIGJ5IGEg
c3RydWN0IHBhZ2UsIGZvciBleGFtcGxlIGFuIG1tYXAnZA0KPiArICogTU1JTyByYW5nZSBmb3Ig
b3VyIG93biBvciBhbm90aGVyIGRldmljZS4gIFRoZXNlIHVzZSBhIGRpZmZlcmVudA0KPiArICog
cGZuIGNvbnZlcnNpb24gYW5kIHNob3VsZG4ndCBiZSB0cmFja2VkIGFzIGxvY2tlZCBwYWdlcy4g
Ki8NCj4gK3N0YXRpYyBpbnQgaXNfaW52YWxpZF9yZXNlcnZlZF9wZm4odW5zaWduZWQgbG9uZyBw
Zm4pDQo+ICt7DQo+ICsJaWYgKHBmbl92YWxpZChwZm4pKSB7DQo+ICsJCWludCByZXNlcnZlZDsN
Cj4gKwkJc3RydWN0IHBhZ2UgKnRhaWwgPSBwZm5fdG9fcGFnZShwZm4pOw0KPiArCQlzdHJ1Y3Qg
cGFnZSAqaGVhZCA9IGNvbXBvdW5kX3RyYW5zX2hlYWQodGFpbCk7DQo+ICsJCXJlc2VydmVkID0g
UGFnZVJlc2VydmVkKGhlYWQpOw0KPiArCQlpZiAoaGVhZCAhPSB0YWlsKSB7DQo+ICsJCQkvKiAi
aGVhZCIgaXMgbm90IGEgZGFuZ2xpbmcgcG9pbnRlcg0KPiArCQkJICogKGNvbXBvdW5kX3RyYW5z
X2hlYWQgdGFrZXMgY2FyZSBvZiB0aGF0KQ0KPiArCQkJICogYnV0IHRoZSBodWdlcGFnZSBtYXkg
aGF2ZSBiZWVuIHNwbGl0DQo+ICsJCQkgKiBmcm9tIHVuZGVyIHVzIChhbmQgd2UgbWF5IG5vdCBo
b2xkIGENCj4gKwkJCSAqIHJlZmVyZW5jZSBjb3VudCBvbiB0aGUgaGVhZCBwYWdlIHNvIGl0IGNh
bg0KPiArCQkJICogYmUgcmV1c2VkIGJlZm9yZSB3ZSBydW4gUGFnZVJlZmVyZW5jZWQpLCBzbw0K
PiArCQkJICogd2UndmUgdG8gY2hlY2sgUGFnZVRhaWwgYmVmb3JlIHJldHVybmluZw0KPiArCQkJ
ICogd2hhdCB3ZSBqdXN0IHJlYWQuDQo+ICsJCQkgKi8NCj4gKwkJCXNtcF9ybWIoKTsNCj4gKwkJ
CWlmIChQYWdlVGFpbCh0YWlsKSkNCj4gKwkJCQlyZXR1cm4gcmVzZXJ2ZWQ7DQo+ICsJCX0NCj4g
KwkJcmV0dXJuIFBhZ2VSZXNlcnZlZCh0YWlsKTsNCj4gKwl9DQo+ICsNCj4gKwlyZXR1cm4gdHJ1
ZTsNCj4gK30NCj4gKw0KPiArc3RhdGljIGludCBwdXRfcGZuKHVuc2lnbmVkIGxvbmcgcGZuLCBp
bnQgcmR3cikNCj4gK3sNCj4gKwlpZiAoIWlzX2ludmFsaWRfcmVzZXJ2ZWRfcGZuKHBmbikpIHsN
Cj4gKwkJc3RydWN0IHBhZ2UgKnBhZ2UgPSBwZm5fdG9fcGFnZShwZm4pOw0KPiArCQlpZiAocmR3
cikNCj4gKwkJCVNldFBhZ2VEaXJ0eShwYWdlKTsNCj4gKwkJcHV0X3BhZ2UocGFnZSk7DQo+ICsJ
CXJldHVybiAxOw0KPiArCX0NCj4gKwlyZXR1cm4gMDsNCj4gK30NCj4gKw0KPiArLyogVW5tYXAg
RE1BIHJlZ2lvbiAqLw0KPiArLyogZGdhdGUgbXVzdCBiZSBoZWxkICovDQo+ICtzdGF0aWMgaW50
IF9fdmZpb19kbWFfdW5tYXAoc3RydWN0IHZmaW9faW9tbXUgKmlvbW11LCB1bnNpZ25lZCBsb25n
DQo+IGlvdmEsDQo+ICsJCQkgICAgaW50IG5wYWdlLCBpbnQgcmR3cikNCj4gK3sNCj4gKwlpbnQg
aSwgdW5sb2NrZWQgPSAwOw0KPiArDQo+ICsJZm9yIChpID0gMDsgaSA8IG5wYWdlOyBpKyssIGlv
dmEgKz0gUEFHRV9TSVpFKSB7DQo+ICsJCXVuc2lnbmVkIGxvbmcgcGZuOw0KPiArDQo+ICsJCXBm
biA9IGlvbW11X2lvdmFfdG9fcGh5cyhpb21tdS0+ZG9tYWluLCBpb3ZhKSA+Pg0KPiBQQUdFX1NI
SUZUOw0KPiArCQlpZiAocGZuKSB7DQo+ICsJCQlpb21tdV91bm1hcChpb21tdS0+ZG9tYWluLCBp
b3ZhLCAwKTsNCj4gKwkJCXVubG9ja2VkICs9IHB1dF9wZm4ocGZuLCByZHdyKTsNCj4gKwkJfQ0K
PiArCX0NCj4gKwlyZXR1cm4gdW5sb2NrZWQ7DQo+ICt9DQo+ICsNCj4gK3N0YXRpYyB2b2lkIHZm
aW9fZG1hX3VubWFwKHN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSwgdW5zaWduZWQgbG9uZw0KPiBp
b3ZhLA0KPiArCQkJICAgdW5zaWduZWQgbG9uZyBucGFnZSwgaW50IHJkd3IpDQo+ICt7DQo+ICsJ
aW50IHVubG9ja2VkOw0KPiArDQo+ICsJdW5sb2NrZWQgPSBfX3ZmaW9fZG1hX3VubWFwKGlvbW11
LCBpb3ZhLCBucGFnZSwgcmR3cik7DQo+ICsJdmZpb19sb2NrX2FjY3QoLXVubG9ja2VkKTsNCj4g
K30NCj4gKw0KPiArLyogVW5tYXAgQUxMIERNQSByZWdpb25zICovDQo+ICt2b2lkIHZmaW9faW9t
bXVfdW5tYXBhbGwoc3RydWN0IHZmaW9faW9tbXUgKmlvbW11KQ0KPiArew0KPiArCXN0cnVjdCBs
aXN0X2hlYWQgKnBvcywgKnBvczI7DQo+ICsJc3RydWN0IGRtYV9tYXBfcGFnZSAqbWxwOw0KPiAr
DQo+ICsJbXV0ZXhfbG9jaygmaW9tbXUtPmRnYXRlKTsNCj4gKwlsaXN0X2Zvcl9lYWNoX3NhZmUo
cG9zLCBwb3MyLCAmaW9tbXUtPmRtX2xpc3QpIHsNCj4gKwkJbWxwID0gbGlzdF9lbnRyeShwb3Ms
IHN0cnVjdCBkbWFfbWFwX3BhZ2UsIGxpc3QpOw0KPiArCQl2ZmlvX2RtYV91bm1hcChpb21tdSwg
bWxwLT5kYWRkciwgbWxwLT5ucGFnZSwgbWxwLT5yZHdyKTsNCj4gKwkJbGlzdF9kZWwoJm1scC0+
bGlzdCk7DQo+ICsJCWtmcmVlKG1scCk7DQo+ICsJfQ0KPiArCW11dGV4X3VubG9jaygmaW9tbXUt
PmRnYXRlKTsNCj4gK30NCj4gKw0KPiArc3RhdGljIGludCB2YWRkcl9nZXRfcGZuKHVuc2lnbmVk
IGxvbmcgdmFkZHIsIGludCByZHdyLCB1bnNpZ25lZCBsb25nDQo+ICpwZm4pDQo+ICt7DQo+ICsJ
c3RydWN0IHBhZ2UgKnBhZ2VbMV07DQo+ICsJc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWE7DQo+
ICsJaW50IHJldCA9IC1FRkFVTFQ7DQo+ICsNCj4gKwlpZiAoZ2V0X3VzZXJfcGFnZXNfZmFzdCh2
YWRkciwgMSwgcmR3ciwgcGFnZSkgPT0gMSkgew0KPiArCQkqcGZuID0gcGFnZV90b19wZm4ocGFn
ZVswXSk7DQo+ICsJCXJldHVybiAwOw0KPiArCX0NCj4gKw0KPiArCWRvd25fcmVhZCgmY3VycmVu
dC0+bW0tPm1tYXBfc2VtKTsNCj4gKw0KPiArCXZtYSA9IGZpbmRfdm1hX2ludGVyc2VjdGlvbihj
dXJyZW50LT5tbSwgdmFkZHIsIHZhZGRyICsgMSk7DQo+ICsNCj4gKwlpZiAodm1hICYmIHZtYS0+
dm1fZmxhZ3MgJiBWTV9QRk5NQVApIHsNCj4gKwkJKnBmbiA9ICgodmFkZHIgLSB2bWEtPnZtX3N0
YXJ0KSA+PiBQQUdFX1NISUZUKSArIHZtYS0NCj4gPnZtX3Bnb2ZmOw0KPiArCQlpZiAoaXNfaW52
YWxpZF9yZXNlcnZlZF9wZm4oKnBmbikpDQo+ICsJCQlyZXQgPSAwOw0KPiArCX0NCj4gKw0KPiAr
CXVwX3JlYWQoJmN1cnJlbnQtPm1tLT5tbWFwX3NlbSk7DQo+ICsNCj4gKwlyZXR1cm4gcmV0Ow0K
PiArfQ0KPiArDQo+ICsvKiBNYXAgRE1BIHJlZ2lvbiAqLw0KPiArLyogZGdhdGUgbXVzdCBiZSBo
ZWxkICovDQo+ICtzdGF0aWMgaW50IHZmaW9fZG1hX21hcChzdHJ1Y3QgdmZpb19pb21tdSAqaW9t
bXUsIHVuc2lnbmVkIGxvbmcgaW92YSwNCj4gKwkJCXVuc2lnbmVkIGxvbmcgdmFkZHIsIGludCBu
cGFnZSwgaW50IHJkd3IpDQo+ICt7DQo+ICsJdW5zaWduZWQgbG9uZyBzdGFydCA9IGlvdmE7DQo+
ICsJaW50IGksIHJldCwgbG9ja2VkID0gMCwgcHJvdCA9IElPTU1VX1JFQUQ7DQo+ICsNCj4gKwkv
KiBWZXJpZnkgcGFnZXMgYXJlIG5vdCBhbHJlYWR5IG1hcHBlZCAqLw0KPiArCWZvciAoaSA9IDA7
IGkgPCBucGFnZTsgaSsrLCBpb3ZhICs9IFBBR0VfU0laRSkNCj4gKwkJaWYgKGlvbW11X2lvdmFf
dG9fcGh5cyhpb21tdS0+ZG9tYWluLCBpb3ZhKSkNCj4gKwkJCXJldHVybiAtRUJVU1k7DQo+ICsN
Cj4gKwlpb3ZhID0gc3RhcnQ7DQo+ICsNCj4gKwlpZiAocmR3cikNCj4gKwkJcHJvdCB8PSBJT01N
VV9XUklURTsNCj4gKwlpZiAoaW9tbXUtPmNhY2hlKQ0KPiArCQlwcm90IHw9IElPTU1VX0NBQ0hF
Ow0KPiArDQo+ICsJZm9yIChpID0gMDsgaSA8IG5wYWdlOyBpKyssIGlvdmEgKz0gUEFHRV9TSVpF
LCB2YWRkciArPQ0KPiBQQUdFX1NJWkUpIHsNCj4gKwkJdW5zaWduZWQgbG9uZyBwZm4gPSAwOw0K
PiArDQo+ICsJCXJldCA9IHZhZGRyX2dldF9wZm4odmFkZHIsIHJkd3IsICZwZm4pOw0KPiArCQlp
ZiAocmV0KSB7DQo+ICsJCQlfX3ZmaW9fZG1hX3VubWFwKGlvbW11LCBzdGFydCwgaSwgcmR3cik7
DQo+ICsJCQlyZXR1cm4gcmV0Ow0KPiArCQl9DQo+ICsNCj4gKwkJLyogT25seSBhZGQgYWN0dWFs
IGxvY2tlZCBwYWdlcyB0byBhY2NvdW50aW5nICovDQo+ICsJCWlmICghaXNfaW52YWxpZF9yZXNl
cnZlZF9wZm4ocGZuKSkNCj4gKwkJCWxvY2tlZCsrOw0KPiArDQo+ICsJCXJldCA9IGlvbW11X21h
cChpb21tdS0+ZG9tYWluLCBpb3ZhLA0KPiArCQkJCShwaHlzX2FkZHJfdClwZm4gPDwgUEFHRV9T
SElGVCwgMCwgcHJvdCk7DQo+ICsJCWlmIChyZXQpIHsNCj4gKwkJCS8qIEJhY2sgb3V0IG1hcHBp
bmdzIG9uIGVycm9yICovDQo+ICsJCQlwdXRfcGZuKHBmbiwgcmR3cik7DQo+ICsJCQlfX3ZmaW9f
ZG1hX3VubWFwKGlvbW11LCBzdGFydCwgaSwgcmR3cik7DQo+ICsJCQlyZXR1cm4gcmV0Ow0KPiAr
CQl9DQo+ICsJfQ0KPiArCXZmaW9fbG9ja19hY2N0KGxvY2tlZCk7DQo+ICsJcmV0dXJuIDA7DQo+
ICt9DQo+ICsNCj4gK3N0YXRpYyBpbmxpbmUgaW50IHJhbmdlc19vdmVybGFwKHVuc2lnbmVkIGxv
bmcgc3RhcnQxLCBzaXplX3Qgc2l6ZTEsDQo+ICsJCQkJIHVuc2lnbmVkIGxvbmcgc3RhcnQyLCBz
aXplX3Qgc2l6ZTIpDQo+ICt7DQo+ICsJcmV0dXJuICEoc3RhcnQxICsgc2l6ZTEgPD0gc3RhcnQy
IHx8IHN0YXJ0MiArIHNpemUyIDw9IHN0YXJ0MSk7DQo+ICt9DQo+ICsNCj4gK3N0YXRpYyBzdHJ1
Y3QgZG1hX21hcF9wYWdlICp2ZmlvX2ZpbmRfZG1hKHN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSwN
Cj4gKwkJCQkJICBkbWFfYWRkcl90IHN0YXJ0LCBzaXplX3Qgc2l6ZSkNCj4gK3sNCj4gKwlzdHJ1
Y3QgbGlzdF9oZWFkICpwb3M7DQo+ICsJc3RydWN0IGRtYV9tYXBfcGFnZSAqbWxwOw0KPiArDQo+
ICsJbGlzdF9mb3JfZWFjaChwb3MsICZpb21tdS0+ZG1fbGlzdCkgew0KPiArCQltbHAgPSBsaXN0
X2VudHJ5KHBvcywgc3RydWN0IGRtYV9tYXBfcGFnZSwgbGlzdCk7DQo+ICsJCWlmIChyYW5nZXNf
b3ZlcmxhcChtbHAtPmRhZGRyLCBOUEFHRV9UT19TSVpFKG1scC0+bnBhZ2UpLA0KPiArCQkJCSAg
IHN0YXJ0LCBzaXplKSkNCj4gKwkJCXJldHVybiBtbHA7DQo+ICsJfQ0KPiArCXJldHVybiBOVUxM
Ow0KPiArfQ0KPiArDQo+ICtpbnQgdmZpb19yZW1vdmVfZG1hX292ZXJsYXAoc3RydWN0IHZmaW9f
aW9tbXUgKmlvbW11LCBkbWFfYWRkcl90DQo+IHN0YXJ0LA0KPiArCQkJICAgIHNpemVfdCBzaXpl
LCBzdHJ1Y3QgZG1hX21hcF9wYWdlICptbHApDQo+ICt7DQo+ICsJc3RydWN0IGRtYV9tYXBfcGFn
ZSAqc3BsaXQ7DQo+ICsJaW50IG5wYWdlX2xvLCBucGFnZV9oaTsNCj4gKw0KPiArCS8qIEV4aXN0
aW5nIGRtYSByZWdpb24gaXMgY29tcGxldGVseSBjb3ZlcmVkLCB1bm1hcCBhbGwgKi8NCg0KVGhp
cyB3b3Jrcy4gSG93ZXZlciwgZ2l2ZW4gaG93IHZmaW9fZG1hX21hcF9kbSBpbXBsZW1lbnRzIHRo
ZSBtZXJnaW5nDQpsb2dpYywgSSB0aGluayBpdCBpcyBpbXBvc3NpYmxlIHRvIGhhdmUNCg0KICAg
IChzdGFydCA8IG1scC0+ZGFkZHIgJiYNCiAgICAgc3RhcnQgKyBzaXplID4gbWxwLT5kYWRkciAr
IE5QQUdFX1RPX1NJWkUobWxwLT5ucGFnZSkpDQoNCg0KPiArCWlmIChzdGFydCA8PSBtbHAtPmRh
ZGRyICYmDQo+ICsJICAgIHN0YXJ0ICsgc2l6ZSA+PSBtbHAtPmRhZGRyICsgTlBBR0VfVE9fU0la
RShtbHAtPm5wYWdlKSkgew0KPiArCQl2ZmlvX2RtYV91bm1hcChpb21tdSwgbWxwLT5kYWRkciwg
bWxwLT5ucGFnZSwgbWxwLT5yZHdyKTsNCj4gKwkJbGlzdF9kZWwoJm1scC0+bGlzdCk7DQo+ICsJ
CW5wYWdlX2xvID0gbWxwLT5ucGFnZTsNCj4gKwkJa2ZyZWUobWxwKTsNCj4gKwkJcmV0dXJuIG5w
YWdlX2xvOw0KPiArCX0NCj4gKw0KPiArCS8qIE92ZXJsYXAgbG93IGFkZHJlc3Mgb2YgZXhpc3Rp
bmcgcmFuZ2UgKi8NCg0KU2FtZSBhcyBhYm92ZSAoaWUsICc8JyBpcyBpbXBvc3NpYmxlKQ0KDQo+
ICsJaWYgKHN0YXJ0IDw9IG1scC0+ZGFkZHIpIHsNCj4gKwkJc2l6ZV90IG92ZXJsYXA7DQo+ICsN
Cj4gKwkJb3ZlcmxhcCA9IHN0YXJ0ICsgc2l6ZSAtIG1scC0+ZGFkZHI7DQo+ICsJCW5wYWdlX2xv
ID0gb3ZlcmxhcCA+PiBQQUdFX1NISUZUOw0KPiArCQlucGFnZV9oaSA9IG1scC0+bnBhZ2UgLSBu
cGFnZV9sbzsNCj4gKw0KPiArCQl2ZmlvX2RtYV91bm1hcChpb21tdSwgbWxwLT5kYWRkciwgbnBh
Z2VfbG8sIG1scC0+cmR3cik7DQo+ICsJCW1scC0+ZGFkZHIgKz0gb3ZlcmxhcDsNCj4gKwkJbWxw
LT52YWRkciArPSBvdmVybGFwOw0KPiArCQltbHAtPm5wYWdlIC09IG5wYWdlX2xvOw0KPiArCQly
ZXR1cm4gbnBhZ2VfbG87DQo+ICsJfQ0KDQpTYW1lIGFzIGFib3ZlIChpZSwgJz4nIGlzIGltcG9z
c2libGUpLg0KDQo+ICsJLyogT3ZlcmxhcCBoaWdoIGFkZHJlc3Mgb2YgZXhpc3RpbmcgcmFuZ2Ug
Ki8NCj4gKwlpZiAoc3RhcnQgKyBzaXplID49IG1scC0+ZGFkZHIgKyBOUEFHRV9UT19TSVpFKG1s
cC0+bnBhZ2UpKSB7DQo+ICsJCXNpemVfdCBvdmVybGFwOw0KPiArDQo+ICsJCW92ZXJsYXAgPSBt
bHAtPmRhZGRyICsgTlBBR0VfVE9fU0laRShtbHAtPm5wYWdlKSAtIHN0YXJ0Ow0KPiArCQlucGFn
ZV9oaSA9IG92ZXJsYXAgPj4gUEFHRV9TSElGVDsNCj4gKwkJbnBhZ2VfbG8gPSBtbHAtPm5wYWdl
IC0gbnBhZ2VfaGk7DQo+ICsNCj4gKwkJdmZpb19kbWFfdW5tYXAoaW9tbXUsIHN0YXJ0LCBucGFn
ZV9oaSwgbWxwLT5yZHdyKTsNCj4gKwkJbWxwLT5ucGFnZSAtPSBucGFnZV9oaTsNCj4gKwkJcmV0
dXJuIG5wYWdlX2hpOw0KPiArCX0NCj4gKw0KPiArCS8qIFNwbGl0IGV4aXN0aW5nICovDQo+ICsJ
bnBhZ2VfbG8gPSAoc3RhcnQgLSBtbHAtPmRhZGRyKSA+PiBQQUdFX1NISUZUOw0KPiArCW5wYWdl
X2hpID0gbWxwLT5ucGFnZSAtIChzaXplID4+IFBBR0VfU0hJRlQpIC0gbnBhZ2VfbG87DQo+ICsN
Cj4gKwlzcGxpdCA9IGt6YWxsb2Moc2l6ZW9mICpzcGxpdCwgR0ZQX0tFUk5FTCk7DQo+ICsJaWYg
KCFzcGxpdCkNCj4gKwkJcmV0dXJuIC1FTk9NRU07DQo+ICsNCj4gKwl2ZmlvX2RtYV91bm1hcChp
b21tdSwgc3RhcnQsIHNpemUgPj4gUEFHRV9TSElGVCwgbWxwLT5yZHdyKTsNCj4gKw0KPiArCW1s
cC0+bnBhZ2UgPSBucGFnZV9sbzsNCj4gKw0KPiArCXNwbGl0LT5ucGFnZSA9IG5wYWdlX2hpOw0K
PiArCXNwbGl0LT5kYWRkciA9IHN0YXJ0ICsgc2l6ZTsNCj4gKwlzcGxpdC0+dmFkZHIgPSBtbHAt
PnZhZGRyICsgTlBBR0VfVE9fU0laRShucGFnZV9sbykgKyBzaXplOw0KPiArCXNwbGl0LT5yZHdy
ID0gbWxwLT5yZHdyOw0KPiArCWxpc3RfYWRkKCZzcGxpdC0+bGlzdCwgJmlvbW11LT5kbV9saXN0
KTsNCj4gKwlyZXR1cm4gc2l6ZSA+PiBQQUdFX1NISUZUOw0KPiArfQ0KPiArDQo+ICtpbnQgdmZp
b19kbWFfdW5tYXBfZG0oc3RydWN0IHZmaW9faW9tbXUgKmlvbW11LCBzdHJ1Y3QgdmZpb19kbWFf
bWFwDQo+ICpkbXApDQo+ICt7DQo+ICsJaW50IHJldCA9IDA7DQo+ICsJc2l6ZV90IG5wYWdlID0g
ZG1wLT5zaXplID4+IFBBR0VfU0hJRlQ7DQo+ICsJc3RydWN0IGxpc3RfaGVhZCAqcG9zLCAqbjsN
Cj4gKw0KPiArCWlmIChkbXAtPmRtYWFkZHIgJiB+UEFHRV9NQVNLKQ0KPiArCQlyZXR1cm4gLUVJ
TlZBTDsNCj4gKwlpZiAoZG1wLT5zaXplICYgflBBR0VfTUFTSykNCj4gKwkJcmV0dXJuIC1FSU5W
QUw7DQo+ICsNCj4gKwltdXRleF9sb2NrKCZpb21tdS0+ZGdhdGUpOw0KPiArDQo+ICsJbGlzdF9m
b3JfZWFjaF9zYWZlKHBvcywgbiwgJmlvbW11LT5kbV9saXN0KSB7DQo+ICsJCXN0cnVjdCBkbWFf
bWFwX3BhZ2UgKm1scDsNCj4gKw0KPiArCQltbHAgPSBsaXN0X2VudHJ5KHBvcywgc3RydWN0IGRt
YV9tYXBfcGFnZSwgbGlzdCk7DQo+ICsJCWlmIChyYW5nZXNfb3ZlcmxhcChtbHAtPmRhZGRyLCBO
UEFHRV9UT19TSVpFKG1scC0+bnBhZ2UpLA0KPiArCQkJCSAgIGRtcC0+ZG1hYWRkciwgZG1wLT5z
aXplKSkgew0KPiArCQkJcmV0ID0gdmZpb19yZW1vdmVfZG1hX292ZXJsYXAoaW9tbXUsIGRtcC0+
ZG1hYWRkciwNCj4gKwkJCQkJCSAgICAgIGRtcC0+c2l6ZSwgbWxwKTsNCj4gKwkJCWlmIChyZXQg
PiAwKQ0KPiArCQkJCW5wYWdlIC09IE5QQUdFX1RPX1NJWkUocmV0KTsNCj4gKwkJCWlmIChyZXQg
PCAwIHx8IG5wYWdlID09IDApDQo+ICsJCQkJYnJlYWs7DQo+ICsJCX0NCj4gKwl9DQo+ICsJbXV0
ZXhfdW5sb2NrKCZpb21tdS0+ZGdhdGUpOw0KPiArCXJldHVybiByZXQgPiAwID8gMCA6IHJldDsN
Cj4gK30NCj4gKw0KPiAraW50IHZmaW9fZG1hX21hcF9kbShzdHJ1Y3QgdmZpb19pb21tdSAqaW9t
bXUsIHN0cnVjdCB2ZmlvX2RtYV9tYXANCj4gKmRtcCkNCj4gK3sNCj4gKwlpbnQgbnBhZ2U7DQo+
ICsJc3RydWN0IGRtYV9tYXBfcGFnZSAqbWxwLCAqbW1scCA9IE5VTEw7DQo+ICsJZG1hX2FkZHJf
dCBkYWRkciA9IGRtcC0+ZG1hYWRkcjsNCj4gKwl1bnNpZ25lZCBsb25nIGxvY2tlZCwgbG9ja19s
aW1pdCwgdmFkZHIgPSBkbXAtPnZhZGRyOw0KPiArCXNpemVfdCBzaXplID0gZG1wLT5zaXplOw0K
PiArCWludCByZXQgPSAwLCByZHdyID0gZG1wLT5mbGFncyAmIFZGSU9fRE1BX01BUF9GTEFHX1dS
SVRFOw0KPiArDQo+ICsJaWYgKHZhZGRyICYgKFBBR0VfU0laRS0xKSkNCj4gKwkJcmV0dXJuIC1F
SU5WQUw7DQo+ICsJaWYgKGRhZGRyICYgKFBBR0VfU0laRS0xKSkNCj4gKwkJcmV0dXJuIC1FSU5W
QUw7DQo+ICsJaWYgKHNpemUgJiAoUEFHRV9TSVpFLTEpKQ0KPiArCQlyZXR1cm4gLUVJTlZBTDsN
Cj4gKw0KPiArCW5wYWdlID0gc2l6ZSA+PiBQQUdFX1NISUZUOw0KPiArCWlmICghbnBhZ2UpDQo+
ICsJCXJldHVybiAtRUlOVkFMOw0KPiArDQo+ICsJaWYgKCFpb21tdSkNCj4gKwkJcmV0dXJuIC1F
SU5WQUw7DQo+ICsNCj4gKwltdXRleF9sb2NrKCZpb21tdS0+ZGdhdGUpOw0KPiArDQo+ICsJaWYg
KHZmaW9fZmluZF9kbWEoaW9tbXUsIGRhZGRyLCBzaXplKSkgew0KPiArCQlyZXQgPSAtRUJVU1k7
DQo+ICsJCWdvdG8gb3V0X2xvY2s7DQo+ICsJfQ0KPiArDQo+ICsJLyogYWNjb3VudCBmb3IgbG9j
a2VkIHBhZ2VzICovDQo+ICsJbG9ja2VkID0gY3VycmVudC0+bW0tPmxvY2tlZF92bSArIG5wYWdl
Ow0KPiArCWxvY2tfbGltaXQgPSBybGltaXQoUkxJTUlUX01FTUxPQ0spID4+IFBBR0VfU0hJRlQ7
DQo+ICsJaWYgKGxvY2tlZCA+IGxvY2tfbGltaXQgJiYgIWNhcGFibGUoQ0FQX0lQQ19MT0NLKSkg
ew0KPiArCQlwcmludGsoS0VSTl9XQVJOSU5HICIlczogUkxJTUlUX01FTUxPQ0sgKCVsZCkgZXhj
ZWVkZWRcbiIsDQo+ICsJCQlfX2Z1bmNfXywgcmxpbWl0KFJMSU1JVF9NRU1MT0NLKSk7DQo+ICsJ
CXJldCA9IC1FTk9NRU07DQo+ICsJCWdvdG8gb3V0X2xvY2s7DQo+ICsJfQ0KPiArDQo+ICsJcmV0
ID0gdmZpb19kbWFfbWFwKGlvbW11LCBkYWRkciwgdmFkZHIsIG5wYWdlLCByZHdyKTsNCj4gKwlp
ZiAocmV0KQ0KPiArCQlnb3RvIG91dF9sb2NrOw0KPiArDQo+ICsJLyogQ2hlY2sgaWYgd2UgYWJ1
dCBhIHJlZ2lvbiBiZWxvdyAqLw0KDQpJcyAhZGFkZHIgcG9zc2libGU/DQoNCj4gKwlpZiAoZGFk
ZHIpIHsNCj4gKwkJbWxwID0gdmZpb19maW5kX2RtYShpb21tdSwgZGFkZHIgLSAxLCAxKTsNCj4g
KwkJaWYgKG1scCAmJiBtbHAtPnJkd3IgPT0gcmR3ciAmJg0KPiArCQkgICAgbWxwLT52YWRkciAr
IE5QQUdFX1RPX1NJWkUobWxwLT5ucGFnZSkgPT0gdmFkZHIpIHsNCj4gKw0KPiArCQkJbWxwLT5u
cGFnZSArPSBucGFnZTsNCj4gKwkJCWRhZGRyID0gbWxwLT5kYWRkcjsNCj4gKwkJCXZhZGRyID0g
bWxwLT52YWRkcjsNCj4gKwkJCW5wYWdlID0gbWxwLT5ucGFnZTsNCj4gKwkJCXNpemUgPSBOUEFH
RV9UT19TSVpFKG5wYWdlKTsNCj4gKw0KPiArCQkJbW1scCA9IG1scDsNCj4gKwkJfQ0KPiArCX0N
Cg0KSXMgIShkYWRkciArIHNpemUpIHBvc3NpYmxlPw0KDQo+ICsJaWYgKGRhZGRyICsgc2l6ZSkg
ew0KPiArCQltbHAgPSB2ZmlvX2ZpbmRfZG1hKGlvbW11LCBkYWRkciArIHNpemUsIDEpOw0KPiAr
CQlpZiAobWxwICYmIG1scC0+cmR3ciA9PSByZHdyICYmIG1scC0+dmFkZHIgPT0gdmFkZHIgKyBz
aXplKQ0KPiB7DQo+ICsNCj4gKwkJCW1scC0+bnBhZ2UgKz0gbnBhZ2U7DQo+ICsJCQltbHAtPmRh
ZGRyID0gZGFkZHI7DQo+ICsJCQltbHAtPnZhZGRyID0gdmFkZHI7DQo+ICsNCj4gKwkJCS8qIElm
IG1lcmdlZCBhYm92ZSBhbmQgYmVsb3csIHJlbW92ZSBwcmV2aW91c2x5DQo+ICsJCQkgKiBtZXJn
ZWQgZW50cnkuICBOZXcgZW50cnkgY292ZXJzIGl0LiAgKi8NCj4gKwkJCWlmIChtbWxwKSB7DQo+
ICsJCQkJbGlzdF9kZWwoJm1tbHAtPmxpc3QpOw0KPiArCQkJCWtmcmVlKG1tbHApOw0KPiArCQkJ
fQ0KPiArCQkJbW1scCA9IG1scDsNCj4gKwkJfQ0KPiArCX0NCj4gKw0KPiArCWlmICghbW1scCkg
ew0KPiArCQltbHAgPSBremFsbG9jKHNpemVvZiAqbWxwLCBHRlBfS0VSTkVMKTsNCj4gKwkJaWYg
KCFtbHApIHsNCj4gKwkJCXJldCA9IC1FTk9NRU07DQo+ICsJCQl2ZmlvX2RtYV91bm1hcChpb21t
dSwgZGFkZHIsIG5wYWdlLCByZHdyKTsNCj4gKwkJCWdvdG8gb3V0X2xvY2s7DQo+ICsJCX0NCj4g
Kw0KPiArCQltbHAtPm5wYWdlID0gbnBhZ2U7DQo+ICsJCW1scC0+ZGFkZHIgPSBkYWRkcjsNCj4g
KwkJbWxwLT52YWRkciA9IHZhZGRyOw0KPiArCQltbHAtPnJkd3IgPSByZHdyOw0KPiArCQlsaXN0
X2FkZCgmbWxwLT5saXN0LCAmaW9tbXUtPmRtX2xpc3QpOw0KPiArCX0NCj4gKw0KPiArb3V0X2xv
Y2s6DQo+ICsJbXV0ZXhfdW5sb2NrKCZpb21tdS0+ZGdhdGUpOw0KPiArCXJldHVybiByZXQ7DQo+
ICt9DQo+ICsNCj4gK3N0YXRpYyBpbnQgdmZpb19pb21tdV9yZWxlYXNlKHN0cnVjdCBpbm9kZSAq
aW5vZGUsIHN0cnVjdCBmaWxlICpmaWxlcCkNCj4gK3sNCj4gKwlzdHJ1Y3QgdmZpb19pb21tdSAq
aW9tbXUgPSBmaWxlcC0+cHJpdmF0ZV9kYXRhOw0KPiArDQo+ICsJdmZpb19yZWxlYXNlX2lvbW11
KGlvbW11KTsNCj4gKwlyZXR1cm4gMDsNCj4gK30NCj4gKw0KPiArc3RhdGljIGxvbmcgdmZpb19p
b21tdV91bmxfaW9jdGwoc3RydWN0IGZpbGUgKmZpbGVwLA0KPiArCQkJCSB1bnNpZ25lZCBpbnQg
Y21kLCB1bnNpZ25lZCBsb25nIGFyZykNCj4gK3sNCj4gKwlzdHJ1Y3QgdmZpb19pb21tdSAqaW9t
bXUgPSBmaWxlcC0+cHJpdmF0ZV9kYXRhOw0KPiArCWludCByZXQgPSAtRU5PU1lTOw0KDQpBbnkg
cmVhc29uIGZvciBub3QgdXNpbmcgInN3aXRjaCIgPw0KDQo+ICsgICAgICAgIGlmIChjbWQgPT0g
VkZJT19JT01NVV9HRVRfRkxBR1MpIHsNCj4gKyAgICAgICAgICAgICAgICB1NjQgZmxhZ3MgPSBW
RklPX0lPTU1VX0ZMQUdTX01BUF9BTlk7DQo+ICsNCj4gKyAgICAgICAgICAgICAgICByZXQgPSBw
dXRfdXNlcihmbGFncywgKHU2NCBfX3VzZXIgKilhcmcpOw0KPiArDQo+ICsgICAgICAgIH0gZWxz
ZSBpZiAoY21kID09IFZGSU9fSU9NTVVfTUFQX0RNQSkgew0KPiArCQlzdHJ1Y3QgdmZpb19kbWFf
bWFwIGRtOw0KPiArDQo+ICsJCWlmIChjb3B5X2Zyb21fdXNlcigmZG0sICh2b2lkIF9fdXNlciAq
KWFyZywgc2l6ZW9mIGRtKSkNCj4gKwkJCXJldHVybiAtRUZBVUxUOw0KDQpXaGF0IGRvZXMgdGhl
ICJfZG0iIHN1ZmZpeCBzdGFuZCBmb3I/DQoNCj4gKwkJcmV0ID0gdmZpb19kbWFfbWFwX2RtKGlv
bW11LCAmZG0pOw0KPiArDQo+ICsJCWlmICghcmV0ICYmIGNvcHlfdG9fdXNlcigodm9pZCBfX3Vz
ZXIgKilhcmcsICZkbSwgc2l6ZW9mDQo+IGRtKSkNCj4gKwkJCXJldCA9IC1FRkFVTFQ7DQo+ICsN
Cj4gKwl9IGVsc2UgaWYgKGNtZCA9PSBWRklPX0lPTU1VX1VOTUFQX0RNQSkgew0KPiArCQlzdHJ1
Y3QgdmZpb19kbWFfbWFwIGRtOw0KPiArDQo+ICsJCWlmIChjb3B5X2Zyb21fdXNlcigmZG0sICh2
b2lkIF9fdXNlciAqKWFyZywgc2l6ZW9mIGRtKSkNCj4gKwkJCXJldHVybiAtRUZBVUxUOw0KPiAr
DQo+ICsJCXJldCA9IHZmaW9fZG1hX3VubWFwX2RtKGlvbW11LCAmZG0pOw0KPiArDQo+ICsJCWlm
ICghcmV0ICYmIGNvcHlfdG9fdXNlcigodm9pZCBfX3VzZXIgKilhcmcsICZkbSwgc2l6ZW9mDQo+
IGRtKSkNCj4gKwkJCXJldCA9IC1FRkFVTFQ7DQo+ICsJfQ0KPiArCXJldHVybiByZXQ7DQo+ICt9
DQo+ICsNCj4gKyNpZmRlZiBDT05GSUdfQ09NUEFUDQo+ICtzdGF0aWMgbG9uZyB2ZmlvX2lvbW11
X2NvbXBhdF9pb2N0bChzdHJ1Y3QgZmlsZSAqZmlsZXAsDQo+ICsJCQkJICAgIHVuc2lnbmVkIGlu
dCBjbWQsIHVuc2lnbmVkIGxvbmcgYXJnKQ0KPiArew0KPiArCWFyZyA9ICh1bnNpZ25lZCBsb25n
KWNvbXBhdF9wdHIoYXJnKTsNCj4gKwlyZXR1cm4gdmZpb19pb21tdV91bmxfaW9jdGwoZmlsZXAs
IGNtZCwgYXJnKTsNCj4gK30NCj4gKyNlbmRpZgkvKiBDT05GSUdfQ09NUEFUICovDQo+ICsNCj4g
K2NvbnN0IHN0cnVjdCBmaWxlX29wZXJhdGlvbnMgdmZpb19pb21tdV9mb3BzID0gew0KPiArCS5v
d25lcgkJPSBUSElTX01PRFVMRSwNCj4gKwkucmVsZWFzZQk9IHZmaW9faW9tbXVfcmVsZWFzZSwN
Cj4gKwkudW5sb2NrZWRfaW9jdGwJPSB2ZmlvX2lvbW11X3VubF9pb2N0bCwNCj4gKyNpZmRlZiBD
T05GSUdfQ09NUEFUDQo+ICsJLmNvbXBhdF9pb2N0bAk9IHZmaW9faW9tbXVfY29tcGF0X2lvY3Rs
LA0KPiArI2VuZGlmDQo+ICt9Ow0KDQovQ2hyaXMNCg0K

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-10  0:57 ` Christian Benvenuti (benve)
@ 2011-11-11 18:04   ` Alex Williamson
  2011-11-11 22:22     ` Christian Benvenuti (benve)
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-11 18:04 UTC (permalink / raw)
  To: Christian Benvenuti (benve)
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf,
	Aaron Fabbri (aafabbri), B08248, B07421, avi, konrad.wilk, kvm,
	qemu-devel, iommu, linux-pci

On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:
> Here are few minor comments on vfio_iommu.c ...

Sorry, I've been poking sticks at trying to figure out a clean way to
solve the force vfio driver attach problem.

> > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > new file mode 100644
> > index 0000000..029dae3
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu.c
<snip>
> > +
> > +#include "vfio_private.h"
> 
> Doesn't the 'dma_'  prefix belong to the generic DMA code?

Sure, we could these more vfio-centric.

> > +struct dma_map_page {
> > +	struct list_head	list;
> > +	dma_addr_t		daddr;
> > +	unsigned long		vaddr;
> > +	int			npage;
> > +	int			rdwr;
> > +};
> > +
> > +/*
> > + * This code handles mapping and unmapping of user data buffers
> > + * into DMA'ble space using the IOMMU
> > + */
> > +
> > +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
> > +
> > +struct vwork {
> > +	struct mm_struct	*mm;
> > +	int			npage;
> > +	struct work_struct	work;
> > +};
> > +
> > +/* delayed decrement for locked_vm */
> > +static void vfio_lock_acct_bg(struct work_struct *work)
> > +{
> > +	struct vwork *vwork = container_of(work, struct vwork, work);
> > +	struct mm_struct *mm;
> > +
> > +	mm = vwork->mm;
> > +	down_write(&mm->mmap_sem);
> > +	mm->locked_vm += vwork->npage;
> > +	up_write(&mm->mmap_sem);
> > +	mmput(mm);		/* unref mm */
> > +	kfree(vwork);
> > +}
> > +
> > +static void vfio_lock_acct(int npage)
> > +{
> > +	struct vwork *vwork;
> > +	struct mm_struct *mm;
> > +
> > +	if (!current->mm) {
> > +		/* process exited */
> > +		return;
> > +	}
> > +	if (down_write_trylock(&current->mm->mmap_sem)) {
> > +		current->mm->locked_vm += npage;
> > +		up_write(&current->mm->mmap_sem);
> > +		return;
> > +	}
> > +	/*
> > +	 * Couldn't get mmap_sem lock, so must setup to decrement
>                                                       ^^^^^^^^^
> 
> Increment?

Yep

<snip>
> > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > start,
> > +			    size_t size, struct dma_map_page *mlp)
> > +{
> > +	struct dma_map_page *split;
> > +	int npage_lo, npage_hi;
> > +
> > +	/* Existing dma region is completely covered, unmap all */
> 
> This works. However, given how vfio_dma_map_dm implements the merging
> logic, I think it is impossible to have
> 
>     (start < mlp->daddr &&
>      start + size > mlp->daddr + NPAGE_TO_SIZE(mlp->npage))

It's quite possible.  This allows userspace to create a sparse mapping,
then blow it all away with a single unmap from 0 to ~0.

> > +	if (start <= mlp->daddr &&
> > +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +		list_del(&mlp->list);
> > +		npage_lo = mlp->npage;
> > +		kfree(mlp);
> > +		return npage_lo;
> > +	}
> > +
> > +	/* Overlap low address of existing range */
> 
> Same as above (ie, '<' is impossible)

existing:   |<--- A --->|      |<--- B --->|
unmap:                |<--- C --->|

Maybe not good practice from userspace, but we shouldn't count on
userspace to be well behaved.

> > +	if (start <= mlp->daddr) {
> > +		size_t overlap;
> > +
> > +		overlap = start + size - mlp->daddr;
> > +		npage_lo = overlap >> PAGE_SHIFT;
> > +		npage_hi = mlp->npage - npage_lo;
> > +
> > +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > +		mlp->daddr += overlap;
> > +		mlp->vaddr += overlap;
> > +		mlp->npage -= npage_lo;
> > +		return npage_lo;
> > +	}
> 
> Same as above (ie, '>' is impossible).

Same example as above.

> > +	/* Overlap high address of existing range */
> > +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		size_t overlap;
> > +
> > +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > +		npage_hi = overlap >> PAGE_SHIFT;
> > +		npage_lo = mlp->npage - npage_hi;
> > +
> > +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > +		mlp->npage -= npage_hi;
> > +		return npage_hi;
> > +	}
<snip>
> > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map
> > *dmp)
> > +{
> > +	int npage;
> > +	struct dma_map_page *mlp, *mmlp = NULL;
> > +	dma_addr_t daddr = dmp->dmaaddr;
> > +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> > +	size_t size = dmp->size;
> > +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> > +
> > +	if (vaddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (daddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (size & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +
> > +	npage = size >> PAGE_SHIFT;
> > +	if (!npage)
> > +		return -EINVAL;
> > +
> > +	if (!iommu)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +
> > +	if (vfio_find_dma(iommu, daddr, size)) {
> > +		ret = -EBUSY;
> > +		goto out_lock;
> > +	}
> > +
> > +	/* account for locked pages */
> > +	locked = current->mm->locked_vm + npage;
> > +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> > +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > +			__func__, rlimit(RLIMIT_MEMLOCK));
> > +		ret = -ENOMEM;
> > +		goto out_lock;
> > +	}
> > +
> > +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> > +	if (ret)
> > +		goto out_lock;
> > +
> > +	/* Check if we abut a region below */
> 
> Is !daddr possible?

Sure, an IOVA of 0x0.  There's no region below if we start at zero.

> > +	if (daddr) {
> > +		mlp = vfio_find_dma(iommu, daddr - 1, 1);
> > +		if (mlp && mlp->rdwr == rdwr &&
> > +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> > +
> > +			mlp->npage += npage;
> > +			daddr = mlp->daddr;
> > +			vaddr = mlp->vaddr;
> > +			npage = mlp->npage;
> > +			size = NPAGE_TO_SIZE(npage);
> > +
> > +			mmlp = mlp;
> > +		}
> > +	}
> 
> Is !(daddr + size) possible?

Same, there's no region above if this region goes to the top of the
address space, ie. 0xffffffff_fffff000 + 0x1000

Hmm, wonder if I'm missing a check for wrapping.

> > +	if (daddr + size) {
> > +		mlp = vfio_find_dma(iommu, daddr + size, 1);
> > +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size)
> > {
> > +
> > +			mlp->npage += npage;
> > +			mlp->daddr = daddr;
> > +			mlp->vaddr = vaddr;
> > +
> > +			/* If merged above and below, remove previously
> > +			 * merged entry.  New entry covers it.  */
> > +			if (mmlp) {
> > +				list_del(&mmlp->list);
> > +				kfree(mmlp);
> > +			}
> > +			mmlp = mlp;
> > +		}
> > +	}
> > +
> > +	if (!mmlp) {
> > +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> > +		if (!mlp) {
> > +			ret = -ENOMEM;
> > +			vfio_dma_unmap(iommu, daddr, npage, rdwr);
> > +			goto out_lock;
> > +		}
> > +
> > +		mlp->npage = npage;
> > +		mlp->daddr = daddr;
> > +		mlp->vaddr = vaddr;
> > +		mlp->rdwr = rdwr;
> > +		list_add(&mlp->list, &iommu->dm_list);
> > +	}
> > +
> > +out_lock:
> > +	mutex_unlock(&iommu->dgate);
> > +	return ret;
> > +}
> > +
> > +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +
> > +	vfio_release_iommu(iommu);
> > +	return 0;
> > +}
> > +
> > +static long vfio_iommu_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +	int ret = -ENOSYS;
> 
> Any reason for not using "switch" ?

It got ugly in vfio_main, so I decided to be consistent w/ it in the
driver and use if/else here too.  I don't like the aesthetics of extra
{}s to declare variables within a switch, nor do I like declaring all
the variables for each case for the whole function.  Personal quirk.

> > +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> > +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> > +
> > +                ret = put_user(flags, (u64 __user *)arg);
> > +
> > +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > +		struct vfio_dma_map dm;
> > +
> > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > +			return -EFAULT;
> 
> What does the "_dm" suffix stand for?

Inherited from Tom, but I figure _dma_map_dm = action(dma map),
object(dm), which is a vfio_Dma_Map.

Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-11 18:04   ` Alex Williamson
@ 2011-11-11 22:22     ` Christian Benvenuti (benve)
  2011-11-14 22:59       ` Alex Williamson
  0 siblings, 1 reply; 62+ messages in thread
From: Christian Benvenuti (benve) @ 2011-11-11 22:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf,
	Aaron Fabbri (aafabbri), B08248, B07421, avi, konrad.wilk, kvm,
	qemu-devel, iommu, linux-pci

PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBBbGV4IFdpbGxpYW1zb24gW21h
aWx0bzphbGV4LndpbGxpYW1zb25AcmVkaGF0LmNvbV0NCj4gU2VudDogRnJpZGF5LCBOb3ZlbWJl
ciAxMSwgMjAxMSAxMDowNCBBTQ0KPiBUbzogQ2hyaXN0aWFuIEJlbnZlbnV0aSAoYmVudmUpDQo+
IENjOiBjaHJpc3dAc291cy1zb2wub3JnOyBhaWtAYXUxLmlibS5jb207IHBtYWNAYXUxLmlibS5j
b207DQo+IGR3Z0BhdTEuaWJtLmNvbTsgam9lcmcucm9lZGVsQGFtZC5jb207IGFncmFmQHN1c2Uu
ZGU7IEFhcm9uIEZhYmJyaQ0KPiAoYWFmYWJicmkpOyBCMDgyNDhAZnJlZXNjYWxlLmNvbTsgQjA3
NDIxQGZyZWVzY2FsZS5jb207IGF2aUByZWRoYXQuY29tOw0KPiBrb25yYWQud2lsa0BvcmFjbGUu
Y29tOyBrdm1Admdlci5rZXJuZWwub3JnOyBxZW11LWRldmVsQG5vbmdudS5vcmc7DQo+IGlvbW11
QGxpc3RzLmxpbnV4LWZvdW5kYXRpb24ub3JnOyBsaW51eC1wY2lAdmdlci5rZXJuZWwub3JnDQo+
IFN1YmplY3Q6IFJFOiBbUkZDIFBBVENIXSB2ZmlvOiBWRklPIERyaXZlciBjb3JlIGZyYW1ld29y
aw0KPiANCj4gT24gV2VkLCAyMDExLTExLTA5IGF0IDE4OjU3IC0wNjAwLCBDaHJpc3RpYW4gQmVu
dmVudXRpIChiZW52ZSkgd3JvdGU6DQo+ID4gSGVyZSBhcmUgZmV3IG1pbm9yIGNvbW1lbnRzIG9u
IHZmaW9faW9tbXUuYyAuLi4NCj4gDQo+IFNvcnJ5LCBJJ3ZlIGJlZW4gcG9raW5nIHN0aWNrcyBh
dCB0cnlpbmcgdG8gZmlndXJlIG91dCBhIGNsZWFuIHdheSB0bw0KPiBzb2x2ZSB0aGUgZm9yY2Ug
dmZpbyBkcml2ZXIgYXR0YWNoIHByb2JsZW0uDQoNCkF0dGFjaCBvIGRldGFjaD8NCg0KPiA+ID4g
ZGlmZiAtLWdpdCBhL2RyaXZlcnMvdmZpby92ZmlvX2lvbW11LmMgYi9kcml2ZXJzL3ZmaW8vdmZp
b19pb21tdS5jDQo+ID4gPiBuZXcgZmlsZSBtb2RlIDEwMDY0NA0KPiA+ID4gaW5kZXggMDAwMDAw
MC4uMDI5ZGFlMw0KPiA+ID4gLS0tIC9kZXYvbnVsbA0KPiA+ID4gKysrIGIvZHJpdmVycy92Zmlv
L3ZmaW9faW9tbXUuYw0KPiA8c25pcD4NCj4gPiA+ICsNCj4gPiA+ICsjaW5jbHVkZSAidmZpb19w
cml2YXRlLmgiDQo+ID4NCj4gPiBEb2Vzbid0IHRoZSAnZG1hXycgIHByZWZpeCBiZWxvbmcgdG8g
dGhlIGdlbmVyaWMgRE1BIGNvZGU/DQo+IA0KPiBTdXJlLCB3ZSBjb3VsZCB0aGVzZSBtb3JlIHZm
aW8tY2VudHJpYy4NCg0KTGlrZSB2ZmlvX2RtYV9tYXBfcGFnZT8NCg0KPiANCj4gPiA+ICtzdHJ1
Y3QgZG1hX21hcF9wYWdlIHsNCj4gPiA+ICsJc3RydWN0IGxpc3RfaGVhZAlsaXN0Ow0KPiA+ID4g
KwlkbWFfYWRkcl90CQlkYWRkcjsNCj4gPiA+ICsJdW5zaWduZWQgbG9uZwkJdmFkZHI7DQo+ID4g
PiArCWludAkJCW5wYWdlOw0KPiA+ID4gKwlpbnQJCQlyZHdyOw0KPiA+ID4gK307DQo+ID4gPiAr
DQo+ID4gPiArLyoNCj4gPiA+ICsgKiBUaGlzIGNvZGUgaGFuZGxlcyBtYXBwaW5nIGFuZCB1bm1h
cHBpbmcgb2YgdXNlciBkYXRhIGJ1ZmZlcnMNCj4gPiA+ICsgKiBpbnRvIERNQSdibGUgc3BhY2Ug
dXNpbmcgdGhlIElPTU1VDQo+ID4gPiArICovDQo+ID4gPiArDQo+ID4gPiArI2RlZmluZSBOUEFH
RV9UT19TSVpFKG5wYWdlKQkoKHNpemVfdCkobnBhZ2UpIDw8IFBBR0VfU0hJRlQpDQo+ID4gPiAr
DQo+ID4gPiArc3RydWN0IHZ3b3JrIHsNCj4gPiA+ICsJc3RydWN0IG1tX3N0cnVjdAkqbW07DQo+
ID4gPiArCWludAkJCW5wYWdlOw0KPiA+ID4gKwlzdHJ1Y3Qgd29ya19zdHJ1Y3QJd29yazsNCj4g
PiA+ICt9Ow0KPiA+ID4gKw0KPiA+ID4gKy8qIGRlbGF5ZWQgZGVjcmVtZW50IGZvciBsb2NrZWRf
dm0gKi8NCj4gPiA+ICtzdGF0aWMgdm9pZCB2ZmlvX2xvY2tfYWNjdF9iZyhzdHJ1Y3Qgd29ya19z
dHJ1Y3QgKndvcmspDQo+ID4gPiArew0KPiA+ID4gKwlzdHJ1Y3QgdndvcmsgKnZ3b3JrID0gY29u
dGFpbmVyX29mKHdvcmssIHN0cnVjdCB2d29yaywgd29yayk7DQo+ID4gPiArCXN0cnVjdCBtbV9z
dHJ1Y3QgKm1tOw0KPiA+ID4gKw0KPiA+ID4gKwltbSA9IHZ3b3JrLT5tbTsNCj4gPiA+ICsJZG93
bl93cml0ZSgmbW0tPm1tYXBfc2VtKTsNCj4gPiA+ICsJbW0tPmxvY2tlZF92bSArPSB2d29yay0+
bnBhZ2U7DQo+ID4gPiArCXVwX3dyaXRlKCZtbS0+bW1hcF9zZW0pOw0KPiA+ID4gKwltbXB1dCht
bSk7CQkvKiB1bnJlZiBtbSAqLw0KPiA+ID4gKwlrZnJlZSh2d29yayk7DQo+ID4gPiArfQ0KPiA+
ID4gKw0KPiA+ID4gK3N0YXRpYyB2b2lkIHZmaW9fbG9ja19hY2N0KGludCBucGFnZSkNCj4gPiA+
ICt7DQo+ID4gPiArCXN0cnVjdCB2d29yayAqdndvcms7DQo+ID4gPiArCXN0cnVjdCBtbV9zdHJ1
Y3QgKm1tOw0KPiA+ID4gKw0KPiA+ID4gKwlpZiAoIWN1cnJlbnQtPm1tKSB7DQo+ID4gPiArCQkv
KiBwcm9jZXNzIGV4aXRlZCAqLw0KPiA+ID4gKwkJcmV0dXJuOw0KPiA+ID4gKwl9DQo+ID4gPiAr
CWlmIChkb3duX3dyaXRlX3RyeWxvY2soJmN1cnJlbnQtPm1tLT5tbWFwX3NlbSkpIHsNCj4gPiA+
ICsJCWN1cnJlbnQtPm1tLT5sb2NrZWRfdm0gKz0gbnBhZ2U7DQo+ID4gPiArCQl1cF93cml0ZSgm
Y3VycmVudC0+bW0tPm1tYXBfc2VtKTsNCj4gPiA+ICsJCXJldHVybjsNCj4gPiA+ICsJfQ0KPiA+
ID4gKwkvKg0KPiA+ID4gKwkgKiBDb3VsZG4ndCBnZXQgbW1hcF9zZW0gbG9jaywgc28gbXVzdCBz
ZXR1cCB0byBkZWNyZW1lbnQNCj4gPiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICBeXl5eXl5eXl4NCj4gPg0KPiA+IEluY3JlbWVudD8NCj4gDQo+
IFllcA0KPiANCj4gPHNuaXA+DQo+ID4gPiAraW50IHZmaW9fcmVtb3ZlX2RtYV9vdmVybGFwKHN0
cnVjdCB2ZmlvX2lvbW11ICppb21tdSwgZG1hX2FkZHJfdA0KPiA+ID4gc3RhcnQsDQo+ID4gPiAr
CQkJICAgIHNpemVfdCBzaXplLCBzdHJ1Y3QgZG1hX21hcF9wYWdlICptbHApDQo+ID4gPiArew0K
PiA+ID4gKwlzdHJ1Y3QgZG1hX21hcF9wYWdlICpzcGxpdDsNCj4gPiA+ICsJaW50IG5wYWdlX2xv
LCBucGFnZV9oaTsNCj4gPiA+ICsNCj4gPiA+ICsJLyogRXhpc3RpbmcgZG1hIHJlZ2lvbiBpcyBj
b21wbGV0ZWx5IGNvdmVyZWQsIHVubWFwIGFsbCAqLw0KPiA+DQo+ID4gVGhpcyB3b3Jrcy4gSG93
ZXZlciwgZ2l2ZW4gaG93IHZmaW9fZG1hX21hcF9kbSBpbXBsZW1lbnRzIHRoZSBtZXJnaW5nDQo+
ID4gbG9naWMsIEkgdGhpbmsgaXQgaXMgaW1wb3NzaWJsZSB0byBoYXZlDQo+ID4NCj4gPiAgICAg
KHN0YXJ0IDwgbWxwLT5kYWRkciAmJg0KPiA+ICAgICAgc3RhcnQgKyBzaXplID4gbWxwLT5kYWRk
ciArIE5QQUdFX1RPX1NJWkUobWxwLT5ucGFnZSkpDQo+IA0KPiBJdCdzIHF1aXRlIHBvc3NpYmxl
LiAgVGhpcyBhbGxvd3MgdXNlcnNwYWNlIHRvIGNyZWF0ZSBhIHNwYXJzZSBtYXBwaW5nLA0KPiB0
aGVuIGJsb3cgaXQgYWxsIGF3YXkgd2l0aCBhIHNpbmdsZSB1bm1hcCBmcm9tIDAgdG8gfjAuDQoN
Ckkgd291bGQgcHJlZmVyIHRoZSB1c2VyIHRvIHVzZSBleGFjdCByYW5nZXMgaW4gdGhlIHVubWFw
IG9wZXJhdGlvbnMNCmJlY2F1c2UgaXQgd291bGQgbWFrZSBpdCBlYXNpZXIgdG8gZGV0ZWN0IGJ1
Z3MvbGVha3MgaW4gdGhlIG1hcC91bm1hcA0KbG9naWMgdXNlZCBieSB0aGUgY2FsbGVycy4NCk15
IGFzc3VtcHRpb25zIGFyZSB0aGF0Og0KDQotIHRoZSB1c2VyIGFsd2F5cyBrZWVwcyB0cmFjayBv
ZiB0aGUgbWFwcGluZ3MNCg0KLSB0aGUgdXNlciBlaXRoZXIgdW5tYXBzIG9uZSBzcGVjaWZpYyBt
YXBwaW5nIG9yICdhbGwgb2YgdGhlbScuDQogIFRoZSAnYWxsIG9mIHRoZW0nIGNhc2Ugd291bGQg
YWxzbyB0YWtlIGNhcmUgb2YgdGhvc2UgY2FzZXMgd2hlcmUNCiAgdGhlIHVzZXIgZG9lcyBfbm90
XyBrZWVwIHRyYWNrIG9mIG1hcHBpbmdzIGFuZCBzaW1wbHkgdXNlcw0KICB0aGUgInVubWFwIGZy
b20gMCB0byB+MCIgZWFjaCB0aW1lLg0KDQpCZWNhdXNlIG9mIHRoaXMgeW91IGNvdWxkIHN0aWxs
IHByb3ZpZGUgYW4gZXhhY3QgbWFwL3VubWFwIGxvZ2ljDQphbmQgYWxsb3cgc3VjaCAidW5tYXAg
ZnJvbSAwIHRvIH4wIiBieSBtYWtpbmcgdGhlIGxhdHRlciBhIHNwZWNpYWwNCmNhc2UuDQpIb3dl
dmVyLCBpZiB3ZSB3YW50IHRvIGFsbG93IGFueSBhcmJpdHJhcnkvaW5leGFjdCB1bm1hcCByZXF1
ZXN0LCB0aGVuIE9LLg0KDQo+ID4gPiArCWlmIChzdGFydCA8PSBtbHAtPmRhZGRyICYmDQo+ID4g
PiArCSAgICBzdGFydCArIHNpemUgPj0gbWxwLT5kYWRkciArIE5QQUdFX1RPX1NJWkUobWxwLT5u
cGFnZSkpIHsNCj4gPiA+ICsJCXZmaW9fZG1hX3VubWFwKGlvbW11LCBtbHAtPmRhZGRyLCBtbHAt
Pm5wYWdlLCBtbHAtPnJkd3IpOw0KPiA+ID4gKwkJbGlzdF9kZWwoJm1scC0+bGlzdCk7DQo+ID4g
PiArCQlucGFnZV9sbyA9IG1scC0+bnBhZ2U7DQo+ID4gPiArCQlrZnJlZShtbHApOw0KPiA+ID4g
KwkJcmV0dXJuIG5wYWdlX2xvOw0KPiA+ID4gKwl9DQo+ID4gPiArDQo+ID4gPiArCS8qIE92ZXJs
YXAgbG93IGFkZHJlc3Mgb2YgZXhpc3RpbmcgcmFuZ2UgKi8NCj4gPg0KPiA+IFNhbWUgYXMgYWJv
dmUgKGllLCAnPCcgaXMgaW1wb3NzaWJsZSkNCj4gDQo+IGV4aXN0aW5nOiAgIHw8LS0tIEEgLS0t
PnwgICAgICB8PC0tLSBCIC0tLT58DQo+IHVubWFwOiAgICAgICAgICAgICAgICB8PC0tLSBDIC0t
LT58DQo+IA0KPiBNYXliZSBub3QgZ29vZCBwcmFjdGljZSBmcm9tIHVzZXJzcGFjZSwgYnV0IHdl
IHNob3VsZG4ndCBjb3VudCBvbg0KPiB1c2Vyc3BhY2UgdG8gYmUgd2VsbCBiZWhhdmVkLg0KPiAN
Cj4gPiA+ICsJaWYgKHN0YXJ0IDw9IG1scC0+ZGFkZHIpIHsNCj4gPiA+ICsJCXNpemVfdCBvdmVy
bGFwOw0KPiA+ID4gKw0KPiA+ID4gKwkJb3ZlcmxhcCA9IHN0YXJ0ICsgc2l6ZSAtIG1scC0+ZGFk
ZHI7DQo+ID4gPiArCQlucGFnZV9sbyA9IG92ZXJsYXAgPj4gUEFHRV9TSElGVDsNCj4gPiA+ICsJ
CW5wYWdlX2hpID0gbWxwLT5ucGFnZSAtIG5wYWdlX2xvOw0KPiA+ID4gKw0KPiA+ID4gKwkJdmZp
b19kbWFfdW5tYXAoaW9tbXUsIG1scC0+ZGFkZHIsIG5wYWdlX2xvLCBtbHAtPnJkd3IpOw0KPiA+
ID4gKwkJbWxwLT5kYWRkciArPSBvdmVybGFwOw0KPiA+ID4gKwkJbWxwLT52YWRkciArPSBvdmVy
bGFwOw0KPiA+ID4gKwkJbWxwLT5ucGFnZSAtPSBucGFnZV9sbzsNCj4gPiA+ICsJCXJldHVybiBu
cGFnZV9sbzsNCj4gPiA+ICsJfQ0KPiA+DQo+ID4gU2FtZSBhcyBhYm92ZSAoaWUsICc+JyBpcyBp
bXBvc3NpYmxlKS4NCj4gDQo+IFNhbWUgZXhhbXBsZSBhcyBhYm92ZS4NCj4gDQo+ID4gPiArCS8q
IE92ZXJsYXAgaGlnaCBhZGRyZXNzIG9mIGV4aXN0aW5nIHJhbmdlICovDQo+ID4gPiArCWlmIChz
dGFydCArIHNpemUgPj0gbWxwLT5kYWRkciArIE5QQUdFX1RPX1NJWkUobWxwLT5ucGFnZSkpIHsN
Cj4gPiA+ICsJCXNpemVfdCBvdmVybGFwOw0KPiA+ID4gKw0KPiA+ID4gKwkJb3ZlcmxhcCA9IG1s
cC0+ZGFkZHIgKyBOUEFHRV9UT19TSVpFKG1scC0+bnBhZ2UpIC0gc3RhcnQ7DQo+ID4gPiArCQlu
cGFnZV9oaSA9IG92ZXJsYXAgPj4gUEFHRV9TSElGVDsNCj4gPiA+ICsJCW5wYWdlX2xvID0gbWxw
LT5ucGFnZSAtIG5wYWdlX2hpOw0KPiA+ID4gKw0KPiA+ID4gKwkJdmZpb19kbWFfdW5tYXAoaW9t
bXUsIHN0YXJ0LCBucGFnZV9oaSwgbWxwLT5yZHdyKTsNCj4gPiA+ICsJCW1scC0+bnBhZ2UgLT0g
bnBhZ2VfaGk7DQo+ID4gPiArCQlyZXR1cm4gbnBhZ2VfaGk7DQo+ID4gPiArCX0NCj4gPHNuaXA+
DQo+ID4gPiAraW50IHZmaW9fZG1hX21hcF9kbShzdHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUsIHN0
cnVjdCB2ZmlvX2RtYV9tYXANCj4gPiA+ICpkbXApDQo+ID4gPiArew0KPiA+ID4gKwlpbnQgbnBh
Z2U7DQo+ID4gPiArCXN0cnVjdCBkbWFfbWFwX3BhZ2UgKm1scCwgKm1tbHAgPSBOVUxMOw0KPiA+
ID4gKwlkbWFfYWRkcl90IGRhZGRyID0gZG1wLT5kbWFhZGRyOw0KPiA+ID4gKwl1bnNpZ25lZCBs
b25nIGxvY2tlZCwgbG9ja19saW1pdCwgdmFkZHIgPSBkbXAtPnZhZGRyOw0KPiA+ID4gKwlzaXpl
X3Qgc2l6ZSA9IGRtcC0+c2l6ZTsNCj4gPiA+ICsJaW50IHJldCA9IDAsIHJkd3IgPSBkbXAtPmZs
YWdzICYgVkZJT19ETUFfTUFQX0ZMQUdfV1JJVEU7DQo+ID4gPiArDQo+ID4gPiArCWlmICh2YWRk
ciAmIChQQUdFX1NJWkUtMSkpDQo+ID4gPiArCQlyZXR1cm4gLUVJTlZBTDsNCj4gPiA+ICsJaWYg
KGRhZGRyICYgKFBBR0VfU0laRS0xKSkNCj4gPiA+ICsJCXJldHVybiAtRUlOVkFMOw0KPiA+ID4g
KwlpZiAoc2l6ZSAmIChQQUdFX1NJWkUtMSkpDQo+ID4gPiArCQlyZXR1cm4gLUVJTlZBTDsNCj4g
PiA+ICsNCj4gPiA+ICsJbnBhZ2UgPSBzaXplID4+IFBBR0VfU0hJRlQ7DQo+ID4gPiArCWlmICgh
bnBhZ2UpDQo+ID4gPiArCQlyZXR1cm4gLUVJTlZBTDsNCj4gPiA+ICsNCj4gPiA+ICsJaWYgKCFp
b21tdSkNCj4gPiA+ICsJCXJldHVybiAtRUlOVkFMOw0KPiA+ID4gKw0KPiA+ID4gKwltdXRleF9s
b2NrKCZpb21tdS0+ZGdhdGUpOw0KPiA+ID4gKw0KPiA+ID4gKwlpZiAodmZpb19maW5kX2RtYShp
b21tdSwgZGFkZHIsIHNpemUpKSB7DQo+ID4gPiArCQlyZXQgPSAtRUJVU1k7DQo+ID4gPiArCQln
b3RvIG91dF9sb2NrOw0KPiA+ID4gKwl9DQo+ID4gPiArDQo+ID4gPiArCS8qIGFjY291bnQgZm9y
IGxvY2tlZCBwYWdlcyAqLw0KPiA+ID4gKwlsb2NrZWQgPSBjdXJyZW50LT5tbS0+bG9ja2VkX3Zt
ICsgbnBhZ2U7DQo+ID4gPiArCWxvY2tfbGltaXQgPSBybGltaXQoUkxJTUlUX01FTUxPQ0spID4+
IFBBR0VfU0hJRlQ7DQo+ID4gPiArCWlmIChsb2NrZWQgPiBsb2NrX2xpbWl0ICYmICFjYXBhYmxl
KENBUF9JUENfTE9DSykpIHsNCj4gPiA+ICsJCXByaW50ayhLRVJOX1dBUk5JTkcgIiVzOiBSTElN
SVRfTUVNTE9DSyAoJWxkKSBleGNlZWRlZFxuIiwNCj4gPiA+ICsJCQlfX2Z1bmNfXywgcmxpbWl0
KFJMSU1JVF9NRU1MT0NLKSk7DQo+ID4gPiArCQlyZXQgPSAtRU5PTUVNOw0KPiA+ID4gKwkJZ290
byBvdXRfbG9jazsNCj4gPiA+ICsJfQ0KPiA+ID4gKw0KPiA+ID4gKwlyZXQgPSB2ZmlvX2RtYV9t
YXAoaW9tbXUsIGRhZGRyLCB2YWRkciwgbnBhZ2UsIHJkd3IpOw0KPiA+ID4gKwlpZiAocmV0KQ0K
PiA+ID4gKwkJZ290byBvdXRfbG9jazsNCj4gPiA+ICsNCj4gPiA+ICsJLyogQ2hlY2sgaWYgd2Ug
YWJ1dCBhIHJlZ2lvbiBiZWxvdyAqLw0KPiA+DQo+ID4gSXMgIWRhZGRyIHBvc3NpYmxlPw0KPiAN
Cj4gU3VyZSwgYW4gSU9WQSBvZiAweDAuICBUaGVyZSdzIG5vIHJlZ2lvbiBiZWxvdyBpZiB3ZSBz
dGFydCBhdCB6ZXJvLg0KPiANCj4gPiA+ICsJaWYgKGRhZGRyKSB7DQo+ID4gPiArCQltbHAgPSB2
ZmlvX2ZpbmRfZG1hKGlvbW11LCBkYWRkciAtIDEsIDEpOw0KPiA+ID4gKwkJaWYgKG1scCAmJiBt
bHAtPnJkd3IgPT0gcmR3ciAmJg0KPiA+ID4gKwkJICAgIG1scC0+dmFkZHIgKyBOUEFHRV9UT19T
SVpFKG1scC0+bnBhZ2UpID09IHZhZGRyKSB7DQo+ID4gPiArDQo+ID4gPiArCQkJbWxwLT5ucGFn
ZSArPSBucGFnZTsNCj4gPiA+ICsJCQlkYWRkciA9IG1scC0+ZGFkZHI7DQo+ID4gPiArCQkJdmFk
ZHIgPSBtbHAtPnZhZGRyOw0KPiA+ID4gKwkJCW5wYWdlID0gbWxwLT5ucGFnZTsNCj4gPiA+ICsJ
CQlzaXplID0gTlBBR0VfVE9fU0laRShucGFnZSk7DQo+ID4gPiArDQo+ID4gPiArCQkJbW1scCA9
IG1scDsNCj4gPiA+ICsJCX0NCj4gPiA+ICsJfQ0KPiA+DQo+ID4gSXMgIShkYWRkciArIHNpemUp
IHBvc3NpYmxlPw0KPiANCj4gU2FtZSwgdGhlcmUncyBubyByZWdpb24gYWJvdmUgaWYgdGhpcyBy
ZWdpb24gZ29lcyB0byB0aGUgdG9wIG9mIHRoZQ0KPiBhZGRyZXNzIHNwYWNlLCBpZS4gMHhmZmZm
ZmZmZl9mZmZmZjAwMCArIDB4MTAwMA0KPiANCj4gSG1tLCB3b25kZXIgaWYgSSdtIG1pc3Npbmcg
YSBjaGVjayBmb3Igd3JhcHBpbmcuDQo+IA0KPiA+ID4gKwlpZiAoZGFkZHIgKyBzaXplKSB7DQo+
ID4gPiArCQltbHAgPSB2ZmlvX2ZpbmRfZG1hKGlvbW11LCBkYWRkciArIHNpemUsIDEpOw0KPiA+
ID4gKwkJaWYgKG1scCAmJiBtbHAtPnJkd3IgPT0gcmR3ciAmJiBtbHAtPnZhZGRyID09IHZhZGRy
ICsgc2l6ZSkNCj4gPiA+IHsNCj4gPiA+ICsNCj4gPiA+ICsJCQltbHAtPm5wYWdlICs9IG5wYWdl
Ow0KPiA+ID4gKwkJCW1scC0+ZGFkZHIgPSBkYWRkcjsNCj4gPiA+ICsJCQltbHAtPnZhZGRyID0g
dmFkZHI7DQo+ID4gPiArDQo+ID4gPiArCQkJLyogSWYgbWVyZ2VkIGFib3ZlIGFuZCBiZWxvdywg
cmVtb3ZlIHByZXZpb3VzbHkNCj4gPiA+ICsJCQkgKiBtZXJnZWQgZW50cnkuICBOZXcgZW50cnkg
Y292ZXJzIGl0LiAgKi8NCj4gPiA+ICsJCQlpZiAobW1scCkgew0KPiA+ID4gKwkJCQlsaXN0X2Rl
bCgmbW1scC0+bGlzdCk7DQo+ID4gPiArCQkJCWtmcmVlKG1tbHApOw0KPiA+ID4gKwkJCX0NCj4g
PiA+ICsJCQltbWxwID0gbWxwOw0KPiA+ID4gKwkJfQ0KPiA+ID4gKwl9DQo+ID4gPiArDQo+ID4g
PiArCWlmICghbW1scCkgew0KPiA+ID4gKwkJbWxwID0ga3phbGxvYyhzaXplb2YgKm1scCwgR0ZQ
X0tFUk5FTCk7DQo+ID4gPiArCQlpZiAoIW1scCkgew0KPiA+ID4gKwkJCXJldCA9IC1FTk9NRU07
DQo+ID4gPiArCQkJdmZpb19kbWFfdW5tYXAoaW9tbXUsIGRhZGRyLCBucGFnZSwgcmR3cik7DQo+
ID4gPiArCQkJZ290byBvdXRfbG9jazsNCj4gPiA+ICsJCX0NCj4gPiA+ICsNCj4gPiA+ICsJCW1s
cC0+bnBhZ2UgPSBucGFnZTsNCj4gPiA+ICsJCW1scC0+ZGFkZHIgPSBkYWRkcjsNCj4gPiA+ICsJ
CW1scC0+dmFkZHIgPSB2YWRkcjsNCj4gPiA+ICsJCW1scC0+cmR3ciA9IHJkd3I7DQo+ID4gPiAr
CQlsaXN0X2FkZCgmbWxwLT5saXN0LCAmaW9tbXUtPmRtX2xpc3QpOw0KPiA+ID4gKwl9DQo+ID4g
PiArDQo+ID4gPiArb3V0X2xvY2s6DQo+ID4gPiArCW11dGV4X3VubG9jaygmaW9tbXUtPmRnYXRl
KTsNCj4gPiA+ICsJcmV0dXJuIHJldDsNCj4gPiA+ICt9DQo+ID4gPiArDQo+ID4gPiArc3RhdGlj
IGludCB2ZmlvX2lvbW11X3JlbGVhc2Uoc3RydWN0IGlub2RlICppbm9kZSwgc3RydWN0IGZpbGUN
Cj4gKmZpbGVwKQ0KPiA+ID4gK3sNCj4gPiA+ICsJc3RydWN0IHZmaW9faW9tbXUgKmlvbW11ID0g
ZmlsZXAtPnByaXZhdGVfZGF0YTsNCj4gPiA+ICsNCj4gPiA+ICsJdmZpb19yZWxlYXNlX2lvbW11
KGlvbW11KTsNCj4gPiA+ICsJcmV0dXJuIDA7DQo+ID4gPiArfQ0KPiA+ID4gKw0KPiA+ID4gK3N0
YXRpYyBsb25nIHZmaW9faW9tbXVfdW5sX2lvY3RsKHN0cnVjdCBmaWxlICpmaWxlcCwNCj4gPiA+
ICsJCQkJIHVuc2lnbmVkIGludCBjbWQsIHVuc2lnbmVkIGxvbmcgYXJnKQ0KPiA+ID4gK3sNCj4g
PiA+ICsJc3RydWN0IHZmaW9faW9tbXUgKmlvbW11ID0gZmlsZXAtPnByaXZhdGVfZGF0YTsNCj4g
PiA+ICsJaW50IHJldCA9IC1FTk9TWVM7DQo+ID4NCj4gPiBBbnkgcmVhc29uIGZvciBub3QgdXNp
bmcgInN3aXRjaCIgPw0KPiANCj4gSXQgZ290IHVnbHkgaW4gdmZpb19tYWluLCBzbyBJIGRlY2lk
ZWQgdG8gYmUgY29uc2lzdGVudCB3LyBpdCBpbiB0aGUNCj4gZHJpdmVyIGFuZCB1c2UgaWYvZWxz
ZSBoZXJlIHRvby4gIEkgZG9uJ3QgbGlrZSB0aGUgYWVzdGhldGljcyBvZiBleHRyYQ0KPiB7fXMg
dG8gZGVjbGFyZSB2YXJpYWJsZXMgd2l0aGluIGEgc3dpdGNoLCBub3IgZG8gSSBsaWtlIGRlY2xh
cmluZyBhbGwNCj4gdGhlIHZhcmlhYmxlcyBmb3IgZWFjaCBjYXNlIGZvciB0aGUgd2hvbGUgZnVu
Y3Rpb24uICBQZXJzb25hbCBxdWlyay4NCj4gDQo+ID4gPiArICAgICAgICBpZiAoY21kID09IFZG
SU9fSU9NTVVfR0VUX0ZMQUdTKSB7DQo+ID4gPiArICAgICAgICAgICAgICAgIHU2NCBmbGFncyA9
IFZGSU9fSU9NTVVfRkxBR1NfTUFQX0FOWTsNCj4gPiA+ICsNCj4gPiA+ICsgICAgICAgICAgICAg
ICAgcmV0ID0gcHV0X3VzZXIoZmxhZ3MsICh1NjQgX191c2VyICopYXJnKTsNCj4gPiA+ICsNCj4g
PiA+ICsgICAgICAgIH0gZWxzZSBpZiAoY21kID09IFZGSU9fSU9NTVVfTUFQX0RNQSkgew0KPiA+
ID4gKwkJc3RydWN0IHZmaW9fZG1hX21hcCBkbTsNCj4gPiA+ICsNCj4gPiA+ICsJCWlmIChjb3B5
X2Zyb21fdXNlcigmZG0sICh2b2lkIF9fdXNlciAqKWFyZywgc2l6ZW9mIGRtKSkNCj4gPiA+ICsJ
CQlyZXR1cm4gLUVGQVVMVDsNCj4gPg0KPiA+IFdoYXQgZG9lcyB0aGUgIl9kbSIgc3VmZml4IHN0
YW5kIGZvcj8NCj4gDQo+IEluaGVyaXRlZCBmcm9tIFRvbSwgYnV0IEkgZmlndXJlIF9kbWFfbWFw
X2RtID0gYWN0aW9uKGRtYSBtYXApLA0KPiBvYmplY3QoZG0pLCB3aGljaCBpcyBhIHZmaW9fRG1h
X01hcC4NCg0KT0suIFRoZSByZWFzb24gd2h5IEkgYXNrZWQgaXMgdGhhdCAnX2RtJyBkb2VzIG5v
dCBhZGQgYW55dGhpbmcgdG8gJ3ZmaW9fZG1hX21hcCcuDQoNCi9DaHJpcw0K

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-11 22:22     ` Christian Benvenuti (benve)
@ 2011-11-14 22:59       ` Alex Williamson
  2011-11-15  0:05         ` David Gibson
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-14 22:59 UTC (permalink / raw)
  To: Christian Benvenuti (benve)
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf,
	Aaron Fabbri (aafabbri), B08248, B07421, avi, konrad.wilk, kvm,
	qemu-devel, iommu, linux-pci

On Fri, 2011-11-11 at 16:22 -0600, Christian Benvenuti (benve) wrote:
> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, November 11, 2011 10:04 AM
> > To: Christian Benvenuti (benve)
> > Cc: chrisw@sous-sol.org; aik@au1.ibm.com; pmac@au1.ibm.com;
> > dwg@au1.ibm.com; joerg.roedel@amd.com; agraf@suse.de; Aaron Fabbri
> > (aafabbri); B08248@freescale.com; B07421@freescale.com; avi@redhat.com;
> > konrad.wilk@oracle.com; kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > iommu@lists.linux-foundation.org; linux-pci@vger.kernel.org
> > Subject: RE: [RFC PATCH] vfio: VFIO Driver core framework
> > 
> > On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:
> > > Here are few minor comments on vfio_iommu.c ...
> > 
> > Sorry, I've been poking sticks at trying to figure out a clean way to
> > solve the force vfio driver attach problem.
> 
> Attach o detach?

Attach.  For the case when a new device appears that belongs to a group
that already in use.  I'll probably add a claim() operation to the
vfio_device_ops that tells the driver to grab it.  I was hoping for pci
this would just add it to the dynamic ids, but that hits device lock
problems.

> > > > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > > > new file mode 100644
> > > > index 0000000..029dae3
> > > > --- /dev/null
> > > > +++ b/drivers/vfio/vfio_iommu.c
> > <snip>
> > > > +
> > > > +#include "vfio_private.h"
> > >
> > > Doesn't the 'dma_'  prefix belong to the generic DMA code?
> > 
> > Sure, we could these more vfio-centric.
> 
> Like vfio_dma_map_page?

Something like that, though _page doesn't seem appropriate as it tracks
a region.

> > 
> > > > +struct dma_map_page {
> > > > +	struct list_head	list;
> > > > +	dma_addr_t		daddr;
> > > > +	unsigned long		vaddr;
> > > > +	int			npage;
> > > > +	int			rdwr;
> > > > +};
> > > > +
> > > > +/*
> > > > + * This code handles mapping and unmapping of user data buffers
> > > > + * into DMA'ble space using the IOMMU
> > > > + */
> > > > +
> > > > +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
> > > > +
> > > > +struct vwork {
> > > > +	struct mm_struct	*mm;
> > > > +	int			npage;
> > > > +	struct work_struct	work;
> > > > +};
> > > > +
> > > > +/* delayed decrement for locked_vm */
> > > > +static void vfio_lock_acct_bg(struct work_struct *work)
> > > > +{
> > > > +	struct vwork *vwork = container_of(work, struct vwork, work);
> > > > +	struct mm_struct *mm;
> > > > +
> > > > +	mm = vwork->mm;
> > > > +	down_write(&mm->mmap_sem);
> > > > +	mm->locked_vm += vwork->npage;
> > > > +	up_write(&mm->mmap_sem);
> > > > +	mmput(mm);		/* unref mm */
> > > > +	kfree(vwork);
> > > > +}
> > > > +
> > > > +static void vfio_lock_acct(int npage)
> > > > +{
> > > > +	struct vwork *vwork;
> > > > +	struct mm_struct *mm;
> > > > +
> > > > +	if (!current->mm) {
> > > > +		/* process exited */
> > > > +		return;
> > > > +	}
> > > > +	if (down_write_trylock(&current->mm->mmap_sem)) {
> > > > +		current->mm->locked_vm += npage;
> > > > +		up_write(&current->mm->mmap_sem);
> > > > +		return;
> > > > +	}
> > > > +	/*
> > > > +	 * Couldn't get mmap_sem lock, so must setup to decrement
> > >                                                       ^^^^^^^^^
> > >
> > > Increment?
> > 
> > Yep

Actually, side note, this is increment/decrement depending on the sign
of the parameter.  So "update" may be more appropriate.  I think Tom
originally used increment in one place and decrement in another to show
it's dual use.

> > <snip>
> > > > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > > > start,
> > > > +			    size_t size, struct dma_map_page *mlp)
> > > > +{
> > > > +	struct dma_map_page *split;
> > > > +	int npage_lo, npage_hi;
> > > > +
> > > > +	/* Existing dma region is completely covered, unmap all */
> > >
> > > This works. However, given how vfio_dma_map_dm implements the merging
> > > logic, I think it is impossible to have
> > >
> > >     (start < mlp->daddr &&
> > >      start + size > mlp->daddr + NPAGE_TO_SIZE(mlp->npage))
> > 
> > It's quite possible.  This allows userspace to create a sparse mapping,
> > then blow it all away with a single unmap from 0 to ~0.
> 
> I would prefer the user to use exact ranges in the unmap operations
> because it would make it easier to detect bugs/leaks in the map/unmap
> logic used by the callers.
> My assumptions are that:
> 
> - the user always keeps track of the mappings

My qemu code plays a little on the loose side here, acting as a
passthrough for the internal memory client.  But even there, worst case
would probably be trying to unmap a non-existent entry, not unmapping a
sparse range.

> - the user either unmaps one specific mapping or 'all of them'.
>   The 'all of them' case would also take care of those cases where
>   the user does _not_ keep track of mappings and simply uses
>   the "unmap from 0 to ~0" each time.
> 
> Because of this you could still provide an exact map/unmap logic
> and allow such "unmap from 0 to ~0" by making the latter a special
> case.
> However, if we want to allow any arbitrary/inexact unmap request, then OK.

I can't think of any good reasons we shouldn't be more strict.  I think
it was primarily just convenient to hit all the corner cases since we
merge all the requests together for tracking and need to be able to
split them back apart.  It does feel a little awkward to have a 0/~0
special case though, but I don't think it's worth adding another ioctl
to handle it.

<snip>
> > > > +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> > > > +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> > > > +
> > > > +                ret = put_user(flags, (u64 __user *)arg);
> > > > +
> > > > +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > > > +		struct vfio_dma_map dm;
> > > > +
> > > > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > > > +			return -EFAULT;
> > >
> > > What does the "_dm" suffix stand for?
> > 
> > Inherited from Tom, but I figure _dma_map_dm = action(dma map),
> > object(dm), which is a vfio_Dma_Map.
> 
> OK. The reason why I asked is that '_dm' does not add anything to 'vfio_dma_map'.

Yep.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-14 22:59       ` Alex Williamson
@ 2011-11-15  0:05         ` David Gibson
  2011-11-15  0:49           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 62+ messages in thread
From: David Gibson @ 2011-11-15  0:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Christian Benvenuti (benve), chrisw, aik, pmac, joerg.roedel,
	agraf, Aaron Fabbri (aafabbri), B08248, B07421, avi, konrad.wilk,
	kvm, qemu-devel, iommu, linux-pci

On Mon, Nov 14, 2011 at 03:59:00PM -0700, Alex Williamson wrote:
> On Fri, 2011-11-11 at 16:22 -0600, Christian Benvenuti (benve) wrote:
[snip]

> > - the user either unmaps one specific mapping or 'all of them'.
> >   The 'all of them' case would also take care of those cases where
> >   the user does _not_ keep track of mappings and simply uses
> >   the "unmap from 0 to ~0" each time.
> > 
> > Because of this you could still provide an exact map/unmap logic
> > and allow such "unmap from 0 to ~0" by making the latter a special
> > case.
> > However, if we want to allow any arbitrary/inexact unmap request, then OK.
> 
> I can't think of any good reasons we shouldn't be more strict.  I think
> it was primarily just convenient to hit all the corner cases since we
> merge all the requests together for tracking and need to be able to
> split them back apart.  It does feel a little awkward to have a 0/~0
> special case though, but I don't think it's worth adding another ioctl
> to handle it.

Being strict, or at least enforcing strictness, requires that the
infrastructure track all the maps, so that the unmaps can be
matching.  This is not a natural thing with the data structures you
want for all IOMMUs.  For example on POWER, the IOMMU (aka TCE table)
is a simple 1-level pagetable.  One pointer with a couple of
permission bits per IOMMU page.  Handling oddly overlapping operations
on that data structure is natural, enforcing strict matching of maps
and unmaps is not and would require extra information to be stored by
vfio.  On POWER, the IOMMU operations often *are* a hot path, so
manipulating those structures would have a real cost, too.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-15  0:05         ` David Gibson
@ 2011-11-15  0:49           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2011-11-15  0:49 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Christian Benvenuti (benve), chrisw, aik, pmac,
	joerg.roedel, agraf, Aaron Fabbri (aafabbri), B08248, B07421, avi,
	konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Tue, 2011-11-15 at 11:05 +1100, David Gibson wrote:
> Being strict, or at least enforcing strictness, requires that the
> infrastructure track all the maps, so that the unmaps can be
> matching.  This is not a natural thing with the data structures you
> want for all IOMMUs.  For example on POWER, the IOMMU (aka TCE table)
> is a simple 1-level pagetable.  One pointer with a couple of
> permission bits per IOMMU page.  Handling oddly overlapping operations
> on that data structure is natural, enforcing strict matching of maps
> and unmaps is not and would require extra information to be stored by
> vfio.  On POWER, the IOMMU operations often *are* a hot path, so
> manipulating those structures would have a real cost, too. 

In fact they are a very hot path even. There's no way we can afford the
cost of tracking per page mapping/unmapping (other than bumping the page
count on a page that's currently mapped or via some debug-only feature).

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
       [not found] <20111103195452.21259.93021.stgit@bling.home>
                   ` (2 preceding siblings ...)
  2011-11-10  0:57 ` Christian Benvenuti (benve)
@ 2011-11-11 17:51 ` Konrad Rzeszutek Wilk
  2011-11-11 22:10   ` Alex Williamson
  2011-11-12  0:14 ` Scott Wood
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 62+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-11 17:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf, benve, aafabbri,
	B08248, B07421, avi, kvm, qemu-devel, iommu, linux-pci

On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> VFIO provides a secure, IOMMU based interface for user space
> drivers, including device assignment to virtual machines.
> This provides the base management of IOMMU groups, devices,
> and IOMMU objects.  See Documentation/vfio.txt included in
> this patch for user and kernel API description.
> 
> Note, this implements the new API discussed at KVM Forum
> 2011, as represented by the drvier version 0.2.  It's hoped
> that this provides a modular enough interface to support PCI
> and non-PCI userspace drivers across various architectures
> and IOMMU implementations.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> Fingers crossed, this is the last RFC for VFIO, but we need
> the iommu group support before this can go upstream
> (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
> hoping this helps push that along.
> 
> Since the last posting, this version completely modularizes
> the device backends and better defines the APIs between the
> core VFIO code and the device backends.  I expect that we
> might also adopt a modular IOMMU interface as iommu_ops learns
> about different types of hardware.  Also many, many cleanups.
> Check the complete git history for details:
> 
> git://github.com/awilliam/linux-vfio.git vfio-ng
> 
> (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
> 
> This version, along with the supporting VFIO PCI backend can
> be found here:
> 
> git://github.com/awilliam/linux-vfio.git vfio-next-20111103
> 
> I've held off on implementing a kernel->user signaling
> mechanism for now since the previous netlink version produced
> too many gag reflexes.  It's easy enough to set a bit in the
> group flags too indicate such support in the future, so I
> think we can move ahead without it.
> 
> Appreciate any feedback or suggestions.  Thanks,
> 
> Alex
> 
>  Documentation/ioctl/ioctl-number.txt |    1 
>  Documentation/vfio.txt               |  304 +++++++++
>  MAINTAINERS                          |    8 
>  drivers/Kconfig                      |    2 
>  drivers/Makefile                     |    1 
>  drivers/vfio/Kconfig                 |    8 
>  drivers/vfio/Makefile                |    3 
>  drivers/vfio/vfio_iommu.c            |  530 ++++++++++++++++
>  drivers/vfio/vfio_main.c             | 1151 ++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_private.h          |   34 +
>  include/linux/vfio.h                 |  155 +++++
>  11 files changed, 2197 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/vfio.txt
>  create mode 100644 drivers/vfio/Kconfig
>  create mode 100644 drivers/vfio/Makefile
>  create mode 100644 drivers/vfio/vfio_iommu.c
>  create mode 100644 drivers/vfio/vfio_main.c
>  create mode 100644 drivers/vfio/vfio_private.h
>  create mode 100644 include/linux/vfio.h
> 
> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> index 54078ed..59d01e4 100644
> --- a/Documentation/ioctl/ioctl-number.txt
> +++ b/Documentation/ioctl/ioctl-number.txt
> @@ -88,6 +88,7 @@ Code  Seq#(hex)	Include File		Comments
>  		and kernel/power/user.c
>  '8'	all				SNP8023 advanced NIC card
>  					<mailto:mcr@solidum.com>
> +';'	64-76	linux/vfio.h
>  '@'	00-0F	linux/radeonfb.h	conflict!
>  '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
>  'A'	00-1F	linux/apm_bios.h	conflict!
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> new file mode 100644
> index 0000000..5866896
> --- /dev/null
> +++ b/Documentation/vfio.txt
> @@ -0,0 +1,304 @@
> +VFIO - "Virtual Function I/O"[1]
> +-------------------------------------------------------------------------------
> +Many modern system now provide DMA and interrupt remapping facilities
> +to help ensure I/O devices behave within the boundaries they've been
> +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> +well as POWER systems with Partitionable Endpoints (PEs) and even
> +embedded powerpc systems (technology name unknown).  The VFIO driver
> +is an IOMMU/device agnostic framework for exposing direct device
> +access to userspace, in a secure, IOMMU protected environment.  In
> +other words, this allows safe, non-privileged, userspace drivers.
> +
> +Why do we want that?  Virtual machines often make use of direct device
> +access ("device assignment") when configured for the highest possible
> +I/O performance.  From a device and host perspective, this simply turns
> +the VM into a userspace driver, with the benefits of significantly
> +reduced latency, higher bandwidth, and direct use of bare-metal device
> +drivers[2].

Are there any constraints of running a 32-bit userspace with
a 64-bit kernel and with 32-bit user space drivers?

> +
> +Some applications, particularly in the high performance computing
> +field, also benefit from low-overhead, direct device access from
> +userspace.  Examples include network adapters (often non-TCP/IP based)
> +and compute accelerators.  Previous to VFIO, these drivers needed to
> +go through the full development cycle to become proper upstream driver,
> +be maintained out of tree, or make use of the UIO framework, which
> +has no notion of IOMMU protection, limited interrupt support, and
> +requires root privileges to access things like PCI configuration space.
> +
> +The VFIO driver framework intends to unify these, replacing both the
> +KVM PCI specific device assignment currently used as well as provide
> +a more secure, more featureful userspace driver environment than UIO.
> +
> +Groups, Devices, IOMMUs, oh my

<chuckles> oh my, eh?

> +-------------------------------------------------------------------------------
> +
> +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> +can't always distinguish transactions from each individual device in
> +the system.  Sometimes this is because of the IOMMU design, such as with
> +PEs, other times it's caused by the I/O topology, for instance a
> +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> +devices created by these restictions IOMMU groups (or just "groups" for
> +this document).
> +
> +The IOMMU cannot distiguish transactions between the individual devices
> +within the group, therefore the group is the basic unit of ownership for
> +a userspace process.  Because of this, groups are also the primary
> +interface to both devices and IOMMU domains in VFIO.
> +
> +The VFIO representation of groups is created as devices are added into
> +the framework by a VFIO bus driver.  The vfio-pci module is an example
> +of a bus driver.  This module registers devices along with a set of bus
> +specific callbacks with the VFIO core.  These callbacks provide the
> +interfaces later used for device access.  As each new group is created,
> +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> +character device.
> +
> +In addition to the device enumeration and callbacks, the VFIO bus driver
> +also provides a traditional device driver and is able to bind to devices
> +on it's bus.  When a device is bound to the bus driver it's available to
> +VFIO.  When all the devices within a group are bound to their bus drivers,
> +the group becomes "viable" and a user with sufficient access to the VFIO
> +group chardev can obtain exclusive access to the set of group devices.
> +
> +As documented in linux/vfio.h, several ioctls are provided on the
> +group chardev:
> +
> +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> +
> +The last two ioctls return new file descriptors for accessing
> +individual devices within the group and programming the IOMMU.  Each of
> +these new file descriptors provide their own set of file interfaces.
> +These ioctls will fail if any of the devices within the group are not
> +bound to their VFIO bus driver.  Additionally, when either of these
> +interfaces are used, the group is then bound to the struct_mm of the
> +caller.  The GET_FLAGS ioctl can be used to view the state of the group.
> +
> +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> +new IOMMU domain is created and all of the devices in the group are
> +attached to it.  This is the only way to ensure full IOMMU isolation
> +of the group, but potentially wastes resources and cycles if the user
> +intends to manage multiple groups with the same set of IOMMU mappings.
> +VFIO therefore provides a group MERGE and UNMERGE interface, which
> +allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
> +arbitrary groups to be merged, so the user should assume merging is
> +opportunistic.  A new group, with no open device or IOMMU file
> +descriptors, can be merged into an existing, in-use, group using the
> +MERGE ioctl.  A merged group can be unmerged using the UNMERGE ioctl
> +once all of the device file descriptors for the group being merged
> +"out" are closed.
> +
> +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are
> +essentially fungible between group file descriptors (ie. if device A
> +is in group X, and X is merged with Y, a file descriptor for A can be
> +retrieved using GET_DEVICE_FD on Y.  Likewise, GET_IOMMU_FD returns a
> +file descriptor referencing the same internal IOMMU object from either
> +X or Y).  Merged groups can be dissolved either explictly with UNMERGE
> +or automatically when ALL file descriptors for the merged group are
> +closed (all IOMMUs, all devices, all groups).
> +
> +The IOMMU file descriptor provides this set of ioctls:
> +
> +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)

Coherency support is not going to be addressed right? What about sync?
Say you need to sync CPU to Device address?

> +
> +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> +We currently only support IOMMU domains that are able to map any
> +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
> +
> +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> +and unmapping IOVAs to process virtual addresses:
> +
> +struct vfio_dma_map {
> +        __u64   len;            /* length of structure */

What is the purpose of the 'len' field? Is it to guard against future
version changes?

> +        __u64   vaddr;          /* process virtual addr */
> +        __u64   dmaaddr;        /* desired and/or returned dma address */
> +        __u64   size;           /* size in bytes */
> +        __u64   flags;
> +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> +};
> +
> +Current users of VFIO use relatively static DMA mappings, not requiring
> +high frequency turnover.  As new users are added, it's expected that the

Is there a limit to how many DMA mappings can be created?

> +IOMMU file descriptor will evolve to support new mapping interfaces, this
> +will be reflected in the flags and may present new ioctls and file
> +interfaces.
> +
> +The device GET_FLAGS ioctl is intended to return basic device type and
> +indicate support for optional capabilities.  Flags currently include whether
> +the device is PCI or described by Device Tree, and whether the RESET ioctl
> +is supported:

And reset in terms of PCIe spec is the FLR?

> +
> +#define VFIO_DEVICE_GET_FLAGS           _IOR(';', 108, __u64)
> + #define VFIO_DEVICE_FLAGS_PCI          (1 << 0)
> + #define VFIO_DEVICE_FLAGS_DT           (1 << 1)
> + #define VFIO_DEVICE_FLAGS_RESET        (1 << 2)
> +
> +The MMIO and IOP resources used by a device are described by regions.

IOP?

> +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> +
> +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)

Don't want __u32?
> +
> +Regions are described by a struct vfio_region_info, which is retrieved by
> +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> +the desired region (0 based index).  Note that devices may implement zero
> 
+sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> +mapping).

Huh?

> +
> +struct vfio_region_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* region number */
> +        __u64   size;           /* size in bytes of region */
> +        __u64   offset;         /* start offset of region */
> +        __u64   flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)

What is FLAG_MMAP? Does it mean: 1) it can be mmaped, or 2) it is mmaped?
FLAG_RO is pretty obvious - presumarily this is for firmware regions and such.
And PHYS_VALID is if the region is disabled for some reasons? If so
would the name FLAG_DISABLED be better?

> +        __u64   phys;           /* physical address of region */
> +};
> +
> +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> +
> +The offset indicates the offset into the device file descriptor which
> +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> +available access types and validity of optional fields.  For instance
> +the phys field may only be valid for certain devices types.
> +
> +Interrupts are described using a similar interface.  GET_NUM_IRQS
> +reports the number or IRQ indexes for the device.
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)

_u32?

> +
> +struct vfio_irq_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* IRQ number */
> +        __u32   count;          /* number of individual IRQs */
> +        __u64   flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> +};
> +
> +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> +type to index mapping).

I am not really sure what that means.

> +
> +Information about each index can be retrieved using the GET_IRQ_INFO
> +ioctl, used much like GET_REGION_INFO.
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> +
> +Individual indexes can describe single or sets of IRQs.  This provides the
> +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> +
> +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> +as shown below, are used to pass the IRQ info index, the number of eventfds,
> +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */

Are eventfds u64 or u32?

Why not just define a structure?
struct vfio_irq_eventfds {
	__u32	index;
	__u32	count;
	__u64	eventfds[0]
};

How do you get an eventfd to feed in here?

> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)

u32?
> +
> +When a level triggered interrupt is signaled, the interrupt is masked
> +on the host.  This prevents an unresponsive userspace driver from
> +continuing to interrupt the host system.  After servicing the interrupt,
> +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> +triggered interrupts implicitly have a count of 1 per index.

So they are enabled automatically? Meaning you don't even hav to do
SET_IRQ_EVENTFDS b/c the count is set to 1?

> +
> +/* Unmask IRQ index, arg[0] = index */
> +#define VFIO_DEVICE_UNMASK_IRQ          _IOW(';', 114, int)

So this is for MSI as well? So if I've an index = 1, with count = 4,
and doing unmaks IRQ will chip enable all the MSI event at once?

I guess there is not much point in enabling/disabling selective MSI
IRQs..

> +
> +Level triggered interrupts can also be unmasked using an irqfd.  Use

irqfd or eventfd?

> +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.

So only level triggered? Hmm, how do I know whether the device is
level or edge? Or is that edge (MSI) can also be unmaked using the
eventfs

> +
> +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> +
> +When supported, as indicated by the device flags, reset the device.
> +
> +#define VFIO_DEVICE_RESET               _IO(';', 116)

Does it disable the 'count'? Err, does it disable the IRQ on the
device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
to set new eventfds? Or does it re-use the eventfds and the device
is enabled after this?


> +
> +Device tree devices also invlude ioctls for further defining the

include

> +device tree properties of the device:
> +
> +struct vfio_dtpath {
> +        __u32   len;            /* length of structure */
> +        __u32   index;

0 based I presume?
> +        __u64   flags;
> +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)

What is region in this context?? Or would this make much more sense
if I knew what Device Tree actually is.

> +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> +        char    *path;

Ah, now I see why you want 'len' here.. But I am still at loss
why you want that with the other structures.

> +};
> +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> +
> +struct vfio_dtindex {
> +        __u32   len;            /* length of structure */
> +        __u32   index;
> +        __u32   prop_type;

Is that an enum type? Is this definied somewhere?
> +        __u32   prop_index;

What is the purpose of this field?

> +        __u64   flags;
> +#define VFIO_DTINDEX_FLAGS_REGION       (1 << 0)
> +#define VFIO_DTINDEX_FLAGS_IRQ          (1 << 1)
> +};
> +#define VFIO_DEVICE_GET_DTINDEX         _IOWR(';', 118, struct vfio_dtindex)
> +
> +
> +VFIO bus driver API
> +-------------------------------------------------------------------------------
> +
> +Bus drivers, such as PCI, have three jobs:
> + 1) Add/remove devices from vfio
> + 2) Provide vfio_device_ops for device access
> + 3) Device binding and unbinding

suspend/resume?

> +
> +When initialized, the bus driver should enumerate the devices on it's
> +bus and call vfio_group_add_dev() for each device.  If the bus supports
> +hotplug, notifiers should be enabled to track devices being added and
> +removed.  vfio_group_del_dev() removes a previously added device from
> +vfio.
> +
> +Adding a device registers a vfio_device_ops function pointer structure
> +for the device:

Huh? So this gets created for _every_ 'struct device' that is added
the VFIO bus? Is this structure exposed? Or is this an internal one?

> +
> +struct vfio_device_ops {
> +	bool			(*match)(struct device *, char *);
> +	int			(*get)(void *);
> +	void			(*put)(void *);
> +	ssize_t			(*read)(void *, char __user *,
> +					size_t, loff_t *);
> +	ssize_t			(*write)(void *, const char __user *,
> +					 size_t, loff_t *);
> +	long			(*ioctl)(void *, unsigned int, unsigned long);
> +	int			(*mmap)(void *, struct vm_area_struct *);
> +};
> +
> +When a device is bound to the bus driver, the bus driver indicates this
> +to vfio using the vfio_bind_dev() interface.  The device_data parameter

Might want to paste the function decleration for it.. b/c I am not sure
where the 'device_data' parameter is on the argument list.

> +is a pointer to an opaque data structure for use only by the bus driver.
> +The get, put, read, write, ioctl, and mmap vfio_device_ops all pass
> +this data structure back to the bus driver.  When a device is unbound

Oh, so it is on the 'void *'.
> +from the bus driver, the vfio_unbind_dev() interface signals this to
> +vfio.  This function returns the pointer to the device_data structure

That function
> +registered for the device.

I am not really sure what this section purpose is? Could this be part
of the header file or the code? It does not look to be part of the
ioctl API?

> +
> +As noted previously, a group contains one or more devices, so
> +GROUP_GET_DEVICE_FD needs to identify the specific device being requested.
> +The vfio_device_ops.match callback is used to allow bus drivers to determine
> +the match.  For drivers like vfio-pci, it's a simple match to dev_name(),
> +which is unique in the system due to the PCI bus topology, other bus drivers
> +may need to include parent devices to create a unique match, so this is
> +left as a bus driver interface.
> +
> +-------------------------------------------------------------------------------
> +
> +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
> +initial implementation by Tom Lyon while as Cisco.  We've since outgrown
> +the acronym, but it's catchy.
> +
> +[2] As always there are trade-offs to virtual machine device
> +assignment that are beyond the scope of VFIO.  It's expected that
> +future IOMMU technologies will reduce some, but maybe not all, of
> +these trade-offs.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f05f5f6..4bd5aa0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7106,6 +7106,14 @@ S:	Maintained
>  F:	Documentation/filesystems/vfat.txt
>  F:	fs/fat/
>  
> +VFIO DRIVER
> +M:	Alex Williamson <alex.williamson@redhat.com>
> +L:	kvm@vger.kernel.org

No vfio mailing list? Or a vfio-mailing list? 
> +S:	Maintained
> +F:	Documentation/vfio.txt
> +F:	drivers/vfio/
> +F:	include/linux/vfio.h
> +
>  VIDEOBUF2 FRAMEWORK
>  M:	Pawel Osciak <pawel@osciak.com>
>  M:	Marek Szyprowski <m.szyprowski@samsung.com>
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index b5e6f24..e15578b 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig"
>  
>  source "drivers/uio/Kconfig"
>  
> +source "drivers/vfio/Kconfig"
> +
>  source "drivers/vlynq/Kconfig"
>  
>  source "drivers/virtio/Kconfig"
> diff --git a/drivers/Makefile b/drivers/Makefile
> index 1b31421..5f138b5 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -58,6 +58,7 @@ obj-$(CONFIG_ATM)		+= atm/
>  obj-$(CONFIG_FUSION)		+= message/
>  obj-y				+= firewire/
>  obj-$(CONFIG_UIO)		+= uio/
> +obj-$(CONFIG_VFIO)		+= vfio/
>  obj-y				+= cdrom/
>  obj-y				+= auxdisplay/
>  obj-$(CONFIG_PCCARD)		+= pcmcia/
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> new file mode 100644
> index 0000000..9acb1e7
> --- /dev/null
> +++ b/drivers/vfio/Kconfig
> @@ -0,0 +1,8 @@
> +menuconfig VFIO
> +	tristate "VFIO Non-Privileged userspace driver framework"
> +	depends on IOMMU_API
> +	help
> +	  VFIO provides a framework for secure userspace device drivers.
> +	  See Documentation/vfio.txt for more details.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> new file mode 100644
> index 0000000..088faf1
> --- /dev/null
> +++ b/drivers/vfio/Makefile
> @@ -0,0 +1,3 @@
> +vfio-y := vfio_main.o vfio_iommu.o
> +
> +obj-$(CONFIG_VFIO) := vfio.o
> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> new file mode 100644
> index 0000000..029dae3
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu.c
> @@ -0,0 +1,530 @@
> +/*
> + * VFIO: IOMMU DMA mapping support
> + *
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/workqueue.h>
> +
> +#include "vfio_private.h"
> +
> +struct dma_map_page {
> +	struct list_head	list;
> +	dma_addr_t		daddr;
> +	unsigned long		vaddr;
> +	int			npage;
> +	int			rdwr;

rdwr? Is this a flag thing? Could it be made in an enum?
> +};
> +
> +/*
> + * This code handles mapping and unmapping of user data buffers
> + * into DMA'ble space using the IOMMU
> + */
> +
> +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
> +
> +struct vwork {
> +	struct mm_struct	*mm;
> +	int			npage;
> +	struct work_struct	work;
> +};
> +
> +/* delayed decrement for locked_vm */
> +static void vfio_lock_acct_bg(struct work_struct *work)
> +{
> +	struct vwork *vwork = container_of(work, struct vwork, work);
> +	struct mm_struct *mm;
> +
> +	mm = vwork->mm;
> +	down_write(&mm->mmap_sem);
> +	mm->locked_vm += vwork->npage;
> +	up_write(&mm->mmap_sem);
> +	mmput(mm);		/* unref mm */
> +	kfree(vwork);
> +}
> +
> +static void vfio_lock_acct(int npage)
> +{
> +	struct vwork *vwork;
> +	struct mm_struct *mm;
> +
> +	if (!current->mm) {
> +		/* process exited */
> +		return;
> +	}
> +	if (down_write_trylock(&current->mm->mmap_sem)) {
> +		current->mm->locked_vm += npage;
> +		up_write(&current->mm->mmap_sem);
> +		return;
> +	}
> +	/*
> +	 * Couldn't get mmap_sem lock, so must setup to decrement
> +	 * mm->locked_vm later. If locked_vm were atomic, we wouldn't
> +	 * need this silliness
> +	 */
> +	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> +	if (!vwork)
> +		return;
> +	mm = get_task_mm(current);	/* take ref mm */
> +	if (!mm) {
> +		kfree(vwork);
> +		return;
> +	}
> +	INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> +	vwork->mm = mm;
> +	vwork->npage = npage;
> +	schedule_work(&vwork->work);
> +}
> +
> +/* Some mappings aren't backed by a struct page, for example an mmap'd
> + * MMIO range for our own or another device.  These use a different
> + * pfn conversion and shouldn't be tracked as locked pages. */
> +static int is_invalid_reserved_pfn(unsigned long pfn)

static bool

> +{
> +	if (pfn_valid(pfn)) {
> +		int reserved;
> +		struct page *tail = pfn_to_page(pfn);
> +		struct page *head = compound_trans_head(tail);
> +		reserved = PageReserved(head);

bool reserved = PageReserved(head);


> +		if (head != tail) {
> +			/* "head" is not a dangling pointer
> +			 * (compound_trans_head takes care of that)
> +			 * but the hugepage may have been split
> +			 * from under us (and we may not hold a
> +			 * reference count on the head page so it can
> +			 * be reused before we run PageReferenced), so
> +			 * we've to check PageTail before returning
> +			 * what we just read.
> +			 */
> +			smp_rmb();
> +			if (PageTail(tail))
> +				return reserved;
> +		}
> +		return PageReserved(tail);
> +	}
> +
> +	return true;
> +}
> +
> +static int put_pfn(unsigned long pfn, int rdwr)
> +{
> +	if (!is_invalid_reserved_pfn(pfn)) {
> +		struct page *page = pfn_to_page(pfn);
> +		if (rdwr)
> +			SetPageDirty(page);
> +		put_page(page);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* Unmap DMA region */
> +/* dgate must be held */

dgate?

> +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> +			    int npage, int rdwr)
> +{
> +	int i, unlocked = 0;
> +
> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> +		unsigned long pfn;
> +
> +		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> +		if (pfn) {
> +			iommu_unmap(iommu->domain, iova, 0);

What is the '0' for? Perhaps a comment: /* We only do zero order */

> +			unlocked += put_pfn(pfn, rdwr);
> +		}
> +	}
> +	return unlocked;
> +}
> +
> +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> +			   unsigned long npage, int rdwr)
> +{
> +	int unlocked;
> +
> +	unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);
> +	vfio_lock_acct(-unlocked);
> +}
> +
> +/* Unmap ALL DMA regions */
> +void vfio_iommu_unmapall(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos, *pos2;

pos2 should probably be just called 'tmp'

> +	struct dma_map_page *mlp;

What does 'mlp' stand for?

mlp -> dma_page ?

> +
> +	mutex_lock(&iommu->dgate);
> +	list_for_each_safe(pos, pos2, &iommu->dm_list) {
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);

Uh, so if it did not get put_page() we would try to still delete it?
Couldn't that lead to corruption as the 'mlp' is returned to the poll?

Ah wait, the put_page is on the DMA page, so it is OK to
delete the tracking structure. It will be just a leaked page.
> +		list_del(&mlp->list);
> +		kfree(mlp);
> +	}
> +	mutex_unlock(&iommu->dgate);
> +}
> +
> +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long *pfn)
> +{
> +	struct page *page[1];
> +	struct vm_area_struct *vma;
> +	int ret = -EFAULT;
> +
> +	if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) {
> +		*pfn = page_to_pfn(page[0]);
> +		return 0;
> +	}
> +
> +	down_read(&current->mm->mmap_sem);
> +
> +	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +
> +	if (vma && vma->vm_flags & VM_PFNMAP) {
> +		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +		if (is_invalid_reserved_pfn(*pfn))
> +			ret = 0;

Did you mean to break here?

> +	}
> +
> +	up_read(&current->mm->mmap_sem);
> +
> +	return ret;
> +}
> +
> +/* Map DMA region */
> +/* dgate must be held */
> +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova,
> +			unsigned long vaddr, int npage, int rdwr)
> +{
> +	unsigned long start = iova;
> +	int i, ret, locked = 0, prot = IOMMU_READ;
> +
> +	/* Verify pages are not already mapped */

I think a 'that' is missing above.

> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> +		if (iommu_iova_to_phys(iommu->domain, iova))
> +			return -EBUSY;
> +
> +	iova = start;
> +
> +	if (rdwr)
> +		prot |= IOMMU_WRITE;
> +	if (iommu->cache)
> +		prot |= IOMMU_CACHE;
> +
> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> +		unsigned long pfn = 0;
> +
> +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> +		if (ret) {
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +
> +		/* Only add actual locked pages to accounting */
> +		if (!is_invalid_reserved_pfn(pfn))
> +			locked++;
> +
> +		ret = iommu_map(iommu->domain, iova,
> +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);

Put a comment by the 0 saying /* order 0 pages only! */

> +		if (ret) {
> +			/* Back out mappings on error */
> +			put_pfn(pfn, rdwr);
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +	}
> +	vfio_lock_acct(locked);
> +	return 0;
> +}
> +
> +static inline int ranges_overlap(unsigned long start1, size_t size1,

Perhaps a bool?

> +				 unsigned long start2, size_t size2)
> +{
> +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> +}
> +
> +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> +					  dma_addr_t start, size_t size)
> +{
> +	struct list_head *pos;
> +	struct dma_map_page *mlp;
> +
> +	list_for_each(pos, &iommu->dm_list) {
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +				   start, size))
> +			return mlp;
> +	}
> +	return NULL;
> +}
> +
> +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> +			    size_t size, struct dma_map_page *mlp)
> +{
> +	struct dma_map_page *split;
> +	int npage_lo, npage_hi;
> +
> +	/* Existing dma region is completely covered, unmap all */
> +	if (start <= mlp->daddr &&
> +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> +		list_del(&mlp->list);
> +		npage_lo = mlp->npage;
> +		kfree(mlp);
> +		return npage_lo;
> +	}
> +
> +	/* Overlap low address of existing range */
> +	if (start <= mlp->daddr) {
> +		size_t overlap;
> +
> +		overlap = start + size - mlp->daddr;
> +		npage_lo = overlap >> PAGE_SHIFT;
> +		npage_hi = mlp->npage - npage_lo;
> +
> +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> +		mlp->daddr += overlap;
> +		mlp->vaddr += overlap;
> +		mlp->npage -= npage_lo;
> +		return npage_lo;
> +	}
> +
> +	/* Overlap high address of existing range */
> +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +		size_t overlap;
> +
> +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> +		npage_hi = overlap >> PAGE_SHIFT;
> +		npage_lo = mlp->npage - npage_hi;
> +
> +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> +		mlp->npage -= npage_hi;
> +		return npage_hi;
> +	}
> +
> +	/* Split existing */
> +	npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> +	npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> +
> +	split = kzalloc(sizeof *split, GFP_KERNEL);
> +	if (!split)
> +		return -ENOMEM;
> +
> +	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> +
> +	mlp->npage = npage_lo;
> +
> +	split->npage = npage_hi;
> +	split->daddr = start + size;
> +	split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> +	split->rdwr = mlp->rdwr;
> +	list_add(&split->list, &iommu->dm_list);
> +	return size >> PAGE_SHIFT;
> +}
> +
> +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +	int ret = 0;
> +	size_t npage = dmp->size >> PAGE_SHIFT;
> +	struct list_head *pos, *n;
> +
> +	if (dmp->dmaaddr & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (dmp->size & ~PAGE_MASK)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->dgate);
> +
> +	list_for_each_safe(pos, n, &iommu->dm_list) {
> +		struct dma_map_page *mlp;
> +
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +				   dmp->dmaaddr, dmp->size)) {
> +			ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> +						      dmp->size, mlp);
> +			if (ret > 0)
> +				npage -= NPAGE_TO_SIZE(ret);
> +			if (ret < 0 || npage == 0)
> +				break;
> +		}
> +	}
> +	mutex_unlock(&iommu->dgate);
> +	return ret > 0 ? 0 : ret;
> +}
> +
> +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +	int npage;
> +	struct dma_map_page *mlp, *mmlp = NULL;
> +	dma_addr_t daddr = dmp->dmaaddr;
> +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> +	size_t size = dmp->size;
> +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> +
> +	if (vaddr & (PAGE_SIZE-1))
> +		return -EINVAL;
> +	if (daddr & (PAGE_SIZE-1))
> +		return -EINVAL;
> +	if (size & (PAGE_SIZE-1))
> +		return -EINVAL;
> +
> +	npage = size >> PAGE_SHIFT;
> +	if (!npage)
> +		return -EINVAL;
> +
> +	if (!iommu)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->dgate);
> +
> +	if (vfio_find_dma(iommu, daddr, size)) {
> +		ret = -EBUSY;
> +		goto out_lock;
> +	}
> +
> +	/* account for locked pages */
> +	locked = current->mm->locked_vm + npage;
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> +			__func__, rlimit(RLIMIT_MEMLOCK));
> +		ret = -ENOMEM;
> +		goto out_lock;
> +	}
> +
> +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> +	if (ret)
> +		goto out_lock;
> +
> +	/* Check if we abut a region below */
> +	if (daddr) {
> +		mlp = vfio_find_dma(iommu, daddr - 1, 1);
> +		if (mlp && mlp->rdwr == rdwr &&
> +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> +
> +			mlp->npage += npage;
> +			daddr = mlp->daddr;
> +			vaddr = mlp->vaddr;
> +			npage = mlp->npage;
> +			size = NPAGE_TO_SIZE(npage);
> +
> +			mmlp = mlp;
> +		}
> +	}
> +
> +	if (daddr + size) {
> +		mlp = vfio_find_dma(iommu, daddr + size, 1);
> +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size) {
> +
> +			mlp->npage += npage;
> +			mlp->daddr = daddr;
> +			mlp->vaddr = vaddr;
> +
> +			/* If merged above and below, remove previously
> +			 * merged entry.  New entry covers it.  */
> +			if (mmlp) {
> +				list_del(&mmlp->list);
> +				kfree(mmlp);
> +			}
> +			mmlp = mlp;
> +		}
> +	}
> +
> +	if (!mmlp) {
> +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> +		if (!mlp) {
> +			ret = -ENOMEM;
> +			vfio_dma_unmap(iommu, daddr, npage, rdwr);
> +			goto out_lock;
> +		}
> +
> +		mlp->npage = npage;
> +		mlp->daddr = daddr;
> +		mlp->vaddr = vaddr;
> +		mlp->rdwr = rdwr;
> +		list_add(&mlp->list, &iommu->dm_list);
> +	}
> +
> +out_lock:
> +	mutex_unlock(&iommu->dgate);
> +	return ret;
> +}
> +
> +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_iommu *iommu = filep->private_data;
> +
> +	vfio_release_iommu(iommu);
> +	return 0;
> +}
> +
> +static long vfio_iommu_unl_ioctl(struct file *filep,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_iommu *iommu = filep->private_data;
> +	int ret = -ENOSYS;
> +
> +        if (cmd == VFIO_IOMMU_GET_FLAGS) {

Something is weird with the tabbing here..

> +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> +
> +                ret = put_user(flags, (u64 __user *)arg);
> +
> +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> +		struct vfio_dma_map dm;
> +
> +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> +			return -EFAULT;
> +
> +		ret = vfio_dma_map_dm(iommu, &dm);
> +
> +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> +			ret = -EFAULT;
> +
> +	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> +		struct vfio_dma_map dm;
> +
> +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> +			return -EFAULT;
> +
> +		ret = vfio_dma_unmap_dm(iommu, &dm);
> +
> +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> +			ret = -EFAULT;
> +	}
> +	return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vfio_iommu_compat_ioctl(struct file *filep,
> +				    unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_iommu_unl_ioctl(filep, cmd, arg);
> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +const struct file_operations vfio_iommu_fops = {
> +	.owner		= THIS_MODULE,
> +	.release	= vfio_iommu_release,
> +	.unlocked_ioctl	= vfio_iommu_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_iommu_compat_ioctl,
> +#endif
> +};
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> new file mode 100644
> index 0000000..6169356
> --- /dev/null
> +++ b/drivers/vfio/vfio_main.c
> @@ -0,0 +1,1151 @@
> +/*
> + * VFIO framework
> + *
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/cdev.h>
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/fs.h>
> +#include <linux/idr.h>
> +#include <linux/iommu.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/wait.h>
> +
> +#include "vfio_private.h"
> +
> +#define DRIVER_VERSION	"0.2"
> +#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> +#define DRIVER_DESC	"VFIO - User Level meta-driver"
> +
> +static int allow_unsafe_intrs;

__read_mostly
> +module_param(allow_unsafe_intrs, int, 0);

S_IRUGO ?

> +MODULE_PARM_DESC(allow_unsafe_intrs,
> +        "Allow use of IOMMUs which do not support interrupt remapping");
> +
> +static struct vfio {
> +	dev_t			devt;
> +	struct cdev		cdev;
> +	struct list_head	group_list;
> +	struct mutex		lock;
> +	struct kref		kref;
> +	struct class		*class;
> +	struct idr		idr;
> +	wait_queue_head_t	release_q;
> +} vfio;

You probably want to move this below the 'vfio_group'
as vfio contains the vfio_group.
> +
> +static const struct file_operations vfio_group_fops;
> +extern const struct file_operations vfio_iommu_fops;
> +
> +struct vfio_group {
> +	dev_t			devt;
> +	unsigned int		groupid;
> +	struct bus_type		*bus;
> +	struct vfio_iommu	*iommu;
> +	struct list_head	device_list;
> +	struct list_head	iommu_next;
> +	struct list_head	group_next;
> +	int			refcnt;
> +};
> +
> +struct vfio_device {
> +	struct device			*dev;
> +	const struct vfio_device_ops	*ops;
> +	struct vfio_iommu		*iommu;
> +	struct vfio_group		*group;
> +	struct list_head		device_next;
> +	bool				attached;
> +	int				refcnt;
> +	void				*device_data;
> +};

And perhaps move this above vfio_group. As vfio_group
contains a list of these structures?


> +
> +/*
> + * Helper functions called under vfio.lock
> + */
> +
> +/* Return true if any devices within a group are opened */
> +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		if (device->refcnt)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/* Return true if any of the groups attached to an iommu are opened.
> + * We can only tear apart merged groups when nothing is left open. */
> +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +		if (group->refcnt)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/* An iommu is "in use" if it has a file descriptor open or if any of
> + * the groups assigned to the iommu have devices open. */
> +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (iommu->refcnt)
> +		return true;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		if (__vfio_group_devs_inuse(group))
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static void __vfio_group_set_iommu(struct vfio_group *group,
> +				   struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (group->iommu)
> +		list_del(&group->iommu_next);
> +	if (iommu)
> +		list_add(&group->iommu_next, &iommu->group_list);
> +
> +	group->iommu = iommu;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		device->iommu = iommu;
> +	}
> +}
> +
> +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> +				    struct vfio_device *device)
> +{
> +	BUG_ON(!iommu->domain && device->attached);

Whoa. Heavy hammer there.

Perhaps WARN_ON as you do check it later on.

> +
> +	if (!iommu->domain || !device->attached)
> +		return;
> +
> +	iommu_detach_device(iommu->domain, device->dev);
> +	device->attached = false;
> +}
> +
> +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> +				      struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		__vfio_iommu_detach_dev(iommu, device);
> +	}
> +}
> +
> +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> +				   struct vfio_device *device)
> +{
> +	int ret;
> +
> +	BUG_ON(device->attached);

How about:

WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register
the device again! Tell him/her to stop please.\n");

> +
> +	if (!iommu || !iommu->domain)
> +		return -EINVAL;
> +
> +	ret = iommu_attach_device(iommu->domain, device->dev);
> +	if (!ret)
> +		device->attached = true;
> +
> +	return ret;
> +}
> +
> +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,
> +				     struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +		int ret;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		ret = __vfio_iommu_attach_dev(iommu, device);
> +		if (ret) {
> +			__vfio_iommu_detach_group(iommu, group);
> +			return ret;
> +		}
> +	}
> +	return 0;
> +}
> +
> +/* The iommu is viable, ie. ready to be configured, when all the devices
> + * for all the groups attached to the iommu are bound to their vfio device
> + * drivers (ex. vfio-pci).  This sets the device_data private data pointer. */
> +static bool __vfio_iommu_viable(struct vfio_iommu *iommu)
> +{
> +	struct list_head *gpos, *dpos;
> +
> +	list_for_each(gpos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (!device->device_data)
> +				return false;
> +		}
> +	}
> +	return true;
> +}
> +
> +static void __vfio_close_iommu(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (!iommu->domain)
> +		return;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		__vfio_iommu_detach_group(iommu, group);
> +	}
> +
> +	vfio_iommu_unmapall(iommu);
> +
> +	iommu_domain_free(iommu->domain);
> +	iommu->domain = NULL;
> +	iommu->mm = NULL;
> +}
> +
> +/* Open the IOMMU.  This gates all access to the iommu or device file
> + * descriptors and sets current->mm as the exclusive user. */
> +static int __vfio_open_iommu(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +	int ret;
> +
> +	if (!__vfio_iommu_viable(iommu))
> +		return -EBUSY;
> +
> +	if (iommu->domain)
> +		return -EINVAL;
> +
> +	iommu->domain = iommu_domain_alloc(iommu->bus);
> +	if (!iommu->domain)
> +		return -EFAULT;

ENOMEM?

> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		ret = __vfio_iommu_attach_group(iommu, group);
> +		if (ret) {
> +			__vfio_close_iommu(iommu);
> +			return ret;
> +		}
> +	}
> +
> +	if (!allow_unsafe_intrs &&
> +	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> +		__vfio_close_iommu(iommu);
> +		return -EFAULT;
> +	}
> +
> +	iommu->cache = (iommu_domain_has_cap(iommu->domain,
> +					     IOMMU_CAP_CACHE_COHERENCY) != 0);
> +	iommu->mm = current->mm;
> +
> +	return 0;
> +}
> +
> +/* Actively try to tear down the iommu and merged groups.  If there are no
> + * open iommu or device fds, we close the iommu.  If we close the iommu and
> + * there are also no open group fds, we can futher dissolve the group to
> + * iommu association and free the iommu data structure. */
> +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)
> +{
> +
> +	if (__vfio_iommu_inuse(iommu))
> +		return -EBUSY;
> +
> +	__vfio_close_iommu(iommu);
> +
> +	if (!__vfio_iommu_groups_inuse(iommu)) {
> +		struct list_head *pos, *ppos;
> +
> +		list_for_each_safe(pos, ppos, &iommu->group_list) {
> +			struct vfio_group *group;
> +
> +			group = list_entry(pos, struct vfio_group, iommu_next);
> +			__vfio_group_set_iommu(group, NULL);
> +		}
> +
> +
> +		kfree(iommu);
> +	}
> +
> +	return 0;
> +}
> +
> +static struct vfio_device *__vfio_lookup_dev(struct device *dev)
> +{
> +	struct list_head *gpos;
> +	unsigned int groupid;
> +
> +	if (iommu_device_group(dev, &groupid))

Hmm, where is this defined? v3.2-rc1 does not seem to have it?

> +		return NULL;
> +
> +	list_for_each(gpos, &vfio.group_list) {
> +		struct vfio_group *group;
> +		struct list_head *dpos;
> +
> +		group = list_entry(gpos, struct vfio_group, group_next);
> +
> +		if (group->groupid != groupid)
> +			continue;
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (device->dev == dev)
> +				return device;
> +		}
> +	}
> +	return NULL;
> +}
> +
> +/* All release paths simply decrement the refcnt, attempt to teardown
> + * the iommu and merged groups, and wakeup anything that might be
> + * waiting if we successfully dissolve anything. */
> +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)
> +{
> +	bool wake;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	(*refcnt)--;
> +	wake = (__vfio_try_dissolve_iommu(iommu) == 0);
> +
> +	mutex_unlock(&vfio.lock);
> +
> +	if (wake)
> +		wake_up(&vfio.release_q);
> +
> +	return 0;
> +}
> +
> +/*
> + * Device fops - passthrough to vfio device driver w/ device_data
> + */
> +static int vfio_device_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	vfio_do_release(&device->refcnt, device->iommu);
> +
> +	device->ops->put(device->device_data);
> +
> +	return 0;
> +}
> +
> +static long vfio_device_unl_ioctl(struct file *filep,
> +				  unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->ioctl(device->device_data, cmd, arg);
> +}
> +
> +static ssize_t vfio_device_read(struct file *filep, char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->read(device->device_data, buf, count, ppos);
> +}
> +
> +static ssize_t vfio_device_write(struct file *filep, const char __user *buf,
> +				 size_t count, loff_t *ppos)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->write(device->device_data, buf, count, ppos);
> +}
> +
> +static int vfio_device_mmap(struct file *filep, struct vm_area_struct *vma)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->mmap(device->device_data, vma);
> +}
> +	
> +#ifdef CONFIG_COMPAT
> +static long vfio_device_compat_ioctl(struct file *filep,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_device_unl_ioctl(filep, cmd, arg);
> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +const struct file_operations vfio_device_fops = {
> +	.owner		= THIS_MODULE,
> +	.release	= vfio_device_release,
> +	.read		= vfio_device_read,
> +	.write		= vfio_device_write,
> +	.unlocked_ioctl	= vfio_device_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_device_compat_ioctl,
> +#endif
> +	.mmap		= vfio_device_mmap,
> +};
> +
> +/*
> + * Group fops
> + */
> +static int vfio_group_open(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_group *group;
> +	int ret = 0;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	group = idr_find(&vfio.idr, iminor(inode));
> +
> +	if (!group) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	filep->private_data = group;
> +
> +	if (!group->iommu) {
> +		struct vfio_iommu *iommu;
> +
> +		iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> +		if (!iommu) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		INIT_LIST_HEAD(&iommu->group_list);
> +		INIT_LIST_HEAD(&iommu->dm_list);
> +		mutex_init(&iommu->dgate);
> +		iommu->bus = group->bus;
> +		__vfio_group_set_iommu(group, iommu);
> +	}
> +	group->refcnt++;
> +
> +out:
> +	mutex_unlock(&vfio.lock);
> +
> +	return ret;
> +}
> +
> +static int vfio_group_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_group *group = filep->private_data;
> +
> +	return vfio_do_release(&group->refcnt, group->iommu);
> +}
> +
> +/* Attempt to merge the group pointed to by fd into group.  The merge-ee
> + * group must not have an iommu or any devices open because we cannot
> + * maintain that context across the merge.  The merge-er group can be
> + * in use. */
> +static int vfio_group_merge(struct vfio_group *group, int fd)
> +{
> +	struct vfio_group *new;
> +	struct vfio_iommu *old_iommu;
> +	struct file *file;
> +	int ret = 0;
> +	bool opened = false;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	file = fget(fd);
> +	if (!file) {
> +		ret = -EBADF;
> +		goto out_noput;
> +	}
> +
> +	/* Sanity check, is this really our fd? */
> +	if (file->f_op != &vfio_group_fops) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	new = file->private_data;
> +
> +	if (!new || new == group || !new->iommu ||
> +	    new->iommu->domain || new->bus != group->bus) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* We need to attach all the devices to each domain separately
> +	 * in order to validate that the capabilities match for both.  */
> +	ret = __vfio_open_iommu(new->iommu);
> +	if (ret)
> +		goto out;
> +
> +	if (!group->iommu->domain) {
> +		ret = __vfio_open_iommu(group->iommu);
> +		if (ret)
> +			goto out;
> +		opened = true;
> +	}
> +
> +	/* If cache coherency doesn't match we'd potentialy need to
> +	 * remap existing iommu mappings in the merge-er domain.
> +	 * Poor return to bother trying to allow this currently. */
> +	if (iommu_domain_has_cap(group->iommu->domain,
> +				 IOMMU_CAP_CACHE_COHERENCY) !=
> +	    iommu_domain_has_cap(new->iommu->domain,
> +				 IOMMU_CAP_CACHE_COHERENCY)) {
> +		__vfio_close_iommu(new->iommu);
> +		if (opened)
> +			__vfio_close_iommu(group->iommu);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Close the iommu for the merge-ee and attach all its devices
> +	 * to the merge-er iommu. */
> +	__vfio_close_iommu(new->iommu);
> +
> +	ret = __vfio_iommu_attach_group(group->iommu, new);
> +	if (ret)
> +		goto out;
> +
> +	/* set_iommu unlinks new from the iommu, so save a pointer to it */
> +	old_iommu = new->iommu;
> +	__vfio_group_set_iommu(new, group->iommu);
> +	kfree(old_iommu);
> +
> +out:
> +	fput(file);
> +out_noput:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Unmerge the group pointed to by fd from group. */
> +static int vfio_group_unmerge(struct vfio_group *group, int fd)
> +{
> +	struct vfio_group *new;
> +	struct vfio_iommu *new_iommu;
> +	struct file *file;
> +	int ret = 0;
> +
> +	/* Since the merge-out group is already opened, it needs to
> +	 * have an iommu struct associated with it. */
> +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);
> +	if (!new_iommu)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&new_iommu->group_list);
> +	INIT_LIST_HEAD(&new_iommu->dm_list);
> +	mutex_init(&new_iommu->dgate);
> +	new_iommu->bus = group->bus;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	file = fget(fd);
> +	if (!file) {
> +		ret = -EBADF;
> +		goto out_noput;
> +	}
> +
> +	/* Sanity check, is this really our fd? */
> +	if (file->f_op != &vfio_group_fops) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	new = file->private_data;
> +	if (!new || new == group || new->iommu != group->iommu) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* We can't merge-out a group with devices still in use. */
> +	if (__vfio_group_devs_inuse(new)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	__vfio_iommu_detach_group(group->iommu, new);
> +	__vfio_group_set_iommu(new, new_iommu);
> +
> +out:
> +	fput(file);
> +out_noput:
> +	if (ret)
> +		kfree(new_iommu);
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Get a new iommu file descriptor.  This will open the iommu, setting
> + * the current->mm ownership if it's not already set. */
> +static int vfio_group_get_iommu_fd(struct vfio_group *group)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (!group->iommu->domain) {
> +		ret = __vfio_open_iommu(group->iommu);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
> +			       group->iommu, O_RDWR);
> +	if (ret < 0)
> +		goto out;
> +
> +	group->iommu->refcnt++;
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Get a new device file descriptor.  This will open the iommu, setting
> + * the current->mm ownership if it's not already set.  It's difficult to
> + * specify the requirements for matching a user supplied buffer to a
> + * device, so we use a vfio driver callback to test for a match.  For
> + * PCI, dev_name(dev) is unique, but other drivers may require including
> + * a parent device string. */
> +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> +{
> +	struct vfio_iommu *iommu = group->iommu;
> +	struct list_head *gpos;
> +	int ret = -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (!iommu->domain) {
> +		ret = __vfio_open_iommu(iommu);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	list_for_each(gpos, &iommu->group_list) {
> +		struct list_head *dpos;
> +
> +		group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (device->ops->match(device->dev, buf)) {
> +				struct file *file;
> +
> +				if (device->ops->get(device->device_data)) {
> +					ret = -EFAULT;
> +					goto out;
> +				}
> +
> +				/* We can't use anon_inode_getfd(), like above
> +				 * because we need to modify the f_mode flags
> +				 * directly to allow more than just ioctls */
> +				ret = get_unused_fd();
> +				if (ret < 0) {
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}
> +
> +				file = anon_inode_getfile("[vfio-device]",
> +							  &vfio_device_fops,
> +							  device, O_RDWR);
> +				if (IS_ERR(file)) {
> +					put_unused_fd(ret);
> +					ret = PTR_ERR(file);
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}
> +
> +				/* Todo: add an anon_inode interface to do
> +				 * this.  Appears to be missing by lack of
> +				 * need rather than explicitly prevented.
> +				 * Now there's need. */
> +				file->f_mode |= (FMODE_LSEEK |
> +						 FMODE_PREAD |
> +						 FMODE_PWRITE);
> +
> +				fd_install(ret, file);
> +
> +				device->refcnt++;
> +				goto out;
> +			}
> +		}
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +static long vfio_group_unl_ioctl(struct file *filep,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_group *group = filep->private_data;
> +
> +	if (cmd == VFIO_GROUP_GET_FLAGS) {
> +		u64 flags = 0;
> +
> +		mutex_lock(&vfio.lock);
> +		if (__vfio_iommu_viable(group->iommu))
> +			flags |= VFIO_GROUP_FLAGS_VIABLE;
> +		mutex_unlock(&vfio.lock);
> +
> +		if (group->iommu->mm)
> +			flags |= VFIO_GROUP_FLAGS_MM_LOCKED;
> +
> +		return put_user(flags, (u64 __user *)arg);
> +	}
> +		
> +	/* Below commands are restricted once the mm is set */
> +	if (group->iommu->mm && group->iommu->mm != current->mm)
> +		return -EPERM;
> +
> +	if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) {
> +		int fd;
> +		
> +		if (get_user(fd, (int __user *)arg))
> +			return -EFAULT;
> +		if (fd < 0)
> +			return -EINVAL;
> +
> +		if (cmd == VFIO_GROUP_MERGE)
> +			return vfio_group_merge(group, fd);
> +		else
> +			return vfio_group_unmerge(group, fd);
> +	} else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {
> +		return vfio_group_get_iommu_fd(group);
> +	} else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {
> +		char *buf;
> +		int ret;
> +
> +		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
> +		if (IS_ERR(buf))
> +			return PTR_ERR(buf);
> +
> +		ret = vfio_group_get_device_fd(group, buf);
> +		kfree(buf);
> +		return ret;
> +	}
> +
> +	return -ENOSYS;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vfio_group_compat_ioctl(struct file *filep,
> +				    unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_group_unl_ioctl(filep, cmd, arg);
> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +static const struct file_operations vfio_group_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vfio_group_open,
> +	.release	= vfio_group_release,
> +	.unlocked_ioctl	= vfio_group_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_group_compat_ioctl,
> +#endif
> +};
> +
> +/* iommu fd release hook */
> +int vfio_release_iommu(struct vfio_iommu *iommu)
> +{
> +	return vfio_do_release(&iommu->refcnt, iommu);
> +}
> +
> +/*
> + * VFIO driver API
> + */
> +
> +/* Add a new device to the vfio framework with associated vfio driver
> + * callbacks.  This is the entry point for vfio drivers to register devices. */
> +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
> +{
> +	struct list_head *pos;
> +	struct vfio_group *group = NULL;
> +	struct vfio_device *device = NULL;
> +	unsigned int groupid;
> +	int ret = 0;
> +	bool new_group = false;
> +
> +	if (!ops)
> +		return -EINVAL;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	list_for_each(pos, &vfio.group_list) {
> +		group = list_entry(pos, struct vfio_group, group_next);
> +		if (group->groupid == groupid)
> +			break;
> +		group = NULL;
> +	}
> +
> +	if (!group) {
> +		int minor;
> +
> +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> +		if (!group) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group->groupid = groupid;
> +		INIT_LIST_HEAD(&group->device_list);
> +
> +		ret = idr_get_new(&vfio.idr, group, &minor);
> +		if (ret == 0 && minor > MINORMASK) {
> +			idr_remove(&vfio.idr, minor);
> +			kfree(group);
> +			ret = -ENOSPC;
> +			goto out;
> +		}
> +
> +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> +		device_create(vfio.class, NULL, group->devt,
> +			      group, "%u", groupid);
> +
> +		group->bus = dev->bus;


Oh, so that is how the IOMMU iommu_ops get copied! You might
want to mention that - I was not sure where the 'handoff' is
was done to insert a device so that it can do iommu_ops properly.

Ok, so the time when a device is detected whether it can do
IOMMU is when we try to open it - as that is when iommu_domain_alloc
is called which can return NULL if the iommu_ops is not set.

So what about devices that don't have an iommu_ops? Say they
are using SWIOTLB? (like the AMD-Vi sometimes does if the
device is not on its list).

Can we use iommu_present?

> +		list_add(&group->group_next, &vfio.group_list);
> +		new_group = true;
> +	} else {
> +		if (group->bus != dev->bus) {
> +			printk(KERN_WARNING
> +			       "Error: IOMMU group ID conflict.  Group ID %u "
> +				"on both bus %s and %s\n", groupid,
> +				group->bus->name, dev->bus->name);
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +
> +		list_for_each(pos, &group->device_list) {
> +			device = list_entry(pos,
> +					    struct vfio_device, device_next);
> +			if (device->dev == dev)
> +				break;
> +			device = NULL;
> +		}
> +	}
> +
> +	if (!device) {
> +		if (__vfio_group_devs_inuse(group) ||
> +		    (group->iommu && group->iommu->refcnt)) {
> +			printk(KERN_WARNING
> +			       "Adding device %s to group %u while group is already in use!!\n",
> +			       dev_name(dev), group->groupid);
> +			/* XXX How to prevent other drivers from claiming? */
> +		}
> +
> +		device = kzalloc(sizeof(*device), GFP_KERNEL);
> +		if (!device) {
> +			/* If we just created this group, tear it down */
> +			if (new_group) {
> +				list_del(&group->group_next);
> +				device_destroy(vfio.class, group->devt);
> +				idr_remove(&vfio.idr, MINOR(group->devt));
> +				kfree(group);
> +			}
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		list_add(&device->device_next, &group->device_list);
> +		device->dev = dev;
> +		device->ops = ops;
> +		device->iommu = group->iommu; /* NULL if new */
> +		__vfio_iommu_attach_dev(group->iommu, device);
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_add_dev);
> +
> +/* Remove a device from the vfio framework */
> +void vfio_group_del_dev(struct device *dev)
> +{
> +	struct list_head *pos;
> +	struct vfio_group *group = NULL;
> +	struct vfio_device *device = NULL;
> +	unsigned int groupid;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	list_for_each(pos, &vfio.group_list) {
> +		group = list_entry(pos, struct vfio_group, group_next);
> +		if (group->groupid == groupid)
> +			break;
> +		group = NULL;
> +	}
> +
> +	if (!group)
> +		goto out;
> +
> +	list_for_each(pos, &group->device_list) {
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		if (device->dev == dev)
> +			break;
> +		device = NULL;
> +	}
> +
> +	if (!device)
> +		goto out;
> +
> +	BUG_ON(device->refcnt);
> +
> +	if (device->attached)
> +		__vfio_iommu_detach_dev(group->iommu, device);
> +
> +	list_del(&device->device_next);
> +	kfree(device);
> +
> +	/* If this was the only device in the group, remove the group.
> +	 * Note that we intentionally unmerge empty groups here if the
> +	 * group fd isn't opened. */
> +	if (list_empty(&group->device_list) && group->refcnt == 0) {
> +		struct vfio_iommu *iommu = group->iommu;
> +
> +		if (iommu) {
> +			__vfio_group_set_iommu(group, NULL);
> +			__vfio_try_dissolve_iommu(iommu);
> +		}
> +
> +		device_destroy(vfio.class, group->devt);
> +		idr_remove(&vfio.idr, MINOR(group->devt));
> +		list_del(&group->group_next);
> +		kfree(group);
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_del_dev);
> +
> +/* When a device is bound to a vfio device driver (ex. vfio-pci), this
> + * entry point is used to mark the device usable (viable).  The vfio
> + * device driver associates a private device_data struct with the device
> + * here, which will later be return for vfio_device_fops callbacks. */
> +int vfio_bind_dev(struct device *dev, void *device_data)
> +{
> +	struct vfio_device *device;
> +	int ret = -EINVAL;
> +
> +	BUG_ON(!device_data);
> +
> +	mutex_lock(&vfio.lock);
> +
> +	device = __vfio_lookup_dev(dev);
> +
> +	BUG_ON(!device);
> +
> +	ret = dev_set_drvdata(dev, device);
> +	if (!ret)
> +		device->device_data = device_data;
> +
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_bind_dev);
> +
> +/* A device is only removeable if the iommu for the group is not in use. */
> +static bool vfio_device_removeable(struct vfio_device *device)
> +{
> +	bool ret = true;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (device->iommu && __vfio_iommu_inuse(device->iommu))
> +		ret = false;
> +
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Notify vfio that a device is being unbound from the vfio device driver
> + * and return the device private device_data pointer.  If the group is
> + * in use, we need to block or take other measures to make it safe for
> + * the device to be removed from the iommu. */
> +void *vfio_unbind_dev(struct device *dev)
> +{
> +	struct vfio_device *device = dev_get_drvdata(dev);
> +	void *device_data;
> +
> +	BUG_ON(!device);
> +
> +again:
> +	if (!vfio_device_removeable(device)) {
> +		/* XXX signal for all devices in group to be removed or
> +		 * resort to killing the process holding the device fds.
> +		 * For now just block waiting for releases to wake us. */
> +		wait_event(vfio.release_q, vfio_device_removeable(device));
> +	}
> +
> +	mutex_lock(&vfio.lock);
> +
> +	/* Need to re-check that the device is still removeable under lock. */
> +	if (device->iommu && __vfio_iommu_inuse(device->iommu)) {
> +		mutex_unlock(&vfio.lock);
> +		goto again;
> +	}
> +
> +	device_data = device->device_data;
> +
> +	device->device_data = NULL;
> +	dev_set_drvdata(dev, NULL);
> +
> +	mutex_unlock(&vfio.lock);
> +	return device_data;
> +}
> +EXPORT_SYMBOL_GPL(vfio_unbind_dev);
> +
> +/*
> + * Module/class support
> + */
> +static void vfio_class_release(struct kref *kref)
> +{
> +	class_destroy(vfio.class);
> +	vfio.class = NULL;
> +}
> +
> +static char *vfio_devnode(struct device *dev, mode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
> +}
> +
> +static int __init vfio_init(void)
> +{
> +	int ret;
> +
> +	idr_init(&vfio.idr);
> +	mutex_init(&vfio.lock);
> +	INIT_LIST_HEAD(&vfio.group_list);
> +	init_waitqueue_head(&vfio.release_q);
> +
> +	kref_init(&vfio.kref);
> +	vfio.class = class_create(THIS_MODULE, "vfio");
> +	if (IS_ERR(vfio.class)) {
> +		ret = PTR_ERR(vfio.class);
> +		goto err_class;
> +	}
> +
> +	vfio.class->devnode = vfio_devnode;
> +
> +	/* FIXME - how many minors to allocate... all of them! */
> +	ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
> +	if (ret)
> +		goto err_chrdev;
> +
> +	cdev_init(&vfio.cdev, &vfio_group_fops);
> +	ret = cdev_add(&vfio.cdev, vfio.devt, MINORMASK);
> +	if (ret)
> +		goto err_cdev;
> +
> +	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> +
> +	return 0;
> +
> +err_cdev:
> +	unregister_chrdev_region(vfio.devt, MINORMASK);
> +err_chrdev:
> +	kref_put(&vfio.kref, vfio_class_release);
> +err_class:
> +	return ret;
> +}
> +
> +static void __exit vfio_cleanup(void)
> +{
> +	struct list_head *gpos, *gppos;
> +
> +	list_for_each_safe(gpos, gppos, &vfio.group_list) {
> +		struct vfio_group *group;
> +		struct list_head *dpos, *dppos;
> +
> +		group = list_entry(gpos, struct vfio_group, group_next);
> +
> +		list_for_each_safe(dpos, dppos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +			vfio_group_del_dev(device->dev);
> +		}
> +	}
> +
> +	idr_destroy(&vfio.idr);
> +	cdev_del(&vfio.cdev);
> +	unregister_chrdev_region(vfio.devt, MINORMASK);
> +	kref_put(&vfio.kref, vfio_class_release);
> +}
> +
> +module_init(vfio_init);
> +module_exit(vfio_cleanup);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/vfio_private.h b/drivers/vfio/vfio_private.h
> new file mode 100644
> index 0000000..350ad67
> --- /dev/null
> +++ b/drivers/vfio/vfio_private.h
> @@ -0,0 +1,34 @@
> +/*
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +
> +#ifndef VFIO_PRIVATE_H
> +#define VFIO_PRIVATE_H
> +
> +struct vfio_iommu {
> +	struct iommu_domain		*domain;
> +	struct bus_type			*bus;
> +	struct mutex			dgate;
> +	struct list_head		dm_list;
> +	struct mm_struct		*mm;
> +	struct list_head		group_list;
> +	int				refcnt;
> +	bool				cache;
> +};
> +
> +extern int vfio_release_iommu(struct vfio_iommu *iommu);
> +extern void vfio_iommu_unmapall(struct vfio_iommu *iommu);
> +
> +#endif /* VFIO_PRIVATE_H */
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> new file mode 100644
> index 0000000..4269b08
> --- /dev/null
> +++ b/include/linux/vfio.h
> @@ -0,0 +1,155 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + */
> +#include <linux/types.h>
> +
> +#ifndef VFIO_H
> +#define VFIO_H
> +
> +#ifdef __KERNEL__
> +
> +struct vfio_device_ops {
> +	bool			(*match)(struct device *, char *);
> +	int			(*get)(void *);
> +	void			(*put)(void *);
> +	ssize_t			(*read)(void *, char __user *,
> +					size_t, loff_t *);
> +	ssize_t			(*write)(void *, const char __user *,
> +					 size_t, loff_t *);
> +	long			(*ioctl)(void *, unsigned int, unsigned long);
> +	int			(*mmap)(void *, struct vm_area_struct *);
> +};
> +
> +extern int vfio_group_add_dev(struct device *device,
> +			      const struct vfio_device_ops *ops);
> +extern void vfio_group_del_dev(struct device *device);
> +extern int vfio_bind_dev(struct device *device, void *device_data);
> +extern void *vfio_unbind_dev(struct device *device);
> +
> +#endif /* __KERNEL__ */
> +
> +/*
> + * VFIO driver - allow mapping and use of certain devices
> + * in unprivileged user processes. (If IOMMU is present)
> + * Especially useful for Virtual Function parts of SR-IOV devices
> + */
> +
> +
> +/* Kernel & User level defines for ioctls */
> +
> +#define VFIO_GROUP_GET_FLAGS		_IOR(';', 100, __u64)

> + #define VFIO_GROUP_FLAGS_VIABLE	(1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED	(1 << 1)
> +#define VFIO_GROUP_MERGE		_IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE		_IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD		_IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD	_IOW(';', 104, char *)
> +
> +/*
> + * Structure for DMA mapping of user buffers
> + * vaddr, dmaaddr, and size must all be page aligned
> + */
> +struct vfio_dma_map {
> +	__u64	len;		/* length of structure */
> +	__u64	vaddr;		/* process virtual addr */
> +	__u64	dmaaddr;	/* desired and/or returned dma address */
> +	__u64	size;		/* size in bytes */
> +	__u64	flags;
> +#define	VFIO_DMA_MAP_FLAG_WRITE		(1 << 0) /* req writeable DMA mem */
> +};
> +
> +#define	VFIO_IOMMU_GET_FLAGS		_IOR(';', 105, __u64)
> + /* Does the IOMMU support mapping any IOVA to any virtual address? */
> + #define VFIO_IOMMU_FLAGS_MAP_ANY	(1 << 0)
> +#define	VFIO_IOMMU_MAP_DMA		_IOWR(';', 106, struct vfio_dma_map)
> +#define	VFIO_IOMMU_UNMAP_DMA		_IOWR(';', 107, struct vfio_dma_map)
> +
> +#define VFIO_DEVICE_GET_FLAGS		_IOR(';', 108, __u64)
> + #define VFIO_DEVICE_FLAGS_PCI		(1 << 0)
> + #define VFIO_DEVICE_FLAGS_DT		(1 << 1)
> + #define VFIO_DEVICE_FLAGS_RESET	(1 << 2)
> +#define VFIO_DEVICE_GET_NUM_REGIONS	_IOR(';', 109, int)
> +
> +struct vfio_region_info {
> +	__u32	len;		/* length of structure */
> +	__u32	index;		/* region number */
> +	__u64	size;		/* size in bytes of region */
> +	__u64	offset;		/* start offset of region */
> +	__u64	flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP		(1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO		(1 << 1)
> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID	(1 << 2)
> +	__u64	phys;		/* physical address of region */
> +};
> +
> +#define VFIO_DEVICE_GET_REGION_INFO	_IOWR(';', 110, struct vfio_region_info)
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS	_IOR(';', 111, int)
> +
> +struct vfio_irq_info {
> +	__u32	len;		/* length of structure */
> +	__u32	index;		/* IRQ number */
> +	__u32	count;		/* number of individual IRQs */
> +	__u32	flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL		(1 << 0)
> +};
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO	_IOWR(';', 112, struct vfio_irq_info)
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS	_IOW(';', 113, int)
> +
> +/* Unmask IRQ index, arg[0] = index */
> +#define VFIO_DEVICE_UNMASK_IRQ		_IOW(';', 114, int)
> +
> +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD	_IOW(';', 115, int)
> +
> +#define VFIO_DEVICE_RESET		_IO(';', 116)
> +
> +struct vfio_dtpath {
> +	__u32	len;		/* length of structure */
> +	__u32	index;
> +	__u64	flags;
> +#define VFIO_DTPATH_FLAGS_REGION	(1 << 0)
> +#define VFIO_DTPATH_FLAGS_IRQ		(1 << 1)
> +	char	*path;
> +};
> +#define VFIO_DEVICE_GET_DTPATH		_IOWR(';', 117, struct vfio_dtpath)
> +
> +struct vfio_dtindex {
> +	__u32	len;		/* length of structure */
> +	__u32	index;
> +	__u32	prop_type;
> +	__u32	prop_index;
> +	__u64	flags;
> +#define VFIO_DTINDEX_FLAGS_REGION	(1 << 0)
> +#define VFIO_DTINDEX_FLAGS_IRQ		(1 << 1)
> +};
> +#define VFIO_DEVICE_GET_DTINDEX		_IOWR(';', 118, struct vfio_dtindex)
> +
> +#endif /* VFIO_H */


So where is the vfio-pci? Is that a seperate posting?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-11 17:51 ` Konrad Rzeszutek Wilk
@ 2011-11-11 22:10   ` Alex Williamson
  2011-11-15  0:00     ` David Gibson
                       ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Alex Williamson @ 2011-11-11 22:10 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf, benve, aafabbri,
	B08248, B07421, avi, kvm, qemu-devel, iommu, linux-pci


Thanks Konrad!  Comments inline.

On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> > VFIO provides a secure, IOMMU based interface for user space
> > drivers, including device assignment to virtual machines.
> > This provides the base management of IOMMU groups, devices,
> > and IOMMU objects.  See Documentation/vfio.txt included in
> > this patch for user and kernel API description.
> > 
> > Note, this implements the new API discussed at KVM Forum
> > 2011, as represented by the drvier version 0.2.  It's hoped
> > that this provides a modular enough interface to support PCI
> > and non-PCI userspace drivers across various architectures
> > and IOMMU implementations.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> > Fingers crossed, this is the last RFC for VFIO, but we need
> > the iommu group support before this can go upstream
> > (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
> > hoping this helps push that along.
> > 
> > Since the last posting, this version completely modularizes
> > the device backends and better defines the APIs between the
> > core VFIO code and the device backends.  I expect that we
> > might also adopt a modular IOMMU interface as iommu_ops learns
> > about different types of hardware.  Also many, many cleanups.
> > Check the complete git history for details:
> > 
> > git://github.com/awilliam/linux-vfio.git vfio-ng
> > 
> > (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
> > 
> > This version, along with the supporting VFIO PCI backend can
> > be found here:
> > 
> > git://github.com/awilliam/linux-vfio.git vfio-next-20111103
> > 
> > I've held off on implementing a kernel->user signaling
> > mechanism for now since the previous netlink version produced
> > too many gag reflexes.  It's easy enough to set a bit in the
> > group flags too indicate such support in the future, so I
> > think we can move ahead without it.
> > 
> > Appreciate any feedback or suggestions.  Thanks,
> > 
> > Alex
> > 
> >  Documentation/ioctl/ioctl-number.txt |    1 
> >  Documentation/vfio.txt               |  304 +++++++++
> >  MAINTAINERS                          |    8 
> >  drivers/Kconfig                      |    2 
> >  drivers/Makefile                     |    1 
> >  drivers/vfio/Kconfig                 |    8 
> >  drivers/vfio/Makefile                |    3 
> >  drivers/vfio/vfio_iommu.c            |  530 ++++++++++++++++
> >  drivers/vfio/vfio_main.c             | 1151 ++++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_private.h          |   34 +
> >  include/linux/vfio.h                 |  155 +++++
> >  11 files changed, 2197 insertions(+), 0 deletions(-)
> >  create mode 100644 Documentation/vfio.txt
> >  create mode 100644 drivers/vfio/Kconfig
> >  create mode 100644 drivers/vfio/Makefile
> >  create mode 100644 drivers/vfio/vfio_iommu.c
> >  create mode 100644 drivers/vfio/vfio_main.c
> >  create mode 100644 drivers/vfio/vfio_private.h
> >  create mode 100644 include/linux/vfio.h
> > 
> > diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> > index 54078ed..59d01e4 100644
> > --- a/Documentation/ioctl/ioctl-number.txt
> > +++ b/Documentation/ioctl/ioctl-number.txt
> > @@ -88,6 +88,7 @@ Code  Seq#(hex)	Include File		Comments
> >  		and kernel/power/user.c
> >  '8'	all				SNP8023 advanced NIC card
> >  					<mailto:mcr@solidum.com>
> > +';'	64-76	linux/vfio.h
> >  '@'	00-0F	linux/radeonfb.h	conflict!
> >  '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
> >  'A'	00-1F	linux/apm_bios.h	conflict!
> > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > new file mode 100644
> > index 0000000..5866896
> > --- /dev/null
> > +++ b/Documentation/vfio.txt
> > @@ -0,0 +1,304 @@
> > +VFIO - "Virtual Function I/O"[1]
> > +-------------------------------------------------------------------------------
> > +Many modern system now provide DMA and interrupt remapping facilities
> > +to help ensure I/O devices behave within the boundaries they've been
> > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > +embedded powerpc systems (technology name unknown).  The VFIO driver
> > +is an IOMMU/device agnostic framework for exposing direct device
> > +access to userspace, in a secure, IOMMU protected environment.  In
> > +other words, this allows safe, non-privileged, userspace drivers.
> > +
> > +Why do we want that?  Virtual machines often make use of direct device
> > +access ("device assignment") when configured for the highest possible
> > +I/O performance.  From a device and host perspective, this simply turns
> > +the VM into a userspace driver, with the benefits of significantly
> > +reduced latency, higher bandwidth, and direct use of bare-metal device
> > +drivers[2].
> 
> Are there any constraints of running a 32-bit userspace with
> a 64-bit kernel and with 32-bit user space drivers?

Shouldn't be.  I'll need to do some testing on that, but it was working
on the previous generation of vfio.

> > +
> > +Some applications, particularly in the high performance computing
> > +field, also benefit from low-overhead, direct device access from
> > +userspace.  Examples include network adapters (often non-TCP/IP based)
> > +and compute accelerators.  Previous to VFIO, these drivers needed to
> > +go through the full development cycle to become proper upstream driver,
> > +be maintained out of tree, or make use of the UIO framework, which
> > +has no notion of IOMMU protection, limited interrupt support, and
> > +requires root privileges to access things like PCI configuration space.
> > +
> > +The VFIO driver framework intends to unify these, replacing both the
> > +KVM PCI specific device assignment currently used as well as provide
> > +a more secure, more featureful userspace driver environment than UIO.
> > +
> > +Groups, Devices, IOMMUs, oh my
> 
> <chuckles> oh my, eh?

Anything for a corny chuckle :)

> > +-------------------------------------------------------------------------------
> > +
> > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > +can't always distinguish transactions from each individual device in
> > +the system.  Sometimes this is because of the IOMMU design, such as with
> > +PEs, other times it's caused by the I/O topology, for instance a
> > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > +devices created by these restictions IOMMU groups (or just "groups" for
> > +this document).
> > +
> > +The IOMMU cannot distiguish transactions between the individual devices
> > +within the group, therefore the group is the basic unit of ownership for
> > +a userspace process.  Because of this, groups are also the primary
> > +interface to both devices and IOMMU domains in VFIO.
> > +
> > +The VFIO representation of groups is created as devices are added into
> > +the framework by a VFIO bus driver.  The vfio-pci module is an example
> > +of a bus driver.  This module registers devices along with a set of bus
> > +specific callbacks with the VFIO core.  These callbacks provide the
> > +interfaces later used for device access.  As each new group is created,
> > +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> > +character device.
> > +
> > +In addition to the device enumeration and callbacks, the VFIO bus driver
> > +also provides a traditional device driver and is able to bind to devices
> > +on it's bus.  When a device is bound to the bus driver it's available to
> > +VFIO.  When all the devices within a group are bound to their bus drivers,
> > +the group becomes "viable" and a user with sufficient access to the VFIO
> > +group chardev can obtain exclusive access to the set of group devices.
> > +
> > +As documented in linux/vfio.h, several ioctls are provided on the
> > +group chardev:
> > +
> > +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> > + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> > + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> > +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> > +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> > +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> > +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> > +
> > +The last two ioctls return new file descriptors for accessing
> > +individual devices within the group and programming the IOMMU.  Each of
> > +these new file descriptors provide their own set of file interfaces.
> > +These ioctls will fail if any of the devices within the group are not
> > +bound to their VFIO bus driver.  Additionally, when either of these
> > +interfaces are used, the group is then bound to the struct_mm of the
> > +caller.  The GET_FLAGS ioctl can be used to view the state of the group.
> > +
> > +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> > +new IOMMU domain is created and all of the devices in the group are
> > +attached to it.  This is the only way to ensure full IOMMU isolation
> > +of the group, but potentially wastes resources and cycles if the user
> > +intends to manage multiple groups with the same set of IOMMU mappings.
> > +VFIO therefore provides a group MERGE and UNMERGE interface, which
> > +allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
> > +arbitrary groups to be merged, so the user should assume merging is
> > +opportunistic.  A new group, with no open device or IOMMU file
> > +descriptors, can be merged into an existing, in-use, group using the
> > +MERGE ioctl.  A merged group can be unmerged using the UNMERGE ioctl
> > +once all of the device file descriptors for the group being merged
> > +"out" are closed.
> > +
> > +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are
> > +essentially fungible between group file descriptors (ie. if device A
> > +is in group X, and X is merged with Y, a file descriptor for A can be
> > +retrieved using GET_DEVICE_FD on Y.  Likewise, GET_IOMMU_FD returns a
> > +file descriptor referencing the same internal IOMMU object from either
> > +X or Y).  Merged groups can be dissolved either explictly with UNMERGE
> > +or automatically when ALL file descriptors for the merged group are
> > +closed (all IOMMUs, all devices, all groups).
> > +
> > +The IOMMU file descriptor provides this set of ioctls:
> > +
> > +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> > + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> > +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> > +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)
> 
> Coherency support is not going to be addressed right? What about sync?
> Say you need to sync CPU to Device address?

Do we need to expose that to userspace or should the underlying
iommu_ops take care of it?

> > +
> > +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> > +We currently only support IOMMU domains that are able to map any
> > +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
> > +
> > +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> > +and unmapping IOVAs to process virtual addresses:
> > +
> > +struct vfio_dma_map {
> > +        __u64   len;            /* length of structure */
> 
> What is the purpose of the 'len' field? Is it to guard against future
> version changes?

Yes, David Gibson suggested we include flags & len for all data
structures to help future proof them.

> > +        __u64   vaddr;          /* process virtual addr */
> > +        __u64   dmaaddr;        /* desired and/or returned dma address */
> > +        __u64   size;           /* size in bytes */
> > +        __u64   flags;
> > +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> > +};
> > +
> > +Current users of VFIO use relatively static DMA mappings, not requiring
> > +high frequency turnover.  As new users are added, it's expected that the
> 
> Is there a limit to how many DMA mappings can be created?

Not that I'm aware of for the current AMD-Vi/VT-d implementations.  I
suppose iommu_ops would return -ENOSPC if it hit a limit.  I added the
VFIO_IOMMU_FLAGS_MAP_ANY flag above to try to identify that kind of
restriction.

> > +IOMMU file descriptor will evolve to support new mapping interfaces, this
> > +will be reflected in the flags and may present new ioctls and file
> > +interfaces.
> > +
> > +The device GET_FLAGS ioctl is intended to return basic device type and
> > +indicate support for optional capabilities.  Flags currently include whether
> > +the device is PCI or described by Device Tree, and whether the RESET ioctl
> > +is supported:
> 
> And reset in terms of PCIe spec is the FLR?

Yes, just a pass through to pci_reset_function() for the pci vfio bus
driver.

> > +
> > +#define VFIO_DEVICE_GET_FLAGS           _IOR(';', 108, __u64)
> > + #define VFIO_DEVICE_FLAGS_PCI          (1 << 0)
> > + #define VFIO_DEVICE_FLAGS_DT           (1 << 1)
> > + #define VFIO_DEVICE_FLAGS_RESET        (1 << 2)
> > +
> > +The MMIO and IOP resources used by a device are described by regions.
> 
> IOP?

I/O port, I'll spell it out.

> > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> > +
> > +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)
> 
> Don't want __u32?

It could be, not sure if it buys us anything maybe even restricts us.
We likely don't need 2^32 regions (famous last words?), so we could
later define <0 to something?

> > +
> > +Regions are described by a struct vfio_region_info, which is retrieved by
> > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> > +the desired region (0 based index).  Note that devices may implement zero
> > 
> +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> > +mapping).
> 
> Huh?

PCI has the following static mapping:

enum {
        VFIO_PCI_BAR0_REGION_INDEX,
        VFIO_PCI_BAR1_REGION_INDEX,
        VFIO_PCI_BAR2_REGION_INDEX,
        VFIO_PCI_BAR3_REGION_INDEX,
        VFIO_PCI_BAR4_REGION_INDEX,
        VFIO_PCI_BAR5_REGION_INDEX,
        VFIO_PCI_ROM_REGION_INDEX,
        VFIO_PCI_CONFIG_REGION_INDEX,
        VFIO_PCI_NUM_REGIONS
};

So 8 regions are always reported regardless of whether the device
implements all the BARs and the ROM.  Then we have a fixed bar:index
mapping so we don't have to create a region_info field to describe the
bar number for the index.

> > +
> > +struct vfio_region_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* region number */
> > +        __u64   size;           /* size in bytes of region */
> > +        __u64   offset;         /* start offset of region */
> > +        __u64   flags;
> > +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> > +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> 
> What is FLAG_MMAP? Does it mean: 1) it can be mmaped, or 2) it is mmaped?

Supports mmap

> FLAG_RO is pretty obvious - presumarily this is for firmware regions and such.
> And PHYS_VALID is if the region is disabled for some reasons? If so
> would the name FLAG_DISABLED be better?

No, POWER guys have some need to report the host physical address of the
region, so the flag indicates whether the below field is present and
valid.  I'll clarify these in the docs.

> 
> > +        __u64   phys;           /* physical address of region */
> > +};
> > +
> > +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> > +
> > +The offset indicates the offset into the device file descriptor which
> > +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> > +available access types and validity of optional fields.  For instance
> > +the phys field may only be valid for certain devices types.
> > +
> > +Interrupts are described using a similar interface.  GET_NUM_IRQS
> > +reports the number or IRQ indexes for the device.
> > +
> > +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> 
> _u32?

Same as above, but I don't have a strong preference.

> > +
> > +struct vfio_irq_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* IRQ number */
> > +        __u32   count;          /* number of individual IRQs */
> > +        __u64   flags;
> > +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> > +};
> > +
> > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > +type to index mapping).
> 
> I am not really sure what that means.

This is so PCI can expose:

enum {
        VFIO_PCI_INTX_IRQ_INDEX,
        VFIO_PCI_MSI_IRQ_INDEX,
        VFIO_PCI_MSIX_IRQ_INDEX,
        VFIO_PCI_NUM_IRQS
};

So like regions it always exposes 3 IRQ indexes where count=0 if the
device doesn't actually support that type of interrupt.  I just want to
spell out that bus drivers have this kind of flexibility.

> > +
> > +Information about each index can be retrieved using the GET_IRQ_INFO
> > +ioctl, used much like GET_REGION_INFO.
> > +
> > +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> > +
> > +Individual indexes can describe single or sets of IRQs.  This provides the
> > +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> > +
> > +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> > +as shown below, are used to pass the IRQ info index, the number of eventfds,
> > +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> > +
> > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> 
> Are eventfds u64 or u32?

int, they're just file descriptors

> Why not just define a structure?
> struct vfio_irq_eventfds {
> 	__u32	index;
> 	__u32	count;
> 	__u64	eventfds[0]
> };

We could do that if preferred.  Hmm, are we then going to need
size/flags?

> How do you get an eventfd to feed in here?

eventfd(2), in qemu event_notifier_init() -> event_notifier_get_fd()

> > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> 
> u32?

Not here, it's an fd, so should be an int.

> > +
> > +When a level triggered interrupt is signaled, the interrupt is masked
> > +on the host.  This prevents an unresponsive userspace driver from
> > +continuing to interrupt the host system.  After servicing the interrupt,
> > +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> > +triggered interrupts implicitly have a count of 1 per index.
> 
> So they are enabled automatically? Meaning you don't even hav to do
> SET_IRQ_EVENTFDS b/c the count is set to 1?

I suppose that should be "no more than 1 per index" (ie. PCI would
report a count of 0 for VFIO_PCI_INTX_IRQ_INDEX if the device doesn't
support INTx).  I think you might be confusing VFIO_DEVICE_GET_IRQ_INFO
which tells how many are available with VFIO_DEVICE_SET_IRQ_EVENTFDS
which does the enabling/disabling.  All interrupts are disabled by
default because userspace needs to give us a way to signal them via
eventfds.  It will be device dependent whether multiple index can be
enabled simultaneously.  Hmm, is that another flag on the irq_info
struct or do we expect drivers to implicitly have that kind of
knowledge?

> > +
> > +/* Unmask IRQ index, arg[0] = index */
> > +#define VFIO_DEVICE_UNMASK_IRQ          _IOW(';', 114, int)
> 
> So this is for MSI as well? So if I've an index = 1, with count = 4,
> and doing unmaks IRQ will chip enable all the MSI event at once?

No, this is only for re-enabling level triggered interrupts as discussed
above.  Edge triggered interrupts like MSI don't need an unmask... we
may want to do something to accelerate the MSI-X table access for
masking specific interrupts, but I figured that would need to be PCI
aware since those are PCI features, and would therefore be some future
extension of the PCI bus driver and exposed via VFIO_DEVICE_GET_FLAGS.

> I guess there is not much point in enabling/disabling selective MSI
> IRQs..

Some older OSes are said to make extensive use of masking for MSI, so we
probably want this at some point.  I'm assuming future PCI extension for
now.

> > +
> > +Level triggered interrupts can also be unmasked using an irqfd.  Use
> 
> irqfd or eventfd?

irqfd is an eventfd in reverse.  eventfd = kernel signals userspace via
an fd, irqfd = userspace signals kernel via an fd.

> > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> 
> So only level triggered? Hmm, how do I know whether the device is
> level or edge? Or is that edge (MSI) can also be unmaked using the
> eventfs

Yes, only for level.  Isn't a device going to know what type of
interrupt it uses?  MSI masking is PCI specific, not handled by this.

> > +
> > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> > +
> > +When supported, as indicated by the device flags, reset the device.
> > +
> > +#define VFIO_DEVICE_RESET               _IO(';', 116)
> 
> Does it disable the 'count'? Err, does it disable the IRQ on the
> device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
> to set new eventfds? Or does it re-use the eventfds and the device
> is enabled after this?

It doesn't affect the interrupt programming.  Should it?

> > +
> > +Device tree devices also invlude ioctls for further defining the
> 
> include
> 
> > +device tree properties of the device:
> > +
> > +struct vfio_dtpath {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;
> 
> 0 based I presume?

Everything else is, I would assume so/

> > +        __u64   flags;
> > +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> 
> What is region in this context?? Or would this make much more sense
> if I knew what Device Tree actually is.

Powerpc guys, any comments?  This was their suggestion.  These are
effectively the first device specific extension, available when
VFIO_DEVICE_FLAGS_DT is set.

> > +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> > +        char    *path;
> 
> Ah, now I see why you want 'len' here.. But I am still at loss
> why you want that with the other structures.

Attempt to future proof and validate input.

> > +};
> > +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> > +
> > +struct vfio_dtindex {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;
> > +        __u32   prop_type;
> 
> Is that an enum type? Is this definied somewhere?
> > +        __u32   prop_index;
> 
> What is the purpose of this field?

Need input from powerpc folks here

> > +        __u64   flags;
> > +#define VFIO_DTINDEX_FLAGS_REGION       (1 << 0)
> > +#define VFIO_DTINDEX_FLAGS_IRQ          (1 << 1)
> > +};
> > +#define VFIO_DEVICE_GET_DTINDEX         _IOWR(';', 118, struct vfio_dtindex)
> > +
> > +
> > +VFIO bus driver API
> > +-------------------------------------------------------------------------------
> > +
> > +Bus drivers, such as PCI, have three jobs:
> > + 1) Add/remove devices from vfio
> > + 2) Provide vfio_device_ops for device access
> > + 3) Device binding and unbinding
> 
> suspend/resume?

In the previous version of vfio, the vfio core signaled suspend/resume
to userspace via netlink, effectively putting userspace on the pm
notifier chain.  I was intending to do the same here.

> > +
> > +When initialized, the bus driver should enumerate the devices on it's
> > +bus and call vfio_group_add_dev() for each device.  If the bus supports
> > +hotplug, notifiers should be enabled to track devices being added and
> > +removed.  vfio_group_del_dev() removes a previously added device from
> > +vfio.
> > +
> > +Adding a device registers a vfio_device_ops function pointer structure
> > +for the device:
> 
> Huh? So this gets created for _every_ 'struct device' that is added
> the VFIO bus? Is this structure exposed? Or is this an internal one?

Every device added creates a struct vfio_device and if necessary a
struct vfio_group.  These are internal, just for managing groups and
devices.

> > +
> > +struct vfio_device_ops {
> > +	bool			(*match)(struct device *, char *);
> > +	int			(*get)(void *);
> > +	void			(*put)(void *);
> > +	ssize_t			(*read)(void *, char __user *,
> > +					size_t, loff_t *);
> > +	ssize_t			(*write)(void *, const char __user *,
> > +					 size_t, loff_t *);
> > +	long			(*ioctl)(void *, unsigned int, unsigned long);
> > +	int			(*mmap)(void *, struct vm_area_struct *);
> > +};
> > +
> > +When a device is bound to the bus driver, the bus driver indicates this
> > +to vfio using the vfio_bind_dev() interface.  The device_data parameter
> 
> Might want to paste the function decleration for it.. b/c I am not sure
> where the 'device_data' parameter is on the argument list.

Ok

> > +is a pointer to an opaque data structure for use only by the bus driver.
> > +The get, put, read, write, ioctl, and mmap vfio_device_ops all pass
> > +this data structure back to the bus driver.  When a device is unbound
> 
> Oh, so it is on the 'void *'.

Right

> > +from the bus driver, the vfio_unbind_dev() interface signals this to
> > +vfio.  This function returns the pointer to the device_data structure
> 
> That function
> > +registered for the device.
> 
> I am not really sure what this section purpose is? Could this be part
> of the header file or the code? It does not look to be part of the
> ioctl API?

We've passed into the "VFIO bus driver API" section of the document, to
explain the interaction between vfio-core and vfio bus drivers.

> > +
> > +As noted previously, a group contains one or more devices, so
> > +GROUP_GET_DEVICE_FD needs to identify the specific device being requested.
> > +The vfio_device_ops.match callback is used to allow bus drivers to determine
> > +the match.  For drivers like vfio-pci, it's a simple match to dev_name(),
> > +which is unique in the system due to the PCI bus topology, other bus drivers
> > +may need to include parent devices to create a unique match, so this is
> > +left as a bus driver interface.
> > +
> > +-------------------------------------------------------------------------------
> > +
> > +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
> > +initial implementation by Tom Lyon while as Cisco.  We've since outgrown
> > +the acronym, but it's catchy.
> > +
> > +[2] As always there are trade-offs to virtual machine device
> > +assignment that are beyond the scope of VFIO.  It's expected that
> > +future IOMMU technologies will reduce some, but maybe not all, of
> > +these trade-offs.
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index f05f5f6..4bd5aa0 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -7106,6 +7106,14 @@ S:	Maintained
> >  F:	Documentation/filesystems/vfat.txt
> >  F:	fs/fat/
> >  
> > +VFIO DRIVER
> > +M:	Alex Williamson <alex.williamson@redhat.com>
> > +L:	kvm@vger.kernel.org
> 
> No vfio mailing list? Or a vfio-mailing list? 

IIRC, Avi had agreed that we could use kvm for now.  I don't know that
vfio will warrant it's own list.  If it picks up, sure, we can move it.

> > +S:	Maintained
> > +F:	Documentation/vfio.txt
> > +F:	drivers/vfio/
> > +F:	include/linux/vfio.h
> > +
> >  VIDEOBUF2 FRAMEWORK
> >  M:	Pawel Osciak <pawel@osciak.com>
> >  M:	Marek Szyprowski <m.szyprowski@samsung.com>
> > diff --git a/drivers/Kconfig b/drivers/Kconfig
> > index b5e6f24..e15578b 100644
> > --- a/drivers/Kconfig
> > +++ b/drivers/Kconfig
> > @@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig"
> >  
> >  source "drivers/uio/Kconfig"
> >  
> > +source "drivers/vfio/Kconfig"
> > +
> >  source "drivers/vlynq/Kconfig"
> >  
> >  source "drivers/virtio/Kconfig"
> > diff --git a/drivers/Makefile b/drivers/Makefile
> > index 1b31421..5f138b5 100644
> > --- a/drivers/Makefile
> > +++ b/drivers/Makefile
> > @@ -58,6 +58,7 @@ obj-$(CONFIG_ATM)		+= atm/
> >  obj-$(CONFIG_FUSION)		+= message/
> >  obj-y				+= firewire/
> >  obj-$(CONFIG_UIO)		+= uio/
> > +obj-$(CONFIG_VFIO)		+= vfio/
> >  obj-y				+= cdrom/
> >  obj-y				+= auxdisplay/
> >  obj-$(CONFIG_PCCARD)		+= pcmcia/
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> > new file mode 100644
> > index 0000000..9acb1e7
> > --- /dev/null
> > +++ b/drivers/vfio/Kconfig
> > @@ -0,0 +1,8 @@
> > +menuconfig VFIO
> > +	tristate "VFIO Non-Privileged userspace driver framework"
> > +	depends on IOMMU_API
> > +	help
> > +	  VFIO provides a framework for secure userspace device drivers.
> > +	  See Documentation/vfio.txt for more details.
> > +
> > +	  If you don't know what to do here, say N.
> > diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> > new file mode 100644
> > index 0000000..088faf1
> > --- /dev/null
> > +++ b/drivers/vfio/Makefile
> > @@ -0,0 +1,3 @@
> > +vfio-y := vfio_main.o vfio_iommu.o
> > +
> > +obj-$(CONFIG_VFIO) := vfio.o
> > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > new file mode 100644
> > index 0000000..029dae3
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu.c
> > @@ -0,0 +1,530 @@
> > +/*
> > + * VFIO: IOMMU DMA mapping support
> > + *
> > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + */
> > +
> > +#include <linux/compat.h>
> > +#include <linux/device.h>
> > +#include <linux/fs.h>
> > +#include <linux/iommu.h>
> > +#include <linux/module.h>
> > +#include <linux/mm.h>
> > +#include <linux/sched.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +#include <linux/workqueue.h>
> > +
> > +#include "vfio_private.h"
> > +
> > +struct dma_map_page {
> > +	struct list_head	list;
> > +	dma_addr_t		daddr;
> > +	unsigned long		vaddr;
> > +	int			npage;
> > +	int			rdwr;
> 
> rdwr? Is this a flag thing? Could it be made in an enum?

Or maybe better would just be a bool.

> > +};
> > +
> > +/*
> > + * This code handles mapping and unmapping of user data buffers
> > + * into DMA'ble space using the IOMMU
> > + */
> > +
> > +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
> > +
> > +struct vwork {
> > +	struct mm_struct	*mm;
> > +	int			npage;
> > +	struct work_struct	work;
> > +};
> > +
> > +/* delayed decrement for locked_vm */
> > +static void vfio_lock_acct_bg(struct work_struct *work)
> > +{
> > +	struct vwork *vwork = container_of(work, struct vwork, work);
> > +	struct mm_struct *mm;
> > +
> > +	mm = vwork->mm;
> > +	down_write(&mm->mmap_sem);
> > +	mm->locked_vm += vwork->npage;
> > +	up_write(&mm->mmap_sem);
> > +	mmput(mm);		/* unref mm */
> > +	kfree(vwork);
> > +}
> > +
> > +static void vfio_lock_acct(int npage)
> > +{
> > +	struct vwork *vwork;
> > +	struct mm_struct *mm;
> > +
> > +	if (!current->mm) {
> > +		/* process exited */
> > +		return;
> > +	}
> > +	if (down_write_trylock(&current->mm->mmap_sem)) {
> > +		current->mm->locked_vm += npage;
> > +		up_write(&current->mm->mmap_sem);
> > +		return;
> > +	}
> > +	/*
> > +	 * Couldn't get mmap_sem lock, so must setup to decrement
> > +	 * mm->locked_vm later. If locked_vm were atomic, we wouldn't
> > +	 * need this silliness
> > +	 */
> > +	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> > +	if (!vwork)
> > +		return;
> > +	mm = get_task_mm(current);	/* take ref mm */
> > +	if (!mm) {
> > +		kfree(vwork);
> > +		return;
> > +	}
> > +	INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> > +	vwork->mm = mm;
> > +	vwork->npage = npage;
> > +	schedule_work(&vwork->work);
> > +}
> > +
> > +/* Some mappings aren't backed by a struct page, for example an mmap'd
> > + * MMIO range for our own or another device.  These use a different
> > + * pfn conversion and shouldn't be tracked as locked pages. */
> > +static int is_invalid_reserved_pfn(unsigned long pfn)
> 
> static bool
> 
> > +{
> > +	if (pfn_valid(pfn)) {
> > +		int reserved;
> > +		struct page *tail = pfn_to_page(pfn);
> > +		struct page *head = compound_trans_head(tail);
> > +		reserved = PageReserved(head);
> 
> bool reserved = PageReserved(head);

Agree on both

> > +		if (head != tail) {
> > +			/* "head" is not a dangling pointer
> > +			 * (compound_trans_head takes care of that)
> > +			 * but the hugepage may have been split
> > +			 * from under us (and we may not hold a
> > +			 * reference count on the head page so it can
> > +			 * be reused before we run PageReferenced), so
> > +			 * we've to check PageTail before returning
> > +			 * what we just read.
> > +			 */
> > +			smp_rmb();
> > +			if (PageTail(tail))
> > +				return reserved;
> > +		}
> > +		return PageReserved(tail);
> > +	}
> > +
> > +	return true;
> > +}
> > +
> > +static int put_pfn(unsigned long pfn, int rdwr)
> > +{
> > +	if (!is_invalid_reserved_pfn(pfn)) {
> > +		struct page *page = pfn_to_page(pfn);
> > +		if (rdwr)
> > +			SetPageDirty(page);
> > +		put_page(page);
> > +		return 1;
> > +	}
> > +	return 0;
> > +}
> > +
> > +/* Unmap DMA region */
> > +/* dgate must be held */
> 
> dgate?

DMA gate, the mutex for iommu operations.  This a carry over from old
vfio.  As there's only one mutex on the struct vfio_iommu, I can just
rename that to "lock".

> > +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > +			    int npage, int rdwr)
> > +{
> > +	int i, unlocked = 0;
> > +
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > +		unsigned long pfn;
> > +
> > +		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> > +		if (pfn) {
> > +			iommu_unmap(iommu->domain, iova, 0);
> 
> What is the '0' for? Perhaps a comment: /* We only do zero order */

yep.  We'll need to improve this at some point to take advantage of
large iommu pages, but it shouldn't affect the API.  I'll add comment.

> > +			unlocked += put_pfn(pfn, rdwr);
> > +		}
> > +	}
> > +	return unlocked;
> > +}
> > +
> > +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > +			   unsigned long npage, int rdwr)
> > +{
> > +	int unlocked;
> > +
> > +	unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);
> > +	vfio_lock_acct(-unlocked);
> > +}
> > +
> > +/* Unmap ALL DMA regions */
> > +void vfio_iommu_unmapall(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos, *pos2;
> 
> pos2 should probably be just called 'tmp'

ok

> > +	struct dma_map_page *mlp;
> 
> What does 'mlp' stand for?
> 
> mlp -> dma_page ?

Carry over from original code, I can guess, but not sure what Tom was
originally thinking.  I think everyone has asked so far, so I'll make a
pass at coming up with a names that I can explain.

> > +
> > +	mutex_lock(&iommu->dgate);
> > +	list_for_each_safe(pos, pos2, &iommu->dm_list) {
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> 
> Uh, so if it did not get put_page() we would try to still delete it?
> Couldn't that lead to corruption as the 'mlp' is returned to the poll?
> 
> Ah wait, the put_page is on the DMA page, so it is OK to
> delete the tracking structure. It will be just a leaked page.

Assume you're referencing this chunk:

vfio_dma_unmap
  __vfio_dma_unmap
    ...
        pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
        if (pfn) {
                iommu_unmap(iommu->domain, iova, 0);
                unlocked += put_pfn(pfn, rdwr);
        }

So we skip things that aren't mapped in the iommu, but anything not
mapped should have already been put (failed vfio_dma_map).  If it is
mapped, we put it if we originally got it via get_user_pages_fast.
unlocked would only not get incremented here if it was an mmap'd page
(such as the mmap of an mmio space of another vfio device), via the code
in vaddr_get_pfn (stolen from KVM).

> > +		list_del(&mlp->list);
> > +		kfree(mlp);
> > +	}
> > +	mutex_unlock(&iommu->dgate);
> > +}
> > +
> > +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long *pfn)
> > +{
> > +	struct page *page[1];
> > +	struct vm_area_struct *vma;
> > +	int ret = -EFAULT;
> > +
> > +	if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) {
> > +		*pfn = page_to_pfn(page[0]);
> > +		return 0;
> > +	}
> > +
> > +	down_read(&current->mm->mmap_sem);
> > +
> > +	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> > +
> > +	if (vma && vma->vm_flags & VM_PFNMAP) {
> > +		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > +		if (is_invalid_reserved_pfn(*pfn))
> > +			ret = 0;
> 
> Did you mean to break here?

We're in an if block, not a loop.

> > +	}
> > +
> > +	up_read(&current->mm->mmap_sem);
> > +
> > +	return ret;
> > +}
> > +
> > +/* Map DMA region */
> > +/* dgate must be held */
> > +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova,
> > +			unsigned long vaddr, int npage, int rdwr)
> > +{
> > +	unsigned long start = iova;
> > +	int i, ret, locked = 0, prot = IOMMU_READ;
> > +
> > +	/* Verify pages are not already mapped */
> 
> I think a 'that' is missing above.

Ok.

> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> > +		if (iommu_iova_to_phys(iommu->domain, iova))
> > +			return -EBUSY;
> > +
> > +	iova = start;
> > +
> > +	if (rdwr)
> > +		prot |= IOMMU_WRITE;
> > +	if (iommu->cache)
> > +		prot |= IOMMU_CACHE;
> > +
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> > +		unsigned long pfn = 0;
> > +
> > +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> > +		if (ret) {
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +
> > +		/* Only add actual locked pages to accounting */
> > +		if (!is_invalid_reserved_pfn(pfn))
> > +			locked++;
> > +
> > +		ret = iommu_map(iommu->domain, iova,
> > +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> 
> Put a comment by the 0 saying /* order 0 pages only! */

Yep

> > +		if (ret) {
> > +			/* Back out mappings on error */
> > +			put_pfn(pfn, rdwr);
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +	}
> > +	vfio_lock_acct(locked);
> > +	return 0;
> > +}
> > +
> > +static inline int ranges_overlap(unsigned long start1, size_t size1,
> 
> Perhaps a bool?

Sure

> > +				 unsigned long start2, size_t size2)
> > +{
> > +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> > +}
> > +
> > +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> > +					  dma_addr_t start, size_t size)
> > +{
> > +	struct list_head *pos;
> > +	struct dma_map_page *mlp;
> > +
> > +	list_for_each(pos, &iommu->dm_list) {
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +				   start, size))
> > +			return mlp;
> > +	}
> > +	return NULL;
> > +}
> > +
> > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> > +			    size_t size, struct dma_map_page *mlp)
> > +{
> > +	struct dma_map_page *split;
> > +	int npage_lo, npage_hi;
> > +
> > +	/* Existing dma region is completely covered, unmap all */
> > +	if (start <= mlp->daddr &&
> > +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +		list_del(&mlp->list);
> > +		npage_lo = mlp->npage;
> > +		kfree(mlp);
> > +		return npage_lo;
> > +	}
> > +
> > +	/* Overlap low address of existing range */
> > +	if (start <= mlp->daddr) {
> > +		size_t overlap;
> > +
> > +		overlap = start + size - mlp->daddr;
> > +		npage_lo = overlap >> PAGE_SHIFT;
> > +		npage_hi = mlp->npage - npage_lo;
> > +
> > +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > +		mlp->daddr += overlap;
> > +		mlp->vaddr += overlap;
> > +		mlp->npage -= npage_lo;
> > +		return npage_lo;
> > +	}
> > +
> > +	/* Overlap high address of existing range */
> > +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		size_t overlap;
> > +
> > +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > +		npage_hi = overlap >> PAGE_SHIFT;
> > +		npage_lo = mlp->npage - npage_hi;
> > +
> > +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > +		mlp->npage -= npage_hi;
> > +		return npage_hi;
> > +	}
> > +
> > +	/* Split existing */
> > +	npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> > +	npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> > +
> > +	split = kzalloc(sizeof *split, GFP_KERNEL);
> > +	if (!split)
> > +		return -ENOMEM;
> > +
> > +	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> > +
> > +	mlp->npage = npage_lo;
> > +
> > +	split->npage = npage_hi;
> > +	split->daddr = start + size;
> > +	split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> > +	split->rdwr = mlp->rdwr;
> > +	list_add(&split->list, &iommu->dm_list);
> > +	return size >> PAGE_SHIFT;
> > +}
> > +
> > +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +	int ret = 0;
> > +	size_t npage = dmp->size >> PAGE_SHIFT;
> > +	struct list_head *pos, *n;
> > +
> > +	if (dmp->dmaaddr & ~PAGE_MASK)
> > +		return -EINVAL;
> > +	if (dmp->size & ~PAGE_MASK)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +
> > +	list_for_each_safe(pos, n, &iommu->dm_list) {
> > +		struct dma_map_page *mlp;
> > +
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +				   dmp->dmaaddr, dmp->size)) {
> > +			ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> > +						      dmp->size, mlp);
> > +			if (ret > 0)
> > +				npage -= NPAGE_TO_SIZE(ret);
> > +			if (ret < 0 || npage == 0)
> > +				break;
> > +		}
> > +	}
> > +	mutex_unlock(&iommu->dgate);
> > +	return ret > 0 ? 0 : ret;
> > +}
> > +
> > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +	int npage;
> > +	struct dma_map_page *mlp, *mmlp = NULL;
> > +	dma_addr_t daddr = dmp->dmaaddr;
> > +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> > +	size_t size = dmp->size;
> > +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> > +
> > +	if (vaddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (daddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (size & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +
> > +	npage = size >> PAGE_SHIFT;
> > +	if (!npage)
> > +		return -EINVAL;
> > +
> > +	if (!iommu)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +
> > +	if (vfio_find_dma(iommu, daddr, size)) {
> > +		ret = -EBUSY;
> > +		goto out_lock;
> > +	}
> > +
> > +	/* account for locked pages */
> > +	locked = current->mm->locked_vm + npage;
> > +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> > +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > +			__func__, rlimit(RLIMIT_MEMLOCK));
> > +		ret = -ENOMEM;
> > +		goto out_lock;
> > +	}
> > +
> > +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> > +	if (ret)
> > +		goto out_lock;
> > +
> > +	/* Check if we abut a region below */
> > +	if (daddr) {
> > +		mlp = vfio_find_dma(iommu, daddr - 1, 1);
> > +		if (mlp && mlp->rdwr == rdwr &&
> > +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> > +
> > +			mlp->npage += npage;
> > +			daddr = mlp->daddr;
> > +			vaddr = mlp->vaddr;
> > +			npage = mlp->npage;
> > +			size = NPAGE_TO_SIZE(npage);
> > +
> > +			mmlp = mlp;
> > +		}
> > +	}
> > +
> > +	if (daddr + size) {
> > +		mlp = vfio_find_dma(iommu, daddr + size, 1);
> > +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size) {
> > +
> > +			mlp->npage += npage;
> > +			mlp->daddr = daddr;
> > +			mlp->vaddr = vaddr;
> > +
> > +			/* If merged above and below, remove previously
> > +			 * merged entry.  New entry covers it.  */
> > +			if (mmlp) {
> > +				list_del(&mmlp->list);
> > +				kfree(mmlp);
> > +			}
> > +			mmlp = mlp;
> > +		}
> > +	}
> > +
> > +	if (!mmlp) {
> > +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> > +		if (!mlp) {
> > +			ret = -ENOMEM;
> > +			vfio_dma_unmap(iommu, daddr, npage, rdwr);
> > +			goto out_lock;
> > +		}
> > +
> > +		mlp->npage = npage;
> > +		mlp->daddr = daddr;
> > +		mlp->vaddr = vaddr;
> > +		mlp->rdwr = rdwr;
> > +		list_add(&mlp->list, &iommu->dm_list);
> > +	}
> > +
> > +out_lock:
> > +	mutex_unlock(&iommu->dgate);
> > +	return ret;
> > +}
> > +
> > +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +
> > +	vfio_release_iommu(iommu);
> > +	return 0;
> > +}
> > +
> > +static long vfio_iommu_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +	int ret = -ENOSYS;
> > +
> > +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> 
> Something is weird with the tabbing here..

Indeed, the joys of switching between kernel and qemu ;)  fixed

> > +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> > +
> > +                ret = put_user(flags, (u64 __user *)arg);
> > +
> > +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > +		struct vfio_dma_map dm;
> > +
> > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > +			return -EFAULT;
> > +
> > +		ret = vfio_dma_map_dm(iommu, &dm);
> > +
> > +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> > +			ret = -EFAULT;
> > +
> > +	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> > +		struct vfio_dma_map dm;
> > +
> > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > +			return -EFAULT;
> > +
> > +		ret = vfio_dma_unmap_dm(iommu, &dm);
> > +
> > +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> > +			ret = -EFAULT;
> > +	}
> > +	return ret;
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +static long vfio_iommu_compat_ioctl(struct file *filep,
> > +				    unsigned int cmd, unsigned long arg)
> > +{
> > +	arg = (unsigned long)compat_ptr(arg);
> > +	return vfio_iommu_unl_ioctl(filep, cmd, arg);
> > +}
> > +#endif	/* CONFIG_COMPAT */
> > +
> > +const struct file_operations vfio_iommu_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.release	= vfio_iommu_release,
> > +	.unlocked_ioctl	= vfio_iommu_unl_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= vfio_iommu_compat_ioctl,
> > +#endif
> > +};
> > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> > new file mode 100644
> > index 0000000..6169356
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_main.c
> > @@ -0,0 +1,1151 @@
> > +/*
> > + * VFIO framework
> > + *
> > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + */
> > +
> > +#include <linux/cdev.h>
> > +#include <linux/compat.h>
> > +#include <linux/device.h>
> > +#include <linux/file.h>
> > +#include <linux/anon_inodes.h>
> > +#include <linux/fs.h>
> > +#include <linux/idr.h>
> > +#include <linux/iommu.h>
> > +#include <linux/mm.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/string.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +#include <linux/wait.h>
> > +
> > +#include "vfio_private.h"
> > +
> > +#define DRIVER_VERSION	"0.2"
> > +#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> > +#define DRIVER_DESC	"VFIO - User Level meta-driver"
> > +
> > +static int allow_unsafe_intrs;
> 
> __read_mostly

Ok

> > +module_param(allow_unsafe_intrs, int, 0);
> 
> S_IRUGO ?

I actually intended that to be S_IRUGO | S_IWUSR just like the kvm
parameter so it can be toggled runtime.

> > +MODULE_PARM_DESC(allow_unsafe_intrs,
> > +        "Allow use of IOMMUs which do not support interrupt remapping");
> > +
> > +static struct vfio {
> > +	dev_t			devt;
> > +	struct cdev		cdev;
> > +	struct list_head	group_list;
> > +	struct mutex		lock;
> > +	struct kref		kref;
> > +	struct class		*class;
> > +	struct idr		idr;
> > +	wait_queue_head_t	release_q;
> > +} vfio;
> 
> You probably want to move this below the 'vfio_group'
> as vfio contains the vfio_group.

Only via the group_list.  Are you suggesting for readability or to avoid
forward declarations (which we don't need between these two with current
ordering).

> > +
> > +static const struct file_operations vfio_group_fops;
> > +extern const struct file_operations vfio_iommu_fops;
> > +
> > +struct vfio_group {
> > +	dev_t			devt;
> > +	unsigned int		groupid;
> > +	struct bus_type		*bus;
> > +	struct vfio_iommu	*iommu;
> > +	struct list_head	device_list;
> > +	struct list_head	iommu_next;
> > +	struct list_head	group_next;
> > +	int			refcnt;
> > +};
> > +
> > +struct vfio_device {
> > +	struct device			*dev;
> > +	const struct vfio_device_ops	*ops;
> > +	struct vfio_iommu		*iommu;
> > +	struct vfio_group		*group;
> > +	struct list_head		device_next;
> > +	bool				attached;
> > +	int				refcnt;
> > +	void				*device_data;
> > +};
> 
> And perhaps move this above vfio_group. As vfio_group
> contains a list of these structures?

These are inter-linked, so chicken and egg.  The current ordering is
more function based than definition based.  struct vfio is the highest
level object, groups are next, iommus and devices are next, but we need
to share iommus with the other file, so that lands in the header.

> > +
> > +/*
> > + * Helper functions called under vfio.lock
> > + */
> > +
> > +/* Return true if any devices within a group are opened */
> > +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		if (device->refcnt)
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +/* Return true if any of the groups attached to an iommu are opened.
> > + * We can only tear apart merged groups when nothing is left open. */
> > +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +		if (group->refcnt)
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +/* An iommu is "in use" if it has a file descriptor open or if any of
> > + * the groups assigned to the iommu have devices open. */
> > +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (iommu->refcnt)
> > +		return true;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		if (__vfio_group_devs_inuse(group))
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +static void __vfio_group_set_iommu(struct vfio_group *group,
> > +				   struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (group->iommu)
> > +		list_del(&group->iommu_next);
> > +	if (iommu)
> > +		list_add(&group->iommu_next, &iommu->group_list);
> > +
> > +	group->iommu = iommu;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		device->iommu = iommu;
> > +	}
> > +}
> > +
> > +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> > +				    struct vfio_device *device)
> > +{
> > +	BUG_ON(!iommu->domain && device->attached);
> 
> Whoa. Heavy hammer there.
> 
> Perhaps WARN_ON as you do check it later on.

I think it's warranted, internal consistency is broken if we have a
device that thinks it's attached to an iommu domain that doesn't exist.
It should, of course, never happen and this isn't a performance path.

> > +
> > +	if (!iommu->domain || !device->attached)
> > +		return;
> > +
> > +	iommu_detach_device(iommu->domain, device->dev);
> > +	device->attached = false;
> > +}
> > +
> > +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> > +				      struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		__vfio_iommu_detach_dev(iommu, device);
> > +	}
> > +}
> > +
> > +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> > +				   struct vfio_device *device)
> > +{
> > +	int ret;
> > +
> > +	BUG_ON(device->attached);
> 
> How about:
> 
> WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register
> the device again! Tell him/her to stop please.\n");

I would almost demote this one to a WARN_ON, but userspace isn't in
control of attaching and detaching devices from the iommu.  That's a
side effect of getting the iommu or device file descriptor.  So again,
this is an internal consistency check and it should never happen,
regardless of userspace.

> > +
> > +	if (!iommu || !iommu->domain)
> > +		return -EINVAL;
> > +
> > +	ret = iommu_attach_device(iommu->domain, device->dev);
> > +	if (!ret)
> > +		device->attached = true;
> > +
> > +	return ret;
> > +}
> > +
> > +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,
> > +				     struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +		int ret;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		ret = __vfio_iommu_attach_dev(iommu, device);
> > +		if (ret) {
> > +			__vfio_iommu_detach_group(iommu, group);
> > +			return ret;
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> > +/* The iommu is viable, ie. ready to be configured, when all the devices
> > + * for all the groups attached to the iommu are bound to their vfio device
> > + * drivers (ex. vfio-pci).  This sets the device_data private data pointer. */
> > +static bool __vfio_iommu_viable(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *gpos, *dpos;
> > +
> > +	list_for_each(gpos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(gpos, struct vfio_group, iommu_next);
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (!device->device_data)
> > +				return false;
> > +		}
> > +	}
> > +	return true;
> > +}
> > +
> > +static void __vfio_close_iommu(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (!iommu->domain)
> > +		return;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		__vfio_iommu_detach_group(iommu, group);
> > +	}
> > +
> > +	vfio_iommu_unmapall(iommu);
> > +
> > +	iommu_domain_free(iommu->domain);
> > +	iommu->domain = NULL;
> > +	iommu->mm = NULL;
> > +}
> > +
> > +/* Open the IOMMU.  This gates all access to the iommu or device file
> > + * descriptors and sets current->mm as the exclusive user. */
> > +static int __vfio_open_iommu(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +	int ret;
> > +
> > +	if (!__vfio_iommu_viable(iommu))
> > +		return -EBUSY;
> > +
> > +	if (iommu->domain)
> > +		return -EINVAL;
> > +
> > +	iommu->domain = iommu_domain_alloc(iommu->bus);
> > +	if (!iommu->domain)
> > +		return -EFAULT;
> 
> ENOMEM?

Yeah, probably more appropriate.

> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		ret = __vfio_iommu_attach_group(iommu, group);
> > +		if (ret) {
> > +			__vfio_close_iommu(iommu);
> > +			return ret;
> > +		}
> > +	}
> > +
> > +	if (!allow_unsafe_intrs &&
> > +	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> > +		__vfio_close_iommu(iommu);
> > +		return -EFAULT;
> > +	}
> > +
> > +	iommu->cache = (iommu_domain_has_cap(iommu->domain,
> > +					     IOMMU_CAP_CACHE_COHERENCY) != 0);
> > +	iommu->mm = current->mm;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Actively try to tear down the iommu and merged groups.  If there are no
> > + * open iommu or device fds, we close the iommu.  If we close the iommu and
> > + * there are also no open group fds, we can futher dissolve the group to
> > + * iommu association and free the iommu data structure. */
> > +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)
> > +{
> > +
> > +	if (__vfio_iommu_inuse(iommu))
> > +		return -EBUSY;
> > +
> > +	__vfio_close_iommu(iommu);
> > +
> > +	if (!__vfio_iommu_groups_inuse(iommu)) {
> > +		struct list_head *pos, *ppos;
> > +
> > +		list_for_each_safe(pos, ppos, &iommu->group_list) {
> > +			struct vfio_group *group;
> > +
> > +			group = list_entry(pos, struct vfio_group, iommu_next);
> > +			__vfio_group_set_iommu(group, NULL);
> > +		}
> > +
> > +
> > +		kfree(iommu);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static struct vfio_device *__vfio_lookup_dev(struct device *dev)
> > +{
> > +	struct list_head *gpos;
> > +	unsigned int groupid;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> 
> Hmm, where is this defined? v3.2-rc1 does not seem to have it?

>From patch header:

        Fingers crossed, this is the last RFC for VFIO, but we need
        the iommu group support before this can go upstream
        (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
        hoping this helps push that along.

hat's the one bit keeping me from doing a non-RFC of the core, besides
fixing all these comments ;)

> > +		return NULL;
> > +
> > +	list_for_each(gpos, &vfio.group_list) {
> > +		struct vfio_group *group;
> > +		struct list_head *dpos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, group_next);
> > +
> > +		if (group->groupid != groupid)
> > +			continue;
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (device->dev == dev)
> > +				return device;
> > +		}
> > +	}
> > +	return NULL;
> > +}
> > +
> > +/* All release paths simply decrement the refcnt, attempt to teardown
> > + * the iommu and merged groups, and wakeup anything that might be
> > + * waiting if we successfully dissolve anything. */
> > +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)
> > +{
> > +	bool wake;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	(*refcnt)--;
> > +	wake = (__vfio_try_dissolve_iommu(iommu) == 0);
> > +
> > +	mutex_unlock(&vfio.lock);
> > +
> > +	if (wake)
> > +		wake_up(&vfio.release_q);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Device fops - passthrough to vfio device driver w/ device_data
> > + */
> > +static int vfio_device_release(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	vfio_do_release(&device->refcnt, device->iommu);
> > +
> > +	device->ops->put(device->device_data);
> > +
> > +	return 0;
> > +}
> > +
> > +static long vfio_device_unl_ioctl(struct file *filep,
> > +				  unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->ioctl(device->device_data, cmd, arg);
> > +}
> > +
> > +static ssize_t vfio_device_read(struct file *filep, char __user *buf,
> > +				size_t count, loff_t *ppos)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->read(device->device_data, buf, count, ppos);
> > +}
> > +
> > +static ssize_t vfio_device_write(struct file *filep, const char __user *buf,
> > +				 size_t count, loff_t *ppos)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->write(device->device_data, buf, count, ppos);
> > +}
> > +
> > +static int vfio_device_mmap(struct file *filep, struct vm_area_struct *vma)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->mmap(device->device_data, vma);
> > +}
> > +	
> > +#ifdef CONFIG_COMPAT
> > +static long vfio_device_compat_ioctl(struct file *filep,
> > +				     unsigned int cmd, unsigned long arg)
> > +{
> > +	arg = (unsigned long)compat_ptr(arg);
> > +	return vfio_device_unl_ioctl(filep, cmd, arg);
> > +}
> > +#endif	/* CONFIG_COMPAT */
> > +
> > +const struct file_operations vfio_device_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.release	= vfio_device_release,
> > +	.read		= vfio_device_read,
> > +	.write		= vfio_device_write,
> > +	.unlocked_ioctl	= vfio_device_unl_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= vfio_device_compat_ioctl,
> > +#endif
> > +	.mmap		= vfio_device_mmap,
> > +};
> > +
> > +/*
> > + * Group fops
> > + */
> > +static int vfio_group_open(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_group *group;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	group = idr_find(&vfio.idr, iminor(inode));
> > +
> > +	if (!group) {
> > +		ret = -ENODEV;
> > +		goto out;
> > +	}
> > +
> > +	filep->private_data = group;
> > +
> > +	if (!group->iommu) {
> > +		struct vfio_iommu *iommu;
> > +
> > +		iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > +		if (!iommu) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +		INIT_LIST_HEAD(&iommu->group_list);
> > +		INIT_LIST_HEAD(&iommu->dm_list);
> > +		mutex_init(&iommu->dgate);
> > +		iommu->bus = group->bus;
> > +		__vfio_group_set_iommu(group, iommu);
> > +	}
> > +	group->refcnt++;
> > +
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +
> > +	return ret;
> > +}
> > +
> > +static int vfio_group_release(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_group *group = filep->private_data;
> > +
> > +	return vfio_do_release(&group->refcnt, group->iommu);
> > +}
> > +
> > +/* Attempt to merge the group pointed to by fd into group.  The merge-ee
> > + * group must not have an iommu or any devices open because we cannot
> > + * maintain that context across the merge.  The merge-er group can be
> > + * in use. */
> > +static int vfio_group_merge(struct vfio_group *group, int fd)
> > +{
> > +	struct vfio_group *new;
> > +	struct vfio_iommu *old_iommu;
> > +	struct file *file;
> > +	int ret = 0;
> > +	bool opened = false;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	file = fget(fd);
> > +	if (!file) {
> > +		ret = -EBADF;
> > +		goto out_noput;
> > +	}
> > +
> > +	/* Sanity check, is this really our fd? */
> > +	if (file->f_op != &vfio_group_fops) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	new = file->private_data;
> > +
> > +	if (!new || new == group || !new->iommu ||
> > +	    new->iommu->domain || new->bus != group->bus) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* We need to attach all the devices to each domain separately
> > +	 * in order to validate that the capabilities match for both.  */
> > +	ret = __vfio_open_iommu(new->iommu);
> > +	if (ret)
> > +		goto out;
> > +
> > +	if (!group->iommu->domain) {
> > +		ret = __vfio_open_iommu(group->iommu);
> > +		if (ret)
> > +			goto out;
> > +		opened = true;
> > +	}
> > +
> > +	/* If cache coherency doesn't match we'd potentialy need to
> > +	 * remap existing iommu mappings in the merge-er domain.
> > +	 * Poor return to bother trying to allow this currently. */
> > +	if (iommu_domain_has_cap(group->iommu->domain,
> > +				 IOMMU_CAP_CACHE_COHERENCY) !=
> > +	    iommu_domain_has_cap(new->iommu->domain,
> > +				 IOMMU_CAP_CACHE_COHERENCY)) {
> > +		__vfio_close_iommu(new->iommu);
> > +		if (opened)
> > +			__vfio_close_iommu(group->iommu);
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* Close the iommu for the merge-ee and attach all its devices
> > +	 * to the merge-er iommu. */
> > +	__vfio_close_iommu(new->iommu);
> > +
> > +	ret = __vfio_iommu_attach_group(group->iommu, new);
> > +	if (ret)
> > +		goto out;
> > +
> > +	/* set_iommu unlinks new from the iommu, so save a pointer to it */
> > +	old_iommu = new->iommu;
> > +	__vfio_group_set_iommu(new, group->iommu);
> > +	kfree(old_iommu);
> > +
> > +out:
> > +	fput(file);
> > +out_noput:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Unmerge the group pointed to by fd from group. */
> > +static int vfio_group_unmerge(struct vfio_group *group, int fd)
> > +{
> > +	struct vfio_group *new;
> > +	struct vfio_iommu *new_iommu;
> > +	struct file *file;
> > +	int ret = 0;
> > +
> > +	/* Since the merge-out group is already opened, it needs to
> > +	 * have an iommu struct associated with it. */
> > +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);
> > +	if (!new_iommu)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&new_iommu->group_list);
> > +	INIT_LIST_HEAD(&new_iommu->dm_list);
> > +	mutex_init(&new_iommu->dgate);
> > +	new_iommu->bus = group->bus;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	file = fget(fd);
> > +	if (!file) {
> > +		ret = -EBADF;
> > +		goto out_noput;
> > +	}
> > +
> > +	/* Sanity check, is this really our fd? */
> > +	if (file->f_op != &vfio_group_fops) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	new = file->private_data;
> > +	if (!new || new == group || new->iommu != group->iommu) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* We can't merge-out a group with devices still in use. */
> > +	if (__vfio_group_devs_inuse(new)) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	__vfio_iommu_detach_group(group->iommu, new);
> > +	__vfio_group_set_iommu(new, new_iommu);
> > +
> > +out:
> > +	fput(file);
> > +out_noput:
> > +	if (ret)
> > +		kfree(new_iommu);
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Get a new iommu file descriptor.  This will open the iommu, setting
> > + * the current->mm ownership if it's not already set. */
> > +static int vfio_group_get_iommu_fd(struct vfio_group *group)
> > +{
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (!group->iommu->domain) {
> > +		ret = __vfio_open_iommu(group->iommu);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
> > +			       group->iommu, O_RDWR);
> > +	if (ret < 0)
> > +		goto out;
> > +
> > +	group->iommu->refcnt++;
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Get a new device file descriptor.  This will open the iommu, setting
> > + * the current->mm ownership if it's not already set.  It's difficult to
> > + * specify the requirements for matching a user supplied buffer to a
> > + * device, so we use a vfio driver callback to test for a match.  For
> > + * PCI, dev_name(dev) is unique, but other drivers may require including
> > + * a parent device string. */
> > +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> > +{
> > +	struct vfio_iommu *iommu = group->iommu;
> > +	struct list_head *gpos;
> > +	int ret = -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (!iommu->domain) {
> > +		ret = __vfio_open_iommu(iommu);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	list_for_each(gpos, &iommu->group_list) {
> > +		struct list_head *dpos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, iommu_next);
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (device->ops->match(device->dev, buf)) {
> > +				struct file *file;
> > +
> > +				if (device->ops->get(device->device_data)) {
> > +					ret = -EFAULT;
> > +					goto out;
> > +				}
> > +
> > +				/* We can't use anon_inode_getfd(), like above
> > +				 * because we need to modify the f_mode flags
> > +				 * directly to allow more than just ioctls */
> > +				ret = get_unused_fd();
> > +				if (ret < 0) {
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> > +
> > +				file = anon_inode_getfile("[vfio-device]",
> > +							  &vfio_device_fops,
> > +							  device, O_RDWR);
> > +				if (IS_ERR(file)) {
> > +					put_unused_fd(ret);
> > +					ret = PTR_ERR(file);
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> > +
> > +				/* Todo: add an anon_inode interface to do
> > +				 * this.  Appears to be missing by lack of
> > +				 * need rather than explicitly prevented.
> > +				 * Now there's need. */
> > +				file->f_mode |= (FMODE_LSEEK |
> > +						 FMODE_PREAD |
> > +						 FMODE_PWRITE);
> > +
> > +				fd_install(ret, file);
> > +
> > +				device->refcnt++;
> > +				goto out;
> > +			}
> > +		}
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +static long vfio_group_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_group *group = filep->private_data;
> > +
> > +	if (cmd == VFIO_GROUP_GET_FLAGS) {
> > +		u64 flags = 0;
> > +
> > +		mutex_lock(&vfio.lock);
> > +		if (__vfio_iommu_viable(group->iommu))
> > +			flags |= VFIO_GROUP_FLAGS_VIABLE;
> > +		mutex_unlock(&vfio.lock);
> > +
> > +		if (group->iommu->mm)
> > +			flags |= VFIO_GROUP_FLAGS_MM_LOCKED;
> > +
> > +		return put_user(flags, (u64 __user *)arg);
> > +	}
> > +		
> > +	/* Below commands are restricted once the mm is set */
> > +	if (group->iommu->mm && group->iommu->mm != current->mm)
> > +		return -EPERM;
> > +
> > +	if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) {
> > +		int fd;
> > +		
> > +		if (get_user(fd, (int __user *)arg))
> > +			return -EFAULT;
> > +		if (fd < 0)
> > +			return -EINVAL;
> > +
> > +		if (cmd == VFIO_GROUP_MERGE)
> > +			return vfio_group_merge(group, fd);
> > +		else
> > +			return vfio_group_unmerge(group, fd);
> > +	} else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {
> > +		return vfio_group_get_iommu_fd(group);
> > +	} else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {
> > +		char *buf;
> > +		int ret;
> > +
> > +		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
> > +		if (IS_ERR(buf))
> > +			return PTR_ERR(buf);
> > +
> > +		ret = vfio_group_get_device_fd(group, buf);
> > +		kfree(buf);
> > +		return ret;
> > +	}
> > +
> > +	return -ENOSYS;
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +static long vfio_group_compat_ioctl(struct file *filep,
> > +				    unsigned int cmd, unsigned long arg)
> > +{
> > +	arg = (unsigned long)compat_ptr(arg);
> > +	return vfio_group_unl_ioctl(filep, cmd, arg);
> > +}
> > +#endif	/* CONFIG_COMPAT */
> > +
> > +static const struct file_operations vfio_group_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.open		= vfio_group_open,
> > +	.release	= vfio_group_release,
> > +	.unlocked_ioctl	= vfio_group_unl_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= vfio_group_compat_ioctl,
> > +#endif
> > +};
> > +
> > +/* iommu fd release hook */
> > +int vfio_release_iommu(struct vfio_iommu *iommu)
> > +{
> > +	return vfio_do_release(&iommu->refcnt, iommu);
> > +}
> > +
> > +/*
> > + * VFIO driver API
> > + */
> > +
> > +/* Add a new device to the vfio framework with associated vfio driver
> > + * callbacks.  This is the entry point for vfio drivers to register devices. */
> > +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
> > +{
> > +	struct list_head *pos;
> > +	struct vfio_group *group = NULL;
> > +	struct vfio_device *device = NULL;
> > +	unsigned int groupid;
> > +	int ret = 0;
> > +	bool new_group = false;
> > +
> > +	if (!ops)
> > +		return -EINVAL;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	list_for_each(pos, &vfio.group_list) {
> > +		group = list_entry(pos, struct vfio_group, group_next);
> > +		if (group->groupid == groupid)
> > +			break;
> > +		group = NULL;
> > +	}
> > +
> > +	if (!group) {
> > +		int minor;
> > +
> > +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +		if (!group) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group->groupid = groupid;
> > +		INIT_LIST_HEAD(&group->device_list);
> > +
> > +		ret = idr_get_new(&vfio.idr, group, &minor);
> > +		if (ret == 0 && minor > MINORMASK) {
> > +			idr_remove(&vfio.idr, minor);
> > +			kfree(group);
> > +			ret = -ENOSPC;
> > +			goto out;
> > +		}
> > +
> > +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> > +		device_create(vfio.class, NULL, group->devt,
> > +			      group, "%u", groupid);
> > +
> > +		group->bus = dev->bus;
> 
> 
> Oh, so that is how the IOMMU iommu_ops get copied! You might
> want to mention that - I was not sure where the 'handoff' is
> was done to insert a device so that it can do iommu_ops properly.
> 
> Ok, so the time when a device is detected whether it can do
> IOMMU is when we try to open it - as that is when iommu_domain_alloc
> is called which can return NULL if the iommu_ops is not set.
> 
> So what about devices that don't have an iommu_ops? Say they
> are using SWIOTLB? (like the AMD-Vi sometimes does if the
> device is not on its list).
> 
> Can we use iommu_present?

I'm not sure I'm following your revelation ;)  Take a look at the
pointer to iommu_device_group I pasted above, or these:

https://github.com/awilliam/linux-vfio/commit/37dd08c90d149caaed7779d4f38850a8f7ed0fa5
https://github.com/awilliam/linux-vfio/commit/63ca8543533d8130db23d7949133e548c3891c97
https://github.com/awilliam/linux-vfio/commit/8d7d70eb8e714fbf8710848a06f8cab0c741631e

That call includes an iommu_present() check, so if there's no iommu or
the iommu can't provide a groupid, the device is skipped over from vfio
(can't be used).

So the ordering is:

 - bus driver registers device
   - if it has an iommu group, add it to the vfio device/group tracking

 - group gets opened
   - user gets iommu or device fd results in iommu_domain_alloc

Devices without iommu_ops don't get to play in the vfio world.

> > +		list_add(&group->group_next, &vfio.group_list);
> > +		new_group = true;
> > +	} else {
> > +		if (group->bus != dev->bus) {
> > +			printk(KERN_WARNING
> > +			       "Error: IOMMU group ID conflict.  Group ID %u "
> > +				"on both bus %s and %s\n", groupid,
> > +				group->bus->name, dev->bus->name);
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +
> > +		list_for_each(pos, &group->device_list) {
> > +			device = list_entry(pos,
> > +					    struct vfio_device, device_next);
> > +			if (device->dev == dev)
> > +				break;
> > +			device = NULL;
> > +		}
> > +	}
> > +
> > +	if (!device) {
> > +		if (__vfio_group_devs_inuse(group) ||
> > +		    (group->iommu && group->iommu->refcnt)) {
> > +			printk(KERN_WARNING
> > +			       "Adding device %s to group %u while group is already in use!!\n",
> > +			       dev_name(dev), group->groupid);
> > +			/* XXX How to prevent other drivers from claiming? */
> > +		}
> > +
> > +		device = kzalloc(sizeof(*device), GFP_KERNEL);
> > +		if (!device) {
> > +			/* If we just created this group, tear it down */
> > +			if (new_group) {
> > +				list_del(&group->group_next);
> > +				device_destroy(vfio.class, group->devt);
> > +				idr_remove(&vfio.idr, MINOR(group->devt));
> > +				kfree(group);
> > +			}
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		list_add(&device->device_next, &group->device_list);
> > +		device->dev = dev;
> > +		device->ops = ops;
> > +		device->iommu = group->iommu; /* NULL if new */
> > +		__vfio_iommu_attach_dev(group->iommu, device);
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_group_add_dev);
> > +
> > +/* Remove a device from the vfio framework */
> > +void vfio_group_del_dev(struct device *dev)
> > +{
> > +	struct list_head *pos;
> > +	struct vfio_group *group = NULL;
> > +	struct vfio_device *device = NULL;
> > +	unsigned int groupid;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	list_for_each(pos, &vfio.group_list) {
> > +		group = list_entry(pos, struct vfio_group, group_next);
> > +		if (group->groupid == groupid)
> > +			break;
> > +		group = NULL;
> > +	}
> > +
> > +	if (!group)
> > +		goto out;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		if (device->dev == dev)
> > +			break;
> > +		device = NULL;
> > +	}
> > +
> > +	if (!device)
> > +		goto out;
> > +
> > +	BUG_ON(device->refcnt);
> > +
> > +	if (device->attached)
> > +		__vfio_iommu_detach_dev(group->iommu, device);
> > +
> > +	list_del(&device->device_next);
> > +	kfree(device);
> > +
> > +	/* If this was the only device in the group, remove the group.
> > +	 * Note that we intentionally unmerge empty groups here if the
> > +	 * group fd isn't opened. */
> > +	if (list_empty(&group->device_list) && group->refcnt == 0) {
> > +		struct vfio_iommu *iommu = group->iommu;
> > +
> > +		if (iommu) {
> > +			__vfio_group_set_iommu(group, NULL);
> > +			__vfio_try_dissolve_iommu(iommu);
> > +		}
> > +
> > +		device_destroy(vfio.class, group->devt);
> > +		idr_remove(&vfio.idr, MINOR(group->devt));
> > +		list_del(&group->group_next);
> > +		kfree(group);
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_group_del_dev);
> > +
> > +/* When a device is bound to a vfio device driver (ex. vfio-pci), this
> > + * entry point is used to mark the device usable (viable).  The vfio
> > + * device driver associates a private device_data struct with the device
> > + * here, which will later be return for vfio_device_fops callbacks. */
> > +int vfio_bind_dev(struct device *dev, void *device_data)
> > +{
> > +	struct vfio_device *device;
> > +	int ret = -EINVAL;
> > +
> > +	BUG_ON(!device_data);
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	device = __vfio_lookup_dev(dev);
> > +
> > +	BUG_ON(!device);
> > +
> > +	ret = dev_set_drvdata(dev, device);
> > +	if (!ret)
> > +		device->device_data = device_data;
> > +
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_bind_dev);
> > +
> > +/* A device is only removeable if the iommu for the group is not in use. */
> > +static bool vfio_device_removeable(struct vfio_device *device)
> > +{
> > +	bool ret = true;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (device->iommu && __vfio_iommu_inuse(device->iommu))
> > +		ret = false;
> > +
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Notify vfio that a device is being unbound from the vfio device driver
> > + * and return the device private device_data pointer.  If the group is
> > + * in use, we need to block or take other measures to make it safe for
> > + * the device to be removed from the iommu. */
> > +void *vfio_unbind_dev(struct device *dev)
> > +{
> > +	struct vfio_device *device = dev_get_drvdata(dev);
> > +	void *device_data;
> > +
> > +	BUG_ON(!device);
> > +
> > +again:
> > +	if (!vfio_device_removeable(device)) {
> > +		/* XXX signal for all devices in group to be removed or
> > +		 * resort to killing the process holding the device fds.
> > +		 * For now just block waiting for releases to wake us. */
> > +		wait_event(vfio.release_q, vfio_device_removeable(device));
> > +	}
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	/* Need to re-check that the device is still removeable under lock. */
> > +	if (device->iommu && __vfio_iommu_inuse(device->iommu)) {
> > +		mutex_unlock(&vfio.lock);
> > +		goto again;
> > +	}
> > +
> > +	device_data = device->device_data;
> > +
> > +	device->device_data = NULL;
> > +	dev_set_drvdata(dev, NULL);
> > +
> > +	mutex_unlock(&vfio.lock);
> > +	return device_data;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_unbind_dev);
> > +
> > +/*
> > + * Module/class support
> > + */
> > +static void vfio_class_release(struct kref *kref)
> > +{
> > +	class_destroy(vfio.class);
> > +	vfio.class = NULL;
> > +}
> > +
> > +static char *vfio_devnode(struct device *dev, mode_t *mode)
> > +{
> > +	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
> > +}
> > +
> > +static int __init vfio_init(void)
> > +{
> > +	int ret;
> > +
> > +	idr_init(&vfio.idr);
> > +	mutex_init(&vfio.lock);
> > +	INIT_LIST_HEAD(&vfio.group_list);
> > +	init_waitqueue_head(&vfio.release_q);
> > +
> > +	kref_init(&vfio.kref);
> > +	vfio.class = class_create(THIS_MODULE, "vfio");
> > +	if (IS_ERR(vfio.class)) {
> > +		ret = PTR_ERR(vfio.class);
> > +		goto err_class;
> > +	}
> > +
> > +	vfio.class->devnode = vfio_devnode;
> > +
> > +	/* FIXME - how many minors to allocate... all of them! */
> > +	ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
> > +	if (ret)
> > +		goto err_chrdev;
> > +
> > +	cdev_init(&vfio.cdev, &vfio_group_fops);
> > +	ret = cdev_add(&vfio.cdev, vfio.devt, MINORMASK);
> > +	if (ret)
> > +		goto err_cdev;
> > +
> > +	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> > +
> > +	return 0;
> > +
> > +err_cdev:
> > +	unregister_chrdev_region(vfio.devt, MINORMASK);
> > +err_chrdev:
> > +	kref_put(&vfio.kref, vfio_class_release);
> > +err_class:
> > +	return ret;
> > +}
> > +
> > +static void __exit vfio_cleanup(void)
> > +{
> > +	struct list_head *gpos, *gppos;
> > +
> > +	list_for_each_safe(gpos, gppos, &vfio.group_list) {
> > +		struct vfio_group *group;
> > +		struct list_head *dpos, *dppos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, group_next);
> > +
> > +		list_for_each_safe(dpos, dppos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +			vfio_group_del_dev(device->dev);
> > +		}
> > +	}
> > +
> > +	idr_destroy(&vfio.idr);
> > +	cdev_del(&vfio.cdev);
> > +	unregister_chrdev_region(vfio.devt, MINORMASK);
> > +	kref_put(&vfio.kref, vfio_class_release);
> > +}
> > +
> > +module_init(vfio_init);
> > +module_exit(vfio_cleanup);
> > +
> > +MODULE_VERSION(DRIVER_VERSION);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR(DRIVER_AUTHOR);
> > +MODULE_DESCRIPTION(DRIVER_DESC);
> > diff --git a/drivers/vfio/vfio_private.h b/drivers/vfio/vfio_private.h
> > new file mode 100644
> > index 0000000..350ad67
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_private.h
> > @@ -0,0 +1,34 @@
> > +/*
> > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + */
> > +
> > +#include <linux/list.h>
> > +#include <linux/mutex.h>
> > +
> > +#ifndef VFIO_PRIVATE_H
> > +#define VFIO_PRIVATE_H
> > +
> > +struct vfio_iommu {
> > +	struct iommu_domain		*domain;
> > +	struct bus_type			*bus;
> > +	struct mutex			dgate;
> > +	struct list_head		dm_list;
> > +	struct mm_struct		*mm;
> > +	struct list_head		group_list;
> > +	int				refcnt;
> > +	bool				cache;
> > +};
> > +
> > +extern int vfio_release_iommu(struct vfio_iommu *iommu);
> > +extern void vfio_iommu_unmapall(struct vfio_iommu *iommu);
> > +
> > +#endif /* VFIO_PRIVATE_H */
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > new file mode 100644
> > index 0000000..4269b08
> > --- /dev/null
> > +++ b/include/linux/vfio.h
> > @@ -0,0 +1,155 @@
> > +/*
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + *
> > + * This program is free software; you may redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; version 2 of the License.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + *
> > + * Portions derived from drivers/uio/uio.c:
> > + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> > + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> > + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> > + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> > + *
> > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > + * Copyright (C) 2009 Red Hat, Inc.
> > + * Author: Michael S. Tsirkin <mst@redhat.com>
> > + */
> > +#include <linux/types.h>
> > +
> > +#ifndef VFIO_H
> > +#define VFIO_H
> > +
> > +#ifdef __KERNEL__
> > +
> > +struct vfio_device_ops {
> > +	bool			(*match)(struct device *, char *);
> > +	int			(*get)(void *);
> > +	void			(*put)(void *);
> > +	ssize_t			(*read)(void *, char __user *,
> > +					size_t, loff_t *);
> > +	ssize_t			(*write)(void *, const char __user *,
> > +					 size_t, loff_t *);
> > +	long			(*ioctl)(void *, unsigned int, unsigned long);
> > +	int			(*mmap)(void *, struct vm_area_struct *);
> > +};
> > +
> > +extern int vfio_group_add_dev(struct device *device,
> > +			      const struct vfio_device_ops *ops);
> > +extern void vfio_group_del_dev(struct device *device);
> > +extern int vfio_bind_dev(struct device *device, void *device_data);
> > +extern void *vfio_unbind_dev(struct device *device);
> > +
> > +#endif /* __KERNEL__ */
> > +
> > +/*
> > + * VFIO driver - allow mapping and use of certain devices
> > + * in unprivileged user processes. (If IOMMU is present)
> > + * Especially useful for Virtual Function parts of SR-IOV devices
> > + */
> > +
> > +
> > +/* Kernel & User level defines for ioctls */
> > +
> > +#define VFIO_GROUP_GET_FLAGS		_IOR(';', 100, __u64)
> 
> > + #define VFIO_GROUP_FLAGS_VIABLE	(1 << 0)
> > + #define VFIO_GROUP_FLAGS_MM_LOCKED	(1 << 1)
> > +#define VFIO_GROUP_MERGE		_IOW(';', 101, int)
> > +#define VFIO_GROUP_UNMERGE		_IOW(';', 102, int)
> > +#define VFIO_GROUP_GET_IOMMU_FD		_IO(';', 103)
> > +#define VFIO_GROUP_GET_DEVICE_FD	_IOW(';', 104, char *)
> > +
> > +/*
> > + * Structure for DMA mapping of user buffers
> > + * vaddr, dmaaddr, and size must all be page aligned
> > + */
> > +struct vfio_dma_map {
> > +	__u64	len;		/* length of structure */
> > +	__u64	vaddr;		/* process virtual addr */
> > +	__u64	dmaaddr;	/* desired and/or returned dma address */
> > +	__u64	size;		/* size in bytes */
> > +	__u64	flags;
> > +#define	VFIO_DMA_MAP_FLAG_WRITE		(1 << 0) /* req writeable DMA mem */
> > +};
> > +
> > +#define	VFIO_IOMMU_GET_FLAGS		_IOR(';', 105, __u64)
> > + /* Does the IOMMU support mapping any IOVA to any virtual address? */
> > + #define VFIO_IOMMU_FLAGS_MAP_ANY	(1 << 0)
> > +#define	VFIO_IOMMU_MAP_DMA		_IOWR(';', 106, struct vfio_dma_map)
> > +#define	VFIO_IOMMU_UNMAP_DMA		_IOWR(';', 107, struct vfio_dma_map)
> > +
> > +#define VFIO_DEVICE_GET_FLAGS		_IOR(';', 108, __u64)
> > + #define VFIO_DEVICE_FLAGS_PCI		(1 << 0)
> > + #define VFIO_DEVICE_FLAGS_DT		(1 << 1)
> > + #define VFIO_DEVICE_FLAGS_RESET	(1 << 2)
> > +#define VFIO_DEVICE_GET_NUM_REGIONS	_IOR(';', 109, int)
> > +
> > +struct vfio_region_info {
> > +	__u32	len;		/* length of structure */
> > +	__u32	index;		/* region number */
> > +	__u64	size;		/* size in bytes of region */
> > +	__u64	offset;		/* start offset of region */
> > +	__u64	flags;
> > +#define VFIO_REGION_INFO_FLAG_MMAP		(1 << 0)
> > +#define VFIO_REGION_INFO_FLAG_RO		(1 << 1)
> > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID	(1 << 2)
> > +	__u64	phys;		/* physical address of region */
> > +};
> > +
> > +#define VFIO_DEVICE_GET_REGION_INFO	_IOWR(';', 110, struct vfio_region_info)
> > +
> > +#define VFIO_DEVICE_GET_NUM_IRQS	_IOR(';', 111, int)
> > +
> > +struct vfio_irq_info {
> > +	__u32	len;		/* length of structure */
> > +	__u32	index;		/* IRQ number */
> > +	__u32	count;		/* number of individual IRQs */
> > +	__u32	flags;
> > +#define VFIO_IRQ_INFO_FLAG_LEVEL		(1 << 0)
> > +};
> > +
> > +#define VFIO_DEVICE_GET_IRQ_INFO	_IOWR(';', 112, struct vfio_irq_info)
> > +
> > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS	_IOW(';', 113, int)
> > +
> > +/* Unmask IRQ index, arg[0] = index */
> > +#define VFIO_DEVICE_UNMASK_IRQ		_IOW(';', 114, int)
> > +
> > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD	_IOW(';', 115, int)
> > +
> > +#define VFIO_DEVICE_RESET		_IO(';', 116)
> > +
> > +struct vfio_dtpath {
> > +	__u32	len;		/* length of structure */
> > +	__u32	index;
> > +	__u64	flags;
> > +#define VFIO_DTPATH_FLAGS_REGION	(1 << 0)
> > +#define VFIO_DTPATH_FLAGS_IRQ		(1 << 1)
> > +	char	*path;
> > +};
> > +#define VFIO_DEVICE_GET_DTPATH		_IOWR(';', 117, struct vfio_dtpath)
> > +
> > +struct vfio_dtindex {
> > +	__u32	len;		/* length of structure */
> > +	__u32	index;
> > +	__u32	prop_type;
> > +	__u32	prop_index;
> > +	__u64	flags;
> > +#define VFIO_DTINDEX_FLAGS_REGION	(1 << 0)
> > +#define VFIO_DTINDEX_FLAGS_IRQ		(1 << 1)
> > +};
> > +#define VFIO_DEVICE_GET_DTINDEX		_IOWR(';', 118, struct vfio_dtindex)
> > +
> > +#endif /* VFIO_H */
> 
> 
> So where is the vfio-pci? Is that a seperate posting?

You can find it in the tree pointed to in the patch description:

https://github.com/awilliam/linux-vfio/commit/534725d327e2b7791a229ce72d2ae8a62ee0a4e5

I was hoping to get some consensus around the new core before spending
too much time polishing up the bus driver.  Thanks for the review, it's
very much appreciated!

Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-11 22:10   ` Alex Williamson
@ 2011-11-15  0:00     ` David Gibson
  2011-11-16 16:52     ` Konrad Rzeszutek Wilk
  2011-11-16 17:47     ` Scott Wood
  2 siblings, 0 replies; 62+ messages in thread
From: David Gibson @ 2011-11-15  0:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Konrad Rzeszutek Wilk, chrisw, aik, pmac, joerg.roedel, agraf,
	benve, aafabbri, B08248, B07421, avi, kvm, qemu-devel, iommu,
	linux-pci

On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
> Thanks Konrad!  Comments inline.
> On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
[snip]
> > > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> > > +
> > > +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)
> > 
> > Don't want __u32?
> 
> It could be, not sure if it buys us anything maybe even restricts us.
> We likely don't need 2^32 regions (famous last words?), so we could
> later define <0 to something?

As a rule, it's best to use explicit fixed width types for all ioctl()
arguments, to avoid compat hell for 32-bit userland on 64-bit kernel
setups.

[snip]
> > > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > > +type to index mapping).
> > 
> > I am not really sure what that means.
> 
> This is so PCI can expose:
> 
> enum {
>         VFIO_PCI_INTX_IRQ_INDEX,
>         VFIO_PCI_MSI_IRQ_INDEX,
>         VFIO_PCI_MSIX_IRQ_INDEX,
>         VFIO_PCI_NUM_IRQS
> };
> 
> So like regions it always exposes 3 IRQ indexes where count=0 if the
> device doesn't actually support that type of interrupt.  I just want to
> spell out that bus drivers have this kind of flexibility.

I knew what you were aiming for, so I could see what you meant here,
but I don't think the doco is very clearly expressed at all.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-11 22:10   ` Alex Williamson
  2011-11-15  0:00     ` David Gibson
@ 2011-11-16 16:52     ` Konrad Rzeszutek Wilk
  2011-11-17 20:22       ` Alex Williamson
  2011-11-16 17:47     ` Scott Wood
  2 siblings, 1 reply; 62+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-16 16:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf, benve, aafabbri,
	B08248, B07421, avi, kvm, qemu-devel, iommu, linux-pci

On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
> 
> Thanks Konrad!  Comments inline.
> 
> On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> > > VFIO provides a secure, IOMMU based interface for user space
> > > drivers, including device assignment to virtual machines.
> > > This provides the base management of IOMMU groups, devices,
> > > and IOMMU objects.  See Documentation/vfio.txt included in
> > > this patch for user and kernel API description.
> > > 
> > > Note, this implements the new API discussed at KVM Forum
> > > 2011, as represented by the drvier version 0.2.  It's hoped
> > > that this provides a modular enough interface to support PCI
> > > and non-PCI userspace drivers across various architectures
> > > and IOMMU implementations.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > > 
> > > Fingers crossed, this is the last RFC for VFIO, but we need
> > > the iommu group support before this can go upstream
> > > (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
> > > hoping this helps push that along.
> > > 
> > > Since the last posting, this version completely modularizes
> > > the device backends and better defines the APIs between the
> > > core VFIO code and the device backends.  I expect that we
> > > might also adopt a modular IOMMU interface as iommu_ops learns
> > > about different types of hardware.  Also many, many cleanups.
> > > Check the complete git history for details:
> > > 
> > > git://github.com/awilliam/linux-vfio.git vfio-ng
> > > 
> > > (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
> > > 
> > > This version, along with the supporting VFIO PCI backend can
> > > be found here:
> > > 
> > > git://github.com/awilliam/linux-vfio.git vfio-next-20111103
> > > 
> > > I've held off on implementing a kernel->user signaling
> > > mechanism for now since the previous netlink version produced
> > > too many gag reflexes.  It's easy enough to set a bit in the
> > > group flags too indicate such support in the future, so I
> > > think we can move ahead without it.
> > > 
> > > Appreciate any feedback or suggestions.  Thanks,
> > > 
> > > Alex
> > > 
> > >  Documentation/ioctl/ioctl-number.txt |    1 
> > >  Documentation/vfio.txt               |  304 +++++++++
> > >  MAINTAINERS                          |    8 
> > >  drivers/Kconfig                      |    2 
> > >  drivers/Makefile                     |    1 
> > >  drivers/vfio/Kconfig                 |    8 
> > >  drivers/vfio/Makefile                |    3 
> > >  drivers/vfio/vfio_iommu.c            |  530 ++++++++++++++++
> > >  drivers/vfio/vfio_main.c             | 1151 ++++++++++++++++++++++++++++++++++
> > >  drivers/vfio/vfio_private.h          |   34 +
> > >  include/linux/vfio.h                 |  155 +++++
> > >  11 files changed, 2197 insertions(+), 0 deletions(-)
> > >  create mode 100644 Documentation/vfio.txt
> > >  create mode 100644 drivers/vfio/Kconfig
> > >  create mode 100644 drivers/vfio/Makefile
> > >  create mode 100644 drivers/vfio/vfio_iommu.c
> > >  create mode 100644 drivers/vfio/vfio_main.c
> > >  create mode 100644 drivers/vfio/vfio_private.h
> > >  create mode 100644 include/linux/vfio.h
> > > 
> > > diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> > > index 54078ed..59d01e4 100644
> > > --- a/Documentation/ioctl/ioctl-number.txt
> > > +++ b/Documentation/ioctl/ioctl-number.txt
> > > @@ -88,6 +88,7 @@ Code  Seq#(hex)	Include File		Comments
> > >  		and kernel/power/user.c
> > >  '8'	all				SNP8023 advanced NIC card
> > >  					<mailto:mcr@solidum.com>
> > > +';'	64-76	linux/vfio.h
> > >  '@'	00-0F	linux/radeonfb.h	conflict!
> > >  '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
> > >  'A'	00-1F	linux/apm_bios.h	conflict!
> > > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > > new file mode 100644
> > > index 0000000..5866896
> > > --- /dev/null
> > > +++ b/Documentation/vfio.txt
> > > @@ -0,0 +1,304 @@
> > > +VFIO - "Virtual Function I/O"[1]
> > > +-------------------------------------------------------------------------------
> > > +Many modern system now provide DMA and interrupt remapping facilities
> > > +to help ensure I/O devices behave within the boundaries they've been
> > > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > > +embedded powerpc systems (technology name unknown).  The VFIO driver
> > > +is an IOMMU/device agnostic framework for exposing direct device
> > > +access to userspace, in a secure, IOMMU protected environment.  In
> > > +other words, this allows safe, non-privileged, userspace drivers.
> > > +
> > > +Why do we want that?  Virtual machines often make use of direct device
> > > +access ("device assignment") when configured for the highest possible
> > > +I/O performance.  From a device and host perspective, this simply turns
> > > +the VM into a userspace driver, with the benefits of significantly
> > > +reduced latency, higher bandwidth, and direct use of bare-metal device
> > > +drivers[2].
> > 
> > Are there any constraints of running a 32-bit userspace with
> > a 64-bit kernel and with 32-bit user space drivers?
> 
> Shouldn't be.  I'll need to do some testing on that, but it was working
> on the previous generation of vfio.

<nods> ok
.. snip..

> > > +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> > > + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> > > +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> > > +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)
> > 
> > Coherency support is not going to be addressed right? What about sync?
> > Say you need to sync CPU to Device address?
> 
> Do we need to expose that to userspace or should the underlying
> iommu_ops take care of it?

That I am not sure of. I know that the kernel drivers (especially network ones)
are riddled with:

pci_dma_sync_single_for_cpu(tp->pdev, dma_addr, len, PCI_DMA_FROMDEVICE);
skb_copy_from_linear_data(skb, copy_skb->data, len); 
pci_dma_sync_single_for_device(tp->pdev, dma_addr, len, PCI_DMA_FROMDEVICE);


But I think that has come from the fact that the devices are 32-bit
so they could not do DMA above 4GB. Hence the bounce buffer usage and
the proliferation of pci_dma_sync.. calls to copy the contents to a
bounce buffer if neccessary.

But IOMMUs seem to deal with devices that can map the full gamma of memory
so they are not constrained to that 32-bit or 36-bit, but rather
they do the mapping in hardware if neccessary.

So I think I just answered the question - which is: No.
.. snip..
> > > +        __u64   vaddr;          /* process virtual addr */
> > > +        __u64   dmaaddr;        /* desired and/or returned dma address */
> > > +        __u64   size;           /* size in bytes */
> > > +        __u64   flags;
> > > +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> > > +};
> > > +
> > > +Current users of VFIO use relatively static DMA mappings, not requiring
> > > +high frequency turnover.  As new users are added, it's expected that the
> > 
> > Is there a limit to how many DMA mappings can be created?
> 
> Not that I'm aware of for the current AMD-Vi/VT-d implementations.  I
> suppose iommu_ops would return -ENOSPC if it hit a limit.  I added the

Not -ENOMEM? Either way, might want to mention that in this nice
document.

> VFIO_IOMMU_FLAGS_MAP_ANY flag above to try to identify that kind of
> restriction.

.. snip..

> > > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> > > +
> > > +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)
> > 
> > Don't want __u32?
> 
> It could be, not sure if it buys us anything maybe even restricts us.
> We likely don't need 2^32 regions (famous last words?), so we could
> later define <0 to something?

OK.
> 
> > > +
> > > +Regions are described by a struct vfio_region_info, which is retrieved by
> > > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> > > +the desired region (0 based index).  Note that devices may implement zero
> > > 
> > +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> > > +mapping).
> > 
> > Huh?
> 
> PCI has the following static mapping:
> 
> enum {
>         VFIO_PCI_BAR0_REGION_INDEX,
>         VFIO_PCI_BAR1_REGION_INDEX,
>         VFIO_PCI_BAR2_REGION_INDEX,
>         VFIO_PCI_BAR3_REGION_INDEX,
>         VFIO_PCI_BAR4_REGION_INDEX,
>         VFIO_PCI_BAR5_REGION_INDEX,
>         VFIO_PCI_ROM_REGION_INDEX,
>         VFIO_PCI_CONFIG_REGION_INDEX,
>         VFIO_PCI_NUM_REGIONS
> };
> 
> So 8 regions are always reported regardless of whether the device
> implements all the BARs and the ROM.  Then we have a fixed bar:index
> mapping so we don't have to create a region_info field to describe the
> bar number for the index.

OK. Is that a problem if the real device actually has a zero sized BAR?
Or is zero sized BAR in PCI spec equal to "disabled, not in use" ? Just
wondering whether (-1ULL) should be used instead? (Which seems the case
in QEMU code).

> 
> > > +
> > > +struct vfio_region_info {
> > > +        __u32   len;            /* length of structure */
> > > +        __u32   index;          /* region number */
> > > +        __u64   size;           /* size in bytes of region */
> > > +        __u64   offset;         /* start offset of region */
> > > +        __u64   flags;
> > > +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> > > +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> > > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> > 
> > What is FLAG_MMAP? Does it mean: 1) it can be mmaped, or 2) it is mmaped?
> 
> Supports mmap

> 
> > FLAG_RO is pretty obvious - presumarily this is for firmware regions and such.
> > And PHYS_VALID is if the region is disabled for some reasons? If so
> > would the name FLAG_DISABLED be better?
> 
> No, POWER guys have some need to report the host physical address of the
> region, so the flag indicates whether the below field is present and
> valid.  I'll clarify these in the docs.

Thanks.
.. snip..
> > > +struct vfio_irq_info {
> > > +        __u32   len;            /* length of structure */
> > > +        __u32   index;          /* IRQ number */
> > > +        __u32   count;          /* number of individual IRQs */
> > > +        __u64   flags;
> > > +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> > > +};
> > > +
> > > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > > +type to index mapping).
> > 
> > I am not really sure what that means.
> 
> This is so PCI can expose:
> 
> enum {
>         VFIO_PCI_INTX_IRQ_INDEX,
>         VFIO_PCI_MSI_IRQ_INDEX,
>         VFIO_PCI_MSIX_IRQ_INDEX,
>         VFIO_PCI_NUM_IRQS
> };
> 
> So like regions it always exposes 3 IRQ indexes where count=0 if the
> device doesn't actually support that type of interrupt.  I just want to
> spell out that bus drivers have this kind of flexibility.

I think you should change the comment that  says 'IRQ number', as the
first thing that comes in my mind is 'GSI' or MSI/MSI-x vector.
Perhaps '/* index to be used with return value from GET_NUM_IRQS ioctl.
Order of structures can be unsorted. */

> 
> > > +
> > > +Information about each index can be retrieved using the GET_IRQ_INFO
> > > +ioctl, used much like GET_REGION_INFO.
> > > +
> > > +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> > > +
> > > +Individual indexes can describe single or sets of IRQs.  This provides the
> > > +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> > > +
> > > +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> > > +as shown below, are used to pass the IRQ info index, the number of eventfds,
> > > +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> > > +
> > > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> > 
> > Are eventfds u64 or u32?
> 
> int, they're just file descriptors
> 
> > Why not just define a structure?
> > struct vfio_irq_eventfds {
> > 	__u32	index;
> > 	__u32	count;
> > 	__u64	eventfds[0]
> > };
> 
> We could do that if preferred.  Hmm, are we then going to need
> size/flags?

Sure.

> 
> > How do you get an eventfd to feed in here?
> 
> eventfd(2), in qemu event_notifier_init() -> event_notifier_get_fd()
> 
> > > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> > 
> > u32?
> 
> Not here, it's an fd, so should be an int.
> 
> > > +
> > > +When a level triggered interrupt is signaled, the interrupt is masked
> > > +on the host.  This prevents an unresponsive userspace driver from
> > > +continuing to interrupt the host system.  After servicing the interrupt,
> > > +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> > > +triggered interrupts implicitly have a count of 1 per index.
> > 
> > So they are enabled automatically? Meaning you don't even hav to do
> > SET_IRQ_EVENTFDS b/c the count is set to 1?
> 
> I suppose that should be "no more than 1 per index" (ie. PCI would
> report a count of 0 for VFIO_PCI_INTX_IRQ_INDEX if the device doesn't
> support INTx).  I think you might be confusing VFIO_DEVICE_GET_IRQ_INFO
> which tells how many are available with VFIO_DEVICE_SET_IRQ_EVENTFDS
> which does the enabling/disabling.  All interrupts are disabled by
> default because userspace needs to give us a way to signal them via
> eventfds.  It will be device dependent whether multiple index can be
> enabled simultaneously.  Hmm, is that another flag on the irq_info
> struct or do we expect drivers to implicitly have that kind of
> knowledge?

Right, that was what I was wondering. Not sure how the PowerPC
world works with this.

> 
> > > +
> > > +/* Unmask IRQ index, arg[0] = index */
> > > +#define VFIO_DEVICE_UNMASK_IRQ          _IOW(';', 114, int)
> > 
> > So this is for MSI as well? So if I've an index = 1, with count = 4,
> > and doing unmaks IRQ will chip enable all the MSI event at once?
> 
> No, this is only for re-enabling level triggered interrupts as discussed
> above.  Edge triggered interrupts like MSI don't need an unmask... we
> may want to do something to accelerate the MSI-X table access for
> masking specific interrupts, but I figured that would need to be PCI
> aware since those are PCI features, and would therefore be some future
> extension of the PCI bus driver and exposed via VFIO_DEVICE_GET_FLAGS.

OK.
> 
> > I guess there is not much point in enabling/disabling selective MSI
> > IRQs..
> 
> Some older OSes are said to make extensive use of masking for MSI, so we
> probably want this at some point.  I'm assuming future PCI extension for
> now.
> 
> > > +
> > > +Level triggered interrupts can also be unmasked using an irqfd.  Use
> > 
> > irqfd or eventfd?
> 
> irqfd is an eventfd in reverse.  eventfd = kernel signals userspace via
> an fd, irqfd = userspace signals kernel via an fd.

Ah neat.

> 
> > > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> > 
> > So only level triggered? Hmm, how do I know whether the device is
> > level or edge? Or is that edge (MSI) can also be unmaked using the
> > eventfs
> 
> Yes, only for level.  Isn't a device going to know what type of
> interrupt it uses?  MSI masking is PCI specific, not handled by this.

I certainly hope it knows, but you know buggy drivers do exist.

What would be the return value if somebody tried to unmask an edge one?
Should that be documented here? -ENOSPEC?

> 
> > > +
> > > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> > > +
> > > +When supported, as indicated by the device flags, reset the device.
> > > +
> > > +#define VFIO_DEVICE_RESET               _IO(';', 116)
> > 
> > Does it disable the 'count'? Err, does it disable the IRQ on the
> > device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
> > to set new eventfds? Or does it re-use the eventfds and the device
> > is enabled after this?
> 
> It doesn't affect the interrupt programming.  Should it?

I would hope not, but I am trying to think of ways one could screw this up.
Perhaps just saying that - "No need to call VFIO_DEVICE_SET_IRQ_EVENTFDS
as the kernel (and the device) will retain the interrupt.".
.. snip..
> > I am not really sure what this section purpose is? Could this be part
> > of the header file or the code? It does not look to be part of the
> > ioctl API?
> 
> We've passed into the "VFIO bus driver API" section of the document, to
> explain the interaction between vfio-core and vfio bus drivers.

Perhaps a different file?
.. large snip ..
> > > +
> > > +	mutex_lock(&iommu->dgate);
> > > +	list_for_each_safe(pos, pos2, &iommu->dm_list) {
> > > +		mlp = list_entry(pos, struct dma_map_page, list);
> > > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > 
> > Uh, so if it did not get put_page() we would try to still delete it?
> > Couldn't that lead to corruption as the 'mlp' is returned to the poll?
> > 
> > Ah wait, the put_page is on the DMA page, so it is OK to
> > delete the tracking structure. It will be just a leaked page.
> 
> Assume you're referencing this chunk:
> 
> vfio_dma_unmap
>   __vfio_dma_unmap
>     ...
>         pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
>         if (pfn) {
>                 iommu_unmap(iommu->domain, iova, 0);
>                 unlocked += put_pfn(pfn, rdwr);
>         }
> 
> So we skip things that aren't mapped in the iommu, but anything not
> mapped should have already been put (failed vfio_dma_map).  If it is
> mapped, we put it if we originally got it via get_user_pages_fast.
> unlocked would only not get incremented here if it was an mmap'd page
> (such as the mmap of an mmio space of another vfio device), via the code
> in vaddr_get_pfn (stolen from KVM).

Yup. Sounds right.
.. snip..
> > > +module_param(allow_unsafe_intrs, int, 0);
> > 
> > S_IRUGO ?
> 
> I actually intended that to be S_IRUGO | S_IWUSR just like the kvm
> parameter so it can be toggled runtime.

OK.
> 
> > > +MODULE_PARM_DESC(allow_unsafe_intrs,
> > > +        "Allow use of IOMMUs which do not support interrupt remapping");
> > > +
> > > +static struct vfio {
> > > +	dev_t			devt;
> > > +	struct cdev		cdev;
> > > +	struct list_head	group_list;
> > > +	struct mutex		lock;
> > > +	struct kref		kref;
> > > +	struct class		*class;
> > > +	struct idr		idr;
> > > +	wait_queue_head_t	release_q;
> > > +} vfio;
> > 
> > You probably want to move this below the 'vfio_group'
> > as vfio contains the vfio_group.
> 
> Only via the group_list.  Are you suggesting for readability or to avoid
> forward declarations (which we don't need between these two with current
> ordering).

Just for readability.

> 
> > > +
> > > +static const struct file_operations vfio_group_fops;
> > > +extern const struct file_operations vfio_iommu_fops;
> > > +
> > > +struct vfio_group {
> > > +	dev_t			devt;
> > > +	unsigned int		groupid;
> > > +	struct bus_type		*bus;
> > > +	struct vfio_iommu	*iommu;
> > > +	struct list_head	device_list;
> > > +	struct list_head	iommu_next;
> > > +	struct list_head	group_next;
> > > +	int			refcnt;
> > > +};
> > > +
> > > +struct vfio_device {
> > > +	struct device			*dev;
> > > +	const struct vfio_device_ops	*ops;
> > > +	struct vfio_iommu		*iommu;
> > > +	struct vfio_group		*group;
> > > +	struct list_head		device_next;
> > > +	bool				attached;
> > > +	int				refcnt;
> > > +	void				*device_data;
> > > +};
> > 
> > And perhaps move this above vfio_group. As vfio_group
> > contains a list of these structures?
> 
> These are inter-linked, so chicken and egg.  The current ordering is
> more function based than definition based.  struct vfio is the highest
> level object, groups are next, iommus and devices are next, but we need
> to share iommus with the other file, so that lands in the header.

Ah, OK.
> 
> > > +
> > > +/*
> > > + * Helper functions called under vfio.lock
> > > + */
> > > +
> > > +/* Return true if any devices within a group are opened */
> > > +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> > > +{
> > > +	struct list_head *pos;
> > > +
> > > +	list_for_each(pos, &group->device_list) {
> > > +		struct vfio_device *device;
> > > +
> > > +		device = list_entry(pos, struct vfio_device, device_next);
> > > +		if (device->refcnt)
> > > +			return true;
> > > +	}
> > > +	return false;
> > > +}
> > > +
> > > +/* Return true if any of the groups attached to an iommu are opened.
> > > + * We can only tear apart merged groups when nothing is left open. */
> > > +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> > > +{
> > > +	struct list_head *pos;
> > > +
> > > +	list_for_each(pos, &iommu->group_list) {
> > > +		struct vfio_group *group;
> > > +
> > > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > > +		if (group->refcnt)
> > > +			return true;
> > > +	}
> > > +	return false;
> > > +}
> > > +
> > > +/* An iommu is "in use" if it has a file descriptor open or if any of
> > > + * the groups assigned to the iommu have devices open. */
> > > +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> > > +{
> > > +	struct list_head *pos;
> > > +
> > > +	if (iommu->refcnt)
> > > +		return true;
> > > +
> > > +	list_for_each(pos, &iommu->group_list) {
> > > +		struct vfio_group *group;
> > > +
> > > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > > +
> > > +		if (__vfio_group_devs_inuse(group))
> > > +			return true;
> > > +	}
> > > +	return false;
> > > +}
> > > +
> > > +static void __vfio_group_set_iommu(struct vfio_group *group,
> > > +				   struct vfio_iommu *iommu)
> > > +{
> > > +	struct list_head *pos;
> > > +
> > > +	if (group->iommu)
> > > +		list_del(&group->iommu_next);
> > > +	if (iommu)
> > > +		list_add(&group->iommu_next, &iommu->group_list);
> > > +
> > > +	group->iommu = iommu;
> > > +
> > > +	list_for_each(pos, &group->device_list) {
> > > +		struct vfio_device *device;
> > > +
> > > +		device = list_entry(pos, struct vfio_device, device_next);
> > > +		device->iommu = iommu;
> > > +	}
> > > +}
> > > +
> > > +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> > > +				    struct vfio_device *device)
> > > +{
> > > +	BUG_ON(!iommu->domain && device->attached);
> > 
> > Whoa. Heavy hammer there.
> > 
> > Perhaps WARN_ON as you do check it later on.
> 
> I think it's warranted, internal consistency is broken if we have a
> device that thinks it's attached to an iommu domain that doesn't exist.
> It should, of course, never happen and this isn't a performance path.
> 
> > > +
> > > +	if (!iommu->domain || !device->attached)
> > > +		return;

Well, the deal is that you BUG_ON earlier, but you check for it here.
But the BUG_ON will stop execution , so the check 'if ..' is actually
not needed.


> > > +
> > > +	iommu_detach_device(iommu->domain, device->dev);
> > > +	device->attached = false;
> > > +}
> > > +
> > > +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> > > +				      struct vfio_group *group)
> > > +{
> > > +	struct list_head *pos;
> > > +
> > > +	list_for_each(pos, &group->device_list) {
> > > +		struct vfio_device *device;
> > > +
> > > +		device = list_entry(pos, struct vfio_device, device_next);
> > > +		__vfio_iommu_detach_dev(iommu, device);
> > > +	}
> > > +}
> > > +
> > > +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> > > +				   struct vfio_device *device)
> > > +{
> > > +	int ret;
> > > +
> > > +	BUG_ON(device->attached);
> > 
> > How about:
> > 
> > WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register
> > the device again! Tell him/her to stop please.\n");
> 
> I would almost demote this one to a WARN_ON, but userspace isn't in
> control of attaching and detaching devices from the iommu.  That's a
> side effect of getting the iommu or device file descriptor.  So again,
> this is an internal consistency check and it should never happen,
> regardless of userspace.
> 

Ok, then you might want to expand it to

BUG_ON(!device  || device->attached);

In case something has gone horribly wrong.


.. snip..
> > > +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> > > +		device_create(vfio.class, NULL, group->devt,
> > > +			      group, "%u", groupid);
> > > +
> > > +		group->bus = dev->bus;
> > 
> > 
> > Oh, so that is how the IOMMU iommu_ops get copied! You might
> > want to mention that - I was not sure where the 'handoff' is
> > was done to insert a device so that it can do iommu_ops properly.
> > 
> > Ok, so the time when a device is detected whether it can do
> > IOMMU is when we try to open it - as that is when iommu_domain_alloc
> > is called which can return NULL if the iommu_ops is not set.
> > 
> > So what about devices that don't have an iommu_ops? Say they
> > are using SWIOTLB? (like the AMD-Vi sometimes does if the
> > device is not on its list).
> > 
> > Can we use iommu_present?
> 
> I'm not sure I'm following your revelation ;)  Take a look at the

I am trying to figure out who sets the iommu_ops call on the devices.

> pointer to iommu_device_group I pasted above, or these:
> 
> https://github.com/awilliam/linux-vfio/commit/37dd08c90d149caaed7779d4f38850a8f7ed0fa5
> https://github.com/awilliam/linux-vfio/commit/63ca8543533d8130db23d7949133e548c3891c97
> https://github.com/awilliam/linux-vfio/commit/8d7d70eb8e714fbf8710848a06f8cab0c741631e
> 
> That call includes an iommu_present() check, so if there's no iommu or
> the iommu can't provide a groupid, the device is skipped over from vfio
> (can't be used).
> 
> So the ordering is:
> 
>  - bus driver registers device
>    - if it has an iommu group, add it to the vfio device/group tracking
> 
>  - group gets opened
>    - user gets iommu or device fd results in iommu_domain_alloc
> 
> Devices without iommu_ops don't get to play in the vfio world.

Right, and I think the answer of which devices get iommu_ops is done via
bus_set_iommu.

(Thinking in long-term of what would be required to make this work
with Xen and it sounds like I will need to implement a Xen IOMMU driver)
 

.. snip..
> > 
> > So where is the vfio-pci? Is that a seperate posting?
> 
> You can find it in the tree pointed to in the patch description:
> 
> https://github.com/awilliam/linux-vfio/commit/534725d327e2b7791a229ce72d2ae8a62ee0a4e5

Thanks.

> 
> I was hoping to get some consensus around the new core before spending
> too much time polishing up the bus driver.  Thanks for the review, it's
> very much appreciated!

Sure thing.
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-16 16:52     ` Konrad Rzeszutek Wilk
@ 2011-11-17 20:22       ` Alex Williamson
  2011-11-17 20:56         ` Scott Wood
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-17 20:22 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf, benve, aafabbri,
	B08248, B07421, avi, kvm, qemu-devel, iommu, linux-pci

On Wed, 2011-11-16 at 11:52 -0500, Konrad Rzeszutek Wilk wrote:
> On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
<snip>
> > > > +
> > > > +Regions are described by a struct vfio_region_info, which is retrieved by
> > > > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> > > > +the desired region (0 based index).  Note that devices may implement zero
> > > > 
> > > +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> > > > +mapping).
> > > 
> > > Huh?
> > 
> > PCI has the following static mapping:
> > 
> > enum {
> >         VFIO_PCI_BAR0_REGION_INDEX,
> >         VFIO_PCI_BAR1_REGION_INDEX,
> >         VFIO_PCI_BAR2_REGION_INDEX,
> >         VFIO_PCI_BAR3_REGION_INDEX,
> >         VFIO_PCI_BAR4_REGION_INDEX,
> >         VFIO_PCI_BAR5_REGION_INDEX,
> >         VFIO_PCI_ROM_REGION_INDEX,
> >         VFIO_PCI_CONFIG_REGION_INDEX,
> >         VFIO_PCI_NUM_REGIONS
> > };
> > 
> > So 8 regions are always reported regardless of whether the device
> > implements all the BARs and the ROM.  Then we have a fixed bar:index
> > mapping so we don't have to create a region_info field to describe the
> > bar number for the index.
> 
> OK. Is that a problem if the real device actually has a zero sized BAR?
> Or is zero sized BAR in PCI spec equal to "disabled, not in use" ? Just
> wondering whether (-1ULL) should be used instead? (Which seems the case
> in QEMU code).

Yes, PCI spec defines that unimplemented BARs are hardwired to zero, so
the sizing operation returns zero for the size.

<snip>
> > > > +struct vfio_irq_info {
> > > > +        __u32   len;            /* length of structure */
> > > > +        __u32   index;          /* IRQ number */
> > > > +        __u32   count;          /* number of individual IRQs */
> > > > +        __u64   flags;
> > > > +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> > > > +};
> > > > +
> > > > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > > > +type to index mapping).
> > > 
> > > I am not really sure what that means.
> > 
> > This is so PCI can expose:
> > 
> > enum {
> >         VFIO_PCI_INTX_IRQ_INDEX,
> >         VFIO_PCI_MSI_IRQ_INDEX,
> >         VFIO_PCI_MSIX_IRQ_INDEX,
> >         VFIO_PCI_NUM_IRQS
> > };
> > 
> > So like regions it always exposes 3 IRQ indexes where count=0 if the
> > device doesn't actually support that type of interrupt.  I just want to
> > spell out that bus drivers have this kind of flexibility.
> 
> I think you should change the comment that  says 'IRQ number', as the
> first thing that comes in my mind is 'GSI' or MSI/MSI-x vector.
> Perhaps '/* index to be used with return value from GET_NUM_IRQS ioctl.
> Order of structures can be unsorted. */

Ah, yes.  I see the confusion.  They can't really be unsorted though,
the user needs some point of reference.  For PCI they will be strictly
ordered.  For Device Tree, I assume there will be a path referencing the
index.  I'll update the doc to clarify.

<snip>
> > > > +
> > > > +When a level triggered interrupt is signaled, the interrupt is masked
> > > > +on the host.  This prevents an unresponsive userspace driver from
> > > > +continuing to interrupt the host system.  After servicing the interrupt,
> > > > +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> > > > +triggered interrupts implicitly have a count of 1 per index.
> > > 
> > > So they are enabled automatically? Meaning you don't even hav to do
> > > SET_IRQ_EVENTFDS b/c the count is set to 1?
> > 
> > I suppose that should be "no more than 1 per index" (ie. PCI would
> > report a count of 0 for VFIO_PCI_INTX_IRQ_INDEX if the device doesn't
> > support INTx).  I think you might be confusing VFIO_DEVICE_GET_IRQ_INFO
> > which tells how many are available with VFIO_DEVICE_SET_IRQ_EVENTFDS
> > which does the enabling/disabling.  All interrupts are disabled by
> > default because userspace needs to give us a way to signal them via
> > eventfds.  It will be device dependent whether multiple index can be
> > enabled simultaneously.  Hmm, is that another flag on the irq_info
> > struct or do we expect drivers to implicitly have that kind of
> > knowledge?
> 
> Right, that was what I was wondering. Not sure how the PowerPC
> world works with this.

On second thought, I think an exclusive flag isn't appropriate.  VFIO is
not meant to abstract the device to the level that a user could write a
generic "vfio driver".  The user will always need to understand the type
of device, VFIO just provides the conduit to make use of it.  There's
too much left undefined with a simplistic exclusive flag.

<snip>
> > > > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> > > 
> > > So only level triggered? Hmm, how do I know whether the device is
> > > level or edge? Or is that edge (MSI) can also be unmaked using the
> > > eventfs
> > 
> > Yes, only for level.  Isn't a device going to know what type of
> > interrupt it uses?  MSI masking is PCI specific, not handled by this.
> 
> I certainly hope it knows, but you know buggy drivers do exist.
> 
> What would be the return value if somebody tried to unmask an edge one?
> Should that be documented here? -ENOSPEC?

I would assume EINVAL or EFAULT since the user is providing an invalid
argument/bad address.

> > > > +
> > > > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > > > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> > > > +
> > > > +When supported, as indicated by the device flags, reset the device.
> > > > +
> > > > +#define VFIO_DEVICE_RESET               _IO(';', 116)
> > > 
> > > Does it disable the 'count'? Err, does it disable the IRQ on the
> > > device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
> > > to set new eventfds? Or does it re-use the eventfds and the device
> > > is enabled after this?
> > 
> > It doesn't affect the interrupt programming.  Should it?
> 
> I would hope not, but I am trying to think of ways one could screw this up.
> Perhaps just saying that - "No need to call VFIO_DEVICE_SET_IRQ_EVENTFDS
> as the kernel (and the device) will retain the interrupt.".

Ok, I added some words around this in the doc.

> .. snip..
> > > I am not really sure what this section purpose is? Could this be part
> > > of the header file or the code? It does not look to be part of the
> > > ioctl API?
> > 
> > We've passed into the "VFIO bus driver API" section of the document, to
> > explain the interaction between vfio-core and vfio bus drivers.
> 
> Perhaps a different file?

The entire file is ~300 lines.  Seems excessive to split.

<snip>
> > > > +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> > > > +				    struct vfio_device *device)
> > > > +{
> > > > +	BUG_ON(!iommu->domain && device->attached);
> > > 
> > > Whoa. Heavy hammer there.
> > > 
> > > Perhaps WARN_ON as you do check it later on.
> > 
> > I think it's warranted, internal consistency is broken if we have a
> > device that thinks it's attached to an iommu domain that doesn't exist.
> > It should, of course, never happen and this isn't a performance path.
> > 
> > > > +
> > > > +	if (!iommu->domain || !device->attached)
> > > > +		return;
> 
> Well, the deal is that you BUG_ON earlier, but you check for it here.
> But the BUG_ON will stop execution , so the check 'if ..' is actually
> not needed.

The BUG_ON is a subtly different check:

domain | attached
-------+---------
   0   |   0     Nothing to do
   0   |   1     <--- BUG_ON, we're broken
   1   |   0     Nothing to do
   1   |   1     Do stuff

Writing out the truth table, I see now I could just make this:
   if (!attached) {return;}
since the BUG_ON takes care of the other case.

The reason for the laziness of allowing this to simply return is that if
we hit an error attaching an individual device within a group, we just
push the whole group back through __vfio_iommu_detach_group(), so some
devices may have never been attached.

> > > > +
> > > > +	iommu_detach_device(iommu->domain, device->dev);
> > > > +	device->attached = false;
> > > > +}
> > > > +
> > > > +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> > > > +				      struct vfio_group *group)
> > > > +{
> > > > +	struct list_head *pos;
> > > > +
> > > > +	list_for_each(pos, &group->device_list) {
> > > > +		struct vfio_device *device;
> > > > +
> > > > +		device = list_entry(pos, struct vfio_device, device_next);
> > > > +		__vfio_iommu_detach_dev(iommu, device);
> > > > +	}
> > > > +}
> > > > +
> > > > +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> > > > +				   struct vfio_device *device)
> > > > +{
> > > > +	int ret;
> > > > +
> > > > +	BUG_ON(device->attached);
> > > 
> > > How about:
> > > 
> > > WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register
> > > the device again! Tell him/her to stop please.\n");
> > 
> > I would almost demote this one to a WARN_ON, but userspace isn't in
> > control of attaching and detaching devices from the iommu.  That's a
> > side effect of getting the iommu or device file descriptor.  So again,
> > this is an internal consistency check and it should never happen,
> > regardless of userspace.
> > 
> 
> Ok, then you might want to expand it to
> 
> BUG_ON(!device  || device->attached);
> 
> In case something has gone horribly wrong.

Impressive, that exceeds even my paranoia ;)  For that we would have had
to walk the group->device_list and come up with a NULL device pointer.
I think we can assume that won't happen.  I've also got this though:

        if (!iommu || !iommu->domain)
                return -EINVAL;

Which is effectively just being lazy without a good excuse like above.
That could probably be folded into the BUG_ON.
> 
> .. snip..
> > > > +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> > > > +		device_create(vfio.class, NULL, group->devt,
> > > > +			      group, "%u", groupid);
> > > > +
> > > > +		group->bus = dev->bus;
> > > 
> > > 
> > > Oh, so that is how the IOMMU iommu_ops get copied! You might
> > > want to mention that - I was not sure where the 'handoff' is
> > > was done to insert a device so that it can do iommu_ops properly.
> > > 
> > > Ok, so the time when a device is detected whether it can do
> > > IOMMU is when we try to open it - as that is when iommu_domain_alloc
> > > is called which can return NULL if the iommu_ops is not set.
> > > 
> > > So what about devices that don't have an iommu_ops? Say they
> > > are using SWIOTLB? (like the AMD-Vi sometimes does if the
> > > device is not on its list).
> > > 
> > > Can we use iommu_present?
> > 
> > I'm not sure I'm following your revelation ;)  Take a look at the
> 
> I am trying to figure out who sets the iommu_ops call on the devices.

The iommu driver registers ops with bus_set_iommu, so then we just need
to pass the bus pointer and iommu_ops figures out the rest.  If there's
no iommu_ops for a device or the iommu_ops doesn't implement the
device_group callback, it gets skipped by vfio and therefore won't be
usable by this interface.

> > pointer to iommu_device_group I pasted above, or these:
> > 
> > https://github.com/awilliam/linux-vfio/commit/37dd08c90d149caaed7779d4f38850a8f7ed0fa5
> > https://github.com/awilliam/linux-vfio/commit/63ca8543533d8130db23d7949133e548c3891c97
> > https://github.com/awilliam/linux-vfio/commit/8d7d70eb8e714fbf8710848a06f8cab0c741631e
> > 
> > That call includes an iommu_present() check, so if there's no iommu or
> > the iommu can't provide a groupid, the device is skipped over from vfio
> > (can't be used).
> > 
> > So the ordering is:
> > 
> >  - bus driver registers device
> >    - if it has an iommu group, add it to the vfio device/group tracking
> > 
> >  - group gets opened
> >    - user gets iommu or device fd results in iommu_domain_alloc
> > 
> > Devices without iommu_ops don't get to play in the vfio world.
> 
> Right, and I think the answer of which devices get iommu_ops is done via
> bus_set_iommu.

Exactly.

> (Thinking in long-term of what would be required to make this work
> with Xen and it sounds like I will need to implement a Xen IOMMU driver)

Yeah, that would make sense.  Thanks!

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-17 20:22       ` Alex Williamson
@ 2011-11-17 20:56         ` Scott Wood
  0 siblings, 0 replies; 62+ messages in thread
From: Scott Wood @ 2011-11-17 20:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Konrad Rzeszutek Wilk, chrisw, aik, pmac, dwg, joerg.roedel,
	agraf, benve, aafabbri, B08248, B07421, avi, kvm, qemu-devel,
	iommu, linux-pci

On Thu, Nov 17, 2011 at 01:22:17PM -0700, Alex Williamson wrote:
> On Wed, 2011-11-16 at 11:52 -0500, Konrad Rzeszutek Wilk wrote:
> > On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
> > What would be the return value if somebody tried to unmask an edge one?
> > Should that be documented here? -ENOSPEC?
> 
> I would assume EINVAL or EFAULT since the user is providing an invalid
> argument/bad address.

EINVAL.  EFAULT is normally only used for when the user passes a bad
virtual memory address to the kernel.  This isn't an address at all, it's
an index that points to an object for which this operation does not make
sense.

-Scott


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-11 22:10   ` Alex Williamson
  2011-11-15  0:00     ` David Gibson
  2011-11-16 16:52     ` Konrad Rzeszutek Wilk
@ 2011-11-16 17:47     ` Scott Wood
  2011-11-17 20:52       ` Alex Williamson
  2 siblings, 1 reply; 62+ messages in thread
From: Scott Wood @ 2011-11-16 17:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Konrad Rzeszutek Wilk, chrisw, aik, pmac, dwg, joerg.roedel,
	agraf, benve, aafabbri, B08248, B07421, avi, kvm, qemu-devel,
	iommu, linux-pci

On 11/11/2011 04:10 PM, Alex Williamson wrote:
> 
> Thanks Konrad!  Comments inline.
> 
> On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
>> On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
>>> +When supported, as indicated by the device flags, reset the device.
>>> +
>>> +#define VFIO_DEVICE_RESET               _IO(';', 116)
>>
>> Does it disable the 'count'? Err, does it disable the IRQ on the
>> device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
>> to set new eventfds? Or does it re-use the eventfds and the device
>> is enabled after this?
> 
> It doesn't affect the interrupt programming.  Should it?

It should probably clear any currently pending interrupts, as if the
unmask IOCTL were called.

>>> +device tree properties of the device:
>>> +
>>> +struct vfio_dtpath {
>>> +        __u32   len;            /* length of structure */
>>> +        __u32   index;
>>
>> 0 based I presume?
> 
> Everything else is, I would assume so/

Yes, it should be zero-based -- this matches how such indices are done
in the kernel device tree APIs.

>>> +        __u64   flags;
>>> +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
>>
>> What is region in this context?? Or would this make much more sense
>> if I knew what Device Tree actually is.
> 
> Powerpc guys, any comments?  This was their suggestion.  These are
> effectively the first device specific extension, available when
> VFIO_DEVICE_FLAGS_DT is set.

An assigned device may consist of an entire subtree of the device tree,
and both register banks and interrupts can come from any node in the
tree.  Region versus IRQ here indicates the context in which to
interpret index, in order to retrieve the path of the node that supplied
this particular region or IRQ.

>>> +};
>>> +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
>>> +
>>> +struct vfio_dtindex {
>>> +        __u32   len;            /* length of structure */
>>> +        __u32   index;
>>> +        __u32   prop_type;
>>
>> Is that an enum type? Is this definied somewhere?
>>> +        __u32   prop_index;
>>
>> What is the purpose of this field?
> 
> Need input from powerpc folks here

To identify what this resource (register bank or IRQ) this is, we need
both the path to the node and the index into the reg or interrupts
property within the node.

We also need to distinguish reg from ranges, and interrupts from
interrupt-map.  As you suggested elsewhere in the thread, the device
tree API should probably be left out for now, and added later along with
the device tree "bus" driver.

>>> +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
>>> +				    struct vfio_device *device)
>>> +{
>>> +	BUG_ON(!iommu->domain && device->attached);
>>
>> Whoa. Heavy hammer there.
>>
>> Perhaps WARN_ON as you do check it later on.
> 
> I think it's warranted, internal consistency is broken if we have a
> device that thinks it's attached to an iommu domain that doesn't exist.
> It should, of course, never happen and this isn't a performance path.
> 
[snip]
>>> +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
>>> +				   struct vfio_device *device)
>>> +{
>>> +	int ret;
>>> +
>>> +	BUG_ON(device->attached);
>>
>> How about:
>>
>> WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register
>> the device again! Tell him/her to stop please.\n");
> 
> I would almost demote this one to a WARN_ON, but userspace isn't in
> control of attaching and detaching devices from the iommu.  That's a
> side effect of getting the iommu or device file descriptor.  So again,
> this is an internal consistency check and it should never happen,
> regardless of userspace.

The rule isn't to use BUG for internal consistency checks and WARN for
stuff userspace can trigger, but rather to use BUG if you cannot
reasonably continue, WARN for "significant issues that need prompt
attention" that are reasonably recoverable.  Most instances of WARN are
internal consistency checks.

>From include/asm-generic/bug.h:
> If you're tempted to BUG(), think again:  is completely giving up
> really the *only* solution?  There are usually better options, where
> users don't need to reboot ASAP and can mostly shut down cleanly.

-Scott


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-16 17:47     ` Scott Wood
@ 2011-11-17 20:52       ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2011-11-17 20:52 UTC (permalink / raw)
  To: Scott Wood
  Cc: Konrad Rzeszutek Wilk, chrisw, aik, pmac, dwg, joerg.roedel,
	agraf, benve, aafabbri, B08248, B07421, avi, kvm, qemu-devel,
	iommu, linux-pci

On Wed, 2011-11-16 at 11:47 -0600, Scott Wood wrote:
> On 11/11/2011 04:10 PM, Alex Williamson wrote:
> > 
> > Thanks Konrad!  Comments inline.
> > 
> > On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
> >> On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> >>> +When supported, as indicated by the device flags, reset the device.
> >>> +
> >>> +#define VFIO_DEVICE_RESET               _IO(';', 116)
> >>
> >> Does it disable the 'count'? Err, does it disable the IRQ on the
> >> device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
> >> to set new eventfds? Or does it re-use the eventfds and the device
> >> is enabled after this?
> > 
> > It doesn't affect the interrupt programming.  Should it?
> 
> It should probably clear any currently pending interrupts, as if the
> unmask IOCTL were called.

Sounds reasonable.

> >>> +device tree properties of the device:
> >>> +
> >>> +struct vfio_dtpath {
> >>> +        __u32   len;            /* length of structure */
> >>> +        __u32   index;
> >>
> >> 0 based I presume?
> > 
> > Everything else is, I would assume so/
> 
> Yes, it should be zero-based -- this matches how such indices are done
> in the kernel device tree APIs.
> 
> >>> +        __u64   flags;
> >>> +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> >>
> >> What is region in this context?? Or would this make much more sense
> >> if I knew what Device Tree actually is.
> > 
> > Powerpc guys, any comments?  This was their suggestion.  These are
> > effectively the first device specific extension, available when
> > VFIO_DEVICE_FLAGS_DT is set.
> 
> An assigned device may consist of an entire subtree of the device tree,
> and both register banks and interrupts can come from any node in the
> tree.  Region versus IRQ here indicates the context in which to
> interpret index, in order to retrieve the path of the node that supplied
> this particular region or IRQ.

Ok.  Thanks for the clarification.  We'll wait for the vfio-dt bus
driver before actually including this.

> >>> +};
> >>> +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> >>> +
> >>> +struct vfio_dtindex {
> >>> +        __u32   len;            /* length of structure */
> >>> +        __u32   index;
> >>> +        __u32   prop_type;
> >>
> >> Is that an enum type? Is this definied somewhere?
> >>> +        __u32   prop_index;
> >>
> >> What is the purpose of this field?
> > 
> > Need input from powerpc folks here
> 
> To identify what this resource (register bank or IRQ) this is, we need
> both the path to the node and the index into the reg or interrupts
> property within the node.
> 
> We also need to distinguish reg from ranges, and interrupts from
> interrupt-map.  As you suggested elsewhere in the thread, the device
> tree API should probably be left out for now, and added later along with
> the device tree "bus" driver.

Yep, I'll do that.

> >>> +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> >>> +				    struct vfio_device *device)
> >>> +{
> >>> +	BUG_ON(!iommu->domain && device->attached);
> >>
> >> Whoa. Heavy hammer there.
> >>
> >> Perhaps WARN_ON as you do check it later on.
> > 
> > I think it's warranted, internal consistency is broken if we have a
> > device that thinks it's attached to an iommu domain that doesn't exist.
> > It should, of course, never happen and this isn't a performance path.
> > 
> [snip]
> >>> +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> >>> +				   struct vfio_device *device)
> >>> +{
> >>> +	int ret;
> >>> +
> >>> +	BUG_ON(device->attached);
> >>
> >> How about:
> >>
> >> WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register
> >> the device again! Tell him/her to stop please.\n");
> > 
> > I would almost demote this one to a WARN_ON, but userspace isn't in
> > control of attaching and detaching devices from the iommu.  That's a
> > side effect of getting the iommu or device file descriptor.  So again,
> > this is an internal consistency check and it should never happen,
> > regardless of userspace.
> 
> The rule isn't to use BUG for internal consistency checks and WARN for
> stuff userspace can trigger, but rather to use BUG if you cannot
> reasonably continue, WARN for "significant issues that need prompt
> attention" that are reasonably recoverable.  Most instances of WARN are
> internal consistency checks.

That makes sense.

> From include/asm-generic/bug.h:
> > If you're tempted to BUG(), think again:  is completely giving up
> > really the *only* solution?  There are usually better options, where
> > users don't need to reboot ASAP and can mostly shut down cleanly.

Ok, I'll make a cleanup pass of demoting BUG_ONs to WARN_ONs.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
       [not found] <20111103195452.21259.93021.stgit@bling.home>
                   ` (3 preceding siblings ...)
  2011-11-11 17:51 ` Konrad Rzeszutek Wilk
@ 2011-11-12  0:14 ` Scott Wood
  2011-11-14 20:54   ` Alex Williamson
  2011-11-15  6:34 ` David Gibson
  2011-11-29  1:52 ` Alexey Kardashevskiy
  6 siblings, 1 reply; 62+ messages in thread
From: Scott Wood @ 2011-11-12  0:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci

On 11/03/2011 03:12 PM, Alex Williamson wrote:
> +Many modern system now provide DMA and interrupt remapping facilities
> +to help ensure I/O devices behave within the boundaries they've been
> +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> +well as POWER systems with Partitionable Endpoints (PEs) and even
> +embedded powerpc systems (technology name unknown).  

Maybe replace "(technology name unknown)" with "(such as Freescale chips
with PAMU)" or similar?

Or just leave out the parenthetical.

> +As documented in linux/vfio.h, several ioctls are provided on the
> +group chardev:
> +
> +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)

This suggests the argument to VFIO_GROUP_GET_DEVICE_FD is a pointer to a
pointer to char rather than a pointer to an array of char (just as e.g.
VFIO_GROUP_MERGE takes a pointer to an int, not just an int).

> +The IOMMU file descriptor provides this set of ioctls:
> +
> +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)

What is the implication if VFIO_IOMMU_FLAGS_MAP_ANY is clear?  Is such
an implementation supposed to add a new flag that describes its
restrictions?

Can we get a way to turn DMA access off and on, short of unmapping
everything, and then mapping it again?

> +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> +We currently only support IOMMU domains that are able to map any
> +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
> +
> +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> +and unmapping IOVAs to process virtual addresses:
> +
> +struct vfio_dma_map {
> +        __u64   len;            /* length of structure */
> +        __u64   vaddr;          /* process virtual addr */
> +        __u64   dmaaddr;        /* desired and/or returned dma address */
> +        __u64   size;           /* size in bytes */
> +        __u64   flags;
> +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> +};

What are the semantics of "desired and/or returned dma address"?

Are we always supposed to provide a desired address, but it may be
different on return?  Or are there cases where we want to say "give me
whatever you want" or "give me this or fail"?

How much of this needs to be filled out for unmap?

Note that the "length of structure" approach means that ioctl numbers
will change whenever this grows -- perhaps we should avoid encoding the
struct size into these ioctls?

> +struct vfio_region_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* region number */
> +        __u64   size;           /* size in bytes of region */
> +        __u64   offset;         /* start offset of region */
> +        __u64   flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> +        __u64   phys;           /* physical address of region */
> +};
> +
> +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> +
> +The offset indicates the offset into the device file descriptor which
> +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> +available access types and validity of optional fields.  For instance
> +the phys field may only be valid for certain devices types.
> +
> +Interrupts are described using a similar interface.  GET_NUM_IRQS
> +reports the number or IRQ indexes for the device.
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> +
> +struct vfio_irq_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* IRQ number */
> +        __u32   count;          /* number of individual IRQs */
> +        __u64   flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)

Make sure flags is 64-bit aligned -- some 32-bit ABIs, such as x86, will
not do this, causing problems if the kernel is 64-bit and thus assumes a
different layout.

> +Information about each index can be retrieved using the GET_IRQ_INFO
> +ioctl, used much like GET_REGION_INFO.
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> +
> +Individual indexes can describe single or sets of IRQs.  This provides the
> +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> +
> +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> +as shown below, are used to pass the IRQ info index, the number of eventfds,
> +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> +
> +When a level triggered interrupt is signaled, the interrupt is masked
> +on the host.  This prevents an unresponsive userspace driver from
> +continuing to interrupt the host system.

It's usually necessary even in the case of responsive userspace, just to
get to the point where userspace can execute (ignoring cases where
userspace runs on one core while the interrupt storms another).

For edge interrupts, will me mask if an interrupt comes in and the
previous interrupt hasn't been read out yet (and then unmask when the
last interrupt gets read out), to isolate us from a rapidly firing
interrupt source that userspace can't keep up with?

> +Device tree devices also invlude ioctls for further defining the
> +device tree properties of the device:
> +
> +struct vfio_dtpath {
> +        __u32   len;            /* length of structure */
> +        __u32   index;
> +        __u64   flags;
> +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> +        char    *path;
> +};
> +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)

Where is length of buffer (and description of associated semantics)?

> +struct vfio_device_ops {
> +	bool			(*match)(struct device *, char *);

const char *?

> +	int			(*get)(void *);
> +	void			(*put)(void *);
> +	ssize_t			(*read)(void *, char __user *,
> +					size_t, loff_t *);
> +	ssize_t			(*write)(void *, const char __user *,
> +					 size_t, loff_t *);
> +	long			(*ioctl)(void *, unsigned int, unsigned long);
> +	int			(*mmap)(void *, struct vm_area_struct *);
> +};

When defining an API, please do not omit parameter names.

Should specify what the driver is supposed to do with get/put -- I guess
not try to unbind when the count is nonzero?  Races could still lead the
unbinder to be blocked, but I guess it lets the driver know when it's
likely to succeed.

> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> new file mode 100644
> index 0000000..9acb1e7
> --- /dev/null
> +++ b/drivers/vfio/Kconfig
> @@ -0,0 +1,8 @@
> +menuconfig VFIO
> +	tristate "VFIO Non-Privileged userspace driver framework"
> +	depends on IOMMU_API
> +	help
> +	  VFIO provides a framework for secure userspace device drivers.
> +	  See Documentation/vfio.txt for more details.
> +
> +	  If you don't know what to do here, say N.

Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
would still be useful for devices which don't do DMA, or where we accept
the lack of protection/translation (e.g. we have a customer that wants
to do KVM device assignment on one of our lower-end chips that lacks an
IOMMU).

> +struct dma_map_page {
> +	struct list_head	list;
> +	dma_addr_t		daddr;
> +	unsigned long		vaddr;
> +	int			npage;
> +	int			rdwr;
> +};

npage should be long.

What is "rdwr"?  non-zero for write?  non-zero for read? :-)
is_write would be a better name.

> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> +		unsigned long pfn = 0;
> +
> +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> +		if (ret) {
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +
> +		/* Only add actual locked pages to accounting */
> +		if (!is_invalid_reserved_pfn(pfn))
> +			locked++;
> +
> +		ret = iommu_map(iommu->domain, iova,
> +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> +		if (ret) {
> +			/* Back out mappings on error */
> +			put_pfn(pfn, rdwr);
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +	}

There's no way to hand this stuff to the IOMMU driver in chunks larger
than a page?  That's going to be a problem for our IOMMU, which wants to
deal with large windows.

> +	vfio_lock_acct(locked);
> +	return 0;
> +}
> +
> +static inline int ranges_overlap(unsigned long start1, size_t size1,
> +				 unsigned long start2, size_t size2)
> +{
> +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> +}

You pass DMA addresses to this, so use dma_addr_t.  unsigned long is not
always large enough.

What if one of the ranges wraps around (including the legitimate
possibility of start + size == 0)?

> +static long vfio_iommu_unl_ioctl(struct file *filep,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_iommu *iommu = filep->private_data;
> +	int ret = -ENOSYS;

-ENOIOCTLCMD or -ENOTTY?

> +
> +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> +
> +                ret = put_user(flags, (u64 __user *)arg);
> +
> +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> +		struct vfio_dma_map dm;

Whitespace.

Any reason not to use a switch?

> +/* Return true if any devices within a group are opened */
> +static bool __vfio_group_devs_inuse(struct vfio_group *group)
[snip]
> +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
[snip]
> +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
[snip]
> +static void __vfio_group_set_iommu(struct vfio_group *group,
> +				   struct vfio_iommu *iommu)

...and so on.

Why all the leading underscores?  Doesn't look like you're trying to
distinguish between this and a more public version with the same name.

> +/* Get a new device file descriptor.  This will open the iommu, setting
> + * the current->mm ownership if it's not already set.  It's difficult to
> + * specify the requirements for matching a user supplied buffer to a
> + * device, so we use a vfio driver callback to test for a match.  For
> + * PCI, dev_name(dev) is unique, but other drivers may require including
> + * a parent device string. */
> +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> +{
> +	struct vfio_iommu *iommu = group->iommu;
> +	struct list_head *gpos;
> +	int ret = -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (!iommu->domain) {
> +		ret = __vfio_open_iommu(iommu);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	list_for_each(gpos, &iommu->group_list) {
> +		struct list_head *dpos;
> +
> +		group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (device->ops->match(device->dev, buf)) {

If there's a match, we're done with the loop -- might as well break out
now rather than indent everything else.

> +				struct file *file;
> +
> +				if (device->ops->get(device->device_data)) {
> +					ret = -EFAULT;
> +					goto out;
> +				}

Why does a failure of get() result in -EFAULT?  -EFAULT is for bad user
addresses.

> +
> +				/* We can't use anon_inode_getfd(), like above
> +				 * because we need to modify the f_mode flags
> +				 * directly to allow more than just ioctls */
> +				ret = get_unused_fd();
> +				if (ret < 0) {
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}
> +
> +				file = anon_inode_getfile("[vfio-device]",
> +							  &vfio_device_fops,
> +							  device, O_RDWR);
> +				if (IS_ERR(file)) {
> +					put_unused_fd(ret);
> +					ret = PTR_ERR(file);
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}

Maybe cleaner with goto-based error management?

> +/* Add a new device to the vfio framework with associated vfio driver
> + * callbacks.  This is the entry point for vfio drivers to register devices. */
> +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
> +{
> +	struct list_head *pos;
> +	struct vfio_group *group = NULL;
> +	struct vfio_device *device = NULL;
> +	unsigned int groupid;
> +	int ret = 0;
> +	bool new_group = false;
> +
> +	if (!ops)
> +		return -EINVAL;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	list_for_each(pos, &vfio.group_list) {
> +		group = list_entry(pos, struct vfio_group, group_next);
> +		if (group->groupid == groupid)
> +			break;
> +		group = NULL;
> +	}

Factor this into vfio_dev_to_group() (and likewise for other such lookups)?

> +	if (!group) {
> +		int minor;
> +
> +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> +		if (!group) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group->groupid = groupid;
> +		INIT_LIST_HEAD(&group->device_list);
> +
> +		ret = idr_get_new(&vfio.idr, group, &minor);
> +		if (ret == 0 && minor > MINORMASK) {
> +			idr_remove(&vfio.idr, minor);
> +			kfree(group);
> +			ret = -ENOSPC;
> +			goto out;
> +		}
> +
> +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> +		device_create(vfio.class, NULL, group->devt,
> +			      group, "%u", groupid);
> +
> +		group->bus = dev->bus;
> +		list_add(&group->group_next, &vfio.group_list);

Factor out into vfio_create_group()?

> +		new_group = true;
> +	} else {
> +		if (group->bus != dev->bus) {
> +			printk(KERN_WARNING
> +			       "Error: IOMMU group ID conflict.  Group ID %u "
> +				"on both bus %s and %s\n", groupid,
> +				group->bus->name, dev->bus->name);
> +			ret = -EFAULT;
> +			goto out;
> +		}

It took me a little while to figure out that this was comparing bus
types, not actual bus instances (which would be an inappropriate
restriction). :-P

Still, isn't it what we really care about that it's the same IOMMU
domain?  Couldn't different bus types share an iommu_ops?

And again, -EFAULT isn't the right error.

-Scott


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-12  0:14 ` Scott Wood
@ 2011-11-14 20:54   ` Alex Williamson
  2011-11-14 21:46     ` Alex Williamson
                       ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Alex Williamson @ 2011-11-14 20:54 UTC (permalink / raw)
  To: Scott Wood
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci

On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
> On 11/03/2011 03:12 PM, Alex Williamson wrote:
> > +Many modern system now provide DMA and interrupt remapping facilities
> > +to help ensure I/O devices behave within the boundaries they've been
> > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > +embedded powerpc systems (technology name unknown).  
> 
> Maybe replace "(technology name unknown)" with "(such as Freescale chips
> with PAMU)" or similar?
> 
> Or just leave out the parenthetical.

I was hoping that comment would lead to an answer.  Thanks for the
info ;)

> > +As documented in linux/vfio.h, several ioctls are provided on the
> > +group chardev:
> > +
> > +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> > + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> > + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> > +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> > +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> > +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> > +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> 
> This suggests the argument to VFIO_GROUP_GET_DEVICE_FD is a pointer to a
> pointer to char rather than a pointer to an array of char (just as e.g.
> VFIO_GROUP_MERGE takes a pointer to an int, not just an int).

I believe I was following the UI_SET_PHYS ioctl as an example, which is
defined as a char *.  I'll change to char and verify.

> > +The IOMMU file descriptor provides this set of ioctls:
> > +
> > +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> > + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> > +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> > +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)
> 
> What is the implication if VFIO_IOMMU_FLAGS_MAP_ANY is clear?  Is such
> an implementation supposed to add a new flag that describes its
> restrictions?

If MAP_ANY is clear then I would expect a new flag is set defining a new
mapping paradigm, probably with an ioctl to describe the
restrictions/parameters.  MAP_ANY effectively means there are no
restrictions.

> Can we get a way to turn DMA access off and on, short of unmapping
> everything, and then mapping it again?

iommu_ops doesn't support such an interface, so no, not currently.

> > +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> > +We currently only support IOMMU domains that are able to map any
> > +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
> > +
> > +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> > +and unmapping IOVAs to process virtual addresses:
> > +
> > +struct vfio_dma_map {
> > +        __u64   len;            /* length of structure */
> > +        __u64   vaddr;          /* process virtual addr */
> > +        __u64   dmaaddr;        /* desired and/or returned dma address */
> > +        __u64   size;           /* size in bytes */
> > +        __u64   flags;
> > +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> > +};
> 
> What are the semantics of "desired and/or returned dma address"?

I believe the original intention was that a user could leave dmaaddr
clear and let the iommu layer provide an iova address.  The iommu api
has since evolved and that mapping scheme really isn't present anymore.
We'll currently fail if we can map the requested address.  I'll update
the docs to make that be the definition.

> Are we always supposed to provide a desired address, but it may be
> different on return?  Or are there cases where we want to say "give me
> whatever you want" or "give me this or fail"?

Exactly, that's what it used to be, but we don't really implement that
any more.

> How much of this needs to be filled out for unmap?

dmaaddr & size, will update docs.

> Note that the "length of structure" approach means that ioctl numbers
> will change whenever this grows -- perhaps we should avoid encoding the
> struct size into these ioctls?

How so?  What's described here is effectively the base size.  If we
later add feature foo requiring additional fields, we set a flag, change
the size, and tack those fields onto the end.  The kernel side should
balk if the size doesn't match what it expects from the flags it
understands (which I think I probably need to be more strict about).

> > +struct vfio_region_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* region number */
> > +        __u64   size;           /* size in bytes of region */
> > +        __u64   offset;         /* start offset of region */
> > +        __u64   flags;
> > +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> > +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> > +        __u64   phys;           /* physical address of region */
> > +};

In light of the above, this struct should not include phys.  In fact, I
should probably remove the PHYS_VALID flag as well until we have a bus
driver implementation that actually makes use of it.

> > +
> > +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> > +
> > +The offset indicates the offset into the device file descriptor which
> > +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> > +available access types and validity of optional fields.  For instance
> > +the phys field may only be valid for certain devices types.
> > +
> > +Interrupts are described using a similar interface.  GET_NUM_IRQS
> > +reports the number or IRQ indexes for the device.
> > +
> > +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> > +
> > +struct vfio_irq_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* IRQ number */
> > +        __u32   count;          /* number of individual IRQs */
> > +        __u64   flags;
> > +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> 
> Make sure flags is 64-bit aligned -- some 32-bit ABIs, such as x86, will
> not do this, causing problems if the kernel is 64-bit and thus assumes a
> different layout.

Shoot, I'll push flags up above count to get it aligned.

> > +Information about each index can be retrieved using the GET_IRQ_INFO
> > +ioctl, used much like GET_REGION_INFO.
> > +
> > +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> > +
> > +Individual indexes can describe single or sets of IRQs.  This provides the
> > +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> > +
> > +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> > +as shown below, are used to pass the IRQ info index, the number of eventfds,
> > +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> > +
> > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> > +
> > +When a level triggered interrupt is signaled, the interrupt is masked
> > +on the host.  This prevents an unresponsive userspace driver from
> > +continuing to interrupt the host system.
> 
> It's usually necessary even in the case of responsive userspace, just to
> get to the point where userspace can execute (ignoring cases where
> userspace runs on one core while the interrupt storms another).

Right, I'll try to clarify.

> For edge interrupts, will me mask if an interrupt comes in and the
> previous interrupt hasn't been read out yet (and then unmask when the
> last interrupt gets read out), to isolate us from a rapidly firing
> interrupt source that userspace can't keep up with?

We don't do that currently and I haven't seen a need to.  Seems like
there'd be no API change in doing that if we want at some point.

> > +Device tree devices also invlude ioctls for further defining the
> > +device tree properties of the device:
> > +
> > +struct vfio_dtpath {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;
> > +        __u64   flags;
> > +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> > +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> > +        char    *path;
> > +};
> > +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> 
> Where is length of buffer (and description of associated semantics)?

I think I should probably take the same approach as the phys field
above, leave it to the dt bus driver to add these ioctls and fields as
I'm almost certain to get it wrong trying to predict what it's going to
need.  Likewise, VFIO_DEVICE_FLAGS_PCI should be defined as part of the
pci bus driver patch, even though it doesn't need any extra
ioctls/fields.

> > +struct vfio_device_ops {
> > +	bool			(*match)(struct device *, char *);
> 
> const char *?

will fix

> > +	int			(*get)(void *);
> > +	void			(*put)(void *);
> > +	ssize_t			(*read)(void *, char __user *,
> > +					size_t, loff_t *);
> > +	ssize_t			(*write)(void *, const char __user *,
> > +					 size_t, loff_t *);
> > +	long			(*ioctl)(void *, unsigned int, unsigned long);
> > +	int			(*mmap)(void *, struct vm_area_struct *);
> > +};
> 
> When defining an API, please do not omit parameter names.

ok

> Should specify what the driver is supposed to do with get/put -- I guess
> not try to unbind when the count is nonzero?  Races could still lead the
> unbinder to be blocked, but I guess it lets the driver know when it's
> likely to succeed.

Right, for the pci bus driver, it's mainly for reference counting,
including the module_get to prevent vfio-pci from being unloaded.  On
the first get for a device, we also do a pci_enable() and pci_disable()
on last put.  I'll try to clarify in the docs.

> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> > new file mode 100644
> > index 0000000..9acb1e7
> > --- /dev/null
> > +++ b/drivers/vfio/Kconfig
> > @@ -0,0 +1,8 @@
> > +menuconfig VFIO
> > +	tristate "VFIO Non-Privileged userspace driver framework"
> > +	depends on IOMMU_API
> > +	help
> > +	  VFIO provides a framework for secure userspace device drivers.
> > +	  See Documentation/vfio.txt for more details.
> > +
> > +	  If you don't know what to do here, say N.
> 
> Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
> would still be useful for devices which don't do DMA, or where we accept
> the lack of protection/translation (e.g. we have a customer that wants
> to do KVM device assignment on one of our lower-end chips that lacks an
> IOMMU).

Ugh.  I'm not really onboard with it given that we're trying to sell
vfio as a secure user space driver interface with iommu-based
protection.  That said, vifo_iommu.c is already it's own file, with the
thought that other platforms might need to manage the iommu differently.
Theoretically the IOMMU_API requirement could be tied specifically to
vfio_iommu and another iommu backend added.

> > +struct dma_map_page {
> > +	struct list_head	list;
> > +	dma_addr_t		daddr;
> > +	unsigned long		vaddr;
> > +	int			npage;
> > +	int			rdwr;
> > +};
> 
> npage should be long.

Seems like I went back and forth on that a couple times, I'll see if I
can remember why I landed on int or change it.  Practically, int is "big
enough", but that's not a good answer.

> What is "rdwr"?  non-zero for write?  non-zero for read? :-)
> is_write would be a better name.

Others commented on this too, I'll switch to a bool rename it so it's
obvious that it means write access enabled.

> 
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> > +		unsigned long pfn = 0;
> > +
> > +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> > +		if (ret) {
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +
> > +		/* Only add actual locked pages to accounting */
> > +		if (!is_invalid_reserved_pfn(pfn))
> > +			locked++;
> > +
> > +		ret = iommu_map(iommu->domain, iova,
> > +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> > +		if (ret) {
> > +			/* Back out mappings on error */
> > +			put_pfn(pfn, rdwr);
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +	}
> 
> There's no way to hand this stuff to the IOMMU driver in chunks larger
> than a page?  That's going to be a problem for our IOMMU, which wants to
> deal with large windows.

There is, this is just a simple implementation that maps individual
pages.  We "just" need to determine physically contiguous chunks and
mlock them instead of using get_user_pages.  The current implementation
is much like how KVM maps iommu pages, but there shouldn't be a user API
change to try to use larger chinks.  We want this for IOMMU large page
support too.

> > +	vfio_lock_acct(locked);
> > +	return 0;
> > +}
> > +
> > +static inline int ranges_overlap(unsigned long start1, size_t size1,
> > +				 unsigned long start2, size_t size2)
> > +{
> > +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> > +}
> 
> You pass DMA addresses to this, so use dma_addr_t.  unsigned long is not
> always large enough.

ok

> What if one of the ranges wraps around (including the legitimate
> possibility of start + size == 0)?

Looks like a bug.

> > +static long vfio_iommu_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +	int ret = -ENOSYS;
> 
> -ENOIOCTLCMD or -ENOTTY?

ok

> > +
> > +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> > +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> > +
> > +                ret = put_user(flags, (u64 __user *)arg);
> > +
> > +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > +		struct vfio_dma_map dm;
> 
> Whitespace.

yep, will fix

> Any reason not to use a switch?

Personal preference.  It got ugly using a switch in vfio_main, trying to
keep variable scope to the case, followed suit here for consistency.

> > +/* Return true if any devices within a group are opened */
> > +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> [snip]
> > +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> [snip]
> > +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> [snip]
> > +static void __vfio_group_set_iommu(struct vfio_group *group,
> > +				   struct vfio_iommu *iommu)
> 
> ...and so on.
> 
> Why all the leading underscores?  Doesn't look like you're trying to
> distinguish between this and a more public version with the same name.

__ implies it should be called under vfio.lock.

> > +/* Get a new device file descriptor.  This will open the iommu, setting
> > + * the current->mm ownership if it's not already set.  It's difficult to
> > + * specify the requirements for matching a user supplied buffer to a
> > + * device, so we use a vfio driver callback to test for a match.  For
> > + * PCI, dev_name(dev) is unique, but other drivers may require including
> > + * a parent device string. */
> > +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> > +{
> > +	struct vfio_iommu *iommu = group->iommu;
> > +	struct list_head *gpos;
> > +	int ret = -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (!iommu->domain) {
> > +		ret = __vfio_open_iommu(iommu);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	list_for_each(gpos, &iommu->group_list) {
> > +		struct list_head *dpos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, iommu_next);
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (device->ops->match(device->dev, buf)) {
> 
> If there's a match, we're done with the loop -- might as well break out
> now rather than indent everything else.

Sure, even just changing the polarity and making this a continue would
help the formatting below.

> > +				struct file *file;
> > +
> > +				if (device->ops->get(device->device_data)) {
> > +					ret = -EFAULT;
> > +					goto out;
> > +				}
> 
> Why does a failure of get() result in -EFAULT?  -EFAULT is for bad user
> addresses.

I'll just return what get() returns.

> > +
> > +				/* We can't use anon_inode_getfd(), like above
> > +				 * because we need to modify the f_mode flags
> > +				 * directly to allow more than just ioctls */
> > +				ret = get_unused_fd();
> > +				if (ret < 0) {
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> > +
> > +				file = anon_inode_getfile("[vfio-device]",
> > +							  &vfio_device_fops,
> > +							  device, O_RDWR);
> > +				if (IS_ERR(file)) {
> > +					put_unused_fd(ret);
> > +					ret = PTR_ERR(file);
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> 
> Maybe cleaner with goto-based error management?

I didn't see enough duplication creeping in to try that here.

> > +/* Add a new device to the vfio framework with associated vfio driver
> > + * callbacks.  This is the entry point for vfio drivers to register devices. */
> > +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
> > +{
> > +	struct list_head *pos;
> > +	struct vfio_group *group = NULL;
> > +	struct vfio_device *device = NULL;
> > +	unsigned int groupid;
> > +	int ret = 0;
> > +	bool new_group = false;
> > +
> > +	if (!ops)
> > +		return -EINVAL;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	list_for_each(pos, &vfio.group_list) {
> > +		group = list_entry(pos, struct vfio_group, group_next);
> > +		if (group->groupid == groupid)
> > +			break;
> > +		group = NULL;
> > +	}
> 
> Factor this into vfio_dev_to_group() (and likewise for other such lookups)?

Yeah, this ends up getting duplicated a few places.

> > +	if (!group) {
> > +		int minor;
> > +
> > +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +		if (!group) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group->groupid = groupid;
> > +		INIT_LIST_HEAD(&group->device_list);
> > +
> > +		ret = idr_get_new(&vfio.idr, group, &minor);
> > +		if (ret == 0 && minor > MINORMASK) {
> > +			idr_remove(&vfio.idr, minor);
> > +			kfree(group);
> > +			ret = -ENOSPC;
> > +			goto out;
> > +		}
> > +
> > +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> > +		device_create(vfio.class, NULL, group->devt,
> > +			      group, "%u", groupid);
> > +
> > +		group->bus = dev->bus;
> > +		list_add(&group->group_next, &vfio.group_list);
> 
> Factor out into vfio_create_group()?

sounds good

> > +		new_group = true;
> > +	} else {
> > +		if (group->bus != dev->bus) {
> > +			printk(KERN_WARNING
> > +			       "Error: IOMMU group ID conflict.  Group ID %u "
> > +				"on both bus %s and %s\n", groupid,
> > +				group->bus->name, dev->bus->name);
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> 
> It took me a little while to figure out that this was comparing bus
> types, not actual bus instances (which would be an inappropriate
> restriction). :-P
> 
> Still, isn't it what we really care about that it's the same IOMMU
> domain?  Couldn't different bus types share an iommu_ops?

Nope, iommu_ops registration is now per bus_type.  Also, Christian
pointed out that groupid is really only guaranteed to be unique per
bus_type so I've been updating groupid comparisons to compare the
groupid, bus_type pair.  

> And again, -EFAULT isn't the right error.

Ok.

Thank you very much for the comments,

Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-14 20:54   ` Alex Williamson
@ 2011-11-14 21:46     ` Alex Williamson
  2011-11-14 22:26     ` Scott Wood
  2011-11-15  2:29     ` Alex Williamson
  2 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2011-11-14 21:46 UTC (permalink / raw)
  To: Scott Wood
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci

On Mon, 2011-11-14 at 13:54 -0700, Alex Williamson wrote:
> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
> > On 11/03/2011 03:12 PM, Alex Williamson wrote: 
> > > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> > > +		unsigned long pfn = 0;
> > > +
> > > +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> > > +		if (ret) {
> > > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > > +			return ret;
> > > +		}
> > > +
> > > +		/* Only add actual locked pages to accounting */
> > > +		if (!is_invalid_reserved_pfn(pfn))
> > > +			locked++;
> > > +
> > > +		ret = iommu_map(iommu->domain, iova,
> > > +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> > > +		if (ret) {
> > > +			/* Back out mappings on error */
> > > +			put_pfn(pfn, rdwr);
> > > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > > +			return ret;
> > > +		}
> > > +	}
> > 
> > There's no way to hand this stuff to the IOMMU driver in chunks larger
> > than a page?  That's going to be a problem for our IOMMU, which wants to
> > deal with large windows.
> 
> There is, this is just a simple implementation that maps individual
> pages.  We "just" need to determine physically contiguous chunks and
> mlock them instead of using get_user_pages.  The current implementation
> is much like how KVM maps iommu pages, but there shouldn't be a user API
> change to try to use larger chinks.  We want this for IOMMU large page
> support too.

Also, at one point intel-iommu didn't allow sub-ranges to be unmapped;
an unmap of a single page would unmap the entire original mapping that
contained that page.  That made it easier to map each page individually
for the flexibility it provided on unmap.  I need to see if we still
have that restriction.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-14 20:54   ` Alex Williamson
  2011-11-14 21:46     ` Alex Williamson
@ 2011-11-14 22:26     ` Scott Wood
  2011-11-14 22:48       ` Alexander Graf
  2011-11-15  2:29     ` Alex Williamson
  2 siblings, 1 reply; 62+ messages in thread
From: Scott Wood @ 2011-11-14 22:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci

On 11/14/2011 02:54 PM, Alex Williamson wrote:
> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
>> What are the semantics of "desired and/or returned dma address"?
> 
> I believe the original intention was that a user could leave dmaaddr
> clear and let the iommu layer provide an iova address.  The iommu api
> has since evolved and that mapping scheme really isn't present anymore.
> We'll currently fail if we can map the requested address.  I'll update
> the docs to make that be the definition.

OK... if there is any desire in the future to have the kernel pick an
address (which could be useful for IOMMUs that don't set
VFIO_IOMMU_FLAGS_MAP_ANY), there should be an explicit flag for this,
since zero could be a valid address to request (doesn't mean "clear").

>> Note that the "length of structure" approach means that ioctl numbers
>> will change whenever this grows -- perhaps we should avoid encoding the
>> struct size into these ioctls?
> 
> How so?  What's described here is effectively the base size.  If we
> later add feature foo requiring additional fields, we set a flag, change
> the size, and tack those fields onto the end.  The kernel side should
> balk if the size doesn't match what it expects from the flags it
> understands (which I think I probably need to be more strict about).

The size of the struct is encoded into the ioctl number via the _IOWR()
macro.  If we want the struct to be growable in the future, we should
leave that out and just use _IO().  Otherwise if the size of the struct
changes, the ioctl number changes.  This is annoying for old userspace
plus new kernel (have to add compat entries to the switch), and broken
for old kernel plus new userspace.

>> Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
>> would still be useful for devices which don't do DMA, or where we accept
>> the lack of protection/translation (e.g. we have a customer that wants
>> to do KVM device assignment on one of our lower-end chips that lacks an
>> IOMMU).
> 
> Ugh.  I'm not really onboard with it given that we're trying to sell
> vfio as a secure user space driver interface with iommu-based
> protection.

That's its main use case, but it doesn't make much sense to duplicate
the non-iommu-related bits for other use cases.

This applies at runtime too, some devices don't do DMA at all (and thus
may not be part of an IOMMU group, even if there is an IOMMU present for
other devices -- could be considered a standalone group of one device,
with a null IOMMU backend).  Support for such devices can wait, but it's
good to keep the possibility in mind.

-Scott

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-14 22:26     ` Scott Wood
@ 2011-11-14 22:48       ` Alexander Graf
  0 siblings, 0 replies; 62+ messages in thread
From: Alexander Graf @ 2011-11-14 22:48 UTC (permalink / raw)
  To: Scott Wood
  Cc: Alex Williamson, <chrisw@sous-sol.org>,
	<aik@au1.ibm.com>, <pmac@au1.ibm.com>,
	<dwg@au1.ibm.com>, <joerg.roedel@amd.com>,
	<benve@cisco.com>, <aafabbri@cisco.com>,
	<B08248@freescale.com>, <B07421@freescale.com>,
	<avi@redhat.com>, <konrad.wilk@oracle.com>,
	<kvm@vger.kernel.org>, <qemu-devel@nongnu.org>,
	<iommu@lists.linux-foundation.org>,
	<linux-pci@vger.kernel.org>



Am 14.11.2011 um 23:26 schrieb Scott Wood <scottwood@freescale.com>:

> On 11/14/2011 02:54 PM, Alex Williamson wrote:
>> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
>>> What are the semantics of "desired and/or returned dma address"?
>> 
>> I believe the original intention was that a user could leave dmaaddr
>> clear and let the iommu layer provide an iova address.  The iommu api
>> has since evolved and that mapping scheme really isn't present anymore.
>> We'll currently fail if we can map the requested address.  I'll update
>> the docs to make that be the definition.
> 
> OK... if there is any desire in the future to have the kernel pick an
> address (which could be useful for IOMMUs that don't set
> VFIO_IOMMU_FLAGS_MAP_ANY), there should be an explicit flag for this,
> since zero could be a valid address to request (doesn't mean "clear").
> 
>>> Note that the "length of structure" approach means that ioctl numbers
>>> will change whenever this grows -- perhaps we should avoid encoding the
>>> struct size into these ioctls?
>> 
>> How so?  What's described here is effectively the base size.  If we
>> later add feature foo requiring additional fields, we set a flag, change
>> the size, and tack those fields onto the end.  The kernel side should
>> balk if the size doesn't match what it expects from the flags it
>> understands (which I think I probably need to be more strict about).
> 
> The size of the struct is encoded into the ioctl number via the _IOWR()
> macro.  If we want the struct to be growable in the future, we should
> leave that out and just use _IO().  Otherwise if the size of the struct
> changes, the ioctl number changes.  This is annoying for old userspace
> plus new kernel (have to add compat entries to the switch), and broken
> for old kernel plus new userspace.

Avi wanted to write up a patch for this to allow ioctls with arbitrary size, for exctly this purpose.

> 
>>> Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
>>> would still be useful for devices which don't do DMA, or where we accept
>>> the lack of protection/translation (e.g. we have a customer that wants
>>> to do KVM device assignment on one of our lower-end chips that lacks an
>>> IOMMU).
>> 
>> Ugh.  I'm not really onboard with it given that we're trying to sell
>> vfio as a secure user space driver interface with iommu-based
>> protection.
> 
> That's its main use case, but it doesn't make much sense to duplicate
> the non-iommu-related bits for other use cases.
> 
> This applies at runtime too, some devices don't do DMA at all (and thus
> may not be part of an IOMMU group, even if there is an IOMMU present for
> other devices -- could be considered a standalone group of one device,
> with a null IOMMU backend).  Support for such devices can wait, but it's
> good to keep the possibility in mind.

I agree. Potentially backing a device with a nop iommu also makes testing easier.

Alex

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-14 20:54   ` Alex Williamson
  2011-11-14 21:46     ` Alex Williamson
  2011-11-14 22:26     ` Scott Wood
@ 2011-11-15  2:29     ` Alex Williamson
  2 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2011-11-15  2:29 UTC (permalink / raw)
  To: Scott Wood
  Cc: chrisw, aik, pmac, dwg, joerg.roedel, agraf, benve, aafabbri,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci

On Mon, 2011-11-14 at 13:54 -0700, Alex Williamson wrote:
> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
> > On 11/03/2011 03:12 PM, Alex Williamson wrote:
> > > +	int			(*get)(void *);
> > > +	void			(*put)(void *);
> > > +	ssize_t			(*read)(void *, char __user *,
> > > +					size_t, loff_t *);
> > > +	ssize_t			(*write)(void *, const char __user *,
> > > +					 size_t, loff_t *);
> > > +	long			(*ioctl)(void *, unsigned int, unsigned long);
> > > +	int			(*mmap)(void *, struct vm_area_struct *);
> > > +};
> > 
> > When defining an API, please do not omit parameter names.
> 
> ok
> 
> > Should specify what the driver is supposed to do with get/put -- I guess
> > not try to unbind when the count is nonzero?  Races could still lead the
> > unbinder to be blocked, but I guess it lets the driver know when it's
> > likely to succeed.
> 
> Right, for the pci bus driver, it's mainly for reference counting,
> including the module_get to prevent vfio-pci from being unloaded.  On
> the first get for a device, we also do a pci_enable() and pci_disable()
> on last put.  I'll try to clarify in the docs.

Looking at these again, I should just rename them to open/release.  That
matches the points when they're called.  I suspect I started with just
reference counting and it grew to more of a full blown open/release.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
       [not found] <20111103195452.21259.93021.stgit@bling.home>
                   ` (4 preceding siblings ...)
  2011-11-12  0:14 ` Scott Wood
@ 2011-11-15  6:34 ` David Gibson
  2011-11-15 18:01   ` Alex Williamson
  2011-11-15 20:10   ` Scott Wood
  2011-11-29  1:52 ` Alexey Kardashevskiy
  6 siblings, 2 replies; 62+ messages in thread
From: David Gibson @ 2011-11-15  6:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, aik, pmac, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> VFIO provides a secure, IOMMU based interface for user space
> drivers, including device assignment to virtual machines.
> This provides the base management of IOMMU groups, devices,
> and IOMMU objects.  See Documentation/vfio.txt included in
> this patch for user and kernel API description.
> 
> Note, this implements the new API discussed at KVM Forum
> 2011, as represented by the drvier version 0.2.  It's hoped
> that this provides a modular enough interface to support PCI
> and non-PCI userspace drivers across various architectures
> and IOMMU implementations.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> Fingers crossed, this is the last RFC for VFIO, but we need
> the iommu group support before this can go upstream
> (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
> hoping this helps push that along.
> 
> Since the last posting, this version completely modularizes
> the device backends and better defines the APIs between the
> core VFIO code and the device backends.  I expect that we
> might also adopt a modular IOMMU interface as iommu_ops learns
> about different types of hardware.  Also many, many cleanups.
> Check the complete git history for details:
> 
> git://github.com/awilliam/linux-vfio.git vfio-ng
> 
> (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
> 
> This version, along with the supporting VFIO PCI backend can
> be found here:
> 
> git://github.com/awilliam/linux-vfio.git vfio-next-20111103
> 
> I've held off on implementing a kernel->user signaling
> mechanism for now since the previous netlink version produced
> too many gag reflexes.  It's easy enough to set a bit in the
> group flags too indicate such support in the future, so I
> think we can move ahead without it.
> 
> Appreciate any feedback or suggestions.  Thanks,
> 
> Alex
> 
>  Documentation/ioctl/ioctl-number.txt |    1 
>  Documentation/vfio.txt               |  304 +++++++++
>  MAINTAINERS                          |    8 
>  drivers/Kconfig                      |    2 
>  drivers/Makefile                     |    1 
>  drivers/vfio/Kconfig                 |    8 
>  drivers/vfio/Makefile                |    3 
>  drivers/vfio/vfio_iommu.c            |  530 ++++++++++++++++
>  drivers/vfio/vfio_main.c             | 1151 ++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_private.h          |   34 +
>  include/linux/vfio.h                 |  155 +++++
>  11 files changed, 2197 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/vfio.txt
>  create mode 100644 drivers/vfio/Kconfig
>  create mode 100644 drivers/vfio/Makefile
>  create mode 100644 drivers/vfio/vfio_iommu.c
>  create mode 100644 drivers/vfio/vfio_main.c
>  create mode 100644 drivers/vfio/vfio_private.h
>  create mode 100644 include/linux/vfio.h
> 
> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> index 54078ed..59d01e4 100644
> --- a/Documentation/ioctl/ioctl-number.txt
> +++ b/Documentation/ioctl/ioctl-number.txt
> @@ -88,6 +88,7 @@ Code  Seq#(hex)	Include File		Comments
>  		and kernel/power/user.c
>  '8'	all				SNP8023 advanced NIC card
>  					<mailto:mcr@solidum.com>
> +';'	64-76	linux/vfio.h
>  '@'	00-0F	linux/radeonfb.h	conflict!
>  '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
>  'A'	00-1F	linux/apm_bios.h	conflict!
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> new file mode 100644
> index 0000000..5866896
> --- /dev/null
> +++ b/Documentation/vfio.txt
> @@ -0,0 +1,304 @@
> +VFIO - "Virtual Function I/O"[1]
> +-------------------------------------------------------------------------------
> +Many modern system now provide DMA and interrupt remapping facilities
> +to help ensure I/O devices behave within the boundaries they've been
> +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> +well as POWER systems with Partitionable Endpoints (PEs) and even
> +embedded powerpc systems (technology name unknown).  The VFIO driver
> +is an IOMMU/device agnostic framework for exposing direct device
> +access to userspace, in a secure, IOMMU protected environment.  In
> +other words, this allows safe, non-privileged, userspace drivers.

It's perhaps worth emphasisng that "safe" depends on the hardware
being sufficiently well behaved.  BenH, I know, thinks there are a
*lot* of cards that, e.g. have debug registers that allow a backdoor
to their own config space via MMIO, which would bypass vfio's
filtering of config space access.  And that's before we even get into
the varying degrees of completeness in the isolation provided by
different IOMMUs.

> +Why do we want that?  Virtual machines often make use of direct device
> +access ("device assignment") when configured for the highest possible
> +I/O performance.  From a device and host perspective, this simply turns
> +the VM into a userspace driver, with the benefits of significantly
> +reduced latency, higher bandwidth, and direct use of bare-metal device
> +drivers[2].
> +
> +Some applications, particularly in the high performance computing
> +field, also benefit from low-overhead, direct device access from
> +userspace.  Examples include network adapters (often non-TCP/IP based)
> +and compute accelerators.  Previous to VFIO, these drivers needed to

s/Previous/Prior/  although that may be a .us vs .au usage thing.

> +go through the full development cycle to become proper upstream driver,
> +be maintained out of tree, or make use of the UIO framework, which
> +has no notion of IOMMU protection, limited interrupt support, and
> +requires root privileges to access things like PCI configuration space.
> +
> +The VFIO driver framework intends to unify these, replacing both the
> +KVM PCI specific device assignment currently used as well as provide
> +a more secure, more featureful userspace driver environment than UIO.
> +
> +Groups, Devices, IOMMUs, oh my
> +-------------------------------------------------------------------------------
> +
> +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> +can't always distinguish transactions from each individual device in
> +the system.  Sometimes this is because of the IOMMU design, such as with
> +PEs, other times it's caused by the I/O topology, for instance a
> +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> +devices created by these restictions IOMMU groups (or just "groups" for
> +this document).
> +
> +The IOMMU cannot distiguish transactions between the individual devices
> +within the group, therefore the group is the basic unit of ownership for
> +a userspace process.  Because of this, groups are also the primary
> +interface to both devices and IOMMU domains in VFIO.
> +
> +The VFIO representation of groups is created as devices are added into
> +the framework by a VFIO bus driver.  The vfio-pci module is an example
> +of a bus driver.  This module registers devices along with a set of bus
> +specific callbacks with the VFIO core.  These callbacks provide the
> +interfaces later used for device access.  As each new group is created,
> +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> +character device.

Ok.. so, the fact that it's called "vfio-pci" suggests that the VFIO
bus driver is per bus type, not per bus instance.   But grouping
constraints could be per bus instance, if you have a couple of
different models of PCI host bridge with IOMMUs of different
capabilities built in, for example.

> +In addition to the device enumeration and callbacks, the VFIO bus driver
> +also provides a traditional device driver and is able to bind to devices
> +on it's bus.  When a device is bound to the bus driver it's available to
> +VFIO.  When all the devices within a group are bound to their bus drivers,
> +the group becomes "viable" and a user with sufficient access to the VFIO
> +group chardev can obtain exclusive access to the set of group devices.
> +
> +As documented in linux/vfio.h, several ioctls are provided on the
> +group chardev:
> +
> +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> +
> +The last two ioctls return new file descriptors for accessing
> +individual devices within the group and programming the IOMMU.  Each of
> +these new file descriptors provide their own set of file interfaces.
> +These ioctls will fail if any of the devices within the group are not
> +bound to their VFIO bus driver.  Additionally, when either of these
> +interfaces are used, the group is then bound to the struct_mm of the
> +caller.  The GET_FLAGS ioctl can be used to view the state of the group.
> +
> +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> +new IOMMU domain is created and all of the devices in the group are
> +attached to it.  This is the only way to ensure full IOMMU isolation
> +of the group, but potentially wastes resources and cycles if the user
> +intends to manage multiple groups with the same set of IOMMU mappings.
> +VFIO therefore provides a group MERGE and UNMERGE interface, which
> +allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
> +arbitrary groups to be merged, so the user should assume merging is
> +opportunistic.

I do not think "opportunistic" means what you think it means..

>  A new group, with no open device or IOMMU file
> +descriptors, can be merged into an existing, in-use, group using the
> +MERGE ioctl.  A merged group can be unmerged using the UNMERGE ioctl
> +once all of the device file descriptors for the group being merged
> +"out" are closed.
> +
> +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are
> +essentially fungible between group file descriptors (ie. if device
> A

IDNT "fungible" MWYTIM, either.

> +is in group X, and X is merged with Y, a file descriptor for A can be
> +retrieved using GET_DEVICE_FD on Y.  Likewise, GET_IOMMU_FD returns a
> +file descriptor referencing the same internal IOMMU object from either
> +X or Y).  Merged groups can be dissolved either explictly with UNMERGE
> +or automatically when ALL file descriptors for the merged group are
> +closed (all IOMMUs, all devices, all groups).

Blech.  I'm really not liking this merge/unmerge API as it stands,
it's horribly confusing.  At the very least, we need some better
terminology.  We need some term for the metagroups; supergroups; iommu
domains or-at-least-they-will-be-once-we-open-the-iommu or
whathaveyous.

The first confusing thing about this interface is that each open group
handle actually refers to two different things; the original group you
opened and the metagroup it's a part of.  For the GET_IOMMU_FD and
GET_DEVICE_FD operations, you're using the metagroup and two "merged"
group handles are interchangeable.  For other MERGE and especially
UNMERGE operations, it matters which is the original group.

The semantics of "merge" and "unmerge" under those names are really
non-obvious.  Merge kind of has to merge two whole metagroups, but
it's unclear if unmerge reverses one merge, or just takes out one
(atom) group.  These operations need better names, at least.

Then it's unclear what order you can do various operations, and which
order you can open and close various things.  You can kind of figure
it out but it takes far more thinking than it should.


So at the _very_ least, we need to invent new terminology and find a
much better way of describing this API's semantics.  I still think an
entirely different interface, where metagroups are created from
outside with a lifetime that's not tied to an fd would be a better
idea.



Now, you specify that you can't use a group as the second argument of
a merge if it already has an open iommu, but it's not clear from the
doc if you can merge things into a group with an open iommu.  Banning
this would make life simpler, because the IOMMU's effective
capabilities may change if you add more devices to the domain.  That's
yet another non-obvious constraint in the interface ordering, though.

> +The IOMMU file descriptor provides this set of ioctls:
> +
> +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)
> +
> +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> +We currently only support IOMMU domains that are able to map any
> +virtual address to any IOVA.  This is indicated by the MAP_ANY
> flag.

So.  I tend to think of an IOMMU mapping IOVAs to memory pages, rather
than memory pages to IOVAs.  The IOMMU itself, of course maps to
physical addresses, and the meaning of "virtual address" in this
context is not really clear.  I think you would be better off saying
the IOMMU can map any IOVA to any memory page.  From a hardware POV
that means any physical address, but of course for a VFIO user a page
is specified by its process virtual address.

I think we need to pin exactly what "MAP_ANY" means down better.  Now,
VFIO is pretty much a lost cause if you can't map any normal process
memory page into the IOMMU, so I think the only thing that is really
covered is IOVAs.  But saying "can map any IOVA" is not clear, because
if you can't map it, it's not a (valid) IOVA.  Better to say that
IOVAs can be any 64-bit value, which I think is what you really mean
here.

Of course, since POWER is a platform where this is *not* true, I'd
prefer to have something giving the range of valid IOVAs in the core
to start with.

> +
> +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> +and unmapping IOVAs to process virtual addresses:
> +
> +struct vfio_dma_map {
> +        __u64   len;            /* length of structure */

Thanks for adding these structure length fields.  But I think they
should be called something other than 'len', which is likely to be
confused with size (or some other length that's actually related to
the operation's parameters).  Better to call it 'structlen' or
'argslen' or something.

> +        __u64   vaddr;          /* process virtual addr */
> +        __u64   dmaaddr;        /* desired and/or returned dma address */
> +        __u64   size;           /* size in bytes */
> +        __u64   flags;
> +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */

Make it independent READ and WRITE flags from the start.  Not all
combinations will be be valid on all hardware, but that way we have
the possibilities covered without having to use strange encodings
later.

> +};
> +
> +Current users of VFIO use relatively static DMA mappings, not requiring
> +high frequency turnover.  As new users are added, it's expected that the
> +IOMMU file descriptor will evolve to support new mapping interfaces, this
> +will be reflected in the flags and may present new ioctls and file
> +interfaces.
> +
> +The device GET_FLAGS ioctl is intended to return basic device type and
> +indicate support for optional capabilities.  Flags currently include whether
> +the device is PCI or described by Device Tree, and whether the RESET ioctl
> +is supported:
> +
> +#define VFIO_DEVICE_GET_FLAGS           _IOR(';', 108, __u64)
> + #define VFIO_DEVICE_FLAGS_PCI          (1 << 0)
> + #define VFIO_DEVICE_FLAGS_DT           (1 << 1)

TBH, I don't think the VFIO for DT stuff is mature enough yet to be in
an initial infrastructure patch, though we should certainly be
discussing it as an add-on patch.

> + #define VFIO_DEVICE_FLAGS_RESET        (1 << 2)
> +
> +The MMIO and IOP resources used by a device are described by regions.
> +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> +
> +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)
> +
> +Regions are described by a struct vfio_region_info, which is retrieved by
> +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> +the desired region (0 based index).  Note that devices may implement zero
> +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> +mapping).

So, I think you're saying that a zero-sized region is used to encode a
NOP region, that is, to basically put a "no region here" in between
valid region indices.  You should spell that out.

[Incidentally, any chance you could borrow one of RH's tech writers
for this?  I'm afraid you seem to lack the knack for clear and easily
read documentation]

> +struct vfio_region_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* region number */
> +        __u64   size;           /* size in bytes of region */
> +        __u64   offset;         /* start offset of region */
> +        __u64   flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)

Again having separate read and write bits from the start will save
strange encodings later.

> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> +        __u64   phys;           /* physical address of region */
> +};

I notice there is no field for "type" e.g. MMIO vs. PIO vs. config
space for PCI.  If you added that having a NONE type might be a
clearer way of encoding a non-region than just having size==0.

> +
> +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> +
> +The offset indicates the offset into the device file descriptor which
> +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> +available access types and validity of optional fields.  For instance
> +the phys field may only be valid for certain devices types.
> +
> +Interrupts are described using a similar interface.  GET_NUM_IRQS
> +reports the number or IRQ indexes for the device.
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> +
> +struct vfio_irq_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* IRQ number */
> +        __u32   count;          /* number of individual IRQs */

Is there a reason for allowing irqs in batches like this, rather than
having each MSI be reflected by a separate irq_info?

> +        __u64   flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> +};
> +
> +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> +type to index mapping).

I know what you mean, but you need a clearer way to express it.

> +Information about each index can be retrieved using the GET_IRQ_INFO
> +ioctl, used much like GET_REGION_INFO.
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> +
> +Individual indexes can describe single or sets of IRQs.  This provides the
> +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> +
> +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> +as shown below, are used to pass the IRQ info index, the number of eventfds,
> +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> +
> +When a level triggered interrupt is signaled, the interrupt is masked
> +on the host.  This prevents an unresponsive userspace driver from
> +continuing to interrupt the host system.  After servicing the interrupt,
> +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> +triggered interrupts implicitly have a count of 1 per index.

This is a silly restriction.  Even PCI devices can have up to 4 LSIs
on a function in theory, though no-one ever does.  Embedded devices
can and do have multiple level interrupts.

> +
> +/* Unmask IRQ index, arg[0] = index */
> +#define VFIO_DEVICE_UNMASK_IRQ          _IOW(';', 114, int)
> +
> +Level triggered interrupts can also be unmasked using an irqfd.  Use
> +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> +
> +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> +
> +When supported, as indicated by the device flags, reset the device.
> +
> +#define VFIO_DEVICE_RESET               _IO(';', 116)
> +
> +Device tree devices also invlude ioctls for further defining the
> +device tree properties of the device:
> +
> +struct vfio_dtpath {
> +        __u32   len;            /* length of structure */
> +        __u32   index;
> +        __u64   flags;
> +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> +        char    *path;
> +};
> +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> +
> +struct vfio_dtindex {
> +        __u32   len;            /* length of structure */
> +        __u32   index;
> +        __u32   prop_type;
> +        __u32   prop_index;
> +        __u64   flags;
> +#define VFIO_DTINDEX_FLAGS_REGION       (1 << 0)
> +#define VFIO_DTINDEX_FLAGS_IRQ          (1 << 1)
> +};
> +#define VFIO_DEVICE_GET_DTINDEX         _IOWR(';', 118, struct vfio_dtindex)
> +
> +
> +VFIO bus driver API
> +-------------------------------------------------------------------------------
> +
> +Bus drivers, such as PCI, have three jobs:
> + 1) Add/remove devices from vfio
> + 2) Provide vfio_device_ops for device access
> + 3) Device binding and unbinding
> +
> +When initialized, the bus driver should enumerate the devices on it's

s/it's/its/

> +bus and call vfio_group_add_dev() for each device.  If the bus supports
> +hotplug, notifiers should be enabled to track devices being added and
> +removed.  vfio_group_del_dev() removes a previously added device from
> +vfio.
> +
> +Adding a device registers a vfio_device_ops function pointer structure
> +for the device:
> +
> +struct vfio_device_ops {
> +	bool			(*match)(struct device *, char *);
> +	int			(*get)(void *);
> +	void			(*put)(void *);
> +	ssize_t			(*read)(void *, char __user *,
> +					size_t, loff_t *);
> +	ssize_t			(*write)(void *, const char __user *,
> +					 size_t, loff_t *);
> +	long			(*ioctl)(void *, unsigned int, unsigned long);
> +	int			(*mmap)(void *, struct vm_area_struct *);
> +};
> +
> +When a device is bound to the bus driver, the bus driver indicates this
> +to vfio using the vfio_bind_dev() interface.  The device_data parameter
> +is a pointer to an opaque data structure for use only by the bus driver.
> +The get, put, read, write, ioctl, and mmap vfio_device_ops all pass
> +this data structure back to the bus driver.  When a device is unbound
> +from the bus driver, the vfio_unbind_dev() interface signals this to
> +vfio.  This function returns the pointer to the device_data structure
> +registered for the device.
> +
> +As noted previously, a group contains one or more devices, so
> +GROUP_GET_DEVICE_FD needs to identify the specific device being requested.
> +The vfio_device_ops.match callback is used to allow bus drivers to determine
> +the match.  For drivers like vfio-pci, it's a simple match to dev_name(),
> +which is unique in the system due to the PCI bus topology, other bus drivers
> +may need to include parent devices to create a unique match, so this is
> +left as a bus driver interface.
> +
> +-------------------------------------------------------------------------------
> +
> +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
> +initial implementation by Tom Lyon while as Cisco.  We've since outgrown
> +the acronym, but it's catchy.
> +
> +[2] As always there are trade-offs to virtual machine device
> +assignment that are beyond the scope of VFIO.  It's expected that
> +future IOMMU technologies will reduce some, but maybe not all, of
> +these trade-offs.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f05f5f6..4bd5aa0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7106,6 +7106,14 @@ S:	Maintained
>  F:	Documentation/filesystems/vfat.txt
>  F:	fs/fat/
>  
> +VFIO DRIVER
> +M:	Alex Williamson <alex.williamson@redhat.com>
> +L:	kvm@vger.kernel.org
> +S:	Maintained
> +F:	Documentation/vfio.txt
> +F:	drivers/vfio/
> +F:	include/linux/vfio.h
> +
>  VIDEOBUF2 FRAMEWORK
>  M:	Pawel Osciak <pawel@osciak.com>
>  M:	Marek Szyprowski <m.szyprowski@samsung.com>
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index b5e6f24..e15578b 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig"
>  
>  source "drivers/uio/Kconfig"
>  
> +source "drivers/vfio/Kconfig"
> +
>  source "drivers/vlynq/Kconfig"
>  
>  source "drivers/virtio/Kconfig"
> diff --git a/drivers/Makefile b/drivers/Makefile
> index 1b31421..5f138b5 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -58,6 +58,7 @@ obj-$(CONFIG_ATM)		+= atm/
>  obj-$(CONFIG_FUSION)		+= message/
>  obj-y				+= firewire/
>  obj-$(CONFIG_UIO)		+= uio/
> +obj-$(CONFIG_VFIO)		+= vfio/
>  obj-y				+= cdrom/
>  obj-y				+= auxdisplay/
>  obj-$(CONFIG_PCCARD)		+= pcmcia/
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> new file mode 100644
> index 0000000..9acb1e7
> --- /dev/null
> +++ b/drivers/vfio/Kconfig
> @@ -0,0 +1,8 @@
> +menuconfig VFIO
> +	tristate "VFIO Non-Privileged userspace driver framework"
> +	depends on IOMMU_API
> +	help
> +	  VFIO provides a framework for secure userspace device drivers.
> +	  See Documentation/vfio.txt for more details.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> new file mode 100644
> index 0000000..088faf1
> --- /dev/null
> +++ b/drivers/vfio/Makefile
> @@ -0,0 +1,3 @@
> +vfio-y := vfio_main.o vfio_iommu.o
> +
> +obj-$(CONFIG_VFIO) := vfio.o
> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> new file mode 100644
> index 0000000..029dae3
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu.c
> @@ -0,0 +1,530 @@
> +/*
> + * VFIO: IOMMU DMA mapping support
> + *
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/workqueue.h>
> +
> +#include "vfio_private.h"
> +
> +struct dma_map_page {
> +	struct list_head	list;
> +	dma_addr_t		daddr;
> +	unsigned long		vaddr;
> +	int			npage;
> +	int			rdwr;
> +};
> +
> +/*
> + * This code handles mapping and unmapping of user data buffers
> + * into DMA'ble space using the IOMMU
> + */
> +
> +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
> +
> +struct vwork {
> +	struct mm_struct	*mm;
> +	int			npage;
> +	struct work_struct	work;
> +};
> +
> +/* delayed decrement for locked_vm */
> +static void vfio_lock_acct_bg(struct work_struct *work)
> +{
> +	struct vwork *vwork = container_of(work, struct vwork, work);
> +	struct mm_struct *mm;
> +
> +	mm = vwork->mm;
> +	down_write(&mm->mmap_sem);
> +	mm->locked_vm += vwork->npage;
> +	up_write(&mm->mmap_sem);
> +	mmput(mm);		/* unref mm */
> +	kfree(vwork);
> +}
> +
> +static void vfio_lock_acct(int npage)
> +{
> +	struct vwork *vwork;
> +	struct mm_struct *mm;
> +
> +	if (!current->mm) {
> +		/* process exited */
> +		return;
> +	}
> +	if (down_write_trylock(&current->mm->mmap_sem)) {
> +		current->mm->locked_vm += npage;
> +		up_write(&current->mm->mmap_sem);
> +		return;
> +	}
> +	/*
> +	 * Couldn't get mmap_sem lock, so must setup to decrement
> +	 * mm->locked_vm later. If locked_vm were atomic, we wouldn't
> +	 * need this silliness
> +	 */
> +	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> +	if (!vwork)
> +		return;
> +	mm = get_task_mm(current);	/* take ref mm */
> +	if (!mm) {
> +		kfree(vwork);
> +		return;
> +	}
> +	INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> +	vwork->mm = mm;
> +	vwork->npage = npage;
> +	schedule_work(&vwork->work);
> +}
> +
> +/* Some mappings aren't backed by a struct page, for example an mmap'd
> + * MMIO range for our own or another device.  These use a different
> + * pfn conversion and shouldn't be tracked as locked pages. */
> +static int is_invalid_reserved_pfn(unsigned long pfn)
> +{
> +	if (pfn_valid(pfn)) {
> +		int reserved;
> +		struct page *tail = pfn_to_page(pfn);
> +		struct page *head = compound_trans_head(tail);
> +		reserved = PageReserved(head);
> +		if (head != tail) {
> +			/* "head" is not a dangling pointer
> +			 * (compound_trans_head takes care of that)
> +			 * but the hugepage may have been split
> +			 * from under us (and we may not hold a
> +			 * reference count on the head page so it can
> +			 * be reused before we run PageReferenced), so
> +			 * we've to check PageTail before returning
> +			 * what we just read.
> +			 */
> +			smp_rmb();
> +			if (PageTail(tail))
> +				return reserved;
> +		}
> +		return PageReserved(tail);
> +	}
> +
> +	return true;
> +}
> +
> +static int put_pfn(unsigned long pfn, int rdwr)
> +{
> +	if (!is_invalid_reserved_pfn(pfn)) {
> +		struct page *page = pfn_to_page(pfn);
> +		if (rdwr)
> +			SetPageDirty(page);
> +		put_page(page);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* Unmap DMA region */
> +/* dgate must be held */
> +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> +			    int npage, int rdwr)

Use of "read" and "write" in DMA can often be confusing, since it's
not always clear if you're talking from the perspective of the CPU or
the device (_writing_ data to a device will usually involve it doing
DMA _reads_ from memory).  It's often best to express things as DMA
direction, 'to device', and 'from device' instead.

> +{
> +	int i, unlocked = 0;
> +
> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> +		unsigned long pfn;
> +
> +		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> +		if (pfn) {
> +			iommu_unmap(iommu->domain, iova, 0);
> +			unlocked += put_pfn(pfn, rdwr);
> +		}
> +	}
> +	return unlocked;
> +}
> +
> +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> +			   unsigned long npage, int rdwr)
> +{
> +	int unlocked;
> +
> +	unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);
> +	vfio_lock_acct(-unlocked);

Have you checked that your accounting will work out if the user maps
the same memory page to multiple IOVAs?

> +}
> +
> +/* Unmap ALL DMA regions */
> +void vfio_iommu_unmapall(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos, *pos2;
> +	struct dma_map_page *mlp;
> +
> +	mutex_lock(&iommu->dgate);
> +	list_for_each_safe(pos, pos2, &iommu->dm_list) {
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> +		list_del(&mlp->list);
> +		kfree(mlp);
> +	}
> +	mutex_unlock(&iommu->dgate);

Ouch, no good at all.  Keeping track of every DMA map is no good on
POWER or other systems where IOMMU operations are a hot path.  I think
you'll need an iommu specific hook for this instead, which uses
whatever data structures are natural for the IOMMU.  For example a
1-level pagetable, like we use on POWER will just zero every entry.

> +}
> +
> +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long *pfn)
> +{
> +	struct page *page[1];
> +	struct vm_area_struct *vma;
> +	int ret = -EFAULT;
> +
> +	if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) {
> +		*pfn = page_to_pfn(page[0]);
> +		return 0;
> +	}
> +
> +	down_read(&current->mm->mmap_sem);
> +
> +	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +
> +	if (vma && vma->vm_flags & VM_PFNMAP) {
> +		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +		if (is_invalid_reserved_pfn(*pfn))
> +			ret = 0;
> +	}

It's kind of nasty that you take gup_fast(), already designed to grab
pointers for multiple user pages, then just use it one page at a time,
even for a big map.

> +	up_read(&current->mm->mmap_sem);
> +
> +	return ret;
> +}
> +
> +/* Map DMA region */
> +/* dgate must be held */
> +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova,
> +			unsigned long vaddr, int npage, int rdwr)

iova should be a dma_addr_t.  Bus address size need not match virtual
address size, and may not fit in an unsigned long.

> +{
> +	unsigned long start = iova;
> +	int i, ret, locked = 0, prot = IOMMU_READ;
> +
> +	/* Verify pages are not already mapped */
> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> +		if (iommu_iova_to_phys(iommu->domain, iova))
> +			return -EBUSY;
> +
> +	iova = start;
> +
> +	if (rdwr)
> +		prot |= IOMMU_WRITE;
> +	if (iommu->cache)
> +		prot |= IOMMU_CACHE;
> +
> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> +		unsigned long pfn = 0;
> +
> +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> +		if (ret) {
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +
> +		/* Only add actual locked pages to accounting */
> +		if (!is_invalid_reserved_pfn(pfn))
> +			locked++;
> +
> +		ret = iommu_map(iommu->domain, iova,
> +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> +		if (ret) {
> +			/* Back out mappings on error */
> +			put_pfn(pfn, rdwr);
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +	}
> +	vfio_lock_acct(locked);
> +	return 0;
> +}
> +
> +static inline int ranges_overlap(unsigned long start1, size_t size1,
> +				 unsigned long start2, size_t size2)
> +{
> +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);

Needs overflow safety.

> +}
> +
> +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> +					  dma_addr_t start, size_t size)
> +{
> +	struct list_head *pos;
> +	struct dma_map_page *mlp;
> +
> +	list_for_each(pos, &iommu->dm_list) {
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +				   start, size))
> +			return mlp;
> +	}
> +	return NULL;
> +}

Again, keeping track of each dma map operation is no good for
performance.

> +
> +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> +			    size_t size, struct dma_map_page *mlp)
> +{
> +	struct dma_map_page *split;
> +	int npage_lo, npage_hi;
> +
> +	/* Existing dma region is completely covered, unmap all */
> +	if (start <= mlp->daddr &&
> +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> +		list_del(&mlp->list);
> +		npage_lo = mlp->npage;
> +		kfree(mlp);
> +		return npage_lo;
> +	}
> +
> +	/* Overlap low address of existing range */
> +	if (start <= mlp->daddr) {
> +		size_t overlap;
> +
> +		overlap = start + size - mlp->daddr;
> +		npage_lo = overlap >> PAGE_SHIFT;
> +		npage_hi = mlp->npage - npage_lo;
> +
> +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> +		mlp->daddr += overlap;
> +		mlp->vaddr += overlap;
> +		mlp->npage -= npage_lo;
> +		return npage_lo;
> +	}
> +
> +	/* Overlap high address of existing range */
> +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +		size_t overlap;
> +
> +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> +		npage_hi = overlap >> PAGE_SHIFT;
> +		npage_lo = mlp->npage - npage_hi;
> +
> +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> +		mlp->npage -= npage_hi;
> +		return npage_hi;
> +	}
> +
> +	/* Split existing */
> +	npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> +	npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> +
> +	split = kzalloc(sizeof *split, GFP_KERNEL);
> +	if (!split)
> +		return -ENOMEM;
> +
> +	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> +
> +	mlp->npage = npage_lo;
> +
> +	split->npage = npage_hi;
> +	split->daddr = start + size;
> +	split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> +	split->rdwr = mlp->rdwr;
> +	list_add(&split->list, &iommu->dm_list);
> +	return size >> PAGE_SHIFT;
> +}
> +
> +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +	int ret = 0;
> +	size_t npage = dmp->size >> PAGE_SHIFT;
> +	struct list_head *pos, *n;
> +
> +	if (dmp->dmaaddr & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (dmp->size & ~PAGE_MASK)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->dgate);
> +
> +	list_for_each_safe(pos, n, &iommu->dm_list) {
> +		struct dma_map_page *mlp;
> +
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +				   dmp->dmaaddr, dmp->size)) {
> +			ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> +						      dmp->size, mlp);
> +			if (ret > 0)
> +				npage -= NPAGE_TO_SIZE(ret);
> +			if (ret < 0 || npage == 0)
> +				break;
> +		}
> +	}
> +	mutex_unlock(&iommu->dgate);
> +	return ret > 0 ? 0 : ret;
> +}
> +
> +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +	int npage;
> +	struct dma_map_page *mlp, *mmlp = NULL;
> +	dma_addr_t daddr = dmp->dmaaddr;
> +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> +	size_t size = dmp->size;
> +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> +
> +	if (vaddr & (PAGE_SIZE-1))
> +		return -EINVAL;
> +	if (daddr & (PAGE_SIZE-1))
> +		return -EINVAL;
> +	if (size & (PAGE_SIZE-1))
> +		return -EINVAL;
> +
> +	npage = size >> PAGE_SHIFT;
> +	if (!npage)
> +		return -EINVAL;
> +
> +	if (!iommu)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->dgate);
> +
> +	if (vfio_find_dma(iommu, daddr, size)) {
> +		ret = -EBUSY;
> +		goto out_lock;
> +	}
> +
> +	/* account for locked pages */
> +	locked = current->mm->locked_vm + npage;
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> +			__func__, rlimit(RLIMIT_MEMLOCK));
> +		ret = -ENOMEM;
> +		goto out_lock;
> +	}
> +
> +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> +	if (ret)
> +		goto out_lock;
> +
> +	/* Check if we abut a region below */
> +	if (daddr) {
> +		mlp = vfio_find_dma(iommu, daddr - 1, 1);
> +		if (mlp && mlp->rdwr == rdwr &&
> +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> +
> +			mlp->npage += npage;
> +			daddr = mlp->daddr;
> +			vaddr = mlp->vaddr;
> +			npage = mlp->npage;
> +			size = NPAGE_TO_SIZE(npage);
> +
> +			mmlp = mlp;
> +		}
> +	}
> +
> +	if (daddr + size) {
> +		mlp = vfio_find_dma(iommu, daddr + size, 1);
> +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size) {
> +
> +			mlp->npage += npage;
> +			mlp->daddr = daddr;
> +			mlp->vaddr = vaddr;
> +
> +			/* If merged above and below, remove previously
> +			 * merged entry.  New entry covers it.  */
> +			if (mmlp) {
> +				list_del(&mmlp->list);
> +				kfree(mmlp);
> +			}
> +			mmlp = mlp;
> +		}
> +	}
> +
> +	if (!mmlp) {
> +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> +		if (!mlp) {
> +			ret = -ENOMEM;
> +			vfio_dma_unmap(iommu, daddr, npage, rdwr);
> +			goto out_lock;
> +		}
> +
> +		mlp->npage = npage;
> +		mlp->daddr = daddr;
> +		mlp->vaddr = vaddr;
> +		mlp->rdwr = rdwr;
> +		list_add(&mlp->list, &iommu->dm_list);
> +	}
> +
> +out_lock:
> +	mutex_unlock(&iommu->dgate);
> +	return ret;
> +}

This whole tracking infrastructure is way too complex to impose on
every IOMMU.  We absolutely don't want to do all this when just
updating a 1-level pagetable.

> +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_iommu *iommu = filep->private_data;
> +
> +	vfio_release_iommu(iommu);
> +	return 0;
> +}
> +
> +static long vfio_iommu_unl_ioctl(struct file *filep,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_iommu *iommu = filep->private_data;
> +	int ret = -ENOSYS;
> +
> +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> +
> +                ret = put_user(flags, (u64 __user *)arg);

Um.. flags surely have to come from the IOMMU driver.

> +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> +		struct vfio_dma_map dm;
> +
> +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> +			return -EFAULT;
> +
> +		ret = vfio_dma_map_dm(iommu, &dm);
> +
> +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> +			ret = -EFAULT;
> +
> +	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> +		struct vfio_dma_map dm;
> +
> +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> +			return -EFAULT;
> +
> +		ret = vfio_dma_unmap_dm(iommu, &dm);
> +
> +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> +			ret = -EFAULT;
> +	}
> +	return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vfio_iommu_compat_ioctl(struct file *filep,
> +				    unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_iommu_unl_ioctl(filep, cmd, arg);

Um, this only works if the structures are exactly compatible between
32-bit and 64-bit ABIs.  I don't think that is always true.

> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +const struct file_operations vfio_iommu_fops = {
> +	.owner		= THIS_MODULE,
> +	.release	= vfio_iommu_release,
> +	.unlocked_ioctl	= vfio_iommu_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_iommu_compat_ioctl,
> +#endif
> +};
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> new file mode 100644
> index 0000000..6169356
> --- /dev/null
> +++ b/drivers/vfio/vfio_main.c
> @@ -0,0 +1,1151 @@
> +/*
> + * VFIO framework
> + *
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/cdev.h>
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/fs.h>
> +#include <linux/idr.h>
> +#include <linux/iommu.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/wait.h>
> +
> +#include "vfio_private.h"
> +
> +#define DRIVER_VERSION	"0.2"
> +#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> +#define DRIVER_DESC	"VFIO - User Level meta-driver"
> +
> +static int allow_unsafe_intrs;
> +module_param(allow_unsafe_intrs, int, 0);
> +MODULE_PARM_DESC(allow_unsafe_intrs,
> +        "Allow use of IOMMUs which do not support interrupt remapping");

This should not be a global option, but part of the AMD/Intel IOMMU
specific code.  In general it's a question of how strict the IOMMU
driver is about isolation when it determines what the groups are, and
only the IOMMU driver can know what the possibilities are for its
class of hardware.

> +
> +static struct vfio {
> +	dev_t			devt;
> +	struct cdev		cdev;
> +	struct list_head	group_list;
> +	struct mutex		lock;
> +	struct kref		kref;
> +	struct class		*class;
> +	struct idr		idr;
> +	wait_queue_head_t	release_q;
> +} vfio;
> +
> +static const struct file_operations vfio_group_fops;
> +extern const struct file_operations vfio_iommu_fops;
> +
> +struct vfio_group {
> +	dev_t			devt;
> +	unsigned int		groupid;
> +	struct bus_type		*bus;
> +	struct vfio_iommu	*iommu;
> +	struct list_head	device_list;
> +	struct list_head	iommu_next;
> +	struct list_head	group_next;
> +	int			refcnt;
> +};
> +
> +struct vfio_device {
> +	struct device			*dev;
> +	const struct vfio_device_ops	*ops;
> +	struct vfio_iommu		*iommu;
> +	struct vfio_group		*group;
> +	struct list_head		device_next;
> +	bool				attached;
> +	int				refcnt;
> +	void				*device_data;
> +};
> +
> +/*
> + * Helper functions called under vfio.lock
> + */
> +
> +/* Return true if any devices within a group are opened */
> +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		if (device->refcnt)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/* Return true if any of the groups attached to an iommu are opened.
> + * We can only tear apart merged groups when nothing is left open. */
> +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +		if (group->refcnt)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/* An iommu is "in use" if it has a file descriptor open or if any of
> + * the groups assigned to the iommu have devices open. */
> +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (iommu->refcnt)
> +		return true;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		if (__vfio_group_devs_inuse(group))
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static void __vfio_group_set_iommu(struct vfio_group *group,
> +				   struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (group->iommu)
> +		list_del(&group->iommu_next);
> +	if (iommu)
> +		list_add(&group->iommu_next, &iommu->group_list);
> +
> +	group->iommu = iommu;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		device->iommu = iommu;
> +	}
> +}
> +
> +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> +				    struct vfio_device *device)
> +{
> +	BUG_ON(!iommu->domain && device->attached);
> +
> +	if (!iommu->domain || !device->attached)
> +		return;
> +
> +	iommu_detach_device(iommu->domain, device->dev);
> +	device->attached = false;
> +}
> +
> +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> +				      struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		__vfio_iommu_detach_dev(iommu, device);
> +	}
> +}
> +
> +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> +				   struct vfio_device *device)
> +{
> +	int ret;
> +
> +	BUG_ON(device->attached);
> +
> +	if (!iommu || !iommu->domain)
> +		return -EINVAL;
> +
> +	ret = iommu_attach_device(iommu->domain, device->dev);
> +	if (!ret)
> +		device->attached = true;
> +
> +	return ret;
> +}
> +
> +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,
> +				     struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +		int ret;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		ret = __vfio_iommu_attach_dev(iommu, device);
> +		if (ret) {
> +			__vfio_iommu_detach_group(iommu, group);
> +			return ret;
> +		}
> +	}
> +	return 0;
> +}
> +
> +/* The iommu is viable, ie. ready to be configured, when all the devices
> + * for all the groups attached to the iommu are bound to their vfio device
> + * drivers (ex. vfio-pci).  This sets the device_data private data pointer. */
> +static bool __vfio_iommu_viable(struct vfio_iommu *iommu)
> +{
> +	struct list_head *gpos, *dpos;
> +
> +	list_for_each(gpos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (!device->device_data)
> +				return false;
> +		}
> +	}
> +	return true;
> +}
> +
> +static void __vfio_close_iommu(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (!iommu->domain)
> +		return;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		__vfio_iommu_detach_group(iommu, group);
> +	}
> +
> +	vfio_iommu_unmapall(iommu);
> +
> +	iommu_domain_free(iommu->domain);
> +	iommu->domain = NULL;
> +	iommu->mm = NULL;
> +}
> +
> +/* Open the IOMMU.  This gates all access to the iommu or device file
> + * descriptors and sets current->mm as the exclusive user. */
> +static int __vfio_open_iommu(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +	int ret;
> +
> +	if (!__vfio_iommu_viable(iommu))
> +		return -EBUSY;
> +
> +	if (iommu->domain)
> +		return -EINVAL;
> +
> +	iommu->domain = iommu_domain_alloc(iommu->bus);
> +	if (!iommu->domain)
> +		return -EFAULT;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		ret = __vfio_iommu_attach_group(iommu, group);
> +		if (ret) {
> +			__vfio_close_iommu(iommu);
> +			return ret;
> +		}
> +	}
> +
> +	if (!allow_unsafe_intrs &&
> +	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> +		__vfio_close_iommu(iommu);
> +		return -EFAULT;
> +	}
> +
> +	iommu->cache = (iommu_domain_has_cap(iommu->domain,
> +					     IOMMU_CAP_CACHE_COHERENCY) != 0);
> +	iommu->mm = current->mm;
> +
> +	return 0;
> +}
> +
> +/* Actively try to tear down the iommu and merged groups.  If there are no
> + * open iommu or device fds, we close the iommu.  If we close the iommu and
> + * there are also no open group fds, we can futher dissolve the group to
> + * iommu association and free the iommu data structure. */
> +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)
> +{
> +
> +	if (__vfio_iommu_inuse(iommu))
> +		return -EBUSY;
> +
> +	__vfio_close_iommu(iommu);
> +
> +	if (!__vfio_iommu_groups_inuse(iommu)) {
> +		struct list_head *pos, *ppos;
> +
> +		list_for_each_safe(pos, ppos, &iommu->group_list) {
> +			struct vfio_group *group;
> +
> +			group = list_entry(pos, struct vfio_group, iommu_next);
> +			__vfio_group_set_iommu(group, NULL);
> +		}
> +
> +
> +		kfree(iommu);
> +	}
> +
> +	return 0;
> +}
> +
> +static struct vfio_device *__vfio_lookup_dev(struct device *dev)
> +{
> +	struct list_head *gpos;
> +	unsigned int groupid;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return NULL;
> +
> +	list_for_each(gpos, &vfio.group_list) {
> +		struct vfio_group *group;
> +		struct list_head *dpos;
> +
> +		group = list_entry(gpos, struct vfio_group, group_next);
> +
> +		if (group->groupid != groupid)
> +			continue;
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (device->dev == dev)
> +				return device;
> +		}
> +	}
> +	return NULL;
> +}
> +
> +/* All release paths simply decrement the refcnt, attempt to teardown
> + * the iommu and merged groups, and wakeup anything that might be
> + * waiting if we successfully dissolve anything. */
> +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)
> +{
> +	bool wake;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	(*refcnt)--;
> +	wake = (__vfio_try_dissolve_iommu(iommu) == 0);
> +
> +	mutex_unlock(&vfio.lock);
> +
> +	if (wake)
> +		wake_up(&vfio.release_q);
> +
> +	return 0;
> +}
> +
> +/*
> + * Device fops - passthrough to vfio device driver w/ device_data
> + */
> +static int vfio_device_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	vfio_do_release(&device->refcnt, device->iommu);
> +
> +	device->ops->put(device->device_data);
> +
> +	return 0;
> +}
> +
> +static long vfio_device_unl_ioctl(struct file *filep,
> +				  unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->ioctl(device->device_data, cmd, arg);
> +}
> +
> +static ssize_t vfio_device_read(struct file *filep, char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->read(device->device_data, buf, count, ppos);
> +}
> +
> +static ssize_t vfio_device_write(struct file *filep, const char __user *buf,
> +				 size_t count, loff_t *ppos)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->write(device->device_data, buf, count, ppos);
> +}
> +
> +static int vfio_device_mmap(struct file *filep, struct vm_area_struct *vma)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->mmap(device->device_data, vma);
> +}
> +	
> +#ifdef CONFIG_COMPAT
> +static long vfio_device_compat_ioctl(struct file *filep,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_device_unl_ioctl(filep, cmd, arg);
> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +const struct file_operations vfio_device_fops = {
> +	.owner		= THIS_MODULE,
> +	.release	= vfio_device_release,
> +	.read		= vfio_device_read,
> +	.write		= vfio_device_write,
> +	.unlocked_ioctl	= vfio_device_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_device_compat_ioctl,
> +#endif
> +	.mmap		= vfio_device_mmap,
> +};
> +
> +/*
> + * Group fops
> + */
> +static int vfio_group_open(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_group *group;
> +	int ret = 0;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	group = idr_find(&vfio.idr, iminor(inode));
> +
> +	if (!group) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	filep->private_data = group;
> +
> +	if (!group->iommu) {
> +		struct vfio_iommu *iommu;
> +
> +		iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> +		if (!iommu) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		INIT_LIST_HEAD(&iommu->group_list);
> +		INIT_LIST_HEAD(&iommu->dm_list);
> +		mutex_init(&iommu->dgate);
> +		iommu->bus = group->bus;
> +		__vfio_group_set_iommu(group, iommu);
> +	}
> +	group->refcnt++;
> +
> +out:
> +	mutex_unlock(&vfio.lock);
> +
> +	return ret;
> +}
> +
> +static int vfio_group_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_group *group = filep->private_data;
> +
> +	return vfio_do_release(&group->refcnt, group->iommu);
> +}
> +
> +/* Attempt to merge the group pointed to by fd into group.  The merge-ee
> + * group must not have an iommu or any devices open because we cannot
> + * maintain that context across the merge.  The merge-er group can be
> + * in use. */

Yeah, so merge-er group in use still has its problems, because it
could affect what the IOMMU is capable of.

> +static int vfio_group_merge(struct vfio_group *group, int fd)
> +{
> +	struct vfio_group *new;
> +	struct vfio_iommu *old_iommu;
> +	struct file *file;
> +	int ret = 0;
> +	bool opened = false;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	file = fget(fd);
> +	if (!file) {
> +		ret = -EBADF;
> +		goto out_noput;
> +	}
> +
> +	/* Sanity check, is this really our fd? */
> +	if (file->f_op != &vfio_group_fops) {

This should be a WARN_ON or BUG_ON rather than just an error return, surely.

> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	new = file->private_data;
> +
> +	if (!new || new == group || !new->iommu ||
> +	    new->iommu->domain || new->bus != group->bus) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* We need to attach all the devices to each domain separately
> +	 * in order to validate that the capabilities match for both.  */
> +	ret = __vfio_open_iommu(new->iommu);
> +	if (ret)
> +		goto out;
> +
> +	if (!group->iommu->domain) {
> +		ret = __vfio_open_iommu(group->iommu);
> +		if (ret)
> +			goto out;
> +		opened = true;
> +	}
> +
> +	/* If cache coherency doesn't match we'd potentialy need to
> +	 * remap existing iommu mappings in the merge-er domain.
> +	 * Poor return to bother trying to allow this currently. */
> +	if (iommu_domain_has_cap(group->iommu->domain,
> +				 IOMMU_CAP_CACHE_COHERENCY) !=
> +	    iommu_domain_has_cap(new->iommu->domain,
> +				 IOMMU_CAP_CACHE_COHERENCY)) {
> +		__vfio_close_iommu(new->iommu);
> +		if (opened)
> +			__vfio_close_iommu(group->iommu);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Close the iommu for the merge-ee and attach all its devices
> +	 * to the merge-er iommu. */
> +	__vfio_close_iommu(new->iommu);
> +
> +	ret = __vfio_iommu_attach_group(group->iommu, new);
> +	if (ret)
> +		goto out;
> +
> +	/* set_iommu unlinks new from the iommu, so save a pointer to it */
> +	old_iommu = new->iommu;
> +	__vfio_group_set_iommu(new, group->iommu);
> +	kfree(old_iommu);
> +
> +out:
> +	fput(file);
> +out_noput:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Unmerge the group pointed to by fd from group. */
> +static int vfio_group_unmerge(struct vfio_group *group, int fd)
> +{
> +	struct vfio_group *new;
> +	struct vfio_iommu *new_iommu;
> +	struct file *file;
> +	int ret = 0;
> +
> +	/* Since the merge-out group is already opened, it needs to
> +	 * have an iommu struct associated with it. */
> +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);
> +	if (!new_iommu)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&new_iommu->group_list);
> +	INIT_LIST_HEAD(&new_iommu->dm_list);
> +	mutex_init(&new_iommu->dgate);
> +	new_iommu->bus = group->bus;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	file = fget(fd);
> +	if (!file) {
> +		ret = -EBADF;
> +		goto out_noput;
> +	}
> +
> +	/* Sanity check, is this really our fd? */
> +	if (file->f_op != &vfio_group_fops) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	new = file->private_data;
> +	if (!new || new == group || new->iommu != group->iommu) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* We can't merge-out a group with devices still in use. */
> +	if (__vfio_group_devs_inuse(new)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	__vfio_iommu_detach_group(group->iommu, new);
> +	__vfio_group_set_iommu(new, new_iommu);
> +
> +out:
> +	fput(file);
> +out_noput:
> +	if (ret)
> +		kfree(new_iommu);
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Get a new iommu file descriptor.  This will open the iommu, setting
> + * the current->mm ownership if it's not already set. */

I know I've had this explained to me several times before, but I've
forgotten again.  Why do we need to wire the iommu to an mm?

> +static int vfio_group_get_iommu_fd(struct vfio_group *group)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (!group->iommu->domain) {
> +		ret = __vfio_open_iommu(group->iommu);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
> +			       group->iommu, O_RDWR);
> +	if (ret < 0)
> +		goto out;
> +
> +	group->iommu->refcnt++;
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Get a new device file descriptor.  This will open the iommu, setting
> + * the current->mm ownership if it's not already set.  It's difficult to
> + * specify the requirements for matching a user supplied buffer to a
> + * device, so we use a vfio driver callback to test for a match.  For
> + * PCI, dev_name(dev) is unique, but other drivers may require including
> + * a parent device string. */

At some point we probably want an interface to enumerate the devices
too, but that can probably wait.

> +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> +{
> +	struct vfio_iommu *iommu = group->iommu;
> +	struct list_head *gpos;
> +	int ret = -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (!iommu->domain) {
> +		ret = __vfio_open_iommu(iommu);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	list_for_each(gpos, &iommu->group_list) {
> +		struct list_head *dpos;
> +
> +		group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (device->ops->match(device->dev, buf)) {
> +				struct file *file;
> +
> +				if (device->ops->get(device->device_data)) {
> +					ret = -EFAULT;
> +					goto out;
> +				}
> +
> +				/* We can't use anon_inode_getfd(), like above
> +				 * because we need to modify the f_mode flags
> +				 * directly to allow more than just ioctls */
> +				ret = get_unused_fd();
> +				if (ret < 0) {
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}
> +
> +				file = anon_inode_getfile("[vfio-device]",
> +							  &vfio_device_fops,
> +							  device, O_RDWR);
> +				if (IS_ERR(file)) {
> +					put_unused_fd(ret);
> +					ret = PTR_ERR(file);
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}
> +
> +				/* Todo: add an anon_inode interface to do
> +				 * this.  Appears to be missing by lack of
> +				 * need rather than explicitly prevented.
> +				 * Now there's need. */
> +				file->f_mode |= (FMODE_LSEEK |
> +						 FMODE_PREAD |
> +						 FMODE_PWRITE);
> +
> +				fd_install(ret, file);
> +
> +				device->refcnt++;
> +				goto out;
> +			}
> +		}
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +static long vfio_group_unl_ioctl(struct file *filep,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_group *group = filep->private_data;
> +
> +	if (cmd == VFIO_GROUP_GET_FLAGS) {
> +		u64 flags = 0;
> +
> +		mutex_lock(&vfio.lock);
> +		if (__vfio_iommu_viable(group->iommu))
> +			flags |= VFIO_GROUP_FLAGS_VIABLE;
> +		mutex_unlock(&vfio.lock);
> +
> +		if (group->iommu->mm)
> +			flags |= VFIO_GROUP_FLAGS_MM_LOCKED;
> +
> +		return put_user(flags, (u64 __user *)arg);
> +	}
> +		
> +	/* Below commands are restricted once the mm is set */
> +	if (group->iommu->mm && group->iommu->mm != current->mm)
> +		return -EPERM;
> +
> +	if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) {
> +		int fd;
> +		
> +		if (get_user(fd, (int __user *)arg))
> +			return -EFAULT;
> +		if (fd < 0)
> +			return -EINVAL;
> +
> +		if (cmd == VFIO_GROUP_MERGE)
> +			return vfio_group_merge(group, fd);
> +		else
> +			return vfio_group_unmerge(group, fd);
> +	} else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {
> +		return vfio_group_get_iommu_fd(group);
> +	} else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {
> +		char *buf;
> +		int ret;
> +
> +		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
> +		if (IS_ERR(buf))
> +			return PTR_ERR(buf);
> +
> +		ret = vfio_group_get_device_fd(group, buf);
> +		kfree(buf);
> +		return ret;
> +	}
> +
> +	return -ENOSYS;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vfio_group_compat_ioctl(struct file *filep,
> +				    unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_group_unl_ioctl(filep, cmd, arg);
> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +static const struct file_operations vfio_group_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vfio_group_open,
> +	.release	= vfio_group_release,
> +	.unlocked_ioctl	= vfio_group_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_group_compat_ioctl,
> +#endif
> +};
> +
> +/* iommu fd release hook */
> +int vfio_release_iommu(struct vfio_iommu *iommu)
> +{
> +	return vfio_do_release(&iommu->refcnt, iommu);
> +}
> +
> +/*
> + * VFIO driver API
> + */
> +
> +/* Add a new device to the vfio framework with associated vfio driver
> + * callbacks.  This is the entry point for vfio drivers to register devices. */
> +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
> +{
> +	struct list_head *pos;
> +	struct vfio_group *group = NULL;
> +	struct vfio_device *device = NULL;
> +	unsigned int groupid;
> +	int ret = 0;
> +	bool new_group = false;
> +
> +	if (!ops)
> +		return -EINVAL;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	list_for_each(pos, &vfio.group_list) {
> +		group = list_entry(pos, struct vfio_group, group_next);
> +		if (group->groupid == groupid)
> +			break;
> +		group = NULL;
> +	}
> +
> +	if (!group) {
> +		int minor;
> +
> +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> +		if (!group) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group->groupid = groupid;
> +		INIT_LIST_HEAD(&group->device_list);
> +
> +		ret = idr_get_new(&vfio.idr, group, &minor);
> +		if (ret == 0 && minor > MINORMASK) {
> +			idr_remove(&vfio.idr, minor);
> +			kfree(group);
> +			ret = -ENOSPC;
> +			goto out;
> +		}
> +
> +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> +		device_create(vfio.class, NULL, group->devt,
> +			      group, "%u", groupid);
> +
> +		group->bus = dev->bus;
> +		list_add(&group->group_next, &vfio.group_list);
> +		new_group = true;
> +	} else {
> +		if (group->bus != dev->bus) {
> +			printk(KERN_WARNING
> +			       "Error: IOMMU group ID conflict.  Group ID %u "
> +				"on both bus %s and %s\n", groupid,
> +				group->bus->name, dev->bus->name);
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +
> +		list_for_each(pos, &group->device_list) {
> +			device = list_entry(pos,
> +					    struct vfio_device, device_next);
> +			if (device->dev == dev)
> +				break;
> +			device = NULL;
> +		}
> +	}
> +
> +	if (!device) {
> +		if (__vfio_group_devs_inuse(group) ||
> +		    (group->iommu && group->iommu->refcnt)) {
> +			printk(KERN_WARNING
> +			       "Adding device %s to group %u while group is already in use!!\n",
> +			       dev_name(dev), group->groupid);
> +			/* XXX How to prevent other drivers from claiming? */
> +		}
> +
> +		device = kzalloc(sizeof(*device), GFP_KERNEL);
> +		if (!device) {
> +			/* If we just created this group, tear it down */
> +			if (new_group) {
> +				list_del(&group->group_next);
> +				device_destroy(vfio.class, group->devt);
> +				idr_remove(&vfio.idr, MINOR(group->devt));
> +				kfree(group);
> +			}
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		list_add(&device->device_next, &group->device_list);
> +		device->dev = dev;
> +		device->ops = ops;
> +		device->iommu = group->iommu; /* NULL if new */
> +		__vfio_iommu_attach_dev(group->iommu, device);
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_add_dev);
> +
> +/* Remove a device from the vfio framework */
> +void vfio_group_del_dev(struct device *dev)
> +{
> +	struct list_head *pos;
> +	struct vfio_group *group = NULL;
> +	struct vfio_device *device = NULL;
> +	unsigned int groupid;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	list_for_each(pos, &vfio.group_list) {
> +		group = list_entry(pos, struct vfio_group, group_next);
> +		if (group->groupid == groupid)
> +			break;
> +		group = NULL;
> +	}
> +
> +	if (!group)
> +		goto out;
> +
> +	list_for_each(pos, &group->device_list) {
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		if (device->dev == dev)
> +			break;
> +		device = NULL;
> +	}
> +
> +	if (!device)
> +		goto out;
> +
> +	BUG_ON(device->refcnt);
> +
> +	if (device->attached)
> +		__vfio_iommu_detach_dev(group->iommu, device);
> +
> +	list_del(&device->device_next);
> +	kfree(device);
> +
> +	/* If this was the only device in the group, remove the group.
> +	 * Note that we intentionally unmerge empty groups here if the
> +	 * group fd isn't opened. */
> +	if (list_empty(&group->device_list) && group->refcnt == 0) {
> +		struct vfio_iommu *iommu = group->iommu;
> +
> +		if (iommu) {
> +			__vfio_group_set_iommu(group, NULL);
> +			__vfio_try_dissolve_iommu(iommu);
> +		}
> +
> +		device_destroy(vfio.class, group->devt);
> +		idr_remove(&vfio.idr, MINOR(group->devt));
> +		list_del(&group->group_next);
> +		kfree(group);
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_del_dev);
> +
> +/* When a device is bound to a vfio device driver (ex. vfio-pci), this
> + * entry point is used to mark the device usable (viable).  The vfio
> + * device driver associates a private device_data struct with the device
> + * here, which will later be return for vfio_device_fops callbacks. */
> +int vfio_bind_dev(struct device *dev, void *device_data)
> +{
> +	struct vfio_device *device;
> +	int ret = -EINVAL;
> +
> +	BUG_ON(!device_data);
> +
> +	mutex_lock(&vfio.lock);
> +
> +	device = __vfio_lookup_dev(dev);
> +
> +	BUG_ON(!device);
> +
> +	ret = dev_set_drvdata(dev, device);
> +	if (!ret)
> +		device->device_data = device_data;
> +
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_bind_dev);
> +
> +/* A device is only removeable if the iommu for the group is not in use. */
> +static bool vfio_device_removeable(struct vfio_device *device)
> +{
> +	bool ret = true;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (device->iommu && __vfio_iommu_inuse(device->iommu))
> +		ret = false;
> +
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Notify vfio that a device is being unbound from the vfio device driver
> + * and return the device private device_data pointer.  If the group is
> + * in use, we need to block or take other measures to make it safe for
> + * the device to be removed from the iommu. */
> +void *vfio_unbind_dev(struct device *dev)
> +{
> +	struct vfio_device *device = dev_get_drvdata(dev);
> +	void *device_data;
> +
> +	BUG_ON(!device);
> +
> +again:
> +	if (!vfio_device_removeable(device)) {
> +		/* XXX signal for all devices in group to be removed or
> +		 * resort to killing the process holding the device fds.
> +		 * For now just block waiting for releases to wake us. */
> +		wait_event(vfio.release_q, vfio_device_removeable(device));
> +	}
> +
> +	mutex_lock(&vfio.lock);
> +
> +	/* Need to re-check that the device is still removeable under lock. */
> +	if (device->iommu && __vfio_iommu_inuse(device->iommu)) {
> +		mutex_unlock(&vfio.lock);
> +		goto again;
> +	}
> +
> +	device_data = device->device_data;
> +
> +	device->device_data = NULL;
> +	dev_set_drvdata(dev, NULL);
> +
> +	mutex_unlock(&vfio.lock);
> +	return device_data;
> +}
> +EXPORT_SYMBOL_GPL(vfio_unbind_dev);
> +
> +/*
> + * Module/class support
> + */
> +static void vfio_class_release(struct kref *kref)
> +{
> +	class_destroy(vfio.class);
> +	vfio.class = NULL;
> +}
> +
> +static char *vfio_devnode(struct device *dev, mode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
> +}
> +
> +static int __init vfio_init(void)
> +{
> +	int ret;
> +
> +	idr_init(&vfio.idr);
> +	mutex_init(&vfio.lock);
> +	INIT_LIST_HEAD(&vfio.group_list);
> +	init_waitqueue_head(&vfio.release_q);
> +
> +	kref_init(&vfio.kref);
> +	vfio.class = class_create(THIS_MODULE, "vfio");
> +	if (IS_ERR(vfio.class)) {
> +		ret = PTR_ERR(vfio.class);
> +		goto err_class;
> +	}
> +
> +	vfio.class->devnode = vfio_devnode;
> +
> +	/* FIXME - how many minors to allocate... all of them! */
> +	ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
> +	if (ret)
> +		goto err_chrdev;
> +
> +	cdev_init(&vfio.cdev, &vfio_group_fops);
> +	ret = cdev_add(&vfio.cdev, vfio.devt, MINORMASK);
> +	if (ret)
> +		goto err_cdev;
> +
> +	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> +
> +	return 0;
> +
> +err_cdev:
> +	unregister_chrdev_region(vfio.devt, MINORMASK);
> +err_chrdev:
> +	kref_put(&vfio.kref, vfio_class_release);
> +err_class:
> +	return ret;
> +}
> +
> +static void __exit vfio_cleanup(void)
> +{
> +	struct list_head *gpos, *gppos;
> +
> +	list_for_each_safe(gpos, gppos, &vfio.group_list) {
> +		struct vfio_group *group;
> +		struct list_head *dpos, *dppos;
> +
> +		group = list_entry(gpos, struct vfio_group, group_next);
> +
> +		list_for_each_safe(dpos, dppos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +			vfio_group_del_dev(device->dev);
> +		}
> +	}
> +
> +	idr_destroy(&vfio.idr);
> +	cdev_del(&vfio.cdev);
> +	unregister_chrdev_region(vfio.devt, MINORMASK);
> +	kref_put(&vfio.kref, vfio_class_release);
> +}
> +
> +module_init(vfio_init);
> +module_exit(vfio_cleanup);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/vfio_private.h b/drivers/vfio/vfio_private.h
> new file mode 100644
> index 0000000..350ad67
> --- /dev/null
> +++ b/drivers/vfio/vfio_private.h
> @@ -0,0 +1,34 @@
> +/*
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +
> +#ifndef VFIO_PRIVATE_H
> +#define VFIO_PRIVATE_H
> +
> +struct vfio_iommu {
> +	struct iommu_domain		*domain;
> +	struct bus_type			*bus;
> +	struct mutex			dgate;
> +	struct list_head		dm_list;
> +	struct mm_struct		*mm;
> +	struct list_head		group_list;
> +	int				refcnt;
> +	bool				cache;
> +};
> +
> +extern int vfio_release_iommu(struct vfio_iommu *iommu);
> +extern void vfio_iommu_unmapall(struct vfio_iommu *iommu);
> +
> +#endif /* VFIO_PRIVATE_H */
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> new file mode 100644
> index 0000000..4269b08
> --- /dev/null
> +++ b/include/linux/vfio.h
> @@ -0,0 +1,155 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + */
> +#include <linux/types.h>
> +
> +#ifndef VFIO_H
> +#define VFIO_H
> +
> +#ifdef __KERNEL__
> +
> +struct vfio_device_ops {
> +	bool			(*match)(struct device *, char *);
> +	int			(*get)(void *);
> +	void			(*put)(void *);
> +	ssize_t			(*read)(void *, char __user *,
> +					size_t, loff_t *);
> +	ssize_t			(*write)(void *, const char __user *,
> +					 size_t, loff_t *);
> +	long			(*ioctl)(void *, unsigned int, unsigned long);
> +	int			(*mmap)(void *, struct vm_area_struct *);
> +};
> +
> +extern int vfio_group_add_dev(struct device *device,
> +			      const struct vfio_device_ops *ops);
> +extern void vfio_group_del_dev(struct device *device);
> +extern int vfio_bind_dev(struct device *device, void *device_data);
> +extern void *vfio_unbind_dev(struct device *device);
> +
> +#endif /* __KERNEL__ */
> +
> +/*
> + * VFIO driver - allow mapping and use of certain devices
> + * in unprivileged user processes. (If IOMMU is present)
> + * Especially useful for Virtual Function parts of SR-IOV devices
> + */
> +
> +
> +/* Kernel & User level defines for ioctls */
> +
> +#define VFIO_GROUP_GET_FLAGS		_IOR(';', 100, __u64)
> + #define VFIO_GROUP_FLAGS_VIABLE	(1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED	(1 << 1)
> +#define VFIO_GROUP_MERGE		_IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE		_IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD		_IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD	_IOW(';', 104, char *)
> +
> +/*
> + * Structure for DMA mapping of user buffers
> + * vaddr, dmaaddr, and size must all be page aligned
> + */
> +struct vfio_dma_map {
> +	__u64	len;		/* length of structure */
> +	__u64	vaddr;		/* process virtual addr */
> +	__u64	dmaaddr;	/* desired and/or returned dma address */
> +	__u64	size;		/* size in bytes */
> +	__u64	flags;
> +#define	VFIO_DMA_MAP_FLAG_WRITE		(1 << 0) /* req writeable DMA mem */
> +};
> +
> +#define	VFIO_IOMMU_GET_FLAGS		_IOR(';', 105, __u64)
> + /* Does the IOMMU support mapping any IOVA to any virtual address? */
> + #define VFIO_IOMMU_FLAGS_MAP_ANY	(1 << 0)
> +#define	VFIO_IOMMU_MAP_DMA		_IOWR(';', 106, struct vfio_dma_map)
> +#define	VFIO_IOMMU_UNMAP_DMA		_IOWR(';', 107, struct vfio_dma_map)
> +
> +#define VFIO_DEVICE_GET_FLAGS		_IOR(';', 108, __u64)
> + #define VFIO_DEVICE_FLAGS_PCI		(1 << 0)
> + #define VFIO_DEVICE_FLAGS_DT		(1 << 1)
> + #define VFIO_DEVICE_FLAGS_RESET	(1 << 2)
> +#define VFIO_DEVICE_GET_NUM_REGIONS	_IOR(';', 109, int)
> +
> +struct vfio_region_info {
> +	__u32	len;		/* length of structure */
> +	__u32	index;		/* region number */
> +	__u64	size;		/* size in bytes of region */
> +	__u64	offset;		/* start offset of region */
> +	__u64	flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP		(1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO		(1 << 1)
> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID	(1 << 2)
> +	__u64	phys;		/* physical address of region */
> +};
> +
> +#define VFIO_DEVICE_GET_REGION_INFO	_IOWR(';', 110, struct vfio_region_info)
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS	_IOR(';', 111, int)
> +
> +struct vfio_irq_info {
> +	__u32	len;		/* length of structure */
> +	__u32	index;		/* IRQ number */
> +	__u32	count;		/* number of individual IRQs */
> +	__u32	flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL		(1 << 0)
> +};
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO	_IOWR(';', 112, struct vfio_irq_info)
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS	_IOW(';', 113, int)
> +
> +/* Unmask IRQ index, arg[0] = index */
> +#define VFIO_DEVICE_UNMASK_IRQ		_IOW(';', 114, int)
> +
> +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD	_IOW(';', 115, int)
> +
> +#define VFIO_DEVICE_RESET		_IO(';', 116)
> +
> +struct vfio_dtpath {
> +	__u32	len;		/* length of structure */
> +	__u32	index;
> +	__u64	flags;
> +#define VFIO_DTPATH_FLAGS_REGION	(1 << 0)
> +#define VFIO_DTPATH_FLAGS_IRQ		(1 << 1)
> +	char	*path;
> +};
> +#define VFIO_DEVICE_GET_DTPATH		_IOWR(';', 117, struct vfio_dtpath)
> +
> +struct vfio_dtindex {
> +	__u32	len;		/* length of structure */
> +	__u32	index;
> +	__u32	prop_type;
> +	__u32	prop_index;
> +	__u64	flags;
> +#define VFIO_DTINDEX_FLAGS_REGION	(1 << 0)
> +#define VFIO_DTINDEX_FLAGS_IRQ		(1 << 1)
> +};
> +#define VFIO_DEVICE_GET_DTINDEX		_IOWR(';', 118, struct vfio_dtindex)
> +
> +#endif /* VFIO_H */
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-15  6:34 ` David Gibson
@ 2011-11-15 18:01   ` Alex Williamson
  2011-11-17  0:02     ` David Gibson
  2011-11-15 20:10   ` Scott Wood
  1 sibling, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-15 18:01 UTC (permalink / raw)
  To: David Gibson
  Cc: chrisw, aik, pmac, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
> On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > new file mode 100644
> > index 0000000..5866896
> > --- /dev/null
> > +++ b/Documentation/vfio.txt
> > @@ -0,0 +1,304 @@
> > +VFIO - "Virtual Function I/O"[1]
> > +-------------------------------------------------------------------------------
> > +Many modern system now provide DMA and interrupt remapping facilities
> > +to help ensure I/O devices behave within the boundaries they've been
> > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > +embedded powerpc systems (technology name unknown).  The VFIO driver
> > +is an IOMMU/device agnostic framework for exposing direct device
> > +access to userspace, in a secure, IOMMU protected environment.  In
> > +other words, this allows safe, non-privileged, userspace drivers.
> 
> It's perhaps worth emphasisng that "safe" depends on the hardware
> being sufficiently well behaved.  BenH, I know, thinks there are a
> *lot* of cards that, e.g. have debug registers that allow a backdoor
> to their own config space via MMIO, which would bypass vfio's
> filtering of config space access.  And that's before we even get into
> the varying degrees of completeness in the isolation provided by
> different IOMMUs.

Fair enough.  I know Tom had emphasized "well behaved" in the original
doc.  Virtual functions are probably the best indicator of well behaved.

> > +Why do we want that?  Virtual machines often make use of direct device
> > +access ("device assignment") when configured for the highest possible
> > +I/O performance.  From a device and host perspective, this simply turns
> > +the VM into a userspace driver, with the benefits of significantly
> > +reduced latency, higher bandwidth, and direct use of bare-metal device
> > +drivers[2].
> > +
> > +Some applications, particularly in the high performance computing
> > +field, also benefit from low-overhead, direct device access from
> > +userspace.  Examples include network adapters (often non-TCP/IP based)
> > +and compute accelerators.  Previous to VFIO, these drivers needed to
> 
> s/Previous/Prior/  although that may be a .us vs .au usage thing.

Same difference, AFAICT.

> > +go through the full development cycle to become proper upstream driver,
> > +be maintained out of tree, or make use of the UIO framework, which
> > +has no notion of IOMMU protection, limited interrupt support, and
> > +requires root privileges to access things like PCI configuration space.
> > +
> > +The VFIO driver framework intends to unify these, replacing both the
> > +KVM PCI specific device assignment currently used as well as provide
> > +a more secure, more featureful userspace driver environment than UIO.
> > +
> > +Groups, Devices, IOMMUs, oh my
> > +-------------------------------------------------------------------------------
> > +
> > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > +can't always distinguish transactions from each individual device in
> > +the system.  Sometimes this is because of the IOMMU design, such as with
> > +PEs, other times it's caused by the I/O topology, for instance a
> > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > +devices created by these restictions IOMMU groups (or just "groups" for
> > +this document).
> > +
> > +The IOMMU cannot distiguish transactions between the individual devices
> > +within the group, therefore the group is the basic unit of ownership for
> > +a userspace process.  Because of this, groups are also the primary
> > +interface to both devices and IOMMU domains in VFIO.
> > +
> > +The VFIO representation of groups is created as devices are added into
> > +the framework by a VFIO bus driver.  The vfio-pci module is an example
> > +of a bus driver.  This module registers devices along with a set of bus
> > +specific callbacks with the VFIO core.  These callbacks provide the
> > +interfaces later used for device access.  As each new group is created,
> > +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> > +character device.
> 
> Ok.. so, the fact that it's called "vfio-pci" suggests that the VFIO
> bus driver is per bus type, not per bus instance.   But grouping
> constraints could be per bus instance, if you have a couple of
> different models of PCI host bridge with IOMMUs of different
> capabilities built in, for example.

Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
instance.  IOMMUs also register drivers per bus type, not per bus
instance.  The IOMMU driver is free to impose any constraints it wants.

> > +In addition to the device enumeration and callbacks, the VFIO bus driver
> > +also provides a traditional device driver and is able to bind to devices
> > +on it's bus.  When a device is bound to the bus driver it's available to
> > +VFIO.  When all the devices within a group are bound to their bus drivers,
> > +the group becomes "viable" and a user with sufficient access to the VFIO
> > +group chardev can obtain exclusive access to the set of group devices.
> > +
> > +As documented in linux/vfio.h, several ioctls are provided on the
> > +group chardev:
> > +
> > +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> > + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> > + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> > +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> > +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> > +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> > +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> > +
> > +The last two ioctls return new file descriptors for accessing
> > +individual devices within the group and programming the IOMMU.  Each of
> > +these new file descriptors provide their own set of file interfaces.
> > +These ioctls will fail if any of the devices within the group are not
> > +bound to their VFIO bus driver.  Additionally, when either of these
> > +interfaces are used, the group is then bound to the struct_mm of the
> > +caller.  The GET_FLAGS ioctl can be used to view the state of the group.
> > +
> > +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> > +new IOMMU domain is created and all of the devices in the group are
> > +attached to it.  This is the only way to ensure full IOMMU isolation
> > +of the group, but potentially wastes resources and cycles if the user
> > +intends to manage multiple groups with the same set of IOMMU mappings.
> > +VFIO therefore provides a group MERGE and UNMERGE interface, which
> > +allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
> > +arbitrary groups to be merged, so the user should assume merging is
> > +opportunistic.
> 
> I do not think "opportunistic" means what you think it means..
> 
> >  A new group, with no open device or IOMMU file
> > +descriptors, can be merged into an existing, in-use, group using the
> > +MERGE ioctl.  A merged group can be unmerged using the UNMERGE ioctl
> > +once all of the device file descriptors for the group being merged
> > +"out" are closed.
> > +
> > +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are
> > +essentially fungible between group file descriptors (ie. if device
> > A
> 
> IDNT "fungible" MWYTIM, either.

Hmm, feel free to suggest.  Maybe we're hitting .us vs .au connotation.

> > +is in group X, and X is merged with Y, a file descriptor for A can be
> > +retrieved using GET_DEVICE_FD on Y.  Likewise, GET_IOMMU_FD returns a
> > +file descriptor referencing the same internal IOMMU object from either
> > +X or Y).  Merged groups can be dissolved either explictly with UNMERGE
> > +or automatically when ALL file descriptors for the merged group are
> > +closed (all IOMMUs, all devices, all groups).
> 
> Blech.  I'm really not liking this merge/unmerge API as it stands,
> it's horribly confusing.  At the very least, we need some better
> terminology.  We need some term for the metagroups; supergroups; iommu
> domains or-at-least-they-will-be-once-we-open-the-iommu or
> whathaveyous.
> 
> The first confusing thing about this interface is that each open group
> handle actually refers to two different things; the original group you
> opened and the metagroup it's a part of.  For the GET_IOMMU_FD and
> GET_DEVICE_FD operations, you're using the metagroup and two "merged"
> group handles are interchangeable.

Fungible, even ;)

> For other MERGE and especially
> UNMERGE operations, it matters which is the original group.

If I stick two LEGO blocks together, I need to identify the individual
block I want to remove to pull them back apart...

> The semantics of "merge" and "unmerge" under those names are really
> non-obvious.  Merge kind of has to merge two whole metagroups, but
> it's unclear if unmerge reverses one merge, or just takes out one
> (atom) group.  These operations need better names, at least.

Christian suggested a change to UNMERGE that we do not need to
specify a group to unmerge "from".  This makes it more like a list
implementation except there's no defined list_head.  Any member of the
list can pull in a new entry.  Calling UNMERGE on any member extracts
that member.

> Then it's unclear what order you can do various operations, and which
> order you can open and close various things.  You can kind of figure
> it out but it takes far more thinking than it should.
> 
> 
> So at the _very_ least, we need to invent new terminology and find a
> much better way of describing this API's semantics.  I still think an
> entirely different interface, where metagroups are created from
> outside with a lifetime that's not tied to an fd would be a better
> idea.

As we've discussed previously, configfs provides part of this, but has
no ioctl support.  It doesn't make sense to me to go play with groups in
configfs, but then still interact with them via a char dev.  It also
splits the ownership model and makes it harder to enforce who gets to
interact with the devices vs who gets to manipulate groups.  The current
model really isn't that complicated, imho.  As always, feel free to
suggest specific models.  If you have a specific terminology other than
MERGE, please suggest.

> Now, you specify that you can't use a group as the second argument of
> a merge if it already has an open iommu, but it's not clear from the
> doc if you can merge things into a group with an open iommu.

>From above:

        A new group, with no open device or IOMMU file descriptors, can
        be merged into an existing, in-use, group using the MERGE ioctl.
                                 ^^^^^^

> Banning
> this would make life simpler, because the IOMMU's effective
> capabilities may change if you add more devices to the domain.  That's
> yet another non-obvious constraint in the interface ordering, though.

Banning this would prevent using merged groups with hotplug, which I
consider to be a primary use case.

> > +The IOMMU file descriptor provides this set of ioctls:
> > +
> > +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> > + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> > +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> > +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)
> > +
> > +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> > +We currently only support IOMMU domains that are able to map any
> > +virtual address to any IOVA.  This is indicated by the MAP_ANY
> > flag.
> 
> So.  I tend to think of an IOMMU mapping IOVAs to memory pages, rather
> than memory pages to IOVAs.  

I do too, not sure why I wrote it that way, will fix.

> The IOMMU itself, of course maps to
> physical addresses, and the meaning of "virtual address" in this
> context is not really clear.  I think you would be better off saying
> the IOMMU can map any IOVA to any memory page.  From a hardware POV
> that means any physical address, but of course for a VFIO user a page
> is specified by its process virtual address.

Will fix.

> I think we need to pin exactly what "MAP_ANY" means down better.  Now,
> VFIO is pretty much a lost cause if you can't map any normal process
> memory page into the IOMMU, so I think the only thing that is really
> covered is IOVAs.  But saying "can map any IOVA" is not clear, because
> if you can't map it, it's not a (valid) IOVA.  Better to say that
> IOVAs can be any 64-bit value, which I think is what you really mean
> here.

ok

> Of course, since POWER is a platform where this is *not* true, I'd
> prefer to have something giving the range of valid IOVAs in the core
> to start with.

Since iommu_ops does not yet have any concept of this (nudge, nudge), I
figured this would be added later.  A possible implementation would be
that such an iommu would not set MAP_ANY, would add a new flag for
MAP_RANGE, and provide a new VFIO_IOMMU_GET_RANGE_INFO ioctl to describe
it.  I'm guaranteed to get it wrong if I try to predict all your needs.

> > +
> > +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> > +and unmapping IOVAs to process virtual addresses:
> > +
> > +struct vfio_dma_map {
> > +        __u64   len;            /* length of structure */
> 
> Thanks for adding these structure length fields.  But I think they
> should be called something other than 'len', which is likely to be
> confused with size (or some other length that's actually related to
> the operation's parameters).  Better to call it 'structlen' or
> 'argslen' or something.

Ok.  As Scott noted, I've failed to implement these in a way that
actually allows extension, but I'll work on it.

> > +        __u64   vaddr;          /* process virtual addr */
> > +        __u64   dmaaddr;        /* desired and/or returned dma address */
> > +        __u64   size;           /* size in bytes */
> > +        __u64   flags;
> > +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> 
> Make it independent READ and WRITE flags from the start.  Not all
> combinations will be be valid on all hardware, but that way we have
> the possibilities covered without having to use strange encodings
> later.

Ok.

> > +};
> > +
> > +Current users of VFIO use relatively static DMA mappings, not requiring
> > +high frequency turnover.  As new users are added, it's expected that the
> > +IOMMU file descriptor will evolve to support new mapping interfaces, this
> > +will be reflected in the flags and may present new ioctls and file
> > +interfaces.
> > +
> > +The device GET_FLAGS ioctl is intended to return basic device type and
> > +indicate support for optional capabilities.  Flags currently include whether
> > +the device is PCI or described by Device Tree, and whether the RESET ioctl
> > +is supported:
> > +
> > +#define VFIO_DEVICE_GET_FLAGS           _IOR(';', 108, __u64)
> > + #define VFIO_DEVICE_FLAGS_PCI          (1 << 0)
> > + #define VFIO_DEVICE_FLAGS_DT           (1 << 1)
> 
> TBH, I don't think the VFIO for DT stuff is mature enough yet to be in
> an initial infrastructure patch, though we should certainly be
> discussing it as an add-on patch.

I agree for DT, and PCI should be added with vfio-pci, not the initial
core.

> > + #define VFIO_DEVICE_FLAGS_RESET        (1 << 2)
> > +
> > +The MMIO and IOP resources used by a device are described by regions.
> > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> > +
> > +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)
> > +
> > +Regions are described by a struct vfio_region_info, which is retrieved by
> > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> > +the desired region (0 based index).  Note that devices may implement zero
> > +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> > +mapping).
> 
> So, I think you're saying that a zero-sized region is used to encode a
> NOP region, that is, to basically put a "no region here" in between
> valid region indices.  You should spell that out.

Ok.

> [Incidentally, any chance you could borrow one of RH's tech writers
> for this?  I'm afraid you seem to lack the knack for clear and easily
> read documentation]

Thanks for the encouragement :-\  It's no wonder there isn't more
content in Documentation.

> > +struct vfio_region_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* region number */
> > +        __u64   size;           /* size in bytes of region */
> > +        __u64   offset;         /* start offset of region */
> > +        __u64   flags;
> > +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> > +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> 
> Again having separate read and write bits from the start will save
> strange encodings later.

Seems highly unlikely, but we have bits to waste...

> > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> > +        __u64   phys;           /* physical address of region */
> > +};
> 
> I notice there is no field for "type" e.g. MMIO vs. PIO vs. config
> space for PCI.  If you added that having a NONE type might be a
> clearer way of encoding a non-region than just having size==0.

I thought there was some resistance to including MMIO and PIO bits in
the flags.  If that's passed, I can add it, but PCI can determine this
through config space (and vfio-pci exposes config space at a fixed
index).  Having a regions w/ size == 0, MMIO and PIO flags unset seems a
little redundant if that's the only reason for having them.  A NONE flag
doesn't make sense to me.  Config space isn't NONE, but neither is it
MMIO nor PIO; and someone would probably be offended about even
mentioning PIO in the specification.

> > +
> > +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> > +
> > +The offset indicates the offset into the device file descriptor which
> > +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> > +available access types and validity of optional fields.  For instance
> > +the phys field may only be valid for certain devices types.
> > +
> > +Interrupts are described using a similar interface.  GET_NUM_IRQS
> > +reports the number or IRQ indexes for the device.
> > +
> > +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> > +
> > +struct vfio_irq_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* IRQ number */
> > +        __u32   count;          /* number of individual IRQs */
> 
> Is there a reason for allowing irqs in batches like this, rather than
> having each MSI be reflected by a separate irq_info?

Yes, bus drivers like vfio-pci can define index 1 as the MSI info
structure and index 2 as MSI-X.  There's really no need to expose 57
individual MSI interrupts and try to map them to the correct device
specific MSI type if they can only logically be enabled in two distinct
groups.  Bus drivers with individually controllable MSI vectors are free
to expose them separately.  I assume device tree paths would help
associate an index to a specific interrupt.

> > +        __u64   flags;
> > +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> > +};
> > +
> > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > +type to index mapping).
> 
> I know what you mean, but you need a clearer way to express it.

I'll work on it.

> > +Information about each index can be retrieved using the GET_IRQ_INFO
> > +ioctl, used much like GET_REGION_INFO.
> > +
> > +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> > +
> > +Individual indexes can describe single or sets of IRQs.  This provides the
> > +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> > +
> > +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> > +as shown below, are used to pass the IRQ info index, the number of eventfds,
> > +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> > +
> > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> > +
> > +When a level triggered interrupt is signaled, the interrupt is masked
> > +on the host.  This prevents an unresponsive userspace driver from
> > +continuing to interrupt the host system.  After servicing the interrupt,
> > +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> > +triggered interrupts implicitly have a count of 1 per index.
> 
> This is a silly restriction.  Even PCI devices can have up to 4 LSIs
> on a function in theory, though no-one ever does.  Embedded devices
> can and do have multiple level interrupts.

Per the PCI spec, an individual PCI function can only ever have, at
most, a single INTx line.  A multi-function *device* can have up to 4
INTx lines, but what we're exposing here is a struct device, ie. a PCI
function.

Other devices could certainly have multiple level interrupts, and if
grouping them as we do with MSI on PCI makes sense, please let me know.
I just didn't see the value in making the unmask operations handle
sub-indexes if it's not needed.

> > +
> > +/* Unmask IRQ index, arg[0] = index */
> > +#define VFIO_DEVICE_UNMASK_IRQ          _IOW(';', 114, int)
> > +
> > +Level triggered interrupts can also be unmasked using an irqfd.  Use
> > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> > +
> > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> > +
> > +When supported, as indicated by the device flags, reset the device.
> > +
> > +#define VFIO_DEVICE_RESET               _IO(';', 116)
> > +
> > +Device tree devices also invlude ioctls for further defining the
> > +device tree properties of the device:
> > +
> > +struct vfio_dtpath {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;
> > +        __u64   flags;
> > +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> > +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> > +        char    *path;
> > +};
> > +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> > +
> > +struct vfio_dtindex {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;
> > +        __u32   prop_type;
> > +        __u32   prop_index;
> > +        __u64   flags;
> > +#define VFIO_DTINDEX_FLAGS_REGION       (1 << 0)
> > +#define VFIO_DTINDEX_FLAGS_IRQ          (1 << 1)
> > +};
> > +#define VFIO_DEVICE_GET_DTINDEX         _IOWR(';', 118, struct vfio_dtindex)
> > +
> > +
> > +VFIO bus driver API
> > +-------------------------------------------------------------------------------
> > +
> > +Bus drivers, such as PCI, have three jobs:
> > + 1) Add/remove devices from vfio
> > + 2) Provide vfio_device_ops for device access
> > + 3) Device binding and unbinding
> > +
> > +When initialized, the bus driver should enumerate the devices on it's
> 
> s/it's/its/

Noted.

<snip>
> > +/* Unmap DMA region */
> > +/* dgate must be held */
> > +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > +			    int npage, int rdwr)
> 
> Use of "read" and "write" in DMA can often be confusing, since it's
> not always clear if you're talking from the perspective of the CPU or
> the device (_writing_ data to a device will usually involve it doing
> DMA _reads_ from memory).  It's often best to express things as DMA
> direction, 'to device', and 'from device' instead.

Good point.

> > +{
> > +	int i, unlocked = 0;
> > +
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > +		unsigned long pfn;
> > +
> > +		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> > +		if (pfn) {
> > +			iommu_unmap(iommu->domain, iova, 0);
> > +			unlocked += put_pfn(pfn, rdwr);
> > +		}
> > +	}
> > +	return unlocked;
> > +}
> > +
> > +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > +			   unsigned long npage, int rdwr)
> > +{
> > +	int unlocked;
> > +
> > +	unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);
> > +	vfio_lock_acct(-unlocked);
> 
> Have you checked that your accounting will work out if the user maps
> the same memory page to multiple IOVAs?

Hmm, it probably doesn't.  We potentially over-penalize the user process
here.

> > +}
> > +
> > +/* Unmap ALL DMA regions */
> > +void vfio_iommu_unmapall(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos, *pos2;
> > +	struct dma_map_page *mlp;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +	list_for_each_safe(pos, pos2, &iommu->dm_list) {
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +		list_del(&mlp->list);
> > +		kfree(mlp);
> > +	}
> > +	mutex_unlock(&iommu->dgate);
> 
> Ouch, no good at all.  Keeping track of every DMA map is no good on
> POWER or other systems where IOMMU operations are a hot path.  I think
> you'll need an iommu specific hook for this instead, which uses
> whatever data structures are natural for the IOMMU.  For example a
> 1-level pagetable, like we use on POWER will just zero every entry.

It's already been noted in the docs that current users have relatively
static mappings and a performance interface is TBD for dynamically
backing streaming DMA.  The current vfio_iommu exposes iommu_ops, POWER
will need to come up with something to expose instead.

> > +}
> > +
> > +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long *pfn)
> > +{
> > +	struct page *page[1];
> > +	struct vm_area_struct *vma;
> > +	int ret = -EFAULT;
> > +
> > +	if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) {
> > +		*pfn = page_to_pfn(page[0]);
> > +		return 0;
> > +	}
> > +
> > +	down_read(&current->mm->mmap_sem);
> > +
> > +	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> > +
> > +	if (vma && vma->vm_flags & VM_PFNMAP) {
> > +		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > +		if (is_invalid_reserved_pfn(*pfn))
> > +			ret = 0;
> > +	}
> 
> It's kind of nasty that you take gup_fast(), already designed to grab
> pointers for multiple user pages, then just use it one page at a time,
> even for a big map.

Yep, this needs work, but shouldn't really change the API.

> > +	up_read(&current->mm->mmap_sem);
> > +
> > +	return ret;
> > +}
> > +
> > +/* Map DMA region */
> > +/* dgate must be held */
> > +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova,
> > +			unsigned long vaddr, int npage, int rdwr)
> 
> iova should be a dma_addr_t.  Bus address size need not match virtual
> address size, and may not fit in an unsigned long.

ok.

> > +{
> > +	unsigned long start = iova;
> > +	int i, ret, locked = 0, prot = IOMMU_READ;
> > +
> > +	/* Verify pages are not already mapped */
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> > +		if (iommu_iova_to_phys(iommu->domain, iova))
> > +			return -EBUSY;
> > +
> > +	iova = start;
> > +
> > +	if (rdwr)
> > +		prot |= IOMMU_WRITE;
> > +	if (iommu->cache)
> > +		prot |= IOMMU_CACHE;
> > +
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> > +		unsigned long pfn = 0;
> > +
> > +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> > +		if (ret) {
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +
> > +		/* Only add actual locked pages to accounting */
> > +		if (!is_invalid_reserved_pfn(pfn))
> > +			locked++;
> > +
> > +		ret = iommu_map(iommu->domain, iova,
> > +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> > +		if (ret) {
> > +			/* Back out mappings on error */
> > +			put_pfn(pfn, rdwr);
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +	}
> > +	vfio_lock_acct(locked);
> > +	return 0;
> > +}
> > +
> > +static inline int ranges_overlap(unsigned long start1, size_t size1,
> > +				 unsigned long start2, size_t size2)
> > +{
> > +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> 
> Needs overflow safety.

Yep.

> > +}
> > +
> > +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> > +					  dma_addr_t start, size_t size)
> > +{
> > +	struct list_head *pos;
> > +	struct dma_map_page *mlp;
> > +
> > +	list_for_each(pos, &iommu->dm_list) {
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +				   start, size))
> > +			return mlp;
> > +	}
> > +	return NULL;
> > +}
> 
> Again, keeping track of each dma map operation is no good for
> performance.

This is not the performance interface you're looking for.

> > +
> > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> > +			    size_t size, struct dma_map_page *mlp)
> > +{
> > +	struct dma_map_page *split;
> > +	int npage_lo, npage_hi;
> > +
> > +	/* Existing dma region is completely covered, unmap all */
> > +	if (start <= mlp->daddr &&
> > +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +		list_del(&mlp->list);
> > +		npage_lo = mlp->npage;
> > +		kfree(mlp);
> > +		return npage_lo;
> > +	}
> > +
> > +	/* Overlap low address of existing range */
> > +	if (start <= mlp->daddr) {
> > +		size_t overlap;
> > +
> > +		overlap = start + size - mlp->daddr;
> > +		npage_lo = overlap >> PAGE_SHIFT;
> > +		npage_hi = mlp->npage - npage_lo;
> > +
> > +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > +		mlp->daddr += overlap;
> > +		mlp->vaddr += overlap;
> > +		mlp->npage -= npage_lo;
> > +		return npage_lo;
> > +	}
> > +
> > +	/* Overlap high address of existing range */
> > +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		size_t overlap;
> > +
> > +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > +		npage_hi = overlap >> PAGE_SHIFT;
> > +		npage_lo = mlp->npage - npage_hi;
> > +
> > +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > +		mlp->npage -= npage_hi;
> > +		return npage_hi;
> > +	}
> > +
> > +	/* Split existing */
> > +	npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> > +	npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> > +
> > +	split = kzalloc(sizeof *split, GFP_KERNEL);
> > +	if (!split)
> > +		return -ENOMEM;
> > +
> > +	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> > +
> > +	mlp->npage = npage_lo;
> > +
> > +	split->npage = npage_hi;
> > +	split->daddr = start + size;
> > +	split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> > +	split->rdwr = mlp->rdwr;
> > +	list_add(&split->list, &iommu->dm_list);
> > +	return size >> PAGE_SHIFT;
> > +}
> > +
> > +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +	int ret = 0;
> > +	size_t npage = dmp->size >> PAGE_SHIFT;
> > +	struct list_head *pos, *n;
> > +
> > +	if (dmp->dmaaddr & ~PAGE_MASK)
> > +		return -EINVAL;
> > +	if (dmp->size & ~PAGE_MASK)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +
> > +	list_for_each_safe(pos, n, &iommu->dm_list) {
> > +		struct dma_map_page *mlp;
> > +
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +				   dmp->dmaaddr, dmp->size)) {
> > +			ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> > +						      dmp->size, mlp);
> > +			if (ret > 0)
> > +				npage -= NPAGE_TO_SIZE(ret);
> > +			if (ret < 0 || npage == 0)
> > +				break;
> > +		}
> > +	}
> > +	mutex_unlock(&iommu->dgate);
> > +	return ret > 0 ? 0 : ret;
> > +}
> > +
> > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +	int npage;
> > +	struct dma_map_page *mlp, *mmlp = NULL;
> > +	dma_addr_t daddr = dmp->dmaaddr;
> > +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> > +	size_t size = dmp->size;
> > +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> > +
> > +	if (vaddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (daddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (size & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +
> > +	npage = size >> PAGE_SHIFT;
> > +	if (!npage)
> > +		return -EINVAL;
> > +
> > +	if (!iommu)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +
> > +	if (vfio_find_dma(iommu, daddr, size)) {
> > +		ret = -EBUSY;
> > +		goto out_lock;
> > +	}
> > +
> > +	/* account for locked pages */
> > +	locked = current->mm->locked_vm + npage;
> > +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> > +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > +			__func__, rlimit(RLIMIT_MEMLOCK));
> > +		ret = -ENOMEM;
> > +		goto out_lock;
> > +	}
> > +
> > +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> > +	if (ret)
> > +		goto out_lock;
> > +
> > +	/* Check if we abut a region below */
> > +	if (daddr) {
> > +		mlp = vfio_find_dma(iommu, daddr - 1, 1);
> > +		if (mlp && mlp->rdwr == rdwr &&
> > +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> > +
> > +			mlp->npage += npage;
> > +			daddr = mlp->daddr;
> > +			vaddr = mlp->vaddr;
> > +			npage = mlp->npage;
> > +			size = NPAGE_TO_SIZE(npage);
> > +
> > +			mmlp = mlp;
> > +		}
> > +	}
> > +
> > +	if (daddr + size) {
> > +		mlp = vfio_find_dma(iommu, daddr + size, 1);
> > +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size) {
> > +
> > +			mlp->npage += npage;
> > +			mlp->daddr = daddr;
> > +			mlp->vaddr = vaddr;
> > +
> > +			/* If merged above and below, remove previously
> > +			 * merged entry.  New entry covers it.  */
> > +			if (mmlp) {
> > +				list_del(&mmlp->list);
> > +				kfree(mmlp);
> > +			}
> > +			mmlp = mlp;
> > +		}
> > +	}
> > +
> > +	if (!mmlp) {
> > +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> > +		if (!mlp) {
> > +			ret = -ENOMEM;
> > +			vfio_dma_unmap(iommu, daddr, npage, rdwr);
> > +			goto out_lock;
> > +		}
> > +
> > +		mlp->npage = npage;
> > +		mlp->daddr = daddr;
> > +		mlp->vaddr = vaddr;
> > +		mlp->rdwr = rdwr;
> > +		list_add(&mlp->list, &iommu->dm_list);
> > +	}
> > +
> > +out_lock:
> > +	mutex_unlock(&iommu->dgate);
> > +	return ret;
> > +}
> 
> This whole tracking infrastructure is way too complex to impose on
> every IOMMU.  We absolutely don't want to do all this when just
> updating a 1-level pagetable.

If only POWER implemented an iommu_ops so we had something on which we
could base an alternate iommu model and pluggable iommu registration...

> > +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +
> > +	vfio_release_iommu(iommu);
> > +	return 0;
> > +}
> > +
> > +static long vfio_iommu_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +	int ret = -ENOSYS;
> > +
> > +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> > +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> > +
> > +                ret = put_user(flags, (u64 __user *)arg);
> 
> Um.. flags surely have to come from the IOMMU driver.

This vfio_iommu object is backed by iommu_ops, which supports this
mapping.

> > +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > +		struct vfio_dma_map dm;
> > +
> > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > +			return -EFAULT;
> > +
> > +		ret = vfio_dma_map_dm(iommu, &dm);
> > +
> > +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> > +			ret = -EFAULT;
> > +
> > +	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> > +		struct vfio_dma_map dm;
> > +
> > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > +			return -EFAULT;
> > +
> > +		ret = vfio_dma_unmap_dm(iommu, &dm);
> > +
> > +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> > +			ret = -EFAULT;
> > +	}
> > +	return ret;
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +static long vfio_iommu_compat_ioctl(struct file *filep,
> > +				    unsigned int cmd, unsigned long arg)
> > +{
> > +	arg = (unsigned long)compat_ptr(arg);
> > +	return vfio_iommu_unl_ioctl(filep, cmd, arg);
> 
> Um, this only works if the structures are exactly compatible between
> 32-bit and 64-bit ABIs.  I don't think that is always true.

I think all our structure sizes are independent of host width.  If I'm
missing something, let me know.

> > +}
> > +#endif	/* CONFIG_COMPAT */
> > +
> > +const struct file_operations vfio_iommu_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.release	= vfio_iommu_release,
> > +	.unlocked_ioctl	= vfio_iommu_unl_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= vfio_iommu_compat_ioctl,
> > +#endif
> > +};
> > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> > new file mode 100644
> > index 0000000..6169356
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_main.c
> > @@ -0,0 +1,1151 @@
> > +/*
> > + * VFIO framework
> > + *
> > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + */
> > +
> > +#include <linux/cdev.h>
> > +#include <linux/compat.h>
> > +#include <linux/device.h>
> > +#include <linux/file.h>
> > +#include <linux/anon_inodes.h>
> > +#include <linux/fs.h>
> > +#include <linux/idr.h>
> > +#include <linux/iommu.h>
> > +#include <linux/mm.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/string.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +#include <linux/wait.h>
> > +
> > +#include "vfio_private.h"
> > +
> > +#define DRIVER_VERSION	"0.2"
> > +#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> > +#define DRIVER_DESC	"VFIO - User Level meta-driver"
> > +
> > +static int allow_unsafe_intrs;
> > +module_param(allow_unsafe_intrs, int, 0);
> > +MODULE_PARM_DESC(allow_unsafe_intrs,
> > +        "Allow use of IOMMUs which do not support interrupt remapping");
> 
> This should not be a global option, but part of the AMD/Intel IOMMU
> specific code.  In general it's a question of how strict the IOMMU
> driver is about isolation when it determines what the groups are, and
> only the IOMMU driver can know what the possibilities are for its
> class of hardware.

I agree this should probably be tied more closely to the iommu driver,
but again, we only have iommu_ops right now.

<snip>
> > +
> > +/* Attempt to merge the group pointed to by fd into group.  The merge-ee
> > + * group must not have an iommu or any devices open because we cannot
> > + * maintain that context across the merge.  The merge-er group can be
> > + * in use. */
> 
> Yeah, so merge-er group in use still has its problems, because it
> could affect what the IOMMU is capable of.

As seen below, we deny merging if the iommu domains are not exactly
compatible.  Our notion of what compatible means depends on what
iommu_ops exposes though.

> > +static int vfio_group_merge(struct vfio_group *group, int fd)
> > +{
> > +	struct vfio_group *new;
> > +	struct vfio_iommu *old_iommu;
> > +	struct file *file;
> > +	int ret = 0;
> > +	bool opened = false;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	file = fget(fd);
> > +	if (!file) {
> > +		ret = -EBADF;
> > +		goto out_noput;
> > +	}
> > +
> > +	/* Sanity check, is this really our fd? */
> > +	if (file->f_op != &vfio_group_fops) {
> 
> This should be a WARN_ON or BUG_ON rather than just an error return, surely.

No, I don't think so.  We're passed a file descriptor that could be for
anything.  If the user passed a file descriptor for something that's not
a vfio group, that's a user error, not an internal consistency error of
vfio.

> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	new = file->private_data;
> > +
> > +	if (!new || new == group || !new->iommu ||
> > +	    new->iommu->domain || new->bus != group->bus) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* We need to attach all the devices to each domain separately
> > +	 * in order to validate that the capabilities match for both.  */
> > +	ret = __vfio_open_iommu(new->iommu);
> > +	if (ret)
> > +		goto out;
> > +
> > +	if (!group->iommu->domain) {
> > +		ret = __vfio_open_iommu(group->iommu);
> > +		if (ret)
> > +			goto out;
> > +		opened = true;
> > +	}
> > +
> > +	/* If cache coherency doesn't match we'd potentialy need to
> > +	 * remap existing iommu mappings in the merge-er domain.
> > +	 * Poor return to bother trying to allow this currently. */
> > +	if (iommu_domain_has_cap(group->iommu->domain,
> > +				 IOMMU_CAP_CACHE_COHERENCY) !=
> > +	    iommu_domain_has_cap(new->iommu->domain,
> > +				 IOMMU_CAP_CACHE_COHERENCY)) {
> > +		__vfio_close_iommu(new->iommu);
> > +		if (opened)
> > +			__vfio_close_iommu(group->iommu);
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* Close the iommu for the merge-ee and attach all its devices
> > +	 * to the merge-er iommu. */
> > +	__vfio_close_iommu(new->iommu);
> > +
> > +	ret = __vfio_iommu_attach_group(group->iommu, new);
> > +	if (ret)
> > +		goto out;
> > +
> > +	/* set_iommu unlinks new from the iommu, so save a pointer to it */
> > +	old_iommu = new->iommu;
> > +	__vfio_group_set_iommu(new, group->iommu);
> > +	kfree(old_iommu);
> > +
> > +out:
> > +	fput(file);
> > +out_noput:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Unmerge the group pointed to by fd from group. */
> > +static int vfio_group_unmerge(struct vfio_group *group, int fd)
> > +{
> > +	struct vfio_group *new;
> > +	struct vfio_iommu *new_iommu;
> > +	struct file *file;
> > +	int ret = 0;
> > +
> > +	/* Since the merge-out group is already opened, it needs to
> > +	 * have an iommu struct associated with it. */
> > +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);
> > +	if (!new_iommu)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&new_iommu->group_list);
> > +	INIT_LIST_HEAD(&new_iommu->dm_list);
> > +	mutex_init(&new_iommu->dgate);
> > +	new_iommu->bus = group->bus;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	file = fget(fd);
> > +	if (!file) {
> > +		ret = -EBADF;
> > +		goto out_noput;
> > +	}
> > +
> > +	/* Sanity check, is this really our fd? */
> > +	if (file->f_op != &vfio_group_fops) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	new = file->private_data;
> > +	if (!new || new == group || new->iommu != group->iommu) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* We can't merge-out a group with devices still in use. */
> > +	if (__vfio_group_devs_inuse(new)) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	__vfio_iommu_detach_group(group->iommu, new);
> > +	__vfio_group_set_iommu(new, new_iommu);
> > +
> > +out:
> > +	fput(file);
> > +out_noput:
> > +	if (ret)
> > +		kfree(new_iommu);
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Get a new iommu file descriptor.  This will open the iommu, setting
> > + * the current->mm ownership if it's not already set. */
> 
> I know I've had this explained to me several times before, but I've
> forgotten again.  Why do we need to wire the iommu to an mm?

We're mapping process virtual addresses into the IOMMU, so it makes
sense to restrict ourselves to a single virtual address space.  It also
enforces the ownership, that only a single mm is in control of the
group.

> > +static int vfio_group_get_iommu_fd(struct vfio_group *group)
> > +{
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (!group->iommu->domain) {
> > +		ret = __vfio_open_iommu(group->iommu);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
> > +			       group->iommu, O_RDWR);
> > +	if (ret < 0)
> > +		goto out;
> > +
> > +	group->iommu->refcnt++;
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Get a new device file descriptor.  This will open the iommu, setting
> > + * the current->mm ownership if it's not already set.  It's difficult to
> > + * specify the requirements for matching a user supplied buffer to a
> > + * device, so we use a vfio driver callback to test for a match.  For
> > + * PCI, dev_name(dev) is unique, but other drivers may require including
> > + * a parent device string. */
> 
> At some point we probably want an interface to enumerate the devices
> too, but that can probably wait.

That's what I decided as well.  I also haven't been able to come up with
an interface for it that doesn't make me want to vomit.

> > +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> > +{

Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-15 18:01   ` Alex Williamson
@ 2011-11-17  0:02     ` David Gibson
  2011-11-18 20:32       ` Alex Williamson
  0 siblings, 1 reply; 62+ messages in thread
From: David Gibson @ 2011-11-17  0:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, aik, pmac, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote:
> On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
> > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> > > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > > new file mode 100644
> > > index 0000000..5866896
> > > --- /dev/null
> > > +++ b/Documentation/vfio.txt
> > > @@ -0,0 +1,304 @@
> > > +VFIO - "Virtual Function I/O"[1]
> > > +-------------------------------------------------------------------------------
> > > +Many modern system now provide DMA and interrupt remapping facilities
> > > +to help ensure I/O devices behave within the boundaries they've been
> > > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > > +embedded powerpc systems (technology name unknown).  The VFIO driver
> > > +is an IOMMU/device agnostic framework for exposing direct device
> > > +access to userspace, in a secure, IOMMU protected environment.  In
> > > +other words, this allows safe, non-privileged, userspace drivers.
> > 
> > It's perhaps worth emphasisng that "safe" depends on the hardware
> > being sufficiently well behaved.  BenH, I know, thinks there are a
> > *lot* of cards that, e.g. have debug registers that allow a backdoor
> > to their own config space via MMIO, which would bypass vfio's
> > filtering of config space access.  And that's before we even get into
> > the varying degrees of completeness in the isolation provided by
> > different IOMMUs.
> 
> Fair enough.  I know Tom had emphasized "well behaved" in the original
> doc.  Virtual functions are probably the best indicator of well behaved.
> 
> > > +Why do we want that?  Virtual machines often make use of direct device
> > > +access ("device assignment") when configured for the highest possible
> > > +I/O performance.  From a device and host perspective, this simply turns
> > > +the VM into a userspace driver, with the benefits of significantly
> > > +reduced latency, higher bandwidth, and direct use of bare-metal device
> > > +drivers[2].
> > > +
> > > +Some applications, particularly in the high performance computing
> > > +field, also benefit from low-overhead, direct device access from
> > > +userspace.  Examples include network adapters (often non-TCP/IP based)
> > > +and compute accelerators.  Previous to VFIO, these drivers needed to
> > 
> > s/Previous/Prior/  although that may be a .us vs .au usage thing.
> 
> Same difference, AFAICT.
> 
> > > +go through the full development cycle to become proper upstream driver,
> > > +be maintained out of tree, or make use of the UIO framework, which
> > > +has no notion of IOMMU protection, limited interrupt support, and
> > > +requires root privileges to access things like PCI configuration space.
> > > +
> > > +The VFIO driver framework intends to unify these, replacing both the
> > > +KVM PCI specific device assignment currently used as well as provide
> > > +a more secure, more featureful userspace driver environment than UIO.
> > > +
> > > +Groups, Devices, IOMMUs, oh my
> > > +-------------------------------------------------------------------------------
> > > +
> > > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > > +can't always distinguish transactions from each individual device in
> > > +the system.  Sometimes this is because of the IOMMU design, such as with
> > > +PEs, other times it's caused by the I/O topology, for instance a
> > > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > > +devices created by these restictions IOMMU groups (or just "groups" for
> > > +this document).
> > > +
> > > +The IOMMU cannot distiguish transactions between the individual devices
> > > +within the group, therefore the group is the basic unit of ownership for
> > > +a userspace process.  Because of this, groups are also the primary
> > > +interface to both devices and IOMMU domains in VFIO.
> > > +
> > > +The VFIO representation of groups is created as devices are added into
> > > +the framework by a VFIO bus driver.  The vfio-pci module is an example
> > > +of a bus driver.  This module registers devices along with a set of bus
> > > +specific callbacks with the VFIO core.  These callbacks provide the
> > > +interfaces later used for device access.  As each new group is created,
> > > +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> > > +character device.
> > 
> > Ok.. so, the fact that it's called "vfio-pci" suggests that the VFIO
> > bus driver is per bus type, not per bus instance.   But grouping
> > constraints could be per bus instance, if you have a couple of
> > different models of PCI host bridge with IOMMUs of different
> > capabilities built in, for example.
> 
> Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
> instance.

Ok, how can that work.  vfio-pci is responsible for generating the
groupings, yes?  For which it needs to know the iommu/host bridge's
isolation capabilities, which vary depending on the type of host
bridge.

>  IOMMUs also register drivers per bus type, not per bus
> instance.  The IOMMU driver is free to impose any constraints it wants.
> 
> > > +In addition to the device enumeration and callbacks, the VFIO bus driver
> > > +also provides a traditional device driver and is able to bind to devices
> > > +on it's bus.  When a device is bound to the bus driver it's available to
> > > +VFIO.  When all the devices within a group are bound to their bus drivers,
> > > +the group becomes "viable" and a user with sufficient access to the VFIO
> > > +group chardev can obtain exclusive access to the set of group devices.
> > > +
> > > +As documented in linux/vfio.h, several ioctls are provided on the
> > > +group chardev:
> > > +
> > > +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> > > + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> > > + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> > > +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> > > +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> > > +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> > > +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> > > +
> > > +The last two ioctls return new file descriptors for accessing
> > > +individual devices within the group and programming the IOMMU.  Each of
> > > +these new file descriptors provide their own set of file interfaces.
> > > +These ioctls will fail if any of the devices within the group are not
> > > +bound to their VFIO bus driver.  Additionally, when either of these
> > > +interfaces are used, the group is then bound to the struct_mm of the
> > > +caller.  The GET_FLAGS ioctl can be used to view the state of the group.
> > > +
> > > +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> > > +new IOMMU domain is created and all of the devices in the group are
> > > +attached to it.  This is the only way to ensure full IOMMU isolation
> > > +of the group, but potentially wastes resources and cycles if the user
> > > +intends to manage multiple groups with the same set of IOMMU mappings.
> > > +VFIO therefore provides a group MERGE and UNMERGE interface, which
> > > +allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
> > > +arbitrary groups to be merged, so the user should assume merging is
> > > +opportunistic.
> > 
> > I do not think "opportunistic" means what you think it means..
> > 
> > >  A new group, with no open device or IOMMU file
> > > +descriptors, can be merged into an existing, in-use, group using the
> > > +MERGE ioctl.  A merged group can be unmerged using the UNMERGE ioctl
> > > +once all of the device file descriptors for the group being merged
> > > +"out" are closed.
> > > +
> > > +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are
> > > +essentially fungible between group file descriptors (ie. if device
> > > A
> > 
> > IDNT "fungible" MWYTIM, either.
> 
> Hmm, feel free to suggest.  Maybe we're hitting .us vs .au connotation.

In any case, I don't think it's a word whose meaning is unambiguous
enough to use here.

> > > +is in group X, and X is merged with Y, a file descriptor for A can be
> > > +retrieved using GET_DEVICE_FD on Y.  Likewise, GET_IOMMU_FD returns a
> > > +file descriptor referencing the same internal IOMMU object from either
> > > +X or Y).  Merged groups can be dissolved either explictly with UNMERGE
> > > +or automatically when ALL file descriptors for the merged group are
> > > +closed (all IOMMUs, all devices, all groups).
> > 
> > Blech.  I'm really not liking this merge/unmerge API as it stands,
> > it's horribly confusing.  At the very least, we need some better
> > terminology.  We need some term for the metagroups; supergroups; iommu
> > domains or-at-least-they-will-be-once-we-open-the-iommu or
> > whathaveyous.
> > 
> > The first confusing thing about this interface is that each open group
> > handle actually refers to two different things; the original group you
> > opened and the metagroup it's a part of.  For the GET_IOMMU_FD and
> > GET_DEVICE_FD operations, you're using the metagroup and two "merged"
> > group handles are interchangeable.
> 
> Fungible, even ;)
> 
> > For other MERGE and especially
> > UNMERGE operations, it matters which is the original group.
> 
> If I stick two LEGO blocks together, I need to identify the individual
> block I want to remove to pull them back apart...

Yeah, I'm starting to get my head around the model, but the current
description of it doesn't help very much.  In particular the terms
"merge" and "unmerge" lead one to the wrong mental model, I think.

> > The semantics of "merge" and "unmerge" under those names are really
> > non-obvious.  Merge kind of has to merge two whole metagroups, but
> > it's unclear if unmerge reverses one merge, or just takes out one
> > (atom) group.  These operations need better names, at least.
> 
> Christian suggested a change to UNMERGE that we do not need to
> specify a group to unmerge "from".  This makes it more like a list
> implementation except there's no defined list_head.  Any member of the
> list can pull in a new entry.  Calling UNMERGE on any member extracts
> that member.

I think that's a good idea, but "unmerge" is not a good word for it.

> > Then it's unclear what order you can do various operations, and which
> > order you can open and close various things.  You can kind of figure
> > it out but it takes far more thinking than it should.
> > 
> > 
> > So at the _very_ least, we need to invent new terminology and find a
> > much better way of describing this API's semantics.  I still think an
> > entirely different interface, where metagroups are created from
> > outside with a lifetime that's not tied to an fd would be a better
> > idea.
> 
> As we've discussed previously, configfs provides part of this, but has
> no ioctl support.  It doesn't make sense to me to go play with groups in
> configfs, but then still interact with them via a char dev.

Why not?  You configure, say, loopback devices with losetup, then use
them as a block device.  Similar with nbd.  You can configure serial
devices with setserial, then use them as a char dev.

>  It also
> splits the ownership model 

I'm not even sure what that means.

> and makes it harder to enforce who gets to
> interact with the devices vs who gets to manipulate groups.

How so.

>  The current
> model really isn't that complicated, imho.  As always, feel free to
> suggest specific models.  If you have a specific terminology other than
> MERGE, please suggest.
> 
> > Now, you specify that you can't use a group as the second argument of
> > a merge if it already has an open iommu, but it's not clear from the
> > doc if you can merge things into a group with an open iommu.
> 
> >From above:
> 
>         A new group, with no open device or IOMMU file descriptors, can
>         be merged into an existing, in-use, group using the MERGE ioctl.
>                                  ^^^^^^
> 
> > Banning
> > this would make life simpler, because the IOMMU's effective
> > capabilities may change if you add more devices to the domain.  That's
> > yet another non-obvious constraint in the interface ordering, though.
> 
> Banning this would prevent using merged groups with hotplug, which I
> consider to be a primary use case.

Yeah, fair enough, based on your later comments w.r.t. only combining
feature compatible groups.

> > > +The IOMMU file descriptor provides this set of ioctls:
> > > +
> > > +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> > > + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> > > +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> > > +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)
> > > +
> > > +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> > > +We currently only support IOMMU domains that are able to map any
> > > +virtual address to any IOVA.  This is indicated by the MAP_ANY
> > > flag.
> > 
> > So.  I tend to think of an IOMMU mapping IOVAs to memory pages, rather
> > than memory pages to IOVAs.  
> 
> I do too, not sure why I wrote it that way, will fix.
> 
> > The IOMMU itself, of course maps to
> > physical addresses, and the meaning of "virtual address" in this
> > context is not really clear.  I think you would be better off saying
> > the IOMMU can map any IOVA to any memory page.  From a hardware POV
> > that means any physical address, but of course for a VFIO user a page
> > is specified by its process virtual address.
> 
> Will fix.
> 
> > I think we need to pin exactly what "MAP_ANY" means down better.  Now,
> > VFIO is pretty much a lost cause if you can't map any normal process
> > memory page into the IOMMU, so I think the only thing that is really
> > covered is IOVAs.  But saying "can map any IOVA" is not clear, because
> > if you can't map it, it's not a (valid) IOVA.  Better to say that
> > IOVAs can be any 64-bit value, which I think is what you really mean
> > here.
> 
> ok
> 
> > Of course, since POWER is a platform where this is *not* true, I'd
> > prefer to have something giving the range of valid IOVAs in the core
> > to start with.
> 
> Since iommu_ops does not yet have any concept of this (nudge, nudge), I
> figured this would be added later.  A possible implementation would be
> that such an iommu would not set MAP_ANY, would add a new flag for
> MAP_RANGE, and provide a new VFIO_IOMMU_GET_RANGE_INFO ioctl to describe
> it.  I'm guaranteed to get it wrong if I try to predict all your needs.

Hrm.  "ANY" just really bothers me because "any iova" is not as clear
a concept as it first appears.  For starters it's actually "any page
aligned" at the very least.  But then it's only any 64-bit address for
busses which have full 64-bit addressing (and I do wonder if there are
any north bridges out there that forgot to implement some of the upper
PCI address bits properly, given that 64-bit CPUs rarely actually
implement more than 40-something physical address bits in practice).

I'd prefer to see at least something to advertise min and max IOVA and
IOVA alignment.  That's enough to cover x86 and POWER, including
possible variants with an IOMMU page size different to the system page
size (note that POWER kernels can have 64k pages as a config option,
which means a TCE page size different to the system page size is quite
common).

Obviously there could be more complex constraints that we would need
to advertise with option bits.

> > > +
> > > +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> > > +and unmapping IOVAs to process virtual addresses:
> > > +
> > > +struct vfio_dma_map {
> > > +        __u64   len;            /* length of structure */
> > 
> > Thanks for adding these structure length fields.  But I think they
> > should be called something other than 'len', which is likely to be
> > confused with size (or some other length that's actually related to
> > the operation's parameters).  Better to call it 'structlen' or
> > 'argslen' or something.
> 
> Ok.  As Scott noted, I've failed to implement these in a way that
> actually allows extension, but I'll work on it.

Right.  I had failed to realise quite how the encoding of structure
size into the ioctl worked.  With that in place, arguably we don't
really need the size in the structure itself, because we can still
have multiple sized versions of the ioctl.  Still, whichever.

> 
> > > +        __u64   vaddr;          /* process virtual addr */
> > > +        __u64   dmaaddr;        /* desired and/or returned dma address */
> > > +        __u64   size;           /* size in bytes */
> > > +        __u64   flags;
> > > +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> > 
> > Make it independent READ and WRITE flags from the start.  Not all
> > combinations will be be valid on all hardware, but that way we have
> > the possibilities covered without having to use strange encodings
> > later.
> 
> Ok.
> 
> > > +};
> > > +
> > > +Current users of VFIO use relatively static DMA mappings, not requiring
> > > +high frequency turnover.  As new users are added, it's expected that the
> > > +IOMMU file descriptor will evolve to support new mapping interfaces, this
> > > +will be reflected in the flags and may present new ioctls and file
> > > +interfaces.
> > > +
> > > +The device GET_FLAGS ioctl is intended to return basic device type and
> > > +indicate support for optional capabilities.  Flags currently include whether
> > > +the device is PCI or described by Device Tree, and whether the RESET ioctl
> > > +is supported:
> > > +
> > > +#define VFIO_DEVICE_GET_FLAGS           _IOR(';', 108, __u64)
> > > + #define VFIO_DEVICE_FLAGS_PCI          (1 << 0)
> > > + #define VFIO_DEVICE_FLAGS_DT           (1 << 1)
> > 
> > TBH, I don't think the VFIO for DT stuff is mature enough yet to be in
> > an initial infrastructure patch, though we should certainly be
> > discussing it as an add-on patch.
> 
> I agree for DT, and PCI should be added with vfio-pci, not the initial
> core.
> 
> > > + #define VFIO_DEVICE_FLAGS_RESET        (1 << 2)
> > > +
> > > +The MMIO and IOP resources used by a device are described by regions.
> > > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> > > +
> > > +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)
> > > +
> > > +Regions are described by a struct vfio_region_info, which is retrieved by
> > > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> > > +the desired region (0 based index).  Note that devices may implement zero
> > > +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> > > +mapping).
> > 
> > So, I think you're saying that a zero-sized region is used to encode a
> > NOP region, that is, to basically put a "no region here" in between
> > valid region indices.  You should spell that out.
> 
> Ok.
> 
> > [Incidentally, any chance you could borrow one of RH's tech writers
> > for this?  I'm afraid you seem to lack the knack for clear and easily
> > read documentation]
> 
> Thanks for the encouragement :-\  It's no wonder there isn't more
> content in Documentation.

Sigh.  Alas, yes.

> > > +struct vfio_region_info {
> > > +        __u32   len;            /* length of structure */
> > > +        __u32   index;          /* region number */
> > > +        __u64   size;           /* size in bytes of region */
> > > +        __u64   offset;         /* start offset of region */
> > > +        __u64   flags;
> > > +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> > > +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> > 
> > Again having separate read and write bits from the start will save
> > strange encodings later.
> 
> Seems highly unlikely, but we have bits to waste...
> 
> > > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> > > +        __u64   phys;           /* physical address of region */
> > > +};
> > 
> > I notice there is no field for "type" e.g. MMIO vs. PIO vs. config
> > space for PCI.  If you added that having a NONE type might be a
> > clearer way of encoding a non-region than just having size==0.
> 
> I thought there was some resistance to including MMIO and PIO bits in
> the flags.  If that's passed, I can add it, but PCI can determine this
> through config space (and vfio-pci exposes config space at a fixed
> index).  Having a regions w/ size == 0, MMIO and PIO flags unset seems a
> little redundant if that's the only reason for having them.  A NONE flag
> doesn't make sense to me.  Config space isn't NONE, but neither is it
> MMIO nor PIO; and someone would probably be offended about even
> mentioning PIO in the specification.

No, my concept was that NONE would be used for the indexes where there
is no valid BAR.  I'll buy your argument on why not to include the PCI
(or whatever) address space type here.

What I'm just a bit concerned by is whether we could have a case (not
for PCI) of a real resource that still has size 0 - e.g. maybe some
sort of doorbell that can't be read or written, but can be triggered
some other way.  I guess that's probably unlikely though.

> 
> > > +
> > > +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> > > +
> > > +The offset indicates the offset into the device file descriptor which
> > > +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> > > +available access types and validity of optional fields.  For instance
> > > +the phys field may only be valid for certain devices types.
> > > +
> > > +Interrupts are described using a similar interface.  GET_NUM_IRQS
> > > +reports the number or IRQ indexes for the device.
> > > +
> > > +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> > > +
> > > +struct vfio_irq_info {
> > > +        __u32   len;            /* length of structure */
> > > +        __u32   index;          /* IRQ number */
> > > +        __u32   count;          /* number of individual IRQs */
> > 
> > Is there a reason for allowing irqs in batches like this, rather than
> > having each MSI be reflected by a separate irq_info?
> 
> Yes, bus drivers like vfio-pci can define index 1 as the MSI info
> structure and index 2 as MSI-X.  There's really no need to expose 57
> individual MSI interrupts and try to map them to the correct device
> specific MSI type if they can only logically be enabled in two distinct
> groups.  Bus drivers with individually controllable MSI vectors are free
> to expose them separately.  I assume device tree paths would help
> associate an index to a specific interrupt.

Ok, fair enough.

> > > +        __u64   flags;
> > > +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> > > +};
> > > +
> > > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > > +type to index mapping).
> > 
> > I know what you mean, but you need a clearer way to express it.
> 
> I'll work on it.
> 
> > > +Information about each index can be retrieved using the GET_IRQ_INFO
> > > +ioctl, used much like GET_REGION_INFO.
> > > +
> > > +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> > > +
> > > +Individual indexes can describe single or sets of IRQs.  This provides the
> > > +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> > > +
> > > +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> > > +as shown below, are used to pass the IRQ info index, the number of eventfds,
> > > +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> > > +
> > > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> > > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> > > +
> > > +When a level triggered interrupt is signaled, the interrupt is masked
> > > +on the host.  This prevents an unresponsive userspace driver from
> > > +continuing to interrupt the host system.  After servicing the interrupt,
> > > +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> > > +triggered interrupts implicitly have a count of 1 per index.
> > 
> > This is a silly restriction.  Even PCI devices can have up to 4 LSIs
> > on a function in theory, though no-one ever does.  Embedded devices
> > can and do have multiple level interrupts.
> 
> Per the PCI spec, an individual PCI function can only ever have, at
> most, a single INTx line.  A multi-function *device* can have up to 4
> INTx lines, but what we're exposing here is a struct device, ie. a PCI
> function.

Ah, my mistake.

> Other devices could certainly have multiple level interrupts, and if
> grouping them as we do with MSI on PCI makes sense, please let me know.
> I just didn't see the value in making the unmask operations handle
> sub-indexes if it's not needed.

I don't know of anything off hand.  But I can't see any consideration
that would make it unlikely either.  I generally don't trust anything
*not* to exist in embedded space.

> > > +
> > > +/* Unmask IRQ index, arg[0] = index */
> > > +#define VFIO_DEVICE_UNMASK_IRQ          _IOW(';', 114, int)
> > > +
> > > +Level triggered interrupts can also be unmasked using an irqfd.  Use
> > > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> > > +
> > > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> > > +
> > > +When supported, as indicated by the device flags, reset the device.
> > > +
> > > +#define VFIO_DEVICE_RESET               _IO(';', 116)
> > > +
> > > +Device tree devices also invlude ioctls for further defining the
> > > +device tree properties of the device:
> > > +
> > > +struct vfio_dtpath {
> > > +        __u32   len;            /* length of structure */
> > > +        __u32   index;
> > > +        __u64   flags;
> > > +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> > > +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> > > +        char    *path;
> > > +};
> > > +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> > > +
> > > +struct vfio_dtindex {
> > > +        __u32   len;            /* length of structure */
> > > +        __u32   index;
> > > +        __u32   prop_type;
> > > +        __u32   prop_index;
> > > +        __u64   flags;
> > > +#define VFIO_DTINDEX_FLAGS_REGION       (1 << 0)
> > > +#define VFIO_DTINDEX_FLAGS_IRQ          (1 << 1)
> > > +};
> > > +#define VFIO_DEVICE_GET_DTINDEX         _IOWR(';', 118, struct vfio_dtindex)
> > > +
> > > +
> > > +VFIO bus driver API
> > > +-------------------------------------------------------------------------------
> > > +
> > > +Bus drivers, such as PCI, have three jobs:
> > > + 1) Add/remove devices from vfio
> > > + 2) Provide vfio_device_ops for device access
> > > + 3) Device binding and unbinding
> > > +
> > > +When initialized, the bus driver should enumerate the devices on it's
> > 
> > s/it's/its/
> 
> Noted.
> 
> <snip>
> > > +/* Unmap DMA region */
> > > +/* dgate must be held */
> > > +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > > +			    int npage, int rdwr)
> > 
> > Use of "read" and "write" in DMA can often be confusing, since it's
> > not always clear if you're talking from the perspective of the CPU or
> > the device (_writing_ data to a device will usually involve it doing
> > DMA _reads_ from memory).  It's often best to express things as DMA
> > direction, 'to device', and 'from device' instead.
> 
> Good point.

This, of course, potentially affects many areas of the code and doco.

> > > +{
> > > +	int i, unlocked = 0;
> > > +
> > > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > > +		unsigned long pfn;
> > > +
> > > +		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> > > +		if (pfn) {
> > > +			iommu_unmap(iommu->domain, iova, 0);
> > > +			unlocked += put_pfn(pfn, rdwr);
> > > +		}
> > > +	}
> > > +	return unlocked;
> > > +}
> > > +
> > > +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > > +			   unsigned long npage, int rdwr)
> > > +{
> > > +	int unlocked;
> > > +
> > > +	unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);
> > > +	vfio_lock_acct(-unlocked);
> > 
> > Have you checked that your accounting will work out if the user maps
> > the same memory page to multiple IOVAs?
> 
> Hmm, it probably doesn't.  We potentially over-penalize the user process
> here.

Ok.

> > > +}
> > > +
> > > +/* Unmap ALL DMA regions */
> > > +void vfio_iommu_unmapall(struct vfio_iommu *iommu)
> > > +{
> > > +	struct list_head *pos, *pos2;
> > > +	struct dma_map_page *mlp;
> > > +
> > > +	mutex_lock(&iommu->dgate);
> > > +	list_for_each_safe(pos, pos2, &iommu->dm_list) {
> > > +		mlp = list_entry(pos, struct dma_map_page, list);
> > > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > > +		list_del(&mlp->list);
> > > +		kfree(mlp);
> > > +	}
> > > +	mutex_unlock(&iommu->dgate);
> > 
> > Ouch, no good at all.  Keeping track of every DMA map is no good on
> > POWER or other systems where IOMMU operations are a hot path.  I think
> > you'll need an iommu specific hook for this instead, which uses
> > whatever data structures are natural for the IOMMU.  For example a
> > 1-level pagetable, like we use on POWER will just zero every entry.
> 
> It's already been noted in the docs that current users have relatively
> static mappings and a performance interface is TBD for dynamically
> backing streaming DMA.  The current vfio_iommu exposes iommu_ops, POWER
> will need to come up with something to expose instead.

Right, but I'm not just talking about the current map/unmap calls
themselves.  This infrastructure for tracking it looks like it's
intended to be generic for all mapping methods.  If not, I can't see
the reason for it, because I don't think the current interface
requires such tracking inherently.

> > > +}
> > > +
> > > +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long *pfn)
> > > +{
> > > +	struct page *page[1];
> > > +	struct vm_area_struct *vma;
> > > +	int ret = -EFAULT;
> > > +
> > > +	if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) {
> > > +		*pfn = page_to_pfn(page[0]);
> > > +		return 0;
> > > +	}
> > > +
> > > +	down_read(&current->mm->mmap_sem);
> > > +
> > > +	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> > > +
> > > +	if (vma && vma->vm_flags & VM_PFNMAP) {
> > > +		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > > +		if (is_invalid_reserved_pfn(*pfn))
> > > +			ret = 0;
> > > +	}
> > 
> > It's kind of nasty that you take gup_fast(), already designed to grab
> > pointers for multiple user pages, then just use it one page at a time,
> > even for a big map.
> 
> Yep, this needs work, but shouldn't really change the API.

Yes, this could be a later optimization.

> > > +	up_read(&current->mm->mmap_sem);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +/* Map DMA region */
> > > +/* dgate must be held */
> > > +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova,
> > > +			unsigned long vaddr, int npage, int rdwr)
> > 
> > iova should be a dma_addr_t.  Bus address size need not match virtual
> > address size, and may not fit in an unsigned long.
> 
> ok.

Again, same consideratoin in many places of course.

> > > +{
> > > +	unsigned long start = iova;
> > > +	int i, ret, locked = 0, prot = IOMMU_READ;
> > > +
> > > +	/* Verify pages are not already mapped */
> > > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> > > +		if (iommu_iova_to_phys(iommu->domain, iova))
> > > +			return -EBUSY;
> > > +
> > > +	iova = start;
> > > +
> > > +	if (rdwr)
> > > +		prot |= IOMMU_WRITE;
> > > +	if (iommu->cache)
> > > +		prot |= IOMMU_CACHE;
> > > +
> > > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> > > +		unsigned long pfn = 0;
> > > +
> > > +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> > > +		if (ret) {
> > > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > > +			return ret;
> > > +		}
> > > +
> > > +		/* Only add actual locked pages to accounting */
> > > +		if (!is_invalid_reserved_pfn(pfn))
> > > +			locked++;
> > > +
> > > +		ret = iommu_map(iommu->domain, iova,
> > > +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> > > +		if (ret) {
> > > +			/* Back out mappings on error */
> > > +			put_pfn(pfn, rdwr);
> > > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > > +			return ret;
> > > +		}
> > > +	}
> > > +	vfio_lock_acct(locked);
> > > +	return 0;
> > > +}
> > > +
> > > +static inline int ranges_overlap(unsigned long start1, size_t size1,
> > > +				 unsigned long start2, size_t size2)
> > > +{
> > > +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> > 
> > Needs overflow safety.
> 
> Yep.
> 
> > > +}
> > > +
> > > +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> > > +					  dma_addr_t start, size_t size)
> > > +{
> > > +	struct list_head *pos;
> > > +	struct dma_map_page *mlp;
> > > +
> > > +	list_for_each(pos, &iommu->dm_list) {
> > > +		mlp = list_entry(pos, struct dma_map_page, list);
> > > +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > > +				   start, size))
> > > +			return mlp;
> > > +	}
> > > +	return NULL;
> > > +}
> > 
> > Again, keeping track of each dma map operation is no good for
> > performance.
> 
> This is not the performance interface you're looking for.
> 
> > > +
> > > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> > > +			    size_t size, struct dma_map_page *mlp)
> > > +{
> > > +	struct dma_map_page *split;
> > > +	int npage_lo, npage_hi;
> > > +
> > > +	/* Existing dma region is completely covered, unmap all */
> > > +	if (start <= mlp->daddr &&
> > > +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > > +		list_del(&mlp->list);
> > > +		npage_lo = mlp->npage;
> > > +		kfree(mlp);
> > > +		return npage_lo;
> > > +	}
> > > +
> > > +	/* Overlap low address of existing range */
> > > +	if (start <= mlp->daddr) {
> > > +		size_t overlap;
> > > +
> > > +		overlap = start + size - mlp->daddr;
> > > +		npage_lo = overlap >> PAGE_SHIFT;
> > > +		npage_hi = mlp->npage - npage_lo;
> > > +
> > > +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > > +		mlp->daddr += overlap;
> > > +		mlp->vaddr += overlap;
> > > +		mlp->npage -= npage_lo;
> > > +		return npage_lo;
> > > +	}
> > > +
> > > +	/* Overlap high address of existing range */
> > > +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > > +		size_t overlap;
> > > +
> > > +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > > +		npage_hi = overlap >> PAGE_SHIFT;
> > > +		npage_lo = mlp->npage - npage_hi;
> > > +
> > > +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > > +		mlp->npage -= npage_hi;
> > > +		return npage_hi;
> > > +	}
> > > +
> > > +	/* Split existing */
> > > +	npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> > > +	npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> > > +
> > > +	split = kzalloc(sizeof *split, GFP_KERNEL);
> > > +	if (!split)
> > > +		return -ENOMEM;
> > > +
> > > +	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> > > +
> > > +	mlp->npage = npage_lo;
> > > +
> > > +	split->npage = npage_hi;
> > > +	split->daddr = start + size;
> > > +	split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> > > +	split->rdwr = mlp->rdwr;
> > > +	list_add(&split->list, &iommu->dm_list);
> > > +	return size >> PAGE_SHIFT;
> > > +}
> > > +
> > > +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > > +{
> > > +	int ret = 0;
> > > +	size_t npage = dmp->size >> PAGE_SHIFT;
> > > +	struct list_head *pos, *n;
> > > +
> > > +	if (dmp->dmaaddr & ~PAGE_MASK)
> > > +		return -EINVAL;
> > > +	if (dmp->size & ~PAGE_MASK)
> > > +		return -EINVAL;
> > > +
> > > +	mutex_lock(&iommu->dgate);
> > > +
> > > +	list_for_each_safe(pos, n, &iommu->dm_list) {
> > > +		struct dma_map_page *mlp;
> > > +
> > > +		mlp = list_entry(pos, struct dma_map_page, list);
> > > +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > > +				   dmp->dmaaddr, dmp->size)) {
> > > +			ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> > > +						      dmp->size, mlp);
> > > +			if (ret > 0)
> > > +				npage -= NPAGE_TO_SIZE(ret);
> > > +			if (ret < 0 || npage == 0)
> > > +				break;
> > > +		}
> > > +	}
> > > +	mutex_unlock(&iommu->dgate);
> > > +	return ret > 0 ? 0 : ret;
> > > +}
> > > +
> > > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > > +{
> > > +	int npage;
> > > +	struct dma_map_page *mlp, *mmlp = NULL;
> > > +	dma_addr_t daddr = dmp->dmaaddr;
> > > +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> > > +	size_t size = dmp->size;
> > > +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> > > +
> > > +	if (vaddr & (PAGE_SIZE-1))
> > > +		return -EINVAL;
> > > +	if (daddr & (PAGE_SIZE-1))
> > > +		return -EINVAL;
> > > +	if (size & (PAGE_SIZE-1))
> > > +		return -EINVAL;
> > > +
> > > +	npage = size >> PAGE_SHIFT;
> > > +	if (!npage)
> > > +		return -EINVAL;
> > > +
> > > +	if (!iommu)
> > > +		return -EINVAL;
> > > +
> > > +	mutex_lock(&iommu->dgate);
> > > +
> > > +	if (vfio_find_dma(iommu, daddr, size)) {
> > > +		ret = -EBUSY;
> > > +		goto out_lock;
> > > +	}
> > > +
> > > +	/* account for locked pages */
> > > +	locked = current->mm->locked_vm + npage;
> > > +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> > > +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > > +			__func__, rlimit(RLIMIT_MEMLOCK));
> > > +		ret = -ENOMEM;
> > > +		goto out_lock;
> > > +	}
> > > +
> > > +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> > > +	if (ret)
> > > +		goto out_lock;
> > > +
> > > +	/* Check if we abut a region below */
> > > +	if (daddr) {
> > > +		mlp = vfio_find_dma(iommu, daddr - 1, 1);
> > > +		if (mlp && mlp->rdwr == rdwr &&
> > > +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> > > +
> > > +			mlp->npage += npage;
> > > +			daddr = mlp->daddr;
> > > +			vaddr = mlp->vaddr;
> > > +			npage = mlp->npage;
> > > +			size = NPAGE_TO_SIZE(npage);
> > > +
> > > +			mmlp = mlp;
> > > +		}
> > > +	}
> > > +
> > > +	if (daddr + size) {
> > > +		mlp = vfio_find_dma(iommu, daddr + size, 1);
> > > +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size) {
> > > +
> > > +			mlp->npage += npage;
> > > +			mlp->daddr = daddr;
> > > +			mlp->vaddr = vaddr;
> > > +
> > > +			/* If merged above and below, remove previously
> > > +			 * merged entry.  New entry covers it.  */
> > > +			if (mmlp) {
> > > +				list_del(&mmlp->list);
> > > +				kfree(mmlp);
> > > +			}
> > > +			mmlp = mlp;
> > > +		}
> > > +	}
> > > +
> > > +	if (!mmlp) {
> > > +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> > > +		if (!mlp) {
> > > +			ret = -ENOMEM;
> > > +			vfio_dma_unmap(iommu, daddr, npage, rdwr);
> > > +			goto out_lock;
> > > +		}
> > > +
> > > +		mlp->npage = npage;
> > > +		mlp->daddr = daddr;
> > > +		mlp->vaddr = vaddr;
> > > +		mlp->rdwr = rdwr;
> > > +		list_add(&mlp->list, &iommu->dm_list);
> > > +	}
> > > +
> > > +out_lock:
> > > +	mutex_unlock(&iommu->dgate);
> > > +	return ret;
> > > +}
> > 
> > This whole tracking infrastructure is way too complex to impose on
> > every IOMMU.  We absolutely don't want to do all this when just
> > updating a 1-level pagetable.
> 
> If only POWER implemented an iommu_ops so we had something on which we
> could base an alternate iommu model and pluggable iommu registration...

Yeah, yeah.  I'm having to find gaps of time between fighting various
fires to work on vfio-ish infrastructure stuff.

> > > +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> > > +{
> > > +	struct vfio_iommu *iommu = filep->private_data;
> > > +
> > > +	vfio_release_iommu(iommu);
> > > +	return 0;
> > > +}
> > > +
> > > +static long vfio_iommu_unl_ioctl(struct file *filep,
> > > +				 unsigned int cmd, unsigned long arg)
> > > +{
> > > +	struct vfio_iommu *iommu = filep->private_data;
> > > +	int ret = -ENOSYS;
> > > +
> > > +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> > > +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> > > +
> > > +                ret = put_user(flags, (u64 __user *)arg);
> > 
> > Um.. flags surely have to come from the IOMMU driver.
> 
> This vfio_iommu object is backed by iommu_ops, which supports this
> mapping.
> 
> > > +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > > +		struct vfio_dma_map dm;
> > > +
> > > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > > +			return -EFAULT;
> > > +
> > > +		ret = vfio_dma_map_dm(iommu, &dm);
> > > +
> > > +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> > > +			ret = -EFAULT;
> > > +
> > > +	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> > > +		struct vfio_dma_map dm;
> > > +
> > > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > > +			return -EFAULT;
> > > +
> > > +		ret = vfio_dma_unmap_dm(iommu, &dm);
> > > +
> > > +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> > > +			ret = -EFAULT;
> > > +	}
> > > +	return ret;
> > > +}
> > > +
> > > +#ifdef CONFIG_COMPAT
> > > +static long vfio_iommu_compat_ioctl(struct file *filep,
> > > +				    unsigned int cmd, unsigned long arg)
> > > +{
> > > +	arg = (unsigned long)compat_ptr(arg);
> > > +	return vfio_iommu_unl_ioctl(filep, cmd, arg);
> > 
> > Um, this only works if the structures are exactly compatible between
> > 32-bit and 64-bit ABIs.  I don't think that is always true.
> 
> I think all our structure sizes are independent of host width.  If I'm
> missing something, let me know.

Ah, for structures, that might be true.  I was seeing the bunch of
ioctl()s that take ints.

> > > +}
> > > +#endif	/* CONFIG_COMPAT */
> > > +
> > > +const struct file_operations vfio_iommu_fops = {
> > > +	.owner		= THIS_MODULE,
> > > +	.release	= vfio_iommu_release,
> > > +	.unlocked_ioctl	= vfio_iommu_unl_ioctl,
> > > +#ifdef CONFIG_COMPAT
> > > +	.compat_ioctl	= vfio_iommu_compat_ioctl,
> > > +#endif
> > > +};
> > > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> > > new file mode 100644
> > > index 0000000..6169356
> > > --- /dev/null
> > > +++ b/drivers/vfio/vfio_main.c
> > > @@ -0,0 +1,1151 @@
> > > +/*
> > > + * VFIO framework
> > > + *
> > > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> > > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + *
> > > + * Derived from original vfio:
> > > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > > + * Author: Tom Lyon, pugs@cisco.com
> > > + */
> > > +
> > > +#include <linux/cdev.h>
> > > +#include <linux/compat.h>
> > > +#include <linux/device.h>
> > > +#include <linux/file.h>
> > > +#include <linux/anon_inodes.h>
> > > +#include <linux/fs.h>
> > > +#include <linux/idr.h>
> > > +#include <linux/iommu.h>
> > > +#include <linux/mm.h>
> > > +#include <linux/module.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/string.h>
> > > +#include <linux/uaccess.h>
> > > +#include <linux/vfio.h>
> > > +#include <linux/wait.h>
> > > +
> > > +#include "vfio_private.h"
> > > +
> > > +#define DRIVER_VERSION	"0.2"
> > > +#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> > > +#define DRIVER_DESC	"VFIO - User Level meta-driver"
> > > +
> > > +static int allow_unsafe_intrs;
> > > +module_param(allow_unsafe_intrs, int, 0);
> > > +MODULE_PARM_DESC(allow_unsafe_intrs,
> > > +        "Allow use of IOMMUs which do not support interrupt remapping");
> > 
> > This should not be a global option, but part of the AMD/Intel IOMMU
> > specific code.  In general it's a question of how strict the IOMMU
> > driver is about isolation when it determines what the groups are, and
> > only the IOMMU driver can know what the possibilities are for its
> > class of hardware.
> 
> I agree this should probably be tied more closely to the iommu driver,
> but again, we only have iommu_ops right now.
> 
> <snip>
> > > +
> > > +/* Attempt to merge the group pointed to by fd into group.  The merge-ee
> > > + * group must not have an iommu or any devices open because we cannot
> > > + * maintain that context across the merge.  The merge-er group can be
> > > + * in use. */
> > 
> > Yeah, so merge-er group in use still has its problems, because it
> > could affect what the IOMMU is capable of.
> 
> As seen below, we deny merging if the iommu domains are not exactly
> compatible.  Our notion of what compatible means depends on what
> iommu_ops exposes though.

Ok.

> > > +static int vfio_group_merge(struct vfio_group *group, int fd)
> > > +{
> > > +	struct vfio_group *new;
> > > +	struct vfio_iommu *old_iommu;
> > > +	struct file *file;
> > > +	int ret = 0;
> > > +	bool opened = false;
> > > +
> > > +	mutex_lock(&vfio.lock);
> > > +
> > > +	file = fget(fd);
> > > +	if (!file) {
> > > +		ret = -EBADF;
> > > +		goto out_noput;
> > > +	}
> > > +
> > > +	/* Sanity check, is this really our fd? */
> > > +	if (file->f_op != &vfio_group_fops) {
> > 
> > This should be a WARN_ON or BUG_ON rather than just an error return, surely.
> 
> No, I don't think so.  We're passed a file descriptor that could be for
> anything.  If the user passed a file descriptor for something that's not
> a vfio group, that's a user error, not an internal consistency error of
> vfio.

Sorry, I was mixing up which of the fd arguments was which.

> > > +		ret = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	new = file->private_data;
> > > +
> > > +	if (!new || new == group || !new->iommu ||
> > > +	    new->iommu->domain || new->bus != group->bus) {
> > > +		ret = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/* We need to attach all the devices to each domain separately
> > > +	 * in order to validate that the capabilities match for both.  */
> > > +	ret = __vfio_open_iommu(new->iommu);
> > > +	if (ret)
> > > +		goto out;
> > > +
> > > +	if (!group->iommu->domain) {
> > > +		ret = __vfio_open_iommu(group->iommu);
> > > +		if (ret)
> > > +			goto out;
> > > +		opened = true;
> > > +	}
> > > +
> > > +	/* If cache coherency doesn't match we'd potentialy need to
> > > +	 * remap existing iommu mappings in the merge-er domain.
> > > +	 * Poor return to bother trying to allow this currently. */
> > > +	if (iommu_domain_has_cap(group->iommu->domain,
> > > +				 IOMMU_CAP_CACHE_COHERENCY) !=
> > > +	    iommu_domain_has_cap(new->iommu->domain,
> > > +				 IOMMU_CAP_CACHE_COHERENCY)) {
> > > +		__vfio_close_iommu(new->iommu);
> > > +		if (opened)
> > > +			__vfio_close_iommu(group->iommu);
> > > +		ret = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/* Close the iommu for the merge-ee and attach all its devices
> > > +	 * to the merge-er iommu. */
> > > +	__vfio_close_iommu(new->iommu);
> > > +
> > > +	ret = __vfio_iommu_attach_group(group->iommu, new);
> > > +	if (ret)
> > > +		goto out;
> > > +
> > > +	/* set_iommu unlinks new from the iommu, so save a pointer to it */
> > > +	old_iommu = new->iommu;
> > > +	__vfio_group_set_iommu(new, group->iommu);
> > > +	kfree(old_iommu);
> > > +
> > > +out:
> > > +	fput(file);
> > > +out_noput:
> > > +	mutex_unlock(&vfio.lock);
> > > +	return ret;
> > > +}
> > > +
> > > +/* Unmerge the group pointed to by fd from group. */
> > > +static int vfio_group_unmerge(struct vfio_group *group, int fd)
> > > +{
> > > +	struct vfio_group *new;
> > > +	struct vfio_iommu *new_iommu;
> > > +	struct file *file;
> > > +	int ret = 0;
> > > +
> > > +	/* Since the merge-out group is already opened, it needs to
> > > +	 * have an iommu struct associated with it. */
> > > +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);
> > > +	if (!new_iommu)
> > > +		return -ENOMEM;
> > > +
> > > +	INIT_LIST_HEAD(&new_iommu->group_list);
> > > +	INIT_LIST_HEAD(&new_iommu->dm_list);
> > > +	mutex_init(&new_iommu->dgate);
> > > +	new_iommu->bus = group->bus;
> > > +
> > > +	mutex_lock(&vfio.lock);
> > > +
> > > +	file = fget(fd);
> > > +	if (!file) {
> > > +		ret = -EBADF;
> > > +		goto out_noput;
> > > +	}
> > > +
> > > +	/* Sanity check, is this really our fd? */
> > > +	if (file->f_op != &vfio_group_fops) {
> > > +		ret = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	new = file->private_data;
> > > +	if (!new || new == group || new->iommu != group->iommu) {
> > > +		ret = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/* We can't merge-out a group with devices still in use. */
> > > +	if (__vfio_group_devs_inuse(new)) {
> > > +		ret = -EBUSY;
> > > +		goto out;
> > > +	}
> > > +
> > > +	__vfio_iommu_detach_group(group->iommu, new);
> > > +	__vfio_group_set_iommu(new, new_iommu);
> > > +
> > > +out:
> > > +	fput(file);
> > > +out_noput:
> > > +	if (ret)
> > > +		kfree(new_iommu);
> > > +	mutex_unlock(&vfio.lock);
> > > +	return ret;
> > > +}
> > > +
> > > +/* Get a new iommu file descriptor.  This will open the iommu, setting
> > > + * the current->mm ownership if it's not already set. */
> > 
> > I know I've had this explained to me several times before, but I've
> > forgotten again.  Why do we need to wire the iommu to an mm?
> 
> We're mapping process virtual addresses into the IOMMU, so it makes
> sense to restrict ourselves to a single virtual address space.  It also
> enforces the ownership, that only a single mm is in control of the
> group.

Neither of those seems conclusive to me, but I remember that I saw a
strong reason earlier, even if I can't remember it now.

> > > +static int vfio_group_get_iommu_fd(struct vfio_group *group)
> > > +{
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&vfio.lock);
> > > +
> > > +	if (!group->iommu->domain) {
> > > +		ret = __vfio_open_iommu(group->iommu);
> > > +		if (ret)
> > > +			goto out;
> > > +	}
> > > +
> > > +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
> > > +			       group->iommu, O_RDWR);
> > > +	if (ret < 0)
> > > +		goto out;
> > > +
> > > +	group->iommu->refcnt++;
> > > +out:
> > > +	mutex_unlock(&vfio.lock);
> > > +	return ret;
> > > +}
> > > +
> > > +/* Get a new device file descriptor.  This will open the iommu, setting
> > > + * the current->mm ownership if it's not already set.  It's difficult to
> > > + * specify the requirements for matching a user supplied buffer to a
> > > + * device, so we use a vfio driver callback to test for a match.  For
> > > + * PCI, dev_name(dev) is unique, but other drivers may require including
> > > + * a parent device string. */
> > 
> > At some point we probably want an interface to enumerate the devices
> > too, but that can probably wait.
> 
> That's what I decided as well.  I also haven't been able to come up with
> an interface for it that doesn't make me want to vomit.

Ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-17  0:02     ` David Gibson
@ 2011-11-18 20:32       ` Alex Williamson
  2011-11-18 21:09         ` Scott Wood
  2011-11-21  2:47         ` David Gibson
  0 siblings, 2 replies; 62+ messages in thread
From: Alex Williamson @ 2011-11-18 20:32 UTC (permalink / raw)
  To: David Gibson
  Cc: chrisw, aik, pmac, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Thu, 2011-11-17 at 11:02 +1100, David Gibson wrote:
> On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote:
> > On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
> > > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
<snip>
> > > > +Groups, Devices, IOMMUs, oh my
> > > > +-------------------------------------------------------------------------------
> > > > +
> > > > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > > > +can't always distinguish transactions from each individual device in
> > > > +the system.  Sometimes this is because of the IOMMU design, such as with
> > > > +PEs, other times it's caused by the I/O topology, for instance a
> > > > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > > > +devices created by these restictions IOMMU groups (or just "groups" for
> > > > +this document).
> > > > +
> > > > +The IOMMU cannot distiguish transactions between the individual devices
> > > > +within the group, therefore the group is the basic unit of ownership for
> > > > +a userspace process.  Because of this, groups are also the primary
> > > > +interface to both devices and IOMMU domains in VFIO.
> > > > +
> > > > +The VFIO representation of groups is created as devices are added into
> > > > +the framework by a VFIO bus driver.  The vfio-pci module is an example
> > > > +of a bus driver.  This module registers devices along with a set of bus
> > > > +specific callbacks with the VFIO core.  These callbacks provide the
> > > > +interfaces later used for device access.  As each new group is created,
> > > > +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> > > > +character device.
> > > 
> > > Ok.. so, the fact that it's called "vfio-pci" suggests that the VFIO
> > > bus driver is per bus type, not per bus instance.   But grouping
> > > constraints could be per bus instance, if you have a couple of
> > > different models of PCI host bridge with IOMMUs of different
> > > capabilities built in, for example.
> > 
> > Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
> > instance.
> 
> Ok, how can that work.  vfio-pci is responsible for generating the
> groupings, yes?  For which it needs to know the iommu/host bridge's
> isolation capabilities, which vary depending on the type of host
> bridge.

No, grouping is done at the iommu driver level.  vfio gets groupings via
iomm_device_group(), which uses the iommu_ops for the bus_type of the
requested device.  I'll attempt to clarify where groups come from in the
documentation.

> >  IOMMUs also register drivers per bus type, not per bus
> > instance.  The IOMMU driver is free to impose any constraints it wants.
> > 
> > > > +In addition to the device enumeration and callbacks, the VFIO bus driver
> > > > +also provides a traditional device driver and is able to bind to devices
> > > > +on it's bus.  When a device is bound to the bus driver it's available to
> > > > +VFIO.  When all the devices within a group are bound to their bus drivers,
> > > > +the group becomes "viable" and a user with sufficient access to the VFIO
> > > > +group chardev can obtain exclusive access to the set of group devices.
> > > > +
> > > > +As documented in linux/vfio.h, several ioctls are provided on the
> > > > +group chardev:
> > > > +
> > > > +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> > > > + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> > > > + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> > > > +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> > > > +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> > > > +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> > > > +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> > > > +
> > > > +The last two ioctls return new file descriptors for accessing
> > > > +individual devices within the group and programming the IOMMU.  Each of
> > > > +these new file descriptors provide their own set of file interfaces.
> > > > +These ioctls will fail if any of the devices within the group are not
> > > > +bound to their VFIO bus driver.  Additionally, when either of these
> > > > +interfaces are used, the group is then bound to the struct_mm of the
> > > > +caller.  The GET_FLAGS ioctl can be used to view the state of the group.
> > > > +
> > > > +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> > > > +new IOMMU domain is created and all of the devices in the group are
> > > > +attached to it.  This is the only way to ensure full IOMMU isolation
> > > > +of the group, but potentially wastes resources and cycles if the user
> > > > +intends to manage multiple groups with the same set of IOMMU mappings.
> > > > +VFIO therefore provides a group MERGE and UNMERGE interface, which
> > > > +allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
> > > > +arbitrary groups to be merged, so the user should assume merging is
> > > > +opportunistic.
> > > 
> > > I do not think "opportunistic" means what you think it means..
> > > 
> > > >  A new group, with no open device or IOMMU file
> > > > +descriptors, can be merged into an existing, in-use, group using the
> > > > +MERGE ioctl.  A merged group can be unmerged using the UNMERGE ioctl
> > > > +once all of the device file descriptors for the group being merged
> > > > +"out" are closed.
> > > > +
> > > > +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are
> > > > +essentially fungible between group file descriptors (ie. if device
> > > > A
> > > 
> > > IDNT "fungible" MWYTIM, either.
> > 
> > Hmm, feel free to suggest.  Maybe we're hitting .us vs .au connotation.
> 
> In any case, I don't think it's a word whose meaning is unambiguous
> enough to use here.
> 
> > > > +is in group X, and X is merged with Y, a file descriptor for A can be
> > > > +retrieved using GET_DEVICE_FD on Y.  Likewise, GET_IOMMU_FD returns a
> > > > +file descriptor referencing the same internal IOMMU object from either
> > > > +X or Y).  Merged groups can be dissolved either explictly with UNMERGE
> > > > +or automatically when ALL file descriptors for the merged group are
> > > > +closed (all IOMMUs, all devices, all groups).
> > > 
> > > Blech.  I'm really not liking this merge/unmerge API as it stands,
> > > it's horribly confusing.  At the very least, we need some better
> > > terminology.  We need some term for the metagroups; supergroups; iommu
> > > domains or-at-least-they-will-be-once-we-open-the-iommu or
> > > whathaveyous.
> > > 
> > > The first confusing thing about this interface is that each open group
> > > handle actually refers to two different things; the original group you
> > > opened and the metagroup it's a part of.  For the GET_IOMMU_FD and
> > > GET_DEVICE_FD operations, you're using the metagroup and two "merged"
> > > group handles are interchangeable.
> > 
> > Fungible, even ;)
> > 
> > > For other MERGE and especially
> > > UNMERGE operations, it matters which is the original group.
> > 
> > If I stick two LEGO blocks together, I need to identify the individual
> > block I want to remove to pull them back apart...
> 
> Yeah, I'm starting to get my head around the model, but the current
> description of it doesn't help very much.  In particular the terms
> "merge" and "unmerge" lead one to the wrong mental model, I think.
> 
> > > The semantics of "merge" and "unmerge" under those names are really
> > > non-obvious.  Merge kind of has to merge two whole metagroups, but
> > > it's unclear if unmerge reverses one merge, or just takes out one
> > > (atom) group.  These operations need better names, at least.
> > 
> > Christian suggested a change to UNMERGE that we do not need to
> > specify a group to unmerge "from".  This makes it more like a list
> > implementation except there's no defined list_head.  Any member of the
> > list can pull in a new entry.  Calling UNMERGE on any member extracts
> > that member.
> 
> I think that's a good idea, but "unmerge" is not a good word for it.

I can't think of better, if you can, please suggest.

> > > Then it's unclear what order you can do various operations, and which
> > > order you can open and close various things.  You can kind of figure
> > > it out but it takes far more thinking than it should.
> > > 
> > > 
> > > So at the _very_ least, we need to invent new terminology and find a
> > > much better way of describing this API's semantics.  I still think an
> > > entirely different interface, where metagroups are created from
> > > outside with a lifetime that's not tied to an fd would be a better
> > > idea.
> > 
> > As we've discussed previously, configfs provides part of this, but has
> > no ioctl support.  It doesn't make sense to me to go play with groups in
> > configfs, but then still interact with them via a char dev.
> 
> Why not?  You configure, say, loopback devices with losetup, then use
> them as a block device.  Similar with nbd.  You can configure serial
> devices with setserial, then use them as a char dev.
> 
> >  It also
> > splits the ownership model 
> 
> I'm not even sure what that means.
> 
> > and makes it harder to enforce who gets to
> > interact with the devices vs who gets to manipulate groups.
> 
> How so.

Let's map out what a configfs interface would look like, maybe I'll
convince myself it's on the table.  We'd probably start with

/config/vfio/$bus_type.name/

That would probably be pre-populated with a bunch of $groupid files,
matching /dev/vfio/$bus_type.name/$groupid char dev files (assuming
configfs can pre-populate files).  To make a user defined group, we
might then do:

mkdir /config/vfio/$bus_type.name/my_group

That would generate a /dev/vfio/$bus_type.name/my_group char dev.  To
add groups to the new my_group "super group", we'd need to do something
like:

ln -s /config/vfio/$bus_type.name/$groupidA /config/vfio/$bus_type.name/my_group/nic_group

I might then add a second group as:

ln -s /config/vfio/$bus_type.name/$groupidB /config/vfio/$bus_type.name/my_group/hba_group

Either link could fail if the target group is not viable, the group is
already in use, or the second link could fail if the iommu domains were
incompatible.

Do these links cause /dev/vfio/$bus_type.name/{$groupidA,$groupidB} to
disappear?  If not, do we allow them to be opened?  Linking would also
have to fail if we later tried to link one of these groupids to a
different super group.

Now we want to give my_group to a user, so we have to go back to /dev
and

chown $user /dev/vfio/$bus_type.name/my_group

At this point my_group would have the existing set of group ioctls sans
{UN}MERGE, of course.

So $user can use the super group, but not manipulate it's members.  Do
we then allow:

chown $user /config/vfio/$bus_type.name/my_group

If so, what does it imply about the user then doing:

ln -s /config/vfio/$bus_type.name/$groupidC /config/vfio/$bus_type.name/my_group/stolen_group

Would we instead need to chown the configfs groups as well as the super
group?

chown $user /config/vfio/$bus_type.name/my_group
chown $user /config/vfio/$bus_type.name/$groupidA
chown $user /config/vfio/$bus_type.name/$groupidB

ie:

# chown $user:$user /config/vfio/$bus_type.name/$groupC
$ ln -s /config/vfio/$bus_type.name/$groupidC /config/vfio/$bus_type.name/my_group/given_group

(linking has to look at the permissions of the target as well as the
link name)

Now we've introduced that we have ownership of configfs entries, what
does that imply about the char dev entries?  For instance, can $userA
own /dev/vfio/$bus_type.name/$groupidA, but $userB own the configfs
file?  We also have another security consideration that an exploit on
the host might allow a 3rd party to insert a device into a group.

This is where I start to get lost in the complexity versus simply giving
the user permissions for the char dev and allowing them to stick groups
together so long as the have permissions for the group.

We also add an entire filesystem to the interface that already spans
sysfs, dev, eventfds and potentially netlink.

If terminology is the complaint against the {UN}MERGE ioctl interface,
I'm still not sold that configfs is the answer.  /me goes to the
thesaurus... amalgamate? blend? combine? cement? unite? join?

> >  The current
> > model really isn't that complicated, imho.  As always, feel free to
> > suggest specific models.  If you have a specific terminology other than
> > MERGE, please suggest.
> > 
> > > Now, you specify that you can't use a group as the second argument of
> > > a merge if it already has an open iommu, but it's not clear from the
> > > doc if you can merge things into a group with an open iommu.
> > 
> > >From above:
> > 
> >         A new group, with no open device or IOMMU file descriptors, can
> >         be merged into an existing, in-use, group using the MERGE ioctl.
> >                                  ^^^^^^
> > 
> > > Banning
> > > this would make life simpler, because the IOMMU's effective
> > > capabilities may change if you add more devices to the domain.  That's
> > > yet another non-obvious constraint in the interface ordering, though.
> > 
> > Banning this would prevent using merged groups with hotplug, which I
> > consider to be a primary use case.
> 
> Yeah, fair enough, based on your later comments w.r.t. only combining
> feature compatible groups.
> 
> > > > +The IOMMU file descriptor provides this set of ioctls:
> > > > +
> > > > +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> > > > + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> > > > +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> > > > +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)
> > > > +
> > > > +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> > > > +We currently only support IOMMU domains that are able to map any
> > > > +virtual address to any IOVA.  This is indicated by the MAP_ANY
> > > > flag.
> > > 
> > > So.  I tend to think of an IOMMU mapping IOVAs to memory pages, rather
> > > than memory pages to IOVAs.  
> > 
> > I do too, not sure why I wrote it that way, will fix.
> > 
> > > The IOMMU itself, of course maps to
> > > physical addresses, and the meaning of "virtual address" in this
> > > context is not really clear.  I think you would be better off saying
> > > the IOMMU can map any IOVA to any memory page.  From a hardware POV
> > > that means any physical address, but of course for a VFIO user a page
> > > is specified by its process virtual address.
> > 
> > Will fix.
> > 
> > > I think we need to pin exactly what "MAP_ANY" means down better.  Now,
> > > VFIO is pretty much a lost cause if you can't map any normal process
> > > memory page into the IOMMU, so I think the only thing that is really
> > > covered is IOVAs.  But saying "can map any IOVA" is not clear, because
> > > if you can't map it, it's not a (valid) IOVA.  Better to say that
> > > IOVAs can be any 64-bit value, which I think is what you really mean
> > > here.
> > 
> > ok
> > 
> > > Of course, since POWER is a platform where this is *not* true, I'd
> > > prefer to have something giving the range of valid IOVAs in the core
> > > to start with.
> > 
> > Since iommu_ops does not yet have any concept of this (nudge, nudge), I
> > figured this would be added later.  A possible implementation would be
> > that such an iommu would not set MAP_ANY, would add a new flag for
> > MAP_RANGE, and provide a new VFIO_IOMMU_GET_RANGE_INFO ioctl to describe
> > it.  I'm guaranteed to get it wrong if I try to predict all your needs.
> 
> Hrm.  "ANY" just really bothers me because "any iova" is not as clear
> a concept as it first appears.  For starters it's actually "any page
> aligned" at the very least.  But then it's only any 64-bit address for
> busses which have full 64-bit addressing (and I do wonder if there are
> any north bridges out there that forgot to implement some of the upper
> PCI address bits properly, given that 64-bit CPUs rarely actually
> implement more than 40-something physical address bits in practice).
> 
> I'd prefer to see at least something to advertise min and max IOVA and
> IOVA alignment.  That's enough to cover x86 and POWER, including
> possible variants with an IOMMU page size different to the system page
> size (note that POWER kernels can have 64k pages as a config option,
> which means a TCE page size different to the system page size is quite
> common).
> 
> Obviously there could be more complex constraints that we would need
> to advertise with option bits.

x86 has limitations as well.   I don't think most x86 IOMMUs support a
full 64bit IOVA space, so point take.

struct vfio_iommu_info {
	__u64	len;	/* or structlen/arglen */
	__u64	flags;	/* replaces VFIO_IOMMU_GET_FLAGS, none defined yet */
	__u64	iova_max;
	__u64	iova_min;
	__u64	granularity;
};
	
#define VFIO_IOMMU_GET_INFO              _IOR(';', xxx, struct vfio_iommu_info)

Is granularity the minimum granularity, typically PAGE_SIZE barring
special configurations noted above, or is it a bitmap of supported
granularities?  Ex. If we support 4k normal pages and 2M large pages, we
might set bits 12 and 21.

> > > > +
> > > > +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> > > > +and unmapping IOVAs to process virtual addresses:
> > > > +
> > > > +struct vfio_dma_map {
> > > > +        __u64   len;            /* length of structure */
> > > 
> > > Thanks for adding these structure length fields.  But I think they
> > > should be called something other than 'len', which is likely to be
> > > confused with size (or some other length that's actually related to
> > > the operation's parameters).  Better to call it 'structlen' or
> > > 'argslen' or something.
> > 
> > Ok.  As Scott noted, I've failed to implement these in a way that
> > actually allows extension, but I'll work on it.
> 
> Right.  I had failed to realise quite how the encoding of structure
> size into the ioctl worked.  With that in place, arguably we don't
> really need the size in the structure itself, because we can still
> have multiple sized versions of the ioctl.  Still, whichever.

Hmm, that might be cleaner than eliminating the size with just using
_IO().  So we might have something like:

#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
#define VFIO_IOMMU_MAP_DMA_V2           _IOWR(';', 106, struct vfio_dma_map_v2)

For which the driver might do:

case VFIO_IOMMU_MAP_DMA:
case VFIO_IOMMU_MAP_DMA_V2:
{
	struct vfio_dma_map map;

	/* We don't care about the extra v2 bits */
	if (copy_from_user(&map, (void __user *)arg, sizeof map))
		return -EFAULT;
...

That presumes v2 is compatible other than extra fields.  Any objections
(this gets rid of length from all ioctl passed structs)?

> > > > +        __u64   vaddr;          /* process virtual addr */
> > > > +        __u64   dmaaddr;        /* desired and/or returned dma address */
> > > > +        __u64   size;           /* size in bytes */
> > > > +        __u64   flags;
> > > > +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> > > 
> > > Make it independent READ and WRITE flags from the start.  Not all
> > > combinations will be be valid on all hardware, but that way we have
> > > the possibilities covered without having to use strange encodings
> > > later.
> > 
> > Ok.
> > 
> > > > +};
> > > > +
> > > > +Current users of VFIO use relatively static DMA mappings, not requiring
> > > > +high frequency turnover.  As new users are added, it's expected that the
> > > > +IOMMU file descriptor will evolve to support new mapping interfaces, this
> > > > +will be reflected in the flags and may present new ioctls and file
> > > > +interfaces.
> > > > +
> > > > +The device GET_FLAGS ioctl is intended to return basic device type and
> > > > +indicate support for optional capabilities.  Flags currently include whether
> > > > +the device is PCI or described by Device Tree, and whether the RESET ioctl
> > > > +is supported:
> > > > +
> > > > +#define VFIO_DEVICE_GET_FLAGS           _IOR(';', 108, __u64)
> > > > + #define VFIO_DEVICE_FLAGS_PCI          (1 << 0)
> > > > + #define VFIO_DEVICE_FLAGS_DT           (1 << 1)
> > > 
> > > TBH, I don't think the VFIO for DT stuff is mature enough yet to be in
> > > an initial infrastructure patch, though we should certainly be
> > > discussing it as an add-on patch.
> > 
> > I agree for DT, and PCI should be added with vfio-pci, not the initial
> > core.
> > 
> > > > + #define VFIO_DEVICE_FLAGS_RESET        (1 << 2)
> > > > +
> > > > +The MMIO and IOP resources used by a device are described by regions.
> > > > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> > > > +
> > > > +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)
> > > > +
> > > > +Regions are described by a struct vfio_region_info, which is retrieved by
> > > > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> > > > +the desired region (0 based index).  Note that devices may implement zero
> > > > +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> > > > +mapping).
> > > 
> > > So, I think you're saying that a zero-sized region is used to encode a
> > > NOP region, that is, to basically put a "no region here" in between
> > > valid region indices.  You should spell that out.
> > 
> > Ok.
> > 
> > > [Incidentally, any chance you could borrow one of RH's tech writers
> > > for this?  I'm afraid you seem to lack the knack for clear and easily
> > > read documentation]
> > 
> > Thanks for the encouragement :-\  It's no wonder there isn't more
> > content in Documentation.
> 
> Sigh.  Alas, yes.
> 
> > > > +struct vfio_region_info {
> > > > +        __u32   len;            /* length of structure */
> > > > +        __u32   index;          /* region number */
> > > > +        __u64   size;           /* size in bytes of region */
> > > > +        __u64   offset;         /* start offset of region */
> > > > +        __u64   flags;
> > > > +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> > > > +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> > > 
> > > Again having separate read and write bits from the start will save
> > > strange encodings later.
> > 
> > Seems highly unlikely, but we have bits to waste...
> > 
> > > > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> > > > +        __u64   phys;           /* physical address of region */
> > > > +};
> > > 
> > > I notice there is no field for "type" e.g. MMIO vs. PIO vs. config
> > > space for PCI.  If you added that having a NONE type might be a
> > > clearer way of encoding a non-region than just having size==0.
> > 
> > I thought there was some resistance to including MMIO and PIO bits in
> > the flags.  If that's passed, I can add it, but PCI can determine this
> > through config space (and vfio-pci exposes config space at a fixed
> > index).  Having a regions w/ size == 0, MMIO and PIO flags unset seems a
> > little redundant if that's the only reason for having them.  A NONE flag
> > doesn't make sense to me.  Config space isn't NONE, but neither is it
> > MMIO nor PIO; and someone would probably be offended about even
> > mentioning PIO in the specification.
> 
> No, my concept was that NONE would be used for the indexes where there
> is no valid BAR.  I'll buy your argument on why not to include the PCI
> (or whatever) address space type here.
> 
> What I'm just a bit concerned by is whether we could have a case (not
> for PCI) of a real resource that still has size 0 - e.g. maybe some
> sort of doorbell that can't be read or written, but can be triggered
> some other way.  I guess that's probably unlikely though.

Right, and if somehow you had such a region where the size is zero, but
allowed some kind of operation on it, we could define a flag for it.

> > > > +
> > > > +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> > > > +
> > > > +The offset indicates the offset into the device file descriptor which
> > > > +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> > > > +available access types and validity of optional fields.  For instance
> > > > +the phys field may only be valid for certain devices types.
> > > > +
> > > > +Interrupts are described using a similar interface.  GET_NUM_IRQS
> > > > +reports the number or IRQ indexes for the device.
> > > > +
> > > > +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> > > > +
> > > > +struct vfio_irq_info {
> > > > +        __u32   len;            /* length of structure */
> > > > +        __u32   index;          /* IRQ number */
> > > > +        __u32   count;          /* number of individual IRQs */
> > > 
> > > Is there a reason for allowing irqs in batches like this, rather than
> > > having each MSI be reflected by a separate irq_info?
> > 
> > Yes, bus drivers like vfio-pci can define index 1 as the MSI info
> > structure and index 2 as MSI-X.  There's really no need to expose 57
> > individual MSI interrupts and try to map them to the correct device
> > specific MSI type if they can only logically be enabled in two distinct
> > groups.  Bus drivers with individually controllable MSI vectors are free
> > to expose them separately.  I assume device tree paths would help
> > associate an index to a specific interrupt.
> 
> Ok, fair enough.
> 
> > > > +        __u64   flags;
> > > > +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> > > > +};
> > > > +
> > > > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > > > +type to index mapping).
> > > 
> > > I know what you mean, but you need a clearer way to express it.
> > 
> > I'll work on it.
> > 
> > > > +Information about each index can be retrieved using the GET_IRQ_INFO
> > > > +ioctl, used much like GET_REGION_INFO.
> > > > +
> > > > +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> > > > +
> > > > +Individual indexes can describe single or sets of IRQs.  This provides the
> > > > +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> > > > +
> > > > +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> > > > +as shown below, are used to pass the IRQ info index, the number of eventfds,
> > > > +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> > > > +
> > > > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> > > > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> > > > +
> > > > +When a level triggered interrupt is signaled, the interrupt is masked
> > > > +on the host.  This prevents an unresponsive userspace driver from
> > > > +continuing to interrupt the host system.  After servicing the interrupt,
> > > > +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> > > > +triggered interrupts implicitly have a count of 1 per index.
> > > 
> > > This is a silly restriction.  Even PCI devices can have up to 4 LSIs
> > > on a function in theory, though no-one ever does.  Embedded devices
> > > can and do have multiple level interrupts.
> > 
> > Per the PCI spec, an individual PCI function can only ever have, at
> > most, a single INTx line.  A multi-function *device* can have up to 4
> > INTx lines, but what we're exposing here is a struct device, ie. a PCI
> > function.
> 
> Ah, my mistake.
> 
> > Other devices could certainly have multiple level interrupts, and if
> > grouping them as we do with MSI on PCI makes sense, please let me know.
> > I just didn't see the value in making the unmask operations handle
> > sub-indexes if it's not needed.
> 
> I don't know of anything off hand.  But I can't see any consideration
> that would make it unlikely either.  I generally don't trust anything
> *not* to exist in embedded space.

Fair enough.  Level IRQs are still triggered individually, so unmasking
is too, which means UNMASK_IRQ takes something like { int index; int
subindex }.

SET_UNMASK_IRQ_EVENTFDS should follow SET_IRQ_EVENTFDS and take { int
index; int count; int fds[] }.

> > > > +
> > > > +/* Unmask IRQ index, arg[0] = index */
> > > > +#define VFIO_DEVICE_UNMASK_IRQ          _IOW(';', 114, int)
> > > > +
> > > > +Level triggered interrupts can also be unmasked using an irqfd.  Use
> > > > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> > > > +
> > > > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > > > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> > > > +
> > > > +When supported, as indicated by the device flags, reset the device.
> > > > +
> > > > +#define VFIO_DEVICE_RESET               _IO(';', 116)
> > > > +
> > > > +Device tree devices also invlude ioctls for further defining the
> > > > +device tree properties of the device:
> > > > +
> > > > +struct vfio_dtpath {
> > > > +        __u32   len;            /* length of structure */
> > > > +        __u32   index;
> > > > +        __u64   flags;
> > > > +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> > > > +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> > > > +        char    *path;
> > > > +};
> > > > +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> > > > +
> > > > +struct vfio_dtindex {
> > > > +        __u32   len;            /* length of structure */
> > > > +        __u32   index;
> > > > +        __u32   prop_type;
> > > > +        __u32   prop_index;
> > > > +        __u64   flags;
> > > > +#define VFIO_DTINDEX_FLAGS_REGION       (1 << 0)
> > > > +#define VFIO_DTINDEX_FLAGS_IRQ          (1 << 1)
> > > > +};
> > > > +#define VFIO_DEVICE_GET_DTINDEX         _IOWR(';', 118, struct vfio_dtindex)
> > > > +
> > > > +
> > > > +VFIO bus driver API
> > > > +-------------------------------------------------------------------------------
> > > > +
> > > > +Bus drivers, such as PCI, have three jobs:
> > > > + 1) Add/remove devices from vfio
> > > > + 2) Provide vfio_device_ops for device access
> > > > + 3) Device binding and unbinding
> > > > +
> > > > +When initialized, the bus driver should enumerate the devices on it's
> > > 
> > > s/it's/its/
> > 
> > Noted.
> > 
> > <snip>
> > > > +/* Unmap DMA region */
> > > > +/* dgate must be held */
> > > > +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > > > +			    int npage, int rdwr)
> > > 
> > > Use of "read" and "write" in DMA can often be confusing, since it's
> > > not always clear if you're talking from the perspective of the CPU or
> > > the device (_writing_ data to a device will usually involve it doing
> > > DMA _reads_ from memory).  It's often best to express things as DMA
> > > direction, 'to device', and 'from device' instead.
> > 
> > Good point.
> 
> This, of course, potentially affects many areas of the code and doco.

I've changed vfio_iommu to use <linux/dma-direction.h> definitions
internally.  For the ioctl I've so far simply included WRITE and READ
flags, which I can clarify are from the device perspective.  Flags like
VFIO_DMA_MAP_FLAG_TO_DEVICE/FROM_DEVICE are actually more confusing to
me at this interface level.  We also have IOMMU_READ/IOMMU_WRITE which
makes me question using dma-direction.h and if we shouldn't just define
everything as from the device perspective.

> > > > +{
> > > > +	int i, unlocked = 0;
> > > > +
> > > > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > > > +		unsigned long pfn;
> > > > +
> > > > +		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> > > > +		if (pfn) {
> > > > +			iommu_unmap(iommu->domain, iova, 0);
> > > > +			unlocked += put_pfn(pfn, rdwr);
> > > > +		}
> > > > +	}
> > > > +	return unlocked;
> > > > +}
> > > > +
> > > > +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > > > +			   unsigned long npage, int rdwr)
> > > > +{
> > > > +	int unlocked;
> > > > +
> > > > +	unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);
> > > > +	vfio_lock_acct(-unlocked);
> > > 
> > > Have you checked that your accounting will work out if the user maps
> > > the same memory page to multiple IOVAs?
> > 
> > Hmm, it probably doesn't.  We potentially over-penalize the user process
> > here.
> 
> Ok.

FWIW, I don't intend to fix this right now, but I have added a comment
in the code noting it.  We'll have to see if there's an efficient way to
make the tracking better.

> > > > +}
> > > > +
> > > > +/* Unmap ALL DMA regions */
> > > > +void vfio_iommu_unmapall(struct vfio_iommu *iommu)
> > > > +{
> > > > +	struct list_head *pos, *pos2;
> > > > +	struct dma_map_page *mlp;
> > > > +
> > > > +	mutex_lock(&iommu->dgate);
> > > > +	list_for_each_safe(pos, pos2, &iommu->dm_list) {
> > > > +		mlp = list_entry(pos, struct dma_map_page, list);
> > > > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > > > +		list_del(&mlp->list);
> > > > +		kfree(mlp);
> > > > +	}
> > > > +	mutex_unlock(&iommu->dgate);
> > > 
> > > Ouch, no good at all.  Keeping track of every DMA map is no good on
> > > POWER or other systems where IOMMU operations are a hot path.  I think
> > > you'll need an iommu specific hook for this instead, which uses
> > > whatever data structures are natural for the IOMMU.  For example a
> > > 1-level pagetable, like we use on POWER will just zero every entry.
> > 
> > It's already been noted in the docs that current users have relatively
> > static mappings and a performance interface is TBD for dynamically
> > backing streaming DMA.  The current vfio_iommu exposes iommu_ops, POWER
> > will need to come up with something to expose instead.
> 
> Right, but I'm not just talking about the current map/unmap calls
> themselves.  This infrastructure for tracking it looks like it's
> intended to be generic for all mapping methods.  If not, I can't see
> the reason for it, because I don't think the current interface
> requires such tracking inherently.

It does seem that way, but there is a purpose.  We need to unmap
everything on release.  It's easy to assume that iommu_domain_free()
will unmap everything from the IOMMU, which it does, but we've also done
a get_user_pages on each of those in vfio, which we need to cleanup.  We
can't rely on userspace to do this since they might have been SIGKILL'd.
Making it generic with coalescing of adjacent regions and such is
primarily for space efficiency.

<snip>
> > > > +#ifdef CONFIG_COMPAT
> > > > +static long vfio_iommu_compat_ioctl(struct file *filep,
> > > > +				    unsigned int cmd, unsigned long arg)
> > > > +{
> > > > +	arg = (unsigned long)compat_ptr(arg);
> > > > +	return vfio_iommu_unl_ioctl(filep, cmd, arg);
> > > 
> > > Um, this only works if the structures are exactly compatible between
> > > 32-bit and 64-bit ABIs.  I don't think that is always true.
> > 
> > I think all our structure sizes are independent of host width.  If I'm
> > missing something, let me know.
> 
> Ah, for structures, that might be true.  I was seeing the bunch of
> ioctl()s that take ints.

Ugh, I suppose you're thinking of an ILP64 platform with ILP32 compat
mode.  Darn it, guess we need to make everything 64bit, including file
descriptors.

<snip>
> > > > +
> > > > +/* Get a new iommu file descriptor.  This will open the iommu, setting
> > > > + * the current->mm ownership if it's not already set. */
> > > 
> > > I know I've had this explained to me several times before, but I've
> > > forgotten again.  Why do we need to wire the iommu to an mm?
> > 
> > We're mapping process virtual addresses into the IOMMU, so it makes
> > sense to restrict ourselves to a single virtual address space.  It also
> > enforces the ownership, that only a single mm is in control of the
> > group.
> 
> Neither of those seems conclusive to me, but I remember that I saw a
> strong reason earlier, even if I can't remember it now.

The point of the group is to provide a unit of ownership.  We can't let
$userA open $groupid and fetch a device, then have $userB do the same,
grabbing a different device.  The mappings will step on each other and
the devices have no isolation.  We can't restrict that purely by file
permissions or we'll have the same problem with sudo.  At one point we
discussed a single open instance, but that unnecessarily limits the
user, so we settled on the mm.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-18 20:32       ` Alex Williamson
@ 2011-11-18 21:09         ` Scott Wood
  2011-11-22 19:16           ` [Qemu-devel] " Alex Williamson
  2011-11-21  2:47         ` David Gibson
  1 sibling, 1 reply; 62+ messages in thread
From: Scott Wood @ 2011-11-18 21:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: David Gibson, chrisw, aik, pmac, joerg.roedel, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci

On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote:
> Hmm, that might be cleaner than eliminating the size with just using
> _IO().  So we might have something like:
> 
> #define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> #define VFIO_IOMMU_MAP_DMA_V2           _IOWR(';', 106, struct vfio_dma_map_v2)
> 
> For which the driver might do:
> 
> case VFIO_IOMMU_MAP_DMA:
> case VFIO_IOMMU_MAP_DMA_V2:
> {
> 	struct vfio_dma_map map;
> 
> 	/* We don't care about the extra v2 bits */
> 	if (copy_from_user(&map, (void __user *)arg, sizeof map))
> 		return -EFAULT;

That won't work if you have an old kernel that doesn't know about v2, and
a new user that uses v2.  To make this work you'd have to strip out the
size from the ioctl number before switching (but still use it when
considering whether to access the v2 fields).  Simpler to just leave it
out of the ioctl number and put it in the struct field as currently
planned.

> > > I think all our structure sizes are independent of host width.  If I'm
> > > missing something, let me know.
> > 
> > Ah, for structures, that might be true.  I was seeing the bunch of
> > ioctl()s that take ints.
> 
> Ugh, I suppose you're thinking of an ILP64 platform with ILP32 compat
> mode.

Does Linux support ILP64?  There are "int" ioctls all over the place, and
I don't think we do compat wrappers for them.  In fact, some of the
ioctls in linux/fs.h use "int" for the compatible version of ioctls
originally defined as "long".

It's cleaner to always use the fixed types, though.

> Darn it, guess we need to make everything 64bit, including file
> descriptors.

What's wrong with __u32/__s32 (or uint32_t/int32_t)?

I really do not see Linux supporting an ABI that has no 32-bit type at
all, especially in a situation where userspace compatibility is needed. 
If that does happen, the ABI breakage will go well beyond VFIO.

> The point of the group is to provide a unit of ownership.  We can't let
> $userA open $groupid and fetch a device, then have $userB do the same,
> grabbing a different device.  The mappings will step on each other and
> the devices have no isolation.  We can't restrict that purely by file
> permissions or we'll have the same problem with sudo.

What is the problem with sudo?  If you're running processes as the same
user, regardless of how, they're going to be able to mess with each
other.

Is it possible to expose access to only specific groups via an
individually-permissionable /dev/device, so only the entity handing out
access to devices needs access to everything?

> At one point we discussed a single open instance, but that
> unnecessarily limits the user, so we settled on the mm.  Thanks,

It would be nice if this limitation weren't excessively integrated into
the design -- in the embedded space we've got unusual partitioning
setups, including failover arrangements where partitions share devices. 
The device may be configured with the IOMMU pointing only at regions that
are shared by both mms, or the non-shared regions may be reconfigured as
active ownership of the device gets handed around.

It would be up to userspace code to make sure that the mappings don't
"step on each other".  The mapping could be done with whichever mm issued
the map call for a given region.

For this use case, there is unlikely to be an issue with ownership
because there will not be separate privilege domains creating partitions
-- other use cases could refrain from enabling multiple-mm support unless
ownership issues are resolved.

This doesn't need to be supported initially, but we should try to avoid
letting the assumption permeate the code.

-Scott

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-18 21:09         ` Scott Wood
@ 2011-11-22 19:16           ` Alex Williamson
  2011-11-22 20:00             ` Scott Wood
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-22 19:16 UTC (permalink / raw)
  To: Scott Wood
  Cc: aafabbri, aik, kvm, pmac, qemu-devel, joerg.roedel, konrad.wilk,
	agraf, David Gibson, chrisw, B08248, iommu, avi, linux-pci,
	B07421, benve

On Fri, Nov 18, 2011 at 2:09 PM, Scott Wood <scottwood@freescale.com> wrote:
> On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote:
>> Hmm, that might be cleaner than eliminating the size with just using
>> _IO().  So we might have something like:
>>
>> #define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
>> #define VFIO_IOMMU_MAP_DMA_V2           _IOWR(';', 106, struct vfio_dma_map_v2)
>>
>> For which the driver might do:
>>
>> case VFIO_IOMMU_MAP_DMA:
>> case VFIO_IOMMU_MAP_DMA_V2:
>> {
>>       struct vfio_dma_map map;
>>
>>       /* We don't care about the extra v2 bits */
>>       if (copy_from_user(&map, (void __user *)arg, sizeof map))
>>               return -EFAULT;
>
> That won't work if you have an old kernel that doesn't know about v2, and
> a new user that uses v2.  To make this work you'd have to strip out the
> size from the ioctl number before switching (but still use it when
> considering whether to access the v2 fields).  Simpler to just leave it
> out of the ioctl number and put it in the struct field as currently
> planned.

Ok, _IO for all ioctls passing structs then.

>> > > I think all our structure sizes are independent of host width.  If I'm
>> > > missing something, let me know.
>> >
>> > Ah, for structures, that might be true.  I was seeing the bunch of
>> > ioctl()s that take ints.
>>
>> Ugh, I suppose you're thinking of an ILP64 platform with ILP32 compat
>> mode.
>
> Does Linux support ILP64?  There are "int" ioctls all over the place, and
> I don't think we do compat wrappers for them.  In fact, some of the
> ioctls in linux/fs.h use "int" for the compatible version of ioctls
> originally defined as "long".
>
> It's cleaner to always use the fixed types, though.

I've updated anything that passes data to use a structure and will
make use of __s32 in place of ints.  If there ever exists an ILP64
system, we can use a flag bit of the structure to indicate 64bit file
descriptor support.

>> Darn it, guess we need to make everything 64bit, including file
>> descriptors.
>
> What's wrong with __u32/__s32 (or uint32_t/int32_t)?
>
> I really do not see Linux supporting an ABI that has no 32-bit type at
> all, especially in a situation where userspace compatibility is needed.
> If that does happen, the ABI breakage will go well beyond VFIO.

Yep, I think the structs fix this and still leave room for the impossible.

>> The point of the group is to provide a unit of ownership.  We can't let
>> $userA open $groupid and fetch a device, then have $userB do the same,
>> grabbing a different device.  The mappings will step on each other and
>> the devices have no isolation.  We can't restrict that purely by file
>> permissions or we'll have the same problem with sudo.
>
> What is the problem with sudo?  If you're running processes as the same
> user, regardless of how, they're going to be able to mess with each
> other.

Just trying to indicate that file permissions are easy to bypass and
privileged users can inadvertently do stupid stuff.  Kind of like
request_region() in the kernel.   Kernel drivers are privileged, but
we still want to enforce an owner of that region.  VFIO extends the
ownership of a device to a single entity in userspace.  How do we
identify that entity and keep others out?

> Is it possible to expose access to only specific groups via an
> individually-permissionable /dev/device, so only the entity handing out
> access to devices needs access to everything?

Yes, that's fundamental to vfio.  vfio-bus drivers enumerate devices
to the vfio-core.  Privileged users bind devices to the vfio-bus
driver creating viable groups.  Groups are represented as chardevs
under /dev/vfio.  If a user has permission to access the chardev, they
have the ability to use the devices.  Once they get a device or iommu
descriptor the group is tied to them via the struct mm and only they
are permitted to access the other devices in the group.

>> At one point we discussed a single open instance, but that
>> unnecessarily limits the user, so we settled on the mm.  Thanks,
>
> It would be nice if this limitation weren't excessively integrated into
> the design -- in the embedded space we've got unusual partitioning
> setups, including failover arrangements where partitions share devices.
> The device may be configured with the IOMMU pointing only at regions that
> are shared by both mms, or the non-shared regions may be reconfigured as
> active ownership of the device gets handed around.
>
> It would be up to userspace code to make sure that the mappings don't
> "step on each other".  The mapping could be done with whichever mm issued
> the map call for a given region.
>
> For this use case, there is unlikely to be an issue with ownership
> because there will not be separate privilege domains creating partitions
> -- other use cases could refrain from enabling multiple-mm support unless
> ownership issues are resolved.
>
> This doesn't need to be supported initially, but we should try to avoid
> letting the assumption permeate the code.

So I'm hearing "we want to use this driver you're developing that's
centered around using the iommu to securely provide access to a device
from userspace, but can we do it without the iommu and can we loosen
up the security a bit?"  Is that about right?  ;)  Thanks,

Alex

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-22 19:16           ` [Qemu-devel] " Alex Williamson
@ 2011-11-22 20:00             ` Scott Wood
  2011-11-22 21:28               ` Alex Williamson
  0 siblings, 1 reply; 62+ messages in thread
From: Scott Wood @ 2011-11-22 20:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, aik, kvm, pmac, qemu-devel, joerg.roedel, konrad.wilk,
	agraf, David Gibson, chrisw, B08248, iommu, avi, linux-pci,
	B07421, benve

On 11/22/2011 01:16 PM, Alex Williamson wrote:
> On Fri, Nov 18, 2011 at 2:09 PM, Scott Wood <scottwood@freescale.com> wrote:
>> On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote:
>>> Ugh, I suppose you're thinking of an ILP64 platform with ILP32 compat
>>> mode.
>>
>> Does Linux support ILP64?  There are "int" ioctls all over the place, and
>> I don't think we do compat wrappers for them.  In fact, some of the
>> ioctls in linux/fs.h use "int" for the compatible version of ioctls
>> originally defined as "long".
>>
>> It's cleaner to always use the fixed types, though.
> 
> I've updated anything that passes data to use a structure 

That's a bit extreme...

> and will make use of __s32 in place of ints.  If there ever exists an ILP64
> system, we can use a flag bit of the structure to indicate 64bit file
> descriptor support.

If we end up supporting an ABI where compatibility between user and
kernel is broken even when we use fixed-size types and are careful about
alignment, we'll need a compat wrapper, and we'll know what ABI
userspace is supposed to be using.  I'm not sure how a flag would help.

>>> The point of the group is to provide a unit of ownership.  We can't let
>>> $userA open $groupid and fetch a device, then have $userB do the same,
>>> grabbing a different device.  The mappings will step on each other and
>>> the devices have no isolation.  We can't restrict that purely by file
>>> permissions or we'll have the same problem with sudo.
>>
>> What is the problem with sudo?  If you're running processes as the same
>> user, regardless of how, they're going to be able to mess with each
>> other.
> 
> Just trying to indicate that file permissions are easy to bypass and
> privileged users can inadvertently do stupid stuff.

Preventing stupid stuff can also prevent useful stuff.  Security and
accident-avoidance are different things.  "We can't let" is the domain
of the former.

> Kind of like request_region() in the kernel.   Kernel drivers are privileged, but
> we still want to enforce an owner of that region.  VFIO extends the
> ownership of a device to a single entity in userspace.  How do we
> identify that entity and keep others out?

That's fine as long as it's an optional safeguard that can be turned off
if needed.  Maybe require userspace to set a flag via some mechanism to
indicate it's opening the device in shared mode.

>> It would be nice if this limitation weren't excessively integrated into
>> the design -- in the embedded space we've got unusual partitioning
>> setups, including failover arrangements where partitions share devices.
>> The device may be configured with the IOMMU pointing only at regions that
>> are shared by both mms, or the non-shared regions may be reconfigured as
>> active ownership of the device gets handed around.
>>
>> It would be up to userspace code to make sure that the mappings don't
>> "step on each other".  The mapping could be done with whichever mm issued
>> the map call for a given region.
>>
>> For this use case, there is unlikely to be an issue with ownership
>> because there will not be separate privilege domains creating partitions
>> -- other use cases could refrain from enabling multiple-mm support unless
>> ownership issues are resolved.
>>
>> This doesn't need to be supported initially, but we should try to avoid
>> letting the assumption permeate the code.
> 
> So I'm hearing "we want to use this driver you're developing that's
> centered around using the iommu to securely provide access to a device
> from userspace, but can we do it without the iommu and can we loosen
> up the security a bit?"  Is that about right?  ;)  Thanks,

We have a variety of use cases for userspace and KVM-guest access to
devices.  Some of those involve an iommu, some don't.  Some involve
shared ownership (which isn't necessarily a loosening of security --
there's still an iommu, and access control on the vfio group), some
don't.  Some don't involve DMA at all.  I see no reason to have entirely
separate kernel mechanisms for these use cases.

I'm not asking you to implement any of this, just hoping you'll keep
such flexibility in mind when deciding on fundamental assumptions that
the code and API are to make.

-Scott


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-22 20:00             ` Scott Wood
@ 2011-11-22 21:28               ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2011-11-22 21:28 UTC (permalink / raw)
  To: Scott Wood
  Cc: aafabbri, aik, kvm, pmac, qemu-devel, joerg.roedel, konrad.wilk,
	agraf, David Gibson, chrisw, B08248, iommu, avi, linux-pci,
	B07421, benve

On Tue, 2011-11-22 at 14:00 -0600, Scott Wood wrote:
> On 11/22/2011 01:16 PM, Alex Williamson wrote:
> > On Fri, Nov 18, 2011 at 2:09 PM, Scott Wood <scottwood@freescale.com> wrote:
> >> On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote:
> >>> Ugh, I suppose you're thinking of an ILP64 platform with ILP32 compat
> >>> mode.
> >>
> >> Does Linux support ILP64?  There are "int" ioctls all over the place, and
> >> I don't think we do compat wrappers for them.  In fact, some of the
> >> ioctls in linux/fs.h use "int" for the compatible version of ioctls
> >> originally defined as "long".
> >>
> >> It's cleaner to always use the fixed types, though.
> > 
> > I've updated anything that passes data to use a structure 
> 
> That's a bit extreme...

Ok, I lied, it's not everything.  I have consolidated some GET_FLAGS and
GET_NUM_* calls into generic GET_INFO ioctls so we have more
flexibility.  I think the structures make sense there.  I'm not as
convinced on the eventfd and irq unmask structures, but who knows, they
might save us some day.

Here's where I stand on the API definitions, maybe we can get some
agreement on this before diving into semantics of the documentation or
or implementation, though it still includes the merge interface.
Thanks,

Alex

/*
 * VFIO API definition
 *
 * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
 * 	Author: Alex Williamson <alex.williamson@redhat.com>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License version 2 as
 * published by the Free Software Foundation.
 */
#ifndef VFIO_H
#define VFIO_H

#include <linux/types.h>

#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */

/**
 * struct vfio_device_ops - VFIO bus driver device callbacks
 *
 * @match: Return true if buf describes the device
 * @open: Called when userspace receives file descriptor for device
 * @release: Called when userspace releases file descriptor for device
 * @read: Perform read(2) on device file descriptor
 * @write: Perform write(2) on device file descriptor
 * @ioctl: Perform ioctl(2) on device file descriptor, supporting VFIO_DEVICE_*
 *         operations documented below
 * @mmap: Perform mmap(2) on a region of the device file descriptor
 */
struct vfio_device_ops {
	bool	(*match)(struct device *dev, const char *buf);
	int	(*open)(void *device_data);
	void	(*release)(void *device_data);
	ssize_t	(*read)(void *device_data, char __user *buf,
			size_t count, loff_t *ppos);
	ssize_t	(*write)(void *device_data, const char __user *buf,
			 size_t count, loff_t *size);
	long	(*ioctl)(void *device_data, unsigned int cmd,
			 unsigned long arg);
	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
};

/**
 * vfio_group_add_dev() - Add a device to the vfio-core
 *
 * @dev: Device to add
 * @ops: VFIO bus driver callbacks for device
 *
 * This registration makes the VFIO core aware of the device, creates
 * groups objects as required and exposes chardevs under /dev/vfio.
 *
 * Return 0 on success, errno on failure.
 */
extern int vfio_group_add_dev(struct device *dev,
			      const struct vfio_device_ops *ops);

/**
 * vfio_group_del_dev() - Remove a device from the vfio-core
 *
 * @dev: Device to remove
 *
 * Remove a device previously added to the VFIO core, removing groups
 * and chardevs as necessary.
 */
extern void vfio_group_del_dev(struct device *dev);

/**
 * vfio_bind_dev() - Indicate device is bound to the VFIO bus driver and
 *                   register private data structure for ops callbacks.
 *
 * @dev: Device being bound
 * @device_data: VFIO bus driver private data
 *
 * This registration indicate that a device previously registered with
 * vfio_group_add_dev() is now available for use by the VFIO core.  When
 * all devices within a group are available, the group is viable and my
 * be used by userspace drivers.  Typically called from VFIO bus driver
 * probe function.
 *
 * Return 0 on success, errno on failure
 */
extern int vfio_bind_dev(struct device *dev, void *device_data);

/**
 * vfio_unbind_dev() - Indicate device is unbinding from VFIO bus driver
 *
 * @dev: Device being unbound
 *
 * De-registration of the device previously registered with vfio_bind_dev()
 * from VFIO.  Upon completion, the device is no longer available for use by
 * the VFIO core.  Typically called from the VFIO bus driver remove function.
 * The VFIO core will attempt to release the device from users and may take
 * measures to free the device and/or block as necessary.
 *
 * Returns pointer to private device_data structure registered with
 * vfio_bind_dev().
 */
extern void *vfio_unbind_dev(struct device *dev);

#endif /* __KERNEL__ */

/* Kernel & User level defines for VFIO IOCTLs. */

/*
 * The IOCTL interface is designed for extensibility by embedding the
 * structure length (argsz) and flags into structures passed between
 * kernel and userspace.  We therefore use the _IO() macro for these
 * defines to avoid implicitly embedding a size into the ioctl request.  
 * As structure fields are added, argsz will increase to match and flag
 * bits will be defined to indicate additional fields with valid data.
 * It's *always* the caller's responsibility to indicate the size of
 * the structure passed by setting argsz appropriately.
 */

#define VFIO_TYPE	';'
#define VFIO_BASE	100

/* --------------- IOCTLs for GROUP file descriptors --------------- */

/**
 * VFIO_GROUP_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 0, struct vfio_group_info)
 *
 * Retrieve information about the group.  Fills in provided
 * struct vfio_group_info.  Caller sets argsz.
 */
struct vfio_group_info {
	__u32	argsz;
	__u32	flags;
#define VFIO_GROUP_FLAGS_VIABLE		(1 << 0)
#define VFIO_GROUP_FLAGS_MM_LOCKED	(1 << 1)
};

#define VFIO_GROUP_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 0);

/**
 * VFIO_GROUP_MERGE - _IOW(VFIO_TYPE, VFIO_BASE + 1, __s32)
 *
 * Merge group indicated by passed file descriptor into current group.
 * Current group may be in use, group indicated by file descriptor
 * cannot be in use (no open iommu or devices).
 */
#define VFIO_GROUP_MERGE		_IOW(VFIO_TYPE, VFIO_BASE + 1, __s32)

/**
 * VFIO_GROUP_UNMERGE - _IO(VFIO_TYPE, VFIO_BASE + 2)
 *
 * Remove the current group from a merged set.  The current group cannot
 * have any open devices.
 */
#define VFIO_GROUP_UNMERGE		_IO(VFIO_TYPE, VFIO_BASE + 2)

/**
 * VFIO_GROUP_GET_IOMMU_FD - _IO(VFIO_TYPE, VFIO_BASE + 3)
 *
 * Return a new file descriptor for the IOMMU object.  The IOMMU object
 * is shared among members of a merged group.
 */
#define VFIO_GROUP_GET_IOMMU_FD		_IO(VFIO_TYPE, VFIO_BASE + 3)

/**
 * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 4, char)
 *
 * Return a new file descriptor for the device object described by
 * the provided char array.
 */
#define VFIO_GROUP_GET_DEVICE_FD	_IOW(VFIO_TYPE, VFIO_BASE + 4, char)

/* --------------- IOCTLs for IOMMU file descriptors --------------- */

/**
 * VFIO_IOMMU_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 5, struct vfio_iommu_info)
 *
 * Retrieve information about the IOMMU object.  Fills in provided
 * struct vfio_iommu_info.  Caller sets argsz.
 */
struct vfio_iommu_info {
	__u32	argsz;
	__u32	flags;
	__u64	iova_max;	/* Maximum IOVA address */
	__u64	iova_min;	/* Minimum IOVA address */
	__u64	alignment;	/* Required alignment, often PAGE_SIZE */
};

#define	VFIO_IOMMU_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 5)

/**
 * VFIO_IOMMU_MAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 6, struct vfio_dma_map)
 *
 * Map or unmap process virtual addresses to IO virtual addresses using
 * the provided struct vfio_dma_map.  Caller sets argsz.
 */
struct vfio_dma_map {
	__u32	argsz;
	__u32	flags;
#define VFIO_DMA_MAP_FLAG_MAP	(1 << 0)	/* Map (1) vs Unmap (0) */
#define VFIO_DMA_MAP_FLAG_READ	(1 << 1)	/* readable from device */
#define VFIO_DMA_MAP_FLAG_WRITE	(1 << 2)	/* writable from device */
	__u64	vaddr;		/* Process virtual address */
	__u64	iova;		/* IO virtual address */
	__u64	size;		/* Size of mapping (bytes) */
};

#define	VFIO_IOMMU_MAP_DMA		_IO(VFIO_TYPE, VFIO_BASE + 6)

/* --------------- IOCTLs for DEVICE file descriptors --------------- */

/**
 * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
 *			       struct vfio_device_info)
 *
 * Retrieve information about the device.  Fills in provided
 * struct vfio_device_info.  Caller sets argsz.
 */
struct vfio_device_info {
	__u32	argsz;
	__u32	flags;
#define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
	__u32	num_regions;	/* Max region index + 1 */
	__u32	num_irqs;	/* Max IRQ index + 1 */
};

#define VFIO_DEVICE_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 7)

/**
 * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
 *				       struct vfio_region_info)
 *
 * Retrieve information about a device region.  Caller provides
 * struct vfio_region_info with index value set.  Caller sets argsz.
 */
struct vfio_region_info {
	__u32	argsz;
	__u32	flags;
#define VFIO_REGION_INFO_FLAG_MMAP	(1 << 0) /* Region supports mmap */
#define VFIO_REGION_INFO_FLAG_RO	(1 << 1) /* Region is read-only */
	__u32	index;		/* Region index */
	__u32	resv;		/* Reserved for alignment */
	__u64	size;		/* Region size (bytes) */
	__u64	offset;		/* Region offset from start of device fd */
};

#define VFIO_DEVICE_GET_REGION_INFO	_IO(VFIO_TYPE, VFIO_BASE + 8)

/**
 * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
 *				    struct vfio_irq_info)
 *
 * Retrieve information about a device IRQ.  Caller provides
 * struct vfio_irq_info with index value set.  Caller sets argsz.
 */
struct vfio_irq_info {
	__u32	argsz;
	__u32	flags;
#define VFIO_IRQ_INFO_FLAG_LEVEL	(1 << 0) /* Level (1) vs Edge (0) */
	__u32	index;		/* IRQ index */
	__u32	count;		/* Number of IRQs within this index */
};

#define VFIO_DEVICE_GET_IRQ_INFO	_IO(VFIO_TYPE, VFIO_BASE + 9)

/**
 * VFIO_DEVICE_SET_IRQ_EVENTFDS - _IOW(VFIO_TYPE, VFIO_BASE + 10,
 *				       struct vfio_irq_eventfds)
 *
 * Set eventfds for IRQs using the struct vfio_irq_eventfds provided.
 * Setting the eventfds also enables the interrupt.  Caller sets argsz.
 */
struct vfio_irq_eventfds {
	__u32	argsz;
	__u32	flags;
	__u32	index;		/* IRQ index */
	__u32	count;		/* Number of eventfds */
	__s32	eventfds[];	/* eventfd for sub-index, -1 to unset */
};

#define VFIO_DEVICE_SET_IRQ_EVENTFDS	_IO(VFIO_TYPE, VFIO_BASE + 10)

/**
 * VFIO_DEVICE_UNMASK_IRQ - _IOW(VFIO_TYPE, VFIO_BASE + 11,
 *				 struct vfio_unmask_irq)
 *
 * Unmask the IRQ described by the provided struct vfio_unmask_irq.
 * Level triggered IRQs are masked when posted to userspace and must
 * be unmasked to re-trigger.  Caller sets argsz.
 */
struct vfio_unmask_irq {
	__u32	argsz;
	__u32	flags;
	__u32	index;		/* IRQ index */
	__u32	subindex;	/* Sub-index to unmask */
};

#define VFIO_DEVICE_UNMASK_IRQ		_IO(VFIO_TYPE, VFIO_BASE + 11)

/**
 * VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD - _IOW(VFIO_TYPE, VFIO_BASE + 12,
 *					     struct vfio_irq_eventfds)
 *
 * Set eventfds to be used for unmasking IRQs using the provided
 * struct vfio_irq_eventfds.
 */
#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD	_IO(VFIO_TYPE, VFIO_BASE + 12)

/**
 * VFIO_DEVICE_RESET - _IO(VFIO_TYPE, VFIO_BASE + 13)
 *
 * Reset a device.
 */
#define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 13)

#endif /* VFIO_H */

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-18 20:32       ` Alex Williamson
  2011-11-18 21:09         ` Scott Wood
@ 2011-11-21  2:47         ` David Gibson
  2011-11-22 18:22           ` Alex Williamson
  1 sibling, 1 reply; 62+ messages in thread
From: David Gibson @ 2011-11-21  2:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, aik, pmac, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote:
> On Thu, 2011-11-17 at 11:02 +1100, David Gibson wrote:
> > On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote:
> > > On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
> > > > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> <snip>
> > > > > +Groups, Devices, IOMMUs, oh my
> > > > > +-------------------------------------------------------------------------------
> > > > > +
> > > > > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > > > > +can't always distinguish transactions from each individual device in
> > > > > +the system.  Sometimes this is because of the IOMMU design, such as with
> > > > > +PEs, other times it's caused by the I/O topology, for instance a
> > > > > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > > > > +devices created by these restictions IOMMU groups (or just "groups" for
> > > > > +this document).
> > > > > +
> > > > > +The IOMMU cannot distiguish transactions between the individual devices
> > > > > +within the group, therefore the group is the basic unit of ownership for
> > > > > +a userspace process.  Because of this, groups are also the primary
> > > > > +interface to both devices and IOMMU domains in VFIO.
> > > > > +
> > > > > +The VFIO representation of groups is created as devices are added into
> > > > > +the framework by a VFIO bus driver.  The vfio-pci module is an example
> > > > > +of a bus driver.  This module registers devices along with a set of bus
> > > > > +specific callbacks with the VFIO core.  These callbacks provide the
> > > > > +interfaces later used for device access.  As each new group is created,
> > > > > +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> > > > > +character device.
> > > > 
> > > > Ok.. so, the fact that it's called "vfio-pci" suggests that the VFIO
> > > > bus driver is per bus type, not per bus instance.   But grouping
> > > > constraints could be per bus instance, if you have a couple of
> > > > different models of PCI host bridge with IOMMUs of different
> > > > capabilities built in, for example.
> > > 
> > > Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
> > > instance.
> > 
> > Ok, how can that work.  vfio-pci is responsible for generating the
> > groupings, yes?  For which it needs to know the iommu/host bridge's
> > isolation capabilities, which vary depending on the type of host
> > bridge.
> 
> No, grouping is done at the iommu driver level.  vfio gets groupings via
> iomm_device_group(), which uses the iommu_ops for the bus_type of the
> requested device.  I'll attempt to clarify where groups come from in the
> documentation.

Hrm, but still per bus type, not bus instance.  Hrm.  Yeah, I need to
look at the earlier iommu patches in more detail.

[snip]
> > Yeah, I'm starting to get my head around the model, but the current
> > description of it doesn't help very much.  In particular the terms
> > "merge" and "unmerge" lead one to the wrong mental model, I think.
> > 
> > > > The semantics of "merge" and "unmerge" under those names are really
> > > > non-obvious.  Merge kind of has to merge two whole metagroups, but
> > > > it's unclear if unmerge reverses one merge, or just takes out one
> > > > (atom) group.  These operations need better names, at least.
> > > 
> > > Christian suggested a change to UNMERGE that we do not need to
> > > specify a group to unmerge "from".  This makes it more like a list
> > > implementation except there's no defined list_head.  Any member of the
> > > list can pull in a new entry.  Calling UNMERGE on any member extracts
> > > that member.
> > 
> > I think that's a good idea, but "unmerge" is not a good word for it.
> 
> I can't think of better, if you can, please suggest.

Well, I think addgroup and removegroup would be better than merge and
unmerge, although they have their own problems.

> > > > Then it's unclear what order you can do various operations, and which
> > > > order you can open and close various things.  You can kind of figure
> > > > it out but it takes far more thinking than it should.
> > > > 
> > > > 
> > > > So at the _very_ least, we need to invent new terminology and find a
> > > > much better way of describing this API's semantics.  I still think an
> > > > entirely different interface, where metagroups are created from
> > > > outside with a lifetime that's not tied to an fd would be a better
> > > > idea.
> > > 
> > > As we've discussed previously, configfs provides part of this, but has
> > > no ioctl support.  It doesn't make sense to me to go play with groups in
> > > configfs, but then still interact with them via a char dev.
> > 
> > Why not?  You configure, say, loopback devices with losetup, then use
> > them as a block device.  Similar with nbd.  You can configure serial
> > devices with setserial, then use them as a char dev.
> > 
> > >  It also
> > > splits the ownership model 
> > 
> > I'm not even sure what that means.
> > 
> > > and makes it harder to enforce who gets to
> > > interact with the devices vs who gets to manipulate groups.
> > 
> > How so.
> 
> Let's map out what a configfs interface would look like, maybe I'll
> convince myself it's on the table.  We'd probably start with

Hrm, assumin we used configfs, which is not the only option.

> /config/vfio/$bus_type.name/
> 
> That would probably be pre-populated with a bunch of $groupid files,
> matching /dev/vfio/$bus_type.name/$groupid char dev files (assuming
> configfs can pre-populate files).  To make a user defined group, we
> might then do:
> 
> mkdir /config/vfio/$bus_type.name/my_group
> 
> That would generate a /dev/vfio/$bus_type.name/my_group char dev.  To
> add groups to the new my_group "super group", we'd need to do something
> like:
> 
> ln -s /config/vfio/$bus_type.name/$groupidA /config/vfio/$bus_type.name/my_group/nic_group
> 
> I might then add a second group as:
> 
> ln -s /config/vfio/$bus_type.name/$groupidB /config/vfio/$bus_type.name/my_group/hba_group
> 
> Either link could fail if the target group is not viable,

The link op shouldn't fail because the subgroup isn't viable.
Instead, the supergroup jusy won't be viable until all devices in all
subgroups are bound to vfio.

> the group is
> already in use, or the second link could fail if the iommu domains were
> incompatible.
> 
> Do these links cause /dev/vfio/$bus_type.name/{$groupidA,$groupidB} to
> disappear?  If not, do we allow them to be opened?  Linking would also
> have to fail if we later tried to link one of these groupids to a
> different super group.

Again, I think some confusion is coming in here from calling both the
hardware determined thing and the admin determined thing a "group".
So for now I'm going to call the first a "group" and the second a
"predomain" (because once it's viable and the right conditions are set
up it will become an iommu domain).

So another option is that "groups" *only* participate in the merging
interface; getting iommu and device handles occurs only on a
predomain.  Therefore there would be no /dev/vfio/$group, you would
have to configure a predomain with at least one group before you had a
device file.

> Now we want to give my_group to a user, so we have to go back to /dev
> and
> 
> chown $user /dev/vfio/$bus_type.name/my_group
> 
> At this point my_group would have the existing set of group ioctls sans
> {UN}MERGE, of course.
> 
> So $user can use the super group, but not manipulate it's members.  Do
> we then allow:
> 
> chown $user /config/vfio/$bus_type.name/my_group
> 
> If so, what does it imply about the user then doing:
> 
> ln -s /config/vfio/$bus_type.name/$groupidC /config/vfio/$bus_type.name/my_group/stolen_group
> 
> Would we instead need to chown the configfs groups as well as the super
> group?
> 
> chown $user /config/vfio/$bus_type.name/my_group
> chown $user /config/vfio/$bus_type.name/$groupidA
> chown $user /config/vfio/$bus_type.name/$groupidB
> 
> ie:
> 
> # chown $user:$user /config/vfio/$bus_type.name/$groupC
> $ ln -s /config/vfio/$bus_type.name/$groupidC /config/vfio/$bus_type.name/my_group/given_group

This is not the only option.  We could also do:

cd /config/vfio
mkdir new_predomain
echo $groupid > new_predomain/addgroup
chown $user /dev/vfio/new_predomain

This is assuming that configuration of predomains is a root only
operation, which seems reasonable to me.

> (linking has to look at the permissions of the target as well as the
> link name)

Which would be unexpected and therefore a bad idea.

> Now we've introduced that we have ownership of configfs entries, what
> does that imply about the char dev entries?  For instance, can $userA
> own /dev/vfio/$bus_type.name/$groupidA, but $userB own the configfs
> file?  We also have another security consideration that an exploit on
> the host might allow a 3rd party to insert a device into a group.
> 
> This is where I start to get lost in the complexity versus simply giving
> the user permissions for the char dev and allowing them to stick groups
> together so long as the have permissions for the group.
> 
> We also add an entire filesystem to the interface that already spans
> sysfs, dev, eventfds and potentially netlink.
> 
> If terminology is the complaint against the {UN}MERGE ioctl interface,
> I'm still not sold that configfs is the answer.  /me goes to the
> thesaurus... amalgamate? blend? combine? cement? unite? join?

A thesaurus won't help, my point is you want something with a
*different* meaning to merge, which implies a symmetry not present in
this operation.

[snip]
> > Hrm.  "ANY" just really bothers me because "any iova" is not as clear
> > a concept as it first appears.  For starters it's actually "any page
> > aligned" at the very least.  But then it's only any 64-bit address for
> > busses which have full 64-bit addressing (and I do wonder if there are
> > any north bridges out there that forgot to implement some of the upper
> > PCI address bits properly, given that 64-bit CPUs rarely actually
> > implement more than 40-something physical address bits in practice).
> > 
> > I'd prefer to see at least something to advertise min and max IOVA and
> > IOVA alignment.  That's enough to cover x86 and POWER, including
> > possible variants with an IOMMU page size different to the system page
> > size (note that POWER kernels can have 64k pages as a config option,
> > which means a TCE page size different to the system page size is quite
> > common).
> > 
> > Obviously there could be more complex constraints that we would need
> > to advertise with option bits.
> 
> x86 has limitations as well.   I don't think most x86 IOMMUs support a
> full 64bit IOVA space, so point take.
> 
> struct vfio_iommu_info {
> 	__u64	len;	/* or structlen/arglen */
> 	__u64	flags;	/* replaces VFIO_IOMMU_GET_FLAGS, none defined yet */
> 	__u64	iova_max;
> 	__u64	iova_min;
> 	__u64	granularity;
> };
> 	
> #define VFIO_IOMMU_GET_INFO              _IOR(';', xxx, struct vfio_iommu_info)

Yeah, this looks like what I was after.

> Is granularity the minimum granularity, typically PAGE_SIZE barring
> special configurations noted above, or is it a bitmap of supported
> granularities?  Ex. If we support 4k normal pages and 2M large pages, we
> might set bits 12 and 21.

Just minimum, I think.  I'd prefer 'alignment' to 'granularity' I
think, but I don't care that much.

> > > > > +
> > > > > +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> > > > > +and unmapping IOVAs to process virtual addresses:
> > > > > +
> > > > > +struct vfio_dma_map {
> > > > > +        __u64   len;            /* length of structure */
> > > > 
> > > > Thanks for adding these structure length fields.  But I think they
> > > > should be called something other than 'len', which is likely to be
> > > > confused with size (or some other length that's actually related to
> > > > the operation's parameters).  Better to call it 'structlen' or
> > > > 'argslen' or something.
> > > 
> > > Ok.  As Scott noted, I've failed to implement these in a way that
> > > actually allows extension, but I'll work on it.
> > 
> > Right.  I had failed to realise quite how the encoding of structure
> > size into the ioctl worked.  With that in place, arguably we don't
> > really need the size in the structure itself, because we can still
> > have multiple sized versions of the ioctl.  Still, whichever.
> 
> Hmm, that might be cleaner than eliminating the size with just using
> _IO().  So we might have something like:
> 
> #define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> #define VFIO_IOMMU_MAP_DMA_V2           _IOWR(';', 106, struct vfio_dma_map_v2)
> 
> For which the driver might do:
> 
> case VFIO_IOMMU_MAP_DMA:
> case VFIO_IOMMU_MAP_DMA_V2:
> {
> 	struct vfio_dma_map map;
> 
> 	/* We don't care about the extra v2 bits */
> 	if (copy_from_user(&map, (void __user *)arg, sizeof map))
> 		return -EFAULT;
> ...
> 
> That presumes v2 is compatible other than extra fields.

Right, as does having the length in the structure itself.

> Any objections
> (this gets rid of length from all ioctl passed structs)?

Not from here.

[snip]
> > No, my concept was that NONE would be used for the indexes where there
> > is no valid BAR.  I'll buy your argument on why not to include the PCI
> > (or whatever) address space type here.
> > 
> > What I'm just a bit concerned by is whether we could have a case (not
> > for PCI) of a real resource that still has size 0 - e.g. maybe some
> > sort of doorbell that can't be read or written, but can be triggered
> > some other way.  I guess that's probably unlikely though.
> 
> Right, and if somehow you had such a region where the size is zero, but
> allowed some kind of operation on it, we could define a flag for it.

Hrm, I guess.

[snip]
> > > Other devices could certainly have multiple level interrupts, and if
> > > grouping them as we do with MSI on PCI makes sense, please let me know.
> > > I just didn't see the value in making the unmask operations handle
> > > sub-indexes if it's not needed.
> > 
> > I don't know of anything off hand.  But I can't see any consideration
> > that would make it unlikely either.  I generally don't trust anything
> > *not* to exist in embedded space.
> 
> Fair enough.  Level IRQs are still triggered individually, so unmasking
> is too, which means UNMASK_IRQ takes something like { int index; int
> subindex }.
> 
> SET_UNMASK_IRQ_EVENTFDS should follow SET_IRQ_EVENTFDS and take { int
> index; int count; int fds[] }.

Ok.

[snip]
> > > > Use of "read" and "write" in DMA can often be confusing, since it's
> > > > not always clear if you're talking from the perspective of the CPU or
> > > > the device (_writing_ data to a device will usually involve it doing
> > > > DMA _reads_ from memory).  It's often best to express things as DMA
> > > > direction, 'to device', and 'from device' instead.
> > > 
> > > Good point.
> > 
> > This, of course, potentially affects many areas of the code and doco.
> 
> I've changed vfio_iommu to use <linux/dma-direction.h> definitions
> internally.  For the ioctl I've so far simply included WRITE and READ
> flags, which I can clarify are from the device perspective.  Flags like
> VFIO_DMA_MAP_FLAG_TO_DEVICE/FROM_DEVICE are actually more confusing to
> me at this interface level.  We also have IOMMU_READ/IOMMU_WRITE which
> makes me question using dma-direction.h and if we shouldn't just define
> everything as from the device perspective.

Ok, sounds like a good start.  In some contexts read/write are clear,
in others they're not.  Just something to keep in mind.

[snip]
> > Right, but I'm not just talking about the current map/unmap calls
> > themselves.  This infrastructure for tracking it looks like it's
> > intended to be generic for all mapping methods.  If not, I can't see
> > the reason for it, because I don't think the current interface
> > requires such tracking inherently.
> 
> It does seem that way, but there is a purpose.  We need to unmap
> everything on release.  It's easy to assume that iommu_domain_free()
> will unmap everything from the IOMMU, which it does, but we've also done
> a get_user_pages on each of those in vfio, which we need to cleanup.  We
> can't rely on userspace to do this since they might have been SIGKILL'd.
> Making it generic with coalescing of adjacent regions and such is
> primarily for space efficiency.


Ah, I see.  Much as generic infrastructure is nice when we can do it,
I think this consideration will have to be pushed down to the iommu
driver layer.  For e.g. on power, we have all the information we need
to do the page tracking; any write to a TCE put()s the page that was
previously in that entry (if any) as well as get()ing the one that's
going in (if any).  It's just that we don't want to keep track in this
generic data structure _as well as_ the one that's natural for the
hardware.

> <snip>
> > > > > +#ifdef CONFIG_COMPAT
> > > > > +static long vfio_iommu_compat_ioctl(struct file *filep,
> > > > > +				    unsigned int cmd, unsigned long arg)
> > > > > +{
> > > > > +	arg = (unsigned long)compat_ptr(arg);
> > > > > +	return vfio_iommu_unl_ioctl(filep, cmd, arg);
> > > > 
> > > > Um, this only works if the structures are exactly compatible between
> > > > 32-bit and 64-bit ABIs.  I don't think that is always true.
> > > 
> > > I think all our structure sizes are independent of host width.  If I'm
> > > missing something, let me know.
> > 
> > Ah, for structures, that might be true.  I was seeing the bunch of
> > ioctl()s that take ints.
> 
> Ugh, I suppose you're thinking of an ILP64 platform with ILP32 compat
> mode.  Darn it, guess we need to make everything 64bit, including file
> descriptors.

Well, we don't _have_ to, but if we don't then we have to implement
compat wrappers for every non explicit width thing we pass through.

> <snip>
> > > > > +
> > > > > +/* Get a new iommu file descriptor.  This will open the iommu, setting
> > > > > + * the current->mm ownership if it's not already set. */
> > > > 
> > > > I know I've had this explained to me several times before, but I've
> > > > forgotten again.  Why do we need to wire the iommu to an mm?
> > > 
> > > We're mapping process virtual addresses into the IOMMU, so it makes
> > > sense to restrict ourselves to a single virtual address space.  It also
> > > enforces the ownership, that only a single mm is in control of the
> > > group.
> > 
> > Neither of those seems conclusive to me, but I remember that I saw a
> > strong reason earlier, even if I can't remember it now.
> 
> The point of the group is to provide a unit of ownership.  We can't let
> $userA open $groupid and fetch a device, then have $userB do the same,
> grabbing a different device.  The mappings will step on each other and
> the devices have no isolation.  We can't restrict that purely by file
> permissions or we'll have the same problem with sudo.  At one point we
> discussed a single open instance, but that unnecessarily limits the
> user, so we settled on the mm.  Thanks,

Hm, ok.

Fyi, I'll be kind of slow in responses for the next while.  I broke a
bone in my hand on Friday :(.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-21  2:47         ` David Gibson
@ 2011-11-22 18:22           ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2011-11-22 18:22 UTC (permalink / raw)
  To: David Gibson
  Cc: chrisw, aik, pmac, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Mon, 2011-11-21 at 13:47 +1100, David Gibson wrote:
> On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote:
> > On Thu, 2011-11-17 at 11:02 +1100, David Gibson wrote:
> > > On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote:
> > > > On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
> > > > > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
<snip> 
> > > > As we've discussed previously, configfs provides part of this, but has
> > > > no ioctl support.  It doesn't make sense to me to go play with groups in
> > > > configfs, but then still interact with them via a char dev.
> > > 
> > > Why not?  You configure, say, loopback devices with losetup, then use
> > > them as a block device.  Similar with nbd.  You can configure serial
> > > devices with setserial, then use them as a char dev.
> > > 
> > > >  It also
> > > > splits the ownership model 
> > > 
> > > I'm not even sure what that means.
> > > 
> > > > and makes it harder to enforce who gets to
> > > > interact with the devices vs who gets to manipulate groups.
> > > 
> > > How so.
> > 
> > Let's map out what a configfs interface would look like, maybe I'll
> > convince myself it's on the table.  We'd probably start with
> 
> Hrm, assumin we used configfs, which is not the only option.

I'm not writing vfiofs, configfs seems most like what we'd need.  If
there are others we should consider, please note them.

> > /config/vfio/$bus_type.name/
> > 
> > That would probably be pre-populated with a bunch of $groupid files,
> > matching /dev/vfio/$bus_type.name/$groupid char dev files (assuming
> > configfs can pre-populate files).  To make a user defined group, we
> > might then do:
> > 
> > mkdir /config/vfio/$bus_type.name/my_group
> > 
> > That would generate a /dev/vfio/$bus_type.name/my_group char dev.  To
> > add groups to the new my_group "super group", we'd need to do something
> > like:
> > 
> > ln -s /config/vfio/$bus_type.name/$groupidA /config/vfio/$bus_type.name/my_group/nic_group
> > 
> > I might then add a second group as:
> > 
> > ln -s /config/vfio/$bus_type.name/$groupidB /config/vfio/$bus_type.name/my_group/hba_group
> > 
> > Either link could fail if the target group is not viable,
> 
> The link op shouldn't fail because the subgroup isn't viable.
> Instead, the supergroup jusy won't be viable until all devices in all
> subgroups are bound to vfio.

The supergroup may already be in use if it's a hotplug.  What does it
mean to have an incompatible group linked into the supergroup?  When
does the subgroup actually become part of the supergroup?  Does the
userspace driver using the supergroup get notified somehow?  Does the
vfio driver get notified independently?  This example continues to show
what an administration nightmare it becomes when we split management
from usage.

> > the group is
> > already in use, or the second link could fail if the iommu domains were
> > incompatible.
> > 
> > Do these links cause /dev/vfio/$bus_type.name/{$groupidA,$groupidB} to
> > disappear?  If not, do we allow them to be opened?  Linking would also
> > have to fail if we later tried to link one of these groupids to a
> > different super group.
> 
> Again, I think some confusion is coming in here from calling both the
> hardware determined thing and the admin determined thing a "group".
> So for now I'm going to call the first a "group" and the second a
> "predomain" (because once it's viable and the right conditions are set
> up it will become an iommu domain).
> 
> So another option is that "groups" *only* participate in the merging
> interface; getting iommu and device handles occurs only on a
> predomain.  Therefore there would be no /dev/vfio/$group, you would
> have to configure a predomain with at least one group before you had a
> device file.

I think this actually leads to a more complicated, more difficult to use
interface that interposes an unnecessary administration layer into a
driver's decisions about how to manage the iommu.

> > Now we want to give my_group to a user, so we have to go back to /dev
> > and
> > 
> > chown $user /dev/vfio/$bus_type.name/my_group
> > 
> > At this point my_group would have the existing set of group ioctls sans
> > {UN}MERGE, of course.
> > 
> > So $user can use the super group, but not manipulate it's members.  Do
> > we then allow:
> > 
> > chown $user /config/vfio/$bus_type.name/my_group
> > 
> > If so, what does it imply about the user then doing:
> > 
> > ln -s /config/vfio/$bus_type.name/$groupidC /config/vfio/$bus_type.name/my_group/stolen_group
> > 
> > Would we instead need to chown the configfs groups as well as the super
> > group?
> > 
> > chown $user /config/vfio/$bus_type.name/my_group
> > chown $user /config/vfio/$bus_type.name/$groupidA
> > chown $user /config/vfio/$bus_type.name/$groupidB
> > 
> > ie:
> > 
> > # chown $user:$user /config/vfio/$bus_type.name/$groupC
> > $ ln -s /config/vfio/$bus_type.name/$groupidC /config/vfio/$bus_type.name/my_group/given_group
> 
> This is not the only option.  We could also do:
> 
> cd /config/vfio
> mkdir new_predomain
> echo $groupid > new_predomain/addgroup
> chown $user /dev/vfio/new_predomain

echo $groupid > new_predomain/delgroup
SEGV... Now we've included yet another admin path in the hotplug case as
the userspace driver needs to coordinate removal of groups with some
other entity.

> This is assuming that configuration of predomains is a root only
> operation, which seems reasonable to me.

I think it should be a driver decision.  Let's go back to the purpose of
this interface.  We want to give *devices* to userspace drivers.  Groups
are any unfortunate side-effect of hardware topology, so instead of
giving the user a device, we give it a group that contains the device.
It's a driver optimization that they can say "oh, I wonder if I can use
the same iommu descriptor to drive both of these, let me try to merge
them...".  That results in "worked, yay" skip initializing a new iommu
object OR "nope, oh well".  Adding an admin layer that presupposes that
they should be merged and does it adds nothing for the better.

> > (linking has to look at the permissions of the target as well as the
> > link name)
> 
> Which would be unexpected and therefore a bad idea.

Another indication that this is the wrong interface.

> > Now we've introduced that we have ownership of configfs entries, what
> > does that imply about the char dev entries?  For instance, can $userA
> > own /dev/vfio/$bus_type.name/$groupidA, but $userB own the configfs
> > file?  We also have another security consideration that an exploit on
> > the host might allow a 3rd party to insert a device into a group.
> > 
> > This is where I start to get lost in the complexity versus simply giving
> > the user permissions for the char dev and allowing them to stick groups
> > together so long as the have permissions for the group.
> > 
> > We also add an entire filesystem to the interface that already spans
> > sysfs, dev, eventfds and potentially netlink.
> > 
> > If terminology is the complaint against the {UN}MERGE ioctl interface,
> > I'm still not sold that configfs is the answer.  /me goes to the
> > thesaurus... amalgamate? blend? combine? cement? unite? join?
> 
> A thesaurus won't help, my point is you want something with a
> *different* meaning to merge, which implies a symmetry not present in
> this operation.

But there is symmetry in a merged group, let's look at the operations on
a group (note I've updated some of the ioctls since last posting):

VFIO_GROUP_GET_INFO

        This returns a structure containing flags for the group, when
        merged it represents the merged group.

VFIO_GROUP_GET_DEVICE_FD

        This returns a file descriptor for the device described by the
        given char*, when merged it operates across all groups within
        the merged set.

VFIO_GROUP_GET_IOMMU_FD

        Return a file descriptor for the iommu, when merged there's a
        single iommu across the merged group.

VFIO_GROUP_MERGE

        Pull a singleton group into a merge.  This can be called on any
        member of a merged group to pull a singleton group into the
        merged set.

VFIO_GROUP_UNMERGE

        Extract the group from the merged set.

Where is the discontinuity with calling this symmetric?  Is it simply
that we have an entry point to the supergroup at each subgroup?  Forming
a new node when groups are merged is a limitation, not a feature, and
imposes a number of administration issues (ownership, creation,
deletion, addition, subtraction, notifications, etc).  Is it that we can
only merge singletons?  This is an implementation restriction, not an
API restriction.  If you want to go to the trouble of determining that
the existing IOMMU mappings are compatible and can atomically merge
them, the singleton could instead be a member of another supergroup.  We
currently can't do this atomically, and as merging is an optimization, I
leave the burden on userspace to split supergroups if they want to merge
with another group.

I'm not sure why this is such a thorn since aiu the power iommu
topology, you're going to have IOVA windows per group that really can't
make use of the merge interface.  This is mostly useful for "MAP_ANY"
style IOMMUs.  Do you really want to impose the administrative overhead
of predomains for a feature you're not likely to use?

<snip>
> [snip]
> > > Right, but I'm not just talking about the current map/unmap calls
> > > themselves.  This infrastructure for tracking it looks like it's
> > > intended to be generic for all mapping methods.  If not, I can't see
> > > the reason for it, because I don't think the current interface
> > > requires such tracking inherently.
> > 
> > It does seem that way, but there is a purpose.  We need to unmap
> > everything on release.  It's easy to assume that iommu_domain_free()
> > will unmap everything from the IOMMU, which it does, but we've also done
> > a get_user_pages on each of those in vfio, which we need to cleanup.  We
> > can't rely on userspace to do this since they might have been SIGKILL'd.
> > Making it generic with coalescing of adjacent regions and such is
> > primarily for space efficiency.
> 
> 
> Ah, I see.  Much as generic infrastructure is nice when we can do it,
> I think this consideration will have to be pushed down to the iommu
> driver layer.  For e.g. on power, we have all the information we need
> to do the page tracking; any write to a TCE put()s the page that was
> previously in that entry (if any) as well as get()ing the one that's
> going in (if any).  It's just that we don't want to keep track in this
> generic data structure _as well as_ the one that's natural for the
> hardware.

There are few users of the IOMMU API, maybe we can negotiate this.  I
also expect as power gets added, we'll need to make the vfio_iommu layer
more modular.  It's possible you won't make use of this iommu object and
can leave page tracking to the iommu.  I think that can be done within
the existing API though.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-15  6:34 ` David Gibson
  2011-11-15 18:01   ` Alex Williamson
@ 2011-11-15 20:10   ` Scott Wood
  2011-11-15 21:40     ` Aaron Fabbri
  1 sibling, 1 reply; 62+ messages in thread
From: Scott Wood @ 2011-11-15 20:10 UTC (permalink / raw)
  To: Alex Williamson, chrisw, aik, pmac, joerg.roedel, agraf, benve,
	aafabbri, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci

On 11/15/2011 12:34 AM, David Gibson wrote:
> I think we need to pin exactly what "MAP_ANY" means down better.  Now,
> VFIO is pretty much a lost cause if you can't map any normal process
> memory page into the IOMMU, so I think the only thing that is really
> covered is IOVAs.  But saying "can map any IOVA" is not clear, because
> if you can't map it, it's not a (valid) IOVA.  Better to say that
> IOVAs can be any 64-bit value, which I think is what you really mean
> here.

It also means that there are no restrictions on what the IOVA can be
within that range (other than page alignment), which isn't true on our
IOMMU.

We'll also need a way to communicate the desired geometry of the overall
IOMMU table (for this group) to the kernel, which determines what the
restrictions will be (we can't determine it automatically until we know
what all the translation requests will be, and even then it's awkward).

> On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
>> +When a level triggered interrupt is signaled, the interrupt is masked
>> +on the host.  This prevents an unresponsive userspace driver from
>> +continuing to interrupt the host system.  After servicing the interrupt,
>> +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
>> +triggered interrupts implicitly have a count of 1 per index.
> 
> This is a silly restriction.  Even PCI devices can have up to 4 LSIs
> on a function in theory, though no-one ever does.  Embedded devices
> can and do have multiple level interrupts.

Those interrupts would each have their own index.  This is necessary for
level-triggered interrupts since they'll need to be individually
identifiable to VFIO_DEVICE_UNMASK_IRQ -- doesn't seem worth adding
another parameter to UNMASK.

>> +#ifdef CONFIG_COMPAT
>> +static long vfio_iommu_compat_ioctl(struct file *filep,
>> +				    unsigned int cmd, unsigned long arg)
>> +{
>> +	arg = (unsigned long)compat_ptr(arg);
>> +	return vfio_iommu_unl_ioctl(filep, cmd, arg);
> 
> Um, this only works if the structures are exactly compatible between
> 32-bit and 64-bit ABIs.  I don't think that is always true.

These are new structs, we can make it true.

>> +static int allow_unsafe_intrs;
>> +module_param(allow_unsafe_intrs, int, 0);
>> +MODULE_PARM_DESC(allow_unsafe_intrs,
>> +        "Allow use of IOMMUs which do not support interrupt remapping");
> 
> This should not be a global option, but part of the AMD/Intel IOMMU
> specific code.  In general it's a question of how strict the IOMMU
> driver is about isolation when it determines what the groups are, and
> only the IOMMU driver can know what the possibilities are for its
> class of hardware.

It's also a concern that is specific to MSIs.  In any case, I'm not sure
that the ability to cause a spurious IRQ is bad enough to warrant
disabling the entire subsystem by default on certain hardware.

Probably best to just print a warning on module init if there are any
known isolation holes, and let the admin decide whom (if anyone) to let
use this.  If the hole is bad enough that it must be confirmed, it
should require at most a sysfs poke.

-Scott

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-15 20:10   ` Scott Wood
@ 2011-11-15 21:40     ` Aaron Fabbri
  2011-11-15 22:29       ` Scott Wood
  0 siblings, 1 reply; 62+ messages in thread
From: Aaron Fabbri @ 2011-11-15 21:40 UTC (permalink / raw)
  To: Scott Wood, Alex Williamson, chrisw, aik, pmac, joerg.roedel,
	agraf, benve, B08248, B07421, avi, konrad.wilk, kvm, qemu-devel,
	iommu, linux-pci




On 11/15/11 12:10 PM, "Scott Wood" <scottwood@freescale.com> wrote:

> On 11/15/2011 12:34 AM, David Gibson wrote:
<snip> 
>>> +static int allow_unsafe_intrs;
>>> +module_param(allow_unsafe_intrs, int, 0);
>>> +MODULE_PARM_DESC(allow_unsafe_intrs,
>>> +        "Allow use of IOMMUs which do not support interrupt remapping");
>> 
>> This should not be a global option, but part of the AMD/Intel IOMMU
>> specific code.  In general it's a question of how strict the IOMMU
>> driver is about isolation when it determines what the groups are, and
>> only the IOMMU driver can know what the possibilities are for its
>> class of hardware.
> 
> It's also a concern that is specific to MSIs.  In any case, I'm not sure
> that the ability to cause a spurious IRQ is bad enough to warrant
> disabling the entire subsystem by default on certain hardware.

I think the issue is more that the ability to create fake MSI interrupts can
lead to bigger exploits.

Originally we didn't have this parameter. It was added it to reflect the
fact that MSI's triggered by guests are dangerous without the isolation that
interrupt remapping provides.

That is, it *should* be inconvenient to run without interrupt mapping HW
support.

-Aaron

> Probably best to just print a warning on module init if there are any
> known isolation holes, and let the admin decide whom (if anyone) to let
> use this.  If the hole is bad enough that it must be confirmed, it
> should require at most a sysfs poke.
> 
> -Scott
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-15 21:40     ` Aaron Fabbri
@ 2011-11-15 22:29       ` Scott Wood
  2011-11-16 23:34         ` Alex Williamson
  0 siblings, 1 reply; 62+ messages in thread
From: Scott Wood @ 2011-11-15 22:29 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alex Williamson, chrisw, aik, pmac, joerg.roedel, agraf, benve,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci

On 11/15/2011 03:40 PM, Aaron Fabbri wrote:
> 
> 
> 
> On 11/15/11 12:10 PM, "Scott Wood" <scottwood@freescale.com> wrote:
> 
>> On 11/15/2011 12:34 AM, David Gibson wrote:
> <snip> 
>>>> +static int allow_unsafe_intrs;
>>>> +module_param(allow_unsafe_intrs, int, 0);
>>>> +MODULE_PARM_DESC(allow_unsafe_intrs,
>>>> +        "Allow use of IOMMUs which do not support interrupt remapping");
>>>
>>> This should not be a global option, but part of the AMD/Intel IOMMU
>>> specific code.  In general it's a question of how strict the IOMMU
>>> driver is about isolation when it determines what the groups are, and
>>> only the IOMMU driver can know what the possibilities are for its
>>> class of hardware.
>>
>> It's also a concern that is specific to MSIs.  In any case, I'm not sure
>> that the ability to cause a spurious IRQ is bad enough to warrant
>> disabling the entire subsystem by default on certain hardware.
> 
> I think the issue is more that the ability to create fake MSI interrupts can
> lead to bigger exploits.
> 
> Originally we didn't have this parameter. It was added it to reflect the
> fact that MSI's triggered by guests are dangerous without the isolation that
> interrupt remapping provides.
> 
> That is, it *should* be inconvenient to run without interrupt mapping HW
> support.

A sysfs knob is sufficient inconvenience.  It should only affect whether
you can use MSIs, and the relevant issue shouldn't be "has interrupt
remapping" but "is there a hole".

Some systems might address the issue in ways other than IOMMU-level MSI
translation.  Our interrupt controller provides enough separate 4K pages
for MSI interrupt delivery for each PCIe IOMMU group to get its own (we
currently only have 3, one per root complex) -- no special IOMMU feature
required.

It doesn't help that the semantics of IOMMU_CAP_INTR_REMAP are
undefined.  I shouldn't have to know how x86 IOMMUs work when
implementing a driver for different hardware, just to know what the
generic code is expecting.

As David suggests, if you want to do this it should be the x86 IOMMU
driver that has a knob that controls how it forms groups in the absence
of this support.

-Scott

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-15 22:29       ` Scott Wood
@ 2011-11-16 23:34         ` Alex Williamson
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Williamson @ 2011-11-16 23:34 UTC (permalink / raw)
  To: Scott Wood
  Cc: Aaron Fabbri, chrisw, aik, pmac, joerg.roedel, agraf, benve,
	B08248, B07421, avi, konrad.wilk, kvm, qemu-devel, iommu,
	linux-pci

On Tue, 2011-11-15 at 16:29 -0600, Scott Wood wrote:
> On 11/15/2011 03:40 PM, Aaron Fabbri wrote:
> > 
> > 
> > 
> > On 11/15/11 12:10 PM, "Scott Wood" <scottwood@freescale.com> wrote:
> > 
> >> On 11/15/2011 12:34 AM, David Gibson wrote:
> > <snip> 
> >>>> +static int allow_unsafe_intrs;
> >>>> +module_param(allow_unsafe_intrs, int, 0);
> >>>> +MODULE_PARM_DESC(allow_unsafe_intrs,
> >>>> +        "Allow use of IOMMUs which do not support interrupt remapping");
> >>>
> >>> This should not be a global option, but part of the AMD/Intel IOMMU
> >>> specific code.  In general it's a question of how strict the IOMMU
> >>> driver is about isolation when it determines what the groups are, and
> >>> only the IOMMU driver can know what the possibilities are for its
> >>> class of hardware.
> >>
> >> It's also a concern that is specific to MSIs.  In any case, I'm not sure
> >> that the ability to cause a spurious IRQ is bad enough to warrant
> >> disabling the entire subsystem by default on certain hardware.
> > 
> > I think the issue is more that the ability to create fake MSI interrupts can
> > lead to bigger exploits.
> > 
> > Originally we didn't have this parameter. It was added it to reflect the
> > fact that MSI's triggered by guests are dangerous without the isolation that
> > interrupt remapping provides.
> > 
> > That is, it *should* be inconvenient to run without interrupt mapping HW
> > support.
> 
> A sysfs knob is sufficient inconvenience.  It should only affect whether
> you can use MSIs, and the relevant issue shouldn't be "has interrupt
> remapping" but "is there a hole".
> 
> Some systems might address the issue in ways other than IOMMU-level MSI
> translation.  Our interrupt controller provides enough separate 4K pages
> for MSI interrupt delivery for each PCIe IOMMU group to get its own (we
> currently only have 3, one per root complex) -- no special IOMMU feature
> required.
> 
> It doesn't help that the semantics of IOMMU_CAP_INTR_REMAP are
> undefined.  I shouldn't have to know how x86 IOMMUs work when
> implementing a driver for different hardware, just to know what the
> generic code is expecting.
> 
> As David suggests, if you want to do this it should be the x86 IOMMU
> driver that has a knob that controls how it forms groups in the absence
> of this support.

That is a possibility, we could push it down to the iommu driver which
could simply lump everything into a single groupid when interrupt
remapping is not supported.  Or more directly, when there is an exposure
that devices can trigger random MSIs in the host.  Then we wouldn't need
an option to override this in vfio, you'd just be stuck not being able
to use any devices if you can't bind everything to vfio.  That also
eliminates the possibility of flipping it on dynamically since we can't
handle groupids changing.  Then we'd need an iommu=group_unsafe_msi flag
to enable it.  Ok?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
       [not found] <20111103195452.21259.93021.stgit@bling.home>
                   ` (5 preceding siblings ...)
  2011-11-15  6:34 ` David Gibson
@ 2011-11-29  1:52 ` Alexey Kardashevskiy
  2011-11-29  2:01   ` Alexey Kardashevskiy
  2011-11-29  3:46   ` Alex Williamson
  6 siblings, 2 replies; 62+ messages in thread
From: Alexey Kardashevskiy @ 2011-11-29  1:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, pmac, dwg, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

Hi!

I tried (successfully) to run it on POWER and while doing that I found some issues. I'll try to
explain them in separate mails.

IOMMU domain setup. On POWER, the linux drivers capable of DMA transfer want to know
a DMA window, i.e. its start and length in the PHB address space. This comes from
hardware. On X86 (correct if I am wrong), every device driver in the guest allocates
memory from the same pool. On POWER, device drivers get DMA window and allocate pages
for DMA within this window. In the case of VFIO, that means that QEMU has to
preallocate this DMA window before running a quest, pass it to a guest (via
device tree) and then a guest tells the host what pages are taken/released by
calling map/unmap callbacks of iommu_ops. Deallocation is made in a device detach
callback as I did not want to add more ioctls.
So, there are 2 patches:

- new VFIO_IOMMU_SETUP ioctl introduced which allocates a DMA window via IOMMU API on
POWER.
btw do we need an additional capability bit for it?

KERNEL PATCH:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 10615ad..a882e08 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -247,3 +247,12 @@ int iommu_device_group(struct device *dev, unsigned int *groupid)
 	return -ENODEV;
 }
 EXPORT_SYMBOL_GPL(iommu_device_group);
+
+int iommu_setup(struct iommu_domain *domain,
+		size_t requested_size, size_t *allocated_size,
+		phys_addr_t *start_address)
+{
+	return domain->ops->setup(domain, requested_size, allocated_size,
+			start_address);
+}
+EXPORT_SYMBOL_GPL(iommu_setup);
diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
index 029dae3..57fb70d 100644
--- a/drivers/vfio/vfio_iommu.c
+++ b/drivers/vfio/vfio_iommu.c
@@ -507,6 +507,23 @@ static long vfio_iommu_unl_ioctl(struct file *filep,

 		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
 			ret = -EFAULT;
+
+	} else if (cmd == VFIO_IOMMU_SETUP) {
+		struct vfio_setup setup;
+		size_t allocated_size = 0;
+		phys_addr_t start_address = 0;
+
+		if (copy_from_user(&setup, (void __user *)arg, sizeof setup))
+			return -EFAULT;
+
+		printk("udomain %p, priv=%p\n", iommu->domain, iommu->domain->priv);
+		ret = iommu_setup(iommu->domain, setup.requested_size,
+				&allocated_size, &start_address);
+		setup.allocated_size = allocated_size;
+		setup.start_address = start_address;
+
+		if (!ret && copy_to_user((void __user *)arg, &setup, sizeof setup))
+			ret = -EFAULT;
 	}
 	return ret;
 }
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 93617e7..355cf8b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -45,6 +45,7 @@ struct iommu_domain {

 #define IOMMU_CAP_CACHE_COHERENCY	0x1
 #define IOMMU_CAP_INTR_REMAP		0x2	/* isolates device intrs */
+#define IOMMU_CAP_SETUP_REQUIRED	0x3	/* requires setup to be called */

 #ifdef CONFIG_IOMMU_API

@@ -62,6 +63,9 @@ struct iommu_ops {
 	int (*domain_has_cap)(struct iommu_domain *domain,
 			      unsigned long cap);
 	int (*device_group)(struct device *dev, unsigned int *groupid);
+	int (*setup)(struct iommu_domain *domain,
+		     size_t requested_size, size_t *allocated_size,
+		     phys_addr_t *start_address);
 };

 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
@@ -80,6 +84,9 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
 				      unsigned long iova);
 extern int iommu_domain_has_cap(struct iommu_domain *domain,
 				unsigned long cap);
+extern int iommu_setup(struct iommu_domain *domain,
+		       size_t requested_size, size_t *allocated_size,
+		       phys_addr_t *start_address);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
 					iommu_fault_handler_t handler);
 extern int iommu_device_group(struct device *dev, unsigned int *groupid);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 971e3b1..5e0ee75 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -26,6 +26,7 @@
  * Author: Michael S. Tsirkin <mst@redhat.com>
  */
 #include <linux/types.h>
+#include <linux/ioctl.h>

 #ifndef VFIO_H
 #define VFIO_H
@@ -172,4 +173,13 @@ enum {
 	VFIO_PCI_NUM_IRQS
 };

+/* Setup domain */
+#define VFIO_IOMMU_SETUP		_IOWR(';', 150, struct vfio_setup)
+
+struct vfio_setup {
+	__u64	requested_size;
+	__u64	allocated_size;
+	__u64	start_address;
+};
+
  #endif /* VFIO_H */

=== end ===


QEMU PATCH:

diff --git a/hw/linux-vfio.h b/hw/linux-vfio.h
index ac48d85..a2c719f 100644
--- a/hw/linux-vfio.h
+++ b/hw/linux-vfio.h
@@ -172,4 +172,13 @@ enum {
 	VFIO_PCI_NUM_IRQS
 };

+/* Setup domain */
+#define VFIO_IOMMU_SETUP                _IOWR(';', 150, struct vfio_setup)
+
+struct vfio_setup {
+	__u64   requested_size;
+	__u64   allocated_size;
+	__u64   start_address;
+};
+
 #endif /* VFIO_H */
diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 1c97c35..b438bbe 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -1501,6 +1503,17 @@ static int vfio_initfn(struct PCIDevice *pdev)
     if (vfio_map_resources(vdev))
         goto out_disable_msi;

+    struct vfio_setup setup = { 1 << 26, 0, 0 };
+    if ((ret =  ioctl(vdev->group->iommu->fd, VFIO_IOMMU_SETUP, &setup))) {
+        return ret;
+    }
+    printf("SETUP: requested %lluMB, allocated %lluMB at %llx\n",
+        (unsigned long long)setup.requested_size,
+        (unsigned long long)setup.allocated_size,
+        (unsigned long long)setup.start_address);
+    vdev->start_address = setup.start_address;
+    vdev->window_size = setup.allocated_size;
+
     if (vfio_enable_intx(vdev))
         goto out_unmap_resources;

diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
index 96b09bb..6b7ab6f 100644
--- a/hw/vfio_pci.h
+++ b/hw/vfio_pci.h
@@ -79,6 +79,10 @@ typedef struct VFIODevice {
     bool msix;
     uint8_t msix_bar;
     uint16_t msix_entries;
+#ifdef TARGET_PPC
+    uint64_t start_address;
+    uint32_t window_size;
+#endif
 } VFIODevice;

 typedef struct VFIOGroup {

=== end ===



- changed __vfio_close_iommu function to do unmapall first and detach devices then
as actual deallocation happens on device detach callback of IOMMU ops.

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 6169356..f78f411 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -28,6 +28,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/wait.h>
+#include <linux/pci.h>

 #include "vfio_private.h"

@@ -242,6 +243,13 @@ static void __vfio_close_iommu(struct vfio_iommu *iommu)
 	if (!iommu->domain)
 		return;

+	/*
+	 * On POWER, device detaching (which is done by __vfio_iommu_detach_group)
+	 * should happen after all pages unmapped because
+	 * the only way to do actual iommu_unmap_page a device detach callback
+	 */
+	vfio_iommu_unmapall(iommu);
+
 	list_for_each(pos, &iommu->group_list) {
 		struct vfio_group *group;
 		group = list_entry(pos, struct vfio_group, iommu_next);
@@ -249,7 +257,7 @@ static void __vfio_close_iommu(struct vfio_iommu *iommu)
 		__vfio_iommu_detach_group(iommu, group);
 	}

-	vfio_iommu_unmapall(iommu);
+	/* vfio_iommu_unmapall(iommu); */

 	iommu_domain_free(iommu->domain);
 	iommu->domain = NULL;





On 04/11/11 07:12, Alex Williamson wrote:
> VFIO provides a secure, IOMMU based interface for user space
> drivers, including device assignment to virtual machines.
> This provides the base management of IOMMU groups, devices,
> and IOMMU objects.  See Documentation/vfio.txt included in
> this patch for user and kernel API description.
> 
> Note, this implements the new API discussed at KVM Forum
> 2011, as represented by the drvier version 0.2.  It's hoped
> that this provides a modular enough interface to support PCI
> and non-PCI userspace drivers across various architectures
> and IOMMU implementations.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> Fingers crossed, this is the last RFC for VFIO, but we need
> the iommu group support before this can go upstream
> (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
> hoping this helps push that along.
> 
> Since the last posting, this version completely modularizes
> the device backends and better defines the APIs between the
> core VFIO code and the device backends.  I expect that we
> might also adopt a modular IOMMU interface as iommu_ops learns
> about different types of hardware.  Also many, many cleanups.
> Check the complete git history for details:
> 
> git://github.com/awilliam/linux-vfio.git vfio-ng
> 
> (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
> 
> This version, along with the supporting VFIO PCI backend can
> be found here:
> 
> git://github.com/awilliam/linux-vfio.git vfio-next-20111103
> 
> I've held off on implementing a kernel->user signaling
> mechanism for now since the previous netlink version produced
> too many gag reflexes.  It's easy enough to set a bit in the
> group flags too indicate such support in the future, so I
> think we can move ahead without it.
> 
> Appreciate any feedback or suggestions.  Thanks,
> 
> Alex
> 


-- 
Alexey Kardashevskiy
IBM OzLabs, LTC Team

e-mail: aik@au1.ibm.com
notes: Alexey Kardashevskiy/Australia/IBM


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-29  1:52 ` Alexey Kardashevskiy
@ 2011-11-29  2:01   ` Alexey Kardashevskiy
  2011-11-29  2:11     ` Alexey Kardashevskiy
  2011-11-29  3:54     ` Alex Williamson
  2011-11-29  3:46   ` Alex Williamson
  1 sibling, 2 replies; 62+ messages in thread
From: Alexey Kardashevskiy @ 2011-11-29  2:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, pmac, dwg, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

Hi all,

Another problem I hit on POWER - MSI interrupts allocation. The existing VFIO does not expect a PBH
to support less interrupts that a device might request. In my case, PHB's limit is 8 interrupts
while my test card (10Gb ethernet CXGB3) wants 9. Below are the patches to demonstrate the idea.


KERNEL patch:

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 7d45c6b..d44b9bf 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -458,17 +458,32 @@ int vfio_pci_setup_msix(struct vfio_pci_device *vdev, int nvec, int __user *inta
 		vdev->msix[i].entry = i;
 		vdev->ev_msix[i] = ctx;
 	}
-	if (!ret)
+	if (!ret) {
 		ret = pci_enable_msix(pdev, vdev->msix, nvec);
+		/*
+		   The kernel is unable to allocate requested number of IRQs
+		   and returned the available number.
+		 */
+		if (0 < ret) {
+			ret = pci_enable_msix(pdev, vdev->msix, ret);
+		}
+	}
 	vdev->msix_nvec = 0;
-	for (i = 0; i < nvec && !ret; i++) {
-		ret = request_irq(vdev->msix[i].vector, msihandler, 0,
-				  "vfio", vdev->ev_msix[i]);
-		if (ret)
-			break;
-		vdev->msix_nvec = i+1;
+	if (0 == ret) {
+		vdev->msix_nvec = 0;
+		ret = 0;
+		for (i = 0; i < nvec && !ret; i++) {
+			ret = request_irq(vdev->msix[i].vector, msihandler, 0,
+					"vfio", vdev->ev_msix[i]);
+			if (ret)
+				break;
+			vdev->msix_nvec = i+1;
+		}
+		if ((0 == vdev->msix_nvec) && (0 != ret))
+			vfio_pci_drop_msix(vdev);
+		else
+			ret = vdev->msix_nvec;
 	}
-	if (ret)
-		vfio_pci_drop_msix(vdev);
+
 	return ret;
 }

=== end ===


QEMU patch:

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 020961a..980eec7 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -341,7 +341,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
         }
     }

-    if (ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds)) {
+    ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds);
+    if (0 > ret) {
         fprintf(stderr, "vfio: Error: Failed to setup MSI/X fds %s\n",
                 strerror(errno));
         for (i = 0; i < vdev->nr_vectors; i++) {
@@ -355,6 +356,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
         qemu_free(vdev->msi_vectors);
         vdev->nr_vectors = 0;
         return;
+    } else if (0 < ret) {
+        vdev->nr_vectors = ret;
     }

     vdev->interrupt = msix ? INT_MSIX : INT_MSI;


=== end ===




On 29/11/11 12:52, Alexey Kardashevskiy wrote:
> Hi!
> 
> I tried (successfully) to run it on POWER and while doing that I found some issues. I'll try to
> explain them in separate mails.
> 
> 
> 
> On 04/11/11 07:12, Alex Williamson wrote:
>> VFIO provides a secure, IOMMU based interface for user space
>> drivers, including device assignment to virtual machines.
>> This provides the base management of IOMMU groups, devices,
>> and IOMMU objects.  See Documentation/vfio.txt included in
>> this patch for user and kernel API description.
>>
>> Note, this implements the new API discussed at KVM Forum
>> 2011, as represented by the drvier version 0.2.  It's hoped
>> that this provides a modular enough interface to support PCI
>> and non-PCI userspace drivers across various architectures
>> and IOMMU implementations.
>>
>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> ---
>>
>> Fingers crossed, this is the last RFC for VFIO, but we need
>> the iommu group support before this can go upstream
>> (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
>> hoping this helps push that along.
>>
>> Since the last posting, this version completely modularizes
>> the device backends and better defines the APIs between the
>> core VFIO code and the device backends.  I expect that we
>> might also adopt a modular IOMMU interface as iommu_ops learns
>> about different types of hardware.  Also many, many cleanups.
>> Check the complete git history for details:
>>
>> git://github.com/awilliam/linux-vfio.git vfio-ng
>>
>> (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
>>
>> This version, along with the supporting VFIO PCI backend can
>> be found here:
>>
>> git://github.com/awilliam/linux-vfio.git vfio-next-20111103
>>
>> I've held off on implementing a kernel->user signaling
>> mechanism for now since the previous netlink version produced
>> too many gag reflexes.  It's easy enough to set a bit in the
>> group flags too indicate such support in the future, so I
>> think we can move ahead without it.
>>
>> Appreciate any feedback or suggestions.  Thanks,
>>
>> Alex
>>
> 
> 


-- 
Alexey Kardashevskiy
IBM OzLabs, LTC Team

e-mail: aik@au1.ibm.com
notes: Alexey Kardashevskiy/Australia/IBM


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-29  2:01   ` Alexey Kardashevskiy
@ 2011-11-29  2:11     ` Alexey Kardashevskiy
  2011-11-29  3:54     ` Alex Williamson
  1 sibling, 0 replies; 62+ messages in thread
From: Alexey Kardashevskiy @ 2011-11-29  2:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, pmac, dwg, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

Hi all again,

It was actually the very first problem - endianess :-)
I am still not sure what format is better for cached config space or whether we should cache it all.

Also, as Benh already mentioned, vfio_virt_init reads a config space to a cache by pci_read_config_dword
for the whole space while some devices may not like it as they might distinguish length of PCI
transactions.



KERNEL patch:

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index b3bab99..9d563b4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -757,6 +757,16 @@ static int vfio_virt_init(struct vfio_pci_device *vdev)
 	vdev->rbar[5] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_5];
 	vdev->rbar[6] = *(u32 *)&vdev->vconfig[PCI_ROM_ADDRESS];

+	/*
+	 * As pci_read_config_XXXX returns data in native format,
+	 * and the cached copy is used in assumption that it is
+	 * native PCI format, fix endianness in the cached copy.
+	 */
+	lp = (u32 *)vdev->vconfig;
+	for (i = 0; i < pdev->cfg_size/sizeof(u32); i++, lp++) {
+		*lp = cpu_to_le32(*lp);
+	}
+
 	/* for sr-iov devices */
 	vdev->vconfig[PCI_VENDOR_ID] = pdev->vendor & 0xFF;
 	vdev->vconfig[PCI_VENDOR_ID+1] = pdev->vendor >> 8;
@@ -807,18 +817,18 @@ static void vfio_bar_fixup(struct vfio_pci_device *vdev)
 		else
 			mask = 0;
 		lp = (u32 *)(vdev->vconfig + PCI_BASE_ADDRESS_0 + 4*bar);
-		*lp &= (u32)mask;
+		*lp &= cpu_to_le32((u32)mask);

 		if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
-			*lp |= PCI_BASE_ADDRESS_SPACE_IO;
+			*lp |= cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
 		else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
-			*lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
+			*lp |= cpu_to_le32(PCI_BASE_ADDRESS_SPACE_MEMORY);
 			if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
-				*lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+				*lp |= cpu_to_le32(PCI_BASE_ADDRESS_MEM_PREFETCH);
 			if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
-				*lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+				*lp |= cpu_to_le32(PCI_BASE_ADDRESS_MEM_TYPE_64);
 				lp++;
-				*lp &= (u32)(mask >> 32);
+				*lp &= cpu_to_le32((u32)(mask >> 32));
 				bar++;
 			}
 		}
@@ -830,7 +840,7 @@ static void vfio_bar_fixup(struct vfio_pci_device *vdev)
 	} else
 		mask = 0;
 	lp = (u32 *)(vdev->vconfig + PCI_ROM_ADDRESS);
-	*lp &= (u32)mask;
+	*lp &= cpu_to_le32((u32)mask);

 	vdev->bardirty = 0;
 }


=== end ===




QEMU patch:


diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 980eec7..1c97c35 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -405,6 +405,8 @@ static void vfio_resource_write(void *opaque, target_phys_addr_t addr,
 {
     PCIResource *res = opaque;

+    fprintf(stderr, "change endianness????\n");
+
     if (pwrite(res->fd, &data, size, res->offset + addr) != size) {
         fprintf(stderr, "%s(,0x%"PRIx64", 0x%"PRIx64", %d) failed: %s\n",
                 __FUNCTION__, addr, data, size, strerror(errno));
@@ -429,6 +431,9 @@ static uint64_t vfio_resource_read(void *opaque,
     DPRINTF("%s(BAR%d+0x%"PRIx64", %d) = 0x%"PRIx64"\n",
             __FUNCTION__, res->bar, addr, size, data);

+    data = le32_to_cpu(data);
+    DPRINTF("%s(BAR%d+0x%"PRIx64", %d) = 0x%"PRIx64" --- CPU\n",
+            __FUNCTION__, res->bar, addr, size, data);
     return data;
 }

@@ -454,13 +459,25 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)

         val = pci_default_read_config(pdev, addr, len);
     } else {
-        if (pread(vdev->fd, &val, len, vdev->config_offset + addr) != len) {
+        u8 buf[4] = {0};
+        if (pread(vdev->fd, buf, len, vdev->config_offset + addr) != len) {
             fprintf(stderr, "%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %s\n",
                     __FUNCTION__, vdev->host.seg, vdev->host.bus,
                     vdev->host.dev, vdev->host.func, addr, len,
                     strerror(errno));
             return -1;
         }
+    	switch (len) {
+            case 1: val = buf[0]; break;
+            case 2: val = le16_to_cpupu((uint16_t*)buf); break;
+            case 4: val = le32_to_cpupu((uint32_t*)buf); break;
+            default:
+                    fprintf(stderr, "%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %s\n",
+                            __FUNCTION__, vdev->host.seg, vdev->host.bus,
+                            vdev->host.dev, vdev->host.func, addr, len,
+                            strerror(errno));
+                    break;
+	    }
     }
     DPRINTF("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) %x\n", __FUNCTION__,
             vdev->host.seg, vdev->host.bus, vdev->host.dev,
@@ -477,8 +494,20 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
             vdev->host.seg, vdev->host.bus, vdev->host.dev,
             vdev->host.func, addr, val, len);

+    u8 buf[4] = {0};
+    switch (len) {
+        case 1: buf[0] = val & 0xFF; break;
+        case 2: cpu_to_le16wu((uint16_t*)buf, val); break;
+        case 4: cpu_to_le32wu((uint32_t*)buf, val); break;
+        default:
+            fprintf(stderr, "%s(%04x:%02x:%02x.%x, 0x%x, 0x%x, 0x%x) failed: %s\n",
+                 __FUNCTION__, vdev->host.seg, vdev->host.bus, vdev->host.dev,
+                 vdev->host.func, addr, val, len, strerror(errno));
+            return;
+    }
+
     /* Write everything to VFIO, let it filter out what we can't write */
-    if (pwrite(vdev->fd, &val, len, vdev->config_offset + addr) != len) {
+    if (pwrite(vdev->fd, buf, len, vdev->config_offset + addr) != len) {
         fprintf(stderr, "%s(%04x:%02x:%02x.%x, 0x%x, 0x%x, 0x%x) failed: %s\n",
                 __FUNCTION__, vdev->host.seg, vdev->host.bus, vdev->host.dev,
                 vdev->host.func, addr, val, len, strerror(errno));
@@ -675,6 +704,7 @@ static int vfio_setup_msi(VFIODevice *vdev)
                   vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
             return -1;
         }
+        ctrl = le16_to_cpu(ctrl);

         msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
         msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);

=== end ===





On 29/11/11 13:01, Alexey Kardashevskiy wrote:
> Hi all,
> 
> Another problem I hit on POWER - MSI interrupts allocation. The existing VFIO does not expect a PBH
> to support less interrupts that a device might request. In my case, PHB's limit is 8 interrupts
> while my test card (10Gb ethernet CXGB3) wants 9. Below are the patches to demonstrate the idea.
> 
> 
> 
> 
> 
> On 29/11/11 12:52, Alexey Kardashevskiy wrote:
>> Hi!
>>
>> I tried (successfully) to run it on POWER and while doing that I found some issues. I'll try to
>> explain them in separate mails.
>>
>>
>>
>> On 04/11/11 07:12, Alex Williamson wrote:
>>> VFIO provides a secure, IOMMU based interface for user space
>>> drivers, including device assignment to virtual machines.
>>> This provides the base management of IOMMU groups, devices,
>>> and IOMMU objects.  See Documentation/vfio.txt included in
>>> this patch for user and kernel API description.
>>>
>>> Note, this implements the new API discussed at KVM Forum
>>> 2011, as represented by the drvier version 0.2.  It's hoped
>>> that this provides a modular enough interface to support PCI
>>> and non-PCI userspace drivers across various architectures
>>> and IOMMU implementations.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>
>>> Fingers crossed, this is the last RFC for VFIO, but we need
>>> the iommu group support before this can go upstream
>>> (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
>>> hoping this helps push that along.
>>>
>>> Since the last posting, this version completely modularizes
>>> the device backends and better defines the APIs between the
>>> core VFIO code and the device backends.  I expect that we
>>> might also adopt a modular IOMMU interface as iommu_ops learns
>>> about different types of hardware.  Also many, many cleanups.
>>> Check the complete git history for details:
>>>
>>> git://github.com/awilliam/linux-vfio.git vfio-ng
>>>
>>> (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
>>>
>>> This version, along with the supporting VFIO PCI backend can
>>> be found here:
>>>
>>> git://github.com/awilliam/linux-vfio.git vfio-next-20111103
>>>
>>> I've held off on implementing a kernel->user signaling
>>> mechanism for now since the previous netlink version produced
>>> too many gag reflexes.  It's easy enough to set a bit in the
>>> group flags too indicate such support in the future, so I
>>> think we can move ahead without it.
>>>
>>> Appreciate any feedback or suggestions.  Thanks,
>>>
>>> Alex
>>>
>>
>>
> 
> 


-- 
Alexey Kardashevskiy
IBM OzLabs, LTC Team

e-mail: aik@au1.ibm.com
notes: Alexey Kardashevskiy/Australia/IBM


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-29  2:01   ` Alexey Kardashevskiy
  2011-11-29  2:11     ` Alexey Kardashevskiy
@ 2011-11-29  3:54     ` Alex Williamson
  2011-11-29 19:26       ` Alex Williamson
  1 sibling, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-29  3:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: chrisw, pmac, dwg, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Tue, 2011-11-29 at 13:01 +1100, Alexey Kardashevskiy wrote:
> Hi all,
> 
> Another problem I hit on POWER - MSI interrupts allocation. The existing VFIO does not expect a PBH
> to support less interrupts that a device might request. In my case, PHB's limit is 8 interrupts
> while my test card (10Gb ethernet CXGB3) wants 9. Below are the patches to demonstrate the idea.

Seems reasonable.  I assume we'd need similar for vfio_pci_setup_msi,
though I haven't seen anything use more than a single MSI interrupt.
Thanks,

Alex

> KERNEL patch:
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index 7d45c6b..d44b9bf 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -458,17 +458,32 @@ int vfio_pci_setup_msix(struct vfio_pci_device *vdev, int nvec, int __user *inta
>  		vdev->msix[i].entry = i;
>  		vdev->ev_msix[i] = ctx;
>  	}
> -	if (!ret)
> +	if (!ret) {
>  		ret = pci_enable_msix(pdev, vdev->msix, nvec);
> +		/*
> +		   The kernel is unable to allocate requested number of IRQs
> +		   and returned the available number.
> +		 */
> +		if (0 < ret) {
> +			ret = pci_enable_msix(pdev, vdev->msix, ret);
> +		}
> +	}
>  	vdev->msix_nvec = 0;
> -	for (i = 0; i < nvec && !ret; i++) {
> -		ret = request_irq(vdev->msix[i].vector, msihandler, 0,
> -				  "vfio", vdev->ev_msix[i]);
> -		if (ret)
> -			break;
> -		vdev->msix_nvec = i+1;
> +	if (0 == ret) {
> +		vdev->msix_nvec = 0;
> +		ret = 0;
> +		for (i = 0; i < nvec && !ret; i++) {
> +			ret = request_irq(vdev->msix[i].vector, msihandler, 0,
> +					"vfio", vdev->ev_msix[i]);
> +			if (ret)
> +				break;
> +			vdev->msix_nvec = i+1;
> +		}
> +		if ((0 == vdev->msix_nvec) && (0 != ret))
> +			vfio_pci_drop_msix(vdev);
> +		else
> +			ret = vdev->msix_nvec;
>  	}
> -	if (ret)
> -		vfio_pci_drop_msix(vdev);
> +
>  	return ret;
>  }
> 
> === end ===
> 
> 
> QEMU patch:
> 
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> index 020961a..980eec7 100644
> --- a/hw/vfio_pci.c
> +++ b/hw/vfio_pci.c
> @@ -341,7 +341,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
>          }
>      }
> 
> -    if (ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds)) {
> +    ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds);
> +    if (0 > ret) {
>          fprintf(stderr, "vfio: Error: Failed to setup MSI/X fds %s\n",
>                  strerror(errno));
>          for (i = 0; i < vdev->nr_vectors; i++) {
> @@ -355,6 +356,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
>          qemu_free(vdev->msi_vectors);
>          vdev->nr_vectors = 0;
>          return;
> +    } else if (0 < ret) {
> +        vdev->nr_vectors = ret;
>      }
> 
>      vdev->interrupt = msix ? INT_MSIX : INT_MSI;
> 
> 
> === end ===



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-29  3:54     ` Alex Williamson
@ 2011-11-29 19:26       ` Alex Williamson
  2011-11-29 23:20         ` [Qemu-devel] " Stuart Yoder
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-29 19:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: chrisw, pmac, dwg, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Mon, 2011-11-28 at 20:54 -0700, Alex Williamson wrote:
> On Tue, 2011-11-29 at 13:01 +1100, Alexey Kardashevskiy wrote:
> > Hi all,
> > 
> > Another problem I hit on POWER - MSI interrupts allocation. The existing VFIO does not expect a PBH
> > to support less interrupts that a device might request. In my case, PHB's limit is 8 interrupts
> > while my test card (10Gb ethernet CXGB3) wants 9. Below are the patches to demonstrate the idea.
> 
> Seems reasonable.  I assume we'd need similar for vfio_pci_setup_msi,
> though I haven't seen anything use more than a single MSI interrupt.
> Thanks,

Hmm, I'm thinking maybe we should reflect the pci_enable_msix() behavior
directly and let the caller decide if they want to make do with fewer
MSI vectors.  So the ioctl will return <0: error, 0:success, >0: number
of MSIs we think we can setup, without actually setting them.  Sound
good?

BTW, github now has updated trees:

git://github.com/awilliam/linux-vfio.git vfio-next-20111129
git://github.com/awilliam/qemu-vfio.git vfio-ng

Thanks,

Alex

> > KERNEL patch:
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> > index 7d45c6b..d44b9bf 100644
> > --- a/drivers/vfio/pci/vfio_pci_intrs.c
> > +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> > @@ -458,17 +458,32 @@ int vfio_pci_setup_msix(struct vfio_pci_device *vdev, int nvec, int __user *inta
> >  		vdev->msix[i].entry = i;
> >  		vdev->ev_msix[i] = ctx;
> >  	}
> > -	if (!ret)
> > +	if (!ret) {
> >  		ret = pci_enable_msix(pdev, vdev->msix, nvec);
> > +		/*
> > +		   The kernel is unable to allocate requested number of IRQs
> > +		   and returned the available number.
> > +		 */
> > +		if (0 < ret) {
> > +			ret = pci_enable_msix(pdev, vdev->msix, ret);
> > +		}
> > +	}
> >  	vdev->msix_nvec = 0;
> > -	for (i = 0; i < nvec && !ret; i++) {
> > -		ret = request_irq(vdev->msix[i].vector, msihandler, 0,
> > -				  "vfio", vdev->ev_msix[i]);
> > -		if (ret)
> > -			break;
> > -		vdev->msix_nvec = i+1;
> > +	if (0 == ret) {
> > +		vdev->msix_nvec = 0;
> > +		ret = 0;
> > +		for (i = 0; i < nvec && !ret; i++) {
> > +			ret = request_irq(vdev->msix[i].vector, msihandler, 0,
> > +					"vfio", vdev->ev_msix[i]);
> > +			if (ret)
> > +				break;
> > +			vdev->msix_nvec = i+1;
> > +		}
> > +		if ((0 == vdev->msix_nvec) && (0 != ret))
> > +			vfio_pci_drop_msix(vdev);
> > +		else
> > +			ret = vdev->msix_nvec;
> >  	}
> > -	if (ret)
> > -		vfio_pci_drop_msix(vdev);
> > +
> >  	return ret;
> >  }
> > 
> > === end ===
> > 
> > 
> > QEMU patch:
> > 
> > diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> > index 020961a..980eec7 100644
> > --- a/hw/vfio_pci.c
> > +++ b/hw/vfio_pci.c
> > @@ -341,7 +341,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
> >          }
> >      }
> > 
> > -    if (ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds)) {
> > +    ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds);
> > +    if (0 > ret) {
> >          fprintf(stderr, "vfio: Error: Failed to setup MSI/X fds %s\n",
> >                  strerror(errno));
> >          for (i = 0; i < vdev->nr_vectors; i++) {
> > @@ -355,6 +356,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
> >          qemu_free(vdev->msi_vectors);
> >          vdev->nr_vectors = 0;
> >          return;
> > +    } else if (0 < ret) {
> > +        vdev->nr_vectors = ret;
> >      }
> > 
> >      vdev->interrupt = msix ? INT_MSIX : INT_MSI;
> > 
> > 
> > === end ===
> 




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-29 19:26       ` Alex Williamson
@ 2011-11-29 23:20         ` Stuart Yoder
  2011-11-29 23:44           ` Alex Williamson
  0 siblings, 1 reply; 62+ messages in thread
From: Stuart Yoder @ 2011-11-29 23:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, aafabbri, kvm, pmac, qemu-devel,
	joerg.roedel, konrad.wilk, agraf, dwg, chrisw, B08248, iommu, avi,
	linux-pci, B07421, benve

>
> BTW, github now has updated trees:
>
> git://github.com/awilliam/linux-vfio.git vfio-next-20111129
> git://github.com/awilliam/qemu-vfio.git vfio-ng

Hi Alex,

Have been looking at vfio a bit.   A few observations and things
we'll need to figure out as it relates to the Freescale iommu.

__vfio_dma_map() assumes that mappings are broken into
4KB pages.   That will not be true for us.   We normally will be mapping
much larger physically contiguous chunks for our guests.  Guests will
get hugetlbfs backed memory with very large pages (e.g. 16MB,
64MB) or very large chunks allocated by some proprietary
means.

Also, mappings will have additional Freescale-specific attributes
that need to get passed through to dma_map somehow.   For
example, the iommu can stash directly into a CPU's cache
and we have iommu mapping properties like the cache stash
target id and an operation mapping attribute.

How do you envision handling proprietary attributes
in struct vfio_dma_map?

Stuart

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-29 23:20         ` [Qemu-devel] " Stuart Yoder
@ 2011-11-29 23:44           ` Alex Williamson
  2011-11-30 15:41             ` Stuart Yoder
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-29 23:44 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Alexey Kardashevskiy, aafabbri, kvm, pmac, qemu-devel,
	joerg.roedel, konrad.wilk, agraf, dwg, chrisw, B08248, iommu, avi,
	linux-pci, B07421, benve

On Tue, 2011-11-29 at 17:20 -0600, Stuart Yoder wrote:
> >
> > BTW, github now has updated trees:
> >
> > git://github.com/awilliam/linux-vfio.git vfio-next-20111129
> > git://github.com/awilliam/qemu-vfio.git vfio-ng
> 
> Hi Alex,
> 
> Have been looking at vfio a bit.   A few observations and things
> we'll need to figure out as it relates to the Freescale iommu.
> 
> __vfio_dma_map() assumes that mappings are broken into
> 4KB pages.   That will not be true for us.   We normally will be mapping
> much larger physically contiguous chunks for our guests.  Guests will
> get hugetlbfs backed memory with very large pages (e.g. 16MB,
> 64MB) or very large chunks allocated by some proprietary
> means.

Hi Stuart,

I think practically everyone has commented on the 4k mappings ;)  There
are a few problems around this.  The first is that iommu drivers don't
necessarily support sub-region unmapping, so if we map 1GB and later
want to unmap 4k, we can't do it atomically.  4k gives us the most
flexibility for supporting fine granularities.  Another problem is that
we're using get_user_pages to pin memory.  It's been suggested that we
should use mlock for this, but I can't find anything that prevents a
user from later munlock'ing the memory and then getting access to memory
they shouldn't have.  Those kind of limit us, but I don't see it being
an API problem for VFIO, just implementation.

> Also, mappings will have additional Freescale-specific attributes
> that need to get passed through to dma_map somehow.   For
> example, the iommu can stash directly into a CPU's cache
> and we have iommu mapping properties like the cache stash
> target id and an operation mapping attribute.
> 
> How do you envision handling proprietary attributes
> in struct vfio_dma_map?

Let me turn the question around, how do you plan to support proprietary
attributes in the IOMMU API?  Is the user level the appropriate place to
specify them, or are they an intrinsic feature of the domain?  We've
designed struct vfio_dma_map for extension, so depending on how many
bits you need, we can make a conduit using the flags directly or setting
a new flag to indicate presence of an arch specific attributes field.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-29 23:44           ` Alex Williamson
@ 2011-11-30 15:41             ` Stuart Yoder
  2011-11-30 16:58               ` Alex Williamson
  0 siblings, 1 reply; 62+ messages in thread
From: Stuart Yoder @ 2011-11-30 15:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, aafabbri, kvm, pmac, qemu-devel,
	joerg.roedel, konrad.wilk, agraf, dwg, chrisw, B08248, iommu, avi,
	linux-pci, B07421, benve

On Tue, Nov 29, 2011 at 5:44 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Tue, 2011-11-29 at 17:20 -0600, Stuart Yoder wrote:
>> >
>> > BTW, github now has updated trees:
>> >
>> > git://github.com/awilliam/linux-vfio.git vfio-next-20111129
>> > git://github.com/awilliam/qemu-vfio.git vfio-ng
>>
>> Hi Alex,
>>
>> Have been looking at vfio a bit.   A few observations and things
>> we'll need to figure out as it relates to the Freescale iommu.
>>
>> __vfio_dma_map() assumes that mappings are broken into
>> 4KB pages.   That will not be true for us.   We normally will be mapping
>> much larger physically contiguous chunks for our guests.  Guests will
>> get hugetlbfs backed memory with very large pages (e.g. 16MB,
>> 64MB) or very large chunks allocated by some proprietary
>> means.
>
> Hi Stuart,
>
> I think practically everyone has commented on the 4k mappings ;)  There
> are a few problems around this.  The first is that iommu drivers don't
> necessarily support sub-region unmapping, so if we map 1GB and later
> want to unmap 4k, we can't do it atomically.  4k gives us the most
> flexibility for supporting fine granularities.  Another problem is that
> we're using get_user_pages to pin memory.  It's been suggested that we
> should use mlock for this, but I can't find anything that prevents a
> user from later munlock'ing the memory and then getting access to memory
> they shouldn't have.  Those kind of limit us, but I don't see it being
> an API problem for VFIO, just implementation.

Ok.

>> Also, mappings will have additional Freescale-specific attributes
>> that need to get passed through to dma_map somehow.   For
>> example, the iommu can stash directly into a CPU's cache
>> and we have iommu mapping properties like the cache stash
>> target id and an operation mapping attribute.
>>
>> How do you envision handling proprietary attributes
>> in struct vfio_dma_map?
>
> Let me turn the question around, how do you plan to support proprietary
> attributes in the IOMMU API?  Is the user level the appropriate place to
> specify them, or are they an intrinsic feature of the domain?  We've
> designed struct vfio_dma_map for extension, so depending on how many
> bits you need, we can make a conduit using the flags directly or setting
> a new flag to indicate presence of an arch specific attributes field.

The attributes are not intrinsic features of the domain.  User space will
need to set them.  But in thinking about it a bit more I think the attributes
are more properties of the domain rather than a per map() operation
characteristic.  I think a separate API might be appropriate.  Define a
new set_domain_attrs() op in the iommu_ops.    In user space, perhaps
 a new vfio group API-- VFIO_GROUP_SET_ATTRS,
VFIO_GROUP_GET_ATTRS.

Stuart

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-30 15:41             ` Stuart Yoder
@ 2011-11-30 16:58               ` Alex Williamson
  2011-12-01 20:58                 ` Stuart Yoder
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-30 16:58 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Alexey Kardashevskiy, aafabbri, kvm, pmac, qemu-devel,
	joerg.roedel, konrad.wilk, agraf, dwg, chrisw, B08248, iommu, avi,
	linux-pci, B07421, benve

On Wed, 2011-11-30 at 09:41 -0600, Stuart Yoder wrote:
> On Tue, Nov 29, 2011 at 5:44 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > On Tue, 2011-11-29 at 17:20 -0600, Stuart Yoder wrote:
> >> >
> >> > BTW, github now has updated trees:
> >> >
> >> > git://github.com/awilliam/linux-vfio.git vfio-next-20111129
> >> > git://github.com/awilliam/qemu-vfio.git vfio-ng
> >>
> >> Hi Alex,
> >>
> >> Have been looking at vfio a bit.   A few observations and things
> >> we'll need to figure out as it relates to the Freescale iommu.
> >>
> >> __vfio_dma_map() assumes that mappings are broken into
> >> 4KB pages.   That will not be true for us.   We normally will be mapping
> >> much larger physically contiguous chunks for our guests.  Guests will
> >> get hugetlbfs backed memory with very large pages (e.g. 16MB,
> >> 64MB) or very large chunks allocated by some proprietary
> >> means.
> >
> > Hi Stuart,
> >
> > I think practically everyone has commented on the 4k mappings ;)  There
> > are a few problems around this.  The first is that iommu drivers don't
> > necessarily support sub-region unmapping, so if we map 1GB and later
> > want to unmap 4k, we can't do it atomically.  4k gives us the most
> > flexibility for supporting fine granularities.  Another problem is that
> > we're using get_user_pages to pin memory.  It's been suggested that we
> > should use mlock for this, but I can't find anything that prevents a
> > user from later munlock'ing the memory and then getting access to memory
> > they shouldn't have.  Those kind of limit us, but I don't see it being
> > an API problem for VFIO, just implementation.
> 
> Ok.
> 
> >> Also, mappings will have additional Freescale-specific attributes
> >> that need to get passed through to dma_map somehow.   For
> >> example, the iommu can stash directly into a CPU's cache
> >> and we have iommu mapping properties like the cache stash
> >> target id and an operation mapping attribute.
> >>
> >> How do you envision handling proprietary attributes
> >> in struct vfio_dma_map?
> >
> > Let me turn the question around, how do you plan to support proprietary
> > attributes in the IOMMU API?  Is the user level the appropriate place to
> > specify them, or are they an intrinsic feature of the domain?  We've
> > designed struct vfio_dma_map for extension, so depending on how many
> > bits you need, we can make a conduit using the flags directly or setting
> > a new flag to indicate presence of an arch specific attributes field.
> 
> The attributes are not intrinsic features of the domain.  User space will
> need to set them.  But in thinking about it a bit more I think the attributes
> are more properties of the domain rather than a per map() operation
> characteristic.  I think a separate API might be appropriate.  Define a
> new set_domain_attrs() op in the iommu_ops.    In user space, perhaps
>  a new vfio group API-- VFIO_GROUP_SET_ATTRS,
> VFIO_GROUP_GET_ATTRS.

In that case, you should definitely be following what Alexey is thinking
about with an iommu_setup IOMMU API callback.  I think it's shaping up
to do:

x86:
 - Report any IOVA range restrictions imposed by hw implementation
POWER:
 - Request IOVA window size, report size and base
powerpc:
 - Set domain attributes, probably report range as well.

Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-30 16:58               ` Alex Williamson
@ 2011-12-01 20:58                 ` Stuart Yoder
  2011-12-01 21:25                   ` Alex Williamson
  0 siblings, 1 reply; 62+ messages in thread
From: Stuart Yoder @ 2011-12-01 20:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, aafabbri, kvm, pmac, qemu-devel,
	joerg.roedel, konrad.wilk, agraf, dwg, chrisw, B08248, iommu, avi,
	linux-pci, B07421, benve

>> The attributes are not intrinsic features of the domain.  User space will
>> need to set them.  But in thinking about it a bit more I think the attributes
>> are more properties of the domain rather than a per map() operation
>> characteristic.  I think a separate API might be appropriate.  Define a
>> new set_domain_attrs() op in the iommu_ops.    In user space, perhaps
>>  a new vfio group API-- VFIO_GROUP_SET_ATTRS,
>> VFIO_GROUP_GET_ATTRS.
>
> In that case, you should definitely be following what Alexey is thinking
> about with an iommu_setup IOMMU API callback.  I think it's shaping up
> to do:
>
> x86:
>  - Report any IOVA range restrictions imposed by hw implementation
> POWER:
>  - Request IOVA window size, report size and base
> powerpc:
>  - Set domain attributes, probably report range as well.

One other mechanism we need as well is the ability to
enable/disable a domain.

For example-- suppose a device is assigned to a VM, the
device is in use when the VM is abruptly terminated.  The
VM terminate would shut off DMA at the IOMMU, but now
the device is in an indeterminate state.   Some devices
have no simple reset bit and getting the device back into
a sane state could be complicated-- something the hypervisor
doesn't want to do.

So now KVM restarts the VM, vfio init happens for the device
and  the IOMMU for that device is re-configured,
etc, but we really can't re-enable DMA until the guest OS tells us
(via an hcall) that it is ready.   The guest needs to get the
assigned device in a sane state before DMA is enabled.

Does this warrant a new domain enable/disable API, or should
we make this part of the setup API we are discussing
here?

Stuart

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-12-01 20:58                 ` Stuart Yoder
@ 2011-12-01 21:25                   ` Alex Williamson
  2011-12-02 14:40                     ` Stuart Yoder
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-12-01 21:25 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Alexey Kardashevskiy, aafabbri, kvm, pmac, qemu-devel,
	joerg.roedel, konrad.wilk, agraf, dwg, chrisw, B08248, iommu, avi,
	linux-pci, B07421, benve

On Thu, 2011-12-01 at 14:58 -0600, Stuart Yoder wrote:
> >> The attributes are not intrinsic features of the domain.  User space will
> >> need to set them.  But in thinking about it a bit more I think the attributes
> >> are more properties of the domain rather than a per map() operation
> >> characteristic.  I think a separate API might be appropriate.  Define a
> >> new set_domain_attrs() op in the iommu_ops.    In user space, perhaps
> >>  a new vfio group API-- VFIO_GROUP_SET_ATTRS,
> >> VFIO_GROUP_GET_ATTRS.
> >
> > In that case, you should definitely be following what Alexey is thinking
> > about with an iommu_setup IOMMU API callback.  I think it's shaping up
> > to do:
> >
> > x86:
> >  - Report any IOVA range restrictions imposed by hw implementation
> > POWER:
> >  - Request IOVA window size, report size and base
> > powerpc:
> >  - Set domain attributes, probably report range as well.
> 
> One other mechanism we need as well is the ability to
> enable/disable a domain.
> 
> For example-- suppose a device is assigned to a VM, the
> device is in use when the VM is abruptly terminated.  The
> VM terminate would shut off DMA at the IOMMU, but now
> the device is in an indeterminate state.   Some devices
> have no simple reset bit and getting the device back into
> a sane state could be complicated-- something the hypervisor
> doesn't want to do.
> 
> So now KVM restarts the VM, vfio init happens for the device
> and  the IOMMU for that device is re-configured,
> etc, but we really can't re-enable DMA until the guest OS tells us
> (via an hcall) that it is ready.   The guest needs to get the
> assigned device in a sane state before DMA is enabled.

Giant red flag.  We need to paravirtualize the guest?  Not on x86.  Some
devices are better for assignment than others.  PCI devices are moving
towards supporting standard reset mechanisms.

> Does this warrant a new domain enable/disable API, or should
> we make this part of the setup API we are discussing
> here?

What's wrong with simply not adding any DMA mapping entries until you
think your guest is ready?  Isn't that effectively the same thing?
Unmap ~= disable.  If the IOMMU API had a mechanism to toggle the iommu
domain on and off, I wouldn't be opposed to adding an ioctl to do it,
but it really seems like just a shortcut vs map/unmap.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-12-01 21:25                   ` Alex Williamson
@ 2011-12-02 14:40                     ` Stuart Yoder
  2011-12-02 18:11                       ` Bhushan Bharat-R65777
  2011-12-02 18:21                       ` Scott Wood
  0 siblings, 2 replies; 62+ messages in thread
From: Stuart Yoder @ 2011-12-02 14:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, aafabbri, kvm, pmac, qemu-devel,
	joerg.roedel, konrad.wilk, agraf, dwg, chrisw, B08248, iommu, avi,
	linux-pci, B07421, benve

On Thu, Dec 1, 2011 at 3:25 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Thu, 2011-12-01 at 14:58 -0600, Stuart Yoder wrote:
>> One other mechanism we need as well is the ability to
>> enable/disable a domain.
>>
>> For example-- suppose a device is assigned to a VM, the
>> device is in use when the VM is abruptly terminated.  The
>> VM terminate would shut off DMA at the IOMMU, but now
>> the device is in an indeterminate state.   Some devices
>> have no simple reset bit and getting the device back into
>> a sane state could be complicated-- something the hypervisor
>> doesn't want to do.
>>
>> So now KVM restarts the VM, vfio init happens for the device
>> and  the IOMMU for that device is re-configured,
>> etc, but we really can't re-enable DMA until the guest OS tells us
>> (via an hcall) that it is ready.   The guest needs to get the
>> assigned device in a sane state before DMA is enabled.
>
> Giant red flag.  We need to paravirtualize the guest?  Not on x86.

It's the reality we have to deal with, but doing this would obviously
only apply to platforms that need it.

> Some
> devices are better for assignment than others.  PCI devices are moving
> towards supporting standard reset mechanisms.
>
>> Does this warrant a new domain enable/disable API, or should
>> we make this part of the setup API we are discussing
>> here?
>
> What's wrong with simply not adding any DMA mapping entries until you
> think your guest is ready?  Isn't that effectively the same thing?
> Unmap ~= disable.  If the IOMMU API had a mechanism to toggle the iommu
> domain on and off, I wouldn't be opposed to adding an ioctl to do it,
> but it really seems like just a shortcut vs map/unmap.  Thanks,

Yes, we could do something like that I guess.

Stuart

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-12-02 14:40                     ` Stuart Yoder
@ 2011-12-02 18:11                       ` Bhushan Bharat-R65777
  2011-12-02 18:27                         ` Scott Wood
  2011-12-02 18:21                       ` Scott Wood
  1 sibling, 1 reply; 62+ messages in thread
From: Bhushan Bharat-R65777 @ 2011-12-02 18:11 UTC (permalink / raw)
  To: Stuart Yoder, Alex Williamson
  Cc: Alexey Kardashevskiy, aafabbri@cisco.com, kvm@vger.kernel.org,
	pmac@au1.ibm.com, qemu-devel@nongnu.org, joerg.roedel@amd.com,
	konrad.wilk@oracle.com, agraf@suse.de, dwg@au1.ibm.com,
	chrisw@sous-sol.org, Yoder Stuart-B08248,
	iommu@lists.linux-foundation.org, avi@redhat.com,
	linux-pci@vger.kernel.org, Wood Scott-B07421, benve@cisco.com



> -----Original Message-----
> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On
> Behalf Of Stuart Yoder
> Sent: Friday, December 02, 2011 8:11 PM
> To: Alex Williamson
> Cc: Alexey Kardashevskiy; aafabbri@cisco.com; kvm@vger.kernel.org;
> pmac@au1.ibm.com; qemu-devel@nongnu.org; joerg.roedel@amd.com;
> konrad.wilk@oracle.com; agraf@suse.de; dwg@au1.ibm.com; chrisw@sous-
> sol.org; Yoder Stuart-B08248; iommu@lists.linux-foundation.org;
> avi@redhat.com; linux-pci@vger.kernel.org; Wood Scott-B07421;
> benve@cisco.com
> Subject: Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
> 
> On Thu, Dec 1, 2011 at 3:25 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > On Thu, 2011-12-01 at 14:58 -0600, Stuart Yoder wrote:
> >> One other mechanism we need as well is the ability to enable/disable
> >> a domain.
> >>
> >> For example-- suppose a device is assigned to a VM, the device is in
> >> use when the VM is abruptly terminated.  The VM terminate would shut
> >> off DMA at the IOMMU, but now the device is in an indeterminate
> >> state.   Some devices have no simple reset bit and getting the device
> >> back into a sane state could be complicated-- something the
> >> hypervisor doesn't want to do.
> >>
> >> So now KVM restarts the VM, vfio init happens for the device and  the
> >> IOMMU for that device is re-configured, etc, but we really can't
> >> re-enable DMA until the guest OS tells us (via an hcall) that it is
> >> ready.   The guest needs to get the assigned device in a sane state
> >> before DMA is enabled.
> >
> > Giant red flag.  We need to paravirtualize the guest?  Not on x86.
> 
> It's the reality we have to deal with, but doing this would obviously
> only apply to platforms that need it.
> 
> > Some
> > devices are better for assignment than others.  PCI devices are moving
> > towards supporting standard reset mechanisms.
> >
> >> Does this warrant a new domain enable/disable API, or should we make
> >> this part of the setup API we are discussing here?
> >
> > What's wrong with simply not adding any DMA mapping entries until you
> > think your guest is ready?  Isn't that effectively the same thing?
> > Unmap ~= disable.  If the IOMMU API had a mechanism to toggle the
> > iommu domain on and off, I wouldn't be opposed to adding an ioctl to
> > do it, but it really seems like just a shortcut vs map/unmap.  Thanks,
> 
> Yes, we could do something like that I guess.

How do we determine whether guest is ready or not? There can be multiple device get ready at different time.
Further if guest have given the device to it user level process or its guest. Should not there be some mechanism for a guest to enable/disable on per device or group?

Thanks
-Bharat




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-12-02 18:11                       ` Bhushan Bharat-R65777
@ 2011-12-02 18:27                         ` Scott Wood
  2011-12-02 18:35                           ` Bhushan Bharat-R65777
  2011-12-02 18:45                           ` Bhushan Bharat-R65777
  0 siblings, 2 replies; 62+ messages in thread
From: Scott Wood @ 2011-12-02 18:27 UTC (permalink / raw)
  To: Bhushan Bharat-R65777
  Cc: Stuart Yoder, Alex Williamson, Alexey Kardashevskiy,
	aafabbri@cisco.com, kvm@vger.kernel.org, pmac@au1.ibm.com,
	qemu-devel@nongnu.org, joerg.roedel@amd.com,
	konrad.wilk@oracle.com, agraf@suse.de, dwg@au1.ibm.com,
	chrisw@sous-sol.org, Yoder Stuart-B08248,
	iommu@lists.linux-foundation.org, avi@redhat.com,
	linux-pci@vger.kernel.org, Wood Scott-B07421, benve@cisco.com

On 12/02/2011 12:11 PM, Bhushan Bharat-R65777 wrote:
> How do we determine whether guest is ready or not? There can be multiple device get ready at different time.

The guest makes a hypercall with a device handle -- at least that's how
we do it in Topaz.

> Further if guest have given the device to it user level process or its guest. Should not there be some mechanism for a guest to enable/disable on per device or group?

Yes, the same mechanism can be used for that -- though in that case
we'll also want the ability for the guest to be able to control another
layer of mapping (which will get quite tricky with PAMU's limitations).
 This isn't really VFIO's concern, though (unless you're talking about
the VFIO implementation in the guest).

-Scott


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-12-02 18:27                         ` Scott Wood
@ 2011-12-02 18:35                           ` Bhushan Bharat-R65777
  2011-12-02 18:45                           ` Bhushan Bharat-R65777
  1 sibling, 0 replies; 62+ messages in thread
From: Bhushan Bharat-R65777 @ 2011-12-02 18:35 UTC (permalink / raw)
  To: Wood Scott-B07421
  Cc: Stuart Yoder, Alex Williamson, Alexey Kardashevskiy,
	aafabbri@cisco.com, kvm@vger.kernel.org, pmac@au1.ibm.com,
	qemu-devel@nongnu.org, joerg.roedel@amd.com,
	konrad.wilk@oracle.com, agraf@suse.de, dwg@au1.ibm.com,
	chrisw@sous-sol.org, Yoder Stuart-B08248,
	iommu@lists.linux-foundation.org, avi@redhat.com,
	linux-pci@vger.kernel.org, benve@cisco.com



> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Friday, December 02, 2011 11:57 PM
> To: Bhushan Bharat-R65777
> Cc: Stuart Yoder; Alex Williamson; Alexey Kardashevskiy;
> aafabbri@cisco.com; kvm@vger.kernel.org; pmac@au1.ibm.com; qemu-
> devel@nongnu.org; joerg.roedel@amd.com; konrad.wilk@oracle.com;
> agraf@suse.de; dwg@au1.ibm.com; chrisw@sous-sol.org; Yoder Stuart-B08248;
> iommu@lists.linux-foundation.org; avi@redhat.com; linux-
> pci@vger.kernel.org; Wood Scott-B07421; benve@cisco.com
> Subject: Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
> 
> On 12/02/2011 12:11 PM, Bhushan Bharat-R65777 wrote:
> > How do we determine whether guest is ready or not? There can be
> multiple device get ready at different time.
> 
> The guest makes a hypercall with a device handle -- at least that's how
> we do it in Topaz.

Yes, it is ok from hcall with device handle perspective.
But I could not understood how cleanly this can be handled with the idea of enabling iommu when guest is ready.

Thanks
-Bharat

> 
> > Further if guest have given the device to it user level process or its
> guest. Should not there be some mechanism for a guest to enable/disable
> on per device or group?
> 
> Yes, the same mechanism can be used for that -- though in that case we'll
> also want the ability for the guest to be able to control another layer
> of mapping (which will get quite tricky with PAMU's limitations).
>  This isn't really VFIO's concern, though (unless you're talking about
> the VFIO implementation in the guest).

May be thinking too ahead, but will not something like this will be needed for nested virtualization?

Thanks
-Bharat


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-12-02 18:27                         ` Scott Wood
  2011-12-02 18:35                           ` Bhushan Bharat-R65777
@ 2011-12-02 18:45                           ` Bhushan Bharat-R65777
  2011-12-02 18:52                             ` Scott Wood
  1 sibling, 1 reply; 62+ messages in thread
From: Bhushan Bharat-R65777 @ 2011-12-02 18:45 UTC (permalink / raw)
  To: Wood Scott-B07421
  Cc: Stuart Yoder, Alex Williamson, Alexey Kardashevskiy,
	aafabbri@cisco.com, kvm@vger.kernel.org, pmac@au1.ibm.com,
	qemu-devel@nongnu.org, joerg.roedel@amd.com,
	konrad.wilk@oracle.com, agraf@suse.de, dwg@au1.ibm.com,
	chrisw@sous-sol.org, Yoder Stuart-B08248,
	iommu@lists.linux-foundation.org, avi@redhat.com,
	linux-pci@vger.kernel.org, benve@cisco.com



> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Friday, December 02, 2011 11:57 PM
> To: Bhushan Bharat-R65777
> Cc: Stuart Yoder; Alex Williamson; Alexey Kardashevskiy;
> aafabbri@cisco.com; kvm@vger.kernel.org; pmac@au1.ibm.com; qemu-
> devel@nongnu.org; joerg.roedel@amd.com; konrad.wilk@oracle.com;
> agraf@suse.de; dwg@au1.ibm.com; chrisw@sous-sol.org; Yoder Stuart-B08248;
> iommu@lists.linux-foundation.org; avi@redhat.com; linux-
> pci@vger.kernel.org; Wood Scott-B07421; benve@cisco.com
> Subject: Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
> 
> On 12/02/2011 12:11 PM, Bhushan Bharat-R65777 wrote:
> > How do we determine whether guest is ready or not? There can be
> multiple device get ready at different time.
> 
> The guest makes a hypercall with a device handle -- at least that's how
> we do it in Topaz.
> 
> > Further if guest have given the device to it user level process or its
> guest. Should not there be some mechanism for a guest to enable/disable
> on per device or group?
> 
> Yes, the same mechanism can be used for that -- though in that case we'll
> also want the ability for the guest to be able to control another layer
> of mapping (which will get quite tricky with PAMU's limitations).
>  This isn't really VFIO's concern, though (unless you're talking about
> the VFIO implementation in the guest).

Scott, I am not sure if there is any real use case where device needed to assigned beyond 2 level (host + immediate guest) in nested virtualization.

But if there any exists then will not it be better to virtualizes the iommu (PAMU for Freescale)?

Thanks
-Bharat 




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-12-02 18:45                           ` Bhushan Bharat-R65777
@ 2011-12-02 18:52                             ` Scott Wood
  0 siblings, 0 replies; 62+ messages in thread
From: Scott Wood @ 2011-12-02 18:52 UTC (permalink / raw)
  To: Bhushan Bharat-R65777
  Cc: Wood Scott-B07421, Stuart Yoder, Alex Williamson,
	Alexey Kardashevskiy, aafabbri@cisco.com, kvm@vger.kernel.org,
	pmac@au1.ibm.com, qemu-devel@nongnu.org, joerg.roedel@amd.com,
	konrad.wilk@oracle.com, agraf@suse.de, dwg@au1.ibm.com,
	chrisw@sous-sol.org, Yoder Stuart-B08248,
	iommu@lists.linux-foundation.org, avi@redhat.com,
	linux-pci@vger.kernel.org, benve@cisco.com

On 12/02/2011 12:45 PM, Bhushan Bharat-R65777 wrote:
> Scott, I am not sure if there is any real use case where device needed to assigned beyond 2 level (host + immediate guest) in nested virtualization.

Userspace drivers in the guest is a more likely scenario than nested
virtualization, at least for us.  Our hardware doesn't support nested
virtualization, so it would have to be some slow emulation-based
approach (worse than e500v2, since we don't have multiple PID registers).

> But if there any exists then will not it be better to virtualizes the iommu (PAMU for Freescale)?

We can't virtualize the PAMU in any sort of transparent manner.  It's
not flexible enough to handle arbitrary mappings.  The guest will need
to cooperate with the host to figure out what mappings it can do.

-Scott

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
  2011-12-02 14:40                     ` Stuart Yoder
  2011-12-02 18:11                       ` Bhushan Bharat-R65777
@ 2011-12-02 18:21                       ` Scott Wood
  1 sibling, 0 replies; 62+ messages in thread
From: Scott Wood @ 2011-12-02 18:21 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Alex Williamson, Alexey Kardashevskiy, aafabbri, kvm, pmac,
	qemu-devel, joerg.roedel, konrad.wilk, agraf, dwg, chrisw, B08248,
	iommu, avi, linux-pci, B07421, benve

On 12/02/2011 08:40 AM, Stuart Yoder wrote:
> On Thu, Dec 1, 2011 at 3:25 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
>> On Thu, 2011-12-01 at 14:58 -0600, Stuart Yoder wrote:
>>> One other mechanism we need as well is the ability to
>>> enable/disable a domain.
>>>
>>> For example-- suppose a device is assigned to a VM, the
>>> device is in use when the VM is abruptly terminated.  The
>>> VM terminate would shut off DMA at the IOMMU, but now
>>> the device is in an indeterminate state.   Some devices
>>> have no simple reset bit and getting the device back into
>>> a sane state could be complicated-- something the hypervisor
>>> doesn't want to do.
>>>
>>> So now KVM restarts the VM, vfio init happens for the device
>>> and  the IOMMU for that device is re-configured,
>>> etc, but we really can't re-enable DMA until the guest OS tells us
>>> (via an hcall) that it is ready.   The guest needs to get the
>>> assigned device in a sane state before DMA is enabled.
>>
>> Giant red flag.  We need to paravirtualize the guest?  Not on x86.
> 
> It's the reality we have to deal with, but doing this would obviously
> only apply to platforms that need it.

By "x86" I assume you mean "PCI" and thus a bus-master enable flag that
you rely on the guest not setting until the device has been reset or
otherwise quiesced from any previous activity, in the absence of
function-level reset.

We don't have such a thing on our non-PCI devices.

>> Some
>> devices are better for assignment than others.  PCI devices are moving
>> towards supporting standard reset mechanisms.
>>
>>> Does this warrant a new domain enable/disable API, or should
>>> we make this part of the setup API we are discussing
>>> here?
>>
>> What's wrong with simply not adding any DMA mapping entries until you
>> think your guest is ready?  Isn't that effectively the same thing?
>> Unmap ~= disable.  If the IOMMU API had a mechanism to toggle the iommu
>> domain on and off, I wouldn't be opposed to adding an ioctl to do it,
>> but it really seems like just a shortcut vs map/unmap.  Thanks,
> 
> Yes, we could do something like that I guess.

It would mean that we don't see any errors relating to impossible map
requests until after the guest is running and decides to enable DMA.
Depending on how PAMU table allocation is handled, it could introduce a
risk of failing even later when a guest reboots and we need to
temporarily disable DMA (e.g. if another vfio user consumes the same
table space for another group in the meantime).

It would add latency to failovers -- some customers have somewhat tight
requirements there.

-Scott


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-29  1:52 ` Alexey Kardashevskiy
  2011-11-29  2:01   ` Alexey Kardashevskiy
@ 2011-11-29  3:46   ` Alex Williamson
  2011-11-29  4:34     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-29  3:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: chrisw, pmac, dwg, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Tue, 2011-11-29 at 12:52 +1100, Alexey Kardashevskiy wrote:
> Hi!
> 
> I tried (successfully) to run it on POWER and while doing that I found some issues. I'll try to
> explain them in separate mails.

Great!

> IOMMU domain setup. On POWER, the linux drivers capable of DMA transfer want to know
> a DMA window, i.e. its start and length in the PHB address space. This comes from
> hardware. On X86 (correct if I am wrong), every device driver in the guest allocates
> memory from the same pool.

Yes, current VT-d/AMD-Vi provide independent IOVA spaces for each
device.

>  On POWER, device drivers get DMA window and allocate pages
> for DMA within this window. In the case of VFIO, that means that QEMU has to
> preallocate this DMA window before running a quest, pass it to a guest (via
> device tree) and then a guest tells the host what pages are taken/released by
> calling map/unmap callbacks of iommu_ops. Deallocation is made in a device detach
> callback as I did not want to add more ioctls.
> So, there are 2 patches:
> 
> - new VFIO_IOMMU_SETUP ioctl introduced which allocates a DMA window via IOMMU API on
> POWER.
> btw do we need an additional capability bit for it?
> 
> KERNEL PATCH:
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 10615ad..a882e08 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -247,3 +247,12 @@ int iommu_device_group(struct device *dev, unsigned int *groupid)
>  	return -ENODEV;
>  }
>  EXPORT_SYMBOL_GPL(iommu_device_group);
> +
> +int iommu_setup(struct iommu_domain *domain,
> +		size_t requested_size, size_t *allocated_size,
> +		phys_addr_t *start_address)
> +{
> +	return domain->ops->setup(domain, requested_size, allocated_size,
> +			start_address);
> +}
> +EXPORT_SYMBOL_GPL(iommu_setup);

requested_size seems redundant both here and in struct vfio_setup.  We
can just pre-load size/start with desired values.  I assume x86 IOMMUs
would ignore requested values and return start = 0 and size = hardware
decoder address bits.  The IOMMU API currently allows:

iommu_domain_alloc
[iommu_attach_device]
[iommu_map]
[iommu_unmap]
[iommu_detach_device]
iommu_domain_free

where everything between alloc and free can be called in any order.  How
does setup fit into that model?  For this it seems like we'd almost want
to combine alloc, setup, and the first attach into a single call (ie.
create a domain with this initial device and these parameters), then
subsequent attaches would only allow compatible devices.

I'm a little confused though, is the window determined by hardware or is
it configurable via requested_size?  David had suggested that we could
implement a VFIO_IOMMU_GET_INFO ioctl that returns something like:

struct vfio_iommu_info {
        __u32   argsz;
        __u32   flags;
        __u64   iova_max;       /* Maximum IOVA address */
        __u64   iova_min;       /* Minimum IOVA address */
        __u64   pgsize_bitmap;  /* Bitmap of supported page sizes */
};

The thought being a TBD IOMMU API interface reports the hardware
determined IOVA range and we could fudge it on x86 for now reporting
0/~0.  Maybe we should replace iova_max/iova_min with
iova_base/iova_size and allow the caller to request a size by setting
iova_size and matching bit in the flags.

> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> index 029dae3..57fb70d 100644
> --- a/drivers/vfio/vfio_iommu.c
> +++ b/drivers/vfio/vfio_iommu.c
> @@ -507,6 +507,23 @@ static long vfio_iommu_unl_ioctl(struct file *filep,
> 
>  		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
>  			ret = -EFAULT;
> +
> +	} else if (cmd == VFIO_IOMMU_SETUP) {
> +		struct vfio_setup setup;
> +		size_t allocated_size = 0;
> +		phys_addr_t start_address = 0;
> +
> +		if (copy_from_user(&setup, (void __user *)arg, sizeof setup))
> +			return -EFAULT;
> +
> +		printk("udomain %p, priv=%p\n", iommu->domain, iommu->domain->priv);
> +		ret = iommu_setup(iommu->domain, setup.requested_size,
> +				&allocated_size, &start_address);
> +		setup.allocated_size = allocated_size;
> +		setup.start_address = start_address;
> +
> +		if (!ret && copy_to_user((void __user *)arg, &setup, sizeof setup))
> +			ret = -EFAULT;
>  	}
>  	return ret;
>  }
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 93617e7..355cf8b 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -45,6 +45,7 @@ struct iommu_domain {
> 
>  #define IOMMU_CAP_CACHE_COHERENCY	0x1
>  #define IOMMU_CAP_INTR_REMAP		0x2	/* isolates device intrs */
> +#define IOMMU_CAP_SETUP_REQUIRED	0x3	/* requires setup to be called */
> 
>  #ifdef CONFIG_IOMMU_API
> 
> @@ -62,6 +63,9 @@ struct iommu_ops {
>  	int (*domain_has_cap)(struct iommu_domain *domain,
>  			      unsigned long cap);
>  	int (*device_group)(struct device *dev, unsigned int *groupid);
> +	int (*setup)(struct iommu_domain *domain,
> +		     size_t requested_size, size_t *allocated_size,
> +		     phys_addr_t *start_address);
>  };
> 
>  extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
> @@ -80,6 +84,9 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
>  				      unsigned long iova);
>  extern int iommu_domain_has_cap(struct iommu_domain *domain,
>  				unsigned long cap);
> +extern int iommu_setup(struct iommu_domain *domain,
> +		       size_t requested_size, size_t *allocated_size,
> +		       phys_addr_t *start_address);
>  extern void iommu_set_fault_handler(struct iommu_domain *domain,
>  					iommu_fault_handler_t handler);
>  extern int iommu_device_group(struct device *dev, unsigned int *groupid);
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 971e3b1..5e0ee75 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -26,6 +26,7 @@
>   * Author: Michael S. Tsirkin <mst@redhat.com>
>   */
>  #include <linux/types.h>
> +#include <linux/ioctl.h>
> 
>  #ifndef VFIO_H
>  #define VFIO_H
> @@ -172,4 +173,13 @@ enum {
>  	VFIO_PCI_NUM_IRQS
>  };
> 
> +/* Setup domain */
> +#define VFIO_IOMMU_SETUP		_IOWR(';', 150, struct vfio_setup)
> +
> +struct vfio_setup {
> +	__u64	requested_size;
> +	__u64	allocated_size;
> +	__u64	start_address;
> +};
> +
>   #endif /* VFIO_H */
> 
> === end ===
> 
> 
> QEMU PATCH:
> 
> diff --git a/hw/linux-vfio.h b/hw/linux-vfio.h
> index ac48d85..a2c719f 100644
> --- a/hw/linux-vfio.h
> +++ b/hw/linux-vfio.h
> @@ -172,4 +172,13 @@ enum {
>  	VFIO_PCI_NUM_IRQS
>  };
> 
> +/* Setup domain */
> +#define VFIO_IOMMU_SETUP                _IOWR(';', 150, struct vfio_setup)
> +
> +struct vfio_setup {
> +	__u64   requested_size;
> +	__u64   allocated_size;
> +	__u64   start_address;
> +};
> +
>  #endif /* VFIO_H */
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> index 1c97c35..b438bbe 100644
> --- a/hw/vfio_pci.c
> +++ b/hw/vfio_pci.c
> @@ -1501,6 +1503,17 @@ static int vfio_initfn(struct PCIDevice *pdev)
>      if (vfio_map_resources(vdev))
>          goto out_disable_msi;
> 
> +    struct vfio_setup setup = { 1 << 26, 0, 0 };

How will qemu decide how much to ask for?

> +    if ((ret =  ioctl(vdev->group->iommu->fd, VFIO_IOMMU_SETUP, &setup))) {
> +        return ret;
> +    }
> +    printf("SETUP: requested %lluMB, allocated %lluMB at %llx\n",
> +        (unsigned long long)setup.requested_size,
> +        (unsigned long long)setup.allocated_size,
> +        (unsigned long long)setup.start_address);
> +    vdev->start_address = setup.start_address;
> +    vdev->window_size = setup.allocated_size;
> +
>      if (vfio_enable_intx(vdev))
>          goto out_unmap_resources;
> 
> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
> index 96b09bb..6b7ab6f 100644
> --- a/hw/vfio_pci.h
> +++ b/hw/vfio_pci.h
> @@ -79,6 +79,10 @@ typedef struct VFIODevice {
>      bool msix;
>      uint8_t msix_bar;
>      uint16_t msix_entries;
> +#ifdef TARGET_PPC
> +    uint64_t start_address;
> +    uint32_t window_size;
> +#endif
>  } VFIODevice;
> 
>  typedef struct VFIOGroup {
> 
> === end ===
> 
> 
> 
> - changed __vfio_close_iommu function to do unmapall first and detach devices then
> as actual deallocation happens on device detach callback of IOMMU ops.
> 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 6169356..f78f411 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -28,6 +28,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/wait.h>
> +#include <linux/pci.h>
> 
>  #include "vfio_private.h"
> 
> @@ -242,6 +243,13 @@ static void __vfio_close_iommu(struct vfio_iommu *iommu)
>  	if (!iommu->domain)
>  		return;
> 
> +	/*
> +	 * On POWER, device detaching (which is done by __vfio_iommu_detach_group)
> +	 * should happen after all pages unmapped because
> +	 * the only way to do actual iommu_unmap_page a device detach callback
> +	 */
> +	vfio_iommu_unmapall(iommu);
> +

The unmapall/detach vs detach/unmapall shouldn't matter for x86.  Though
I wonder if we should be proactively resetting devices before either to
avoid spurious IOVA faults.

>  	list_for_each(pos, &iommu->group_list) {
>  		struct vfio_group *group;
>  		group = list_entry(pos, struct vfio_group, iommu_next);
> @@ -249,7 +257,7 @@ static void __vfio_close_iommu(struct vfio_iommu *iommu)
>  		__vfio_iommu_detach_group(iommu, group);
>  	}
> 
> -	vfio_iommu_unmapall(iommu);
> +	/* vfio_iommu_unmapall(iommu); */
> 
>  	iommu_domain_free(iommu->domain);
>  	iommu->domain = NULL;

Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-29  3:46   ` Alex Williamson
@ 2011-11-29  4:34     ` Alexey Kardashevskiy
  2011-11-29  5:48       ` Alex Williamson
  0 siblings, 1 reply; 62+ messages in thread
From: Alexey Kardashevskiy @ 2011-11-29  4:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, pmac, dwg, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

Hi!

On 29/11/11 14:46, Alex Williamson wrote:
> On Tue, 2011-11-29 at 12:52 +1100, Alexey Kardashevskiy wrote:
>> Hi!
>>
>> I tried (successfully) to run it on POWER and while doing that I found some issues. I'll try to
>> explain them in separate mails.
> 
> Great!
> 
>> IOMMU domain setup. On POWER, the linux drivers capable of DMA transfer want to know
>> a DMA window, i.e. its start and length in the PHB address space. This comes from
>> hardware. On X86 (correct if I am wrong), every device driver in the guest allocates
>> memory from the same pool.
> 
> Yes, current VT-d/AMD-Vi provide independent IOVA spaces for each
> device.
> 
>>  On POWER, device drivers get DMA window and allocate pages
>> for DMA within this window. In the case of VFIO, that means that QEMU has to
>> preallocate this DMA window before running a quest, pass it to a guest (via
>> device tree) and then a guest tells the host what pages are taken/released by
>> calling map/unmap callbacks of iommu_ops. Deallocation is made in a device detach
>> callback as I did not want to add more ioctls.
>> So, there are 2 patches:
>>
>> - new VFIO_IOMMU_SETUP ioctl introduced which allocates a DMA window via IOMMU API on
>> POWER.
>> btw do we need an additional capability bit for it?
>>
>> KERNEL PATCH:
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 10615ad..a882e08 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -247,3 +247,12 @@ int iommu_device_group(struct device *dev, unsigned int *groupid)
>>  	return -ENODEV;
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_device_group);
>> +
>> +int iommu_setup(struct iommu_domain *domain,
>> +		size_t requested_size, size_t *allocated_size,
>> +		phys_addr_t *start_address)
>> +{
>> +	return domain->ops->setup(domain, requested_size, allocated_size,
>> +			start_address);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_setup);
> 
> requested_size seems redundant both here and in struct vfio_setup.  We
> can just pre-load size/start with desired values.  I assume x86 IOMMUs
> would ignore requested values and return start = 0 and size = hardware
> decoder address bits.  The IOMMU API currently allows:
> 
> iommu_domain_alloc
> [iommu_attach_device]
> [iommu_map]
> [iommu_unmap]
> [iommu_detach_device]
> iommu_domain_free
> 
> where everything between alloc and free can be called in any order.  How
> does setup fit into that model?

This is why I posted a QEMU patch :)

> For this it seems like we'd almost want
> to combine alloc, setup, and the first attach into a single call (ie.
> create a domain with this initial device and these parameters), then
> subsequent attaches would only allow compatible devices.


Not exactly. This setup is more likely to get combined with domain alloc only.
On POWER, we have iommu_table per DMA window which can be or can be not shared
between devices. At the moment there is one window per PCIe _device_ (so multiple
functions of multiport network adapter share one DMA window) and one window for
all the devices behind PCIe-to-PCI bridge. It is more or less so.


> I'm a little confused though, is the window determined by hardware or is
> it configurable via requested_size?


The window parameters are calculated by software and then written to hardware so
hardware does filtering and prevents bad devices from memory corruption.


> David had suggested that we could
> implement a VFIO_IOMMU_GET_INFO ioctl that returns something like:
> 
> struct vfio_iommu_info {
>         __u32   argsz;
>         __u32   flags;
>         __u64   iova_max;       /* Maximum IOVA address */
>         __u64   iova_min;       /* Minimum IOVA address */
>         __u64   pgsize_bitmap;  /* Bitmap of supported page sizes */
> };
> 
> The thought being a TBD IOMMU API interface reports the hardware
> determined IOVA range and we could fudge it on x86 for now reporting
> 0/~0.  Maybe we should replace iova_max/iova_min with
> iova_base/iova_size and allow the caller to request a size by setting
> iova_size and matching bit in the flags.


No, we need some sort of SET_INFO, not GET as we want QEMU to decide on a DMA
window size.

Or simply add these parameters to domain allocation callback.


>> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
>> index 029dae3..57fb70d 100644
>> --- a/drivers/vfio/vfio_iommu.c
>> +++ b/drivers/vfio/vfio_iommu.c
>> @@ -507,6 +507,23 @@ static long vfio_iommu_unl_ioctl(struct file *filep,
>>
>>  		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
>>  			ret = -EFAULT;
>> +
>> +	} else if (cmd == VFIO_IOMMU_SETUP) {
>> +		struct vfio_setup setup;
>> +		size_t allocated_size = 0;
>> +		phys_addr_t start_address = 0;
>> +
>> +		if (copy_from_user(&setup, (void __user *)arg, sizeof setup))
>> +			return -EFAULT;
>> +
>> +		printk("udomain %p, priv=%p\n", iommu->domain, iommu->domain->priv);
>> +		ret = iommu_setup(iommu->domain, setup.requested_size,
>> +				&allocated_size, &start_address);
>> +		setup.allocated_size = allocated_size;
>> +		setup.start_address = start_address;
>> +
>> +		if (!ret && copy_to_user((void __user *)arg, &setup, sizeof setup))
>> +			ret = -EFAULT;
>>  	}
>>  	return ret;
>>  }
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 93617e7..355cf8b 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -45,6 +45,7 @@ struct iommu_domain {
>>
>>  #define IOMMU_CAP_CACHE_COHERENCY	0x1
>>  #define IOMMU_CAP_INTR_REMAP		0x2	/* isolates device intrs */
>> +#define IOMMU_CAP_SETUP_REQUIRED	0x3	/* requires setup to be called */
>>
>>  #ifdef CONFIG_IOMMU_API
>>
>> @@ -62,6 +63,9 @@ struct iommu_ops {
>>  	int (*domain_has_cap)(struct iommu_domain *domain,
>>  			      unsigned long cap);
>>  	int (*device_group)(struct device *dev, unsigned int *groupid);
>> +	int (*setup)(struct iommu_domain *domain,
>> +		     size_t requested_size, size_t *allocated_size,
>> +		     phys_addr_t *start_address);
>>  };
>>
>>  extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
>> @@ -80,6 +84,9 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
>>  				      unsigned long iova);
>>  extern int iommu_domain_has_cap(struct iommu_domain *domain,
>>  				unsigned long cap);
>> +extern int iommu_setup(struct iommu_domain *domain,
>> +		       size_t requested_size, size_t *allocated_size,
>> +		       phys_addr_t *start_address);
>>  extern void iommu_set_fault_handler(struct iommu_domain *domain,
>>  					iommu_fault_handler_t handler);
>>  extern int iommu_device_group(struct device *dev, unsigned int *groupid);
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index 971e3b1..5e0ee75 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -26,6 +26,7 @@
>>   * Author: Michael S. Tsirkin <mst@redhat.com>
>>   */
>>  #include <linux/types.h>
>> +#include <linux/ioctl.h>
>>
>>  #ifndef VFIO_H
>>  #define VFIO_H
>> @@ -172,4 +173,13 @@ enum {
>>  	VFIO_PCI_NUM_IRQS
>>  };
>>
>> +/* Setup domain */
>> +#define VFIO_IOMMU_SETUP		_IOWR(';', 150, struct vfio_setup)
>> +
>> +struct vfio_setup {
>> +	__u64	requested_size;
>> +	__u64	allocated_size;
>> +	__u64	start_address;
>> +};
>> +
>>   #endif /* VFIO_H */
>>
>> === end ===
>>
>>
>> QEMU PATCH:
>>
>> diff --git a/hw/linux-vfio.h b/hw/linux-vfio.h
>> index ac48d85..a2c719f 100644
>> --- a/hw/linux-vfio.h
>> +++ b/hw/linux-vfio.h
>> @@ -172,4 +172,13 @@ enum {
>>  	VFIO_PCI_NUM_IRQS
>>  };
>>
>> +/* Setup domain */
>> +#define VFIO_IOMMU_SETUP                _IOWR(';', 150, struct vfio_setup)
>> +
>> +struct vfio_setup {
>> +	__u64   requested_size;
>> +	__u64   allocated_size;
>> +	__u64   start_address;
>> +};
>> +
>>  #endif /* VFIO_H */
>> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
>> index 1c97c35..b438bbe 100644
>> --- a/hw/vfio_pci.c
>> +++ b/hw/vfio_pci.c
>> @@ -1501,6 +1503,17 @@ static int vfio_initfn(struct PCIDevice *pdev)
>>      if (vfio_map_resources(vdev))
>>          goto out_disable_msi;
>>
>> +    struct vfio_setup setup = { 1 << 26, 0, 0 };
> 
> How will qemu decide how much to ask for?


It is done by some heuristic. Like "usb controller needs 16mb" and "10Gb card
needs more than 100mbit". I'd think that POWER-specific code in QEMU would decide.
As POWER supports multiple PCI domains, it can afford spending addresses :)



>> +    if ((ret =  ioctl(vdev->group->iommu->fd, VFIO_IOMMU_SETUP, &setup))) {
>> +        return ret;
>> +    }
>> +    printf("SETUP: requested %lluMB, allocated %lluMB at %llx\n",
>> +        (unsigned long long)setup.requested_size,
>> +        (unsigned long long)setup.allocated_size,
>> +        (unsigned long long)setup.start_address);
>> +    vdev->start_address = setup.start_address;
>> +    vdev->window_size = setup.allocated_size;
>> +
>>      if (vfio_enable_intx(vdev))
>>          goto out_unmap_resources;
>>
>> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
>> index 96b09bb..6b7ab6f 100644
>> --- a/hw/vfio_pci.h
>> +++ b/hw/vfio_pci.h
>> @@ -79,6 +79,10 @@ typedef struct VFIODevice {
>>      bool msix;
>>      uint8_t msix_bar;
>>      uint16_t msix_entries;
>> +#ifdef TARGET_PPC
>> +    uint64_t start_address;
>> +    uint32_t window_size;
>> +#endif
>>  } VFIODevice;
>>
>>  typedef struct VFIOGroup {
>>
>> === end ===
>>
>>
>>
>> - changed __vfio_close_iommu function to do unmapall first and detach devices then
>> as actual deallocation happens on device detach callback of IOMMU ops.
>>
>> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
>> index 6169356..f78f411 100644
>> --- a/drivers/vfio/vfio_main.c
>> +++ b/drivers/vfio/vfio_main.c
>> @@ -28,6 +28,7 @@
>>  #include <linux/uaccess.h>
>>  #include <linux/vfio.h>
>>  #include <linux/wait.h>
>> +#include <linux/pci.h>
>>
>>  #include "vfio_private.h"
>>
>> @@ -242,6 +243,13 @@ static void __vfio_close_iommu(struct vfio_iommu *iommu)
>>  	if (!iommu->domain)
>>  		return;
>>
>> +	/*
>> +	 * On POWER, device detaching (which is done by __vfio_iommu_detach_group)
>> +	 * should happen after all pages unmapped because
>> +	 * the only way to do actual iommu_unmap_page a device detach callback
>> +	 */
>> +	vfio_iommu_unmapall(iommu);
>> +
> 
> The unmapall/detach vs detach/unmapall shouldn't matter for x86.  Though
> I wonder if we should be proactively resetting devices before either to
> avoid spurious IOVA faults.


Then we need some to "shutdown" a device.

I am not sure about x86, but on POWER a host allocates DMA window (SETUP does iommu_alloc
so the _whole_ DMA window gets allocated), and then a guest allocates pages within this
window itself but it only updates the host's IOMMU table with pairs of addresses, a host
does not do any no actual map/unmap while guest is running.

Oooor, we could release the whole window in the domain close callback of iommu_ops...



>>  	list_for_each(pos, &iommu->group_list) {
>>  		struct vfio_group *group;
>>  		group = list_entry(pos, struct vfio_group, iommu_next);
>> @@ -249,7 +257,7 @@ static void __vfio_close_iommu(struct vfio_iommu *iommu)
>>  		__vfio_iommu_detach_group(iommu, group);
>>  	}
>>
>> -	vfio_iommu_unmapall(iommu);
>> +	/* vfio_iommu_unmapall(iommu); */
>>
>>  	iommu_domain_free(iommu->domain);
>>  	iommu->domain = NULL;
> 
> Thanks,
> 
> Alex


-- 
Alexey Kardashevskiy
IBM OzLabs, LTC Team

e-mail: aik@au1.ibm.com
notes: Alexey Kardashevskiy/Australia/IBM


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-29  4:34     ` Alexey Kardashevskiy
@ 2011-11-29  5:48       ` Alex Williamson
  2011-12-02  5:06         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Williamson @ 2011-11-29  5:48 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: chrisw, pmac, dwg, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On Tue, 2011-11-29 at 15:34 +1100, Alexey Kardashevskiy wrote:
> Hi!
> 
> On 29/11/11 14:46, Alex Williamson wrote:
> > On Tue, 2011-11-29 at 12:52 +1100, Alexey Kardashevskiy wrote:
> >> Hi!
> >>
> >> I tried (successfully) to run it on POWER and while doing that I found some issues. I'll try to
> >> explain them in separate mails.
> > 
> > Great!
> > 
> >> IOMMU domain setup. On POWER, the linux drivers capable of DMA transfer want to know
> >> a DMA window, i.e. its start and length in the PHB address space. This comes from
> >> hardware. On X86 (correct if I am wrong), every device driver in the guest allocates
> >> memory from the same pool.
> > 
> > Yes, current VT-d/AMD-Vi provide independent IOVA spaces for each
> > device.
> > 
> >>  On POWER, device drivers get DMA window and allocate pages
> >> for DMA within this window. In the case of VFIO, that means that QEMU has to
> >> preallocate this DMA window before running a quest, pass it to a guest (via
> >> device tree) and then a guest tells the host what pages are taken/released by
> >> calling map/unmap callbacks of iommu_ops. Deallocation is made in a device detach
> >> callback as I did not want to add more ioctls.
> >> So, there are 2 patches:
> >>
> >> - new VFIO_IOMMU_SETUP ioctl introduced which allocates a DMA window via IOMMU API on
> >> POWER.
> >> btw do we need an additional capability bit for it?
> >>
> >> KERNEL PATCH:
> >>
> >> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >> index 10615ad..a882e08 100644
> >> --- a/drivers/iommu/iommu.c
> >> +++ b/drivers/iommu/iommu.c
> >> @@ -247,3 +247,12 @@ int iommu_device_group(struct device *dev, unsigned int *groupid)
> >>  	return -ENODEV;
> >>  }
> >>  EXPORT_SYMBOL_GPL(iommu_device_group);
> >> +
> >> +int iommu_setup(struct iommu_domain *domain,
> >> +		size_t requested_size, size_t *allocated_size,
> >> +		phys_addr_t *start_address)
> >> +{
> >> +	return domain->ops->setup(domain, requested_size, allocated_size,
> >> +			start_address);
> >> +}
> >> +EXPORT_SYMBOL_GPL(iommu_setup);
> > 
> > requested_size seems redundant both here and in struct vfio_setup.  We
> > can just pre-load size/start with desired values.  I assume x86 IOMMUs
> > would ignore requested values and return start = 0 and size = hardware
> > decoder address bits.  The IOMMU API currently allows:
> > 
> > iommu_domain_alloc
> > [iommu_attach_device]
> > [iommu_map]
> > [iommu_unmap]
> > [iommu_detach_device]
> > iommu_domain_free
> > 
> > where everything between alloc and free can be called in any order.  How
> > does setup fit into that model?
> 
> This is why I posted a QEMU patch :)

Right, but qemu/vfio is by no means the defacto standard of how one must
use the IOMMU API.  KVM currently orders the map vs attach differently.
When is it valid to call setup when factoring in hot attached/detached
devices?

> > For this it seems like we'd almost want
> > to combine alloc, setup, and the first attach into a single call (ie.
> > create a domain with this initial device and these parameters), then
> > subsequent attaches would only allow compatible devices.
> 
> 
> Not exactly. This setup is more likely to get combined with domain alloc only.

At domain_alloc we don't have any association to actual hardware other
than a bus_type, how would you know which iommu is being setup?

> On POWER, we have iommu_table per DMA window which can be or can be not shared
> between devices. At the moment there is one window per PCIe _device_ (so multiple
> functions of multiport network adapter share one DMA window) and one window for
> all the devices behind PCIe-to-PCI bridge. It is more or less so.
> 
> 
> > I'm a little confused though, is the window determined by hardware or is
> > it configurable via requested_size?
> 
> 
> The window parameters are calculated by software and then written to hardware so
> hardware does filtering and prevents bad devices from memory corruption.
> 
> 
> > David had suggested that we could
> > implement a VFIO_IOMMU_GET_INFO ioctl that returns something like:
> > 
> > struct vfio_iommu_info {
> >         __u32   argsz;
> >         __u32   flags;
> >         __u64   iova_max;       /* Maximum IOVA address */
> >         __u64   iova_min;       /* Minimum IOVA address */
> >         __u64   pgsize_bitmap;  /* Bitmap of supported page sizes */
> > };
> > 
> > The thought being a TBD IOMMU API interface reports the hardware
> > determined IOVA range and we could fudge it on x86 for now reporting
> > 0/~0.  Maybe we should replace iova_max/iova_min with
> > iova_base/iova_size and allow the caller to request a size by setting
> > iova_size and matching bit in the flags.
> 
> 
> No, we need some sort of SET_INFO, not GET as we want QEMU to decide on a DMA
> window size.

Right, GET_INFO is no longer the right name if we're really passing in
requests.  Maybe it becomes VFIO_IOMMU_SETUP as you suggest and it
really is a bidirectional ioctl.

> Or simply add these parameters to domain allocation callback.

Except alloc only specifies a bus_type, not a specific iommu, which is
why we might need to think about combining {alloc, attach, setup}...

struct iommu_domain *iommu_create_domain(int nr_devs,
					 struct device **devs,
					 dma_addr_t *iova_start,
					 size_t *iova_size)

But then we have trouble when vfio needs to prevent device access until
the iommu domain is setup which means that our only opportunity to
specify iommu parameters might be to add a struct to GROUP_GET_IOMMU_FD,
but without device access, how does qemu know how large of a window to
request?

BTW, domain_has_cap() might be a way to advertise if the domain
supports/requires a setup callback.

> >> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> >> index 029dae3..57fb70d 100644
> >> --- a/drivers/vfio/vfio_iommu.c
> >> +++ b/drivers/vfio/vfio_iommu.c
> >> @@ -507,6 +507,23 @@ static long vfio_iommu_unl_ioctl(struct file *filep,
> >>
> >>  		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> >>  			ret = -EFAULT;
> >> +
> >> +	} else if (cmd == VFIO_IOMMU_SETUP) {
> >> +		struct vfio_setup setup;
> >> +		size_t allocated_size = 0;
> >> +		phys_addr_t start_address = 0;
> >> +
> >> +		if (copy_from_user(&setup, (void __user *)arg, sizeof setup))
> >> +			return -EFAULT;
> >> +
> >> +		printk("udomain %p, priv=%p\n", iommu->domain, iommu->domain->priv);
> >> +		ret = iommu_setup(iommu->domain, setup.requested_size,
> >> +				&allocated_size, &start_address);
> >> +		setup.allocated_size = allocated_size;
> >> +		setup.start_address = start_address;
> >> +
> >> +		if (!ret && copy_to_user((void __user *)arg, &setup, sizeof setup))
> >> +			ret = -EFAULT;
> >>  	}
> >>  	return ret;
> >>  }
> >> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> >> index 93617e7..355cf8b 100644
> >> --- a/include/linux/iommu.h
> >> +++ b/include/linux/iommu.h
> >> @@ -45,6 +45,7 @@ struct iommu_domain {
> >>
> >>  #define IOMMU_CAP_CACHE_COHERENCY	0x1
> >>  #define IOMMU_CAP_INTR_REMAP		0x2	/* isolates device intrs */
> >> +#define IOMMU_CAP_SETUP_REQUIRED	0x3	/* requires setup to be called */
> >>
> >>  #ifdef CONFIG_IOMMU_API
> >>
> >> @@ -62,6 +63,9 @@ struct iommu_ops {
> >>  	int (*domain_has_cap)(struct iommu_domain *domain,
> >>  			      unsigned long cap);
> >>  	int (*device_group)(struct device *dev, unsigned int *groupid);
> >> +	int (*setup)(struct iommu_domain *domain,
> >> +		     size_t requested_size, size_t *allocated_size,
> >> +		     phys_addr_t *start_address);
> >>  };
> >>
> >>  extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
> >> @@ -80,6 +84,9 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
> >>  				      unsigned long iova);
> >>  extern int iommu_domain_has_cap(struct iommu_domain *domain,
> >>  				unsigned long cap);
> >> +extern int iommu_setup(struct iommu_domain *domain,
> >> +		       size_t requested_size, size_t *allocated_size,
> >> +		       phys_addr_t *start_address);
> >>  extern void iommu_set_fault_handler(struct iommu_domain *domain,
> >>  					iommu_fault_handler_t handler);
> >>  extern int iommu_device_group(struct device *dev, unsigned int *groupid);
> >> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >> index 971e3b1..5e0ee75 100644
> >> --- a/include/linux/vfio.h
> >> +++ b/include/linux/vfio.h
> >> @@ -26,6 +26,7 @@
> >>   * Author: Michael S. Tsirkin <mst@redhat.com>
> >>   */
> >>  #include <linux/types.h>
> >> +#include <linux/ioctl.h>
> >>
> >>  #ifndef VFIO_H
> >>  #define VFIO_H
> >> @@ -172,4 +173,13 @@ enum {
> >>  	VFIO_PCI_NUM_IRQS
> >>  };
> >>
> >> +/* Setup domain */
> >> +#define VFIO_IOMMU_SETUP		_IOWR(';', 150, struct vfio_setup)
> >> +
> >> +struct vfio_setup {
> >> +	__u64	requested_size;
> >> +	__u64	allocated_size;
> >> +	__u64	start_address;
> >> +};
> >> +
> >>   #endif /* VFIO_H */
> >>
> >> === end ===
> >>
> >>
> >> QEMU PATCH:
> >>
> >> diff --git a/hw/linux-vfio.h b/hw/linux-vfio.h
> >> index ac48d85..a2c719f 100644
> >> --- a/hw/linux-vfio.h
> >> +++ b/hw/linux-vfio.h
> >> @@ -172,4 +172,13 @@ enum {
> >>  	VFIO_PCI_NUM_IRQS
> >>  };
> >>
> >> +/* Setup domain */
> >> +#define VFIO_IOMMU_SETUP                _IOWR(';', 150, struct vfio_setup)
> >> +
> >> +struct vfio_setup {
> >> +	__u64   requested_size;
> >> +	__u64   allocated_size;
> >> +	__u64   start_address;
> >> +};
> >> +
> >>  #endif /* VFIO_H */
> >> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> >> index 1c97c35..b438bbe 100644
> >> --- a/hw/vfio_pci.c
> >> +++ b/hw/vfio_pci.c
> >> @@ -1501,6 +1503,17 @@ static int vfio_initfn(struct PCIDevice *pdev)
> >>      if (vfio_map_resources(vdev))
> >>          goto out_disable_msi;
> >>
> >> +    struct vfio_setup setup = { 1 << 26, 0, 0 };
> > 
> > How will qemu decide how much to ask for?
> 
> 
> It is done by some heuristic. Like "usb controller needs 16mb" and "10Gb card
> needs more than 100mbit". I'd think that POWER-specific code in QEMU would decide.
> As POWER supports multiple PCI domains, it can afford spending addresses :)
> 
> 
> 
> >> +    if ((ret =  ioctl(vdev->group->iommu->fd, VFIO_IOMMU_SETUP, &setup))) {
> >> +        return ret;
> >> +    }
> >> +    printf("SETUP: requested %lluMB, allocated %lluMB at %llx\n",
> >> +        (unsigned long long)setup.requested_size,
> >> +        (unsigned long long)setup.allocated_size,
> >> +        (unsigned long long)setup.start_address);
> >> +    vdev->start_address = setup.start_address;
> >> +    vdev->window_size = setup.allocated_size;
> >> +
> >>      if (vfio_enable_intx(vdev))
> >>          goto out_unmap_resources;
> >>
> >> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
> >> index 96b09bb..6b7ab6f 100644
> >> --- a/hw/vfio_pci.h
> >> +++ b/hw/vfio_pci.h
> >> @@ -79,6 +79,10 @@ typedef struct VFIODevice {
> >>      bool msix;
> >>      uint8_t msix_bar;
> >>      uint16_t msix_entries;
> >> +#ifdef TARGET_PPC
> >> +    uint64_t start_address;
> >> +    uint32_t window_size;
> >> +#endif
> >>  } VFIODevice;
> >>
> >>  typedef struct VFIOGroup {
> >>
> >> === end ===
> >>
> >>
> >>
> >> - changed __vfio_close_iommu function to do unmapall first and detach devices then
> >> as actual deallocation happens on device detach callback of IOMMU ops.
> >>
> >> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> >> index 6169356..f78f411 100644
> >> --- a/drivers/vfio/vfio_main.c
> >> +++ b/drivers/vfio/vfio_main.c
> >> @@ -28,6 +28,7 @@
> >>  #include <linux/uaccess.h>
> >>  #include <linux/vfio.h>
> >>  #include <linux/wait.h>
> >> +#include <linux/pci.h>
> >>
> >>  #include "vfio_private.h"
> >>
> >> @@ -242,6 +243,13 @@ static void __vfio_close_iommu(struct vfio_iommu *iommu)
> >>  	if (!iommu->domain)
> >>  		return;
> >>
> >> +	/*
> >> +	 * On POWER, device detaching (which is done by __vfio_iommu_detach_group)
> >> +	 * should happen after all pages unmapped because
> >> +	 * the only way to do actual iommu_unmap_page a device detach callback
> >> +	 */
> >> +	vfio_iommu_unmapall(iommu);
> >> +
> > 
> > The unmapall/detach vs detach/unmapall shouldn't matter for x86.  Though
> > I wonder if we should be proactively resetting devices before either to
> > avoid spurious IOVA faults.
> 
> 
> Then we need some to "shutdown" a device.

VFIO Devices expose the DEVICE_RESET ioctl, which could be exposed to
the group via vfio_device_ops.  pci_reset_function() does a pretty good
job of quiescing devices when it works.  I've also been wondering if we
need a VFIO_GROUP_RESET which can call reset on each device, but also
allow things like PCI secondary bus resets when all the devices behind a
bridge are in a group.  For now I've deferred it as a possible future
extension.

> I am not sure about x86, but on POWER a host allocates DMA window (SETUP does iommu_alloc
> so the _whole_ DMA window gets allocated), and then a guest allocates pages within this
> window itself but it only updates the host's IOMMU table with pairs of addresses, a host
> does not do any no actual map/unmap while guest is running.

What's the difference between "updates the host's IOMMU table with pairs
of addresses" and map/unmap?  Your DMA window is static, but what each
IOVA within the window points to is not, correct?  I would have assumed
you do unmap/map to update each of those (at least David seemed to care
about map/unmap latency).

x86 maps the entire guest, so the only runtime map/unmaps would be when
the memory map changes.  Typically this only happens for MMIO regions
and odd chipset specific regions that switch between being memory backed
or ROM backed (which we really don't care about for DMA), and
theoretically for future memory hotplug.  At some point we're supposed
to have devices and IOMMUs that support IO page faults, so a device can
request an IOVA and we'd probably register a page fault handler for the
domain to dynamically pin and map pages.

> Oooor, we could release the whole window in the domain close callback of iommu_ops...

AIUI, the iommu driver will destroy all the mappings in a domain when we
call iommu_domain_free, but the real need for unmapall is to unpin all
the memory.  David was suggesting maybe the pinning should happen in the
iommu driver, which could then handle unpinning on release.  I kinda
doubt iommu drivers want to get into the business of pinning memory
though.  I'd actually like VFIO to get out of this business as well and
was thinking about requiring mapped pages to be mlocked by the user, but
it appears we have no way to later prevent or detect that the user
munlocked the pages and might then have access to random host memory.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH] vfio: VFIO Driver core framework
  2011-11-29  5:48       ` Alex Williamson
@ 2011-12-02  5:06         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 62+ messages in thread
From: Alexey Kardashevskiy @ 2011-12-02  5:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, pmac, dwg, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, konrad.wilk, kvm, qemu-devel, iommu, linux-pci

On 29/11/11 16:48, Alex Williamson wrote:
> On Tue, 2011-11-29 at 15:34 +1100, Alexey Kardashevskiy wrote:
>> Hi!
>>
>> On 29/11/11 14:46, Alex Williamson wrote:
>>> On Tue, 2011-11-29 at 12:52 +1100, Alexey Kardashevskiy wrote:
>>>> Hi!
>>>>
>>>> I tried (successfully) to run it on POWER and while doing that I found some issues. I'll try to
>>>> explain them in separate mails.
>>>
>>> Great!
>>>
>>>> IOMMU domain setup. On POWER, the linux drivers capable of DMA transfer want to know
>>>> a DMA window, i.e. its start and length in the PHB address space. This comes from
>>>> hardware. On X86 (correct if I am wrong), every device driver in the guest allocates
>>>> memory from the same pool.
>>>
>>> Yes, current VT-d/AMD-Vi provide independent IOVA spaces for each
>>> device.
>>>
>>>>  On POWER, device drivers get DMA window and allocate pages
>>>> for DMA within this window. In the case of VFIO, that means that QEMU has to
>>>> preallocate this DMA window before running a quest, pass it to a guest (via
>>>> device tree) and then a guest tells the host what pages are taken/released by
>>>> calling map/unmap callbacks of iommu_ops. Deallocation is made in a device detach
>>>> callback as I did not want to add more ioctls.
>>>> So, there are 2 patches:
>>>>
>>>> - new VFIO_IOMMU_SETUP ioctl introduced which allocates a DMA window via IOMMU API on
>>>> POWER.
>>>> btw do we need an additional capability bit for it?
>>>>
>>>> KERNEL PATCH:
>>>>
>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>> index 10615ad..a882e08 100644
>>>> --- a/drivers/iommu/iommu.c
>>>> +++ b/drivers/iommu/iommu.c
>>>> @@ -247,3 +247,12 @@ int iommu_device_group(struct device *dev, unsigned int *groupid)
>>>>  	return -ENODEV;
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(iommu_device_group);
>>>> +
>>>> +int iommu_setup(struct iommu_domain *domain,
>>>> +		size_t requested_size, size_t *allocated_size,
>>>> +		phys_addr_t *start_address)
>>>> +{
>>>> +	return domain->ops->setup(domain, requested_size, allocated_size,
>>>> +			start_address);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(iommu_setup);
>>>
>>> requested_size seems redundant both here and in struct vfio_setup.  We
>>> can just pre-load size/start with desired values.  I assume x86 IOMMUs
>>> would ignore requested values and return start = 0 and size = hardware
>>> decoder address bits.  The IOMMU API currently allows:
>>>
>>> iommu_domain_alloc
>>> [iommu_attach_device]
>>> [iommu_map]
>>> [iommu_unmap]
>>> [iommu_detach_device]
>>> iommu_domain_free
>>>
>>> where everything between alloc and free can be called in any order.  How
>>> does setup fit into that model?
>>
>> This is why I posted a QEMU patch :)
> 
> Right, but qemu/vfio is by no means the defacto standard of how one must
> use the IOMMU API.  KVM currently orders the map vs attach differently.
> When is it valid to call setup when factoring in hot attached/detached
> devices?
> 


>>> For this it seems like we'd almost want
>>> to combine alloc, setup, and the first attach into a single call (ie.
>>> create a domain with this initial device and these parameters), then
>>> subsequent attaches would only allow compatible devices.
>>
>>
>> Not exactly. This setup is more likely to get combined with domain alloc only.
> 
> At domain_alloc we don't have any association to actual hardware other
> than a bus_type, how would you know which iommu is being setup?


Yes. This is exact problem. We do have preallocated PEs (aka groups) on POWER but cannot
use this information during setup until the first device is attached.

Generally speaking, we should not be adding devices to an IOMMU domain, we should be adding groups
instead as it is the smallest entity which IOMMU can handle. At least in API, it is better to have
as simple as the idea it is implementing.

For example, we could implement a tool to put devices into a group (at the moment on POWER it would
check that a domain has no more than just a single group in it). We could pass this group ID to QEMU
instead of passing "-device vfio-pci ..." (as we still must pass _all_ devices of a group to QEMU).

Sure, we'll have change VFIO to tell QEMU what devices are included in what group, quite easy to do.

It is a tree: domain -> group -> device. Lets reflect it in the API.




>> On POWER, we have iommu_table per DMA window which can be or can be not shared
>> between devices. At the moment there is one window per PCIe _device_ (so multiple
>> functions of multiport network adapter share one DMA window) and one window for
>> all the devices behind PCIe-to-PCI bridge. It is more or less so.
>>
>>
>>> I'm a little confused though, is the window determined by hardware or is
>>> it configurable via requested_size?
>>
>>
>> The window parameters are calculated by software and then written to hardware so
>> hardware does filtering and prevents bad devices from memory corruption.
>>
>>
>>> David had suggested that we could
>>> implement a VFIO_IOMMU_GET_INFO ioctl that returns something like:
>>>
>>> struct vfio_iommu_info {
>>>         __u32   argsz;
>>>         __u32   flags;
>>>         __u64   iova_max;       /* Maximum IOVA address */
>>>         __u64   iova_min;       /* Minimum IOVA address */
>>>         __u64   pgsize_bitmap;  /* Bitmap of supported page sizes */
>>> };
>>>
>>> The thought being a TBD IOMMU API interface reports the hardware
>>> determined IOVA range and we could fudge it on x86 for now reporting
>>> 0/~0.  Maybe we should replace iova_max/iova_min with
>>> iova_base/iova_size and allow the caller to request a size by setting
>>> iova_size and matching bit in the flags.
>>
>>
>> No, we need some sort of SET_INFO, not GET as we want QEMU to decide on a DMA
>> window size.
> 
> Right, GET_INFO is no longer the right name if we're really passing in
> requests.  Maybe it becomes VFIO_IOMMU_SETUP as you suggest and it
> really is a bidirectional ioctl.



Actually these are a group properties, not of device or iommu, at least now for POWER it is so. And
they are predefined for every group. We would need "GET_INFO" for them if we had a group-based API
but we do not have it.



>> Or simply add these parameters to domain allocation callback.
> 
> Except alloc only specifies a bus_type, not a specific iommu, which is
> why we might need to think about combining {alloc, attach, setup}...
> 
> struct iommu_domain *iommu_create_domain(int nr_devs,
> 					 struct device **devs,
> 					 dma_addr_t *iova_start,
> 					 size_t *iova_size)
> 
> But then we have trouble when vfio needs to prevent device access until
> the iommu domain is setup which means that our only opportunity to
> specify iommu parameters might be to add a struct to GROUP_GET_IOMMU_FD,
> but without device access, how does qemu know how large of a window to
> request?


Somehow. Everyone gets 64mb in the simplest case or somehow more tricky.

There is a difference. On x86 you really create a new domain. On POWER (again, currently) we create
a domain which corresponds to an existing group, one group per domain, domains cannot consist of 2
or more groups (will be fixed in hardware later though).



> BTW, domain_has_cap() might be a way to advertise if the domain
> supports/requires a setup callback.


>>>> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
>>>> index 029dae3..57fb70d 100644
>>>> --- a/drivers/vfio/vfio_iommu.c
>>>> +++ b/drivers/vfio/vfio_iommu.c
>>>> @@ -507,6 +507,23 @@ static long vfio_iommu_unl_ioctl(struct file *filep,
>>>>
>>>>  		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
>>>>  			ret = -EFAULT;
>>>> +
>>>> +	} else if (cmd == VFIO_IOMMU_SETUP) {
>>>> +		struct vfio_setup setup;
>>>> +		size_t allocated_size = 0;
>>>> +		phys_addr_t start_address = 0;
>>>> +
>>>> +		if (copy_from_user(&setup, (void __user *)arg, sizeof setup))
>>>> +			return -EFAULT;
>>>> +
>>>> +		printk("udomain %p, priv=%p\n", iommu->domain, iommu->domain->priv);
>>>> +		ret = iommu_setup(iommu->domain, setup.requested_size,
>>>> +				&allocated_size, &start_address);
>>>> +		setup.allocated_size = allocated_size;
>>>> +		setup.start_address = start_address;
>>>> +
>>>> +		if (!ret && copy_to_user((void __user *)arg, &setup, sizeof setup))
>>>> +			ret = -EFAULT;
>>>>  	}
>>>>  	return ret;
>>>>  }
>>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>>> index 93617e7..355cf8b 100644
>>>> --- a/include/linux/iommu.h
>>>> +++ b/include/linux/iommu.h
>>>> @@ -45,6 +45,7 @@ struct iommu_domain {
>>>>
>>>>  #define IOMMU_CAP_CACHE_COHERENCY	0x1
>>>>  #define IOMMU_CAP_INTR_REMAP		0x2	/* isolates device intrs */
>>>> +#define IOMMU_CAP_SETUP_REQUIRED	0x3	/* requires setup to be called */
>>>>
>>>>  #ifdef CONFIG_IOMMU_API
>>>>
>>>> @@ -62,6 +63,9 @@ struct iommu_ops {
>>>>  	int (*domain_has_cap)(struct iommu_domain *domain,
>>>>  			      unsigned long cap);
>>>>  	int (*device_group)(struct device *dev, unsigned int *groupid);
>>>> +	int (*setup)(struct iommu_domain *domain,
>>>> +		     size_t requested_size, size_t *allocated_size,
>>>> +		     phys_addr_t *start_address);
>>>>  };
>>>>
>>>>  extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
>>>> @@ -80,6 +84,9 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
>>>>  				      unsigned long iova);
>>>>  extern int iommu_domain_has_cap(struct iommu_domain *domain,
>>>>  				unsigned long cap);
>>>> +extern int iommu_setup(struct iommu_domain *domain,
>>>> +		       size_t requested_size, size_t *allocated_size,
>>>> +		       phys_addr_t *start_address);
>>>>  extern void iommu_set_fault_handler(struct iommu_domain *domain,
>>>>  					iommu_fault_handler_t handler);
>>>>  extern int iommu_device_group(struct device *dev, unsigned int *groupid);
>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>> index 971e3b1..5e0ee75 100644
>>>> --- a/include/linux/vfio.h
>>>> +++ b/include/linux/vfio.h
>>>> @@ -26,6 +26,7 @@
>>>>   * Author: Michael S. Tsirkin <mst@redhat.com>
>>>>   */
>>>>  #include <linux/types.h>
>>>> +#include <linux/ioctl.h>
>>>>
>>>>  #ifndef VFIO_H
>>>>  #define VFIO_H
>>>> @@ -172,4 +173,13 @@ enum {
>>>>  	VFIO_PCI_NUM_IRQS
>>>>  };
>>>>
>>>> +/* Setup domain */
>>>> +#define VFIO_IOMMU_SETUP		_IOWR(';', 150, struct vfio_setup)
>>>> +
>>>> +struct vfio_setup {
>>>> +	__u64	requested_size;
>>>> +	__u64	allocated_size;
>>>> +	__u64	start_address;
>>>> +};
>>>> +
>>>>   #endif /* VFIO_H */
>>>>
>>>> === end ===
>>>>
>>>>
>>>> QEMU PATCH:
>>>>
>>>> diff --git a/hw/linux-vfio.h b/hw/linux-vfio.h
>>>> index ac48d85..a2c719f 100644
>>>> --- a/hw/linux-vfio.h
>>>> +++ b/hw/linux-vfio.h
>>>> @@ -172,4 +172,13 @@ enum {
>>>>  	VFIO_PCI_NUM_IRQS
>>>>  };
>>>>
>>>> +/* Setup domain */
>>>> +#define VFIO_IOMMU_SETUP                _IOWR(';', 150, struct vfio_setup)
>>>> +
>>>> +struct vfio_setup {
>>>> +	__u64   requested_size;
>>>> +	__u64   allocated_size;
>>>> +	__u64   start_address;
>>>> +};
>>>> +
>>>>  #endif /* VFIO_H */
>>>> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
>>>> index 1c97c35..b438bbe 100644
>>>> --- a/hw/vfio_pci.c
>>>> +++ b/hw/vfio_pci.c
>>>> @@ -1501,6 +1503,17 @@ static int vfio_initfn(struct PCIDevice *pdev)
>>>>      if (vfio_map_resources(vdev))
>>>>          goto out_disable_msi;
>>>>
>>>> +    struct vfio_setup setup = { 1 << 26, 0, 0 };
>>>
>>> How will qemu decide how much to ask for?
>>
>>
>> It is done by some heuristic. Like "usb controller needs 16mb" and "10Gb card
>> needs more than 100mbit". I'd think that POWER-specific code in QEMU would decide.
>> As POWER supports multiple PCI domains, it can afford spending addresses :)
>>
>>
>>
>>>> +    if ((ret =  ioctl(vdev->group->iommu->fd, VFIO_IOMMU_SETUP, &setup))) {
>>>> +        return ret;
>>>> +    }
>>>> +    printf("SETUP: requested %lluMB, allocated %lluMB at %llx\n",
>>>> +        (unsigned long long)setup.requested_size,
>>>> +        (unsigned long long)setup.allocated_size,
>>>> +        (unsigned long long)setup.start_address);
>>>> +    vdev->start_address = setup.start_address;
>>>> +    vdev->window_size = setup.allocated_size;
>>>> +
>>>>      if (vfio_enable_intx(vdev))
>>>>          goto out_unmap_resources;
>>>>
>>>> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
>>>> index 96b09bb..6b7ab6f 100644
>>>> --- a/hw/vfio_pci.h
>>>> +++ b/hw/vfio_pci.h
>>>> @@ -79,6 +79,10 @@ typedef struct VFIODevice {
>>>>      bool msix;
>>>>      uint8_t msix_bar;
>>>>      uint16_t msix_entries;
>>>> +#ifdef TARGET_PPC
>>>> +    uint64_t start_address;
>>>> +    uint32_t window_size;
>>>> +#endif
>>>>  } VFIODevice;
>>>>
>>>>  typedef struct VFIOGroup {
>>>>
>>>> === end ===
>>>>
>>>>
>>>>
>>>> - changed __vfio_close_iommu function to do unmapall first and detach devices then
>>>> as actual deallocation happens on device detach callback of IOMMU ops.
>>>>
>>>> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
>>>> index 6169356..f78f411 100644
>>>> --- a/drivers/vfio/vfio_main.c
>>>> +++ b/drivers/vfio/vfio_main.c
>>>> @@ -28,6 +28,7 @@
>>>>  #include <linux/uaccess.h>
>>>>  #include <linux/vfio.h>
>>>>  #include <linux/wait.h>
>>>> +#include <linux/pci.h>
>>>>
>>>>  #include "vfio_private.h"
>>>>
>>>> @@ -242,6 +243,13 @@ static void __vfio_close_iommu(struct vfio_iommu *iommu)
>>>>  	if (!iommu->domain)
>>>>  		return;
>>>>
>>>> +	/*
>>>> +	 * On POWER, device detaching (which is done by __vfio_iommu_detach_group)
>>>> +	 * should happen after all pages unmapped because
>>>> +	 * the only way to do actual iommu_unmap_page a device detach callback
>>>> +	 */
>>>> +	vfio_iommu_unmapall(iommu);
>>>> +
>>>
>>> The unmapall/detach vs detach/unmapall shouldn't matter for x86.  Though
>>> I wonder if we should be proactively resetting devices before either to
>>> avoid spurious IOVA faults.
>>
>>
>> Then we need some to "shutdown" a device.
> 
> VFIO Devices expose the DEVICE_RESET ioctl, which could be exposed to
> the group via vfio_device_ops.  pci_reset_function() does a pretty good
> job of quiescing devices when it works.  I've also been wondering if we
> need a VFIO_GROUP_RESET which can call reset on each device, but also
> allow things like PCI secondary bus resets when all the devices behind a
> bridge are in a group.  For now I've deferred it as a possible future
> extension.

Right, we do not need it right now.

>> I am not sure about x86, but on POWER a host allocates DMA window (SETUP does iommu_alloc
>> so the _whole_ DMA window gets allocated), and then a guest allocates pages within this
>> window itself but it only updates the host's IOMMU table with pairs of addresses, a host
>> does not do any no actual map/unmap while guest is running.
> 
> What's the difference between "updates the host's IOMMU table with pairs
> of addresses" and map/unmap?  Your DMA window is static, but what each
> IOVA within the window points to is not, correct?

DMA window is configured in system firmware and is static in the host kernel - it has all DMA
windows (at least 32-bit windows) defined for all PEs.

64-bit windows are dynamic (a guest may ask a kernel to allocate some) and this also needs to be
taken care of (arch-specific ioctls?) but later.



> I would have assumed
> you do unmap/map to update each of those (at least David seemed to care
> about map/unmap latency).


> x86 maps the entire guest, so the only runtime map/unmaps would be when
> the memory map changes.  Typically this only happens for MMIO regions
> and odd chipset specific regions that switch between being memory backed
> or ROM backed (which we really don't care about for DMA), and
> theoretically for future memory hotplug.  At some point we're supposed
> to have devices and IOMMUs that support IO page faults, so a device can
> request an IOVA and we'd probably register a page fault handler for the
> domain to dynamically pin and map pages.
> 
>> Oooor, we could release the whole window in the domain close callback of iommu_ops...
> 
> AIUI, the iommu driver will destroy all the mappings in a domain when we
> call iommu_domain_free, but the real need for unmapall is to unpin all
> the memory.  David was suggesting maybe the pinning should happen in the
> iommu driver, which could then handle unpinning on release.  I kinda
> doubt iommu drivers want to get into the business of pinning memory
> though.  I'd actually like VFIO to get out of this business as well and
> was thinking about requiring mapped pages to be mlocked by the user, but
> it appears we have no way to later prevent or detect that the user
> munlocked the pages and might then have access to random host memory.


aaa, I had some discussion here, my implementation was a hack so this problem has gone for now.


-- 
Alexey Kardashevskiy
IBM OzLabs, LTC Team

e-mail: aik@au1.ibm.com
notes: Alexey Kardashevskiy/Australia/IBM


^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2011-12-02 18:52 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20111103195452.21259.93021.stgit@bling.home>
2011-11-09  4:17 ` [RFC PATCH] vfio: VFIO Driver core framework Aaron Fabbri
2011-11-09  4:41   ` Alex Williamson
2011-11-09  8:11 ` Christian Benvenuti (benve)
2011-11-09 18:02   ` Alex Williamson
2011-11-09 21:08     ` Christian Benvenuti (benve)
2011-11-09 23:40       ` Alex Williamson
2011-11-10  0:57 ` Christian Benvenuti (benve)
2011-11-11 18:04   ` Alex Williamson
2011-11-11 22:22     ` Christian Benvenuti (benve)
2011-11-14 22:59       ` Alex Williamson
2011-11-15  0:05         ` David Gibson
2011-11-15  0:49           ` Benjamin Herrenschmidt
2011-11-11 17:51 ` Konrad Rzeszutek Wilk
2011-11-11 22:10   ` Alex Williamson
2011-11-15  0:00     ` David Gibson
2011-11-16 16:52     ` Konrad Rzeszutek Wilk
2011-11-17 20:22       ` Alex Williamson
2011-11-17 20:56         ` Scott Wood
2011-11-16 17:47     ` Scott Wood
2011-11-17 20:52       ` Alex Williamson
2011-11-12  0:14 ` Scott Wood
2011-11-14 20:54   ` Alex Williamson
2011-11-14 21:46     ` Alex Williamson
2011-11-14 22:26     ` Scott Wood
2011-11-14 22:48       ` Alexander Graf
2011-11-15  2:29     ` Alex Williamson
2011-11-15  6:34 ` David Gibson
2011-11-15 18:01   ` Alex Williamson
2011-11-17  0:02     ` David Gibson
2011-11-18 20:32       ` Alex Williamson
2011-11-18 21:09         ` Scott Wood
2011-11-22 19:16           ` [Qemu-devel] " Alex Williamson
2011-11-22 20:00             ` Scott Wood
2011-11-22 21:28               ` Alex Williamson
2011-11-21  2:47         ` David Gibson
2011-11-22 18:22           ` Alex Williamson
2011-11-15 20:10   ` Scott Wood
2011-11-15 21:40     ` Aaron Fabbri
2011-11-15 22:29       ` Scott Wood
2011-11-16 23:34         ` Alex Williamson
2011-11-29  1:52 ` Alexey Kardashevskiy
2011-11-29  2:01   ` Alexey Kardashevskiy
2011-11-29  2:11     ` Alexey Kardashevskiy
2011-11-29  3:54     ` Alex Williamson
2011-11-29 19:26       ` Alex Williamson
2011-11-29 23:20         ` [Qemu-devel] " Stuart Yoder
2011-11-29 23:44           ` Alex Williamson
2011-11-30 15:41             ` Stuart Yoder
2011-11-30 16:58               ` Alex Williamson
2011-12-01 20:58                 ` Stuart Yoder
2011-12-01 21:25                   ` Alex Williamson
2011-12-02 14:40                     ` Stuart Yoder
2011-12-02 18:11                       ` Bhushan Bharat-R65777
2011-12-02 18:27                         ` Scott Wood
2011-12-02 18:35                           ` Bhushan Bharat-R65777
2011-12-02 18:45                           ` Bhushan Bharat-R65777
2011-12-02 18:52                             ` Scott Wood
2011-12-02 18:21                       ` Scott Wood
2011-11-29  3:46   ` Alex Williamson
2011-11-29  4:34     ` Alexey Kardashevskiy
2011-11-29  5:48       ` Alex Williamson
2011-12-02  5:06         ` Alexey Kardashevskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).