From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49914) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aO6yo-0002nY-Eb for qemu-devel@nongnu.org; Tue, 26 Jan 2016 12:00:36 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aO6ym-0000VF-9U for qemu-devel@nongnu.org; Tue, 26 Jan 2016 12:00:34 -0500 Received: from mx1.redhat.com ([209.132.183.28]:45391) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aO6yl-0000Uz-Uy for qemu-devel@nongnu.org; Tue, 26 Jan 2016 12:00:32 -0500 Message-ID: <1453827630.26652.71.camel@redhat.com> From: Alex Williamson Date: Tue, 26 Jan 2016 10:00:30 -0700 In-Reply-To: <56A787DC.6060905@linux.vnet.ibm.com> References: <1452611505-25478-1-git-send-email-pmorel@linux.vnet.ibm.com> <1452622595.9674.19.camel@redhat.com> <569FA454.6050409@linux.vnet.ibm.com> <1453304819.32741.277.camel@redhat.com> <56A0D9F4.1060708@linux.vnet.ibm.com> <1453500876.32741.465.camel@redhat.com> <1453501156.32741.468.camel@redhat.com> <56A787DC.6060905@linux.vnet.ibm.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v3] vfio/common: Check iova with limit not with size List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Pierre Morel Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, peter.maydell@linaro.org On Tue, 2016-01-26 at 15:51 +0100, Pierre Morel wrote: >=20 > On 01/22/2016 11:19 PM, Alex Williamson wrote: > > On Fri, 2016-01-22 at 15:14 -0700, Alex Williamson wrote: > > > On Thu, 2016-01-21 at 14:15 +0100, Pierre Morel wrote: > > > > On 01/20/2016 04:46 PM, Alex Williamson wrote: > > > > > On Wed, 2016-01-20 at 16:14 +0100, Pierre Morel wrote: > > > > > > On 01/12/2016 07:16 PM, Alex Williamson wrote: > > > > > > > On Tue, 2016-01-12 at 16:11 +0100, Pierre Morel wrote: > > > > > > > > In vfio_listener_region_add(), we try to validate that th= e region > > > > > > > > is > > > > > > > > not > > > > > > > > zero sized and hasn't overflowed the addresses space. > > > > > > > >=20 > > > > > > > > But the calculation uses the size of the region instead o= f > > > > > > > > using the region's limit (size - 1). > > > > > > > >=20 > > > > > > > > This leads to Int128 overflow when the region has > > > > > > > > been initialized to UINT64_MAX because in this case > > > > > > > > memory_region_init() transform the size from UINT64_MAX > > > > > > > > to int128_2_64(). > > > > > > > >=20 > > > > > > > > Let's really use the limit by sustracting one to the size > > > > > > > > and take care to use the limit for functions using limit > > > > > > > > and size to call functions which need size. > > > > > > > >=20 > > > > > > > > Signed-off-by: Pierre Morel > > > > > > > > --- > > > > > > > >=20 > > > > > > > > Changes from v2: > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0- all, just ign= ore v2, sorry about this, > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0thi= s is build after v1 > > > > > > > >=20 > > > > > > > > Changes from v1: > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0- adjust the te= sts by knowing we already substracted one to > > > > > > > > end. > > > > > > > >=20 > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0hw/vfio/common.c |=C2=A0=C2=A0=C2= =A014 +++++++------- > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A01 files changed, 7 insertions(+),= 7 deletions(-) > > > > > > > >=20 > > > > > > > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > > > > > > > > index 6797208..a5f6643 100644 > > > > > > > > --- a/hw/vfio/common.c > > > > > > > > +++ b/hw/vfio/common.c > > > > > > > > @@ -348,12 +348,12 @@ static void > > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (int12= 8_ge(int128_make64(iova), llend)) { > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0return; > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0end =3D int128_get64(llend); > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0end =3D int128_get64(int128_sub(= llend, int128_one())); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0if ((iova < container->min_iova)= || ((end - 1) > container- > > > > > > > > > max_iova)) { > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0if ((iova < container->min_iova)= || (end=C2=A0=C2=A0> container- > > > > > > > > > max_iova)) { > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0error_report("vfio: IOMMU container %p can't map guest > > > > > > > > IOVA > > > > > > > > region" > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0" 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0cont= ainer, iova, end - 1); > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0cont= ainer, iova, end); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0ret =3D -EFAULT; > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0goto fail; > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} > > > > > > > > @@ -363,7 +363,7 @@ static void > > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (memor= y_region_is_iommu(section->mr)) { > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0VFIOGuestIOMMU *giommu; > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0trace_vf= io_listener_region_add_iommu(iova, end - 1); > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0trace_vf= io_listener_region_add_iommu(iova, end); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0/* > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0* FIXME: We should do some checking to see if the > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0* capabilities of the host VFIO IOMMU are adequate t= o > > > > > > > > model > > > > > > > > @@ -394,13 +394,13 @@ static void > > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0section->offset_within_region + > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0(iova - section->offset_within_add= ress_space); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0trace_vfio_listener_region_add_r= am(iova, end - 1, vaddr); > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0trace_vfio_listener_region_add_r= am(iova, end, vaddr); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0ret =3D vfio_dma_map(container, = iova, end - iova, vaddr, > > > > > > > > section- > > > > > > > > > readonly); > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0ret =3D vfio_dma_map(container, = iova, end - iova + 1, vaddr, > > > > > > > > section->readonly); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (ret) = { > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0"0x%"HWADDR_PRIx", %p) =3D %d (%m)", > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0cont= ainer, iova, end - iova, vaddr, ret); > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0cont= ainer, iova, end - iova + 1, vaddr, > > > > > > > > ret); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0goto fail; > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > Hmm, did we just push the overflow from one place to anothe= r?=C2=A0=C2=A0If > > > > > > > we're > > > > > > > mapping a full region of size int128_2_64() starting at iov= a zero, > > > > > > > then > > > > > > > this becomes (0xffff_ffff_ffff_ffff - 0 + 1) =3D 0.=C2=A0=C2= =A0So I think we > > > > > > > need > > > > > > > to calculate size with 128bit arithmetic too and let it ass= ert if > > > > > > > we > > > > > > > overflow, ie: > > > > > > >=20 > > > > > > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > > > > > > > index a5f6643..13ad90b 100644 > > > > > > > --- a/hw/vfio/common.c > > > > > > > +++ b/hw/vfio/common.c > > > > > > > @@ -321,7 +321,7 @@ static void > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0MemoryRegionSection > > > > > > > *section) > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0{ > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0VFIOContain= er *container =3D container_of(listener, > > > > > > > VFIOContainer, listener); > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0hwaddr iova, end; > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0hwaddr iova, end, size; > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0Int128 llen= d; > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0void *vaddr= ; > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0int ret; > > > > > > > @@ -348,7 +348,9 @@ static void > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (int128_= ge(int128_make64(iova), llend)) { > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0return; > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} > > > > > > > + > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0end =3D int= 128_get64(int128_sub(llend, int128_one())); > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0size =3D int128_get64(int128_sub(l= lend, int128_make64(iova))); > > > > > > here again, if iova is null, since llend is section->size (2^= 64) ... > > > > > >=20 > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if ((iova <= container->min_iova) || (end=C2=A0=C2=A0> container- > > > > > > > > max_iova)) { > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0error_report("vfio: IOMMU container %p can't map guest > > > > > > > IOVA region" > > > > > > > @@ -396,11 +398,11 @@ static void > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0trace_vfio_= listener_region_add_ram(iova, end, vaddr); > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0ret =3D vfio_dma_map(container, io= va, end - iova + 1, vaddr, > > > > > > > section->readonly); > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0ret =3D vfio_dma_map(container, io= va, size, vaddr, section- > > > > > > > > readonly); > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (ret) { > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0"0x%"HWADDR_PRIx", %p) =3D %d (%m)", > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0contain= er, iova, end - iova + 1, vaddr, ret); > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0contain= er, iova, size, vaddr, ret); > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0goto fail; > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > >=20 > > > > > > > Does that still solve your scenario?=C2=A0=C2=A0Perhaps vfi= o-iommu-type1 > > > > > > > should > > > > > > > have used first/last rather than start/size for mapping sin= ce we > > > > > > > seem > > > > > > > to have an off-by-one for mapping a full 64bit space.=C2=A0= =C2=A0Seems like > > > > > > > we > > > > > > > could do it with two calls to vfio_dma_map if we really wan= ted to. > > > > > > > Thanks, > > > > > > >=20 > > > > > > > Alex > > > > > > >=20 > > > > > > You are right, every try to solve this will push the overflow > > > > > > somewhere > > > > > > else. > > > > > >=20 > > > > > > There is just no way to express 2^64 with 64 bits, we have th= e > > > > > > int128() > > > > > > solution, > > > > > > but if we solve it here, we fall in the linux ioctl call anyw= ay. > > > > > >=20 > > > > > > Intuitively, making two calls do not seem right to me. > > > > > >=20 > > > > > > But, what do you think of something like: > > > > > >=20 > > > > > > - creating a new VFIO extention > > > > > >=20 > > > > > > - and in ioctl(), since we have a flag entry in the > > > > > > vfio_iommu_type1_dma_map, > > > > > > may be adding a new flag meaning "map all virtual memory" ? > > > > > > or meaning "use first/last" ? > > > > > > I think this would break existing code unless we add a new VF= IO > > > > > > extension. > > > > > Backup, is there ever a case where we actually need to map the = entire > > > > > 64bit address space?=C2=A0=C2=A0This is fairly well impossible = on x86.=C2=A0=C2=A0I'm > > > > > pointing out an issue, but I don't know that we need to solve i= t with > > > > > more than an assert since it's never likely to happen.=C2=A0=C2= =A0Thanks, > > > > >=20 > > > > > Alex > > > > >=20 > > > > If I understood right, IOVA is the IO virtual address, > > > > it is then possible to map the virtual address page 0xffff_ffff_f= fff_f000 > > > > to something reasonable inside the real memory. > > > It is. > > >=20 > > > > Eventual we do not need to map the last virtual page but > > > > I think that in a general case the all virtual memory, as viewed = by the > > > > device through the IOMMU should be mapped to avoid any uninitiali= zed > > > > virtual memory access. > > > When using vfio, a device only has access to the IOVA space which h= as > > > been explicitly mapped.=C2=A0=C2=A0This would be a security issue o= therwise since > > > kernel vfio can't rely on userspace to wipe the device IOVA space. > yes. > > > > It is the same reason that make us map the all virtual memory for= the > > > > CPU MMU. > > > We don't really do that either, CPU mapping works based on page tab= les > > > and non-existent entries simply don't exist.=C2=A0=C2=A0We don't fu= lly populate > > > the page tables in advance, this would be a horrible waste of memor= y. >=20 >=20 > Alex, >=20 > I am not sure of that, when preparing DMA from the device, the > guest will provide the destination address and these destination addres= ses > will be translated by the IOMMU when the device start the DMA. >=20 > The guest can make any decision by preparing the DMA and if I have well= =C2=A0 > understood, > this is transparent to QEMU. > What is not transparent is the IOMMU translation. >=20 > Then, when the device starts the DMA the destination address can be=C2=A0 > anything inside > the virtual memory and the IOMMU will translate this. > To be able to translate, a table entry for this virtual address must=C2= =A0 > exist in the IOMMU > page table. >=20 > If you have several level of page table you may only fill the first lev= el > for all entries, and may be, have only one first level entry initialize= d=C2=A0 > and the belonging > second level entries filled. > Which greatly reduces the size of the tables. >=20 > But if you do not fill one of the first level entry the behavior of the= =C2=A0 > IOMMU > and then of the DMA is done according to what ever has been left in thi= s > entry. It seems like you're arguing that the guest is going to have a 2^64 bit address space for DMA targets. =C2=A0Sure, the guest driver can program t= he device to DMA anywhere in that address space, but what should the IOMMU actually consider an valid DMA? =C2=A0It has to be things that have been mapped, like guest physical RAM or peer-to-peer DMA targets. =C2=A0How wo= uld the IOMMU handle a stray DMA for anything else? =C2=A0By default the DMA target would not be mapped and the DMA would generate a fault. =C2=A0How that fault is handled is specific to the architecture and platform. =C2=A0= If you want to provide the device with a full 2^64 bit address space where nothing will fault, you're going to need to do it via a lot of mappings pointing to the same host physical page. =C2=A0I'm really not sure where we're going with this though. > > >=20 > > > > May be I missed something, or may be I worry too much, > > > > but I see this as a restriction on the supported hardware > > > > if we compare host and guest hardware support compatibility. > > > I don't see the issue, there's arguably a bug in the API that doesn= 't > > > allow us to map the full 64bit IOVA space of a device in a single > > > mapping, but we can do it in two.=C2=A0=C2=A0Besides, there's reall= y no case > > > where a device needs a fully populated IOTLB unless you're actually > > > giving the device access to 16 EMB of memory. > > s/EMB/EB/=C2=A0=C2=A0Or I suppose technically EiB >=20 > yes, I agree with this, we do not need to access so much memory. >=20 > >=20 > > > > We can live with it, because in fact you are right and today > > > > I am not aware of a hardware wanting to access this page but a > > > > hardware designers knowing having a IOMMU may want to access exac= tly > > > > this kind of strange virtual page for special features and this w= ould work > > > > on the host but not inside of the guest. > > > The API issue is not that we can't map 0xffff_ffff_ffff_f000, it's = that > > > we can't map 0x0 through 0xffff_ffff_ffff_ffff in a single mapping > > > because we pass the size instead of the end address (where size her= e > > > would be 2^64).=C2=A0=C2=A0We can map 0x0 through 0xffff_ffff_ffff_= efff, followed > > > by 0xffff_ffff_ffff_f000 through 0xffff_ffff_ffff_ffff, but again, = why > > > would you ever need to do this?=C2=A0=C2=A0Thanks, > > >=20 > > > Alex >=20 > The thing is that It could be useful to say we map all the virtual memo= ry. Why? =C2=A0And again, you can do it, just not in a single mapping, which really seems like a theoretical problem since you're mapping the IOVA space of a device, which lives within and consumes a small portion of this address space, so you're pretty much always looking at doing multiple mappings. > Having a size of 2^64 was a possibility. > On the other hands, with the actual implementation the > "memory_region_iommu_replay" would take on long long long time. >=20 > In fact, depending on the IOMMU capabilities and usage we do not need t= o=C2=A0 > call > the "memory_region_iommu_replay" at that time or even not at all. The mapping itself would take a long time, even with GiB pages we're talking about populating 2^34 page table entries. =C2=A0Of course many of those entries would need to point to the same physical page to cover empty space since otherwise you'd need to run this on a processor that supports more than 2^64 bits of address space, but you probably don't want to waste lots of memory covering empty space, so that means a smaller page size, which means orders of magnitude more mappings and more space wasted in the page tables... =C2=A0All of this should have som= e useful value, which is escaping me. =C2=A0Thanks, Alex