From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46056) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOU8P-0002yI-Hy for qemu-devel@nongnu.org; Wed, 27 Jan 2016 12:44:04 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aOU8K-00072i-Go for qemu-devel@nongnu.org; Wed, 27 Jan 2016 12:44:01 -0500 Received: from mx1.redhat.com ([209.132.183.28]:47583) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOU8K-00072e-6s for qemu-devel@nongnu.org; Wed, 27 Jan 2016 12:43:56 -0500 Message-ID: <1453916634.6261.7.camel@redhat.com> From: Alex Williamson Date: Wed, 27 Jan 2016 10:43:54 -0700 In-Reply-To: <56A88DD9.3040004@linux.vnet.ibm.com> References: <1452611505-25478-1-git-send-email-pmorel@linux.vnet.ibm.com> <1452622595.9674.19.camel@redhat.com> <569FA454.6050409@linux.vnet.ibm.com> <1453304819.32741.277.camel@redhat.com> <56A0D9F4.1060708@linux.vnet.ibm.com> <1453500876.32741.465.camel@redhat.com> <1453501156.32741.468.camel@redhat.com> <56A787DC.6060905@linux.vnet.ibm.com> <1453827630.26652.71.camel@redhat.com> <56A88DD9.3040004@linux.vnet.ibm.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v3] vfio/common: Check iova with limit not with size List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Pierre Morel Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, peter.maydell@linaro.org On Wed, 2016-01-27 at 10:28 +0100, Pierre Morel wrote: >=C2=A0 > On 01/26/2016 06:00 PM, Alex Williamson wrote: > > On Tue, 2016-01-26 at 15:51 +0100, Pierre Morel wrote: > > > On 01/22/2016 11:19 PM, Alex Williamson wrote: > > > > On Fri, 2016-01-22 at 15:14 -0700, Alex Williamson wrote: > > > > > On Thu, 2016-01-21 at 14:15 +0100, Pierre Morel wrote: > > > > > > On 01/20/2016 04:46 PM, Alex Williamson wrote: > > > > > > > On Wed, 2016-01-20 at 16:14 +0100, Pierre Morel wrote: > > > > > > > > On 01/12/2016 07:16 PM, Alex Williamson wrote: > > > > > > > > > On Tue, 2016-01-12 at 16:11 +0100, Pierre Morel wrote: > > > > > > > > > > In vfio_listener_region_add(), we try to validate tha= t the region > > > > > > > > > > is > > > > > > > > > > not > > > > > > > > > > zero sized and hasn't overflowed the addresses space. > > > > > > > > > >=C2=A0 > > > > > > > > > > But the calculation uses the size of the region inste= ad of > > > > > > > > > > using the region's limit (size - 1). > > > > > > > > > >=C2=A0 > > > > > > > > > > This leads to Int128 overflow when the region has > > > > > > > > > > been initialized to UINT64_MAX because in this case > > > > > > > > > > memory_region_init() transform the size from UINT64_M= AX > > > > > > > > > > to int128_2_64(). > > > > > > > > > >=C2=A0 > > > > > > > > > > Let's really use the limit by sustracting one to the = size > > > > > > > > > > and take care to use the limit for functions using li= mit > > > > > > > > > > and size to call functions which need size. > > > > > > > > > >=C2=A0 > > > > > > > > > > Signed-off-by: Pierre Morel > > > > > > > > > > --- > > > > > > > > > >=C2=A0 > > > > > > > > > > Changes from v2: > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0- all= , just ignore v2, sorry about this, > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0this is build after v1 > > > > > > > > > >=C2=A0 > > > > > > > > > > Changes from v1: > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0- adj= ust the tests by knowing we already substracted one to > > > > > > > > > > end. > > > > > > > > > >=C2=A0 > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0hw/vfio/common.c |=C2=A0= =C2=A0=C2=A014 +++++++------- > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A01 files changed, 7 inse= rtions(+), 7 deletions(-) > > > > > > > > > >=C2=A0 > > > > > > > > > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > > > > > > > > > > index 6797208..a5f6643 100644 > > > > > > > > > > --- a/hw/vfio/common.c > > > > > > > > > > +++ b/hw/vfio/common.c > > > > > > > > > > @@ -348,12 +348,12 @@ static void > > > > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= if (int128_ge(int128_make64(iova), llend)) { > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0return; > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= } > > > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0end =3D int128_get64(llend); > > > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0end =3D int128_get64(int128_= sub(llend, int128_one())); > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0if ((iova < container->min_i= ova) || ((end - 1) > container- > > > > > > > > > > > max_iova)) { > > > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0if ((iova < container->min_i= ova) || (end=C2=A0=C2=A0> container- > > > > > > > > > > > max_iova)) { > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0error_report("vfio: IOMMU container %p can't map = guest > > > > > > > > > > IOVA > > > > > > > > > > region" > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0" 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, > > > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0c= ontainer, iova, end - 1); > > > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0c= ontainer, iova, end); > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0ret =3D -EFAULT; > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0goto fail; > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= } > > > > > > > > > > @@ -363,7 +363,7 @@ static void > > > > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= if (memory_region_is_iommu(section->mr)) { > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0VFIOGuestIOMMU *giommu; > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0trac= e_vfio_listener_region_add_iommu(iova, end - 1); > > > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0trac= e_vfio_listener_region_add_iommu(iova, end); > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0/* > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* FIXME: We should do some checking to see = if the > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* capabilities of the host VFIO IOMMU are a= dequate to > > > > > > > > > > model > > > > > > > > > > @@ -394,13 +394,13 @@ static void > > > > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0section->offset_within_re= gion + > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0(iova - section->offset_w= ithin_address_space); > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0trace_vfio_listener_region_a= dd_ram(iova, end - 1, vaddr); > > > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0trace_vfio_listener_region_a= dd_ram(iova, end, vaddr); > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0ret =3D vfio_dma_map(contain= er, iova, end - iova, vaddr, > > > > > > > > > > section- > > > > > > > > > > > readonly); > > > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0ret =3D vfio_dma_map(contain= er, iova, end - iova + 1, vaddr, > > > > > > > > > > section->readonly); > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= if (ret) { > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", = " > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0"0x%"HWADDR_PRIx", %p) =3D %d (%m)", > > > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0c= ontainer, iova, end - iova, vaddr, ret); > > > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0c= ontainer, iova, end - iova + 1, vaddr, > > > > > > > > > > ret); > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0goto fail; > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= } > > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > > Hmm, did we just push the overflow from one place to an= other?=C2=A0=C2=A0If > > > > > > > > > we're > > > > > > > > > mapping a full region of size int128_2_64() starting at= iova zero, > > > > > > > > > then > > > > > > > > > this becomes (0xffff_ffff_ffff_ffff - 0 + 1) =3D 0.=C2=A0= =C2=A0So I think we > > > > > > > > > need > > > > > > > > > to calculate size with 128bit arithmetic too and let it= assert if > > > > > > > > > we > > > > > > > > > overflow, ie: > > > > > > > > >=C2=A0 > > > > > > > > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > > > > > > > > > index a5f6643..13ad90b 100644 > > > > > > > > > --- a/hw/vfio/common.c > > > > > > > > > +++ b/hw/vfio/common.c > > > > > > > > > @@ -321,7 +321,7 @@ static void > > > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0MemoryRegionSection > > > > > > > > > *section) > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0{ > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0V= FIOContainer *container =3D container_of(listener, > > > > > > > > > VFIOContainer, listener); > > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0hwaddr iova, end; > > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0hwaddr iova, end, size; > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0I= nt128 llend; > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0v= oid *vaddr; > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0i= nt ret; > > > > > > > > > @@ -348,7 +348,9 @@ static void > > > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0i= f (int128_ge(int128_make64(iova), llend)) { > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0return; > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} > > > > > > > > > + > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0e= nd =3D int128_get64(int128_sub(llend, int128_one())); > > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0size =3D int128_get64(int128_s= ub(llend, int128_make64(iova))); > > > > > > > > here again, if iova is null, since llend is section->size= (2^64) ... > > > > > > > >=C2=A0 > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0i= f ((iova < container->min_iova) || (end=C2=A0=C2=A0> container- > > > > > > > > > > max_iova)) { > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0error_report("vfio: IOMMU container %p can't map gue= st > > > > > > > > > IOVA region" > > > > > > > > > @@ -396,11 +398,11 @@ static void > > > > > > > > > vfio_listener_region_add(MemoryListener *listener, > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0t= race_vfio_listener_region_add_ram(iova, end, vaddr); > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0ret =3D vfio_dma_map(container= , iova, end - iova + 1, vaddr, > > > > > > > > > section->readonly); > > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0ret =3D vfio_dma_map(container= , iova, size, vaddr, section- > > > > > > > > > > readonly); > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0i= f (ret) { > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0"0x%"HWADDR_PRIx", %p) =3D %d (%m)", > > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0c= ontainer, iova, end - iova + 1, vaddr, ret); > > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0c= ontainer, iova, size, vaddr, ret); > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0goto fail; > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0} > > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > > > > > > > >=C2=A0 > > > > > > > > > Does that still solve your scenario?=C2=A0=C2=A0Perhaps= vfio-iommu-type1 > > > > > > > > > should > > > > > > > > > have used first/last rather than start/size for mapping= since we > > > > > > > > > seem > > > > > > > > > to have an off-by-one for mapping a full 64bit space.=C2= =A0=C2=A0Seems like > > > > > > > > > we > > > > > > > > > could do it with two calls to vfio_dma_map if we really= wanted to. > > > > > > > > > Thanks, > > > > > > > > >=C2=A0 > > > > > > > > > Alex > > > > > > > > >=C2=A0 > > > > > > > > You are right, every try to solve this will push the over= flow > > > > > > > > somewhere > > > > > > > > else. > > > > > > > >=C2=A0 > > > > > > > > There is just no way to express 2^64 with 64 bits, we hav= e the > > > > > > > > int128() > > > > > > > > solution, > > > > > > > > but if we solve it here, we fall in the linux ioctl call = anyway. > > > > > > > >=C2=A0 > > > > > > > > Intuitively, making two calls do not seem right to me. > > > > > > > >=C2=A0 > > > > > > > > But, what do you think of something like: > > > > > > > >=C2=A0 > > > > > > > > - creating a new VFIO extention > > > > > > > >=C2=A0 > > > > > > > > - and in ioctl(), since we have a flag entry in the > > > > > > > > vfio_iommu_type1_dma_map, > > > > > > > > may be adding a new flag meaning "map all virtual memory"= ? > > > > > > > > or meaning "use first/last" ? > > > > > > > > I think this would break existing code unless we add a ne= w VFIO > > > > > > > > extension. > > > > > > > Backup, is there ever a case where we actually need to map = the entire > > > > > > > 64bit address space?=C2=A0=C2=A0This is fairly well impossi= ble on x86.=C2=A0=C2=A0I'm > > > > > > > pointing out an issue, but I don't know that we need to sol= ve it with > > > > > > > more than an assert since it's never likely to happen.=C2=A0= =C2=A0Thanks, > > > > > > >=C2=A0 > > > > > > > Alex > > > > > > >=C2=A0 > > > > > > If I understood right, IOVA is the IO virtual address, > > > > > > it is then possible to map the virtual address page 0xffff_ff= ff_ffff_f000 > > > > > > to something reasonable inside the real memory. > > > > > It is. > > > > >=C2=A0 > > > > > > Eventual we do not need to map the last virtual page but > > > > > > I think that in a general case the all virtual memory, as vie= wed by the > > > > > > device through the IOMMU should be mapped to avoid any uninit= ialized > > > > > > virtual memory access. > > > > > When using vfio, a device only has access to the IOVA space whi= ch has > > > > > been explicitly mapped.=C2=A0=C2=A0This would be a security iss= ue otherwise since > > > > > kernel vfio can't rely on userspace to wipe the device IOVA spa= ce. > > > yes. > > > > > > It is the same reason that make us map the all virtual memory= for the > > > > > > CPU MMU. > > > > > We don't really do that either, CPU mapping works based on page= tables > > > > > and non-existent entries simply don't exist.=C2=A0=C2=A0We don'= t fully populate > > > > > the page tables in advance, this would be a horrible waste of m= emory. > > >=C2=A0 > > > Alex, > > >=C2=A0 > > > I am not sure of that, when preparing DMA from the device, the > > > guest will provide the destination address and these destination ad= dresses > > > will be translated by the IOMMU when the device start the DMA. > > >=C2=A0 > > > The guest can make any decision by preparing the DMA and if I have = well > > > understood, > > > this is transparent to QEMU. > > > What is not transparent is the IOMMU translation. > > >=C2=A0 > > > Then, when the device starts the DMA the destination address can be > > > anything inside > > > the virtual memory and the IOMMU will translate this. > > > To be able to translate, a table entry for this virtual address mus= t > > > exist in the IOMMU > > > page table. > > >=C2=A0 > > > If you have several level of page table you may only fill the first= level > > > for all entries, and may be, have only one first level entry initia= lized > > > and the belonging > > > second level entries filled. > > > Which greatly reduces the size of the tables. > > >=C2=A0 > > > But if you do not fill one of the first level entry the behavior of= the > > > IOMMU > > > and then of the DMA is done according to what ever has been left in= this > > > entry. > > It seems like you're arguing that the guest is going to have a 2^64 b= it > > address space for DMA targets.=C2=A0=C2=A0Sure, the guest driver can = program the > > device to DMA anywhere in that address space, but what should the IOM= MU > > actually consider an valid DMA?=C2=A0=C2=A0It has to be things that h= ave been > > mapped, like guest physical RAM or peer-to-peer DMA targets.=C2=A0=C2= =A0How would > > the IOMMU handle a stray DMA for anything else?=C2=A0=C2=A0By default= the DMA > > target would not be mapped and the DMA would generate a fault.=C2=A0=C2= =A0How > > that fault is handled is specific to the architecture and platform.=C2= =A0=C2=A0If > > you want to provide the device with a full 2^64 bit address space whe= re > > nothing will fault, you're going to need to do it via a lot of mappin= gs > > pointing to the same host physical page.=C2=A0=C2=A0I'm really not su= re where > > we're going with this though. >=C2=A0 > Alex, >=C2=A0 > I think we misunderstood us. > As I do not know if I do not understand you or if you do not understand= me, > I try to explain me better. > You may find the description obvious, the all is ok. >=C2=A0 > In the architectures I know, DMA and IOMMU are two separate things. > IOMMU must initialize table entries for the all space DMA is able to ac= cess. > It must do it, no exception allowed. > But some, and may be quite all the entries in the table are invalid=C2=A0 > entries, > existing entries with the invalid bit (or no valid bit) set. >=C2=A0 > Most, if not all IOMMU have several level of tables. > Suppose a two level table mapping a 2^32 memory space. > You have first level with 4k entries with an indirection to the second=C2= =A0 > page level. > Second page level mapping 256 entries of 4k pages. >=C2=A0 > In this case, to have a valid page table accessing 1 byte somewhere in=C2= =A0 > memory you have: > 4k - 1 entries of first level with invalid bit set > 1 first level entry which is valid and point to a second level table >=C2=A0 > in second level table you have 255 entries with invalid bit set > one entry valid pointing to a page where the byte is. >=C2=A0 > This just to illustrate what I want to say when I say all entries are=C2= =A0 > initialized. > If one let garbage inside one entry it happens random things... we do=C2= =A0 > not want that. >=C2=A0 > Are we OK with this? Yes, basic page tables. > > > > > > May be I missed something, or may be I worry too much, > > > > > > but I see this as a restriction on the supported hardware > > > > > > if we compare host and guest hardware support compatibility. > > > > > I don't see the issue, there's arguably a bug in the API that d= oesn't > > > > > allow us to map the full 64bit IOVA space of a device in a sing= le > > > > > mapping, but we can do it in two.=C2=A0=C2=A0Besides, there's r= eally no case > > > > > where a device needs a fully populated IOTLB unless you're actu= ally > > > > > giving the device access to 16 EMB of memory. > > > > s/EMB/EB/=C2=A0=C2=A0Or I suppose technically EiB > > > yes, I agree with this, we do not need to access so much memory. > > >=C2=A0 > > > > > > We can live with it, because in fact you are right and today > > > > > > I am not aware of a hardware wanting to access this page but = a > > > > > > hardware designers knowing having a IOMMU may want to access = exactly > > > > > > this kind of strange virtual page for special features and th= is would work > > > > > > on the host but not inside of the guest. > > > > > The API issue is not that we can't map 0xffff_ffff_ffff_f000, i= t's that > > > > > we can't map 0x0 through 0xffff_ffff_ffff_ffff in a single mapp= ing > > > > > because we pass the size instead of the end address (where size= here > > > > > would be 2^64).=C2=A0=C2=A0We can map 0x0 through 0xffff_ffff_f= fff_efff, followed > > > > > by 0xffff_ffff_ffff_f000 through 0xffff_ffff_ffff_ffff, but aga= in, why > > > > > would you ever need to do this?=C2=A0=C2=A0Thanks, > > > > >=C2=A0 > > > > > Alex > > > The thing is that It could be useful to say we map all the virtual = memory. > > Why?=C2=A0=C2=A0And again, you can do it, just not in a single mappin= g, which > > really seems like a theoretical problem since you're mapping the IOVA > > space of a device, which lives within and consumes a small portion of > > this address space, so you're pretty much always looking at doing > > multiple mappings. >=C2=A0 > My fault, I use the word map indifferently to refer to a valid mapping=C2= =A0 > or invalid mapping. > May be the source of the misunderstood. >=C2=A0 > >=C2=A0 > > > Having a size of 2^64 was a possibility. > > > On the other hands, with the actual implementation the > > > "memory_region_iommu_replay" would take on long long long time. > > >=C2=A0 > > > In fact, depending on the IOMMU capabilities and usage we do not ne= ed to > > > call > > > the "memory_region_iommu_replay" at that time or even not at all. > > The mapping itself would take a long time, even with GiB pages we're > > talking about populating 2^34 page table entries. >=C2=A0 > as seen earlier, it depends on the IOMMU architecture, some tables > can be ignored if the entry of the parent table is invalid. > And anyway, the host must do it. >=C2=A0 > > =C2=A0 Of course many of > > those entries would need to point to the same physical page to cover > > empty space since otherwise you'd need to run this on a processor tha= t > > supports more than 2^64 bits of address space, but you probably don't > > want to waste lots of memory covering empty space, so that means a > > smaller page size, which means orders of magnitude more mappings and > > more space wasted in the page tables...=C2=A0=C2=A0All of this should= have some > > useful value, which is escaping me.=C2=A0=C2=A0Thanks, > >=C2=A0 > > Alex > >=C2=A0 >=C2=A0 > Another source of misunderstanding can be that I may not understand > the QEMU code as good as I think. >=C2=A0 > And may be the last (I hope so) source of misunderstanding is that I di= d=C2=A0 > not explain well > the goal of this which is the optimization of IOMMU for some architectu= res: >=C2=A0 > Depending on the IOMMU architecture you may not be forced to map the me= mory > with a valid entry from the beginning. > In fact you have the following possibility: > - map all memory with invalid entry, may be the host do it per default=C2= =A0 > (he better do) > - only update entries that the guest updated. >=C2=A0 > To do this you must intercept guest updates. > There are several possibilities to do the interception depending on the= =C2=A0 > host architecture, > the IOMMU and the IOMMU-TLB architecture. >=C2=A0 > In this case, you would map all the memory the DMA may theoretically ac= cess > with invalid entries, you have no need of replaying the IOMMU entries o= n=C2=A0 > start, and you > will only updates entries on each DMA_MAP/UNMAP calls. >=C2=A0 > The migration case would need some kind of IOMMU replay but it is=C2=A0 > another problem. You've still lost me on the actual problem you're trying to solve.=C2=A0=C2= =A0You want to invalidate all of the IOTLB entries for a device, but that's exactly the starting state of an IOMMU domain and you already have the ability to get back to that state by unmapping the entire address space, with the caveat that it takes two calls to flush the full 2^64 bit address space.=C2=A0=C2=A0So don't you just need to fix the overflow issu= e that you've identified and add a workaround for the API issue?=C2=A0=C2=A0I do= n't understand why you'd ever need to _map_, ie. create valid IOTLB entries, for the entire address space, but if you don't even want to track the highest DMA address in use to make unmapping more efficient, the tools are still there for you.=C2=A0=C2=A0Thanks, Alex