qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Jason Wang <jasowang@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: "Liu, Yi L" <yi.l.liu@intel.com>,
	yi.y.sun@linux.intel.com, qemu-devel <qemu-devel@nongnu.org>,
	mst <mst@redhat.com>
Subject: Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode
Date: Wed, 24 Nov 2021 17:01:42 +0800	[thread overview]
Message-ID: <CACGkMEsohbTvbFhMaZ_aAHpyJdbB4xcp6zRzaVYZXAZzCN7Vyw@mail.gmail.com> (raw)
In-Reply-To: <YZ39USAfW7i1oAOO@xz-m1.local>

On Wed, Nov 24, 2021 at 4:52 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Nov 24, 2021 at 04:28:52PM +0800, Jason Wang wrote:
> > On Wed, Nov 24, 2021 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > When booting with scalable mode, I hit this error:
> > > >
> > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > > > qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=01:00:00, iova=0xfffff002)
> > > > qemu-system-x86_64: New fault is not recorded due to compression of faults
> > > >
> > > > This is because the SNP bit is set since Linux kernel commit
> > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > seems to be an violation on the spec which said the SNP bit is
> > > > considered to be reserved if SC is not supported.
> > >
> > > When I was reading that commit, I was actually confused by this change:
> > >
> > > ---8<---
> > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > --- a/drivers/iommu/intel/iommu.c
> > > +++ b/drivers/iommu/intel/iommu.c
> > > @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct intel_iommu *skip)
> > >         rcu_read_lock();
> > >         for_each_active_iommu(iommu, drhd) {
> > >                 if (iommu != skip) {
> > > -                       if (!ecap_sc_support(iommu->ecap)) {
> > > +                       /*
> > > +                        * If the hardware is operating in the scalable mode,
> > > +                        * the snooping control is always supported since we
> > > +                        * always set PASID-table-entry.PGSNP bit if the domain
> > > +                        * is managed outside (UNMANAGED).
> > > +                        */
> > > +                       if (!sm_supported(iommu) &&
> > > +                           !ecap_sc_support(iommu->ecap)) {
> > >                                 ret = 0;
> > >                                 break;
> > >                         }
> > > ---8<---
> > >
> > > Does it mean that for some hardwares that has sm_supported()==true, it'll have
> > > SC bit cleared in ecap register?
> >
> > I guess not, so it's probably only the problem of vIOMMU.
>
> But then what does the code mean above?
>
> If SC is required for scalable mode,

I guess the code was tested on hardware with both SC and SM.

> ecap_sc_support()==false already implies
> sm_supported()==false too.  Then that check seems redundant.

To tell the truth, I'm not sure. And according to the spec SNP is
reserved if SC==false.

>
> >
> > > That sounds odd, and not sure why.  Maybe Yi
> > > Liu or Yi Sun may know?
> >
> > Another interesting point is that, it looks to me after that commit
> > SNP is used for the domain that is not UNMANAGED even if PGSNP is not
> > set.
> >
> >
> > >
> > > >
> > > > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > > > the future we may consider to add SC support.
> > >
> > > Oh yes, I remembered the last time we discussed this.  Could you remind me
> > > what's missing for us to support SC?
> >
> > Exactly what you described below.
> >
> > >
> > > IIUC, for common device emulations we can just declare SC==1, right?
> >
> > Strictly speaking, only safe for the datapath that is running in the
> > Qemu. For things like vhost-user, I'm not sure it can check CC when
> > using VFIO.
>
> Hmm yeah.. though I'll just call those vhost-user backends to fall into below
> "device assignment" category too.  Great to know we're on the same page here.

I see.

>
> >
> > >  As all
> > > the DMAs (including kernel accels like vhost) will be from host processors so
> > > there're no coherent issues with guest vcpu threads.
> > >
> > > If that's correct, the only challenge is device assignment in any form (I am
> > > not familiar with vdpa; so perhaps that includes vfio, vpda and any other kind
> > > of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
> > > capability from the host iommu groups that covers the assigned devices, and we
> > > only set SC==1 if we have cache coherency on all the devices?
> >
> > For VFIO yes, and we should prevent VFIO without CC to be plugged if
> > SC is advertised.
> >
> > For vDPA, we don't need to worry about it at all, kernel vDPA forces
> > IOMMU_CACHE now.
> >
> > vhost_vdpa_alloc_domain():
> >
> >         if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> >                 return -ENOTSUPP;
> >
> > (For device with on-chip IOMMU, it's the parent and device that
> > guarantees the CC)
>
> Ah right, yes you mentioned it and I forgot..  Though I'm not sure we'd simply
> double-check again here (if we'll support vfio anyway, then we'll need to be
> able to read those out from IOMMU groups), because we shouldn't rely on that
> fact which is an implementation detail of vdpa, imho (say, when vdpa starts to
> support !SC someday).

Good point, but at that time we need to expose the !SC via uAPI which
is currently not supported.

>
> PS: I have other comments below in previous reply - please have a look too! :-D

Sorry for missing that part :(

>
> >
> > Thanks
> >
> >
> > >
> > > >
> > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > ---
> > > >  hw/i386/intel_iommu.c          | 18 ++++++++++++------
> > > >  hw/i386/intel_iommu_internal.h |  2 ++
> > > >  2 files changed, 14 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index 294499ee20..3bcac56c3e 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -969,7 +969,8 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> > > >  static uint64_t vtd_spte_rsvd[5];
> > > >  static uint64_t vtd_spte_rsvd_large[5];
> > > >
> > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > +                                   uint64_t slpte, uint32_t level)
> > > >  {
> > > >      uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > >
> > > > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > >          rsvd_mask = vtd_spte_rsvd_large[level];
> > > >      }
> > > >
> > > > +    if (s->scalable_mode) {
> > > > +        rsvd_mask &= ~VTD_SPTE_SNP;
> > > > +    }
> > >
> > > IMHO what we want to do is only to skip the leaves of pgtables on SNP, so maybe
> > > we still want to keep checking the bit 11 reserved for e.g. common pgtable dir
> > > entries?

Maybe, but it's probably a question that can only be answered by
Intel. I can change it for the next version if you stick.

> > >
> > > To do so, how about directly modifying the vtd_spte_rsvd* fields in vtd_init()?
> > > I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:
> > >
> > >   - vtd_spte_rsvd[1] (4K)
> > >   - vtd_spte_rsvd_large[2] (2M)
> > >   - vtd_spte_rsvd_large[3] (1G)
> > >
> > > What do you think?  Then we avoid passing IntelIOMMUState* all over too.

I started a version like that:), it should work, I will change that if
it was agreed by everyone.

The reason that I changed to pass IntelIOMMUState is that it results
in a smaller changeset. The reason is that I tend to introduce new
rsvd bits for SM mode since after checking vtd 3.3 it looks have
different reserved bit requirement than before (at least 1.2)

Thanks

>
> [Here]
>
> > >
> > > > +
> > > >      return slpte & rsvd_mask;
> > > >  }
> > > >
> > > > @@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
> > > >                                iova, level, slpte, is_write);
> > > >              return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
> > > >          }
> > > > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > > > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> > > >              error_report_once("%s: detected splte reserve non-zero "
> > > >                                "iova=0x%" PRIx64 ", level=0x%" PRIx32
> > > >                                "slpte=0x%" PRIx64 ")", __func__, iova,
> > > > @@ -1185,7 +1190,8 @@ static int vtd_page_walk_one(IOMMUTLBEvent *event, vtd_page_walk_info *info)
> > > >   * @write: whether parent level has write permission
> > > >   * @info: constant information for the page walk
> > > >   */
> > > > -static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > > > +static int vtd_page_walk_level(IntelIOMMUState *s,
> > > > +                               dma_addr_t addr, uint64_t start,
> > > >                                 uint64_t end, uint32_t level, bool read,
> > > >                                 bool write, vtd_page_walk_info *info)
> > > >  {
> > > > @@ -1214,7 +1220,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > > >              goto next;
> > > >          }
> > > >
> > > > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > > > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> > > >              trace_vtd_page_walk_skip_reserve(iova, iova_next);
> > > >              goto next;
> > > >          }
> > > > @@ -1235,7 +1241,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > > >               * This is a valid PDE (or even bigger than PDE).  We need
> > > >               * to walk one further level.
> > > >               */
> > > > -            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
> > > > +            ret = vtd_page_walk_level(s, vtd_get_slpte_addr(slpte, info->aw),
> > > >                                        iova, MIN(iova_next, end), level - 1,
> > > >                                        read_cur, write_cur, info);
> > > >          } else {
> > > > @@ -1294,7 +1300,7 @@ static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
> > > >          end = vtd_iova_limit(s, ce, info->aw);
> > > >      }
> > > >
> > > > -    return vtd_page_walk_level(addr, start, end, level, true, true, info);
> > > > +    return vtd_page_walk_level(s, addr, start, end, level, true, true, info);
> > > >  }
> > > >
> > > >  static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
> > > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > > index 3d5487fe2c..a6c788049b 100644
> > > > --- a/hw/i386/intel_iommu_internal.h
> > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > @@ -388,6 +388,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > >  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffff0000ffe0fff8
> > > >
> > > >  /* Rsvd field masks for spte */
> > > > +#define VTD_SPTE_SNP 0x800ULL
> > > > +
> > > >  #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, dt_supported) \
> > > >          dt_supported ? \
> > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
> > > > --
> > > > 2.25.1
> > > >
> > >
> > > --
> > > Peter Xu
> > >
> >
>
> --
> Peter Xu
>



  reply	other threads:[~2021-11-24  9:02 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-24  6:03 [PATCH] intel-iommu: ignore SNP bit in scalable mode Jason Wang
2021-11-24  7:57 ` Peter Xu
2021-11-24  8:28   ` Jason Wang
2021-11-24  8:52     ` Peter Xu
2021-11-24  9:01       ` Jason Wang [this message]
2021-11-24  9:23         ` Peter Xu
2021-11-24  9:35           ` Jason Wang
2021-11-24 10:10             ` Peter Xu
2021-11-25  2:34               ` Jason Wang
2021-11-25  5:59             ` Liu, Yi L
2021-11-25  6:04     ` Liu, Yi L
2021-11-25  4:03   ` Liu, Yi L
2021-11-25  4:30     ` Peter Xu
2021-11-25  5:49       ` Liu, Yi L
2021-11-25  6:13         ` Peter Xu
2021-11-28  7:06           ` Liu, Yi L
2021-11-29  1:19             ` Peter Xu
2021-11-29  2:28               ` Jason Wang
2021-11-29  3:13                 ` Peter Xu
2021-11-29  3:19                   ` Jason Wang
2021-11-29  5:37                   ` Liu, Yi L

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACGkMEsohbTvbFhMaZ_aAHpyJdbB4xcp6zRzaVYZXAZzCN7Vyw@mail.gmail.com \
    --to=jasowang@redhat.com \
    --cc=mst@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=yi.l.liu@intel.com \
    --cc=yi.y.sun@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).