public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@ziepe.ca>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Keith Busch <kbusch@meta.com>,
	kvm@vger.kernel.org, Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH rfc] vfio-pci: Allow write combining
Date: Fri, 2 Aug 2024 08:53:08 -0300	[thread overview]
Message-ID: <20240802115308.GA676757@ziepe.ca> (raw)
In-Reply-To: <20240801121657.20f0fdb4.alex.williamson@redhat.com>

On Thu, Aug 01, 2024 at 12:16:57PM -0600, Alex Williamson wrote:
> On Thu, 1 Aug 2024 14:53:39 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, Aug 01, 2024 at 11:33:44AM -0600, Alex Williamson wrote:
> > > On Thu, 1 Aug 2024 14:13:55 -0300
> > > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >   
> > > > On Thu, Aug 01, 2024 at 10:52:18AM -0600, Alex Williamson wrote:  
> > > > > > > vfio_region_info.flags in not currently tested for input therefore this
> > > > > > > proposal could lead to unexpected behavior for a caller that doesn't
> > > > > > > currently zero this field.  It's intended as an output-only field.      
> > > > > > 
> > > > > > Perhaps a REGION_INFO2 then?
> > > > > > 
> > > > > > I still think per-request is better than a global flag    
> > > > > 
> > > > > I don't understand why we'd need a REGION_INFO2, we already have
> > > > > support for defining new regions.    
> > > > 
> > > > It is not a new region, it is a modified mmap behavior for an existing
> > > > region.  
> > > 
> > > If we're returning a different offset into the vfio device file from
> > > which to get a WC mapping, what's the difference?   
> > 
> > I think it is a pretty big difference.. The offset is just a "mmap
> > cookie", it doesn't have to be 1:1 with the idea of a region.
> > 
> > > A vfio "region" is
> > > describing a region or range of the vfio device file descriptor.  
> > 
> > I'm thinking a region is describing an area of memory that is
> > available in the VFIO device. The offset output is just a "mmap
> > cookie" to tell userspace how to mmap it. Having N mmap cookies for 1
> > region is OK.
> 
> Is an "mmap cookie" an offset into the vfio device file where mmap'ing
> that offset results in a WC mapping to a specific device resource?

Yes

> Isn't that just a region that doesn't have an index or supporting
> infrastructure?

No? It is a "mmap cookie" that has the requested mmap-time properties.

Today the kernel side binds the mmap offset to the index in a computed
way, but from a API perspective userspace does REGION_INFO and gets
back an opaque "mmap cookie" that it uses to pass to mmap to get back
the thing describe by REGION_INFO. Userspace has no idea about any
structure to the cookie numbers.
  
> > Which controls what NC/WC the mmap creates when called:
> > 
> > +	if (vdev->bar_write_combine[index])
> > +		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
> > +	else
> > +		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > 
> > You get the same output from REGION_INFO, same number of regions.
> 
> It was your proposal that introduced REQ_WC, this is Keith's original
> proposal.  I'm equating a REQ_WC request inventing an "mmap cookie" as
> effectively the same as bringing a lightweight region into existence
> because it defines a section of the vfio device file to have specific
> mmap semantics.

Well, again, it is not a region, it is just a record that this mmap
cookie uses X region with Y mapping flags. The number of regions don't
change. Critically from a driver perspective the number of regions and
region indexes wouldn't change.

I am not thing of making a new region, I am thinking of adjusting how
mmap works. Like today we do this:

	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);

So maybe instead it is something like:

      vfio_decode_mmap_cookie(vma, &index, &flags);
      if (flags & VFIO_MMAP_WC)
          prot = pgprot_writecombine(..)

From a driver perspective the region indexes don't change at all. This
is what I think is important here.

> > It was the other proposal from long ago that created more regions.
> > 
> > This is what I like and would prefer to stick with. REGION_INFO
> > doesn't really change, we don't have two regions refering to the same
> > physical memory, and we find some way to request NC/WC of a region at
> > mmap time.
> 
> "At mmap time" means that something in the vma needs to describe to us
> to use the WC semantics, where I think you're proposing that the "mmap
> cookie" provides a specific vm_pgoff which we already use to determine
> the region index.  

Yes

> So whether or not we want to call this a region,
> it's effectively in the same address space as regions.  Therefore "mmap
> cookie" ~= "region offset".

Well, that is just the current implementation. What we did in RDMA
when we switched from hard coded mmap cookies to dynamic ones is
use an xarray (today this should be a maple tree) to dynamically
allocate mmap cookies whenever the driver returns something to
userspace. During the mmap fop the pgoff is fed back through the maple
tree to get the description of what the cookie represents.

So the encoding of cookies is completely disjoint from whatever the
underlying thing is. If you want the same region to be mapped with two
or three different prot flags you just ask for two or three cookies
and at mmap time you can recover the region pointer and the mmap
flags.

So VFIO could do several different things here to convay the mmap
flags through the cookie, including somehow encoding it in a pgoff
bit, or using a dynamic maple tree scheme.

My point is to not confuse the pgoff encoding with the driver concept
of a region. The region is a single peice of memory, the "mmap cookie"s
are just handles to it. Adding more data to the handle is not the same
as adding more regions.

> > > At the limit they're the same.  We could use a
> > > DEVICE_FEATURE to ask vfio to selectively populate WC regions after
> > > which the user could re-enumerate additional regions, or in fact to
> > > switch on WC for a given region if we want to go that route.  Thanks,  
> > 
> > This is still adding more regions and reporting more stuff from
> > REGION_INFO, that is what I would like to avoid.
> 
> Why?  This reminds me of hidden registers outside of capability chains
> in PCI config space.  Thanks,

The mmap offsets are not (supposed to be) ABI in the VFIO ioctls. The
encoding is entirely opaque inside the kernel already. Apps are
supposed to use REGION_INFO to learn the value to pass to mmap. ie
things like VFIO_PCI_OFFSET_SHIFT are not in the uAPI header.

Jason

  reply	other threads:[~2024-08-02 11:53 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-31 15:53 [PATCH rfc] vfio-pci: Allow write combining Keith Busch
2024-08-01 14:19 ` Jason Gunthorpe
2024-08-01 15:41   ` Alex Williamson
2024-08-01 16:11     ` Jason Gunthorpe
2024-08-01 16:52       ` Alex Williamson
2024-08-01 17:13         ` Jason Gunthorpe
2024-08-01 17:33           ` Alex Williamson
2024-08-01 17:53             ` Jason Gunthorpe
2024-08-01 18:16               ` Alex Williamson
2024-08-02 11:53                 ` Jason Gunthorpe [this message]
2024-08-02 17:05                   ` Alex Williamson
2024-08-06 16:53                     ` Jason Gunthorpe
2024-08-06 18:43                       ` Alex Williamson
2024-08-07 14:19                         ` Jason Gunthorpe
2024-08-07 17:46                           ` Alex Williamson
2024-08-13 18:02                             ` Jason Gunthorpe
2024-08-02 14:24             ` Keith Busch
2024-08-02 14:33               ` Jason Gunthorpe
2024-08-06  7:19                 ` Tian, Kevin
2024-08-06 16:47                   ` Jason Gunthorpe
2024-08-15  5:05               ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240802115308.GA676757@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=alex.williamson@redhat.com \
    --cc=kbusch@kernel.org \
    --cc=kbusch@meta.com \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox