[LSF/MM TOPIC] Direct block mapping through fs for device

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jerome Glisse <jglisse@redhat.com>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [LSF/MM TOPIC] Direct block mapping through fs for device
Date: Thu, 25 Apr 2019 21:38:14 -0400	[thread overview]
Message-ID: <20190426013814.GB3350@redhat.com> (raw)

I see that they are still empty spot in LSF/MM schedule so i would like to
have a discussion on allowing direct block mapping of file for devices (nic,
gpu, fpga, ...). This is mm, fs and block discussion, thought the mm side
is pretty light ie only adding 2 callback to vm_operations_struct:

    int (*device_map)(struct vm_area_struct *vma,
                      struct device *importer,
                      struct dma_buf **bufp,
                      unsigned long start,
                      unsigned long end,
                      unsigned flags,
                      dma_addr_t *pa);

    // Some flags i can think of:
    DEVICE_MAP_FLAG_PIN // ie return a dma_buf object
    DEVICE_MAP_FLAG_WRITE // importer want to be able to write
    DEVICE_MAP_FLAG_SUPPORT_ATOMIC_OP // importer want to do atomic operation
                                      // on the mapping

    void (*device_unmap)(struct vm_area_struct *vma,
                         struct device *importer,
                         unsigned long start,
                         unsigned long end,
                         dma_addr_t *pa);

Each filesystem could add this callback and decide wether or not to allow
the importer to directly map block. Filesystem can use what ever logic they
want to make that decision. For instance if they are page in the page cache
for the range then it can say no and the device would fallback to main
memory. Filesystem can also update its internal data structure to keep
track of direct block mapping.

If filesystem decide to allow the direct block mapping then it forward the
request to the block device which itself can decide to forbid the direct
mapping again for any reasons. For instance running out of BAR space or
peer to peer between block device and importer device is not supported or
block device does not want to allow writeable peer mapping ...

So event flow is:
    1  program mmap a file (end never intend to access it with CPU)
    2  program try to access the mmap from a device A
    3  device A driver see device_map callback on the vma and call it
    4a on success device A driver program the device to mapped dma address
    4b on failure device A driver fallback to faulting so that it can use
       page from page cache

This API assume that the importer does support mmu notifier and thus that
the fs can invalidate device mapping at _any_ time by sending mmu notifier
to all mapping of the file (for a given range in the file or for the whole
file). Obviously you want to minimize disruption and thus only invalidate
when necessary.

The dma_buf parameter can be use to add pinning support for filesystem who
wish to support that case too. Here the mapping lifetime get disconnected
from the vma and is transfer to the dma_buf allocated by filesystem. Again
filesystem can decide to say no as pinning blocks has drastic consequence
for filesystem and block device.

This has some similarities to the hmmap and caching topic (which is mapping
block directly to CPU AFAIU) but device mapping can cut some corner for
instance some device can forgo atomic operation on such mapping and thus
can work over PCIE while CPU can not do atomic to PCIE BAR.

Also this API here can be use to allow peer to peer access between devices
when the vma is a mmap of a device file and thus vm_operations_struct come
from some exporter device driver. So same 2 vm_operations_struct call back
can be use in more cases than what i just described here.

So i would like to gather people feedback on general approach and few things
like:
    - Do block device need to be able to invalidate such mapping too ?

      It is easy for fs the to invalidate as it can walk file mappings
      but block device do not know about file.

    - Do we want to provide some generic implementation to share accross
      fs ?

    - Maybe some share helpers for block devices that could track file
      corresponding to peer mapping ?

Cheers,
Jérôme

next             reply	other threads:[~2019-04-26  1:38 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-26  1:38 Jerome Glisse [this message]
2019-04-26  6:28 ` [LSF/MM TOPIC] Direct block mapping through fs for device Dave Chinner
2019-04-26 12:45   ` Christoph Hellwig
2019-04-26 14:45     ` Darrick J. Wong
2019-04-26 14:47       ` Christoph Hellwig
2019-04-26 15:20   ` Jerome Glisse
2019-04-27  1:25     ` Dave Chinner
2019-04-29 13:26       ` Jerome Glisse
2019-05-01 23:47         ` Dave Chinner
2019-05-02  1:52         ` Matthew Wilcox
2019-04-26 20:28 ` Adam Manzanares

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190426013814.GB3350@redhat.com \
    --to=jglisse@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).