linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Matthew Wilcox <willy@linux.intel.com>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: direct_access, pinning and truncation
Date: Fri, 10 Oct 2014 15:08:05 +0200	[thread overview]
Message-ID: <20141010130805.GC25693@quack.suse.cz> (raw)
In-Reply-To: <20141008190523.GM5098@wil.cx>

On Wed 08-10-14 15:05:23, Matthew Wilcox wrote:
> 
> One of the things on my todo list is making O_DIRECT work to a
> memory-mapped direct_access file.  Right now, it simply doesn't work
> because there's no struct page for the memory, so get_user_pages() fails.
> Boaz has posted a patch to create struct pages for direct_access files,
> which is certainly one way of solving the immediate problem, but it
> ignores the deeper problem.
  Maybe we can set some terminology - direct IO has two 'endpoints' (I
don't want to talk about source / target because that swaps when talking
about reads / writes). One endpoint is a 'buffer' and another endpoint is a
'storage'. Now 'buffer' may be a memory mapped file on some filesystem.
In your case what isn't working is when 'buffer' is mmaped file on a DAX
filesystem.

> For normal files, get_user_pages() elevates the reference count on
> the pages.  If those pages are subsequently truncated from the file,
> the underlying file blocks are released to the filesystem's free pool.
> The pages are removed from the page cache and the process's address space,
> but hang around until the caller of get_user_pages() calls put_page() on
> them again at which point they are released into the pool of free pages.
> 
> Once we have a struct page for (or some other way to handle pinning of)
> persistent memory blocks, truncating a file that has pinned pages will
> still cause the disk blocks to be released to the free pool.  But there
> weren't any pages of DRAM between the filesystem and the application!
> So those blocks are "freed" while still referenced.  And that reference
> might well be programmed into a piece of hardware that's doing DMA;
> it can't be stopped.
> 
> I see three solutions here:
> 
> 1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
> the caller with the struct pages of the DRAM.  Modify DAX to handle some
> file pages being in the page cache, and make sure that we know whether
> the PMEM or DRAM is up to date.  This has the obvious downside that
> get_user_pages() becomes slow.
> 
> 2. Modify filesystems that support DAX to handle pinning blocks.
> Some filesystems (that support COW and snapshots) already support
> reference-counting individual blocks.  We may be ale to do better by
> using a tree of pinned extents or something.  This makes it much harder
> to modify a filesystem to support DAX, and I don't see patches adding
> this capability to ext2 being warmly welcomed.
> 
> 3. Make truncate() block if it hits a pinned page.  There's really no
> good reason to truncate a file that has pinned pages; it's either a bug
> or you're trying to be nasty to someone.  We actually already have code
> for this; inode_dio_wait() / inode_dio_done().  But page pinning isn't
> just for O_DIRECT I/Os and other transient users like crypto, it's also
> for long-lived things like RDMA, where we could potentially block for
> an indefinite time.
  What option 3 seems to implicitely assume is that there are 'struct
pages' to pin. So do you expect to add struct page to PFNs which were a
target of get_user_pages()? And then check whether PFN is pinned (has
corresponding struct page) in the truncate code?

Note that inode_dio_wait() isn't really what you look for. That waits for
DIO pending against 'storage'. Currently we don't track in any way (except
for elevated page reference counts) that 'buffer' is an endpoint of direct
IO.

Thinking about options over and over again, I think trying something like
2) might be good. I'd still attach struct page to pinned PFNs to avoid some
troubles but you could delay freeing of fs blocks if they are pinned by
get_user_pages(). You could just hook into a path where filesystem frees
blocks - e.g. ext4 already does this anyway in ext4_mb_free_metadata()
since we free blocks in in-memory bitmaps only after the current
transaction is committed (changes in in-memory bitmaps happen from
ext4_journal_commit_callback(), which calls ext4_free_data_callback()). So
ext4 already handles the situation where in-memory bitmaps are different
from on disk ones and what you need is no different.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  parent reply	other threads:[~2014-10-10 13:08 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-08 19:05 direct_access, pinning and truncation Matthew Wilcox
2014-10-08 23:21 ` Zach Brown
2014-10-09 16:44   ` Matthew Wilcox
2014-10-09 19:14     ` Zach Brown
2014-10-10 10:01       ` Jan Kara
2014-10-09  1:10 ` Dave Chinner
2014-10-09 15:25   ` Matthew Wilcox
2014-10-13  1:19     ` Dave Chinner
2014-10-19  9:51     ` Boaz Harrosh
2014-10-10 13:08 ` Jan Kara [this message]
2014-10-10 14:24   ` Matthew Wilcox
2014-10-19 11:08     ` Boaz Harrosh
2014-10-19 23:01       ` Dave Chinner
2014-10-21  9:17         ` Boaz Harrosh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141010130805.GC25693@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).