linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O
Date: Wed, 9 Apr 2014 11:19:08 -0400	[thread overview]
Message-ID: <20140409151908.GD5727@linux.intel.com> (raw)
In-Reply-To: <20140409091450.GA32103@quack.suse.cz>

On Wed, Apr 09, 2014 at 11:14:50AM +0200, Jan Kara wrote:
> On Tue 08-04-14 16:21:02, Matthew Wilcox wrote:
> > On Tue, Apr 08, 2014 at 07:56:00PM +0200, Jan Kara wrote:
> > > > +static void dax_new_buf(void *addr, unsigned size, unsigned first,
> > > > +					loff_t offset, loff_t end, int rw)
> > > > +{
> > > > +	loff_t final = end - offset + first; /* The final byte of the buffer */
> > > > +	if (rw != WRITE) {
> > > > +		memset(addr, 0, size);
> > > > +		return;
> > > > +	}
> > >   It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do
> > > this for unwritten blocks) when reading from them. Presumably it could also
> > > have undesired effects on endurance of persistent memory. Instead I'd expect
> > > that you simply zero out user provided buffer the same way as you do it for
> > > holes.
> > 
> > I think we have to zero it here, because the second time we call
> > get_block() for a given block, it won't be BH_New any more, so we won't
> > know that it's supposed to be zeroed.
>   But how can you have BH_New buffer when you didn't ask get_blocks() to
> create any block? That would be a bug in the get_blocks() implementation...
> Or am I missing something?

Oh ... right.  So just to be clear, we're looking at the case where
we're doing a read of a filesystem block which is BH_Unwritten, but
isn't a hole ... so it's been allocated on storage and not yet written.
That's already treated as a hole:

                        if (rw == WRITE) {
...
                        } else {
                                hole = !buffer_written(bh);
                        }

and dax_new_buf is only called in the !hole case.

>   OK, but there are filesystems which do the same thing as ext4 (e.g.
> btrfs) and historically noone really cared. E.g. direct IO code advances
> only by a single block regardless of what filesystem returns when the
> buffer is unmapped. As you correctly mention, get_blocks() API isn't really
> documented so noone has really defined what should happen when you ask
> filesystem to map some blocks and there's a hole. I agree what XFS does
> looks sensible and ext4 can do the same. Hopefully this gets cleaned up
> when Dave finishes his new block mapping interface.

I hope so too!  The get_block() API has been the bane of my existance
since Christmas :-)

>   This wouldn't quite work because even ext4_map_blocks() doesn't bother to
> fill in 'map' when it finds a hole. But it won't be complicated to
> propagate the information.

Good point.

> > It'll be kind of tricky to move it because 'len' is not necessarily
> > a multiple of i_blkbits, so we can't necessarily maintain b_blocknr
> > accurately.
>   Yeah, after I understood the code I also understood why you do it the way
> you did. But we could do something like:
> ...
> +               if (!len)
> +                       break;
> + 
> 		blocks = ((offset + len) >> inode->i_blkbits) - 
> 				(offset >> inode->i_blkbits);
> 		bh->b_blocknr += blocks;
> 		bh->b_size -= blocks << inode->i_blkbits;
> +               offset += len;
> +               copied += len;
> +               addr += len;
> ...

We could ... I'm not sure it's simpler though.

> BTW: it might be good to store inode->i_blkbits in a local variable. It
> makes some expressions shorter.

Yes, good idea.  Done.

> BTW2: although direct IO uses 'offset' for position in file, the rest of
> VFS uses 'pos' for that and that seems to be less overloaded term so for me
> it would be easier if you used 'pos' instead of 'offset'. Just a
> suggestion.

Sure.  Done.

> > > > +			if (rw == WRITE) {
> > > > +				if (!buffer_mapped(bh)) {
> > > > +					retval = -EIO;
> > > > +					break;
> > >   -EIO looks like a wrong error here. Or maybe it is the right one and it
> > > only needs some explanation? The thing is that for direct IO some
> > > filesystems choose not to fill holes for direct IO and fall back to
> > > buffered IO instead (to avoid exposure of uninitialized blocks if the
> > > system crashes after blocks have been added to a file but before they were
> > > written out). For DAX you are pretty much free to define what you ask from
> > > the get_blocks() (and this fallback behavior is somewhat disputed behavior
> > > in direct IO case so you might want to differ here) but you should document
> > > it somewhere.
> > 
> > Hmm ... I thought that calling get_block() with the create argument would
> > force the return of a bh with the Mapped bit set.  Did I misunderstand that
> > aspect of the undocumented get_block() API too?
>   As you mention the API is undocumented and not really designed. So
> filesystems do whatever causes the generic code to do what they want (it's
> a mess I know). In this case, I'm warning you there are filesystems which
> refuse to fill in holes from the get_blocks() function passed to
> blockdev_direct_IO() (even ext4 does this for inodes with old
> indirect-block based on disk format). You can just define DAX fails
> horribly in these case and I'm fine with that at least in this stage. If
> someone bothers later, fallback to buffered IO can be implemented. But we
> should document this somewhere. 

Urgh.  Yeah, we should probably fall back to buffered I/O for that case.
I'll stick a comment in dax.c for now, and we can fix it later.

> > > > +	if ((flags & DIO_LOCKING) && (rw == READ)) {
> > > > +		struct address_space *mapping = inode->i_mapping;
> > > > +		mutex_lock(&inode->i_mutex);
> > > > +		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
> > > > +		if (retval) {
> > > > +			mutex_unlock(&inode->i_mutex);
> > > > +			goto out;
> > > > +		}
> > >   Is there a reason for this? I'd assume DAX has no pages in pagecache...
> > 
> > There will be pages in the page cache for holes that we page faulted on.
> > They must go!  :-)
>   Well, but this will only writeback dirty pages and if I read the code
> correctly those pages will never be dirty since dax_mkwrite() will replace
> them. Or am I missing something?

In addition to writing back dirty pages, filemap_write_and_wait_range()
will evict clean pages.  Unintuitive, I know, but it matches what the
direct I/O path does.  Plus, if we fall back to buffered I/O for holes
(see above), then this will do the right thing at that time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-04-09 15:19 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-23 19:08 [PATCH v7 00/22] Support ext4 on NV-DIMMs Matthew Wilcox
2014-03-23 19:08 ` [PATCH v7 01/22] Fix XIP fault vs truncate race Matthew Wilcox
2014-03-29 15:57   ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 02/22] Allow page fault handlers to perform the COW Matthew Wilcox
2014-04-08 16:34   ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 03/22] axonram: Fix bug in direct_access Matthew Wilcox
2014-03-29 16:22   ` Jan Kara
2014-04-02 19:24     ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v7 04/22] Change direct_access calling convention Matthew Wilcox
2014-03-29 16:30   ` Jan Kara
2014-04-02 19:27     ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v7 05/22] Introduce IS_DAX(inode) Matthew Wilcox
2014-04-08 15:32   ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 06/22] Replace XIP read and write with DAX I/O Matthew Wilcox
2014-04-08 17:56   ` Jan Kara
2014-04-08 20:21     ` Matthew Wilcox
2014-04-09  9:14       ` Jan Kara
2014-04-09 15:19         ` Matthew Wilcox [this message]
2014-04-09 20:55           ` Jan Kara
2014-04-13 18:05             ` Matthew Wilcox
2014-04-09 12:04   ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
2014-04-08 22:05   ` Jan Kara
2014-04-09 20:48     ` Matthew Wilcox
2014-04-09 21:12       ` Jan Kara
2014-04-13 11:21         ` Matthew Wilcox
2014-04-14 16:04           ` Jan Kara
2014-04-09 10:27   ` Jan Kara
2014-04-09 20:51     ` Matthew Wilcox
2014-04-09 21:43       ` Jan Kara
2014-04-13 18:03         ` Matthew Wilcox
2014-07-29 12:12         ` Matthew Wilcox
2014-07-29 21:04           ` Jan Kara
2014-07-29 21:23             ` Matthew Wilcox
2014-07-30  9:52               ` Jan Kara
2014-07-30 21:02                 ` Matthew Wilcox
2014-08-09 11:00                 ` Matthew Wilcox
2014-08-11  8:51                   ` Jan Kara
2014-08-11 14:13                     ` Matthew Wilcox
2014-08-11 14:35                       ` Jan Kara
2014-08-11 15:02                         ` Matthew Wilcox
2014-08-11 15:25                           ` Jan Kara
2014-05-21 20:35   ` Toshi Kani
2014-06-05 22:38     ` Toshi Kani
2014-03-23 19:08 ` [PATCH v7 08/22] Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
2014-04-08 22:17   ` Jan Kara
2014-04-09  9:26     ` Jan Kara
2014-04-13 19:07       ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v7 09/22] Remove mm/filemap_xip.c Matthew Wilcox
2014-04-08 18:21   ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 10/22] Remove get_xip_mem Matthew Wilcox
2014-04-08 18:20   ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 11/22] Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
2014-04-09  9:46   ` Jan Kara
2014-04-10 14:16     ` Matthew Wilcox
2014-04-10 18:31       ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 12/22] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
2014-04-09  9:52   ` Jan Kara
2014-04-10 14:22     ` Matthew Wilcox
2014-04-10 18:35       ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 13/22] ext2: Remove ext2_use_xip Matthew Wilcox
2014-04-09  9:55   ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 14/22] ext2: Remove xip.c and xip.h Matthew Wilcox
2014-04-09  9:59   ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Matthew Wilcox
2014-04-09  9:59   ` Jan Kara
2014-04-10 14:23     ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v7 16/22] ext2: Remove ext2_aops_xip Matthew Wilcox
2014-04-09 10:02   ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 17/22] Get rid of most mentions of XIP in ext2 Matthew Wilcox
2014-04-09 10:04   ` Jan Kara
2014-04-10 14:26     ` Matthew Wilcox
2014-04-10 18:40       ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 18/22] xip: Add xip_zero_page_range Matthew Wilcox
2014-04-09 10:15   ` Jan Kara
2014-04-10 14:27     ` Matthew Wilcox
2014-04-10 18:43       ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 19/22] ext4: Make ext4_block_zero_page_range static Matthew Wilcox
2014-03-24 19:11   ` tytso
2014-03-23 19:08 ` [PATCH v7 20/22] ext4: Add DAX functionality Matthew Wilcox
2014-04-09 12:17   ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 21/22] ext4: Fix typos Matthew Wilcox
2014-03-24 19:16   ` tytso
2014-03-23 19:08 ` [PATCH v7 22/22] brd: Rename XIP to DAX Matthew Wilcox
2014-04-09 10:07   ` Jan Kara
2014-05-18 14:58 ` [PATCH v7 00/22] Support ext4 on NV-DIMMs Boaz Harrosh
2014-05-18 23:24   ` Matthew Wilcox
2014-06-17 18:11 ` Boaz Harrosh
2014-06-17 18:19   ` Matthew Wilcox
2014-06-17 18:39     ` Boaz Harrosh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140409151908.GD5727@linux.intel.com \
    --to=willy@linux.intel.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matthew.r.wilcox@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).