From: Dave Chinner <david@fromorbit.com>
To: Steven Whitehouse <swhiteho@redhat.com>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>, Jan Kara <jack@suse.cz>,
linux-fsdevel@vger.kernel.org,
Christoph Hellwig <hch@infradead.org>,
Dmitry Monakhov <dmonakhov@openvz.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Zheng Liu <wenqing.lz@taobao.com>,
Lukas Czerner <lczerner@redhat.com>
Subject: Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
Date: Mon, 23 Dec 2013 14:00:06 +1100 [thread overview]
Message-ID: <20131223030006.GD3220@dastard> (raw)
In-Reply-To: <1387531724.2739.13.camel@menhir>
On Fri, Dec 20, 2013 at 09:28:44AM +0000, Steven Whitehouse wrote:
> On Fri, 2013-12-20 at 09:44 +1100, Dave Chinner wrote:
[snip]
> > Ok, buffered writes explain all those symptoms, and that means what
> > you are seeing is a buffered write/direct IO read race condition.
> > If the first page is missing data from the direct IO read, that
> > implies that the inode size has been updated by the buffered write
> > path, but a direct Io read of zeroes means the data in the page
> > cache has not been flushed to disk. Given that the direct IO read
> > path does a filemap_write_and_wait_range() call before issuing the
> > direct IO, that implies the first page was not flushed to disk by
> > the cache flush.
> >
> > I'm not sure why that may be the case, but that's where I'd
> > start looking.....
> >
> Indeed - I've just spent the last couple of days looking at exactly
> this :-) I'm shortly going to lose access to the machine I've been doing
> tests on until the New Year, so progress may slow down for a little
> while on my side.
>
> However there is nothing obvious in the trace. It looks like the write
> I/O gets pushed out in two chunks, the earlier one containing the "first
> page" missing block along with other blocks, and the second one is
> pushed out by the filemap_write_and_wait_range added in the patch I
> posted yesterday. Both have completed before we send out the read
> (O_DIRECT) which results in a single I/O to disk - again exactly what
> I'd expect to see. There are two calls to bmap for the O_DIRECT read,
> the first one returns short (correctly since we hit EOF) and the second
> one returns unmapped since it is a request to map the remainder of the
> file, which is beyond EOF.
Oh, interesting that there are two bmap calls for the read. Have a
look at do_direct_IO(), specifically the case where we hit the
buffer_unmapped/do_holes code:
do_holes:
/* Handle holes */
if (!buffer_mapped(map_bh)) {
loff_t i_size_aligned;
/* AKPM: eargh, -ENOTBLK is a hack */
if (dio->rw & WRITE) {
page_cache_release(page);
return -ENOTBLK;
}
/*
* Be sure to account for a partial block as the
* last block in the file
*/
i_size_aligned = ALIGN(i_size_read(dio->inode),
1 << blkbits);
if (sdio->block_in_file >=
i_size_aligned >> blkbits) {
/* We hit eof */
page_cache_release(page);
goto out;
}
zero_user(page, block_in_page << blkbits,
1 << blkbits);
sdio->block_in_file++;
block_in_page++;
goto next_block;
}
So, if we hit a hole, we read the inode size to determine if we are
beyond EOF. If we are beyond EOF, we're done. If not, the block gets
zeroed, and we move on to the nect block.
Now, GFS2 does not set DIO_LOCKING, so the direct IO code is not
holding the i_mutex across the read, which means it is probably
racing with the buffered write (I'm not sure what locking gfs2 does
here). Hence if the buffered write completes a page and updates the
inode size between the point in time that the direct IO read mapped
a hole and then checked the EOF, it will zero the page rather than
returning a short read, and then move on to reading the next
block. If the next block is still in the hole but is beyond EOF it
will then abort, returning a short read of a zeroed page....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2013-12-23 3:00 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-07 10:55 [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Zheng Liu
2013-12-16 9:37 ` Steven Whitehouse
2013-12-16 10:01 ` Jan Kara
2013-12-17 9:43 ` Steven Whitehouse
2013-12-17 11:16 ` Zheng Liu
2013-12-17 12:17 ` Steven Whitehouse
2013-12-17 16:41 ` Zheng Liu
2013-12-19 12:27 ` Steven Whitehouse
2013-12-19 22:44 ` Dave Chinner
2013-12-20 9:28 ` Steven Whitehouse
2013-12-23 3:00 ` Dave Chinner [this message]
2014-01-14 15:22 ` Steven Whitehouse
2014-01-14 19:19 ` Jan Kara
2014-01-15 7:19 ` Zheng Liu
2014-01-16 15:35 ` [RFC] [PATCH] Fix race when checking i_size on direct i/o read Steven Whitehouse
2014-01-17 10:20 ` Miklos Szeredi
2014-01-24 14:42 ` Steven Whitehouse
2014-01-27 10:13 ` Jan Kara
2013-12-17 14:01 ` [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Jan Kara
2013-12-17 15:32 ` Steven Whitehouse
2013-12-17 16:39 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131223030006.GD3220@dastard \
--to=david@fromorbit.com \
--cc=dmonakhov@openvz.org \
--cc=gnehzuil.liu@gmail.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=lczerner@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=swhiteho@redhat.com \
--cc=viro@zeniv.linux.org.uk \
--cc=wenqing.lz@taobao.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).