From: Eryu Guan <eguan@redhat.com>
To: Brian Foster <bfoster@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>,
Eric Sandeen <sandeen@redhat.com>,
fstests <fstests@vger.kernel.org>,
Zheng Liu <wenqing.lz@taobao.com>,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH] test race when checking i_size on direct i/o read
Date: Thu, 21 Sep 2017 18:09:44 +0800 [thread overview]
Message-ID: <20170921100944.GX8034@eguan.usersys.redhat.com> (raw)
In-Reply-To: <20170920125529.GA16481@bfoster.bfoster>
On Wed, Sep 20, 2017 at 08:55:30AM -0400, Brian Foster wrote:
> On Wed, Sep 20, 2017 at 07:05:04PM +0800, Eryu Guan wrote:
> > On Tue, Sep 19, 2017 at 07:34:06AM -0700, Christoph Hellwig wrote:
> > > On Tue, Sep 19, 2017 at 10:13:52AM -0400, Brian Foster wrote:
> > > > Can we pass a boolean or flag to xfs_iomap_write_unwritten() to have it
> > > > update the incore i_size after unwritten extent conversion? Then move
> > > > (or remove) the associated update from xfs_dio_write_end_io().
> > >
> > > I don't think we even need a flag - all three callers of
> > > xfs_iomap_write_unwritten want to update the file size.
> >
> > I tried this approach, but seems there's some problem in the buffered
> > aio path, generic/112 (aio fsx) failed quickly. But I haven't digged
> > into the reason (maybe I screwed it up, not the method is wrong..).
> >
>
> From the generic/112 results:
>
> Size error: expected 0x3785b stat 0x38000 seek 0x38000
>
> I suspect the problem is that the offset+size from buffered I/O
> completion is not based on the inode size. Rather, it is buffer head
> granularity size of the ioend. Given that, it probably does make sense
> to skip the update from this path.
Yeah, you're right.
xfs_io -fc "falloc -k 0 4k" -c "pwrite 0 2k" -c fsync /mnt/xfs/testfile
This results in 4k file size (block size is also 4k).
>
> > Then I tried Brian's suggestion, pass a boolean to
> > xfs_iomap_write_unwritten() to tell if we want it to update in-core
> > isize after unwritten extent conversion, and skip the in-core isize
> > update in xfs_dio_write_end_io() accordingly. This approach seems to
> > work, it passed the test Eric posted here, and fstests 'aio' group
> > tests, a run of 'quick' group didn't find any new failure as well.
> >
> > I attached the WIP patch (without proper comments) I was testing, if
> > this looks fine I can format a formal patch and do more testings.
> >
>
> Thanks. This mostly looks reasonable to me...
>
> > Thanks,
> > Eryu
> >
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 29172609f2a3..288da47e9ac5 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -343,7 +343,7 @@ xfs_end_io(
> > error = xfs_reflink_end_cow(ip, offset, size);
> > break;
> > case XFS_IO_UNWRITTEN:
> > - error = xfs_iomap_write_unwritten(ip, offset, size);
> > + error = xfs_iomap_write_unwritten(ip, offset, size, false);
>
> Maybe add a single line comment here wrt to the above (why we don't
> update isize).
Will do.
>
> > break;
> > default:
> > ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_append_trans);
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 350b6d43ba23..f3ad024573e7 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -435,6 +435,7 @@ xfs_dio_write_end_io(
> > struct xfs_inode *ip = XFS_I(inode);
> > loff_t offset = iocb->ki_pos;
> > bool update_size = false;
> > + bool write_unwritten = (flags & IOMAP_DIO_UNWRITTEN);
> > int error = 0;
> >
> > trace_xfs_end_io_direct_write(ip, offset, size);
> > @@ -458,7 +459,8 @@ xfs_dio_write_end_io(
> > */
> > spin_lock(&ip->i_flags_lock);
> > if (offset + size > i_size_read(inode)) {
> > - i_size_write(inode, offset + size);
> > + if (!write_unwritten)
> > + i_size_write(inode, offset + size);
> > update_size = true;
>
> I find the logic a little confusing here. For !write_unwritten,
> update_size means to update the on-disk size. Otherwise, it instructs
> iomap_write_unwritten() to also update the in-core size. The latter also
> checks for appends, however, so it might as well always be true from
> here. It seems that for such a small function, we should be able to make
> this a bit easier to follow. ;P
Agreed, it looks confusing when update_size serves as indicators of both
in-core and on-disk size update. And we can pass 'true' unconditionally
to xfs_iomap_write_unwritten() in this case.
>
> Should we ever see an isize update at all on a IOMAP_DIO_COW completion
> (wouldn't reflinked blocks have to be within eof)? If not, then we can
> presumably rule out isize updates in that case. I think that just leaves
> the case where a dio write occurs on a pre-existing block. Hmm, could we
> just move this whole hunk down to after the iomap_write_unwritten() call
> and eliminate the need for the flag entirely? E.g., something like:
>
> ...
> if (IOMAP_DIO_COW) {
> ...
> }
>
> /* unwritten conversion updates isize */
> if (IOMAP_DIO_UNWRITTEN)
> return xfs_iomap_write_unwritten(ip, offset, size, true);
>
> if (offset + size > i_size_read(inode)) {
> i_size_write(inode, offset + size);
> error = xfs_setfilesize(...);
> }
>
> return error;
This seems fine, tests in 'clone' group all passed without new failures,
I'll update patch as you suggested. I thought about something like this
too, but I'm not familiar with the CoW path, so I decided to keep the
original logic as much as possible.
Thanks for your review and suggestions!
Eryu
>
> > }
> > spin_unlock(&ip->i_flags_lock);
> > @@ -469,8 +471,8 @@ xfs_dio_write_end_io(
> > return error;
> > }
> >
> > - if (flags & IOMAP_DIO_UNWRITTEN)
> > - error = xfs_iomap_write_unwritten(ip, offset, size);
> > + if (write_unwritten)
> > + error = xfs_iomap_write_unwritten(ip, offset, size, update_size);
> > else if (update_size)
> > error = xfs_setfilesize(ip, offset, size);
> >
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index a1909bc064e9..0a088586371e 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -829,7 +829,8 @@ int
> > xfs_iomap_write_unwritten(
> > xfs_inode_t *ip,
> > xfs_off_t offset,
> > - xfs_off_t count)
> > + xfs_off_t count,
> > + bool update_size)
> > {
> > xfs_mount_t *mp = ip->i_mount;
> > xfs_fileoff_t offset_fsb;
> > @@ -840,6 +841,7 @@ xfs_iomap_write_unwritten(
> > xfs_trans_t *tp;
> > xfs_bmbt_irec_t imap;
> > struct xfs_defer_ops dfops;
> > + struct inode *inode = VFS_I(ip);
> > xfs_fsize_t i_size;
> > uint resblks;
> > int error;
> > @@ -900,6 +902,13 @@ xfs_iomap_write_unwritten(
> > if (i_size > offset + count)
> > i_size = offset + count;
> >
> > + if (update_size) {
> > + spin_lock(&ip->i_flags_lock);
> > + if (i_size > i_size_read(inode))
> > + i_size_write(inode, i_size);
> > + spin_unlock(&ip->i_flags_lock);
>
> We have XFS_ILOCK_EXCL here so I don't think the spinlocks are
> necessary any longer. That means this could probably be condensed to
> something like
>
> if (update_size && i_size > i_size_read(inode))
> i_size_write(inode, i_size)
>
> > + }
> > +
> > i_size = xfs_new_eof(ip, i_size);
> > if (i_size) {
> > ip->i_d.di_size = i_size;
> > diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> > index 00db3ecea084..ee535065c5d0 100644
> > --- a/fs/xfs/xfs_iomap.h
> > +++ b/fs/xfs/xfs_iomap.h
> > @@ -27,7 +27,7 @@ int xfs_iomap_write_direct(struct xfs_inode *, xfs_off_t, size_t,
> > struct xfs_bmbt_irec *, int);
> > int xfs_iomap_write_allocate(struct xfs_inode *, int, xfs_off_t,
> > struct xfs_bmbt_irec *);
> > -int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
> > +int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t, bool);
> >
> > void xfs_bmbt_to_iomap(struct xfs_inode *, struct iomap *,
> > struct xfs_bmbt_irec *);
> > diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
> > index 2f2dc3c09ad0..4246876df7b7 100644
> > --- a/fs/xfs/xfs_pnfs.c
> > +++ b/fs/xfs/xfs_pnfs.c
> > @@ -274,7 +274,7 @@ xfs_fs_commit_blocks(
> > (end - 1) >> PAGE_SHIFT);
> > WARN_ON_ONCE(error);
> >
> > - error = xfs_iomap_write_unwritten(ip, start, length);
> > + error = xfs_iomap_write_unwritten(ip, start, length, false);
>
> Note that this path does update isize. It runs another transaction to
> get around the fact that write_unwritten() wouldn't log a new on disk
> size. There is some validation thing here though, so we might need to
> check whether it is Ok to run that earlier and whether it should
> continue to return an error on failure. Christoph?
>
> That said, maybe a follow on patch would be better since this one might
> be stable fodder.
>
> Brian
>
> > if (error)
> > goto out_drop_iolock;
> > }
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
prev parent reply other threads:[~2017-09-21 10:09 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <799f63d9-de03-eab3-1e6e-cd747aa04c36@redhat.com>
2017-09-19 7:36 ` [PATCH] test race when checking i_size on direct i/o read Eryu Guan
2017-09-19 14:13 ` Brian Foster
2017-09-19 14:34 ` Christoph Hellwig
2017-09-19 14:58 ` Brian Foster
2017-09-20 11:05 ` Eryu Guan
2017-09-20 12:55 ` Brian Foster
2017-09-21 10:09 ` Eryu Guan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170921100944.GX8034@eguan.usersys.redhat.com \
--to=eguan@redhat.com \
--cc=bfoster@redhat.com \
--cc=fstests@vger.kernel.org \
--cc=hch@infradead.org \
--cc=linux-xfs@vger.kernel.org \
--cc=sandeen@redhat.com \
--cc=wenqing.lz@taobao.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).