All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Curt Wohlgemuth <curtw@google.com>
Cc: Mingming <cmm@us.ibm.com>, Theodore Tso <tytso@mit.edu>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	Eric Sandeen <sandeen@redhat.com>,
	Andreas Dilger <adilger@sun.com>
Subject: Re: Fallocate and DirectIO
Date: Fri, 24 Jul 2009 23:32:25 +0530	[thread overview]
Message-ID: <20090724180225.GA29851@skywalker> (raw)
In-Reply-To: <6601abe90907240930t5232329byb1c2c6930abcb473@mail.gmail.com>

On Fri, Jul 24, 2009 at 09:30:08AM -0700, Curt Wohlgemuth wrote:
> On Thu, Jul 23, 2009 at 5:56 PM, Mingming<cmm@us.ibm.com> wrote:
> > On Thu, 2009-07-23 at 08:56 -0700, Curt Wohlgemuth wrote:
> >> On Tue, Jul 21, 2009 at 6:27 PM, Curt Wohlgemuth<curtw@google.com> wrote:
> >> > I spent a bit of time looking at this today.
> >> >
> >> > On Fri, Jun 12, 2009 at 10:33 AM, Theodore Tso<tytso@mit.edu> wrote:
> >> >> On Fri, Jun 12, 2009 at 06:01:12PM +0530, Aneesh Kumar K.V wrote:
> >> >>> Hi,
> >> >>>
> >> >>> I noticed yesterday that a write to fallocate
> >> >>> space via directIO results in fallback to buffer_IO. ie the userspace
> >> >>> pages get copied to the page cache and then call a sync.
> >> >>>
> >> >>> I guess this defeat the purpose of using directIO. May be we should
> >> >>> consider this a high priority bug.
> >> >
> >> > My simple experiment -- without a journal -- shows that you're
> >> > observation is correct.  *Except* if FALLOC_FL_KEEP_SIZE is used in
> >> > the fallocate() call, in which case the page cache is *not* used.
> >> >
> >> > Pseudo-code example:
> >> >
> >> >  open(O_DIRECT)
> >> >  fallocate(mode, 512MB)
> >> >  while (! written 100MB)
> >> >     write(64K)
> >> >  close()
> >> >
> >> > If mode == FALLOC_FL_KEEP_SIZE, then no page cache is used.
> >> > Otherwise, we *do* go through the page cache.
> >> >
> >> > It comes down to the fact that, since the i_size is not updated with
> >> > KEEP_SIZE, then ext4_get_block() is called with create = 1, since the
> >> > block that's needed is "beyond" the file end.
> >>
> > I think so.
> > In the case of KEEP_SIZE, get_block() is called with create=1 before dio
> > submit the real data IO, thus dio get a chance to convert the
> > uninitalized extents to initialized before returns back to the caller.
> 
> Ah, I see this now in ext4_direct_IO().  Thanks.
> 
> > But in the case of non KEEP_SIZE, i.e. updating i_size after fallocate()
> > case, we now have to fall back to buffered IO to ensure the extents
> > conversion is happened in an ordering. Because if we convert the extents
> > before submit the IO, and this conversion reached to disk, if system
> > crash before the real data IO finished, then it could expose the stale
> > data out, as the extent has already marked "initialized".
> 
> Yes, that makes sense -- since i_size already covers the formerly
> uninitialized, now initialized, extents.
> 
> >> Ted, given your concerns over the performance impact of updating the
> >> extents during direct I/O writes, it would seem that the fact that
> >> when KEEP_SIZE is specified we do the DMA (and don't go through the
> >> page cache) would be a problem/bug.  At least, it seems that the
> >> performance issue is the same regardless of whether KEEP_SIZE is used
> >> on the fallocate or not: in both we're dealing with an uninitialized
> >> extent.  Do you agree?
> >
> > Here is what I thought...
> >
> > I think updating the extents itself is not a big performance concern, In
> > the non KEEP_SIZE case, if we don't want to fall back to buffered IO,
> > ext4 DIO has to wait for the journal to commit the transaction which
> > converts extents to complete, before DIO could return back apps, this
> > could be a big latency. That seems what xfs does.
> 
> Wouldn't this still be an exposure to stale data?  The only way for
> this to work, if i_size already covers the uninit extents, is to make
> sure the data goes to disk before the extents get converted and
> committed.  Since the extents are converted in the ext4_get_block()
> path, before DIO actually performs the data write, this seems to be
> too late.
> 
> > For KEEP_SIZE case, The conversion actually could happen before the
> > related IO reach to disk, I guess the oraph inode list protects stale
> > data get exposed in this case.
> 
> I'm sorry, I don't follow you here.
> 
> >> I'm exploring (a) what this performance penalty is for the journal
> >> commit; and (b) can we at least avoid the page cache if your
> >> conditions above (no journal commit; no new extent blocks) are met.
> >
> > In fact, in the case of no journal, as long as the extents conversion
> > happens after the data IO reach to disk, it should be safe, am I right?
> > If system crash before the extent conversion finish, we only lost
> > recently updated IO, but won't expose the stale data out, as the extents
> > is still marked as uninitialized.
> 
> But again, the extent conversion (and mark_inode_dirty()) happens at
> get_block time, before the data goes to disk.
> 
> For KEEP_SIZE, this isn't an exposure because i_size prevents the data
> from being read.  But without KEEP_SIZE, this would seem to be a
> problem, right?
> 
> (From a practical perspective, there's also a problem getting real DIO
> to work without KEEP_SIZE in the fallocate():  the decision to send
> "create=0" to ext4_get_block() happens in VFS code, and there's no way
> to tell in the get_block path that "this is a 'no create' for a write,
> vs. a read.)

What we need is to track I/O's untill they hit the disk. This will
help us to do data=guarded and also help in the above case. So
for directIO we should use blockdev_direct_IO_own_locking and the get_block
used should split the uninit extent the needed way but still mark it
uninit. That would make sure a read will see the uninit extent and return
zero as expected. Now on IO completion we should mark split uninit extent
as init.

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2009-07-24 18:02 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-12 12:31 Fallocate and DirectIO Aneesh Kumar K.V
2009-06-12 17:33 ` Theodore Tso
2009-07-22  1:27   ` Curt Wohlgemuth
2009-07-23 15:56     ` Curt Wohlgemuth
2009-07-24  0:56       ` Mingming
2009-07-24 16:30         ` Curt Wohlgemuth
2009-07-24 18:02           ` Aneesh Kumar K.V [this message]
2009-07-24 18:18             ` Curt Wohlgemuth
2009-07-24 18:40               ` Aneesh Kumar K.V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090724180225.GA29851@skywalker \
    --to=aneesh.kumar@linux.vnet.ibm.com \
    --cc=adilger@sun.com \
    --cc=cmm@us.ibm.com \
    --cc=curtw@google.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=sandeen@redhat.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.