From: Mingming <cmm@us.ibm.com>
To: Theodore Tso <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>, Curt Wohlgemuth <curtw@google.com>,
ext4 development <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache
Date: Fri, 31 Jul 2009 10:58:57 -0700 [thread overview]
Message-ID: <1249063137.3917.8.camel@mingming-laptop> (raw)
In-Reply-To: <20090730203351.GB6833@mit.edu>
On Thu, 2009-07-30 at 16:33 -0400, Theodore Tso wrote:
> On Thu, Jul 30, 2009 at 08:30:53PM +0200, Jan Kara wrote:
> > I have to say I'm a bit worried about modify-in-place tricks - it's
> > not trivial to make sure buffer is not part of any transaction in the
> > journal, since the buffer head could have been evicted from memory, but
> > the transaction still is not fully checkpointed. Hence in memory, you
> > don't have any evidence of the fact that if the machine crashes, your
> > modify-in-place gets overwritten by journal-replay.
>
> Yeah, good point; tracking which blocks might get overwritten on a
> journal replay is tough. What we *could* do that would make this easier
> is to insert a revoke record for all extent tree blocks after the
> blocks have been written to disk (since at that point there's no need
> for that block to be replayed).
>
> Whether or not this optimization is worth it largely depends on time
> between how many blocks are getting allocated using fallocate(), and
> what the average number of blocks are that get written at a time by
> the application (normally enterprise databases) when write into the
> unitialized area. If the average size is say, 32k, and the amount of
> space they allocate is say, 32 megs, then without doing any special
> DIO optimization, on average we will end up having to do 1024
> synchronous waits on a journal commit. If the database doesn't use
> any fallocates at all, then it will have to do a 32 meg write to
> initialize the area, followed by 32 megs of data writes, written
> randomly 32k at a time.
>
> So being aggressive with pre-zeroing extra datablocks when we convert
> uninit extents to initialized extents mean that we still have to do
> some percentage of zero'izing data writes combined with the extra
> journal traffic, so it's likely we haven't reduced the total disk
> bandwidth by much, and the latency improvements of not having to do
> the 32meg zero writes gets offset with the data=ordered latency hits
> when we do the journal commit.
>
> So it would seem to me that if we really want to get the full benefit
> of preallocation in the DIO case, we really do need to think about
> seeing if it's possible bypass the journal.
>
> It may be useful here to write a benchmark that simulates the behavior
> of an eneterprise database using fallocate, so we can see what the
> performance hit is of making sure we don't lose data on a crash, and
> then how much of that performance hit we can claw back with various
> optimizations.
>
Eric and I looked at xfs code together the other day, xfs code did not
ensure DIO sync metadata (conversion) before return back to userspace.
It does ensure the workqueue kickoff the conversion and journal commit,
but it seems not waiting for it to complete. This seems confirmed by xfs
expert on IRC, who expressed DIO means only bypass page cache, but not
necessarily means sync on data and metadata unless file is opened with
SYNC mode.
Mingming
next prev parent reply other threads:[~2009-07-31 17:58 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-07-29 0:28 [PATCH RFC] Insure direct IO writes do not use the page cache Curt Wohlgemuth
2009-07-29 16:10 ` Curt Wohlgemuth
2009-07-29 17:18 ` Eric Sandeen
2009-07-29 17:41 ` Eric Sandeen
2009-07-29 19:48 ` Eric Sandeen
2009-07-29 22:17 ` Mingming
2009-07-29 17:47 ` Mingming
2009-07-29 18:10 ` Theodore Tso
2009-07-30 18:30 ` Jan Kara
2009-07-30 18:39 ` Eric Sandeen
2009-07-30 18:44 ` Jan Kara
2009-07-30 19:16 ` Eric Sandeen
2009-07-30 20:33 ` Theodore Tso
2009-07-31 16:10 ` Curt Wohlgemuth
2009-08-01 6:56 ` [PATCH RFC] ext4 direct IO for holes, fallocate Mingming
2009-08-03 16:47 ` Aneesh Kumar K.V
2009-08-03 23:40 ` Mingming
2009-07-31 17:58 ` Mingming [this message]
2009-07-31 18:03 ` [PATCH RFC] Insure direct IO writes do not use the page cache Michael Rubin
2009-07-31 18:03 ` Michael Rubin
2009-08-03 9:36 ` Jan Kara
2009-07-30 11:06 ` Aneesh Kumar K.V
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1249063137.3917.8.camel@mingming-laptop \
--to=cmm@us.ibm.com \
--cc=curtw@google.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox