All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Linux Kernel Developers List <linux-kernel@vger.kernel.org>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>
Subject: Re: [GIT PULL] Ext3 latency fixes
Date: Thu, 09 Apr 2009 14:10:03 -0400	[thread overview]
Message-ID: <1239300603.31257.26.camel@think.oraclecorp.com> (raw)
In-Reply-To: <20090409174932.GD26552@atrey.karlin.mff.cuni.cz>

On Thu, 2009-04-09 at 19:49 +0200, Jan Kara wrote:
> > On Thu, 2009-04-09 at 08:49 -0700, Linus Torvalds wrote:
> > > 
> > > On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> > > > 
> > > > One of these patches fixes a performance regression caused by a64c8610,
> > > > which unplugged the write queue after every page write.  Now that Jens
> > > > added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> > > > WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> > > > to further improbve ext3 latency, especially during the "sync" command
> > > > in Linus's write-big-file-and-sync workload.
> > > 
> > > So here's a question and a untested _conceptual_ patch. 
> > > 
> > > The kind of writeback mode I'd personally prefer would be more of a 
> > > mixture of the current "data=writeback" and "data=ordered" modes, with 
> > > something of the best of both worlds. I'd like the data writeback to get 
> > > _started_ when the journal is written to disk, but I'd like it to not 
> > > block journal updates.
> > > 
> > > IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
> > > be totally unordered either.
> > > 
> > 
> > I started working on the xfs style i_size updates last night, and here's
> > my current (most definitely broken) proof of concept.  I call it
> > data=guarded.
> > 
> > In guarded mode the on disk i_size is not updated until after the data
> > writes are complete.  I've got a per FS work queue and I'm abusing
> > bh->b_private as a list pointer.  So, what happens is:
> > 
> > * writepage sets up the buffer with the guarded end_io handler
> > 
> > * The end_io handler puts the buffer onto the per-sb list of guarded
> > buffers and then it kicks the work queue
> > 
> > * The workqueue updates the on disk i_size to the min of the end of the
> > buffer or the in-memory i_size, and then it logs the inode.
> > 
> > * Then the regular async bh end_io handler is called to end writeback on
> > the page.
> > 
> > One big gotcha is that we starting a transaction while a page is
> > writeback.  It means that anyone who waits for writeback to finish on
> > the datapage with a transaction running could deadlock against the work
> > queue func trying to start a transaction.
>   For ext3 I don't think anyone waits for PageWriteback with a
> transaction open. We definitely don't do it from ext3 code and generic
> code does usually sequence like:
>   lock_page(page);
>   ...
>   wait_on_page_writeback(page)
> 
>   and because lock ordering is page_lock < transaction start, we
> shouldn't have transaction open at that point.
>   But with ext4 it may be different - there, the lock ordering is
> transaction start > page_lock and so above code could well have
> transaction started.
>   Wouldn't it actually be better to update i_size when the page is
> fully written out after we clear PG_writeback as you write below?

It would, but then we have to take a ref on the inode and risk iput
leading to inode deletion in the handler that is supposed to be doing IO
completion.  It's icky either way ;)

The nice part with doing it before writeback is that we know that when
we wait for page writeback, we don't also have to wait for i_size update
to be finished.

If we go this route and it gets copied to ext4, we can weigh our options
I guess.

>   One thing which does not seem to be handled is that your code can
> happily race with truncate. So IO completion could reset i_size which
> has been just set by truncate. And I'm not sure how to handle this
> effectively. Generally I'm not sure how much this is going to cost...
> 

Yeah, I was worried about that.  What ends up happening is the setattr
call sets the disk i_size and then calls inode_setattr, who calls
vmtruncate who actually waits on the writeback.

Then, we wander into the ext3 truncate who resets disk i_size down
again.  It's a pretty strange window of updates, but I was thinking it
would work to cut down i_size, wait on IO, then cut it down again in
setattr.

Once we wait on all IO past the new in-memory i_size, writepage won't
send any more down.  So updating disk i_size after the wait should be
enough.

-chris



  reply	other threads:[~2009-04-09 18:11 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-08 23:40 [GIT PULL] Ext3 latency fixes Theodore Ts'o
2009-04-09 15:49 ` Linus Torvalds
2009-04-09 16:23   ` Chris Mason
2009-04-09 17:49     ` Jan Kara
2009-04-09 18:10       ` Chris Mason [this message]
2009-04-09 19:04         ` Jan Kara
2009-04-14  2:29           ` [RFC] ext3 data=guarded was " Chris Mason
2009-04-09 17:36   ` Jan Kara
  -- strict thread matches above, loose matches on Subject: below --
2009-04-03  7:01 Theodore Ts'o
2009-04-03 18:24 ` Linus Torvalds
2009-04-03 18:47   ` Jens Axboe
2009-04-03 19:13     ` Theodore Tso
2009-04-03 21:01     ` Chris Mason
2009-04-03 19:02   ` Linus Torvalds
2009-04-03 20:41     ` Linus Torvalds
2009-04-04 13:57       ` Theodore Tso
2009-04-04 15:16         ` Jens Axboe
2009-04-04 15:57           ` Linus Torvalds
2009-04-04 16:06             ` Linus Torvalds
2009-04-04 17:36               ` Jens Axboe
2009-04-04 17:34             ` Jens Axboe
2009-04-04 17:44               ` Linus Torvalds
2009-04-04 18:00                 ` Trenton D. Adams
2009-04-04 18:01                 ` Jens Axboe
2009-04-04 18:10                   ` Linus Torvalds
2009-04-04 23:22                   ` Theodore Tso
2009-04-04 23:33                     ` Arjan van de Ven
2009-04-05  0:10                       ` Theodore Tso
2009-04-05 15:05                         ` Arjan van de Ven
2009-04-05 17:01                         ` Linus Torvalds
2009-04-05 17:15                           ` Mark Lord
2009-04-05 20:57                             ` Jeff Garzik
2009-04-05 23:48                               ` Arjan van de Ven
2009-04-06  2:32                                 ` Mark Lord
2009-04-06  5:47                                 ` Jeff Garzik
2009-04-07 18:18                                   ` Linus Torvalds
2009-04-07 18:22                                     ` Linus Torvalds
2009-04-06  8:13                             ` Jens Axboe
2009-04-05 18:56                           ` Arjan van de Ven
2009-04-05 19:34                             ` Linus Torvalds
2009-04-05 20:06                               ` Arjan van de Ven
2009-04-06  6:25                               ` Jens Axboe
2009-04-06  6:05                           ` Theodore Tso
2009-04-06  6:23                           ` Jens Axboe
2009-04-06  8:16                       ` Jens Axboe
2009-04-06 14:48                         ` Linus Torvalds
2009-04-06 15:09                           ` Jens Axboe
2009-04-06  6:15                     ` Jens Axboe
2009-04-04 20:18               ` Ingo Molnar
2009-04-06 21:50                 ` Lennart Sorensen
2009-04-07 13:31                   ` Mark Lord
2009-04-07 14:48                     ` Lennart Sorensen
2009-04-07 19:21                       ` Mark Lord
2009-04-07 19:57                         ` Lennart Sorensen
2009-04-04 20:56               ` Arjan van de Ven
2009-04-06  7:06                 ` Jens Axboe
2009-04-07 15:39             ` Indan Zupancic
2009-04-04 19:18           ` Theodore Tso
2009-04-06  8:12             ` Jens Axboe
2009-04-04 22:13         ` Linus Torvalds
2009-04-04 22:19           ` Linus Torvalds
2009-04-05  0:20           ` Theodore Tso
2009-04-03 19:54   ` Theodore Tso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1239300603.31257.26.camel@think.oraclecorp.com \
    --to=chris.mason@oracle.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.