linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ted Ts'o <tytso@mit.edu>
To: Jonathan Nieder <jrnieder@gmail.com>
Cc: linux-ext4@vger.kernel.org
Subject: Re: Bug#605009: serious performance regression with ext4
Date: Mon, 29 Nov 2010 09:44:36 -0500	[thread overview]
Message-ID: <20101129144436.GT2767@thunk.org> (raw)
In-Reply-To: <20101129072930.GA7213@burratino>

On Mon, Nov 29, 2010 at 01:29:30AM -0600, Jonathan Nieder wrote:
> 
> >                        sync_file_range() is a Linux specific system
> > call that has been around for a while.  It allows program to control
> > when writeback happens in a very low-level fashion.  The first set of
> > sync_file_range() system calls causes the system to start writing back
> > each file once it has finished being extracted.  It doesn't actually
> > wait for the write to finish; it just starts the writeback.
> 
> True, using sync_file_range(..., SYNC_FILE_RANGE_WRITE) for each file
> makes later fsync() much faster.  But why?  Is this a matter of allowing
> writeback to overlap with write() or is something else going on?

So what's going on is this.  dpkg is writing a series of files.
fsync() causes the following to happen: 

	* force the file specified to be written to disk; in the case
		of ext4 with delayed allocation, this means blocks
		have to be allocated, so the block bitmap gets
		dirtied, etc.
	* force a journal commit.   This causes the block bitmap,
		inode table block for the inode, etc., to be written
		to the journal, followed by a barrier operation to make
		sure all of the file system metadata as well as the
		data blocks in the previous step, are written to disk.

If you call fsync() for each file, these two steps get done for each
file.  This means we have to do a journal commit for each and every
file.

By using sync_file_range() first, for all files, this forces the
delayed allocation to be resolved, so all of the block bitmaps, inode
data structures, etc., are updated.  Then on the first fdatasync(),
the resulting journal commit updates all of the block bitmaps and all
of the inode table blocks(), and we're done.  The subsequent
fdatasync() calls become no-ops --- which the ftrace shell script will
show.

We could imagine a new kernel interface which took an array of file
descriptors, say call it fsync_array(), which would force writeback on
all of the specified file descriptors, as well as forcing the journal
commit that would guarantee the metadata had been written to disk.
But calling sync_file_range() for each file, and then calling
fdatasync() for all of the, is something that exists today with
currently shipping kernels (and sync_file_range() has been around for
over four years; whereas a new system call wouldn't see wide
deployment for at least 2-3 years).

						- Ted


  reply	other threads:[~2010-11-29 14:44 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20101126093257.23480.86900.reportbug@pluto.milchstrasse.xx>
     [not found] ` <20101126145327.GB19399@rivendell.home.ouaza.com>
     [not found]   ` <20101126215254.GJ2767@thunk.org>
     [not found]     ` <20101127075831.GC24433@burratino>
     [not found]       ` <20101127085346.GD14011@rivendell.home.ouaza.com>
     [not found]         ` <20101129041152.GQ2767@thunk.org>
2010-11-29  7:29           ` Bug#605009: serious performance regression with ext4 Jonathan Nieder
2010-11-29 14:44             ` Ted Ts'o [this message]
2010-11-29 15:18               ` Bernd Schubert
2010-11-29 15:37                 ` Ted Ts'o
2010-11-29 15:54                 ` Eric Sandeen
2010-11-29 16:20                   ` Bernd Schubert
2010-11-29 16:27                 ` Florian Weimer
2010-11-29 20:50                 ` Andreas Dilger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101129144436.GT2767@thunk.org \
    --to=tytso@mit.edu \
    --cc=jrnieder@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).