Re: Bug#605009: serious performance regression with ext4

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Bug#605009: serious performance regression with ext4
       [not found]         ` <20101129041152.GQ2767@thunk.org>
@ 2010-11-29  7:29           ` Jonathan Nieder
  2010-11-29 14:44             ` Ted Ts'o
  0 siblings, 1 reply; 8+ messages in thread
From: Jonathan Nieder @ 2010-11-29  7:29 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Ts'o

Hi,

Ted Ts'o wrote:

> I did some experimenting, and I figured out what was going on.  You're
> right, (c) doesn't quite work, because delayed allocation meant that
> the writeout didn't take place until the fsync() for each file
> happened.  I didn't see this at first; my apologies.

Thanks for a clear analysis[1].  I am still confused about something,
though.  If the answer is "stop wasting my time, just read the source", I
can accept that.

>                        sync_file_range() is a Linux specific system
> call that has been around for a while.  It allows program to control
> when writeback happens in a very low-level fashion.  The first set of
> sync_file_range() system calls causes the system to start writing back
> each file once it has finished being extracted.  It doesn't actually
> wait for the write to finish; it just starts the writeback.

True, using sync_file_range(..., SYNC_FILE_RANGE_WRITE) for each file
makes later fsync() much faster.  But why?  Is this a matter of allowing
writeback to overlap with write() or is something else going on?

I'm thinking it has to be something else, since sync() is fast without the
sync_file_range().

> I've attached the program I used to test and prove this mechanism, as
> well as the kernel tracepoint script I used to debug why (c) wasn't
> working, which might be of interest to folks on debian-kernel.
> Basically it's a demonstration of how cool ftrace is.  :-)

Perhaps the answer can be phrased in terms of the output of this script.

> #!/bin/sh
> cd /sys/kernel/debug/tracing
> echo blk > current_tracer
> echo 1 > /sys/block/dm-5/trace/enable
> echo 1 > events/ext4/ext4_sync_file/enable
> echo 1 > events/ext4/ext4_da_writepages/enable
> echo 1 > events/ext4/ext4_mark_inode_dirty/enable
> echo 1 > events/jbd2/jbd2_run_stats/enable
> echo 1 > events/jbd2/jbd2_start_commit/enable
> echo 1 > events/jbd2/jbd2_end_commit/enable
> (cd /kbuild; /home/tytso/src/mass-sync-tester -n)
> cat trace > /tmp/trace
> echo 0 > events/jbd2/jbd2_start_commit/enable
> echo 0 > events/jbd2/jbd2_end_commit/enable
> echo 0 > events/jbd2/jbd2_run_stats/enable
> echo 0 > events/ext4/ext4_sync_file/enable
> echo 0 > events/ext4/ext4_da_writepages/enable
> echo 0 > events/ext4/ext4_mark_inode_dirty/enable
> echo 0 > /sys/block/dm-5/trace/enable
> echo nop > current_tracer

Jonathan

[1] http://lists.debian.org/debian-devel/2010/11/msg00577.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug#605009: serious performance regression with ext4
  2010-11-29  7:29           ` Bug#605009: serious performance regression with ext4 Jonathan Nieder
@ 2010-11-29 14:44             ` Ted Ts'o
  2010-11-29 15:18               ` Bernd Schubert
  0 siblings, 1 reply; 8+ messages in thread
From: Ted Ts'o @ 2010-11-29 14:44 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: linux-ext4

On Mon, Nov 29, 2010 at 01:29:30AM -0600, Jonathan Nieder wrote:
> 
> >                        sync_file_range() is a Linux specific system
> > call that has been around for a while.  It allows program to control
> > when writeback happens in a very low-level fashion.  The first set of
> > sync_file_range() system calls causes the system to start writing back
> > each file once it has finished being extracted.  It doesn't actually
> > wait for the write to finish; it just starts the writeback.
> 
> True, using sync_file_range(..., SYNC_FILE_RANGE_WRITE) for each file
> makes later fsync() much faster.  But why?  Is this a matter of allowing
> writeback to overlap with write() or is something else going on?

So what's going on is this.  dpkg is writing a series of files.
fsync() causes the following to happen: 

	* force the file specified to be written to disk; in the case
		of ext4 with delayed allocation, this means blocks
		have to be allocated, so the block bitmap gets
		dirtied, etc.
	* force a journal commit.   This causes the block bitmap,
		inode table block for the inode, etc., to be written
		to the journal, followed by a barrier operation to make
		sure all of the file system metadata as well as the
		data blocks in the previous step, are written to disk.

If you call fsync() for each file, these two steps get done for each
file.  This means we have to do a journal commit for each and every
file.

By using sync_file_range() first, for all files, this forces the
delayed allocation to be resolved, so all of the block bitmaps, inode
data structures, etc., are updated.  Then on the first fdatasync(),
the resulting journal commit updates all of the block bitmaps and all
of the inode table blocks(), and we're done.  The subsequent
fdatasync() calls become no-ops --- which the ftrace shell script will
show.

We could imagine a new kernel interface which took an array of file
descriptors, say call it fsync_array(), which would force writeback on
all of the specified file descriptors, as well as forcing the journal
commit that would guarantee the metadata had been written to disk.
But calling sync_file_range() for each file, and then calling
fdatasync() for all of the, is something that exists today with
currently shipping kernels (and sync_file_range() has been around for
over four years; whereas a new system call wouldn't see wide
deployment for at least 2-3 years).

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug#605009: serious performance regression with ext4
  2010-11-29 14:44             ` Ted Ts'o
@ 2010-11-29 15:18               ` Bernd Schubert
  2010-11-29 15:37                 ` Ted Ts'o
                                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Bernd Schubert @ 2010-11-29 15:18 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Jonathan Nieder, linux-ext4

On Monday, November 29, 2010, Ted Ts'o wrote:
> By using sync_file_range() first, for all files, this forces the
> delayed allocation to be resolved, so all of the block bitmaps, inode
> data structures, etc., are updated.  Then on the first fdatasync(),
> the resulting journal commit updates all of the block bitmaps and all
> of the inode table blocks(), and we're done.  The subsequent
> fdatasync() calls become no-ops --- which the ftrace shell script will
> show.

Wouldn't it make sense to modify ext4 or even the vfs to do that on close() 
itself? Most applications expect the file to be on disk after a close anyway 
and I also don't see a good reason why one should delay a disk write-back 
after close any longer (well, there are exeption if the application is broken, 
for example such as ha-logd used by pacemaker, which did for each line of logs 
an open, seek, write, flush, close sequence..., but at least we have fixed 
that in -hg now).

Cheers,
Bernd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug#605009: serious performance regression with ext4
  2010-11-29 15:18               ` Bernd Schubert
@ 2010-11-29 15:37                 ` Ted Ts'o
  2010-11-29 15:54                 ` Eric Sandeen
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Ted Ts'o @ 2010-11-29 15:37 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Jonathan Nieder, linux-ext4

On Mon, Nov 29, 2010 at 04:18:24PM +0100, Bernd Schubert wrote:
> 
> Wouldn't it make sense to modify ext4 or even the vfs to do that on
> close() itself? Most applications expect the file to be on disk
> after a close anyway and I also don't see a good reason why one
> should delay a disk write-back after close any longer (well, there
> are exeption if the application is broken, for example such as
> ha-logd used by pacemaker, which did for each line of logs an open,
> seek, write, flush, close sequence..., but at least we have fixed
> that in -hg now).

I can think of plenty of cases where it wouldn't make sense to do that
on a close().  For example, it would dramatically slow down compiles.
Just to give one example, you really don't want to force writeback to
start when the compiler finishes writing an intermediate .o file.  And
there are often temporary files which are created and then deleted
very shortly afterwards; forcing writeback just because the file has
been closed would be pointless.

Now, a hint that could be set via an open flag, or via fcntl(), saying
that *this* file is one that should really be written at close() time
--- that would probably be a good idea, if application/library authors
would actually use it.

      	       			 	    - Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug#605009: serious performance regression with ext4
  2010-11-29 15:18               ` Bernd Schubert
  2010-11-29 15:37                 ` Ted Ts'o
@ 2010-11-29 15:54                 ` Eric Sandeen
  2010-11-29 16:20                   ` Bernd Schubert
  2010-11-29 16:27                 ` Florian Weimer
  2010-11-29 20:50                 ` Andreas Dilger
  3 siblings, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2010-11-29 15:54 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Ted Ts'o, Jonathan Nieder, linux-ext4

On 11/29/10 9:18 AM, Bernd Schubert wrote:
> On Monday, November 29, 2010, Ted Ts'o wrote:
>> By using sync_file_range() first, for all files, this forces the
>> delayed allocation to be resolved, so all of the block bitmaps, inode
>> data structures, etc., are updated.  Then on the first fdatasync(),
>> the resulting journal commit updates all of the block bitmaps and all
>> of the inode table blocks(), and we're done.  The subsequent
>> fdatasync() calls become no-ops --- which the ftrace shell script will
>> show.
> 
> Wouldn't it make sense to modify ext4 or even the vfs to do that on close() 
> itself? Most applications expect the file to be on disk after a close anyway 

but those applications would be wrong.

http://www.flamingspork.com/talks/
Eat My Data: How Everybody Gets File IO Wrong

-Eric

> and I also don't see a good reason why one should delay a disk write-back 
> after close any longer (well, there are exeption if the application is broken, 
> for example such as ha-logd used by pacemaker, which did for each line of logs 
> an open, seek, write, flush, close sequence..., but at least we have fixed 
> that in -hg now).
> 
> 
> Cheers,
> Bernd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug#605009: serious performance regression with ext4
  2010-11-29 15:54                 ` Eric Sandeen
@ 2010-11-29 16:20                   ` Bernd Schubert
  0 siblings, 0 replies; 8+ messages in thread
From: Bernd Schubert @ 2010-11-29 16:20 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Ted Ts'o, Jonathan Nieder, linux-ext4

On Monday, November 29, 2010, Eric Sandeen wrote:
> On 11/29/10 9:18 AM, Bernd Schubert wrote:
> > On Monday, November 29, 2010, Ted Ts'o wrote:
> >> By using sync_file_range() first, for all files, this forces the
> >> delayed allocation to be resolved, so all of the block bitmaps, inode
> >> data structures, etc., are updated.  Then on the first fdatasync(),
> >> the resulting journal commit updates all of the block bitmaps and all
> >> of the inode table blocks(), and we're done.  The subsequent
> >> fdatasync() calls become no-ops --- which the ftrace shell script will
> >> show.
> > 
> > Wouldn't it make sense to modify ext4 or even the vfs to do that on
> > close() itself? Most applications expect the file to be on disk after a
> > close anyway
> 
> but those applications would be wrong.

Of course they are, I don't deny that. But denying the most applications 
expect the file to be on disk after a close() also denies reality, in my 
experience.
And IMHO, such temporary files as pointed out by Ted either should go to tmpfs 
or should be specially flagged as something like O_TMP. Unfortunately, that 
changes symantics and so indeed the only way left is to do it the other way 
around as Ted suggested.


Cheers,
Bernd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug#605009: serious performance regression with ext4
  2010-11-29 15:18               ` Bernd Schubert
  2010-11-29 15:37                 ` Ted Ts'o
  2010-11-29 15:54                 ` Eric Sandeen
@ 2010-11-29 16:27                 ` Florian Weimer
  2010-11-29 20:50                 ` Andreas Dilger
  3 siblings, 0 replies; 8+ messages in thread
From: Florian Weimer @ 2010-11-29 16:27 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Ted Ts'o, Jonathan Nieder, linux-ext4

* Bernd Schubert:

> Wouldn't it make sense to modify ext4 or even the vfs to do that on close() 
> itself? Most applications expect the file to be on disk after a close anyway 
> and I also don't see a good reason why one should delay a disk write-back 
> after close any longer (well, there are exeption if the application is broken, 
> for example such as ha-logd used by pacemaker, which did for each line of logs 
> an open, seek, write, flush, close sequence..., but at least we have fixed 
> that in -hg now).

If you use Oracle Berkeley DB in a process-based fashion, it is
crucial for decent performance that the memory-mapped file containing
the cache is not flushed to disk when the database environment is
closed prior to process termination.  Perhaps flushing could be
delayed until the last open file handle is gone.  In any case, it's a
pretty drastic change, which should probably be tunable with a
(generic) mount option.

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug#605009: serious performance regression with ext4
  2010-11-29 15:18               ` Bernd Schubert
                                   ` (2 preceding siblings ...)
  2010-11-29 16:27                 ` Florian Weimer
@ 2010-11-29 20:50                 ` Andreas Dilger
  3 siblings, 0 replies; 8+ messages in thread
From: Andreas Dilger @ 2010-11-29 20:50 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Ted Ts'o, Jonathan Nieder, linux-ext4

On 2010-11-29, at 08:18, Bernd Schubert wrote:
> Wouldn't it make sense to modify ext4 or even the vfs to do that on
> close() itself? Most applications expect the file to be on disk after
> a close anyway and I also don't see a good reason why one should delay
> a disk write-back after close any longer (well, there are exeption if
> the application is broken, for example such as ha-logd used by pacemaker,
> which did for each line of logs an open, seek, write, flush, close
> sequence..., but at least we have fixed that in -hg now).

This would be terrible for applications like tar that create many hundreds or thousands of files.  Also, doesn't NFS also internally open/close the file for every write?

There would now be an implicit fsync and disk cache flush for every created file.  It would be impossible to create or extract more than about 100 files/second on an HDD due to seek limitations, even if the files are tiny and do not fill the memory.

I can imagine that it might make sense to _start_ writeback sooner than what the VM currently does, if an application is not repeatedly opening, writing, and closing the same file, since this is otherwise dead time in the IO pipeline that could be better utilized.  This kind of background writeout shouldn't trigger a cache flush each, so that multiple writes can be aggregated more efficiently.

Lustre has always been more aggressive than the VM in starting writeout when there are good-sized chunks of data to me written, or if there are a lot of small files that are not being modified, and this significantly improves performance when IO is bursty, which it is in most real-world cases.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-11-29 20:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20101126093257.23480.86900.reportbug@pluto.milchstrasse.xx>
     [not found] ` <20101126145327.GB19399@rivendell.home.ouaza.com>
     [not found]   ` <20101126215254.GJ2767@thunk.org>
     [not found]     ` <20101127075831.GC24433@burratino>
     [not found]       ` <20101127085346.GD14011@rivendell.home.ouaza.com>
     [not found]         ` <20101129041152.GQ2767@thunk.org>
2010-11-29  7:29           ` Bug#605009: serious performance regression with ext4 Jonathan Nieder
2010-11-29 14:44             ` Ted Ts'o
2010-11-29 15:18               ` Bernd Schubert
2010-11-29 15:37                 ` Ted Ts'o
2010-11-29 15:54                 ` Eric Sandeen
2010-11-29 16:20                   ` Bernd Schubert
2010-11-29 16:27                 ` Florian Weimer
2010-11-29 20:50                 ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).