* Re: Bug#605009: serious performance regression with ext4
2010-11-29 15:18 ` Bernd Schubert
@ 2010-11-29 15:37 ` Ted Ts'o
2010-11-29 15:54 ` Eric Sandeen
` (2 subsequent siblings)
3 siblings, 0 replies; 8+ messages in thread
From: Ted Ts'o @ 2010-11-29 15:37 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Jonathan Nieder, linux-ext4
On Mon, Nov 29, 2010 at 04:18:24PM +0100, Bernd Schubert wrote:
>
> Wouldn't it make sense to modify ext4 or even the vfs to do that on
> close() itself? Most applications expect the file to be on disk
> after a close anyway and I also don't see a good reason why one
> should delay a disk write-back after close any longer (well, there
> are exeption if the application is broken, for example such as
> ha-logd used by pacemaker, which did for each line of logs an open,
> seek, write, flush, close sequence..., but at least we have fixed
> that in -hg now).
I can think of plenty of cases where it wouldn't make sense to do that
on a close(). For example, it would dramatically slow down compiles.
Just to give one example, you really don't want to force writeback to
start when the compiler finishes writing an intermediate .o file. And
there are often temporary files which are created and then deleted
very shortly afterwards; forcing writeback just because the file has
been closed would be pointless.
Now, a hint that could be set via an open flag, or via fcntl(), saying
that *this* file is one that should really be written at close() time
--- that would probably be a good idea, if application/library authors
would actually use it.
- Ted
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: Bug#605009: serious performance regression with ext4
2010-11-29 15:18 ` Bernd Schubert
2010-11-29 15:37 ` Ted Ts'o
@ 2010-11-29 15:54 ` Eric Sandeen
2010-11-29 16:20 ` Bernd Schubert
2010-11-29 16:27 ` Florian Weimer
2010-11-29 20:50 ` Andreas Dilger
3 siblings, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2010-11-29 15:54 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Ted Ts'o, Jonathan Nieder, linux-ext4
On 11/29/10 9:18 AM, Bernd Schubert wrote:
> On Monday, November 29, 2010, Ted Ts'o wrote:
>> By using sync_file_range() first, for all files, this forces the
>> delayed allocation to be resolved, so all of the block bitmaps, inode
>> data structures, etc., are updated. Then on the first fdatasync(),
>> the resulting journal commit updates all of the block bitmaps and all
>> of the inode table blocks(), and we're done. The subsequent
>> fdatasync() calls become no-ops --- which the ftrace shell script will
>> show.
>
> Wouldn't it make sense to modify ext4 or even the vfs to do that on close()
> itself? Most applications expect the file to be on disk after a close anyway
but those applications would be wrong.
http://www.flamingspork.com/talks/
Eat My Data: How Everybody Gets File IO Wrong
-Eric
> and I also don't see a good reason why one should delay a disk write-back
> after close any longer (well, there are exeption if the application is broken,
> for example such as ha-logd used by pacemaker, which did for each line of logs
> an open, seek, write, flush, close sequence..., but at least we have fixed
> that in -hg now).
>
>
> Cheers,
> Bernd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Bug#605009: serious performance regression with ext4
2010-11-29 15:54 ` Eric Sandeen
@ 2010-11-29 16:20 ` Bernd Schubert
0 siblings, 0 replies; 8+ messages in thread
From: Bernd Schubert @ 2010-11-29 16:20 UTC (permalink / raw)
To: Eric Sandeen; +Cc: Ted Ts'o, Jonathan Nieder, linux-ext4
On Monday, November 29, 2010, Eric Sandeen wrote:
> On 11/29/10 9:18 AM, Bernd Schubert wrote:
> > On Monday, November 29, 2010, Ted Ts'o wrote:
> >> By using sync_file_range() first, for all files, this forces the
> >> delayed allocation to be resolved, so all of the block bitmaps, inode
> >> data structures, etc., are updated. Then on the first fdatasync(),
> >> the resulting journal commit updates all of the block bitmaps and all
> >> of the inode table blocks(), and we're done. The subsequent
> >> fdatasync() calls become no-ops --- which the ftrace shell script will
> >> show.
> >
> > Wouldn't it make sense to modify ext4 or even the vfs to do that on
> > close() itself? Most applications expect the file to be on disk after a
> > close anyway
>
> but those applications would be wrong.
Of course they are, I don't deny that. But denying the most applications
expect the file to be on disk after a close() also denies reality, in my
experience.
And IMHO, such temporary files as pointed out by Ted either should go to tmpfs
or should be specially flagged as something like O_TMP. Unfortunately, that
changes symantics and so indeed the only way left is to do it the other way
around as Ted suggested.
Cheers,
Bernd
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Bug#605009: serious performance regression with ext4
2010-11-29 15:18 ` Bernd Schubert
2010-11-29 15:37 ` Ted Ts'o
2010-11-29 15:54 ` Eric Sandeen
@ 2010-11-29 16:27 ` Florian Weimer
2010-11-29 20:50 ` Andreas Dilger
3 siblings, 0 replies; 8+ messages in thread
From: Florian Weimer @ 2010-11-29 16:27 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Ted Ts'o, Jonathan Nieder, linux-ext4
* Bernd Schubert:
> Wouldn't it make sense to modify ext4 or even the vfs to do that on close()
> itself? Most applications expect the file to be on disk after a close anyway
> and I also don't see a good reason why one should delay a disk write-back
> after close any longer (well, there are exeption if the application is broken,
> for example such as ha-logd used by pacemaker, which did for each line of logs
> an open, seek, write, flush, close sequence..., but at least we have fixed
> that in -hg now).
If you use Oracle Berkeley DB in a process-based fashion, it is
crucial for decent performance that the memory-mapped file containing
the cache is not flushed to disk when the database environment is
closed prior to process termination. Perhaps flushing could be
delayed until the last open file handle is gone. In any case, it's a
pretty drastic change, which should probably be tunable with a
(generic) mount option.
--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Bug#605009: serious performance regression with ext4
2010-11-29 15:18 ` Bernd Schubert
` (2 preceding siblings ...)
2010-11-29 16:27 ` Florian Weimer
@ 2010-11-29 20:50 ` Andreas Dilger
3 siblings, 0 replies; 8+ messages in thread
From: Andreas Dilger @ 2010-11-29 20:50 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Ted Ts'o, Jonathan Nieder, linux-ext4
On 2010-11-29, at 08:18, Bernd Schubert wrote:
> Wouldn't it make sense to modify ext4 or even the vfs to do that on
> close() itself? Most applications expect the file to be on disk after
> a close anyway and I also don't see a good reason why one should delay
> a disk write-back after close any longer (well, there are exeption if
> the application is broken, for example such as ha-logd used by pacemaker,
> which did for each line of logs an open, seek, write, flush, close
> sequence..., but at least we have fixed that in -hg now).
This would be terrible for applications like tar that create many hundreds or thousands of files. Also, doesn't NFS also internally open/close the file for every write?
There would now be an implicit fsync and disk cache flush for every created file. It would be impossible to create or extract more than about 100 files/second on an HDD due to seek limitations, even if the files are tiny and do not fill the memory.
I can imagine that it might make sense to _start_ writeback sooner than what the VM currently does, if an application is not repeatedly opening, writing, and closing the same file, since this is otherwise dead time in the IO pipeline that could be better utilized. This kind of background writeout shouldn't trigger a cache flush each, so that multiple writes can be aggregated more efficiently.
Lustre has always been more aggressive than the VM in starting writeout when there are good-sized chunks of data to me written, or if there are a lot of small files that are not being modified, and this significantly improves performance when IO is bursty, which it is in most real-world cases.
Cheers, Andreas
^ permalink raw reply [flat|nested] 8+ messages in thread