From: Andrew Morton <akpm@linux-foundation.org>
To: Andreas Dilger <adilger@clusterfs.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Marat Buharov <marat.buharov@gmail.com>,
Mike Galbraith <efault@gmx.de>,
LKML <linux-kernel@vger.kernel.org>,
Jens Axboe <jens.axboe@oracle.com>,
"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
Alex Tomas <alex@clusterfs.com>
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Date: Fri, 27 Apr 2007 15:18:37 -0700 [thread overview]
Message-ID: <20070427151837.f1439639.akpm@linux-foundation.org> (raw)
In-Reply-To: <20070427193130.GD5967@schatzie.adilger.int>
On Fri, 27 Apr 2007 13:31:30 -0600
Andreas Dilger <adilger@clusterfs.com> wrote:
> On Apr 27, 2007 08:30 -0700, Linus Torvalds wrote:
> > On a good filesystem, when you do "fsync()" on a file, nothing at all
> > happens to any other files. On ext3, it seems to sync the global journal,
> > which means that just about *everything* that writes even a single byte
> > (well, at least anything journalled, which would be all the normal
> > directory ops etc) to disk will just *stop* dead cold!
> >
> > It's horrid. And it really is ext3, not "fsync()".
> >
> > I used to run reiserfs, and it had its problems, but this was the
> > "feature" of ext3 that I've disliked most. If you run a MUA with local
> > mail, it will do fsync's for most things, and things really hickup if you
> > are doing some other writes at the same time. In contrast, with reiser, if
> > you did a big untar or some other big write, if somebody fsync'ed a small
> > file, it wasn't even a blip on the radar - the fsync would sync just that
> > small thing.
>
> It's true that this is a "feature" of ext3 with data=ordered (the default),
> but I suspect the same thing is now true in reiserfs too. The reason is
> that if a journal commit doesn't flush the data as well then a crash will
> result in garbage (from old deleted files) being visible in the newly
> allocated file. People used to complain about this with reiserfs all the
> time having corrupt data in new files after a crash, which is why I believe
> it was fixed.
People still complain about hey-my-files-are-all-full-of-zeroes on XFS.
> There definitely are some problems with the ext3 journal commit though.
> If the journal is full it will cause the whole journal to checkpoint out
> to the filesystem synchronously even if just space for a small transaction
> is needed. That is doubly bad if you have a very large journal. I believe
> Alex has a patch to have it checkpoint much smaller chunks to the fs.
>
We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer.
Basically, do ordered-data with a commit-time inode walk, calling
do_sync_mapping_range().
Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
And guess what? We can then partly fix _this_ problem too. If we're
running a commit on behalf of fsync(inode1) and we come across an inode2
which doesn't have any block allocation metadata in this commit, we don't
need to sync inode2's pages.
Weep. It's times like this when I want to escape all this patch-wrangling
nonsense and go do some real stuff.
next prev parent reply other threads:[~2007-04-27 22:19 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1177660767.6567.41.camel@Homer.simpson.net>
2007-04-27 8:33 ` [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation) Andrew Morton
2007-04-27 9:23 ` Mike Galbraith
2007-04-27 10:17 ` Mike Galbraith
2007-04-27 11:59 ` Marat Buharov
2007-04-27 12:30 ` Peter Zijlstra
2007-04-27 13:50 ` Mark Lord
2007-04-27 12:39 ` Manoj Joseph
2007-04-27 15:30 ` Linus Torvalds
2007-04-27 19:31 ` Andreas Dilger
2007-04-27 19:44 ` Mike Galbraith
2007-04-27 19:50 ` Linus Torvalds
2007-04-27 20:05 ` Hua Zhong
2007-04-27 20:12 ` Bill Huey
2007-04-28 5:37 ` Mikulas Patocka
2007-04-28 5:45 ` Mikulas Patocka
2007-04-28 21:57 ` Bill Huey
2007-04-28 22:38 ` Mikulas Patocka
2007-04-27 20:29 ` Gabriel C
2007-04-27 20:54 ` Manoj Joseph
2007-04-28 8:45 ` Matthias Andree
2007-04-27 22:18 ` Andrew Morton [this message]
2007-05-03 17:38 ` Alex Tomas
2007-05-03 23:54 ` Andrew Morton
2007-05-04 6:18 ` Alex Tomas
2007-05-04 6:38 ` Andrew Morton
2007-05-04 6:57 ` Alex Tomas
2007-05-04 7:18 ` Andrew Morton
2007-05-04 7:39 ` Alex Tomas
2007-05-04 8:02 ` Andrew Morton
2007-08-16 18:20 ` Alex Tomas
2007-08-16 18:46 ` Andrew Morton
2007-08-17 2:24 ` Alex Tomas
2007-08-17 6:52 ` Andrew Morton
2007-08-17 8:36 ` Alex Tomas
2007-08-17 9:02 ` Andrew Morton
2007-08-17 18:42 ` Alex Tomas
2007-04-28 8:44 ` Matthias Andree
2007-04-28 20:46 ` Mikulas Patocka
2007-04-28 21:12 ` Lee Revell
2007-04-29 20:49 ` Mark Lord
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070427151837.f1439639.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=adilger@clusterfs.com \
--cc=alex@clusterfs.com \
--cc=efault@gmx.de \
--cc=jens.axboe@oracle.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=marat.buharov@gmail.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).