From: Jan Kara <jack@suse.cz>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>, Dmitry Monakhov <dmonakhov@openvz.org>,
linux-ext4@vger.kernel.org
Subject: Re: Uninitialized extent races
Date: Fri, 21 Dec 2012 17:19:29 +0100 [thread overview]
Message-ID: <20121221161929.GF17357@quack.suse.cz> (raw)
In-Reply-To: <20121221031151.GA5014@thunk.org>
On Thu 20-12-12 22:11:51, Ted Tso wrote:
> On Fri, Dec 21, 2012 at 02:25:26AM +0100, Jan Kara wrote:
> > Am I missing something Dmitry? Also I was wondering about one thing: Does
> > anybody see a problem with disabling merging of uninitialized extents
> > completely? It would simplify the code (end_io conversion doesn't need to
> > potentially split extents) and the case when we really want to merge
> > extents - i.e., when someone calls fallocate() on small chunks - doesn't
> > seem like the case we need to optimize for?
>
> Which case specifically are you talking about here?
>
> Are you talking about the merging of _formerly_ uninitialized extents?
> i.e., what keeps the extent tree from exploding if you fallocate one
> megabyte region, and then write to all 256 blocks of that one megabyte
> region, except in a random order?
>
> Or something else?
No, I'm speaking about merging currently uninitialized extents. I.e.
suppose someone does the following on a filesystem with dioread_nolock so
that writeback happens via unwritten extents:
fd = open("file", O_RDWR);
pwrite(fd, buf, 4096, 0);
flusher thread starts writing
we create uninitialized extent for
range 0-4096
fallocate(fd, 0, 4096, 4096);
- we merge extents and now have just 1 uninitialized extent for range
0-8192
ext4_convert_unwritten_extents() now
has to split the extent to finish
the IO.
Now splitting the extent requires number of credits proportional to the
tree depth, maybe even allocation... And strictly speaking number of
credits is impossible to reliably estimate until you hold i_data_sem (tree
can grow until we hold that semaphore) which is too late - we need to start
a transaction before we take that semaphore.
So if we disabled merging of extents that are currently uninitialized,
above problem couldn't happen. We would know we only convert that one
extent and possibly merge it in the leaf to other extents.
> > Also it would bound the amount of transaction credits we need for
> > conversion to 1 block which would make it easier for me to change
> > ext4 to clear PageWriteback only after extent conversion is done
> > (again code simplification, more uniform handling of page
> > writeback).
>
> So I'm confused. If it's the case that we're thinking about, we only
> need a single transaction credit, because we're not currently merging
> across adjacent interior extent tree blocks.
>
> Can you be a bit more explicit about which case you're thinking about?
> I do agree that the extent tree code is too complicated, but we also
> have the problem that we probably been more merging, not less, since
> we can already end up with a case where you start with a single extent
> tree block after fallocating a gigabyte or two. Then after writing
> randomly into that gigabyte file using AIO, we can end up with a very
> deep, spindly extent tree containing multiple interior extent tree
> blocks, because we're not doing sufficient merging --- and in
> particular, we currently have no way at all of decreasing the depth of
> the extent tree.
Yeah, so I agree that's a problem in some cases but my suggestion
shouldn't really make this any worse. We *will* still be merging normal
extents after they are converted in end_io handler... Just we won't merge
them while they are still uninitialized.
And I regarding more merging, that could be done (obviously), just we might
need to postpone that after writeback is finished (PageWriteback is
cleared) because there extent estimates are not clear. And I need to know
necessary number of extents well in advance to be able to reserve credits
in the journal. OTOH maybe we could use jbd2_journal_extend() to get more
credits if we need them for merging. And when that fails, bad luck but we
can cope... Anyway, this is a different problem.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2012-12-21 16:19 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-21 1:25 Uninitialized extent races Jan Kara
2012-12-21 3:11 ` Theodore Ts'o
2012-12-21 16:19 ` Jan Kara [this message]
2012-12-21 18:02 ` Theodore Ts'o
2012-12-21 22:49 ` Jan Kara
2012-12-21 23:03 ` Theodore Ts'o
2012-12-24 11:17 ` Zheng Liu
2012-12-31 8:32 ` Jan Kara
2012-12-31 16:31 ` Zheng Liu
2012-12-31 16:44 ` Jan Kara
2013-01-01 4:49 ` Zheng Liu
2012-12-21 12:34 ` Dmitry Monakhov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121221161929.GF17357@quack.suse.cz \
--to=jack@suse.cz \
--cc=dmonakhov@openvz.org \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).