From: Andreas Dilger <adilger@clusterfs.com>
To: "Stephen C. Tweedie" <sct@redhat.com>
Cc: Erik Mouw <erik@harddisk-recovery.com>,
UZAIR LAKHANI <uzairr_bs1b@yahoo.com>,
linux-fsdevel@vger.kernel.org
Subject: Re: How To Recover Files From ext3 Partition??
Date: Mon, 8 May 2006 11:41:08 -0600 [thread overview]
Message-ID: <20060508174108.GY6075@schatzie.adilger.int> (raw)
In-Reply-To: <1147092139.5331.9.camel@sisko.sctweedie.blueyonder.co.uk>
On May 08, 2006 13:42 +0100, Stephen C. Tweedie wrote:
> On Mon, 2006-05-08 at 14:34 +0200, Erik Mouw wrote:
>
> > > Trouble is, there's no guarantee that that transaction would actually
> > > fit into the journal. Most of the time it will, but if it doesn't, then
> > > we deadlock or risk data corruption.
> >
> > Is there some way to determine in advance if a transaction fits into
> > the journal?
>
> For truncate/delete, no, not easily. Or rather, it's possible, but only
> for trivially short files. The trouble is that we can't assume that all
> of the file's blocks are on the same block groups, so each block in the
> file is potentially an update to a new group descriptor and a new block
> bitmap (ie. 2 blocks of journal per block of file.)
Actually, the upper limit is the number of block groups in the filesystem.
In many cases this would be a sufficient upper bound that could be checked
without any effort. Given that the default journal size is now 128MB
(32k blocks) this would guarantee full-file truncates in the worst case
for up (to 32k blocks / 4 / (1 + 1/128 blocks/group)) = 8128 groups = 1016GB
before we even have to consider walking the file. We could also work out
the worst-case truncate by the number of blocks in the file.
> That's hugely pessimistic, of course, but it is the genuine worst-case
> scenario and we have to be prepared for it. We only work out that we
> need less once we actually start walking the file's indirect tree, at
> which point the truncate is already under way.
>
> We _could_ walk the tree twice, but that would be unnecessarily
> expensive, especially for large files.
That was actually my thought. I don't think it is expensive, given the
current implementation already has to read all of these blocks from disk
(in reverse order, I might add) and then _write_ to each of the indirect
blocks (1/1024 of the whole file size) to zero them out.
Instead, we could walk the file tree in forward order, doing async readahead
for the indirect, dindirect, tindirect blocks, then a second batch of
readahead for all of the indirect blocks from dindirect, and dindirect blocks
from tindirect, rinse repeat. In the meantime we could walk the blocks and
count the block groups affected (= transaction size) while waiting for the
next batch of blocks to complete IO. Since we need to read these blocks in
any case, doing the readahead efficiently will likely improve performance
of this step.
Rather than hurt performance, I think this will actually improve the truncate
performance because we don't need to do ANY indirect block writes, removing
1 block write per 1MB of file space, only doing the (already required) write
to the group descriptor and bitmap. Given that we try hard to do contiguous
allocations for files this will usually be a relatively small number of
blocks, as few as 1/128MB of the file size (group size limit). If we do the
file walk, instead of being completely pessimistic, we can also reduce the
pressure on journal flushes, which I suspect would be more costly than the
walk itself.
The fact that this also fixes undelete was actually a side-effect, IMHO.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
next prev parent reply other threads:[~2006-05-08 17:41 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-05-04 14:18 How To Recover Files From ext3 Partition?? UZAIR LAKHANI
2006-05-04 14:38 ` Erik Mouw
2006-05-05 5:16 ` UZAIR LAKHANI
2006-05-05 11:18 ` Erik Mouw
2006-05-05 16:41 ` Andreas Dilger
2006-05-08 10:51 ` Stephen C. Tweedie
2006-05-08 12:34 ` Erik Mouw
2006-05-08 12:42 ` Stephen C. Tweedie
2006-05-08 17:41 ` Andreas Dilger [this message]
2006-05-08 13:20 ` Theodore Tso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060508174108.GY6075@schatzie.adilger.int \
--to=adilger@clusterfs.com \
--cc=erik@harddisk-recovery.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=sct@redhat.com \
--cc=uzairr_bs1b@yahoo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).