From: Andreas Dilger <adilger@sun.com>
To: m-ota@ys.jp.nec.com
Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: what should I do when an error occurred after write_begin()
Date: Sun, 20 Jul 2008 23:04:52 -0600 [thread overview]
Message-ID: <20080721050452.GB3370@webber.adilger.int> (raw)
In-Reply-To: <20080718094315m-ota@mail.jp.nec.com>
On Jul 18, 2008 09:43 +0900, m-ota@ys.jp.nec.com wrote:
> ext4 online defrag exchanges the data block in the following procedures.
>
> 1. Creates a temporary inode and allocates contiguous blocks.
> 2. Read data from original file to memory page by write_begin()
> 3. Swap the blocks between the original inode and the temporary inode.
> Updates the extent tree and registers the block to transaction by
> ext4_journal_dirty_metadata().
> 4. Write data in memory page to new blocks by write_end().
>
> In the current implementation, when the block swap failed,
> data could not move to the new block.
> So the defrag process exits without calling write_end().
> We try to defrag for the same file again, but the defrag process seems to stall.
> After defrag process stalled, all acess to the file systems like "ls" command
> also stall.
> Both processes wait for unlock j_wait_transaction_locked.
>
> If the block exchange between write_begin() and write_end() failed,
> what should I do?
It sounds like you are not closing the transaction correctly in the
case of the failed block swap.
One important rule when writing ext3/ext4 code is to try and ensure
all possible failure conditions are handled BEFORE starting the journal
operation.
It does not seem necessary to do the allocation and writing of the
temprorary inode under the same transaction as the block swapping
as long as it is in the orphan inode list with i_nlink == 0. A first
transaction can be started to allocate the temporary inode, add it to
the orphan list, and then close the transaction. Then, if the system
crashes during the defrag then the temporary inode will be removed at
and all allocated blocks freed at e2fsck/remount time like an
open-unlinked file would.
Multiple transactions may be needed for doing the file copying, depending
on the size of the blocks being copied. Lustre could always do 1MB writes
in a single transaction without problems, without doing data journaling.
You can try to start a single transaction large enough to allocate, say,
min(file size, 4MB) blocks, and then if journal_start() returns -ENOSPC
reduce the allocation size by 1/2 each time. A separate transaction can
be used to do the copying of the data into the temporary inode (with
journal_dirty_metadata() as you say to avoid the need to always fsync).
Then, once the copy is finished a separate transaction should be started
to do the final swapping of the i_block[] array in the inode and freeing
of the temporary inode. It shouldn't really be possible to fail at that
point.
The other question I had about the defragmenter is that it would be
excellent if it is possible to "defragment" a block-mapped file into
an extent-mapped file. This should be relatively easy so long as there
as the whole file is "defragmented" and then the i_block[] array is
swapped with the original inode and EXT4_EXTENTS_FL is set on the inode.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
next prev parent reply other threads:[~2008-07-21 5:04 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-07-18 0:43 what should I do when an error occurred after write_begin() m-ota
2008-07-21 5:04 ` Andreas Dilger [this message]
2008-08-04 12:51 ` Akira Fujita
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080721050452.GB3370@webber.adilger.int \
--to=adilger@sun.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=m-ota@ys.jp.nec.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).