Corruption from interrupted e2fsck

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Corruption from interrupted e2fsck
@ 2015-11-02  1:16 Andreas Dilger
  2015-11-02 15:23 ` Theodore Ts'o
  0 siblings, 1 reply; 4+ messages in thread
From: Andreas Dilger @ 2015-11-02  1:16 UTC (permalink / raw)
  To: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 2545 bytes --]

It looks there is a bug in how e2fsck handles being interrupted by CTRL-C.
If CTRL-C is pressed to kill e2fsck rather than e.g. kill -9, then the
interrupt handler sets E2F_FLAG_CANCEL in the context but doesn't actually
kill the process.  Instead, e2fsck_pass1() checks this flag before processing
the next inode.

If a filesystem is running in fix mode (e2fsck -fy) is interrupted, and the
quota feature is enabled, then the quota file will still be written to disk
even though the inode scan was not complete and the quota information is
totally inaccurate.  Even worse, if the Pass 1 inode and block scan was not
finished, then the in-memory block bitmaps (which are used for block
allocation during e2fsck) are also invalid, so any blocks allocated to the
quota files may corrupt other files if those blocks were actually used.

It also looks like the journal may also be recreated after e2fsck is
interrupted, if it was deleted during pass 1 because of corruption.

static void signal_cancel(int sig EXT2FS_ATTR((unused)))
{
        e2fsck_t ctx = e2fsck_global_ctx;

        if (!ctx)
                exit(FSCK_CANCELED);

        ctx->flags |= E2F_FLAG_CANCEL;
}

	sa.sa_handler = signal_cancel;
	sigaction(SIGINT, &sa, 0);
	sigaction(SIGTERM, &sa, 0);
	:
	:
        run_result = e2fsck_run(ctx);
        e2fsck_clear_progbar(ctx);

        if (!ctx->invalid_bitmaps &&
            (ctx->flags & E2F_FLAG_JOURNAL_INODE)) {
		if (fix_problem(ctx, PR_6_RECREATE_JOURNAL, &pctx)) {
			:
			:
			retval = ext2fs_add_journal_inode(fs, journal_size, 0);
		}
	}

no_journal:
	if (ctx->qctx) {
		for (i = 0; i < MAXQUOTAS; i++) {
			retval = quota_compare_and_update(ctx->qctx, i, &needs_writeout);
		}
	}

	if (run_result & E2F_FLAG_ABORT)
		fatal_error(ctx, _("aborted"));

Is there a reason not to have a cancel check right after the return from
e2fsck_run() rather than trying to recover the journal and quota files?
I can imagine that there is a desire to flush out modified inodes and such
that have been repaired, so that restarting an interrupted e2fsck will make
progress, but the quota file update is plain wrong unless at least pass1
has finished, and the journal recreation is also dangerous if the block
bitmaps have not been fully updated.

The quota problem was hit in on a system, but the journal problem is only a
theory at this point.  I'm working on a patch but wanted to solicit input in case there is something that I'm missing.

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Corruption from interrupted e2fsck
  2015-11-02  1:16 Corruption from interrupted e2fsck Andreas Dilger
@ 2015-11-02 15:23 ` Theodore Ts'o
  2015-11-02 21:31   ` Andreas Dilger
  0 siblings, 1 reply; 4+ messages in thread
From: Theodore Ts'o @ 2015-11-02 15:23 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

On Sun, Nov 01, 2015 at 06:16:50PM -0700, Andreas Dilger wrote:
> Is there a reason not to have a cancel check right after the return from
> e2fsck_run() rather than trying to recover the journal and quota files?
> I can imagine that there is a desire to flush out modified inodes and such
> that have been repaired, so that restarting an interrupted e2fsck will make
> progress, but the quota file update is plain wrong unless at least pass1
> has finished, and the journal recreation is also dangerous if the block
> bitmaps have not been fully updated.

You're right.  My suggested fix would be in the case of
E2F_FLAG_CANCEL, we set the ctx->invalid_bitmaps flag, and then avoid
writing out the quota file if invalid_bitmaps is enabled.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Corruption from interrupted e2fsck
  2015-11-02 15:23 ` Theodore Ts'o
@ 2015-11-02 21:31   ` Andreas Dilger
  2015-11-03 16:00     ` Theodore Ts'o
  0 siblings, 1 reply; 4+ messages in thread
From: Andreas Dilger @ 2015-11-02 21:31 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 1477 bytes --]


> On Nov 2, 2015, at 8:23 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Sun, Nov 01, 2015 at 06:16:50PM -0700, Andreas Dilger wrote:
>> Is there a reason not to have a cancel check right after the return from
>> e2fsck_run() rather than trying to recover the journal and quota files?
>> I can imagine that there is a desire to flush out modified inodes and such
>> that have been repaired, so that restarting an interrupted e2fsck will make
>> progress, but the quota file update is plain wrong unless at least pass1
>> has finished, and the journal recreation is also dangerous if the block
>> bitmaps have not been fully updated.
> 
> You're right.  My suggested fix would be in the case of
> E2F_FLAG_CANCEL, we set the ctx->invalid_bitmaps flag, and then avoid
> writing out the quota file if invalid_bitmaps is enabled.

I was looking at that too.  In some sense it isn't a bad idea to allow
updating the quota file in this case, but it still bothers me that e2fsck
would continue on to update the quotas if the user wants to kill it, so
my preference would be to not write the quota files at all if e2fsck is
interrupted.

It would probably make more sense to have an option like "-E quota-only"
to allow running a shorter e2fsck (maybe useful for link-farm backups
that take a long time in later passes) but most of the time pass 1 is
the slowest so there is usually minimal benefit from skipping later passes.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Corruption from interrupted e2fsck
  2015-11-02 21:31   ` Andreas Dilger
@ 2015-11-03 16:00     ` Theodore Ts'o
  0 siblings, 0 replies; 4+ messages in thread
From: Theodore Ts'o @ 2015-11-03 16:00 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

On Mon, Nov 02, 2015 at 02:31:01PM -0700, Andreas Dilger wrote:
> > You're right.  My suggested fix would be in the case of
> > E2F_FLAG_CANCEL, we set the ctx->invalid_bitmaps flag, and then avoid
> > writing out the quota file if invalid_bitmaps is enabled.
> 
> I was looking at that too.  In some sense it isn't a bad idea to allow
> updating the quota file in this case, but it still bothers me that e2fsck
> would continue on to update the quotas if the user wants to kill it, so
> my preference would be to not write the quota files at all if e2fsck is
> interrupted.

I agree; that's why I suggested that if E2F_FLAG_CANCEL was set, then
we would skip wrting out the quota file.

> It would probably make more sense to have an option like "-E quota-only"
> to allow running a shorter e2fsck (maybe useful for link-farm backups
> that take a long time in later passes) but most of the time pass 1 is
> the slowest so there is usually minimal benefit from skipping later passes.

This is a separate issue.  One of the reasons why I wanted to
integrate the quota checking into e2fsck is that since quota tracking
is *always* enabled if the quota inode(s) are present.  So the only
time the quota should be inconsistent (and so when we would need to
update the quota file) is if the file system itself had gotten
inconsistent, since the quota file is now considered *part* of the
file system metadata that should always be consistence in the absense
of a kernel bug, hardware-induced corruption, or someone messing with
the file system out of band using something like debugfs.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-11-03 16:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-02  1:16 Corruption from interrupted e2fsck Andreas Dilger
2015-11-02 15:23 ` Theodore Ts'o
2015-11-02 21:31   ` Andreas Dilger
2015-11-03 16:00     ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).