[RFC PATCH 1/1] add a jbd option to force an unclean journal state

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 1/1] add a jbd option to force an unclean journal state
@ 2008-03-04 18:39 Josef Bacik
  2008-03-04 19:01 ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Josef Bacik @ 2008-03-04 18:39 UTC (permalink / raw)
  To: linux-ext4; +Cc: Jan Kara

Hello,

I'm fixing to go through and muck with some of the transaction handling stuff in 
jbd and I want a way to verify that I'm not screwing anything up in the 
process, and this is what I came up with.  Basically this option would only be 
used in the case where someone mounts an ext3 image or fs, does a specific IO 
operation (create 100 files, write data to a few files etc), unmounts the fs 
and remounts so that jbd does its journal recovery and then check the status of 
the fs to make sure its exactly the way its expected to be.  I'm not entirely 
sure how usefull of an option like this would be (or if I did it right :) ), 
but I thought I'd throw it out there in case anybody thinks it may be useful, 
and in case there is some case that I'm missing so I can fix it and better make 
sure I don't mess anything up while doing stuff.  Basically this patch keeps us 
from resetting the journal's tail/transaction sequence when we destroy the 
journal so when we mount the fs again it will look like we didn't unmount 
properly and recovery will occur.  Any comments are much appreciated,

Josef

Index: linux-2.6/fs/jbd/journal.c
===================================================================
--- linux-2.6.orig/fs/jbd/journal.c
+++ linux-2.6/fs/jbd/journal.c
@@ -1146,9 +1146,12 @@ void journal_destroy(journal_t *journal)
 	J_ASSERT(journal->j_checkpoint_transactions == NULL);
 	spin_unlock(&journal->j_list_lock);

-	/* We can now mark the journal as empty. */
-	journal->j_tail = 0;
-	journal->j_tail_sequence = ++journal->j_transaction_sequence;
+	if (!(journal->j_flags & JFS_UNCLEAN)) {
+		/* We can now mark the journal as empty. */
+		journal->j_tail = 0;
+		journal->j_tail_sequence = ++journal->j_transaction_sequence;
+	}
+
 	if (journal->j_sb_buffer) {
 		journal_update_superblock(journal, 1);
 		brelse(journal->j_sb_buffer);
Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h
+++ linux-2.6/include/linux/jbd.h
@@ -825,6 +825,7 @@ struct journal_s
 #define JFS_FLUSHED	0x008	/* The journal superblock has been flushed */
 #define JFS_LOADED	0x010	/* The journal superblock has been loaded */
 #define JFS_BARRIER	0x020	/* Use IDE barriers */
+#define JFS_UNCLEAN	0x040	/* Don't clean up the journal on umount */

 /*
  * Function declarations for the journaling transaction and buffer
Index: linux-2.6/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.orig/fs/jbd/checkpoint.c
+++ linux-2.6/fs/jbd/checkpoint.c
@@ -423,10 +423,14 @@ int cleanup_journal_tail(journal_t *jour
 	} else if ((transaction = journal->j_running_transaction) != NULL) {
 		first_tid = transaction->t_tid;
 		blocknr = journal->j_head;
-	} else {
+	} else if (!(journal->j_flags & JFS_UNCLEAN)) {
 		first_tid = journal->j_transaction_sequence;
 		blocknr = journal->j_head;
+	} else {
+		first_tid = journal->j_tail_sequence;
+		blocknr = journal->j_tail;
 	}
+
 	spin_unlock(&journal->j_list_lock);
 	J_ASSERT(blocknr != 0);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/1] add a jbd option to force an unclean journal state
  2008-03-04 18:39 [RFC PATCH 1/1] add a jbd option to force an unclean journal state Josef Bacik
@ 2008-03-04 19:01 ` Jan Kara
  2008-03-04 23:58   ` Andrew Morton
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2008-03-04 19:01 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-ext4, Andrew Morton

  Hi,

On Tue 04-03-08 13:39:41, Josef Bacik wrote:
> jbd and I want a way to verify that I'm not screwing anything up in the 
> process, and this is what I came up with.  Basically this option would only be 
> used in the case where someone mounts an ext3 image or fs, does a specific IO 
> operation (create 100 files, write data to a few files etc), unmounts the fs 
> and remounts so that jbd does its journal recovery and then check the status of 
> the fs to make sure its exactly the way its expected to be.  I'm not entirely 
> sure how usefull of an option like this would be (or if I did it right :) ), 
> but I thought I'd throw it out there in case anybody thinks it may be useful, 
> and in case there is some case that I'm missing so I can fix it and better make 
> sure I don't mess anything up while doing stuff.  Basically this patch keeps us 
> from resetting the journal's tail/transaction sequence when we destroy the 
> journal so when we mount the fs again it will look like we didn't unmount 
> properly and recovery will occur.  Any comments are much appreciated,
  Actually, there is a different way how we've done checking like this (and
I think also more useful), at least for ext3. Basically you mounted a
filesysteem with some timeout and after the timeout, device was forced
read-only. And then you've checked that the fs is consistent after journal
replay. I think Andrew had the patches somewhere...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/1] add a jbd option to force an unclean journal state
  2008-03-04 19:01 ` Jan Kara
@ 2008-03-04 23:58   ` Andrew Morton
  2008-03-05  2:46     ` Eric Sandeen
  2008-03-05  7:34     ` Andreas Dilger
  0 siblings, 2 replies; 5+ messages in thread
From: Andrew Morton @ 2008-03-04 23:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: jbacik, linux-ext4

On Tue, 4 Mar 2008 20:01:09 +0100
Jan Kara <jack@suse.cz> wrote:

>   Hi,
> 
> On Tue 04-03-08 13:39:41, Josef Bacik wrote:
> > jbd and I want a way to verify that I'm not screwing anything up in the 
> > process, and this is what I came up with.  Basically this option would only be 
> > used in the case where someone mounts an ext3 image or fs, does a specific IO 
> > operation (create 100 files, write data to a few files etc), unmounts the fs 
> > and remounts so that jbd does its journal recovery and then check the status of 
> > the fs to make sure its exactly the way its expected to be.  I'm not entirely 
> > sure how usefull of an option like this would be (or if I did it right :) ), 
> > but I thought I'd throw it out there in case anybody thinks it may be useful, 
> > and in case there is some case that I'm missing so I can fix it and better make 
> > sure I don't mess anything up while doing stuff.  Basically this patch keeps us 
> > from resetting the journal's tail/transaction sequence when we destroy the 
> > journal so when we mount the fs again it will look like we didn't unmount 
> > properly and recovery will occur.  Any comments are much appreciated,
>   Actually, there is a different way how we've done checking like this (and
> I think also more useful), at least for ext3. Basically you mounted a
> filesysteem with some timeout and after the timeout, device was forced
> read-only. And then you've checked that the fs is consistent after journal
> replay. I think Andrew had the patches somewhere...

About a billion years ago...

But the idea was (I think) good:

- mount the filesystem with `-o ro_after=100'

- the fs arms a timer to go off in 100 seconds

- now you start running some filesystem stress test

- the timer goes off.  At timer-interrupt time, flags are set which cause
  the low-level driver layer to start silently ignoring all writes to the
  device which backs the filesystem.

  This simulates a crash or poweroff.

- Now up in userspace we

  - kill off the stresstest
  - unmount the fs
  - mount the fs (to run recovery)
  - unmount the fs
  - fsck it
  - mount the fs
    - check the data content of the files which the stresstest was writing:
      look for uninitialised blocks, incorrect data, etc.
  - unmount the fs

- start it all again.


So it's 100% scriptable and can be left running overnight, etc.  It found
quite a few problems with ext3/jbd recovery which I doubt could be found by
other means.  This was 6-7 years ago and I'd expect that new recovery bugs
have crept in since then which it can expose.

I think we should implement this in a formal, mergeable fashion, as there
are numerous filesystems which could and should use this sort of testing
infrastructure.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/1] add a jbd option to force an unclean journal state
  2008-03-04 23:58   ` Andrew Morton
@ 2008-03-05  2:46     ` Eric Sandeen
  2008-03-05  7:34     ` Andreas Dilger
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Sandeen @ 2008-03-05  2:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, jbacik, linux-ext4

Andrew Morton wrote:

> So it's 100% scriptable and can be left running overnight, etc.  It found
> quite a few problems with ext3/jbd recovery which I doubt could be found by
> other means.  This was 6-7 years ago and I'd expect that new recovery bugs
> have crept in since then which it can expose.
> 
> I think we should implement this in a formal, mergeable fashion, as there
> are numerous filesystems which could and should use this sort of testing
> infrastructure.


FWIW, xfs has something vaguely similar, the XFS_IOC_GOINGDOWN ioctl,
which invokes xfs_fs_goingdown, which takes a few flags:

/*
 * Flags for going down operation
 */
#define XFS_FSOP_GOING_FLAGS_DEFAULT            0x0     /* going down */
#define XFS_FSOP_GOING_FLAGS_LOGFLUSH           0x1     /* flush log but
not data */
#define XFS_FSOP_GOING_FLAGS_NOLOGFLUSH         0x2     /* don't flush
log nor data */


but ultimately calls xfs_force_shutdown, which is sort of rougly similar
to ext3_abort....

The xfs qa tests make use of this ioctl.

-Eric


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/1] add a jbd option to force an unclean journal state
  2008-03-04 23:58   ` Andrew Morton
  2008-03-05  2:46     ` Eric Sandeen
@ 2008-03-05  7:34     ` Andreas Dilger
  1 sibling, 0 replies; 5+ messages in thread
From: Andreas Dilger @ 2008-03-05  7:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, jbacik, linux-ext4

On Mar 04, 2008  15:58 -0800, Andrew Morton wrote:
> - mount the filesystem with `-o ro_after=100'
> 
> - the fs arms a timer to go off in 100 seconds
> 
> - now you start running some filesystem stress test
> 
> - the timer goes off.  At timer-interrupt time, flags are set which cause
>   the low-level driver layer to start silently ignoring all writes to the
>   device which backs the filesystem.
> 
>   This simulates a crash or poweroff.
> 
> - Now up in userspace we
> 
>   - kill off the stresstest
>   - unmount the fs
>   - mount the fs (to run recovery)
>   - unmount the fs
>   - fsck it
>   - mount the fs
>     - check the data content of the files which the stresstest was writing:
>       look for uninitialised blocks, incorrect data, etc.
>   - unmount the fs
> 
> - start it all again.
> 
> 
> So it's 100% scriptable and can be left running overnight, etc.  It found
> quite a few problems with ext3/jbd recovery which I doubt could be found by
> other means.  This was 6-7 years ago and I'd expect that new recovery bugs
> have crept in since then which it can expose.
> 
> I think we should implement this in a formal, mergeable fashion, as there
> are numerous filesystems which could and should use this sort of testing
> infrastructure.

We use a patch which is a distant ancestor of Andrew's original patch to
do very similar testing for Lustre + ext3, allowing us to simulate node
crashes.  This patch is against 2.6.22, but the relevant code doesn't
appear significantly changed in newer kernels.  YMMV.

The major difference between this code and Andrew's original code is
that this allows multiple devices to be turned read-only at one time
(e.g. ext3 filesystem + external journal), while the original code wasn't
very robust in that area.

There is no mechanism to enable this from userspace or the filesystem,
since there is Lustre code in the kernel that calls dev_set_rdonly()
on one or more devices when we hit a failure trigger, but adding the
timer code or a /proc or /sys entry for this would be easy.

The device is reset only when the last reference to it is removed in
kill_bdev() so that there isn't a race with re-enabling writes to the
device while there are still dirty buffers outstanding, and certainly
corrupting the filesystem.  That's why dev_clear_rdonly() is not exported.

Signed-off-by: Andreas Dilger <adilger@clusterfs.com>

Index: linux-2.6.22.5/block/ll_rw_blk.c
===================================================================
--- linux-2.6.22.5.orig/block/ll_rw_blk.c	2007-08-22 17:23:54.000000000 -0600
+++ linux-2.6.22.5/block/ll_rw_blk.c	2008-02-21 01:07:16.000000000 -0700
@@ -3101,6 +3101,8 @@
 
 #endif /* CONFIG_FAIL_MAKE_REQUEST */
 
+int dev_check_rdonly(struct block_device *bdev);
+
 /**
  * generic_make_request: hand a buffer to its device driver for I/O
  * @bio:  The bio describing the location in memory and on the device.
@@ -3185,6 +3187,12 @@ static inline void __generic_make_request
 
 		if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
 			goto end_io;
+
+		if (unlikely(bio->bi_rw == WRITE &&
+		             dev_check_rdonly(bio->bi_bdev))) {
+			bio_endio(bio, bio->bi_size, 0);
+			break;
+		}
 
 		if (should_fail_request(bio))
 			goto end_io;
@@ -3850,6 +3858,100 @@ void swap_io_context(struct io_context
 	*ioc2 = temp;
 }
 EXPORT_SYMBOL(swap_io_context);
+
+ /*
+ * Debug code for turning block devices "read-only" (will discard writes
+ * silently).  This is for filesystem crash/recovery testing.
+ */
+struct deventry {
+	dev_t dev;
+	struct deventry *next;
+};
+
+static struct deventry *devlist = NULL;
+static spinlock_t devlock = SPIN_LOCK_UNLOCKED; 
+
+int dev_check_rdonly(struct block_device *bdev) 
+{
+	struct deventry *cur;
+
+	if (!bdev)
+		return 0;
+
+	spin_lock(&devlock);
+	cur = devlist;
+	while (cur) {
+		if (bdev->bd_dev == cur->dev) {
+			spin_unlock(&devlock);
+			return 1;
+		}
+		cur = cur->next;
+	}
+	spin_unlock(&devlock);
+
+	return 0;
+}
+
+void dev_set_rdonly(struct block_device *bdev)
+{
+	struct deventry *newdev, *cur;
+
+	if (!bdev) 
+		return;
+
+	newdev = kmalloc(sizeof(struct deventry), GFP_KERNEL);
+	if (!newdev) 
+		return;
+	
+	spin_lock(&devlock);
+	cur = devlist;
+	while (cur) {
+		if (bdev->bd_dev == cur->dev) {
+			spin_unlock(&devlock);
+			kfree(newdev);
+			return;
+		}
+		cur = cur->next;
+	}
+	newdev->dev = bdev->bd_dev;
+	newdev->next = devlist;
+	devlist = newdev;
+	spin_unlock(&devlock);
+
+	printk(KERN_WARNING "Turning device %s (%#x) read-only\n",
+	       bdev->bd_disk ? bdev->bd_disk->disk_name : "", bdev->bd_dev);
+}
+
+void dev_clear_rdonly(struct block_device *bdev) 
+{
+	struct deventry *cur, *last = NULL;
+
+	if (!bdev)
+		return;
+
+	spin_lock(&devlock);
+	cur = devlist;
+	while (cur) {
+		if (bdev->bd_dev == cur->dev) {
+			if (last) 
+				last->next = cur->next;
+			else
+				devlist = cur->next;
+			spin_unlock(&devlock);
+			kfree(cur);
+			printk(KERN_WARNING "Removing read-only on %s (%#x)\n",
+			       bdev->bd_disk ? bdev->bd_disk->disk_name :
+					       "unknown block", bdev->bd_dev);
+			return;
+		}
+		last = cur;
+		cur = cur->next;
+	}
+	spin_unlock(&devlock);
+}
+
+EXPORT_SYMBOL(dev_set_rdonly);
+EXPORT_SYMBOL(dev_check_rdonly);
 
 /*
  * sysfs parts below
Index: linux-2.6.22.5/fs/block_dev.c
===================================================================
--- linux-2.6.22.5.orig/fs/block_dev.c	2007-08-22 17:23:54.000000000 -0600
+++ linux-2.6.22.5/fs/block_dev.c	2008-02-21 01:07:16.000000000 -0700
@@ -63,6 +63,7 @@ static void kill_bdev(struct block_device *bdev)
 		return;
 	invalidate_bh_lrus();
 	truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
+	dev_clear_rdonly(bdev);
 }	
 
 int set_blocksize(struct block_device *bdev, int size)
Index: linux-2.6.22.5/include/linux/fs.h
===================================================================
--- linux-2.6.22.5.orig/include/linux/fs.h	2008-02-21 00:58:18.000000000 -0700
+++ linux-2.6.22.5/include/linux/fs.h	2008-02-21 01:07:16.000000000 -0700
@@ -1744,6 +1744,10 @@
 extern void submit_bio(int, struct bio *);
 extern int bdev_read_only(struct block_device *);
 #endif
+#define HAVE_CLEAR_RDONLY_ON_PUT
+extern void dev_set_rdonly(struct block_device *bdev);
+extern int dev_check_rdonly(struct block_device *bdev);
+extern void dev_clear_rdonly(struct block_device *bdev);
 extern int set_blocksize(struct block_device *, int);
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-03-05  7:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-04 18:39 [RFC PATCH 1/1] add a jbd option to force an unclean journal state Josef Bacik
2008-03-04 19:01 ` Jan Kara
2008-03-04 23:58   ` Andrew Morton
2008-03-05  2:46     ` Eric Sandeen
2008-03-05  7:34     ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox