public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mingming Cao <cmm@us.ibm.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Theodore Ts'o" <tytso@MIT.EDU>,
	linux-kernel@vger.kernel.org, girish@clusterfs.com,
	adilger@clusterfs.com, shaggy@linux.vnet.ibm.com,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH 33/49] ext4: Add the journal checksum feature
Date: Thu, 24 Jan 2008 13:24:08 -0800	[thread overview]
Message-ID: <1201209848.4105.14.camel@localhost.localdomain> (raw)
In-Reply-To: <20080123140704.01249f86.akpm@linux-foundation.org>

On Wed, 2008-01-23 at 14:07 -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:12 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> > From: Girish Shilamkar <girish@clusterfs.com>
> > 
> > The journal checksum feature adds two new flags i.e
> > JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.
> > 
> > JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
> > checksum for the blocks described by the descriptor blocks.
> > Due to checksums, writing of the commit record no longer needs to be
> > synchronous. Now commit record can be sent to disk without waiting for
> > descriptor blocks to be written to disk. This behavior is controlled
> > using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
> > able to recover the journal with _ASYNC_COMMIT hence it is made
> > incompat.
> > The commit header has been extended to hold the checksum along with the
> > type of the checksum.
> > 
> > For recovery in pass scan checksums are verified to ensure the sanity
> > and completeness(in case of _ASYNC_COMMIT) of every transaction.
> > 
> >  ...
> >
> > +static inline __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
> 
> unneeded inlining.
> 
> > +{
> > +	struct page *page = bh->b_page;
> > +	char *addr;
> > +	__u32 checksum;
> > +
> > +	addr = kmap_atomic(page, KM_USER0);
> > +	checksum = crc32_be(crc32_sum,
> > +		(void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
> > +	kunmap_atomic(addr, KM_USER0);
> > +
> > +	return checksum;
> > +}
> 
> Can this buffer actually be in highmem?
> 
> >  static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
> >  				   unsigned long long block)
> 
> More unnecessary inlining.
> 
> > +/*
> > + * jbd2_journal_clear_features () - Clear a given journal feature in the
> > + * 				    superblock
> > + * @journal: Journal to act on.
> > + * @compat: bitmask of compatible features
> > + * @ro: bitmask of features that force read-only mount
> > + * @incompat: bitmask of incompatible features
> > + *
> > + * Clear a given journal feature as present on the
> > + * superblock.  Returns true if the requested features could be reset.
> > + */
> > +int jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
> > +				unsigned long ro, unsigned long incompat)
> > +{
> > +	journal_superblock_t *sb;
> > +
> > +	jbd_debug(1, "Clear features 0x%lx/0x%lx/0x%lx\n",
> > +		  compat, ro, incompat);
> > +
> > +	sb = journal->j_superblock;
> > +
> > +	sb->s_feature_compat    &= ~cpu_to_be32(compat);
> > +	sb->s_feature_ro_compat &= ~cpu_to_be32(ro);
> > +	sb->s_feature_incompat  &= ~cpu_to_be32(incompat);
> > +
> > +	return 1;
> > +}
> > +EXPORT_SYMBOL(jbd2_journal_clear_features);
> 
> Kernel usually returns 0 on success.  So we can return a useful errno on
> failure.
> 
> > +/*
> > + * calc_chksums calculates the checksums for the blocks described in the
> > + * descriptor block.
> > + */
> > +static int calc_chksums(journal_t *journal, struct buffer_head *bh,
> > +			unsigned long *next_log_block, __u32 *crc32_sum)
> > +{
> > +	int i, num_blks, err;
> > +	unsigned io_block;
> > +	struct buffer_head *obh;
> > +
> > +	num_blks = count_tags(journal, bh);
> > +	/* Calculate checksum of the descriptor block. */
> > +	*crc32_sum = crc32_be(*crc32_sum, (void *)bh->b_data, bh->b_size);
> > +
> > +	for (i = 0; i < num_blks; i++) {
> > +		io_block = (*next_log_block)++;
> 
>  unsigned <- unsigned long.
> 
> Are all the types appropriate in here?
> 
> > +		wrap(journal, *next_log_block);
> > +		err = jread(&obh, journal, io_block);
> > +		if (err) {
> > +			printk(KERN_ERR "JBD: IO error %d recovering block "
> > +				"%u in log\n", err, io_block);
> > +			return 1;
> > +		} else {
> > +			*crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
> > +				     obh->b_size);
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> >  static int do_one_pass(journal_t *journal,
> >  			struct recovery_info *info, enum passtype pass)
> >  {
> > @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journal,
> >  	unsigned int		sequence;
> >  	int			blocktype;
> >  	int			tag_bytes = journal_tag_bytes(journal);
> > +	__u32			crc32_sum = ~0; /* Transactional Checksums */
> >  
> >  	/* Precompute the maximum metadata descriptors in a descriptor block */
> >  	int			MAX_BLOCKS_PER_DESC;
> > @@ -419,9 +452,23 @@ static int do_one_pass(journal_t *journal,
> >  		switch(blocktype) {
> >  		case JBD2_DESCRIPTOR_BLOCK:
> >  			/* If it is a valid descriptor block, replay it
> > -			 * in pass REPLAY; otherwise, just skip over the
> > -			 * blocks it describes. */
> > +			 * in pass REPLAY; if journal_checksums enabled, then
> > +			 * calculate checksums in PASS_SCAN, otherwise,
> > +			 * just skip over the blocks it describes. */
> >  			if (pass != PASS_REPLAY) {
> > +				if (pass == PASS_SCAN &&
> > +				    JBD2_HAS_COMPAT_FEATURE(journal,
> > +					    JBD2_FEATURE_COMPAT_CHECKSUM) &&
> > +				    !info->end_transaction) {
> > +					if (calc_chksums(journal, bh,
> > +							&next_log_block,
> > +							&crc32_sum)) {
> 
> put_bh()
> 
> > +						brelse(bh);
> > +						break;
> > +					}
> > +					brelse(bh);
> > +					continue;
> 
> put_bh()
> 
> > +				}
> >  				next_log_block += count_tags(journal, bh);
> >  				wrap(journal, next_log_block);
> >  				brelse(bh);
> > @@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journal,
> >  			continue;
> >  
> > +					brelse(bh);
> 
> etc
> 

Thanks, Updated  patch below:
ext4: Add the journal checksum feature

From: Girish Shilamkar <girish@clusterfs.com>

The journal checksum feature adds two new flags i.e
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.

JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
checksum for the blocks described by the descriptor blocks.
Due to checksums, writing of the commit record no longer needs to be
synchronous. Now commit record can be sent to disk without waiting for
descriptor blocks to be written to disk. This behavior is controlled
using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
able to recover the journal with _ASYNC_COMMIT hence it is made
incompat.
The commit header has been extended to hold the checksum along with the
type of the checksum.

For recovery in pass scan checksums are verified to ensure the sanity
and completeness(in case of _ASYNC_COMMIT) of every transaction.

Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Girish Shilamkar <girish@clusterfs.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---

 Documentation/filesystems/ext4.txt |   10 +
 fs/Kconfig                         |    1 
 fs/ext4/super.c                    |   25 ++++
 fs/jbd2/commit.c                   |  198 +++++++++++++++++++++++++++----------
 fs/jbd2/journal.c                  |   26 ++++
 fs/jbd2/recovery.c                 |  151 ++++++++++++++++++++++++++--
 include/linux/ext4_fs.h            |    3 
 include/linux/jbd2.h               |   36 +++++-
 8 files changed, 388 insertions(+), 62 deletions(-)


Index: linux-2.6.24-rc8/Documentation/filesystems/ext4.txt
===================================================================
--- linux-2.6.24-rc8.orig/Documentation/filesystems/ext4.txt	2008-01-24 11:18:08.000000000 -0800
+++ linux-2.6.24-rc8/Documentation/filesystems/ext4.txt	2008-01-24 13:00:44.000000000 -0800
@@ -89,6 +89,16 @@ When mounting an ext4 filesystem, the fo
 extents			ext4 will use extents to address file data.  The
 			file system will no longer be mountable by ext3.
 
+journal_checksum	Enable checksumming of the journal transactions.
+			This will allow the recovery code in e2fsck and the
+			kernel to detect corruption in the kernel.  It is a
+			compatible change and will be ignored by older kernels.
+
+journal_async_commit	Commit block can be written to disk without waiting
+			for descriptor blocks. If enabled older kernels cannot
+			mount the device. This will enable 'journal_checksum'
+			internally.
+
 journal=update		Update the ext4 file system's journal to the current
 			format.
 
Index: linux-2.6.24-rc8/fs/Kconfig
===================================================================
--- linux-2.6.24-rc8.orig/fs/Kconfig	2008-01-24 11:18:08.000000000 -0800
+++ linux-2.6.24-rc8/fs/Kconfig	2008-01-24 11:18:55.000000000 -0800
@@ -236,6 +236,7 @@ config JBD_DEBUG
 
 config JBD2
 	tristate
+	select CRC32
 	help
 	  This is a generic journaling layer for block devices that support
 	  both 32-bit and 64-bit block numbers.  It is currently used by
Index: linux-2.6.24-rc8/fs/ext4/super.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/ext4/super.c	2008-01-24 11:18:52.000000000 -0800
+++ linux-2.6.24-rc8/fs/ext4/super.c	2008-01-24 13:00:45.000000000 -0800
@@ -869,6 +869,7 @@ enum {
 	Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
 	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
 	Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
+	Opt_journal_checksum, Opt_journal_async_commit,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
@@ -908,6 +909,8 @@ static match_table_t tokens = {
 	{Opt_journal_update, "journal=update"},
 	{Opt_journal_inum, "journal=%u"},
 	{Opt_journal_dev, "journal_dev=%u"},
+	{Opt_journal_checksum, "journal_checksum"},
+	{Opt_journal_async_commit, "journal_async_commit"},
 	{Opt_abort, "abort"},
 	{Opt_data_journal, "data=journal"},
 	{Opt_data_ordered, "data=ordered"},
@@ -1095,6 +1098,13 @@ static int parse_options (char *options,
 				return 0;
 			*journal_devnum = option;
 			break;
+		case Opt_journal_checksum:
+			set_opt(sbi->s_mount_opt, JOURNAL_CHECKSUM);
+			break;
+		case Opt_journal_async_commit:
+			set_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
+			set_opt(sbi->s_mount_opt, JOURNAL_CHECKSUM);
+			break;
 		case Opt_noload:
 			set_opt (sbi->s_mount_opt, NOLOAD);
 			break;
@@ -2114,6 +2124,21 @@ static int ext4_fill_super (struct super
 		goto failed_mount4;
 	}
 
+	if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
+		jbd2_journal_set_features(sbi->s_journal,
+				JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+				JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+	} else if (test_opt(sb, JOURNAL_CHECKSUM)) {
+		jbd2_journal_set_features(sbi->s_journal,
+				JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
+		jbd2_journal_clear_features(sbi->s_journal, 0, 0,
+				JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+	} else {
+		jbd2_journal_clear_features(sbi->s_journal,
+				JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+				JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+	}
+
 	/* We have now updated the journal if required, so we can
 	 * validate the data journaling mode. */
 	switch (test_opt(sb, DATA_FLAGS)) {
Index: linux-2.6.24-rc8/fs/jbd2/commit.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/jbd2/commit.c	2008-01-24 11:18:54.000000000 -0800
+++ linux-2.6.24-rc8/fs/jbd2/commit.c	2008-01-24 13:02:43.000000000 -0800
@@ -21,6 +21,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/jiffies.h>
+#include <linux/crc32.h>
 
 /*
  * Default IO end handler for temporary BJ_IO buffer_heads.
@@ -93,19 +94,23 @@ static int inverted_lock(journal_t *jour
 	return 1;
 }
 
-/* Done it all: now write the commit record.  We should have
+/*
+ * Done it all: now submit the commit record.  We should have
  * cleaned up our previous buffers by now, so if we are in abort
  * mode we can now just skip the rest of the journal write
  * entirely.
  *
  * Returns 1 if the journal needs to be aborted or 0 on success
  */
-static int journal_write_commit_record(journal_t *journal,
-					transaction_t *commit_transaction)
+static int journal_submit_commit_record(journal_t *journal,
+					transaction_t *commit_transaction,
+					struct buffer_head **cbh,
+					__u32 crc32_sum)
 {
 	struct journal_head *descriptor;
+	struct commit_header *tmp;
 	struct buffer_head *bh;
-	int i, ret;
+	int ret;
 	int barrier_done = 0;
 
 	if (is_journal_aborted(journal))
@@ -117,21 +122,33 @@ static int journal_write_commit_record(j
 
 	bh = jh2bh(descriptor);
 
-	/* AKPM: buglet - add `i' to tmp! */
-	for (i = 0; i < bh->b_size; i += 512) {
-		journal_header_t *tmp = (journal_header_t*)bh->b_data;
-		tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
-		tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
-		tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
+	tmp = (struct commit_header *)bh->b_data;
+	tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
+	tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
+	tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
+
+	if (JBD2_HAS_COMPAT_FEATURE(journal,
+				    JBD2_FEATURE_COMPAT_CHECKSUM)) {
+		tmp->h_chksum_type 	= JBD2_CRC32_CHKSUM;
+		tmp->h_chksum_size 	= JBD2_CRC32_CHKSUM_SIZE;
+		tmp->h_chksum[0] 	= cpu_to_be32(crc32_sum);
 	}
 
-	JBUFFER_TRACE(descriptor, "write commit block");
+	JBUFFER_TRACE(descriptor, "submit commit block");
+	lock_buffer(bh);
+
 	set_buffer_dirty(bh);
-	if (journal->j_flags & JBD2_BARRIER) {
+	set_buffer_uptodate(bh);
+	bh->b_end_io = journal_end_buffer_io_sync;
+
+	if (journal->j_flags & JBD2_BARRIER &&
+		!JBD2_HAS_COMPAT_FEATURE(journal,
+					 JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
 		set_buffer_ordered(bh);
 		barrier_done = 1;
 	}
-	ret = sync_dirty_buffer(bh);
+	ret = submit_bh(WRITE, bh);
+
 	/* is it possible for another commit to fail at roughly
 	 * the same time as this one?  If so, we don't want to
 	 * trust the barrier flag in the super, but instead want
@@ -152,14 +169,72 @@ static int journal_write_commit_record(j
 		clear_buffer_ordered(bh);
 		set_buffer_uptodate(bh);
 		set_buffer_dirty(bh);
-		ret = sync_dirty_buffer(bh);
+		ret = submit_bh(WRITE, bh);
 	}
-	put_bh(bh);		/* One for getblk() */
-	jbd2_journal_put_journal_head(descriptor);
+	*cbh = bh;
+	return ret;
+}
+
+/*
+ * This function along with journal_submit_commit_record
+ * allows to write the commit record asynchronously.
+ */
+static int journal_wait_on_commit_record(struct buffer_head *bh)
+{
+	int ret = 0;
+
+	clear_buffer_dirty(bh);
+	wait_on_buffer(bh);
+
+	if (unlikely(!buffer_uptodate(bh)))
+		ret = -EIO;
+	put_bh(bh);            /* One for getblk() */
+	jbd2_journal_put_journal_head(bh2jh(bh));
 
-	return (ret == -EIO);
+	return ret;
 }
 
+/*
+ * Wait for all submitted IO to complete.
+ */
+static int journal_wait_on_locked_list(journal_t *journal,
+				       transaction_t *commit_transaction)
+{
+	int ret = 0;
+	struct journal_head *jh;
+
+	while (commit_transaction->t_locked_list) {
+		struct buffer_head *bh;
+
+		jh = commit_transaction->t_locked_list->b_tprev;
+		bh = jh2bh(jh);
+		get_bh(bh);
+		if (buffer_locked(bh)) {
+			spin_unlock(&journal->j_list_lock);
+			wait_on_buffer(bh);
+			if (unlikely(!buffer_uptodate(bh)))
+				ret = -EIO;
+			spin_lock(&journal->j_list_lock);
+		}
+		if (!inverted_lock(journal, bh)) {
+			put_bh(bh);
+			spin_lock(&journal->j_list_lock);
+			continue;
+		}
+		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
+			__jbd2_journal_unfile_buffer(jh);
+			jbd_unlock_bh_state(bh);
+			jbd2_journal_remove_journal_head(bh);
+			put_bh(bh);
+		} else {
+			jbd_unlock_bh_state(bh);
+		}
+		put_bh(bh);
+		cond_resched_lock(&journal->j_list_lock);
+	}
+	return ret;
+  }
+
 static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
 {
 	int i;
@@ -275,7 +350,21 @@ write_out_data:
 	journal_do_submit_data(wbuf, bufs);
 }
 
-static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
+static __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
+{
+	struct page *page = bh->b_page;
+	char *addr;
+	__u32 checksum;
+
+	addr = kmap_atomic(page, KM_USER0);
+	checksum = crc32_be(crc32_sum,
+		(void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
+	kunmap_atomic(addr, KM_USER0);
+
+	return checksum;
+}
+
+static void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
 				   unsigned long long block)
 {
 	tag->t_blocknr = cpu_to_be32(block & (u32)~0);
@@ -307,6 +396,8 @@ void jbd2_journal_commit_transaction(jou
 	int tag_flag;
 	int i;
 	int tag_bytes = journal_tag_bytes(journal);
+	struct buffer_head *cbh = NULL; /* For transactional checksums */
+	__u32 crc32_sum = ~0;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -451,38 +542,15 @@ void jbd2_journal_commit_transaction(jou
 	journal_submit_data_buffers(journal, commit_transaction);
 
 	/*
-	 * Wait for all previously submitted IO to complete.
+	 * Wait for all previously submitted IO to complete if commit
+	 * record is to be written synchronously.
 	 */
 	spin_lock(&journal->j_list_lock);
-	while (commit_transaction->t_locked_list) {
-		struct buffer_head *bh;
+	if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
+		JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
+		err = journal_wait_on_locked_list(journal,
+						commit_transaction);
 
-		jh = commit_transaction->t_locked_list->b_tprev;
-		bh = jh2bh(jh);
-		get_bh(bh);
-		if (buffer_locked(bh)) {
-			spin_unlock(&journal->j_list_lock);
-			wait_on_buffer(bh);
-			if (unlikely(!buffer_uptodate(bh)))
-				err = -EIO;
-			spin_lock(&journal->j_list_lock);
-		}
-		if (!inverted_lock(journal, bh)) {
-			put_bh(bh);
-			spin_lock(&journal->j_list_lock);
-			continue;
-		}
-		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
-			__jbd2_journal_unfile_buffer(jh);
-			jbd_unlock_bh_state(bh);
-			jbd2_journal_remove_journal_head(bh);
-			put_bh(bh);
-		} else {
-			jbd_unlock_bh_state(bh);
-		}
-		put_bh(bh);
-		cond_resched_lock(&journal->j_list_lock);
-	}
 	spin_unlock(&journal->j_list_lock);
 
 	if (err)
@@ -656,6 +724,15 @@ void jbd2_journal_commit_transaction(jou
 start_journal_io:
 			for (i = 0; i < bufs; i++) {
 				struct buffer_head *bh = wbuf[i];
+				/*
+				 * Compute checksum.
+				 */
+				if (JBD2_HAS_COMPAT_FEATURE(journal,
+					JBD2_FEATURE_COMPAT_CHECKSUM)) {
+					crc32_sum =
+					    jbd2_checksum_data(crc32_sum, bh);
+				}
+
 				lock_buffer(bh);
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
@@ -672,6 +749,23 @@ start_journal_io:
 		}
 	}
 
+	/* Done it all: now write the commit record asynchronously. */
+
+	if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+		JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+		err = journal_submit_commit_record(journal, commit_transaction,
+						 &cbh, crc32_sum);
+		if (err)
+			__jbd2_journal_abort_hard(journal);
+
+		spin_lock(&journal->j_list_lock);
+		err = journal_wait_on_locked_list(journal,
+						commit_transaction);
+		spin_unlock(&journal->j_list_lock);
+		if (err)
+			__jbd2_journal_abort_hard(journal);
+	}
+
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
            complete.  Control buffers being written are on the
@@ -771,8 +865,14 @@ wait_for_iobuf:
 
 	jbd_debug(3, "JBD: commit phase 6\n");
 
-	if (journal_write_commit_record(journal, commit_transaction))
-		err = -EIO;
+	if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
+		JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+		err = journal_submit_commit_record(journal, commit_transaction,
+						&cbh, crc32_sum);
+		if (err)
+			__jbd2_journal_abort_hard(journal);
+	}
+	err = journal_wait_on_commit_record(cbh);
 
 	if (err)
 		jbd2_journal_abort(journal, err);
Index: linux-2.6.24-rc8/fs/jbd2/journal.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/jbd2/journal.c	2008-01-24 11:18:54.000000000 -0800
+++ linux-2.6.24-rc8/fs/jbd2/journal.c	2008-01-24 13:06:53.000000000 -0800
@@ -1578,6 +1578,32 @@ int jbd2_journal_set_features (journal_t
 	return 1;
 }
 
+/*
+ * jbd2_journal_clear_features () - Clear a given journal feature in the
+ * 				    superblock
+ * @journal: Journal to act on.
+ * @compat: bitmask of compatible features
+ * @ro: bitmask of features that force read-only mount
+ * @incompat: bitmask of incompatible features
+ *
+ * Clear a given journal feature as present on the
+ * superblock.
+ */
+int jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
+				unsigned long ro, unsigned long incompat)
+{
+	journal_superblock_t *sb;
+
+	jbd_debug(1, "Clear features 0x%lx/0x%lx/0x%lx\n",
+		  compat, ro, incompat);
+
+	sb = journal->j_superblock;
+
+	sb->s_feature_compat    &= ~cpu_to_be32(compat);
+	sb->s_feature_ro_compat &= ~cpu_to_be32(ro);
+	sb->s_feature_incompat  &= ~cpu_to_be32(incompat);
+}
+EXPORT_SYMBOL(jbd2_journal_clear_features);
 
 /**
  * int jbd2_journal_update_format () - Update on-disk journal structure.
Index: linux-2.6.24-rc8/fs/jbd2/recovery.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/jbd2/recovery.c	2008-01-24 11:18:08.000000000 -0800
+++ linux-2.6.24-rc8/fs/jbd2/recovery.c	2008-01-24 13:16:07.000000000 -0800
@@ -21,6 +21,7 @@
 #include <linux/jbd2.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
+#include <linux/crc32.h>
 #endif
 
 /*
@@ -316,6 +317,37 @@ static inline unsigned long long read_ta
 	return block;
 }
 
+/*
+ * calc_chksums calculates the checksums for the blocks described in the
+ * descriptor block.
+ */
+static int calc_chksums(journal_t *journal, struct buffer_head *bh,
+			unsigned long *next_log_block, __u32 *crc32_sum)
+{
+	int i, num_blks, err;
+	unsigned long io_block;
+	struct buffer_head *obh;
+
+	num_blks = count_tags(journal, bh);
+	/* Calculate checksum of the descriptor block. */
+	*crc32_sum = crc32_be(*crc32_sum, (void *)bh->b_data, bh->b_size);
+
+	for (i = 0; i < num_blks; i++) {
+		io_block = (*next_log_block)++;
+		wrap(journal, *next_log_block);
+		err = jread(&obh, journal, io_block);
+		if (err) {
+			printk(KERN_ERR "JBD: IO error %d recovering block "
+				"%lu in log\n", err, io_block);
+			return 1;
+		} else {
+			*crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
+				     obh->b_size);
+		}
+	}
+	return 0;
+}
+
 static int do_one_pass(journal_t *journal,
 			struct recovery_info *info, enum passtype pass)
 {
@@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journa
 	unsigned int		sequence;
 	int			blocktype;
 	int			tag_bytes = journal_tag_bytes(journal);
+	__u32			crc32_sum = ~0; /* Transactional Checksums */
 
 	/* Precompute the maximum metadata descriptors in a descriptor block */
 	int			MAX_BLOCKS_PER_DESC;
@@ -419,12 +452,26 @@ static int do_one_pass(journal_t *journa
 		switch(blocktype) {
 		case JBD2_DESCRIPTOR_BLOCK:
 			/* If it is a valid descriptor block, replay it
-			 * in pass REPLAY; otherwise, just skip over the
-			 * blocks it describes. */
+			 * in pass REPLAY; if journal_checksums enabled, then
+			 * calculate checksums in PASS_SCAN, otherwise,
+			 * just skip over the blocks it describes. */
 			if (pass != PASS_REPLAY) {
+				if (pass == PASS_SCAN &&
+				    JBD2_HAS_COMPAT_FEATURE(journal,
+					    JBD2_FEATURE_COMPAT_CHECKSUM) &&
+				    !info->end_transaction) {
+					if (calc_chksums(journal, bh,
+							&next_log_block,
+							&crc32_sum)) {
+						put_bh(bh);
+						break;
+					}
+					put_bh(bh);
+					continue;
+				}
 				next_log_block += count_tags(journal, bh);
 				wrap(journal, next_log_block);
-				brelse(bh);
+				put_bh(bh);
 				continue;
 			}
 
@@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journa
 			continue;
 
 		case JBD2_COMMIT_BLOCK:
-			/* Found an expected commit block: not much to
-			 * do other than move on to the next sequence
+			/*     How to differentiate between interrupted commit
+			 *               and journal corruption ?
+			 *
+			 * {nth transaction}
+			 *        Checksum Verification Failed
+			 *			 |
+			 *		 ____________________
+			 *		|		     |
+			 * 	async_commit             sync_commit
+			 *     		|                    |
+			 *		| GO TO NEXT    "Journal Corruption"
+			 *		| TRANSACTION
+			 *		|
+			 * {(n+1)th transanction}
+			 *		|
+			 * 	 _______|______________
+			 * 	|	 	      |
+			 * Commit block found	Commit block not found
+			 *      |		      |
+			 * "Journal Corruption"       |
+			 *		 _____________|_________
+			 *     		|	           	|
+			 *	nth trans corrupt	OR   nth trans
+			 *	and (n+1)th interrupted     interrupted
+			 *	before commit block
+			 *      could reach the disk.
+			 *	(Cannot find the difference in above
+			 *	 mentioned conditions. Hence assume
+			 *	 "Interrupted Commit".)
+			 */
+
+			/* Found an expected commit block: if checksums
+			 * are present verify them in PASS_SCAN; else not
+			 * much to do other than move on to the next sequence
 			 * number. */
+			if (pass == PASS_SCAN &&
+			    JBD2_HAS_COMPAT_FEATURE(journal,
+				    JBD2_FEATURE_COMPAT_CHECKSUM)) {
+				int chksum_err, chksum_seen;
+				struct commit_header *cbh =
+					(struct commit_header *)bh->b_data;
+				unsigned found_chksum =
+					be32_to_cpu(cbh->h_chksum[0]);
+
+				chksum_err = chksum_seen = 0;
+
+				if (info->end_transaction) {
+					printk(KERN_ERR "JBD: Transaction %u "
+						"found to be corrupt.\n",
+						next_commit_ID - 1);
+					brelse(bh);
+					break;
+				}
+
+				if (crc32_sum == found_chksum &&
+				    cbh->h_chksum_type == JBD2_CRC32_CHKSUM &&
+				    cbh->h_chksum_size ==
+						JBD2_CRC32_CHKSUM_SIZE)
+				       chksum_seen = 1;
+				else if (!(cbh->h_chksum_type == 0 &&
+					     cbh->h_chksum_size == 0 &&
+					     found_chksum == 0 &&
+					     !chksum_seen))
+				/*
+				 * If fs is mounted using an old kernel and then
+				 * kernel with journal_chksum is used then we
+				 * get a situation where the journal flag has
+				 * checksum flag set but checksums are not
+				 * present i.e chksum = 0, in the individual
+				 * commit blocks.
+				 * Hence to avoid checksum failures, in this
+				 * situation, this extra check is added.
+				 */
+						chksum_err = 1;
+
+				if (chksum_err) {
+					info->end_transaction = next_commit_ID;
+
+					if (!JBD2_HAS_COMPAT_FEATURE(journal,
+					   JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)){
+						printk(KERN_ERR
+						       "JBD: Transaction %u "
+						       "found to be corrupt.\n",
+						       next_commit_ID);
+						brelse(bh);
+						break;
+					}
+				}
+				crc32_sum = ~0;
+			}
 			brelse(bh);
 			next_commit_ID++;
 			continue;
@@ -554,9 +688,10 @@ static int do_one_pass(journal_t *journa
 	 * transaction marks the end of the valid log.
 	 */
 
-	if (pass == PASS_SCAN)
-		info->end_transaction = next_commit_ID;
-	else {
+	if (pass == PASS_SCAN) {
+		if (!info->end_transaction)
+			info->end_transaction = next_commit_ID;
+	} else {
 		/* It's really bad news if different passes end up at
 		 * different places (but possible due to IO errors). */
 		if (info->end_transaction != next_commit_ID) {
Index: linux-2.6.24-rc8/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.24-rc8.orig/include/linux/ext4_fs.h	2008-01-24 11:18:52.000000000 -0800
+++ linux-2.6.24-rc8/include/linux/ext4_fs.h	2008-01-24 13:00:45.000000000 -0800
@@ -467,7 +467,8 @@ do {									       \
 #define EXT4_MOUNT_USRQUOTA		0x100000 /* "old" user quota */
 #define EXT4_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
 #define EXT4_MOUNT_EXTENTS		0x400000 /* Extents support */
-
+#define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
+#define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
 /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
 #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
Index: linux-2.6.24-rc8/include/linux/jbd2.h
===================================================================
--- linux-2.6.24-rc8.orig/include/linux/jbd2.h	2008-01-24 11:18:54.000000000 -0800
+++ linux-2.6.24-rc8/include/linux/jbd2.h	2008-01-24 11:46:16.000000000 -0800
@@ -149,6 +149,28 @@ typedef struct journal_header_s
 	__be32		h_sequence;
 } journal_header_t;
 
+/*
+ * Checksum types.
+ */
+#define JBD2_CRC32_CHKSUM   1
+#define JBD2_MD5_CHKSUM     2
+#define JBD2_SHA1_CHKSUM    3
+
+#define JBD2_CRC32_CHKSUM_SIZE 4
+
+#define JBD2_CHECKSUM_BYTES (32 / sizeof(u32))
+/*
+ * Commit block header for storing transactional checksums:
+ */
+struct commit_header {
+	__be32		h_magic;
+	__be32          h_blocktype;
+	__be32          h_sequence;
+	unsigned char   h_chksum_type;
+	unsigned char   h_chksum_size;
+	unsigned char 	h_padding[2];
+	__be32 		h_chksum[JBD2_CHECKSUM_BYTES];
+};
 
 /*
  * The block tag: used to describe a single buffer in the journal.
@@ -242,14 +264,18 @@ typedef struct journal_superblock_s
 	((j)->j_format_version >= 2 &&					\
 	 ((j)->j_superblock->s_feature_incompat & cpu_to_be32((mask))))
 
-#define JBD2_FEATURE_INCOMPAT_REVOKE	0x00000001
-#define JBD2_FEATURE_INCOMPAT_64BIT	0x00000002
+#define JBD2_FEATURE_COMPAT_CHECKSUM	0x00000001
+
+#define JBD2_FEATURE_INCOMPAT_REVOKE		0x00000001
+#define JBD2_FEATURE_INCOMPAT_64BIT		0x00000002
+#define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT	0x00000004
 
 /* Features known to this kernel version: */
-#define JBD2_KNOWN_COMPAT_FEATURES	0
+#define JBD2_KNOWN_COMPAT_FEATURES	JBD2_FEATURE_COMPAT_CHECKSUM
 #define JBD2_KNOWN_ROCOMPAT_FEATURES	0
 #define JBD2_KNOWN_INCOMPAT_FEATURES	(JBD2_FEATURE_INCOMPAT_REVOKE | \
-					 JBD2_FEATURE_INCOMPAT_64BIT)
+					JBD2_FEATURE_INCOMPAT_64BIT | \
+					JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
 
 #ifdef __KERNEL__
 
@@ -997,6 +1023,8 @@ extern int	   jbd2_journal_check_availab
 		   (journal_t *, unsigned long, unsigned long, unsigned long);
 extern int	   jbd2_journal_set_features
 		   (journal_t *, unsigned long, unsigned long, unsigned long);
+extern int	   jbd2_journal_clear_features
+		   (journal_t *, unsigned long, unsigned long, unsigned long);
 extern int	   jbd2_journal_create     (journal_t *);
 extern int	   jbd2_journal_load       (journal_t *journal);
 extern void	   jbd2_journal_destroy    (journal_t *);


> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


  parent reply	other threads:[~2008-01-24 21:24 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-22  3:01 ext4 merge plans for 2.6.25 Theodore Ts'o
2008-01-22  3:01 ` [PATCH 01/49] ext4: Support large blocksize up to PAGESIZE Theodore Ts'o
2008-01-22  3:01   ` [PATCH 02/49] ext4: Avoid rec_len overflow with 64KB block size Theodore Ts'o
2008-01-22  3:01     ` [PATCH 03/49] ext4: Introduce ext4_lblk_t Theodore Ts'o
2008-01-22  3:01       ` [PATCH 04/49] ext4 extents: remove unneeded casts Theodore Ts'o
2008-01-22  3:01         ` [PATCH 05/49] ext4: add ext4_group_t, and change all group variables to this type Theodore Ts'o
2008-01-22  3:01           ` [PATCH 06/49] ext4: fixes block group number being set to a negative value Theodore Ts'o
2008-01-22  3:01             ` [PATCH 07/49] ext4: Introduce ext4_update_*_feature Theodore Ts'o
2008-01-22  3:01               ` [PATCH 08/49] ext4: Fix sparse warnings Theodore Ts'o
2008-01-22  3:01                 ` [PATCH 09/49] ext4: Rename i_file_acl to i_file_acl_lo Theodore Ts'o
2008-01-22  3:01                   ` [PATCH 10/49] ext4: Rename i_dir_acl to i_size_high Theodore Ts'o
2008-01-22  3:01                     ` [PATCH 11/49] ext4: Add support for 48 bit inode i_blocks Theodore Ts'o
2008-01-22  3:01                       ` [PATCH 12/49] ext4: Support large files Theodore Ts'o
2008-01-22  3:01                         ` [PATCH 13/49] ext4: different maxbytes functions for bitmap & extent files Theodore Ts'o
2008-01-22  3:01                           ` [PATCH 14/49] ext4: export iov_shorten from kernel for ext4's use Theodore Ts'o
2008-01-22  3:01                             ` [PATCH 15/49] ext4: store maxbytes for bitmapped files and return EFBIG as appropriate Theodore Ts'o
2008-01-22  3:01                               ` [PATCH 16/49] ext2: Fix the max file size for ext2 file system Theodore Ts'o
2008-01-22  3:01                                 ` [PATCH 17/49] ext3: Fix the max file size for ext3 " Theodore Ts'o
2008-01-22  3:01                                   ` [PATCH 18/49] ext4: sync up block group descriptor with e2fsprogs Theodore Ts'o
2008-01-22  3:01                                     ` [PATCH 19/49] ext4: Return after ext4_error in case of failures Theodore Ts'o
2008-01-22  3:01                                       ` [PATCH 20/49] ext4/super.c: fix #ifdef's (CONFIG_EXT4_* -> CONFIG_EXT4DEV_*) Theodore Ts'o
2008-01-22  3:02                                         ` [PATCH 21/49] ext4: fix oops on corrupted ext4 mount Theodore Ts'o
2008-01-22  3:02                                           ` [PATCH 22/49] ext4: Change the default behaviour on error Theodore Ts'o
2008-01-22  3:02                                             ` [PATCH 23/49] Add buffer head related helper functions Theodore Ts'o
2008-01-22  3:02                                               ` [PATCH 24/49] ext4: add block bitmap validation Theodore Ts'o
2008-01-22  3:02                                                 ` [PATCH 25/49] jbd2: Remove printk from J_ASSERT to preserve registers during BUG Theodore Ts'o
2008-01-22  3:02                                                   ` [PATCH 26/49] jbd2: Fix assertion failure in fs/jbd2/checkpoint.c Theodore Ts'o
2008-01-22  3:02                                                     ` [PATCH 27/49] ext4: Check for the correct error return from Theodore Ts'o
2008-01-22  3:02                                                       ` [PATCH 28/49] ext4: remove unused code from ext4_find_entry() Theodore Ts'o
2008-01-22  3:02                                                         ` [PATCH 29/49] ext4: Make ext4_get_blocks_wrap take the truncate_mutex early Theodore Ts'o
2008-01-22  3:02                                                           ` [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore Theodore Ts'o
2008-01-22  3:02                                                             ` [PATCH 31/49] ext4: Take read lock during overwrite case Theodore Ts'o
2008-01-22  3:02                                                               ` [PATCH 32/49] jbd2: jbd2 stats through procfs Theodore Ts'o
2008-01-22  3:02                                                                 ` [PATCH 33/49] ext4: Add the journal checksum feature Theodore Ts'o
2008-01-22  3:02                                                                   ` [PATCH 34/49] vfs: Add 64 bit i_version support Theodore Ts'o
2008-01-22  3:02                                                                     ` [PATCH 35/49] ext4: Add inode version support in ext4 Theodore Ts'o
2008-01-22  3:02                                                                       ` [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl Theodore Ts'o
2008-01-22  3:02                                                                         ` [PATCH 37/49] ext4: Fix ext4_show_options to show the correct mount options Theodore Ts'o
2008-01-22  3:02                                                                           ` [PATCH 38/49] ext4: fix up EXT4FS_DEBUG builds Theodore Ts'o
2008-01-22  3:02                                                                             ` [PATCH 39/49] ext4: Add ext4_find_next_bit() Theodore Ts'o
2008-01-22  3:02                                                                               ` [PATCH 40/49] ext4: Add new functions for searching extent tree Theodore Ts'o
2008-01-22  3:02                                                                                 ` [PATCH 41/49] ext4: Add multi block allocator for ext4 Theodore Ts'o
2008-01-22  3:02                                                                                   ` [PATCH 42/49] ext4: Enable the multiblock allocator by default Theodore Ts'o
2008-01-22  3:02                                                                                     ` [PATCH 43/49] ext4: Check for return value from sb_set_blocksize Theodore Ts'o
2008-01-22  3:02                                                                                       ` [PATCH 44/49] ext4: fix uniniatilized extent splitting error Theodore Ts'o
2008-01-22  3:02                                                                                         ` [PATCH 45/49] ext4: Use the ext4_ext_actual_len() helper function Theodore Ts'o
2008-01-22  3:02                                                                                           ` [PATCH 46/49] jbd2: add lockdep support Theodore Ts'o
2008-01-22  3:02                                                                                             ` [PATCH 47/49] jbd2: Mark jbd2 slabs as SLAB_TEMPORARY Theodore Ts'o
2008-01-22  3:02                                                                                               ` [PATCH 48/49] jbd2: Use round-jiffies() function for the "5 second" ext4/jbd2 wakeup Theodore Ts'o
2008-01-22  3:02                                                                                                 ` [PATCH 49/49] jbd2: sparse pointer use of zero as null Theodore Ts'o
2008-01-23 22:07                                                                                   ` [PATCH 41/49] ext4: Add multi block allocator for ext4 Andrew Morton
2008-01-23 23:20                                                                                     ` Andreas Dilger
2008-01-24  7:56                                                                                     ` Aneesh Kumar K.V
2008-01-24  9:04                                                                                       ` Aneesh Kumar K.V
2008-01-24 14:53                                                                                       ` Aneesh Kumar K.V
2008-01-28 18:45                                                                                   ` Eric Sandeen
2008-01-23 22:07                                                                         ` [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl Andrew Morton
2008-01-24  5:55                                                                           ` Aneesh Kumar K.V
2008-01-26  4:15                                                                             ` Theodore Tso
2008-01-26  8:42                                                                               ` Aneesh Kumar K.V
2008-01-23 22:07                                                                   ` [PATCH 33/49] ext4: Add the journal checksum feature Andrew Morton
2008-01-23 22:40                                                                     ` Andreas Dilger
2008-01-24 21:24                                                                     ` Mingming Cao [this message]
2008-02-01 20:50                                                                       ` Girish Shilamkar
2008-01-23 22:06                                                             ` [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore Andrew Morton
2008-01-24  5:29                                                               ` Aneesh Kumar K.V
2008-01-24 13:00                                                               ` Andy Whitcroft
2008-01-23 22:06                                                 ` [PATCH 24/49] ext4: add block bitmap validation Andrew Morton
2008-01-26 13:26                                                   ` Theodore Tso
2008-01-23 22:06                                               ` [PATCH 23/49] Add buffer head related helper functions Andrew Morton
2008-01-24  5:22                                                 ` Aneesh Kumar K.V
2008-01-24  8:53                                                   ` Andrew Morton
2008-01-23 12:43 ` ext4 merge plans for 2.6.25 Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1201209848.4105.14.camel@localhost.localdomain \
    --to=cmm@us.ibm.com \
    --cc=adilger@clusterfs.com \
    --cc=akpm@linux-foundation.org \
    --cc=girish@clusterfs.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shaggy@linux.vnet.ibm.com \
    --cc=tytso@MIT.EDU \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox