[PATCH 26/49] jbd2: Fix assertion failure in fs/jbd2/checkpoint.c

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 26/49] jbd2: Fix assertion failure in fs/jbd2/checkpoint.c
       [not found]                                                 ` <1200970948-17903-26-git-send-email-tytso@mit.edu>
@ 2008-01-22  3:02                                                   ` Theodore Ts'o
       [not found]                                                     ` <1200970948-17903-28-git-send-email-tytso@mit.edu>
  0 siblings, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2008-01-22  3:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jan Kara, linux-ext4, Andrew Morton

From: Jan Kara <jack@suse.cz>

Before we start committing a transaction, we call
__journal_clean_checkpoint_list() to cleanup transaction's written-back
buffers.

If this call happens to remove all of them (and there were already some
buffers), __journal_remove_checkpoint() will decide to free the transaction
because it isn't (yet) a committing transaction and soon we fail some
assertion - the transaction really isn't ready to be freed :).

We change the check in __journal_remove_checkpoint() to free only a
transaction in T_FINISHED state.  The locking there is subtle though (as
everywhere in JBD ;().  We use j_list_lock to protect the check and a
subsequent call to __journal_drop_transaction() and do the same in the end
of journal_commit_transaction() which is the only place where a transaction
can get to T_FINISHED state.

Probably I'm too paranoid here and such locking is not really necessary -
checkpoint lists are processed only from log_do_checkpoint() where a
transaction must be already committed to be processed or from
__journal_clean_checkpoint_list() where kjournald itself calls it and thus
transaction cannot change state either.  Better be safe if something
changes in future...

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 fs/jbd2/checkpoint.c |   12 ++++++------
 fs/jbd2/commit.c     |    8 ++++----
 include/linux/jbd2.h |    2 ++
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 3fccde7..7e958c8 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -602,15 +602,15 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
 
 	/*
 	 * There is one special case to worry about: if we have just pulled the
-	 * buffer off a committing transaction's forget list, then even if the
-	 * checkpoint list is empty, the transaction obviously cannot be
-	 * dropped!
+	 * buffer off a running or committing transaction's checkpoing list,
+	 * then even if the checkpoint list is empty, the transaction obviously
+	 * cannot be dropped!
 	 *
-	 * The locking here around j_committing_transaction is a bit sleazy.
+	 * The locking here around t_state is a bit sleazy.
 	 * See the comment at the end of jbd2_journal_commit_transaction().
 	 */
-	if (transaction == journal->j_committing_transaction) {
-		JBUFFER_TRACE(jh, "belongs to committing transaction");
+	if (transaction->t_state != T_FINISHED) {
+		JBUFFER_TRACE(jh, "belongs to running/committing transaction");
 		goto out;
 	}
 
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6986f33..39b5cee 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -867,10 +867,10 @@ restart_loop:
 	}
 	spin_unlock(&journal->j_list_lock);
 	/*
-	 * This is a bit sleazy.  We borrow j_list_lock to protect
-	 * journal->j_committing_transaction in __jbd2_journal_remove_checkpoint.
-	 * Really, __jbd2_journal_remove_checkpoint should be using j_state_lock but
-	 * it's a bit hassle to hold that across __jbd2_journal_remove_checkpoint
+	 * This is a bit sleazy.  We use j_list_lock to protect transition
+	 * of a transaction into T_FINISHED state and calling
+	 * __jbd2_journal_drop_transaction(). Otherwise we could race with
+	 * other checkpointing code processing the transaction...
 	 */
 	spin_lock(&journal->j_state_lock);
 	spin_lock(&journal->j_list_lock);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index d5f7cff..d861ffd 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -442,6 +442,8 @@ struct transaction_s
 	/*
 	 * Transaction's current state
 	 * [no locking - only kjournald2 alters this]
+	 * [j_list_lock] guards transition of a transaction into T_FINISHED
+	 * state and subsequent call of __jbd2_journal_drop_transaction()
 	 * FIXME: needs barriers
 	 * KLUDGE: [use j_state_lock]
 	 */
-- 
1.5.4.rc3.31.g1271-dirty

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 39/49] ext4: Add ext4_find_next_bit()
       [not found]                                                                           ` <1200970948-17903-39-git-send-email-tytso@mit.edu>
@ 2008-01-22  3:02                                                                             ` Theodore Ts'o
       [not found]                                                                               ` <1200970948-17903-41-git-send-email-tytso@mit.edu>
  0 siblings, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2008-01-22  3:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: Aneesh Kumar K.V, linux-ext4, Andrew Morton

From: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

This function is used by the ext4 multi block allocator patches.

Also add generic_find_next_le_bit

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/asm-arm/bitops.h                     |    2 +
 include/asm-generic/bitops/ext2-non-atomic.h |    2 +
 include/asm-generic/bitops/le.h              |    4 ++
 include/asm-m68k/bitops.h                    |    2 +
 include/asm-m68knommu/bitops.h               |    2 +
 include/asm-powerpc/bitops.h                 |    4 ++
 include/asm-s390/bitops.h                    |    2 +
 include/linux/ext4_fs.h                      |    1 +
 lib/find_next_bit.c                          |   43 ++++++++++++++++++++++++++
 9 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/include/asm-arm/bitops.h b/include/asm-arm/bitops.h
index 47a6b08..5c60bfc 100644
--- a/include/asm-arm/bitops.h
+++ b/include/asm-arm/bitops.h
@@ -310,6 +310,8 @@ static inline int constant_fls(int x)
 		_find_first_zero_bit_le(p,sz)
 #define ext2_find_next_zero_bit(p,sz,off)	\
 		_find_next_zero_bit_le(p,sz,off)
+#define ext2_find_next_bit(p, sz, off) \
+		_find_next_bit_le(p, sz, off)
 
 /*
  * Minix is defined to use little-endian byte ordering.
diff --git a/include/asm-generic/bitops/ext2-non-atomic.h b/include/asm-generic/bitops/ext2-non-atomic.h
index 1697404..63cf822 100644
--- a/include/asm-generic/bitops/ext2-non-atomic.h
+++ b/include/asm-generic/bitops/ext2-non-atomic.h
@@ -14,5 +14,7 @@
 	generic_find_first_zero_le_bit((unsigned long *)(addr), (size))
 #define ext2_find_next_zero_bit(addr, size, off) \
 	generic_find_next_zero_le_bit((unsigned long *)(addr), (size), (off))
+#define ext2_find_next_bit(addr, size, off) \
+	generic_find_next_le_bit((unsigned long *)(addr), (size), (off))
 
 #endif /* _ASM_GENERIC_BITOPS_EXT2_NON_ATOMIC_H_ */
diff --git a/include/asm-generic/bitops/le.h b/include/asm-generic/bitops/le.h
index b9c7e5d..80e3bf1 100644
--- a/include/asm-generic/bitops/le.h
+++ b/include/asm-generic/bitops/le.h
@@ -20,6 +20,8 @@
 #define generic___test_and_clear_le_bit(nr, addr) __test_and_clear_bit(nr, addr)
 
 #define generic_find_next_zero_le_bit(addr, size, offset) find_next_zero_bit(addr, size, offset)
+#define generic_find_next_le_bit(addr, size, offset) \
+			find_next_bit(addr, size, offset)
 
 #elif defined(__BIG_ENDIAN)
 
@@ -42,6 +44,8 @@
 
 extern unsigned long generic_find_next_zero_le_bit(const unsigned long *addr,
 		unsigned long size, unsigned long offset);
+extern unsigned long generic_find_next_le_bit(const unsigned long *addr,
+		unsigned long size, unsigned long offset);
 
 #else
 #error "Please fix <asm/byteorder.h>"
diff --git a/include/asm-m68k/bitops.h b/include/asm-m68k/bitops.h
index 2976b5d..83d1f28 100644
--- a/include/asm-m68k/bitops.h
+++ b/include/asm-m68k/bitops.h
@@ -410,6 +410,8 @@ static inline int ext2_find_next_zero_bit(const void *vaddr, unsigned size,
 	res = ext2_find_first_zero_bit (p, size - 32 * (p - addr));
 	return (p - addr) * 32 + res;
 }
+#define ext2_find_next_bit(addr, size, off) \
+	generic_find_next_le_bit((unsigned long *)(addr), (size), (off))
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-m68knommu/bitops.h b/include/asm-m68knommu/bitops.h
index f8dfb7b..f43afe1 100644
--- a/include/asm-m68knommu/bitops.h
+++ b/include/asm-m68knommu/bitops.h
@@ -294,6 +294,8 @@ found_middle:
 	return result + ffz(__swab32(tmp));
 }
 
+#define ext2_find_next_bit(addr, size, off) \
+	generic_find_next_le_bit((unsigned long *)(addr), (size), (off))
 #include <asm-generic/bitops/minix.h>
 
 #endif /* __KERNEL__ */
diff --git a/include/asm-powerpc/bitops.h b/include/asm-powerpc/bitops.h
index 733b4af..220d9a7 100644
--- a/include/asm-powerpc/bitops.h
+++ b/include/asm-powerpc/bitops.h
@@ -359,6 +359,8 @@ static __inline__ int test_le_bit(unsigned long nr,
 unsigned long generic_find_next_zero_le_bit(const unsigned long *addr,
 				    unsigned long size, unsigned long offset);
 
+unsigned long generic_find_next_le_bit(const unsigned long *addr,
+				    unsigned long size, unsigned long offset);
 /* Bitmap functions for the ext2 filesystem */
 
 #define ext2_set_bit(nr,addr) \
@@ -378,6 +380,8 @@ unsigned long generic_find_next_zero_le_bit(const unsigned long *addr,
 #define ext2_find_next_zero_bit(addr, size, off) \
 	generic_find_next_zero_le_bit((unsigned long*)addr, size, off)
 
+#define ext2_find_next_bit(addr, size, off) \
+	generic_find_next_le_bit((unsigned long *)addr, size, off)
 /* Bitmap functions for the minix filesystem.  */
 
 #define minix_test_and_set_bit(nr,addr) \
diff --git a/include/asm-s390/bitops.h b/include/asm-s390/bitops.h
index 34d9a63..dba6fec 100644
--- a/include/asm-s390/bitops.h
+++ b/include/asm-s390/bitops.h
@@ -772,6 +772,8 @@ static inline int sched_find_first_bit(unsigned long *b)
 	test_and_clear_bit((nr)^(__BITOPS_WORDSIZE - 8), (unsigned long *)addr)
 #define ext2_test_bit(nr, addr)      \
 	test_bit((nr)^(__BITOPS_WORDSIZE - 8), (unsigned long *)addr)
+#define ext2_find_next_bit(addr, size, off) \
+	generic_find_next_le_bit((unsigned long *)(addr), (size), (off))
 
 #ifndef __s390x__
 
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 213974f..d0b7ca9 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -493,6 +493,7 @@ do {									       \
 #define ext4_test_bit			ext2_test_bit
 #define ext4_find_first_zero_bit	ext2_find_first_zero_bit
 #define ext4_find_next_zero_bit		ext2_find_next_zero_bit
+#define ext4_find_next_bit		ext2_find_next_bit
 
 /*
  * Maximal mount counts between two filesystem checks
diff --git a/lib/find_next_bit.c b/lib/find_next_bit.c
index bda0d71..78ccd73 100644
--- a/lib/find_next_bit.c
+++ b/lib/find_next_bit.c
@@ -178,4 +178,47 @@ found_middle_swap:
 
 EXPORT_SYMBOL(generic_find_next_zero_le_bit);
 
+unsigned long generic_find_next_le_bit(const unsigned long *addr, unsigned
+		long size, unsigned long offset)
+{
+	const unsigned long *p = addr + BITOP_WORD(offset);
+	unsigned long result = offset & ~(BITS_PER_LONG - 1);
+	unsigned long tmp;
+
+	if (offset >= size)
+		return size;
+	size -= result;
+	offset &= (BITS_PER_LONG - 1UL);
+	if (offset) {
+		tmp = ext2_swabp(p++);
+		tmp &= (~0UL << offset);
+		if (size < BITS_PER_LONG)
+			goto found_first;
+		if (tmp)
+			goto found_middle;
+		size -= BITS_PER_LONG;
+		result += BITS_PER_LONG;
+	}
+
+	while (size & ~(BITS_PER_LONG - 1)) {
+		tmp = *(p++);
+		if (tmp)
+			goto found_middle_swap;
+		result += BITS_PER_LONG;
+		size -= BITS_PER_LONG;
+	}
+	if (!size)
+		return result;
+	tmp = ext2_swabp(p);
+found_first:
+	tmp &= (~0UL >> (BITS_PER_LONG - size));
+	if (tmp == 0UL)		/* Are any bits set? */
+		return result + size; /* Nope. */
+found_middle:
+	return result + __ffs(tmp);
+
+found_middle_swap:
+	return result + __ffs(ext2_swab(tmp));
+}
+EXPORT_SYMBOL(generic_find_next_le_bit);
 #endif /* __BIG_ENDIAN */
-- 
1.5.4.rc3.31.g1271-dirty

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 23/49] Add buffer head related helper functions
       [not found]                                             ` <1200970948-17903-24-git-send-email-tytso@mit.edu>
@ 2008-01-23 22:06                                               ` Andrew Morton
  2008-01-24  5:22                                                 ` Aneesh Kumar K.V
       [not found]                                               ` <1200970948-17903-25-git-send-email-tytso@mit.edu>
  1 sibling, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-01-23 22:06 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel, aneesh.kumar, linux-ext4@vger.kernel.org

> On Mon, 21 Jan 2008 22:02:02 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> +}
> +EXPORT_SYMBOL(bh_uptodate_or_lock);
> +/**

Missing newline.

> + * bh_submit_read: Submit a locked buffer for reading
> + * @bh: struct buffer_head
> + *
> + * Returns a negative error
> + */
> +int bh_submit_read(struct buffer_head *bh)
> +{
> +	if (!buffer_locked(bh))
> +		lock_buffer(bh);
> +
> +	if (buffer_uptodate(bh))
> +		return 0;

Here it can lock the buffer then return zero

> +	get_bh(bh);
> +	bh->b_end_io = end_buffer_read_sync;
> +	submit_bh(READ, bh);
> +	wait_on_buffer(bh);
> +	if (buffer_uptodate(bh))
> +		return 0;

Here it will unlock the buffer and return zero.

This function is unusable when passed an unlocked buffer.

The return value should (always) be documented.

> +	return -EIO;
> +}
> +EXPORT_SYMBOL(bh_submit_read);
>  void __init buffer_init(void)

Missing newline.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 24/49] ext4: add block bitmap validation
       [not found]                                               ` <1200970948-17903-25-git-send-email-tytso@mit.edu>
       [not found]                                                 ` <1200970948-17903-26-git-send-email-tytso@mit.edu>
@ 2008-01-23 22:06                                                 ` Andrew Morton
  2008-01-26 13:26                                                   ` Theodore Tso
  1 sibling, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-01-23 22:06 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel, aneesh.kumar, linux-ext4@vger.kernel.org

> On Mon, 21 Jan 2008 22:02:03 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> +	if (bh_submit_read(bh) < 0) {
> +		brelse(bh);
> +		ext4_error(sb, __FUNCTION__,
>  			    "Cannot read block bitmap - "
> -			    "block_group = %lu, block_bitmap = %llu",
> -			    block_group, bitmap_blk);
> +			    "block_group = %d, block_bitmap = %llu",
> +			    (int)block_group, (unsigned long long)bitmap_blk);
> +		return NULL;
> +	}
> +	if (!ext4_valid_block_bitmap(sb, desc, block_group, bh)) {
> +		brelse(bh);
> +		return NULL;
> +	}

brelse() should only be used when the bh might be NULL - put_bh()
can be used here.

Please review all ext4/jbd2 code for this trivial speedup.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore.
       [not found]                                                           ` <1200970948-17903-31-git-send-email-tytso@mit.edu>
@ 2008-01-23 22:06                                                             ` Andrew Morton
  2008-01-24  5:29                                                               ` Aneesh Kumar K.V
  2008-01-24 13:00                                                               ` Andy Whitcroft
       [not found]                                                             ` <1200970948-17903-32-git-send-email-tytso@mit.edu>
  1 sibling, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2008-01-23 22:06 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-kernel, aneesh.kumar, Andy Whitcroft,
	linux-ext4@vger.kernel.org

> On Mon, 21 Jan 2008 22:02:09 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> +int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> +			unsigned long max_blocks, struct buffer_head *bh,
> +			int create, int extend_disksize)
> +{
> +	int retval;
> +	if (create) {
> +		down_write((&EXT4_I(inode)->i_data_sem));
> +	} else {
> +		down_read((&EXT4_I(inode)->i_data_sem));
> +	}
> +	if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
> +		retval =  ext4_ext_get_blocks(handle, inode, block, max_blocks,
> +				bh, create, extend_disksize);
> +	} else {
> +		retval = ext4_get_blocks_handle(handle, inode, block,
> +				max_blocks, bh, create, extend_disksize);
> +	}
> +	if (create) {
> +		up_write((&EXT4_I(inode)->i_data_sem));
> +	} else {
> +		up_read((&EXT4_I(inode)->i_data_sem));
> +	}

This function has many unneeded braces.  checkpatch used to detect this
but it seems to have broken.

> +	return retval;
> +}
>  static int ext4_get_block(struct inode *inode, sector_t iblock,
>  			struct buffer_head *bh_result, int create)

Mising newline.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 33/49] ext4: Add the journal checksum feature
       [not found]                                                                 ` <1200970948-17903-34-git-send-email-tytso@mit.edu>
@ 2008-01-23 22:07                                                                   ` Andrew Morton
  2008-01-23 22:40                                                                     ` Andreas Dilger
  2008-01-24 21:24                                                                     ` Mingming Cao
       [not found]                                                                   ` <1200970948-17903-35-git-send-email-tytso@mit.edu>
  1 sibling, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2008-01-23 22:07 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-kernel, girish, adilger, shaggy, cmm,
	linux-ext4@vger.kernel.org

> On Mon, 21 Jan 2008 22:02:12 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> From: Girish Shilamkar <girish@clusterfs.com>
> 
> The journal checksum feature adds two new flags i.e
> JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.
> 
> JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
> checksum for the blocks described by the descriptor blocks.
> Due to checksums, writing of the commit record no longer needs to be
> synchronous. Now commit record can be sent to disk without waiting for
> descriptor blocks to be written to disk. This behavior is controlled
> using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
> able to recover the journal with _ASYNC_COMMIT hence it is made
> incompat.
> The commit header has been extended to hold the checksum along with the
> type of the checksum.
> 
> For recovery in pass scan checksums are verified to ensure the sanity
> and completeness(in case of _ASYNC_COMMIT) of every transaction.
> 
>  ...
>
> +static inline __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)

unneeded inlining.

> +{
> +	struct page *page = bh->b_page;
> +	char *addr;
> +	__u32 checksum;
> +
> +	addr = kmap_atomic(page, KM_USER0);
> +	checksum = crc32_be(crc32_sum,
> +		(void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
> +	kunmap_atomic(addr, KM_USER0);
> +
> +	return checksum;
> +}

Can this buffer actually be in highmem?

>  static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
>  				   unsigned long long block)

More unnecessary inlining.

> +/*
> + * jbd2_journal_clear_features () - Clear a given journal feature in the
> + * 				    superblock
> + * @journal: Journal to act on.
> + * @compat: bitmask of compatible features
> + * @ro: bitmask of features that force read-only mount
> + * @incompat: bitmask of incompatible features
> + *
> + * Clear a given journal feature as present on the
> + * superblock.  Returns true if the requested features could be reset.
> + */
> +int jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
> +				unsigned long ro, unsigned long incompat)
> +{
> +	journal_superblock_t *sb;
> +
> +	jbd_debug(1, "Clear features 0x%lx/0x%lx/0x%lx\n",
> +		  compat, ro, incompat);
> +
> +	sb = journal->j_superblock;
> +
> +	sb->s_feature_compat    &= ~cpu_to_be32(compat);
> +	sb->s_feature_ro_compat &= ~cpu_to_be32(ro);
> +	sb->s_feature_incompat  &= ~cpu_to_be32(incompat);
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL(jbd2_journal_clear_features);

Kernel usually returns 0 on success.  So we can return a useful errno on
failure.

> +/*
> + * calc_chksums calculates the checksums for the blocks described in the
> + * descriptor block.
> + */
> +static int calc_chksums(journal_t *journal, struct buffer_head *bh,
> +			unsigned long *next_log_block, __u32 *crc32_sum)
> +{
> +	int i, num_blks, err;
> +	unsigned io_block;
> +	struct buffer_head *obh;
> +
> +	num_blks = count_tags(journal, bh);
> +	/* Calculate checksum of the descriptor block. */
> +	*crc32_sum = crc32_be(*crc32_sum, (void *)bh->b_data, bh->b_size);
> +
> +	for (i = 0; i < num_blks; i++) {
> +		io_block = (*next_log_block)++;

 unsigned <- unsigned long.

Are all the types appropriate in here?

> +		wrap(journal, *next_log_block);
> +		err = jread(&obh, journal, io_block);
> +		if (err) {
> +			printk(KERN_ERR "JBD: IO error %d recovering block "
> +				"%u in log\n", err, io_block);
> +			return 1;
> +		} else {
> +			*crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
> +				     obh->b_size);
> +		}
> +	}
> +	return 0;
> +}
> +
>  static int do_one_pass(journal_t *journal,
>  			struct recovery_info *info, enum passtype pass)
>  {
> @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journal,
>  	unsigned int		sequence;
>  	int			blocktype;
>  	int			tag_bytes = journal_tag_bytes(journal);
> +	__u32			crc32_sum = ~0; /* Transactional Checksums */
>  
>  	/* Precompute the maximum metadata descriptors in a descriptor block */
>  	int			MAX_BLOCKS_PER_DESC;
> @@ -419,9 +452,23 @@ static int do_one_pass(journal_t *journal,
>  		switch(blocktype) {
>  		case JBD2_DESCRIPTOR_BLOCK:
>  			/* If it is a valid descriptor block, replay it
> -			 * in pass REPLAY; otherwise, just skip over the
> -			 * blocks it describes. */
> +			 * in pass REPLAY; if journal_checksums enabled, then
> +			 * calculate checksums in PASS_SCAN, otherwise,
> +			 * just skip over the blocks it describes. */
>  			if (pass != PASS_REPLAY) {
> +				if (pass == PASS_SCAN &&
> +				    JBD2_HAS_COMPAT_FEATURE(journal,
> +					    JBD2_FEATURE_COMPAT_CHECKSUM) &&
> +				    !info->end_transaction) {
> +					if (calc_chksums(journal, bh,
> +							&next_log_block,
> +							&crc32_sum)) {

put_bh()

> +						brelse(bh);
> +						break;
> +					}
> +					brelse(bh);
> +					continue;

put_bh()

> +				}
>  				next_log_block += count_tags(journal, bh);
>  				wrap(journal, next_log_block);
>  				brelse(bh);
> @@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journal,
>  			continue;
>  
> +					brelse(bh);

etc

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl
       [not found]                                                                       ` <1200970948-17903-37-git-send-email-tytso@mit.edu>
       [not found]                                                                         ` <1200970948-17903-38-git-send-email-tytso@mit.edu>
@ 2008-01-23 22:07                                                                         ` Andrew Morton
  2008-01-24  5:55                                                                           ` Aneesh Kumar K.V
  1 sibling, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-01-23 22:07 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel, aneesh.kumar, linux-ext4@vger.kernel.org

> On Mon, 21 Jan 2008 22:02:15 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> The below patch add ioctl for migrating ext3 indirect block mapped inode
> to ext4 extent mapped inode.

This patch adds lots of weird and inexplicable single- and double-newlines
in inappropriate places.  However it frequently forgets to add newlines
between end-of-locals and start-of-code, which is usual practice.


+struct list_blocks_struct {
+	ext4_lblk_t first_block, last_block;
+	ext4_fsblk_t first_pblock, last_pblock;
+};
	
This structure would benefit from some code comments.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 41/49] ext4: Add multi block allocator for ext4
       [not found]                                                                                 ` <1200970948-17903-42-git-send-email-tytso@mit.edu>
@ 2008-01-23 22:07                                                                                   ` Andrew Morton
  2008-01-23 23:20                                                                                     ` Andreas Dilger
  2008-01-24  7:56                                                                                     ` Aneesh Kumar K.V
  0 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2008-01-23 22:07 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-kernel, alex, adilger, aneesh.kumar, sandeen, tytso,
	linux-ext4@vger.kernel.org

> On Mon, 21 Jan 2008 22:02:20 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> From: Alex Tomas <alex@clusterfs.com>
> 
> Signed-off-by: Alex Tomas <alex@clusterfs.com>
> Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>
> ...
>
> +#if BITS_PER_LONG == 64
> +#define mb_correct_addr_and_bit(bit, addr)		\
> +{							\
> +	bit += ((unsigned long) addr & 7UL) << 3;	\
> +	addr = (void *) ((unsigned long) addr & ~7UL);	\
> +}
> +#elif BITS_PER_LONG == 32
> +#define mb_correct_addr_and_bit(bit, addr)		\
> +{							\
> +	bit += ((unsigned long) addr & 3UL) << 3;	\
> +	addr = (void *) ((unsigned long) addr & ~3UL);	\
> +}
> +#else
> +#error "how many bits you are?!"
> +#endif

Why do these exist?

> +static inline int mb_test_bit(int bit, void *addr)
> +{
> +	mb_correct_addr_and_bit(bit, addr);
> +	return ext4_test_bit(bit, addr);
> +}

ext2_test_bit() already handles bitnum > wordsize.

If mb_correct_addr_and_bit() is actually needed then some suitable comment
would help.

> +static inline void mb_set_bit(int bit, void *addr)
> +{
> +	mb_correct_addr_and_bit(bit, addr);
> +	ext4_set_bit(bit, addr);
> +}
> +
> +static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr)
> +{
> +	mb_correct_addr_and_bit(bit, addr);
> +	ext4_set_bit_atomic(lock, bit, addr);
> +}
> +
> +static inline void mb_clear_bit(int bit, void *addr)
> +{
> +	mb_correct_addr_and_bit(bit, addr);
> +	ext4_clear_bit(bit, addr);
> +}
> +
> +static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr)
> +{
> +	mb_correct_addr_and_bit(bit, addr);
> +	ext4_clear_bit_atomic(lock, bit, addr);
> +}
> +
> +static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)

uninlining this will save about eighty squigabytes of text.

Please review all of ext4/jbd2 with a view to removig unnecessary and wrong
inlings.

> +{
> +	char *bb;
> +
> +	/* FIXME!! is this needed */
> +	BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b));
> +	BUG_ON(max == NULL);
> +
> +	if (order > e4b->bd_blkbits + 1) {
> +		*max = 0;
> +		return NULL;
> +	}
> +
> +	/* at order 0 we see each particular block */
> +	*max = 1 << (e4b->bd_blkbits + 3);
> +	if (order == 0)
> +		return EXT4_MB_BITMAP(e4b);
> +
> +	bb = EXT4_MB_BUDDY(e4b) + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order];
> +	*max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order];
> +
> +	return bb;
> +}
> +
>
> ...
>
> +#else
> +#define mb_free_blocks_double(a, b, c, d)
> +#define mb_mark_used_double(a, b, c)
> +#define mb_cmp_bitmaps(a, b)
> +#endif

Please use the do{}while(0) thing.  Or, better, proper C functions which
have typechecking (unless this will cause undefined-var compile errors,
which happens sometimes)

> +/* find most significant bit */
> +static int fmsb(unsigned short word)
> +{
> +	int order;
> +
> +	if (word > 255) {
> +		order = 7;
> +		word >>= 8;
> +	} else {
> +		order = -1;
> +	}
> +
> +	do {
> +		order++;
> +		word >>= 1;
> +	} while (word != 0);
> +
> +	return order;
> +}

Did we just reinvent fls()?

> +/* FIXME!! need more doc */
> +static void ext4_mb_mark_free_simple(struct super_block *sb,
> +				void *buddy, unsigned first, int len,
> +					struct ext4_group_info *grp)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	unsigned short min;
> +	unsigned short max;
> +	unsigned short chunk;
> +	unsigned short border;
> +
> +	BUG_ON(len >= EXT4_BLOCKS_PER_GROUP(sb));
> +
> +	border = 2 << sb->s_blocksize_bits;

Won't this explode with >= 32k blocksize?

> +	while (len > 0) {
> +		/* find how many blocks can be covered since this position */
> +		max = ffs(first | border) - 1;
> +
> +		/* find how many blocks of power 2 we need to mark */
> +		min = fmsb(len);
> +
> +		if (max < min)
> +			min = max;
> +		chunk = 1 << min;
> +
> +		/* mark multiblock chunks only */
> +		grp->bb_counters[min]++;
> +		if (min > 0)
> +			mb_clear_bit(first >> min,
> +				     buddy + sbi->s_mb_offsets[min]);
> +
> +		len -= chunk;
> +		first += chunk;
> +	}
> +}
> +
> 
> ...
>
> +static int ext4_mb_init_cache(struct page *page, char *incore)
> +{
> +	int blocksize;
> +	int blocks_per_page;
> +	int groups_per_page;
> +	int err = 0;
> +	int i;
> +	ext4_group_t first_group;
> +	int first_block;
> +	struct super_block *sb;
> +	struct buffer_head *bhs;
> +	struct buffer_head **bh;
> +	struct inode *inode;
> +	char *data;
> +	char *bitmap;
> +
> +	mb_debug("init page %lu\n", page->index);
> +
> +	inode = page->mapping->host;
> +	sb = inode->i_sb;
> +	blocksize = 1 << inode->i_blkbits;
> +	blocks_per_page = PAGE_CACHE_SIZE / blocksize;
> +
> +	groups_per_page = blocks_per_page >> 1;
> +	if (groups_per_page == 0)
> +		groups_per_page = 1;
> +
> +	/* allocate buffer_heads to read bitmaps */
> +	if (groups_per_page > 1) {
> +		err = -ENOMEM;
> +		i = sizeof(struct buffer_head *) * groups_per_page;
> +		bh = kmalloc(i, GFP_NOFS);
> +		if (bh == NULL)
> +			goto out;
> +		memset(bh, 0, i);

kzalloc()

> +	} else
> +		bh = &bhs;
> +
> +	first_group = page->index * blocks_per_page / 2;
> +
> +	/* read all groups the page covers into the cache */
> +	for (i = 0; i < groups_per_page; i++) {
> +		struct ext4_group_desc *desc;
> +
> +		if (first_group + i >= EXT4_SB(sb)->s_groups_count)
> +			break;
> +
> +		err = -EIO;
> +		desc = ext4_get_group_desc(sb, first_group + i, NULL);
> +		if (desc == NULL)
> +			goto out;
> +
> +		err = -ENOMEM;
> +		bh[i] = sb_getblk(sb, ext4_block_bitmap(sb, desc));
> +		if (bh[i] == NULL)
> +			goto out;
> +
> +		if (buffer_uptodate(bh[i]))
> +			continue;
> +
> +		lock_buffer(bh[i]);
> +		if (buffer_uptodate(bh[i])) {
> +			unlock_buffer(bh[i]);
> +			continue;
> +		}

Didn't we just add a helper in fs/buffer.c to do this?

> +		if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> +			ext4_init_block_bitmap(sb, bh[i],
> +						first_group + i, desc);
> +			set_buffer_uptodate(bh[i]);
> +			unlock_buffer(bh[i]);
> +			continue;
> +		}
> +		get_bh(bh[i]);
> +		bh[i]->b_end_io = end_buffer_read_sync;
> +		submit_bh(READ, bh[i]);
> +		mb_debug("read bitmap for group %lu\n", first_group + i);
> +	}
> +
> +	/* wait for I/O completion */
> +	for (i = 0; i < groups_per_page && bh[i]; i++)
> +		wait_on_buffer(bh[i]);
> +
> +	err = -EIO;
> +	for (i = 0; i < groups_per_page && bh[i]; i++)
> +		if (!buffer_uptodate(bh[i]))
> +			goto out;
> +
> +	first_block = page->index * blocks_per_page;
> +	for (i = 0; i < blocks_per_page; i++) {
> +		int group;
> +		struct ext4_group_info *grinfo;
> +
> +		group = (first_block + i) >> 1;
> +		if (group >= EXT4_SB(sb)->s_groups_count)
> +			break;
> +
> +		/*
> +		 * data carry information regarding this
> +		 * particular group in the format specified
> +		 * above
> +		 *
> +		 */
> +		data = page_address(page) + (i * blocksize);
> +		bitmap = bh[group - first_group]->b_data;
> +
> +		/*
> +		 * We place the buddy block and bitmap block
> +		 * close together
> +		 */
> +		if ((first_block + i) & 1) {
> +			/* this is block of buddy */
> +			BUG_ON(incore == NULL);
> +			mb_debug("put buddy for group %u in page %lu/%x\n",
> +				group, page->index, i * blocksize);
> +			memset(data, 0xff, blocksize);
> +			grinfo = ext4_get_group_info(sb, group);
> +			grinfo->bb_fragments = 0;
> +			memset(grinfo->bb_counters, 0,
> +			       sizeof(unsigned short)*(sb->s_blocksize_bits+2));
> +			/*
> +			 * incore got set to the group block bitmap below
> +			 */
> +			ext4_mb_generate_buddy(sb, data, incore, group);
> +			incore = NULL;
> +		} else {
> +			/* this is block of bitmap */
> +			BUG_ON(incore != NULL);
> +			mb_debug("put bitmap for group %u in page %lu/%x\n",
> +				group, page->index, i * blocksize);
> +
> +			/* see comments in ext4_mb_put_pa() */
> +			ext4_lock_group(sb, group);
> +			memcpy(data, bitmap, blocksize);
> +
> +			/* mark all preallocated blks used in in-core bitmap */
> +			ext4_mb_generate_from_pa(sb, data, group);
> +			ext4_unlock_group(sb, group);
> +
> +			/* set incore so that the buddy information can be
> +			 * generated using this
> +			 */
> +			incore = data;
> +		}
> +	}
> +	SetPageUptodate(page);

Is the page locked here?

> +out:
> +	if (bh) {
> +		for (i = 0; i < groups_per_page && bh[i]; i++)
> +			brelse(bh[i]);

put_bh()

> +		if (bh != &bhs)
> +			kfree(bh);
> +	}
> +	return err;
> +}
> +
>
> ...
>
> +static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len)
> +{
> +	__u32 *addr;
> +
> +	len = cur + len;
> +	while (cur < len) {
> +		if ((cur & 31) == 0 && (len - cur) >= 32) {
> +			/* fast path: clear whole word at once */

s/clear/set/

> +			addr = bm + (cur >> 3);
> +			*addr = 0xffffffff;
> +			cur += 32;
> +			continue;
> +		}
> +		mb_set_bit_atomic(lock, cur, bm);
> +		cur++;
> +	}
> +}
> +
>
> ...
>
> +static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
> +					struct ext4_buddy *e4b)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> +	int ret;
> +
> +	BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group);
> +	BUG_ON(ac->ac_status == AC_STATUS_FOUND);
> +
> +	ac->ac_b_ex.fe_len = min(ac->ac_b_ex.fe_len, ac->ac_g_ex.fe_len);
> +	ac->ac_b_ex.fe_logical = ac->ac_g_ex.fe_logical;
> +	ret = mb_mark_used(e4b, &ac->ac_b_ex);
> +
> +	/* preallocation can change ac_b_ex, thus we store actually
> +	 * allocated blocks for history */
> +	ac->ac_f_ex = ac->ac_b_ex;
> +
> +	ac->ac_status = AC_STATUS_FOUND;
> +	ac->ac_tail = ret & 0xffff;
> +	ac->ac_buddy = ret >> 16;
> +
> +	/* XXXXXXX: SUCH A HORRIBLE **CK */
> +	/*FIXME!! Why ? */

?

> +	ac->ac_bitmap_page = e4b->bd_bitmap_page;
> +	get_page(ac->ac_bitmap_page);
> +	ac->ac_buddy_page = e4b->bd_buddy_page;
> +	get_page(ac->ac_buddy_page);
> +
> +	/* store last allocated for subsequent stream allocation */
> +	if ((ac->ac_flags & EXT4_MB_HINT_DATA)) {
> +		spin_lock(&sbi->s_md_lock);
> +		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> +		sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
> +		spin_unlock(&sbi->s_md_lock);
> +	}
> +}
>
> ...
>
> +static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
> +					ext4_group_t group)
> +{
> +	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> +	struct ext4_prealloc_space *pa;
> +	struct list_head *cur;
> +	ext4_group_t groupnr;
> +	ext4_grpblk_t start;
> +	int preallocated = 0;
> +	int count = 0;
> +	int len;
> +
> +	/* all form of preallocation discards first load group,
> +	 * so the only competing code is preallocation use.
> +	 * we don't need any locking here
> +	 * notice we do NOT ignore preallocations with pa_deleted
> +	 * otherwise we could leave used blocks available for
> +	 * allocation in buddy when concurrent ext4_mb_put_pa()
> +	 * is dropping preallocation
> +	 */
> +	list_for_each_rcu(cur, &grp->bb_prealloc_list) {
> +		pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
> +		spin_lock(&pa->pa_lock);
> +		ext4_get_group_no_and_offset(sb, pa->pa_pstart,
> +					     &groupnr, &start);
> +		len = pa->pa_len;
> +		spin_unlock(&pa->pa_lock);
> +		if (unlikely(len == 0))
> +			continue;
> +		BUG_ON(groupnr != group);
> +		mb_set_bits(sb_bgl_lock(EXT4_SB(sb), group),
> +						bitmap, start, len);
> +		preallocated += len;
> +		count++;
> +	}

Seems to be missing rcu_read_lock()

> +	mb_debug("prellocated %u for group %lu\n", preallocated, group);
> +}
> +
> +static void ext4_mb_pa_callback(struct rcu_head *head)
> +{
> +	struct ext4_prealloc_space *pa;
> +	pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
> +	kmem_cache_free(ext4_pspace_cachep, pa);
> +}
> +#define mb_call_rcu(__pa)	call_rcu(&(__pa)->u.pa_rcu, ext4_mb_pa_callback)

Is there any reason why this had to be implemented as a macro?

>
> ...
>
> +static int ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
> +{
> +	struct super_block *sb = ac->ac_sb;
> +	struct ext4_prealloc_space *pa;
> +	struct ext4_group_info *grp;
> +	struct ext4_inode_info *ei;
> +
> +	/* preallocate only when found space is larger then requested */
> +	BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
> +	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
> +	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
> +
> +	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);

Do all the GFP_NOFS's in this code really need to be GFP_NOFS?

> +	if (pa == NULL)
> +		return -ENOMEM;
> +
> +	if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) {
> +		int winl;
> +		int wins;
> +		int win;
> +		int offs;
> +
> +		/* we can't allocate as much as normalizer wants.
> +		 * so, found space must get proper lstart
> +		 * to cover original request */
> +		BUG_ON(ac->ac_g_ex.fe_logical > ac->ac_o_ex.fe_logical);
> +		BUG_ON(ac->ac_g_ex.fe_len < ac->ac_o_ex.fe_len);
> +
> +		/* we're limited by original request in that
> +		 * logical block must be covered any way
> +		 * winl is window we can move our chunk within */
> +		winl = ac->ac_o_ex.fe_logical - ac->ac_g_ex.fe_logical;
> +
> +		/* also, we should cover whole original request */
> +		wins = ac->ac_b_ex.fe_len - ac->ac_o_ex.fe_len;
> +
> +		/* the smallest one defines real window */
> +		win = min(winl, wins);
> +
> +		offs = ac->ac_o_ex.fe_logical % ac->ac_b_ex.fe_len;
> +		if (offs && offs < win)
> +			win = offs;
> +
> +		ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - win;
> +		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
> +		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
> +	}
> +
> +	/* preallocation can change ac_b_ex, thus we store actually
> +	 * allocated blocks for history */
> +	ac->ac_f_ex = ac->ac_b_ex;
> +
> +	pa->pa_lstart = ac->ac_b_ex.fe_logical;
> +	pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
> +	pa->pa_len = ac->ac_b_ex.fe_len;
> +	pa->pa_free = pa->pa_len;
> +	atomic_set(&pa->pa_count, 1);
> +	spin_lock_init(&pa->pa_lock);
> +	pa->pa_deleted = 0;
> +	pa->pa_linear = 0;
> +
> +	mb_debug("new inode pa %p: %llu/%u for %u\n", pa,
> +			pa->pa_pstart, pa->pa_len, pa->pa_lstart);
> +
> +	ext4_mb_use_inode_pa(ac, pa);
> +	atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);
> +
> +	ei = EXT4_I(ac->ac_inode);
> +	grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);
> +
> +	pa->pa_obj_lock = &ei->i_prealloc_lock;
> +	pa->pa_inode = ac->ac_inode;
> +
> +	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
> +	list_add_rcu(&pa->pa_group_list, &grp->bb_prealloc_list);
> +	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
> +
> +	spin_lock(pa->pa_obj_lock);
> +	list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list);
> +	spin_unlock(pa->pa_obj_lock);

hm.  Strange to see list_add_rcu() inside spinlock like this.

> +	return 0;
> +}
> +
>
> ...
>
> +static int ext4_mb_discard_group_preallocations(struct super_block *sb,
> +					ext4_group_t group, int needed)
> +{
> +	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> +	struct buffer_head *bitmap_bh = NULL;
> +	struct ext4_prealloc_space *pa, *tmp;
> +	struct list_head list;
> +	struct ext4_buddy e4b;
> +	int err;
> +	int busy = 0;
> +	int free = 0;
> +
> +	mb_debug("discard preallocation for group %lu\n", group);
> +
> +	if (list_empty(&grp->bb_prealloc_list))
> +		return 0;
> +
> +	bitmap_bh = read_block_bitmap(sb, group);
> +	if (bitmap_bh == NULL) {
> +		/* error handling here */
> +		ext4_mb_release_desc(&e4b);
> +		BUG_ON(bitmap_bh == NULL);
> +	}
> +
> +	err = ext4_mb_load_buddy(sb, group, &e4b);
> +	BUG_ON(err != 0); /* error handling here */
> +
> +	if (needed == 0)
> +		needed = EXT4_BLOCKS_PER_GROUP(sb) + 1;
> +
> +	grp = ext4_get_group_info(sb, group);
> +	INIT_LIST_HEAD(&list);
> +
> +repeat:
> +	ext4_lock_group(sb, group);
> +	list_for_each_entry_safe(pa, tmp,
> +				&grp->bb_prealloc_list, pa_group_list) {
> +		spin_lock(&pa->pa_lock);
> +		if (atomic_read(&pa->pa_count)) {
> +			spin_unlock(&pa->pa_lock);
> +			busy = 1;
> +			continue;
> +		}
> +		if (pa->pa_deleted) {
> +			spin_unlock(&pa->pa_lock);
> +			continue;
> +		}
> +
> +		/* seems this one can be freed ... */
> +		pa->pa_deleted = 1;
> +
> +		/* we can trust pa_free ... */
> +		free += pa->pa_free;
> +
> +		spin_unlock(&pa->pa_lock);
> +
> +		list_del_rcu(&pa->pa_group_list);
> +		list_add(&pa->u.pa_tmp_list, &list);
> +	}

Strange to see rcu operations outside rcu_read_lock().

> +	/* if we still need more blocks and some PAs were used, try again */
> +	if (free < needed && busy) {
> +		busy = 0;
> +		ext4_unlock_group(sb, group);
> +		/*
> +		 * Yield the CPU here so that we don't get soft lockup
> +		 * in non preempt case.
> +		 */
> +		yield();

argh, no, yield() is basically unusable.  schedule_timeout(1) is preferable.

Please test this code whe there are lots of cpu-intensive tasks running.

> +		goto repeat;
> +	}
> +
> +	/* found anything to free? */
> +	if (list_empty(&list)) {
> +		BUG_ON(free != 0);
> +		goto out;
> +	}
> +
> +	/* now free all selected PAs */
> +	list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
> +
> +		/* remove from object (inode or locality group) */
> +		spin_lock(pa->pa_obj_lock);
> +		list_del_rcu(&pa->pa_inode_list);
> +		spin_unlock(pa->pa_obj_lock);
> +
> +		if (pa->pa_linear)
> +			ext4_mb_release_group_pa(&e4b, pa);
> +		else
> +			ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
> +
> +		list_del(&pa->u.pa_tmp_list);
> +		mb_call_rcu(pa);
> +	}
> +
> +out:
> +	ext4_unlock_group(sb, group);
> +	ext4_mb_release_desc(&e4b);
> +	brelse(bitmap_bh);

put_bh()

> +	return free;
> +}
> +
> +/*
> + * releases all non-used preallocated blocks for given inode
> + *
> + * It's important to discard preallocations under i_data_sem
> + * We don't want another block to be served from the prealloc
> + * space when we are discarding the inode prealloc space.
> + *
> + * FIXME!! Make sure it is valid at all the call sites
> + */
> +void ext4_mb_discard_inode_preallocations(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	struct super_block *sb = inode->i_sb;
> +	struct buffer_head *bitmap_bh = NULL;
> +	struct ext4_prealloc_space *pa, *tmp;
> +	ext4_group_t group = 0;
> +	struct list_head list;
> +	struct ext4_buddy e4b;
> +	int err;
> +
> +	if (!test_opt(sb, MBALLOC) || !S_ISREG(inode->i_mode)) {
> +		/*BUG_ON(!list_empty(&ei->i_prealloc_list));*/
> +		return;
> +	}
> +
> +	mb_debug("discard preallocation for inode %lu\n", inode->i_ino);
> +
> +	INIT_LIST_HEAD(&list);
> +
> +repeat:
> +	/* first, collect all pa's in the inode */
> +	spin_lock(&ei->i_prealloc_lock);
> +	while (!list_empty(&ei->i_prealloc_list)) {
> +		pa = list_entry(ei->i_prealloc_list.next,
> +				struct ext4_prealloc_space, pa_inode_list);
> +		BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
> +		spin_lock(&pa->pa_lock);
> +		if (atomic_read(&pa->pa_count)) {
> +			/* this shouldn't happen often - nobody should
> +			 * use preallocation while we're discarding it */
> +			spin_unlock(&pa->pa_lock);
> +			spin_unlock(&ei->i_prealloc_lock);
> +			printk(KERN_ERR "uh-oh! used pa while discarding\n");
> +			dump_stack();

WARN_ON(1) would be more conventional.

> +			current->state = TASK_UNINTERRUPTIBLE;
> +			schedule_timeout(HZ);

schedule_timeout_uninterruptible()

> +			goto repeat;
> +
> +		}
> +		if (pa->pa_deleted == 0) {
> +			pa->pa_deleted = 1;
> +			spin_unlock(&pa->pa_lock);
> +			list_del_rcu(&pa->pa_inode_list);
> +			list_add(&pa->u.pa_tmp_list, &list);
> +			continue;
> +		}
> +
> +		/* someone is deleting pa right now */
> +		spin_unlock(&pa->pa_lock);
> +		spin_unlock(&ei->i_prealloc_lock);
> +
> +		/* we have to wait here because pa_deleted
> +		 * doesn't mean pa is already unlinked from
> +		 * the list. as we might be called from
> +		 * ->clear_inode() the inode will get freed
> +		 * and concurrent thread which is unlinking
> +		 * pa from inode's list may access already
> +		 * freed memory, bad-bad-bad */
> +
> +		/* XXX: if this happens too often, we can
> +		 * add a flag to force wait only in case
> +		 * of ->clear_inode(), but not in case of
> +		 * regular truncate */
> +		current->state = TASK_UNINTERRUPTIBLE;
> +		schedule_timeout(HZ);

ditto

> +		goto repeat;
> +	}
> +	spin_unlock(&ei->i_prealloc_lock);
> +
> +	list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
> +		BUG_ON(pa->pa_linear != 0);
> +		ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, NULL);
> +
> +		err = ext4_mb_load_buddy(sb, group, &e4b);
> +		BUG_ON(err != 0); /* error handling here */
> +
> +		bitmap_bh = read_block_bitmap(sb, group);
> +		if (bitmap_bh == NULL) {
> +			/* error handling here */
> +			ext4_mb_release_desc(&e4b);
> +			BUG_ON(bitmap_bh == NULL);
> +		}
> +
> +		ext4_lock_group(sb, group);
> +		list_del_rcu(&pa->pa_group_list);
> +		ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
> +		ext4_unlock_group(sb, group);
> +
> +		ext4_mb_release_desc(&e4b);
> +		brelse(bitmap_bh);
> +
> +		list_del(&pa->u.pa_tmp_list);
> +		mb_call_rcu(pa);
> +	}
> +}

Would be nice to ask Paul to review all the rcu usage in here.  It looks odd.

>
> ...
>
> +#else
> +#define ext4_mb_show_ac(x)
> +#endif

static inlined C functions are preferred (+1e6 dittoes)

> +/*
> + * We use locality group preallocation for small size file. The size of the
> + * file is determined by the current size or the resulting size after
> + * allocation which ever is larger
> + *
> + * One can tune this size via /proc/fs/ext4/<partition>/stream_req
> + */
> +static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> +	int bsbits = ac->ac_sb->s_blocksize_bits;
> +	loff_t size, isize;
> +
> +	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
> +		return;
> +
> +	size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
> +	isize = i_size_read(ac->ac_inode) >> bsbits;
> +	if (size < isize)
> +		size = isize;

min()?

> +	/* don't use group allocation for large files */
> +	if (size >= sbi->s_mb_stream_request)
> +		return;
> +
> +	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
> +		return;
> +
> +	BUG_ON(ac->ac_lg != NULL);
> +	ac->ac_lg = &sbi->s_locality_groups[get_cpu()];
> +	put_cpu();

Strange-looking code.  I'd be interested in a description of the per-cou
design here.

> +	/* we're going to use group allocation */
> +	ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
> +
> +	/* serialize all allocations in the group */
> +	down(&ac->ac_lg->lg_sem);

This should be a mutex, shouldn't it?

> +}
> +
>
> ...
>
> +static int ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
> +			  ext4_group_t group, ext4_grpblk_t block, int count)
> +{
> +	struct ext4_group_info *db = e4b->bd_info;
> +	struct super_block *sb = e4b->bd_sb;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_free_metadata *md;
> +	int i;
> +
> +	BUG_ON(e4b->bd_bitmap_page == NULL);
> +	BUG_ON(e4b->bd_buddy_page == NULL);
> +
> +	ext4_lock_group(sb, group);
> +	for (i = 0; i < count; i++) {
> +		md = db->bb_md_cur;
> +		if (md && db->bb_tid != handle->h_transaction->t_tid) {
> +			db->bb_md_cur = NULL;
> +			md = NULL;
> +		}
> +
> +		if (md == NULL) {
> +			ext4_unlock_group(sb, group);
> +			md = kmalloc(sizeof(*md), GFP_KERNEL);

Why was this one not GFP_NOFS?

> +			if (md == NULL)
> +				return -ENOMEM;

Did we just leak some memory?

> +			md->num = 0;
> +			md->group = group;
> +
> +			ext4_lock_group(sb, group);
> +			if (db->bb_md_cur == NULL) {
> +				spin_lock(&sbi->s_md_lock);
> +				list_add(&md->list, &sbi->s_active_transaction);
> +				spin_unlock(&sbi->s_md_lock);
> +				/* protect buddy cache from being freed,
> +				 * otherwise we'll refresh it from
> +				 * on-disk bitmap and lose not-yet-available
> +				 * blocks */
> +				page_cache_get(e4b->bd_buddy_page);
> +				page_cache_get(e4b->bd_bitmap_page);
> +				db->bb_md_cur = md;
> +				db->bb_tid = handle->h_transaction->t_tid;
> +				mb_debug("new md 0x%p for group %lu\n",
> +						md, md->group);
> +			} else {
> +				kfree(md);
> +				md = db->bb_md_cur;
> +			}
> +		}
> +
> +		BUG_ON(md->num >= EXT4_BB_MAX_BLOCKS);
> +		md->blocks[md->num] = block + i;
> +		md->num++;
> +		if (md->num == EXT4_BB_MAX_BLOCKS) {
> +			/* no more space, put full container on a sb's list */
> +			db->bb_md_cur = NULL;
> +		}
> +	}
> +	ext4_unlock_group(sb, group);
> +	return 0;
> +}
> +
>
> ...
>
> +		case Opt_mballoc:
> +			set_opt(sbi->s_mount_opt, MBALLOC);
> +			break;
> +		case Opt_nomballoc:
> +			clear_opt(sbi->s_mount_opt, MBALLOC);
> +			break;
> +		case Opt_stripe:
> +			if (match_int(&args[0], &option))
> +				return 0;
> +			if (option < 0)
> +				return 0;
> +			sbi->s_stripe = option;
> +			break;

These appear to be undocumented.

>  		default:
>  			printk (KERN_ERR
>  				"EXT4-fs: Unrecognized mount option \"%s\" "
> @@ -1742,6 +1762,33 @@ static ext4_fsblk_t descriptor_loc(struct super_block *sb,
>  	return (has_super + ext4_group_first_block_no(sb, bg));
>  }
>  
> +/**
> + * ext4_get_stripe_size: Get the stripe size.
> + * @sbi: In memory super block info
> + *
> + * If we have specified it via mount option, then
> + * use the mount option value. If the value specified at mount time is
> + * greater than the blocks per group use the super block value.
> + * If the super block value is greater than blocks per group return 0.
> + * Allocator needs it be less than blocks per group.
> + *
> + */
> +static unsigned long ext4_get_stripe_size(struct ext4_sb_info *sbi)
> +{
> +	unsigned long stride = le16_to_cpu(sbi->s_es->s_raid_stride);
> +	unsigned long stripe_width =
> +			le32_to_cpu(sbi->s_es->s_raid_stripe_width);
> +
> +	if (sbi->s_stripe && sbi->s_stripe <= sbi->s_blocks_per_group) {
> +		return sbi->s_stripe;
> +	} else if (stripe_width <= sbi->s_blocks_per_group) {
> +		return stripe_width;
> +	} else if (stride <= sbi->s_blocks_per_group) {
> +		return stride;
> +	}

unneeded braces.

> +	return 0;
> +}
>  
>  ...
>
> +static inline
> +struct ext4_group_info *ext4_get_group_info(struct super_block *sb,
> +							ext4_group_t group)
> +{
> +	 struct ext4_group_info ***grp_info;
> +	 long indexv, indexh;
> +	 grp_info = EXT4_SB(sb)->s_group_info;
> +	 indexv = group >> (EXT4_DESC_PER_BLOCK_BITS(sb));
> +	 indexh = group & ((EXT4_DESC_PER_BLOCK(sb)) - 1);
> +	 return grp_info[indexv][indexh];
> +}

This should be uninlined.



Gosh what a lot of code.  Is it faster?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 33/49] ext4: Add the journal checksum feature
  2008-01-23 22:07                                                                   ` [PATCH 33/49] ext4: Add the journal checksum feature Andrew Morton
@ 2008-01-23 22:40                                                                     ` Andreas Dilger
  2008-01-24 21:24                                                                     ` Mingming Cao
  1 sibling, 0 replies; 23+ messages in thread
From: Andreas Dilger @ 2008-01-23 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, linux-kernel, girish, adilger, shaggy, cmm,
	linux-ext4@vger.kernel.org

On Jan 23, 2008  14:07 -0800, Andrew Morton wrote:
> > +{
> > +	struct page *page = bh->b_page;
> > +	char *addr;
> > +	__u32 checksum;
> > +
> > +	addr = kmap_atomic(page, KM_USER0);
> > +	checksum = crc32_be(crc32_sum,
> > +		(void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
> > +	kunmap_atomic(addr, KM_USER0);
> > +
> > +	return checksum;
> > +}
> 
> Can this buffer actually be in highmem?

Yes, this was found during system testing.  While ext3/4 will only allocate
buffer heads in lowmem, the jbd/jbd2 code can allocate buffers in highmem.
I was surprised about this also.

Please see the thread in ext4-devel:
[PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 41/49] ext4: Add multi block allocator for ext4
  2008-01-23 22:07                                                                                   ` [PATCH 41/49] ext4: Add multi block allocator for ext4 Andrew Morton
@ 2008-01-23 23:20                                                                                     ` Andreas Dilger
  2008-01-24  7:56                                                                                     ` Aneesh Kumar K.V
  1 sibling, 0 replies; 23+ messages in thread
From: Andreas Dilger @ 2008-01-23 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, linux-kernel, alex, adilger, aneesh.kumar,
	sandeen, linux-ext4@vger.kernel.org

On Jan 23, 2008  14:07 -0800, Andrew Morton wrote:
> > +#define mb_correct_addr_and_bit(bit, addr)		\
> > +{							\
> > +	bit += ((unsigned long) addr & 3UL) << 3;	\
> > +	addr = (void *) ((unsigned long) addr & ~3UL);	\
> > +}
> 
> Why do these exist?

They seem to be a holdover from when mballoc stored the buddy bitmaps
on disk.  That no longer happens (to avoid bitmap vs. buddy consistency
problems), so I suspect they can be removed.

I can't comment on many of the other issues because Alex wrote most
of the code.

> Gosh what a lot of code.  Is it faster?

Yes, and also importantly it uses a lot less CPU to do a given amount
of allocation, which is critical in our environments where there is
very high disk bandwidth on a single node and CPU becomes the limiting
factor of the IO speed.  This of course also helps any write-intensive
environment where the CPU is doing something "useful".

Some older test results include:
https://ols2006.108.redhat.com/2007/Reprints/mathur-Reprint.pdf (Section 7)

In the linux-ext4 thread "compilebench numbers for ext4":
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg03834.html

http://oss.oracle.com/~mason/compilebench/ext4/ext-create-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-compile-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-rm-compare.png

note the ext-read-compare.png graph shows lower read performance, but
a couple of bugs in mballoc were since fixed to have ext4 allocate more
contiguous extents.

In the old linux-ext4 thread "[RFC] delayed allocation testing on node zefir"
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg00587.html

        : dd2048rw                             
        : REAL   UTIME  STIME  READ    WRITTEN DETAILS
EXT3    : 58.46  23     1491   2572    2097292 17 extents
EXT4    : 44.56  19     1018   12      2097244 19 extents
REISERFS: 56.80  26     1370   2952    2097336 457 extents
JFS     : 45.77  22     984    0       2097216 1 extents
XFS     : 50.97  20     1394   0       2100825 7 extents

        : kernuntar                            
        : REAL   UTIME  STIME  READ    WRITTEN DETAILS
EXT3    : 56.99  5037   651    68      252016  
EXT4    : 55.03  5034   553    36      249884  
REISERFS: 52.55  4996   854    64      238068  
JFS     : 70.15  5057   630    496     288116  
XFS     : 72.84  5052   953    132     316798  

        : kernstat                             
        : REAL   UTIME  STIME  READ    WRITTEN DETAILS
EXT3    : 2.83   8      15     5892    0       
EXT4    : 0.51   9      10     5892    0       
REISERFS: 0.81   7      49     2696    0       
JFS     : 6.19   11     49     12552   0       
XFS     : 2.09   9      61     6504    0       

        : kerncat                              
        : REAL   UTIME  STIME  READ    WRITTEN DETAILS
EXT3    : 9.48   25     213    241624  0       
EXT4    : 6.29   27     197    238560  0       
REISERFS: 14.69  33     230    234744  0       
JFS     : 23.51  23     231    244596  0       
XFS     : 18.24  36     254    238548  0       

        : kernrm                               
        : REAL   UTIME  STIME  READ    WRITTEN DETAILS
EXT3    : 4.82   4      108    9628    4672    
EXT4    : 1.61   5      110    6536    4632    
REISERFS: 3.15   8      276    2768    236     
JFS     : 33.90  7      168    14400   33048   
XFS     : 20.03  8      296    6632    86160   


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 23/49] Add buffer head related helper functions
  2008-01-23 22:06                                               ` [PATCH 23/49] Add buffer head related helper functions Andrew Morton
@ 2008-01-24  5:22                                                 ` Aneesh Kumar K.V
  2008-01-24  8:53                                                   ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Aneesh Kumar K.V @ 2008-01-24  5:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Theodore Ts'o, linux-kernel, linux-ext4@vger.kernel.org

On Wed, Jan 23, 2008 at 02:06:48PM -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:02 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> > +}
> > +EXPORT_SYMBOL(bh_uptodate_or_lock);
> > +/**
> 
> Missing newline.
> 
> > + * bh_submit_read: Submit a locked buffer for reading
> > + * @bh: struct buffer_head
> > + *
> > + * Returns a negative error
> > + */
> > +int bh_submit_read(struct buffer_head *bh)
> > +{
> > +	if (!buffer_locked(bh))
> > +		lock_buffer(bh);
> > +
> > +	if (buffer_uptodate(bh))
> > +		return 0;
> 
> Here it can lock the buffer then return zero
> 
> > +	get_bh(bh);
> > +	bh->b_end_io = end_buffer_read_sync;
> > +	submit_bh(READ, bh);
> > +	wait_on_buffer(bh);
> > +	if (buffer_uptodate(bh))
> > +		return 0;
> 
> Here it will unlock the buffer and return zero.
> 
> This function is unusable when passed an unlocked buffer.
> 

Updated patch below.

commit 70d4ca32604e0935a8b9a49c5ac8b9c64c810693
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Thu Jan 24 10:50:24 2008 +0530

    Add buffer head related helper functions
    
    Add buffer head related helper function bh_uptodate_or_lock and
    bh_submit_read which can be used by file system
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..82aa2db 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3213,6 +3213,53 @@ static int buffer_cpu_notify(struct notifier_block *self,
 	return NOTIFY_OK;
 }
 
+/**
+ * bh_uptodate_or_lock: Test whether the buffer is uptodate
+ * @bh: struct buffer_head
+ *
+ * Return true if the buffer is up-to-date and false,
+ * with the buffer locked, if not.
+ */
+int bh_uptodate_or_lock(struct buffer_head *bh)
+{
+	if (!buffer_uptodate(bh)) {
+		lock_buffer(bh);
+		if (!buffer_uptodate(bh))
+			return 0;
+		unlock_buffer(bh);
+	}
+	return 1;
+}
+EXPORT_SYMBOL(bh_uptodate_or_lock);
+
+/**
+ * bh_submit_read: Submit a locked buffer for reading
+ * @bh: struct buffer_head
+ *
+ * Returns zero on success and -EIO on error.If the input
+ * buffer is not locked returns -EINVAL
+ *
+ */
+int bh_submit_read(struct buffer_head *bh)
+{
+	if (!buffer_locked(bh))
+		return -EINVAL;
+
+	if (buffer_uptodate(bh)) {
+		unlock_buffer(bh);
+		return 0;
+	}
+
+	get_bh(bh);
+	bh->b_end_io = end_buffer_read_sync;
+	submit_bh(READ, bh);
+	wait_on_buffer(bh);
+	if (buffer_uptodate(bh))
+		return 0;
+	return -EIO;
+}
+EXPORT_SYMBOL(bh_submit_read);
+
 void __init buffer_init(void)
 {
 	int nrpages;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index da0d83f..e98801f 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -192,6 +192,8 @@ int sync_dirty_buffer(struct buffer_head *bh);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
+int bh_uptodate_or_lock(struct buffer_head *bh);
+int bh_submit_read(struct buffer_head *bh);
 
 extern int buffer_heads_over_limit;
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore.
  2008-01-23 22:06                                                             ` [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore Andrew Morton
@ 2008-01-24  5:29                                                               ` Aneesh Kumar K.V
  2008-01-24 13:00                                                               ` Andy Whitcroft
  1 sibling, 0 replies; 23+ messages in thread
From: Aneesh Kumar K.V @ 2008-01-24  5:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, linux-kernel, Andy Whitcroft,
	linux-ext4@vger.kernel.org

On Wed, Jan 23, 2008 at 02:06:59PM -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:09 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> > +int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> > +			unsigned long max_blocks, struct buffer_head *bh,
> > +			int create, int extend_disksize)
> > +{
> > +	int retval;
> > +	if (create) {
> > +		down_write((&EXT4_I(inode)->i_data_sem));
> > +	} else {
> > +		down_read((&EXT4_I(inode)->i_data_sem));
> > +	}
> > +	if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
> > +		retval =  ext4_ext_get_blocks(handle, inode, block, max_blocks,
> > +				bh, create, extend_disksize);
> > +	} else {
> > +		retval = ext4_get_blocks_handle(handle, inode, block,
> > +				max_blocks, bh, create, extend_disksize);
> > +	}
> > +	if (create) {
> > +		up_write((&EXT4_I(inode)->i_data_sem));
> > +	} else {
> > +		up_read((&EXT4_I(inode)->i_data_sem));
> > +	}
> 
> This function has many unneeded braces.  checkpatch used to detect this
> but it seems to have broken.

The follow up patch "ext4: Take read lock during overwrite case" removes
those single line if statement.


> 
> > +	return retval;
> > +}
> >  static int ext4_get_block(struct inode *inode, sector_t iblock,
> >  			struct buffer_head *bh_result, int create)
> 
> Mising newline.

Fixed.

-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl
  2008-01-23 22:07                                                                         ` [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl Andrew Morton
@ 2008-01-24  5:55                                                                           ` Aneesh Kumar K.V
  2008-01-26  4:15                                                                             ` Theodore Tso
  0 siblings, 1 reply; 23+ messages in thread
From: Aneesh Kumar K.V @ 2008-01-24  5:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Theodore Ts'o, linux-kernel, linux-ext4@vger.kernel.org

On Wed, Jan 23, 2008 at 02:07:16PM -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:15 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> > The below patch add ioctl for migrating ext3 indirect block mapped inode
> > to ext4 extent mapped inode.
> 
> This patch adds lots of weird and inexplicable single- and double-newlines
> in inappropriate places.  However it frequently forgets to add newlines
> between end-of-locals and start-of-code, which is usual practice.
> 
> 
> +struct list_blocks_struct {
> +	ext4_lblk_t first_block, last_block;
> +	ext4_fsblk_t first_pblock, last_pblock;
> +};
> 	

Updated patch

commit c4786b67cdc5b24d2548a69b62774fb54f8f1575
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Tue Jan 22 09:28:55 2008 +0530

    ext4: Add EXT4_IOC_MIGRATE ioctl
    
    The below patch add ioctl for migrating ext3 indirect block mapped inode
    to ext4 extent mapped inode.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index ae6e7e5..d5fd80b 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_EXT4DEV_FS) += ext4dev.o
 
 ext4dev-y	:= balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
 		   ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
-		   ext4_jbd2.o
+		   ext4_jbd2.o migrate.o
 
 ext4dev-$(CONFIG_EXT4DEV_FS_XATTR)	+= xattr.o xattr_user.o xattr_trusted.o
 ext4dev-$(CONFIG_EXT4DEV_FS_POSIX_ACL)	+= acl.o
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 03d1bbb..323cd76 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -75,7 +75,7 @@ static ext4_fsblk_t idx_pblock(struct ext4_extent_idx *ix)
  * stores a large physical block number into an extent struct,
  * breaking it into parts
  */
-static void ext4_ext_store_pblock(struct ext4_extent *ex, ext4_fsblk_t pb)
+void ext4_ext_store_pblock(struct ext4_extent *ex, ext4_fsblk_t pb)
 {
 	ex->ee_start_lo = cpu_to_le32((unsigned long) (pb & 0xffffffff));
 	ex->ee_start_hi = cpu_to_le16((unsigned long) ((pb >> 31) >> 1) & 0xffff);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index c0e5b8c..2ed7c37 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -254,6 +254,9 @@ flags_err:
 		return err;
 	}
 
+	case EXT4_IOC_MIGRATE:
+		return ext4_ext_migrate(inode, filp, cmd, arg);
+
 	default:
 		return -ENOTTY;
 	}
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
new file mode 100644
index 0000000..deb2327
--- /dev/null
+++ b/fs/ext4/migrate.c
@@ -0,0 +1,588 @@
+/*
+ * Copyright IBM Corporation, 2007
+ * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/ext4_jbd2.h>
+#include <linux/ext4_fs_extents.h>
+
+/*
+ * The contiguous blocks details which can be
+ * represented by a single extent
+ */
+struct list_blocks_struct {
+	ext4_lblk_t first_block, last_block;
+	ext4_fsblk_t first_pblock, last_pblock;
+};
+
+static int finish_range(handle_t *handle, struct inode *inode,
+				struct list_blocks_struct *lb)
+
+{
+	int retval = 0, needed;
+	struct ext4_extent newext;
+	struct ext4_ext_path *path;
+	if (lb->first_pblock == 0)
+		return 0;
+
+	/* Add the extent to temp inode*/
+	newext.ee_block = cpu_to_le32(lb->first_block);
+	newext.ee_len   = cpu_to_le16(lb->last_block - lb->first_block + 1);
+	ext4_ext_store_pblock(&newext, lb->first_pblock);
+	path = ext4_ext_find_extent(inode, lb->first_block, NULL);
+
+	if (IS_ERR(path)) {
+		retval = PTR_ERR(path);
+		goto err_out;
+	}
+
+	/*
+	 * Calculate the credit needed to inserting this extent
+	 * Since we are doing this in loop we may accumalate extra
+	 * credit. But below we try to not accumalate too much
+	 * of them by restarting the journal.
+	 */
+	needed = ext4_ext_calc_credits_for_insert(inode, path);
+
+	/*
+	 * Make sure the credit we accumalated is not really high
+	 */
+	if (needed && handle->h_buffer_credits >= EXT4_RESERVE_TRANS_BLOCKS) {
+
+		retval = ext4_journal_restart(handle, needed);
+		if (retval)
+			goto err_out;
+	}
+	if (needed) {
+		retval = ext4_journal_extend(handle, needed);
+		if (retval != 0) {
+			/*
+			 * IF not able to extend the journal restart the journal
+			 */
+			retval = ext4_journal_restart(handle, needed);
+			if (retval)
+				goto err_out;
+		}
+	}
+	retval = ext4_ext_insert_extent(handle, inode, path, &newext);
+
+err_out:
+	lb->first_pblock = 0;
+	return retval;
+}
+
+static int update_extent_range(handle_t *handle, struct inode *inode,
+				ext4_fsblk_t pblock, ext4_lblk_t blk_num,
+				struct list_blocks_struct *lb)
+{
+	int retval;
+	/*
+	 * See if we can add on to the existing range (if it exists)
+	 */
+	if (lb->first_pblock &&
+		(lb->last_pblock+1 == pblock) &&
+		(lb->last_block+1 == blk_num)) {
+		lb->last_pblock = pblock;
+		lb->last_block = blk_num;
+		return 0;
+	}
+	/*
+	 * Start a new range.
+	 */
+	retval = finish_range(handle, inode, lb);
+	lb->first_pblock = lb->last_pblock = pblock;
+	lb->first_block = lb->last_block = blk_num;
+
+	return retval;
+
+}
+
+static int update_ind_extent_range(handle_t *handle, struct inode *inode,
+				   ext4_fsblk_t pblock, ext4_lblk_t *blk_nump,
+				   struct list_blocks_struct *lb)
+{
+	struct buffer_head *bh;
+	__le32 *i_data;
+	int i, retval = 0;
+	ext4_lblk_t blk_count = *blk_nump;
+	unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+	if (!pblock) {
+		/* Only update the file block number */
+		*blk_nump += max_entries;
+		return 0;
+	}
+
+	bh = sb_bread(inode->i_sb, pblock);
+	if (!bh)
+		return -EIO;
+
+	i_data = (__le32 *)bh->b_data;
+
+	for (i = 0; i < max_entries; i++, blk_count++) {
+		if (i_data[i]) {
+			retval = update_extent_range(handle, inode,
+						le32_to_cpu(i_data[i]),
+						blk_count, lb);
+			if (retval)
+				break;
+		}
+	}
+
+	/* Update the file block number */
+	*blk_nump = blk_count;
+	brelse(bh);
+	return retval;
+
+}
+
+static int update_dind_extent_range(handle_t *handle, struct inode *inode,
+				    ext4_fsblk_t pblock, ext4_lblk_t *blk_nump,
+				    struct list_blocks_struct *lb)
+{
+	struct buffer_head *bh;
+	__le32 *i_data;
+	int i, retval = 0;
+	ext4_lblk_t blk_count = *blk_nump;
+	unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+	if (!pblock) {
+		/* Only update the file block number */
+		*blk_nump += max_entries * max_entries;
+		return 0;
+	}
+	bh = sb_bread(inode->i_sb, pblock);
+	if (!bh)
+		return -EIO;
+
+	i_data = (__le32 *)bh->b_data;
+	for (i = 0; i < max_entries; i++) {
+		if (i_data[i]) {
+			retval = update_ind_extent_range(handle, inode,
+						le32_to_cpu(i_data[i]),
+						&blk_count, lb);
+			if (retval)
+				break;
+		} else {
+			/* Only update the file block number */
+			blk_count += max_entries;
+		}
+	}
+
+	/* Update the file block number */
+	*blk_nump = blk_count;
+	brelse(bh);
+	return retval;
+
+}
+
+static int update_tind_extent_range(handle_t *handle, struct inode *inode,
+				     ext4_fsblk_t pblock, ext4_lblk_t *blk_nump,
+				     struct list_blocks_struct *lb)
+{
+	struct buffer_head *bh;
+	__le32 *i_data;
+	int i, retval = 0;
+	ext4_lblk_t blk_count = *blk_nump;
+	unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+	if (!pblock) {
+		/* Only update the file block number */
+		*blk_nump += max_entries * max_entries * max_entries;
+		return 0;
+	}
+	bh = sb_bread(inode->i_sb, pblock);
+	if (!bh)
+		return -EIO;
+
+	i_data = (__le32 *)bh->b_data;
+	for (i = 0; i < max_entries; i++) {
+		if (i_data[i]) {
+			retval = update_dind_extent_range(handle, inode,
+						le32_to_cpu(i_data[i]),
+						&blk_count, lb);
+			if (retval)
+				break;
+		} else {
+			/* Only update the file block number */
+			blk_count += max_entries * max_entries;
+		}
+	}
+	/* Update the file block number */
+	*blk_nump = blk_count;
+	brelse(bh);
+	return retval;
+
+}
+
+static int free_dind_blocks(handle_t *handle,
+				struct inode *inode, __le32 i_data)
+{
+	int i;
+	__le32 *tmp_idata;
+	struct buffer_head *bh;
+	unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+	bh = sb_bread(inode->i_sb, le32_to_cpu(i_data));
+	if (!bh)
+		return -EIO;
+
+	tmp_idata = (__le32 *)bh->b_data;
+	for (i = 0; i < max_entries; i++) {
+		if (tmp_idata[i]) {
+			ext4_free_blocks(handle, inode,
+					le32_to_cpu(tmp_idata[i]), 1);
+		}
+	}
+	brelse(bh);
+	ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1);
+	return 0;
+}
+
+static int free_tind_blocks(handle_t *handle,
+				struct inode *inode, __le32 i_data)
+{
+	int i, retval = 0;
+	__le32 *tmp_idata;
+	struct buffer_head *bh;
+	unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+	bh = sb_bread(inode->i_sb, le32_to_cpu(i_data));
+	if (!bh)
+		return -EIO;
+
+	tmp_idata = (__le32 *)bh->b_data;
+	for (i = 0; i < max_entries; i++) {
+		if (tmp_idata[i]) {
+			retval = free_dind_blocks(handle,
+					inode, tmp_idata[i]);
+			if (retval) {
+				brelse(bh);
+				return retval;
+			}
+		}
+	}
+	brelse(bh);
+	ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1);
+	return 0;
+}
+
+static int free_ind_block(handle_t *handle, struct inode *inode)
+{
+	int retval;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	if (ei->i_data[EXT4_IND_BLOCK]) {
+		ext4_free_blocks(handle, inode,
+				le32_to_cpu(ei->i_data[EXT4_IND_BLOCK]), 1);
+	}
+
+	if (ei->i_data[EXT4_DIND_BLOCK]) {
+		retval = free_dind_blocks(handle, inode,
+						ei->i_data[EXT4_DIND_BLOCK]);
+		if (retval)
+			return retval;
+	}
+
+	if (ei->i_data[EXT4_TIND_BLOCK]) {
+		retval = free_tind_blocks(handle, inode,
+						ei->i_data[EXT4_TIND_BLOCK]);
+		if (retval)
+			return retval;
+	}
+	return 0;
+}
+
+static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
+				struct inode *tmp_inode, int retval)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode);
+
+	retval = free_ind_block(handle, inode);
+	if (retval)
+		goto err_out;
+
+	/*
+	 * One credit accounted for writing the
+	 * i_data field of the original inode
+	 */
+	retval = ext4_journal_extend(handle, 1);
+	if (retval != 0) {
+		retval = ext4_journal_restart(handle, 1);
+		if (retval)
+			goto err_out;
+	}
+
+	/*
+	 * We have the extent map build with the tmp inode.
+	 * Now copy the i_data across
+	 */
+	ei->i_flags |= EXT4_EXTENTS_FL;
+	memcpy(ei->i_data, tmp_ei->i_data, sizeof(ei->i_data));
+
+	/*
+	 * Update i_blocks with the new blocks that got
+	 * allocated while adding extents for extent index
+	 * blocks.
+	 *
+	 * While converting to extents we need not
+	 * update the orignal inode i_blocks for extent blocks
+	 * via quota APIs. The quota update happened via tmp_inode already.
+	 */
+	spin_lock(&inode->i_lock);
+	inode->i_blocks += tmp_inode->i_blocks;
+	spin_unlock(&inode->i_lock);
+
+	ext4_mark_inode_dirty(handle, inode);
+err_out:
+	return retval;
+}
+
+/* Will go away */
+static ext4_fsblk_t idx_pblock(struct ext4_extent_idx *ix)
+{
+	ext4_fsblk_t block;
+
+	block = le32_to_cpu(ix->ei_leaf_lo);
+	block |= ((ext4_fsblk_t) le16_to_cpu(ix->ei_leaf_hi) << 31) << 1;
+	return block;
+}
+
+static int free_ext_idx(handle_t *handle, struct inode *inode,
+					struct ext4_extent_idx *ix)
+{
+	int i, retval = 0;
+	ext4_fsblk_t block;
+	struct buffer_head *bh;
+	struct ext4_extent_header *eh;
+
+	block = idx_pblock(ix);
+	bh = sb_bread(inode->i_sb, block);
+	if (!bh)
+		return -EIO;
+
+	eh = (struct ext4_extent_header *)bh->b_data;
+	if (eh->eh_depth == 0) {
+		brelse(bh);
+		ext4_free_blocks(handle, inode, block, 1);
+	} else {
+		ix = EXT_FIRST_INDEX(eh);
+		for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
+			retval = free_ext_idx(handle, inode, ix);
+			if (retval)
+				return retval;
+		}
+	}
+	return retval;
+}
+
+/*
+ * Free the extent meta data blocks only
+ */
+static int free_ext_block(handle_t *handle, struct inode *inode)
+{
+	int i, retval = 0;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_extent_header *eh = (struct ext4_extent_header *)ei->i_data;
+	struct ext4_extent_idx *ix;
+	if (eh->eh_depth == 0) {
+		/*
+		 * No extra blocks allocated for extent meta data
+		 */
+		return 0;
+	}
+	ix = EXT_FIRST_INDEX(eh);
+	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
+		retval = free_ext_idx(handle, inode, ix);
+		if (retval)
+			return retval;
+	}
+	return retval;
+
+}
+
+int ext4_ext_migrate(struct inode *inode, struct file *filp,
+				unsigned int cmd, unsigned long arg)
+{
+	handle_t *handle;
+	int retval = 0, i;
+	__le32 *i_data;
+	ext4_lblk_t blk_count = 0;
+	struct ext4_inode_info *ei;
+	struct inode *tmp_inode = NULL;
+	struct list_blocks_struct lb;
+	unsigned long max_entries;
+
+	if (!test_opt(inode->i_sb, EXTENTS)) {
+		/*
+		 * if mounted with noextents
+		 * we don't allow the migrate
+		 */
+		return -EINVAL;
+	}
+
+	if ((EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+		return -EINVAL;
+
+	down_write(&EXT4_I(inode)->i_data_sem);
+	handle = ext4_journal_start(inode,
+					EXT4_DATA_TRANS_BLOCKS(inode->i_sb) +
+					EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3 +
+					2 * EXT4_QUOTA_INIT_BLOCKS(inode->i_sb)
+					+ 1);
+	if (IS_ERR(handle)) {
+		retval = PTR_ERR(handle);
+		goto err_out;
+	}
+	tmp_inode = ext4_new_inode(handle,
+				inode->i_sb->s_root->d_inode,
+				S_IFREG);
+	if (IS_ERR(tmp_inode)) {
+		retval = -ENOMEM;
+		ext4_journal_stop(handle);
+		tmp_inode = NULL;
+		goto err_out;
+	}
+	i_size_write(tmp_inode, i_size_read(inode));
+	/*
+	 * We don't want the inode to be reclaimed
+	 * if we got interrupted in between. We have
+	 * this tmp inode carrying reference to the
+	 * data blocks of the original file. We set
+	 * the i_nlink to zero at the last stage after
+	 * switching the original file to extent format
+	 */
+	tmp_inode->i_nlink = 1;
+
+	ext4_ext_tree_init(handle, tmp_inode);
+	ext4_orphan_add(handle, tmp_inode);
+	ext4_journal_stop(handle);
+
+	ei = EXT4_I(inode);
+	i_data = ei->i_data;
+	memset(&lb, 0, sizeof(lb));
+
+	/* 32 bit block address 4 bytes */
+	max_entries = inode->i_sb->s_blocksize >> 2;
+
+	/*
+	 * start with one credit accounted for
+	 * superblock modification.
+	 *
+	 * For the tmp_inode we already have commited the
+	 * trascation that created the inode. Later as and
+	 * when we add extents we extent the journal
+	 */
+	handle = ext4_journal_start(inode, 1);
+	for (i = 0; i < EXT4_NDIR_BLOCKS; i++, blk_count++) {
+		if (i_data[i]) {
+			retval = update_extent_range(handle, tmp_inode,
+						le32_to_cpu(i_data[i]),
+						blk_count, &lb);
+			if (retval)
+				goto err_out;
+		}
+	}
+	if (i_data[EXT4_IND_BLOCK]) {
+		retval = update_ind_extent_range(handle, tmp_inode,
+					le32_to_cpu(i_data[EXT4_IND_BLOCK]),
+					&blk_count, &lb);
+			if (retval)
+				goto err_out;
+	} else {
+		blk_count +=  max_entries;
+	}
+	if (i_data[EXT4_DIND_BLOCK]) {
+		retval = update_dind_extent_range(handle, tmp_inode,
+					le32_to_cpu(i_data[EXT4_DIND_BLOCK]),
+					&blk_count, &lb);
+			if (retval)
+				goto err_out;
+	} else {
+		blk_count += max_entries * max_entries;
+	}
+	if (i_data[EXT4_TIND_BLOCK]) {
+		retval = update_tind_extent_range(handle, tmp_inode,
+					le32_to_cpu(i_data[EXT4_TIND_BLOCK]),
+					&blk_count, &lb);
+			if (retval)
+				goto err_out;
+	}
+	/*
+	 * Build the last extent
+	 */
+	retval = finish_range(handle, tmp_inode, &lb);
+err_out:
+	/*
+	 * We are either freeing extent information or indirect
+	 * blocks. During this we touch superblock, group descriptor
+	 * and block bitmap. Later we mark the tmp_inode dirty
+	 * via ext4_ext_tree_init. So allocate a credit of 4
+	 * We may update quota (user and group).
+	 *
+	 * FIXME!! we may be touching bitmaps in different block groups.
+	 */
+	if (ext4_journal_extend(handle,
+			4 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb)) != 0) {
+
+		ext4_journal_restart(handle,
+				4 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
+	}
+	if (retval) {
+		/*
+		 * Failure case delete the extent information with the
+		 * tmp_inode
+		 */
+		free_ext_block(handle, tmp_inode);
+
+	} else {
+
+		retval = ext4_ext_swap_inode_data(handle, inode,
+							tmp_inode, retval);
+	}
+
+	/*
+	 * Mark the tmp_inode as of size zero
+	 */
+	i_size_write(tmp_inode, 0);
+
+	/*
+	 * set the  i_blocks count to zero
+	 * so that the ext4_delete_inode does the
+	 * right job
+	 *
+	 * We don't need to take the i_lock because
+	 * the inode is not visible to user space.
+	 */
+	tmp_inode->i_blocks = 0;
+
+	/* Reset the extent details */
+	ext4_ext_tree_init(handle, tmp_inode);
+
+	/*
+	 * Set the i_nlink to zero so that
+	 * generic_drop_inode really deletes the
+	 * inode
+	 */
+	tmp_inode->i_nlink = 0;
+
+	ext4_journal_stop(handle);
+
+	up_write(&EXT4_I(inode)->i_data_sem);
+
+	if (tmp_inode)
+		iput(tmp_inode);
+
+	return retval;
+}
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index b609294..213974f 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -243,6 +243,7 @@ struct ext4_new_group_data {
 #endif
 #define EXT4_IOC_GETRSVSZ		_IOR('f', 5, long)
 #define EXT4_IOC_SETRSVSZ		_IOW('f', 6, long)
+#define EXT4_IOC_MIGRATE		_IO('f', 7)
 
 /*
  * ioctl commands in 32 bit emulation
@@ -983,6 +984,9 @@ extern int ext4_ioctl (struct inode *, struct file *, unsigned int,
 		       unsigned long);
 extern long ext4_compat_ioctl (struct file *, unsigned int, unsigned long);
 
+/* migrate.c */
+extern int ext4_ext_migrate(struct inode *, struct file *, unsigned int,
+		       unsigned long);
 /* namei.c */
 extern int ext4_orphan_add(handle_t *, struct inode *);
 extern int ext4_orphan_del(handle_t *, struct inode *);
diff --git a/include/linux/ext4_fs_extents.h b/include/linux/ext4_fs_extents.h
index 023683b..db64509 100644
--- a/include/linux/ext4_fs_extents.h
+++ b/include/linux/ext4_fs_extents.h
@@ -212,6 +212,7 @@ static inline int ext4_ext_get_actual_len(struct ext4_extent *ext)
 		(le16_to_cpu(ext->ee_len) - EXT_INIT_MAX_LEN));
 }
 
+extern void ext4_ext_store_pblock(struct ext4_extent *, ext4_fsblk_t);
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
 extern int ext4_ext_try_to_merge(struct inode *inode,

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 41/49] ext4: Add multi block allocator for ext4
  2008-01-23 22:07                                                                                   ` [PATCH 41/49] ext4: Add multi block allocator for ext4 Andrew Morton
  2008-01-23 23:20                                                                                     ` Andreas Dilger
@ 2008-01-24  7:56                                                                                     ` Aneesh Kumar K.V
  2008-01-24  9:04                                                                                       ` Aneesh Kumar K.V
  2008-01-24 14:53                                                                                       ` Aneesh Kumar K.V
  1 sibling, 2 replies; 23+ messages in thread
From: Aneesh Kumar K.V @ 2008-01-24  7:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, linux-kernel, alex, adilger, sandeen,
	linux-ext4@vger.kernel.org

On Wed, Jan 23, 2008 at 02:07:27PM -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:20 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> > From: Alex Tomas <alex@clusterfs.com>
> > 
> > Signed-off-by: Alex Tomas <alex@clusterfs.com>
> > Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> > Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> >
> > ...
> >
> > +#if BITS_PER_LONG == 64
> > +#define mb_correct_addr_and_bit(bit, addr)		\
> > +{							\
> > +	bit += ((unsigned long) addr & 7UL) << 3;	\
> > +	addr = (void *) ((unsigned long) addr & ~7UL);	\
> > +}
> > +#elif BITS_PER_LONG == 32
> > +#define mb_correct_addr_and_bit(bit, addr)		\
> > +{							\
> > +	bit += ((unsigned long) addr & 3UL) << 3;	\
> > +	addr = (void *) ((unsigned long) addr & ~3UL);	\
> > +}
> > +#else
> > +#error "how many bits you are?!"
> > +#endif
> 
> Why do these exist?

Initial version on mballoc supported on x86 32 this was there to give
compile warning on 64 bit platform. I guess we can remove that now.
Or may be we can keep it as such because it is harmless.


> 
> > +static inline int mb_test_bit(int bit, void *addr)
> > +{
> > +	mb_correct_addr_and_bit(bit, addr);
> > +	return ext4_test_bit(bit, addr);
> > +}
> 
> ext2_test_bit() already handles bitnum > wordsize.
> 
> If mb_correct_addr_and_bit() is actually needed then some suitable comment
> would help.

ext4_test_bit on powerpc needs the addr to be 8 byte aligned. Othewise
it fails

> 
> > +static inline void mb_set_bit(int bit, void *addr)
> > +{
> > +	mb_correct_addr_and_bit(bit, addr);
> > +	ext4_set_bit(bit, addr);
> > +}
> > +
> > +static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr)
> > +{
> > +	mb_correct_addr_and_bit(bit, addr);
> > +	ext4_set_bit_atomic(lock, bit, addr);
> > +}
> > +
> > +static inline void mb_clear_bit(int bit, void *addr)
> > +{
> > +	mb_correct_addr_and_bit(bit, addr);
> > +	ext4_clear_bit(bit, addr);
> > +}
> > +
> > +static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr)
> > +{
> > +	mb_correct_addr_and_bit(bit, addr);
> > +	ext4_clear_bit_atomic(lock, bit, addr);
> > +}
> > +
> > +static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
> 
> uninlining this will save about eighty squigabytes of text.

Fixed


> 
> Please review all of ext4/jbd2 with a view to removig unnecessary and wrong
> inlings.
> 
> > +{
> > +	char *bb;
> > +
> > +	/* FIXME!! is this needed */
> > +	BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b));
> > +	BUG_ON(max == NULL);
> > +
> > +	if (order > e4b->bd_blkbits + 1) {
> > +		*max = 0;
> > +		return NULL;
> > +	}
> > +
> > +	/* at order 0 we see each particular block */
> > +	*max = 1 << (e4b->bd_blkbits + 3);
> > +	if (order == 0)
> > +		return EXT4_MB_BITMAP(e4b);
> > +
> > +	bb = EXT4_MB_BUDDY(e4b) + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order];
> > +	*max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order];
> > +
> > +	return bb;
> > +}
> > +
> >
> > ...
> >
> > +#else
> > +#define mb_free_blocks_double(a, b, c, d)
> > +#define mb_mark_used_double(a, b, c)
> > +#define mb_cmp_bitmaps(a, b)
> > +#endif
> 
> Please use the do{}while(0) thing.  Or, better, proper C functions which
> have typechecking (unless this will cause undefined-var compile errors,
> which happens sometimes)

makde static inline void.

> 
> > +/* find most significant bit */
> > +static int fmsb(unsigned short word)
> > +{
> > +	int order;
> > +
> > +	if (word > 255) {
> > +		order = 7;
> > +		word >>= 8;
> > +	} else {
> > +		order = -1;
> > +	}
> > +
> > +	do {
> > +		order++;
> > +		word >>= 1;
> > +	} while (word != 0);
> > +
> > +	return order;
> > +}
> 
> Did we just reinvent fls()?

replaced by fls.

> 
> > +/* FIXME!! need more doc */
> > +static void ext4_mb_mark_free_simple(struct super_block *sb,
> > +				void *buddy, unsigned first, int len,
> > +					struct ext4_group_info *grp)
> > +{
> > +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +	unsigned short min;
> > +	unsigned short max;
> > +	unsigned short chunk;
> > +	unsigned short border;
> > +
> > +	BUG_ON(len >= EXT4_BLOCKS_PER_GROUP(sb));
> > +
> > +	border = 2 << sb->s_blocksize_bits;
> 
> Won't this explode with >= 32k blocksize?
> 
> > +	while (len > 0) {
> > +		/* find how many blocks can be covered since this position */
> > +		max = ffs(first | border) - 1;
> > +
> > +		/* find how many blocks of power 2 we need to mark */
> > +		min = fmsb(len);
> > +
> > +		if (max < min)
> > +			min = max;
> > +		chunk = 1 << min;
> > +
> > +		/* mark multiblock chunks only */
> > +		grp->bb_counters[min]++;
> > +		if (min > 0)
> > +			mb_clear_bit(first >> min,
> > +				     buddy + sbi->s_mb_offsets[min]);
> > +
> > +		len -= chunk;
> > +		first += chunk;
> > +	}
> > +}
> > +
> > 
> > ...
> >
> > +static int ext4_mb_init_cache(struct page *page, char *incore)
> > +{
> > +	int blocksize;
> > +	int blocks_per_page;
> > +	int groups_per_page;
> > +	int err = 0;
> > +	int i;
> > +	ext4_group_t first_group;
> > +	int first_block;
> > +	struct super_block *sb;
> > +	struct buffer_head *bhs;
> > +	struct buffer_head **bh;
> > +	struct inode *inode;
> > +	char *data;
> > +	char *bitmap;
> > +
> > +	mb_debug("init page %lu\n", page->index);
> > +
> > +	inode = page->mapping->host;
> > +	sb = inode->i_sb;
> > +	blocksize = 1 << inode->i_blkbits;
> > +	blocks_per_page = PAGE_CACHE_SIZE / blocksize;
> > +
> > +	groups_per_page = blocks_per_page >> 1;
> > +	if (groups_per_page == 0)
> > +		groups_per_page = 1;
> > +
> > +	/* allocate buffer_heads to read bitmaps */
> > +	if (groups_per_page > 1) {
> > +		err = -ENOMEM;
> > +		i = sizeof(struct buffer_head *) * groups_per_page;
> > +		bh = kmalloc(i, GFP_NOFS);
> > +		if (bh == NULL)
> > +			goto out;
> > +		memset(bh, 0, i);
> 
> kzalloc()

Fixed

> 
> > +	} else
> > +		bh = &bhs;
> > +
> > +	first_group = page->index * blocks_per_page / 2;
> > +
> > +	/* read all groups the page covers into the cache */
> > +	for (i = 0; i < groups_per_page; i++) {
> > +		struct ext4_group_desc *desc;
> > +
> > +		if (first_group + i >= EXT4_SB(sb)->s_groups_count)
> > +			break;
> > +
> > +		err = -EIO;
> > +		desc = ext4_get_group_desc(sb, first_group + i, NULL);
> > +		if (desc == NULL)
> > +			goto out;
> > +
> > +		err = -ENOMEM;
> > +		bh[i] = sb_getblk(sb, ext4_block_bitmap(sb, desc));
> > +		if (bh[i] == NULL)
> > +			goto out;
> > +
> > +		if (buffer_uptodate(bh[i]))
> > +			continue;
> > +
> > +		lock_buffer(bh[i]);
> > +		if (buffer_uptodate(bh[i])) {
> > +			unlock_buffer(bh[i]);
> > +			continue;
> > +		}
> 
> Didn't we just add a helper in fs/buffer.c to do this?
> 

Fixed


> > +		if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> > +			ext4_init_block_bitmap(sb, bh[i],
> > +						first_group + i, desc);
> > +			set_buffer_uptodate(bh[i]);
> > +			unlock_buffer(bh[i]);
> > +			continue;
> > +		}
> > +		get_bh(bh[i]);
> > +		bh[i]->b_end_io = end_buffer_read_sync;

[... snip... ]
> > +
> > +			/* set incore so that the buddy information can be
> > +			 * generated using this
> > +			 */
> > +			incore = data;
> > +		}
> > +	}
> > +	SetPageUptodate(page);
> 
> Is the page locked here?


The page is locked via find_or_create_page

> 
> > +out:
> > +	if (bh) {
> > +		for (i = 0; i < groups_per_page && bh[i]; i++)
> > +			brelse(bh[i]);
> 
> put_bh()
> 
> > +		if (bh != &bhs)
> > +			kfree(bh);
> > +	}
> > +	return err;
> > +}
> > +
> >
> > ...
> >
> > +static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len)
> > +{
> > +	__u32 *addr;
> > +
> > +	len = cur + len;
> > +	while (cur < len) {
> > +		if ((cur & 31) == 0 && (len - cur) >= 32) {
> > +			/* fast path: clear whole word at once */
> 
> s/clear/set/

Fixed

> 
> > +			addr = bm + (cur >> 3);
> > +			*addr = 0xffffffff;
> > +			cur += 32;
> > +			continue;
> > +		}
> > +		mb_set_bit_atomic(lock, cur, bm);
> > +		cur++;
> > +	}
> > +}
> > +
> >
> > ...
> >
> > +static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
> > +					ext4_group_t group)
> > +{
> > +	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> > +	struct ext4_prealloc_space *pa;
> > +	struct list_head *cur;
> > +	ext4_group_t groupnr;
> > +	ext4_grpblk_t start;
> > +	int preallocated = 0;
> > +	int count = 0;
> > +	int len;
> > +
> > +	/* all form of preallocation discards first load group,
> > +	 * so the only competing code is preallocation use.
> > +	 * we don't need any locking here
> > +	 * notice we do NOT ignore preallocations with pa_deleted
> > +	 * otherwise we could leave used blocks available for
> > +	 * allocation in buddy when concurrent ext4_mb_put_pa()
> > +	 * is dropping preallocation
> > +	 */
> > +	list_for_each_rcu(cur, &grp->bb_prealloc_list) {
> > +		pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
> > +		spin_lock(&pa->pa_lock);
> > +		ext4_get_group_no_and_offset(sb, pa->pa_pstart,
> > +					     &groupnr, &start);
> > +		len = pa->pa_len;
> > +		spin_unlock(&pa->pa_lock);
> > +		if (unlikely(len == 0))
> > +			continue;
> > +		BUG_ON(groupnr != group);
> > +		mb_set_bits(sb_bgl_lock(EXT4_SB(sb), group),
> > +						bitmap, start, len);
> > +		preallocated += len;
> > +		count++;
> > +	}
> 
> Seems to be missing rcu_read_lock()
> 


bb_prealloc_list  is actually modified under ext4_group_lock. So it is
not actually rcu. I this we should be using list_for_each there.
The rcu managed list are i_prealloc_list and lg_prealloc_list


> > +	mb_debug("prellocated %u for group %lu\n", preallocated, group);
> > +}
> > +
> > +static void ext4_mb_pa_callback(struct rcu_head *head)
> > +{
> > +	struct ext4_prealloc_space *pa;
> > +	pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
> > +	kmem_cache_free(ext4_pspace_cachep, pa);
> > +}
> > +#define mb_call_rcu(__pa)	call_rcu(&(__pa)->u.pa_rcu, ext4_mb_pa_callback)
> 
> Is there any reason why this had to be implemented as a macro?

Fixed

> 
> >
> > ...
> >
> > +static int ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
> > +{
> > +	struct super_block *sb = ac->ac_sb;
> > +	struct ext4_prealloc_space *pa;
> > +	struct ext4_group_info *grp;
> > +	struct ext4_inode_info *ei;
> > +
> > +	/* preallocate only when found space is larger then requested */
> > +	BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
> > +	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
> > +	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
> > +
> > +	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
> 
> Do all the GFP_NOFS's in this code really need to be GFP_NOFS?
> 
> > +	if (pa == NULL)
> > +		return -ENOMEM;
> > +	ext4_lock_group(sb, ac->ac_b_ex.fe_group);

....

> > +	list_add_rcu(&pa->pa_group_list, &grp->bb_prealloc_list);
> > +	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
> > +
> > +	spin_lock(pa->pa_obj_lock);
> > +	list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list);
> > +	spin_unlock(pa->pa_obj_lock);
> 
> hm.  Strange to see list_add_rcu() inside spinlock like this.

Few lines above we have
pa->pa_obj_lock = &ei->i_prealloc_lock;
So the spin_lock is there to prevent mutiple cpu's adding to the
prealloc list together.


> 
> > +	return 0;
> > +}
> > +
> >
> > ...
> >
> > +static int ext4_mb_discard_group_preallocations(struct super_block *sb,
> > +					ext4_group_t group, int needed)
> > +{
> > +	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> > +	struct buffer_head *bitmap_bh = NULL;
> > +	struct ext4_prealloc_space *pa, *tmp;
> > +	struct list_head list;
> > +	struct ext4_buddy e4b;
> > +	int err;
> > +	int busy = 0;
> > +	int free = 0;
> > +		/* seems this one can be freed ... */

....

> > +		pa->pa_deleted = 1;
> > +
> > +		/* we can trust pa_free ... */
> > +		free += pa->pa_free;
> > +
> > +		spin_unlock(&pa->pa_lock);
> > +
> > +		list_del_rcu(&pa->pa_group_list);
> > +		list_add(&pa->u.pa_tmp_list, &list);
> > +	}
> 
> Strange to see rcu operations outside rcu_read_lock().

That need not be actually list_del_rcu. As i stated above that is
holding the bb_prealloc_list. It is updated under ext4_group_lock



> 
> > +	/* if we still need more blocks and some PAs were used, try again */
> > +	if (free < needed && busy) {
> > +		busy = 0;
> > +		ext4_unlock_group(sb, group);
> > +		/*
> > +		 * Yield the CPU here so that we don't get soft lockup
> > +		 * in non preempt case.
> > +		 */
> > +		yield();
> 
> argh, no, yield() is basically unusable.  schedule_timeout(1) is preferable.

I actually schedule_timeout(HZ); This was actually a bug fix a soft
lockup happening when we were running non preemptible kernel. Well we
just want to make sure the high priority watchdog thread gets a chance
to run. And if there are no high priority threads we ourself would like
to run. My understanding was yield is the right choice there.


> 
> Please test this code whe there are lots of cpu-intensive tasks running.
> 
> > +		goto repeat;
> > +	}
> > +
> > +	/* found anything to free? */
> > +	if (list_empty(&list)) {
> > +		BUG_ON(free != 0);
> > +		goto out;
> > +	}
> > +
> > +	/* now free all selected PAs */
> > +		if (atomic_read(&pa->pa_count)) {
> > +			/* this shouldn't happen often - nobody should

.....


> > +			 * use preallocation while we're discarding it */
> > +			spin_unlock(&pa->pa_lock);
> > +			spin_unlock(&ei->i_prealloc_lock);
> > +			printk(KERN_ERR "uh-oh! used pa while discarding\n");
> > +			dump_stack();
> 
> WARN_ON(1) would be more conventional.

Fixed

> 
> > +			current->state = TASK_UNINTERRUPTIBLE;
> > +			schedule_timeout(HZ);
> 
> schedule_timeout_uninterruptible()
> 

Fixed

> > +			goto repeat;
> > +
> > +		}
> > +		if (pa->pa_deleted == 0) {
> > +			pa->pa_deleted = 1;
> > +			spin_unlock(&pa->pa_lock);
> > +			list_del_rcu(&pa->pa_inode_list);
> > +			list_add(&pa->u.pa_tmp_list, &list);
> > +			continue;
> > +		}
> > +
> > +		/* someone is deleting pa right now */
> > +		spin_unlock(&pa->pa_lock);
> > +		spin_unlock(&ei->i_prealloc_lock);
> > +
> > +		/* we have to wait here because pa_deleted
> > +		 * doesn't mean pa is already unlinked from
> > +		 * the list. as we might be called from
> > +		 * ->clear_inode() the inode will get freed
> > +		 * and concurrent thread which is unlinking
> > +		 * pa from inode's list may access already
> > +		 * freed memory, bad-bad-bad */
> > +
> > +		/* XXX: if this happens too often, we can
> > +		 * add a flag to force wait only in case
> > +		 * of ->clear_inode(), but not in case of
> > +		 * regular truncate */
> > +		current->state = TASK_UNINTERRUPTIBLE;
> > +		schedule_timeout(HZ);
> 
> ditto
> 

Fixed

> > +		goto repeat;
> > +	}
> > +	spin_unlock(&ei->i_prealloc_lock);
> > +
> > +	list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
> > +		BUG_ON(pa->pa_linear != 0);
> > +		ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, NULL);
> > +
> > +		err = ext4_mb_load_buddy(sb, group, &e4b);
> > +		BUG_ON(err != 0); /* error handling here */
> > +
> > +		bitmap_bh = read_block_bitmap(sb, group);
> > +		if (bitmap_bh == NULL) {
> > +			/* error handling here */
> > +			ext4_mb_release_desc(&e4b);
> > +			BUG_ON(bitmap_bh == NULL);
> > +		}
> > +
> > +		ext4_lock_group(sb, group);
> > +		list_del_rcu(&pa->pa_group_list);
> > +		ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
> > +		ext4_unlock_group(sb, group);
> > +
> > +		ext4_mb_release_desc(&e4b);
> > +		brelse(bitmap_bh);
> > +
> > +		list_del(&pa->u.pa_tmp_list);
> > +		mb_call_rcu(pa);
> > +	}
> > +}
> 
> Would be nice to ask Paul to review all the rcu usage in here.  It looks odd.
> 

Will add Paul to the CC

> >
> > ...
> >
> > +#else
> > +#define ext4_mb_show_ac(x)
> > +#endif
> 
> static inlined C functions are preferred (+1e6 dittoes)

Fixed

> 
> > +/*
> > + * We use locality group preallocation for small size file. The size of the
> > + * file is determined by the current size or the resulting size after
> > + * allocation which ever is larger
> > + *
> > + * One can tune this size via /proc/fs/ext4/<partition>/stream_req
> > + */
> > +static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
> > +{
> > +	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > +	int bsbits = ac->ac_sb->s_blocksize_bits;
> > +	loff_t size, isize;
> > +
> > +	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
> > +		return;
> > +
> > +	size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
> > +	isize = i_size_read(ac->ac_inode) >> bsbits;
> > +	if (size < isize)
> > +		size = isize;
> 
> min()?
> 

updated as size = max(size, isize);


> > +	/* don't use group allocation for large files */
> > +	if (size >= sbi->s_mb_stream_request)
> > +		return;
> > +
> > +	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
> > +		return;
> > +
> > +	BUG_ON(ac->ac_lg != NULL);
> > +	ac->ac_lg = &sbi->s_locality_groups[get_cpu()];
> > +	put_cpu();
> 
> Strange-looking code.  I'd be interested in a description of the per-cou
> design here.

I added the below doc


/*
 * locality group prealloc space are per cpu. The reason for
 * having  per cpu locality group is to reduce the contention
 * between block  request from multiple CPUs.
 */




> 
> > +	/* we're going to use group allocation */
> > +	ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
> > +
> > +	/* serialize all allocations in the group */
> > +	down(&ac->ac_lg->lg_sem);
> 
> This should be a mutex, shouldn't it?
> 

converted to mutex


> > +}
> > +
> >
> > ...
> >
> > +static int ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
> > +			  ext4_group_t group, ext4_grpblk_t block, int count)
> > +{
> > +	struct ext4_group_info *db = e4b->bd_info;
> > +	struct super_block *sb = e4b->bd_sb;
> > +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +	struct ext4_free_metadata *md;
> > +	int i;
> > +
> > +	BUG_ON(e4b->bd_bitmap_page == NULL);
> > +	BUG_ON(e4b->bd_buddy_page == NULL);
> > +
> > +	ext4_lock_group(sb, group);
> > +	for (i = 0; i < count; i++) {
> > +		md = db->bb_md_cur;
> > +		if (md && db->bb_tid != handle->h_transaction->t_tid) {
> > +			db->bb_md_cur = NULL;
> > +			md = NULL;
> > +		}
> > +
> > +		if (md == NULL) {
> > +			ext4_unlock_group(sb, group);
> > +			md = kmalloc(sizeof(*md), GFP_KERNEL);
> 
> Why was this one not GFP_NOFS?
> 
> > +			if (md == NULL)
> > +				return -ENOMEM;
> 
> Did we just leak some memory?
> 

No the data is allocated to carry information regarding the free blocks.


> > +			md->num = 0;
> > +			md->group = group;
> > +
> > +			ext4_lock_group(sb, group);
> > +			if (db->bb_md_cur == NULL) {
> > +				spin_lock(&sbi->s_md_lock);
> > +				list_add(&md->list, &sbi->s_active_transaction);
> > +				spin_unlock(&sbi->s_md_lock);
> > +				/* protect buddy cache from being freed,
> > +				 * otherwise we'll refresh it from
> > +				 * on-disk bitmap and lose not-yet-available
> > +				 * blocks */
> > +				page_cache_get(e4b->bd_buddy_page);
> > +				page_cache_get(e4b->bd_bitmap_page);
> > +				db->bb_md_cur = md;
> > +				db->bb_tid = handle->h_transaction->t_tid;
> > +				mb_debug("new md 0x%p for group %lu\n",
> > +						md, md->group);
> > +			} else {
> > +				kfree(md);
> > +				md = db->bb_md_cur;
> > +			}
> > +		}
> > +
> > +		BUG_ON(md->num >= EXT4_BB_MAX_BLOCKS);
> > +		md->blocks[md->num] = block + i;
> > +		md->num++;
> > +		if (md->num == EXT4_BB_MAX_BLOCKS) {
> > +			/* no more space, put full container on a sb's list */
> > +			db->bb_md_cur = NULL;
> > +		}
> > +	}
> > +	ext4_unlock_group(sb, group);
> > +	return 0;
> > +}
> > +
> >
> > ...
> >
> > +		case Opt_mballoc:
> > +			set_opt(sbi->s_mount_opt, MBALLOC);
> > +			break;
> > +		case Opt_nomballoc:
> > +			clear_opt(sbi->s_mount_opt, MBALLOC);
> > +			break;
> > +		case Opt_stripe:
> > +			if (match_int(&args[0], &option))
> > +				return 0;
> > +			if (option < 0)
> > +				return 0;
> > +			sbi->s_stripe = option;
> > +			break;
> 
> These appear to be undocumented.

Updated


> 
> >  		default:
> >  			printk (KERN_ERR
> >  				"EXT4-fs: Unrecognized mount option \"%s\" "
> > @@ -1742,6 +1762,33 @@ static ext4_fsblk_t descriptor_loc(struct super_block *sb,
> >  	return (has_super + ext4_group_first_block_no(sb, bg));
> >  }
> >  
> > +/**
> > + * ext4_get_stripe_size: Get the stripe size.
> > + * @sbi: In memory super block info
> > + *
> > + * If we have specified it via mount option, then
> > + * use the mount option value. If the value specified at mount time is
> > + * greater than the blocks per group use the super block value.
> > + * If the super block value is greater than blocks per group return 0.
> > + * Allocator needs it be less than blocks per group.
> > + *
> > + */
> > +static unsigned long ext4_get_stripe_size(struct ext4_sb_info *sbi)
> > +{
> > +	unsigned long stride = le16_to_cpu(sbi->s_es->s_raid_stride);
> > +	unsigned long stripe_width =
> > +			le32_to_cpu(sbi->s_es->s_raid_stripe_width);
> > +
> > +	if (sbi->s_stripe && sbi->s_stripe <= sbi->s_blocks_per_group) {
> > +		return sbi->s_stripe;
> > +	} else if (stripe_width <= sbi->s_blocks_per_group) {
> > +		return stripe_width;
> > +	} else if (stride <= sbi->s_blocks_per_group) {
> > +		return stride;
> > +	}
> 
> unneeded braces.

I was thinking it is ok these days. checkpatch didn't warn and i had
multiple else if. I could remove those else if


> 
> > +	return 0;
> > +}
> >  
> >  ...
> >
> > +static inline
> > +struct ext4_group_info *ext4_get_group_info(struct super_block *sb,
> > +							ext4_group_t group)
> > +{
> > +	 struct ext4_group_info ***grp_info;
> > +	 long indexv, indexh;
> > +	 grp_info = EXT4_SB(sb)->s_group_info;
> > +	 indexv = group >> (EXT4_DESC_PER_BLOCK_BITS(sb));
> > +	 indexh = group & ((EXT4_DESC_PER_BLOCK(sb)) - 1);
> > +	 return grp_info[indexv][indexh];
> > +}
> 
> This should be uninlined.
> 
> 
> 
> Gosh what a lot of code.  Is it faster?

Performance numbers with compile bench http://ext4.wiki.kernel.org/index.php/Performance_results

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 23/49] Add buffer head related helper functions
  2008-01-24  5:22                                                 ` Aneesh Kumar K.V
@ 2008-01-24  8:53                                                   ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2008-01-24  8:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: tytso, linux-kernel, linux-ext4

> On Thu, 24 Jan 2008 10:52:27 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> + * Returns zero on success and -EIO on error.If the input
> + * buffer is not locked returns -EINVAL
> + *
> + */
> +int bh_submit_read(struct buffer_head *bh)
> +{
> +	if (!buffer_locked(bh))
> +		return -EINVAL;

Is this case just catching a programming bug?

If so, a plain old BUG_ON would be better.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 41/49] ext4: Add multi block allocator for ext4
  2008-01-24  7:56                                                                                     ` Aneesh Kumar K.V
@ 2008-01-24  9:04                                                                                       ` Aneesh Kumar K.V
  2008-01-24 14:53                                                                                       ` Aneesh Kumar K.V
  1 sibling, 0 replies; 23+ messages in thread
From: Aneesh Kumar K.V @ 2008-01-24  9:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, linux-kernel, alex, adilger, sandeen,
	linux-ext4@vger.kernel.org

updated patch. Waiting for the test results.

I am only attaching the diff. Mballoc patch is really large.

-aneesh
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 4f329af..ec7d349 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -89,6 +89,8 @@ When mounting an ext4 filesystem, the following option are accepted:
 extents			ext4 will use extents to address file data.  The
 			file system will no longer be mountable by ext3.
 
+noextents		ext4 will not use extents for new files created.
+
 journal_checksum	Enable checksumming of the journal transactions.
 			This will allow the recovery code in e2fsck and the
 			kernel to detect corruption in the kernel.  It is a
@@ -206,6 +208,10 @@ nobh			(a) cache disk block mapping information
 			"nobh" option tries to avoid associating buffer
 			heads (supported only for "writeback" mode).
 
+mballoc		(*)	Use the mutliblock allocator for block allocation
+nomballoc		disabled multiblock allocator for block allocation.
+stripe=n		filesystem blocks per stripe for a RAID configuration.
+
 
 Data Mode
 ---------
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index dec9945..4413a2d 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -857,6 +857,45 @@ CPUs.
 The   "procs_blocked" line gives  the  number of  processes currently blocked,
 waiting for I/O to complete.
 
+1.9 Ext4 file system parameters
+------------------------------
+Ext4 file system have one directory per partition under /proc/fs/ext4/
+# ls /proc/fs/ext4/hdc/
+group_prealloc  max_to_scan  mb_groups  mb_history  min_to_scan  order2_req
+stats  stream_req
+
+mb_groups:
+This file gives the details of mutiblock allocator buddy cache of free blocks
+
+mb_history:
+Multiblock allocation history.
+
+stats:
+This file indicate whether the multiblock allocator should start collecting
+statistics. The statistics are shown during unmount
+
+group_prealloc:
+The multiblock allocator normalize the block allocation request to
+group_prealloc filesystem blocks if we don't have strip value set.
+The stripe value can be specified at mount time or during mke2fs.
+
+max_to_scan:
+How long multiblock allocator can look for a best extent (in found extents)
+
+min_to_scan:
+How long multiblock allocator  must look for a best extent
+
+order2_req:
+Multiblock allocator use  2^N search using buddies only for requests greater
+than or equal to order2_req. The request size is specfied in file system
+blocks. A value of 2 indicate only if the requests are greater than or equal
+to 4 blocks.
+
+stream_req:
+Files smaller than stream_req are served by the stream allocator, whose
+purpose is to pack requests as close each to other as possible to
+produce smooth I/O traffic. Avalue of 16 indicate that file smaller than 16
+filesystem block size will use group based preallocation.
 
 ------------------------------------------------------------------------------
 Summary
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 0398aa0..310bad6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -489,7 +489,7 @@ struct ext4_free_extent {
  */
 struct ext4_locality_group {
 	/* for allocator */
-	struct semaphore	lg_sem;		/* to serialize allocates */
+	struct mutex		lg_sem;		/* to serialize allocates */
 	struct list_head	lg_prealloc_list;/* list of preallocations */
 	spinlock_t		lg_prealloc_lock;
 };
@@ -563,7 +563,10 @@ struct ext4_buddy {
 #define EXT4_MB_BUDDY(e4b)	((e4b)->bd_buddy)
 
 #ifndef EXT4_MB_HISTORY
-#define ext4_mb_store_history(ac)
+static inline void ext4_mb_store_history(struct ext4_allocation_context *ac)
+{
+	return;
+}
 #else
 static void ext4_mb_store_history(struct ext4_allocation_context *ac);
 #endif
@@ -641,6 +644,10 @@ static ext4_fsblk_t ext4_grp_offs_to_block(struct super_block *sb,
 
 static inline int mb_test_bit(int bit, void *addr)
 {
+	/*
+	 * ext4_test_bit on architecture like powerpc
+	 * needs unsigned long aligned address
+	 */
 	mb_correct_addr_and_bit(bit, addr);
 	return ext4_test_bit(bit, addr);
 }
@@ -669,7 +676,7 @@ static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr)
 	ext4_clear_bit_atomic(lock, bit, addr);
 }
 
-static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
+static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
 {
 	char *bb;
 
@@ -752,9 +759,20 @@ static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
 }
 
 #else
-#define mb_free_blocks_double(a, b, c, d)
-#define mb_mark_used_double(a, b, c)
-#define mb_cmp_bitmaps(a, b)
+static inline void mb_free_blocks_double(struct inode *inode,
+				struct ext4_buddy *e4b, int first, int count)
+{
+	return;
+}
+static inline void mb_mark_used_double(struct ext4_buddy *e4b,
+						int first, int count)
+{
+	return;
+}
+static inline void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
+{
+	return;
+}
 #endif
 
 #ifdef AGGRESSIVE_CHECK
@@ -877,26 +895,6 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
 #define mb_check_buddy(e4b)
 #endif
 
-/* find most significant bit */
-static int fmsb(unsigned short word)
-{
-	int order;
-
-	if (word > 255) {
-		order = 7;
-		word >>= 8;
-	} else {
-		order = -1;
-	}
-
-	do {
-		order++;
-		word >>= 1;
-	} while (word != 0);
-
-	return order;
-}
-
 /* FIXME!! need more doc */
 static void ext4_mb_mark_free_simple(struct super_block *sb,
 				void *buddy, unsigned first, int len,
@@ -917,7 +915,7 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
 		max = ffs(first | border) - 1;
 
 		/* find how many blocks of power 2 we need to mark */
-		min = fmsb(len);
+		min = fls(len);
 
 		if (max < min)
 			min = max;
@@ -1029,10 +1027,9 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
 	if (groups_per_page > 1) {
 		err = -ENOMEM;
 		i = sizeof(struct buffer_head *) * groups_per_page;
-		bh = kmalloc(i, GFP_NOFS);
+		bh = kzalloc(i, GFP_NOFS);
 		if (bh == NULL)
 			goto out;
-		memset(bh, 0, i);
 	} else
 		bh = &bhs;
 
@@ -1055,15 +1052,9 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
 		if (bh[i] == NULL)
 			goto out;
 
-		if (buffer_uptodate(bh[i]))
+		if (bh_uptodate_or_lock(bh[i]))
 			continue;
 
-		lock_buffer(bh[i]);
-		if (buffer_uptodate(bh[i])) {
-			unlock_buffer(bh[i]);
-			continue;
-		}
-
 		if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
 			ext4_init_block_bitmap(sb, bh[i],
 						first_group + i, desc);
@@ -1302,7 +1293,7 @@ static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len)
 	len = cur + len;
 	while (cur < len) {
 		if ((cur & 31) == 0 && (len - cur) >= 32) {
-			/* fast path: clear whole word at once */
+			/* fast path: set whole word at once */
 			addr = bm + (cur >> 3);
 			*addr = 0xffffffff;
 			cur += 32;
@@ -2675,7 +2666,7 @@ int ext4_mb_init(struct super_block *sb, int needs_recovery)
 	for (i = 0; i < NR_CPUS; i++) {
 		struct ext4_locality_group *lg;
 		lg = &sbi->s_locality_groups[i];
-		sema_init(&lg->lg_sem, 1);
+		mutex_init(&lg->lg_sem);
 		INIT_LIST_HEAD(&lg->lg_prealloc_list);
 		spin_lock_init(&lg->lg_prealloc_lock);
 	}
@@ -2687,6 +2678,7 @@ int ext4_mb_init(struct super_block *sb, int needs_recovery)
 	return 0;
 }
 
+/* need to called with ext4 group lock (ext4_lock_group) */
 static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
 {
 	struct ext4_prealloc_space *pa;
@@ -2695,7 +2687,7 @@ static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
 
 	list_for_each_safe(cur, tmp, &grp->bb_prealloc_list) {
 		pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
-		list_del_rcu(&pa->pa_group_list);
+		list_del(&pa->pa_group_list);
 		count++;
 		kfree(pa);
 	}
@@ -3441,6 +3433,7 @@ static int ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 /*
  * the function goes through all preallocation in this group and marks them
  * used in in-core bitmap. buddy must be generated from this bitmap
+ * Need to be called with ext4 group lock (ext4_lock_group)
  */
 static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
 					ext4_group_t group)
@@ -3462,7 +3455,7 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
 	 * allocation in buddy when concurrent ext4_mb_put_pa()
 	 * is dropping preallocation
 	 */
-	list_for_each_rcu(cur, &grp->bb_prealloc_list) {
+	list_for_each(cur, &grp->bb_prealloc_list) {
 		pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
 		spin_lock(&pa->pa_lock);
 		ext4_get_group_no_and_offset(sb, pa->pa_pstart,
@@ -3486,7 +3479,6 @@ static void ext4_mb_pa_callback(struct rcu_head *head)
 	pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
 	kmem_cache_free(ext4_pspace_cachep, pa);
 }
-#define mb_call_rcu(__pa)	call_rcu(&(__pa)->u.pa_rcu, ext4_mb_pa_callback)
 
 /*
  * drops a reference to preallocated space descriptor
@@ -3528,14 +3520,14 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
 	 * against that pair
 	 */
 	ext4_lock_group(sb, grp);
-	list_del_rcu(&pa->pa_group_list);
+	list_del(&pa->pa_group_list);
 	ext4_unlock_group(sb, grp);
 
 	spin_lock(pa->pa_obj_lock);
 	list_del_rcu(&pa->pa_inode_list);
 	spin_unlock(pa->pa_obj_lock);
 
-	mb_call_rcu(pa);
+	call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
 }
 
 /*
@@ -3615,7 +3607,7 @@ static int ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	pa->pa_inode = ac->ac_inode;
 
 	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
-	list_add_rcu(&pa->pa_group_list, &grp->bb_prealloc_list);
+	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
 	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 
 	spin_lock(pa->pa_obj_lock);
@@ -3672,7 +3664,7 @@ static int ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 	pa->pa_inode = NULL;
 
 	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
-	list_add_rcu(&pa->pa_group_list, &grp->bb_prealloc_list);
+	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
 	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 
 	spin_lock(pa->pa_obj_lock);
@@ -3853,7 +3845,7 @@ repeat:
 
 		spin_unlock(&pa->pa_lock);
 
-		list_del_rcu(&pa->pa_group_list);
+		list_del(&pa->pa_group_list);
 		list_add(&pa->u.pa_tmp_list, &list);
 	}
 
@@ -3889,7 +3881,7 @@ repeat:
 			ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
 
 		list_del(&pa->u.pa_tmp_list);
-		mb_call_rcu(pa);
+		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
 	}
 
 out:
@@ -3942,9 +3934,8 @@ repeat:
 			spin_unlock(&pa->pa_lock);
 			spin_unlock(&ei->i_prealloc_lock);
 			printk(KERN_ERR "uh-oh! used pa while discarding\n");
-			dump_stack();
-			current->state = TASK_UNINTERRUPTIBLE;
-			schedule_timeout(HZ);
+			WARN_ON(1);
+			schedule_timeout_uninterruptible(HZ);
 			goto repeat;
 
 		}
@@ -3972,8 +3963,7 @@ repeat:
 		 * add a flag to force wait only in case
 		 * of ->clear_inode(), but not in case of
 		 * regular truncate */
-		current->state = TASK_UNINTERRUPTIBLE;
-		schedule_timeout(HZ);
+		schedule_timeout_uninterruptible(HZ);
 		goto repeat;
 	}
 	spin_unlock(&ei->i_prealloc_lock);
@@ -3993,7 +3983,7 @@ repeat:
 		}
 
 		ext4_lock_group(sb, group);
-		list_del_rcu(&pa->pa_group_list);
+		list_del(&pa->pa_group_list);
 		ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
 		ext4_unlock_group(sb, group);
 
@@ -4001,7 +3991,7 @@ repeat:
 		brelse(bitmap_bh);
 
 		list_del(&pa->u.pa_tmp_list);
-		mb_call_rcu(pa);
+		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
 	}
 }
 
@@ -4051,7 +4041,8 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 		struct ext4_prealloc_space *pa;
 		ext4_grpblk_t start;
 		struct list_head *cur;
-		list_for_each_rcu(cur, &grp->bb_prealloc_list) {
+		ext4_lock_group(sb, i);
+		list_for_each(cur, &grp->bb_prealloc_list) {
 			pa = list_entry(cur, struct ext4_prealloc_space,
 					pa_group_list);
 			spin_lock(&pa->pa_lock);
@@ -4061,6 +4052,7 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 			printk(KERN_ERR "PA:%lu:%d:%u \n", i,
 							start, pa->pa_len);
 		}
+		ext4_lock_group(sb, i);
 
 		if (grp->bb_free == 0)
 			continue;
@@ -4070,7 +4062,10 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 	printk(KERN_ERR "\n");
 }
 #else
-#define ext4_mb_show_ac(x)
+static inline void ext4_mb_show_ac(struct ext4_allocation_context *ac)
+{
+	return;
+}
 #endif
 
 /*
@@ -4091,8 +4086,7 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 
 	size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
 	isize = i_size_read(ac->ac_inode) >> bsbits;
-	if (size < isize)
-		size = isize;
+	size = max(size, isize);
 
 	/* don't use group allocation for large files */
 	if (size >= sbi->s_mb_stream_request)
@@ -4102,6 +4096,11 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 		return;
 
 	BUG_ON(ac->ac_lg != NULL);
+	/*
+	 * locality group prealloc space are per cpu. The reason for having
+	 * per cpu locality group is to reduce the contention between block
+	 * request from multiple CPUs.
+	 */
 	ac->ac_lg = &sbi->s_locality_groups[get_cpu()];
 	put_cpu();
 
@@ -4109,7 +4108,7 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 	ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
 
 	/* serialize all allocations in the group */
-	down(&ac->ac_lg->lg_sem);
+	mutex_lock(&ac->ac_lg->lg_sem);
 }
 
 static int ext4_mb_initialize_context(struct ext4_allocation_context *ac,
@@ -4202,7 +4201,7 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
 	if (ac->ac_buddy_page)
 		page_cache_release(ac->ac_buddy_page);
 	if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)
-		up(&ac->ac_lg->lg_sem);
+		mutex_unlock(&ac->ac_lg->lg_sem);
 	ext4_mb_collect_stats(ac);
 	return 0;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 136d095..3a51ffc 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1779,13 +1779,14 @@ static unsigned long ext4_get_stripe_size(struct ext4_sb_info *sbi)
 	unsigned long stripe_width =
 			le32_to_cpu(sbi->s_es->s_raid_stripe_width);
 
-	if (sbi->s_stripe && sbi->s_stripe <= sbi->s_blocks_per_group) {
+	if (sbi->s_stripe && sbi->s_stripe <= sbi->s_blocks_per_group)
 		return sbi->s_stripe;
-	} else if (stripe_width <= sbi->s_blocks_per_group) {
+
+	if (stripe_width <= sbi->s_blocks_per_group)
 		return stripe_width;
-	} else if (stride <= sbi->s_blocks_per_group) {
+
+	if (stride <= sbi->s_blocks_per_group)
 		return stride;
-	}
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore.
  2008-01-23 22:06                                                             ` [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore Andrew Morton
  2008-01-24  5:29                                                               ` Aneesh Kumar K.V
@ 2008-01-24 13:00                                                               ` Andy Whitcroft
  1 sibling, 0 replies; 23+ messages in thread
From: Andy Whitcroft @ 2008-01-24 13:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, linux-kernel, aneesh.kumar,
	linux-ext4@vger.kernel.org

On Wed, Jan 23, 2008 at 02:06:59PM -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:09 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> > +int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> > +			unsigned long max_blocks, struct buffer_head *bh,
> > +			int create, int extend_disksize)
> > +{
> > +	int retval;
> > +	if (create) {
> > +		down_write((&EXT4_I(inode)->i_data_sem));
> > +	} else {
> > +		down_read((&EXT4_I(inode)->i_data_sem));
> > +	}
> > +	if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
> > +		retval =  ext4_ext_get_blocks(handle, inode, block, max_blocks,
> > +				bh, create, extend_disksize);
> > +	} else {
> > +		retval = ext4_get_blocks_handle(handle, inode, block,
> > +				max_blocks, bh, create, extend_disksize);
> > +	}
> > +	if (create) {
> > +		up_write((&EXT4_I(inode)->i_data_sem));
> > +	} else {
> > +		up_read((&EXT4_I(inode)->i_data_sem));
> > +	}
> 
> This function has many unneeded braces.  checkpatch used to detect this
> but it seems to have broken.

This is a side effect of this rule:
	
	This does not apply if one branch of a conditional statement
	is a single statement. Use braces in both branches.

Basically each arm is being considered in isolation, each arm is seen as
having a "sibling" arm with braces so it is permitted to have braces.
Bugger.

I guess I'll try and see if I can detect this.

> > +	return retval;
> > +}
> >  static int ext4_get_block(struct inode *inode, sector_t iblock,
> >  			struct buffer_head *bh_result, int create)
> 
> Mising newline.

We could check for those ... will look to add in the next release.

-apw

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 41/49] ext4: Add multi block allocator for ext4
  2008-01-24  7:56                                                                                     ` Aneesh Kumar K.V
  2008-01-24  9:04                                                                                       ` Aneesh Kumar K.V
@ 2008-01-24 14:53                                                                                       ` Aneesh Kumar K.V
  1 sibling, 0 replies; 23+ messages in thread
From: Aneesh Kumar K.V @ 2008-01-24 14:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, linux-kernel, alex, adilger, sandeen,
	linux-ext4@vger.kernel.org

On Thu, Jan 24, 2008 at 01:26:14PM +0530, Aneesh Kumar K.V wrote:
> 
> > 
> > > +/* find most significant bit */
> > > +static int fmsb(unsigned short word)
> > > +{
> > > +	int order;
> > > +
> > > +	if (word > 255) {
> > > +		order = 7;
> > > +		word >>= 8;
> > > +	} else {
> > > +		order = -1;
> > > +	}
> > > +
> > > +	do {
> > > +		order++;
> > > +		word >>= 1;
> > > +	} while (word != 0);
> > > +
> > > +	return order;
> > > +}
> > 
> > Did we just reinvent fls()?
> 
> replaced by fls.
> 
> > 

That should be fls() - 1;

The full patch is at

http://www.radian.org/~kvaneesh/ext4/jan-24-2008/mballoc-core.patch

The patch is too big to inline.

-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 33/49] ext4: Add the journal checksum feature
  2008-01-23 22:07                                                                   ` [PATCH 33/49] ext4: Add the journal checksum feature Andrew Morton
  2008-01-23 22:40                                                                     ` Andreas Dilger
@ 2008-01-24 21:24                                                                     ` Mingming Cao
  2008-02-01 20:50                                                                       ` Girish Shilamkar
  1 sibling, 1 reply; 23+ messages in thread
From: Mingming Cao @ 2008-01-24 21:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, linux-kernel, girish, adilger, shaggy,
	linux-ext4@vger.kernel.org

On Wed, 2008-01-23 at 14:07 -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:12 -0500 "Theodore Ts'o" <tytso@MIT.EDU> wrote:
> > From: Girish Shilamkar <girish@clusterfs.com>
> > 
> > The journal checksum feature adds two new flags i.e
> > JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.
> > 
> > JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
> > checksum for the blocks described by the descriptor blocks.
> > Due to checksums, writing of the commit record no longer needs to be
> > synchronous. Now commit record can be sent to disk without waiting for
> > descriptor blocks to be written to disk. This behavior is controlled
> > using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
> > able to recover the journal with _ASYNC_COMMIT hence it is made
> > incompat.
> > The commit header has been extended to hold the checksum along with the
> > type of the checksum.
> > 
> > For recovery in pass scan checksums are verified to ensure the sanity
> > and completeness(in case of _ASYNC_COMMIT) of every transaction.
> > 
> >  ...
> >
> > +static inline __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
> 
> unneeded inlining.
> 
> > +{
> > +	struct page *page = bh->b_page;
> > +	char *addr;
> > +	__u32 checksum;
> > +
> > +	addr = kmap_atomic(page, KM_USER0);
> > +	checksum = crc32_be(crc32_sum,
> > +		(void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
> > +	kunmap_atomic(addr, KM_USER0);
> > +
> > +	return checksum;
> > +}
> 
> Can this buffer actually be in highmem?
> 
> >  static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
> >  				   unsigned long long block)
> 
> More unnecessary inlining.
> 
> > +/*
> > + * jbd2_journal_clear_features () - Clear a given journal feature in the
> > + * 				    superblock
> > + * @journal: Journal to act on.
> > + * @compat: bitmask of compatible features
> > + * @ro: bitmask of features that force read-only mount
> > + * @incompat: bitmask of incompatible features
> > + *
> > + * Clear a given journal feature as present on the
> > + * superblock.  Returns true if the requested features could be reset.
> > + */
> > +int jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
> > +				unsigned long ro, unsigned long incompat)
> > +{
> > +	journal_superblock_t *sb;
> > +
> > +	jbd_debug(1, "Clear features 0x%lx/0x%lx/0x%lx\n",
> > +		  compat, ro, incompat);
> > +
> > +	sb = journal->j_superblock;
> > +
> > +	sb->s_feature_compat    &= ~cpu_to_be32(compat);
> > +	sb->s_feature_ro_compat &= ~cpu_to_be32(ro);
> > +	sb->s_feature_incompat  &= ~cpu_to_be32(incompat);
> > +
> > +	return 1;
> > +}
> > +EXPORT_SYMBOL(jbd2_journal_clear_features);
> 
> Kernel usually returns 0 on success.  So we can return a useful errno on
> failure.
> 
> > +/*
> > + * calc_chksums calculates the checksums for the blocks described in the
> > + * descriptor block.
> > + */
> > +static int calc_chksums(journal_t *journal, struct buffer_head *bh,
> > +			unsigned long *next_log_block, __u32 *crc32_sum)
> > +{
> > +	int i, num_blks, err;
> > +	unsigned io_block;
> > +	struct buffer_head *obh;
> > +
> > +	num_blks = count_tags(journal, bh);
> > +	/* Calculate checksum of the descriptor block. */
> > +	*crc32_sum = crc32_be(*crc32_sum, (void *)bh->b_data, bh->b_size);
> > +
> > +	for (i = 0; i < num_blks; i++) {
> > +		io_block = (*next_log_block)++;
> 
>  unsigned <- unsigned long.
> 
> Are all the types appropriate in here?
> 
> > +		wrap(journal, *next_log_block);
> > +		err = jread(&obh, journal, io_block);
> > +		if (err) {
> > +			printk(KERN_ERR "JBD: IO error %d recovering block "
> > +				"%u in log\n", err, io_block);
> > +			return 1;
> > +		} else {
> > +			*crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
> > +				     obh->b_size);
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> >  static int do_one_pass(journal_t *journal,
> >  			struct recovery_info *info, enum passtype pass)
> >  {
> > @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journal,
> >  	unsigned int		sequence;
> >  	int			blocktype;
> >  	int			tag_bytes = journal_tag_bytes(journal);
> > +	__u32			crc32_sum = ~0; /* Transactional Checksums */
> >  
> >  	/* Precompute the maximum metadata descriptors in a descriptor block */
> >  	int			MAX_BLOCKS_PER_DESC;
> > @@ -419,9 +452,23 @@ static int do_one_pass(journal_t *journal,
> >  		switch(blocktype) {
> >  		case JBD2_DESCRIPTOR_BLOCK:
> >  			/* If it is a valid descriptor block, replay it
> > -			 * in pass REPLAY; otherwise, just skip over the
> > -			 * blocks it describes. */
> > +			 * in pass REPLAY; if journal_checksums enabled, then
> > +			 * calculate checksums in PASS_SCAN, otherwise,
> > +			 * just skip over the blocks it describes. */
> >  			if (pass != PASS_REPLAY) {
> > +				if (pass == PASS_SCAN &&
> > +				    JBD2_HAS_COMPAT_FEATURE(journal,
> > +					    JBD2_FEATURE_COMPAT_CHECKSUM) &&
> > +				    !info->end_transaction) {
> > +					if (calc_chksums(journal, bh,
> > +							&next_log_block,
> > +							&crc32_sum)) {
> 
> put_bh()
> 
> > +						brelse(bh);
> > +						break;
> > +					}
> > +					brelse(bh);
> > +					continue;
> 
> put_bh()
> 
> > +				}
> >  				next_log_block += count_tags(journal, bh);
> >  				wrap(journal, next_log_block);
> >  				brelse(bh);
> > @@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journal,
> >  			continue;
> >  
> > +					brelse(bh);
> 
> etc
> 

Thanks, Updated  patch below:
ext4: Add the journal checksum feature

From: Girish Shilamkar <girish@clusterfs.com>

The journal checksum feature adds two new flags i.e
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.

JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
checksum for the blocks described by the descriptor blocks.
Due to checksums, writing of the commit record no longer needs to be
synchronous. Now commit record can be sent to disk without waiting for
descriptor blocks to be written to disk. This behavior is controlled
using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
able to recover the journal with _ASYNC_COMMIT hence it is made
incompat.
The commit header has been extended to hold the checksum along with the
type of the checksum.

For recovery in pass scan checksums are verified to ensure the sanity
and completeness(in case of _ASYNC_COMMIT) of every transaction.

Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Girish Shilamkar <girish@clusterfs.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---

 Documentation/filesystems/ext4.txt |   10 +
 fs/Kconfig                         |    1 
 fs/ext4/super.c                    |   25 ++++
 fs/jbd2/commit.c                   |  198 +++++++++++++++++++++++++++----------
 fs/jbd2/journal.c                  |   26 ++++
 fs/jbd2/recovery.c                 |  151 ++++++++++++++++++++++++++--
 include/linux/ext4_fs.h            |    3 
 include/linux/jbd2.h               |   36 +++++-
 8 files changed, 388 insertions(+), 62 deletions(-)


Index: linux-2.6.24-rc8/Documentation/filesystems/ext4.txt
===================================================================
--- linux-2.6.24-rc8.orig/Documentation/filesystems/ext4.txt	2008-01-24 11:18:08.000000000 -0800
+++ linux-2.6.24-rc8/Documentation/filesystems/ext4.txt	2008-01-24 13:00:44.000000000 -0800
@@ -89,6 +89,16 @@ When mounting an ext4 filesystem, the fo
 extents			ext4 will use extents to address file data.  The
 			file system will no longer be mountable by ext3.
 
+journal_checksum	Enable checksumming of the journal transactions.
+			This will allow the recovery code in e2fsck and the
+			kernel to detect corruption in the kernel.  It is a
+			compatible change and will be ignored by older kernels.
+
+journal_async_commit	Commit block can be written to disk without waiting
+			for descriptor blocks. If enabled older kernels cannot
+			mount the device. This will enable 'journal_checksum'
+			internally.
+
 journal=update		Update the ext4 file system's journal to the current
 			format.
 
Index: linux-2.6.24-rc8/fs/Kconfig
===================================================================
--- linux-2.6.24-rc8.orig/fs/Kconfig	2008-01-24 11:18:08.000000000 -0800
+++ linux-2.6.24-rc8/fs/Kconfig	2008-01-24 11:18:55.000000000 -0800
@@ -236,6 +236,7 @@ config JBD_DEBUG
 
 config JBD2
 	tristate
+	select CRC32
 	help
 	  This is a generic journaling layer for block devices that support
 	  both 32-bit and 64-bit block numbers.  It is currently used by
Index: linux-2.6.24-rc8/fs/ext4/super.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/ext4/super.c	2008-01-24 11:18:52.000000000 -0800
+++ linux-2.6.24-rc8/fs/ext4/super.c	2008-01-24 13:00:45.000000000 -0800
@@ -869,6 +869,7 @@ enum {
 	Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
 	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
 	Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
+	Opt_journal_checksum, Opt_journal_async_commit,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
@@ -908,6 +909,8 @@ static match_table_t tokens = {
 	{Opt_journal_update, "journal=update"},
 	{Opt_journal_inum, "journal=%u"},
 	{Opt_journal_dev, "journal_dev=%u"},
+	{Opt_journal_checksum, "journal_checksum"},
+	{Opt_journal_async_commit, "journal_async_commit"},
 	{Opt_abort, "abort"},
 	{Opt_data_journal, "data=journal"},
 	{Opt_data_ordered, "data=ordered"},
@@ -1095,6 +1098,13 @@ static int parse_options (char *options,
 				return 0;
 			*journal_devnum = option;
 			break;
+		case Opt_journal_checksum:
+			set_opt(sbi->s_mount_opt, JOURNAL_CHECKSUM);
+			break;
+		case Opt_journal_async_commit:
+			set_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
+			set_opt(sbi->s_mount_opt, JOURNAL_CHECKSUM);
+			break;
 		case Opt_noload:
 			set_opt (sbi->s_mount_opt, NOLOAD);
 			break;
@@ -2114,6 +2124,21 @@ static int ext4_fill_super (struct super
 		goto failed_mount4;
 	}
 
+	if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
+		jbd2_journal_set_features(sbi->s_journal,
+				JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+				JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+	} else if (test_opt(sb, JOURNAL_CHECKSUM)) {
+		jbd2_journal_set_features(sbi->s_journal,
+				JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
+		jbd2_journal_clear_features(sbi->s_journal, 0, 0,
+				JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+	} else {
+		jbd2_journal_clear_features(sbi->s_journal,
+				JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+				JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+	}
+
 	/* We have now updated the journal if required, so we can
 	 * validate the data journaling mode. */
 	switch (test_opt(sb, DATA_FLAGS)) {
Index: linux-2.6.24-rc8/fs/jbd2/commit.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/jbd2/commit.c	2008-01-24 11:18:54.000000000 -0800
+++ linux-2.6.24-rc8/fs/jbd2/commit.c	2008-01-24 13:02:43.000000000 -0800
@@ -21,6 +21,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/jiffies.h>
+#include <linux/crc32.h>
 
 /*
  * Default IO end handler for temporary BJ_IO buffer_heads.
@@ -93,19 +94,23 @@ static int inverted_lock(journal_t *jour
 	return 1;
 }
 
-/* Done it all: now write the commit record.  We should have
+/*
+ * Done it all: now submit the commit record.  We should have
  * cleaned up our previous buffers by now, so if we are in abort
  * mode we can now just skip the rest of the journal write
  * entirely.
  *
  * Returns 1 if the journal needs to be aborted or 0 on success
  */
-static int journal_write_commit_record(journal_t *journal,
-					transaction_t *commit_transaction)
+static int journal_submit_commit_record(journal_t *journal,
+					transaction_t *commit_transaction,
+					struct buffer_head **cbh,
+					__u32 crc32_sum)
 {
 	struct journal_head *descriptor;
+	struct commit_header *tmp;
 	struct buffer_head *bh;
-	int i, ret;
+	int ret;
 	int barrier_done = 0;
 
 	if (is_journal_aborted(journal))
@@ -117,21 +122,33 @@ static int journal_write_commit_record(j
 
 	bh = jh2bh(descriptor);
 
-	/* AKPM: buglet - add `i' to tmp! */
-	for (i = 0; i < bh->b_size; i += 512) {
-		journal_header_t *tmp = (journal_header_t*)bh->b_data;
-		tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
-		tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
-		tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
+	tmp = (struct commit_header *)bh->b_data;
+	tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
+	tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
+	tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
+
+	if (JBD2_HAS_COMPAT_FEATURE(journal,
+				    JBD2_FEATURE_COMPAT_CHECKSUM)) {
+		tmp->h_chksum_type 	= JBD2_CRC32_CHKSUM;
+		tmp->h_chksum_size 	= JBD2_CRC32_CHKSUM_SIZE;
+		tmp->h_chksum[0] 	= cpu_to_be32(crc32_sum);
 	}
 
-	JBUFFER_TRACE(descriptor, "write commit block");
+	JBUFFER_TRACE(descriptor, "submit commit block");
+	lock_buffer(bh);
+
 	set_buffer_dirty(bh);
-	if (journal->j_flags & JBD2_BARRIER) {
+	set_buffer_uptodate(bh);
+	bh->b_end_io = journal_end_buffer_io_sync;
+
+	if (journal->j_flags & JBD2_BARRIER &&
+		!JBD2_HAS_COMPAT_FEATURE(journal,
+					 JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
 		set_buffer_ordered(bh);
 		barrier_done = 1;
 	}
-	ret = sync_dirty_buffer(bh);
+	ret = submit_bh(WRITE, bh);
+
 	/* is it possible for another commit to fail at roughly
 	 * the same time as this one?  If so, we don't want to
 	 * trust the barrier flag in the super, but instead want
@@ -152,14 +169,72 @@ static int journal_write_commit_record(j
 		clear_buffer_ordered(bh);
 		set_buffer_uptodate(bh);
 		set_buffer_dirty(bh);
-		ret = sync_dirty_buffer(bh);
+		ret = submit_bh(WRITE, bh);
 	}
-	put_bh(bh);		/* One for getblk() */
-	jbd2_journal_put_journal_head(descriptor);
+	*cbh = bh;
+	return ret;
+}
+
+/*
+ * This function along with journal_submit_commit_record
+ * allows to write the commit record asynchronously.
+ */
+static int journal_wait_on_commit_record(struct buffer_head *bh)
+{
+	int ret = 0;
+
+	clear_buffer_dirty(bh);
+	wait_on_buffer(bh);
+
+	if (unlikely(!buffer_uptodate(bh)))
+		ret = -EIO;
+	put_bh(bh);            /* One for getblk() */
+	jbd2_journal_put_journal_head(bh2jh(bh));
 
-	return (ret == -EIO);
+	return ret;
 }
 
+/*
+ * Wait for all submitted IO to complete.
+ */
+static int journal_wait_on_locked_list(journal_t *journal,
+				       transaction_t *commit_transaction)
+{
+	int ret = 0;
+	struct journal_head *jh;
+
+	while (commit_transaction->t_locked_list) {
+		struct buffer_head *bh;
+
+		jh = commit_transaction->t_locked_list->b_tprev;
+		bh = jh2bh(jh);
+		get_bh(bh);
+		if (buffer_locked(bh)) {
+			spin_unlock(&journal->j_list_lock);
+			wait_on_buffer(bh);
+			if (unlikely(!buffer_uptodate(bh)))
+				ret = -EIO;
+			spin_lock(&journal->j_list_lock);
+		}
+		if (!inverted_lock(journal, bh)) {
+			put_bh(bh);
+			spin_lock(&journal->j_list_lock);
+			continue;
+		}
+		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
+			__jbd2_journal_unfile_buffer(jh);
+			jbd_unlock_bh_state(bh);
+			jbd2_journal_remove_journal_head(bh);
+			put_bh(bh);
+		} else {
+			jbd_unlock_bh_state(bh);
+		}
+		put_bh(bh);
+		cond_resched_lock(&journal->j_list_lock);
+	}
+	return ret;
+  }
+
 static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
 {
 	int i;
@@ -275,7 +350,21 @@ write_out_data:
 	journal_do_submit_data(wbuf, bufs);
 }
 
-static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
+static __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
+{
+	struct page *page = bh->b_page;
+	char *addr;
+	__u32 checksum;
+
+	addr = kmap_atomic(page, KM_USER0);
+	checksum = crc32_be(crc32_sum,
+		(void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
+	kunmap_atomic(addr, KM_USER0);
+
+	return checksum;
+}
+
+static void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
 				   unsigned long long block)
 {
 	tag->t_blocknr = cpu_to_be32(block & (u32)~0);
@@ -307,6 +396,8 @@ void jbd2_journal_commit_transaction(jou
 	int tag_flag;
 	int i;
 	int tag_bytes = journal_tag_bytes(journal);
+	struct buffer_head *cbh = NULL; /* For transactional checksums */
+	__u32 crc32_sum = ~0;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -451,38 +542,15 @@ void jbd2_journal_commit_transaction(jou
 	journal_submit_data_buffers(journal, commit_transaction);
 
 	/*
-	 * Wait for all previously submitted IO to complete.
+	 * Wait for all previously submitted IO to complete if commit
+	 * record is to be written synchronously.
 	 */
 	spin_lock(&journal->j_list_lock);
-	while (commit_transaction->t_locked_list) {
-		struct buffer_head *bh;
+	if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
+		JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
+		err = journal_wait_on_locked_list(journal,
+						commit_transaction);
 
-		jh = commit_transaction->t_locked_list->b_tprev;
-		bh = jh2bh(jh);
-		get_bh(bh);
-		if (buffer_locked(bh)) {
-			spin_unlock(&journal->j_list_lock);
-			wait_on_buffer(bh);
-			if (unlikely(!buffer_uptodate(bh)))
-				err = -EIO;
-			spin_lock(&journal->j_list_lock);
-		}
-		if (!inverted_lock(journal, bh)) {
-			put_bh(bh);
-			spin_lock(&journal->j_list_lock);
-			continue;
-		}
-		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
-			__jbd2_journal_unfile_buffer(jh);
-			jbd_unlock_bh_state(bh);
-			jbd2_journal_remove_journal_head(bh);
-			put_bh(bh);
-		} else {
-			jbd_unlock_bh_state(bh);
-		}
-		put_bh(bh);
-		cond_resched_lock(&journal->j_list_lock);
-	}
 	spin_unlock(&journal->j_list_lock);
 
 	if (err)
@@ -656,6 +724,15 @@ void jbd2_journal_commit_transaction(jou
 start_journal_io:
 			for (i = 0; i < bufs; i++) {
 				struct buffer_head *bh = wbuf[i];
+				/*
+				 * Compute checksum.
+				 */
+				if (JBD2_HAS_COMPAT_FEATURE(journal,
+					JBD2_FEATURE_COMPAT_CHECKSUM)) {
+					crc32_sum =
+					    jbd2_checksum_data(crc32_sum, bh);
+				}
+
 				lock_buffer(bh);
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
@@ -672,6 +749,23 @@ start_journal_io:
 		}
 	}
 
+	/* Done it all: now write the commit record asynchronously. */
+
+	if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+		JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+		err = journal_submit_commit_record(journal, commit_transaction,
+						 &cbh, crc32_sum);
+		if (err)
+			__jbd2_journal_abort_hard(journal);
+
+		spin_lock(&journal->j_list_lock);
+		err = journal_wait_on_locked_list(journal,
+						commit_transaction);
+		spin_unlock(&journal->j_list_lock);
+		if (err)
+			__jbd2_journal_abort_hard(journal);
+	}
+
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
            complete.  Control buffers being written are on the
@@ -771,8 +865,14 @@ wait_for_iobuf:
 
 	jbd_debug(3, "JBD: commit phase 6\n");
 
-	if (journal_write_commit_record(journal, commit_transaction))
-		err = -EIO;
+	if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
+		JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+		err = journal_submit_commit_record(journal, commit_transaction,
+						&cbh, crc32_sum);
+		if (err)
+			__jbd2_journal_abort_hard(journal);
+	}
+	err = journal_wait_on_commit_record(cbh);
 
 	if (err)
 		jbd2_journal_abort(journal, err);
Index: linux-2.6.24-rc8/fs/jbd2/journal.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/jbd2/journal.c	2008-01-24 11:18:54.000000000 -0800
+++ linux-2.6.24-rc8/fs/jbd2/journal.c	2008-01-24 13:06:53.000000000 -0800
@@ -1578,6 +1578,32 @@ int jbd2_journal_set_features (journal_t
 	return 1;
 }
 
+/*
+ * jbd2_journal_clear_features () - Clear a given journal feature in the
+ * 				    superblock
+ * @journal: Journal to act on.
+ * @compat: bitmask of compatible features
+ * @ro: bitmask of features that force read-only mount
+ * @incompat: bitmask of incompatible features
+ *
+ * Clear a given journal feature as present on the
+ * superblock.
+ */
+int jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
+				unsigned long ro, unsigned long incompat)
+{
+	journal_superblock_t *sb;
+
+	jbd_debug(1, "Clear features 0x%lx/0x%lx/0x%lx\n",
+		  compat, ro, incompat);
+
+	sb = journal->j_superblock;
+
+	sb->s_feature_compat    &= ~cpu_to_be32(compat);
+	sb->s_feature_ro_compat &= ~cpu_to_be32(ro);
+	sb->s_feature_incompat  &= ~cpu_to_be32(incompat);
+}
+EXPORT_SYMBOL(jbd2_journal_clear_features);
 
 /**
  * int jbd2_journal_update_format () - Update on-disk journal structure.
Index: linux-2.6.24-rc8/fs/jbd2/recovery.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/jbd2/recovery.c	2008-01-24 11:18:08.000000000 -0800
+++ linux-2.6.24-rc8/fs/jbd2/recovery.c	2008-01-24 13:16:07.000000000 -0800
@@ -21,6 +21,7 @@
 #include <linux/jbd2.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
+#include <linux/crc32.h>
 #endif
 
 /*
@@ -316,6 +317,37 @@ static inline unsigned long long read_ta
 	return block;
 }
 
+/*
+ * calc_chksums calculates the checksums for the blocks described in the
+ * descriptor block.
+ */
+static int calc_chksums(journal_t *journal, struct buffer_head *bh,
+			unsigned long *next_log_block, __u32 *crc32_sum)
+{
+	int i, num_blks, err;
+	unsigned long io_block;
+	struct buffer_head *obh;
+
+	num_blks = count_tags(journal, bh);
+	/* Calculate checksum of the descriptor block. */
+	*crc32_sum = crc32_be(*crc32_sum, (void *)bh->b_data, bh->b_size);
+
+	for (i = 0; i < num_blks; i++) {
+		io_block = (*next_log_block)++;
+		wrap(journal, *next_log_block);
+		err = jread(&obh, journal, io_block);
+		if (err) {
+			printk(KERN_ERR "JBD: IO error %d recovering block "
+				"%lu in log\n", err, io_block);
+			return 1;
+		} else {
+			*crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
+				     obh->b_size);
+		}
+	}
+	return 0;
+}
+
 static int do_one_pass(journal_t *journal,
 			struct recovery_info *info, enum passtype pass)
 {
@@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journa
 	unsigned int		sequence;
 	int			blocktype;
 	int			tag_bytes = journal_tag_bytes(journal);
+	__u32			crc32_sum = ~0; /* Transactional Checksums */
 
 	/* Precompute the maximum metadata descriptors in a descriptor block */
 	int			MAX_BLOCKS_PER_DESC;
@@ -419,12 +452,26 @@ static int do_one_pass(journal_t *journa
 		switch(blocktype) {
 		case JBD2_DESCRIPTOR_BLOCK:
 			/* If it is a valid descriptor block, replay it
-			 * in pass REPLAY; otherwise, just skip over the
-			 * blocks it describes. */
+			 * in pass REPLAY; if journal_checksums enabled, then
+			 * calculate checksums in PASS_SCAN, otherwise,
+			 * just skip over the blocks it describes. */
 			if (pass != PASS_REPLAY) {
+				if (pass == PASS_SCAN &&
+				    JBD2_HAS_COMPAT_FEATURE(journal,
+					    JBD2_FEATURE_COMPAT_CHECKSUM) &&
+				    !info->end_transaction) {
+					if (calc_chksums(journal, bh,
+							&next_log_block,
+							&crc32_sum)) {
+						put_bh(bh);
+						break;
+					}
+					put_bh(bh);
+					continue;
+				}
 				next_log_block += count_tags(journal, bh);
 				wrap(journal, next_log_block);
-				brelse(bh);
+				put_bh(bh);
 				continue;
 			}
 
@@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journa
 			continue;
 
 		case JBD2_COMMIT_BLOCK:
-			/* Found an expected commit block: not much to
-			 * do other than move on to the next sequence
+			/*     How to differentiate between interrupted commit
+			 *               and journal corruption ?
+			 *
+			 * {nth transaction}
+			 *        Checksum Verification Failed
+			 *			 |
+			 *		 ____________________
+			 *		|		     |
+			 * 	async_commit             sync_commit
+			 *     		|                    |
+			 *		| GO TO NEXT    "Journal Corruption"
+			 *		| TRANSACTION
+			 *		|
+			 * {(n+1)th transanction}
+			 *		|
+			 * 	 _______|______________
+			 * 	|	 	      |
+			 * Commit block found	Commit block not found
+			 *      |		      |
+			 * "Journal Corruption"       |
+			 *		 _____________|_________
+			 *     		|	           	|
+			 *	nth trans corrupt	OR   nth trans
+			 *	and (n+1)th interrupted     interrupted
+			 *	before commit block
+			 *      could reach the disk.
+			 *	(Cannot find the difference in above
+			 *	 mentioned conditions. Hence assume
+			 *	 "Interrupted Commit".)
+			 */
+
+			/* Found an expected commit block: if checksums
+			 * are present verify them in PASS_SCAN; else not
+			 * much to do other than move on to the next sequence
 			 * number. */
+			if (pass == PASS_SCAN &&
+			    JBD2_HAS_COMPAT_FEATURE(journal,
+				    JBD2_FEATURE_COMPAT_CHECKSUM)) {
+				int chksum_err, chksum_seen;
+				struct commit_header *cbh =
+					(struct commit_header *)bh->b_data;
+				unsigned found_chksum =
+					be32_to_cpu(cbh->h_chksum[0]);
+
+				chksum_err = chksum_seen = 0;
+
+				if (info->end_transaction) {
+					printk(KERN_ERR "JBD: Transaction %u "
+						"found to be corrupt.\n",
+						next_commit_ID - 1);
+					brelse(bh);
+					break;
+				}
+
+				if (crc32_sum == found_chksum &&
+				    cbh->h_chksum_type == JBD2_CRC32_CHKSUM &&
+				    cbh->h_chksum_size ==
+						JBD2_CRC32_CHKSUM_SIZE)
+				       chksum_seen = 1;
+				else if (!(cbh->h_chksum_type == 0 &&
+					     cbh->h_chksum_size == 0 &&
+					     found_chksum == 0 &&
+					     !chksum_seen))
+				/*
+				 * If fs is mounted using an old kernel and then
+				 * kernel with journal_chksum is used then we
+				 * get a situation where the journal flag has
+				 * checksum flag set but checksums are not
+				 * present i.e chksum = 0, in the individual
+				 * commit blocks.
+				 * Hence to avoid checksum failures, in this
+				 * situation, this extra check is added.
+				 */
+						chksum_err = 1;
+
+				if (chksum_err) {
+					info->end_transaction = next_commit_ID;
+
+					if (!JBD2_HAS_COMPAT_FEATURE(journal,
+					   JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)){
+						printk(KERN_ERR
+						       "JBD: Transaction %u "
+						       "found to be corrupt.\n",
+						       next_commit_ID);
+						brelse(bh);
+						break;
+					}
+				}
+				crc32_sum = ~0;
+			}
 			brelse(bh);
 			next_commit_ID++;
 			continue;
@@ -554,9 +688,10 @@ static int do_one_pass(journal_t *journa
 	 * transaction marks the end of the valid log.
 	 */
 
-	if (pass == PASS_SCAN)
-		info->end_transaction = next_commit_ID;
-	else {
+	if (pass == PASS_SCAN) {
+		if (!info->end_transaction)
+			info->end_transaction = next_commit_ID;
+	} else {
 		/* It's really bad news if different passes end up at
 		 * different places (but possible due to IO errors). */
 		if (info->end_transaction != next_commit_ID) {
Index: linux-2.6.24-rc8/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.24-rc8.orig/include/linux/ext4_fs.h	2008-01-24 11:18:52.000000000 -0800
+++ linux-2.6.24-rc8/include/linux/ext4_fs.h	2008-01-24 13:00:45.000000000 -0800
@@ -467,7 +467,8 @@ do {									       \
 #define EXT4_MOUNT_USRQUOTA		0x100000 /* "old" user quota */
 #define EXT4_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
 #define EXT4_MOUNT_EXTENTS		0x400000 /* Extents support */
-
+#define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
+#define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
 /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
 #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
Index: linux-2.6.24-rc8/include/linux/jbd2.h
===================================================================
--- linux-2.6.24-rc8.orig/include/linux/jbd2.h	2008-01-24 11:18:54.000000000 -0800
+++ linux-2.6.24-rc8/include/linux/jbd2.h	2008-01-24 11:46:16.000000000 -0800
@@ -149,6 +149,28 @@ typedef struct journal_header_s
 	__be32		h_sequence;
 } journal_header_t;
 
+/*
+ * Checksum types.
+ */
+#define JBD2_CRC32_CHKSUM   1
+#define JBD2_MD5_CHKSUM     2
+#define JBD2_SHA1_CHKSUM    3
+
+#define JBD2_CRC32_CHKSUM_SIZE 4
+
+#define JBD2_CHECKSUM_BYTES (32 / sizeof(u32))
+/*
+ * Commit block header for storing transactional checksums:
+ */
+struct commit_header {
+	__be32		h_magic;
+	__be32          h_blocktype;
+	__be32          h_sequence;
+	unsigned char   h_chksum_type;
+	unsigned char   h_chksum_size;
+	unsigned char 	h_padding[2];
+	__be32 		h_chksum[JBD2_CHECKSUM_BYTES];
+};
 
 /*
  * The block tag: used to describe a single buffer in the journal.
@@ -242,14 +264,18 @@ typedef struct journal_superblock_s
 	((j)->j_format_version >= 2 &&					\
 	 ((j)->j_superblock->s_feature_incompat & cpu_to_be32((mask))))
 
-#define JBD2_FEATURE_INCOMPAT_REVOKE	0x00000001
-#define JBD2_FEATURE_INCOMPAT_64BIT	0x00000002
+#define JBD2_FEATURE_COMPAT_CHECKSUM	0x00000001
+
+#define JBD2_FEATURE_INCOMPAT_REVOKE		0x00000001
+#define JBD2_FEATURE_INCOMPAT_64BIT		0x00000002
+#define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT	0x00000004
 
 /* Features known to this kernel version: */
-#define JBD2_KNOWN_COMPAT_FEATURES	0
+#define JBD2_KNOWN_COMPAT_FEATURES	JBD2_FEATURE_COMPAT_CHECKSUM
 #define JBD2_KNOWN_ROCOMPAT_FEATURES	0
 #define JBD2_KNOWN_INCOMPAT_FEATURES	(JBD2_FEATURE_INCOMPAT_REVOKE | \
-					 JBD2_FEATURE_INCOMPAT_64BIT)
+					JBD2_FEATURE_INCOMPAT_64BIT | \
+					JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
 
 #ifdef __KERNEL__
 
@@ -997,6 +1023,8 @@ extern int	   jbd2_journal_check_availab
 		   (journal_t *, unsigned long, unsigned long, unsigned long);
 extern int	   jbd2_journal_set_features
 		   (journal_t *, unsigned long, unsigned long, unsigned long);
+extern int	   jbd2_journal_clear_features
+		   (journal_t *, unsigned long, unsigned long, unsigned long);
 extern int	   jbd2_journal_create     (journal_t *);
 extern int	   jbd2_journal_load       (journal_t *journal);
 extern void	   jbd2_journal_destroy    (journal_t *);


> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl
  2008-01-24  5:55                                                                           ` Aneesh Kumar K.V
@ 2008-01-26  4:15                                                                             ` Theodore Tso
  2008-01-26  8:42                                                                               ` Aneesh Kumar K.V
  0 siblings, 1 reply; 23+ messages in thread
From: Theodore Tso @ 2008-01-26  4:15 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Andrew Morton, linux-kernel, linux-ext4@vger.kernel.org

On Thu, Jan 24, 2008 at 11:25:32AM +0530, Aneesh Kumar K.V wrote:
> +static int free_ext_idx(handle_t *handle, struct inode *inode,
> +					struct ext4_extent_idx *ix)
> +{
> +	int i, retval = 0;
> +	ext4_fsblk_t block;
> +	struct buffer_head *bh;
> +	struct ext4_extent_header *eh;
> +
> +	block = idx_pblock(ix);
> +	bh = sb_bread(inode->i_sb, block);
> +	if (!bh)
> +		return -EIO;
> +
> +	eh = (struct ext4_extent_header *)bh->b_data;
> +	if (eh->eh_depth == 0) {
> +		brelse(bh);
> +		ext4_free_blocks(handle, inode, block, 1);
> +	} else {
> +		ix = EXT_FIRST_INDEX(eh);
> +		for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
> +			retval = free_ext_idx(handle, inode, ix);
> +			if (retval)
> +				return retval;
> +		}
> +	}
> +	return retval;
> +}

Aneesh, looks like if eh->eh_depth is != 0, bh gets leaked.  This is
how I plan to fix it up:

+static int free_ext_idx(handle_t *handle, struct inode *inode,
+					struct ext4_extent_idx *ix)
+{
+	int i, retval = 0;
+	ext4_fsblk_t block;
+	struct buffer_head *bh;
+	struct ext4_extent_header *eh;
+
+	block = idx_pblock(ix);
+	bh = sb_bread(inode->i_sb, block);
+	if (!bh)
+		return -EIO;
+
+	eh = (struct ext4_extent_header *)bh->b_data;
+	if (eh->eh_depth == 0)
+		ext4_free_blocks(handle, inode, block, 1);
+	else {
+		ix = EXT_FIRST_INDEX(eh);
+		for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
+			retval = free_ext_idx(handle, inode, ix);
+			if (retval)
+				break;
+		}
+	}
+	put_bh(bh);
+	return retval;
+}

						- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl
  2008-01-26  4:15                                                                             ` Theodore Tso
@ 2008-01-26  8:42                                                                               ` Aneesh Kumar K.V
  0 siblings, 0 replies; 23+ messages in thread
From: Aneesh Kumar K.V @ 2008-01-26  8:42 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, linux-kernel,
	linux-ext4@vger.kernel.org

On Fri, Jan 25, 2008 at 11:15:00PM -0500, Theodore Tso wrote:
> On Thu, Jan 24, 2008 at 11:25:32AM +0530, Aneesh Kumar K.V wrote:
> > +static int free_ext_idx(handle_t *handle, struct inode *inode,
> > +					struct ext4_extent_idx *ix)
> > +{
> > +	int i, retval = 0;
> > +	ext4_fsblk_t block;
> > +	struct buffer_head *bh;
> > +	struct ext4_extent_header *eh;
> > +
> > +	block = idx_pblock(ix);
> > +	bh = sb_bread(inode->i_sb, block);
> > +	if (!bh)
> > +		return -EIO;
> > +
> > +	eh = (struct ext4_extent_header *)bh->b_data;
> > +	if (eh->eh_depth == 0) {
> > +		brelse(bh);
> > +		ext4_free_blocks(handle, inode, block, 1);
> > +	} else {
> > +		ix = EXT_FIRST_INDEX(eh);
> > +		for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
> > +			retval = free_ext_idx(handle, inode, ix);
> > +			if (retval)
> > +				return retval;
> > +		}
> > +	}
> > +	return retval;
> > +}
> 
> Aneesh, looks like if eh->eh_depth is != 0, bh gets leaked.  This is
> how I plan to fix it up:
> 
> +static int free_ext_idx(handle_t *handle, struct inode *inode,
> +					struct ext4_extent_idx *ix)
> +{
> +	int i, retval = 0;
> +	ext4_fsblk_t block;
> +	struct buffer_head *bh;
> +	struct ext4_extent_header *eh;
> +
> +	block = idx_pblock(ix);
> +	bh = sb_bread(inode->i_sb, block);
> +	if (!bh)
> +		return -EIO;
> +
> +	eh = (struct ext4_extent_header *)bh->b_data;
> +	if (eh->eh_depth == 0)
> +		ext4_free_blocks(handle, inode, block, 1);
> +	else {
> +		ix = EXT_FIRST_INDEX(eh);
> +		for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
> +			retval = free_ext_idx(handle, inode, ix);
> +			if (retval)
> +				break;
> +		}
> +	}
> +	put_bh(bh);


We need to mark the index block as free.
 via ext4_free_blocks(handle, inode, block, 1);

I remember making this change. May be it was related to dind/tind
blocks. 


-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 24/49] ext4: add block bitmap validation
  2008-01-23 22:06                                                 ` [PATCH 24/49] ext4: add block bitmap validation Andrew Morton
@ 2008-01-26 13:26                                                   ` Theodore Tso
  0 siblings, 0 replies; 23+ messages in thread
From: Theodore Tso @ 2008-01-26 13:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, aneesh.kumar, linux-ext4@vger.kernel.org

On Wed, Jan 23, 2008 at 02:06:54PM -0800, Andrew Morton wrote:
> brelse() should only be used when the bh might be NULL - put_bh()
> can be used here.
> 
> Please review all ext4/jbd2 code for this trivial speedup.

I've reviewed all of the pending patches in the stable queue for this
speedup, and applied them where necessary; it was useful, since I
detected a buffer head leak in one of the patches while I was at it.

The ext4/jbd2 code as a whole still needs to be reviewed for this
speedup, but I don't want to fix this in the initial stable push, lest
I break something by accident.  I'll put it in the "TO DO" queue.

Regards,

      		   	      	       	     - Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 33/49] ext4: Add the journal checksum feature
  2008-01-24 21:24                                                                     ` Mingming Cao
@ 2008-02-01 20:50                                                                       ` Girish Shilamkar
  0 siblings, 0 replies; 23+ messages in thread
From: Girish Shilamkar @ 2008-02-01 20:50 UTC (permalink / raw)
  To: cmm
  Cc: Andrew Morton, Theodore Ts'o, linux-kernel, adilger, shaggy,
	linux-ext4@vger.kernel.org

Hi,

On Thu, 2008-01-24 at 13:24 -0800, Mingming Cao wrote:
> -static int journal_write_commit_record(journal_t *journal,
> -					transaction_t *commit_transaction)
> +static int journal_submit_commit_record(journal_t *journal,
> +					transaction_t *commit_transaction,
> +					struct buffer_head **cbh,
> +					__u32 crc32_sum)
>  {
>  	struct journal_head *descriptor;
> +	struct commit_header *tmp;
>  	struct buffer_head *bh;
> -	int i, ret;
> +	int ret;
>  	int barrier_done = 0;
>  
>  	if (is_journal_aborted(journal))
> @@ -117,21 +122,33 @@ static int journal_write_commit_record(j
>  
>  	bh = jh2bh(descriptor);
>  
> -	/* AKPM: buglet - add `i' to tmp! */
> -	for (i = 0; i < bh->b_size; i += 512) {
> -		journal_header_t *tmp = (journal_header_t*)bh->b_data;
> -		tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
> -		tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
> -		tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
> +	tmp = (struct commit_header *)bh->b_data;
> +	tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
> +	tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
> +	tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
> +
> +	if (JBD2_HAS_COMPAT_FEATURE(journal,
> +				    JBD2_FEATURE_COMPAT_CHECKSUM)) {
> +		tmp->h_chksum_type 	= JBD2_CRC32_CHKSUM;
> +		tmp->h_chksum_size 	= JBD2_CRC32_CHKSUM_SIZE;
> +		tmp->h_chksum[0] 	= cpu_to_be32(crc32_sum);
>  	}
>  
> -	JBUFFER_TRACE(descriptor, "write commit block");
> +	JBUFFER_TRACE(descriptor, "submit commit block");
> +	lock_buffer(bh);
> +
get_bh() is missing here. 
bh refcount is decremented in journal_wait_on_commit_record(), but it is
not incremented in journal_submit_commit_record().
Thanks to Johann Lombardi for pointing this out.

Comments.

>  	set_buffer_dirty(bh);
> -	if (journal->j_flags & JBD2_BARRIER) {
> +	set_buffer_uptodate(bh);
> +	bh->b_end_io = journal_end_buffer_io_sync;
> +
> +	if (journal->j_flags & JBD2_BARRIER &&
> +		!JBD2_HAS_COMPAT_FEATURE(journal,
> +					 JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
>  		set_buffer_ordered(bh);
>  		barrier_done = 1;
>  	}
> -	ret = sync_dirty_buffer(bh);
> +	ret = submit_bh(WRITE, bh);
> +
>  	/* is it possible for another commit to fail at roughly
>  	 * the same time as this one?  If so, we don't want to
>  	 * trust the barrier flag in the super, but instead want
> @@ -152,14 +169,72 @@ static int journal_write_commit_record(j
>  		clear_buffer_ordered(bh);
>  		set_buffer_uptodate(bh);
>  		set_buffer_dirty(bh);
> -		ret = sync_dirty_buffer(bh);
> +		ret = submit_bh(WRITE, bh);
>  	}
> -	put_bh(bh);		/* One for getblk() */
> -	jbd2_journal_put_journal_head(descriptor);
> +	*cbh = bh;
> +	return ret;
> +}
> +
> +/*
> + * This function along with journal_submit_commit_record
> + * allows to write the commit record asynchronously.
> + */
> +static int journal_wait_on_commit_record(struct buffer_head *bh)
> +{
> +	int ret = 0;
> +
> +	clear_buffer_dirty(bh);
> +	wait_on_buffer(bh);
> +
> +	if (unlikely(!buffer_uptodate(bh)))
> +		ret = -EIO;
> +	put_bh(bh);            /* One for getblk() */
> +	jbd2_journal_put_journal_head(bh2jh(bh));
>  
> -	return (ret == -EIO);
> +	return ret;
>  }
>  
-Girish

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2008-02-01 20:42 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1200970948-17903-1-git-send-email-tytso@mit.edu>
     [not found] ` <1200970948-17903-2-git-send-email-tytso@mit.edu>
     [not found]   ` <1200970948-17903-3-git-send-email-tytso@mit.edu>
     [not found]     ` <1200970948-17903-4-git-send-email-tytso@mit.edu>
     [not found]       ` <1200970948-17903-5-git-send-email-tytso@mit.edu>
     [not found]         ` <1200970948-17903-6-git-send-email-tytso@mit.edu>
     [not found]           ` <1200970948-17903-7-git-send-email-tytso@mit.edu>
     [not found]             ` <1200970948-17903-8-git-send-email-tytso@mit.edu>
     [not found]               ` <1200970948-17903-9-git-send-email-tytso@mit.edu>
     [not found]                 ` <1200970948-17903-10-git-send-email-tytso@mit.edu>
     [not found]                   ` <1200970948-17903-11-git-send-email-tytso@mit.edu>
     [not found]                     ` <1200970948-17903-12-git-send-email-tytso@mit.edu>
     [not found]                       ` <1200970948-17903-13-git-send-email-tytso@mit.edu>
     [not found]                         ` <1200970948-17903-14-git-send-email-tytso@mit.edu>
     [not found]                           ` <1200970948-17903-15-git-send-email-tytso@mit.edu>
     [not found]                             ` <1200970948-17903-16-git-send-email-tytso@mit.edu>
     [not found]                               ` <1200970948-17903-17-git-send-email-tytso@mit.edu>
     [not found]                                 ` <1200970948-17903-18-git-send-email-tytso@mit.edu>
     [not found]                                   ` <1200970948-17903-19-git-send-email-tytso@mit.edu>
     [not found]                                     ` <1200970948-17903-20-git-send-email-tytso@mit.edu>
     [not found]                                       ` <1200970948-17903-21-git-send-email-tytso@mit.edu>
     [not found]                                         ` <1200970948-17903-22-git-send-email-tytso@mit.edu>
     [not found]                                           ` <1200970948-17903-23-git-send-email-tytso@mit.edu>
     [not found]                                             ` <1200970948-17903-24-git-send-email-tytso@mit.edu>
2008-01-23 22:06                                               ` [PATCH 23/49] Add buffer head related helper functions Andrew Morton
2008-01-24  5:22                                                 ` Aneesh Kumar K.V
2008-01-24  8:53                                                   ` Andrew Morton
     [not found]                                               ` <1200970948-17903-25-git-send-email-tytso@mit.edu>
     [not found]                                                 ` <1200970948-17903-26-git-send-email-tytso@mit.edu>
2008-01-22  3:02                                                   ` [PATCH 26/49] jbd2: Fix assertion failure in fs/jbd2/checkpoint.c Theodore Ts'o
     [not found]                                                     ` <1200970948-17903-28-git-send-email-tytso@mit.edu>
     [not found]                                                       ` <1200970948-17903-29-git-send-email-tytso@mit.edu>
     [not found]                                                         ` <1200970948-17903-30-git-send-email-tytso@mit.edu>
     [not found]                                                           ` <1200970948-17903-31-git-send-email-tytso@mit.edu>
2008-01-23 22:06                                                             ` [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore Andrew Morton
2008-01-24  5:29                                                               ` Aneesh Kumar K.V
2008-01-24 13:00                                                               ` Andy Whitcroft
     [not found]                                                             ` <1200970948-17903-32-git-send-email-tytso@mit.edu>
     [not found]                                                               ` <1200970948-17903-33-git-send-email-tytso@mit.edu>
     [not found]                                                                 ` <1200970948-17903-34-git-send-email-tytso@mit.edu>
2008-01-23 22:07                                                                   ` [PATCH 33/49] ext4: Add the journal checksum feature Andrew Morton
2008-01-23 22:40                                                                     ` Andreas Dilger
2008-01-24 21:24                                                                     ` Mingming Cao
2008-02-01 20:50                                                                       ` Girish Shilamkar
     [not found]                                                                   ` <1200970948-17903-35-git-send-email-tytso@mit.edu>
     [not found]                                                                     ` <1200970948-17903-36-git-send-email-tytso@mit.edu>
     [not found]                                                                       ` <1200970948-17903-37-git-send-email-tytso@mit.edu>
     [not found]                                                                         ` <1200970948-17903-38-git-send-email-tytso@mit.edu>
     [not found]                                                                           ` <1200970948-17903-39-git-send-email-tytso@mit.edu>
2008-01-22  3:02                                                                             ` [PATCH 39/49] ext4: Add ext4_find_next_bit() Theodore Ts'o
     [not found]                                                                               ` <1200970948-17903-41-git-send-email-tytso@mit.edu>
     [not found]                                                                                 ` <1200970948-17903-42-git-send-email-tytso@mit.edu>
2008-01-23 22:07                                                                                   ` [PATCH 41/49] ext4: Add multi block allocator for ext4 Andrew Morton
2008-01-23 23:20                                                                                     ` Andreas Dilger
2008-01-24  7:56                                                                                     ` Aneesh Kumar K.V
2008-01-24  9:04                                                                                       ` Aneesh Kumar K.V
2008-01-24 14:53                                                                                       ` Aneesh Kumar K.V
2008-01-23 22:07                                                                         ` [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl Andrew Morton
2008-01-24  5:55                                                                           ` Aneesh Kumar K.V
2008-01-26  4:15                                                                             ` Theodore Tso
2008-01-26  8:42                                                                               ` Aneesh Kumar K.V
2008-01-23 22:06                                                 ` [PATCH 24/49] ext4: add block bitmap validation Andrew Morton
2008-01-26 13:26                                                   ` Theodore Tso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).