[PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
@ 2006-03-15 12:39 Takashi Sato
  2006-03-15 12:56 ` [Ext2-devel] " Laurent Vivier
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Takashi Sato @ 2006-03-15 12:39 UTC (permalink / raw)
  To: ext2-devel, linux-kernel

Hi,

As a disk size tends to be larger, some disk storages get to have
the capacity to supply more than multi-TB recently.  But now ext2/3
can't support more than 8TB filesystem in 4K-blocksize.  And then I
think the filesystem size of ext2/3 should be extended.

I'd like to extend the max filesystem size of ext2/3 from 8TB to 16TB
by making the number of blocks on ext2/3 extend from 2G-1(2^31-1) to
4G-1(2^32-1) as below.

The max number of blocks is restricted to 2G-1(2^31-1) on ext2/3
because of the following problems.

- The number of blocks is treated as signed 4bytes variable on some
  codes for ext2/3 in kernel.

- Assembler instructions which can't treat more than 2GB is used
  on some functions related to bit manipulation, like ext2fs_set_bit()
  and ext2fs_test_bit().  These functions are called through mke2fs
  on x86 and mc68000 architecture.

- A block number and an inode number is output with the format
  string(%d, %ld) in many places on both kernel and commands.

This patch set is composed of two parts, for the kernel and e2fsprogs.

[1/2] kernel(linux 2.6.16-rc6)
 - Change signed 4bytes variables for a block number and a inode
   number, to unsigned.

 - Change the format string(%d, %ld) for a block number and a inode
   number to %u or %lu.

 - ext2/ext3_sb_info uses the percpu_counter structure for counting
   blocks and inodes, and it has "long" counter.  I made the new
   structure percpu_llcounter which has "long long" counter.

[2/2] Commands(e2fsprogs-1.38)
 - Modify to call C functions(ext2fs_set_bit(),ext2fs_test_bit())
   defined in lib/ex2fs/bitops.c on x86 and mc68000 architecture.
   This makes it possible to make ext2/3 with more than 2G blocks
   by mke2fs with -F option.
 - Change the format string(%d, %ld) for a block number and inode
   number to %u or %lu.

Any feedback and comments are welcome.

Signed-off-by: Takashi Sato sho@tnes.nec.co.jp
---
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/balloc.c linux-2.6.16-rc6-4g/fs/ext2/balloc.c
--- linux-2.6.16-rc6.org/fs/ext2/balloc.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext2/balloc.c	2006-03-14 09:29:01.000000000 +0900
@@ -99,14 +99,14 @@ error_out:
  * Set sb->s_dirt here because the superblock was "logically" altered.  We
  * need to recalculate its free blocks count and flush it out.
  */
-static int reserve_blocks(struct super_block *sb, int count)
+static unsigned int reserve_blocks(struct super_block *sb, unsigned int count)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	struct ext2_super_block *es = sbi->s_es;
-	unsigned free_blocks;
+	unsigned int free_blocks;
 	unsigned root_blocks;

-	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
+	free_blocks = percpu_llcounter_read_positive(&sbi->s_freeblocks_counter);
 	root_blocks = le32_to_cpu(es->s_r_blocks_count);

 	if (free_blocks < count)
@@ -125,23 +125,23 @@ static int reserve_blocks(struct super_b
 			return 0;
 	}

-	percpu_counter_mod(&sbi->s_freeblocks_counter, -count);
+	percpu_llcounter_mod(&sbi->s_freeblocks_counter, -count);
 	sb->s_dirt = 1;
 	return count;
 }

-static void release_blocks(struct super_block *sb, int count)
+static void release_blocks(struct super_block *sb, unsigned int count)
 {
 	if (count) {
 		struct ext2_sb_info *sbi = EXT2_SB(sb);

-		percpu_counter_mod(&sbi->s_freeblocks_counter, count);
+		percpu_llcounter_mod(&sbi->s_freeblocks_counter, count);
 		sb->s_dirt = 1;
 	}
 }

-static int group_reserve_blocks(struct ext2_sb_info *sbi, int group_no,
-	struct ext2_group_desc *desc, struct buffer_head *bh, int count)
+static unsigned int group_reserve_blocks(struct ext2_sb_info *sbi, int group_no,
+	struct ext2_group_desc *desc, struct buffer_head *bh, unsigned int count)
 {
 	unsigned free_blocks;

@@ -159,7 +159,7 @@ static int group_reserve_blocks(struct e
 }

 static void group_release_blocks(struct super_block *sb, int group_no,
-	struct ext2_group_desc *desc, struct buffer_head *bh, int count)
+	struct ext2_group_desc *desc, struct buffer_head *bh, unsigned int count)
 {
 	if (count) {
 		struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -324,7 +324,7 @@ got_it:
  * bitmap, and then for any free bit if that fails.
  * This function also updates quota and i_blocks field.
  */
-int ext2_new_block(struct inode *inode, unsigned long goal,
+unsigned int ext2_new_block(struct inode *inode, unsigned long goal,
 			u32 *prealloc_count, u32 *prealloc_block, int *err)
 {
 	struct buffer_head *bitmap_bh = NULL;
@@ -333,8 +333,8 @@ int ext2_new_block(struct inode *inode,
 	int group_no;			/* i */
 	int ret_block;			/* j */
 	int group_idx;			/* k */
-	int target_block;		/* tmp */
-	int block = 0;
+	unsigned int target_block;	/* tmp */
+	unsigned int block = 0;
 	struct super_block *sb = inode->i_sb;
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	struct ext2_super_block *es = sbi->s_es;
@@ -447,7 +447,6 @@ retry:
 		group_alloc = 0;
 		goto retry;
 	}
-
 got_block:
 	ext2_debug("using block group %d(%d)\n",
 		group_no, desc->bg_free_blocks_count);
@@ -465,7 +464,7 @@ got_block:

 	if (target_block >= le32_to_cpu(es->s_blocks_count)) {
 		ext2_error (sb, "ext2_new_block",
-			    "block(%d) >= blocks count(%d) - "
+			    "block(%d) >= blocks count(%u) - "
 			    "block_group = %d, es == %p ", ret_block,
 			le32_to_cpu(es->s_blocks_count), group_no, es);
 		goto io_error;
@@ -504,7 +503,7 @@ got_block:
 	if (sb->s_flags & MS_SYNCHRONOUS)
 		sync_dirty_buffer(bitmap_bh);

-	ext2_debug ("allocating block %d. ", block);
+	ext2_debug ("allocating block %u. ", block);

 	*err = 0;
 out_release:
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/ext2.h linux-2.6.16-rc6-4g/fs/ext2/ext2.h
--- linux-2.6.16-rc6.org/fs/ext2/ext2.h	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext2/ext2.h	2006-03-14 09:29:01.000000000 +0900
@@ -91,7 +91,7 @@ static inline struct ext2_inode_info *EX
 /* balloc.c */
 extern int ext2_bg_has_super(struct super_block *sb, int group);
 extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
-extern int ext2_new_block (struct inode *, unsigned long,
+extern unsigned int ext2_new_block (struct inode *, unsigned long,
 			   __u32 *, __u32 *, int *);
 extern void ext2_free_blocks (struct inode *, unsigned long,
 			      unsigned long);
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/ialloc.c linux-2.6.16-rc6-4g/fs/ext2/ialloc.c
--- linux-2.6.16-rc6.org/fs/ext2/ialloc.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext2/ialloc.c	2006-03-14 09:29:01.000000000 +0900
@@ -83,7 +83,7 @@ static void ext2_release_inode(struct su
 			cpu_to_le16(le16_to_cpu(desc->bg_used_dirs_count) - 1);
 	spin_unlock(sb_bgl_lock(EXT2_SB(sb), group));
 	if (dir)
-		percpu_counter_dec(&EXT2_SB(sb)->s_dirs_counter);
+		percpu_llcounter_dec(&EXT2_SB(sb)->s_dirs_counter);
 	sb->s_dirt = 1;
 	mark_buffer_dirty(bh);
 }
@@ -276,22 +276,20 @@ static int find_group_orlov(struct super
 	struct ext2_super_block *es = sbi->s_es;
 	int ngroups = sbi->s_groups_count;
 	int inodes_per_group = EXT2_INODES_PER_GROUP(sb);
-	int freei;
+	unsigned long freei, free_blocks, ndirs;
 	int avefreei;
-	int free_blocks;
 	int avefreeb;
 	int blocks_per_dir;
-	int ndirs;
 	int max_debt, max_dirs, min_blocks, min_inodes;
 	int group = -1, i;
 	struct ext2_group_desc *desc;
 	struct buffer_head *bh;

-	freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
+	freei = percpu_llcounter_read_positive(&sbi->s_freeinodes_counter);
 	avefreei = freei / ngroups;
-	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
+	free_blocks = percpu_llcounter_read_positive(&sbi->s_freeblocks_counter);
 	avefreeb = free_blocks / ngroups;
-	ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
+	ndirs = percpu_llcounter_read_positive(&sbi->s_dirs_counter);

 	if ((parent == sb->s_root->d_inode) ||
 	    (EXT2_I(parent)->i_flags & EXT2_TOPDIR_FL)) {
@@ -328,7 +326,7 @@ static int find_group_orlov(struct super
 	}

 	if (ndirs == 0)
-		ndirs = 1;	/* percpu_counters are approximate... */
+		ndirs = 1;	/* percpu_llcounters are approximate... */

 	blocks_per_dir = (le32_to_cpu(es->s_blocks_count)-free_blocks) / ndirs;

@@ -543,9 +541,9 @@ got:
 		goto fail;
 	}

-	percpu_counter_mod(&sbi->s_freeinodes_counter, -1);
+	percpu_llcounter_mod(&sbi->s_freeinodes_counter, -1);
 	if (S_ISDIR(mode))
-		percpu_counter_inc(&sbi->s_dirs_counter);
+		percpu_llcounter_inc(&sbi->s_dirs_counter);

 	spin_lock(sb_bgl_lock(sbi, group));
 	gdp->bg_free_inodes_count =
@@ -670,7 +668,7 @@ unsigned long ext2_count_free_inodes (st
 	}
 	brelse(bitmap_bh);
 	printk("ext2_count_free_inodes: stored = %lu, computed = %lu, %lu\n",
-		percpu_counter_read(&EXT2_SB(sb)->s_freeinodes_counter),
+		percpu_llcounter_read(&EXT2_SB(sb)->s_freeinodes_counter),
 		desc_count, bitmap_count);
 	unlock_super(sb);
 	return desc_count;
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/inode.c linux-2.6.16-rc6-4g/fs/ext2/inode.c
--- linux-2.6.16-rc6.org/fs/ext2/inode.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext2/inode.c	2006-03-15 21:16:51.000000000 +0900
@@ -107,7 +107,7 @@ void ext2_discard_prealloc (struct inode
 #endif
 }

-static int ext2_alloc_block (struct inode * inode, unsigned long goal, int *err)
+static unsigned int ext2_alloc_block (struct inode * inode, unsigned int goal, int *err)
 {
 #ifdef EXT2FS_DEBUG
 	static unsigned long alloc_hits, alloc_attempts;
@@ -193,8 +193,8 @@ static inline int verify_chain(Indirect
  * get there at all.
  */

-static int ext2_block_to_path(struct inode *inode,
-			long i_block, int offsets[4], int *boundary)
+static int ext2_block_to_path(struct inode *inode, unsigned long i_block,
+				unsigned int offsets[4], int *boundary)
 {
 	int ptrs = EXT2_ADDR_PER_BLOCK(inode->i_sb);
 	int ptrs_bits = EXT2_ADDR_PER_BLOCK_BITS(inode->i_sb);
@@ -263,7 +263,7 @@ static int ext2_block_to_path(struct ino
  */
 static Indirect *ext2_get_branch(struct inode *inode,
 				 int depth,
-				 int *offsets,
+				 unsigned int *offsets,
 				 Indirect chain[4],
 				 int *err)
 {
@@ -363,7 +363,7 @@ static unsigned long ext2_find_near(stru
  */

 static inline int ext2_find_goal(struct inode *inode,
-				 long block,
+				 unsigned long block,
 				 Indirect chain[4],
 				 Indirect *partial,
 				 unsigned long *goal)
@@ -418,20 +418,20 @@ static inline int ext2_find_goal(struct
 static int ext2_alloc_branch(struct inode *inode,
 			     int num,
 			     unsigned long goal,
-			     int *offsets,
+			     unsigned int *offsets,
 			     Indirect *branch)
 {
 	int blocksize = inode->i_sb->s_blocksize;
 	int n = 0;
 	int err;
 	int i;
-	int parent = ext2_alloc_block(inode, goal, &err);
+	unsigned int parent = ext2_alloc_block(inode, goal, &err);

 	branch[0].key = cpu_to_le32(parent);
 	if (parent) for (n = 1; n < num; n++) {
 		struct buffer_head *bh;
 		/* Allocate the next block */
-		int nr = ext2_alloc_block(inode, parent, &err);
+		unsigned int nr = ext2_alloc_block(inode, parent, &err);
 		if (!nr)
 			break;
 		branch[n].key = cpu_to_le32(nr);
@@ -489,7 +489,7 @@ static int ext2_alloc_branch(struct inod
  */

 static inline int ext2_splice_branch(struct inode *inode,
-				     long block,
+				     unsigned long block,
 				     Indirect chain[4],
 				     Indirect *where,
 				     int num)
@@ -547,7 +547,7 @@ changed:
 int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create)
 {
 	int err = -EIO;
-	int offsets[4];
+	unsigned int offsets[4];
 	Indirect chain[4];
 	Indirect *partial;
 	unsigned long goal;
@@ -776,7 +776,7 @@ static inline int all_zeroes(__le32 *p,

 static Indirect *ext2_find_shared(struct inode *inode,
 				int depth,
-				int offsets[4],
+				unsigned int offsets[4],
 				Indirect chain[4],
 				__le32 *top)
 {
@@ -892,7 +892,7 @@ static void ext2_free_branches(struct in
 			 */
 			if (!bh) {
 				ext2_error(inode->i_sb, "ext2_free_branches",
-					"Read failure, inode=%ld, block=%ld",
+					"Read failure, inode=%lu, block=%lu",
 					inode->i_ino, nr);
 				continue;
 			}
@@ -912,12 +912,12 @@ void ext2_truncate (struct inode * inode
 {
 	__le32 *i_data = EXT2_I(inode)->i_data;
 	int addr_per_block = EXT2_ADDR_PER_BLOCK(inode->i_sb);
-	int offsets[4];
+	unsigned int offsets[4];
 	Indirect chain[4];
 	Indirect *partial;
 	__le32 nr = 0;
 	int n;
-	long iblock;
+	unsigned long iblock;
 	unsigned blocksize;

 	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/super.c linux-2.6.16-rc6-4g/fs/ext2/super.c
--- linux-2.6.16-rc6.org/fs/ext2/super.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext2/super.c	2006-03-14 09:29:01.000000000 +0900
@@ -126,9 +126,9 @@ static void ext2_put_super (struct super
 			brelse (sbi->s_group_desc[i]);
 	kfree(sbi->s_group_desc);
 	kfree(sbi->s_debts);
-	percpu_counter_destroy(&sbi->s_freeblocks_counter);
-	percpu_counter_destroy(&sbi->s_freeinodes_counter);
-	percpu_counter_destroy(&sbi->s_dirs_counter);
+	percpu_llcounter_destroy(&sbi->s_freeblocks_counter);
+	percpu_llcounter_destroy(&sbi->s_freeinodes_counter);
+	percpu_llcounter_destroy(&sbi->s_dirs_counter);
 	brelse (sbi->s_sbh);
 	sb->s_fs_info = NULL;
 	kfree(sbi);
@@ -836,9 +836,9 @@ static int ext2_fill_super(struct super_
 		printk ("EXT2-fs: not enough memory\n");
 		goto failed_mount;
 	}
-	percpu_counter_init(&sbi->s_freeblocks_counter);
-	percpu_counter_init(&sbi->s_freeinodes_counter);
-	percpu_counter_init(&sbi->s_dirs_counter);
+	percpu_llcounter_init(&sbi->s_freeblocks_counter);
+	percpu_llcounter_init(&sbi->s_freeinodes_counter);
+	percpu_llcounter_init(&sbi->s_dirs_counter);
 	bgl_lock_init(&sbi->s_blockgroup_lock);
 	sbi->s_debts = kmalloc(sbi->s_groups_count * sizeof(*sbi->s_debts),
 			       GFP_KERNEL);
@@ -888,11 +888,11 @@ static int ext2_fill_super(struct super_
 		ext2_warning(sb, __FUNCTION__,
 			"mounting ext3 filesystem as ext2");
 	ext2_setup_super (sb, es, sb->s_flags & MS_RDONLY);
-	percpu_counter_mod(&sbi->s_freeblocks_counter,
+	percpu_llcounter_mod(&sbi->s_freeblocks_counter,
 				ext2_count_free_blocks(sb));
-	percpu_counter_mod(&sbi->s_freeinodes_counter,
+	percpu_llcounter_mod(&sbi->s_freeinodes_counter,
 				ext2_count_free_inodes(sb));
-	percpu_counter_mod(&sbi->s_dirs_counter,
+	percpu_llcounter_mod(&sbi->s_dirs_counter,
 				ext2_count_dirs(sb));
 	return 0;

diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/xattr.c linux-2.6.16-rc6-4g/fs/ext2/xattr.c
--- linux-2.6.16-rc6.org/fs/ext2/xattr.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext2/xattr.c	2006-03-14 09:29:01.000000000 +0900
@@ -71,7 +71,7 @@

 #ifdef EXT2_XATTR_DEBUG
 # define ea_idebug(inode, f...) do { \
-		printk(KERN_DEBUG "inode %s:%ld: ", \
+		printk(KERN_DEBUG "inode %s:%lu: ", \
 			inode->i_sb->s_id, inode->i_ino); \
 		printk(f); \
 		printk("\n"); \
@@ -164,7 +164,7 @@ ext2_xattr_get(struct inode *inode, int
 	error = -ENODATA;
 	if (!EXT2_I(inode)->i_file_acl)
 		goto cleanup;
-	ea_idebug(inode, "reading block %d", EXT2_I(inode)->i_file_acl);
+	ea_idebug(inode, "reading block %u", EXT2_I(inode)->i_file_acl);
 	bh = sb_bread(inode->i_sb, EXT2_I(inode)->i_file_acl);
 	error = -EIO;
 	if (!bh)
@@ -175,7 +175,7 @@ ext2_xattr_get(struct inode *inode, int
 	if (HDR(bh)->h_magic != cpu_to_le32(EXT2_XATTR_MAGIC) ||
 	    HDR(bh)->h_blocks != cpu_to_le32(1)) {
 bad_block:	ext2_error(inode->i_sb, "ext2_xattr_get",
-			"inode %ld: bad block %d", inode->i_ino,
+			"inode %lu: bad block %u", inode->i_ino,
 			EXT2_I(inode)->i_file_acl);
 		error = -EIO;
 		goto cleanup;
@@ -264,7 +264,7 @@ ext2_xattr_list(struct inode *inode, cha
 	error = 0;
 	if (!EXT2_I(inode)->i_file_acl)
 		goto cleanup;
-	ea_idebug(inode, "reading block %d", EXT2_I(inode)->i_file_acl);
+	ea_idebug(inode, "reading block %u", EXT2_I(inode)->i_file_acl);
 	bh = sb_bread(inode->i_sb, EXT2_I(inode)->i_file_acl);
 	error = -EIO;
 	if (!bh)
@@ -275,7 +275,7 @@ ext2_xattr_list(struct inode *inode, cha
 	if (HDR(bh)->h_magic != cpu_to_le32(EXT2_XATTR_MAGIC) ||
 	    HDR(bh)->h_blocks != cpu_to_le32(1)) {
 bad_block:	ext2_error(inode->i_sb, "ext2_xattr_list",
-			"inode %ld: bad block %d", inode->i_ino,
+			"inode %lu: bad block %u", inode->i_ino,
 			EXT2_I(inode)->i_file_acl);
 		error = -EIO;
 		goto cleanup;
@@ -411,7 +411,7 @@ ext2_xattr_set(struct inode *inode, int
 		if (header->h_magic != cpu_to_le32(EXT2_XATTR_MAGIC) ||
 		    header->h_blocks != cpu_to_le32(1)) {
 bad_block:		ext2_error(sb, "ext2_xattr_set",
-				"inode %ld: bad block %d", inode->i_ino,
+				"inode %lu: bad block %u", inode->i_ino,
 				   EXT2_I(inode)->i_file_acl);
 			error = -EIO;
 			goto cleanup;
@@ -664,15 +664,15 @@ ext2_xattr_set2(struct inode *inode, str
 			ext2_xattr_cache_insert(new_bh);
 		} else {
 			/* We need to allocate a new block */
-			int goal = le32_to_cpu(EXT2_SB(sb)->s_es->
+			unsigned int goal = le32_to_cpu(EXT2_SB(sb)->s_es->
 						           s_first_data_block) +
 				   EXT2_I(inode)->i_block_group *
 				   EXT2_BLOCKS_PER_GROUP(sb);
-			int block = ext2_new_block(inode, goal,
+			unsigned int block = ext2_new_block(inode, goal,
 						   NULL, NULL, &error);
 			if (error)
 				goto cleanup;
-			ea_idebug(inode, "creating block %d", block);
+			ea_idebug(inode, "creating block %u", block);

 			new_bh = sb_getblk(sb, block);
 			if (!new_bh) {
@@ -772,7 +772,7 @@ ext2_xattr_delete_inode(struct inode *in
 	bh = sb_bread(inode->i_sb, EXT2_I(inode)->i_file_acl);
 	if (!bh) {
 		ext2_error(inode->i_sb, "ext2_xattr_delete_inode",
-			"inode %ld: block %d read error", inode->i_ino,
+			"inode %lu: block %u read error", inode->i_ino,
 			EXT2_I(inode)->i_file_acl);
 		goto cleanup;
 	}
@@ -780,7 +780,7 @@ ext2_xattr_delete_inode(struct inode *in
 	if (HDR(bh)->h_magic != cpu_to_le32(EXT2_XATTR_MAGIC) ||
 	    HDR(bh)->h_blocks != cpu_to_le32(1)) {
 		ext2_error(inode->i_sb, "ext2_xattr_delete_inode",
-			"inode %ld: bad block %d", inode->i_ino,
+			"inode %lu: bad block %u", inode->i_ino,
 			EXT2_I(inode)->i_file_acl);
 		goto cleanup;
 	}
@@ -931,13 +931,13 @@ again:
 		bh = sb_bread(inode->i_sb, ce->e_block);
 		if (!bh) {
 			ext2_error(inode->i_sb, "ext2_xattr_cache_find",
-				"inode %ld: block %ld read error",
+				"inode %lu: block %lu read error",
 				inode->i_ino, (unsigned long) ce->e_block);
 		} else {
 			lock_buffer(bh);
 			if (le32_to_cpu(HDR(bh)->h_refcount) >
 				   EXT2_XATTR_REFCOUNT_MAX) {
-				ea_idebug(inode, "block %ld refcount %d>%d",
+				ea_idebug(inode, "block %lu refcount %d>%d",
 					  (unsigned long) ce->e_block,
 					  le32_to_cpu(HDR(bh)->h_refcount),
 					  EXT2_XATTR_REFCOUNT_MAX);
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/xip.c linux-2.6.16-rc6-4g/fs/ext2/xip.c
--- linux-2.6.16-rc6.org/fs/ext2/xip.c	2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext2/xip.c	2006-03-14 09:29:01.000000000 +0900
@@ -44,8 +44,8 @@ __ext2_get_sector(struct inode *inode, s
 	return rc;
 }

-int
-ext2_clear_xip_target(struct inode *inode, int block)
+unsigned int
+ext2_clear_xip_target(struct inode *inode, unsigned int block)
 {
 	sector_t sector = block * (PAGE_SIZE/512);
 	unsigned long data;
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/balloc.c linux-2.6.16-rc6-4g/fs/ext3/balloc.c
--- linux-2.6.16-rc6.org/fs/ext3/balloc.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext3/balloc.c	2006-03-14 09:29:01.000000000 +0900
@@ -36,7 +36,6 @@
  * when a file system is mounted (see ext3_read_super).
  */

-
 #define in_range(b, first, len)	((b) >= (first) && (b) <= (first) + (len) - 1)

 struct ext3_group_desc * ext3_get_group_desc(struct super_block * sb,
@@ -467,7 +466,7 @@ do_more:
 		cpu_to_le16(le16_to_cpu(desc->bg_free_blocks_count) +
 			group_freed);
 	spin_unlock(sb_bgl_lock(sbi, block_group));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, count);
+	percpu_llcounter_mod(&sbi->s_freeblocks_counter, count);

 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -1118,9 +1117,10 @@ out:

 static int ext3_has_free_blocks(struct ext3_sb_info *sbi)
 {
-	int free_blocks, root_blocks;
+	unsigned long free_blocks;
+	int  root_blocks;

-	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
+	free_blocks = percpu_llcounter_read_positive(&sbi->s_freeblocks_counter);
 	root_blocks = le32_to_cpu(sbi->s_es->s_r_blocks_count);
 	if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
 		sbi->s_resuid != current->fsuid &&
@@ -1154,19 +1154,20 @@ int ext3_should_retry_alloc(struct super
  * bitmap, and then for any free bit if that fails.
  * This function also updates quota and i_blocks field.
  */
-int ext3_new_block(handle_t *handle, struct inode *inode,
+unsigned int ext3_new_block(handle_t *handle, struct inode *inode,
 			unsigned long goal, int *errp)
 {
 	struct buffer_head *bitmap_bh = NULL;
 	struct buffer_head *gdp_bh;
 	int group_no;
 	int goal_group;
-	int ret_block;
+	unsigned int ret_block;
 	int bgi;			/* blockgroup iteration index */
-	int target_block;
+	unsigned int target_block;
 	int fatal = 0, err;
 	int performed_allocation = 0;
 	int free_blocks;
+	int group_block;
 	struct super_block *sb;
 	struct ext3_group_desc *gdp;
 	struct ext3_super_block *es;
@@ -1238,17 +1239,19 @@ retry:
 		my_rsv = NULL;

 	if (free_blocks > 0) {
-		ret_block = ((goal - le32_to_cpu(es->s_first_data_block)) %
+		group_block = ((goal - le32_to_cpu(es->s_first_data_block)) %
 				EXT3_BLOCKS_PER_GROUP(sb));
 		bitmap_bh = read_block_bitmap(sb, group_no);
 		if (!bitmap_bh)
 			goto io_error;
-		ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
-					bitmap_bh, ret_block, my_rsv, &fatal);
+		group_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
+					bitmap_bh, group_block, my_rsv, &fatal);
 		if (fatal)
 			goto out;
-		if (ret_block >= 0)
+		if (group_block >= 0) {
+			ret_block = group_block;
 			goto allocated;
+		}
 	}

 	ngroups = EXT3_SB(sb)->s_groups_count;
@@ -1280,12 +1283,14 @@ retry:
 		bitmap_bh = read_block_bitmap(sb, group_no);
 		if (!bitmap_bh)
 			goto io_error;
-		ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
+		group_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
 					bitmap_bh, -1, my_rsv, &fatal);
 		if (fatal)
 			goto out;
-		if (ret_block >= 0)
+		if (group_block >= 0) {
+			ret_block = group_block;
 			goto allocated;
+		}
 	}
 	/*
 	 * We may end up a bogus ealier ENOSPC error due to
@@ -1347,7 +1352,7 @@ allocated:
 				"b_committed_data\n", __FUNCTION__);
 		}
 	}
-	ext3_debug("found bit %d\n", ret_block);
+	ext3_debug("found bit %u\n", ret_block);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
 	jbd_unlock_bh_state(bitmap_bh);
 #endif
@@ -1357,8 +1362,8 @@ allocated:

 	if (ret_block >= le32_to_cpu(es->s_blocks_count)) {
 		ext3_error(sb, "ext3_new_block",
-			    "block(%d) >= blocks count(%d) - "
-			    "block_group = %d, es == %p ", ret_block,
+			    "block(%u) >= blocks count(%u) - "
+			    "block_group = %u, es == %p ", ret_block,
 			le32_to_cpu(es->s_blocks_count), group_no, es);
 		goto out;
 	}
@@ -1368,14 +1373,14 @@ allocated:
 	 * list of some description.  We don't know in advance whether
 	 * the caller wants to use it as metadata or data.
 	 */
-	ext3_debug("allocating block %d. Goal hits %d of %d.\n",
+	ext3_debug("allocating block %u. Goal hits %d of %d.\n",
 			ret_block, goal_hits, goal_attempts);

 	spin_lock(sb_bgl_lock(sbi, group_no));
 	gdp->bg_free_blocks_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) - 1);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -1);
+	percpu_llcounter_mod(&sbi->s_freeblocks_counter, -1);

 	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
 	err = ext3_journal_dirty_metadata(handle, gdp_bh);
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/ialloc.c linux-2.6.16-rc6-4g/fs/ext3/ialloc.c
--- linux-2.6.16-rc6.org/fs/ext3/ialloc.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext3/ialloc.c	2006-03-14 09:29:01.000000000 +0900
@@ -170,9 +170,9 @@ void ext3_free_inode (handle_t *handle,
 				gdp->bg_used_dirs_count = cpu_to_le16(
 				  le16_to_cpu(gdp->bg_used_dirs_count) - 1);
 			spin_unlock(sb_bgl_lock(sbi, block_group));
-			percpu_counter_inc(&sbi->s_freeinodes_counter);
+			percpu_llcounter_inc(&sbi->s_freeinodes_counter);
 			if (is_directory)
-				percpu_counter_dec(&sbi->s_dirs_counter);
+				percpu_llcounter_dec(&sbi->s_dirs_counter);

 		}
 		BUFFER_TRACE(bh2, "call ext3_journal_dirty_metadata");
@@ -202,12 +202,13 @@ error_return:
 static int find_group_dir(struct super_block *sb, struct inode *parent)
 {
 	int ngroups = EXT3_SB(sb)->s_groups_count;
-	int freei, avefreei;
+	unsigned long freei;
+	int avefreei;
 	struct ext3_group_desc *desc, *best_desc = NULL;
 	struct buffer_head *bh;
 	int group, best_group = -1;

-	freei = percpu_counter_read_positive(&EXT3_SB(sb)->s_freeinodes_counter);
+	freei = percpu_llcounter_read_positive(&EXT3_SB(sb)->s_freeinodes_counter);
 	avefreei = freei / ngroups;

 	for (group = 0; group < ngroups; group++) {
@@ -261,19 +262,20 @@ static int find_group_orlov(struct super
 	struct ext3_super_block *es = sbi->s_es;
 	int ngroups = sbi->s_groups_count;
 	int inodes_per_group = EXT3_INODES_PER_GROUP(sb);
-	int freei, avefreei;
-	int freeb, avefreeb;
-	int blocks_per_dir, ndirs;
+	unsigned long freei, freeb, ndirs;
+	int avefreei;
+	int avefreeb;
+	int blocks_per_dir;
 	int max_debt, max_dirs, min_blocks, min_inodes;
 	int group = -1, i;
 	struct ext3_group_desc *desc;
 	struct buffer_head *bh;

-	freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
+	freei = percpu_llcounter_read_positive(&sbi->s_freeinodes_counter);
 	avefreei = freei / ngroups;
-	freeb = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
+	freeb = percpu_llcounter_read_positive(&sbi->s_freeblocks_counter);
 	avefreeb = freeb / ngroups;
-	ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
+	ndirs = percpu_llcounter_read_positive(&sbi->s_dirs_counter);

 	if ((parent == sb->s_root->d_inode) ||
 	    (EXT3_I(parent)->i_flags & EXT3_TOPDIR_FL)) {
@@ -539,9 +541,9 @@ got:
 	err = ext3_journal_dirty_metadata(handle, bh2);
 	if (err) goto fail;

-	percpu_counter_dec(&sbi->s_freeinodes_counter);
+	percpu_llcounter_dec(&sbi->s_freeinodes_counter);
 	if (S_ISDIR(mode))
-		percpu_counter_inc(&sbi->s_dirs_counter);
+		percpu_llcounter_inc(&sbi->s_dirs_counter);
 	sb->s_dirt = 1;

 	inode->i_uid = current->fsuid;
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/inode.c linux-2.6.16-rc6-4g/fs/ext3/inode.c
--- linux-2.6.16-rc6.org/fs/ext3/inode.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext3/inode.c	2006-03-14 09:29:01.000000000 +0900
@@ -64,7 +64,7 @@ static inline int ext3_inode_is_fast_sym

 int ext3_forget(handle_t *handle, int is_metadata,
 		       struct inode *inode, struct buffer_head *bh,
-		       int blocknr)
+		       unsigned int blocknr)
 {
 	int err;

@@ -235,10 +235,10 @@ no_delete:
 	clear_inode(inode);	/* We must guarantee clearing of inode... */
 }

-static int ext3_alloc_block (handle_t *handle,
-			struct inode * inode, unsigned long goal, int *err)
+static unsigned int ext3_alloc_block (handle_t *handle,
+			struct inode * inode, unsigned int goal, int *err)
 {
-	unsigned long result;
+	unsigned int result;

 	result = ext3_new_block(handle, inode, goal, err);
 	return result;
@@ -296,7 +296,7 @@ static inline int verify_chain(Indirect
  */

 static int ext3_block_to_path(struct inode *inode,
-			long i_block, int offsets[4], int *boundary)
+			unsigned long i_block, unsigned int offsets[4], int *boundary)
 {
 	int ptrs = EXT3_ADDR_PER_BLOCK(inode->i_sb);
 	int ptrs_bits = EXT3_ADDR_PER_BLOCK_BITS(inode->i_sb);
@@ -363,7 +363,7 @@ static int ext3_block_to_path(struct ino
  *	or when it reads all @depth-1 indirect blocks successfully and finds
  *	the whole chain, all way to the data (returns %NULL, *err == 0).
  */
-static Indirect *ext3_get_branch(struct inode *inode, int depth, int *offsets,
+static Indirect *ext3_get_branch(struct inode *inode, int depth, unsigned int *offsets,
 				 Indirect chain[4], int *err)
 {
 	struct super_block *sb = inode->i_sb;
@@ -460,7 +460,7 @@ static unsigned long ext3_find_near(stru
  *	stores it in *@goal and returns zero.
  */

-static unsigned long ext3_find_goal(struct inode *inode, long block,
+static unsigned long ext3_find_goal(struct inode *inode, unsigned long block,
 		Indirect chain[4], Indirect *partial)
 {
 	struct ext3_block_alloc_info *block_i =  EXT3_I(inode)->i_block_alloc_info;
@@ -505,21 +505,21 @@ static unsigned long ext3_find_goal(stru
 static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
 			     int num,
 			     unsigned long goal,
-			     int *offsets,
+			     unsigned int *offsets,
 			     Indirect *branch)
 {
 	int blocksize = inode->i_sb->s_blocksize;
 	int n = 0, keys = 0;
 	int err = 0;
 	int i;
-	int parent = ext3_alloc_block(handle, inode, goal, &err);
+	unsigned int parent = ext3_alloc_block(handle, inode, goal, &err);

 	branch[0].key = cpu_to_le32(parent);
 	if (parent) {
 		for (n = 1; n < num; n++) {
 			struct buffer_head *bh;
 			/* Allocate the next block */
-			int nr = ext3_alloc_block(handle, inode, parent, &err);
+			unsigned int nr = ext3_alloc_block(handle, inode, parent, &err);
 			if (!nr)
 				break;
 			branch[n].key = cpu_to_le32(nr);
@@ -585,7 +585,7 @@ static int ext3_alloc_branch(handle_t *h
  *	chain to new block and return 0.
  */

-static int ext3_splice_branch(handle_t *handle, struct inode *inode, long block,
+static int ext3_splice_branch(handle_t *handle, struct inode *inode, unsigned long block,
 			      Indirect chain[4], Indirect *where, int num)
 {
 	int i;
@@ -676,7 +676,7 @@ ext3_get_block_handle(handle_t *handle,
 		struct buffer_head *bh_result, int create, int extend_disksize)
 {
 	int err = -EIO;
-	int offsets[4];
+	unsigned int offsets[4];
 	Indirect chain[4];
 	Indirect *partial;
 	unsigned long goal;
@@ -852,7 +852,7 @@ get_block:
  * `handle' can be NULL if create is zero
  */
 struct buffer_head *ext3_getblk(handle_t *handle, struct inode * inode,
-				long block, int create, int * errp)
+				unsigned long block, int create, int * errp)
 {
 	struct buffer_head dummy;
 	int fatal = 0, err;
@@ -907,7 +907,7 @@ err:
 }

 struct buffer_head *ext3_bread(handle_t *handle, struct inode * inode,
-			       int block, int create, int *err)
+			       unsigned int block, int create, int *err)
 {
 	struct buffer_head * bh;

@@ -1754,7 +1754,7 @@ static inline int all_zeroes(__le32 *p,

 static Indirect *ext3_find_shared(struct inode *inode,
 				int depth,
-				int offsets[4],
+				unsigned int offsets[4],
 				Indirect chain[4],
 				__le32 *top)
 {
@@ -1967,7 +1967,7 @@ static void ext3_free_branches(handle_t
 			 */
 			if (!bh) {
 				ext3_error(inode->i_sb, "ext3_free_branches",
-					   "Read failure, inode=%ld, block=%ld",
+					   "Read failure, inode=%lu, block=%lu",
 					   inode->i_ino, nr);
 				continue;
 			}
@@ -2084,12 +2084,12 @@ void ext3_truncate(struct inode * inode)
 	__le32 *i_data = ei->i_data;
 	int addr_per_block = EXT3_ADDR_PER_BLOCK(inode->i_sb);
 	struct address_space *mapping = inode->i_mapping;
-	int offsets[4];
+	unsigned int offsets[4];
 	Indirect chain[4];
 	Indirect *partial;
 	__le32 nr = 0;
 	int n;
-	long last_block;
+	unsigned long last_block;
 	unsigned blocksize = inode->i_sb->s_blocksize;
 	struct page *page;

diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/namei.c linux-2.6.16-rc6-4g/fs/ext3/namei.c
--- linux-2.6.16-rc6.org/fs/ext3/namei.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext3/namei.c	2006-03-14 09:29:01.000000000 +0900
@@ -816,7 +816,8 @@ static struct buffer_head * ext3_find_en
 	int ra_ptr = 0;		/* Current index into readahead
 				   buffer */
 	int num = 0;
-	int nblocks, i, err;
+	unsigned int nblocks;
+	int i, err;
 	struct inode *dir = dentry->d_parent->d_inode;
 	int namelen;
 	const u8 *name;
@@ -1910,8 +1911,8 @@ int ext3_orphan_add(handle_t *handle, st
 	if (!err)
 		list_add(&EXT3_I(inode)->i_orphan, &EXT3_SB(sb)->s_orphan);

-	jbd_debug(4, "superblock will point to %ld\n", inode->i_ino);
-	jbd_debug(4, "orphan inode %ld will point to %d\n",
+	jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
+	jbd_debug(4, "orphan inode %lu will point to %d\n",
 			inode->i_ino, NEXT_ORPHAN(inode));
 out_unlock:
 	unlock_super(sb);
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/resize.c linux-2.6.16-rc6-4g/fs/ext3/resize.c
--- linux-2.6.16-rc6.org/fs/ext3/resize.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext3/resize.c	2006-03-14 09:29:01.000000000 +0900
@@ -37,7 +37,7 @@ static int verify_group_input(struct sup
 		 le16_to_cpu(es->s_reserved_gdt_blocks)) : 0;
 	unsigned metaend = start + overhead;
 	struct buffer_head *bh = NULL;
-	int free_blocks_count;
+	long long free_blocks_count;
 	int err = -EINVAL;

 	input->free_blocks_count = free_blocks_count =
@@ -45,7 +45,7 @@ static int verify_group_input(struct sup

 	if (test_opt(sb, DEBUG))
 		printk(KERN_DEBUG "EXT3-fs: adding %s group %u: %u blocks "
-		       "(%d free, %u reserved)\n",
+		       "(%lld free, %u reserved)\n",
 		       ext3_bg_has_super(sb, input->group) ? "normal" :
 		       "no-super", input->group, input->blocks_count,
 		       free_blocks_count, input->reserved_blocks);
@@ -138,14 +138,14 @@ static struct buffer_head *bclean(handle
  * need to use it within a single byte (to ensure we get endianness right).
  * We can use memset for the rest of the bitmap as there are no other users.
  */
-static void mark_bitmap_end(int start_bit, int end_bit, char *bitmap)
+static void mark_bitmap_end(unsigned int start_bit, unsigned int end_bit, char *bitmap)
 {
-	int i;
+	unsigned int i;

 	if (start_bit >= end_bit)
 		return;

-	ext3_debug("mark end bits +%d through +%d used\n", start_bit, end_bit);
+	ext3_debug("mark end bits +%u through +%u used\n", start_bit, end_bit);
 	for (i = start_bit; i < ((start_bit + 7) & ~7UL); i++)
 		ext3_set_bit(i, bitmap);
 	if (i < end_bit)
@@ -340,7 +340,7 @@ static int verify_reserved_gdb(struct su
 	while ((grp = ext3_list_backups(sb, &three, &five, &seven)) < end) {
 		if (le32_to_cpu(*p++) != grp * EXT3_BLOCKS_PER_GROUP(sb) + blk){
 			ext3_warning(sb, __FUNCTION__,
-				     "reserved GDT %ld missing grp %d (%ld)",
+				     "reserved GDT %ld missing grp %d (%lu)",
 				     blk, grp,
 				     grp * EXT3_BLOCKS_PER_GROUP(sb) + blk);
 			return -EINVAL;
@@ -619,7 +619,7 @@ exit_free:
  * at this time.  The resize which changed s_groups_count will backup again.
  */
 static void update_backups(struct super_block *sb,
-			   int blk_off, char *data, int size)
+			   unsigned int blk_off, char *data, int size)
 {
 	struct ext3_sb_info *sbi = EXT3_SB(sb);
 	const unsigned long last = sbi->s_groups_count;
@@ -869,9 +869,9 @@ int ext3_group_add(struct super_block *s
 		input->reserved_blocks);

 	/* Update the free space counts */
-	percpu_counter_mod(&sbi->s_freeblocks_counter,
+	percpu_llcounter_mod(&sbi->s_freeblocks_counter,
 			   input->free_blocks_count);
-	percpu_counter_mod(&sbi->s_freeinodes_counter,
+	percpu_llcounter_mod(&sbi->s_freeinodes_counter,
 			   EXT3_INODES_PER_GROUP(sb));

 	ext3_journal_dirty_metadata(handle, sbi->s_sbh);
@@ -990,10 +990,10 @@ int ext3_group_extend(struct super_block
 	ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh);
 	sb->s_dirt = 1;
 	unlock_super(sb);
-	ext3_debug("freeing blocks %ld through %ld\n", o_blocks_count,
+	ext3_debug("freeing blocks %lu through %lu\n", o_blocks_count,
 		   o_blocks_count + add);
 	ext3_free_blocks_sb(handle, sb, o_blocks_count, add, &freed_blocks);
-	ext3_debug("freed blocks %ld through %ld\n", o_blocks_count,
+	ext3_debug("freed blocks %lu through %lu\n", o_blocks_count,
 		   o_blocks_count + add);
 	if ((err = ext3_journal_stop(handle)))
 		goto exit_put;
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/super.c linux-2.6.16-rc6-4g/fs/ext3/super.c
--- linux-2.6.16-rc6.org/fs/ext3/super.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext3/super.c	2006-03-14 09:29:01.000000000 +0900
@@ -377,7 +377,7 @@ static void dump_orphan_list(struct supe
 	list_for_each(l, &sbi->s_orphan) {
 		struct inode *inode = orphan_list_entry(l);
 		printk(KERN_ERR "  "
-		       "inode %s:%ld at %p: mode %o, nlink %d, next %d\n",
+		       "inode %s:%lu at %p: mode %o, nlink %d, next %d\n",
 		       inode->i_sb->s_id, inode->i_ino, inode,
 		       inode->i_mode, inode->i_nlink,
 		       NEXT_ORPHAN(inode));
@@ -403,9 +403,9 @@ static void ext3_put_super (struct super
 	for (i = 0; i < sbi->s_gdb_count; i++)
 		brelse(sbi->s_group_desc[i]);
 	kfree(sbi->s_group_desc);
-	percpu_counter_destroy(&sbi->s_freeblocks_counter);
-	percpu_counter_destroy(&sbi->s_freeinodes_counter);
-	percpu_counter_destroy(&sbi->s_dirs_counter);
+	percpu_llcounter_destroy(&sbi->s_freeblocks_counter);
+	percpu_llcounter_destroy(&sbi->s_freeinodes_counter);
+	percpu_llcounter_destroy(&sbi->s_dirs_counter);
 	brelse(sbi->s_sbh);
 #ifdef CONFIG_QUOTA
 	for (i = 0; i < MAXQUOTAS; i++)
@@ -1253,17 +1253,17 @@ static void ext3_orphan_cleanup (struct
 		DQUOT_INIT(inode);
 		if (inode->i_nlink) {
 			printk(KERN_DEBUG
-				"%s: truncating inode %ld to %Ld bytes\n",
+				"%s: truncating inode %lu to %Ld bytes\n",
 				__FUNCTION__, inode->i_ino, inode->i_size);
-			jbd_debug(2, "truncating inode %ld to %Ld bytes\n",
+			jbd_debug(2, "truncating inode %lu to %Ld bytes\n",
 				  inode->i_ino, inode->i_size);
 			ext3_truncate(inode);
 			nr_truncates++;
 		} else {
 			printk(KERN_DEBUG
-				"%s: deleting unreferenced inode %ld\n",
+				"%s: deleting unreferenced inode %lu\n",
 				__FUNCTION__, inode->i_ino);
-			jbd_debug(2, "deleting unreferenced inode %ld\n",
+			jbd_debug(2, "deleting unreferenced inode %lu\n",
 				  inode->i_ino);
 			nr_orphans++;
 		}
@@ -1578,9 +1578,9 @@ static int ext3_fill_super (struct super
 		goto failed_mount;
 	}

-	percpu_counter_init(&sbi->s_freeblocks_counter);
-	percpu_counter_init(&sbi->s_freeinodes_counter);
-	percpu_counter_init(&sbi->s_dirs_counter);
+	percpu_llcounter_init(&sbi->s_freeblocks_counter);
+	percpu_llcounter_init(&sbi->s_freeinodes_counter);
+	percpu_llcounter_init(&sbi->s_dirs_counter);
 	bgl_lock_init(&sbi->s_blockgroup_lock);

 	for (i = 0; i < db_count; i++) {
@@ -1728,11 +1728,11 @@ static int ext3_fill_super (struct super
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
 		"writeback");

-	percpu_counter_mod(&sbi->s_freeblocks_counter,
+	percpu_llcounter_mod(&sbi->s_freeblocks_counter,
 		ext3_count_free_blocks(sb));
-	percpu_counter_mod(&sbi->s_freeinodes_counter,
+	percpu_llcounter_mod(&sbi->s_freeinodes_counter,
 		ext3_count_free_inodes(sb));
-	percpu_counter_mod(&sbi->s_dirs_counter,
+	percpu_llcounter_mod(&sbi->s_dirs_counter,
 		ext3_count_dirs(sb));

 	lock_kernel();
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/xattr.c linux-2.6.16-rc6-4g/fs/ext3/xattr.c
--- linux-2.6.16-rc6.org/fs/ext3/xattr.c	2006-03-14 09:09:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/ext3/xattr.c	2006-03-14 09:29:01.000000000 +0900
@@ -75,7 +75,7 @@

 #ifdef EXT3_XATTR_DEBUG
 # define ea_idebug(inode, f...) do { \
-		printk(KERN_DEBUG "inode %s:%ld: ", \
+		printk(KERN_DEBUG "inode %s:%lu: ", \
 			inode->i_sb->s_id, inode->i_ino); \
 		printk(f); \
 		printk("\n"); \
@@ -225,7 +225,7 @@ ext3_xattr_block_get(struct inode *inode
 	error = -ENODATA;
 	if (!EXT3_I(inode)->i_file_acl)
 		goto cleanup;
-	ea_idebug(inode, "reading block %d", EXT3_I(inode)->i_file_acl);
+	ea_idebug(inode, "reading block %u", EXT3_I(inode)->i_file_acl);
 	bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
 	if (!bh)
 		goto cleanup;
@@ -233,7 +233,7 @@ ext3_xattr_block_get(struct inode *inode
 		atomic_read(&(bh->b_count)), le32_to_cpu(BHDR(bh)->h_refcount));
 	if (ext3_xattr_check_block(bh)) {
 bad_block:	ext3_error(inode->i_sb, __FUNCTION__,
-			   "inode %ld: bad block %d", inode->i_ino,
+			   "inode %lu: bad block %u", inode->i_ino,
 			   EXT3_I(inode)->i_file_acl);
 		error = -EIO;
 		goto cleanup;
@@ -366,7 +366,7 @@ ext3_xattr_block_list(struct inode *inod
 	error = 0;
 	if (!EXT3_I(inode)->i_file_acl)
 		goto cleanup;
-	ea_idebug(inode, "reading block %d", EXT3_I(inode)->i_file_acl);
+	ea_idebug(inode, "reading block %u", EXT3_I(inode)->i_file_acl);
 	bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
 	error = -EIO;
 	if (!bh)
@@ -375,7 +375,7 @@ ext3_xattr_block_list(struct inode *inod
 		atomic_read(&(bh->b_count)), le32_to_cpu(BHDR(bh)->h_refcount));
 	if (ext3_xattr_check_block(bh)) {
 		ext3_error(inode->i_sb, __FUNCTION__,
-			   "inode %ld: bad block %d", inode->i_ino,
+			   "inode %lu: bad block %u", inode->i_ino,
 			   EXT3_I(inode)->i_file_acl);
 		error = -EIO;
 		goto cleanup;
@@ -647,7 +647,7 @@ ext3_xattr_block_find(struct inode *inod
 			le32_to_cpu(BHDR(bs->bh)->h_refcount));
 		if (ext3_xattr_check_block(bs->bh)) {
 			ext3_error(sb, __FUNCTION__,
-				"inode %ld: bad block %d", inode->i_ino,
+				"inode %lu: bad block %u", inode->i_ino,
 				EXT3_I(inode)->i_file_acl);
 			error = -EIO;
 			goto cleanup;
@@ -792,14 +792,14 @@ inserted:
 			get_bh(new_bh);
 		} else {
 			/* We need to allocate a new block */
-			int goal = le32_to_cpu(
+			unsigned int goal = le32_to_cpu(
 					EXT3_SB(sb)->s_es->s_first_data_block) +
 				EXT3_I(inode)->i_block_group *
 				EXT3_BLOCKS_PER_GROUP(sb);
-			int block = ext3_new_block(handle, inode, goal, &error);
+			unsigned int block = ext3_new_block(handle, inode, goal, &error);
 			if (error)
 				goto cleanup;
-			ea_idebug(inode, "creating block %d", block);
+			ea_idebug(inode, "creating block %u", block);

 			new_bh = sb_getblk(sb, block);
 			if (!new_bh) {
@@ -847,7 +847,7 @@ cleanup_dquot:

 bad_block:
 	ext3_error(inode->i_sb, __FUNCTION__,
-		   "inode %ld: bad block %d", inode->i_ino,
+		   "inode %lu: bad block %u", inode->i_ino,
 		   EXT3_I(inode)->i_file_acl);
 	goto cleanup;

@@ -1076,14 +1076,14 @@ ext3_xattr_delete_inode(handle_t *handle
 	bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
 	if (!bh) {
 		ext3_error(inode->i_sb, __FUNCTION__,
-			"inode %ld: block %d read error", inode->i_ino,
+			"inode %lu: block %u read error", inode->i_ino,
 			EXT3_I(inode)->i_file_acl);
 		goto cleanup;
 	}
 	if (BHDR(bh)->h_magic != cpu_to_le32(EXT3_XATTR_MAGIC) ||
 	    BHDR(bh)->h_blocks != cpu_to_le32(1)) {
 		ext3_error(inode->i_sb, __FUNCTION__,
-			"inode %ld: bad block %d", inode->i_ino,
+			"inode %lu: bad block %u", inode->i_ino,
 			EXT3_I(inode)->i_file_acl);
 		goto cleanup;
 	}
@@ -1210,11 +1210,11 @@ again:
 		bh = sb_bread(inode->i_sb, ce->e_block);
 		if (!bh) {
 			ext3_error(inode->i_sb, __FUNCTION__,
-				"inode %ld: block %ld read error",
+				"inode %lu: block %lu read error",
 				inode->i_ino, (unsigned long) ce->e_block);
 		} else if (le32_to_cpu(BHDR(bh)->h_refcount) >=
 				EXT3_XATTR_REFCOUNT_MAX) {
-			ea_idebug(inode, "block %ld refcount %d>=%d",
+			ea_idebug(inode, "block %lu refcount %d>=%d",
 				  (unsigned long) ce->e_block,
 				  le32_to_cpu(BHDR(bh)->h_refcount),
 					  EXT3_XATTR_REFCOUNT_MAX);
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/jbd/journal.c linux-2.6.16-rc6-4g/fs/jbd/journal.c
--- linux-2.6.16-rc6.org/fs/jbd/journal.c	2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.16-rc6-4g/fs/jbd/journal.c	2006-03-14 09:29:01.000000000 +0900
@@ -761,7 +761,7 @@ journal_t * journal_init_inode (struct i
 	journal->j_dev = journal->j_fs_dev = inode->i_sb->s_bdev;
 	journal->j_inode = inode;
 	jbd_debug(1,
-		  "journal %p: inode %s/%ld, size %Ld, bits %d, blksize %ld\n",
+		  "journal %p: inode %s/%u, size %Ld, bits %d, blksize %ld\n",
 		  journal, inode->i_sb->s_id, inode->i_ino,
 		  (long long) inode->i_size,
 		  inode->i_sb->s_blocksize_bits, inode->i_sb->s_blocksize);
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/include/linux/ext2_fs_sb.h
linux-2.6.16-rc6-4g/include/linux/ext2_fs_sb.h
--- linux-2.6.16-rc6.org/include/linux/ext2_fs_sb.h	2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.16-rc6-4g/include/linux/ext2_fs_sb.h	2006-03-14 12:06:21.000000000 +0900
@@ -17,7 +17,7 @@
 #define _LINUX_EXT2_FS_SB

 #include <linux/blockgroup_lock.h>
-#include <linux/percpu_counter.h>
+#include <linux/percpu_llcounter.h>

 /*
  * second extended-fs super-block data in memory
@@ -49,9 +49,9 @@ struct ext2_sb_info {
 	u32 s_next_generation;
 	unsigned long s_dir_count;
 	u8 *s_debts;
-	struct percpu_counter s_freeblocks_counter;
-	struct percpu_counter s_freeinodes_counter;
-	struct percpu_counter s_dirs_counter;
+	struct percpu_llcounter s_freeblocks_counter;
+	struct percpu_llcounter s_freeinodes_counter;
+	struct percpu_llcounter s_dirs_counter;
 	struct blockgroup_lock s_blockgroup_lock;
 };

diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/include/linux/ext3_fs.h
linux-2.6.16-rc6-4g/include/linux/ext3_fs.h
--- linux-2.6.16-rc6.org/include/linux/ext3_fs.h	2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.16-rc6-4g/include/linux/ext3_fs.h	2006-03-14 09:29:01.000000000 +0900
@@ -731,7 +731,7 @@ struct dir_private_info {
 /* balloc.c */
 extern int ext3_bg_has_super(struct super_block *sb, int group);
 extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
-extern int ext3_new_block (handle_t *, struct inode *, unsigned long, int *);
+extern unsigned int ext3_new_block (handle_t *, struct inode *, unsigned long, int *);
 extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long,
 			      unsigned long);
 extern void ext3_free_blocks_sb (handle_t *, struct super_block *,
@@ -761,7 +761,6 @@ extern int ext3_sync_file (struct file *
 extern int ext3fs_dirhash(const char *name, int len, struct
 			  dx_hash_info *hinfo);

-/* ialloc.c */
 extern struct inode * ext3_new_inode (handle_t *, struct inode *, int);
 extern void ext3_free_inode (handle_t *, struct inode *);
 extern struct inode * ext3_orphan_get (struct super_block *, unsigned long);
@@ -772,9 +771,9 @@ extern unsigned long ext3_count_free (st


 /* inode.c */
-extern int ext3_forget(handle_t *, int, struct inode *, struct buffer_head *, int);
-extern struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *);
-extern struct buffer_head * ext3_bread (handle_t *, struct inode *, int, int, int *);
+extern int ext3_forget(handle_t *, int, struct inode *, struct buffer_head *, unsigned int);
+extern struct buffer_head * ext3_getblk (handle_t *, struct inode *, unsigned long, int, int *);
+extern struct buffer_head * ext3_bread (handle_t *, struct inode *, unsigned int, int, int *);

 extern void ext3_read_inode (struct inode *);
 extern int  ext3_write_inode (struct inode *, int);
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/include/linux/ext3_fs_sb.h
linux-2.6.16-rc6-4g/include/linux/ext3_fs_sb.h
--- linux-2.6.16-rc6.org/include/linux/ext3_fs_sb.h	2006-01-03 12:21:10.000000000 +0900
+++ linux-2.6.16-rc6-4g/include/linux/ext3_fs_sb.h	2006-03-14 12:06:35.000000000 +0900
@@ -20,7 +20,7 @@
 #include <linux/timer.h>
 #include <linux/wait.h>
 #include <linux/blockgroup_lock.h>
-#include <linux/percpu_counter.h>
+#include <linux/percpu_llcounter.h>
 #endif
 #include <linux/rbtree.h>

@@ -54,9 +54,9 @@ struct ext3_sb_info {
 	u32 s_next_generation;
 	u32 s_hash_seed[4];
 	int s_def_hash_version;
-	struct percpu_counter s_freeblocks_counter;
-	struct percpu_counter s_freeinodes_counter;
-	struct percpu_counter s_dirs_counter;
+	struct percpu_llcounter s_freeblocks_counter;
+	struct percpu_llcounter s_freeinodes_counter;
+	struct percpu_llcounter s_dirs_counter;
 	struct blockgroup_lock s_blockgroup_lock;

 	/* root of the per fs reservation window tree */
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/include/linux/percpu_llcounter.h
linux-2.6.16-rc6-4g/include/linux/percpu_llcounter.h
--- linux-2.6.16-rc6.org/include/linux/percpu_llcounter.h	1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.16-rc6-4g/include/linux/percpu_llcounter.h	2006-03-14 13:50:54.000000000 +0900
@@ -0,0 +1,113 @@
+#ifndef _LINUX_LLPERCPU_COUNTER_H
+#define _LINUX_LLPERCPU_COUNTER_H
+/*
+ * A simple "approximate counter" for use in ext2 and ext3 superblocks.
+ *
+ * WARNING: these things are HUGE.  4 kbytes per counter on 32-way P4.
+ */
+
+#include <linux/config.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+#include <linux/threads.h>
+#include <linux/percpu.h>
+
+#ifdef CONFIG_SMP
+
+struct percpu_llcounter {
+	spinlock_t lock;
+	long long count;
+	long long *counters;
+};
+
+#if NR_CPUS >= 16
+#define FBC_BATCH	(NR_CPUS*2)
+#else
+#define FBC_BATCH	(NR_CPUS*4)
+#endif
+
+static inline void percpu_llcounter_init(struct percpu_llcounter *fbc)
+{
+	spin_lock_init(&fbc->lock);
+	fbc->count = 0;
+	fbc->counters = alloc_percpu(long long);
+}
+
+static inline void percpu_llcounter_destroy(struct percpu_llcounter *fbc)
+{
+	free_percpu(fbc->counters);
+}
+
+void percpu_llcounter_mod(struct percpu_llcounter *fbc, long long amount);
+long long percpu_llcounter_sum(struct percpu_llcounter *fbc);
+
+static inline long long percpu_llcounter_read(struct percpu_llcounter *fbc)
+{
+	return fbc->count;
+}
+
+/*
+ * It is possible for the percpu_llcounter_read() to return a small negative
+ * number for some counter which should never be negative.
+ */
+static inline long long percpu_llcounter_read_positive(struct percpu_llcounter *fbc)
+{
+	long long ret = fbc->count;
+
+	barrier();		/* Prevent reloads of fbc->count */
+	if (ret > 0)
+		return ret;
+	return 1;
+}
+
+#else
+
+struct percpu_llcounter {
+	long long count;
+};
+
+static inline void percpu_llcounter_init(struct percpu_llcounter *fbc)
+{
+	fbc->count = 0;
+}
+
+static inline void percpu_llcounter_destroy(struct percpu_llcounter *fbc)
+{
+}
+
+static inline void
+percpu_llcounter_mod(struct percpu_llcounter *fbc, long long amount)
+{
+	preempt_disable();
+	fbc->count += amount;
+	preempt_enable();
+}
+
+static inline long long percpu_llcounter_read(struct percpu_llcounter *fbc)
+{
+	return fbc->count;
+}
+
+static inline long long percpu_llcounter_read_positive(struct percpu_llcounter *fbc)
+{
+	return fbc->count;
+}
+
+static inline long long percpu_llcounter_sum(struct percpu_llcounter *fbc)
+{
+	return percpu_llcounter_read_positive(fbc);
+}
+
+#endif	/* CONFIG_SMP */
+
+static inline void percpu_llcounter_inc(struct percpu_llcounter *fbc)
+{
+	percpu_llcounter_mod(fbc, 1);
+}
+
+static inline void percpu_llcounter_dec(struct percpu_llcounter *fbc)
+{
+	percpu_llcounter_mod(fbc, -1);
+}
+
+#endif /* _LINUX_LLPERCPU_COUNTER_H */
diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/mm/swap.c linux-2.6.16-rc6-4g/mm/swap.c
--- linux-2.6.16-rc6.org/mm/swap.c	2006-03-14 09:09:07.000000000 +0900
+++ linux-2.6.16-rc6-4g/mm/swap.c	2006-03-14 13:47:18.000000000 +0900
@@ -26,6 +26,7 @@
 #include <linux/buffer_head.h>	/* for try_to_release_page() */
 #include <linux/module.h>
 #include <linux/percpu_counter.h>
+#include <linux/percpu_llcounter.h>
 #include <linux/percpu.h>
 #include <linux/cpu.h>
 #include <linux/notifier.h>
@@ -498,6 +499,27 @@ void percpu_counter_mod(struct percpu_co
 }
 EXPORT_SYMBOL(percpu_counter_mod);

+void percpu_llcounter_mod(struct percpu_llcounter *fbc, long long amount)
+{
+	long long count;
+	long long *pcount;
+	int cpu = get_cpu();
+
+	pcount = per_cpu_ptr(fbc->counters, cpu);
+	count = *pcount + amount;
+	if (count >= FBC_BATCH || count <= -FBC_BATCH) {
+		spin_lock(&fbc->lock);
+		fbc->count += count;
+		*pcount = 0;
+		spin_unlock(&fbc->lock);
+	} else {
+		*pcount = count;
+	}
+	put_cpu();
+}
+EXPORT_SYMBOL(percpu_llcounter_mod);
+
+
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
@@ -517,6 +539,26 @@ long percpu_counter_sum(struct percpu_co
 	return ret < 0 ? 0 : ret;
 }
 EXPORT_SYMBOL(percpu_counter_sum);
+
+/*
+ * Add up all the per-cpu counts, return the result.  This is a more accurate
+ * but much slower version of percpu_llcounter_read_positive()
+ */
+long long percpu_llcounter_sum(struct percpu_llcounter *fbc)
+{
+	long long ret;
+	int cpu;
+
+	spin_lock(&fbc->lock);
+	ret = fbc->count;
+	for_each_cpu(cpu) {
+		long long *pcount = per_cpu_ptr(fbc->counters, cpu);
+		ret += *pcount;
+	}
+	spin_unlock(&fbc->lock);
+	return ret < 0 ? 0 : ret;
+}
+EXPORT_SYMBOL(percpu_llcounter_sum);
 #endif

 /*



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-15 12:39 [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel) Takashi Sato
@ 2006-03-15 12:56 ` Laurent Vivier
  2006-03-16  2:19 ` Mingming Cao
  2006-03-19  2:20 ` Theodore Ts'o
  2 siblings, 0 replies; 34+ messages in thread
From: Laurent Vivier @ 2006-03-15 12:56 UTC (permalink / raw)
  To: Takashi Sato; +Cc: ext2-devel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 58600 bytes --]

Hi Takashi,

I already did this work more completely using "sector_t" type, and thus
a 64bit addressing mode.

You can have a look at http://www.bullopensource.org/ext4

Regards,
Laurent

Le mer 15/03/2006 à 13:39, Takashi Sato a écrit :
> Hi,
> 
> As a disk size tends to be larger, some disk storages get to have
> the capacity to supply more than multi-TB recently.  But now ext2/3
> can't support more than 8TB filesystem in 4K-blocksize.  And then I
> think the filesystem size of ext2/3 should be extended.
> 
> I'd like to extend the max filesystem size of ext2/3 from 8TB to 16TB
> by making the number of blocks on ext2/3 extend from 2G-1(2^31-1) to
> 4G-1(2^32-1) as below.
> 
> The max number of blocks is restricted to 2G-1(2^31-1) on ext2/3
> because of the following problems.
> 
> - The number of blocks is treated as signed 4bytes variable on some
>   codes for ext2/3 in kernel.
> 
> - Assembler instructions which can't treat more than 2GB is used
>   on some functions related to bit manipulation, like ext2fs_set_bit()
>   and ext2fs_test_bit().  These functions are called through mke2fs
>   on x86 and mc68000 architecture.
> 
> - A block number and an inode number is output with the format
>   string(%d, %ld) in many places on both kernel and commands.
> 
> This patch set is composed of two parts, for the kernel and e2fsprogs.
> 
> [1/2] kernel(linux 2.6.16-rc6)
>  - Change signed 4bytes variables for a block number and a inode
>    number, to unsigned.
> 
>  - Change the format string(%d, %ld) for a block number and a inode
>    number to %u or %lu.
> 
>  - ext2/ext3_sb_info uses the percpu_counter structure for counting
>    blocks and inodes, and it has "long" counter.  I made the new
>    structure percpu_llcounter which has "long long" counter.
> 
> [2/2] Commands(e2fsprogs-1.38)
>  - Modify to call C functions(ext2fs_set_bit(),ext2fs_test_bit())
>    defined in lib/ex2fs/bitops.c on x86 and mc68000 architecture.
>    This makes it possible to make ext2/3 with more than 2G blocks
>    by mke2fs with -F option.
>  - Change the format string(%d, %ld) for a block number and inode
>    number to %u or %lu.
> 
> Any feedback and comments are welcome.
> 
> Signed-off-by: Takashi Sato sho@tnes.nec.co.jp
> ---
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/balloc.c linux-2.6.16-rc6-4g/fs/ext2/balloc.c
> --- linux-2.6.16-rc6.org/fs/ext2/balloc.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext2/balloc.c	2006-03-14 09:29:01.000000000 +0900
> @@ -99,14 +99,14 @@ error_out:
>   * Set sb->s_dirt here because the superblock was "logically" altered.  We
>   * need to recalculate its free blocks count and flush it out.
>   */
> -static int reserve_blocks(struct super_block *sb, int count)
> +static unsigned int reserve_blocks(struct super_block *sb, unsigned int count)
>  {
>  	struct ext2_sb_info *sbi = EXT2_SB(sb);
>  	struct ext2_super_block *es = sbi->s_es;
> -	unsigned free_blocks;
> +	unsigned int free_blocks;
>  	unsigned root_blocks;
> 
> -	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
> +	free_blocks = percpu_llcounter_read_positive(&sbi->s_freeblocks_counter);
>  	root_blocks = le32_to_cpu(es->s_r_blocks_count);
> 
>  	if (free_blocks < count)
> @@ -125,23 +125,23 @@ static int reserve_blocks(struct super_b
>  			return 0;
>  	}
> 
> -	percpu_counter_mod(&sbi->s_freeblocks_counter, -count);
> +	percpu_llcounter_mod(&sbi->s_freeblocks_counter, -count);
>  	sb->s_dirt = 1;
>  	return count;
>  }
> 
> -static void release_blocks(struct super_block *sb, int count)
> +static void release_blocks(struct super_block *sb, unsigned int count)
>  {
>  	if (count) {
>  		struct ext2_sb_info *sbi = EXT2_SB(sb);
> 
> -		percpu_counter_mod(&sbi->s_freeblocks_counter, count);
> +		percpu_llcounter_mod(&sbi->s_freeblocks_counter, count);
>  		sb->s_dirt = 1;
>  	}
>  }
> 
> -static int group_reserve_blocks(struct ext2_sb_info *sbi, int group_no,
> -	struct ext2_group_desc *desc, struct buffer_head *bh, int count)
> +static unsigned int group_reserve_blocks(struct ext2_sb_info *sbi, int group_no,
> +	struct ext2_group_desc *desc, struct buffer_head *bh, unsigned int count)
>  {
>  	unsigned free_blocks;
> 
> @@ -159,7 +159,7 @@ static int group_reserve_blocks(struct e
>  }
> 
>  static void group_release_blocks(struct super_block *sb, int group_no,
> -	struct ext2_group_desc *desc, struct buffer_head *bh, int count)
> +	struct ext2_group_desc *desc, struct buffer_head *bh, unsigned int count)
>  {
>  	if (count) {
>  		struct ext2_sb_info *sbi = EXT2_SB(sb);
> @@ -324,7 +324,7 @@ got_it:
>   * bitmap, and then for any free bit if that fails.
>   * This function also updates quota and i_blocks field.
>   */
> -int ext2_new_block(struct inode *inode, unsigned long goal,
> +unsigned int ext2_new_block(struct inode *inode, unsigned long goal,
>  			u32 *prealloc_count, u32 *prealloc_block, int *err)
>  {
>  	struct buffer_head *bitmap_bh = NULL;
> @@ -333,8 +333,8 @@ int ext2_new_block(struct inode *inode,
>  	int group_no;			/* i */
>  	int ret_block;			/* j */
>  	int group_idx;			/* k */
> -	int target_block;		/* tmp */
> -	int block = 0;
> +	unsigned int target_block;	/* tmp */
> +	unsigned int block = 0;
>  	struct super_block *sb = inode->i_sb;
>  	struct ext2_sb_info *sbi = EXT2_SB(sb);
>  	struct ext2_super_block *es = sbi->s_es;
> @@ -447,7 +447,6 @@ retry:
>  		group_alloc = 0;
>  		goto retry;
>  	}
> -
>  got_block:
>  	ext2_debug("using block group %d(%d)\n",
>  		group_no, desc->bg_free_blocks_count);
> @@ -465,7 +464,7 @@ got_block:
> 
>  	if (target_block >= le32_to_cpu(es->s_blocks_count)) {
>  		ext2_error (sb, "ext2_new_block",
> -			    "block(%d) >= blocks count(%d) - "
> +			    "block(%d) >= blocks count(%u) - "
>  			    "block_group = %d, es == %p ", ret_block,
>  			le32_to_cpu(es->s_blocks_count), group_no, es);
>  		goto io_error;
> @@ -504,7 +503,7 @@ got_block:
>  	if (sb->s_flags & MS_SYNCHRONOUS)
>  		sync_dirty_buffer(bitmap_bh);
> 
> -	ext2_debug ("allocating block %d. ", block);
> +	ext2_debug ("allocating block %u. ", block);
> 
>  	*err = 0;
>  out_release:
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/ext2.h linux-2.6.16-rc6-4g/fs/ext2/ext2.h
> --- linux-2.6.16-rc6.org/fs/ext2/ext2.h	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext2/ext2.h	2006-03-14 09:29:01.000000000 +0900
> @@ -91,7 +91,7 @@ static inline struct ext2_inode_info *EX
>  /* balloc.c */
>  extern int ext2_bg_has_super(struct super_block *sb, int group);
>  extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
> -extern int ext2_new_block (struct inode *, unsigned long,
> +extern unsigned int ext2_new_block (struct inode *, unsigned long,
>  			   __u32 *, __u32 *, int *);
>  extern void ext2_free_blocks (struct inode *, unsigned long,
>  			      unsigned long);
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/ialloc.c linux-2.6.16-rc6-4g/fs/ext2/ialloc.c
> --- linux-2.6.16-rc6.org/fs/ext2/ialloc.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext2/ialloc.c	2006-03-14 09:29:01.000000000 +0900
> @@ -83,7 +83,7 @@ static void ext2_release_inode(struct su
>  			cpu_to_le16(le16_to_cpu(desc->bg_used_dirs_count) - 1);
>  	spin_unlock(sb_bgl_lock(EXT2_SB(sb), group));
>  	if (dir)
> -		percpu_counter_dec(&EXT2_SB(sb)->s_dirs_counter);
> +		percpu_llcounter_dec(&EXT2_SB(sb)->s_dirs_counter);
>  	sb->s_dirt = 1;
>  	mark_buffer_dirty(bh);
>  }
> @@ -276,22 +276,20 @@ static int find_group_orlov(struct super
>  	struct ext2_super_block *es = sbi->s_es;
>  	int ngroups = sbi->s_groups_count;
>  	int inodes_per_group = EXT2_INODES_PER_GROUP(sb);
> -	int freei;
> +	unsigned long freei, free_blocks, ndirs;
>  	int avefreei;
> -	int free_blocks;
>  	int avefreeb;
>  	int blocks_per_dir;
> -	int ndirs;
>  	int max_debt, max_dirs, min_blocks, min_inodes;
>  	int group = -1, i;
>  	struct ext2_group_desc *desc;
>  	struct buffer_head *bh;
> 
> -	freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
> +	freei = percpu_llcounter_read_positive(&sbi->s_freeinodes_counter);
>  	avefreei = freei / ngroups;
> -	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
> +	free_blocks = percpu_llcounter_read_positive(&sbi->s_freeblocks_counter);
>  	avefreeb = free_blocks / ngroups;
> -	ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
> +	ndirs = percpu_llcounter_read_positive(&sbi->s_dirs_counter);
> 
>  	if ((parent == sb->s_root->d_inode) ||
>  	    (EXT2_I(parent)->i_flags & EXT2_TOPDIR_FL)) {
> @@ -328,7 +326,7 @@ static int find_group_orlov(struct super
>  	}
> 
>  	if (ndirs == 0)
> -		ndirs = 1;	/* percpu_counters are approximate... */
> +		ndirs = 1;	/* percpu_llcounters are approximate... */
> 
>  	blocks_per_dir = (le32_to_cpu(es->s_blocks_count)-free_blocks) / ndirs;
> 
> @@ -543,9 +541,9 @@ got:
>  		goto fail;
>  	}
> 
> -	percpu_counter_mod(&sbi->s_freeinodes_counter, -1);
> +	percpu_llcounter_mod(&sbi->s_freeinodes_counter, -1);
>  	if (S_ISDIR(mode))
> -		percpu_counter_inc(&sbi->s_dirs_counter);
> +		percpu_llcounter_inc(&sbi->s_dirs_counter);
> 
>  	spin_lock(sb_bgl_lock(sbi, group));
>  	gdp->bg_free_inodes_count =
> @@ -670,7 +668,7 @@ unsigned long ext2_count_free_inodes (st
>  	}
>  	brelse(bitmap_bh);
>  	printk("ext2_count_free_inodes: stored = %lu, computed = %lu, %lu\n",
> -		percpu_counter_read(&EXT2_SB(sb)->s_freeinodes_counter),
> +		percpu_llcounter_read(&EXT2_SB(sb)->s_freeinodes_counter),
>  		desc_count, bitmap_count);
>  	unlock_super(sb);
>  	return desc_count;
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/inode.c linux-2.6.16-rc6-4g/fs/ext2/inode.c
> --- linux-2.6.16-rc6.org/fs/ext2/inode.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext2/inode.c	2006-03-15 21:16:51.000000000 +0900
> @@ -107,7 +107,7 @@ void ext2_discard_prealloc (struct inode
>  #endif
>  }
> 
> -static int ext2_alloc_block (struct inode * inode, unsigned long goal, int *err)
> +static unsigned int ext2_alloc_block (struct inode * inode, unsigned int goal, int *err)
>  {
>  #ifdef EXT2FS_DEBUG
>  	static unsigned long alloc_hits, alloc_attempts;
> @@ -193,8 +193,8 @@ static inline int verify_chain(Indirect
>   * get there at all.
>   */
> 
> -static int ext2_block_to_path(struct inode *inode,
> -			long i_block, int offsets[4], int *boundary)
> +static int ext2_block_to_path(struct inode *inode, unsigned long i_block,
> +				unsigned int offsets[4], int *boundary)
>  {
>  	int ptrs = EXT2_ADDR_PER_BLOCK(inode->i_sb);
>  	int ptrs_bits = EXT2_ADDR_PER_BLOCK_BITS(inode->i_sb);
> @@ -263,7 +263,7 @@ static int ext2_block_to_path(struct ino
>   */
>  static Indirect *ext2_get_branch(struct inode *inode,
>  				 int depth,
> -				 int *offsets,
> +				 unsigned int *offsets,
>  				 Indirect chain[4],
>  				 int *err)
>  {
> @@ -363,7 +363,7 @@ static unsigned long ext2_find_near(stru
>   */
> 
>  static inline int ext2_find_goal(struct inode *inode,
> -				 long block,
> +				 unsigned long block,
>  				 Indirect chain[4],
>  				 Indirect *partial,
>  				 unsigned long *goal)
> @@ -418,20 +418,20 @@ static inline int ext2_find_goal(struct
>  static int ext2_alloc_branch(struct inode *inode,
>  			     int num,
>  			     unsigned long goal,
> -			     int *offsets,
> +			     unsigned int *offsets,
>  			     Indirect *branch)
>  {
>  	int blocksize = inode->i_sb->s_blocksize;
>  	int n = 0;
>  	int err;
>  	int i;
> -	int parent = ext2_alloc_block(inode, goal, &err);
> +	unsigned int parent = ext2_alloc_block(inode, goal, &err);
> 
>  	branch[0].key = cpu_to_le32(parent);
>  	if (parent) for (n = 1; n < num; n++) {
>  		struct buffer_head *bh;
>  		/* Allocate the next block */
> -		int nr = ext2_alloc_block(inode, parent, &err);
> +		unsigned int nr = ext2_alloc_block(inode, parent, &err);
>  		if (!nr)
>  			break;
>  		branch[n].key = cpu_to_le32(nr);
> @@ -489,7 +489,7 @@ static int ext2_alloc_branch(struct inod
>   */
> 
>  static inline int ext2_splice_branch(struct inode *inode,
> -				     long block,
> +				     unsigned long block,
>  				     Indirect chain[4],
>  				     Indirect *where,
>  				     int num)
> @@ -547,7 +547,7 @@ changed:
>  int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create)
>  {
>  	int err = -EIO;
> -	int offsets[4];
> +	unsigned int offsets[4];
>  	Indirect chain[4];
>  	Indirect *partial;
>  	unsigned long goal;
> @@ -776,7 +776,7 @@ static inline int all_zeroes(__le32 *p,
> 
>  static Indirect *ext2_find_shared(struct inode *inode,
>  				int depth,
> -				int offsets[4],
> +				unsigned int offsets[4],
>  				Indirect chain[4],
>  				__le32 *top)
>  {
> @@ -892,7 +892,7 @@ static void ext2_free_branches(struct in
>  			 */
>  			if (!bh) {
>  				ext2_error(inode->i_sb, "ext2_free_branches",
> -					"Read failure, inode=%ld, block=%ld",
> +					"Read failure, inode=%lu, block=%lu",
>  					inode->i_ino, nr);
>  				continue;
>  			}
> @@ -912,12 +912,12 @@ void ext2_truncate (struct inode * inode
>  {
>  	__le32 *i_data = EXT2_I(inode)->i_data;
>  	int addr_per_block = EXT2_ADDR_PER_BLOCK(inode->i_sb);
> -	int offsets[4];
> +	unsigned int offsets[4];
>  	Indirect chain[4];
>  	Indirect *partial;
>  	__le32 nr = 0;
>  	int n;
> -	long iblock;
> +	unsigned long iblock;
>  	unsigned blocksize;
> 
>  	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/super.c linux-2.6.16-rc6-4g/fs/ext2/super.c
> --- linux-2.6.16-rc6.org/fs/ext2/super.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext2/super.c	2006-03-14 09:29:01.000000000 +0900
> @@ -126,9 +126,9 @@ static void ext2_put_super (struct super
>  			brelse (sbi->s_group_desc[i]);
>  	kfree(sbi->s_group_desc);
>  	kfree(sbi->s_debts);
> -	percpu_counter_destroy(&sbi->s_freeblocks_counter);
> -	percpu_counter_destroy(&sbi->s_freeinodes_counter);
> -	percpu_counter_destroy(&sbi->s_dirs_counter);
> +	percpu_llcounter_destroy(&sbi->s_freeblocks_counter);
> +	percpu_llcounter_destroy(&sbi->s_freeinodes_counter);
> +	percpu_llcounter_destroy(&sbi->s_dirs_counter);
>  	brelse (sbi->s_sbh);
>  	sb->s_fs_info = NULL;
>  	kfree(sbi);
> @@ -836,9 +836,9 @@ static int ext2_fill_super(struct super_
>  		printk ("EXT2-fs: not enough memory\n");
>  		goto failed_mount;
>  	}
> -	percpu_counter_init(&sbi->s_freeblocks_counter);
> -	percpu_counter_init(&sbi->s_freeinodes_counter);
> -	percpu_counter_init(&sbi->s_dirs_counter);
> +	percpu_llcounter_init(&sbi->s_freeblocks_counter);
> +	percpu_llcounter_init(&sbi->s_freeinodes_counter);
> +	percpu_llcounter_init(&sbi->s_dirs_counter);
>  	bgl_lock_init(&sbi->s_blockgroup_lock);
>  	sbi->s_debts = kmalloc(sbi->s_groups_count * sizeof(*sbi->s_debts),
>  			       GFP_KERNEL);
> @@ -888,11 +888,11 @@ static int ext2_fill_super(struct super_
>  		ext2_warning(sb, __FUNCTION__,
>  			"mounting ext3 filesystem as ext2");
>  	ext2_setup_super (sb, es, sb->s_flags & MS_RDONLY);
> -	percpu_counter_mod(&sbi->s_freeblocks_counter,
> +	percpu_llcounter_mod(&sbi->s_freeblocks_counter,
>  				ext2_count_free_blocks(sb));
> -	percpu_counter_mod(&sbi->s_freeinodes_counter,
> +	percpu_llcounter_mod(&sbi->s_freeinodes_counter,
>  				ext2_count_free_inodes(sb));
> -	percpu_counter_mod(&sbi->s_dirs_counter,
> +	percpu_llcounter_mod(&sbi->s_dirs_counter,
>  				ext2_count_dirs(sb));
>  	return 0;
> 
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/xattr.c linux-2.6.16-rc6-4g/fs/ext2/xattr.c
> --- linux-2.6.16-rc6.org/fs/ext2/xattr.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext2/xattr.c	2006-03-14 09:29:01.000000000 +0900
> @@ -71,7 +71,7 @@
> 
>  #ifdef EXT2_XATTR_DEBUG
>  # define ea_idebug(inode, f...) do { \
> -		printk(KERN_DEBUG "inode %s:%ld: ", \
> +		printk(KERN_DEBUG "inode %s:%lu: ", \
>  			inode->i_sb->s_id, inode->i_ino); \
>  		printk(f); \
>  		printk("\n"); \
> @@ -164,7 +164,7 @@ ext2_xattr_get(struct inode *inode, int
>  	error = -ENODATA;
>  	if (!EXT2_I(inode)->i_file_acl)
>  		goto cleanup;
> -	ea_idebug(inode, "reading block %d", EXT2_I(inode)->i_file_acl);
> +	ea_idebug(inode, "reading block %u", EXT2_I(inode)->i_file_acl);
>  	bh = sb_bread(inode->i_sb, EXT2_I(inode)->i_file_acl);
>  	error = -EIO;
>  	if (!bh)
> @@ -175,7 +175,7 @@ ext2_xattr_get(struct inode *inode, int
>  	if (HDR(bh)->h_magic != cpu_to_le32(EXT2_XATTR_MAGIC) ||
>  	    HDR(bh)->h_blocks != cpu_to_le32(1)) {
>  bad_block:	ext2_error(inode->i_sb, "ext2_xattr_get",
> -			"inode %ld: bad block %d", inode->i_ino,
> +			"inode %lu: bad block %u", inode->i_ino,
>  			EXT2_I(inode)->i_file_acl);
>  		error = -EIO;
>  		goto cleanup;
> @@ -264,7 +264,7 @@ ext2_xattr_list(struct inode *inode, cha
>  	error = 0;
>  	if (!EXT2_I(inode)->i_file_acl)
>  		goto cleanup;
> -	ea_idebug(inode, "reading block %d", EXT2_I(inode)->i_file_acl);
> +	ea_idebug(inode, "reading block %u", EXT2_I(inode)->i_file_acl);
>  	bh = sb_bread(inode->i_sb, EXT2_I(inode)->i_file_acl);
>  	error = -EIO;
>  	if (!bh)
> @@ -275,7 +275,7 @@ ext2_xattr_list(struct inode *inode, cha
>  	if (HDR(bh)->h_magic != cpu_to_le32(EXT2_XATTR_MAGIC) ||
>  	    HDR(bh)->h_blocks != cpu_to_le32(1)) {
>  bad_block:	ext2_error(inode->i_sb, "ext2_xattr_list",
> -			"inode %ld: bad block %d", inode->i_ino,
> +			"inode %lu: bad block %u", inode->i_ino,
>  			EXT2_I(inode)->i_file_acl);
>  		error = -EIO;
>  		goto cleanup;
> @@ -411,7 +411,7 @@ ext2_xattr_set(struct inode *inode, int
>  		if (header->h_magic != cpu_to_le32(EXT2_XATTR_MAGIC) ||
>  		    header->h_blocks != cpu_to_le32(1)) {
>  bad_block:		ext2_error(sb, "ext2_xattr_set",
> -				"inode %ld: bad block %d", inode->i_ino,
> +				"inode %lu: bad block %u", inode->i_ino,
>  				   EXT2_I(inode)->i_file_acl);
>  			error = -EIO;
>  			goto cleanup;
> @@ -664,15 +664,15 @@ ext2_xattr_set2(struct inode *inode, str
>  			ext2_xattr_cache_insert(new_bh);
>  		} else {
>  			/* We need to allocate a new block */
> -			int goal = le32_to_cpu(EXT2_SB(sb)->s_es->
> +			unsigned int goal = le32_to_cpu(EXT2_SB(sb)->s_es->
>  						           s_first_data_block) +
>  				   EXT2_I(inode)->i_block_group *
>  				   EXT2_BLOCKS_PER_GROUP(sb);
> -			int block = ext2_new_block(inode, goal,
> +			unsigned int block = ext2_new_block(inode, goal,
>  						   NULL, NULL, &error);
>  			if (error)
>  				goto cleanup;
> -			ea_idebug(inode, "creating block %d", block);
> +			ea_idebug(inode, "creating block %u", block);
> 
>  			new_bh = sb_getblk(sb, block);
>  			if (!new_bh) {
> @@ -772,7 +772,7 @@ ext2_xattr_delete_inode(struct inode *in
>  	bh = sb_bread(inode->i_sb, EXT2_I(inode)->i_file_acl);
>  	if (!bh) {
>  		ext2_error(inode->i_sb, "ext2_xattr_delete_inode",
> -			"inode %ld: block %d read error", inode->i_ino,
> +			"inode %lu: block %u read error", inode->i_ino,
>  			EXT2_I(inode)->i_file_acl);
>  		goto cleanup;
>  	}
> @@ -780,7 +780,7 @@ ext2_xattr_delete_inode(struct inode *in
>  	if (HDR(bh)->h_magic != cpu_to_le32(EXT2_XATTR_MAGIC) ||
>  	    HDR(bh)->h_blocks != cpu_to_le32(1)) {
>  		ext2_error(inode->i_sb, "ext2_xattr_delete_inode",
> -			"inode %ld: bad block %d", inode->i_ino,
> +			"inode %lu: bad block %u", inode->i_ino,
>  			EXT2_I(inode)->i_file_acl);
>  		goto cleanup;
>  	}
> @@ -931,13 +931,13 @@ again:
>  		bh = sb_bread(inode->i_sb, ce->e_block);
>  		if (!bh) {
>  			ext2_error(inode->i_sb, "ext2_xattr_cache_find",
> -				"inode %ld: block %ld read error",
> +				"inode %lu: block %lu read error",
>  				inode->i_ino, (unsigned long) ce->e_block);
>  		} else {
>  			lock_buffer(bh);
>  			if (le32_to_cpu(HDR(bh)->h_refcount) >
>  				   EXT2_XATTR_REFCOUNT_MAX) {
> -				ea_idebug(inode, "block %ld refcount %d>%d",
> +				ea_idebug(inode, "block %lu refcount %d>%d",
>  					  (unsigned long) ce->e_block,
>  					  le32_to_cpu(HDR(bh)->h_refcount),
>  					  EXT2_XATTR_REFCOUNT_MAX);
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext2/xip.c linux-2.6.16-rc6-4g/fs/ext2/xip.c
> --- linux-2.6.16-rc6.org/fs/ext2/xip.c	2006-01-03 12:21:10.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext2/xip.c	2006-03-14 09:29:01.000000000 +0900
> @@ -44,8 +44,8 @@ __ext2_get_sector(struct inode *inode, s
>  	return rc;
>  }
> 
> -int
> -ext2_clear_xip_target(struct inode *inode, int block)
> +unsigned int
> +ext2_clear_xip_target(struct inode *inode, unsigned int block)
>  {
>  	sector_t sector = block * (PAGE_SIZE/512);
>  	unsigned long data;
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/balloc.c linux-2.6.16-rc6-4g/fs/ext3/balloc.c
> --- linux-2.6.16-rc6.org/fs/ext3/balloc.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext3/balloc.c	2006-03-14 09:29:01.000000000 +0900
> @@ -36,7 +36,6 @@
>   * when a file system is mounted (see ext3_read_super).
>   */
> 
> -
>  #define in_range(b, first, len)	((b) >= (first) && (b) <= (first) + (len) - 1)
> 
>  struct ext3_group_desc * ext3_get_group_desc(struct super_block * sb,
> @@ -467,7 +466,7 @@ do_more:
>  		cpu_to_le16(le16_to_cpu(desc->bg_free_blocks_count) +
>  			group_freed);
>  	spin_unlock(sb_bgl_lock(sbi, block_group));
> -	percpu_counter_mod(&sbi->s_freeblocks_counter, count);
> +	percpu_llcounter_mod(&sbi->s_freeblocks_counter, count);
> 
>  	/* We dirtied the bitmap block */
>  	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
> @@ -1118,9 +1117,10 @@ out:
> 
>  static int ext3_has_free_blocks(struct ext3_sb_info *sbi)
>  {
> -	int free_blocks, root_blocks;
> +	unsigned long free_blocks;
> +	int  root_blocks;
> 
> -	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
> +	free_blocks = percpu_llcounter_read_positive(&sbi->s_freeblocks_counter);
>  	root_blocks = le32_to_cpu(sbi->s_es->s_r_blocks_count);
>  	if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
>  		sbi->s_resuid != current->fsuid &&
> @@ -1154,19 +1154,20 @@ int ext3_should_retry_alloc(struct super
>   * bitmap, and then for any free bit if that fails.
>   * This function also updates quota and i_blocks field.
>   */
> -int ext3_new_block(handle_t *handle, struct inode *inode,
> +unsigned int ext3_new_block(handle_t *handle, struct inode *inode,
>  			unsigned long goal, int *errp)
>  {
>  	struct buffer_head *bitmap_bh = NULL;
>  	struct buffer_head *gdp_bh;
>  	int group_no;
>  	int goal_group;
> -	int ret_block;
> +	unsigned int ret_block;
>  	int bgi;			/* blockgroup iteration index */
> -	int target_block;
> +	unsigned int target_block;
>  	int fatal = 0, err;
>  	int performed_allocation = 0;
>  	int free_blocks;
> +	int group_block;
>  	struct super_block *sb;
>  	struct ext3_group_desc *gdp;
>  	struct ext3_super_block *es;
> @@ -1238,17 +1239,19 @@ retry:
>  		my_rsv = NULL;
> 
>  	if (free_blocks > 0) {
> -		ret_block = ((goal - le32_to_cpu(es->s_first_data_block)) %
> +		group_block = ((goal - le32_to_cpu(es->s_first_data_block)) %
>  				EXT3_BLOCKS_PER_GROUP(sb));
>  		bitmap_bh = read_block_bitmap(sb, group_no);
>  		if (!bitmap_bh)
>  			goto io_error;
> -		ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
> -					bitmap_bh, ret_block, my_rsv, &fatal);
> +		group_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
> +					bitmap_bh, group_block, my_rsv, &fatal);
>  		if (fatal)
>  			goto out;
> -		if (ret_block >= 0)
> +		if (group_block >= 0) {
> +			ret_block = group_block;
>  			goto allocated;
> +		}
>  	}
> 
>  	ngroups = EXT3_SB(sb)->s_groups_count;
> @@ -1280,12 +1283,14 @@ retry:
>  		bitmap_bh = read_block_bitmap(sb, group_no);
>  		if (!bitmap_bh)
>  			goto io_error;
> -		ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
> +		group_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no,
>  					bitmap_bh, -1, my_rsv, &fatal);
>  		if (fatal)
>  			goto out;
> -		if (ret_block >= 0)
> +		if (group_block >= 0) {
> +			ret_block = group_block;
>  			goto allocated;
> +		}
>  	}
>  	/*
>  	 * We may end up a bogus ealier ENOSPC error due to
> @@ -1347,7 +1352,7 @@ allocated:
>  				"b_committed_data\n", __FUNCTION__);
>  		}
>  	}
> -	ext3_debug("found bit %d\n", ret_block);
> +	ext3_debug("found bit %u\n", ret_block);
>  	spin_unlock(sb_bgl_lock(sbi, group_no));
>  	jbd_unlock_bh_state(bitmap_bh);
>  #endif
> @@ -1357,8 +1362,8 @@ allocated:
> 
>  	if (ret_block >= le32_to_cpu(es->s_blocks_count)) {
>  		ext3_error(sb, "ext3_new_block",
> -			    "block(%d) >= blocks count(%d) - "
> -			    "block_group = %d, es == %p ", ret_block,
> +			    "block(%u) >= blocks count(%u) - "
> +			    "block_group = %u, es == %p ", ret_block,
>  			le32_to_cpu(es->s_blocks_count), group_no, es);
>  		goto out;
>  	}
> @@ -1368,14 +1373,14 @@ allocated:
>  	 * list of some description.  We don't know in advance whether
>  	 * the caller wants to use it as metadata or data.
>  	 */
> -	ext3_debug("allocating block %d. Goal hits %d of %d.\n",
> +	ext3_debug("allocating block %u. Goal hits %d of %d.\n",
>  			ret_block, goal_hits, goal_attempts);
> 
>  	spin_lock(sb_bgl_lock(sbi, group_no));
>  	gdp->bg_free_blocks_count =
>  			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) - 1);
>  	spin_unlock(sb_bgl_lock(sbi, group_no));
> -	percpu_counter_mod(&sbi->s_freeblocks_counter, -1);
> +	percpu_llcounter_mod(&sbi->s_freeblocks_counter, -1);
> 
>  	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
>  	err = ext3_journal_dirty_metadata(handle, gdp_bh);
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/ialloc.c linux-2.6.16-rc6-4g/fs/ext3/ialloc.c
> --- linux-2.6.16-rc6.org/fs/ext3/ialloc.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext3/ialloc.c	2006-03-14 09:29:01.000000000 +0900
> @@ -170,9 +170,9 @@ void ext3_free_inode (handle_t *handle,
>  				gdp->bg_used_dirs_count = cpu_to_le16(
>  				  le16_to_cpu(gdp->bg_used_dirs_count) - 1);
>  			spin_unlock(sb_bgl_lock(sbi, block_group));
> -			percpu_counter_inc(&sbi->s_freeinodes_counter);
> +			percpu_llcounter_inc(&sbi->s_freeinodes_counter);
>  			if (is_directory)
> -				percpu_counter_dec(&sbi->s_dirs_counter);
> +				percpu_llcounter_dec(&sbi->s_dirs_counter);
> 
>  		}
>  		BUFFER_TRACE(bh2, "call ext3_journal_dirty_metadata");
> @@ -202,12 +202,13 @@ error_return:
>  static int find_group_dir(struct super_block *sb, struct inode *parent)
>  {
>  	int ngroups = EXT3_SB(sb)->s_groups_count;
> -	int freei, avefreei;
> +	unsigned long freei;
> +	int avefreei;
>  	struct ext3_group_desc *desc, *best_desc = NULL;
>  	struct buffer_head *bh;
>  	int group, best_group = -1;
> 
> -	freei = percpu_counter_read_positive(&EXT3_SB(sb)->s_freeinodes_counter);
> +	freei = percpu_llcounter_read_positive(&EXT3_SB(sb)->s_freeinodes_counter);
>  	avefreei = freei / ngroups;
> 
>  	for (group = 0; group < ngroups; group++) {
> @@ -261,19 +262,20 @@ static int find_group_orlov(struct super
>  	struct ext3_super_block *es = sbi->s_es;
>  	int ngroups = sbi->s_groups_count;
>  	int inodes_per_group = EXT3_INODES_PER_GROUP(sb);
> -	int freei, avefreei;
> -	int freeb, avefreeb;
> -	int blocks_per_dir, ndirs;
> +	unsigned long freei, freeb, ndirs;
> +	int avefreei;
> +	int avefreeb;
> +	int blocks_per_dir;
>  	int max_debt, max_dirs, min_blocks, min_inodes;
>  	int group = -1, i;
>  	struct ext3_group_desc *desc;
>  	struct buffer_head *bh;
> 
> -	freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
> +	freei = percpu_llcounter_read_positive(&sbi->s_freeinodes_counter);
>  	avefreei = freei / ngroups;
> -	freeb = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
> +	freeb = percpu_llcounter_read_positive(&sbi->s_freeblocks_counter);
>  	avefreeb = freeb / ngroups;
> -	ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
> +	ndirs = percpu_llcounter_read_positive(&sbi->s_dirs_counter);
> 
>  	if ((parent == sb->s_root->d_inode) ||
>  	    (EXT3_I(parent)->i_flags & EXT3_TOPDIR_FL)) {
> @@ -539,9 +541,9 @@ got:
>  	err = ext3_journal_dirty_metadata(handle, bh2);
>  	if (err) goto fail;
> 
> -	percpu_counter_dec(&sbi->s_freeinodes_counter);
> +	percpu_llcounter_dec(&sbi->s_freeinodes_counter);
>  	if (S_ISDIR(mode))
> -		percpu_counter_inc(&sbi->s_dirs_counter);
> +		percpu_llcounter_inc(&sbi->s_dirs_counter);
>  	sb->s_dirt = 1;
> 
>  	inode->i_uid = current->fsuid;
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/inode.c linux-2.6.16-rc6-4g/fs/ext3/inode.c
> --- linux-2.6.16-rc6.org/fs/ext3/inode.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext3/inode.c	2006-03-14 09:29:01.000000000 +0900
> @@ -64,7 +64,7 @@ static inline int ext3_inode_is_fast_sym
> 
>  int ext3_forget(handle_t *handle, int is_metadata,
>  		       struct inode *inode, struct buffer_head *bh,
> -		       int blocknr)
> +		       unsigned int blocknr)
>  {
>  	int err;
> 
> @@ -235,10 +235,10 @@ no_delete:
>  	clear_inode(inode);	/* We must guarantee clearing of inode... */
>  }
> 
> -static int ext3_alloc_block (handle_t *handle,
> -			struct inode * inode, unsigned long goal, int *err)
> +static unsigned int ext3_alloc_block (handle_t *handle,
> +			struct inode * inode, unsigned int goal, int *err)
>  {
> -	unsigned long result;
> +	unsigned int result;
> 
>  	result = ext3_new_block(handle, inode, goal, err);
>  	return result;
> @@ -296,7 +296,7 @@ static inline int verify_chain(Indirect
>   */
> 
>  static int ext3_block_to_path(struct inode *inode,
> -			long i_block, int offsets[4], int *boundary)
> +			unsigned long i_block, unsigned int offsets[4], int *boundary)
>  {
>  	int ptrs = EXT3_ADDR_PER_BLOCK(inode->i_sb);
>  	int ptrs_bits = EXT3_ADDR_PER_BLOCK_BITS(inode->i_sb);
> @@ -363,7 +363,7 @@ static int ext3_block_to_path(struct ino
>   *	or when it reads all @depth-1 indirect blocks successfully and finds
>   *	the whole chain, all way to the data (returns %NULL, *err == 0).
>   */
> -static Indirect *ext3_get_branch(struct inode *inode, int depth, int *offsets,
> +static Indirect *ext3_get_branch(struct inode *inode, int depth, unsigned int *offsets,
>  				 Indirect chain[4], int *err)
>  {
>  	struct super_block *sb = inode->i_sb;
> @@ -460,7 +460,7 @@ static unsigned long ext3_find_near(stru
>   *	stores it in *@goal and returns zero.
>   */
> 
> -static unsigned long ext3_find_goal(struct inode *inode, long block,
> +static unsigned long ext3_find_goal(struct inode *inode, unsigned long block,
>  		Indirect chain[4], Indirect *partial)
>  {
>  	struct ext3_block_alloc_info *block_i =  EXT3_I(inode)->i_block_alloc_info;
> @@ -505,21 +505,21 @@ static unsigned long ext3_find_goal(stru
>  static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
>  			     int num,
>  			     unsigned long goal,
> -			     int *offsets,
> +			     unsigned int *offsets,
>  			     Indirect *branch)
>  {
>  	int blocksize = inode->i_sb->s_blocksize;
>  	int n = 0, keys = 0;
>  	int err = 0;
>  	int i;
> -	int parent = ext3_alloc_block(handle, inode, goal, &err);
> +	unsigned int parent = ext3_alloc_block(handle, inode, goal, &err);
> 
>  	branch[0].key = cpu_to_le32(parent);
>  	if (parent) {
>  		for (n = 1; n < num; n++) {
>  			struct buffer_head *bh;
>  			/* Allocate the next block */
> -			int nr = ext3_alloc_block(handle, inode, parent, &err);
> +			unsigned int nr = ext3_alloc_block(handle, inode, parent, &err);
>  			if (!nr)
>  				break;
>  			branch[n].key = cpu_to_le32(nr);
> @@ -585,7 +585,7 @@ static int ext3_alloc_branch(handle_t *h
>   *	chain to new block and return 0.
>   */
> 
> -static int ext3_splice_branch(handle_t *handle, struct inode *inode, long block,
> +static int ext3_splice_branch(handle_t *handle, struct inode *inode, unsigned long block,
>  			      Indirect chain[4], Indirect *where, int num)
>  {
>  	int i;
> @@ -676,7 +676,7 @@ ext3_get_block_handle(handle_t *handle,
>  		struct buffer_head *bh_result, int create, int extend_disksize)
>  {
>  	int err = -EIO;
> -	int offsets[4];
> +	unsigned int offsets[4];
>  	Indirect chain[4];
>  	Indirect *partial;
>  	unsigned long goal;
> @@ -852,7 +852,7 @@ get_block:
>   * `handle' can be NULL if create is zero
>   */
>  struct buffer_head *ext3_getblk(handle_t *handle, struct inode * inode,
> -				long block, int create, int * errp)
> +				unsigned long block, int create, int * errp)
>  {
>  	struct buffer_head dummy;
>  	int fatal = 0, err;
> @@ -907,7 +907,7 @@ err:
>  }
> 
>  struct buffer_head *ext3_bread(handle_t *handle, struct inode * inode,
> -			       int block, int create, int *err)
> +			       unsigned int block, int create, int *err)
>  {
>  	struct buffer_head * bh;
> 
> @@ -1754,7 +1754,7 @@ static inline int all_zeroes(__le32 *p,
> 
>  static Indirect *ext3_find_shared(struct inode *inode,
>  				int depth,
> -				int offsets[4],
> +				unsigned int offsets[4],
>  				Indirect chain[4],
>  				__le32 *top)
>  {
> @@ -1967,7 +1967,7 @@ static void ext3_free_branches(handle_t
>  			 */
>  			if (!bh) {
>  				ext3_error(inode->i_sb, "ext3_free_branches",
> -					   "Read failure, inode=%ld, block=%ld",
> +					   "Read failure, inode=%lu, block=%lu",
>  					   inode->i_ino, nr);
>  				continue;
>  			}
> @@ -2084,12 +2084,12 @@ void ext3_truncate(struct inode * inode)
>  	__le32 *i_data = ei->i_data;
>  	int addr_per_block = EXT3_ADDR_PER_BLOCK(inode->i_sb);
>  	struct address_space *mapping = inode->i_mapping;
> -	int offsets[4];
> +	unsigned int offsets[4];
>  	Indirect chain[4];
>  	Indirect *partial;
>  	__le32 nr = 0;
>  	int n;
> -	long last_block;
> +	unsigned long last_block;
>  	unsigned blocksize = inode->i_sb->s_blocksize;
>  	struct page *page;
> 
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/namei.c linux-2.6.16-rc6-4g/fs/ext3/namei.c
> --- linux-2.6.16-rc6.org/fs/ext3/namei.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext3/namei.c	2006-03-14 09:29:01.000000000 +0900
> @@ -816,7 +816,8 @@ static struct buffer_head * ext3_find_en
>  	int ra_ptr = 0;		/* Current index into readahead
>  				   buffer */
>  	int num = 0;
> -	int nblocks, i, err;
> +	unsigned int nblocks;
> +	int i, err;
>  	struct inode *dir = dentry->d_parent->d_inode;
>  	int namelen;
>  	const u8 *name;
> @@ -1910,8 +1911,8 @@ int ext3_orphan_add(handle_t *handle, st
>  	if (!err)
>  		list_add(&EXT3_I(inode)->i_orphan, &EXT3_SB(sb)->s_orphan);
> 
> -	jbd_debug(4, "superblock will point to %ld\n", inode->i_ino);
> -	jbd_debug(4, "orphan inode %ld will point to %d\n",
> +	jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
> +	jbd_debug(4, "orphan inode %lu will point to %d\n",
>  			inode->i_ino, NEXT_ORPHAN(inode));
>  out_unlock:
>  	unlock_super(sb);
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/resize.c linux-2.6.16-rc6-4g/fs/ext3/resize.c
> --- linux-2.6.16-rc6.org/fs/ext3/resize.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext3/resize.c	2006-03-14 09:29:01.000000000 +0900
> @@ -37,7 +37,7 @@ static int verify_group_input(struct sup
>  		 le16_to_cpu(es->s_reserved_gdt_blocks)) : 0;
>  	unsigned metaend = start + overhead;
>  	struct buffer_head *bh = NULL;
> -	int free_blocks_count;
> +	long long free_blocks_count;
>  	int err = -EINVAL;
> 
>  	input->free_blocks_count = free_blocks_count =
> @@ -45,7 +45,7 @@ static int verify_group_input(struct sup
> 
>  	if (test_opt(sb, DEBUG))
>  		printk(KERN_DEBUG "EXT3-fs: adding %s group %u: %u blocks "
> -		       "(%d free, %u reserved)\n",
> +		       "(%lld free, %u reserved)\n",
>  		       ext3_bg_has_super(sb, input->group) ? "normal" :
>  		       "no-super", input->group, input->blocks_count,
>  		       free_blocks_count, input->reserved_blocks);
> @@ -138,14 +138,14 @@ static struct buffer_head *bclean(handle
>   * need to use it within a single byte (to ensure we get endianness right).
>   * We can use memset for the rest of the bitmap as there are no other users.
>   */
> -static void mark_bitmap_end(int start_bit, int end_bit, char *bitmap)
> +static void mark_bitmap_end(unsigned int start_bit, unsigned int end_bit, char *bitmap)
>  {
> -	int i;
> +	unsigned int i;
> 
>  	if (start_bit >= end_bit)
>  		return;
> 
> -	ext3_debug("mark end bits +%d through +%d used\n", start_bit, end_bit);
> +	ext3_debug("mark end bits +%u through +%u used\n", start_bit, end_bit);
>  	for (i = start_bit; i < ((start_bit + 7) & ~7UL); i++)
>  		ext3_set_bit(i, bitmap);
>  	if (i < end_bit)
> @@ -340,7 +340,7 @@ static int verify_reserved_gdb(struct su
>  	while ((grp = ext3_list_backups(sb, &three, &five, &seven)) < end) {
>  		if (le32_to_cpu(*p++) != grp * EXT3_BLOCKS_PER_GROUP(sb) + blk){
>  			ext3_warning(sb, __FUNCTION__,
> -				     "reserved GDT %ld missing grp %d (%ld)",
> +				     "reserved GDT %ld missing grp %d (%lu)",
>  				     blk, grp,
>  				     grp * EXT3_BLOCKS_PER_GROUP(sb) + blk);
>  			return -EINVAL;
> @@ -619,7 +619,7 @@ exit_free:
>   * at this time.  The resize which changed s_groups_count will backup again.
>   */
>  static void update_backups(struct super_block *sb,
> -			   int blk_off, char *data, int size)
> +			   unsigned int blk_off, char *data, int size)
>  {
>  	struct ext3_sb_info *sbi = EXT3_SB(sb);
>  	const unsigned long last = sbi->s_groups_count;
> @@ -869,9 +869,9 @@ int ext3_group_add(struct super_block *s
>  		input->reserved_blocks);
> 
>  	/* Update the free space counts */
> -	percpu_counter_mod(&sbi->s_freeblocks_counter,
> +	percpu_llcounter_mod(&sbi->s_freeblocks_counter,
>  			   input->free_blocks_count);
> -	percpu_counter_mod(&sbi->s_freeinodes_counter,
> +	percpu_llcounter_mod(&sbi->s_freeinodes_counter,
>  			   EXT3_INODES_PER_GROUP(sb));
> 
>  	ext3_journal_dirty_metadata(handle, sbi->s_sbh);
> @@ -990,10 +990,10 @@ int ext3_group_extend(struct super_block
>  	ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh);
>  	sb->s_dirt = 1;
>  	unlock_super(sb);
> -	ext3_debug("freeing blocks %ld through %ld\n", o_blocks_count,
> +	ext3_debug("freeing blocks %lu through %lu\n", o_blocks_count,
>  		   o_blocks_count + add);
>  	ext3_free_blocks_sb(handle, sb, o_blocks_count, add, &freed_blocks);
> -	ext3_debug("freed blocks %ld through %ld\n", o_blocks_count,
> +	ext3_debug("freed blocks %lu through %lu\n", o_blocks_count,
>  		   o_blocks_count + add);
>  	if ((err = ext3_journal_stop(handle)))
>  		goto exit_put;
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/super.c linux-2.6.16-rc6-4g/fs/ext3/super.c
> --- linux-2.6.16-rc6.org/fs/ext3/super.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext3/super.c	2006-03-14 09:29:01.000000000 +0900
> @@ -377,7 +377,7 @@ static void dump_orphan_list(struct supe
>  	list_for_each(l, &sbi->s_orphan) {
>  		struct inode *inode = orphan_list_entry(l);
>  		printk(KERN_ERR "  "
> -		       "inode %s:%ld at %p: mode %o, nlink %d, next %d\n",
> +		       "inode %s:%lu at %p: mode %o, nlink %d, next %d\n",
>  		       inode->i_sb->s_id, inode->i_ino, inode,
>  		       inode->i_mode, inode->i_nlink,
>  		       NEXT_ORPHAN(inode));
> @@ -403,9 +403,9 @@ static void ext3_put_super (struct super
>  	for (i = 0; i < sbi->s_gdb_count; i++)
>  		brelse(sbi->s_group_desc[i]);
>  	kfree(sbi->s_group_desc);
> -	percpu_counter_destroy(&sbi->s_freeblocks_counter);
> -	percpu_counter_destroy(&sbi->s_freeinodes_counter);
> -	percpu_counter_destroy(&sbi->s_dirs_counter);
> +	percpu_llcounter_destroy(&sbi->s_freeblocks_counter);
> +	percpu_llcounter_destroy(&sbi->s_freeinodes_counter);
> +	percpu_llcounter_destroy(&sbi->s_dirs_counter);
>  	brelse(sbi->s_sbh);
>  #ifdef CONFIG_QUOTA
>  	for (i = 0; i < MAXQUOTAS; i++)
> @@ -1253,17 +1253,17 @@ static void ext3_orphan_cleanup (struct
>  		DQUOT_INIT(inode);
>  		if (inode->i_nlink) {
>  			printk(KERN_DEBUG
> -				"%s: truncating inode %ld to %Ld bytes\n",
> +				"%s: truncating inode %lu to %Ld bytes\n",
>  				__FUNCTION__, inode->i_ino, inode->i_size);
> -			jbd_debug(2, "truncating inode %ld to %Ld bytes\n",
> +			jbd_debug(2, "truncating inode %lu to %Ld bytes\n",
>  				  inode->i_ino, inode->i_size);
>  			ext3_truncate(inode);
>  			nr_truncates++;
>  		} else {
>  			printk(KERN_DEBUG
> -				"%s: deleting unreferenced inode %ld\n",
> +				"%s: deleting unreferenced inode %lu\n",
>  				__FUNCTION__, inode->i_ino);
> -			jbd_debug(2, "deleting unreferenced inode %ld\n",
> +			jbd_debug(2, "deleting unreferenced inode %lu\n",
>  				  inode->i_ino);
>  			nr_orphans++;
>  		}
> @@ -1578,9 +1578,9 @@ static int ext3_fill_super (struct super
>  		goto failed_mount;
>  	}
> 
> -	percpu_counter_init(&sbi->s_freeblocks_counter);
> -	percpu_counter_init(&sbi->s_freeinodes_counter);
> -	percpu_counter_init(&sbi->s_dirs_counter);
> +	percpu_llcounter_init(&sbi->s_freeblocks_counter);
> +	percpu_llcounter_init(&sbi->s_freeinodes_counter);
> +	percpu_llcounter_init(&sbi->s_dirs_counter);
>  	bgl_lock_init(&sbi->s_blockgroup_lock);
> 
>  	for (i = 0; i < db_count; i++) {
> @@ -1728,11 +1728,11 @@ static int ext3_fill_super (struct super
>  		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
>  		"writeback");
> 
> -	percpu_counter_mod(&sbi->s_freeblocks_counter,
> +	percpu_llcounter_mod(&sbi->s_freeblocks_counter,
>  		ext3_count_free_blocks(sb));
> -	percpu_counter_mod(&sbi->s_freeinodes_counter,
> +	percpu_llcounter_mod(&sbi->s_freeinodes_counter,
>  		ext3_count_free_inodes(sb));
> -	percpu_counter_mod(&sbi->s_dirs_counter,
> +	percpu_llcounter_mod(&sbi->s_dirs_counter,
>  		ext3_count_dirs(sb));
> 
>  	lock_kernel();
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/xattr.c linux-2.6.16-rc6-4g/fs/ext3/xattr.c
> --- linux-2.6.16-rc6.org/fs/ext3/xattr.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext3/xattr.c	2006-03-14 09:29:01.000000000 +0900
> @@ -75,7 +75,7 @@
> 
>  #ifdef EXT3_XATTR_DEBUG
>  # define ea_idebug(inode, f...) do { \
> -		printk(KERN_DEBUG "inode %s:%ld: ", \
> +		printk(KERN_DEBUG "inode %s:%lu: ", \
>  			inode->i_sb->s_id, inode->i_ino); \
>  		printk(f); \
>  		printk("\n"); \
> @@ -225,7 +225,7 @@ ext3_xattr_block_get(struct inode *inode
>  	error = -ENODATA;
>  	if (!EXT3_I(inode)->i_file_acl)
>  		goto cleanup;
> -	ea_idebug(inode, "reading block %d", EXT3_I(inode)->i_file_acl);
> +	ea_idebug(inode, "reading block %u", EXT3_I(inode)->i_file_acl);
>  	bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
>  	if (!bh)
>  		goto cleanup;
> @@ -233,7 +233,7 @@ ext3_xattr_block_get(struct inode *inode
>  		atomic_read(&(bh->b_count)), le32_to_cpu(BHDR(bh)->h_refcount));
>  	if (ext3_xattr_check_block(bh)) {
>  bad_block:	ext3_error(inode->i_sb, __FUNCTION__,
> -			   "inode %ld: bad block %d", inode->i_ino,
> +			   "inode %lu: bad block %u", inode->i_ino,
>  			   EXT3_I(inode)->i_file_acl);
>  		error = -EIO;
>  		goto cleanup;
> @@ -366,7 +366,7 @@ ext3_xattr_block_list(struct inode *inod
>  	error = 0;
>  	if (!EXT3_I(inode)->i_file_acl)
>  		goto cleanup;
> -	ea_idebug(inode, "reading block %d", EXT3_I(inode)->i_file_acl);
> +	ea_idebug(inode, "reading block %u", EXT3_I(inode)->i_file_acl);
>  	bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
>  	error = -EIO;
>  	if (!bh)
> @@ -375,7 +375,7 @@ ext3_xattr_block_list(struct inode *inod
>  		atomic_read(&(bh->b_count)), le32_to_cpu(BHDR(bh)->h_refcount));
>  	if (ext3_xattr_check_block(bh)) {
>  		ext3_error(inode->i_sb, __FUNCTION__,
> -			   "inode %ld: bad block %d", inode->i_ino,
> +			   "inode %lu: bad block %u", inode->i_ino,
>  			   EXT3_I(inode)->i_file_acl);
>  		error = -EIO;
>  		goto cleanup;
> @@ -647,7 +647,7 @@ ext3_xattr_block_find(struct inode *inod
>  			le32_to_cpu(BHDR(bs->bh)->h_refcount));
>  		if (ext3_xattr_check_block(bs->bh)) {
>  			ext3_error(sb, __FUNCTION__,
> -				"inode %ld: bad block %d", inode->i_ino,
> +				"inode %lu: bad block %u", inode->i_ino,
>  				EXT3_I(inode)->i_file_acl);
>  			error = -EIO;
>  			goto cleanup;
> @@ -792,14 +792,14 @@ inserted:
>  			get_bh(new_bh);
>  		} else {
>  			/* We need to allocate a new block */
> -			int goal = le32_to_cpu(
> +			unsigned int goal = le32_to_cpu(
>  					EXT3_SB(sb)->s_es->s_first_data_block) +
>  				EXT3_I(inode)->i_block_group *
>  				EXT3_BLOCKS_PER_GROUP(sb);
> -			int block = ext3_new_block(handle, inode, goal, &error);
> +			unsigned int block = ext3_new_block(handle, inode, goal, &error);
>  			if (error)
>  				goto cleanup;
> -			ea_idebug(inode, "creating block %d", block);
> +			ea_idebug(inode, "creating block %u", block);
> 
>  			new_bh = sb_getblk(sb, block);
>  			if (!new_bh) {
> @@ -847,7 +847,7 @@ cleanup_dquot:
> 
>  bad_block:
>  	ext3_error(inode->i_sb, __FUNCTION__,
> -		   "inode %ld: bad block %d", inode->i_ino,
> +		   "inode %lu: bad block %u", inode->i_ino,
>  		   EXT3_I(inode)->i_file_acl);
>  	goto cleanup;
> 
> @@ -1076,14 +1076,14 @@ ext3_xattr_delete_inode(handle_t *handle
>  	bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
>  	if (!bh) {
>  		ext3_error(inode->i_sb, __FUNCTION__,
> -			"inode %ld: block %d read error", inode->i_ino,
> +			"inode %lu: block %u read error", inode->i_ino,
>  			EXT3_I(inode)->i_file_acl);
>  		goto cleanup;
>  	}
>  	if (BHDR(bh)->h_magic != cpu_to_le32(EXT3_XATTR_MAGIC) ||
>  	    BHDR(bh)->h_blocks != cpu_to_le32(1)) {
>  		ext3_error(inode->i_sb, __FUNCTION__,
> -			"inode %ld: bad block %d", inode->i_ino,
> +			"inode %lu: bad block %u", inode->i_ino,
>  			EXT3_I(inode)->i_file_acl);
>  		goto cleanup;
>  	}
> @@ -1210,11 +1210,11 @@ again:
>  		bh = sb_bread(inode->i_sb, ce->e_block);
>  		if (!bh) {
>  			ext3_error(inode->i_sb, __FUNCTION__,
> -				"inode %ld: block %ld read error",
> +				"inode %lu: block %lu read error",
>  				inode->i_ino, (unsigned long) ce->e_block);
>  		} else if (le32_to_cpu(BHDR(bh)->h_refcount) >=
>  				EXT3_XATTR_REFCOUNT_MAX) {
> -			ea_idebug(inode, "block %ld refcount %d>=%d",
> +			ea_idebug(inode, "block %lu refcount %d>=%d",
>  				  (unsigned long) ce->e_block,
>  				  le32_to_cpu(BHDR(bh)->h_refcount),
>  					  EXT3_XATTR_REFCOUNT_MAX);
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/jbd/journal.c linux-2.6.16-rc6-4g/fs/jbd/journal.c
> --- linux-2.6.16-rc6.org/fs/jbd/journal.c	2006-01-03 12:21:10.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/jbd/journal.c	2006-03-14 09:29:01.000000000 +0900
> @@ -761,7 +761,7 @@ journal_t * journal_init_inode (struct i
>  	journal->j_dev = journal->j_fs_dev = inode->i_sb->s_bdev;
>  	journal->j_inode = inode;
>  	jbd_debug(1,
> -		  "journal %p: inode %s/%ld, size %Ld, bits %d, blksize %ld\n",
> +		  "journal %p: inode %s/%u, size %Ld, bits %d, blksize %ld\n",
>  		  journal, inode->i_sb->s_id, inode->i_ino,
>  		  (long long) inode->i_size,
>  		  inode->i_sb->s_blocksize_bits, inode->i_sb->s_blocksize);
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/include/linux/ext2_fs_sb.h
> linux-2.6.16-rc6-4g/include/linux/ext2_fs_sb.h
> --- linux-2.6.16-rc6.org/include/linux/ext2_fs_sb.h	2006-01-03 12:21:10.000000000 +0900
> +++ linux-2.6.16-rc6-4g/include/linux/ext2_fs_sb.h	2006-03-14 12:06:21.000000000 +0900
> @@ -17,7 +17,7 @@
>  #define _LINUX_EXT2_FS_SB
> 
>  #include <linux/blockgroup_lock.h>
> -#include <linux/percpu_counter.h>
> +#include <linux/percpu_llcounter.h>
> 
>  /*
>   * second extended-fs super-block data in memory
> @@ -49,9 +49,9 @@ struct ext2_sb_info {
>  	u32 s_next_generation;
>  	unsigned long s_dir_count;
>  	u8 *s_debts;
> -	struct percpu_counter s_freeblocks_counter;
> -	struct percpu_counter s_freeinodes_counter;
> -	struct percpu_counter s_dirs_counter;
> +	struct percpu_llcounter s_freeblocks_counter;
> +	struct percpu_llcounter s_freeinodes_counter;
> +	struct percpu_llcounter s_dirs_counter;
>  	struct blockgroup_lock s_blockgroup_lock;
>  };
> 
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/include/linux/ext3_fs.h
> linux-2.6.16-rc6-4g/include/linux/ext3_fs.h
> --- linux-2.6.16-rc6.org/include/linux/ext3_fs.h	2006-01-03 12:21:10.000000000 +0900
> +++ linux-2.6.16-rc6-4g/include/linux/ext3_fs.h	2006-03-14 09:29:01.000000000 +0900
> @@ -731,7 +731,7 @@ struct dir_private_info {
>  /* balloc.c */
>  extern int ext3_bg_has_super(struct super_block *sb, int group);
>  extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
> -extern int ext3_new_block (handle_t *, struct inode *, unsigned long, int *);
> +extern unsigned int ext3_new_block (handle_t *, struct inode *, unsigned long, int *);
>  extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long,
>  			      unsigned long);
>  extern void ext3_free_blocks_sb (handle_t *, struct super_block *,
> @@ -761,7 +761,6 @@ extern int ext3_sync_file (struct file *
>  extern int ext3fs_dirhash(const char *name, int len, struct
>  			  dx_hash_info *hinfo);
> 
> -/* ialloc.c */
>  extern struct inode * ext3_new_inode (handle_t *, struct inode *, int);
>  extern void ext3_free_inode (handle_t *, struct inode *);
>  extern struct inode * ext3_orphan_get (struct super_block *, unsigned long);
> @@ -772,9 +771,9 @@ extern unsigned long ext3_count_free (st
> 
> 
>  /* inode.c */
> -extern int ext3_forget(handle_t *, int, struct inode *, struct buffer_head *, int);
> -extern struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *);
> -extern struct buffer_head * ext3_bread (handle_t *, struct inode *, int, int, int *);
> +extern int ext3_forget(handle_t *, int, struct inode *, struct buffer_head *, unsigned int);
> +extern struct buffer_head * ext3_getblk (handle_t *, struct inode *, unsigned long, int, int *);
> +extern struct buffer_head * ext3_bread (handle_t *, struct inode *, unsigned int, int, int *);
> 
>  extern void ext3_read_inode (struct inode *);
>  extern int  ext3_write_inode (struct inode *, int);
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/include/linux/ext3_fs_sb.h
> linux-2.6.16-rc6-4g/include/linux/ext3_fs_sb.h
> --- linux-2.6.16-rc6.org/include/linux/ext3_fs_sb.h	2006-01-03 12:21:10.000000000 +0900
> +++ linux-2.6.16-rc6-4g/include/linux/ext3_fs_sb.h	2006-03-14 12:06:35.000000000 +0900
> @@ -20,7 +20,7 @@
>  #include <linux/timer.h>
>  #include <linux/wait.h>
>  #include <linux/blockgroup_lock.h>
> -#include <linux/percpu_counter.h>
> +#include <linux/percpu_llcounter.h>
>  #endif
>  #include <linux/rbtree.h>
> 
> @@ -54,9 +54,9 @@ struct ext3_sb_info {
>  	u32 s_next_generation;
>  	u32 s_hash_seed[4];
>  	int s_def_hash_version;
> -	struct percpu_counter s_freeblocks_counter;
> -	struct percpu_counter s_freeinodes_counter;
> -	struct percpu_counter s_dirs_counter;
> +	struct percpu_llcounter s_freeblocks_counter;
> +	struct percpu_llcounter s_freeinodes_counter;
> +	struct percpu_llcounter s_dirs_counter;
>  	struct blockgroup_lock s_blockgroup_lock;
> 
>  	/* root of the per fs reservation window tree */
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/include/linux/percpu_llcounter.h
> linux-2.6.16-rc6-4g/include/linux/percpu_llcounter.h
> --- linux-2.6.16-rc6.org/include/linux/percpu_llcounter.h	1970-01-01 09:00:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/include/linux/percpu_llcounter.h	2006-03-14 13:50:54.000000000 +0900
> @@ -0,0 +1,113 @@
> +#ifndef _LINUX_LLPERCPU_COUNTER_H
> +#define _LINUX_LLPERCPU_COUNTER_H
> +/*
> + * A simple "approximate counter" for use in ext2 and ext3 superblocks.
> + *
> + * WARNING: these things are HUGE.  4 kbytes per counter on 32-way P4.
> + */
> +
> +#include <linux/config.h>
> +#include <linux/spinlock.h>
> +#include <linux/smp.h>
> +#include <linux/threads.h>
> +#include <linux/percpu.h>
> +
> +#ifdef CONFIG_SMP
> +
> +struct percpu_llcounter {
> +	spinlock_t lock;
> +	long long count;
> +	long long *counters;
> +};
> +
> +#if NR_CPUS >= 16
> +#define FBC_BATCH	(NR_CPUS*2)
> +#else
> +#define FBC_BATCH	(NR_CPUS*4)
> +#endif
> +
> +static inline void percpu_llcounter_init(struct percpu_llcounter *fbc)
> +{
> +	spin_lock_init(&fbc->lock);
> +	fbc->count = 0;
> +	fbc->counters = alloc_percpu(long long);
> +}
> +
> +static inline void percpu_llcounter_destroy(struct percpu_llcounter *fbc)
> +{
> +	free_percpu(fbc->counters);
> +}
> +
> +void percpu_llcounter_mod(struct percpu_llcounter *fbc, long long amount);
> +long long percpu_llcounter_sum(struct percpu_llcounter *fbc);
> +
> +static inline long long percpu_llcounter_read(struct percpu_llcounter *fbc)
> +{
> +	return fbc->count;
> +}
> +
> +/*
> + * It is possible for the percpu_llcounter_read() to return a small negative
> + * number for some counter which should never be negative.
> + */
> +static inline long long percpu_llcounter_read_positive(struct percpu_llcounter *fbc)
> +{
> +	long long ret = fbc->count;
> +
> +	barrier();		/* Prevent reloads of fbc->count */
> +	if (ret > 0)
> +		return ret;
> +	return 1;
> +}
> +
> +#else
> +
> +struct percpu_llcounter {
> +	long long count;
> +};
> +
> +static inline void percpu_llcounter_init(struct percpu_llcounter *fbc)
> +{
> +	fbc->count = 0;
> +}
> +
> +static inline void percpu_llcounter_destroy(struct percpu_llcounter *fbc)
> +{
> +}
> +
> +static inline void
> +percpu_llcounter_mod(struct percpu_llcounter *fbc, long long amount)
> +{
> +	preempt_disable();
> +	fbc->count += amount;
> +	preempt_enable();
> +}
> +
> +static inline long long percpu_llcounter_read(struct percpu_llcounter *fbc)
> +{
> +	return fbc->count;
> +}
> +
> +static inline long long percpu_llcounter_read_positive(struct percpu_llcounter *fbc)
> +{
> +	return fbc->count;
> +}
> +
> +static inline long long percpu_llcounter_sum(struct percpu_llcounter *fbc)
> +{
> +	return percpu_llcounter_read_positive(fbc);
> +}
> +
> +#endif	/* CONFIG_SMP */
> +
> +static inline void percpu_llcounter_inc(struct percpu_llcounter *fbc)
> +{
> +	percpu_llcounter_mod(fbc, 1);
> +}
> +
> +static inline void percpu_llcounter_dec(struct percpu_llcounter *fbc)
> +{
> +	percpu_llcounter_mod(fbc, -1);
> +}
> +
> +#endif /* _LINUX_LLPERCPU_COUNTER_H */
> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/mm/swap.c linux-2.6.16-rc6-4g/mm/swap.c
> --- linux-2.6.16-rc6.org/mm/swap.c	2006-03-14 09:09:07.000000000 +0900
> +++ linux-2.6.16-rc6-4g/mm/swap.c	2006-03-14 13:47:18.000000000 +0900
> @@ -26,6 +26,7 @@
>  #include <linux/buffer_head.h>	/* for try_to_release_page() */
>  #include <linux/module.h>
>  #include <linux/percpu_counter.h>
> +#include <linux/percpu_llcounter.h>
>  #include <linux/percpu.h>
>  #include <linux/cpu.h>
>  #include <linux/notifier.h>
> @@ -498,6 +499,27 @@ void percpu_counter_mod(struct percpu_co
>  }
>  EXPORT_SYMBOL(percpu_counter_mod);
> 
> +void percpu_llcounter_mod(struct percpu_llcounter *fbc, long long amount)
> +{
> +	long long count;
> +	long long *pcount;
> +	int cpu = get_cpu();
> +
> +	pcount = per_cpu_ptr(fbc->counters, cpu);
> +	count = *pcount + amount;
> +	if (count >= FBC_BATCH || count <= -FBC_BATCH) {
> +		spin_lock(&fbc->lock);
> +		fbc->count += count;
> +		*pcount = 0;
> +		spin_unlock(&fbc->lock);
> +	} else {
> +		*pcount = count;
> +	}
> +	put_cpu();
> +}
> +EXPORT_SYMBOL(percpu_llcounter_mod);
> +
> +
>  /*
>   * Add up all the per-cpu counts, return the result.  This is a more accurate
>   * but much slower version of percpu_counter_read_positive()
> @@ -517,6 +539,26 @@ long percpu_counter_sum(struct percpu_co
>  	return ret < 0 ? 0 : ret;
>  }
>  EXPORT_SYMBOL(percpu_counter_sum);
> +
> +/*
> + * Add up all the per-cpu counts, return the result.  This is a more accurate
> + * but much slower version of percpu_llcounter_read_positive()
> + */
> +long long percpu_llcounter_sum(struct percpu_llcounter *fbc)
> +{
> +	long long ret;
> +	int cpu;
> +
> +	spin_lock(&fbc->lock);
> +	ret = fbc->count;
> +	for_each_cpu(cpu) {
> +		long long *pcount = per_cpu_ptr(fbc->counters, cpu);
> +		ret += *pcount;
> +	}
> +	spin_unlock(&fbc->lock);
> +	return ret < 0 ? 0 : ret;
> +}
> +EXPORT_SYMBOL(percpu_llcounter_sum);
>  #endif
> 
>  /*
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> that extends applications into web and mobile media. Attend the live webcast
> and join the prime developer group breaking into this new coding territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Ext2-devel mailing list
> Ext2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ext2-devel
-- 
Laurent Vivier
Bull, Architect of an Open World (TM)
http://www.bullopensource.org/ext4

[-- Attachment #2: Ceci est une partie de message numériquement signée. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-15 12:39 [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel) Takashi Sato
  2006-03-15 12:56 ` [Ext2-devel] " Laurent Vivier
@ 2006-03-16  2:19 ` Mingming Cao
  2006-03-16 12:11   ` Takashi Sato
  2006-03-19  2:20 ` Theodore Ts'o
  2 siblings, 1 reply; 34+ messages in thread
From: Mingming Cao @ 2006-03-16  2:19 UTC (permalink / raw)
  To: Takashi Sato; +Cc: ext2-devel, linux-kernel

On Wed, 2006-03-15 at 21:39 +0900, Takashi Sato wrote:
> Hi,
> 
> As a disk size tends to be larger, some disk storages get to have
> the capacity to supply more than multi-TB recently.  But now ext2/3
> can't support more than 8TB filesystem in 4K-blocksize.  And then I
> think the filesystem size of ext2/3 should be extended.
> 
> I'd like to extend the max filesystem size of ext2/3 from 8TB to 16TB
> by making the number of blocks on ext2/3 extend from 2G-1(2^31-1) to
> 4G-1(2^32-1) as below.

> The max number of blocks is restricted to 2G-1(2^31-1) on ext2/3
> because of the following problems.
> 
Hi Takashi, nice work and summary.

> - The number of blocks is treated as signed 4bytes variable on some
>   codes for ext2/3 in kernel.
> 

Yes, there are a number of places in the ext3/2 code to use int to
represent the block number on disk, especially in block allocation code
e.g. ext3_new_block() returns an int type value for the new allocated
block. It uses "-1" to indicates failure to the caller, and on success
it will return a positive value of int.

When the ext3 block reservation were made into mainline, it continues
uses int type variable for physical block numbers in several places. I
did thought about fix this limitation together with the reservation
change but never get a chance to really work on it.

You changed most of the affected variables from "int" to "unsigned int",
that seems allow block number to address 2^32. It probably a good thing
to consider change the variables to sector_t type, so when the time we
want to support for 64 bit block number, we don't have to re-do the
similar work again.  Laurent did very similar work on this before.

> - Assembler instructions which can't treat more than 2GB is used
>   on some functions related to bit manipulation, like ext2fs_set_bit()
>   and ext2fs_test_bit().  These functions are called through mke2fs
>   on x86 and mc68000 architecture.
> 
> - A block number and an inode number is output with the format
>   string(%d, %ld) in many places on both kernel and commands.
> 

 Besides these limitations, I think there is one more to limit ext3
filesystem size to 8TB

- The superblock format currently stores the number of block groups as a
16-bit integer, and because (on a 4 KB blocksize filesystem) the maximum
number of blocks in a block group is 32,768 , a combination of these
constraints limits the maximum size of the filesystem to 8 TB

> This patch set is composed of two parts, for the kernel and e2fsprogs.
> 
> [1/2] kernel(linux 2.6.16-rc6)
>  - Change signed 4bytes variables for a block number and a inode
>    number, to unsigned.
> 

I noticed that the first patch set combines changes to ext2 filesystem
and changes to ext3 filesystem. It would be nice to split the changes to
two different filesystems.

The changes you made to ext3_new_block() is okey, as the group_block is
a block number relative to the block group, not to the whole filesystem,
and since we will convert the ret_block to the filesystem wide block
number, so keep group_block it as int type is fine. 

But that doesn't fix all th problem. We still have places in ext3 block
reservation code that use int for system-wide block numbers. For e.g.,
alloc_new_reservation(), group_first_block, group_end_block, start_block
 are all filesystem wide block numbers, they need to be changed. I will
check the code more closely tomorrow to see if the changes will break
any assumptions.

Also, I noticed that in your first patch, you changed a few variables
for logical block number from "long" to "unsigned int". Just want to
point out that's a seperate issue- that's for enlarge the file size, not
for expand the max filesystem size.

> diff -uprN -X linux-2.6.16-rc6.org/Documentation/dontdiff linux-2.6.16-rc6.org/fs/ext3/inode.c linux-2.6.16-rc6-4g/fs/ext3/inode.c
> --- linux-2.6.16-rc6.org/fs/ext3/inode.c	2006-03-14 09:09:00.000000000 +0900
> +++ linux-2.6.16-rc6-4g/fs/ext3/inode.c	2006-03-14 09:29:01.000000000 +0900

> @@ -235,10 +235,10 @@ no_delete:
>  	clear_inode(inode);	/* We must guarantee clearing of inode... */
>  }
> 
> -static int ext3_alloc_block (handle_t *handle,
> -			struct inode * inode, unsigned long goal, int *err)
> +static unsigned int ext3_alloc_block (handle_t *handle,
> +			struct inode * inode, unsigned int goal, int *err)
>  {

I did some changes in the same code to support ext3 multiple block
allocation. Those patches removed this function ext3_alloc_block(). The
patches are sitting in mm tree now.

BTW, why we change from unsigned long back to unsigned int here?

>  	struct ext3_block_alloc_info *block_i =  EXT3_I(inode)->i_block_alloc_info;
> @@ -505,21 +505,21 @@ static unsigned long ext3_find_goal(stru
>  static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
>  			     int num,
>  			     unsigned long goal,
> -			     int *offsets,
> +			     unsigned int *offsets,
>  			     Indirect *branch)

offsets[] array here store the index position within a indirect block,
where the physical block is stored. The indirect block takes a 4k block,
holds up to 1K entry of physical block numbers, so int type for the
index is good enough.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-16  2:19 ` Mingming Cao
@ 2006-03-16 12:11   ` Takashi Sato
  2006-03-16 13:53     ` Theodore Ts'o
  2006-03-16 18:35     ` Andreas Dilger
  0 siblings, 2 replies; 34+ messages in thread
From: Takashi Sato @ 2006-03-16 12:11 UTC (permalink / raw)
  To: cmm; +Cc: linux-kernel, ext2-devel

Hi,

> You changed most of the affected variables from "int" to "unsigned int",
> that seems allow block number to address 2^32. It probably a good thing
> to consider change the variables to sector_t type, so when the time we
> want to support for 64 bit block number, we don't have to re-do the
> similar work again.  Laurent did very similar work on this before.

sector_t is 8bytes on normal configuration and there are many
variables for blocks on ext2/3.  I thought extending variables may
influence on performance, so I didn't change.

> Besides these limitations, I think there is one more to limit ext3
> filesystem size to 8TB
> 
> - The superblock format currently stores the number of block groups as a
> 16-bit integer, and because (on a 4 KB blocksize filesystem) the maximum
> number of blocks in a block group is 32,768 , a combination of these
> constraints limits the maximum size of the filesystem to 8 TB

Is it s_block_group_nr in ext3_super_block?
mke2fs sets 65535 to the field if the number of block groups is greater
than 65535.  Current kernel ignores the field and re-calculate from
other fields.  findsuper command is the only user of it and it simply prints
the value.  So, it does not limit the maximum size of the filesystem to 8 TB.
I  confirmed that mke2fs with my change could make the filesysytem
which has more than 65536 groups and it could be mounted.

> I noticed that the first patch set combines changes to ext2 filesystem
> and changes to ext3 filesystem. It would be nice to split the changes to
> two different filesystems.

Ok, I'll split it later.

> But that doesn't fix all th problem. We still have places in ext3 block
> reservation code that use int for system-wide block numbers. For e.g.,
> alloc_new_reservation(), group_first_block, group_end_block, start_block
> are all filesystem wide block numbers, they need to be changed. I will
> check the code more closely tomorrow to see if the changes will break
> any assumptions.

Thank you, I missed it.  I'm looking forward to seeing your report.

> Also, I noticed that in your first patch, you changed a few variables
> for logical block number from "long" to "unsigned int". Just want to
> point out that's a seperate issue- that's for enlarge the file size, not
> for expand the max filesystem size.

Ok, I'll remove them when I update the patch next time.
They are left because I'm considering enlarging the file size max too...

>> -static int ext3_alloc_block (handle_t *handle,
>> - struct inode * inode, unsigned long goal, int *err)
>> +static unsigned int ext3_alloc_block (handle_t *handle,
>> + struct inode * inode, unsigned int goal, int *err)
>>  {
> 
> I did some changes in the same code to support ext3 multiple block
> allocation. Those patches removed this function ext3_alloc_block(). The
> patches are sitting in mm tree now.
> 
> BTW, why we change from unsigned long back to unsigned int here?

Because ext3_alloc_branch calls ext3_alloc_block with int type for the
block number and ext3_alloc_blocks returns int type.

>>  struct ext3_block_alloc_info *block_i =  EXT3_I(inode)->i_block_alloc_info;
>> @@ -505,21 +505,21 @@ static unsigned long ext3_find_goal(stru
>>  static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
>>       int num,
>>       unsigned long goal,
>> -      int *offsets,
>> +      unsigned int *offsets,
>>       Indirect *branch)
> 
> offsets[] array here store the index position within a indirect block,
> where the physical block is stored. The indirect block takes a 4k block,
> holds up to 1K entry of physical block numbers, so int type for the
> index is good enough.

Ok, I'll update them too.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-16 12:11   ` Takashi Sato
@ 2006-03-16 13:53     ` Theodore Ts'o
  2006-03-16 18:35     ` Andreas Dilger
  1 sibling, 0 replies; 34+ messages in thread
From: Theodore Ts'o @ 2006-03-16 13:53 UTC (permalink / raw)
  To: Takashi Sato; +Cc: cmm, linux-kernel, ext2-devel

On Thu, Mar 16, 2006 at 09:11:17PM +0900, Takashi Sato wrote:
> >You changed most of the affected variables from "int" to "unsigned int",
> >that seems allow block number to address 2^32. It probably a good thing
> >to consider change the variables to sector_t type, so when the time we
> >want to support for 64 bit block number, we don't have to re-do the
> >similar work again.  Laurent did very similar work on this before.
> 
> sector_t is 8bytes on normal configuration and there are many
> variables for blocks on ext2/3.  I thought extending variables may
> influence on performance, so I didn't change.

It would be interesting to do a CPU overhead benchmark to see how much
of the overhead is actually measurable on an x86 system.  If it's only
a small percent, it might be acceptable given that x86_64 machines are
going to be gradually taking over, and sector_t only exists if
CONFIG_LBD is enabled.  So for smaller systems where LBD isn't
enabled, we won't see performance overhead since sector_t won't exist
and so the code is going to have to use a typedef for ext2_blk_t which
is either __u32 or sector_t as necessary.

> >- The superblock format currently stores the number of block groups as a
> >16-bit integer, and because (on a 4 KB blocksize filesystem) the maximum
> >number of blocks in a block group is 32,768 , a combination of these
> >constraints limits the maximum size of the filesystem to 8 TB
> 
> Is it s_block_group_nr in ext3_super_block?
> mke2fs sets 65535 to the field if the number of block groups is greater
> than 65535.  Current kernel ignores the field and re-calculate from
> other fields.  findsuper command is the only user of it and it simply prints
> the value.  So, it does not limit the maximum size of the filesystem to 8 
> TB.

s_block_group_nr is *not* the number of block groups in the
filesystem.  As Takashi-san properly pointed out, the kernel
calculates the number of block groups by dividing the number of blocks
by the blocks_per_group fields.  s_block_group_nr is used to identify
the block group of a particular backup supeblock.  

So for the backup superblock located at block group #3,
s_block_group_nr 3, and for the backup superblock located at block
group #5, s_block_group_nr 5, and so on.  It is used only as a hint so
that prorams like findsuper and gpart can be more intelligent about
finding the start of filesystem, when trying to recover from a smashed
partition table.

						- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-16 12:11   ` Takashi Sato
  2006-03-16 13:53     ` Theodore Ts'o
@ 2006-03-16 18:35     ` Andreas Dilger
  2006-03-16 21:26       ` Theodore Ts'o
  2006-03-17  9:35       ` Laurent Vivier
  1 sibling, 2 replies; 34+ messages in thread
From: Andreas Dilger @ 2006-03-16 18:35 UTC (permalink / raw)
  To: Takashi Sato; +Cc: cmm, linux-kernel, ext2-devel, Laurent Vivier

On Mar 16, 2006  21:11 +0900, Takashi Sato wrote:
> >Also, I noticed that in your first patch, you changed a few variables
> >for logical block number from "long" to "unsigned int". Just want to
> >point out that's a seperate issue- that's for enlarge the file size, not
> >for expand the max filesystem size.
> 
> Ok, I'll remove them when I update the patch next time.
> They are left because I'm considering enlarging the file size max too...

There was previously a patch by Goldwyn Rodrigues in linux-kernel:
"[PATCH] Pushing ext3 file size limits beyond 2TB", which at least
got as far as 4TB for the file size (for 4kB blocks).

Beyond that, we need a format change and may as well have something
like extents, but even extents still need to allow a larger i_blocks,
so that patch would be useful in any case...  though it needs some
cleanup to remove all users of i_frag and i_faddr (which have never
ever been used).

Laurent, do your 64-bit patches include support for larger i_blocks?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-16 18:35     ` Andreas Dilger
@ 2006-03-16 21:26       ` Theodore Ts'o
  2006-03-16 22:59         ` Andreas Dilger
  2006-03-17  9:35       ` Laurent Vivier
  1 sibling, 1 reply; 34+ messages in thread
From: Theodore Ts'o @ 2006-03-16 21:26 UTC (permalink / raw)
  To: Takashi Sato, cmm, linux-kernel, ext2-devel, Laurent Vivier

On Thu, Mar 16, 2006 at 11:35:49AM -0700, Andreas Dilger wrote:
> Beyond that, we need a format change and may as well have something
> like extents, but even extents still need to allow a larger i_blocks,

As a side note, one of the things that we've been talking about doing
is bundling a number of small changes together into a single INCOMPAT
flag.  Changing i_blocks so its units are in blocks rather than
512-byte sectors was one such change.

Another was guaranteeing that for large inodes (> 128 bytes) that at
least some number of bytes (probably on the order of 32 bytes or so)
would be reserved for things like the high resolution portion of
ctime/mtime/atime, high watermark, and other inode extensions.  (One
of the problems with doing high res timestamps right is how to handle
the case where you can't make room for the high res timestamps, due to
too much space being taken up by extended attributes.  The make(1)
program gets really confused unless all files are either using or not
using high res timestamps.)

The idea was to do a quick easy strike of all of the ideas which could
be implemented quickly, and perhaps try to get them done before RHEL 5
snapshots.  Even if RHEL5 doesn't enable use of these features by
default, having it supported by RHEL5 would be extremely convenient.

> so that patch would be useful in any case...  though it needs some
> cleanup to remove all users of i_frag and i_faddr (which have never
> ever been used).

One of the things which we need to consider is whether we think we
will never support tail packing or other forms of fragments, which is
related to whether we think we will ever support large blocks (i.e.,
32k, 64k, and up).  If we do, we might want to keep those fields
around.

						- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-16 21:26       ` Theodore Ts'o
@ 2006-03-16 22:59         ` Andreas Dilger
  2006-03-18 17:07           ` Theodore Ts'o
  0 siblings, 1 reply; 34+ messages in thread
From: Andreas Dilger @ 2006-03-16 22:59 UTC (permalink / raw)
  To: Theodore Ts'o, Takashi Sato, cmm, linux-kernel, ext2-devel,
	Laurent Vivier

On Mar 16, 2006  16:26 -0500, Theodore Ts'o wrote:
> On Thu, Mar 16, 2006 at 11:35:49AM -0700, Andreas Dilger wrote:
> > Beyond that, we need a format change and may as well have something
> > like extents, but even extents still need to allow a larger i_blocks,
> 
> As a side note, one of the things that we've been talking about doing
> is bundling a number of small changes together into a single INCOMPAT
> flag.  Changing i_blocks so its units are in blocks rather than
> 512-byte sectors was one such change.

> Another was guaranteeing that for large inodes (> 128 bytes) that at
> least some number of bytes (probably on the order of 32 bytes or so)
> would be reserved for things like the high resolution portion of
> ctime/mtime/atime, high watermark, and other inode extensions.  (One
> of the problems with doing high res timestamps right is how to handle
> the case where you can't make room for the high res timestamps, due to
> too much space being taken up by extended attributes.  The make(1)
> program gets really confused unless all files are either using or not
> using high res timestamps.)
> 
> The idea was to do a quick easy strike of all of the ideas which could
> be implemented quickly, and perhaps try to get them done before RHEL 5
> snapshots.  Even if RHEL5 doesn't enable use of these features by
> default, having it supported by RHEL5 would be extremely convenient.

While I agree with that in theory, in practise we never end up doing
this and it just ends up delaying the acceptance of the trivial patches.
It may also be a burden later on when some of the features that could
be e.g. ROCOMPAT are bundled with an INCOMPAT change and we then
make the filesystem gratuitously INCOMPAT.

In the end, I don't think having a couple of separate flags is any more
effort than having a single one.  As yet we only have about 6 of each 32
feature bits used, and if we get close to running out we can make an
EXT3_FEATURE_{,RO,IN}COMPAT_NEXT_WORD flag to continue it on.

Note that I'm not against this in practise, but I wouldn't hold up any
feature for this reason.  How long has large i_blocks been pending,
and usecond timestamps?  Many years already, even though they are trivial
to implement, so I'm hesitant to tie them together and delay further.

I think i_blocks can be considered an ROCOMPAT feature, and the large
inode reservation for usecond timestamps could be COMPAT I think (since
an unsupporting kernel would still update all the timestamps consistently
even if the useconds on disk would be some constant instead of 0.

> > so that patch would be useful in any case...  though it needs some
> > cleanup to remove all users of i_frag and i_faddr (which have never
> > ever been used).
> 
> One of the things which we need to consider is whether we think we
> will never support tail packing or other forms of fragments, which is
> related to whether we think we will ever support large blocks (i.e.,
> 32k, 64k, and up).  If we do, we might want to keep those fields
> around.

I thought the long-term plan for small files was to just store them
in an EA?  That way, we can efficiently pack them inside the inode
or up to blocksize (space willing) without any usage of inode fields
(maybe with a flag to indicate that there is such a fragment to avoid
gratuitous EA searching).  This would be a net win on performance since
it avoids an IO for the in-inode case at least.

I think testing with reiserfs showed that tail packing was a net loss in
most cases, since basically every benchmark I've ever seen with reiserfs
disables tail packing or suffers.  For space constrained systems (if
there ever exists such a thing again ;-) it would probably be better to
go to compressed files.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-16 18:35     ` Andreas Dilger
  2006-03-16 21:26       ` Theodore Ts'o
@ 2006-03-17  9:35       ` Laurent Vivier
  1 sibling, 0 replies; 34+ messages in thread
From: Laurent Vivier @ 2006-03-17  9:35 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Takashi Sato, Mingming Cao, linux-kernel, ext2-devel

[-- Attachment #1: Type: text/plain, Size: 353 bytes --]

Le jeu 16/03/2006 à 19:35, Andreas Dilger a écrit :
[...]
> Laurent, do your 64-bit patches include support for larger i_blocks?

No, I only work on extending the filesystem size. Extending the file
size will be the next step...

Cheers,
Laurent
-- 
Laurent Vivier
Bull, Architect of an Open World (TM)
http://www.bullopensource.org/ext4

[-- Attachment #2: Ceci est une partie de message numériquement signée. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-16 22:59         ` Andreas Dilger
@ 2006-03-18 17:07           ` Theodore Ts'o
  2006-03-20  6:36             ` Andreas Dilger
  0 siblings, 1 reply; 34+ messages in thread
From: Theodore Ts'o @ 2006-03-18 17:07 UTC (permalink / raw)
  To: Takashi Sato, cmm, linux-kernel, ext2-devel, Laurent Vivier

On Thu, Mar 16, 2006 at 03:59:13PM -0700, Andreas Dilger wrote:
> While I agree with that in theory, in practise we never end up doing
> this and it just ends up delaying the acceptance of the trivial patches.
> It may also be a burden later on when some of the features that could
> be e.g. ROCOMPAT are bundled with an INCOMPAT change and we then
> make the filesystem gratuitously INCOMPAT.
> 
> In the end, I don't think having a couple of separate flags is any more
> effort than having a single one.  As yet we only have about 6 of each 32
> feature bits used, and if we get close to running out we can make an
> EXT3_FEATURE_{,RO,IN}COMPAT_NEXT_WORD flag to continue it on.

The overhead is not running out of feature bit flags.  After all, it's
easy to add more if we need to; we just define the MSB as meaning
"check the auxiliary features compat/rocompat/incompat mask", and then
define a new 32-bit extension bitmask in the superblock.

What I'm trying to simplify is the overhead of users trying to
understand a tangled mess of features, some compat, some incompat,
etc.

> Note that I'm not against this in practise, but I wouldn't hold up any
> feature for this reason.  How long has large i_blocks been pending,
> and usecond timestamps?  Many years already, even though they are trivial
> to implement, so I'm hesitant to tie them together and delay further.

i_blocks has been pending because the people who could push it haven't
had the time, and usec timestamps because the trivial way (without an
at least an ROCOMPAT flag).

> I think i_blocks can be considered an ROCOMPAT feature, and the large
> inode reservation for usecond timestamps could be COMPAT I think (since
> an unsupporting kernel would still update all the timestamps consistently
> even if the useconds on disk would be some constant instead of 0.

i_blocks can ROCOMPAT only if it is acceptable for stat(2) to return
erroneous i_blocks return values.  I'm not entirely convinced that's a
good thing, and at the very least it would be extremely confusing, but
maybe.

usecond timestamps must be at least ROCOMPAT, because of the
requirement that all newly created inodes must reserve extra space and
guarantee that i_extra_isize must be at least n bytes (where n is the
size of the guaranteed extra inode fields).  If you don't do that,
then when the filesystem is mounted one again on a kernel that does
understand usec timestamps, some inodes will have room for the usec
time fields, and other inodes won't (because they have too much of the
space used for EA's), and that will cause serious problems for make(1).

> I think testing with reiserfs showed that tail packing was a net loss in
> most cases, since basically every benchmark I've ever seen with reiserfs
> disables tail packing or suffers.  For space constrained systems (if
> there ever exists such a thing again ;-) it would probably be better to
> go to compressed files.

I have to wonder if that's because of the way reiserfs implemented
tail-packing more than anything else.  I don't belive fragments hurt
performance on UFS systems quite as much as it does on reiserfs
systems.  I'm not worried about this as much for space constrained
systems, but for cases where we find that increasing the blocksize to
8k or even larger (32k?  64k) really helps, but we don't want to pay
the internal fragmentation penalty for small files.  There are other
ways to solve the problem, yes, such as by assuming that we can use a
different filesystem for database or video streams separate from the
/, /usr, and/or /var filesystems, for example.  

If we are ready to forever forswear wanting to use large block sizes,
then maybe we don't need to worry about fragmentations support (or
maybe the 1.8" pedabyte disk drives will show up and be cheap enough
that we just won't care about wasting space on small files).  But
that's I think a decision which we need to formally make.

					- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-15 12:39 [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel) Takashi Sato
  2006-03-15 12:56 ` [Ext2-devel] " Laurent Vivier
  2006-03-16  2:19 ` Mingming Cao
@ 2006-03-19  2:20 ` Theodore Ts'o
  2006-03-20 10:11   ` Takashi Sato
  2 siblings, 1 reply; 34+ messages in thread
From: Theodore Ts'o @ 2006-03-19  2:20 UTC (permalink / raw)
  To: Takashi Sato; +Cc: ext2-devel, linux-kernel

On Wed, Mar 15, 2006 at 09:39:20PM +0900, Takashi Sato wrote:
>  - Modify to call C functions(ext2fs_set_bit(),ext2fs_test_bit())
>    defined in lib/ex2fs/bitops.c on x86 and mc68000 architecture.

I just did some quick tests, and using the C functions instead of the
asm instructions increases the CPU user time by 7.8% on a test
filesystem, and increased the wall clock time of the e2fsck regression
test suite by 4.5%.  That's not huge, and the test suite was cached so
the percentage difference will probably be less in real-life
situation, but I'd still like to avoid it if I can.

I've just checked my i386 assembly language reference, and I don't see
any indication that the btsl, btrl, and btl instructions don't work if
the high bit is set on the bit number.  Have you done tests showing
that these instructions do not work correctly for filesystem sizes >
2**31 blocks, or have references showing that these instructions
interpret the bit number as a signed integer?

Thanks regards,

						- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-18 17:07           ` Theodore Ts'o
@ 2006-03-20  6:36             ` Andreas Dilger
  2006-03-20 22:38               ` Stephen C. Tweedie
  0 siblings, 1 reply; 34+ messages in thread
From: Andreas Dilger @ 2006-03-20  6:36 UTC (permalink / raw)
  To: Theodore Ts'o, Takashi Sato, cmm, linux-kernel, ext2-devel,
	Laurent Vivier

On Mar 18, 2006  12:07 -0500, Theodore Ts'o wrote:
> What I'm trying to simplify is the overhead of users trying to
> understand a tangled mess of features, some compat, some incompat,
> etc.

I think the real goal is that 99% of users never really see this
in the first place.  Add in support for these features now with the
most permissive COMPAT flag possible, and chances are that most
normal users won't use these features for a few years.  Most of
them are for the 1% of systems that are pushing the current filesystem
limits, and the majority of these users are also more sophisticated.
You won't have Joe Average dual-booting their Linux system with a
16TB filesystem.

> > I think i_blocks can be considered an ROCOMPAT feature, and the large
> > inode reservation for usecond timestamps could be COMPAT I think (since
> > an unsupporting kernel would still update all the timestamps consistently
> > even if the useconds on disk would be some constant instead of 0.

NB - my reference for i_blocks was the use of i_frag|i_fsize for use
     by files > 2TB, not the recent proposal for i_blocks in fs blocksize.

> i_blocks can ROCOMPAT only if it is acceptable for stat(2) to return
> erroneous i_blocks return values.  I'm not entirely convinced that's a
> good thing, and at the very least it would be extremely confusing, but
> maybe.

I think most cases where someone is ro-mounting their filesystem is
when they need emergency access to the filesystem with an older kernel.
Allowing that access IMHO is more important than exact correctness.
I'm also not aware of any tools that depend on i_blocks being correct,
though I suspect e.g. "cp --sparse" will use a discrepency in
i_size vs (i_blocks >> 9) to determine sparseness.

> usecond timestamps must be at least ROCOMPAT, because of the
> requirement that all newly created inodes must reserve extra space and
> guarantee that i_extra_isize must be at least n bytes (where n is the
> size of the guaranteed extra inode fields).  If you don't do that,
> then when the filesystem is mounted one again on a kernel that does
> understand usec timestamps, some inodes will have room for the usec
> time fields, and other inodes won't (because they have too much of the
> space used for EA's), and that will cause serious problems for make(1).

What happens to existing filesystems with large inodes that don't have
enough space for the extra timestamps in the first place?  Also, if files
are created while the filesystem is mounted without usecond timestamps
they would get no usecond fields anyways.  I agree that there are some
unlikely corner conditions that might be hit (large inode filesystem, on
older kernel without usec support, fills both the in-inode and external
block so much that there isn't 12 bytes left for the usecond timestamps,
and that file happens to depend on the exact accuracy of the timestamp).
IMHO the inconvenience of the ROCOMPAT outweighs the benefits.

We have previously restricted ROCOMPAT and INCOMPAT flags for changes
that would cause corruption or crashes on older kernels.  In the end,
I'm not dead-set against making it ROCOMPAT, just trying to maintain
the maximal compatibility possible.

> for cases where we find that increasing the blocksize to
> 8k or even larger (32k?  64k) really helps, but we don't want to pay
> the internal fragmentation penalty for small files.  There are other
> ways to solve the problem, yes, such as by assuming that we can use a
> different filesystem for database or video streams separate from the
> /, /usr, and/or /var filesystems, for example.  
> 
> If we are ready to forever forswear wanting to use large block sizes,
> then maybe we don't need to worry about fragmentations support (or
> maybe the 1.8" pedabyte disk drives will show up and be cheap enough
> that we just won't care about wasting space on small files).  But
> that's I think a decision which we need to formally make.

I'm not against large block support (in fact I was hoping this would be
standard by now), or fragment support.  Rather, I think that nobody
cares enough about it to actually implement it, and given the growth
of disks the demand will never materialize, just like ext2 filesystem
compression missed the window when the benefits outweighed the costs.

If and when large-page/block support makes it into a commodity CPU
(sadly, x86_64 missed the mark) or kernel we can re-evaluate it then.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-19  2:20 ` Theodore Ts'o
@ 2006-03-20 10:11   ` Takashi Sato
  2006-03-26  3:01     ` Theodore Ts'o
  0 siblings, 1 reply; 34+ messages in thread
From: Takashi Sato @ 2006-03-20 10:11 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel, ext2-devel

Hi,

> I've just checked my i386 assembly language reference, and I don't see
> any indication that the btsl, btrl, and btl instructions don't work if
> the high bit is set on the bit number.  Have you done tests showing
> that these instructions do not work correctly for filesystem sizes >
> 2**31 blocks, 

Of course I did and confirmed to get the segmentation fault
at those instructions.

>                          or have references showing that these instructions
> interpret the bit number as a signed integer?

I got the developer's manual from the following site.
ftp://download.intel.com/design/Pentium4/manuals/25366618.pdf
There is the description of "See also: Bit(BitBase, BitOffset) on page 3-10"
on the explanation of "bts" at page 3-82.

"Table 3-2. Range of Bit Positions Specified by Bit Offset Operands"
 at page 3-10 says that the register bit offset is restricted
from -2^31 to 2^31 - 1.

--
Takashi Sato 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-20  6:36             ` Andreas Dilger
@ 2006-03-20 22:38               ` Stephen C. Tweedie
  2006-03-20 23:48                 ` Andreas Dilger
  2006-03-21  4:03                 ` Theodore Ts'o
  0 siblings, 2 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 2006-03-20 22:38 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, Takashi Sato, cmm, linux-kernel, ext2-devel,
	Laurent Vivier, Stephen Tweedie

Hi,

On Sun, 2006-03-19 at 23:36 -0700, Andreas Dilger wrote:

> What happens to existing filesystems with large inodes that don't have
> enough space for the extra timestamps in the first place?

Sadly, they are basically out of luck, unless we change the way that
space in the extended inode is used.

In retrospect, perhaps we goofed.  We added that space into the inode,
but there is no guarantee that it can be used on demand for anything
other than xattrs --- precisely because xattrs can grow to use all
available space both in the external xattr block *and* in the inode.

We could have defined things such that you could either use the in-inode
space, OR the external space, for xattrs, but not both.  But that would
be a performance compromise at best, for some of the most important
xattrs (like SELinux labels, which are always there and are always
needed) really want to be accelerated in the inode.

We really ought to have reserved *some* space in the extended inode for
non-xattr fields, for compatibility purposes.

But it's probably not too late.  I would expect that the vast majority
of filesystems won't have any inodes that have fully-occupied xattr
space.  It would be easy enough to define a new flag that indicates that
there is always X amount of space reserved for inode fields, and to set
that in fsck if all inodes on the fs obey that restriction.  Then it
just comes down to picking a number X that is likely to satisfy all the
short-term demands for new inode fields.

> Also, if files
> are created while the filesystem is mounted without usecond timestamps
> they would get no usecond fields anyways.  I agree that there are some
> unlikely corner conditions that might be hit (large inode filesystem, on
> older kernel without usec support, fills both the in-inode and external
> block so much that there isn't 12 bytes left for the usecond timestamps,
> and that file happens to depend on the exact accuracy of the timestamp).
> IMHO the inconvenience of the ROCOMPAT outweighs the benefits.

That's precisely the corner case that concerns me.  The question is, do
we want the filesystem to behave correctly in all cases, or do we take
short-cuts?  

I think we're probably early enough in the adoption of large inodes that
we don't have to make that compromise, and we can reserve some space for
guaranteed use by inode fields with a single minimally-invasive compat
change (say, a flag enabling a field in the superblock which defines how
many bytes we can always safely use for extended inode fields.)

--Stephen

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-20 22:38               ` Stephen C. Tweedie
@ 2006-03-20 23:48                 ` Andreas Dilger
  2006-03-21 17:05                   ` Stephen C. Tweedie
  2006-03-21  4:03                 ` Theodore Ts'o
  1 sibling, 1 reply; 34+ messages in thread
From: Andreas Dilger @ 2006-03-20 23:48 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Theodore Ts'o, Takashi Sato, cmm, linux-kernel, ext2-devel,
	Laurent Vivier

On Mar 20, 2006  17:38 -0500, Stephen C. Tweedie wrote:
> On Sun, 2006-03-19 at 23:36 -0700, Andreas Dilger wrote:
> > What happens to existing filesystems with large inodes that don't have
> > enough space for the extra timestamps in the first place?
> 
> Sadly, they are basically out of luck, unless we change the way that
> space in the extended inode is used.
> 
> We could have defined things such that you could either use the in-inode
> space, OR the external space, for xattrs, but not both.  But that would
> be a performance compromise at best, for some of the most important
> xattrs (like SELinux labels, which are always there and are always
> needed) really want to be accelerated in the inode.

The fast EA space is the only reason we implemented this at all.  We
also still need the external EA space for the overflow case.

> I would expect that the vast majority of filesystems won't have any
> inodes that have fully-occupied xattr space.

I would agree.  The number of affected files is likely infintesimal,
given that large inodes are not enabled by default (not sure if they
are even documented), Lustre doesn't use more than a single EA on a file
except in the just-release version, and Samba4 doesn't care because it
stores its timestamps in EAs anyways due to lack of usecond timestamps.

> It would be easy enough to define a new flag that indicates that
> there is always X amount of space reserved for inode fields, and to set
> that in fsck if all inodes on the fs obey that restriction.  Then it
> just comes down to picking a number X that is likely to satisfy all the
> short-term demands for new inode fields.

We could change ext3_new_inode() today to reserve, say, 12 or 16 more
bytes for timestamps, even if they are not implemented yet.  Having a
field in the superblock (tunable by the admin, concievably) to reserve
a total of X bytes for i_extra_isize has some appeal though.

At a rough guess I'd want to have timestamps (27 bits for 10s of nanoseconds,
with the high 5 bits left for growing the number of seconds).  If we put the
fields in "priority" order, then on those inodes that don't have much space
left we at least get the more important one(s), primarily mtime I think).

	__u32 i_mtime_extended;
	__u32 i_ctime_extended;
	__u32 i_atime_extended;

Are there any other needs right now?  My thought on the "extra" fields
in the inode is that they would always be on an "as available" basis,
so if e.g. we only had 8 bytes reserved we would get i_mtime_extended,
and i_ctime_extended, and not i_atime_extended.  If there is some added
field that is so important that the kernel/filesystem can't live without
it, it would need its own {RO,IN}COMPAT flag anyways.

I think with the advent of large inodes we can be less worried about
cannibalizing the other "unused" inode fields like i_faddr, i_frag,
i_fsize.  An i_blocks_high field, (even in the face of Takashi's recently
proposed patch we would still want another 16 or 32 bits for larger
files, maybe at the same time as his patch is implemented), a 32-bit
inode checksum, more bits for i_nlinks?

It would also be good to understand what HURD is actually doing with
those other fields (if anything, does it even exist anymore?), since
it is literally holding TB of space unusable on Linux ext3 filesystems
that could better be put to use.  There are i_translator, i_mode_high,
and i_author held hostage by HURD, and I certainly have never seen or
heard of any good description of what they do or if Linux would/could
ever use them, or if HURD could live without them.

> > Also, if files
> > are created while the filesystem is mounted without usecond timestamps
> > they would get no usecond fields anyways.  I agree that there are some
> > unlikely corner conditions that might be hit (large inode filesystem, on
> > older kernel without usec support, fills both the in-inode and external
> > block so much that there isn't 12 bytes left for the usecond timestamps,
> > and that file happens to depend on the exact accuracy of the timestamp).
> > IMHO the inconvenience of the ROCOMPAT outweighs the benefits.
> 
> That's precisely the corner case that concerns me.  The question is, do
> we want the filesystem to behave correctly in all cases, or do we take
> short-cuts?  
> 
> I think we're probably early enough in the adoption of large inodes that
> we don't have to make that compromise, and we can reserve some space for
> guaranteed use by inode fields with a single minimally-invasive compat
> change (say, a flag enabling a field in the superblock which defines how
> many bytes we can always safely use for extended inode fields.)

I'm fully in the "the chance of any real problem is vanishingly small"
camp, even though Lustre is one of the few users of large inodes.  The
presence of the COMPAT field would not really be any different than just
changing ext3_new_inode() to make i_extra_isize 16 by default, except to
cause breakage against the older e2fsprogs.  I don't think it is a bad
idea to implement kernel support for such a flag, but not actually set
it in the superblock unless done so by tune2fs.

Hmm, another "forward looking" change may be to add some masking of bits
in the inode i_flags word.  Ted did this with great success for the
EXT2_INDEX_FL.  Would it be prudent to do the same with, say, the top 4
bits of i_flags?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-20 22:38               ` Stephen C. Tweedie
  2006-03-20 23:48                 ` Andreas Dilger
@ 2006-03-21  4:03                 ` Theodore Ts'o
  1 sibling, 0 replies; 34+ messages in thread
From: Theodore Ts'o @ 2006-03-21  4:03 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andreas Dilger, Takashi Sato, cmm, linux-kernel, ext2-devel,
	Laurent Vivier

On Mon, Mar 20, 2006 at 05:38:02PM -0500, Stephen C. Tweedie wrote:
> But it's probably not too late.  I would expect that the vast majority
> of filesystems won't have any inodes that have fully-occupied xattr
> space.  It would be easy enough to define a new flag that indicates that
> there is always X amount of space reserved for inode fields, and to set
> that in fsck if all inodes on the fs obey that restriction.  Then it
> just comes down to picking a number X that is likely to satisfy all the
> short-term demands for new inode fields.

Yes, that's what I'm proposing that we do.  My original plan was to
use an incompat flag that would guarantee that there would be enough
space for likely short-term new inode fields, but perhaps it doesn't
have to be an incompat flag.  At least in theory it could be a compat
flag, and then we release a new e2fsprogs which enforces the guarantee
that at least that much space is reserved in every single inode, and
offers to remove one or more EA's in order to satisfy that guarantee.  

There is a chance that someone who has a filesystem with the compat
feature enabled, a kernel has the support for high-resolution
time-stamps, and an old e2fsprogs will get screwed, but only if the EA
space is totally filled up.  But maybe that's an acceptable risk, and
the worst that will happen is that make(1) will get confused.

> I think we're probably early enough in the adoption of large inodes that
> we don't have to make that compromise, and we can reserve some space for
> guaranteed use by inode fields with a single minimally-invasive compat
> change (say, a flag enabling a field in the superblock which defines how
> many bytes we can always safely use for extended inode fields.)

Ah, it sounds like you're thinking the same thing I am.  OK, that
seems like a reasonable compromise.  We are taking a bit of a
shortcut, but it seems reasonable to assume that distro's will have
the right version of e2fsprogs if they want to use this feature; if
they don't users won't be able to enable the new compat flag anyway,
which means the chances of the user noticing a real problem is pretty
low.

						- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-20 23:48                 ` Andreas Dilger
@ 2006-03-21 17:05                   ` Stephen C. Tweedie
  2006-03-21 18:38                     ` Theodore Ts'o
  2006-03-21 20:26                     ` Andreas Dilger
  0 siblings, 2 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 2006-03-21 17:05 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, Takashi Sato, cmm, linux-kernel, ext2-devel,
	Laurent Vivier, Stephen Tweedie

Hi,

On Mon, 2006-03-20 at 16:48 -0700, Andreas Dilger wrote:

> > It would be easy enough to define a new flag that indicates that
> > there is always X amount of space reserved for inode fields, and to set
> > that in fsck if all inodes on the fs obey that restriction.  Then it
> > just comes down to picking a number X that is likely to satisfy all the
> > short-term demands for new inode fields.
> 
> We could change ext3_new_inode() today to reserve, say, 12 or 16 more
> bytes for timestamps, even if they are not implemented yet.  Having a
> field in the superblock (tunable by the admin, concievably) to reserve
> a total of X bytes for i_extra_isize has some appeal though.

Exactly, because it's more than just the timestamps that we'd like to
grow.

> 	__u32 i_(m|c|a)time_extended;

> Are there any other needs right now?

Potentially, yes.  If we want to go 64-bits, then the extent maps can
take care of indirect blocks, but we would still need:

	__le32	i_blocks;	/* Blocks count */
	__le32	i_file_acl;	/* File ACL */
	__le32	i_dir_acl;	/* Directory ACL */

to get an extra 32 bits.  And there's the old favourite,

	__le16	i_links_count;	/* Links count */

which is a completely unnecessary limit on subdirs, which would be great
to eliminate at the same time.

We're not talking about a huge amount of space, here; I'd hate to
reserve too little space for the next year or so and force people to go
through a full forced fsck more than once to flag just a few more bytes
as available.

>   My thought on the "extra" fields
> in the inode is that they would always be on an "as available" basis,
> so if e.g. we only had 8 bytes reserved we would get i_mtime_extended,
> and i_ctime_extended, and not i_atime_extended.  

Yes, that's basically what we're already set up for with i_extra_size.

The problem is that by the time we find we can't grow the inode fields,
it may be too late.  That's especially true with timestamps: ENOSPC is a
bad return code for sys_utimes()!  It's perhaps a little more reasonable
to expect to have to deal with ENOSPC when we do a mkdir() or write().
But a per-sb reserved-inode-growth field that fsck can always set, and
that the overwhelming majority of filesystems will be able to satisfy,
simply gets rid of *all* the edge cases by guaranteeing enough space.

> ...a 32-bit
> inode checksum...?

Not something that anyone is using right now, but it's exactly the sort
of thing that a superblock field would be ideal for.

> It would also be good to understand what HURD is actually doing with
> those other fields (if anything, does it even exist anymore?), since
> it is literally holding TB of space unusable on Linux ext3 filesystems
> that could better be put to use.  There are i_translator, i_mode_high,
> and i_author held hostage by HURD, and I certainly have never seen or
> heard of any good description of what they do or if Linux would/could
> ever use them, or if HURD could live without them.

If they really are 100% necessary for hurd, it might be that we could
relegate them to an xattr.  There's the slight problem of testing,
though; does anyone on ext2-devel actually run hurd, ever?

> I'm fully in the "the chance of any real problem is vanishingly small"
> camp, even though Lustre is one of the few users of large inodes.  The
> presence of the COMPAT field would not really be any different than just
> changing ext3_new_inode() to make i_extra_isize 16 by default, except to
> cause breakage against the older e2fsprogs.

Setting i_extra_isize will break older e2fsprogs anyway, won't it?
e2fsck needs to have full knowledge of all fs fields in order to
maintain consistency; if it doesn't know about some of the fields whose
presence is implied by i_extra_isize, then doesn't it have to abort?

So for future-proofing, we do need some distinction between the fields
actually *used* in i_extra_isize, and those simply reserved there.  And
that has to be per-inode, if we want to allow easy dynamic migration to
newer fields.

So a per-superblock field guaranteeing that there's at least $N bytes of
usable *potential* i_extra_isize in each inode, and a per-inode
i_extra_isize which shows which fields are *actively* used, gives us
both pieces of information that we need.

--Stephen

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-21 17:05                   ` Stephen C. Tweedie
@ 2006-03-21 18:38                     ` Theodore Ts'o
  2006-03-21 19:47                       ` Stephen C. Tweedie
                                         ` (3 more replies)
  2006-03-21 20:26                     ` Andreas Dilger
  1 sibling, 4 replies; 34+ messages in thread
From: Theodore Ts'o @ 2006-03-21 18:38 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andreas Dilger, Takashi Sato, cmm, linux-kernel, ext2-devel,
	Laurent Vivier, ams, cascardo

On Tue, Mar 21, 2006 at 12:05:22PM -0500, Stephen C. Tweedie wrote:
> > It would also be good to understand what HURD is actually doing with
> > those other fields (if anything, does it even exist anymore?), since
> > it is literally holding TB of space unusable on Linux ext3 filesystems
> > that could better be put to use.  There are i_translator, i_mode_high,
> > and i_author held hostage by HURD, and I certainly have never seen or
> > heard of any good description of what they do or if Linux would/could
> > ever use them, or if HURD could live without them.

Hurd is definitely using the translator field, and I only recently
discovered they are using it to point at a disk block where the name
of the translator program (I'm not 100% sure, but I think it's a
generic, out-of-band, #! sort of functionality).  I don't know about
the other fields, but I can find out.

> If they really are 100% necessary for hurd, it might be that we could
> relegate them to an xattr.  There's the slight problem of testing,
> though; does anyone on ext2-devel actually run hurd, ever?

Relegating them to an xatter would break compatibility with existing
hurd filesystems.  We could take the arrogant "Linux is the only thing
that matters", and just screw them, and the net result will probably
be that Hurd will never implement some of the advanced features we've
been talking about.  They might not anyways, though.  A real problem
is that as far as I know, the hurd ext2 developers aren't on the
ext2-devel mailing list.

I've cc'ed two people that sent me a request to add some additional
debugfs functionality to support hurd; maybe they can help by telling
us whether or not hurd is using i_mode_high and i_author, and whether
or not hurd has any likelihood of tracking new ext3 features that we
might add in the future or not.

> > I'm fully in the "the chance of any real problem is vanishingly small"
> > camp, even though Lustre is one of the few users of large inodes.  The
> > presence of the COMPAT field would not really be any different than just
> > changing ext3_new_inode() to make i_extra_isize 16 by default, except to
> > cause breakage against the older e2fsprogs.
> 
> Setting i_extra_isize will break older e2fsprogs anyway, won't it?
> e2fsck needs to have full knowledge of all fs fields in order to
> maintain consistency; if it doesn't know about some of the fields whose
> presence is implied by i_extra_isize, then doesn't it have to abort?

E2fsprogs previous to e2fsprogs 1.37 ignored i_extra_isize and didn't
check whether or not the EA's in the inode were valid.  Starting in
e2fsprogs 1.37, e2fsck understands i_extra_size and in fact does
validate the EA's in the inode.  If we add new i_extra fields, then
currently e2fsprogs will ignore them, and that's OK for things like
the high precision time fields.  But if they are fields where e2fsck
does need to know about them, then obviously we would need a COMPAT
feature flag to signal that fact (since e2fsck will refuse to operate
on a filesystem if ther is a COMPAT feature that it doesn't
understand.)

> So for future-proofing, we do need some distinction between the fields
> actually *used* in i_extra_isize, and those simply reserved there.  And
> that has to be per-inode, if we want to allow easy dynamic migration to
> newer fields.
>
> So a per-superblock field guaranteeing that there's at least $N bytes of
> usable *potential* i_extra_isize in each inode, and a per-inode
> i_extra_isize which shows which fields are *actively* used, gives us
> both pieces of information that we need.

The easiest way to do future-proofing is to state that they must be
initialized to zero.  That's how we handle unusued fields in the
superblock, after all, and it means that it's relatively easy to add
new superblock fields without needing to cause compatibility
problems..  If you absolutely, positively need e2fsck to abort if it
doesn't understand a particular field, that's what a COMPAT feature
flag is for.  Otherwise, new kernels can simply check to see if the
field is non-zero, and if so, honor it, and old-kernels will simply
ignore the new information.  In many cases, that's more than
sufficient.

						- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-21 18:38                     ` Theodore Ts'o
@ 2006-03-21 19:47                       ` Stephen C. Tweedie
  2006-03-21 20:40                         ` Andreas Dilger
  2006-03-21 20:16                       ` Alfred M. Szmidt
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 2006-03-21 19:47 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Andreas Dilger, Takashi Sato, cmm, linux-kernel, ext2-devel,
	Laurent Vivier, ams, cascardo, Stephen Tweedie

Hi,

On Tue, 2006-03-21 at 13:38 -0500, Theodore Ts'o wrote:

> Hurd is definitely using the translator field, and I only recently
> discovered they are using it to point at a disk block where the name
> of the translator program (I'm not 100% sure, but I think it's a
> generic, out-of-band, #! sort of functionality).  

..

> > If they really are 100% necessary for hurd, it might be that we could
> > relegate them to an xattr.  There's the slight problem of testing,
> > though; does anyone on ext2-devel actually run hurd, ever?
> 
> Relegating them to an xatter would break compatibility with existing
> hurd filesystems.

This would be an incompat change, but one that would not be hard to
maintain.  The translator stuff looks like the kind of thing that would
_easily_ suit xattrs.  

>   We could take the arrogant "Linux is the only thing
> that matters"

I'm not proposing breaking any compatibility.  The idea was simply that
if we wanted to add new fields to that space in the inode struct, it
would be an incompat change on *all* platforms, not just hurd; and that
on hurd, an extra side-effect of that incompat flag would be that we now
look for translation etc. in an xattr.

Do you know how large the translation data is, btw?  If it's typically
just a small string, then we may actually get far better efficiency by
lumping it into the xattr blocks than by keeping it out-of-band.

> E2fsprogs previous to e2fsprogs 1.37 ignored i_extra_isize and didn't
> check whether or not the EA's in the inode were valid.  Starting in
> e2fsprogs 1.37, e2fsck understands i_extra_size and in fact does
> validate the EA's in the inode.  If we add new i_extra fields, then
> currently e2fsprogs will ignore them, and that's OK for things like
> the high precision time fields.  But if they are fields where e2fsck
> does need to know about them, then obviously we would need a COMPAT
> feature flag to signal that fact (since e2fsck will refuse to operate
> on a filesystem if ther is a COMPAT feature that it doesn't
> understand.)

The timestamps are about the only things I can think of that would be
safe to ignore.  Everything else --- i_nlinks, i_blocks, checksums,
highwatermarking --- has consistency implications and e2fsck would need
to be aware of it.

> > So for future-proofing, we do need some distinction between the fields
> > actually *used* in i_extra_isize, and those simply reserved there.  And
> > that has to be per-inode, if we want to allow easy dynamic migration to
> > newer fields.
...
> The easiest way to do future-proofing is to state that they must be
> initialized to zero.

Hmm, that should work.  It certainly works nicely for overflow fields.
It might complicate things like highwatermarking: a simple HWM
implementation would record the amount of the file that is actually
initialised in the HWM field, so "0" would actually be an unusual,
important special case.  And "0" would be a potentially valid checksum
if we use CRC32, too.  Using the per-sb field for reserved space, and
the in-inode one to determine which fields are actively in use, would
avoid such ambiguous cases.

--Stephen

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-21 18:38                     ` Theodore Ts'o
  2006-03-21 19:47                       ` Stephen C. Tweedie
@ 2006-03-21 20:16                       ` Alfred M. Szmidt
  2006-03-21 23:05                       ` Olivier Galibert
  2006-03-25 14:51                       ` cascardo
  3 siblings, 0 replies; 34+ messages in thread
From: Alfred M. Szmidt @ 2006-03-21 20:16 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: tytso, sct, adilger, sho, cmm, linux-kernel, ext2-devel,
	Laurent.Vivier, cascardo, roland

Adding Roland McGrath to the CC.

   > > It would also be good to understand what HURD is actually doing
   > > with those other fields (if anything, does it even exist
   > > anymore?), since it is literally holding TB of space unusable
   > > on Linux ext3 filesystems that could better be put to use.
   > > There are i_translator, i_mode_high, and i_author held hostage
   > > by HURD, and I certainly have never seen or heard of any good
   > > description of what they do or if Linux would/could ever use
   > > them, or if HURD could live without them.

   Hurd is definitely using the translator field, and I only recently
   discovered they are using it to point at a disk block where the
   name of the translator program (I'm not 100% sure, but I think it's
   a generic, out-of-band, #! sort of functionality).  I don't know
   about the other fields, but I can find out.

Something like that.  The author field is akin to gid/uid.  I don't
recall the exact usage of i_mode_high, but it has something to do with
translators.

   > If they really are 100% necessary for hurd, it might be that we
   > could relegate them to an xattr.  There's the slight problem of
   > testing, though; does anyone on ext2-devel actually run hurd,
   > ever?

   Relegating them to an xatter would break compatibility with
   existing hurd filesystems.  We could take the arrogant "Linux is
   the only thing that matters", and just screw them, and the net
   result will probably be that Hurd will never implement some of the
   advanced features we've been talking about.  They might not
   anyways, though.  A real problem is that as far as I know, the hurd
   ext2 developers aren't on the ext2-devel mailing list.

   I've cc'ed two people that sent me a request to add some additional
   debugfs functionality to support hurd; maybe they can help by
   telling us whether or not hurd is using i_mode_high and i_author,
   and whether or not hurd has any likelihood of tracking new ext3
   features that we might add in the future or not.

Both i_mode_high and i_author are used in the Hurd.  But they are only
used if and only if creator of the file-system is the Hurd, same for
the translator fields.

   > > I'm fully in the "the chance of any real problem is vanishingly
   > > small" camp, even though Lustre is one of the few users of
   > > large inodes.  The presence of the COMPAT field would not
   > > really be any different than just changing ext3_new_inode() to
   > > make i_extra_isize 16 by default, except to cause breakage
   > > against the older e2fsprogs.
   > 
   > Setting i_extra_isize will break older e2fsprogs anyway, won't
   > it?  e2fsck needs to have full knowledge of all fs fields in
   > order to maintain consistency; if it doesn't know about some of
   > the fields whose presence is implied by i_extra_isize, then
   > doesn't it have to abort?

   E2fsprogs previous to e2fsprogs 1.37 ignored i_extra_isize and
   didn't check whether or not the EA's in the inode were valid.
   Starting in e2fsprogs 1.37, e2fsck understands i_extra_size and in
   fact does validate the EA's in the inode.  If we add new i_extra
   fields, then currently e2fsprogs will ignore them, and that's OK
   for things like the high precision time fields.  But if they are
   fields where e2fsck does need to know about them, then obviously we
   would need a COMPAT feature flag to signal that fact (since e2fsck
   will refuse to operate on a filesystem if ther is a COMPAT feature
   that it doesn't understand.)

   > So for future-proofing, we do need some distinction between the
   > fields actually *used* in i_extra_isize, and those simply
   > reserved there.  And that has to be per-inode, if we want to
   > allow easy dynamic migration to newer fields.
   >
   > So a per-superblock field guaranteeing that there's at least $N
   > bytes of usable *potential* i_extra_isize in each inode, and a
   > per-inode i_extra_isize which shows which fields are *actively*
   > used, gives us both pieces of information that we need.

   The easiest way to do future-proofing is to state that they must be
   initialized to zero.  That's how we handle unusued fields in the
   superblock, after all, and it means that it's relatively easy to
   add new superblock fields without needing to cause compatibility
   problems..  If you absolutely, positively need e2fsck to abort if
   it doesn't understand a particular field, that's what a COMPAT
   feature flag is for.  Otherwise, new kernels can simply check to
   see if the field is non-zero, and if so, honor it, and old-kernels
   will simply ignore the new information.  In many cases, that's more
   than sufficient.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-21 17:05                   ` Stephen C. Tweedie
  2006-03-21 18:38                     ` Theodore Ts'o
@ 2006-03-21 20:26                     ` Andreas Dilger
  1 sibling, 0 replies; 34+ messages in thread
From: Andreas Dilger @ 2006-03-21 20:26 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Theodore Ts'o, Takashi Sato, cmm, linux-kernel, ext2-devel,
	Laurent Vivier

On Mar 21, 2006  12:05 -0500, Stephen C. Tweedie wrote:
> If we want to go 64-bits, then the extent maps can
> take care of indirect blocks, but we would still need:
> 
> 	__le32	i_blocks;	/* Blocks count */

This has partially been addressed by Takashi's patch for fs-blocksize
i_blocks, and there was also a patch to use either i_frag|i_fsize or
i_faddr as the high bits of this value.  If we think that 2^48 * blocksize
is enough for a file (2^60 bytes for 4kB blocks, 2^64 bytes for 64kB blocks)
then it would be prudent to use (i_frag|i_fsize) as i_blocks_high.  AFAIK,
those fields have never, ever been used, and adding such a change along
with Takashi's patch makes a lot of sense.

> 	__le32	i_file_acl;	/* File ACL */

This needs another 32 bits for sure.  We might concievably also fix up the
EA code to improve the external-block EA format (e.g. allow pointing at an
extent index block or another inode to allow storing larger EAs).

> 	__le32	i_dir_acl;	/* Directory ACL */

This is i_size_high for regular files, and I propose that it also become
i_size_high for directories as well, because CFS at least is hitting
limits of 2GB directories already (that's around 25M files).  It also
doesn't take into account that we need to increase the dirent size to
accomodate larger inode numbers and possibly some other attribute data,
which I propose we flag with the high 5 bits of the d_type field.

> to get an extra 32 bits.  And there's the old favourite,
> 
> 	__le16	i_links_count;	/* Links count */
> 
> which is a completely unnecessary limit on subdirs, which would be great
> to eliminate at the same time.

CFS has a patch that has been working for ages that changes the i_links_count
handling to be the same as reiserfs - namely, if we overflow 65000 links
the directory i_nlinks becomes 1 (disables "find" heuristic to only recurse
into i_nlinks subdirectories), and ext3_dir_empty() is trusted to tell us
when the directory is empty (which it does already, i_nlinks only used
to print out a warning in any case).  Even an unpatched e2fsck and kernel
handle this gracefully.

> We're not talking about a huge amount of space, here; I'd hate to
> reserve too little space for the next year or so and force people to go
> through a full forced fsck more than once to flag just a few more bytes
> as available.

At the same time, if we reserve too much space, it hurts EAs fitting
into the large inode space (which is at least CFS's and Samba's primary
requirement for large inodes).

> The problem is that by the time we find we can't grow the inode fields,
> it may be too late.  That's especially true with timestamps: ENOSPC is a
> bad return code for sys_utimes()!

I'd rather return success and truncate the timestamp.

> It's perhaps a little more reasonable
> to expect to have to deal with ENOSPC when we do a mkdir() or write().
> But a per-sb reserved-inode-growth field that fsck can always set, and
> that the overwhelming majority of filesystems will be able to satisfy,
> simply gets rid of *all* the edge cases by guaranteeing enough space.

In the end, I don't think we can ever have "enough" space reserved for
all needs, so the code will have to have a graceful fallback strategy
in any case.

> > I'm fully in the "the chance of any real problem is vanishingly small"
> > camp, even though Lustre is one of the few users of large inodes.  The
> > presence of the COMPAT field would not really be any different than just
> > changing ext3_new_inode() to make i_extra_isize 16 by default, except to
> > cause breakage against the older e2fsprogs.
> 
> Setting i_extra_isize will break older e2fsprogs anyway, won't it?
> e2fsck needs to have full knowledge of all fs fields in order to
> maintain consistency; if it doesn't know about some of the fields whose
> presence is implied by i_extra_isize, then doesn't it have to abort?

Like Ted said, and I had said when the large inode patch was first proposed,
if there is something added in the large inode space that is absolutely
mandatory, then it can be covered by an appropriate *COMPAT flag.  I don't
think that the inode timestamps even warrant that protection.

> So for future-proofing, we do need some distinction between the fields
> actually *used* in i_extra_isize, and those simply reserved there.  And
> that has to be per-inode, if we want to allow easy dynamic migration to
> newer fields.

The concept of "reserving" space in i_extra_isize wasn't considered.  Instead
the design was that i_extra_isize would be large enough to cover the valid
fields, and if more space is needed, i_extra_isize would be grown to cover
this space, as applicable.

As Ted says, we could just initialize unused fields to zero and depend on
that.  It works for the timestamps and e.g. checksum at least, and new
code can be made to live with this also.

> So a per-superblock field guaranteeing that there's at least $N bytes of
> usable *potential* i_extra_isize in each inode, and a per-inode
> i_extra_isize which shows which fields are *actively* used, gives us
> both pieces of information that we need.

I thought of this also, though plain "reservation" fails in two regards:
- if the older kernel doesn't understand "s_extra_isize_min", it may still
  consume that space if the filesystem is mounted there
- if all fields in i_extra_isize are not used (e.g. i_checksum is disabled,
  but a later i_dac is enabled), we still need a way to know if the field
  is in use

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-21 19:47                       ` Stephen C. Tweedie
@ 2006-03-21 20:40                         ` Andreas Dilger
  0 siblings, 0 replies; 34+ messages in thread
From: Andreas Dilger @ 2006-03-21 20:40 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Theodore Ts'o, Takashi Sato, cmm, linux-kernel, ext2-devel,
	Laurent Vivier, ams, cascardo

On Mar 21, 2006  14:47 -0500, Stephen C. Tweedie wrote:
> On Tue, 2006-03-21 at 13:38 -0500, Theodore Ts'o wrote:
> > Hurd is definitely using the translator field, and I only recently
> > discovered they are using it to point at a disk block where the name
> > of the translator program (I'm not 100% sure, but I think it's a
> > generic, out-of-band, #! sort of functionality).  

Argh, sounds fragile in any case.

> I'm not proposing breaking any compatibility.  The idea was simply that
> if we wanted to add new fields to that space in the inode struct, it
> would be an incompat change on *all* platforms, not just hurd; and that
> on hurd, an extra side-effect of that incompat flag would be that we now
> look for translation etc. in an xattr.

I would rather propose that we maintain as much compatibility as possible,
given that we don't even know what those extra fields might be, and would
likely need to have yet another compatibility flag on the feature itself.
Remember that large inodes themselves are incompatible with older kernels
(maybe predating 2.6.9) so we don't need to worry about 2.4 kernels at all.

> The timestamps are about the only things I can think of that would be
> safe to ignore.  Everything else --- i_nlinks, i_blocks, checksums,
> highwatermarking --- has consistency implications and e2fsck would need
> to be aware of it.

Which would get their own superblock flags if needed.

> Hmm, that should work.  It certainly works nicely for overflow fields.
> It might complicate things like highwatermarking: a simple HWM
> implementation would record the amount of the file that is actually
> initialised in the HWM field, so "0" would actually be an unusual,
> important special case.

The HWM feature would fall under an INCOMPAT flag then, and possibly
also set a flag in the inode to indicate validity (similar to my
proposal for the i_blocks change).

> And "0" would be a potentially valid checksum if we use CRC32, too.

Hmm, is that true?  I thought that 0 was impossible for CRC32, since
even for a zero-length file the initial value should be 0xffffffff,
though I'm not 100% sure of that.  

> Using the per-sb field for reserved space, and
> the in-inode one to determine which fields are actively in use, would
> avoid such ambiguous cases.

But, doesn't help if i_hwm comes before some other field that is put into
use, so it has to be handled anyways.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-21 18:38                     ` Theodore Ts'o
  2006-03-21 19:47                       ` Stephen C. Tweedie
  2006-03-21 20:16                       ` Alfred M. Szmidt
@ 2006-03-21 23:05                       ` Olivier Galibert
  2006-03-21 23:35                         ` Alfred M. Szmidt
  2006-03-25 14:51                       ` cascardo
  3 siblings, 1 reply; 34+ messages in thread
From: Olivier Galibert @ 2006-03-21 23:05 UTC (permalink / raw)
  To: Theodore Ts'o, Stephen C. Tweedie, Andreas Dilger,
	Takashi Sato, cmm, linux-kernel, ext2-devel, Laurent Vivier, ams,
	cascardo

On Tue, Mar 21, 2006 at 01:38:22PM -0500, Theodore Ts'o wrote:
> Hurd is definitely using the translator field, and I only recently
> discovered they are using it to point at a disk block where the name
> of the translator program (I'm not 100% sure, but I think it's a
> generic, out-of-band, #! sort of functionality).

Translators on directories are a combo of automount+userland
filesystem, with the addition on having them saved in the mounted-on
filesystem.  Rather nice actually.  Replacing /etc/fstab with
local-to-the-mountpoint information has some charm.  I'm not sure if
translator-on-files actually exist.

Note that in hurd all filesystems are userland.  Whether it is a good
thing is left as an exercise to the benchmarker and the deadlock
chaser.

  OG.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-21 23:05                       ` Olivier Galibert
@ 2006-03-21 23:35                         ` Alfred M. Szmidt
  0 siblings, 0 replies; 34+ messages in thread
From: Alfred M. Szmidt @ 2006-03-21 23:35 UTC (permalink / raw)
  To: Olivier Galibert
  Cc: galibert, tytso, sct, adilger, sho, cmm, linux-kernel, ext2-devel,
	Laurent.Vivier, cascardo

   > Hurd is definitely using the translator field, and I only
   > recently discovered they are using it to point at a disk block
   > where the name of the translator program (I'm not 100% sure, but
   > I think it's a generic, out-of-band, #! sort of functionality).

   Translators on directories are a combo of automount+userland
   filesystem, with the addition on having them saved in the
   mounted-on filesystem.  Rather nice actually.  Replacing /etc/fstab
   with local-to-the-mountpoint information has some charm.  I'm not
   sure if translator-on-files actually exist.

You can set a translator on a file or a directory, it doesn't matter.
Anything that is accessed through the file-system is a translator.
/dev/null is a translator, symbolic links can be[0] translators,
/dev/hd0s1 (/dev/hda1 in GNU/Linux) is a translator, ...

[0]: They are usually implemented directly into the file-system so you
don't end up spawning a new processes for each symlink.  But if the
file-system in question doesn't support symlinks you can always use
the symlink translator to get symlinks.  This will work for all
file-systems as long as you do not wish to have it persitant across
reboots, then you need passive translator support (which is what those
fields in ext2 are for among other things).

Happy hacking.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-21 18:38                     ` Theodore Ts'o
                                         ` (2 preceding siblings ...)
  2006-03-21 23:05                       ` Olivier Galibert
@ 2006-03-25 14:51                       ` cascardo
  2006-03-26 16:27                         ` Andreas Dilger
  2006-03-27 19:55                         ` Stephen C. Tweedie
  3 siblings, 2 replies; 34+ messages in thread
From: cascardo @ 2006-03-25 14:51 UTC (permalink / raw)
  To: Theodore Ts'o, Stephen C. Tweedie, Andreas Dilger,
	Takashi Sato, cmm, linux-kernel, ext2-devel, Laurent Vivier, ams,
	cascardo

[-- Attachment #1: Type: text/plain, Size: 4449 bytes --]

On Tue, Mar 21, 2006 at 01:38:22PM -0500, Theodore Ts'o wrote:
> On Tue, Mar 21, 2006 at 12:05:22PM -0500, Stephen C. Tweedie wrote:
> > > It would also be good to understand what HURD is actually doing with
> > > those other fields (if anything, does it even exist anymore?), since
> > > it is literally holding TB of space unusable on Linux ext3 filesystems
> > > that could better be put to use.  There are i_translator, i_mode_high,
> > > and i_author held hostage by HURD, and I certainly have never seen or
> > > heard of any good description of what they do or if Linux would/could
> > > ever use them, or if HURD could live without them.
> 
> Hurd is definitely using the translator field, and I only recently
> discovered they are using it to point at a disk block where the name
> of the translator program (I'm not 100% sure, but I think it's a
> generic, out-of-band, #! sort of functionality).  I don't know about
> the other fields, but I can find out.
> 
> > If they really are 100% necessary for hurd, it might be that we could
> > relegate them to an xattr.  There's the slight problem of testing,
> > though; does anyone on ext2-devel actually run hurd, ever?
> 
> Relegating them to an xatter would break compatibility with existing
> hurd filesystems.  We could take the arrogant "Linux is the only thing
> that matters", and just screw them, and the net result will probably
> be that Hurd will never implement some of the advanced features we've
> been talking about.  They might not anyways, though.  A real problem
> is that as far as I know, the hurd ext2 developers aren't on the
> ext2-devel mailing list.
> 
> I've cc'ed two people that sent me a request to add some additional
> debugfs functionality to support hurd; maybe they can help by telling
> us whether or not hurd is using i_mode_high and i_author, and whether
> or not hurd has any likelihood of tracking new ext3 features that we
> might add in the future or not.
> 

As AMS has pointed out, the filesystem creator must be set to Hurd for
these inode fields to be used. Since ext2 seems to be the most
supported filesystem on Hurd, most of the ext2 fs used have the fs
creator set to Hurd.

Regarding compatibility, there are plans to support xattr in Hurd and
use them for these fields, translator and author. (I can't recall what
i_mode_high is used for.) With respect to that, I'd appreciate if
there is a recommendation to every ext2 implementation (not only
Linux) that supports xattr, to support gnu.translator and gnu.author
(I'll check about the i_mode_high and post about it asap.). There is a
patch by Roland McGrath for Linux that supports those besides the
reserved fields in case the fs creator is Hurd.

> > > I'm fully in the "the chance of any real problem is vanishingly small"
> > > camp, even though Lustre is one of the few users of large inodes.  The
> > > presence of the COMPAT field would not really be any different than just
> > > changing ext3_new_inode() to make i_extra_isize 16 by default, except to
> > > cause breakage against the older e2fsprogs.
> > 
> > Setting i_extra_isize will break older e2fsprogs anyway, won't it?
> > e2fsck needs to have full knowledge of all fs fields in order to
> > maintain consistency; if it doesn't know about some of the fields whose
> > presence is implied by i_extra_isize, then doesn't it have to abort?
> 
> E2fsprogs previous to e2fsprogs 1.37 ignored i_extra_isize and didn't
> check whether or not the EA's in the inode were valid.  Starting in
> e2fsprogs 1.37, e2fsck understands i_extra_size and in fact does
> validate the EA's in the inode.  If we add new i_extra fields, then
> currently e2fsprogs will ignore them, and that's OK for things like
> the high precision time fields.  But if they are fields where e2fsck
> does need to know about them, then obviously we would need a COMPAT
> feature flag to signal that fact (since e2fsck will refuse to operate
> on a filesystem if ther is a COMPAT feature that it doesn't
> understand.)

Regarding userland tools, it would be wise if they would still support
old format filesystems, including those with fs creator set to
Hurd. That would include supporting the oob block for translator when
counting used/free blocks and other operations like copying a file
using debugfs, for example.

> 
[...]
> 
> 						- Ted

Regards,
Thadeu Cascardo.
--

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-20 10:11   ` Takashi Sato
@ 2006-03-26  3:01     ` Theodore Ts'o
  0 siblings, 0 replies; 34+ messages in thread
From: Theodore Ts'o @ 2006-03-26  3:01 UTC (permalink / raw)
  To: Takashi Sato; +Cc: linux-kernel, ext2-devel

On Mon, Mar 20, 2006 at 07:11:51PM +0900, Takashi Sato wrote:
> >I've just checked my i386 assembly language reference, and I don't see
> >any indication that the btsl, btrl, and btl instructions don't work if
> >the high bit is set on the bit number.  Have you done tests showing
> >that these instructions do not work correctly for filesystem sizes >
> >2**31 blocks, 
> 
> Of course I did and confirmed to get the segmentation fault
> at those instructions.

Thanks for the clarification.  FYI, this is what I checked into the
e2fsprogs mercurial repository.  Note the comments about potential
issues with using a filesystem just under 2**32 blocks on a 32-bit
system.  

						- Ted

# HG changeset patch
# User tytso@mit.edu
# Node ID de831ae49d51575d0f59f4ee2e198fa4d6a75c23
# Parent  dd0dd259cf22059412ae4e6f3e7a9e8756d02b1e
Fix the i386 bitmap operations so they are 32-bit clean

The x86 assembly instructures for bit test-and-set, test-and-clear,
etc., interpret the bit number as a 32-bit signed number, which is
problematic in order to support filesystems > 8TB.  

Added new inline functions (in C) to implement a
ext2fs_fast_set/clear_bit() that does not return the old value of the
bit, and use it for the fast block/bitmap functions.

Added a regression test suite to test the low-level bit operations
functions to make sure they work correctly.

Note that a bitmap can address 2**32 blocks requires 2**29 bytes, or
512 megabytes.  E2fsck requires 3 (and possibly 4 block bitmaps),
which means that the block bitmaps can require 2GB all by themselves,
and this doesn't include the 4 or 5 inode bitmaps (which assuming an
8k inode ratio, will take 256 megabytes each).  This means that it's
more likely that a filesystem check of a filesystem greater than 2**31
blocks will fail if the e2fsck is dynamically linked (since the shared
libraries can consume a substantial portion of the 3GB address space
available to x86 userspace applications).  Even if e2fsck is
statically linked, for a badly damaged filesystem, which may require
additional block and/or inode bitmaps, I am not sure e2fsck will
succeed in all cases.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff -r dd0dd259cf22 -r de831ae49d51 lib/ext2fs/ChangeLog
--- a/lib/ext2fs/ChangeLog	Sat Mar 25 01:42:02 2006 -0500
+++ b/lib/ext2fs/ChangeLog	Sat Mar 25 13:42:45 2006 -0500
@@ -1,3 +1,16 @@
+2006-03-25  Theodore Ts'o  <tytso@mit.edu>
+
+	* Makefile.in: Check the bitfield operations much more carefully,
+		and arrange to have tst_bitops run from "make check"
+
+	* tst_bitops.c: Enahce tst_bitops program so that it is much more
+		thorough in testing bit optations.
+	
+	* bitops.h: Add new functions ext2fs_fast_set_bit() and
+		ext2fs_fast_clear_bit() and make the x86 functions 32-bit
+		clear.  Change the fast inode and block mark/unmark
+		functions to use ext2fs_fast_set/get_bit()
+
 2006-03-18  Theodore Ts'o  <tytso@mit.edu>
 
 	* ext2fs.h (EXT2_FLAG_EXCLUSIVE): Define new flag which requests
diff -r dd0dd259cf22 -r de831ae49d51 lib/ext2fs/Makefile.in
--- a/lib/ext2fs/Makefile.in	Sat Mar 25 01:42:02 2006 -0500
+++ b/lib/ext2fs/Makefile.in	Sat Mar 25 13:42:45 2006 -0500
@@ -212,7 +212,7 @@
 
 tst_bitops: tst_bitops.o inline.o $(STATIC_LIBEXT2FS)
 	@echo "	LD $@"
-	@$(CC) -o tst_bitops tst_bitops.o inline.o \
+	@$(CC) -o tst_bitops tst_bitops.o inline.o $(ALL_CFLAGS) \
 		$(STATIC_LIBEXT2FS) $(LIBCOM_ERR)
 
 tst_getsectsize: tst_getsectsize.o getsectsize.o $(STATIC_LIBEXT2FS)
@@ -224,7 +224,8 @@
 	@echo "	LD $@"
 	@$(CC) -o mkjournal $(srcdir)/mkjournal.c -DDEBUG $(STATIC_LIBEXT2FS) $(LIBCOM_ERR) $(ALL_CFLAGS)
 
-check:: tst_badblocks tst_iscan @SWAPFS_CMT@ tst_byteswap
+check:: tst_bitops tst_badblocks tst_iscan @SWAPFS_CMT@ tst_byteswap
+	LD_LIBRARY_PATH=$(LIB) DYLD_LIBRARY_PATH=$(LIB) ./tst_bitops
 	LD_LIBRARY_PATH=$(LIB) DYLD_LIBRARY_PATH=$(LIB) ./tst_badblocks
 	LD_LIBRARY_PATH=$(LIB) DYLD_LIBRARY_PATH=$(LIB) ./tst_iscan
 @SWAPFS_CMT@	LD_LIBRARY_PATH=$(LIB) DYLD_LIBRARY_PATH=$(LIB) ./tst_byteswap
diff -r dd0dd259cf22 -r de831ae49d51 lib/ext2fs/bitops.h
--- a/lib/ext2fs/bitops.h	Sat Mar 25 01:42:02 2006 -0500
+++ b/lib/ext2fs/bitops.h	Sat Mar 25 13:42:45 2006 -0500
@@ -17,6 +17,8 @@
 extern int ext2fs_set_bit(unsigned int nr,void * addr);
 extern int ext2fs_clear_bit(unsigned int nr, void * addr);
 extern int ext2fs_test_bit(unsigned int nr, const void * addr);
+extern void ext2fs_fast_set_bit(unsigned int nr,void * addr);
+extern void ext2fs_fast_clear_bit(unsigned int nr, void * addr);
 extern __u16 ext2fs_swab16(__u16 val);
 extern __u32 ext2fs_swab32(__u32 val);
 
@@ -129,6 +131,28 @@
 #endif
 #endif
 
+/*
+ * Fast bit set/clear functions that doesn't need to return the
+ * previous bit value.
+ */
+
+_INLINE_ void ext2fs_fast_set_bit(unsigned int nr,void * addr)
+{
+	unsigned char	*ADDR = (unsigned char *) addr;
+
+	ADDR += nr >> 3;
+	*ADDR |= (1 << (nr & 0x07));
+}
+
+_INLINE_ void ext2fs_fast_clear_bit(unsigned int nr, void * addr)
+{
+	unsigned char	*ADDR = (unsigned char *) addr;
+
+	ADDR += nr >> 3;
+	*ADDR &= ~(1 << (nr & 0x07));
+}
+
+
 #if ((defined __GNUC__) && !defined(_EXT2_USE_C_VERSIONS_) && \
      (defined(__i386__) || defined(__i486__) || defined(__i586__)))
 
@@ -155,9 +179,10 @@
 {
 	int oldbit;
 
+	addr = (void *) (((unsigned char *) addr) + (nr >> 3));
 	__asm__ __volatile__("btsl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"=m" (EXT2FS_ADDR)
-		:"r" (nr));
+		:"r" (nr & 7));
 	return oldbit;
 }
 
@@ -165,9 +190,10 @@
 {
 	int oldbit;
 
+	addr = (void *) (((unsigned char *) addr) + (nr >> 3));
 	__asm__ __volatile__("btrl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"=m" (EXT2FS_ADDR)
-		:"r" (nr));
+		:"r" (nr & 7));
 	return oldbit;
 }
 
@@ -175,9 +201,10 @@
 {
 	int oldbit;
 
+	addr = (void *) (((unsigned char *) addr) + (nr >> 3));
 	__asm__ __volatile__("btl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit)
-		:"m" (EXT2FS_CONST_ADDR),"r" (nr));
+		:"m" (EXT2FS_CONST_ADDR),"r" (nr & 7));
 	return oldbit;
 }
 
@@ -263,7 +290,8 @@
 
 #endif	/* i386 */
 
-#ifdef __mc68000__
+#if ((defined __GNUC__) && !defined(_EXT2_USE_C_VERSIONS_) && \
+     (defined(__mc68000__)))
 
 #define _EXT2_HAVE_ASM_BITOPS_
 
@@ -428,7 +456,7 @@
 		return;
 	}
 #endif	
-	ext2fs_set_bit(block - bitmap->start, bitmap->bitmap);
+	ext2fs_fast_set_bit(block - bitmap->start, bitmap->bitmap);
 }
 
 _INLINE_ void ext2fs_fast_unmark_block_bitmap(ext2fs_block_bitmap bitmap,
@@ -441,7 +469,7 @@
 		return;
 	}
 #endif
-	ext2fs_clear_bit(block - bitmap->start, bitmap->bitmap);
+	ext2fs_fast_clear_bit(block - bitmap->start, bitmap->bitmap);
 }
 
 _INLINE_ int ext2fs_fast_test_block_bitmap(ext2fs_block_bitmap bitmap,
@@ -467,7 +495,7 @@
 		return;
 	}
 #endif
-	ext2fs_set_bit(inode - bitmap->start, bitmap->bitmap);
+	ext2fs_fast_set_bit(inode - bitmap->start, bitmap->bitmap);
 }
 
 _INLINE_ void ext2fs_fast_unmark_inode_bitmap(ext2fs_inode_bitmap bitmap,
@@ -480,7 +508,7 @@
 		return;
 	}
 #endif
-	ext2fs_clear_bit(inode - bitmap->start, bitmap->bitmap);
+	ext2fs_fast_clear_bit(inode - bitmap->start, bitmap->bitmap);
 }
 
 _INLINE_ int ext2fs_fast_test_inode_bitmap(ext2fs_inode_bitmap bitmap,
@@ -563,7 +591,7 @@
 		return;
 	}
 	for (i=0; i < num; i++)
-		ext2fs_set_bit(block + i - bitmap->start, bitmap->bitmap);
+		ext2fs_fast_set_bit(block + i - bitmap->start, bitmap->bitmap);
 }
 
 _INLINE_ void ext2fs_fast_mark_block_bitmap_range(ext2fs_block_bitmap bitmap,
@@ -579,7 +607,7 @@
 	}
 #endif	
 	for (i=0; i < num; i++)
-		ext2fs_set_bit(block + i - bitmap->start, bitmap->bitmap);
+		ext2fs_fast_set_bit(block + i - bitmap->start, bitmap->bitmap);
 }
 
 _INLINE_ void ext2fs_unmark_block_bitmap_range(ext2fs_block_bitmap bitmap,
@@ -593,7 +621,8 @@
 		return;
 	}
 	for (i=0; i < num; i++)
-		ext2fs_clear_bit(block + i - bitmap->start, bitmap->bitmap);
+		ext2fs_fast_clear_bit(block + i - bitmap->start, 
+				      bitmap->bitmap);
 }
 
 _INLINE_ void ext2fs_fast_unmark_block_bitmap_range(ext2fs_block_bitmap bitmap,
@@ -609,7 +638,8 @@
 	}
 #endif	
 	for (i=0; i < num; i++)
-		ext2fs_clear_bit(block + i - bitmap->start, bitmap->bitmap);
+		ext2fs_fast_clear_bit(block + i - bitmap->start, 
+				      bitmap->bitmap);
 }
 #undef _INLINE_
 #endif
diff -r dd0dd259cf22 -r de831ae49d51 lib/ext2fs/tst_bitops.c
--- a/lib/ext2fs/tst_bitops.c	Sat Mar 25 01:42:02 2006 -0500
+++ b/lib/ext2fs/tst_bitops.c	Sat Mar 25 13:42:45 2006 -0500
@@ -8,8 +8,6 @@
  * License.
  * %End-Header%
  */
-
-/* #define _EXT2_USE_C_VERSIONS_ */
 
 #include <stdio.h>
 #include <string.h>
@@ -23,6 +21,8 @@
 #if HAVE_ERRNO_H
 #include <errno.h>
 #endif
+#include <sys/time.h>
+#include <sys/resource.h>
 
 #include "ext2_fs.h"
 #include "ext2fs.h"
@@ -31,14 +31,143 @@
 	0x80, 0xF0, 0x40, 0x40, 0x0, 0x0, 0x0, 0x0, 0x10, 0x20, 0x00, 0x00
 	};
 
+int bits_list[] = {
+	7, 12, 13, 14,15, 22, 30, 68, 77, -1,
+};
+
+#define BIG_TEST_BIT   (((unsigned) 1 << 31) + 42)
+
+
 main(int argc, char **argv)
 {
-	int	i, size;
+	int	i, j, size;
+	unsigned char testarray[12];
+	unsigned char *bigarray;
 
 	size = sizeof(bitarray)*8;
+#if 0
 	i = ext2fs_find_first_bit_set(bitarray, size);
 	while (i < size) {
 		printf("Bit set: %d\n", i);
 		i = ext2fs_find_next_bit_set(bitarray, size, i+1);
 	}
+#endif
+
+	/* Test test_bit */
+	for (i=0,j=0; i < size; i++) {
+		if (ext2fs_test_bit(i, bitarray)) {
+			if (bits_list[j] == i) {
+				j++;
+			} else {
+				printf("Bit %d set, not expected\n", i);
+				exit(1);
+			}
+		} else {
+			if (bits_list[j] == i) {
+				printf("Expected bit %d to be clear.\n", i);
+				exit(1);
+			}
+		}
+	}
+	printf("ext2fs_test_bit appears to be correct\n");
+
+	/* Test ext2fs_set_bit */
+	memset(testarray, 0, sizeof(testarray));
+	for (i=0; bits_list[i] > 0; i++) {
+		ext2fs_set_bit(bits_list[i], testarray);
+	}
+	if (memcmp(testarray, bitarray, sizeof(testarray)) == 0) {
+		printf("ext2fs_set_bit test succeeded.\n");
+	} else {
+		printf("ext2fs_set_bit test failed.\n");
+		for (i=0; i < sizeof(testarray); i++) {
+			printf("%02x ", testarray[i]);
+		}
+		printf("\n");
+		exit(1);
+	}
+	for (i=0; bits_list[i] > 0; i++) {
+		ext2fs_clear_bit(bits_list[i], testarray);
+	}
+	for (i=0; i < sizeof(testarray); i++) {
+		if (testarray[i]) {
+			printf("ext2fs_clear_bit failed, "
+			       "testarray[%d] is %d\n", i, testarray[i]);
+			exit(1);
+		}
+	}
+	printf("ext2fs_clear_bit test succeed.\n");
+		
+
+	/* Do bigarray test */
+	bigarray = malloc(1 << 29);
+	if (!bigarray) {
+		fprintf(stderr, "Failed to allocate scratch memory!\n");
+		exit(1);
+	}
+
+        bigarray[BIG_TEST_BIT >> 3] = 0;
+
+	ext2fs_set_bit(BIG_TEST_BIT, bigarray);
+	printf("big bit number (%u) test: %d, expected %d\n", BIG_TEST_BIT,
+	       bigarray[BIG_TEST_BIT >> 3], (1 << (BIG_TEST_BIT & 7)));
+	if (bigarray[BIG_TEST_BIT >> 3] != (1 << (BIG_TEST_BIT & 7)))
+		exit(1);
+
+	ext2fs_clear_bit(BIG_TEST_BIT, bigarray);
+	
+	printf("big bit number (%u) test: %d, expected 0\n", BIG_TEST_BIT,
+	       bigarray[BIG_TEST_BIT >> 3], 0);
+	if (bigarray[BIG_TEST_BIT >> 3] != 0)
+		exit(1);
+
+	printf("ext2fs_set_bit big_test successful\n");
+
+
+	/* Now test ext2fs_fast_set_bit */
+	memset(testarray, 0, sizeof(testarray));
+	for (i=0; bits_list[i] > 0; i++) {
+		ext2fs_fast_set_bit(bits_list[i], testarray);
+	}
+	if (memcmp(testarray, bitarray, sizeof(testarray)) == 0) {
+		printf("ext2fs_fast_set_bit test succeeded.\n");
+	} else {
+		printf("ext2fs_fast_set_bit test failed.\n");
+		for (i=0; i < sizeof(testarray); i++) {
+			printf("%02x ", testarray[i]);
+		}
+		printf("\n");
+		exit(1);
+	}
+	for (i=0; bits_list[i] > 0; i++) {
+		ext2fs_clear_bit(bits_list[i], testarray);
+	}
+	for (i=0; i < sizeof(testarray); i++) {
+		if (testarray[i]) {
+			printf("ext2fs_clear_bit failed, "
+			       "testarray[%d] is %d\n", i, testarray[i]);
+			exit(1);
+		}
+	}
+	printf("ext2fs_clear_bit test succeed.\n");
+		
+
+        bigarray[BIG_TEST_BIT >> 3] = 0;
+
+	ext2fs_fast_set_bit(BIG_TEST_BIT, bigarray);
+	printf("big bit number (%u) test: %d, expected %d\n", BIG_TEST_BIT,
+	       bigarray[BIG_TEST_BIT >> 3], (1 << (BIG_TEST_BIT & 7)));
+	if (bigarray[BIG_TEST_BIT >> 3] != (1 << (BIG_TEST_BIT & 7)))
+		exit(1);
+
+	ext2fs_fast_clear_bit(BIG_TEST_BIT, bigarray);
+	
+	printf("big bit number (%u) test: %d, expected 0\n", BIG_TEST_BIT,
+	       bigarray[BIG_TEST_BIT >> 3], 0);
+	if (bigarray[BIG_TEST_BIT >> 3] != 0)
+		exit(1);
+
+	printf("ext2fs_fast_set_bit big_test successful\n");
+
+	exit(0);
 }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-25 14:51                       ` cascardo
@ 2006-03-26 16:27                         ` Andreas Dilger
  2006-03-27 19:59                           ` Stephen C. Tweedie
  2006-03-27 20:36                           ` Alfred M. Szmidt
  2006-03-27 19:55                         ` Stephen C. Tweedie
  1 sibling, 2 replies; 34+ messages in thread
From: Andreas Dilger @ 2006-03-26 16:27 UTC (permalink / raw)
  To: cascardo
  Cc: Theodore Ts'o, Stephen C. Tweedie, Takashi Sato, cmm,
	linux-kernel, ext2-devel, Laurent Vivier, ams

On Mar 25, 2006  11:51 -0300, cascardo@minaslivre.org wrote:
> As AMS has pointed out, the filesystem creator must be set to Hurd for
> these inode fields to be used. Since ext2 seems to be the most
> supported filesystem on Hurd, most of the ext2 fs used have the fs
> creator set to Hurd.

So, if fs creator is Linux then HURD doesn't try to use those fields?
That would allow Linux to start using them, and if such a filesystem
is used on HURD then it could store the translator/author/mode_high
in the xattr space.  Does it even make sense to add translator/author
to existing files, or only at file creation time?

That would mean that Linux would just need to check the fs creator
field before using any of the HURD-reserved fields.

> Regarding compatibility, there are plans to support xattr in Hurd and
> use them for these fields, translator and author. (I can't recall what
> i_mode_high is used for.) With respect to that, I'd appreciate if
> there is a recommendation to every ext2 implementation (not only
> Linux) that supports xattr, to support gnu.translator and gnu.author
> (I'll check about the i_mode_high and post about it asap.).

Not that we will be in a rush to use these fields, but it would be good
to know what i_mode_high is used for in case it ever becomes relevant
for Linux we would want to keep it the same meaning as HURD.

> There is a
> patch by Roland McGrath for Linux that supports those besides the
> reserved fields in case the fs creator is Hurd.

I'm not sure what is required for supporting such EAs?  I don't think
any kernel would remove existing EAs, even if it doesn't understand
them.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
@ 2006-03-26 22:15 Chuck Ebbert
  0 siblings, 0 replies; 34+ messages in thread
From: Chuck Ebbert @ 2006-03-26 22:15 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel, ext2-devel, Takashi Sato

In-Reply-To: <20060326030114.GA2241@thunk.org>

On Sat, 25 Mar 2006 22:01:15 -0500, Theodore Ts'o wrote:

> Fix the i386 bitmap operations so they are 32-bit clean
...
> --- a/lib/ext2fs/bitops.h     Sat Mar 25 01:42:02 2006 -0500
> +++ b/lib/ext2fs/bitops.h     Sat Mar 25 13:42:45 2006 -0500
...
> @@ -155,9 +179,10 @@
>  {
>       int oldbit;
>  
> +     addr = (void *) (((unsigned char *) addr) + (nr >> 3));
>       __asm__ __volatile__("btsl %2,%1\n\tsbbl %0,%0"
>               :"=r" (oldbit),"=m" (EXT2FS_ADDR)
                                ^
  This should be "+" because that data is both read and written
by the assembler instruction.

> -             :"r" (nr));
> +             :"r" (nr & 7));
>       return oldbit;
>  }
>  
> @@ -165,9 +190,10 @@
>  {
>       int oldbit;
>  
> +     addr = (void *) (((unsigned char *) addr) + (nr >> 3));
>       __asm__ __volatile__("btrl %2,%1\n\tsbbl %0,%0"
>               :"=r" (oldbit),"=m" (EXT2FS_ADDR)
                                ^
  Same here.

> -             :"r" (nr));
> +             :"r" (nr & 7));
>       return oldbit;
>  }
 
See include/asm-i386/bitops.h where that has already been done (it's still
wrong, but less so than before.)

-- 
Chuck
"Penguins don't come from next door, they come from the Antarctic!"


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-25 14:51                       ` cascardo
  2006-03-26 16:27                         ` Andreas Dilger
@ 2006-03-27 19:55                         ` Stephen C. Tweedie
  2006-03-27 20:05                           ` Alfred M. Szmidt
  2006-03-28  0:14                           ` cascardo
  1 sibling, 2 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 2006-03-27 19:55 UTC (permalink / raw)
  To: cascardo
  Cc: Theodore Ts'o, Andreas Dilger, Takashi Sato, cmm,
	linux-kernel, ext2-devel, Laurent Vivier, ams, Stephen Tweedie

Hi,

On Sat, 2006-03-25 at 11:51 -0300, cascardo@minaslivre.org wrote:

> Regarding compatibility, there are plans to support xattr in Hurd and
> use them for these fields, translator and author. (I can't recall what
> i_mode_high is used for.) With respect to that, I'd appreciate if
> there is a recommendation to every ext2 implementation (not only
> Linux) that supports xattr, to support gnu.translator and gnu.author
> (I'll check about the i_mode_high and post about it asap.). 

What do you mean by "support", exactly?

There are 3 different bits of xattr design which matter here.  There's
the namespace exported to users via the *attr syscalls; there's the
encoding used on disk for those different namespaces; and there's the
exact semantics surrounding interpretation of the xattr contents.

Now, a non-Hurd system is not going to have any use for the gnu.* xattr
semantics, as translator is a Hurd-specific concept.  The user "gnu.*"
namespace is easy enough to teach to Linux: to simply reserve that
namespace, without actually implementing any part of it, I think it be
sufficient simply to claim the name in include/linux/xattr.h.

For ext2/3, though, the key is how to store gnu.* on disk.  Right now
the different namespaces that ext* stores on disk are enumerated in

	fs/ext[23]/xattr.h

which, for ext2, currently contains:

        /* Name indexes */
        /* Name indexes */
        #define EXT2_XATTR_INDEX_USER			1
        #define EXT2_XATTR_INDEX_POSIX_ACL_ACCESS	2
        #define EXT2_XATTR_INDEX_POSIX_ACL_DEFAULT	3
        #define EXT2_XATTR_INDEX_TRUSTED		4
        #define	EXT2_XATTR_INDEX_LUSTRE			5
        #define EXT2_XATTR_INDEX_SECURITY	        6

If you want to reserve a new semantically-significant portion of the
namespace for use in the Hurd by gnu.* xattrs, then you'd need to submit
an authoritative Linux patch to register a new name index on ext2;
reservation of such an xattr namespace index is in effect an on-disk
format decision so needs to be agreed between implementations.

> Regarding userland tools, it would be wise if they would still support
> old format filesystems, including those with fs creator set to
> Hurd. That would include supporting the oob block for translator when
> counting used/free blocks and other operations like copying a file
> using debugfs, for example.

Certainly; I don't think anybody is arguing against that, and I regard
such backwards compatibility as an absolute requirement.

--Stephen

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-26 16:27                         ` Andreas Dilger
@ 2006-03-27 19:59                           ` Stephen C. Tweedie
  2006-03-27 20:36                           ` Alfred M. Szmidt
  1 sibling, 0 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 2006-03-27 19:59 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: cascardo, Theodore Ts'o, Takashi Sato, cmm, linux-kernel,
	ext2-devel, Laurent Vivier, ams, Stephen Tweedie

Hi,

On Sun, 2006-03-26 at 09:27 -0700, Andreas Dilger wrote:

> I'm not sure what is required for supporting such EAs?  I don't think
> any kernel would remove existing EAs, even if it doesn't understand
> them.

Right --- reservation in fs/ext[23]/xattr.h is sufficient, I think, as
all we need is to make sure that the gnu.* on-disk namespace is reserved
against reuse by any new namespaces in the future.

--Stephen



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-27 19:55                         ` Stephen C. Tweedie
@ 2006-03-27 20:05                           ` Alfred M. Szmidt
  2006-03-27 20:40                             ` Stephen C. Tweedie
  2006-03-28  0:14                           ` cascardo
  1 sibling, 1 reply; 34+ messages in thread
From: Alfred M. Szmidt @ 2006-03-27 20:05 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: cascardo, tytso, adilger, sho, cmm, linux-kernel, ext2-devel,
	Laurent.Vivier, sct

   Now, a non-Hurd system is not going to have any use for the gnu.*
   xattr semantics, as translator is a Hurd-specific concept.

gnu.* doesn't just concern itself with translators, it can also be
gnu.author (or some such) which is a normal UID, which GNU/Linux can
support without any problems.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-26 16:27                         ` Andreas Dilger
  2006-03-27 19:59                           ` Stephen C. Tweedie
@ 2006-03-27 20:36                           ` Alfred M. Szmidt
  1 sibling, 0 replies; 34+ messages in thread
From: Alfred M. Szmidt @ 2006-03-27 20:36 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: cascardo, tytso, sct, sho, cmm, linux-kernel, ext2-devel,
	Laurent.Vivier

   So, if fs creator is Linux then HURD doesn't try to use those
   fields?

As I recall it (don't have access to the source code here), the
file-system translator will return EOPNOTSUPP if you try and set a
passive translator on a non-Hurd owned file-system.  Passive
translators are the only kind of translators which write any kind of
data back to the acutal file-system.

For the case of st_author/i_author, when the file is created on a
non-Hurd owned file-system, it will simply return whatever
i_uid/st_uid is.

   Not that we will be in a rush to use these fields, but it would be
   good to know what i_mode_high is used for in case it ever becomes
   relevant for Linux we would want to keep it the same meaning as
   HURD.

Once again, as I recall it (a bit better this time), i_mode_high is
used for the actual bits that define if there is a translator (and
what kind) on a node or not.

Cheers.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-27 20:05                           ` Alfred M. Szmidt
@ 2006-03-27 20:40                             ` Stephen C. Tweedie
  0 siblings, 0 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 2006-03-27 20:40 UTC (permalink / raw)
  To: ams
  Cc: cascardo, tytso, adilger, sho, cmm, linux-kernel, ext2-devel,
	Laurent.Vivier, Stephen Tweedie

Hi,

On Mon, 2006-03-27 at 22:05 +0200, Alfred M. Szmidt wrote:
>    Now, a non-Hurd system is not going to have any use for the gnu.*
>    xattr semantics, as translator is a Hurd-specific concept.
> 
> gnu.* doesn't just concern itself with translators, it can also be
> gnu.author (or some such) which is a normal UID, which GNU/Linux can
> support without any problems.

OK, but would it have any active semantics on non-Hurd kernels?  How
would the behaviour of ext3 change in the presence of a gnu.author
attribute on a file?

It would certainly be possible to add a generic ext2/3 namespace handler
to allow those fields to be set on, say, Linux hosts; but that would
just be a matter of matching the gnu.* syscall xattr encoding to the
EXT2_XATTR_INDEX_GNU on-disk encoding; it wouldn't actually deal with
any semantic expectations surrounding the use of those fields.

--Stephen

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
  2006-03-27 19:55                         ` Stephen C. Tweedie
  2006-03-27 20:05                           ` Alfred M. Szmidt
@ 2006-03-28  0:14                           ` cascardo
  1 sibling, 0 replies; 34+ messages in thread
From: cascardo @ 2006-03-28  0:14 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: cascardo, Theodore Ts'o, Andreas Dilger, Takashi Sato, cmm,
	linux-kernel, ext2-devel, Laurent Vivier, ams

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=us-ascii, Size: 3567 bytes --]

On Mon, Mar 27, 2006 at 02:55:01PM -0500, Stephen C. Tweedie wrote:
> Hi,
> 
> On Sat, 2006-03-25 at 11:51 -0300, cascardo@minaslivre.org wrote:
> 
> > Regarding compatibility, there are plans to support xattr in Hurd and
> > use them for these fields, translator and author. (I can't recall what
> > i_mode_high is used for.) With respect to that, I'd appreciate if
> > there is a recommendation to every ext2 implementation (not only
> > Linux) that supports xattr, to support gnu.translator and gnu.author
> > (I'll check about the i_mode_high and post about it asap.). 
> 
> What do you mean by "support", exactly?
> 
> There are 3 different bits of xattr design which matter here.  There's
> the namespace exported to users via the *attr syscalls; there's the
> encoding used on disk for those different namespaces; and there's the
> exact semantics surrounding interpretation of the xattr contents.
> 

Listing the attributes of a file should return the "gnu.*"
ones. That's the first meaning of supporting. Storing them on ext2/3
is the second. This one is already implemented for Linux by Roland
McGrath. I don't know, however, it that patch was submitted to the
right people. Is anyone here responsible for that? I can send it to
the list or privately, including the number used to store them. AFAIK,
the Linux code does not blindly lists all the attributes, but only
those "supported", as you pointed below, because they require a
reservation.

> Now, a non-Hurd system is not going to have any use for the gnu.* xattr
> semantics, as translator is a Hurd-specific concept.  The user "gnu.*"
> namespace is easy enough to teach to Linux: to simply reserve that
> namespace, without actually implementing any part of it, I think it be
> sufficient simply to claim the name in include/linux/xattr.h.
> 

The semantics may not be supported, if they have no meaning to the
system. But star or cp should be able to keep those attributes if they
are written to do so. Does anyone know if cp can keep the xattr of a
file? Anyway, a patched cp that would keep the xattrs should keep the
"gnu.*" xattrs, and that's all (if both underlying filesystems support
them, which would be true for two ext2/3 filesystems).

> For ext2/3, though, the key is how to store gnu.* on disk.  Right now
> the different namespaces that ext* stores on disk are enumerated in
> 
> 	fs/ext[23]/xattr.h
> 
> which, for ext2, currently contains:
> 
>         /* Name indexes */
>         /* Name indexes */
>         #define EXT2_XATTR_INDEX_USER			1
>         #define EXT2_XATTR_INDEX_POSIX_ACL_ACCESS	2
>         #define EXT2_XATTR_INDEX_POSIX_ACL_DEFAULT	3
>         #define EXT2_XATTR_INDEX_TRUSTED		4
>         #define	EXT2_XATTR_INDEX_LUSTRE			5
>         #define EXT2_XATTR_INDEX_SECURITY	        6
> 
> If you want to reserve a new semantically-significant portion of the
> namespace for use in the Hurd by gnu.* xattrs, then you'd need to submit
> an authoritative Linux patch to register a new name index on ext2;
> reservation of such an xattr namespace index is in effect an on-disk
> format decision so needs to be agreed between implementations.
> 

That's just what I meant by saying that I'd like them to be supported
by every implementation of ext2/3 xattr. Sorry if that was not
clear. That would be 7, right? That's what Roland uses in his patch.

[...]
> 
> --Stephen
> 
> 

Regards,
Thadeu Cascardo.
--

_______________________________________________________ 
Yahoo! Acesso Grátis - Internet rápida e grátis. Instale o discador agora! 
http://br.acesso.yahoo.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2006-03-27 23:11 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-15 12:39 [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel) Takashi Sato
2006-03-15 12:56 ` [Ext2-devel] " Laurent Vivier
2006-03-16  2:19 ` Mingming Cao
2006-03-16 12:11   ` Takashi Sato
2006-03-16 13:53     ` Theodore Ts'o
2006-03-16 18:35     ` Andreas Dilger
2006-03-16 21:26       ` Theodore Ts'o
2006-03-16 22:59         ` Andreas Dilger
2006-03-18 17:07           ` Theodore Ts'o
2006-03-20  6:36             ` Andreas Dilger
2006-03-20 22:38               ` Stephen C. Tweedie
2006-03-20 23:48                 ` Andreas Dilger
2006-03-21 17:05                   ` Stephen C. Tweedie
2006-03-21 18:38                     ` Theodore Ts'o
2006-03-21 19:47                       ` Stephen C. Tweedie
2006-03-21 20:40                         ` Andreas Dilger
2006-03-21 20:16                       ` Alfred M. Szmidt
2006-03-21 23:05                       ` Olivier Galibert
2006-03-21 23:35                         ` Alfred M. Szmidt
2006-03-25 14:51                       ` cascardo
2006-03-26 16:27                         ` Andreas Dilger
2006-03-27 19:59                           ` Stephen C. Tweedie
2006-03-27 20:36                           ` Alfred M. Szmidt
2006-03-27 19:55                         ` Stephen C. Tweedie
2006-03-27 20:05                           ` Alfred M. Szmidt
2006-03-27 20:40                             ` Stephen C. Tweedie
2006-03-28  0:14                           ` cascardo
2006-03-21 20:26                     ` Andreas Dilger
2006-03-21  4:03                 ` Theodore Ts'o
2006-03-17  9:35       ` Laurent Vivier
2006-03-19  2:20 ` Theodore Ts'o
2006-03-20 10:11   ` Takashi Sato
2006-03-26  3:01     ` Theodore Ts'o
  -- strict thread matches above, loose matches on Subject: below --
2006-03-26 22:15 Chuck Ebbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox