All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] reiserfs: on-demand bitmap loading (testing only)
@ 2005-07-06 21:22 Jeff Mahoney
  2005-07-06 22:45 ` Hans Reiser
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Mahoney @ 2005-07-06 21:22 UTC (permalink / raw)
  To: ReiserFS List

[-- Attachment #1: Type: text/plain, Size: 2024 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

 Currently, ReiserFS will read and keep in memory all the bitmaps for
 the filesystem on mount. After the journal is replayed, it will read
 them in again. On huge filesystems, this can be a resource hog and a
 performance/ availability problem.

 For example, on a maximum size (~16 TB) ReiserFS filesystem, there are
 2^32 blocks, which require 131072 bitmaps to describe them. This means
 that without loading any of the metadata tree or accessing file data,
 just over 512M of RAM must be allocated (and is unswappable) for the
 filesystem to be mounted and completely idle. All of that data is
 distributed evenly over the entire disk, and must be read (twice!)
 on mount.

 There have been reports of large filesystems taking an unacceptably
 long time to mount. These mount times can take your 5 9's down pretty
 quickly.

 The following patch implements on-demand loading for bitmaps. Rather
 than pin all the bitmaps in memory as we do now, when a bitmap is
 needed it is read from disk. If it is needed frequently, the buffer
 cache will use existing heuristics to keep it around. The caching of
 bitmap metadata is kept, so that bitmaps that are known to be full are
 skipped completely.

 I have done some very basic testing on this, but I'd like to have some
 more eyes take a look.

 Caveats:

 The error handling in this revision is incomplete. This is a known
 issue as I would like to end up applying this patch after reworking
 the error handling in ReiserFS as a whole. Ultimately, a
 reiserfs_error() similar to ext3 will be introduced, which will allow
 smoother handling of errors than currently available.

 The "old bitmap" code is untested. In principle, the difference boils
 down to only where the bitmap block is located.

- -Jeff

- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFCzEt/LPWxlyuTD7IRAt4iAJ48L0ldd+jgowf1Qf6f6e90wbgacwCgjo5o
RSQzCzJAnYeLpYQDzkLBGVc=
=5ux1
-----END PGP SIGNATURE-----

[-- Attachment #2: reiserfs-on-demand-bitmaps.diff --]
[-- Type: text/x-patch, Size: 21685 bytes --]

From: Jeff Mahoney <jeffm@suse.com>
Subject: [PATCH] reiserfs: implement on-demand bitmap loading (testing only)

 Currently, ReiserFS will read and keep in memory all the bitmaps for the
 filesystem on mount. After the journal is replayed, it will read them in
 again. On huge filesystems, this can be a resource hog and a performance/
 availability problem.

 For example, on a maximum size (~16 TB) ReiserFS filesystem, there are
 2^32 blocks, which require 131072 bitmaps to describe them. This means that
 without loading any of the metadata tree or accessing file data, just over
 512M of RAM must be allocated (and is unswappable) for the filesystem to
 be mounted and completely idle. All of that data is distributed evenly over
 the entire disk, and must be read (twice!) on mount.

 There have been reports of large filesystems taking an unacceptably long time
 to mount. These mount times can take your 5 9's down pretty quickly.

 The following patch implements on-demand loading for bitmaps. Rather than pin
 all the bitmaps in memory as we do now, when a bitmap is needed it is read
 from disk. If it is needed frequently, the buffer cache will use existing
 heuristics to keep it around. The caching of bitmap metadata is kept, so that
 bitmaps that are known to be full are skipped completely.

 I have done some very basic testing on this, but I'd like to have some more
 eyes take a look.

 Caveats:

 The error handling in this revision is incomplete. This is a known issue
 as I would like to end up applying this patch after reworking the error
 handling in ReiserFS as a whole. Ultimately, a reiserfs_error() similar to
 ext3 will be introduced, which will allow smoother handling of errors than
 currently available.

 The "old bitmap" code is untested. In principle, the difference boils down to
 only where the bitmap block is located.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>

diff -ruNpX dontdiff linux-2.6.12.1/fs/reiserfs/bitmap.c linux-2.6.12.1.devel/fs/reiserfs/bitmap.c
--- linux-2.6.12.1/fs/reiserfs/bitmap.c	2005-06-30 12:51:42.000000000 -0400
+++ linux-2.6.12.1.devel/fs/reiserfs/bitmap.c	2005-06-30 16:40:29.000000000 -0400
@@ -61,6 +61,8 @@ static inline void get_bit_address (stru
 int is_reusable (struct super_block * s, b_blocknr_t block, int bit_value)
 {
     int i, j;
+    unsigned int bmap = block >> s->s_blocksize_bits;
+    struct buffer_head *bh;
 
     if (block == 0 || block >= SB_BLOCK_COUNT (s)) {
 	reiserfs_warning (s, "vs-4010: is_reusable: block number is out of range %lu (%u)",
@@ -68,14 +70,29 @@ int is_reusable (struct super_block * s,
 	return 0;
     }
 
-    /* it can't be one of the bitmap blocks */
-    for (i = 0; i < SB_BMAP_NR (s); i ++)
-	if (block == SB_AP_BITMAP (s)[i].bh->b_blocknr) {
-	    reiserfs_warning (s, "vs: 4020: is_reusable: "
-			      "bitmap block %lu(%u) can't be freed or reused",
-			      block, SB_BMAP_NR (s));
-	    return 0;
-	}
+    /* Old format filesystem? Unlikely, but the bitmaps are all up front so
+     * we need to account for it. */
+    if (unlikely (test_bit(REISERFS_OLD_FORMAT,
+                           &(REISERFS_SB(sb)->s_properties)))) {
+            block_nr_t bmap1 = REISERFS_SB(sb)->s_bh->b_blocknr + 1;
+            if (block >= REISERFS_SB(sb)->s_bh->b_blocknr + 1 + bmap &&
+                block <= REISERFS_SB(sb)->s_bh->b_blocknr + 1 + SB_BMAP_NR(s)) {
+                    reiserfs_warning (s, "vs: 4019: is_reusable: "
+                                      "bitmap block %lu(%u) can't be freed or "
+                                      "reused", block, SB_BMAP_NR (s));
+                    return 0
+            }
+    } else {
+            if (block & ((s->s_blocksize << 3) - 1 ) == 0)
+                    reiserfs_warning (s, "vs: 4020: is_reusable: "
+                                      "bitmap block %lu(%u) can't be freed or "
+                                      "reused", block, SB_BMAP_NR (s));
+                    return 0;
+                }
+            }
+    }
+
+
   
     get_bit_address (s, block, &i, &j);
 
@@ -85,16 +102,22 @@ int is_reusable (struct super_block * s,
 	return 0;
     }
 
+    bh  = read_bitmap_block (s, bmap);
+    if (!bh)
+        return 0;
+
     if ((bit_value == 0 && 
-         reiserfs_test_le_bit(j, SB_AP_BITMAP(s)[i].bh->b_data)) ||
+         reiserfs_test_le_bit(j, bh->b_data)) ||
 	(bit_value == 1 && 
-	 reiserfs_test_le_bit(j, SB_AP_BITMAP (s)[i].bh->b_data) == 0)) {
+	 reiserfs_test_le_bit(j, bh->b_data) == 0)) {
 	reiserfs_warning (s, "vs-4040: is_reusable: corresponding bit of block %lu does not "
 			  "match required value (i==%d, j==%d) test_bit==%d",
-		block, i, j, reiserfs_test_le_bit (j, SB_AP_BITMAP (s)[i].bh->b_data));
+		block, i, j, reiserfs_test_le_bit (j, bh->b_data));
 
+	brelse (bh);
 	return 0;
     }
+    brelse (bh);
 
     if (bit_value == 0 && block == SB_ROOT_BLOCK (s)) {
 	reiserfs_warning (s, "vs-4050: is_reusable: this is root block (%u), "
@@ -134,6 +157,7 @@ static int scan_bitmap_block (struct rei
 {
     struct super_block *s = th->t_super;
     struct reiserfs_bitmap_info *bi=&SB_AP_BITMAP(s)[bmap_n];
+    struct buffer_head *bh;
     int end, next;
     int org = *beg;
 
@@ -150,22 +174,22 @@ static int scan_bitmap_block (struct rei
 	reiserfs_warning (s, "NULL bitmap info pointer for bitmap %d", bmap_n);
 	return 0;
     }
-    if (buffer_locked (bi->bh)) {
-       PROC_INFO_INC( s, scan_bitmap.wait );
-       __wait_on_buffer (bi->bh);
-    }
+
+    bh = read_bitmap_block (s, bmap_n);
+    if (!bh)
+        return 0;
 
     while (1) {
-	cont:
 	if (bi->free_count < min)
 		return 0; // No free blocks in this bitmap
 
 	/* search for a first zero bit -- beggining of a window */
 	*beg = reiserfs_find_next_zero_le_bit
-	        ((unsigned long*)(bi->bh->b_data), boundary, *beg);
+	        ((unsigned long*)(bh->b_data), boundary, *beg);
   
 	if (*beg + min > boundary) { /* search for a zero bit fails or the rest of bitmap block
 				      * cannot contain a zero window of minimum size */
+	    brelse (bh);
 	    return 0;
 	}
 
@@ -173,7 +197,7 @@ static int scan_bitmap_block (struct rei
 	    continue;
 	/* first zero bit found; we check next bits */
 	for (end = *beg + 1;; end ++) {
-	    if (end >= *beg + max || end >= boundary || reiserfs_test_le_bit (end, bi->bh->b_data)) {
+	    if (end >= *beg + max || end >= boundary || reiserfs_test_le_bit (end, bh->b_data)) {
 		next = end;
 		break;
 	    }
@@ -187,11 +211,11 @@ static int scan_bitmap_block (struct rei
 	 * (end) points to one bit after the window end */
 	if (end - *beg >= min) { /* it seems we have found window of proper size */
 	    int i;
-	    reiserfs_prepare_for_journal (s, bi->bh, 1);
+	    reiserfs_prepare_for_journal (s, bh, 1);
 	    /* try to set all blocks used checking are they still free */
 	    for (i = *beg; i < end; i++) {
 		/* It seems that we should not check in journal again. */
-		if (reiserfs_test_and_set_le_bit (i, bi->bh->b_data)) {
+		if (reiserfs_test_and_set_le_bit (i, bh->b_data)) {
 		    /* bit was set by another process
 		     * while we slept in prepare_for_journal() */
 		    PROC_INFO_INC( s, scan_bitmap.stolen );
@@ -202,21 +226,22 @@ static int scan_bitmap_block (struct rei
 		    }
 		    /* otherwise we clear all bit were set ... */
 		    while (--i >= *beg)
-			reiserfs_test_and_clear_le_bit (i, bi->bh->b_data);
-		    reiserfs_restore_prepared_buffer (s, bi->bh);
+			reiserfs_test_and_clear_le_bit (i, bh->b_data);
+		    reiserfs_restore_prepared_buffer (s, bh);
 		    *beg = org;
 		    /* ... and search again in current block from beginning */
-		    goto cont;	
+		    continue;
 		}
 	    }
 	    bi->free_count -= (end - *beg);
-	    journal_mark_dirty (th, s, bi->bh);
+	    journal_mark_dirty (th, s, bh);
 
 	    /* free block count calculation */
 	    reiserfs_prepare_for_journal (s, SB_BUFFER_WITH_SB(s), 1);
 	    PUT_SB_FREE_BLOCKS(s, SB_FREE_BLOCKS(s) - (end - *beg));
 	    journal_mark_dirty (th, s, SB_BUFFER_WITH_SB(s));
 
+	    brelse (bh);
 	    return end - (*beg);
 	} else {
 	    *beg = next;
@@ -349,7 +374,7 @@ static void _reiserfs_free_block (struct
 {
     struct super_block * s = th->t_super;
     struct reiserfs_super_block * rs;
-    struct buffer_head * sbh;
+    struct buffer_head * sbh, *bmbh;
     struct reiserfs_bitmap_info *apbi;
     int nr, offset;
 
@@ -370,16 +395,23 @@ static void _reiserfs_free_block (struct
 	return;
     }
 
-    reiserfs_prepare_for_journal(s, apbi[nr].bh, 1 ) ;
+    bmbh = read_bitmap_block (s, nr);
+    if (!bmbh) {
+        reiserfs_warning (s, "jdm-4077: reiserfs_free_block: couldn't read bitmap %d\n", nr);
+        return;
+    }
+
+    reiserfs_prepare_for_journal(s, bmbh, 1 ) ;
 
     /* clear bit for the given block in bit map */
-    if (!reiserfs_test_and_clear_le_bit (offset, apbi[nr].bh->b_data)) {
+    if (!reiserfs_test_and_clear_le_bit (offset, bmbh->b_data)) {
 	reiserfs_warning (s, "vs-4080: reiserfs_free_block: "
 			  "free_block (%s:%lu)[dev:blocknr]: bit already cleared",
 			  reiserfs_bdevname (s), block);
     }
     apbi[nr].free_count ++;
-    journal_mark_dirty (th, s, apbi[nr].bh);
+    journal_mark_dirty (th, s, bmbh);
+    brelse (bmbh);
 
     reiserfs_prepare_for_journal(s, sbh, 1) ;
     /* update super block */
@@ -1168,3 +1200,82 @@ int reiserfs_can_fit_pages ( struct supe
 
 	return space>0?space:0;
 }
+
+/* cache_bitmap_metadata - Caches bitmap metadata so that the bitmap itself
+ *                         need not be reloaded only to find that it is full.
+ * @bh: buffer_head containing the bitmap data
+ * @info: info struct to cache the data
+ *
+ * This will cache the number of free blocks and which block is the first
+ * available. 
+ *
+ * The values are considered uncached if ->first_zero_hint is 0, since
+ * the bitmap block itself always occupies block 0 of the bitmap.
+ *
+ */
+static void
+cache_bitmap_metadata (struct super_block *sb, struct buffer_head *bh,
+                       struct reiserfs_bitmap_info *info)
+{
+	unsigned long *cur = (unsigned long *)bh->b_data;
+	int i;
+
+	for (i = sb->s_blocksize / sizeof (*cur); i > 0; i--, cur++) {
+		/* 0 and ~0 are special, we can optimize for them */
+		if (*cur == 0) {
+			info->first_zero_hint = i << 3;
+			info->free_count += sizeof (*cur) << 3;
+		} else if (*cur != ~0L) { /* A mix, investigate */
+			int b;
+			for (b = sizeof (*cur) << 3; b >= 0; b--) {
+				if (!reiserfs_test_le_bit (b, cur)) {
+					info->first_zero_hint = (i << 3) + b;
+					info->free_count++;
+				}
+			}
+		}
+	}
+
+	/* The first bit must ALWAYS be 1 */
+	BUG_ON (info->first_zero_hint == 0);
+}
+
+/* read_bitmap_block - reads a bitmap block from disk dynamically
+ * @sb: super_block associated with filesystem
+ * @bitmap: bitmap number to load
+ *
+ * On huge filesystems, the initial load time for bitmaps is unacceptably long.
+ * Mount time as long as 15 minutes have been reported on huge (~ 15 TiB)
+ * filesystems. Pinning the bitmaps in memory on these huge filesystems is
+ * also a non-trivial resource use - 480 MB on that same 15 TiB filesystem.
+ * Loading the bitmaps dynamically on demand alleviates both problems.
+ */
+struct buffer_head *
+read_bitmap_block (struct super_block *sb, unsigned int bitmap)
+{
+	unsigned int block = (sb->s_blocksize << 3) * bitmap;
+	struct reiserfs_bitmap_info *info = SB_AP_BITMAP (sb) + bitmap;
+	struct buffer_head *bh;
+
+	/* Way old format filesystems had the bitmaps packed up front.
+	 * I doubt there are any of these left, but just in case... */
+	if (test_bit(REISERFS_OLD_FORMAT, &(REISERFS_SB(sb)->s_properties))) {
+		block = REISERFS_SB (sb)->s_sbh->b_blocknr + 1 + bitmap;
+	}
+
+	bh = sb_bread (sb, block);
+
+	if (bh == NULL) {
+		reiserfs_warning (sb, "jdm-4900: Cannot read bitmap %u "
+		                      "(block %u)", bitmap, block);
+		return NULL;
+	}
+
+	/* If we've already cached the metadata, ->first_zero_hint must be > 0,
+	 * since the bitmap itself occupies the first block described by the
+	 * bitmap */
+	if (info->first_zero_hint == 0)
+		cache_bitmap_metadata (sb, bh, info);
+
+	return bh;
+}
diff -ruNpX dontdiff linux-2.6.12.1/fs/reiserfs/resize.c linux-2.6.12.1.devel/fs/reiserfs/resize.c
--- linux-2.6.12.1/fs/reiserfs/resize.c	2005-03-02 02:37:55.000000000 -0500
+++ linux-2.6.12.1.devel/fs/reiserfs/resize.c	2005-06-30 16:13:07.000000000 -0400
@@ -21,9 +21,9 @@ int reiserfs_resize (struct super_block 
 {
         int err = 0;
 	struct reiserfs_super_block * sb;
-        struct reiserfs_bitmap_info *bitmap;
+	struct reiserfs_bitmap_info *bitmap, *info;
 	struct reiserfs_bitmap_info *old_bitmap = SB_AP_BITMAP(s);
-	struct buffer_head * bh;
+	struct buffer_head * bh, *bmbh;
 	struct reiserfs_transaction_handle th;
 	unsigned int bmap_nr_new, bmap_nr;
 	unsigned int block_r_new, block_r;
@@ -107,29 +107,31 @@ int reiserfs_resize (struct super_block 
 	
 	    /* allocate additional bitmap blocks, reallocate array of bitmap
 	     * block pointers */
-	    bitmap = vmalloc(sizeof(struct reiserfs_bitmap_info) * bmap_nr_new);
+	    bitmap = vmalloc(sizeof(*bitmap) * bmap_nr_new);
 	    if (!bitmap) {
 		/* Journal bitmaps are still supersized, but the memory isn't
 		 * leaked, so I guess it's ok */
 		printk("reiserfs_resize: unable to allocate memory.\n");
 		return -ENOMEM;
 	    }
-	    memset (bitmap, 0, sizeof (struct reiserfs_bitmap_info) * SB_BMAP_NR(s));
-	    for (i = 0; i < bmap_nr; i++)
-		bitmap[i] = old_bitmap[i];
+
+	    /* Copy what we have, clear out the rest */
+	    memcpy (bitmap, old_bitmap, sizeof (*bitmap) * SB_BMAP_NR(s));
+	    memset (bitmap + SB_BMAP_NR(s), 0,
+	            sizeof (*bitmap) * (bmap_nr - SB_BMAP_NR(s)));
 
 	    /* This doesn't go through the journal, but it doesn't have to.
 	     * The changes are still atomic: We're synced up when the journal
 	     * transaction begins, and the new bitmaps don't matter if the
 	     * transaction fails. */
 	    for (i = bmap_nr; i < bmap_nr_new; i++) {
-		bitmap[i].bh = sb_getblk(s, i * s->s_blocksize * 8);
-		memset(bitmap[i].bh->b_data, 0, sb_blocksize(sb));
-		reiserfs_test_and_set_le_bit(0, bitmap[i].bh->b_data);
-
-		set_buffer_uptodate(bitmap[i].bh);
-		mark_buffer_dirty(bitmap[i].bh) ;
-		sync_dirty_buffer(bitmap[i].bh);
+		bmbh = sb_bread (s, i * s->s_blocksize * 8);
+		memset(bmbh->b_data, 0, sb_blocksize(sb));
+		reiserfs_test_and_set_le_bit(0, bmbh->b_data);
+
+		set_buffer_uptodate(bmbh);
+		mark_buffer_dirty(bmbh) ;
+		sync_dirty_buffer(bmbh);
 		// update bitmap_info stuff
 		bitmap[i].first_zero_hint=1;
 		bitmap[i].free_count = sb_blocksize(sb) * 8 - 1;
@@ -146,27 +148,46 @@ int reiserfs_resize (struct super_block 
 	if (err)
 	    return err;
 
-	/* correct last bitmap blocks in old and new disk layout */
-	reiserfs_prepare_for_journal(s, SB_AP_BITMAP(s)[bmap_nr - 1].bh, 1);
-	for (i = block_r; i < s->s_blocksize * 8; i++)
-	    reiserfs_test_and_clear_le_bit(i, 
-					   SB_AP_BITMAP(s)[bmap_nr - 1].bh->b_data);
-	SB_AP_BITMAP(s)[bmap_nr - 1].free_count += s->s_blocksize * 8 - block_r;
-	if ( !SB_AP_BITMAP(s)[bmap_nr - 1].first_zero_hint)
-	    SB_AP_BITMAP(s)[bmap_nr - 1].first_zero_hint = block_r;
+	/* Extend old last bitmap block - new blocks have been made available */
+	bmbh = read_bitmap_block (s, bmap_nr - 1);
+	if (!bmbh) {
+	    int jerr = journal_end (&th, s, 10);
+	    if (jerr)
+		return jerr;
+	    return -EIO;
+	}
 
-	journal_mark_dirty(&th, s, SB_AP_BITMAP(s)[bmap_nr - 1].bh);
+	info = SB_AP_BITMAP(s) + bmap_nr - 1;
+	reiserfs_prepare_for_journal(s, bmbh, 1);
+	for (i = block_r; i < s->s_blocksize * 8; i++)
+	    reiserfs_test_and_clear_le_bit(i, bmbh->b_data);
+	info->free_count += s->s_blocksize * 8 - block_r;
+	if ( !info->first_zero_hint)
+	    info->first_zero_hint = block_r;
+
+	journal_mark_dirty(&th, s, bmbh);
+	brelse (bmbh);
+
+	/* Correct new last bitmap block - It may not be full */
+	bmbh = read_bitmap_block (s, bmap_nr_new - 1);
+	info = SB_AP_BITMAP(s) + bmap_nr_new - 1;
+	if (!bmbh) {
+	    int jerr = journal_end (&th, s, 10);
+	    if (jerr)
+		return jerr;
+	    return -EIO;
+	}
 
-	reiserfs_prepare_for_journal(s, SB_AP_BITMAP(s)[bmap_nr_new - 1].bh, 1);
+	reiserfs_prepare_for_journal(s, bmbh, 1);
 	for (i = block_r_new; i < s->s_blocksize * 8; i++)
-	    reiserfs_test_and_set_le_bit(i,
-					 SB_AP_BITMAP(s)[bmap_nr_new - 1].bh->b_data);
-	journal_mark_dirty(&th, s, SB_AP_BITMAP(s)[bmap_nr_new - 1].bh);
+	    reiserfs_test_and_set_le_bit(i, bmbh->b_data);
+	journal_mark_dirty(&th, s, bmbh);
  
-	SB_AP_BITMAP(s)[bmap_nr_new - 1].free_count -= s->s_blocksize * 8 - block_r_new;
+	info->free_count -= s->s_blocksize * 8 - block_r_new;
 	/* Extreme case where last bitmap is the only valid block in itself. */
-	if ( !SB_AP_BITMAP(s)[bmap_nr_new - 1].free_count )
-	    SB_AP_BITMAP(s)[bmap_nr_new - 1].first_zero_hint = 0;
+	if ( !info->free_count )
+	    info->first_zero_hint = 0;
+
  	/* update super */
 	reiserfs_prepare_for_journal(s, SB_BUFFER_WITH_SB(s), 1) ;
 	free_blocks = SB_FREE_BLOCKS(s);
diff -ruNpX dontdiff linux-2.6.12.1/fs/reiserfs/super.c linux-2.6.12.1.devel/fs/reiserfs/super.c
--- linux-2.6.12.1/fs/reiserfs/super.c	2005-06-30 12:51:42.000000000 -0400
+++ linux-2.6.12.1.devel/fs/reiserfs/super.c	2005-06-30 16:15:26.000000000 -0400
@@ -417,7 +417,6 @@ int remove_save_link (struct inode * ino
 
 static void reiserfs_put_super (struct super_block * s)
 {
-  int i;
   struct reiserfs_transaction_handle th ;
   th.t_trans_id = 0;
 
@@ -445,9 +444,6 @@ static void reiserfs_put_super (struct s
   */
   journal_release(&th, s) ;
 
-  for (i = 0; i < SB_BMAP_NR (s); i ++)
-    brelse (SB_AP_BITMAP (s)[i].bh);
-
   vfree (SB_AP_BITMAP (s));
 
   brelse (SB_BUFFER_WITH_SB (s));
@@ -1188,6 +1184,20 @@ static int reiserfs_remount (struct supe
   return 0;
 }
 
+static struct reiserfs_bitmap_info *
+init_bitmap_cache (struct super_block *sb)
+{
+	struct reiserfs_bitmap_info *bitmap;
+
+	bitmap = vmalloc (sizeof (*bitmap) * SB_BMAP_NR (sb));
+	if (bitmap)
+		memset (bitmap, 0, sizeof (*bitmap) * SB_BMAP_NR (sb));
+
+	return bitmap;
+}
+
+
+#if 0
 /* load_bitmap_info_data - Sets up the reiserfs_bitmap_info structure from disk.
  * @sb - superblock for this filesystem
  * @bi - the bitmap info to be loaded. Requires that bi->bh is valid.
@@ -1293,6 +1303,7 @@ static int read_old_bitmaps (struct supe
   return 0;
 }
 
+#endif
 static int read_super_block (struct super_block * s, int offset)
 {
     struct buffer_head * bh;
@@ -1389,7 +1400,6 @@ static int read_super_block (struct supe
 
 /* after journal replay, reread all bitmap and super blocks */
 static int reread_meta_blocks(struct super_block *s) {
-  int i ;
   ll_rw_block(READ, 1, &(SB_BUFFER_WITH_SB(s))) ;
   wait_on_buffer(SB_BUFFER_WITH_SB(s)) ;
   if (!buffer_uptodate(SB_BUFFER_WITH_SB(s))) {
@@ -1397,6 +1407,7 @@ static int reread_meta_blocks(struct sup
     return 1 ;
   }
 
+#if 0
   for (i = 0; i < SB_BMAP_NR(s) ; i++) {
     ll_rw_block(READ, 1, &(SB_AP_BITMAP(s)[i].bh)) ;
     wait_on_buffer(SB_AP_BITMAP(s)[i].bh) ;
@@ -1406,6 +1417,7 @@ static int reread_meta_blocks(struct sup
       return 1 ;
     }
   }
+#endif
   return 0 ;
 
 }
@@ -1639,8 +1651,9 @@ static int reiserfs_fill_super (struct s
     sbi->s_mount_state = SB_REISERFS_STATE(s);
     sbi->s_mount_state = REISERFS_VALID_FS ;
 
-    if (old_format ? read_old_bitmaps(s) : read_bitmaps(s)) {
-	SWARN(silent, s, "jmacd-8: reiserfs_fill_super: unable to read bitmap");
+    SB_AP_BITMAP(s) = init_bitmap_cache (s);
+    if (!SB_AP_BITMAP(s)) {
+	SWARN(silent, s, "jdm-8: reiserfs_fill_super: unable to init bitmap cache");
 	goto error;
     }
 #ifdef CONFIG_REISERFS_CHECK
@@ -1799,12 +1812,7 @@ static int reiserfs_fill_super (struct s
 	journal_release_error(NULL, s) ;
     }
     if (SB_DISK_SUPER_BLOCK (s)) {
-	for (j = 0; j < SB_BMAP_NR (s); j ++) {
-	    if (SB_AP_BITMAP (s))
-		brelse (SB_AP_BITMAP (s)[j].bh);
-	}
-	if (SB_AP_BITMAP (s))
-	    vfree (SB_AP_BITMAP (s));
+	vfree (SB_AP_BITMAP (s));
     }
     if (SB_BUFFER_WITH_SB (s))
 	brelse(SB_BUFFER_WITH_SB (s));
diff -ruNpX dontdiff linux-2.6.12.1/include/linux/reiserfs_fs.h linux-2.6.12.1.devel/include/linux/reiserfs_fs.h
--- linux-2.6.12.1/include/linux/reiserfs_fs.h	2005-06-30 12:51:48.000000000 -0400
+++ linux-2.6.12.1.devel/include/linux/reiserfs_fs.h	2005-06-30 16:13:59.000000000 -0400
@@ -2111,6 +2111,7 @@ void reiserfs_init_alloc_options (struct
  */
 __le32 reiserfs_choose_packing(struct inode *dir);
 
+struct buffer_head * read_bitmap_block (struct super_block *sb, unsigned int bitmap);
 int is_reusable (struct super_block * s, b_blocknr_t block, int bit_value);
 void reiserfs_free_block (struct reiserfs_transaction_handle *th, struct inode *, b_blocknr_t, int for_unformatted);
 int reiserfs_allocate_blocknrs(reiserfs_blocknr_hint_t *, b_blocknr_t * , int, int);
diff -ruNpX dontdiff linux-2.6.12.1/include/linux/reiserfs_fs_sb.h linux-2.6.12.1.devel/include/linux/reiserfs_fs_sb.h
--- linux-2.6.12.1/include/linux/reiserfs_fs_sb.h	2005-03-02 02:38:09.000000000 -0500
+++ linux-2.6.12.1.devel/include/linux/reiserfs_fs_sb.h	2005-06-30 15:43:09.000000000 -0400
@@ -269,7 +269,6 @@ struct reiserfs_bitmap_info
     // FIXME: Won't work with block sizes > 8K
     __u16  first_zero_hint;
     __u16  free_count;
-    struct buffer_head *bh; /* the actual bitmap */
 };
 
 struct proc_dir_entry;
@@ -419,6 +418,7 @@ struct reiserfs_sb_info
 /* Definitions of reiserfs on-disk properties: */
 #define REISERFS_3_5 0
 #define REISERFS_3_6 1
+#define REISERFS_OLD_FORMAT 2
 
 enum reiserfs_mount_options {
 /* Mount options */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-06 21:22 [PATCH] reiserfs: on-demand bitmap loading (testing only) Jeff Mahoney
@ 2005-07-06 22:45 ` Hans Reiser
  2005-07-07 16:04   ` Jeff Mahoney
  2005-07-08  5:35   ` Alexander Zarochentsev
  0 siblings, 2 replies; 13+ messages in thread
From: Hans Reiser @ 2005-07-06 22:45 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: ReiserFS List, Nate Diller

Jeff, are you sure that you need this code to exist?  Here are the
problems I see:

    for the average case, it is suboptimal.  The seeks to the bitmaps
are far more expensive than the averaged cost of keeping them in ram.

    for 16TB filesystems, they will have plenty of budget for ram

    it complicates code if it has to worry about such things as not
enough clean memory for holding bitmaps, etc.

    It is more appropriate to write this kind of code for the
development branch which is V4.  This kind of code is likely to have
hard to test and hit bugs.

    The mount time problem should be solved by querying the device
geometry, and inserting into the queue requests for every disk drive in
parallel.  The current code fails to keep all the spindles busy.  It
would be nice if there was general purpose code for querying about how a
device divides into spindles so that scheduling in general can be optimized.

    This should be a nondefault mount option.

That said, thanks for paying attention to a problem Namesys discussed
but lacked the manpower for addressing.  Do you think you could discuss
your plans before coding next time?  I agree that ReiserFS V3 and V4
mount time is too long.  15 minutes is clearly not acceptable.  Perhaps
there is a deeper IO scheduler problem beyond bitmaps that should be
addressed though.....

Hans

Jeff Mahoney wrote:

>  Currently, ReiserFS will read and keep in memory all the bitmaps for
>  the filesystem on mount. After the journal is replayed, it will read
>  them in again. On huge filesystems, this can be a resource hog and a
>  performance/ availability problem.
>
>  For example, on a maximum size (~16 TB) ReiserFS filesystem, there are
>  2^32 blocks, which require 131072 bitmaps to describe them. This means
>  that without loading any of the metadata tree or accessing file data,
>  just over 512M of RAM must be allocated (and is unswappable) for the
>  filesystem to be mounted and completely idle. All of that data is
>  distributed evenly over the entire disk, and must be read (twice!)
>  on mount.
>
>  There have been reports of large filesystems taking an unacceptably
>  long time to mount. These mount times can take your 5 9's down pretty
>  quickly.
>
>  The following patch implements on-demand loading for bitmaps. Rather
>  than pin all the bitmaps in memory as we do now, when a bitmap is
>  needed it is read from disk. If it is needed frequently, the buffer
>  cache will use existing heuristics to keep it around. The caching of
>  bitmap metadata is kept, so that bitmaps that are known to be full are
>  skipped completely.
>
>  I have done some very basic testing on this, but I'd like to have some
>  more eyes take a look.
>
>  Caveats:
>
>  The error handling in this revision is incomplete. This is a known
>  issue as I would like to end up applying this patch after reworking
>  the error handling in ReiserFS as a whole. Ultimately, a
>  reiserfs_error() similar to ext3 will be introduced, which will allow
>  smoother handling of errors than currently available.
>
>  The "old bitmap" code is untested. In principle, the difference boils
>  down to only where the bitmap block is located.
>
> -Jeff
>
> --
> Jeff Mahoney
> SuSE Labs


-------------------------

From: Jeff Mahoney <jeffm@suse.com>
Subject: [PATCH] reiserfs: implement on-demand bitmap loading (testing only)

 Currently, ReiserFS will read and keep in memory all the bitmaps for the
 filesystem on mount. After the journal is replayed, it will read them in
 again. On huge filesystems, this can be a resource hog and a performance/
 availability problem.

 For example, on a maximum size (~16 TB) ReiserFS filesystem, there are
 2^32 blocks, which require 131072 bitmaps to describe them. This means that
 without loading any of the metadata tree or accessing file data, just over
 512M of RAM must be allocated (and is unswappable) for the filesystem to
 be mounted and completely idle. All of that data is distributed evenly over
 the entire disk, and must be read (twice!) on mount.

 There have been reports of large filesystems taking an unacceptably
long time
 to mount. These mount times can take your 5 9's down pretty quickly.

 The following patch implements on-demand loading for bitmaps. Rather
than pin
 all the bitmaps in memory as we do now, when a bitmap is needed it is read
 from disk. If it is needed frequently, the buffer cache will use existing
 heuristics to keep it around. The caching of bitmap metadata is kept,
so that
 bitmaps that are known to be full are skipped completely.

 I have done some very basic testing on this, but I'd like to have some more
 eyes take a look.

 Caveats:

 The error handling in this revision is incomplete. This is a known issue
 as I would like to end up applying this patch after reworking the error
 handling in ReiserFS as a whole. Ultimately, a reiserfs_error() similar to
 ext3 will be introduced, which will allow smoother handling of errors than
 currently available.

 The "old bitmap" code is untested. In principle, the difference boils
down to
 only where the bitmap block is located.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>

diff -ruNpX dontdiff linux-2.6.12.1/fs/reiserfs/bitmap.c
linux-2.6.12.1.devel/fs/reiserfs/bitmap.c
--- linux-2.6.12.1/fs/reiserfs/bitmap.c    2005-06-30 12:51:42.000000000
-0400
+++ linux-2.6.12.1.devel/fs/reiserfs/bitmap.c    2005-06-30
16:40:29.000000000 -0400
@@ -61,6 +61,8 @@ static inline void get_bit_address (stru
 int is_reusable (struct super_block * s, b_blocknr_t block, int bit_value)
 {
     int i, j;
+    unsigned int bmap = block >> s->s_blocksize_bits;
+    struct buffer_head *bh;
 
     if (block == 0 || block >= SB_BLOCK_COUNT (s)) {
     reiserfs_warning (s, "vs-4010: is_reusable: block number is out of
range %lu (%u)",
@@ -68,14 +70,29 @@ int is_reusable (struct super_block * s,
     return 0;
     }
 
-    /* it can't be one of the bitmap blocks */
-    for (i = 0; i < SB_BMAP_NR (s); i ++)
-    if (block == SB_AP_BITMAP (s)[i].bh->b_blocknr) {
-        reiserfs_warning (s, "vs: 4020: is_reusable: "
-                  "bitmap block %lu(%u) can't be freed or reused",
-                  block, SB_BMAP_NR (s));
-        return 0;
-    }
+    /* Old format filesystem? Unlikely, but the bitmaps are all up front so
+     * we need to account for it. */
+    if (unlikely (test_bit(REISERFS_OLD_FORMAT,
+                           &(REISERFS_SB(sb)->s_properties)))) {
+            block_nr_t bmap1 = REISERFS_SB(sb)->s_bh->b_blocknr + 1;
+            if (block >= REISERFS_SB(sb)->s_bh->b_blocknr + 1 + bmap &&
+                block <= REISERFS_SB(sb)->s_bh->b_blocknr + 1 +
SB_BMAP_NR(s)) {
+                    reiserfs_warning (s, "vs: 4019: is_reusable: "
+                                      "bitmap block %lu(%u) can't be
freed or "
+                                      "reused", block, SB_BMAP_NR (s));
+                    return 0
+            }
+    } else {
+            if (block & ((s->s_blocksize << 3) - 1 ) == 0)
+                    reiserfs_warning (s, "vs: 4020: is_reusable: "
+                                      "bitmap block %lu(%u) can't be
freed or "
+                                      "reused", block, SB_BMAP_NR (s));
+                    return 0;
+                }
+            }
+    }
+
+
   
     get_bit_address (s, block, &i, &j);
 
@@ -85,16 +102,22 @@ int is_reusable (struct super_block * s,
     return 0;
     }
 
+    bh  = read_bitmap_block (s, bmap);
+    if (!bh)
+        return 0;
+
     if ((bit_value == 0 &&
-         reiserfs_test_le_bit(j, SB_AP_BITMAP(s)[i].bh->b_data)) ||
+         reiserfs_test_le_bit(j, bh->b_data)) ||
     (bit_value == 1 &&
-     reiserfs_test_le_bit(j, SB_AP_BITMAP (s)[i].bh->b_data) == 0)) {
+     reiserfs_test_le_bit(j, bh->b_data) == 0)) {
     reiserfs_warning (s, "vs-4040: is_reusable: corresponding bit of
block %lu does not "
               "match required value (i==%d, j==%d) test_bit==%d",
-        block, i, j, reiserfs_test_le_bit (j, SB_AP_BITMAP
(s)[i].bh->b_data));
+        block, i, j, reiserfs_test_le_bit (j, bh->b_data));
 
+    brelse (bh);
     return 0;
     }
+    brelse (bh);
 
     if (bit_value == 0 && block == SB_ROOT_BLOCK (s)) {
     reiserfs_warning (s, "vs-4050: is_reusable: this is root block (%u), "
@@ -134,6 +157,7 @@ static int scan_bitmap_block (struct rei
 {
     struct super_block *s = th->t_super;
     struct reiserfs_bitmap_info *bi=&SB_AP_BITMAP(s)[bmap_n];
+    struct buffer_head *bh;
     int end, next;
     int org = *beg;
 
@@ -150,22 +174,22 @@ static int scan_bitmap_block (struct rei
     reiserfs_warning (s, "NULL bitmap info pointer for bitmap %d", bmap_n);
     return 0;
     }
-    if (buffer_locked (bi->bh)) {
-       PROC_INFO_INC( s, scan_bitmap.wait );
-       __wait_on_buffer (bi->bh);
-    }
+
+    bh = read_bitmap_block (s, bmap_n);
+    if (!bh)
+        return 0;
 
     while (1) {
-    cont:
     if (bi->free_count < min)
         return 0; // No free blocks in this bitmap
 
     /* search for a first zero bit -- beggining of a window */
     *beg = reiserfs_find_next_zero_le_bit
-            ((unsigned long*)(bi->bh->b_data), boundary, *beg);
+            ((unsigned long*)(bh->b_data), boundary, *beg);
   
     if (*beg + min > boundary) { /* search for a zero bit fails or the
rest of bitmap block
                       * cannot contain a zero window of minimum size */
+        brelse (bh);
         return 0;
     }
 
@@ -173,7 +197,7 @@ static int scan_bitmap_block (struct rei
         continue;
     /* first zero bit found; we check next bits */
     for (end = *beg + 1;; end ++) {
-        if (end >= *beg + max || end >= boundary ||
reiserfs_test_le_bit (end, bi->bh->b_data)) {
+        if (end >= *beg + max || end >= boundary ||
reiserfs_test_le_bit (end, bh->b_data)) {
         next = end;
         break;
         }
@@ -187,11 +211,11 @@ static int scan_bitmap_block (struct rei
      * (end) points to one bit after the window end */
     if (end - *beg >= min) { /* it seems we have found window of proper
size */
         int i;
-        reiserfs_prepare_for_journal (s, bi->bh, 1);
+        reiserfs_prepare_for_journal (s, bh, 1);
         /* try to set all blocks used checking are they still free */
         for (i = *beg; i < end; i++) {
         /* It seems that we should not check in journal again. */
-        if (reiserfs_test_and_set_le_bit (i, bi->bh->b_data)) {
+        if (reiserfs_test_and_set_le_bit (i, bh->b_data)) {
             /* bit was set by another process
              * while we slept in prepare_for_journal() */
             PROC_INFO_INC( s, scan_bitmap.stolen );
@@ -202,21 +226,22 @@ static int scan_bitmap_block (struct rei
             }
             /* otherwise we clear all bit were set ... */
             while (--i >= *beg)
-            reiserfs_test_and_clear_le_bit (i, bi->bh->b_data);
-            reiserfs_restore_prepared_buffer (s, bi->bh);
+            reiserfs_test_and_clear_le_bit (i, bh->b_data);
+            reiserfs_restore_prepared_buffer (s, bh);
             *beg = org;
             /* ... and search again in current block from beginning */
-            goto cont;    
+            continue;
         }
         }
         bi->free_count -= (end - *beg);
-        journal_mark_dirty (th, s, bi->bh);
+        journal_mark_dirty (th, s, bh);
 
         /* free block count calculation */
         reiserfs_prepare_for_journal (s, SB_BUFFER_WITH_SB(s), 1);
         PUT_SB_FREE_BLOCKS(s, SB_FREE_BLOCKS(s) - (end - *beg));
         journal_mark_dirty (th, s, SB_BUFFER_WITH_SB(s));
 
+        brelse (bh);
         return end - (*beg);
     } else {
         *beg = next;
@@ -349,7 +374,7 @@ static void _reiserfs_free_block (struct
 {
     struct super_block * s = th->t_super;
     struct reiserfs_super_block * rs;
-    struct buffer_head * sbh;
+    struct buffer_head * sbh, *bmbh;
     struct reiserfs_bitmap_info *apbi;
     int nr, offset;
 
@@ -370,16 +395,23 @@ static void _reiserfs_free_block (struct
     return;
     }
 
-    reiserfs_prepare_for_journal(s, apbi[nr].bh, 1 ) ;
+    bmbh = read_bitmap_block (s, nr);
+    if (!bmbh) {
+        reiserfs_warning (s, "jdm-4077: reiserfs_free_block: couldn't
read bitmap %d\n", nr);
+        return;
+    }
+
+    reiserfs_prepare_for_journal(s, bmbh, 1 ) ;
 
     /* clear bit for the given block in bit map */
-    if (!reiserfs_test_and_clear_le_bit (offset, apbi[nr].bh->b_data)) {
+    if (!reiserfs_test_and_clear_le_bit (offset, bmbh->b_data)) {
     reiserfs_warning (s, "vs-4080: reiserfs_free_block: "
               "free_block (%s:%lu)[dev:blocknr]: bit already cleared",
               reiserfs_bdevname (s), block);
     }
     apbi[nr].free_count ++;
-    journal_mark_dirty (th, s, apbi[nr].bh);
+    journal_mark_dirty (th, s, bmbh);
+    brelse (bmbh);
 
     reiserfs_prepare_for_journal(s, sbh, 1) ;
     /* update super block */
@@ -1168,3 +1200,82 @@ int reiserfs_can_fit_pages ( struct supe
 
     return space>0?space:0;
 }
+
+/* cache_bitmap_metadata - Caches bitmap metadata so that the bitmap itself
+ *                         need not be reloaded only to find that it is
full.
+ * @bh: buffer_head containing the bitmap data
+ * @info: info struct to cache the data
+ *
+ * This will cache the number of free blocks and which block is the first
+ * available.
+ *
+ * The values are considered uncached if ->first_zero_hint is 0, since
+ * the bitmap block itself always occupies block 0 of the bitmap.
+ *
+ */
+static void
+cache_bitmap_metadata (struct super_block *sb, struct buffer_head *bh,
+                       struct reiserfs_bitmap_info *info)
+{
+    unsigned long *cur = (unsigned long *)bh->b_data;
+    int i;
+
+    for (i = sb->s_blocksize / sizeof (*cur); i > 0; i--, cur++) {
+        /* 0 and ~0 are special, we can optimize for them */
+        if (*cur == 0) {
+            info->first_zero_hint = i << 3;
+            info->free_count += sizeof (*cur) << 3;
+        } else if (*cur != ~0L) { /* A mix, investigate */
+            int b;
+            for (b = sizeof (*cur) << 3; b >= 0; b--) {
+                if (!reiserfs_test_le_bit (b, cur)) {
+                    info->first_zero_hint = (i << 3) + b;
+                    info->free_count++;
+                }
+            }
+        }
+    }
+
+    /* The first bit must ALWAYS be 1 */
+    BUG_ON (info->first_zero_hint == 0);
+}
+
+/* read_bitmap_block - reads a bitmap block from disk dynamically
+ * @sb: super_block associated with filesystem
+ * @bitmap: bitmap number to load
+ *
+ * On huge filesystems, the initial load time for bitmaps is
unacceptably long.
+ * Mount time as long as 15 minutes have been reported on huge (~ 15 TiB)
+ * filesystems. Pinning the bitmaps in memory on these huge filesystems is
+ * also a non-trivial resource use - 480 MB on that same 15 TiB filesystem.
+ * Loading the bitmaps dynamically on demand alleviates both problems.
+ */
+struct buffer_head *
+read_bitmap_block (struct super_block *sb, unsigned int bitmap)
+{
+    unsigned int block = (sb->s_blocksize << 3) * bitmap;
+    struct reiserfs_bitmap_info *info = SB_AP_BITMAP (sb) + bitmap;
+    struct buffer_head *bh;
+
+    /* Way old format filesystems had the bitmaps packed up front.
+     * I doubt there are any of these left, but just in case... */
+    if (test_bit(REISERFS_OLD_FORMAT, &(REISERFS_SB(sb)->s_properties))) {
+        block = REISERFS_SB (sb)->s_sbh->b_blocknr + 1 + bitmap;
+    }
+
+    bh = sb_bread (sb, block);
+
+    if (bh == NULL) {
+        reiserfs_warning (sb, "jdm-4900: Cannot read bitmap %u "
+                              "(block %u)", bitmap, block);
+        return NULL;
+    }
+
+    /* If we've already cached the metadata, ->first_zero_hint must be > 0,
+     * since the bitmap itself occupies the first block described by the
+     * bitmap */
+    if (info->first_zero_hint == 0)
+        cache_bitmap_metadata (sb, bh, info);
+
+    return bh;
+}
diff -ruNpX dontdiff linux-2.6.12.1/fs/reiserfs/resize.c
linux-2.6.12.1.devel/fs/reiserfs/resize.c
--- linux-2.6.12.1/fs/reiserfs/resize.c    2005-03-02 02:37:55.000000000
-0500
+++ linux-2.6.12.1.devel/fs/reiserfs/resize.c    2005-06-30
16:13:07.000000000 -0400
@@ -21,9 +21,9 @@ int reiserfs_resize (struct super_block
 {
         int err = 0;
     struct reiserfs_super_block * sb;
-        struct reiserfs_bitmap_info *bitmap;
+    struct reiserfs_bitmap_info *bitmap, *info;
     struct reiserfs_bitmap_info *old_bitmap = SB_AP_BITMAP(s);
-    struct buffer_head * bh;
+    struct buffer_head * bh, *bmbh;
     struct reiserfs_transaction_handle th;
     unsigned int bmap_nr_new, bmap_nr;
     unsigned int block_r_new, block_r;
@@ -107,29 +107,31 @@ int reiserfs_resize (struct super_block
     
         /* allocate additional bitmap blocks, reallocate array of bitmap
          * block pointers */
-        bitmap = vmalloc(sizeof(struct reiserfs_bitmap_info) *
bmap_nr_new);
+        bitmap = vmalloc(sizeof(*bitmap) * bmap_nr_new);
         if (!bitmap) {
         /* Journal bitmaps are still supersized, but the memory isn't
          * leaked, so I guess it's ok */
         printk("reiserfs_resize: unable to allocate memory.\n");
         return -ENOMEM;
         }
-        memset (bitmap, 0, sizeof (struct reiserfs_bitmap_info) *
SB_BMAP_NR(s));
-        for (i = 0; i < bmap_nr; i++)
-        bitmap[i] = old_bitmap[i];
+
+        /* Copy what we have, clear out the rest */
+        memcpy (bitmap, old_bitmap, sizeof (*bitmap) * SB_BMAP_NR(s));
+        memset (bitmap + SB_BMAP_NR(s), 0,
+                sizeof (*bitmap) * (bmap_nr - SB_BMAP_NR(s)));
 
         /* This doesn't go through the journal, but it doesn't have to.
          * The changes are still atomic: We're synced up when the journal
          * transaction begins, and the new bitmaps don't matter if the
          * transaction fails. */
         for (i = bmap_nr; i < bmap_nr_new; i++) {
-        bitmap[i].bh = sb_getblk(s, i * s->s_blocksize * 8);
-        memset(bitmap[i].bh->b_data, 0, sb_blocksize(sb));
-        reiserfs_test_and_set_le_bit(0, bitmap[i].bh->b_data);
-
-        set_buffer_uptodate(bitmap[i].bh);
-        mark_buffer_dirty(bitmap[i].bh) ;
-        sync_dirty_buffer(bitmap[i].bh);
+        bmbh = sb_bread (s, i * s->s_blocksize * 8);
+        memset(bmbh->b_data, 0, sb_blocksize(sb));
+        reiserfs_test_and_set_le_bit(0, bmbh->b_data);
+
+        set_buffer_uptodate(bmbh);
+        mark_buffer_dirty(bmbh) ;
+        sync_dirty_buffer(bmbh);
         // update bitmap_info stuff
         bitmap[i].first_zero_hint=1;
         bitmap[i].free_count = sb_blocksize(sb) * 8 - 1;
@@ -146,27 +148,46 @@ int reiserfs_resize (struct super_block
     if (err)
         return err;
 
-    /* correct last bitmap blocks in old and new disk layout */
-    reiserfs_prepare_for_journal(s, SB_AP_BITMAP(s)[bmap_nr - 1].bh, 1);
-    for (i = block_r; i < s->s_blocksize * 8; i++)
-        reiserfs_test_and_clear_le_bit(i,
-                       SB_AP_BITMAP(s)[bmap_nr - 1].bh->b_data);
-    SB_AP_BITMAP(s)[bmap_nr - 1].free_count += s->s_blocksize * 8 -
block_r;
-    if ( !SB_AP_BITMAP(s)[bmap_nr - 1].first_zero_hint)
-        SB_AP_BITMAP(s)[bmap_nr - 1].first_zero_hint = block_r;
+    /* Extend old last bitmap block - new blocks have been made
available */
+    bmbh = read_bitmap_block (s, bmap_nr - 1);
+    if (!bmbh) {
+        int jerr = journal_end (&th, s, 10);
+        if (jerr)
+        return jerr;
+        return -EIO;
+    }
 
-    journal_mark_dirty(&th, s, SB_AP_BITMAP(s)[bmap_nr - 1].bh);
+    info = SB_AP_BITMAP(s) + bmap_nr - 1;
+    reiserfs_prepare_for_journal(s, bmbh, 1);
+    for (i = block_r; i < s->s_blocksize * 8; i++)
+        reiserfs_test_and_clear_le_bit(i, bmbh->b_data);
+    info->free_count += s->s_blocksize * 8 - block_r;
+    if ( !info->first_zero_hint)
+        info->first_zero_hint = block_r;
+
+    journal_mark_dirty(&th, s, bmbh);
+    brelse (bmbh);
+
+    /* Correct new last bitmap block - It may not be full */
+    bmbh = read_bitmap_block (s, bmap_nr_new - 1);
+    info = SB_AP_BITMAP(s) + bmap_nr_new - 1;
+    if (!bmbh) {
+        int jerr = journal_end (&th, s, 10);
+        if (jerr)
+        return jerr;
+        return -EIO;
+    }
 
-    reiserfs_prepare_for_journal(s, SB_AP_BITMAP(s)[bmap_nr_new -
1].bh, 1);
+    reiserfs_prepare_for_journal(s, bmbh, 1);
     for (i = block_r_new; i < s->s_blocksize * 8; i++)
-        reiserfs_test_and_set_le_bit(i,
-                     SB_AP_BITMAP(s)[bmap_nr_new - 1].bh->b_data);
-    journal_mark_dirty(&th, s, SB_AP_BITMAP(s)[bmap_nr_new - 1].bh);
+        reiserfs_test_and_set_le_bit(i, bmbh->b_data);
+    journal_mark_dirty(&th, s, bmbh);
 
-    SB_AP_BITMAP(s)[bmap_nr_new - 1].free_count -= s->s_blocksize * 8 -
block_r_new;
+    info->free_count -= s->s_blocksize * 8 - block_r_new;
     /* Extreme case where last bitmap is the only valid block in itself. */
-    if ( !SB_AP_BITMAP(s)[bmap_nr_new - 1].free_count )
-        SB_AP_BITMAP(s)[bmap_nr_new - 1].first_zero_hint = 0;
+    if ( !info->free_count )
+        info->first_zero_hint = 0;
+
      /* update super */
     reiserfs_prepare_for_journal(s, SB_BUFFER_WITH_SB(s), 1) ;
     free_blocks = SB_FREE_BLOCKS(s);
diff -ruNpX dontdiff linux-2.6.12.1/fs/reiserfs/super.c
linux-2.6.12.1.devel/fs/reiserfs/super.c
--- linux-2.6.12.1/fs/reiserfs/super.c    2005-06-30 12:51:42.000000000
-0400
+++ linux-2.6.12.1.devel/fs/reiserfs/super.c    2005-06-30
16:15:26.000000000 -0400
@@ -417,7 +417,6 @@ int remove_save_link (struct inode * ino
 
 static void reiserfs_put_super (struct super_block * s)
 {
-  int i;
   struct reiserfs_transaction_handle th ;
   th.t_trans_id = 0;
 
@@ -445,9 +444,6 @@ static void reiserfs_put_super (struct s
   */
   journal_release(&th, s) ;
 
-  for (i = 0; i < SB_BMAP_NR (s); i ++)
-    brelse (SB_AP_BITMAP (s)[i].bh);
-
   vfree (SB_AP_BITMAP (s));
 
   brelse (SB_BUFFER_WITH_SB (s));
@@ -1188,6 +1184,20 @@ static int reiserfs_remount (struct supe
   return 0;
 }
 
+static struct reiserfs_bitmap_info *
+init_bitmap_cache (struct super_block *sb)
+{
+    struct reiserfs_bitmap_info *bitmap;
+
+    bitmap = vmalloc (sizeof (*bitmap) * SB_BMAP_NR (sb));
+    if (bitmap)
+        memset (bitmap, 0, sizeof (*bitmap) * SB_BMAP_NR (sb));
+
+    return bitmap;
+}
+
+
+#if 0
 /* load_bitmap_info_data - Sets up the reiserfs_bitmap_info structure
from disk.
  * @sb - superblock for this filesystem
  * @bi - the bitmap info to be loaded. Requires that bi->bh is valid.
@@ -1293,6 +1303,7 @@ static int read_old_bitmaps (struct supe
   return 0;
 }
 
+#endif
 static int read_super_block (struct super_block * s, int offset)
 {
     struct buffer_head * bh;
@@ -1389,7 +1400,6 @@ static int read_super_block (struct supe
 
 /* after journal replay, reread all bitmap and super blocks */
 static int reread_meta_blocks(struct super_block *s) {
-  int i ;
   ll_rw_block(READ, 1, &(SB_BUFFER_WITH_SB(s))) ;
   wait_on_buffer(SB_BUFFER_WITH_SB(s)) ;
   if (!buffer_uptodate(SB_BUFFER_WITH_SB(s))) {
@@ -1397,6 +1407,7 @@ static int reread_meta_blocks(struct sup
     return 1 ;
   }
 
+#if 0
   for (i = 0; i < SB_BMAP_NR(s) ; i++) {
     ll_rw_block(READ, 1, &(SB_AP_BITMAP(s)[i].bh)) ;
     wait_on_buffer(SB_AP_BITMAP(s)[i].bh) ;
@@ -1406,6 +1417,7 @@ static int reread_meta_blocks(struct sup
       return 1 ;
     }
   }
+#endif
   return 0 ;
 
 }
@@ -1639,8 +1651,9 @@ static int reiserfs_fill_super (struct s
     sbi->s_mount_state = SB_REISERFS_STATE(s);
     sbi->s_mount_state = REISERFS_VALID_FS ;
 
-    if (old_format ? read_old_bitmaps(s) : read_bitmaps(s)) {
-    SWARN(silent, s, "jmacd-8: reiserfs_fill_super: unable to read
bitmap");
+    SB_AP_BITMAP(s) = init_bitmap_cache (s);
+    if (!SB_AP_BITMAP(s)) {
+    SWARN(silent, s, "jdm-8: reiserfs_fill_super: unable to init bitmap
cache");
     goto error;
     }
 #ifdef CONFIG_REISERFS_CHECK
@@ -1799,12 +1812,7 @@ static int reiserfs_fill_super (struct s
     journal_release_error(NULL, s) ;
     }
     if (SB_DISK_SUPER_BLOCK (s)) {
-    for (j = 0; j < SB_BMAP_NR (s); j ++) {
-        if (SB_AP_BITMAP (s))
-        brelse (SB_AP_BITMAP (s)[j].bh);
-    }
-    if (SB_AP_BITMAP (s))
-        vfree (SB_AP_BITMAP (s));
+    vfree (SB_AP_BITMAP (s));
     }
     if (SB_BUFFER_WITH_SB (s))
     brelse(SB_BUFFER_WITH_SB (s));
diff -ruNpX dontdiff linux-2.6.12.1/include/linux/reiserfs_fs.h
linux-2.6.12.1.devel/include/linux/reiserfs_fs.h
--- linux-2.6.12.1/include/linux/reiserfs_fs.h    2005-06-30
12:51:48.000000000 -0400
+++ linux-2.6.12.1.devel/include/linux/reiserfs_fs.h    2005-06-30
16:13:59.000000000 -0400
@@ -2111,6 +2111,7 @@ void reiserfs_init_alloc_options (struct
  */
 __le32 reiserfs_choose_packing(struct inode *dir);
 
+struct buffer_head * read_bitmap_block (struct super_block *sb,
unsigned int bitmap);
 int is_reusable (struct super_block * s, b_blocknr_t block, int bit_value);
 void reiserfs_free_block (struct reiserfs_transaction_handle *th,
struct inode *, b_blocknr_t, int for_unformatted);
 int reiserfs_allocate_blocknrs(reiserfs_blocknr_hint_t *, b_blocknr_t *
, int, int);
diff -ruNpX dontdiff linux-2.6.12.1/include/linux/reiserfs_fs_sb.h
linux-2.6.12.1.devel/include/linux/reiserfs_fs_sb.h
--- linux-2.6.12.1/include/linux/reiserfs_fs_sb.h    2005-03-02
02:38:09.000000000 -0500
+++ linux-2.6.12.1.devel/include/linux/reiserfs_fs_sb.h    2005-06-30
15:43:09.000000000 -0400
@@ -269,7 +269,6 @@ struct reiserfs_bitmap_info
     // FIXME: Won't work with block sizes > 8K
     __u16  first_zero_hint;
     __u16  free_count;
-    struct buffer_head *bh; /* the actual bitmap */
 };
 
 struct proc_dir_entry;
@@ -419,6 +418,7 @@ struct reiserfs_sb_info
 /* Definitions of reiserfs on-disk properties: */
 #define REISERFS_3_5 0
 #define REISERFS_3_6 1
+#define REISERFS_OLD_FORMAT 2
 
 enum reiserfs_mount_options {
 /* Mount options */



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-06 22:45 ` Hans Reiser
@ 2005-07-07 16:04   ` Jeff Mahoney
  2005-07-07 17:59     ` studdugie
  2005-07-08  5:35   ` Alexander Zarochentsev
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff Mahoney @ 2005-07-07 16:04 UTC (permalink / raw)
  To: Hans Reiser; +Cc: ReiserFS List, Nate Diller

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hans Reiser wrote:
> Jeff, are you sure that you need this code to exist?  Here are the
> problems I see:
> 
>     for the average case, it is suboptimal.  The seeks to the bitmaps
> are far more expensive than the averaged cost of keeping them in ram.
> 
>     for 16TB filesystems, they will have plenty of budget for ram
> 
>     it complicates code if it has to worry about such things as not
> enough clean memory for holding bitmaps, etc.
> 
>     It is more appropriate to write this kind of code for the
> development branch which is V4.  This kind of code is likely to have
> hard to test and hit bugs.
> 
>     The mount time problem should be solved by querying the device
> geometry, and inserting into the queue requests for every disk drive in
> parallel.  The current code fails to keep all the spindles busy.  It
> would be nice if there was general purpose code for querying about how a
> device divides into spindles so that scheduling in general can be optimized.
> 
>     This should be a nondefault mount option.
> 
> That said, thanks for paying attention to a problem Namesys discussed
> but lacked the manpower for addressing.  Do you think you could discuss
> your plans before coding next time?  I agree that ReiserFS V3 and V4
> mount time is too long.  15 minutes is clearly not acceptable.  Perhaps
> there is a deeper IO scheduler problem beyond bitmaps that should be
> addressed though.....
> 

Hans -

There are two issues here: The amount of time required to read in the
bitmap blocks at mount time, and the resources that are wasted due to
maintaining unused bitmap data in memory. Your arguments are reasonable,
but the user response to each of them is the same: They will simply
choose another filesystem to deploy rather than deal with the caveats of
ReiserFS.

I agree that there may be opportunities to optimize the I/O scheduler,
but even if we ignored the blockdev<->filesystem layering violations,
and had perfect knowledge of the storage subsystem, there is still
latency associated with reading the data in. There may be any number of
abstractions between the block device presented to the filesystem and
the actual spindles (md, dm, loop, or hardware raid) and the block dev
subsystem is best equipped to handle that. The goal is not to make mount
times quicker than they are now, but to make them negligible. Suppose
for the sake of argument that somehow the I/O scheduler could be
leveraged to reduce the mount time by 90%. This is an incredibly
optimistic number and still it only reduces the 15 minute mount time to
90 seconds. That's 90 seconds *every* boot that the system becomes
unavailable. That 90 second addition adds up, and will be the difference
between a site deploying reiserfs and choosing another solution that
doesn't have that caveat.

That said, the resource savings benefit is largely secondary, but may be
quite important for many users including those deploying embedded
devices. We are not in the position to be making hardware purchasing
guidelines for our users. It's not reasonable to expect more than the
disk space required to store the filesystem itself. "Huge" filesystems
that were once reserved for large servers can now be found on the
desktop. For a few hundred dollars in hardware, I can construct a
multi-terabyte array under my desk. A typical usage for something like
this would be to store music, movies, or say an A/V editing suite. On a
system with 512MB of RAM, the 32 MB allocation for ONLY bitmaps is a
huge resource hit. On embedded systems that are tight on RAM, where they
are using alternate C libraries to shave off a few KiB of memory use,
pinning bitmaps is a total waste of resources. Telling the user "go buy
more memory" is not an acceptable solution. Again, this will only mean
another user chooses a different solution than reiserfs.

ReiserFS v3 has an established track record as a stable filesystem. V4
may be an excellent successor, but many users simply aren't interested.
They want particular features now and aren't willing to be guinea pigs
for V4 in order to get them. We've seen this time and again with feature
additions. Denying user demands with the mantra of "wait for it in V4"
has left many users frustrated, and they will once again choose
something else rather than deal without features they can have on other
filesystems.

The performance difference, I suspect, will be negligible. If the
bitmaps are really in heavy use (which is only the case for a limited
set of workloads) then the buffer cache will keep those around anyway.
If the memory is needed elsewhere, the system has the "big picture" view
and should be able to make those decisions. Having to swap out user code
or data vs. keeping ReiserFS bitmaps in memory is going to have a
performance impact either way, and I suspect the former will be the
worse case. Regarding the unavailability of memory for bitmaps, we must
already sleep in order to get the buffer heads for parts of the tree
that aren't pinned in memory. This case isn't any different. We also
already sleep waiting for bitmap blocks to become unlocked.

As for it being the default case or not, I've only posted this code for
testing purposes. Eventually, I think it should be the default case.
We've seen what happens when useful features get buried under a mount
time option (-oattrs, anyone?) - they get ignored. I think that once
this code has seen active testing, -opin_bitmaps should become an option
and reading them on-demand should become the default.

- -Jeff

- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFCzVKKLPWxlyuTD7IRAm8AAJ9i8D5VTak/puOg0yLuUtmKxvWcZQCePGZu
/UR00EcaRwM2t3qZ0D9vuF4=
=7/vu
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-07 16:04   ` Jeff Mahoney
@ 2005-07-07 17:59     ` studdugie
  2005-07-07 22:37       ` Pierre Etchemaïté
  0 siblings, 1 reply; 13+ messages in thread
From: studdugie @ 2005-07-07 17:59 UTC (permalink / raw)
  To: ReiserFS List

I agree w/ Jeff 100%. I'm not a kernel hacker, simply a user. As a
matter of fact, I was one of those people that Jeff aluded to when he
said: "There have been reports of large filesystems taking an
unacceptably long time to mount."
On 7/7/05, Jeff Mahoney <jeffm@suse.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hans Reiser wrote:
> > Jeff, are you sure that you need this code to exist?  Here are the
> > problems I see:
> >
> >     for the average case, it is suboptimal.  The seeks to the bitmaps
> > are far more expensive than the averaged cost of keeping them in ram.
> >
> >     for 16TB filesystems, they will have plenty of budget for ram
> >
> >     it complicates code if it has to worry about such things as not
> > enough clean memory for holding bitmaps, etc.
> >
> >     It is more appropriate to write this kind of code for the
> > development branch which is V4.  This kind of code is likely to have
> > hard to test and hit bugs.
> >
> >     The mount time problem should be solved by querying the device
> > geometry, and inserting into the queue requests for every disk drive in
> > parallel.  The current code fails to keep all the spindles busy.  It
> > would be nice if there was general purpose code for querying about how a
> > device divides into spindles so that scheduling in general can be optimized.
> >
> >     This should be a nondefault mount option.
> >
> > That said, thanks for paying attention to a problem Namesys discussed
> > but lacked the manpower for addressing.  Do you think you could discuss
> > your plans before coding next time?  I agree that ReiserFS V3 and V4
> > mount time is too long.  15 minutes is clearly not acceptable.  Perhaps
> > there is a deeper IO scheduler problem beyond bitmaps that should be
> > addressed though.....
> >
> 
> Hans -
> 
> There are two issues here: The amount of time required to read in the
> bitmap blocks at mount time, and the resources that are wasted due to
> maintaining unused bitmap data in memory. Your arguments are reasonable,
> but the user response to each of them is the same: They will simply
> choose another filesystem to deploy rather than deal with the caveats of
> ReiserFS.
> 
> I agree that there may be opportunities to optimize the I/O scheduler,
> but even if we ignored the blockdev<->filesystem layering violations,
> and had perfect knowledge of the storage subsystem, there is still
> latency associated with reading the data in. There may be any number of
> abstractions between the block device presented to the filesystem and
> the actual spindles (md, dm, loop, or hardware raid) and the block dev
> subsystem is best equipped to handle that. The goal is not to make mount
> times quicker than they are now, but to make them negligible. Suppose
> for the sake of argument that somehow the I/O scheduler could be
> leveraged to reduce the mount time by 90%. This is an incredibly
> optimistic number and still it only reduces the 15 minute mount time to
> 90 seconds. That's 90 seconds *every* boot that the system becomes
> unavailable. That 90 second addition adds up, and will be the difference
> between a site deploying reiserfs and choosing another solution that
> doesn't have that caveat.
> 
> That said, the resource savings benefit is largely secondary, but may be
> quite important for many users including those deploying embedded
> devices. We are not in the position to be making hardware purchasing
> guidelines for our users. It's not reasonable to expect more than the
> disk space required to store the filesystem itself. "Huge" filesystems
> that were once reserved for large servers can now be found on the
> desktop. For a few hundred dollars in hardware, I can construct a
> multi-terabyte array under my desk. A typical usage for something like
> this would be to store music, movies, or say an A/V editing suite. On a
> system with 512MB of RAM, the 32 MB allocation for ONLY bitmaps is a
> huge resource hit. On embedded systems that are tight on RAM, where they
> are using alternate C libraries to shave off a few KiB of memory use,
> pinning bitmaps is a total waste of resources. Telling the user "go buy
> more memory" is not an acceptable solution. Again, this will only mean
> another user chooses a different solution than reiserfs.
> 
> ReiserFS v3 has an established track record as a stable filesystem. V4
> may be an excellent successor, but many users simply aren't interested.
> They want particular features now and aren't willing to be guinea pigs
> for V4 in order to get them. We've seen this time and again with feature
> additions. Denying user demands with the mantra of "wait for it in V4"
> has left many users frustrated, and they will once again choose
> something else rather than deal without features they can have on other
> filesystems.
> 
> The performance difference, I suspect, will be negligible. If the
> bitmaps are really in heavy use (which is only the case for a limited
> set of workloads) then the buffer cache will keep those around anyway.
> If the memory is needed elsewhere, the system has the "big picture" view
> and should be able to make those decisions. Having to swap out user code
> or data vs. keeping ReiserFS bitmaps in memory is going to have a
> performance impact either way, and I suspect the former will be the
> worse case. Regarding the unavailability of memory for bitmaps, we must
> already sleep in order to get the buffer heads for parts of the tree
> that aren't pinned in memory. This case isn't any different. We also
> already sleep waiting for bitmap blocks to become unlocked.
> 
> As for it being the default case or not, I've only posted this code for
> testing purposes. Eventually, I think it should be the default case.
> We've seen what happens when useful features get buried under a mount
> time option (-oattrs, anyone?) - they get ignored. I think that once
> this code has seen active testing, -opin_bitmaps should become an option
> and reading them on-demand should become the default.
> 
> - -Jeff
> 
> - --
> Jeff Mahoney
> SuSE Labs
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.0 (GNU/Linux)
> 
> iD8DBQFCzVKKLPWxlyuTD7IRAm8AAJ9i8D5VTak/puOg0yLuUtmKxvWcZQCePGZu
> /UR00EcaRwM2t3qZ0D9vuF4=
> =7/vu
> -----END PGP SIGNATURE-----
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-07 17:59     ` studdugie
@ 2005-07-07 22:37       ` Pierre Etchemaïté
  2005-07-07 23:15         ` David Masover
  0 siblings, 1 reply; 13+ messages in thread
From: Pierre Etchemaïté @ 2005-07-07 22:37 UTC (permalink / raw)
  To: reiserfs-list

Le Thu, 7 Jul 2005 13:59:35 -0400, studdugie <studdugie@gmail.com> a écrit :

> I agree w/ Jeff 100%. I'm not a kernel hacker, simply a user. As a
> matter of fact, I was one of those people that Jeff aluded to when he
> said: "There have been reports of large filesystems taking an
> unacceptably long time to mount."

That also makes reiserfs uncomfortable with automount devices, specially
if they're bandwidth limited like external USB or firewire disks...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-07 22:37       ` Pierre Etchemaïté
@ 2005-07-07 23:15         ` David Masover
  2005-07-08  4:52           ` Jeffrey Mahoney
  2005-07-08 21:41           ` Jeff Mahoney
  0 siblings, 2 replies; 13+ messages in thread
From: David Masover @ 2005-07-07 23:15 UTC (permalink / raw)
  To: Pierre Etchemaïté; +Cc: reiserfs-list

Pierre Etchemaïté wrote:
> Le Thu, 7 Jul 2005 13:59:35 -0400, studdugie <studdugie@gmail.com> a écrit :
> 
> 
>>I agree w/ Jeff 100%. I'm not a kernel hacker, simply a user. As a
>>matter of fact, I was one of those people that Jeff aluded to when he
>>said: "There have been reports of large filesystems taking an
>>unacceptably long time to mount."
> 
> 
> That also makes reiserfs uncomfortable with automount devices, specially
> if they're bandwidth limited like external USB or firewire disks...

USB and firewire disks already take a little long to mount anyway.

But, it is definitely a performance enhancement, or at least a tweak. 
I'd like to see it happen -- it takes 10-15 seconds to mount my 200 gig 
Reiser4 partition, which is unacceptabe for a desktop machine -- at 
least, for a *linux* desktop machine.

To keep Hans happy about the "default case", can we load the bitmap in 
the background during boot/mount?  Basically, if it's loaded on demand, 
then we pretend to demand each part of it, one by one.  Would that 
considerably slow normal FS operation?  Could we defer it to when the 
disk is idle?  (*disk*, not FS)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-07 23:15         ` David Masover
@ 2005-07-08  4:52           ` Jeffrey Mahoney
  2005-07-08  5:52             ` Stefan Traby
  2005-07-08 16:55             ` David Masover
  2005-07-08 21:41           ` Jeff Mahoney
  1 sibling, 2 replies; 13+ messages in thread
From: Jeffrey Mahoney @ 2005-07-08  4:52 UTC (permalink / raw)
  To: David Masover; +Cc: Pierre Etchemaïté, reiserfs-list

David Masover wrote:
> Pierre Etchemaïté wrote:
> 
>> Le Thu, 7 Jul 2005 13:59:35 -0400, studdugie <studdugie@gmail.com> a
>> écrit :
>>
>>
>>> I agree w/ Jeff 100%. I'm not a kernel hacker, simply a user. As a
>>> matter of fact, I was one of those people that Jeff aluded to when he
>>> said: "There have been reports of large filesystems taking an
>>> unacceptably long time to mount."
>>
>>
>>
>> That also makes reiserfs uncomfortable with automount devices, specially
>> if they're bandwidth limited like external USB or firewire disks...
> 
> 
> USB and firewire disks already take a little long to mount anyway.
> 
> But, it is definitely a performance enhancement, or at least a tweak.
> I'd like to see it happen -- it takes 10-15 seconds to mount my 200 gig
> Reiser4 partition, which is unacceptabe for a desktop machine -- at
> least, for a *linux* desktop machine.
> 
> To keep Hans happy about the "default case", can we load the bitmap in
> the background during boot/mount?  Basically, if it's loaded on demand,
> then we pretend to demand each part of it, one by one.  Would that
> considerably slow normal FS operation?  Could we defer it to when the
> disk is idle?  (*disk*, not FS)

Hi David -

The main issue I have with this is that I don't think bitmaps should be
treated specially. They are metadata, pure and simple. We don't treat
the s-tree specially with respect to caching and it is used regardless
of whether the operations performed on the filesystem are for reads or
writes.

The root block/node sees far more activity than any bitmap and it is not
treated any differently than any other bit of metadata. It's simply
requested when it is needed. If the vm has determined that the root
block is frequently used, it stays in memory. Why should the bitmaps be
any different?

It's possible to read the bitmaps in a "delayed" fashion, but the
problem of completely wasted resources is still not addressed. I feel
the correct solution is to let the buffer cache do its job and not
assume that any particular filesystem takes priority over other resources.

-Jeff

-- 
Jeff Mahoney
SuSE Labs
jeffm@suse.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-06 22:45 ` Hans Reiser
  2005-07-07 16:04   ` Jeff Mahoney
@ 2005-07-08  5:35   ` Alexander Zarochentsev
  2005-07-08  6:04     ` Stefan Traby
  1 sibling, 1 reply; 13+ messages in thread
From: Alexander Zarochentsev @ 2005-07-08  5:35 UTC (permalink / raw)
  To: Hans Reiser, Elena Gryaznova; +Cc: reiserfs-list, Jeff Mahoney

On Thursday 07 July 2005 02:45, Hans Reiser wrote:
> Jeff, are you sure that you need this code to exist?  Here are the
> problems I see:
>
>     for the average case, it is suboptimal.  The seeks to the bitmaps
> are far more expensive than the averaged cost of keeping them in ram.

It would be good to see the benchmark results which your opinion is based on.

I know that on-demand bitmap loading in reiser4 was under suspicion 
several times but I don't remember any benchmark results which showed a
slowdown regarding that.  

>
>     for 16TB filesystems, they will have plenty of budget for ram
>
>     it complicates code if it has to worry about such things as not
> enough clean memory for holding bitmaps, etc.
>
>     It is more appropriate to write this kind of code for the
> development branch which is V4.  This kind of code is likely to have
> hard to test and hit bugs.
>

Lena, may you check how the bitmap on-demand loading patch affects, say,
mongo results for reiserfs?

Thanks, 
Alex.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-08  4:52           ` Jeffrey Mahoney
@ 2005-07-08  5:52             ` Stefan Traby
  2005-07-08 16:55             ` David Masover
  1 sibling, 0 replies; 13+ messages in thread
From: Stefan Traby @ 2005-07-08  5:52 UTC (permalink / raw)
  To: Jeffrey Mahoney; +Cc: David Masover, Pierre Etchemaïté, reiserfs-list

On Fri, Jul 08, 2005 at 12:52:52AM -0400, Jeffrey Mahoney wrote:

> It's possible to read the bitmaps in a "delayed" fashion, but the
> problem of completely wasted resources is still not addressed. I feel
> the correct solution is to let the buffer cache do its job and not
> assume that any particular filesystem takes priority over other resources.
 
I feel 100% with you.
Getting rid of super.c:read_bitmaps is nothing but a bug-fix.

-- 

  ciao - 
    Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-08  5:35   ` Alexander Zarochentsev
@ 2005-07-08  6:04     ` Stefan Traby
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Traby @ 2005-07-08  6:04 UTC (permalink / raw)
  To: Alexander Zarochentsev
  Cc: Hans Reiser, Elena Gryaznova, reiserfs-list, Jeff Mahoney

On Fri, Jul 08, 2005 at 09:35:28AM +0400, Alexander Zarochentsev wrote:

> Lena, may you check how the bitmap on-demand loading patch affects, say,
> mongo results for reiserfs?

Alex, it seems that to share Jeff's view - for me it's a question of
correctness.
You can't test correctness with benchmarks - well except your
name is Yury -> http://oesiman.de/last-words-1.html
:)

-- 

  ciao - 
    Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-08  4:52           ` Jeffrey Mahoney
  2005-07-08  5:52             ` Stefan Traby
@ 2005-07-08 16:55             ` David Masover
  1 sibling, 0 replies; 13+ messages in thread
From: David Masover @ 2005-07-08 16:55 UTC (permalink / raw)
  To: Jeffrey Mahoney; +Cc: Pierre Etchemaïté, reiserfs-list

Jeffrey Mahoney wrote:
> David Masover wrote:
> 
>>Pierre Etchemaïté wrote:
>>
>>
>>>Le Thu, 7 Jul 2005 13:59:35 -0400, studdugie <studdugie@gmail.com> a
>>>écrit :
>>>
>>>
>>>
>>>>I agree w/ Jeff 100%. I'm not a kernel hacker, simply a user. As a
>>>>matter of fact, I was one of those people that Jeff aluded to when he
>>>>said: "There have been reports of large filesystems taking an
>>>>unacceptably long time to mount."
>>>
>>>
>>>
>>>That also makes reiserfs uncomfortable with automount devices, specially
>>>if they're bandwidth limited like external USB or firewire disks...
>>
>>
>>USB and firewire disks already take a little long to mount anyway.
>>
>>But, it is definitely a performance enhancement, or at least a tweak.
>>I'd like to see it happen -- it takes 10-15 seconds to mount my 200 gig
>>Reiser4 partition, which is unacceptabe for a desktop machine -- at
>>least, for a *linux* desktop machine.
>>
>>To keep Hans happy about the "default case", can we load the bitmap in
>>the background during boot/mount?  Basically, if it's loaded on demand,
>>then we pretend to demand each part of it, one by one.  Would that
>>considerably slow normal FS operation?  Could we defer it to when the
>>disk is idle?  (*disk*, not FS)
> 
> 
> Hi David -
> 
> The main issue I have with this is that I don't think bitmaps should be
> treated specially. They are metadata, pure and simple.

This is the philosophical reason why I like on-demand loading.  There's 
also the immediate speedup of boot.

But, I think Hans has a point that it may be better for performance to 
pre-cache them.  I would rather the default behavior be to load them on 
demand, but I can see situations where people would choose to pre-cache 
them, or even (as we do now) force them to stay in kernel memory as long 
as the FS is mounted.  But that's not a sane default.

Big hard drives on desktop machines are getting more and more common, 
and even the lowly Linux gamer doesn't want to waste the RAM he bought 
for Doom 3 on a 200 gig filesystem when he's only using 1.5 gigs of it 
at the moment.

I would guess that this is a lot more common of a scenario than massive 
2TB arrays where people can throw money (RAM) at the system to make it 
faster, in any way they can.  But even if it's not, people with 2TB 
arrays are much more likely to discover the precaching feature and turn 
it on than gamers / desktop users woud be to discover it and turn it off.

And besides, for the average desktop machine, it's latency that matters 
more than throughput.  The most noticeable latency that we can optimize 
for is boot time, the next most noticeable is launching a new app / 
changing apps.  For the average desktop user, it doesn't matter if it 
takes an extra half second to load a chunk of the bitmap in order to 
load apps, and it certainly doesn't matter if it takes an extra tenth of 
a second to load a file, but it does matter if RAM wasteage pushes an 
app into swap and it takes 5-10 seconds (at least) to switch apps, and 
it does matter if it takes 5-10-20 seconds longer to boot.

That is why I think on-demand should be default.

> The root block/node sees far more activity than any bitmap and it is not
> treated any differently than any other bit of metadata. It's simply
> requested when it is needed. If the vm has determined that the root
> block is frequently used, it stays in memory. Why should the bitmaps be
> any different?

Because bitmaps are harder to seek to?  I think that's the argument -- 
the bitmap is going to be pushed out of memory because it's used less 
frequently, but it'll take much longer to load than anything else 
because it's spread over the disk.

But, you could make a similar argument about most files.

I second the request for benchmarks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-07 23:15         ` David Masover
  2005-07-08  4:52           ` Jeffrey Mahoney
@ 2005-07-08 21:41           ` Jeff Mahoney
  2005-07-09  7:46             ` Hans Reiser
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff Mahoney @ 2005-07-08 21:41 UTC (permalink / raw)
  To: David Masover; +Cc: Pierre Etchemaïté, reiserfs-list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David Masover wrote:
> Pierre Etchemaïté wrote:
>> Le Thu, 7 Jul 2005 13:59:35 -0400, studdugie <studdugie@gmail.com> a
>> écrit :
>>
>>
>>> I agree w/ Jeff 100%. I'm not a kernel hacker, simply a user. As a
>>> matter of fact, I was one of those people that Jeff aluded to when he
>>> said: "There have been reports of large filesystems taking an
>>> unacceptably long time to mount."
>>
>>
>> That also makes reiserfs uncomfortable with automount devices, specially
>> if they're bandwidth limited like external USB or firewire disks...
> 
> USB and firewire disks already take a little long to mount anyway.
> 
> But, it is definitely a performance enhancement, or at least a tweak.
> I'd like to see it happen -- it takes 10-15 seconds to mount my 200 gig
> Reiser4 partition, which is unacceptabe for a desktop machine -- at
> least, for a *linux* desktop machine.
> 
> To keep Hans happy about the "default case", can we load the bitmap in
> the background during boot/mount?  Basically, if it's loaded on demand,
> then we pretend to demand each part of it, one by one.  Would that
> considerably slow normal FS operation?  Could we defer it to when the
> disk is idle?  (*disk*, not FS)

I believe there are two possible methods of delayed loading. The first
is to issue all the bitmap read requests on mount and then when we need
that particular bitmap later we can wait on it. The second is to issue
the read request the first time it's needed, and don't let go of it. For
the sake of exploring other options, I decided to implment the first
one. The results were surprising.

The disk I've been testing on lately is a 40 GB ATA/100 disk mounted in
a USB2 enclosure. I tested with a range of block sizes so that the
number of bitmaps would increase without needing a larger disk. I
realize the results won't be identical to filesystems that large, but
it's the best I can do with my storage constraints. Realistically, the
times will be even longer on the larger filesystems.

The results showed that delayed allocation was only slightly faster than
waiting on the buffers. The reason for this is that the majority of the
time is actually spent issuing the block read requests, not actually
waiting for their results. The amount of time waiting on the blocks
appears not to change radically, though the amount of time issuing the
read requests does.

Here are the actual numbers from the test runs. Between each mount
attempt, I attempted to clear the system caches by allocating and
writing to all the memory on the system, as well as the disk caches by
reading 50 MB from disk. I performed the tests with four block sizes in
order to increase the number of bitmap blocks that need to be loaded at
mount time. Note that each decrease in block size increases the number
of bitmaps fourfold. This is because when the block size is halved, it
not only doubles the number of blocks, but also halves the capacity of
each bitmap block.

4k block size:                          2k block size:
10036464 blocks,                        20072928 blocks,
307 bitmaps (~= 39 GB)                  1226 bitmaps  (~= 153 GB @ 4k)
- -opin_bitmaps                           -opin_bitmaps
sb_getblk loop: 0.0s                    sb_getblk loop: 0.3999643s
ll_rw_block: 1.435871744s               ll_rw_block: 8.143272619s
wait_on_buffer: 0.513519144s            wait_on_buffer: 1.990925198s
real    0m4.551s                        real    0m10.906s
user    0m0.000s                        user    0m0.000s
sys     0m0.060s                        sys     0m0.028s

- -opin_bitmaps,delayed_bitmaps           -opin_bitmaps,delayed_bitmaps
sb_getblk loop: 0.0s                    sb_getblk loop: 0.3999643s
ll_rw_block: 1.443871029s               ll_rw_block: 8.944447839s
real    0m2.128s                        real    0m8.630s
user    0m0.000s                        user    0m0.000s
sys     0m0.016s                        sys     0m0.020s

- -odyn_bitmaps                           -odyn_bitmaps
real    0m0.626s                        real    0m0.850s
user    0m0.000s                        user    0m0.000s
sys     0m0.008s                        sys     0m0.016s

1k block size:                          512b block size:
40145856 blocks,                        80291712 blocks,
4901 bitmaps (~= 612 GB @ 4k)           19603 bitmaps (~= 2.4 TB@4k)
- -opin_bitmaps                           -opin_bitmaps
sb_getblk loop: 0.19998214s             sb_getblk loop: 0.95991426s
ll_rw_block: 33.727900516s              ll_rw_block: 110.98165711s
wait_on_buffer: 1.423872816s            wait_on_buffer: 0.749324905s
real    0m36.052s                       real    1m51.423s
user    0m0.000s                        user    0m0.000s
sys     0m0.124s                        sys     0m0.256s

- -opin_bitmaps,delayed_bitmaps           -opin_bitmaps,delayed_bitmaps
sb_getblk loop: 0.23997856s             sb_getblk loop: 0.95991426s
ll_rw_block: 33.644994731s              ll_rw_block: 109.427893721s
real    0m34.562s                       real    1m50.693s
user    0m0.004s                        user    0m0.004s
sys     0m0.060s                        sys     0m0.232s

- -odyn_bitmaps                           -odyn_bitmaps
real    0m0.516s                        real    0m0.601s
user    0m0.000s                        user    0m0.000s
sys     0m0.004s                        sys     0m0.000s

I will post runtime results of each case early next week.

- -Jeff

- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFCzvMILPWxlyuTD7IRApRLAJ0YDdH24YEef9XSE7JfzF7hp1jCAACgjWTX
+chdg0ihAJmup6GjQtKBc/o=
=WUUP
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] reiserfs: on-demand bitmap loading (testing only)
  2005-07-08 21:41           ` Jeff Mahoney
@ 2005-07-09  7:46             ` Hans Reiser
  0 siblings, 0 replies; 13+ messages in thread
From: Hans Reiser @ 2005-07-09  7:46 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: David Masover, Pierre Etchemaïté, reiserfs-list

Jeff Mahoney wrote:

>
>
> The results showed that delayed allocation was only slightly faster than
> waiting on the buffers. The reason for this is that the majority of the
> time is actually spent issuing the block read requests, not actually
> waiting for their results. 

Can you define "spent issuing"?  Perhaps this is simply a bad choice of
block device congestion configuration, and changing it would fix
things.  Because device congestion is based on NUMBER of requests, not
their size, bitmap reading would congest things more than file IO.

Block device congestion limits I suspect to be in need of serious
review, not just because of bitmaps.....

> The amount of time waiting on the blocks
> appears not to change radically, though the amount of time issuing the
> read requests does.
>
> Here are the actual numbers from the test runs. Between each mount
> attempt, I attempted to clear the system caches by allocating and
> writing to all the memory on the system, as well as the disk caches by
> reading 50 MB from disk. I performed the tests with four block sizes in
> order to increase the number of bitmap blocks that need to be loaded at
> mount time. Note that each decrease in block size increases the number
> of bitmaps fourfold. This is because when the block size is halved, it
> not only doubles the number of blocks, but also halves the capacity of
> each bitmap block.
>
> 4k block size:                          2k block size:
> 10036464 blocks,                        20072928 blocks,
> 307 bitmaps (~= 39 GB)                  1226 bitmaps  (~= 153 GB @ 4k)
> -opin_bitmaps                           -opin_bitmaps
> sb_getblk loop: 0.0s                    sb_getblk loop: 0.3999643s
> ll_rw_block: 1.435871744s               ll_rw_block: 8.143272619s
> wait_on_buffer: 0.513519144s            wait_on_buffer: 1.990925198s
> real    0m4.551s                        real    0m10.906s
> user    0m0.000s                        user    0m0.000s
> sys     0m0.060s                        sys     0m0.028s
>
> -opin_bitmaps,delayed_bitmaps           -opin_bitmaps,delayed_bitmaps
> sb_getblk loop: 0.0s                    sb_getblk loop: 0.3999643s
> ll_rw_block: 1.443871029s               ll_rw_block: 8.944447839s
> real    0m2.128s                        real    0m8.630s
> user    0m0.000s                        user    0m0.000s
> sys     0m0.016s                        sys     0m0.020s
>
> -odyn_bitmaps                           -odyn_bitmaps
> real    0m0.626s                        real    0m0.850s
> user    0m0.000s                        user    0m0.000s
> sys     0m0.008s                        sys     0m0.016s
>
> 1k block size:                          512b block size:
> 40145856 blocks,                        80291712 blocks,
> 4901 bitmaps (~= 612 GB @ 4k)           19603 bitmaps (~= 2.4 TB@4k)
> -opin_bitmaps                           -opin_bitmaps
> sb_getblk loop: 0.19998214s             sb_getblk loop: 0.95991426s
> ll_rw_block: 33.727900516s              ll_rw_block: 110.98165711s
> wait_on_buffer: 1.423872816s            wait_on_buffer: 0.749324905s
> real    0m36.052s                       real    1m51.423s
> user    0m0.000s                        user    0m0.000s
> sys     0m0.124s                        sys     0m0.256s
>
> -opin_bitmaps,delayed_bitmaps           -opin_bitmaps,delayed_bitmaps
> sb_getblk loop: 0.23997856s             sb_getblk loop: 0.95991426s
> ll_rw_block: 33.644994731s              ll_rw_block: 109.427893721s
> real    0m34.562s                       real    1m50.693s
> user    0m0.004s                        user    0m0.004s
> sys     0m0.060s                        sys     0m0.232s
>
> -odyn_bitmaps                           -odyn_bitmaps
> real    0m0.516s                        real    0m0.601s
> user    0m0.000s                        user    0m0.000s
> sys     0m0.004s                        sys     0m0.000s
>
> I will post runtime results of each case early next week.
>
> -Jeff
>
> --
> Jeff Mahoney
> SuSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2005-07-09  7:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-06 21:22 [PATCH] reiserfs: on-demand bitmap loading (testing only) Jeff Mahoney
2005-07-06 22:45 ` Hans Reiser
2005-07-07 16:04   ` Jeff Mahoney
2005-07-07 17:59     ` studdugie
2005-07-07 22:37       ` Pierre Etchemaïté
2005-07-07 23:15         ` David Masover
2005-07-08  4:52           ` Jeffrey Mahoney
2005-07-08  5:52             ` Stefan Traby
2005-07-08 16:55             ` David Masover
2005-07-08 21:41           ` Jeff Mahoney
2005-07-09  7:46             ` Hans Reiser
2005-07-08  5:35   ` Alexander Zarochentsev
2005-07-08  6:04     ` Stefan Traby

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.