linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] delayed allocation, mballoc, etc
@ 2006-12-01  0:15 Alex Tomas
  2006-12-07 17:18 ` Valerie Clement
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Tomas @ 2006-12-01  0:15 UTC (permalink / raw)
  To: linux-ext4


Good day,

I'd like to ask the community to discuss and review few things
I've been working on. we propose set of patches with intention
to improve performance of ext4:

 * locality groups

   to achieve good performance writing many small files
   we need to allocate them closely each to other. the
   simplest way could be to allocate all small files using
   next block after the previous small file. and this would
   work well for a single-job case. for multi-job case (few
   untar's, for example) this would break job locality and
   cause performance penaly in subsequent access. locality
   groups idea may help here: let's group all files by some
   property. pgid, for example. now, every time the kernel
   ask filesystem to flush dirty pages, we flush inodes from
   1st group, then from 2nd and go on. this one we can form
   large contiguous allocations (for a whole group) achieving
   good throughput and preserve quite good locality.

 * scalable block reservation

   this is required to protect from -ENOSPC when pages enter
   pagecache w/o space allocation (delayed allocation). it
   also should scale well on high-end SMP as every cpu has
   one "pool" of block. when pool is empty, the filesystem
   rebalance free blocks between all cpus

 * mballoc v4

   multiblock allocator. it's supposed to be ablo to allocate
   many blocks at once saving cpu.

   with the following changes since v2 published before:

     a) per-inode preallocation
 
        every regular inode may have few preallocated chunks
        assigned to specific logical offset. it's intended to
        help applications like IOR and p2p

     b) per-locality-group preallocation

        a locality group may have few preallocated chunks

     c) buddy structures aren't stored on a disk, instead
        they are regenerated from on-disk bitmaps on demand

     d) has stride option to align requests (useful for arrays)

 * delayed allocation

   not that many changes have been done since the previous
   publication: few bugfixes and tweaks, adopted to new mballoc

as usual, there are tons of things yet to be done/fixed/tweaked.
I'm trying to keep them uptodate in TODOs.

few tests have been done. I'm sending the numbers (as well as
the patches) in the subsequent mails. please, have a look.

all the series can be found at
    ftp://ftp.clusterfs.com/pub/people/alex/2.6.19-rc6/

to enable the features, ext4 should be mounted with options:
   extents,mballoc,delalloc

any comments and questions are very welcome.

thanks, Alex

PS. I'd like to give thanks to CFS for help. especially to
    Peter Braam and Andreas Dilger who feed me with ideas.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] delayed allocation, mballoc, etc
  2006-12-01  0:15 Alex Tomas
@ 2006-12-07 17:18 ` Valerie Clement
  2006-12-07 17:26   ` Alex Tomas
  0 siblings, 1 reply; 5+ messages in thread
From: Valerie Clement @ 2006-12-07 17:18 UTC (permalink / raw)
  To: Alex Tomas; +Cc: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 484 bytes --]

Alex Tomas wrote:
> Good day,
> 
> I'd like to ask the community to discuss and review few things
> I've been working on. we propose set of patches with intention
> to improve performance of ext4:
> 
>  * mballoc v4
> 
Hi Alex,
I retrieved the patches from your site and began to have a look at them.
I found some issues in the ext4-mballoc patch concerning the 48bit 
physical block number support.
The patch in attachment fixes the problem, if I'm not wrong.
Regards,

    Valérie


[-- Attachment #2: fix_48bit_support_in_mballoc.patch --]
[-- Type: text/plain, Size: 1772 bytes --]

Index: linux-2.6.19-rc6/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.19-rc6.orig/fs/ext4/mballoc.c	2006-12-07 17:54:01.000000000 +0100
+++ linux-2.6.19-rc6/fs/ext4/mballoc.c	2006-12-07 19:19:11.000000000 +0100
@@ -745,7 +745,7 @@ static int ext4_mb_init_cache(struct pag
 			goto out;
 
 		err = -ENOMEM;
-		bh[i] = sb_getblk(sb, le32_to_cpu(desc->bg_block_bitmap));
+		bh[i] = sb_getblk(sb, ext4_block_bitmap(sb, desc));
 		if (bh[i] == NULL)
 			goto out;
 
@@ -2794,9 +2794,9 @@ int ext4_mb_mark_diskspace_used(struct e
 		+ ac->ac_b_ex.fe_start
 		+ le32_to_cpu(es->s_first_data_block);
 
-	if (block == le32_to_cpu(gdp->bg_block_bitmap) ||
-			block == le32_to_cpu(gdp->bg_inode_bitmap) ||
-			in_range(block, le32_to_cpu(gdp->bg_inode_table),
+	if (block == ext4_block_bitmap(sb, gdp) ||
+			block == ext4_inode_bitmap(sb, gdp) ||
+			in_range(block, ext4_inode_table(sb, gdp),
 				EXT4_SB(sb)->s_itb_per_group))
 		ext4_error(sb, "ext4_new_block",
 				"Allocating block in system zone - block = %lu",
@@ -3837,11 +3837,11 @@ do_more:
 	if (!gdp)
 		goto error_return;
 
-	if (in_range (le32_to_cpu(gdp->bg_block_bitmap), block, count) ||
-			in_range (le32_to_cpu(gdp->bg_inode_bitmap), block, count) ||
-			in_range (block, le32_to_cpu(gdp->bg_inode_table),
+	if (in_range (ext4_block_bitmap(sb, gdp), block, count) ||
+			in_range (ext4_inode_bitmap(sb, gdp), block, count) ||
+			in_range (block, ext4_inode_table(sb, gdp),
 				EXT4_SB(sb)->s_itb_per_group) ||
-			in_range (block + count - 1, le32_to_cpu(gdp->bg_inode_table),
+			in_range (block + count - 1, ext4_inode_table(sb, gdp),
 				EXT4_SB(sb)->s_itb_per_group))
 		ext4_error (sb, "ext4_free_blocks",
 				"Freeing blocks in system zones - "

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] delayed allocation, mballoc, etc
  2006-12-07 17:18 ` Valerie Clement
@ 2006-12-07 17:26   ` Alex Tomas
  0 siblings, 0 replies; 5+ messages in thread
From: Alex Tomas @ 2006-12-07 17:26 UTC (permalink / raw)
  To: Valerie Clement; +Cc: Alex Tomas, linux-ext4

>>>>> Valerie Clement (VC) writes:

 VC> Alex Tomas wrote:
 >> Good day,
 >> 
 >> I'd like to ask the community to discuss and review few things
 >> I've been working on. we propose set of patches with intention
 >> to improve performance of ext4:
 >> 
 >> * mballoc v4
 >> 
 VC> Hi Alex,
 VC> I retrieved the patches from your site and began to have a look at them.
 VC> I found some issues in the ext4-mballoc patch concerning the 48bit
 VC> physical block number support.
 VC> The patch in attachment fixes the problem, if I'm not wrong.
 VC> Regards,

thanks for the patch, Valerie!


to be honest, I haven't implemented 48bit support in mballoc yet.
that would be a next step.

thanks, Alex

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] delayed allocation, mballoc, etc
@ 2006-12-27 11:09 sho
  2006-12-27 11:16 ` Alex Tomas
  0 siblings, 1 reply; 5+ messages in thread
From: sho @ 2006-12-27 11:09 UTC (permalink / raw)
  To: alex; +Cc: linux-ext4

Hi Alex

I found a bug on linux-2.6.19-rc6 with Alex's patches.

With no files on the device, doing the following system call:
1. open with O_CREAT
	fd = open("test_file", O_RDWR|O_CREAT, 0777)
2. ftruncate (length is not aligned with blocksize)
	ftruncate(fd, 200)
3. write out the same block
	write(fd, write_buf, 100)

As a result, panic occurred at the following code:
  ext4_wb_commit_write()
          BUG_ON(EXT4_I(inode)->i_locality_group == NULL);

I tracked down the scenario of causing this panic, which is as below:
1. i_locality_group is set to NULL when a file is created at first

2. Given a length which is not aligned with blocksize to ftruncate,
   PG_dirty flag is set in _set_page_dirty_nobuffers() after zeroing
   out halfway part of the block on ftruncate
   	ext4_wb_block_truncate_page()
        	kaddr = kmap_atomic(page, KM_USER0);
        	memset(kaddr + offset, 0, length);
        	flush_dcache_page(page);
        	kunmap_atomic(kaddr, KM_USER0);
        	SetPageUptodate(page);s
        	_set_page_dirty_nobuffers(page);

3. With PG_dirty flag set, i_locality_group is not set in
   ext4_lg_page_enter_inode()
     ext4_wb_commit_write()
		if (__set_page_dirty_nobuffers(page))
			ext4_lg_page_enter_inode(inode, page,
				PageMappedToDisk(page));

4. i_locality_group set to NULL causes BUG_ON

I tried the attached patch where ext4_lg_page_enter_inode()
is necessarily called.  It seems to me that the problem does not occur
with this patch, how about your comment?

diff -upNr -X linux-2.6.19-rc6/Documentation/dontdiff linux-2.6.19-rc6/fs/ext4/writeback.c linux-2.6.19-rc6-tmp/fs/ext4/writeback.c
--- linux-2.6.19-rc6/fs/ext4/writeback.c        2006-12-22 19:16:17.000000000 +0900
+++ linux-2.6.19-rc6-tmp/fs/ext4/writeback.c   2006-12-22 19:15:45.000000000 +0900
@@ -968,10 +968,8 @@ int ext4_wb_commit_write(struct file *fi
 
-       if (__set_page_dirty_nobuffers(page)) {
-                __set_page_dirty_nobuffers(page);
-               ext4_lg_page_enter_inode(inode, page, PageMappedToDisk(page));
-       }
+       __set_page_dirty_nobuffers(page);
+       ext4_lg_page_enter_inode(inode, page, PageMappedToDisk(page));


Cheers, Takashi

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] delayed allocation, mballoc, etc
  2006-12-27 11:09 [RFC] delayed allocation, mballoc, etc sho
@ 2006-12-27 11:16 ` Alex Tomas
  0 siblings, 0 replies; 5+ messages in thread
From: Alex Tomas @ 2006-12-27 11:16 UTC (permalink / raw)
  To: sho; +Cc: alex, linux-ext4


Hi,

you're right. thanks for the patch.

thanks, Alex

>>>>> sho  (s) writes:

 s> Hi Alex
 s> I found a bug on linux-2.6.19-rc6 with Alex's patches.

 s> With no files on the device, doing the following system call:
 s> 1. open with O_CREAT
 s> 	fd = open("test_file", O_RDWR|O_CREAT, 0777)
 s> 2. ftruncate (length is not aligned with blocksize)
 s> 	ftruncate(fd, 200)
 s> 3. write out the same block
 s> 	write(fd, write_buf, 100)

 s> As a result, panic occurred at the following code:
 s>   ext4_wb_commit_write()
 s>           BUG_ON(EXT4_I(inode)->i_locality_group == NULL);

 s> I tracked down the scenario of causing this panic, which is as below:
 s> 1. i_locality_group is set to NULL when a file is created at first

 s> 2. Given a length which is not aligned with blocksize to ftruncate,
 s>    PG_dirty flag is set in _set_page_dirty_nobuffers() after zeroing
 s>    out halfway part of the block on ftruncate
 s>    	ext4_wb_block_truncate_page()
 s>         	kaddr = kmap_atomic(page, KM_USER0);
 s>         	memset(kaddr + offset, 0, length);
 s>         	flush_dcache_page(page);
 s>         	kunmap_atomic(kaddr, KM_USER0);
 s>         	SetPageUptodate(page);s
 s>         	_set_page_dirty_nobuffers(page);

 s> 3. With PG_dirty flag set, i_locality_group is not set in
 s>    ext4_lg_page_enter_inode()
 s>      ext4_wb_commit_write()
 s> 		if (__set_page_dirty_nobuffers(page))
 s> 			ext4_lg_page_enter_inode(inode, page,
 s> 				PageMappedToDisk(page));

 s> 4. i_locality_group set to NULL causes BUG_ON

 s> I tried the attached patch where ext4_lg_page_enter_inode()
 s> is necessarily called.  It seems to me that the problem does not occur
 s> with this patch, how about your comment?

 s> diff -upNr -X linux-2.6.19-rc6/Documentation/dontdiff linux-2.6.19-rc6/fs/ext4/writeback.c linux-2.6.19-rc6-tmp/fs/ext4/writeback.c
 s> --- linux-2.6.19-rc6/fs/ext4/writeback.c        2006-12-22 19:16:17.000000000 +0900
 s> +++ linux-2.6.19-rc6-tmp/fs/ext4/writeback.c   2006-12-22 19:15:45.000000000 +0900
 s> @@ -968,10 +968,8 @@ int ext4_wb_commit_write(struct file *fi
 
 s> -       if (__set_page_dirty_nobuffers(page)) {
 s> -                __set_page_dirty_nobuffers(page);
 s> -               ext4_lg_page_enter_inode(inode, page, PageMappedToDisk(page));
 s> -       }
 s> +       __set_page_dirty_nobuffers(page);
 s> +       ext4_lg_page_enter_inode(inode, page, PageMappedToDisk(page));


 s> Cheers, Takashi

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-12-27 11:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-27 11:09 [RFC] delayed allocation, mballoc, etc sho
2006-12-27 11:16 ` Alex Tomas
  -- strict thread matches above, loose matches on Subject: below --
2006-12-01  0:15 Alex Tomas
2006-12-07 17:18 ` Valerie Clement
2006-12-07 17:26   ` Alex Tomas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).