Re: [PATCH] ext4: dir inode reservation V3

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Coly Li <coyli@suse.de>
To: cmm@us.ibm.com
Cc: linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] ext4: dir inode reservation V3
Date: Tue, 20 Nov 2007 12:14:03 +0800	[thread overview]
Message-ID: <47425F0B.3030604@suse.de> (raw)
In-Reply-To: <1195524061.4177.17.camel@localhost.localdomain>

Thanks for the feedback :-)

Mingming Cao wrote:
> On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote:
>> Basic idea of my dir inode reservation patch can be found here,
>> http://lists.openwall.net/linux-ext4/2007/11/05/3
>>
>> 1, What does dir inode reservation do
>> Dir inode reservation tries to reserve several inodes in inodes table for a directory when this
>> directory is created. When create new file under this directory, try to allocate inode from the
>> reserved inodes area. This is called as dir_ireserve inode allocator.
>>
> Thanks for the update.
> 
> Let me try to understand your method:
> 
> So the basic idea is not do linear inode allocation for directory? Inode
> structure block for directory file is only coming from block 0, N, N
> +N,... where the number of skipped blocks N is stored in the in-core
> superblock structure. 

N is not stored in in-core superblock. N = s_dir_ireserve_nr / inodes_per_block. What is stored in
in-core superblock is number of inodes to be reserved for each directory.

> 
> When ever need to allocate an inode for directory, skip N reserved bits
> (space for N*16 inodes) if the previous block is already allocated. That
> way place two directories with the hole of N*16 inodes structures, then
> allow files under the first directory stay closer with their parent
> directory. Is this correct?

The hole is (s_dir_ireserve_nr - 1), not N * s_dir_ireserve_nr. Because directory inode will also
use a inode slot from reserved area, reset slots number for files is (s_dir_ireserve_nr - 1).
Except for the reserved inodes number, your understanding exactly matches my idea.

>  
> 
>> 4, Dir inode reservation is optional
>> Dir inode reservation is optional, you can use -o followed by one of these options to enable dir
>> inode reservation during mount ext4 file system:
>>          dir_ireserve=low
>>          dir_ireserve=normal
>>          dir_ireserve=high
> 
> Would be nice to pass the tuning info low/normal/high(16/64/128 blocks
> correspondingly) via something else rather than mount options. 

Sure, I agree with you. Also I am thinking should this patch permit user to input reserved inodes
number directly other than a low/normal/high. Also I am looking for methods to display the tuning
info more convenient to users.

>  
>> Currently, 'low' reserves 15 file inodes for each directory, 'normal' reserves 31 inodes and 'high'
>> reserves 127 inodes. Reserving more than 127 inodes does not help to performance obviously.
>>
>>
>> 5, Performance number
>> On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create
>> 50000 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I
>> tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result,
>> 			normal ext4			ext4 with dir inode reservation
>> 	mount options:	-o data=writeback		-o data=writeback,dir_ireserve=low
>> 	Create dirs:	real    0m49.101s		real    2m59.703s
>> 	Create files:	real    24m17.962s		real    21m8.161s
>> 	Unlink all:	real    24m43.788s		real    17m29.862s
>> Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating
>> directory inodes in non-linear order will cause extra hard disk seeking and block I/O.
> 
> Hmm...I suspect there is bug in your patch, the extra seek should not
> contribute to 4 times slower

I agree with you :-)

> 
>>  #include <linux/time.h>
>> @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent,
>>  	return -1;
>>  }
>>
>> +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir,
>> +			  int mode, ext4_group_t *group, unsigned long *ino)
>> +{
>> +	struct super_block *sb;
>> +	struct ext4_sb_info *sbi;
>> +	struct ext4_group_desc *gdp = NULL;
>> +	struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL;
>> +	ext4_group_t ires_group = *group;
>> +	unsigned long ires_ino;
>> +	int i, bit;
>> +
>> +	sb = dir->i_sb;
>> +	sbi = EXT4_SB(sb);
>> +
>> +	/* if the inode number is not for directory,
>> +	 * only try to allocate after directory's inode
>> +	 */
>> +	if (!S_ISDIR(mode)) {
>> +		*ino = dir->i_ino % EXT4_INODES_PER_GROUP(sb);
>> +		return 0;
>> +	}
>> +
>> +	/* reserve inodes for new directory */
>> +	for (i = 0; i < sbi->s_groups_count; i++) {
>> +		gdp = ext4_get_group_desc(sb, ires_group, &gdp_bh);
>> +		if (!gdp)
>> +			goto fail;
>> +		bit = 0;
>> +try_same_group:
>> +		if (bit < EXT4_INODES_PER_GROUP(sb)) {
>> +			brelse(bitmap_bh);
>> +			bitmap_bh = read_inode_bitmap(sb, ires_group);
>> +			if (!bitmap_bh)
>> +				goto fail;
>> +
>> +			BUFFER_TRACE(bitmap_bh, "get_write_access");
>> +			if (ext4_journal_get_write_access(
>> +				handle, bitmap_bh) != 0)
>> +				goto fail;
>> +			if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group),
>> +					bit, bitmap_bh->b_data)) {
>> +				/* we won it */
>> +				BUFFER_TRACE(bitmap_bh,
>> +					"call ext4_journal_dirty_metadata");
>> +				if (ext4_journal_dirty_metadata(handle,
>> +							bitmap_bh) != 0)
>> +					goto fail;
>> +				ires_ino = bit;
>> +				goto find;
>> +			}
>> +			/* we lost it */
>> +			jbd2_journal_release_buffer(handle, bitmap_bh);
>> +			bit += sbi->s_dir_ireserve_nr;
>> +			goto try_same_group;
>> +		}
>> +		if (++ires_group == sbi->s_groups_count)
>> +			ires_group = 0;
>> +	}
>> +	goto fail;
>> +find:
>> +	brelse(bitmap_bh);
>> +	*group = ires_group;
>> +	*ino = ires_ino;
>> +	return 0;
>> +fail:
>> +	brelse(bitmap_bh);
>> +	return -ENOSPC;
>> +}
>> +
>>  /*
>>   * There are two policies for allocating an inode.  If the new inode is
>>   * a directory, then a forward search is made for a block group with both
>> @@ -543,6 +614,12 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode * dir, int mode)
>>
>>  		ino = 0;
>>
>> +		if (test_opt(sb, DIR_IRESERVE)) {
>> +			err = ext4_ino_from_ireserve(handle, dir,
>> +						     mode, &group, &ino);
>> +			if ((!err) && S_ISDIR(mode))
>> +				goto got;
>> +		}
> 
> 
> So you are calling ext4_ino_from_ireserve() inside a loop iterate all
> block groups? I think this is bug, it should move outside of the loop.
> 

Hmm, sure, putting ext4_ino_from_ireserve() in this loop is redundant, it should be out of this
loop. Neat eyes:-) So this is second bug in my patch.

I will submit V4 patch and basic benchmark number before next internal conf-call. Thanks for your
review, great help to me :-)


>>  repeat_in_this_group:
>>  		ino = ext4_find_next_zero_bit((unsigned long *)
>>  				bitmap_bh->b_data, EXT4_INODES_PER_GROUP(sb), ino);
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index b626339..9b873b7 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -884,6 +884,7 @@ enum {
>>  	Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
>>  	Opt_journal_checksum, Opt_journal_async_commit,
>>  	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
>> +	Opt_dir_ireserve_low, Opt_dir_ireserve_normal, Opt_dir_ireserve_high,
>>  	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
>>  	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
>>  	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
>> @@ -929,6 +930,9 @@ static match_table_t tokens = {
>>  	{Opt_data_journal, "data=journal"},
>>  	{Opt_data_ordered, "data=ordered"},
>>  	{Opt_data_writeback, "data=writeback"},
>> +	{Opt_dir_ireserve_low, "dir_ireserve=low"},
>> +	{Opt_dir_ireserve_normal, "dir_ireserve=normal"},
>> +	{Opt_dir_ireserve_high, "dir_ireserve=high"},
>>  	{Opt_offusrjquota, "usrjquota="},
>>  	{Opt_usrjquota, "usrjquota=%s"},
>>  	{Opt_offgrpjquota, "grpjquota="},
>> @@ -1311,6 +1315,18 @@ clear_qf_name:
>>  				return 0;
>>  			sbi->s_stripe = option;
>>  			break;
>> +		case Opt_dir_ireserve_low:
>> +			set_opt(sbi->s_mount_opt, DIR_IRESERVE);
>> +			sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_LOW;
>> +			break;
>> +		case Opt_dir_ireserve_normal:
>> +			set_opt(sbi->s_mount_opt, DIR_IRESERVE);
>> +			sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_NORMAL;
>> +			break;
>> +		case Opt_dir_ireserve_high:
>> +			set_opt(sbi->s_mount_opt, DIR_IRESERVE);
>> +			sbi->s_dir_ireserve_nr = EXT4_DIR_IRESERVE_HIGH;
>> +			break;
>>  		default:
>>  			printk (KERN_ERR
>>  				"EXT4-fs: Unrecognized mount option \"%s\" "
>> diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
>> index bcdb59d..88f5173 100644
>> --- a/include/linux/ext4_fs.h
>> +++ b/include/linux/ext4_fs.h
>> @@ -92,6 +92,13 @@ struct ext4_allocation_request {
>>  #define EXT4_GOOD_OLD_FIRST_INO	11
>>
>>  /*
>> + * Macro-instructions used to reserve inodes for directories
>> + */
>> +#define EXT4_DIR_IRESERVE_LOW		16
>> +#define EXT4_DIR_IRESERVE_NORMAL	64
>> +#define EXT4_DIR_IRESERVE_HIGH		128
>> +
>> +/*
>>   * Maximal count of links to a file
>>   */
>>  #define EXT4_LINK_MAX		65000
>> @@ -502,6 +509,7 @@ do {									       \
>>  #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
>>  #define EXT4_MOUNT_DELALLOC		0x2000000 /* Delalloc support */
>>  #define EXT4_MOUNT_MBALLOC		0x4000000 /* Buddy allocation support */
>> +#define EXT4_MOUNT_DIR_IRESERVE		0x10000000/* dir inode reservation */
>>  /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
>>  #ifndef _LINUX_EXT2_FS_H
>>  #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
>> diff --git a/include/linux/ext4_fs_sb.h b/include/linux/ext4_fs_sb.h
>> index 744e746..790b0cf 100644
>> --- a/include/linux/ext4_fs_sb.h
>> +++ b/include/linux/ext4_fs_sb.h
>> @@ -147,6 +147,8 @@ struct ext4_sb_info {
>>
>>  	/* locality groups */
>>  	struct ext4_locality_group *s_locality_groups;
>> +	/* number of inodes we reserve in a directory */
>> +	int s_dir_ireserve_nr;
>>  };
>>  #define EXT4_GROUP_INFO(sb, group)					   \
>>  	EXT4_SB(sb)->s_group_info[(group) >> EXT4_DESC_PER_BLOCK_BITS(sb)] \
>>
>>
> 

-- 
Coly Li
SuSE PRC Labs

next prev parent reply	other threads:[~2007-11-20  4:05 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-11-13 14:12 [PATCH] ext4: dir inode reservation V3 Coly Li
2007-11-13 14:09 ` Alex Tomas
2007-11-13 16:27   ` Coly Li
2007-11-13 16:43     ` Coly Li
2007-11-20  2:01 ` Mingming Cao
2007-11-20  4:14   ` Coly Li [this message]
2007-11-20 20:22     ` Mingming Cao
2007-11-21  2:09       ` Andreas Dilger
2007-11-20 15:58 ` Jan Kara
2007-11-20 16:40   ` Coly Li
2007-11-20 16:44     ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47425F0B.3030604@suse.de \
    --to=coyli@suse.de \
    --cc=cmm@us.ibm.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.