Re: [PATCH 2/2] ext4 directory index: read-ahead blocks

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Bernd Schubert <bernd.schubert@fastmail.fm>
To: colyli@gmail.com
Cc: linux-ext4@vger.kernel.org,
	Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Subject: Re: [PATCH 2/2] ext4 directory index: read-ahead blocks
Date: Fri, 17 Jun 2011 23:35:52 +0200	[thread overview]
Message-ID: <4DFBC8B8.9060207@fastmail.fm> (raw)
In-Reply-To: <4DFBA07B.6090001@gmail.com>

On 06/17/2011 08:44 PM, Coly Li wrote:
> On 2011年06月18日 00:01, Bernd Schubert Wrote:
>> While creating files in large directories we noticed an endless number
>> of 4K reads. And those reads very much reduced file creation numbers
>> as shown by bonnie. While we would expect about 2000 creates/s, we
>> only got about 25 creates/s. Running the benchmarks for a long time
>> improved the numbers, but not above 200 creates/s.
>> It turned out those reads came from directory index block reads
>> and probably the bh cache never cached all dx blocks. Given by
>> the high number of directories we have (8192) and number of files required
>> to trigger the issue (16 million), rather probably bh cached dx blocks
>> got lost in favour of other less important blocks.
>> The patch below implements a read-ahead for *all* dx blocks of a directory
>> if a single dx block is missing in the cache. That also helps the LRU
>> to cache important dx blocks.
>>
>> Unfortunately, it also has a performance trade-off for the first access to
>> a directory, although the READA flag is set already.
>> Therefore at least for now, this option is disabled by default, but may
>> be enabled using 'mount -o dx_read_ahead' or 'mount -odx_read_ahead=1'
>>
>> Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
>> ---
> 
> A question is, is there any performance number for dx dir read ahead ?

Well, I benchmarked it all the week now. But in between bonnie++ and
ext4 there is FhGFS... What exactly do you want to know?

> My concern is, if buffer cache replacement behavior is not ideal, which may replace a dx block by other (maybe) more hot
> blocks, dx dir readahead will introduce more I/Os. In this case, we may focus on exploring why dx block is replaced out
> of buffer cache, other than using dx readahead.

I think we have to differentiate between two different problems. Firstly
we have to get all the indexes into memory at all and secondly, keep
them in memory. Given by the high number of index blocks we have, it is
not easy to differentiate between both and I had to add several printks
or systemtap prints to get an idea why accessing the filesystem was so slow.

> 
> 
> [snip]
>> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
>> index 6f32da4..78290f0 100644
>> --- a/fs/ext4/namei.c
>> +++ b/fs/ext4/namei.c
>> @@ -334,6 +334,35 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
>>  #endif /* DX_DEBUG */
>>  
>>  /*
>> + * Read ahead directory index blocks
>> + */
>> +static void dx_ra_blocks(struct inode *dir, struct dx_entry * entries)
>> +{
>> +	int i, err = 0;
>> +	unsigned num_entries = dx_get_count(entries);
>> +
>> +	if (num_entries < 2 || num_entries > dx_get_limit(entries)) {
>> +		dxtrace(printk("dx read-ahead: invalid number of entries\n"));
>> +		return;
>> +	}
>> +
>> +	dxtrace(printk("dx read-ahead: %d entries in dir-ino %lu \n",
>> +			num_entries, dir->i_ino));
>> +
>> +	i = 1; /* skip first entry, it was already read in by the caller */
>> +	do {
>> +		struct dx_entry *entry;
>> +		ext4_lblk_t block;
>> +
>> +		entry = entries + i;
>> +
>> +		block = dx_get_block(entry);
>> +		err = ext4_bread_ra(dir, dx_get_block(entry));
>> +		i++;
>> +	 } while (i < num_entries && !err);
>> +}
>> +
> 
> 
> I see sync reading here (CMIIW), this is performance killer. An async background reading ahead is better.

But isn't it async? See in the new function ext4_bread_ra() please.
After ll_rw_block(READA, 1, &bh) we don't wait for the buffer to be
up-to-date, but immediately return. I also though about to add it to
worker threads, but then though that only would be additional overhead
without any gain. I didn't test and benchmark it, though.

Thanks for your review!

Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2011-06-17 21:35 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-17 16:00 [PATCH 1/2] ext4: Fix compilation with -DDX_DEBUG Bernd Schubert
2011-06-17 16:01 ` [PATCH 2/2] ext4 directory index: read-ahead blocks Bernd Schubert
2011-06-17 18:44   ` Coly Li
2011-06-17 19:29     ` Andreas Dilger
2011-06-17 22:08       ` Bernd Schubert
2011-06-17 21:35     ` Bernd Schubert [this message]
2011-06-18  7:45   ` Robin Dong
2011-06-17 18:29 ` [PATCH 1/2] ext4: Fix compilation with -DDX_DEBUG Coly Li
2011-06-17 21:25   ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4DFBC8B8.9060207@fastmail.fm \
    --to=bernd.schubert@fastmail.fm \
    --cc=bernd.schubert@itwm.fraunhofer.de \
    --cc=colyli@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).