* btrfs: why default 4M readahead size?
@ 2010-03-18 1:42 Shaohua Li
2010-03-18 12:53 ` Chris Mason
0 siblings, 1 reply; 7+ messages in thread
From: Shaohua Li @ 2010-03-18 1:42 UTC (permalink / raw)
To: linux-btrfs; +Cc: chris.mason, jens.axboe, fengguang.wu, shaohua.li
Btrfs uses below equation to calculate ra_pages:
fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages,
4 * 1024 * 1024 / PAGE_CACHE_SIZE);
is the max() a typo of min()? This makes the readahead size is 4M by default,
which is too big.
I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for
each disk, so this isn't a raid setup. The test is fio, which has 12 tasks to
access 12 files for each disk. The fio test is mmap sequential read. I measure
the performance with different readahead size:
ra size io throughput
4M 268288 k/s
2M 367616 k/s
1M 431104 k/s
512K 474112 k/s
256K 512000 k/s
128K 538624 k/s
The 4M default readahead size has poor performance.
I also does sync sequential read test, the test difference in't that big. But
the 4M case still has about 10% drop compared to the 512k case.
One might argue how about the case memory isn't tight. I tried only run a
one-disk setup with only one task. The 4M ra almost has no difference with the
128K ra. I guess the 128k default ra size for backing dev is carefuly choosed
to work with popular disks.
So my question is why we have a default 4M readahead size even with noraid case?
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs: why default 4M readahead size?
2010-03-18 1:42 btrfs: why default 4M readahead size? Shaohua Li
@ 2010-03-18 12:53 ` Chris Mason
2010-03-19 0:59 ` Shaohua Li
0 siblings, 1 reply; 7+ messages in thread
From: Chris Mason @ 2010-03-18 12:53 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-btrfs, jens.axboe, fengguang.wu
On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote:
> Btrfs uses below equation to calculate ra_pages:
> fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages,
> 4 * 1024 * 1024 / PAGE_CACHE_SIZE);
> is the max() a typo of min()? This makes the readahead size is 4M by default,
> which is too big.
Looks like things have changed since I tuned that number. Fengguang has
been busy ;)
> I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for
> each disk, so this isn't a raid setup. The test is fio, which has 12 tasks to
> access 12 files for each disk. The fio test is mmap sequential read. I measure
> the performance with different readahead size:
> ra size io throughput
> 4M 268288 k/s
> 2M 367616 k/s
> 1M 431104 k/s
> 512K 474112 k/s
> 256K 512000 k/s
> 128K 538624 k/s
> The 4M default readahead size has poor performance.
> I also does sync sequential read test, the test difference in't that big. But
> the 4M case still has about 10% drop compared to the 512k case.
I'm surprised the 4M is so much slower. At any rate, the larger size
was selected because btrfs checksumming means we need a bigger buffer to
keep the disks saturated. Were you on a fancy intel box with hardware
crc32c enabled?
>
> One might argue how about the case memory isn't tight. I tried only run a
> one-disk setup with only one task. The 4M ra almost has no difference with the
> 128K ra. I guess the 128k default ra size for backing dev is carefuly choosed
> to work with popular disks.
> So my question is why we have a default 4M readahead size even with noraid case?
I'm happy to tune it down if lower numbers are more appropriate now,
thanks for trying this!
-chris
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs: why default 4M readahead size?
2010-03-18 12:53 ` Chris Mason
@ 2010-03-19 0:59 ` Shaohua Li
2010-03-19 2:56 ` Shaohua Li
0 siblings, 1 reply; 7+ messages in thread
From: Shaohua Li @ 2010-03-19 0:59 UTC (permalink / raw)
To: Chris Mason, linux-btrfs, jens.axboe, fengguang.wu
On Thu, Mar 18, 2010 at 08:53:13PM +0800, Chris Mason wrote:
> On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote:
> > Btrfs uses below equation to calculate ra_pages:
> > fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages,
> > 4 * 1024 * 1024 / PAGE_CACHE_SIZE);
> > is the max() a typo of min()? This makes the readahead size is 4M by default,
> > which is too big.
>
> Looks like things have changed since I tuned that number. Fengguang has
> been busy ;)
>
> > I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for
> > each disk, so this isn't a raid setup. The test is fio, which has 12 tasks to
> > access 12 files for each disk. The fio test is mmap sequential read. I measure
> > the performance with different readahead size:
> > ra size io throughput
> > 4M 268288 k/s
> > 2M 367616 k/s
> > 1M 431104 k/s
> > 512K 474112 k/s
> > 256K 512000 k/s
> > 128K 538624 k/s
> > The 4M default readahead size has poor performance.
> > I also does sync sequential read test, the test difference in't that big. But
> > the 4M case still has about 10% drop compared to the 512k case.
>
> I'm surprised the 4M is so much slower. At any rate, the larger size
> was selected because btrfs checksumming means we need a bigger buffer to
> keep the disks saturated. Were you on a fancy intel box with hardware
> crc32c enabled?
yes, this machine supports sse4.2 instruction. Let me check the result with checksum
disabled.
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs: why default 4M readahead size?
2010-03-19 0:59 ` Shaohua Li
@ 2010-03-19 2:56 ` Shaohua Li
2010-03-19 8:22 ` Jens Axboe
0 siblings, 1 reply; 7+ messages in thread
From: Shaohua Li @ 2010-03-19 2:56 UTC (permalink / raw)
To: Chris Mason, linux-btrfs, jens.axboe, fengguang.wu
On Fri, Mar 19, 2010 at 08:59:48AM +0800, Shaohua Li wrote:
> On Thu, Mar 18, 2010 at 08:53:13PM +0800, Chris Mason wrote:
> > On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote:
> > > Btrfs uses below equation to calculate ra_pages:
> > > fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages,
> > > 4 * 1024 * 1024 / PAGE_CACHE_SIZE);
> > > is the max() a typo of min()? This makes the readahead size is 4M by default,
> > > which is too big.
> >
> > Looks like things have changed since I tuned that number. Fengguang has
> > been busy ;)
> >
> > > I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for
> > > each disk, so this isn't a raid setup. The test is fio, which has 12 tasks to
> > > access 12 files for each disk. The fio test is mmap sequential read. I measure
> > > the performance with different readahead size:
> > > ra size io throughput
> > > 4M 268288 k/s
> > > 2M 367616 k/s
> > > 1M 431104 k/s
> > > 512K 474112 k/s
> > > 256K 512000 k/s
> > > 128K 538624 k/s
> > > The 4M default readahead size has poor performance.
> > > I also does sync sequential read test, the test difference in't that big. But
> > > the 4M case still has about 10% drop compared to the 512k case.
> >
> > I'm surprised the 4M is so much slower. At any rate, the larger size
> > was selected because btrfs checksumming means we need a bigger buffer to
> > keep the disks saturated. Were you on a fancy intel box with hardware
> > crc32c enabled?
> yes, this machine supports sse4.2 instruction. Let me check the result with checksum
> disabled.
Sounds no big difference with checksum disabled. I format the disks and redo
the test:
128k ra: 539648 k/s
4m ra: 285696 k/s
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs: why default 4M readahead size?
2010-03-19 2:56 ` Shaohua Li
@ 2010-03-19 8:22 ` Jens Axboe
2010-03-19 9:29 ` Shaohua Li
0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2010-03-19 8:22 UTC (permalink / raw)
To: Shaohua Li; +Cc: Chris Mason, linux-btrfs, fengguang.wu
On Fri, Mar 19 2010, Shaohua Li wrote:
> On Fri, Mar 19, 2010 at 08:59:48AM +0800, Shaohua Li wrote:
> > On Thu, Mar 18, 2010 at 08:53:13PM +0800, Chris Mason wrote:
> > > On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote:
> > > > Btrfs uses below equation to calculate ra_pages:
> > > > fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages,
> > > > 4 * 1024 * 1024 / PAGE_CACHE_SIZE);
> > > > is the max() a typo of min()? This makes the readahead size is 4M by default,
> > > > which is too big.
> > >
> > > Looks like things have changed since I tuned that number. Fengguang has
> > > been busy ;)
> > >
> > > > I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for
> > > > each disk, so this isn't a raid setup. The test is fio, which has 12 tasks to
> > > > access 12 files for each disk. The fio test is mmap sequential read. I measure
> > > > the performance with different readahead size:
> > > > ra size io throughput
> > > > 4M 268288 k/s
> > > > 2M 367616 k/s
> > > > 1M 431104 k/s
> > > > 512K 474112 k/s
> > > > 256K 512000 k/s
> > > > 128K 538624 k/s
> > > > The 4M default readahead size has poor performance.
> > > > I also does sync sequential read test, the test difference in't that big. But
> > > > the 4M case still has about 10% drop compared to the 512k case.
> > >
> > > I'm surprised the 4M is so much slower. At any rate, the larger size
> > > was selected because btrfs checksumming means we need a bigger buffer to
> > > keep the disks saturated. Were you on a fancy intel box with hardware
> > > crc32c enabled?
> > yes, this machine supports sse4.2 instruction. Let me check the result with checksum
> > disabled.
> Sounds no big difference with checksum disabled. I format the disks and redo
> the test:
> 128k ra: 539648 k/s
> 4m ra: 285696 k/s
4MB is definitely a huge read-ahead size, but I do wonder why it would
perform that much worse than a 128KB window. If you narrow your test
down to a single disk (or something simpler, at least), how does 4MB
compare to 128KB? With 6GB of memory, you should not run into read-ahead
memory thrashing.
--
Jens Axboe
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs: why default 4M readahead size?
2010-03-19 8:22 ` Jens Axboe
@ 2010-03-19 9:29 ` Shaohua Li
2010-03-19 12:57 ` Jens Axboe
0 siblings, 1 reply; 7+ messages in thread
From: Shaohua Li @ 2010-03-19 9:29 UTC (permalink / raw)
To: Jens Axboe; +Cc: Chris Mason, linux-btrfs@vger.kernel.org, Wu, Fengguang
On Fri, Mar 19, 2010 at 04:22:11PM +0800, Jens Axboe wrote:
> On Fri, Mar 19 2010, Shaohua Li wrote:
> > On Fri, Mar 19, 2010 at 08:59:48AM +0800, Shaohua Li wrote:
> > > On Thu, Mar 18, 2010 at 08:53:13PM +0800, Chris Mason wrote:
> > > > On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote:
> > > > > Btrfs uses below equation to calculate ra_pages:
> > > > > fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages,
> > > > > 4 * 1024 * 1024 / PAGE_CACHE_SIZE);
> > > > > is the max() a typo of min()? This makes the readahead size is 4M by default,
> > > > > which is too big.
> > > >
> > > > Looks like things have changed since I tuned that number. Fengguang has
> > > > been busy ;)
> > > >
> > > > > I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for
> > > > > each disk, so this isn't a raid setup. The test is fio, which has 12 tasks to
> > > > > access 12 files for each disk. The fio test is mmap sequential read. I measure
> > > > > the performance with different readahead size:
> > > > > ra size io throughput
> > > > > 4M 268288 k/s
> > > > > 2M 367616 k/s
> > > > > 1M 431104 k/s
> > > > > 512K 474112 k/s
> > > > > 256K 512000 k/s
> > > > > 128K 538624 k/s
> > > > > The 4M default readahead size has poor performance.
> > > > > I also does sync sequential read test, the test difference in't that big. But
> > > > > the 4M case still has about 10% drop compared to the 512k case.
> > > >
> > > > I'm surprised the 4M is so much slower. At any rate, the larger size
> > > > was selected because btrfs checksumming means we need a bigger buffer to
> > > > keep the disks saturated. Were you on a fancy intel box with hardware
> > > > crc32c enabled?
> > > yes, this machine supports sse4.2 instruction. Let me check the result with checksum
> > > disabled.
> > Sounds no big difference with checksum disabled. I format the disks and redo
> > the test:
> > 128k ra: 539648 k/s
> > 4m ra: 285696 k/s
>
> 4MB is definitely a huge read-ahead size, but I do wonder why it would
> perform that much worse than a 128KB window. If you narrow your test
> down to a single disk (or something simpler, at least), how does 4MB
> compare to 128KB? With 6GB of memory, you should not run into read-ahead
> memory thrashing.
test data for a single disk(just run one time so far):
128k ra: 88513k/s
4m ra:87630k/s
so no big difference.
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: btrfs: why default 4M readahead size?
2010-03-19 9:29 ` Shaohua Li
@ 2010-03-19 12:57 ` Jens Axboe
0 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2010-03-19 12:57 UTC (permalink / raw)
To: Shaohua Li; +Cc: Chris Mason, linux-btrfs@vger.kernel.org, Wu, Fengguang
On Fri, Mar 19 2010, Shaohua Li wrote:
> On Fri, Mar 19, 2010 at 04:22:11PM +0800, Jens Axboe wrote:
> > On Fri, Mar 19 2010, Shaohua Li wrote:
> > > On Fri, Mar 19, 2010 at 08:59:48AM +0800, Shaohua Li wrote:
> > > > On Thu, Mar 18, 2010 at 08:53:13PM +0800, Chris Mason wrote:
> > > > > On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote:
> > > > > > Btrfs uses below equation to calculate ra_pages:
> > > > > > fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages,
> > > > > > 4 * 1024 * 1024 / PAGE_CACHE_SIZE);
> > > > > > is the max() a typo of min()? This makes the readahead size is 4M by default,
> > > > > > which is too big.
> > > > >
> > > > > Looks like things have changed since I tuned that number. Fengguang has
> > > > > been busy ;)
> > > > >
> > > > > > I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for
> > > > > > each disk, so this isn't a raid setup. The test is fio, which has 12 tasks to
> > > > > > access 12 files for each disk. The fio test is mmap sequential read. I measure
> > > > > > the performance with different readahead size:
> > > > > > ra size io throughput
> > > > > > 4M 268288 k/s
> > > > > > 2M 367616 k/s
> > > > > > 1M 431104 k/s
> > > > > > 512K 474112 k/s
> > > > > > 256K 512000 k/s
> > > > > > 128K 538624 k/s
> > > > > > The 4M default readahead size has poor performance.
> > > > > > I also does sync sequential read test, the test difference in't that big. But
> > > > > > the 4M case still has about 10% drop compared to the 512k case.
> > > > >
> > > > > I'm surprised the 4M is so much slower. At any rate, the larger size
> > > > > was selected because btrfs checksumming means we need a bigger buffer to
> > > > > keep the disks saturated. Were you on a fancy intel box with hardware
> > > > > crc32c enabled?
> > > > yes, this machine supports sse4.2 instruction. Let me check the result with checksum
> > > > disabled.
> > > Sounds no big difference with checksum disabled. I format the disks and redo
> > > the test:
> > > 128k ra: 539648 k/s
> > > 4m ra: 285696 k/s
> >
> > 4MB is definitely a huge read-ahead size, but I do wonder why it would
> > perform that much worse than a 128KB window. If you narrow your test
> > down to a single disk (or something simpler, at least), how does 4MB
> > compare to 128KB? With 6GB of memory, you should not run into read-ahead
> > memory thrashing.
> test data for a single disk(just run one time so far):
> 128k ra: 88513k/s
> 4m ra:87630k/s
> so no big difference.
That looks pretty much as expected, unless you hit some sort of memory
thrashing, a huge read-ahead window should not cause a performance
degredation. At least not of your magnitude. I would expect performance
to reach a stable threshold once you have requests that are large enough
to utilize the full device bandwidth on its own and then remain at that
plateau.
Any chance you could capture blktrace data for a run with 128KB and one
with 4MB so we could inspect the disk IO pattern?
--
Jens Axboe
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-03-19 12:57 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-18 1:42 btrfs: why default 4M readahead size? Shaohua Li
2010-03-18 12:53 ` Chris Mason
2010-03-19 0:59 ` Shaohua Li
2010-03-19 2:56 ` Shaohua Li
2010-03-19 8:22 ` Jens Axboe
2010-03-19 9:29 ` Shaohua Li
2010-03-19 12:57 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).