From: Pallai Roland <dap@mail.index.hu>
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: Linux RAID Mailing List <linux-raid@vger.kernel.org>
Subject: Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
Date: Sun, 22 Apr 2007 13:38:45 +0200 [thread overview]
Message-ID: <200704221338.45759.dap@mail.index.hu> (raw)
In-Reply-To: <Pine.LNX.4.64.0704220619390.14170@p34.internal.lan>
[-- Attachment #1: Type: text/plain, Size: 3792 bytes --]
On Sunday 22 April 2007 12:23:12 Justin Piszcz wrote:
> On Sun, 22 Apr 2007, Pallai Roland wrote:
> > On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote:
> >> On Sun, 22 Apr 2007, Pallai Roland wrote:
> >>> On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote:
> >>>> How did you run your read test?
> >>>
> >>> I did run 100 parallel reader process (dd) top of XFS file system, try
> >>> this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k
> >>> 2>/dev/null; done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k
> >>> 2>/dev/null & done
> >>>
> >>> and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb)
> >>> /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done
> >>>
> >>> I also set 2048/4096 readahead sectors with blockdev --setra
> >>>
> >>> You need 50-100 reader processes for this issue, I think so. My kernel
> >>> version is 2.6.20.3
> >>
> >> In one xterm:
> >> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done
> >>
> >> In another:
> >> for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done
> >
> > Write and read files top of XFS, not on the block device. $i isn't a
> > typo, you should write into 100 files and read back by 100 threads in
> > parallel when done. I've 1Gb of RAM, maybe you should use mem= kernel
> > parameter on boot.
> >
> > 1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100
> > 2>/dev/null; done
> > 2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null &
> > done
> >
>
> I use a combination of 4 Silicon Image controllers (SiI) and the Intel 965
> chipset. My max_sectors_kb is 128kb, my chunk size is 128kb, why do you
> set the max_sectors_kb less than the chunk size?
It's the maximum on Marvell SATA chips under Linux. Maybe hardware
limitation. I just would used 128Kb chunk but I hit this issue.
> For read-ahead, there
> are some good benchmarks by SGI(?) I believe and some others that states
> 16MB is the best value, over that, you lose on reads/writes or the other,
> 16MB appears to be optimal for best overall value. Do these values look
> good to you, or?
Where can I found this bechmark? I did some test on this topic, too. I think
the optimal readahead size always depend on the number of sequentally reading
processes and the available RAM. If you've 100 processes and 1Gb of RAM, max
optimal readahead is about 5-6Mb, if you set it bigger that turns into
readahead thrashing and undesirable context switches. Anyway, I tried 16Mb
now, but the readahead size doesn't matter in this bug, same context switch
storm appears with any readahead window size.
> Read 100 files on XFS simultaneously:
max_sectors_kb is 128kb is here? I think so. I see some anomaly, but maybe
just you've too big readahead window for so many processes, it's not the bug
what I'm talking about in my original post. High interrupt and CS count has
been building slowly, it may a sign of readahead thrashing. In my case the CS
storm began in the first second and no high interrupt count:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 7220 0 940972 0 0 0 0 256 20 0 0 100 0
0 13 0 383636 0 535520 0 0 144904 32 2804 63834 1 42 0 57
24 20 0 353312 0 558200 0 0 121524 0 2669 67604 1 40 0 59
15 21 0 314808 0 557068 0 0 91300 33 2572 53442 0 29 0 71
I attached a small kernel patch, you can measure readahead thrashing ratio
with this (see tail of /proc/vmstat). I think it's a handy tool to find the
optimal RA-size. And if you're interested in the bug what I'm talking about,
set max_sectors_kb to 64Kb.
--
d
[-- Attachment #2: 01_2618+_rathr-d1.diff --]
[-- Type: text/x-diff, Size: 868 bytes --]
--- linux-2.6.18.2/include/linux/vmstat.h.orig 2006-09-20 05:42:06.000000000 +0200
+++ linux-2.6.18.2/include/linux/vmstat.h 2006-11-06 02:09:25.000000000 +0100
@@ -30,6 +30,7 @@
FOR_ALL_ZONES(PGSCAN_DIRECT),
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+ RATHRASHED,
NR_VM_EVENT_ITEMS
};
--- linux-2.6.18.2/mm/vmstat.c.orig 2006-11-06 01:55:58.000000000 +0100
+++ linux-2.6.18.2/mm/vmstat.c 2006-11-06 02:05:14.000000000 +0100
@@ -502,6 +502,8 @@
"allocstall",
"pgrotated",
+
+ "rathrashed",
#endif
};
--- linux-2.6.18.2/mm/readahead.c.orig 2006-09-20 05:42:06.000000000 +0200
+++ linux-2.6.18.2/mm/readahead.c 2006-11-06 02:13:12.000000000 +0100
@@ -568,6 +568,7 @@
ra->flags |= RA_FLAG_MISS;
ra->flags &= ~RA_FLAG_INCACHE;
ra->cache_hit = 0;
+ count_vm_event(RATHRASHED);
}
/*
next prev parent reply other threads:[~2007-04-22 11:38 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-04-20 21:06 major performance drop on raid5 due to context switches caused by small max_hw_sectors Pallai Roland
[not found] ` <5d96567b0704202247s60e4f2f1x19511f790f597ea0@mail.gmail.com>
2007-04-21 19:32 ` major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] Pallai Roland
2007-04-22 0:18 ` Justin Piszcz
2007-04-22 0:42 ` Pallai Roland
2007-04-22 8:47 ` Justin Piszcz
2007-04-22 9:52 ` Pallai Roland
2007-04-22 10:23 ` Justin Piszcz
2007-04-22 11:38 ` Pallai Roland [this message]
2007-04-22 11:42 ` Justin Piszcz
2007-04-22 14:38 ` Pallai Roland
2007-04-22 14:48 ` Justin Piszcz
2007-04-22 15:09 ` Pallai Roland
2007-04-22 15:53 ` Justin Piszcz
2007-04-22 19:01 ` Mr. James W. Laferriere
2007-04-22 20:35 ` Justin Piszcz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200704221338.45759.dap@mail.index.hu \
--to=dap@mail.index.hu \
--cc=jpiszcz@lucidpixels.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).