Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Pallai Roland <dap@mail.index.hu>
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: Linux RAID Mailing List <linux-raid@vger.kernel.org>
Subject: Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
Date: Sun, 22 Apr 2007 13:38:45 +0200	[thread overview]
Message-ID: <200704221338.45759.dap@mail.index.hu> (raw)
In-Reply-To: <Pine.LNX.4.64.0704220619390.14170@p34.internal.lan>

[-- Attachment #1: Type: text/plain, Size: 3792 bytes --]


On Sunday 22 April 2007 12:23:12 Justin Piszcz wrote:
> On Sun, 22 Apr 2007, Pallai Roland wrote:
> > On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote:
> >> On Sun, 22 Apr 2007, Pallai Roland wrote:
> >>> On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote:
> >>>> How did you run your read test?
> >>>
> >>> I did run 100 parallel reader process (dd) top of XFS file system, try
> >>> this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k
> >>> 2>/dev/null; done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k
> >>> 2>/dev/null & done
> >>>
> >>> and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb)
> >>> /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done
> >>>
> >>> I also set 2048/4096 readahead sectors with blockdev --setra
> >>>
> >>> You need 50-100 reader processes for this issue, I think so. My kernel
> >>> version is 2.6.20.3
> >>
> >> In one xterm:
> >> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done
> >>
> >> In another:
> >> for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done
> >
> > Write and read files top of XFS, not on the block device. $i isn't a
> > typo, you should write into 100 files and read back by 100 threads in
> > parallel when done. I've 1Gb of RAM, maybe you should use mem= kernel
> > parameter on boot.
> >
> > 1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100
> > 2>/dev/null; done
> > 2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null &
> > done
> >
>
> I use a combination of 4 Silicon Image controllers (SiI) and the Intel 965
> chipset.  My max_sectors_kb is 128kb, my chunk size is 128kb, why do you
> set the max_sectors_kb less than the chunk size?
 It's the maximum on Marvell SATA chips under Linux. Maybe hardware 
limitation. I just would used 128Kb chunk but I hit this issue.

> For read-ahead, there 
> are some good benchmarks by SGI(?) I believe and some others that states
> 16MB is the best value, over that, you lose on reads/writes or the other,
> 16MB appears to be optimal for best overall value.  Do these values look
> good to you, or?
 Where can I found this bechmark? I did some test on this topic, too. I think 
the optimal readahead size always depend on the number of sequentally reading 
processes and the available RAM. If you've 100 processes and 1Gb of RAM, max 
optimal readahead is about 5-6Mb, if you set it bigger that turns into 
readahead thrashing and undesirable context switches. Anyway, I tried 16Mb 
now, but the readahead size doesn't matter in this bug, same context switch 
storm appears with any readahead window size.

> Read 100 files on XFS simultaneously:
 max_sectors_kb is 128kb is here? I think so. I see some anomaly, but maybe 
just you've too big readahead window for so many processes, it's not the bug 
what I'm talking about in my original post. High interrupt and CS count has 
been building slowly, it may a sign of readahead thrashing. In my case the CS 
storm began in the first second and no high interrupt count:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
0  0      0   7220      0 940972    0    0     0     0  256    20  0  0 100  0
0 13      0 383636      0 535520    0    0 144904    32 2804 63834  1 42  0 57
24 20     0 353312      0 558200    0    0 121524     0 2669 67604  1 40  0 59
15 21      0 314808      0 557068    0    0 91300    33 2572 53442  0 29  0 71

 I attached a small kernel patch, you can measure readahead thrashing ratio 
with this (see tail of /proc/vmstat). I think it's a handy tool to find the 
optimal RA-size. And if you're interested in the bug what I'm talking about, 
set max_sectors_kb to 64Kb.


--
 d


[-- Attachment #2: 01_2618+_rathr-d1.diff --]
[-- Type: text/x-diff, Size: 868 bytes --]

--- linux-2.6.18.2/include/linux/vmstat.h.orig	2006-09-20 05:42:06.000000000 +0200
+++ linux-2.6.18.2/include/linux/vmstat.h	2006-11-06 02:09:25.000000000 +0100
@@ -30,6 +30,7 @@
 		FOR_ALL_ZONES(PGSCAN_DIRECT),
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		RATHRASHED,
 		NR_VM_EVENT_ITEMS
 };
 
--- linux-2.6.18.2/mm/vmstat.c.orig	2006-11-06 01:55:58.000000000 +0100
+++ linux-2.6.18.2/mm/vmstat.c	2006-11-06 02:05:14.000000000 +0100
@@ -502,6 +502,8 @@
 	"allocstall",
 
 	"pgrotated",
+
+	"rathrashed",
 #endif
 };
 
--- linux-2.6.18.2/mm/readahead.c.orig	2006-09-20 05:42:06.000000000 +0200
+++ linux-2.6.18.2/mm/readahead.c	2006-11-06 02:13:12.000000000 +0100
@@ -568,6 +568,7 @@
 	ra->flags |= RA_FLAG_MISS;
 	ra->flags &= ~RA_FLAG_INCACHE;
 	ra->cache_hit = 0;
+	count_vm_event(RATHRASHED);
 }
 
 /*

next prev parent reply	other threads:[~2007-04-22 11:38 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-20 21:06 major performance drop on raid5 due to context switches caused by small max_hw_sectors Pallai Roland
     [not found] ` <5d96567b0704202247s60e4f2f1x19511f790f597ea0@mail.gmail.com>
2007-04-21 19:32   ` major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] Pallai Roland
2007-04-22  0:18     ` Justin Piszcz
2007-04-22  0:42       ` Pallai Roland
2007-04-22  8:47         ` Justin Piszcz
2007-04-22  9:52           ` Pallai Roland
2007-04-22 10:23             ` Justin Piszcz
2007-04-22 11:38               ` Pallai Roland [this message]
2007-04-22 11:42                 ` Justin Piszcz
2007-04-22 14:38                   ` Pallai Roland
2007-04-22 14:48                     ` Justin Piszcz
2007-04-22 15:09                       ` Pallai Roland
2007-04-22 15:53                         ` Justin Piszcz
2007-04-22 19:01                           ` Mr. James W. Laferriere
2007-04-22 20:35                             ` Justin Piszcz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200704221338.45759.dap@mail.index.hu \
    --to=dap@mail.index.hu \
    --cc=jpiszcz@lucidpixels.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).