From: Jens Axboe <jens.axboe@oracle.com>
To: Shan Wei <shanwei@cn.fujitsu.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: CFQ is worse than other IO schedulers in some cases
Date: Wed, 18 Feb 2009 12:37:04 +0100 [thread overview]
Message-ID: <20090218113704.GW30821@kernel.dk> (raw)
In-Reply-To: <499BA413.2010705@cn.fujitsu.com>
On Wed, Feb 18 2009, Shan Wei wrote:
> I found that CFQ's performance is worse than other IO scheduer in some cases
> I confirmed its phenomenon when I executed dump command and sysbench on 2.6.28.
>
>
> In dump(version:dump-0.4b41-2.fc6), I confirmed
> the speed under CFQ is slower than other IO schedulers.
>
>
> The Test Result(dump):
> UNIT:Mb/sec
> _______________________
> | IO | |
> | scheduler | Speed |
> +------------|--------|
> |cfq | 24.310 |
> |noop | 36.885 |
> |anticipatory| 34.956 |
> |deadline | 36.758 |
> +----------------------
>
>
> Steps to reproduce(dump):
> #dump -0uf /dev/null /dev/sda6
The dump issue is a known one, it has to do with how dump uses seperate
processes to interleave IO to the 'same' location. Jeff Moyer posted a
fix for that some time ago, you can also find references to the
discussion and progress right here on lkml. For reference, patch is
included.
> In sysbench(version:sysbench-0.4.10), I confirmed followings.
> - CFQ's performance is worse than other IO schedulers when only multiple
> threads test.
> (There is no difference under single thread test.)
> - It is worse than other IO scheduler when
> I used read mode. (No regression in write mode).
> - There is no difference among other IO schedulers. (e.g noop deadline)
>
>
> The Test Result(sysbench):
> UNIT:Mb/sec
> __________________________________________________
> | IO | thread number |
> | scheduler |-----------------------------------|
> | | 1 | 3 | 5 | 7 | 9 |
> +------------|------|-------|------|------|------|
> |cfq | 77.8 | 32.4 | 43.3 | 55.8 | 58.5 |
> |noop | 78.2 | 79.0 | 78.2 | 77.2 | 77.0 |
> |anticipatory| 78.2 | 78.6 | 78.4 | 77.8 | 78.1 |
> |deadline | 76.9 | 78.4 | 77.0 | 78.4 | 77.9 |
> +------------------------------------------------+
What kind of storage hardware did you use?
------
Hi,
dump performs poorly when run under the CFQ I/O scheduler. The reason
for this is that the dump command interleaves I/O between two (or
three?) cooperating processes. This is about the worst case scenario
you can get for CFQ, as the I/O access pattern within each process is
sequential. Thus, CFQ will idle for a number of milliseconds waiting
for the current process to issue more I/O before switching to the next.
Now, this behaviour can be changed with tuning. However, if the dump
command simply shared I/O contexts between cooperating processes, CFQ
could make more intelligent decisions about I/O scheduling.
So, here are the numbers, running under 2.6.28-rc3.
deadline 82241 kB/s
cfq 34143 kB/s
cfq-shared 82241 kB/s
cfq-shared denotes that the dump utility was patched with the attached
patch to share I/O contexts. As you can see, with a very little bit of
code change, we can drastically increase the performance of dump under
CFQ (which is the default I/O scheduler used in a number of
distributions).
For more information on the underlying problems, you can refer to the
following kernel discussion:
http://lkml.org/lkml/2008/11/9/133
Comments are appreciated.
Cheers,
Jeff
diff -up ./dump/tape.c.orig ./dump/tape.c
--- ./dump/tape.c.orig 2005-08-20 17:00:48.000000000 -0400
+++ ./dump/tape.c 2008-11-17 16:40:42.575792509 -0500
@@ -187,6 +187,40 @@ static sigjmp_buf jmpbuf; /* where to ju
static int gtperr = 0;
#endif
+/*
+ * Determine if we can use Linux' clone system call. If so, call it
+ * with the CLONE_IO flag so that all processes will share the same I/O
+ * context, allowing the I/O schedulers to make better scheduling decisions.
+ */
+#ifdef __linux__
+#include <syscall.h>
+
+#ifndef SYS_clone
+#define fork_clone_io fork
+#else /* SYS_clone */
+#include <linux/version.h>
+
+/*
+ * Kernel 2.5.49 introduced two extra parameters to the clone system call.
+ * Neither is useful in our case, so this is easy to handle.
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,49)
+/* clone_flags, child_stack, parent_tidptr, child_tidptr */
+#define CLONE_ARGS SIGCHLD|CLONE_IO, 0, NULL, NULL
+#else
+#define CLONE_ARGS SIGCHLD|CLONE_IO, 0
+#endif /* LINUX_VERSION_CODE */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <unistd.h>
+#undef _GNU_SOURCE
+pid_t fork_clone_io(void);
+#endif /* SYS_clone */
+#else /* __linux__ not defined */
+#define fork_clone_io fork
+#endif /* __linux__ */
+
int
alloctape(void)
{
@@ -755,6 +789,16 @@ rollforward(void)
#endif
}
+#ifdef __linux__
+#ifdef SYS_clone
+pid_t
+fork_clone_io(void)
+{
+ return syscall(SYS_clone, CLONE_ARGS);
+}
+#endif
+#endif
+
/*
* We implement taking and restoring checkpoints on the tape level.
* When each tape is opened, a new process is created by forking; this
@@ -801,7 +845,7 @@ restore_check_point:
/*
* All signals are inherited...
*/
- childpid = fork();
+ childpid = fork_clone_io();
if (childpid < 0) {
msg("Context save fork fails in parent %d\n", parentpid);
Exit(X_ABORT);
@@ -1017,7 +1061,7 @@ enslave(void)
}
if (socketpair(AF_UNIX, SOCK_STREAM, 0, cmd) < 0 ||
- (slaves[i].pid = fork()) < 0)
+ (slaves[i].pid = fork_clone_io()) < 0)
quit("too many slaves, %d (recompile smaller): %s\n",
i, strerror(errno));
--
Jens Axboe
next prev parent reply other threads:[~2009-02-18 11:39 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-18 6:00 CFQ is worse than other IO schedulers in some cases Shan Wei
2009-02-18 8:05 ` Mike Galbraith
2009-02-18 10:15 ` Shan Wei
2009-02-18 11:35 ` Mike Galbraith
2009-03-09 5:24 ` Shan Wei
2009-03-09 7:43 ` Jens Axboe
2009-03-09 12:02 ` Shan Wei
2009-03-09 12:14 ` Jens Axboe
2009-03-09 12:31 ` Shan Wei
2009-02-18 11:37 ` Jens Axboe [this message]
2009-02-19 9:28 ` Shan Wei
2009-02-19 15:26 ` Jeff Moyer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090218113704.GW30821@kernel.dk \
--to=jens.axboe@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=shanwei@cn.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.