* [BLOCK] 0/4 explicit io plugging
@ 2007-01-03 7:48 Jens Axboe
2007-01-03 7:48 ` [PATCH] 1/4 qrcu: "quick" srcu implementation Jens Axboe
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Jens Axboe @ 2007-01-03 7:48 UTC (permalink / raw)
To: linux-kernel; +Cc: Nick Piggin, akpm
This series of 4 patches switch the block layer to use explicit
plugging instead of the implicit plugging that takes place now when io
is queued against an empty queue.
The first three patches update RCU to include a QRCU method similar to
SRCU. QRCU is a bit heavier on the reader side, but a _lot_ cheaper for
the synchronization part. The new plugging scheme needs to synchronize
queue plugs for barriers and queue quiescing, so it needs to be cheap.
The fourth patch is the actual meat of the series. It also has a longer
explanation of the benefits of the explicit plugging.
I'm sending this out to get some review of the code, and to ask people
to do some testing. I'm looking for both the "hey it works for me" as
well as benchmark runs. In the performance category, I'm interested in
both high end (lots of CPUs) testing to see whether this actually does
reduce lock contention and block layer cpu utilization as well as more
simplistic io performance results on "normal" boxes to make sure we are
not regressing anywhere.
This code is also available in the 'plug' branch of the block layer git
repo:
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git/
Documentation/RCU/checklist.txt | 13 +
Documentation/RCU/rcu.txt | 6
Documentation/RCU/torture.txt | 15 -
Documentation/RCU/whatisRCU.txt | 3
Documentation/block/biodoc.txt | 5
block/as-iosched.c | 15 -
block/cfq-iosched.c | 8
block/deadline-iosched.c | 9
block/elevator.c | 44 ---
block/ll_rw_blk.c | 483 ++++++++++++++++++++--------------------
block/noop-iosched.c | 8
drivers/block/cciss.c | 6
drivers/block/cpqarray.c | 3
drivers/block/floppy.c | 1
drivers/block/loop.c | 12
drivers/block/pktcdvd.c | 5
drivers/block/rd.c | 2
drivers/block/umem.c | 16 -
drivers/ide/ide-cd.c | 9
drivers/ide/ide-io.c | 25 --
drivers/md/bitmap.c | 1
drivers/md/dm-emc.c | 2
drivers/md/dm-table.c | 14 -
drivers/md/dm.c | 18 -
drivers/md/dm.h | 1
drivers/md/linear.c | 14 -
drivers/md/md.c | 3
drivers/md/multipath.c | 32 --
drivers/md/raid0.c | 17 -
drivers/md/raid1.c | 70 -----
drivers/md/raid10.c | 73 ------
drivers/md/raid5.c | 60 ----
drivers/message/i2o/i2o_block.c | 6
drivers/mmc/mmc_queue.c | 3
drivers/s390/block/dasd.c | 3
drivers/s390/char/tape_block.c | 1
drivers/scsi/ide-scsi.c | 2
drivers/scsi/scsi_lib.c | 47 +--
fs/adfs/inode.c | 1
fs/affs/file.c | 2
fs/befs/linuxvfs.c | 1
fs/bfs/file.c | 1
fs/block_dev.c | 2
fs/buffer.c | 25 --
fs/cifs/file.c | 2
fs/direct-io.c | 7
fs/ecryptfs/mmap.c | 23 -
fs/efs/inode.c | 1
fs/ext2/inode.c | 2
fs/ext3/inode.c | 3
fs/ext4/inode.c | 3
fs/fat/inode.c | 1
fs/freevxfs/vxfs_subr.c | 1
fs/fuse/inode.c | 1
fs/gfs2/ops_address.c | 1
fs/hfs/inode.c | 2
fs/hfsplus/inode.c | 2
fs/hpfs/file.c | 1
fs/isofs/inode.c | 1
fs/jfs/inode.c | 1
fs/jfs/jfs_metapage.c | 1
fs/minix/inode.c | 1
fs/ntfs/aops.c | 4
fs/ntfs/compress.c | 2
fs/ocfs2/aops.c | 1
fs/ocfs2/cluster/heartbeat.c | 4
fs/qnx4/inode.c | 1
fs/reiserfs/inode.c | 1
fs/sysv/itree.c | 1
fs/udf/file.c | 1
fs/udf/inode.c | 1
fs/ufs/inode.c | 1
fs/ufs/truncate.c | 2
fs/xfs/linux-2.6/xfs_aops.c | 1
fs/xfs/linux-2.6/xfs_buf.c | 15 -
include/linux/backing-dev.h | 3
include/linux/blkdev.h | 75 +++---
include/linux/buffer_head.h | 1
include/linux/elevator.h | 8
include/linux/fs.h | 1
include/linux/pagemap.h | 12
include/linux/raid/md.h | 1
include/linux/sched.h | 1
include/linux/srcu.h | 30 ++
include/linux/swap.h | 2
kernel/rcutorture.c | 71 +++++
kernel/sched.c | 1
kernel/srcu.c | 105 ++++++++
mm/filemap.c | 62 -----
mm/nommu.c | 4
mm/page-writeback.c | 8
mm/readahead.c | 11
mm/shmem.c | 1
mm/swap_state.c | 5
mm/swapfile.c | 37 ---
mm/vmscan.c | 6
96 files changed, 632 insertions(+), 989 deletions(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] 1/4 qrcu: "quick" srcu implementation
2007-01-03 7:48 [BLOCK] 0/4 explicit io plugging Jens Axboe
@ 2007-01-03 7:48 ` Jens Axboe
2007-01-03 7:48 ` [PATCH] 2/4 qrcu: add rcutorture test Jens Axboe
` (3 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2007-01-03 7:48 UTC (permalink / raw)
To: linux-kernel; +Cc: Nick Piggin, akpm, Oleg Nesterov
From: Oleg Nesterov <oleg@tv-sign.ru>
Very much based on ideas, corrections, and patient explanations from
Alan and Paul.
The current srcu implementation is very good for readers, lock/unlock
are extremely cheap. But for that reason it is not possible to avoid
synchronize_sched() and polling in synchronize_srcu().
Jens Axboe wrote:
>
> It works for me, but the overhead is still large. Before it would take
> 8-12 jiffies for a synchronize_srcu() to complete without there actually
> being any reader locks active, now it takes 2-3 jiffies. So it's
> definitely faster, and as suspected the loss of two of three
> synchronize_sched() cut down the overhead to a third.
'qrcu' behaves the same as srcu but optimized for writers. The fast path
for synchronize_qrcu() is mutex_lock() + atomic_read() + mutex_unlock().
The slow path is __wait_event(), no polling. However, the reader does
atomic inc/dec on lock/unlock, and the counters are not per-cpu.
Also, unlike srcu, qrcu read lock/unlock can be used in interrupt context,
and 'qrcu_struct' can be compile-time initialized.
See also (a long) discussion:
http://marc.theaimsgroup.com/?t=116370857600003
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
---
include/linux/srcu.h | 30 ++++++++++++++
kernel/srcu.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 135 insertions(+), 0 deletions(-)
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index aca0eee..fcdb749 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -27,6 +27,8 @@
#ifndef _LINUX_SRCU_H
#define _LINUX_SRCU_H
+#include <linux/wait.h>
+
struct srcu_struct_array {
int c[2];
};
@@ -50,4 +52,32 @@ void srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
void synchronize_srcu(struct srcu_struct *sp);
long srcu_batches_completed(struct srcu_struct *sp);
+/*
+ * fully compatible with srcu, but optimized for writers.
+ */
+
+struct qrcu_struct {
+ int completed;
+ atomic_t ctr[2];
+ wait_queue_head_t wq;
+ struct mutex mutex;
+};
+
+int init_qrcu_struct(struct qrcu_struct *qp);
+int qrcu_read_lock(struct qrcu_struct *qp);
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx);
+void synchronize_qrcu(struct qrcu_struct *qp);
+
+/**
+ * cleanup_qrcu_struct - deconstruct a quick-RCU structure
+ * @qp: structure to clean up.
+ *
+ * Must invoke this after you are finished using a given qrcu_struct that
+ * was initialized via init_qrcu_struct(). We reserve the right to
+ * leak memory should you fail to do this!
+ */
+static inline void cleanup_qrcu_struct(struct qrcu_struct *qp)
+{
+}
+
#endif
diff --git a/kernel/srcu.c b/kernel/srcu.c
index 3507cab..53c6989 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -256,3 +256,108 @@ EXPORT_SYMBOL_GPL(srcu_read_unlock);
EXPORT_SYMBOL_GPL(synchronize_srcu);
EXPORT_SYMBOL_GPL(srcu_batches_completed);
EXPORT_SYMBOL_GPL(srcu_readers_active);
+
+/**
+ * init_qrcu_struct - initialize a quick-RCU structure.
+ * @qp: structure to initialize.
+ *
+ * Must invoke this on a given qrcu_struct before passing that qrcu_struct
+ * to any other function. Each qrcu_struct represents a separate domain
+ * of QRCU protection.
+ */
+int init_qrcu_struct(struct qrcu_struct *qp)
+{
+ qp->completed = 0;
+ atomic_set(qp->ctr + 0, 1);
+ atomic_set(qp->ctr + 1, 0);
+ init_waitqueue_head(&qp->wq);
+ mutex_init(&qp->mutex);
+
+ return 0;
+}
+
+/**
+ * qrcu_read_lock - register a new reader for an QRCU-protected structure.
+ * @qp: qrcu_struct in which to register the new reader.
+ *
+ * Counts the new reader in the appropriate element of the qrcu_struct.
+ * Returns an index that must be passed to the matching qrcu_read_unlock().
+ */
+int qrcu_read_lock(struct qrcu_struct *qp)
+{
+ for (;;) {
+ int idx = qp->completed & 0x1;
+ if (likely(atomic_inc_not_zero(qp->ctr + idx)))
+ return idx;
+ }
+}
+
+/**
+ * qrcu_read_unlock - unregister a old reader from an QRCU-protected structure.
+ * @qp: qrcu_struct in which to unregister the old reader.
+ * @idx: return value from corresponding qrcu_read_lock().
+ *
+ * Removes the count for the old reader from the appropriate element of
+ * the qrcu_struct.
+ */
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx)
+{
+ if (atomic_dec_and_test(qp->ctr + idx))
+ wake_up(&qp->wq);
+}
+
+/**
+ * synchronize_qrcu - wait for prior QRCU read-side critical-section completion
+ * @qp: qrcu_struct with which to synchronize.
+ *
+ * Flip the completed counter, and wait for the old count to drain to zero.
+ * As with classic RCU, the updater must use some separate means of
+ * synchronizing concurrent updates. Can block; must be called from
+ * process context.
+ *
+ * Note that it is illegal to call synchronize_qrcu() from the corresponding
+ * QRCU read-side critical section; doing so will result in deadlock.
+ * However, it is perfectly legal to call synchronize_qrcu() on one
+ * qrcu_struct from some other qrcu_struct's read-side critical section.
+ */
+void synchronize_qrcu(struct qrcu_struct *qp)
+{
+ int idx;
+
+ /*
+ * The following memory barrier is needed to ensure that
+ * any prior data-structure manipulation is seen by other
+ * CPUs to happen before picking up the value of
+ * qp->completed.
+ */
+ smp_mb();
+ mutex_lock(&qp->mutex);
+
+ idx = qp->completed & 0x1;
+ if (atomic_read(qp->ctr + idx) == 1)
+ goto out;
+
+ atomic_inc(qp->ctr + (idx ^ 0x1));
+ /* Reduce the likelihood that qrcu_read_lock() will loop */
+ smp_mb__after_atomic_inc();
+ qp->completed++;
+
+ atomic_dec(qp->ctr + idx);
+ __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
+out:
+ mutex_unlock(&qp->mutex);
+ smp_mb();
+ /*
+ * The above smp_mb() is needed in the case that we
+ * see the counter reaching zero, so that we do not
+ * need to block. In this case, we need to make
+ * sure that the CPU does not re-order any subsequent
+ * changes made by the caller to occur prior to the
+ * test, as seen by other CPUs.
+ */
+}
+
+EXPORT_SYMBOL_GPL(init_qrcu_struct);
+EXPORT_SYMBOL_GPL(qrcu_read_lock);
+EXPORT_SYMBOL_GPL(qrcu_read_unlock);
+EXPORT_SYMBOL_GPL(synchronize_qrcu);
--
1.4.4.2.g02c9
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH] 2/4 qrcu: add rcutorture test
2007-01-03 7:48 [BLOCK] 0/4 explicit io plugging Jens Axboe
2007-01-03 7:48 ` [PATCH] 1/4 qrcu: "quick" srcu implementation Jens Axboe
@ 2007-01-03 7:48 ` Jens Axboe
2007-01-03 8:31 ` [PATCH] 3/4 qrcu: add documentation Jens Axboe
` (2 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2007-01-03 7:48 UTC (permalink / raw)
To: linux-kernel; +Cc: Nick Piggin, akpm, Oleg Nesterov, Josh Triplett
From: Oleg Nesterov <oleg@tv-sign.ru>
Add rcutorture test for qrcu.
Works for me!
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
---
include/linux/srcu.h | 4 +-
kernel/rcutorture.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 71 insertions(+), 4 deletions(-)
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index fcdb749..03a9010 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -64,8 +64,8 @@ struct qrcu_struct {
};
int init_qrcu_struct(struct qrcu_struct *qp);
-int qrcu_read_lock(struct qrcu_struct *qp);
-void qrcu_read_unlock(struct qrcu_struct *qp, int idx);
+int qrcu_read_lock(struct qrcu_struct *qp) __acquires(qp);
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx) __releases(qp);
void synchronize_qrcu(struct qrcu_struct *qp);
/**
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 482b11f..bd7fd49 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -465,6 +465,73 @@ static struct rcu_torture_ops srcu_ops = {
};
/*
+ * Definitions for qrcu torture testing.
+ */
+
+static struct qrcu_struct qrcu_ctl;
+
+static void qrcu_torture_init(void)
+{
+ init_qrcu_struct(&qrcu_ctl);
+ rcu_sync_torture_init();
+}
+
+static void qrcu_torture_cleanup(void)
+{
+ synchronize_qrcu(&qrcu_ctl);
+ cleanup_qrcu_struct(&qrcu_ctl);
+}
+
+static int qrcu_torture_read_lock(void) __acquires(&qrcu_ctl)
+{
+ return qrcu_read_lock(&qrcu_ctl);
+}
+
+static void qrcu_torture_read_unlock(int idx) __releases(&qrcu_ctl)
+{
+ qrcu_read_unlock(&qrcu_ctl, idx);
+}
+
+static int qrcu_torture_completed(void)
+{
+ return qrcu_ctl.completed;
+}
+
+static void qrcu_torture_synchronize(void)
+{
+ synchronize_qrcu(&qrcu_ctl);
+}
+
+static int qrcu_torture_stats(char *page)
+{
+ int cnt = 0;
+ int idx = qrcu_ctl.completed & 0x1;
+
+ cnt += sprintf(&page[cnt], "%s%s per-CPU(idx=%d):",
+ torture_type, TORTURE_FLAG, idx);
+
+ cnt += sprintf(&page[cnt], " (%d,%d)",
+ atomic_read(qrcu_ctl.ctr + 0),
+ atomic_read(qrcu_ctl.ctr + 1));
+
+ cnt += sprintf(&page[cnt], "\n");
+ return cnt;
+}
+
+static struct rcu_torture_ops qrcu_ops = {
+ .init = qrcu_torture_init,
+ .cleanup = qrcu_torture_cleanup,
+ .readlock = qrcu_torture_read_lock,
+ .readdelay = srcu_read_delay,
+ .readunlock = qrcu_torture_read_unlock,
+ .completed = qrcu_torture_completed,
+ .deferredfree = rcu_sync_torture_deferred_free,
+ .sync = qrcu_torture_synchronize,
+ .stats = qrcu_torture_stats,
+ .name = "qrcu"
+};
+
+/*
* Definitions for sched torture testing.
*/
@@ -503,8 +570,8 @@ static struct rcu_torture_ops sched_ops = {
};
static struct rcu_torture_ops *torture_ops[] =
- { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops, &srcu_ops,
- &sched_ops, NULL };
+ { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops,
+ &srcu_ops, &qrcu_ops, &sched_ops, NULL };
/*
* RCU torture writer kthread. Repeatedly substitutes a new structure
--
1.4.4.2.g02c9
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] 4/4 block: explicit plugging
[not found] ` <1167810508576-git-send-email-jens.axboe@oracle.com>
@ 2007-01-03 8:09 ` Andrew Morton
2007-01-03 8:22 ` Jens Axboe
2007-01-04 4:35 ` Nick Piggin
1 sibling, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2007-01-03 8:09 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-kernel, Nick Piggin, Nick Piggin
On Wed, 3 Jan 2007 08:48:28 +0100
Jens Axboe <jens.axboe@oracle.com> wrote:
> This is a patch to perform block device plugging explicitly in the submitting
> process context rather than implicitly by the block device.
I don't think anyone will regret the passing of address_space_operations.sync_page().
Do you have any benchmarks which got faster with these changes?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] 4/4 block: explicit plugging
2007-01-03 8:09 ` Andrew Morton
@ 2007-01-03 8:22 ` Jens Axboe
2007-01-03 21:50 ` Chen, Kenneth W
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2007-01-03 8:22 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Nick Piggin, Nick Piggin
On Wed, Jan 03 2007, Andrew Morton wrote:
> On Wed, 3 Jan 2007 08:48:28 +0100
> Jens Axboe <jens.axboe@oracle.com> wrote:
>
> > This is a patch to perform block device plugging explicitly in the submitting
> > process context rather than implicitly by the block device.
>
> I don't think anyone will regret the passing of
> address_space_operations.sync_page().
Hardly :-)
> Do you have any benchmarks which got faster with these changes?
On the hardware I have immediately available, I see no regressions wrt
performance. With instrumentation it's simple to demonstrate that most
of the queueing activity of an io heavy benchmark spends less time in
the kernel (most merging activity takes place outside of the queue lock,
hence queueing is lock free).
I've asked Ken to run this series on some of his big iron, I hope he'll
have some results for us soonish. I can run some pseudo benchmarks on a
4-way here with some simulated storage to demonstrate the locking
improvements.
I don't see 3/4 and 4/4 on lkml yet, I wonder if they got lost.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] 3/4 qrcu: add documentation
2007-01-03 7:48 [BLOCK] 0/4 explicit io plugging Jens Axboe
2007-01-03 7:48 ` [PATCH] 1/4 qrcu: "quick" srcu implementation Jens Axboe
2007-01-03 7:48 ` [PATCH] 2/4 qrcu: add rcutorture test Jens Axboe
@ 2007-01-03 8:31 ` Jens Axboe
2007-01-03 9:29 ` Tomas Carnecky
2007-01-03 9:41 ` [PATCH] 4/4 block: explicit plugging Jens Axboe
[not found] ` <1167810508576-git-send-email-jens.axboe@oracle.com>
4 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2007-01-03 8:31 UTC (permalink / raw)
To: linux-kernel; +Cc: Nick Piggin, akpm
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
---
Documentation/RCU/checklist.txt | 13 +++++++++++++
Documentation/RCU/rcu.txt | 6 ++++--
Documentation/RCU/torture.txt | 15 +++++++++------
Documentation/RCU/whatisRCU.txt | 3 +++
4 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index f4dffad..36d6185 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -259,3 +259,16 @@ over a rather long period of time, but improvements are always welcome!
Note that, rcu_assign_pointer() and rcu_dereference() relate to
SRCU just as they do to other forms of RCU.
+
+14. QRCU is very similar to SRCU, but features very fast grace-period
+ processing at the expense of heavier-weight read-side operations.
+ The correspondance between QRCU and SRCU is as follows:
+
+ QRCU SRCU
+
+ struct qrcu_struct struct srcu_struct
+ init_qrcu_struct() init_srcu_struct()
+ cleanup_qrcu_struct() cleanup_srcu_struct()
+ qrcu_read_lock() srcu_read_lock()
+ qrcu_read-unlock() srcu_read_unlock()
+ synchronize_qrcu() synchronize_srcu()
diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt
index f84407c..ae1e54e 100644
--- a/Documentation/RCU/rcu.txt
+++ b/Documentation/RCU/rcu.txt
@@ -45,8 +45,10 @@ o How can I see where RCU is currently used in the Linux kernel?
Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
"rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh",
- "srcu_read_lock", "srcu_read_unlock", "synchronize_rcu",
- "synchronize_net", and "synchronize_srcu".
+ "qrcu_read_lock", qrcu_read_unlock", "srcu_read_lock",
+ "srcu_read_unlock", "synchronize_rcu", "synchronize_qrcu",
+ "synchronize_net", "synchronize_srcu", rcu_assign_pointer(),
+ and rcu_dereference().
o What guidelines should I follow when writing code that uses RCU?
diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt
index 25a3c3f..2cb0a3b 100644
--- a/Documentation/RCU/torture.txt
+++ b/Documentation/RCU/torture.txt
@@ -35,7 +35,8 @@ nfakewriters This is the number of RCU fake writer threads to run. Fake
different numbers of writers running in parallel.
nfakewriters defaults to 4, which provides enough parallelism
to trigger special cases caused by multiple writers, such as
- the synchronize_srcu() early return optimization.
+ the synchronize_srcu() and synchronize_qrcu() early return
+ optimizations.
stat_interval The number of seconds between output of torture
statistics (via printk()). Regardless of the interval,
@@ -54,11 +55,13 @@ test_no_idle_hz Whether or not to test the ability of RCU to operate in
idle CPUs. Boolean parameter, "1" to test, "0" otherwise.
torture_type The type of RCU to test: "rcu" for the rcu_read_lock() API,
- "rcu_sync" for rcu_read_lock() with synchronous reclamation,
- "rcu_bh" for the rcu_read_lock_bh() API, "rcu_bh_sync" for
- rcu_read_lock_bh() with synchronous reclamation, "srcu" for
- the "srcu_read_lock()" API, and "sched" for the use of
- preempt_disable() together with synchronize_sched().
+ "rcu_sync" for rcu_read_lock() with synchronous
+ reclamation, "rcu_bh" for the rcu_read_lock_bh() API,
+ "rcu_bh_sync" for rcu_read_lock_bh() with synchronous
+ reclamation, "srcu" for the "srcu_read_lock()" API,
+ "qrcu" for the "qrcu_read_lock()" "quick grace period"
+ form of SRCU, and "sched" for the use of preempt_disable()
+ together with synchronize_sched().
verbose Enable debug printk()s. Default is disabled.
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index e0d6d99..e91650b 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -780,6 +780,8 @@ Markers for RCU read-side critical sections:
rcu_read_unlock_bh
srcu_read_lock
srcu_read_unlock
+ qrcu_read_lock
+ qrcu_read_unlock
RCU pointer/list traversal:
@@ -807,6 +809,7 @@ RCU grace period:
synchronize_sched
synchronize_rcu
synchronize_srcu
+ synchronize_qrcu
call_rcu
call_rcu_bh
--
1.4.4.2.g02c9
--
Jens Axboe
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] 3/4 qrcu: add documentation
2007-01-03 8:31 ` [PATCH] 3/4 qrcu: add documentation Jens Axboe
@ 2007-01-03 9:29 ` Tomas Carnecky
2007-01-03 9:39 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Tomas Carnecky @ 2007-01-03 9:29 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-kernel, Nick Piggin, akpm
Jens Axboe wrote:
> diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
> index f4dffad..36d6185 100644
> --- a/Documentation/RCU/checklist.txt
> +++ b/Documentation/RCU/checklist.txt
> @@ -259,3 +259,16 @@ over a rather long period of time, but improvements are always welcome!
>
> Note that, rcu_assign_pointer() and rcu_dereference() relate to
> SRCU just as they do to other forms of RCU.
> +
> +14. QRCU is very similar to SRCU, but features very fast grace-period
> + processing at the expense of heavier-weight read-side operations.
> + The correspondance between QRCU and SRCU is as follows:
> +
> + QRCU SRCU
> +
> + struct qrcu_struct struct srcu_struct
> + init_qrcu_struct() init_srcu_struct()
> + cleanup_qrcu_struct() cleanup_srcu_struct()
> + qrcu_read_lock() srcu_read_lock()
> + qrcu_read-unlock() srcu_read_unlock()
A small typo: qrcu_read_unlock()
tom
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] 3/4 qrcu: add documentation
2007-01-03 9:29 ` Tomas Carnecky
@ 2007-01-03 9:39 ` Jens Axboe
0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2007-01-03 9:39 UTC (permalink / raw)
To: Tomas Carnecky; +Cc: linux-kernel, Nick Piggin, akpm
On Wed, Jan 03 2007, Tomas Carnecky wrote:
> Jens Axboe wrote:
> > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
> > index f4dffad..36d6185 100644
> > --- a/Documentation/RCU/checklist.txt
> > +++ b/Documentation/RCU/checklist.txt
> > @@ -259,3 +259,16 @@ over a rather long period of time, but improvements are always welcome!
> >
> > Note that, rcu_assign_pointer() and rcu_dereference() relate to
> > SRCU just as they do to other forms of RCU.
> > +
> > +14. QRCU is very similar to SRCU, but features very fast grace-period
> > + processing at the expense of heavier-weight read-side operations.
> > + The correspondance between QRCU and SRCU is as follows:
> > +
> > + QRCU SRCU
> > +
> > + struct qrcu_struct struct srcu_struct
> > + init_qrcu_struct() init_srcu_struct()
> > + cleanup_qrcu_struct() cleanup_srcu_struct()
> > + qrcu_read_lock() srcu_read_lock()
> > + qrcu_read-unlock() srcu_read_unlock()
>
> A small typo: qrcu_read_unlock()
Indeed, thanks, I'll update the repo.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] 4/4 block: explicit plugging
2007-01-03 7:48 [BLOCK] 0/4 explicit io plugging Jens Axboe
` (2 preceding siblings ...)
2007-01-03 8:31 ` [PATCH] 3/4 qrcu: add documentation Jens Axboe
@ 2007-01-03 9:41 ` Jens Axboe
[not found] ` <1167810508576-git-send-email-jens.axboe@oracle.com>
4 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2007-01-03 9:41 UTC (permalink / raw)
To: linux-kernel; +Cc: Nick Piggin, akpm
[-- Attachment #1: Type: text/plain, Size: 2653 bytes --]
Not much luck with the 4th patch, I guess it's too big. I've gzip
attached it now, with the description inlined.
---
Nick writes:
This is a patch to perform block device plugging explicitly in the submitting
process context rather than implicitly by the block device.
There are several advantages to plugging in process context over plugging
by the block device:
- Implicit plugging is only active when the queue empties, so any
advantages are lost if there is parallel IO occuring. Not so with
explicit plugging.
- Implicit plugging relies on a timer and watermarks and a kind-of-explicit
directive in lock_page which directs plugging. These are heuristics and
can cost performance due to holding a block device idle longer than it
should be. Explicit plugging avoids most of these issues by only holding
the device idle when it is known more requests will be submitted.
- This lock_page directive uses a roundabout way to attempt to minimise
intrusiveness of plugging on the VM. In doing so, it gets needlessly
complex: the VM really is in a good position to direct the block layer
as to the nature of its requests, so there is no need to try to hide
the fact.
- Explicit plugging keeps a process-private queue of requests being held.
This offers some advantages over immediately sending requests to the
block device: firstly, merging can be attempted on requests in this list
(currently only attempted on the head of the list) without taking any
locks; secondly, when unplugging occurs, the requests can be delivered
to the block device queue in a batch, thus the lock aquisitions can be
batched up.
On a parallel tiobench benchmark, of the 800 000 calls to __make_request
performed, this patch avoids 490 000 (62%) of queue_lock aquisitions by
early merging on the private plugged list.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Changes so far by me:
- Don't invoke ->request_fn() in blk_queue_invalidate_tags
- Fixup all filesystems for block_sync_page()
- Add blk_delay_queue() to handle the old plugging-on-shortage usage.
- Unconditionally run replug_current_nested() in ioschedule()
- Fixup queue start/stop
- Fixup all the remaining drivers
- Change the namespace (prefix the plug functions with blk_)
- Fixup ext4
- Dead code removal
- Fixup blktrace plug/unplug notifications
- __make_request() cleanups
- bio_sync() fixups
- Kill queue empty checking
- Make barriers work again, using QRCU
- Make blk_sync_queue() work again, reuse barrier SRCU handling
This patch needs more work and some dedicated testing.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
--
Jens Axboe
[-- Attachment #2: 0004-block-explicit-plugging.txt.gz --]
[-- Type: application/x-gzip, Size: 30605 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH] 4/4 block: explicit plugging
2007-01-03 8:22 ` Jens Axboe
@ 2007-01-03 21:50 ` Chen, Kenneth W
2007-01-03 22:29 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Chen, Kenneth W @ 2007-01-03 21:50 UTC (permalink / raw)
To: Jens Axboe, Andrew Morton; +Cc: linux-kernel, Nick Piggin, Nick Piggin
Jens Axboe wrote on Wednesday, January 03, 2007 12:22 AM
> > Do you have any benchmarks which got faster with these changes?
>
> On the hardware I have immediately available, I see no regressions wrt
> performance. With instrumentation it's simple to demonstrate that most
> of the queueing activity of an io heavy benchmark spends less time in
> the kernel (most merging activity takes place outside of the queue
lock,
> hence queueing is lock free).
>
> I've asked Ken to run this series on some of his big iron, I hope
he'll
> have some results for us soonish.
We are having some trouble with the patch set that some of our fiber
channel
host controller doesn't initialize properly anymore and thus lost whole
bunch
of disks (somewhere around 200 disks out of 900) at boot time.
Presumably FC
loop initialization command are done through block layer etc. I haven't
looked into the problem closely.
Jens, I assume the spin lock bug in __blk_run_queue is fixed in this
patch
set?
- Ken
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] 4/4 block: explicit plugging
2007-01-03 21:50 ` Chen, Kenneth W
@ 2007-01-03 22:29 ` Jens Axboe
2007-01-03 22:34 ` Chen, Kenneth W
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2007-01-03 22:29 UTC (permalink / raw)
To: Chen, Kenneth W; +Cc: Andrew Morton, linux-kernel, Nick Piggin, Nick Piggin
On Wed, Jan 03 2007, Chen, Kenneth W wrote:
> Jens Axboe wrote on Wednesday, January 03, 2007 12:22 AM
> > > Do you have any benchmarks which got faster with these changes?
> >
> > On the hardware I have immediately available, I see no regressions wrt
> > performance. With instrumentation it's simple to demonstrate that most
> > of the queueing activity of an io heavy benchmark spends less time in
> > the kernel (most merging activity takes place outside of the queue
> lock,
> > hence queueing is lock free).
> >
> > I've asked Ken to run this series on some of his big iron, I hope
> he'll
> > have some results for us soonish.
>
> We are having some trouble with the patch set that some of our fiber
> channel
> host controller doesn't initialize properly anymore and thus lost whole
> bunch
> of disks (somewhere around 200 disks out of 900) at boot time.
> Presumably FC
> loop initialization command are done through block layer etc. I haven't
> looked into the problem closely.
>
> Jens, I assume the spin lock bug in __blk_run_queue is fixed in this
> patch
> set?
It is. Are you still seeing problems after the initial mail exchange we
had prior to christmas, or are you referencing that initial problem?
It's not likely to be a block layer issue, more likely the SCSI <->
block interactions. If you mail me a new dmesg (if your problem is with
the __blk_run_queue() fixups), I can take a look. Otherwise please do
test with the __blk_run_queue() fixup, just use the current patchset.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH] 4/4 block: explicit plugging
2007-01-03 22:29 ` Jens Axboe
@ 2007-01-03 22:34 ` Chen, Kenneth W
2007-01-04 14:39 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Chen, Kenneth W @ 2007-01-03 22:34 UTC (permalink / raw)
To: 'Jens Axboe'
Cc: Andrew Morton, linux-kernel, Nick Piggin, Nick Piggin
Jens Axboe wrote on Wednesday, January 03, 2007 2:30 PM
> > We are having some trouble with the patch set that some of our fiber channel
> > host controller doesn't initialize properly anymore and thus lost whole
> > bunch of disks (somewhere around 200 disks out of 900) at boot time.
> > Presumably FC loop initialization command are done through block layer etc.
> > I haven't looked into the problem closely.
> >
> > Jens, I assume the spin lock bug in __blk_run_queue is fixed in this patch
> > set?
>
> It is. Are you still seeing problems after the initial mail exchange we
> had prior to christmas,
Yes. Not the same kernel panic, but a problem with FC loop reset itself.
> or are you referencing that initial problem?
No. we got passed that point thanks for the bug fix patch you give me
prior to Christmas. That fixed a kernel panic on boot up.
> It's not likely to be a block layer issue, more likely the SCSI <->
> block interactions. If you mail me a new dmesg (if your problem is with
> the __blk_run_queue() fixups), I can take a look. Otherwise please do
> test with the __blk_run_queue() fixup, just use the current patchset.
I will just retake the tip of your plug tree and retest.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] 4/4 block: explicit plugging
[not found] ` <1167810508576-git-send-email-jens.axboe@oracle.com>
2007-01-03 8:09 ` Andrew Morton
@ 2007-01-04 4:35 ` Nick Piggin
2007-01-05 7:23 ` Jens Axboe
1 sibling, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2007-01-04 4:35 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-kernel, akpm, Nick Piggin, Trond Myklebust, Neil Brown,
Mark Fasheh, Chen, Kenneth W
Jens Axboe wrote:
> Nick writes:
>
> This is a patch to perform block device plugging explicitly in the submitting
> process context rather than implicitly by the block device.
Hi Jens,
Hey thanks for doing so much hard work with this, I couldn't have fixed
all the block layer stuff myself. QRCU looks like a good solution for the
barrier/sync operations (/me worried that one wouldn't exist), and a
novel use of RCU!
The only thing I had been thinking about before it is ready for primetime
-- as far as the VM side of things goes -- is whether we should change
the hard calls to address_space operations, such that they might be
avoided or customised when there is no backing block device?
I'm sure the answer to this is "yes", so I have an idea for a simple
implementation... but I'd like to hear thoughts from network fs / raid
people?
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] 4/4 block: explicit plugging
2007-01-03 22:34 ` Chen, Kenneth W
@ 2007-01-04 14:39 ` Jens Axboe
2007-01-05 22:04 ` Chen, Kenneth W
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2007-01-04 14:39 UTC (permalink / raw)
To: Chen, Kenneth W; +Cc: Andrew Morton, linux-kernel, Nick Piggin, Nick Piggin
On Wed, Jan 03 2007, Chen, Kenneth W wrote:
> Jens Axboe wrote on Wednesday, January 03, 2007 2:30 PM
> > > We are having some trouble with the patch set that some of our fiber channel
> > > host controller doesn't initialize properly anymore and thus lost whole
> > > bunch of disks (somewhere around 200 disks out of 900) at boot time.
> > > Presumably FC loop initialization command are done through block layer etc.
> > > I haven't looked into the problem closely.
> > >
> > > Jens, I assume the spin lock bug in __blk_run_queue is fixed in this patch
> > > set?
> >
> > It is. Are you still seeing problems after the initial mail exchange we
> > had prior to christmas,
>
> Yes. Not the same kernel panic, but a problem with FC loop reset itself.
>
>
> > or are you referencing that initial problem?
>
> No. we got passed that point thanks for the bug fix patch you give me
> prior to Christmas. That fixed a kernel panic on boot up.
>
>
> > It's not likely to be a block layer issue, more likely the SCSI <->
> > block interactions. If you mail me a new dmesg (if your problem is with
> > the __blk_run_queue() fixups), I can take a look. Otherwise please do
> > test with the __blk_run_queue() fixup, just use the current patchset.
>
> I will just retake the tip of your plug tree and retest.
That would be great! There's a busy race fixed in the current branch,
make sure that one is included as well.
>From 9174fea2184187209b1f851137bd1612728fae2c Mon Sep 17 00:00:00 2001
From: Jens Axboe <jens.axboe@oracle.com>
Date: Thu, 4 Jan 2007 10:42:33 +0100
Subject: [PATCH] [PATCH] scsi: race in checking sdev->device_busy
Save some code, create a new out label for the path that already checks
the busy count and delays the queue if necessary.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
drivers/scsi/scsi_lib.c | 11 ++++-------
1 files changed, 4 insertions(+), 7 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index fce5e2f..3ffa35d 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1509,12 +1509,9 @@ static void scsi_request_fn(struct request_queue *q)
* Dispatch the command to the low-level driver.
*/
rtn = scsi_dispatch_cmd(cmd);
- if (rtn) {
- if (sdev->device_busy == 0)
- blk_delay_queue(q, SCSI_QUEUE_DELAY);
- goto out_nolock;
- }
spin_lock_irq(q->queue_lock);
+ if (rtn)
+ goto out_delay;
}
goto out;
@@ -1533,13 +1530,13 @@ static void scsi_request_fn(struct request_queue *q)
spin_lock_irq(q->queue_lock);
blk_requeue_request(q, req);
sdev->device_busy--;
+out_delay:
if (sdev->device_busy == 0)
blk_delay_queue(q, SCSI_QUEUE_DELAY);
- out:
+out:
/* must be careful here...if we trigger the ->remove() function
* we cannot be holding the q lock */
spin_unlock_irq(q->queue_lock);
- out_nolock:
put_device(&sdev->sdev_gendev);
spin_lock_irq(q->queue_lock);
}
--
1.5.0.rc0.gd222
--
Jens Axboe
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] 4/4 block: explicit plugging
2007-01-04 4:35 ` Nick Piggin
@ 2007-01-05 7:23 ` Jens Axboe
0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2007-01-05 7:23 UTC (permalink / raw)
To: Nick Piggin
Cc: linux-kernel, akpm, Nick Piggin, Trond Myklebust, Neil Brown,
Mark Fasheh, Chen, Kenneth W
On Thu, Jan 04 2007, Nick Piggin wrote:
> Jens Axboe wrote:
> >Nick writes:
> >
> >This is a patch to perform block device plugging explicitly in the
> >submitting
> >process context rather than implicitly by the block device.
>
> Hi Jens,
>
> Hey thanks for doing so much hard work with this, I couldn't have fixed
> all the block layer stuff myself. QRCU looks like a good solution for the
> barrier/sync operations (/me worried that one wouldn't exist), and a
> novel use of RCU!
>
> The only thing I had been thinking about before it is ready for primetime
> -- as far as the VM side of things goes -- is whether we should change
> the hard calls to address_space operations, such that they might be
> avoided or customised when there is no backing block device?
>
> I'm sure the answer to this is "yes", so I have an idea for a simple
> implementation... but I'd like to hear thoughts from network fs / raid
> people?
I suppose that would be the proper thing to do, for non __make_request()
operated backing devices. I'll add the hooks, then we can cook up a raid
implementation if need be.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH] 4/4 block: explicit plugging
2007-01-04 14:39 ` Jens Axboe
@ 2007-01-05 22:04 ` Chen, Kenneth W
0 siblings, 0 replies; 16+ messages in thread
From: Chen, Kenneth W @ 2007-01-05 22:04 UTC (permalink / raw)
To: 'Jens Axboe'
Cc: Andrew Morton, linux-kernel, Nick Piggin, Nick Piggin
Andrew Morton wrote on Wednesday, January 03, 2007 12:09 AM
> Do you have any benchmarks which got faster with these changes?
Jens Axboe wrote on Wednesday, January 03, 2007 12:22 AM
> I've asked Ken to run this series on some of his big iron, I hope he'll
> have some results for us soonish. I can run some pseudo benchmarks on a
> 4-way here with some simulated storage to demonstrate the locking
> improvements.
> Jens Axboe wrote on Thursday, January 04, 2007 6:39 AM
> > I will just retake the tip of your plug tree and retest.
>
> That would be great! There's a busy race fixed in the current branch,
> make sure that one is included as well.
Good news: the tip of plug tree fixed the FC loop reset issue we are
seeing earlier.
Performance wise, our big db benchmark run came out with 0.14% regression
compare to 2.6.20-rc2. It is small enough that we declared it as noise
level change. Unfortunately our internal profile tool broke on 2.6.20-rc2
so I don't have an execution profile to post.
- Ken
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2007-01-05 22:04 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-03 7:48 [BLOCK] 0/4 explicit io plugging Jens Axboe
2007-01-03 7:48 ` [PATCH] 1/4 qrcu: "quick" srcu implementation Jens Axboe
2007-01-03 7:48 ` [PATCH] 2/4 qrcu: add rcutorture test Jens Axboe
2007-01-03 8:31 ` [PATCH] 3/4 qrcu: add documentation Jens Axboe
2007-01-03 9:29 ` Tomas Carnecky
2007-01-03 9:39 ` Jens Axboe
2007-01-03 9:41 ` [PATCH] 4/4 block: explicit plugging Jens Axboe
[not found] ` <1167810508576-git-send-email-jens.axboe@oracle.com>
2007-01-03 8:09 ` Andrew Morton
2007-01-03 8:22 ` Jens Axboe
2007-01-03 21:50 ` Chen, Kenneth W
2007-01-03 22:29 ` Jens Axboe
2007-01-03 22:34 ` Chen, Kenneth W
2007-01-04 14:39 ` Jens Axboe
2007-01-05 22:04 ` Chen, Kenneth W
2007-01-04 4:35 ` Nick Piggin
2007-01-05 7:23 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox