[PATCH 0/3]: blk-iopoll, a polled completion API for block devices

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3]: blk-iopoll, a polled completion API for block devices
@ 2009-08-06 19:58 Jens Axboe
  2009-08-06 19:58 ` [PATCH 1/3] block: add blk-iopoll, a NAPI like approach " Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Jens Axboe @ 2009-08-06 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Eric.Moore, jeff

Hi,

This is something I've been playing around and I have hopes of merging
the core for 2.6.32, so I thought I'd send it out for some review and
inspection.

Basically this implements NAPI for block devices, and much of the core
is essentially lifted from there. Since NAPI is already generic, there's
a possibility for some code reuse too. It's pretty small though and
there are a few small differences, so I just left it as-is for this
posting.

I've gotten good results with this on SSD devices, reducing the
interrupt rate a lot. And that's just for single disks, if you had
a bunch of those hanging off several controllers, it would look more
interesting I'm sure.

Apart from the core, I did patches for ahci and mpt. The former mostly
as a proof of concept, it's been running my laptop and test box for
months. There may be a problem with the ack part, as per the comment at
the end of ahci_interrupt(). So far I haven't observed any bad behaviour.
mpt is more interesting, as it's more geared at the market where this
should make some impact. The mpt patch should also be fully stable,
I've tested it on several platforms and adapters.

As to results, they vary. On a faster box, I can reduce the interrupt
rate by ~28% at only 50k IOPS. On a slower box, at 30k IOPS it'll drop
by as much as 95%. Latencies numbers are as good (or better). On
the faster box, the mpt controller even had interrupt coalescing enabled
and it still made a big difference.

Anyway, YMMV, I would appreciate some test results (and as usual, that
even includes just saying that it boots and functions for you). If
people feel adventurous, patches for other controllers will be happily
queued up for testing. I may even be convinced to implement support
for your controller of choice, if you have some fast storage hooked up
and would like to experiment. Generally, adding support to a driver is
not very hard and the two conversions included were also meant to serve
as an inspiration.

Patches are against current -git. You can also pull them from

  git://git.kernel.dk/linux-2.6-block.git blk-iopoll

 block/Makefile                   |    2 
 block/blk-iopoll.c               |  227 +++++++++++++++++++++++++++++++
 drivers/ata/ahci.c               |   53 ++++++-
 drivers/message/fusion/mptbase.c |   99 +++++++++++--
 drivers/message/fusion/mptbase.h |    3 
 include/linux/blk-iopoll.h       |   41 +++++
 include/linux/interrupt.h        |    1 
 include/linux/libata.h           |    2 
 kernel/sysctl.c                  |   10 +
 9 files changed, 419 insertions(+), 19 deletions(-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/3] block: add blk-iopoll, a NAPI like approach for block devices
  2009-08-06 19:58 [PATCH 0/3]: blk-iopoll, a polled completion API for block devices Jens Axboe
@ 2009-08-06 19:58 ` Jens Axboe
  2009-08-06 21:32   ` Alan Cox
  2009-08-06 19:58 ` [PATCH 2/3] libata: add support for blk-iopoll Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2009-08-06 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Eric.Moore, jeff, Jens Axboe

This borrows some code from NAPI and implements a polled completion
mode for block devices. The idea is the same as NAPI - instead of
doing the command completion when the irq occurs, schedule a dedicated
softirq in the hopes that we will complete more IO when the iopoll
handler is invoked. Devices have a budget of commands assigned, and will
stay in polled mode as long as they continue to consume their budget
from the iopoll softirq handler. If they do not, the device is set back
to interrupt completion mode.

This patch holds the core bits for blk-iopoll, device driver support
sold separately.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

diff --git a/block/Makefile b/block/Makefile
index 6c54ed0..ba74ca6 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			ioctl.o genhd.o scsi_ioctl.o
+			blk-iopoll.o ioctl.o genhd.o scsi_ioctl.o
 
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
diff --git a/block/blk-iopoll.c b/block/blk-iopoll.c
new file mode 100644
index 0000000..ca56420
--- /dev/null
+++ b/block/blk-iopoll.c
@@ -0,0 +1,227 @@
+/*
+ * Functions related to interrupt-poll handling in the block layer. This
+ * is similar to NAPI for network devices.
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/interrupt.h>
+#include <linux/cpu.h>
+#include <linux/blk-iopoll.h>
+#include <linux/delay.h>
+
+#include "blk.h"
+
+int blk_iopoll_enabled = 1;
+EXPORT_SYMBOL(blk_iopoll_enabled);
+
+static unsigned int blk_iopoll_budget __read_mostly = 256;
+
+static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);
+
+/**
+ * blk_iopoll_sched - Schedule a run of the iopoll handler
+ * @iop:      The parent iopoll structure
+ *
+ * Description:
+ *     Add this blk_iopoll structure to the pending poll list and trigger the
+ *     raise of the blk iopoll softirq. The driver must already have gotten a
+ *     succesful return from blk_iopoll_sched_prep() before calling this.
+ **/
+void blk_iopoll_sched(struct blk_iopoll *iop)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	list_add_tail(&iop->list, &__get_cpu_var(blk_cpu_iopoll));
+	__raise_softirq_irqoff(BLOCK_IOPOLL_SOFTIRQ);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(blk_iopoll_sched);
+
+/**
+ * __blk_iopoll_complete - Mark this @iop as un-polled again
+ * @iop:      The parent iopoll structure
+ *
+ * Description:
+ *     See blk_iopoll_complete(). This function must be called with interrupts
+ *     disabled.
+ **/
+void __blk_iopoll_complete(struct blk_iopoll *iop)
+{
+	list_del(&iop->list);
+	smp_mb__before_clear_bit();
+	clear_bit_unlock(IOPOLL_F_SCHED, &iop->state);
+}
+EXPORT_SYMBOL(__blk_iopoll_complete);
+
+/**
+ * blk_iopoll_complete - Mark this @iop as un-polled again
+ * @iop:      The parent iopoll structure
+ *
+ * Description:
+ *     If a driver consumes less than the assigned budget in its run of the
+ *     iopoll handler, it'll end the polled mode by calling this function. The
+ *     iopoll handler will not be invoked again before blk_iopoll_sched_prep()
+ *     is called.
+ **/
+void blk_iopoll_complete(struct blk_iopoll *iopoll)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__blk_iopoll_complete(iopoll);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(blk_iopoll_complete);
+
+static void blk_iopoll_softirq(struct softirq_action *h)
+{
+	struct list_head *list = &__get_cpu_var(blk_cpu_iopoll);
+	int rearm = 0, budget = blk_iopoll_budget;
+	unsigned long start_time = jiffies;
+
+	local_irq_disable();
+
+	while (!list_empty(list)) {
+		struct blk_iopoll *iop;
+		int work, weight;
+
+		/*
+		 * If softirq window is exhausted then punt.
+		 */
+		if (budget <= 0 || time_after(jiffies, start_time)) {
+			rearm = 1;
+			break;
+		}
+
+		local_irq_enable();
+
+		/* Even though interrupts have been re-enabled, this
+		 * access is safe because interrupts can only add new
+		 * entries to the tail of this list, and only ->poll()
+		 * calls can remove this head entry from the list.
+		 */
+		iop = list_entry(list->next, struct blk_iopoll, list);
+
+		weight = iop->weight;
+		work = 0;
+		if (test_bit(IOPOLL_F_SCHED, &iop->state))
+			work = iop->poll(iop, weight);
+
+		budget -= work;
+
+		local_irq_disable();
+
+		/*
+		 * Drivers must not modify the iopoll state, if they
+		 * consume their assigned weight (or more, some drivers can't
+		 * easily just stop processing, they have to complete an
+		 * entire mask of commands).In such cases this code
+		 * still "owns" the iopoll instance and therefore can
+		 * move the instance around on the list at-will.
+		 */
+		if (work >= weight) {
+			if (blk_iopoll_disable_pending(iop))
+				__blk_iopoll_complete(iop);
+			else
+				list_move_tail(&iop->list, list);
+		}
+	}
+
+	if (rearm)
+		__raise_softirq_irqoff(BLOCK_IOPOLL_SOFTIRQ);
+
+	local_irq_enable();
+}
+
+/**
+ * blk_iopoll_disable - Disable iopoll on this @iop
+ * @iop:      The parent iopoll structure
+ *
+ * Description:
+ *     Disable io polling and wait for any pending callbacks to have completed.
+ **/
+void blk_iopoll_disable(struct blk_iopoll *iop)
+{
+	set_bit(IOPOLL_F_DISABLE, &iop->state);
+	while (test_and_set_bit(IOPOLL_F_SCHED, &iop->state))
+		msleep(1);
+	clear_bit(IOPOLL_F_DISABLE, &iop->state);
+}
+EXPORT_SYMBOL(blk_iopoll_disable);
+
+/**
+ * blk_iopoll_enable - Enable iopoll on this @iop
+ * @iop:      The parent iopoll structure
+ *
+ * Description:
+ *     Enable iopoll on this @iop. Note that the handler run will not be
+ *     scheduled, it will only mark it as active.
+ **/
+void blk_iopoll_enable(struct blk_iopoll *iop)
+{
+	BUG_ON(!test_bit(IOPOLL_F_SCHED, &iop->state));
+	smp_mb__before_clear_bit();
+	clear_bit_unlock(IOPOLL_F_SCHED, &iop->state);
+}
+EXPORT_SYMBOL(blk_iopoll_enable);
+
+/**
+ * blk_iopoll_init - Initialize this @iop
+ * @iop:      The parent iopoll structure
+ * @weight:   The default weight (or command completion budget)
+ * @poll_fn:  The handler to invoke
+ *
+ * Description:
+ *     Initialize this blk_iopoll structure. Before being actively used, the
+ *     driver must call blk_iopoll_enable().
+ **/
+void blk_iopoll_init(struct blk_iopoll *iop, int weight, blk_iopoll_fn *poll_fn)
+{
+	memset(iop, 0, sizeof(*iop));
+	INIT_LIST_HEAD(&iop->list);
+	iop->weight = weight;
+	iop->poll = poll_fn;
+	set_bit(IOPOLL_F_SCHED, &iop->state);
+}
+EXPORT_SYMBOL(blk_iopoll_init);
+
+static int __cpuinit blk_iopoll_cpu_notify(struct notifier_block *self,
+					  unsigned long action, void *hcpu)
+{
+	/*
+	 * If a CPU goes away, splice its entries to the current CPU
+	 * and trigger a run of the softirq
+	 */
+	if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
+		int cpu = (unsigned long) hcpu;
+
+		local_irq_disable();
+		list_splice_init(&per_cpu(blk_cpu_iopoll, cpu),
+				 &__get_cpu_var(blk_cpu_iopoll));
+		__raise_softirq_irqoff(BLOCK_IOPOLL_SOFTIRQ);
+		local_irq_enable();
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata blk_iopoll_cpu_notifier = {
+	.notifier_call	= blk_iopoll_cpu_notify,
+};
+
+static __init int blk_iopoll_setup(void)
+{
+	int i;
+
+	for_each_possible_cpu(i)
+		INIT_LIST_HEAD(&per_cpu(blk_cpu_iopoll, i));
+
+	open_softirq(BLOCK_IOPOLL_SOFTIRQ, blk_iopoll_softirq);
+	register_hotcpu_notifier(&blk_iopoll_cpu_notifier);
+	return 0;
+}
+subsys_initcall(blk_iopoll_setup);
diff --git a/include/linux/blk-iopoll.h b/include/linux/blk-iopoll.h
new file mode 100644
index 0000000..b2e1739
--- /dev/null
+++ b/include/linux/blk-iopoll.h
@@ -0,0 +1,41 @@
+#ifndef BLK_IOPOLL_H
+#define BLK_IOPOLL_H
+
+struct blk_iopoll;
+typedef int (blk_iopoll_fn)(struct blk_iopoll *, int);
+
+struct blk_iopoll {
+	struct list_head list;
+	unsigned long state;
+	unsigned long data;
+	int weight;
+	int max;
+	blk_iopoll_fn *poll;
+};
+
+enum {
+	IOPOLL_F_SCHED		= 0,
+	IOPOLL_F_DISABLE	= 1,
+};
+
+static inline int blk_iopoll_sched_prep(struct blk_iopoll *iop)
+{
+	return !test_bit(IOPOLL_F_DISABLE, &iop->state) &&
+		!test_and_set_bit(IOPOLL_F_SCHED, &iop->state);
+}
+
+static inline int blk_iopoll_disable_pending(struct blk_iopoll *iop)
+{
+	return test_bit(IOPOLL_F_DISABLE, &iop->state);
+}
+
+extern void blk_iopoll_sched(struct blk_iopoll *);
+extern void blk_iopoll_init(struct blk_iopoll *, int, blk_iopoll_fn *);
+extern void blk_iopoll_complete(struct blk_iopoll *);
+extern void __blk_iopoll_complete(struct blk_iopoll *);
+extern void blk_iopoll_enable(struct blk_iopoll *);
+extern void blk_iopoll_disable(struct blk_iopoll *);
+
+extern int blk_iopoll_enabled;
+
+#endif
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 35e7df1..edd8d5c 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -344,6 +344,7 @@ enum
 	NET_TX_SOFTIRQ,
 	NET_RX_SOFTIRQ,
 	BLOCK_SOFTIRQ,
+	BLOCK_IOPOLL_SOFTIRQ,
 	TASKLET_SOFTIRQ,
 	SCHED_SOFTIRQ,
 	HRTIMER_SOFTIRQ,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 98e0232..121837e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -91,6 +91,7 @@ extern int sysctl_nr_trim_pages;
 #ifdef CONFIG_RCU_TORTURE_TEST
 extern int rcutorture_runnable;
 #endif /* #ifdef CONFIG_RCU_TORTURE_TEST */
+extern int blk_iopoll_enabled;
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_DETECT_SOFTLOCKUP
@@ -989,7 +990,14 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
-
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "blk_iopoll",
+		.data		= &blk_iopoll_enabled,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] block: add blk-iopoll, a NAPI like approach for block devices
  2009-08-06 19:58 ` [PATCH 1/3] block: add blk-iopoll, a NAPI like approach " Jens Axboe
@ 2009-08-06 21:32   ` Alan Cox
  2009-08-07  6:37     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Cox @ 2009-08-06 21:32 UTC (permalink / raw)
  Cc: linux-kernel, linux-scsi, Eric.Moore, jeff, Jens Axboe

> doing the command completion when the irq occurs, schedule a dedicated
> softirq in the hopes that we will complete more IO when the iopoll
> handler is invoked. Devices have a budget of commands assigned, and will
> stay in polled mode as long as they continue to consume their budget
> from the iopoll softirq handler. If they do not, the device is set back
> to interrupt completion mode.

This seems a little odd for pure ATA except for NCQ commands. Normal ATA
is notoriously completion/reissue latency sensitive [to the point I
suspect we should be dequeuing 2 commands from SCSI and loading the next
in the completion handler as soon as we recover the result task file and
see no error rather than going up and down the stack)

What do the numbers look like ?

> This patch holds the core bits for blk-iopoll, device driver support
> sold separately.

You've been at Oracle too long ;) You'll be telling me its not a
supported configuration next.

Alan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] block: add blk-iopoll, a NAPI like approach for block devices
  2009-08-06 21:32   ` Alan Cox
@ 2009-08-07  6:37     ` Jens Axboe
  2009-08-07  8:38       ` Jeff Garzik
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2009-08-07  6:37 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel, linux-scsi, Eric.Moore, jeff

On Thu, Aug 06 2009, Alan Cox wrote:
> > doing the command completion when the irq occurs, schedule a dedicated
> > softirq in the hopes that we will complete more IO when the iopoll
> > handler is invoked. Devices have a budget of commands assigned, and will
> > stay in polled mode as long as they continue to consume their budget
> > from the iopoll softirq handler. If they do not, the device is set back
> > to interrupt completion mode.
> 
> This seems a little odd for pure ATA except for NCQ commands. Normal ATA
> is notoriously completion/reissue latency sensitive [to the point I
> suspect we should be dequeuing 2 commands from SCSI and loading the next
> in the completion handler as soon as we recover the result task file and
> see no error rather than going up and down the stack)

Yes certainly, it's only for devices that do queuing. If they don't,
then we will always have just the one command to complete. So not much
to poll! As to pre-prep for extra latency intensive devices, have you
tried experimenting with just pretending that non-ncq devices in libata
have a queue depth of 2? That should ensure that the first command
available upon completion of the existing command is already prepped.
Not sure how much time that would save, I would hope that our prep phase
isn't too slow to begin with (or that would be the place to fix :-)

> What do the numbers look like ?

On a slow box (with many cores), the benefits are quite huge:

blocksize       blk-iopoll      IOPS    IRQ/sec         Commands/IRQ
--------------------------------------------------------------------
512b            0               25168   ~19500          1,3
512b            1               30355     ~750          40
4096b           0               25612   ~21500          1,2
4096b           1               30231    ~1200          25

I suspect there's some cache interaction going on here too, but the
numbers do look very good. On a faster box (and different architecture),
on a test that does 50k IOPS, they perform identically but the iopoll
approach uses less CPU. The interrupt rate drops from 55k ints/sec to
39-40k ints/sec for that case.

These are all synthetic IO only benchmarks, I hope to have some numbers
for some mixed benchmarks soon too.

> > This patch holds the core bits for blk-iopoll, device driver support
> > sold separately.
> 
> You've been at Oracle too long ;) You'll be telling me its not a
> supported configuration next.

;-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] block: add blk-iopoll, a NAPI like approach for block devices
  2009-08-07  6:37     ` Jens Axboe
@ 2009-08-07  8:38       ` Jeff Garzik
  2009-08-07  8:50         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff Garzik @ 2009-08-07  8:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Alan Cox, linux-kernel, linux-scsi, Eric.Moore

Jens Axboe wrote:
> On Thu, Aug 06 2009, Alan Cox wrote:
>>> doing the command completion when the irq occurs, schedule a dedicated
>>> softirq in the hopes that we will complete more IO when the iopoll
>>> handler is invoked. Devices have a budget of commands assigned, and will
>>> stay in polled mode as long as they continue to consume their budget
>>> from the iopoll softirq handler. If they do not, the device is set back
>>> to interrupt completion mode.
>> This seems a little odd for pure ATA except for NCQ commands. Normal ATA
>> is notoriously completion/reissue latency sensitive [to the point I
>> suspect we should be dequeuing 2 commands from SCSI and loading the next
>> in the completion handler as soon as we recover the result task file and
>> see no error rather than going up and down the stack)
> 
> Yes certainly, it's only for devices that do queuing. If they don't,
> then we will always have just the one command to complete. So not much
> to poll! As to pre-prep for extra latency intensive devices, have you
> tried experimenting with just pretending that non-ncq devices in libata
> have a queue depth of 2? That should ensure that the first command
> available upon completion of the existing command is already prepped.
> Not sure how much time that would save, I would hope that our prep phase
> isn't too slow to begin with (or that would be the place to fix :-)
> 
>> What do the numbers look like ?
> 
> On a slow box (with many cores), the benefits are quite huge:
> 
> 
> blocksize       blk-iopoll      IOPS    IRQ/sec         Commands/IRQ
> --------------------------------------------------------------------
> 512b            0               25168   ~19500          1,3
> 512b            1               30355     ~750          40
> 4096b           0               25612   ~21500          1,2
> 4096b           1               30231    ~1200          25
> 
> I suspect there's some cache interaction going on here too, but the
> numbers do look very good. On a faster box (and different architecture),
> on a test that does 50k IOPS, they perform identically but the iopoll
> approach uses less CPU. The interrupt rate drops from 55k ints/sec to
> 39-40k ints/sec for that case.

It's easy to move work from one place to another, so I would definitely 
expect that IRQ/sec drops...  but these are the more relevant numbers, IMO:

* CPU usage before/after
* latency before/after

Also, and even for storage where command queueing is _possible_, there 
is a problem case we saw with NAPI:  sometimes the combination of a fast 
computer and an under-100%-utilization workload can imply repeated cycles of

	spin lock
	irq disable
	blk_iopoll_sched()
	spin unlock

	spin lock
	handle a single command completion
	spin unlock
	blk_iopoll_complete()

which not only erases the benefit, but winds up being more costly, both 
in terms of CPU usage and in terms of latency.

This makes measuring the problem much more difficult; the interesting 
case I am highlighting does not occur when using a benchmarking tool to 
keep a storage device at 100% utilization.

We don't want to optimize for the 100%-load case at the expense of the 
_common case_, which is IMO utilization below 100%.  Servers are not 
100% busy all the time, which opens the possibility that a 
split-completion scheme such as the one presented can actually use 
_more_ CPU than the current, unmodified 2.6.31-rc kernel.

I'm not NAK'ing...  just inserting some relevant NAPI field experience, 
and hoping for some numbers that better measure the costs/benefits.

	Jeff




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] block: add blk-iopoll, a NAPI like approach for block  devices
  2009-08-07  8:38       ` Jeff Garzik
@ 2009-08-07  8:50         ` Jens Axboe
  2009-08-07 11:05           ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2009-08-07  8:50 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, linux-kernel, linux-scsi, Eric.Moore

On Fri, Aug 07 2009, Jeff Garzik wrote:
> Jens Axboe wrote:
>> On Thu, Aug 06 2009, Alan Cox wrote:
>>>> doing the command completion when the irq occurs, schedule a dedicated
>>>> softirq in the hopes that we will complete more IO when the iopoll
>>>> handler is invoked. Devices have a budget of commands assigned, and will
>>>> stay in polled mode as long as they continue to consume their budget
>>>> from the iopoll softirq handler. If they do not, the device is set back
>>>> to interrupt completion mode.
>>> This seems a little odd for pure ATA except for NCQ commands. Normal ATA
>>> is notoriously completion/reissue latency sensitive [to the point I
>>> suspect we should be dequeuing 2 commands from SCSI and loading the next
>>> in the completion handler as soon as we recover the result task file and
>>> see no error rather than going up and down the stack)
>>
>> Yes certainly, it's only for devices that do queuing. If they don't,
>> then we will always have just the one command to complete. So not much
>> to poll! As to pre-prep for extra latency intensive devices, have you
>> tried experimenting with just pretending that non-ncq devices in libata
>> have a queue depth of 2? That should ensure that the first command
>> available upon completion of the existing command is already prepped.
>> Not sure how much time that would save, I would hope that our prep phase
>> isn't too slow to begin with (or that would be the place to fix :-)
>>
>>> What do the numbers look like ?
>>
>> On a slow box (with many cores), the benefits are quite huge:
>>
>>
>> blocksize       blk-iopoll      IOPS    IRQ/sec         Commands/IRQ
>> --------------------------------------------------------------------
>> 512b            0               25168   ~19500          1,3
>> 512b            1               30355     ~750          40
>> 4096b           0               25612   ~21500          1,2
>> 4096b           1               30231    ~1200          25
>>
>> I suspect there's some cache interaction going on here too, but the
>> numbers do look very good. On a faster box (and different architecture),
>> on a test that does 50k IOPS, they perform identically but the iopoll
>> approach uses less CPU. The interrupt rate drops from 55k ints/sec to
>> 39-40k ints/sec for that case.
>
> It's easy to move work from one place to another, so I would definitely  
> expect that IRQ/sec drops...  but these are the more relevant numbers, 
> IMO:
>
> * CPU usage before/after
> * latency before/after

As I mentioned in the 0/3 email, latency for my tests were as good or
better than the original an CPU usage was lower. The former must largely
be due to decreased latency in commands successfully retired in addition
to the one that triggered the IRQ, since the latency for the first
command should be a little higher. Since we use softirq completion for
the command in the FIRST place anyway, it probably wont make any
difference (and this latency for the first command should be almost
immeasurably from the non-iopoll path).

> Also, and even for storage where command queueing is _possible_, there  
> is a problem case we saw with NAPI:  sometimes the combination of a fast  
> computer and an under-100%-utilization workload can imply repeated cycles 
> of
>
> 	spin lock
> 	irq disable
> 	blk_iopoll_sched()
> 	spin unlock
>
> 	spin lock
> 	handle a single command completion
> 	spin unlock
> 	blk_iopoll_complete()
>
> which not only erases the benefit, but winds up being more costly, both  
> in terms of CPU usage and in terms of latency.

It's clear that if you always only retire a single command AND you need
to lock at both ends, then it'll never be a win. I guess we could detect
such cases and be more cautious about when to enter iopoll, if that is
an issue. The ahci case looks like what you describe and I'm not seeing
any issues on the laptop, but I do concede that this is something to
look out for. If you look at the mpt conversion, we don't get cache line
bouncing on a lock there. As I also wrote, ahci is only really
interesting for test purposes, I don't envision a lot of real world win
there. But it widens the scope for testing :-)

> This makes measuring the problem much more difficult; the interesting  
> case I am highlighting does not occur when using a benchmarking tool to  
> keep a storage device at 100% utilization.

Of course not, that case is primarily interesting to gauge potential
best case wins.

> We don't want to optimize for the 100%-load case at the expense of the  
> _common case_, which is IMO utilization below 100%.  Servers are not  
> 100% busy all the time, which opens the possibility that a  
> split-completion scheme such as the one presented can actually use  
> _more_ CPU than the current, unmodified 2.6.31-rc kernel.

Depends, if the common case doesn't really suffer, then it doesn't
matter. Graceful load handling is important.

> I'm not NAK'ing...  just inserting some relevant NAPI field experience,  
> and hoping for some numbers that better measure the costs/benefits.

Appreciate you looking over this, and I'll certainly be posting some
more numbers on this. It'll largely depend on both storage, controller,
and worload.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] block: add blk-iopoll, a NAPI like approach for block  devices
  2009-08-07  8:50         ` Jens Axboe
@ 2009-08-07 11:05           ` Jens Axboe
  2009-08-07 11:31             ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2009-08-07 11:05 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, linux-kernel, linux-scsi, Eric.Moore

On Fri, Aug 07 2009, Jens Axboe wrote:
> > I'm not NAK'ing...  just inserting some relevant NAPI field experience,  
> > and hoping for some numbers that better measure the costs/benefits.
> 
> Appreciate you looking over this, and I'll certainly be posting some
> more numbers on this. It'll largely depend on both storage, controller,
> and worload.

Here's a quick set of numbers, beating with random reads on a drive.
Average of three runs for each, stddev is very low so confidence in the
numbers should be high.

With iopoll=0 (disabled), stock:

blocksize       IOPS    ints/sec        usr     sys
------------------------------------------------------
4k              48401   ~30500          3.36%   27.26%

clat (usec): min=1052, max=21615, avg=10541.48, stdev=243.48
clat (usec): min=1066, max=22040, avg=10543.69, stdev=242.05
clat (usec): min=1057, max=23237, avg=10529.04, stdev=239.30

With iopoll=1

blocksize       IOPS    ints/sec        usr     sys
------------------------------------------------------
4k              48452   ~29000          3.37%   26.47%

clat (usec): min=1178, max=21662, avg=10542.72, stdev=247.87
clat (usec): min=1074, max=21783, avg=10534.14, stdev=240.54
clat (usec): min=1102, max=22123, avg=10509.42, stdev=225.73

The system utilization numbers are significant, I can say that for these
three runs, the iopoll=0 numbers were 27.25%, 27.28%, and 27.26%. For
iopoll=1, they were 26.44%, 26.26%, and 26.36%. The usr numbers were
equally stable. The latencies numbers are too close to call here.

On a slower box, I get:

iopoll=0

blocksize       IOPS    ints/sec        usr     sys
------------------------------------------------------
4k              13100   ~12000          3.37%   19.70%

clat (msec): min=7, max=99, avg=78.32, stdev= 1.89
clat (msec): min=6, max=96, avg=77.00, stdev= 1.89
clat (msec): min=8, max=111, avg=78.27, stdev= 1.84

iopoll=1

blocksize       IOPS    ints/sec        usr     sys
------------------------------------------------------
4k              13745   ~400            3.30%   19.74%

clat (msec): min=8, max=91, avg=73.33, stdev= 1.66
clat (msec): min=7, max=90, avg=72.94, stdev= 1.64
clat (msec): min=6, max=103, avg=73.11, stdev= 1.77

Now, 13K iops isn't very much, so there isn't a huge performance
difference here and system utilization is practically identical. If we
were to hit 100k+ iops, I'm sure things would look different. If you
look at the IO completion latencies, they are actually better. This box
is a bit special, in that the 13k iops is purely limited by the softirq
that runs the completion. The controller only generates irqs on a single
CPU, so the softirqs all happen there (unless you use IO affinity by
setting rq_affinity=1, in which case you can reach 30k IOPS with the
same drive).

Anyway, just a first stack of numbers. Both of these are with using the
mpt sas controller.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] block: add blk-iopoll, a NAPI like approach for block  devices
  2009-08-07 11:05           ` Jens Axboe
@ 2009-08-07 11:31             ` Jens Axboe
  2009-08-19 19:08               ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2009-08-07 11:31 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, linux-kernel, linux-scsi, Eric.Moore

On Fri, Aug 07 2009, Jens Axboe wrote:
> On Fri, Aug 07 2009, Jens Axboe wrote:
> > > I'm not NAK'ing...  just inserting some relevant NAPI field experience,  
> > > and hoping for some numbers that better measure the costs/benefits.
> > 
> > Appreciate you looking over this, and I'll certainly be posting some
> > more numbers on this. It'll largely depend on both storage, controller,
> > and worload.
> 
> Here's a quick set of numbers, beating with random reads on a drive.
> Average of three runs for each, stddev is very low so confidence in the
> numbers should be high.
> 
> With iopoll=0 (disabled), stock:
> 
> blocksize       IOPS    ints/sec        usr     sys
> ------------------------------------------------------
> 4k              48401   ~30500          3.36%   27.26%
> 
> clat (usec): min=1052, max=21615, avg=10541.48, stdev=243.48
> clat (usec): min=1066, max=22040, avg=10543.69, stdev=242.05
> clat (usec): min=1057, max=23237, avg=10529.04, stdev=239.30
> 
> 
> With iopoll=1
> 
> blocksize       IOPS    ints/sec        usr     sys
> ------------------------------------------------------
> 4k              48452   ~29000          3.37%   26.47%
> 
> 
> clat (usec): min=1178, max=21662, avg=10542.72, stdev=247.87
> clat (usec): min=1074, max=21783, avg=10534.14, stdev=240.54
> clat (usec): min=1102, max=22123, avg=10509.42, stdev=225.73

Lets raise the bar a bit, this time using 8k reads on the faster box.

iopoll=0

blocksize       IOPS    ints/sec        usr     sys
------------------------------------------------------
8k              64050   ~76000          4.12%   45.01%

clat (usec): min=1326, max=18994, avg=7967.54, stdev=214.12
clat (usec): min=1325, max=25404, avg=7968.06, stdev=239.87
clat (usec): min=1273, max=21414, avg=7963.43, stdev=231.27


iopoll=1

blocksize       IOPS    ints/sec        usr     sys
------------------------------------------------------
8k              64162   ~55000          4.07%   42.32%

clat (usec): min=1380, max=19681, avg=7960.31, stdev=197.41
clat (usec): min=1370, max=37508, avg=7954.61, stdev=210.35
clat (usec): min=1332, max=23383, avg=7947.99, stdev=209.60

Again, purely a synthetic IO benchmark, but the sys reduction is
interesting.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] block: add blk-iopoll, a NAPI like approach for block  devices
  2009-08-07 11:31             ` Jens Axboe
@ 2009-08-19 19:08               ` Jens Axboe
  2009-08-20 11:30                 ` [PATCH 1/3] block: add blk-iopoll, a NAPI like approach forblock devices jack wang
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2009-08-19 19:08 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, linux-kernel, linux-scsi, Eric.Moore

On Fri, Aug 07 2009, Jens Axboe wrote:
> On Fri, Aug 07 2009, Jens Axboe wrote:
> > On Fri, Aug 07 2009, Jens Axboe wrote:
> > > > I'm not NAK'ing...  just inserting some relevant NAPI field experience,  
> > > > and hoping for some numbers that better measure the costs/benefits.
> > > 
> > > Appreciate you looking over this, and I'll certainly be posting some
> > > more numbers on this. It'll largely depend on both storage, controller,
> > > and worload.
> > 
> > Here's a quick set of numbers, beating with random reads on a drive.
> > Average of three runs for each, stddev is very low so confidence in the
> > numbers should be high.
> > 
> > With iopoll=0 (disabled), stock:
> > 
> > blocksize       IOPS    ints/sec        usr     sys
> > ------------------------------------------------------
> > 4k              48401   ~30500          3.36%   27.26%
> > 
> > clat (usec): min=1052, max=21615, avg=10541.48, stdev=243.48
> > clat (usec): min=1066, max=22040, avg=10543.69, stdev=242.05
> > clat (usec): min=1057, max=23237, avg=10529.04, stdev=239.30
> > 
> > 
> > With iopoll=1
> > 
> > blocksize       IOPS    ints/sec        usr     sys
> > ------------------------------------------------------
> > 4k              48452   ~29000          3.37%   26.47%
> > 
> > 
> > clat (usec): min=1178, max=21662, avg=10542.72, stdev=247.87
> > clat (usec): min=1074, max=21783, avg=10534.14, stdev=240.54
> > clat (usec): min=1102, max=22123, avg=10509.42, stdev=225.73
> 
> Lets raise the bar a bit, this time using 8k reads on the faster box.
> 
> iopoll=0
> 
> blocksize       IOPS    ints/sec        usr     sys
> ------------------------------------------------------
> 8k              64050   ~76000          4.12%   45.01%
> 
> clat (usec): min=1326, max=18994, avg=7967.54, stdev=214.12
> clat (usec): min=1325, max=25404, avg=7968.06, stdev=239.87
> clat (usec): min=1273, max=21414, avg=7963.43, stdev=231.27
> 
> 
> iopoll=1
> 
> blocksize       IOPS    ints/sec        usr     sys
> ------------------------------------------------------
> 8k              64162   ~55000          4.07%   42.32%
> 
> clat (usec): min=1380, max=19681, avg=7960.31, stdev=197.41
> clat (usec): min=1370, max=37508, avg=7954.61, stdev=210.35
> clat (usec): min=1332, max=23383, avg=7947.99, stdev=209.60
> 
> Again, purely a synthetic IO benchmark, but the sys reduction is
> interesting.

Upping the ante a bit more, this time on a really fast box. Just to show
that iopoll works well even on just about the fastest CPU you can throw
at it.

iopoll=0

blocksize       IOPS    ints/sec        usr     sys
------------------------------------------------------
8k              64823   ~67000          4.75%   13.41%

clat (usec): min=1430, max=15770, avg=7880.60, stdev=118.95
clat (usec): min=1249, max=17810, avg=7887.34, stdev=120.39
clat (usec): min=1729, max=15473, avg=7888.13, stdev=118.70


iopoll=1

blocksize       IOPS    ints/sec        usr     sys
------------------------------------------------------
8k              64825   ~65000         4.37%   11.39%

clat (usec): min=1530, max=15195, avg=7910.01, stdev=111.43
clat (usec): min=1495, max=16180, avg=7885.11, stdev=115.56
clat (usec): min=1446, max=19733, avg=7890.46, stdev=139.05

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] block: add blk-iopoll, a NAPI like approach forblock  devices
  2009-08-19 19:08               ` Jens Axboe
@ 2009-08-20 11:30                 ` jack wang
  2009-08-20 11:38                   ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: jack wang @ 2009-08-20 11:30 UTC (permalink / raw)
  To: 'Jens Axboe', 'Jeff Garzik'
  Cc: 'Alan Cox', linux-kernel, linux-scsi, Eric.Moore

On Fri, Aug 07 2009, Jens Axboe wrote:
> On Fri, Aug 07 2009, Jens Axboe wrote:
> > On Fri, Aug 07 2009, Jens Axboe wrote:
> > > > I'm not NAK'ing...  just inserting some relevant NAPI field
experience,  
> > > > and hoping for some numbers that better measure the costs/benefits.
> > > 
> > > Appreciate you looking over this, and I'll certainly be posting some
> > > more numbers on this. It'll largely depend on both storage,
controller,
> > > and worload.
> > 
> > Here's a quick set of numbers, beating with random reads on a drive.
> > Average of three runs for each, stddev is very low so confidence in the
> > numbers should be high.
> > 
> > With iopoll=0 (disabled), stock:
> > 
> > blocksize       IOPS    ints/sec        usr     sys
> > ------------------------------------------------------
> > 4k              48401   ~30500          3.36%   27.26%
> > 
> > clat (usec): min=1052, max=21615, avg=10541.48, stdev=243.48
> > clat (usec): min=1066, max=22040, avg=10543.69, stdev=242.05
> > clat (usec): min=1057, max=23237, avg=10529.04, stdev=239.30
> > 
> > 
> > With iopoll=1
> > 
> > blocksize       IOPS    ints/sec        usr     sys
> > ------------------------------------------------------
> > 4k              48452   ~29000          3.37%   26.47%
> > 
> > 
> > clat (usec): min=1178, max=21662, avg=10542.72, stdev=247.87
> > clat (usec): min=1074, max=21783, avg=10534.14, stdev=240.54
> > clat (usec): min=1102, max=22123, avg=10509.42, stdev=225.73
> 
> Lets raise the bar a bit, this time using 8k reads on the faster box.
> 
> iopoll=0
> 
> blocksize       IOPS    ints/sec        usr     sys
> ------------------------------------------------------
> 8k              64050   ~76000          4.12%   45.01%
> 
> clat (usec): min=1326, max=18994, avg=7967.54, stdev=214.12
> clat (usec): min=1325, max=25404, avg=7968.06, stdev=239.87
> clat (usec): min=1273, max=21414, avg=7963.43, stdev=231.27
> 
> 
> iopoll=1
> 
> blocksize       IOPS    ints/sec        usr     sys
> ------------------------------------------------------
> 8k              64162   ~55000          4.07%   42.32%
> 
> clat (usec): min=1380, max=19681, avg=7960.31, stdev=197.41
> clat (usec): min=1370, max=37508, avg=7954.61, stdev=210.35
> clat (usec): min=1332, max=23383, avg=7947.99, stdev=209.60
> 
> Again, purely a synthetic IO benchmark, but the sys reduction is
> interesting.

Upping the ante a bit more, this time on a really fast box. Just to show
that iopoll works well even on just about the fastest CPU you can throw
at it.

iopoll=0

blocksize       IOPS    ints/sec        usr     sys
------------------------------------------------------
8k              64823   ~67000          4.75%   13.41%

clat (usec): min=1430, max=15770, avg=7880.60, stdev=118.95
clat (usec): min=1249, max=17810, avg=7887.34, stdev=120.39
clat (usec): min=1729, max=15473, avg=7888.13, stdev=118.70


iopoll=1

blocksize       IOPS    ints/sec        usr     sys
------------------------------------------------------
8k              64825   ~65000         4.37%   11.39%

clat (usec): min=1530, max=15195, avg=7910.01, stdev=111.43
clat (usec): min=1495, max=16180, avg=7885.11, stdev=115.56
clat (usec): min=1446, max=19733, avg=7890.46, stdev=139.05

-- 
Jens Axboe


Hi Jens
Could you tell me what tool do you use to get the IO benchmark?  
Thanks
Jack Wang
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] block: add blk-iopoll, a NAPI like approach forblock  devices
  2009-08-20 11:30                 ` [PATCH 1/3] block: add blk-iopoll, a NAPI like approach forblock devices jack wang
@ 2009-08-20 11:38                   ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2009-08-20 11:38 UTC (permalink / raw)
  To: jack wang
  Cc: 'Jeff Garzik', 'Alan Cox', linux-kernel,
	linux-scsi, Eric.Moore

On Thu, Aug 20 2009, jack wang wrote:
> Could you tell me what tool do you use to get the IO benchmark?  

Sure, it's basically always the same tool: fio.

git clone git://git.kernel.dk/fio.git

or just grab the latest snapshot:

http://brick.kernel.dk/snaps/fio-git-latest.tar.gz

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/3] libata: add support for blk-iopoll
  2009-08-06 19:58 [PATCH 0/3]: blk-iopoll, a polled completion API for block devices Jens Axboe
  2009-08-06 19:58 ` [PATCH 1/3] block: add blk-iopoll, a NAPI like approach " Jens Axboe
@ 2009-08-06 19:58 ` Jens Axboe
  2009-08-10 17:15   ` Jonathan Corbet
  2009-08-06 19:58 ` [PATCH 3/3] mptfusion: " Jens Axboe
  2009-08-11 10:35 ` [PATCH 0/3]: blk-iopoll, a polled completion API for block devices Bart Van Assche
  3 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2009-08-06 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Eric.Moore, jeff, Jens Axboe

This adds basic support to libata, and specific support for ahci.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 drivers/ata/ahci.c     |   53 +++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/libata.h |    2 +
 2 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 958c1fa..9dda8ca 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -45,6 +45,7 @@
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_cmnd.h>
 #include <linux/libata.h>
+#include <linux/blk-iopoll.h>
 
 #define DRV_NAME	"ahci"
 #define DRV_VERSION	"3.0"
@@ -2053,7 +2054,7 @@ static void ahci_error_intr(struct ata_port *ap, u32 irq_stat)
 		ata_port_abort(ap);
 }
 
-static void ahci_port_intr(struct ata_port *ap)
+static int ahci_port_intr(struct ata_port *ap)
 {
 	void __iomem *port_mmio = ahci_port_base(ap);
 	struct ata_eh_info *ehi = &ap->link.eh_info;
@@ -2083,7 +2084,7 @@ static void ahci_port_intr(struct ata_port *ap)
 
 	if (unlikely(status & PORT_IRQ_ERROR)) {
 		ahci_error_intr(ap, status);
-		return;
+		return 0;
 	}
 
 	if (status & PORT_IRQ_SDB_FIS) {
@@ -2124,7 +2125,43 @@ static void ahci_port_intr(struct ata_port *ap)
 		ehi->err_mask |= AC_ERR_HSM;
 		ehi->action |= ATA_EH_RESET;
 		ata_port_freeze(ap);
+		rc = 0;
 	}
+
+	return rc;
+}
+
+static void ap_irq_disable(struct ata_port *ap)
+{
+	void __iomem *port_mmio = ahci_port_base(ap);
+
+	writel(0, port_mmio + PORT_IRQ_MASK);
+}
+
+static void ap_irq_enable(struct ata_port *ap)
+{
+	void __iomem *port_mmio = ahci_port_base(ap);
+	struct ahci_port_priv *pp = ap->private_data;
+
+	writel(pp->intr_mask, port_mmio + PORT_IRQ_MASK);
+}
+
+static int ahci_iopoll(struct blk_iopoll *iop, int budget)
+{
+	struct ata_port *ap = container_of(iop, struct ata_port, iopoll);
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&ap->host->lock, flags);
+	ret = ahci_port_intr(ap);
+	spin_unlock_irqrestore(&ap->host->lock, flags);
+
+	if (ret < budget) {
+		blk_iopoll_complete(iop);
+		ap_irq_enable(ap);
+	}
+
+	return ret;
 }
 
 static irqreturn_t ahci_interrupt(int irq, void *dev_instance)
@@ -2157,7 +2194,12 @@ static irqreturn_t ahci_interrupt(int irq, void *dev_instance)
 
 		ap = host->ports[i];
 		if (ap) {
-			ahci_port_intr(ap);
+			if (!blk_iopoll_enabled)
+				ahci_port_intr(ap);
+			else if (blk_iopoll_sched_prep(&ap->iopoll)) {
+				ap_irq_disable(ap);
+				blk_iopoll_sched(&ap->iopoll);
+			}
 			VPRINTK("port %u\n", i);
 		} else {
 			VPRINTK("port %u (no irq)\n", i);
@@ -2299,6 +2341,7 @@ static int ahci_port_resume(struct ata_port *ap)
 	else
 		ahci_pmp_detach(ap);
 
+	blk_iopoll_enable(&ap->iopoll);
 	return 0;
 }
 
@@ -2421,6 +2464,8 @@ static int ahci_port_start(struct ata_port *ap)
 
 	ap->private_data = pp;
 
+	blk_iopoll_init(&ap->iopoll, 32, ahci_iopoll);
+
 	/* engage engines, captain */
 	return ahci_port_resume(ap);
 }
@@ -2434,6 +2479,8 @@ static void ahci_port_stop(struct ata_port *ap)
 	rc = ahci_deinit_port(ap, &emsg);
 	if (rc)
 		ata_port_printk(ap, KERN_WARNING, "%s (%d)\n", emsg, rc);
+
+	blk_iopoll_disable(&ap->iopoll);
 }
 
 static int ahci_configure_dma_masks(struct pci_dev *pdev, int using_dac)
diff --git a/include/linux/libata.h b/include/linux/libata.h
index 7cfed24..14bba58 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -37,6 +37,7 @@
 #include <scsi/scsi_host.h>
 #include <linux/acpi.h>
 #include <linux/cdrom.h>
+#include <linux/blk-iopoll.h>
 
 /*
  * Define if arch has non-standard setup.  This is a _PCI_ standard
@@ -761,6 +762,7 @@ struct ata_port {
 #endif
 	/* owned by EH */
 	u8			sector_buf[ATA_SECT_SIZE] ____cacheline_aligned;
+	struct blk_iopoll	iopoll;
 };
 
 /* The following initializer overrides a method to NULL whether one of
-- 
1.6.3.2.306.g4f4fa


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] libata: add support for blk-iopoll
  2009-08-06 19:58 ` [PATCH 2/3] libata: add support for blk-iopoll Jens Axboe
@ 2009-08-10 17:15   ` Jonathan Corbet
  2009-08-10 17:22     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jonathan Corbet @ 2009-08-10 17:15 UTC (permalink / raw)
  Cc: linux-kernel, linux-scsi, Eric.Moore, jeff, Jens Axboe

Hey, Jens,

I'm a little slow in looking at this, hopefully it's not completely
noise...

> @@ -2157,7 +2194,12 @@ static irqreturn_t ahci_interrupt(int irq, void *dev_instance)
>  
>  		ap = host->ports[i];
>  		if (ap) {
> -			ahci_port_intr(ap);
> +			if (!blk_iopoll_enabled)
> +				ahci_port_intr(ap);
> +			else if (blk_iopoll_sched_prep(&ap->iopoll)) {
> +				ap_irq_disable(ap);
> +				blk_iopoll_sched(&ap->iopoll);
> +			}
>  			VPRINTK("port %u\n", i);
>  		} else {
>  			VPRINTK("port %u (no irq)\n", i);

It seems to me that, if blk_iopoll_sched_prep() fails, the interrupt
will be dropped on the floor; would you not need an explicit
ahci_port_intr() call in that case too?  Unless I've misunderstood as
usual...

Documenting the "zero means failure" nature of blk_iopoll_sched_prep()
might also be a good idea; I predict confusion otherwise.

jon

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] libata: add support for blk-iopoll
  2009-08-10 17:15   ` Jonathan Corbet
@ 2009-08-10 17:22     ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2009-08-10 17:22 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: linux-kernel, linux-scsi, Eric.Moore, jeff

On Mon, Aug 10 2009, Jonathan Corbet wrote:
> Hey, Jens,
> 
> I'm a little slow in looking at this, hopefully it's not completely
> noise...
> 
> > @@ -2157,7 +2194,12 @@ static irqreturn_t ahci_interrupt(int irq, void *dev_instance)
> >  
> >  		ap = host->ports[i];
> >  		if (ap) {
> > -			ahci_port_intr(ap);
> > +			if (!blk_iopoll_enabled)
> > +				ahci_port_intr(ap);
> > +			else if (blk_iopoll_sched_prep(&ap->iopoll)) {
> > +				ap_irq_disable(ap);
> > +				blk_iopoll_sched(&ap->iopoll);
> > +			}
> >  			VPRINTK("port %u\n", i);
> >  		} else {
> >  			VPRINTK("port %u (no irq)\n", i);
> 
> It seems to me that, if blk_iopoll_sched_prep() fails, the interrupt
> will be dropped on the floor; would you not need an explicit
> ahci_port_intr() call in that case too?  Unless I've misunderstood as
> usual...

If that happens, it is probably a spurious IRQ since it's already
scheduled to run (and hasn't yet). So it should be fine, in reality it
should not happen since the IRQ should have been acked and the iopoll
handler scheduled.

> Documenting the "zero means failure" nature of blk_iopoll_sched_prep()
> might also be a good idea; I predict confusion otherwise.

There's no real failure case, it zero just means "already scheduled".
But we do usually use 0 as the "normal" case, so good point anyway. I'll
change it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/3] mptfusion: add support for blk-iopoll
  2009-08-06 19:58 [PATCH 0/3]: blk-iopoll, a polled completion API for block devices Jens Axboe
  2009-08-06 19:58 ` [PATCH 1/3] block: add blk-iopoll, a NAPI like approach " Jens Axboe
  2009-08-06 19:58 ` [PATCH 2/3] libata: add support for blk-iopoll Jens Axboe
@ 2009-08-06 19:58 ` Jens Axboe
  2009-08-11 10:35 ` [PATCH 0/3]: blk-iopoll, a polled completion API for block devices Bart Van Assche
  3 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2009-08-06 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-scsi; +Cc: Eric.Moore, jeff, Jens Axboe

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 drivers/message/fusion/mptbase.c |   99 ++++++++++++++++++++++++++++++++-----
 drivers/message/fusion/mptbase.h |    3 +
 2 files changed, 88 insertions(+), 14 deletions(-)

diff --git a/drivers/message/fusion/mptbase.c b/drivers/message/fusion/mptbase.c
index 5d0ba4f..24549d6 100644
--- a/drivers/message/fusion/mptbase.c
+++ b/drivers/message/fusion/mptbase.c
@@ -114,6 +114,9 @@ module_param_call(mpt_fwfault_debug, param_set_int, param_get_int,
 MODULE_PARM_DESC(mpt_fwfault_debug, "Enable detection of Firmware fault"
 	" and halt Firmware on fault - (default=0)");
 
+static int mpt_iopoll_w = 32;
+module_param(mpt_iopoll_w, int, 0);
+MODULE_PARM_DESC(mpt_iopoll_w, " blk iopoll budget (default=32");
 
 
 #ifdef MFCNT
@@ -515,6 +518,44 @@ mpt_reply(MPT_ADAPTER *ioc, u32 pa)
 	mb();
 }
 
+static void mpt_irq_disable(MPT_ADAPTER *ioc)
+{
+	CHIPREG_WRITE32(&ioc->chip->IntMask, 0xFFFFFFFF);
+	CHIPREG_WRITE32(&ioc->chip->IntStatus, 0);
+	CHIPREG_READ32(&ioc->chip->IntStatus);
+}
+
+static void mpt_irq_enable(MPT_ADAPTER *ioc)
+{
+	CHIPREG_WRITE32(&ioc->chip->IntMask, MPI_HIM_DIM);
+}
+
+static inline void __mpt_handle_irq(MPT_ADAPTER *ioc, u32 pa)
+{
+	if (pa & MPI_ADDRESS_REPLY_A_BIT)
+		mpt_reply(ioc, pa);
+	else
+		mpt_turbo_reply(ioc, pa);
+}
+
+static int mpt_handle_irq(MPT_ADAPTER *ioc, unsigned int budget)
+{
+	int nr = 0;
+	u32 pa;
+
+	/*
+	 *  Drain the reply FIFO!
+	 */
+	while ((pa = CHIPREG_READ32_dmasync(&ioc->chip->ReplyFifo)) != 0xffffffff) {
+		nr++;
+		__mpt_handle_irq(ioc, pa);
+		if (nr == budget)
+			break;
+	}
+
+	return nr;
+}
+
 /*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
 /**
  *	mpt_interrupt - MPT adapter (IOC) specific interrupt handler.
@@ -536,23 +577,48 @@ static irqreturn_t
 mpt_interrupt(int irq, void *bus_id)
 {
 	MPT_ADAPTER *ioc = bus_id;
-	u32 pa = CHIPREG_READ32_dmasync(&ioc->chip->ReplyFifo);
+	int nr = 0;
+
+	if (!blk_iopoll_enabled)
+		nr = mpt_handle_irq(ioc, -1U);
+	else if (blk_iopoll_sched_prep(&ioc->iopoll)) {
+		mpt_irq_disable(ioc);
+		ioc->iopoll.data =CHIPREG_READ32_dmasync(&ioc->chip->ReplyFifo);
+		blk_iopoll_sched(&ioc->iopoll);
+		nr = 1;
+	} else {
+		/*
+		 * Not really handled, but it will be by iopoll.
+		 */
+		nr = 1;
+	}
 
-	if (pa == 0xFFFFFFFF)
-		return IRQ_NONE;
+	if (nr)
+		return IRQ_HANDLED;
 
-	/*
-	 *  Drain the reply FIFO!
-	 */
-	do {
-		if (pa & MPI_ADDRESS_REPLY_A_BIT)
-			mpt_reply(ioc, pa);
-		else
-			mpt_turbo_reply(ioc, pa);
-		pa = CHIPREG_READ32_dmasync(&ioc->chip->ReplyFifo);
-	} while (pa != 0xFFFFFFFF);
+	return IRQ_NONE;
+}
 
-	return IRQ_HANDLED;
+static int mpt_iopoll(struct blk_iopoll *iop, int budget)
+{
+	MPT_ADAPTER *ioc = container_of(iop, MPT_ADAPTER, iopoll);
+	int ret = 0;
+	u32 pa;
+
+	pa = iop->data;
+	iop->data = 0xffffffff;
+	if (pa != 0xffffffff) {
+		__mpt_handle_irq(ioc, pa);
+		ret = 1;
+	}
+
+	ret += mpt_handle_irq(ioc, budget - ret);
+	if (ret < budget) {
+		blk_iopoll_complete(iop);
+		mpt_irq_enable(ioc);
+	}
+
+	return ret;
 }
 
 /*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
@@ -2091,6 +2157,7 @@ mpt_suspend(struct pci_dev *pdev, pm_message_t state)
 	/* Clear any lingering interrupt */
 	CHIPREG_WRITE32(&ioc->chip->IntStatus, 0);
 
+	blk_iopoll_disable(&ioc->iopoll);
 	free_irq(ioc->pci_irq, ioc);
 	if (ioc->msi_enable)
 		pci_disable_msi(ioc->pcidev);
@@ -2358,6 +2425,8 @@ mpt_do_ioc_recovery(MPT_ADAPTER *ioc, u32 reason, int sleepFlag)
 				ret = -EBUSY;
 				goto out;
 			}
+			blk_iopoll_init(&ioc->iopoll, mpt_iopoll_w, mpt_iopoll);
+			blk_iopoll_enable(&ioc->iopoll);
 			irq_allocated = 1;
 			ioc->pci_irq = ioc->pcidev->irq;
 			pci_set_master(ioc->pcidev);		/* ?? */
@@ -2578,6 +2647,7 @@ mpt_do_ioc_recovery(MPT_ADAPTER *ioc, u32 reason, int sleepFlag)
 
  out:
 	if ((ret != 0) && irq_allocated) {
+		blk_iopoll_disable(&ioc->iopoll);
 		free_irq(ioc->pci_irq, ioc);
 		if (ioc->msi_enable)
 			pci_disable_msi(ioc->pcidev);
@@ -2786,6 +2856,7 @@ mpt_adapter_dispose(MPT_ADAPTER *ioc)
 	mpt_adapter_disable(ioc);
 
 	if (ioc->pci_irq != -1) {
+		blk_iopoll_disable(&ioc->iopoll);
 		free_irq(ioc->pci_irq, ioc);
 		if (ioc->msi_enable)
 			pci_disable_msi(ioc->pcidev);
diff --git a/drivers/message/fusion/mptbase.h b/drivers/message/fusion/mptbase.h
index 1c8514d..954a59f 100644
--- a/drivers/message/fusion/mptbase.h
+++ b/drivers/message/fusion/mptbase.h
@@ -52,6 +52,7 @@
 #include <linux/kernel.h>
 #include <linux/pci.h>
 #include <linux/mutex.h>
+#include <linux/blk-iopoll.h>
 
 #include "lsi/mpi_type.h"
 #include "lsi/mpi.h"		/* Fusion MPI(nterface) basic defs */
@@ -763,6 +764,8 @@ typedef struct _MPT_ADAPTER
 	struct workqueue_struct *reset_work_q;
 	struct delayed_work	 fault_reset_work;
 
+	struct blk_iopoll	iopoll;
+
 	u8			sg_addr_size;
 	u8			in_rescan;
 	u8			SGE_size;
-- 
1.6.3.2.306.g4f4fa


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3]: blk-iopoll, a polled completion API for block devices
  2009-08-06 19:58 [PATCH 0/3]: blk-iopoll, a polled completion API for block devices Jens Axboe
                   ` (2 preceding siblings ...)
  2009-08-06 19:58 ` [PATCH 3/3] mptfusion: " Jens Axboe
@ 2009-08-11 10:35 ` Bart Van Assche
  2009-08-11 14:39   ` Jens Axboe
  3 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2009-08-11 10:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, Eric.Moore, jeff

On Thu, Aug 6, 2009 at 9:58 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> Anyway, YMMV, I would appreciate some test results (and as usual, that
> even includes just saying that it boots and functions for you). If
> people feel adventurous, patches for other controllers will be happily
> queued up for testing. I may even be convinced to implement support
> for your controller of choice, if you have some fast storage hooked up
> and would like to experiment. Generally, adding support to a driver is
> not very hard and the two conversions included were also meant to serve
> as an inspiration.

Sounds very interesting. Have you already considered patching the SRP
initiator ? During the SRP performance tests I ran CPU usage on the
initiator was more than 95% and on the target less than 10%.

Bart.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3]: blk-iopoll, a polled completion API for block devices
  2009-08-11 10:35 ` [PATCH 0/3]: blk-iopoll, a polled completion API for block devices Bart Van Assche
@ 2009-08-11 14:39   ` Jens Axboe
  2009-08-11 14:59     ` Bart Van Assche
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2009-08-11 14:39 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-kernel, linux-scsi, Eric.Moore, jeff

On Tue, Aug 11 2009, Bart Van Assche wrote:
> On Thu, Aug 6, 2009 at 9:58 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > Anyway, YMMV, I would appreciate some test results (and as usual, that
> > even includes just saying that it boots and functions for you). If
> > people feel adventurous, patches for other controllers will be happily
> > queued up for testing. I may even be convinced to implement support
> > for your controller of choice, if you have some fast storage hooked up
> > and would like to experiment. Generally, adding support to a driver is
> > not very hard and the two conversions included were also meant to serve
> > as an inspiration.
> 
> Sounds very interesting. Have you already considered patching the SRP
> initiator ? During the SRP performance tests I ran CPU usage on the
> initiator was more than 95% and on the target less than 10%.

No I haven't, if you point me at which srp files, I can take a look.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3]: blk-iopoll, a polled completion API for block devices
  2009-08-11 14:39   ` Jens Axboe
@ 2009-08-11 14:59     ` Bart Van Assche
  2009-08-11 17:14       ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2009-08-11 14:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, Eric.Moore, jeff

On Tue, Aug 11, 2009 at 4:39 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> On Tue, Aug 11 2009, Bart Van Assche wrote:
>> On Thu, Aug 6, 2009 at 9:58 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
>> > Anyway, YMMV, I would appreciate some test results (and as usual, that
>> > even includes just saying that it boots and functions for you). If
>> > people feel adventurous, patches for other controllers will be happily
>> > queued up for testing. I may even be convinced to implement support
>> > for your controller of choice, if you have some fast storage hooked up
>> > and would like to experiment. Generally, adding support to a driver is
>> > not very hard and the two conversions included were also meant to serve
>> > as an inspiration.
>>
>> Sounds very interesting. Have you already considered patching the SRP
>> initiator ? During the SRP performance tests I ran CPU usage on the
>> initiator was more than 95% and on the target less than 10%.
>
> No I haven't, if you point me at which srp files, I can take a look.

The relevant source files are:
include/scsi/srp.h
include/scsi/scsi_transport_srp.h
drivers/infiniband/ulp/srp/ib_srp.h
drivers/infiniband/ulp/srp/ib_srp.c
drivers/scsi/scsi_transport_srp.c
drivers/scsi/libsrp.c
drivers/scsi/scsi_transport_srp_internal.h

Bart.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3]: blk-iopoll, a polled completion API for block devices
  2009-08-11 14:59     ` Bart Van Assche
@ 2009-08-11 17:14       ` Jens Axboe
  2009-08-11 18:37         ` Bart Van Assche
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2009-08-11 17:14 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-kernel, linux-scsi, Eric.Moore, jeff

On Tue, Aug 11 2009, Bart Van Assche wrote:
> On Tue, Aug 11, 2009 at 4:39 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> > On Tue, Aug 11 2009, Bart Van Assche wrote:
> >> On Thu, Aug 6, 2009 at 9:58 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> >> > Anyway, YMMV, I would appreciate some test results (and as usual, that
> >> > even includes just saying that it boots and functions for you). If
> >> > people feel adventurous, patches for other controllers will be happily
> >> > queued up for testing. I may even be convinced to implement support
> >> > for your controller of choice, if you have some fast storage hooked up
> >> > and would like to experiment. Generally, adding support to a driver is
> >> > not very hard and the two conversions included were also meant to serve
> >> > as an inspiration.
> >>
> >> Sounds very interesting. Have you already considered patching the SRP
> >> initiator ? During the SRP performance tests I ran CPU usage on the
> >> initiator was more than 95% and on the target less than 10%.
> >
> > No I haven't, if you point me at which srp files, I can take a look.
> 
> The relevant source files are:
> include/scsi/srp.h
> include/scsi/scsi_transport_srp.h
> drivers/infiniband/ulp/srp/ib_srp.h
> drivers/infiniband/ulp/srp/ib_srp.c
> drivers/scsi/scsi_transport_srp.c
> drivers/scsi/libsrp.c
> drivers/scsi/scsi_transport_srp_internal.h

I can find | grep too :-)

Did you profile this? Where did it burn all the CPU time on the
initiator side?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3]: blk-iopoll, a polled completion API for block devices
  2009-08-11 17:14       ` Jens Axboe
@ 2009-08-11 18:37         ` Bart Van Assche
  2009-08-11 18:41           ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2009-08-11 18:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, Eric.Moore, jeff

On Tue, Aug 11, 2009 at 7:14 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> On Tue, Aug 11 2009, Bart Van Assche wrote:
>> On Tue, Aug 11, 2009 at 4:39 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
>> > On Tue, Aug 11 2009, Bart Van Assche wrote:
>> >> On Thu, Aug 6, 2009 at 9:58 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
>> >> > Anyway, YMMV, I would appreciate some test results (and as usual, that
>> >> > even includes just saying that it boots and functions for you). If
>> >> > people feel adventurous, patches for other controllers will be happily
>> >> > queued up for testing. I may even be convinced to implement support
>> >> > for your controller of choice, if you have some fast storage hooked up
>> >> > and would like to experiment. Generally, adding support to a driver is
>> >> > not very hard and the two conversions included were also meant to serve
>> >> > as an inspiration.
>> >>
>> >> Sounds very interesting. Have you already considered patching the SRP
>> >> initiator ? During the SRP performance tests I ran CPU usage on the
>> >> initiator was more than 95% and on the target less than 10%.
>> >
>> > No I haven't, if you point me at which srp files, I can take a look.
>>
>> The relevant source files are:
>> include/scsi/srp.h
>> include/scsi/scsi_transport_srp.h
>> drivers/infiniband/ulp/srp/ib_srp.h
>> drivers/infiniband/ulp/srp/ib_srp.c
>> drivers/scsi/scsi_transport_srp.c
>> drivers/scsi/libsrp.c
>> drivers/scsi/scsi_transport_srp_internal.h
>
> I can find | grep too :-)

The command "find | grep srp" would have returned a few more source
files. The most relevant starting point is the source file
drivers/infiniband/ulp/srp/ib_srp.c, which contains the struct
scsi_host_template.

> Did you profile this? Where did it burn all the CPU time on the
> initiator side?

The test I ran involved a Linux SRP initiator and a Linux SRP target
(SCST) using a RAM disk as backstorage. Read throughput is about 1700
MB/s for block sizes of 8 MB and above. But with a block size of 4 KB,
the read throughput on the initiator drops to 100 MB/s. At this block
size there are about 50.000 interrupts per second generated by the
InfiniBand HCA in the initiator system. On the same setup the
ib_send_bw tool reports a throughput of 1850 MB/s for a block size of
4 KB. This last tool is not interrupt driven but uses polling.

Bart.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3]: blk-iopoll, a polled completion API for block devices
  2009-08-11 18:37         ` Bart Van Assche
@ 2009-08-11 18:41           ` Jens Axboe
  2009-08-11 18:49             ` Bart Van Assche
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2009-08-11 18:41 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-kernel, linux-scsi, Eric.Moore, jeff

On Tue, Aug 11 2009, Bart Van Assche wrote:
> On Tue, Aug 11, 2009 at 7:14 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> > On Tue, Aug 11 2009, Bart Van Assche wrote:
> >> On Tue, Aug 11, 2009 at 4:39 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> >> > On Tue, Aug 11 2009, Bart Van Assche wrote:
> >> >> On Thu, Aug 6, 2009 at 9:58 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> >> >> > Anyway, YMMV, I would appreciate some test results (and as usual, that
> >> >> > even includes just saying that it boots and functions for you). If
> >> >> > people feel adventurous, patches for other controllers will be happily
> >> >> > queued up for testing. I may even be convinced to implement support
> >> >> > for your controller of choice, if you have some fast storage hooked up
> >> >> > and would like to experiment. Generally, adding support to a driver is
> >> >> > not very hard and the two conversions included were also meant to serve
> >> >> > as an inspiration.
> >> >>
> >> >> Sounds very interesting. Have you already considered patching the SRP
> >> >> initiator ? During the SRP performance tests I ran CPU usage on the
> >> >> initiator was more than 95% and on the target less than 10%.
> >> >
> >> > No I haven't, if you point me at which srp files, I can take a look.
> >>
> >> The relevant source files are:
> >> include/scsi/srp.h
> >> include/scsi/scsi_transport_srp.h
> >> drivers/infiniband/ulp/srp/ib_srp.h
> >> drivers/infiniband/ulp/srp/ib_srp.c
> >> drivers/scsi/scsi_transport_srp.c
> >> drivers/scsi/libsrp.c
> >> drivers/scsi/scsi_transport_srp_internal.h
> >
> > I can find | grep too :-)
> 
> The command "find | grep srp" would have returned a few more source
> files. The most relevant starting point is the source file
> drivers/infiniband/ulp/srp/ib_srp.c, which contains the struct
> scsi_host_template.
> 
> > Did you profile this? Where did it burn all the CPU time on the
> > initiator side?
> 
> The test I ran involved a Linux SRP initiator and a Linux SRP target
> (SCST) using a RAM disk as backstorage. Read throughput is about 1700
> MB/s for block sizes of 8 MB and above. But with a block size of 4 KB,
> the read throughput on the initiator drops to 100 MB/s. At this block
> size there are about 50.000 interrupts per second generated by the
> InfiniBand HCA in the initiator system. On the same setup the
> ib_send_bw tool reports a throughput of 1850 MB/s for a block size of
> 4 KB. This last tool is not interrupt driven but uses polling.

OK, so that looks promising at least. Which hw driver does it use? If I
look under infiniband/, I see nes, amso, ehca, various ipath and mthca.
That's where it needs to be hooked up, the srp above mostly looks like
library helpers and the target hook to the scsi layer.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3]: blk-iopoll, a polled completion API for block devices
  2009-08-11 18:41           ` Jens Axboe
@ 2009-08-11 18:49             ` Bart Van Assche
  0 siblings, 0 replies; 22+ messages in thread
From: Bart Van Assche @ 2009-08-11 18:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, Eric.Moore, jeff

On Tue, Aug 11, 2009 at 8:41 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> On Tue, Aug 11 2009, Bart Van Assche wrote:
>> On Tue, Aug 11, 2009 at 7:14 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
>> > Did you profile this? Where did it burn all the CPU time on the
>> > initiator side?
>>
>> The test I ran involved a Linux SRP initiator and a Linux SRP target
>> (SCST) using a RAM disk as backstorage. Read throughput is about 1700
>> MB/s for block sizes of 8 MB and above. But with a block size of 4 KB,
>> the read throughput on the initiator drops to 100 MB/s. At this block
>> size there are about 50.000 interrupts per second generated by the
>> InfiniBand HCA in the initiator system. On the same setup the
>> ib_send_bw tool reports a throughput of 1850 MB/s for a block size of
>> 4 KB. This last tool is not interrupt driven but uses polling.
>
> OK, so that looks promising at least. Which hw driver does it use? If I
> look under infiniband/, I see nes, amso, ehca, various ipath and mthca.
> That's where it needs to be hooked up, the srp above mostly looks like
> library helpers and the target hook to the scsi layer.

The above numbers have been obtained on Mellanox ConnectX hardware.
This hardware is controlled by the mlx4_core and mlx4_ib kernel
modules. Source code for these drivers can be found in
drivers/infiniband/hw/mlx4 and drivers/net/mlx4.

Bart.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-08-20 11:38 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-06 19:58 [PATCH 0/3]: blk-iopoll, a polled completion API for block devices Jens Axboe
2009-08-06 19:58 ` [PATCH 1/3] block: add blk-iopoll, a NAPI like approach " Jens Axboe
2009-08-06 21:32   ` Alan Cox
2009-08-07  6:37     ` Jens Axboe
2009-08-07  8:38       ` Jeff Garzik
2009-08-07  8:50         ` Jens Axboe
2009-08-07 11:05           ` Jens Axboe
2009-08-07 11:31             ` Jens Axboe
2009-08-19 19:08               ` Jens Axboe
2009-08-20 11:30                 ` [PATCH 1/3] block: add blk-iopoll, a NAPI like approach forblock devices jack wang
2009-08-20 11:38                   ` Jens Axboe
2009-08-06 19:58 ` [PATCH 2/3] libata: add support for blk-iopoll Jens Axboe
2009-08-10 17:15   ` Jonathan Corbet
2009-08-10 17:22     ` Jens Axboe
2009-08-06 19:58 ` [PATCH 3/3] mptfusion: " Jens Axboe
2009-08-11 10:35 ` [PATCH 0/3]: blk-iopoll, a polled completion API for block devices Bart Van Assche
2009-08-11 14:39   ` Jens Axboe
2009-08-11 14:59     ` Bart Van Assche
2009-08-11 17:14       ` Jens Axboe
2009-08-11 18:37         ` Bart Van Assche
2009-08-11 18:41           ` Jens Axboe
2009-08-11 18:49             ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).