* [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand
@ 2010-09-03 22:27 Vasu Dev
2010-09-04 1:20 ` Nicholas A. Bellinger
2010-09-12 16:37 ` Bart Van Assche
0 siblings, 2 replies; 8+ messages in thread
From: Vasu Dev @ 2010-09-03 22:27 UTC (permalink / raw)
To: Mike Christie, linux-scsi, Nicholas A. Bellinger, Joe Eykholt,
James Bottomley
Currently fc_queuecommand drops this lock very early on and then re-acquires
this lock just before return, this re-acquired lock gets dropped immediately
by its fast path caller scsi_dispatch_cmd, this re-acquire and then immediate
drop on each IO hits performance especially with small size IOs on multi-core
systems, this hit is significant about 25% with 512 bytes size IOs.
This lock is not needed in fc_queuecommand and calling fc_queuecommand
without this lock held removes above described performance hit.
So this patch adds unlocked_qcmds flag to drop host_lock before
calling only fc_queuecommand and removes re-acquire & then drop
for each IO. Added flag, drops this lock only if LLD wants such
as fcoe.ko does here for fc_queuecommand. This change won't affect
any existing LLD since added unlocked_qcmds flag will be zero
in those cases and their host lock uses would effectively remain
same after this patch.
As per Mike and Nab comments, restored host lock around SHOST_DEL state
in scsi_dispatch_cmd.
Signed-off-by: Vasu Dev <vasu.dev@intel.com>
---
drivers/scsi/fcoe/fcoe.c | 1 +
drivers/scsi/libfc/fc_fcp.c | 14 +++++---------
drivers/scsi/scsi.c | 7 ++++++-
include/scsi/scsi_host.h | 3 +++
4 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
index 844d618..280a4df 100644
--- a/drivers/scsi/fcoe/fcoe.c
+++ b/drivers/scsi/fcoe/fcoe.c
@@ -706,6 +706,7 @@ static int fcoe_shost_config(struct fc_lport *lport, struct device *dev)
lport->host->max_id = FCOE_MAX_FCP_TARGET;
lport->host->max_channel = 0;
lport->host->max_cmd_len = FCOE_MAX_CMD_LEN;
+ lport->host->unlocked_qcmds = 1;
if (lport->vport)
lport->host->transportt = fcoe_vport_transport_template;
diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c
index 43866e6..39b6bfa 100644
--- a/drivers/scsi/libfc/fc_fcp.c
+++ b/drivers/scsi/libfc/fc_fcp.c
@@ -1751,8 +1751,7 @@ static inline int fc_fcp_lport_queue_ready(struct fc_lport *lport)
* @cmd: The scsi_cmnd to be executed
* @done: The callback function to be called when the scsi_cmnd is complete
*
- * This is the i/o strategy routine, called by the SCSI layer. This routine
- * is called with the host_lock held.
+ * This is the i/o strategy routine, called by the SCSI layer.
*/
int fc_queuecommand(struct scsi_cmnd *sc_cmd, void (*done)(struct scsi_cmnd *))
{
@@ -1770,9 +1769,8 @@ int fc_queuecommand(struct scsi_cmnd *sc_cmd, void (*done)(struct scsi_cmnd *))
if (rval) {
sc_cmd->result = rval;
done(sc_cmd);
- return 0;
+ return rc;
}
- spin_unlock_irq(lport->host->host_lock);
if (!*(struct fc_remote_port **)rport->dd_data) {
/*
@@ -1781,7 +1779,7 @@ int fc_queuecommand(struct scsi_cmnd *sc_cmd, void (*done)(struct scsi_cmnd *))
*/
sc_cmd->result = DID_IMM_RETRY << 16;
done(sc_cmd);
- goto out;
+ return rc;
}
rpriv = rport->dd_data;
@@ -1790,13 +1788,13 @@ int fc_queuecommand(struct scsi_cmnd *sc_cmd, void (*done)(struct scsi_cmnd *))
if (lport->qfull)
fc_fcp_can_queue_ramp_down(lport);
rc = SCSI_MLQUEUE_HOST_BUSY;
- goto out;
+ return rc;
}
fsp = fc_fcp_pkt_alloc(lport, GFP_ATOMIC);
if (fsp == NULL) {
rc = SCSI_MLQUEUE_HOST_BUSY;
- goto out;
+ return rc;
}
/*
@@ -1848,8 +1846,6 @@ int fc_queuecommand(struct scsi_cmnd *sc_cmd, void (*done)(struct scsi_cmnd *))
fc_fcp_pkt_release(fsp);
rc = SCSI_MLQUEUE_HOST_BUSY;
}
-out:
- spin_lock_irq(lport->host->host_lock);
return rc;
}
EXPORT_SYMBOL(fc_queuecommand);
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index ad0ed21..3819d66 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -749,11 +749,16 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
if (unlikely(host->shost_state == SHOST_DEL)) {
cmd->result = (DID_NO_CONNECT << 16);
scsi_done(cmd);
+ spin_unlock_irqrestore(host->host_lock, flags);
} else {
trace_scsi_dispatch_cmd_start(cmd);
+ if (host->unlocked_qcmds)
+ spin_unlock_irqrestore(host->host_lock, flags);
rtn = host->hostt->queuecommand(cmd, scsi_done);
+ if (!host->unlocked_qcmds)
+ spin_unlock_irqrestore(host->host_lock, flags);
}
- spin_unlock_irqrestore(host->host_lock, flags);
+
if (rtn) {
trace_scsi_dispatch_cmd_error(cmd, rtn);
if (rtn != SCSI_MLQUEUE_DEVICE_BUSY &&
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index b7bdecb..1814c51 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -636,6 +636,9 @@ struct Scsi_Host {
/* Asynchronous scan in progress */
unsigned async_scan:1;
+ /* call queuecommand without Scsi_Host lock held */
+ unsigned unlocked_qcmds:1;
+
/*
* Optional work queue to be utilized by the transport
*/
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand
2010-09-03 22:27 [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand Vasu Dev
@ 2010-09-04 1:20 ` Nicholas A. Bellinger
2010-09-12 16:37 ` Bart Van Assche
1 sibling, 0 replies; 8+ messages in thread
From: Nicholas A. Bellinger @ 2010-09-04 1:20 UTC (permalink / raw)
To: Vasu Dev
Cc: Mike Christie, linux-scsi, Joe Eykholt, James Bottomley, devel,
Robert Love
On Fri, 2010-09-03 at 15:27 -0700, Vasu Dev wrote:
> Currently fc_queuecommand drops this lock very early on and then re-acquires
> this lock just before return, this re-acquired lock gets dropped immediately
> by its fast path caller scsi_dispatch_cmd, this re-acquire and then immediate
> drop on each IO hits performance especially with small size IOs on multi-core
> systems, this hit is significant about 25% with 512 bytes size IOs.
>
> This lock is not needed in fc_queuecommand and calling fc_queuecommand
> without this lock held removes above described performance hit.
>
> So this patch adds unlocked_qcmds flag to drop host_lock before
> calling only fc_queuecommand and removes re-acquire & then drop
> for each IO. Added flag, drops this lock only if LLD wants such
> as fcoe.ko does here for fc_queuecommand. This change won't affect
> any existing LLD since added unlocked_qcmds flag will be zero
> in those cases and their host lock uses would effectively remain
> same after this patch.
>
> As per Mike and Nab comments, restored host lock around SHOST_DEL state
> in scsi_dispatch_cmd.
>
> Signed-off-by: Vasu Dev <vasu.dev@intel.com>
> ---
> drivers/scsi/fcoe/fcoe.c | 1 +
> drivers/scsi/libfc/fc_fcp.c | 14 +++++---------
> drivers/scsi/scsi.c | 7 ++++++-
> include/scsi/scsi_host.h | 3 +++
> 4 files changed, 15 insertions(+), 10 deletions(-)
>
<SNIP>
> index b7bdecb..1814c51 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -636,6 +636,9 @@ struct Scsi_Host {
> /* Asynchronous scan in progress */
> unsigned async_scan:1;
>
> + /* call queuecommand without Scsi_Host lock held */
> + unsigned unlocked_qcmds:1;
> +
I am not sure why this structure member (or any of the other ones) need
to be using :1 at this point, considering drivers/scsi/hosts.c:scsi_host_alloc()
is doing a kzalloc() on struct Scsi_Host to begin with..
Other than that minor nit, I have merged this patch into branch at
lio-core-2.6.git/scsi-unlocked-qcmds available here:
git://git.kernel.org/pub/scm/linux/kernel/git/nab/lio-core-2.6.git scsi-unlocked-qcmds
and via gitweb here:
http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=shortlog;h=refs/heads/scsi-unlocked_qcmds
I will have a look at running some modified TCM_Loop and Linux/iSCSI Initiators tests
using with SHT->unlocked-qcmds=1 in the very near future on this code. 8-)
Many thanks again for your very worthwhile efforts Vasu!
Best,
--nab
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand
2010-09-03 22:27 [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand Vasu Dev
2010-09-04 1:20 ` Nicholas A. Bellinger
@ 2010-09-12 16:37 ` Bart Van Assche
2010-09-15 17:31 ` Vasu Dev
2010-09-24 6:41 ` Jens Axboe
1 sibling, 2 replies; 8+ messages in thread
From: Bart Van Assche @ 2010-09-12 16:37 UTC (permalink / raw)
To: Vasu Dev; +Cc: linux-scsi
On Sat, Sep 4, 2010 at 12:27 AM, Vasu Dev <vasu.dev@intel.com> wrote:
>
> Currently fc_queuecommand drops this lock very early on and then re-acquires
> this lock just before return, this re-acquired lock gets dropped immediately
> by its fast path caller scsi_dispatch_cmd, this re-acquire and then immediate
> drop on each IO hits performance especially with small size IOs on multi-core
> systems, this hit is significant about 25% with 512 bytes size IOs.
>
> This lock is not needed in fc_queuecommand and calling fc_queuecommand
> without this lock held removes above described performance hit.
>
> So this patch adds unlocked_qcmds flag to drop host_lock before
> calling only fc_queuecommand and removes re-acquire & then drop
> for each IO. Added flag, drops this lock only if LLD wants such
> as fcoe.ko does here for fc_queuecommand. This change won't affect
> any existing LLD since added unlocked_qcmds flag will be zero
> in those cases and their host lock uses would effectively remain
> same after this patch.
[ ... ]
Hello Vasu,
While I don't doubt that your patch improves performance, there might
be more bottlenecks in the Linux storage stack than the SCSI host
lock. After having converted the SRP initiator (ib_srp) to use per-LUN
locking instead of per-host locking and after having enabled
unlocked_qcmds, I noticed that the largest source of contention was in
the block layer. There was ten times more contention on
request_queue.queue_lock than on any other spinlock. As you can see
below, the four most contended locks in the test I ran were:
* request_queue.queue_lock, needed by __make_request() in blk-core.c.
* srp_target_port.lock, the per-LUN lock in ib_srp.c.
* The SCSI host lock.
* An AIO lock.
The test I ran used libaio and the NOOP scheduler. Block merging was
disabled via sysfs.
From /proc/lock_stat:
lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
&(&q->__queue_lock)->rlock: 11093047 11093901
0.22 9.37 10163231.15 41697489
74851829 0.00 21.14 60656375.82
--------------------------
&(&q->__queue_lock)->rlock 2093562
[<ffffffff811af406>] __make_request+0x56/0x500
&(&q->__queue_lock)->rlock 2
[<ffffffff811bcd70>] cfq_get_queue+0xf0/0x2a0
&(&q->__queue_lock)->rlock 1
[<ffffffff811bd3e4>] cfq_set_request+0x414/0x500
&(&q->__queue_lock)->rlock 36
[<ffffffff811bd049>] cfq_set_request+0x79/0x500
--------------------------
&(&q->__queue_lock)->rlock 17
[<ffffffff811bd049>] cfq_set_request+0x79/0x500
&(&q->__queue_lock)->rlock 1007349
[<ffffffff811af406>] __make_request+0x56/0x500
&(&q->__queue_lock)->rlock 3036167
[<ffffffff811af45f>] __make_request+0xaf/0x500
&(&q->__queue_lock)->rlock 1
[<ffffffff811bcd70>] cfq_get_queue+0xf0/0x2a0
...............................................................................................................................................................................................
&(&target->lock)->rlock: 1098815 1098945
0.21 3.67 317746.64 16107020
48401787 0.00 14.65 11997072.82
-----------------------
&(&target->lock)->rlock 220238
[<ffffffffa04e1307>] __srp_get_tx_iu+0x47/0xd0 [ib_srp]
&(&target->lock)->rlock 366255
[<ffffffffa04e2ee0>] srp_queuecommand+0xb0/0xad0 [ib_srp]
&(&target->lock)->rlock 297993
[<ffffffffa04e25b6>] srp_remove_req+0x46/0x80 [ib_srp]
&(&target->lock)->rlock 114850
[<ffffffffa04e2ae2>] srp_handle_recv+0x182/0x420 [ib_srp]
-----------------------
&(&target->lock)->rlock 277148
[<ffffffffa04e2ee0>] srp_queuecommand+0xb0/0xad0 [ib_srp]
&(&target->lock)->rlock 142360
[<ffffffffa04e2ae2>] srp_handle_recv+0x182/0x420 [ib_srp]
&(&target->lock)->rlock 245220
[<ffffffffa04e1307>] __srp_get_tx_iu+0x47/0xd0 [ib_srp]
&(&target->lock)->rlock 266847
[<ffffffffa04e25b6>] srp_remove_req+0x46/0x80 [ib_srp]
...............................................................................................................................................................................................
&(shost->host_lock)->rlock: 989776 989903
0.21 5.78 271707.74 20089204
38990409 0.00 7.07 9908160.13
--------------------------
&(shost->host_lock)->rlock 123198
[<ffffffffa0037583>] scsi_request_fn+0x163/0x4d0 [scsi_mod]
&(shost->host_lock)->rlock 235047
[<ffffffffa003036c>] scsi_dispatch_cmd+0x11c/0x3e0 [scsi_mod]
&(shost->host_lock)->rlock 389901
[<ffffffffa00383cd>] scsi_device_unbusy+0x3d/0xd0 [scsi_mod]
&(shost->host_lock)->rlock 241762
[<ffffffffa0036824>] scsi_run_queue+0x54/0x3d0 [scsi_mod]
--------------------------
&(shost->host_lock)->rlock 107827
[<ffffffffa0037583>] scsi_request_fn+0x163/0x4d0 [scsi_mod]
&(shost->host_lock)->rlock 95756
[<ffffffffa003036c>] scsi_dispatch_cmd+0x11c/0x3e0 [scsi_mod]
&(shost->host_lock)->rlock 316122
[<ffffffffa0036824>] scsi_run_queue+0x54/0x3d0 [scsi_mod]
&(shost->host_lock)->rlock 470203
[<ffffffffa00383cd>] scsi_device_unbusy+0x3d/0xd0 [scsi_mod]
...............................................................................................................................................................................................
&(&ctx->ctx_lock)->rlock: 399855 399868
0.22 4.22 143530.86 7351017
50104676 0.00 8.92 14706301.05
------------------------
&(&ctx->ctx_lock)->rlock 194753
[<ffffffff81158805>] aio_complete+0x45/0x260
&(&ctx->ctx_lock)->rlock 67287
[<ffffffff81159434>] do_io_submit+0x214/0x7b0
&(&ctx->ctx_lock)->rlock 56529
[<ffffffff81157da5>] __aio_get_req+0x95/0x1a0
&(&ctx->ctx_lock)->rlock 54461
[<ffffffff8115785d>] aio_put_req+0x2d/0x60
------------------------
&(&ctx->ctx_lock)->rlock 205740
[<ffffffff81158805>] aio_complete+0x45/0x260
&(&ctx->ctx_lock)->rlock 125933
[<ffffffff8115785d>] aio_put_req+0x2d/0x60
&(&ctx->ctx_lock)->rlock 37042
[<ffffffff81157da5>] __aio_get_req+0x95/0x1a0
&(&ctx->ctx_lock)->rlock 29197
[<ffffffff81159434>] do_io_submit+0x214/0x7b0
...............................................................................................................................................................................................
[ ... ]
Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand
2010-09-12 16:37 ` Bart Van Assche
@ 2010-09-15 17:31 ` Vasu Dev
2010-09-15 18:01 ` Bart Van Assche
2010-09-24 6:41 ` Jens Axboe
1 sibling, 1 reply; 8+ messages in thread
From: Vasu Dev @ 2010-09-15 17:31 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Vasu Dev, linux-scsi
On Sun, 2010-09-12 at 18:37 +0200, Bart Van Assche wrote:
> On Sat, Sep 4, 2010 at 12:27 AM, Vasu Dev <vasu.dev@intel.com> wrote:
> >
> > Currently fc_queuecommand drops this lock very early on and then re-acquires
> > this lock just before return, this re-acquired lock gets dropped immediately
> > by its fast path caller scsi_dispatch_cmd, this re-acquire and then immediate
> > drop on each IO hits performance especially with small size IOs on multi-core
> > systems, this hit is significant about 25% with 512 bytes size IOs.
> >
> > This lock is not needed in fc_queuecommand and calling fc_queuecommand
> > without this lock held removes above described performance hit.
> >
> > So this patch adds unlocked_qcmds flag to drop host_lock before
> > calling only fc_queuecommand and removes re-acquire & then drop
> > for each IO. Added flag, drops this lock only if LLD wants such
> > as fcoe.ko does here for fc_queuecommand. This change won't affect
> > any existing LLD since added unlocked_qcmds flag will be zero
> > in those cases and their host lock uses would effectively remain
> > same after this patch.
> [ ... ]
>
> Hello Vasu,
>
> While I don't doubt that your patch improves performance, there might
> be more bottlenecks in the Linux storage stack than the SCSI host
> lock.
Yes there are, upper blocks(scsi/block layer) of fcoe stack were still
highest hit in my perf profiling even after the unlocked_qcmds, as I
mentioned this unlocked_qcmds change helped with small IOs most and that
too at 10G host link and same was verified by our validation but
improving on more upper blocks of fcoe stack bottleneck should give
better perf for all IO sizes at various link speeds for various
transports.
> After having converted the SRP initiator (ib_srp) to use per-LUN
> locking instead of per-host locking and after having enabled
Per-lun locking looks good idea and should mitigate host lock
bottlenecks while lun is not simultaneously used by all host cores, may
be some use case would prevent that by limiting lun to VM on specific
core. Is this change already upstream or queued for upstream ? Is that
change going to eliminate host lock on fast path completely ?
> unlocked_qcmds, I noticed that the largest source of contention was in
> the block layer. There was ten times more contention on
> request_queue.queue_lock than on any other spinlock. As you can see
> below, the four most contended locks in the test I ran were:
> * request_queue.queue_lock, needed by __make_request() in blk-core.c.
> * srp_target_port.lock, the per-LUN lock in ib_srp.c.
> * The SCSI host lock.
> * An AIO lock.
>
> The test I ran used libaio and the NOOP scheduler. Block merging was
> disabled via sysfs.
>
> >From /proc/lock_stat:
>
> lock_stat version 0.3
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> class name con-bounces contentions
> waittime-min waittime-max waittime-total acq-bounces
> acquisitions holdtime-min holdtime-max holdtime-total
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> &(&q->__queue_lock)->rlock: 11093047 11093901
> 0.22 9.37 10163231.15 41697489
> 74851829 0.00 21.14 60656375.82
> --------------------------
> &(&q->__queue_lock)->rlock 2093562
> [<ffffffff811af406>] __make_request+0x56/0x500
> &(&q->__queue_lock)->rlock 2
> [<ffffffff811bcd70>] cfq_get_queue+0xf0/0x2a0
> &(&q->__queue_lock)->rlock 1
> [<ffffffff811bd3e4>] cfq_set_request+0x414/0x500
> &(&q->__queue_lock)->rlock 36
> [<ffffffff811bd049>] cfq_set_request+0x79/0x500
> --------------------------
> &(&q->__queue_lock)->rlock 17
> [<ffffffff811bd049>] cfq_set_request+0x79/0x500
> &(&q->__queue_lock)->rlock 1007349
> [<ffffffff811af406>] __make_request+0x56/0x500
> &(&q->__queue_lock)->rlock 3036167
> [<ffffffff811af45f>] __make_request+0xaf/0x500
> &(&q->__queue_lock)->rlock 1
> [<ffffffff811bcd70>] cfq_get_queue+0xf0/0x2a0
>
> ...............................................................................................................................................................................................
>
> &(&target->lock)->rlock: 1098815 1098945
> 0.21 3.67 317746.64 16107020
> 48401787 0.00 14.65 11997072.82
> -----------------------
> &(&target->lock)->rlock 220238
> [<ffffffffa04e1307>] __srp_get_tx_iu+0x47/0xd0 [ib_srp]
> &(&target->lock)->rlock 366255
> [<ffffffffa04e2ee0>] srp_queuecommand+0xb0/0xad0 [ib_srp]
> &(&target->lock)->rlock 297993
> [<ffffffffa04e25b6>] srp_remove_req+0x46/0x80 [ib_srp]
> &(&target->lock)->rlock 114850
> [<ffffffffa04e2ae2>] srp_handle_recv+0x182/0x420 [ib_srp]
> -----------------------
> &(&target->lock)->rlock 277148
> [<ffffffffa04e2ee0>] srp_queuecommand+0xb0/0xad0 [ib_srp]
> &(&target->lock)->rlock 142360
> [<ffffffffa04e2ae2>] srp_handle_recv+0x182/0x420 [ib_srp]
> &(&target->lock)->rlock 245220
> [<ffffffffa04e1307>] __srp_get_tx_iu+0x47/0xd0 [ib_srp]
> &(&target->lock)->rlock 266847
> [<ffffffffa04e25b6>] srp_remove_req+0x46/0x80 [ib_srp]
>
> ...............................................................................................................................................................................................
>
> &(shost->host_lock)->rlock: 989776 989903
> 0.21 5.78 271707.74 20089204
> 38990409 0.00 7.07 9908160.13
> --------------------------
> &(shost->host_lock)->rlock 123198
> [<ffffffffa0037583>] scsi_request_fn+0x163/0x4d0 [scsi_mod]
> &(shost->host_lock)->rlock 235047
> [<ffffffffa003036c>] scsi_dispatch_cmd+0x11c/0x3e0 [scsi_mod]
> &(shost->host_lock)->rlock 389901
> [<ffffffffa00383cd>] scsi_device_unbusy+0x3d/0xd0 [scsi_mod]
> &(shost->host_lock)->rlock 241762
> [<ffffffffa0036824>] scsi_run_queue+0x54/0x3d0 [scsi_mod]
> --------------------------
> &(shost->host_lock)->rlock 107827
> [<ffffffffa0037583>] scsi_request_fn+0x163/0x4d0 [scsi_mod]
> &(shost->host_lock)->rlock 95756
> [<ffffffffa003036c>] scsi_dispatch_cmd+0x11c/0x3e0 [scsi_mod]
> &(shost->host_lock)->rlock 316122
> [<ffffffffa0036824>] scsi_run_queue+0x54/0x3d0 [scsi_mod]
> &(shost->host_lock)->rlock 470203
> [<ffffffffa00383cd>] scsi_device_unbusy+0x3d/0xd0 [scsi_mod]
>
> ...............................................................................................................................................................................................
>
> &(&ctx->ctx_lock)->rlock: 399855 399868
> 0.22 4.22 143530.86 7351017
> 50104676 0.00 8.92 14706301.05
> ------------------------
> &(&ctx->ctx_lock)->rlock 194753
> [<ffffffff81158805>] aio_complete+0x45/0x260
> &(&ctx->ctx_lock)->rlock 67287
> [<ffffffff81159434>] do_io_submit+0x214/0x7b0
> &(&ctx->ctx_lock)->rlock 56529
> [<ffffffff81157da5>] __aio_get_req+0x95/0x1a0
> &(&ctx->ctx_lock)->rlock 54461
> [<ffffffff8115785d>] aio_put_req+0x2d/0x60
> ------------------------
> &(&ctx->ctx_lock)->rlock 205740
> [<ffffffff81158805>] aio_complete+0x45/0x260
> &(&ctx->ctx_lock)->rlock 125933
> [<ffffffff8115785d>] aio_put_req+0x2d/0x60
> &(&ctx->ctx_lock)->rlock 37042
> [<ffffffff81157da5>] __aio_get_req+0x95/0x1a0
> &(&ctx->ctx_lock)->rlock 29197
> [<ffffffff81159434>] do_io_submit+0x214/0x7b0
>
> ...............................................................................................................................................................................................
May be block layer experts working on improving these code paths and I
don't know much about that part and any case this all goes beyond what I
was trying to accomplish with simple unlocked_qcmds change.
Thanks
Vasu
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand
2010-09-15 17:31 ` Vasu Dev
@ 2010-09-15 18:01 ` Bart Van Assche
0 siblings, 0 replies; 8+ messages in thread
From: Bart Van Assche @ 2010-09-15 18:01 UTC (permalink / raw)
To: Vasu Dev; +Cc: Vasu Dev, linux-scsi
On Wed, Sep 15, 2010 at 7:31 PM, Vasu Dev <vasu.dev@linux.intel.com> wrote:
>
> On Sun, 2010-09-12 at 18:37 +0200, Bart Van Assche wrote:
> > On Sat, Sep 4, 2010 at 12:27 AM, Vasu Dev <vasu.dev@intel.com> wrote:
> > > [ ... ]
> > After having converted the SRP initiator (ib_srp) to use per-LUN
> > locking instead of per-host locking and after having enabled
>
> Per-lun locking looks good idea and should mitigate host lock
> bottlenecks while lun is not simultaneously used by all host cores, may
> be some use case would prevent that by limiting lun to VM on specific
> core. Is this change already upstream or queued for upstream ? Is that
> change going to eliminate host lock on fast path completely ?
> [ ... ]
Yes, that patch eliminates the host lock on the fast patch completely.
The patch is not yet available upstream - I still have to split it and
post it for review on the linux-rdma mailing list. But if you want you
can have a look at that patch here:
http://sourceforge.net/mailarchive/message.php?msg_name=E1Ova8J-0005Mx-NV%40sfp-svn-4.v30.ch3.sourceforge.com.
Bart.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand
2010-09-12 16:37 ` Bart Van Assche
2010-09-15 17:31 ` Vasu Dev
@ 2010-09-24 6:41 ` Jens Axboe
2010-09-25 16:55 ` Bart Van Assche
1 sibling, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2010-09-24 6:41 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Vasu Dev, linux-scsi@vger.kernel.org
On 2010-09-12 18:37, Bart Van Assche wrote:
> On Sat, Sep 4, 2010 at 12:27 AM, Vasu Dev <vasu.dev@intel.com> wrote:
>>
>> Currently fc_queuecommand drops this lock very early on and then re-acquires
>> this lock just before return, this re-acquired lock gets dropped immediately
>> by its fast path caller scsi_dispatch_cmd, this re-acquire and then immediate
>> drop on each IO hits performance especially with small size IOs on multi-core
>> systems, this hit is significant about 25% with 512 bytes size IOs.
>>
>> This lock is not needed in fc_queuecommand and calling fc_queuecommand
>> without this lock held removes above described performance hit.
>>
>> So this patch adds unlocked_qcmds flag to drop host_lock before
>> calling only fc_queuecommand and removes re-acquire & then drop
>> for each IO. Added flag, drops this lock only if LLD wants such
>> as fcoe.ko does here for fc_queuecommand. This change won't affect
>> any existing LLD since added unlocked_qcmds flag will be zero
>> in those cases and their host lock uses would effectively remain
>> same after this patch.
> [ ... ]
>
> Hello Vasu,
>
> While I don't doubt that your patch improves performance, there might
> be more bottlenecks in the Linux storage stack than the SCSI host
> lock. After having converted the SRP initiator (ib_srp) to use per-LUN
> locking instead of per-host locking and after having enabled
> unlocked_qcmds, I noticed that the largest source of contention was in
> the block layer. There was ten times more contention on
> request_queue.queue_lock than on any other spinlock. As you can see
> below, the four most contended locks in the test I ran were:
> * request_queue.queue_lock, needed by __make_request() in blk-core.c.
> * srp_target_port.lock, the per-LUN lock in ib_srp.c.
> * The SCSI host lock.
> * An AIO lock.
>
> The test I ran used libaio and the NOOP scheduler. Block merging was
> disabled via sysfs.
Bart, can you try with this patchset added:
git://git.kernel.dk/linux-2.6-block.git blk-alloc-optimize
It's a work in progress and not suitable for general consumption yet,
but it's tested working at least. There will be more built on top of
this, but at least even this simple stuff is making a big difference
for IOPS testing for me.
--
Jens Axboe
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand
2010-09-24 6:41 ` Jens Axboe
@ 2010-09-25 16:55 ` Bart Van Assche
2010-09-26 3:19 ` Jens Axboe
0 siblings, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2010-09-25 16:55 UTC (permalink / raw)
To: Jens Axboe; +Cc: Vasu Dev, linux-scsi@vger.kernel.org
On Fri, Sep 24, 2010 at 8:41 AM, Jens Axboe <jaxboe@fusionio.com> wrote:
>
> [ ... ]
>
> Bart, can you try with this patchset added:
>
> git://git.kernel.dk/linux-2.6-block.git blk-alloc-optimize
>
> It's a work in progress and not suitable for general consumption yet,
> but it's tested working at least. There will be more built on top of
> this, but at least even this simple stuff is making a big difference
> for IOPS testing for me.
Hello Jens,
Thanks for the feedback. I see a nice 10% speedup after having applied
the four block layer optimization patches from the blk-alloc-optimize
branch on an already patched 2.6.35.5 SRP initiator.
Note: according to the output of perf record -g, most spinlock calls
still originate from the block layer. This is what the perf tool
reported for a fio run using libaio with small blocks (512 bytes):
Event: cycles
- 7.06% fio [kernel.kallsyms] [k] _raw_spin_lock_irqsave
- _raw_spin_lock_irqsave
+ 19.51% blk_run_queue
+ 13.71% blk_end_bidi_request
+ 10.04% mlx4_ib_poll_cq
+ 4.68% lock_timer_base
+ 4.22% aio_complete
+ 3.97% srp_send_completion
+ 3.71% srp_queuecommand
+ 3.55% dio_bio_end_aio
+ 3.37% __srp_get_tx_iu
+ 3.14% srp_recv_completion
+ 3.00% scsi_device_unbusy
+ 2.87% __scsi_put_command
+ 2.82% __blockdev_direct_IO_newtrunc
+ 2.76% scsi_put_command
+ 2.69% scsi_run_queue
+ 2.65% dio_bio_submit
+ 2.54% srp_remove_req
+ 2.46% mlx4_ib_post_send
+ 2.33% scsi_get_command
+ 1.95% mlx4_ib_post_recv
[ ... ]
The script I used to generate a patched kernel can be found here
(r2271): https://scst.svn.sourceforge.net/svnroot/scst/trunk/scripts/generate-kernel-with-srp-patches.
Bart.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand
2010-09-25 16:55 ` Bart Van Assche
@ 2010-09-26 3:19 ` Jens Axboe
0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2010-09-26 3:19 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Vasu Dev, linux-scsi@vger.kernel.org
On 2010-09-26 01:55, Bart Van Assche wrote:
> On Fri, Sep 24, 2010 at 8:41 AM, Jens Axboe <jaxboe@fusionio.com> wrote:
>>
>> [ ... ]
>>
>> Bart, can you try with this patchset added:
>>
>> git://git.kernel.dk/linux-2.6-block.git blk-alloc-optimize
>>
>> It's a work in progress and not suitable for general consumption yet,
>> but it's tested working at least. There will be more built on top of
>> this, but at least even this simple stuff is making a big difference
>> for IOPS testing for me.
>
> Hello Jens,
>
> Thanks for the feedback. I see a nice 10% speedup after having applied
> the four block layer optimization patches from the blk-alloc-optimize
> branch on an already patched 2.6.35.5 SRP initiator.
Great! Not too bad for something that's will a WIP.
> Note: according to the output of perf record -g, most spinlock calls
> still originate from the block layer. This is what the perf tool
> reported for a fio run using libaio with small blocks (512 bytes):
>
> Event: cycles
> - 7.06% fio [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> - _raw_spin_lock_irqsave
> + 19.51% blk_run_queue
> + 13.71% blk_end_bidi_request
> + 10.04% mlx4_ib_poll_cq
> + 4.68% lock_timer_base
> + 4.22% aio_complete
> + 3.97% srp_send_completion
> + 3.71% srp_queuecommand
> + 3.55% dio_bio_end_aio
> + 3.37% __srp_get_tx_iu
> + 3.14% srp_recv_completion
> + 3.00% scsi_device_unbusy
> + 2.87% __scsi_put_command
> + 2.82% __blockdev_direct_IO_newtrunc
> + 2.76% scsi_put_command
> + 2.69% scsi_run_queue
> + 2.65% dio_bio_submit
> + 2.54% srp_remove_req
> + 2.46% mlx4_ib_post_send
> + 2.33% scsi_get_command
> + 1.95% mlx4_ib_post_recv
One piece of low hanging fruit is reducing the number of queue runs.
SCSI does this for every completed command to keep the device queue
full. I bet if you try an experiement where you only run the queue when
a certain number of requests have completed, you would greatly reduce
scsi_run_queue and blk_run_queue in the above profile.
--
Jens Axboe
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2010-09-26 3:19 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-03 22:27 [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand Vasu Dev
2010-09-04 1:20 ` Nicholas A. Bellinger
2010-09-12 16:37 ` Bart Van Assche
2010-09-15 17:31 ` Vasu Dev
2010-09-15 18:01 ` Bart Van Assche
2010-09-24 6:41 ` Jens Axboe
2010-09-25 16:55 ` Bart Van Assche
2010-09-26 3:19 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox