* multipath_busy() stalls IO due to scsi_host_is_busy()
@ 2012-05-16 12:28 Bernd Schubert
2012-05-16 14:06 ` James Bottomley
2012-05-17 9:09 ` multipath_busy() stalls IO due to scsi_host_is_busy() Jun'ichi Nomura
0 siblings, 2 replies; 15+ messages in thread
From: Bernd Schubert @ 2012-05-16 12:28 UTC (permalink / raw)
To: dm-devel; +Cc: linux-scsi@vger.kernel.org
Hello,
while I actually want to benchmark FhGFS on a NetApp system, I'm somehow
running from one kernel problem to another.
Yesterday we had to recable and while we are now still using multipath,
each priority group now only has one underlying devices (we don't have
sufficient IB srp ports on our test systems, but still want to benchmark
a system as close as possible to a production system).
So after recabling actually all failover paths disappeared, which
*shouldn't* have any influence on the performance. However, unexpectedly
performance is now by less than 50% when I'm doing buffered IO. With
direct IO it also still fine and reducing nr_requests of the multipath
device to 8 also 'fixes' the problem. I then guessed it right and simply
made multipath_busy() always to return 0, which also fixes the issue.
- problem:
- iostat -x -m 1 shows that alternating one multipath devices starts to
stall IO for several minutes
- the other multipath device then does IO during that time with about
600 to 700 MB/s, until it starts to stall IO
- the active NetApp controller could server both multipath devices with
about 600 to 700 MB/s
problem solutions:
- add another passive sdX device to the multipath group
- use direct IO
- reduce /sys/block/dm-X/queue/nr_requests to 8
- /sys/block/sdX does not need to be updated
- disbable multipath_busy() by letting it return 0
Looking through the call chain, I see the underlying problem seems to be
in scsi_host_is_busy().
> static inline int scsi_host_is_busy(struct Scsi_Host *shost)
> {
> if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
> shost->host_blocked || shost->host_self_blocked)
> return 1;
>
> return 0;
> }
>
shost->can_queue -> 62 here
shost->host_busy -> 62 when one of the multipath groups does IO, further
multipath groups then seem to get stalled.
I'm not sure yet why multipath_busy() does not stall IO when there is a
passive path in the prio group.
Any idea how to properly address this problem?
Thanks,
Bernd
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: multipath_busy() stalls IO due to scsi_host_is_busy()
2012-05-16 12:28 multipath_busy() stalls IO due to scsi_host_is_busy() Bernd Schubert
@ 2012-05-16 14:06 ` James Bottomley
2012-05-16 14:29 ` Bernd Schubert
2012-05-17 9:09 ` multipath_busy() stalls IO due to scsi_host_is_busy() Jun'ichi Nomura
1 sibling, 1 reply; 15+ messages in thread
From: James Bottomley @ 2012-05-16 14:06 UTC (permalink / raw)
To: Bernd Schubert; +Cc: dm-devel, linux-scsi@vger.kernel.org
On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote:
> shost->can_queue -> 62 here
> shost->host_busy -> 62 when one of the multipath groups does IO, further
> multipath groups then seem to get stalled.
>
> I'm not sure yet why multipath_busy() does not stall IO when there is a
> passive path in the prio group.
>
> Any idea how to properly address this problem?
shost->can_queue is supposed to represent the maximum number of possible
outstanding commands per HBA (i.e. the HBA hardware limit). Assuming
the driver got it right, the only way of increasing this is to buy a
better HBA.
James
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: multipath_busy() stalls IO due to scsi_host_is_busy()
2012-05-16 14:06 ` James Bottomley
@ 2012-05-16 14:29 ` Bernd Schubert
2012-05-16 15:27 ` [dm-devel] " Mike Christie
0 siblings, 1 reply; 15+ messages in thread
From: Bernd Schubert @ 2012-05-16 14:29 UTC (permalink / raw)
To: James Bottomley; +Cc: dm-devel, linux-scsi@vger.kernel.org
On 05/16/2012 04:06 PM, James Bottomley wrote:
> On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote:
>> shost->can_queue -> 62 here
>> shost->host_busy -> 62 when one of the multipath groups does IO, further
>> multipath groups then seem to get stalled.
>>
>> I'm not sure yet why multipath_busy() does not stall IO when there is a
>> passive path in the prio group.
>>
>> Any idea how to properly address this problem?
>
> shost->can_queue is supposed to represent the maximum number of possible
> outstanding commands per HBA (i.e. the HBA hardware limit). Assuming
> the driver got it right, the only way of increasing this is to buy a
> better HBA.
HBA is a mellanox IB adapter. I have not checked yet where the limit of
62 queue entries comes from. This is also not a real problem. Real
problem is that multipath suspends IO, although it should not.
As I said, if I remove the functionality of those busy functions
everything is fine. I think what happens is that dm-multipath suspends
IO for too long and in the mean time the other path already submits IO
again. So I guess the underlying problem is an unfair queuing strategy.
Cheers,
Bernd
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dm-devel] multipath_busy() stalls IO due to scsi_host_is_busy()
2012-05-16 14:29 ` Bernd Schubert
@ 2012-05-16 15:27 ` Mike Christie
[not found] ` <4FB3C75F.3070903-hcNo3dDEHLuVc3sceRu5cw@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Mike Christie @ 2012-05-16 15:27 UTC (permalink / raw)
To: device-mapper development
Cc: Bernd Schubert, James Bottomley, linux-scsi@vger.kernel.org
On 05/16/2012 09:29 AM, Bernd Schubert wrote:
> On 05/16/2012 04:06 PM, James Bottomley wrote:
>> On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote:
>>> shost->can_queue -> 62 here
>>> shost->host_busy -> 62 when one of the multipath groups does IO,
>>> further
>>> multipath groups then seem to get stalled.
>>>
>>> I'm not sure yet why multipath_busy() does not stall IO when there is a
>>> passive path in the prio group.
>>>
>>> Any idea how to properly address this problem?
>>
>> shost->can_queue is supposed to represent the maximum number of possible
>> outstanding commands per HBA (i.e. the HBA hardware limit). Assuming
>> the driver got it right, the only way of increasing this is to buy a
>> better HBA.
>
> HBA is a mellanox IB adapter. I have not checked yet where the limit of
What driver is this with? SRP or iSER or something else?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dm-devel] multipath_busy() stalls IO due to scsi_host_is_busy()
[not found] ` <4FB3C75F.3070903-hcNo3dDEHLuVc3sceRu5cw@public.gmane.org>
@ 2012-05-16 15:54 ` Bernd Schubert
[not found] ` <4FB3CDC5.9040608-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Bernd Schubert @ 2012-05-16 15:54 UTC (permalink / raw)
To: Mike Christie
Cc: device-mapper development, James Bottomley,
linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, David Dillow,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
On 05/16/2012 05:27 PM, Mike Christie wrote:
> On 05/16/2012 09:29 AM, Bernd Schubert wrote:
>> On 05/16/2012 04:06 PM, James Bottomley wrote:
>>> On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote:
>>>> shost->can_queue -> 62 here
>>>> shost->host_busy -> 62 when one of the multipath groups does IO,
>>>> further
>>>> multipath groups then seem to get stalled.
>>>>
>>>> I'm not sure yet why multipath_busy() does not stall IO when there is a
>>>> passive path in the prio group.
>>>>
>>>> Any idea how to properly address this problem?
>>>
>>> shost->can_queue is supposed to represent the maximum number of possible
>>> outstanding commands per HBA (i.e. the HBA hardware limit). Assuming
>>> the driver got it right, the only way of increasing this is to buy a
>>> better HBA.
>>
>> HBA is a mellanox IB adapter. I have not checked yet where the limit of
>
> What driver is this with? SRP or iSER or something else?
Its SRP. The command queue limit comes from SRP_RQ_SIZE. The value seems
a bit low, IMHO. And its definitely lower than needed for optimal
performance. However, given that I get good performance when
multipath_busy() is a noop, I think this is the primary issue here. And
it is always possible that a single LUN could use all command queues.
Other LUNs still shouldn't be stalled completely.
So in summary we actually have two issues:
1) Unfair queuing/waiting of dm-mpath, which stalls an entire path and
brings down overall performance.
2) Low SRP command queues. Is there a reason why
SRP_RQ_SHIFT/SRP_RQ_SIZE and their depend values such as SRP_RQ_SIZE are
so small?
Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dm-devel] multipath_busy() stalls IO due to scsi_host_is_busy()
[not found] ` <4FB3CDC5.9040608-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
@ 2012-05-16 17:03 ` David Dillow
2012-05-16 20:34 ` Bernd Schubert
2012-05-21 15:49 ` [PATCH] srp: convert SRP_RQ_SHIFT into a module parameter Bernd Schubert
0 siblings, 2 replies; 15+ messages in thread
From: David Dillow @ 2012-05-16 17:03 UTC (permalink / raw)
To: Bernd Schubert
Cc: Mike Christie, device-mapper development, James Bottomley,
linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On Wed, 2012-05-16 at 11:54 -0400, Bernd Schubert wrote:
> 2) Low SRP command queues. Is there a reason why
> SRP_RQ_SHIFT/SRP_RQ_SIZE and their depend values such as SRP_RQ_SIZE are
> so small?
That's a decision that has been around since the beginning of the driver
as far as I can tell. It looks to be a balance between device needs and
memory usage, and it can certainly be raised -- I've run locally with
SRP_RQ_SHIFT set to 7 (shost.can_queue 126) and I'm sure 8 would be no
problem, either. I didn't see a performance improvement on my workload,
but may you will.
Because we take the minimum of our initiator queue depth and the initial
credits from the target (each req consumes a credit), going higher than
8 doesn't buy us much, as I don't know off-hand of any target that gives
out more than 256 credits.
The memory used for the command ring will vary depending on the value of
SRP_RQ_SHIFT and the number of s/g entries allows to be put in the
command. 255 s/g entries requires an 8 KB allocation for each request
(~4200 bytes), so we currently require 512 KB of buffers for the send
queue for each target. Going to 8 will require 2 MB max per target,
which probably isn't a real issue.
There's also a response ring with an allocation size that depends on the
target, but those should be pretty small buffers, say 1 KB * (1 <<
SRP_RQ_SHIFT).
--
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dm-devel] multipath_busy() stalls IO due to scsi_host_is_busy()
2012-05-16 17:03 ` David Dillow
@ 2012-05-16 20:34 ` Bernd Schubert
2012-05-21 15:49 ` [PATCH] srp: convert SRP_RQ_SHIFT into a module parameter Bernd Schubert
1 sibling, 0 replies; 15+ messages in thread
From: Bernd Schubert @ 2012-05-16 20:34 UTC (permalink / raw)
To: David Dillow
Cc: Mike Christie, device-mapper development, James Bottomley,
linux-scsi@vger.kernel.org, linux-rdma@vger.kernel.org
On 05/16/2012 07:03 PM, David Dillow wrote:
> On Wed, 2012-05-16 at 11:54 -0400, Bernd Schubert wrote:
>> 2) Low SRP command queues. Is there a reason why
>> SRP_RQ_SHIFT/SRP_RQ_SIZE and their depend values such as SRP_RQ_SIZE are
>> so small?
>
> That's a decision that has been around since the beginning of the driver
> as far as I can tell. It looks to be a balance between device needs and
> memory usage, and it can certainly be raised -- I've run locally with
> SRP_RQ_SHIFT set to 7 (shost.can_queue 126) and I'm sure 8 would be no
> problem, either. I didn't see a performance improvement on my workload,
> but may you will.
Ah, thanks a lot! In the past I tested the DDN S2A and figured out a
queue size of 16 per device provides optimal performance. So with
typically 7 primary devices per Server that makes 112, so SRP_RQ_SHIFT=7
is perfectly fine. But then with another typical configuration of 14
devices per server and with the current multipath-busy strategy, you
already should see a performance drop.
Right now I'm running tests on a NetApp and don't know yet optimal
parameters. So I set the queue size to the maximum, but didn't expect
such multipath issues...
>
> Because we take the minimum of our initiator queue depth and the initial
> credits from the target (each req consumes a credit), going higher than
> 8 doesn't buy us much, as I don't know off-hand of any target that gives
> out more than 256 credits.
>
> The memory used for the command ring will vary depending on the value of
> SRP_RQ_SHIFT and the number of s/g entries allows to be put in the
> command. 255 s/g entries requires an 8 KB allocation for each request
> (~4200 bytes), so we currently require 512 KB of buffers for the send
> queue for each target. Going to 8 will require 2 MB max per target,
> which probably isn't a real issue.
>
> There's also a response ring with an allocation size that depends on the
> target, but those should be pretty small buffers, say 1 KB * (1 <<
> SRP_RQ_SHIFT).
>
Maybe we should covert the entire parameter to a module option? I will
look into it tomorrow.
And unless someone already comes up with a dm-mpath patch, I will try to
fix the first. I think I will simply always allow a few requests per
prio-group. Lets see if this gets accepted.
Thanks,
Bernd
PS: Thanks a lot for your ib-srp large IO patches you already sent last
year! I just noticed those last week.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: multipath_busy() stalls IO due to scsi_host_is_busy()
2012-05-16 12:28 multipath_busy() stalls IO due to scsi_host_is_busy() Bernd Schubert
2012-05-16 14:06 ` James Bottomley
@ 2012-05-17 9:09 ` Jun'ichi Nomura
2012-05-17 13:46 ` Mike Snitzer
2012-05-21 15:42 ` Bernd Schubert
1 sibling, 2 replies; 15+ messages in thread
From: Jun'ichi Nomura @ 2012-05-17 9:09 UTC (permalink / raw)
To: Bernd Schubert; +Cc: dm-devel, linux-scsi@vger.kernel.org
Hi,
On 05/16/12 21:28, Bernd Schubert wrote:
> Looking through the call chain, I see the underlying problem seems to be in scsi_host_is_busy().
>
>> static inline int scsi_host_is_busy(struct Scsi_Host *shost)
>> {
>> if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
>> shost->host_blocked || shost->host_self_blocked)
>> return 1;
>>
>> return 0;
>> }
multipath_busy() was introduced because, without that,
a request would be prematurely sent down to SCSI,
lose the chance of additional merges and result in
bad performance.
However, when it is target/host that is busy, I think dm should
send the request down and let SCSI, which has better knowledge
about the shared resource, do appropriate starvation control.
Could you try the attached patch?
---
Jun'ichi Nomura, NEC Corporation
If sdev is not busy but starget and/or host is busy,
it is better to accept a request from stacking driver.
Otherwise, the stacking device could be starved by other device
sharing the same target/host.
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5dfd749..0eb4602 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1378,16 +1378,13 @@ static int scsi_lld_busy(struct request_queue *q)
{
struct scsi_device *sdev = q->queuedata;
struct Scsi_Host *shost;
- struct scsi_target *starget;
if (!sdev)
return 0;
shost = sdev->host;
- starget = scsi_target(sdev);
- if (scsi_host_in_recovery(shost) || scsi_host_is_busy(shost) ||
- scsi_target_is_busy(starget) || scsi_device_is_busy(sdev))
+ if (scsi_host_in_recovery(shost) || scsi_device_is_busy(sdev))
return 1;
return 0;
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: multipath_busy() stalls IO due to scsi_host_is_busy()
2012-05-17 9:09 ` multipath_busy() stalls IO due to scsi_host_is_busy() Jun'ichi Nomura
@ 2012-05-17 13:46 ` Mike Snitzer
2012-05-21 15:42 ` Bernd Schubert
1 sibling, 0 replies; 15+ messages in thread
From: Mike Snitzer @ 2012-05-17 13:46 UTC (permalink / raw)
To: Jun'ichi Nomura; +Cc: Bernd Schubert, dm-devel, linux-scsi@vger.kernel.org
On Thu, May 17 2012 at 5:09am -0400,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:
> Hi,
>
> On 05/16/12 21:28, Bernd Schubert wrote:
> > Looking through the call chain, I see the underlying problem seems to be in scsi_host_is_busy().
> >
> >> static inline int scsi_host_is_busy(struct Scsi_Host *shost)
> >> {
> >> if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
> >> shost->host_blocked || shost->host_self_blocked)
> >> return 1;
> >>
> >> return 0;
> >> }
>
> multipath_busy() was introduced because, without that,
> a request would be prematurely sent down to SCSI,
> lose the chance of additional merges and result in
> bad performance.
>
> However, when it is target/host that is busy, I think dm should
> send the request down and let SCSI, which has better knowledge
> about the shared resource, do appropriate starvation control.
>
> Could you try the attached patch?
>
> ---
> Jun'ichi Nomura, NEC Corporation
>
> If sdev is not busy but starget and/or host is busy,
> it is better to accept a request from stacking driver.
> Otherwise, the stacking device could be starved by other device
> sharing the same target/host.
Great suggestion.
It should be noted that DM mpath is the only caller of blk_lld_busy (and
scsi_lld_busy). So even though this patch may _seem_ like the tail
(mpath) wagging the dog (SCSI), it is reasonable to change SCSI's
definition of a LLD being "busy" if it benefits multipath.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: multipath_busy() stalls IO due to scsi_host_is_busy()
2012-05-17 9:09 ` multipath_busy() stalls IO due to scsi_host_is_busy() Jun'ichi Nomura
2012-05-17 13:46 ` Mike Snitzer
@ 2012-05-21 15:42 ` Bernd Schubert
2012-05-22 4:31 ` Jun'ichi Nomura
1 sibling, 1 reply; 15+ messages in thread
From: Bernd Schubert @ 2012-05-21 15:42 UTC (permalink / raw)
To: Jun'ichi Nomura; +Cc: dm-devel, linux-scsi@vger.kernel.org
On 05/17/2012 11:09 AM, Jun'ichi Nomura wrote:
> Hi,
>
> On 05/16/12 21:28, Bernd Schubert wrote:
>> Looking through the call chain, I see the underlying problem seems to be in scsi_host_is_busy().
>>
>>> static inline int scsi_host_is_busy(struct Scsi_Host *shost)
>>> {
>>> if ((shost->can_queue> 0&& shost->host_busy>= shost->can_queue) ||
>>> shost->host_blocked || shost->host_self_blocked)
>>> return 1;
>>>
>>> return 0;
>>> }
>
> multipath_busy() was introduced because, without that,
> a request would be prematurely sent down to SCSI,
> lose the chance of additional merges and result in
> bad performance.
>
> However, when it is target/host that is busy, I think dm should
> send the request down and let SCSI, which has better knowledge
> about the shared resource, do appropriate starvation control.
>
> Could you try the attached patch?
>
> ---
> Jun'ichi Nomura, NEC Corporation
>
> If sdev is not busy but starget and/or host is busy,
> it is better to accept a request from stacking driver.
> Otherwise, the stacking device could be starved by other device
> sharing the same target/host.
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 5dfd749..0eb4602 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1378,16 +1378,13 @@ static int scsi_lld_busy(struct request_queue *q)
> {
> struct scsi_device *sdev = q->queuedata;
> struct Scsi_Host *shost;
> - struct scsi_target *starget;
>
> if (!sdev)
> return 0;
>
> shost = sdev->host;
> - starget = scsi_target(sdev);
>
> - if (scsi_host_in_recovery(shost) || scsi_host_is_busy(shost) ||
> - scsi_target_is_busy(starget) || scsi_device_is_busy(sdev))
> + if (scsi_host_in_recovery(shost) || scsi_device_is_busy(sdev))
> return 1;
>
> return 0;
Thanks, that works fine! I had something else in my mind for multipath,
but if this can go into scsi we don't need the multipath patch anymore.
Are you going to officially submit it (missing signed-off)?
Thanks,
Bernd
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] srp: convert SRP_RQ_SHIFT into a module parameter
2012-05-16 17:03 ` David Dillow
2012-05-16 20:34 ` Bernd Schubert
@ 2012-05-21 15:49 ` Bernd Schubert
[not found] ` <4FBA6412.7040505-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
1 sibling, 1 reply; 15+ messages in thread
From: Bernd Schubert @ 2012-05-21 15:49 UTC (permalink / raw)
To: David Dillow
Cc: Mike Christie, James Bottomley, linux-scsi@vger.kernel.org,
linux-rdma@vger.kernel.org
On 05/16/2012 07:03 PM, David Dillow wrote:
> On Wed, 2012-05-16 at 11:54 -0400, Bernd Schubert wrote:
>> 2) Low SRP command queues. Is there a reason why
>> SRP_RQ_SHIFT/SRP_RQ_SIZE and their depend values such as SRP_RQ_SIZE are
>> so small?
>
> That's a decision that has been around since the beginning of the driver
> as far as I can tell. It looks to be a balance between device needs and
> memory usage, and it can certainly be raised -- I've run locally with
> SRP_RQ_SHIFT set to 7 (shost.can_queue 126) and I'm sure 8 would be no
> problem, either. I didn't see a performance improvement on my workload,
> but may you will.
>
> Because we take the minimum of our initiator queue depth and the initial
> credits from the target (each req consumes a credit), going higher than
> 8 doesn't buy us much, as I don't know off-hand of any target that gives
> out more than 256 credits.
>
> The memory used for the command ring will vary depending on the value of
> SRP_RQ_SHIFT and the number of s/g entries allows to be put in the
> command. 255 s/g entries requires an 8 KB allocation for each request
> (~4200 bytes), so we currently require 512 KB of buffers for the send
> queue for each target. Going to 8 will require 2 MB max per target,
> which probably isn't a real issue.
>
> There's also a response ring with an allocation size that depends on the
> target, but those should be pretty small buffers, say 1 KB * (1<<
> SRP_RQ_SHIFT).
>
David, below is a first version to convert SRP_RQ_SHIFT into a new
module option "srp_rq_size". I already tested it, but I also need to
re-read it myself.
Author: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Date: Mon May 21 10:25:01 2012 +0200
infiniband/srp: convert SRP_RQ_SHIFT into a module parameter
When SRP_RQ_SHIFT is a fix parameter the optimal value is unclear.
The current value of 6 might be too small for several storage systems
and larger values might take too much memory for other systems.
So make it a parameter and let the admin decide.
Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 0bfa545..67fa47f 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -62,6 +62,7 @@ MODULE_LICENSE("Dual BSD/GPL");
static unsigned int srp_sg_tablesize;
static unsigned int cmd_sg_entries;
static unsigned int indirect_sg_entries;
+static unsigned int srp_rq_size, srp_sq_size, srp_cmd_sq_size;
static bool allow_ext_sg;
static int topspin_workarounds = 1;
@@ -76,6 +77,9 @@ module_param(indirect_sg_entries, uint, 0444);
MODULE_PARM_DESC(indirect_sg_entries,
"Default max number of gather/scatter entries (default is 12, max is " __stringify(SCSI_MAX_SG_CHAIN_SEGMENTS) ")");
+module_param(srp_rq_size, uint, 0444);
+MODULE_PARM_DESC(srp_sg_tablesize, "Request queue size (default is 64, max is " __stringify(SRP_RQ_MAX) ")");
+
module_param(allow_ext_sg, bool, 0444);
MODULE_PARM_DESC(allow_ext_sg,
"Default behavior when there are more than cmd_sg_entries S/G entries after mapping; fails the request when false (default false)");
@@ -227,14 +231,14 @@ static int srp_create_target_ib(struct srp_target_port *target)
return -ENOMEM;
target->recv_cq = ib_create_cq(target->srp_host->srp_dev->dev,
- srp_recv_completion, NULL, target, SRP_RQ_SIZE, 0);
+ srp_recv_completion, NULL, target, srp_rq_size, 0);
if (IS_ERR(target->recv_cq)) {
ret = PTR_ERR(target->recv_cq);
goto err;
}
target->send_cq = ib_create_cq(target->srp_host->srp_dev->dev,
- srp_send_completion, NULL, target, SRP_SQ_SIZE, 0);
+ srp_send_completion, NULL, target, srp_sq_size, 0);
if (IS_ERR(target->send_cq)) {
ret = PTR_ERR(target->send_cq);
goto err_recv_cq;
@@ -243,8 +247,8 @@ static int srp_create_target_ib(struct srp_target_port *target)
ib_req_notify_cq(target->recv_cq, IB_CQ_NEXT_COMP);
init_attr->event_handler = srp_qp_event;
- init_attr->cap.max_send_wr = SRP_SQ_SIZE;
- init_attr->cap.max_recv_wr = SRP_RQ_SIZE;
+ init_attr->cap.max_send_wr = srp_sq_size;
+ init_attr->cap.max_recv_wr = srp_rq_size;
init_attr->cap.max_recv_sge = 1;
init_attr->cap.max_send_sge = 1;
init_attr->sq_sig_type = IB_SIGNAL_ALL_WR;
@@ -287,9 +291,9 @@ static void srp_free_target_ib(struct srp_target_port *target)
ib_destroy_cq(target->send_cq);
ib_destroy_cq(target->recv_cq);
- for (i = 0; i < SRP_RQ_SIZE; ++i)
+ for (i = 0; i < srp_rq_size; ++i)
srp_free_iu(target->srp_host, target->rx_ring[i]);
- for (i = 0; i < SRP_SQ_SIZE; ++i)
+ for (i = 0; i < srp_sq_size; ++i)
srp_free_iu(target->srp_host, target->tx_ring[i]);
}
@@ -460,7 +464,7 @@ static void srp_free_req_data(struct srp_target_port *target)
struct srp_request *req;
int i;
- for (i = 0, req = target->req_ring; i < SRP_CMD_SQ_SIZE; ++i, ++req) {
+ for (i = 0, req = target->req_ring; i < srp_cmd_sq_size; ++i, ++req) {
kfree(req->fmr_list);
kfree(req->map_page);
if (req->indirect_dma_addr) {
@@ -472,6 +476,41 @@ static void srp_free_req_data(struct srp_target_port *target)
}
}
+/**
+ * Free target rings
+ */
+static void srp_free_target_rings(struct srp_target_port *target)
+{
+ kfree(target->tx_ring);
+ kfree(target->rx_ring);
+ kfree(target->req_ring);
+}
+
+/**
+ * Target ring allocations with a size depending on module options
+ */
+static int srp_alloc_target_rings(struct srp_target_port *target)
+{
+ target->tx_ring = kzalloc(srp_sq_size * sizeof(**(target->tx_ring)), GFP_KERNEL);
+ if (!target->tx_ring)
+ return -ENOMEM;
+
+ target->rx_ring = kzalloc(srp_rq_size * sizeof(**(target->rx_ring)), GFP_KERNEL);
+ if (!target->rx_ring) {
+ kfree(target->tx_ring);
+ return -ENOMEM;
+ }
+
+ target->req_ring = kzalloc(srp_cmd_sq_size * sizeof(*(target->req_ring)), GFP_KERNEL);
+ if (!target->req_ring) {
+ kfree(target->tx_ring);
+ kfree(target->rx_ring);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
static void srp_remove_work(struct work_struct *work)
{
struct srp_target_port *target =
@@ -489,6 +528,7 @@ static void srp_remove_work(struct work_struct *work)
ib_destroy_cm_id(target->cm_id);
srp_free_target_ib(target);
srp_free_req_data(target);
+ srp_free_target_rings(target);
scsi_host_put(target->scsi_host);
}
@@ -620,14 +660,14 @@ static int srp_reconnect_target(struct srp_target_port *target)
while (ib_poll_cq(target->send_cq, 1, &wc) > 0)
; /* nothing */
- for (i = 0; i < SRP_CMD_SQ_SIZE; ++i) {
+ for (i = 0; i < srp_cmd_sq_size; ++i) {
struct srp_request *req = &target->req_ring[i];
if (req->scmnd)
srp_reset_req(target, req);
}
INIT_LIST_HEAD(&target->free_tx);
- for (i = 0; i < SRP_SQ_SIZE; ++i)
+ for (i = 0; i < srp_sq_size; ++i)
list_add(&target->tx_ring[i]->list, &target->free_tx);
target->qp_in_error = 0;
@@ -969,7 +1009,7 @@ static void srp_put_tx_iu(struct srp_target_port *target, struct srp_iu *iu,
* Note:
* An upper limit for the number of allocated information units for each
* request type is:
- * - SRP_IU_CMD: SRP_CMD_SQ_SIZE, since the SCSI mid-layer never queues
+ * - SRP_IU_CMD: srp_cmd_sq_size, since the SCSI mid-layer never queues
* more than Scsi_Host.can_queue requests.
* - SRP_IU_TSK_MGMT: SRP_TSK_MGMT_SQ_SIZE.
* - SRP_IU_RSP: 1, since a conforming SRP target never sends more than
@@ -1320,7 +1360,7 @@ static int srp_alloc_iu_bufs(struct srp_target_port *target)
{
int i;
- for (i = 0; i < SRP_RQ_SIZE; ++i) {
+ for (i = 0; i < srp_rq_size; ++i) {
target->rx_ring[i] = srp_alloc_iu(target->srp_host,
target->max_ti_iu_len,
GFP_KERNEL, DMA_FROM_DEVICE);
@@ -1328,7 +1368,7 @@ static int srp_alloc_iu_bufs(struct srp_target_port *target)
goto err;
}
- for (i = 0; i < SRP_SQ_SIZE; ++i) {
+ for (i = 0; i < srp_sq_size; ++i) {
target->tx_ring[i] = srp_alloc_iu(target->srp_host,
target->max_iu_len,
GFP_KERNEL, DMA_TO_DEVICE);
@@ -1341,12 +1381,12 @@ static int srp_alloc_iu_bufs(struct srp_target_port *target)
return 0;
err:
- for (i = 0; i < SRP_RQ_SIZE; ++i) {
+ for (i = 0; i < srp_rq_size; ++i) {
srp_free_iu(target->srp_host, target->rx_ring[i]);
target->rx_ring[i] = NULL;
}
- for (i = 0; i < SRP_SQ_SIZE; ++i) {
+ for (i = 0; i < srp_sq_size; ++i) {
srp_free_iu(target->srp_host, target->tx_ring[i]);
target->tx_ring[i] = NULL;
}
@@ -1401,7 +1441,7 @@ static void srp_cm_rep_handler(struct ib_cm_id *cm_id,
if (ret)
goto error_free;
- for (i = 0; i < SRP_RQ_SIZE; i++) {
+ for (i = 0; i < srp_rq_size; i++) {
struct srp_iu *iu = target->rx_ring[i];
ret = srp_post_recv(target, iu);
if (ret)
@@ -1649,7 +1689,7 @@ static int srp_reset_device(struct scsi_cmnd *scmnd)
if (target->tsk_mgmt_status)
return FAILED;
- for (i = 0; i < SRP_CMD_SQ_SIZE; ++i) {
+ for (i = 0; i < srp_cmd_sq_size; ++i) {
struct srp_request *req = &target->req_ring[i];
if (req->scmnd && req->scmnd->device == scmnd->device)
srp_reset_req(target, req);
@@ -1841,9 +1881,9 @@ static struct scsi_host_template srp_template = {
.eh_device_reset_handler = srp_reset_device,
.eh_host_reset_handler = srp_reset_host,
.sg_tablesize = SRP_DEF_SG_TABLESIZE,
- .can_queue = SRP_CMD_SQ_SIZE,
+ .can_queue = SRP_CMD_SQ_DEF_SIZE,
.this_id = -1,
- .cmd_per_lun = SRP_CMD_SQ_SIZE,
+ .cmd_per_lun = SRP_CMD_SQ_DEF_SIZE,
.use_clustering = ENABLE_CLUSTERING,
.shost_attrs = srp_host_attrs
};
@@ -2034,7 +2074,7 @@ static int srp_parse_options(const char *buf, struct srp_target_port *target)
printk(KERN_WARNING PFX "bad max cmd_per_lun parameter '%s'\n", p);
goto out;
}
- target->scsi_host->cmd_per_lun = min(token, SRP_CMD_SQ_SIZE);
+ target->scsi_host->cmd_per_lun = min_t(unsigned, token, srp_cmd_sq_size);
break;
case SRP_OPT_IO_CLASS:
@@ -2131,9 +2171,15 @@ static ssize_t srp_create_target(struct device *dev,
target_host->max_id = 1;
target_host->max_lun = SRP_MAX_LUN;
target_host->max_cmd_len = sizeof ((struct srp_cmd *) (void *) 0L)->cdb;
+ target_host->can_queue = srp_cmd_sq_size;
+ target_host->cmd_per_lun = srp_cmd_sq_size;
target = host_to_target(target_host);
+ ret = srp_alloc_target_rings(target);
+ if (ret)
+ goto err;
+
target->io_class = SRP_REV16A_IB_IO_CLASS;
target->scsi_host = target_host;
target->srp_host = host;
@@ -2163,7 +2209,7 @@ static ssize_t srp_create_target(struct device *dev,
spin_lock_init(&target->lock);
INIT_LIST_HEAD(&target->free_tx);
INIT_LIST_HEAD(&target->free_reqs);
- for (i = 0; i < SRP_CMD_SQ_SIZE; ++i) {
+ for (i = 0; i < srp_cmd_sq_size; ++i) {
struct srp_request *req = &target->req_ring[i];
req->fmr_list = kmalloc(target->cmd_sg_cnt * sizeof (void *),
@@ -2435,6 +2481,7 @@ static void srp_remove_one(struct ib_device *device)
ib_destroy_cm_id(target->cm_id);
srp_free_target_ib(target);
srp_free_req_data(target);
+ srp_free_target_rings(target);
scsi_host_put(target->scsi_host);
}
@@ -2479,6 +2526,17 @@ static int __init srp_init_module(void)
indirect_sg_entries = cmd_sg_entries;
}
+ if (!srp_rq_size)
+ srp_rq_size = SRP_RQ_DEF_SIZE;
+
+ if (srp_rq_size > SRP_RQ_MAX) {
+ printk(KERN_WARNING PFX "Clamping srp_rq_size to %d\n", SRP_RQ_MAX);
+ srp_rq_size = SRP_RQ_MAX;
+ }
+
+ srp_sq_size = srp_rq_size;
+ srp_cmd_sq_size = srp_sq_size - SRP_RSP_SQ_SIZE - SRP_TSK_MGMT_SQ_SIZE;
+
ib_srp_transport_template =
srp_attach_transport(&ib_srp_transport_functions);
if (!ib_srp_transport_template)
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h
index 020caf0..f2195fd 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -57,14 +57,13 @@ enum {
SRP_MAX_LUN = 512,
SRP_DEF_SG_TABLESIZE = 12,
- SRP_RQ_SHIFT = 6,
- SRP_RQ_SIZE = 1 << SRP_RQ_SHIFT,
+ SRP_RQ_DEF_SIZE = 64,
+ SRP_SQ_DEF_SIZE = SRP_RQ_DEF_SIZE,
+ SRP_RQ_MAX = 1024,
- SRP_SQ_SIZE = SRP_RQ_SIZE,
SRP_RSP_SQ_SIZE = 1,
- SRP_REQ_SQ_SIZE = SRP_SQ_SIZE - SRP_RSP_SQ_SIZE,
SRP_TSK_MGMT_SQ_SIZE = 1,
- SRP_CMD_SQ_SIZE = SRP_REQ_SQ_SIZE - SRP_TSK_MGMT_SQ_SIZE,
+ SRP_CMD_SQ_DEF_SIZE = SRP_SQ_DEF_SIZE - SRP_RSP_SQ_SIZE - SRP_TSK_MGMT_SQ_SIZE,
SRP_TAG_NO_REQ = ~0U,
SRP_TAG_TSK_MGMT = 1U << 31,
@@ -169,9 +168,9 @@ struct srp_target_port {
int zero_req_lim;
- struct srp_iu *tx_ring[SRP_SQ_SIZE];
- struct srp_iu *rx_ring[SRP_RQ_SIZE];
- struct srp_request req_ring[SRP_CMD_SQ_SIZE];
+ struct srp_iu **tx_ring; /* table of srp_sq_size */
+ struct srp_iu **rx_ring; /* table of srp_rq_size */
+ struct srp_request *req_ring; /* table of srp_cmd_sq_size */
struct work_struct work;
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] srp: convert SRP_RQ_SHIFT into a module parameter
[not found] ` <4FBA6412.7040505-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
@ 2012-05-21 19:35 ` Bernd Schubert
2012-05-30 5:22 ` David Dillow
1 sibling, 0 replies; 15+ messages in thread
From: Bernd Schubert @ 2012-05-21 19:35 UTC (permalink / raw)
To: Bernd Schubert
Cc: David Dillow, Mike Christie, James Bottomley,
linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> @@ -2131,9 +2171,15 @@ static ssize_t srp_create_target(struct device *dev,
> target_host->max_id = 1;
> target_host->max_lun = SRP_MAX_LUN;
> target_host->max_cmd_len = sizeof ((struct srp_cmd *) (void *) 0L)->cdb;
> + target_host->can_queue = srp_cmd_sq_size;
> + target_host->cmd_per_lun = srp_cmd_sq_size;
>
> target = host_to_target(target_host);
>
> + ret = srp_alloc_target_rings(target);
> + if (ret)
> + goto err;
Here is a memory leak if something fails below, I need to add another
goto target.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: multipath_busy() stalls IO due to scsi_host_is_busy()
2012-05-21 15:42 ` Bernd Schubert
@ 2012-05-22 4:31 ` Jun'ichi Nomura
0 siblings, 0 replies; 15+ messages in thread
From: Jun'ichi Nomura @ 2012-05-22 4:31 UTC (permalink / raw)
To: Bernd Schubert, Mike Snitzer; +Cc: dm-devel, linux-scsi@vger.kernel.org
Hi,
On 05/22/12 00:42, Bernd Schubert wrote:
> On 05/17/2012 11:09 AM, Jun'ichi Nomura wrote:
>> However, when it is target/host that is busy, I think dm should
>> send the request down and let SCSI, which has better knowledge
>> about the shared resource, do appropriate starvation control.
>>
>> Could you try the attached patch?
...
> Thanks, that works fine! I had something else in my mind for multipath, but if this can go into scsi we don't need the multipath patch anymore.
> Are you going to officially submit it (missing signed-off)?
Thank you for testing.
Yes, I'll submit it officially with your Reported-by/Tested-by.
Also, a note suggested by Mike Snitzer will be added to the
patch description. (Thanks Mike.)
--
Jun'ichi Nomura, NEC Corporation
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srp: convert SRP_RQ_SHIFT into a module parameter
[not found] ` <4FBA6412.7040505-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
2012-05-21 19:35 ` Bernd Schubert
@ 2012-05-30 5:22 ` David Dillow
[not found] ` <1338355352.2361.24.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
1 sibling, 1 reply; 15+ messages in thread
From: David Dillow @ 2012-05-30 5:22 UTC (permalink / raw)
To: Bernd Schubert
Cc: Mike Christie, James Bottomley,
linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On Mon, 2012-05-21 at 17:49 +0200, Bernd Schubert wrote:
> David, below is a first version to convert SRP_RQ_SHIFT into a new
> module option "srp_rq_size". I already tested it, but I also need to
> re-read it myself.
>
> Author: Bernd Schubert <bernd.schubert-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
> Date: Mon May 21 10:25:01 2012 +0200
>
> infiniband/srp: convert SRP_RQ_SHIFT into a module parameter
>
> When SRP_RQ_SHIFT is a fix parameter the optimal value is unclear.
> The current value of 6 might be too small for several storage systems
> and larger values might take too much memory for other systems.
> So make it a parameter and let the admin decide.
This shouldn't be a module parameter; if at all possible, all new
parameters should be specified on a per-target basis, with reasonable
defaults. srp_rq_size isn't the right
Also, I think we can get rid of the {tx,rx}_ring arrays and just
individually allocate the IUs and hook them on the appropriate lists.
This may make cleanup a bit more tricky, as we'd have to walk the
outstanding requests to find all of the IUs....
It may easier to just add a singly linked list to struct srp_iu to keep
the {tx,rx} IU's on, since we always use them from a direct pointer or
from a free list. We could do a code audit and remove that field later
if needed, but the simplicity is probably worth the memory cost.
We have to keep the req_ring array; we need to look those entries up by
index as we don't want to trust a pointer over the wire.
Eventually, I could see us delaying this initialization until we get the
login response back and know the maximum we'd use on this connection.
That should wait, as Bart's persistent connection changes could add some
complications there.
Thanks,
--
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srp: convert SRP_RQ_SHIFT into a module parameter
[not found] ` <1338355352.2361.24.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2012-05-30 5:24 ` David Dillow
0 siblings, 0 replies; 15+ messages in thread
From: David Dillow @ 2012-05-30 5:24 UTC (permalink / raw)
To: Bernd Schubert
Cc: Mike Christie, James Bottomley,
linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On Wed, 2012-05-30 at 01:22 -0400, David Dillow wrote:
> On Mon, 2012-05-21 at 17:49 +0200, Bernd Schubert wrote:
> > David, below is a first version to convert SRP_RQ_SHIFT into a new
> > module option "srp_rq_size". I already tested it, but I also need to
> > re-read it myself.
> >
> > Author: Bernd Schubert <bernd.schubert-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
> > Date: Mon May 21 10:25:01 2012 +0200
> >
> > infiniband/srp: convert SRP_RQ_SHIFT into a module parameter
> >
> > When SRP_RQ_SHIFT is a fix parameter the optimal value is unclear.
> > The current value of 6 might be too small for several storage systems
> > and larger values might take too much memory for other systems.
> > So make it a parameter and let the admin decide.
>
> This shouldn't be a module parameter; if at all possible, all new
> parameters should be specified on a per-target basis, with reasonable
> defaults. srp_rq_size isn't the right
... name. "queue_depth" or "max_cmd" or "can_queue" would be better.
--
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2012-05-30 5:24 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-16 12:28 multipath_busy() stalls IO due to scsi_host_is_busy() Bernd Schubert
2012-05-16 14:06 ` James Bottomley
2012-05-16 14:29 ` Bernd Schubert
2012-05-16 15:27 ` [dm-devel] " Mike Christie
[not found] ` <4FB3C75F.3070903-hcNo3dDEHLuVc3sceRu5cw@public.gmane.org>
2012-05-16 15:54 ` Bernd Schubert
[not found] ` <4FB3CDC5.9040608-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
2012-05-16 17:03 ` David Dillow
2012-05-16 20:34 ` Bernd Schubert
2012-05-21 15:49 ` [PATCH] srp: convert SRP_RQ_SHIFT into a module parameter Bernd Schubert
[not found] ` <4FBA6412.7040505-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
2012-05-21 19:35 ` Bernd Schubert
2012-05-30 5:22 ` David Dillow
[not found] ` <1338355352.2361.24.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2012-05-30 5:24 ` David Dillow
2012-05-17 9:09 ` multipath_busy() stalls IO due to scsi_host_is_busy() Jun'ichi Nomura
2012-05-17 13:46 ` Mike Snitzer
2012-05-21 15:42 ` Bernd Schubert
2012-05-22 4:31 ` Jun'ichi Nomura
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).