* [RFC 0/8] IB/srp: scaling and bug fixes for 2.6.38
@ 2010-12-23 21:55 David Dillow
2010-12-23 22:08 ` David Dillow
[not found] ` <1293141332-12318-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
0 siblings, 2 replies; 5+ messages in thread
From: David Dillow @ 2010-12-23 21:55 UTC (permalink / raw)
To: linux-rdma; +Cc: linux-scsi, Bart Van Assche
[ Sorry to break threading, I botched things when editing the cover
letter to add an attachment... ]
The first patch in this series fixes a longstanding issue where we crash if we
use sg_reset to perform a bus reset, but haven't sent enough commands to
initialize all of our request structures. The remaining patches break up Bart
Van Assche's lock scaling work, and add a few optimizations on top.
The scaling work looks to have paid off pretty well. All tests were conducted
over a QDR link between two Dell R410s with 2.6GHz Xeons. To push any possible
bottlenecks to the initiator, the test target was stripped down to not transfer
the requests data -- it simply response to the command as though it had.
For fio driving one LUN using the SG engine, refactoring the locking using
patches 2 through 6 give a 30% increase in command throughput from 16 to 64
threads, while allowing similar (within the noise) or slight improvements for 1
to 8 threads and 128 threads and above. Unsharing the lock (patch 7) with the
SCSI mid-layer hurts a bit for the single thread case (~2%) but gives an
additional 1 to 6% with more than one thread. Cache optimization (patch 8)
returns the single thread case back to par, and gives a modest increase as
threads increase.
For fio driving mulitple LUNs using the AIO engine, patches 2 through 6 give
slightly smaller increases at low thread counts with a single drive (20% over
baseline), but the improvement increases as drives are added and/or iodepth
increases, reaching 50% in many cases. The removing the shared lock typically
brings 5-10% improvement over the lock reduction work, and optimizing the cache
usage also gives a modest improvement, though more than in the SG case.
There is more investigation to be done -- for example, AIO peaked at 296k IOPs
from a single drive at an iodepth of 32 and a thread count of 32. SG peaked at
183k IOPS at 64 threads (iodepth was 1, but I did not try a survey for this
engine). I have some completion batching and blk-iopoll conversion patches as
well, but they have some interesting performance anomolies at the moment that
prevent them being a win.
I'd appreciate people's review and comments, as while the patches have over 10
billion commands on them from the performance testing and real hardware, they
involve locking and race conditions, which have a habit of not showing up until
the most inopportune time.
Once 2.6.37 is out, I'll add sign offs and push these to my repo for 2.6.38.
David Dillow (8):
IB/srp: allow task management without a previous request
IB/srp: consolidate state change code
IB/srp: allow lockless work posting
IB/srp: don't move active requests to their own list
IB/srp: reduce local coverage for command submission and EH
IB/srp: reduce lock coverage of command completion
IB/srp: stop sharing the host lock with SCSI
IB/srp: consolidate hot-path variables into cache lines
drivers/infiniband/ulp/srp/ib_srp.c | 390 ++++++++++++++++-------------------
drivers/infiniband/ulp/srp/ib_srp.h | 46 +++--
2 files changed, 204 insertions(+), 232 deletions(-)
--
1.7.2.3
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [RFC 0/8] IB/srp: scaling and bug fixes for 2.6.38
2010-12-23 21:55 [RFC 0/8] IB/srp: scaling and bug fixes for 2.6.38 David Dillow
@ 2010-12-23 22:08 ` David Dillow
[not found] ` <1293141332-12318-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
1 sibling, 0 replies; 5+ messages in thread
From: David Dillow @ 2010-12-23 22:08 UTC (permalink / raw)
To: linux-rdma; +Cc: linux-scsi, Bart Van Assche
[-- Attachment #1: Type: text/plain, Size: 747 bytes --]
On Thu, 2010-12-23 at 16:55 -0500, David Dillow wrote:
> [ Sorry to break threading, I botched things when editing the cover
> letter to add an attachment... ]
And here's the intended attachment for those interested in the test
results.
The configs:
00-baseline is 698fd6a2c3ca05ec796072defb5c415289a86cdc in Linus's tree.
01-shared-lock is with patches 1 through 6
02-unshared-lock adds patch 7
03-unshared-cache adds patch 8
Each test was run for 10 seconds, and repeated 7 times. I initially ran
for different runtimes to determine that there was little variability
after a few seconds, and settled on 10 to move the tests along.
--
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office
[-- Attachment #2: srp-scaling.ods --]
[-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 115310 bytes --]
[-- Attachment #3: run_tests.sh --]
[-- Type: application/x-shellscript, Size: 1908 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread[parent not found: <1293141332-12318-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>]
* [RFC 0/8] IB/srp: scaling and bug fixes for 2.6.38
@ 2010-12-23 21:31 David Dillow
0 siblings, 0 replies; 5+ messages in thread
From: David Dillow @ 2010-12-23 21:31 UTC (permalink / raw)
To: linux-rdma; +Cc: linux-scsi, Bart Van Assche
[-- Attachment #1: Type: text/plain, Size: 3014 bytes --]
The first patch in this series fixes a longstanding issue where we crash if we
use sg_reset to perform a bus reset, but haven't sent enough commands to
initialize all of our request structures. The remaining patches break up Bart
Van Assche's lock scaling work, and add a few optimizations on top.
The scaling work looks to have paid off pretty well. All tests were conducted
over a QDR link between two Dell R410s with 2.6GHz Xeons. To push any possible
bottlenecks to the initiator, the test target was stripped down to not transfer
the requests data -- it simply response to the command as though it had.
For fio driving one LUN using the SG engine, refactoring the locking using
patches 2 through 6 give a 30% increase in command throughput from 16 to 64
threads, while allowing similar (within the noise) or slight improvements for 1
to 8 threads and 128 threads and above. Unsharing the lock (patch 7) with the
SCSI mid-layer hurts a bit for the single thread case (~2%) but gives an
additional 1 to 6% with more than one thread. Cache optimization (patch 8)
returns the single thread case back to par, and gives a modest increase as
threads increase.
For fio driving mulitple LUNs using the AIO engine, patches 2 through 6 give
slightly smaller increases at low thread counts with a single drive (20% over
baseline), but the improvement increases as drives are added and/or iodepth
increases, reaching 50% in many cases. The removing the shared lock typically
brings 5-10% improvement over the lock reduction work, and optimizing the cache
usage also gives a modest improvement, though more than in the SG case.
There is more investigation to be done -- for example, AIO peaked at 296k IOPs
from a single drive at an iodepth of 32 and a thread count of 32. SG peaked at
183k IOPS at 64 threads (iodepth was 1, but I did not try a survey for this
engine). I have some completion batching and blk-iopoll conversion patches as
well, but they have some interesting performance anomolies at the moment that
prevent them being a win.
I'd appreciate people's review and comments, as while the patches have over 10
billion commands on them from the performance testing and real hardware, they
involve locking and race conditions, which have a habit of not showing up until
the most inopportune time.
Once 2.6.37 is out, I'll add sign offs and push these to my repo for 2.6.38.
David Dillow (8):
IB/srp: allow task management without a previous request
IB/srp: consolidate state change code
IB/srp: allow lockless work posting
IB/srp: don't move active requests to their own list
IB/srp: reduce local coverage for command submission and EH
IB/srp: reduce lock coverage of command completion
IB/srp: stop sharing the host lock with SCSI
IB/srp: consolidate hot-path variables into cache lines
drivers/infiniband/ulp/srp/ib_srp.c | 390 ++++++++++++++++-------------------
drivers/infiniband/ulp/srp/ib_srp.h | 46 +++--
2 files changed, 204 insertions(+), 232 deletions(-)
--
1.7.2.3
[-- Attachment #2: srp-scaling.ods --]
[-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 115310 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-01-05 18:46 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-23 21:55 [RFC 0/8] IB/srp: scaling and bug fixes for 2.6.38 David Dillow
2010-12-23 22:08 ` David Dillow
[not found] ` <1293141332-12318-1-git-send-email-dillowda-1Heg1YXhbW8@public.gmane.org>
2011-01-05 18:09 ` Roland Dreier
[not found] ` <adad3obfcid.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2011-01-05 18:46 ` David Dillow
-- strict thread matches above, loose matches on Subject: below --
2010-12-23 21:31 David Dillow
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox