SRP initiator and iSER initiator performance

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* SRP initiator and iSER initiator performance
@ 2010-02-27 19:27 Bart Van Assche
       [not found] ` <e2e108261002271127x253faa84lf6eb8aa77d3cf51a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2010-02-27 19:27 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Chris Worley, David Dillow, OFED mailing list, scst-devel

On Mon, Jan 11, 2010 at 7:44 PM, Vladislav Bolkhovitin <vst-d+Crzxg7Rs0@public.gmane.org> wrote:
>
> [ ... ]
>
> SRP initiator seems to be not too well optimized for the best performance. ISER initiator is noticeably better in this area.

(replying to an e-mail of one month ago)

I'm not sure the above statement makes sense. Below you can find the
performance results for 512-byte reads with a varying number of
threads and a NULLIO target. With a sufficiently high number of
threads this test saturated the two CPU cores of the initiator system
but not the CPU core of the target system. So one can conclude from
the numbers below that for the initiator and target software
combinations used for this test that although the difference is small,
the latency for the SRP traffic is slightly lower than that of the
iSER traffic and also that the CPU usage of the SRP traffic is
slightly lower than that of the iSER traffic. These numbers are quite
impressive since one can conclude from the numbers below that for both
protocols one I/O operation is completed by the initiator system in
about 17 microseconds or 44000 clock cycles.

iSER:
 1   read : io=128MB, bw=13,755KB/s, iops=27,510, runt=  9529msec
 2   read : io=256MB, bw=26,118KB/s, iops=52,235, runt= 10037msec
 4   read : io=512MB, bw=48,985KB/s, iops=97,970, runt= 10703msec
 8   read : io=1,024MB, bw=57,519KB/s, iops=115K, runt= 18230msec
16   read : io=2,048MB, bw=57,880KB/s, iops=116K, runt= 36233msec
32   read : io=4,096MB, bw=57,990KB/s, iops=116K, runt= 72328msec
64   read : io=8,192MB, bw=58,066KB/s, iops=116K, runt=144468msec
CPU load for 64 threads (according to vmstat 2): 20% us + 80% sy on
the initiator and 40% us + 20% sy + 40% id on the target.

SRP:
 1   read : io=128MB, bw=14,211KB/s, iops=28,422, runt=  9223msec
 2   read : io=256MB, bw=26,275KB/s, iops=52,549, runt=  9977msec
 4   read : io=512MB, bw=49,257KB/s, iops=98,513, runt= 10644msec
 8   read : io=1,024MB, bw=60,322KB/s, iops=121K, runt= 17383msec
16   read : io=2,048MB, bw=61,272KB/s, iops=123K, runt= 34227msec
32   read : io=4,096MB, bw=61,176KB/s, iops=122K, runt= 68561msec
64   read : io=8,192MB, bw=60,963KB/s, iops=122K, runt=137602msec
CPU load for 64 threads (according to vmstat 2): 20% us + 80% sy on
the initiator and 0% us + 50% sy + 50% id on the target.

Setup details:
* The above output was generated with the following command:
for i in 1 2 4 8 16 32 64; do printf "%2d " $i; io-load 512 $i
${initiator_device} | grep runt; done
* The io-load script is as follows:
#!/bin/sh
blocksize="${1:-512}"
threads="${2:-1}"
dev="${3:-sdj}"
fio --bs="${blocksize}" --buffered=0 --size=128M --ioengine=sg
--rw=read --invalidate=1 --end_fsync=1 --thread --numjobs="${threads}"
--loops=1 --group_reporting --name=nullio --filename=/dev/${dev}

* SRP target software: SCST r1522 compiled in release mode.
* iSER target software: tgt 1.0.2.

* InfiniBand hardware: QDR PCIe 2.0 HCA's.

* Initiator system:
2.6.33-rc7 kernel (for-next branch of Rolands InfiniBand repository
without the recently posted iSER and SRP performance improvement
patches).
SRP initiator was loaded with parameter srp_sg_tablesize=128
Frequency scaling was disabled.
Runlevel: 3.
CPU: E6750 @ 2.66GHz.

* Target system:
2.6.30.7 kernel + SCST patches.
Frequency scaling was disabled.
Runlevel: 3.
CPU: E8400 @ 3.00GHz booted with maxcpus=1.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SRP initiator and iSER initiator performance
       [not found] ` <e2e108261002271127x253faa84lf6eb8aa77d3cf51a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-03-01 20:12   ` Vladislav Bolkhovitin
       [not found]     ` <4B8C1FBF.8060001-d+Crzxg7Rs0@public.gmane.org>
       [not found]     ` <e2e108261003011238h331e473bge905b8ea695f7483@mail.gmail.com>
  0 siblings, 2 replies; 6+ messages in thread
From: Vladislav Bolkhovitin @ 2010-03-01 20:12 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Chris Worley, David Dillow, OFED mailing list, scst-devel

Bart Van Assche, on 02/27/2010 10:27 PM wrote:
> On Mon, Jan 11, 2010 at 7:44 PM, Vladislav Bolkhovitin <vst-d+Crzxg7Rs0@public.gmane.org> wrote:
>> [ ... ]
>>
>> SRP initiator seems to be not too well optimized for the best performance. ISER initiator is noticeably better in this area.
> 
> (replying to an e-mail of one month ago)
> 
> I'm not sure the above statement makes sense. Below you can find the
> performance results for 512-byte reads with a varying number of
> threads and a NULLIO target. With a sufficiently high number of
> threads this test saturated the two CPU cores of the initiator system
> but not the CPU core of the target system. So one can conclude from
> the numbers below that for the initiator and target software
> combinations used for this test that although the difference is small,
> the latency for the SRP traffic is slightly lower than that of the
> iSER traffic and also that the CPU usage of the SRP traffic is
> slightly lower than that of the iSER traffic. These numbers are quite
> impressive since one can conclude from the numbers below that for both
> protocols one I/O operation is completed by the initiator system in
> about 17 microseconds or 44000 clock cycles.
> 
> iSER:
>  1   read : io=128MB, bw=13,755KB/s, iops=27,510, runt=  9529msec
>  2   read : io=256MB, bw=26,118KB/s, iops=52,235, runt= 10037msec
>  4   read : io=512MB, bw=48,985KB/s, iops=97,970, runt= 10703msec
>  8   read : io=1,024MB, bw=57,519KB/s, iops=115K, runt= 18230msec
> 16   read : io=2,048MB, bw=57,880KB/s, iops=116K, runt= 36233msec
> 32   read : io=4,096MB, bw=57,990KB/s, iops=116K, runt= 72328msec
> 64   read : io=8,192MB, bw=58,066KB/s, iops=116K, runt=144468msec
> CPU load for 64 threads (according to vmstat 2): 20% us + 80% sy on
> the initiator and 40% us + 20% sy + 40% id on the target.
> 
> SRP:
>  1   read : io=128MB, bw=14,211KB/s, iops=28,422, runt=  9223msec
>  2   read : io=256MB, bw=26,275KB/s, iops=52,549, runt=  9977msec
>  4   read : io=512MB, bw=49,257KB/s, iops=98,513, runt= 10644msec
>  8   read : io=1,024MB, bw=60,322KB/s, iops=121K, runt= 17383msec
> 16   read : io=2,048MB, bw=61,272KB/s, iops=123K, runt= 34227msec
> 32   read : io=4,096MB, bw=61,176KB/s, iops=122K, runt= 68561msec
> 64   read : io=8,192MB, bw=60,963KB/s, iops=122K, runt=137602msec
> CPU load for 64 threads (according to vmstat 2): 20% us + 80% sy on
> the initiator and 0% us + 50% sy + 50% id on the target.
> 
> Setup details:
> * The above output was generated with the following command:
> for i in 1 2 4 8 16 32 64; do printf "%2d " $i; io-load 512 $i
> ${initiator_device} | grep runt; done
> * The io-load script is as follows:
> #!/bin/sh
> blocksize="${1:-512}"
> threads="${2:-1}"
> dev="${3:-sdj}"
> fio --bs="${blocksize}" --buffered=0 --size=128M --ioengine=sg
> --rw=read --invalidate=1 --end_fsync=1 --thread --numjobs="${threads}"
> --loops=1 --group_reporting --name=nullio --filename=/dev/${dev}
> 
> * SRP target software: SCST r1522 compiled in release mode.
> * iSER target software: tgt 1.0.2.
> 
> * InfiniBand hardware: QDR PCIe 2.0 HCA's.
> 
> * Initiator system:
> 2.6.33-rc7 kernel (for-next branch of Rolands InfiniBand repository
> without the recently posted iSER and SRP performance improvement
> patches).
> SRP initiator was loaded with parameter srp_sg_tablesize=128
> Frequency scaling was disabled.
> Runlevel: 3.
> CPU: E6750 @ 2.66GHz.
> 
> * Target system:
> 2.6.30.7 kernel + SCST patches.
> Frequency scaling was disabled.
> Runlevel: 3.
> CPU: E8400 @ 3.00GHz booted with maxcpus=1.

It's good if my impression was wrong. But you've got suspiciously low 
IOPS numbers. On your hardware you should have much more. Seems you 
experienced a bottleneck on the initiator somewhere above the drivers 
level (fio? sg engine? IRQs or context switches count?), so your results 
could be not really related to the topic. Oprofile and lockstat output 
can shed more light on this.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SRP initiator and iSER initiator performance
       [not found]     ` <4B8C1FBF.8060001-d+Crzxg7Rs0@public.gmane.org>
@ 2010-03-02  6:59       ` Bart Van Assche
       [not found]         ` <e2e108261003012259o7c9b217j5011006e3a772d6a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2010-03-02  6:59 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Chris Worley, David Dillow, OFED mailing list, scst-devel

On Mon, Mar 1, 2010 at 9:12 PM, Vladislav Bolkhovitin <vst-d+Crzxg7Rs0@public.gmane.org> wrote:
> [ ... ]
> It's good if my impression was wrong. But you've got suspiciously low IOPS
> numbers. On your hardware you should have much more. Seems you experienced a
> bottleneck on the initiator somewhere above the drivers level (fio? sg
> engine? IRQs or context switches count?), so your results could be not
> really related to the topic. Oprofile and lockstat output can shed more
> light on this.

You didn't understand the purpose of the test. My goal was not to
achieve record IOPS numbers but to stress the SRP and iSER initiators
as much as possible. I choose the sg I/O engine in order to bypass the
block layer.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SRP initiator and iSER initiator performance
       [not found]       ` <e2e108261003011238h331e473bge905b8ea695f7483-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-03-03 20:23         ` Vladislav Bolkhovitin
       [not found]           ` <4B8EC526.4060006-d+Crzxg7Rs0@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Vladislav Bolkhovitin @ 2010-03-03 20:23 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Chris Worley, David Dillow, OFED mailing list, scst-devel

Bart Van Assche, on 03/01/2010 11:38 PM wrote:
> On Mon, Mar 1, 2010 at 9:12 PM, Vladislav Bolkhovitin <vst-d+Crzxg7Rs0@public.gmane.org 
> <mailto:vst-d+Crzxg7Rs0@public.gmane.org>> wrote:
> 
>     [ ... ]
>     It's good if my impression was wrong. But you've got suspiciously
>     low IOPS numbers. On your hardware you should have much more. Seems
>     you experienced a bottleneck on the initiator somewhere above the
>     drivers level (fio? sg engine? IRQs or context switches count?), so
>     your results could be not really related to the topic. Oprofile and
>     lockstat output can shed more light on this.
> 
> 
> The number of IOPS I obtained is really high considering that I used the 
> sg I/O engine. This means that no buffering has been used and none of 
> the I/O requests were combined into larger requests. I chose the sg I/O 
> engine on purpose in order to bypass the block layer. I was not 
> interested in record IOPS numbers but in a test where most of the time 
> is spent in the SRP / iSER initiator instead of the block layer.

116K IOPS'es isn't high, it's pretty low for QDR IB. Even 4Gbps FC can 
overperform it. Remember, Microsoft has managed to get 1 million IOPS'es 
from 10GbE, but your card should be much faster. This is why I have 
strong suspicious that the test is incorrect.

Let's estimate how much your IB card can achieve. It has 1us latency on 
1 byte packets, so it can perform at least 1 millions op/sec. This is 
the upper bound estimation, because (1) if the card has multi-core 
setup, this number can be several times bigger, and (2) it includes data 
transfers. From other side, you can read data via your card on 2.9GB/s. 
If we consider that transferring a 512B packet has 100% overhead (this 
is upper bound estimation too, because I can't believe that such a low 
latency HPC interconnect has so huge data transfer overhead), this will 
give us that it can transfer 2.9 / (512 * 2) = 2.9 millions IOPS'es. So, 
your IB hardware should be capable to make at least 1 million I/O 
transfers per second, which is 10 times bigger than you have.

So, you definitely need to find out the bottleneck. I would start from 
checking:

1. fio implemented not too effectively. It can be checked using null 
ioengine.

2. You have only one outstanding command at time (queue depth 1). You 
can check it during the test either using iostat on the initiator, or 
(better) on the SCST target in /proc/scsi_tgt/sessions and 
/proc/scsi_tgt/sgv files.

3. sg engine used by fio in indirect mode, i.e. it transfers data 
between user and kernel spaces using data copy. Can be checked looking 
at the fio's sources or using oprofile.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SRP initiator and iSER initiator performance
       [not found]         ` <e2e108261003012259o7c9b217j5011006e3a772d6a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-03-03 20:23           ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 6+ messages in thread
From: Vladislav Bolkhovitin @ 2010-03-03 20:23 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Chris Worley, David Dillow, OFED mailing list, scst-devel

Bart Van Assche, on 03/02/2010 09:59 AM wrote:
> On Mon, Mar 1, 2010 at 9:12 PM, Vladislav Bolkhovitin <vst-d+Crzxg7Rs0@public.gmane.org> wrote:
>> [ ... ]
>> It's good if my impression was wrong. But you've got suspiciously low IOPS
>> numbers. On your hardware you should have much more. Seems you experienced a
>> bottleneck on the initiator somewhere above the drivers level (fio? sg
>> engine? IRQs or context switches count?), so your results could be not
>> really related to the topic. Oprofile and lockstat output can shed more
>> light on this.
> 
> You didn't understand the purpose of the test. My goal was not to
> achieve record IOPS numbers but to stress the SRP and iSER initiators
> as much as possible. I choose the sg I/O engine in order to bypass the
> block layer.

No, Bart, I understood your purpose very well. I'll illustrate my point 
on example. Let's consider we want to compare a Ferrari and Toyota 
Corolla cars. The only track we have to use has 60km/h speed limit, so 
we used it strictly following the speed limit. Would we have a correct 
comparison of the cars' capabilities, or would we compare only their 
speedometers' mistakes? If Toyota's speedometer allows to stay more 
closely to the limit, Toyota can win Ferrari. But would it win too on 
180 km/h limit? Or without speed limit at all?

The same is in our topic. We can consider your experiment correct only 
if the bottleneck is driver/hardware, which isn't likely to be.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SRP initiator and iSER initiator performance
       [not found]           ` <4B8EC526.4060006-d+Crzxg7Rs0@public.gmane.org>
@ 2010-03-04  7:03             ` Bart Van Assche
  0 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2010-03-04  7:03 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Chris Worley, David Dillow, OFED mailing list, scst-devel

On Wed, Mar 3, 2010 at 9:23 PM, Vladislav Bolkhovitin <vst-d+Crzxg7Rs0@public.gmane.org> wrote:
> Bart Van Assche, on 03/01/2010 11:38 PM wrote:
>>
>> On Mon, Mar 1, 2010 at 9:12 PM, Vladislav Bolkhovitin <vst-d+Crzxg7Rs0@public.gmane.org
>> <mailto:vst-d+Crzxg7Rs0@public.gmane.org>> wrote:
>>
>>    [ ... ]
>>    It's good if my impression was wrong. But you've got suspiciously
>>    low IOPS numbers. On your hardware you should have much more. Seems
>>    you experienced a bottleneck on the initiator somewhere above the
>>    drivers level (fio? sg engine? IRQs or context switches count?), so
>>    your results could be not really related to the topic. Oprofile and
>>    lockstat output can shed more light on this.
>>
>>
>> The number of IOPS I obtained is really high considering that I used the
>> sg I/O engine. This means that no buffering has been used and none of the
>> I/O requests were combined into larger requests. I chose the sg I/O engine
>> on purpose in order to bypass the block layer. I was not interested in
>> record IOPS numbers but in a test where most of the time is spent in the SRP
>> / iSER initiator instead of the block layer.
>
> 116K IOPS'es isn't high, it's pretty low for QDR IB. [ ... ]

It looks like it's time that you make yourself familiar with the
difference between CPU-bound, I/O bound and memory bound. It is
essential to understand this terminology before starting to comment on
performance tests.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-03-04  7:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-27 19:27 SRP initiator and iSER initiator performance Bart Van Assche
     [not found] ` <e2e108261002271127x253faa84lf6eb8aa77d3cf51a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-03-01 20:12   ` Vladislav Bolkhovitin
     [not found]     ` <4B8C1FBF.8060001-d+Crzxg7Rs0@public.gmane.org>
2010-03-02  6:59       ` Bart Van Assche
     [not found]         ` <e2e108261003012259o7c9b217j5011006e3a772d6a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-03-03 20:23           ` Vladislav Bolkhovitin
     [not found]     ` <e2e108261003011238h331e473bge905b8ea695f7483@mail.gmail.com>
     [not found]       ` <e2e108261003011238h331e473bge905b8ea695f7483-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-03-03 20:23         ` Vladislav Bolkhovitin
     [not found]           ` <4B8EC526.4060006-d+Crzxg7Rs0@public.gmane.org>
2010-03-04  7:03             ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox