* Re: dm-multipath has great throughput but we'd like more!
2006-05-18 7:44 ` Bob Gautier
@ 2006-05-18 7:55 ` Jonathan E Brassow
2006-05-18 7:59 ` Luca Berra
` (2 subsequent siblings)
3 siblings, 0 replies; 15+ messages in thread
From: Jonathan E Brassow @ 2006-05-18 7:55 UTC (permalink / raw)
To: rgautier; +Cc: device-mapper development, consult-list
On May 18, 2006, at 2:44 AM, Bob Gautier wrote:
> On Thu, 2006-05-18 at 02:25 -0500, Jonathan E Brassow wrote:
>> The system bus isn't a limiting factor is it? 64-bit PCI-X will get
>> 8.5 GB/s (plenty), but 32-bit PCI 33MHz got 133MB/s.
>>
>> Can your disks sustain that much bandwidth? 10 striped drives might
>> get
>> better than 200MB/s if done right, I suppose.
>>
>> Don't the switches run at 2 Gbits/s? 2 Gbits/s / 10 (throw in 2 bits
>> for protocol) ~= 200MB/s.
>>
>
> Thanks for the fast responses:
>
> The card is a 64-bit PCI-X, so I don't think the bus is the bottleneck,
> and anyway the vendor specifies a maximum throughput of 200Mbyte/s per
> card.
>
> The disk array does not appear to be the bottleneck because we get
> 200Mbyte/s when we use *two* HBAs in load-balanced mode.
>
> The question is really about why we only see O(100Mbyte/s) with one HBA
> when we can achieve O(200MByte/s) with two cards, given that one card
> should be able to achieve that throughput.
>
> I don't think the method of producing the traffic (bonnie++ or
> something
> else) should be relevant but if it were that would be very interesting
> for the benchmark authors!
>
> The storage is an HDS 9980 (I think?)
>
I guess I was thinking you were asking why you weren't getting 240MB/s,
and I overlooked the obvious question. I guess I don't know the answer
(or even the right questions). :(
brassow
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: dm-multipath has great throughput but we'd like more!
2006-05-18 7:44 ` Bob Gautier
2006-05-18 7:55 ` Jonathan E Brassow
@ 2006-05-18 7:59 ` Luca Berra
2006-05-18 8:04 ` [Consult-list] " Nicholas C. Strugnell
2006-05-18 20:28 ` Steve Lord
3 siblings, 0 replies; 15+ messages in thread
From: Luca Berra @ 2006-05-18 7:59 UTC (permalink / raw)
To: device-mapper development
On Thu, May 18, 2006 at 08:44:14AM +0100, Bob Gautier wrote:
>The card is a 64-bit PCI-X, so I don't think the bus is the bottleneck,
>and anyway the vendor specifies a maximum throughput of 200Mbyte/s per
>card.
>
>The disk array does not appear to be the bottleneck because we get
>200Mbyte/s when we use *two* HBAs in load-balanced mode.
>
>The question is really about why we only see O(100Mbyte/s) with one HBA
>when we can achieve O(200MByte/s) with two cards, given that one card
>should be able to achieve that throughput.
>
>I don't think the method of producing the traffic (bonnie++ or something
>else) should be relevant but if it were that would be very interesting
>for the benchmark authors!
>
>The storage is an HDS 9980 (I think?)
i am not an expert with Hitachi storages,
anyway
does each hba map to a different controller on the storage?
do you have some statistics on disk usage from the storage side?
L.
--
Luca Berra -- bluca@comedia.it
Communication Media & Services S.r.l.
/"\
\ / ASCII RIBBON CAMPAIGN
X AGAINST HTML MAIL
/ \
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [Consult-list] Re: dm-multipath has great throughput but we'd like more!
2006-05-18 7:44 ` Bob Gautier
2006-05-18 7:55 ` Jonathan E Brassow
2006-05-18 7:59 ` Luca Berra
@ 2006-05-18 8:04 ` Nicholas C. Strugnell
2006-05-18 9:42 ` Nicholas C. Strugnell
2006-05-18 20:28 ` Steve Lord
3 siblings, 1 reply; 15+ messages in thread
From: Nicholas C. Strugnell @ 2006-05-18 8:04 UTC (permalink / raw)
To: rgautier; +Cc: device-mapper development, consult-list
[-- Attachment #1.1: Type: text/plain, Size: 1992 bytes --]
On Thu, 2006-05-18 at 08:44 +0100, Bob Gautier wrote:
> On Thu, 2006-05-18 at 02:25 -0500, Jonathan E Brassow wrote:
> > The system bus isn't a limiting factor is it? 64-bit PCI-X will get
> > 8.5 GB/s (plenty), but 32-bit PCI 33MHz got 133MB/s.
> >
> > Can your disks sustain that much bandwidth? 10 striped drives might get
> > better than 200MB/s if done right, I suppose.
> >
This is an HDS Lightning - 64GB of mirrored write cache - I doubt if any
of the writes even see disk :-)
> > Don't the switches run at 2 Gbits/s? 2 Gbits/s / 10 (throw in 2 bits
> > for protocol) ~= 200MB/s.
> >
>
> Thanks for the fast responses:
>
> The card is a 64-bit PCI-X, so I don't think the bus is the bottleneck,
> and anyway the vendor specifies a maximum throughput of 200Mbyte/s per
> card.
>
> The disk array does not appear to be the bottleneck because we get
> 200Mbyte/s when we use *two* HBAs in load-balanced mode.
>
> The question is really about why we only see O(100Mbyte/s) with one HBA
> when we can achieve O(200MByte/s) with two cards, given that one card
> should be able to achieve that throughput.
>
We've just done _exactly_ the same test against an EVA 8000 with 8
active paths - theoretically we should be able to get 4Gb/s via two HBAs
but in fact we saw max 200MB/s with ext2, dropping to 160MB/s with ext3
- this was to a fairly slow RAID5 but that is irrelevant as we have a
16GB write cache and we were only writing 4GB files with bonnie++
I'm not sure where the overhead is. The fact that we see a 20%
performance drop when we switch journalling on suggests that the
overhead might be in the filesystem perhaps?
It might make sense to test raw writes to a device with dd and see if
that gets comparable performance figures - I'll just try that myself
actually.
Nick
--
M: +44 (0)7736 665171 Skype: nstrug
http://europe.redhat.com
GPG FPR: 9C6C 093C 756A 6C57 49A1 E211 BBBA F5F5 C440 5DE0
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 191 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Consult-list] Re: dm-multipath has great throughput but we'd like more!
2006-05-18 8:04 ` [Consult-list] " Nicholas C. Strugnell
@ 2006-05-18 9:42 ` Nicholas C. Strugnell
2006-05-18 10:28 ` Richard Keech
2006-05-22 15:31 ` Ed Wilts
0 siblings, 2 replies; 15+ messages in thread
From: Nicholas C. Strugnell @ 2006-05-18 9:42 UTC (permalink / raw)
To: rgautier; +Cc: device-mapper development, consult-list
[-- Attachment #1.1: Type: text/plain, Size: 1621 bytes --]
On Thu, 2006-05-18 at 10:04 +0200, Nicholas C. Strugnell wrote:
> On Thu, 2006-05-18 at 08:44 +0100, Bob Gautier wrote:
> > On Thu, 2006-05-18 at 02:25 -0500, Jonathan E Brassow wrote:
> > > The system bus isn't a limiting factor is it? 64-bit PCI-X will get
> > > 8.5 GB/s (plenty), but 32-bit PCI 33MHz got 133MB/s.
> > >
> > > Can your disks sustain that much bandwidth? 10 striped drives might get
> > > better than 200MB/s if done right, I suppose.
> > >
>
> It might make sense to test raw writes to a device with dd and see if
> that gets comparable performance figures - I'll just try that myself
> actually.
write throughput to EVA 8000 (8GB write cache), host DL380 with 2x2Gb/s
HBAs, 2GB RAM
testing 4GB files:
on filesystems: bonnie++ -d /mnt/tmp -s 4g -f -n 0 -u root
ext3: 129MB/s sd=0.43
ext2: 202MB/s sd=21.34
on raw: 216MB/s sd=3.93 (dd if=/dev/zero of=/dev/mpath/3600508b4001048ba0000b00001400000 bs=4k count=1048576)
NB I did not have exclusive access to the SAN or this particular storage
array - this is a big corp. SAN network under quite heavy load and disk
array under moderate load - not even sure if I had exclusive access to
the disks. All values averaged over 20 runs.
The very low deviation of write speed on ext3 vs. exr2 or raw is
interesting - not sure if it means anything.
In any case, we don't manage to get very close to the theoretical
throughput of the 2 HBAs, 512MB/s
Nick
--
M: +44 (0)7736 665171 Skype: nstrug
http://europe.redhat.com
GPG FPR: 9C6C 093C 756A 6C57 49A1 E211 BBBA F5F5 C440 5DE0
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 191 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Consult-list] Re: dm-multipath has great throughput but we'd like more!
2006-05-18 9:42 ` Nicholas C. Strugnell
@ 2006-05-18 10:28 ` Richard Keech
2006-05-22 15:31 ` Ed Wilts
1 sibling, 0 replies; 15+ messages in thread
From: Richard Keech @ 2006-05-18 10:28 UTC (permalink / raw)
To: Nicholas C. Strugnell; +Cc: device-mapper development, consult-list
[-- Attachment #1.1: Type: text/plain, Size: 1389 bytes --]
Nicholas C. Strugnell wrote:
>On Thu, 2006-05-18 at 10:04 +0200, Nicholas C. Strugnell wrote:
>
>
>>On Thu, 2006-05-18 at 08:44 +0100, Bob Gautier wrote:
>>
>>
>>>On Thu, 2006-05-18 at 02:25 -0500, Jonathan E Brassow wrote:
>>>
>>>
>>>>The system bus isn't a limiting factor is it? 64-bit PCI-X will get
>>>>8.5 GB/s (plenty), but 32-bit PCI 33MHz got 133MB/s.
>>>>
>>>>Can your disks sustain that much bandwidth? 10 striped drives might get
>>>>better than 200MB/s if done right, I suppose.
>>>>
>>>>
>>>>
>
>
>
>>It might make sense to test raw writes to a device with dd and see if
>>that gets comparable performance figures - I'll just try that myself
>>actually.
>>
>>
>
>write throughput to EVA 8000 (8GB write cache), host DL380 with 2x2Gb/s
>HBAs, 2GB RAM
>
>testing 4GB files:
>
>on filesystems: bonnie++ -d /mnt/tmp -s 4g -f -n 0 -u root
>
>ext3: 129MB/s sd=0.43
>
>
I presume this is with journal=ordered. Try with journel=writeback.
I've seen benchmarks which suggest it can be close to the speed of ext2.
--
Red Hat Home
www.redhat.com.au <http://www.redhat.com>
*Richard Keech*
Chief Technology Architect
Red Hat Asia-Pacific
email: rkeech@redhat.com <mailto:rkeech@redhat.com>
mobile: +61 419 036 463
Level 50, 120 Collins Street
Melbourne VIC 3000
phone: +61 3 9225 5258
fax: +61 3 9225 5050
support: 1800 733 428
[-- Attachment #1.2.1: Type: text/html, Size: 3687 bytes --]
[-- Attachment #1.2.2: logo_rh_home.png --]
[-- Type: image/png, Size: 1266 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Consult-list] Re: dm-multipath has great throughput but we'd like more!
2006-05-18 9:42 ` Nicholas C. Strugnell
2006-05-18 10:28 ` Richard Keech
@ 2006-05-22 15:31 ` Ed Wilts
1 sibling, 0 replies; 15+ messages in thread
From: Ed Wilts @ 2006-05-22 15:31 UTC (permalink / raw)
To: device-mapper development; +Cc: consult-list
On Thu, May 18, 2006 at 11:42:36AM +0200, Nicholas C. Strugnell wrote:
> write throughput to EVA 8000 (8GB write cache), host DL380 with 2x2Gb/s
> HBAs, 2GB RAM
>
> testing 4GB files:
>
> on filesystems: bonnie++ -d /mnt/tmp -s 4g -f -n 0 -u root
>
> ext3: 129MB/s sd=0.43
>
> ext2: 202MB/s sd=21.34
>
> on raw: 216MB/s sd=3.93 (dd if=/dev/zero of=/dev/mpath/3600508b4001048ba0000b00001400000 bs=4k count=1048576)
>
>
> NB I did not have exclusive access to the SAN or this particular storage
> array - this is a big corp. SAN network under quite heavy load and disk
> array under moderate load - not even sure if I had exclusive access to
> the disks. All values averaged over 20 runs.
Since I manage a half-dozen EVAs, I'll pretend I actually know something
about them :-). First, there are multiple ways of setting up the LUNs
on the frame - anywhere from a small LUN with RAID5 to a large LUN with
raid 0. The differences should be significant. A small RAID5 LUN will
give you very limited balancing across physical disks. Because of the
virtualization of the disks within the frame, you most definitely do not
have exclusive access to the physical disks. It's quite possible that
your raid 5 partition is on the same physical disk as a very busy
database. The EVA spreads the lun across multiple spindles - the larger
the lun, the more spindles you can get working for you.
If you can, get the storage group to assign you a large raid 0 lun and
redo your tests. You should see different results.
.../Ed
--
Ed Wilts, RHCE
Mounds View, MN, USA
mailto:ewilts@ewilts.org
Member #1, Red Hat Community Ambassador Program
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dm-multipath has great throughput but we'd like more!
2006-05-18 7:44 ` Bob Gautier
` (2 preceding siblings ...)
2006-05-18 8:04 ` [Consult-list] " Nicholas C. Strugnell
@ 2006-05-18 20:28 ` Steve Lord
3 siblings, 0 replies; 15+ messages in thread
From: Steve Lord @ 2006-05-18 20:28 UTC (permalink / raw)
To: rgautier, device-mapper development; +Cc: consult-list
Provided you have things cabled right and you have 2 HBA ports going either
into a switch, or into the controllers of the raid (raid probably has 4
ports), then the theoretical bandwidth is closer to 400 Mbytes/sec. Pretty sure
any reasonable Hitachi raid will sustain close to that. Using other software and
raid hardware I can generally sustain 375 Mbytes/sec from 2 qlogic hba ports in
a fairly old dell server box, and that is going through 3 switches in the
middle.
You need to have sustained I/O which is directed at both sides of the
raid though. Not sure about the HDS 9980, but I think that is an
active/active raid, which means each controller can access each lun
in parallel. You really need to be striping your I/O across the luns
and controllers though. You can pull tricks to measure the fabric
capacity vs the storage bandwidth by using the raid's cache. Ensure you
have caching enabled in the raid, and have a file which is laid out
across multiple luns. Read a file which is a large percentage of the
cache size using o_direct (lmdd can be built with direct I/O support).
Then run the read again, if you did it right, you just eliminated the
spindles from the I/O.
Not sure about the hitachi raid again, but a lun would generally
belong to a controller on the raid, and there are usually two
controllers. Make sure that when you build the volume you stripe
luns so that they alternate between controllers. Then you need to
make sure that your I/Os are large enough to hit multiple disks
at once. There are lots of tricks to tuning this type of setup.
The problem with the load balancing in dm-multipath is that it is not
really load balancing, it is round robin, on a per lun basis I think,
it has no global picture of how much other load is currently going
to each HBA or controller port. The best you can do is drop the value
of rr_min_io in the /etc/multipath.conf file to a small value, try
something like 1 or 2.
Steve
Bob Gautier wrote:
> On Thu, 2006-05-18 at 02:25 -0500, Jonathan E Brassow wrote:
>> The system bus isn't a limiting factor is it? 64-bit PCI-X will get
>> 8.5 GB/s (plenty), but 32-bit PCI 33MHz got 133MB/s.
>>
>> Can your disks sustain that much bandwidth? 10 striped drives might get
>> better than 200MB/s if done right, I suppose.
>>
>> Don't the switches run at 2 Gbits/s? 2 Gbits/s / 10 (throw in 2 bits
>> for protocol) ~= 200MB/s.
>>
>
> Thanks for the fast responses:
>
> The card is a 64-bit PCI-X, so I don't think the bus is the bottleneck,
> and anyway the vendor specifies a maximum throughput of 200Mbyte/s per
> card.
>
> The disk array does not appear to be the bottleneck because we get
> 200Mbyte/s when we use *two* HBAs in load-balanced mode.
>
> The question is really about why we only see O(100Mbyte/s) with one HBA
> when we can achieve O(200MByte/s) with two cards, given that one card
> should be able to achieve that throughput.
>
> I don't think the method of producing the traffic (bonnie++ or something
> else) should be relevant but if it were that would be very interesting
> for the benchmark authors!
>
> The storage is an HDS 9980 (I think?)
>
>> Could be a bunch of reasons...
>>
>> brassow
>>
>> On May 18, 2006, at 2:05 AM, Bob Gautier wrote:
>>
>>> Yesterday my client was testing of multipath load balancing and
>>> failover
>>> on a system running ext3 on a logical volume which comprises about ten
>>> SAN LUNs all reached using multipath in multibus mode over two QL2340
>>> HBAs.
>>>
>>> On the one hand, the client is very impressed: running bonnie++
>>> (inspired by Ronan's GFS v VxFS example) we get just over 200Mbyte/s
>>> over the two HBAs, and when we pull a link we get about 120MByte/s.
>>>
>>> The throughput and failover response times are better than the client
>>> has ever seen, but we're wondering why we are not seeing higher
>>> throughput per-HBA -- the QL2340 datasheet says it should manage
>>> 200Mbyte/s and all switches etc. run at 2GBps.
>>>
>>> Any ideas?
>>>
>>> Bob Gautier
>>> +44 7921 700996
>>>
^ permalink raw reply [flat|nested] 15+ messages in thread