dd to a striped device with 9 disks gets much lower throughput when oflag=direct used

dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* dd to a striped device with 9 disks gets much lower throughput when oflag=direct used
@ 2012-01-27  1:06 Richard Sharpe
  2012-01-27  6:54 ` Hannes Reinecke
  2012-01-27  8:52 ` Christoph Hellwig
  0 siblings, 2 replies; 9+ messages in thread
From: Richard Sharpe @ 2012-01-27  1:06 UTC (permalink / raw)
  To: dm-devel

Hi,

Perhaps I am doing something stupid, but I would like to understand
why there is a difference in the following situation.

I have defined a stripe device thusly:

     "echo 0 17560535040 striped 9 8 /dev/sdd 0 /dev/sde 0 /dev/sdf 0
/dev/sdg 0 /dev/sdh 0 /dev/sdi 0 /dev/sdj 0 /dev/sdk 0 /dev/sdl 0 |
dmsetup create stripe_dev"

Then is did the following:

    dd if=/dev/zero of=/dev/mapper/stripe_dev bs=262144 count=1000000

and I got 880 MB/s

However, when I changed that command to:

    dd if=/dev/zero of=/dev/mapper/stripe_dev bs=262144 count=1000000
oflag=direct

I get 210 MB/s reliably.

The system in question is a 16 core (probably two CPUs) Intel Xeon
E5620 @2.40Ghz with 64GB of memory and 12 7200PRM SATA drives
connected to an LSI SAS controller but set up as a JBOD of 12 drives.

Why do I see such a big performance difference? Does writing to the
device also use the page cache if I don't specify DIRECT IO?

-- 
Regards,
Richard Sharpe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dd to a striped device with 9 disks gets much lower throughput when oflag=direct used
  2012-01-27  1:06 dd to a striped device with 9 disks gets much lower throughput when oflag=direct used Richard Sharpe
@ 2012-01-27  6:54 ` Hannes Reinecke
  2012-01-27  8:52 ` Christoph Hellwig
  1 sibling, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2012-01-27  6:54 UTC (permalink / raw)
  To: device-mapper development

On 01/27/2012 02:06 AM, Richard Sharpe wrote:
> Hi,
> 
> Perhaps I am doing something stupid, but I would like to understand
> why there is a difference in the following situation.
> 
> I have defined a stripe device thusly:
> 
>      "echo 0 17560535040 striped 9 8 /dev/sdd 0 /dev/sde 0 /dev/sdf 0
> /dev/sdg 0 /dev/sdh 0 /dev/sdi 0 /dev/sdj 0 /dev/sdk 0 /dev/sdl 0 |
> dmsetup create stripe_dev"
> 
> Then is did the following:
> 
>     dd if=/dev/zero of=/dev/mapper/stripe_dev bs=262144 count=1000000
> 
> and I got 880 MB/s
> 
> However, when I changed that command to:
> 
>     dd if=/dev/zero of=/dev/mapper/stripe_dev bs=262144 count=1000000
> oflag=direct
> 
> I get 210 MB/s reliably.
> 
> The system in question is a 16 core (probably two CPUs) Intel Xeon
> E5620 @2.40Ghz with 64GB of memory and 12 7200PRM SATA drives
> connected to an LSI SAS controller but set up as a JBOD of 12 drives.
> 
> Why do I see such a big performance difference? Does writing to the
> device also use the page cache if I don't specify DIRECT IO?
> 
Yes. All I/O using read/write calls is going via the pagecache.
The only way to circumvent this is to use DIRECT_IO.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dd to a striped device with 9 disks gets much lower throughput when oflag=direct used
  2012-01-27  1:06 dd to a striped device with 9 disks gets much lower throughput when oflag=direct used Richard Sharpe
  2012-01-27  6:54 ` Hannes Reinecke
@ 2012-01-27  8:52 ` Christoph Hellwig
  2012-01-27 15:03   ` Richard Sharpe
  1 sibling, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2012-01-27  8:52 UTC (permalink / raw)
  To: device-mapper development

On Thu, Jan 26, 2012 at 05:06:42PM -0800, Richard Sharpe wrote:
> Why do I see such a big performance difference? Does writing to the
> device also use the page cache if I don't specify DIRECT IO?

Yes.  Trying adding conv=fdatasync to both versions to get more
realistic results.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dd to a striped device with 9 disks gets much lower throughput when oflag=direct used
  2012-01-27  8:52 ` Christoph Hellwig
@ 2012-01-27 15:03   ` Richard Sharpe
  2012-01-27 15:16     ` Zdenek Kabelac
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Sharpe @ 2012-01-27 15:03 UTC (permalink / raw)
  To: device-mapper development

On Fri, Jan 27, 2012 at 12:52 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Jan 26, 2012 at 05:06:42PM -0800, Richard Sharpe wrote:
>> Why do I see such a big performance difference? Does writing to the
>> device also use the page cache if I don't specify DIRECT IO?
>
> Yes.  Trying adding conv=fdatasync to both versions to get more
> realistic results.

Thank you for that advice. I am comparing btrfs vs rolling my own
thing using the new dm thin-provisioning approach to get something
with resilient metadata, but I need to support two different types of
IO, one that uses directio and one that can take advantage of the page
cache.

So far, btrfs gives me around 800MB/s with a similar setup (can't get
exactly the same setup) without DIRECTIO and 450MB/s with DIRECTIO. a
dm striped setup is giving me about 10% better throughput without
DIRECTIO but only about 45% of the performance with DIRECTIO.

Anyway, I now understand. I will run my scripts with conv=fdatasync as well.

-- 
Regards,
Richard Sharpe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dd to a striped device with 9 disks gets much lower throughput when oflag=direct used
  2012-01-27 15:03   ` Richard Sharpe
@ 2012-01-27 15:16     ` Zdenek Kabelac
  2012-01-27 15:28       ` Richard Sharpe
  0 siblings, 1 reply; 9+ messages in thread
From: Zdenek Kabelac @ 2012-01-27 15:16 UTC (permalink / raw)
  To: dm-devel

Dne 27.1.2012 16:03, Richard Sharpe napsal(a):
> On Fri, Jan 27, 2012 at 12:52 AM, Christoph Hellwig<hch@infradead.org>  wrote:
>> On Thu, Jan 26, 2012 at 05:06:42PM -0800, Richard Sharpe wrote:
>>> Why do I see such a big performance difference? Does writing to the
>>> device also use the page cache if I don't specify DIRECT IO?
>>
>> Yes.  Trying adding conv=fdatasync to both versions to get more
>> realistic results.
>
> Thank you for that advice. I am comparing btrfs vs rolling my own
> thing using the new dm thin-provisioning approach to get something
> with resilient metadata, but I need to support two different types of
> IO, one that uses directio and one that can take advantage of the page
> cache.
>
> So far, btrfs gives me around 800MB/s with a similar setup (can't get
> exactly the same setup) without DIRECTIO and 450MB/s with DIRECTIO. a
> dm striped setup is giving me about 10% better throughput without
> DIRECTIO but only about 45% of the performance with DIRECTIO.
>

You've mentioned you are using thinp device with stripping - do you have
stripes properly aligned on data-block-size of thinp device ?
(I think 9 disks are properly quite hard to align somehow on 3.2 kernel,
since data block size needs to be power of 2 - I think 3.3 will have this
relaxed to page size boundary.

Zdenek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dd to a striped device with 9 disks gets much lower throughput when oflag=direct used
  2012-01-27 15:16     ` Zdenek Kabelac
@ 2012-01-27 15:28       ` Richard Sharpe
  2012-01-27 17:24         ` Zdenek Kabelac
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Sharpe @ 2012-01-27 15:28 UTC (permalink / raw)
  To: device-mapper development

On Fri, Jan 27, 2012 at 7:16 AM, Zdenek Kabelac <zkabelac@redhat.com> wrote:
> Dne 27.1.2012 16:03, Richard Sharpe napsal(a):
>
>> On Fri, Jan 27, 2012 at 12:52 AM, Christoph Hellwig<hch@infradead.org>
>>  wrote:
>>>
>>> On Thu, Jan 26, 2012 at 05:06:42PM -0800, Richard Sharpe wrote:
>>>>
>>>> Why do I see such a big performance difference? Does writing to the
>>>> device also use the page cache if I don't specify DIRECT IO?
>>>
>>>
>>> Yes.  Trying adding conv=fdatasync to both versions to get more
>>> realistic results.
>>
>>
>> Thank you for that advice. I am comparing btrfs vs rolling my own
>> thing using the new dm thin-provisioning approach to get something
>> with resilient metadata, but I need to support two different types of
>> IO, one that uses directio and one that can take advantage of the page
>> cache.
>>
>> So far, btrfs gives me around 800MB/s with a similar setup (can't get
>> exactly the same setup) without DIRECTIO and 450MB/s with DIRECTIO. a
>> dm striped setup is giving me about 10% better throughput without
>> DIRECTIO but only about 45% of the performance with DIRECTIO.
>>
>
> You've mentioned you are using thinp device with stripping - do you have
> stripes properly aligned on data-block-size of thinp device ?
> (I think 9 disks are properly quite hard to align somehow on 3.2 kernel,
> since data block size needs to be power of 2 - I think 3.3 will have this
> relaxed to page size boundary.

Actually, so far I have not used any thinp devices, since from reading
the documentation it seemed that, for what I am doing, I need to give
thinp a mirrored device for its metadata and a striped device for its
data, so I thought I would try just a striped device.

Actually, I can cut that back to 8 devices in the stripe. I am using
4kiB block sizes and writing 256kiB blocks in the dd requests and
there is no parity involved so there should be no read-modify-write
cycles.

I imagine that if I push the write sizes up to a MB or more at a time
throughput will get better because at the moment each device is being
given 32kIB or 16kiB (a few devices) with DIRECTIO and with a larger
write size they will get more data at a time.

-- 
Regards,
Richard Sharpe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dd to a striped device with 9 disks gets much lower throughput when oflag=direct used
  2012-01-27 15:28       ` Richard Sharpe
@ 2012-01-27 17:24         ` Zdenek Kabelac
  2012-01-27 17:48           ` Richard Sharpe
  0 siblings, 1 reply; 9+ messages in thread
From: Zdenek Kabelac @ 2012-01-27 17:24 UTC (permalink / raw)
  To: device-mapper development

Dne 27.1.2012 16:28, Richard Sharpe napsal(a):
> On Fri, Jan 27, 2012 at 7:16 AM, Zdenek Kabelac<zkabelac@redhat.com>  wrote:
>> Dne 27.1.2012 16:03, Richard Sharpe napsal(a):
>>
>>> On Fri, Jan 27, 2012 at 12:52 AM, Christoph Hellwig<hch@infradead.org>
>>>   wrote:
>>>>
>>>> On Thu, Jan 26, 2012 at 05:06:42PM -0800, Richard Sharpe wrote:
>>>>>
>>>>> Why do I see such a big performance difference? Does writing to the
>>>>> device also use the page cache if I don't specify DIRECT IO?
>>>>
>>>>
>>>> Yes.  Trying adding conv=fdatasync to both versions to get more
>>>> realistic results.
>>>
>>>
>>> Thank you for that advice. I am comparing btrfs vs rolling my own
>>> thing using the new dm thin-provisioning approach to get something
>>> with resilient metadata, but I need to support two different types of
>>> IO, one that uses directio and one that can take advantage of the page
>>> cache.
>>>
>>> So far, btrfs gives me around 800MB/s with a similar setup (can't get
>>> exactly the same setup) without DIRECTIO and 450MB/s with DIRECTIO. a
>>> dm striped setup is giving me about 10% better throughput without
>>> DIRECTIO but only about 45% of the performance with DIRECTIO.
>>>
>>
>> You've mentioned you are using thinp device with stripping - do you have
>> stripes properly aligned on data-block-size of thinp device ?
>> (I think 9 disks are properly quite hard to align somehow on 3.2 kernel,
>> since data block size needs to be power of 2 - I think 3.3 will have this
>> relaxed to page size boundary.
>
> Actually, so far I have not used any thinp devices, since from reading
> the documentation it seemed that, for what I am doing, I need to give
> thinp a mirrored device for its metadata and a striped device for its
> data, so I thought I would try just a striped device.
>
> Actually, I can cut that back to 8 devices in the stripe. I am using
> 4kiB block sizes and writing 256kiB blocks in the dd requests and
> there is no parity involved so there should be no read-modify-write
> cycles.
>
> I imagine that if I push the write sizes up to a MB or more at a time
> throughput will get better because at the moment each device is being
> given 32kIB or 16kiB (a few devices) with DIRECTIO and with a larger
> write size they will get more data at a time.
>

Well I cannot tell how big influence proper alignment has in your case, but it 
would be good to measure it in your case.
Do you use data_block_size equal to stripe size (256KiB 512blocks ?)

Zdenek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dd to a striped device with 9 disks gets much lower throughput when oflag=direct used
  2012-01-27 17:24         ` Zdenek Kabelac
@ 2012-01-27 17:48           ` Richard Sharpe
  2012-01-27 18:06             ` Zdenek Kabelac
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Sharpe @ 2012-01-27 17:48 UTC (permalink / raw)
  To: device-mapper development

On Fri, Jan 27, 2012 at 9:24 AM, Zdenek Kabelac <zkabelac@redhat.com> wrote:
> Dne 27.1.2012 16:28, Richard Sharpe napsal(a):
>> Actually, so far I have not used any thinp devices, since from reading
>> the documentation it seemed that, for what I am doing, I need to give
>> thinp a mirrored device for its metadata and a striped device for its
>> data, so I thought I would try just a striped device.
>>
>> Actually, I can cut that back to 8 devices in the stripe. I am using
>> 4kiB block sizes and writing 256kiB blocks in the dd requests and
>> there is no parity involved so there should be no read-modify-write
>> cycles.
>>
>> I imagine that if I push the write sizes up to a MB or more at a time
>> throughput will get better because at the moment each device is being
>> given 32kIB or 16kiB (a few devices) with DIRECTIO and with a larger
>> write size they will get more data at a time.
>>
>
> Well I cannot tell how big influence proper alignment has in your case, but
> it would be good to measure it in your case.
> Do you use data_block_size equal to stripe size (256KiB 512blocks ?)

I suspect not :-) However, I am not sure what you are asking. I
believe that the stripe size is 9 * 8 * 512B, or 36kiB because I think
I told it to use 8 sectors per device. This might be sub-optimal.

Based on that, I think it will take my write blocks, of 256kiB, and
write sectors that are (offset/512 + 256) mod 9 = {0, 1, 2, ... 8} to
{disk 0, disk 1, disk 2, ... disk 8}.

If I wanted perfectly strip-aligned writes then I think I should write
something like 32*9kiB rather than the 32*8kiB I am currently writing.

Is that what you are asking me?

-- 
Regards,
Richard Sharpe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: dd to a striped device with 9 disks gets much lower throughput when oflag=direct used
  2012-01-27 17:48           ` Richard Sharpe
@ 2012-01-27 18:06             ` Zdenek Kabelac
  0 siblings, 0 replies; 9+ messages in thread
From: Zdenek Kabelac @ 2012-01-27 18:06 UTC (permalink / raw)
  To: dm-devel

Dne 27.1.2012 18:48, Richard Sharpe napsal(a):
> On Fri, Jan 27, 2012 at 9:24 AM, Zdenek Kabelac<zkabelac@redhat.com>  wrote:
>> Dne 27.1.2012 16:28, Richard Sharpe napsal(a):
>>> Actually, so far I have not used any thinp devices, since from reading
>>> the documentation it seemed that, for what I am doing, I need to give
>>> thinp a mirrored device for its metadata and a striped device for its
>>> data, so I thought I would try just a striped device.
>>>
>>> Actually, I can cut that back to 8 devices in the stripe. I am using
>>> 4kiB block sizes and writing 256kiB blocks in the dd requests and
>>> there is no parity involved so there should be no read-modify-write
>>> cycles.
>>>
>>> I imagine that if I push the write sizes up to a MB or more at a time
>>> throughput will get better because at the moment each device is being
>>> given 32kIB or 16kiB (a few devices) with DIRECTIO and with a larger
>>> write size they will get more data at a time.
>>>
>>
>> Well I cannot tell how big influence proper alignment has in your case, but
>> it would be good to measure it in your case.
>> Do you use data_block_size equal to stripe size (256KiB 512blocks ?)
>
> I suspect not :-) However, I am not sure what you are asking. I
> believe that the stripe size is 9 * 8 * 512B, or 36kiB because I think
> I told it to use 8 sectors per device. This might be sub-optimal.
>
> Based on that, I think it will take my write blocks, of 256kiB, and
> write sectors that are (offset/512 + 256) mod 9 = {0, 1, 2, ... 8} to
> {disk 0, disk 1, disk 2, ... disk 8}.
>
> If I wanted perfectly strip-aligned writes then I think I should write
> something like 32*9kiB rather than the 32*8kiB I am currently writing.
>
> Is that what you are asking me?
>

There is surely number of things to test to get optimal performance from 
striped array and you probably need to make several experiments yourself to 
figure out the best settings.

I'd suggest to use 32KiB on each disk and combine them (8 x 32) to 256KiB 
array. Then use 512 data_block_size for thinp creation.

You may as well try just 4KiB on each drive and get 64KiB stripe and
use 128 blocks as data_block_size for thinp.

For 9 disks it's hard to say what is the 'optimal' number with 3.2 kernel and 
thinp - so it will need some playtime.
Maybe 32KiB on each disk - and use 128KiB data_block_size on 288KiB stripe.
(Though data block size heavily depends on the use case).

Zdenek

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-01-27 18:06 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-27  1:06 dd to a striped device with 9 disks gets much lower throughput when oflag=direct used Richard Sharpe
2012-01-27  6:54 ` Hannes Reinecke
2012-01-27  8:52 ` Christoph Hellwig
2012-01-27 15:03   ` Richard Sharpe
2012-01-27 15:16     ` Zdenek Kabelac
2012-01-27 15:28       ` Richard Sharpe
2012-01-27 17:24         ` Zdenek Kabelac
2012-01-27 17:48           ` Richard Sharpe
2012-01-27 18:06             ` Zdenek Kabelac

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).