Linux RAID Partition Offset 63 cylinders / 30% performance hit?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Linux RAID Partition Offset 63 cylinders / 30% performance hit?
@ 2007-12-19 14:50 Justin Piszcz
  2007-12-19 15:01 ` Mattias Wadenstein
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Justin Piszcz @ 2007-12-19 14:50 UTC (permalink / raw)
  To: linux-raid; +Cc: apiszcz

The (up to) 30% percent figure is mentioned here:
http://insights.oetiker.ch/linux/raidoptimization.html

On http://forums.storagereview.net/index.php?showtopic=25786:
This user writes about the problem:
XP, and virtually every O/S and partitioning software of XP's day, by default places the first partition on a disk at sector 63. Being an odd number, and 31.5KB into the drive, it isn't ever going to align with any stripe size. This is an unfortunate industry standard.
Vista on the other hand, aligns the first partition on sector 2048 by default as a by-product of it's revisions to support large-sector sized hard drives. As RAID5 arrays in write mode mimick the performance characteristics of large-sector size hard drives, this comes as a great if not inadvertent benefit. 2048 is evenly divisible by 2 and 4 (allowing for 3 and 5 drive arrays optimally) and virtually every stripe size in common use. If you are however using a 4-drive RAID5, you're SOOL.

Page 9 in this PDF (EMC_BestPractice_R22.pdf) shows the problem graphically:
http://bbs.doit.com.cn/attachment.php?aid=6757

------

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

    Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1       18241   146520801   fd  Linux raid autodetect

---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe aligned?

Or is there a better way to do this, does parted handle this situation 
better?

What is the best (and correct) way to calculate stripe-alignment on the 
RAID5 device itself?

---

The EMC paper recommends:

Disk partition adjustment for Linux systems
In Linux, align the partition table before data is written to the LUN, as 
the partition map will be rewritten
and all data on the LUN destroyed. In the following example, the LUN is 
mapped to
/dev/emcpowerah, and the LUN stripe element size is 128 blocks. Arguments 
for the fdisk utility are
as follows:
fdisk    /dev/emcpowerah
x      # expert mode
b      # adjust starting block number
1      # choose partition 1
128 #    set it to 128, our stripe element size
w      # write the new partition

---

Does this also apply to Linux/SW RAID5?  Or are there any caveats that are 
not taken into account since it is based in SW vs. HW?

---

What it currently looks like:

Command (m for help): x

Expert command (m for help): p

Disk /dev/sdc: 255 heads, 63 sectors, 18241 cylinders

Nr AF  Hd Sec  Cyl  Hd Sec  Cyl     Start      Size ID
  1 00   1   1    0 254  63 1023         63  293041602 fd
  2 00   0   0    0   0   0    0          0          0 00
  3 00   0   0    0   0   0    0          0          0 00
  4 00   0   0    0   0   0    0          0          0 00

Justin.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 14:50 Linux RAID Partition Offset 63 cylinders / 30% performance hit? Justin Piszcz
@ 2007-12-19 15:01 ` Mattias Wadenstein
  2007-12-19 15:04   ` Justin Piszcz
  2007-12-20 10:33   ` Gabor Gombas
  2007-12-19 21:44 ` Michal Soltys
  2007-12-19 21:59 ` Robin Hill
  2 siblings, 2 replies; 26+ messages in thread
From: Mattias Wadenstein @ 2007-12-19 15:01 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, apiszcz

On Wed, 19 Dec 2007, Justin Piszcz wrote:

> ------
>
> Now to my setup / question:
>
> # fdisk -l /dev/sdc
>
> Disk /dev/sdc: 150.0 GB, 150039945216 bytes
> 255 heads, 63 sectors/track, 18241 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> Disk identifier: 0x5667c24a
>
>   Device Boot      Start         End      Blocks   Id  System
> /dev/sdc1               1       18241   146520801   fd  Linux raid autodetect
>
> ---
>
> If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct start 
> and end size if I wanted to make sure the RAID5 was stripe aligned?
>
> Or is there a better way to do this, does parted handle this situation 
> better?

From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw 
raid) and sw raid members.

/Mattias Wadenstein

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 15:01 ` Mattias Wadenstein
@ 2007-12-19 15:04   ` Justin Piszcz
  2007-12-19 15:06     ` Jon Nelson
  2007-12-19 17:40     ` Bill Davidsen
  2007-12-20 10:33   ` Gabor Gombas
  1 sibling, 2 replies; 26+ messages in thread
From: Justin Piszcz @ 2007-12-19 15:04 UTC (permalink / raw)
  To: Mattias Wadenstein; +Cc: linux-raid, apiszcz



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:

> On Wed, 19 Dec 2007, Justin Piszcz wrote:
>
>> ------
>> 
>> Now to my setup / question:
>> 
>> # fdisk -l /dev/sdc
>> 
>> Disk /dev/sdc: 150.0 GB, 150039945216 bytes
>> 255 heads, 63 sectors/track, 18241 cylinders
>> Units = cylinders of 16065 * 512 = 8225280 bytes
>> Disk identifier: 0x5667c24a
>>
>>   Device Boot      Start         End      Blocks   Id  System
>> /dev/sdc1               1       18241   146520801   fd  Linux raid 
>> autodetect
>> 
>> ---
>> 
>> If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
>> start and end size if I wanted to make sure the RAID5 was stripe aligned?
>> 
>> Or is there a better way to do this, does parted handle this situation 
>> better?
>
>> From that setup it seems simple, scrap the partition table and use the 
> disk device for raid. This is what we do for all data storage disks (hw raid) 
> and sw raid members.
>
> /Mattias Wadenstein
>

Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices 
again I did not plug them back in the same order, everything worked of 
course but when I ran LILO it said it was not part of the RAID set, 
because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, if 
I had not used partitions here, I'd have lost (or more of the drives) due 
to a bad LILO run?

Justin.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 15:04   ` Justin Piszcz
@ 2007-12-19 15:06     ` Jon Nelson
  2007-12-19 15:31       ` Justin Piszcz
  2007-12-19 17:40     ` Bill Davidsen
  1 sibling, 1 reply; 26+ messages in thread
From: Jon Nelson @ 2007-12-19 15:06 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Mattias Wadenstein, linux-raid, apiszcz

On 12/19/07, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Wed, 19 Dec 2007, Mattias Wadenstein wrote:
> >> From that setup it seems simple, scrap the partition table and use the
> > disk device for raid. This is what we do for all data storage disks (hw raid)
> > and sw raid members.
> >
> > /Mattias Wadenstein
> >
>
> Is there any downside to doing that?  I remember when I had to take my

There is one (just pointed out to me yesterday): having the partition
and having it labeled as raid makes identification quite a bit easier
for humans and software, too.

-- 
Jon

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 15:06     ` Jon Nelson
@ 2007-12-19 15:31       ` Justin Piszcz
  2007-12-20 10:37         ` Gabor Gombas
  0 siblings, 1 reply; 26+ messages in thread
From: Justin Piszcz @ 2007-12-19 15:31 UTC (permalink / raw)
  To: Jon Nelson; +Cc: Mattias Wadenstein, linux-raid, apiszcz



On Wed, 19 Dec 2007, Jon Nelson wrote:

> On 12/19/07, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>
>>
>> On Wed, 19 Dec 2007, Mattias Wadenstein wrote:
>>>> From that setup it seems simple, scrap the partition table and use the
>>> disk device for raid. This is what we do for all data storage disks (hw raid)
>>> and sw raid members.
>>>
>>> /Mattias Wadenstein
>>>
>>
>> Is there any downside to doing that?  I remember when I had to take my
>
> There is one (just pointed out to me yesterday): having the partition
> and having it labeled as raid makes identification quite a bit easier
> for humans and software, too.
>
> -- 
> Jon
>

Some nice graphs found here:
http://sqlblog.com/blogs/linchi_shea/archive/2007/02/01/performance-impact-of-disk-misalignment.aspx


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 17:40     ` Bill Davidsen
@ 2007-12-19 17:37       ` Jon Nelson
  2007-12-19 17:37       ` Jon Nelson
  2007-12-19 17:55       ` Justin Piszcz
  2 siblings, 0 replies; 26+ messages in thread
From: Jon Nelson @ 2007-12-19 17:37 UTC (permalink / raw)
  Cc: linux-raid

On 12/19/07, Bill Davidsen <davidsen@tmr.com> wrote:
> As other posts have detailed, putting the partition on a 64k aligned
> boundary can address the performance problems. However, a poor choice of
> chunk size, cache_buffer size, or just random i/o in small sizes can eat
> up a lot of the benefit.
>
> I don't think you need to give up your partitions to get the benefit of
> alignment.

How might that benefit be realized?
Assume I have 3 disks, /dev/sd{b,c,d} all partitioned identically with
4 partitions, and I want to use /dev/sd{b,c,d}3 for a new SW raid.

What sequence of steps can I take to ensure that my raid is aligned on
a 64K boundary?
What effect do the different superblock formats have, if any, in this situation?


-- 
Jon

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 17:40     ` Bill Davidsen
  2007-12-19 17:37       ` Jon Nelson
@ 2007-12-19 17:37       ` Jon Nelson
  2007-12-19 17:55       ` Justin Piszcz
  2 siblings, 0 replies; 26+ messages in thread
From: Jon Nelson @ 2007-12-19 17:37 UTC (permalink / raw)
  Cc: linux-raid

On 12/19/07, Bill Davidsen <davidsen@tmr.com> wrote:
> As other posts have detailed, putting the partition on a 64k aligned
> boundary can address the performance problems. However, a poor choice of
> chunk size, cache_buffer size, or just random i/o in small sizes can eat
> up a lot of the benefit.
>
> I don't think you need to give up your partitions to get the benefit of
> alignment.

How might that benefit be realized?
Assume I have 3 disks, /dev/sd{b,c,d} all partitioned identically with
4 partitions, and I want to use /dev/sd{b,c,d}3 for a new SW raid.

What sequence of steps can I take to ensure that my raid is aligned on
a 64K boundary?
What effect do the different superblock formats have, if any, in this situation?

-- 
Jon

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 15:04   ` Justin Piszcz
  2007-12-19 15:06     ` Jon Nelson
@ 2007-12-19 17:40     ` Bill Davidsen
  2007-12-19 17:37       ` Jon Nelson
                         ` (2 more replies)
  1 sibling, 3 replies; 26+ messages in thread
From: Bill Davidsen @ 2007-12-19 17:40 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Mattias Wadenstein, linux-raid, apiszcz

Justin Piszcz wrote:
>
>
> On Wed, 19 Dec 2007, Mattias Wadenstein wrote:
>
>> On Wed, 19 Dec 2007, Justin Piszcz wrote:
>>
>>> ------
>>>
>>> Now to my setup / question:
>>>
>>> # fdisk -l /dev/sdc
>>>
>>> Disk /dev/sdc: 150.0 GB, 150039945216 bytes
>>> 255 heads, 63 sectors/track, 18241 cylinders
>>> Units = cylinders of 16065 * 512 = 8225280 bytes
>>> Disk identifier: 0x5667c24a
>>>
>>>   Device Boot      Start         End      Blocks   Id  System
>>> /dev/sdc1               1       18241   146520801   fd  Linux raid 
>>> autodetect
>>>
>>> ---
>>>
>>> If I use 10-disk RAID5 with 1024 KiB stripe, what would be the 
>>> correct start and end size if I wanted to make sure the RAID5 was 
>>> stripe aligned?
>>>
>>> Or is there a better way to do this, does parted handle this 
>>> situation better?
>>
>>> From that setup it seems simple, scrap the partition table and use the 
>> disk device for raid. This is what we do for all data storage disks 
>> (hw raid) and sw raid members.
>>
>> /Mattias Wadenstein
>>
>
> Is there any downside to doing that?  I remember when I had to take my 
> machine apart for a BIOS downgrade when I plugged in the sata devices 
> again I did not plug them back in the same order, everything worked of 
> course but when I ran LILO it said it was not part of the RAID set, 
> because /dev/sda had become /dev/sdg and overwrote the MBR on the 
> disk, if I had not used partitions here, I'd have lost (or more of the 
> drives) due to a bad LILO run?

As other posts have detailed, putting the partition on a 64k aligned 
boundary can address the performance problems. However, a poor choice of 
chunk size, cache_buffer size, or just random i/o in small sizes can eat 
up a lot of the benefit.

I don't think you need to give up your partitions to get the benefit of 
alignment.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 17:40     ` Bill Davidsen
  2007-12-19 17:37       ` Jon Nelson
  2007-12-19 17:37       ` Jon Nelson
@ 2007-12-19 17:55       ` Justin Piszcz
  2007-12-19 19:18         ` Bill Davidsen
  2007-12-20 10:24         ` Gabor Gombas
  2 siblings, 2 replies; 26+ messages in thread
From: Justin Piszcz @ 2007-12-19 17:55 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Mattias Wadenstein, linux-raid, apiszcz



On Wed, 19 Dec 2007, Bill Davidsen wrote:

> Justin Piszcz wrote:
>> 
>> 
>> On Wed, 19 Dec 2007, Mattias Wadenstein wrote:
>> 
>>> On Wed, 19 Dec 2007, Justin Piszcz wrote:
>>> 
>>>> ------
>>>> 
>>>> Now to my setup / question:
>>>> 
>>>> # fdisk -l /dev/sdc
>>>> 
>>>> Disk /dev/sdc: 150.0 GB, 150039945216 bytes
>>>> 255 heads, 63 sectors/track, 18241 cylinders
>>>> Units = cylinders of 16065 * 512 = 8225280 bytes
>>>> Disk identifier: 0x5667c24a
>>>>
>>>>   Device Boot      Start         End      Blocks   Id  System
>>>> /dev/sdc1               1       18241   146520801   fd  Linux raid 
>>>> autodetect
>>>> 
>>>> ---
>>>> 
>>>> If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
>>>> start and end size if I wanted to make sure the RAID5 was stripe aligned?
>>>> 
>>>> Or is there a better way to do this, does parted handle this situation 
>>>> better?
>>> 
>>>> From that setup it seems simple, scrap the partition table and use the 
>>> disk device for raid. This is what we do for all data storage disks (hw 
>>> raid) and sw raid members.
>>> 
>>> /Mattias Wadenstein
>>> 
>> 
>> Is there any downside to doing that?  I remember when I had to take my 
>> machine apart for a BIOS downgrade when I plugged in the sata devices again 
>> I did not plug them back in the same order, everything worked of course but 
>> when I ran LILO it said it was not part of the RAID set, because /dev/sda 
>> had become /dev/sdg and overwrote the MBR on the disk, if I had not used 
>> partitions here, I'd have lost (or more of the drives) due to a bad LILO 
>> run?
>
> As other posts have detailed, putting the partition on a 64k aligned boundary 
> can address the performance problems. However, a poor choice of chunk size, 
> cache_buffer size, or just random i/o in small sizes can eat up a lot of the 
> benefit.
>
> I don't think you need to give up your partitions to get the benefit of 
> alignment.
>
> -- 
> Bill Davidsen <davidsen@tmr.com>
> "Woe unto the statesman who makes war without a reason that will still
> be valid when the war is over..." Otto von Bismark 
>

Hrmm..

I am doing a benchmark now with:

6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.

unligned, just fdisk /dev/sdc, mkpartition, fd raid.
  aligned, fdisk, expert, start at 512 as the off-set

Per a Microsoft KB:

Example of alignment calculations in kilobytes for a 256-KB stripe unit 
size:
(63 * .5) / 256 = 0.123046875
(64 * .5) / 256 = 0.125
(128 * .5) / 256 = 0.25
(256 * .5) / 256 = 0.5
(512 * .5) / 256 = 1
These examples shows that the partition is not aligned correctly for a 
256-KB stripe unit size until the partition is created by using an offset 
of 512 sectors (512 bytes per sector).

So I should start at 512 for a 256k chunk size.

I ran bonnie++ three consecutive times and took the average for the 
unaligned, rebuilding the RAID5 now and then I will re-execute the test 3 
additional times and take the average of that.

Justin.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 17:55       ` Justin Piszcz
@ 2007-12-19 19:18         ` Bill Davidsen
  2007-12-19 19:44           ` Justin Piszcz
  2007-12-19 21:31           ` Justin Piszcz
  2007-12-20 10:24         ` Gabor Gombas
  1 sibling, 2 replies; 26+ messages in thread
From: Bill Davidsen @ 2007-12-19 19:18 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Mattias Wadenstein, linux-raid, apiszcz

Justin Piszcz wrote:
>
>
> On Wed, 19 Dec 2007, Bill Davidsen wrote:
>
>> Justin Piszcz wrote:
>>>
>>>
>>> On Wed, 19 Dec 2007, Mattias Wadenstein wrote:
>>>
>>>> On Wed, 19 Dec 2007, Justin Piszcz wrote:
>>>>
>>>>> ------
>>>>>
>>>>> Now to my setup / question:
>>>>>
>>>>> # fdisk -l /dev/sdc
>>>>>
>>>>> Disk /dev/sdc: 150.0 GB, 150039945216 bytes
>>>>> 255 heads, 63 sectors/track, 18241 cylinders
>>>>> Units = cylinders of 16065 * 512 = 8225280 bytes
>>>>> Disk identifier: 0x5667c24a
>>>>>
>>>>>   Device Boot      Start         End      Blocks   Id  System
>>>>> /dev/sdc1               1       18241   146520801   fd  Linux raid 
>>>>> autodetect
>>>>>
>>>>> ---
>>>>>
>>>>> If I use 10-disk RAID5 with 1024 KiB stripe, what would be the 
>>>>> correct start and end size if I wanted to make sure the RAID5 was 
>>>>> stripe aligned?
>>>>>
>>>>> Or is there a better way to do this, does parted handle this 
>>>>> situation better?
>>>>
>>>>> From that setup it seems simple, scrap the partition table and use 
>>>>> the 
>>>> disk device for raid. This is what we do for all data storage disks 
>>>> (hw raid) and sw raid members.
>>>>
>>>> /Mattias Wadenstein
>>>>
>>>
>>> Is there any downside to doing that?  I remember when I had to take 
>>> my machine apart for a BIOS downgrade when I plugged in the sata 
>>> devices again I did not plug them back in the same order, everything 
>>> worked of course but when I ran LILO it said it was not part of the 
>>> RAID set, because /dev/sda had become /dev/sdg and overwrote the MBR 
>>> on the disk, if I had not used partitions here, I'd have lost (or 
>>> more of the drives) due to a bad LILO run?
>>
>> As other posts have detailed, putting the partition on a 64k aligned 
>> boundary can address the performance problems. However, a poor choice 
>> of chunk size, cache_buffer size, or just random i/o in small sizes 
>> can eat up a lot of the benefit.
>>
>> I don't think you need to give up your partitions to get the benefit 
>> of alignment.
>>
>> -- 
>> Bill Davidsen <davidsen@tmr.com>
>> "Woe unto the statesman who makes war without a reason that will still
>> be valid when the war is over..." Otto von Bismark
>
> Hrmm..
>
> I am doing a benchmark now with:
>
> 6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.
>
> unligned, just fdisk /dev/sdc, mkpartition, fd raid.
>  aligned, fdisk, expert, start at 512 as the off-set
>
> Per a Microsoft KB:
>
> Example of alignment calculations in kilobytes for a 256-KB stripe 
> unit size:
> (63 * .5) / 256 = 0.123046875
> (64 * .5) / 256 = 0.125
> (128 * .5) / 256 = 0.25
> (256 * .5) / 256 = 0.5
> (512 * .5) / 256 = 1
> These examples shows that the partition is not aligned correctly for a 
> 256-KB stripe unit size until the partition is created by using an 
> offset of 512 sectors (512 bytes per sector).
>
> So I should start at 512 for a 256k chunk size.
>
> I ran bonnie++ three consecutive times and took the average for the 
> unaligned, rebuilding the RAID5 now and then I will re-execute the 
> test 3 additional times and take the average of that.

I'm going to try another approach, I'll describe it when I get results 
(or not).

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 19:18         ` Bill Davidsen
@ 2007-12-19 19:44           ` Justin Piszcz
  2007-12-19 21:31           ` Justin Piszcz
  1 sibling, 0 replies; 26+ messages in thread
From: Justin Piszcz @ 2007-12-19 19:44 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Mattias Wadenstein, linux-raid, apiszcz



On Wed, 19 Dec 2007, Bill Davidsen wrote:

> Justin Piszcz wrote:
>> 
>> 
>> On Wed, 19 Dec 2007, Bill Davidsen wrote:
>> 
>>> Justin Piszcz wrote:
>>>> 
>>>> 
>>>> On Wed, 19 Dec 2007, Mattias Wadenstein wrote:
>>>> 
>>>>> On Wed, 19 Dec 2007, Justin Piszcz wrote:
>>>>> 
>>>>>> ------
>>>>>> 
>>>>>> Now to my setup / question:
>>>>>> 
>>>>>> # fdisk -l /dev/sdc
>>>>>> 
>>>>>> Disk /dev/sdc: 150.0 GB, 150039945216 bytes
>>>>>> 255 heads, 63 sectors/track, 18241 cylinders
>>>>>> Units = cylinders of 16065 * 512 = 8225280 bytes
>>>>>> Disk identifier: 0x5667c24a
>>>>>>
>>>>>>   Device Boot      Start         End      Blocks   Id  System
>>>>>> /dev/sdc1               1       18241   146520801   fd  Linux raid 
>>>>>> autodetect
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
>>>>>> start and end size if I wanted to make sure the RAID5 was stripe 
>>>>>> aligned?
>>>>>> 
>>>>>> Or is there a better way to do this, does parted handle this situation 
>>>>>> better?
>>>>> 
>>>>>> From that setup it seems simple, scrap the partition table and use the 
>>>>> disk device for raid. This is what we do for all data storage disks (hw 
>>>>> raid) and sw raid members.
>>>>> 
>>>>> /Mattias Wadenstein
>>>>> 
>>>> 
>>>> Is there any downside to doing that?  I remember when I had to take my 
>>>> machine apart for a BIOS downgrade when I plugged in the sata devices 
>>>> again I did not plug them back in the same order, everything worked of 
>>>> course but when I ran LILO it said it was not part of the RAID set, 
>>>> because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, 
>>>> if I had not used partitions here, I'd have lost (or more of the drives) 
>>>> due to a bad LILO run?
>>> 
>>> As other posts have detailed, putting the partition on a 64k aligned 
>>> boundary can address the performance problems. However, a poor choice of 
>>> chunk size, cache_buffer size, or just random i/o in small sizes can eat 
>>> up a lot of the benefit.
>>> 
>>> I don't think you need to give up your partitions to get the benefit of 
>>> alignment.
>>> 
>>> -- 
>>> Bill Davidsen <davidsen@tmr.com>
>>> "Woe unto the statesman who makes war without a reason that will still
>>> be valid when the war is over..." Otto von Bismark
>> 
>> Hrmm..
>> 
>> I am doing a benchmark now with:
>> 
>> 6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.
>> 
>> unligned, just fdisk /dev/sdc, mkpartition, fd raid.
>>  aligned, fdisk, expert, start at 512 as the off-set
>> 
>> Per a Microsoft KB:
>> 
>> Example of alignment calculations in kilobytes for a 256-KB stripe unit 
>> size:
>> (63 * .5) / 256 = 0.123046875
>> (64 * .5) / 256 = 0.125
>> (128 * .5) / 256 = 0.25
>> (256 * .5) / 256 = 0.5
>> (512 * .5) / 256 = 1
>> These examples shows that the partition is not aligned correctly for a 
>> 256-KB stripe unit size until the partition is created by using an offset 
>> of 512 sectors (512 bytes per sector).
>> 
>> So I should start at 512 for a 256k chunk size.
>> 
>> I ran bonnie++ three consecutive times and took the average for the 
>> unaligned, rebuilding the RAID5 now and then I will re-execute the test 3 
>> additional times and take the average of that.
>
> I'm going to try another approach, I'll describe it when I get results (or 
> not).

Waiting for the raid to rebuild then I will re-run thereafter.

       [=================>...]  recovery = 86.7% (339104640/390708480) 
finish=30.8min speed=27835K/sec

...



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 19:18         ` Bill Davidsen
  2007-12-19 19:44           ` Justin Piszcz
@ 2007-12-19 21:31           ` Justin Piszcz
  2007-12-20 15:18             ` Bill Davidsen
  1 sibling, 1 reply; 26+ messages in thread
From: Justin Piszcz @ 2007-12-19 21:31 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Mattias Wadenstein, linux-raid, apiszcz



On Wed, 19 Dec 2007, Bill Davidsen wrote:

> I'm going to try another approach, I'll describe it when I get results (or 
> not).

http://home.comcast.net/~jpiszcz/align_vs_noalign/

Hardly any difference at whatsoever, only on the per char for read/write 
is it any faster..?

Average of 3 runs taken:

$ cat align/*log|grep ,
p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:100000:16/64,1334210,10,330,2,120,1,3978,10,312,2
p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:100000:16/64,1252548,6,296,1,115,1,7927,20,373,2
p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:100000:16/64,1242445,6,303,1,117,1,6767,17,359,2

$ cat noalign/*log|grep ,
p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:100000:16/64,1353180,8,314,1,117,1,8684,22,283,2
p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:100000:16/64,12211519,29,297,1,113,1,3218,8,325,2
p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:100000:16/64,1243229,8,307,1,120,1,4247,11,313,2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 14:50 Linux RAID Partition Offset 63 cylinders / 30% performance hit? Justin Piszcz
  2007-12-19 15:01 ` Mattias Wadenstein
@ 2007-12-19 21:44 ` Michal Soltys
  2007-12-19 22:12   ` Jon Nelson
  2007-12-19 21:59 ` Robin Hill
  2 siblings, 1 reply; 26+ messages in thread
From: Michal Soltys @ 2007-12-19 21:44 UTC (permalink / raw)
  To: linux-raid; +Cc: Justin Piszcz

Justin Piszcz wrote:
> 
> Or is there a better way to do this, does parted handle this situation 
> better?
> 
> What is the best (and correct) way to calculate stripe-alignment on the 
> RAID5 device itself?
> 
> 
> Does this also apply to Linux/SW RAID5?  Or are there any caveats that 
> are not taken into account since it is based in SW vs. HW?
> 
> ---

In case of SW or HW raid, when you place raid aware filesystem directly on 
it, I don't see any potential poblems

Also, if md's superblock version/placement actually mattered, it'd be pretty 
strange. The space available for actual use - be it partitions or filesystem 
directly - should be always nicely aligned. I don't know that for sure though.

If you use SW partitionable raid, or HW raid with partitions, then you would
have to align it on a chunk boundary manually. Any selfrespecting os 
shouldn't complain a partition doesn't start on cylinder boundary these 
days. LVM can complicate life a bit too - if you want it's volumes to be 
chunk-aligned.

With NTFS the problem is, that it's not aware of underlaying raid in any 
way. It starts with 16 sectors long boot sector, somewhat compatible with 
ancient FAT. My blind guess would be to try to align the very first sector 
of $Mft with your chunk. Also, mentioned bootsector is also referenced as 
$Boot, thus I don't know if large cluster won't automatically extend it to 
full cluster size. Experiment, YMMV :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 14:50 Linux RAID Partition Offset 63 cylinders / 30% performance hit? Justin Piszcz
  2007-12-19 15:01 ` Mattias Wadenstein
  2007-12-19 21:44 ` Michal Soltys
@ 2007-12-19 21:59 ` Robin Hill
  2007-12-19 22:03   ` Justin Piszcz
  2007-12-25 19:06   ` Bill Davidsen
  2 siblings, 2 replies; 26+ messages in thread
From: Robin Hill @ 2007-12-19 21:59 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1656 bytes --]

On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:

> The (up to) 30% percent figure is mentioned here:
> http://insights.oetiker.ch/linux/raidoptimization.html
>
That looks to be referring to partitioning a RAID device - this'll only
apply to hardware RAID or partitionable software RAID, not to the normal
use case.  When you're creating an array out of standard partitions then
you know the array stripe size will align with the disks (there's no way
it cannot), and you can set the filesystem stripe size to align as well
(XFS will do this automatically).

I've actually done tests on this with hardware RAID to try to find the
correct partition offset, but wasn't able to see any difference (using
bonnie++ and moving the partition start by one sector at a time).

> # fdisk -l /dev/sdc
>
> Disk /dev/sdc: 150.0 GB, 150039945216 bytes
> 255 heads, 63 sectors/track, 18241 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> Disk identifier: 0x5667c24a
>
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sdc1               1       18241   146520801   fd  Linux raid 
> autodetect
>
This looks to be a normal disk - the partition offsets shouldn't be
relevant here (barring any knowledge of the actual physical disk layout
anyway, and block remapping may well make that rather irrelevant).

That's my take on this one anyway.

Cheers,
        Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 21:59 ` Robin Hill
@ 2007-12-19 22:03   ` Justin Piszcz
  2007-12-25 19:06   ` Bill Davidsen
  1 sibling, 0 replies; 26+ messages in thread
From: Justin Piszcz @ 2007-12-19 22:03 UTC (permalink / raw)
  To: Robin Hill; +Cc: linux-raid, apiszcz



On Wed, 19 Dec 2007, Robin Hill wrote:

> On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:
>
>> The (up to) 30% percent figure is mentioned here:
>> http://insights.oetiker.ch/linux/raidoptimization.html
>>
> That looks to be referring to partitioning a RAID device - this'll only
> apply to hardware RAID or partitionable software RAID, not to the normal
> use case.  When you're creating an array out of standard partitions then
> you know the array stripe size will align with the disks (there's no way
> it cannot), and you can set the filesystem stripe size to align as well
> (XFS will do this automatically).
>
> I've actually done tests on this with hardware RAID to try to find the
> correct partition offset, but wasn't able to see any difference (using
> bonnie++ and moving the partition start by one sector at a time).
>
>> # fdisk -l /dev/sdc
>>
>> Disk /dev/sdc: 150.0 GB, 150039945216 bytes
>> 255 heads, 63 sectors/track, 18241 cylinders
>> Units = cylinders of 16065 * 512 = 8225280 bytes
>> Disk identifier: 0x5667c24a
>>
>>    Device Boot      Start         End      Blocks   Id  System
>> /dev/sdc1               1       18241   146520801   fd  Linux raid
>> autodetect
>>
> This looks to be a normal disk - the partition offsets shouldn't be
> relevant here (barring any knowledge of the actual physical disk layout
> anyway, and block remapping may well make that rather irrelevant).
>
> That's my take on this one anyway.
>
> Cheers,
>        Robin
> --
>     ___
>    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
>   / / )      | Little Jim says ....                            |
>  // !!       |      "He fallen in de water !!"                 |
>

Interesting, yes, I am using XFS as well, thanks for the response.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 21:44 ` Michal Soltys
@ 2007-12-19 22:12   ` Jon Nelson
  2007-12-20 13:01     ` Michal Soltys
  0 siblings, 1 reply; 26+ messages in thread
From: Jon Nelson @ 2007-12-19 22:12 UTC (permalink / raw)
  Cc: linux-raid

On 12/19/07, Michal Soltys <soltys@ziu.info> wrote:
> Justin Piszcz wrote:
> >
> > Or is there a better way to do this, does parted handle this situation
> > better?
> >
> > What is the best (and correct) way to calculate stripe-alignment on the
> > RAID5 device itself?
> >
> >
> > Does this also apply to Linux/SW RAID5?  Or are there any caveats that
> > are not taken into account since it is based in SW vs. HW?
> >
> > ---
>
> In case of SW or HW raid, when you place raid aware filesystem directly on
> it, I don't see any potential poblems
>
> Also, if md's superblock version/placement actually mattered, it'd be pretty
> strange. The space available for actual use - be it partitions or filesystem
> directly - should be always nicely aligned. I don't know that for sure though.
>
> If you use SW partitionable raid, or HW raid with partitions, then you would
> have to align it on a chunk boundary manually. Any selfrespecting os
> shouldn't complain a partition doesn't start on cylinder boundary these
> days. LVM can complicate life a bit too - if you want it's volumes to be
> chunk-aligned.

That, for me, is the next question - how can one educate LVM about the
underlying block device such that logical volumes carved out of that
space align properly - many of us have experienced 30% (or so)
performance losses for the convenience of LVM (and mighty convenient
it is).


-- 
Jon

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 17:55       ` Justin Piszcz
  2007-12-19 19:18         ` Bill Davidsen
@ 2007-12-20 10:24         ` Gabor Gombas
  1 sibling, 0 replies; 26+ messages in thread
From: Gabor Gombas @ 2007-12-20 10:24 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Bill Davidsen, Mattias Wadenstein, linux-raid, apiszcz

On Wed, Dec 19, 2007 at 12:55:16PM -0500, Justin Piszcz wrote:

> unligned, just fdisk /dev/sdc, mkpartition, fd raid.
>  aligned, fdisk, expert, start at 512 as the off-set

No, that won't show any difference. You need to partition _the RAID
device_. If the partitioning is below the RAID level, then alignment do
not matter.

What is missing from your original quote is that the original reporter
used fake-HW RAID which can only handle full disks, and not individual
partitions. So if you want to experience the same performance drop, you
should also RAID full disks together and then put partitions on top of
the RAID array.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 15:01 ` Mattias Wadenstein
  2007-12-19 15:04   ` Justin Piszcz
@ 2007-12-20 10:33   ` Gabor Gombas
  1 sibling, 0 replies; 26+ messages in thread
From: Gabor Gombas @ 2007-12-20 10:33 UTC (permalink / raw)
  To: Mattias Wadenstein; +Cc: Justin Piszcz, linux-raid, apiszcz

On Wed, Dec 19, 2007 at 04:01:43PM +0100, Mattias Wadenstein wrote:

> From that setup it seems simple, scrap the partition table and use the disk 
> device for raid. This is what we do for all data storage disks (hw raid) 
> and sw raid members.

And _exactly_ that's when you run into the alignment problem. The common
SW RAID case (first partitioning, then building RAID arrays from
individual partitions) does not suffer from this issue.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 15:31       ` Justin Piszcz
@ 2007-12-20 10:37         ` Gabor Gombas
  0 siblings, 0 replies; 26+ messages in thread
From: Gabor Gombas @ 2007-12-20 10:37 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Jon Nelson, Mattias Wadenstein, linux-raid, apiszcz

On Wed, Dec 19, 2007 at 10:31:12AM -0500, Justin Piszcz wrote:

> Some nice graphs found here:
> http://sqlblog.com/blogs/linchi_shea/archive/2007/02/01/performance-impact-of-disk-misalignment.aspx

Again, this is a HW RAID, and the partitioning is done _on top of_ the
RAID.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 22:12   ` Jon Nelson
@ 2007-12-20 13:01     ` Michal Soltys
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Soltys @ 2007-12-20 13:01 UTC (permalink / raw)
  To: Jon Nelson; +Cc: linux-raid

Jon Nelson wrote:
> 
> That, for me, is the next question - how can one educate LVM about the
> underlying block device such that logical volumes carved out of that
> space align properly - many of us have experienced 30% (or so)
> performance losses for the convenience of LVM (and mighty convenient
> it is).
> 

When you do pvcreate you can specify --metadatasize. It will add padding 
just to hit next 64K boundary.

So i.e. if you do

#pvcreate --metadatasize 256K /dev/sda

your metadata will have 320K

but with

#pvcreate --metadatasize 255K /dev/sda

it will have 256K

After that, lvm extents will follow - with the size specified during 
vgcreate. Also I tend to use rather large sized extents (512M), as I 
don't really need the granularity offered by the default 4M.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-20 15:18             ` Bill Davidsen
@ 2007-12-20 15:00               ` Justin Piszcz
  0 siblings, 0 replies; 26+ messages in thread
From: Justin Piszcz @ 2007-12-20 15:00 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Mattias Wadenstein, linux-raid, apiszcz



On Thu, 20 Dec 2007, Bill Davidsen wrote:

> Justin Piszcz wrote:
>> 
>> 
>> On Wed, 19 Dec 2007, Bill Davidsen wrote:
>> 
>>> I'm going to try another approach, I'll describe it when I get results (or 
>>> not).
>> 
>> http://home.comcast.net/~jpiszcz/align_vs_noalign/
>> 
>> Hardly any difference at whatsoever, only on the per char for read/write is 
>> it any faster..?
>
> Am I misreading what you are doing here... you have the underlying data on 
> the actual hardware devices 64k aligned by using either the whole device or 
> starting a partition on a 64k boundary? I'm dubious that you will see a 
> difference any other way, after all the translations take place.
>
> I'm trying creating a raid array using loop devices created with the "offset" 
> parameter, but I suspect that I will wind up doing a test after just 
> repartitioning the drives, painful as that will be.
>> 
>> Average of 3 runs taken:
>> 
>> $ cat align/*log|grep ,
>> p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:100000:16/64,1334210,10,330,2,120,1,3978,10,312,2 
>> p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:100000:16/64,1252548,6,296,1,115,1,7927,20,373,2 
>> p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:100000:16/64,1242445,6,303,1,117,1,6767,17,359,2 
>> 
>> $ cat noalign/*log|grep ,
>> p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:100000:16/64,1353180,8,314,1,117,1,8684,22,283,2 
>> p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:100000:16/64,12211519,29,297,1,113,1,3218,8,325,2 
>> p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:100000:16/64,1243229,8,307,1,120,1,4247,11,313,2 
>> 
>
>
> -- 
> Bill Davidsen <davidsen@tmr.com>
> "Woe unto the statesman who makes war without a reason that will still
> be valid when the war is over..." Otto von Bismark 
>

1. The first I made partitions on each drive like I normally do.
2. The second test was I followed the EMC document on how to properly 
align the partitions and I followed Microsoft's document on how to 
calculate the correct offset, I used 512 for 256k stripe.

Justin.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 21:31           ` Justin Piszcz
@ 2007-12-20 15:18             ` Bill Davidsen
  2007-12-20 15:00               ` Justin Piszcz
  0 siblings, 1 reply; 26+ messages in thread
From: Bill Davidsen @ 2007-12-20 15:18 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Mattias Wadenstein, linux-raid, apiszcz

Justin Piszcz wrote:
>
>
> On Wed, 19 Dec 2007, Bill Davidsen wrote:
>
>> I'm going to try another approach, I'll describe it when I get 
>> results (or not).
>
> http://home.comcast.net/~jpiszcz/align_vs_noalign/
>
> Hardly any difference at whatsoever, only on the per char for 
> read/write is it any faster..?

Am I misreading what you are doing here... you have the underlying data 
on the actual hardware devices 64k aligned by using either the whole 
device or starting a partition on a 64k boundary? I'm dubious that you 
will see a difference any other way, after all the translations take place.

I'm trying creating a raid array using loop devices created with the 
"offset" parameter, but I suspect that I will wind up doing a test after 
just repartitioning the drives, painful as that will be.
>
> Average of 3 runs taken:
>
> $ cat align/*log|grep ,
> p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:100000:16/64,1334210,10,330,2,120,1,3978,10,312,2 
>
> p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:100000:16/64,1252548,6,296,1,115,1,7927,20,373,2 
>
> p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:100000:16/64,1242445,6,303,1,117,1,6767,17,359,2 
>
>
> $ cat noalign/*log|grep ,
> p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:100000:16/64,1353180,8,314,1,117,1,8684,22,283,2 
>
> p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:100000:16/64,12211519,29,297,1,113,1,3218,8,325,2 
>
> p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:100000:16/64,1243229,8,307,1,120,1,4247,11,313,2 
>
>


-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-19 21:59 ` Robin Hill
  2007-12-19 22:03   ` Justin Piszcz
@ 2007-12-25 19:06   ` Bill Davidsen
  2007-12-29 17:22     ` dean gaudet
  1 sibling, 1 reply; 26+ messages in thread
From: Bill Davidsen @ 2007-12-25 19:06 UTC (permalink / raw)
  To: linux-raid

Robin Hill wrote:
> On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:
>
>   
>> The (up to) 30% percent figure is mentioned here:
>> http://insights.oetiker.ch/linux/raidoptimization.html
>>
>>     
> That looks to be referring to partitioning a RAID device - this'll only
> apply to hardware RAID or partitionable software RAID, not to the normal
> use case.  When you're creating an array out of standard partitions then
> you know the array stripe size will align with the disks (there's no way
> it cannot), and you can set the filesystem stripe size to align as well
> (XFS will do this automatically).
>
> I've actually done tests on this with hardware RAID to try to find the
> correct partition offset, but wasn't able to see any difference (using
> bonnie++ and moving the partition start by one sector at a time).
>
>   
>> # fdisk -l /dev/sdc
>>
>> Disk /dev/sdc: 150.0 GB, 150039945216 bytes
>> 255 heads, 63 sectors/track, 18241 cylinders
>> Units = cylinders of 16065 * 512 = 8225280 bytes
>> Disk identifier: 0x5667c24a
>>
>>    Device Boot      Start         End      Blocks   Id  System
>> /dev/sdc1               1       18241   146520801   fd  Linux raid 
>> autodetect
>>
>>     
> This looks to be a normal disk - the partition offsets shouldn't be
> relevant here (barring any knowledge of the actual physical disk layout
> anyway, and block remapping may well make that rather irrelevant).
>   
The issue I'm thinking about is hardware sector size, which on modern 
drives may be larger than 512b and therefore entail a read-alter-rewrite 
(RAR) cycle when writing a 512b block. With larger writes, if the 
alignment is poor and the write size is some multiple of 512, it's 
possible to have an RAR at each end of the write. The only way to have a 
hope of controlling the alignment is to write a raw device or use a 
filesystem which can be configured to have blocks which are a multiple 
of the sector size and to do all i/o in block size starting each file on 
a block boundary. That may be possible with ext[234] set up properly.

Why this is important: the physical layout of the drive is useful, but 
for a large write the drive will have to make some number of steps from 
on cylinder to another. By carefully choosing the starting point, the 
best improvement will be to eliminate 2 track-to-track seek times, one 
at the start and one at the end. If the writes are small only one t2t 
saving is possible.

Now consider a RAR process. The drive is spinning typically at 7200 rpm, 
or 8.333 ms/rev. A read might take .5 rev on average, and a RAR will 
take 1.5 rev, because it takes a full revolution after the original data 
is read before the altered data can be rewritten. Larger sectors give 
more capacity, but reduced performance for write. And doing small writes 
can result in paying the RAR penalty on every write. So there may be a 
measurable benefit to getting that alignment right at the drive level.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-25 19:06   ` Bill Davidsen
@ 2007-12-29 17:22     ` dean gaudet
  2007-12-29 17:34       ` Justin Piszcz
  0 siblings, 1 reply; 26+ messages in thread
From: dean gaudet @ 2007-12-29 17:22 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-raid

On Tue, 25 Dec 2007, Bill Davidsen wrote:

> The issue I'm thinking about is hardware sector size, which on modern drives
> may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
> when writing a 512b block.

i'm not sure any shipping SATA disks have larger than 512B sectors yet... 
do you know of any?  (or is this thread about SCSI which i don't pay 
attention to...)

on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:

255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

so sda1 starts at a non-multiple of 4096 into the disk.

i ran some random seek+write experiments using
<http://arctic.org/~dean/randomio/>, here are the results using 512 byte
and 4096 byte writes (fsync after each write), 8 threads, on sda1:

# ./randomio /dev/sda1 8 1 1 512 10 6
  total |  read:         latency (ms)       |  write:        latency (ms)
   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
--------+-----------------------------------+----------------------------------
  148.5 |    0.0   inf    nan    0.0    nan |  148.5   0.2   53.7   89.3   19.5
  129.2 |    0.0   inf    nan    0.0    nan |  129.2  37.2   61.9   96.7    9.3
  131.2 |    0.0   inf    nan    0.0    nan |  131.2  40.3   61.0   90.4    9.3
  132.0 |    0.0   inf    nan    0.0    nan |  132.0  39.6   60.6   89.3    9.1
  130.7 |    0.0   inf    nan    0.0    nan |  130.7  39.8   61.3   98.1    8.9
  131.4 |    0.0   inf    nan    0.0    nan |  131.4  40.0   60.8  101.0    9.6
# ./randomio /dev/sda1 8 1 1 4096 10 6
  total |  read:         latency (ms)       |  write:        latency (ms)
   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
--------+-----------------------------------+----------------------------------
  141.7 |    0.0   inf    nan    0.0    nan |  141.7   0.3   56.3   99.3   21.1
  132.4 |    0.0   inf    nan    0.0    nan |  132.4  43.3   60.4   91.8    8.5
  131.6 |    0.0   inf    nan    0.0    nan |  131.6  41.4   60.9  111.0    9.6
  131.8 |    0.0   inf    nan    0.0    nan |  131.8  41.4   60.7   85.3    8.6
  130.6 |    0.0   inf    nan    0.0    nan |  130.6  41.7   61.3   95.0    9.4
  131.4 |    0.0   inf    nan    0.0    nan |  131.4  42.2   60.8   90.5    8.4


i think the anomalous results in the first 10s samples are perhaps the drive
coming out of a standby state.

and here are the results aligned using the sda raw device itself:

# ./randomio /dev/sda 8 1 1 512 10 6
  total |  read:         latency (ms)       |  write:        latency (ms)
   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
--------+-----------------------------------+----------------------------------
  147.3 |    0.0   inf    nan    0.0    nan |  147.3   0.3   54.1   93.7   20.1
  132.4 |    0.0   inf    nan    0.0    nan |  132.4  37.4   60.6   91.8    9.2
  132.5 |    0.0   inf    nan    0.0    nan |  132.5  37.7   60.3   93.7    9.3
  131.8 |    0.0   inf    nan    0.0    nan |  131.8  39.4   60.7   92.7    9.0
  133.9 |    0.0   inf    nan    0.0    nan |  133.9  41.7   59.8   90.7    8.5
  130.2 |    0.0   inf    nan    0.0    nan |  130.2  40.8   61.5   88.6    8.9
# ./randomio /dev/sda 8 1 1 4096 10 6
  total |  read:         latency (ms)       |  write:        latency (ms)
   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
--------+-----------------------------------+----------------------------------
  145.4 |    0.0   inf    nan    0.0    nan |  145.4   0.3   54.9   94.0   20.1
  130.3 |    0.0   inf    nan    0.0    nan |  130.3  36.0   61.4   92.7    9.6
  130.6 |    0.0   inf    nan    0.0    nan |  130.6  38.2   61.2   96.7    9.2
  132.1 |    0.0   inf    nan    0.0    nan |  132.1  39.0   60.5   93.5    9.2
  131.8 |    0.0   inf    nan    0.0    nan |  131.8  43.1   60.8   93.8    9.1
  129.0 |    0.0   inf    nan    0.0    nan |  129.0  40.2   62.0   96.4    8.8

it looks pretty much the same to me...

-dean

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-29 17:22     ` dean gaudet
@ 2007-12-29 17:34       ` Justin Piszcz
  2007-12-30  1:33         ` Michael Tokarev
  0 siblings, 1 reply; 26+ messages in thread
From: Justin Piszcz @ 2007-12-29 17:34 UTC (permalink / raw)
  To: dean gaudet; +Cc: Bill Davidsen, linux-raid, Alan Piszcz



On Sat, 29 Dec 2007, dean gaudet wrote:

> On Tue, 25 Dec 2007, Bill Davidsen wrote:
>
>> The issue I'm thinking about is hardware sector size, which on modern drives
>> may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
>> when writing a 512b block.
>
> i'm not sure any shipping SATA disks have larger than 512B sectors yet...
> do you know of any?  (or is this thread about SCSI which i don't pay
> attention to...)
>
> on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:
>
> 255 heads, 63 sectors/track, 91201 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> so sda1 starts at a non-multiple of 4096 into the disk.
>
> i ran some random seek+write experiments using
> <http://arctic.org/~dean/randomio/>, here are the results using 512 byte
> and 4096 byte writes (fsync after each write), 8 threads, on sda1:
>
> # ./randomio /dev/sda1 8 1 1 512 10 6
>  total |  read:         latency (ms)       |  write:        latency (ms)
>   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
> --------+-----------------------------------+----------------------------------
>  148.5 |    0.0   inf    nan    0.0    nan |  148.5   0.2   53.7   89.3   19.5
>  129.2 |    0.0   inf    nan    0.0    nan |  129.2  37.2   61.9   96.7    9.3
>  131.2 |    0.0   inf    nan    0.0    nan |  131.2  40.3   61.0   90.4    9.3
>  132.0 |    0.0   inf    nan    0.0    nan |  132.0  39.6   60.6   89.3    9.1
>  130.7 |    0.0   inf    nan    0.0    nan |  130.7  39.8   61.3   98.1    8.9
>  131.4 |    0.0   inf    nan    0.0    nan |  131.4  40.0   60.8  101.0    9.6
> # ./randomio /dev/sda1 8 1 1 4096 10 6
>  total |  read:         latency (ms)       |  write:        latency (ms)
>   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
> --------+-----------------------------------+----------------------------------
>  141.7 |    0.0   inf    nan    0.0    nan |  141.7   0.3   56.3   99.3   21.1
>  132.4 |    0.0   inf    nan    0.0    nan |  132.4  43.3   60.4   91.8    8.5
>  131.6 |    0.0   inf    nan    0.0    nan |  131.6  41.4   60.9  111.0    9.6
>  131.8 |    0.0   inf    nan    0.0    nan |  131.8  41.4   60.7   85.3    8.6
>  130.6 |    0.0   inf    nan    0.0    nan |  130.6  41.7   61.3   95.0    9.4
>  131.4 |    0.0   inf    nan    0.0    nan |  131.4  42.2   60.8   90.5    8.4
>
>
> i think the anomalous results in the first 10s samples are perhaps the drive
> coming out of a standby state.
>
> and here are the results aligned using the sda raw device itself:
>
> # ./randomio /dev/sda 8 1 1 512 10 6
>  total |  read:         latency (ms)       |  write:        latency (ms)
>   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
> --------+-----------------------------------+----------------------------------
>  147.3 |    0.0   inf    nan    0.0    nan |  147.3   0.3   54.1   93.7   20.1
>  132.4 |    0.0   inf    nan    0.0    nan |  132.4  37.4   60.6   91.8    9.2
>  132.5 |    0.0   inf    nan    0.0    nan |  132.5  37.7   60.3   93.7    9.3
>  131.8 |    0.0   inf    nan    0.0    nan |  131.8  39.4   60.7   92.7    9.0
>  133.9 |    0.0   inf    nan    0.0    nan |  133.9  41.7   59.8   90.7    8.5
>  130.2 |    0.0   inf    nan    0.0    nan |  130.2  40.8   61.5   88.6    8.9
> # ./randomio /dev/sda 8 1 1 4096 10 6
>  total |  read:         latency (ms)       |  write:        latency (ms)
>   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
> --------+-----------------------------------+----------------------------------
>  145.4 |    0.0   inf    nan    0.0    nan |  145.4   0.3   54.9   94.0   20.1
>  130.3 |    0.0   inf    nan    0.0    nan |  130.3  36.0   61.4   92.7    9.6
>  130.6 |    0.0   inf    nan    0.0    nan |  130.6  38.2   61.2   96.7    9.2
>  132.1 |    0.0   inf    nan    0.0    nan |  132.1  39.0   60.5   93.5    9.2
>  131.8 |    0.0   inf    nan    0.0    nan |  131.8  43.1   60.8   93.8    9.1
>  129.0 |    0.0   inf    nan    0.0    nan |  129.0  40.2   62.0   96.4    8.8
>
> it looks pretty much the same to me...
>
> -dean
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Good to know/have it confirmed by someone else, the alignment does not 
matter with Linux/SW RAID.

Justin.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
  2007-12-29 17:34       ` Justin Piszcz
@ 2007-12-30  1:33         ` Michael Tokarev
  0 siblings, 0 replies; 26+ messages in thread
From: Michael Tokarev @ 2007-12-30  1:33 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: dean gaudet, Bill Davidsen, linux-raid, Alan Piszcz

Justin Piszcz wrote:
[]
> Good to know/have it confirmed by someone else, the alignment does not
> matter with Linux/SW RAID.

Alignment matters when one partitions Linux/SW raid array.
If the inside partitions will not be aligned on a stripe
boundary, esp. in the worst case when the filesystem blocks
cross the stripe boundary (wonder if it's ever possible...
and I think it is, if a partition starts at some odd 512
bytes boundary, and filesystem block size is 4Kb, there's
just no chance for an inside filesystem to do full-stripe
writes, ever, so (modulo stripe cache size) all writes will
go read-modify-write or similar way.

And that's what the original article is about, by the way.
It just happens that hardware raid array is more often split
into partitions (using native tools) than linux software raid
arrays.

And that's what has been pointed out in this thread, as well... ;)

/mjt

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2007-12-30  1:33 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-19 14:50 Linux RAID Partition Offset 63 cylinders / 30% performance hit? Justin Piszcz
2007-12-19 15:01 ` Mattias Wadenstein
2007-12-19 15:04   ` Justin Piszcz
2007-12-19 15:06     ` Jon Nelson
2007-12-19 15:31       ` Justin Piszcz
2007-12-20 10:37         ` Gabor Gombas
2007-12-19 17:40     ` Bill Davidsen
2007-12-19 17:37       ` Jon Nelson
2007-12-19 17:37       ` Jon Nelson
2007-12-19 17:55       ` Justin Piszcz
2007-12-19 19:18         ` Bill Davidsen
2007-12-19 19:44           ` Justin Piszcz
2007-12-19 21:31           ` Justin Piszcz
2007-12-20 15:18             ` Bill Davidsen
2007-12-20 15:00               ` Justin Piszcz
2007-12-20 10:24         ` Gabor Gombas
2007-12-20 10:33   ` Gabor Gombas
2007-12-19 21:44 ` Michal Soltys
2007-12-19 22:12   ` Jon Nelson
2007-12-20 13:01     ` Michal Soltys
2007-12-19 21:59 ` Robin Hill
2007-12-19 22:03   ` Justin Piszcz
2007-12-25 19:06   ` Bill Davidsen
2007-12-29 17:22     ` dean gaudet
2007-12-29 17:34       ` Justin Piszcz
2007-12-30  1:33         ` Michael Tokarev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).