* Raid5 performance question
@ 2007-04-16 17:42 mickg
0 siblings, 0 replies; 11+ messages in thread
From: mickg @ 2007-04-16 17:42 UTC (permalink / raw)
To: linux-raid
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I have a raid5 array /dev/md1, being used as the pv of an LVM Volume Group.
I have a very peculiar problem with the array:
Writes to it happen at 40MB/sec or so.
But *reads* from it happen at 10MB/sec.
This, given how raid5 works, is a bit weird.
dd if=/dev/md1 of=/disks/raid_backup/test_speed bs=10M
gives 21MB/sec, which is, while not good, not bad for dd.
while:
dd if=/dev/mapper/system-raid_lvm of=/disks/raid_backup/test_speed bs=10M
4+0 records in
3+0 records out
31457280 bytes (31 MB) copied, 8.75426 seconds, 3.6 MB/s
which is obviously bogus.
dd if=/dev/sdg of=/disks/raid_lvm/test.speed bs=10M
24+0 records in
23+0 records out
241172480 bytes (241 MB) copied, 11.6211 seconds, 20.8 MB/s
So, writes to the system VG happen at least 4xfaster than reads.
This to me seems insane.
Note that:
pvs
PV VG Fmt Attr PSize PFree
/dev/md0 backup lvm2 a- 1.46T 159.20G
/dev/md1 system lvm2 a- 1.82T 63.04G
/disks/raid_backup/ is actually the backup VG.
/disks/raid_lvm/ is actually the system VG.
Any ideas on why this happens would be very appreciated.
mickg@mickg.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (MingW32)
iD8DBQFGI7WOoRaxTYdM/EwRAqWnAKDadOnzfPuzzoMILAsaddWwdK2XIgCgmGzm
dUlR7OPxasuKa1JwtUoR60E=
=8Vtw
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 11+ messages in thread
* raid5 performance question
@ 2006-03-06 11:46 Raz Ben-Jehuda(caro)
2006-03-06 11:59 ` Gordon Henderson
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2006-03-06 11:46 UTC (permalink / raw)
To: Linux RAID Mailing List; +Cc: Neil Brown
Neil Hello .
I have a performance question.
I am using raid5 stripe size 1024K over 4 disks.
I am benchmarking it with an asynchronous tester.
This tester submits 100 IOs of size of 1024 K --> as the stripe size.
It reads raw io from the device, no file system is involved.
I am making the following comparsion:
1. Reading 4 disks at the same time using 1 MB buffer in random manner.
2. Reading 1 raid5 device using 1MB buffer in random manner.
I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from
4 disks, the raid5 device gives 35 MB/s .
it is like i am reading a single disk , but by looking at iostat i can
see that all
disks are active but with low throughput.
Any idea ?
Thank you.
--
Raz
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: raid5 performance question
2006-03-06 11:46 raid5 " Raz Ben-Jehuda(caro)
@ 2006-03-06 11:59 ` Gordon Henderson
2006-03-06 12:56 ` Raz Ben-Jehuda(caro)
2006-03-06 22:17 ` Guy
2006-03-06 22:24 ` Neil Brown
2 siblings, 1 reply; 11+ messages in thread
From: Gordon Henderson @ 2006-03-06 11:59 UTC (permalink / raw)
To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List
On Mon, 6 Mar 2006, Raz Ben-Jehuda(caro) wrote:
> Neil Hello .
> I have a performance question.
>
> I am using raid5 stripe size 1024K over 4 disks.
> I am benchmarking it with an asynchronous tester.
> This tester submits 100 IOs of size of 1024 K --> as the stripe size.
> It reads raw io from the device, no file system is involved.
>
> I am making the following comparsion:
>
> 1. Reading 4 disks at the same time using 1 MB buffer in random manner.
> 2. Reading 1 raid5 device using 1MB buffer in random manner.
>
> I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from
> 4 disks, the raid5 device gives 35 MB/s .
> it is like i am reading a single disk , but by looking at iostat i can
> see that all
> disks are active but with low throughput.
>
> Any idea ?
Is this reading the block device direct, or via a filesystem? If the
latter, what filesystem?
If ext2/3 have you tried mkfs with a stride option?
See:
http://www.tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.11
Gordon
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid5 performance question
2006-03-06 11:59 ` Gordon Henderson
@ 2006-03-06 12:56 ` Raz Ben-Jehuda(caro)
0 siblings, 0 replies; 11+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2006-03-06 12:56 UTC (permalink / raw)
To: Gordon Henderson; +Cc: Linux RAID Mailing List
it reads raw. no filesystem whatsover.
On 3/6/06, Gordon Henderson <gordon@drogon.net> wrote:
> On Mon, 6 Mar 2006, Raz Ben-Jehuda(caro) wrote:
>
> > Neil Hello .
> > I have a performance question.
> >
> > I am using raid5 stripe size 1024K over 4 disks.
> > I am benchmarking it with an asynchronous tester.
> > This tester submits 100 IOs of size of 1024 K --> as the stripe size.
> > It reads raw io from the device, no file system is involved.
> >
> > I am making the following comparsion:
> >
> > 1. Reading 4 disks at the same time using 1 MB buffer in random manner.
> > 2. Reading 1 raid5 device using 1MB buffer in random manner.
> >
> > I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from
> > 4 disks, the raid5 device gives 35 MB/s .
> > it is like i am reading a single disk , but by looking at iostat i can
> > see that all
> > disks are active but with low throughput.
> >
> > Any idea ?
>
> Is this reading the block device direct, or via a filesystem? If the
> latter, what filesystem?
>
> If ext2/3 have you tried mkfs with a stride option?
>
> See:
> http://www.tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.11
>
> Gordon
>
--
Raz
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: raid5 performance question
2006-03-06 11:46 raid5 " Raz Ben-Jehuda(caro)
2006-03-06 11:59 ` Gordon Henderson
@ 2006-03-06 22:17 ` Guy
2006-03-06 22:24 ` Neil Brown
2 siblings, 0 replies; 11+ messages in thread
From: Guy @ 2006-03-06 22:17 UTC (permalink / raw)
To: 'Raz Ben-Jehuda(caro)', 'Linux RAID Mailing List'
Cc: 'Neil Brown'
Does test 1 have 4 processes?
Does test 2 have 1 process?
The number of testing processes should be the same in both tests.
} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of Raz Ben-Jehuda(caro)
} Sent: Monday, March 06, 2006 6:46 AM
} To: Linux RAID Mailing List
} Cc: Neil Brown
} Subject: raid5 performance question
}
} Neil Hello .
} I have a performance question.
}
} I am using raid5 stripe size 1024K over 4 disks.
} I am benchmarking it with an asynchronous tester.
} This tester submits 100 IOs of size of 1024 K --> as the stripe size.
} It reads raw io from the device, no file system is involved.
}
} I am making the following comparsion:
}
} 1. Reading 4 disks at the same time using 1 MB buffer in random manner.
} 2. Reading 1 raid5 device using 1MB buffer in random manner.
}
} I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s
} from
} 4 disks, the raid5 device gives 35 MB/s .
} it is like i am reading a single disk , but by looking at iostat i can
} see that all
} disks are active but with low throughput.
}
} Any idea ?
}
} Thank you.
} --
} Raz
} -
} To unsubscribe from this list: send the line "unsubscribe linux-raid" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid5 performance question
2006-03-06 11:46 raid5 " Raz Ben-Jehuda(caro)
2006-03-06 11:59 ` Gordon Henderson
2006-03-06 22:17 ` Guy
@ 2006-03-06 22:24 ` Neil Brown
2006-03-07 8:40 ` Raz Ben-Jehuda(caro)
2006-03-08 6:45 ` thunder7
2 siblings, 2 replies; 11+ messages in thread
From: Neil Brown @ 2006-03-06 22:24 UTC (permalink / raw)
To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List
On Monday March 6, raziebe@gmail.com wrote:
> Neil Hello .
> I have a performance question.
>
> I am using raid5 stripe size 1024K over 4 disks.
I assume you mean a chunksize of 1024K rather than a stripe size.
With a 4 disk array, the stripe size will be 3 times the chunksize,
and so could not possibly by 1024K.
> I am benchmarking it with an asynchronous tester.
> This tester submits 100 IOs of size of 1024 K --> as the stripe size.
> It reads raw io from the device, no file system is involved.
>
> I am making the following comparsion:
>
> 1. Reading 4 disks at the same time using 1 MB buffer in random manner.
> 2. Reading 1 raid5 device using 1MB buffer in random manner.
If your chunk size is 1MB, then you will need larger sequential reads
to get good throughput.
You can also try increasing the size of the stripe cache in
/sys/block/mdX/md/stripe_cache_size
The units are in pages (normally 4K) per device. The default is 256 which fits
only one stripe with a 1 Meg chunk size.
Try 1024 ?
NeilBrown
>
> I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from
> 4 disks, the raid5 device gives 35 MB/s .
> it is like i am reading a single disk , but by looking at iostat i can
> see that all
> disks are active but with low throughput.
>
> Any idea ?
>
> Thank you.
> --
> Raz
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid5 performance question
2006-03-06 22:24 ` Neil Brown
@ 2006-03-07 8:40 ` Raz Ben-Jehuda(caro)
2006-03-07 23:03 ` Neil Brown
2006-03-08 6:45 ` thunder7
1 sibling, 1 reply; 11+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2006-03-07 8:40 UTC (permalink / raw)
To: Neil Brown; +Cc: Linux RAID Mailing List
Neil.
what is the stripe_cache exacly ?
First , here are some numbers.
Setting it to 1024 gives me 85 MB/s.
Setting it to 4096 gives me 105 MB/s.
Setting it to 8192 gives me 115 MB/s.
the md.txt does not say much about it just that it is the number of
entries.
here are some tests i have made:
test1:
when i set the stripe_cache to zero and run:
"dd if=/dev/md1 of=/dev/zero bs=1M count=100000 skip=630000"
i am getting 120MB/s.
when i set the stripe cache to 4096 and : issue the same command i am
getting 120 MB/s
as well.
test 2:
I would describe what this tester does:
It opens N descriptors over a device.
It issues N IOs to the target and waits for the completion of each IO.
When the IO is completed the tester has two choices:
1. calculate a new seek posistion over the target.
2. move sequetially to the next position. meaning , if one reads 1MB
buffer, the next
position is current+1M.
I am using direct IO and asynchrnous IO.
option 1 simulates non contigous files. option 2 simulates contiguous files.
the above numbers were made with option 2.
if i am using option 1 i am getting 95 MB/s with stripe_size=4096.
A single disk in this manner ( option 1 ) gives ~28 MB/s.
A single disk in scenario 2 gives ~30 MB/s.
I understand the a question of the IO distribution is something to talk
about. but i am submitting 250 IOs so i suppose to be heavy on the raid.
Questions
1. how can the stripe size cache gives me a boost when i have total
random access
to the disk ?
2. Does direct IO passes this cache ?
3. How can a dd of 1 MB over 1MB chunck size acheive this high
throughputs of 4 disks
even if does not get the stripe cache benifits ?
thank you
raz.
On 3/7/06, Neil Brown <neilb@suse.de> wrote:
> On Monday March 6, raziebe@gmail.com wrote:
> > Neil Hello .
> > I have a performance question.
> >
> > I am using raid5 stripe size 1024K over 4 disks.
>
> I assume you mean a chunksize of 1024K rather than a stripe size.
> With a 4 disk array, the stripe size will be 3 times the chunksize,
> and so could not possibly by 1024K.
>
> > I am benchmarking it with an asynchronous tester.
> > This tester submits 100 IOs of size of 1024 K --> as the stripe size.
> > It reads raw io from the device, no file system is involved.
> >
> > I am making the following comparsion:
> >
> > 1. Reading 4 disks at the same time using 1 MB buffer in random manner.
> > 2. Reading 1 raid5 device using 1MB buffer in random manner.
>
> If your chunk size is 1MB, then you will need larger sequential reads
> to get good throughput.
>
> You can also try increasing the size of the stripe cache in
> /sys/block/mdX/md/stripe_cache_size
>
> The units are in pages (normally 4K) per device. The default is 256 which fits
> only one stripe with a 1 Meg chunk size.
>
> Try 1024 ?
>
> NeilBrown
>
>
> >
> > I am getting terrible results in scenario 2. if scenario 1 gives 120 MB/s from
> > 4 disks, the raid5 device gives 35 MB/s .
> > it is like i am reading a single disk , but by looking at iostat i can
> > see that all
> > disks are active but with low throughput.
> >
> > Any idea ?
> >
> > Thank you.
> > --
> > Raz
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Raz
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: raid5 performance question
2006-03-07 8:40 ` Raz Ben-Jehuda(caro)
@ 2006-03-07 23:03 ` Neil Brown
2006-03-22 13:22 ` Bill Davidsen
0 siblings, 1 reply; 11+ messages in thread
From: Neil Brown @ 2006-03-07 23:03 UTC (permalink / raw)
To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List
On Tuesday March 7, raziebe@gmail.com wrote:
> Neil.
> what is the stripe_cache exacly ?
In order to ensure correctness of data, all IO operations on a raid5
pass through the 'stripe cache' This is a cache of stripes where each
stripe is one page wide across all devices.
e.g. to write a block, we allocate one stripe in the cache to cover
that block, pre-read anything that might be needed, copy in the new
data and update parity, and write out anything that has changed.
Similarly to read, we allocate a stripe to cover the block, read in
the requires parts, and copy out of the stripe cache into the
destination.
Requiring all reads to pass through the stripe cache is not strictly
necessary, but it keeps the code a lot easier to manage (fewer special
cases). Bypassing the cache for simple read requests when the array
is non-degraded is on my list....
>
> First , here are some numbers.
>
> Setting it to 1024 gives me 85 MB/s.
> Setting it to 4096 gives me 105 MB/s.
> Setting it to 8192 gives me 115 MB/s.
Not surprisingly, a larger cache gives better throughput as it allows
more parallelism. There is probably a link between optimal cache size
and chunk size.
>
> the md.txt does not say much about it just that it is the number of
> entries.
No. I should fix that.
>
> here are some tests i have made:
>
> test1:
> when i set the stripe_cache to zero and run:
Setting it to zero is a no-op. Only values from 17 to 32768 are
permitted.
>
> "dd if=/dev/md1 of=/dev/zero bs=1M count=100000 skip=630000"
> i am getting 120MB/s.
> when i set the stripe cache to 4096 and : issue the same command i am
> getting 120 MB/s
> as well.
This sort of operation will causes the kernel's read-ahead to keep the
drives reading constantly. Providing the stripe cache is large enough
to hold 2 full chunk-sized stripes, you should get very good
throughput.
>
> test 2:
> I would describe what this tester does:
>
> It opens N descriptors over a device.
> It issues N IOs to the target and waits for the completion of each IO.
> When the IO is completed the tester has two choices:
>
> 1. calculate a new seek posistion over the target.
>
> 2. move sequetially to the next position. meaning , if one reads 1MB
> buffer, the next
> position is current+1M.
>
> I am using direct IO and asynchrnous IO.
>
> option 1 simulates non contigous files. option 2 simulates contiguous files.
> the above numbers were made with option 2.
> if i am using option 1 i am getting 95 MB/s with stripe_size=4096.
>
> A single disk in this manner ( option 1 ) gives ~28 MB/s.
> A single disk in scenario 2 gives ~30 MB/s.
>
> I understand the a question of the IO distribution is something to talk
> about. but i am submitting 250 IOs so i suppose to be heavy on the raid.
>
> Questions
> 1. how can the stripe size cache gives me a boost when i have total
> random access
> to the disk ?
It doesn't give you a boost exactly. It is just that a small cache
can get in your way by reducing the possibly parallelism.
>
> 2. Does direct IO passes this cache ?
Yes. Everything does.
>
> 3. How can a dd of 1 MB over 1MB chunck size acheive this high
> throughputs of 4 disks
> even if does not get the stripe cache benifits ?
read-ahead performed by the kernel.
NeilBrown
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid5 performance question
2006-03-07 23:03 ` Neil Brown
@ 2006-03-22 13:22 ` Bill Davidsen
2006-03-24 4:40 ` Neil Brown
0 siblings, 1 reply; 11+ messages in thread
From: Bill Davidsen @ 2006-03-22 13:22 UTC (permalink / raw)
To: Neil Brown; +Cc: Raz Ben-Jehuda(caro), Linux RAID Mailing List
Neil Brown wrote:
>On Tuesday March 7, raziebe@gmail.com wrote:
>
>
>>Neil.
>>what is the stripe_cache exacly ?
>>
>>
>
>In order to ensure correctness of data, all IO operations on a raid5
>pass through the 'stripe cache' This is a cache of stripes where each
>stripe is one page wide across all devices.
>
>e.g. to write a block, we allocate one stripe in the cache to cover
>that block, pre-read anything that might be needed, copy in the new
>data and update parity, and write out anything that has changed.
>
>
I can see that you would have to read the old data and parity blocks for
RAID-5, I assume that's what you mean by "might be needed" and not a
read of every drive to get the data to rebuild the parity from scratch.
That would be not only slower, but require complex error recovery on an
error reading unneeded data.
>Similarly to read, we allocate a stripe to cover the block, read in
>the requires parts, and copy out of the stripe cache into the
>destination.
>
>Requiring all reads to pass through the stripe cache is not strictly
>necessary, but it keeps the code a lot easier to manage (fewer special
>cases). Bypassing the cache for simple read requests when the array
>is non-degraded is on my list....
>
It sounds as if you do a memory copy with each read, even if a read to
user buffer would be possible. Hopefully I'm reading that wrong.
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid5 performance question
2006-03-22 13:22 ` Bill Davidsen
@ 2006-03-24 4:40 ` Neil Brown
0 siblings, 0 replies; 11+ messages in thread
From: Neil Brown @ 2006-03-24 4:40 UTC (permalink / raw)
To: Bill Davidsen; +Cc: Raz Ben-Jehuda(caro), Linux RAID Mailing List
On Wednesday March 22, davidsen@tmr.com wrote:
> Neil Brown wrote:
>
> >On Tuesday March 7, raziebe@gmail.com wrote:
> >
> >
> >>Neil.
> >>what is the stripe_cache exacly ?
> >>
> >>
> >
> >In order to ensure correctness of data, all IO operations on a raid5
> >pass through the 'stripe cache' This is a cache of stripes where each
> >stripe is one page wide across all devices.
> >
> >e.g. to write a block, we allocate one stripe in the cache to cover
> >that block, pre-read anything that might be needed, copy in the new
> >data and update parity, and write out anything that has changed.
> >
> >
> I can see that you would have to read the old data and parity blocks for
> RAID-5, I assume that's what you mean by "might be needed" and not a
> read of every drive to get the data to rebuild the parity from scratch.
> That would be not only slower, but require complex error recovery on an
> error reading unneeded data.
"might be needed" because sometime raid5 reads the old copies of the
blocks it is about to over-write, and sometimes it reads all the
blocks that it is NOT going to over-write instead. And if it is
over-writing all blocks in the stripe, it doesn't need to read
anything.
>
> >Similarly to read, we allocate a stripe to cover the block, read in
> >the requires parts, and copy out of the stripe cache into the
> >destination.
> >
> >Requiring all reads to pass through the stripe cache is not strictly
> >necessary, but it keeps the code a lot easier to manage (fewer special
> >cases). Bypassing the cache for simple read requests when the array
> >is non-degraded is on my list....
> >
> It sounds as if you do a memory copy with each read, even if a read to
> user buffer would be possible. Hopefully I'm reading that wrong.
Unfortunately you are reading it correctly.
NeilBrown
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid5 performance question
2006-03-06 22:24 ` Neil Brown
2006-03-07 8:40 ` Raz Ben-Jehuda(caro)
@ 2006-03-08 6:45 ` thunder7
1 sibling, 0 replies; 11+ messages in thread
From: thunder7 @ 2006-03-08 6:45 UTC (permalink / raw)
To: Linux RAID Mailing List
From: Neil Brown <neilb@suse.de>
Date: Tue, Mar 07, 2006 at 09:24:26AM +1100
> You can also try increasing the size of the stripe cache in
> /sys/block/mdX/md/stripe_cache_size
>
> The units are in pages (normally 4K) per device. The default is 256 which fits
> only one stripe with a 1 Meg chunk size.
>
> Try 1024 ?
>
Interesting. I noticed I don't have such a file for my raid6 device. Can
you explain why? I thought raid6 and raid5 worked a lot like each other.
Thanks,
Jurriaan
--
So I must behave with haste and dishonor to serve the political ambitions
of men that I never vowed, before the Lord, to serve and honor. I must be
pretty and delicate and happy and perfect; I must accord his passing no
mourning, and no loss.
Michelle West - The Broken Crown
Debian (Unstable) GNU/Linux 2.6.16-rc5-mm2 5503 bogomips load 0.06
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2007-04-16 17:42 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-16 17:42 Raid5 performance question mickg
-- strict thread matches above, loose matches on Subject: below --
2006-03-06 11:46 raid5 " Raz Ben-Jehuda(caro)
2006-03-06 11:59 ` Gordon Henderson
2006-03-06 12:56 ` Raz Ben-Jehuda(caro)
2006-03-06 22:17 ` Guy
2006-03-06 22:24 ` Neil Brown
2006-03-07 8:40 ` Raz Ben-Jehuda(caro)
2006-03-07 23:03 ` Neil Brown
2006-03-22 13:22 ` Bill Davidsen
2006-03-24 4:40 ` Neil Brown
2006-03-08 6:45 ` thunder7
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).