* Very long raid5 init/rebuild times
@ 2014-01-21 7:35 Marc MERLIN
2014-01-21 16:37 ` Marc MERLIN
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Marc MERLIN @ 2014-01-21 7:35 UTC (permalink / raw)
To: linux-raid
Howdy,
I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt.
Question #1:
Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite
(raid5 first, and then dmcrypt)
I used:
cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1
Question #2:
In order to copy data from a working system, I connected the drives via an external
enclosure which uses a SATA PMP. As a result, things are slow:
md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0]
15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_]
[>....................] recovery = 0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec
bitmap: 0/30 pages [0KB], 65536KB chunk
2.5 days for an init or rebuild is going to be painful.
I already checked that I'm not CPU/dmcrpyt pegged.
I read Neil's message why init is still required:
http://marc.info/?l=linux-raid&m=112044009718483&w=2
even if somehow on brand new blank drives full of 0s I'm thinking this could be faster
by just assuming the array is clean (all 0s give a parity of 0).
Is it really unsafe to do so? (actually if you do this on top of dmcrypt
like I did here, I won't get 0s, so that way around, it's unfortunately
necessary).
I suppose that 1 day-ish rebuild times are kind of a given with 4TB drives anyway?
Question #3:
Since I'm going to put btrfs on top, I'm almost tempted to skip the md raid5
layer and just use the native support, but the raid code in btrfs still
seems a bit younger than I'm comfortable with.
Is anyone using it and has done disk failures, replaces, and all?
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: Very long raid5 init/rebuild times 2014-01-21 7:35 Very long raid5 init/rebuild times Marc MERLIN @ 2014-01-21 16:37 ` Marc MERLIN 2014-01-21 17:08 ` Mark Knecht ` (2 more replies) 2014-01-21 18:31 ` Chris Murphy 2014-01-22 13:46 ` Ethan Wilson 2 siblings, 3 replies; 41+ messages in thread From: Marc MERLIN @ 2014-01-21 16:37 UTC (permalink / raw) To: linux-raid On Mon, Jan 20, 2014 at 11:35:40PM -0800, Marc MERLIN wrote: > Howdy, > > I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt. > > Question #1: > Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite > (raid5 first, and then dmcrypt) > I used: > cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1 I should have said that this is seemingly a stupid question since obviously if you encrypt each drive separately, you're going through the encryption layer 5 times during rebuilds instead of just once. However in my case, I'm not CPU-bound, so that didn't seem to be an issue and I was more curious to know if the dmcrypt and dmraid5 layers stacked the same regardless of which one was on top and which one at the bottom. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-21 16:37 ` Marc MERLIN @ 2014-01-21 17:08 ` Mark Knecht 2014-01-21 18:42 ` Chris Murphy 2014-01-22 7:55 ` Stan Hoeppner 2 siblings, 0 replies; 41+ messages in thread From: Mark Knecht @ 2014-01-21 17:08 UTC (permalink / raw) To: Marc MERLIN; +Cc: Linux-RAID On Tue, Jan 21, 2014 at 8:37 AM, Marc MERLIN <marc@merlins.org> wrote: > On Mon, Jan 20, 2014 at 11:35:40PM -0800, Marc MERLIN wrote: >> Howdy, >> >> I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt. >> >> Question #1: >> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite >> (raid5 first, and then dmcrypt) >> I used: >> cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1 > > I should have said that this is seemingly a stupid question since obviously > if you encrypt each drive separately, you're going through the encryption > layer 5 times during rebuilds instead of just once. > However in my case, I'm not CPU-bound, so that didn't seem to be an issue > and I was more curious to know if the dmcrypt and dmraid5 layers stacked the > same regardless of which one was on top and which one at the bottom. > > Marc I know nothing about dmcrypt but as someone pointed out to me in another thread recently about alternative parity methods for RAID6 you might be able to do some tests using loopback devices instead of real hard drives to speed up your investigation times. Cheers, Mark ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-21 16:37 ` Marc MERLIN 2014-01-21 17:08 ` Mark Knecht @ 2014-01-21 18:42 ` Chris Murphy 2014-01-22 7:55 ` Stan Hoeppner 2 siblings, 0 replies; 41+ messages in thread From: Chris Murphy @ 2014-01-21 18:42 UTC (permalink / raw) To: linux-raid@vger.kernel.org Mailing List On Jan 21, 2014, at 9:37 AM, Marc MERLIN <marc@merlins.org> wrote: > I should have said that this is seemingly a stupid question since obviously > if you encrypt each drive separately, you're going through the encryption > layer 5 times during rebuilds instead of just once. It wasn't a stupid question, but I think you've succeeded in confusing yourself into thinking more work is happening by encrypting the drives rather than the logical md device. > However in my case, I'm not CPU-bound, so that didn't seem to be an issue > and I was more curious to know if the dmcrypt and dmraid5 layers stacked the > same regardless of which one was on top and which one at the bottom. md raid isn't dmraid. I'm actually not sure where the dmraid work is at. I'm under the impression most of that work is happening within LVM2 - they now have their own raid 0,1,10,5,6 implementation. My understanding is it uses md kernel code, but uses lvm tools to create and monitor, rather than mdadm. Chris Murphy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-21 16:37 ` Marc MERLIN 2014-01-21 17:08 ` Mark Knecht 2014-01-21 18:42 ` Chris Murphy @ 2014-01-22 7:55 ` Stan Hoeppner 2014-01-22 17:48 ` Marc MERLIN 2014-01-22 19:38 ` Opal 2.0 SEDs on linux, was: " Chris Murphy 2 siblings, 2 replies; 41+ messages in thread From: Stan Hoeppner @ 2014-01-22 7:55 UTC (permalink / raw) To: Marc MERLIN, linux-raid On 1/21/2014 10:37 AM, Marc MERLIN wrote: > On Mon, Jan 20, 2014 at 11:35:40PM -0800, Marc MERLIN wrote: >> Howdy, >> >> I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt. >> >> Question #1: >> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite >> (raid5 first, and then dmcrypt) For maximum throughput and to avoid hitting a ceiling with one thread on one core, using one dmcrypt thread per physical device is a way to achieve this. >> I used: >> cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1 Changing the key size or the encryption method may decrease latency a bit, but likely not enough. > I should have said that this is seemingly a stupid question since obviously > if you encrypt each drive separately, you're going through the encryption > layer 5 times during rebuilds instead of just once. Each dmcrypt thread is handling 1/5th of the IOs. The low init throughput isn't caused by using 5 threads. One thread would likely do no better. > However in my case, I'm not CPU-bound, so that didn't seem to be an issue > and I was more curious to know if the dmcrypt and dmraid5 layers stacked the > same regardless of which one was on top and which one at the bottom. You are not CPU bound, nor hardware bandwidth bound. You are latency bound, just like every dmcrypt user. dmcrypt adds a non trivial amount of latency to every IO. Latency with serial IO equals low throughput. Experiment with these things to increase throughput. If you're using the CFQ elevator switch to deadline. Try smaller md chunk sizes, key lengths, different ciphers, etc. Turn off automatic CPU frequency scaling. I've read reports of encryption causing the frequency to drop instead of increase. In general, to increase serial IO throughput on a high latency path one must: 1. Issue lots of IOs asynchronously 2. And/or issue lots of IOs in parallel Or both. AFAIK both of these require code rewrites for md maintenance operations. Once in production, if your application workloads do 1 or 2 above then you may see higher throughput than the 18MB/s you see with the init. If your workloads are serial maybe not much more. Common sense says that encrypting 16TB of storage at the block level, using software libraries and optimized CPU instructions, is not a smart thing to do. Not if one desires decent performance, and especially if one doesn't need all 16TB encrypted. If you in fact don't need all 16TB encrypted, and I'd argue very few do, especially John and Jane Doe, then tear this down, build a regular array, and maintain an encrypted directory or few. If you actually *need* to encrypt all 16TB at the block level, and require decent performance, you need to acquire a dedicated crypto board. One board will cost more than your complete server. The cost of such devices should be a strong clue as to who does and does not need to encrypt their entire storage. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-22 7:55 ` Stan Hoeppner @ 2014-01-22 17:48 ` Marc MERLIN 2014-01-22 23:17 ` Stan Hoeppner 2014-01-23 2:37 ` Stan Hoeppner 2014-01-22 19:38 ` Opal 2.0 SEDs on linux, was: " Chris Murphy 1 sibling, 2 replies; 41+ messages in thread From: Marc MERLIN @ 2014-01-22 17:48 UTC (permalink / raw) To: Stan Hoeppner; +Cc: linux-raid On Wed, Jan 22, 2014 at 01:55:34AM -0600, Stan Hoeppner wrote: > >> Question #1: > >> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite > >> (raid5 first, and then dmcrypt) > > For maximum throughput and to avoid hitting a ceiling with one thread on > one core, using one dmcrypt thread per physical device is a way to > achieve this. There is that, but at rebuild time, if dmcrypt is after raid5, the raid5 rebuild would happen without going through encryption, and hence would save 5 core's worth of encryption bandwidth, would it not (for 5 drives) I agree that during non rebuild operation, I do get 5 cores of encryption bandwidth insttead of 1, so if I'm willing to suck up the CPU from rebuild time, it may be a good thing anyway. > >> I used: > >> cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1 > > Changing the key size or the encryption method may decrease latency a > bit, but likely not enough. Ok, thanks. > > I should have said that this is seemingly a stupid question since obviously > > if you encrypt each drive separately, you're going through the encryption > > layer 5 times during rebuilds instead of just once. > > Each dmcrypt thread is handling 1/5th of the IOs. The low init > throughput isn't caused by using 5 threads. One thread would likely do > no better. If crypt is on top of raid5, it seems (and that makes sense) that no encryption is neded for the rebuild. However in my test I can confirm that the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth and I think tha'ts because of the port multiplier. > > However in my case, I'm not CPU-bound, so that didn't seem to be an issue > > and I was more curious to know if the dmcrypt and dmraid5 layers stacked the > > same regardless of which one was on top and which one at the bottom. > > You are not CPU bound, nor hardware bandwidth bound. You are latency > bound, just like every dmcrypt user. dmcrypt adds a non trivial amount > of latency to every IO. Latency with serial IO equals low throughput. Are you sure that applies here in the rebuild time? I see no crypt thread running. > Experiment with these things to increase throughput. If you're using > the CFQ elevator switch to deadline. Try smaller md chunk sizes, key > lengths, different ciphers, etc. Turn off automatic CPU frequency > scaling. I've read reports of encryption causing the frequency to drop > instead of increase. I'll check those too, they can't hurt. > Once in production, if your application workloads do 1 or 2 above then > you may see higher throughput than the 18MB/s you see with the init. If > your workloads are serial maybe not much more. I expect to see more because the drives will move inside the array that is directly connected to the SATA card without going through a PMP (with PMP all the SATA IO is shared on a single SATA chip). > Common sense says that encrypting 16TB of storage at the block level, > using software libraries and optimized CPU instructions, is not a smart > thing to do. Not if one desires decent performance, and especially if > one doesn't need all 16TB encrypted. I encrypt everything now because I think it's good general hygiene, and I don't want to think about where my drives and data end up 5 years later, or worry if they get stolen. Software encryption on linux has been close enough to wire speed for a little while now, I encrypt my 500MB/s capable SSD on my laptop and barely see slowdowns (except a bit of extra latency as you point out). > If you in fact don't need all 16TB encrypted, and I'd argue very few do, > especially John and Jane Doe, then tear this down, build a regular > array, and maintain an encrypted directory or few. Not bad advise in general. > If you actually *need* to encrypt all 16TB at the block level, and > require decent performance, you need to acquire a dedicated crypto > board. One board will cost more than your complete server. The cost of > such devices should be a strong clue as to who does and does not need to > encrypt their entire storage. I'm not actually convinced that the CPU is the bottleneck, and as pointed out if I put dmcrypt on top of raid5, the rebuild happens without any encryption. Or did I miss something? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-22 17:48 ` Marc MERLIN @ 2014-01-22 23:17 ` Stan Hoeppner 2014-01-23 14:28 ` John Stoffel 2014-01-23 2:37 ` Stan Hoeppner 1 sibling, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2014-01-22 23:17 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-raid On 1/22/2014 11:48 AM, Marc MERLIN wrote: ... > If crypt is on top of raid5, it seems (and that makes sense) that no > encryption is neded for the rebuild. However in my test I can confirm that > the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth > and I think tha'ts because of the port multiplier. Ok, now I think we're finally getting to the heart of this. Given the fact that you're doing full array encryption, and after reading your bio on your website the other day, I think I've been giving you too much credit. So let's get back to md basics. Have you performed any md optimizations? The default value of /sys/block/mdX/md/stripe_cache_size is 256. This default is woefully inadequate for modern systems, and will yield dreadfully low throughput. To fix this execute ~$ echo 2048 > /sys/block/mdX/md/stripe_cache_size To specifically address slow resync speed try ~$ echo 50000 > /proc/sys/dev/raid/speed_limit_min And you also likely need to increase readahead from the default 128KB to something like 1MB (in 512KiB units) ~$ blockdev --setra 2048 /dev/mdX Since kernel 2.6.23 Linux does on demand readahead, so small random IO won't trigger it. Thus a large value here will not negatively impact random IO. See: http://lwn.net/Articles/235181/ These changes should give you just a bit of a boost to resync throughput, and streaming workloads in general. Please test and post your results. I don't think your problems have anything to do with crypto. However, after you get md running at peak performance you then may start to see limitations in your crypto setup, if you have chosen to switch to dmcrypt above md. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-22 23:17 ` Stan Hoeppner @ 2014-01-23 14:28 ` John Stoffel 2014-01-24 1:02 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: John Stoffel @ 2014-01-23 14:28 UTC (permalink / raw) To: stan; +Cc: Marc MERLIN, linux-raid >>>>> "Stan" == Stan Hoeppner <stan@hardwarefreak.com> writes: Stan> On 1/22/2014 11:48 AM, Marc MERLIN wrote: Stan> ... >> If crypt is on top of raid5, it seems (and that makes sense) that no >> encryption is neded for the rebuild. However in my test I can confirm that >> the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth >> and I think tha'ts because of the port multiplier. Stan> Ok, now I think we're finally getting to the heart of this. Given the Stan> fact that you're doing full array encryption, and after reading your bio Stan> on your website the other day, I think I've been giving you too much Stan> credit. So let's get back to md basics. Have you performed any md Stan> optimizations? The default value of Stan> /sys/block/mdX/md/stripe_cache_size Stan> is 256. This default is woefully inadequate for modern systems, and Stan> will yield dreadfully low throughput. To fix this execute Stan> ~$ echo 2048 > /sys/block/mdX/md/stripe_cache_size Which Linux kernel version is this in? I'm running 3.9.3 on my main home server and I'm not finding this at all. Nor do I find it on my Linux Mint 16 desktop running 3.11.0-12-generic either. # cat /proc/version Linux version 3.9.3 (root@quad) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Wed May 22 12:15:10 EDT 2013 # find /sys -name "*stripe*" # find /dev -name "*stripe*" # find /proc -name "stripe*" # Oh wait... I'm a total moron here. This feature is only for RAID[456] arrays, and all I have are RAID1 mirrors for all my disks. Hmm... it would almost make more sense for it to be named something different, but legacy systems would be impacted. But more importantly, maybe it would make sense to have this number automatically scale with memory size? If you only have 1gig stay at 256, but then jump more aggresively to 1024, 2048, 4196 and 8192 and then (for now) capping at 8192. John ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-23 14:28 ` John Stoffel @ 2014-01-24 1:02 ` Stan Hoeppner 2014-01-24 3:07 ` NeilBrown 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2014-01-24 1:02 UTC (permalink / raw) To: John Stoffel; +Cc: Marc MERLIN, linux-raid On 1/23/2014 8:28 AM, John Stoffel wrote: > But more importantly, maybe it would make sense to have this number > automatically scale with memory size? If you only have 1gig stay at > 256, but then jump more aggresively to 1024, 2048, 4196 and 8192 and > then (for now) capping at 8192. Setting the default based strictly on memory capacity won't work. See this discussion for background. http://www.spinics.net/lists/raid/msg45364.html -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-24 1:02 ` Stan Hoeppner @ 2014-01-24 3:07 ` NeilBrown 2014-01-24 8:24 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: NeilBrown @ 2014-01-24 3:07 UTC (permalink / raw) To: stan; +Cc: John Stoffel, Marc MERLIN, linux-raid [-- Attachment #1: Type: text/plain, Size: 781 bytes --] On Thu, 23 Jan 2014 19:02:21 -0600 Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 1/23/2014 8:28 AM, John Stoffel wrote: > > > But more importantly, maybe it would make sense to have this number > > automatically scale with memory size? If you only have 1gig stay at > > 256, but then jump more aggresively to 1024, 2048, 4196 and 8192 and > > then (for now) capping at 8192. > > Setting the default based strictly on memory capacity won't work. See > this discussion for background. > > http://www.spinics.net/lists/raid/msg45364.html > I would like to see the stripe cache grow on demand, shrink when idle, and use the "shrinker" interface to shrink even when not idle if there is memory pressure. So if someone wants a project.... NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-24 3:07 ` NeilBrown @ 2014-01-24 8:24 ` Stan Hoeppner 0 siblings, 0 replies; 41+ messages in thread From: Stan Hoeppner @ 2014-01-24 8:24 UTC (permalink / raw) To: NeilBrown; +Cc: John Stoffel, Marc MERLIN, linux-raid On 1/23/2014 9:07 PM, NeilBrown wrote: > On Thu, 23 Jan 2014 19:02:21 -0600 Stan Hoeppner <stan@hardwarefreak.com> > wrote: > >> On 1/23/2014 8:28 AM, John Stoffel wrote: >> >>> But more importantly, maybe it would make sense to have this number >>> automatically scale with memory size? If you only have 1gig stay at >>> 256, but then jump more aggresively to 1024, 2048, 4196 and 8192 and >>> then (for now) capping at 8192. >> >> Setting the default based strictly on memory capacity won't work. See >> this discussion for background. >> >> http://www.spinics.net/lists/raid/msg45364.html >> > > I would like to see the stripe cache grow on demand, shrink when idle, and > use the "shrinker" interface to shrink even when not idle if there is memory > pressure. > So if someone wants a project.... > > NeilBrown I'm a user, not a kernel hacker, and I don't know C. Three strikes right there. :( Otherwise I'd love to tackle it. I do have some comments/ideas on the subject. Progressively growing and shrinking the cache should be relatively straightforward. We can do it dynamically today by modifying a system variable. What's needed is code to track data input volume or rate to md and to interface with the shrinker. I think the difficult aspect of this will be determining the upper bound on the cache size for a given system, as the optimum cache size directly correlates to the throughput of the hardware. With the current power of 2 restrictions, less than thorough testing indicates that disk based arrays seem to prefer a value of 1024-2048 for max throughput whereas SSD arrays seem to prefer 4096. In either case, going to the next legal value decreases throughput and eats double the RAM while doing so. So here we need some way to determine device throughput or at least device class, and set an upper bound accordingly. I also think we should consider unhitching our wagon from powers of 2 if we're going to be dynamically growing/shrinking the cache. I think grow/shrink should be progressive with smaller jumps. With 5 drives growing from 2048 to 4096 is going to grab 40MB of pages, likewise dumping 40MB for the impending shrink iteration, then 20MB, 10MB, and finally dumping 5MB arriving back at the 1MB/drive default. This may cause a lot of memory thrashing on some systems and workloads, evicting application data from L2/L3 caches. So we may want to be careful about how much memory we're shuffling and how often. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-22 17:48 ` Marc MERLIN 2014-01-22 23:17 ` Stan Hoeppner @ 2014-01-23 2:37 ` Stan Hoeppner 2014-01-23 9:13 ` Marc MERLIN 1 sibling, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2014-01-23 2:37 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-raid On 1/22/2014 11:48 AM, Marc MERLIN wrote: ... > If crypt is on top of raid5, it seems (and that makes sense) that no > encryption is neded for the rebuild. However in my test I can confirm that > the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth > and I think tha'ts because of the port multiplier. I didn't address this earlier as I assumed you, and anyone else reading this thread, would do a little background reading and realize no SATA PMP would behave in this manner. No SATA PMP, not Silicon Image, not Marvell, none of them, will limit host port throughput to 20MB/s. All of them achieve pretty close to wire speed throughput. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-23 2:37 ` Stan Hoeppner @ 2014-01-23 9:13 ` Marc MERLIN 2014-01-23 12:24 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Marc MERLIN @ 2014-01-23 9:13 UTC (permalink / raw) To: Stan Hoeppner; +Cc: linux-raid On Wed, Jan 22, 2014 at 08:37:49PM -0600, Stan Hoeppner wrote: > On 1/22/2014 11:48 AM, Marc MERLIN wrote: > ... > > If crypt is on top of raid5, it seems (and that makes sense) that no > > encryption is neded for the rebuild. However in my test I can confirm that > > the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth > > and I think tha'ts because of the port multiplier. > > I didn't address this earlier as I assumed you, and anyone else reading > this thread, would do a little background reading and realize no SATA > PMP would behave in this manner. No SATA PMP, not Silicon Image, not > Marvell, none of them, will limit host port throughput to 20MB/s. All > of them achieve pretty close to wire speed throughput. I haven't answered your other message, as I'm getting more data to do so, but I can assure you that this is incorrect :) I've worked with 3 different PMP boards and three different SATA cards over the last 6 years (sil3124, 3132, and marvel), and got similarly slow results on all of them. The marvel was faster than sil3124 but it stopped being stable in kernels in the last year and fell unsupported (no one to fix the bugs), so I went back to sil3124. I'm not saying that they can't go faster somehow, but in my experience that has not been the case. In case you don't believe me, I just switched my drives from the PMP to directly connected to the motherboard and a marvel card, and my rebuild speed changed from 19MB/s to 99MB/s. (I made no other setting changes, but I did try your changes without saving them before and after the PMP change and will report below) You also said: > Ok, now I think we're finally getting to the heart of this. Given the > fact that you're doing full array encryption, and after reading your bio > on your website the other day, I think I've been giving you too much > credit. So let's get back to md basics. Have you performed any md > optimizations? The default value of Can't hurt to ask, you never know if I may have forgotten or not know about one. > /sys/block/mdX/md/stripe_cache_size > is 256. This default is woefully inadequate for modern systems, and > will yield dreadfully low throughput. To fix this execute > ~$ echo 2048 > /sys/block/mdX/md/stripe_cache_size Thanks for that one. It made no speed difference on the PMP or without, but can't hurt to do anyway. > To specifically address slow resync speed try > ~$ echo 50000 > /proc/sys/dev/raid/speed_limit_min I had this, but good reminder. > And you also likely need to increase readahead from the default 128KB to > something like 1MB (in 512KiB units) > > ~$ blockdev --setra 2048 /dev/mdX I had this already set to 8192, but again, thanks for asking too. > Since kernel 2.6.23 Linux does on demand readahead, so small random IO > won't trigger it. Thus a large value here will not negatively impact > random IO. See: http://lwn.net/Articles/235181/ > > Please test and post your results. I don't think your problems have > anything to do with crypto. However, after you get md running at peak > performance you then may start to see limitations in your crypto setup, > if you have chosen to switch to dmcrypt above md. Looks like so far my only problem was the PMP. Thank you for your suggestions though. Back to my original questions: > Question #1: > Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite > (raid5 first, and then dmcrypt) > I used: > cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1 As you did point out, the array will be faster when I use it because the encryption will be sharded over my CPUs, but rebuilding is going to create 5 encryption threads whereas if md5 is first and encryption is on top, rebuilds do not involve any encryption on CPU. So it depends what's more important. > Question #2: > In order to copy data from a working system, I connected the drives via an external > enclosure which uses a SATA PMP. As a result, things are slow: > > md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0] > 15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_] > [>....................] recovery = 0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec > bitmap: 0/30 pages [0KB], 65536KB chunk > > 2.5 days for an init or rebuild is going to be painful. > I already checked that I'm not CPU/dmcrpyt pegged. > > I read Neil's message why init is still required: > http://marc.info/?l=linux-raid&m=112044009718483&w=2 > even if somehow on brand new blank drives full of 0s I'm thinking this could be faster > by just assuming the array is clean (all 0s give a parity of 0). > Is it really unsafe to do so? (actually if you do this on top of dmcrypt > like I did here, I won't get 0s, so that way around, it's unfortunately > necessary). Still curious on this: if the drives are brand new, is it safe to assume t> hey're full of 0's and tell mdadm to skip the re-init? (parity of X x 0 = 0) > Question #3: > Since I'm going to put btrfs on top, I'm almost tempted to skip the md raid5 > layer and just use the native support, but the raid code in btrfs still > seems a bit younger than I'm comfortable with. > Is anyone using it and has done disk failures, replaces, and all? Ok, this is not a btrfs list, so I'll asume no one tried that here, no biggie. Cheers, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-23 9:13 ` Marc MERLIN @ 2014-01-23 12:24 ` Stan Hoeppner 2014-01-23 21:01 ` Marc MERLIN 2014-01-30 20:18 ` Phillip Susi 0 siblings, 2 replies; 41+ messages in thread From: Stan Hoeppner @ 2014-01-23 12:24 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-raid On 1/23/2014 3:13 AM, Marc MERLIN wrote: > On Wed, Jan 22, 2014 at 08:37:49PM -0600, Stan Hoeppner wrote: >> On 1/22/2014 11:48 AM, Marc MERLIN wrote: >> ... >>> If crypt is on top of raid5, it seems (and that makes sense) that no >>> encryption is neded for the rebuild. However in my test I can confirm that >>> the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth >>> and I think tha'ts because of the port multiplier. >> >> I didn't address this earlier as I assumed you, and anyone else reading >> this thread, would do a little background reading and realize no SATA >> PMP would behave in this manner. No SATA PMP, not Silicon Image, not >> Marvell, none of them, will limit host port throughput to 20MB/s. All >> of them achieve pretty close to wire speed throughput. > > I haven't answered your other message, as I'm getting more data to do > so, but I can assure you that this is incorrect :) > > I've worked with 3 different PMP boards and three different SATA cards > over the last 6 years (sil3124, 3132, and marvel), and got similarly > slow results on all of them. > The marvel was faster than sil3124 but it stopped being stable in > kernels in the last year and fell unsupported (no one to fix the bugs), > so I went back to sil3124. > > I'm not saying that they can't go faster somehow, but in my experience > that has not been the case. Others don't seem to be having such PMP problems. Not in modern times anyway. Maybe it's just your specific hardware mix. If eliminating the PMP increased your read-only resync speed by a factor of 5x, I'm elated to be wrong here. > In case you don't believe me, I just switched my drives from the PMP to > directly connected to the motherboard and a marvel card, and my rebuild > speed changed from 19MB/s to 99MB/s. > (I made no other setting changes, but I did try your changes without > saving them before and after the PMP change and will report below) Why would you assume I wouldn't believe you? > You also said: >> Ok, now I think we're finally getting to the heart of this. Given the >> fact that you're doing full array encryption, and after reading your bio >> on your website the other day, I think I've been giving you too much >> credit. So let's get back to md basics. Have you performed any md >> optimizations? The default value of > > Can't hurt to ask, you never know if I may have forgotten or not know about one. > >> /sys/block/mdX/md/stripe_cache_size >> is 256. This default is woefully inadequate for modern systems, and >> will yield dreadfully low throughput. To fix this execute >> ~$ echo 2048 > /sys/block/mdX/md/stripe_cache_size > > Thanks for that one. > It made no speed difference on the PMP or without, but can't hurt to do anyway. If you're not writing it won't. The problem here is that you're apparently using a non-destructive resync as a performance benchmark. Don't do that. It's representative of nothing but read-only resync speed. Increasing stripe_cache_size above the default as I suggested will ALWAYS increase write speed, often by a factor of 2-3x or more on modern hardware. It should speed up destructive resyncs considerably, as well as normal write IO. Once your array has settled down after the inits and resyncs and what not, run some parallel FIO write tests with the default of 256 and then with 2048. You can try 4096 as well, but with 5 rusty drives 4096 will probably cause a slight tailing off of throughput. 2048 should be your sweet spot. You can also just time a few large parallel file copies. You'll be amazed at the gains. The reason is simply that the default of 256 was selected some ~10 years ago when disks were much slower. Increasing this default has been a topic of much discussion recently, because bumping it up increases throughput for everyone, substantially, even with 3 disk RAID5 arrays. >> To specifically address slow resync speed try >> ~$ echo 50000 > /proc/sys/dev/raid/speed_limit_min > > I had this, but good reminder. > >> And you also likely need to increase readahead from the default 128KB to >> something like 1MB (in 512KiB units) >> >> ~$ blockdev --setra 2048 /dev/mdX > > I had this already set to 8192, but again, thanks for asking too. > >> Since kernel 2.6.23 Linux does on demand readahead, so small random IO >> won't trigger it. Thus a large value here will not negatively impact >> random IO. See: http://lwn.net/Articles/235181/ >> >> Please test and post your results. I don't think your problems have >> anything to do with crypto. However, after you get md running at peak >> performance you then may start to see limitations in your crypto setup, >> if you have chosen to switch to dmcrypt above md. > > Looks like so far my only problem was the PMP. That's because you've not been looking deep enough. > Thank you for your suggestions though. You're welcome. > Back to my original questions: >> Question #1: >> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite >> (raid5 first, and then dmcrypt) >> I used: >> cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1 > > As you did point out, the array will be faster when I use it because the > encryption will be sharded over my CPUs, but rebuilding is going to create 5 encryption > threads whereas if md5 is first and encryption is on top, rebuilds do > not involve any encryption on CPU. > > So it depends what's more important. Yep. If you post what CPU you're using I can probably give you a good idea if one core is sufficient for dmcrypt. I'll also reiterate that encrypting a 16TB array device is silly when you can simply carve off an LV for files that need to be encrypted, and run dmcrypt only against that LV. You can always expand an LV. This is a huge performance win for all other files, such your media collections, which don't need to be encrypted. >> Question #2: >> In order to copy data from a working system, I connected the drives via an external >> enclosure which uses a SATA PMP. As a result, things are slow: >> >> md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0] >> 15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_] >> [>....................] recovery = 0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec >> bitmap: 0/30 pages [0KB], 65536KB chunk >> >> 2.5 days for an init or rebuild is going to be painful. With stripe_cache_size=2048 this should drop from 2.5 days to less than a day. >> I already checked that I'm not CPU/dmcrpyt pegged. >> >> I read Neil's message why init is still required: >> http://marc.info/?l=linux-raid&m=112044009718483&w=2 >> even if somehow on brand new blank drives full of 0s I'm thinking this could be faster >> by just assuming the array is clean (all 0s give a parity of 0). >> Is it really unsafe to do so? (actually if you do this on top of dmcrypt >> like I did here, I won't get 0s, so that way around, it's unfortunately >> necessary). > > Still curious on this: if the drives are brand new, is it safe to assume > t> hey're full of 0's and tell mdadm to skip the re-init? > (parity of X x 0 = 0) No, for a few reasons: 1. Because not all bits are always 0 out of the factory. 2. Bad sectors may exist and need to be discovered/remapped 3. With the increased stripe_cache_size, and if your CPU turns out to be fast enough for dmcrypt in front of md, resync speed won't be as much of an issue, eliminating your motivation for skipping the init. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-23 12:24 ` Stan Hoeppner @ 2014-01-23 21:01 ` Marc MERLIN 2014-01-24 5:13 ` Stan Hoeppner 2014-01-30 20:18 ` Phillip Susi 1 sibling, 1 reply; 41+ messages in thread From: Marc MERLIN @ 2014-01-23 21:01 UTC (permalink / raw) To: Stan Hoeppner; +Cc: linux-raid On Thu, Jan 23, 2014 at 06:24:39AM -0600, Stan Hoeppner wrote: > > In case you don't believe me, I just switched my drives from the PMP to > > directly connected to the motherboard and a marvel card, and my rebuild > > speed changed from 19MB/s to 99MB/s. > > (I made no other setting changes, but I did try your changes without > > saving them before and after the PMP change and will report below) > > Why would you assume I wouldn't believe you? You seemed incredulous that PMPs could make things so slow :) > > Thanks for that one. > > It made no speed difference on the PMP or without, but can't hurt to do anyway. > > If you're not writing it won't. The problem here is that you're > apparently using a non-destructive resync as a performance benchmark. > Don't do that. It's representative of nothing but read-only resync speed. Let me think about this: the resync is done at build array time. If all the drives are full of 0's indeed there will be nothing to write. Given that, I think you're right. > Increasing stripe_cache_size above the default as I suggested will > ALWAYS increase write speed, often by a factor of 2-3x or more on modern > hardware. It should speed up destructive resyncs considerably, as well > as normal write IO. Once your array has settled down after the inits > and resyncs and what not, run some parallel FIO write tests with the > default of 256 and then with 2048. You can try 4096 as well, but with 5 > rusty drives 4096 will probably cause a slight tailing off of > throughput. 2048 should be your sweet spot. You can also just time a > few large parallel file copies. You'll be amazed at the gains. Will do, thanks. > The reason is simply that the default of 256 was selected some ~10 years > ago when disks were much slower. Increasing this default has been a > topic of much discussion recently, because bumping it up increases > throughput for everyone, substantially, even with 3 disk RAID5 arrays. Great to hear that the default may hopefully be increased for all. > > As you did point out, the array will be faster when I use it because the > > encryption will be sharded over my CPUs, but rebuilding is going to create 5 encryption > > threads whereas if md5 is first and encryption is on top, rebuilds do > > not involve any encryption on CPU. > > > > So it depends what's more important. > > Yep. If you post what CPU you're using I can probably give you a good > idea if one core is sufficient for dmcrypt. Oh, I did forget to post that. That server is a low power-ish dual core with 4 HT units: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i3-2100T CPU @ 2.50GHz stepping : 7 microcode : 0x28 cpu MHz : 2500.000 cache size : 3072 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave avx lahf_lm arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 5150.14 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: > I'll also reiterate that encrypting a 16TB array device is silly when > you can simply carve off an LV for files that need to be encrypted, and > run dmcrypt only against that LV. You can always expand an LV. This is > a huge performance win for all other files, such your media collections, > which don't need to be encrypted. I use btrfs for LV management, so it's easier to encrypt the entire pool. I also encrypt any data on any drive at this point, kind of like I wash my hands. I'm not saying it's the right thing to do for all, but it's my personal choice. I've seen too many drives end up on ebay with data, and I don't want to have to worry about this later, or even erasing my own drives before sending them back to warranty, especially in cases where maybe I can't erase them, but the manufacturer can read them anyway. You get the idea... I've used LVM for too many years (15 was it?) and I'm happy to switch away now :) (I know thin snapshots were recently added, but basically I've been not super happy with LVM performance, and LVM snapshots have been abysmal if you keep them long term). Also, this is off topic here, but I like the fact that I can compute snapshot diffs with btfrs and use that for super fast backups of changed blocks instead of a very slow rsync that has to scan millions of inodes (which is what I've been doing so far). > >> Question #2: > >> In order to copy data from a working system, I connected the drives via an external > >> enclosure which uses a SATA PMP. As a result, things are slow: > >> > >> md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0] > >> 15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_] > >> [>....................] recovery = 0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec > >> bitmap: 0/30 pages [0KB], 65536KB chunk > >> > >> 2.5 days for an init or rebuild is going to be painful. > > With stripe_cache_size=2048 this should drop from 2.5 days to less than > a day. It didn't since it PMP limited, but I made that change for the other reasons you suggested. > > Still curious on this: if the drives are brand new, is it safe to assume > > t> hey're full of 0's and tell mdadm to skip the re-init? > > (parity of X x 0 = 0) > > No, for a few reasons: > > 1. Because not all bits are always 0 out of the factory. > 2. Bad sectors may exist and need to be discovered/remapped > 3. With the increased stripe_cache_size, and if your CPU turns out to > be fast enough for dmcrypt in front of md, resync speed won't be as much > of an issue, eliminating your motivation for skipping the init. All fair points, thanks for explaining. For now, I put dmcrypt on top of md5, I get 100MB/s raw block write speed (actually just writing a big file in btrfs and going through all the layers) even though it's only using one CPU thread for encryption instead of 2 or more if each disk were encrypted under the md5 layer. Since 100MB/s was also the resync speed I was getting without encryption involved, looks like a single CPU thread can keep up with the raw IO of the array, so I guess I'll leave things that way. As another test gargamel:/mnt/btrfs_pool1# dd if=/dev/md5 of=/dev/null bs=1M count=1024 1073741824 bytes (1.1 GB) copied, 9.78191 s, 110 MB/s So it looks like 100-110MB/s is the read and write speed limit of that array. The drives are rated for 150MB/s each so I'm not too sure which limit I'm hitting, but 100MB/s is fast enough for my intended use. Thanks for you answers again, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-23 21:01 ` Marc MERLIN @ 2014-01-24 5:13 ` Stan Hoeppner 2014-01-25 8:36 ` Marc MERLIN 2014-01-30 20:36 ` Phillip Susi 0 siblings, 2 replies; 41+ messages in thread From: Stan Hoeppner @ 2014-01-24 5:13 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-raid On 1/23/2014 3:01 PM, Marc MERLIN wrote: > On Thu, Jan 23, 2014 at 06:24:39AM -0600, Stan Hoeppner wrote: >>> In case you don't believe me, I just switched my drives from the PMP to >>> directly connected to the motherboard and a marvel card, and my rebuild >>> speed changed from 19MB/s to 99MB/s. >>> (I made no other setting changes, but I did try your changes without >>> saving them before and after the PMP change and will report below) >> >> Why would you assume I wouldn't believe you? > > You seemed incredulous that PMPs could make things so slow :) Well, no, not really. I know there are some real quality issues with a lot of cheap PMP JBODs out there. I was just surprised to see an experienced Linux sysadmin have bad luck with 3/3 of em. Most folks using Silicon Image HBAs with SiI PMPs seem to get good performance. Personally, I've never used PMPs. Given the cost ratio between drives and HBA ports, a quality 4/8 port SAS HBA such as one of the LSIs is a better solution all around. 4TB drives average $200 each. A five drive array is $1000. An LSI 8 port 12G SAS HBA with guaranteed compatibility, quality, support, and performance is $300. A cheap 2 port SATA HBA and 5 port PMP card gives sub optimal performance, iffy compatibility, and low quality, and is ~$130. $1300 vs $1130. Going with a cheap SATA HBA and PMP makes no sense. >>> Thanks for that one. >>> It made no speed difference on the PMP or without, but can't hurt to do anyway. >> >> If you're not writing it won't. The problem here is that you're >> apparently using a non-destructive resync as a performance benchmark. >> Don't do that. It's representative of nothing but read-only resync speed. > > Let me think about this: the resync is done at build array time. > If all the drives are full of 0's indeed there will be nothing to write. > Given that, I think you're right. The initial resync is read-only. It won't modify anything unless there's a discrepancy. So the stripe cache isn't in play. The larger stripe cache should indeed increase rebuild rate though. >> Increasing stripe_cache_size above the default as I suggested will >> ALWAYS increase write speed, often by a factor of 2-3x or more on modern >> hardware. It should speed up destructive resyncs considerably, as well >> as normal write IO. Once your array has settled down after the inits >> and resyncs and what not, run some parallel FIO write tests with the >> default of 256 and then with 2048. You can try 4096 as well, but with 5 >> rusty drives 4096 will probably cause a slight tailing off of >> throughput. 2048 should be your sweet spot. You can also just time a >> few large parallel file copies. You'll be amazed at the gains. > > Will do, thanks. > >> The reason is simply that the default of 256 was selected some ~10 years >> ago when disks were much slower. Increasing this default has been a >> topic of much discussion recently, because bumping it up increases >> throughput for everyone, substantially, even with 3 disk RAID5 arrays. > > Great to hear that the default may hopefully be increased for all. It may be a while, or never. Neil's last note suggests the default likely won't change, but eventually we may have automated stripe cache size management. >>> As you did point out, the array will be faster when I use it because the >>> encryption will be sharded over my CPUs, but rebuilding is going to create 5 encryption >>> threads whereas if md5 is first and encryption is on top, rebuilds do >>> not involve any encryption on CPU. >>> >>> So it depends what's more important. >> >> Yep. If you post what CPU you're using I can probably give you a good >> idea if one core is sufficient for dmcrypt. > > Oh, I did forget to post that. > > That server is a low power-ish dual core with 4 HT units: ... > model name : Intel(R) Core(TM) i3-2100T CPU @ 2.50GHz ... > cache size : 3072 KB ... Actually, instead of me making an educated guess, I'd suggest you run ~$ cryptsetup benchmark This will tell you precisely what your throughput is with various settings and ciphers. Depending on what this spits back you may want to change your setup, assuming we get the IO throughput where it should be. >> I'll also reiterate that encrypting a 16TB array device is silly when >> you can simply carve off an LV for files that need to be encrypted, and >> run dmcrypt only against that LV. You can always expand an LV. This is >> a huge performance win for all other files, such your media collections, >> which don't need to be encrypted. > > I use btrfs for LV management, so it's easier to encrypt the entire pool. I > also encrypt any data on any drive at this point, kind of like I wash my > hands. I'm not saying it's the right thing to do for all, but it's my > personal choice. I've seen too many drives end up on ebay with data, and I > don't want to have to worry about this later, or even erasing my own drives > before sending them back to warranty, especially in cases where maybe I > can't erase them, but the manufacturer can read them anyway. > You get the idea... So be it. Now let's work to see if we can squeeze every ounce of performance out of it. ... >>>> Question #2: >>>> In order to copy data from a working system, I connected the drives via an external >>>> enclosure which uses a SATA PMP. As a result, things are slow: >>>> >>>> md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0] >>>> 15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_] >>>> [>....................] recovery = 0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec >>>> bitmap: 0/30 pages [0KB], 65536KB chunk >>>> >>>> 2.5 days for an init or rebuild is going to be painful. >> >> With stripe_cache_size=2048 this should drop from 2.5 days to less than >> a day. > > It didn't since it PMP limited, but I made that change for the other reasons > you suggested. You said you had pulled the PMP and connected direct to an HBA, bumping from 19MB/s to 99MB/s. Did you switch back to the PMP and are now getting 100MB/s through the PMP? We should be able to get much higher if it's 3/6G SATA, a little higher if it's 1/5G. >>> Still curious on this: if the drives are brand new, is it safe to assume >>> t> hey're full of 0's and tell mdadm to skip the re-init? >>> (parity of X x 0 = 0) >> >> No, for a few reasons: >> >> 1. Because not all bits are always 0 out of the factory. >> 2. Bad sectors may exist and need to be discovered/remapped >> 3. With the increased stripe_cache_size, and if your CPU turns out to >> be fast enough for dmcrypt in front of md, resync speed won't be as much >> of an issue, eliminating your motivation for skipping the init. I shouldn't have included #3 here as it doesn't affect initial resync, only rebuild. > All fair points, thanks for explaining. > For now, I put dmcrypt on top of md5, I get 100MB/s raw block write speed (actually > just writing a big file in btrfs and going through all the layers) even > though it's only using one CPU thread for encryption instead of 2 or more if > each disk were encrypted under the md5 layer. 100MB/s sequential read throughput is very poor for a 5 drive RAID5, especially with new 4TB drives which can stream well over 130MB/s each. > Since 100MB/s was also the resync speed I was getting without encryption > involved, looks like a single CPU thread can keep up with the raw IO of the > array, so I guess I'll leave things that way. 100MB/s is leaving big performance on the table. And 100 isn't the peak array throughput of your current configuration. > As another test > gargamel:/mnt/btrfs_pool1# dd if=/dev/md5 of=/dev/null bs=1M count=1024 > 1073741824 bytes (1.1 GB) copied, 9.78191 s, 110 MB/s dd single stream copies are not a valid test of array throughput. This tells you only the -minimum- throughput of the array. > So it looks like 100-110MB/s is the read and write speed limit of that array. > The drives are rated for 150MB/s each so I'm not too sure which limit I'm > hitting, but 100MB/s is fast enough for my intended use. To test real maximum throughput install fio, save and run this job file, and post your results. Monitor CPU burn of dmcrypt, using top is fine, while running the job to see if it eats all of one core. The job runs in multiple steps, first creating the eight 1GB test files, then running the read/write tests against those files. [global] directory=/some/directory zero_buffers numjobs=4 group_reporting blocksize=1024k ioengine=libaio iodepth=16 direct=1 size=1g [read] rw=read stonewall [write] rw=write stonewall > Thanks for you answers again, You're welcome. If you wish to wring maximum possible performance from this rig I'll stick with ya until we get there. You're not far. Just takes some testing and tweaking unless you have a real hardware limitation, not a driver setting or firmware issue. BTW, I don't recall you mentioning which HBA and PMP you're using at the moment, and whether the PMP is an Addonics card or integrated in a JBOD. Nor if you're 1.5/3/6G from HBA through PMP to each drive. Post your dmesg output showing the drive link speeds if you would, i.e. ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-24 5:13 ` Stan Hoeppner @ 2014-01-25 8:36 ` Marc MERLIN 2014-01-28 7:46 ` Stan Hoeppner 2014-01-30 20:36 ` Phillip Susi 1 sibling, 1 reply; 41+ messages in thread From: Marc MERLIN @ 2014-01-25 8:36 UTC (permalink / raw) To: Stan Hoeppner; +Cc: linux-raid On Thu, Jan 23, 2014 at 11:13:41PM -0600, Stan Hoeppner wrote: > Well, no, not really. I know there are some real quality issues with a > lot of cheap PMP JBODs out there. I was just surprised to see an > experienced Linux sysadmin have bad luck with 3/3 of em. Most folks > using Silicon Image HBAs with SiI PMPs seem to get good performance. I've worked with the raw chips on silicon, have the firmware flashing tool for the PMP, and never saw better than that. So I'm not sure who those most folks are, or what chips they have, but obviously the experience you describe is very different from the one I've seen, or even from what the 2 kernel folks I know who used to maintain them have, since they've abandonned using them due to them being more trouble than they're worth and the performance poor. To be fair, at the time I cared about performance on PMP, I was also using snapshots on LVM and those were so bad that they actually were the performance issue sometimes I got as slow as 5MB/s. Yes, LVM snapshots were horrible for performance, which is why I switched to brtfs now. > Personally, I've never used PMPs. Given the cost ratio between drives > and HBA ports, a quality 4/8 port SAS HBA such as one of the LSIs is a > better solution all around. 4TB drives average $200 each. A five drive > array is $1000. An LSI 8 port 12G SAS HBA with guaranteed > compatibility, quality, support, and performance is $300. A cheap 2 You are correct. When I started with PMPs there was not a single good SATA card that had 10 ports or more and didn't cost $900. That was 4-5 years ago though. Today, I don't use PMPs anymore, except for some enclosures where it's easy to just have one cable and where what you describe would need 5 sata cables to the enclosure, would it not? (unless you use something like USB3, but that's another interface I've had my share of driver bug problems with, so it's not a net win either). > port SATA HBA and 5 port PMP card gives sub optimal performance, iffy > compatibility, and low quality, and is ~$130. $1300 vs $1130. Going > with a cheap SATA HBA and PMP makes no sense. I generally agree. Here I was using it to transfer data off some drives, but indeed I wouldn't use this for a main array. > > Let me think about this: the resync is done at build array time. > > If all the drives are full of 0's indeed there will be nothing to write. > > Given that, I think you're right. > > The initial resync is read-only. It won't modify anything unless > there's a discrepancy. So the stripe cache isn't in play. The larger > stripe cache should indeed increase rebuild rate though. Right, I understood that the first time you explained it. > Actually, instead of me making an educated guess, I'd suggest you run > > ~$ cryptsetup benchmark > > This will tell you precisely what your throughput is with various > settings and ciphers. Depending on what this spits back you may want to > change your setup, assuming we get the IO throughput where it should be. Sigh, debian unstable doesn't have the brand new cryptsetup with that option yet, will have to get it. Either way, I already know my CPU is not a bottleneck, so it's not that important. > > I use btrfs for LV management, so it's easier to encrypt the entire pool. I > > also encrypt any data on any drive at this point, kind of like I wash my > > hands. I'm not saying it's the right thing to do for all, but it's my > > personal choice. I've seen too many drives end up on ebay with data, and I > > don't want to have to worry about this later, or even erasing my own drives > > before sending them back to warranty, especially in cases where maybe I > > can't erase them, but the manufacturer can read them anyway. > > You get the idea... > > So be it. Now let's work to see if we can squeeze every ounce of > performance out of it. Since I get the same speed writing through all the layers as raid5 gets doing a resync without writes and the other layers, I'm not sure how you're suggesting that I can get extra performance. Well, unless you mean just raw swraid5 can be made faster with my drives still. That is likely possible if I get a better sata card to put in my machine or find another way to increase cpu to drive throughput. > You said you had pulled the PMP and connected direct to an HBA, bumping > from 19MB/s to 99MB/s. Did you switch back to the PMP and are now > getting 100MB/s through the PMP? We should be able to get much higher > if it's 3/6G SATA, a little higher if it's 1/5G. No, I did not. I'm not planning on having my destination array (the one I'm writing to) behind a PMP for the reasons we discussed above. The ports are 3MB/s. Obviously I'm not getting the right speed, but I think there is something wrong with the motherboard of the system this is in, causing some bus conflicts and slowdowns. This is something I'll need to investigate outside of this list since it's not related to raid anymore. > > For now, I put dmcrypt on top of md5, I get 100MB/s raw block write speed (actually > > just writing a big file in btrfs and going through all the layers) even > > though it's only using one CPU thread for encryption instead of 2 or more if > > each disk were encrypted under the md5 layer. > > 100MB/s sequential read throughput is very poor for a 5 drive RAID5, > especially with new 4TB drives which can stream well over 130MB/s each. Yes, I totally agree. > > As another test > > gargamel:/mnt/btrfs_pool1# dd if=/dev/md5 of=/dev/null bs=1M count=1024 > > 1073741824 bytes (1.1 GB) copied, 9.78191 s, 110 MB/s > > dd single stream copies are not a valid test of array throughput. This > tells you only the -minimum- throughput of the array. If the array is idle, how is that not a valid block read test? > > So it looks like 100-110MB/s is the read and write speed limit of that array. > To test real maximum throughput install fio, save and run this job file, > and post your results. Monitor CPU burn of dmcrypt, using top is fine, > while running the job to see if it eats all of one core. The job runs > in multiple steps, first creating the eight 1GB test files, then running > the read/write tests against those files. > > [global] > directory=/some/directory > zero_buffers > numjobs=4 > group_reporting > blocksize=1024k > ioengine=libaio > iodepth=16 > direct=1 > size=1g > > [read] > rw=read > stonewall > > [write] > rw=write > stonewall Yeah, I have fio, didn't seem needed here, but I'll it a shot when I get a chance. > > Thanks for you answers again, > > You're welcome. If you wish to wring maximum possible performance from > this rig I'll stick with ya until we get there. You're not far. Just > takes some testing and tweaking unless you have a real hardware > limitation, not a driver setting or firmware issue. Thanks for your offer, although to be honest, I think I'm hitting a hardware problem which I need to look into when I get a chance. > BTW, I don't recall you mentioning which HBA and PMP you're using at the > moment, and whether the PMP is an Addonics card or integrated in a JBOD. > Nor if you're 1.5/3/6G from HBA through PMP to each drive. That PMP is integrated in the jbod, I haven't torn it apart to check which one it was, but I've pretty much gotten slow speeds from those things and more importantly PMPs have bugs during drive hangs and retries which can cause recovery problems and killing swraid5 arrays, so that's why I stopped using them for serious use. The driver authors know about the issues, and some are in the PMP firmware and not something they can work around. > Post your dmesg output showing the drive link speeds if you would, i.e. > ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Yep, very familiar with that unfortunately from my PMP debugging days [ 6.188660] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 330) [ 6.211533] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 330) [ 6.444897] ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 330) [ 6.444918] ata1.01: SATA link up 3.0 Gbps (SStatus 123 SControl 330) [ 6.445087] ata2.00: SATA link up 6.0 Gbps (SStatus 133 SControl 330) [ 6.445109] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 330) [ 14.179297] ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 14.675693] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 15.516390] ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 16.008800] ata12: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 19.339559] ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 0) [ 19.692273] ata14.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [ 20.705263] ata14.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 21.785956] ata14.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 22.899091] ata14.03: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 23.935813] ata14.04: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Of course, I'm not getting that speed, but again, I'll look into it. Thanks for your suggestions for tweaks. Best, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-25 8:36 ` Marc MERLIN @ 2014-01-28 7:46 ` Stan Hoeppner 2014-01-28 16:50 ` Marc MERLIN 2014-01-30 20:47 ` Phillip Susi 0 siblings, 2 replies; 41+ messages in thread From: Stan Hoeppner @ 2014-01-28 7:46 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-raid On 1/25/2014 2:36 AM, Marc MERLIN wrote: > On Thu, Jan 23, 2014 at 11:13:41PM -0600, Stan Hoeppner wrote: >> Well, no, not really. I know there are some real quality issues with a >> lot of cheap PMP JBODs out there. I was just surprised to see an >> experienced Linux sysadmin have bad luck with 3/3 of em. Most folks >> using Silicon Image HBAs with SiI PMPs seem to get good performance. > > I've worked with the raw chips on silicon, have the firmware flashing tool > for the PMP, and never saw better than that. > So I'm not sure who those most folks are, or what chips they have, but > obviously the experience you describe is very different from the one I've > seen, or even from what the 2 kernel folks I know who used to maintain them > have, since they've abandonned using them due to them being more trouble > than they're worth and the performance poor. The first that comes to mind is Backblaze, a cloud storage provider for consumer file backup. They're on their 3rd generation of storage pod, and they're still using the original Syba SiI 3132 PCIe, Addonics SiI 3124 PCI cards, and SiI 3726 PMP backplane boards, since 2009. All Silicon Image ASICs both HBA and PMP. Each pod has 4 SATA cards and 9 PMPs boards with 45 drive slots. The version 3.0 pod offers 180TB of storage. They have a few hundred of these storage pods in service backing up user files over the net. Here's the original design. The post has links to version 2 and 3. http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ The key to their success is obviously working closely with all their vendors to make sure the SATA cards and PMPs have the correct firmware versions to work reliably with each other. Consumers buying cheap big box store HBAs and enclosures don't have this advantage. > To be fair, at the time I cared about performance on PMP, I was also using > snapshots on LVM and those were so bad that they actually were the > performance issue sometimes I got as slow as 5MB/s. Yes, LVM snapshots were > horrible for performance, which is why I switched to brtfs now. > >> Personally, I've never used PMPs. Given the cost ratio between drives >> and HBA ports, a quality 4/8 port SAS HBA such as one of the LSIs is a >> better solution all around. 4TB drives average $200 each. A five drive >> array is $1000. An LSI 8 port 12G SAS HBA with guaranteed >> compatibility, quality, support, and performance is $300. A cheap 2 > > You are correct. When I started with PMPs there was not a single good SATA > card that had 10 ports or more and didn't cost $900. That was 4-5 years ago > though. > Today, I don't use PMPs anymore, except for some enclosures where it's easy > to just have one cable and where what you describe would need 5 sata cables > to the enclosure, would it not? No. For external JBOD storage you go with an SAS expander unit instead of a PMP. You have a single SFF 8088 cable to the host which carries 4 SAS/SATA channels, up to 2.4 GB/s with 6G interfaces. > (unless you use something like USB3, but that's another interface I've had > my share of driver bug problems with, so it's not a net win either). Yes, USB is a horrible interface for RAID storage. >> port SATA HBA and 5 port PMP card gives sub optimal performance, iffy >> compatibility, and low quality, and is ~$130. $1300 vs $1130. Going >> with a cheap SATA HBA and PMP makes no sense. > > I generally agree. Here I was using it to transfer data off some drives, but > indeed I wouldn't use this for a main array. Your original posts left me with the impression that you were using this as a production array. Apologies for not digesting those correctly. ... > Since I get the same speed writing through all the layers as raid5 gets > doing a resync without writes and the other layers, I'm not sure how you're > suggesting that I can get extra performance. You don't get extra performance. You expose the performance you already have. Serial submission typically doesn't reach peak throughput. Both the resync operation and dd copy are serial submitters. You usually must submit asynchronously or in parallel to reach maximum throughput. Being limited by a PMP it may not matter. But with your direct connected drives of your production array you should see a substantial increase in throughput with parallel submission. > Well, unless you mean just raw swraid5 can be made faster with my drives > still. > That is likely possible if I get a better sata card to put in my machine > or find another way to increase cpu to drive throughput. To significantly increase single streaming throughput you need AIO. A faster CPU won't make any difference. Neither will a better SATA card, unless your current one is defective, or limits port throughput will more than one port active--I've heard of couple that do so. >> You said you had pulled the PMP and connected direct to an HBA, bumping >> from 19MB/s to 99MB/s. Did you switch back to the PMP and are now >> getting 100MB/s through the PMP? We should be able to get much higher >> if it's 3/6G SATA, a little higher if it's 1/5G. > > No, I did not. I'm not planning on having my destination array (the one I'm > writing to) behind a PMP for the reasons we discussed above. > The ports are 3MB/s. Obviously I'm not getting the right speed, but I think > there is something wrong with the motherboard of the system this is in, > causing some bus conflicts and slowdowns. > This is something I'll need to investigate outside of this list since it's > not related to raid anymore. Interesting. >>> For now, I put dmcrypt on top of md5, I get 100MB/s raw block write speed (actually >>> just writing a big file in btrfs and going through all the layers) even >>> though it's only using one CPU thread for encryption instead of 2 or more if >>> each disk were encrypted under the md5 layer. >> >> 100MB/s sequential read throughput is very poor for a 5 drive RAID5, >> especially with new 4TB drives which can stream well over 130MB/s each. > > Yes, I totally agree. > >>> As another test >>> gargamel:/mnt/btrfs_pool1# dd if=/dev/md5 of=/dev/null bs=1M count=1024 >>> 1073741824 bytes (1.1 GB) copied, 9.78191 s, 110 MB/s >> >> dd single stream copies are not a valid test of array throughput. This >> tells you only the -minimum- throughput of the array. > > If the array is idle, how is that not a valid block read test? See above WRT asynchronous and parallel submission. >>> So it looks like 100-110MB/s is the read and write speed limit of that array. >> To test real maximum throughput install fio, save and run this job file, >> and post your results. Monitor CPU burn of dmcrypt, using top is fine, >> while running the job to see if it eats all of one core. The job runs >> in multiple steps, first creating the eight 1GB test files, then running >> the read/write tests against those files. >> >> [global] >> directory=/some/directory >> zero_buffers >> numjobs=4 >> group_reporting >> blocksize=1024k >> ioengine=libaio >> iodepth=16 >> direct=1 >> size=1g >> >> [read] >> rw=read >> stonewall >> >> [write] >> rw=write >> stonewall > > Yeah, I have fio, didn't seem needed here, but I'll it a shot when I get a > chance. With your setup and its apparent hardware limitations, parallel submission may not reveal any more performance. On the vast majority of systems it does. >>> Thanks for you answers again, >> >> You're welcome. If you wish to wring maximum possible performance from >> this rig I'll stick with ya until we get there. You're not far. Just >> takes some testing and tweaking unless you have a real hardware >> limitation, not a driver setting or firmware issue. > > Thanks for your offer, although to be honest, I think I'm hitting a hardware > problem which I need to look into when I get a chance. Got it. >> BTW, I don't recall you mentioning which HBA and PMP you're using at the >> moment, and whether the PMP is an Addonics card or integrated in a JBOD. >> Nor if you're 1.5/3/6G from HBA through PMP to each drive. > > That PMP is integrated in the jbod, I haven't torn it apart to check which > one it was, but I've pretty much gotten slow speeds from those things and > more importantly PMPs have bugs during drive hangs and retries which can > cause recovery problems and killing swraid5 arrays, so that's why I stopped > using them for serious use. Probably a good call WRT consumer PMP JBODs. > The driver authors know about the issues, and some are in the PMP firmware > and not something they can work around. > >> Post your dmesg output showing the drive link speeds if you would, i.e. >> ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) > > Yep, very familiar with that unfortunately from my PMP debugging days > [ 6.188660] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 330) > [ 6.211533] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 330) > [ 6.444897] ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 330) > [ 6.444918] ata1.01: SATA link up 3.0 Gbps (SStatus 123 SControl 330) > [ 6.445087] ata2.00: SATA link up 6.0 Gbps (SStatus 133 SControl 330) > [ 6.445109] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 330) > [ 14.179297] ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [ 14.675693] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [ 15.516390] ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [ 16.008800] ata12: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [ 19.339559] ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 0) > [ 19.692273] ata14.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320) > [ 20.705263] ata14.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [ 21.785956] ata14.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [ 22.899091] ata14.03: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [ 23.935813] ata14.04: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > > Of course, I'm not getting that speed, but again, I'll look into it. Yeah, something's definitely up with that. All drives are 3G sync, so you 'should' have 300 MB/s data rate through the PMP. > Thanks for your suggestions for tweaks. No problem Marc. Have you noticed the right hand side of my email address? :) I'm kinda like a dog with a bone when it comes to hardware issues. Apologies if I've been a bit too tenacious with this. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-28 7:46 ` Stan Hoeppner @ 2014-01-28 16:50 ` Marc MERLIN 2014-01-29 0:56 ` Stan Hoeppner 2014-01-30 20:47 ` Phillip Susi 1 sibling, 1 reply; 41+ messages in thread From: Marc MERLIN @ 2014-01-28 16:50 UTC (permalink / raw) To: Stan Hoeppner; +Cc: linux-raid On Tue, Jan 28, 2014 at 01:46:28AM -0600, Stan Hoeppner wrote: > > Today, I don't use PMPs anymore, except for some enclosures where it's easy > > to just have one cable and where what you describe would need 5 sata cables > > to the enclosure, would it not? > > No. For external JBOD storage you go with an SAS expander unit instead > of a PMP. You have a single SFF 8088 cable to the host which carries 4 > SAS/SATA channels, up to 2.4 GB/s with 6G interfaces. Yeah, I know about those, but I have 5 drives in my enclosures, so that's one short :) > > I generally agree. Here I was using it to transfer data off some drives, but > > indeed I wouldn't use this for a main array. > > Your original posts left me with the impression that you were using this > as a production array. Apologies for not digesting those correctly. I likely wasn't clear, sorry about that. > You don't get extra performance. You expose the performance you already > have. Serial submission typically doesn't reach peak throughput. Both > the resync operation and dd copy are serial submitters. You usually > must submit asynchronously or in parallel to reach maximum throughput. > Being limited by a PMP it may not matter. But with your direct > connected drives of your production array you should see a substantial > increase in throughput with parallel submission. I agree, it should be faster. > >> [global] > >> directory=/some/directory > >> zero_buffers > >> numjobs=4 > >> group_reporting > >> blocksize=1024k > >> ioengine=libaio > >> iodepth=16 > >> direct=1 > >> size=1g > >> > >> [read] > >> rw=read > >> stonewall > >> > >> [write] > >> rw=write > >> stonewall > > > > Yeah, I have fio, didn't seem needed here, but I'll it a shot when I get a > > chance. > > With your setup and its apparent hardware limitations, parallel > submission may not reveal any more performance. On the vast majority of > systems it does. fio said: Run status group 0 (all jobs): READ: io=4096.0MB, aggrb=77695KB/s, minb=77695KB/s, maxb=77695KB/s, mint=53984msec, maxt=53984msec Run status group 1 (all jobs): WRITE: io=4096.0MB, aggrb=77006KB/s, minb=77006KB/s, maxb=77006KB/s, mint=54467msec, maxt=54467msec > > Of course, I'm not getting that speed, but again, I'll look into it. > > Yeah, something's definitely up with that. All drives are 3G sync, so > you 'should' have 300 MB/s data rate through the PMP. Right. > > Thanks for your suggestions for tweaks. > > No problem Marc. Have you noticed the right hand side of my email > address? :) I'm kinda like a dog with a bone when it comes to hardware > issues. Apologies if I've been a bit too tenacious with this. I had not :) I usually try to optimize stuff as much as possible when it's worth it or when I really care and have time. I agree this one is puzzling me a bit and even if it's fast enough for my current needs and the time I have right now, I'll try and move it to another system to see. I'm pretty sure that one system has a weird bottleneck. Cheers, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-28 16:50 ` Marc MERLIN @ 2014-01-29 0:56 ` Stan Hoeppner 2014-01-29 1:01 ` Marc MERLIN 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2014-01-29 0:56 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-raid On 1/28/2014 10:50 AM, Marc MERLIN wrote: > On Tue, Jan 28, 2014 at 01:46:28AM -0600, Stan Hoeppner wrote: >>> Today, I don't use PMPs anymore, except for some enclosures where it's easy >>> to just have one cable and where what you describe would need 5 sata cables >>> to the enclosure, would it not? >> >> No. For external JBOD storage you go with an SAS expander unit instead >> of a PMP. You have a single SFF 8088 cable to the host which carries 4 >> SAS/SATA channels, up to 2.4 GB/s with 6G interfaces. > > Yeah, I know about those, but I have 5 drives in my enclosures, so that's > one short :) I think you misunderstood. I was referring to a JBOD chassis with SAS expander, up to 32 drives, typically 12-24 drives with two host or two daisy chain ports. Maybe an example would help here. http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047 Obviously this is in a difference cost category, and not typical for consumer use. Smaller units are available for less $$ but you pay more per drive, as the expander board is the majority of the cost. Steel and plastic are cheap, as are PSUs. >>> I generally agree. Here I was using it to transfer data off some drives, but >>> indeed I wouldn't use this for a main array. >> >> Your original posts left me with the impression that you were using this >> as a production array. Apologies for not digesting those correctly. > > I likely wasn't clear, sorry about that. > >> You don't get extra performance. You expose the performance you already >> have. Serial submission typically doesn't reach peak throughput. Both >> the resync operation and dd copy are serial submitters. You usually >> must submit asynchronously or in parallel to reach maximum throughput. >> Being limited by a PMP it may not matter. But with your direct >> connected drives of your production array you should see a substantial >> increase in throughput with parallel submission. > > I agree, it should be faster. > >>>> [global] >>>> directory=/some/directory >>>> zero_buffers >>>> numjobs=4 >>>> group_reporting >>>> blocksize=1024k >>>> ioengine=libaio >>>> iodepth=16 >>>> direct=1 >>>> size=1g >>>> >>>> [read] >>>> rw=read >>>> stonewall >>>> >>>> [write] >>>> rw=write >>>> stonewall >>> >>> Yeah, I have fio, didn't seem needed here, but I'll it a shot when I get a >>> chance. >> >> With your setup and its apparent hardware limitations, parallel >> submission may not reveal any more performance. On the vast majority of >> systems it does. > > fio said: > Run status group 0 (all jobs): > READ: io=4096.0MB, aggrb=77695KB/s, minb=77695KB/s, maxb=77695KB/s, mint=53984msec, maxt=53984msec > > Run status group 1 (all jobs): > WRITE: io=4096.0MB, aggrb=77006KB/s, minb=77006KB/s, maxb=77006KB/s, mint=54467msec, maxt=54467msec Something is definitely not right if parallel FIO submission is ~25% lower than single submission dd. But you were running your dd tests through buffer cache IIRC. This FIO test uses O_DIRECT. So it's not apples to apples. When testing IO throughput one should also bypass buffer cache. >>> Of course, I'm not getting that speed, but again, I'll look into it. >> >> Yeah, something's definitely up with that. All drives are 3G sync, so >> you 'should' have 300 MB/s data rate through the PMP. > > Right. > >>> Thanks for your suggestions for tweaks. >> >> No problem Marc. Have you noticed the right hand side of my email >> address? :) I'm kinda like a dog with a bone when it comes to hardware >> issues. Apologies if I've been a bit too tenacious with this. > > I had not :) I usually try to optimize stuff as much as possible when it's > worth it or when I really care and have time. I agree this one is puzzling > me a bit and even if it's fast enough for my current needs and the time I > have right now, I'll try and move it to another system to see. I'm pretty > sure that one system has a weird bottleneck. Yeah, something definitely not right. Your RAID throughput is less than a single 7.2K SATA drive. It's probably just something funky with that JBOD chassis. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-29 0:56 ` Stan Hoeppner @ 2014-01-29 1:01 ` Marc MERLIN 0 siblings, 0 replies; 41+ messages in thread From: Marc MERLIN @ 2014-01-29 1:01 UTC (permalink / raw) To: Stan Hoeppner; +Cc: linux-raid On Tue, Jan 28, 2014 at 06:56:32PM -0600, Stan Hoeppner wrote: > On 1/28/2014 10:50 AM, Marc MERLIN wrote: > > On Tue, Jan 28, 2014 at 01:46:28AM -0600, Stan Hoeppner wrote: > >>> Today, I don't use PMPs anymore, except for some enclosures where it's easy > >>> to just have one cable and where what you describe would need 5 sata cables > >>> to the enclosure, would it not? > >> > >> No. For external JBOD storage you go with an SAS expander unit instead > >> of a PMP. You have a single SFF 8088 cable to the host which carries 4 > >> SAS/SATA channels, up to 2.4 GB/s with 6G interfaces. > > > > Yeah, I know about those, but I have 5 drives in my enclosures, so that's > > one short :) > > I think you misunderstood. I was referring to a JBOD chassis with SAS > expander, up to 32 drives, typically 12-24 drives with two host or two > daisy chain ports. Maybe an example would help here. > > http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047 Ah, yes, that. So indeed in the price category of a PMP chassis with 5 drives ($150-ish), I haven't found anything that isn't PMP or 5 sata cables. > Yeah, something definitely not right. Your RAID throughput is less than > a single 7.2K SATA drive. It's probably just something funky with that > JBOD chassis. That's also possible. If/when I have time, I'll unplug things and plug the drives directly to the card as well as try another MB. Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-28 7:46 ` Stan Hoeppner 2014-01-28 16:50 ` Marc MERLIN @ 2014-01-30 20:47 ` Phillip Susi 2014-02-01 22:39 ` Stan Hoeppner 1 sibling, 1 reply; 41+ messages in thread From: Phillip Susi @ 2014-01-30 20:47 UTC (permalink / raw) To: stan, Marc MERLIN; +Cc: linux-raid -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 1/28/2014 2:46 AM, Stan Hoeppner wrote: > You usually must submit asynchronously or in parallel to reach > maximum throughput. Being limited by a PMP it may not matter. But > with your direct connected drives of your production array you > should see a substantial increase in throughput with parallel > submission. Not for streaming IO; you just need to make sure your cache is big enough so the drive is never waiting for the app. > To significantly increase single streaming throughput you need AIO. > A faster CPU won't make any difference. Neither will a better SATA > card, unless your current one is defective, or limits port > throughput will more than one port active--I've heard of couple > that do so. What AIO gets you is the ability to use O_DIRECT to avoid a memory copy to/from the kernel page cache. That saves you some cpu time, but doesn't make *that* much difference unless you have a crazy fast storage array, or crazy slow ram. And since almost nobody uses it, it's a bit of an unrealistic benchmark. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJS6rpiAAoJEI5FoCIzSKrw6oMH/jDcFOs8Gu2wjbbuE1eoGtG7 aHeUvF6klWWV5VWCVBd4tHieVkj1zyg3nQa3DGaOvqBnz6mtIQUx6Pg5MgYkJAhD EY1f3zVH+hxBEyJwwmMIDIyVsDCbdsryKndfPuYolaqNSgXLyWpAcL6g/SM9vjoG nH29w1GC3TJP5Py1DNP4P04Q+kJMTYnY/4AFJOtsMRK5XRpno784YZauS/basEH3 rpSf/JvhcZMbk6nE8jkqIYnMbA35E8f+GfSa60epqDSSM3hU5U1xYnh6vCZSSndK pMCFv26O9AVoFdyPZTJwM32gqGXdsGkDanK2+0y/j2im5IT0PxKCWO+uCLO/1mQ= =9NYg -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-30 20:47 ` Phillip Susi @ 2014-02-01 22:39 ` Stan Hoeppner 2014-02-02 18:53 ` Phillip Susi 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2014-02-01 22:39 UTC (permalink / raw) To: Phillip Susi, Marc MERLIN; +Cc: linux-raid On 1/30/2014 2:47 PM, Phillip Susi wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 1/28/2014 2:46 AM, Stan Hoeppner wrote: >> You usually must submit asynchronously or in parallel to reach >> maximum throughput. Being limited by a PMP it may not matter. But >> with your direct connected drives of your production array you >> should see a substantial increase in throughput with parallel >> submission. > > Not for streaming IO; you just need to make sure your cache is big > enough so the drive is never waiting for the app. > >> To significantly increase single streaming throughput you need AIO. >> A faster CPU won't make any difference. Neither will a better SATA >> card, unless your current one is defective, or limits port >> throughput will more than one port active--I've heard of couple >> that do so. > > What AIO gets you is the ability to use O_DIRECT to avoid a memory > copy to/from the kernel page cache. That saves you some cpu time, but > doesn't make *that* much difference unless you have a crazy fast > storage array, or crazy slow ram. And since almost nobody uses it, > it's a bit of an unrealistic benchmark. Phillip, you seem to be arguing application performance. For an application/data type that doesn't need fsync you'd be correct. However, the purpose of this exchange has been to determine the maximum hardware throughput of the OP's array. It's not possible to accurately measure IO throughput doing buffered writes. Thus O_DIRECT is needed. But, as I already stated, because O_DIRECT is synchronous with significant completion latency, regardless of CPU/RAM speed, a single write stream typically won't saturate the storage. Thus, one needs to use either use AIO or parallel submission, or possibly both, to saturate the storage. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-01 22:39 ` Stan Hoeppner @ 2014-02-02 18:53 ` Phillip Susi 2014-02-03 6:34 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Phillip Susi @ 2014-02-02 18:53 UTC (permalink / raw) To: stan, Marc MERLIN; +Cc: linux-raid -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 02/01/2014 05:39 PM, Stan Hoeppner wrote: > Phillip, you seem to be arguing application performance. For an > application/data type that doesn't need fsync you'd be correct. > > However, the purpose of this exchange has been to determine the > maximum hardware throughput of the OP's array. It's not possible > to accurately measure IO throughput doing buffered writes. Thus > O_DIRECT is needed. No, you can get there just fine with buffered IO as well, unless you have an obscenely fast array and very slow ram, the overhead of buffering won't really matter for IO throughput ( just the extra cpu ). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCgAGBQJS7pRFAAoJEI5FoCIzSKrw/5AH/2EUXwPjdaqKyg9oD5igX92p VLLjCCDJyzd2MaC3u4DPpQUtzIXPEkjuKT6hBMNwx/+gFnODXQjssZ3siXVgc2Mi JAGYwRWYbLDYQscKagYyQFDiNjg5b1zA/KEjKYTO5hpkFQELDgE115PPn6NhV/Q9 r/vjzEtvDLjuN5Y7uQDpZv87bEy2O7aJX1TFygPN0MazJo6O93yFfweUwI2JUm1H q1DL96Evd/CCfzhPWiPnWrP4MpuUG7B1OKjwfXnlw5tW1tj7wRA6tRnyuCeAHCfb 1/wvxXm426rxflSNa85sP///amh8mnccN+4ZcOYEql4XSt8qh8G2sdH35Kp3og8= =BnZb -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-02 18:53 ` Phillip Susi @ 2014-02-03 6:34 ` Stan Hoeppner 2014-02-03 14:42 ` Phillip Susi 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2014-02-03 6:34 UTC (permalink / raw) To: Phillip Susi, Marc MERLIN; +Cc: linux-raid On 2/2/2014 12:53 PM, Phillip Susi wrote: > On 02/01/2014 05:39 PM, Stan Hoeppner wrote: >> It's not possible >> to accurately measure IO throughput doing buffered writes. Thus >> O_DIRECT is needed. > No, you can get there just fine with buffered IO as well, unless you > have an obscenely fast array and very slow ram, the overhead of > buffering won't really matter for IO throughput ( just the extra cpu ). Please reread my statement above. Now let me restate that as: Measuring disk throughput when writing through the buffer cache isn't a measurement of disk throughput as much as it is a measurement of cache throughput. Thus, such measurements do not demonstrate actual disk throughput. Do you disagree? -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-03 6:34 ` Stan Hoeppner @ 2014-02-03 14:42 ` Phillip Susi 2014-02-04 3:30 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Phillip Susi @ 2014-02-03 14:42 UTC (permalink / raw) To: stan, Marc MERLIN; +Cc: linux-raid -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2/3/2014 1:34 AM, Stan Hoeppner wrote: > Please reread my statement above. Now let me restate that as: > > Measuring disk throughput when writing through the buffer cache > isn't a measurement of disk throughput as much as it is a > measurement of cache throughput. Thus, such measurements do not > demonstrate actual disk throughput. > > Do you disagree? Yes, I do because cache throughput is >>>> disk throughput. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJS76rMAAoJEI5FoCIzSKrw+1oH/1xdPTZwoaw4MKQrwQV22sgM zu1BXTs1+/wEjyxJdwr9Rpa6W/aLlaMYmoriNWVXG+MLm2aemrGq4nHD5i3GhESU T1R65IY92fVPqCAYNUjYQftGryYcZjWxiNurHI4/Tt0BH4hPn0Ol34xwuTE7/mg7 ozn7mzqYFxJltomRjtARuXulJz4DW5p0tKjsgBRnqAqXyywU/bEC5fpb0xuqhNWK xm4asE9FPJxCxV8QuqQUwehU+IAQ3ObqgDIsdcCX0wSZurDQCbPBfbW2OuYpSBSd QOrjulMciutBxIyejA1OC20z9DEPhoWfJEn2HSUrMEMgpvuz/+q7ybBkCcHDs5Y= =4ti3 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-03 14:42 ` Phillip Susi @ 2014-02-04 3:30 ` Stan Hoeppner 2014-02-04 17:59 ` Larry Fenske 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2014-02-04 3:30 UTC (permalink / raw) To: Phillip Susi; +Cc: linux-raid On 2/3/2014 8:42 AM, Phillip Susi wrote: > On 2/3/2014 1:34 AM, Stan Hoeppner wrote: >> Please reread my statement above. Now let me restate that as: >> >> Measuring disk throughput when writing through the buffer cache >> isn't a measurement of disk throughput as much as it is a >> measurement of cache throughput. Thus, such measurements do not >> demonstrate actual disk throughput. >> >> Do you disagree? > > Yes, I do because cache throughput is >>>> disk throughput. It is because buffer cache throughput is greater that measurements of disk throughput are not accurate. If one issues a sync after writing through buffer cache the measured throughput should be fairly close. But without issuing a sync you're measuring buffer cache throughput. Thus, as I said previously, it is better to do parallel O_DIRECT writes or use AIO with O_DIRECT for testing disk throughput as one doesn't have to worry about these buffer cache issues. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-04 3:30 ` Stan Hoeppner @ 2014-02-04 17:59 ` Larry Fenske 2014-02-04 18:08 ` Phillip Susi 0 siblings, 1 reply; 41+ messages in thread From: Larry Fenske @ 2014-02-04 17:59 UTC (permalink / raw) To: stan, Phillip Susi; +Cc: linux-raid On 02/03/2014 08:30 PM, Stan Hoeppner wrote: > On 2/3/2014 8:42 AM, Phillip Susi wrote: > >> On 2/3/2014 1:34 AM, Stan Hoeppner wrote: >>> Please reread my statement above. Now let me restate that as: >>> >>> Measuring disk throughput when writing through the buffer cache >>> isn't a measurement of disk throughput as much as it is a >>> measurement of cache throughput. Thus, such measurements do not >>> demonstrate actual disk throughput. >>> >>> Do you disagree? >> Yes, I do because cache throughput is >>>> disk throughput. > It is because buffer cache throughput is greater that measurements of > disk throughput are not accurate. If one issues a sync after writing > through buffer cache the measured throughput should be fairly close. > But without issuing a sync you're measuring buffer cache throughput. > > Thus, as I said previously, it is better to do parallel O_DIRECT writes > or use AIO with O_DIRECT for testing disk throughput as one doesn't have > to worry about these buffer cache issues. > Perhaps Phillip is doing the obvious and only measuring throughput after the cache is full. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-04 17:59 ` Larry Fenske @ 2014-02-04 18:08 ` Phillip Susi 2014-02-04 18:43 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Phillip Susi @ 2014-02-04 18:08 UTC (permalink / raw) To: Larry Fenske, stan; +Cc: linux-raid -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2/4/2014 12:59 PM, Larry Fenske wrote: > Perhaps Phillip is doing the obvious and only measuring throughput > after the cache is full. Or the obvious and dropping the cache first, the way hdparm -t does. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJS8SygAAoJEI5FoCIzSKrwdesH/3ZuonrTN58MTmKalrdbmrvO P9UJCY90Nzv57TBm6POzipDKe7cdBjBRSr97DU2Ea8yOqxTo9ErbbS2prUmDeC04 RrTUJnWw5eP5Zrt2TT4tnUJbKmmhxbXMxTPz8ZrzaV/2PJzh2PWbj5HjGgceyCVS 2V7iuMJfpPvL/EiiTm32gXVAp9FlWtOpiKdBg+eaD4UfLemYMObRScbhmS0+1XYh lQ7Ce7RUE+y0zkgaeLSBDTpyUL+mVF6l9C2q0dxK8U2mhUChQl7RX81yk3ePD6DD w+SzG+UaQjdNM9jbldHKdXSYT04u/ZX9KmZszFQGPRGvvw9tMxihjUmmawN9GWQ= =2/Lt -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-04 18:08 ` Phillip Susi @ 2014-02-04 18:43 ` Stan Hoeppner 2014-02-04 18:55 ` Phillip Susi 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2014-02-04 18:43 UTC (permalink / raw) To: Phillip Susi, Larry Fenske; +Cc: linux-raid On 2/4/2014 12:08 PM, Phillip Susi wrote: > Or the obvious and dropping the cache first, the way hdparm -t does. "hdparm -t" is a read test. Everything we've been discussing has been about maximizing write throughput. The fact that you argue this at this point makes it crystal clear that you don't have no understanding of the differences in the read/write paths and how buffer cache affects each differently. Further discussion is thus pointless. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-04 18:43 ` Stan Hoeppner @ 2014-02-04 18:55 ` Phillip Susi 2014-02-04 19:15 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Phillip Susi @ 2014-02-04 18:55 UTC (permalink / raw) To: stan, Larry Fenske; +Cc: linux-raid -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2/4/2014 1:43 PM, Stan Hoeppner wrote: > Everything we've been discussing has been about maximizing write > throughput. The fact that you argue this at this point makes it > crystal clear that you don't have no understanding of the > differences in the read/write paths and how buffer cache affects > each differently. Further discussion is thus pointless. I am intimately familiar with the two code paths, having written several applications using them, studied the kernel code extensively, and been one of the original strong advocates for the kernel to grow direct aio apis in the first place, since it worked swimmingly well on WinNT. So I say again: switching to direct aio, while saving a decent chunk of cpu time, makes very little difference in streaming write throughput. If it did, there would be something terribly broken with the buffer cache if it couldn't keep the disk queues full. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJS8TeVAAoJEI5FoCIzSKrwMJ0IAJB3+mGIVXJ+qMCHwSGjFI7G dNJp8/0NFdy42Eww1Yu3EOBRPim4sDxmRE6bqRX9Ytbb5jqnlr22c/fjcUmqH3wr fO7qqj2T6FiaFgrudFNukCAqRiCWTS3nkxzrAs5HV1PBukJtAXugQEBYEtHcVZ7l EoeTu16N70RMnywK0vbHx7Gqx9AOps9xe6qyStN7KptgGbkX/b0OkDLRjSedLput qQNyLA8/kuoGfVvswSzKqneK/CC0GAdbdQt0rP0hC3Icsh2qKQZLdsAwKgL3L0f6 zyALvUBvuRSD6ZQW8VdNI+i4BCyeYSCwZT/5pPXI5AtRZtIUkymQkZtW7cjNKpM= =ILJS -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-04 18:55 ` Phillip Susi @ 2014-02-04 19:15 ` Stan Hoeppner 2014-02-04 20:16 ` Phillip Susi 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2014-02-04 19:15 UTC (permalink / raw) To: Phillip Susi, Larry Fenske; +Cc: linux-raid On 2/4/2014 12:55 PM, Phillip Susi wrote: > On 2/4/2014 1:43 PM, Stan Hoeppner wrote: >> Everything we've been discussing has been about maximizing write >> throughput. The fact that you argue this at this point makes it >> crystal clear that you don't have no understanding of the >> differences in the read/write paths and how buffer cache affects >> each differently. Further discussion is thus pointless. > > I am intimately familiar with the two code paths, having written > several applications using them, studied the kernel code extensively, > and been one of the original strong advocates for the kernel to grow > direct aio apis in the first place, since it worked swimmingly well on > WinNT. > > So I say again: switching to direct aio, while saving a decent chunk > of cpu time, makes very little difference in streaming write > throughput. If it did, there would be something terribly broken with > the buffer cache if it couldn't keep the disk queues full. If all this is true, then why do you keep making a tangential arguments that are not relevant? I never argued that the buffer cache path is slower. It is in fact much faster in most cases. I argued that accurately measuring the actual data throughput at the disks isn't possible when writing through buffer cache. At least not in a straightforward manner as with O_DIRECT. I've made the point in the last two or three replies. Yet instead of directly addressing that, rebutting that, you keep making these tangential irrelevant arguments... -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-04 19:15 ` Stan Hoeppner @ 2014-02-04 20:16 ` Phillip Susi 2014-02-04 21:58 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Phillip Susi @ 2014-02-04 20:16 UTC (permalink / raw) To: stan, Larry Fenske; +Cc: linux-raid -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2/4/2014 2:15 PM, Stan Hoeppner wrote: > I never argued that the buffer cache path is slower. It is in fact > much faster in most cases. > > I argued that accurately measuring the actual data throughput at > the disks isn't possible when writing through buffer cache. At > least not in a straightforward manner as with O_DIRECT. I've made > the point in the last two or three replies. Yet instead of > directly addressing that, rebutting that, you keep making these > tangential irrelevant arguments... You originally said "To significantly increase single streaming throughput you need AIO." Now you appear to be saying otherwise. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJS8UqkAAoJEI5FoCIzSKrwIMYH/AvTA9Z1+SGarMym9hy1DPis czb3W4+MVeisH2QwyfTbMPkicOw4pffa3Hc9ZLLI0yvGnU8b6XbFvG+2sWYBtqhj HQX1Osjy0ZP7GuVU5TtydbNNXba4f+iIm6FIpzX3eseAjZgBJeDeG2s0oePw8q/d b+P0PAZSqA99CNNpqOw7GTnYZqh++SM9CYPmr7KC4LYFyaqklj3eS0XQPDT+rbej K1ly2ZibE348Nol/A6gT63x6WnuMFj4jAUK40O//farqbjsDTJgWaz9x0aV58c6x Uxsrzx9a92tZC5AL0BW6dqLeY0BMZ3j9hqF/51nz5MrGtzm7qninrsSp1hJJRuM= =Ls3I -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-04 20:16 ` Phillip Susi @ 2014-02-04 21:58 ` Stan Hoeppner 2014-02-05 1:19 ` Phillip Susi 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2014-02-04 21:58 UTC (permalink / raw) To: Phillip Susi, Larry Fenske; +Cc: linux-raid On 2/4/2014 2:16 PM, Phillip Susi wrote: > On 2/4/2014 2:15 PM, Stan Hoeppner wrote: >> I never argued that the buffer cache path is slower. It is in fact >> much faster in most cases. >> >> I argued that accurately measuring the actual data throughput at >> the disks isn't possible when writing through buffer cache. At >> least not in a straightforward manner as with O_DIRECT. I've made >> the point in the last two or three replies. Yet instead of >> directly addressing that, rebutting that, you keep making these >> tangential irrelevant arguments... > > You originally said "To significantly increase single streaming > throughput you need AIO." Yes, I stated that in this post http://www.spinics.net/lists/raid/msg45726.html in the context of achieving greater throughput with an FIO job file configured to use O_DIRECT, a job file I created, that the OP was using for testing. That job file is quoted further down in this same post, and is included in my posts prior this one in the thread. Apparently you ignored them. The context of my comment above is clearly established multiple times earlier in the thread. In my paragraph directly preceding the statement you quote above, I stated this: "Serial submission typically doesn't reach peak throughput... You usually must submit asynchronously or in parallel to reach maximum throughput." And again this is in the context of the FIO job file using O_DIRECT, and this statement is factual. As I repeated earlier today, O_DIRECT is used because measuring actual throughput at the disks is straightforward. To increase O_DIRECT write throughput in FIO, you typically need parallel submission or AIO. This is well known. > Now you appear to be saying otherwise. No, I have not contradicted myself Phillip. I've stated the same thing until becoming blue in the face, and it's quite frustrating. The fact of the matter is that you took a single sentence out of context in a very long thread, and attacked it, attacked me, as being wrong, when in fact I am, and have been, correct throughout. Context matters, always. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-04 21:58 ` Stan Hoeppner @ 2014-02-05 1:19 ` Phillip Susi 2014-02-05 1:42 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Phillip Susi @ 2014-02-05 1:19 UTC (permalink / raw) To: stan, Larry Fenske; +Cc: linux-raid -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 02/04/2014 04:58 PM, Stan Hoeppner wrote: > Yes, I stated that in this post > > http://www.spinics.net/lists/raid/msg45726.html > > in the context of achieving greater throughput with an FIO job > file configured to use O_DIRECT, a job file I created, that the OP > was using for testing. That job file is quoted further down in > this same post, and is included in my posts prior this one in the > thread. Apparently you ignored them. The context of my comment > above is clearly established multiple times earlier in the thread. > > In my paragraph directly preceding the statement you quote above, > I stated this: > > "Serial submission typically doesn't reach peak throughput... You > usually must submit asynchronously or in parallel to reach maximum > throughput." > > And again this is in the context of the FIO job file using > O_DIRECT, and this statement is factual. As I repeated earlier > today, O_DIRECT is used because measuring actual throughput at the > disks is straightforward. To increase O_DIRECT write throughput in > FIO, you typically need parallel submission or AIO. This is well > known. Ahh, I did not gather that O_DIRECT was already assumed. In that case, then I was simply restating the same thing: that you want aio with O_DIRECT, but otherwise, buffered IO works fine too ( which is what the OP was using with dd, which is why it sounded like you were saying not to do that, that you must use O_DIRECT + aio because buffered IO won't get you the performance you're looking for ). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCgAGBQJS8ZGfAAoJEI5FoCIzSKrw43gH/2sUIEVB97YOGpTj5H8XkySb wJAQxU//LyZRcUiK37TNeIF+6QUfqVtD/VFYxjTFfV8gmLSmu7JzwfMZQjJ2Rrb5 I08Pks2xCrU/XvfLKqum5JQHreJaz8jQQVIByXAziDAj+H46k5NV34rUNDP5glyk 18uKN1ty0//jyKNlzWhRZllw7Uo7CAvJvfTHSxvoTGgTmzeea2Q6eADIv0Ov96Lb ZeNKnZXTwDyIXskEduDToWQdGL01TYSKXiV8zTqnhMsMBUZ33oE7r5l+a/o/m6Kv ZKWE+JG/5xzZiFipNj1ELYuPwM/SD6cCPBRfwh2tWmKTG3Z/waD+kjytIwieUDY= =1T19 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-02-05 1:19 ` Phillip Susi @ 2014-02-05 1:42 ` Stan Hoeppner 0 siblings, 0 replies; 41+ messages in thread From: Stan Hoeppner @ 2014-02-05 1:42 UTC (permalink / raw) To: Phillip Susi, Larry Fenske; +Cc: linux-raid On 2/4/2014 7:19 PM, Phillip Susi wrote: > On 02/04/2014 04:58 PM, Stan Hoeppner wrote: >> Yes, I stated that in this post >> >> http://www.spinics.net/lists/raid/msg45726.html >> >> in the context of achieving greater throughput with an FIO job >> file configured to use O_DIRECT, a job file I created, that the OP >> was using for testing. That job file is quoted further down in >> this same post, and is included in my posts prior this one in the >> thread. Apparently you ignored them. The context of my comment >> above is clearly established multiple times earlier in the thread. >> >> In my paragraph directly preceding the statement you quote above, >> I stated this: >> >> "Serial submission typically doesn't reach peak throughput... You >> usually must submit asynchronously or in parallel to reach maximum >> throughput." >> >> And again this is in the context of the FIO job file using >> O_DIRECT, and this statement is factual. As I repeated earlier >> today, O_DIRECT is used because measuring actual throughput at the >> disks is straightforward. To increase O_DIRECT write throughput in >> FIO, you typically need parallel submission or AIO. This is well >> known. > > Ahh, I did not gather that O_DIRECT was already assumed. In that > case, then I was simply restating the same thing: that you want aio > with O_DIRECT, but otherwise, buffered IO works fine too ( which is > what the OP was using with dd, which is why it sounded like you were > saying not to do that, that you must use O_DIRECT + aio because > buffered IO won't get you the performance you're looking for ). I guess maybe I wasn't clear with my wording at that time. Yes, IIRC he was doing dd through buffer cache. My point to him was that O_DIRECT dd gives more accurate throughput numbers, but a single stream may not be sufficient to peak the disks. Which is why I recommended FIO and provided a job file, as it can do multiple O_DIRECT streams. AIO can reduce dispatch latency thus increasing throughput, but not to the extent that multiple streams will, as the latter can fully overlap the single stream dispatch latency, keeping the pipeline full. -- Stan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-24 5:13 ` Stan Hoeppner 2014-01-25 8:36 ` Marc MERLIN @ 2014-01-30 20:36 ` Phillip Susi 1 sibling, 0 replies; 41+ messages in thread From: Phillip Susi @ 2014-01-30 20:36 UTC (permalink / raw) To: stan, Marc MERLIN; +Cc: linux-raid -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 1/24/2014 12:13 AM, Stan Hoeppner wrote: > The initial resync is read-only. It won't modify anything unless > there's a discrepancy. So the stripe cache isn't in play. The > larger stripe cache should indeed increase rebuild rate though. What? That makes no sense. It doesn't really take any longer to write the parity than to read it, and the odds are pretty good that it is going to need to write it anyhow, so doing a read first would waste a *lot* of time. I think you are thinking of when you manually issue a check_repair on the array after it is fully initialized. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJS6rexAAoJEI5FoCIzSKrwm8sIAJ93OT39nxYau3//33skn0lW XOT0z+EyBvAMBHMz4x36GJilHsd9gNwRLSETHu3sSTFve+0hkTWPRfLFt+OMkX2S 30hZnVWh+fd02enTE3uw6kaCcuU709hKDwCUOf1wQhm3bUeJGIOTRrkqnGtKpncR 9qxENIdhcRrMIkh1F1dmJOZvehlGc6doa1ddodM8QfSESlQtTu8N9nyxAbVtLf5S lGebMF+dsf3DEK1UXn7RUus8IE28DifWQnMlfUPtI7u2A50cjADQh9mu1moJdDdo HVgz/Y5sveYq5KfRMco4cNVGQiyR3t1LYmZQFTcAcCUqbQ6PctXp3pCEfogLYTw= =zlqj -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-23 12:24 ` Stan Hoeppner 2014-01-23 21:01 ` Marc MERLIN @ 2014-01-30 20:18 ` Phillip Susi 1 sibling, 0 replies; 41+ messages in thread From: Phillip Susi @ 2014-01-30 20:18 UTC (permalink / raw) To: stan, Marc MERLIN; +Cc: linux-raid -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 1/23/2014 7:24 AM, Stan Hoeppner wrote: > Increasing stripe_cache_size above the default as I suggested will > ALWAYS increase write speed, often by a factor of 2-3x or more on > modern hardware. It should speed up destructive resyncs > considerably, as well as normal write IO. Once your array has > settled down after the inits and resyncs and what not, run some > parallel FIO write tests with the default of 256 and then with > 2048. You can try 4096 as well, but with 5 rusty drives 4096 will > probably cause a slight tailing off of throughput. 2048 should be > your sweet spot. You can also just time a few large parallel file > copies. You'll be amazed at the gains. I have never seen it make a difference on destructive syncs on 3 or 4 disk raid5 arrays and when you think about it, that makes perfect sense. The point of the cache is to coalesce small random writes into full stripes so you don't have to do a RMW cycle, so for streaming IO, it isn't going to do a thing. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJS6rOcAAoJEI5FoCIzSKrwz0YIAJMJG6C0aePvKfXJlQy1mmbp AaFHQX8MgIgj3kgiNj8s83uWEdVZfzaQpcc3oJcB0PD2FTHxt+204e8C2wZz0b5N 4zZim1YRec67LTRkLwNeko5HrkiapmWf0FYmx95d3gNvb0UGUbh98hRItgSX78NS Lu+afQWOLCqiv3UjMHpG4Blb37oT0cp2pttuGKbDZTS4OSgd/qWRvcQ5sRQ09338 n40EiCWaIIlWlSQJ0r6GUylTOiys+JziB+qcHK6SvK+9gF3VdYN955GXNKxMn7t1 A3rFbeVsC0MSRcMytVmMk3cFQDn+xhZ1ZvHvkbLagj9i0uLcC+1cR6KIeScdQC4= =iFoR -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Opal 2.0 SEDs on linux, was: Very long raid5 init/rebuild times 2014-01-22 7:55 ` Stan Hoeppner 2014-01-22 17:48 ` Marc MERLIN @ 2014-01-22 19:38 ` Chris Murphy 1 sibling, 0 replies; 41+ messages in thread From: Chris Murphy @ 2014-01-22 19:38 UTC (permalink / raw) To: linux-raid On Jan 22, 2014, at 12:55 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > > You are not CPU bound, nor hardware bandwidth bound. You are latency > bound, just like every dmcrypt user. dmcrypt adds a non trivial amount > of latency to every IO. Latency with serial IO equals low throughput. A self-encrypting drive (a drive that always & only writes ciphertext to the platters) I'd expect totally eliminates this. It's difficult finding information whether and how these drives are configurable purely for data drives (non-bootable) which ought to be easier to implement than the boot use case. For booting, it appears necessary to have computer firmware that explicitly supports the TCG OPAL spec: an EFI application might be sufficient for unlocking the drive(s) at boot time, but I'm not certain about that. It seems necessary for hibernate support that firmware and kernel support is required. Chris Murphy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-21 7:35 Very long raid5 init/rebuild times Marc MERLIN 2014-01-21 16:37 ` Marc MERLIN @ 2014-01-21 18:31 ` Chris Murphy 2014-01-22 13:46 ` Ethan Wilson 2 siblings, 0 replies; 41+ messages in thread From: Chris Murphy @ 2014-01-21 18:31 UTC (permalink / raw) To: linux-raid On Jan 21, 2014, at 12:35 AM, Marc MERLIN <marc@merlins.org> wrote: > Howdy, > > I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt. > > Question #1: > Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite > (raid5 first, and then dmcrypt) I'm pretty sure there are multithreading patches for dmcrypt, but I don't know what versions it's accepted into, and that's the question you need to answer. If the version you're using still only supports one encryption thread per block device, then you want to dmcrypt 5 drives, then create md on top of that. This way you get 5 threads doing encryption rather than 1 thread and hence one core trying to encrypt for 5 drives. > I suppose that 1 day-ish rebuild times are kind of a given with 4TB drives anyway? 4000000MB / 135MB/s = 8.23 hours. So 24 hours seems like a long time and I'd wonder what the bottleneck is. > Question #3: > Since I'm going to put btrfs on top, I'm almost tempted to skip the md raid5 > layer and just use the native support, but the raid code in btrfs still > seems a bit younger than I'm comfortable with. > Is anyone using it and has done disk failures, replaces, and all? Like I mentioned on linux-btrfs@ it's there for testing. If you're prepared to use it with the intent of breaking it and reporting breakage, then great, that's useful. If you're just being impatient, expect to get bitten. And it's undergoing quite a bit of changes in btrfs-next. I'm not sure if those changes will appear in 3.14 or 3.15 Chris Murphy ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: Very long raid5 init/rebuild times 2014-01-21 7:35 Very long raid5 init/rebuild times Marc MERLIN 2014-01-21 16:37 ` Marc MERLIN 2014-01-21 18:31 ` Chris Murphy @ 2014-01-22 13:46 ` Ethan Wilson 2 siblings, 0 replies; 41+ messages in thread From: Ethan Wilson @ 2014-01-22 13:46 UTC (permalink / raw) To: linux-raid On 21/01/2014 08:35, Marc MERLIN wrote: > Howdy, > > I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt. > > Question #1: > Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite > (raid5 first, and then dmcrypt) Crypt above, or you will need to enter the password 5 times. Array checks and rebuilds would also be slower And also, when working at low level with mdadm commands, it would probably be too easy to get confused and specify the underlying volume instead of the one above luks, wiping all data as a result. > Question #2: > In order to copy data from a working system, I connected the drives via an external > enclosure which uses a SATA PMP. As a result, things are slow: > > md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0] > 15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_] > [>....................] recovery = 0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec > bitmap: 0/30 pages [0KB], 65536KB chunk > > 2.5 days for an init or rebuild is going to be painful. > I already checked that I'm not CPU/dmcrpyt pegged. > > I read Neil's message why init is still required: > http://marc.info/?l=linux-raid&m=112044009718483&w=2 > even if somehow on brand new blank drives full of 0s I'm thinking this could be faster > by just assuming the array is clean (all 0s give a parity of 0). > Is it really unsafe to do so? (actually if you do this on top of dmcrypt > like I did here, I won't get 0s, so that way around, it's unfortunately > necessary). Yes it is unsafe because raid5 does shortcut rmw , which means it uses the current, wrong, parity to compute new parity, which will also come out wrong . The parities of your array will never be correct, so you won't be able to withstand a disk failure. You need to do the initial init/rebuild, however you can start writing to the array now already, but keep in mind that such data will be safe only after the fist init/rebuild has completed. > I suppose that 1 day-ish rebuild times are kind of a given with 4TB drives anyway? I think around 13 hours if your connections to the disks are fast. > Question #3: > Since I'm going to put btrfs on top, I'm almost tempted to skip the md raid5 > layer and just use the native support, but the raid code in btrfs still > seems a bit younger than I'm comfortable with. Native btrfs raid5 is WAY experimental at this stage. Only raid0/1/10 is kinda stable at this stage. ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2014-02-05 1:42 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-01-21 7:35 Very long raid5 init/rebuild times Marc MERLIN 2014-01-21 16:37 ` Marc MERLIN 2014-01-21 17:08 ` Mark Knecht 2014-01-21 18:42 ` Chris Murphy 2014-01-22 7:55 ` Stan Hoeppner 2014-01-22 17:48 ` Marc MERLIN 2014-01-22 23:17 ` Stan Hoeppner 2014-01-23 14:28 ` John Stoffel 2014-01-24 1:02 ` Stan Hoeppner 2014-01-24 3:07 ` NeilBrown 2014-01-24 8:24 ` Stan Hoeppner 2014-01-23 2:37 ` Stan Hoeppner 2014-01-23 9:13 ` Marc MERLIN 2014-01-23 12:24 ` Stan Hoeppner 2014-01-23 21:01 ` Marc MERLIN 2014-01-24 5:13 ` Stan Hoeppner 2014-01-25 8:36 ` Marc MERLIN 2014-01-28 7:46 ` Stan Hoeppner 2014-01-28 16:50 ` Marc MERLIN 2014-01-29 0:56 ` Stan Hoeppner 2014-01-29 1:01 ` Marc MERLIN 2014-01-30 20:47 ` Phillip Susi 2014-02-01 22:39 ` Stan Hoeppner 2014-02-02 18:53 ` Phillip Susi 2014-02-03 6:34 ` Stan Hoeppner 2014-02-03 14:42 ` Phillip Susi 2014-02-04 3:30 ` Stan Hoeppner 2014-02-04 17:59 ` Larry Fenske 2014-02-04 18:08 ` Phillip Susi 2014-02-04 18:43 ` Stan Hoeppner 2014-02-04 18:55 ` Phillip Susi 2014-02-04 19:15 ` Stan Hoeppner 2014-02-04 20:16 ` Phillip Susi 2014-02-04 21:58 ` Stan Hoeppner 2014-02-05 1:19 ` Phillip Susi 2014-02-05 1:42 ` Stan Hoeppner 2014-01-30 20:36 ` Phillip Susi 2014-01-30 20:18 ` Phillip Susi 2014-01-22 19:38 ` Opal 2.0 SEDs on linux, was: " Chris Murphy 2014-01-21 18:31 ` Chris Murphy 2014-01-22 13:46 ` Ethan Wilson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).