* RAID-10 explicitly defined drive pairs? @ 2011-12-12 11:54 Jan Kasprzak 2011-12-12 15:33 ` John Robinson 0 siblings, 1 reply; 25+ messages in thread From: Jan Kasprzak @ 2011-12-12 11:54 UTC (permalink / raw) To: linux-raid Hello, Linux RAID gurus, I have a new server with two identical external disk shelves (22 drives each), which will be connected to the server with a pair of SAS cables. I want to use RAID-10 on these disks, but I want it to be configured so that the data will always be mirrored between the shelves. I.e. I want to be able to overcome complete failure of a single shelf. Is there any way how to tell mdadm explicitly how to set up the pairs of mirrored drives inside a RAID-10 volume? Thanks, -Yenya -- | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> | | GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E | | http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ | Please don't top post and in particular don't attach entire digests to your mail or we'll all soon be using bittorrent to read the list. --Alan Cox ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2011-12-12 11:54 RAID-10 explicitly defined drive pairs? Jan Kasprzak @ 2011-12-12 15:33 ` John Robinson 2012-01-06 15:08 ` Jan Kasprzak 0 siblings, 1 reply; 25+ messages in thread From: John Robinson @ 2011-12-12 15:33 UTC (permalink / raw) To: Jan Kasprzak; +Cc: linux-raid On 12/12/2011 11:54, Jan Kasprzak wrote: > Hello, Linux RAID gurus, > > I have a new server with two identical external disk shelves (22 drives each), > which will be connected to the server with a pair of SAS cables. > I want to use RAID-10 on these disks, but I want it to be configured > so that the data will always be mirrored between the shelves. > I.e. I want to be able to overcome complete failure of a single shelf. > > Is there any way how to tell mdadm explicitly how to set up > the pairs of mirrored drives inside a RAID-10 volume? If you're using RAID10,n2 (the default layout) then adjacent pairs of drives in the create command will be mirrors, so your command line should be something like: # mdadm --create /dev/mdX -l10 -pn2 -n44 /dev/shelf1drive1 /dev/shelf2drive1 /dev/shelf1drive2 ... Having said that, if you think there's a real chance of a shelf failing, you probably ought to think about adding more redundancy within the shelves so that you can survive another drive failure or two while you're running on just one shelf. If you are sticking with RAID10, you can potentially get double the read performance using the far layout - -pf2 - and with the same order of drives you can still survive a shelf failure, though your use of port multipliers may well limit your performance anyway. Hope this helps! Cheers, John. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2011-12-12 15:33 ` John Robinson @ 2012-01-06 15:08 ` Jan Kasprzak 2012-01-06 16:39 ` Peter Grandi 2012-01-06 20:55 ` NeilBrown 0 siblings, 2 replies; 25+ messages in thread From: Jan Kasprzak @ 2012-01-06 15:08 UTC (permalink / raw) To: linux-raid; +Cc: John Robinson John Robinson wrote: : On 12/12/2011 11:54, Jan Kasprzak wrote: : > Is there any way how to tell mdadm explicitly how to set up : >the pairs of mirrored drives inside a RAID-10 volume? : : If you're using RAID10,n2 (the default layout) then adjacent pairs : of drives in the create command will be mirrors, so your command : line should be something like: : : # mdadm --create /dev/mdX -l10 -pn2 -n44 /dev/shelf1drive1 : /dev/shelf2drive1 /dev/shelf1drive2 ... OK, this works, thanks! : Having said that, if you think there's a real chance of a shelf : failing, you probably ought to think about adding more redundancy : within the shelves so that you can survive another drive failure or : two while you're running on just one shelf. I am aware of that. I don't think the whole shelf will fail, but who knows :-) : If you are sticking with RAID10, you can potentially get double the : read performance using the far layout - -pf2 - and with the same : order of drives you can still survive a shelf failure, though your : use of port multipliers may well limit your performance anyway. On the older hardware I have a majority of writes, so the far layout is probably not good for me (reads can be cached pretty well at the OS level). After some experiments with my new hardware, I have discovered one more serious problem: I have simulated an enclosure failure, so half of the disks forming the RAID-10 volume disappeared. After removing them using mdadm --remove, and adding them back, iostat reports that they are resynced one disk a time, not all just-added disks in parallel. Is there any way of adding more than one disk to the degraded RAID-10 volume, and get the volume restored as fast as the hardware permits? Otherwise, it would be better for us to discard RAID-10 altogether, and use several independent RAID-1 volumes joined together using LVM (which we will probably use on top of the RAID-10 volume anyway). I have tried mdadm --add /dev/mdN /dev/sd.. /dev/sd.. /dev/sd.., but it behaves the same way as issuing mdadm --add one drive at a time. Thanks, -Yenya -- | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> | | GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E | | http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ | Please don't top post and in particular don't attach entire digests to your mail or we'll all soon be using bittorrent to read the list. --Alan Cox ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-06 15:08 ` Jan Kasprzak @ 2012-01-06 16:39 ` Peter Grandi 2012-01-06 19:16 ` Stan Hoeppner 2012-01-06 20:11 ` Jan Kasprzak 2012-01-06 20:55 ` NeilBrown 1 sibling, 2 replies; 25+ messages in thread From: Peter Grandi @ 2012-01-06 16:39 UTC (permalink / raw) To: Linux RAID >>> Is there any way how to tell mdadm explicitly how to set up >>> the pairs of mirrored drives inside a RAID-10 volume? >> If you're using RAID10,n2 (the default layout) then adjacent >> pairs of drives in the create command will be mirrors, [ ... ] I did that once with a pair of MD1000 shelves from Dell and that worked pretty well (except it was very painful to configure the shelves with each disk as separate volume). > half of the disks forming the RAID-10 volume disappeared. > After removing them using mdadm --remove, and adding them > back, iostat reports that they are resynced one disk a time, > not all just-added disks in parallel. That's very interesting news. Thanks for reporting this though, it is something to keep in mind. > [ ... ] Otherwise it would be better for us to discard RAID-10 > altogether, and use several independent RAID-1 volumes joined > together I suspect that that MD runs one recovery per array at a time, and 'raid10' arrays are a single array. Which would be interesting to know in general, for example how many drives would be rebuilt at the same time in a 2-drive failure on a RAID6. You might try a two layer arrangements, as a 'raid0' of 'raid1' pairs, instead of a 'raid10'. The two things with MD are not the same, for example you can do layouts like a 3-drive 'raid10'. > using LVM (which we will probably use on top of the RAID-10 > volume anyway). Oh no! LVM is nowhere as nice as MD for RAIDing and is otherwise largely useless (except regrettably for snapshots) and has some annoying limitations. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-06 16:39 ` Peter Grandi @ 2012-01-06 19:16 ` Stan Hoeppner 2012-01-06 20:11 ` Jan Kasprzak 1 sibling, 0 replies; 25+ messages in thread From: Stan Hoeppner @ 2012-01-06 19:16 UTC (permalink / raw) To: Linux RAID I often forget the vger lists don't provide a List-Post: header (while 'everyone' else does). Thus my apologies for this reply going initially only to Peter and not the list. On 1/6/2012 10:39 AM, Peter Grandi wrote: > You might try a two layer arrangements, as a 'raid0' of 'raid1' > pairs, instead of a 'raid10'. The two things with MD are not the > same, for example you can do layouts like a 3-drive 'raid10'. The MD 3-drive 'RAID 10' layout being similar or equivalent to SNIA RAID 1E, IIRC. -- Stan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-06 16:39 ` Peter Grandi 2012-01-06 19:16 ` Stan Hoeppner @ 2012-01-06 20:11 ` Jan Kasprzak 2012-01-06 22:55 ` Stan Hoeppner 1 sibling, 1 reply; 25+ messages in thread From: Jan Kasprzak @ 2012-01-06 20:11 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID Peter Grandi wrote: : > half of the disks forming the RAID-10 volume disappeared. : > After removing them using mdadm --remove, and adding them : > back, iostat reports that they are resynced one disk a time, : > not all just-added disks in parallel. : : That's very interesting news. Thanks for reporting this though, : it is something to keep in mind. Yes. My HBA is able to do 4 GByte/s bursts according to the documentation, and I am able to get 2.4 GByte/s sustained. So getting only about 120-150 MByte/s for RAID-10 resync is really disappointing. : > [ ... ] Otherwise it would be better for us to discard RAID-10 : > altogether, and use several independent RAID-1 volumes joined : > together : : I suspect that that MD runs one recovery per array at a time, : and 'raid10' arrays are a single array. Yes, but when the array is being assembled initially (without --assume-clean), the MD RAID-10 can resync all pairs of disks at once. It is still limited to two threads (mdX_resync and mdX_raid10), so for widely-interleaved RAID-10 CPUs can still be a bottleneck (see my post in this thread from last May or April). But it is still much better than 120-150 MByte/s. : You might try a two layer arrangements, as a 'raid0' of 'raid1' : pairs, instead of a 'raid10'. The two things with MD are not the : same, for example you can do layouts like a 3-drive 'raid10'. : : > using LVM (which we will probably use on top of the RAID-10 : > volume anyway). : : Oh no! LVM is nowhere as nice as MD for RAIDing and is otherwise : largely useless (except regrettably for snapshots) and has some : annoying limitations. I think LVM on top of RAID-10 (or more RAID-1 volumes) is actually pretty nice. With RAID-10 it is a bit easier to handle, because the upper layer (LVM) does not need to know about proper interleaving of lower layers. And I suspect that XFS swidth/sunit settings will still work with RAID-10 parameters even over plain LVM logical volume on top of that RAID 10, while the settings would be more tricky when used with interleaved LVM logical volume on top of several RAID-1 pairs (LVM interleaving uses LE/PE-sized stripes, IIRC). -Yenya -- | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> | | GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E | | http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ | Please don't top post and in particular don't attach entire digests to your mail or we'll all soon be using bittorrent to read the list. --Alan Cox ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-06 20:11 ` Jan Kasprzak @ 2012-01-06 22:55 ` Stan Hoeppner 2012-01-07 14:25 ` Peter Grandi 0 siblings, 1 reply; 25+ messages in thread From: Stan Hoeppner @ 2012-01-06 22:55 UTC (permalink / raw) To: Jan Kasprzak; +Cc: Peter Grandi, Linux RAID On 1/6/2012 2:11 PM, Jan Kasprzak wrote: > And I suspect that XFS swidth/sunit > settings will still work with RAID-10 parameters even over plain > LVM logical volume on top of that RAID 10, while the settings would > be more tricky when used with interleaved LVM logical volume on top > of several RAID-1 pairs (LVM interleaving uses LE/PE-sized stripes, IIRC). If one is using many RAID1 pair s/he probably isn't after single large file performance anyway, or s/he would just use RAID10. Thus sunit/swidth settings aren't tricky in this case. One would use a linear concatenation and drive parallelism with XFS allocation groups, i.e. for a 24 drive chassis you'd setup an mdraid or lvm linear array of 12 RAID1 pairs and format with something like: $ mkfs.xfs -d agcount=24 [device] As long as one's workload writes files relatively evenly across 24 or more directories, one receives fantastic concurrency/parallelism, in this case 24 concurrent transactions, 2 to each mirror pair. In the case of 15K SAS drives this is far more than sufficient to saturate the seek bandwidth of the drives. One may need more AGs to achieve the concurrency necessary to saturate good SSDs. -- Stan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-06 22:55 ` Stan Hoeppner @ 2012-01-07 14:25 ` Peter Grandi 2012-01-07 16:25 ` Stan Hoeppner 0 siblings, 1 reply; 25+ messages in thread From: Peter Grandi @ 2012-01-07 14:25 UTC (permalink / raw) To: Linux RAID >> And I suspect that XFS swidth/sunit settings will still work >> with RAID-10 parameters even over plain LVM logical volume on >> top of that RAID 10, while the settings would be more tricky >> when used with interleaved LVM logical volume on top of >> several RAID-1 pairs (LVM interleaving uses LE/PE-sized >> stripes, IIRC). Stripe alignment is only relevant for parity RAID types, as it is meant to minimize read-modify-write. There is no RMW problem with RAID0, RAID1 or combinations. But there is a case for 'sunit'/'swidth' with single flash based SSDs as they do have a RMW-like issue with erase blocks. In other cases whether they are of benefit is rather questionable. > One would use a linear concatenation and drive parallelism > with XFS allocation groups, i.e. for a 24 drive chassis you'd > setup an mdraid or lvm linear array of 12 RAID1 pairs and > format with something like: $ mkfs.xfs -d agcount=24 [device] > As long as one's workload writes files relatively evenly > across 24 or more directories, one receives fantastic > concurrency/parallelism, in this case 24 concurrent > transactions, 2 to each mirror pair. That to me sounds a bit too fragile ; RAID0 is almost always preferable to "concat", even with AG multiplication, and I would be avoiding LVM more than avoiding MD. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-07 14:25 ` Peter Grandi @ 2012-01-07 16:25 ` Stan Hoeppner 2012-01-09 13:46 ` Peter Grandi 2012-01-12 12:47 ` Peter Grandi 0 siblings, 2 replies; 25+ messages in thread From: Stan Hoeppner @ 2012-01-07 16:25 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID On 1/7/2012 8:25 AM, Peter Grandi wrote: I wrote: >>> And I suspect that XFS swidth/sunit settings will still work >>> with RAID-10 parameters even over plain LVM logical volume on >>> top of that RAID 10, while the settings would be more tricky >>> when used with interleaved LVM logical volume on top of >>> several RAID-1 pairs (LVM interleaving uses LE/PE-sized >>> stripes, IIRC). > Stripe alignment is only relevant for parity RAID types, as it > is meant to minimize read-modify-write. The benefits aren't limited to parity arrays. Tuning the stripe parameters yields benefits on RAID0/10 arrays as well, mainly by packing a full stripe of data when possible, avoiding many partial stripe width writes in the non aligned case. Granted the gains are workload dependent, but overall you get a bump from aligned writes. > There is no RMW problem > with RAID0, RAID1 or combinations. Which is one of the reasons the linear concat over RAID1 pairs works very well for some workloads. > But there is a case for > 'sunit'/'swidth' with single flash based SSDs as they do have a > RMW-like issue with erase blocks. In other cases whether they > are of benefit is rather questionable. I'd love to see some documentation supporting this sunit/swidth with a single SSD device theory. I wrote: >> One would use a linear concatenation and drive parallelism >> with XFS allocation groups, i.e. for a 24 drive chassis you'd >> setup an mdraid or lvm linear array of 12 RAID1 pairs and >> format with something like: $ mkfs.xfs -d agcount=24 [device] >> As long as one's workload writes files relatively evenly >> across 24 or more directories, one receives fantastic >> concurrency/parallelism, in this case 24 concurrent >> transactions, 2 to each mirror pair. > That to me sounds a bit too fragile ; RAID0 is almost always > preferable to "concat", even with AG multiplication, and I would > be avoiding LVM more than avoiding MD. This wholly depends on the workload. For something like maildir RAID0 would give you no benefit as the mail files are going to be smaller than a sane MDRAID chunk size for such an array, so you get no striping performance benefit. And RAID0 is far more fragile here than a concat. If you lose both drives in a mirror pair, say to controller, backplane, cable, etc failure, you've lost your entire array, and your XFS filesystem. With a cocncat you can lose a mirror pair, run an xfs_repair and very likely end up with a functioning filesystem, sans the directories and files that resided on that pair. With RAID0 you're totally hosed. With a concat you're probably mostly still in business. -- Stan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-07 16:25 ` Stan Hoeppner @ 2012-01-09 13:46 ` Peter Grandi 2012-01-10 3:54 ` Stan Hoeppner 2012-01-12 12:47 ` Peter Grandi 1 sibling, 1 reply; 25+ messages in thread From: Peter Grandi @ 2012-01-09 13:46 UTC (permalink / raw) To: Linux RAID [ ... ] >> Stripe alignment is only relevant for parity RAID types, as it >> is meant to minimize read-modify-write. > The benefits aren't limited to parity arrays. Tuning the > stripe parameters yields benefits on RAID0/10 arrays as well, > mainly by packing a full stripe of data when possible, > avoiding many partial stripe width writes in the non aligned > case. This seems like handwaving gibberish to me, or (being very generous) a misunderestimating of the general notion that larger (as opposed to *aligned*) transactions are (sometimes) of greater benefit than smaller ones. Note: There is with 'ext' style filesystems the 'stride' which is designed to interleave data and metadata so they are likely to be on different disks, but that is in some ways the opposite to 'sunit'/'swidth' style address/length alignment, and is rather more similar to multiple AGs rather than aligning IO on RMW-free boundaries. How can «packing a full stripe of data» by itself be of benefit on RAID0/RAID1/RAID10, if that is in any way different from just doing larger larger transactions, or if it is different from an argument about chunk size vs. transaction size? An single N-wide write (or even a close sequence of N 1-wide writes) on a RAID0/1/10 will result in optimal N concurrent writes if that is possible, whether it is address/length aligned or not. Why would «avoiding many partial stripe width writes« have a significant effect in the RAID0 or RAID1 case, given that there is no RMW problem? > Granted the gains are workload dependent, but overall you get > a bump from aligned writes. Perhaps in a small way because of buffering effects or RAM or cache alignment effects, but that would be unrelated to the storage geometry. >> There is no RMW problem with RAID0, RAID1 or combinations. > Which is one of the reasons the linear concat over RAID1 pairs > works very well for some workloads. But the two are completely unrelated. Your argument was that 'concat' plus AGs works well if the workload is distributed over different directories in a number similar to the drivers. Concat plus AGs may work well for special workloads, but RAID0 plus AGs might work better. To me 'concat' is just like RAID0 but sillier, regardless of special cases. It is largely pointless. Please show how 'concat' is indeed preferable to RAID0 in the general case or any significant special case. >> But there is a case for 'sunit'/'swidth' with single flash >> based SSDs as they do have a RMW-like issue with erase >> blocks. In other cases whether they are of benefit is rather >> questionable. > I'd love to see some documentation supporting this sunit/swidth > with a single SSD device theory. You have already read it above: internally SSDs have a big RMW problem because of (erase) ''flash blocks'' being much larger (around 512KiB/1MiB) than (''write''/read) ''flash pages'' which are anyhow rather larger (usually 4KiB/8KiB) than logical 512B sectors. RMW avoidance is all that there is to address/length alignment. It has nothing to do with RAIDness per se and indeed in a different domain address/length aligned writes work very well with RAM because it too has a big RMW problem. Note: the case for RMW address/length aligned writes on single SSDs is not clear only because FTL firmware simulates a non-RMW device by using something (quite) similar to a small-granule log-structured filesystem on top of the flash storage and this might "waste" the extra alignment by the filesystem. Same for example as partition alignment: you can easily find on the web documentation that explain in accessible terms that having ''parity block'' aligned partitions is good for parity RAID, and other documentation that explains that ''erase block'' aligned partitions are good for SSDs too, and in both case the reason is RMW, whether the reason for RMW is parity or erasing. Those able to do a web search with the relevant keywords and read documentation can find some mentions of single SSD RMW and address/length alignment, for example here: http://research.cs.wisc.edu/adsl/Publications/ssd-usenix08.pdf http://research.microsoft.com/en-us/projects/flashlight/winhec08-ssd.pptx http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-2.pdf Mentioned in passing as something pretty obvious, and there are other similar mentions that come up in web searches because it is a pretty natural application of thinking about RMW issues. Now I eagerly await your explanation of the amazing "Hoeppner effect" by which address/length aligned writes on RAID0/1/10 have significant benefits and of the audacious "Hoeppner principle" by which 'concat' is as good as RAID0 over the same disks. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-09 13:46 ` Peter Grandi @ 2012-01-10 3:54 ` Stan Hoeppner 2012-01-10 4:13 ` NeilBrown 2012-01-12 11:58 ` Peter Grandi 0 siblings, 2 replies; 25+ messages in thread From: Stan Hoeppner @ 2012-01-10 3:54 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID On 1/9/2012 7:46 AM, Peter Grandi wrote: > Those able to do a web search with the relevant keywords and > read documentation can find some mentions of single SSD RMW and > address/length alignment, for example here: > > http://research.cs.wisc.edu/adsl/Publications/ssd-usenix08.pdf > http://research.microsoft.com/en-us/projects/flashlight/winhec08-ssd.pptx > http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-2.pdf > > Mentioned in passing as something pretty obvious, and there are > other similar mentions that come up in web searches because it > is a pretty natural application of thinking about RMW issues. Yes, I've read such things. I was eluding to the fact that there are at least a half dozen different erase block sizes and algorithms in use by different SSD manufacturers. There is no standard. And not all of them are published. There is no reliable way to do such optimization generically. > Now I eagerly await your explanation of the amazing "Hoeppner > effect" by which address/length aligned writes on RAID0/1/10 > have significant benefits and of the audacious "Hoeppner > principle" by which 'concat' is as good as RAID0 over the same > disks. IIRC from a previous discussion I had with Neil Brown on this list, mdraid0, as with all the striped array code, runs as a single kernel thread, limiting its performance to that of a single CPU. A linear concatenation does not run as a single kernel thread, but is simply an offset calculation routine that, IIRC, executes on the same CPU as the caller. Thus one can theoretically achieve near 100% CPU scalability when using concat instead of mdraid0. So the issue isn't partial stripe writes at the media level, but the CPU overhead caused by millions of the little bastards with heavy random IOPS workloads, along with increased numbers of smaller IOs through the SCSI/SATA interface, causing more interrupts thus more CPU time, etc. I've not run into this single stripe thread limitation myself, but have read multiple cases where OPs can't get maximum performance from their storage hardware because their top level mdraid stripe thread is peaking a single CPU in their X-way system. Moving from RAID10 to a linear concat gets around this limitation for small file random IOPS workloads. Only when using XFS and a proper AG configuration, obviously. This is my recollection of Neil's description of the code behavior. I could very well have misunderstood, and I'm sure he'll correct me if that's the case, or you, or both. ;) Dave Chinner had some input WRT XFS on concat for this type of workload, stating it's a little better than RAID10 (ambiguous as to hard/soft). Did you read that thread Peter? I know you're on the XFS list as well. I can't exactly recall at this time Dave's specific reasoning, I'll try to dig it up. I'm thinking it had to do with the different distribution of metadata IOs between the two AG layouts, and the amount of total head seeking required for the workload being somewhat higher for RAID10 than for the concat of RAID1 pairs. Again, I could be wrong on that, but it seems familiar. That discussion was many months ago. -- Stan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-10 3:54 ` Stan Hoeppner @ 2012-01-10 4:13 ` NeilBrown 2012-01-10 16:25 ` Stan Hoeppner 2012-01-12 11:58 ` Peter Grandi 1 sibling, 1 reply; 25+ messages in thread From: NeilBrown @ 2012-01-10 4:13 UTC (permalink / raw) To: stan; +Cc: Peter Grandi, Linux RAID [-- Attachment #1: Type: text/plain, Size: 4171 bytes --] On Mon, 09 Jan 2012 21:54:56 -0600 Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 1/9/2012 7:46 AM, Peter Grandi wrote: > > > Those able to do a web search with the relevant keywords and > > read documentation can find some mentions of single SSD RMW and > > address/length alignment, for example here: > > > > http://research.cs.wisc.edu/adsl/Publications/ssd-usenix08.pdf > > http://research.microsoft.com/en-us/projects/flashlight/winhec08-ssd.pptx > > http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-2.pdf > > > > Mentioned in passing as something pretty obvious, and there are > > other similar mentions that come up in web searches because it > > is a pretty natural application of thinking about RMW issues. > > Yes, I've read such things. I was eluding to the fact that there are at > least a half dozen different erase block sizes and algorithms in use by > different SSD manufacturers. There is no standard. And not all of them > are published. There is no reliable way to do such optimization > generically. > > > Now I eagerly await your explanation of the amazing "Hoeppner > > effect" by which address/length aligned writes on RAID0/1/10 > > have significant benefits and of the audacious "Hoeppner > > principle" by which 'concat' is as good as RAID0 over the same > > disks. > > IIRC from a previous discussion I had with Neil Brown on this list, > mdraid0, as with all the striped array code, runs as a single kernel > thread, limiting its performance to that of a single CPU. A linear > concatenation does not run as a single kernel thread, but is simply an > offset calculation routine that, IIRC, executes on the same CPU as the > caller. Thus one can theoretically achieve near 100% CPU scalability > when using concat instead of mdraid0. So the issue isn't partial stripe > writes at the media level, but the CPU overhead caused by millions of > the little bastards with heavy random IOPS workloads, along with > increased numbers of smaller IOs through the SCSI/SATA interface, > causing more interrupts thus more CPU time, etc. > > I've not run into this single stripe thread limitation myself, but have > read multiple cases where OPs can't get maximum performance from their > storage hardware because their top level mdraid stripe thread is peaking > a single CPU in their X-way system. Moving from RAID10 to a linear > concat gets around this limitation for small file random IOPS workloads. > Only when using XFS and a proper AG configuration, obviously. This is > my recollection of Neil's description of the code behavior. I could > very well have misunderstood, and I'm sure he'll correct me if that's > the case, or you, or both. ;) (oh dear, someone is Wrong on the Internet! Quick, duck into the telephone booth and pop out as ....) Hi Stan, I think you must be misremembering. Neither RAID0 or Linear have any threads involved. They just redirect the request to the appropriate devices. Multiple threads can submit multiple requests down through RAID0 and Linear concurrently. RAID1, RAID10, and RAID5/6 are different. For reads they normally are have no contention with other requests, but for writes things to get single-threaded at some point. Hm... you text above sometime talks about RAID0 vs Linear, and sometimes about RAID10 vs Linear. So maybe you are remembering correctly, but presenting incorrectly in part .... NeilBrown > > Dave Chinner had some input WRT XFS on concat for this type of workload, > stating it's a little better than RAID10 (ambiguous as to hard/soft). > Did you read that thread Peter? I know you're on the XFS list as well. > I can't exactly recall at this time Dave's specific reasoning, I'll try > to dig it up. I'm thinking it had to do with the different distribution > of metadata IOs between the two AG layouts, and the amount of total head > seeking required for the workload being somewhat higher for RAID10 than > for the concat of RAID1 pairs. Again, I could be wrong on that, but it > seems familiar. That discussion was many months ago. > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-10 4:13 ` NeilBrown @ 2012-01-10 16:25 ` Stan Hoeppner 0 siblings, 0 replies; 25+ messages in thread From: Stan Hoeppner @ 2012-01-10 16:25 UTC (permalink / raw) To: NeilBrown; +Cc: Peter Grandi, Linux RAID On 1/9/2012 10:13 PM, NeilBrown wrote: >> IIRC from a previous discussion I had with Neil Brown on this list, >> mdraid0, as with all the striped array code, runs as a single kernel >> thread, limiting its performance to that of a single CPU. A linear >> concatenation does not run as a single kernel thread, but is simply an >> offset calculation routine that, IIRC, executes on the same CPU as the >> caller. Thus one can theoretically achieve near 100% CPU scalability >> when using concat instead of mdraid0. So the issue isn't partial stripe >> writes at the media level, but the CPU overhead caused by millions of >> the little bastards with heavy random IOPS workloads, along with >> increased numbers of smaller IOs through the SCSI/SATA interface, >> causing more interrupts thus more CPU time, etc. >> >> I've not run into this single stripe thread limitation myself, but have >> read multiple cases where OPs can't get maximum performance from their >> storage hardware because their top level mdraid stripe thread is peaking >> a single CPU in their X-way system. Moving from RAID10 to a linear >> concat gets around this limitation for small file random IOPS workloads. >> Only when using XFS and a proper AG configuration, obviously. This is >> my recollection of Neil's description of the code behavior. I could >> very well have misunderstood, and I'm sure he'll correct me if that's >> the case, or you, or both. ;) > > (oh dear, someone is Wrong on the Internet! Quick, duck into the telephone > booth and pop out as ....) > > Hi Stan, > I think you must be misremembering. > Neither RAID0 or Linear have any threads involved. They just redirect the > request to the appropriate devices. Multiple threads can submit multiple > requests down through RAID0 and Linear concurrently. > > RAID1, RAID10, and RAID5/6 are different. For reads they normally are have > no contention with other requests, but for writes things to get > single-threaded at some point. > > Hm... you text above sometime talks about RAID0 vs Linear, and sometimes > about RAID10 vs Linear. So maybe you are remembering correctly, but > presenting incorrectly in part .... Yes, I believe that's where we are. My apologies for allowing myself to become slightly confused. I'm sure I'm the only human being working with Linux to ever become so. ;) Peter kept referencing RAID0 after I'd explicitly referenced RAID10 in my statement. I guess I assumed he was simply referring to the striped component of RAID10, which apparently wasn't the case. So I did recall correctly that mdraid10 does have some threading limitations. So what needs clarification at this point is whether those limitations are greater than any such limitations with the concatenated RAID1 pair case using XFS AGs to drive the parallelism. Thanks for your input Neil, and for your clarifications thus far. -- Stan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-10 3:54 ` Stan Hoeppner 2012-01-10 4:13 ` NeilBrown @ 2012-01-12 11:58 ` Peter Grandi 1 sibling, 0 replies; 25+ messages in thread From: Peter Grandi @ 2012-01-12 11:58 UTC (permalink / raw) To: Linux RAID I have pulled bits of the original posts to give some context. [ .... ] >>>> Stripe alignment is only relevant for parity RAID types, as >>>> it is meant to minimize read-modify-write. There is no RMW >>>> problem with RAID0, RAID1 or combinations. [ ... ] >>> The benefits aren't limited to parity arrays. Tuning the >>> stripe parameters yields benefits on RAID0/10 arrays as >>> well, mainly by packing a full stripe of data when possible, >>> avoiding many partial stripe width writes in the non aligned >>> case. Granted the gains are workload dependent, but overall >>> you get a bump from aligned writes. [ ... ] >>>> But there is a case for 'sunit'/'swidth' with single flash >>>> based SSDs as they do have a RMW-like issue with erase >>>> blocks. In other cases whether they are of benefit is >>>> rather questionable. >>> I'd love to see some documentation supporting this >>> sunit/swidth with a single SSD device theory. [ ... ] > Yes, I've read such things. I was eluding to the fact that > there are at least a half dozen different erase block sizes > and algorithms in use by different SSD manufacturers. There is > no standard. And not all of them are published. There is no > reliable way to do such optimization generically. Well, at least in some cases there are some details on erase block sizes for some devices, and most contemporary devices seem to max at 8KiB "flash pages" and 1MiB "flash blocks" (most contemporary low cost flash SSDs are RAID0-like interleavings of chips with those parameters). There is (hopefully) little cost in further alignment so I think that 16KiB as the 'sunit' and 2MiB as the 'swidth' on single SSD should cover some further tightening of the requirements. But as I wrote previously the biggest issue with the expectation that address/length alignment matters with flash SSDs that do is the Flash Translation Layer firmware they use that may make attempts to perform higher level geometry adaptation not so relevant. While there isn't any good argument that address/length alignment matters other than for RMW storage devices, I must say that because of my estimate that address/length alignment is not costly, and my intuition that it might help, I specify address/length alignments on *everything* (even non parity RAID on non-RMW storage, even single disks on non-RMW storage). One of the guesses that I have as to that is that it might help keep free space more contiguous, and thus in general may lead to lower fragmentation of allocated files (which does not matter a lot for flash SSDs, but then perhaps the RMW issue matters). Because probably it leads to allocations being done in bigger and more aligned chunks that otherwise. That is a *static* effect at the file system level rather than the dynamic effects at the array level mentioned in your (euphemism alert) rather weak arguments as to multithreading or better scheduling of IO operations at the array level. Where the cost is mostly that when there is little free space probably the remaining free space is more fragmented that it would otherwise have been. But I try to keep at least 15-20% free space available regardless. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-07 16:25 ` Stan Hoeppner 2012-01-09 13:46 ` Peter Grandi @ 2012-01-12 12:47 ` Peter Grandi 2012-01-12 21:24 ` Stan Hoeppner 1 sibling, 1 reply; 25+ messages in thread From: Peter Grandi @ 2012-01-12 12:47 UTC (permalink / raw) To: Linux RAID [ ... ] >> That to me sounds a bit too fragile ; RAID0 is almost always >> preferable to "concat", even with AG multiplication, and I >> would be avoiding LVM more than avoiding MD. > This wholly depends on the workload. For something like > maildir RAID0 would give you no benefit as the mail files are > going to be smaller than a sane MDRAID chunk size for such an > array, so you get no striping performance benefit. That seems to me unfortunate argument and example: * As an example, putting a mail archive on a RAID0 or 'concat' seems a bit at odd with the usual expectations of availability for them. Unless a RAID0 or 'concat' over RAID1. Because anyhow 'maildir' mail archive is a horribly bad idea regardless because it maps very badly on current storage technology. * The issue if chunk size is one of my pet peeves, as there is very little case for it being larger than file system block size. Sure there are many "benchmarks" that show that larger chunk sizes correspond to higher transfer rates, but that is because of unrealistic transaction size effects. Which don't matter for a mostly random-access share mail archive, never mind a maildir one. * Regardless, an argument that there is no striping benefit in that case is not an argument that 'concat' is better. I'd still default to RAID0. * Consider the dubious joys of an 'fsck' or 'rsync' (and other bulk maintenance operations, like indexing the archive), and how RAID0 may help (even if not a lot) the scanning of metadata with respect to 'concat' (unless one relies totally on parallelism across multiple AGs). Perhaps one could make a case that 'concat' is no worse than 'RAID0' if one has a very special case that is equivalent to painting oneself in a corner, but it is not a very interesting case. > And RAID0 is far more fragile here than a concat. If you lose > both drives in a mirror pair, say to controller, backplane, > cable, etc failure, you've lost your entire array, and your > XFS filesystem. Uhm, sometimes it is not a good idea to structure mirror pairs so that they have blatant common modes of failure. But then most arrays I have seen were built out of drives of the same make and model and taken out of the same carton.... > With a concat you can lose a mirror pair, run an xfs_repair and > very likely end up with a functioning filesystem, sans the > directories and files that resided on that pair. With RAID0 > you're totally hosed. With a concat you're probably mostly > still in business. That sounds (euphemism alert) rather optimistic to me, because it is based on the expectation that files, and files within the same directory, tend to be allocated entirely within a single segment of a 'concat'. Even with distributing AGs around for file system types that support that, that's a bit wistful (as is the expectation that AGs are indeed wholly contained in specific segments of a 'concat'). Usually if there is a case for a 'concat' there is a rather better case for separate, smaller filesystems mounted under a common location, as an alternative to RAID0. It is often a better case because data is often partitionable, there is no large advantage to a single free space pool as most files are not that large, and one can do fully independent and parallel 'fsck', 'rsync' and other bulk maintenance operations (including restores). Then we might as well get into distributed partitioned file systems with a single namespace like Lustre or DPM. But your (euphemism alert) edgy recovery example above triggers a couple of my long standing pet peeves: * The correct response to a damaged (in the sense of data loss) storage system is not to ignore the hole, patch up the filetree in it, and restart it, but to restore the filetree from backups. Because in any case one would have to run a verification pass aganst backups to see what has been lost and whether any partial file losses have happened. * If availability requirement are so exigent that a restore from backup is not acceptable to the customer, and random data loss is better accepted, we have a strange situation. Which is that the customer really wants a Very Large DataBase (a database so large that it cannot be taken offline for maintenance, such as backups or recovery) style storage system, but they don't want to pay for it. A sysadm may then look good by playing to these politics by pretending they have done one on the cheap, by tacitly dropping data integrity, but these are scary politics. [ ... ] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-12 12:47 ` Peter Grandi @ 2012-01-12 21:24 ` Stan Hoeppner 0 siblings, 0 replies; 25+ messages in thread From: Stan Hoeppner @ 2012-01-12 21:24 UTC (permalink / raw) To: Linux RAID On 1/12/2012 6:47 AM, Peter Grandi wrote: > [ ... ] > >>> That to me sounds a bit too fragile ; RAID0 is almost always >>> preferable to "concat", even with AG multiplication, and I >>> would be avoiding LVM more than avoiding MD. > >> This wholly depends on the workload. For something like >> maildir RAID0 would give you no benefit as the mail files are >> going to be smaller than a sane MDRAID chunk size for such an >> array, so you get no striping performance benefit. > > That seems to me unfortunate argument and example: > > * As an example, putting a mail archive on a RAID0 or 'concat' > seems a bit at odd with the usual expectations of availability > for them. Unless a RAID0 or 'concat' over RAID1. Because anyhow > 'maildir' mail archive is a horribly bad idea regardless > because it maps very badly on current storage technology. WRT availability both are identical to textbook RAID10--half the drives can fail as long as no two are in the same mirror pair. In essence RAID0 over mirrors _is_ RAID10. I totally agree with your maildir sentiments--far too much physical IO (metadata) needed for the same job as mbox, Dovecot's mdbox, etc. But, maildir is still extremely popular and in wide use, and will be for quite some time. > * The issue if chunk size is one of my pet peeves, as there is > very little case for it being larger than file system block > size. Sure there are many "benchmarks" that show that larger > chunk sizes correspond to higher transfer rates, but that is > because of unrealistic transaction size effects. Which don't > matter for a mostly random-access share mail archive, never > mind a maildir one. I absolutely agree. Which is why the concat makes sense for such workloads. From a 10,000 ft view, it is little different than having a group of mirror pairs, putting a filesystem on each, and manually spreading one's user mailboxen over those filesystems. The XFS over concat simply takes the manual spreading aspect out of this, and yields pretty good transaction load distribution. > * Regardless, an argument that there is no striping benefit in > that case is not an argument that 'concat' is better. I'd still > default to RAID0. The issue with RAID0 or RAID10 here is tuning XFS. With a striped array XFS works best if sunit/swidth match the stripe block characteristics of the array, as it attempts to pack a full stripe worth of writes before pushing down the stack. This works well with large files, but can be an impediment to performance with lots of small writes. Free space fragmentation becomes a problem as XFS attempts to stripe align all writes. So with maildir you often end up with lots of partial stripe writes, each being a different stripe. Once an XFS filesystem ages sufficiently (i.e. fills up) more head seeking is required to write files into the fragmented free space. At least, this is my understanding from Dave's previous explanation. Additionally, when using a striped array, all XFS AGs are striped down the virtual cylinder that is the array. So when searching a large directory btree, you may generate seeks across all drives in the array to find a single entry. With a properly created XFS on concat, AGs are aligned and wholly contained within a mirror pair. And since all files created within an AG have their metadata also within that AG, any btree walking is done only on an AG within that mirror pair, reducing head seeking to one drive vs all drives in a striped array. The concat setup also has the advantage that per drive read-ahaead is more likely to cache blocks that actually will be needed shortly, i.e. the next file in an inbox, whereas with a striped array it's very likely that the next few blocks contain a different user's email, a user who may not even be logged in. > * Consider the dubious joys of an 'fsck' or 'rsync' (and other > bulk maintenance operations, like indexing the archive), and > how RAID0 may help (even if not a lot) the scanning of metadata > with respect to 'concat' (unless one relies totally on > parallelism across multiple AGs). This concat setup is specific to XFS and only XFS. It is useless with any other (Linux anyway) filesystem because no others use an allocation group design nor can derive meaningful parallelism in the absence of striping. > Perhaps one could make a case that 'concat' is no worse than > 'RAID0' if one has a very special case that is equivalent to > painting oneself in a corner, but it is not a very interesting > case. It better than a RAID0/10 stripe for small file random IO workloads. See reasons above. >> And RAID0 is far more fragile here than a concat. If you lose >> both drives in a mirror pair, say to controller, backplane, >> cable, etc failure, you've lost your entire array, and your >> XFS filesystem. > > Uhm, sometimes it is not a good idea to structure mirror pairs so > that they have blatant common modes of failure. But then most > arrays I have seen were built out of drives of the same make and > model and taken out of the same carton.... I was demonstrating the worst case scenario that could take down both array types, and the fact that when using XFS on both, you lose everything with RAID0, but can likely recover to a large degree with the concat specifically because of the allocation group design and how the AGs are physically laid down on the concat disks. >> With a concat you can lose a mirror pair, run an xfs_repair and >> very likely end up with a functioning filesystem, sans the >> directories and files that resided on that pair. With RAID0 >> you're totally hosed. With a concat you're probably mostly >> still in business. > > That sounds (euphemism alert) rather optimistic to me, because it > is based on the expectation that files, and files within the same > directory, tend to be allocated entirely within a single segment > of a 'concat'. This is exactly the case. With 16x1TB drives in an mdraid linear concat with XFS and 16 AGs, you get exactly 1 AG on each drive. In practice in this case one would probably want 2 AGs per drive, as files are clustered around the directories. With the small file random IO workload this decreases head seeking between the directory write op and the file write up, which typically occur in rapid succession. > Even with distributing AGs around for file system > types that support that, that's a bit wistful (as is the > expectation that AGs are indeed wholly contained in specific > segments of a 'concat'). No, it is not, and yes, they are. > Usually if there is a case for a 'concat' there is a rather > better case for separate, smaller filesystems mounted under a > common location, as an alternative to RAID0. Absolutely agreed, for the most part. If the application itself has the ability to spread the file transaction load across multiple directories this is often better than relying on the filesystem to do it automagically. And if you lose one filesystem for any reason you've only lost access to a portion of data, not all of it. The minor downside is managing multiple filesystems instead of one, but not a big deal really, given the extra safety margin. In the case of the maildir workload, Dovecot, for instance, allows specifying a mailbox location on a per user basis. I recall one Dovecot OP who is doing this with 16 mirror pairs with 16 EXTx filesystems atop. IIRC he was bitten more than once by single large hardware RAID setups going down--I don't recall the specifics. Check the Dovecot list archives. > It is often a better case because data is often partitionable, > there is no large advantage to a single free space pool as most > files are not that large, and one can do fully independent and > parallel 'fsck', 'rsync' and other bulk maintenance operations > (including restores). Agreed. If the data set can be partitioned, and if your application permits doing so. Some do not. > Then we might as well get into distributed partitioned file > systems with a single namespace like Lustre or DPM. Lustre wasn't designed for, nor is suitable for, high IOPS low latency, small file workloads, which is, or at least was, the topic we are discussing. I'm not familiar with DPM. Most distributed filesystems aren't suitable for this type of workload due to multiple types of latency. > But your (euphemism alert) edgy recovery example above triggers a > couple of my long standing pet peeves: > > * The correct response to a damaged (in the sense of data loss) > storage system is not to ignore the hole, patch up the filetree > in it, and restart it, but to restore the filetree from backups. > Because in any case one would have to run a verification pass > aganst backups to see what has been lost and whether any > partial file losses have happened. I believe you missed the point, and are making some incorrect assumptions WRT SOP in this field, and the where-with-all of your colleagues. In my concat example you can likely be back up and running "right now" with some loss _while_ you troubleshoot/fix/restore. In the RAID0 scenario, you're completely down _until_ you troubleshoot/fix/restore. Nobody is going to slap a bandaid on and "ignore the hole". I never stated nor implied that. I operate on the assumption my colleagues here know what they're doing for the most part, so I don't expend extra unnecessary paragraphs on SOP minutia. [snipped] -- Stan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-06 15:08 ` Jan Kasprzak 2012-01-06 16:39 ` Peter Grandi @ 2012-01-06 20:55 ` NeilBrown 2012-01-06 21:02 ` Jan Kasprzak 2012-03-22 10:01 ` Alexander Lyakas 1 sibling, 2 replies; 25+ messages in thread From: NeilBrown @ 2012-01-06 20:55 UTC (permalink / raw) To: Jan Kasprzak; +Cc: linux-raid, John Robinson [-- Attachment #1: Type: text/plain, Size: 2619 bytes --] On Fri, 6 Jan 2012 16:08:23 +0100 Jan Kasprzak <kas@fi.muni.cz> wrote: > John Robinson wrote: > : On 12/12/2011 11:54, Jan Kasprzak wrote: > : > Is there any way how to tell mdadm explicitly how to set up > : >the pairs of mirrored drives inside a RAID-10 volume? > : > : If you're using RAID10,n2 (the default layout) then adjacent pairs > : of drives in the create command will be mirrors, so your command > : line should be something like: > : > : # mdadm --create /dev/mdX -l10 -pn2 -n44 /dev/shelf1drive1 > : /dev/shelf2drive1 /dev/shelf1drive2 ... > > OK, this works, thanks! > > : Having said that, if you think there's a real chance of a shelf > : failing, you probably ought to think about adding more redundancy > : within the shelves so that you can survive another drive failure or > : two while you're running on just one shelf. > > I am aware of that. I don't think the whole shelf will fail, > but who knows :-) > > : If you are sticking with RAID10, you can potentially get double the > : read performance using the far layout - -pf2 - and with the same > : order of drives you can still survive a shelf failure, though your > : use of port multipliers may well limit your performance anyway. > > On the older hardware I have a majority of writes, so the far > layout is probably not good for me (reads can be cached pretty well > at the OS level). > > After some experiments with my new hardware, I have discovered > one more serious problem: I have simulated an enclosure failure, > so half of the disks forming the RAID-10 volume disappeared. > After removing them using mdadm --remove, and adding them back, > iostat reports that they are resynced one disk a time, not all > just-added disks in parallel. > > Is there any way of adding more than one disk to the degraded > RAID-10 volume, and get the volume restored as fast as the hardware permits? > Otherwise, it would be better for us to discard RAID-10 altogether, > and use several independent RAID-1 volumes joined together using LVM > (which we will probably use on top of the RAID-10 volume anyway). > > I have tried mdadm --add /dev/mdN /dev/sd.. /dev/sd.. /dev/sd.., > but it behaves the same way as issuing mdadm --add one drive at a time. I would expect that to first recover just the first device added, then recover all the rest at once. If you: echo frozen > /sys/block/mdN/md/sync_action mdadm --add /dev/mdN /dev...... echo recover > /sys/block/mdN/md/sync_action it should do them all at once. I should teach mdadm about this.. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-06 20:55 ` NeilBrown @ 2012-01-06 21:02 ` Jan Kasprzak 2012-03-22 10:01 ` Alexander Lyakas 1 sibling, 0 replies; 25+ messages in thread From: Jan Kasprzak @ 2012-01-06 21:02 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, John Robinson NeilBrown wrote: : > I have tried mdadm --add /dev/mdN /dev/sd.. /dev/sd.. /dev/sd.., : > but it behaves the same way as issuing mdadm --add one drive at a time. : : I would expect that to first recover just the first device added, then : recover all the rest at once. : : If you: : echo frozen > /sys/block/mdN/md/sync_action : mdadm --add /dev/mdN /dev...... : echo recover > /sys/block/mdN/md/sync_action : : it should do them all at once. Wow, it works! Thanks! : I should teach mdadm about this.. It would be nice if mdadm --add /dev/mdN <multiple devices> did this. -Yenya -- | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> | | GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E | | http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ | Please don't top post and in particular don't attach entire digests to your mail or we'll all soon be using bittorrent to read the list. --Alan Cox ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-01-06 20:55 ` NeilBrown 2012-01-06 21:02 ` Jan Kasprzak @ 2012-03-22 10:01 ` Alexander Lyakas 2012-03-22 10:31 ` NeilBrown 1 sibling, 1 reply; 25+ messages in thread From: Alexander Lyakas @ 2012-03-22 10:01 UTC (permalink / raw) To: NeilBrown; +Cc: Jan Kasprzak, linux-raid, John Robinson Neil, > echo frozen > /sys/block/mdN/md/sync_action > mdadm --add /dev/mdN /dev...... > echo recover > /sys/block/mdN/md/sync_action > > it should do them all at once. > > I should teach mdadm about this.. What is required to do that from mdadm? I don't see any other place where MD_RECOVERY_FROZEN is set, except via sysfs. So do you suggest that mdadm use sysfs for that? Also, what should be done if mdadm succeeds to "freeze" the array, but fails to "unfreeze" it for some reason? Thanks, Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-03-22 10:01 ` Alexander Lyakas @ 2012-03-22 10:31 ` NeilBrown 2012-03-25 9:30 ` Alexander Lyakas 0 siblings, 1 reply; 25+ messages in thread From: NeilBrown @ 2012-03-22 10:31 UTC (permalink / raw) To: Alexander Lyakas; +Cc: Jan Kasprzak, linux-raid, John Robinson [-- Attachment #1: Type: text/plain, Size: 832 bytes --] On Thu, 22 Mar 2012 12:01:48 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Neil, > > > echo frozen > /sys/block/mdN/md/sync_action > > mdadm --add /dev/mdN /dev...... > > echo recover > /sys/block/mdN/md/sync_action > > > > it should do them all at once. > > > > I should teach mdadm about this.. > > What is required to do that from mdadm? I don't see any other place > where MD_RECOVERY_FROZEN is set, except via sysfs. So do you suggest > that mdadm use sysfs for that? Yes. http://neil.brown.name/git?p=mdadm;a=commitdiff;h=9f58469128c99c0d7f434d28657f86789334f253 > Also, what should be done if mdadm succeeds to "freeze" the array, but > fails to "unfreeze" it for some reason? What could possibly cause that? I guess if someone kills mdadm while it was running.. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-03-22 10:31 ` NeilBrown @ 2012-03-25 9:30 ` Alexander Lyakas 2012-04-04 16:56 ` Alexander Lyakas 0 siblings, 1 reply; 25+ messages in thread From: Alexander Lyakas @ 2012-03-25 9:30 UTC (permalink / raw) To: NeilBrown; +Cc: Jan Kasprzak, linux-raid, John Robinson Thanks, Neil! I will merge this in and test. On Thu, Mar 22, 2012 at 12:31 PM, NeilBrown <neilb@suse.de> wrote: > On Thu, 22 Mar 2012 12:01:48 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Neil, >> >> > echo frozen > /sys/block/mdN/md/sync_action >> > mdadm --add /dev/mdN /dev...... >> > echo recover > /sys/block/mdN/md/sync_action >> > >> > it should do them all at once. >> > >> > I should teach mdadm about this.. >> >> What is required to do that from mdadm? I don't see any other place >> where MD_RECOVERY_FROZEN is set, except via sysfs. So do you suggest >> that mdadm use sysfs for that? > > Yes. > > http://neil.brown.name/git?p=mdadm;a=commitdiff;h=9f58469128c99c0d7f434d28657f86789334f253 > >> Also, what should be done if mdadm succeeds to "freeze" the array, but >> fails to "unfreeze" it for some reason? > > What could possibly cause that? > I guess if someone kills mdadm while it was running.. > > > NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-03-25 9:30 ` Alexander Lyakas @ 2012-04-04 16:56 ` Alexander Lyakas 2014-06-09 14:26 ` Alexander Lyakas 0 siblings, 1 reply; 25+ messages in thread From: Alexander Lyakas @ 2012-04-04 16:56 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, I fetched this commit, and also the "fixed sysfs_freeze_array array to work properly with Manage_subdevs" commit fd324b08dbfa8404558534dd0a2321213ffb7257, which looks relevant. I experimented with this, and I have some questions: Is there any special purpose for calling sysfs_attribute_available("sync_action")? If this call fails, the code returns 1, which will make the caller to attempt to un-freeze. Without this call, however, sysfs_get_str("sync_action") will fail and return 0, which will prevent the caller from unfreezing, which looks better. Why do you refuse to freeze the array if it's not "idle"? What will happen is that current recover/resync will abort, drives will be added, and on unfreezing, array will resume (restart?) recovery with all drives. If array was resyncing, however, it will start recovering the newly added drives, because kernel prefers recovery over resync (as we discussed earlier). Any other caveats of this freezing/unfreezing you can think of? Thanks, Alex. On Sun, Mar 25, 2012 at 12:30 PM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Thanks, Neil! > I will merge this in and test. > > > On Thu, Mar 22, 2012 at 12:31 PM, NeilBrown <neilb@suse.de> wrote: >> On Thu, 22 Mar 2012 12:01:48 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> >> wrote: >> >>> Neil, >>> >>> > echo frozen > /sys/block/mdN/md/sync_action >>> > mdadm --add /dev/mdN /dev...... >>> > echo recover > /sys/block/mdN/md/sync_action >>> > >>> > it should do them all at once. >>> > >>> > I should teach mdadm about this.. >>> >>> What is required to do that from mdadm? I don't see any other place >>> where MD_RECOVERY_FROZEN is set, except via sysfs. So do you suggest >>> that mdadm use sysfs for that? >> >> Yes. >> >> http://neil.brown.name/git?p=mdadm;a=commitdiff;h=9f58469128c99c0d7f434d28657f86789334f253 >> >>> Also, what should be done if mdadm succeeds to "freeze" the array, but >>> fails to "unfreeze" it for some reason? >> >> What could possibly cause that? >> I guess if someone kills mdadm while it was running.. >> >> >> NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2012-04-04 16:56 ` Alexander Lyakas @ 2014-06-09 14:26 ` Alexander Lyakas 2014-06-10 0:11 ` NeilBrown 0 siblings, 1 reply; 25+ messages in thread From: Alexander Lyakas @ 2014-06-09 14:26 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid > Why do you refuse to freeze the array if it's not "idle"? What will > happen is that current recover/resync will abort, drives will be > added, and on unfreezing, array will resume (restart?) recovery with > all drives. If array was resyncing, however, it will start recovering > the newly added drives, because kernel prefers recovery over resync > (as we discussed earlier). Indeed, since dea3786ae2cf74ecb0087d1bea1aa04e9091ad5c, I see that you agree to freeze the array also in case it is recovering. Alex. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2014-06-09 14:26 ` Alexander Lyakas @ 2014-06-10 0:11 ` NeilBrown 2014-06-11 16:05 ` Alexander Lyakas 0 siblings, 1 reply; 25+ messages in thread From: NeilBrown @ 2014-06-10 0:11 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1026 bytes --] On Mon, 9 Jun 2014 17:26:38 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > > Why do you refuse to freeze the array if it's not "idle"? What will > > happen is that current recover/resync will abort, drives will be > > added, and on unfreezing, array will resume (restart?) recovery with > > all drives. If array was resyncing, however, it will start recovering > > the newly added drives, because kernel prefers recovery over resync > > (as we discussed earlier). > Indeed, since dea3786ae2cf74ecb0087d1bea1aa04e9091ad5c, I see that you > agree to freeze the array also in case it is recovering. I guess I did..... though I don't remember seeing the email that you have quoted. I can see it in my inbox, but it seems that I never replied. Maybe I was too busy that day :-( If there other outstanding issues, feel free to resend. (If I don't reply it is more likely to be careless than deliberate, so in general you should feel free to resend if I don't respond in a week or so). NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RAID-10 explicitly defined drive pairs? 2014-06-10 0:11 ` NeilBrown @ 2014-06-11 16:05 ` Alexander Lyakas 0 siblings, 0 replies; 25+ messages in thread From: Alexander Lyakas @ 2014-06-11 16:05 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Thanks, Neil. What we do now: - we refuse freezing/re-adding drives if array is not idle or recovering (e.g., resyncing) - otherwise, we freeze, add/re-add drives, unfreeze Thanks! Alex. On Tue, Jun 10, 2014 at 3:11 AM, NeilBrown <neilb@suse.de> wrote: > On Mon, 9 Jun 2014 17:26:38 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> > Why do you refuse to freeze the array if it's not "idle"? What will >> > happen is that current recover/resync will abort, drives will be >> > added, and on unfreezing, array will resume (restart?) recovery with >> > all drives. If array was resyncing, however, it will start recovering >> > the newly added drives, because kernel prefers recovery over resync >> > (as we discussed earlier). >> Indeed, since dea3786ae2cf74ecb0087d1bea1aa04e9091ad5c, I see that you >> agree to freeze the array also in case it is recovering. > > I guess I did..... though I don't remember seeing the email that you have > quoted. I can see it in my inbox, but it seems that I never replied. Maybe > I was too busy that day :-( > > If there other outstanding issues, feel free to resend. > (If I don't reply it is more likely to be careless than deliberate, so in > general you should feel free to resend if I don't respond in a week or so). > > NeilBrown ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2014-06-11 16:05 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-12-12 11:54 RAID-10 explicitly defined drive pairs? Jan Kasprzak 2011-12-12 15:33 ` John Robinson 2012-01-06 15:08 ` Jan Kasprzak 2012-01-06 16:39 ` Peter Grandi 2012-01-06 19:16 ` Stan Hoeppner 2012-01-06 20:11 ` Jan Kasprzak 2012-01-06 22:55 ` Stan Hoeppner 2012-01-07 14:25 ` Peter Grandi 2012-01-07 16:25 ` Stan Hoeppner 2012-01-09 13:46 ` Peter Grandi 2012-01-10 3:54 ` Stan Hoeppner 2012-01-10 4:13 ` NeilBrown 2012-01-10 16:25 ` Stan Hoeppner 2012-01-12 11:58 ` Peter Grandi 2012-01-12 12:47 ` Peter Grandi 2012-01-12 21:24 ` Stan Hoeppner 2012-01-06 20:55 ` NeilBrown 2012-01-06 21:02 ` Jan Kasprzak 2012-03-22 10:01 ` Alexander Lyakas 2012-03-22 10:31 ` NeilBrown 2012-03-25 9:30 ` Alexander Lyakas 2012-04-04 16:56 ` Alexander Lyakas 2014-06-09 14:26 ` Alexander Lyakas 2014-06-10 0:11 ` NeilBrown 2014-06-11 16:05 ` Alexander Lyakas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).