* mkfs.xfs states log stripe unit is too large
@ 2012-06-23 12:50 Ingo Jürgensmann
2012-06-23 23:44 ` Dave Chinner
0 siblings, 1 reply; 19+ messages in thread
From: Ingo Jürgensmann @ 2012-06-23 12:50 UTC (permalink / raw)
To: xfs
Hi!
I already brought this one up yesterday on #xfs@freenode where it was suggested to write this on this ML. Here I go...
I'm running Debian unstable on my desktop and lately added a new RAID set consisting of 3x 4 TB disks (namely Hitachi HDS724040ALE640). My partition layout is:
Model: ATA Hitachi HDS72404 (scsi)
Disk /dev/sdd: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Number Start End Size File system Name Flags
1 17.4kB 1018kB 1000kB bios_grub
2 2097kB 212MB 210MB ext3 raid
3 212MB 1286MB 1074MB xfs raid
4 1286MB 4001GB 4000GB raid
Partition #2 is intended as /boot disk (RAID1), partition #3 as small rescue disk or swap (RAID1), partition #4 will be used as physical device for LVM (RAID5).
muaddib:~# mdadm --detail /dev/md7
/dev/md7:
Version : 1.2
Creation Time : Fri Jun 22 22:47:15 2012
Raid Level : raid5
Array Size : 7811261440 (7449.40 GiB 7998.73 GB)
Used Dev Size : 3905630720 (3724.70 GiB 3999.37 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Sat Jun 23 13:47:19 2012
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : muaddib:7 (local to host muaddib)
UUID : 0be7f76d:90fe734e:ac190ee4:9b5f7f34
Events : 20
Number Major Minor RaidDevice State
0 8 68 0 active sync /dev/sde4
1 8 52 1 active sync /dev/sdd4
3 8 84 2 active sync /dev/sdf4
So, a cat /proc/mdstat shows all of my RAID devices:
muaddib:~# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid5 sdf4[3] sdd4[1] sde4[0]
7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
md6 : active raid1 sdd3[0] sdf3[2] sde3[1]
1048564 blocks super 1.2 [3/3] [UUU]
md5 : active (auto-read-only) raid1 sdd2[0] sdf2[2] sde2[1]
204788 blocks super 1.2 [3/3] [UUU]
md4 : active raid5 sdc6[0] sda6[2] sdb6[1]
1938322304 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
md3 : active (auto-read-only) raid1 sdc5[0] sda5[2] sdb5[1]
1052160 blocks [3/3] [UUU]
md2 : active raid1 sdc3[0] sda3[2] sdb3[1]
4192896 blocks [3/3] [UUU]
md1 : active (auto-read-only) raid1 sdc2[0] sda2[2] sdb2[1]
2096384 blocks [3/3] [UUU]
md0 : active raid1 sdc1[0] sda1[2] sdb1[1]
256896 blocks [3/3] [UUU]
unused devices: <none>
The RAID devices /dev/md0 to /dev/md4 are on my old 3x 1 TB Seagate disks. Anyway, to finally come to the problem, when I try to create a filesystem on the new RAID5 I get the following:
muaddib:~# mkfs.xfs /dev/lv/usr
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/lv/usr isize=256 agcount=16, agsize=327552 blks
= sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=5240832, imaxpct=25
= sunit=128 swidth=256 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
As you can see I follow the "mkfs.xfs knows best, don't fiddle around with options unless you know what you're doing!"-advice. But apparently mkfs.xfs wanted to create a log stripe unit of 512 kiB, most likely because it's the same chunk size as the underlying RAID device.
The problem seems to be related to RAID5, because when I try to make a filesystem on /dev/md6 (RAID1), there's no error message:
muaddib:~# mkfs.xfs /dev/md6
meta-data=/dev/md6 isize=256 agcount=8, agsize=32768 blks
= sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=262141, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=1200, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Additional info:
I first bought two 4 TB disks and ran them for about 6 weeks as a RAID1 and already did some tests (because the 4 TB Hitachis were sold out in the meantime). I can't remember seeing the log stripe error message during those tests while working with a RAID1.
So, the question is:
- is this a bug somewhere in XFS, LVM or Linux's software RAID implementation?
- will performance suffer from log stripe size adjusted to just 32 kiB? Some of my logical volumes will just store data, but one or the other will have some workload acting as storage for BackupPC.
- would it be worth the effort to raise log stripe to at least 256 kiB?
- or would it be better to run with external log on the old 1 TB RAID?
End note: the 4 TB disks are not yet "in production", so I can run tests with both RAID setup as well as mkfs.xfs. Reshaping the RAID will take up to 10 hours, though...
--
Ciao... // Fon: 0381-2744150
Ingo \X/ http://blog.windfluechter.net
gpg pubkey: http://www.juergensmann.de/ij_public_key.asc
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: mkfs.xfs states log stripe unit is too large 2012-06-23 12:50 mkfs.xfs states log stripe unit is too large Ingo Jürgensmann @ 2012-06-23 23:44 ` Dave Chinner 2012-06-24 2:20 ` Eric Sandeen 2012-06-25 10:33 ` Ingo Jürgensmann 0 siblings, 2 replies; 19+ messages in thread From: Dave Chinner @ 2012-06-23 23:44 UTC (permalink / raw) To: Ingo Jürgensmann; +Cc: xfs On Sat, Jun 23, 2012 at 02:50:49PM +0200, Ingo Jürgensmann wrote: > muaddib:~# cat /proc/mdstat > Personalities : [raid1] [raid6] [raid5] [raid4] > md7 : active raid5 sdf4[3] sdd4[1] sde4[0] > 7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] ..... > The RAID devices /dev/md0 to /dev/md4 are on my old 3x 1 TB > Seagate disks. Anyway, to finally come to the problem, when I try > to create a filesystem on the new RAID5 I get the following: > > muaddib:~# mkfs.xfs /dev/lv/usr > log stripe unit (524288 bytes) is too large (maximum is 256KiB) > log stripe unit adjusted to 32KiB > meta-data=/dev/lv/usr isize=256 agcount=16, agsize=327552 blks > = sectsz=512 attr=2, projid32bit=0 > data = bsize=4096 blocks=5240832, imaxpct=25 > = sunit=128 swidth=256 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal log bsize=4096 blocks=2560, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > > As you can see I follow the "mkfs.xfs knows best, don't fiddle > around with options unless you know what you're doing!"-advice. > But apparently mkfs.xfs wanted to create a log stripe unit of 512 > kiB, most likely because it's the same chunk size as the > underlying RAID device. Exactly. Best thing in general is to align all log writes to the underlying stripe unit of the array. That way as multiple frequent log writes occur, it is guaranteed to form full stripe writes and basically have no RMW overhead. 32k is chosen by default because that's the default log buffer size and hence the typical size of log writes. If you increase the log stripe unit, you also increase the minimum log buffer size that the filesystem supports. The filesystem can support up to 256k log buffers, and hence the limit on maximum log stripe alignment. > The problem seems to be related to RAID5, because when I try to > make a filesystem on /dev/md6 (RAID1), there's no error message: They don't have a stripe unit/stripe width, so no alignment is needed or configured. > So, the question is: > - is this a bug somewhere in XFS, LVM or Linux's software RAID > implementation? Not a bug at all. > - will performance suffer from log stripe size adjusted to just 32 > kiB? Some of my logical volumes will just store data, but one or > the other will have some workload acting as storage for BackupPC. For data volumes, no. For backupPC, it depends on whether the MD RAID stripe cache can turn all the sequential log writes into a full stripe write. In general, this is not a problem, and is almost never a problem for HW RAID with BBWC.... > - would it be worth the effort to raise log stripe to at least 256 > kiB? Depends on your workload. If it is fsync heavy, I'd advise against it, as every log write will be padded out to 256k, even if you only write 500 bytes worth of transaction data.... > - or would it be better to run with external log on the old 1 TB > RAID? External logs provide muchless benefit with delayed logging than hey use to. As it is, your external log needs to have the same reliability characteristics as the main volume - lose the log, corrupt the filesystem. Hence for RAID5 volumes, you need a RAID1 log, and for RAID6 you either need RAID6 or a 3-way mirror to provide the same reliability.... > End note: the 4 TB disks are not yet "in production", so I can run > tests with both RAID setup as well as mkfs.xfs. Reshaping the RAID > will take up to 10 hours, though... IMO, RAID reshaping is just a bad idea - it changes the alignment characteristic of the volume, hence everything that the filesystemlaid down in an aligned fashion is now unaligned, and you have to tell the filesytemteh new alignment before new files will be correctly aligned. Also, it's usually faster to back up, recreate and restore than reshape and that puts a lot less load on your disks, too... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-23 23:44 ` Dave Chinner @ 2012-06-24 2:20 ` Eric Sandeen 2012-06-24 13:05 ` Stan Hoeppner 2012-06-25 10:33 ` Ingo Jürgensmann 1 sibling, 1 reply; 19+ messages in thread From: Eric Sandeen @ 2012-06-24 2:20 UTC (permalink / raw) To: Dave Chinner; +Cc: Ingo Jürgensmann, xfs On 6/23/12 6:44 PM, Dave Chinner wrote: > On Sat, Jun 23, 2012 at 02:50:49PM +0200, Ingo Jürgensmann wrote: >> muaddib:~# cat /proc/mdstat >> Personalities : [raid1] [raid6] [raid5] [raid4] >> md7 : active raid5 sdf4[3] sdd4[1] sde4[0] >> 7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] > ..... > >> The RAID devices /dev/md0 to /dev/md4 are on my old 3x 1 TB >> Seagate disks. Anyway, to finally come to the problem, when I try >> to create a filesystem on the new RAID5 I get the following: >> >> muaddib:~# mkfs.xfs /dev/lv/usr >> log stripe unit (524288 bytes) is too large (maximum is 256KiB) >> log stripe unit adjusted to 32KiB ... > >> So, the question is: >> - is this a bug somewhere in XFS, LVM or Linux's software RAID >> implementation? > > Not a bug at all. Dave, I'd suggest that we should remove the warning though, if XFS picks the wrong defaults and then overrides itself. Rule of Silence: When a program has nothing surprising to say, it should say nothing. ;) -Eric _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-24 2:20 ` Eric Sandeen @ 2012-06-24 13:05 ` Stan Hoeppner 2012-06-24 13:17 ` Ingo Jürgensmann 2012-06-24 15:03 ` Ingo Jürgensmann 0 siblings, 2 replies; 19+ messages in thread From: Stan Hoeppner @ 2012-06-24 13:05 UTC (permalink / raw) To: Eric Sandeen; +Cc: =?ISO-8859-1?Q?Ingo_J=FCrgensma?=, nn, xfs On 6/23/2012 9:20 PM, Eric Sandeen wrote: > On 6/23/12 6:44 PM, Dave Chinner wrote: >> On Sat, Jun 23, 2012 at 02:50:49PM +0200, Ingo Jürgensmann wrote: >>> muaddib:~# cat /proc/mdstat >>> Personalities : [raid1] [raid6] [raid5] [raid4] >>> md7 : active raid5 sdf4[3] sdd4[1] sde4[0] >>> 7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] ^^^^^^^^^^ The the log stripe unit mismatch error is a direct result of Ingo manually choosing a rather large chunk size for his two stripe spindle md array, yielding a 1MB stripe, and using an internal log with it. Maybe there is a good reason for this, but I'm going to challenge it. The default md chunk size IIRC is 64KB, 8x smaller than Ingo's chunk. With the default it would require 16 stripe spindles to reach a 1MB stripe. Ingo has TWO stripe spindles. In the default case with a 1MB stripe and 16 spindles, each aligned aggregated stripe write out will be 256 XFS blocks, or 16 blocks to each spindle, 128 sectors (512 byte). In Ingo's case, it will be 128 XFS blocks, 1024 sectors. Does backup PC perform better writing 2048 sectors per stripe write, 1024 per spindle, with two spindles, than 256 sectors per stripe write, 128 per spindle, using two spindles? >> ..... >> >>> The RAID devices /dev/md0 to /dev/md4 are on my old 3x 1 TB >>> Seagate disks. Anyway, to finally come to the problem, when I try >>> to create a filesystem on the new RAID5 I get the following: >>> >>> muaddib:~# mkfs.xfs /dev/lv/usr >>> log stripe unit (524288 bytes) is too large (maximum is 256KiB) >>> log stripe unit adjusted to 32KiB > > ... > >> >>> So, the question is: >>> - is this a bug somewhere in XFS, LVM or Linux's software RAID >>> implementation? >> >> Not a bug at all. > > Dave, I'd suggest that we should remove the warning though, if XFS picks > the wrong defaults and then overrides itself. > > Rule of Silence: When a program has nothing surprising to say, it should say nothing. I think this goes to the heart of the matter. Ingo chose an arbitrarily large chunk size apparently without understanding the ramifications. mkfs.xfs was written to read md parameters I believe with the assumption the parameters were md defaults. It obviously wasn't written to gracefully deal with a manually configured arbitrarily large md chunks size. Maybe a better solution than silence here would be education. Flag the mismatch as we do now, and provide a URL to a new FAQ entry that explains why this occurs, and possible solutions to the problem, the first recommendation being to choose a sane chunk size. Question: does this occur with hardware RAID when entering all the same parameters manually on the command line? Or is this error limited to the md interrogation path? -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-24 13:05 ` Stan Hoeppner @ 2012-06-24 13:17 ` Ingo Jürgensmann 2012-06-24 19:28 ` Stan Hoeppner 2012-06-24 15:03 ` Ingo Jürgensmann 1 sibling, 1 reply; 19+ messages in thread From: Ingo Jürgensmann @ 2012-06-24 13:17 UTC (permalink / raw) To: xfs; +Cc: stan Am 24.06.2012 um 15:05 schrieb Stan Hoeppner: > On 6/23/2012 9:20 PM, Eric Sandeen wrote: >> On 6/23/12 6:44 PM, Dave Chinner wrote: >>> On Sat, Jun 23, 2012 at 02:50:49PM +0200, Ingo Jürgensmann wrote: >>>> muaddib:~# cat /proc/mdstat >>>> Personalities : [raid1] [raid6] [raid5] [raid4] >>>> md7 : active raid5 sdf4[3] sdd4[1] sde4[0] >>>> 7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] > ^^^^^^^^^^ > > The the log stripe unit mismatch error is a direct result of Ingo > manually choosing a rather large chunk size for his two stripe spindle > md array, yielding a 1MB stripe, and using an internal log with it. > Maybe there is a good reason for this, but I'm going to challenge it. Correction: I did not manually choose that chunk size, but it was automatically chosen by mdadm when creating the RAID5. > The default md chunk size IIRC is 64KB, 8x smaller than Ingo's chunk. 64k is the default for creating RAIDs with 0.90 format superblock. My RAID5 has a 1.2 format superblock. > Does backup PC perform better writing 2048 sectors per stripe write, > 1024 per spindle, with two spindles, than 256 sectors per stripe write, > 128 per spindle, using two spindles? Don't know how BackupPC actually writes the data, but it does make extensive use of hardlinks to save some diskspace. Some sort of deduplicating, if you like to say it that way. >> Rule of Silence: When a program has nothing surprising to say, it should say nothing. > I think this goes to the heart of the matter. Ingo chose an arbitrarily > large chunk size apparently without understanding the ramifications. That's wrong! I've just worked with the defaults. -- Ciao... // Fon: 0381-2744150 Ingo \X/ http://blog.windfluechter.net gpg pubkey: http://www.juergensmann.de/ij_public_key.asc _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-24 13:17 ` Ingo Jürgensmann @ 2012-06-24 19:28 ` Stan Hoeppner 2012-06-24 19:51 ` Ingo Jürgensmann 0 siblings, 1 reply; 19+ messages in thread From: Stan Hoeppner @ 2012-06-24 19:28 UTC (permalink / raw) To: Ingo Jürgensmann; +Cc: xfs On 6/24/2012 8:17 AM, Ingo Jürgensmann wrote: > Am 24.06.2012 um 15:05 schrieb Stan Hoeppner: > >> On 6/23/2012 9:20 PM, Eric Sandeen wrote: >>> On 6/23/12 6:44 PM, Dave Chinner wrote: >>>> On Sat, Jun 23, 2012 at 02:50:49PM +0200, Ingo Jürgensmann wrote: >>>>> muaddib:~# cat /proc/mdstat >>>>> Personalities : [raid1] [raid6] [raid5] [raid4] >>>>> md7 : active raid5 sdf4[3] sdd4[1] sde4[0] >>>>> 7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] >> ^^^^^^^^^^ >> >> The the log stripe unit mismatch error is a direct result of Ingo >> manually choosing a rather large chunk size for his two stripe spindle >> md array, yielding a 1MB stripe, and using an internal log with it. >> Maybe there is a good reason for this, but I'm going to challenge it. > > Correction: I did not manually choose that chunk size, but it was automatically chosen by mdadm when creating the RAID5. > >> The default md chunk size IIRC is 64KB, 8x smaller than Ingo's chunk. > > 64k is the default for creating RAIDs with 0.90 format superblock. My RAID5 has a 1.2 format superblock. > >> Does backup PC perform better writing 2048 sectors per stripe write, >> 1024 per spindle, with two spindles, than 256 sectors per stripe write, >> 128 per spindle, using two spindles? > > Don't know how BackupPC actually writes the data, but it does make extensive use of hardlinks to save some diskspace. Some sort of deduplicating, if you like to say it that way. > >>> Rule of Silence: When a program has nothing surprising to say, it should say nothing. >> I think this goes to the heart of the matter. Ingo chose an arbitrarily >> large chunk size apparently without understanding the ramifications. > > That's wrong! I've just worked with the defaults. At this point I get the feeling you're sandbagging us Ingo. AFAIK you have the distinction of being the very first person on earth to report this problem. This would suggest you're the first XFS user with an internal log to use the mdadm defaults. Do you think that's likely? Thus, I'd guess that the metadata format changed from 0.90 to 1.2 with a very recent release of mdadm. Are you using distro supplied mdadm, a backported more recent mdadm, or did you build mdadm from the most recent source? If either of the latter two, don't you think it would have been wise to inform us that "hay, I'm using the bleeding edge mdadm just released"? Or if you're using a brand new distro release? -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-24 19:28 ` Stan Hoeppner @ 2012-06-24 19:51 ` Ingo Jürgensmann 2012-06-24 22:15 ` Stan Hoeppner 0 siblings, 1 reply; 19+ messages in thread From: Ingo Jürgensmann @ 2012-06-24 19:51 UTC (permalink / raw) To: stan; +Cc: xfs Am 24.06.2012 um 21:28 schrieb Stan Hoeppner: > Thus, I'd guess that the metadata format changed from 0.90 to 1.2 with a > very recent release of mdadm. Are you using distro supplied mdadm, a > backported more recent mdadm, or did you build mdadm from the most > recent source? As I already wrote, I'm using Debian unstable, therefore distro supplied mdadm. Otherwise I'd have said this. > If either of the latter two, don't you think it would have been wise to > inform us that "hay, I'm using the bleeding edge mdadm just released"? > Or if you're using a brand new distro release? I don't think that Debian unstable is bleeding edge. I find it strange that you've misinterpreted citing the mdadm man page as "sandbagging us". =:-O -- Ciao... // Fon: 0381-2744150 Ingo \X/ http://blog.windfluechter.net gpg pubkey: http://www.juergensmann.de/ij_public_key.asc _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-24 19:51 ` Ingo Jürgensmann @ 2012-06-24 22:15 ` Stan Hoeppner 2012-06-25 5:25 ` Ingo Jürgensmann 0 siblings, 1 reply; 19+ messages in thread From: Stan Hoeppner @ 2012-06-24 22:15 UTC (permalink / raw) To: Ingo Jürgensmann; +Cc: xfs On 6/24/2012 2:51 PM, Ingo Jürgensmann wrote: > Am 24.06.2012 um 21:28 schrieb Stan Hoeppner: > >> Thus, I'd guess that the metadata format changed from 0.90 to 1.2 with a >> very recent release of mdadm. Are you using distro supplied mdadm, a >> backported more recent mdadm, or did you build mdadm from the most >> recent source? > > As I already wrote, I'm using Debian unstable, therefore distro supplied mdadm. Otherwise I'd have said this. Yes, you did mention SID, and I missed it. SID is the problem here, or I should say, the cause of the error message. SID is leading (better?) edge, and is obviously using a recent mdadm release, which defaults to metadata 1.2, and chunk of 512KB. As more distros adopt newer mdadm, reports of this will be more prevalent. So Eric's idea is likely preferable than mine. XFS making a recommendation against an md default would fly like a lead balloon... > I don't think that Debian unstable is bleeding edge. It's apparently close enough in the case of mdadm, given you're the first to report this, AFAIK. > I find it strange that you've misinterpreted citing the mdadm man page as "sandbagging us". =:-O Sandbagging simply means holding something back, withholding information. Had you actually not mentioned your OS/version, this would have been an accurate take on the situation. But again, youd did, and I simply missed it. So again, my apologies for missing your mention of SID in your opening email. That would have prevented my skeptical demeanor. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-24 22:15 ` Stan Hoeppner @ 2012-06-25 5:25 ` Ingo Jürgensmann [not found] ` <4FE8CEED.7070505@hardwarefreak.com> 0 siblings, 1 reply; 19+ messages in thread From: Ingo Jürgensmann @ 2012-06-25 5:25 UTC (permalink / raw) To: stan; +Cc: xfs On 2012-06-25 00:15, Stan Hoeppner wrote: >> As I already wrote, I'm using Debian unstable, therefore distro >> supplied mdadm. Otherwise I'd have said this. > SID is the problem here, or I should say, the cause of the error > message. SID is leading (better?) edge, and is obviously using a > recent > mdadm release, which defaults to metadata 1.2, and chunk of 512KB. > As more distros adopt newer mdadm, reports of this will be more > prevalent. So Eric's idea is likely preferable than mine. XFS > making a > recommendation against an md default would fly like a lead balloon... Actually, even man page of Debian stable (Squeeze) mentions: -c, --chunk= Specify chunk size of kibibytes. The default when creating an array is 512KB. To ensure compatibility with earlier versions, the default when Building and array with no persis‐ tent metadata is 64KB. This is only meaningful for RAID0, RAID4, RAID5, RAID6, and RAID10. So, the question is: why did mdadm choose 1.2 format superblock this time? My guess is, that's because of GPT labelled disks instead of MBR, but it's only a guess. Maybe it's because the new md device is bigger in size. All of my other md devices on MBR labelled disks do have 0.90 format superblock, all md devices on the GPT disks are of 1.2 format. Although it doesn't seem a new default in mdadm for me, your assumption would still stand if the cause would turn out to be the GPT label. More and more people will start using GPT labelled disks. >> I find it strange that you've misinterpreted citing the mdadm man >> page as "sandbagging us". =:-O > Sandbagging simply means holding something back, withholding > information. Are ok, I misread "sandboxing us" as "boxing onto us like at a sandbox". So, my apologies here. :-) -- Ciao... // Fon: 0381-2744150 . Ingo \X/ http://blog.windfluechter.net gpg pubkey: http://www.juergensmann.de/ij_public_key. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <4FE8CEED.7070505@hardwarefreak.com>]
* Re: mkfs.xfs states log stripe unit is too large [not found] ` <4FE8CEED.7070505@hardwarefreak.com> @ 2012-06-25 21:18 ` Ingo Jürgensmann 0 siblings, 0 replies; 19+ messages in thread From: Ingo Jürgensmann @ 2012-06-25 21:18 UTC (permalink / raw) To: stan; +Cc: xfs On 2012-06-25 22:49, Stan Hoeppner wrote: > I've never understand exactly what this means, but it's apparently > involved with some of the arrays you've built with Stable and SID: > > "To ensure compatibility with earlier versions, the default when > Building and array with no persistent metadata is 64KB." > > How does one "build an array with no persistent metadata"? Does this > simply mean forcing metadata .90 on the command line? IIRC, the metadata in 1.2 is populated over the RAID whereas in 0.90 it was only at the beginning. But take that with care. I've no source for that assumption. It's only somewhere in my mind that I think I might have read about this somewhere, somewhen... Someone else will know better and correct me for sure. :-) >> So, the question is: why did mdadm choose 1.2 format superblock this >> time? My guess is, that's because of GPT labelled disks instead of >> MBR, >> but it's only a guess. Maybe it's because the new md device is >> bigger in >> size. All of my other md devices on MBR labelled disks do have 0.90 >> format superblock, all md devices on the GPT disks are of 1.2 >> format. >> Although it doesn't seem a new default in mdadm for me, your >> assumption >> would still stand if the cause would turn out to be the GPT label. >> More >> and more people will start using GPT labelled disks. > Ok this is really interesting as this is undocumented behavior, if > indeed this is occurring. Would you mind firing up a thread about > this > on the linux-raid list? I've talked to some guys on #debian.de in the meantime. I don't think now that this has anything to do with GPT labels. According to #debian.de the default behaviour in mdadm was changed after release of Squeeze. Already before Squeeze, metadata format 0.90 was obsolete and was only kept for Squeeze for backward compatibility reasons. So, it's indeed a changed default within Debian, but nothing new for upstream mdadm and it's likely that other distros have adopted the upstream default way before Debian did. -- Ciao... // Fon: 0381-2744150 . Ingo \X/ http://blog.windfluechter.net gpg pubkey: http://www.juergensmann.de/ij_public_key. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-24 13:05 ` Stan Hoeppner 2012-06-24 13:17 ` Ingo Jürgensmann @ 2012-06-24 15:03 ` Ingo Jürgensmann 2012-06-26 2:30 ` Dave Chinner 1 sibling, 1 reply; 19+ messages in thread From: Ingo Jürgensmann @ 2012-06-24 15:03 UTC (permalink / raw) To: xfs On 2012-06-24 15:05, Stan Hoeppner wrote: > The the log stripe unit mismatch error is a direct result of Ingo > manually choosing a rather large chunk size for his two stripe > spindle > md array, yielding a 1MB stripe, and using an internal log with it. > Maybe there is a good reason for this, but I'm going to challenge it. To cite man mdadm: -c, --chunk= Specify chunk size of kibibytes. The default when creating an array is 512KB. To ensure compatibility with earlier versions, the default when Building and array with no persistent metadata is 64KB. This is only meaningful for RAID0, RAID4, RAID5, RAID6, and RAID10. So, actually there's a mismatch with the default of mdadm an mkfs.xfs. Maybe it's worthwhile to think of raising the log stripe maximum size to at least 512 kiB? I don't know what implications this could have, though... -- Ciao... // Fon: 0381-2744150 . Ingo \X/ http://blog.windfluechter.net gpg pubkey: http://www.juergensmann.de/ij_public_key. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-24 15:03 ` Ingo Jürgensmann @ 2012-06-26 2:30 ` Dave Chinner 2012-06-26 8:02 ` Christoph Hellwig ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Dave Chinner @ 2012-06-26 2:30 UTC (permalink / raw) To: Ingo Jürgensmann; +Cc: xfs On Sun, Jun 24, 2012 at 05:03:47PM +0200, Ingo Jürgensmann wrote: > On 2012-06-24 15:05, Stan Hoeppner wrote: > > >The the log stripe unit mismatch error is a direct result of Ingo > >manually choosing a rather large chunk size for his two stripe > >spindle > >md array, yielding a 1MB stripe, and using an internal log with it. > >Maybe there is a good reason for this, but I'm going to challenge it. > > To cite man mdadm: > > -c, --chunk= > Specify chunk size of kibibytes. The default when > creating an array is 512KB. To ensure compatibility > with earlier versions, the default when Building and > array with no persistent metadata is 64KB. This is > only meaningful for RAID0, RAID4, RAID5, RAID6, and > RAID10. > > So, actually there's a mismatch with the default of mdadm an > mkfs.xfs. Maybe it's worthwhile to think of raising the log stripe > maximum size to at least 512 kiB? I don't know what implications > this could have, though... You can't, simple as that. The maximum supported is 256k. As it is, a default chunk size of 512k is probably harmful to most workloads - large chunk sizes mean that just about every write will trigger a RMW cycle in the RAID because it is pretty much impossible to issue full stripe writes. Writeback doesn't do any alignment of IO (the generic page cache writeback path is the problem here), so we will lamost always be doing unaligned IO to the RAID, and there will be little opportunity for sequential IOs to merge and form full stripe writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write). IOWs, every time you do a small isolated write, the MD RAID volume will do a RMW cycle, reading 11MB and writing 12MB of data to disk. Given that most workloads are not doing lots and lots of large sequential writes this is, IMO, a pretty bad default given typical RAID5/6 volume configurations we see.... Without the warning, nobody would have noticed this. I think the warning has value - even if it is just to indicate MD now uses a bad default value for common workloads.. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-26 2:30 ` Dave Chinner @ 2012-06-26 8:02 ` Christoph Hellwig [not found] ` <20120702061827.GB16671@infradead.org> 2012-06-26 19:34 ` Ingo Jürgensmann 2012-06-27 2:06 ` Eric Sandeen 2 siblings, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2012-06-26 8:02 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-raid, Ingo J?rgensmann, xfs On Tue, Jun 26, 2012 at 12:30:59PM +1000, Dave Chinner wrote: > You can't, simple as that. The maximum supported is 256k. As it is, > a default chunk size of 512k is probably harmful to most workloads - > large chunk sizes mean that just about every write will trigger a > RMW cycle in the RAID because it is pretty much impossible to issue > full stripe writes. Writeback doesn't do any alignment of IO (the > generic page cache writeback path is the problem here), so we will > lamost always be doing unaligned IO to the RAID, and there will be > little opportunity for sequential IOs to merge and form full stripe > writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write). > > IOWs, every time you do a small isolated write, the MD RAID volume > will do a RMW cycle, reading 11MB and writing 12MB of data to disk. > Given that most workloads are not doing lots and lots of large > sequential writes this is, IMO, a pretty bad default given typical > RAID5/6 volume configurations we see.... Not too long ago I benchmarked out mdraid stripe sizes, and at least for XFS 32kb was a clear winner, anything larger decreased performance. ext4 didn't get hit that badly with larger stripe sizes, probably because they still internally bump the writeback size like crazy, but they did not actually get faster with larger stripes either. This was streaming data heavy workloads, anything more metadata heavy probably will suffer from larger stripes even more. Ccing the linux-raid list if there actually is any reason for these defaults, something I wanted to ask for a long time but never really got back to. Also I'm pretty sure back then the md default was 256kb writes, not 512 so it seems the defaults further increased. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <20120702061827.GB16671@infradead.org>]
* Re: mkfs.xfs states log stripe unit is too large [not found] ` <20120702061827.GB16671@infradead.org> @ 2012-07-02 6:41 ` NeilBrown 2012-07-02 8:08 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2012-07-02 6:41 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-raid, Ingo J?rgensmann, xfs [-- Attachment #1.1: Type: text/plain, Size: 3438 bytes --] On Mon, 2 Jul 2012 02:18:27 -0400 Christoph Hellwig <hch@infradead.org> wrote: > Ping to Neil / the raid list. Thanks for the reminder. > > On Tue, Jun 26, 2012 at 04:02:17AM -0400, Christoph Hellwig wrote: > > On Tue, Jun 26, 2012 at 12:30:59PM +1000, Dave Chinner wrote: > > > You can't, simple as that. The maximum supported is 256k. As it is, > > > a default chunk size of 512k is probably harmful to most workloads - > > > large chunk sizes mean that just about every write will trigger a > > > RMW cycle in the RAID because it is pretty much impossible to issue > > > full stripe writes. Writeback doesn't do any alignment of IO (the > > > generic page cache writeback path is the problem here), so we will > > > lamost always be doing unaligned IO to the RAID, and there will be > > > little opportunity for sequential IOs to merge and form full stripe > > > writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write). > > > > > > IOWs, every time you do a small isolated write, the MD RAID volume > > > will do a RMW cycle, reading 11MB and writing 12MB of data to disk. > > > Given that most workloads are not doing lots and lots of large > > > sequential writes this is, IMO, a pretty bad default given typical > > > RAID5/6 volume configurations we see.... > > > > Not too long ago I benchmarked out mdraid stripe sizes, and at least > > for XFS 32kb was a clear winner, anything larger decreased performance. > > > > ext4 didn't get hit that badly with larger stripe sizes, probably > > because they still internally bump the writeback size like crazy, but > > they did not actually get faster with larger stripes either. > > > > This was streaming data heavy workloads, anything more metadata heavy > > probably will suffer from larger stripes even more. > > > > Ccing the linux-raid list if there actually is any reason for these > > defaults, something I wanted to ask for a long time but never really got > > back to. > > > > Also I'm pretty sure back then the md default was 256kb writes, not 512 > > so it seems the defaults further increased. "originally" the default chunksize was 64K. It was changed in late 2009 to 512K - this first appeared in mdadm 3.1.1 I don't recall the details of why it was changed but I'm fairly sure that it was based on measurements that I had made and measurements that others had made. I suspect the tests were largely run on ext3. I don't think there is anything close to a truly optimal chunk size. What works best really depends on your hardware, your filesystem, and your work load. If 512K is always suboptimal for XFS then that is unfortunate but I don't think it is really possible to choose a default that everyone will be happy with. Maybe we just need more documentation and warning emitted by various tools. Maybe mkfs.xfs could augment the "stripe unit too large" message with some text about choosing a smaller chunk size? NeilBrown > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > ---end quoted text--- > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #1.2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-07-02 6:41 ` NeilBrown @ 2012-07-02 8:08 ` Dave Chinner 2012-07-09 12:02 ` kedacomkernel 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2012-07-02 8:08 UTC (permalink / raw) To: NeilBrown; +Cc: Christoph Hellwig, linux-raid, Ingo J?rgensmann, xfs On Mon, Jul 02, 2012 at 04:41:13PM +1000, NeilBrown wrote: > On Mon, 2 Jul 2012 02:18:27 -0400 Christoph Hellwig <hch@infradead.org> wrote: > > > Ping to Neil / the raid list. > > Thanks for the reminder. > > > > > On Tue, Jun 26, 2012 at 04:02:17AM -0400, Christoph Hellwig wrote: > > > On Tue, Jun 26, 2012 at 12:30:59PM +1000, Dave Chinner wrote: > > > > You can't, simple as that. The maximum supported is 256k. As it is, > > > > a default chunk size of 512k is probably harmful to most workloads - > > > > large chunk sizes mean that just about every write will trigger a > > > > RMW cycle in the RAID because it is pretty much impossible to issue > > > > full stripe writes. Writeback doesn't do any alignment of IO (the > > > > generic page cache writeback path is the problem here), so we will > > > > lamost always be doing unaligned IO to the RAID, and there will be > > > > little opportunity for sequential IOs to merge and form full stripe > > > > writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write). > > > > > > > > IOWs, every time you do a small isolated write, the MD RAID volume > > > > will do a RMW cycle, reading 11MB and writing 12MB of data to disk. > > > > Given that most workloads are not doing lots and lots of large > > > > sequential writes this is, IMO, a pretty bad default given typical > > > > RAID5/6 volume configurations we see.... > > > > > > Not too long ago I benchmarked out mdraid stripe sizes, and at least > > > for XFS 32kb was a clear winner, anything larger decreased performance. > > > > > > ext4 didn't get hit that badly with larger stripe sizes, probably > > > because they still internally bump the writeback size like crazy, but > > > they did not actually get faster with larger stripes either. > > > > > > This was streaming data heavy workloads, anything more metadata heavy > > > probably will suffer from larger stripes even more. > > > > > > Ccing the linux-raid list if there actually is any reason for these > > > defaults, something I wanted to ask for a long time but never really got > > > back to. > > > > > > Also I'm pretty sure back then the md default was 256kb writes, not 512 > > > so it seems the defaults further increased. > > "originally" the default chunksize was 64K. > It was changed in late 2009 to 512K - this first appeared in mdadm 3.1.1 > > I don't recall the details of why it was changed but I'm fairly sure that > it was based on measurements that I had made and measurements that others had > made. I suspect the tests were largely run on ext3. > > I don't think there is anything close to a truly optimal chunk size. What > works best really depends on your hardware, your filesystem, and your work > load. That's true, but the characterisitics of spinning disks have not changed in the past 20 years, nor has the typical file size distributions in filesystems, nor have the RAID5/6 algorithms. So it's not really clear to me why you;d woul deven consider changing the default the downsides of large chunk sizes on RAID5/6 volumes is well known. This may well explain the apparent increase in "XFS has hung but it's really just waiting for lots of really slow IO on MD" cases I've seen over the past couple of years. The only time I'd ever consider stripe -widths- of more than 512k or 1MB with RAID5/6 is if I knew my workload is almost exclusively using large files and sequential access with little metadata load, and there's relatively few workloads where that is the case. Typically those workloads measure throughput in GB/s and everyone uses hardware RAID for them because MD simply doesn't scale to this sort of usage. > If 512K is always suboptimal for XFS then that is unfortunate but I don't I think 512k chunk sizes are suboptimal for most users, regardless of the filesystem or workload.... > think it is really possible to choose a default that everyone will be happy > with. Maybe we just need more documentation and warning emitted by various > tools. Maybe mkfs.xfs could augment the "stripe unit too large" message with > some text about choosing a smaller chunk size? We work to the mantra that XFS should always choose the defaults that give the best overall performance and aging characteristics so users don't need to be a storage expert to get the best the filesystem can offer. The XFS warning is there to indicate that the user might be doing something wrong. If that's being emitted with a default MD configuration, then that indicates that the MD defaults need to be revised.... If you know what a stripe unit or chunk size is, then you know how to deal with the problem. But for the majority of people, that's way more knowledge than they are prepared to learn about or should be forced to learn about. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Re: mkfs.xfs states log stripe unit is too large 2012-07-02 8:08 ` Dave Chinner @ 2012-07-09 12:02 ` kedacomkernel 0 siblings, 0 replies; 19+ messages in thread From: kedacomkernel @ 2012-07-09 12:02 UTC (permalink / raw) To: Dave Chinner, Neil Brown Cc: Christoph Hellwig, linux-raid, Ingo J?rgensmann, xfs On 2012-07-02 16:08 Dave Chinner <david@fromorbit.com> Wrote: >On Mon, Jul 02, 2012 at 04:41:13PM +1000, NeilBrown wrote: >> On Mon, 2 Jul 2012 02:18:27 -0400 Christoph Hellwig <hch@infradead.org> wrote: >> >> > Ping to Neil / the raid list. >> >> Thanks for the reminder. >> >> > [snip] > >That's true, but the characterisitics of spinning disks have not >changed in the past 20 years, nor has the typical file size >distributions in filesystems, nor have the RAID5/6 algorithms. So >it's not really clear to me why you;d woul deven consider changing >the default the downsides of large chunk sizes on RAID5/6 volumes is >well known. This may well explain the apparent increase in "XFS has >hung but it's really just waiting for lots of really slow IO on MD" >cases I've seen over the past couple of years. > At present, cat /sys/block/sdb/queue/max_sectors_kb: is 512k. Maybe because this. >The only time I'd ever consider stripe -widths- of more than 512k or >1MB with RAID5/6 is if I knew my workload is almost exclusively >using large files and sequential access with little metadata load, >and there's relatively few workloads where that is the case. >Typically those workloads measure throughput in GB/s and everyone >uses hardware RAID for them because MD simply doesn't scale to this >sort of usage. > >> If 512K is always suboptimal for XFS then that is unfortunate but I don't > >I think 512k chunk sizes are suboptimal for most users, regardless >of the filesystem or workload.... > >> think it is really possible to choose a default that everyone will be happy >> with. Maybe we just need more documentation and warning emitted by various >> tools. Maybe mkfs.xfs could augment the "stripe unit too large" message with >> some text about choosing a smaller chunk size? > >We work to the mantra that XFS should always choose the defaults >that give the best overall performance and aging characteristics so >users don't need to be a storage expert to get the best the >filesystem can offer. The XFS warning is there to indicate that the >user might be doing something wrong. If that's being emitted with a >default MD configuration, then that indicates that the MD defaults >need to be revised.... > >If you know what a stripe unit or chunk size is, then you know how >to deal with the problem. But for the majority of people, that's way >more knowledge than they are prepared to learn about or should be >forced to learn about. > >Cheers, > >Dave. >-- >Dave Chinner >david@fromorbit.com >-- >To unsubscribe from this list: send the line "unsubscribe linux-raid" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-26 2:30 ` Dave Chinner 2012-06-26 8:02 ` Christoph Hellwig @ 2012-06-26 19:34 ` Ingo Jürgensmann 2012-06-27 2:06 ` Eric Sandeen 2 siblings, 0 replies; 19+ messages in thread From: Ingo Jürgensmann @ 2012-06-26 19:34 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Am 26.06.2012 um 04:30 schrieb Dave Chinner: > On Sun, Jun 24, 2012 at 05:03:47PM +0200, Ingo Jürgensmann wrote: >> On 2012-06-24 15:05, Stan Hoeppner wrote: >> >>> The the log stripe unit mismatch error is a direct result of Ingo >>> manually choosing a rather large chunk size for his two stripe >>> spindle >>> md array, yielding a 1MB stripe, and using an internal log with it. >>> Maybe there is a good reason for this, but I'm going to challenge it. >> >> To cite man mdadm: >> >> -c, --chunk= >> Specify chunk size of kibibytes. The default when >> creating an array is 512KB. To ensure compatibility >> with earlier versions, the default when Building and >> array with no persistent metadata is 64KB. This is >> only meaningful for RAID0, RAID4, RAID5, RAID6, and >> RAID10. >> >> So, actually there's a mismatch with the default of mdadm an >> mkfs.xfs. Maybe it's worthwhile to think of raising the log stripe >> maximum size to at least 512 kiB? I don't know what implications >> this could have, though... > > You can't, simple as that. The maximum supported is 256k. As it is, > a default chunk size of 512k is probably harmful to most workloads - > large chunk sizes mean that just about every write will trigger a > RMW cycle in the RAID because it is pretty much impossible to issue > full stripe writes. Writeback doesn't do any alignment of IO (the > generic page cache writeback path is the problem here), so we will > lamost always be doing unaligned IO to the RAID, and there will be > little opportunity for sequential IOs to merge and form full stripe > writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write). > > IOWs, every time you do a small isolated write, the MD RAID volume > will do a RMW cycle, reading 11MB and writing 12MB of data to disk. > Given that most workloads are not doing lots and lots of large > sequential writes this is, IMO, a pretty bad default given typical > RAID5/6 volume configurations we see.... > > Without the warning, nobody would have noticed this. I think the > warning has value - even if it is just to indicate MD now uses a > bad default value for common workloads.. Seconded. But I think the warning, as it is, can confuse the use - like me. ;) Maybe you can an URL to this warning message and point it to a detailed explanation: =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Q: mkfs.xfs states log stripe unit is too large A: On RAID devices created with mdadm and a 1.2 format superblock, the default chunk size is 512 kiB. When creating a filesystem with mkfs.xfs on top of such a device, mkfs.xfs will use the chunk size of the underlying RAID device to set some parameters of the file- system, e.g. log stripe size. XFS is limited to 256 kiB of log stripe size, so mkfs.xfs falls back to its default value of 32 kiB size when it can't use larger values from underlying chunk sizes. This is, in general, a good decision for your filesystem. Best thing in general is to align all log writes to the underlying stripe unit of the array. That way as multiple frequent log writes occur, it is guaranteed to form full stripe writes and basically have no RMW overhead. 32k is chosen by default because that's the default log buffer size and hence the typical size of log writes. If you increase the log stripe unit, you also increase the minimum log buffer size that the filesystem supports. The filesystem can support up to 256k log buffers, and hence the limit on maximum log stripe alignment. The maximum supported log stripe size in XFS is 256k. As it is, a default chunk size of 512k is probably harmful to most workloads - large chunk sizes mean that just about every write will trigger a RMW cycle in the RAID because it is pretty much impossible to issue full stripe writes. Writeback doesn't do any alignment of IO (the generic page cache writeback path is the problem here), so we will lamost always be doing unaligned IO to the RAID, and there will be little opportunity for sequential IOs to merge and form full stripe writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write). IOWs, every time you do a small isolated write, the MD RAID volume will do a RMW cycle, reading 11MB and writing 12MB of data to disk. Given that most workloads are not doing lots and lots of large sequential writes this is, IMO, a pretty bad default given typical RAID5/6 volume configurations we see.... When benchmarking out mdraid stripe sizes a size of 32kb for XFS is a clear winner, anything larger decreases performance. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- As you can see, I've conducted some answers from Dave and Chris that helped me to understand the issue and the implications of log stripe size. I would welcome a FAQ entry and an URL to it included in the already existing warn message. Regardless whether you will do so, I've blogged today about this issue and the "solution": http://blog.windfluechter.net/content/blog/2012/06/26/1475-confusion-about-mkfsxfs-and-log-stripe-size-being-too-big Maybe this helps other people to not come up with the same question... :-) Many thanks to all who helped me to understand this "issue"! :-) -- Ciao... // Fon: 0381-2744150 Ingo \X/ http://blog.windfluechter.net gpg pubkey: http://www.juergensmann.de/ij_public_key.asc _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-26 2:30 ` Dave Chinner 2012-06-26 8:02 ` Christoph Hellwig 2012-06-26 19:34 ` Ingo Jürgensmann @ 2012-06-27 2:06 ` Eric Sandeen 2 siblings, 0 replies; 19+ messages in thread From: Eric Sandeen @ 2012-06-27 2:06 UTC (permalink / raw) To: Dave Chinner; +Cc: Ingo Jürgensmann, xfs On 6/25/12 10:30 PM, Dave Chinner wrote: ... > Without the warning, nobody would have noticed this. I think the > warning has value - even if it is just to indicate MD now uses a > bad default value for common workloads.. Fair enough. log stripe unit (524288 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB It just tweaked me a little to complain about something the user didn't specify, but thinking about it from the perspective of letting the user know that the _device_ has a stripe unit larger than xfs can handle makes sense. -Eric _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: mkfs.xfs states log stripe unit is too large 2012-06-23 23:44 ` Dave Chinner 2012-06-24 2:20 ` Eric Sandeen @ 2012-06-25 10:33 ` Ingo Jürgensmann 1 sibling, 0 replies; 19+ messages in thread From: Ingo Jürgensmann @ 2012-06-25 10:33 UTC (permalink / raw) To: xfs On 2012-06-24 01:44, Dave Chinner wrote: > If you increase the log stripe unit, you also increase the minimum > log buffer size that the filesystem supports. The filesystem can > support up to 256k log buffers, and hence the limit on maximum log > stripe alignment. So, no way to increase log buffers to match 1.2 format superblocks default size of 512 kiB, I guess, because it would change on disk-format? >> - will performance suffer from log stripe size adjusted to just 32 >> kiB? Some of my logical volumes will just store data, but one or >> the other will have some workload acting as storage for BackupPC. > For data volumes, no. For backupPC, it depends on whether the MD > RAID stripe cache can turn all the sequential log writes into a full > stripe write. In general, this is not a problem, and is almost never > a problem for HW RAID with BBWC.... Well, the external log would have been on my other RAID disks. Having a RAID1 for this would be doable, but I decided to not go that way. It would limit me too much to replace those 1 TB disks by bigger ones somewhen in the future. Regarding BackupPC: it might more likely benefit from a smaller log stripe size, because BackupPC makes extensive use of hardlinks, so I guess the overhead will be smaller when using 32 kiB log stripe size, as you suggests as well below: >> - would it be worth the effort to raise log stripe to at least 256 >> kiB? > Depends on your workload. If it is fsync heavy, I'd advise against > it, as every log write will be padded out to 256k, even if you only > write 500 bytes worth of transaction data.... BackupPC will check against its pool of files, whether a file is already in it (by comparing md5sum or shaXXXsum) or not. If it's in the pool already it will hardlink to it, if it's not it will copy the file and hardlink then. Therefore I assume that the workload will mainly be fsyncs. >> - or would it be better to run with external log on the old 1 TB >> RAID? > External logs provide muchless benefit with delayed logging than hey > use to. As it is, your external log needs to have the same > reliability characteristics as the main volume - lose the log, > corrupt the filesystem. Hence for RAID5 volumes, you need a RAID1 > log, and for RAID6 you either need RAID6 or a 3-way mirror to > provide the same reliability.... That would be possible. But as stated above, I won't go that way for practical reasons. >> End note: the 4 TB disks are not yet "in production", so I can run >> tests with both RAID setup as well as mkfs.xfs. Reshaping the RAID >> will take up to 10 hours, though... > IMO, RAID reshaping is just a bad idea - it changes the alignment > characteristic of the volume, hence everything that the > filesystemlaid down in an aligned fashion is now unaligned, and you > have to tell the filesytemteh new alignment before new files will be > correctly aligned. Also, it's usually faster to back up, recreate > and restore than reshape and that puts a lot less load on your > disks, too... True. Therefor I've re-created the RAID again instead of still running it from re-shaped RAID1-to-RAID5. Anyway, reshaping is only an issue as long as there's already a FS on it. But a bad feeling still persists... ;) Thanks for your explanation, Dave! -- Ciao... // Fon: 0381-2744150 . Ingo \X/ http://blog.windfluechter.net gpg pubkey: http://www.juergensmann.de/ij_public_key. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2012-07-09 12:01 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-23 12:50 mkfs.xfs states log stripe unit is too large Ingo Jürgensmann
2012-06-23 23:44 ` Dave Chinner
2012-06-24 2:20 ` Eric Sandeen
2012-06-24 13:05 ` Stan Hoeppner
2012-06-24 13:17 ` Ingo Jürgensmann
2012-06-24 19:28 ` Stan Hoeppner
2012-06-24 19:51 ` Ingo Jürgensmann
2012-06-24 22:15 ` Stan Hoeppner
2012-06-25 5:25 ` Ingo Jürgensmann
[not found] ` <4FE8CEED.7070505@hardwarefreak.com>
2012-06-25 21:18 ` Ingo Jürgensmann
2012-06-24 15:03 ` Ingo Jürgensmann
2012-06-26 2:30 ` Dave Chinner
2012-06-26 8:02 ` Christoph Hellwig
[not found] ` <20120702061827.GB16671@infradead.org>
2012-07-02 6:41 ` NeilBrown
2012-07-02 8:08 ` Dave Chinner
2012-07-09 12:02 ` kedacomkernel
2012-06-26 19:34 ` Ingo Jürgensmann
2012-06-27 2:06 ` Eric Sandeen
2012-06-25 10:33 ` Ingo Jürgensmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox