* 30 TB RAID6 + XFS slow write performance @ 2011-07-18 19:58 John Bokma 2011-07-19 0:00 ` Eric Sandeen 2011-07-19 8:37 ` Emmanuel Florac 0 siblings, 2 replies; 17+ messages in thread From: John Bokma @ 2011-07-18 19:58 UTC (permalink / raw) To: xfs Dear list members, A customer of mine is currently struggling with the performance of a 30 TB RAID6 which uses XFS as the filing system. I am somewhat sure it's not XFS that's causing the performance issue but my expertise is not XFS nor RAID; I just wrote the software that after moving to the larger RAID (from a much smaller one, ~ 3TB, using ext3) suddenly seems to have a huge drop in write performance. The software I wrote writes many small (50-150K) files in parallel (100+ processes), thousands of times per hour. Writing a file of 50-150K now and then seems to take between 30 and 90 seconds, and more rarely can take over 200 seconds (several times an hour). When all processes are stopped and restarted again the 30-90 seconds delay start happening when about 16-20+ processes are running. To me this sounds like something has been configured wrong. I already recommended my customer to find someone who is capable of configuring the RAID correctly; to me it sounds like a hardware/configuration issue. Any insights are very welcome. Hardware: card: MegaRAID SAS 9260-16i disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares). RAID6 ~ 30TB Thanks for reading, John _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-18 19:58 30 TB RAID6 + XFS slow write performance John Bokma @ 2011-07-19 0:00 ` Eric Sandeen 2011-07-19 8:37 ` Emmanuel Florac 1 sibling, 0 replies; 17+ messages in thread From: Eric Sandeen @ 2011-07-19 0:00 UTC (permalink / raw) To: John Bokma; +Cc: xfs On 7/18/11 2:58 PM, John Bokma wrote: > Dear list members, > > A customer of mine is currently struggling with the performance of a > 30 TB RAID6 which uses XFS as the filing system. I am somewhat sure > it's not XFS that's causing the performance issue but my expertise is > not XFS nor RAID; I just wrote the software that after moving to the > larger RAID (from a much smaller one, ~ 3TB, using ext3) suddenly > seems to have a huge drop in write performance. > > The software I wrote writes many small (50-150K) files in parallel > (100+ processes), thousands of times per hour. Writing a file of > 50-150K now and then seems to take between 30 and 90 seconds, and > more rarely can take over 200 seconds (several times an hour). > > When all processes are stopped and restarted again the 30-90 seconds > delay start happening when about 16-20+ processes are running. > > To me this sounds like something has been configured wrong. I already > recommended my customer to find someone who is capable of configuring > the RAID correctly; to me it sounds like a hardware/configuration > issue. > > Any insights are very welcome. > > Hardware: card: MegaRAID SAS 9260-16i disks: 14x Barracuda® XT > ST33000651AS 3TB (2 hot spares). RAID6 ~ 30TB My first suggestion would be to check the partition alignment on the raid (if it is partitioned), and be sure it is aligned with the underlying raid geometry. And then make sure you give mkfs.xfs the proper geometry as well. After that, does the raid card have a battery-backed write cache? If so, you can safely disable barriers. More info is always good too, for starters what kernel & what xfsprogs version? What mkfs & mount options? -Eric > Thanks for reading, John > > _______________________________________________ xfs mailing list > xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-18 19:58 30 TB RAID6 + XFS slow write performance John Bokma 2011-07-19 0:00 ` Eric Sandeen @ 2011-07-19 8:37 ` Emmanuel Florac 2011-07-19 22:37 ` Stan Hoeppner 1 sibling, 1 reply; 17+ messages in thread From: Emmanuel Florac @ 2011-07-19 8:37 UTC (permalink / raw) To: John Bokma; +Cc: xfs Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez: > card: MegaRAID SAS 9260-16i > disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares). > RAID6 > ~ 30TB > This card doesn't activate the write cache without a BBU present. Be sure you have a BBU or the performance will always be unbearably awful. Then proceed like Eric suggested. Initialize your filesystem with the right options : su= your RAID stripe size, sw= your RAID array data members (for RAID 6, the total number minus 2), don't forget the useful option -l lazy-count=1, and mount with nobarriers and inode64. BTW apparently you're confusing hot spares and parity drives. A RAID-6 array has 2 parity drives; then it may have or not 1 or more hot spares (generally one is enough). I suppose your array is actually a 12 data + 2 parity drives. regards, -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-19 8:37 ` Emmanuel Florac @ 2011-07-19 22:37 ` Stan Hoeppner 2011-07-20 0:20 ` Dave Chinner 0 siblings, 1 reply; 17+ messages in thread From: Stan Hoeppner @ 2011-07-19 22:37 UTC (permalink / raw) To: Emmanuel Florac; +Cc: xfs, John Bokma On 7/19/2011 3:37 AM, Emmanuel Florac wrote: > Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez: > >> card: MegaRAID SAS 9260-16i >> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares). >> RAID6 >> ~ 30TB > This card doesn't activate the write cache without a BBU present. Be > sure you have a BBU or the performance will always be unbearably awful. In addition to all the other recommendations, once the BBU is installed, disable the individual drive caches (if this isn't done automatically), and set the controller cache mode to 'write back'. The write through and direct I/O cache modes will deliver horrible RAID6 write performance. And, BTW, RAID6 is a horrible choice for a parallel, small file, high random I/O workload such as you've described. RAID10 would be much more suitable. Actually, any striped RAID is less than optimal for such a small file workload. The default stripe size for the LSI RAID controllers, IIRC, is 64KB. With 14 spindles of stripe width you end up with 64*14 = 896KB. XFS will try to pack as many of these 50-150K files into a single extent, but you're talking 6 to 18 files per extent, and this is wholly dependent on the parallel write pattern, and in which of the allocation groups XFS decides to write each file. XFS isn't going to be 100% efficient in this case. Thus, you will end up with many partial stripe width writes, eliminating much of the performance advantage of striping. These are large 7200 rpm SATA drives which have poor seek performance to begin with, unlike the 'small' 300GB 15k SAS drives. You're robbing that poor seek performance further by: 1. Using double parity striped RAID 2. Writing thousands of small files in parallel This workload is very similar to the case of a mail server using the maildir storage format. If you read the list archives you'll see recommendations for an optimal storage stack setup for this workload. It goes something like this: 1. Create a linear array of hardware RAID1 mirror sets. Do this all in the controller if it can do it. If not, use Linux RAID (mdadm) to create a '--linear' array of the multiple (7 in your case, apparently) hardware RAID1 mirror sets 2. Now let XFS handle the write parallelism. Format the resulting 7 spindle Linux RAID device with, for example: mkfs.xfs -d agcount=14 /dev/md0 By using this configuration you eliminate the excessive head seeking associated with the partial stripe write problems of RAID6, restoring performance efficiency to the array. Using 14 allocation groups allows XFS to write write, at minimum, 14 such files in parallel. This may not seem like a lot given you have ~200 writers, but it's actually far more than what you're getting now, or what you'll get with striped parity RAID. Consider the 150KB file case: 14*150KB = 2.1MB/s. Assuming this hardware and software stack can sink 210MB/s with this workload, that's ~1400 files written per second, or 84,000 files per hour. Would this be sufficient for your application? Now that we've covered the XFS and hardware RAID side of this equation, does your application run directly on the this machine, or are you writing over NFS or CIFS to this XFS filesystem? If so, that's another fly in the ointment we may have to deal with. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-19 22:37 ` Stan Hoeppner @ 2011-07-20 0:20 ` Dave Chinner 2011-07-20 5:16 ` Stan Hoeppner 0 siblings, 1 reply; 17+ messages in thread From: Dave Chinner @ 2011-07-20 0:20 UTC (permalink / raw) To: Stan Hoeppner; +Cc: John Bokma, xfs On Tue, Jul 19, 2011 at 05:37:25PM -0500, Stan Hoeppner wrote: > On 7/19/2011 3:37 AM, Emmanuel Florac wrote: > > Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez: > > > >> card: MegaRAID SAS 9260-16i > >> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares). > >> RAID6 > >> ~ 30TB > > > This card doesn't activate the write cache without a BBU present. Be > > sure you have a BBU or the performance will always be unbearably awful. > > In addition to all the other recommendations, once the BBU is installed, > disable the individual drive caches (if this isn't done automatically), > and set the controller cache mode to 'write back'. The write through > and direct I/O cache modes will deliver horrible RAID6 write performance. > > And, BTW, RAID6 is a horrible choice for a parallel, small file, high > random I/O workload such as you've described. RAID10 would be much more > suitable. Actually, any striped RAID is less than optimal for such a > small file workload. The default stripe size for the LSI RAID > controllers, IIRC, is 64KB. With 14 spindles of stripe width you end up > with 64*14 = 896KB. All good up to here. > XFS will try to pack as many of these 50-150K files > into a single extent, but you're talking 6 to 18 files per extent, I think you've got your terminology wrong. An extent can only belong to a single inode, but an inode can contain many extents, as can a stripe width. We do not pack data from multiple files into a single extent. For new files on a su/sw aware filesystem, however, XFS will *not* pack multiple files into the same stripe unit. It will try to align the first extent of the file to sunit, or if you have the swalloc mount option set and the allocation is for more than a swidth of space it will align to swidth rather than sunit. So if you have a small file workload, specifying sunit/swidth can actually -decrease- performance because it allocates the file extents sparsely. IOWs, stripe alignment is important for bandwidth intensive applications because it allows full stripe writes to occur much more frequently, but can be harmful to small file performance as the aligned allocation pattern can prevent full stripe writes from occurring..... > and > this is wholly dependent on the parallel write pattern, and in which of > the allocation groups XFS decides to write each file. That's pretty much irrelevant for small files as a single allocation is done for each file during writeback. > XFS isn't going > to be 100% efficient in this case. Thus, you will end up with many > partial stripe width writes, eliminating much of the performance > advantage of striping. Yes, that's the ultimate problem, but not for the reasons you suggested. ;) > These are large 7200 rpm SATA drives which have poor seek performance to > begin with, unlike the 'small' 300GB 15k SAS drives. You're robbing > that poor seek performance further by: > > 1. Using double parity striped RAID > 2. Writing thousands of small files in parallel The writing in parallel is only an issue if it is direct or synchronous IO. If it's using normal buffered writes, then writeback is mostly single threaded and delayed allocation should be preventing fragmentation completely. That still doesn't guarantee that writeback avoids RAID RMW cycles (see above about allocation alignment). > This workload is very similar to the case of a mail server using the > maildir storage format. There's not enough detail in the workload description to make that assumption. > If you read the list archives you'll see > recommendations for an optimal storage stack setup for this workload. > It goes something like this: > > 1. Create a linear array of hardware RAID1 mirror sets. > Do this all in the controller if it can do it. > If not, use Linux RAID (mdadm) to create a '--linear' array of the > multiple (7 in your case, apparently) hardware RAID1 mirror sets > > 2. Now let XFS handle the write parallelism. Format the resulting > 7 spindle Linux RAID device with, for example: > > mkfs.xfs -d agcount=14 /dev/md0 > > By using this configuration you eliminate the excessive head seeking > associated with the partial stripe write problems of RAID6, restoring > performance efficiency to the array. Using 14 allocation groups allows > XFS to write write, at minimum, 14 such files in parallel. That's not correct. 14 AG means that if the files are laid out across all AGs then there can be 14 -allocations- in parallel at once. If Io does not require allocation, then they don't serialise at all on the AGs. IOWs, If allocation takes 1ms of work in an AG, then you could have 1,000 allocations per second per AG. With 14 AGs, that gives allocation capability of up to 14,000/s And given that not all writes require allocation and allocation is usually only a small percentage of the total IO time. You can have many, many more write IOs in flight than you can do allocations in an AG.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-20 0:20 ` Dave Chinner @ 2011-07-20 5:16 ` Stan Hoeppner 2011-07-20 6:44 ` Dave Chinner 0 siblings, 1 reply; 17+ messages in thread From: Stan Hoeppner @ 2011-07-20 5:16 UTC (permalink / raw) To: Dave Chinner; +Cc: John Bokma, xfs On 7/19/2011 7:20 PM, Dave Chinner wrote: > On Tue, Jul 19, 2011 at 05:37:25PM -0500, Stan Hoeppner wrote: >> On 7/19/2011 3:37 AM, Emmanuel Florac wrote: >>> Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez: >>> >>>> card: MegaRAID SAS 9260-16i >>>> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares). >>>> RAID6 >>>> ~ 30TB >> >>> This card doesn't activate the write cache without a BBU present. Be >>> sure you have a BBU or the performance will always be unbearably awful. >> >> In addition to all the other recommendations, once the BBU is installed, >> disable the individual drive caches (if this isn't done automatically), >> and set the controller cache mode to 'write back'. The write through >> and direct I/O cache modes will deliver horrible RAID6 write performance. >> >> And, BTW, RAID6 is a horrible choice for a parallel, small file, high >> random I/O workload such as you've described. RAID10 would be much more >> suitable. Actually, any striped RAID is less than optimal for such a >> small file workload. The default stripe size for the LSI RAID >> controllers, IIRC, is 64KB. With 14 spindles of stripe width you end up >> with 64*14 = 896KB. > > All good up to here. And then my lack of understanding of XFS internals begins to show. :( >> XFS will try to pack as many of these 50-150K files >> into a single extent, but you're talking 6 to 18 files per extent, > I think you've got your terminology wrong. An extent can only belong > to a single inode, but an inode can contain many extents, as can a > stripe width. We do not pack data from multiple files into a single > extent. Yes, I think I meant stripe unit, the 896KB. > For new files on a su/sw aware filesystem, however, XFS will *not* > pack multiple files into the same stripe unit. It will try to align > the first extent of the file to sunit, or if you have the swalloc > mount option set and the allocation is for more than a swidth of > space it will align to swidth rather than sunit. Interesting. Didn't realize this. > So if you have a small file workload, specifying sunit/swidth can > actually -decrease- performance because it allocates the file > extents sparsely. IOWs, stripe alignment is important for bandwidth > intensive applications because it allows full stripe writes to occur > much more frequently, but can be harmful to small file performance > as the aligned allocation pattern can prevent full stripe writes > from occurring..... I don't recall reading this before Dave. Thank you for this tidbit. How much performance decrease are we looking at here? An mkfs.xfs of an mdraid striped array will by default create sunit/swidth values right? And thus this lower performance w/small files. >> and >> this is wholly dependent on the parallel write pattern, and in which of >> the allocation groups XFS decides to write each file. > > That's pretty much irrelevant for small files as a single allocation > is done for each file during writeback. I believe I was already thinking of the concatenated array at this point and accidentally dropped those thoughts into the striped array discussion. >> XFS isn't going >> to be 100% efficient in this case. Thus, you will end up with many >> partial stripe width writes, eliminating much of the performance >> advantage of striping. > > Yes, that's the ultimate problem, but not for the reasons you > suggested. ;) Thanks for saving me Dave. :) I had the big picture right but FUBAR'd some of the details. Maybe there's a job in politics waiting for me. ;) >> These are large 7200 rpm SATA drives which have poor seek performance to >> begin with, unlike the 'small' 300GB 15k SAS drives. You're robbing >> that poor seek performance further by: >> >> 1. Using double parity striped RAID >> 2. Writing thousands of small files in parallel > > The writing in parallel is only an issue if it is direct or > synchronous IO. If it's using normal buffered writes, then writeback > is mostly single threaded and delayed allocation should be preventing > fragmentation completely. That still doesn't guarantee that > writeback avoids RAID RMW cycles (see above about allocation > alignment). The RMW was mainly what I was concerned with here. >> This workload is very similar to the case of a mail server using the >> maildir storage format. > > There's not enough detail in the workload description to make that > assumption. Good point. I should have said "at first glance... seems similar". >> If you read the list archives you'll see >> recommendations for an optimal storage stack setup for this workload. >> It goes something like this: >> >> 1. Create a linear array of hardware RAID1 mirror sets. >> Do this all in the controller if it can do it. >> If not, use Linux RAID (mdadm) to create a '--linear' array of the >> multiple (7 in your case, apparently) hardware RAID1 mirror sets >> >> 2. Now let XFS handle the write parallelism. Format the resulting >> 7 spindle Linux RAID device with, for example: >> >> mkfs.xfs -d agcount=14 /dev/md0 >> >> By using this configuration you eliminate the excessive head seeking >> associated with the partial stripe write problems of RAID6, restoring >> performance efficiency to the array. Using 14 allocation groups allows >> XFS to write write, at minimum, 14 such files in parallel. > > That's not correct. 14 AG means that if the files are laid out > across all AGs then there can be 14 -allocations- in parallel at > once. If Io does not require allocation, then they don't serialise > at all on the AGs. IOWs, If allocation takes 1ms of work in an AG, > then you could have 1,000 allocations per second per AG. With 14 > AGs, that gives allocation capability of up to 14,000/s So are you saying that we have no guarantee, nor high probability, that the small files in this case will be spread out across all AGs, thus making more efficient use of each disk's performance in the concatenated array, vs a striped array? Or, are you merely pointing out a detail I have incorrect, which I've yet to fully understand? > And given that not all writes require allocation and allocation is > usually only a small percentage of the total IO time. You can have > many, many more write IOs in flight than you can do allocations in > an AG.... Ahh, I think I see your point. For the maildir case, more of the IO is likely due to things like updating message flags, etc, than actually writing new mail files into the directory. Such operations don't require allocation. With the workload mentioned by the OP, it's possible that all of the small file writes may indeed require allocation, unlike the maildir workload. But if this is the case, wouldn't the concatenated array still yield better overall performance than RAID6, or any other striped array? If I misunderstood your last point, or any points, please guide me to the light Dave. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-20 5:16 ` Stan Hoeppner @ 2011-07-20 6:44 ` Dave Chinner 2011-07-20 12:10 ` Stan Hoeppner 0 siblings, 1 reply; 17+ messages in thread From: Dave Chinner @ 2011-07-20 6:44 UTC (permalink / raw) To: Stan Hoeppner; +Cc: John Bokma, xfs On Wed, Jul 20, 2011 at 12:16:15AM -0500, Stan Hoeppner wrote: > On 7/19/2011 7:20 PM, Dave Chinner wrote: > > On Tue, Jul 19, 2011 at 05:37:25PM -0500, Stan Hoeppner wrote: > >> On 7/19/2011 3:37 AM, Emmanuel Florac wrote: > >>> Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez: > >>> > >>>> card: MegaRAID SAS 9260-16i > >>>> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares). > >>>> RAID6 > >>>> ~ 30TB > >> > >>> This card doesn't activate the write cache without a BBU present. Be > >>> sure you have a BBU or the performance will always be unbearably awful. > >> > >> In addition to all the other recommendations, once the BBU is installed, > >> disable the individual drive caches (if this isn't done automatically), > >> and set the controller cache mode to 'write back'. The write through > >> and direct I/O cache modes will deliver horrible RAID6 write performance. > >> > >> And, BTW, RAID6 is a horrible choice for a parallel, small file, high > >> random I/O workload such as you've described. RAID10 would be much more > >> suitable. Actually, any striped RAID is less than optimal for such a > >> small file workload. The default stripe size for the LSI RAID > >> controllers, IIRC, is 64KB. With 14 spindles of stripe width you end up > >> with 64*14 = 896KB. > > > > All good up to here. > > And then my lack of understanding of XFS internals begins to show. :( The fact you are trying to understand them is the important bit! .... > > So if you have a small file workload, specifying sunit/swidth can > > actually -decrease- performance because it allocates the file > > extents sparsely. IOWs, stripe alignment is important for bandwidth > > intensive applications because it allows full stripe writes to occur > > much more frequently, but can be harmful to small file performance > > as the aligned allocation pattern can prevent full stripe writes > > from occurring..... > > I don't recall reading this before Dave. Thank you for this tidbit. I'm sure I've said this before, but it's possible I've said it this time in away that is obvious and understandable. Most people struggle with the concept of allocation alignment and why it might be important, let alone understand it well enough to discuss intricate details of the allocator and tuning it for different workloads... > How much performance decrease are we looking at here? Depends on your hardware and the workload. It may not be measurable, it may be very noticable. benchmarking your system with your workload is the only way to really know. > An mkfs.xfs of an > mdraid striped array will by default create sunit/swidth values right? > And thus this lower performance w/small files. In general, sunit/swidth being specified provides a better tradeoff for maintaining consistent performance on files across the filesystem. it might cost a little for small files, but unaligned IO on large files cause much more noticable performace problems... .... > >> If you read the list archives you'll see > >> recommendations for an optimal storage stack setup for this workload. > >> It goes something like this: > >> > >> 1. Create a linear array of hardware RAID1 mirror sets. > >> Do this all in the controller if it can do it. > >> If not, use Linux RAID (mdadm) to create a '--linear' array of the > >> multiple (7 in your case, apparently) hardware RAID1 mirror sets > >> > >> 2. Now let XFS handle the write parallelism. Format the resulting > >> 7 spindle Linux RAID device with, for example: > >> > >> mkfs.xfs -d agcount=14 /dev/md0 > >> > >> By using this configuration you eliminate the excessive head seeking > >> associated with the partial stripe write problems of RAID6, restoring > >> performance efficiency to the array. Using 14 allocation groups allows > >> XFS to write write, at minimum, 14 such files in parallel. > > > > That's not correct. 14 AG means that if the files are laid out > > across all AGs then there can be 14 -allocations- in parallel at > > once. If Io does not require allocation, then they don't serialise > > at all on the AGs. IOWs, If allocation takes 1ms of work in an AG, > > then you could have 1,000 allocations per second per AG. With 14 > > AGs, that gives allocation capability of up to 14,000/s > > So are you saying that we have no guarantee, nor high probability, that > the small files in this case will be spread out across all AGs, thus > making more efficient use of each disk's performance in the concatenated > array, vs a striped array? Or, are you merely pointing out a detail I > have incorrect, which I've yet to fully understand? Yet to fully understand. It's not limited to small files, either. XFS doesn't guarantee that specific allocations are evenly distributed across AGs, but it does try to spread the overall contents of the filesystem across all AGs. It does have concepts of locality of reference, but they change depending on the allocator in use. Take, for example, inode32 vs inode64 which are the two most common allocation strategies and assume we have a 16TB fs with 1TB AGs. The inode32 allocator will place all inodes and most directory metadata in the first AG, below one TB. There is basically no metadata allocation parallelism in this strategy, so metadata performance is limited and will often serialise. Metadata tends to have good locality of reference - all directories and inodes will tend to be close together on disk because they are in the same AG. Data, on the other had is rotored around AGs 2-16 on a per file basis, so there is no locality between inodes and their data, nor of data between two adjacent files in the same directory. There is, however, data allocation parallelism because files are spread across allocation groups... Hence for inode32, metadata is closely located, but data is spread out widely. Hence metadata operations don't scale at all well on a linear concat (e.g. hit only one disk/mirror pair), but data allocations are spread effectively and hence parallelise and scale quite well. The downside to this is that data lookups involve large seeks if you have a stripe, and hence can be quite slow. Data reads on a linear concat are not guaranteed to evenly load the disks, either, simply because there's no correlation between the location of the data and the access patterns. For inode64, locality of reference clusters around the directory structure. The inodes for files in a directory will be allocated in the same AG as the directory inode, and the data for each file will be allocated in the same AG as the file inodes. When you create a new directory, it gets placed in a different AG, and the pattern repeats. So for inode64, distributing files across all AGs is caused by distributing the directory structure. FWIW, an example is a kernel source tree: ~/src/kern/xfsdev$ find . -type d -exec sudo xfs_bmap -v {} \; | awk '/ 0: / { print $4 }' |sort -n |uniq -c 76 0 66 1 85 2 81 3 82 4 69 5 89 6 74 7 90 8 81 9 96 10 84 11 85 12 84 13 86 14 71 15 As you can see, there's a relatively even spread of the directories across all 16 AGs in that directory structure, and the file data will follow this pattern. Because of it's better metadata<->data locality of reference, inode64 tends to be signficantly faster on workloads that mix metadata operations with data operations (e.g. recursive grep across a kernel source tree) as the seek cost between the inode and it's data is much less than for inode32.... However, if youre workload does not spread across directories, then IO will tend to be limited to specific silos in the linear concat while other disks sit idle. If you have a stripe, then the seeks to get to the data are small, and hence much faster than inode32 on similar workloads. This is all ignoring stripe aligned allocation - that is often lost in the noise comapred to bigger issues like seeking from AG 0 to AG 15 when reading the inode then the data or having a workload only use a single AG because it is all confined to a single directory. IOWs, the best, most optimal filesystem layout and allocation stratgey is both workload and hardware dependent, and there's no one right answer. The defaults select the best balance for typical usage - beyond that benchmarking the workload is the only way to really measure whether your tweaks are the right ones or not. IOWs, you need to understand the filesystem, your storage hardware and -the application IO patterns- to make the right tuning decisions. > > And given that not all writes require allocation and allocation is > > usually only a small percentage of the total IO time. You can have > > many, many more write IOs in flight than you can do allocations in > > an AG.... > > Ahh, I think I see your point. For the maildir case, more of the IO is > likely due to things like updating message flags, etc, than actually > writing new mail files into the directory. I wasn't really talking about maildir here, just pointing out that allocation is generally not the limiting factor in doing large amounts of concurrent write IO. > Such operations don't > require allocation. With the workload mentioned by the OP, it's > possible that all of the small file writes may indeed require > allocation, unlike the maildir workload. But if this is the case, > wouldn't the concatenated array still yield better overall performance > than RAID6, or any other striped array? <shrug> Quite possibly, butI can't say conclusively - I simply don't know enough about the workload or the fs configuration. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-20 6:44 ` Dave Chinner @ 2011-07-20 12:10 ` Stan Hoeppner 2011-07-20 14:04 ` Michael Monnerie 0 siblings, 1 reply; 17+ messages in thread From: Stan Hoeppner @ 2011-07-20 12:10 UTC (permalink / raw) To: Dave Chinner; +Cc: John Bokma, xfs On 7/20/2011 1:44 AM, Dave Chinner wrote: > On Wed, Jul 20, 2011 at 12:16:15AM -0500, Stan Hoeppner wrote: >> On 7/19/2011 7:20 PM, Dave Chinner wrote: >>> On Tue, Jul 19, 2011 at 05:37:25PM -0500, Stan Hoeppner wrote: >>>> On 7/19/2011 3:37 AM, Emmanuel Florac wrote: >>>>> Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez: >>>>> >>>>>> card: MegaRAID SAS 9260-16i >>>>>> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares). >>>>>> RAID6 >>>>>> ~ 30TB >>>> >>>>> This card doesn't activate the write cache without a BBU present. Be >>>>> sure you have a BBU or the performance will always be unbearably awful. >>>> >>>> In addition to all the other recommendations, once the BBU is installed, >>>> disable the individual drive caches (if this isn't done automatically), >>>> and set the controller cache mode to 'write back'. The write through >>>> and direct I/O cache modes will deliver horrible RAID6 write performance. >>>> >>>> And, BTW, RAID6 is a horrible choice for a parallel, small file, high >>>> random I/O workload such as you've described. RAID10 would be much more >>>> suitable. Actually, any striped RAID is less than optimal for such a >>>> small file workload. The default stripe size for the LSI RAID >>>> controllers, IIRC, is 64KB. With 14 spindles of stripe width you end up >>>> with 64*14 = 896KB. >>> >>> All good up to here. >> >> And then my lack of understanding of XFS internals begins to show. :( > > The fact you are trying to understand them is the important bit! I've always found XFS fascinating (as with most of SGI's creations). The more I use XFS, and the more I participate here, the more I want to understand how the cogs turn. And as you mentioned previously, it's beneficial to this list if users can effectively answer other users' questions, giving devs more time for developing. :) > .... >>> So if you have a small file workload, specifying sunit/swidth can >>> actually -decrease- performance because it allocates the file >>> extents sparsely. IOWs, stripe alignment is important for bandwidth >>> intensive applications because it allows full stripe writes to occur >>> much more frequently, but can be harmful to small file performance >>> as the aligned allocation pattern can prevent full stripe writes >>> from occurring..... >> >> I don't recall reading this before Dave. Thank you for this tidbit. > > I'm sure I've said this before, but it's possible I've said it this > time in away that is obvious and understandable. Most people > struggle with the concept of allocation alignment and why it might be > important, let alone understand it well enough to discuss intricate > details of the allocator and tuning it for different workloads... In general I've understood for quite some time that large stripes were typically bad for small file performance due to the partial stripe write issue. However, I misunderstood something you said quite some time ago about XFS having some tricks to somewhat mitigate partial stripe writes during writeback. I thought this was packing multiple small files into a single stripe write, which you just explained XFS does not do. Thinking back you were probably talking about some other aggregation that occurs in the allocator to cut down on the number of physical IOs required to write the data, or something like that. ... >> An mkfs.xfs of an >> mdraid striped array will by default create sunit/swidth values right? >> And thus this lower performance w/small files. > > In general, sunit/swidth being specified provides a better tradeoff > for maintaining consistent performance on files across the > filesystem. it might cost a little for small files, but unaligned IO > on large files cause much more noticable performace problems... The reason I asked is to get something in Google. If a user has a purely small file workload, such as maildir, but insists on using an mdraid striped array, would it be better to override the mkfs.xfs defaults here so sunit/swidth aren't defined? If so, would one specify zero for each parameter on the command line? > .... > >>>> If you read the list archives you'll see >>>> recommendations for an optimal storage stack setup for this workload. >>>> It goes something like this: >>>> >>>> 1. Create a linear array of hardware RAID1 mirror sets. >>>> Do this all in the controller if it can do it. >>>> If not, use Linux RAID (mdadm) to create a '--linear' array of the >>>> multiple (7 in your case, apparently) hardware RAID1 mirror sets >>>> >>>> 2. Now let XFS handle the write parallelism. Format the resulting >>>> 7 spindle Linux RAID device with, for example: >>>> >>>> mkfs.xfs -d agcount=14 /dev/md0 >>>> >>>> By using this configuration you eliminate the excessive head seeking >>>> associated with the partial stripe write problems of RAID6, restoring >>>> performance efficiency to the array. Using 14 allocation groups allows >>>> XFS to write write, at minimum, 14 such files in parallel. >>> >>> That's not correct. 14 AG means that if the files are laid out >>> across all AGs then there can be 14 -allocations- in parallel at >>> once. If Io does not require allocation, then they don't serialise >>> at all on the AGs. IOWs, If allocation takes 1ms of work in an AG, >>> then you could have 1,000 allocations per second per AG. With 14 >>> AGs, that gives allocation capability of up to 14,000/s >> >> So are you saying that we have no guarantee, nor high probability, that >> the small files in this case will be spread out across all AGs, thus >> making more efficient use of each disk's performance in the concatenated >> array, vs a striped array? Or, are you merely pointing out a detail I >> have incorrect, which I've yet to fully understand? > > Yet to fully understand. It's not limited to small files, either. > > XFS doesn't guarantee that specific allocations are evenly > distributed across AGs, but it does try to spread the overall > contents of the filesystem across all AGs. It does have concepts of > locality of reference, but they change depending on the allocator in > use. > > Take, for example, inode32 vs inode64 which are the two most common > allocation strategies and assume we have a 16TB fs with 1TB AGs. > The inode32 allocator will place all inodes and most directory > metadata in the first AG, below one TB. There is basically no > metadata allocation parallelism in this strategy, so metadata > performance is limited and will often serialise. Metadata tends to > have good locality of reference - all directories and inodes will > tend to be close together on disk because they are in the same AG. I'd forgotten this. I do recall discussions of all the directories and inodes being in the first 1TB on an inode32 filesystem. IIRC, those were focused on people "running out of space" when they still had many hundreds of Gigs or a TB free, simply because they ran out of space for inodes. Until now I hadn't tied this together with the potential metadata performance issue, and specifically with a linear concat setup. > Data, on the other had is rotored around AGs 2-16 on a per file > basis, so there is no locality between inodes and their data, nor of > data between two adjacent files in the same directory. There is, > however, data allocation parallelism because files are spread > across allocation groups... > > Hence for inode32, metadata is closely located, but data is spread > out widely. Hence metadata operations don't scale at all well on a > linear concat (e.g. hit only one disk/mirror pair), but data > allocations are spread effectively and hence parallelise and scale > quite well. The downside to this is that data lookups involve large > seeks if you have a stripe, and hence can be quite slow. Data reads > on a linear concat are not guaranteed to evenly load the disks, > either, simply because there's no correlation between the location > of the data and the access patterns. Got it. > For inode64, locality of reference clusters around the directory > structure. The inodes for files in a directory will be allocated in > the same AG as the directory inode, and the data for each file will > be allocated in the same AG as the file inodes. When you create a > new directory, it gets placed in a different AG, and the pattern > repeats. So for inode64, distributing files across all AGs is caused > by distributing the directory structure. And this is why maildir works very well with a linear concat on an inode64 filesystem, as each mailbox is in a different directory, thus spreading all the small mail files and metadata across all AGs. Which is why I've been recommending it. I don't think I've been specifying inode64 though in my previous recommendations. I should probably be doing that. I guess I assumed everyone running XFS today is running a 64bit kernel/user space--probably not good to simply assume that. > FWIW, an example is a > kernel source tree: > > ~/src/kern/xfsdev$ find . -type d -exec sudo xfs_bmap -v {} \; | awk '/ 0: / { print $4 }' |sort -n |uniq -c > 76 0 > 66 1 > 85 2 > 81 3 > 82 4 > 69 5 > 89 6 > 74 7 > 90 8 > 81 9 > 96 10 > 84 11 > 85 12 > 84 13 > 86 14 > 71 15 > > As you can see, there's a relatively even spread of the directories > across all 16 AGs in that directory structure, and the file data > will follow this pattern. Because of it's better metadata<->data > locality of reference, inode64 tends to be signficantly faster on > workloads that mix metadata operations with data operations (e.g. > recursive grep across a kernel source tree) as the seek cost between > the inode and it's data is much less than for inode32.... Right. > However, if youre workload does not spread across directories, then > IO will tend to be limited to specific silos in the linear concat > while other disks sit idle. If you have a stripe, then the seeks to > get to the data are small, and hence much faster than inode32 on > similar workloads. And now I understand your previous comment that we don't know enough about the user's workload to make the linear concat recommendation. If he's writing all those hundreds of thousands of small files into the same directory the performance of a linear concat would be horrible. > This is all ignoring stripe aligned allocation - that is often lost > in the noise comapred to bigger issues like seeking from AG 0 to AG > 15 when reading the inode then the data or having a workload only > use a single AG because it is all confined to a single directory. > > IOWs, the best, most optimal filesystem layout and allocation > stratgey is both workload and hardware dependent, and there's no one > right answer. The defaults select the best balance for typical usage > - beyond that benchmarking the workload is the only way to really > measure whether your tweaks are the right ones or not. IOWs, you > need to understand the filesystem, your storage hardware and -the > application IO patterns- to make the right tuning decisions. Got it. When I prematurely recommended the linear concat I'd simply forgotten that our AG parallelism is dependent on having many of directories, not just many small files. >>> And given that not all writes require allocation and allocation is >>> usually only a small percentage of the total IO time. You can have >>> many, many more write IOs in flight than you can do allocations in >>> an AG.... >> >> Ahh, I think I see your point. For the maildir case, more of the IO is >> likely due to things like updating message flags, etc, than actually >> writing new mail files into the directory. > > I wasn't really talking about maildir here, just pointing out that > allocation is generally not the limiting factor in doing large > amounts of concurrent write IO. Got it. In the specific case the OP posted about, hundreds of thousands of small file writes, allocation could be a limiting factor though, correct? >> Such operations don't >> require allocation. With the workload mentioned by the OP, it's >> possible that all of the small file writes may indeed require >> allocation, unlike the maildir workload. But if this is the case, >> wouldn't the concatenated array still yield better overall performance >> than RAID6, or any other striped array? > > <shrug> > > Quite possibly, butI can't say conclusively - I simply don't know > enough about the workload or the fs configuration. Don't shrug Dave. :) You already answered this question up above. Well, you provided me some new information, and reminded me of things I already knew, which allowed me to answer this for my self. Thanks for spending the time you have in this thread to do some serious teaching. You provided some valuable information that isn't in the XFS User Guide, nor the XFS File System Structure document. If it is there, it's not in a format that a mere mortal such as my self can digest. You make deeper aspects of XFS understandable, and I really appreciate that. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-20 12:10 ` Stan Hoeppner @ 2011-07-20 14:04 ` Michael Monnerie 2011-07-20 23:01 ` Dave Chinner 0 siblings, 1 reply; 17+ messages in thread From: Michael Monnerie @ 2011-07-20 14:04 UTC (permalink / raw) To: xfs; +Cc: Stan Hoeppner, John Bokma [-- Attachment #1.1: Type: Text/Plain, Size: 1023 bytes --] On Mittwoch, 20. Juli 2011 Stan Hoeppner wrote: > I thought this was packing multiple small files into > a single stripe write, which you just explained XFS does not do. This is interesting, I jump in here. Does that mean that if I have a XFS volume with sw=14,su=64k (14*64=896KiB) that when I write 10 small files in the same dir with 2KB each, each file would be placed at a 896KiB boundary? That way, all stripes of a 1GB partition would be full when there are roughly 1170 files (1170*896KiB ~ 1GB). What would happen when I create other files - is XFS "full" then, or would it start using sub- stripes? If sub-stripes, would they start at su (=64KiB) distances, or at single block (e.g. 4KiB) distances? I hope I could explain my thoughts in an understandable way ;-) -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services: Protéger http://proteger.at [gesprochen: Prot-e-schee] Tel: +43 660 / 415 6531 // Haus zu verkaufen: http://zmi.at/langegg/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-20 14:04 ` Michael Monnerie @ 2011-07-20 23:01 ` Dave Chinner 2011-07-21 6:19 ` Michael Monnerie 0 siblings, 1 reply; 17+ messages in thread From: Dave Chinner @ 2011-07-20 23:01 UTC (permalink / raw) To: Michael Monnerie; +Cc: John Bokma, Stan Hoeppner, xfs On Wed, Jul 20, 2011 at 04:04:31PM +0200, Michael Monnerie wrote: > On Mittwoch, 20. Juli 2011 Stan Hoeppner wrote: > > I thought this was packing multiple small files into > > a single stripe write, which you just explained XFS does not do. > > This is interesting, I jump in here. Does that mean that if I have a XFS > volume with sw=14,su=64k (14*64=896KiB) that when I write 10 small files > in the same dir with 2KB each, each file would be placed at a 896KiB > boundary? No, they'll get sunit aligned but default, which would be on 64k boundaries. > That way, all stripes of a 1GB partition would be full when > there are roughly 1170 files (1170*896KiB ~ 1GB). What would happen when > I create other files - is XFS "full" then, or would it start using sub- > stripes? If sub-stripes, would they start at su (=64KiB) distances, or > at single block (e.g. 4KiB) distances? It starts packing files tightly into remaining free space when no free aligned extents are availble for allocation in the AG. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-20 23:01 ` Dave Chinner @ 2011-07-21 6:19 ` Michael Monnerie 2011-07-21 6:48 ` Dave Chinner 0 siblings, 1 reply; 17+ messages in thread From: Michael Monnerie @ 2011-07-21 6:19 UTC (permalink / raw) To: xfs; +Cc: Stan Hoeppner, John Bokma [-- Attachment #1.1: Type: Text/Plain, Size: 1866 bytes --] On Donnerstag, 21. Juli 2011 Dave Chinner wrote: > No, they'll get sunit aligned but default, which would be on 64k > boundaries. OK, so only when <quote Dave> "swalloc mount option set and the allocation is for more than a swidth of space it will align to swidth rather than sunit" </quote Dave>. So even when I specify swalloc but a file is generated with only 4KB, it will very probably be sunit aligned on disk. > > That way, all stripes of a 1GB partition would be full when > > there are roughly 1170 files (1170*896KiB ~ 1GB). What would happen > > when I create other files - is XFS "full" then, or would it start > > using sub- stripes? If sub-stripes, would they start at su > > (=64KiB) distances, or at single block (e.g. 4KiB) distances? > > It starts packing files tightly into remaining free space when no > free aligned extents are availble for allocation in the AG. That means for above example, that 16384 x 2KiB files could be created, and each be sunit aligned on disk. Then all sunit start blocks are full, so additional files will be sub-sunit "packed", is it this? That would mean fragmentation is likely to occur from that moment, if there are files that grow. And files >64KiB are immediately fragmented then. At this time, there are only 16384 * 2KiB = 32MiB used, which is 3,125% of the disk. I can't believe my numbers, are they true? OK, this is a worst case scenario, and as you've said before, any filesystem can be considered full at 85% fill grade. But it's incredible how quickly you could fuck up a filesystem when using su/sw and writing small files. -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services: Protéger http://proteger.at [gesprochen: Prot-e-schee] Tel: +43 660 / 415 6531 // Haus zu verkaufen: http://zmi.at/langegg/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-21 6:19 ` Michael Monnerie @ 2011-07-21 6:48 ` Dave Chinner 2011-07-22 6:10 ` Michael Monnerie 0 siblings, 1 reply; 17+ messages in thread From: Dave Chinner @ 2011-07-21 6:48 UTC (permalink / raw) To: Michael Monnerie; +Cc: John Bokma, Stan Hoeppner, xfs On Thu, Jul 21, 2011 at 08:19:54AM +0200, Michael Monnerie wrote: > On Donnerstag, 21. Juli 2011 Dave Chinner wrote: > > No, they'll get sunit aligned but default, which would be on 64k > > boundaries. > > OK, so only when <quote Dave> "swalloc mount option set and the > allocation is for more than a swidth of space it will align to swidth > rather than sunit" </quote Dave>. > > So even when I specify swalloc but a file is generated with only 4KB, it > will very probably be sunit aligned on disk. > > > > That way, all stripes of a 1GB partition would be full when > > > there are roughly 1170 files (1170*896KiB ~ 1GB). What would happen > > > when I create other files - is XFS "full" then, or would it start > > > using sub- stripes? If sub-stripes, would they start at su > > > (=64KiB) distances, or at single block (e.g. 4KiB) distances? > > > > It starts packing files tightly into remaining free space when no > > free aligned extents are availble for allocation in the AG. > > That means for above example, that 16384 x 2KiB files could be created, > and each be sunit aligned on disk. Then all sunit start blocks are full, > so additional files will be sub-sunit "packed", is it this? Effectively. > That would mean fragmentation is likely to occur from that moment, if > there are files that grow. If you are writing files that grow like this, then you are doing something wrong. If the app can't do it's IO differently, then this is exactly the reason we have userspace-controlled preallocation interfaces. Filesystems cannot prevent user stupidity from screwing something up.... > And files >64KiB are immediately fragmented > then. At this time, there are only 16384 * 2KiB = 32MiB used, which is > 3,125% of the disk. I can't believe my numbers, are they true? No, because most filesystems have a 4k block size. Not to mention that fragmentation is likely to be limited to the single AG the files in the directory belong to. i.e. even if we can't allocation a sunit aligned chunk in an AG, we won't switch to another AG just to do sunit aligned allocation. > OK, this is a worst case scenario, and as you've said before, any > filesystem can be considered full at 85% fill grade. But it's incredible > how quickly you could fuck up a filesystem when using su/sw and writing > small files. Well, don't use a filesystem that is optimised for storing large sizes, large files and high bandwidth for storing lots of small files, then. Indeed, the point of not packing the files is so they -don't fragemnt as they grow-. XFS is not designed to be optimal for small filesystems or small files. In most cases it will deal with them just fine, so in reality your concerns are mostly unfounded... BTW, ext3/ext4 do exactly the same thing with spreading files out over block groups before packing them tightly when there are not more empty block groups left.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-21 6:48 ` Dave Chinner @ 2011-07-22 6:10 ` Michael Monnerie 2011-07-22 18:05 ` Stan Hoeppner 0 siblings, 1 reply; 17+ messages in thread From: Michael Monnerie @ 2011-07-22 6:10 UTC (permalink / raw) To: xfs; +Cc: Stan Hoeppner, John Bokma [-- Attachment #1.1: Type: Text/Plain, Size: 2684 bytes --] On Donnerstag, 21. Juli 2011 Dave Chinner wrote: > If you are writing files that grow like this, then you are doing > something wrong. If the app can't do it's IO differently, then this > is exactly the reason we have userspace-controlled preallocation > interfaces. > > Filesystems cannot prevent user stupidity from screwing something > up.... This can happen if you copy a syslog server over to a new disk, then let it start it's work again. Many files that start small and grow. Luckily, the logs are rotated latest monthly, so it shouldn't be too bad. > > And files >64KiB are immediately fragmented > > then. At this time, there are only 16384 * 2KiB = 32MiB used, which > > is 3,125% of the disk. I can't believe my numbers, are they true? > > No, because most filesystems have a 4k block size. I just meant pure disk usage. Of 1GB, only 32MB are used, and this worst case example hits us badly. > Not to mention > that fragmentation is likely to be limited to the single AG the files > in the directory belong to. i.e. even if we can't allocation a sunit > aligned chunk in an AG, we won't switch to another AG just to do > sunit aligned allocation. This is good to know also, thanks. > > OK, this is a worst case scenario, and as you've said before, any > > filesystem can be considered full at 85% fill grade. But it's > > incredible how quickly you could fuck up a filesystem when using > > su/sw and writing small files. > > Well, don't use a filesystem that is optimised for storing large > sizes, large files and high bandwidth for storing lots of small > files, then. Indeed, the point of not packing the files is so they > -don't fragemnt as they grow-. XFS is not designed to be optimal > for small filesystems or small files. In most cases it will deal > with them just fine, so in reality your concerns are mostly > unfounded... Yes, I just wanted to know about the corner cases, and how XFS behaves. Actually, we're changing over to using NetApps, and with their WAFL anyway I should drop all su/sw usage and just use 4KB blocks. And even when XFS is optimized for large files, there are often small ones. Think of a mysql server with hundreds of DBs and innodb_file_per_table set. Even when some DBs are large, there are many small files. But this thread has drifted a bit. XFS does great work, and now I understand the background a bit more. Thanks, Dave. -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services: Protéger http://proteger.at [gesprochen: Prot-e-schee] Tel: +43 660 / 415 6531 // Haus zu verkaufen: http://zmi.at/langegg/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-22 6:10 ` Michael Monnerie @ 2011-07-22 18:05 ` Stan Hoeppner 2011-07-22 23:10 ` Dave Chinner 0 siblings, 1 reply; 17+ messages in thread From: Stan Hoeppner @ 2011-07-22 18:05 UTC (permalink / raw) To: Michael Monnerie; +Cc: John Bokma, xfs On 7/22/2011 1:10 AM, Michael Monnerie wrote: > Yes, I just wanted to know about the corner cases, and how XFS behaves. > Actually, we're changing over to using NetApps, and with their WAFL > anyway I should drop all su/sw usage and just use 4KB blocks. I've never used a NetApp filer myself. However, that said, I would assume that WAFL is only in play for NFS/CIFS transactions since WAFL is itself a filesystem. When exposing LUNs from the same filer to FC and iSCSI hosts I would assume the filer acts just as any other SAN controller would. In this case I would think you'd probably still want to align your XFS filesystem to the underlying RAID stripe from which the LUN was carved. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-22 18:05 ` Stan Hoeppner @ 2011-07-22 23:10 ` Dave Chinner 2011-07-24 6:14 ` Stan Hoeppner 0 siblings, 1 reply; 17+ messages in thread From: Dave Chinner @ 2011-07-22 23:10 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Michael Monnerie, John Bokma, xfs On Fri, Jul 22, 2011 at 01:05:14PM -0500, Stan Hoeppner wrote: > On 7/22/2011 1:10 AM, Michael Monnerie wrote: > > > Yes, I just wanted to know about the corner cases, and how XFS behaves. > > Actually, we're changing over to using NetApps, and with their WAFL > > anyway I should drop all su/sw usage and just use 4KB blocks. > > I've never used a NetApp filer myself. However, that said, I would > assume that WAFL is only in play for NFS/CIFS transactions since WAFL is > itself a filesystem. Netapp's website is busted, so here's a cached link: http://webcache.googleusercontent.com/search?q=cache:9DdO2a16hdIJ:blogs.netapp.com/extensible_netapp/2008/10/what-is-wafl--3.html+netapp+san+wafl&cd=1&hl=en&ct=clnk&source=www.google.com "The point is that WAFL is the part of the code that provides the 'read or write from-disk' mechanisms to both NFS and CIFS and SAN. The semantics of a how the blocks are accessed are provided by higher level code not by WAFL, which means WAFL is not a file system." If you can be bothered trolling for that entire series of blog posts in the google cache, it's probably a good idea so you can get a basic understanding of what WAFL actually is. > When exposing LUNs from the same filer to FC and iSCSI hosts I would > assume the filer acts just as any other SAN controller would. It has it's own quirks, just like any other FC attached RAID array... > In this case I would think you'd probably still want to align your > XFS filesystem to the underlying RAID stripe from which the LUN > was carved. Which actually matters very little when WAFL between the FS and the disk because WAFL uses copy-on-write and stages all it's writes through NVRAM and so you've got no idea what the alignment of any given address in the filesystem maps to, anyway. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-22 23:10 ` Dave Chinner @ 2011-07-24 6:14 ` Stan Hoeppner 2011-07-24 8:47 ` Michael Monnerie 0 siblings, 1 reply; 17+ messages in thread From: Stan Hoeppner @ 2011-07-24 6:14 UTC (permalink / raw) To: Dave Chinner; +Cc: Michael Monnerie, xfs, John Bokma On 7/22/2011 6:10 PM, Dave Chinner wrote: > On Fri, Jul 22, 2011 at 01:05:14PM -0500, Stan Hoeppner wrote: >> I've never used a NetApp filer myself. However, that said, I would >> assume that WAFL is only in play for NFS/CIFS transactions since WAFL is >> itself a filesystem. > > Netapp's website is busted, so here's a cached link: > > http://webcache.googleusercontent.com/search?q=cache:9DdO2a16hdIJ:blogs.netapp.com/extensible_netapp/2008/10/what-is-wafl--3.html+netapp+san+wafl&cd=1&hl=en&ct=clnk&source=www.google.com This is interesting: http://communities.netapp.com/community/netapp-blogs/dave/blog/2008/12/08/is-wafl-a-filesystem The author implemented WAFL in two layers. The bottom layer handles block stuff including volume management, dedup, snapshots, etc, and the top layer functions as multiple file systems, amongst other duties. > If you can be bothered trolling for that entire series of blog posts > in the google cache, it's probably a good idea so you can get a > basic understanding of what WAFL actually is. It's never a bother to learn something new. :) >> When exposing LUNs from the same filer to FC and iSCSI hosts I would >> assume the filer acts just as any other SAN controller would. > > It has it's own quirks, just like any other FC attached RAID array... > >> In this case I would think you'd probably still want to align your >> XFS filesystem to the underlying RAID stripe from which the LUN >> was carved. > > Which actually matters very little when WAFL between the FS and the > disk because WAFL uses copy-on-write and stages all it's writes > through NVRAM and so you've got no idea what the alignment of any > given address in the filesystem maps to, anyway. Is the NetApp FC/iSCSI attachment performance still competitive for large file/streaming IO, given that one can't optimize XFS stripe alignment, and with no indication of where the file fragments are actually written on the media? Or does it lag behind something like a roughly equivalent class Infinite Storage array, or IBM DS? -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 30 TB RAID6 + XFS slow write performance 2011-07-24 6:14 ` Stan Hoeppner @ 2011-07-24 8:47 ` Michael Monnerie 0 siblings, 0 replies; 17+ messages in thread From: Michael Monnerie @ 2011-07-24 8:47 UTC (permalink / raw) To: xfs; +Cc: Stan Hoeppner, John Bokma [-- Attachment #1.1: Type: Text/Plain, Size: 1410 bytes --] On Sonntag, 24. Juli 2011 Stan Hoeppner wrote: > Is the NetApp FC/iSCSI attachment performance still competitive for > large file/streaming IO, given that one can't optimize XFS stripe > alignment, and with no indication of where the file fragments are > actually written on the media? Or does it lag behind something like > a roughly equivalent class Infinite Storage array, or IBM DS? I can't tell about performance difference. But I'd like to explain two fundamental differences to all other storages: 1) WAFL *never* overwrites an existing block. Whenver there's a write to an existing block, that block is instead written to a new location, afterwards the old block mapped to the new one. This is a key factor to keeping performance up when using snapshots and deduplication. 2) WAFL never does small or random writes. All writes are collected in NVRAM, and then written as one large sequential write, always one full stripe is written. That means for workloads with lots of small random writes, NetApp storages beat the hell out of the disks, compared to other storages. I can't tell for large seq. writes, though, I don't have such workload. -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services: Protéger http://proteger.at [gesprochen: Prot-e-schee] Tel: +43 660 / 415 6531 // Haus zu verkaufen: http://zmi.at/langegg/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-07-24 8:47 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-07-18 19:58 30 TB RAID6 + XFS slow write performance John Bokma 2011-07-19 0:00 ` Eric Sandeen 2011-07-19 8:37 ` Emmanuel Florac 2011-07-19 22:37 ` Stan Hoeppner 2011-07-20 0:20 ` Dave Chinner 2011-07-20 5:16 ` Stan Hoeppner 2011-07-20 6:44 ` Dave Chinner 2011-07-20 12:10 ` Stan Hoeppner 2011-07-20 14:04 ` Michael Monnerie 2011-07-20 23:01 ` Dave Chinner 2011-07-21 6:19 ` Michael Monnerie 2011-07-21 6:48 ` Dave Chinner 2011-07-22 6:10 ` Michael Monnerie 2011-07-22 18:05 ` Stan Hoeppner 2011-07-22 23:10 ` Dave Chinner 2011-07-24 6:14 ` Stan Hoeppner 2011-07-24 8:47 ` Michael Monnerie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox