* Filesystem writes on RAID5 too slow @ 2013-11-18 16:02 Martin Boutin 2013-11-18 18:28 ` Eric Sandeen 2013-11-18 18:41 ` Roman Mamedov 0 siblings, 2 replies; 19+ messages in thread From: Martin Boutin @ 2013-11-18 16:02 UTC (permalink / raw) To: Kernel.org-Linux-RAID; +Cc: Kernel.org-Linux-XFS, Kernel.org-Linux-EXT4 Dear list, I am writing about an apparent issue (or maybe it is normal, that's my question) regarding filesystem write speed in in a linux raid device. More specifically, I have linux-3.10.10 running in an Intel Haswell embedded system with 3 HDDs in a RAID-5 configuration. The hard disks have 4k physical sectors which are reported as 512 logical size. I made sure the partitions underlying the raid device start at sector 2048. The RAID device has version 1.2 metadata and 4k (bytes) of data offset, therefore the data should also be 4k aligned. The raid chunk size is 512K. I have the md0 raid device formatted as ext3 with a 4k block size, and stride and stripes correctly chosen to match the raid chunk size, that is, stride=128,stripe-width=256. While I was working in a small university project, I just noticed that the write speeds when using a filesystem over raid are *much* slower than when writing directly to the raid device (or even compared to filesystem read speeds). The command line for measuring filesystem read and write speeds was: $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct The command line for measuring raw read and write speeds was: $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct Here are some speed measures using dd (an average of 20 runs).: device raw/fs mode speed (MB/s) slowdown (%) /dev/md0 raw read 207 /dev/md0 raw write 209 /dev/md1 raw read 214 /dev/md1 raw write 212 /dev/md0 xfs read 188 9 /dev/md0 xfs write 35 83 /dev/md1 ext3 read 199 7 /dev/md1 ext3 write 36 83 /dev/md0 ufs read 212 0 /dev/md0 ufs write 53 75 /dev/md0 ext2 read 202 2 /dev/md0 ext2 write 34 84 Is it possible that the filesystem has such enormous impact in the write speed? We are talking about a slowdown of 80%!!! Even a filesystem as simple as ufs has a slowdown of 75%! What am I missing? Thank you, -- Martin Boutin ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-18 16:02 Filesystem writes on RAID5 too slow Martin Boutin @ 2013-11-18 18:28 ` Eric Sandeen 2013-11-19 0:57 ` Dave Chinner 2013-11-18 18:41 ` Roman Mamedov 1 sibling, 1 reply; 19+ messages in thread From: Eric Sandeen @ 2013-11-18 18:28 UTC (permalink / raw) To: Martin Boutin, Kernel.org-Linux-RAID; +Cc: xfs-oss, Kernel.org-Linux-EXT4 On 11/18/13, 10:02 AM, Martin Boutin wrote: > Dear list, > > I am writing about an apparent issue (or maybe it is normal, that's my > question) regarding filesystem write speed in in a linux raid device. > More specifically, I have linux-3.10.10 running in an Intel Haswell > embedded system with 3 HDDs in a RAID-5 configuration. > The hard disks have 4k physical sectors which are reported as 512 > logical size. I made sure the partitions underlying the raid device > start at sector 2048. (fixed cc: to xfs list) > The RAID device has version 1.2 metadata and 4k (bytes) of data > offset, therefore the data should also be 4k aligned. The raid chunk > size is 512K. > > I have the md0 raid device formatted as ext3 with a 4k block size, and > stride and stripes correctly chosen to match the raid chunk size, that > is, stride=128,stripe-width=256. > > While I was working in a small university project, I just noticed that > the write speeds when using a filesystem over raid are *much* slower > than when writing directly to the raid device (or even compared to > filesystem read speeds). > > The command line for measuring filesystem read and write speeds was: > > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct > > The command line for measuring raw read and write speeds was: > > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct > > Here are some speed measures using dd (an average of 20 runs).: > > device raw/fs mode speed (MB/s) slowdown (%) > /dev/md0 raw read 207 > /dev/md0 raw write 209 > /dev/md1 raw read 214 > /dev/md1 raw write 212 > > /dev/md0 xfs read 188 9 > /dev/md0 xfs write 35 83 > > /dev/md1 ext3 read 199 7 > /dev/md1 ext3 write 36 83 > > /dev/md0 ufs read 212 0 > /dev/md0 ufs write 53 75 > > /dev/md0 ext2 read 202 2 > /dev/md0 ext2 write 34 84 > > Is it possible that the filesystem has such enormous impact in the > write speed? We are talking about a slowdown of 80%!!! Even a > filesystem as simple as ufs has a slowdown of 75%! What am I missing? One thing you're missing is enough info to debug this. /proc/mdstat, kernel version, xfs_info output, mkfs commandlines used, partition table details, etc. If something is misaligned and you are doing RMW for these IOs it could hurt a lot. -Eric > Thank you, > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-18 18:28 ` Eric Sandeen @ 2013-11-19 0:57 ` Dave Chinner 2013-11-21 9:11 ` Martin Boutin 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2013-11-19 0:57 UTC (permalink / raw) To: Eric Sandeen Cc: Martin Boutin, Kernel.org-Linux-RAID, Kernel.org-Linux-EXT4, xfs-oss On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote: > On 11/18/13, 10:02 AM, Martin Boutin wrote: > > Dear list, > > > > I am writing about an apparent issue (or maybe it is normal, that's my > > question) regarding filesystem write speed in in a linux raid device. > > More specifically, I have linux-3.10.10 running in an Intel Haswell > > embedded system with 3 HDDs in a RAID-5 configuration. > > The hard disks have 4k physical sectors which are reported as 512 > > logical size. I made sure the partitions underlying the raid device > > start at sector 2048. > > (fixed cc: to xfs list) > > > The RAID device has version 1.2 metadata and 4k (bytes) of data > > offset, therefore the data should also be 4k aligned. The raid chunk > > size is 512K. > > > > I have the md0 raid device formatted as ext3 with a 4k block size, and > > stride and stripes correctly chosen to match the raid chunk size, that > > is, stride=128,stripe-width=256. > > > > While I was working in a small university project, I just noticed that > > the write speeds when using a filesystem over raid are *much* slower > > than when writing directly to the raid device (or even compared to > > filesystem read speeds). > > > > The command line for measuring filesystem read and write speeds was: > > > > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct > > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct > > > > The command line for measuring raw read and write speeds was: > > > > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct > > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct > > > > Here are some speed measures using dd (an average of 20 runs).: > > > > device raw/fs mode speed (MB/s) slowdown (%) > > /dev/md0 raw read 207 > > /dev/md0 raw write 209 > > /dev/md1 raw read 214 > > /dev/md1 raw write 212 So, that's writing to the first 1GB of /dev/md0, and all the writes are going to be aligned to the MD stripe. > > /dev/md0 xfs read 188 9 > > /dev/md0 xfs write 35 83o And these will not be written to the first 1GB of the block device but somewhere else. Most likely a region that hasn't otherwise been used, and so isn't going to be overwriting the same blocks like the /dev/md0 case is going to be. Perhaps there's some kind of stripe caching effect going on here? Was the md device fully initialised before you ran these tests? > > > > /dev/md1 ext3 read 199 7 > > /dev/md1 ext3 write 36 83 > > > > /dev/md0 ufs read 212 0 > > /dev/md0 ufs write 53 75 > > > > /dev/md0 ext2 read 202 2 > > /dev/md0 ext2 write 34 84 I suspect what you are seeing here is either the latency introduced by having to allocate blocks before issuing the IO, or the file layout due to allocation is not idea. Single threaded direct IO is latency bound, not bandwidth bound and, as such, is IO size sensitive. Allocation for direct IO is also IO size sensitive - there's typically an allocation per IO, so the more IO you have to do, the more allocation that occurs. So, on XFS, what does "xfs_bmap -vvp /tmp/diskmnt/filewr.zero" output for the file you wrote? Specifically, I'm interested whether it aligned the allocations to the stripe unit boundary, and if so, what offset into the device those extents sit at.... Also, you should run iostat and blktrace to determine if MD is doing RMW cycles when being written to through the filesystem. > > Is it possible that the filesystem has such enormous impact in the > > write speed? We are talking about a slowdown of 80%!!! Even a > > filesystem as simple as ufs has a slowdown of 75%! What am I missing? > > One thing you're missing is enough info to debug this. > > /proc/mdstat, kernel version, xfs_info output, mkfs commandlines used, > partition table details, etc. THere's a good list here: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > If something is misaligned and you are doing RMW for these IOs it could > hurt a lot. > > -Eric > > > Thank you, > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-19 0:57 ` Dave Chinner @ 2013-11-21 9:11 ` Martin Boutin 2013-11-21 9:26 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Martin Boutin @ 2013-11-21 9:11 UTC (permalink / raw) To: Dave Chinner Cc: Eric Sandeen, Kernel.org-Linux-RAID, xfs-oss, Kernel.org-Linux-EXT4 On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote: >> On 11/18/13, 10:02 AM, Martin Boutin wrote: >> > Dear list, >> > >> > I am writing about an apparent issue (or maybe it is normal, that's my >> > question) regarding filesystem write speed in in a linux raid device. >> > More specifically, I have linux-3.10.10 running in an Intel Haswell >> > embedded system with 3 HDDs in a RAID-5 configuration. >> > The hard disks have 4k physical sectors which are reported as 512 >> > logical size. I made sure the partitions underlying the raid device >> > start at sector 2048. >> >> (fixed cc: to xfs list) >> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data >> > offset, therefore the data should also be 4k aligned. The raid chunk >> > size is 512K. >> > >> > I have the md0 raid device formatted as ext3 with a 4k block size, and >> > stride and stripes correctly chosen to match the raid chunk size, that >> > is, stride=128,stripe-width=256. >> > >> > While I was working in a small university project, I just noticed that >> > the write speeds when using a filesystem over raid are *much* slower >> > than when writing directly to the raid device (or even compared to >> > filesystem read speeds). >> > >> > The command line for measuring filesystem read and write speeds was: >> > >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct >> > >> > The command line for measuring raw read and write speeds was: >> > >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct >> > >> > Here are some speed measures using dd (an average of 20 runs).: >> > >> > device raw/fs mode speed (MB/s) slowdown (%) >> > /dev/md0 raw read 207 >> > /dev/md0 raw write 209 >> > /dev/md1 raw read 214 >> > /dev/md1 raw write 212 > > So, that's writing to the first 1GB of /dev/md0, and all the writes > are going to be aligned to the MD stripe. > >> > /dev/md0 xfs read 188 9 >> > /dev/md0 xfs write 35 83o > > And these will not be written to the first 1GB of the block device > but somewhere else. Most likely a region that hasn't otherwise been > used, and so isn't going to be overwriting the same blocks like the > /dev/md0 case is going to be. Perhaps there's some kind of stripe > caching effect going on here? Was the md device fully initialised > before you ran these tests? > >> > >> > /dev/md1 ext3 read 199 7 >> > /dev/md1 ext3 write 36 83 >> > >> > /dev/md0 ufs read 212 0 >> > /dev/md0 ufs write 53 75 >> > >> > /dev/md0 ext2 read 202 2 >> > /dev/md0 ext2 write 34 84 > > I suspect what you are seeing here is either the latency introduced > by having to allocate blocks before issuing the IO, or the file > layout due to allocation is not idea. Single threaded direct IO is > latency bound, not bandwidth bound and, as such, is IO size > sensitive. Allocation for direct IO is also IO size sensitive - > there's typically an allocation per IO, so the more IO you have to > do, the more allocation that occurs. I just did a few more tests, this time with ext4: device raw/fs mode speed (MB/s) slowdown (%) /dev/md0 ext4 read 199 4% /dev/md0 ext4 write 210 0% This time, no slowdown at all on ext4. I believe this is due to the multiblock allocation feature of ext4 (I'm using O_DIRECT, so it should be it). So I guess for the other filesystems, it was indeed the latency introduced by block allocation. Thanks, - Martin > > So, on XFS, what does "xfs_bmap -vvp /tmp/diskmnt/filewr.zero" > output for the file you wrote? Specifically, I'm interested whether > it aligned the allocations to the stripe unit boundary, and if so, > what offset into the device those extents sit at.... > > Also, you should run iostat and blktrace to determine if MD is > doing RMW cycles when being written to through the filesystem. > >> > Is it possible that the filesystem has such enormous impact in the >> > write speed? We are talking about a slowdown of 80%!!! Even a >> > filesystem as simple as ufs has a slowdown of 75%! What am I missing? >> >> One thing you're missing is enough info to debug this. >> >> /proc/mdstat, kernel version, xfs_info output, mkfs commandlines used, >> partition table details, etc. > > THere's a good list here: > > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > >> If something is misaligned and you are doing RMW for these IOs it could >> hurt a lot. >> >> -Eric >> >> > Thank you, >> > >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-21 9:11 ` Martin Boutin @ 2013-11-21 9:26 ` Dave Chinner 2013-11-21 9:50 ` Martin Boutin 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2013-11-21 9:26 UTC (permalink / raw) To: Martin Boutin Cc: Eric Sandeen, Kernel.org-Linux-RAID, xfs-oss, Kernel.org-Linux-EXT4 On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote: > On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote: > >> On 11/18/13, 10:02 AM, Martin Boutin wrote: > >> > Dear list, > >> > > >> > I am writing about an apparent issue (or maybe it is normal, that's my > >> > question) regarding filesystem write speed in in a linux raid device. > >> > More specifically, I have linux-3.10.10 running in an Intel Haswell > >> > embedded system with 3 HDDs in a RAID-5 configuration. > >> > The hard disks have 4k physical sectors which are reported as 512 > >> > logical size. I made sure the partitions underlying the raid device > >> > start at sector 2048. > >> > >> (fixed cc: to xfs list) > >> > >> > The RAID device has version 1.2 metadata and 4k (bytes) of data > >> > offset, therefore the data should also be 4k aligned. The raid chunk > >> > size is 512K. > >> > > >> > I have the md0 raid device formatted as ext3 with a 4k block size, and > >> > stride and stripes correctly chosen to match the raid chunk size, that > >> > is, stride=128,stripe-width=256. > >> > > >> > While I was working in a small university project, I just noticed that > >> > the write speeds when using a filesystem over raid are *much* slower > >> > than when writing directly to the raid device (or even compared to > >> > filesystem read speeds). > >> > > >> > The command line for measuring filesystem read and write speeds was: > >> > > >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct > >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct > >> > > >> > The command line for measuring raw read and write speeds was: > >> > > >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct > >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct > >> > > >> > Here are some speed measures using dd (an average of 20 runs).: > >> > > >> > device raw/fs mode speed (MB/s) slowdown (%) > >> > /dev/md0 raw read 207 > >> > /dev/md0 raw write 209 > >> > /dev/md1 raw read 214 > >> > /dev/md1 raw write 212 > > > > So, that's writing to the first 1GB of /dev/md0, and all the writes > > are going to be aligned to the MD stripe. > > > >> > /dev/md0 xfs read 188 9 > >> > /dev/md0 xfs write 35 83o > > > > And these will not be written to the first 1GB of the block device > > but somewhere else. Most likely a region that hasn't otherwise been > > used, and so isn't going to be overwriting the same blocks like the > > /dev/md0 case is going to be. Perhaps there's some kind of stripe > > caching effect going on here? Was the md device fully initialised > > before you ran these tests? > > > >> > > >> > /dev/md1 ext3 read 199 7 > >> > /dev/md1 ext3 write 36 83 > >> > > >> > /dev/md0 ufs read 212 0 > >> > /dev/md0 ufs write 53 75 > >> > > >> > /dev/md0 ext2 read 202 2 > >> > /dev/md0 ext2 write 34 84 > > > > I suspect what you are seeing here is either the latency introduced > > by having to allocate blocks before issuing the IO, or the file > > layout due to allocation is not idea. Single threaded direct IO is > > latency bound, not bandwidth bound and, as such, is IO size > > sensitive. Allocation for direct IO is also IO size sensitive - > > there's typically an allocation per IO, so the more IO you have to > > do, the more allocation that occurs. > > I just did a few more tests, this time with ext4: > > device raw/fs mode speed (MB/s) slowdown (%) > /dev/md0 ext4 read 199 4% > /dev/md0 ext4 write 210 0% > > This time, no slowdown at all on ext4. I believe this is due to the > multiblock allocation feature of ext4 (I'm using O_DIRECT, so it > should be it). So I guess for the other filesystems, it was indeed > the latency introduced by block allocation. Except that XFS does extent based allocation as well, so that's not likely the reason. The fact that ext4 doesn't see a slowdown like every other filesystem really doesn't make a lot of sense to me, either from an IO dispatch point of view or an IO alignment point of view. Why? Because all the filesystems align identically to the underlying device and all should be doing 4k block aligned IO, and XFS has roughly the same allocation overhead for this workload as ext4. Did you retest XFS or any of the other filesystems directly after running the ext4 tests (i.e. confirm you are testing apples to apples)? What we need to determine why other filesystems are slow (and why ext4 is fast) is more information about your configuration and block traces showing what is happening at the IO level, like was requested in a previous email.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-21 9:26 ` Dave Chinner @ 2013-11-21 9:50 ` Martin Boutin 2013-11-21 13:31 ` Martin Boutin 0 siblings, 1 reply; 19+ messages in thread From: Martin Boutin @ 2013-11-21 9:50 UTC (permalink / raw) To: Dave Chinner Cc: Kernel.org-Linux-RAID, Eric Sandeen, Kernel.org-Linux-EXT4, xfs-oss On Thu, Nov 21, 2013 at 4:26 AM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote: >> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote: >> >> On 11/18/13, 10:02 AM, Martin Boutin wrote: >> >> > Dear list, >> >> > >> >> > I am writing about an apparent issue (or maybe it is normal, that's my >> >> > question) regarding filesystem write speed in in a linux raid device. >> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell >> >> > embedded system with 3 HDDs in a RAID-5 configuration. >> >> > The hard disks have 4k physical sectors which are reported as 512 >> >> > logical size. I made sure the partitions underlying the raid device >> >> > start at sector 2048. >> >> >> >> (fixed cc: to xfs list) >> >> >> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data >> >> > offset, therefore the data should also be 4k aligned. The raid chunk >> >> > size is 512K. >> >> > >> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and >> >> > stride and stripes correctly chosen to match the raid chunk size, that >> >> > is, stride=128,stripe-width=256. >> >> > >> >> > While I was working in a small university project, I just noticed that >> >> > the write speeds when using a filesystem over raid are *much* slower >> >> > than when writing directly to the raid device (or even compared to >> >> > filesystem read speeds). >> >> > >> >> > The command line for measuring filesystem read and write speeds was: >> >> > >> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct >> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct >> >> > >> >> > The command line for measuring raw read and write speeds was: >> >> > >> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct >> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct >> >> > >> >> > Here are some speed measures using dd (an average of 20 runs).: >> >> > >> >> > device raw/fs mode speed (MB/s) slowdown (%) >> >> > /dev/md0 raw read 207 >> >> > /dev/md0 raw write 209 >> >> > /dev/md1 raw read 214 >> >> > /dev/md1 raw write 212 >> > >> > So, that's writing to the first 1GB of /dev/md0, and all the writes >> > are going to be aligned to the MD stripe. >> > >> >> > /dev/md0 xfs read 188 9 >> >> > /dev/md0 xfs write 35 83o >> > >> > And these will not be written to the first 1GB of the block device >> > but somewhere else. Most likely a region that hasn't otherwise been >> > used, and so isn't going to be overwriting the same blocks like the >> > /dev/md0 case is going to be. Perhaps there's some kind of stripe >> > caching effect going on here? Was the md device fully initialised >> > before you ran these tests? >> > >> >> > >> >> > /dev/md1 ext3 read 199 7 >> >> > /dev/md1 ext3 write 36 83 >> >> > >> >> > /dev/md0 ufs read 212 0 >> >> > /dev/md0 ufs write 53 75 >> >> > >> >> > /dev/md0 ext2 read 202 2 >> >> > /dev/md0 ext2 write 34 84 >> > >> > I suspect what you are seeing here is either the latency introduced >> > by having to allocate blocks before issuing the IO, or the file >> > layout due to allocation is not idea. Single threaded direct IO is >> > latency bound, not bandwidth bound and, as such, is IO size >> > sensitive. Allocation for direct IO is also IO size sensitive - >> > there's typically an allocation per IO, so the more IO you have to >> > do, the more allocation that occurs. >> >> I just did a few more tests, this time with ext4: >> >> device raw/fs mode speed (MB/s) slowdown (%) >> /dev/md0 ext4 read 199 4% >> /dev/md0 ext4 write 210 0% >> >> This time, no slowdown at all on ext4. I believe this is due to the >> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it >> should be it). So I guess for the other filesystems, it was indeed >> the latency introduced by block allocation. > > Except that XFS does extent based allocation as well, so that's not > likely the reason. The fact that ext4 doesn't see a slowdown like > every other filesystem really doesn't make a lot of sense to > me, either from an IO dispatch point of view or an IO alignment > point of view. > > Why? Because all the filesystems align identically to the underlying > device and all should be doing 4k block aligned IO, and XFS has > roughly the same allocation overhead for this workload as ext4. > Did you retest XFS or any of the other filesystems directly after > running the ext4 tests (i.e. confirm you are testing apples to > apples)? Yes I did, the performance figures did not change for either XFS or ext3. > > What we need to determine why other filesystems are slow (and why > ext4 is fast) is more information about your configuration and block > traces showing what is happening at the IO level, like was requested > in a previous email.... Ok, I'm going to try coming up with meaningful data. Thanks. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Martin Boutin _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-21 9:50 ` Martin Boutin @ 2013-11-21 13:31 ` Martin Boutin 2013-11-21 16:35 ` Martin Boutin 2013-11-21 23:41 ` Dave Chinner 0 siblings, 2 replies; 19+ messages in thread From: Martin Boutin @ 2013-11-21 13:31 UTC (permalink / raw) To: Dave Chinner Cc: Eric Sandeen, Kernel.org-Linux-RAID, xfs-oss, Kernel.org-Linux-EXT4 $ uname -a Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013 i686 GNU/Linux $ xfs_repair -V xfs_repair version 3.1.4 $ cat /proc/cpuinfo | grep processor processor : 0 processor : 1 $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0 $ mount -t xfs /dev/md0 /tmp/diskmnt/ $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s $ cat /proc/meminfo MemTotal: 1313956 kB MemFree: 1099936 kB Buffers: 13232 kB Cached: 141452 kB SwapCached: 0 kB Active: 128960 kB Inactive: 55936 kB Active(anon): 30548 kB Inactive(anon): 1096 kB Active(file): 98412 kB Inactive(file): 54840 kB Unevictable: 0 kB Mlocked: 0 kB HighTotal: 626696 kB HighFree: 452472 kB LowTotal: 687260 kB LowFree: 647464 kB SwapTotal: 72256 kB SwapFree: 72256 kB Dirty: 8 kB Writeback: 0 kB AnonPages: 30172 kB Mapped: 15764 kB Shmem: 1432 kB Slab: 14720 kB SReclaimable: 6632 kB SUnreclaim: 8088 kB KernelStack: 1792 kB PageTables: 1176 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 729232 kB Committed_AS: 734116 kB VmallocTotal: 327680 kB VmallocUsed: 10192 kB VmallocChunk: 294904 kB DirectMap4k: 12280 kB DirectMap4M: 692224 kB $ cat /proc/mounts (...) /dev/md0 /tmp/diskmnt xfs rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0 $ cat /proc/partitions major minor #blocks name 8 0 976762584 sda 8 1 10281600 sda1 8 2 966479960 sda2 8 16 976762584 sdb 8 17 10281600 sdb1 8 18 966479960 sdb2 8 32 976762584 sdc 8 33 10281600 sdc1 8 34 966479960 sdc2 (...) 9 1 20560896 md1 9 0 1932956672 md0 # same layout for other disks $ fdisk -c -u /dev/sda The device presents a logical sector size that is smaller than the physical sector size. Aligning to a physical sector (or optimal I/O) size boundary is recommended, or performance may be impacted. Command (m for help): p Disk /dev/sda: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x00000000 Device Boot Start End Blocks Id System /dev/sda1 2048 20565247 10281600 83 Linux /dev/sda2 20565248 1953525167 966479960 83 Linux # unfortunately I had to reinitelize the array and recovery takes a while.. it does not impact performance much though. $ cat /proc/mdstat Personalities : [linear] [raid6] [raid5] [raid4] md0 : active raid5 sda2[0] sdc2[3] sdb2[1] 1932956672 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_] [>....................] recovery = 2.4% (23588740/966478336) finish=156.6min speed=100343K/sec bitmap: 0/1 pages [0KB], 2097152KB chunk # sda sdb and sdc are the same model $ hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: HGST HCC541010A9E680 (...) Firmware Revision: JA0OA560 Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6; Revision: ATA8-AST T13 Project D1697 Revision 0b Standards: Used: unknown (minor revision code 0x0028) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 1953525168 Logical Sector size: 512 bytes Physical Sector size: 4096 bytes Logical Sector-0 offset: 0 bytes device size with M = 1024*1024: 953869 MBytes device size with M = 1000*1000: 1000204 MBytes (1000 GB) cache/buffer size = 8192 KBytes (type=DualPortCache) Form Factor: 2.5 inch Nominal Media Rotation Rate: 5400 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Advanced power management level: 128 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns $ hdparm -I /dev/sd{a,b,c} | grep "Write cache" * Write cache * Write cache * Write cache # therefore write cache is enabled in all drives $ xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=32, agsize=15101312 blks = sectsz=4096 attr=2 data = bsize=4096 blocks=483239168, imaxpct=5 = sunit=128 swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=8192, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero /tmp/diskmnt/filewr.zero: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2047999]: 2049056..4097055 0 (2049056..4097055) 2048000 01111 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width # this does not look good, does it? # run while dd was executing, looks like we have almost the half writes as reads.... $ iostat -d -k 30 2 /dev/sda2 /dev/sdb2 /dev/sdc2 Linux 3.10.10 (haswell1) 11/21/2013 _i686_ (2 CPU) Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda2 13.75 6639.52 232.17 78863819 2757731 sdb2 13.74 6639.42 232.24 78862660 2758483 sdc2 13.68 55.86 6813.67 663443 80932375 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda2 78.27 11191.20 22556.07 335736 676682 sdb2 78.30 11175.73 22589.13 335272 677674 sdc2 78.30 5506.13 28258.47 165184 847754 Thanks - Martin On Thu, Nov 21, 2013 at 4:50 AM, Martin Boutin <martboutin@gmail.com> wrote: > On Thu, Nov 21, 2013 at 4:26 AM, Dave Chinner <david@fromorbit.com> wrote: >> On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote: >>> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@fromorbit.com> wrote: >>> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote: >>> >> On 11/18/13, 10:02 AM, Martin Boutin wrote: >>> >> > Dear list, >>> >> > >>> >> > I am writing about an apparent issue (or maybe it is normal, that's my >>> >> > question) regarding filesystem write speed in in a linux raid device. >>> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell >>> >> > embedded system with 3 HDDs in a RAID-5 configuration. >>> >> > The hard disks have 4k physical sectors which are reported as 512 >>> >> > logical size. I made sure the partitions underlying the raid device >>> >> > start at sector 2048. >>> >> >>> >> (fixed cc: to xfs list) >>> >> >>> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data >>> >> > offset, therefore the data should also be 4k aligned. The raid chunk >>> >> > size is 512K. >>> >> > >>> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and >>> >> > stride and stripes correctly chosen to match the raid chunk size, that >>> >> > is, stride=128,stripe-width=256. >>> >> > >>> >> > While I was working in a small university project, I just noticed that >>> >> > the write speeds when using a filesystem over raid are *much* slower >>> >> > than when writing directly to the raid device (or even compared to >>> >> > filesystem read speeds). >>> >> > >>> >> > The command line for measuring filesystem read and write speeds was: >>> >> > >>> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct >>> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct >>> >> > >>> >> > The command line for measuring raw read and write speeds was: >>> >> > >>> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct >>> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct >>> >> > >>> >> > Here are some speed measures using dd (an average of 20 runs).: >>> >> > >>> >> > device raw/fs mode speed (MB/s) slowdown (%) >>> >> > /dev/md0 raw read 207 >>> >> > /dev/md0 raw write 209 >>> >> > /dev/md1 raw read 214 >>> >> > /dev/md1 raw write 212 >>> > >>> > So, that's writing to the first 1GB of /dev/md0, and all the writes >>> > are going to be aligned to the MD stripe. >>> > >>> >> > /dev/md0 xfs read 188 9 >>> >> > /dev/md0 xfs write 35 83o >>> > >>> > And these will not be written to the first 1GB of the block device >>> > but somewhere else. Most likely a region that hasn't otherwise been >>> > used, and so isn't going to be overwriting the same blocks like the >>> > /dev/md0 case is going to be. Perhaps there's some kind of stripe >>> > caching effect going on here? Was the md device fully initialised >>> > before you ran these tests? >>> > >>> >> > >>> >> > /dev/md1 ext3 read 199 7 >>> >> > /dev/md1 ext3 write 36 83 >>> >> > >>> >> > /dev/md0 ufs read 212 0 >>> >> > /dev/md0 ufs write 53 75 >>> >> > >>> >> > /dev/md0 ext2 read 202 2 >>> >> > /dev/md0 ext2 write 34 84 >>> > >>> > I suspect what you are seeing here is either the latency introduced >>> > by having to allocate blocks before issuing the IO, or the file >>> > layout due to allocation is not idea. Single threaded direct IO is >>> > latency bound, not bandwidth bound and, as such, is IO size >>> > sensitive. Allocation for direct IO is also IO size sensitive - >>> > there's typically an allocation per IO, so the more IO you have to >>> > do, the more allocation that occurs. >>> >>> I just did a few more tests, this time with ext4: >>> >>> device raw/fs mode speed (MB/s) slowdown (%) >>> /dev/md0 ext4 read 199 4% >>> /dev/md0 ext4 write 210 0% >>> >>> This time, no slowdown at all on ext4. I believe this is due to the >>> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it >>> should be it). So I guess for the other filesystems, it was indeed >>> the latency introduced by block allocation. >> >> Except that XFS does extent based allocation as well, so that's not >> likely the reason. The fact that ext4 doesn't see a slowdown like >> every other filesystem really doesn't make a lot of sense to >> me, either from an IO dispatch point of view or an IO alignment >> point of view. >> >> Why? Because all the filesystems align identically to the underlying >> device and all should be doing 4k block aligned IO, and XFS has >> roughly the same allocation overhead for this workload as ext4. >> Did you retest XFS or any of the other filesystems directly after >> running the ext4 tests (i.e. confirm you are testing apples to >> apples)? > > Yes I did, the performance figures did not change for either XFS or ext3. >> >> What we need to determine why other filesystems are slow (and why >> ext4 is fast) is more information about your configuration and block >> traces showing what is happening at the IO level, like was requested >> in a previous email.... > > Ok, I'm going to try coming up with meaningful data. Thanks. >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com > > > > -- > Martin Boutin ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-21 13:31 ` Martin Boutin @ 2013-11-21 16:35 ` Martin Boutin 2013-11-21 23:41 ` Dave Chinner 1 sibling, 0 replies; 19+ messages in thread From: Martin Boutin @ 2013-11-21 16:35 UTC (permalink / raw) To: Dave Chinner Cc: Eric Sandeen, Kernel.org-Linux-RAID, xfs-oss, Kernel.org-Linux-EXT4 Sorry for the spam but I just noticed that the XFS stripe unit does not match the strip unit of the underlying RAID device. I tried to do a mkfs.xfs with a stripe of 512KiB but mkfs.xfs complains that the maximum stripe width is 256KiB. So I recreated the RAID with a stripe of 256KiB: $ cat /proc/mdstat Personalities : [linear] [raid6] [raid5] [raid4] md0 : active raid5 sdc2[3] sdb2[1] sda2[0] 1932957184 blocks super 1.2 level 5, 256k chunk, algorithm 2 [3/2] [UU_] resync=DELAYED bitmap: 1/1 pages [4KB], 2097152KB chunk and called mkf.xfs with proper parameters: $ mkfs.xfs -d sunit=512,swidth=1024 -f -l size=32m /dev/md0 Unfortunately the file is still created unaligned to the RAID stripe. $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero /tmp/diskmnt/filewr.zero: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..507903]: 2048544..2556447 0 (2048544..2556447) 507904 01111 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width Now I'm out of ideas.. - Martin On Thu, Nov 21, 2013 at 8:31 AM, Martin Boutin <martboutin@gmail.com> wrote: > $ uname -a > Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013 > i686 GNU/Linux > > $ xfs_repair -V > xfs_repair version 3.1.4 > > $ cat /proc/cpuinfo | grep processor > processor : 0 > processor : 1 > > $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0 > $ mount -t xfs /dev/md0 /tmp/diskmnt/ > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct > 1000+0 records in > 1000+0 records out > 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s > > $ cat /proc/meminfo > MemTotal: 1313956 kB > MemFree: 1099936 kB > Buffers: 13232 kB > Cached: 141452 kB > SwapCached: 0 kB > Active: 128960 kB > Inactive: 55936 kB > Active(anon): 30548 kB > Inactive(anon): 1096 kB > Active(file): 98412 kB > Inactive(file): 54840 kB > Unevictable: 0 kB > Mlocked: 0 kB > HighTotal: 626696 kB > HighFree: 452472 kB > LowTotal: 687260 kB > LowFree: 647464 kB > SwapTotal: 72256 kB > SwapFree: 72256 kB > Dirty: 8 kB > Writeback: 0 kB > AnonPages: 30172 kB > Mapped: 15764 kB > Shmem: 1432 kB > Slab: 14720 kB > SReclaimable: 6632 kB > SUnreclaim: 8088 kB > KernelStack: 1792 kB > PageTables: 1176 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 729232 kB > Committed_AS: 734116 kB > VmallocTotal: 327680 kB > VmallocUsed: 10192 kB > VmallocChunk: 294904 kB > DirectMap4k: 12280 kB > DirectMap4M: 692224 kB > > $ cat /proc/mounts > (...) > /dev/md0 /tmp/diskmnt xfs > rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0 > > $ cat /proc/partitions > major minor #blocks name > > 8 0 976762584 sda > 8 1 10281600 sda1 > 8 2 966479960 sda2 > 8 16 976762584 sdb > 8 17 10281600 sdb1 > 8 18 966479960 sdb2 > 8 32 976762584 sdc > 8 33 10281600 sdc1 > 8 34 966479960 sdc2 > (...) > 9 1 20560896 md1 > 9 0 1932956672 md0 > > # same layout for other disks > $ fdisk -c -u /dev/sda > > The device presents a logical sector size that is smaller than > the physical sector size. Aligning to a physical sector (or optimal > I/O) size boundary is recommended, or performance may be impacted. > > Command (m for help): p > > Disk /dev/sda: 1000.2 GB, 1000204886016 bytes > 255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 4096 bytes > I/O size (minimum/optimal): 4096 bytes / 4096 bytes > Disk identifier: 0x00000000 > > Device Boot Start End Blocks Id System > /dev/sda1 2048 20565247 10281600 83 Linux > /dev/sda2 20565248 1953525167 966479960 83 Linux > > # unfortunately I had to reinitelize the array and recovery takes a > while.. it does not impact performance much though. > $ cat /proc/mdstat > Personalities : [linear] [raid6] [raid5] [raid4] > md0 : active raid5 sda2[0] sdc2[3] sdb2[1] > 1932956672 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_] > [>....................] recovery = 2.4% (23588740/966478336) > finish=156.6min speed=100343K/sec > bitmap: 0/1 pages [0KB], 2097152KB chunk > > > # sda sdb and sdc are the same model > $ hdparm -I /dev/sda > > /dev/sda: > > ATA device, with non-removable media > Model Number: HGST HCC541010A9E680 > (...) > Firmware Revision: JA0OA560 > Transport: Serial, ATA8-AST, SATA 1.0a, SATA II > Extensions, SATA Rev 2.5, SATA Rev 2.6; Revision: ATA8-AST T13 Project > D1697 Revision 0b > Standards: > Used: unknown (minor revision code 0x0028) > Supported: 8 7 6 5 > Likely used: 8 > Configuration: > Logical max current > cylinders 16383 16383 > heads 16 16 > sectors/track 63 63 > -- > CHS current addressable sectors: 16514064 > LBA user addressable sectors: 268435455 > LBA48 user addressable sectors: 1953525168 > Logical Sector size: 512 bytes > Physical Sector size: 4096 bytes > Logical Sector-0 offset: 0 bytes > device size with M = 1024*1024: 953869 MBytes > device size with M = 1000*1000: 1000204 MBytes (1000 GB) > cache/buffer size = 8192 KBytes (type=DualPortCache) > Form Factor: 2.5 inch > Nominal Media Rotation Rate: 5400 > Capabilities: > LBA, IORDY(can be disabled) > Queue depth: 32 > Standby timer values: spec'd by Standard, no device specific minimum > R/W multiple sector transfer: Max = 16 Current = 16 > Advanced power management level: 128 > DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 > Cycle time: min=120ns recommended=120ns > PIO: pio0 pio1 pio2 pio3 pio4 > Cycle time: no flow control=120ns IORDY flow control=120ns > > $ hdparm -I /dev/sd{a,b,c} | grep "Write cache" > * Write cache > * Write cache > * Write cache > # therefore write cache is enabled in all drives > > $ xfs_info /dev/md0 > meta-data=/dev/md0 isize=256 agcount=32, agsize=15101312 blks > = sectsz=4096 attr=2 > data = bsize=4096 blocks=483239168, imaxpct=5 > = sunit=128 swidth=256 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=8192, version=2 > = sectsz=4096 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero > /tmp/diskmnt/filewr.zero: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..2047999]: 2049056..4097055 0 (2049056..4097055) 2048000 01111 > FLAG Values: > 010000 Unwritten preallocated extent > 001000 Doesn't begin on stripe unit > 000100 Doesn't end on stripe unit > 000010 Doesn't begin on stripe width > 000001 Doesn't end on stripe width > # this does not look good, does it? > > # run while dd was executing, looks like we have almost the half > writes as reads.... > $ iostat -d -k 30 2 /dev/sda2 /dev/sdb2 /dev/sdc2 > Linux 3.10.10 (haswell1) 11/21/2013 _i686_ (2 CPU) > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda2 13.75 6639.52 232.17 78863819 2757731 > sdb2 13.74 6639.42 232.24 78862660 2758483 > sdc2 13.68 55.86 6813.67 663443 80932375 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda2 78.27 11191.20 22556.07 335736 676682 > sdb2 78.30 11175.73 22589.13 335272 677674 > sdc2 78.30 5506.13 28258.47 165184 847754 > > Thanks > - Martin > > On Thu, Nov 21, 2013 at 4:50 AM, Martin Boutin <martboutin@gmail.com> wrote: >> On Thu, Nov 21, 2013 at 4:26 AM, Dave Chinner <david@fromorbit.com> wrote: >>> On Thu, Nov 21, 2013 at 04:11:41AM -0500, Martin Boutin wrote: >>>> On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@fromorbit.com> wrote: >>>> > On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote: >>>> >> On 11/18/13, 10:02 AM, Martin Boutin wrote: >>>> >> > Dear list, >>>> >> > >>>> >> > I am writing about an apparent issue (or maybe it is normal, that's my >>>> >> > question) regarding filesystem write speed in in a linux raid device. >>>> >> > More specifically, I have linux-3.10.10 running in an Intel Haswell >>>> >> > embedded system with 3 HDDs in a RAID-5 configuration. >>>> >> > The hard disks have 4k physical sectors which are reported as 512 >>>> >> > logical size. I made sure the partitions underlying the raid device >>>> >> > start at sector 2048. >>>> >> >>>> >> (fixed cc: to xfs list) >>>> >> >>>> >> > The RAID device has version 1.2 metadata and 4k (bytes) of data >>>> >> > offset, therefore the data should also be 4k aligned. The raid chunk >>>> >> > size is 512K. >>>> >> > >>>> >> > I have the md0 raid device formatted as ext3 with a 4k block size, and >>>> >> > stride and stripes correctly chosen to match the raid chunk size, that >>>> >> > is, stride=128,stripe-width=256. >>>> >> > >>>> >> > While I was working in a small university project, I just noticed that >>>> >> > the write speeds when using a filesystem over raid are *much* slower >>>> >> > than when writing directly to the raid device (or even compared to >>>> >> > filesystem read speeds). >>>> >> > >>>> >> > The command line for measuring filesystem read and write speeds was: >>>> >> > >>>> >> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct >>>> >> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct >>>> >> > >>>> >> > The command line for measuring raw read and write speeds was: >>>> >> > >>>> >> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct >>>> >> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct >>>> >> > >>>> >> > Here are some speed measures using dd (an average of 20 runs).: >>>> >> > >>>> >> > device raw/fs mode speed (MB/s) slowdown (%) >>>> >> > /dev/md0 raw read 207 >>>> >> > /dev/md0 raw write 209 >>>> >> > /dev/md1 raw read 214 >>>> >> > /dev/md1 raw write 212 >>>> > >>>> > So, that's writing to the first 1GB of /dev/md0, and all the writes >>>> > are going to be aligned to the MD stripe. >>>> > >>>> >> > /dev/md0 xfs read 188 9 >>>> >> > /dev/md0 xfs write 35 83o >>>> > >>>> > And these will not be written to the first 1GB of the block device >>>> > but somewhere else. Most likely a region that hasn't otherwise been >>>> > used, and so isn't going to be overwriting the same blocks like the >>>> > /dev/md0 case is going to be. Perhaps there's some kind of stripe >>>> > caching effect going on here? Was the md device fully initialised >>>> > before you ran these tests? >>>> > >>>> >> > >>>> >> > /dev/md1 ext3 read 199 7 >>>> >> > /dev/md1 ext3 write 36 83 >>>> >> > >>>> >> > /dev/md0 ufs read 212 0 >>>> >> > /dev/md0 ufs write 53 75 >>>> >> > >>>> >> > /dev/md0 ext2 read 202 2 >>>> >> > /dev/md0 ext2 write 34 84 >>>> > >>>> > I suspect what you are seeing here is either the latency introduced >>>> > by having to allocate blocks before issuing the IO, or the file >>>> > layout due to allocation is not idea. Single threaded direct IO is >>>> > latency bound, not bandwidth bound and, as such, is IO size >>>> > sensitive. Allocation for direct IO is also IO size sensitive - >>>> > there's typically an allocation per IO, so the more IO you have to >>>> > do, the more allocation that occurs. >>>> >>>> I just did a few more tests, this time with ext4: >>>> >>>> device raw/fs mode speed (MB/s) slowdown (%) >>>> /dev/md0 ext4 read 199 4% >>>> /dev/md0 ext4 write 210 0% >>>> >>>> This time, no slowdown at all on ext4. I believe this is due to the >>>> multiblock allocation feature of ext4 (I'm using O_DIRECT, so it >>>> should be it). So I guess for the other filesystems, it was indeed >>>> the latency introduced by block allocation. >>> >>> Except that XFS does extent based allocation as well, so that's not >>> likely the reason. The fact that ext4 doesn't see a slowdown like >>> every other filesystem really doesn't make a lot of sense to >>> me, either from an IO dispatch point of view or an IO alignment >>> point of view. >>> >>> Why? Because all the filesystems align identically to the underlying >>> device and all should be doing 4k block aligned IO, and XFS has >>> roughly the same allocation overhead for this workload as ext4. >>> Did you retest XFS or any of the other filesystems directly after >>> running the ext4 tests (i.e. confirm you are testing apples to >>> apples)? >> >> Yes I did, the performance figures did not change for either XFS or ext3. >>> >>> What we need to determine why other filesystems are slow (and why >>> ext4 is fast) is more information about your configuration and block >>> traces showing what is happening at the IO level, like was requested >>> in a previous email.... >> >> Ok, I'm going to try coming up with meaningful data. Thanks. >>> >>> Cheers, >>> >>> Dave. >>> -- >>> Dave Chinner >>> david@fromorbit.com >> >> >> >> -- >> Martin Boutin -- Martin Boutin ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-21 13:31 ` Martin Boutin 2013-11-21 16:35 ` Martin Boutin @ 2013-11-21 23:41 ` Dave Chinner 2013-11-22 9:21 ` Christoph Hellwig ` (2 more replies) 1 sibling, 3 replies; 19+ messages in thread From: Dave Chinner @ 2013-11-21 23:41 UTC (permalink / raw) To: Martin Boutin Cc: Eric Sandeen, Kernel.org-Linux-RAID, xfs-oss, Kernel.org-Linux-EXT4 On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote: > $ uname -a > Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013 > i686 GNU/Linux Oh, it's 32 bit system. Things you don't know from the obfuscating codenames everyone uses these days... > $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0 > $ mount -t xfs /dev/md0 /tmp/diskmnt/ > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct > 1000+0 records in > 1000+0 records out > 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s .... > $ cat /proc/mounts > (...) > /dev/md0 /tmp/diskmnt xfs > rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0 sunit/swidth is 512k/1MB > # same layout for other disks > $ fdisk -c -u /dev/sda .... > Device Boot Start End Blocks Id System > /dev/sda1 2048 20565247 10281600 83 Linux Aligned to 1 MB. > /dev/sda2 20565248 1953525167 966479960 83 Linux And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is aligned to 4k, though, so there shouldn't be any hardware RMW cycles. > $ xfs_info /dev/md0 > meta-data=/dev/md0 isize=256 agcount=32, agsize=15101312 blks > = sectsz=4096 attr=2 > data = bsize=4096 blocks=483239168, imaxpct=5 > = sunit=12 sunit/swidth of 512k/1MB, so it matches the MD device. > $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero > /tmp/diskmnt/filewr.zero: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..2047999]: 2049056..4097055 0 (2049056..4097055) 2048000 01111 > FLAG Values: > 010000 Unwritten preallocated extent > 001000 Doesn't begin on stripe unit > 000100 Doesn't end on stripe unit > 000010 Doesn't begin on stripe width > 000001 Doesn't end on stripe width > # this does not look good, does it? Yup, looks broken. /me digs through git. Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke the code that sets stripe unit alignment for the initial allocation way back in 3.2. [ Hmmm, that would explain the very occasional failure that generic/223 throws outi (maybe once a month I see it fail). ] Which means MD is doing RMW cycles for it's parity calculations, and that's where performance is going south. Current code: $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile testfile: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: 1056..2098207 0 (1056..2098207) 2097152 11111 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width wrote 1073741824/1073741824 bytes at offset 0 1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec) $ Which indicates that even if we take direct IO based allocation out of the picture, the allocation does not get aligned properly. This in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k. With a fixed kernel: $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile testfile: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: 6293504..8390655 0 (6293504..8390655) 2097152 10000 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width wrote 1073741824/1073741824 bytes at offset 0 1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec) $ It;s clear we have completely stripe swidth aligned allocation and it's 25% faster. Take fallocate out of the picture so the direct IO does the allocation: $ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile wrote 1073741824/1073741824 bytes at offset 0 1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec) testfile: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: 2099200..4196351 0 (2099200..4196351) 2097152 00000 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width It's slower than with preallocation (no surprise - no allocation overhead per write(2) call after preallocation is done) but the allocation is still correctly aligned. The patch below should fix the unaligned allocation problem you are seeing, but because XFS defaults to stripe unit alignment for large allocations, you might still see RMW cycles when it aligns to a stripe unit that is not the first in a MD stripe. I'll have a quick look at fixing that behaviour when the swalloc mount option is specified.... Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: align initial file allocations correctly. From: Dave Chinner <dchinner@redhat.com> The function xfs_bmap_isaeof() is used to indicate that an allocation is occurring at or past the end of file, and as such should be aligned to the underlying storage geometry if possible. Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the behaviour of this function for empty files - it turned off allocation alignment for this case accidentally. Hence large initial allocations from direct IO are not getting correctly aligned to the underlying geometry, and that is cause write performance to drop in alignment sensitive configurations. Fix it by considering allocation into empty files as requiring aligned allocation again. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_bmap.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c index 3ef11b2..8401f11 100644 --- a/fs/xfs/xfs_bmap.c +++ b/fs/xfs/xfs_bmap.c @@ -1635,7 +1635,7 @@ xfs_bmap_last_extent( * blocks at the end of the file which do not start at the previous data block, * we will try to align the new blocks at stripe unit boundaries. * - * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be + * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be * at, or past the EOF. */ STATIC int @@ -1650,9 +1650,14 @@ xfs_bmap_isaeof( bma->aeof = 0; error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec, &is_empty); - if (error || is_empty) + if (error) return error; + if (is_empty) { + bma->aeof = 1; + return 0; + } + /* * Check if we are allocation or past the last extent, or at least into * the last delayed allocated extent. ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-21 23:41 ` Dave Chinner @ 2013-11-22 9:21 ` Christoph Hellwig 2013-11-22 22:40 ` Dave Chinner 2013-11-22 13:33 ` Martin Boutin 2013-12-10 19:18 ` Christoph Hellwig 2 siblings, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2013-11-22 9:21 UTC (permalink / raw) To: Dave Chinner Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen, Kernel.org-Linux-EXT4, xfs-oss > From: Dave Chinner <dchinner@redhat.com> > > The function xfs_bmap_isaeof() is used to indicate that an > allocation is occurring at or past the end of file, and as such > should be aligned to the underlying storage geometry if possible. > > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the > behaviour of this function for empty files - it turned off > allocation alignment for this case accidentally. Hence large initial > allocations from direct IO are not getting correctly aligned to the > underlying geometry, and that is cause write performance to drop in > alignment sensitive configurations. > > Fix it by considering allocation into empty files as requiring > aligned allocation again. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> Ooops. The fix looks good, Reviewed-by: Christoph Hellwig <hch@lst.de> Might be worth cooking up a test for this, scsi_debug can expose geometry, and we already have it wired to to large sector size testing in xfstests. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-22 9:21 ` Christoph Hellwig @ 2013-11-22 22:40 ` Dave Chinner 2013-11-23 8:41 ` Christoph Hellwig 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2013-11-22 22:40 UTC (permalink / raw) To: Christoph Hellwig Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen, Kernel.org-Linux-EXT4, xfs-oss On Fri, Nov 22, 2013 at 01:21:36AM -0800, Christoph Hellwig wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > The function xfs_bmap_isaeof() is used to indicate that an > > allocation is occurring at or past the end of file, and as such > > should be aligned to the underlying storage geometry if possible. > > > > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the > > behaviour of this function for empty files - it turned off > > allocation alignment for this case accidentally. Hence large initial > > allocations from direct IO are not getting correctly aligned to the > > underlying geometry, and that is cause write performance to drop in > > alignment sensitive configurations. > > > > Fix it by considering allocation into empty files as requiring > > aligned allocation again. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > Ooops. The fix looks good, > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > Might be worth cooking up a test for this, scsi_debug can expose > geometry, and we already have it wired to to large sector size > testing in xfstests. We don't need to screw around with the sector size - that is irrelevant to the problem, and we have an allocation alignment test that is supposed to catch these issues: generic/223. As I said, I have seen occasional failures of that test (once a month, on average) as a result of this bug. It was simply not often enough - running in a hard loop didn't increase the frequency of failures - to be able debug it or to reach my "there's a regression I need to look at" threshold. Perhaps we need to revisit that test and see if we can make it more likely to trigger failures... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-22 22:40 ` Dave Chinner @ 2013-11-23 8:41 ` Christoph Hellwig 2013-11-24 23:21 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2013-11-23 8:41 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen, Kernel.org-Linux-EXT4, xfs-oss On Sat, Nov 23, 2013 at 09:40:38AM +1100, Dave Chinner wrote: > > geometry, and we already have it wired to to large sector size > > testing in xfstests. > > We don't need to screw around with the sector size - that is > irrelevant to the problem, and we have an allocation alignment > test that is supposed to catch these issues: generic/223. It didn't imply we need large sector sizes, but the same mechanism to expodse a large sector size can also be used to present large stripe units/width. > As I said, I have seen occasional failures of that test (once a > month, on average) as a result of this bug. It was simply not often > enough - running in a hard loop didn't increase the frequency of > failures - to be able debug it or to reach my "there's a regression > I need to look at" threshold. Perhaps we need to revisit that test > and see if we can make it more likely to trigger failures... Seems like 233 should have cought it regularly with the explicit alignment options on mkfs time. Maybe we also need a test mirroring the plain dd more closely? I've not seen 233 fail for a long time.. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-23 8:41 ` Christoph Hellwig @ 2013-11-24 23:21 ` Dave Chinner 0 siblings, 0 replies; 19+ messages in thread From: Dave Chinner @ 2013-11-24 23:21 UTC (permalink / raw) To: Christoph Hellwig Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen, Kernel.org-Linux-EXT4, xfs-oss On Sat, Nov 23, 2013 at 12:41:06AM -0800, Christoph Hellwig wrote: > On Sat, Nov 23, 2013 at 09:40:38AM +1100, Dave Chinner wrote: > > > geometry, and we already have it wired to to large sector size > > > testing in xfstests. > > > > We don't need to screw around with the sector size - that is > > irrelevant to the problem, and we have an allocation alignment > > test that is supposed to catch these issues: generic/223. > > It didn't imply we need large sector sizes, but the same mechanism > to expodse a large sector size can also be used to present large > stripe units/width. > > > As I said, I have seen occasional failures of that test (once a > > month, on average) as a result of this bug. It was simply not often > > enough - running in a hard loop didn't increase the frequency of > > failures - to be able debug it or to reach my "there's a regression > > I need to look at" threshold. Perhaps we need to revisit that test > > and see if we can make it more likely to trigger failures... > > Seems like 233 should have cought it regularly with the explicit > alignment options on mkfs time. Maybe we also need a test mirroring > the plain dd more closely? Preallocation showed the problem, too, so we probably don't even need dd to check whether allocation alignment is working properly. We should probably write a test that spefically checks all the different anlignment/extent size combinations we can use. Preallocation should behave very similarly to direct IO, but I'm pretty sure that it won't do things like round up allocations to stripe unit/widths like direct IO does. The fact that we do allocation sunit/swidth size alignment for direct Io outside the allocator and sunit/swidth offset alignment inside the allocation is kinda funky.... > I've not seen 233 fail for a long time.. Not surprising, it is a one in several hundred test runs occurrence here... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-21 23:41 ` Dave Chinner 2013-11-22 9:21 ` Christoph Hellwig @ 2013-11-22 13:33 ` Martin Boutin 2013-12-10 19:18 ` Christoph Hellwig 2 siblings, 0 replies; 19+ messages in thread From: Martin Boutin @ 2013-11-22 13:33 UTC (permalink / raw) To: Dave Chinner Cc: Kernel.org-Linux-RAID, Eric Sandeen, Kernel.org-Linux-EXT4, xfs-oss Dave, I just applied your patch in my vanilla 3.10.10 Linux. Here are the new performance figures for XFS: $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 4.95292 s, 212 MB/s : ) So things make more sense now... I hit a bug in XFS and ext3 and ufs do not support some kind of multiblock allocation. Thank you all, - Martin On Thu, Nov 21, 2013 at 6:41 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote: >> $ uname -a >> Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013 >> i686 GNU/Linux > > Oh, it's 32 bit system. Things you don't know from the obfuscating > codenames everyone uses these days... > >> $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0 >> $ mount -t xfs /dev/md0 /tmp/diskmnt/ >> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct >> 1000+0 records in >> 1000+0 records out >> 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s > .... >> $ cat /proc/mounts >> (...) >> /dev/md0 /tmp/diskmnt xfs >> rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0 > > sunit/swidth is 512k/1MB > >> # same layout for other disks >> $ fdisk -c -u /dev/sda > .... >> Device Boot Start End Blocks Id System >> /dev/sda1 2048 20565247 10281600 83 Linux > > Aligned to 1 MB. > >> /dev/sda2 20565248 1953525167 966479960 83 Linux > > And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is > aligned to 4k, though, so there shouldn't be any hardware RMW > cycles. > >> $ xfs_info /dev/md0 >> meta-data=/dev/md0 isize=256 agcount=32, agsize=15101312 blks >> = sectsz=4096 attr=2 >> data = bsize=4096 blocks=483239168, imaxpct=5 >> = sunit=12 > > sunit/swidth of 512k/1MB, so it matches the MD device. > >> $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero >> /tmp/diskmnt/filewr.zero: >> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS >> 0: [0..2047999]: 2049056..4097055 0 (2049056..4097055) 2048000 01111 >> FLAG Values: >> 010000 Unwritten preallocated extent >> 001000 Doesn't begin on stripe unit >> 000100 Doesn't end on stripe unit >> 000010 Doesn't begin on stripe width >> 000001 Doesn't end on stripe width >> # this does not look good, does it? > > Yup, looks broken. > > /me digs through git. > > Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke > the code that sets stripe unit alignment for the initial allocation > way back in 3.2. > > [ Hmmm, that would explain the very occasional failure that > generic/223 throws outi (maybe once a month I see it fail). ] > > Which means MD is doing RMW cycles for it's parity calculations, and > that's where performance is going south. > > Current code: > > $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile > testfile: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..2097151]: 1056..2098207 0 (1056..2098207) 2097152 11111 > FLAG Values: > 010000 Unwritten preallocated extent > 001000 Doesn't begin on stripe unit > 000100 Doesn't end on stripe unit > 000010 Doesn't begin on stripe width > 000001 Doesn't end on stripe width > wrote 1073741824/1073741824 bytes at offset 0 > 1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec) > $ > > Which indicates that even if we take direct IO based allocation out > of the picture, the allocation does not get aligned properly. This > in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k. > > With a fixed kernel: > > $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile > testfile: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..2097151]: 6293504..8390655 0 (6293504..8390655) 2097152 10000 > FLAG Values: > 010000 Unwritten preallocated extent > 001000 Doesn't begin on stripe unit > 000100 Doesn't end on stripe unit > 000010 Doesn't begin on stripe width > 000001 Doesn't end on stripe width > wrote 1073741824/1073741824 bytes at offset 0 > 1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec) > $ > > It;s clear we have completely stripe swidth aligned allocation and it's 25% faster. > > Take fallocate out of the picture so the direct IO does the > allocation: > > $ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile > wrote 1073741824/1073741824 bytes at offset 0 > 1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec) > testfile: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..2097151]: 2099200..4196351 0 (2099200..4196351) 2097152 00000 > FLAG Values: > 010000 Unwritten preallocated extent > 001000 Doesn't begin on stripe unit > 000100 Doesn't end on stripe unit > 000010 Doesn't begin on stripe width > 000001 Doesn't end on stripe width > > It's slower than with preallocation (no surprise - no allocation > overhead per write(2) call after preallocation is done) but the > allocation is still correctly aligned. > > The patch below should fix the unaligned allocation problem you are > seeing, but because XFS defaults to stripe unit alignment for large > allocations, you might still see RMW cycles when it aligns to a > stripe unit that is not the first in a MD stripe. I'll have a quick > look at fixing that behaviour when the swalloc mount option is > specified.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > xfs: align initial file allocations correctly. > > From: Dave Chinner <dchinner@redhat.com> > > The function xfs_bmap_isaeof() is used to indicate that an > allocation is occurring at or past the end of file, and as such > should be aligned to the underlying storage geometry if possible. > > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the > behaviour of this function for empty files - it turned off > allocation alignment for this case accidentally. Hence large initial > allocations from direct IO are not getting correctly aligned to the > underlying geometry, and that is cause write performance to drop in > alignment sensitive configurations. > > Fix it by considering allocation into empty files as requiring > aligned allocation again. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_bmap.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c > index 3ef11b2..8401f11 100644 > --- a/fs/xfs/xfs_bmap.c > +++ b/fs/xfs/xfs_bmap.c > @@ -1635,7 +1635,7 @@ xfs_bmap_last_extent( > * blocks at the end of the file which do not start at the previous data block, > * we will try to align the new blocks at stripe unit boundaries. > * > - * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be > + * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be > * at, or past the EOF. > */ > STATIC int > @@ -1650,9 +1650,14 @@ xfs_bmap_isaeof( > bma->aeof = 0; > error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec, > &is_empty); > - if (error || is_empty) > + if (error) > return error; > > + if (is_empty) { > + bma->aeof = 1; > + return 0; > + } > + > /* > * Check if we are allocation or past the last extent, or at least into > * the last delayed allocated extent. -- Martin Boutin _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-21 23:41 ` Dave Chinner 2013-11-22 9:21 ` Christoph Hellwig 2013-11-22 13:33 ` Martin Boutin @ 2013-12-10 19:18 ` Christoph Hellwig 2013-12-11 0:27 ` Dave Chinner 2 siblings, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2013-12-10 19:18 UTC (permalink / raw) To: Dave Chinner Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen, Kernel.org-Linux-EXT4, xfs-oss > xfs: align initial file allocations correctly. > > From: Dave Chinner <dchinner@redhat.com> > > The function xfs_bmap_isaeof() is used to indicate that an > allocation is occurring at or past the end of file, and as such > should be aligned to the underlying storage geometry if possible. > > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the > behaviour of this function for empty files - it turned off > allocation alignment for this case accidentally. Hence large initial > allocations from direct IO are not getting correctly aligned to the > underlying geometry, and that is cause write performance to drop in > alignment sensitive configurations. > > Fix it by considering allocation into empty files as requiring > aligned allocation again. Seems like this one didn't get picked up yet? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-12-10 19:18 ` Christoph Hellwig @ 2013-12-11 0:27 ` Dave Chinner 2013-12-11 19:09 ` Ben Myers 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2013-12-11 0:27 UTC (permalink / raw) To: Christoph Hellwig Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen, Kernel.org-Linux-EXT4, xfs-oss On Tue, Dec 10, 2013 at 11:18:03AM -0800, Christoph Hellwig wrote: > > xfs: align initial file allocations correctly. > > > > From: Dave Chinner <dchinner@redhat.com> > > > > The function xfs_bmap_isaeof() is used to indicate that an > > allocation is occurring at or past the end of file, and as such > > should be aligned to the underlying storage geometry if possible. > > > > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the > > behaviour of this function for empty files - it turned off > > allocation alignment for this case accidentally. Hence large initial > > allocations from direct IO are not getting correctly aligned to the > > underlying geometry, and that is cause write performance to drop in > > alignment sensitive configurations. > > > > Fix it by considering allocation into empty files as requiring > > aligned allocation again. > > Seems like this one didn't get picked up yet? I'm about to resend all my outstanding patches... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-12-11 0:27 ` Dave Chinner @ 2013-12-11 19:09 ` Ben Myers 0 siblings, 0 replies; 19+ messages in thread From: Ben Myers @ 2013-12-11 19:09 UTC (permalink / raw) To: Dave Chinner Cc: Martin Boutin, Kernel.org-Linux-RAID, Eric Sandeen, xfs-oss, Christoph Hellwig, Kernel.org-Linux-EXT4 Hi, On Wed, Dec 11, 2013 at 11:27:53AM +1100, Dave Chinner wrote: > On Tue, Dec 10, 2013 at 11:18:03AM -0800, Christoph Hellwig wrote: > > > xfs: align initial file allocations correctly. > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > The function xfs_bmap_isaeof() is used to indicate that an > > > allocation is occurring at or past the end of file, and as such > > > should be aligned to the underlying storage geometry if possible. > > > > > > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the > > > behaviour of this function for empty files - it turned off > > > allocation alignment for this case accidentally. Hence large initial > > > allocations from direct IO are not getting correctly aligned to the > > > underlying geometry, and that is cause write performance to drop in > > > alignment sensitive configurations. > > > > > > Fix it by considering allocation into empty files as requiring > > > aligned allocation again. > > > > Seems like this one didn't get picked up yet? > > I'm about to resend all my outstanding patches... Sorry I didn't see that one. If you stick the keyword 'patch' in the subject I tend to do a bit better. Regards, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-18 16:02 Filesystem writes on RAID5 too slow Martin Boutin 2013-11-18 18:28 ` Eric Sandeen @ 2013-11-18 18:41 ` Roman Mamedov 2013-11-18 19:25 ` Roman Mamedov 1 sibling, 1 reply; 19+ messages in thread From: Roman Mamedov @ 2013-11-18 18:41 UTC (permalink / raw) To: Martin Boutin Cc: Kernel.org-Linux-RAID, Kernel.org-Linux-XFS, Kernel.org-Linux-EXT4 [-- Attachment #1: Type: text/plain, Size: 709 bytes --] On Mon, 18 Nov 2013 11:02:15 -0500 Martin Boutin <martboutin@gmail.com> wrote: > I have the md0 raid device formatted as ext3 with a 4k block size, and > stride and stripes correctly chosen to match the raid chunk size, that > is, stride=128,stripe-width=256. What is your stripe cache size? http://peterkieser.com/2009/11/29/raid-mdraid-stripe_cache_size-vs-write-transfer/ > The command line for measuring filesystem read and write speeds was: > > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct Try testing with "fdatasync" instead of "direct" here. -- With respect, Roman [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Filesystem writes on RAID5 too slow 2013-11-18 18:41 ` Roman Mamedov @ 2013-11-18 19:25 ` Roman Mamedov 0 siblings, 0 replies; 19+ messages in thread From: Roman Mamedov @ 2013-11-18 19:25 UTC (permalink / raw) To: Martin Boutin Cc: Kernel.org-Linux-RAID, Kernel.org-Linux-XFS, Kernel.org-Linux-EXT4 [-- Attachment #1: Type: text/plain, Size: 464 bytes --] On Tue, 19 Nov 2013 00:41:40 +0600 Roman Mamedov <rm@romanrm.net> wrote: > > The command line for measuring filesystem read and write speeds was: > > > > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct > > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct > > Try testing with "fdatasync" instead of "direct" here. Sorry, "conv=fdatasync" instead of "oflag=direct". -- With respect, Roman [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2013-12-11 19:09 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-11-18 16:02 Filesystem writes on RAID5 too slow Martin Boutin 2013-11-18 18:28 ` Eric Sandeen 2013-11-19 0:57 ` Dave Chinner 2013-11-21 9:11 ` Martin Boutin 2013-11-21 9:26 ` Dave Chinner 2013-11-21 9:50 ` Martin Boutin 2013-11-21 13:31 ` Martin Boutin 2013-11-21 16:35 ` Martin Boutin 2013-11-21 23:41 ` Dave Chinner 2013-11-22 9:21 ` Christoph Hellwig 2013-11-22 22:40 ` Dave Chinner 2013-11-23 8:41 ` Christoph Hellwig 2013-11-24 23:21 ` Dave Chinner 2013-11-22 13:33 ` Martin Boutin 2013-12-10 19:18 ` Christoph Hellwig 2013-12-11 0:27 ` Dave Chinner 2013-12-11 19:09 ` Ben Myers 2013-11-18 18:41 ` Roman Mamedov 2013-11-18 19:25 ` Roman Mamedov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).