* ext4 performance regression 2.6.27-stable versus 2.6.32 and later @ 2010-07-28 19:51 Kay Diederichs 2010-07-28 21:00 ` Greg Freemyer ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Kay Diederichs @ 2010-07-28 19:51 UTC (permalink / raw) To: linux; +Cc: Ext4 Developers List, Karsten Schaefer Dear all, we reproducibly find significantly worse ext4 performance when our fileservers run 2.6.32 or later kernels, when compared to the 2.6.27-stable series. The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an external eSATA enclosure (STARDOM ST6600); disks are not partitioned but rather the complete disks are used: md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1] 3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU] The enclosure is connected using a Silicon Image (supported by sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2 Xeon 3.2GHz). The ext4 filesystem was created using mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg It is mounted with noatime,data=writeback As operating system we usually use RHEL5.5, but to exclude problems with self-compiled kernels, we also booted USB sticks with latest Fedora12 and FC13 . Our benchmarks consist of copying 100 6MB files from and to the RAID5, over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and rsync-ing kernel trees back and forth. Before and after each individual benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on both the client and the server. The problem: with 2.6.27.48 we typically get: 44 seconds for preparations 23 seconds to rsync 100 frames with 597M from nfs directory 33 seconds to rsync 100 frames with 595M to nfs directory 50 seconds to untar 24353 kernel files with 323M to nfs directory 56 seconds to rsync 24353 kernel files with 323M from nfs directory 67 seconds to run xds_par in nfs directory (reads and writes 600M) 301 seconds to run the script with 2.6.32.16 we find: 49 seconds for preparations 23 seconds to rsync 100 frames with 597M from nfs directory 261 seconds to rsync 100 frames with 595M to nfs directory 74 seconds to untar 24353 kernel files with 323M to nfs directory 67 seconds to rsync 24353 kernel files with 323M from nfs directory 290 seconds to run xds_par in nfs directory (reads and writes 600M) 797 seconds to run the script This is quite reproducible (times varying about 1-2% or so). All times include reading and writing on the client side (stock CentOS5.5 Nehalem machines with fast single SATA disks). The 2.6.32.16 times are the same with FC12 and FC13 (booted from USB stick). The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because md RAID5 does not support barriers ("JBD: barrier-based sync failed on md5 - disabling barriers"). What we tried: noop and deadline schedulers instead of cfq; modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching on/off NCQ; blockdev --setra 8192 /dev/md5; increasing /sys/block/md5/md/stripe_cache_size When looking at the I/O statistics while the benchmark is running, we see very choppy patterns for 2.6.32, but quite smooth stats for 2.6.27-stable. It is not an NFS problem; we see the same effect when transferring the data using an rsync daemon. We believe, but are not sure, that the problem does not exist with ext3 - it's not so quick to re-format a 4 TB volume. Any ideas? We cannot believe that a general ext4 regression should have gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ? thanks, Kay -- Kay Diederichs http://strucbio.biologie.uni-konstanz.de email: Kay.Diederichs@uni-konstanz.de Tel +49 7531 88 4049 Fax 3183 Fachbereich Biologie, Universität Konstanz, Box M647, D-78457 Konstanz. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-07-28 19:51 ext4 performance regression 2.6.27-stable versus 2.6.32 and later Kay Diederichs @ 2010-07-28 21:00 ` Greg Freemyer 2010-08-02 10:47 ` Kay Diederichs 2010-07-29 23:28 ` Dave Chinner 2010-07-30 2:20 ` Ted Ts'o 2 siblings, 1 reply; 15+ messages in thread From: Greg Freemyer @ 2010-07-28 21:00 UTC (permalink / raw) To: Kay Diederichs; +Cc: linux, Ext4 Developers List, Karsten Schaefer On Wed, Jul 28, 2010 at 3:51 PM, Kay Diederichs <Kay.Diederichs@uni-konstanz.de> wrote: > Dear all, > > we reproducibly find significantly worse ext4 performance when our > fileservers run 2.6.32 or later kernels, when compared to the > 2.6.27-stable series. > > The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an > external eSATA enclosure (STARDOM ST6600); disks are not partitioned but > rather the complete disks are used: > md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1] > 3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] > [UUUUU] > > The enclosure is connected using a Silicon Image (supported by > sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup > fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU > 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2 > Xeon 3.2GHz). > > The ext4 filesystem was created using > mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg > It is mounted with noatime,data=writeback > > As operating system we usually use RHEL5.5, but to exclude problems with > self-compiled kernels, we also booted USB sticks with latest Fedora12 > and FC13 . > > Our benchmarks consist of copying 100 6MB files from and to the RAID5, > over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and > rsync-ing kernel trees back and forth. Before and after each individual > benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on > both the client and the server. > > The problem: > with 2.6.27.48 we typically get: > 44 seconds for preparations > 23 seconds to rsync 100 frames with 597M from nfs directory > 33 seconds to rsync 100 frames with 595M to nfs directory > 50 seconds to untar 24353 kernel files with 323M to nfs directory > 56 seconds to rsync 24353 kernel files with 323M from nfs directory > 67 seconds to run xds_par in nfs directory (reads and writes 600M) > 301 seconds to run the script > > with 2.6.32.16 we find: > 49 seconds for preparations > 23 seconds to rsync 100 frames with 597M from nfs directory > 261 seconds to rsync 100 frames with 595M to nfs directory > 74 seconds to untar 24353 kernel files with 323M to nfs directory > 67 seconds to rsync 24353 kernel files with 323M from nfs directory > 290 seconds to run xds_par in nfs directory (reads and writes 600M) > 797 seconds to run the script > > This is quite reproducible (times varying about 1-2% or so). All times > include reading and writing on the client side (stock CentOS5.5 Nehalem > machines with fast single SATA disks). The 2.6.32.16 times are the same > with FC12 and FC13 (booted from USB stick). > > The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because > md RAID5 does not support barriers ("JBD: barrier-based sync failed on > md5 - disabling barriers"). > > What we tried: noop and deadline schedulers instead of cfq; > modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching > on/off NCQ; blockdev --setra 8192 /dev/md5; increasing > /sys/block/md5/md/stripe_cache_size > > When looking at the I/O statistics while the benchmark is running, we > see very choppy patterns for 2.6.32, but quite smooth stats for > 2.6.27-stable. > > It is not an NFS problem; we see the same effect when transferring the > data using an rsync daemon. We believe, but are not sure, that the > problem does not exist with ext3 - it's not so quick to re-format a 4 TB > volume. > > Any ideas? We cannot believe that a general ext4 regression should have > gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ? > > thanks, > > Kay Kay, I didn't read your whole e-mail, but 2.6.27 has known issues with barriers not working in many raid configs. Thus it is more likely to experience data loss in the event of a power failure. With newer kernels, If you prefer to have performance over robustness, you can mount with the "nobarrier" option. So now you have your choice whereas with 2.6.27, with raid5 you effectively had nobarriers as your only choice. Greg ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-07-28 21:00 ` Greg Freemyer @ 2010-08-02 10:47 ` Kay Diederichs 2010-08-02 16:04 ` Henrique de Moraes Holschuh 0 siblings, 1 reply; 15+ messages in thread From: Kay Diederichs @ 2010-08-02 10:47 UTC (permalink / raw) To: Greg Freemyer; +Cc: linux, Ext4 Developers List, Karsten Schaefer [-- Attachment #1: Type: text/plain, Size: 5041 bytes --] Greg Freemyer schrieb: > On Wed, Jul 28, 2010 at 3:51 PM, Kay Diederichs > <Kay.Diederichs@uni-konstanz.de> wrote: >> Dear all, >> >> we reproducibly find significantly worse ext4 performance when our >> fileservers run 2.6.32 or later kernels, when compared to the >> 2.6.27-stable series. >> >> The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an >> external eSATA enclosure (STARDOM ST6600); disks are not partitioned but >> rather the complete disks are used: >> md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1] >> 3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] >> [UUUUU] >> >> The enclosure is connected using a Silicon Image (supported by >> sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup >> fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU >> 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2 >> Xeon 3.2GHz). >> >> The ext4 filesystem was created using >> mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg >> It is mounted with noatime,data=writeback >> >> As operating system we usually use RHEL5.5, but to exclude problems with >> self-compiled kernels, we also booted USB sticks with latest Fedora12 >> and FC13 . >> >> Our benchmarks consist of copying 100 6MB files from and to the RAID5, >> over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and >> rsync-ing kernel trees back and forth. Before and after each individual >> benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on >> both the client and the server. >> >> The problem: >> with 2.6.27.48 we typically get: >> 44 seconds for preparations >> 23 seconds to rsync 100 frames with 597M from nfs directory >> 33 seconds to rsync 100 frames with 595M to nfs directory >> 50 seconds to untar 24353 kernel files with 323M to nfs directory >> 56 seconds to rsync 24353 kernel files with 323M from nfs directory >> 67 seconds to run xds_par in nfs directory (reads and writes 600M) >> 301 seconds to run the script >> >> with 2.6.32.16 we find: >> 49 seconds for preparations >> 23 seconds to rsync 100 frames with 597M from nfs directory >> 261 seconds to rsync 100 frames with 595M to nfs directory >> 74 seconds to untar 24353 kernel files with 323M to nfs directory >> 67 seconds to rsync 24353 kernel files with 323M from nfs directory >> 290 seconds to run xds_par in nfs directory (reads and writes 600M) >> 797 seconds to run the script >> >> This is quite reproducible (times varying about 1-2% or so). All times >> include reading and writing on the client side (stock CentOS5.5 Nehalem >> machines with fast single SATA disks). The 2.6.32.16 times are the same >> with FC12 and FC13 (booted from USB stick). >> >> The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because >> md RAID5 does not support barriers ("JBD: barrier-based sync failed on >> md5 - disabling barriers"). >> >> What we tried: noop and deadline schedulers instead of cfq; >> modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching >> on/off NCQ; blockdev --setra 8192 /dev/md5; increasing >> /sys/block/md5/md/stripe_cache_size >> >> When looking at the I/O statistics while the benchmark is running, we >> see very choppy patterns for 2.6.32, but quite smooth stats for >> 2.6.27-stable. >> >> It is not an NFS problem; we see the same effect when transferring the >> data using an rsync daemon. We believe, but are not sure, that the >> problem does not exist with ext3 - it's not so quick to re-format a 4 TB >> volume. >> >> Any ideas? We cannot believe that a general ext4 regression should have >> gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ? >> >> thanks, >> >> Kay > > Kay, > > I didn't read your whole e-mail, but 2.6.27 has known issues with > barriers not working in many raid configs. Thus it is more likely to > experience data loss in the event of a power failure. > > With newer kernels, If you prefer to have performance over robustness, > you can mount with the "nobarrier" option. > > So now you have your choice whereas with 2.6.27, with raid5 you > effectively had nobarriers as your only choice. > > Greg Greg, 2.6.33 and later support md5 write barriers, whereas 2.6.27-stable doesn't. I looked thru the 2.6.32.* Changelogs at http://kernel.org/pub/linux/kernel/v2.6/ but could not find anything indicating that md5 write barriers were backported to 2.6.32-stable. Anyway, we do not get the message "JBD: barrier-based sync failed on md5 - disabling barriers" when using 2.6.32.16 which might indicate that write barriers are indeed active when specifying no options in this respect. Performance-wise, we tried mounting with barrier versus nobarrier (or barrier=1 versus barrier=0) and re-did the 2.6.32+ benchmarks. It turned out that the benchmark difference with and without barrier is less than the variation between runs (which is much higher with 2.6.32+ than with 2.6.27-stable), so the influence seems to be minor. best, Kay [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/x-pkcs7-signature, Size: 5236 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-08-02 10:47 ` Kay Diederichs @ 2010-08-02 16:04 ` Henrique de Moraes Holschuh 2010-08-02 16:10 ` Henrique de Moraes Holschuh 0 siblings, 1 reply; 15+ messages in thread From: Henrique de Moraes Holschuh @ 2010-08-02 16:04 UTC (permalink / raw) To: Kay Diederichs Cc: Greg Freemyer, linux, Ext4 Developers List, Karsten Schaefer On Mon, 02 Aug 2010, Kay Diederichs wrote: > Performance-wise, we tried mounting with barrier versus nobarrier (or > barrier=1 versus barrier=0) and re-did the 2.6.32+ benchmarks. It turned > out that the benchmark difference with and without barrier is less than > the variation between runs (which is much higher with 2.6.32+ than with > 2.6.27-stable), so the influence seems to be minor. Did you check interactions with the IO scheduler? -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-08-02 16:04 ` Henrique de Moraes Holschuh @ 2010-08-02 16:10 ` Henrique de Moraes Holschuh 0 siblings, 0 replies; 15+ messages in thread From: Henrique de Moraes Holschuh @ 2010-08-02 16:10 UTC (permalink / raw) To: Kay Diederichs Cc: Greg Freemyer, linux, Ext4 Developers List, Karsten Schaefer On Mon, 02 Aug 2010, Henrique de Moraes Holschuh wrote: > On Mon, 02 Aug 2010, Kay Diederichs wrote: > > Performance-wise, we tried mounting with barrier versus nobarrier (or > > barrier=1 versus barrier=0) and re-did the 2.6.32+ benchmarks. It turned > > out that the benchmark difference with and without barrier is less than > > the variation between runs (which is much higher with 2.6.32+ than with > > 2.6.27-stable), so the influence seems to be minor. > > Did you check interactions with the IO scheduler? Never mind, I reread your first message, and you did. I apologise for the noise. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-07-28 19:51 ext4 performance regression 2.6.27-stable versus 2.6.32 and later Kay Diederichs 2010-07-28 21:00 ` Greg Freemyer @ 2010-07-29 23:28 ` Dave Chinner 2010-08-02 14:52 ` Kay Diederichs 2010-07-30 2:20 ` Ted Ts'o 2 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2010-07-29 23:28 UTC (permalink / raw) To: Kay Diederichs; +Cc: linux, Ext4 Developers List, Karsten Schaefer On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote: > Dear all, > > we reproducibly find significantly worse ext4 performance when our > fileservers run 2.6.32 or later kernels, when compared to the > 2.6.27-stable series. > > The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an > external eSATA enclosure (STARDOM ST6600); disks are not partitioned but > rather the complete disks are used: > md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1] > 3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] > [UUUUU] > > The enclosure is connected using a Silicon Image (supported by > sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup > fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU > 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2 > Xeon 3.2GHz). > > The ext4 filesystem was created using > mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg > It is mounted with noatime,data=writeback > > As operating system we usually use RHEL5.5, but to exclude problems with > self-compiled kernels, we also booted USB sticks with latest Fedora12 > and FC13 . > > Our benchmarks consist of copying 100 6MB files from and to the RAID5, > over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and > rsync-ing kernel trees back and forth. Before and after each individual > benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on > both the client and the server. > > The problem: > with 2.6.27.48 we typically get: > 44 seconds for preparations > 23 seconds to rsync 100 frames with 597M from nfs directory > 33 seconds to rsync 100 frames with 595M to nfs directory > 50 seconds to untar 24353 kernel files with 323M to nfs directory > 56 seconds to rsync 24353 kernel files with 323M from nfs directory > 67 seconds to run xds_par in nfs directory (reads and writes 600M) > 301 seconds to run the script > > with 2.6.32.16 we find: > 49 seconds for preparations > 23 seconds to rsync 100 frames with 597M from nfs directory > 261 seconds to rsync 100 frames with 595M to nfs directory > 74 seconds to untar 24353 kernel files with 323M to nfs directory > 67 seconds to rsync 24353 kernel files with 323M from nfs directory > 290 seconds to run xds_par in nfs directory (reads and writes 600M) > 797 seconds to run the script > > This is quite reproducible (times varying about 1-2% or so). All times > include reading and writing on the client side (stock CentOS5.5 Nehalem > machines with fast single SATA disks). The 2.6.32.16 times are the same > with FC12 and FC13 (booted from USB stick). > > The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because > md RAID5 does not support barriers ("JBD: barrier-based sync failed on > md5 - disabling barriers"). > > What we tried: noop and deadline schedulers instead of cfq; > modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching > on/off NCQ; blockdev --setra 8192 /dev/md5; increasing > /sys/block/md5/md/stripe_cache_size > > When looking at the I/O statistics while the benchmark is running, we > see very choppy patterns for 2.6.32, but quite smooth stats for > 2.6.27-stable. > > It is not an NFS problem; we see the same effect when transferring the > data using an rsync daemon. We believe, but are not sure, that the > problem does not exist with ext3 - it's not so quick to re-format a 4 TB > volume. > > Any ideas? We cannot believe that a general ext4 regression should have > gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ? Try reverting 50797481a7bdee548589506d7d7b48b08bc14dcd (ext4: Avoid group preallocation for closed files). IIRC it caused the same sort of isevere performance regressions for postmark.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-07-29 23:28 ` Dave Chinner @ 2010-08-02 14:52 ` Kay Diederichs 2010-08-02 16:12 ` Eric Sandeen 0 siblings, 1 reply; 15+ messages in thread From: Kay Diederichs @ 2010-08-02 14:52 UTC (permalink / raw) To: Dave Chinner; +Cc: linux, Ext4 Developers List, Karsten Schaefer Dave Chinner schrieb: > On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote: >> Dear all, >> >> we reproducibly find significantly worse ext4 performance when our >> fileservers run 2.6.32 or later kernels, when compared to the >> 2.6.27-stable series. >> >> The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an >> external eSATA enclosure (STARDOM ST6600); disks are not partitioned but >> rather the complete disks are used: >> md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1] >> 3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] >> [UUUUU] >> >> The enclosure is connected using a Silicon Image (supported by >> sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup >> fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU >> 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2 >> Xeon 3.2GHz). >> >> The ext4 filesystem was created using >> mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg >> It is mounted with noatime,data=writeback >> >> As operating system we usually use RHEL5.5, but to exclude problems with >> self-compiled kernels, we also booted USB sticks with latest Fedora12 >> and FC13 . >> >> Our benchmarks consist of copying 100 6MB files from and to the RAID5, >> over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and >> rsync-ing kernel trees back and forth. Before and after each individual >> benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on >> both the client and the server. >> >> The problem: >> with 2.6.27.48 we typically get: >> 44 seconds for preparations >> 23 seconds to rsync 100 frames with 597M from nfs directory >> 33 seconds to rsync 100 frames with 595M to nfs directory >> 50 seconds to untar 24353 kernel files with 323M to nfs directory >> 56 seconds to rsync 24353 kernel files with 323M from nfs directory >> 67 seconds to run xds_par in nfs directory (reads and writes 600M) >> 301 seconds to run the script >> >> with 2.6.32.16 we find: >> 49 seconds for preparations >> 23 seconds to rsync 100 frames with 597M from nfs directory >> 261 seconds to rsync 100 frames with 595M to nfs directory >> 74 seconds to untar 24353 kernel files with 323M to nfs directory >> 67 seconds to rsync 24353 kernel files with 323M from nfs directory >> 290 seconds to run xds_par in nfs directory (reads and writes 600M) >> 797 seconds to run the script >> >> This is quite reproducible (times varying about 1-2% or so). All times >> include reading and writing on the client side (stock CentOS5.5 Nehalem >> machines with fast single SATA disks). The 2.6.32.16 times are the same >> with FC12 and FC13 (booted from USB stick). >> >> The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because >> md RAID5 does not support barriers ("JBD: barrier-based sync failed on >> md5 - disabling barriers"). >> >> What we tried: noop and deadline schedulers instead of cfq; >> modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching >> on/off NCQ; blockdev --setra 8192 /dev/md5; increasing >> /sys/block/md5/md/stripe_cache_size >> >> When looking at the I/O statistics while the benchmark is running, we >> see very choppy patterns for 2.6.32, but quite smooth stats for >> 2.6.27-stable. >> >> It is not an NFS problem; we see the same effect when transferring the >> data using an rsync daemon. We believe, but are not sure, that the >> problem does not exist with ext3 - it's not so quick to re-format a 4 TB >> volume. >> >> Any ideas? We cannot believe that a general ext4 regression should have >> gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ? > > Try reverting 50797481a7bdee548589506d7d7b48b08bc14dcd (ext4: Avoid > group preallocation for closed files). IIRC it caused the same sort > of isevere performance regressions for postmark.... > > Cheers, > > Dave. Dave, as you suggested, we reverted "ext4: Avoid group preallocation for closed files" and this indeed fixes a big part of the problem: after booting the NFS server we get NFS-Server: turn5 2.6.32.16p i686 NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64 exported directory on the nfs-server: /dev/md5 /mnt/md5 ext4 rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0 48 seconds for preparations 28 seconds to rsync 100 frames with 597M from nfs directory 57 seconds to rsync 100 frames with 595M to nfs directory 70 seconds to untar 24353 kernel files with 323M to nfs directory 57 seconds to rsync 24353 kernel files with 323M from nfs directory 133 seconds to run xds_par in nfs directory 425 seconds to run the script For blktrace details, see my next email which is a response to Ted's. best, Kay ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-08-02 14:52 ` Kay Diederichs @ 2010-08-02 16:12 ` Eric Sandeen 2010-08-02 21:08 ` Kay Diederichs 2010-08-03 13:31 ` Kay Diederichs 0 siblings, 2 replies; 15+ messages in thread From: Eric Sandeen @ 2010-08-02 16:12 UTC (permalink / raw) To: Kay Diederichs Cc: Dave Chinner, linux, Ext4 Developers List, Karsten Schaefer [-- Attachment #1: Type: text/plain, Size: 2162 bytes --] On 08/02/2010 09:52 AM, Kay Diederichs wrote: > Dave, > > as you suggested, we reverted "ext4: Avoid group preallocation for > closed files" and this indeed fixes a big part of the problem: after > booting the NFS server we get > > NFS-Server: turn5 2.6.32.16p i686 > NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64 > > exported directory on the nfs-server: > /dev/md5 /mnt/md5 ext4 > rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0 > > 48 seconds for preparations > 28 seconds to rsync 100 frames with 597M from nfs directory > 57 seconds to rsync 100 frames with 595M to nfs directory > 70 seconds to untar 24353 kernel files with 323M to nfs directory > 57 seconds to rsync 24353 kernel files with 323M from nfs directory > 133 seconds to run xds_par in nfs directory > 425 seconds to run the script Interesting, I had found this commit to be a problem for small files which are constantly created & deleted; the commit had the effect of packing the newly created files in the first free space that could be found, rather than walking down the disk leaving potentially fragmented freespace behind (see seekwatcher graph attached). Reverting the patch sped things up for this test, but left the filesystem freespace in bad shape. But you seem to see one of the largest effects in here: 261 seconds to rsync 100 frames with 595M to nfs directory vs 57 seconds to rsync 100 frames with 595M to nfs directory with the patch reverted making things go faster. So you are doing 100 6MB writes to the server, correct? Is the filesystem mkfs'd fresh before each test, or is it aged? If not mkfs'd, is it at least completely empty prior to the test, or does data remain on it? I'm just wondering if fragmented freespace is contributing to this behavior as well. If there is fragmented freespace, then with the patch I think the allocator is more likely to hunt around for small discontiguous chunks of free sapce, rather than going further out in the disk looking for a large area to allocate from. It might be interesting to use seekwatcher on the server to visualize the allocation/IO patterns for the test running just this far? -Eric [-- Attachment #2: rhel6_ext4_comparison.png --] [-- Type: image/png, Size: 113533 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-08-02 16:12 ` Eric Sandeen @ 2010-08-02 21:08 ` Kay Diederichs 2010-08-03 13:31 ` Kay Diederichs 1 sibling, 0 replies; 15+ messages in thread From: Kay Diederichs @ 2010-08-02 21:08 UTC (permalink / raw) To: Eric Sandeen; +Cc: Dave Chinner, linux, Ext4 Developers List, Karsten Schaefer [-- Attachment #1: Type: text/plain, Size: 4265 bytes --] Am 02.08.2010 18:12, schrieb Eric Sandeen: > On 08/02/2010 09:52 AM, Kay Diederichs wrote: >> Dave, >> >> as you suggested, we reverted "ext4: Avoid group preallocation for >> closed files" and this indeed fixes a big part of the problem: after >> booting the NFS server we get >> >> NFS-Server: turn5 2.6.32.16p i686 >> NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64 >> >> exported directory on the nfs-server: >> /dev/md5 /mnt/md5 ext4 >> rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0 >> >> 48 seconds for preparations >> 28 seconds to rsync 100 frames with 597M from nfs directory >> 57 seconds to rsync 100 frames with 595M to nfs directory >> 70 seconds to untar 24353 kernel files with 323M to nfs directory >> 57 seconds to rsync 24353 kernel files with 323M from nfs directory >> 133 seconds to run xds_par in nfs directory >> 425 seconds to run the script > > Interesting, I had found this commit to be a problem for small files > which are constantly created& deleted; the commit had the effect of > packing the newly created files in the first free space that could be > found, rather than walking down the disk leaving potentially fragmented > freespace behind (see seekwatcher graph attached). Reverting the patch > sped things up for this test, but left the filesystem freespace in bad > shape. > > But you seem to see one of the largest effects in here: > > 261 seconds to rsync 100 frames with 595M to nfs directory > vs > 57 seconds to rsync 100 frames with 595M to nfs directory > > with the patch reverted making things go faster. So you are doing 100 > 6MB writes to the server, correct? correct. > > Is the filesystem mkfs'd fresh > before each test, or is it aged? it is too big to "just create it freshly". It was actually created a week ago, and filled by a single ~ 10-hour rsync job run on the server such that the filesystem should be filled in the most linear way possible. Since then, the benchmarking has created and deleted lots of files. > If not mkfs'd, is it at least > completely empty prior to the test, or does data remain on it? I'm just it's not empty: df -h reports Filesystem Size Used Avail Use% Mounted on /dev/md5 3.7T 2.8T 712G 80% /mnt/md5 e2freefrag-1.41.12 reports: Device: /dev/md5 Blocksize: 4096 bytes Total blocks: 976761344 Free blocks: 235345984 (24.1%) Min. free extent: 4 KB Max. free extent: 99348 KB Avg. free extent: 1628 KB HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range : Free extents Free Blocks Percent 4K... 8K- : 1858 1858 0.00% 8K... 16K- : 3415 8534 0.00% 16K... 32K- : 9952 54324 0.02% 32K... 64K- : 23884 288848 0.12% 64K... 128K- : 27901 658130 0.28% 128K... 256K- : 25761 1211519 0.51% 256K... 512K- : 35863 3376274 1.43% 512K... 1024K- : 48643 9416851 4.00% 1M... 2M- : 150311 60704033 25.79% 2M... 4M- : 244895 148283666 63.01% 4M... 8M- : 3970 5508499 2.34% 8M... 16M- : 187 551835 0.23% 16M... 32M- : 302 1765912 0.75% 32M... 64M- : 282 2727162 1.16% 64M... 128M- : 42 788539 0.34% > wondering if fragmented freespace is contributing to this behavior as > well. If there is fragmented freespace, then with the patch I think the > allocator is more likely to hunt around for small discontiguous chunks > of free sapce, rather than going further out in the disk looking for a > large area to allocate from. the last step of the benchmark, "xds_par", reads 600MB and writes 50MB. It has 16 threads which might put some additional pressure on the freespace hunting. That step also is fast in 2.6.27.48 but slow in 2.6.32+ . > > It might be interesting to use seekwatcher on the server to visualize > the allocation/IO patterns for the test running just this far? > > -Eric will try to install seekwatcher. thanks, Kay [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 5236 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-08-02 16:12 ` Eric Sandeen 2010-08-02 21:08 ` Kay Diederichs @ 2010-08-03 13:31 ` Kay Diederichs 1 sibling, 0 replies; 15+ messages in thread From: Kay Diederichs @ 2010-08-03 13:31 UTC (permalink / raw) To: Eric Sandeen Cc: Dave Chinner, linux, Ext4 Developers List, Karsten Schaefer, Ted Ts'o [-- Attachment #1.1: Type: text/plain, Size: 4058 bytes --] Eric Sandeen schrieb: > On 08/02/2010 09:52 AM, Kay Diederichs wrote: >> Dave, >> >> as you suggested, we reverted "ext4: Avoid group preallocation for >> closed files" and this indeed fixes a big part of the problem: after >> booting the NFS server we get >> >> NFS-Server: turn5 2.6.32.16p i686 >> NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64 >> >> exported directory on the nfs-server: >> /dev/md5 /mnt/md5 ext4 >> rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0 >> >> 48 seconds for preparations >> 28 seconds to rsync 100 frames with 597M from nfs directory >> 57 seconds to rsync 100 frames with 595M to nfs directory >> 70 seconds to untar 24353 kernel files with 323M to nfs directory >> 57 seconds to rsync 24353 kernel files with 323M from nfs directory >> 133 seconds to run xds_par in nfs directory >> 425 seconds to run the script > > Interesting, I had found this commit to be a problem for small files > which are constantly created & deleted; the commit had the effect of > packing the newly created files in the first free space that could be > found, rather than walking down the disk leaving potentially fragmented > freespace behind (see seekwatcher graph attached). Reverting the patch > sped things up for this test, but left the filesystem freespace in bad > shape. > > But you seem to see one of the largest effects in here: > > 261 seconds to rsync 100 frames with 595M to nfs directory > vs > 57 seconds to rsync 100 frames with 595M to nfs directory > > with the patch reverted making things go faster. So you are doing 100 > 6MB writes to the server, correct? Is the filesystem mkfs'd fresh > before each test, or is it aged? If not mkfs'd, is it at least > completely empty prior to the test, or does data remain on it? I'm just > wondering if fragmented freespace is contributing to this behavior as > well. If there is fragmented freespace, then with the patch I think the > allocator is more likely to hunt around for small discontiguous chunks > of free sapce, rather than going further out in the disk looking for a > large area to allocate from. > > It might be interesting to use seekwatcher on the server to visualize > the allocation/IO patterns for the test running just this far? > > -Eric > > > ------------------------------------------------------------------------ > Eric, seekwatcher does not seem to understand the blktrace output of old kernels, so I rolled my own primitive plotting, e.g. blkparse -i md5.xds_par.2.6.32.16p_run1 > blkparse.out grep flush blkparse.out | grep W > flush_W grep flush blkparse.out | grep R > flush_R grep nfsd blkparse.out | grep R > nfsd_R grep nfsd blkparse.out | grep W > nfsd_W grep sync blkparse.out | grep R > sync_R grep sync blkparse.out | grep W > sync_W gnuplot<<EOF set term png set out '2.6.32.16p_run1.png' set key outside set title "2.6.32.16p_run1" plot 'nfsd_W' us 4:8,'flush_W' us 4:8,'sync_W' us 4:8,'nfsd_R' us 4:8,'flush_R' us 4:8 EOF I attach the resulting plots for 2.6.27.48_run1 (after booting) and 2.6.27.48_run2 (after run1 ; sync; and drop_cache). They show seconds on the x axis (horizontal) and block numbers (512-byte blocks, I suppose; the ext4 filesystem has 976761344 4096-byte blocks so that would be about 8e+09 512-byte blocks) on the y axis (vertical). You'll have to do the real interpretation of the plots yourself, but even someone who does not know exactly what the pdflush (in 2.6.27.48) or flush (in 2.6.32+) kernel threads are supposed to do can tell that the kernels behave _very_ differently. In particular, stock 2.6.32.16 every time (only run1 is shown, but run2 is the same) has the flush thread visiting all of the filesystem, in steps of 263168 blocks. I have no idea why it does this. Roughly the first 1/3 of the filesystem is also visited by kernels 2.6.27.48 and the patched 2.6.32.16 that Dave Chinner suggested, but only in the first run after booting. Subsequent runs are fast and do not employ the flush thread much. Hope this helps to pin down the regression. thanks, Kay [-- Attachment #1.2: 2.6.27.48_run1.png --] [-- Type: image/png, Size: 5146 bytes --] [-- Attachment #1.3: 2.6.27.48_run2.png --] [-- Type: image/png, Size: 4484 bytes --] [-- Attachment #1.4: 2.6.32.16p_run1.png --] [-- Type: image/png, Size: 4935 bytes --] [-- Attachment #1.5: 2.6.32.16p_run2.png --] [-- Type: image/png, Size: 4443 bytes --] [-- Attachment #1.6: 2.6.32.16.png --] [-- Type: image/png, Size: 5359 bytes --] [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/x-pkcs7-signature, Size: 5236 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-07-28 19:51 ext4 performance regression 2.6.27-stable versus 2.6.32 and later Kay Diederichs 2010-07-28 21:00 ` Greg Freemyer 2010-07-29 23:28 ` Dave Chinner @ 2010-07-30 2:20 ` Ted Ts'o 2010-07-30 21:01 ` Kay Diederichs ` (2 more replies) 2 siblings, 3 replies; 15+ messages in thread From: Ted Ts'o @ 2010-07-30 2:20 UTC (permalink / raw) To: Kay Diederichs; +Cc: linux, Ext4 Developers List, Karsten Schaefer On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote: > > When looking at the I/O statistics while the benchmark is running, we > see very choppy patterns for 2.6.32, but quite smooth stats for > 2.6.27-stable. Could you try to do two things for me? Using (preferably from a recent e2fsprogs, such as 1.41.11 or 12) run filefrag -v on the files created from your 2.6.27 run and your 2.6.32 run? Secondly can capture blktrace results from 2.6.27 and 2.6.32? That would be very helpful to understand what might be going on. Either would be helpful; both would be greatly appreciated. Thanks, - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-07-30 2:20 ` Ted Ts'o @ 2010-07-30 21:01 ` Kay Diederichs 2010-08-01 23:02 ` Ted Ts'o 2010-08-02 15:28 ` Kay Diederichs [not found] ` <4C56E47B.8080600@uni-konstanz.de> 2 siblings, 1 reply; 15+ messages in thread From: Kay Diederichs @ 2010-07-30 21:01 UTC (permalink / raw) To: Ted Ts'o, linux, Ext4 Developers List, Karsten Schaefer [-- Attachment #1: Type: text/plain, Size: 1556 bytes --] Am 30.07.2010 04:20, schrieb Ted Ts'o: > On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote: >> >> When looking at the I/O statistics while the benchmark is running, we >> see very choppy patterns for 2.6.32, but quite smooth stats for >> 2.6.27-stable. > > Could you try to do two things for me? Using (preferably from a > recent e2fsprogs, such as 1.41.11 or 12) run filefrag -v on the files > created from your 2.6.27 run and your 2.6.32 run? > > Secondly can capture blktrace results from 2.6.27 and 2.6.32? That > would be very helpful to understand what might be going on. > > Either would be helpful; both would be greatly appreciated. > > Thanks, > > - Ted Ted, a typical example of filefrag -v output for 2.6.27.48 is Filesystem type is: ef53 File size of /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf is 6229688 (1521 blocks, blocksize 4096) ext logical physical expected length flags 0 0 796160000 1024 1 1024 826381312 796161023 497 eof (99 out of 100 files have 2 extents) whereas for 2.6.32.16 the result is typically Filesystem type is: ef53 File size of /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf is 6229688 (1521 blocks, blocksize 4096) ext logical physical expected length flags 0 0 826376200 1521 eof /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf: 1 extent found (99 out of 100 files have 1 extent) We'll try the blktrace ASAP and report back. thanks, Kay [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 5236 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-07-30 21:01 ` Kay Diederichs @ 2010-08-01 23:02 ` Ted Ts'o 0 siblings, 0 replies; 15+ messages in thread From: Ted Ts'o @ 2010-08-01 23:02 UTC (permalink / raw) To: Kay Diederichs; +Cc: linux, Ext4 Developers List, Karsten Schaefer On Fri, Jul 30, 2010 at 11:01:36PM +0200, Kay Diederichs wrote: > whereas for 2.6.32.16 the result is typically > Filesystem type is: ef53 > File size of > /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf is > 6229688 (1521 blocks, blocksize 4096) > ext logical physical expected length flags > 0 0 826376200 1521 eof > /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf: 1 extent found OK, so 2.6.32 is actually doing a better job laying out the files.... The blktrace will be interesting, but at this point I'm wondering if this is a generic kernel-wide writeback regression. At $WORK we've noticed some performance regressions between 2.6.26-based kernels and 2.6.33- and 2.6.34-based kernels with both ext2 and ext4 (in no journal mode) that we've been trying to track down. We have a pretty large number of patches applied to both 2.6.26 and 2.6.33/34 which is why I haven't mentioned it up until now, but at this point it seems pretty clear there are some writeback issues in the mainline kernel. There are half a dozen or so patch series on LKML that are addressing writeback in one way or another, and writeback is a major topic at the upcoming Linux Storage and Filesystem workshop. So if this is the cause, hopefully there will be some improvements in this area in the near future. - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later 2010-07-30 2:20 ` Ted Ts'o 2010-07-30 21:01 ` Kay Diederichs @ 2010-08-02 15:28 ` Kay Diederichs [not found] ` <4C56E47B.8080600@uni-konstanz.de> 2 siblings, 0 replies; 15+ messages in thread From: Kay Diederichs @ 2010-08-02 15:28 UTC (permalink / raw) To: Ted Ts'o, Kay Diederichs, linux, Ext4 Developers List, Karsten Schaefer < Ted Ts'o schrieb: > On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote: >> When looking at the I/O statistics while the benchmark is running, we >> see very choppy patterns for 2.6.32, but quite smooth stats for >> 2.6.27-stable. > > Could you try to do two things for me? Using (preferably from a > recent e2fsprogs, such as 1.41.11 or 12) run filefrag -v on the files > created from your 2.6.27 run and your 2.6.32 run? > > Secondly can capture blktrace results from 2.6.27 and 2.6.32? That > would be very helpful to understand what might be going on. > > Either would be helpful; both would be greatly appreciated. > > Thanks, > > - Ted Ted, we pared down the benchmark to the last step (called "run xds_par in nfs directory (reads 600M, and writes 50M)") because this captures most of the problem. Here we report kernel messages with stacktrace, and the blktrace output that you requested. Kernel messages: with 2.6.32.16 we observe [ 6961.838032] INFO: task jbd2/md5-8:2010 blocked for more than 120 seconds. [ 6961.838111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 6961.838191] jbd2/md5-8 D 00000634 0 2010 2 0x00000000 [ 6961.838200] f5171e78 00000046 231a9999 00000634 ddf91f2c f652cc4c 00000001 f5171e1c [ 6961.838307] c0a6f140 c0a6f140 c0a6f140 c0a6a6ac f6354c20 f6354ecc c2008140 00000000 [ 6961.838412] 00637f84 00000003 f652cc58 00000000 00000292 00000048 c20036ac f6354ecc [ 6961.838518] Call Trace: [ 6961.838556] [<c056c39e>] jbd2_journal_commit_transaction+0x1d9/0x1187 [ 6961.838627] [<c040220a>] ? __switch_to+0xd5/0x147 [ 6961.838681] [<c07a390a>] ? schedule+0x837/0x885 [ 6961.838734] [<c0455e5f>] ? autoremove_wake_function+0x0/0x38 [ 6961.838799] [<c0448c84>] ? try_to_del_timer_sync+0x58/0x60 [ 6961.838859] [<c0572426>] kjournald2+0xa2/0x1be [ 6961.838909] [<c0455e5f>] ? autoremove_wake_function+0x0/0x38 [ 6961.838971] [<c0572384>] ? kjournald2+0x0/0x1be [ 6961.839035] [<c0455c11>] kthread+0x66/0x6b [ 6961.839089] [<c0455bab>] ? kthread+0x0/0x6b [ 6961.839139] [<c0404167>] kernel_thread_helper+0x7/0x10 [ 6961.839215] INFO: task sync:11600 blocked for more than 120 seconds. [ 6961.839286] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 6961.839367] sync D 00000632 0 11600 11595 0x00000000 [ 6961.839375] ddf91ea4 00000086 dca59d8b 00000632 76f570ee 00000048 00001dec 773fd796 [ 6961.839486] c0a6f140 c0a6f140 c0a6f140 c200819c f4ce0000 f4ce02ac c2008140 00000000 [ 6961.839600] ddf91ea8 dca5c36b 00000632 dca5bb77 c2008180 773fd796 00000282 f4ce02ac [ 6961.839727] Call Trace: [ 6961.839762] [<c04f4de6>] bdi_sched_wait+0xd/0x11 [ 6961.841438] [<c07a3ede>] __wait_on_bit+0x3b/0x62 [ 6961.843109] [<c04f4dd9>] ? bdi_sched_wait+0x0/0x11 [ 6961.844782] [<c07a3fb5>] out_of_line_wait_on_bit+0xb0/0xb8 [ 6961.846479] [<c04f4dd9>] ? bdi_sched_wait+0x0/0x11 [ 6961.848181] [<c0455e97>] ? wake_bit_function+0x0/0x48 [ 6961.849906] [<c04f4c75>] wait_on_bit+0x25/0x31 [ 6961.851601] [<c04f4e5d>] sync_inodes_sb+0x73/0x121 [ 6961.853287] [<c04f8acc>] __sync_filesystem+0x48/0x69 [ 6961.854983] [<c04f8b72>] sync_filesystems+0x85/0xc7 [ 6961.856670] [<c04f8c04>] sys_sync+0x20/0x32 [ 6961.858363] [<c040351b>] sysenter_do_call+0x12/0x28 Blktrace: blktrace was run for 2.6.27.48, 2.6.32.16 and a patched 2.6.32.16 (called 2.6.32.16p below and in the .tar file), where the patch just reverts "ext4: Avoid group preallocation for closed files". This revert removes a substantial part of the regression. For 2.6.32.16p and 2.6.27.48 there are two runs: run1 is directly after booting, then the directory is unexported, unmounted, mounted, exported, and then run2 is done. For 2.6.32.16 there is just run1; all subsequent runs yield approximately the same results, i.e. they are as slow as run1. Some numbers (time, and number of lines with flush|nfsd|sync in the blkparse output): 2.6.27.48 2.6.32.16 2.6.32.16p run1 run2 run1 run2 run1 run2 wallclock 113s 61s 280s ~280s 137s 61s flush 25362 9285 71861 32656 12066 nfsd 7595 8580 8685 8359 8444 sync 2860 3925 303 212 169 The total time seems to be dominated by the number of flushes. It should be noted that all these runs used barrier=0 ; barrier=1 does not have a significant effect, though. So we find: a) in 2.6.32.16, there is a problem which manifests itself in kernel messages associated with the jbd2/md5-8 and sync tasks, and vastly increased number of flush operations b) reverting patch "ext4: Avoid group preallocation for closed files" cures part of the problem c) even after reverting that patch, the first run takes much longer than the subsequent runs, despite "sync", "echo 3 > /proc/sys/vm/drop_caches", and umounting/re-mounting the filesystem. The blktrace files are at http://strucbio.biologie.uni-konstanz.de/~dikay/blktraces.tar.bz2 . Should we test any other patches? thanks, Kay ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <4C56E47B.8080600@uni-konstanz.de>]
[parent not found: <20100802202123.GC25653@thunk.org>]
* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later [not found] ` <20100802202123.GC25653@thunk.org> @ 2010-08-04 8:18 ` Kay Diederichs 0 siblings, 0 replies; 15+ messages in thread From: Kay Diederichs @ 2010-08-04 8:18 UTC (permalink / raw) To: Ted Ts'o; +Cc: Dave Chinner, Ext4 Developers List, linux, Karsten Schaefer [-- Attachment #1: Type: text/plain, Size: 3337 bytes --] Am 02.08.2010 22:21, schrieb Ted Ts'o: > On Mon, Aug 02, 2010 at 05:30:03PM +0200, Kay Diederichs wrote: >> >> we pared down the benchmark to the last step (called "run xds_par in nfs >> directory (reads 600M, and writes 50M)") because this captures most of >> the problem. Here we report kernel messages with stacktrace, and the >> blktrace output that you requested. > > Thanks, I'll take a look at it. > > Is NFS required to reproduce the problem? If you simply copy the 100 > files using rsync, or cp -r while logged onto the server, do you > notice the performance regression? > > Thanks, regards, > > - Ted Ted, we've run the benchmarks internally on the file server; it turns out that NFS is not required to reproduce the problem. We also took the opportunity to try 2.6.32.17 which just came out. 2.6.32.17 behaves similar to 2.6.32.16-patched (i.e. with reverted "ext4: Avoid group preallocation for closed files"); 2.6.32.17 has quite a few ext4 patches so one or a couple of those seems to have a similar effect as reverting "ext4: Avoid group preallocation for closed files". These are the times for the second (and higher) benchmark runs; the first run is always slower. The last step ("run xds_par") is slower than in the NFS case because it's heavy in CPU usage (total CPU time is more than 200 seconds); the NFS client is a 8-core (+HT) Nehalem-type machine, whereas the NFS server is just a 2-core Pentium D @ 3.40GHz Local machine: turn5 2.6.27.48 i686 Raid5: /dev/md5 /mnt/md5 ext4dev rw,noatime,barrier=1,stripe=512,data=writeback 0 0 32 seconds for preparations 19 seconds to rsync 100 frames with 597M from raid5,ext4 directory 17 seconds to rsync 100 frames with 595M to raid5,ext4 directory 36 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory 31 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory 267 seconds to run xds_par in raid5,ext4 directory 427 seconds to run the script Local machine: turn5 2.6.32.16 i686 (vanilla, i.e. not patched) Raid5: /dev/md5 /mnt/md5 ext4 rw,seclabel,noatime,barrier=0,stripe=512,data=writeback 0 0 36 seconds for preparations 18 seconds to rsync 100 frames with 597M from raid5,ext4 directory 33 seconds to rsync 100 frames with 595M to raid5,ext4 directory 68 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory 40 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory 489 seconds to run xds_par in raid5,ext4 directory 714 seconds to run the script Local machine: turn5 2.6.32.17 i686 Raid5: /dev/md5 /mnt/md5 ext4 rw,seclabel,noatime,barrier=0,stripe=512,data=writeback 0 0 38 seconds for preparations 18 seconds to rsync 100 frames with 597M from raid5,ext4 directory 33 seconds to rsync 100 frames with 595M to raid5,ext4 directory 67 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory 41 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory 266 seconds to run xds_par in raid5,ext4 directory 492 seconds to run the script So even if the patches that went into 2.6.32.17 seem to fix the worst stalls, it is obvious that untarring and rsyncing kernel files is significantly slower on 2.6.32.17 than 2.6.27.48 . HTH, Kay [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 5236 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2010-08-04 8:18 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-07-28 19:51 ext4 performance regression 2.6.27-stable versus 2.6.32 and later Kay Diederichs 2010-07-28 21:00 ` Greg Freemyer 2010-08-02 10:47 ` Kay Diederichs 2010-08-02 16:04 ` Henrique de Moraes Holschuh 2010-08-02 16:10 ` Henrique de Moraes Holschuh 2010-07-29 23:28 ` Dave Chinner 2010-08-02 14:52 ` Kay Diederichs 2010-08-02 16:12 ` Eric Sandeen 2010-08-02 21:08 ` Kay Diederichs 2010-08-03 13:31 ` Kay Diederichs 2010-07-30 2:20 ` Ted Ts'o 2010-07-30 21:01 ` Kay Diederichs 2010-08-01 23:02 ` Ted Ts'o 2010-08-02 15:28 ` Kay Diederichs [not found] ` <4C56E47B.8080600@uni-konstanz.de> [not found] ` <20100802202123.GC25653@thunk.org> 2010-08-04 8:18 ` Kay Diederichs
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).