ext4 performance regression 2.6.27-stable versus 2.6.32 and later

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ext4 performance regression 2.6.27-stable versus 2.6.32 and later
@ 2010-07-28 19:51 Kay Diederichs
  2010-07-28 21:00 ` Greg Freemyer
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Kay Diederichs @ 2010-07-28 19:51 UTC (permalink / raw)
  To: linux; +Cc: Ext4 Developers List, Karsten Schaefer

Dear all,

we reproducibly find significantly worse ext4 performance when our
fileservers run 2.6.32 or later kernels, when compared to the
2.6.27-stable series.

The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an
external eSATA enclosure (STARDOM ST6600); disks are not partitioned but
rather the complete disks are used:
md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1]
    3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5]
[UUUUU]

The enclosure is connected using a Silicon Image (supported by
sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup
fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU
3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2
Xeon 3.2GHz).

The ext4 filesystem was created using
mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg
It is mounted with noatime,data=writeback

As operating system we usually use RHEL5.5, but to exclude problems with
self-compiled kernels, we also booted USB sticks with latest Fedora12
and FC13 .

Our benchmarks consist of copying 100 6MB files from and to the RAID5,
over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and
rsync-ing kernel trees back and forth. Before and after each individual
benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on
both the client and the server.

The problem:
with 2.6.27.48 we typically get:
 44 seconds for preparations
 23 seconds to rsync 100 frames with 597M from nfs directory
 33 seconds to rsync 100 frames with 595M to nfs directory
 50 seconds to untar 24353 kernel files with 323M to nfs directory
 56 seconds to rsync 24353 kernel files with 323M from nfs directory
 67 seconds to run xds_par in nfs directory (reads and writes 600M)
301 seconds to run the script

with 2.6.32.16 we find:
 49 seconds for preparations
 23 seconds to rsync 100 frames with 597M from nfs directory
261 seconds to rsync 100 frames with 595M to nfs directory
 74 seconds to untar 24353 kernel files with 323M to nfs directory
 67 seconds to rsync 24353 kernel files with 323M from nfs directory
290 seconds to run xds_par in nfs directory (reads and writes 600M)
797 seconds to run the script

This is quite reproducible (times varying about 1-2% or so). All times
include reading and writing on the client side (stock CentOS5.5 Nehalem
machines with fast single SATA disks). The 2.6.32.16 times are the same
with FC12 and FC13 (booted from USB stick).

The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because
md RAID5 does not support barriers ("JBD: barrier-based sync failed on
md5 - disabling barriers").

What we tried: noop and deadline schedulers instead of cfq;
modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching
on/off NCQ; blockdev --setra 8192 /dev/md5; increasing
/sys/block/md5/md/stripe_cache_size

When looking at the I/O statistics while the benchmark is running, we
see very choppy patterns for 2.6.32, but quite smooth stats for
2.6.27-stable.

It is not an NFS problem; we see the same effect when transferring the
data using an rsync daemon. We believe, but are not sure, that the
problem does not exist with ext3 - it's not so quick to re-format a 4 TB
volume.

Any ideas? We cannot believe that a general ext4 regression should have
gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ?

thanks,

Kay
-- 
Kay Diederichs                http://strucbio.biologie.uni-konstanz.de
email: Kay.Diederichs@uni-konstanz.de    Tel +49 7531 88 4049 Fax 3183
Fachbereich Biologie, Universität Konstanz, Box M647, D-78457 Konstanz.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-07-28 19:51 ext4 performance regression 2.6.27-stable versus 2.6.32 and later Kay Diederichs
@ 2010-07-28 21:00 ` Greg Freemyer
  2010-08-02 10:47   ` Kay Diederichs
  2010-07-29 23:28 ` Dave Chinner
  2010-07-30  2:20 ` Ted Ts'o
  2 siblings, 1 reply; 15+ messages in thread
From: Greg Freemyer @ 2010-07-28 21:00 UTC (permalink / raw)
  To: Kay Diederichs; +Cc: linux, Ext4 Developers List, Karsten Schaefer

On Wed, Jul 28, 2010 at 3:51 PM, Kay Diederichs
<Kay.Diederichs@uni-konstanz.de> wrote:
> Dear all,
>
> we reproducibly find significantly worse ext4 performance when our
> fileservers run 2.6.32 or later kernels, when compared to the
> 2.6.27-stable series.
>
> The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an
> external eSATA enclosure (STARDOM ST6600); disks are not partitioned but
> rather the complete disks are used:
> md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1]
>    3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5]
> [UUUUU]
>
> The enclosure is connected using a Silicon Image (supported by
> sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup
> fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU
> 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2
> Xeon 3.2GHz).
>
> The ext4 filesystem was created using
> mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg
> It is mounted with noatime,data=writeback
>
> As operating system we usually use RHEL5.5, but to exclude problems with
> self-compiled kernels, we also booted USB sticks with latest Fedora12
> and FC13 .
>
> Our benchmarks consist of copying 100 6MB files from and to the RAID5,
> over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and
> rsync-ing kernel trees back and forth. Before and after each individual
> benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on
> both the client and the server.
>
> The problem:
> with 2.6.27.48 we typically get:
>  44 seconds for preparations
>  23 seconds to rsync 100 frames with 597M from nfs directory
>  33 seconds to rsync 100 frames with 595M to nfs directory
>  50 seconds to untar 24353 kernel files with 323M to nfs directory
>  56 seconds to rsync 24353 kernel files with 323M from nfs directory
>  67 seconds to run xds_par in nfs directory (reads and writes 600M)
> 301 seconds to run the script
>
> with 2.6.32.16 we find:
>  49 seconds for preparations
>  23 seconds to rsync 100 frames with 597M from nfs directory
> 261 seconds to rsync 100 frames with 595M to nfs directory
>  74 seconds to untar 24353 kernel files with 323M to nfs directory
>  67 seconds to rsync 24353 kernel files with 323M from nfs directory
> 290 seconds to run xds_par in nfs directory (reads and writes 600M)
> 797 seconds to run the script
>
> This is quite reproducible (times varying about 1-2% or so). All times
> include reading and writing on the client side (stock CentOS5.5 Nehalem
> machines with fast single SATA disks). The 2.6.32.16 times are the same
> with FC12 and FC13 (booted from USB stick).
>
> The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because
> md RAID5 does not support barriers ("JBD: barrier-based sync failed on
> md5 - disabling barriers").
>
> What we tried: noop and deadline schedulers instead of cfq;
> modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching
> on/off NCQ; blockdev --setra 8192 /dev/md5; increasing
> /sys/block/md5/md/stripe_cache_size
>
> When looking at the I/O statistics while the benchmark is running, we
> see very choppy patterns for 2.6.32, but quite smooth stats for
> 2.6.27-stable.
>
> It is not an NFS problem; we see the same effect when transferring the
> data using an rsync daemon. We believe, but are not sure, that the
> problem does not exist with ext3 - it's not so quick to re-format a 4 TB
> volume.
>
> Any ideas? We cannot believe that a general ext4 regression should have
> gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ?
>
> thanks,
>
> Kay

Kay,

I didn't read your whole e-mail, but 2.6.27 has known issues with
barriers not working in many raid configs.  Thus it is more likely to
experience data loss in the event of a power failure.

With newer kernels, If you prefer to have performance over robustness,
you can mount with the "nobarrier" option.

So now you have your choice whereas with 2.6.27, with raid5 you
effectively had nobarriers as your only choice.

Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-07-28 21:00 ` Greg Freemyer
@ 2010-08-02 10:47   ` Kay Diederichs
  2010-08-02 16:04     ` Henrique de Moraes Holschuh
  0 siblings, 1 reply; 15+ messages in thread
From: Kay Diederichs @ 2010-08-02 10:47 UTC (permalink / raw)
  To: Greg Freemyer; +Cc: linux, Ext4 Developers List, Karsten Schaefer

[-- Attachment #1: Type: text/plain, Size: 5041 bytes --]

Greg Freemyer schrieb:
> On Wed, Jul 28, 2010 at 3:51 PM, Kay Diederichs
> <Kay.Diederichs@uni-konstanz.de> wrote:
>> Dear all,
>>
>> we reproducibly find significantly worse ext4 performance when our
>> fileservers run 2.6.32 or later kernels, when compared to the
>> 2.6.27-stable series.
>>
>> The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an
>> external eSATA enclosure (STARDOM ST6600); disks are not partitioned but
>> rather the complete disks are used:
>> md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1]
>>    3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5]
>> [UUUUU]
>>
>> The enclosure is connected using a Silicon Image (supported by
>> sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup
>> fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU
>> 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2
>> Xeon 3.2GHz).
>>
>> The ext4 filesystem was created using
>> mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg
>> It is mounted with noatime,data=writeback
>>
>> As operating system we usually use RHEL5.5, but to exclude problems with
>> self-compiled kernels, we also booted USB sticks with latest Fedora12
>> and FC13 .
>>
>> Our benchmarks consist of copying 100 6MB files from and to the RAID5,
>> over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and
>> rsync-ing kernel trees back and forth. Before and after each individual
>> benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on
>> both the client and the server.
>>
>> The problem:
>> with 2.6.27.48 we typically get:
>>  44 seconds for preparations
>>  23 seconds to rsync 100 frames with 597M from nfs directory
>>  33 seconds to rsync 100 frames with 595M to nfs directory
>>  50 seconds to untar 24353 kernel files with 323M to nfs directory
>>  56 seconds to rsync 24353 kernel files with 323M from nfs directory
>>  67 seconds to run xds_par in nfs directory (reads and writes 600M)
>> 301 seconds to run the script
>>
>> with 2.6.32.16 we find:
>>  49 seconds for preparations
>>  23 seconds to rsync 100 frames with 597M from nfs directory
>> 261 seconds to rsync 100 frames with 595M to nfs directory
>>  74 seconds to untar 24353 kernel files with 323M to nfs directory
>>  67 seconds to rsync 24353 kernel files with 323M from nfs directory
>> 290 seconds to run xds_par in nfs directory (reads and writes 600M)
>> 797 seconds to run the script
>>
>> This is quite reproducible (times varying about 1-2% or so). All times
>> include reading and writing on the client side (stock CentOS5.5 Nehalem
>> machines with fast single SATA disks). The 2.6.32.16 times are the same
>> with FC12 and FC13 (booted from USB stick).
>>
>> The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because
>> md RAID5 does not support barriers ("JBD: barrier-based sync failed on
>> md5 - disabling barriers").
>>
>> What we tried: noop and deadline schedulers instead of cfq;
>> modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching
>> on/off NCQ; blockdev --setra 8192 /dev/md5; increasing
>> /sys/block/md5/md/stripe_cache_size
>>
>> When looking at the I/O statistics while the benchmark is running, we
>> see very choppy patterns for 2.6.32, but quite smooth stats for
>> 2.6.27-stable.
>>
>> It is not an NFS problem; we see the same effect when transferring the
>> data using an rsync daemon. We believe, but are not sure, that the
>> problem does not exist with ext3 - it's not so quick to re-format a 4 TB
>> volume.
>>
>> Any ideas? We cannot believe that a general ext4 regression should have
>> gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ?
>>
>> thanks,
>>
>> Kay
> 
> Kay,
> 
> I didn't read your whole e-mail, but 2.6.27 has known issues with
> barriers not working in many raid configs.  Thus it is more likely to
> experience data loss in the event of a power failure.
> 
> With newer kernels, If you prefer to have performance over robustness,
> you can mount with the "nobarrier" option.
> 
> So now you have your choice whereas with 2.6.27, with raid5 you
> effectively had nobarriers as your only choice.
> 
> Greg

Greg,

2.6.33 and later support md5 write barriers, whereas 2.6.27-stable
doesn't. I looked thru the 2.6.32.* Changelogs at
http://kernel.org/pub/linux/kernel/v2.6/ but could not find anything
indicating that md5 write barriers were backported to 2.6.32-stable.

Anyway, we do not get the message "JBD: barrier-based sync failed on md5
- disabling barriers" when using 2.6.32.16 which might indicate that
write barriers are indeed active when specifying no options in this respect.

Performance-wise, we tried mounting with barrier versus nobarrier (or
barrier=1 versus barrier=0) and re-did the 2.6.32+ benchmarks. It turned
out that the benchmark difference with and without barrier is less than
the variation between runs (which is much higher with 2.6.32+ than with
2.6.27-stable), so the influence seems to be minor.

best,

Kay

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 5236 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-08-02 10:47   ` Kay Diederichs
@ 2010-08-02 16:04     ` Henrique de Moraes Holschuh
  2010-08-02 16:10       ` Henrique de Moraes Holschuh
  0 siblings, 1 reply; 15+ messages in thread
From: Henrique de Moraes Holschuh @ 2010-08-02 16:04 UTC (permalink / raw)
  To: Kay Diederichs
  Cc: Greg Freemyer, linux, Ext4 Developers List, Karsten Schaefer

On Mon, 02 Aug 2010, Kay Diederichs wrote:
> Performance-wise, we tried mounting with barrier versus nobarrier (or
> barrier=1 versus barrier=0) and re-did the 2.6.32+ benchmarks. It turned
> out that the benchmark difference with and without barrier is less than
> the variation between runs (which is much higher with 2.6.32+ than with
> 2.6.27-stable), so the influence seems to be minor.

Did you check interactions with the IO scheduler?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-08-02 16:04     ` Henrique de Moraes Holschuh
@ 2010-08-02 16:10       ` Henrique de Moraes Holschuh
  0 siblings, 0 replies; 15+ messages in thread
From: Henrique de Moraes Holschuh @ 2010-08-02 16:10 UTC (permalink / raw)
  To: Kay Diederichs
  Cc: Greg Freemyer, linux, Ext4 Developers List, Karsten Schaefer

On Mon, 02 Aug 2010, Henrique de Moraes Holschuh wrote:
> On Mon, 02 Aug 2010, Kay Diederichs wrote:
> > Performance-wise, we tried mounting with barrier versus nobarrier (or
> > barrier=1 versus barrier=0) and re-did the 2.6.32+ benchmarks. It turned
> > out that the benchmark difference with and without barrier is less than
> > the variation between runs (which is much higher with 2.6.32+ than with
> > 2.6.27-stable), so the influence seems to be minor.
> 
> Did you check interactions with the IO scheduler?

Never mind, I reread your first message, and you did.  I apologise for the
noise.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-07-28 19:51 ext4 performance regression 2.6.27-stable versus 2.6.32 and later Kay Diederichs
  2010-07-28 21:00 ` Greg Freemyer
@ 2010-07-29 23:28 ` Dave Chinner
  2010-08-02 14:52   ` Kay Diederichs
  2010-07-30  2:20 ` Ted Ts'o
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2010-07-29 23:28 UTC (permalink / raw)
  To: Kay Diederichs; +Cc: linux, Ext4 Developers List, Karsten Schaefer

On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote:
> Dear all,
> 
> we reproducibly find significantly worse ext4 performance when our
> fileservers run 2.6.32 or later kernels, when compared to the
> 2.6.27-stable series.
> 
> The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an
> external eSATA enclosure (STARDOM ST6600); disks are not partitioned but
> rather the complete disks are used:
> md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1]
>     3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5]
> [UUUUU]
> 
> The enclosure is connected using a Silicon Image (supported by
> sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup
> fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU
> 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2
> Xeon 3.2GHz).
> 
> The ext4 filesystem was created using
> mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg
> It is mounted with noatime,data=writeback
> 
> As operating system we usually use RHEL5.5, but to exclude problems with
> self-compiled kernels, we also booted USB sticks with latest Fedora12
> and FC13 .
> 
> Our benchmarks consist of copying 100 6MB files from and to the RAID5,
> over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and
> rsync-ing kernel trees back and forth. Before and after each individual
> benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on
> both the client and the server.
> 
> The problem:
> with 2.6.27.48 we typically get:
>  44 seconds for preparations
>  23 seconds to rsync 100 frames with 597M from nfs directory
>  33 seconds to rsync 100 frames with 595M to nfs directory
>  50 seconds to untar 24353 kernel files with 323M to nfs directory
>  56 seconds to rsync 24353 kernel files with 323M from nfs directory
>  67 seconds to run xds_par in nfs directory (reads and writes 600M)
> 301 seconds to run the script
> 
> with 2.6.32.16 we find:
>  49 seconds for preparations
>  23 seconds to rsync 100 frames with 597M from nfs directory
> 261 seconds to rsync 100 frames with 595M to nfs directory
>  74 seconds to untar 24353 kernel files with 323M to nfs directory
>  67 seconds to rsync 24353 kernel files with 323M from nfs directory
> 290 seconds to run xds_par in nfs directory (reads and writes 600M)
> 797 seconds to run the script
> 
> This is quite reproducible (times varying about 1-2% or so). All times
> include reading and writing on the client side (stock CentOS5.5 Nehalem
> machines with fast single SATA disks). The 2.6.32.16 times are the same
> with FC12 and FC13 (booted from USB stick).
> 
> The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because
> md RAID5 does not support barriers ("JBD: barrier-based sync failed on
> md5 - disabling barriers").
> 
> What we tried: noop and deadline schedulers instead of cfq;
> modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching
> on/off NCQ; blockdev --setra 8192 /dev/md5; increasing
> /sys/block/md5/md/stripe_cache_size
> 
> When looking at the I/O statistics while the benchmark is running, we
> see very choppy patterns for 2.6.32, but quite smooth stats for
> 2.6.27-stable.
> 
> It is not an NFS problem; we see the same effect when transferring the
> data using an rsync daemon. We believe, but are not sure, that the
> problem does not exist with ext3 - it's not so quick to re-format a 4 TB
> volume.
> 
> Any ideas? We cannot believe that a general ext4 regression should have
> gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ?

Try reverting 50797481a7bdee548589506d7d7b48b08bc14dcd (ext4: Avoid
group preallocation for closed files). IIRC it caused the same sort
of isevere performance regressions for postmark....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-07-29 23:28 ` Dave Chinner
@ 2010-08-02 14:52   ` Kay Diederichs
  2010-08-02 16:12     ` Eric Sandeen
  0 siblings, 1 reply; 15+ messages in thread
From: Kay Diederichs @ 2010-08-02 14:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux, Ext4 Developers List, Karsten Schaefer

Dave Chinner schrieb:
> On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote:
>> Dear all,
>>
>> we reproducibly find significantly worse ext4 performance when our
>> fileservers run 2.6.32 or later kernels, when compared to the
>> 2.6.27-stable series.
>>
>> The hardware is RAID5 of 5 1TB WD10EACS disks (giving almost 4TB) in an
>> external eSATA enclosure (STARDOM ST6600); disks are not partitioned but
>> rather the complete disks are used:
>> md5 : active raid5 sde[0] sdg[5] sdd[3] sdc[2] sdf[1]
>>     3907045376 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5]
>> [UUUUU]
>>
>> The enclosure is connected using a Silicon Image (supported by
>> sata_sil24) PCIe-X1 adapter to one of our fileservers (either the backup
>> fileserver, 32bit desktop hardware with Intel(R) Pentium(R) D CPU
>> 3.40GHz, or a production-fileserver 64bit Precision WorkStation 670 w/ 2
>> Xeon 3.2GHz).
>>
>> The ext4 filesystem was created using
>> mke2fs -j -T largefile -E stride=128,stripe_width=512 -O extent,uninit_bg
>> It is mounted with noatime,data=writeback
>>
>> As operating system we usually use RHEL5.5, but to exclude problems with
>> self-compiled kernels, we also booted USB sticks with latest Fedora12
>> and FC13 .
>>
>> Our benchmarks consist of copying 100 6MB files from and to the RAID5,
>> over NFS (NVSv3, GB ethernet, TCP, async export), and tar-ing and
>> rsync-ing kernel trees back and forth. Before and after each individual
>> benchmark part, we "sync" and "echo 3 > /proc/sys/vm/drop_caches" on
>> both the client and the server.
>>
>> The problem:
>> with 2.6.27.48 we typically get:
>>  44 seconds for preparations
>>  23 seconds to rsync 100 frames with 597M from nfs directory
>>  33 seconds to rsync 100 frames with 595M to nfs directory
>>  50 seconds to untar 24353 kernel files with 323M to nfs directory
>>  56 seconds to rsync 24353 kernel files with 323M from nfs directory
>>  67 seconds to run xds_par in nfs directory (reads and writes 600M)
>> 301 seconds to run the script
>>
>> with 2.6.32.16 we find:
>>  49 seconds for preparations
>>  23 seconds to rsync 100 frames with 597M from nfs directory
>> 261 seconds to rsync 100 frames with 595M to nfs directory
>>  74 seconds to untar 24353 kernel files with 323M to nfs directory
>>  67 seconds to rsync 24353 kernel files with 323M from nfs directory
>> 290 seconds to run xds_par in nfs directory (reads and writes 600M)
>> 797 seconds to run the script
>>
>> This is quite reproducible (times varying about 1-2% or so). All times
>> include reading and writing on the client side (stock CentOS5.5 Nehalem
>> machines with fast single SATA disks). The 2.6.32.16 times are the same
>> with FC12 and FC13 (booted from USB stick).
>>
>> The 2.6.27-versus-2.6.32+ regression cannot be due to barriers because
>> md RAID5 does not support barriers ("JBD: barrier-based sync failed on
>> md5 - disabling barriers").
>>
>> What we tried: noop and deadline schedulers instead of cfq;
>> modifications of /sys/block/sd[c-g]/queue/max_sectors_kb; switching
>> on/off NCQ; blockdev --setra 8192 /dev/md5; increasing
>> /sys/block/md5/md/stripe_cache_size
>>
>> When looking at the I/O statistics while the benchmark is running, we
>> see very choppy patterns for 2.6.32, but quite smooth stats for
>> 2.6.27-stable.
>>
>> It is not an NFS problem; we see the same effect when transferring the
>> data using an rsync daemon. We believe, but are not sure, that the
>> problem does not exist with ext3 - it's not so quick to re-format a 4 TB
>> volume.
>>
>> Any ideas? We cannot believe that a general ext4 regression should have
>> gone unnoticed. So is it due to the interaction of ext4 with md-RAID5 ?
> 
> Try reverting 50797481a7bdee548589506d7d7b48b08bc14dcd (ext4: Avoid
> group preallocation for closed files). IIRC it caused the same sort
> of isevere performance regressions for postmark....
> 
> Cheers,
> 
> Dave.

Dave,

as you suggested, we reverted "ext4: Avoid group preallocation for
closed files" and this indeed fixes a big part of the problem: after
booting the NFS server we get

NFS-Server: turn5 2.6.32.16p i686
NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64

exported directory on the nfs-server:
/dev/md5 /mnt/md5 ext4
rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0

 48 seconds for preparations
 28 seconds to rsync 100 frames with 597M from nfs directory
 57 seconds to rsync 100 frames with 595M to nfs directory
 70 seconds to untar 24353 kernel files with 323M to nfs directory
 57 seconds to rsync 24353 kernel files with 323M from nfs directory
133 seconds to run xds_par in nfs directory
425 seconds to run the script


For blktrace details, see my next email which is a response to Ted's.

best,

Kay

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-08-02 14:52   ` Kay Diederichs
@ 2010-08-02 16:12     ` Eric Sandeen
  2010-08-02 21:08       ` Kay Diederichs
  2010-08-03 13:31       ` Kay Diederichs
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Sandeen @ 2010-08-02 16:12 UTC (permalink / raw)
  To: Kay Diederichs
  Cc: Dave Chinner, linux, Ext4 Developers List, Karsten Schaefer

[-- Attachment #1: Type: text/plain, Size: 2162 bytes --]

On 08/02/2010 09:52 AM, Kay Diederichs wrote:
> Dave,
> 
> as you suggested, we reverted "ext4: Avoid group preallocation for
> closed files" and this indeed fixes a big part of the problem: after
> booting the NFS server we get
> 
> NFS-Server: turn5 2.6.32.16p i686
> NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64
> 
> exported directory on the nfs-server:
> /dev/md5 /mnt/md5 ext4
> rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0
> 
>  48 seconds for preparations
>  28 seconds to rsync 100 frames with 597M from nfs directory
>  57 seconds to rsync 100 frames with 595M to nfs directory
>  70 seconds to untar 24353 kernel files with 323M to nfs directory
>  57 seconds to rsync 24353 kernel files with 323M from nfs directory
> 133 seconds to run xds_par in nfs directory
> 425 seconds to run the script

Interesting, I had found this commit to be a problem for small files
which are constantly created & deleted; the commit had the effect of
packing the newly created files in the first free space that could be
found, rather than walking down the disk leaving potentially fragmented
freespace behind (see seekwatcher graph attached).  Reverting the patch
sped things up for this test, but left the filesystem freespace in bad
shape.

But you seem to see one of the largest effects in here:

261 seconds to rsync 100 frames with 595M to nfs directory
vs
 57 seconds to rsync 100 frames with 595M to nfs directory

with the patch reverted making things go faster.  So you are doing 100
6MB writes to the server, correct?  Is the filesystem mkfs'd fresh
before each test, or is it aged?  If not mkfs'd, is it at least
completely empty prior to the test, or does data remain on it?  I'm just
wondering if fragmented freespace is contributing to this behavior as
well.  If there is fragmented freespace, then with the patch I think the
allocator is more likely to hunt around for small discontiguous chunks
of free sapce, rather than going further out in the disk looking for a
large area to allocate from.

It might be interesting to use seekwatcher on the server to visualize
the allocation/IO patterns for the test running just this far?

-Eric

[-- Attachment #2: rhel6_ext4_comparison.png --]
[-- Type: image/png, Size: 113533 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-08-02 16:12     ` Eric Sandeen
@ 2010-08-02 21:08       ` Kay Diederichs
  2010-08-03 13:31       ` Kay Diederichs
  1 sibling, 0 replies; 15+ messages in thread
From: Kay Diederichs @ 2010-08-02 21:08 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Chinner, linux, Ext4 Developers List, Karsten Schaefer

[-- Attachment #1: Type: text/plain, Size: 4265 bytes --]

Am 02.08.2010 18:12, schrieb Eric Sandeen:
> On 08/02/2010 09:52 AM, Kay Diederichs wrote:
>> Dave,
>>
>> as you suggested, we reverted "ext4: Avoid group preallocation for
>> closed files" and this indeed fixes a big part of the problem: after
>> booting the NFS server we get
>>
>> NFS-Server: turn5 2.6.32.16p i686
>> NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64
>>
>> exported directory on the nfs-server:
>> /dev/md5 /mnt/md5 ext4
>> rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0
>>
>>   48 seconds for preparations
>>   28 seconds to rsync 100 frames with 597M from nfs directory
>>   57 seconds to rsync 100 frames with 595M to nfs directory
>>   70 seconds to untar 24353 kernel files with 323M to nfs directory
>>   57 seconds to rsync 24353 kernel files with 323M from nfs directory
>> 133 seconds to run xds_par in nfs directory
>> 425 seconds to run the script
>
> Interesting, I had found this commit to be a problem for small files
> which are constantly created&  deleted; the commit had the effect of
> packing the newly created files in the first free space that could be
> found, rather than walking down the disk leaving potentially fragmented
> freespace behind (see seekwatcher graph attached).  Reverting the patch
> sped things up for this test, but left the filesystem freespace in bad
> shape.
>
> But you seem to see one of the largest effects in here:
>
> 261 seconds to rsync 100 frames with 595M to nfs directory
> vs
>   57 seconds to rsync 100 frames with 595M to nfs directory
>
> with the patch reverted making things go faster.  So you are doing 100
> 6MB writes to the server, correct?

correct.

 >
> Is the filesystem mkfs'd fresh
> before each test, or is it aged?

it is too big to "just create it freshly". It was actually created a 
week ago, and filled by a single ~ 10-hour rsync job run on the server 
such that the filesystem should be filled in the most linear way 
possible. Since then, the benchmarking has created and deleted lots of 
files.

> If not mkfs'd, is it at least
> completely empty prior to the test, or does data remain on it?  I'm just

it's not empty: df -h reports
Filesystem            Size  Used Avail Use% Mounted on
/dev/md5              3.7T  2.8T  712G  80% /mnt/md5

e2freefrag-1.41.12 reports:
Device: /dev/md5
Blocksize: 4096 bytes
Total blocks: 976761344
Free blocks: 235345984 (24.1%)

Min. free extent: 4 KB
Max. free extent: 99348 KB
Avg. free extent: 1628 KB

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
     4K...    8K-  :          1858          1858    0.00%
     8K...   16K-  :          3415          8534    0.00%
    16K...   32K-  :          9952         54324    0.02%
    32K...   64K-  :         23884        288848    0.12%
    64K...  128K-  :         27901        658130    0.28%
   128K...  256K-  :         25761       1211519    0.51%
   256K...  512K-  :         35863       3376274    1.43%
   512K... 1024K-  :         48643       9416851    4.00%
     1M...    2M-  :        150311      60704033   25.79%
     2M...    4M-  :        244895     148283666   63.01%
     4M...    8M-  :          3970       5508499    2.34%
     8M...   16M-  :           187        551835    0.23%
    16M...   32M-  :           302       1765912    0.75%
    32M...   64M-  :           282       2727162    1.16%
    64M...  128M-  :            42        788539    0.34%


> wondering if fragmented freespace is contributing to this behavior as
> well.  If there is fragmented freespace, then with the patch I think the
> allocator is more likely to hunt around for small discontiguous chunks
> of free sapce, rather than going further out in the disk looking for a
> large area to allocate from.

the last step of the benchmark, "xds_par", reads 600MB and writes 50MB. 
It has 16 threads which might put some additional pressure on the 
freespace hunting. That step also is fast in 2.6.27.48 but slow in 2.6.32+ .

>
> It might be interesting to use seekwatcher on the server to visualize
> the allocation/IO patterns for the test running just this far?
>
> -Eric

will try to install seekwatcher.

thanks,

Kay


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5236 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-08-02 16:12     ` Eric Sandeen
  2010-08-02 21:08       ` Kay Diederichs
@ 2010-08-03 13:31       ` Kay Diederichs
  1 sibling, 0 replies; 15+ messages in thread
From: Kay Diederichs @ 2010-08-03 13:31 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Dave Chinner, linux, Ext4 Developers List, Karsten Schaefer,
	Ted Ts'o


[-- Attachment #1.1: Type: text/plain, Size: 4058 bytes --]

Eric Sandeen schrieb:
> On 08/02/2010 09:52 AM, Kay Diederichs wrote:
>> Dave,
>>
>> as you suggested, we reverted "ext4: Avoid group preallocation for
>> closed files" and this indeed fixes a big part of the problem: after
>> booting the NFS server we get
>>
>> NFS-Server: turn5 2.6.32.16p i686
>> NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64
>>
>> exported directory on the nfs-server:
>> /dev/md5 /mnt/md5 ext4
>> rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0
>>
>>  48 seconds for preparations
>>  28 seconds to rsync 100 frames with 597M from nfs directory
>>  57 seconds to rsync 100 frames with 595M to nfs directory
>>  70 seconds to untar 24353 kernel files with 323M to nfs directory
>>  57 seconds to rsync 24353 kernel files with 323M from nfs directory
>> 133 seconds to run xds_par in nfs directory
>> 425 seconds to run the script
> 
> Interesting, I had found this commit to be a problem for small files
> which are constantly created & deleted; the commit had the effect of
> packing the newly created files in the first free space that could be
> found, rather than walking down the disk leaving potentially fragmented
> freespace behind (see seekwatcher graph attached).  Reverting the patch
> sped things up for this test, but left the filesystem freespace in bad
> shape.
> 
> But you seem to see one of the largest effects in here:
> 
> 261 seconds to rsync 100 frames with 595M to nfs directory
> vs
>  57 seconds to rsync 100 frames with 595M to nfs directory
> 
> with the patch reverted making things go faster.  So you are doing 100
> 6MB writes to the server, correct?  Is the filesystem mkfs'd fresh
> before each test, or is it aged?  If not mkfs'd, is it at least
> completely empty prior to the test, or does data remain on it?  I'm just
> wondering if fragmented freespace is contributing to this behavior as
> well.  If there is fragmented freespace, then with the patch I think the
> allocator is more likely to hunt around for small discontiguous chunks
> of free sapce, rather than going further out in the disk looking for a
> large area to allocate from.
> 
> It might be interesting to use seekwatcher on the server to visualize
> the allocation/IO patterns for the test running just this far?
> 
> -Eric
> 
> 
> ------------------------------------------------------------------------
> 

Eric,

seekwatcher does not seem to understand the blktrace output of old
kernels, so I rolled my own primitive plotting, e.g.

blkparse -i md5.xds_par.2.6.32.16p_run1 > blkparse.out

grep flush blkparse.out | grep W > flush_W
grep flush blkparse.out | grep R > flush_R

grep nfsd blkparse.out | grep R > nfsd_R
grep nfsd blkparse.out | grep W > nfsd_W

grep sync blkparse.out | grep R > sync_R
grep sync blkparse.out | grep W > sync_W

gnuplot<<EOF
set term png
set out '2.6.32.16p_run1.png'
set key outside
set title "2.6.32.16p_run1"
plot 'nfsd_W' us 4:8,'flush_W' us 4:8,'sync_W' us 4:8,'nfsd_R' us
4:8,'flush_R' us 4:8
EOF

I attach the resulting plots for 2.6.27.48_run1 (after booting) and
2.6.27.48_run2 (after run1 ; sync; and drop_cache). They show seconds on
the x axis (horizontal) and block numbers (512-byte blocks, I suppose;
the ext4 filesystem has 976761344 4096-byte blocks so that would be
about 8e+09 512-byte blocks) on the y axis (vertical).

You'll have to do the real interpretation of the plots yourself, but
even someone who does not know exactly what the pdflush (in 2.6.27.48)
or flush (in 2.6.32+) kernel threads are supposed to do can tell that
the kernels behave _very_ differently.

In particular, stock 2.6.32.16 every time (only run1 is shown, but run2
is the same) has the flush thread visiting all of the filesystem, in
steps of 263168 blocks. I have no idea why it does this.
Roughly the first 1/3 of the filesystem is also visited by kernels
2.6.27.48 and the patched 2.6.32.16 that Dave Chinner suggested, but
only in the first run after booting. Subsequent runs are fast and do not
  employ the flush thread much.

Hope this helps to pin down the regression.

thanks,

Kay

[-- Attachment #1.2: 2.6.27.48_run1.png --]
[-- Type: image/png, Size: 5146 bytes --]

[-- Attachment #1.3: 2.6.27.48_run2.png --]
[-- Type: image/png, Size: 4484 bytes --]

[-- Attachment #1.4: 2.6.32.16p_run1.png --]
[-- Type: image/png, Size: 4935 bytes --]

[-- Attachment #1.5: 2.6.32.16p_run2.png --]
[-- Type: image/png, Size: 4443 bytes --]

[-- Attachment #1.6: 2.6.32.16.png --]
[-- Type: image/png, Size: 5359 bytes --]

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 5236 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-07-28 19:51 ext4 performance regression 2.6.27-stable versus 2.6.32 and later Kay Diederichs
  2010-07-28 21:00 ` Greg Freemyer
  2010-07-29 23:28 ` Dave Chinner
@ 2010-07-30  2:20 ` Ted Ts'o
  2010-07-30 21:01   ` Kay Diederichs
                     ` (2 more replies)
  2 siblings, 3 replies; 15+ messages in thread
From: Ted Ts'o @ 2010-07-30  2:20 UTC (permalink / raw)
  To: Kay Diederichs; +Cc: linux, Ext4 Developers List, Karsten Schaefer

On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote:
> 
> When looking at the I/O statistics while the benchmark is running, we
> see very choppy patterns for 2.6.32, but quite smooth stats for
> 2.6.27-stable.

Could you try to do two things for me?  Using (preferably from a
recent e2fsprogs, such as 1.41.11 or 12) run filefrag -v on the files
created from your 2.6.27 run and your 2.6.32 run?

Secondly can capture blktrace results from 2.6.27 and 2.6.32?  That
would be very helpful to understand what might be going on.

Either would be helpful; both would be greatly appreciated.

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-07-30  2:20 ` Ted Ts'o
@ 2010-07-30 21:01   ` Kay Diederichs
  2010-08-01 23:02     ` Ted Ts'o
  2010-08-02 15:28   ` Kay Diederichs
       [not found]   ` <4C56E47B.8080600@uni-konstanz.de>
  2 siblings, 1 reply; 15+ messages in thread
From: Kay Diederichs @ 2010-07-30 21:01 UTC (permalink / raw)
  To: Ted Ts'o, linux, Ext4 Developers List, Karsten Schaefer

[-- Attachment #1: Type: text/plain, Size: 1556 bytes --]

Am 30.07.2010 04:20, schrieb Ted Ts'o:
> On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote:
>>
>> When looking at the I/O statistics while the benchmark is running, we
>> see very choppy patterns for 2.6.32, but quite smooth stats for
>> 2.6.27-stable.
>
> Could you try to do two things for me?  Using (preferably from a
> recent e2fsprogs, such as 1.41.11 or 12) run filefrag -v on the files
> created from your 2.6.27 run and your 2.6.32 run?
>
> Secondly can capture blktrace results from 2.6.27 and 2.6.32?  That
> would be very helpful to understand what might be going on.
>
> Either would be helpful; both would be greatly appreciated.
>
> Thanks,
>
> 						- Ted

Ted,

a typical example of filefrag -v output for 2.6.27.48 is

  Filesystem type is: ef53
File size of /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf 
is 6229688 (1521 blocks, blocksize 4096)
  ext logical physical expected length flags
    0       0 796160000            1024
    1    1024 826381312 796161023    497 eof

(99 out of 100 files have 2 extents)

whereas for 2.6.32.16 the result is typically
Filesystem type is: ef53
File size of /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf 
is 6229688 (1521 blocks, blocksize 4096)
  ext logical physical expected length flags
    0       0 826376200            1521 eof
/mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf: 1 extent found

(99 out of 100 files have 1 extent)

We'll try the blktrace ASAP and report back.

thanks,

Kay


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5236 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-07-30 21:01   ` Kay Diederichs
@ 2010-08-01 23:02     ` Ted Ts'o
  0 siblings, 0 replies; 15+ messages in thread
From: Ted Ts'o @ 2010-08-01 23:02 UTC (permalink / raw)
  To: Kay Diederichs; +Cc: linux, Ext4 Developers List, Karsten Schaefer

On Fri, Jul 30, 2010 at 11:01:36PM +0200, Kay Diederichs wrote:
> whereas for 2.6.32.16 the result is typically
> Filesystem type is: ef53
> File size of
> /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf is
> 6229688 (1521 blocks, blocksize 4096)
>  ext logical physical expected length flags
>    0       0 826376200            1521 eof
> /mnt/md5/scratch/nfs-test/tmp/xds/frames/h2g28_1_00000.cbf: 1 extent found

OK, so 2.6.32 is actually doing a better job laying out the files....

The blktrace will be interesting, but at this point I'm wondering if
this is a generic kernel-wide writeback regression.  At $WORK we've
noticed some performance regressions between 2.6.26-based kernels and
2.6.33- and 2.6.34-based kernels with both ext2 and ext4 (in no
journal mode) that we've been trying to track down.  We have a pretty
large number of patches applied to both 2.6.26 and 2.6.33/34 which is
why I haven't mentioned it up until now, but at this point it seems
pretty clear there are some writeback issues in the mainline kernel.

There are half a dozen or so patch series on LKML that are addressing
writeback in one way or another, and writeback is a major topic at the
upcoming Linux Storage and Filesystem workshop.  So if this is the
cause, hopefully there will be some improvements in this area in the
near future.

						- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
  2010-07-30  2:20 ` Ted Ts'o
  2010-07-30 21:01   ` Kay Diederichs
@ 2010-08-02 15:28   ` Kay Diederichs
       [not found]   ` <4C56E47B.8080600@uni-konstanz.de>
  2 siblings, 0 replies; 15+ messages in thread
From: Kay Diederichs @ 2010-08-02 15:28 UTC (permalink / raw)
  To: Ted Ts'o, Kay Diederichs, linux, Ext4 Developers List,
	Karsten Schaefer <

Ted Ts'o schrieb:
> On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote:
>> When looking at the I/O statistics while the benchmark is running, we
>> see very choppy patterns for 2.6.32, but quite smooth stats for
>> 2.6.27-stable.
> 
> Could you try to do two things for me?  Using (preferably from a
> recent e2fsprogs, such as 1.41.11 or 12) run filefrag -v on the files
> created from your 2.6.27 run and your 2.6.32 run?
> 
> Secondly can capture blktrace results from 2.6.27 and 2.6.32?  That
> would be very helpful to understand what might be going on.
> 
> Either would be helpful; both would be greatly appreciated.
> 
> Thanks,
> 
> 						- Ted

Ted,

we pared down the benchmark to the last step (called "run xds_par in nfs
directory (reads 600M, and writes 50M)") because this captures most of
the problem. Here we report kernel messages with stacktrace, and the
blktrace output that you requested.

Kernel messages: with 2.6.32.16 we observe

[ 6961.838032] INFO: task jbd2/md5-8:2010 blocked for more than 120 seconds.
[ 6961.838111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 6961.838191] jbd2/md5-8    D 00000634     0  2010      2 0x00000000
[ 6961.838200]  f5171e78 00000046 231a9999 00000634 ddf91f2c f652cc4c
00000001 f5171e1c
[ 6961.838307]  c0a6f140 c0a6f140 c0a6f140 c0a6a6ac f6354c20 f6354ecc
c2008140 00000000
[ 6961.838412]  00637f84 00000003 f652cc58 00000000 00000292 00000048
c20036ac f6354ecc
[ 6961.838518] Call Trace:
[ 6961.838556]  [<c056c39e>] jbd2_journal_commit_transaction+0x1d9/0x1187
[ 6961.838627]  [<c040220a>] ? __switch_to+0xd5/0x147
[ 6961.838681]  [<c07a390a>] ? schedule+0x837/0x885
[ 6961.838734]  [<c0455e5f>] ? autoremove_wake_function+0x0/0x38
[ 6961.838799]  [<c0448c84>] ? try_to_del_timer_sync+0x58/0x60
[ 6961.838859]  [<c0572426>] kjournald2+0xa2/0x1be
[ 6961.838909]  [<c0455e5f>] ? autoremove_wake_function+0x0/0x38
[ 6961.838971]  [<c0572384>] ? kjournald2+0x0/0x1be
[ 6961.839035]  [<c0455c11>] kthread+0x66/0x6b
[ 6961.839089]  [<c0455bab>] ? kthread+0x0/0x6b
[ 6961.839139]  [<c0404167>] kernel_thread_helper+0x7/0x10
[ 6961.839215] INFO: task sync:11600 blocked for more than 120 seconds.
[ 6961.839286] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 6961.839367] sync          D 00000632     0 11600  11595 0x00000000
[ 6961.839375]  ddf91ea4 00000086 dca59d8b 00000632 76f570ee 00000048
00001dec 773fd796
[ 6961.839486]  c0a6f140 c0a6f140 c0a6f140 c200819c f4ce0000 f4ce02ac
c2008140 00000000
[ 6961.839600]  ddf91ea8 dca5c36b 00000632 dca5bb77 c2008180 773fd796
00000282 f4ce02ac
[ 6961.839727] Call Trace:
[ 6961.839762]  [<c04f4de6>] bdi_sched_wait+0xd/0x11
[ 6961.841438]  [<c07a3ede>] __wait_on_bit+0x3b/0x62
[ 6961.843109]  [<c04f4dd9>] ? bdi_sched_wait+0x0/0x11
[ 6961.844782]  [<c07a3fb5>] out_of_line_wait_on_bit+0xb0/0xb8
[ 6961.846479]  [<c04f4dd9>] ? bdi_sched_wait+0x0/0x11
[ 6961.848181]  [<c0455e97>] ? wake_bit_function+0x0/0x48
[ 6961.849906]  [<c04f4c75>] wait_on_bit+0x25/0x31
[ 6961.851601]  [<c04f4e5d>] sync_inodes_sb+0x73/0x121
[ 6961.853287]  [<c04f8acc>] __sync_filesystem+0x48/0x69
[ 6961.854983]  [<c04f8b72>] sync_filesystems+0x85/0xc7
[ 6961.856670]  [<c04f8c04>] sys_sync+0x20/0x32
[ 6961.858363]  [<c040351b>] sysenter_do_call+0x12/0x28

Blktrace: blktrace was run for 2.6.27.48, 2.6.32.16 and a patched
2.6.32.16 (called 2.6.32.16p below and in the .tar file), where the
patch just reverts "ext4: Avoid group preallocation for closed files".
This revert removes a substantial part of the regression.

For 2.6.32.16p and 2.6.27.48 there are two runs: run1 is directly after
booting, then the directory is unexported, unmounted, mounted, exported,
and then run2 is done. For 2.6.32.16 there is just run1; all subsequent
runs yield approximately the same results, i.e. they are as slow as run1.

Some numbers (time, and number of lines with flush|nfsd|sync in the
blkparse output):
            2.6.27.48        2.6.32.16      2.6.32.16p
           run1   run2      run1   run2     run1  run2
wallclock  113s    61s      280s  ~280s     137s   61s
flush     25362   9285     71861           32656 12066
nfsd       7595   8580      8685            8359  8444
sync       2860   3925       303             212   169

The total time seems to be dominated by the number of flushes.
It should be noted that all these runs used barrier=0 ; barrier=1 does
not have a significant effect, though.

So we find:
a) in 2.6.32.16, there is a problem which manifests itself in kernel
messages associated with the jbd2/md5-8 and sync tasks, and vastly
increased number of flush operations
b) reverting patch "ext4: Avoid group preallocation for closed files"
cures part of the problem
c) even after reverting that patch, the first run takes much longer than
the subsequent runs, despite "sync", "echo 3 >
/proc/sys/vm/drop_caches", and umounting/re-mounting the filesystem.

The blktrace files are at
http://strucbio.biologie.uni-konstanz.de/~dikay/blktraces.tar.bz2 .

Should we test any other patches?

thanks,

Kay

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <4C56E47B.8080600@uni-konstanz.de>]

[parent not found: <20100802202123.GC25653@thunk.org>]

* Re: ext4 performance regression 2.6.27-stable versus 2.6.32 and later
       [not found]     ` <20100802202123.GC25653@thunk.org>
@ 2010-08-04  8:18       ` Kay Diederichs
  0 siblings, 0 replies; 15+ messages in thread
From: Kay Diederichs @ 2010-08-04  8:18 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Dave Chinner, Ext4 Developers List, linux, Karsten Schaefer

[-- Attachment #1: Type: text/plain, Size: 3337 bytes --]

Am 02.08.2010 22:21, schrieb Ted Ts'o:
> On Mon, Aug 02, 2010 at 05:30:03PM +0200, Kay Diederichs wrote:
>>
>> we pared down the benchmark to the last step (called "run xds_par in nfs
>> directory (reads 600M, and writes 50M)") because this captures most of
>> the problem. Here we report kernel messages with stacktrace, and the
>> blktrace output that you requested.
>
> Thanks, I'll take a look at it.
>
> Is NFS required to reproduce the problem?  If you simply copy the 100
> files using rsync, or cp -r while logged onto the server, do you
> notice the performance regression?
>
> Thanks, regards,
>
> 						- Ted

Ted,

we've run the benchmarks internally on the file server; it turns out 
that NFS is not required to reproduce the problem.

We also took the opportunity to try 2.6.32.17 which just came out. 
2.6.32.17 behaves similar to 2.6.32.16-patched (i.e. with reverted 
"ext4: Avoid group preallocation for closed files"); 2.6.32.17 has quite 
a few ext4 patches so one or a couple of those seems to have a similar 
effect as reverting "ext4: Avoid group preallocation for closed files".

These are the times for the second (and higher) benchmark runs; the 
first run is always slower. The last step ("run xds_par") is slower than 
in the NFS case because it's heavy in CPU usage (total CPU time is more 
than 200 seconds); the NFS client is a 8-core (+HT) Nehalem-type 
machine, whereas the NFS server is just a 2-core Pentium D @ 3.40GHz

Local machine: turn5 2.6.27.48 i686
Raid5: /dev/md5 /mnt/md5 ext4dev 
rw,noatime,barrier=1,stripe=512,data=writeback 0 0
  32 seconds for preparations
  19 seconds to rsync 100 frames with 597M from raid5,ext4 directory
  17 seconds to rsync 100 frames with 595M to raid5,ext4 directory
  36 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory
  31 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory
267 seconds to run xds_par in raid5,ext4 directory
427 seconds to run the script

Local machine: turn5 2.6.32.16 i686  (vanilla, i.e. not patched)
Raid5: /dev/md5 /mnt/md5 ext4 
rw,seclabel,noatime,barrier=0,stripe=512,data=writeback 0 0
  36 seconds for preparations
  18 seconds to rsync 100 frames with 597M from raid5,ext4 directory
  33 seconds to rsync 100 frames with 595M to raid5,ext4 directory
  68 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory
  40 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory
489 seconds to run xds_par in raid5,ext4 directory
714 seconds to run the script

Local machine: turn5 2.6.32.17 i686
Raid5: /dev/md5 /mnt/md5 ext4 
rw,seclabel,noatime,barrier=0,stripe=512,data=writeback 0 0
  38 seconds for preparations
  18 seconds to rsync 100 frames with 597M from raid5,ext4 directory
  33 seconds to rsync 100 frames with 595M to raid5,ext4 directory
  67 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory
  41 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory
266 seconds to run xds_par in raid5,ext4 directory
492 seconds to run the script

So even if the patches that went into 2.6.32.17 seem to fix the worst 
stalls, it is obvious that untarring and rsyncing kernel files is 
significantly slower on 2.6.32.17 than 2.6.27.48 .

HTH,

Kay

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5236 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-08-04  8:18 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-28 19:51 ext4 performance regression 2.6.27-stable versus 2.6.32 and later Kay Diederichs
2010-07-28 21:00 ` Greg Freemyer
2010-08-02 10:47   ` Kay Diederichs
2010-08-02 16:04     ` Henrique de Moraes Holschuh
2010-08-02 16:10       ` Henrique de Moraes Holschuh
2010-07-29 23:28 ` Dave Chinner
2010-08-02 14:52   ` Kay Diederichs
2010-08-02 16:12     ` Eric Sandeen
2010-08-02 21:08       ` Kay Diederichs
2010-08-03 13:31       ` Kay Diederichs
2010-07-30  2:20 ` Ted Ts'o
2010-07-30 21:01   ` Kay Diederichs
2010-08-01 23:02     ` Ted Ts'o
2010-08-02 15:28   ` Kay Diederichs
     [not found]   ` <4C56E47B.8080600@uni-konstanz.de>
     [not found]     ` <20100802202123.GC25653@thunk.org>
2010-08-04  8:18       ` Kay Diederichs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).