From mboxrd@z Thu Jan 1 00:00:00 1970 From: Karsten Weiss Subject: Bad ext4 sync performance on 16 TB GPT partition Date: Fri, 26 Feb 2010 11:18:47 +0100 (CET) Message-ID: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII To: linux-ext4@vger.kernel.org Return-path: Received: from smtp2.belwue.de ([129.143.2.15]:42850 "EHLO smtp2.belwue.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933747Ab0BZKb4 (ORCPT ); Fri, 26 Feb 2010 05:31:56 -0500 Received: from mx3.science-computing.de (mx3.science-computing.de [193.197.16.20]) by smtp2.belwue.de with ESMTP id o1QAIt4M026359 for ; Fri, 26 Feb 2010 11:18:55 +0100 (MET) env-from (prvs=6660bea79=K.Weiss@science-computing.de) Sender: linux-ext4-owner@vger.kernel.org List-ID: Hi, (please Cc: me, I'm no subscriber) we were performing some ext4 tests on a 16 TB GPT partition and ran into this issue when writing a single large file with dd and syncing afterwards. The problem: dd is fast (cached) but the following sync is *very* slow. # /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync" 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 15.9423 seconds, 658 MB/s 0.01user 441.40system 7:26.10elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+794minor)pagefaults 0swaps dd: ~16 seconds sync: ~7 minutes (The same test finishes in 57s with xfs!) Here's the "iostat -m /dev/sdb 1" output during dd write: avg-cpu: %user %nice %system %iowait %steal %idle 0,00 0,00 6,62 19,35 0,00 74,03 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdb 484,00 0,00 242,00 0 242 "iostat -m /dev/sdb 1" during the sync looks like this avg-cpu: %user %nice %system %iowait %steal %idle 0,00 0,00 12,48 0,00 0,00 87,52 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdb 22,00 0,00 8,00 0 8 However, the sync performance is fine if we ... * use xfs or * disable the ext4 journal or * disable ext4 extents (but with enabled journal) Here's a kernel profile of the test: # readprofile -r # /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB_3 bs=1M count=10000 && sync" 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 15.8261 seconds, 663 MB/s 0.01user 448.55system 7:32.89elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+788minor)pagefaults 0swaps # readprofile -m /boot/System.map-2.6.18-190.el5 | sort -nr -k 3 | head -15 3450304 default_idle 43128.8000 9532 mod_zone_page_state 733.2308 58594 find_get_pages 537.5596 58499 find_get_pages_tag 427.0000 72404 __set_page_dirty_nobuffers 310.7468 10740 __wake_up_bit 238.6667 7786 unlock_page 165.6596 1996 dec_zone_page_state 153.5385 12230 clear_page_dirty_for_io 63.0412 5938 page_waitqueue 60.5918 14440 release_pages 41.7341 12664 __mark_inode_dirty 34.6011 5281 copy_user_generic_unrolled 30.7035 323 redirty_page_for_writepage 26.9167 15537 write_cache_pages 18.9939 Here are three call traces from the "sync" command: sync R running task 0 5041 5032 (NOTLB) 00000000ffffffff 000000000000000c ffff81022bb16510 00000000001ec2a3 ffff81022bb16548 ffffffffffffff10 ffffffff800d1964 0000000000000010 0000000000000286 ffff8101ff56bb48 0000000000000018 ffff81033686e970 Call Trace: [] page_mkclean+0x255/0x281 [] find_get_pages_tag+0x34/0x89 [] write_cache_pages+0x21b/0x332 [] :ext4:__mpage_da_writepage+0x0/0x162 [] :ext4:ext4_da_writepages+0x317/0x4fe [] do_writepages+0x20/0x2f [] __writeback_single_inode+0x1ae/0x328 [] wait_on_page_writeback_range+0xd6/0x12e [] sync_sb_inodes+0x1b5/0x26f [] sync_inodes_sb+0x99/0xa9 [] __sync_inodes+0x5d/0xaa [] sync_inodes+0x11/0x29 [] do_sync+0x12/0x5a [] sys_sync+0xe/0x12 [] tracesys+0xd5/0xe0 sync R running task 0 5041 5032 (NOTLB) ffff81022a3e4e88 ffffffff8000e930 ffff8101ff56bee8 ffff8101ff56bcd8 0000000000000000 ffffffff886fbdbc ffff8101ff56bc68 ffff8101ff56bc68 00000000002537c3 000000000001dfcd 0000000000000007 ffff8101ff56bc90 Call Trace: [] __set_page_dirty_nobuffers+0xde/0xe9 [] find_get_pages+0x2f/0x6d [] :ext4:mpage_da_submit_io+0xd0/0x12c [] :ext4:ext4_da_writepages+0x343/0x4fe [] do_writepages+0x20/0x2f [] __writeback_single_inode+0x1ae/0x328 [] wait_on_page_writeback_range+0xd6/0x12e [] sync_sb_inodes+0x1b5/0x26f [] sync_inodes_sb+0x99/0xa9 [] __sync_inodes+0x5d/0xaa [] sync_inodes+0x11/0x29 [] do_sync+0x12/0x5a [] sys_sync+0xe/0x12 [] tracesys+0xd5/0xe0 sync R running task 0 5353 5348 (NOTLB) ffff810426e04048 0000000000001200 0000000100000001 0000000000000001 0000000100000000 ffff8103dcecadf8 ffff810426dfd000 ffff8102ebd81b48 ffff810426e04048 ffff810239bbbc40 0000000000000008 00000000008447f8 Call Trace: [] elv_merged_request+0x1e/0x26 [] __make_request+0x324/0x401 [] cache_alloc_refill+0x106/0x186 [] :ext4:walk_page_buffers+0x65/0x8b [] __set_page_dirty_nobuffers+0xde/0xe9 [] :ext4:ext4_writepage+0x9b/0x333 [] :ext4:mpage_da_submit_io+0xd0/0x12c [] :ext4:ext4_da_writepages+0x343/0x4fe [] do_writepages+0x20/0x2f [] __writeback_single_inode+0x1ae/0x328 [] wait_on_page_writeback_range+0xd6/0x12e [] sync_sb_inodes+0x1b5/0x26f [] sync_inodes_sb+0x99/0xa9 [] __sync_inodes+0x5d/0xaa [] sync_inodes+0x11/0x29 [] do_sync+0x12/0x5a [] sys_sync+0xe/0x12 [] tracesys+0xd5/0xe0 I've tried some more options. These do *not* influence the (bad) result: * data=writeback and data=ordered * disabling/enabling uninit_bg * max_sectors_kb=512 or 4096 * io scheduler: cfq or noop Some background information about the system: OS: CentOS 5.4 Memory: 16 GB CPUs: 2x Quad-Core Opteron 2356 IO scheduler: CFQ Kernels: * 2.6.18-164.11.1.el5 x86_64 (latest CentOS 5.4 kernel) * 2.6.18-190.el5 x86_64 (latest Red Hat EL5 test kernel I've found from http://people.redhat.com/jwilson/el5/ which contains an ext4 version which (according to the rpm's changelog) was updated from the 2.6.32 ext4 codebase. * I did not try a vanilla kernel so far. # df -h /dev/sdb{1,2} Filesystem Size Used Avail Use% Mounted on /dev/sdb1 15T 9.9G 14T 1% /mnt/large /dev/sdb2 7.3T 179M 7.0T 1% /mnt/small (parted) print Model: easyRAID easyRAID_Q16PS (scsi) Disk /dev/sdb: 24,0TB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 1049kB 16,0TB 16,0TB ext3 large 2 16,0TB 24,0TB 8003GB ext3 small ("start 1049kB" is at sector 2048) sdb is a FC HW-RAID (easyRAID_Q16PS) and consists of a RAID6 volume created from 14 disks with chunk size 128kb. QLogic Fibre Channel HBA Driver: 8.03.01.04.05.05-k QLogic QLE2462 - PCI-Express Dual Channel 4Gb Fibre Channel HBA ISP2432: PCIe (2.5Gb/s x4) @ 0000:01:00.1 hdma+, host#=8, fw=4.04.09 (486) The ext4 filesystem was created with mkfs.ext4 -T ext4,largefile4 -E stride=32,stripe-width=$((32*(14-2))) /dev/sdb1 or mkfs.ext4 -T ext4 /dev/sdb1 Mount options: defaults,noatime,data=writeback Any ideas? -- Karsten Weiss