From: Dave Chinner <david@fromorbit.com>
To: Theodore Tso <tytso@MIT.EDU>
Cc: xfs@oss.sgi.com, linux-ext4@vger.kernel.org,
linux-fsdevel@vger.kernel.org
Subject: Re: [RFC][PATCH 0/8] xfstests: rework large filesystem testing
Date: Tue, 22 Nov 2011 20:28:43 +1100 [thread overview]
Message-ID: <20111122092843.GA8098@dastard> (raw)
In-Reply-To: <CF9FF112-852A-473B-99E4-3A5B790AE147@mit.edu>
On Mon, Nov 21, 2011 at 07:10:45AM -0500, Theodore Tso wrote:
>
> On Nov 21, 2011, at 6:31 AM, Dave Chinner wrote:
>
> > ext4, however, still has serious issues with this - either we take
> > the mkfs.ext4 time hit to initialise all the block groups, or we
> > take it during the preallocation. IOWs, the "don't do work at mkfs
> > but do it after mount" hack^Wtradeoff simply does not work for
> > testing large filesystems in this manner. While it is possible to
> > run large filesystem tests on ext4 using this mechanism, it is
> > extremely painful to do so.
>
> For testing, we can disable the "do it after the mount " aspect
> of ext4 by using the mount option "noinit_itable". We basically
> only need to zero the inode table to make sure e2fsck doesn't
> confuse old inode tables as new ones in the event that the block
It's not the deferred inode table initialisation that is the problem
for the preallocation immediately after a mkfs and mount - it's
initialising block groups that is the problem:
363806.042907] SysRq : Show Blocked State
[363806.044586] task PC stack pid father
[363806.046400] xfs_io D ffff8801099aed08 0 7264 7064 0x00000000
[363806.046400] ffff880117e33868 0000000000000086 0000000000000000 ffffffffb13a2903
[363806.046400] ffff8801099ae980 ffff880117e33fd8 ffff880117e33fd8 ffff880117e33fd8
[363806.046400] ffff88011afb44c0 ffff8801099ae980 ffff880117e33868 00000001810b59ed
[363806.046400] Call Trace:
[363806.046400] [<ffffffff8118eec0>] ? __wait_on_buffer+0x30/0x30
[363806.046400] [<ffffffff81aab3af>] schedule+0x3f/0x60
[363806.046400] [<ffffffff81aab45f>] io_schedule+0x8f/0xd0
[363806.046400] [<ffffffff8118eece>] sleep_on_buffer+0xe/0x20
[363806.046400] [<ffffffff81aabc2f>] __wait_on_bit+0x5f/0x90
[363806.046400] [<ffffffff8167e177>] ? generic_make_request+0xc7/0x100
[363806.046400] [<ffffffff8118eec0>] ? __wait_on_buffer+0x30/0x30
[363806.046400] [<ffffffff81aabcdc>] out_of_line_wait_on_bit+0x7c/0x90
[363806.046400] [<ffffffff810ac360>] ? autoremove_wake_function+0x40/0x40
[363806.046400] [<ffffffff8118eebe>] __wait_on_buffer+0x2e/0x30
[363806.046400] [<ffffffff812824c3>] ext4_mb_init_cache+0x223/0x9c0
[363806.046400] [<ffffffff81118583>] ? add_to_page_cache_locked+0xb3/0x100
[363806.046400] [<ffffffff81282dae>] ext4_mb_init_group+0x14e/0x210
[363806.046400] [<ffffffff812832d9>] ext4_mb_load_buddy+0x339/0x350
[363806.046400] [<ffffffff8128465b>] ext4_mb_find_by_goal+0x6b/0x2b0
[363806.046400] [<ffffffff81285034>] ext4_mb_regular_allocator+0x64/0x430
[363806.046400] [<ffffffff81286d8d>] ext4_mb_new_blocks+0x40d/0x560
[363806.046400] [<ffffffff81aad1ee>] ? _raw_spin_lock+0xe/0x20
[363806.046400] [<ffffffff81aad1ee>] ? _raw_spin_lock+0xe/0x20
[363806.046400] [<ffffffff8127c6a1>] ext4_ext_map_blocks+0xfa1/0x1d10
[363806.046400] [<ffffffff8129a6aa>] ? jbd2__journal_start+0xca/0x110
[363806.046400] [<ffffffff81252535>] ext4_map_blocks+0x1b5/0x280
[363806.046400] [<ffffffff8127ddf5>] ext4_fallocate+0x1c5/0x530
[363806.046400] [<ffffffff8115e992>] do_fallocate+0xf2/0x160
[363806.046400] [<ffffffff8115ea4b>] sys_fallocate+0x4b/0x70
[363806.046400] [<ffffffff81ab5082>] system_call_fastpath+0x16/0x1b
this initialisation runs at about 50MB/s for some periods of the
preallocation. Sample from iostat -d -x -m 5:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdc 0.00 1352.30 46.91 178.64 0.18 51.44 468.74 4.38 19.42 18.57 19.64 4.00 90.30
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdc 0.00 1405.40 47.20 184.40 0.18 50.97 452.34 5.99 25.84 18.31 27.77 3.91 90.56
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdc 0.00 4302.40 38.60 377.40 0.15 57.49 283.79 31.68 76.17 23.50 81.55 2.20 91.68
shows it is close to IO bound. This in on a 12 disk RAID-0 array w/
a 512MB BBWC. That indicates that most of the IO being done is
random. perf top shows that that the limited amount of CPU time
being spent is distributed like this:
samples pcnt function DSO
_______ _____ _____________________________ _________________
83.00 6.7% ext4_init_block_bitmap [kernel.kallsyms]
82.00 6.6% crc16 [kernel.kallsyms]
73.00 5.9% __find_get_block [kernel.kallsyms]
65.00 5.2% ext4_num_overhead_clusters [kernel.kallsyms]
62.00 5.0% ext4_set_bits [kernel.kallsyms]
56.00 4.5% ext4_ext_find_extent [kernel.kallsyms]
55.00 4.4% ext4_mark_iloc_dirty [kernel.kallsyms]
53.00 4.3% jbd2_journal_add_journal_head [kernel.kallsyms]
50.00 4.0% do_get_write_access [kernel.kallsyms]
45.00 3.6% mb_find_order_for_block [kernel.kallsyms]
41.00 3.3% ext4_ext_map_blocks [kernel.kallsyms]
34.00 2.7% jbd2_journal_cancel_revoke [kernel.kallsyms]
28.00 2.3% jbd2_journal_dirty_metadata [kernel.kallsyms]
27.00 2.2% jbd2_journal_put_journal_head [kernel.kallsyms]
The rest of the time, there is no IO and the preallocation is
is severely CPU bound. Top shows:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7523 root 20 0 10848 792 636 R 99 0.0 0:23.39 xfs_io
and perf top -p <pid of xfs_io> shows:
samples pcnt function DSO
_______ _____ ___________________________ _________________
13840.00 89.2% ext4_mb_good_group [kernel.kallsyms]
1218.00 7.8% ext4_mb_regular_allocator [kernel.kallsyms]
148.00 1.0% mb_find_order_for_block [kernel.kallsyms]
85.00 0.5% find_next_zero_bit [kernel.kallsyms]
78.00 0.5% radix_tree_lookup_element [kernel.kallsyms]
54.00 0.3% find_get_page [kernel.kallsyms]
53.00 0.3% mb_find_extent.constprop.31 [kernel.kallsyms]
12.00 0.1% mb_find_buddy [kernel.kallsyms]
10.00 0.1% ext4_mb_load_buddy [kernel.kallsyms]
which, if I read the code correctly, is CPU bound searching for a
block group to allocate from.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
prev parent reply other threads:[~2011-11-22 9:28 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-11-21 11:31 [RFC][PATCH 0/8] xfstests: rework large filesystem testing Dave Chinner
2011-11-21 11:31 ` [PATCH 1/8] xfstests: add --largefs check option Dave Chinner
2011-12-04 21:14 ` Christoph Hellwig
2011-11-21 11:31 ` [PATCH 2/8] xfstests: rename USE_BIG_LOOPFS to be more generic Dave Chinner
2012-01-16 16:23 ` Mark Tinguely
2011-11-21 11:31 ` [PATCH 3/8] xfstests: rename RETAIN_AG_BYTES Dave Chinner
2011-11-21 11:31 ` [PATCH 4/8] xfstests: use preallocation for ag-wiper Dave Chinner
2011-11-21 11:31 ` [PATCH 5/8] xfstests: use command line option for setting extra space Dave Chinner
2011-12-04 21:16 ` Christoph Hellwig
2011-11-21 11:31 ` [PATCH 6/8] xfstest: enable xfs_repair for large filesystem testing Dave Chinner
2012-01-16 17:04 ` Mark Tinguely
2011-11-21 11:31 ` [PATCH 7/8] xfstests: always us test option when checking large scratch device Dave Chinner
2011-11-21 11:31 ` [PATCH 8/8] xfstests: enable large fs testing on ext4 Dave Chinner
2011-11-21 12:10 ` [RFC][PATCH 0/8] xfstests: rework large filesystem testing Theodore Tso
2011-11-22 9:28 ` Dave Chinner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111122092843.GA8098@dastard \
--to=david@fromorbit.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@MIT.EDU \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).