From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Andreas Dilger <adilger@dilger.ca>
Cc: tytso@mit.edu, linux-ext4@vger.kernel.org
Subject: Re: [PATCH 37/49] e2fsck: read-ahead metadata during passes 1, 2, and 4
Date: Mon, 17 Mar 2014 23:50:01 -0700 [thread overview]
Message-ID: <20140318065001.GI9070@birch.djwong.org> (raw)
In-Reply-To: <20140318044231.GA20038@birch.djwong.org>
On Mon, Mar 17, 2014 at 09:42:31PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 17, 2014 at 05:10:22PM -0600, Andreas Dilger wrote:
> >
> > On Mar 11, 2014, at 12:57 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > > e2fsck pass1 is modified to use the block group data prefetch function
> > > to try to fetch the inode tables into the pagecache before it is
> > > needed. In order to avoid cache thrashing, we limit ourselves to
> > > prefetching at most half the available memory.
> >
> > It looks like the prefetching is done in huge chunks, and not incrementally?
> > It makes more sense to have a steady amount of prefetch happening instead
> > of waiting for it to all be consumed before starting a new batch. See in
> > e2fsck_pass1() below.
>
> I agree that prefetch ought not to wait until the entire inode table is
> consumed.
>
> > > pass2 is modified to use the dirblock prefetching function to prefetch
> > > the list of directory blocks that are assembled in pass1. So long as
> > > we don't anticipate rehashing the dirs (pass 3a), we can release the
> > > dirblocks as soon as we're done checking them.
> > >
> > > pass4 is modified to prefetch the block and inode bitmaps in
> > > anticipation of pass 5, because pass4 is entirely CPU bound.
> > >
> > > In general, these mechanisms can halve fsck time, if the host system
> > > has sufficient memory and the storage system can provide a lot of
> > > IOPs. SSDs and multi-spindle RAIDs see the most speedup; single disks
> > > experience a modest speedup, and single-spindle USB mass storage
> > > devices see hardly any benefit.
> > >
> > > By default, readahead will try to fill half the physical memory in the
> > > system. The -R option can be given to specify the amount of memory to
> > > use for readahead, or zero to disable it entirely; or an option can be
> > > given in e2fsck.conf.
> > >
> > >
> > > +static void *pass1_readahead(void *p)
> > > +{
> > > + struct pass1ra_ctx *c = p;
> > > + errcode_t err;
> > > +
> > > + ext2fs_readahead(c->fs, EXT2_READA_ITABLE, c->group, c->ngroups);
> > > + return NULL;
> > > +}
> > > +
> > > +static errcode_t initiate_readahead(e2fsck_t ctx, dgrp_t group, dgrp_t ngroups)
> > > +{
> > > + struct pass1ra_ctx *ractx;
> > > + errcode_t err;
> > > +
> > > + err = ext2fs_get_mem(sizeof(*ractx), &ractx);
> > > + if (err)
> > > + return err;
> > > +
> > > + ractx->fs = ctx->fs;
> > > + ractx->group = group;
> > > + ractx->ngroups = ngroups;
> > > +
> > > + err = e2fsck_run_thread(&ctx->ra_thread, pass1_readahead,
> > > + pass1_readahead_cleanup, ractx);
> > > + if (err)
> > > + ext2fs_free_mem(&ractx);
> > > +
> > > + return err;
> > > +}
> > > +
> > > void e2fsck_pass1(e2fsck_t ctx)
> > > {
> > > int i;
> > > @@ -611,10 +654,37 @@ void e2fsck_pass1(e2fsck_t ctx)
> > > int busted_fs_time = 0;
> > > int inode_size;
> > > int failed_csum = 0;
> > > + dgrp_t grp;
> > > + ext2_ino_t ra_threshold = 0;
> > > + dgrp_t ra_groups = 0;
> > > + errcode_t err;
> > >
> > > init_resource_track(&rtrack, ctx->fs->io);
> > > clear_problem_context(&pctx);
> > >
> > > + /* If we can do readahead, figure out how many groups to pull in. */
> > > + if (!ext2fs_can_readahead(ctx->fs))
> > > + ctx->readahead_mem_kb = 0;
> > > + if (ctx->readahead_mem_kb) {
> > > + ra_groups = ctx->readahead_mem_kb /
> > > + (fs->inode_blocks_per_group * fs->blocksize /
> > > + 1024);
> > > + if (ra_groups < 16)
> > > + ra_groups = 0;
> >
> > It probably always makes sense to prefetch one group if possible?
>
> I was intending to skip pass1 RA if there wasn't a lot of memory around. Not
> that I did a lot of work to figure out if < 16 groups really was a "lowmem"
> situation.
>
> > > + else if (ra_groups > fs->group_desc_count)
> > > + ra_groups = fs->group_desc_count;
> > > + if (ra_groups) {
> > > + err = initiate_readahead(ctx, grp, ra_groups);
> >
> > Looks like "grp" is used uninitialized here. Should be "grp = 0" to start.
>
> Oops, good catch.
>
> > > + if (err) {
> > > + com_err(ctx->program_name, err, "%s",
> > > + _("while starting pass1 readahead"));
> > > + ra_groups = 0;
> > > + }
> > > + ra_threshold = ra_groups *
> > > + fs->super->s_inodes_per_group;
> >
> > This is the threshold of the last inode to be prefetched.
>
> Yes.
>
> > > + }
> > > + }
> > > +
> > > if (!(ctx->options & E2F_OPT_PREEN))
> > > fix_problem(ctx, PR_1_PASS_HEADER, &pctx);
> > >
> > > @@ -778,6 +848,19 @@ void e2fsck_pass1(e2fsck_t ctx)
> > > if (e2fsck_mmp_update(fs))
> > > fatal_error(ctx, 0);
> > > }
> > > + if (ra_groups && ino > ra_threshold) {
> >
> > This doesn't start prefetching again until the last inode is checked.
> > It probably makes sense to have a sliding window to start readahead
> > again once half of the memory has been consumed or so. Otherwise,
> > the scanning will block here until the next inode table is read from
> > disk, instead of the readahead being started earlier and it is in RAM.
>
> You're right, it would be even faster if ra_threshold were to start RA a couple
> of block groups *before* we run out of prefetched data.
>
> > > + grp = (ino - 1) / fs->super->s_inodes_per_group;
> > > + ra_threshold = (grp + ra_groups) *
> > > + fs->super->s_inodes_per_group;
> >
> > > + err = initiate_readahead(ctx, grp, ra_groups);
> > > + if (err == EAGAIN) {
> > > + printf("Disabling slow readahead.\n");
> > > + ra_groups = 0;
> >
> > I see that EAGAIN comes from e2fsck_run_thread(), if there is still a
> > readahead thread running. Does it make sense to stop readahead in
> > that case? It would seem to me that if readahead is taking a long
> > time and the inode processing is catching up to it (i.e. IO bound)
> > then it is even more important to do readahead in that case.
>
> This is tricky -- POSIX_FADV_WILLNEED starts a non-blocking readahead, so there
> really isn't any good way to tell if the inode checker has caught up to RA.
> Here I'm interpreting "RA thread still running" as a warning that soon the
> inode checker will be ahead of the RA, so we might as well stop the RA.
> However, there still isn't really much good way to find out exactly where RA
> is.
>
> > Something like the following to readahead half of the inode tables once
> > half of them have been processed, and shrink the readahead window if the
> > readahead is being called too often:
>
> Hmm. I will give this a shot and report back; this seems like it ought to
> produce a better result than "two before" as I suggested above.
>
> > if (ra_groups != 0 && ino > ra_threshold - (ra_groups + 1) / 2 *
> > fs->super->s_inodes_per_group) {
> > if (ra_threshold < ino)
> > ra_threshold = ino;
> > grp = (ra_threshold -1) / fs->super->s_inodes_per_group;
> > err = initiate_readahead(ctx, grp, (ra_groups + 1) / 2);
> > if (err == EAGAIN)
> > ra_groups = (ra_groups + 1) / 2;
> > else if (err)
> > com_err(ctx->program_name, err, "%s",
> > _("while starting pass1 readahead"));
> > else
> > ra_threshold += (ra_groups + 1) / 2 *
> > fs->super->s_inodes_per_group;
> > }
Now that I've thought about this a little harder, even this isn't quite
sufficient -- since the inode scan skips inode_uninit blockgroups, we have to
figure out which group our new ra_threshold inode is in and scan backwards
through the groups until we find a bg that isn't inode_uninit. If we don't do
this, the scan will skip right past our ra_threshold, which means that RA
starts late or possibly even after we've started scanning inodes from the group
we're RAing.
That said, even doing that I don't see much more of a speed up.
--D
> >
> > > + } else if (err) {
> > > + com_err(ctx->program_name, err, "%s",
> > > + _("while starting pass1 readahead"));
> > > + }
> > > + }
> > > old_op = ehandler_operation(_("getting next inode from scan"));
> > > pctx.errcode = ext2fs_get_next_inode_full(scan, &ino,
> > > inode, inode_size);
> > > diff --git a/e2fsck/unix.c b/e2fsck/unix.c
> > > index 80ebdb1..d6ef8c5 100644
> > > --- a/e2fsck/unix.c
> > > +++ b/e2fsck/unix.c
> > > @@ -74,7 +74,7 @@ static void usage(e2fsck_t ctx)
> > > _("Usage: %s [-panyrcdfvtDFV] [-b superblock] [-B blocksize]\n"
> > > "\t\t[-I inode_buffer_blocks] [-P process_inode_size]\n"
> > > "\t\t[-l|-L bad_blocks_file] [-C fd] [-j external_journal]\n"
> > > - "\t\t[-E extended-options] device\n"),
> > > + "\t\t[-E extended-options] [-R readahead_kb] device\n"),
> >
> > Note that "-R" is only recently deprecated for raid options, why not make
> > this an option under "-E"?
>
> Ok.
>
> --D
> >
> > > ctx->program_name);
> > >
> > > fprintf(stderr, "%s", _("\nEmergency help:\n"
> > > @@ -90,6 +90,7 @@ static void usage(e2fsck_t ctx)
> > > " -j external_journal Set location of the external journal\n"
> > > " -l bad_blocks_file Add to badblocks list\n"
> > > " -L bad_blocks_file Set badblocks list\n"
> > > + " -R readahead_kb Allow this much readahead.\n"
> >
> >
> > Cheers, Andreas
> >
> >
> >
> >
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2014-03-18 6:50 UTC|newest]
Thread overview: 88+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-11 6:53 [PATCH 00/49] e2fsprogs patchbomb 3/14 Darrick J. Wong
2014-03-11 6:54 ` [PATCH 01/49] create_inode: clean up return mess in do_write_internal Darrick J. Wong
2014-03-11 20:30 ` Andreas Dilger
2014-03-11 20:41 ` Darrick J. Wong
2014-03-11 21:08 ` Theodore Ts'o
2014-03-12 3:24 ` Theodore Ts'o
2014-03-11 6:54 ` [PATCH 02/49] create_inode: minor cleanups Darrick J. Wong
2014-03-11 20:31 ` Andreas Dilger
2014-03-12 3:25 ` Theodore Ts'o
2014-03-12 3:27 ` Theodore Ts'o
2014-03-11 6:54 ` [PATCH 03/49] create_inode: whitespace fixes Darrick J. Wong
2014-03-12 3:27 ` Theodore Ts'o
2014-03-11 6:54 ` [PATCH 04/49] create_inode: move debugfs internal state back to debugfs Darrick J. Wong
2014-03-12 3:31 ` Theodore Ts'o
2014-03-11 6:54 ` [PATCH 05/49] create_inode: handle hard link inum mappings per populate_fs invocation Darrick J. Wong
2014-03-12 3:46 ` Theodore Ts'o
2014-03-11 6:54 ` [PATCH 06/49] libext2fs: support modifying arbitrary extended attributes (v5) Darrick J. Wong
2014-03-12 3:51 ` Theodore Ts'o
2014-03-11 6:54 ` [PATCH 07/49] debugfs: create commands to edit extended attributes Darrick J. Wong
2014-03-12 3:51 ` Theodore Ts'o
2014-03-11 6:54 ` [PATCH 08/49] e2fsck: don't rehash inline directories Darrick J. Wong
2014-03-13 3:52 ` Theodore Ts'o
2014-03-13 5:38 ` Darrick J. Wong
2014-03-13 12:13 ` Theodore Ts'o
2014-03-11 6:54 ` [PATCH 09/49] libext2fs: don't fail when doing a strict rewrite of inline data Darrick J. Wong
2014-03-14 13:19 ` Theodore Ts'o
2014-03-11 6:55 ` [PATCH 10/49] libext2fs: fix iblocks correctly when expanding an inline_data file Darrick J. Wong
2014-03-12 16:38 ` Andreas Dilger
2014-03-12 17:01 ` Darrick J. Wong
2014-03-14 13:25 ` Theodore Ts'o
2014-03-11 6:55 ` [PATCH 11/49] e2fsck: zero errcode when checking inline data blocks Darrick J. Wong
2014-03-14 13:26 ` Theodore Ts'o
2014-03-11 6:55 ` [PATCH 12/49] libext2fs: during inlinedata expand, don't corrupt inode Darrick J. Wong
2014-03-14 13:29 ` Theodore Ts'o
2014-03-11 6:55 ` [PATCH 13/49] libext2fs: repair side effects when iterating dirents in inline dirs Darrick J. Wong
2014-03-14 13:30 ` Theodore Ts'o
2014-03-11 6:55 ` [PATCH 14/49] resize2fs: add inline dirs for remapping Darrick J. Wong
2014-03-14 13:31 ` Theodore Ts'o
2014-03-11 6:55 ` [PATCH 15/49] all: Introduce cppcheck static checking for make C=1 Darrick J. Wong
2014-03-14 13:33 ` Theodore Ts'o
2014-03-11 6:55 ` [PATCH 16/49] misc: cppcheck cleanups Darrick J. Wong
2014-03-14 13:34 ` Theodore Ts'o
2014-03-11 6:55 ` [PATCH 17/49] libext2fs: fix 64bit overflow in ext2fs_block_alloc_stats_range Darrick J. Wong
2014-03-14 13:35 ` Theodore Ts'o
2014-03-11 6:55 ` [PATCH 18/49] misc: fix header complaints and resource leaks in e2fsprogs Darrick J. Wong
2014-03-14 13:39 ` Theodore Ts'o
2014-03-14 13:53 ` Theodore Ts'o
2014-03-14 19:23 ` Darrick J. Wong
2014-03-11 6:55 ` [PATCH 19/49] libext2fs: fix memory leak when drastically shrinking extent tree depth Darrick J. Wong
2014-03-14 13:56 ` Theodore Ts'o
2014-03-11 6:56 ` [PATCH 20/49] libext2fs: fix parents when modifying extents Darrick J. Wong
2014-03-14 14:01 ` Theodore Ts'o
2014-03-14 20:13 ` Darrick J. Wong
2014-03-15 15:46 ` Theodore Ts'o
2014-03-17 16:59 ` Darrick J. Wong
2014-03-11 6:56 ` [PATCH 21/49] e2fsck: print runs of duplicate blocks instead of all of them Darrick J. Wong
2014-03-15 16:19 ` Theodore Ts'o
2014-03-11 6:56 ` [PATCH 22/49] e2fsck: verify checksums after checking everything else Darrick J. Wong
2014-03-11 6:56 ` [PATCH 23/49] e2fsck: fix the extended attribute checksum error message Darrick J. Wong
2014-03-11 6:56 ` [PATCH 24/49] e2fsck: insert a missing dirent tail for checksums if possible Darrick J. Wong
2014-03-11 6:56 ` [PATCH 25/49] e2fsck: write dir blocks after new inode when reconstructing root/lost+found Darrick J. Wong
2014-03-11 6:56 ` [PATCH 26/49] tests: add test for corrupted checksummed root directory block Darrick J. Wong
2014-03-11 6:56 ` [PATCH 27/49] dumpe2fs: add switch to disable checksum verification Darrick J. Wong
2014-03-11 6:56 ` [PATCH 28/49] mke2fs: set block_validity as a default mount option Darrick J. Wong
2014-03-11 6:57 ` [PATCH 29/49] libext2fs: support allocating uninit blocks in bmap2() Darrick J. Wong
2014-03-11 6:57 ` [PATCH 30/49] libext2fs: file IO routines should handle uninit blocks Darrick J. Wong
2014-03-11 6:57 ` [PATCH 31/49] resize2fs: convert fs to and from 64bit mode Darrick J. Wong
2014-03-11 6:57 ` [PATCH 32/49] resize2fs: when toggling 64bit, don't free in-use bg data clusters Darrick J. Wong
2014-03-11 6:57 ` [PATCH 33/49] resize2fs: adjust reserved_gdt_blocks when changing group descriptor size Darrick J. Wong
2014-03-11 6:57 ` [PATCH 34/49] libext2fs: have UNIX IO manager use pread/pwrite Darrick J. Wong
2014-03-11 6:57 ` [PATCH 35/49] ext2fs: add readahead method to improve scanning Darrick J. Wong
2014-03-17 22:07 ` Andreas Dilger
2014-03-11 6:57 ` [PATCH 36/49] libext2fs: allow clients to read-ahead metadata Darrick J. Wong
2014-03-17 23:11 ` Andreas Dilger
2014-03-11 6:57 ` [PATCH 37/49] e2fsck: read-ahead metadata during passes 1, 2, and 4 Darrick J. Wong
2014-03-17 23:10 ` Andreas Dilger
2014-03-18 4:42 ` Darrick J. Wong
2014-03-18 6:50 ` Darrick J. Wong [this message]
2014-03-11 6:58 ` [PATCH 38/49] libext2fs: when appending to a file, don't split an index block in equal halves Darrick J. Wong
2014-03-11 6:58 ` [PATCH 39/49] libext2fs: find inode goal when allocating blocks Darrick J. Wong
2014-03-11 6:58 ` [PATCH 40/49] libext2fs: find a range of empty blocks Darrick J. Wong
2014-03-11 6:58 ` [PATCH 41/49] libext2fs: provide a function to set inode size Darrick J. Wong
2014-03-11 6:58 ` [PATCH 42/49] libext2fs: implement fallocate Darrick J. Wong
2014-03-11 6:58 ` [PATCH 44/49] fuse2fs: translate ACL structures Darrick J. Wong
2014-03-11 6:58 ` [PATCH 45/49] fuse2fs: handle 64-bit dates correctly Darrick J. Wong
2014-03-11 6:58 ` [PATCH 46/49] fuse2fs: implement fallocate Darrick J. Wong
2014-03-11 6:59 ` [PATCH 48/49] tests: enable using fuse2fs with metadata checksum test Darrick J. Wong
2014-03-11 6:59 ` [PATCH 49/49] tests: test date handling Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140318065001.GI9070@birch.djwong.org \
--to=darrick.wong@oracle.com \
--cc=adilger@dilger.ca \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).