Re: [PATCH 37/49] e2fsck: read-ahead metadata during passes 1, 2, and 4

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Andreas Dilger <adilger@dilger.ca>
Cc: tytso@mit.edu, linux-ext4@vger.kernel.org
Subject: Re: [PATCH 37/49] e2fsck: read-ahead metadata during passes 1, 2, and 4
Date: Mon, 17 Mar 2014 23:50:01 -0700	[thread overview]
Message-ID: <20140318065001.GI9070@birch.djwong.org> (raw)
In-Reply-To: <20140318044231.GA20038@birch.djwong.org>

On Mon, Mar 17, 2014 at 09:42:31PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 17, 2014 at 05:10:22PM -0600, Andreas Dilger wrote:
> > 
> > On Mar 11, 2014, at 12:57 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > 
> > > e2fsck pass1 is modified to use the block group data prefetch function
> > > to try to fetch the inode tables into the pagecache before it is
> > > needed.  In order to avoid cache thrashing, we limit ourselves to
> > > prefetching at most half the available memory.
> > 
> > It looks like the prefetching is done in huge chunks, and not incrementally?
> > It makes more sense to have a steady amount of prefetch happening instead
> > of waiting for it to all be consumed before starting a new batch.  See in
> > e2fsck_pass1() below.
> 
> I agree that prefetch ought not to wait until the entire inode table is
> consumed.
> 
> > > pass2 is modified to use the dirblock prefetching function to prefetch
> > > the list of directory blocks that are assembled in pass1.  So long as
> > > we don't anticipate rehashing the dirs (pass 3a), we can release the
> > > dirblocks as soon as we're done checking them.
> > > 
> > > pass4 is modified to prefetch the block and inode bitmaps in
> > > anticipation of pass 5, because pass4 is entirely CPU bound.
> > > 
> > > In general, these mechanisms can halve fsck time, if the host system
> > > has sufficient memory and the storage system can provide a lot of
> > > IOPs.  SSDs and multi-spindle RAIDs see the most speedup; single disks
> > > experience a modest speedup, and single-spindle USB mass storage
> > > devices see hardly any benefit.
> > > 
> > > By default, readahead will try to fill half the physical memory in the
> > > system.  The -R option can be given to specify the amount of memory to
> > > use for readahead, or zero to disable it entirely; or an option can be
> > > given in e2fsck.conf.
> > > 
> > > 
> > > +static void *pass1_readahead(void *p)
> > > +{
> > > +	struct pass1ra_ctx *c = p;
> > > +	errcode_t err;
> > > +
> > > +	ext2fs_readahead(c->fs, EXT2_READA_ITABLE, c->group, c->ngroups);
> > > +	return NULL;
> > > +}
> > > +
> > > +static errcode_t initiate_readahead(e2fsck_t ctx, dgrp_t group, dgrp_t ngroups)
> > > +{
> > > +	struct pass1ra_ctx *ractx;
> > > +	errcode_t err;
> > > +
> > > +	err = ext2fs_get_mem(sizeof(*ractx), &ractx);
> > > +	if (err)
> > > +		return err;
> > > +
> > > +	ractx->fs = ctx->fs;
> > > +	ractx->group = group;
> > > +	ractx->ngroups = ngroups;
> > > +
> > > +	err = e2fsck_run_thread(&ctx->ra_thread, pass1_readahead,
> > > +				pass1_readahead_cleanup, ractx);
> > > +	if (err)
> > > +		ext2fs_free_mem(&ractx);
> > > +
> > > +	return err;
> > > +}
> > > +
> > >  void e2fsck_pass1(e2fsck_t ctx)
> > >  {
> > > 	int	i;
> > > @@ -611,10 +654,37 @@ void e2fsck_pass1(e2fsck_t ctx)
> > > 	int		busted_fs_time = 0;
> > > 	int		inode_size;
> > > 	int		failed_csum = 0;
> > > +	dgrp_t		grp;
> > > +	ext2_ino_t	ra_threshold = 0;
> > > +	dgrp_t		ra_groups = 0;
> > > +	errcode_t	err;
> > > 
> > > 	init_resource_track(&rtrack, ctx->fs->io);
> > > 	clear_problem_context(&pctx);
> > > 
> > > +	/* If we can do readahead, figure out how many groups to pull in. */
> > > +	if (!ext2fs_can_readahead(ctx->fs))
> > > +		ctx->readahead_mem_kb = 0;
> > > +	if (ctx->readahead_mem_kb) {
> > > +		ra_groups = ctx->readahead_mem_kb /
> > > +			    (fs->inode_blocks_per_group * fs->blocksize /
> > > +			     1024);
> > > +		if (ra_groups < 16)
> > > +			ra_groups = 0;
> > 
> > It probably always makes sense to prefetch one group if possible?
> 
> I was intending to skip pass1 RA if there wasn't a lot of memory around.  Not
> that I did a lot of work to figure out if < 16 groups really was a "lowmem"
> situation.
> 
> > > +		else if (ra_groups > fs->group_desc_count)
> > > +			ra_groups = fs->group_desc_count;
> > > +		if (ra_groups) {
> > > +			err = initiate_readahead(ctx, grp, ra_groups);
> > 
> > Looks like "grp" is used uninitialized here.  Should be "grp = 0" to start.
> 
> Oops, good catch.
> 
> > > +			if (err) {
> > > +				com_err(ctx->program_name, err, "%s",
> > > +					_("while starting pass1 readahead"));
> > > +				ra_groups = 0;
> > > +			}
> > > +			ra_threshold = ra_groups *
> > > +				       fs->super->s_inodes_per_group;
> > 
> > This is the threshold of the last inode to be prefetched.
> 
> Yes.
> 
> > > +		}
> > > +	}
> > > +
> > > 	if (!(ctx->options & E2F_OPT_PREEN))
> > > 		fix_problem(ctx, PR_1_PASS_HEADER, &pctx);
> > > 
> > > @@ -778,6 +848,19 @@ void e2fsck_pass1(e2fsck_t ctx)
> > > 			if (e2fsck_mmp_update(fs))
> > > 				fatal_error(ctx, 0);
> > > 		}
> > > +		if (ra_groups && ino > ra_threshold) {
> > 
> > This doesn't start prefetching again until the last inode is checked.
> > It probably makes sense to have a sliding window to start readahead
> > again once half of the memory has been consumed or so.  Otherwise,
> > the scanning will block here until the next inode table is read from
> > disk, instead of the readahead being started earlier and it is in RAM.
> 
> You're right, it would be even faster if ra_threshold were to start RA a couple
> of block groups *before* we run out of prefetched data.
> 
> > > +			grp = (ino - 1) / fs->super->s_inodes_per_group;
> > > +			ra_threshold = (grp + ra_groups) *
> > > +				       fs->super->s_inodes_per_group;
> > 
> > > +			err = initiate_readahead(ctx, grp, ra_groups);
> > > +			if (err == EAGAIN) {
> > > +				printf("Disabling slow readahead.\n");
> > > +				ra_groups = 0;
> > 
> > I see that EAGAIN comes from e2fsck_run_thread(), if there is still a
> > readahead thread running.  Does it make sense to stop readahead in
> > that case?  It would seem to me that if readahead is taking a long
> > time and the inode processing is catching up to it (i.e. IO bound)
> > then it is even more important to do readahead in that case.
> 
> This is tricky -- POSIX_FADV_WILLNEED starts a non-blocking readahead, so there
> really isn't any good way to tell if the inode checker has caught up to RA.
> Here I'm interpreting "RA thread still running" as a warning that soon the
> inode checker will be ahead of the RA, so we might as well stop the RA.
> However, there still isn't really much good way to find out exactly where RA
> is.
> 
> > Something like the following to readahead half of the inode tables once
> > half of them have been processed, and shrink the readahead window if the
> > readahead is being called too often:
> 
> Hmm.  I will give this a shot and report back; this seems like it ought to
> produce a better result than "two before" as I suggested above.
> 
> > 	if (ra_groups != 0 && ino > ra_threshold - (ra_groups + 1) / 2 *
> > 					fs->super->s_inodes_per_group) {
> >		if (ra_threshold < ino)
> > 			ra_threshold = ino;
> > 		grp = (ra_threshold -1) / fs->super->s_inodes_per_group;
> > 		err = initiate_readahead(ctx, grp, (ra_groups + 1) / 2);
> > 		if (err == EAGAIN)
> > 			ra_groups = (ra_groups + 1) / 2;
> > 		else if (err)
> > 			com_err(ctx->program_name, err, "%s",
> > 				_("while starting pass1 readahead"));
> > 		else
> > 			ra_threshold += (ra_groups + 1) / 2 *
> > 				fs->super->s_inodes_per_group;
> > 	}

Now that I've thought about this a little harder, even this isn't quite
sufficient -- since the inode scan skips inode_uninit blockgroups, we have to
figure out which group our new ra_threshold inode is in and scan backwards
through the groups until we find a bg that isn't inode_uninit.  If we don't do
this, the scan will skip right past our ra_threshold, which means that RA
starts late or possibly even after we've started scanning inodes from the group
we're RAing.

That said, even doing that I don't see much more of a speed up.

--D

> > 
> > > +			} else if (err) {
> > > +				com_err(ctx->program_name, err, "%s",
> > > +					_("while starting pass1 readahead"));
> > > +			}
> > > +		}
> > > 		old_op = ehandler_operation(_("getting next inode from scan"));
> > > 		pctx.errcode = ext2fs_get_next_inode_full(scan, &ino,
> > > 							  inode, inode_size);
> > > diff --git a/e2fsck/unix.c b/e2fsck/unix.c
> > > index 80ebdb1..d6ef8c5 100644
> > > --- a/e2fsck/unix.c
> > > +++ b/e2fsck/unix.c
> > > @@ -74,7 +74,7 @@ static void usage(e2fsck_t ctx)
> > > 		_("Usage: %s [-panyrcdfvtDFV] [-b superblock] [-B blocksize]\n"
> > > 		"\t\t[-I inode_buffer_blocks] [-P process_inode_size]\n"
> > > 		"\t\t[-l|-L bad_blocks_file] [-C fd] [-j external_journal]\n"
> > > -		"\t\t[-E extended-options] device\n"),
> > > +		"\t\t[-E extended-options] [-R readahead_kb] device\n"),
> > 
> > Note that "-R" is only recently deprecated for raid options, why not make
> > this an option under "-E"?
> 
> Ok.
> 
> --D
> > 
> > > 		ctx->program_name);
> > > 
> > > 	fprintf(stderr, "%s", _("\nEmergency help:\n"
> > > @@ -90,6 +90,7 @@ static void usage(e2fsck_t ctx)
> > > 		" -j external_journal  Set location of the external journal\n"
> > > 		" -l bad_blocks_file   Add to badblocks list\n"
> > > 		" -L bad_blocks_file   Set badblocks list\n"
> > > +		" -R readahead_kb      Allow this much readahead.\n"
> > 
> > 
> > Cheers, Andreas
> > 
> > 
> > 
> > 
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2014-03-18  6:50 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-11  6:53 [PATCH 00/49] e2fsprogs patchbomb 3/14 Darrick J. Wong
2014-03-11  6:54 ` [PATCH 01/49] create_inode: clean up return mess in do_write_internal Darrick J. Wong
2014-03-11 20:30   ` Andreas Dilger
2014-03-11 20:41     ` Darrick J. Wong
2014-03-11 21:08       ` Theodore Ts'o
2014-03-12  3:24         ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 02/49] create_inode: minor cleanups Darrick J. Wong
2014-03-11 20:31   ` Andreas Dilger
2014-03-12  3:25     ` Theodore Ts'o
2014-03-12  3:27     ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 03/49] create_inode: whitespace fixes Darrick J. Wong
2014-03-12  3:27   ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 04/49] create_inode: move debugfs internal state back to debugfs Darrick J. Wong
2014-03-12  3:31   ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 05/49] create_inode: handle hard link inum mappings per populate_fs invocation Darrick J. Wong
2014-03-12  3:46   ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 06/49] libext2fs: support modifying arbitrary extended attributes (v5) Darrick J. Wong
2014-03-12  3:51   ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 07/49] debugfs: create commands to edit extended attributes Darrick J. Wong
2014-03-12  3:51   ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 08/49] e2fsck: don't rehash inline directories Darrick J. Wong
2014-03-13  3:52   ` Theodore Ts'o
2014-03-13  5:38     ` Darrick J. Wong
2014-03-13 12:13       ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 09/49] libext2fs: don't fail when doing a strict rewrite of inline data Darrick J. Wong
2014-03-14 13:19   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 10/49] libext2fs: fix iblocks correctly when expanding an inline_data file Darrick J. Wong
2014-03-12 16:38   ` Andreas Dilger
2014-03-12 17:01     ` Darrick J. Wong
2014-03-14 13:25       ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 11/49] e2fsck: zero errcode when checking inline data blocks Darrick J. Wong
2014-03-14 13:26   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 12/49] libext2fs: during inlinedata expand, don't corrupt inode Darrick J. Wong
2014-03-14 13:29   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 13/49] libext2fs: repair side effects when iterating dirents in inline dirs Darrick J. Wong
2014-03-14 13:30   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 14/49] resize2fs: add inline dirs for remapping Darrick J. Wong
2014-03-14 13:31   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 15/49] all: Introduce cppcheck static checking for make C=1 Darrick J. Wong
2014-03-14 13:33   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 16/49] misc: cppcheck cleanups Darrick J. Wong
2014-03-14 13:34   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 17/49] libext2fs: fix 64bit overflow in ext2fs_block_alloc_stats_range Darrick J. Wong
2014-03-14 13:35   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 18/49] misc: fix header complaints and resource leaks in e2fsprogs Darrick J. Wong
2014-03-14 13:39   ` Theodore Ts'o
2014-03-14 13:53   ` Theodore Ts'o
2014-03-14 19:23     ` Darrick J. Wong
2014-03-11  6:55 ` [PATCH 19/49] libext2fs: fix memory leak when drastically shrinking extent tree depth Darrick J. Wong
2014-03-14 13:56   ` Theodore Ts'o
2014-03-11  6:56 ` [PATCH 20/49] libext2fs: fix parents when modifying extents Darrick J. Wong
2014-03-14 14:01   ` Theodore Ts'o
2014-03-14 20:13     ` Darrick J. Wong
2014-03-15 15:46       ` Theodore Ts'o
2014-03-17 16:59         ` Darrick J. Wong
2014-03-11  6:56 ` [PATCH 21/49] e2fsck: print runs of duplicate blocks instead of all of them Darrick J. Wong
2014-03-15 16:19   ` Theodore Ts'o
2014-03-11  6:56 ` [PATCH 22/49] e2fsck: verify checksums after checking everything else Darrick J. Wong
2014-03-11  6:56 ` [PATCH 23/49] e2fsck: fix the extended attribute checksum error message Darrick J. Wong
2014-03-11  6:56 ` [PATCH 24/49] e2fsck: insert a missing dirent tail for checksums if possible Darrick J. Wong
2014-03-11  6:56 ` [PATCH 25/49] e2fsck: write dir blocks after new inode when reconstructing root/lost+found Darrick J. Wong
2014-03-11  6:56 ` [PATCH 26/49] tests: add test for corrupted checksummed root directory block Darrick J. Wong
2014-03-11  6:56 ` [PATCH 27/49] dumpe2fs: add switch to disable checksum verification Darrick J. Wong
2014-03-11  6:56 ` [PATCH 28/49] mke2fs: set block_validity as a default mount option Darrick J. Wong
2014-03-11  6:57 ` [PATCH 29/49] libext2fs: support allocating uninit blocks in bmap2() Darrick J. Wong
2014-03-11  6:57 ` [PATCH 30/49] libext2fs: file IO routines should handle uninit blocks Darrick J. Wong
2014-03-11  6:57 ` [PATCH 31/49] resize2fs: convert fs to and from 64bit mode Darrick J. Wong
2014-03-11  6:57 ` [PATCH 32/49] resize2fs: when toggling 64bit, don't free in-use bg data clusters Darrick J. Wong
2014-03-11  6:57 ` [PATCH 33/49] resize2fs: adjust reserved_gdt_blocks when changing group descriptor size Darrick J. Wong
2014-03-11  6:57 ` [PATCH 34/49] libext2fs: have UNIX IO manager use pread/pwrite Darrick J. Wong
2014-03-11  6:57 ` [PATCH 35/49] ext2fs: add readahead method to improve scanning Darrick J. Wong
2014-03-17 22:07   ` Andreas Dilger
2014-03-11  6:57 ` [PATCH 36/49] libext2fs: allow clients to read-ahead metadata Darrick J. Wong
2014-03-17 23:11   ` Andreas Dilger
2014-03-11  6:57 ` [PATCH 37/49] e2fsck: read-ahead metadata during passes 1, 2, and 4 Darrick J. Wong
2014-03-17 23:10   ` Andreas Dilger
2014-03-18  4:42     ` Darrick J. Wong
2014-03-18  6:50       ` Darrick J. Wong [this message]
2014-03-11  6:58 ` [PATCH 38/49] libext2fs: when appending to a file, don't split an index block in equal halves Darrick J. Wong
2014-03-11  6:58 ` [PATCH 39/49] libext2fs: find inode goal when allocating blocks Darrick J. Wong
2014-03-11  6:58 ` [PATCH 40/49] libext2fs: find a range of empty blocks Darrick J. Wong
2014-03-11  6:58 ` [PATCH 41/49] libext2fs: provide a function to set inode size Darrick J. Wong
2014-03-11  6:58 ` [PATCH 42/49] libext2fs: implement fallocate Darrick J. Wong
2014-03-11  6:58 ` [PATCH 44/49] fuse2fs: translate ACL structures Darrick J. Wong
2014-03-11  6:58 ` [PATCH 45/49] fuse2fs: handle 64-bit dates correctly Darrick J. Wong
2014-03-11  6:58 ` [PATCH 46/49] fuse2fs: implement fallocate Darrick J. Wong
2014-03-11  6:59 ` [PATCH 48/49] tests: enable using fuse2fs with metadata checksum test Darrick J. Wong
2014-03-11  6:59 ` [PATCH 49/49] tests: test date handling Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140318065001.GI9070@birch.djwong.org \
    --to=darrick.wong@oracle.com \
    --cc=adilger@dilger.ca \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).