linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Andreas Dilger <adilger@dilger.ca>
Cc: tytso@mit.edu, linux-ext4@vger.kernel.org
Subject: Re: [PATCH 37/49] e2fsck: read-ahead metadata during passes 1, 2, and 4
Date: Mon, 17 Mar 2014 21:42:31 -0700	[thread overview]
Message-ID: <20140318044231.GA20038@birch.djwong.org> (raw)
In-Reply-To: <8715537A-D22B-4CDC-B76B-75B62FDBC1D7@dilger.ca>

On Mon, Mar 17, 2014 at 05:10:22PM -0600, Andreas Dilger wrote:
> 
> On Mar 11, 2014, at 12:57 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> > e2fsck pass1 is modified to use the block group data prefetch function
> > to try to fetch the inode tables into the pagecache before it is
> > needed.  In order to avoid cache thrashing, we limit ourselves to
> > prefetching at most half the available memory.
> 
> It looks like the prefetching is done in huge chunks, and not incrementally?
> It makes more sense to have a steady amount of prefetch happening instead
> of waiting for it to all be consumed before starting a new batch.  See in
> e2fsck_pass1() below.

I agree that prefetch ought not to wait until the entire inode table is
consumed.

> > pass2 is modified to use the dirblock prefetching function to prefetch
> > the list of directory blocks that are assembled in pass1.  So long as
> > we don't anticipate rehashing the dirs (pass 3a), we can release the
> > dirblocks as soon as we're done checking them.
> > 
> > pass4 is modified to prefetch the block and inode bitmaps in
> > anticipation of pass 5, because pass4 is entirely CPU bound.
> > 
> > In general, these mechanisms can halve fsck time, if the host system
> > has sufficient memory and the storage system can provide a lot of
> > IOPs.  SSDs and multi-spindle RAIDs see the most speedup; single disks
> > experience a modest speedup, and single-spindle USB mass storage
> > devices see hardly any benefit.
> > 
> > By default, readahead will try to fill half the physical memory in the
> > system.  The -R option can be given to specify the amount of memory to
> > use for readahead, or zero to disable it entirely; or an option can be
> > given in e2fsck.conf.
> > 
> > 
> > +static void *pass1_readahead(void *p)
> > +{
> > +	struct pass1ra_ctx *c = p;
> > +	errcode_t err;
> > +
> > +	ext2fs_readahead(c->fs, EXT2_READA_ITABLE, c->group, c->ngroups);
> > +	return NULL;
> > +}
> > +
> > +static errcode_t initiate_readahead(e2fsck_t ctx, dgrp_t group, dgrp_t ngroups)
> > +{
> > +	struct pass1ra_ctx *ractx;
> > +	errcode_t err;
> > +
> > +	err = ext2fs_get_mem(sizeof(*ractx), &ractx);
> > +	if (err)
> > +		return err;
> > +
> > +	ractx->fs = ctx->fs;
> > +	ractx->group = group;
> > +	ractx->ngroups = ngroups;
> > +
> > +	err = e2fsck_run_thread(&ctx->ra_thread, pass1_readahead,
> > +				pass1_readahead_cleanup, ractx);
> > +	if (err)
> > +		ext2fs_free_mem(&ractx);
> > +
> > +	return err;
> > +}
> > +
> >  void e2fsck_pass1(e2fsck_t ctx)
> >  {
> > 	int	i;
> > @@ -611,10 +654,37 @@ void e2fsck_pass1(e2fsck_t ctx)
> > 	int		busted_fs_time = 0;
> > 	int		inode_size;
> > 	int		failed_csum = 0;
> > +	dgrp_t		grp;
> > +	ext2_ino_t	ra_threshold = 0;
> > +	dgrp_t		ra_groups = 0;
> > +	errcode_t	err;
> > 
> > 	init_resource_track(&rtrack, ctx->fs->io);
> > 	clear_problem_context(&pctx);
> > 
> > +	/* If we can do readahead, figure out how many groups to pull in. */
> > +	if (!ext2fs_can_readahead(ctx->fs))
> > +		ctx->readahead_mem_kb = 0;
> > +	if (ctx->readahead_mem_kb) {
> > +		ra_groups = ctx->readahead_mem_kb /
> > +			    (fs->inode_blocks_per_group * fs->blocksize /
> > +			     1024);
> > +		if (ra_groups < 16)
> > +			ra_groups = 0;
> 
> It probably always makes sense to prefetch one group if possible?

I was intending to skip pass1 RA if there wasn't a lot of memory around.  Not
that I did a lot of work to figure out if < 16 groups really was a "lowmem"
situation.

> > +		else if (ra_groups > fs->group_desc_count)
> > +			ra_groups = fs->group_desc_count;
> > +		if (ra_groups) {
> > +			err = initiate_readahead(ctx, grp, ra_groups);
> 
> Looks like "grp" is used uninitialized here.  Should be "grp = 0" to start.

Oops, good catch.

> > +			if (err) {
> > +				com_err(ctx->program_name, err, "%s",
> > +					_("while starting pass1 readahead"));
> > +				ra_groups = 0;
> > +			}
> > +			ra_threshold = ra_groups *
> > +				       fs->super->s_inodes_per_group;
> 
> This is the threshold of the last inode to be prefetched.

Yes.

> > +		}
> > +	}
> > +
> > 	if (!(ctx->options & E2F_OPT_PREEN))
> > 		fix_problem(ctx, PR_1_PASS_HEADER, &pctx);
> > 
> > @@ -778,6 +848,19 @@ void e2fsck_pass1(e2fsck_t ctx)
> > 			if (e2fsck_mmp_update(fs))
> > 				fatal_error(ctx, 0);
> > 		}
> > +		if (ra_groups && ino > ra_threshold) {
> 
> This doesn't start prefetching again until the last inode is checked.
> It probably makes sense to have a sliding window to start readahead
> again once half of the memory has been consumed or so.  Otherwise,
> the scanning will block here until the next inode table is read from
> disk, instead of the readahead being started earlier and it is in RAM.

You're right, it would be even faster if ra_threshold were to start RA a couple
of block groups *before* we run out of prefetched data.

> > +			grp = (ino - 1) / fs->super->s_inodes_per_group;
> > +			ra_threshold = (grp + ra_groups) *
> > +				       fs->super->s_inodes_per_group;
> 
> > +			err = initiate_readahead(ctx, grp, ra_groups);
> > +			if (err == EAGAIN) {
> > +				printf("Disabling slow readahead.\n");
> > +				ra_groups = 0;
> 
> I see that EAGAIN comes from e2fsck_run_thread(), if there is still a
> readahead thread running.  Does it make sense to stop readahead in
> that case?  It would seem to me that if readahead is taking a long
> time and the inode processing is catching up to it (i.e. IO bound)
> then it is even more important to do readahead in that case.

This is tricky -- POSIX_FADV_WILLNEED starts a non-blocking readahead, so there
really isn't any good way to tell if the inode checker has caught up to RA.
Here I'm interpreting "RA thread still running" as a warning that soon the
inode checker will be ahead of the RA, so we might as well stop the RA.
However, there still isn't really much good way to find out exactly where RA
is.

> Something like the following to readahead half of the inode tables once
> half of them have been processed, and shrink the readahead window if the
> readahead is being called too often:

Hmm.  I will give this a shot and report back; this seems like it ought to
produce a better result than "two before" as I suggested above.

> 	if (ra_groups != 0 && ino > ra_threshold - (ra_groups + 1) / 2 *
> 					fs->super->s_inodes_per_group) {
>		if (ra_threshold < ino)
> 			ra_threshold = ino;
> 		grp = (ra_threshold -1) / fs->super->s_inodes_per_group;
> 		err = initiate_readahead(ctx, grp, (ra_groups + 1) / 2);
> 		if (err == EAGAIN)
> 			ra_groups = (ra_groups + 1) / 2;
> 		else if (err)
> 			com_err(ctx->program_name, err, "%s",
> 				_("while starting pass1 readahead"));
> 		else
> 			ra_threshold += (ra_groups + 1) / 2 *
> 				fs->super->s_inodes_per_group;
> 	}
> 
> > +			} else if (err) {
> > +				com_err(ctx->program_name, err, "%s",
> > +					_("while starting pass1 readahead"));
> > +			}
> > +		}
> > 		old_op = ehandler_operation(_("getting next inode from scan"));
> > 		pctx.errcode = ext2fs_get_next_inode_full(scan, &ino,
> > 							  inode, inode_size);
> > diff --git a/e2fsck/unix.c b/e2fsck/unix.c
> > index 80ebdb1..d6ef8c5 100644
> > --- a/e2fsck/unix.c
> > +++ b/e2fsck/unix.c
> > @@ -74,7 +74,7 @@ static void usage(e2fsck_t ctx)
> > 		_("Usage: %s [-panyrcdfvtDFV] [-b superblock] [-B blocksize]\n"
> > 		"\t\t[-I inode_buffer_blocks] [-P process_inode_size]\n"
> > 		"\t\t[-l|-L bad_blocks_file] [-C fd] [-j external_journal]\n"
> > -		"\t\t[-E extended-options] device\n"),
> > +		"\t\t[-E extended-options] [-R readahead_kb] device\n"),
> 
> Note that "-R" is only recently deprecated for raid options, why not make
> this an option under "-E"?

Ok.

--D
> 
> > 		ctx->program_name);
> > 
> > 	fprintf(stderr, "%s", _("\nEmergency help:\n"
> > @@ -90,6 +90,7 @@ static void usage(e2fsck_t ctx)
> > 		" -j external_journal  Set location of the external journal\n"
> > 		" -l bad_blocks_file   Add to badblocks list\n"
> > 		" -L bad_blocks_file   Set badblocks list\n"
> > +		" -R readahead_kb      Allow this much readahead.\n"
> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 



  reply	other threads:[~2014-03-18  4:42 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-11  6:53 [PATCH 00/49] e2fsprogs patchbomb 3/14 Darrick J. Wong
2014-03-11  6:54 ` [PATCH 01/49] create_inode: clean up return mess in do_write_internal Darrick J. Wong
2014-03-11 20:30   ` Andreas Dilger
2014-03-11 20:41     ` Darrick J. Wong
2014-03-11 21:08       ` Theodore Ts'o
2014-03-12  3:24         ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 02/49] create_inode: minor cleanups Darrick J. Wong
2014-03-11 20:31   ` Andreas Dilger
2014-03-12  3:25     ` Theodore Ts'o
2014-03-12  3:27     ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 03/49] create_inode: whitespace fixes Darrick J. Wong
2014-03-12  3:27   ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 04/49] create_inode: move debugfs internal state back to debugfs Darrick J. Wong
2014-03-12  3:31   ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 05/49] create_inode: handle hard link inum mappings per populate_fs invocation Darrick J. Wong
2014-03-12  3:46   ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 06/49] libext2fs: support modifying arbitrary extended attributes (v5) Darrick J. Wong
2014-03-12  3:51   ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 07/49] debugfs: create commands to edit extended attributes Darrick J. Wong
2014-03-12  3:51   ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 08/49] e2fsck: don't rehash inline directories Darrick J. Wong
2014-03-13  3:52   ` Theodore Ts'o
2014-03-13  5:38     ` Darrick J. Wong
2014-03-13 12:13       ` Theodore Ts'o
2014-03-11  6:54 ` [PATCH 09/49] libext2fs: don't fail when doing a strict rewrite of inline data Darrick J. Wong
2014-03-14 13:19   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 10/49] libext2fs: fix iblocks correctly when expanding an inline_data file Darrick J. Wong
2014-03-12 16:38   ` Andreas Dilger
2014-03-12 17:01     ` Darrick J. Wong
2014-03-14 13:25       ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 11/49] e2fsck: zero errcode when checking inline data blocks Darrick J. Wong
2014-03-14 13:26   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 12/49] libext2fs: during inlinedata expand, don't corrupt inode Darrick J. Wong
2014-03-14 13:29   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 13/49] libext2fs: repair side effects when iterating dirents in inline dirs Darrick J. Wong
2014-03-14 13:30   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 14/49] resize2fs: add inline dirs for remapping Darrick J. Wong
2014-03-14 13:31   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 15/49] all: Introduce cppcheck static checking for make C=1 Darrick J. Wong
2014-03-14 13:33   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 16/49] misc: cppcheck cleanups Darrick J. Wong
2014-03-14 13:34   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 17/49] libext2fs: fix 64bit overflow in ext2fs_block_alloc_stats_range Darrick J. Wong
2014-03-14 13:35   ` Theodore Ts'o
2014-03-11  6:55 ` [PATCH 18/49] misc: fix header complaints and resource leaks in e2fsprogs Darrick J. Wong
2014-03-14 13:39   ` Theodore Ts'o
2014-03-14 13:53   ` Theodore Ts'o
2014-03-14 19:23     ` Darrick J. Wong
2014-03-11  6:55 ` [PATCH 19/49] libext2fs: fix memory leak when drastically shrinking extent tree depth Darrick J. Wong
2014-03-14 13:56   ` Theodore Ts'o
2014-03-11  6:56 ` [PATCH 20/49] libext2fs: fix parents when modifying extents Darrick J. Wong
2014-03-14 14:01   ` Theodore Ts'o
2014-03-14 20:13     ` Darrick J. Wong
2014-03-15 15:46       ` Theodore Ts'o
2014-03-17 16:59         ` Darrick J. Wong
2014-03-11  6:56 ` [PATCH 21/49] e2fsck: print runs of duplicate blocks instead of all of them Darrick J. Wong
2014-03-15 16:19   ` Theodore Ts'o
2014-03-11  6:56 ` [PATCH 22/49] e2fsck: verify checksums after checking everything else Darrick J. Wong
2014-03-11  6:56 ` [PATCH 23/49] e2fsck: fix the extended attribute checksum error message Darrick J. Wong
2014-03-11  6:56 ` [PATCH 24/49] e2fsck: insert a missing dirent tail for checksums if possible Darrick J. Wong
2014-03-11  6:56 ` [PATCH 25/49] e2fsck: write dir blocks after new inode when reconstructing root/lost+found Darrick J. Wong
2014-03-11  6:56 ` [PATCH 26/49] tests: add test for corrupted checksummed root directory block Darrick J. Wong
2014-03-11  6:56 ` [PATCH 27/49] dumpe2fs: add switch to disable checksum verification Darrick J. Wong
2014-03-11  6:56 ` [PATCH 28/49] mke2fs: set block_validity as a default mount option Darrick J. Wong
2014-03-11  6:57 ` [PATCH 29/49] libext2fs: support allocating uninit blocks in bmap2() Darrick J. Wong
2014-03-11  6:57 ` [PATCH 30/49] libext2fs: file IO routines should handle uninit blocks Darrick J. Wong
2014-03-11  6:57 ` [PATCH 31/49] resize2fs: convert fs to and from 64bit mode Darrick J. Wong
2014-03-11  6:57 ` [PATCH 32/49] resize2fs: when toggling 64bit, don't free in-use bg data clusters Darrick J. Wong
2014-03-11  6:57 ` [PATCH 33/49] resize2fs: adjust reserved_gdt_blocks when changing group descriptor size Darrick J. Wong
2014-03-11  6:57 ` [PATCH 34/49] libext2fs: have UNIX IO manager use pread/pwrite Darrick J. Wong
2014-03-11  6:57 ` [PATCH 35/49] ext2fs: add readahead method to improve scanning Darrick J. Wong
2014-03-17 22:07   ` Andreas Dilger
2014-03-11  6:57 ` [PATCH 36/49] libext2fs: allow clients to read-ahead metadata Darrick J. Wong
2014-03-17 23:11   ` Andreas Dilger
2014-03-11  6:57 ` [PATCH 37/49] e2fsck: read-ahead metadata during passes 1, 2, and 4 Darrick J. Wong
2014-03-17 23:10   ` Andreas Dilger
2014-03-18  4:42     ` Darrick J. Wong [this message]
2014-03-18  6:50       ` Darrick J. Wong
2014-03-11  6:58 ` [PATCH 38/49] libext2fs: when appending to a file, don't split an index block in equal halves Darrick J. Wong
2014-03-11  6:58 ` [PATCH 39/49] libext2fs: find inode goal when allocating blocks Darrick J. Wong
2014-03-11  6:58 ` [PATCH 40/49] libext2fs: find a range of empty blocks Darrick J. Wong
2014-03-11  6:58 ` [PATCH 41/49] libext2fs: provide a function to set inode size Darrick J. Wong
2014-03-11  6:58 ` [PATCH 42/49] libext2fs: implement fallocate Darrick J. Wong
2014-03-11  6:58 ` [PATCH 44/49] fuse2fs: translate ACL structures Darrick J. Wong
2014-03-11  6:58 ` [PATCH 45/49] fuse2fs: handle 64-bit dates correctly Darrick J. Wong
2014-03-11  6:58 ` [PATCH 46/49] fuse2fs: implement fallocate Darrick J. Wong
2014-03-11  6:59 ` [PATCH 48/49] tests: enable using fuse2fs with metadata checksum test Darrick J. Wong
2014-03-11  6:59 ` [PATCH 49/49] tests: test date handling Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140318044231.GA20038@birch.djwong.org \
    --to=darrick.wong@oracle.com \
    --cc=adilger@dilger.ca \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).