From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 4/7] repair: parallelise phase 6
Date: Wed, 21 Oct 2020 23:11:00 -0700 [thread overview]
Message-ID: <20201022061100.GP9832@magnolia> (raw)
In-Reply-To: <20201022051537.2286402-5-david@fromorbit.com>
On Thu, Oct 22, 2020 at 04:15:34PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> A recent metadump provided to us caused repair to take hours in
> phase6. It wasn't IO bound - it was fully CPU bound the entire time.
> The only way to speed it up is to make phase 6 run multiple
> concurrent processing threads.
>
> The obvious way to do this is to spread the concurrency across AGs,
> like the other phases, and while this works it is not optimal. When
> a processing thread hits a really large directory, it essentially
> sits CPU bound until that directory is processed. IF an AG has lots
> of large directories, we end up with a really long single threaded
> tail that limits concurrency.
>
> Hence we also need to have concurrency /within/ the AG. This is
> realtively easy, as the inode chunk records allow for a simple
> concurrency mechanism within an AG. We can simply feed each chunk
> record to a workqueue, and we get concurrency within the AG for
> free. However, this allows prefetch to run way ahead of processing
> and this blows out the buffer cache size and can cause OOM.
>
> However, we can use the new workqueue depth limiting to limit the
> number of inode chunks queued, and this then backs up the inode
> prefetching to it's maximum queue depth.
I'm interested in (some day) hooking up xfs_scrub to max_queued, since
it has the same concurrency problem when one of the AGs has a number of
hugely fragmented files.
> Hence we prevent having the
> prefetch code queue the entire AG's inode chunks on the workqueue
> blowing out memory by throttling the prefetch consumer.
>
> This takes phase 6 from taking many, many hours down to:
>
> Phase 6: 10/30 21:12:58 10/30 21:40:48 27 minutes, 50 seconds
>
> And burning 20-30 cpus that entire time on my test rig.
Yay!
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> repair/phase6.c | 43 +++++++++++++++++++++++++++++++++++--------
> 1 file changed, 35 insertions(+), 8 deletions(-)
>
> diff --git a/repair/phase6.c b/repair/phase6.c
> index 70d32089bb57..bf0719c186fb 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -6,6 +6,7 @@
>
> #include "libxfs.h"
> #include "threads.h"
> +#include "threads.h"
> #include "prefetch.h"
> #include "avl.h"
> #include "globals.h"
> @@ -3109,20 +3110,45 @@ check_for_orphaned_inodes(
> }
>
> static void
> -traverse_function(
> +do_dir_inode(
> struct workqueue *wq,
> - xfs_agnumber_t agno,
> + xfs_agnumber_t agno,
> void *arg)
> {
> - ino_tree_node_t *irec;
> + struct ino_tree_node *irec = arg;
> int i;
> +
> + for (i = 0; i < XFS_INODES_PER_CHUNK; i++) {
> + if (inode_isadir(irec, i))
> + process_dir_inode(wq->wq_ctx, agno, irec, i);
> + }
> +}
> +
> +static void
> +traverse_function(
> + struct workqueue *wq,
> + xfs_agnumber_t agno,
> + void *arg)
> +{
> + struct ino_tree_node *irec;
> prefetch_args_t *pf_args = arg;
> + struct workqueue lwq;
> + struct xfs_mount *mp = wq->wq_ctx;
> +
>
> wait_for_inode_prefetch(pf_args);
>
> if (verbose)
> do_log(_(" - agno = %d\n"), agno);
>
> + /*
> + * The more AGs we have in flight at once, the fewer processing threads
> + * per AG. This means we don't overwhelm the machine with hundreds of
> + * threads when we start acting on lots of AGs at once. We just want
> + * enough that we can keep multiple CPUs busy across multiple AGs.
> + */
> + workqueue_create_bound(&lwq, mp, ag_stride, 1000);
Eeeeee, magic number! :)
/me tosses in obligatory hand-wringing about 2000 CPU systems running
out of work. How about ag_stride * 50 or something? :P
(Aside from that this all looks ok to me)
--D
> +
> for (irec = findfirst_inode_rec(agno); irec; irec = next_ino_rec(irec)) {
> if (irec->ino_isa_dir == 0)
> continue;
> @@ -3130,18 +3156,19 @@ traverse_function(
> if (pf_args) {
> sem_post(&pf_args->ra_count);
> #ifdef XR_PF_TRACE
> + {
> + int i;
> sem_getvalue(&pf_args->ra_count, &i);
> pftrace(
> "processing inode chunk %p in AG %d (sem count = %d)",
> irec, agno, i);
> + }
> #endif
> }
>
> - for (i = 0; i < XFS_INODES_PER_CHUNK; i++) {
> - if (inode_isadir(irec, i))
> - process_dir_inode(wq->wq_ctx, agno, irec, i);
> - }
> + queue_work(&lwq, do_dir_inode, agno, irec);
> }
> + destroy_work_queue(&lwq);
> cleanup_inode_prefetch(pf_args);
> }
>
> @@ -3169,7 +3196,7 @@ static void
> traverse_ags(
> struct xfs_mount *mp)
> {
> - do_inode_prefetch(mp, 0, traverse_function, false, true);
> + do_inode_prefetch(mp, ag_stride, traverse_function, false, true);
> }
>
> void
> --
> 2.28.0
>
next prev parent reply other threads:[~2020-10-22 6:11 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-22 5:15 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
2020-10-22 5:15 ` [PATCH 1/7] workqueue: bound maximum queue depth Dave Chinner
2020-10-22 5:54 ` Darrick J. Wong
2020-10-22 8:11 ` Dave Chinner
2020-10-25 4:41 ` Darrick J. Wong
2020-10-26 22:29 ` Dave Chinner
2020-10-26 22:40 ` Darrick J. Wong
2020-10-26 22:57 ` Dave Chinner
2020-10-22 5:15 ` [PATCH 2/7] repair: Protect bad inode list with mutex Dave Chinner
2020-10-22 5:45 ` Darrick J. Wong
2020-10-29 9:35 ` Christoph Hellwig
2020-10-22 5:15 ` [PATCH 3/7] repair: protect inode chunk tree records with a mutex Dave Chinner
2020-10-22 6:02 ` Darrick J. Wong
2020-10-22 8:15 ` Dave Chinner
2020-10-29 16:45 ` Darrick J. Wong
2020-10-22 5:15 ` [PATCH 4/7] repair: parallelise phase 6 Dave Chinner
2020-10-22 6:11 ` Darrick J. Wong [this message]
2020-10-27 5:10 ` Dave Chinner
2020-10-29 17:20 ` Darrick J. Wong
2020-10-22 5:15 ` [PATCH 5/7] repair: don't duplicate names in " Dave Chinner
2020-10-22 6:21 ` Darrick J. Wong
2020-10-22 8:23 ` Dave Chinner
2020-10-22 15:53 ` Darrick J. Wong
2020-10-29 9:39 ` Christoph Hellwig
2020-10-22 5:15 ` [PATCH 6/7] repair: convert the dir byaddr hash to a radix tree Dave Chinner
2020-10-29 16:41 ` Darrick J. Wong
2020-10-22 5:15 ` [PATCH 7/7] repair: scale duplicate name checking in phase 6 Dave Chinner
2020-10-29 16:29 ` Darrick J. Wong
-- strict thread matches above, loose matches on Subject: below --
2021-03-19 1:33 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
2021-03-19 1:33 ` [PATCH 4/7] repair: parallelise phase 6 Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201022061100.GP9832@magnolia \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.