From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 4/7] repair: parallelise phase 6
Date: Wed, 21 Oct 2020 23:11:00 -0700 [thread overview]
Message-ID: <20201022061100.GP9832@magnolia> (raw)
In-Reply-To: <20201022051537.2286402-5-david@fromorbit.com>
On Thu, Oct 22, 2020 at 04:15:34PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> A recent metadump provided to us caused repair to take hours in
> phase6. It wasn't IO bound - it was fully CPU bound the entire time.
> The only way to speed it up is to make phase 6 run multiple
> concurrent processing threads.
>
> The obvious way to do this is to spread the concurrency across AGs,
> like the other phases, and while this works it is not optimal. When
> a processing thread hits a really large directory, it essentially
> sits CPU bound until that directory is processed. IF an AG has lots
> of large directories, we end up with a really long single threaded
> tail that limits concurrency.
>
> Hence we also need to have concurrency /within/ the AG. This is
> realtively easy, as the inode chunk records allow for a simple
> concurrency mechanism within an AG. We can simply feed each chunk
> record to a workqueue, and we get concurrency within the AG for
> free. However, this allows prefetch to run way ahead of processing
> and this blows out the buffer cache size and can cause OOM.
>
> However, we can use the new workqueue depth limiting to limit the
> number of inode chunks queued, and this then backs up the inode
> prefetching to it's maximum queue depth.
I'm interested in (some day) hooking up xfs_scrub to max_queued, since
it has the same concurrency problem when one of the AGs has a number of
hugely fragmented files.
> Hence we prevent having the
> prefetch code queue the entire AG's inode chunks on the workqueue
> blowing out memory by throttling the prefetch consumer.
>
> This takes phase 6 from taking many, many hours down to:
>
> Phase 6: 10/30 21:12:58 10/30 21:40:48 27 minutes, 50 seconds
>
> And burning 20-30 cpus that entire time on my test rig.
Yay!
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> repair/phase6.c | 43 +++++++++++++++++++++++++++++++++++--------
> 1 file changed, 35 insertions(+), 8 deletions(-)
>
> diff --git a/repair/phase6.c b/repair/phase6.c
> index 70d32089bb57..bf0719c186fb 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -6,6 +6,7 @@
>
> #include "libxfs.h"
> #include "threads.h"
> +#include "threads.h"
> #include "prefetch.h"
> #include "avl.h"
> #include "globals.h"
> @@ -3109,20 +3110,45 @@ check_for_orphaned_inodes(
> }
>
> static void
> -traverse_function(
> +do_dir_inode(
> struct workqueue *wq,
> - xfs_agnumber_t agno,
> + xfs_agnumber_t agno,
> void *arg)
> {
> - ino_tree_node_t *irec;
> + struct ino_tree_node *irec = arg;
> int i;
> +
> + for (i = 0; i < XFS_INODES_PER_CHUNK; i++) {
> + if (inode_isadir(irec, i))
> + process_dir_inode(wq->wq_ctx, agno, irec, i);
> + }
> +}
> +
> +static void
> +traverse_function(
> + struct workqueue *wq,
> + xfs_agnumber_t agno,
> + void *arg)
> +{
> + struct ino_tree_node *irec;
> prefetch_args_t *pf_args = arg;
> + struct workqueue lwq;
> + struct xfs_mount *mp = wq->wq_ctx;
> +
>
> wait_for_inode_prefetch(pf_args);
>
> if (verbose)
> do_log(_(" - agno = %d\n"), agno);
>
> + /*
> + * The more AGs we have in flight at once, the fewer processing threads
> + * per AG. This means we don't overwhelm the machine with hundreds of
> + * threads when we start acting on lots of AGs at once. We just want
> + * enough that we can keep multiple CPUs busy across multiple AGs.
> + */
> + workqueue_create_bound(&lwq, mp, ag_stride, 1000);
Eeeeee, magic number! :)
/me tosses in obligatory hand-wringing about 2000 CPU systems running
out of work. How about ag_stride * 50 or something? :P
(Aside from that this all looks ok to me)
--D
> +
> for (irec = findfirst_inode_rec(agno); irec; irec = next_ino_rec(irec)) {
> if (irec->ino_isa_dir == 0)
> continue;
> @@ -3130,18 +3156,19 @@ traverse_function(
> if (pf_args) {
> sem_post(&pf_args->ra_count);
> #ifdef XR_PF_TRACE
> + {
> + int i;
> sem_getvalue(&pf_args->ra_count, &i);
> pftrace(
> "processing inode chunk %p in AG %d (sem count = %d)",
> irec, agno, i);
> + }
> #endif
> }
>
> - for (i = 0; i < XFS_INODES_PER_CHUNK; i++) {
> - if (inode_isadir(irec, i))
> - process_dir_inode(wq->wq_ctx, agno, irec, i);
> - }
> + queue_work(&lwq, do_dir_inode, agno, irec);
> }
> + destroy_work_queue(&lwq);
> cleanup_inode_prefetch(pf_args);
> }
>
> @@ -3169,7 +3196,7 @@ static void
> traverse_ags(
> struct xfs_mount *mp)
> {
> - do_inode_prefetch(mp, 0, traverse_function, false, true);
> + do_inode_prefetch(mp, ag_stride, traverse_function, false, true);
> }
>
> void
> --
> 2.28.0
>
next prev parent reply other threads:[~2020-10-22 6:11 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-22 5:15 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
2020-10-22 5:15 ` [PATCH 1/7] workqueue: bound maximum queue depth Dave Chinner
2020-10-22 5:54 ` Darrick J. Wong
2020-10-22 8:11 ` Dave Chinner
2020-10-25 4:41 ` Darrick J. Wong
2020-10-26 22:29 ` Dave Chinner
2020-10-26 22:40 ` Darrick J. Wong
2020-10-26 22:57 ` Dave Chinner
2020-10-22 5:15 ` [PATCH 2/7] repair: Protect bad inode list with mutex Dave Chinner
2020-10-22 5:45 ` Darrick J. Wong
2020-10-29 9:35 ` Christoph Hellwig
2020-10-22 5:15 ` [PATCH 3/7] repair: protect inode chunk tree records with a mutex Dave Chinner
2020-10-22 6:02 ` Darrick J. Wong
2020-10-22 8:15 ` Dave Chinner
2020-10-29 16:45 ` Darrick J. Wong
2020-10-22 5:15 ` [PATCH 4/7] repair: parallelise phase 6 Dave Chinner
2020-10-22 6:11 ` Darrick J. Wong [this message]
2020-10-27 5:10 ` Dave Chinner
2020-10-29 17:20 ` Darrick J. Wong
2020-10-22 5:15 ` [PATCH 5/7] repair: don't duplicate names in " Dave Chinner
2020-10-22 6:21 ` Darrick J. Wong
2020-10-22 8:23 ` Dave Chinner
2020-10-22 15:53 ` Darrick J. Wong
2020-10-29 9:39 ` Christoph Hellwig
2020-10-22 5:15 ` [PATCH 6/7] repair: convert the dir byaddr hash to a radix tree Dave Chinner
2020-10-29 16:41 ` Darrick J. Wong
2020-10-22 5:15 ` [PATCH 7/7] repair: scale duplicate name checking in phase 6 Dave Chinner
2020-10-29 16:29 ` Darrick J. Wong
-- strict thread matches above, loose matches on Subject: below --
2021-03-19 1:33 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
2021-03-19 1:33 ` [PATCH 4/7] repair: parallelise phase 6 Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201022061100.GP9832@magnolia \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox