From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42639C433F5 for ; Tue, 15 Feb 2022 01:54:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230257AbiBOBy1 (ORCPT ); Mon, 14 Feb 2022 20:54:27 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:39692 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230050AbiBOBy0 (ORCPT ); Mon, 14 Feb 2022 20:54:26 -0500 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 927D9140742 for ; Mon, 14 Feb 2022 17:54:17 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 1E4A66165D for ; Tue, 15 Feb 2022 01:54:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 69BB5C340E9; Tue, 15 Feb 2022 01:54:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1644890056; bh=+exkmfdsyxAkmlu+4WhLDbEAzpPB+iDXysugvlpmFLY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Go76PhD4nDOD8gczQ3q9B92O+7MIBNrVnXYopEvwlz1Yu79HhOd4k6FphZMjTLjb0 aHeRieFcokaVfdEp43HGA4Ek1kFwmzJd9QSGqBhvB/Myo3T/UDa5/cOTCNVQs9Ym3E 7NQPNbQjHmOcQ8Mpo6mkBHwOdyfPir3CnjU51CyRyQUWGQqOPFYc6F4zIY72jeuF98 c+StzrRlMwbqKGG0IyZEqxxVktjeTNfNMdUMgWH0eK8aSAD/5bQkbZq4GIz49HdYxD GTeiyeQF5FhZ7V1F7ludncnMVda6jojh/UPfmNjcCGX0eBVSjqkQWx4RJYHb35AL2Y ZhH8EZTRMrSrQ== Date: Mon, 14 Feb 2022 17:54:16 -0800 From: "Darrick J. Wong" To: Dave Chinner Cc: Brian Foster , linux-xfs@vger.kernel.org Subject: Re: [PATCH 2/2] xfs: run blockgc on freeze to avoid iget stalls after reclaim Message-ID: <20220215015416.GG8338@magnolia> References: <20220114213043.GB90423@magnolia> <20220118185647.GB13563@magnolia> <20220120003636.GF13563@magnolia> <20220202022240.GY59729@dread.disaster.area> <20220210230806.GO59729@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220210230806.GO59729@dread.disaster.area> Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Fri, Feb 11, 2022 at 10:08:06AM +1100, Dave Chinner wrote: > On Thu, Feb 10, 2022 at 02:03:23PM -0500, Brian Foster wrote: > > On Wed, Feb 02, 2022 at 01:22:40PM +1100, Dave Chinner wrote: > > > On Mon, Jan 24, 2022 at 11:57:12AM -0500, Brian Foster wrote: > > > > On Wed, Jan 19, 2022 at 04:36:36PM -0800, Darrick J. Wong wrote: > > > > > > Of course if you wanted to recycle inactive inodes or do something else > > > > > > entirely, then it's probably not worth going down this path.. > > > > > > > > > > I'm a bit partial to /trying/ to recycle inactive inodes because (a) > > > > > it's less tangling with -fsdevel for you and (b) inode scans in the > > > > > online repair patchset got a little weird once xfs_iget lost the ability > > > > > to recycle _NEEDS_INACTIVE inodes... > > > > > > > > > > OTOH I tried to figure out how to deal with the lockless list that those > > > > > inodes are put on, and I couldn't figure out how to get them off the > > > > > list safely, so that might be a dead end. If you have any ideas I'm all > > > > > ears. :) > > > > > > > > > > > > > So one of the things that I've been kind of unclear on about the current > > > > deferred inactivation implementation is the primary goal of the percpu > > > > optimization. I obviously can see the value of the percpu list in > > > > general, but how much processing needs to occur in percpu context to > > > > achieve the primary goal? > > > > > > > > For example, I can see how a single or small multi threaded sustained > > > > file removal might be batched efficiently, but what happens if said task > > > > happens to bounce around many cpus? > > > > > > In that case we have a scheduler problem, not a per-cpu > > > infrastructure issue. > > > > > > > Last I tested on my box, a single threaded rm -rf had executed on > > something like 24 of the 80 cpus available after about ~1m of runtime. > > Not surprising - the scheduler will select a sibling cores that > share caches when the previous CPU the task was running on is busy > running CPU affine tasks (i.e. the per-cpu inactive kworker thread). > > But how many CPUs it bounces the workload around over a long period > is not important. What is important is the cache residency between > the inodes when they are queued and when they are then inactivated. > That's measured in microseconds to single digit milliseconds (i.e. > within a single scheduling tick), and this is the metric that > matters the most. It doesn't matter if the first batch is unlinked > on CPU 1 and then inactived on CPU 1 while the scheduler moves the > unlink task to CPU 2 where it queues the next batch to be > inactivated on CPU 2, and so on. What matters is the data set in > each batch remains on the same CPU for inactivation processing. > > The tracing I've done indicates taht the majority of the time that > the scehduler bounces the tasks between two or three sibling CPUs > repeatedly. This occurs when the inactivation is keeping up with the > unlink queueing side. When we have lots of extents to remove in > inactivation, the amount of inactivation work is much greater than > the unlink() work, and so we see inactivation batch processing > taking longer and the CPU spread of the unlink task gets larger > because there are more CPUs running CPU-affine tasks doing > background inactivation. > > IOWs, having the number of CPUs the unlink task is scheduled on > grow over the long term is not at all unexpected - this is exactly > what we'd expect to see when we move towards async background > processing of complex operations... > > > Of course the inactivation work for an inode occurs on the cpu that the > > rm task was running on when the inode was destroyed, but I don't think > > there's any guarantee that kworker task will be the next task selected > > on the cpu if the system is loaded with other runnable tasks (and then > > whatever task is selected doesn't pollute the cache). > > The scheduler will select the next CPU based on phsyical machine > topology - core latencies, shared caches, numa distances, whether > the target CPU has affinity bound tasks already queued, etc. > > > For example, I > > also noticed rm- -> rm- context switching on concurrent > > remove tests. That is obviously not a caching issue in this case because > > both tasks are still doing remove work, but behavior is less > > deterministic of the target task happens to be something unrelated. Of > > course, that doesn't mean the approach might not otherwise be effective > > for the majority of common workloads.. > > As long as the context switch rate does not substantially increase, > having tasks migrate between sibling cores every so often isn't a > huge problem. > > > > That per-ag based back end processing is exactly what Darrick's > > > original proposals did: > > > > > > https://lore.kernel.org/linux-xfs/162758431072.332903.17159226037941080971.stgit@magnolia/ > > > > > > It used radix tree walks run in background kworker threads to batch > > > all the inode inactivation in a given AG via radix tree walks to > > > find them. > > > > > > It performed poorly at scale because it destroyed the CPU affinity > > > between the unlink() operation and the inactivation operation of the > > > unlinked inode when the last reference to the inode is dropped and > > > the inode evicted from task syscall exit context. REgardless of > > > whether there is a per-cpu front end or not, the per-ag based > > > processing will destroy the CPU affinity of the data set being > > > processed because we cannot guarantee that the per-ag objects are > > > all local to the CPU that is processing the per-ag objects.... > > > > > > > Ok. The role/significance of cpu caching was not as apparent to me when > > I had last replied to this thread. The discussion around the rcu inode > > lifecycle issue helped clear some of that up. > > > > That said, why not conditionally tag and divert to a background worker > > when the inodegc is disabled? That could allow NEEDS_INACTIVE inodes to > > be claimed/recycled from other contexts in scenarios like when the fs is > > frozen, since they won't be stuck in inaccessible and inactive percpu > > queues, but otherwise preserves current behavior in normal runtime > > conditions. Darrick mentioned online repair wanting to do something > > similar earlier, but it's not clear to me if scrub could or would want > > to disable the percpu inodegc workers in favor of a temporary/background > > mode while repair is running. I'm just guessing that performance is > > probably small enough of a concern in that situation that it wouldn't be > > a mitigating factor. Hm? > > WE probably could do this, but I'm not sure the complexity is > justified by the rarity of the problem it is trying to avoid. > Freezes are not long term, nor are they particularly common for > performance sensitive workloads. Hence I'm just not this corner case > is important enough to justify doing the work given that we've had > similar freeze-will-delay-some-stuff-indefinitely behaviour for a > long time... We /do/ have a few complaints lodged about hangcheck warnings when the filesystem has to be frozen for a very long time. It'd be nice to unblock the callers that want to grab a still-reachable inode, though. As for the online repair case that Brian mentioned, I've largely eliminated the need for scrub freezes by using jump labels, notifier chains, and an in-memory shadow btree to make it so that full fs scans can run in the background while the rest of the fs continues running. That'll be discussed ... soon, when I get to the point where I'm ready to start a full design review. (Hilariously, between that and better management of the DONTCACHE flag, I don't actually need deferred inactivation for repair anymore... :P) --D > > Dave. > -- > Dave Chinner > david@fromorbit.com