From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94AA6C433EF for ; Tue, 25 Jan 2022 02:10:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S3417562AbiAYCJ7 (ORCPT ); Mon, 24 Jan 2022 21:09:59 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:43150 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1839562AbiAXX3Z (ORCPT ); Mon, 24 Jan 2022 18:29:25 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643066963; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=jx3Ro6S3xws9rd8tdWZmYUjl65K1O4tRyj4n/OFarU8=; b=ciU1w3RCE6ZBH5Bbyg6I565v6f5oVDcUPMp8xanf/ksH1GJIlw5G3buCH4DI+2foNlrTRl bzYL6cG9KMZ9XN8kRXW5eg035yDbh/6CHJs6BbgJ/pVl6d453BOY3ZckHMwSJRFkAt9lO1 tse+/kyHffymqYmo6OvI6t4Phb4hndY= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-591-ex0sPEdQPAKCuHMuqUmaCg-1; Mon, 24 Jan 2022 18:29:22 -0500 X-MC-Unique: ex0sPEdQPAKCuHMuqUmaCg-1 Received: by mail-qv1-f71.google.com with SMTP id jo10-20020a056214500a00b00421ce742399so9795992qvb.14 for ; Mon, 24 Jan 2022 15:29:22 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=jx3Ro6S3xws9rd8tdWZmYUjl65K1O4tRyj4n/OFarU8=; b=DWQKcZNnovzZD8YE1wNbJh2w4WsRJsEjqix0LDarjTdV3AMgRZfb+esmoz9J7xIlep HvOF09VNCKCXzlLs81SIIAcdVr18xRvvbGqtTnUqP2E5vL6LWOTp4XQrLHdBmUrSzBY7 0E5FVE/DuaG1gD1ukKu0n6e9RsJ1FcRYS3NUqaWypvY5hwYYzfaZU6TiVZAzmGpr20e4 /zW7zVB0L2Vuqy2yZxMDZMbArgd+uNpuqA7PdlQUIvfF03wNr0JX9lkFohNx8aY/uPcJ atTQDIRRcw3HRK+wGJZOWYJ3jaQV+JzMUh6xakv++SINAzl3mx6BvpMVTi3eHO51bhaa 5AuQ== X-Gm-Message-State: AOAM530eWyk+369xHoUacF/NtXnc1u+HT3qoemuuq5Xt0BIMGfY7KUoA omRpHvWYXDXEy2IROITzO4OgmqU4J/AzFSqY9EkI4YhOwhIx4LCJGnE3P7JI4SAGQAnvoIusfpO rAMPwUNJBkW2Vt6gFGmRn X-Received: by 2002:ad4:4c50:: with SMTP id cs16mr17152230qvb.74.1643066961285; Mon, 24 Jan 2022 15:29:21 -0800 (PST) X-Google-Smtp-Source: ABdhPJzhDqRtdkxn4Dhygdtg8YvjttWgPinxk5iVhgAGWQ3pp2CiPoyi7efbRpFWsP4J6Et4Sl9Ktw== X-Received: by 2002:ad4:4c50:: with SMTP id cs16mr17152207qvb.74.1643066960862; Mon, 24 Jan 2022 15:29:20 -0800 (PST) Received: from bfoster (c-24-61-119-116.hsd1.ma.comcast.net. [24.61.119.116]) by smtp.gmail.com with ESMTPSA id k15sm8766131qko.82.2022.01.24.15.29.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Jan 2022 15:29:20 -0800 (PST) Date: Mon, 24 Jan 2022 18:29:18 -0500 From: Brian Foster To: Dave Chinner Cc: linux-xfs@vger.kernel.org, Al Viro , Ian Kent , rcu@vger.kernel.org Subject: Re: [PATCH] xfs: require an rcu grace period before inode recycle Message-ID: References: <20220121142454.1994916-1-bfoster@redhat.com> <20220124220853.GN59729@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220124220853.GN59729@dread.disaster.area> Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote: > On Mon, Jan 24, 2022 at 10:02:27AM -0500, Brian Foster wrote: > > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > > > The XFS inode allocation algorithm aggressively reuses recently > > > freed inodes. This is historical behavior that has been in place for > > > quite some time, since XFS was imported to mainline Linux. Once the > > > VFS adopted RCUwalk path lookups (also some time ago), this behavior > > > became slightly incompatible because the inode recycle path doesn't > > > isolate concurrent access to the inode from the VFS. > > > > > > This has recently manifested as problems in the VFS when XFS happens > > > to change the type or properties of a recently unlinked inode while > > > still involved in an RCU lookup. For example, if the VFS refers to a > > > previous incarnation of a symlink inode, obtains the ->get_link() > > > callback from inode_operations, and the latter happens to change to > > > a non-symlink type via a recycle event, the ->get_link() callback > > > pointer is reset to NULL and the lookup results in a crash. > > > > > > To avoid this class of problem, isolate in-core inodes for recycling > > > with an RCU grace period. This is the same level of protection the > > > VFS expects for inactivated inodes that are never reused, and so > > > guarantees no further concurrent access before the type or > > > properties of the inode change. We don't want an unconditional > > > synchronize_rcu() event here because that would result in a > > > significant performance impact to mixed inode allocation workloads. > > > > > > Fortunately, we can take advantage of the recently added deferred > > > inactivation mechanism to mitigate the need for an RCU wait in most > > > cases. Deferred inactivation queues and batches the on-disk freeing > > > of recently destroyed inodes, and so significantly increases the > > > likelihood that a grace period has elapsed by the time an inode is > > > freed and observable by the allocation code as a reuse candidate. > > > Capture the current RCU grace period cookie at inode destroy time > > > and refer to it at allocation time to conditionally wait for an RCU > > > grace period if one hadn't expired in the meantime. Since only > > > unlinked inodes are recycle candidates and unlinked inodes always > > > require inactivation, we only need to poll and assign RCU state in > > > the inactivation codepath. Slightly adjust struct xfs_inode to fit > > > the new field into padding holes that conveniently preexist in the > > > same cacheline as the deferred inactivation list. > > > > > > Finally, note that the ideal long term solution here is to > > > rearchitect bits of XFS' internal inode lifecycle management such > > > that this additional stall point is not required, but this requires > > > more thought, time and work to address. This approach restores > > > functional correctness in the meantime. > > > > > > Signed-off-by: Brian Foster > > > --- > > > > > > Hi all, > > > > > > Here's the RCU fixup patch for inode reuse that I've been playing with, > > > re: the vfs patch discussion [1]. I've put it in pretty much the most > > > basic form, but I think there are a couple aspects worth thinking about: > > > > > > 1. Use and frequency of start_poll_synchronize_rcu() (vs. > > > get_state_synchronize_rcu()). The former is a bit more active than the > > > latter in that it triggers the start of a grace period, when necessary. > > > This currently invokes per inode, which is the ideal frequency in > > > theory, but could be reduced, associated with the xfs_inogegc thresholds > > > in some manner, etc., if there is good reason to do that. > > > > > > 2. The rcu cookie lifecycle. This variant updates it on inactivation > > > queue and nowhere else because the RCU docs imply that counter rollover > > > is not a significant problem. In practice, I think this means that if an > > > inode is stamped at least once, and the counter rolls over, future > > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could > > > trigger rcu syncs. I think this would require repeated > > > eviction/reinstantiation cycles within a small window to be noticeable, > > > so I'm not sure how likely this is to occur. We could be more defensive > > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero) > > > at recycle time, unconditionally refresh at destroy time (using > > > get_state_synchronize_rcu() for non-inactivation), etc. > > > > > > Otherwise testing is ongoing, but this version at least survives an > > > fstests regression run. > > > > > > > FYI, I modified my repeated alloc/free test to do some batching and form > > it into something more able to measure the potential side effect / cost > > of the grace period sync. The test is a single threaded, file alloc/free > > loop using a variable per iteration batch size. The test runs for ~60s > > and reports how many total files were allocated/freed in that period > > with the specified batch size. Note that this particular test ran > > without any background workload. Results are as follows: > > > > files baseline test > > > > 1 38480 38437 > > 4 126055 111080 > > 8 218299 134469 > > 16 306619 141968 > > 32 397909 152267 > > 64 418603 200875 > > 128 469077 289365 > > 256 684117 566016 > > 512 931328 878933 > > 1024 1126741 1118891 > > Can you post the test code, because 38,000 alloc/unlinks in 60s is > extremely slow for a single tight open-unlink-close loop. I'd be > expecting at least ~10,000 alloc/unlink iterations per second, not > 650/second. > Hm, Ok. My test was just a bash script doing a 'touch ; rm ' loop. I know there was application overhead because if I tweaked the script to open an fd directly rather than use touch, the single file performance jumped up a bit, but it seemed to wash away as I increased the file count so I kept running it with larger sizes. This seems off so I'll port it over to C code and see how much the numbers change. > A quick test here with "batch size == 1" main loop on a vanilla > 5.17-rc1 kernel: > > for (i = 0; i < iters; i++) { > int fd = open(file, O_CREAT|O_RDWR, 0777); > > if (fd < 0) { > perror("open"); > exit(1); > } > unlink(file); > close(fd); > } > > > $ time ./open-unlink 10000 /mnt/scratch/blah > > real 0m0.962s > user 0m0.022s > sys 0m0.775s > > Shows pretty much 10,000 alloc/unlinks a second without any specific > batching on my slow machine. And my "fast" machine (3yr old 2.1GHz > Xeons) > > $ time sudo ./open-unlink 40000 /mnt/scratch/foo > > real 0m0.958s > user 0m0.033s > sys 0m0.770s > > Runs single loop iterations at 40,000 alloc/unlink iterations per > second. > > So I'm either not understanding the test you are running and/or the > kernel/patches that you are comparing here. Is the "baseline" just a > vanilla, unmodified upstream kernel, or something else? > Yeah, the baseline was just the XFS for-next branch. > > That's just a test of a quick hack, however. Since there is no real > > urgency to inactivate an unlinked inode (it has no potential users until > > it's freed), > > On the contrary, there is extreme urgency to inactivate inodes > quickly. > Ok, I think we're talking about slightly different things. What I mean above is that if a task removes a file and goes off doing unrelated $work, that inode will just sit on the percpu queue indefinitely. That's fine, as there's no functional need for us to process it immediately unless we're around -ENOSPC thresholds or some such that demand reclaim of the inode. It sounds like what you're talking about is specifically the behavior/performance of sustained file removal (which is important obviously), where apparently there is a notable degradation if the queues become deep enough to push the inode batches out of CPU cache. So that makes sense... > Darrick made the original assumption that we could delay > inactivation indefinitely and so he allowed really deep queues of up > to 64k deferred inactivations. But with queues this deep, we could > never get that background inactivation code to perform anywhere near > the original synchronous background inactivation code. e.g. I > measured 60-70% performance degradataions on my scalability tests, > and nothing stood out in the profiles until I started looking at > CPU data cache misses. > ... but could you elaborate on the scalability tests involved here so I can get a better sense of it in practice and perhaps observe the impact of changes in this path? Brian > What we found was that if we don't run the background inactivation > while the inodes are still hot in the CPU cache, the cost of bring > the inodes back into the CPU cache at a later time is extremely > expensive and cannot be avoided. That's where all the performance > was lost and so this is exactly what the current per-cpu background > inactivation implementation avoids. i.e. we have shallow queues, > early throttling and CPU affinity to ensure that the inodes are > processed before they are evicted from the CPU caches and ensure we > don't take a performance hit. > > IOWs, the deferred inactivation queues are designed to minimise > inactivation delay, generally trying to delay inactivation for a > couple of milliseconds at most during typical fast-path > inactivations (i.e. an extent or two per inode needing to be freed, > plus maybe the inode itself). Such inactivations generally take > 50-100us of CPU time each to process, and we try to keep the > inactivation batch size down to 32 inodes... > > > I suspect that result can be further optimized to absorb > > the cost of an rcu delay by deferring the steps that make the inode > > available for reallocation in the first place. > > A typical RCU grace period delay is longer than the latency we > require to keep the inodes hot in cache for efficient background > inactivation. We can't move the "we need to RCU delay inactivation" > overhead to the background inactivation code without taking a > global performance hit to the filesystem performance due to the CPU > cache thrashing it will introduce.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >