From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6624815F407 for ; Thu, 11 Jul 2024 11:55:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720698907; cv=none; b=JjV3IXDvMIlS1PEcX50VraKMvPpjvvFN1l5b64fPFDl2F9qm0ELTKzGUCEd1Fkt0ivWYFQTO5z9muJV8r8HilUCvVIZTcbsP3In6WwUt2SR6u4hpLHslO91DjtCaND5/h1DQOYRJJouUI+CKEpCxBIW5xQukRU1ofsguxHvW9cI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720698907; c=relaxed/simple; bh=IjKBQypoiYpezQuDfjEMCysU4UpG4sxhNzaYrDgsDjo=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=CbEgA86BM4phcEk5l3UQqLkhGjl7noKeYysKuNtfn3p7CR6lPh4w4BQsgZLOT8UZWu7x3W0ABgvq7n9yg3vVFwxdBlj7+o+VchuDyfwFmXV334ZFrMYBs5XnmaLx2cplZxoIw5EdJPDShp3hVS5wAOKHPmY/MiVKjYGCmQ8XCpM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=XcVgRhxj; arc=none smtp.client-ip=209.85.214.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="XcVgRhxj" Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-1fa9ecfb321so5909335ad.0 for ; Thu, 11 Jul 2024 04:55:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1720698905; x=1721303705; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=ZjYWpoT4u/Zq2mIe8pqxx1BxwM6H3Jj1huw4voUKaZg=; b=XcVgRhxjMCFJcMwD6HhIVmFgm9XBeXGn+wcClOHpgvqNnlt7u/jZzHCeLAOVEA0xO5 6ZgVMXX4HeNnVYdRi/SJ+qIzGAQ5eKzkeQanKJLOvDtEuwPGvx5rJtXB5wgQYV8h2CAR KTyp6D2Bt1ape2i8sVeQvYG+4z2yWaYbv2jL+y876G//CcmYRY1jYsY8nW2jdeLvNGjv 5xnIOTxaUrIU+w2N91s1x3JejsHcX84gnOJE5k2MFV8eyIQyZqDHToxXte/J0OpT/EMZ KTW6zLifKO14gvxfLxQAXzcl77m2oVNrIM/pTTqyzKiY8TqqF7PoDI6ssNt7mP3gA9P1 QP8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720698905; x=1721303705; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=ZjYWpoT4u/Zq2mIe8pqxx1BxwM6H3Jj1huw4voUKaZg=; b=YSF0O+wVUrWE6Xr0Z58B3Ovkyaeu4XPwt0Dsw/6663BQhCqnvRsBY/B/9gM3LIUWIe iBO/kEOyC4nkJ2GOeCezsCIl2Q47KHpgynJy9xGxeatqHCigZXkNGgdLY0GCIJloVCVK GrkwHgzdiHrGhwKZZbXRI6hG5O/zIxhqZu9qJ4dS2WAXM7qosA2eY+JmoUtezjXv2LF+ 7mVH+RsM6fT0ENDKkYReAURv18an7VCelrRZ6Xf23Gv122u7FhezQL2/9Y6nuKrVKSNN 5hM5cPi4zcDjBs4y3XZHe8KfktWINsOiLW/R4tYmTF/79gOciFeecZfQh12M9HKEmwMv SanQ== X-Forwarded-Encrypted: i=1; AJvYcCUArxUf1aUjUa16cECaHurrcs1Sus6gLZNZFjkoouXKJ0jbRZM/Ut1TAQFh0Up81hHAV+B//bMRSM0mU1oze+BggI4GW/imm7K8 X-Gm-Message-State: AOJu0Yzz2V5NqGlNizqNmeNsL3VJPjNTI63fmXowmfqOlwjsrWnhhRt/ UUc8Wy5qlDLOEFQxIyQpXgdFX1MzcNimnSvQ7sNlr8Ts8qCbRQxeRyY28uyOSPQ= X-Google-Smtp-Source: AGHT+IHc/2NCtrqEJg+0Imf71rTjjn0FUGUhGEF9LXOZb1eutDkZEADYESdmeMrIbVmejDdivQbvVA== X-Received: by 2002:a17:902:ced2:b0:1fb:6ea1:4c with SMTP id d9443c01a7336-1fbb6d3d631mr73256845ad.23.1720698905482; Thu, 11 Jul 2024 04:55:05 -0700 (PDT) Received: from dread.disaster.area (pa49-179-32-121.pa.nsw.optusnet.com.au. [49.179.32.121]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1fbb6ab7d9csm48859965ad.157.2024.07.11.04.55.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Jul 2024 04:55:04 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1sRsO6-00C0GN-1Q; Thu, 11 Jul 2024 21:55:02 +1000 Date: Thu, 11 Jul 2024 21:55:02 +1000 From: Dave Chinner To: NeilBrown Cc: Christoph Hellwig , Mike Snitzer , linux-xfs@vger.kernel.org, Brian Foster , linux-nfs@vger.kernel.org Subject: Re: [PATCH v2] xfs: enable WQ_MEM_RECLAIM on m_sync_workqueue Message-ID: References: <> <172056677808.15471.5200774043985229799@noble.neil.brown.name> Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <172056677808.15471.5200774043985229799@noble.neil.brown.name> On Wed, Jul 10, 2024 at 09:12:58AM +1000, NeilBrown wrote: > On Thu, 04 Jul 2024, Christoph Hellwig wrote: > > On Wed, Jul 03, 2024 at 09:29:00PM +1000, NeilBrown wrote: > > > I know nothing of this stance. Do you have a reference? > > > > No particular one. > > > > > I have put a modest amount of work into ensure NFS to a server on the > > > same machine works and last I checked it did - though I'm more > > > confident of NFSv3 than NFSv4 because of the state manager thread. > > > > How do you propagate the NOFS flag (and NOIO for a loop device) to > > the server an the workqueues run by the server and the file system > > call by it? How do you ensure WQ_MEM_RECLAIM gets propagate to > > all workqueues that could be called by the file system on the > > server (the problem kicking off this discussion)? > > > > Do we need to propagate these? > > NOFS is for deadlock avoidance. A filesystem "backend" (Dave's term - I > think for the parts of the fs that handle write-back) might allocate > memory, that might block waiting for memory reclaim, memory reclaim > might re-enter the filesystem backend and might block on a lock (or > similar) held while allocating memory. NOFS breaks that deadlock. > > The important thing here isn't the NOFS flag, it is breaking any > possible deadlock. NOFS doesn't "break" any deadlocks. It simply prevents recursion from one filesystem context to another. We don't have to use NOFS if recursion is safe and won't deadlock. That is, it may be safe for a filesystem to use GFP_KERNEL allocations in it's writeback path. If the filesystem doesn't implement ->writepage (like most of the major filesystems these days) there is no way for memory reclaim to recurse back into the fs writeback path. Hence GFP_NOFS is not needed in writeback context to prevent reclaim recursion back into the filesystem writeback path.... And the superblock shrinkers can't deadlock - they are non-blocking and only act on unreferenced inodes. Hence any code that has a locked inode is either evicting an unreferenced inode or holds a reference to the inode. If we are doing an allocation with eithe rof those sorts of inodes locked, there is no way that memory reclaim recursion can trip over the locked inode and deadlock. So between the removal of ->writepage, non-blocking shrinkers, and scoped NOIO context for loop devices, I'm not sure there are any generic reclaim recursion paths that can actually deadlock. i.e. GFP_NOFS really only needs to be used if the filesystem itself cannot safely recurse back into itself. > Layered filesystems introduce a new complexity. Nothing new about layered filesystems - we've been doing this for decades... > The backend for one > filesystem can call into the front end of another filesystem. That > front-end is not required to use NOFS and even if we impose > PF_MEMALLOC_NOFS, the front-end might wait for some work-queue action > which doesn't inherit the NOFS flag. > > But this doesn't necessarily matter. Calling into the filesystem is not > the problem - blocking waiting for a reply is the problem. It is > blocking that creates deadlocks. So if the backend of one filesystem > queues to a separate thread the work for the front end of the other > filesystem and doesn't wait for the work to complete, then a deadlock > cannot be introduced. > > /dev/loop uses the loop%d workqueue for this. loop-back NFS hands the > front-end work over to nfsd. The proposed localio implementation uses a > nfslocaliod workqueue for exactly the same task. These remove the > possibility of deadlock and mean that there is no need to pass NOFS > through to the front-end of the backing filesystem. I think this logic is fundamentally flawed. Pushing IO submission to a separate thread context which runs them in GFP_KERNEL context does not help if the deadlock occurs during IO submission. With loop devices, there's a "global" lock in the lower filesystem on the loop device - the image file inode lock. The IO issued by the loop device will -always- hit the same inode and the same inode locks. Hence if we do memory allocation with an inode lock held exclusive in the lower filesystem (e.g. a page cache folio for a buffered write), we cannot allow memory reclaim during any allocation with the image file inode locked to recurse into the upper filesystem. If the upper filesystem then performs an operation that requires IO to be submitted and completed to make progress then we have a deadlock condition due to recursion from the lower to upper filesystem regardless of the fact that the lower IO submission is run from a different task. Hence the loop device sets up the backing file mapping as: lo->lo_backing_file = file; lo->old_gfp_mask = mapping_gfp_mask(mapping); mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); GFP_NOIO context. It also sets up worker task context as: current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; GFP_NOIO context. IOWs, all allocation in the IO submission path is explicitly GFP_NOIO to prevent any sort of reclaim recursion into filesysetms or the block layer. That's the only sane thing to do, because multi-filesystem deadlocks are an utter PITA to triage and solve... Keep in mind that PF_LOCAL_THROTTLE also prevents IO submission deadlocks in the lower filesystem. If the lower filesystem IO submission dirties pages (i.e. buffered writes) it can get throttled on the dirty page threshold. If it get's throttled like this trying to clean dirty pages from the upper filesystem we have a deadlock. The localio submission task will need to prevent that deadlock, too. IOWs, just moving IO submission to another thread does not avoid the possibility of lower-to-upper filesystem recursion or lower filesystem dirty page throttling deadlocks. > Note that there is a separate question concerning pageout to a swap > file. pageout needs more than just deadlock avoidance. It needs > guaranteed progress in low memory conditions. It needs PF_MEMALLOC (or > mempools) and that cannot be finessed using work queues. I don't think > that Linux is able to support pageout through layered filesystems. I don't think we ever want to go there. -Dave. -- Dave Chinner david@fromorbit.com