From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f42.google.com (mail-pj1-f42.google.com [209.85.216.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5752D198845 for ; Mon, 27 Jan 2025 23:34:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.42 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738020851; cv=none; b=bW7vfVbutj/sJsGmPfTNzp5cizKYhhswkp0Bv6eTKGb/JF2D3muhpYMz2rjHOpSumEipANCuMPfKwxrDLo3QfG5hcR+W57bWuQ4szLjlkKirvVCzZnoQPdXXr527S2i9OP3/UDQWBrY4IEEaKDhKGtV/LNIV805HrBmH+dAmEpQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738020851; c=relaxed/simple; bh=7G7qlgzFvHEtrpje6f/jvcFPYcp8mJYIxFR1wOJyQsw=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=DU9RWPbf4mLRz1FDWwnrbKxj/H+F9Jc1C8ZWUnvssnkAzLMdj5a++6lJkQzfJWmRcxxusFQDetyDmxLosFeOlEhlgRNLxzYMQ3h13kjzDPcbyjlCophEaLN8FpgVKpencs5EsM0NFjn7wyriGTyGVIoIutAOQ+ppEMeEll5pS5k= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=yYWV9uue; arc=none smtp.client-ip=209.85.216.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="yYWV9uue" Received: by mail-pj1-f42.google.com with SMTP id 98e67ed59e1d1-2ee50ffcf14so9441084a91.0 for ; Mon, 27 Jan 2025 15:34:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1738020849; x=1738625649; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=buFinmue/bA4T+nrarsz5dNdvHxtKEn0jadIk+SrTEE=; b=yYWV9uueUP0LT1ElxFiDJ4OhtsJToY0T8K08oIHrQ5ED7J284d3ectVwc4bEWcxysl q5/bTl6qnPj6/f3T8MukXDf9FGEG9NfincF6x3D4Os9cpqwO0IdKtsm2ojJjaBIlHN9U eyLYMx55T+jmQaDKSLMoetJ+7zgGzhwRSEQ3ZxwCLxlaOVFYTmwE+56sZ/oXeh9LDD9B j74mWrEDIFv3IrVAmdJFWYP18uNmGUDx1YiPI3r/wPekKjqZQuC+L2qzhaXuTFNFX6ze hQJM9cQypXRfOP2FCTyEWle1ZkVv2Cofh4oowX07u2FAmMb/7h3nhEtUJIOyPozufPN3 srtw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738020849; x=1738625649; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=buFinmue/bA4T+nrarsz5dNdvHxtKEn0jadIk+SrTEE=; b=STqRAHef9msrmvP3y5Ug8TI5AHH3Reks9xDrRguv/IdX0hhERZRR0GF/is+UQgS2Ja QjcQhT050mG20xt82RjFI1sN8rC0Hzzm7iGZ1vLg04tgrKWhdF2W32c0qGPl6VRhxrjq 0wj/cIVIWfmynHSRKTezQ6IBRVAnyN0TW5XTZ4Z1UHhOGJfiBG/N4r/9lEN3wX3lAiQx 1cCtV2fEaC71EBZr324EVinyOTkyIu2Vk0rm//lKXGMPJZuQA27/XV/g3vD8sBEE/eky 0K8ZcASMDD9bbYfQT0Qpha2NzeLcS+QsmXWLHbbyMLRKDuSCnlKNY0YBVMNe3G/SEPdu jUNQ== X-Gm-Message-State: AOJu0Yx06cMy+SONaVAaiVHUSL2G2xeUBiV7xPQWwEN86QfIDK9v/oM6 Aw9jV1b5vKNS68H0zuivLMiJET7gH3LqvAvdnaT/C/YbFK3bjuIOW/0d81hcnTc= X-Gm-Gg: ASbGncsfBxDC5BEFeFwabMwnss+VHFfms1yAHcGRjjR8pwK5jeY1ljKlQ2IEYU7Cbsi l9U8qCjy5cGXXe7JKcI+gvklMrMNPKId7ffk2a4t3VLP0/1W4Dem1BHbb6W+cjjtScLFzyojs7V ojLh0vgdi5FEtq12VB7m1/gNRP58DHTAdADaisJsy3XNeeKcHQ/9vhl54jyxBolsFUyjEzsKRyg CiAxiUEbtj7o5lyXbx+kfPK9t3Fvszg6FcTPJeZ2ZiTdP77fMWLCGJsCvw6xcWM9UWU6xEbCkAh Hsq+DJ9MK5fszJOA/ceS3Oj9JfoRtrnq/SdZxXNtKR7z1bdcz4pTfmiy X-Google-Smtp-Source: AGHT+IEfevddtBnwKdTCuVgLfAMRZv/eJd0UamLXsv1pE/QmWUnsR26OSpNquwBSpg4jNcj6LGVq3g== X-Received: by 2002:a05:6a00:4482:b0:728:15fd:dabb with SMTP id d2e1a72fcca58-72fc092989fmr1954255b3a.8.1738020849511; Mon, 27 Jan 2025 15:34:09 -0800 (PST) Received: from dread.disaster.area (pa49-186-89-135.pa.vic.optusnet.com.au. [49.186.89.135]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-72f8a69eb75sm7998564b3a.12.2025.01.27.15.34.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Jan 2025 15:34:08 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.98) (envelope-from ) id 1tcYcI-0000000BMJC-2P34; Tue, 28 Jan 2025 10:34:06 +1100 Date: Tue, 28 Jan 2025 10:34:06 +1100 From: Dave Chinner To: Amir Goldstein Cc: linux-fsdevel , lsf-pc , Jan Kara , Christian Brauner , Josef Bacik , Jeff Layton Subject: Re: [LSF/MM/BPF TOPIC] vfs write barriers Message-ID: References: Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote: > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner wrote: > > This proposed write barrier does not seem capable of providing any > > sort of physical data or metadata/data write ordering guarantees, so > > I'm a bit lost in how it can be used to provide reliable "crash > > consistent change tracking" when there is no relationship between > > the data/metadata in memory and data/metadata on disk... > > That's a good question. A bit hard to explain but I will try. > > The short answer is that the vfs write barrier does *not* by itself > provide the guarantee for "crash consistent change tracking". > > In the prototype, the "crash consistent change tracking" guarantee > is provided by the fact that the change records are recorded as > as metadata in the same filesystem, prior to the modification and > those metadata records are strictly ordered by the filesystem before > the actual change. Uh, ok. I've read the docco and I think I understand what the prototype you've pointed me at is doing. It is using a separate chunk of the filesystem as a database to persist change records for data in the filesystem. It is doing this by creating an empty(?) file per change record in a per time period (T) directory instance. i.e. write() -> pre-modify -> fanotify -> userspace HSM -> create file in dir T named "" And then you're relying on the filesystem to make that directory entry T/ stable before the data the pre-modify record was generated for ever gets written. IOWs, you've specifically relying on *all unrelated metadata changes in the filesystem* having strict global ordering *and* being persisted before any data written after the metadata was created is persisted. Sure, this might work right now on XFS because the journalling implementation -currently- provides global metadata ordering and data/metadata ordering based on IO completion to submission ordering. However, we do not guarantee that XFS will -always- have this behaviour. This is an *implementation detail*, not a guaranteed behaviour we will preserve for all time. i.e. we reserve the right to change how we do unrelated metadata and data/metadata ordering internally. This reminds of how applications observed that ext3 ordered mode didn't require fsync to guarantee the data got written before the metadata, so they elided the fsync() because it was really expensive on ext3. i.e. they started relying on a specific filesystem implementation detail for "correct crash consistency behaviour", without understanding that it -only worked on ext3- and broken crash consistency behaviour on all other filesystems. That was *bad*, and it took a long time to get the message across that applications *must* use fsync() for correct crash consistency behaviour... What you are describing for your prototype HSM to provide crash consistent change tracking really seems to me like it is reliant on the side effects of specific filesystem implementation choices, not a behaviour that all filesysetms guarantee. i.e. not all filesystems provide strict global metadata ordering semantics, and some fs maintainers are on record explicitly stating that they will not provide or guarantee them. e.g. ext4, especially with fast commits enabled, will not provide global strictly ordered metadata semantics. btrfs also doesn't provide such a guarantee, either. > I would love to discuss the merits and pitfalls of this method, but the > main thing I wanted to get feedback on is whether anyone finds the > described vfs API useful for anything other that the change tracking > system that I described. If my understanding is correct, then this HSM prototype change tracking mechanism seems like a fragile, unsupportable architecture. I don't think we should be trying to add new VFS infrastructure to make it work, because I think the underlying behaviours it requires from filesystems are simply not guaranteed to exist. -Dave. -- Dave Chinner david@fromorbit.com