From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53F7BC43219 for ; Fri, 3 May 2019 14:18:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 239FA2075E for ; Fri, 3 May 2019 14:18:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="RZGJY1dm" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728115AbfECOSY (ORCPT ); Fri, 3 May 2019 10:18:24 -0400 Received: from mail-yw1-f67.google.com ([209.85.161.67]:44529 "EHLO mail-yw1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728099AbfECOSY (ORCPT ); Fri, 3 May 2019 10:18:24 -0400 Received: by mail-yw1-f67.google.com with SMTP id j4so4378994ywk.11 for ; Fri, 03 May 2019 07:18:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=A4XtQ9BE6S/+vFeJxumqTjxJbaQPwqJr03SgLnlZUfE=; b=RZGJY1dmnCmXGVjbiwtEyKAK0Tw0jsq6wSLoBAzzjbLWQgxXVSJYgZWt6c279Eoa85 LIrwwpmaV/Om3qxIVnwZSIhqjrZCGAt/9IOo97L7VygddHyr7p+v0fXkvGlDi3PAuQkk Mddc9sX67afmDzJ7QNjos71OPFIfnJU6VnVhXw0AURKsxYr1i2Z9zC5UdBB//kz/07Y9 f2MPumzc18SzLVf1L104+HV3tmxl7sk3YOfvqPJGp5UNwHeBrwcflK+BPO+G8dW3sqgY 2Pe4JIhymDgahAjpu0pLC1uN0HnxPG9Eud8W2CChog5VOcUWQdbWPFu9B+6LWCM52Zpr dEIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=A4XtQ9BE6S/+vFeJxumqTjxJbaQPwqJr03SgLnlZUfE=; b=IBJEXrK9AkpkGsIE3OgwCiactk+HaostndjR8uTSGokPyNDBxUG71XF3qB5YGHSUNP 4qEQKTMEjOGHl0nSyxIajsG2XtL/dF+nk15xJeo5WRF5iW3lZd7v+LGQ2v/TFbSRDOvF 8JVoMUMdNA3gX9kuIhXT1PcbmTCJYXjDTi1WPPBJGb71b9Iger1vEHlCAXcugi5tuAHG 9vpzr5draVAjO2yopVugfCjYvCPa0oj93/AO4OmQ8GaScsPpNLqjgs0jM0miVwiBMX44 KWpHJjkSWa+RHw/7pp5GcyGvrsGz3s2AAR6cAj4CVBPmQJUsT2IiLBAes8WK7cbz2h0k RlLA== X-Gm-Message-State: APjAAAXR3RwUGk+CvIOyF1P1PRKRaaSax2U6RCvvO5PFhRGW/QAgtPf0 m2RN98UEHe8wM3D5vDTqB83zZW5fhk1yGd8RUI8= X-Google-Smtp-Source: APXvYqxKnAUJZFTnI0T3Ay4fZk8u3wC7XSpyb3jxa5YhMVc7ztm0FobY6eT/szRKIQcwh1v6c4jNkseBft1/5OGROks= X-Received: by 2002:a25:952:: with SMTP id u18mr7914528ybm.397.1556893102987; Fri, 03 May 2019 07:18:22 -0700 (PDT) MIME-Version: 1.0 References: <20190503023043.GB23724@mit.edu> <20190503095846.GE23724@mit.edu> In-Reply-To: <20190503095846.GE23724@mit.edu> From: Amir Goldstein Date: Fri, 3 May 2019 10:18:11 -0400 Message-ID: Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract To: "Theodore Ts'o" Cc: Vijay Chidambaram , lsf-pc@lists.linux-foundation.org, Dave Chinner , "Darrick J. Wong" , Jan Kara , linux-fsdevel , Jayashree Mohan , Filipe Manana , Chris Mason , lwn@lwn.net Content-Type: text/plain; charset="UTF-8" Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Fri, May 3, 2019 at 5:59 AM Theodore Ts'o wrote: > > On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote: > > OK. we can leave that one for later. > > Although I am not sure what the concern is. > > If we are able to agree and document a LINK_ATOMIC flag, > > what would be the down side of documenting a RENAME_ATOMIC > > flag with same semantics? After all, as I said, this is what many users > > already expect when renaming a temp file (as ext4 heuristics prove). > > The problem is if the "temp file" has been hardlinked to 1000 > different directories, does the rename() have to guarantee that we > have to make sure that the changes to all 1000 directories have been > persisted to disk? And all of the parent directories of those 1000 > directories have also *all* been persisted to disk, all the way up to > the root? > > With the O_TMPFILE linkat case, we know that inode hasn't been > hard-linked to any other directory, and mercifully directories have > only one parent directory, so we only have to check one set of > directory inodes all the way up to the root having been persisted. > > But.... I can already imagine someone complaining that if due to bind > mounts and 1000 mount namespaces, there is some *other* directory > pathname which could be used to reach said "tmpfile", we have to > guarantee that all parent directories which could be used to reach > said "tmpfile" even if they span a dozen different file systems, > *also* have to be persisted due to sloppy drafting of what the > atomicity rules might happen to be. > > If we are only guaranteeing the persistence of the containing > directories of the source and destination files, that's pretty easy. > But then the consistency rules need to *explicitly* state this. Some > of the handwaving definitions of what would be guaranteed.... scare > me. > I see. So the issue is with the language: "metadata modifications made to the file before being linked" that may be interpreted that hardlinking a file is making a modification to the file. I can't help myself writing the pun "nlink doesn't count". Tough one. We can include more exclusive language, but that is not going to aid the goal of a simple documented API. OK, I'll withdraw RENAME_ATOMIC for now and concede to having LINK_ATOMIC fail when trying to link and nlink > 0. How about if I implement RENAME_ATOMIC for in-kernel users only at this point in time? Overlayfs needs it for correctness of directory copy up operation. > > P.S. If we were going to do this, we'd probably want to simply define > a flag to be AT_FSYNC, using the strict POSIX definition of fsync, > which is to say, as a result of the linkat or renameat, the file in > question, and its associated metadata, are guaranteed to be persisted > to disk. No other guarantees about any other inode's metadata > regardless of when they might be made, would be guaranteed. > I agree that may be useful. Not to my use case though. > If people really want "global barrier" semantics, then perhaps it > would be better to simply define a barrierfs(2) system call that works > like syncfs(2) --- it applies to the whole file system, and guarantees > that all changes made after barrierfs(2) will be visible if any > changes made *after* barrierfs(2) are visible. Amir, you used "global > ordering" a few times; if you really need that, let's define a new > system call which guarantees that. Maybe some of the research > proposals for exotic changes to SSD semantics, etc., would allow > barrierfs(2) semantics to be something that we could implement more > efficiently than syncfs(2). But let's make this be explicit, as > opposed to some magic guarantee that falls out as a side effect of the > fsync(2) system call to a single inode. Yes, maybe. For xfs/ext4. Not sure about btrfs. Seems like fbarrier(2) would have been more natural for btrfs model (file and all its dependencies). I think barrierfs(2) would be useful, but I think it is harder to explain to users. See barrierfs() should not flush all inode pages that would be counter productive, so what does it really mean to end users? We would end up with the same problem of misunderstood sync_file_range(). I would have been happy with this API: sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE_AND_WAIT); barrierfs(fd); rename(...)/link(...) Perhaps atomic_rename()/atomic_link() should be library functions wrapping the lower level API to hide those details from end users. Thanks, Amir.