From: Chao Yu <chao@kernel.org>
To: Jayashree Mohan <jayashree2912@gmail.com>,
Amir Goldstein <amir73il@gmail.com>
Cc: fstests <fstests@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
linux-doc@vger.kernel.org,
Vijaychidambaram Velayudhan Pillai <vijay@cs.utexas.edu>,
Dave Chinner <david@fromorbit.com>, Theodore Tso <tytso@mit.edu>,
Filipe Manana <fdmanana@gmail.com>,
linux-f2fs-devel@lists.sourceforge.net
Subject: Re: [PATCH] Documenting the crash-recovery guarantees of Linux file systems
Date: Tue, 12 Mar 2019 21:13:30 +0800 [thread overview]
Message-ID: <c4ef60cb-0083-3dad-0f01-e71ff9b440f8@kernel.org> (raw)
In-Reply-To: <CA+EzBbD9S6JN861H+5HRBbh_uSfo=1bCR4-NvnFmD1N2qw2h7g@mail.gmail.com>
Hi Jayashree,
Sorry for the delay.
On 2019-3-8 2:51, Jayashree Mohan wrote:
> [cc : f2fs-dev]
> Thanks for the suggestions! Will incorporate these changes and send out a v2.
>
> We would also like to update the document to correctly reflect whether each file
> system is SOMC compliant. As of now, we only know for sure that xfs provides
> SOMC. Could developers of ext4, btrfs and F2FS comment whether your file system
> is SOMC complaint (or aims to be complaint)? @Theodore Ts'o
> <mailto:tytso@mit.edu> , @Chao Yu <mailto:chao@kernel.org> , @Filipe Manana
> <mailto:fdmanana@gmail.com>
>
> @Chao Yu <mailto:chao@kernel.org> We are also unsure about the fsync behaviour
> of F2FS. Is it just POSIX in the default mode, and SOMC if mounted with fsync_mode=
> strict?
Yes, that's the rule f2fs tries to keep. :)
Thanks,
>
> Thanks,
> Jayashree Mohan
>
>
>
> On Wed, Mar 6, 2019 at 3:14 AM Amir Goldstein <amir73il@gmail.com
> <mailto:amir73il@gmail.com>> wrote:
>
> On Wed, Mar 6, 2019 at 4:59 AM Jayashree <jaya@cs.utexas.edu
> <mailto:jaya@cs.utexas.edu>> wrote:
> >
> > In this file, we document the crash-recovery guarantees
> > provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also
> > present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency
> > (SOMC), which is provided by xfs. It is not clear to us if other file systems
> > provide SOMC
>
> Nice work.
> You may add
> Reviewed-by: Amir Goldstein <amir73il@gmail.com <mailto:amir73il@gmail.com>>
>
> Few nits below.
>
> > ; we would be happy to modify the document if file-system
> > developers claim that their system provides (or aims to provide) SOMC.
>
> This part belongs after the --- line
> IOW, it does not belong in the commit message.
>
> >
> > Signed-off-by: Jayashree Mohan <jaya@cs.utexas.edu
> <mailto:jaya@cs.utexas.edu>>
> > ---
> > .../filesystems/crash-recovery-guarantees.txt | 173
> +++++++++++++++++++++
> > 1 file changed, 173 insertions(+)
> > create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt
> >
> > diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt
> b/Documentation/filesystems/crash-recovery-guarantees.txt
> > new file mode 100644
> > index 0000000..4d1a9c6b
> > --- /dev/null
> > +++ b/Documentation/filesystems/crash-recovery-guarantees.txt
> > @@ -0,0 +1,173 @@
> > +=====================================================================
> > +File System Crash-Recovery Guarantees
> > +=====================================================================
> > +Linux file systems provide certain guarantees to user-space
> > +applications about what happens to their data if the system crashes
> > +(due to power loss or kernel panic). These are termed crash-recovery
> > +guarantees.
> > +
> > +Crash-recovery guarantees only pertain to data or metadata that has
> > +been explicitly persisted to storage with fsync(), fdatasync(), or
> > +sync() system calls. By default, write(), mkdir(), and other
> > +file-system related system calls only affect the in-memory state of
> > +the file system.
> > +
> > +The crash-recovery guarantees provided by most Linux file systems are
> > +significantly stronger than what is required by POSIX. POSIX is vague,
> > +even allowing fsync() to do nothing (Mac OSX takes advantage of
> > +this). However, the guarantees provided by file systems are not
> > +documented, and vary between file systems. This document seeks to
> > +describe the current crash-recovery guarantees provided by major Linux
> > +file systems.
> > +
> > +What does the fsync() operation guarantee?
> > +----------------------------------------------------
> > +fsync() operation is meant to force the physical write of data
> > +corresponding to a file from the buffer cache, along with the file
> > +metadata. Note that the guarantees mentioned for each file system below
> > +are in addition to the ones provided by POSIX.
> > +
> > +POSIX
> > +-----
> > +fsync(file) : Flushes the data and metadata associated with the
> > +file. However, if the directory entry for the file has not been
> > +previously persisted, or has been modified, it is not guaranteed to be
> > +persisted by the fsync of the file [1]. What this means is, if a file
> > +is newly created, you will have to fsync(parent directory) in addition
> > +to fsync(file) in order to ensure that the file data has safely
> > +reached the disk.
>
> No. In order to ensure that the file's *directory entry* will persist.
> Throughout the doc, if you just say "file will persist" the meaning
> is ambiguous. "file data will persist" "file metadata will persist"
> and "file directory entry will persist" are three distinguished
> outcomes.
>
> > +
> > +fsync(dir) : Flushes directory data and directory entries. However if
> > +you created a new file within the directory and wrote data to the
> > +file, then the file data is not guaranteed to be persisted, unless an
> > +explicit fsync() is issued on the file.
> > +
> > +ext4
> > +-----
> > +fsync(file) : Ensures that a newly created file is persisted (no need
>
> newly created file directory entry is persisted
>
> > +to explicitly persist the parent directory). However, if you create
> > +multiple names of the file (hard links), then they are not guaranteed
> > +to persist unless each one of the hard links are persisted [2].
>
> "...then the hard linked directory entries are not guarantied to persist
> unless each one of the parent directories are persisted."
>
> > +
> > +fsync(dir) : All file names within the persisted directory will exist,
> > +but does not guarantee file data.
> > +
> > +btrfs
> > +------
> > +fsync(file) : Ensures that the newly created file is persisted, along
> > +with all its hard links. You do not need to persist individual hard
> > +links to the file.
>
> Rephrase to disambiguate
>
> > +
> > +fsync(dir) : All the file names within the directory persist. All the
> > +rename and unlink operations within the directory are persisted. Due
> > +to the design choices made by btrfs, fsync of a directory could lead
> > +to an iterative fsync on sub-directories, thereby requiring a full
> > +file system commit. So btrfs does not advocate persisting directories
> > +[2].
> > +
> > +fsync(symlink)
> > +-------------
> > +A symlink inode cannot be directly opened for IO, which means there is
> > +no such thing as fsync of a symlink [3]. You could be tricked by the
> > +fact that open and fsync of a symlink succeeds without returning a
> > +error, but what happens in reality is as follows.
> > +
> > +Suppose we have a symlink “foo”, which points to the file “A/bar”
> > +
> > +fd = open(“foo”, O_CREAT | O_RDWR)
> > +fsync(fd)
> > +
> > +Both the above operations succeed, but if you crash after fsync, the
> > +symlink could be still missing.
> > +
> > +When you try to open the symlink “foo”, you are actually trying to
> > +open the file that the symlink resolves to, which in this case is
> > +“A/bar”. When you fsync the inode returned by the open system call, you
> > +are actually persisting the file “A/bar” and not the symlink. Note
> > +that if the file “A/bar” does not exist and you try the open the
> > +symlink “foo” without the O_CREAT flag, then file open will fail. To
> > +obtain the file descriptor associated with the symlink inode, you
> > +could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the
> > +file descriptor obtained this way can be only used to indicate a
> > +location in the file-system tree and to perform operations that act
> > +purely at the file descriptor level. Operations like read(), write(),
> > +fsync() etc cannot be performed on such file descriptors.
> > +
> > +Bottomline : You cannot fsync() a symlink.
> > +
> > +fsync(special files)
> > +--------------------
> > +Special files in Linux include block and character device files
> > +(created using mknod), FIFO (created using mkfifo) etc. Just like the
> > +behavior of fsync on symlinks described above, these special files do
> > +not have a fsync function defined. Similar to symlinks, you
> > +cannot fsync a special file [4].
> > +
> > +
> > +Strictly Ordered Metadata Consistency
> > +-------------------------------------
> > +With each file system providing varying levels of persistence
> > +guarantees, a consensus in this regard, will benefit application
> > +developers to work with certain fixed assumptions about file system
> > +guarantees. Dave Chinner proposed a unified model called the
> > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > +
> > +Under this scheme, the file system guarantees to persist all previous
> > +dependent modifications to the object upon fsync(). If you fsync() an
> > +inode, it will persist all the changes required to reference the inode
> > +and its data. SOMC can be defined as follows [6]:
> > +
> > +If op1 precedes op2 in program order (in-memory execution order), and
> > +op1 and op2 share a dependency, then op2 must not be observed by a
> > +user after recovery without also observing op1.
> > +
> > +Unfortunately, SOMC's definition depends upon whether two operations
> > +share a dependency, which is file-system specific. A developer would
> > +need to understand file-system internals to know if SOMC would order
> > +one operation before another. It is worth noting that a file system
> > +can be crash-consistent (according to POSIX), without providing SOMC
> > +[7].
> > +
> > +Example
> > +-------
> > +touch A/foo
> > +echo “hello” > A/foo
> > +sync
> > +
> > +mv A/foo A/bar
> > +echo “world” > A/foo
> > +fsync A/foo
> > +CRASH
> > +
> > +What would you expect on recovery, if the file system crashed after
> > +the final fsync returned successfully?
> > +
> > +Non SOMC file systems will not persist the file
> > +A/bar because it was not explicitly fsync-ed. But this means, you will
> > +find only the file A/foo with data “world” after crash, thereby losing
> > +the previously persisted file with data “hello” [8]. You will need to
> > +explicitly persist the directory A to ensure the rename operation is
> > +safely persisted on disk.
> > +
> > +Under SOMC, to correctly reference the new inode via A/foo,
> > +the previous rename operation must persist as well. Therefore,
> > +fsync() of A/foo will persist the renamed file A/bar as well.
> > +On recovery you will find both A/bar (with data “hello”)
> > +and A/foo (with data “world”).
> > +
> > +It is noteworthy that xfs, ext4, F2FS (when mounted with fsync_mode=strict)
> > +and btrfs provide SOMC like behaviour in this particular example.
> > +However, on document, only XFS claims to provide SOMC.
> > +It is not clear if ext4, F2FS and btrfs provide strictly ordered
> > +metadata consistency.
> > +
> > +--------------------------------------------------------
> > +[1] http://man7.org/linux/man-pages/man2/fdatasync.2.html
> > +[2] https://www.spinics.net/lists/linux-btrfs/msg77340.html
> > +[3] https://www.spinics.net/lists/fstests/msg09370.html
> > +[4] https://bugzilla.kernel.org/show_bug.cgi?id=202485
> > +[5] https://marc.info/?l=fstests&m=155010885626284&w=2
> > +[6] https://marc.info/?l=fstests&m=155011123126916&w=2
> > +[7] https://www.spinics.net/lists/fstests/msg09379.html
> > +[8] https://patchwork.kernel.org/patch/10132305/
> > +
> > --
> > 2.7.4
> >
>
prev parent reply other threads:[~2019-03-12 13:13 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-03-06 2:59 [PATCH] Documenting the crash-recovery guarantees of Linux file systems Jayashree
2019-03-06 3:26 ` Randy Dunlap
2019-03-06 5:07 ` Dave Chinner
2019-03-06 9:14 ` Amir Goldstein
[not found] ` <CA+EzBbD9S6JN861H+5HRBbh_uSfo=1bCR4-NvnFmD1N2qw2h7g@mail.gmail.com>
2019-03-12 13:13 ` Chao Yu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c4ef60cb-0083-3dad-0f01-e71ff9b440f8@kernel.org \
--to=chao@kernel.org \
--cc=amir73il@gmail.com \
--cc=david@fromorbit.com \
--cc=fdmanana@gmail.com \
--cc=fstests@vger.kernel.org \
--cc=jayashree2912@gmail.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-f2fs-devel@lists.sourceforge.net \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=vijay@cs.utexas.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).