linux-f2fs-devel.lists.sourceforge.net archive mirror
 help / color / mirror / Atom feed
From: Marc Lehmann <schmorp@schmorp.de>
To: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: linux-f2fs-devel@lists.sourceforge.net
Subject: Re: sync/umount hang on 3.18.21, 1.4TB gone after crash
Date: Sat, 26 Sep 2015 05:08:33 +0200	[thread overview]
Message-ID: <20150926030833.GC3669@schmorp.de> (raw)
In-Reply-To: <20150925184202.GD6998@jaegeuk-mac02>

On Fri, Sep 25, 2015 at 11:42:02AM -0700, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> AFAIK, there-in *commit* means syncing metadata, not userdata. Doesn't it?

In general, no, they commit userdata, even if not necessarily at the same
time. ext* for example has three modes, and the default commits userdata
before the corresponding metadata (data=ordered).

But even when you relax this (data=writeback), a few minutes after a file is
written, both userdata and metadata are there (usually after 30s). Data that
was just being wirtten is generally mixed, but that's an easy to handle
trade-off.

(and then there is data=journal, which should get perfectly ordered
behaviour for both, at high cost, and flushoncommit).

Early (linux) versions of XFS were more like a brutal version of writeback
- files recently written before a crash were frequently filled with zero
bytes (something I haven't seen with irix, which frequently crashed :).
But they somehow made it work - I was a frequent victim of zero-filled
files, but for many years it didn't happen for me. So while I don't know
if it's a guarantee, in practise, file data is there together with the
metadata, and usually within the writeback period configured in the kernel
(+ whatever time it takes to write out the data, which cna be substantial,
but can also be limited in /proc/sys/vm, especially dirty_bytes and
dirty_background_bytes).

Note also that filesystems often special case write + rename over old
file, and carious other cases, to give a good user experience.

So for traditional filesystems, /proc/sys/vm/dirty_bytes + dirty_expire
gives a pretty good way of defining a) how much data is lost and b) within
which timeframe. Such filesystems also have their own setting for metadata
commit, but they are generally within the timeframe of a few seconds to
half a minute.

It does not have the nice "exact version of a point in time" qualities you
can get from log-based file system, but they give quite nice guarantees in
practise - if a file was half-written, it does not have it's full length
but corrupted data inside for example.

For things like database files, this could be an issue, as indeed you
don't control the order of things written, but programs *know* about this
problem and fsync accordingly (and the kernel has extra support for these
things, as in sync_page_range and so on).

So, in general, filesystems only commit metadata, but the kernel commits
userdata on its own, and as extra feature, "good" filesystems such
as xfs or ext* have extra logic to commit userdata before committing
corresponding metadata (or after).

Note also that with most journal-based filesystems, commit just forces the
issue, both metadata and userdata usually hit the disk much earlier.

In addition, the issue at hand is f2fs losing metadata, not userdata,
as all data had been written to the device hours before tghe crash. The
userdata was all there, but the filesystem forgot to how to access it.

> So, even if you saw no data loss, filesystem doesn't guarantee all the data were
> completely recovered, since sync or fsync was not called for that file.

No, but I feel fairly confident that a file written over a minutes ago
on a box that is sitting idle for a minute is still there after a crash,
barring hardware faults.

Now, I am not neecessarily criticing f2fs here, after all, the problem at
hand is f2fs _hanging_, which is clearly a bug. I don't know how well f2fs
performs with this bug fixed, regarding data loss.

Also, f2fs is a different beast - syncs can take a veeery long time on f2fs
compared to xfs or ext4, and maybe that is due to the design of f2fs (I
suppose so, but you can correct me). In which case it might not be such a
good idea to commit every 30s. Maybe my performance problem was because
f2fs committed every 30s.

> I think you need to tune the system-wide parameters related to flusher mentioned
> by Chao for your workloads.

I already do configure these extensively, according to my workload. On the
box I did my recent tests:

   vm.dirty_ratio = 80
   vm.dirty_background_ratio = 4
   vm.dirty_writeback_centisecs = 100
   vm.dirty_expire_centisecs = 100

These are pretty aggressive. The reason is that the box has 32GB of ram, and
with default values it is not uncommon to get 10-20gb of dirty data before a
writeback, which then more or less freezes everything and can take a long
time. So the above values don't wait long to write userdata, and make sure a
process generating lots of dirty blocks can't freeze the system.

Speciifcally, in the case of tar writing files, tar will start blocking after
only ~1.3GB of dirty data.

That means with a "conventional" filesystem, I lose at most 1.3GB of data
+ less than 30s, on a crash.

> And, we need to expect periodic checkpoints are able to recover the previously
> flushed data.

Yes, I would consider this a must, however, again, I can accept if f2fs
needs much higher "commit" intervals than other filesystems (say, 10
minutes), if that is needed to make it performant.

But some form of fixed timeframe is needed, I tzhink, whether it's seconds
or minutes.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------

  reply	other threads:[~2015-09-26  3:08 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-23 21:58 sync/umount hang on 3.18.21, 1.4TB gone after crash Marc Lehmann
2015-09-23 23:11 ` write performance difference 3.18.21/4.2.1 Marc Lehmann
2015-09-24 18:28   ` Jaegeuk Kim
2015-09-24 23:20     ` Marc Lehmann
2015-09-24 23:27       ` Marc Lehmann
2015-09-25  6:50     ` Marc Lehmann
2015-09-25  9:47       ` Chao Yu
2015-09-25 18:20         ` Jaegeuk Kim
2015-09-26  3:22         ` Marc Lehmann
2015-09-26  5:25           ` write performance difference 3.18.21/git f2fs Marc Lehmann
2015-09-26  5:57             ` Marc Lehmann
2015-09-26  7:52             ` Jaegeuk Kim
2015-09-26 13:59               ` Marc Lehmann
2015-09-28 17:59                 ` Jaegeuk Kim
2015-09-29 11:02                   ` Marc Lehmann
2015-09-29 23:13                     ` Jaegeuk Kim
2015-09-30  9:02                       ` Chao Yu
2015-10-01 12:11                       ` Marc Lehmann
2015-10-01 18:51                         ` Marc Lehmann
2015-10-02  8:53                           ` 100% system time hang with git f2fs Marc Lehmann
2015-10-02 16:51                             ` Jaegeuk Kim
2015-10-03  6:29                               ` Marc Lehmann
2015-10-02 16:46                           ` write performance difference 3.18.21/git f2fs Jaegeuk Kim
2015-10-04  9:40                             ` near disk full performance (full 8TB) Marc Lehmann
2015-09-26  7:48           ` write performance difference 3.18.21/4.2.1 Jaegeuk Kim
2015-09-25 18:26       ` Jaegeuk Kim
2015-09-24 18:50 ` sync/umount hang on 3.18.21, 1.4TB gone after crash Jaegeuk Kim
2015-09-25  6:00   ` Marc Lehmann
2015-09-25  6:01     ` Marc Lehmann
2015-09-25 18:42     ` Jaegeuk Kim
2015-09-26  3:08       ` Marc Lehmann [this message]
2015-09-26  7:27         ` Jaegeuk Kim
2015-09-25  9:13   ` Chao Yu
2015-09-25 18:30     ` Jaegeuk Kim
  -- strict thread matches above, loose matches on Subject: below --
2015-08-08 20:50 general stability of f2fs? Marc Lehmann
2015-08-10 20:31 ` Jaegeuk Kim
2015-08-10 20:53   ` Marc Lehmann
2015-08-10 21:58     ` Jaegeuk Kim
2015-08-13  0:26       ` Marc Lehmann
2015-08-14 23:07         ` Jaegeuk Kim
2015-09-20 23:59   ` finally testing with SMR drives Marc Lehmann
2015-09-21  8:17     ` SMR drive test 1; 512GB partition; very slow + unfixable corruption Marc Lehmann
2015-09-21  8:19       ` Marc Lehmann
2015-09-21  9:58         ` SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning Marc Lehmann
2015-09-22 20:22           ` SMR drive test 3: full 8TB partition, mount problems, fsck error after delete Marc Lehmann
2015-09-22 23:08             ` Jaegeuk Kim
2015-09-23  3:50               ` Marc Lehmann
2015-09-23  1:12           ` SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, weird overprovisioning Jaegeuk Kim
2015-09-23  4:15             ` Marc Lehmann
2015-09-23  6:00               ` Marc Lehmann
2015-09-23  8:55                 ` Chao Yu
2015-09-23 23:30                   ` Marc Lehmann
2015-09-23 23:43                     ` Marc Lehmann
2015-09-24 17:21                       ` Jaegeuk Kim
2015-09-25  8:28                         ` Chao Yu
2015-09-25  8:05                     ` Chao Yu
2015-09-26  3:42                       ` Marc Lehmann
2015-09-23 22:08                 ` Jaegeuk Kim
2015-09-23 23:39                   ` Marc Lehmann
2015-09-24 17:27                     ` Jaegeuk Kim
2015-09-25  5:42                       ` Marc Lehmann
2015-09-25 17:45                         ` Jaegeuk Kim
2015-09-26  3:32                           ` Marc Lehmann
2015-09-26  7:36                             ` Jaegeuk Kim
2015-09-26 13:53                               ` Marc Lehmann
2015-09-28 18:33                                 ` Jaegeuk Kim
2015-09-29  7:36                                   ` Marc Lehmann
2015-09-23  6:06               ` Marc Lehmann
2015-09-23  9:10                 ` Chao Yu
2015-09-23 21:30                   ` Jaegeuk Kim
2015-09-23 23:11                   ` Marc Lehmann
2015-09-23 21:29               ` Jaegeuk Kim
2015-09-23 23:24                 ` Marc Lehmann
2015-09-24 17:51                   ` Jaegeuk Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150926030833.GC3669@schmorp.de \
    --to=schmorp@schmorp.de \
    --cc=jaegeuk@kernel.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).