From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
Cc: dsterba@suse.cz, Waxhead <waxhead@online.no>,
linux-btrfs@vger.kernel.org
Subject: Re: Is stability a joke? (wiki updated)
Date: Sun, 18 Sep 2016 23:47:02 -0400 [thread overview]
Message-ID: <20160919034701.GE21290@hungrycats.org> (raw)
In-Reply-To: <8df2691f-94c1-61de-881f-075682d4a28d@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2539 bytes --]
On Mon, Sep 12, 2016 at 12:56:03PM -0400, Austin S. Hemmelgarn wrote:
> 4. File Range Cloning and Out-of-band Dedupe: Similarly, work fine if the FS
> is healthy.
I've found issues with OOB dedup (clone/extent-same):
1. Don't dedup data that has not been committed--either call fsync()
on it, or check the generation numbers on each extent before deduping
it, or make sure the data is not being actively modified during dedup;
otherwise, a race condition may lead to the the filesystem locking up and
becoming inaccessible until the kernel is rebooted. This is particularly
important if you are doing bedup-style incremental dedup on a live system.
I've worked around #1 by placing a fsync() call on the src FD immediately
before calling FILE_EXTENT_SAME. When I do an A/B experiment with and
without the fsync, "with-fsync" runs for weeks at a time without issues,
while "without-fsync" hangs, sometimes in just a matter of hours. Note
that the fsync() doesn't resolve the underlying race condition, it just
makes the filesystem hang less often.
2. There is a practical limit to the number of times a single duplicate
extent can be deduplicated. As more references to a shared extent
are created, any part of the filesystem that uses backref walking code
gets slower. This includes dedup itself, balance, device replace/delete,
FIEMAP, LOGICAL_INO, and mmap() (which can be bad news if the duplicate
files are executables). Several factors (including file size and number
of snapshots) are involved, making it difficult to devise workarounds or
set up test cases. 99.5% of the time, these operations just get slower
by a few ms each time a new reference is created, but the other 0.5% of
the time, write operations will abruptly grow to consume hours of CPU
time or dozens of gigabytes of RAM (in millions of kmalloc-32 slabs)
when they touch one of these over-shared extents. When this occurs,
it effectively (but not literally) crashes the host machine.
I've worked around #2 by building tables of "toxic" hashes that occur too
frequently in a filesystem to be deduped, and using these tables in dedup
software to ignore any duplicate data matching them. These tables can
be relatively small as they only need to list hashes that are repeated
more than a few thousand times, and typical filesystems (up to 10TB or
so) have only a few hundred such hashes.
I happened to have a couple of machines taken down by these issues this
very weekend, so I can confirm the issues are present in kernels 4.4.21,
4.5.7, and 4.7.4.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
next prev parent reply other threads:[~2016-09-19 3:47 UTC|newest]
Thread overview: 93+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-09-11 8:55 Is stability a joke? Waxhead
2016-09-11 9:56 ` Steven Haigh
2016-09-11 10:23 ` Martin Steigerwald
2016-09-11 11:21 ` Zoiled
2016-09-11 11:43 ` Martin Steigerwald
2016-09-11 12:05 ` Martin Steigerwald
2016-09-11 12:39 ` Waxhead
2016-09-11 13:02 ` Hugo Mills
2016-09-11 14:59 ` Martin Steigerwald
2016-09-11 20:14 ` Chris Murphy
2016-09-12 12:20 ` Austin S. Hemmelgarn
2016-09-12 12:59 ` Michel Bouissou
2016-09-12 13:14 ` Austin S. Hemmelgarn
2016-09-12 14:04 ` Lionel Bouton
2016-09-15 1:05 ` Nicholas D Steeves
2016-09-15 8:02 ` Martin Steigerwald
2016-09-16 7:13 ` Helmut Eller
2016-09-15 5:55 ` Kai Krakow
2016-09-15 8:05 ` Martin Steigerwald
2016-09-11 14:54 ` Martin Steigerwald
2016-09-11 15:19 ` Martin Steigerwald
2016-09-11 20:21 ` Chris Murphy
2016-09-11 17:46 ` Marc MERLIN
2016-09-20 16:33 ` Chris Murphy
2016-09-11 17:11 ` Duncan
2016-09-12 12:26 ` Austin S. Hemmelgarn
2016-09-11 12:30 ` Waxhead
2016-09-11 14:36 ` Martin Steigerwald
2016-09-12 12:48 ` Swâmi Petaramesh
2016-09-12 13:53 ` Chris Mason
2016-09-12 17:36 ` Zoiled
2016-09-12 17:44 ` Waxhead
2016-09-15 1:12 ` Nicholas D Steeves
2016-09-12 14:27 ` David Sterba
2016-09-12 14:54 ` Austin S. Hemmelgarn
2016-09-12 16:51 ` David Sterba
2016-09-12 17:31 ` Austin S. Hemmelgarn
2016-09-15 1:07 ` Nicholas D Steeves
2016-09-15 1:13 ` Steven Haigh
2016-09-15 2:14 ` stability matrix (was: Is stability a joke?) Christoph Anton Mitterer
2016-09-15 9:49 ` stability matrix Hans van Kranenburg
2016-09-15 11:54 ` Austin S. Hemmelgarn
2016-09-15 14:15 ` Chris Murphy
2016-09-15 14:56 ` Martin Steigerwald
2016-09-19 14:38 ` David Sterba
2016-09-19 15:27 ` stability matrix (was: Is stability a joke?) David Sterba
2016-09-19 17:18 ` stability matrix Austin S. Hemmelgarn
2016-09-19 19:52 ` Christoph Anton Mitterer
2016-09-19 20:07 ` Chris Mason
2016-09-19 20:36 ` Christoph Anton Mitterer
2016-09-19 21:03 ` Chris Mason
2016-09-19 19:45 ` stability matrix (was: Is stability a joke?) Christoph Anton Mitterer
2016-09-20 7:59 ` Duncan
2016-09-20 8:19 ` Hugo Mills
2016-09-20 8:34 ` David Sterba
2016-09-19 15:38 ` Is stability a joke? David Sterba
2016-09-19 21:25 ` Hans van Kranenburg
2016-09-12 16:27 ` Is stability a joke? (wiki updated) David Sterba
2016-09-12 16:56 ` Austin S. Hemmelgarn
2016-09-12 17:29 ` Filipe Manana
2016-09-12 17:42 ` Austin S. Hemmelgarn
2016-09-12 20:08 ` Chris Murphy
2016-09-13 11:35 ` Austin S. Hemmelgarn
2016-09-15 18:01 ` Chris Murphy
2016-09-15 18:20 ` Austin S. Hemmelgarn
2016-09-15 19:02 ` Chris Murphy
2016-09-15 20:16 ` Hugo Mills
2016-09-15 20:26 ` Chris Murphy
2016-09-16 12:00 ` Austin S. Hemmelgarn
2016-09-19 2:57 ` Zygo Blaxell
2016-09-19 12:37 ` Austin S. Hemmelgarn
2016-09-19 4:08 ` Zygo Blaxell
2016-09-19 15:27 ` Sean Greenslade
2016-09-19 17:38 ` Austin S. Hemmelgarn
2016-09-19 18:27 ` Chris Murphy
2016-09-19 18:34 ` Austin S. Hemmelgarn
2016-09-19 20:15 ` Zygo Blaxell
2016-09-20 12:09 ` Austin S. Hemmelgarn
2016-09-15 21:23 ` Christoph Anton Mitterer
2016-09-16 12:13 ` Austin S. Hemmelgarn
2016-09-19 3:47 ` Zygo Blaxell [this message]
2016-09-19 12:32 ` Austin S. Hemmelgarn
2016-09-19 15:33 ` Zygo Blaxell
2016-09-12 19:57 ` Martin Steigerwald
2016-09-12 20:21 ` Pasi Kärkkäinen
2016-09-12 20:35 ` Martin Steigerwald
2016-09-12 20:44 ` Chris Murphy
2016-09-13 11:28 ` Austin S. Hemmelgarn
2016-09-13 11:39 ` Martin Steigerwald
2016-09-14 5:53 ` Marc Haber
2016-09-12 20:48 ` Waxhead
2016-09-13 8:38 ` Timofey Titovets
2016-09-13 11:26 ` Austin S. Hemmelgarn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160919034701.GE21290@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=ahferroin7@gmail.com \
--cc=dsterba@suse.cz \
--cc=linux-btrfs@vger.kernel.org \
--cc=waxhead@online.no \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).