From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Andrei Borzenkov <arvidjaar@gmail.com>
Cc: Chris Murphy <lists@colorremedies.com>,
kreijack@inwind.it, Roman Mamedov <rm@romanrm.net>,
Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Fri, 24 Jun 2016 12:39:29 -0400 [thread overview]
Message-ID: <20160624163929.GE14667@hungrycats.org> (raw)
In-Reply-To: <576CB0DA.6030409@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1955 bytes --]
On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> I don't read code well enough, but I'd be surprised if Btrfs
> >> reconstructs from parity and doesn't then check the resulting
> >> reconstructed data to its EXTENT_CSUM.
> >
> > I wouldn't be surprised if both things happen in different code paths,
> > given the number of different paths leading into the raid56 code and
> > the number of distinct failure modes it seems to have.
>
> Well, the problem is that parity block cannot be redirected on write as
> data blocks; which makes it impossible to version control it. The only
> solution I see is to always use full stripe writes by either wasting
> time in fixed width stripe or using variable width, so that every stripe
> always gets new version of parity. This makes it possible to keep parity
> checksums like data checksums.
The allocator could try harder to avoid partial stripe writes. We can
write multiple small extents to the same stripe as long as we always do
it all within one transaction, and then later treat the entire stripe
as read-only until every extent is removed. It would be possible to do
that by fudging extent lengths (effectively adding a bunch of prealloc-ish
space if we have a partial write after all the delalloc stuff is done),
but it could also waste some blocks on every single transaction, or
create a bunch of "free but unavailable" space that makes df/statvfs
output even more wrong than it usually is.
The raid5 rmw code could try to relocate the other extents sharing a
stripe, but I fear that with the current state of backref walking code
that would make raid5 spectacularly slow if a filesystem is anywhere
near full.
We could also write rmw parity block updates to a journal (like another
log tree). That would enable us to at least fix up the parity blocks
after a crash, and close the write hole. That's an on-disk format
change though.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
next prev parent reply other threads:[~2016-06-24 16:39 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-20 3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11 ` Zygo Blaxell
2016-06-20 19:30 ` Chris Murphy
2016-06-20 20:40 ` Zygo Blaxell
2016-06-20 21:27 ` Chris Murphy
2016-06-21 1:55 ` Zygo Blaxell
2016-06-21 3:53 ` Zygo Blaxell
2016-06-22 17:14 ` Chris Murphy
2016-06-22 20:35 ` Zygo Blaxell
2016-06-23 19:32 ` Goffredo Baroncelli
2016-06-24 0:26 ` Chris Murphy
2016-06-24 1:47 ` Zygo Blaxell
2016-06-24 4:02 ` Andrei Borzenkov
2016-06-24 8:50 ` Hugo Mills
2016-06-24 9:52 ` Andrei Borzenkov
2016-06-24 10:16 ` Hugo Mills
2016-06-24 10:19 ` Andrei Borzenkov
2016-06-24 10:59 ` Hugo Mills
2016-06-24 11:36 ` Austin S. Hemmelgarn
2016-06-24 17:40 ` Chris Murphy
2016-06-24 18:06 ` Zygo Blaxell
2016-06-24 17:06 ` Chris Murphy
2016-06-24 17:21 ` Andrei Borzenkov
2016-06-24 17:52 ` Chris Murphy
2016-06-24 18:19 ` Austin S. Hemmelgarn
2016-06-25 16:44 ` Chris Murphy
2016-06-25 21:52 ` Chris Murphy
2016-06-26 7:54 ` Andrei Borzenkov
2016-06-26 15:03 ` Duncan
2016-06-26 19:30 ` Chris Murphy
2016-06-26 19:52 ` Zygo Blaxell
2016-06-27 11:21 ` Austin S. Hemmelgarn
2016-06-27 16:17 ` Chris Murphy
2016-06-27 20:54 ` Chris Murphy
2016-06-27 21:02 ` Henk Slager
2016-06-27 21:57 ` Zygo Blaxell
2016-06-27 22:30 ` Chris Murphy
2016-06-28 1:52 ` Zygo Blaxell
2016-06-28 2:39 ` Chris Murphy
2016-06-28 3:17 ` Zygo Blaxell
2016-06-28 11:23 ` Austin S. Hemmelgarn
2016-06-28 12:05 ` Austin S. Hemmelgarn
2016-06-28 12:14 ` Steven Haigh
2016-06-28 12:25 ` Austin S. Hemmelgarn
2016-06-28 16:40 ` Steven Haigh
2016-06-28 18:01 ` Chris Murphy
2016-06-28 18:17 ` Steven Haigh
2016-07-05 23:05 ` Chris Murphy
2016-07-06 11:51 ` Austin S. Hemmelgarn
2016-07-06 16:43 ` Chris Murphy
2016-07-06 17:18 ` Austin S. Hemmelgarn
2016-07-06 18:45 ` Chris Murphy
2016-07-06 19:15 ` Austin S. Hemmelgarn
2016-07-06 21:01 ` Chris Murphy
2016-06-24 16:52 ` Chris Murphy
2016-06-24 16:56 ` Hugo Mills
2016-06-24 16:39 ` Zygo Blaxell [this message]
2016-06-24 1:36 ` Zygo Blaxell
2016-06-23 23:37 ` Chris Murphy
2016-06-24 2:07 ` Zygo Blaxell
2016-06-24 5:20 ` Chris Murphy
2016-06-24 10:16 ` Andrei Borzenkov
2016-06-24 17:33 ` Chris Murphy
2016-06-24 11:24 ` Austin S. Hemmelgarn
2016-06-24 16:32 ` Zygo Blaxell
2016-06-24 2:17 ` Zygo Blaxell
2016-06-22 4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160624163929.GE14667@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=arvidjaar@gmail.com \
--cc=kreijack@inwind.it \
--cc=linux-btrfs@vger.kernel.org \
--cc=lists@colorremedies.com \
--cc=rm@romanrm.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).