btrfs raid1 metadata, single data

All of lore.kernel.org
 help / color / mirror / Atom feed

* btrfs raid1 metadata, single data
@ 2015-08-07  8:49 Robert Krig
  2015-08-07  9:18 ` Russell Coker
  2015-08-07 10:26 ` Duncan
  0 siblings, 2 replies; 8+ messages in thread
From: Robert Krig @ 2015-08-07  8:49 UTC (permalink / raw)
  To: linux-btrfs

Hi, I was wondering.

What exactly is contained in btrfs metadata?

I've read about some users setting up their btrfs volumes as
data=single, but metadata=raid1

Is there any actual benefit to that? I mean, if you keep your data as
single, but have multiple copies of metadata, does that still allow you
to recover from data corruption? Or is metadata redundancy a benefit to
ensure that your btrfs volume remains mountable/readable?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs raid1 metadata, single data
  2015-08-07  8:49 btrfs raid1 metadata, single data Robert Krig
@ 2015-08-07  9:18 ` Russell Coker
  2015-08-07  9:47   ` Sjoerd
  2015-08-07 10:26 ` Duncan
  1 sibling, 1 reply; 8+ messages in thread
From: Russell Coker @ 2015-08-07  9:18 UTC (permalink / raw)
  To: Robert Krig; +Cc: linux-btrfs

On Fri, 7 Aug 2015 06:49:58 PM Robert Krig wrote:
> What exactly is contained in btrfs metadata?

Much the same as in metadata for every other filesystem.

> I've read about some users setting up their btrfs volumes as
> data=single, but metadata=raid1
> 
> Is there any actual benefit to that? I mean, if you keep your data as
> single, but have multiple copies of metadata, does that still allow you
> to recover from data corruption? Or is metadata redundancy a benefit to
> ensure that your btrfs volume remains mountable/readable?

If you have redundant metadata and experience corruption then you will know 
the name of every file that has data corruption, this is really good for 
restoring from backup.  Also you will be protected against corruption of a 
root directory causing massive data loss.

If you have the bad luck to have certain metadata structures corrupted with no 
redundancy then you can face massive data loss and possibly have the entire 
filesystem become at least temporarily unusable.  While corruption of the root 
directory is unlikely it is possible.  With "dup" metadata I've seen a BTRFS 
filesystem remain usable after 12,000+ read errors.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs raid1 metadata, single data
  2015-08-07  9:18 ` Russell Coker
@ 2015-08-07  9:47   ` Sjoerd
  2015-08-07 10:40     ` Mike Fleetwood
  2015-08-07 11:13     ` Duncan
  0 siblings, 2 replies; 8+ messages in thread
From: Sjoerd @ 2015-08-07  9:47 UTC (permalink / raw)
  To: linux-btrfs

On Friday 07 August 2015 19:18:02 Russell Coker wrote:
> On Fri, 7 Aug 2015 06:49:58 PM Robert Krig wrote:
> > What exactly is contained in btrfs metadata?
> 
> Much the same as in metadata for every other filesystem.
> 
> > I've read about some users setting up their btrfs volumes as
> > data=single, but metadata=raid1
> > 
> > Is there any actual benefit to that? I mean, if you keep your data as
> > single, but have multiple copies of metadata, does that still allow you
> > to recover from data corruption? Or is metadata redundancy a benefit to
> > ensure that your btrfs volume remains mountable/readable?
> 
> If you have redundant metadata and experience corruption then you will know
> the name of every file that has data corruption, this is really good for
> restoring from backup.  Also you will be protected against corruption of a
> root directory causing massive data loss.
> 
> If you have the bad luck to have certain metadata structures corrupted with
> no redundancy then you can face massive data loss and possibly have the
> entire filesystem become at least temporarily unusable.  While corruption
> of the root directory is unlikely it is possible.  With "dup" metadata I've
> seen a BTRFS filesystem remain usable after 12,000+ read errors.

While we're at it: any idea why the default for SSD's is single for meta data 
as described on the wiki? 
(https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Filesystem_creation)

I was looking for an answer why my SSD just had single metadata, while I 
expected it to be DUP and stumbled on this wiki article. Can't find a reason 
for why a SSD would be different?

Cheers,
Sjoerd
 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs raid1 metadata, single data
  2015-08-07  9:47   ` Sjoerd
@ 2015-08-07 10:40     ` Mike Fleetwood
  2015-08-07 11:38       ` Austin S Hemmelgarn
  2015-08-07 20:07       ` Sjoerd
  2015-08-07 11:13     ` Duncan
  1 sibling, 2 replies; 8+ messages in thread
From: Mike Fleetwood @ 2015-08-07 10:40 UTC (permalink / raw)
  To: Sjoerd; +Cc: linux-btrfs

On 7 August 2015 at 10:47, Sjoerd <sjoerd@sjomar.eu> wrote:
> While we're at it: any idea why the default for SSD's is single for meta data
> as described on the wiki?
> (https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Filesystem_creation)
>
> I was looking for an answer why my SSD just had single metadata, while I
> expected it to be DUP and stumbled on this wiki article. Can't find a reason
> for why a SSD would be different?
>
> Cheers,
> Sjoerd

I would assume that it is because some SSD drives controllers
deduplicate by default [1].  The developers probably think that when
it comes to your data the truth, no mater how ugly, is preferable to a
false sense of security.  (Btrfs thinking it has 2 copies of metadata
when the SSD drive only actually has stored 1 copy).

[1] How SSDs can hose your data
http://www.zdnet.com/article/how-ssds-can-hose-your-data/
"Researchers found that at least 1 Sandforce SSD controller - the
SF1200 - does block-level deduplication by default. Which can be a
problem.

Many file systems - NTFS, most Unix/Linux FSs, ZFS are some - write
critical metadata to multiple blocks in case one copy gets corrupted.
But what if, unbeknownst to you, your SSD de-duplicates that block,
leaving your file system with only 1 copy? "

Thanks,
Mike

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs raid1 metadata, single data
  2015-08-07 10:40     ` Mike Fleetwood
@ 2015-08-07 11:38       ` Austin S Hemmelgarn
  2015-08-07 20:07       ` Sjoerd
  1 sibling, 0 replies; 8+ messages in thread
From: Austin S Hemmelgarn @ 2015-08-07 11:38 UTC (permalink / raw)
  To: Mike Fleetwood, Sjoerd; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2283 bytes --]

On 2015-08-07 06:40, Mike Fleetwood wrote:
> On 7 August 2015 at 10:47, Sjoerd <sjoerd@sjomar.eu> wrote:
>> While we're at it: any idea why the default for SSD's is single for meta data
>> as described on the wiki?
>> (https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Filesystem_creation)
>>
>> I was looking for an answer why my SSD just had single metadata, while I
>> expected it to be DUP and stumbled on this wiki article. Can't find a reason
>> for why a SSD would be different?
>>
>> Cheers,
>> Sjoerd
>
> I would assume that it is because some SSD drives controllers
> deduplicate by default [1].  The developers probably think that when
> it comes to your data the truth, no mater how ugly, is preferable to a
> false sense of security.  (Btrfs thinking it has 2 copies of metadata
> when the SSD drive only actually has stored 1 copy).
>
> [1] How SSDs can hose your data
> http://www.zdnet.com/article/how-ssds-can-hose-your-data/
> "Researchers found that at least 1 Sandforce SSD controller - the
> SF1200 - does block-level deduplication by default. Which can be a
> problem.
>
> Many file systems - NTFS, most Unix/Linux FSs, ZFS are some - write
> critical metadata to multiple blocks in case one copy gets corrupted.
> But what if, unbeknownst to you, your SSD de-duplicates that block,
> leaving your file system with only 1 copy? "
>
And of course there's the counter-argument from the manufacturers that 
do this:
"But we use ECC, so only having one copy of the data is still safe!"
which is obviously something from their marketing department and not the 
people who actually understand how this works, most SSD's only do SECDED 
(Single error correction, double error detection) ECC, which is very 
much insufficient if it's MLC flash, as losing a single cell causes 
multiple bits to go bad.

The other reason, which applies to all SSD's, is that the oisk layout is 
very different from how things are actually stored on the flash chips 
themselves, and most firmware will group writes temporally, likely 
causing both copies of the metadata to be put on the same erase block, 
which in turn means that the duplication provides essentially zero 
protection beyond having a single copy.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs raid1 metadata, single data
  2015-08-07 10:40     ` Mike Fleetwood
  2015-08-07 11:38       ` Austin S Hemmelgarn
@ 2015-08-07 20:07       ` Sjoerd
  1 sibling, 0 replies; 8+ messages in thread
From: Sjoerd @ 2015-08-07 20:07 UTC (permalink / raw)
  To: linux-btrfs

On Friday 07 August 2015 11:40:24 Mike Fleetwood wrote:
> On 7 August 2015 at 10:47, Sjoerd <sjoerd@sjomar.eu> wrote:
> > While we're at it: any idea why the default for SSD's is single for meta
> > data as described on the wiki?
> > (https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
> > #Filesystem_creation)
> > 
> > I was looking for an answer why my SSD just had single metadata, while I
> > expected it to be DUP and stumbled on this wiki article. Can't find a
> > reason for why a SSD would be different?
> > 
> > Cheers,
> > Sjoerd
> 
> I would assume that it is because some SSD drives controllers
> deduplicate by default [1].  The developers probably think that when
> it comes to your data the truth, no mater how ugly, is preferable to a
> false sense of security.  (Btrfs thinking it has 2 copies of metadata
> when the SSD drive only actually has stored 1 copy).
> 
> [1] How SSDs can hose your data
> http://www.zdnet.com/article/how-ssds-can-hose-your-data/
> "Researchers found that at least 1 Sandforce SSD controller - the
> SF1200 - does block-level deduplication by default. Which can be a
> problem.
> 
> Many file systems - NTFS, most Unix/Linux FSs, ZFS are some - write
> critical metadata to multiple blocks in case one copy gets corrupted.
> But what if, unbeknownst to you, your SSD de-duplicates that block,
> leaving your file system with only 1 copy? "
> 
> Thanks,
> Mike

Thanks for the explanation Mike...sounds plausible..

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs raid1 metadata, single data
  2015-08-07  9:47   ` Sjoerd
  2015-08-07 10:40     ` Mike Fleetwood
@ 2015-08-07 11:13     ` Duncan
  1 sibling, 0 replies; 8+ messages in thread
From: Duncan @ 2015-08-07 11:13 UTC (permalink / raw)
  To: linux-btrfs

Sjoerd posted on Fri, 07 Aug 2015 11:47:57 +0200 as excerpted:

> While we're at it: any idea why the default for SSD's is single for meta
> data as described on the wiki?
> (https://btrfs.wiki.kernel.org/index.php/
Using_Btrfs_with_Multiple_Devices#Filesystem_creation)
> 
> I was looking for an answer why my SSD just had single metadata, while I
> expected it to be DUP and stumbled on this wiki article. Can't find a
> reason for why a SSD would be different?

I've seen two variant answers given, both having to do with the FTL in 
the SSDs, but both not particularly satisfying to me.

The first applies to only some SSDs, in particular, those with dedup 
features, like those with sandstorm (?? from memory, sand-something 
anyway) firmware.  These will detect the duplicated metadata and dedup 
it, so there will still be only one copy.  In this case, writing the 
disappearing dup is only wasting cpu and device cycles and bandwidth, so 
yes, for these devices single is appropriate, but since only a fraction 
of SSDs are affected, I personally still don't see that justifying 
changing the default for all SSDs.

The second variant suggests that since the two copies will be updated so 
close to each other in time, even non-deduping FTLs will very likely map 
both of them to the same erase block, moving them together for wear-
leveling, etc.  As such, they will very likely be located close to each 
other on the physical media, and chip damage or block wear-out that 
affects one will very likely affect the other as well, so even where it's 
not deduped into a single copy by #1, there's likely very little if any 
advantage in having that second copy on ssd, since both copies are likely 
going to be corrupted together in any case, and it's /both/ extra cpu and 
device cycles and bandwidth, /and/ extra write cycles on write-cycle-
limited media.

This variant seems to me to be more credible, but even then, I'm not sure 
it's justification for changing the default.  If a user wants to spare 
his SSD the trouble, it's easy enough to change the default.  Otherwise, 
it seems to me the otherwise dup default should remain as it is, both 
because in some cases it might still provide useful redundancy, and 
because having an exception like that is a pain when it comes to 
documentation and explanation.  And that pain has a cost as well...

But... most of my btrfs are dual-device raid1 both data/metadata anyway, 
with the exception of /boot, which I specifically setup dup, despite it 
being on ssd (with a backup /boot on the other device, complete with its 
own backup gpt-bios partition and grub setup as well, so each device can 
be selected in the BIOS and booted independently, regardless of whether 
the other device is there or not).  Being /boot it's small, so mixed-bg, 
meaning data/metadata are both dup, which halves my effective storage 
capacity, but I'd still rather dup than single.

And I'd really recommend dual device raid1 both data/metadata, for the 
reasons I explained in my earlier response to this thread -- in addition 
to the additional physical device redundancy of raid1, btrfs data 
integrity means raid1 gives you a second copy of everything, so if btrfs 
detects a bad copy, it actually has a second, hopefully good, copy to 
overwrite the bad copy with.  Without that second copy, it'll just error 
out when you try to access the blocks of the file that checksum-fail.  
With it, you'll get the good copy instead, and btrfs can overwrite the 
bad one too. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs raid1 metadata, single data
  2015-08-07  8:49 btrfs raid1 metadata, single data Robert Krig
  2015-08-07  9:18 ` Russell Coker
@ 2015-08-07 10:26 ` Duncan
  1 sibling, 0 replies; 8+ messages in thread
From: Duncan @ 2015-08-07 10:26 UTC (permalink / raw)
  To: linux-btrfs

Robert Krig posted on Fri, 07 Aug 2015 10:49:58 +0200 as excerpted:

> Hi, I was wondering.
> 
> What exactly is contained in btrfs metadata?
> 
> I've read about some users setting up their btrfs volumes as
> data=single, but metadata=raid1
> 
> Is there any actual benefit to that? I mean, if you keep your data as
> single, but have multiple copies of metadata, does that still allow you
> to recover from data corruption? Or is metadata redundancy a benefit to
> ensure that your btrfs volume remains mountable/readable?

The latter.

Metadata includes information /about/ the files, the names and file 
permissions, what extents the data is actually saved in on the device(s), 
etc.  In btrfs, really small files (under a few KiB, depending on the 
size of your metadata nodes, which are typically 16 KiB) are normally 
stored directly in metadata, instead of mapping to a data extent, as 
well.  Critically, the checksum data for btrfs' file integrity features 
are stored in metadata as well.

Because the metadata for a file is typically much smaller than a file, 
and also much smaller than the default 16 KiB metadata node size, the 
metadata for many files is stored in a single metadata node.  As such, 
damage to a metadata node risks making many files irretrievable.

Additionally, that extent mapping is critical.  If it points to the wrong 
place, a file may contain chunks of other files, perhaps not owned by the 
same user/group so potentially a security/privacy leak as well as file 
corruption.

Of course btrfs metadata is itself checksummed so damage to it should be 
detected, and when there's another copy of that metadata, btrfs will 
verify it and if it's good, use it instead.

And of course, on btrfs, various btrfs specific features such as 
snapshotting (which basically locks in place a reference to the current 
file extents, so any changes, written elsewhere due to COW, will change 
that version, but not the snapshotted version, so now there's two 
mappings of the file in the metadata, one for the snapshot, another for 
the changed, current version), etc, depend on good metadata as well.

All that is why metadata is so critical, so much so that on single-device 
btrfs, the default is dup for metadata, still two copies of it, instead 
of just one, the default for data.  If one metadata node copy is bad 
(fails checksum validation or can't be read at all), by default, there's 
a second copy to read from, even on a single-device filesystem.  (Do 
note, however, that this doesn't apply to ssd, where for various reasons 
having to do with how SSD FTLs work, the metadata default for single-
device is single, not dup.)

On a multi-device filesystem, the metadata default changes to raid1, 
ensuring that the two copies are kept on different devices.  (Note that 
currently, there's never more than two copies, no matter how many devices 
there are in the filesystem.)  That does help ensure metadata integrity 
even if a device is lost, which should indeed help recover any still 
single-mode files that weren't partly on that device, but that's not the 
real reason it's the metadata default, since for various reasons it's 
still reasonably likely part of any particular file will be on the failed 
device, if the data default single-mode is used.  The real reason it's 
the metadata default is the same reason the metadata default on a single-
device btrfs is dup, so there's always two copies.  And with at least two 
devices available, it simply makes sense to ensure the two copies are on 
different devices, even if the benefit is only incremental over allowing 
the two copies to be on whatever device, even the same one for both.

But if one can afford double the dataspace usage, raid1 (or raid10 for 4+ 
devices, raid5 and 6 are still immature) for both data and metadata is 
quite appealing on btrfs, particularly with the checksumming and data 
integrity features, since having both data and metadata raid1 means data 
or metadata, there's always a second copy to fall back on, should the one 
copy fail checksum verification.  That's actually what I use here, and 
it's one of the big reasons I'm using btrfs in the first place, since few 
other solutions provide that level of both redundancy and verified 
integrity.  mdraid1, for instance, allows multiple copies, but doesn't 
checksum or verify the validity.  mdraid5 and mdraid6 have parity 
checking and could in theory verify in normal operation as well as repair 
after replacement of a lost device, but they don't -- parity is only 
checked on rebuild after replacement of a lost device, not in normal 
operation.

FWIW what I'd LIKE to use in btrfs, but while it's on the roadmap and 
scheduled as the next raid feature for implementation after raid5/6, 
which is now complete but still immature, is N-way-mirroring, so I could 
have for instance three copies of the data and metadata, instead of just 
two.  Three copies is really my sweet spot, tho with N-way-mirroring four 
or more would be possible, as every time a btrfs scrub corrects a bad 
copy of something by overwriting it with the good copy, I worry about 
what would happen if that only remaining good copy had gone bad at the 
same time as well.  The chances of three bad copies at the same time vs 
two bad copies, is enough better that I find it worth it, while the 
incremental risk level improvement of adding a forth copy, vs the 
management time and cost of that forth copy, isn't worth it to me.  So 
I've been waiting for that n-way-mirroring implementation to get my three 
copies, tho even when it's implemented, it'll take awhile to stabilize 
(I've been recommending a year of stabilization for raid56, and that 
people continue with raid1 or raid10 in the mean time), so I have not 
only to wait for the feature to be introduced, but then I have to either 
wait even longer for stabilization, or expect to be the guinea pig 
finding and reporting bugs in the implementation the first few kernel 
cycles.

HTH explain things! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-08-07 20:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-07  8:49 btrfs raid1 metadata, single data Robert Krig
2015-08-07  9:18 ` Russell Coker
2015-08-07  9:47   ` Sjoerd
2015-08-07 10:40     ` Mike Fleetwood
2015-08-07 11:38       ` Austin S Hemmelgarn
2015-08-07 20:07       ` Sjoerd
2015-08-07 11:13     ` Duncan
2015-08-07 10:26 ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.