staggered stripes

All of lore.kernel.org
 help / color / mirror / Atom feed

* staggered stripes
@ 2014-05-15  9:00 Russell Coker
  2014-05-15  9:31 ` Duncan
  2014-05-15  9:34 ` Hugo Mills
  0 siblings, 2 replies; 7+ messages in thread
From: Russell Coker @ 2014-05-15  9:00 UTC (permalink / raw)
  To: linux-btrfs

http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.html

Page 13 of the above paper says:

# Figure 12 presents for each block number, the number of disk drives of disk
# model ‘E-1’ that developed a checksum mismatch at that block number. We see
# in the figure that many disks develop corruption for a specific set of block
# numbers. We also verified that (i) other disk models did not develop
# multiple check-sum mismatches for the same set of block numbers (ii) the
# disks that developed mismatches at the same block numbers belong to
# different storage systems, and (iii) our software stack has no specific data
# structure that is placed at the block numbers of interest.
#
# These observations indicate that hardware or firmware bugs that affect
# specific sets of block numbers might exist. Therefore, RAID system designers
# may be well-advised to use staggered stripes such that the blocks that form
# a stripe (providing the required redundancy) are placed at different block
# numbers on different disks.

Does the BTRFS RAID functionality do such staggered stripes?  If not could it 
be added?

I guess there's nothing stopping a sysadmin from allocating an unused 
partition at the start of each disk and use a different size for each disk.  
But I think it would be best to do this inside the filesystem.

Also this is another reason for having DUP+RAID-1.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: staggered stripes
  2014-05-15  9:00 staggered stripes Russell Coker
@ 2014-05-15  9:31 ` Duncan
  2014-05-15 14:38   ` Russell Coker
       [not found]   ` <2Ee51o00g0uXw0U01Ee7j5>
  2014-05-15  9:34 ` Hugo Mills
  1 sibling, 2 replies; 7+ messages in thread
From: Duncan @ 2014-05-15  9:31 UTC (permalink / raw)
  To: linux-btrfs

Russell Coker posted on Thu, 15 May 2014 19:00:10 +1000 as excerpted:

> http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.html
> 
> Page 13 of the above paper says:
> 
> # Figure 12 [...] We see in the figure that many disks develop
> # corruption for a specific set of block numbers.  [T]herefore,
> # RAID system designers may be well-advised to use staggered
> # stripes such that the blocks that form a stripe (providing
> # the required redundancy) are placed at different block numbers
> # on different disks.
> 
> Does the BTRFS RAID functionality do such staggered stripes?  If not
> could it be added?

AFAIK nothing like that yet, but it's reasonably likely to be implemented 
later.  N-way-mirroring is roadmapped for next up after raid56 
completion, however. 

You do mention the partition alternative, but not as I'd do it for such a 
case.  Instead of doing a different sized buffer partition (or using the 
mkfs.btrfs option to start at some offset into the device) on each 
device, I'd simply do multiple partitions and reorder them on each 
device.  Tho N-way-mirroring would sure help here too, since if a given 
area around the same address is assumed to be weak on each device, I'd 
sure like greater than the current 2-way-mirroring, even if if I had a 
different filesystem/partition at that spot on each one, since with only 
two-way-mirroring if one copy is assumed to be weak, guess what, you're 
down to only one reasonably reliable copy now, and that's not a good spot 
to be in if that one copy happens to be hit by a cosmic ray or otherwise 
fail checksum, without another reliable copy to fix it since that other 
copy is in the weak area already.

Another alternative would be using something like mdraid's raid10 "far" 
layout, with btrfs on top of that...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: staggered stripes
  2014-05-15  9:31 ` Duncan
@ 2014-05-15 14:38   ` Russell Coker
  2014-05-15 16:15     ` Brendan Hide
  2014-05-15 16:18     ` Hugo Mills
       [not found]   ` <2Ee51o00g0uXw0U01Ee7j5>
  1 sibling, 2 replies; 7+ messages in thread
From: Russell Coker @ 2014-05-15 14:38 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Thu, 15 May 2014 09:31:42 Duncan wrote:
> > Does the BTRFS RAID functionality do such staggered stripes?  If not
> > could it be added?
> 
> AFAIK nothing like that yet, but it's reasonably likely to be implemented
> later.  N-way-mirroring is roadmapped for next up after raid56
> completion, however.

It's RAID-5/6 when we really need such staggering.  It's a reasonably common 
configuration choice to use two different brands of disk for a RAID-1 array.  
As the correlation between parts of the disks with errors only applied to 
disks of the same make and model (and this is expected due to 
firmware/manufacturing issues) the people who care about such things on RAID-1 
have probably already dealt with the issue.

> You do mention the partition alternative, but not as I'd do it for such a
> case.  Instead of doing a different sized buffer partition (or using the
> mkfs.btrfs option to start at some offset into the device) on each
> device, I'd simply do multiple partitions and reorder them on each
> device.

If there are multiple partitions on a device then that will probably make 
performance suck.  Also does BTRFS even allow special treatment of them or 
will it put two copies from a RAID-10 on the same disk?

> Tho N-way-mirroring would sure help here too, since if a given
> area around the same address is assumed to be weak on each device, I'd
> sure like greater than the current 2-way-mirroring, even if if I had a
> different filesystem/partition at that spot on each one, since with only
> two-way-mirroring if one copy is assumed to be weak, guess what, you're
> down to only one reasonably reliable copy now, and that's not a good spot
> to be in if that one copy happens to be hit by a cosmic ray or otherwise
> fail checksum, without another reliable copy to fix it since that other
> copy is in the weak area already.
> 
> Another alternative would be using something like mdraid's raid10 "far"
> layout, with btrfs on top of that...

In the "copies= option" thread Brendan Hide stated that this sort of thing is 
planned.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: staggered stripes
  2014-05-15 14:38   ` Russell Coker
@ 2014-05-15 16:15     ` Brendan Hide
  2014-05-15 16:18     ` Hugo Mills
  1 sibling, 0 replies; 7+ messages in thread
From: Brendan Hide @ 2014-05-15 16:15 UTC (permalink / raw)
  To: russell, Duncan; +Cc: linux-btrfs

On 2014/05/15 04:38 PM, Russell Coker wrote:
> On Thu, 15 May 2014 09:31:42 Duncan wrote:
>>> Does the BTRFS RAID functionality do such staggered stripes?  If not
>>> could it be added?
>> AFAIK nothing like that yet, but it's reasonably likely to be implemented
>> later.  N-way-mirroring is roadmapped for next up after raid56
>> completion, however.
> It's RAID-5/6 when we really need such staggering.  It's a reasonably common
> configuration choice to use two different brands of disk for a RAID-1 array.
> As the correlation between parts of the disks with errors only applied to
> disks of the same make and model (and this is expected due to
> firmware/manufacturing issues) the people who care about such things on RAID-1
> have probably already dealt with the issue.
>
>> You do mention the partition alternative, but not as I'd do it for such a
>> case.  Instead of doing a different sized buffer partition (or using the
>> mkfs.btrfs option to start at some offset into the device) on each
>> device, I'd simply do multiple partitions and reorder them on each
>> device.
> If there are multiple partitions on a device then that will probably make
> performance suck.  Also does BTRFS even allow special treatment of them or
> will it put two copies from a RAID-10 on the same disk?

I suspect the approach is similar to the following:
sd[abcd][1234....] each configured as LVM PVs
sda[1234....] as an LVM VG
sdb[2345....] as an LVM VG
sdc[3456....] as an LVM VG
sdd[4567....] as an LVM VG
btrfs across all four VGs

^ Um - the above is ignoring "DOS"-style partition limitations
>> Tho N-way-mirroring would sure help here too, since if a given
>> area around the same address is assumed to be weak on each device, I'd
>> sure like greater than the current 2-way-mirroring, even if if I had a
>> different filesystem/partition at that spot on each one, since with only
>> two-way-mirroring if one copy is assumed to be weak, guess what, you're
>> down to only one reasonably reliable copy now, and that's not a good spot
>> to be in if that one copy happens to be hit by a cosmic ray or otherwise
>> fail checksum, without another reliable copy to fix it since that other
>> copy is in the weak area already.
>>
>> Another alternative would be using something like mdraid's raid10 "far"
>> layout, with btrfs on top of that...
> In the "copies= option" thread Brendan Hide stated that this sort of thing is
> planned.
>

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: staggered stripes
  2014-05-15 14:38   ` Russell Coker
  2014-05-15 16:15     ` Brendan Hide
@ 2014-05-15 16:18     ` Hugo Mills
  1 sibling, 0 replies; 7+ messages in thread
From: Hugo Mills @ 2014-05-15 16:18 UTC (permalink / raw)
  To: Russell Coker; +Cc: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2548 bytes --]

On Fri, May 16, 2014 at 12:38:04AM +1000, Russell Coker wrote:
> On Thu, 15 May 2014 09:31:42 Duncan wrote:
> > > Does the BTRFS RAID functionality do such staggered stripes?  If not
> > > could it be added?
> > 
> > AFAIK nothing like that yet, but it's reasonably likely to be implemented
> > later.  N-way-mirroring is roadmapped for next up after raid56
> > completion, however.
> 
> It's RAID-5/6 when we really need such staggering.  It's a reasonably common 
> configuration choice to use two different brands of disk for a RAID-1 array.  
> As the correlation between parts of the disks with errors only applied to 
> disks of the same make and model (and this is expected due to 
> firmware/manufacturing issues) the people who care about such things on RAID-1 
> have probably already dealt with the issue.
> 
> > You do mention the partition alternative, but not as I'd do it for such a
> > case.  Instead of doing a different sized buffer partition (or using the
> > mkfs.btrfs option to start at some offset into the device) on each
> > device, I'd simply do multiple partitions and reorder them on each
> > device.
> 
> If there are multiple partitions on a device then that will probably make 
> performance suck.  Also does BTRFS even allow special treatment of them or 
> will it put two copies from a RAID-10 on the same disk?

   It will do. However, we should be able to fix that with the new
allocator, if I ever get it finished...

   Hugo.

> > Tho N-way-mirroring would sure help here too, since if a given
> > area around the same address is assumed to be weak on each device, I'd
> > sure like greater than the current 2-way-mirroring, even if if I had a
> > different filesystem/partition at that spot on each one, since with only
> > two-way-mirroring if one copy is assumed to be weak, guess what, you're
> > down to only one reasonably reliable copy now, and that's not a good spot
> > to be in if that one copy happens to be hit by a cosmic ray or otherwise
> > fail checksum, without another reliable copy to fix it since that other
> > copy is in the weak area already.
> > 
> > Another alternative would be using something like mdraid's raid10 "far"
> > layout, with btrfs on top of that...
> 
> In the "copies= option" thread Brendan Hide stated that this sort of thing is 
> planned.
> 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
                 --- Stick them with the pointy end. ---                 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <2Ee51o00g0uXw0U01Ee7j5>]

* Re: staggered stripes
       [not found]   ` <2Ee51o00g0uXw0U01Ee7j5>
@ 2014-05-16  4:17     ` Duncan
  0 siblings, 0 replies; 7+ messages in thread
From: Duncan @ 2014-05-16  4:17 UTC (permalink / raw)
  To: russell; +Cc: linux-btrfs

On Fri, 16 May 2014 00:38:04 +1000
Russell Coker <russell@coker.com.au> wrote:

> > You do mention the partition alternative, but not as I'd do it for
> > such a case.  Instead of doing a different sized buffer partition
> > (or using the mkfs.btrfs option to start at some offset into the
> > device) on each device, I'd simply do multiple partitions and
> > reorder them on each device.  
> 
> If there are multiple partitions on a device then that will probably
> make performance suck.  Also does BTRFS even allow special treatment
> of them or will it put two copies from a RAID-10 on the same disk?

I try to be brief, omitting the "common sense" stuff as readable
between the lines, and people don't...

What I meant is a layout like the one I have now, only staggered
partitions.  Rather than describe the ideas, here's my actual sda
layout. sdb is identical, but would have the same partitions reordered
if setup as discussed here.  These are actually SSD so the firmware
will be scrambling and write-leveling the erase-blocks in any case,
but I've long used the same basic layout on spinning rust too, tweaking
it only a bit over several generations:

# gdisk -l /dev/sda

[...]
Found valid GPT with protective MBR; using GPT.
Disk /dev/sda: 500118192 sectors, 238.5 GiB
[...]
Partitions will be aligned on 2048-sector boundaries
Total free space is 246364781 sectors (117.5 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048            8191   3.0 MiB     EF02  bi0238gcn1+35l0
   2            8192          262143   124.0 MiB   EF00  ef0238gcn1+35l0
   3          262144          786431   256.0 MiB   8300  bt0238gcn1+35l0
   4          786432         2097151   640.0 MiB   8300  lg0238gcn1+35l0
   5         2097152        18874367   8.0 GiB     8300  rt0238gcn1+35l0
   6        18874368        60817407   20.0 GiB    8300  hm0238gcn1+35l0
   7        60817408       111149055   24.0 GiB    8300  pk0238gcn1+35l0
   8       111149056       127926271   8.0 GiB     8300  nr0238gcn1+35l0
   9       127926272       144703487   8.0 GiB     8300  rt0238gcn1+35l1
  10       144703488       186646527   20.0 GiB    8300  hm0238gcn1+35l1
  11       186646528       236978175   24.0 GiB    8300  pk0238gcn1+35l1
  12       236978176       253755391   8.0 GiB     8300  nr0238gcn1+35l1

You will note that partitioning is GPT for reliability and simplicity,
even tho my system's standard BIOS.  You'll also note I use GPT
partition naming to keep track of what's what, with the first two
characters denoting partition function (rt=root, hm=home, pk=package,
etc), and the last denoting working copy or backup N.[1]

Partition #1 is BIOS reserved -- that's where grub2 puts it's core.  It
starts at the 1 MiB boundary and is 3 MiB, so everything after it is on
a 4 MiB boundary minimum.

#2 is EFI reserved, so I don't have to repartition if I upgrade to
UEFI and want to try it.  It starts at 4 MiB and is 124 MiB size, so
ends at 128 MiB, and everything after it is at minimum 128 MiB
boundaries.

Thus the first 128 MiB is special-purpose reserved.  Below that,
starting with #3, are my normal partitions, all btrfs, raid1 both
data/metadata except for /boot.

#3 is /boot.  Starting at 128 MiB it is 256 MiB size so ends at 384 MiB.

Unlike my other btrfs, /boot is single-device dup-mode mixed-bg, with
its primary backup on the partner hardware device (sda3 and sdb3,
working /boot and and primary /boot backup).  This is because
it's FAR easier to simply point the grub on each device at its
own /boot partition, using the BIOS boot-device selector to
decide which one to boot, than it is to dynamically tell grub to
use a different /boot at boot-time (tho unlike with grub1, with grub2
it's actually possible due to grub rescue mode).

Btrfs dup-mode-mixed-bg effectively means I have only half capacity,
128 MiB, but that's enough for /boot.

#4 is /var/log.  Starting at 384 MiB it is 640 MiB in size (per device),
so it ends at the 1 GiB boundary and all partitions beyond it are whole
GiB sized so begin and end on whole GiB boundaries.  As it's under a
GiB per device it's btrfs mixed-bg mode, not separate data/metadata,
and btrfs raid1.

Unlike my other btrfs, log has no independent backup copy as I don't
find a backup of /var/log particularly useful.  But like the others
with the exception of /boot and its backup, it's btrfs raid1, so losing
a device doesn't mean losing the logs.

I'd probably leave the partitions thru #4 as-is, since they're sub-GiB
and end on a GiB boundary.  If /var/log happens to be on a weak part of
the device, oh, well, I'll take the loss, /boot is independent with the
backup written far less than the working copy anyway, so if that's a
weak spot, the working copy should go out first, with plenty of warning
before the 

The next 8 partitions are split into two sets of four.  All are btrfs
raid1 mode for both data and metadata.

#5 is root (/).  It's 8 GiB and contains very nearly
everything that the package manager installs including the package
database, with the exception of /var/log as mentioned above and
some /var/lib/ subdirs as discussed below.  I once had a tough disaster
recovery where I ended up restoring from root, /usr and /var from
backups done at three separate times, such that after the initial
recovery the installed package database on /var didn't match what was
actually on either the rootfs (including /etc) or /usr.  *NEVER*
*AGAIN*!!  It's (almost) all on the same partition and backup now, so
while I might end up restoring from an old backup, the package
installation database will always be in sync with what's actually
installed.  The "(almost)" is log and state and if they're out of sync
I can just blow them away and start over, but all documentation and
configuration files as well as the actual operational files for a
package will remain synced.

8 GiB is plenty for my installation.  Btrfs fi show says the devices
are only 4.53 GiB used.  Btrfs raid1 both data/metadata.

#6 is /home.  It's 20 GiB, which is enough, given I have a separate,
dedicated media partition (on spinning rust as access is reasonably
sequential and doesn't need the speed of ssd so I save on cost too,
and it's actually not btrfs). Btrfs raid1 both data/metadata.

#7 is distro package tree and cache.

I run gentoo so the distro package tree means build-scripts, and cached
sources.  I have the binpkg feature set, however, so I keep tarballed
binpkg backups of all packages needed for a complete reinstall, plus a
reasonable binpkg version history, in case I need to roll back.  In
addition to the build-scripts and source tarballs, the binpkgs are on
this filesystem too.  And I run ccache to speed up builds, with ccache
located on the packages filesystem too.

Additionally, I keep a second, independent set of binpkgs and ccache
for my 32-bit-only netbook, and that's on this partition too.  That's
why it's so big, 24 GiB, as it contains the distro tree and source
tarballs, plus both the binpkg tarballs and ccache for two independent
build sets.

Btrfs raid1 both data/metadata, of course.

#8 is the netbook's rootfs build image. Again, 8 GiB, just as is the
main rootfs.

That's the first set of four partitions, my working copy set, 60 GiB
total, beginning at 1 GiB so ending at 61 GiB.

The second set of four partitions mirrors the first set in size and
function, forming my first/primary backup, on the same pair of SSD
physical devices.  So it's 60 GiB total also, ending at 121 GiB.

The SSDs are 238.5 GiB (256 GB SI units) in size, so I've only actually
allocated just under 51% of the SSDs, plenty of overprovisioning to
allow the firmware lots and lots of room to do its wear-leveling.

Given that these ARE SSDs and the firmware DOES do wear-level
shuffling, I don't see the point in staggering the partitions here and
the layouts are identical on both, with btrfs using sda5/sdb5 as my
working root partition, for instance. However were I on spinning rust,
I'd likely setup the btrfs raid1s such that working root was sda5/sdb9
while backup root was sda9/sdb5, thus staggering the partitions on each
device, while each filesystem would still consist of only a single
partition on each device.

*THAT* is what I meant.

---
[1]  I have a standard scheme I use for both partition and filesystem
names/labels that allows me to uniquely identify devices partitions and
filesystems by function, size, brand, target machine,
intended-working-copy or backup number, etc.  Only the first two
characters, partition/filesystem function, and the last character,
working copy (0) or backup N, are of interest for this post, however.

-- 
Duncan - No HTML messages please, as they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: staggered stripes
  2014-05-15  9:00 staggered stripes Russell Coker
  2014-05-15  9:31 ` Duncan
@ 2014-05-15  9:34 ` Hugo Mills
  1 sibling, 0 replies; 7+ messages in thread
From: Hugo Mills @ 2014-05-15  9:34 UTC (permalink / raw)
  To: Russell Coker; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2097 bytes --]

On Thu, May 15, 2014 at 07:00:10PM +1000, Russell Coker wrote:
> http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.html
> 
> Page 13 of the above paper says:
> 
> # Figure 12 presents for each block number, the number of disk drives of disk
> # model ‘E-1’ that developed a checksum mismatch at that block number. We see
> # in the figure that many disks develop corruption for a specific set of block
> # numbers. We also verified that (i) other disk models did not develop
> # multiple check-sum mismatches for the same set of block numbers (ii) the
> # disks that developed mismatches at the same block numbers belong to
> # different storage systems, and (iii) our software stack has no specific data
> # structure that is placed at the block numbers of interest.
> #
> # These observations indicate that hardware or firmware bugs that affect
> # specific sets of block numbers might exist. Therefore, RAID system designers
> # may be well-advised to use staggered stripes such that the blocks that form
> # a stripe (providing the required redundancy) are placed at different block
> # numbers on different disks.
> 
> Does the BTRFS RAID functionality do such staggered stripes?  If not could it 
> be added?

   Yes, it could, by simply shifting around the chunk locations at
allocation time. I'm working in this area at the moment, and I think
it should be feasible within the scope of what I'm doing. I'll add it
to my list of things to look at.

   Hugo.

> I guess there's nothing stopping a sysadmin from allocating an unused 
> partition at the start of each disk and use a different size for each disk.  
> But I think it would be best to do this inside the filesystem.
> 
> Also this is another reason for having DUP+RAID-1.
> 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
         --- If you're not part of the solution, you're part ---         
                           of the precipiate.                            

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-05-16  4:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-15  9:00 staggered stripes Russell Coker
2014-05-15  9:31 ` Duncan
2014-05-15 14:38   ` Russell Coker
2014-05-15 16:15     ` Brendan Hide
2014-05-15 16:18     ` Hugo Mills
     [not found]   ` <2Ee51o00g0uXw0U01Ee7j5>
2014-05-16  4:17     ` Duncan
2014-05-15  9:34 ` Hugo Mills

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.