linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Update to Project_ideas wiki page
@ 2010-11-17  3:19 Chris Ball
  2010-11-17 14:31 ` Hugo Mills
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Ball @ 2010-11-17  3:19 UTC (permalink / raw)
  To: linux-btrfs

Hi,

Chris Mason has posted a bunch of interesting updates to the
Project_ideas wiki page.  If you're interested in working on any
of these, feel free to speak up and ask for more information if
you need it.  Here are the new sections, for the curious:

== Block group reclaim ==

The split between data and metadata block groups means that we
sometimes have mostly empty block groups dedicated to only data or
metadata.  As files are deleted, we should be able to reclaim these
and put the space back into the free space pool.

We also need rebalancing ioctls that focus only on specific raid
levels.

== RBtree lock contention ==

Btrfs uses a number of rbtrees to index in-memory data structures.
Some of these are dominated by reads, and the lock contention from
searching them is showing up in profiles.  We need to look into an RCU
and sequence counter combination to allow lockless reads.

== Forced readonly mounts on errors ==

The sources have a number of BUG() statements that could easily be
replaced with code to force the filesystem readonly.  This is the
first step in being more fault tolerant of disk corruptions.  The
first step is to add a framework for generating errors that should
result in filesystems going readonly, and the conversion from BUG()
to that framework can happen incrementally.

== Dedicated metadata drives ==

We're able to split data and metadata IO very easily.  Metadata tends
to be dominated by seeks and for many applications it makes sense to
put the metadata onto faster SSDs.

== Readonly snapshots ==

The Btrfs snapshots are read/write by default.  A small number of
checks would allow us to make readonly snapshots instead.

== Per file / directory controls for COW and compression ==

Data compression and data cow are controlled across the entire FS by
mount options right now.  ioctls are needed to set this on a per file
or per directory basis.  This has been proposed previously, but VFS
developers wanted us to use generic ioctls rather than btrfs-specific
ones.  Can we use some of the same ioctls that ext4 uses?  This task
is mostly organizational rather than technical.

== Chunk tree backups ==

The chunk tree is critical to mapping logical block numbers to
physical locations on the drive.  We need to make the mappings
discoverable via a block device scan so that we can recover from
corrupted chunk trees.

== Rsync integration ==

Now that we have code to efficiently find newly updated files, we need
to tie it into tools such as rsync and dirvish.  (For bonus points, we
can even tell rsync _which blocks_ inside a file have changed.  Would
need to work with the rsync developers on that one.)

== Atomic write API ==

The Btrfs implementation of data=ordered only updates metadata to
point to new data blocks when the data IO is finished.  This makes it
easy for us to implement atomic writes of an arbitrary size.  Some
hardware is coming out that can support this down in the block layer
as well.

== Backref walking utilities ==

Given a block number on a disk, the Btrfs metadata can find all the
files and directories that use or care about that block.  Some
utilities to walk these back refs and print the results would help
debug corruptions.

Given an inode, the Btrfs metadata can find all the directories that
point to the inode.  We should have utils to walk these back refs as
well.

== Scrubbing ==

We need a periodic daemon that can walk the filesystem and verify
the contents of all copies of all allocated blocks are correct.
This is mostly equivalent to "find | xargs cat >/dev/null", but
with the constraint that we don't want to thrash the page cache,
so direct I/O should be used instead.

If we find a bad copy during this process, and we're using RAID,
we should queue up an overwrite of the bad copy with a good one.
The overwrite can happen in-place.

== Drive swapping ==

Right now when we replace a drive, we do so with a full FS balance.
If we are inserting a new drive to remove an old one, we can do a
much less expensive operation where we just put valid copies of all
the blocks onto the new drive.

== IO error tracking ==

As we get bad csums or IO errors from drives, we should track the
failures and kick out the drive if it is clearly going bad.

== Random write performance ==

Random writes introduce small extents and fragmentation.  We need new
file layout code to improve this and defrag the files as they are
being changed.

== Free inode number cache ==

As the filesystem fills up, finding a free inode number will become
expensive.  This should be cached the same way we do free blocks.

== Snapshot aware defrag ==

As we defragment files, we break any sharing from other snapshots.
The balancing code will preserve the sharing, and defrag needs to grow
this as well.

== Btree lock contention ==

The btree locks, especially on the root block can be very hot.
We need to improve this, especially in read mostly workloads.

== Changing RAID levels ==

We need ioctls to change between different raid levels.  Some of these
are quite easy -- e.g. for RAID0 to RAID1, we just halve the available
bytes on the fs, then queue a rebalance.

== DISCARD utilities ==

For SSDs with discard support, we could use a scrubber that goes
through the fs and performs discard on anything that is unused.  You
could first use the balance operation to compact data to the front of
the drive, then discard the rest.

-- 
Chris Ball   <cjb@laptop.org>
One Laptop Per Child

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17  3:19 Update to Project_ideas wiki page Chris Ball
@ 2010-11-17 14:31 ` Hugo Mills
  2010-11-17 15:12   ` Bart Noordervliet
  2010-11-26 14:57   ` Paul Komkoff
  0 siblings, 2 replies; 15+ messages in thread
From: Hugo Mills @ 2010-11-17 14:31 UTC (permalink / raw)
  To: Chris Ball; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1326 bytes --]

On Tue, Nov 16, 2010 at 10:19:45PM -0500, Chris Ball wrote:
> Hi,
> 
> Chris Mason has posted a bunch of interesting updates to the
> Project_ideas wiki page.  If you're interested in working on any
> of these, feel free to speak up and ask for more information if
> you need it.  Here are the new sections, for the curious:
> 
> == Block group reclaim ==
> 
> The split between data and metadata block groups means that we
> sometimes have mostly empty block groups dedicated to only data or
> metadata.  As files are deleted, we should be able to reclaim these
> and put the space back into the free space pool.
> 
> We also need rebalancing ioctls that focus only on specific raid
> levels.

> == Changing RAID levels ==
> 
> We need ioctls to change between different raid levels.  Some of these
> are quite easy -- e.g. for RAID0 to RAID1, we just halve the available
> bytes on the fs, then queue a rebalance.

   I would be interested in the rebalancing ioctls, and in RAID level
management. I'm still very much trying to learn the basics, though, so
I may go very slowly at first...

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- We demand rigidly defined areas of doubt and uncertainty! ---    

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17 14:31 ` Hugo Mills
@ 2010-11-17 15:12   ` Bart Noordervliet
  2010-11-17 17:19     ` Xavier Nicollet
                       ` (2 more replies)
  2010-11-26 14:57   ` Paul Komkoff
  1 sibling, 3 replies; 15+ messages in thread
From: Bart Noordervliet @ 2010-11-17 15:12 UTC (permalink / raw)
  To: Hugo Mills, Chris Ball, linux-btrfs

On Wed, Nov 17, 2010 at 15:31, Hugo Mills <hugo-lkml@carfax.org.uk> wro=
te:
> On Tue, Nov 16, 2010 at 10:19:45PM -0500, Chris Ball wrote:
>> =3D=3D Changing RAID levels =3D=3D
>>
>> We need ioctls to change between different raid levels. =C2=A0Some o=
f these
>> are quite easy -- e.g. for RAID0 to RAID1, we just halve the availab=
le
>> bytes on the fs, then queue a rebalance.
>
> =C2=A0 I would be interested in the rebalancing ioctls, and in RAID l=
evel
> management. I'm still very much trying to learn the basics, though, s=
o
> I may go very slowly at first...
>
> =C2=A0 Hugo.

Can I suggest we combine this new RAID level management with a
modernisation of the terminology for storage redundancy, as has been
discussed previously in the "Raid1 with 3 drives" thread of March this
year? I.e. abandon the burdened raid* terminology in favour of
something that makes more sense for a filesystem.

Mostly this would involve a discussion about what terms would make
most sense, though some changes in the behaviour of btrfs redundancy
modes may be warranted if they make things more intuitive.

I could help you make these changes in your patches, or write my own
patches against yours, though I'm also completely new to kernel
development.

Best regards,

Bart
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17 15:12   ` Bart Noordervliet
@ 2010-11-17 17:19     ` Xavier Nicollet
  2010-11-17 17:52     ` Mike Fedyk
  2010-11-17 17:56     ` Hugo Mills
  2 siblings, 0 replies; 15+ messages in thread
From: Xavier Nicollet @ 2010-11-17 17:19 UTC (permalink / raw)
  To: Bart Noordervliet; +Cc: Hugo Mills, Chris Ball, linux-btrfs

Le 17 novembre 2010 =E0 16:12, Bart Noordervliet a =E9crit:
> Can I suggest we combine this new RAID level management with a
> modernisation of the terminology for storage redundancy, as has been
> discussed previously in the "Raid1 with 3 drives" thread of March thi=
s
> year? I.e. abandon the burdened raid* terminology in favour of
> something that makes more sense for a filesystem.

I would agree with that.

--=20
Xavier Nicollet
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17 15:12   ` Bart Noordervliet
  2010-11-17 17:19     ` Xavier Nicollet
@ 2010-11-17 17:52     ` Mike Fedyk
  2010-11-17 17:56     ` Hugo Mills
  2 siblings, 0 replies; 15+ messages in thread
From: Mike Fedyk @ 2010-11-17 17:52 UTC (permalink / raw)
  To: Bart Noordervliet; +Cc: Hugo Mills, Chris Ball, linux-btrfs

On Wed, Nov 17, 2010 at 7:12 AM, Bart Noordervliet
<bart@noordervliet.net> wrote:
> On Wed, Nov 17, 2010 at 15:31, Hugo Mills <hugo-lkml@carfax.org.uk> w=
rote:
>> On Tue, Nov 16, 2010 at 10:19:45PM -0500, Chris Ball wrote:
>>> =3D=3D Changing RAID levels =3D=3D
>>>
>>> We need ioctls to change between different raid levels. =C2=A0Some =
of these
>>> are quite easy -- e.g. for RAID0 to RAID1, we just halve the availa=
ble
>>> bytes on the fs, then queue a rebalance.
>>
>> =C2=A0 I would be interested in the rebalancing ioctls, and in RAID =
level
>> management. I'm still very much trying to learn the basics, though, =
so
>> I may go very slowly at first...
>>
>> =C2=A0 Hugo.
>
> Can I suggest we combine this new RAID level management with a
> modernisation of the terminology for storage redundancy, as has been
> discussed previously in the "Raid1 with 3 drives" thread of March thi=
s
> year? I.e. abandon the burdened raid* terminology in favour of
> something that makes more sense for a filesystem.
>
> Mostly this would involve a discussion about what terms would make
> most sense, though some changes in the behaviour of btrfs redundancy
> modes may be warranted if they make things more intuitive.
>
> I could help you make these changes in your patches, or write my own
> patches against yours, though I'm also completely new to kernel
> development.
>

That would inherently solve the need to convert between dup and raid1
as well.  Why those are separate and why dup does not become raid1
when there are N > 1 drives is beyond me.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17 15:12   ` Bart Noordervliet
  2010-11-17 17:19     ` Xavier Nicollet
  2010-11-17 17:52     ` Mike Fedyk
@ 2010-11-17 17:56     ` Hugo Mills
  2010-11-17 18:07       ` Gordan Bobic
  2010-11-17 18:14       ` Andreas Philipp
  2 siblings, 2 replies; 15+ messages in thread
From: Hugo Mills @ 2010-11-17 17:56 UTC (permalink / raw)
  To: Bart Noordervliet; +Cc: Hugo Mills, Chris Ball, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2336 bytes --]

On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote:
> Can I suggest we combine this new RAID level management with a
> modernisation of the terminology for storage redundancy, as has been
> discussed previously in the "Raid1 with 3 drives" thread of March this
> year? I.e. abandon the burdened raid* terminology in favour of
> something that makes more sense for a filesystem.

   Well, our current RAID modes are:

 * 1 Copy ("SINGLE")
 * 2 Copies ("DUP")
 * 2 Copies, different spindles ("RAID1")
 * 1 Copy, 2 Stripes ("RAID0")
 * 2 Copies, 2 Stripes [each] ("RAID10")

   The forthcoming RAID5/6 code will expand on that, with

 * 1 Copy, n Stripes + 1 Parity ("RAID5")
 * 1 Copy, n Stripes + 2 Parity ("RAID6")

   (I'm not certain how "n" will be selected -- it could be a config
option, or simply selected on the basis of the number of
spindles/devices currently in the FS).

   We could further postulate a RAID50/RAID60 mode, which would be

 * 2 Copies, n Stripes + 1 Parity
 * 2 Copies, n Stripes + 2 Parity

   For brevity, we could collapse these names down to: 1C, 2C, 2CR,
1C2S, 2C2S, 1CnS1P, 1CnS2P, 2CnS1P, 2CnS2P. However, that's probably a
bit too condensed for useful readability. I'd support some set of
terms based on this taxonomy, though, as it's fairly extensible, and
tells you the details of the duplication strategy in question.

> Mostly this would involve a discussion about what terms would make
> most sense, though some changes in the behaviour of btrfs redundancy
> modes may be warranted if they make things more intuitive.

   Consider the above a first suggestion. :)

> I could help you make these changes in your patches, or write my own
> patches against yours, though I'm also completely new to kernel
> development.

   Probably best to keep the kernel internals unchanged for this
particular issue, as they don't make much difference to the naming,
but patches to the userspace side of things (mkfs.btrfs and btrfs fi
df specifically) should be fairly straightforward.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- <gdb> The enemy have elected for Death by Powerpoint.  That's ---  
                          what they shall get.                           

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17 17:56     ` Hugo Mills
@ 2010-11-17 18:07       ` Gordan Bobic
  2010-11-17 18:41         ` Bart Kus
  2010-11-18 14:31         ` Bart Noordervliet
  2010-11-17 18:14       ` Andreas Philipp
  1 sibling, 2 replies; 15+ messages in thread
From: Gordan Bobic @ 2010-11-17 18:07 UTC (permalink / raw)
  To: linux-btrfs

On 11/17/2010 05:56 PM, Hugo Mills wrote:
> On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote:
>> Can I suggest we combine this new RAID level management with a
>> modernisation of the terminology for storage redundancy, as has been
>> discussed previously in the "Raid1 with 3 drives" thread of March this
>> year? I.e. abandon the burdened raid* terminology in favour of
>> something that makes more sense for a filesystem.
>
>     Well, our current RAID modes are:
>
>   * 1 Copy ("SINGLE")
>   * 2 Copies ("DUP")
>   * 2 Copies, different spindles ("RAID1")
>   * 1 Copy, 2 Stripes ("RAID0")
>   * 2 Copies, 2 Stripes [each] ("RAID10")
>
>     The forthcoming RAID5/6 code will expand on that, with
>
>   * 1 Copy, n Stripes + 1 Parity ("RAID5")
>   * 1 Copy, n Stripes + 2 Parity ("RAID6")
>
>     (I'm not certain how "n" will be selected -- it could be a config
> option, or simply selected on the basis of the number of
> spindles/devices currently in the FS).
>
>     We could further postulate a RAID50/RAID60 mode, which would be
>
>   * 2 Copies, n Stripes + 1 Parity
>   * 2 Copies, n Stripes + 2 Parity

Since BTRFS is already doing some relatively radical things, I would 
like to suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn't 
safely usable for arrays bigger than about 5TB with disks that have a 
specified error rate of 10^-14. RAID6 pushes that problem a little 
further away, but in the longer term, I would argue that RAID (n+m) 
would work best. We specify that of (n+m) disks in the array, we want n 
data disks and m redundancy disks. If this is implemented in a generic 
way, then there won't be a need to implement additional RAID modes later.

Gordan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17 17:56     ` Hugo Mills
  2010-11-17 18:07       ` Gordan Bobic
@ 2010-11-17 18:14       ` Andreas Philipp
  2010-11-17 18:34         ` Hugo Mills
  1 sibling, 1 reply; 15+ messages in thread
From: Andreas Philipp @ 2010-11-17 18:14 UTC (permalink / raw)
  To: Hugo Mills, Bart Noordervliet, Chris Ball, linux-btrfs


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
 
On 17.11.2010 18:56, Hugo Mills wrote:
> On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote:
>> Can I suggest we combine this new RAID level management with a
>> modernisation of the terminology for storage redundancy, as has been
>> discussed previously in the "Raid1 with 3 drives" thread of March this
>> year? I.e. abandon the burdened raid* terminology in favour of
>> something that makes more sense for a filesystem.
>
> Well, our current RAID modes are:
>
> * 1 Copy ("SINGLE")
> * 2 Copies ("DUP")
> * 2 Copies, different spindles ("RAID1")
> * 1 Copy, 2 Stripes ("RAID0")
> * 2 Copies, 2 Stripes [each] ("RAID10")
>
> The forthcoming RAID5/6 code will expand on that, with
>
> * 1 Copy, n Stripes + 1 Parity ("RAID5")
> * 1 Copy, n Stripes + 2 Parity ("RAID6")
>
> (I'm not certain how "n" will be selected -- it could be a config
> option, or simply selected on the basis of the number of
> spindles/devices currently in the FS).
Just one question on "small n": If one has N = 3*k >= 6 spindles, then
RAID5 with n = N/2-1 results in something like RAID50? So having an
option for "small n" might realize RAID50 given the right choice for n.
>
> We could further postulate a RAID50/RAID60 mode, which would be
>
> * 2 Copies, n Stripes + 1 Parity
> * 2 Copies, n Stripes + 2 Parity
Isn't this RAID51/RAID61 (or 15/16 unsure on how to put) and would
RAID50/RAID60 correspond to

* 2 Stripes, n Stripes + 1 Parity
* 2 Stripes, n Stripes + 2 Parity
>
> For brevity, we could collapse these names down to: 1C, 2C, 2CR,
> 1C2S, 2C2S, 1CnS1P, 1CnS2P, 2CnS1P, 2CnS2P. However, that's probably a
> bit too condensed for useful readability. I'd support some set of
> terms based on this taxonomy, though, as it's fairly extensible, and
> tells you the details of the duplication strategy in question.
>
>> Mostly this would involve a discussion about what terms would make
>> most sense, though some changes in the behaviour of btrfs redundancy
>> modes may be warranted if they make things more intuitive.
>
> Consider the above a first suggestion. :)
>
>> I could help you make these changes in your patches, or write my own
>> patches against yours, though I'm also completely new to kernel
>> development.
>
> Probably best to keep the kernel internals unchanged for this
> particular issue, as they don't make much difference to the naming,
> but patches to the userspace side of things (mkfs.btrfs and btrfs fi
> df specifically) should be fairly straightforward.
>
> Hugo.
Andreas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
 
iQIcBAEBAgAGBQJM5BuWAAoJEJIcBJ3+XkgiJVkP/i1YrexX3lxH6A02IWHfRIP/
+8qIDLfw5l6wuX0UV3B/ROngsI2HHvWmybFR5+rrcAkntG/EJn0amhYMrKZMDh7n
WrpWuBWjBiLI6tAkiE/ur9D3AGQhBW69okeGq2MCGYSIZNVUjVfWtEmF/OKj8soz
1Wk6Ymk0aNYBme7FMZwM/fRQnoRoV3qk5IrztzaZTClmcpM6ut+puPO42r5IEqmV
d441f7ta6vXfmXCaBA5otAdMsa3yQedkUd+wAS4xPZgN+CopeuSUUFeD4FH3b1wX
pyA9WtS8bb10cdnf0YOkbUVTgWhmsPzABqhZlmcA/8xMCMCx7Fg6eKAjGaBTcnP0
+rxWRmoyLRRS015IDat4bb31yeA8GQxteOOhpF9gLd9I0bF8QYTBSGOG9dVadEJU
PJ1aCA+5rePwadOh4wp6V0vH6BRCs7JcDc12K/gfCCFoHTyfk23j73+jJ2/dzH/E
aTDrprQIWdJE5Fk33XA1oLIcT2xNj/6PKv8DKTTj8n4SxfhViQkrNu1RrzVd5Ln1
BYbVnUbcDuAUR7AAFqRaMBVMIULJgvkaqeaQohFfONei4CgTm+A14ONcN/c0I3fV
ef9hBG2YV9X82yozLvZ0888q+eCb86AnOaVxWtnHNgvPKxnOu8Iu7NhSO53yYaro
i8aAoui00NJueGGKXzLt
=Wj4a
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17 18:14       ` Andreas Philipp
@ 2010-11-17 18:34         ` Hugo Mills
  0 siblings, 0 replies; 15+ messages in thread
From: Hugo Mills @ 2010-11-17 18:34 UTC (permalink / raw)
  To: Andreas Philipp; +Cc: Hugo Mills, Bart Noordervliet, Chris Ball, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3469 bytes --]

On Wed, Nov 17, 2010 at 07:14:47PM +0100, Andreas Philipp wrote:
> On 17.11.2010 18:56, Hugo Mills wrote:
> > On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote:
> >> Can I suggest we combine this new RAID level management with a
> >> modernisation of the terminology for storage redundancy, as has been
> >> discussed previously in the "Raid1 with 3 drives" thread of March this
> >> year? I.e. abandon the burdened raid* terminology in favour of
> >> something that makes more sense for a filesystem.
> >
> > Well, our current RAID modes are:
> >
> > * 1 Copy ("SINGLE")
> > * 2 Copies ("DUP")
> > * 2 Copies, different spindles ("RAID1")
> > * 1 Copy, 2 Stripes ("RAID0")
> > * 2 Copies, 2 Stripes [each] ("RAID10")
> >
> > The forthcoming RAID5/6 code will expand on that, with
> >
> > * 1 Copy, n Stripes + 1 Parity ("RAID5")
> > * 1 Copy, n Stripes + 2 Parity ("RAID6")
> >
> > (I'm not certain how "n" will be selected -- it could be a config
> > option, or simply selected on the basis of the number of
> > spindles/devices currently in the FS).
> Just one question on "small n": If one has N = 3*k >= 6 spindles, then
> RAID5 with n = N/2-1 results in something like RAID50? So having an
> option for "small n" might realize RAID50 given the right choice for n.

   I see what you're getting at, but actually, that would just be
RAID-5 with small n. It merely happens to spread chunks out over more
spindles than the minimum n+1 required to give you what you asked for.
(See the explanation below for why).

> > We could further postulate a RAID50/RAID60 mode, which would be
> >
> > * 2 Copies, n Stripes + 1 Parity
> > * 2 Copies, n Stripes + 2 Parity
> Isn't this RAID51/RAID61 (or 15/16 unsure on how to put) and would
> RAID50/RAID60 correspond to

   Errr... yes, you're right. My mistake. Although... again, see the
conclusion below. :)

> * 2 Stripes, n Stripes + 1 Parity
> * 2 Stripes, n Stripes + 2 Parity

   I'm not sure talking about RAID50-like things (as you state above)
makes much sense, given the internal data structures that btrfs uses:

   As far as I know(*), data is firstly allocated in chunks of about
1GiB per device. Chunks are grouped together to give you replication.
So, for a RAID-0 or RAID-1 arrangement, chunks are allocated in pairs,
picked from different devices. For RAID-10, they're allocated in
quartets, again on different devices. For RAID-5, they'd be allocated
in groups of n+1. For RAID-61, we'd use 2n+4 chunks in an allocation.

   For replication strategies where it matters (anything other than
DUP, SINGLE, RAID-1 so far), the chunks are then subdivided into
stripes of a fixed width. Data written to the disk is spread across
the stripes in an appropriate manner.

   From this point of view, RAID50 and RAID51 look much the same,
unless the stripe size for the "5" is different to the stripe size for
the "0" or "1". I'm not sure that's the case. If the stripe sizes are
the same, you'll basically get the same layout of data across the 2n+2
chunks -- it's just that (possibly) the internal labels of the chunks
which indicate which bit of data they're holding in the pattern will
be different.

   Hugo.

(*) I could be wrong, hopefully someone will correct me if so.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
       --- A cross? Oy vey, have you picked the wrong vampire! ---       

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17 18:07       ` Gordan Bobic
@ 2010-11-17 18:41         ` Bart Kus
  2010-11-18  8:36           ` Gordan Bobic
  2010-11-18 14:31         ` Bart Noordervliet
  1 sibling, 1 reply; 15+ messages in thread
From: Bart Kus @ 2010-11-17 18:41 UTC (permalink / raw)
  To: Gordan Bobic; +Cc: linux-btrfs

On 11/17/2010 10:07 AM, Gordan Bobic wrote:
> On 11/17/2010 05:56 PM, Hugo Mills wrote:
>> On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote:
>>> Can I suggest we combine this new RAID level management with a
>>> modernisation of the terminology for storage redundancy, as has been
>>> discussed previously in the "Raid1 with 3 drives" thread of March this
>>> year? I.e. abandon the burdened raid* terminology in favour of
>>> something that makes more sense for a filesystem.
>>
>>     Well, our current RAID modes are:
>>
>>   * 1 Copy ("SINGLE")
>>   * 2 Copies ("DUP")
>>   * 2 Copies, different spindles ("RAID1")
>>   * 1 Copy, 2 Stripes ("RAID0")
>>   * 2 Copies, 2 Stripes [each] ("RAID10")
>>
>>     The forthcoming RAID5/6 code will expand on that, with
>>
>>   * 1 Copy, n Stripes + 1 Parity ("RAID5")
>>   * 1 Copy, n Stripes + 2 Parity ("RAID6")
>>
>>     (I'm not certain how "n" will be selected -- it could be a config
>> option, or simply selected on the basis of the number of
>> spindles/devices currently in the FS).
>>
>>     We could further postulate a RAID50/RAID60 mode, which would be
>>
>>   * 2 Copies, n Stripes + 1 Parity
>>   * 2 Copies, n Stripes + 2 Parity
>
> Since BTRFS is already doing some relatively radical things, I would 
> like to suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn't 
> safely usable for arrays bigger than about 5TB with disks that have a 
> specified error rate of 10^-14. RAID6 pushes that problem a little 
> further away, but in the longer term, I would argue that RAID (n+m) 
> would work best. We specify that of (n+m) disks in the array, we want 
> n data disks and m redundancy disks. If this is implemented in a 
> generic way, then there won't be a need to implement additional RAID 
> modes later.

Not to throw a wrench in the works, but has anyone given any thought as 
to how to best deal with SSD-based RAIDs?  Normal RAID algorithms will 
maximize synchronized failures of those devices.  Perhaps there's a 
chance here to fix that issue?

I like the RAID n+m mode of thinking though.  It'd also be nice to have 
spares which are spun-down until needed.

Lastly, perhaps there's also a chance here to employ SSD-based caching 
when doing RAID, as is done in the most recent RAID controllers?  
Exposure to media failures in the SSD does make me nervous about that 
though.  Does anyone know if those controllers write some sort of extra 
data to the SSD for redundancy/error recovery purposes?

--Bart


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17 18:41         ` Bart Kus
@ 2010-11-18  8:36           ` Gordan Bobic
  0 siblings, 0 replies; 15+ messages in thread
From: Gordan Bobic @ 2010-11-18  8:36 UTC (permalink / raw)
  To: linux-btrfs

Bart Kus wrote:
> On 11/17/2010 10:07 AM, Gordan Bobic wrote:
>> On 11/17/2010 05:56 PM, Hugo Mills wrote:
>>> On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote:
>>>> Can I suggest we combine this new RAID level management with a
>>>> modernisation of the terminology for storage redundancy, as has been
>>>> discussed previously in the "Raid1 with 3 drives" thread of March this
>>>> year? I.e. abandon the burdened raid* terminology in favour of
>>>> something that makes more sense for a filesystem.
>>>
>>>     Well, our current RAID modes are:
>>>
>>>   * 1 Copy ("SINGLE")
>>>   * 2 Copies ("DUP")
>>>   * 2 Copies, different spindles ("RAID1")
>>>   * 1 Copy, 2 Stripes ("RAID0")
>>>   * 2 Copies, 2 Stripes [each] ("RAID10")
>>>
>>>     The forthcoming RAID5/6 code will expand on that, with
>>>
>>>   * 1 Copy, n Stripes + 1 Parity ("RAID5")
>>>   * 1 Copy, n Stripes + 2 Parity ("RAID6")
>>>
>>>     (I'm not certain how "n" will be selected -- it could be a config
>>> option, or simply selected on the basis of the number of
>>> spindles/devices currently in the FS).
>>>
>>>     We could further postulate a RAID50/RAID60 mode, which would be
>>>
>>>   * 2 Copies, n Stripes + 1 Parity
>>>   * 2 Copies, n Stripes + 2 Parity
>>
>> Since BTRFS is already doing some relatively radical things, I would 
>> like to suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn't 
>> safely usable for arrays bigger than about 5TB with disks that have a 
>> specified error rate of 10^-14. RAID6 pushes that problem a little 
>> further away, but in the longer term, I would argue that RAID (n+m) 
>> would work best. We specify that of (n+m) disks in the array, we want 
>> n data disks and m redundancy disks. If this is implemented in a 
>> generic way, then there won't be a need to implement additional RAID 
>> modes later.
> 
> Not to throw a wrench in the works, but has anyone given any thought as 
> to how to best deal with SSD-based RAIDs?  Normal RAID algorithms will 
> maximize synchronized failures of those devices.  Perhaps there's a 
> chance here to fix that issue?

The wear-out failure of SSDs (the exact failure you are talking bout) is 
very predictable. Current generation of SSDs provide a reading via SMART 
of how much life (in %) there is left in the SSD. When this gets down to 
single figures, the disks should be replaced. Provided that the disks 
are correctly monitored, it shouldn't be an issue.

On a related issue, I am not convinced that wear-out based SSD failure 
is an issue provided that:

1) there is at least a rudimentary amount of wear leveling done in the 
firmware. This is the case even for cheap CF/SD card media, and is not 
hard to implement. And considering I recently got a number of cheap-ish 
32GB CF cards with lifetime warranty, it's safe to assume they will have 
wear leveling built in, or Kingston will rue the day they sold them with 
lifetime warranty. ;)

2) Reasonable effort is made to not put write-heavy things onto SSDs 
(think /tmp, /var/tmp, /var/lock, /var/run, swap, etc.). These can 
safely be put on tmpfs instead, and for swap you can use ramzswap 
(compcache). You'll get both better performance and prolong the life  of 
the SSD significantly. Switching off atime on the FS helps a lot, too. 
And switching off journaling can make a difference of over 50% on 
metadata-heavy operations.

And assuming that you write 40GB of data per day to your 40GB SSD 
(unlikely for most applications), you'll still get a 10,000 day life 
expectancy on that disk. That's 30 years. Does anyone still use any 
disks from 30 years ago? What about 20 years ago? 10? The rate of growth 
of RAM and storage in computers has increased by about 10x in the last 
10 years. It seems unlikely that even if our current generation of SSDs 
will be useful in 10 years time, let alone 30.

> I like the RAID n+m mode of thinking though.  It'd also be nice to have 
> spares which are spun-down until needed.
 >
> Lastly, perhaps there's also a chance here to employ SSD-based caching 
> when doing RAID, as is done in the most recent RAID controllers?  

Tiered storage capability would be nice. What would it take to keep 
statistics on how frequently various file blocks are accessed, and put 
the most frequently accessed file blocks on SSD? It would be nice if 
this could be done by the accesses/day with some reasonable limit on the 
number of days over which accesses are considered.

> Exposure to media failures in the SSD does make me nervous about that 
> though.

You'd need a pretty substantial churn rate for that to happen quickly. 
With the caching strategy I described above, churn should be much lower 
than the naive LRU while providing a much better overall hit rate.

> Does anyone know if those controllers write some sort of extra 
> data to the SSD for redundancy/error recovery purposes?

SSDs handle that internally. The predictability of failures due to 
wear-out on SSDs makes this relatively easy to handle.

Another thing that would be nice to have - defrag with ability to 
specify where particular files should be kept. One thing I've been 
pondering writing for ext2 when I have a month of spare time is a defrag 
utility that can be passed an ordered list of files to put at the very 
front of the disk.

Such a list could easily be generated using inotify. This would log all 
file accesses during the boot/login process. Defragging the disk in such 
a way that all files read-accessed from the disk are laid out 
sequentially with no gaps at the front of the disk would ensure that 
boot times are actually faster than on an SSD*.

*Access time on an decent SSD is about 100us. With pre-fetch on a 
rotating disk, most, if not all, of the data that is going to be 
accessed is going to get pre-cached by the time we even ask for it, so 
it might even be faster. This might actually provide higher performance.

Gordan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17 18:07       ` Gordan Bobic
  2010-11-17 18:41         ` Bart Kus
@ 2010-11-18 14:31         ` Bart Noordervliet
  2010-11-18 15:02           ` Justin Ossevoort
  2010-11-18 15:06           ` Gordan Bobic
  1 sibling, 2 replies; 15+ messages in thread
From: Bart Noordervliet @ 2010-11-18 14:31 UTC (permalink / raw)
  To: Gordan Bobic; +Cc: linux-btrfs

On Wed, Nov 17, 2010 at 19:07, Gordan Bobic <gordan@bobich.net> wrote:
> Since BTRFS is already doing some relatively radical things, I would like to
> suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn't safely usable
> for arrays bigger than about 5TB with disks that have a specified error rate
> of 10^-14. RAID6 pushes that problem a little further away, but in the
> longer term, I would argue that RAID (n+m) would work best. We specify that
> of (n+m) disks in the array, we want n data disks and m redundancy disks. If
> this is implemented in a generic way, then there won't be a need to
> implement additional RAID modes later.

I presume you're talking about the uncaught read errors that makes
many people avoid RAID5. Btrfs actually enables us to use it with
confidence again, since using checksums it's able to detect these
errors and prevent corruption of the array. So to the contrary, I see
a lot of potential for parity-based redundancy in combination with
btrfs.

Regards,

Bart

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-18 14:31         ` Bart Noordervliet
@ 2010-11-18 15:02           ` Justin Ossevoort
  2010-11-18 15:06           ` Gordan Bobic
  1 sibling, 0 replies; 15+ messages in thread
From: Justin Ossevoort @ 2010-11-18 15:02 UTC (permalink / raw)
  To: Bart Noordervliet; +Cc: Gordan Bobic, linux-btrfs

On 18/11/10 15:31, Bart Noordervliet wrote:
> On Wed, Nov 17, 2010 at 19:07, Gordan Bobic <gordan@bobich.net> wrote:
>> Since BTRFS is already doing some relatively radical things, I would like to
>> suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn't safely usable
>> for arrays bigger than about 5TB with disks that have a specified error rate
>> of 10^-14. RAID6 pushes that problem a little further away, but in the
>> longer term, I would argue that RAID (n+m) would work best. We specify that
>> of (n+m) disks in the array, we want n data disks and m redundancy disks. If
>> this is implemented in a generic way, then there won't be a need to
>> implement additional RAID modes later.
> 
> I presume you're talking about the uncaught read errors that makes
> many people avoid RAID5. Btrfs actually enables us to use it with
> confidence again, since using checksums it's able to detect these
> errors and prevent corruption of the array. So to the contrary, I see
> a lot of potential for parity-based redundancy in combination with
> btrfs.


No he's talking about the high chance of triggering another error during
the long time it takes to perform the recovery (and before your data is
redundant again). Often also attributed to multiple disks being from the
same batch and having the same flaws and lifetime expectancy.

But since btrfs would do this on a per object basis instead of the whole
array, only the objects whose blocks have gone are at risk (not
necessarily the whole filesystem). Furthermore, additional read errors
often only impact a subset of the files that were at risk.
Furthermore if recovery is half-way done when another error is triggerd
the already done part will still be available.

So the real strength is that corruptions are more likely only to impact
a small subset of the filesystem and that different objects can have
different amount of redundancy. So 'raid1' for metadata and other very
important files, no raid for unimportant data and raid5/6 for large
objects or for objects which only need a basic level of protection.

Regards,

    justin....

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-18 14:31         ` Bart Noordervliet
  2010-11-18 15:02           ` Justin Ossevoort
@ 2010-11-18 15:06           ` Gordan Bobic
  1 sibling, 0 replies; 15+ messages in thread
From: Gordan Bobic @ 2010-11-18 15:06 UTC (permalink / raw)
  To: linux-btrfs

Bart Noordervliet wrote:
> On Wed, Nov 17, 2010 at 19:07, Gordan Bobic <gordan@bobich.net> wrote:
>> Since BTRFS is already doing some relatively radical things, I would like to
>> suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn't safely usable
>> for arrays bigger than about 5TB with disks that have a specified error rate
>> of 10^-14. RAID6 pushes that problem a little further away, but in the
>> longer term, I would argue that RAID (n+m) would work best. We specify that
>> of (n+m) disks in the array, we want n data disks and m redundancy disks. If
>> this is implemented in a generic way, then there won't be a need to
>> implement additional RAID modes later.
> 
> I presume you're talking about the uncaught read errors that makes
> many people avoid RAID5. Btrfs actually enables us to use it with
> confidence again, since using checksums it's able to detect these
> errors and prevent corruption of the array. So to the contrary, I see
> a lot of potential for parity-based redundancy in combination with
> btrfs.

No. What I'm talking about the the probability of finding an error 
during the process of rebuilding a degraded array. With a 6TB (usable) 
array and disks with 10^-14 error rate, the probability of getting an 
unrecoverable read error exceeds 50%. n+1 RAID isn't fit for use with 
the current generation of drives where n > 1-5TB depending on how 
important your data and downtime are and how good your backups are.

And I don't put much stock in the manufacturer figures, either, so 
assume that 10^-14 is optimistic of it is reported. On high capacity 
drives (especially 1TB Seagates, both 3 and 4 platter variants) I am 
certainly seeing a higher error rate than that on a significant fraction 
of the disks.

Gordan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update to Project_ideas wiki page
  2010-11-17 14:31 ` Hugo Mills
  2010-11-17 15:12   ` Bart Noordervliet
@ 2010-11-26 14:57   ` Paul Komkoff
  1 sibling, 0 replies; 15+ messages in thread
From: Paul Komkoff @ 2010-11-26 14:57 UTC (permalink / raw)
  To: Hugo Mills, Chris Ball, linux-btrfs

On Wed, Nov 17, 2010 at 2:31 PM, Hugo Mills <hugo-lkml@carfax.org.uk> w=
rote:
>> =3D=3D Changing RAID levels =3D=3D
>>
>> We need ioctls to change between different raid levels. =A0Some of t=
hese
>> are quite easy -- e.g. for RAID0 to RAID1, we just halve the availab=
le
>> bytes on the fs, then queue a rebalance.

Can we please do it properly? That is, change raid levels on a
per-file, per-tree basis?

Thanks.
--=20
This message represents the official view of the voices in my head
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-11-26 14:57 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-17  3:19 Update to Project_ideas wiki page Chris Ball
2010-11-17 14:31 ` Hugo Mills
2010-11-17 15:12   ` Bart Noordervliet
2010-11-17 17:19     ` Xavier Nicollet
2010-11-17 17:52     ` Mike Fedyk
2010-11-17 17:56     ` Hugo Mills
2010-11-17 18:07       ` Gordan Bobic
2010-11-17 18:41         ` Bart Kus
2010-11-18  8:36           ` Gordan Bobic
2010-11-18 14:31         ` Bart Noordervliet
2010-11-18 15:02           ` Justin Ossevoort
2010-11-18 15:06           ` Gordan Bobic
2010-11-17 18:14       ` Andreas Philipp
2010-11-17 18:34         ` Hugo Mills
2010-11-26 14:57   ` Paul Komkoff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).