Some very basic questions

All of lore.kernel.org
 help / color / mirror / Atom feed

* Some very basic questions
@ 2008-10-21 11:23 Stephan von Krawczynski
  2008-10-21 12:13 ` Andi Kleen
                   ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-21 11:23 UTC (permalink / raw)
  To: linux-btrfs

Hello all,

reading the list for a while it looks like all kinds of implementational
topics are covered but no basic user requests or talks are going on. Since I
have found no other list on vger covering these issues I choose this one,
forgive my ignorance if it is the wrong place.
Like many people on the planet we try to handle quite some amounts of data
(TBs) and try to solve this with several linux-based fileservers.
Years of (mostly bad) experience led us to the following minimum requirements
for a new fs on our servers:

1. filesystem-check
1.1 it should not
    - delay boot process (we have to wait for hours currently)
    - prevent mount in case of errors
    - be a part of the mount process at all
    - always check the whole fs
1.2 it should be able 
    - to always be started interactively by user
    - to check parts/subtrees of the fs
    - to run purely informational (reporting, non-modifying)
    - to run on a mounted fs
2. general requirements
    - fs errors without file/dir names are useless
    - errors in parts of the fs are no reason for a fs to go offline as a whole
    - mounting must not delay the system startup significantly
    - resizing during runtime (up and down)
    - parallel mounts (very important!)
      (two or more hosts mount the same fs concurrently for reading and
      writing)
    - journaling
    - versioning (file and dir)
    - undelete (file and dir)
    - snapshots
    - run into hd errors more than once for the same file (as an option)
    - map out dead blocks
      (and of course display of the currently mapped out list)
    - no size limitations (more or less)
    - performant handling of large numbers of files inside single dirs
      (to check that use > 100.000 files in a dir, understand that it is
      no good idea to spread inode-blocks over the whole hd because of seek
      times)
    - power loss at any time must not corrupt the fs (atomic fs modification)
      (new-data loss is acceptable)

Remember, this is not meant to be a request for features, it is a list that
built up over 10 years of handling data and the failures we experienced. To
our knowledge no fs meets this list, but hey, is that a reason for not talking
about it? Our goal is pretty simple: maximize fs uptime.
How does btrfs match?
-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 11:23 Some very basic questions Stephan von Krawczynski
@ 2008-10-21 12:13 ` Andi Kleen
  2008-10-21 14:22   ` Stephan von Krawczynski
  2008-10-21 13:20 ` jim owens
  2008-10-21 13:59 ` Chris Mason
  2 siblings, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2008-10-21 12:13 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-btrfs

Stephan von Krawczynski <skraw@ithnet.com> writes:

> reading the list for a while it looks like all kinds of implementational
> topics are covered but no basic user requests or talks are going on. Since I
> have found no other list on vger covering these issues I choose this one,
> forgive my ignorance if it is the wrong place.
> Like many people on the planet we try to handle quite some amounts of data
> (TBs) and try to solve this with several linux-based fileservers.
> Years of (mostly bad) experience led us to the following minimum requirements
> for a new fs on our servers:

If that are the minimum requirements, what are the maximum ones?

Also you realize that some of the requirements (like parallel read/write
aka a full cluster file system) are extremly hard?

Perhaps it would make more sense if you extracted the top 10 items
and ranked them by importance and posted again.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 12:13 ` Andi Kleen
@ 2008-10-21 14:22   ` Stephan von Krawczynski
  2008-10-21 15:34     ` jim owens
  0 siblings, 1 reply; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-21 14:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-btrfs

On Tue, 21 Oct 2008 14:13:33 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> Stephan von Krawczynski <skraw@ithnet.com> writes:
> 
> > reading the list for a while it looks like all kinds of implementational
> > topics are covered but no basic user requests or talks are going on. Since I
> > have found no other list on vger covering these issues I choose this one,
> > forgive my ignorance if it is the wrong place.
> > Like many people on the planet we try to handle quite some amounts of data
> > (TBs) and try to solve this with several linux-based fileservers.
> > Years of (mostly bad) experience led us to the following minimum requirements
> > for a new fs on our servers:
> 
> If that are the minimum requirements, what are the maximum ones?
> 
> Also you realize that some of the requirements (like parallel read/write
> aka a full cluster file system) are extremly hard?
> 
> Perhaps it would make more sense if you extracted the top 10 items
> and ranked them by importance and posted again.

Hello Andi,

thanks for your feedback. Understand "minimum requirement" as "minimum
requirement to drop the current installation and migrate the data to a
new fs platform".
Of course you are right, dealing with multiple/parallel mounts can be quite a
nasty job if the fs was not originally planned with this feature in mind.
On the other hand I cannot really imagine how to deal with TBs of data in the
future without such a feature.
If you look at the big picture the things I mentioned allow you to have
redundant front-ends for the fileservice doing the same or completely
different applications. You can use one mount (host) for tape backup purposes
only without heavy loss in standard file service. You can even mount for
filesystem check purposes, a box that does nothing else but check the
structure and keep you informed what is really going on with your data - and
your data is still in production in the meantime.
Whatever happens you have a real chance of keeping your file service up, even
if parts of your fs go nuts because some underlying hd got partially damaged.
Keeping it up and running is the most important part, performance is only
second on the list.
If you take a close look there are not really 10 different items on my list,
depending on the level of abstraction you prefer, nevertheless:

1) parallel mounts
2) mounting must not delay the system startup significantly
3) errors in parts of the fs are no reason for a fs to go offline as a whole
4) power loss at any time must not corrupt the fs
5) fsck on a mounted fs, interactively, not part of the mount (all fsck
features)
6) journaling
7) undelete (file and dir)
8) resizing during runtime (up and down)
9) snapshots
10) performant handling of large numbers of files inside single dirs

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 14:22   ` Stephan von Krawczynski
@ 2008-10-21 15:34     ` jim owens
  2008-10-22 11:36       ` Stephan von Krawczynski
  0 siblings, 1 reply; 79+ messages in thread
From: jim owens @ 2008-10-21 15:34 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-btrfs

Hearing what user's think they want is always good, but...

Stephan von Krawczynski wrote:
> 
> thanks for your feedback. Understand "minimum requirement" as "minimum
> requirement to drop the current installation and migrate the data to a
> new fs platform".

I would sure like to know what existing platform and filesystem
you have that you think has all 10 of your features.

> Of course you are right, dealing with multiple/parallel mounts can be quite a
> nasty job if the fs was not originally planned with this feature in mind.
> On the other hand I cannot really imagine how to deal with TBs of data in the
> future without such a feature.
> If you look at the big picture the things I mentioned allow you to have
> redundant front-ends for the fileservice doing the same or completely
> different applications. You can use one mount (host) for tape backup purposes
> only without heavy loss in standard file service. You can even mount for
> filesystem check purposes, a box that does nothing else but check the
> structure and keep you informed what is really going on with your data - and
> your data is still in production in the meantime.
> Whatever happens you have a real chance of keeping your file service up, even
> if parts of your fs go nuts because some underlying hd got partially damaged.
> Keeping it up and running is the most important part, performance is only
> second on the list.
> If you take a close look there are not really 10 different items on my list,
> depending on the level of abstraction you prefer, nevertheless:
> 
> 1) parallel mounts

What I see from that explanation is you have a "system design" idea
using parallel machines to fix problems you have had in the past.
To implement your design, you need a filesystem to fit it.  I think
it is better to just design a filesystem without the problems and
configure the hardware to handle the necessary load.

> 2) mounting must not delay the system startup significantly
> 3) errors in parts of the fs are no reason for a fs to go offline as a whole
> 4) power loss at any time must not corrupt the fs
> 5) fsck on a mounted fs, interactively, not part of the mount (all fsck
> features)

I think all of these are part of the "reliability" goal for btrfs
and when you say "fsck" it is probably misleading if I understand
your real requirement to be the same as my customers:

   - *NO* fsck
   - filesystem design "prevents problems we have had before"
   - filesystem autodetects, isolates, and (possibly) repairs errors
   - online "scan, check, repair filesystem" tool initiated by admin
   - Reliability so high that they never run that check-and-fix tool

Note that I personally have never seen a first release meet
the "no problems, no need to fix" criteria that would obviate
any need for a check/fix tool.

jim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 15:34     ` jim owens
@ 2008-10-22 11:36       ` Stephan von Krawczynski
  2008-10-22 12:15         ` Avi Kivity
  0 siblings, 1 reply; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-22 11:36 UTC (permalink / raw)
  To: jim owens; +Cc: linux-btrfs

On Tue, 21 Oct 2008 11:34:20 -0400
jim owens <jowens@hp.com> wrote:

> Hearing what user's think they want is always good, but...
> 
> Stephan von Krawczynski wrote:
> > 
> > thanks for your feedback. Understand "minimum requirement" as "minimum
> > requirement to drop the current installation and migrate the data to a
> > new fs platform".
> 
> I would sure like to know what existing platform and filesystem
> you have that you think has all 10 of your features.

Obviously none, else I would not speak up and try to find one. :-)

> > [...]
> > 1) parallel mounts
> 
> What I see from that explanation is you have a "system design" idea
> using parallel machines to fix problems you have had in the past.
> To implement your design, you need a filesystem to fit it.

Well, I can't hardly deny that. Lets just name the (simple) problem, different
names for the very same thing: uptime, availability, redundancy

>  I think
> it is better to just design a filesystem without the problems and
> configure the hardware to handle the necessary load.

Ok, now you see me astonished. You really think that there is one piece of
software around that is "without problems" ?
My idea of the world is really very different from that:
The world is far from perfect. That is why I try to deploy solutions that have
redundancy for all kinds of problems I can think of and hopefully for a few
that I haven't thought of.

> > 2) mounting must not delay the system startup significantly
> > 3) errors in parts of the fs are no reason for a fs to go offline as a whole
> > 4) power loss at any time must not corrupt the fs
> > 5) fsck on a mounted fs, interactively, not part of the mount (all fsck
> > features)
> 
> I think all of these are part of the "reliability" goal for btrfs
> and when you say "fsck" it is probably misleading if I understand
> your real requirement to be the same as my customers:
> 
>    - *NO* fsck
>    - filesystem design "prevents problems we have had before"
>    - filesystem autodetects, isolates, and (possibly) repairs errors
>    - online "scan, check, repair filesystem" tool initiated by admin
>    - Reliability so high that they never run that check-and-fix tool

That is _wrong_ (to a certain extent). You _want to run_ diagnostic tools to
make sure that there is no problem. And you don't want some software (not even
HAL) to repair errors without prior admin knowledge/permission.

> Note that I personally have never seen a first release meet
> the "no problems, no need to fix" criteria that would obviate
> any need for a check/fix tool.

That really does not depend on the release number of _your_ special software.
Your software always depends on other components (hw or sw) that (can) have
bugs and weird behaviour. And this is the fact: no perfect world, so don't
count on your or others' perfectness. If you do you will fail.

> jim

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 11:36       ` Stephan von Krawczynski
@ 2008-10-22 12:15         ` Avi Kivity
  2008-10-22 13:03           ` Ric Wheeler
  0 siblings, 1 reply; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 12:15 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: jim owens, linux-btrfs

Stephan von Krawczynski wrote:
>
>>    - filesystem autodetects, isolates, and (possibly) repairs errors
>>    - online "scan, check, repair filesystem" tool initiated by admin
>>    - Reliability so high that they never run that check-and-fix tool
>>     
>
> That is _wrong_ (to a certain extent). You _want to run_ diagnostic tools to
> make sure that there is no problem. And you don't want some software (not even
> HAL) to repair errors without prior admin knowledge/permission

I think there's a place for a scrubber to continuously verify filesystem 
data and metadata, at low io priority, and correct any correctable 
errors.  The admin can read the error correction report at their 
leisure, and then take any action that's outside the filesystem's 
capabilities (like ordering and installing new disks).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 12:15         ` Avi Kivity
@ 2008-10-22 13:03           ` Ric Wheeler
  2008-10-22 13:13             ` Chris Mason
  2008-10-22 13:16             ` Avi Kivity
  0 siblings, 2 replies; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 13:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Stephan von Krawczynski, jim owens, linux-btrfs

Avi Kivity wrote:
> Stephan von Krawczynski wrote:
>>
>>>    - filesystem autodetects, isolates, and (possibly) repairs errors
>>>    - online "scan, check, repair filesystem" tool initiated by admin
>>>    - Reliability so high that they never run that check-and-fix tool
>>>     
>>
>> That is _wrong_ (to a certain extent). You _want to run_ diagnostic 
>> tools to
>> make sure that there is no problem. And you don't want some software 
>> (not even
>> HAL) to repair errors without prior admin knowledge/permission
>
> I think there's a place for a scrubber to continuously verify 
> filesystem data and metadata, at low io priority, and correct any 
> correctable errors.  The admin can read the error correction report at 
> their leisure, and then take any action that's outside the 
> filesystem's capabilities (like ordering and installing new disks).
>
Scrubbing is key for many scenarios since errors can "grow" even in 
places where previous IO has been completed without flagging an error.

Some neat tricks are:

    (1) use block level scrubbing to detect any media errors. If you can 
map that sector level error into a file system object (meta data, file 
data or unallocated space), tools can recover (fsck, get another copy of 
the file or just ignore it!). There is a special command called 
"READ_VERIFY" that can be used to validate the sectors without actually 
moving data from the target to the host, so you can scrub without 
consuming page cache, etc.

    (2) sign and validate the object at the file level, say by 
validating a digital signature. This can catch high level errors (say 
the app messed up).

Note that this scrubbing needs to be carefully tuned to not interfere 
with the foreground workload, using something like IO nice or the other 
IO controllers being kicked about might help :-)

ric


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:03           ` Ric Wheeler
@ 2008-10-22 13:13             ` Chris Mason
  2008-10-22 13:16             ` Avi Kivity
  1 sibling, 0 replies; 79+ messages in thread
From: Chris Mason @ 2008-10-22 13:13 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Avi Kivity, Stephan von Krawczynski, jim owens, linux-btrfs

On Wed, 2008-10-22 at 09:03 -0400, Ric Wheeler wrote:
> Avi Kivity wrote:
> > Stephan von Krawczynski wrote:
> >>
> >>>    - filesystem autodetects, isolates, and (possibly) repairs errors
> >>>    - online "scan, check, repair filesystem" tool initiated by admin
> >>>    - Reliability so high that they never run that check-and-fix tool
> >>>     
> >>
> >> That is _wrong_ (to a certain extent). You _want to run_ diagnostic 
> >> tools to
> >> make sure that there is no problem. And you don't want some software 
> >> (not even
> >> HAL) to repair errors without prior admin knowledge/permission
> >
> > I think there's a place for a scrubber to continuously verify 
> > filesystem data and metadata, at low io priority, and correct any 
> > correctable errors.  The admin can read the error correction report at 
> > their leisure, and then take any action that's outside the 
> > filesystem's capabilities (like ordering and installing new disks).
> >
> Scrubbing is key for many scenarios since errors can "grow" even in 
> places where previous IO has been completed without flagging an error.
> 

We'll definitely have background scrubbing.  It is a key part of the
health of the FS I think.

-chris



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:03           ` Ric Wheeler
  2008-10-22 13:13             ` Chris Mason
@ 2008-10-22 13:16             ` Avi Kivity
  1 sibling, 0 replies; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 13:16 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Stephan von Krawczynski, jim owens, linux-btrfs

Ric Wheeler wrote:
> Scrubbing is key for many scenarios since errors can "grow" even in 
> places where previous IO has been completed without flagging an error.
>
> Some neat tricks are:
>
>    (1) use block level scrubbing to detect any media errors. If you 
> can map that sector level error into a file system object (meta data, 
> file data or unallocated space), tools can recover (fsck, get another 
> copy of the file or just ignore it!). There is a special command 
> called "READ_VERIFY" that can be used to validate the sectors without 
> actually moving data from the target to the host, so you can scrub 
> without consuming page cache, etc.
>

This has the disadvantage of not catching errors that were introduced 
while writing; the very errors that btrfs checksums can catch.

>    (2) sign and validate the object at the file level, say by 
> validating a digital signature. This can catch high level errors (say 
> the app messed up).

Btrfs extent-level checksums can be used for this.  This is just below 
the application level, but good enough IMO.

> Note that this scrubbing needs to be carefully tuned to not interfere 
> with the foreground workload, using something like IO nice or the 
> other IO controllers being kicked about might help :-)

Right.  Further, reading the disk by logical block order will help 
reduce seeks.  Btrfs's back references, if cached properly, will help 
with this as well.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 11:23 Some very basic questions Stephan von Krawczynski
  2008-10-21 12:13 ` Andi Kleen
@ 2008-10-21 13:20 ` jim owens
  2008-10-21 17:01   ` Stephan von Krawczynski
  2008-10-21 13:59 ` Chris Mason
  2 siblings, 1 reply; 79+ messages in thread
From: jim owens @ 2008-10-21 13:20 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-btrfs

btrfs has many of the same goals... but they are goals not code
so when you might see them is indeterminate.

I believe these should not be in btrfs:

Stephan von Krawczynski wrote:

>     - parallel mounts (very important!)

as Andi said, you want a cluster or distributed fs.  There
are layered designs (CRFS or network filesystems) that can do
the job and trying to do it in btrfs causes too many problems.

>     - journaling

I assume you *do not* mean metadata journaling, you mean
sending all file updates to a single output stream (as in one
disk, tape, or network link).  I've done that, but would not
recommend it in btrfs because it limits the total fs bandwidth
to what the single stream can support.  This is normally done
today by applications like databases, not in the filesystem.

>     - map out dead blocks

Useless... a waste of time, code, and metadata structures.
With current device technology, any device reporting bad blocks
the device can not map out is about to die and needs replaced!

jim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 13:20 ` jim owens
@ 2008-10-21 17:01   ` Stephan von Krawczynski
  2008-10-21 17:15     ` Christoph Hellwig
  0 siblings, 1 reply; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-21 17:01 UTC (permalink / raw)
  To: jim owens; +Cc: linux-btrfs

On Tue, 21 Oct 2008 09:20:16 -0400
jim owens <jowens@hp.com> wrote:

> btrfs has many of the same goals... but they are goals not code
> so when you might see them is indeterminate.

no big issue, my pension is 20 years away, I got time ;-)

> I believe these should not be in btrfs:
> 
> Stephan von Krawczynski wrote:
> 
> >     - parallel mounts (very important!)
> 
> as Andi said, you want a cluster or distributed fs.  There
> are layered designs (CRFS or network filesystems) that can do
> the job and trying to do it in btrfs causes too many problems.

question is: if you had such an implementation, are there drawbacks expectable
for the single-mount case? If not I'd vote for it because there are not really
many alternatives "on the market".

> >     - journaling
> 
> I assume you *do not* mean metadata journaling, you mean
> sending all file updates to a single output stream (as in one
> disk, tape, or network link).  I've done that, but would not
> recommend it in btrfs because it limits the total fs bandwidth
> to what the single stream can support.  This is normally done
> today by applications like databases, not in the filesystem.

As far as I know metadata journaling is in, right?
If what you mean is capable of creating live or offline images of the fs you
got me right.

> >     - map out dead blocks
> 
> Useless... a waste of time, code, and metadata structures.
> With current device technology, any device reporting bad blocks
> the device can not map out is about to die and needs replaced!

Sure, but what you say only reflects the ideal world. On a file service, you
never have that. In fact you do not even have good control about what is going
on. Lets say you have a setup that creates, reads and deletes files 24h a day
from numerous clients. At two o'clock in the morning some hd decides to
partially die. Files get created on it, fill data up to errors, get
deleted and another bunch of data arrives and yet again fs tries to allocate
the same dead areas. You loose a lot more data only because the fs did not map
out the already known dead blocks. Of course you would replace the dead drive
later on, but in the meantime you have a lot of fun.
In other words: give me a tool to freeze the world right at the time the
errors show up, or map out dead blocks (only because it is a lot easier).

> jim

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 17:01   ` Stephan von Krawczynski
@ 2008-10-21 17:15     ` Christoph Hellwig
  2008-10-21 17:31       ` Ric Wheeler
  2008-10-22 11:40       ` Stephan von Krawczynski
  0 siblings, 2 replies; 79+ messages in thread
From: Christoph Hellwig @ 2008-10-21 17:15 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: jim owens, linux-btrfs

On Tue, Oct 21, 2008 at 07:01:36PM +0200, Stephan von Krawczynski wrote:
> Sure, but what you say only reflects the ideal world. On a file service, you
> never have that. In fact you do not even have good control about what is going
> on. Lets say you have a setup that creates, reads and deletes files 24h a day
> from numerous clients. At two o'clock in the morning some hd decides to
> partially die. Files get created on it, fill data up to errors, get
> deleted and another bunch of data arrives and yet again fs tries to allocate
> the same dead areas. You loose a lot more data only because the fs did not map
> out the already known dead blocks. Of course you would replace the dead drive
> later on, but in the meantime you have a lot of fun.
> In other words: give me a tool to freeze the world right at the time the
> errors show up, or map out dead blocks (only because it is a lot easier).

When modern disks can't solve the problems with their internal driver
remapping anymore you better replace it ASAP as it is a very strong
disk failure indication.  Last years FAST has some very interesting
statitics showing this in the field.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 17:15     ` Christoph Hellwig
@ 2008-10-21 17:31       ` Ric Wheeler
  2008-10-22 12:27         ` Stephan von Krawczynski
  2008-10-22 11:40       ` Stephan von Krawczynski
  1 sibling, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-21 17:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Stephan von Krawczynski, jim owens, linux-btrfs

Christoph Hellwig wrote:
> On Tue, Oct 21, 2008 at 07:01:36PM +0200, Stephan von Krawczynski wrote:
>   
>> Sure, but what you say only reflects the ideal world. On a file service, you
>> never have that. In fact you do not even have good control about what is going
>> on. Lets say you have a setup that creates, reads and deletes files 24h a day
>> from numerous clients. At two o'clock in the morning some hd decides to
>> partially die. Files get created on it, fill data up to errors, get
>> deleted and another bunch of data arrives and yet again fs tries to allocate
>> the same dead areas. You loose a lot more data only because the fs did not map
>> out the already known dead blocks. Of course you would replace the dead drive
>> later on, but in the meantime you have a lot of fun.
>> In other words: give me a tool to freeze the world right at the time the
>> errors show up, or map out dead blocks (only because it is a lot easier).
>>     
>
> When modern disks can't solve the problems with their internal driver
> remapping anymore you better replace it ASAP as it is a very strong
> disk failure indication.  Last years FAST has some very interesting
> statitics showing this in the field.
>   

Doing proactive drive pulls is kind of a black art, but looking for 
*lots* of remapped sectors is always a pretty reliable clue. Note that 
modern S-ATA disks might have room to remap 2-3 thousand sectors, so you 
should not worry too much about a handful (say 20 or so). Sometimes the 
remapping happens because of transient things (junk on the platter, 
vibrations, out of spec temperature range, etc) so your drive might be 
perfectly healthy.

If you have remapped a big chunk of the sectors (say more than 10%), you 
should grab the data off the disk asap and replace it. Worry less about 
errors during read, writes indicate more serious errors.

The file system should not have to worry about remapping sectors 
internally, by the time writes fail and you have consumed all remapped 
sectors, you should definitely be in read-only mode and well on the way 
to replacing the disk :-)

ric


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 17:31       ` Ric Wheeler
@ 2008-10-22 12:27         ` Stephan von Krawczynski
  2008-10-22 13:15           ` Chris Mason
  0 siblings, 1 reply; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-22 12:27 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Christoph Hellwig, jim owens, linux-btrfs

On Tue, 21 Oct 2008 13:31:37 -0400
Ric Wheeler <ricwheeler@gmail.com> wrote:

> [...]
> If you have remapped a big chunk of the sectors (say more than 10%), you 
> should grab the data off the disk asap and replace it. Worry less about 
> errors during read, writes indicate more serious errors.

Ok, now for the bad news: money is invented.
If you replace a disk before real failure you won't get replacement from the
manufacturer. That may sound irrelevant to someone handling 5 disks, but is
significant if handling 500 or more. The replacement rate is indeed much
higher than people think from their home pcs.

> [...]
> ric

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 12:27         ` Stephan von Krawczynski
@ 2008-10-22 13:15           ` Chris Mason
  2008-10-22 13:27             ` Ric Wheeler
  2008-10-22 13:52             ` Stephan von Krawczynski
  0 siblings, 2 replies; 79+ messages in thread
From: Chris Mason @ 2008-10-22 13:15 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Ric Wheeler, Christoph Hellwig, jim owens, linux-btrfs

On Wed, 2008-10-22 at 14:27 +0200, Stephan von Krawczynski wrote:
> On Tue, 21 Oct 2008 13:31:37 -0400
> Ric Wheeler <ricwheeler@gmail.com> wrote:
> 
> > [...]
> > If you have remapped a big chunk of the sectors (say more than 10%), you 
> > should grab the data off the disk asap and replace it. Worry less about 
> > errors during read, writes indicate more serious errors.
> 
> Ok, now for the bad news: money is invented.
> If you replace a disk before real failure you won't get replacement from the
> manufacturer. That may sound irrelevant to someone handling 5 disks, but is
> significant if handling 500 or more. The replacement rate is indeed much
> higher than people think from their home pcs.

Hardware vendors already do replace disks based on policies defined by
their own array hardware.  These are already predictive.

-chris





^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:15           ` Chris Mason
@ 2008-10-22 13:27             ` Ric Wheeler
  2008-10-22 14:32               ` Avi Kivity
  2008-10-22 13:52             ` Stephan von Krawczynski
  1 sibling, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 13:27 UTC (permalink / raw)
  To: Chris Mason
  Cc: Stephan von Krawczynski, Christoph Hellwig, jim owens,
	linux-btrfs

Chris Mason wrote:
> On Wed, 2008-10-22 at 14:27 +0200, Stephan von Krawczynski wrote:
>   
>> On Tue, 21 Oct 2008 13:31:37 -0400
>> Ric Wheeler <ricwheeler@gmail.com> wrote:
>>
>>     
>>> [...]
>>> If you have remapped a big chunk of the sectors (say more than 10%), you 
>>> should grab the data off the disk asap and replace it. Worry less about 
>>> errors during read, writes indicate more serious errors.
>>>       
>> Ok, now for the bad news: money is invented.
>> If you replace a disk before real failure you won't get replacement from the
>> manufacturer. That may sound irrelevant to someone handling 5 disks, but is
>> significant if handling 500 or more. The replacement rate is indeed much
>> higher than people think from their home pcs.
>>     
>
> Hardware vendors already do replace disks based on policies defined by
> their own array hardware.  These are already predictive.
>
> -chris
>
>
>
>   
One key is not to replace the drives too early - you often can recover 
significant amounts of data from a drive that is on its last legs. This 
can be useful even in RAID rebuilds since with today's enormous drive 
capacities, you might hit a latent error during the rebuild on one of 
the presumed healthy drives.

Of course, if you don't have a spare drive in your configuration, this 
is not practical...

ric



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:27             ` Ric Wheeler
@ 2008-10-22 14:32               ` Avi Kivity
  2008-10-22 14:36                 ` Chris Mason
  2008-10-22 14:46                 ` Ric Wheeler
  0 siblings, 2 replies; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 14:32 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Mason, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

Ric Wheeler wrote: 
> One key is not to replace the drives too early - you often can recover 
> significant amounts of data from a drive that is on its last legs. 
> This can be useful even in RAID rebuilds since with today's enormous 
> drive capacities, you might hit a latent error during the rebuild on 
> one of the presumed healthy drives.
>
> Of course, if you don't have a spare drive in your configuration, this 
> is not practical...

Why would you have a spare drive?  That's a wasted spindle.

You want to have spare capacity, enough for one or two (or fifteen) 
drives' worth of data.  When a drive goes bad, you rebuild into the 
spare capacity you have.

When you replace the drive, the filesystem moves data into the new drive 
to take advantage of the new spindle.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 14:32               ` Avi Kivity
@ 2008-10-22 14:36                 ` Chris Mason
  2008-10-22 14:40                   ` Avi Kivity
  2008-10-22 14:46                 ` Ric Wheeler
  1 sibling, 1 reply; 79+ messages in thread
From: Chris Mason @ 2008-10-22 14:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ric Wheeler, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

On Wed, 2008-10-22 at 16:32 +0200, Avi Kivity wrote:
> Ric Wheeler wrote: 
> > One key is not to replace the drives too early - you often can recover 
> > significant amounts of data from a drive that is on its last legs. 
> > This can be useful even in RAID rebuilds since with today's enormous 
> > drive capacities, you might hit a latent error during the rebuild on 
> > one of the presumed healthy drives.
> >
> > Of course, if you don't have a spare drive in your configuration, this 
> > is not practical...
> 
> Why would you have a spare drive?  That's a wasted spindle.
> 
> You want to have spare capacity, enough for one or two (or fifteen) 
> drives' worth of data.  When a drive goes bad, you rebuild into the 
> spare capacity you have.
> 

You want spare capacity that does not degrade your raid levels if you
move the data onto it.  In some configs, this will be a hot spare, in
others it'll just be free space.

-chris



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 14:36                 ` Chris Mason
@ 2008-10-22 14:40                   ` Avi Kivity
  0 siblings, 0 replies; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 14:40 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ric Wheeler, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

Chris Mason wrote:
>> You want to have spare capacity, enough for one or two (or fifteen) 
>> drives' worth of data.  When a drive goes bad, you rebuild into the 
>> spare capacity you have.
>>
>>     
>
> You want spare capacity that does not degrade your raid levels if you
> move the data onto it.  In some configs, this will be a hot spare, in
> others it'll just be free space.
>   

What kind of configuration would prefer a spare disk to spare capacity?  
RAID6 with a small number of disks?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 14:32               ` Avi Kivity
  2008-10-22 14:36                 ` Chris Mason
@ 2008-10-22 14:46                 ` Ric Wheeler
  2008-10-22 14:54                   ` Avi Kivity
  1 sibling, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 14:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Mason, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

Avi Kivity wrote:
> Ric Wheeler wrote:
>> One key is not to replace the drives too early - you often can 
>> recover significant amounts of data from a drive that is on its last 
>> legs. This can be useful even in RAID rebuilds since with today's 
>> enormous drive capacities, you might hit a latent error during the 
>> rebuild on one of the presumed healthy drives.
>>
>> Of course, if you don't have a spare drive in your configuration, 
>> this is not practical...
>
> Why would you have a spare drive?  That's a wasted spindle.

You have a spare drive because you care about data integrity and have 
too many years of experience in disk arrays to go without :-)
>
> You want to have spare capacity, enough for one or two (or fifteen) 
> drives' worth of data.  When a drive goes bad, you rebuild into the 
> spare capacity you have.

That is a different model (and one that makes sense, we used that in 
Centera for object level protection schemes). It is a nice model as 
well, but not how most storage works today.
>
> When you replace the drive, the filesystem moves data into the new 
> drive to take advantage of the new spindle.
>

When you buy a storage solution (hardware or software), the key here is 
"utilized capacity." If you have an enclosure that can host say 12-15 
drives in a 2U enclosure, people normally leave one drive as spare.  
RAID6 is another way to do this. You can do a 4+2 and 4+2 with 66% 
utilized capacity in RAID 6 or possibly a RAID5 scheme using like 5+1 
and 4+1 with one global spare (75% utilized capacity).

That gives you the chance to do  rebuild your RAID group without having 
to physically visit the data center. You can also do fancy stuff with 
the spare (like migrate as many blocks as possible before the RAID 
rebuild to that spare) which reduces your exposure to the 2nd drive 
failure and speeds up your rebuild time.

In the end, whether you use a block based RAID solution or an object 
based solution, you just need to figure out how to balance your utilized 
capacity against performance and data integrity needs.

ric

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 14:46                 ` Ric Wheeler
@ 2008-10-22 14:54                   ` Avi Kivity
  2008-10-22 15:02                     ` Ric Wheeler
  0 siblings, 1 reply; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 14:54 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Mason, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

Ric Wheeler wrote:
>> You want to have spare capacity, enough for one or two (or fifteen) 
>> drives' worth of data.  When a drive goes bad, you rebuild into the 
>> spare capacity you have.
>
> That is a different model (and one that makes sense, we used that in 
> Centera for object level protection schemes). It is a nice model as 
> well, but not how most storage works today.

Well, btrfs is not about duplicating how most storage works today.  
Spare capacity has significant advantages over spare disks, such as 
being able to mix disk sizes, RAID levels, and better performance.

>>
>> When you replace the drive, the filesystem moves data into the new 
>> drive to take advantage of the new spindle.
>>
>
> When you buy a storage solution (hardware or software), the key here 
> is "utilized capacity." If you have an enclosure that can host say 
> 12-15 drives in a 2U enclosure, people normally leave one drive as 
> spare.  RAID6 is another way to do this. You can do a 4+2 and 4+2 with 
> 66% utilized capacity in RAID 6 or possibly a RAID5 scheme using like 
> 5+1 and 4+1 with one global spare (75% utilized capacity).
>
> That gives you the chance to do  rebuild your RAID group without 
> having to physically visit the data center. You can also do fancy 
> stuff with the spare (like migrate as many blocks as possible before 
> the RAID rebuild to that spare) which reduces your exposure to the 2nd 
> drive failure and speeds up your rebuild time.
>
> In the end, whether you use a block based RAID solution or an object 
> based solution, you just need to figure out how to balance your 
> utilized capacity against performance and data integrity needs.

In both models (spare disk and spare capacity) the storage utilization 
is the same, or nearly so.  But with spare capacity you get better 
performance since you have more spindles seeking for your data, and 
since less of the disk surface is occupied by data, making your seeks 
shorter.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 14:54                   ` Avi Kivity
@ 2008-10-22 15:02                     ` Ric Wheeler
  2008-10-22 15:13                       ` Avi Kivity
  0 siblings, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 15:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Mason, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

Avi Kivity wrote:
> Ric Wheeler wrote:
>>> You want to have spare capacity, enough for one or two (or fifteen) 
>>> drives' worth of data.  When a drive goes bad, you rebuild into the 
>>> spare capacity you have.
>>
>> That is a different model (and one that makes sense, we used that in 
>> Centera for object level protection schemes). It is a nice model as 
>> well, but not how most storage works today.
>
> Well, btrfs is not about duplicating how most storage works today.  
> Spare capacity has significant advantages over spare disks, such as 
> being able to mix disk sizes, RAID levels, and better performance.

Sure, there are advantages that go in favour of one or the other 
approaches. But btrfs is also about being able to use common hardware 
configurations without having to reinvent where we can avoid it (if we 
have a working RAID or enough drives to do RAID5 with spares or RAID6, 
we want to be able to delegate that off to something else if we can).
>
>>>
>>> When you replace the drive, the filesystem moves data into the new 
>>> drive to take advantage of the new spindle.
>>>
>>
>> When you buy a storage solution (hardware or software), the key here 
>> is "utilized capacity." If you have an enclosure that can host say 
>> 12-15 drives in a 2U enclosure, people normally leave one drive as 
>> spare.  RAID6 is another way to do this. You can do a 4+2 and 4+2 
>> with 66% utilized capacity in RAID 6 or possibly a RAID5 scheme using 
>> like 5+1 and 4+1 with one global spare (75% utilized capacity).
>>
>> That gives you the chance to do  rebuild your RAID group without 
>> having to physically visit the data center. You can also do fancy 
>> stuff with the spare (like migrate as many blocks as possible before 
>> the RAID rebuild to that spare) which reduces your exposure to the 
>> 2nd drive failure and speeds up your rebuild time.
>>
>> In the end, whether you use a block based RAID solution or an object 
>> based solution, you just need to figure out how to balance your 
>> utilized capacity against performance and data integrity needs.
>
> In both models (spare disk and spare capacity) the storage utilization 
> is the same, or nearly so.  But with spare capacity you get better 
> performance since you have more spindles seeking for your data, and 
> since less of the disk surface is occupied by data, making your seeks 
> shorter.
>
True, you can get more performance if you use all of the hardware you 
have all of the time.

The major difficulty with the spare capacity model is that your recovery 
is not as simple and well understood as RAID rebuilds. If you assume 
that whole drives fail under btrfs mirroring, you are not really doing 
anything more than simple RAID, or do I misunderstand your suggestion?

I don't see the point about head seeking. In RAID, you also have the 
same layout so you minimize head movement (just move more heads per IO 
in parallel).

ric

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 15:02                     ` Ric Wheeler
@ 2008-10-22 15:13                       ` Avi Kivity
  2008-10-22 15:25                         ` Ric Wheeler
  0 siblings, 1 reply; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 15:13 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Mason, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

Ric Wheeler wrote:
>>
>> Well, btrfs is not about duplicating how most storage works today.  
>> Spare capacity has significant advantages over spare disks, such as 
>> being able to mix disk sizes, RAID levels, and better performance.
>
> Sure, there are advantages that go in favour of one or the other 
> approaches. But btrfs is also about being able to use common hardware 
> configurations without having to reinvent where we can avoid it (if we 
> have a working RAID or enough drives to do RAID5 with spares or RAID6, 
> we want to be able to delegate that off to something else if we can).

Well, if you have an existing RAID (or have lots of $$$ to buy a new 
one), you needn't tell Btrfs about it.  Just be sure not to enable Btrfs 
data redundancy, or you'll have redundant redundancy, which is expensive.

What Btrfs enables with its multiple device capabilities is to assemble 
a JBOD into a filesystem-level data redundancy system, which is cheaper, 
more flexible (per-file data redundancy levels), and faster (no need for 
RMW, since you're always COWing).

> The major difficulty with the spare capacity model is that your 
> recovery is not as simple and well understood as RAID rebuilds. 

That's Chris's problem. :-)

> If you assume that whole drives fail under btrfs mirroring, you are 
> not really doing anything more than simple RAID, or do I misunderstand 
> your suggestion?

I do assume that whole drives fail, but RAIDing and rebuilding is file 
level.  So one extent on a failed disk might be part of a mirrored file, 
while another extent can be part of a 14-member RAID6 extent.

A rebuild would iterate over all disk extents (making use of the backref 
tree), determine which file contains that extent, and rebuild that 
extent using spare storage on other disks.

> I don't see the point about head seeking. In RAID, you also have the 
> same layout so you minimize head movement (just move more heads per IO 
> in parallel).

Suppose you have 5 disks with 1 spare.  Suppose you are reading from a 
full fs.  On a disk-level RAID, all disks are full.  So you have 5 
spindles seeking over 100% of the disk surface.  With spare capacity, 
you have 6 disks which are 5/6 full (retaining the same utilization as 
old-school RAID).  So you have 6 spindles, each with a seek range that 
is 5/6 of a whole disk, so more seek heads _and_ faster individual seeks.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 15:13                       ` Avi Kivity
@ 2008-10-22 15:25                         ` Ric Wheeler
  2008-10-22 15:33                           ` Chris Mason
  2008-10-22 15:39                           ` Avi Kivity
  0 siblings, 2 replies; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 15:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Mason, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

Avi Kivity wrote:
> Ric Wheeler wrote:
>>>
>>> Well, btrfs is not about duplicating how most storage works today.  
>>> Spare capacity has significant advantages over spare disks, such as 
>>> being able to mix disk sizes, RAID levels, and better performance.
>>
>> Sure, there are advantages that go in favour of one or the other 
>> approaches. But btrfs is also about being able to use common hardware 
>> configurations without having to reinvent where we can avoid it (if 
>> we have a working RAID or enough drives to do RAID5 with spares or 
>> RAID6, we want to be able to delegate that off to something else if 
>> we can).
>
> Well, if you have an existing RAID (or have lots of $$$ to buy a new 
> one), you needn't tell Btrfs about it.  Just be sure not to enable 
> Btrfs data redundancy, or you'll have redundant redundancy, which is 
> expensive.
>
> What Btrfs enables with its multiple device capabilities is to 
> assemble a JBOD into a filesystem-level data redundancy system, which 
> is cheaper, more flexible (per-file data redundancy levels), and 
> faster (no need for RMW, since you're always COWing).
I think that the btrfs plan is still to push more complicated RAID 
schemes off to MD (RAID6, etc) so this is an issue even with a JBOD. It 
will be interesting to map out the possible ways to use built in 
mirroring, etc vs the external RAID and actually measure the utilized 
capacity and performance (online & during rebuilds).
>
>> The major difficulty with the spare capacity model is that your 
>> recovery is not as simple and well understood as RAID rebuilds. 
>
> That's Chris's problem. :-)
Unless he can pawn it off on some other lucky developer :-)

>
>> If you assume that whole drives fail under btrfs mirroring, you are 
>> not really doing anything more than simple RAID, or do I 
>> misunderstand your suggestion?
>
> I do assume that whole drives fail, but RAIDing and rebuilding is file 
> level.  So one extent on a failed disk might be part of a mirrored 
> file, while another extent can be part of a 14-member RAID6 extent.
>
> A rebuild would iterate over all disk extents (making use of the 
> backref tree), determine which file contains that extent, and rebuild 
> that extent using spare storage on other disks.
>
>> I don't see the point about head seeking. In RAID, you also have the 
>> same layout so you minimize head movement (just move more heads per 
>> IO in parallel).
>
> Suppose you have 5 disks with 1 spare.  Suppose you are reading from a 
> full fs.  On a disk-level RAID, all disks are full.  So you have 5 
> spindles seeking over 100% of the disk surface.  With spare capacity, 
> you have 6 disks which are 5/6 full (retaining the same utilization as 
> old-school RAID).  So you have 6 spindles, each with a seek range that 
> is 5/6 of a whole disk, so more seek heads _and_ faster individual seeks.
>
I think that this is somewhat correct, but most likely offset by the 
performance levels of streaming IO vs IO with any seeks (at least for 
full file systems). Certainly, the spare capacity model is increasingly 
better when you have really light utilized file systems...

Don't think that I am arguing against the model, just saying that it is 
not always as clear cut as you might think....

ric


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 15:25                         ` Ric Wheeler
@ 2008-10-22 15:33                           ` Chris Mason
  2008-10-22 15:43                             ` Avi Kivity
  2008-10-22 15:39                           ` Avi Kivity
  1 sibling, 1 reply; 79+ messages in thread
From: Chris Mason @ 2008-10-22 15:33 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Avi Kivity, Stephan von Krawczynski, Christoph Hellwig, jim owens,
	linux-btrfs

On Wed, 2008-10-22 at 11:25 -0400, Ric Wheeler wrote:
> Avi Kivity wrote:
> > Ric Wheeler wrote:
> >>>
> >>> Well, btrfs is not about duplicating how most storage works today.  
> >>> Spare capacity has significant advantages over spare disks, such as 
> >>> being able to mix disk sizes, RAID levels, and better performance.
> >>
> >> Sure, there are advantages that go in favour of one or the other 
> >> approaches. But btrfs is also about being able to use common hardware 
> >> configurations without having to reinvent where we can avoid it (if 
> >> we have a working RAID or enough drives to do RAID5 with spares or 
> >> RAID6, we want to be able to delegate that off to something else if 
> >> we can).
> >
> > Well, if you have an existing RAID (or have lots of $$$ to buy a new 
> > one), you needn't tell Btrfs about it.  Just be sure not to enable 
> > Btrfs data redundancy, or you'll have redundant redundancy, which is 
> > expensive.
> >
> > What Btrfs enables with its multiple device capabilities is to 
> > assemble a JBOD into a filesystem-level data redundancy system, which 
> > is cheaper, more flexible (per-file data redundancy levels), and 
> > faster (no need for RMW, since you're always COWing).
>
> I think that the btrfs plan is still to push more complicated RAID 
> schemes off to MD (RAID6, etc) so this is an issue even with a JBOD.

At least v1.0 won't have raid6.  Over the longer term I hope to include
it because managing the storage once in btrfs and once in md is going to
be a bit clumsy.  It also limits the mixed mode functionality like
different stripe sizes for data vs metadata or metadata mirroring and
data raid6 that will allow us to perform well.

The goal will be to make a library of raid routines based on md that
other storage will be able to use.  I know Christoph has been interested
in this as well.

But in general, the btrfs raid code can do either spare disks or spare
capacity modes safely.  It enforces the correct number of devices in
each raid mode (as long as the admin doesn't lie to us and feed
partitions off the same device).

I'll leave the rest up to the admin.  One problem with the spare
capacity model is the general trend where drives from the same batch
that get hammered on in the same way tend to die at the same time.  Some
shops will sleep better knowing there's a hot spare and that's fine by
me.

-chris

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 15:33                           ` Chris Mason
@ 2008-10-22 15:43                             ` Avi Kivity
  2008-10-22 15:54                               ` Ric Wheeler
  0 siblings, 1 reply; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 15:43 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ric Wheeler, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

Chris Mason wrote:
> One problem with the spare
> capacity model is the general trend where drives from the same batch
> that get hammered on in the same way tend to die at the same time.  Some
> shops will sleep better knowing there's a hot spare and that's fine by
> me.
>   

How does hot sparing help?  All your disks die except the spare.

Of course, I've no objection to disk sparing as an additional option; I 
just feel that capacity sparing is superior.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 15:43                             ` Avi Kivity
@ 2008-10-22 15:54                               ` Ric Wheeler
  2008-10-22 18:28                                 ` Avi Kivity
  0 siblings, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 15:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Mason, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

Avi Kivity wrote:
> Chris Mason wrote:
>> One problem with the spare
>> capacity model is the general trend where drives from the same batch
>> that get hammered on in the same way tend to die at the same time.  Some
>> shops will sleep better knowing there's a hot spare and that's fine by
>> me.
>>   
>
> How does hot sparing help?  All your disks die except the spare.
>
> Of course, I've no objection to disk sparing as an additional option; 
> I just feel that capacity sparing is superior.
>
For any given set of disks, you "just" need to do the math to compute 
the utilized capacity, the expected rate of drive failure, the rebuild 
time and then see whether you can recover from your first failure before 
a 2nd disk dies.

In practice, this is not an academic question since drives do 
occasionally fail in batches (and drives from the same batch get stuffed 
into the same system).

I suspect that what will be used in mission critical deployments will be 
more conservative than what is used in less critical path systems, but 
this will be up to the end user to configure...

ric


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 15:54                               ` Ric Wheeler
@ 2008-10-22 18:28                                 ` Avi Kivity
  0 siblings, 0 replies; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 18:28 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Mason, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

Ric Wheeler wrote:
> For any given set of disks, you "just" need to do the math to compute 
> the utilized capacity, the expected rate of drive failure, the rebuild 
> time and then see whether you can recover from your first failure 
> before a 2nd disk dies.
>

Spare disks have the advantage of a fully linear access pattern 
(ignoring normal working load).  Spare capacity has the advantage of 
utilizing all devices (if you have a hundred-disk fs, all surviving 
disks participate in the rebuild; whereas with spare disks you are 
limited to the surviving raidset members.

Spare capacity also has the advantage that you don't need to rebuild 
free space.
> In practice, this is not an academic question since drives do 
> occasionally fail in batches (and drives from the same batch get 
> stuffed into the same system).

This seems to be orthogonal to the sparing method used; and in both 
cases the answer is to tolerate dual failures.  File-based redundancy 
has the advantage here of allowing triple mirroring for metadata and 
frequently written files, and double parity raid for large files.

> I suspect that what will be used in mission critical deployments will 
> be more conservative than what is used in less critical path systems

That's true, unfortunately.  But with time people will trust the newer, 
more efficient methods.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 15:25                         ` Ric Wheeler
  2008-10-22 15:33                           ` Chris Mason
@ 2008-10-22 15:39                           ` Avi Kivity
  1 sibling, 0 replies; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 15:39 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Mason, Stephan von Krawczynski, Christoph Hellwig,
	jim owens, linux-btrfs

Ric Wheeler wrote:
> I think that the btrfs plan is still to push more complicated RAID 
> schemes off to MD (RAID6, etc) so this is an issue even with a JBOD. 
> It will be interesting to map out the possible ways to use built in 
> mirroring, etc vs the external RAID and actually measure the utilized 
> capacity and performance (online & during rebuilds).

That's leaving a lot of performance and features on the table, IMO.  We 
definitely want to have metadata and small files using mirroring 
(perhaps even three copies for some metadata).  Use RAID[56] for large 
files.  Perhaps even start files at RAID1, and have the scrubber convert 
them to RAID[56] when it notices they are only ever read.  Keep 
temporary or unimportant files at RAID0.  Play games with asymmetric 
setups (small fast disks + large slow disks). etc etc etc.

Delegating things to MD throws out a lot of metadata so these things 
become impossible.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:15           ` Chris Mason
  2008-10-22 13:27             ` Ric Wheeler
@ 2008-10-22 13:52             ` Stephan von Krawczynski
  2008-10-22 15:56               ` Michel Salim
  1 sibling, 1 reply; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-22 13:52 UTC (permalink / raw)
  To: Chris Mason; +Cc: Ric Wheeler, Christoph Hellwig, jim owens, linux-btrfs

On Wed, 22 Oct 2008 09:15:45 -0400
Chris Mason <chris.mason@oracle.com> wrote:

> On Wed, 2008-10-22 at 14:27 +0200, Stephan von Krawczynski wrote:
> > On Tue, 21 Oct 2008 13:31:37 -0400
> > Ric Wheeler <ricwheeler@gmail.com> wrote:
> > 
> > > [...]
> > > If you have remapped a big chunk of the sectors (say more than 10%), you 
> > > should grab the data off the disk asap and replace it. Worry less about 
> > > errors during read, writes indicate more serious errors.
> > 
> > Ok, now for the bad news: money is invented.
> > If you replace a disk before real failure you won't get replacement from the
> > manufacturer. That may sound irrelevant to someone handling 5 disks, but is
> > significant if handling 500 or more. The replacement rate is indeed much
> > higher than people think from their home pcs.
> 
> Hardware vendors already do replace disks based on policies defined by
> their own array hardware.  These are already predictive.

Lets agree that the market for drives, arrays and related stuff is big and
contains just about any example one needs for arguing :-)
Nevertheless we probably agree that if john doe meets big-player and tries to
warranty-replace a non-dead drive he will have troubles.

> -chris

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:52             ` Stephan von Krawczynski
@ 2008-10-22 15:56               ` Michel Salim
  2008-10-22 16:56                 ` jim owens
  2008-10-23  9:47                 ` Stephan von Krawczynski
  0 siblings, 2 replies; 79+ messages in thread
From: Michel Salim @ 2008-10-22 15:56 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Chris Mason, Ric Wheeler, Christoph Hellwig, jim owens,
	linux-btrfs

T24gV2VkLCBPY3QgMjIsIDIwMDggYXQgOTo1MiBBTSwgU3RlcGhhbiB2b24gS3Jhd2N6eW5za2kK
PHNrcmF3QGl0aG5ldC5jb20+IHdyb3RlOgo+IE9uIFdlZCwgMjIgT2N0IDIwMDggMDk6MTU6NDUg
LTA0MDAKPiBDaHJpcyBNYXNvbiA8Y2hyaXMubWFzb25Ab3JhY2xlLmNvbT4gd3JvdGU6Cj4KPj4g
T24gV2VkLCAyMDA4LTEwLTIyIGF0IDE0OjI3ICswMjAwLCBTdGVwaGFuIHZvbiBLcmF3Y3p5bnNr
aSB3cm90ZToKPj4gPiBPbiBUdWUsIDIxIE9jdCAyMDA4IDEzOjMxOjM3IC0wNDAwCj4+ID4gUmlj
IFdoZWVsZXIgPHJpY3doZWVsZXJAZ21haWwuY29tPiB3cm90ZToKPj4gPgo+PiA+ID4gWy4uLl0K
Pj4gPiA+IElmIHlvdSBoYXZlIHJlbWFwcGVkIGEgYmlnIGNodW5rIG9mIHRoZSBzZWN0b3JzIChz
YXkgbW9yZSB0aGFuIDEwJSksIHlvdQo+PiA+ID4gc2hvdWxkIGdyYWIgdGhlIGRhdGEgb2ZmIHRo
ZSBkaXNrIGFzYXAgYW5kIHJlcGxhY2UgaXQuIFdvcnJ5IGxlc3MgYWJvdXQKPj4gPiA+IGVycm9y
cyBkdXJpbmcgcmVhZCwgd3JpdGVzIGluZGljYXRlIG1vcmUgc2VyaW91cyBlcnJvcnMuCj4+ID4K
Pj4gPiBPaywgbm93IGZvciB0aGUgYmFkIG5ld3M6IG1vbmV5IGlzIGludmVudGVkLgo+PiA+IElm
IHlvdSByZXBsYWNlIGEgZGlzayBiZWZvcmUgcmVhbCBmYWlsdXJlIHlvdSB3b24ndCBnZXQgcmVw
bGFjZW1lbnQgZnJvbSB0aGUKPj4gPiBtYW51ZmFjdHVyZXIuIFRoYXQgbWF5IHNvdW5kIGlycmVs
ZXZhbnQgdG8gc29tZW9uZSBoYW5kbGluZyA1IGRpc2tzLCBidXQgaXMKPj4gPiBzaWduaWZpY2Fu
dCBpZiBoYW5kbGluZyA1MDAgb3IgbW9yZS4gVGhlIHJlcGxhY2VtZW50IHJhdGUgaXMgaW5kZWVk
IG11Y2gKPj4gPiBoaWdoZXIgdGhhbiBwZW9wbGUgdGhpbmsgZnJvbSB0aGVpciBob21lIHBjcy4K
Pj4KPj4gSGFyZHdhcmUgdmVuZG9ycyBhbHJlYWR5IGRvIHJlcGxhY2UgZGlza3MgYmFzZWQgb24g
cG9saWNpZXMgZGVmaW5lZCBieQo+PiB0aGVpciBvd24gYXJyYXkgaGFyZHdhcmUuICBUaGVzZSBh
cmUgYWxyZWFkeSBwcmVkaWN0aXZlLgo+Cj4gTGV0cyBhZ3JlZSB0aGF0IHRoZSBtYXJrZXQgZm9y
IGRyaXZlcywgYXJyYXlzIGFuZCByZWxhdGVkIHN0dWZmIGlzIGJpZyBhbmQKPiBjb250YWlucyBq
dXN0IGFib3V0IGFueSBleGFtcGxlIG9uZSBuZWVkcyBmb3IgYXJndWluZyA6LSkKPiBOZXZlcnRo
ZWxlc3Mgd2UgcHJvYmFibHkgYWdyZWUgdGhhdCBpZiBqb2huIGRvZSBtZWV0cyBiaWctcGxheWVy
IGFuZCB0cmllcyB0bwo+IHdhcnJhbnR5LXJlcGxhY2UgYSBub24tZGVhZCBkcml2ZSBoZSB3aWxs
IGhhdmUgdHJvdWJsZXMuCj4KSWYgSm9obiBEb2UgaXMgdXNpbmcgcmVkdW5kYW50IHN0b3JhZ2Ug
aW4gdGhlIGZpcnN0IHBsYWNlLCBoZSBqdXN0Cm5lZWRzIGFuIGVtZXJnZW5jeSBkaXNrIHRoYXQg
Y2FuIGJlIHN3YXBwZWQtaW4gZm9yIGEgZmFpbGluZyBkaXNrLCBhbmQKdGhlbiBzdHJlc3MtdGVz
dCB0aGUgZmFpbGluZyBkaXNrIHRvIGRlYXRoLCBnZXQgaXQgcmVwbGFjZWQgYnkKbWFudWZhY3R1
cmVyLCBhbmQgdGhlIHJlcGxhY2VtZW50IGJlY29tZXMgdGhlIG5leHQgc3RhbmRieS9lbWVyZ2Vu
Y3kKZGlzay4KClRob3VnaCBpdCB3b3VsZCBiZSBuaWNlIHRvIGhhdmUgYSB0b29sIHRoYXQgd291
bGQgcHJvdmlkZSBlbm91Z2gKaW5mb3JtYXRpb24gdG8gbWFrZSBhIHdhcnJhbnR5IGNsYWltIC0t
IGRvZXMgYnRyZnMga2VlcCBlbm91Z2gKaW5mb3JtYXRpb24gZm9yIHN1Y2ggYSB0b29sIHRvIGJl
IHdyaXR0ZW4/CgpUaGFua3MsCgotLSAKbWnKg2VsIHNhbGltICDigKIgIGh0dHA6Ly9oaXJjdXMu
amFpa3UuY29tLwpJVUNTICAgICAgICAg4oCiICBtc2FsaW1AY3MuaW5kaWFuYS5lZHUKRmVkb3Jh
ICAgICAgIOKAoiAgc2FsaW1tYUBmZWRvcmFwcm9qZWN0Lm9yZwpNYWNQb3J0cyAgICAg4oCiICBo
aXJjdXNAbWFjcG9ydHMub3JnCg==

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 15:56               ` Michel Salim
@ 2008-10-22 16:56                 ` jim owens
  2008-10-23  9:47                 ` Stephan von Krawczynski
  1 sibling, 0 replies; 79+ messages in thread
From: jim owens @ 2008-10-22 16:56 UTC (permalink / raw)
  To: Michel Salim; +Cc: Stephan von Krawczynski, linux-btrfs

Michel Salim wrote:
> 
> Though it would be nice to have a tool that would provide enough
> information to make a warranty claim -- does btrfs keep enough
> information for such a tool to be written?

Failed device I/O (rather than bad checksums and other
fs-specific error detections) should be logged at a lower
layer in the standard system logs.

Warranties are really about who you buy your drives from,
if you go cheap don't expect any replacements.  If you buy
quality stuff, the failures usually occur right after
the warranty expires :)  In the case of bad manufacturing
batches, the good vendors figure that out real fast and
don't hassle you about replacing them as they fail.

And even from a good vendor, don't expect you can run
a drive with a 1-year 20% duty-cycle warranty like
it was a 100% duty-cycle drive and get the vendor
to replace them if they fail in < 1 year.  People often
complain the vendor does not stand behind the warranty
when they are really badly violating the usage terms.

jim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 15:56               ` Michel Salim
  2008-10-22 16:56                 ` jim owens
@ 2008-10-23  9:47                 ` Stephan von Krawczynski
  1 sibling, 0 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-23  9:47 UTC (permalink / raw)
  To: Michel Salim
  Cc: Chris Mason, Ric Wheeler, Christoph Hellwig, jim owens,
	linux-btrfs

On Wed, 22 Oct 2008 11:56:58 -0400
"Michel Salim" <michel.sylvan@gmail.com> wrote:

> > [...]
> > Lets agree that the market for drives, arrays and related stuff is big and
> > contains just about any example one needs for arguing :-)
> > Nevertheless we probably agree that if john doe meets big-player and tries to
> > warranty-replace a non-dead drive he will have troubles.
> >
> If John Doe is using redundant storage in the first place, he just
> needs an emergency disk that can be swapped-in for a failing disk, and
> then stress-test the failing disk to death, get it replaced by
> manufacturer, and the replacement becomes the next standby/emergency
> disk.

Even more expensive than drives is working time. So you just swapped the
problem the wrong way round.
I would not have expected that it is hard to argue why it makes sense to
replace dead disks when they are dead, because you then know that they are dead
and everybody else looking at the brick knows it too - without spending time
and money for testing and arguing about warranty issues.
Does anybody remember the word "keep it simple" ?

PS: of course we agree in your description of a minimal replacement strategy.

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 17:15     ` Christoph Hellwig
  2008-10-21 17:31       ` Ric Wheeler
@ 2008-10-22 11:40       ` Stephan von Krawczynski
  1 sibling, 0 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-22 11:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: jim owens, linux-btrfs

On Tue, 21 Oct 2008 13:15:13 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Oct 21, 2008 at 07:01:36PM +0200, Stephan von Krawczynski wrote:
> > Sure, but what you say only reflects the ideal world. On a file service, you
> > never have that. In fact you do not even have good control about what is going
> > on. Lets say you have a setup that creates, reads and deletes files 24h a day
> > from numerous clients. At two o'clock in the morning some hd decides to
> > partially die. Files get created on it, fill data up to errors, get
> > deleted and another bunch of data arrives and yet again fs tries to allocate
> > the same dead areas. You loose a lot more data only because the fs did not map
> > out the already known dead blocks. Of course you would replace the dead drive
> > later on, but in the meantime you have a lot of fun.
> > In other words: give me a tool to freeze the world right at the time the
> > errors show up, or map out dead blocks (only because it is a lot easier).
> 
> When modern disks can't solve the problems with their internal driver
> remapping anymore you better replace it ASAP as it is a very strong
> disk failure indication.  Last years FAST has some very interesting
> statitics showing this in the field.

And of course a "disk" is always a "disk", right? 

-- 
Regards,
Stephan


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 11:23 Some very basic questions Stephan von Krawczynski
  2008-10-21 12:13 ` Andi Kleen
  2008-10-21 13:20 ` jim owens
@ 2008-10-21 13:59 ` Chris Mason
  2008-10-21 16:09   ` Andi Kleen
                     ` (2 more replies)
  2 siblings, 3 replies; 79+ messages in thread
From: Chris Mason @ 2008-10-21 13:59 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-btrfs

On Tue, 2008-10-21 at 13:23 +0200, Stephan von Krawczynski wrote:
> Hello all,
> 
> reading the list for a while it looks like all kinds of implementational
> topics are covered but no basic user requests or talks are going on. Since I
> have found no other list on vger covering these issues I choose this one,
> forgive my ignorance if it is the wrong place.
> Like many people on the planet we try to handle quite some amounts of data
> (TBs) and try to solve this with several linux-based fileservers.
> Years of (mostly bad) experience led us to the following minimum requirements
> for a new fs on our servers:
> 

Thanks for this input and for taking the time to post it.

> 1. filesystem-check
> 1.1 it should not
>     - delay boot process (we have to wait for hours currently)
>     - prevent mount in case of errors
>     - be a part of the mount process at all
>     - always check the whole fs

For this, you have to define filesystem-check very carefully.  In
reality, corruptions can prevent mounting.  We can try very very hard to
limit the class of corruptions that prevent mounting, and use
duplication and replication to create configurations that address the
remaining cases.

In general, we'll be able to make things much better than they are
today.

> 1.2 it should be able 
>     - to always be started interactively by user
>     - to check parts/subtrees of the fs
>     - to run purely informational (reporting, non-modifying)
>     - to run on a mounted fs

Started interactively?  I'm not entirely sure what that means, but in
general when you ask the user a question about if/how to fix a
corruption, they will have no idea what the correct answer is.

> 2. general requirements
>     - fs errors without file/dir names are useless
>     - errors in parts of the fs are no reason for a fs to go offline as a whole

These two are in progress.  Btrfs won't always be able to give a file
and directory name, but it will be able to give something that can be
turned into a file or directory name.  You don't want important
diagnostic messages delayed by name lookup.

>     - mounting must not delay the system startup significantly

Mounts are fast

>     - resizing during runtime (up and down)

Resize is done

>     - parallel mounts (very important!)
>       (two or more hosts mount the same fs concurrently for reading and
>       writing)

As Jim and Andi have said, parallel mounts are not in the feature list
for Btrfs.  Network filesystems will provide these features.

>     - journaling

Btrfs doesn't journal.  The tree logging code is close, it provides
optimized fsync and O_SYNC operations.  The same basic structures could
be used for remote replication.

>     - versioning (file and dir)

>From a data structure point of view, version control is fairly easy.
>From a user interface and policy point of view, it gets difficult very
quickly.  Aside from snapshotting, version control is outside the scope
of btrfs.

There are lots of good version control systems available, I'd suggest
you use them instead.

>     - undelete (file and dir)

Undelete is easy but I think best done at a layer above the FS.

>     - snapshots

Done

>     - run into hd errors more than once for the same file (as an option)

Sorry, I'm not sure what you mean here.

>     - map out dead blocks
>       (and of course display of the currently mapped out list)

I agree with Jim on this one.  Drives remap dead sectors, and when they
stop remapping them, the drive should be replaced.

>     - no size limitations (more or less)
>     - performant handling of large numbers of files inside single dirs
>       (to check that use > 100.000 files in a dir, understand that it is
>       no good idea to spread inode-blocks over the whole hd because of seek
>       times)

Everyone has different ideas on "large" numbers of files inside a single
dir.  The directory indexing done by btrfs can easily handle 100,000

>     - power loss at any time must not corrupt the fs (atomic fs modification)
>       (new-data loss is acceptable)

Done.  Btrfs already uses barriers as required for sata drives.

> 
> Remember, this is not meant to be a request for features, it is a list that
> built up over 10 years of handling data and the failures we experienced. To
> our knowledge no fs meets this list, but hey, is that a reason for not talking
> about it? Our goal is pretty simple: maximize fs uptime.
> How does btrfs match?

-chris

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 13:59 ` Chris Mason
@ 2008-10-21 16:09   ` Andi Kleen
  2008-10-22 11:43     ` Stephan von Krawczynski
  2008-10-21 16:27   ` Stephan von Krawczynski
  2008-10-21 20:54   ` Eric Anopolsky
  2 siblings, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2008-10-21 16:09 UTC (permalink / raw)
  To: Chris Mason; +Cc: Stephan von Krawczynski, linux-btrfs

Chris Mason <chris.mason@oracle.com> writes:
>
> Started interactively?  I'm not entirely sure what that means, but in
> general when you ask the user a question about if/how to fix a
> corruption, they will have no idea what the correct answer is.

While that's true today, I'm not sure it has to be true always.
I always thought traditional fsck user interfaces were a
UI desaster and could be done much better with some simple tweaks. 

For example the fsck could present the user a list of files that ended
up in lost+found and let them examine them, instead of asking a lot of
useless questions. Or it could give a high level summary on how many
files in which part of the directory tree were corrupted. etc.etc.  Or
it could default to a high level mode that only gives such high level
information to the user.

So I don't think all corruptions could be done perfectly user
friendly, but at least the basic user friendliness in many
situations could be much improved.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 16:09   ` Andi Kleen
@ 2008-10-22 11:43     ` Stephan von Krawczynski
  0 siblings, 0 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-22 11:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Chris Mason, linux-btrfs

On Tue, 21 Oct 2008 18:09:40 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> While that's true today, I'm not sure it has to be true always.
> I always thought traditional fsck user interfaces were a
> UI desaster and could be done much better with some simple tweaks. 
> [...]

You are completely right.

> -Andi

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 13:59 ` Chris Mason
  2008-10-21 16:09   ` Andi Kleen
@ 2008-10-21 16:27   ` Stephan von Krawczynski
  2008-10-21 16:59     ` Andi Kleen
  2008-10-21 17:49     ` Chris Mason
  2008-10-21 20:54   ` Eric Anopolsky
  2 siblings, 2 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-21 16:27 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Hello Chris, 

let me clarify some things a bit, see ...

On Tue, 21 Oct 2008 09:59:40 -0400
Chris Mason <chris.mason@oracle.com> wrote:

> Thanks for this input and for taking the time to post it.
> 
> > 1. filesystem-check
> > 1.1 it should not
> >     - delay boot process (we have to wait for hours currently)
> >     - prevent mount in case of errors
> >     - be a part of the mount process at all
> >     - always check the whole fs
> 
> For this, you have to define filesystem-check very carefully.  In
> reality, corruptions can prevent mounting.  We can try very very hard to
> limit the class of corruptions that prevent mounting, and use
> duplication and replication to create configurations that address the
> remaining cases.

What we would like to have is a possibility to check an already mounted and
active fs for corruption, that's the reporting part.
If some corruption is found we should be able to correct the
data/metadata/whatever on the _still active_ fs, lets say by starting fsck in
modify mode. It is often preferred not to do a run over the complete fs but
only over certain (already known-to-be-corrupted) parts/subtrees.
It is obvious that the fs should not go offline then even if something very
ugly happens.
You can imagine:
Run fsck via cron every night. Then look at the logs in the morning an if bad
news arrived try to correct the broken subtree or exclude it from further
usage.

> In general, we'll be able to make things much better than they are
> today.

I am pretty sure about that ;-)

> > 1.2 it should be able 
> >     - to always be started interactively by user
> >     - to check parts/subtrees of the fs
> >     - to run purely informational (reporting, non-modifying)
> >     - to run on a mounted fs
> 
> Started interactively?  I'm not entirely sure what that means, but in
> general when you ask the user a question about if/how to fix a
> corruption, they will have no idea what the correct answer is.

see above explanation. We don't expect the classical y/n-questions during
fsck. Honestly there are only 3 types of modification modes in fsck:
- try correction in place
- exclude (i.e. delete) whole problem subtree
- duplicate to another subtree whatever can be rescued from the original place
  (and leave problem subtree as-is)

> > 2. general requirements
> >     - fs errors without file/dir names are useless
> >     - errors in parts of the fs are no reason for a fs to go offline as a whole
> 
> These two are in progress.  Btrfs won't always be able to give a file
> and directory name, but it will be able to give something that can be
> turned into a file or directory name.  You don't want important
> diagnostic messages delayed by name lookup.

That's a point I really never understood. Why is it non-trivial for a fs to
know what file or dir (name) it is currently working on?
It really sounds strange to me that a layer that is managing files on some
device does not know at any time during runtime what file or dir it is
actually handling. If _it_ does not know, how should the _user_ probably hours
later reading the logs know based on inode numbers or whatever cryptic logs
are thrown out? I mean filenames are nothing more than a human-readable
describing data structure mostly type char. Its only reason of existance is
readability, why not in logs?

> 
> >     - mounting must not delay the system startup significantly
> 
> Mounts are fast
> 
> >     - resizing during runtime (up and down)
> 
> Resize is done
> 
> >     - parallel mounts (very important!)
> >       (two or more hosts mount the same fs concurrently for reading and
> >       writing)
> 
> As Jim and Andi have said, parallel mounts are not in the feature list
> for Btrfs.  Network filesystems will provide these features.

Can you explain what "network filesystems" stands for in this statement,
please name two or three examples.

> >     - journaling
> 
> Btrfs doesn't journal.  The tree logging code is close, it provides
> optimized fsync and O_SYNC operations.  The same basic structures could
> be used for remote replication.
> 
> >     - versioning (file and dir)
> 
> >From a data structure point of view, version control is fairly easy.
> >From a user interface and policy point of view, it gets difficult very
> quickly.  Aside from snapshotting, version control is outside the scope
> of btrfs.
> 
> There are lots of good version control systems available, I'd suggest
> you use them instead.

To me versioning sounds like a not-so-easy-to-implement feature. Nevertheless
I trust your experience. If a basic implementation is possible and not too
complex, why deny a feature? 

> >     - undelete (file and dir)
> 
> Undelete is easy

Yes, we hear and say that all the time, name one linux fs doing it, please.

> but I think best done at a layer above the FS.

Before we got into the linux community we used n.vell netware. Undelete has
been there since about the first day. More then ten years later (nowadays) it
is still missing in linux. I really do suggest to provide _some_ solution and
_then_ lets talk about the _better_ solution.

> >     - snapshots
> 
> Done
> 
> >     - run into hd errors more than once for the same file (as an option)
> 
> Sorry, I'm not sure what you mean here.

If your hd is going dead you often find out that touching broken files takes
ages. If the fs finds out a file is corrupt because the device has errors it
could just flag the file as broken and not re-read the same error a thousand
times more. Obviously you want that as an option, because there can be good
reasons for re-reading dead files...

> >     - map out dead blocks
> >       (and of course display of the currently mapped out list)
> 
> I agree with Jim on this one.  Drives remap dead sectors, and when they
> stop remapping them, the drive should be replaced.

If your life depends on it, would you use one rope or two to secure yourself?

> 
> >     - no size limitations (more or less)
> >     - performant handling of large numbers of files inside single dirs
> >       (to check that use > 100.000 files in a dir, understand that it is
> >       no good idea to spread inode-blocks over the whole hd because of seek
> >       times)
> 
> Everyone has different ideas on "large" numbers of files inside a single
> dir.  The directory indexing done by btrfs can easily handle 100,000

The story is not really about if it can but how fast it can. You know that
most time is spent in seeks these days. If you have 100000 blocks to read
right across the whole disk for scanning through a dir (fstat every file) you
will see quite a difference to a situation where the relevant data can be read
within few (or zero) seeks. Its a question of fs layout on the disk.

> >     - power loss at any time must not corrupt the fs (atomic fs modification)
> >       (new-data loss is acceptable)
> 
> Done.  Btrfs already uses barriers as required for sata drives.
> [...]
> -chris

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 16:27   ` Stephan von Krawczynski
@ 2008-10-21 16:59     ` Andi Kleen
  2008-10-22 11:46       ` Stephan von Krawczynski
  2008-10-21 17:49     ` Chris Mason
  1 sibling, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2008-10-21 16:59 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Chris Mason, linux-btrfs

Stephan von Krawczynski <skraw@ithnet.com> writes:
>
> Yes, we hear and say that all the time, name one linux fs doing it, please.

ext[234] support it to some extent. It has some limitations
(especially when the files are large and you shouldn't do too much followon
IO to prevent the data from being overwriten) and the user frontends are not
very nice, but it it's there

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 16:59     ` Andi Kleen
@ 2008-10-22 11:46       ` Stephan von Krawczynski
  0 siblings, 0 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-22 11:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Chris Mason, linux-btrfs

On Tue, 21 Oct 2008 18:59:26 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> Stephan von Krawczynski <skraw@ithnet.com> writes:
> >
> > Yes, we hear and say that all the time, name one linux fs doing it, please.
> 
> ext[234] support it to some extent. It has some limitations
> (especially when the files are large and you shouldn't do too much followon
> IO to prevent the data from being overwriten) and the user frontends are not
> very nice, but it it's there

Well, they must be pretty ugly, I really never heard of that. But really, it
is not very important, because extX is completely useless with TB-size disks
unless you feel good waiting hours for fsck (I did, and will never do again). 
_All_ customers we deployed ext3 urged us to go back to reiserfs3 ...

> -Andi

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 16:27   ` Stephan von Krawczynski
  2008-10-21 16:59     ` Andi Kleen
@ 2008-10-21 17:49     ` Chris Mason
  2008-10-22 12:19       ` Stephan von Krawczynski
  1 sibling, 1 reply; 79+ messages in thread
From: Chris Mason @ 2008-10-21 17:49 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-btrfs

On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote:

> > > 2. general requirements
> > >     - fs errors without file/dir names are useless
> > >     - errors in parts of the fs are no reason for a fs to go offline as a whole
> > 
> > These two are in progress.  Btrfs won't always be able to give a file
> > and directory name, but it will be able to give something that can be
> > turned into a file or directory name.  You don't want important
> > diagnostic messages delayed by name lookup.
> 
> That's a point I really never understood. Why is it non-trivial for a fs to
> know what file or dir (name) it is currently working on?

The name lives in block A, but you might find a corruption while
processing block B.  Block A might not be in ram anymore, or it might be
in ram but locked by another process.

On top of all of that, when we print errors it's because things haven't
gone well.  They are deep inside of various parts of the filesystem, and
we might not be able to take the required locks or read from the disk in
order to find the name of the thing we're operating on.

> > 
> > >     - mounting must not delay the system startup significantly
> > 
> > Mounts are fast
> > 
> > >     - resizing during runtime (up and down)
> > 
> > Resize is done
> > 
> > >     - parallel mounts (very important!)
> > >       (two or more hosts mount the same fs concurrently for reading and
> > >       writing)
> > 
> > As Jim and Andi have said, parallel mounts are not in the feature list
> > for Btrfs.  Network filesystems will provide these features.
> 
> Can you explain what "network filesystems" stands for in this statement,
> please name two or three examples.
> 
NFS (done) CRFS (under development), maybe ceph as well which is also
under development.

> > >     - journaling
> > 
> > Btrfs doesn't journal.  The tree logging code is close, it provides
> > optimized fsync and O_SYNC operations.  The same basic structures could
> > be used for remote replication.
> > 
> > >     - versioning (file and dir)
> > 
> > >From a data structure point of view, version control is fairly easy.
> > >From a user interface and policy point of view, it gets difficult very
> > quickly.  Aside from snapshotting, version control is outside the scope
> > of btrfs.
> > 
> > There are lots of good version control systems available, I'd suggest
> > you use them instead.
> 
> To me versioning sounds like a not-so-easy-to-implement feature. Nevertheless
> I trust your experience. If a basic implementation is possible and not too
> complex, why deny a feature? 
> 

In general I think snapshotting solves enough of the problem for most of
the people most of the time.  I'd love for Btrfs to be the perfect FS,
but I'm afraid everyone has a different definition of perfect.

Storing multiple versions of something is pretty easy.  Making a usable
interface around those versions is the hard part, especially because you
need groups of files to be versioned together in atomic groups
(something that looks a lot like a snapshot).

Versioning is solved in userspace.  We would never be able to implement
everything that git or mercurial can do inside the filesystem.

> > >     - undelete (file and dir)
> > 
> > Undelete is easy
> 
> Yes, we hear and say that all the time, name one linux fs doing it, please.
> 

The fact that nobody is doing it is not a good argument for why it
should be done ;)  Undelete is a policy decision about what to do with
files as they are removed.  I'd much rather see it implemented above the
filesystems instead of individually in each filesystem.

This doesn't mean I'll never code it, it just means it won't get
implemented directly inside of Btrfs.  In comparison with all of the
other features pending, undelete is pretty far down on the list.

> > but I think best done at a layer above the FS.
> 
> Before we got into the linux community we used n.vell netware. Undelete has
> been there since about the first day. More then ten years later (nowadays) it
> is still missing in linux. I really do suggest to provide _some_ solution and
> _then_ lets talk about the _better_ solution.
> 
> > >     - snapshots
> > 
> > Done
> > 
> > >     - run into hd errors more than once for the same file (as an option)
> > 
> > Sorry, I'm not sure what you mean here.
> 
> If your hd is going dead you often find out that touching broken files takes
> ages. If the fs finds out a file is corrupt because the device has errors it
> could just flag the file as broken and not re-read the same error a thousand
> times more. Obviously you want that as an option, because there can be good
> reasons for re-reading dead files...

I really agree that we want to avoid beating on a dead drive.

Btrfs will record some error information about the drive so it can
decide what to do with failures.  But, remembering that sector #12345768
is bad doesn't help much.  When the drive returned the IO error it
remapped the sector and the next write will probably succeed.

> 
> > >     - map out dead blocks
> > >       (and of course display of the currently mapped out list)
> > 
> > I agree with Jim on this one.  Drives remap dead sectors, and when they
> > stop remapping them, the drive should be replaced.
> 
> If your life depends on it, would you use one rope or two to secure yourself?
> 

Btrfs will keep the dead drive around as a fallback for sectors that
fail on the other mirrors when data is being rebuilt.  Beyond that,
we'll expect you to toss the bad drive once the rebuild has finished.

There's an interesting paper about how netapp puts the drive into rehab
and is able to avoid service calls by rewriting the bad sectors and
checking them over.  That's a little ways off for Btrfs.

> > 
> > >     - no size limitations (more or less)
> > >     - performant handling of large numbers of files inside single dirs
> > >       (to check that use > 100.000 files in a dir, understand that it is
> > >       no good idea to spread inode-blocks over the whole hd because of seek
> > >       times)
> > 
> > Everyone has different ideas on "large" numbers of files inside a single
> > dir.  The directory indexing done by btrfs can easily handle 100,000
> 
> The story is not really about if it can but how fast it can. You know that
> most time is spent in seeks these days. If you have 100000 blocks to read
> right across the whole disk for scanning through a dir (fstat every file) you
> will see quite a difference to a situation where the relevant data can be read
> within few (or zero) seeks. Its a question of fs layout on the disk.
> 

Yes, btrfs already performs well in this workload.

-chris

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 17:49     ` Chris Mason
@ 2008-10-22 12:19       ` Stephan von Krawczynski
  2008-10-22 12:48         ` Jeff Schroeder
                           ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-22 12:19 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Tue, 21 Oct 2008 13:49:43 -0400
Chris Mason <chris.mason@oracle.com> wrote:

> On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote:
> 
> > > > 2. general requirements
> > > >     - fs errors without file/dir names are useless
> > > >     - errors in parts of the fs are no reason for a fs to go offline as a whole
> > > 
> > > These two are in progress.  Btrfs won't always be able to give a file
> > > and directory name, but it will be able to give something that can be
> > > turned into a file or directory name.  You don't want important
> > > diagnostic messages delayed by name lookup.
> > 
> > That's a point I really never understood. Why is it non-trivial for a fs to
> > know what file or dir (name) it is currently working on?
> 
> The name lives in block A, but you might find a corruption while
> processing block B.  Block A might not be in ram anymore, or it might be
> in ram but locked by another process.
> 
> On top of all of that, when we print errors it's because things haven't
> gone well.  They are deep inside of various parts of the filesystem, and
> we might not be able to take the required locks or read from the disk in
> order to find the name of the thing we're operating on.

Ok, this is interesting. In another thread I was told parallel mounts are
really complex and you cannot do good things in such an environment that you
can do with single mount. Well, then, why don't we do it? All boxes I know
have tons of RAM, but fs finds no place in RAM to put large parts (if not all)
of the structural fs data including filenames? Besides the simple fact that
RAM is always faster than any known disk be it rotating or not, and that RAM
is just there, whats the word for not doing it?

> > > >     - parallel mounts (very important!)
> > > >       (two or more hosts mount the same fs concurrently for reading and
> > > >       writing)
> > > 
> > > As Jim and Andi have said, parallel mounts are not in the feature list
> > > for Btrfs.  Network filesystems will provide these features.
> > 
> > Can you explain what "network filesystems" stands for in this statement,
> > please name two or three examples.
> > 
> NFS (done) CRFS (under development), maybe ceph as well which is also
> under development.

NFS is a good example for a fs that never got redesigned for modern world. I
hope it will, but currently it's like Model T on a highway.
You have a NFS server with clients. Your NFS server dies, your backup server
cannot take over the clients without them resetting their NFS-link (which
means reboot to many applications) - no way.
Besides that you still need another fs below NFS to bring your data onto some
medium, which means you still have the problem how to create redundancy in
your server architecture.

> > > >     - versioning (file and dir)
> > > 
> > > >From a data structure point of view, version control is fairly easy.
> > > >From a user interface and policy point of view, it gets difficult very
> > > quickly.  Aside from snapshotting, version control is outside the scope
> > > of btrfs.
> > > 
> > > There are lots of good version control systems available, I'd suggest
> > > you use them instead.
> > 
> > To me versioning sounds like a not-so-easy-to-implement feature. Nevertheless
> > I trust your experience. If a basic implementation is possible and not too
> > complex, why deny a feature? 
> > 
> 
> In general I think snapshotting solves enough of the problem for most of
> the people most of the time.  I'd love for Btrfs to be the perfect FS,
> but I'm afraid everyone has a different definition of perfect.
> 
> Storing multiple versions of something is pretty easy.  Making a usable
> interface around those versions is the hard part, especially because you
> need groups of files to be versioned together in atomic groups
> (something that looks a lot like a snapshot).
> 
> Versioning is solved in userspace.  We would never be able to implement
> everything that git or mercurial can do inside the filesystem.

Well, quite often the question is not about whole trees of data to be
versioned. Even single (few) files or dirs can be of interest. And you want
people to set up a complete user space monster to version three openoffice
documents (only a rather flawed example of course)? 
Lots of people need a basic solution, not the groundbreaking answer to all
questions.

> > > >     - undelete (file and dir)
> > > 
> > > Undelete is easy
> > 
> > Yes, we hear and say that all the time, name one linux fs doing it, please.
> > 
> 
> The fact that nobody is doing it is not a good argument for why it
> should be done ;)

Believe me, if NTFS had a simple undelete tool come with it, we (in linux fs)
would have it, too. Why do we always want to be _second best_?

>  Undelete is a policy decision about what to do with
> files as they are removed.  I'd much rather see it implemented above the
> filesystems instead of individually in each filesystem.
> 
> This doesn't mean I'll never code it, it just means it won't get
> implemented directly inside of Btrfs.  In comparison with all of the
> other features pending, undelete is pretty far down on the list.

Nobody talks about a solution for a problem he does not have, its of minor
priority. Up to the day he needs it, of course. Suddenly the priority jumps
up :-)
Come on, it is simple and it is useful and it is a question that will never
rise again after its solution. 

> > If your hd is going dead you often find out that touching broken files takes
> > ages. If the fs finds out a file is corrupt because the device has errors it
> > could just flag the file as broken and not re-read the same error a thousand
> > times more. Obviously you want that as an option, because there can be good
> > reasons for re-reading dead files...
> 
> I really agree that we want to avoid beating on a dead drive.
> 
> Btrfs will record some error information about the drive so it can
> decide what to do with failures.  But, remembering that sector #12345768
> is bad doesn't help much.  When the drive returned the IO error it
> remapped the sector and the next write will probably succeed.

Problem with probability is that software is pretty bad in judging. That's why
my proposal was, lets do it and make it configurable for an admin that has a
better idea of the current probability.

> > > >     - map out dead blocks
> > > >       (and of course display of the currently mapped out list)
> > > 
> > > I agree with Jim on this one.  Drives remap dead sectors, and when they
> > > stop remapping them, the drive should be replaced.
> > 
> > If your life depends on it, would you use one rope or two to secure yourself?
> > 
> 
> Btrfs will keep the dead drive around as a fallback for sectors that
> fail on the other mirrors when data is being rebuilt.  Beyond that,
> we'll expect you to toss the bad drive once the rebuild has finished.
> 
> There's an interesting paper about how netapp puts the drive into rehab
> and is able to avoid service calls by rewriting the bad sectors and
> checking them over.  That's a little ways off for Btrfs.

It will become more interesting what remapping means in a world full of
flash-disks. Does it mean a disk must be replaced when some or even lots of
sectors are dead? How about being faster in understanding we don't know all
future parameters than in buying?

> [...]
> -chris

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 12:19       ` Stephan von Krawczynski
@ 2008-10-22 12:48         ` Jeff Schroeder
  2008-10-22 14:02           ` Stephan von Krawczynski
  2008-10-22 13:50         ` Chris Mason
  2008-10-24  8:39         ` Chris Samuel
  2 siblings, 1 reply; 79+ messages in thread
From: Jeff Schroeder @ 2008-10-22 12:48 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Chris Mason, linux-btrfs

On Wed, Oct 22, 2008 at 5:19 AM, Stephan von Krawczynski
<skraw@ithnet.com> wrote:
> On Tue, 21 Oct 2008 13:49:43 -0400
> Chris Mason <chris.mason@oracle.com> wrote:
>
>> On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote:
>>
>> > > > 2. general requirements
>> > > >     - fs errors without file/dir names are useless
>> > > >     - errors in parts of the fs are no reason for a fs to go offline as a whole
>> > >
>> > > These two are in progress.  Btrfs won't always be able to give a file
>> > > and directory name, but it will be able to give something that can be
>> > > turned into a file or directory name.  You don't want important
>> > > diagnostic messages delayed by name lookup.
>> >
>> > That's a point I really never understood. Why is it non-trivial for a fs to
>> > know what file or dir (name) it is currently working on?
>>
>> The name lives in block A, but you might find a corruption while
>> processing block B.  Block A might not be in ram anymore, or it might be
>> in ram but locked by another process.
>>
>> On top of all of that, when we print errors it's because things haven't
>> gone well.  They are deep inside of various parts of the filesystem, and
>> we might not be able to take the required locks or read from the disk in
>> order to find the name of the thing we're operating on.
>
> Ok, this is interesting. In another thread I was told parallel mounts are
> really complex and you cannot do good things in such an environment that you
> can do with single mount. Well, then, why don't we do it? All boxes I know
> have tons of RAM, but fs finds no place in RAM to put large parts (if not all)
> of the structural fs data including filenames? Besides the simple fact that
> RAM is always faster than any known disk be it rotating or not, and that RAM
> is just there, whats the word for not doing it?

Google "Daniel Phillips Ramback faster than a speeding bullet". He is on this
list and may have some insight.

>> > > >     - parallel mounts (very important!)
>> > > >       (two or more hosts mount the same fs concurrently for reading and
>> > > >       writing)
>> > >
>> > > As Jim and Andi have said, parallel mounts are not in the feature list
>> > > for Btrfs.  Network filesystems will provide these features.
>> >
>> > Can you explain what "network filesystems" stands for in this statement,
>> > please name two or three examples.
>> >
>> NFS (done) CRFS (under development), maybe ceph as well which is also
>> under development.
>
> NFS is a good example for a fs that never got redesigned for modern world. I
> hope it will, but currently it's like Model T on a highway.
> You have a NFS server with clients. Your NFS server dies, your backup server
> cannot take over the clients without them resetting their NFS-link (which
> means reboot to many applications) - no way.
> Besides that you still need another fs below NFS to bring your data onto some
> medium, which means you still have the problem how to create redundancy in
> your server architecture.

You are somewhat misinformed on this. Perhaps the Linux nfs server can't cope,
but I doubt it. NFS was designed to be stateless. I've got a fair
amount of experience
with a dual head netapp architecture. When 1 head dies, the other
transparently fails
over. During the brief downtime, the clients will go into I/O wait if
at all instead of being
disconnected. You might be able to do something similar using nfsd and
keepalived if
both servers were connected to the same storage. Setting that up would
be trivial. You
just need the clients mounting the vip and a reliable mechanism to
provide the data from
that vip. You could use heartbeat, but it is overly complex. Also look
at clustered nfs or
pnfs, both of which are nfs redesigns like you speak of.

-- 
Jeff Schroeder

Don't drink and derive, alcohol and analysis don't mix.
http://www.digitalprognosis.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 12:48         ` Jeff Schroeder
@ 2008-10-22 14:02           ` Stephan von Krawczynski
  0 siblings, 0 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-22 14:02 UTC (permalink / raw)
  To: jeffschroeder; +Cc: Jeff Schroeder, Chris Mason, linux-btrfs

On Wed, 22 Oct 2008 05:48:30 -0700
"Jeff Schroeder" <jeffschroed@gmail.com> wrote:

> > NFS is a good example for a fs that never got redesigned for modern world. I
> > hope it will, but currently it's like Model T on a highway.
> > You have a NFS server with clients. Your NFS server dies, your backup server
> > cannot take over the clients without them resetting their NFS-link (which
> > means reboot to many applications) - no way.
> > Besides that you still need another fs below NFS to bring your data onto some
> > medium, which means you still have the problem how to create redundancy in
> > your server architecture.
> 
> You are somewhat misinformed on this. Perhaps the Linux nfs server can't cope,
> but I doubt it. NFS was designed to be stateless. I've got a fair
> amount of experience
> with a dual head netapp architecture. When 1 head dies, the other
> transparently fails
> over. During the brief downtime, the clients will go into I/O wait if
> at all instead of being
> disconnected. You might be able to do something similar using nfsd and
> keepalived if
> both servers were connected to the same storage. Setting that up would
> be trivial. You
> just need the clients mounting the vip and a reliable mechanism to
> provide the data from
> that vip. You could use heartbeat, but it is overly complex. Also look
> at clustered nfs or
> pnfs, both of which are nfs redesigns like you speak of.

we tried that with pure linux nfs, and it does not work. The clients do not
recover. After trying ourselves and failing we found several docs on the net
that described just the same problem and its reasons. Very likely netapp found
that out too and did something against it. 

Ah yes, and btw, your description contains another discussed problem: "both
servers were connected to the same storage". If you mean that both servers
really access the same storage at the same time your software options are
pretty few in numbers.

> -- 
> Jeff Schroeder

-- 
Regards,
Stephan


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 12:19       ` Stephan von Krawczynski
  2008-10-22 12:48         ` Jeff Schroeder
@ 2008-10-22 13:50         ` Chris Mason
  2008-10-22 14:04           ` Matthias Wächter
  2008-10-24  8:42           ` Chris Samuel
  2008-10-24  8:39         ` Chris Samuel
  2 siblings, 2 replies; 79+ messages in thread
From: Chris Mason @ 2008-10-22 13:50 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-btrfs

On Wed, 2008-10-22 at 14:19 +0200, Stephan von Krawczynski wrote:
> On Tue, 21 Oct 2008 13:49:43 -0400
> Chris Mason <chris.mason@oracle.com> wrote:
> 
> > On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote:
> > 
> > > > > 2. general requirements
> > > > >     - fs errors without file/dir names are useless
> > > > >     - errors in parts of the fs are no reason for a fs to go offline as a whole
> > > > 
> > > > These two are in progress.  Btrfs won't always be able to give a file
> > > > and directory name, but it will be able to give something that can be
> > > > turned into a file or directory name.  You don't want important
> > > > diagnostic messages delayed by name lookup.
> > > 
> > > That's a point I really never understood. Why is it non-trivial for a fs to
> > > know what file or dir (name) it is currently working on?
> > 
> > The name lives in block A, but you might find a corruption while
> > processing block B.  Block A might not be in ram anymore, or it might be
> > in ram but locked by another process.
> > 
> > On top of all of that, when we print errors it's because things haven't
> > gone well.  They are deep inside of various parts of the filesystem, and
> > we might not be able to take the required locks or read from the disk in
> > order to find the name of the thing we're operating on.
> 
> Ok, this is interesting. In another thread I was told parallel mounts are
> really complex and you cannot do good things in such an environment that you
> can do with single mount. Well, then, why don't we do it? All boxes I know
> have tons of RAM, but fs finds no place in RAM to put large parts (if not all)
> of the structural fs data including filenames?

I'm afraid it just isn't practical to keep all of the metadata in ram
all of the time.

>  Besides the simple fact that
> RAM is always faster than any known disk be it rotating or not, and that RAM
> is just there, whats the word for not doing it?
> 

People expect the OS to use the expensive ram for the data they use most
often.

> > > > >     - parallel mounts (very important!)
> > > > >       (two or more hosts mount the same fs concurrently for reading and
> > > > >       writing)
> > > > 
> > > > As Jim and Andi have said, parallel mounts are not in the feature list
> > > > for Btrfs.  Network filesystems will provide these features.
> > > 
> > > Can you explain what "network filesystems" stands for in this statement,
> > > please name two or three examples.
> > > 
> > NFS (done) CRFS (under development), maybe ceph as well which is also
> > under development.
> 
> NFS is a good example for a fs that never got redesigned for modern world. I
> hope it will, but currently it's like Model T on a highway.
> You have a NFS server with clients. Your NFS server dies, your backup server
> cannot take over the clients without them resetting their NFS-link (which
> means reboot to many applications) - no way.
> Besides that you still need another fs below NFS to bring your data onto some
> medium, which means you still have the problem how to create redundancy in
> your server architecture.
> 

As someone else replied, NFS is stateless, and they have made a large
number of design tradeoffs to stay that way.  So, your example above
isn't quite fair, it is one of the things the NFS protocol can handle
well.

With that said, CRFS is a network filesystem designed explicitly for
btrfs, and I have high hopes for it.

> > > > >     - versioning (file and dir)
> > > > 
> > > > >From a data structure point of view, version control is fairly easy.
> > > > >From a user interface and policy point of view, it gets difficult very
> > > > quickly.  Aside from snapshotting, version control is outside the scope
> > > > of btrfs.
> > > > 
> > > > There are lots of good version control systems available, I'd suggest
> > > > you use them instead.
> > > 
> > > To me versioning sounds like a not-so-easy-to-implement feature. Nevertheless
> > > I trust your experience. If a basic implementation is possible and not too
> > > complex, why deny a feature? 
> > > 
> > 
> > In general I think snapshotting solves enough of the problem for most of
> > the people most of the time.  I'd love for Btrfs to be the perfect FS,
> > but I'm afraid everyone has a different definition of perfect.
> > 
> > Storing multiple versions of something is pretty easy.  Making a usable
> > interface around those versions is the hard part, especially because you
> > need groups of files to be versioned together in atomic groups
> > (something that looks a lot like a snapshot).
> > 
> > Versioning is solved in userspace.  We would never be able to implement
> > everything that git or mercurial can do inside the filesystem.
> 
> Well, quite often the question is not about whole trees of data to be
> versioned. Even single (few) files or dirs can be of interest. And you want
> people to set up a complete user space monster to version three openoffice
> documents (only a rather flawed example of course)? 
> Lots of people need a basic solution, not the groundbreaking answer to all
> questions.
> 
One of the things that makes FS design so difficult is that people try
to solve lots of problems with filesystems.  Every feature we include is
a mixture of disk format, policy and userland interface that must be
tested in combination with all of the other features, and maintained
pretty much forever.

A big part of my job is to find the features that are sufficient to
justify the expense of starting from scratch, and to get things finished
within a reasonable amount of time.

Btrfs already has an ioctl to create a COW copy of a file (see the bcp
command in btrfs-progs).  This is enough for applications to do their
own single file versioning.

I understand this isn't the automatic system you would like for the use
case above, but I have to draw the line somewhere in terms of providing
the tools needed to implement features vs including all the features in
the FS.

A big part of why Btrfs is gaining ground today is that we're focusing
on finishing the features we have instead of adding the kitchen sink.
It is very hard to say no to interested users, but it's a reality of
actually bringing the software to market.

> > > If your hd is going dead you often find out that touching broken files takes
> > > ages. If the fs finds out a file is corrupt because the device has errors it
> > > could just flag the file as broken and not re-read the same error a thousand
> > > times more. Obviously you want that as an option, because there can be good
> > > reasons for re-reading dead files...
> > 
> > I really agree that we want to avoid beating on a dead drive.
> > 
> > Btrfs will record some error information about the drive so it can
> > decide what to do with failures.  But, remembering that sector #12345768
> > is bad doesn't help much.  When the drive returned the IO error it
> > remapped the sector and the next write will probably succeed.
> 
> Problem with probability is that software is pretty bad in judging. That's why
> my proposal was, lets do it and make it configurable for an admin that has a
> better idea of the current probability.
> 

Let me reword my answer ;).  The next write will always succeed unless
the drive is out of remapping sectors.  If the drive is out, it is only
good for reads and holding down paper on your desk.

This means we'll want to do a raid rebuild, which won't use that drive
unless something horrible has gone wrong.

> > > > >     - map out dead blocks
> > > > >       (and of course display of the currently mapped out list)
> > > > 
> > > > I agree with Jim on this one.  Drives remap dead sectors, and when they
> > > > stop remapping them, the drive should be replaced.
> > > 
> > > If your life depends on it, would you use one rope or two to secure yourself?
> > > 
> > 
> > Btrfs will keep the dead drive around as a fallback for sectors that
> > fail on the other mirrors when data is being rebuilt.  Beyond that,
> > we'll expect you to toss the bad drive once the rebuild has finished.
> > 
> > There's an interesting paper about how netapp puts the drive into rehab
> > and is able to avoid service calls by rewriting the bad sectors and
> > checking them over.  That's a little ways off for Btrfs.
> 
> It will become more interesting what remapping means in a world full of
> flash-disks. Does it mean a disk must be replaced when some or even lots of
> sectors are dead? 

Yes the disk must be replaced.  Our job here is not to provide people
with hope they can get to some of the data some of the time.  Our job is
to tell them a given component is bad and to have it replaced.

-chris



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:50         ` Chris Mason
@ 2008-10-22 14:04           ` Matthias Wächter
  2008-10-22 14:32             ` Ric Wheeler
  2008-10-24  8:42           ` Chris Samuel
  1 sibling, 1 reply; 79+ messages in thread
From: Matthias Wächter @ 2008-10-22 14:04 UTC (permalink / raw)
  To: Chris Mason; +Cc: Stephan von Krawczynski, linux-btrfs

On 10/22/2008 3:50 PM, Chris Mason wrote:

> Let me reword my answer ;).  The next write will always succeed unles=
s
> the drive is out of remapping sectors.  If the drive is out, it is on=
ly
> good for reads and holding down paper on your desk.

I have a fairly new SATA disk with about 3000 hours of 24/7 duty
(very light load), 0 remapped sectors and 8 consecutive sectors with
read/write errors. Still, it did not perform remapping facing heavy
writes on the bad sectors. Now what? For whatever reason, remapping
not always works (or mine was produced with a total of zero
remapping sectors=E2=80=A6).

- Matthias
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 14:04           ` Matthias Wächter
@ 2008-10-22 14:32             ` Ric Wheeler
  2008-10-22 14:44               ` jim owens
  0 siblings, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 14:32 UTC (permalink / raw)
  To: Matthias Wächter; +Cc: Chris Mason, Stephan von Krawczynski, linux-btrfs

Matthias W=C3=A4chter wrote:
> On 10/22/2008 3:50 PM, Chris Mason wrote:
>
>  =20
>> Let me reword my answer ;).  The next write will always succeed unle=
ss
>> the drive is out of remapping sectors.  If the drive is out, it is o=
nly
>> good for reads and holding down paper on your desk.
>>    =20
>
> I have a fairly new SATA disk with about 3000 hours of 24/7 duty
> (very light load), 0 remapped sectors and 8 consecutive sectors with
> read/write errors. Still, it did not perform remapping facing heavy
> writes on the bad sectors. Now what? For whatever reason, remapping
> not always works (or mine was produced with a total of zero
> remapping sectors=E2=80=A6).
>
> - Matthias
>  =20

It sounds like this drive is actually fine, you might have seen some=20
transient issues.

Are you positive that the writes went directly to the sectors in=20
question - that should either clear the error or cause it to remap the=20
sectors internally. (Reads will continue to fail).

Mark Lord has added some options to hdparm that you might be able to us=
e=20
to expressly clear the sectors in question in a more direct way.

Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 14:32             ` Ric Wheeler
@ 2008-10-22 14:44               ` jim owens
  0 siblings, 0 replies; 79+ messages in thread
From: jim owens @ 2008-10-22 14:44 UTC (permalink / raw)
  To: Matthias Wächter; +Cc: Stephan von Krawczynski, linux-btrfs

Ric Wheeler wrote:
> Matthias W=C3=A4chter wrote:
>> On 10/22/2008 3:50 PM, Chris Mason wrote:
>>
>> =20
>>> Let me reword my answer ;).  The next write will always succeed unl=
ess
>>> the drive is out of remapping sectors.  If the drive is out, it is =
only
>>> good for reads and holding down paper on your desk.
>>>    =20
>>
>> I have a fairly new SATA disk with about 3000 hours of 24/7 duty
>> (very light load), 0 remapped sectors and 8 consecutive sectors with
>> read/write errors. Still, it did not perform remapping facing heavy
>> writes on the bad sectors. Now what? For whatever reason, remapping
>> not always works (or mine was produced with a total of zero
>> remapping sectors=E2=80=A6).
>>
>> - Matthias
>>  =20
>=20
> It sounds like this drive is actually fine, you might have seen some=20
> transient issues.
>=20
> Are you positive that the writes went directly to the sectors in=20
> question - that should either clear the error or cause it to remap th=
e=20
> sectors internally. (Reads will continue to fail).
>=20
> Mark Lord has added some options to hdparm that you might be able to =
use=20
> to expressly clear the sectors in question in a more direct way.

let me add 2 other thoughts from my experience with other drive types:

    - check for firmware updates.

    - some drives have a remapping mode where it fails the write,
      reports to the host, then the host will send a remap-this-sector
      command.  this mode might be selectable on the drive. if the
      host driver does not do the remap that sector will continue
      to fail.

jim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:50         ` Chris Mason
  2008-10-22 14:04           ` Matthias Wächter
@ 2008-10-24  8:42           ` Chris Samuel
  1 sibling, 0 replies; 79+ messages in thread
From: Chris Samuel @ 2008-10-24  8:42 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 353 bytes --]

On Thu, 23 Oct 2008 12:50:33 am Chris Mason wrote:

> As someone else replied, NFS is stateless

NFS up to and including v3 is, but NFSv4 is stateful.

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 12:19       ` Stephan von Krawczynski
  2008-10-22 12:48         ` Jeff Schroeder
  2008-10-22 13:50         ` Chris Mason
@ 2008-10-24  8:39         ` Chris Samuel
  2 siblings, 0 replies; 79+ messages in thread
From: Chris Samuel @ 2008-10-24  8:39 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 740 bytes --]

On Wed, 22 Oct 2008 11:19:06 pm Stephan von Krawczynski wrote:

> You have a NFS server with clients. Your NFS server dies, your backup
> server cannot take over the clients without them resetting their NFS-link
> (which means reboot to many applications) - no way.

We're getting way off btrfs here, but did you set the fsid's for all your 
exports on the primary and backup NFS servers and make sure they were also set 
to the same values ?

e.g.

/home   10.0.0.0/255.0.0.0(async,no_subtree_check,rw,fsid=0111)

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 13:59 ` Chris Mason
  2008-10-21 16:09   ` Andi Kleen
  2008-10-21 16:27   ` Stephan von Krawczynski
@ 2008-10-21 20:54   ` Eric Anopolsky
  2008-10-21 22:18     ` Ric Wheeler
  2 siblings, 1 reply; 79+ messages in thread
From: Eric Anopolsky @ 2008-10-21 20:54 UTC (permalink / raw)
  To: Chris Mason; +Cc: Stephan von Krawczynski, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 359 bytes --]

On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote:
> >     - power loss at any time must not corrupt the fs (atomic fs modification)
> >       (new-data loss is acceptable)
> 
> Done.  Btrfs already uses barriers as required for sata drives.

Aren't there situations in which write barriers don't do what they're
supposed to do?

Cheers,
Eric


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 20:54   ` Eric Anopolsky
@ 2008-10-21 22:18     ` Ric Wheeler
  2008-10-22  2:29       ` Eric Anopolsky
  0 siblings, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-21 22:18 UTC (permalink / raw)
  To: Eric Anopolsky; +Cc: Chris Mason, Stephan von Krawczynski, linux-btrfs

Eric Anopolsky wrote:
> On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote:
>   
>>>     - power loss at any time must not corrupt the fs (atomic fs modification)
>>>       (new-data loss is acceptable)
>>>       
>> Done.  Btrfs already uses barriers as required for sata drives.
>>     
>
> Aren't there situations in which write barriers don't do what they're
> supposed to do?
>
> Cheers,
> Eric
>
>   
If the drive effectively "lies" to you about flushing the write cache, 
you might have an issue. I have not seen that first hand with recent 
disk drives (and I have seen a lot :-))

Ric



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 22:18     ` Ric Wheeler
@ 2008-10-22  2:29       ` Eric Anopolsky
  2008-10-22 10:42         ` Ric Wheeler
  0 siblings, 1 reply; 79+ messages in thread
From: Eric Anopolsky @ 2008-10-22  2:29 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Chris Mason, Stephan von Krawczynski, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1976 bytes --]

On Tue, 2008-10-21 at 18:18 -0400, Ric Wheeler wrote:
> Eric Anopolsky wrote:
> > On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote:
> >   
> >>>     - power loss at any time must not corrupt the fs (atomic fs modification)
> >>>       (new-data loss is acceptable)
> >>>       
> >> Done.  Btrfs already uses barriers as required for sata drives.
> >>     
> >
> > Aren't there situations in which write barriers don't do what they're
> > supposed to do?
> >
> > Cheers,
> > Eric
> >
> >   
> If the drive effectively "lies" to you about flushing the write cache, 
> you might have an issue. I have not seen that first hand with recent 
> disk drives (and I have seen a lot :-))

That does not match the understanding I get from reading the
notes/caveats section of Documentation/block/barrier.txt:

"Note that block drivers must not requeue preceding requests while
completing latter requests in an ordered sequence.  Currently, no
error checking is done against this."

and perhaps more importantly:

"[a technical scenario involving disk writes]
The problem here is that the barrier request is *supposed* to indicate
that filesystem update requests [2] and [3] made it safely to the
physical medium and, if the machine crashes after the barrier is
written, filesystem recovery code can depend on that.  Sadly, that
isn't true in this case anymore.  IOW, the success of a I/O barrier
should also be dependent on success of some of the preceding requests,
where only upper layer (filesystem) knows what 'some' is.

This can be solved by implementing a way to tell the block layer which
requests affect the success of the following barrier request and
making lower lever drivers to resume operation on error only after
block layer tells it to do so.

As the probability of this happening is very low and the drive should
be faulty, implementing the fix is probably an overkill.  But, still,
it's there."

Cheers,
Eric

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22  2:29       ` Eric Anopolsky
@ 2008-10-22 10:42         ` Ric Wheeler
  2008-10-22 10:53           ` Tejun Heo
  0 siblings, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 10:42 UTC (permalink / raw)
  To: Eric Anopolsky
  Cc: Chris Mason, Stephan von Krawczynski, linux-btrfs, Tejun Heo

Eric Anopolsky wrote:
> On Tue, 2008-10-21 at 18:18 -0400, Ric Wheeler wrote:
>   
>> Eric Anopolsky wrote:
>>     
>>> On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote:
>>>   
>>>       
>>>>>     - power loss at any time must not corrupt the fs (atomic fs modification)
>>>>>       (new-data loss is acceptable)
>>>>>       
>>>>>           
>>>> Done.  Btrfs already uses barriers as required for sata drives.
>>>>     
>>>>         
>>> Aren't there situations in which write barriers don't do what they're
>>> supposed to do?
>>>
>>> Cheers,
>>> Eric
>>>
>>>   
>>>       
>> If the drive effectively "lies" to you about flushing the write cache, 
>> you might have an issue. I have not seen that first hand with recent 
>> disk drives (and I have seen a lot :-))
>>     
>
> That does not match the understanding I get from reading the
> notes/caveats section of Documentation/block/barrier.txt:
>
> "Note that block drivers must not requeue preceding requests while
> completing latter requests in an ordered sequence.  Currently, no
> error checking is done against this."
>
> and perhaps more importantly:
>
> "[a technical scenario involving disk writes]
> The problem here is that the barrier request is *supposed* to indicate
> that filesystem update requests [2] and [3] made it safely to the
> physical medium and, if the machine crashes after the barrier is
> written, filesystem recovery code can depend on that.  Sadly, that
> isn't true in this case anymore.  IOW, the success of a I/O barrier
> should also be dependent on success of some of the preceding requests,
> where only upper layer (filesystem) knows what 'some' is.
>
> This can be solved by implementing a way to tell the block layer which
> requests affect the success of the following barrier request and
> making lower lever drivers to resume operation on error only after
> block layer tells it to do so.
>
> As the probability of this happening is very low and the drive should
> be faulty, implementing the fix is probably an overkill.  But, still,
> it's there."
>
> Cheers,
> Eric
>
>   
The cache flush command for ATA devices will block and wait until all of 
the device's write cache has been written back.

What I assume Tejun was referring to here is that some IO might have 
been written out to the device and an error happened when the device 
tried to write the cache back (say due to normal drive microcode cache 
destaging). The problem with this is that there is no outstanding IO 
context between the host and the storage to report the error to (i.e., 
the drive has already ack'ed the write).

If this is what is being described, there is a non-zero chance that this 
might happen, but it is extremely infrequent.  The checksumming that we 
have in btrfs will catch these bad writes when you replay the journal 
after a crash (or even when you read data blocks) so I would contend 
that this is about as good as we can do.

Tejun, Chris, does this match your understanding?

Thanks!

Ric



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 10:42         ` Ric Wheeler
@ 2008-10-22 10:53           ` Tejun Heo
  2008-10-22 12:57             ` Ric Wheeler
  2008-10-22 12:57             ` Ric Wheeler
  0 siblings, 2 replies; 79+ messages in thread
From: Tejun Heo @ 2008-10-22 10:53 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Eric Anopolsky, Chris Mason, Stephan von Krawczynski, linux-btrfs

Ric Wheeler wrote:
> The cache flush command for ATA devices will block and wait until all of
> the device's write cache has been written back.
> 
> What I assume Tejun was referring to here is that some IO might have
> been written out to the device and an error happened when the device
> tried to write the cache back (say due to normal drive microcode cache
> destaging). The problem with this is that there is no outstanding IO
> context between the host and the storage to report the error to (i.e.,
> the drive has already ack'ed the write).
> 
> If this is what is being described, there is a non-zero chance that this
> might happen, but it is extremely infrequent.  The checksumming that we
> have in btrfs will catch these bad writes when you replay the journal
> after a crash (or even when you read data blocks) so I would contend
> that this is about as good as we can do.

Please consider the following scenario.

1. FS issues lots of writes which are queued in the block elevator.
2. FS issues barrier.
3. Elevator pushes out all the writes.
4. One of the writes fails for some reason.  Media failure or what
   not.  Failure is propagated to upper layer.
5. Whether there was preceding failure or not, block queue processing
   continues and writes out all the pending requests.
6. Elevator issues FLUSH and it gets executed by the device.
7. Elevator issues barrier write and it gets executed by the device.
8. *POWER LOSS*

The thing is that currently there is no defined way for FS to take
action after #4 once happens unless it waits for all outstanding
writes to complete before issuing the barrier.  One way to solve this
would be to make the failure status sticky such that any barrier
following any number of uncleared errors will fail too, so that the
filesystem can think about what it should do with the write failure.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 10:53           ` Tejun Heo
@ 2008-10-22 12:57             ` Ric Wheeler
  2008-10-22 12:57             ` Ric Wheeler
  1 sibling, 0 replies; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 12:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ric Wheeler, Eric Anopolsky, Chris Mason, Stephan von Krawczynski,
	linux-btrfs

Tejun Heo wrote:
> Ric Wheeler wrote:
>   
>> The cache flush command for ATA devices will block and wait until all of
>> the device's write cache has been written back.
>>
>> What I assume Tejun was referring to here is that some IO might have
>> been written out to the device and an error happened when the device
>> tried to write the cache back (say due to normal drive microcode cache
>> destaging). The problem with this is that there is no outstanding IO
>> context between the host and the storage to report the error to (i.e.,
>> the drive has already ack'ed the write).
>>
>> If this is what is being described, there is a non-zero chance that this
>> might happen, but it is extremely infrequent.  The checksumming that we
>> have in btrfs will catch these bad writes when you replay the journal
>> after a crash (or even when you read data blocks) so I would contend
>> that this is about as good as we can do.
>>     
>
> Please consider the following scenario.
>
> 1. FS issues lots of writes which are queued in the block elevator.
> 2. FS issues barrier.
> 3. Elevator pushes out all the writes.
> 4. One of the writes fails for some reason.  Media failure or what
>    not.  Failure is propagated to upper layer.
> 5. Whether there was preceding failure or not, block queue processing
>    continues and writes out all the pending requests.
> 6. Elevator issues FLUSH and it gets executed by the device.
> 7. Elevator issues barrier write and it gets executed by the device.
> 8. *POWER LOSS*
>
> The thing is that currently there is no defined way for FS to take
> action after #4 once happens unless it waits for all outstanding
> writes to complete before issuing the barrier.  One way to solve this
> would be to make the failure status sticky such that any barrier
> following any number of uncleared errors will fail too, so that the
> filesystem can think about what it should do with the write failure.
>
> Thanks.
>   
I think that we do handle a failure in the case that you outline above 
since the FS will be able to notice the error before it sends a commit 
down (and that commit is wrapped in the barrier flush calls). This is 
the easy case since we still have the context for the IO.

It is more challenging  (and kind of related) if the IO done in (4) has 
been ack'ed by drive, the drive later destages (not as part of the 
flush) its write cache and then an error happens. In this case, there is 
nothing waiting on the initiator side to receive the IO error. We have 
effectively lost the context for that IO.

The only way to detect this is on replay (if the journal has checksums 
enabled or the error will be flagged as a media error).

Ric



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 10:53           ` Tejun Heo
  2008-10-22 12:57             ` Ric Wheeler
@ 2008-10-22 12:57             ` Ric Wheeler
  2008-10-22 13:15               ` Tejun Heo
  1 sibling, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 12:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ric Wheeler, Eric Anopolsky, Chris Mason, Stephan von Krawczynski,
	linux-btrfs

Tejun Heo wrote:
> Ric Wheeler wrote:
>   
>> The cache flush command for ATA devices will block and wait until all of
>> the device's write cache has been written back.
>>
>> What I assume Tejun was referring to here is that some IO might have
>> been written out to the device and an error happened when the device
>> tried to write the cache back (say due to normal drive microcode cache
>> destaging). The problem with this is that there is no outstanding IO
>> context between the host and the storage to report the error to (i.e.,
>> the drive has already ack'ed the write).
>>
>> If this is what is being described, there is a non-zero chance that this
>> might happen, but it is extremely infrequent.  The checksumming that we
>> have in btrfs will catch these bad writes when you replay the journal
>> after a crash (or even when you read data blocks) so I would contend
>> that this is about as good as we can do.
>>     
>
> Please consider the following scenario.
>
> 1. FS issues lots of writes which are queued in the block elevator.
> 2. FS issues barrier.
> 3. Elevator pushes out all the writes.
> 4. One of the writes fails for some reason.  Media failure or what
>    not.  Failure is propagated to upper layer.
> 5. Whether there was preceding failure or not, block queue processing
>    continues and writes out all the pending requests.
> 6. Elevator issues FLUSH and it gets executed by the device.
> 7. Elevator issues barrier write and it gets executed by the device.
> 8. *POWER LOSS*
>
> The thing is that currently there is no defined way for FS to take
> action after #4 once happens unless it waits for all outstanding
> writes to complete before issuing the barrier.  One way to solve this
> would be to make the failure status sticky such that any barrier
> following any number of uncleared errors will fail too, so that the
> filesystem can think about what it should do with the write failure.
>
> Thanks.
>   
I think that we do handle a failure in the case that you outline above 
since the FS will be able to notice the error before it sends a commit 
down (and that commit is wrapped in the barrier flush calls). This is 
the easy case since we still have the context for the IO.

It is more challenging  (and kind of related) if the IO done in (4) has 
been ack'ed by drive, the drive later destages (not as part of the 
flush) its write cache and then an error happens. In this case, there is 
nothing waiting on the initiator side to receive the IO error. We have 
effectively lost the context for that IO.

The only way to detect this is on replay (if the journal has checksums 
enabled or the error will be flagged as a media error).

Ric



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 12:57             ` Ric Wheeler
@ 2008-10-22 13:15               ` Tejun Heo
  2008-10-22 13:19                 ` Chris Mason
  2008-10-22 13:23                 ` Ric Wheeler
  0 siblings, 2 replies; 79+ messages in thread
From: Tejun Heo @ 2008-10-22 13:15 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Eric Anopolsky, Chris Mason, Stephan von Krawczynski, linux-btrfs

Ric Wheeler wrote:
> I think that we do handle a failure in the case that you outline above
> since the FS will be able to notice the error before it sends a commit
> down (and that commit is wrapped in the barrier flush calls). This is
> the easy case since we still have the context for the IO.

I'm no FS guy but for that to be true FS should be waiting for all the
outstanding IOs to finish before issuing a barrier and actually
doesn't need barriers at all - it can do the same with flush_cache.

> It is more challenging  (and kind of related) if the IO done in (4) has
> been ack'ed by drive, the drive later destages (not as part of the
> flush) its write cache and then an error happens. In this case, there is
> nothing waiting on the initiator side to receive the IO error. We have
> effectively lost the context for that IO.

IIUC, that should be detectable from FLUSH whether the destaging
occurred as part of flush or not, no?

> The only way to detect this is on replay (if the journal has checksums
> enabled or the error will be flagged as a media error).

If it's not reported on FLUSH, it basically amounts to silent data
corruption and only checksums can help.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:15               ` Tejun Heo
@ 2008-10-22 13:19                 ` Chris Mason
  2008-10-22 13:38                   ` Ric Wheeler
  2008-10-22 13:23                 ` Ric Wheeler
  1 sibling, 1 reply; 79+ messages in thread
From: Chris Mason @ 2008-10-22 13:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ric Wheeler, Eric Anopolsky, Stephan von Krawczynski, linux-btrfs

On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote:
> Ric Wheeler wrote:
> > I think that we do handle a failure in the case that you outline above
> > since the FS will be able to notice the error before it sends a commit
> > down (and that commit is wrapped in the barrier flush calls). This is
> > the easy case since we still have the context for the IO.
> 
> I'm no FS guy but for that to be true FS should be waiting for all the
> outstanding IOs to finish before issuing a barrier and actually
> doesn't need barriers at all - it can do the same with flush_cache.
> 

We wait and then barrier.  If the barrier returned status that a
previously ack'd IO had actually failed, we could do something to make
sure the FS was consistent.

-chris



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:19                 ` Chris Mason
@ 2008-10-22 13:38                   ` Ric Wheeler
  2008-10-22 13:59                     ` Chris Mason
  0 siblings, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 13:38 UTC (permalink / raw)
  To: Chris Mason
  Cc: Tejun Heo, Eric Anopolsky, Stephan von Krawczynski, linux-btrfs

Chris Mason wrote:
> On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote:
>   
>> Ric Wheeler wrote:
>>     
>>> I think that we do handle a failure in the case that you outline above
>>> since the FS will be able to notice the error before it sends a commit
>>> down (and that commit is wrapped in the barrier flush calls). This is
>>> the easy case since we still have the context for the IO.
>>>       
>> I'm no FS guy but for that to be true FS should be waiting for all the
>> outstanding IOs to finish before issuing a barrier and actually
>> doesn't need barriers at all - it can do the same with flush_cache.
>>
>>     
>
> We wait and then barrier.  If the barrier returned status that a
> previously ack'd IO had actually failed, we could do something to make
> sure the FS was consistent.
>
> -chris
>
>
>   
As I mentioned in a reply to Tejun, I am not sure that we can count on 
the barrier op giving us status for IO's that failed to destage cleanly.

Waiting and then doing the FLUSH seems to give us the best coverage for 
normal failures (and your own testing shows that it is hugely effective 
in reducing some types of corruption at least :-)).

If you look at the types of common drive failures, I would break them 
into two big groups. 

The first group would be transient errors - i.e., this IO fails (usually 
a read), but a subsequent IO will succeed with or without a sector 
remapping happening.  Causes might be:

    (1) just a bad read due to dirt on the surface of the drive - the 
read will always fail, a write might clean the surface and restore it to 
useful life.
    (2) vibrations (dropping your laptop, rolling a big machine down the 
data center, passing trains :-))
    (3) adjacent sector writes - hot spotting on drives can degrade the 
data on adjacent tracks. This causes IO errors on reads for data that 
was successfully written before, but the track itself is still perfectly 
fine.

All of these first types of errors need robust error handling on IO 
errors (i.e., quickly fail, check for IO errors and isolate the impact 
of the error as best as we can) but do not indicate a bad drive.

The second group would be persistent failures - no matter what you do to 
the drive, it is going to kick the bucket! Common causes might be:

    (1) a few bad sectors (1-5% of the drive's remapped sector table for 
example).
    (2) a bad disk head - this is a very common failure, you will see a 
large amount of bad sectors.
    (3) bad components (say bad memory chips in the write cache) can 
produce consistent errors
    (4) failure to spin up (total drive failure).

The challenging part is to figure out as best as we can how to 
differentiate the causes of IO failures or checksum failures and to 
respond correctly.  Array vendors spend a lot of time pulling out hair 
trying to do predictive drive failure, but it is really, really hard to 
get correct...

ric

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:38                   ` Ric Wheeler
@ 2008-10-22 13:59                     ` Chris Mason
  2008-10-22 14:23                       ` Ric Wheeler
  0 siblings, 1 reply; 79+ messages in thread
From: Chris Mason @ 2008-10-22 13:59 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Tejun Heo, Eric Anopolsky, Stephan von Krawczynski, linux-btrfs

On Wed, 2008-10-22 at 09:38 -0400, Ric Wheeler wrote:
> Chris Mason wrote:
> > On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote:
> >   
> >> Ric Wheeler wrote:
> >>     
> >>> I think that we do handle a failure in the case that you outline above
> >>> since the FS will be able to notice the error before it sends a commit
> >>> down (and that commit is wrapped in the barrier flush calls). This is
> >>> the easy case since we still have the context for the IO.
> >>>       
> >> I'm no FS guy but for that to be true FS should be waiting for all the
> >> outstanding IOs to finish before issuing a barrier and actually
> >> doesn't need barriers at all - it can do the same with flush_cache.
> >>
> >>     
> >
> > We wait and then barrier.  If the barrier returned status that a
> > previously ack'd IO had actually failed, we could do something to make
> > sure the FS was consistent.
> >   
> As I mentioned in a reply to Tejun, I am not sure that we can count on 
> the barrier op giving us status for IO's that failed to destage cleanly.
> 
> Waiting and then doing the FLUSH seems to give us the best coverage for 
> normal failures (and your own testing shows that it is hugely effective 
> in reducing some types of corruption at least :-)).
> 
> If you look at the types of common drive failures, I would break them 
> into two big groups. 
> 
> The first group would be transient errors - i.e., this IO fails (usually 
> a read), but a subsequent IO will succeed with or without a sector 
> remapping happening.  Causes might be:
> 
>     (1) just a bad read due to dirt on the surface of the drive - the 
> read will always fail, a write might clean the surface and restore it to 
> useful life.
>     (2) vibrations (dropping your laptop, rolling a big machine down the 
> data center, passing trains :-))
>     (3) adjacent sector writes - hot spotting on drives can degrade the 
> data on adjacent tracks. This causes IO errors on reads for data that 
> was successfully written before, but the track itself is still perfectly 
> fine.
> 

4) Transient conditions such as heat or other problems made the drive
give errors.

Combine your matrix with the single drive install vs the mirrored
configuration and we get a lot of variables.  What I'd love to have is a
rehab tool for drives that works it over and decides if it should stay
or go.

It is somewhat difficult to run the rehab on a mounted single disk
install, but we can start with the multi-device config and work out way
out from there.

For barrier flush, io errors reported back by the barrier flush would
allow us to know when corrective action was required.

-chris



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:59                     ` Chris Mason
@ 2008-10-22 14:23                       ` Ric Wheeler
  0 siblings, 0 replies; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 14:23 UTC (permalink / raw)
  To: Chris Mason
  Cc: Tejun Heo, Eric Anopolsky, Stephan von Krawczynski, linux-btrfs,
	Mark Lord

Chris Mason wrote:
> On Wed, 2008-10-22 at 09:38 -0400, Ric Wheeler wrote:
>   
>> Chris Mason wrote:
>>     
>>> On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote:
>>>   
>>>       
>>>> Ric Wheeler wrote:
>>>>     
>>>>         
>>>>> I think that we do handle a failure in the case that you outline above
>>>>> since the FS will be able to notice the error before it sends a commit
>>>>> down (and that commit is wrapped in the barrier flush calls). This is
>>>>> the easy case since we still have the context for the IO.
>>>>>       
>>>>>           
>>>> I'm no FS guy but for that to be true FS should be waiting for all the
>>>> outstanding IOs to finish before issuing a barrier and actually
>>>> doesn't need barriers at all - it can do the same with flush_cache.
>>>>
>>>>     
>>>>         
>>> We wait and then barrier.  If the barrier returned status that a
>>> previously ack'd IO had actually failed, we could do something to make
>>> sure the FS was consistent.
>>>   
>>>       
>> As I mentioned in a reply to Tejun, I am not sure that we can count on 
>> the barrier op giving us status for IO's that failed to destage cleanly.
>>
>> Waiting and then doing the FLUSH seems to give us the best coverage for 
>> normal failures (and your own testing shows that it is hugely effective 
>> in reducing some types of corruption at least :-)).
>>
>> If you look at the types of common drive failures, I would break them 
>> into two big groups. 
>>
>> The first group would be transient errors - i.e., this IO fails (usually 
>> a read), but a subsequent IO will succeed with or without a sector 
>> remapping happening.  Causes might be:
>>
>>     (1) just a bad read due to dirt on the surface of the drive - the 
>> read will always fail, a write might clean the surface and restore it to 
>> useful life.
>>     (2) vibrations (dropping your laptop, rolling a big machine down the 
>> data center, passing trains :-))
>>     (3) adjacent sector writes - hot spotting on drives can degrade the 
>> data on adjacent tracks. This causes IO errors on reads for data that 
>> was successfully written before, but the track itself is still perfectly 
>> fine.
>>
>>     
>
> 4) Transient conditions such as heat or other problems made the drive
> give errors.
>   

Yes, heat is an issue (as well as severe cold) since drives have part 
that expand and contract :-)).
> Combine your matrix with the single drive install vs the mirrored
> configuration and we get a lot of variables.  What I'd love to have is a
> rehab tool for drives that works it over and decides if it should stay
> or go.
>   

That would be a really nice thing to have and not really that difficult 
to sketch out. MD has some of that built in, but this is also something 
that we could do pretty easily up in user space.
> It is somewhat difficult to run the rehab on a mounted single disk
> install, but we can start with the multi-device config and work out way
> out from there.
>   

Scanning a mounted drive with read-verify or object level signature 
checking can be done on mounted file systems...
> For barrier flush, io errors reported back by the barrier flush would
> allow us to know when corrective action was required.
>
> -chris
>
>
>   

As I mentioned before, this would be great, but I am not sure that it 
would work that way (certainly not consistently across devices).

ric


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:15               ` Tejun Heo
  2008-10-22 13:19                 ` Chris Mason
@ 2008-10-22 13:23                 ` Ric Wheeler
  2008-10-22 16:14                   ` Tejun Heo
  1 sibling, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 13:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ric Wheeler, Eric Anopolsky, Chris Mason, Stephan von Krawczynski,
	linux-btrfs

Tejun Heo wrote:
> Ric Wheeler wrote:
>   
>> I think that we do handle a failure in the case that you outline above
>> since the FS will be able to notice the error before it sends a commit
>> down (and that commit is wrapped in the barrier flush calls). This is
>> the easy case since we still have the context for the IO.
>>     
>
> I'm no FS guy but for that to be true FS should be waiting for all the
> outstanding IOs to finish before issuing a barrier and actually
> doesn't need barriers at all - it can do the same with flush_cache.
>   

Waiting for the target to ack an IO is not sufficient, since the target 
ack does not (with write cache enabled) mean that it is on persistent 
storage.

The key is to make your transaction commit insure that the commit block 
itself is not written out of sequence without flushing the dependent IO 
from the transaction.

If we disable the write cache, then file systems effectively do exactly 
the right thing today as you describe :-)
>   
>> It is more challenging  (and kind of related) if the IO done in (4) has
>> been ack'ed by drive, the drive later destages (not as part of the
>> flush) its write cache and then an error happens. In this case, there is
>> nothing waiting on the initiator side to receive the IO error. We have
>> effectively lost the context for that IO.
>>     
>
> IIUC, that should be detectable from FLUSH whether the destaging
> occurred as part of flush or not, no?
>   

I am not sure what happens to a write that fails to get destaged from 
cache. It probably depends on the target firmware, but I imagine that 
the target cannot hold onto it forever (or all subsequent flushes would 
always fail).
>   
>> The only way to detect this is on replay (if the journal has checksums
>> enabled or the error will be flagged as a media error).
>>     
>
> If it's not reported on FLUSH, it basically amounts to silent data
> corruption and only checksums can help.
>
> Thanks.
>
>   

Agreed - checksums (or proper handling of media errors) are the only way 
to detect this.

Ric



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 13:23                 ` Ric Wheeler
@ 2008-10-22 16:14                   ` Tejun Heo
  2008-10-22 16:34                     ` Ric Wheeler
                                       ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Tejun Heo @ 2008-10-22 16:14 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Eric Anopolsky, Chris Mason, Stephan von Krawczynski, linux-btrfs

Ric Wheeler wrote:
> Waiting for the target to ack an IO is not sufficient, since the target
> ack does not (with write cache enabled) mean that it is on persistent
> storage.

FS waiting for completion of all the dependent writes isn't too good
latency and throughput-wise tho.  It would be best if FS can indicate
dependencies between write commands and barrier so that barrier
doesn't have to empty the whole queue.  Hmm... Can someone tell me how
much such scheme would help?

> The key is to make your transaction commit insure that the commit block
> itself is not written out of sequence without flushing the dependent IO
> from the transaction.
> 
> If we disable the write cache, then file systems effectively do exactly
> the right thing today as you describe :-)

For most SATA drives, disabling write back cache seems to take high
toll on write throughput.  :-(

>> IIUC, that should be detectable from FLUSH whether the destaging
>> occurred as part of flush or not, no?
>>   
> 
> I am not sure what happens to a write that fails to get destaged from
> cache. It probably depends on the target firmware, but I imagine that
> the target cannot hold onto it forever (or all subsequent flushes would
> always fail).

As long as the error status is sticky, it doesn't have to hold on to
the data, it's not gonna be able to write it anyway.  The drive has to
hold onto the failure information only.  Yeah, but fully agreed on
that it's most likely dependent on the specific firmware.  There isn't
any requirement on how to handle write back failure in the ATA spec.
It wouldn't be too surprising if there are some drives which happily
report the old data after silent write failure followed by flush and
power loss at the right timing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 16:14                   ` Tejun Heo
@ 2008-10-22 16:34                     ` Ric Wheeler
  2008-10-23  3:59                       ` Tejun Heo
  2008-10-22 18:32                     ` Avi Kivity
  2008-10-22 21:31                     ` Eric Anopolsky
  2 siblings, 1 reply; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 16:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ric Wheeler, Eric Anopolsky, Chris Mason, Stephan von Krawczynski,
	linux-btrfs

Tejun Heo wrote:
> Ric Wheeler wrote:
>   
>> Waiting for the target to ack an IO is not sufficient, since the target
>> ack does not (with write cache enabled) mean that it is on persistent
>> storage.
>>     
>
> FS waiting for completion of all the dependent writes isn't too good
> latency and throughput-wise tho.  It would be best if FS can indicate
> dependencies between write commands and barrier so that barrier
> doesn't have to empty the whole queue.  Hmm... Can someone tell me how
> much such scheme would help?
>
>   
I think that this is where SCSI ordered tags come in (or similar 
schemes). The idea would be to have tag all IO. You bump the tag, for 
example after you send down the journal data blocks to a new tag which 
is used for the commit block data sequence.

The ordering would require that lower ranked tags must all be destaged 
to persistent storage before a subsequent tag is written out.

The T13 had a microsoft proposal that is in this area:

http://www.t13.org/Documents/UploadedDocuments/docs2007/e07174r0-Write_Barrier_Command_Proposal.doc


>> The key is to make your transaction commit insure that the commit block
>> itself is not written out of sequence without flushing the dependent IO
>> from the transaction.
>>
>> If we disable the write cache, then file systems effectively do exactly
>> the right thing today as you describe :-)
>>     
>
> For most SATA drives, disabling write back cache seems to take high
> toll on write throughput.  :-(
>   

I have seen a 50% reduction in my testing on S-ATA :-(

>   
>>> IIUC, that should be detectable from FLUSH whether the destaging
>>> occurred as part of flush or not, no?
>>>   
>>>       
>> I am not sure what happens to a write that fails to get destaged from
>> cache. It probably depends on the target firmware, but I imagine that
>> the target cannot hold onto it forever (or all subsequent flushes would
>> always fail).
>>     
>
> As long as the error status is sticky, it doesn't have to hold on to
> the data, it's not gonna be able to write it anyway.  The drive has to
> hold onto the failure information only.  Yeah, but fully agreed on
> that it's most likely dependent on the specific firmware.  There isn't
> any requirement on how to handle write back failure in the ATA spec.
> It wouldn't be too surprising if there are some drives which happily
> report the old data after silent write failure followed by flush and
> power loss at the right timing.
>
> Thanks.
>
>   
agreed....

ric


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 16:34                     ` Ric Wheeler
@ 2008-10-23  3:59                       ` Tejun Heo
  0 siblings, 0 replies; 79+ messages in thread
From: Tejun Heo @ 2008-10-23  3:59 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Eric Anopolsky, Chris Mason, Stephan von Krawczynski, linux-btrfs

Ric Wheeler wrote:
>> FS waiting for completion of all the dependent writes isn't too good
>> latency and throughput-wise tho.  It would be best if FS can indicate
>> dependencies between write commands and barrier so that barrier
>> doesn't have to empty the whole queue.  Hmm... Can someone tell me how
>> much such scheme would help?
>>
>>   
> I think that this is where SCSI ordered tags come in (or similar
> schemes). The idea would be to have tag all IO. You bump the tag, for
> example after you send down the journal data blocks to a new tag which
> is used for the commit block data sequence.
> 
> The ordering would require that lower ranked tags must all be destaged
> to persistent storage before a subsequent tag is written out.
> 
> The T13 had a microsoft proposal that is in this area:
> 
> http://www.t13.org/Documents/UploadedDocuments/docs2007/e07174r0-Write_Barrier_Command_Proposal.doc

Yeah, that's one thing although it still has the
undetected-write-errors in front of barrier problem (SCSI spec doesn't
have a way to detect that).

There's another queue, which can be considerably larger than the
on-device buffer - the block elevator queue.  Currently, as the
elevator doesn't know what's dependent on what, it has to dump the
whole content of elevator before doing barrier.  I don't know how much
it would help to do it selectively tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 16:14                   ` Tejun Heo
  2008-10-22 16:34                     ` Ric Wheeler
@ 2008-10-22 18:32                     ` Avi Kivity
  2008-10-22 19:13                       ` jim owens
  2008-10-22 19:59                       ` Ric Wheeler
  2008-10-22 21:31                     ` Eric Anopolsky
  2 siblings, 2 replies; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 18:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ric Wheeler, Eric Anopolsky, Chris Mason, Stephan von Krawczynski,
	linux-btrfs

Tejun Heo wrote:
> For most SATA drives, disabling write back cache seems to take high
> toll on write throughput.  :-(
>
>   

I measured this yesterday.  This is true for pure write workloads; for 
mixed read/write workloads the throughput decrease is negligible.

> As long as the error status is sticky, it doesn't have to hold on to
> the data, it's not gonna be able to write it anyway.  The drive has to
> hold onto the failure information only.  Yeah, but fully agreed on
> that it's most likely dependent on the specific firmware.  There isn't
> any requirement on how to handle write back failure in the ATA spec.
> It wouldn't be too surprising if there are some drives which happily
> report the old data after silent write failure followed by flush and
> power loss at the right timing.

I got flamed for this on another list, but let's disable the write cache 
and live with the performance drop.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 18:32                     ` Avi Kivity
@ 2008-10-22 19:13                       ` jim owens
  2008-10-22 19:22                         ` Avi Kivity
  2008-10-22 19:59                       ` Ric Wheeler
  1 sibling, 1 reply; 79+ messages in thread
From: jim owens @ 2008-10-22 19:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Tejun Heo, linux-btrfs

Avi Kivity wrote:
> Tejun Heo wrote:
>> For most SATA drives, disabling write back cache seems to take high
>> toll on write throughput.  :-(
> 
> I measured this yesterday.  This is true for pure write workloads; for 
> mixed read/write workloads the throughput decrease is negligible.

Different tests on different hardware
give different results at different times!

>> As long as the error status is sticky, it doesn't have to hold on to
>> the data, it's not gonna be able to write it anyway.  The drive has to
>> hold onto the failure information only.  Yeah, but fully agreed on
>> that it's most likely dependent on the specific firmware.  There isn't
>> any requirement on how to handle write back failure in the ATA spec.
>> It wouldn't be too surprising if there are some drives which happily
>> report the old data after silent write failure followed by flush and
>> power loss at the right timing.
> 
> I got flamed for this on another list, but let's disable the write cache 
> and live with the performance drop.

We don't get to decide this, customers do.
As they say in the raid forum... fast, cheap, good - pick any 2

We just need to ensure we don't turn good into bad with fs mistakes.

jim

P.S. no flames because we chose no-battery == disable-write-cache

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 19:13                       ` jim owens
@ 2008-10-22 19:22                         ` Avi Kivity
  0 siblings, 0 replies; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 19:22 UTC (permalink / raw)
  To: jim owens; +Cc: Tejun Heo, linux-btrfs

jim owens wrote:
>>> For most SATA drives, disabling write back cache seems to take high
>>> toll on write throughput.  :-(
>>
>> I measured this yesterday.  This is true for pure write workloads; 
>> for mixed read/write workloads the throughput decrease is negligible.
>
> Different tests on different hardware
> give different results at different times!
>

True.  But data loss is forever!

>>
>> I got flamed for this on another list, but let's disable the write 
>> cache and live with the performance drop.
>
> We don't get to decide this, customers do.

We get to pick the defaults.

> As they say in the raid forum... fast, cheap, good - pick any 2

We can upgrade slow to fast, but !good gets upgraded to another fs.

> P.S. no flames because we chose no-battery == disable-write-cache

ACK!

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 18:32                     ` Avi Kivity
  2008-10-22 19:13                       ` jim owens
@ 2008-10-22 19:59                       ` Ric Wheeler
  1 sibling, 0 replies; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 19:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Tejun Heo, Eric Anopolsky, Chris Mason, Stephan von Krawczynski,
	linux-btrfs

Avi Kivity wrote:
> Tejun Heo wrote:
>> For most SATA drives, disabling write back cache seems to take high
>> toll on write throughput.  :-(
>>
>>   
>
> I measured this yesterday.  This is true for pure write workloads; for 
> mixed read/write workloads the throughput decrease is negligible.
>
Depends on your workload, I have measured (back at Centera) a 
significant win for mixed read/write as well (at least 20%) depending on 
file size.

>> As long as the error status is sticky, it doesn't have to hold on to
>> the data, it's not gonna be able to write it anyway.  The drive has to
>> hold onto the failure information only.  Yeah, but fully agreed on
>> that it's most likely dependent on the specific firmware.  There isn't
>> any requirement on how to handle write back failure in the ATA spec.
>> It wouldn't be too surprising if there are some drives which happily
>> report the old data after silent write failure followed by flush and
>> power loss at the right timing.
>
> I got flamed for this on another list, but let's disable the write 
> cache and live with the performance drop.
>

Won't ever happen, no one wants to lose 50% of their performance :-)

ric



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 16:14                   ` Tejun Heo
  2008-10-22 16:34                     ` Ric Wheeler
  2008-10-22 18:32                     ` Avi Kivity
@ 2008-10-22 21:31                     ` Eric Anopolsky
  2008-10-22 21:56                       ` Ric Wheeler
  2 siblings, 1 reply; 79+ messages in thread
From: Eric Anopolsky @ 2008-10-22 21:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Ric Wheeler, Chris Mason, Stephan von Krawczynski, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1318 bytes --]

On Thu, 2008-10-23 at 01:14 +0900, Tejun Heo wrote:
> Ric Wheeler wrote:
> > Waiting for the target to ack an IO is not sufficient, since the target
> > ack does not (with write cache enabled) mean that it is on persistent
> > storage.
> 
> FS waiting for completion of all the dependent writes isn't too good
> latency and throughput-wise tho.  It would be best if FS can indicate
> dependencies between write commands and barrier so that barrier
> doesn't have to empty the whole queue.  Hmm... Can someone tell me how
> much such scheme would help?

The extent of my coding for ZFS on FUSE was in this area. Solaris has a
generic ioctl to flush the write cache on a block device but Linux does
not. I wrote a few routines to detect the type of block device and flush
the cache by talking to the hardware via an ioctl.

Tests with bonnie++ on my laptop showed that throughput and metadata
operations per second were not noticeably affected by completely
flushing the write cache when necessary versus never flushing the write
cache or using any kind of IO barrier.

Caveats:
*Not every HDD is a laptop HDD.
*ZFS on FUSE got average to poor results for metadata operations per
second since it hadn't been optimized for that yet.

Maybe fancier schemes aren't necessary?

Cheers,
Eric

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 21:31                     ` Eric Anopolsky
@ 2008-10-22 21:56                       ` Ric Wheeler
  0 siblings, 0 replies; 79+ messages in thread
From: Ric Wheeler @ 2008-10-22 21:56 UTC (permalink / raw)
  To: Eric Anopolsky
  Cc: Tejun Heo, Chris Mason, Stephan von Krawczynski, linux-btrfs

Eric Anopolsky wrote:
> On Thu, 2008-10-23 at 01:14 +0900, Tejun Heo wrote:
>   
>> Ric Wheeler wrote:
>>     
>>> Waiting for the target to ack an IO is not sufficient, since the target
>>> ack does not (with write cache enabled) mean that it is on persistent
>>> storage.
>>>       
>> FS waiting for completion of all the dependent writes isn't too good
>> latency and throughput-wise tho.  It would be best if FS can indicate
>> dependencies between write commands and barrier so that barrier
>> doesn't have to empty the whole queue.  Hmm... Can someone tell me how
>> much such scheme would help?
>>     
>
> The extent of my coding for ZFS on FUSE was in this area. Solaris has a
> generic ioctl to flush the write cache on a block device but Linux does
> not. I wrote a few routines to detect the type of block device and flush
> the cache by talking to the hardware via an ioctl.
>
> Tests with bonnie++ on my laptop showed that throughput and metadata
> operations per second were not noticeably affected by completely
> flushing the write cache when necessary versus never flushing the write
> cache or using any kind of IO barrier.
>
> Caveats:
> *Not every HDD is a laptop HDD.
> *ZFS on FUSE got average to poor results for metadata operations per
> second since it hadn't been optimized for that yet.
>
> Maybe fancier schemes aren't necessary?
>
> Cheers,
> Eric
>
>   
What I have seen so far with meta-data heavy workloads & the write 
barrier (working correctly!) is a pretty close match to the specs of the 
drive, at least for single threaded writing.

For example, if you have an average seek time of 20ms, you should see no 
more than 50 files/sec (if only one barrier is issued per file write).  
In practice, we see closer to 30 files/sec.

If nothing else, you can always detect a broken (or disabled) write 
barrier by exceeding that spec for single writers :-)

ric




^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
@ 2008-10-21 17:37 calin
  2008-10-21 20:08 ` jim owens
  0 siblings, 1 reply; 79+ messages in thread
From: calin @ 2008-10-21 17:37 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-btrfs

> question is: if you had such an implementation, are there
> drawbacks expectable for the single-mount case? If not I'd vote for it
> because there are not really many alternatives "on the market".

As I understand it, the largest issue is in locking and boundaries.  Two different systems could mount a filesystem, and try to use some sort of on-disk markers to keep from writing to the same area at the same time... but there is often some bit of time between when a system sends data to the disk and when it would become available to read from the disk, and little or no guarantee about the order in which the data is written.  All the work that goes into making transactions atomic depends on there only being a single path to the disk - through the code that handles transactions.  If data can arrive on the disk without being managed by that code, all bets are off.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 17:37 calin
@ 2008-10-21 20:08 ` jim owens
  2008-10-22  7:15   ` Avi Kivity
  0 siblings, 1 reply; 79+ messages in thread
From: jim owens @ 2008-10-21 20:08 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-btrfs

calin wrote:
>> question is: if you had such an implementation, are there
>> drawbacks expectable for the single-mount case? If not I'd vote for it
>> because there are not really many alternatives "on the market".
> 
> As I understand it, the largest issue is in locking and boundaries. 

Correct, that is the first big issue.  As soon as 2 machines can
access the same device, you must design for distributed locking.
And that means a lot more code, lower performance, and a lot of
things a local-only filesystem could do that must be disallowed.

The second issue is what is the purpose of more than 1 host
accessing the data directly from the device.  There are cases
where this is a good thing because the application is designed
with data partitioning and multi-instance coordination.  It is
a bad thing for random uncoordinated use like backups or fsck.

Remember that the device bandwidth is the limiter so even
when each host has a dedicated path to the device (as in
dual port SAS or FC), that 2nd host cuts the throughput by
more than 1/2 with uncoordinated seeks and transfers.

And if the host device drivers are not designed for multiple
host sharing, this can cause timeouts, resets, and false
device-failed states.

And yes... even read-only access from a 2nd host is trouble
in many parts of the design and does not come for free.

jim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-21 20:08 ` jim owens
@ 2008-10-22  7:15   ` Avi Kivity
  2008-10-22 14:13     ` jim owens
  0 siblings, 1 reply; 79+ messages in thread
From: Avi Kivity @ 2008-10-22  7:15 UTC (permalink / raw)
  To: jim owens; +Cc: Stephan von Krawczynski, linux-btrfs

jim owens wrote:
>
> Remember that the device bandwidth is the limiter so even
> when each host has a dedicated path to the device (as in
> dual port SAS or FC), that 2nd host cuts the throughput by
> more than 1/2 with uncoordinated seeks and transfers.

That's only a problem if there is a single shared device.  Since btrfs 
supports multiple devices, each host could own a device set and access 
from other hosts would be through the owner.  You would need RDMA to get 
reasonable performance and some kind of dual-porting to get high 
availability.  Each host could control the allocation tree for its devices.

Of course, this doesn't solve the other problems with parallel mounts.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22  7:15   ` Avi Kivity
@ 2008-10-22 14:13     ` jim owens
  2008-10-22 14:25       ` Avi Kivity
  0 siblings, 1 reply; 79+ messages in thread
From: jim owens @ 2008-10-22 14:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Stephan von Krawczynski, linux-btrfs

Avi Kivity wrote:
> jim owens wrote:
>>
>> Remember that the device bandwidth is the limiter so even
>> when each host has a dedicated path to the device (as in
>> dual port SAS or FC), that 2nd host cuts the throughput by
>> more than 1/2 with uncoordinated seeks and transfers.
> 
> That's only a problem if there is a single shared device.  Since btrfs 
> supports multiple devices, each host could own a device set and access 
> from other hosts would be through the owner.  You would need RDMA to get 
> reasonable performance and some kind of dual-porting to get high 
> availability.  Each host could control the allocation tree for its devices.

No.  Every device including a monster $$$ array has the problem.

As I said before, unless the application is partitioned
there is always data host2 needs from host1's disk and that
slows down host1.

If host2 seldom needs any host1 data, then you are describing
a configuration that can be done easily by each host having a
separate filesystem for the device it owns by default.  Each
host nfs mounts the other host's data and if host1 fails, host2
can direct mount host1-fs from the shared array.

Even with multiple disks under the same filesystem as separate
allocated storage there is still the problem of shared namespace
metadata that slows down both hosts.  If you don't need shared
namespaces then you absolutely don't want a cluster fs.

A cluster fs is useful, but the cost can be high so using
it for a single-host fs is not a good idea.

jim

jim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 14:13     ` jim owens
@ 2008-10-22 14:25       ` Avi Kivity
  0 siblings, 0 replies; 79+ messages in thread
From: Avi Kivity @ 2008-10-22 14:25 UTC (permalink / raw)
  To: jim owens; +Cc: Stephan von Krawczynski, linux-btrfs

jim owens wrote:
> Avi Kivity wrote:
>> jim owens wrote:
>>>
>>> Remember that the device bandwidth is the limiter so even
>>> when each host has a dedicated path to the device (as in
>>> dual port SAS or FC), that 2nd host cuts the throughput by
>>> more than 1/2 with uncoordinated seeks and transfers.
>>
>> That's only a problem if there is a single shared device.  Since 
>> btrfs supports multiple devices, each host could own a device set and 
>> access from other hosts would be through the owner.  You would need 
>> RDMA to get reasonable performance and some kind of dual-porting to 
>> get high availability.  Each host could control the allocation tree 
>> for its devices.
>
> No.  Every device including a monster $$$ array has the problem.
>
> As I said before, unless the application is partitioned
> there is always data host2 needs from host1's disk and that
> slows down host1.

The CPU load should not be significant if you have RDMA.  Or are you 
talking about the seek load?  Since host1's load should be distributed 
over all devices in the system, overall seek capacity increases as you 
add more nodes.

>
> If host2 seldom needs any host1 data, then you are describing
> a configuration that can be done easily by each host having a
> separate filesystem for the device it owns by default.  Each
> host nfs mounts the other host's data and if host1 fails, host2
> can direct mount host1-fs from the shared array.
>

Separate namespaces are uninteresting to me.  That's just pushing back 
the problem to the user.

> Even with multiple disks under the same filesystem as separate
> allocated storage there is still the problem of shared namespace
> metadata that slows down both hosts.  If you don't need shared
> namespaces then you absolutely don't want a cluster fs.
>

If you separate the allocation metadata to the storage owning node, and 
the file metadata to the actively using node, the slowdown should be low 
in most cases.  Problems begin when all nodes access the same file, but 
that's relatively rare.  Even then, when the file size does not change 
and when the data is preallocated, it's possible to achieve acceptable 
overhead.

> A cluster fs is useful, but the cost can be high so using
> it for a single-host fs is not a good idea.

Development costs, yes.  But I don't see why the runtime overhead can't 
disappear when running on a single host.  Sort of like running an smp 
kernel on uniprocessor (I agree the fs problem is much bigger).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
@ 2008-10-22 14:35 dbz
  2008-10-27 15:43 ` Stephan von Krawczynski
  0 siblings, 1 reply; 79+ messages in thread
From: dbz @ 2008-10-22 14:35 UTC (permalink / raw)
  To: linux-btrfs

concerning this discussion, I'd like to put up some "requests" which 
strongly oppose to those brought up initially:

- if you run into an error in the fs structure or any IO error that prevents 
you from bringing the fs into a consistent state, please simply oops. If a 
user feels that availability is a main issue, he has to use a failover 
solution. In this case a fast and clean cut is desireable and no 
"pray-and-hope-mode" or "90%-mode". If avaliability is not the issue, it is 
in any case most important that data on the fs is safe. If you don't oops, 
you risk to pose further damage onto the filesystem and end up with a 
completely destroyed fs.

- if you get any IO error, please **don't** put up a number of retries or 
anything. If the device reports an error simply believe it. It is bad enough 
that many block drivers or controllers try to be smart and put up hundreds 
of retries. Adding further retries you only end up in wasting hours on 
useless retries. If availability is an issue, the user again has to put up a 
failover solution. Again, a clean cut is what is needed. The user has to 
make shure he uses appropiate configuration according to the importance of 
his data (mirroring on the fs and/or RAID, failover ...)

- if during mount something unexpected comes up and you can't be shure that 
the fs will work properly, please deny mounting and request a fsck. This can 
be easily handled by a start- or mount-script. During mount, take the time 
you need to ensure that the fs looks proper and safe to use. I'd rather now 
during boot that something is wrong than to run with a foul fs and end up 
with data loss or any other mixup later on.

- btrfs is no cluster fs, so there is no point of even thinking about it. If 
somebody feels he needs multiple writeable mounts of the same fs, please use 
a cluster fs. Of course, you have to live with the tradeoffs. Dreaming of a 
fs that uses something like witchcraft to do things like locking, quorums, 
cache synchronisation without penalty and, of course, without any 
configuration, is pointless.

In my opinon, the whole thing comes up from the idea of using cheap hardware 
and out-of-the-box configurations to keep promises of reliability and 
availability which are not realistic. There is a reason why there are more 
expensive HDDs, RAIDs, SANs with volume mirroring, multipathing and so on. 
Simply ignoring the fact that you have to use the proper tools to address 
specific problems and pray to the toothfairy to put a 
solve-all-my-problems-fs under your pillow is no solution. I'd rather have a 
solid fs with deterministic behavior and some state-of-the-art features.

Just my 2c.
(Gerald) 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: Some very basic questions
  2008-10-22 14:35 dbz
@ 2008-10-27 15:43 ` Stephan von Krawczynski
  0 siblings, 0 replies; 79+ messages in thread
From: Stephan von Krawczynski @ 2008-10-27 15:43 UTC (permalink / raw)
  To: dbz; +Cc: linux-btrfs

On Wed, 22 Oct 2008 16:35:55 +0200
"dbz" <hwallenstone@gmx.de> wrote:

> concerning this discussion, I'd like to put up some "requests" which 
> strongly oppose to those brought up initially:
> 
> - if you run into an error in the fs structure or any IO error that prevents 
> you from bringing the fs into a consistent state, please simply oops. If a 
> user feels that availability is a main issue, he has to use a failover 
> solution. In this case a fast and clean cut is desireable and no 
> "pray-and-hope-mode" or "90%-mode". If avaliability is not the issue, it is 
> in any case most important that data on the fs is safe. If you don't oops, 
> you risk to pose further damage onto the filesystem and end up with a 
> completely destroyed fs.

Hi Gerald,

this is a good proposal to explain why most failover setups do indeed not
work. If you look at numerous internet howtos about building failover you will
recognise that 95% talk about servers that syncronise their fs by all kinds of
tools _offline_, like drbd - or choose some network-dependant raid, like nbd
or enbd. All these have in common that they are unreliable just because of the
needed mounting during failover. In your example: if box 1 oopses because of
some error, chances are that box 2 trying to mount the very same data (which
should be because of raid or sync) will indeed fail to mount, too. That leaves
you with exactly nothing in hand.

> - if you get any IO error, please **don't** put up a number of retries or 
> anything. If the device reports an error simply believe it. It is bad enough 
> that many block drivers or controllers try to be smart and put up hundreds 
> of retries. Adding further retries you only end up in wasting hours on 
> useless retries. If availability is an issue, the user again has to put up a 
> failover solution. Again, a clean cut is what is needed. The user has to 
> make shure he uses appropiate configuration according to the importance of 
> his data (mirroring on the fs and/or RAID, failover ...)

Well, this leaves you with my proposal to optionally stop retrying, marking
files or (better) blocks as dead.

> - if during mount something unexpected comes up and you can't be shure that 
> the fs will work properly, please deny mounting and request a fsck. This can 
> be easily handled by a start- or mount-script. During mount, take the time 
> you need to ensure that the fs looks proper and safe to use. I'd rather now 
> during boot that something is wrong than to run with a foul fs and end up 
> with data loss or any other mixup later on.

As explained above it is exactly the lack of parallel mounts that drives you
to not having a lot of time during mount. A failover that takes only 10 minutes
for re-mount is no failover, it is sh.t. ext? btw hardly ever mounts TBs at
below 10 minutes.

> - btrfs is no cluster fs, so there is no point of even thinking about it. If 
> somebody feels he needs multiple writeable mounts of the same fs, please use 
> a cluster fs. Of course, you have to live with the tradeoffs. Dreaming of a 
> fs that uses something like witchcraft to do things like locking, quorums, 
> cache synchronisation without penalty and, of course, without any 
> configuration, is pointless.

This reads pretty much like "a processor is a processor and not multiple
processors". We all know today that this time has passed. In 5 years you
should pretty much say the same for "single fs" vs. "cluster fs". 

> In my opinon, the whole thing comes up from the idea of using cheap hardware 
> and out-of-the-box configurations to keep promises of reliability and 
> availability which are not realistic. There is a reason why there are more 
> expensive HDDs, RAIDs, SANs with volume mirroring, multipathing and so on. 
> Simply ignoring the fact that you have to use the proper tools to address 
> specific problems and pray to the toothfairy to put a 
> solve-all-my-problems-fs under your pillow is no solution. I'd rather have a 
> solid fs with deterministic behavior and some state-of-the-art features.

Well, sorry to say, but I begin to sound a bit like Joseph Stiglitz
trying to explain why neoliberalism does not work out.
Please accept that this world is full of failure of all kinds. If you deny
that all your models and ideas will only be failures, too.
All I am saying is that we should accept that dead sectors, braindead
firmware-programmers, production in jungle-environment, transportation in
rough areas, high temperatures, high humidity, harddisks that have no disks
and so on are facts of live. And only a childs answer can be : "oops"
(sorry could not resist this one ;-)

> Just my 2c.
> (Gerald) 

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2008-10-27 15:43 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-21 11:23 Some very basic questions Stephan von Krawczynski
2008-10-21 12:13 ` Andi Kleen
2008-10-21 14:22   ` Stephan von Krawczynski
2008-10-21 15:34     ` jim owens
2008-10-22 11:36       ` Stephan von Krawczynski
2008-10-22 12:15         ` Avi Kivity
2008-10-22 13:03           ` Ric Wheeler
2008-10-22 13:13             ` Chris Mason
2008-10-22 13:16             ` Avi Kivity
2008-10-21 13:20 ` jim owens
2008-10-21 17:01   ` Stephan von Krawczynski
2008-10-21 17:15     ` Christoph Hellwig
2008-10-21 17:31       ` Ric Wheeler
2008-10-22 12:27         ` Stephan von Krawczynski
2008-10-22 13:15           ` Chris Mason
2008-10-22 13:27             ` Ric Wheeler
2008-10-22 14:32               ` Avi Kivity
2008-10-22 14:36                 ` Chris Mason
2008-10-22 14:40                   ` Avi Kivity
2008-10-22 14:46                 ` Ric Wheeler
2008-10-22 14:54                   ` Avi Kivity
2008-10-22 15:02                     ` Ric Wheeler
2008-10-22 15:13                       ` Avi Kivity
2008-10-22 15:25                         ` Ric Wheeler
2008-10-22 15:33                           ` Chris Mason
2008-10-22 15:43                             ` Avi Kivity
2008-10-22 15:54                               ` Ric Wheeler
2008-10-22 18:28                                 ` Avi Kivity
2008-10-22 15:39                           ` Avi Kivity
2008-10-22 13:52             ` Stephan von Krawczynski
2008-10-22 15:56               ` Michel Salim
2008-10-22 16:56                 ` jim owens
2008-10-23  9:47                 ` Stephan von Krawczynski
2008-10-22 11:40       ` Stephan von Krawczynski
2008-10-21 13:59 ` Chris Mason
2008-10-21 16:09   ` Andi Kleen
2008-10-22 11:43     ` Stephan von Krawczynski
2008-10-21 16:27   ` Stephan von Krawczynski
2008-10-21 16:59     ` Andi Kleen
2008-10-22 11:46       ` Stephan von Krawczynski
2008-10-21 17:49     ` Chris Mason
2008-10-22 12:19       ` Stephan von Krawczynski
2008-10-22 12:48         ` Jeff Schroeder
2008-10-22 14:02           ` Stephan von Krawczynski
2008-10-22 13:50         ` Chris Mason
2008-10-22 14:04           ` Matthias Wächter
2008-10-22 14:32             ` Ric Wheeler
2008-10-22 14:44               ` jim owens
2008-10-24  8:42           ` Chris Samuel
2008-10-24  8:39         ` Chris Samuel
2008-10-21 20:54   ` Eric Anopolsky
2008-10-21 22:18     ` Ric Wheeler
2008-10-22  2:29       ` Eric Anopolsky
2008-10-22 10:42         ` Ric Wheeler
2008-10-22 10:53           ` Tejun Heo
2008-10-22 12:57             ` Ric Wheeler
2008-10-22 12:57             ` Ric Wheeler
2008-10-22 13:15               ` Tejun Heo
2008-10-22 13:19                 ` Chris Mason
2008-10-22 13:38                   ` Ric Wheeler
2008-10-22 13:59                     ` Chris Mason
2008-10-22 14:23                       ` Ric Wheeler
2008-10-22 13:23                 ` Ric Wheeler
2008-10-22 16:14                   ` Tejun Heo
2008-10-22 16:34                     ` Ric Wheeler
2008-10-23  3:59                       ` Tejun Heo
2008-10-22 18:32                     ` Avi Kivity
2008-10-22 19:13                       ` jim owens
2008-10-22 19:22                         ` Avi Kivity
2008-10-22 19:59                       ` Ric Wheeler
2008-10-22 21:31                     ` Eric Anopolsky
2008-10-22 21:56                       ` Ric Wheeler
  -- strict thread matches above, loose matches on Subject: below --
2008-10-21 17:37 calin
2008-10-21 20:08 ` jim owens
2008-10-22  7:15   ` Avi Kivity
2008-10-22 14:13     ` jim owens
2008-10-22 14:25       ` Avi Kivity
2008-10-22 14:35 dbz
2008-10-27 15:43 ` Stephan von Krawczynski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.