defrag vs autodefrag

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* defrag vs autodefrag
@ 2015-12-21  1:26 Donald Pearson
  2015-12-21  3:22 ` Duncan
  2015-12-21  8:14 ` Hugo Mills
  0 siblings, 2 replies; 17+ messages in thread
From: Donald Pearson @ 2015-12-21  1:26 UTC (permalink / raw)
  To: Btrfs BTRFS

I read an implication in a different thread that defrag and autodefrag
behave differently in that autodefrag is more snapshot friendly for
COW data.

Did I understand that correctly?  I have not been doing defrag on my
virtual machine image directory because I do use a snapshot schedule
and the way I understood things, a defrag would basically decouple the
live data from the snapshots and greatly increase utilization.

It sounded like autodefrag does not have this problem?

If that's true, is there any case where it would not be best practice
to mount with autodefrag enabled?

Thanks,
Donald

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: defrag vs autodefrag
  2015-12-21  1:26 defrag vs autodefrag Donald Pearson
@ 2015-12-21  3:22 ` Duncan
  2015-12-21  8:14 ` Hugo Mills
  1 sibling, 0 replies; 17+ messages in thread
From: Duncan @ 2015-12-21  3:22 UTC (permalink / raw)
  To: linux-btrfs

Donald Pearson posted on Sun, 20 Dec 2015 19:26:07 -0600 as excerpted:

> I read an implication in a different thread that defrag and autodefrag
> behave differently in that autodefrag is more snapshot friendly for COW
> data.
> 
> Did I understand that correctly?  I have not been doing defrag on my
> virtual machine image directory because I do use a snapshot schedule and
> the way I understood things, a defrag would basically decouple the live
> data from the snapshots and greatly increase utilization.
> 
> It sounded like autodefrag does not have this problem?

I'd definitely like more details on that myself, as Hugo did say that and 
I normally trust his advice as he's one of the few folks who can and does 
actually read the code, but spends most of his time helping admin-level 
users, not coding.  However, without further context, that claim simply 
doesn't make a lot of sense to me, so I don't really know what to make of 
it.  Was he mistaken in this case?  Is he correct and we just don't know 
the mechanism that allows autodefrag to work well here while normal 
defrag doesn't?  Are we using different definitions and thus actually 
talking about different things?  I don't know.  The claim simply doesn't 
make sense on its own, so one way or another, there's gotta be more to 
the story than other list-regular "admin level" people here seem to know.

> If that's true, is there any case where it would not be best practice to
> mount with autodefrag enabled?

The primary issue with autodefrag is performance.  You mentioned VMs, but 
didn't say how busy they are or whether you're spinning rust or SSDs, and 
both factors make a big difference.

In general, the performance issue is due to write amplification due to 
autodefrag queuing up (I believe) entire files for rewrite.  With slow 
enough media and large and rewrite-busy enough files, there simply comes 
a point at which change-writes are coming in faster than the entire file 
can be more or less constantly rewritten due to the autodefrag flag along 
with the busy VM/DB/whatever.

If the VMs/DBs/other-large-file-rewrite-apps aren't particularly busy, 
then obviously it'll take a larger file or slower media to trigger the 
problem.  Similarly, on fast media it'll take a larger file or busier 
file rewriting to trigger the problem, and if the file in question is 
small enough, the problem generally doesn't trigger because it gets 
rewritten fast enough even on slow media with a busy rewriter.

On reasonably fast SSDs, I'd say try it and see, particularly if your VM 
images tend to be under say 4 GiB, as there's a good chance they'll be 
fast enough to eliminate the problem.

Similarly, if your VMs are sitting around doing nothing a lot of the time 
anyway and/or if when they are busy their work isn't time critical so you 
don't care about the autdefrag file rewrite bottlenecking, it's worth at 
least a try.

As for file size, under a quarter GiB (which would I guess be a pretty 
rare VM, but it covers most trivial SQLite databases, like those firefox 
and thunderbird use, quite well) people rarely have problems.  At half a 
GiB, some are reporting issues, and by a GiB size, most on spinning rust 
will be seeing a noticeable slowdown.  Unfortunately, I guess most VM 
images are multiple GiB, so the VM-use-case on spinning rust isn't likely 
to be a good match, unless of course the VMs simply aren't busy enough to 
make it an issue.

Assuming you do decide to try autodefrag...

Note that if you turn on autodefrag on already heavily fragmented files 
such as yours are likely to be, with enough fragmentation that existing 
free space is likely to be quite fragmented as well so the initial 
rewrite isn't going to be able to fix it, you're likely to see a much 
higher initial hit, until fragmentation calms down quite a bit, than you 
would had you been running autodefrag the entire time.

So assuming you have the additional space and time available to do so, it 
may be worth copying the existing VM images over to a fresh btrfs, 
mounted with autodefrag from the get-go, so they start unfragmented and 
you don't have that huge initial hit to take before things calm down.

(That's what I've done with all my btrfs here, mounted with autodefrag 
from the very first mount.  Tho my use-case doesn't involve VMs or 
snapshots, so it's unsurprising I've not personally had to deal with 
this, the primary autodefrag issue, in my own usage.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: defrag vs autodefrag
  2015-12-21  1:26 defrag vs autodefrag Donald Pearson
  2015-12-21  3:22 ` Duncan
@ 2015-12-21  8:14 ` Hugo Mills
  2015-12-21  9:28   ` Filipe Manana
  1 sibling, 1 reply; 17+ messages in thread
From: Hugo Mills @ 2015-12-21  8:14 UTC (permalink / raw)
  To: Donald Pearson; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1082 bytes --]

On Sun, Dec 20, 2015 at 07:26:07PM -0600, Donald Pearson wrote:
> I read an implication in a different thread that defrag and autodefrag
> behave differently in that autodefrag is more snapshot friendly for
> COW data.
> 
> Did I understand that correctly?  I have not been doing defrag on my
> virtual machine image directory because I do use a snapshot schedule
> and the way I understood things, a defrag would basically decouple the
> live data from the snapshots and greatly increase utilization.
> 
> It sounded like autodefrag does not have this problem?

   Correct.

> If that's true, is there any case where it would not be best practice
> to mount with autodefrag enabled?

   When you are already tight on I/O bandwidth for your application.
autodefrag increases the amount of I/O the disks do (because it's
rewriting parts of the file near each write, as well as just the piece
that's being written by userspace).

   Hugo.

-- 
Hugo Mills             | UNIX: Japanese brand of food containers
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: defrag vs autodefrag
  2015-12-21  8:14 ` Hugo Mills
@ 2015-12-21  9:28   ` Filipe Manana
  2015-12-22 20:16     ` Christoph Anton Mitterer
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Filipe Manana @ 2015-12-21  9:28 UTC (permalink / raw)
  To: Hugo Mills, Donald Pearson, Btrfs BTRFS

On Mon, Dec 21, 2015 at 8:14 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Sun, Dec 20, 2015 at 07:26:07PM -0600, Donald Pearson wrote:
>> I read an implication in a different thread that defrag and autodefrag
>> behave differently in that autodefrag is more snapshot friendly for
>> COW data.
>>
>> Did I understand that correctly?  I have not been doing defrag on my
>> virtual machine image directory because I do use a snapshot schedule
>> and the way I understood things, a defrag would basically decouple the
>> live data from the snapshots and greatly increase utilization.
>>
>> It sounded like autodefrag does not have this problem?
>
>    Correct.

Hum?
How is that so? Snapshot-aware defrag was disabled almost 2 years ago,
and that piece of code is used both by a "manual" defrag (ioctl) and
by automatic defrag.

>
>> If that's true, is there any case where it would not be best practice
>> to mount with autodefrag enabled?
>
>    When you are already tight on I/O bandwidth for your application.
> autodefrag increases the amount of I/O the disks do (because it's
> rewriting parts of the file near each write, as well as just the piece
> that's being written by userspace).
>
>    Hugo.
>
> --
> Hugo Mills             | UNIX: Japanese brand of food containers
> hugo@... carfax.org.uk |
> http://carfax.org.uk/  |
> PGP: E2AB1DE4          |



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: defrag vs autodefrag
  2015-12-21  9:28   ` Filipe Manana
@ 2015-12-22 20:16     ` Christoph Anton Mitterer
  2015-12-22 20:30     ` Hugo Mills
  2015-12-27  3:03     ` [PATCH] improve documentation of snapshot unaware defrag Christoph Anton Mitterer
  2 siblings, 0 replies; 17+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-22 20:16 UTC (permalink / raw)
  To: fdmanana, Hugo Mills, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 656 bytes --]

On Mon, 2015-12-21 at 09:28 +0000, Filipe Manana wrote:
> Hum?
> How is that so? Snapshot-aware defrag was disabled almost 2 years
> ago,
> and that piece of code is used both by a "manual" defrag (ioctl) and
> by automatic defrag.
Thanks for clearing that up.


Could someone then please add an according warning to the btrfs-
mount(5) manpage, as there is for btrfs-filesystem(5)?!

It really shouldn't happen that such dramatic changes get done without
being properly documented and making it quite likely that some user's
filesystems get "blown up" (well at least in the sense that quite some
space may be eaten up).


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: defrag vs autodefrag
  2015-12-21  9:28   ` Filipe Manana
  2015-12-22 20:16     ` Christoph Anton Mitterer
@ 2015-12-22 20:30     ` Hugo Mills
  2015-12-23  2:16       ` Duncan
  2015-12-27  3:03     ` [PATCH] improve documentation of snapshot unaware defrag Christoph Anton Mitterer
  2 siblings, 1 reply; 17+ messages in thread
From: Hugo Mills @ 2015-12-22 20:30 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Donald Pearson, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 2585 bytes --]

On Mon, Dec 21, 2015 at 09:28:37AM +0000, Filipe Manana wrote:
> On Mon, Dec 21, 2015 at 8:14 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> > On Sun, Dec 20, 2015 at 07:26:07PM -0600, Donald Pearson wrote:
> >> I read an implication in a different thread that defrag and autodefrag
> >> behave differently in that autodefrag is more snapshot friendly for
> >> COW data.
> >>
> >> Did I understand that correctly?  I have not been doing defrag on my
> >> virtual machine image directory because I do use a snapshot schedule
> >> and the way I understood things, a defrag would basically decouple the
> >> live data from the snapshots and greatly increase utilization.
> >>
> >> It sounded like autodefrag does not have this problem?
> >
> >    Correct.
> 
> Hum?
> How is that so? Snapshot-aware defrag was disabled almost 2 years ago,
> and that piece of code is used both by a "manual" defrag (ioctl) and
> by automatic defrag.

   A couple of days ago (it's been a busy week) I started writing a
rebuttal of this based on the behaviour of filesystems with autodefrag
set on. Then I realised that actually, we're both right. :)

   If autodefrag behaves as you say (and I'm sure it does, since you
know the code), then when you write data to a file, a piece of the
file around the write(s) can be copied and written elsewhere by
autodefrag. That's going to be the same as the manual defrag. However,
it's a relatively small separation effect that's likely lost in the
noise. If you rewrite the whole file, then it'll separate the file
completely -- which is what manual defrag does -- but then, you'd
expect that behaviour without autodefrag, too.

   So, ultimately, autodefrag *does* separate reflink copies, but only
near where writes are made to the file, so the overall effect is very
similar to the effect you'd have from writing to the file without
autodefrag.

   And I am now better informed than I was before. :)

   Hugo.

> >> If that's true, is there any case where it would not be best practice
> >> to mount with autodefrag enabled?
> >
> >    When you are already tight on I/O bandwidth for your application.
> > autodefrag increases the amount of I/O the disks do (because it's
> > rewriting parts of the file near each write, as well as just the piece
> > that's being written by userspace).
> >
> >    Hugo.
> >

-- 
Hugo Mills             | If the first-ever performance is the première, is
hugo@... carfax.org.uk | the last-ever performance the derrière?
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: defrag vs autodefrag
  2015-12-22 20:30     ` Hugo Mills
@ 2015-12-23  2:16       ` Duncan
  0 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2015-12-23  2:16 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Tue, 22 Dec 2015 20:30:41 +0000 as excerpted:

> On Mon, Dec 21, 2015 at 09:28:37AM +0000, Filipe Manana wrote:

>> On Mon, Dec 21, 2015 at 8:14 AM, Hugo Mills <hugo@carfax.org.uk> wrote:

>> > On Sun, Dec 20, 2015 at 07:26:07PM -0600, Donald Pearson wrote:

>> >> I read an implication in a different thread that defrag and
>> >> autodefrag behave differently in that autodefrag is more snapshot
>> >> friendly for COW data.

>> >    Correct.

>> How is that so? Snapshot-aware defrag was disabled almost 2 years ago,
>> and that piece of code is used both by a "manual" defrag (ioctl) and by
>> automatic defrag.

> Then I realised that actually, we're both right. :)
> 
> If autodefrag behaves as you say (and I'm sure it does, since you
> know the code), then when you write data to a file, a piece of the file
> around the write(s) can be copied and written elsewhere by autodefrag.
> That's going to be the same as the manual defrag. However,
> it's a relatively small separation effect that's likely lost in the
> noise. If you rewrite the whole file, then it'll separate the file
> completely -- which is what manual defrag does -- but then, you'd expect
> that behaviour without autodefrag, too.
> 
> So, ultimately, autodefrag *does* separate reflink copies, but only
> near where writes are made to the file, so the overall effect is very
> similar to the effect you'd have from writing to the file without
> autodefrag.
> 
> And I am now better informed than I was before. :)

And now so are we.  Thanks, Hugo.

I /knew/ there had to be a good explanation, as what you were saying just 
didn't fit in with what I knew.  Now I have that explanation, and it 
makes perfect sense. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] improve documentation of snapshot unaware defrag
  2015-12-21  9:28   ` Filipe Manana
  2015-12-22 20:16     ` Christoph Anton Mitterer
  2015-12-22 20:30     ` Hugo Mills
@ 2015-12-27  3:03     ` Christoph Anton Mitterer
  2015-12-27  3:10       ` Christoph Anton Mitterer
  2015-12-27  7:09       ` Duncan
  2 siblings, 2 replies; 17+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-27  3:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: calestyo

In btrfs-filesystem(8), improved the documentation of snapshot unaware
defragmentation and included the exact kernel version numbers being affected as
well as the possible effects.
No longer use th word "unlink" which is easily understood as "deleting a file".
Moved the warning more to the beginning of "defragment" subcommand's
documentation where it's more visible to readers.

Added the same warning to the "autodefrag" option of btrfs-mount(5).

Signed-off-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
---
 Documentation/btrfs-filesystem.asciidoc | 11 +++++++----
 Documentation/btrfs-mount.asciidoc      |  7 +++++++
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/Documentation/btrfs-filesystem.asciidoc b/Documentation/btrfs-filesystem.asciidoc
index 31cd51b..600bbac 100644
--- a/Documentation/btrfs-filesystem.asciidoc
+++ b/Documentation/btrfs-filesystem.asciidoc
@@ -55,6 +55,13 @@ if the free space is too fragmented.
 Use 0 to take the kernel default, which is 256kB but may change in the future.
 You can also turn on compression in defragment operations.
 +
+WARNING: Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14-rc2 as well as
+with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12 or ≥ 3.13.4 will break up
+the ref-links of CoW data (for example files copied with `cp --reflink`,
+snapshots or de-duplicated data).
+This may cause considerable increase of space usage depending on the broken up
+ref-links.
++
 `Options`
 +
 -v::::
@@ -79,10 +86,6 @@ target extent size, do not touch extents bigger than <size>
 For <start>, <len>, <size> it is possible to append
 units designator: \'K', \'M', \'G', \'T', \'P', or \'E', which represent
 KiB, MiB, GiB, TiB, PiB, or EiB, respectively. Case does not matter.
-+
-WARNING: defragmenting with kernels up to 2.6.37 will unlink COW-ed copies of data,
-don't use it if you use snapshots, have de-duplicated your data or made
-copies with `cp --reflink`.
 
 *label* [<dev>|<mountpoint>] [<newlabel>]::
 Show or update the label of a filesystem.
diff --git a/Documentation/btrfs-mount.asciidoc b/Documentation/btrfs-mount.asciidoc
index 39215a8..d364594 100644
--- a/Documentation/btrfs-mount.asciidoc
+++ b/Documentation/btrfs-mount.asciidoc
@@ -26,6 +26,13 @@ MOUNT OPTIONS
 	Auto defragmentation detects small random writes into files and queue
 	them up for the defrag process.  Works best for small files;
 	Not well suited for large database workloads.
+	+
+	WARNING: Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14-rc2 as
+	well as with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12 or
+	≥ 3.13.4 will break up the ref-links of CoW data (for example files
+	copied with `cp --reflink`, snapshots or de-duplicated data).
+	This may cause considerable increase of space usage depending on the
+	broken up ref-links.
 
 *check_int*::
 *check_int_data*::
-- 
2.6.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] improve documentation of snapshot unaware defrag
  2015-12-27  3:03     ` [PATCH] improve documentation of snapshot unaware defrag Christoph Anton Mitterer
@ 2015-12-27  3:10       ` Christoph Anton Mitterer
  2015-12-27  7:09       ` Duncan
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-27  3:10 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 708 bytes --]

On Sun, 2015-12-27 at 04:03 +0100, Christoph Anton Mitterer wrote:
> -WARNING: defragmenting with kernels up to 2.6.37 will unlink COW-ed
Perhaps someone can also check the above.

I was looking through the git history, but, couldn't find anything wrt
2.6.37...
The commit's I've basically searched for in the non-stable repo were
38c227d87c49ad5d173cb5d4374d49acec6a495d (adding the ref-link aware defrag)
8101c8dbf6243ba517aab58d69bf1bc37d8b7b9c (removing it)

But maybe I've missed something.


Also, the wiki mentioned it for 3.13, I changed that:
https://btrfs.wiki.kernel.org/index.php?title=Changelog&action=historysubmit&diff=29765&oldid=29697

Please correct if wrong.

HTH,
Chris

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] improve documentation of snapshot unaware defrag
  2015-12-27  3:03     ` [PATCH] improve documentation of snapshot unaware defrag Christoph Anton Mitterer
  2015-12-27  3:10       ` Christoph Anton Mitterer
@ 2015-12-27  7:09       ` Duncan
  2015-12-28  0:50         ` Christoph Anton Mitterer
  1 sibling, 1 reply; 17+ messages in thread
From: Duncan @ 2015-12-27  7:09 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Sun, 27 Dec 2015 04:03:27 +0100 as
excerpted:

[Rewrapped here but all added lines.]

> +WARNING: Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14-rc2
> as well as 
> +with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12 or ≥
> 3.13.4 will break up 
> +the ref-links of CoW data (for example files copied with 
> `cp --reflink`,
> +snapshots or de-duplicated data).
> +This may cause considerable increase of space usage depending on the
> broken up +ref-links.

Thanks.  I had looked at that a few times and thought it needed updated, 
but I think it hadn't reached my pain threshold yet[1], so I hadn't yet 
posted about it.

Glad it reached someone's pain threshold. =:^)

---
[1] Pain threshold:  Or more like, I was always doing something else at 
the time, which is probably everybody else's excuse too.  But by contrast 
it can be noted that I posted right away when I noticed the mkfs.btrfs 
manpage totally lost raid1 mode with one update, because I use it, 
regardless of what else I was doing.  I guess that must have hit my pain 
threshold...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] improve documentation of snapshot unaware defrag
  2015-12-27  7:09       ` Duncan
@ 2015-12-28  0:50         ` Christoph Anton Mitterer
  2015-12-28  1:58           ` Hugo Mills
  2015-12-28  2:51           ` Duncan
  0 siblings, 2 replies; 17+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-28  0:50 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 203 bytes --]

On Sun, 2015-12-27 at 07:09 +0000, Duncan wrote:
> raid1 mode
I wonder when that reaches my pain threshold... and I submit a patch
that renames it "notreallyraid1" in all places ;-)

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] improve documentation of snapshot unaware defrag
  2015-12-28  0:50         ` Christoph Anton Mitterer
@ 2015-12-28  1:58           ` Hugo Mills
  2015-12-28  2:07             ` Christoph Anton Mitterer
  2015-12-28  9:12             ` Duncan
  2015-12-28  2:51           ` Duncan
  1 sibling, 2 replies; 17+ messages in thread
From: Hugo Mills @ 2015-12-28  1:58 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 721 bytes --]

On Mon, Dec 28, 2015 at 01:50:09AM +0100, Christoph Anton Mitterer wrote:
> On Sun, 2015-12-27 at 07:09 +0000, Duncan wrote:
> > raid1 mode
> I wonder when that reaches my pain threshold... and I submit a patch
> that renames it "notreallyraid1" in all places ;-)

   Isn't this an FAQ already? There is already a patch to rename the
RAID modes. It's been sitting in the progs patch queue for about 2
years, because none of the senior devs has acked it yet (since it's a
big user-visible change).

   Hugo.

-- 
Hugo Mills             | Talking about music is like dancing about
hugo@... carfax.org.uk | architecture
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                           Frank Zappa

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] improve documentation of snapshot unaware defrag
  2015-12-28  1:58           ` Hugo Mills
@ 2015-12-28  2:07             ` Christoph Anton Mitterer
  2015-12-28  9:12             ` Duncan
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-28  2:07 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1135 bytes --]

On Mon, 2015-12-28 at 01:58 +0000, Hugo Mills wrote:
>    Isn't this an FAQ already? There is already a patch to rename the
> RAID modes. It's been sitting in the progs patch queue for about 2
> years, because none of the senior devs has acked it yet (since it's a
> big user-visible change).
Uhm... yeah, it's a bit invasive... but that happens when such improper
naming is done in the first place :-/

It's similar to when tools wrongly or ambiguously use SI prefixes,
instead of proper kB, MB, GB, etc. (for base 1000) respectively KiB,
MiB, GiB, etc. (for base 1024).
Especially just using K,M,G, is simply evil and should lead to public
punishment ;-)
I'm also not really fond of what btrfs took over from LVM,
namely kKmMgGtTpPeE,... it's ambiguous or at least unclean as well...

Best would be probably if we don't use "raid" as names at all (or just
as aliases for the actual canonical names), but rather describe what's
actually...
E.g.
classic RAID1 = mirror
btrfs RAID1   = dup2 or something similar like clone2, replica2
classic RAID0 = striped
RAID5/6       = parityN

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] improve documentation of snapshot unaware defrag
  2015-12-28  1:58           ` Hugo Mills
  2015-12-28  2:07             ` Christoph Anton Mitterer
@ 2015-12-28  9:12             ` Duncan
  1 sibling, 0 replies; 17+ messages in thread
From: Duncan @ 2015-12-28  9:12 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Mon, 28 Dec 2015 01:58:07 +0000 as excerpted:

> On Mon, Dec 28, 2015 at 01:50:09AM +0100, Christoph Anton Mitterer
> wrote:
>> On Sun, 2015-12-27 at 07:09 +0000, Duncan wrote:
>> > raid1 mode
>> I wonder when that reaches my pain threshold... and I submit a patch
>> that renames it "notreallyraid1" in all places ;-)
> 
> Isn't this an FAQ already? There is already a patch to rename the
> RAID modes. It's been sitting in the progs patch queue for about 2
> years, because none of the senior devs has acked it yet (since it's a
> big user-visible change).

I don't see it in the FAQ, but I see hints on both the sysadmin's guide 
and the usecases pages.

(Either the wiki or firefox seems to be having certificate problems ATM 
and all I'm getting is an OCSP response has an invalid sig error.  But 
the resurrect this page extension to the rescue, click the resurrect via 
google and I get it.  Links has no problem loading the page, but lynx 
does, so it's not just firefox.)

UseCases:

First section is RAID, first question there is on creating a raid1 mirror 
in btrfs.  It has this to say at the end of the answer:

>>>>>

NOTE This does not do the 'usual thing' for 3 or more drives. Until "N-
Way" (traditional) RAID-1 is implemented: Loss of more than one drive 
might crash the array. For now, RAID-1 means 'one copy of what's 
important exists on two of the drives in the array no matter how many 
drives there may be in it'.

<<<<<

SysadminGuide:

Second section is data usage and allocation.  First subsection there is 
RAID and data replication.  The first paragraph there is:

>>>>>

Btrfs's "RAID" implementation bears only passing resemblance to 
traditional RAID implementations. Instead, btrfs replicates data on a per-
chunk basis. If the filesystem is configured to use "RAID-1", for 
example, chunks are allocated in pairs, with each chunk of the pair being 
taken from a different block device. Data written to such a chunk pair 
will be duplicated across both chunks.

<<<<<

The multi-device page has little theoretical discussion, only discussing 
current status and having a bunch of specific commandline examples.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] improve documentation of snapshot unaware defrag
  2015-12-28  0:50         ` Christoph Anton Mitterer
  2015-12-28  1:58           ` Hugo Mills
@ 2015-12-28  2:51           ` Duncan
  2015-12-28  3:03             ` Christoph Anton Mitterer
  1 sibling, 1 reply; 17+ messages in thread
From: Duncan @ 2015-12-28  2:51 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Mon, 28 Dec 2015 01:50:09 +0100 as
excerpted:

> On Sun, 2015-12-27 at 07:09 +0000, Duncan wrote:

>> raid1 mode

> I wonder when that reaches my pain threshold... and I submit a patch
> that renames it "notreallyraid1" in all places ;-)

I've seen two responses to that, both correct, AFAIK.

1) Btrfs very specifically and deliberately uses *lowercase* raidN in 
part to make that distinction, as the btrfs variants are chunk-level (and 
designed so that at some point in the future they can be subvolume and/or 
file level), not device-level (and at that future point, not necessarily 
filesystem level either).

As we've seen in discussion in other threads, for raid10 in particular, 
that makes a profound difference in robustness in the multi-device 
failure case.

2) Regarding btrfs raid1 and raid10's current very specific two-way-
mirroring in particular, limiting to two-way-mirroring in the 3+ devices 
case is well within established definitions and historic usage.  
Apparently, the N-devices = N-way-mirroring usage is relatively new, 
arguably first popularized by Linux mdraid, after which various hardware 
raid suppliers also implemented it due to competitive pressure.  But only 
two-way-mirroring is required by the RAID-1 definition.

Even were that not the case, point #1, btrfs' very specific use of 
*lowercase* raid1, still covers the two-way-limitation case just as well 
as it covers the chunk-level case.

That said, that the limited pair-mirroring btrfs implements even in the 
3+ device case still meets formal RAID-1 definitions was originally news 
to me as well, however well I might now accept the fact.  But once my 
earlier naive assumptions were corrected, the remaining clarification 
issues fell below my pain threshold.

But for those for whom it's still very close to their pain threshold, due 
to the above a patch effectively doing s/raid1/notreallyraid1/g is 
unlikely to be accepted.  Much more likely to be accepted would be a 
patch to the btrfs-balance and mkfs.btrfs manpages adding note, 
preferably accounting for the raid10 situation as well, explaining that 
btrfs raid (lowercase) isn't RAID (uppercase) in the the traditional 
sense, that it's chunk-scope not device-scope and that this has 
implications in for instance robustness in the raid10 multi-device 
failure case, and that both raid1 and raid10 are (currently) limited to 
two-way-mirroring.

Meanwhile, for anyone considering writing that patch, I'd also strongly 
recommend that the two-way-mirroring wording is separated out, at least 
onto its own lines if not a separate paragraph, so it can be cleanly 
deleted and/or modified once N-way-mirroring is introduced as a feature, 
without having to rewrite the chunk-level and raid10 bit as well.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] improve documentation of snapshot unaware defrag
  2015-12-28  2:51           ` Duncan
@ 2015-12-28  3:03             ` Christoph Anton Mitterer
  2015-12-28  6:12               ` Duncan
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-28  3:03 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2055 bytes --]

On Mon, 2015-12-28 at 02:51 +0000, Duncan wrote:
> 1) Btrfs very specifically and deliberately uses *lowercase* raidN
> in part to make that distinction, as the btrfs variants are chunk-
> level (and designed so that at some point in the future they can be
> subvolume and/or file level), not device-level (and at that future
> point, not necessarily filesystem level either).
I guess no "normal" user would expect or understand that lower/upper
case would imply any distinction.


> 2) Regarding btrfs raid1 and raid10's current very specific two-way-
> mirroring in particular, limiting to two-way-mirroring in the 3+
> devices 
> case is well within established definitions and historic usage.  
> Apparently, the N-devices = N-way-mirroring usage is relatively new, 
> arguably first popularized by Linux mdraid, after which various
> hardware 
> raid suppliers also implemented it due to competitive pressure.  But
> only 
> two-way-mirroring is required by the RAID-1 definition.
No, this is not true.

This http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf is
the original paper on RAID.
Chapter 7 describes RAID1 and the clearly says "all disks are
duplicated" as well as "Level 1 RAID has only one data disk".

I wouldn't know any single case of a HW RAID controller (and we've had
quite a few of them here at the Tier2) or other software implementation
where RAID1 had another meaning than "N disks, N mirrors".



> Even were that not the case, point #1, btrfs' very specific use of 
> *lowercase* raid1, still covers the two-way-limitation case just as
> well 
> as it covers the chunk-level case.
Hmm wouldn't still change anything, IMHO,... saying "lower case RAID is
something different than upper case RAID" would be just a bit ...
uhm... weird.

Actually, because btrfs doing it at the chunk level (while RAID being
at the device level), proves while my point that "raid" or "RAID" or
any other lower/upper case combination shouldn't be used at all.


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] improve documentation of snapshot unaware defrag
  2015-12-28  3:03             ` Christoph Anton Mitterer
@ 2015-12-28  6:12               ` Duncan
  0 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2015-12-28  6:12 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Mon, 28 Dec 2015 04:03:05 +0100 as
excerpted:

> On Mon, 2015-12-28 at 02:51 +0000, Duncan wrote:

>> 1) Btrfs very specifically and deliberately uses *lowercase* raidN in
>> part to make that distinction, as the btrfs variants are chunk- level
>> (and designed so that at some point in the future they can be subvolume
>> and/or file level), not device-level (and at that future point, not
>> necessarily filesystem level either).

> I guess no "normal" user would expect or understand that lower/upper
> case would imply any distinction.

I /could/ argue the case based on definition of the "normal" in "normal 
user", but I won't, as in any case I agree with you at least to the 
extent that a better explanation of the details should eventually be 
found both on the wiki (where it is arguably already covered in the 
sysadmin's and multiple devices pages) and in the btrfs-balance and 
mkfs.btrfs manpages (where it remains uncovered).

>> 2) Regarding btrfs raid1 and raid10's current very specific two-way-
>> mirroring in particular, limiting to two-way-mirroring in the 3+
>> devices case is well within established definitions and historic usage.
>> Apparently, the N-devices = N-way-mirroring usage is relatively new,
>> arguably first popularized by Linux mdraid, after which various
>> hardware raid suppliers also implemented it due to competitive
>> pressure.  But only two-way-mirroring is required by the RAID-1
>> definition.

> No, this is not true.
> 
> This http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf is
> the original paper on RAID.
> Chapter 7 describes RAID1 and the clearly says "all disks are
> duplicated" as well as "Level 1 RAID has only one data disk".

Kudos for digging up the reference. =:^)

Never-the-less, I (and others from which I got the position), believe 
your interpretation is arguably in error.  More precisely...

1) In the context of the Level 1 RAID discussed in chapter 7, from 
earlier in the paper, in chapter 6, introducing RAID, on page six of the 
paper which is page 8 of the PDF (quotes here between the >>>>> and <<<<< 
demarcs, [...] indicating elision, as traditional):

>>>>>

Reliability: Our basic approach will be to break the arrays into 
reliability groups, with each group having extra "check" disks containing 
redundant information. [...]

Here are some other terms that we use:

D = total number of disks with data (not including the extra check disks);

G = number of data disks in a group (not including the extra check disks);

[...]

C = number of check disks in a group;

<<<<<

That's the context, disks grouped for reliability, with data and check 
disks in a group, but multiple such groups.

Then later in the paper, in the First Level RAID discussion in chapter 7, 
starting on page 9 of the paper, page 11 of the pdf:

>>>>>

Mirrored disks are a traditional approach for improving reliability of 
magnetic disks.  This is the most expensive option since all disks are 
duplicated (G=1 and C=1), and every write to a data disk is also a write 
to a check disk.

<<<<<

With the definitions and context above, we see that the "(G=1 and C=1)" 
defines First Level RAID as exactly one data disk and one check disk in a 
reliability group, with multiple such groups.  So yes, it has "only one 
data disk"... in a defined context where that's per group, with exactly 
one check disk as well, with multiple groups, such that each write to a 
group writes to exactly one data disk and one check disk, but a full 
write may be to many groups.

This can be further seen by examining Table II on page 10 of the paper 
(12 of the pdf), where total number of disks is declared to be 2D (twice 
the number of data disks, based on the above definition of D), and usable 
storage capacity to be 50%.

Further, in the commentary on the same page, "Since a Level 1 RAID has 
only one data disk in its group, we assume that the large transfer 
requires the same number of disk acting in concert as found in groups of 
the higher level RAIDs: 10 to 25 disks."  Again, that emphasizes the per-
group aspect of the G=1, C=1 definition, and the fact that there's many 
such groups in the deployment.

Finally, "Duplicating all disks can mean doubling the cost of the 
database system or using only 50% of the disk storage capacity."  Again, 
very clearly pair-mirroring, with many such pair-mirrors in the array.

Which, other than the per-chunk rather than per-disk granularity, is 
_exactly_ what btrfs does.

It would actually seem that the N-way-mirroring, where N=number-of-
devices, usage of so-called raid1 is out of kilter with the original 
definition, not btrfs' very specific two-way-mirroring, regardless of the 
number of devices, which is actually very close to the original 
definition of two devices per groups, many such groups in an array.

Tho I'll certainly agree that in today's usage, RAID-1 certainly 
/incorporates/ the N-way-mirroring usage, and would even agree that, 
within my rather limited exposure at least, it's the more common usage.  
But that doesn't make it the original usage, nor does it mean that 
there's no room in today's broader definition for the original usage, 
which then must remain as valid as the broader usage, today.

So other than the per-chunk scope, btrfs raid1 would indeed seem to be 
real RAID-1.

Never-the-less, given the broader usage today, there's definitely a need 
for some word of explanation in the mkfs.btrfs and btrfs-balance 
manpages.  I'll agree there, but then I never disagreed with that in the 
first place, and indeed, that was my opinion from when I myself thought 
pair-mirroring wasn't proper raid1 -- that much hasn't changed.

Meanwhile, I've actually quoted about 50% of the original paper's raid1 
discussion in the above.  The Level 1 RAID discussion is actually quite 
short, under a double-spaced page in the original paper, which itself is 
only 26 pdf pages long, including two pages of title and blank page at 
the beginning (thus the pdf page numbering being two pages higher than 
the paper's page numbering), and two plus pages of acknowledgments, 
references and appendix at the end, so only 22 pages of well spaced 
actual content.  Those who haven't clicked thru to actually read it may 
be interested in doing so.  Here it is again for convenience. =:^)

http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf

> I wouldn't know any single case of a HW RAID controller (and we've had
> quite a few of them here at the Tier2) or other software implementation
> where RAID1 had another meaning than "N disks, N mirrors".

That may be.  I'm sure you have more experience with it than I do.  But 
that doesn't change the original definition, or mean that usage 
consistent with that original definition is incorrect, even if uncommon 
today.

>> Even were that not the case, point #1, btrfs' very specific use of
>> *lowercase* raid1, still covers the two-way-limitation case just as
>> well as it covers the chunk-level case.

> Hmm wouldn't still change anything, IMHO,... saying "lower case RAID is
> something different than upper case RAID" would be just a bit ... uhm...
> weird.
> 
> Actually, because btrfs doing it at the chunk level (while RAID being at
> the device level), proves while my point that "raid" or "RAID" or any
> other lower/upper case combination shouldn't be used at all.

I don't actually disagree with you there.  Weird it is, agreed.  But it's 
also the case, at least currently, and based on what Hugo said about a 
patch to change the terminology being in limbo for two years, during 
which the currently used terminology has become even more entrenched as 
btrfs is widely deployed in distro installations now (even if it isn't 
entirely stable yet), that it's unlikely to change.  The best that could 
be done at this point is make raid1 an alias for something else, but even 
then, I'd guess the raid1 terminology would continue pretty much 
unabated, since it's already widely used and well entrenched in the 
various google engines as well as the archives for this list.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-12-28  9:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-21  1:26 defrag vs autodefrag Donald Pearson
2015-12-21  3:22 ` Duncan
2015-12-21  8:14 ` Hugo Mills
2015-12-21  9:28   ` Filipe Manana
2015-12-22 20:16     ` Christoph Anton Mitterer
2015-12-22 20:30     ` Hugo Mills
2015-12-23  2:16       ` Duncan
2015-12-27  3:03     ` [PATCH] improve documentation of snapshot unaware defrag Christoph Anton Mitterer
2015-12-27  3:10       ` Christoph Anton Mitterer
2015-12-27  7:09       ` Duncan
2015-12-28  0:50         ` Christoph Anton Mitterer
2015-12-28  1:58           ` Hugo Mills
2015-12-28  2:07             ` Christoph Anton Mitterer
2015-12-28  9:12             ` Duncan
2015-12-28  2:51           ` Duncan
2015-12-28  3:03             ` Christoph Anton Mitterer
2015-12-28  6:12               ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).