device balance times

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* device balance times
@ 2014-10-21 18:59 Tomasz Chmielewski
  2014-10-21 20:14 ` Piotr Pawłow
  0 siblings, 1 reply; 35+ messages in thread
From: Tomasz Chmielewski @ 2014-10-21 18:59 UTC (permalink / raw)
  To: linux-btrfs

FYI - after a failed disk and replacing it I've run a balance; it took 
almost 3 weeks to complete, for 120 GBs of data:

# time btrfs balance start -v /home
Dumping filters: flags 0x7, state 0x0, force is off
   DATA (flags 0x0): balancing
   METADATA (flags 0x0): balancing
   SYSTEM (flags 0x0): balancing
Done, had to relocate 124 out of 124 chunks

real    30131m52.873s
user    0m0.000s
sys     74m59.180s


Kernel is 3.17.0-rc7.


Filesystem is not that big, merely 124 GB used out of 1.8 TB:

/dev/sdb4                                               1.8T  124G  1.6T 
   8% /home

# btrfs fi df /home
Data, RAID1: total=121.00GiB, used=117.56GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=8.00GiB, used=4.99GiB
GlobalReserve, single: total=512.00MiB, used=0.00

# btrfs fi show /home
Label: none  uuid: 84d087aa-3a32-46da-844f-a233237cf04f
         Total devices 2 FS bytes used 122.56GiB
         devid    2 size 1.71TiB used 129.03GiB path /dev/sdb4
         devid    3 size 1.71TiB used 129.03GiB path /dev/sda4


The only special thing about this filesystem is that there are ~250 
snapshots there:

# btrfs sub list /home|wc -l
253

It's using compression:

/dev/sdb4 on /home type btrfs (rw,noatime,compress=lzo,space_cache)


Other than taking occasional backups from remote, the server is idle.

# hdparm -t /dev/sda /dev/sdb

/dev/sda:
  Timing buffered disk reads: 394 MB in  3.01 seconds = 131.03 MB/sec

/dev/sdb:
  Timing buffered disk reads: 402 MB in  3.00 seconds = 133.86 MB/sec



How long does the balance take for others with "many" snapshots?




-- 
Tomasz Chmielewski
http://www.sslrack.com


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-21 18:59 device balance times Tomasz Chmielewski
@ 2014-10-21 20:14 ` Piotr Pawłow
  2014-10-21 20:44   ` Arnaud Kapp
  2014-10-22  1:43   ` Chris Murphy
  0 siblings, 2 replies; 35+ messages in thread
From: Piotr Pawłow @ 2014-10-21 20:14 UTC (permalink / raw)
  To: Tomasz Chmielewski, linux-btrfs

On 21.10.2014 20:59, Tomasz Chmielewski wrote:
> FYI - after a failed disk and replacing it I've run a balance; it took 
> almost 3 weeks to complete, for 120 GBs of data:

Looks normal to me. Last time I started a balance after adding 6th 
device to my FS, it took 4 days to move 25GBs of data. Some chunks took 
20 hours to move. I currently have 156 snapshots on this FS (nightly 
rsync backups).

I think it is so slow, because it's disassembling chunks piece by piece 
and stuffing these pieces elsewhere, instead of moving chunks as a 
whole. If you have a lot of little pieces (as I do), it will take a while...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-21 20:14 ` Piotr Pawłow
@ 2014-10-21 20:44   ` Arnaud Kapp
  2014-10-22  1:10     ` 5 _thousand_ snapshots? even 160? (was: device balance times) Robert White
  2014-10-22 11:22     ` device balance times Austin S Hemmelgarn
  2014-10-22  1:43   ` Chris Murphy
  1 sibling, 2 replies; 35+ messages in thread
From: Arnaud Kapp @ 2014-10-21 20:44 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I would like to ask if the balance time is related to the number of 
snapshot or if this is related only to data (or both).

I currently have about 4TB of data and around 5k snapshots. I'm thinking 
of going raid1 instead of single. From the numbers I see this seems 
totally impossible as it would take *way* too long.

Would destroying snapshots (those are hourly snapshots to prevent stupid 
error to happens, like `rm my_important_file`) help?

Should I reconsider moving to raid1 because of the time it would take?

Sorry if I'm somehow hijacking this thread, but it seemed related :)

Thanks,

On 10/21/2014 10:14 PM, Piotr Pawłow wrote:
> On 21.10.2014 20:59, Tomasz Chmielewski wrote:
>> FYI - after a failed disk and replacing it I've run a balance; it took
>> almost 3 weeks to complete, for 120 GBs of data:
>
> Looks normal to me. Last time I started a balance after adding 6th
> device to my FS, it took 4 days to move 25GBs of data. Some chunks took
> 20 hours to move. I currently have 156 snapshots on this FS (nightly
> rsync backups).
>
> I think it is so slow, because it's disassembling chunks piece by piece
> and stuffing these pieces elsewhere, instead of moving chunks as a
> whole. If you have a lot of little pieces (as I do), it will take a
> while...
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* 5 _thousand_ snapshots? even 160? (was: device balance times)
  2014-10-21 20:44   ` Arnaud Kapp
@ 2014-10-22  1:10     ` Robert White
  2014-10-22  4:02       ` Zygo Blaxell
                         ` (3 more replies)
  2014-10-22 11:22     ` device balance times Austin S Hemmelgarn
  1 sibling, 4 replies; 35+ messages in thread
From: Robert White @ 2014-10-22  1:10 UTC (permalink / raw)
  To: Arnaud Kapp, linux-btrfs

That's an unmanageably large and probably pointless number of snapshots 
guys.

I mean 150 is a heck of a lot, and 5000 is almost unfathomable in terms 
of possible usefulness.

Snapshots are cheap but they aren't free.

Each snapshot is effectively stapling down one version of your entire 
metadata tree, right? So imagine leaving tape spikes (little marks on 
the floor to keep track of where something is so you can put it back) 
for the last 150 or 5000 positions of the chair you are sitting in. At 
some point the clarity and purpose of those marks becomes the opposite 
of useful.

Hourly for a day, daily for a week, weekly for a month, monthly for a 
year. And it's not a "backup" if you haven't moved it to another device. 
If you have 5k snapshots of a file that didn't change, you are still 
just one bad disk sector away from never having that data again because 
there's only one copy of the actual data stapled down in all of those 
snapshots.

The ability to avoid fragmentation and cruft is diminished by excessive 
snapshots on a live media.

Go get a backup drive or whatever. Snapshot you live media, send the 
snapshot to that backup. If you want to hoard them, hoard them on the 
backup drive.

There is an old saying. If you haven't run the restore operation your 
backup scheme is untested. Have you _really_ considered how you would go 
about scavenging through 5k of snapshots? Have you really done the 
exercise-of-consideration about what you are safeguarding by having 156 
or more paths to the same single disk sector?

More than four snapshots on the live disk and you are playing with it 
(ha ha).

Excessive snapshotting _will_ complicate many operations because you are 
permuting the choices the system has to consider and you are leaving 
allocated the ghosts of long dead files (like old logs in /var/log and 
umpteen copies of your browser cookies and history, and copies of the 
window layout from the last several hundred times you logged out of your 
desktop).

I don't think balance will _ever_ move the contents of a read only 
snapshot. I could be wrong. I think you just end up with an endlessly 
fragmented storage space and balance has to take each chunk and search 
for someplace else it might better fit. Which explains why it took so long.

And just _forget_ single-extent large files at that point.

(Of course I could be wrong about the "never move" rule, but that would 
just make the checksums on the potentially hundreds or thousands of 
references need to be recalculated after a move, which would make 
incremental send/receive unfathomable.)

On 10/21/2014 01:44 PM, Arnaud Kapp wrote:
> Hello,
>
> I would like to ask if the balance time is related to the number of
> snapshot or if this is related only to data (or both).
>
> I currently have about 4TB of data and around 5k snapshots. I'm thinking
> of going raid1 instead of single. From the numbers I see this seems
> totally impossible as it would take *way* too long.
>
> Would destroying snapshots (those are hourly snapshots to prevent stupid
> error to happens, like `rm my_important_file`) help?
>
> Should I reconsider moving to raid1 because of the time it would take?
>
> Sorry if I'm somehow hijacking this thread, but it seemed related :)
>
> Thanks,
>
> On 10/21/2014 10:14 PM, Piotr Pawłow wrote:
>> On 21.10.2014 20:59, Tomasz Chmielewski wrote:
>>> FYI - after a failed disk and replacing it I've run a balance; it took
>>> almost 3 weeks to complete, for 120 GBs of data:
>>
>> Looks normal to me. Last time I started a balance after adding 6th
>> device to my FS, it took 4 days to move 25GBs of data. Some chunks took
>> 20 hours to move. I currently have 156 snapshots on this FS (nightly
>> rsync backups).
>>
>> I think it is so slow, because it's disassembling chunks piece by piece
>> and stuffing these pieces elsewhere, instead of moving chunks as a
>> whole. If you have a lot of little pieces (as I do), it will take a
>> while...
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-21 20:14 ` Piotr Pawłow
  2014-10-21 20:44   ` Arnaud Kapp
@ 2014-10-22  1:43   ` Chris Murphy
  2014-10-22 12:40     ` Piotr Pawłow
  2014-10-22 16:15     ` Chris Murphy
  1 sibling, 2 replies; 35+ messages in thread
From: Chris Murphy @ 2014-10-22  1:43 UTC (permalink / raw)
  To: linux-btrfs

On Oct 21, 2014, at 4:14 PM, Piotr Pawłow <pp@siedziba.pl> wrote:

> On 21.10.2014 20:59, Tomasz Chmielewski wrote:
>> FYI - after a failed disk and replacing it I've run a balance; it took almost 3 weeks to complete, for 120 GBs of data:
> 
> Looks normal to me. Last time I started a balance after adding 6th device to my FS, it took 4 days to move 25GBs of data.

It's long term untenable. At some point it must be fixed. It's way, way slower than md raid.

> I think it is so slow, because it's disassembling chunks piece by piece and stuffing these pieces elsewhere, instead of moving chunks as a whole. If you have a lot of little pieces (as I do), it will take a while…

At a certain point it needs to fallback to block level copying, with a ~ 32KB block. It can't be treating things as if they're 1K files, doing file level copying that takes forever. It's just too risky that another device fails in the meantime.

Chris Murphy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160? (was: device balance times)
  2014-10-22  1:10     ` 5 _thousand_ snapshots? even 160? (was: device balance times) Robert White
@ 2014-10-22  4:02       ` Zygo Blaxell
  2014-10-22  4:05       ` Duncan
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 35+ messages in thread
From: Zygo Blaxell @ 2014-10-22  4:02 UTC (permalink / raw)
  To: Robert White; +Cc: Arnaud Kapp, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1664 bytes --]

On Tue, Oct 21, 2014 at 06:10:27PM -0700, Robert White wrote:
> That's an unmanageably large and probably pointless number of
> snapshots guys.
> 
> I mean 150 is a heck of a lot, and 5000 is almost unfathomable in
> terms of possible usefulness.
> 
> Snapshots are cheap but they aren't free.

This could be better documented.  Most of the documentation about btrfs
snapshots frankly just says how to make them and how awesome they are.
In small-scale tests they seem cheap--a little more expensive than
mkdir--but on larger scales they can become nightmares.

Things I learned the hard way about ~200 snapshots on a half-TB
filesystem:

btrfs balance cancel/pause take a long time.  Like 24+ hours.  Not the
entire balance operation--just one block group.  Reboot and mount with
skip_balance is a crude but effective way out.

Making new snapshots is cheap.  Deleting them is not.  btrfs-cleaner
will systematically disassemble each snapshot, sequentially, one extent
(or some similarly tiny fragment) at a time, for more than a week.
fsync() will take several seconds on that filesystem for the entire
time btrfs-cleaner is running.  There is no way to pause or stop the
btrfs-cleaner kernel thread and as far as I can tell it's more or less
immune to attempts to limit the amount of I/O bandwidth it demands with
cgroup blkio.weight or ionice.  Rebooting won't even stop it--it will
just resume the next time the filesystem is mounted.  mkfs and restore
the filesystem from a backup of the original subvolume--it's an order
of magnitude faster even with rsync.

2 or 3 are good numbers for snapshots.  0 or 1 are even better.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160? (was: device balance times)
  2014-10-22  1:10     ` 5 _thousand_ snapshots? even 160? (was: device balance times) Robert White
  2014-10-22  4:02       ` Zygo Blaxell
@ 2014-10-22  4:05       ` Duncan
  2014-10-23 20:38         ` 5 _thousand_ snapshots? even 160? Arnaud Kapp
  2014-10-22 11:30       ` Austin S Hemmelgarn
  2014-10-22 17:32       ` Goffredo Baroncelli
  3 siblings, 1 reply; 35+ messages in thread
From: Duncan @ 2014-10-22  4:05 UTC (permalink / raw)
  To: linux-btrfs

Robert White posted on Tue, 21 Oct 2014 18:10:27 -0700 as excerpted:

> Each snapshot is effectively stapling down one version of your entire
> metadata tree, right? So imagine leaving tape spikes (little marks on
> the floor to keep track of where something is so you can put it back)
> for the last 150 or 5000 positions of the chair you are sitting in. At
> some point the clarity and purpose of those marks becomes the opposite
> of useful.
> 
> Hourly for a day, daily for a week, weekly for a month, monthly for a
> year. And it's not a "backup" if you haven't moved it to another device.
> If you have 5k snapshots of a file that didn't change, you are still
> just one bad disk sector away from never having that data again because
> there's only one copy of the actual data stapled down in all of those
> snapshots.

Exactly.

I explain the same thing in different words:

(Note: "You" in this post is variously used to indicate the parent 
poster, and a "general you", including but not limited to the grandparent 
poster inquiring about his 5000 hourly snapshots.  As I'm not trying to 
write a book or a term paper I actively suppose it should be clear to 
which "you" I'm referring in each case based on context...)

Say you are taking hourly snapshots of a file, and you mistakenly delete 
it or need a copy from some time earlier.

If you figure that out a day later, yes, the hour the snapshot was taken 
can make a big difference.

If you don't figure it out until a month later, then is it going to be 
REALLY critical which HOUR you pick, or is simply picking one hour in the 
correct day (or possibly half-day) going to be as good, knowing that if 
you guess wrong you can always go back or forward another whole day?

And if it's a year later, is even the particular day going to matter, or 
will going forward or backward a week or a month going to be good enough?

And say it *IS* a year later, and the actual hour *DOES* matter.  A year 
later, exactly how are you planning to remember the EXACT hour you need, 
such that simply randomly picking just one out of the day or week is 
going to make THAT big a difference?

As you said but adjusted slightly to even out the weeks vs months, hourly 
for a day (or two), daily to complete the week (or two), weekly to 
complete the quarter (13 weeks), and if desired, quarterly for a year or 
two.

But as you also rightly pointed out, just as if it's not tested it's not 
a backup, if it's not on an entirely separate device and filesystem, it's 
not a backup.

And if you don't have real backups at least every quarter, why on earth 
are you worrying about a year's worth of hourly snapshots?  If disaster 
strikes and the filesystem blows up, without a separate backup, they're 
all gone, so why the trouble to keep them around in the first place?

And once you have that quarterly or whatever backup, then the advantage 
of continuing to lock down those 90-day-stale copies of all those files 
and metadata goes down dramatically, since if worse comes to worse, you 
simply retrieve it from backup, but meanwhile, all that stale locked down 
data and metadata is eating up room and dramatically complicating the job 
btrfs must do to manage it all!

Yes, there are use-cases and there are use-cases.  But if you aren't 
keeping at least quarterly backups, perhaps you better examine your 
backup plan and see if it really DOES match your use-case, ESPECIALLY if 
you're keeping thousands of snapshots around.  And once you DO have those 
quarterly or whatever backups, then do you REALLY need to keep around 
even quarterly snapshots covering the SAME period?

But let's say you do:

48 hourly snapshots, thinned after that to...

12 daily snapshots (2 weeks = 14, minus the two days of hourly), thinned 
after that to...

11 weekly snapshots (1 quarter = 13 weeks, minus the two weeks of daily), 
thinned after that to...

7 quarterly snapshots (2 years = 8 quarters, minus the quarter of weekly).

48 + 12 + 11 + 7 = ...

78 snapshots, appropriately spaced by age, covering two full years.

I've even done the math for the extreme case of per-minute snapshots.  
With reasonable thinning along the lines of the above, even per-minute 
snapshots ends up well under 300 snapshots being reasonably managed at 
any single time.

And keeping it under 300 snapshots really DOES help btrfs in terms of 
management task time-scaling.

If you're doing hourly, as I said, 78, tho killing the quarterly 
snapshots entirely because they're backed up reduces that to 71, but 
let's just say, EASILY under 100.

Tho that is of course per subvolume.  If you have multiple subvolumes on 
the same filesystem, that can still end up being a thousand or two 
snapshots per filesystem.  But those are all groups of something under 
300 (under 100 with hourly) highly connected to each other, with the 
interweaving inside each of those groups being the real complexity in 
terms of btrfs management.

But 5000 snapshots?

Why?  Are you *TRYING* to test btrfs until it breaks, or TRYING to 
demonstrate a balance taking an entire year?

Do a real backup (or more than one, using those snapshots) if you need 
to, then thin the snapshots to something reasonable.  As the above 
example shows, if it's a single subvolume being snapshotted, with hourly 
snapshots, 100 is /more/ than reasonable.

With some hard questions, keeping in mind the cost in extra maintenance 
time for each additional snapshot, you might even find that minimum 6-
hour snapshots (four per day) instead of 1-hour snapshots (24 per day) 
are fine.  Or you might find that you only need to keep hourly snapshots 
for 12 hours instead of the 48 I assumed above, and daily snapshots for a 
week instead of the two I assumed above.  Throwing in the nothing over a 
quarter because it's backed up assumption as well, that's...

8 4x-daily snapshots (2 days)

5 daily snapshots (a week, minus the two days above)

12 weekly snapshots (a quarter, minus the week above, then it's backed up 
to other storage)

8 + 5 + 12 = ...

25 snapshots total, 6-hours apart (four per day) at maximum frequency aka 
minimum spacing, reasonably spaced by age to no more than a week apart, 
with real backups taking over after a quarter.

Btrfs should be able to work thru that in something actually approaching 
reasonable time, even if you /are/ dealing with 4 TB of data. =:^)

Bonus hints:

Btrfs quotas significantly complicate management as well.  If you really 
need them, fine, but don't unnecessarily use them just because they are 
there.

Look into defrag.

If you don't have any half-gig plus VMs or databases or similar "internal 
rewrite pattern" files, consider the autodefrag mount option.  Note that 
if you haven't been using it and your files are highly fragmented, it can 
slow things down at first, but a manual defrag, possibly a directory tree 
at a time to split things up into reasonable size and timeframes, can 
help.

If you are running large VMs or databases or other half-gig-plus sized 
internal-rewrite-pattern files, the autodefrag mount option may not 
perform well for you.  There's other options for that, including separate 
subvolumes, setting nocow on those files, and setting up a scheduled 
defrag.  That's out of scope for this post, so do your research.  It has 
certainly been discussed enough on-list.

Meanwhile, do note that defrag is currently snapshot-aware-disabled, due 
to scaling issues.  IOW, if your files are highly fragmented as they may 
well be if you haven't been regularly defragging them, expect the defrag 
to eat a lot of space since it'll break the sharing with older snapshots 
as anything that defrag moves will be unshared.  However, if you've 
reduced snapshots to the quarter-max before off-filesystem backup as 
recommended above, a quarter from now all the undefragged snapshots will 
be expired and off the system and you'll have reclaimed that extra space. 
Meanwhile, your system should be /much/ easier to manage and will likely 
be snappier in its response as well.  =:^)

With all these points applied, balance performance should improve 
dramatically.  However, with 4 TB of data the shear data size will remain 
a factor.  Even in the best case typical thruput on spinning rust won't 
reach the ideal.  10 MiB/sec is a reasonable guide.  4 TiB/10 MiB/sec...

4*1024*1024 (MiB) /  10 MiB / sec = ...

nearly 420 thousand seconds ... / 60 sec/min = ...

7000 minutes ... / 60 min/hour = ...

nearly 120 hours or ...

a bit under 5 days.

So 4 TiB on spinning rust could reasonably take about 5 days to balance 
even under quite good conditions.  That's due to the simple mechanics of 
head seek to read, head seek again to write, on spinning rust, and the 
shear size of 4 TB of data and metadata (tho with a bit of luck some of 
that will disappear as you thin out those thousands of snapshots, and 
it'll be more like 3 TB than 4, or possibly even down to 2 TiB, by the 
time you actually do it).

IOW, it's not going to be instant, by any means.

But the good part of it is that you don't have to do it all at once.  You 
can use balance filters and balance start/pause/resume/cancel as 
necessary, to do only a portion of it at a time, and restart the balance 
using the convert,soft options so it doesn't redo already converted 
chunks when you have time to let it run.  As long as it completes at 
least one chunk each run it'll make progress.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160? (was: device balance times)
@ 2014-10-22  7:14 Tomasz Chmielewski
  2014-10-22  7:41 ` Duncan
  0 siblings, 1 reply; 35+ messages in thread
From: Tomasz Chmielewski @ 2014-10-22  7:14 UTC (permalink / raw)
  To: linux-btrfs

> But 5000 snapshots?
> 
> Why?  Are you *TRYING* to test btrfs until it breaks, or TRYING to
> demonstrate a balance taking an entire year?

Remember a given btrfs filesystem is not necessarily a backup 
destination for data from one source.

It can be, say, 30 or 60 daily snapshots, plus several monthly, for each 
data source * number of data sources.

So while it probably will make a difference (5000 snapshots from one 
source, vs 5000 snapshots made from many sources) for balance times, I 
wouldn't call a large number of snapshots that unusual.

-- 
Tomasz Chmielewski
http://www.sslrack.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160? (was: device balance times)
  2014-10-22  7:14 5 _thousand_ snapshots? even 160? (was: device balance times) Tomasz Chmielewski
@ 2014-10-22  7:41 ` Duncan
  2014-10-22 20:08   ` Zygo Blaxell
  0 siblings, 1 reply; 35+ messages in thread
From: Duncan @ 2014-10-22  7:41 UTC (permalink / raw)
  To: linux-btrfs

Tomasz Chmielewski posted on Wed, 22 Oct 2014 09:14:14 +0200 as excerpted:

> Remember a given btrfs filesystem is not necessarily a backup
> destination for data from one source.
> 
> It can be, say, 30 or 60 daily snapshots, plus several monthly, for each
> data source * number of data sources.
> 
> So while it probably will make a difference (5000 snapshots from one
> source, vs 5000 snapshots made from many sources) for balance times, I
> wouldn't call a large number of snapshots that unusual.

That's what this paragraph, just above the paragraph you quoted, was all 
about:

>> Tho that is of course per subvolume.  If you have multiple subvolumes
>> on the same filesystem, that can still end up being a thousand or two
>> snapshots per filesystem.  But those are all groups of something under
>> 300 (under 100 with hourly) highly connected to each other, with the
>> interweaving inside each of those groups being the real complexity in
>> terms of btrfs management.

IOW, if you thin down the snapshots per subvolume to something reasonable 
(under 300 for sure, preferably under 100), then depending on the number 
of subvolumes you're snapshotting, you might have a thousand or two.  
However, of those couple thousand, btrfs will only have to deal with the 
under 300 and preferably well under a hundred in the same group, that are 
snapshots of the same thing and thus related to each other, at any given 
time.  The other snapshots will be there but won't be adding to the 
complexity near as much since they're of different subvolumes and aren't 
logically interwoven together with the ones being considered at that 
moment.

But even then, at say 250 snapshots per subvolume, 2000 snapshots is 8 
independent subvolumes.  That could happen.  But 5000 snapshots?  That'd 
be 20 independent subvolumes, which is heading toward the extreme again.  
Yes it could happen, but better if it does to cut down on the per-
subvolume snapshots further, to say the 25 per subvolume I mentioned, or 
perhaps even further.  25 snapshots per subvolume with those same 20 
subvolumes... 500 snapshots total instead of 5000. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-21 20:44   ` Arnaud Kapp
  2014-10-22  1:10     ` 5 _thousand_ snapshots? even 160? (was: device balance times) Robert White
@ 2014-10-22 11:22     ` Austin S Hemmelgarn
  1 sibling, 0 replies; 35+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-22 11:22 UTC (permalink / raw)
  To: Arnaud Kapp, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2510 bytes --]

On 2014-10-21 16:44, Arnaud Kapp wrote:
> Hello,
>
> I would like to ask if the balance time is related to the number of
> snapshot or if this is related only to data (or both).
>
> I currently have about 4TB of data and around 5k snapshots. I'm thinking
> of going raid1 instead of single. From the numbers I see this seems
> totally impossible as it would take *way* too long.
>
> Would destroying snapshots (those are hourly snapshots to prevent stupid
> error to happens, like `rm my_important_file`) help?
>
> Should I reconsider moving to raid1 because of the time it would take?
>
> Sorry if I'm somehow hijacking this thread, but it seemed related :)
>
> Thanks,
>
The issue is the snapshots, because I regularly fully re-balance my home 
directory on my desktop which is ~150GB on a BTRFS raid10 setup with 
only 3 or 4 snapshots (I only do daily snapshots, cause anything I need 
finer granularity on I have under git), and that takes only about 2 or 3 
hours depending on how many empty chunks I have.

I would remove the snapshots, and also start keeping fewer of them (5k 
hourly snapshots is more than six months worth of file versions), and 
then run the balance.  I would also suggest converting data by itself 
first, and then converting metadata, as converting data chunks will 
require re-writing large parts of the metadata.
> On 10/21/2014 10:14 PM, Piotr Pawłow wrote:
>> On 21.10.2014 20:59, Tomasz Chmielewski wrote:
>>> FYI - after a failed disk and replacing it I've run a balance; it took
>>> almost 3 weeks to complete, for 120 GBs of data:
>>
>> Looks normal to me. Last time I started a balance after adding 6th
>> device to my FS, it took 4 days to move 25GBs of data. Some chunks took
>> 20 hours to move. I currently have 156 snapshots on this FS (nightly
>> rsync backups).
>>
>> I think it is so slow, because it's disassembling chunks piece by piece
>> and stuffing these pieces elsewhere, instead of moving chunks as a
>> whole. If you have a lot of little pieces (as I do), it will take a
>> while...
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160?
  2014-10-22  1:10     ` 5 _thousand_ snapshots? even 160? (was: device balance times) Robert White
  2014-10-22  4:02       ` Zygo Blaxell
  2014-10-22  4:05       ` Duncan
@ 2014-10-22 11:30       ` Austin S Hemmelgarn
  2014-10-22 17:32       ` Goffredo Baroncelli
  3 siblings, 0 replies; 35+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-22 11:30 UTC (permalink / raw)
  To: Robert White, Arnaud Kapp, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2645 bytes --]

On 2014-10-21 21:10, Robert White wrote:
>
> I don't think balance will _ever_ move the contents of a read only
> snapshot. I could be wrong. I think you just end up with an endlessly
> fragmented storage space and balance has to take each chunk and search
> for someplace else it might better fit. Which explains why it took so long.
>
> And just _forget_ single-extent large files at that point.
>
> (Of course I could be wrong about the "never move" rule, but that would
> just make the checksums on the potentially hundreds or thousands of
> references need to be recalculated after a move, which would make
> incremental send/receive unfathomable.)
>
Balance doesn't do anything different for snapshots from what it does 
with regular data.  I think you are confusing balance with 
defragmentation, as that does (in theory) handle snapshots differently. 
  Balance just takes all of the blocks selected by the filters, and 
sends the through the block allocator again, and then updates the 
metadata to point to the new blocks.  It can result in some 
fragmentation, but usually only for files bigger than about 256M, and 
even then doesn't always cause fragmentation
>
> On 10/21/2014 01:44 PM, Arnaud Kapp wrote:
>> Hello,
>>
>> I would like to ask if the balance time is related to the number of
>> snapshot or if this is related only to data (or both).
>>
>> I currently have about 4TB of data and around 5k snapshots. I'm thinking
>> of going raid1 instead of single. From the numbers I see this seems
>> totally impossible as it would take *way* too long.
>>
>> Would destroying snapshots (those are hourly snapshots to prevent stupid
>> error to happens, like `rm my_important_file`) help?
>>
>> Should I reconsider moving to raid1 because of the time it would take?
>>
>> Sorry if I'm somehow hijacking this thread, but it seemed related :)
>>
>> Thanks,
>>
>> On 10/21/2014 10:14 PM, Piotr Pawłow wrote:
>>> On 21.10.2014 20:59, Tomasz Chmielewski wrote:
>>>> FYI - after a failed disk and replacing it I've run a balance; it took
>>>> almost 3 weeks to complete, for 120 GBs of data:
>>>
>>> Looks normal to me. Last time I started a balance after adding 6th
>>> device to my FS, it took 4 days to move 25GBs of data. Some chunks took
>>> 20 hours to move. I currently have 156 snapshots on this FS (nightly
>>> rsync backups).
>>>
>>> I think it is so slow, because it's disassembling chunks piece by piece
>>> and stuffing these pieces elsewhere, instead of moving chunks as a
>>> whole. If you have a lot of little pieces (as I do), it will take a
>>> while...
>>>



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-22  1:43   ` Chris Murphy
@ 2014-10-22 12:40     ` Piotr Pawłow
  2014-10-22 16:59       ` Bob Marley
  2014-10-23  9:19       ` Miao Xie
  2014-10-22 16:15     ` Chris Murphy
  1 sibling, 2 replies; 35+ messages in thread
From: Piotr Pawłow @ 2014-10-22 12:40 UTC (permalink / raw)
  To: Chris Murphy, linux-btrfs

On 22.10.2014 03:43, Chris Murphy wrote:
> On Oct 21, 2014, at 4:14 PM, Piotr Pawłow<pp@siedziba.pl>  wrote:
>> Looks normal to me. Last time I started a balance after adding 6th device to my FS, it took 4 days to move 25GBs of data.
> It's long term untenable. At some point it must be fixed. It's way, way slower than md raid.
> At a certain point it needs to fallback to block level copying, with a ~ 32KB block. It can't be treating things as if they're 1K files, doing file level copying that takes forever. It's just too risky that another device fails in the meantime.

There's "device replace" for restoring redundancy, which is fast, but 
not implemented yet for RAID5/6.

I think the problem is that balance was originally used for balancing 
data / metadata split - moving stuff out of mostly empty chunks to free 
them and use for something else. It pretty much has to be done on the 
extent level.

Then balance was repurposed for things like converting RAID profiles and 
restoring redundancy and balancing device usage in multi-device 
configurations. It works, but the approach to do it extent by extent is 
slow.

I wonder if we could do some of these operations by just copying whole 
chunks in bulk. Wasn't that the point of introducing logical addresses? 
- to be able to move chunks around quickly without changing anything 
except updating chunk pointers?

BTW: I'd love a simple interface to be able to select a chunk and tell 
it to move somewhere else. I'd like to tell chunks with metadata, or 
with tons of extents: Hey, chunks! Why don't you move to my SSDs? :)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-22  1:43   ` Chris Murphy
  2014-10-22 12:40     ` Piotr Pawłow
@ 2014-10-22 16:15     ` Chris Murphy
  2014-10-23  2:44       ` Duncan
  1 sibling, 1 reply; 35+ messages in thread
From: Chris Murphy @ 2014-10-22 16:15 UTC (permalink / raw)
  To: linux-btrfs

On Oct 21, 2014, at 9:43 PM, Chris Murphy <lists@colorremedies.com> wrote:

> 
> On Oct 21, 2014, at 4:14 PM, Piotr Pawłow <pp@siedziba.pl> wrote:
> 
>> On 21.10.2014 20:59, Tomasz Chmielewski wrote:
>>> FYI - after a failed disk and replacing it I've run a balance; it took almost 3 weeks to complete, for 120 GBs of data:
>> 
>> Looks normal to me. Last time I started a balance after adding 6th device to my FS, it took 4 days to move 25GBs of data.
> 
> It's long term untenable. At some point it must be fixed. It's way, way slower than md raid.

Granted I'm ignoring the fact there are 5000+ snapshots, as well as the fact we were told something like ~200 is OK but things rapidly get pathological above that. Josef has some patches to better deal with this, but I'm fairly sure they're not merged yet. My "long term untenable" assertion is in relation to scalability. Another thing is the snapshot aware defrag stuff was recently pulled out and needs to be redone, which may also be a factor with this. The short term, maybe even medium term, it's "doctor, it hurts when I do this!" and the doctor says, "well then don't do that!"

Chris Murphy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-22 12:40     ` Piotr Pawłow
@ 2014-10-22 16:59       ` Bob Marley
  2014-10-23  7:39         ` Russell Coker
  2014-10-23  9:19       ` Miao Xie
  1 sibling, 1 reply; 35+ messages in thread
From: Bob Marley @ 2014-10-22 16:59 UTC (permalink / raw)
  To: linux-btrfs

On 22/10/2014 14:40, Piotr Pawłow wrote:
> On 22.10.2014 03:43, Chris Murphy wrote:
>> On Oct 21, 2014, at 4:14 PM, Piotr Pawłow<pp@siedziba.pl>  wrote:
>>> Looks normal to me. Last time I started a balance after adding 6th 
>>> device to my FS, it took 4 days to move 25GBs of data.
>> It's long term untenable. At some point it must be fixed. It's way, 
>> way slower than md raid.
>> At a certain point it needs to fallback to block level copying, with 
>> a ~ 32KB block. It can't be treating things as if they're 1K files, 
>> doing file level copying that takes forever. It's just too risky that 
>> another device fails in the meantime.
>
> There's "device replace" for restoring redundancy, which is fast, but 
> not implemented yet for RAID5/6.

"Device replace" on raid 0,1,10 works if the device to be replaced is 
still alive, otherwise the operation is as long as a rebalance and works 
similarly (AFAIR).
Which is way too long in terms of the likelihood of another disk failing.
Additionally, it seeks like crazy during the operation, which also 
greatly increases the likelihood of another disk failing.

Until this is fixed I am not confident in using btrfs on a production 
system which requires RAID redundancy.

The operation needs to be streamlined: it should be as sequential as 
possible (sort files according to their LBA before reading/writing), 
with the fewest number of seeks on every disk, and with large buffers, 
so that reads from the source disk(s) and writes to the replacement disk 
goes at platter-speed or near there.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160?
  2014-10-22  1:10     ` 5 _thousand_ snapshots? even 160? (was: device balance times) Robert White
                         ` (2 preceding siblings ...)
  2014-10-22 11:30       ` Austin S Hemmelgarn
@ 2014-10-22 17:32       ` Goffredo Baroncelli
  3 siblings, 0 replies; 35+ messages in thread
From: Goffredo Baroncelli @ 2014-10-22 17:32 UTC (permalink / raw)
  To: Robert White, Arnaud Kapp, linux-btrfs

On 10/22/2014 03:10 AM, Robert White wrote:
> Each snapshot is effectively stapling down one version of your
> entire metadata tree, right ?
On the best of my knowledge, I cannot confirm that.

I understood (please, be free to correct me if I am wrong) that each snapshot create a copy of the changed leaf (of the (b)tree), and update (doing a copy) all the nodes up to the tree root.

[...]
> (Of course I could be wrong about the "never move" rule, but that 
> would just make the checksums on the potentially hundreds or 
> thousands of references need to be recalculated after a move, which 
> would make incremental send/receive unfathomable.)

Between the physical data on the disk and the logical data see by the tree(s) there is an indirection layer: the tree chunks.
Near all the trees refer to the data in terms of  "logical" position. The logical position is  translated to the physical one by the tree chunks.
The balance is related to the movement of the chunk between the disks. But the data is unchanged.

Anyway I fully agree with you when you say:
> Snapshots are cheap but they aren't free. 

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160? (was: device balance times)
  2014-10-22  7:41 ` Duncan
@ 2014-10-22 20:08   ` Zygo Blaxell
  2014-10-22 20:37     ` Robert White
  2014-10-23  4:30     ` Chris Murphy
  0 siblings, 2 replies; 35+ messages in thread
From: Zygo Blaxell @ 2014-10-22 20:08 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2597 bytes --]

On Wed, Oct 22, 2014 at 07:41:32AM +0000, Duncan wrote:
> Tomasz Chmielewski posted on Wed, 22 Oct 2014 09:14:14 +0200 as excerpted:
> >> Tho that is of course per subvolume.  If you have multiple subvolumes
> >> on the same filesystem, that can still end up being a thousand or two
> >> snapshots per filesystem.  But those are all groups of something under
> >> 300 (under 100 with hourly) highly connected to each other, with the
> >> interweaving inside each of those groups being the real complexity in
> >> terms of btrfs management.
> 
> IOW, if you thin down the snapshots per subvolume to something reasonable 
> (under 300 for sure, preferably under 100), then depending on the number 
> of subvolumes you're snapshotting, you might have a thousand or two.  
> However, of those couple thousand, btrfs will only have to deal with the 
> under 300 and preferably well under a hundred in the same group, that are 
> snapshots of the same thing and thus related to each other, at any given 
> time.  The other snapshots will be there but won't be adding to the 
> complexity near as much since they're of different subvolumes and aren't 
> logically interwoven together with the ones being considered at that 
> moment.
> 
> But even then, at say 250 snapshots per subvolume, 2000 snapshots is 8 
> independent subvolumes.  That could happen.  But 5000 snapshots?  That'd 
> be 20 independent subvolumes, which is heading toward the extreme again.  
> Yes it could happen, but better if it does to cut down on the per-
> subvolume snapshots further, to say the 25 per subvolume I mentioned, or 
> perhaps even further.  25 snapshots per subvolume with those same 20 
> subvolumes... 500 snapshots total instead of 5000. =:^)

If you have one subvolume per user and 1000 user directories on a server,
it's only 5 snapshots per user (last hour, last day, last week, last
month, and last year).  I hear this is a normal use case in the ZFS world.
It would certainly be attractive if there was working quota support.

I have datasets where I record 14000+ snapshots of filesystem directory
trees scraped from test machines and aggregated onto a single server
for deduplication...but I store each snapshot as a git commit, not as
a btrfs snapshot or even subvolume.

We do sometimes run queries like "in the last two years, how many times
did $CONDITION occur?" which will scan a handful files in all of the
snapshots.  The use case itself isn't unreasonable, although using the
filesystem instead of a more domain-specific tool to achieve it may be.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160? (was: device balance times)
  2014-10-22 20:08   ` Zygo Blaxell
@ 2014-10-22 20:37     ` Robert White
  2014-10-23  3:09       ` Zygo Blaxell
  2014-10-23  4:30     ` Chris Murphy
  1 sibling, 1 reply; 35+ messages in thread
From: Robert White @ 2014-10-22 20:37 UTC (permalink / raw)
  To: Zygo Blaxell, Duncan; +Cc: linux-btrfs

On 10/22/2014 01:08 PM, Zygo Blaxell wrote:
> I have datasets where I record 14000+ snapshots of filesystem directory
> trees scraped from test machines and aggregated onto a single server
> for deduplication...but I store each snapshot as a git commit, not as
> a btrfs snapshot or even subvolume.
>
> We do sometimes run queries like "in the last two years, how many times
> did $CONDITION occur?" which will scan a handful files in all of the
> snapshots.  The use case itself isn't unreasonable, although using the
> filesystem instead of a more domain-specific tool to achieve it may be.
>

Okay, sure. And as stated by others, there _are_ use cases that are 
exceptional.

But such an archival system most likely does not _need_ to be balanced 
etc with any frequency, or likely ever because it isn't experiencing 
churn from dynamic use.

In the world of trade-offs, trade-offs happen.

The guy who cited the 5000 snapshots said they were hourly and taken 
because he might remove an important file or something. This is _way_ 
more action than the feared condition.

ASIDE: While fixing someone's document archive RAID device (a Sun 
hardware device the size of a fridge) back in 1997 or so I discovered 
that they'd disabled _all_ the hardware cache features. When asked I was 
told that "the procedure for replacing a failed drive required the cache 
device to be cleared by pressing the red button" and they were afraid 
that, should that day come, someone would forget to press that button... 
so they'd turned off the feature entirely. This is a form of 
unreasonable paranoia. They were afraid that someone in the future would 
not follow the directions would be printed on both the machine and the 
new drive (these were _not_ commodity parts).

When an over-abundance of caution passes beyond reasonable expectations, 
performance will suffer. The system is immaterial, the rule holds.

What's worse is it becomes very like "security theater" only its "a 
backup show" where no actual backing up is really happening in any 
useful sense. And god save you picking which version of a file was the 
last "good one".

So in your use case, your git repository of snapshots isn't actually 
"live" on the production server you are archiving, right?

So too, it would be reasonable to btrfs send periodic snapshots to an 
archive system, retain lots and lots of them, and expect reasonable 
performance of your queries.

And you cold expect reasonable performance in your maintenance.

But "reasonable performance" in the maintenance case is massively 
different than reasonable performance in use cases. Indeed if you try to 
balance multiple terabytes of data spread across thousands of snapshots 
you'll be taking a lot of time. A _perfectly_ _reasonable_ lot of time 
for the operation at hand.

But if you expect to be able to do maintenance (like btrfsck your 
production box with its 5k snapshots) in just a few minutes when you've 
got logarithmic-rate meta data to shuffle through... well good luck with 
that.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-22 16:15     ` Chris Murphy
@ 2014-10-23  2:44       ` Duncan
  0 siblings, 0 replies; 35+ messages in thread
From: Duncan @ 2014-10-23  2:44 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Wed, 22 Oct 2014 12:15:25 -0400 as excerpted:

> Granted I'm ignoring the fact there are 5000+ snapshots[.]
> The short term, maybe even medium term, it's "doctor, it hurts
> when I do this!" and the doctor says, "well then don't do that!"

LOL!  Nicely said! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160? (was: device balance times)
  2014-10-22 20:37     ` Robert White
@ 2014-10-23  3:09       ` Zygo Blaxell
  0 siblings, 0 replies; 35+ messages in thread
From: Zygo Blaxell @ 2014-10-23  3:09 UTC (permalink / raw)
  To: Robert White; +Cc: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4046 bytes --]

On Wed, Oct 22, 2014 at 01:37:15PM -0700, Robert White wrote:
> On 10/22/2014 01:08 PM, Zygo Blaxell wrote:
> >I have datasets where I record 14000+ snapshots of filesystem directory
> >trees scraped from test machines and aggregated onto a single server
> >for deduplication...but I store each snapshot as a git commit, not as
> >a btrfs snapshot or even subvolume.
> >
> >We do sometimes run queries like "in the last two years, how many times
> >did $CONDITION occur?" which will scan a handful files in all of the
> >snapshots.  The use case itself isn't unreasonable, although using the
> >filesystem instead of a more domain-specific tool to achieve it may be.
> 
> Okay, sure. And as stated by others, there _are_ use cases that are
> exceptional.
> 
> But such an archival system most likely does not _need_ to be
> balanced etc with any frequency, or likely ever because it isn't
> experiencing churn from dynamic use.
> 
> In the world of trade-offs, trade-offs happen.
> 
> The guy who cited the 5000 snapshots said they were hourly and taken
> because he might remove an important file or something. This is
> _way_ more action than the feared condition.

> When an over-abundance of caution passes beyond reasonable
> expectations, performance will suffer. The system is immaterial, the
> rule holds.
> 
> What's worse is it becomes very like "security theater" only its "a
> backup show" where no actual backing up is really happening in any
> useful sense. And god save you picking which version of a file was
> the last "good one".
> 
> So in your use case, your git repository of snapshots isn't actually
> "live" on the production server you are archiving, right?

I'm not sure if this question is directed at me or 5000-snapshots-guy.
5000-snapshots-guy was not using git.  I'd say he's using btrfs snapshots
to do a job git is much better at.

My 14K-snapshot git repo is on my production server.  The data in the repo
is aggregated to that server from many other smaller (tiny) machines.
Each smaller machine's data is replicated to the git working directory
and a git commit generated from it.  After the commit is created,
we replace the working tree with the next machine's data and repeat.

The entire git repo is replicated by git fetch/push to a couple of other
servers, so if the main git repo goes down we just swap it with one of
the clone servers in a few minutes.  No btrfs snapshots are involved.
One of the clone servers doesn't even have btrfs at all.

git's packing algorithm does a much more thorough job of compression and
deduplication than a mere filesystem can.  We can store 3,000 current
snapshots of 15GB of raw data in 10GB of disk space, and another 11,000
older snapshots in 25GB.  I don't think btrfs snapshots can do better,
or even get close, to that kind of storage efficiency.

The point I was making was that btrfs snapshots aren't the best tool,
or even a reasonably good tool, for this particular use case (5000
snapshots of a single subvolume on one server).

> So too, it would be reasonable to btrfs send periodic snapshots to
> an archive system, retain lots and lots of them, and expect
> reasonable performance of your queries.
> 
> And you cold expect reasonable performance in your maintenance.
> 
> But "reasonable performance" in the maintenance case is massively
> different than reasonable performance in use cases. Indeed if you
> try to balance multiple terabytes of data spread across thousands of
> snapshots you'll be taking a lot of time. A _perfectly_ _reasonable_
> lot of time for the operation at hand.

...unless you are balancing because you're replacing a RAID1 disk.
That's a good reason to be wary of too many snapshots even if the data
is fairly static.

> But if you expect to be able to do maintenance (like btrfsck your
> production box with its 5k snapshots) in just a few minutes when
> you've got logarithmic-rate meta data to shuffle through... well
> good luck with that.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160? (was: device balance times)
  2014-10-22 20:08   ` Zygo Blaxell
  2014-10-22 20:37     ` Robert White
@ 2014-10-23  4:30     ` Chris Murphy
  2014-10-23  5:18       ` Robert White
  1 sibling, 1 reply; 35+ messages in thread
From: Chris Murphy @ 2014-10-23  4:30 UTC (permalink / raw)
  To: Btrfs BTRFS

On Oct 22, 2014, at 4:08 PM, Zygo Blaxell <zblaxell@furryterror.org> wrote:
> 
> If you have one subvolume per user and 1000 user directories on a server,
> it's only 5 snapshots per user (last hour, last day, last week, last
> month, and last year). 

Sure. So if Btrfs is meant to address scalability, then perhaps at the moment it's falling short. As it's easy to add large drives and get very large multiple device volumes, the snapshotting needs to scale also.

I'd say per user, it's reasonable to have 24 hourly (one snapshot per hour for a day), 7 daily, 4 weekly, and 12 monthly snapshots, or 47 snapshots. That's 47,000 snapshots if it's sane for a single Btrfs volume to host 1000 users. Arguably, such a system is better off with a distributed fs: Gluster FS or GFS2 or Ceph.

Chris Murphy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160? (was: device balance times)
  2014-10-23  4:30     ` Chris Murphy
@ 2014-10-23  5:18       ` Robert White
  2014-10-23  8:38         ` Duncan
  2014-10-23 13:15         ` Zygo Blaxell
  0 siblings, 2 replies; 35+ messages in thread
From: Robert White @ 2014-10-23  5:18 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 10/22/2014 09:30 PM, Chris Murphy wrote:
> Sure. So if Btrfs is meant to address scalability, then perhaps at the moment it's falling short. As it's easy to add large drives and get very large multiple device volumes, the snapshotting needs to scale also.
>
> I'd say per user, it's reasonable to have 24 hourly (one snapshot per hour for a day), 7 daily, 4 weekly, and 12 monthly snapshots, or 47 snapshots. That's 47,000 snapshots if it's sane for a single Btrfs volume to host 1000 users. Arguably, such a system is better off with a distributed fs: Gluster FS or GFS2 or Ceph.

Is one subvolume per user a rational expectation? Is it even 
particularly smart? Dooable, sure, but as a best practice it doesn't 
seem that useful because it multiplies the maintenace by the user base.

Presuming a linux standard base layout (which is very presumptive) 
having the 47 snapshots of /home instead of the 47,000 snapshots of 
/home/X(1000) is just as workable, if not moreso. A reflink recursive 
copy of /home/X(n) from /home_Backup_date/X(n) is only trivially longer 
than resnapshotting the individual user.

Again this gets into the question not of what exercises well to create 
the snapshot but what functions well during a restore.

People constantly create "backup solutions" without really looking at 
the restore path.

I can't get anybody here to answer the question about "btrfs fi li -s /" 
and setting/resetting the "snapshot" status of a subvolume. I've been 
told "snapshots are subvolumes" which is fine, but since there _is_ a 
classification mechanism things get all caca if you rely on the "-s" in 
your scripting and then promote a snapshot back into prime activity. 
(seriously compare the listing with and without -s, note its natural 
affinity for classifying subvolumes, then imagine the horror of needing 
to take /home_backup_date and make it /home.)

Similar problems obtain as soon as you consider the daunting task of 
shuffling through 47,000 snapshots instead of just 47.

And if you setup each user on their own snapshot what happens the first 
time two users want to hard-link a file betwixt them?

Excessive segmentation of storage is an evil unto itself.

YMMV, of course.

An orthoginal example:

If you give someone six disks and tell them to make an encrypted raid6 
via cryptsetup and mdadm, at least eight out of ten will encrypt the 
drives and then raid the result. But it's _massivly_ more efficent to 
raid the drives and then encrypt the result. Why? Because writing a 
block with the latter involves only one block being encrypted/decrypted. 
The former, if the raid is fine involves several encryptions/decryptions 
and _many_ if the raid is degraded.

The above is a mental constraint, a mistake, that is all to common 
because people expect encrytion to be "better" the closer you get to the 
spinning rust.

So too people expect that segmentation is somehow better if it most 
closely matches the abstract groupings (like per user) but in practical 
terms it is better matched to the modality, where, for instance, all 
users are one kind of thing, while all data stores are another kind of 
thing.

We were just talking about putting all your VMs and larger NOCOW files 
into a separate subvolume/domain because of their radically different 
write behaviors. Thats a sterling reason to subdivide the storage. So is 
/ vs. /var vs. /home as three different domains with radically different 
update profiles.

So while the natural impulse is to give each user its own subvolume it's 
not likely to be that great an idea in practice because... um... 47,000 
snapshots dude, and so on.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-22 16:59       ` Bob Marley
@ 2014-10-23  7:39         ` Russell Coker
  2014-10-23  8:49           ` Duncan
  0 siblings, 1 reply; 35+ messages in thread
From: Russell Coker @ 2014-10-23  7:39 UTC (permalink / raw)
  To: Bob Marley, linux-btrfs

Also a device replace operation requires that the replacement be the same size (or maybe larger). While a remove and replace allows the replacement to be merely large enough to contain all the data. Given the size variation in what might be called the same size disk by manufcturers this isn't uncommon - unless you just get a replacement of the next size up (which is a good option too).

On October 23, 2014 3:59:31 AM GMT+11:00, Bob Marley <bobmarley@shiftmail.org> wrote:
>On 22/10/2014 14:40, Piotr Pawłow wrote:
>> On 22.10.2014 03:43, Chris Murphy wrote:
>>> On Oct 21, 2014, at 4:14 PM, Piotr Pawłow<pp@siedziba.pl>  wrote:
>>>> Looks normal to me. Last time I started a balance after adding 6th 
>>>> device to my FS, it took 4 days to move 25GBs of data.
>>> It's long term untenable. At some point it must be fixed. It's way, 
>>> way slower than md raid.
>>> At a certain point it needs to fallback to block level copying, with
>
>>> a ~ 32KB block. It can't be treating things as if they're 1K files, 
>>> doing file level copying that takes forever. It's just too risky
>that 
>>> another device fails in the meantime.
>>
>> There's "device replace" for restoring redundancy, which is fast, but
>
>> not implemented yet for RAID5/6.
>
>"Device replace" on raid 0,1,10 works if the device to be replaced is 
>still alive, otherwise the operation is as long as a rebalance and
>works 
>similarly (AFAIR).
>Which is way too long in terms of the likelihood of another disk
>failing.
>Additionally, it seeks like crazy during the operation, which also 
>greatly increases the likelihood of another disk failing.
>
>Until this is fixed I am not confident in using btrfs on a production 
>system which requires RAID redundancy.
>
>The operation needs to be streamlined: it should be as sequential as 
>possible (sort files according to their LBA before reading/writing), 
>with the fewest number of seeks on every disk, and with large buffers, 
>so that reads from the source disk(s) and writes to the replacement
>disk 
>goes at platter-speed or near there.
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Sent from my Samsung Galaxy Note 3 with K-9 Mail.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160? (was: device balance times)
  2014-10-23  5:18       ` Robert White
@ 2014-10-23  8:38         ` Duncan
  2014-10-23 13:15         ` Zygo Blaxell
  1 sibling, 0 replies; 35+ messages in thread
From: Duncan @ 2014-10-23  8:38 UTC (permalink / raw)
  To: linux-btrfs

Robert White posted on Wed, 22 Oct 2014 22:18:09 -0700 as excerpted:

> On 10/22/2014 09:30 PM, Chris Murphy wrote:
>> Sure. So if Btrfs is meant to address scalability, then perhaps at the
>> moment it's falling short. As it's easy to add large drives and get
>> very large multiple device volumes, the snapshotting needs to scale
>> also.

I believe it's a fair statement to say that many aspects of btrfs in 
general simply don't scale well at this point.  Many of the features are 
there; we're now getting to the point where many of those features are 
reasonably bug-free, altho that's definitely an ongoing thing; but pretty 
much wherever you turn and whatever you look at, btrfs is in general not 
yet optimized.

* The raid1 device-read-selection algorithm is simply even/odd-PID-
based.  That's great for a first implementation, since it's simple enough 
to implement and it works well enough to know that reading from either 
copy works, but it's horrible for a final, scalable implementation, since 
too many use-cases will be nearly all even or all odd pids.

* Btrfs multi-device-writes are nearly all serialized, one at a time, 
instead of scheduling writes to all devices at once in ordered to 
maximize bandwidth over the individual-device-speed bottleneck.

* Btrfs snapshot-aware-defrag was introduced to much fanfare, and then 
disabled a couple kernel series later when it became very apparent it 
simply didn't scale, and the lack of scaling meant it didn't work
/at/ /all/ for many users.

* The quota implementation was just recently pretty much entirely 
rewritten due to serious corner-case breakage and lack of scaling (one of 
the contributors to the defrag and balance scaling issues, as it happens).

* The autodefrag mount option doesn't scale well beyond a few hundred MiB 
or with frag-triggering file updates coming in faster than the entire 
file can be rewritten (there's plans to make this better, but the time to 
code and test simply hasn't been available yet).

* This thread is about the balance scaling issues, a good portion of 
which boil down to extremely poorly optimized quota and snapshot 
handling, and another problem set to no choice but extent-based 
operations, which are great for some things but don't work well when all 
you want to do is duplicate chunks in a conversion to raid1 mode, for 
instance.

That's what I think of off the top of my head.  I'm sure there's more.

However, specifically addressing snapshotting, while optimizing for speed 
and scale will certainly help, I'm not sure btrfs will ever be what might 
be called a speed demon in the area.

If that is indeed the case, and I don't know but it's certainly possible, 
then for the future, and regardless of the future, definitely for the 
present, that means it's absolutely /critical/ for the human side to 
optimize things to keep things like number of snapshots from growing out 
of reasonable management range.

...

>> I'd say per user, it's reasonable to have 24 hourly (one snapshot per
>> hour for a day), 7 daily, 4 weekly, and 12 monthly snapshots, or 47
>> snapshots. That's 47,000 snapshots if it's sane for a single Btrfs
>> volume to host 1000 users. Arguably, such a system is better off with a
>> distributed fs: Gluster FS or GFS2 or Ceph.
> 
> Is one subvolume per user a rational expectation? Is it even
> particularly smart? Dooable, sure, but as a best practice it doesn't
> seem that useful because it multiplies the maintenace by the user base.
> 
> Presuming a linux standard base layout (which is very presumptive)
> having the 47 snapshots of /home instead of the 47,000 snapshots of
> /home/X(1000) is just as workable, if not moreso. A reflink recursive
> copy of /home/X(n) from /home_Backup_date/X(n) is only trivially longer
> than resnapshotting the individual user.
> 
> Again this gets into the question not of what exercises well to create
> the snapshot but what functions well during a restore.
> 
> People constantly create "backup solutions" without really looking at
> the restore path.
> 

... Which is where this discussion comes in.

FWIW, over more than a decade of fine tuning and experience with a number 
of disaster and recovery cases here, I've come up with what is for me a 
reasonably close to ideal multiple partition layout.  That's actually one 
of the big reasons I don't use subvolumes here; I don't need to because I 
already have a nearly perfect... for my use-case... independent 
partitions layout.

Here's my point.  Via trial and error I concluded almost the exact same 
point that Chris is making about subvolumes, only for independent 
partitions.

The discussed example is subvolumes for individual users in /home, vs. 
one big subvolume for /home itself (or arguably, if there's a convenient 
user-role-based separation, perhaps a subvolume for say teacher-home and 
another for student-home, or for user-home, group-leader-home, upper-
management-home, etc).

The same lesson, however, applies to say all distro-update storage, in my 
(gentoo) case, the main distro package ebuild tree, the overlays, binpkgs 
for my 64-bit builds, binpkgs for my 32-bit builds, 64-bit ccache, 32-bit 
ccache, the separate (mainstream git) kernel repo and individual build 
dirs for 32-bit and 64-bit kernels, etc.

Back when I first split things up, most of those were in individual 
partitions.  Now I just have one partition with all those components in 
it, in separate subdirs, and symlinks from the various other locations in 
my layout to the various components in this partition.

Why?  Because managing all of them separately was a pain, and I tended to 
mount and unmount most of them together anyway, when I did system updates.

Similarly, in my original setup I had the traditional small /, with /etc, 
but with an independent /usr and /var, and /var/log independent of /var.

I still have /var/log independent of the others since limiting a runaway 
logging scenario to an independent log partition makes very good sense 
and logging is its own logical task, but /usr is on / now, as is /var 
itself, but with individual subdirs of for instance /var/lib symlinked 
elsewhere.

Why?  Because at one point I had an A/C failure, here in Phoenix in the 
middle of the summer, when I was gone.  I came home to a a 50C+ house, a 
heat-induced head-crashed disk and a frozen CPU.

Recovering from that disaster was a nightmare, because while I had 
backups, I ended up with a root backup from one date, a /usr backup from 
a different date, and a /var, including the record of what packages and 
files were installed, from a third date.  So my record (in /var/db) of 
what was installed didn't match /usr, which didn't match /!

So everything that the distro actually installs, including the database 
of what is actually installed, with one limited exception, is now all on 
the same partition, root!  If I have to recover from backup, whether I'm 
recovering from yesterday's backup or one made a year ago, there's one 
thing I'll be sure of, the database of what's installed will match what's 
actually there, because it's on the same system root partition and thus 
the same backup!

And that system partition is now mounted read-only by default.  I only 
mount it writable in ordered to update, either packages or system 
configuration.  When I'm done with the update I sudo systemctl emergency, 
hit ctrl-d to get back to normal mode without logging into emergency mode 
and log back into normal mode, systemctl daemon-reexec if systemd itself 
was updated (thus restarting all programs including init/systemd itself, 
so no stale and deleted libs are still in use), and remount ro /.

Which brings me to the limited /var exception I mentioned earlier.  
While /var is supposed to be variable/writable, in practice, many system 
processes only need to write to their /var subdir if they're updated.  
These can stay on /var.  The few others, the ones that really need a 
writable /var subdir, have that subdir actually linked back to a 
similarly named subdir in /home/var/, which is mounted writable by 
default.  That lets me keep /, including the rest of /var (with the 
exception of /var/tmp, which is tmpfs), read-only by default.

As it happens, these writable-necessary /var/ subdirs aren't critical to 
early boot, and if for some reason /home is unmountable (as it was along 
with /var/log at one point recently when the writable at crash /home 
btrfs refused to mount, but /, being read-only at crash, wasn't harmed), 
I can either do without temporarily, or reconstruct an empty or default 
/var subdir for them in tmpfs or whatever.

One critical thing that makes this all work is the existence of symlinks 
in various locations, pointing to the real locations on the partition 
where logical function-grouping places them.

But all this simply reinforces the point.  Logically function-group 
subdirs on subvolumes much as I logically function-group subdirs on 
independent partitions, and you won't be /dealing/ with 47K snapshots, 47 
snapshots each of 1000 individual user subvolumes.  You might be dealing 
with 47 snapshots of /home, with 1000 users on it, or you might function-
group a bit further and have 47 snapshots each of mgmt-home, user-home, 
and teamldr-home, so 141 snapshots total for all of homedirs, but that's 
still reasonable with btrfs' current scaling, while 47K snapshots, forget 
it!

And take it from me, it makes it FAR easier to deal with backup and 
testing, with disaster recovery should it be necessary, and with other 
sysadmin level maintenance as well.

Tho of course I'd personally argue that for reliability and 
recoverability reasons, each of those function-groups should be an 
independent partition, not just a subvolume that should the filesystem go 
down, it'll take all the subvolumes with it.  But that's an entirely 
different argument to be had...  Regardless of whether it's subvolumes or 
independent partitions and filesystems, however, the same point applies.

> I can't get anybody here to answer the question about "btrfs fi li -s /"
> and setting/resetting the "snapshot" status of a subvolume.

I wish I knew the answer.  But as I said above, I don't do subvolumes, or 
for that matter, snapshots, myself, preferring fully independent 
partitions, and fully independent "snapshot" backups to same-size backup 
partitions located elsewhere, so I can simply point the mount at the 
backup and mount it in place of the previously working copy for recovery 
should it be necessary.  So I've not had to investigate that for personal 
reasons, and while I've an academic interest as well as an interest in 
knowing it simply to help others here, I've seen nobody else post a 
satisfactory answer, so...

I share your frustration, tho at the academic and help-others level, not 
the personal installation operations level.

> I've been
> told "snapshots are subvolumes" which is fine, but since there _is_ a
> classification mechanism things get all caca if you rely on the "-s" in
> your scripting and then promote a snapshot back into prime activity.
> (seriously compare the listing with and without -s, note its natural
> affinity for classifying subvolumes, then imagine the horror of needing
> to take /home_backup_date and make it /home.)

By all means explain to me how this won't work if so, but it seems to me 
the following is a reasonably effective workaround that shouldn't take 
/too/ much more time...

1) According to the wiki, cross-subvolume reflinks now (since 3.6) work.  
See the explanatory text for the following (watch the linkwrap):

https://btrfs.wiki.kernel.org/index.php/
UseCases#Can_I_take_a_snapshot_of_a_directory.3F

Note that based on the above link, reflinks won't work if the subvolumes 
are separately mounted, that is, across separate mount-points.  However, 
as long as it's a single common "parent" mount, with the subvolumes 
simply accessed under it as if they were subdirs, reflink-copying should 
"just work".

Based on that...

2) Mount a parent (which might be the root subvolume) to both the backup-
snapshot and the intended target subvolume, creating the new target 
subvolume as necessary.

3) Reflink-copy recursively from the backup to the target, as if you were 
traditionally backup-restoring from a backup mounted elsewhere, except 
using the parent-subvolume mount paths so you don't cross mount-points, 
and using reflink-copying to dramatically speed the process.

4) When you are done, you should have a non-snapshot subvolume restored 
and ready for use, almost as if you were able to directly mount the 
snapshot in place of the non-snapshot original, removing its snapshot 
property in the process.

5) If desired, delete the backup snapshot, thus completing the parallel.  
Alternatively, keep it where it is.  After all, you needed to restore 
from it once, what's to say something else won't happen to kill the 
restored version, thus triggering the need for another restore?  Surely 
that'd be bad juju, but better to still have that backup snapshot on 
hand, then to have just moved it to production, and then lost it too.
=:^)

> Similar problems obtain as soon as you consider the daunting task of
> shuffling through 47,000 snapshots instead of just 47.
> 
> And if you setup each user on their own snapshot what happens the first
> time two users want to hard-link a file betwixt them?

See the above cross-subvol reflink discussion...

> Excessive segmentation of storage is an evil unto itself.

... But never-the-less, absolutely agreed. =:^)

> YMMV, of course.
> 
> An orthoginal example:
> 
> If you give someone six disks and tell them to make an encrypted raid6
> via cryptsetup and mdadm, at least eight out of ten will encrypt the
> drives and then raid the result. But it's _massivly_ more efficent to
> raid the drives and then encrypt the result. Why? Because writing a
> block with the latter involves only one block being encrypted/decrypted.
> The former, if the raid is fine involves several encryptions/decryptions
> and _many_ if the raid is degraded.
> 
> The above is a mental constraint, a mistake, that is all to common
> because people expect encrytion to be "better" the closer you get to the
> spinning rust.

This totally unexpected but useful jewel is part of why I'm addicted to 
newsgroups and mailing lists. (FWIW, I do this list as a newsgroup via 
gmane.org's list2news service.)  Totally unexpected "orthogonal 
examples", which can be immensely useful all on their own. =:^)

FWIW I haven't gotten much into encrypted storage here, but I keep 
thinking about it, and were I to have done so before reading this, I 
might have made exactly that mistake myself.

OTOH, with btrfs raid replacing mdraid, individually encrypted block 
devices are (currently) necessary, because btrfs merges the filesystem 
and raid levels.  Tho direct btrfs encryption support is apparently 
planned, and if/when that's implemented, one could expect they'll address 
your point and internally layer the encryption over the raid.  Tho I 
expect that'd be a weaker encryption implementation if done that way, 
because part of the advantage of btrfs raid is that the filesystem 
structures work down thru the raid level as well, so individual chunk 
structures appear at the device level below the raid.  If encryption is 
then built over the raid, that would mean the encryption would need to 
pass the individual chunk structures thru so btrfs raid could still use 
them, and that would be a critical information leak to the encrypted side.

So if btrfs does implement encryption, one would hope they'd either have 
a config option for above or below the raid level, or that they'd do it 
the less efficient multi-encryption way below the raid, thus not having 
to pass that information thru the encryption to the raid, leaking it in 
the process.

> So while the natural impulse is to give each user its own subvolume it's
> not likely to be that great an idea in practice because... um... 47,000
> snapshots dude, and so on.

Agreed. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-23  7:39         ` Russell Coker
@ 2014-10-23  8:49           ` Duncan
  0 siblings, 0 replies; 35+ messages in thread
From: Duncan @ 2014-10-23  8:49 UTC (permalink / raw)
  To: linux-btrfs

Russell Coker posted on Thu, 23 Oct 2014 18:39:52 +1100 as excerpted:

> Also a device replace operation requires that the replacement be the
> same size (or maybe larger). While a remove and replace allows the
> replacement to be merely large enough to contain all the data. Given the
> size variation in what might be called the same size disk by
> manufcturers this isn't uncommon - unless you just get a replacement of
> the next size up (which is a good option too).

Good argument for using user-configured partitions, even if you just use 
one covering most of the physical device, instead of manufacturer's 
rather arbitrarily convenience-chosen raw physical device sizes.

In such a scenario if you deliberately undersize that single partition by 
a few gigs, you can be reasonably sure that any physical device of the 
same claimed size as the original physical device can handle it.

Plus that reserves room for GPT and dedicated EFI and/or reserved-BIOS 
partitions too, since the BIOS partition might be a couple MiB and the 
EFI partition a couple hundred MiB, well under a gig combined.  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-22 12:40     ` Piotr Pawłow
  2014-10-22 16:59       ` Bob Marley
@ 2014-10-23  9:19       ` Miao Xie
  2014-10-23 11:39         ` Austin S Hemmelgarn
  1 sibling, 1 reply; 35+ messages in thread
From: Miao Xie @ 2014-10-23  9:19 UTC (permalink / raw)
  To: Piotr Pawłow, Chris Murphy, linux-btrfs

On Wed, 22 Oct 2014 14:40:47 +0200, Piotr Pawłow wrote:
> On 22.10.2014 03:43, Chris Murphy wrote:
>> On Oct 21, 2014, at 4:14 PM, Piotr Pawłow<pp@siedziba.pl>  wrote:
>>> Looks normal to me. Last time I started a balance after adding 6th device to my FS, it took 4 days to move 25GBs of data.
>> It's long term untenable. At some point it must be fixed. It's way, way slower than md raid.
>> At a certain point it needs to fallback to block level copying, with a ~ 32KB block. It can't be treating things as if they're 1K files, doing file level copying that takes forever. It's just too risky that another device fails in the meantime.
> 
> There's "device replace" for restoring redundancy, which is fast, but not implemented yet for RAID5/6.

Now my colleague and I is implementing the scrub/replace for RAID5/6
and I have a plan to reimplement the balance and split it off from the metadata/file data process. the main idea is
- allocate a new chunk which has the same size as the relocated one, but don't insert it into the block group list, so we don't
  allocate the free space from it.
- set the source chunk to be Read-only
- copy the data from the source chunk to the new chunk
- replace the extent map of the source chunk with the one of the new chunk(The new chunk has
  the same logical address and the length as the old one)
- release the source chunk

By this way, we needn't deal the data one extent by one extent, and needn't do any space reservation,
so the speed will be very fast even we have lots of snapshots.

Thanks
Miao

> 
> I think the problem is that balance was originally used for balancing data / metadata split - moving stuff out of mostly empty chunks to free them and use for something else. It pretty much has to be done on the extent level.
> 
> Then balance was repurposed for things like converting RAID profiles and restoring redundancy and balancing device usage in multi-device configurations. It works, but the approach to do it extent by extent is slow.
> 
> I wonder if we could do some of these operations by just copying whole chunks in bulk. Wasn't that the point of introducing logical addresses? - to be able to move chunks around quickly without changing anything except updating chunk pointers?
> 
> BTW: I'd love a simple interface to be able to select a chunk and tell it to move somewhere else. I'd like to tell chunks with metadata, or with tons of extents: Hey, chunks! Why don't you move to my SSDs? :)
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-23  9:19       ` Miao Xie
@ 2014-10-23 11:39         ` Austin S Hemmelgarn
  2014-10-24  1:05           ` Duncan
  0 siblings, 1 reply; 35+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-23 11:39 UTC (permalink / raw)
  To: miaox, Piotr Pawłow, Chris Murphy, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1905 bytes --]

On 2014-10-23 05:19, Miao Xie wrote:
> On Wed, 22 Oct 2014 14:40:47 +0200, Piotr Pawłow wrote:
>> On 22.10.2014 03:43, Chris Murphy wrote:
>>> On Oct 21, 2014, at 4:14 PM, Piotr Pawłow<pp@siedziba.pl>  wrote:
>>>> Looks normal to me. Last time I started a balance after adding 6th device to my FS, it took 4 days to move 25GBs of data.
>>> It's long term untenable. At some point it must be fixed. It's way, way slower than md raid.
>>> At a certain point it needs to fallback to block level copying, with a ~ 32KB block. It can't be treating things as if they're 1K files, doing file level copying that takes forever. It's just too risky that another device fails in the meantime.
>>
>> There's "device replace" for restoring redundancy, which is fast, but not implemented yet for RAID5/6.
>
> Now my colleague and I is implementing the scrub/replace for RAID5/6
> and I have a plan to reimplement the balance and split it off from the metadata/file data process. the main idea is
> - allocate a new chunk which has the same size as the relocated one, but don't insert it into the block group list, so we don't
>    allocate the free space from it.
> - set the source chunk to be Read-only
> - copy the data from the source chunk to the new chunk
> - replace the extent map of the source chunk with the one of the new chunk(The new chunk has
>    the same logical address and the length as the old one)
> - release the source chunk
>
> By this way, we needn't deal the data one extent by one extent, and needn't do any space reservation,
> so the speed will be very fast even we have lots of snapshots.
>
Even if balance gets re-implemented this way, we should still provide 
some way to consolidate the data from multiple partially full chunks. 
Maybe keep the old balance path and have some option (maybe call it 
aggressive?) that turns it on instead of the new code.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160? (was: device balance times)
  2014-10-23  5:18       ` Robert White
  2014-10-23  8:38         ` Duncan
@ 2014-10-23 13:15         ` Zygo Blaxell
  1 sibling, 0 replies; 35+ messages in thread
From: Zygo Blaxell @ 2014-10-23 13:15 UTC (permalink / raw)
  To: Robert White; +Cc: Chris Murphy, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 5233 bytes --]

On Wed, Oct 22, 2014 at 10:18:09PM -0700, Robert White wrote:
> On 10/22/2014 09:30 PM, Chris Murphy wrote:
> >Sure. So if Btrfs is meant to address scalability, then perhaps at the moment it's falling short. As it's easy to add large drives and get very large multiple device volumes, the snapshotting needs to scale also.
> >
> >I'd say per user, it's reasonable to have 24 hourly (one snapshot per hour for a day), 7 daily, 4 weekly, and 12 monthly snapshots, or 47 snapshots. That's 47,000 snapshots if it's sane for a single Btrfs volume to host 1000 users. Arguably, such a system is better off with a distributed fs: Gluster FS or GFS2 or Ceph.
> 
> Is one subvolume per user a rational expectation? Is it even
> particularly smart? Dooable, sure, but as a best practice it doesn't
> seem that useful because it multiplies the maintenace by the user
> base.

For snapshots alone it doesn't make much sense, but there are other
btrfs features that work in subvolume units.  Some people want quota
and send/receive to work on a per-user level too.

If 'btrfs subvolume' had a '-r' recursive option, it would make management
easier.  Even without -r, /home/* can be managed by a simple shell loop
with a wildcard:

makeSnapshot () {
	btrfs sub create "/snapshots/$1"
	for x in /home/*; do
		btrfs sub snap "$x" "/snapshots/$1/$x";
	done
}

makeSnapshot "/home/.snapshots/$(date +%Y-%m-%d-%H-%M-%S)"

> Presuming a linux standard base layout (which is very presumptive)
> having the 47 snapshots of /home instead of the 47,000 snapshots of
> /home/X(1000) is just as workable, if not moreso. A reflink
> recursive copy of /home/X(n) from /home_Backup_date/X(n) is only
> trivially longer than resnapshotting the individual user.

reflink copies are much slower than snapshots.  For that matter, making
a writable snapshot of the entire /home as one subvolume, then using
'rm -rf' to get rid of what we don't need for one particular snapshot
is *also* faster than reflink copies.

More precisely, the bulk of the total life-cycle execution time is at the
beginning with reflink copies (have to create shared extent ref items,
traverse the source directories, and allocate new directory trees, all
while racing against data modifications) and at the end with snapshots
(btrfs-cleaner has to remove unreferenced tree nodes and extents in the
background) or snap-then-trim (replace btrfs-cleaner with rm -rf).

> Again this gets into the question not of what exercises well to
> create the snapshot but what functions well during a restore.
> 
> People constantly create "backup solutions" without really looking
> at the restore path.

It's not all about backups.

> And if you setup each user on their own snapshot what happens the
> first time two users want to hard-link a file betwixt them?

One of the features of per-user subvolumes is that such things are
completely forbidden.  Security issues, user confusion, and all that.

Deduplication by extent sharing (and reflink copy) doesn't care about
subvolumes as long as you do the clone through a common parent of the
user subvolumes (i.e. /home).  The result isn't a hardlink which keeps
users happy, and shares underlying storage which keeps admins with
storage budget issues happy.

> Excessive segmentation of storage is an evil unto itself.
> 
> YMMV, of course.
> 
> An orthoginal example:
> 
> If you give someone six disks and tell them to make an encrypted
> raid6 via cryptsetup and mdadm, at least eight out of ten will
> encrypt the drives and then raid the result. But it's _massivly_
> more efficent to raid the drives and then encrypt the result. Why?

That seems...implausible.  They would need to enter the passphrases six
times too.

> Because writing a block with the latter involves only one block
> being encrypted/decrypted. The former, if the raid is fine involves
> several encryptions/decryptions and _many_ if the raid is degraded.

It would be the correct answer if you needed to keep the structure of
the storage array secret...or if you wanted to use btrfs to implement
the RAID layer, and needed the encrypted layer to be divided along
the same boundaries as the physical layer.

> The above is a mental constraint, a mistake, that is all to common
> because people expect encrytion to be "better" the closer you get to
> the spinning rust.
> 
> So too people expect that segmentation is somehow better if it most
> closely matches the abstract groupings (like per user) but in
> practical terms it is better matched to the modality, where, for
> instance, all users are one kind of thing, while all data stores are
> another kind of thing.
> 
> We were just talking about putting all your VMs and larger NOCOW
> files into a separate subvolume/domain because of their radically
> different write behaviors. Thats a sterling reason to subdivide the
> storage. So is / vs. /var vs. /home as three different domains with
> radically different update profiles.
> 
> So while the natural impulse is to give each user its own subvolume
> it's not likely to be that great an idea in practice because...
> um... 47,000 snapshots dude, and so on.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 5 _thousand_ snapshots? even 160?
  2014-10-22  4:05       ` Duncan
@ 2014-10-23 20:38         ` Arnaud Kapp
  0 siblings, 0 replies; 35+ messages in thread
From: Arnaud Kapp @ 2014-10-23 20:38 UTC (permalink / raw)
  To: linux-btrfs

Hello,

First, I'd like to thank you for this is interesting discussion
and for pointing efficient snapshotting strategies.

My 5k snapshots actually come from 4 subvolumes. I create 8 snapshots
per hour because I actually create both a read-only and writable
snapshots for each of my volume. Yeah this may sound dump, but this
setup was my first use of btrfs --> oh some cool feature, lets abuse
them !

The reason I did that is simple: w/o reading this mailing list, I would
have continued to think that snapshots were really that cheap (a la 
git-branch). Turns out it's not the case (yet?).

I will now rethink my snapshotting plan thanks to you.

On 10/22/2014 06:05 AM, Duncan wrote:
> Robert White posted on Tue, 21 Oct 2014 18:10:27 -0700 as excerpted:
>
>> Each snapshot is effectively stapling down one version of your entire
>> metadata tree, right? So imagine leaving tape spikes (little marks on
>> the floor to keep track of where something is so you can put it back)
>> for the last 150 or 5000 positions of the chair you are sitting in. At
>> some point the clarity and purpose of those marks becomes the opposite
>> of useful.
>>
>> Hourly for a day, daily for a week, weekly for a month, monthly for a
>> year. And it's not a "backup" if you haven't moved it to another device.
>> If you have 5k snapshots of a file that didn't change, you are still
>> just one bad disk sector away from never having that data again because
>> there's only one copy of the actual data stapled down in all of those
>> snapshots.
>
> Exactly.
>
> I explain the same thing in different words:
>
> (Note: "You" in this post is variously used to indicate the parent
> poster, and a "general you", including but not limited to the grandparent
> poster inquiring about his 5000 hourly snapshots.  As I'm not trying to
> write a book or a term paper I actively suppose it should be clear to
> which "you" I'm referring in each case based on context...)
>
> Say you are taking hourly snapshots of a file, and you mistakenly delete
> it or need a copy from some time earlier.
>
> If you figure that out a day later, yes, the hour the snapshot was taken
> can make a big difference.
>
> If you don't figure it out until a month later, then is it going to be
> REALLY critical which HOUR you pick, or is simply picking one hour in the
> correct day (or possibly half-day) going to be as good, knowing that if
> you guess wrong you can always go back or forward another whole day?
>
> And if it's a year later, is even the particular day going to matter, or
> will going forward or backward a week or a month going to be good enough?
>
> And say it *IS* a year later, and the actual hour *DOES* matter.  A year
> later, exactly how are you planning to remember the EXACT hour you need,
> such that simply randomly picking just one out of the day or week is
> going to make THAT big a difference?
>
> As you said but adjusted slightly to even out the weeks vs months, hourly
> for a day (or two), daily to complete the week (or two), weekly to
> complete the quarter (13 weeks), and if desired, quarterly for a year or
> two.
>
> But as you also rightly pointed out, just as if it's not tested it's not
> a backup, if it's not on an entirely separate device and filesystem, it's
> not a backup.
>
> And if you don't have real backups at least every quarter, why on earth
> are you worrying about a year's worth of hourly snapshots?  If disaster
> strikes and the filesystem blows up, without a separate backup, they're
> all gone, so why the trouble to keep them around in the first place?
>
> And once you have that quarterly or whatever backup, then the advantage
> of continuing to lock down those 90-day-stale copies of all those files
> and metadata goes down dramatically, since if worse comes to worse, you
> simply retrieve it from backup, but meanwhile, all that stale locked down
> data and metadata is eating up room and dramatically complicating the job
> btrfs must do to manage it all!
>
> Yes, there are use-cases and there are use-cases.  But if you aren't
> keeping at least quarterly backups, perhaps you better examine your
> backup plan and see if it really DOES match your use-case, ESPECIALLY if
> you're keeping thousands of snapshots around.  And once you DO have those
> quarterly or whatever backups, then do you REALLY need to keep around
> even quarterly snapshots covering the SAME period?
>
> But let's say you do:
>
> 48 hourly snapshots, thinned after that to...
>
> 12 daily snapshots (2 weeks = 14, minus the two days of hourly), thinned
> after that to...
>
> 11 weekly snapshots (1 quarter = 13 weeks, minus the two weeks of daily),
> thinned after that to...
>
> 7 quarterly snapshots (2 years = 8 quarters, minus the quarter of weekly).
>
> 48 + 12 + 11 + 7 = ...
>
> 78 snapshots, appropriately spaced by age, covering two full years.
>
> I've even done the math for the extreme case of per-minute snapshots.
> With reasonable thinning along the lines of the above, even per-minute
> snapshots ends up well under 300 snapshots being reasonably managed at
> any single time.
>
> And keeping it under 300 snapshots really DOES help btrfs in terms of
> management task time-scaling.
>
> If you're doing hourly, as I said, 78, tho killing the quarterly
> snapshots entirely because they're backed up reduces that to 71, but
> let's just say, EASILY under 100.
>
> Tho that is of course per subvolume.  If you have multiple subvolumes on
> the same filesystem, that can still end up being a thousand or two
> snapshots per filesystem.  But those are all groups of something under
> 300 (under 100 with hourly) highly connected to each other, with the
> interweaving inside each of those groups being the real complexity in
> terms of btrfs management.
>
> But 5000 snapshots?
>
> Why?  Are you *TRYING* to test btrfs until it breaks, or TRYING to
> demonstrate a balance taking an entire year?
>
> Do a real backup (or more than one, using those snapshots) if you need
> to, then thin the snapshots to something reasonable.  As the above
> example shows, if it's a single subvolume being snapshotted, with hourly
> snapshots, 100 is /more/ than reasonable.
>
> With some hard questions, keeping in mind the cost in extra maintenance
> time for each additional snapshot, you might even find that minimum 6-
> hour snapshots (four per day) instead of 1-hour snapshots (24 per day)
> are fine.  Or you might find that you only need to keep hourly snapshots
> for 12 hours instead of the 48 I assumed above, and daily snapshots for a
> week instead of the two I assumed above.  Throwing in the nothing over a
> quarter because it's backed up assumption as well, that's...
>
> 8 4x-daily snapshots (2 days)
>
> 5 daily snapshots (a week, minus the two days above)
>
> 12 weekly snapshots (a quarter, minus the week above, then it's backed up
> to other storage)
>
> 8 + 5 + 12 = ...
>
> 25 snapshots total, 6-hours apart (four per day) at maximum frequency aka
> minimum spacing, reasonably spaced by age to no more than a week apart,
> with real backups taking over after a quarter.
>
> Btrfs should be able to work thru that in something actually approaching
> reasonable time, even if you /are/ dealing with 4 TB of data. =:^)
>
> Bonus hints:
>
> Btrfs quotas significantly complicate management as well.  If you really
> need them, fine, but don't unnecessarily use them just because they are
> there.
>
> Look into defrag.
>
> If you don't have any half-gig plus VMs or databases or similar "internal
> rewrite pattern" files, consider the autodefrag mount option.  Note that
> if you haven't been using it and your files are highly fragmented, it can
> slow things down at first, but a manual defrag, possibly a directory tree
> at a time to split things up into reasonable size and timeframes, can
> help.
>
> If you are running large VMs or databases or other half-gig-plus sized
> internal-rewrite-pattern files, the autodefrag mount option may not
> perform well for you.  There's other options for that, including separate
> subvolumes, setting nocow on those files, and setting up a scheduled
> defrag.  That's out of scope for this post, so do your research.  It has
> certainly been discussed enough on-list.
>
> Meanwhile, do note that defrag is currently snapshot-aware-disabled, due
> to scaling issues.  IOW, if your files are highly fragmented as they may
> well be if you haven't been regularly defragging them, expect the defrag
> to eat a lot of space since it'll break the sharing with older snapshots
> as anything that defrag moves will be unshared.  However, if you've
> reduced snapshots to the quarter-max before off-filesystem backup as
> recommended above, a quarter from now all the undefragged snapshots will
> be expired and off the system and you'll have reclaimed that extra space.
> Meanwhile, your system should be /much/ easier to manage and will likely
> be snappier in its response as well.  =:^)
>
> With all these points applied, balance performance should improve
> dramatically.  However, with 4 TB of data the shear data size will remain
> a factor.  Even in the best case typical thruput on spinning rust won't
> reach the ideal.  10 MiB/sec is a reasonable guide.  4 TiB/10 MiB/sec...
>
> 4*1024*1024 (MiB) /  10 MiB / sec = ...
>
> nearly 420 thousand seconds ... / 60 sec/min = ...
>
> 7000 minutes ... / 60 min/hour = ...
>
> nearly 120 hours or ...
>
> a bit under 5 days.
>
>
> So 4 TiB on spinning rust could reasonably take about 5 days to balance
> even under quite good conditions.  That's due to the simple mechanics of
> head seek to read, head seek again to write, on spinning rust, and the
> shear size of 4 TB of data and metadata (tho with a bit of luck some of
> that will disappear as you thin out those thousands of snapshots, and
> it'll be more like 3 TB than 4, or possibly even down to 2 TiB, by the
> time you actually do it).
>
> IOW, it's not going to be instant, by any means.
>
> But the good part of it is that you don't have to do it all at once.  You
> can use balance filters and balance start/pause/resume/cancel as
> necessary, to do only a portion of it at a time, and restart the balance
> using the convert,soft options so it doesn't redo already converted
> chunks when you have time to let it run.  As long as it completes at
> least one chunk each run it'll make progress.
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-23 11:39         ` Austin S Hemmelgarn
@ 2014-10-24  1:05           ` Duncan
  2014-10-24  2:35             ` Zygo Blaxell
  0 siblings, 1 reply; 35+ messages in thread
From: Duncan @ 2014-10-24  1:05 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Thu, 23 Oct 2014 07:39:28 -0400 as
excerpted:

> On 2014-10-23 05:19, Miao Xie wrote:
>>
>> Now my colleague and I is implementing the scrub/replace for RAID5/6
>> and I have a plan to reimplement the balance and split it off from the
>> metadata/file data process. the main idea is
>> - allocate a new chunk which has the same size as the relocated one,
>>   but don't insert it into the block group list, so we don't allocate
>>   the free space from it.
>> - set the source chunk to be Read-only
>> - copy the data from the source chunk to the new chunk
>> - replace the extent map of the source chunk with the one of the new
>>   chunk(The new chunk has the same logical address and the length as
>>   the old one)
>> - release the source chunk
>>
>> By this way, we needn't deal the data one extent by one extent, and
>> needn't do any space reservation, so the speed will be very fast even
>> [if] we have lots of snapshots.
>>
> Even if balance gets re-implemented this way, we should still provide
> some way to consolidate the data from multiple partially full chunks.
> Maybe keep the old balance path and have some option (maybe call it
> aggressive?) that turns it on instead of the new code.

IMO:

* Keep normal default balance behavior as-is.

* Add two new options, --fast, and --aggressive.

* --aggressive behaves as today and is the normal default.

* --fast is the new chunk-by-chunk behavior.  This becomes the default if 
the convert filter is used, or if balance detects that it /is/ changing 
the mode, thus converting or filling in missing chunk copies, even when 
the convert filter was not specifically set.  Thus, if there's only one 
chunk copy (single or raid0 mode, or raid1/10 or dup with a missing/
invalid copy) and the balance would result in two copies, default to
--fast.  Similarly, if it's raid1/10 and switching to single/raid0, 
default to --fast.  If no conversion is being done, keep the normal
--aggressive default.

* Users could always specify the behavior they want, overriding the 
default, using the appropriate option.

* Of course defaults may result in some chunks being rebalanced in fast 
mode, while others are rebalanced in aggressive mode, if for instance 
it's 3+ device raid1 mode filesystem with one device missing, since in 
that case there'd be the usual two copies of some chunks and those would 
default to aggressive, while there'd be one copy of chunks where the 
other one was on the missing device.  However, users could always specify 
the desired behavior using the last point above, thus getting the same 
behavior for the entire balance.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-24  1:05           ` Duncan
@ 2014-10-24  2:35             ` Zygo Blaxell
  2014-10-24  5:13               ` Duncan
  2014-10-24 10:58               ` Rich Freeman
  0 siblings, 2 replies; 35+ messages in thread
From: Zygo Blaxell @ 2014-10-24  2:35 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4493 bytes --]

On Fri, Oct 24, 2014 at 01:05:39AM +0000, Duncan wrote:
> Austin S Hemmelgarn posted on Thu, 23 Oct 2014 07:39:28 -0400 as
> excerpted:
> 
> > On 2014-10-23 05:19, Miao Xie wrote:
> >>
> >> Now my colleague and I is implementing the scrub/replace for RAID5/6
> >> and I have a plan to reimplement the balance and split it off from the
> >> metadata/file data process. the main idea is
> >> - allocate a new chunk which has the same size as the relocated one,
> >>   but don't insert it into the block group list, so we don't allocate
> >>   the free space from it.
> >> - set the source chunk to be Read-only
> >> - copy the data from the source chunk to the new chunk
> >> - replace the extent map of the source chunk with the one of the new
> >>   chunk(The new chunk has the same logical address and the length as
> >>   the old one)
> >> - release the source chunk
> >>
> >> By this way, we needn't deal the data one extent by one extent, and
> >> needn't do any space reservation, so the speed will be very fast even
> >> [if] we have lots of snapshots.
> >>
> > Even if balance gets re-implemented this way, we should still provide
> > some way to consolidate the data from multiple partially full chunks.
> > Maybe keep the old balance path and have some option (maybe call it
> > aggressive?) that turns it on instead of the new code.
> 
> IMO:
> 
> * Keep normal default balance behavior as-is.
> 
> * Add two new options, --fast, and --aggressive.
> 
> * --aggressive behaves as today and is the normal default.
> 
> * --fast is the new chunk-by-chunk behavior.  This becomes the default if 
> the convert filter is used, or if balance detects that it /is/ changing 
> the mode, thus converting or filling in missing chunk copies, even when 
> the convert filter was not specifically set.  Thus, if there's only one 
> chunk copy (single or raid0 mode, or raid1/10 or dup with a missing/
> invalid copy) and the balance would result in two copies, default to
> --fast.  Similarly, if it's raid1/10 and switching to single/raid0, 
> default to --fast.  If no conversion is being done, keep the normal
> --aggressive default.

My pet peeve:  if balance is converting profiles from RAID1 to single,
the conversion should be *instantaneous* (or at least small_constant *
number_of_block_groups).  Pick one mirror, keep all the chunks on that
mirror, delete all the corresponding chunks on the other mirror.

Sometimes when a RAID1 mirror dies we want to temporarily convert
the remaining disk to single data / DUP metadata while we wait for
a replacement.  Right now if we try to do this, we discover:

	- if the system reboots during the rebalance, btrfs now sees a
	mix of single and RAID1 data profiles on the disk.  The rebalance
	takes a long time, and a hardware replacement has been ordered,
	so the probability of this happening is pretty close to 1.0.

	- one disk is missing, so there's a check in the mount code path
	that counts missing disks like this:

		- RAID1 profile: we can tolerate 1 missing disk so just
		mount rw,degraded

		- single profile: we can tolerate zero missing disks,
		so we don't allow rw mounts even if degraded.

That filesystem is now permanently read-only (or at least it was in 3.14).
It's not even possible to add or replace disks any more since that
requires mounting the filesystem read-write.

> * Users could always specify the behavior they want, overriding the 
> default, using the appropriate option.
> 
> * Of course defaults may result in some chunks being rebalanced in fast 
> mode, while others are rebalanced in aggressive mode, if for instance 
> it's 3+ device raid1 mode filesystem with one device missing, since in 
> that case there'd be the usual two copies of some chunks and those would 
> default to aggressive, while there'd be one copy of chunks where the 
> other one was on the missing device.  However, users could always specify 
> the desired behavior using the last point above, thus getting the same 
> behavior for the entire balance.
> 
> -- 
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-24  2:35             ` Zygo Blaxell
@ 2014-10-24  5:13               ` Duncan
  2014-10-24 15:18                 ` Zygo Blaxell
  2014-10-24 10:58               ` Rich Freeman
  1 sibling, 1 reply; 35+ messages in thread
From: Duncan @ 2014-10-24  5:13 UTC (permalink / raw)
  To: linux-btrfs

Zygo Blaxell posted on Thu, 23 Oct 2014 22:35:29 -0400 as excerpted:

> My pet peeve:  if balance is converting profiles from RAID1 to single,
> the conversion should be *instantaneous* (or at least small_constant *
> number_of_block_groups).  Pick one mirror, keep all the chunks on that
> mirror, delete all the corresponding chunks on the other mirror.

That would argue for either a third balance mode, --convert-only, or a 
different tool, avoiding a rewrite of existing chunks entirely, simply 
replicating them if adding redundancy or deleting a copy if reducing it, 
as necessary.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-24  2:35             ` Zygo Blaxell
  2014-10-24  5:13               ` Duncan
@ 2014-10-24 10:58               ` Rich Freeman
  2014-10-24 16:07                 ` Zygo Blaxell
  1 sibling, 1 reply; 35+ messages in thread
From: Rich Freeman @ 2014-10-24 10:58 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Duncan, Btrfs BTRFS

On Thu, Oct 23, 2014 at 10:35 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
>                 - single profile: we can tolerate zero missing disks,
>                 so we don't allow rw mounts even if degraded.
>

That seems like the wrong logic here.  By all means mount read-only by
default for safety, but there should be a way to force a read-write
mount on any filesystem, precisely because the RAID modes can be mixed
and even if you lose two devices on a RAID1 system not ALL the data is
lost if you have more than two drives.

By all means return an error when reading a file that is completely
missing.  By all means have an extra fsck mode that goes ahead and
deletes all the missing files (assuming it has metadata) or perhaps
moves them all to a new "lost+notfound" subvolume or something.

Indeed, if the lost device just happens to not actually contain any
data you might be lucky and not lose any data at all when losing a
single device in a filesystem that entirely uses the single profile.
That would be a bit of an edge case though, but one that is
automatically handled if you give the admin the ability to force
read-write/etc.

--
Rich

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-24  5:13               ` Duncan
@ 2014-10-24 15:18                 ` Zygo Blaxell
  0 siblings, 0 replies; 35+ messages in thread
From: Zygo Blaxell @ 2014-10-24 15:18 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1640 bytes --]

On Fri, Oct 24, 2014 at 05:13:27AM +0000, Duncan wrote:
> Zygo Blaxell posted on Thu, 23 Oct 2014 22:35:29 -0400 as excerpted:
> 
> > My pet peeve:  if balance is converting profiles from RAID1 to single,
> > the conversion should be *instantaneous* (or at least small_constant *
> > number_of_block_groups).  Pick one mirror, keep all the chunks on that
> > mirror, delete all the corresponding chunks on the other mirror.
> 
> That would argue for either a third balance mode, --convert-only, or a 
> different tool, avoiding a rewrite of existing chunks entirely, simply 
> replicating them if adding redundancy or deleting a copy if reducing it, 
> as necessary.

Isn't that what soft does?  [reading noises]  OK, maybe not.
'soft' leaves a chunk alone if it already fits all the target profile
requirements; however, in this case the profile (and only the profile,
no data) is changing.

I think just two modes are sufficient:  one that does everything the most
thorough way (throw scrub and defrag in there too, so we can do a single
pass over the filesystem that does all the maintenance tasks at once),
and one that takes advantage of every special-case shortcut available
to achieve specific goals in the shortest time.

I also think it's a little odd that conversion and balance are the
same tool.  Traditional RAID conversions don't care about filesystem
layout, because they work on a completely separate layer (i.e. at the
block level).  It's certainly possible to perform a RAID conversion
by reallocating all the filesystem-level objects, but just because
you can doesn't mean you should.  ;)

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-24 10:58               ` Rich Freeman
@ 2014-10-24 16:07                 ` Zygo Blaxell
  2014-10-24 19:58                   ` Rich Freeman
  0 siblings, 1 reply; 35+ messages in thread
From: Zygo Blaxell @ 2014-10-24 16:07 UTC (permalink / raw)
  To: Rich Freeman; +Cc: Duncan, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 3092 bytes --]

On Fri, Oct 24, 2014 at 06:58:25AM -0400, Rich Freeman wrote:
> On Thu, Oct 23, 2014 at 10:35 PM, Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> >                 - single profile: we can tolerate zero missing disks,
> >                 so we don't allow rw mounts even if degraded.
> 
> That seems like the wrong logic here.  By all means mount read-only by
> default for safety, but there should be a way to force a read-write
> mount on any filesystem, precisely because the RAID modes can be mixed
> and even if you lose two devices on a RAID1 system not ALL the data is
> lost if you have more than two drives.

I agree, but https://bugzilla.kernel.org/show_bug.cgi?id=60594 does not:

	Stefan Behrens 2013-08-23 13:42:16 UTC
	The way out is to mount read-only, copy the data aside and be
	happy that no data was lost.

	The #1 goal (IMO) is to avoid data loss. Therefore the filesystem
	goes read-only if less devices are functional for writing than
	required by the selected RAID levels. And in order to avoid
	the surprise of a filesystem going read-only 30 seconds after
	mounting it, this is also enforced at mount time. [...]

	We could also leave this as an option to the user "mount -o
	degraded-and-I-want-to-lose-my-data", but in my opinion the use
	case is very, very exceptional.

IMHO the use case is common any time restoring the entire filesystem
from backups is inconvenient.  That covers a *lot* of users.  I never
have a machine with more than 50% of its raw disk space devoted to btrfs
because I need raw space on the disk to do mkfs+rsync from the broken
read-only btrfs filesystems.

Somewhere in the future for btrfs is online fsck; however, we're not there
yet.  The kernel still blows up over relatively minor structural errors.

FWIW I'd like to be able to mount a broken btrfs read-write, add more
storage (either grow existing disks or add new ones), and then use the new
storage as temporary space to build a cleaned copy of the old metadata
with unreachable or broken objects dropped (preferably leaving some
object behind that returns EIO when read, but can be written or deleted).
Once there is clean metadata, we can rebuild free space maps (possibly
collecting allocated orphan extents into lost+found), then the surviving
data can be rebalanced or moved fairly easily.  The grown/added disks
can be shrunk/removed at the end.

> By all means return an error when reading a file that is completely
> missing.  By all means have an extra fsck mode that goes ahead and
> deletes all the missing files (assuming it has metadata) or perhaps
> moves them all to a new "lost+notfound" subvolume or something.
> 
> Indeed, if the lost device just happens to not actually contain any
> data you might be lucky and not lose any data at all when losing a
> single device in a filesystem that entirely uses the single profile.
> That would be a bit of an edge case though, but one that is
> automatically handled if you give the admin the ability to force
> read-write/etc.
> 
> --
> Rich

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: device balance times
  2014-10-24 16:07                 ` Zygo Blaxell
@ 2014-10-24 19:58                   ` Rich Freeman
  0 siblings, 0 replies; 35+ messages in thread
From: Rich Freeman @ 2014-10-24 19:58 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Duncan, Btrfs BTRFS

On Fri, Oct 24, 2014 at 12:07 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
>         We could also leave this as an option to the user "mount -o
>         degraded-and-I-want-to-lose-my-data", but in my opinion the use
>         case is very, very exceptional.

Well, it is only exceptional if you never shut down during a
conversion to raid1 as far as I understand it.  :)

>
> IMHO the use case is common any time restoring the entire filesystem
> from backups is inconvenient.  That covers a *lot* of users.  I never
> have a machine with more than 50% of its raw disk space devoted to btrfs
> because I need raw space on the disk to do mkfs+rsync from the broken
> read-only btrfs filesystems.

The problem is that if you want btrfs raid1 and you ALSO want to have
an extra set of spares for copying your entire RAID1 to something
else, you're talking about a lot of extra disk space.  I really don't
want to maintain a SAN just in case I have a btrfs problem.  :)

I realize things are still somewhat experimental now, but we need to
at least think about how things will work long-term.  Copying all your
data to another filesystem and re-creating the btrfs filesystem isn't
really a good recovery mode.

Restoring from backups is also becoming increasingly difficult.  IO
bandwidth just has not kept pace with disk capacity.  It can take the
better part of a day to copy a multi-TB array, and if you need to copy
it two ways you have to double the time, not to mention having
multiple TB of disks lying around.

--
Rich

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2014-10-24 19:58 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-21 18:59 device balance times Tomasz Chmielewski
2014-10-21 20:14 ` Piotr Pawłow
2014-10-21 20:44   ` Arnaud Kapp
2014-10-22  1:10     ` 5 _thousand_ snapshots? even 160? (was: device balance times) Robert White
2014-10-22  4:02       ` Zygo Blaxell
2014-10-22  4:05       ` Duncan
2014-10-23 20:38         ` 5 _thousand_ snapshots? even 160? Arnaud Kapp
2014-10-22 11:30       ` Austin S Hemmelgarn
2014-10-22 17:32       ` Goffredo Baroncelli
2014-10-22 11:22     ` device balance times Austin S Hemmelgarn
2014-10-22  1:43   ` Chris Murphy
2014-10-22 12:40     ` Piotr Pawłow
2014-10-22 16:59       ` Bob Marley
2014-10-23  7:39         ` Russell Coker
2014-10-23  8:49           ` Duncan
2014-10-23  9:19       ` Miao Xie
2014-10-23 11:39         ` Austin S Hemmelgarn
2014-10-24  1:05           ` Duncan
2014-10-24  2:35             ` Zygo Blaxell
2014-10-24  5:13               ` Duncan
2014-10-24 15:18                 ` Zygo Blaxell
2014-10-24 10:58               ` Rich Freeman
2014-10-24 16:07                 ` Zygo Blaxell
2014-10-24 19:58                   ` Rich Freeman
2014-10-22 16:15     ` Chris Murphy
2014-10-23  2:44       ` Duncan
  -- strict thread matches above, loose matches on Subject: below --
2014-10-22  7:14 5 _thousand_ snapshots? even 160? (was: device balance times) Tomasz Chmielewski
2014-10-22  7:41 ` Duncan
2014-10-22 20:08   ` Zygo Blaxell
2014-10-22 20:37     ` Robert White
2014-10-23  3:09       ` Zygo Blaxell
2014-10-23  4:30     ` Chris Murphy
2014-10-23  5:18       ` Robert White
2014-10-23  8:38         ` Duncan
2014-10-23 13:15         ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).