Many questions from a potential btrfs user

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Many questions from a potential btrfs user
@ 2013-10-14  2:54 Rogério Brito
  2013-10-14  7:48 ` Hugo Mills
  0 siblings, 1 reply; 2+ messages in thread
From: Rogério Brito @ 2013-10-14  2:54 UTC (permalink / raw)
  To: linux-btrfs

Hi.

I am seriously considering employing btrfs on my systems, particularly due
to some space-saving features that it has (namely, deduplication and
compression).

In fact, I was (a few moments ago) trying to back up some of my systems to a
2TB HD that has an ext4 filesystem and, in the middle of the last one, I got
the error message that the backup HD was full.

Given that what I backup there are systems where I have some of the data
present multiple times (e.g., my mailbox that is sync'ed via offlineimap, or
videos that I download from online learning sites) and that such data
consists of many small files that are highly compressible (the e-mails) or
large files (the videos), I would like to employ btrfs.

So, after reading the documentation on https://btrfs.wiki.kernel.org/, I am
still unsure of some points and I would like to have some clarifications
and/or expectations set straight.

* I understand that I can convert an ext4 filesystem to btrfs. Will such
  conversion work with an almost full ext4 filesystem? How much overhead
  will be needed to perform the conversion? I can (temporarily) remove some
  files that already are on this backup.

* Is it possible to deduplicate the files that are already in it? As
  mentioned before, there are likely to be many, and some of them are on the
  order of 1 to 2GBs.

* Doing a defragmentation with the filesystem mounted with compression will
  recompress the files (if they are deemed compressible by the
  filesystem). Is that understanding correct?  Will compressed blocks among
  many files also be deduplicated?

* How exactly do the recently merged offline deduplication features in the
  kernel interfere with what was (in my limited understanding) already
  possible with userspace tools like <https://github.com/g2p/bedup>?  Are
  such third-party tools likely to be integrated into btrfs-progs? Are they
  supposed to be kept separate?

* Does this change the on-disk format? Putting it another way, will it be
  safe to possibly go back to a previous kernel, if there is some problem
  with the current kernels? (Not that I necessarily want to go back to a
  previous kernel, but, sometimes, one would need to, say, git bisect the
  kernel).

* I most likely *don't* want to use online deduplication (given my bad
  experiences with ZFS).  With that in mind, is the current userspace
  deduplicaton intended to be run as a cron job? Is the offline
  deduplication too memory intensive?  How much RAM would it be needed for a
  2TB filesystem? Are 2GB enough? How about 4GB?

* Will further runs of the offline deduplication be "incremental" in some
  imprecise sense of the word? That is, if I run the deduplication once and
  immediately run it again (supposing nothing changes), will the 2nd time be
  faster than the first?  (If the disk caches are dropped?)

* Will I be able to add further HDs to my btrfs filesystem, once I get some
  more money to run something like a RAID0 configuration? If I get more HDs
  later, will I be able to change the configuration to, say, RAID5 or RAID6?
  I don't intend to use lvm, unless I have to.

I think that I had other questions, but since it is now past bed time, I
can't remember them. :)

Any further comments and/or guidance will be gladly accepted.

Thanks in advance,

Rogério Brito.

-- 
Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA
http://cynic.cc/blog/ : github.com/rbrito : profiles.google.com/rbrito
DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Many questions from a potential btrfs user
  2013-10-14  2:54 Many questions from a potential btrfs user Rogério Brito
@ 2013-10-14  7:48 ` Hugo Mills
  0 siblings, 0 replies; 2+ messages in thread
From: Hugo Mills @ 2013-10-14  7:48 UTC (permalink / raw)
  To: Rogério Brito; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5464 bytes --]

On Sun, Oct 13, 2013 at 11:54:42PM -0300, Rogério Brito wrote:
> Hi.
> 
> I am seriously considering employing btrfs on my systems, particularly due
> to some space-saving features that it has (namely, deduplication and
> compression).
> 
> In fact, I was (a few moments ago) trying to back up some of my systems to a
> 2TB HD that has an ext4 filesystem and, in the middle of the last one, I got
> the error message that the backup HD was full.
> 
> Given that what I backup there are systems where I have some of the data
> present multiple times (e.g., my mailbox that is sync'ed via offlineimap, or
> videos that I download from online learning sites) and that such data
> consists of many small files that are highly compressible (the e-mails) or
> large files (the videos), I would like to employ btrfs.
> 
> So, after reading the documentation on https://btrfs.wiki.kernel.org/, I am
> still unsure of some points and I would like to have some clarifications
> and/or expectations set straight.
> 
> 
> * I understand that I can convert an ext4 filesystem to btrfs. Will such
>   conversion work with an almost full ext4 filesystem? How much overhead
>   will be needed to perform the conversion? I can (temporarily) remove some
>   files that already are on this backup.

   I don't think we've ever explored the bounds of exactly how much
space you need for conversion. It'll be an absolute minimum of 0.1% of
the data used, probably quite a bit more, for the metadata.

> * Is it possible to deduplicate the files that are already in it? As
>   mentioned before, there are likely to be many, and some of them are on the
>   order of 1 to 2GBs.

   Yes, there's an out-of-band deduplicator. I'll have to go and look
it up to work out exactly what tools you need to make it work. :)

> * Doing a defragmentation with the filesystem mounted with compression will
>   recompress the files (if they are deemed compressible by the
>   filesystem). Is that understanding correct?  Will compressed blocks among
>   many files also be deduplicated?

   You'll probably need to add -c to the defrag command, but yes, you
can persuade the FS to recompress files. I'm not sure how this affects
deduplication.

> * How exactly do the recently merged offline deduplication features in the
>   kernel interfere with what was (in my limited understanding) already
>   possible with userspace tools like <https://github.com/g2p/bedup>?  Are
>   such third-party tools likely to be integrated into btrfs-progs? Are they
>   supposed to be kept separate?

   The out-of-band (rather than offline) dedup kernel features simply
give a more reliable API call for merging identical extents, as it
allows them to be locked during the process -- without that API call,
there's a race condition that could potentially lead to data loss.

> * Does this change the on-disk format? Putting it another way, will it be
>   safe to possibly go back to a previous kernel, if there is some problem
>   with the current kernels? (Not that I necessarily want to go back to a
>   previous kernel, but, sometimes, one would need to, say, git bisect the
>   kernel).

   No, that feature doesn't change the on-disk format.

> * I most likely *don't* want to use online deduplication (given my bad
>   experiences with ZFS).  With that in mind, is the current userspace
>   deduplicaton intended to be run as a cron job? Is the offline
>   deduplication too memory intensive?  How much RAM would it be needed for a
>   2TB filesystem? Are 2GB enough? How about 4GB?

   Out-of-band dedup is indeed the kind of thing you'd run as a cron
job. However, there's a couple of better approaches you can use. I
don't know about RAM usage, I'm afraid.

   If you use rsync for backups, then you can keep one subvolume as
the "current" version of the backups, and use the --in-place option of
rsync. Then, immediately after finishing a backup run, you can
snapshot that subvolume to give yourself a read-only historical
record. This will ensure that the maximum quantity of data is shared
between the individual backups without having to use OOB dedup.

   If your source FS is btrfs as well, you can do pretty much the same
thing (it's a little more complicated to set up) with btrfs
send/receive, which uses the inherent knowledge of the FS to work out
the differences more efficiently than rsync.

> * Will further runs of the offline deduplication be "incremental" in some
>   imprecise sense of the word? That is, if I run the deduplication once and
>   immediately run it again (supposing nothing changes), will the 2nd time be
>   faster than the first?  (If the disk caches are dropped?)

   I don't know, but probably (since it should be able to tell that
the extents are already CoW copies).

> * Will I be able to add further HDs to my btrfs filesystem, once I get some
>   more money to run something like a RAID0 configuration? If I get more HDs
>   later, will I be able to change the configuration to, say, RAID5 or RAID6?
>   I don't intend to use lvm, unless I have to.

   Yes, you can change RAID levels on the fly, while the FS is mounted.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
            --- I can resist everything except temptation ---            

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2013-10-14  7:48 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-14  2:54 Many questions from a potential btrfs user Rogério Brito
2013-10-14  7:48 ` Hugo Mills

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).