Content based storage

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Content based storage
@ 2010-03-16  9:21 David Brown
  2010-03-16 22:45 ` Fabio
  2010-03-17  0:45 ` Hubert Kario
  0 siblings, 2 replies; 16+ messages in thread
From: David Brown @ 2010-03-16  9:21 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I was wondering if there has been any thought or progress in 
content-based storage for btrfs beyond the suggestion in the "Project 
ideas" wiki page?

The basic idea, as I understand it, is that a longer data extent 
checksum is used (long enough to make collisions unrealistic), and merge 
data extents with the same checksums.  The result is that "cp foo bar" 
will have pretty much the same effect as "cp --reflink foo bar" - the 
two copies will share COW data extents - as long as they remain the 
same, they will share the disk space.  But you can still access each 
file independently, unlike with a traditional hard link.

I can see at least three cases where this could be a big win - I'm sure 
there are more.

Developers often have multiple copies of source code trees as branches, 
snapshots, etc.  For larger projects (I have multiple "buildroot" trees 
for one project) this can take a lot of space.  Content-based storage 
would give the space efficiency of hard links with the independence of 
straight copies.  Using "cp --reflink" would help for the initial 
snapshot or branch, of course, but it could not help after the copy.

On servers using lightweight virtual servers such as OpenVZ, you have 
multiple "root" file systems each with their own copy of "/usr", etc. 
With OpenVZ, all the virtual roots are part of the host's file system 
(i.e., not hidden within virtual disks), so content-based storage could 
merge these, making them very much more efficient.  Because each of 
these virtual roots can be updated independently, it is not possible to 
use "cp --reflink" to keep them merged.

For backup systems, you will often have multiple copies of the same 
files.  A common scheme is to use rsync and "cp -al" to make hard-linked 
(and therefore space-efficient) snapshots of the trees.  But sometimes 
these things get out of synchronisation - perhaps your remote rsync dies 
halfway, and you end up with multiple independent copies of the same 
files.  Content-based storage can then re-merge these files.

I would imagine that content-based storage will sometimes be a 
performance win, sometimes a loss.  It would be a win when merging 
results in better use of the file system cache - OpenVZ virtual serving 
would be an example where you would be using multiple copies of the same 
file at the same time.  For other uses, such as backups, there would be 
no performance gain since you seldom (hopefully!) read the backup files. 
  But in that situation, speed is not a major issue.

mvh.,

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-16  9:21 Content based storage David Brown
@ 2010-03-16 22:45 ` Fabio
  2010-03-17  8:21   ` David Brown
  2010-03-17  0:45 ` Hubert Kario
  1 sibling, 1 reply; 16+ messages in thread
From: Fabio @ 2010-03-16 22:45 UTC (permalink / raw)
  To: David Brown; +Cc: linux-btrfs

Some years ago I was searching for that kind of functionality and found 
an experimental ext3 patch to allow the so-called COW-links: 
http://lwn.net/Articles/76616/

There was a discussion later on LWN http://lwn.net/Articles/77972/
an approach like COW-links would break POSIX standards.

I am not very technical and don't know if it's feasible in btrfs.
I think most likely you'll have to run an userspace tool to find and 
merge identical files based on checksums (which already sounds good to me).
The only thing we can ask the developers at the moment is if something 
like that would be possible without changes to the on-disk format.


PS. Another great scenario is shared hosting web/file servers: ten of 
thousand website with mostly the same tiny PHP Joomla files.
If you can get the benefits of: compression + "content based"/cowlinks + 
FS Cache... That would really make Btrfs FLY on Hard Disk and make SSD 
devices possible for storage (because of the space efficiency).

-- 
Fabio


David Brown ha scritto:
> Hi,
>
> I was wondering if there has been any thought or progress in 
> content-based storage for btrfs beyond the suggestion in the "Project 
> ideas" wiki page?
>
> The basic idea, as I understand it, is that a longer data extent 
> checksum is used (long enough to make collisions unrealistic), and 
> merge data extents with the same checksums.  The result is that "cp 
> foo bar" will have pretty much the same effect as "cp --reflink foo 
> bar" - the two copies will share COW data extents - as long as they 
> remain the same, they will share the disk space.  But you can still 
> access each file independently, unlike with a traditional hard link.
>
> I can see at least three cases where this could be a big win - I'm 
> sure there are more.
>
> Developers often have multiple copies of source code trees as 
> branches, snapshots, etc.  For larger projects (I have multiple 
> "buildroot" trees for one project) this can take a lot of space.  
> Content-based storage would give the space efficiency of hard links 
> with the independence of straight copies.  Using "cp --reflink" would 
> help for the initial snapshot or branch, of course, but it could not 
> help after the copy.
>
> On servers using lightweight virtual servers such as OpenVZ, you have 
> multiple "root" file systems each with their own copy of "/usr", etc. 
> With OpenVZ, all the virtual roots are part of the host's file system 
> (i.e., not hidden within virtual disks), so content-based storage 
> could merge these, making them very much more efficient.  Because each 
> of these virtual roots can be updated independently, it is not 
> possible to use "cp --reflink" to keep them merged.
>
> For backup systems, you will often have multiple copies of the same 
> files.  A common scheme is to use rsync and "cp -al" to make 
> hard-linked (and therefore space-efficient) snapshots of the trees.  
> But sometimes these things get out of synchronisation - perhaps your 
> remote rsync dies halfway, and you end up with multiple independent 
> copies of the same files.  Content-based storage can then re-merge 
> these files.
>
>
> I would imagine that content-based storage will sometimes be a 
> performance win, sometimes a loss.  It would be a win when merging 
> results in better use of the file system cache - OpenVZ virtual 
> serving would be an example where you would be using multiple copies 
> of the same file at the same time.  For other uses, such as backups, 
> there would be no performance gain since you seldom (hopefully!) read 
> the backup files.  But in that situation, speed is not a major issue.
>
>
> mvh.,
>
> David
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-16  9:21 Content based storage David Brown
  2010-03-16 22:45 ` Fabio
@ 2010-03-17  0:45 ` Hubert Kario
  2010-03-17  8:27   ` David Brown
  2010-03-18 23:33   ` create debian package of btrfs kernel from git tree rk
  1 sibling, 2 replies; 16+ messages in thread
From: Hubert Kario @ 2010-03-17  0:45 UTC (permalink / raw)
  To: David Brown; +Cc: linux-btrfs

On Tuesday 16 March 2010 10:21:43 David Brown wrote:
> Hi,
>=20
> I was wondering if there has been any thought or progress in
> content-based storage for btrfs beyond the suggestion in the "Project
> ideas" wiki page?
>=20
> The basic idea, as I understand it, is that a longer data extent
> checksum is used (long enough to make collisions unrealistic), and me=
rge
> data extents with the same checksums.  The result is that "cp foo bar=
"
> will have pretty much the same effect as "cp --reflink foo bar" - the
> two copies will share COW data extents - as long as they remain the
> same, they will share the disk space.  But you can still access each
> file independently, unlike with a traditional hard link.
>=20
> I can see at least three cases where this could be a big win - I'm su=
re
> there are more.
>=20
> Developers often have multiple copies of source code trees as branche=
s,
> snapshots, etc.  For larger projects (I have multiple "buildroot" tre=
es
> for one project) this can take a lot of space.  Content-based storage
> would give the space efficiency of hard links with the independence o=
f
> straight copies.  Using "cp --reflink" would help for the initial
> snapshot or branch, of course, but it could not help after the copy.
>=20
> On servers using lightweight virtual servers such as OpenVZ, you have
> multiple "root" file systems each with their own copy of "/usr", etc.
> With OpenVZ, all the virtual roots are part of the host's file system
> (i.e., not hidden within virtual disks), so content-based storage cou=
ld
> merge these, making them very much more efficient.  Because each of
> these virtual roots can be updated independently, it is not possible =
to
> use "cp --reflink" to keep them merged.
>=20
> For backup systems, you will often have multiple copies of the same
> files.  A common scheme is to use rsync and "cp -al" to make hard-lin=
ked
> (and therefore space-efficient) snapshots of the trees.  But sometime=
s
> these things get out of synchronisation - perhaps your remote rsync d=
ies
> halfway, and you end up with multiple independent copies of the same
> files.  Content-based storage can then re-merge these files.
>=20
>=20
> I would imagine that content-based storage will sometimes be a
> performance win, sometimes a loss.  It would be a win when merging
> results in better use of the file system cache - OpenVZ virtual servi=
ng
> would be an example where you would be using multiple copies of the s=
ame
> file at the same time.  For other uses, such as backups, there would =
be
> no performance gain since you seldom (hopefully!) read the backup fil=
es.
>   But in that situation, speed is not a major issue.
>=20
>=20
> mvh.,
>=20
> David

=46rom what I could read, content based storage is supposed to be in-li=
ne=20
deduplication, there are already plans to do (probably) a userland daem=
on=20
traversing the FS and merging indentical extents -- giving you post-pro=
cess=20
deduplication.

=46or a rather heavy used host (such as a VM host) you'd probably want =
to use=20
post-process dedup -- as the daemon can be easly stopped or be given lo=
wer=20
priority. In line dedup is quite CPU intensive.

In line dedup is very nice for backup though -- you don't need the temp=
orary=20
storage before the (mostly unchanged) data is deduplicated.
--=20
Hubert Kario
QBS - Quality Business Software
ul. Ksawer=F3w 30/85
02-656 Warszawa
POLAND
tel. +48 (22) 646-61-51, 646-74-24
fax +48 (22) 646-61-50
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-16 22:45 ` Fabio
@ 2010-03-17  8:21   ` David Brown
  0 siblings, 0 replies; 16+ messages in thread
From: David Brown @ 2010-03-17  8:21 UTC (permalink / raw)
  To: linux-btrfs

On 16/03/2010 23:45, Fabio wrote:
> Some years ago I was searching for that kind of functionality and found
> an experimental ext3 patch to allow the so-called COW-links:
> http://lwn.net/Articles/76616/
>

I'd read about the COW patches for ext3 before.  While there is 
certainly some similarity here, there are a fair number of differences. 
  One is that those patches were aimed only at copying - there was no 
way to merge files later.  Another is that it was (as far as I can see) 
just an experimental hack to try out the concept.  Since it didn't take 
off, I think it is worth learning from, but not building on.

> There was a discussion later on LWN http://lwn.net/Articles/77972/
> an approach like COW-links would break POSIX standards.
>

I think a lot of the problems here were concerning inode numbers.  As 
far as I understand it, when you made an ext3-cow copy, the copy and the 
original had different inode numbers.  That meant the userspace programs 
saw them as different files, and you could have different owners, 
attributes, etc., while keeping the data linked.  But that broke a 
common optimisation when doing large diff's - thus some people wanted to 
have the same inode for each file and that /definitely/ broke posix.

With btrfs, the file copies would each have their own inode - it would, 
I think, be posix compliant as it is transparent to user programs.  The 
diff optimisation discussed in the articles you sited would not work - 
but if btrfs becomes the standard Linux file system, then user 
applications like diff can be extended with btrfs-specific optimisations 
if necessary.

> I am not very technical and don't know if it's feasible in btrfs.

Nor am I very knowledgeable in this area (most of my programming is on 
8-bit processors), but I believe btrfs is already designed to support 
larger checksums (32-bit CRCs are not enough to say that data is 
identical), and the "cp --reflink" shows how the underlying link is made.

> I think most likely you'll have to run an userspace tool to find and
> merge identical files based on checksums (which already sounds good to me).

This sounds right to me.  In fact, it would be possible to do today, 
entirely from within user space - but files would need to be compared 
long-hand before merging.  With larger checksums, the userspace daemon 
would be much more efficient.

> The only thing we can ask the developers at the moment is if something
> like that would be possible without changes to the on-disk format.
>

I guess that's partly why I made these posts!

>
> PS. Another great scenario is shared hosting web/file servers: ten of
> thousand website with mostly the same tiny PHP Joomla files.
> If you can get the benefits of: compression + "content based"/cowlinks +
> FS Cache... That would really make Btrfs FLY on Hard Disk and make SSD
> devices possible for storage (because of the space efficiency).
>

That's a good point.

People often think that hard disk space is cheap these days - but being 
space efficient means you can use an SSD instead of a hard disk.  And 
for on-disk backups, it means you can use a small number of disks even 
though the users think "I've got a huge hard disk, I can make lots of 
copies of these files" !

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-17  0:45 ` Hubert Kario
@ 2010-03-17  8:27   ` David Brown
  2010-03-17  8:48     ` Heinz-Josef Claes
  2010-03-18 23:33   ` create debian package of btrfs kernel from git tree rk
  1 sibling, 1 reply; 16+ messages in thread
From: David Brown @ 2010-03-17  8:27 UTC (permalink / raw)
  To: linux-btrfs

On 17/03/2010 01:45, Hubert Kario wrote:
> On Tuesday 16 March 2010 10:21:43 David Brown wrote:
>> Hi,
>>
>> I was wondering if there has been any thought or progress in
>> content-based storage for btrfs beyond the suggestion in the "Project
>> ideas" wiki page?
>>
>> The basic idea, as I understand it, is that a longer data extent
>> checksum is used (long enough to make collisions unrealistic), and merge
>> data extents with the same checksums.  The result is that "cp foo bar"
>> will have pretty much the same effect as "cp --reflink foo bar" - the
>> two copies will share COW data extents - as long as they remain the
>> same, they will share the disk space.  But you can still access each
>> file independently, unlike with a traditional hard link.
>>
>> I can see at least three cases where this could be a big win - I'm sure
>> there are more.
>>
>> Developers often have multiple copies of source code trees as branches,
>> snapshots, etc.  For larger projects (I have multiple "buildroot" trees
>> for one project) this can take a lot of space.  Content-based storage
>> would give the space efficiency of hard links with the independence of
>> straight copies.  Using "cp --reflink" would help for the initial
>> snapshot or branch, of course, but it could not help after the copy.
>>
>> On servers using lightweight virtual servers such as OpenVZ, you have
>> multiple "root" file systems each with their own copy of "/usr", etc.
>> With OpenVZ, all the virtual roots are part of the host's file system
>> (i.e., not hidden within virtual disks), so content-based storage could
>> merge these, making them very much more efficient.  Because each of
>> these virtual roots can be updated independently, it is not possible to
>> use "cp --reflink" to keep them merged.
>>
>> For backup systems, you will often have multiple copies of the same
>> files.  A common scheme is to use rsync and "cp -al" to make hard-linked
>> (and therefore space-efficient) snapshots of the trees.  But sometimes
>> these things get out of synchronisation - perhaps your remote rsync dies
>> halfway, and you end up with multiple independent copies of the same
>> files.  Content-based storage can then re-merge these files.
>>
>>
>> I would imagine that content-based storage will sometimes be a
>> performance win, sometimes a loss.  It would be a win when merging
>> results in better use of the file system cache - OpenVZ virtual serving
>> would be an example where you would be using multiple copies of the same
>> file at the same time.  For other uses, such as backups, there would be
>> no performance gain since you seldom (hopefully!) read the backup files.
>>    But in that situation, speed is not a major issue.
>>
>>
>> mvh.,
>>
>> David
>
>  From what I could read, content based storage is supposed to be in-line
> deduplication, there are already plans to do (probably) a userland daemon
> traversing the FS and merging indentical extents -- giving you post-process
> deduplication.
>
> For a rather heavy used host (such as a VM host) you'd probably want to use
> post-process dedup -- as the daemon can be easly stopped or be given lower
> priority. In line dedup is quite CPU intensive.
>
> In line dedup is very nice for backup though -- you don't need the temporary
> storage before the (mostly unchanged) data is deduplicated.

I think post-process deduplication is the way to go here, using a 
userspace daemon.  It's the most flexible solution.  As you say, inline 
dedup could be nice in some cases, such as for backups, since the cpu 
time cost is not an issue there.  However, in a typical backup 
situation, the new files are often written fairly slowly (for remote 
backups).  Even for local backups, there is generally not that much 
/new/ data, since you normally use some sort of incremental backup 
scheme (such as rsync, combined with cp -al or cp --reflink).  Thus it 
should be fine to copy over the data, then de-dup it later or in the 
background.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-17  8:27   ` David Brown
@ 2010-03-17  8:48     ` Heinz-Josef Claes
  2010-03-17 15:25       ` Hubert Kario
  0 siblings, 1 reply; 16+ messages in thread
From: Heinz-Josef Claes @ 2010-03-17  8:48 UTC (permalink / raw)
  To: linux-btrfs

Hi,

just want to add one correction to your thoughts:

Storage is not cheap if you think about enterprise storage on a SAN, 
replicated to another data centre. Using dedup on the storage boxes leads to 
performance issues and other problems - only NetApp is offering this at the 
moment and it's not heavily used (because of the issues).

So I think it would be a big advantage for professional use to have dedup 
build into the filesystem - processors are faster and faster today and not the 
cost drivers any more. I do not think it's a problem to "spend" on core of a 2 
socket box with 12 cores for this purpose.
Storage is cost intensive:
- SAN boxes are expensive
- RAID5 in two locations is expensive
- FC lines between locations is expensive (depeding very much on where you 
are).

Naturally, you would not use this feature for all kind of use cases (eg. 
heavily used database), but I think there is enough need.

my 2 cents,
Heinz-Josef Claes

On Wednesday 17 March 2010 09:27:15 you wrote:
> On 17/03/2010 01:45, Hubert Kario wrote:
> > On Tuesday 16 March 2010 10:21:43 David Brown wrote:
> >> Hi,
> >> 
> >> I was wondering if there has been any thought or progress in
> >> content-based storage for btrfs beyond the suggestion in the "Project
> >> ideas" wiki page?
> >> 
> >> The basic idea, as I understand it, is that a longer data extent
> >> checksum is used (long enough to make collisions unrealistic), and merge
> >> data extents with the same checksums.  The result is that "cp foo bar"
> >> will have pretty much the same effect as "cp --reflink foo bar" - the
> >> two copies will share COW data extents - as long as they remain the
> >> same, they will share the disk space.  But you can still access each
> >> file independently, unlike with a traditional hard link.
> >> 
> >> I can see at least three cases where this could be a big win - I'm sure
> >> there are more.
> >> 
> >> Developers often have multiple copies of source code trees as branches,
> >> snapshots, etc.  For larger projects (I have multiple "buildroot" trees
> >> for one project) this can take a lot of space.  Content-based storage
> >> would give the space efficiency of hard links with the independence of
> >> straight copies.  Using "cp --reflink" would help for the initial
> >> snapshot or branch, of course, but it could not help after the copy.
> >> 
> >> On servers using lightweight virtual servers such as OpenVZ, you have
> >> multiple "root" file systems each with their own copy of "/usr", etc.
> >> With OpenVZ, all the virtual roots are part of the host's file system
> >> (i.e., not hidden within virtual disks), so content-based storage could
> >> merge these, making them very much more efficient.  Because each of
> >> these virtual roots can be updated independently, it is not possible to
> >> use "cp --reflink" to keep them merged.
> >> 
> >> For backup systems, you will often have multiple copies of the same
> >> files.  A common scheme is to use rsync and "cp -al" to make hard-linked
> >> (and therefore space-efficient) snapshots of the trees.  But sometimes
> >> these things get out of synchronisation - perhaps your remote rsync dies
> >> halfway, and you end up with multiple independent copies of the same
> >> files.  Content-based storage can then re-merge these files.
> >> 
> >> 
> >> I would imagine that content-based storage will sometimes be a
> >> performance win, sometimes a loss.  It would be a win when merging
> >> results in better use of the file system cache - OpenVZ virtual serving
> >> would be an example where you would be using multiple copies of the same
> >> file at the same time.  For other uses, such as backups, there would be
> >> no performance gain since you seldom (hopefully!) read the backup files.
> >> 
> >>    But in that situation, speed is not a major issue.
> >> 
> >> mvh.,
> >> 
> >> David
> >> 
> >  From what I could read, content based storage is supposed to be in-line
> > 
> > deduplication, there are already plans to do (probably) a userland daemon
> > traversing the FS and merging indentical extents -- giving you
> > post-process deduplication.
> > 
> > For a rather heavy used host (such as a VM host) you'd probably want to
> > use post-process dedup -- as the daemon can be easly stopped or be given
> > lower priority. In line dedup is quite CPU intensive.
> > 
> > In line dedup is very nice for backup though -- you don't need the
> > temporary storage before the (mostly unchanged) data is deduplicated.
> 
> I think post-process deduplication is the way to go here, using a
> userspace daemon.  It's the most flexible solution.  As you say, inline
> dedup could be nice in some cases, such as for backups, since the cpu
> time cost is not an issue there.  However, in a typical backup
> situation, the new files are often written fairly slowly (for remote
> backups).  Even for local backups, there is generally not that much
> /new/ data, since you normally use some sort of incremental backup
> scheme (such as rsync, combined with cp -al or cp --reflink).  Thus it
> should be fine to copy over the data, then de-dup it later or in the
> background.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-17  8:48     ` Heinz-Josef Claes
@ 2010-03-17 15:25       ` Hubert Kario
  2010-03-17 15:33         ` Leszek Ciesielski
  0 siblings, 1 reply; 16+ messages in thread
From: Hubert Kario @ 2010-03-17 15:25 UTC (permalink / raw)
  To: linux-btrfs

On Wednesday 17 March 2010 09:48:18 Heinz-Josef Claes wrote:
> Hi,
>=20
> just want to add one correction to your thoughts:
>=20
> Storage is not cheap if you think about enterprise storage on a SAN,
> replicated to another data centre. Using dedup on the storage boxes l=
eads
>  to performance issues and other problems - only NetApp is offering t=
his at
>  the moment and it's not heavily used (because of the issues).

there are at least two other suppliers with inline dedup products and t=
here is=20
OSS solution: lessfs

> So I think it would be a big advantage for professional use to have d=
edup
> build into the filesystem - processors are faster and faster today an=
d not
>  the cost drivers any more. I do not think it's a problem to "spend" =
on
>  core of a 2 socket box with 12 cores for this purpose.
> Storage is cost intensive:
> - SAN boxes are expensive
> - RAID5 in two locations is expensive
> - FC lines between locations is expensive (depeding very much on wher=
e you
> are).

In-line dedup is expensive in two ways: first you have to cache the dat=
a going=20
to disk and generate checksum for it, then you have to look if such blo=
ck is=20
already stored -- if the database doesn't fit into RAM (for a VM host i=
t's more=20
than likely) it requires at least few disk seeks, if not a few dozen fo=
r=20
really big databases. Then you should read the block/extent back and co=
mpare=20
them bit for bit. And only then write the data to the disk. That reduce=
s your=20
IOPS by at least an order of maginitude, if not more.=20

=46or post-process dedup you can go as fast as your HDDs will allow you=
=2E And=20
then, when your machine is mostly idle you can go and churn through the=
 data.

IMHO in-line dedup is a good thing only as storage for backups -- when =
you=20
have high probability that the stored data is duplicated (and with a 1:=
10=20
dedup ratio you have 90% probability, it is).

So the CPU cost is only one factor. HDDs are a major bottleneck too.

All things considered, it would be best to have both post-process and i=
n-line=20
data deduplication, but I think, that in-line dedup will see much less =
use.

>=20
> Naturally, you would not use this feature for all kind of use cases (=
eg.
> heavily used database), but I think there is enough need.
>=20
> my 2 cents,
> Heinz-Josef Claes
--=20
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawer=C3=B3w 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarz=C4=85dzania Jako=C5=9Bci=C4=85
zgodny z norm=C4=85 ISO 9001:2000
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-17 15:25       ` Hubert Kario
@ 2010-03-17 15:33         ` Leszek Ciesielski
  2010-03-17 19:43           ` Hubert Kario
  0 siblings, 1 reply; 16+ messages in thread
From: Leszek Ciesielski @ 2010-03-17 15:33 UTC (permalink / raw)
  To: Hubert Kario, linux-btrfs

On Wed, Mar 17, 2010 at 4:25 PM, Hubert Kario <hka@qbs.com.pl> wrote:
> On Wednesday 17 March 2010 09:48:18 Heinz-Josef Claes wrote:
>> Hi,
>>
>> just want to add one correction to your thoughts:
>>
>> Storage is not cheap if you think about enterprise storage on a SAN,
>> replicated to another data centre. Using dedup on the storage boxes =
leads
>> =A0to performance issues and other problems - only NetApp is offerin=
g this at
>> =A0the moment and it's not heavily used (because of the issues).
>
> there are at least two other suppliers with inline dedup products and=
 there is
> OSS solution: lessfs
>
>> So I think it would be a big advantage for professional use to have =
dedup
>> build into the filesystem - processors are faster and faster today a=
nd not
>> =A0the cost drivers any more. I do not think it's a problem to "spen=
d" on
>> =A0core of a 2 socket box with 12 cores for this purpose.
>> Storage is cost intensive:
>> - SAN boxes are expensive
>> - RAID5 in two locations is expensive
>> - FC lines between locations is expensive (depeding very much on whe=
re you
>> are).
>
> In-line dedup is expensive in two ways: first you have to cache the d=
ata going
> to disk and generate checksum for it, then you have to look if such b=
lock is
> already stored -- if the database doesn't fit into RAM (for a VM host=
 it's more
> than likely) it requires at least few disk seeks, if not a few dozen =
for
> really big databases. Then you should read the block/extent back and =
compare
> them bit for bit. And only then write the data to the disk. That redu=
ces your
> IOPS by at least an order of maginitude, if not more.

Sun decided that with SHA256 (which ZFS uses for normal checksumming)
collisions are unlikely enough to skip the read/compare step:
http://blogs.sun.com/bonwick/entry/zfs_dedup . That's not the case, of
course, with btrfs-used CRC32, but a switch to a stronger hash would
be recommended to reduce collisions anyway. And yes, for the truly
paranoid, a forced verification (after the hashes match) is always an
option.

>
> For post-process dedup you can go as fast as your HDDs will allow you=
=2E And
> then, when your machine is mostly idle you can go and churn through t=
he data.
>
> IMHO in-line dedup is a good thing only as storage for backups -- whe=
n you
> have high probability that the stored data is duplicated (and with a =
1:10
> dedup ratio you have 90% probability, it is).
>
> So the CPU cost is only one factor. HDDs are a major bottleneck too.
>
> All things considered, it would be best to have both post-process and=
 in-line
> data deduplication, but I think, that in-line dedup will see much les=
s use.
>
>>
>> Naturally, you would not use this feature for all kind of use cases =
(eg.
>> heavily used database), but I think there is enough need.
>>
>> my 2 cents,
>> Heinz-Josef Claes
> --
> Hubert Kario
> QBS - Quality Business Software
> 02-656 Warszawa, ul. Ksawer=F3w 30/85
> tel. +48 (22) 646-61-51, 646-74-24
> www.qbs.com.pl
>
> System Zarz=B1dzania Jako=B6ci=B1
> zgodny z norm=B1 ISO 9001:2000
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-17 15:33         ` Leszek Ciesielski
@ 2010-03-17 19:43           ` Hubert Kario
  2010-03-20  2:46             ` Boyd Waters
  0 siblings, 1 reply; 16+ messages in thread
From: Hubert Kario @ 2010-03-17 19:43 UTC (permalink / raw)
  To: linux-btrfs

On Wednesday 17 March 2010 16:33:41 Leszek Ciesielski wrote:
> On Wed, Mar 17, 2010 at 4:25 PM, Hubert Kario <hka@qbs.com.pl> wrote:
> > On Wednesday 17 March 2010 09:48:18 Heinz-Josef Claes wrote:
> >> Hi,
> >>
> >> just want to add one correction to your thoughts:
> >>
> >> Storage is not cheap if you think about enterprise storage on a SA=
N,
> >> replicated to another data centre. Using dedup on the storage boxe=
s
> >> leads to performance issues and other problems - only NetApp is of=
fering
> >> this at the moment and it's not heavily used (because of the issue=
s).
> >
> > there are at least two other suppliers with inline dedup products a=
nd
> > there is OSS solution: lessfs
> >
> >> So I think it would be a big advantage for professional use to hav=
e
> >> dedup build into the filesystem - processors are faster and faster=
 today
> >> and not the cost drivers any more. I do not think it's a problem t=
o
> >> "spend" on core of a 2 socket box with 12 cores for this purpose.
> >> Storage is cost intensive:
> >> - SAN boxes are expensive
> >> - RAID5 in two locations is expensive
> >> - FC lines between locations is expensive (depeding very much on w=
here
> >> you are).
> >
> > In-line dedup is expensive in two ways: first you have to cache the=
 data
> > going to disk and generate checksum for it, then you have to look i=
f such
> > block is already stored -- if the database doesn't fit into RAM (fo=
r a VM
> > host it's more than likely) it requires at least few disk seeks, if=
 not a
> > few dozen for really big databases. Then you should read the block/=
extent
> > back and compare them bit for bit. And only then write the data to =
the
> > disk. That reduces your IOPS by at least an order of maginitude, if=
 not
> > more.
>=20
> Sun decided that with SHA256 (which ZFS uses for normal checksumming)
> collisions are unlikely enough to skip the read/compare step:
> http://blogs.sun.com/bonwick/entry/zfs_dedup . That's not the case, o=
f
> course, with btrfs-used CRC32, but a switch to a stronger hash would
> be recommended to reduce collisions anyway. And yes, for the truly
> paranoid, a forced verification (after the hashes match) is always an
> option.
>=20

If the server contains financial data I'd prefer the "impossible" not=20
"unlikely".

Read further, Sun did provide a way to enable the compare step by using=
=20
"verify" instead of "on":
zfs set dedup=3Dverify <pool>

And, yes, I know that the probability of hardware malfunction is vastly=
 higher=20
than the probability of collision (that's why I wrote "should", next ti=
me I'll=20
write it as SHOULD as per RFC2119 ;), but, as the history showed, all h=
ash=20
algorithms are broken, the question is only when, if the FS does verify=
 the=20
data, then the attacker can't use the collisions to get data it souldn'=
t have=20
access to.
--=20
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawer=F3w 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarz=B1dzania Jako=B6ci=B1
zgodny z norm=B1 ISO 9001:2000
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* create debian package of btrfs kernel from git tree
  2010-03-17  0:45 ` Hubert Kario
  2010-03-17  8:27   ` David Brown
@ 2010-03-18 23:33   ` rk
  1 sibling, 0 replies; 16+ messages in thread
From: rk @ 2010-03-18 23:33 UTC (permalink / raw)
  To: linux-btrfs

Hello, would somebody please write down how to create deb kernel package
with latest btrfs from the git tree -- that would be a big help.
thanks,
rk

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-17 19:43           ` Hubert Kario
@ 2010-03-20  2:46             ` Boyd Waters
  2010-03-20 13:05               ` Ric Wheeler
  0 siblings, 1 reply; 16+ messages in thread
From: Boyd Waters @ 2010-03-20  2:46 UTC (permalink / raw)
  To: linux-btrfs

2010/3/17 Hubert Kario <hka@qbs.com.pl>:
>
> Read further, Sun did provide a way to enable the compare step by using
> "verify" instead of "on":
> zfs set dedup=verify <pool>

I have tested ZFS deduplication on the same data set that I'm using to
test btrfs. I used a 5-element radiz, dedup=on, which uses SHA256 for
ZFS checksumming and duplication detection on Build 133 of OpenSolaris
for x86_64.

Subjectively, I felt that the array writes were slower than without
dedup. For a while, the option for "dedup=fletcher4,verify" was in the
system, which permitted the (faster, more prone to collisions)
fletcher4 hash for ZFS checksum, and full comparison in the
(relatively rare) case of collision. Darren Moffat worked to unify the
ZFS SHA256 code with the OpenSolaris crypo-api implementation, which
improved performance [1]. But I was not able to test that
implementation.

My dataset reported a dedup factor of 1.28 for about 4TB, meaning that
almost a third of the dataset was duplicated. This seemed plausible,
as the dataset includes multiple backups of a 400GB data set, as well
as numerous VMWare virtual machines.

Despite the performance hit, I'd be pleased to see work on this
continue. Darren Moffat's performance improvements were encouraging,
and the data set integrity was rock-solid. I had a disk failure during
this test, which almost certainly had far more impact on performance
than the deduplication: failed writes to the disk were blocking I/O,
and it got pretty bad before I was able to replace the disk. I never
lost any data, and array management was dead simple.

So anyway FWIW the ZFS dedup implementation is a good one, and had
headroom for improvement.

Finally, ZFS also lets you set a minimum number of duplicates that you
would like applied to the dataset; it only starts pointing to existing
blocks after the "duplication minimum" is reached. (dedupditto
property) [2]

[1] http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via
[2] http://opensolaris.org/jive/thread.jspa?messageID=426661

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-20  2:46             ` Boyd Waters
@ 2010-03-20 13:05               ` Ric Wheeler
  2010-03-20 21:24                 ` Boyd Waters
  0 siblings, 1 reply; 16+ messages in thread
From: Ric Wheeler @ 2010-03-20 13:05 UTC (permalink / raw)
  To: Boyd Waters; +Cc: linux-btrfs

On 03/19/2010 10:46 PM, Boyd Waters wrote:
> 2010/3/17 Hubert Kario<hka@qbs.com.pl>:
>    
>> Read further, Sun did provide a way to enable the compare step by using
>> "verify" instead of "on":
>> zfs set dedup=verify<pool>
>>      
> I have tested ZFS deduplication on the same data set that I'm using to
> test btrfs. I used a 5-element radiz, dedup=on, which uses SHA256 for
> ZFS checksumming and duplication detection on Build 133 of OpenSolaris
> for x86_64.
>
> Subjectively, I felt that the array writes were slower than without
> dedup. For a while, the option for "dedup=fletcher4,verify" was in the
> system, which permitted the (faster, more prone to collisions)
> fletcher4 hash for ZFS checksum, and full comparison in the
> (relatively rare) case of collision. Darren Moffat worked to unify the
> ZFS SHA256 code with the OpenSolaris crypo-api implementation, which
> improved performance [1]. But I was not able to test that
> implementation.
>
> My dataset reported a dedup factor of 1.28 for about 4TB, meaning that
> almost a third of the dataset was duplicated. This seemed plausible,
> as the dataset includes multiple backups of a 400GB data set, as well
> as numerous VMWare virtual machines.
>    

It is always interesting to compare this to the rate you would get with 
old fashioned compression to see how effective this is. Seems to be not 
that aggressive if I understand your results correctly.

Any idea of how compressible your data set was?

Regards,

Ric


> Despite the performance hit, I'd be pleased to see work on this
> continue. Darren Moffat's performance improvements were encouraging,
> and the data set integrity was rock-solid. I had a disk failure during
> this test, which almost certainly had far more impact on performance
> than the deduplication: failed writes to the disk were blocking I/O,
> and it got pretty bad before I was able to replace the disk. I never
> lost any data, and array management was dead simple.
>
> So anyway FWIW the ZFS dedup implementation is a good one, and had
> headroom for improvement.
>
> Finally, ZFS also lets you set a minimum number of duplicates that you
> would like applied to the dataset; it only starts pointing to existing
> blocks after the "duplication minimum" is reached. (dedupditto
> property) [2]
>
>
> [1] http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via
> [2] http://opensolaris.org/jive/thread.jspa?messageID=426661
>
>    


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-20 13:05               ` Ric Wheeler
@ 2010-03-20 21:24                 ` Boyd Waters
  2010-03-20 22:16                   ` Ric Wheeler
  0 siblings, 1 reply; 16+ messages in thread
From: Boyd Waters @ 2010-03-20 21:24 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-btrfs@vger.kernel.org

On Mar 20, 2010, at 9:05 AM, Ric Wheeler <rwheeler@redhat.com> wrote:
>>
>> My dataset reported a dedup factor of 1.28 for about 4TB, meaning
>> that
>> almost a third of the dataset was duplicated.

> It is always interesting to compare this to the rate you would get
> with old fashioned compression to see how effective this is. Seems
> to be not that aggressive if I understand your results correctly.
>
> Any idea of how compressible your data set was?

Well, of course if I used zip on the whole 4 TB that would deal with
my duplication issues, and give me a useless, static blob with no
checksumming. I haven't tried.
>

One thing that I did do, seven (!) years ago, was to detect duplicate
files (not blocks) and use hard links. I was able to squeeze out all
of the air in a series of backups, and was able to see all of them. I
used a Perl script for all this. It was nuts, but now I understand why
Apple implemented hard links to directories in HFS in order to get
thier Time Machine product.  I didn't have copy-on-write, so btrfs
snapshots completely spank a manual system like this, but I did get 7-
to-1 compression. These days you can use rsync with "--link-target" to
make hard-linked duplicates of large directory trees. Tar, cpio, and
friends tend to break when transferring hundreds of gigabytes with
thousands of hard links. Or they ignore the hard links.

Good times. I'm not sure how this is germane to btrfs, except to point
out pathological file-system usage that I've actually attempted in
real life. I actually use a lot of the ZFS feature set, and I look
forward to btrfs stability. I think btrfs can get there.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-20 21:24                 ` Boyd Waters
@ 2010-03-20 22:16                   ` Ric Wheeler
  2010-03-20 22:44                     ` Ric Wheeler
  0 siblings, 1 reply; 16+ messages in thread
From: Ric Wheeler @ 2010-03-20 22:16 UTC (permalink / raw)
  To: Boyd Waters; +Cc: linux-btrfs@vger.kernel.org

On 03/20/2010 05:24 PM, Boyd Waters wrote:
> On Mar 20, 2010, at 9:05 AM, Ric Wheeler<rwheeler@redhat.com>  wrote:
>>>
>>> My dataset reported a dedup factor of 1.28 for about 4TB, meaning
>>> that
>>> almost a third of the dataset was duplicated.
>
>> It is always interesting to compare this to the rate you would get
>> with old fashioned compression to see how effective this is. Seems
>> to be not that aggressive if I understand your results correctly.
>>
>> Any idea of how compressible your data set was?
>
> Well, of course if I used zip on the whole 4 TB that would deal with
> my duplication issues, and give me a useless, static blob with no
> checksumming. I haven't tried.

gzip/bzip2 of the block device was not meant to give a best case estimate of 
what traditional compression can do. Many block devices (including some single 
spindle disks) can do encryption internally.

>
> One thing that I did do, seven (!) years ago, was to detect duplicate
> files (not blocks) and use hard links. I was able to squeeze out all
> of the air in a series of backups, and was able to see all of them. I
> used a Perl script for all this. It was nuts, but now I understand why
> Apple implemented hard links to directories in HFS in order to get
> thier Time Machine product.  I didn't have copy-on-write, so btrfs
> snapshots completely spank a manual system like this, but I did get 7-
> to-1 compression. These days you can use rsync with "--link-target" to
> make hard-linked duplicates of large directory trees. Tar, cpio, and
> friends tend to break when transferring hundreds of gigabytes with
> thousands of hard links. Or they ignore the hard links.
>
> Good times. I'm not sure how this is germane to btrfs, except to point
> out pathological file-system usage that I've actually attempted in
> real life. I actually use a lot of the ZFS feature set, and I look
> forward to btrfs stability. I think btrfs can get there.

File level dedup is something we did in a group I worked with before and can 
certainly be quite effective. Even better, it is much easier to map into normal 
user expectations :-)

ric


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-20 22:16                   ` Ric Wheeler
@ 2010-03-20 22:44                     ` Ric Wheeler
  2010-03-21  6:55                       ` Boyd Waters
  0 siblings, 1 reply; 16+ messages in thread
From: Ric Wheeler @ 2010-03-20 22:44 UTC (permalink / raw)
  To: Boyd Waters; +Cc: linux-btrfs@vger.kernel.org

On 03/20/2010 06:16 PM, Ric Wheeler wrote:
> On 03/20/2010 05:24 PM, Boyd Waters wrote:
>> On Mar 20, 2010, at 9:05 AM, Ric Wheeler<rwheeler@redhat.com>  wrote:
>>>>
>>>> My dataset reported a dedup factor of 1.28 for about 4TB, meaning
>>>> that
>>>> almost a third of the dataset was duplicated.
>>
>>> It is always interesting to compare this to the rate you would get
>>> with old fashioned compression to see how effective this is. Seems
>>> to be not that aggressive if I understand your results correctly.
>>>
>>> Any idea of how compressible your data set was?
>>
>> Well, of course if I used zip on the whole 4 TB that would deal with
>> my duplication issues, and give me a useless, static blob with no
>> checksumming. I haven't tried.
>
> gzip/bzip2 of the block device was not meant to give a best case 
> estimate of what traditional compression can do. Many block devices 
> (including some single spindle disks) can do encryption internally.

I meant to say was not meant to provide a useful compression just meant 
to measure how well block level encryption could do.

ric

>
>>
>> One thing that I did do, seven (!) years ago, was to detect duplicate
>> files (not blocks) and use hard links. I was able to squeeze out all
>> of the air in a series of backups, and was able to see all of them. I
>> used a Perl script for all this. It was nuts, but now I understand why
>> Apple implemented hard links to directories in HFS in order to get
>> thier Time Machine product.  I didn't have copy-on-write, so btrfs
>> snapshots completely spank a manual system like this, but I did get 7-
>> to-1 compression. These days you can use rsync with "--link-target" to
>> make hard-linked duplicates of large directory trees. Tar, cpio, and
>> friends tend to break when transferring hundreds of gigabytes with
>> thousands of hard links. Or they ignore the hard links.
>>
>> Good times. I'm not sure how this is germane to btrfs, except to point
>> out pathological file-system usage that I've actually attempted in
>> real life. I actually use a lot of the ZFS feature set, and I look
>> forward to btrfs stability. I think btrfs can get there.
>
> File level dedup is something we did in a group I worked with before 
> and can certainly be quite effective. Even better, it is much easier 
> to map into normal user expectations :-)
>
> ric
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Content based storage
  2010-03-20 22:44                     ` Ric Wheeler
@ 2010-03-21  6:55                       ` Boyd Waters
  0 siblings, 0 replies; 16+ messages in thread
From: Boyd Waters @ 2010-03-21  6:55 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

I realize that I've posted some dumb things in this thread so here's a
re-cast summary:

1) In the past, I experimented with fikesystem backups, using my own
file-level checksumming that would detect when a file was already in
the backup repository, and add a hard link rather than allocate new
blocks. You can do that today on any [posix] fikesystem that supports
hard links, by using rsync.

But you are far, far better off using snapshots.

2) I said that I got 7-to-1 "deduplication" using my hard-link system.
That's a meaningless statement, but anyway I was able to save twelve
or so backups of a 100GB dataset on a 160GB hard disk.

You would almost certainly see much better results by using snapshots
on ZFS or btrfs, where a snapshot takes almost no storage to create,
and only uses extra space for any changed blocks. Snapshots are block-
level.

3) Another meaningless statement was my subjective notion that ZFS
dedup led to performance degradation. Forget I said that, as actually
I have no idea. My system was operating with failing drives at the time.

Some people report better performace with ZFS dedup, as it decreases
the number of disk writes.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-03-21  6:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-16  9:21 Content based storage David Brown
2010-03-16 22:45 ` Fabio
2010-03-17  8:21   ` David Brown
2010-03-17  0:45 ` Hubert Kario
2010-03-17  8:27   ` David Brown
2010-03-17  8:48     ` Heinz-Josef Claes
2010-03-17 15:25       ` Hubert Kario
2010-03-17 15:33         ` Leszek Ciesielski
2010-03-17 19:43           ` Hubert Kario
2010-03-20  2:46             ` Boyd Waters
2010-03-20 13:05               ` Ric Wheeler
2010-03-20 21:24                 ` Boyd Waters
2010-03-20 22:16                   ` Ric Wheeler
2010-03-20 22:44                     ` Ric Wheeler
2010-03-21  6:55                       ` Boyd Waters
2010-03-18 23:33   ` create debian package of btrfs kernel from git tree rk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).