Linux Btrfs filesystem development
 help / color / mirror / Atom feed
* Data-deduplication?
@ 2008-10-12  2:06 Ray Van Dolson
  2008-10-13  8:52 ` Data-deduplication? Andi Kleen
  2008-10-13 11:02 ` Data-deduplication? Chris Mason
  0 siblings, 2 replies; 14+ messages in thread
From: Ray Van Dolson @ 2008-10-12  2:06 UTC (permalink / raw)
  To: linux-btrfs

I recall their being a thread here a number of months back regarding
data-deduplication support for bttfs.

Did anyone end up picking that up and giving a go at it?  Block level
data dedup would be *awesome* in a Linux filesystem.  It does wonders
for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
have this feature yet (although I've read discussions on them looking
to add it).

Thanks for everyone's hard work!

Ray

-- 
Ray Van Dolson <rayvd@bludgeon.org>
GPG Fingerprint: 175B D779 4BC9 D5FF 5CC9  CE79 BCB4 0703 B51E 9F1A

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-12  2:06 Data-deduplication? Ray Van Dolson
@ 2008-10-13  8:52 ` Andi Kleen
  2008-10-15 13:39   ` Data-deduplication? Avi Kivity
  2008-10-13 11:02 ` Data-deduplication? Chris Mason
  1 sibling, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2008-10-13  8:52 UTC (permalink / raw)
  To: Ray Van Dolson; +Cc: linux-btrfs

Ray Van Dolson <rayvd@bludgeon.org> writes:

> I recall their being a thread here a number of months back regarding
> data-deduplication support for bttfs.
>
> Did anyone end up picking that up and giving a go at it?  Block level
> data dedup would be *awesome* in a Linux filesystem.  It does wonders
> for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
> have this feature yet (although I've read discussions on them looking
> to add it).

There are some patches to do in QEMU's cow format for KVM. That's
user level only.

-Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-12  2:06 Data-deduplication? Ray Van Dolson
  2008-10-13  8:52 ` Data-deduplication? Andi Kleen
@ 2008-10-13 11:02 ` Chris Mason
  2008-10-16 19:25   ` Data-deduplication? Valerie Aurora Henson
  1 sibling, 1 reply; 14+ messages in thread
From: Chris Mason @ 2008-10-13 11:02 UTC (permalink / raw)
  To: Ray Van Dolson; +Cc: linux-btrfs

On Sat, 2008-10-11 at 19:06 -0700, Ray Van Dolson wrote:
> I recall their being a thread here a number of months back regarding
> data-deduplication support for bttfs.
> 
> Did anyone end up picking that up and giving a go at it?  Block level
> data dedup would be *awesome* in a Linux filesystem.  It does wonders
> for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
> have this feature yet (although I've read discussions on them looking
> to add it).
> 

So far nobody has grabbed this one, but I've had more requests (no
shocker there, the kvm people are interested in it too).  It probably
won't make 1.0 but the disk format will be able to support it.

-chris



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-13  8:52 ` Data-deduplication? Andi Kleen
@ 2008-10-15 13:39   ` Avi Kivity
  2008-10-15 14:15     ` Data-deduplication? Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2008-10-15 13:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ray Van Dolson, linux-btrfs

Andi Kleen wrote:
> Ray Van Dolson <rayvd@bludgeon.org> writes:
>
>   
>> I recall their being a thread here a number of months back regarding
>> data-deduplication support for bttfs.
>>
>> Did anyone end up picking that up and giving a go at it?  Block level
>> data dedup would be *awesome* in a Linux filesystem.  It does wonders
>> for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
>> have this feature yet (although I've read discussions on them looking
>> to add it).
>>     
>
> There are some patches to do in QEMU's cow format for KVM. That's
> user level only.
>   

And thus, doesn't work for sharing between different images, especially 
at runtime. I'd really, really [any number of reallies], really like to 
see btrfs deduplication.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-15 13:39   ` Data-deduplication? Avi Kivity
@ 2008-10-15 14:15     ` Andi Kleen
  2008-10-15 14:43       ` Data-deduplication? Miguel Sousa Filipe
  2008-10-15 17:49       ` Data-deduplication? Avi Kivity
  0 siblings, 2 replies; 14+ messages in thread
From: Andi Kleen @ 2008-10-15 14:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Andi Kleen, Ray Van Dolson, linux-btrfs

On Wed, Oct 15, 2008 at 03:39:16PM +0200, Avi Kivity wrote:
> Andi Kleen wrote:
> >Ray Van Dolson <rayvd@bludgeon.org> writes:
> >
> >  
> >>I recall their being a thread here a number of months back regarding
> >>data-deduplication support for bttfs.
> >>
> >>Did anyone end up picking that up and giving a go at it?  Block level
> >>data dedup would be *awesome* in a Linux filesystem.  It does wonders
> >>for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
> >>have this feature yet (although I've read discussions on them looking
> >>to add it).
> >>    
> >
> >There are some patches to do in QEMU's cow format for KVM. That's
> >user level only.
> >  
> 
> And thus, doesn't work for sharing between different images, especially 
> at runtime. 

It would work if the images are all based once on a reference image, won't it?
I would imagine that's the common situation for installing lots of VMs.

> I'd really, really [any number of reallies], really like to 
> see btrfs deduplication.

Sure it would be useful for a couple of things.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-15 14:15     ` Data-deduplication? Andi Kleen
@ 2008-10-15 14:43       ` Miguel Sousa Filipe
  2008-10-15 15:00         ` Data-deduplication? Andi Kleen
  2008-10-15 17:49       ` Data-deduplication? Avi Kivity
  1 sibling, 1 reply; 14+ messages in thread
From: Miguel Sousa Filipe @ 2008-10-15 14:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Avi Kivity, Ray Van Dolson, linux-btrfs

On Wed, Oct 15, 2008 at 3:15 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Wed, Oct 15, 2008 at 03:39:16PM +0200, Avi Kivity wrote:
>> Andi Kleen wrote:
>> >Ray Van Dolson <rayvd@bludgeon.org> writes:
>> >
>> >
>> >>I recall their being a thread here a number of months back regarding
>> >>data-deduplication support for bttfs.
>> >>
>> >>Did anyone end up picking that up and giving a go at it?  Block level
>> >>data dedup would be *awesome* in a Linux filesystem.  It does wonders
>> >>for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
>> >>have this feature yet (although I've read discussions on them looking
>> >>to add it).
>> >>
>> >
>> >There are some patches to do in QEMU's cow format for KVM. That's
>> >user level only.
>> >
>>
>> And thus, doesn't work for sharing between different images, especially
>> at runtime.
>
> It would work if the images are all based once on a reference image, won't it?
> I would imagine that's the common situation for installing lots of VMs.

Like, using bcp (btrfs specific cp) for creating "new" images from a base one?
Will that suffice?
With modifications after that being COW, that could be a simple way of
having a "stupid/hack" no duplication.


>
>> I'd really, really [any number of reallies], really like to
>> see btrfs deduplication.
>
> Sure it would be useful for a couple of things.
>
> -Andi
>
> --
> ak@linux.intel.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Miguel Sousa Filipe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-15 14:43       ` Data-deduplication? Miguel Sousa Filipe
@ 2008-10-15 15:00         ` Andi Kleen
  0 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2008-10-15 15:00 UTC (permalink / raw)
  To: Miguel Sousa Filipe; +Cc: Andi Kleen, Avi Kivity, Ray Van Dolson, linux-btrfs

> Like, using bcp (btrfs specific cp) for creating "new" images from a base one?
> Will that suffice?
> With modifications after that being COW, that could be a simple way of
> having a "stupid/hack" no duplication.

qcow already supports that. The challenge is just to deduplicate 
later e.g. when you start applying security updates to the images.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-15 14:15     ` Data-deduplication? Andi Kleen
  2008-10-15 14:43       ` Data-deduplication? Miguel Sousa Filipe
@ 2008-10-15 17:49       ` Avi Kivity
  1 sibling, 0 replies; 14+ messages in thread
From: Avi Kivity @ 2008-10-15 17:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ray Van Dolson, linux-btrfs

Andi Kleen wrote:

   

>>> There are some patches to do in QEMU's cow format for KVM. That's
>>> user level only.
>>>  
>>>       
>> And thus, doesn't work for sharing between different images, especially 
>> at runtime. 
>>     
>
> It would work if the images are all based once on a reference image, won't it?
>   

Yes and no.  It's difficult to do it at runtime, and it allows one qemu 
to access another guest's data (for read-only).

Also, it's almost impossible to do at runtime.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-13 11:02 ` Data-deduplication? Chris Mason
@ 2008-10-16 19:25   ` Valerie Aurora Henson
  2008-10-16 19:30     ` Data-deduplication? Chris Mason
  2008-10-17 20:10     ` Data-deduplication? Christoph Hellwig
  0 siblings, 2 replies; 14+ messages in thread
From: Valerie Aurora Henson @ 2008-10-16 19:25 UTC (permalink / raw)
  To: Chris Mason; +Cc: Ray Van Dolson, linux-btrfs

On Mon, Oct 13, 2008 at 07:02:14AM -0400, Chris Mason wrote:
> On Sat, 2008-10-11 at 19:06 -0700, Ray Van Dolson wrote:
> > I recall their being a thread here a number of months back regarding
> > data-deduplication support for bttfs.
> > 
> > Did anyone end up picking that up and giving a go at it?  Block level
> > data dedup would be *awesome* in a Linux filesystem.  It does wonders
> > for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
> > have this feature yet (although I've read discussions on them looking
> > to add it).
> > 
> 
> So far nobody has grabbed this one, but I've had more requests (no
> shocker there, the kvm people are interested in it too).  It probably
> won't make 1.0 but the disk format will be able to support it.

Both deduplication and compression have an interesting side effect in
which a write to a previously "allocated" block can return ENOSPC.
This is even more exciting when you factor in mmap.  Any thoughts on
how to handle this?

-VAL

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-16 19:25   ` Data-deduplication? Valerie Aurora Henson
@ 2008-10-16 19:30     ` Chris Mason
  2008-10-17 18:24       ` Data-deduplication? Valerie Aurora Henson
  2008-10-17 20:10     ` Data-deduplication? Christoph Hellwig
  1 sibling, 1 reply; 14+ messages in thread
From: Chris Mason @ 2008-10-16 19:30 UTC (permalink / raw)
  To: Valerie Aurora Henson; +Cc: Ray Van Dolson, linux-btrfs

On Thu, 2008-10-16 at 15:25 -0400, Valerie Aurora Henson wrote:
> On Mon, Oct 13, 2008 at 07:02:14AM -0400, Chris Mason wrote:
> > On Sat, 2008-10-11 at 19:06 -0700, Ray Van Dolson wrote:
> > > I recall their being a thread here a number of months back regarding
> > > data-deduplication support for bttfs.
> > > 
> > > Did anyone end up picking that up and giving a go at it?  Block level
> > > data dedup would be *awesome* in a Linux filesystem.  It does wonders
> > > for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
> > > have this feature yet (although I've read discussions on them looking
> > > to add it).
> > > 
> > 
> > So far nobody has grabbed this one, but I've had more requests (no
> > shocker there, the kvm people are interested in it too).  It probably
> > won't make 1.0 but the disk format will be able to support it.
> 
> Both deduplication and compression have an interesting side effect in
> which a write to a previously "allocated" block can return ENOSPC.
> This is even more exciting when you factor in mmap.  Any thoughts on
> how to handle this?

Unfortunately we'll have a number of places where ENOSPC will jump in
where people don't expect it, and this includes any COW overwrite of an
existing extent.  The old extent isn't freed until snapshot deletion
time, which won't happen until after the current transaction commits.

Another example is fallocate.  The extent will have a little flag that
says I'm a preallocated extent, which is how we'll know we're allowed to
overwrite it directly instead of doing COW.

But, to write to the fallocated extent, we'll have to clear the flag.
So, we'll have to cow the block that holds the file extent pointer,
which means we can enospc.

-chris



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-16 19:30     ` Data-deduplication? Chris Mason
@ 2008-10-17 18:24       ` Valerie Aurora Henson
  2008-10-20  0:16         ` Data-deduplication? Chris Mason
  0 siblings, 1 reply; 14+ messages in thread
From: Valerie Aurora Henson @ 2008-10-17 18:24 UTC (permalink / raw)
  To: Chris Mason; +Cc: Ray Van Dolson, linux-btrfs

On Thu, Oct 16, 2008 at 03:30:49PM -0400, Chris Mason wrote:
> On Thu, 2008-10-16 at 15:25 -0400, Valerie Aurora Henson wrote:
> > 
> > Both deduplication and compression have an interesting side effect in
> > which a write to a previously "allocated" block can return ENOSPC.
> > This is even more exciting when you factor in mmap.  Any thoughts on
> > how to handle this?
> 
> Unfortunately we'll have a number of places where ENOSPC will jump in
> where people don't expect it, and this includes any COW overwrite of an
> existing extent.  The old extent isn't freed until snapshot deletion
> time, which won't happen until after the current transaction commits.
> 
> Another example is fallocate.  The extent will have a little flag that
> says I'm a preallocated extent, which is how we'll know we're allowed to
> overwrite it directly instead of doing COW.
> 
> But, to write to the fallocated extent, we'll have to clear the flag.
> So, we'll have to cow the block that holds the file extent pointer,
> which means we can enospc.

I'm sure you know this, but for the peanut gallery: You can avoid some
of these sort of purely copy-on-write ENOSPC cases.  Any operation
where the space used afterwards is less than or equal to the space
used before - like in your fallocate case - can avoid ENOSPC as long
as you reserve a certain amount of space on the fs and break down the
changes into small enough groups.  Most file systems don't let you
fill up beyond 90-95% anyway because performance goes to hell.  You
also need to do this so you can delete when your file system is full.

In general, it'd be nice to say that if your app can't handle suprise
ENOSPC, then if you run without snapshots, compression, or data dedup,
we guarantee you'll only get ENOSPC in the "normal" cases.  What do
you think?

-VAL

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-16 19:25   ` Data-deduplication? Valerie Aurora Henson
  2008-10-16 19:30     ` Data-deduplication? Chris Mason
@ 2008-10-17 20:10     ` Christoph Hellwig
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2008-10-17 20:10 UTC (permalink / raw)
  To: Valerie Aurora Henson; +Cc: Chris Mason, Ray Van Dolson, linux-btrfs

On Thu, Oct 16, 2008 at 03:25:01PM -0400, Valerie Aurora Henson wrote:
> Both deduplication and compression have an interesting side effect in
> which a write to a previously "allocated" block can return ENOSPC.
> This is even more exciting when you factor in mmap.  Any thoughts on
> how to handle this?

Note that this can already happen in todays filesystems.  Writing into
some preallocated space can always cause splits of the allocation or
bmap btrees as the pervious big preallocated extent now is split into
one allocated and at least one (or two if writing into the middle)
preallocated extents.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-17 18:24       ` Data-deduplication? Valerie Aurora Henson
@ 2008-10-20  0:16         ` Chris Mason
  2008-10-21 20:33           ` Data-deduplication? Valerie Aurora Henson
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Mason @ 2008-10-20  0:16 UTC (permalink / raw)
  To: Valerie Aurora Henson; +Cc: Ray Van Dolson, linux-btrfs

On Fri, 2008-10-17 at 14:24 -0400, Valerie Aurora Henson wrote:
> On Thu, Oct 16, 2008 at 03:30:49PM -0400, Chris Mason wrote:
> > On Thu, 2008-10-16 at 15:25 -0400, Valerie Aurora Henson wrote:
> > > 
> > > Both deduplication and compression have an interesting side effect in
> > > which a write to a previously "allocated" block can return ENOSPC.
> > > This is even more exciting when you factor in mmap.  Any thoughts on
> > > how to handle this?
> > 
> > Unfortunately we'll have a number of places where ENOSPC will jump in
> > where people don't expect it, and this includes any COW overwrite of an
> > existing extent.  The old extent isn't freed until snapshot deletion
> > time, which won't happen until after the current transaction commits.
> > 
> > Another example is fallocate.  The extent will have a little flag that
> > says I'm a preallocated extent, which is how we'll know we're allowed to
> > overwrite it directly instead of doing COW.
> > 
> > But, to write to the fallocated extent, we'll have to clear the flag.
> > So, we'll have to cow the block that holds the file extent pointer,
> > which means we can enospc.
> 
> I'm sure you know this, but for the peanut gallery: You can avoid some
> of these sort of purely copy-on-write ENOSPC cases.  Any operation
> where the space used afterwards is less than or equal to the space
> used before - like in your fallocate case - can avoid ENOSPC as long
> as you reserve a certain amount of space on the fs and break down the
> changes into small enough groups.  Most file systems don't let you
> fill up beyond 90-95% anyway because performance goes to hell.  You
> also need to do this so you can delete when your file system is full.
> 
> In general, it'd be nice to say that if your app can't handle suprise
> ENOSPC, then if you run without snapshots, compression, or data dedup,
> we guarantee you'll only get ENOSPC in the "normal" cases.  What do
> you think?

I think I'll have to come back to this after getting ENOSPC to work at
all ;)  You're right that reserved space can do wonders to dig us out of
holes, it has to be reserved at a multiple of the number of procs that I
allow into the transaction.

I should be able to go into an emergency one writer at a time theme as
space gets really tight, but there are lots of missing pieces that
haven't been coded yet in that area.

-chris



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Data-deduplication?
  2008-10-20  0:16         ` Data-deduplication? Chris Mason
@ 2008-10-21 20:33           ` Valerie Aurora Henson
  0 siblings, 0 replies; 14+ messages in thread
From: Valerie Aurora Henson @ 2008-10-21 20:33 UTC (permalink / raw)
  To: Chris Mason; +Cc: Ray Van Dolson, linux-btrfs

On Sun, Oct 19, 2008 at 08:16:31PM -0400, Chris Mason wrote:
> 
> I think I'll have to come back to this after getting ENOSPC to work at
> all ;)  You're right that reserved space can do wonders to dig us out of

:) Having been through this before, the ENOSPC accounting was
incredibly hard to get right.  It's at least worth thinking about the
edge cases while you're writing the first version, although you will
probably just have to throw one away no matter what.

> holes, it has to be reserved at a multiple of the number of procs that I
> allow into the transaction.
> 
> I should be able to go into an emergency one writer at a time theme as
> space gets really tight, but there are lots of missing pieces that
> haven't been coded yet in that area.

Makes sense.

I have the following "behave like I expect" rules for things that
often aren't right in the first version of a COW file system.

* If a write could succeed in the future without any user-level
  changes to the file system, then it will succeeed the first time. 

Basically, this is reflecting what happens when space used by the
previous version of the fs is freed after the next COW version is
written out.  A naive implementation of COW will fail the write if it
happens while enough other writes are outstanding, even if there would
be enough space after the other writes have been synced to disk and
the blocks from the old version are freed.  This means backing off to
the one-writer-at-a-time mode you are talking about.

* Rewriting metadata will always succeed.

Again, with naive COW, you can get into a state where doing a chmod()
on a file could end up returning ENOSPC.  Totally uncool.  Pretty much
just requires a little reserved space.

* Deletion will always succeed.

Again, reserved space, plus a little forethought in metadata design.
It is not automatically the case that your metadata will be designed
such that deletion will always result in more free space afterwards,
so it's worth a review pass just to be sure.

One thing I ran into before is that it's non-trivial to calculate
exactly how many blocks will need to be COW'd for even the tiniest
write.  Leaves split, directories grow another block, the inode block
has to be copied, the tree grows another level, you have to allocate a
new free space extent, etc., etc.  The worst case can be hundreds of
KB per 1-byte write.  Logically, you may only be writing a few bytes,
but they may require megabytes of free space to sync out to disk.
Very annoying.

-VAL

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-10-21 20:33 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-12  2:06 Data-deduplication? Ray Van Dolson
2008-10-13  8:52 ` Data-deduplication? Andi Kleen
2008-10-15 13:39   ` Data-deduplication? Avi Kivity
2008-10-15 14:15     ` Data-deduplication? Andi Kleen
2008-10-15 14:43       ` Data-deduplication? Miguel Sousa Filipe
2008-10-15 15:00         ` Data-deduplication? Andi Kleen
2008-10-15 17:49       ` Data-deduplication? Avi Kivity
2008-10-13 11:02 ` Data-deduplication? Chris Mason
2008-10-16 19:25   ` Data-deduplication? Valerie Aurora Henson
2008-10-16 19:30     ` Data-deduplication? Chris Mason
2008-10-17 18:24       ` Data-deduplication? Valerie Aurora Henson
2008-10-20  0:16         ` Data-deduplication? Chris Mason
2008-10-21 20:33           ` Data-deduplication? Valerie Aurora Henson
2008-10-17 20:10     ` Data-deduplication? Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox