btrfs wastes disk space after snapshot deletetion.

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* btrfs wastes disk space after snapshot deletetion.
@ 2013-02-04  9:08 Moshe
  2013-02-04 15:56 ` Josef Bacik
  2013-02-06  2:34 ` Liu Bo
  0 siblings, 2 replies; 7+ messages in thread
From: Moshe @ 2013-02-04  9:08 UTC (permalink / raw)
  To: linux-btrfs

Hello,

If I write large sequential file on snapshot, then create another snapshot, 
overwrite file with small amount of data and delete first snapshot, second 
snapshot has very large data extent and only small part of it is used.
For example if I use following sequence:
mkfs.btrfs /dev/sdn
mount -o noatime,nodatacow,nospace_cache /dev/sdn /mnt/b
btrfs sub snap /mnt/b /mnt/b/snap1
dd if=/dev/zero of=/mnt/b/snap1/t count=15000 bs=65535
sync
btrfs sub snap /mnt/b/snap1 /mnt/b/snap2
dd if=/dev/zero of=/mnt/b/snap2/t seek=3 count=1 bs=2048
sync
btrfs sub delete /mnt/b/snap1
btrfs-debug-tree /dev/sdn
I see following data extents
item 6 key (257 EXTENT_DATA 0) itemoff 3537 itemsize 53
    extent data disk byte 1103101952 nr 194641920
    extent data offset 0 nr 4096 ram 194641920
    extent compression 0
item 7 key (257 EXTENT_DATA 4096) itemoff 3484 itemsize 53
    extent data disk byte 2086129664 nr 4096
    extent data offset 0 nr 4096 ram 4096
    extent compression 0

In item 6: only 4096 from 194641920 are in use. Rest of space is wasted.

If I defragment like: btrfs filesystem defragment /mnt/b/snap2/t it release 
wasted space. But I can't use defragment because if I have few snapshots I 
need to run defragment on each snapshot and it disconnect relation between 
snapshot and create multiple copies of same data.

In our test that create and delete snapshots while writing data, we end up 
with few GBs of disk space wasted.

Is it possible to limit size of allocated data extents?
Is it possible to defragment subvolume without breaking snapshots relations?
Any other idea how to recover wasted space?

Thanks,
Moshe Melnikov

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs wastes disk space after snapshot deletetion.
  2013-02-04  9:08 btrfs wastes disk space after snapshot deletetion Moshe
@ 2013-02-04 15:56 ` Josef Bacik
  2013-02-05  9:09   ` Moshe
  2013-02-06  2:34 ` Liu Bo
  1 sibling, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2013-02-04 15:56 UTC (permalink / raw)
  To: Moshe; +Cc: linux-btrfs@vger.kernel.org

On Mon, Feb 04, 2013 at 02:08:01AM -0700, Moshe wrote:
> Hello,
> 
> If I write large sequential file on snapshot, then create another snapshot, 
> overwrite file with small amount of data and delete first snapshot, second 
> snapshot has very large data extent and only small part of it is used.
> For example if I use following sequence:
> mkfs.btrfs /dev/sdn
> mount -o noatime,nodatacow,nospace_cache /dev/sdn /mnt/b
> btrfs sub snap /mnt/b /mnt/b/snap1
> dd if=/dev/zero of=/mnt/b/snap1/t count=15000 bs=65535
> sync
> btrfs sub snap /mnt/b/snap1 /mnt/b/snap2
> dd if=/dev/zero of=/mnt/b/snap2/t seek=3 count=1 bs=2048
> sync
> btrfs sub delete /mnt/b/snap1
> btrfs-debug-tree /dev/sdn
> I see following data extents
> item 6 key (257 EXTENT_DATA 0) itemoff 3537 itemsize 53
>     extent data disk byte 1103101952 nr 194641920
>     extent data offset 0 nr 4096 ram 194641920
>     extent compression 0
> item 7 key (257 EXTENT_DATA 4096) itemoff 3484 itemsize 53
>     extent data disk byte 2086129664 nr 4096
>     extent data offset 0 nr 4096 ram 4096
>     extent compression 0
> 
> In item 6: only 4096 from 194641920 are in use. Rest of space is wasted.
> 
> If I defragment like: btrfs filesystem defragment /mnt/b/snap2/t it release 
> wasted space. But I can't use defragment because if I have few snapshots I 
> need to run defragment on each snapshot and it disconnect relation between 
> snapshot and create multiple copies of same data.
> 
> In our test that create and delete snapshots while writing data, we end up 
> with few GBs of disk space wasted.
> 
> Is it possible to limit size of allocated data extents?
> Is it possible to defragment subvolume without breaking snapshots relations?
> Any other idea how to recover wasted space?

This is all by design to try and limit the size of the extent tree.  Instead of
splitting references in the extent tree to account for the split extent we do it
in the file tree.  In your case it results in a lot of wasted space.  This is on
the list of things to fix, we will just split the references in the extent tree
and deal with the larger extent tree, but it's on the back burner while we get
things a bit more stable.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs wastes disk space after snapshot deletetion.
  2013-02-04 15:56 ` Josef Bacik
@ 2013-02-05  9:09   ` Moshe
  2013-02-05 14:41     ` Josef Bacik
  0 siblings, 1 reply; 7+ messages in thread
From: Moshe @ 2013-02-05  9:09 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

Thanks for your reply Josef.
I want to experiment with extents size, to see how it influence size of 
extent tree. Can you point me to code that I can change to limit size of 
data extents?

Thanks,
Moshe Melnikov


-----Original Message----- 
From: Josef Bacik
Sent: Monday, February 04, 2013 5:56 PM
To: Moshe
Cc: linux-btrfs@vger.kernel.org
Subject: Re: btrfs wastes disk space after snapshot deletetion.

On Mon, Feb 04, 2013 at 02:08:01AM -0700, Moshe wrote:
> Hello,
>
> If I write large sequential file on snapshot, then create another 
> snapshot,
> overwrite file with small amount of data and delete first snapshot, second
> snapshot has very large data extent and only small part of it is used.
> For example if I use following sequence:
> mkfs.btrfs /dev/sdn
> mount -o noatime,nodatacow,nospace_cache /dev/sdn /mnt/b
> btrfs sub snap /mnt/b /mnt/b/snap1
> dd if=/dev/zero of=/mnt/b/snap1/t count=15000 bs=65535
> sync
> btrfs sub snap /mnt/b/snap1 /mnt/b/snap2
> dd if=/dev/zero of=/mnt/b/snap2/t seek=3 count=1 bs=2048
> sync
> btrfs sub delete /mnt/b/snap1
> btrfs-debug-tree /dev/sdn
> I see following data extents
> item 6 key (257 EXTENT_DATA 0) itemoff 3537 itemsize 53
>     extent data disk byte 1103101952 nr 194641920
>     extent data offset 0 nr 4096 ram 194641920
>     extent compression 0
> item 7 key (257 EXTENT_DATA 4096) itemoff 3484 itemsize 53
>     extent data disk byte 2086129664 nr 4096
>     extent data offset 0 nr 4096 ram 4096
>     extent compression 0
>
> In item 6: only 4096 from 194641920 are in use. Rest of space is wasted.
>
> If I defragment like: btrfs filesystem defragment /mnt/b/snap2/t it 
> release
> wasted space. But I can't use defragment because if I have few snapshots I
> need to run defragment on each snapshot and it disconnect relation between
> snapshot and create multiple copies of same data.
>
> In our test that create and delete snapshots while writing data, we end up
> with few GBs of disk space wasted.
>
> Is it possible to limit size of allocated data extents?
> Is it possible to defragment subvolume without breaking snapshots 
> relations?
> Any other idea how to recover wasted space?

This is all by design to try and limit the size of the extent tree.  Instead 
of
splitting references in the extent tree to account for the split extent we 
do it
in the file tree.  In your case it results in a lot of wasted space.  This 
is on
the list of things to fix, we will just split the references in the extent 
tree
and deal with the larger extent tree, but it's on the back burner while we 
get
things a bit more stable.  Thanks,

Josef 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs wastes disk space after snapshot deletetion.
  2013-02-05  9:09   ` Moshe
@ 2013-02-05 14:41     ` Josef Bacik
  2013-02-05 15:27       ` Moshe Melnikov
  0 siblings, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2013-02-05 14:41 UTC (permalink / raw)
  To: Moshe; +Cc: Josef Bacik, linux-btrfs@vger.kernel.org

On Tue, Feb 05, 2013 at 02:09:02AM -0700, Moshe wrote:
> Thanks for your reply Josef.
> I want to experiment with extents size, to see how it influence size of 
> extent tree. Can you point me to code that I can change to limit size of 
> data extents?

So it's not the size of the data extents, it's how we deal with references to
them.  Let me map out what happens now

1) we do a write and create a 1 gig data extent.
2) create a file extent item in the fs tree pointing to the extent
3) create an reference with a count of 1 for the entire extent
4) create a snapshot of the data extent
5) write 4k to the middle of the extent
6a) we cow down to the file extent item we need to split and add a ref to the
	original 1 gig extent because of the snapshot.
6b) split the file extent item in the fs tree into 3 extents.
	- one from 0 to the random offset
	- one from random offset to random offset + 4k
	- one from random offset + 4k to the end of the original extent
		this points to an offset within the original 1 gig extent
6c) in the split we increase the refcount of the original 1 gig extent by 1
7) add an extent reference for the 4k extent we wrote.

So at the end of this our original 1 gig extent has 3 references, 1 for the
original snapshot with it's unmodified extent, 2 for the snapshot which includes
a reference for each chunk of the split extent.  In order to free up this space
you would either have to overwrite the entirety of the remaining chunks of the
original extent in the snapshot and free up the extent in the original fs by
some means.

So say you delete the file in the original file system, and then do something
horrible like overwrite every other 4k block in the file you'd end up with
around 1.5gig of data in use for logically 1 gig of actual space.  The way to
fix this is in 6c.

In file.c you have __btrfs_drop_extents which does this btrfs_inc_extent_ref on
an extent it has to split on two sides.  Instead of doing this we would probably
add another delayed extent operation for splitting the extent reference.  So
instead of having file extents that span large areas and stick around forever,
we just fix the extent references to account for the actual file extents, so
when you drop a part you actually recover the space.  There is no code for this
yet because this is kind of an overhaul of how things are done, and I'm still
getting "if I do blah it panics the box" emails so I want to spend time
stabilizing.  If this is something you want to tackle go for it, but be prepared
to spend a few months on it.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs wastes disk space after snapshot deletetion.
  2013-02-05 14:41     ` Josef Bacik
@ 2013-02-05 15:27       ` Moshe Melnikov
  2013-02-06  2:24         ` Liu Bo
  0 siblings, 1 reply; 7+ messages in thread
From: Moshe Melnikov @ 2013-02-05 15:27 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

Is it possible in step 1) to create few smaller extents instead of 1 gig 
data extent?

Moshe

-----Original Message----- 
From: Josef Bacik
Sent: Tuesday, February 05, 2013 4:41 PM
To: Moshe
Cc: Josef Bacik ; linux-btrfs@vger.kernel.org
Subject: Re: btrfs wastes disk space after snapshot deletetion.

On Tue, Feb 05, 2013 at 02:09:02AM -0700, Moshe wrote:
> Thanks for your reply Josef.
> I want to experiment with extents size, to see how it influence size of
> extent tree. Can you point me to code that I can change to limit size of
> data extents?

So it's not the size of the data extents, it's how we deal with references 
to
them.  Let me map out what happens now

1) we do a write and create a 1 gig data extent.
2) create a file extent item in the fs tree pointing to the extent
3) create an reference with a count of 1 for the entire extent
4) create a snapshot of the data extent
5) write 4k to the middle of the extent
6a) we cow down to the file extent item we need to split and add a ref to 
the
original 1 gig extent because of the snapshot.
6b) split the file extent item in the fs tree into 3 extents.
- one from 0 to the random offset
- one from random offset to random offset + 4k
- one from random offset + 4k to the end of the original extent
this points to an offset within the original 1 gig extent
6c) in the split we increase the refcount of the original 1 gig extent by 1
7) add an extent reference for the 4k extent we wrote.

So at the end of this our original 1 gig extent has 3 references, 1 for the
original snapshot with it's unmodified extent, 2 for the snapshot which 
includes
a reference for each chunk of the split extent.  In order to free up this 
space
you would either have to overwrite the entirety of the remaining chunks of 
the
original extent in the snapshot and free up the extent in the original fs by
some means.

So say you delete the file in the original file system, and then do 
something
horrible like overwrite every other 4k block in the file you'd end up with
around 1.5gig of data in use for logically 1 gig of actual space.  The way 
to
fix this is in 6c.

In file.c you have __btrfs_drop_extents which does this btrfs_inc_extent_ref 
on
an extent it has to split on two sides.  Instead of doing this we would 
probably
add another delayed extent operation for splitting the extent reference.  So
instead of having file extents that span large areas and stick around 
forever,
we just fix the extent references to account for the actual file extents, so
when you drop a part you actually recover the space.  There is no code for 
this
yet because this is kind of an overhaul of how things are done, and I'm 
still
getting "if I do blah it panics the box" emails so I want to spend time
stabilizing.  If this is something you want to tackle go for it, but be 
prepared
to spend a few months on it.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs wastes disk space after snapshot deletetion.
  2013-02-05 15:27       ` Moshe Melnikov
@ 2013-02-06  2:24         ` Liu Bo
  0 siblings, 0 replies; 7+ messages in thread
From: Liu Bo @ 2013-02-06  2:24 UTC (permalink / raw)
  To: Moshe Melnikov; +Cc: Josef Bacik, linux-btrfs

On Tue, Feb 05, 2013 at 05:27:45PM +0200, Moshe Melnikov wrote:
> 
> Is it possible in step 1) to create few smaller extents instead of 1
> gig data extent?

DIO or O_SYNC can help to create extents whose size is your 'bs=xxx',
but you know, this is not expected as fast as buffered write.

thanks,
liubo

> 
> Moshe
> 
> -----Original Message----- From: Josef Bacik
> Sent: Tuesday, February 05, 2013 4:41 PM
> To: Moshe
> Cc: Josef Bacik ; linux-btrfs@vger.kernel.org
> Subject: Re: btrfs wastes disk space after snapshot deletetion.
> 
> On Tue, Feb 05, 2013 at 02:09:02AM -0700, Moshe wrote:
> >Thanks for your reply Josef.
> >I want to experiment with extents size, to see how it influence size of
> >extent tree. Can you point me to code that I can change to limit size of
> >data extents?
> 
> 
> So it's not the size of the data extents, it's how we deal with
> references to
> them.  Let me map out what happens now
> 
> 1) we do a write and create a 1 gig data extent.
> 2) create a file extent item in the fs tree pointing to the extent
> 3) create an reference with a count of 1 for the entire extent
> 4) create a snapshot of the data extent
> 5) write 4k to the middle of the extent
> 6a) we cow down to the file extent item we need to split and add a
> ref to the
> original 1 gig extent because of the snapshot.
> 6b) split the file extent item in the fs tree into 3 extents.
> - one from 0 to the random offset
> - one from random offset to random offset + 4k
> - one from random offset + 4k to the end of the original extent
> this points to an offset within the original 1 gig extent
> 6c) in the split we increase the refcount of the original 1 gig extent by 1
> 7) add an extent reference for the 4k extent we wrote.
> 
> So at the end of this our original 1 gig extent has 3 references, 1 for the
> original snapshot with it's unmodified extent, 2 for the snapshot
> which includes
> a reference for each chunk of the split extent.  In order to free up
> this space
> you would either have to overwrite the entirety of the remaining
> chunks of the
> original extent in the snapshot and free up the extent in the original fs by
> some means.
> 
> So say you delete the file in the original file system, and then do
> something
> horrible like overwrite every other 4k block in the file you'd end up with
> around 1.5gig of data in use for logically 1 gig of actual space.
> The way to
> fix this is in 6c.
> 
> In file.c you have __btrfs_drop_extents which does this
> btrfs_inc_extent_ref on
> an extent it has to split on two sides.  Instead of doing this we
> would probably
> add another delayed extent operation for splitting the extent reference.  So
> instead of having file extents that span large areas and stick
> around forever,
> we just fix the extent references to account for the actual file extents, so
> when you drop a part you actually recover the space.  There is no
> code for this
> yet because this is kind of an overhaul of how things are done, and
> I'm still
> getting "if I do blah it panics the box" emails so I want to spend time
> stabilizing.  If this is something you want to tackle go for it, but
> be prepared
> to spend a few months on it.  Thanks,
> 
> Josef
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs wastes disk space after snapshot deletetion.
  2013-02-04  9:08 btrfs wastes disk space after snapshot deletetion Moshe
  2013-02-04 15:56 ` Josef Bacik
@ 2013-02-06  2:34 ` Liu Bo
  1 sibling, 0 replies; 7+ messages in thread
From: Liu Bo @ 2013-02-06  2:34 UTC (permalink / raw)
  To: Moshe; +Cc: linux-btrfs

On Mon, Feb 04, 2013 at 11:08:01AM +0200, Moshe wrote:
> Hello,
> 
> If I write large sequential file on snapshot, then create another
> snapshot, overwrite file with small amount of data and delete first
> snapshot, second snapshot has very large data extent and only small
> part of it is used.
> For example if I use following sequence:
> mkfs.btrfs /dev/sdn
> mount -o noatime,nodatacow,nospace_cache /dev/sdn /mnt/b
> btrfs sub snap /mnt/b /mnt/b/snap1
> dd if=/dev/zero of=/mnt/b/snap1/t count=15000 bs=65535
> sync
> btrfs sub snap /mnt/b/snap1 /mnt/b/snap2
> dd if=/dev/zero of=/mnt/b/snap2/t seek=3 count=1 bs=2048
> sync
> btrfs sub delete /mnt/b/snap1
> btrfs-debug-tree /dev/sdn
> I see following data extents
> item 6 key (257 EXTENT_DATA 0) itemoff 3537 itemsize 53
>    extent data disk byte 1103101952 nr 194641920
>    extent data offset 0 nr 4096 ram 194641920
>    extent compression 0
> item 7 key (257 EXTENT_DATA 4096) itemoff 3484 itemsize 53
>    extent data disk byte 2086129664 nr 4096
>    extent data offset 0 nr 4096 ram 4096
>    extent compression 0
> 
> In item 6: only 4096 from 194641920 are in use. Rest of space is wasted.
> 
> If I defragment like: btrfs filesystem defragment /mnt/b/snap2/t it
> release wasted space. But I can't use defragment because if I have
> few snapshots I need to run defragment on each snapshot and it
> disconnect relation between snapshot and create multiple copies of
> same data.

Well, just for this case, you can try our experimental feature for your
test, 'snapshot-aware defrag', which is designed for this kind of problems.

It's still floating on the ML, and I've no idea when it'll land in
upstream.

Currently the latest patch is V6, and NOTE: if you want to use
autodefrag(which is recommended), you'd like to apply the v6 patch along
with another patch for autodefrag, otherwise it may crash your box.

FYI,
- snapshot-aware defrag
	https://patchwork.kernel.org/patch/2058911/
- autodefrag fix
	https://patchwork.kernel.org/patch/2058921/

thanks,
liubo

> 
> In our test that create and delete snapshots while writing data, we
> end up with few GBs of disk space wasted.
> 
> Is it possible to limit size of allocated data extents?
> Is it possible to defragment subvolume without breaking snapshots relations?
> Any other idea how to recover wasted space?
> 
> Thanks,
> Moshe Melnikov
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-02-06  2:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-04  9:08 btrfs wastes disk space after snapshot deletetion Moshe
2013-02-04 15:56 ` Josef Bacik
2013-02-05  9:09   ` Moshe
2013-02-05 14:41     ` Josef Bacik
2013-02-05 15:27       ` Moshe Melnikov
2013-02-06  2:24         ` Liu Bo
2013-02-06  2:34 ` Liu Bo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).