* btrfs is using 25% more disk than it should
@ 2014-12-18 14:59 Daniele Testa
2014-12-19 18:53 ` Phillip Susi
2014-12-19 21:10 ` Josef Bacik
0 siblings, 2 replies; 22+ messages in thread
From: Daniele Testa @ 2014-12-18 14:59 UTC (permalink / raw)
To: linux-btrfs
Hey,
I am hoping you guys can shed some light on my issue. I know that it's
a common question that people see differences in the "disk used" when
running different calculations, but I still think that my issue is
weird.
root@s4 / # mount
/dev/md3 on /opt/drives/ssd type btrfs
(rw,noatime,compress=zlib,discard,nospace_cache)
root@s4 / # btrfs filesystem df /opt/drives/ssd
Data: total=407.97GB, used=404.08GB
System, DUP: total=8.00MB, used=52.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.25GB, used=672.21MB
Metadata: total=8.00MB, used=0.00
root@s4 /opt/drives/ssd # ls -alhs
total 302G
4.0K drwxr-xr-x 1 root root 42 Dec 18 14:34 .
4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49 disk_208.img
0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu 0 Dec 18 10:08 snapshots
root@s4 /opt/drives/ssd # du -h
0 ./snapshots
302G .
As seen above, I have a 410GB SSD mounted at "/opt/drives/ssd". On
that partition, I have one single starse file, taking 302GB of space
(max 315GB). The snapshots directory is completely empty.
However, for some weird reason, btrfs seems to think it takes 404GB.
The big file is a disk that I use in a virtual server and when I write
stuff inside that virtual server, the disk-usage of the btrfs
partition on the host keeps increasing even if the sparse-file is
constant at 302GB. I even have 100GB of "free" disk-space inside that
virtual disk-file. Writing 1GB inside the virtual disk-file seems to
increase the usage about 4-5GB on the "outside".
Does anyone have a clue on what is going on? How can the difference
and behaviour be like this when I just have one single file? Is it
also normal to have 672MB of metadata for a single file?
Regards,
Daniele
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-18 14:59 btrfs is using 25% more disk than it should Daniele Testa
@ 2014-12-19 18:53 ` Phillip Susi
2014-12-19 19:59 ` Daniele Testa
2014-12-19 21:10 ` Josef Bacik
1 sibling, 1 reply; 22+ messages in thread
From: Phillip Susi @ 2014-12-19 18:53 UTC (permalink / raw)
To: Daniele Testa, linux-btrfs
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 12/18/2014 9:59 AM, Daniele Testa wrote:
> As seen above, I have a 410GB SSD mounted at "/opt/drives/ssd". On
> that partition, I have one single starse file, taking 302GB of
> space (max 315GB). The snapshots directory is completely empty.
So you don't have any snapshots or other subvolumes?
> However, for some weird reason, btrfs seems to think it takes
> 404GB. The big file is a disk that I use in a virtual server and
> when I write stuff inside that virtual server, the disk-usage of
> the btrfs partition on the host keeps increasing even if the
> sparse-file is constant at 302GB. I even have 100GB of "free"
> disk-space inside that virtual disk-file. Writing 1GB inside the
> virtual disk-file seems to increase the usage about 4-5GB on the
> "outside".
Did you flag the file as nodatacow?
> Does anyone have a clue on what is going on? How can the
> difference and behaviour be like this when I just have one single
> file? Is it also normal to have 672MB of metadata for a single
> file?
You probably have the data checksums enabled and that isn't
unreasonable for checksums on 302g of data.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
iQEcBAEBAgAGBQJUlHQyAAoJENRVrw2cjl5RZWEIAKfdDzlNVrD/IYDZ5wzIeg5P
DR5H8anGGc2QPTAD76vEX/XA7/j1Kg+PbQRHGdz6Iq2+Vq4CGno/yIi46oVVVYaL
H4XvuH7GvPJyzHJ+XCMHjPGLrSCBxgIm1XSluNXmFNCwqi/FONk8TUhWsw7JchaZ
yCVe/82YI+MLZhmJdudt48MeNFzW6LYi58dQo/JfYnTGnpZAFutdgBM7vLmnqLY2
WVLQUNHZsHBa7solttCuRtc4h8ku9FBObfKKYNPAEn1YWfx7bihWgPeBMH/blsza
yhpMq96OMhIfn2SmIZMSwGh2ys+AxQQfymYR69fyGYTIajHmJEhJUzltuQD9Yg8=
=Z9/S
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 18:53 ` Phillip Susi
@ 2014-12-19 19:59 ` Daniele Testa
2014-12-19 20:35 ` Phillip Susi
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Daniele Testa @ 2014-12-19 19:59 UTC (permalink / raw)
To: Phillip Susi; +Cc: linux-btrfs
No, I don't have any snapshots or subvolumes. Only that single file.
The file has both checksums and datacow on it. I will do "chattr +C"
on the parent dir and re-create the file to make sure all files are
marked as "nodatacow".
Should I also turn off checksums with the mount-flags if this
filesystem only contain big VM-files? Or is it not needed if I put +C
on the parent dir?
2014-12-20 2:53 GMT+08:00 Phillip Susi <psusi@ubuntu.com>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 12/18/2014 9:59 AM, Daniele Testa wrote:
>> As seen above, I have a 410GB SSD mounted at "/opt/drives/ssd". On
>> that partition, I have one single starse file, taking 302GB of
>> space (max 315GB). The snapshots directory is completely empty.
>
> So you don't have any snapshots or other subvolumes?
>
>> However, for some weird reason, btrfs seems to think it takes
>> 404GB. The big file is a disk that I use in a virtual server and
>> when I write stuff inside that virtual server, the disk-usage of
>> the btrfs partition on the host keeps increasing even if the
>> sparse-file is constant at 302GB. I even have 100GB of "free"
>> disk-space inside that virtual disk-file. Writing 1GB inside the
>> virtual disk-file seems to increase the usage about 4-5GB on the
>> "outside".
>
> Did you flag the file as nodatacow?
>
>> Does anyone have a clue on what is going on? How can the
>> difference and behaviour be like this when I just have one single
>> file? Is it also normal to have 672MB of metadata for a single
>> file?
>
> You probably have the data checksums enabled and that isn't
> unreasonable for checksums on 302g of data.
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.17 (MingW32)
>
> iQEcBAEBAgAGBQJUlHQyAAoJENRVrw2cjl5RZWEIAKfdDzlNVrD/IYDZ5wzIeg5P
> DR5H8anGGc2QPTAD76vEX/XA7/j1Kg+PbQRHGdz6Iq2+Vq4CGno/yIi46oVVVYaL
> H4XvuH7GvPJyzHJ+XCMHjPGLrSCBxgIm1XSluNXmFNCwqi/FONk8TUhWsw7JchaZ
> yCVe/82YI+MLZhmJdudt48MeNFzW6LYi58dQo/JfYnTGnpZAFutdgBM7vLmnqLY2
> WVLQUNHZsHBa7solttCuRtc4h8ku9FBObfKKYNPAEn1YWfx7bihWgPeBMH/blsza
> yhpMq96OMhIfn2SmIZMSwGh2ys+AxQQfymYR69fyGYTIajHmJEhJUzltuQD9Yg8=
> =Z9/S
> -----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 19:59 ` Daniele Testa
@ 2014-12-19 20:35 ` Phillip Susi
2014-12-19 21:15 ` Josef Bacik
2014-12-20 1:33 ` Duncan
2 siblings, 0 replies; 22+ messages in thread
From: Phillip Susi @ 2014-12-19 20:35 UTC (permalink / raw)
To: Daniele Testa; +Cc: linux-btrfs
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 12/19/2014 2:59 PM, Daniele Testa wrote:
> No, I don't have any snapshots or subvolumes. Only that single
> file.
>
> The file has both checksums and datacow on it. I will do "chattr
> +C" on the parent dir and re-create the file to make sure all files
> are marked as "nodatacow".
>
> Should I also turn off checksums with the mount-flags if this
> filesystem only contain big VM-files? Or is it not needed if I put
> +C on the parent dir?
If you don't want the overhead of those checksums, then yea. Also I
would question why you are using btrfs to hold only big vm files in
the first place. You would be better off using lvm thinp volumes
instead of files, though personally I prefer to just use regular lvm
volumes and manually allocate enough space. It avoids the
fragmentation you get from thin provisioning ( or qcow2 ) at the cost
of a bit of overallocated space and the need to do some manual
resizing to add more if and when it is needed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
iQEcBAEBAgAGBQJUlIwGAAoJENRVrw2cjl5RlGEH/1OYz07C/OjGBASA9IHTCVMV
NkYHnO3/s2+SOsafQj4ej/RifgX9aG43b8Y6z9XAdosG/X+8z7xRjW9Nic0H5beK
JZRpwP+02Dw02A3/RSPjGqJBeAmS8yi9yTlunnPaCau+m1kPYL4M/vFM8/hqrGeU
Jy+jbffX+XtOedBWptxnDVIyXpYskgVyH8AmQ9d3TGrv52jw/QY1BxkuoVG60hBU
Fk4Q8ed43C9zjCVihmkDOeER6Ygr1roDb1/gFLoeCk4FwVLO9Kusft2Qi2oXyHy1
iTkoVJan8NRzXBhrPtZexxQdewHSw9Z4wyHxlal3b/xIbRf6/DRwPRHfgG5djvM=
=AqC/
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-18 14:59 btrfs is using 25% more disk than it should Daniele Testa
2014-12-19 18:53 ` Phillip Susi
@ 2014-12-19 21:10 ` Josef Bacik
2014-12-19 21:17 ` Josef Bacik
2014-12-21 3:04 ` Robert White
1 sibling, 2 replies; 22+ messages in thread
From: Josef Bacik @ 2014-12-19 21:10 UTC (permalink / raw)
To: Daniele Testa, linux-btrfs
On 12/18/2014 09:59 AM, Daniele Testa wrote:
> Hey,
>
> I am hoping you guys can shed some light on my issue. I know that it's
> a common question that people see differences in the "disk used" when
> running different calculations, but I still think that my issue is
> weird.
>
> root@s4 / # mount
> /dev/md3 on /opt/drives/ssd type btrfs
> (rw,noatime,compress=zlib,discard,nospace_cache)
>
> root@s4 / # btrfs filesystem df /opt/drives/ssd
> Data: total=407.97GB, used=404.08GB
> System, DUP: total=8.00MB, used=52.00KB
> System: total=4.00MB, used=0.00
> Metadata, DUP: total=1.25GB, used=672.21MB
> Metadata: total=8.00MB, used=0.00
>
> root@s4 /opt/drives/ssd # ls -alhs
> total 302G
> 4.0K drwxr-xr-x 1 root root 42 Dec 18 14:34 .
> 4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
> 302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49 disk_208.img
> 0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu 0 Dec 18 10:08 snapshots
>
> root@s4 /opt/drives/ssd # du -h
> 0 ./snapshots
> 302G .
>
> As seen above, I have a 410GB SSD mounted at "/opt/drives/ssd". On
> that partition, I have one single starse file, taking 302GB of space
> (max 315GB). The snapshots directory is completely empty.
>
> However, for some weird reason, btrfs seems to think it takes 404GB.
> The big file is a disk that I use in a virtual server and when I write
> stuff inside that virtual server, the disk-usage of the btrfs
> partition on the host keeps increasing even if the sparse-file is
> constant at 302GB. I even have 100GB of "free" disk-space inside that
> virtual disk-file. Writing 1GB inside the virtual disk-file seems to
> increase the usage about 4-5GB on the "outside".
>
> Does anyone have a clue on what is going on? How can the difference
> and behaviour be like this when I just have one single file? Is it
> also normal to have 672MB of metadata for a single file?
>
Hello and welcome to the wonderful world of btrfs, where COW can really
suck hard without being super clear why! It's 4pm on a Friday right
before I'm gone for 2 weeks so I'm a bit happy and drunk so I'm going to
use pretty pictures. You have this case to start with
file offset 0 offset 302g
[-------------------------prealloced 302g extent----------------------]
(man it's impressive I got all that lined up right)
On disk you have 2 things. First your file which has file extents which
says
inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen 302g
and then in the extent tree, who keeps track of actual allocated space
has this
extent bytenr 123, len 302g, refs 1
Now say you boot up your virt image and it writes 1 4k block to offset
0. Now you have this
[4k][--------------------302g-4k--------------------------------------]
And for your inode you now have this
inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
disklen 302g
and in your extent tree you have
extent bytenr 123, len 302g, refs 1
extent bytenr whatever, len 4k, refs 1
See that? Your file is still the same size, it is still 302g. If you
cp'ed it right now it would copy 302g of information. But what you have
actually allocated on disk? Well that's now 302g + 4k. Now lets say
your virt thing decides to write to the middle, lets say at offset 12k,
now you have this
inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
disklen 4k
inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
disklen 302g
and in the extent tree you have this
extent bytenr 123, len 302g, refs 2
extent bytenr whatever, len 4k, refs 1
extent bytenr notimportant, len 4k, refs 1
See that refs 2 change? We split the original extent, so we have 2 file
extents pointing to the same physical extents, so we bumped the ref
count. This will happen over and over again until we have completely
overwritten the original extent, at which point your space usage will go
back down to ~302g.
We split big extents with cow, so unless you've got lots of space to
spare or are going to use nodatacow you should probably not pre-allocate
virt images. Thanks,
Josef
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 19:59 ` Daniele Testa
2014-12-19 20:35 ` Phillip Susi
@ 2014-12-19 21:15 ` Josef Bacik
2014-12-19 21:53 ` Phillip Susi
2014-12-20 1:33 ` Duncan
2 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2014-12-19 21:15 UTC (permalink / raw)
To: Daniele Testa, Phillip Susi; +Cc: linux-btrfs
On 12/19/2014 02:59 PM, Daniele Testa wrote:
> No, I don't have any snapshots or subvolumes. Only that single file.
>
> The file has both checksums and datacow on it. I will do "chattr +C"
> on the parent dir and re-create the file to make sure all files are
> marked as "nodatacow".
>
> Should I also turn off checksums with the mount-flags if this
> filesystem only contain big VM-files? Or is it not needed if I put +C
> on the parent dir?
Please God don't turn off of checksums. Checksums are tracked in
metadata anyway, they won't show up in the data accounting. Our csums
are 8 bytes per block, so basic math says you are going to max out at
604 megabytes for that big of a file.
Please people try to only take advice from people who know what they are
talking about. So unless it's from somebody who has commits in
btrfs/btrfs-progs take their feedback with a grain of salt. Thanks,
Josef
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 21:10 ` Josef Bacik
@ 2014-12-19 21:17 ` Josef Bacik
2014-12-20 1:38 ` Duncan
` (3 more replies)
2014-12-21 3:04 ` Robert White
1 sibling, 4 replies; 22+ messages in thread
From: Josef Bacik @ 2014-12-19 21:17 UTC (permalink / raw)
To: Daniele Testa, linux-btrfs
On 12/19/2014 04:10 PM, Josef Bacik wrote:
> On 12/18/2014 09:59 AM, Daniele Testa wrote:
>> Hey,
>>
>> I am hoping you guys can shed some light on my issue. I know that it's
>> a common question that people see differences in the "disk used" when
>> running different calculations, but I still think that my issue is
>> weird.
>>
>> root@s4 / # mount
>> /dev/md3 on /opt/drives/ssd type btrfs
>> (rw,noatime,compress=zlib,discard,nospace_cache)
>>
>> root@s4 / # btrfs filesystem df /opt/drives/ssd
>> Data: total=407.97GB, used=404.08GB
>> System, DUP: total=8.00MB, used=52.00KB
>> System: total=4.00MB, used=0.00
>> Metadata, DUP: total=1.25GB, used=672.21MB
>> Metadata: total=8.00MB, used=0.00
>>
>> root@s4 /opt/drives/ssd # ls -alhs
>> total 302G
>> 4.0K drwxr-xr-x 1 root root 42 Dec 18 14:34 .
>> 4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
>> 302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49
>> disk_208.img
>> 0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu 0 Dec 18 10:08 snapshots
>>
>> root@s4 /opt/drives/ssd # du -h
>> 0 ./snapshots
>> 302G .
>>
>> As seen above, I have a 410GB SSD mounted at "/opt/drives/ssd". On
>> that partition, I have one single starse file, taking 302GB of space
>> (max 315GB). The snapshots directory is completely empty.
>>
>> However, for some weird reason, btrfs seems to think it takes 404GB.
>> The big file is a disk that I use in a virtual server and when I write
>> stuff inside that virtual server, the disk-usage of the btrfs
>> partition on the host keeps increasing even if the sparse-file is
>> constant at 302GB. I even have 100GB of "free" disk-space inside that
>> virtual disk-file. Writing 1GB inside the virtual disk-file seems to
>> increase the usage about 4-5GB on the "outside".
>>
>> Does anyone have a clue on what is going on? How can the difference
>> and behaviour be like this when I just have one single file? Is it
>> also normal to have 672MB of metadata for a single file?
>>
>
> Hello and welcome to the wonderful world of btrfs, where COW can really
> suck hard without being super clear why! It's 4pm on a Friday right
> before I'm gone for 2 weeks so I'm a bit happy and drunk so I'm going to
> use pretty pictures. You have this case to start with
>
> file offset 0 offset 302g
> [-------------------------prealloced 302g extent----------------------]
>
> (man it's impressive I got all that lined up right)
>
> On disk you have 2 things. First your file which has file extents which
> says
>
> inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen 302g
>
> and then in the extent tree, who keeps track of actual allocated space
> has this
>
> extent bytenr 123, len 302g, refs 1
>
> Now say you boot up your virt image and it writes 1 4k block to offset
> 0. Now you have this
>
> [4k][--------------------302g-4k--------------------------------------]
>
> And for your inode you now have this
>
> inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
> disklen 4k
> inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
> disklen 302g
>
> and in your extent tree you have
>
> extent bytenr 123, len 302g, refs 1
> extent bytenr whatever, len 4k, refs 1
>
> See that? Your file is still the same size, it is still 302g. If you
> cp'ed it right now it would copy 302g of information. But what you have
> actually allocated on disk? Well that's now 302g + 4k. Now lets say
> your virt thing decides to write to the middle, lets say at offset 12k,
> now you have this
>
> inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
> disklen 4k
> inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
> inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
> disklen 4k
> inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
> disklen 302g
>
> and in the extent tree you have this
>
> extent bytenr 123, len 302g, refs 2
> extent bytenr whatever, len 4k, refs 1
> extent bytenr notimportant, len 4k, refs 1
>
> See that refs 2 change? We split the original extent, so we have 2 file
> extents pointing to the same physical extents, so we bumped the ref
> count. This will happen over and over again until we have completely
> overwritten the original extent, at which point your space usage will go
> back down to ~302g.
>
> We split big extents with cow, so unless you've got lots of space to
> spare or are going to use nodatacow you should probably not pre-allocate
> virt images. Thanks,
>
Sorry should have added a
tl;dr: Cow means you can in the worst case end up using 2 * filesize -
blocksize of data on disk and the file will appear to be filesize. Thanks,
Josef
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 21:15 ` Josef Bacik
@ 2014-12-19 21:53 ` Phillip Susi
2014-12-19 22:06 ` Josef Bacik
0 siblings, 1 reply; 22+ messages in thread
From: Phillip Susi @ 2014-12-19 21:53 UTC (permalink / raw)
To: Josef Bacik, Daniele Testa; +Cc: linux-btrfs
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 12/19/2014 4:15 PM, Josef Bacik wrote:
> Please God don't turn off of checksums. Checksums are tracked in
> metadata anyway, they won't show up in the data accounting. Our
> csums are 8 bytes per block, so basic math says you are going to
> max out at 604 megabytes for that big of a file.
Yes, and it is exactly that metadata space he is complaining about.
So if you don't want to use up all of that space ( and have no use for
the checksums ), then you turn them off.
> Please people try to only take advice from people who know what
> they are talking about. So unless it's from somebody who has
> commits in btrfs/btrfs-progs take their feedback with a grain of
> salt. Thanks,
Well that is rather arrogant and rude. For that matter, I *do* have
commits in btrfs-progs.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
iQEcBAEBAgAGBQJUlJ5gAAoJENRVrw2cjl5RZ5MIAI0Ok0q0hFTMcYYXu1U48R4Z
AsuRg6zQDMOa9C1SqZucH2cuiiaGU8XixKcscaquoJDzzaND2kuy+sxp0k2YQnGz
+/269OmZUtwjYil1NcSFTJiE2bYUAx1R+xWUGax/03NsXRr672f0EtAQ2sIitTaG
WsNUhiU0GREpQL6pK403fO79eD2vRmgCx2w50gB2OYPQYciJ+YN0YAJ7z8VEmUro
M9xqce2oc7haAHliDvazl+7IDRkkiZ7FcpSs2nBSqiHiUhgVaxuTzHZEXvUasE5l
LamJCwiSwuevWWPCDE4N/r7qVcamKM2K/DMvZCiOuPkSm3YkcVyrUd8x4i8OEJs=
=8R13
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 21:53 ` Phillip Susi
@ 2014-12-19 22:06 ` Josef Bacik
0 siblings, 0 replies; 22+ messages in thread
From: Josef Bacik @ 2014-12-19 22:06 UTC (permalink / raw)
To: Phillip Susi, Daniele Testa; +Cc: linux-btrfs
On 12/19/2014 04:53 PM, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 12/19/2014 4:15 PM, Josef Bacik wrote:
>> Please God don't turn off of checksums. Checksums are tracked in
>> metadata anyway, they won't show up in the data accounting. Our
>> csums are 8 bytes per block, so basic math says you are going to
>> max out at 604 megabytes for that big of a file.
>
> Yes, and it is exactly that metadata space he is complaining about.
> So if you don't want to use up all of that space ( and have no use for
> the checksums ), then you turn them off.
>
>> Please people try to only take advice from people who know what
>> they are talking about. So unless it's from somebody who has
>> commits in btrfs/btrfs-progs take their feedback with a grain of
>> salt. Thanks,
>
> Well that is rather arrogant and rude. For that matter, I *do* have
> commits in btrfs-progs.
>
root@destiny ~/btrfs-progs# git log --oneline --author="Phillip Susi"
c65345d btrfs-progs: document --rootdir mkfs switch
f6b6e93 btrfs-progs: removed extraneous whitespace from mkfs man page
Sorry I should have qualified that statement better.
So unless it's from somebody who has had commits to meaningful portions
of btrfs/btrfs-progs take their feedback with a grain of salt.
There are too many people on this list who give random horribly wrong
advice to users that can result in data loss or corruption. Now I'll
admit I read her question wrong so what you said wasn't incorrect, I'm
sorry for that. I've seen a lot of people responding to questions
recently that I don't recognize that have been completely full of crap,
I just assumed you were in that camp as well. Thanks,
Josef
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 19:59 ` Daniele Testa
2014-12-19 20:35 ` Phillip Susi
2014-12-19 21:15 ` Josef Bacik
@ 2014-12-20 1:33 ` Duncan
2 siblings, 0 replies; 22+ messages in thread
From: Duncan @ 2014-12-20 1:33 UTC (permalink / raw)
To: linux-btrfs
Daniele Testa posted on Sat, 20 Dec 2014 03:59:42 +0800 as excerpted:
> The file has both checksums and datacow on it. I will do "chattr +C"
> on the parent dir and re-create the file to make sure all files are
> marked as "nodatacow".
>
> Should I also turn off checksums with the mount-flags if this filesystem
> only contain big VM-files? Or is it not needed if I put +C on the parent
> dir?
FWIW...
Turning off datacow, whether by chattr +C on the parent dir before
creating the file, or via mount option, turns off checksumming as well.
(For completeness, it also turns off compression, but I don't think that
applies in your case.)
In general, active VM images (and database files) with default flags tend
to get very highly fragmented very fast, due to btrfs' default COW on a
file with a heavy "internal rewrite" pattern (as opposed to append-only
or full rename/replace on rewrite). For relatively small files with this
rewrite pattern, think typical desktop firefox sqlite database files of a
quarter GiB or less, the btrfs autodefrag mount option can be helpful,
but because it triggers a rewrite of the entire file, as filesize goes
up, the viability of autodefrag goes down, and at somewhere around half a
gig, autodefrag doesn't work so well any more, particularly on very
active files where the incoming rewrite stream may be faster than btrfs
can rewrite the entire file.
Making heavy-internal-rewrite pattern files of over say half a GiB in
size nocow is one suggested solution. However, snapshots lock in place
the existing version, causing a one-time COW after a snapshot. If people
are doing frequent automated snapshots (say once an hour), this can be a
big problem, as the file ends up fragmenting pretty badly with these 1-
cow writes as well. That's where snapshots come into the picture.
There are ways to work around the problem (put the files in question on a
subvolume and don't snapshot it as often as the parent, setup a cron job
to do say weekly defrag on the files in question, etc), but since you
don't have snapshots going anyway, that's not a concern for you except as
a preventative -- consider it if you /do/ start doing snapshots.
So anyway, as I said, creating the file nocow (whether by mount option or
chattr) will turn off checksumming too. But on something that frequently
internally rewritten, where corruption will very likely corrupt the VM
anyway and there's already mechanisms in place to deal with that (either
VM integrity mechanisms, or backups, or simply disposable VMs, fire up a
new one when necessary), at least with btrfs single-mode-data where
there's no second copy to restore from if the checksum /does/ fail,
turning off checksumming isn't necessarily as bad as it may seem anyway.
And it /should/ save you some on the metadata... tho I'd not consider
that savings worth turning off checksumming if that were the /only/
reason, on its own. The metadata difference is more a nice side-effect
of an already commonly recommended practice for large VM image files,
than something you'd turn off checksumming for in the first place.
Certainly, on most files I'd prefer the checksums, and in fact am running
btrfs raid1 mode here specifically to get the benefit of having a second
copy to retrieve from if the first attempted copy fails checksum. But VM
images and database files are a bit of an exception.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 21:17 ` Josef Bacik
@ 2014-12-20 1:38 ` Duncan
2014-12-20 5:52 ` Zygo Blaxell
` (2 subsequent siblings)
3 siblings, 0 replies; 22+ messages in thread
From: Duncan @ 2014-12-20 1:38 UTC (permalink / raw)
To: linux-btrfs
Josef Bacik posted on Fri, 19 Dec 2014 16:17:08 -0500 as excerpted:
> tl;dr: Cow means you can in the worst case end up using 2 * filesize -
> blocksize of data on disk and the file will appear to be filesize.
Thanks for the tl;dr /and/ the very sensible longer explanation. That's
a very nice thing to know and to file away for further reference. =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 21:17 ` Josef Bacik
2014-12-20 1:38 ` Duncan
@ 2014-12-20 5:52 ` Zygo Blaxell
2014-12-20 6:18 ` Daniele Testa
2014-12-20 11:28 ` Josef Bacik
2014-12-20 9:15 ` Daniele Testa
2014-12-20 11:23 ` Robert White
3 siblings, 2 replies; 22+ messages in thread
From: Zygo Blaxell @ 2014-12-20 5:52 UTC (permalink / raw)
To: Josef Bacik; +Cc: Daniele Testa, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2634 bytes --]
On Fri, Dec 19, 2014 at 04:17:08PM -0500, Josef Bacik wrote:
> >And for your inode you now have this
> >
> >inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
> >disklen 4k
> >inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
> >disklen 302g
> >
> >and in your extent tree you have
> >
> >extent bytenr 123, len 302g, refs 1
> >extent bytenr whatever, len 4k, refs 1
> >
> >See that? Your file is still the same size, it is still 302g. If you
> >cp'ed it right now it would copy 302g of information. But what you have
> >actually allocated on disk? Well that's now 302g + 4k. Now lets say
> >your virt thing decides to write to the middle, lets say at offset 12k,
> >now you have this
> >
> >inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
> >disklen 4k
> >inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
> >inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
> >disklen 4k
> >inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
> >disklen 302g
> >
> >and in the extent tree you have this
> >
> >extent bytenr 123, len 302g, refs 2
> >extent bytenr whatever, len 4k, refs 1
> >extent bytenr notimportant, len 4k, refs 1
> >
> >See that refs 2 change? We split the original extent, so we have 2 file
> >extents pointing to the same physical extents, so we bumped the ref
> >count. This will happen over and over again until we have completely
> >overwritten the original extent, at which point your space usage will go
> >back down to ~302g.
Wait, *what*?
OK, I did a small experiment, and found that btrfs actually does do
something like this. Can't argue with fact, though it would be nice if
btrfs could be smarter and drop unused portions of the original extent
sooner. :-P
The above quoted scenario is a little oversimplified. Chances are that
302G file is made of much smaller extents (128M..256M). If the VM is
writing 4K randomly everywhere then those 128M+ extents are not going
away any time soon. Even the extents that are dropped stick around for
a few btrfs transaction commits before they go away.
I couldn't reproduce this behavior until I realized the extents I was
overwriting in my tests were exactly the same size and position of
the extents on disk. I changed the offset slightly and found that
partially-overwritten extents do in fact stick around in their entirety.
There seems to be an unexpected benefit for compression here: compression
keeps the extents small, so many small updates will be less likely to
leave big mostly-unused extents lying around the filesystem.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-20 5:52 ` Zygo Blaxell
@ 2014-12-20 6:18 ` Daniele Testa
2014-12-20 6:59 ` Duncan
2014-12-20 11:02 ` Josef Bacik
2014-12-20 11:28 ` Josef Bacik
1 sibling, 2 replies; 22+ messages in thread
From: Daniele Testa @ 2014-12-20 6:18 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: Josef Bacik, linux-btrfs
But I read somewhere that compression should be turned off on mounts
that just store large VM-images. Is that wrong?
Btw, I am not pre-allocation space for the images. I use sparse files with:
dd if=/dev/zero of=drive.img bs=1 count=1 seek=300G
It creates the file in a few ms.
Is it better to use "fallocate" with btrfs?
If I use sparse files, it adds a benefit when I want to copy/move the
image-file to another server.
Like if the 300GB sparse file just has 10GB of data in it, I only need
to copy 10GB when moving it to another server.
Would the same be true with "fallocate"?
Anyways, would disabling CoW (by putting +C on the parent dir) prevent
the performance issues and 2*filesize issue?
2014-12-20 13:52 GMT+08:00 Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> On Fri, Dec 19, 2014 at 04:17:08PM -0500, Josef Bacik wrote:
>> >And for your inode you now have this
>> >
>> >inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
>> >disklen 4k
>> >inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
>> >disklen 302g
>> >
>> >and in your extent tree you have
>> >
>> >extent bytenr 123, len 302g, refs 1
>> >extent bytenr whatever, len 4k, refs 1
>> >
>> >See that? Your file is still the same size, it is still 302g. If you
>> >cp'ed it right now it would copy 302g of information. But what you have
>> >actually allocated on disk? Well that's now 302g + 4k. Now lets say
>> >your virt thing decides to write to the middle, lets say at offset 12k,
>> >now you have this
>> >
>> >inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
>> >disklen 4k
>> >inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
>> >inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
>> >disklen 4k
>> >inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
>> >disklen 302g
>> >
>> >and in the extent tree you have this
>> >
>> >extent bytenr 123, len 302g, refs 2
>> >extent bytenr whatever, len 4k, refs 1
>> >extent bytenr notimportant, len 4k, refs 1
>> >
>> >See that refs 2 change? We split the original extent, so we have 2 file
>> >extents pointing to the same physical extents, so we bumped the ref
>> >count. This will happen over and over again until we have completely
>> >overwritten the original extent, at which point your space usage will go
>> >back down to ~302g.
>
> Wait, *what*?
>
> OK, I did a small experiment, and found that btrfs actually does do
> something like this. Can't argue with fact, though it would be nice if
> btrfs could be smarter and drop unused portions of the original extent
> sooner. :-P
>
> The above quoted scenario is a little oversimplified. Chances are that
> 302G file is made of much smaller extents (128M..256M). If the VM is
> writing 4K randomly everywhere then those 128M+ extents are not going
> away any time soon. Even the extents that are dropped stick around for
> a few btrfs transaction commits before they go away.
>
> I couldn't reproduce this behavior until I realized the extents I was
> overwriting in my tests were exactly the same size and position of
> the extents on disk. I changed the offset slightly and found that
> partially-overwritten extents do in fact stick around in their entirety.
>
> There seems to be an unexpected benefit for compression here: compression
> keeps the extents small, so many small updates will be less likely to
> leave big mostly-unused extents lying around the filesystem.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-20 6:18 ` Daniele Testa
@ 2014-12-20 6:59 ` Duncan
2014-12-20 11:02 ` Josef Bacik
1 sibling, 0 replies; 22+ messages in thread
From: Duncan @ 2014-12-20 6:59 UTC (permalink / raw)
To: linux-btrfs
Daniele Testa posted on Sat, 20 Dec 2014 14:18:31 +0800 as excerpted:
> Anyways, would disabling CoW (by putting +C on the parent dir) prevent
> the performance issues and 2*filesize issue?
It should, provided you don't then start snapshotting the file (which I
don't believe you intend to do but just in case...).
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 21:17 ` Josef Bacik
2014-12-20 1:38 ` Duncan
2014-12-20 5:52 ` Zygo Blaxell
@ 2014-12-20 9:15 ` Daniele Testa
2014-12-20 11:23 ` Robert White
3 siblings, 0 replies; 22+ messages in thread
From: Daniele Testa @ 2014-12-20 9:15 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-btrfs
Ok, so this is what I did:
1. Copied the sparse 315GB (with 302GB inside) to another server
2. Re-formatted the btrfs partition
3. chattr +C on the parent dir
4. Copied the 315GB file back to the btrfs partition (the file is not
sparse any more due to the copying)
This is the end result:
root@s4 /opt/drives/ssd # ls -alhs
total 316G
16K drwxr-xr-x 1 libvirt-qemu libvirt-qemu 42 Dec 20 07:00 .
4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
315G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 20 09:11 disk_208.img
0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu 0 Dec 20 06:53 snapshots
root@s4 /opt/drives/ssd # du -h
0 ./snapshots
316G .
root@s4 /opt/drives/ssd # df -h
/dev/md3 411G 316G
94G 78% /opt/drives/ssd
root@s4 /opt/drives/ssd # btrfs filesystem df /opt/drives/ssd
Data, single: total=323.01GiB, used=315.08GiB
System, DUP: total=8.00MiB, used=64.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=880.00KiB
Metadata, single: total=8.00MiB, used=0.00
unknown, single: total=16.00MiB, used=0.00
root@s4 /opt/drives/ssd # lsattr
---------------- ./snapshots
---------------C ./disk_208.img
As you can see, it looks much better now. The file takes as much space
as it should and the Metadata is only 880kb.
I will do some writes inside the VM and see if the file grows on the
"outside". If everything is ok, it should not.
2014-12-20 5:17 GMT+08:00 Josef Bacik <jbacik@fb.com>:
> On 12/19/2014 04:10 PM, Josef Bacik wrote:
>>
>> On 12/18/2014 09:59 AM, Daniele Testa wrote:
>>>
>>> Hey,
>>>
>>> I am hoping you guys can shed some light on my issue. I know that it's
>>> a common question that people see differences in the "disk used" when
>>> running different calculations, but I still think that my issue is
>>> weird.
>>>
>>> root@s4 / # mount
>>> /dev/md3 on /opt/drives/ssd type btrfs
>>> (rw,noatime,compress=zlib,discard,nospace_cache)
>>>
>>> root@s4 / # btrfs filesystem df /opt/drives/ssd
>>> Data: total=407.97GB, used=404.08GB
>>> System, DUP: total=8.00MB, used=52.00KB
>>> System: total=4.00MB, used=0.00
>>> Metadata, DUP: total=1.25GB, used=672.21MB
>>> Metadata: total=8.00MB, used=0.00
>>>
>>> root@s4 /opt/drives/ssd # ls -alhs
>>> total 302G
>>> 4.0K drwxr-xr-x 1 root root 42 Dec 18 14:34 .
>>> 4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
>>> 302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49
>>> disk_208.img
>>> 0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu 0 Dec 18 10:08 snapshots
>>>
>>> root@s4 /opt/drives/ssd # du -h
>>> 0 ./snapshots
>>> 302G .
>>>
>>> As seen above, I have a 410GB SSD mounted at "/opt/drives/ssd". On
>>> that partition, I have one single starse file, taking 302GB of space
>>> (max 315GB). The snapshots directory is completely empty.
>>>
>>> However, for some weird reason, btrfs seems to think it takes 404GB.
>>> The big file is a disk that I use in a virtual server and when I write
>>> stuff inside that virtual server, the disk-usage of the btrfs
>>> partition on the host keeps increasing even if the sparse-file is
>>> constant at 302GB. I even have 100GB of "free" disk-space inside that
>>> virtual disk-file. Writing 1GB inside the virtual disk-file seems to
>>> increase the usage about 4-5GB on the "outside".
>>>
>>> Does anyone have a clue on what is going on? How can the difference
>>> and behaviour be like this when I just have one single file? Is it
>>> also normal to have 672MB of metadata for a single file?
>>>
>>
>> Hello and welcome to the wonderful world of btrfs, where COW can really
>> suck hard without being super clear why! It's 4pm on a Friday right
>> before I'm gone for 2 weeks so I'm a bit happy and drunk so I'm going to
>> use pretty pictures. You have this case to start with
>>
>> file offset 0 offset 302g
>> [-------------------------prealloced 302g extent----------------------]
>>
>> (man it's impressive I got all that lined up right)
>>
>> On disk you have 2 things. First your file which has file extents which
>> says
>>
>> inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen
>> 302g
>>
>> and then in the extent tree, who keeps track of actual allocated space
>> has this
>>
>> extent bytenr 123, len 302g, refs 1
>>
>> Now say you boot up your virt image and it writes 1 4k block to offset
>> 0. Now you have this
>>
>> [4k][--------------------302g-4k--------------------------------------]
>>
>> And for your inode you now have this
>>
>> inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
>> disklen 4k
>> inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
>> disklen 302g
>>
>> and in your extent tree you have
>>
>> extent bytenr 123, len 302g, refs 1
>> extent bytenr whatever, len 4k, refs 1
>>
>> See that? Your file is still the same size, it is still 302g. If you
>> cp'ed it right now it would copy 302g of information. But what you have
>> actually allocated on disk? Well that's now 302g + 4k. Now lets say
>> your virt thing decides to write to the middle, lets say at offset 12k,
>> now you have this
>>
>> inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
>> disklen 4k
>> inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen
>> 302g
>> inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
>> disklen 4k
>> inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
>> disklen 302g
>>
>> and in the extent tree you have this
>>
>> extent bytenr 123, len 302g, refs 2
>> extent bytenr whatever, len 4k, refs 1
>> extent bytenr notimportant, len 4k, refs 1
>>
>> See that refs 2 change? We split the original extent, so we have 2 file
>> extents pointing to the same physical extents, so we bumped the ref
>> count. This will happen over and over again until we have completely
>> overwritten the original extent, at which point your space usage will go
>> back down to ~302g.
>>
>> We split big extents with cow, so unless you've got lots of space to
>> spare or are going to use nodatacow you should probably not pre-allocate
>> virt images. Thanks,
>>
>
> Sorry should have added a
>
> tl;dr: Cow means you can in the worst case end up using 2 * filesize -
> blocksize of data on disk and the file will appear to be filesize. Thanks,
>
> Josef
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-20 6:18 ` Daniele Testa
2014-12-20 6:59 ` Duncan
@ 2014-12-20 11:02 ` Josef Bacik
1 sibling, 0 replies; 22+ messages in thread
From: Josef Bacik @ 2014-12-20 11:02 UTC (permalink / raw)
To: Daniele Testa, Zygo Blaxell; +Cc: linux-btrfs
On 12/20/2014 01:18 AM, Daniele Testa wrote:
> But I read somewhere that compression should be turned off on mounts
> that just store large VM-images. Is that wrong?
>
It doesn't really matter frankly. Usually virt images are preallocated
with fallocate which means compression doesn't happen as writes into
fallocated areas aren't compressed, but you aren't doing that so you
would be getting some compression.
> Btw, I am not pre-allocation space for the images. I use sparse files with:
>
> dd if=/dev/zero of=drive.img bs=1 count=1 seek=300G
>
> It creates the file in a few ms.
> Is it better to use "fallocate" with btrfs?
>
It depends. If you are going to use nodatacow for your virt images then
I would definitely suggest using fallocate since you'll get a nice
contiguous chunk of data for your virt images.
> If I use sparse files, it adds a benefit when I want to copy/move the
> image-file to another server.
> Like if the 300GB sparse file just has 10GB of data in it, I only need
> to copy 10GB when moving it to another server.
> Would the same be true with "fallocate"?
>
No, but send/receive would only copy 10GB, but the resulting file would
be sparse.
> Anyways, would disabling CoW (by putting +C on the parent dir) prevent
> the performance issues and 2*filesize issue?
>
Yes. Thanks,
Josef
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 21:17 ` Josef Bacik
` (2 preceding siblings ...)
2014-12-20 9:15 ` Daniele Testa
@ 2014-12-20 11:23 ` Robert White
2014-12-20 11:39 ` Josef Bacik
3 siblings, 1 reply; 22+ messages in thread
From: Robert White @ 2014-12-20 11:23 UTC (permalink / raw)
To: Josef Bacik, Daniele Testa, linux-btrfs
On 12/19/2014 01:17 PM, Josef Bacik wrote:
> tl;dr: Cow means you can in the worst case end up using 2 * filesize -
> blocksize of data on disk and the file will appear to be filesize. Thanks,
Doesn't the worst case more like N^log(N) (when N is file in blocksize)
in the pernicious case?
Staggered block overwrites can "peer down" through gaps to create more
than two layers of retention. The only real requirement is that each
layer get smaller than the one before it so as to leave some of each of
it's predecessor visible.
So if I make a file size N blocks, then overwrite it with N-1 blocks,
then overwrite it again with N-2 blocks (etc). I can easily create a
deep slop of obscured data.
[-----------------]
[----------------]
[---------------]
[--------------]
[-------------]
[------------]
[-----------]
[----------]
[---------]
(etc...)
Or would I have to bracket the front and back
----------
--------
------
Or could I bracket the sides
---------
---- ----
--- ---
-- --
- -
There's got to be pahological patterns like this that can end up with a
heck of a lot of "hidden" data.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-20 5:52 ` Zygo Blaxell
2014-12-20 6:18 ` Daniele Testa
@ 2014-12-20 11:28 ` Josef Bacik
2014-12-23 21:51 ` Zygo Blaxell
1 sibling, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2014-12-20 11:28 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: Daniele Testa, linux-btrfs
On 12/20/2014 12:52 AM, Zygo Blaxell wrote:
> On Fri, Dec 19, 2014 at 04:17:08PM -0500, Josef Bacik wrote:
>>> And for your inode you now have this
>>>
>>> inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
>>> disklen 4k
>>> inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
>>> disklen 302g
>>>
>>> and in your extent tree you have
>>>
>>> extent bytenr 123, len 302g, refs 1
>>> extent bytenr whatever, len 4k, refs 1
>>>
>>> See that? Your file is still the same size, it is still 302g. If you
>>> cp'ed it right now it would copy 302g of information. But what you have
>>> actually allocated on disk? Well that's now 302g + 4k. Now lets say
>>> your virt thing decides to write to the middle, lets say at offset 12k,
>>> now you have this
>>>
>>> inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
>>> disklen 4k
>>> inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
>>> inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
>>> disklen 4k
>>> inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
>>> disklen 302g
>>>
>>> and in the extent tree you have this
>>>
>>> extent bytenr 123, len 302g, refs 2
>>> extent bytenr whatever, len 4k, refs 1
>>> extent bytenr notimportant, len 4k, refs 1
>>>
>>> See that refs 2 change? We split the original extent, so we have 2 file
>>> extents pointing to the same physical extents, so we bumped the ref
>>> count. This will happen over and over again until we have completely
>>> overwritten the original extent, at which point your space usage will go
>>> back down to ~302g.
>
> Wait, *what*?
>
> OK, I did a small experiment, and found that btrfs actually does do
> something like this. Can't argue with fact, though it would be nice if
> btrfs could be smarter and drop unused portions of the original extent
> sooner. :-P
>
So we've thought about changing this, and will eventually, but it's kind
of difficult. Above is an example of what happens currently, so the
split code for file extents is kind of big and scary, check
__btrfs_drop_extents. We would have to fix that to adjust the
disk_bytenr and disk_num_bytes, which isn't too bad since we already are
doing this dance and adjusting offset. The trick would be when updating
the extent references, we would have to split those extents. So say we
have a 128mb extent and we write 4k at 1mb. If we split the extent refs
we'd have this afterwards
(note this isn't how they'd be ordered on disk, just written this way so
it makes logical sense)
extent bytenr 0, len 1mb, refs 1
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 1
Ok so now we have 3 extents in the extent tree to describe essentially 2
ranges that are in use, but we get back the 4k so that's nice. But wait
there's more! What if we're snapshotted? We can't just drop that 4k
because somebody else has a reference to it. So what do we do? Well we
could do something like this
extent bytenr 0, len 1mb, refs 1
extent bytenr 0, len 128mb, refs 1
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 1
This creates all sorts of problems for us. We now have two extents with
the same bytenr but with different lengths. This could be ok, we'd have
to add a bunch of checks to make sure we're looking at the right extent,
but it wouldn't be horrible. I imagine we'd be fixing weird corruption
bugs for a few releases though while we found all of the corner cases we
missed.
Then there is the problem of actually returning the free space. Now if
we drop all of the refs for an extent we know the space is free and we
return it to the allocator. With the above example we can't do that
anymore, we have to check the extent tree for any area that is left
overlapping the area we just freed. This add's another search to every
btrfs_free_extent operation, which slows the whole system down and again
leaves us with weird corner cases and pain for the users. Plus this
would be an incompatible format change so would require setting a
feature flag in the fs and rolled to voluntarily.
Now I have another solution, but I'm not convinced it's awesome either.
Take the same example above, but instead we split the original extent
in the extent tree so we avoid all the mess of having overlapping ranges
and get this instead
extent bytenr 0, len 1mb, refs 2
extent bytenr 1mb, len 4k, refs 1 <-- part of the original extent
pointed to by the snapshot
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 2
So yay we've solved the problem of overlapping extents and bonus this is
backwards compatible. So why don't we do this? Well all the reasons I
listed above about corner cases and much pain for our users. This
wouldn't require a format change so everybody would get this behaviour
as soon as we turned it on, and I feel I would be doing a lot of fsck
work for the next 6 months. Plus we would have to add a 'split'
operation to the extent operations that copies all of the extent
references around and drops the proper reference. Keep in mind that
I've been showing a dumbed down version of extent refs, what it would
really look like is this
extent bytenr 0, len 128mb, refs 2
root 5, owner 256, refs 1
root 256, owner 256, refs 1
So when we do our split operation we'd copy this extent entry twice,
update the two sides with their new offset and len, and drop the
original inode from the middle thing, and finally add our new extent.
That is a lot more work for one operation than just adding a new entry
or removing an old entry. Not only is it more work but it adds more
metadata to the extent root, which makes extent operations more
expensive which again slows the whole file system down.
Welcome to file system development, you spin the giant wheel of trade
offs and decide which sucks less for you and your users. Years ago we
chose simplicity in one of the more complex areas of btrfs for wasting
space in overwrites. It's not super clear that was the right choice so
we're considering changing it, but as you can see it ain't going to be
fun, and will require other trade offs which may have unintended
consequences later on. Thanks,
Josef
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-20 11:23 ` Robert White
@ 2014-12-20 11:39 ` Josef Bacik
2014-12-21 1:40 ` Robert White
0 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2014-12-20 11:39 UTC (permalink / raw)
To: Robert White, Daniele Testa, linux-btrfs
On 12/20/2014 06:23 AM, Robert White wrote:
> On 12/19/2014 01:17 PM, Josef Bacik wrote:
>> tl;dr: Cow means you can in the worst case end up using 2 * filesize -
>> blocksize of data on disk and the file will appear to be filesize.
>> Thanks,
>
> Doesn't the worst case more like N^log(N) (when N is file in blocksize)
> in the pernicious case?
>
> Staggered block overwrites can "peer down" through gaps to create more
> than two layers of retention. The only real requirement is that each
> layer get smaller than the one before it so as to leave some of each of
> it's predecessor visible.
>
> So if I make a file size N blocks, then overwrite it with N-1 blocks,
> then overwrite it again with N-2 blocks (etc). I can easily create a
> deep slop of obscured data.
>
> [-----------------]
> [----------------]
> [---------------]
> [--------------]
> [-------------]
> [------------]
> [-----------]
> [----------]
> [---------]
> (etc...)
>
>
> Or would I have to bracket the front and back
>
> ----------
> --------
> ------
>
> Or could I bracket the sides
>
> ---------
> ---- ----
> --- ---
> -- --
> - -
>
> There's got to be pahological patterns like this that can end up with a
> heck of a lot of "hidden" data.
Just the sloped case would do it, the pathological case would result in
way more used than you expect. So I guess the worst case would be
something like
(num_blocks + (num_blocks - 1)!) * blocksize
in actually size usage. Our extents are limited to 128mb in size, but
still that ends up being pretty huge. I'm actually going to do this
locally and see what happens. Thanks,
Josef
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-20 11:39 ` Josef Bacik
@ 2014-12-21 1:40 ` Robert White
0 siblings, 0 replies; 22+ messages in thread
From: Robert White @ 2014-12-21 1:40 UTC (permalink / raw)
To: Josef Bacik, Daniele Testa, linux-btrfs
On 12/20/2014 03:39 AM, Josef Bacik wrote:
> On 12/20/2014 06:23 AM, Robert White wrote:
>> On 12/19/2014 01:17 PM, Josef Bacik wrote:
>>> tl;dr: Cow means you can in the worst case end up using 2 * filesize -
>>> blocksize of data on disk and the file will appear to be filesize.
>>> Thanks,
>>
>> Doesn't the worst case more like N^log(N) (when N is file in blocksize)
>> in the pernicious case?
>>
>> Staggered block overwrites can "peer down" through gaps to create more
>> than two layers of retention. The only real requirement is that each
>> layer get smaller than the one before it so as to leave some of each of
>> it's predecessor visible.
>>
>> So if I make a file size N blocks, then overwrite it with N-1 blocks,
>> then overwrite it again with N-2 blocks (etc). I can easily create a
>> deep slop of obscured data.
>>
>> [-----------------]
>> [----------------]
>> [---------------]
>> [--------------]
>> [-------------]
>> [------------]
>> [-----------]
>> [----------]
>> [---------]
>> (etc...)
>>
>>
>> Or would I have to bracket the front and back
>>
>> ----------
>> --------
>> ------
>>
>> Or could I bracket the sides
>>
>> ---------
>> ---- ----
>> --- ---
>> -- --
>> - -
>>
>> There's got to be pahological patterns like this that can end up with a
>> heck of a lot of "hidden" data.
>
> Just the sloped case would do it, the pathological case would result in
> way more used than you expect. So I guess the worst case would be
> something like
>
> (num_blocks + (num_blocks - 1)!) * blocksize
I think that for a single file it's not factorial but consecutive sum.
(One of Gauss' equations.)
so max=((n * (n+1))/2)*blocksize
A lot smaller than factorial but still n^2+n blocks, which is nothing to
discard lightly.
>
> in actually size usage. Our extents are limited to 128mb in size, but
> still that ends up being pretty huge. I'm actually going to do this
> locally and see what happens. Thanks,
>
> Josef
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-19 21:10 ` Josef Bacik
2014-12-19 21:17 ` Josef Bacik
@ 2014-12-21 3:04 ` Robert White
1 sibling, 0 replies; 22+ messages in thread
From: Robert White @ 2014-12-21 3:04 UTC (permalink / raw)
To: Josef Bacik, Daniele Testa, linux-btrfs
On 12/19/2014 01:10 PM, Josef Bacik wrote:
> On 12/18/2014 09:59 AM, Daniele Testa wrote:
>> Hey,
>>
>> I am hoping you guys can shed some light on my issue. I know that it's
>> a common question that people see differences in the "disk used" when
>> running different calculations, but I still think that my issue is
>> weird.
>>
>> root@s4 / # mount
>> /dev/md3 on /opt/drives/ssd type btrfs
>> (rw,noatime,compress=zlib,discard,nospace_cache)
>>
>> root@s4 / # btrfs filesystem df /opt/drives/ssd
>> Data: total=407.97GB, used=404.08GB
>> System, DUP: total=8.00MB, used=52.00KB
>> System: total=4.00MB, used=0.00
>> Metadata, DUP: total=1.25GB, used=672.21MB
>> Metadata: total=8.00MB, used=0.00
>>
>> root@s4 /opt/drives/ssd # ls -alhs
>> total 302G
>> 4.0K drwxr-xr-x 1 root root 42 Dec 18 14:34 .
>> 4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 ..
>> 302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49
>> disk_208.img
>> 0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu 0 Dec 18 10:08 snapshots
>>
>> root@s4 /opt/drives/ssd # du -h
>> 0 ./snapshots
>> 302G .
>>
>> As seen above, I have a 410GB SSD mounted at "/opt/drives/ssd". On
>> that partition, I have one single starse file, taking 302GB of space
>> (max 315GB). The snapshots directory is completely empty.
>>
>> However, for some weird reason, btrfs seems to think it takes 404GB.
>> The big file is a disk that I use in a virtual server and when I write
>> stuff inside that virtual server, the disk-usage of the btrfs
>> partition on the host keeps increasing even if the sparse-file is
>> constant at 302GB. I even have 100GB of "free" disk-space inside that
>> virtual disk-file. Writing 1GB inside the virtual disk-file seems to
>> increase the usage about 4-5GB on the "outside".
>>
>> Does anyone have a clue on what is going on? How can the difference
>> and behaviour be like this when I just have one single file? Is it
>> also normal to have 672MB of metadata for a single file?
>>
>
> Hello and welcome to the wonderful world of btrfs, where COW can really
> suck hard without being super clear why! It's 4pm on a Friday right
> before I'm gone for 2 weeks so I'm a bit happy and drunk so I'm going to
> use pretty pictures. You have this case to start with
>
> file offset 0 offset 302g
> [-------------------------prealloced 302g extent----------------------]
>
> (man it's impressive I got all that lined up right)
>
> On disk you have 2 things. First your file which has file extents which
> says
>
> inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen 302g
>
> and then in the extent tree, who keeps track of actual allocated space
> has this
>
> extent bytenr 123, len 302g, refs 1
>
> Now say you boot up your virt image and it writes 1 4k block to offset
> 0. Now you have this
>
> [4k][--------------------302g-4k--------------------------------------]
>
> And for your inode you now have this
>
> inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
> disklen 4k
> inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
> disklen 302g
>
> and in your extent tree you have
>
> extent bytenr 123, len 302g, refs 1
> extent bytenr whatever, len 4k, refs 1
>
> See that? Your file is still the same size, it is still 302g. If you
> cp'ed it right now it would copy 302g of information. But what you have
> actually allocated on disk? Well that's now 302g + 4k. Now lets say
> your virt thing decides to write to the middle, lets say at offset 12k,
> now you have this
>
> inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
> disklen 4k
> inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
> inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
> disklen 4k
> inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
> disklen 302g
>
> and in the extent tree you have this
>
> extent bytenr 123, len 302g, refs 2
> extent bytenr whatever, len 4k, refs 1
> extent bytenr notimportant, len 4k, refs 1
>
> See that refs 2 change? We split the original extent, so we have 2 file
> extents pointing to the same physical extents, so we bumped the ref
> count. This will happen over and over again until we have completely
> overwritten the original extent, at which point your space usage will go
> back down to ~302g.
>
> We split big extents with cow, so unless you've got lots of space to
> spare or are going to use nodatacow you should probably not pre-allocate
> virt images. Thanks,
Stll too new to the code base to offer much other than psudocode...
Is it "easy" to find all the inodes that are using a particular extent
at runtime?
It occurs to me that since every extent starts life with exactly one
owner, a scruplous breaking of extents can prevent the unbounded
left-overlap problem...
If the preexisting extent is always broken up into two or three new
extents wherever it's being referenced, then problematic overlaps are
eleminated and dead data can be discarded as soon as it's actually dead.
So in the exemplar case
'.' == preexisting extent
'+' == new written extent
'-' == preexisting described by new extent records
The core operations: multiple lines are used because the brackets
overlap in the ASCII art. 8-)
case 1:
[................]
[++++]
[----]
[------] [------]
case 2:
[................]
[+++++++++]
[----]
[------------]
case 3:
[................]
[+++++++++]
[----]
[------------]
case 4: (trivial, extent is just derefed by 1)
[....]
[++++++++++++++++]
I am going to introduce the word "shatter" for convenience. We will be
"shattering" the existing extent etc.
So;
Ignoring for a time the existing filesystems with existing problematic
layouts, which will continue with the (n^2+n)/2 worst case, we can know
a few things.
For all files, there exists no sliding window over storage. That is
there is no ioctl() to discard the leading N bytes of a file by just
moving the various offsets inside the inode-specific reference. Nor is
there an ioctl to insert data at the front of the file. Both of these
operations would be "easy" to create in BTRFS, but they do not exist at
this time.
All users of an extent, not counting theoretical deduplication, follow
from a single original allocation via reflink, clone, or snapshot.
IFF all extents were always shattered when _any_ file using it had an
overwrite event THEN all references to extentX would have a reference
using extent-offest of zero. That is, breaking up extents would result in:
inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever
disklen same-as-size
So at the time an extent is shattered, iff all the other users of the
extent can be found easily, a fairly cheap per-inode substitution can be
computed and performed.
(psudocode)
transaction_start;
foreach existing_extent overlapped by new_extent
do
new_set peer_uesers = (all referencing inodes but self)
new_set fragments_other = (empty)
new_set fragments_self = (empty)
if (existing.start < overlap.start) then
left_extent = new_extent_map[existing.start,overlap.start)
fragments_self += left_extent;
fragments_other += left_extent;
left = overlap.start;
else
left = existing.start;
fi
if (overlap.end < existing.end) then
right = overlap.end
right_extent = new_extent_map[overlap.end,existing.end)
fragments_self += right_extent
fragments_other += right_extent;
else
right = existing.end
fi
old_fragment = new_extent_map[left,right)
if (old_fragment != existing_extent) then
fragments_other += old_fragment
fi
if (not empty(fragments_other) and not_empty(peer_users)) then
foreach peer_user do
replace_extent(user,existing_extent,fragments_other)
done
replace_extent(self,existing_extent,fragments_self)
done
add_extent(self,new_extent)
transaction_end;
(end pseudocode)
Not optimized (no point in assembling fragments_other if there are no
peers for example) but it should be logically correct if I didn't make
some first-year error. 8-)
In practical terms of complexity the leading edge of a new extent can
either lie on an existing extent boundary or somewhere in the heart of
an extent. The trailing edge can exist within, uppon, or beyond the end
of an existing extent. ("beyond the end" being a proper append of a file).
Any extent that is completely spanned by the new extent will get
dereferenced by replace_extent(self,existing,{0}), e.g. the empty set;
and skipped over entirely because "fragments_other" would be empty.
Any extent otherwise split will be split everywhere.
The new_extent is never split, so we tend to optimize layout until
further overwrite.
Deep divergence, such as large extents that should have previously been
shattered elsewhere just sort of happens when the search for peers
doesn't have the necessary match to add those inodes into the peer list.
Cost should be manageable since it really only effects zero, one, or two
existing extents, but cost does scale with the number of peers.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: btrfs is using 25% more disk than it should
2014-12-20 11:28 ` Josef Bacik
@ 2014-12-23 21:51 ` Zygo Blaxell
0 siblings, 0 replies; 22+ messages in thread
From: Zygo Blaxell @ 2014-12-23 21:51 UTC (permalink / raw)
To: Josef Bacik; +Cc: Daniele Testa, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1813 bytes --]
On Sat, Dec 20, 2014 at 06:28:22AM -0500, Josef Bacik wrote:
> We now have two extents
> with the same bytenr but with different lengths.
>[...]
> Then there is the problem of actually returning the free space. Now
> if we drop all of the refs for an extent we know the space is free
> and we return it to the allocator. With the above example we can't
> do that anymore, we have to check the extent tree for any area that
> is left overlapping the area we just freed. This add's another
> search to every btrfs_free_extent operation, which slows the whole
> system down and again leaves us with weird corner cases and pain for
> the users. Plus this would be an incompatible format change so
> would require setting a feature flag in the fs and rolled to
> voluntarily.
Ouchie.
> Now I have another solution, but I'm not convinced it's awesome
> either. Take the same example above, but instead we split the
> original extent in the extent tree so we avoid all the mess of
> having overlapping ranges
Would this work for a read-only snapshot? For a read-write snapshot
it would be as if we had modified both (or all, if there are multiple
snapshots) versions of the tree with split extents.
> This wouldn't require a format change so everybody would get
> this behaviour as soon as we turned it on
It could be a mount option, like autodefrag, off by default until the
bugs were worked out.
Arguably there could be a 'garbage-collection tool' similar to 'btrfs
fi defrag', that could be used to clean out any large partially-obscured
extents from specific files. This might be important for deduplication
as well (although the extent-same code looks like it does split extents?).
Definitely something to think about. Thanks for the detailed
explanations.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2014-12-23 21:51 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-18 14:59 btrfs is using 25% more disk than it should Daniele Testa
2014-12-19 18:53 ` Phillip Susi
2014-12-19 19:59 ` Daniele Testa
2014-12-19 20:35 ` Phillip Susi
2014-12-19 21:15 ` Josef Bacik
2014-12-19 21:53 ` Phillip Susi
2014-12-19 22:06 ` Josef Bacik
2014-12-20 1:33 ` Duncan
2014-12-19 21:10 ` Josef Bacik
2014-12-19 21:17 ` Josef Bacik
2014-12-20 1:38 ` Duncan
2014-12-20 5:52 ` Zygo Blaxell
2014-12-20 6:18 ` Daniele Testa
2014-12-20 6:59 ` Duncan
2014-12-20 11:02 ` Josef Bacik
2014-12-20 11:28 ` Josef Bacik
2014-12-23 21:51 ` Zygo Blaxell
2014-12-20 9:15 ` Daniele Testa
2014-12-20 11:23 ` Robert White
2014-12-20 11:39 ` Josef Bacik
2014-12-21 1:40 ` Robert White
2014-12-21 3:04 ` Robert White
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).