* new database files not compressed
@ 2020-08-30 9:35 Hamish Moffatt
2020-08-31 2:20 ` Eric Wong
` (2 more replies)
0 siblings, 3 replies; 31+ messages in thread
From: Hamish Moffatt @ 2020-08-30 9:35 UTC (permalink / raw)
To: linux-btrfs
I am trying to store Firebird database files compressed on btrfs.
Although I have mounted the file system with -o compress-force, new
files created by Firebird are not being compressed according to
compsize. If I copy them, or use btrfs filesystem defrag, they compress
well.
Other files seem to be compressed automatically OK. Why are the Firebird
files different?
$ uname -a
Linux packer-debian-10-amd64 5.7.0-3-amd64 #1 SMP Debian 5.7.17-1
(2020-08-23) x86_64 GNU/Linux
$ sudo mkfs.btrfs -m single -f /dev/sdb
btrfs-progs v4.20.1
See http://btrfs.wiki.kernel.org for more information.
Label: (null)
UUID: 949f7f3f-681b-40fb-97b9-522756d5d619
Node size: 16384
Sector size: 4096
Filesystem size: 29.98GiB
Block group profiles:
Data: single 8.00MiB
Metadata: single 8.00MiB
System: single 4.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Number of devices: 1
Devices:
ID SIZE PATH
1 29.98GiB /dev/sdb
$ sudo mount -o compress-force /dev/sdb /mnt/test
$ sudo mkdir /mnt/test/db
$ cd /mnt/test/db
Now I restore a backup to create a database:
$ zcat ~/*.zip | gbak -REP stdin test.fdb
$ sudo compsize test.fdb
Type Perc Disk Usage Uncompressed Referenced
TOTAL 100% 182M 182M 175M
none 100% 182M 182M 175M
$ cat test.fdb > test2.fdb
$ sudo compsize test2.fdb
Type Perc Disk Usage Uncompressed Referenced
TOTAL 10% 18M 175M 175M
zlib 10% 18M 175M 175M
The same thing occurs if I create a brand new database:
$ isql-fb
Use CONNECT or CREATE DATABASE to specify a database
SQL> create database 'test3.fdb';
SQL> ^D$
$ sync
$ sudo compsize test3.fdb
Type Perc Disk Usage Uncompressed Referenced
TOTAL 100% 784K 784K 784K
none 100% 784K 784K 784K
$ cp test3.fdb test4.fdb
$ sync
$ sudo compsize test4.fdb
Type Perc Disk Usage Uncompressed Referenced
TOTAL 7% 60K 784K 784K
zlib 7% 60K 784K 784K
If I create a database with SQLite it is compressed:
$ cat test.sql
create table test ( id integer primary key asc autoincrement, timestamp
text default (datetime()), data text);
$ sqlite3 foo.db < test.sql
$ sudo compsize foo.db
Type Perc Disk Usage Uncompressed Referenced
TOTAL 33% 4.0K 12K 12K
zlib 33% 4.0K 12K 12K
I ran isql-fb in strace to see if there is something special in the
open(2) flags;
$ strace -o trace isql-fb
Use CONNECT or CREATE DATABASE to specify a database
SQL> create database 'new.fdb';
SQL> ^D$
$ grep new.fdb trace
readlink("/mnt/test/db/new.fdb", 0x7ffd9cf70810, 4096) = -1 ENOENT (No
such file or directory)
stat("/mnt/test/db/new.fdb", 0x7ffd9cf705b0) = -1 ENOENT (No such file
or directory)
stat("/mnt/test/db/new.fdb", 0x7ffd9cf71480) = -1 ENOENT (No such file
or directory)
stat("/mnt/test/db/new.fdb", 0x7ffd9cf70e40) = -1 ENOENT (No such file
or directory)
openat(AT_FDCWD, "/mnt/test/db/new.fdb", O_RDWR) = -1 ENOENT (No such
file or directory)
openat(AT_FDCWD, "/mnt/test/db/new.fdb", O_RDONLY) = -1 ENOENT (No such
file or directory)
readlink("/mnt/test/db/new.fdb", 0x7ffd9cf70800, 4096) = -1 ENOENT (No
such file or directory)
stat("/mnt/test/db/new.fdb", 0x7ffd9cf705a0) = -1 ENOENT (No such file
or directory)
stat("/mnt/test/db/new.fdb", 0x7ffd9cf71470) = -1 ENOENT (No such file
or directory)
stat("/mnt/test/db/new.fdb", 0x7ffd9cf70dc0) = -1 ENOENT (No such file
or directory)
stat("/mnt/test/db/new.fdb", 0x7ffd9cf70eb0) = -1 ENOENT (No such file
or directory)
openat(AT_FDCWD, "/mnt/test/db/new.fdb", O_RDWR|O_CREAT|O_EXCL, 0666) = 6
readlink("/mnt/test/db/new.fdb", 0x7ffd9cf6ff00, 4096) = -1 EINVAL
(Invalid argument)
thanks
Hamish
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: new database files not compressed 2020-08-30 9:35 new database files not compressed Hamish Moffatt @ 2020-08-31 2:20 ` Eric Wong 2020-08-31 2:44 ` Hamish Moffatt 2020-08-31 3:15 ` A L 2020-08-31 3:47 ` Zygo Blaxell 2020-09-01 1:43 ` Chris Murphy 2 siblings, 2 replies; 31+ messages in thread From: Eric Wong @ 2020-08-31 2:20 UTC (permalink / raw) To: Hamish Moffatt; +Cc: linux-btrfs Hamish Moffatt <hamish-btrfs@moffatt.email> wrote: > I am trying to store Firebird database files compressed on btrfs. Although I > have mounted the file system with -o compress-force, new files created by > Firebird are not being compressed according to compsize. If I copy them, or > use btrfs filesystem defrag, they compress well. > > Other files seem to be compressed automatically OK. Why are the Firebird > files different? Maybe Firebird creates DB with the No_COW attribute? "lsattr -l /path/to/file" to check. I don't know much about Firebird; but No_COW is pretty much required for big database, VM images, etc which are subject to random writes. Unfortunately, neither compression nor checksumming are available with No_COW set. Big SQLite and Xapian DBs gave me trouble even on an SSD before I recreated them with No_COW. Small DBs can probably get away with autodefrag. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 2:20 ` Eric Wong @ 2020-08-31 2:44 ` Hamish Moffatt 2020-08-31 3:15 ` A L 1 sibling, 0 replies; 31+ messages in thread From: Hamish Moffatt @ 2020-08-31 2:44 UTC (permalink / raw) To: Eric Wong, Hamish Moffatt; +Cc: linux-btrfs On 31/8/20 12:20 pm, Eric Wong wrote: > Hamish Moffatt <hamish-btrfs@moffatt.email> wrote: >> I am trying to store Firebird database files compressed on btrfs. Although I >> have mounted the file system with -o compress-force, new files created by >> Firebird are not being compressed according to compsize. If I copy them, or >> use btrfs filesystem defrag, they compress well. >> >> Other files seem to be compressed automatically OK. Why are the Firebird >> files different? > Maybe Firebird creates DB with the No_COW attribute? > "lsattr -l /path/to/file" to check. > > I don't know much about Firebird; but No_COW is pretty much > required for big database, VM images, etc which are subject to > random writes. Unfortunately, neither compression nor > checksumming are available with No_COW set. > > Big SQLite and Xapian DBs gave me trouble even on an SSD before > I recreated them with No_COW. Small DBs can probably get away > with autodefrag. I don't see anything in the lsattr output; $ isql-fb Use CONNECT or CREATE DATABASE to specify a database SQL> create database 'foo.fdb'; SQL> ^D $ lsattr foo.fdb ------------------- foo.fdb $ lsattr -l foo.fdb foo.fdb --- $ sudo compsize foo.fdb Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 768K 768K 768K none 100% 768K 768K 768K Still, if you're telling me that trying to use compression on database files is a bad idea, thanks for the warning and I'll let it be. Hamish ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 2:20 ` Eric Wong 2020-08-31 2:44 ` Hamish Moffatt @ 2020-08-31 3:15 ` A L 1 sibling, 0 replies; 31+ messages in thread From: A L @ 2020-08-31 3:15 UTC (permalink / raw) To: Eric Wong, Hamish Moffatt; +Cc: linux-btrfs ---- From: Eric Wong <e@80x24.org> -- Sent: 2020-08-31 - 04:20 ---- > Hamish Moffatt <hamish-btrfs@moffatt.email> wrote: >> I am trying to store Firebird database files compressed on btrfs. Although I >> have mounted the file system with -o compress-force, new files created by >> Firebird are not being compressed according to compsize. If I copy them, or >> use btrfs filesystem defrag, they compress well. >> >> Other files seem to be compressed automatically OK. Why are the Firebird >> files different? > > Maybe Firebird creates DB with the No_COW attribute? > "lsattr -l /path/to/file" to check. Could also be that it is using Direct I/O. DirectIO prevents csum and compression too. > > I don't know much about Firebird; but No_COW is pretty much > required for big database, VM images, etc which are subject to > random writes. Unfortunately, neither compression nor > checksumming are available with No_COW set. I'd not agree with this in general. Nodatacow can help in the case you are really bottle necked by disk I/O. I think the general recommendation to use nocow is dangerous as it reduces the integrity of the filesystem for those files. > > Big SQLite and Xapian DBs gave me trouble even on an SSD before > I recreated them with No_COW. Small DBs can probably get away > with autodefrag. This mostly depends on your application workload and not size of the files. I found that with MariaDB/Innodb it is possible to adjust its settings to achieve good performance on Btrfs. I use both VM images and SQL databases on Btrfs with full cow without issues. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-30 9:35 new database files not compressed Hamish Moffatt 2020-08-31 2:20 ` Eric Wong @ 2020-08-31 3:47 ` Zygo Blaxell 2020-08-31 8:53 ` Hamish Moffatt 2020-09-01 1:43 ` Chris Murphy 2 siblings, 1 reply; 31+ messages in thread From: Zygo Blaxell @ 2020-08-31 3:47 UTC (permalink / raw) To: Hamish Moffatt; +Cc: linux-btrfs On Sun, Aug 30, 2020 at 07:35:59PM +1000, Hamish Moffatt wrote: > I am trying to store Firebird database files compressed on btrfs. Although I > have mounted the file system with -o compress-force, new files created by > Firebird are not being compressed according to compsize. If I copy them, or > use btrfs filesystem defrag, they compress well. > > Other files seem to be compressed automatically OK. Why are the Firebird > files different? If it is writing single 4K blocks with fsync() between writes, or writing 4K blocks to discontiguous file offsets, then the extents will be 4K and there can be no compression. Allocation is in 4K blocks (with default mkfs options on popular CPUs). To save any space, compression must reduce the size of an extent by at least 4K. A 4K extent can't be compressed because even a single bit of compressed output would round the extent size back up to 4K, resulting in no size reduction on disk. 8K extents can be compressed if the compression ratio is 50% or higher, 12K extents can be compressed if the ratio is at least 33%, 16K extents can be compressed if the ratio is at least 25%, and so on. Larger writes are better for compression. Defrag and copies are able to compress because they write contiguously up to the maximum compressed extent size of 128K; however, after defrag, small random writes will not release the large contiguous extents and total space usage reported by compsize can reach over 100% of the original uncompressed file size. With nodatacow (and no compression) the disk usage of the database remains stable at 100% of the file size. With defrag and compression the disk usage varies from the best compressed size to (size_of_compressed_database + uncompressed_file_size) over time. e.g. if you have a 50% compression ratio on a 1MB database then the disk usage varies from 512K immediately after defrag to a maximum of 1502K in the worst case (out of every 32 blocks, 31 are written in separate transactions, which leaves references in the file to all of the compressed extents, and adds 31 uncompressed 4K extents for each compressed extent). This means that if you want to keep a database compressed with a 4K database page size, you have to run defrag frequently. Another way to get compression is to increase the database page size. Sizes up to 128K are useful--128K is the maximum btrfs compressed extent size, and increasing the database page size higher will have no further compression benefit. Most databases I've encountered max out at 64K pages, but even 64K gives some compression. > $ uname -a > Linux packer-debian-10-amd64 5.7.0-3-amd64 #1 SMP Debian 5.7.17-1 > (2020-08-23) x86_64 GNU/Linux > > $ sudo mkfs.btrfs -m single -f /dev/sdb > btrfs-progs v4.20.1 > See http://btrfs.wiki.kernel.org for more information. > > Label: (null) > UUID: 949f7f3f-681b-40fb-97b9-522756d5d619 > Node size: 16384 > Sector size: 4096 > Filesystem size: 29.98GiB > Block group profiles: > Data: single 8.00MiB > Metadata: single 8.00MiB > System: single 4.00MiB > SSD detected: no > Incompat features: extref, skinny-metadata > Number of devices: 1 > Devices: > ID SIZE PATH > 1 29.98GiB /dev/sdb > > $ sudo mount -o compress-force /dev/sdb /mnt/test > $ sudo mkdir /mnt/test/db > $ cd /mnt/test/db > > Now I restore a backup to create a database: > > $ zcat ~/*.zip | gbak -REP stdin test.fdb > $ sudo compsize test.fdb > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 182M 182M 175M > none 100% 182M 182M 175M > $ cat test.fdb > test2.fdb > $ sudo compsize test2.fdb > Type Perc Disk Usage Uncompressed Referenced > TOTAL 10% 18M 175M 175M > zlib 10% 18M 175M 175M > > > The same thing occurs if I create a brand new database: > > $ isql-fb > Use CONNECT or CREATE DATABASE to specify a database > SQL> create database 'test3.fdb'; > SQL> ^D$ > $ sync > $ sudo compsize test3.fdb > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 784K 784K 784K > none 100% 784K 784K 784K > $ cp test3.fdb test4.fdb > $ sync > $ sudo compsize test4.fdb > Type Perc Disk Usage Uncompressed Referenced > TOTAL 7% 60K 784K 784K > zlib 7% 60K 784K 784K > > > If I create a database with SQLite it is compressed: > > > $ cat test.sql > create table test ( id integer primary key asc autoincrement, timestamp text > default (datetime()), data text); > $ sqlite3 foo.db < test.sql > $ sudo compsize foo.db > Type Perc Disk Usage Uncompressed Referenced > TOTAL 33% 4.0K 12K 12K > zlib 33% 4.0K 12K 12K > > > I ran isql-fb in strace to see if there is something special in the open(2) > flags; > > > $ strace -o trace isql-fb > Use CONNECT or CREATE DATABASE to specify a database > SQL> create database 'new.fdb'; > SQL> ^D$ > $ grep new.fdb trace > readlink("/mnt/test/db/new.fdb", 0x7ffd9cf70810, 4096) = -1 ENOENT (No such > file or directory) > stat("/mnt/test/db/new.fdb", 0x7ffd9cf705b0) = -1 ENOENT (No such file or > directory) > stat("/mnt/test/db/new.fdb", 0x7ffd9cf71480) = -1 ENOENT (No such file or > directory) > stat("/mnt/test/db/new.fdb", 0x7ffd9cf70e40) = -1 ENOENT (No such file or > directory) > openat(AT_FDCWD, "/mnt/test/db/new.fdb", O_RDWR) = -1 ENOENT (No such file > or directory) > openat(AT_FDCWD, "/mnt/test/db/new.fdb", O_RDONLY) = -1 ENOENT (No such file > or directory) > readlink("/mnt/test/db/new.fdb", 0x7ffd9cf70800, 4096) = -1 ENOENT (No such > file or directory) > stat("/mnt/test/db/new.fdb", 0x7ffd9cf705a0) = -1 ENOENT (No such file or > directory) > stat("/mnt/test/db/new.fdb", 0x7ffd9cf71470) = -1 ENOENT (No such file or > directory) > stat("/mnt/test/db/new.fdb", 0x7ffd9cf70dc0) = -1 ENOENT (No such file or > directory) > stat("/mnt/test/db/new.fdb", 0x7ffd9cf70eb0) = -1 ENOENT (No such file or > directory) > openat(AT_FDCWD, "/mnt/test/db/new.fdb", O_RDWR|O_CREAT|O_EXCL, 0666) = 6 > readlink("/mnt/test/db/new.fdb", 0x7ffd9cf6ff00, 4096) = -1 EINVAL (Invalid > argument) > > > > > thanks > > Hamish > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 3:47 ` Zygo Blaxell @ 2020-08-31 8:53 ` Hamish Moffatt 2020-08-31 9:25 ` Nikolay Borisov 2020-08-31 11:15 ` Roman Mamedov 0 siblings, 2 replies; 31+ messages in thread From: Hamish Moffatt @ 2020-08-31 8:53 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 31/8/20 1:47 pm, Zygo Blaxell wrote: > On Sun, Aug 30, 2020 at 07:35:59PM +1000, Hamish Moffatt wrote: >> I am trying to store Firebird database files compressed on btrfs. Although I >> have mounted the file system with -o compress-force, new files created by >> Firebird are not being compressed according to compsize. If I copy them, or >> use btrfs filesystem defrag, they compress well. >> >> Other files seem to be compressed automatically OK. Why are the Firebird >> files different? > If it is writing single 4K blocks with fsync() between writes, or writing > 4K blocks to discontiguous file offsets, then the extents will be 4K > and there can be no compression. > > Allocation is in 4K blocks (with default mkfs options on popular CPUs). > To save any space, compression must reduce the size of an extent by at > least 4K. A 4K extent can't be compressed because even a single bit of > compressed output would round the extent size back up to 4K, resulting > in no size reduction on disk. > > 8K extents can be compressed if the compression ratio is 50% or higher, > 12K extents can be compressed if the ratio is at least 33%, 16K extents > can be compressed if the ratio is at least 25%, and so on. Larger writes > are better for compression. > > Defrag and copies are able to compress because they write contiguously up > to the maximum compressed extent size of 128K; however, after defrag, > small random writes will not release the large contiguous extents > and total space usage reported by compsize can reach over 100% of the > original uncompressed file size. With nodatacow (and no compression) > the disk usage of the database remains stable at 100% of the file size. > > With defrag and compression the disk usage varies from the best compressed > size to (size_of_compressed_database + uncompressed_file_size) over time. > e.g. if you have a 50% compression ratio on a 1MB database then the disk > usage varies from 512K immediately after defrag to a maximum of 1502K > in the worst case (out of every 32 blocks, 31 are written in separate > transactions, which leaves references in the file to all of the compressed > extents, and adds 31 uncompressed 4K extents for each compressed extent). > This means that if you want to keep a database compressed with a 4K > database page size, you have to run defrag frequently. > > Another way to get compression is to increase the database page size. > Sizes up to 128K are useful--128K is the maximum btrfs compressed extent > size, and increasing the database page size higher will have no further > compression benefit. Most databases I've encountered max out at 64K > pages, but even 64K gives some compression. Understood. Thanks for this explanation. Perhaps I'm missing something more fundamental, because I don't seem to get compression even if I create a file full of zeroes with dd; $ sudo mount -O compress-force=zstd /dev/sdb /mnt/test $ cd /mnt/test/db $ dd if=/dev/zero of=zero bs=16k count=1024 1024+0 records in 1024+0 records out 16777216 bytes (17 MB, 16 MiB) copied, 0.0154404 s, 1.1 GB/s $ sudo compsize zero Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 16M 16M 16M none 100% 16M 16M 16M $ sudo btrfs fi defrag -czstd zero $ sudo compsize zero Type Perc Disk Usage Uncompressed Referenced TOTAL 3% 512K 16M 16M zstd 3% 512K 16M 16M I did trying my Firebird tests with a 16k database page size and didn't see any compression there either. Hamish ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 8:53 ` Hamish Moffatt @ 2020-08-31 9:25 ` Nikolay Borisov 2020-08-31 10:40 ` Hamish Moffatt 2020-08-31 11:15 ` Roman Mamedov 1 sibling, 1 reply; 31+ messages in thread From: Nikolay Borisov @ 2020-08-31 9:25 UTC (permalink / raw) To: Hamish Moffatt, Zygo Blaxell; +Cc: linux-btrfs On 31.08.20 г. 11:53 ч., Hamish Moffatt wrote: > On 31/8/20 1:47 pm, Zygo Blaxell wrote: >> On Sun, Aug 30, 2020 at 07:35:59PM +1000, Hamish Moffatt wrote: >>> I am trying to store Firebird database files compressed on btrfs. >>> Although I >>> have mounted the file system with -o compress-force, new files >>> created by >>> Firebird are not being compressed according to compsize. If I copy >>> them, or >>> use btrfs filesystem defrag, they compress well. >>> >>> Other files seem to be compressed automatically OK. Why are the Firebird >>> files different? >> If it is writing single 4K blocks with fsync() between writes, or writing >> 4K blocks to discontiguous file offsets, then the extents will be 4K >> and there can be no compression. >> >> Allocation is in 4K blocks (with default mkfs options on popular CPUs). >> To save any space, compression must reduce the size of an extent by at >> least 4K. A 4K extent can't be compressed because even a single bit of >> compressed output would round the extent size back up to 4K, resulting >> in no size reduction on disk. >> >> 8K extents can be compressed if the compression ratio is 50% or higher, >> 12K extents can be compressed if the ratio is at least 33%, 16K extents >> can be compressed if the ratio is at least 25%, and so on. Larger writes >> are better for compression. >> >> Defrag and copies are able to compress because they write contiguously up >> to the maximum compressed extent size of 128K; however, after defrag, >> small random writes will not release the large contiguous extents >> and total space usage reported by compsize can reach over 100% of the >> original uncompressed file size. With nodatacow (and no compression) >> the disk usage of the database remains stable at 100% of the file size. >> >> With defrag and compression the disk usage varies from the best >> compressed >> size to (size_of_compressed_database + uncompressed_file_size) over time. >> e.g. if you have a 50% compression ratio on a 1MB database then the disk >> usage varies from 512K immediately after defrag to a maximum of 1502K >> in the worst case (out of every 32 blocks, 31 are written in separate >> transactions, which leaves references in the file to all of the >> compressed >> extents, and adds 31 uncompressed 4K extents for each compressed extent). >> This means that if you want to keep a database compressed with a 4K >> database page size, you have to run defrag frequently. >> >> Another way to get compression is to increase the database page size. >> Sizes up to 128K are useful--128K is the maximum btrfs compressed extent >> size, and increasing the database page size higher will have no further >> compression benefit. Most databases I've encountered max out at 64K >> pages, but even 64K gives some compression. > > Understood. Thanks for this explanation. > > Perhaps I'm missing something more fundamental, because I don't seem to > get compression even if I create a file full of zeroes with dd; > > $ sudo mount -O compress-force=zstd /dev/sdb /mnt/test > $ cd /mnt/test/db > $ dd if=/dev/zero of=zero bs=16k count=1024 > 1024+0 records in > 1024+0 records out > 16777216 bytes (17 MB, 16 MiB) copied, 0.0154404 s, 1.1 GB/s > $ sudo compsize zero > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 16M 16M 16M > none 100% 16M 16M 16M > $ sudo btrfs fi defrag -czstd zero > $ sudo compsize zero > Type Perc Disk Usage Uncompressed Referenced > TOTAL 3% 512K 16M 16M > zstd 3% 512K 16M 16M > > I did trying my Firebird tests with a 16k database page size and didn't > see any compression there either. Doing the following test : root@ubuntu18:~# mount -O compress-force=zstd /dev/vdc /media/scratch/ root@ubuntu18:~# rm -rf /media/scratch/zero root@ubuntu18:~# dd if=/dev/zero of=/media/scratch/zero bs=16k count=1024 sync btrfs inspect-internal dump-tree -t5 /dev/vdc results in: item 6 key (259 EXTENT_DATA 0) itemoff 15816 itemsize 53 generation 12 type 1 (regular) extent data disk byte 315621376 nr 4096 extent data offset 0 nr 131072 ram 131072 extent compression 3 (zstd) item 7 key (259 EXTENT_DATA 131072) itemoff 15763 itemsize 53 generation 12 type 1 (regular) extent data disk byte 315625472 nr 4096 extent data offset 0 nr 131072 ram 131072 extent compression 3 (zstd) item 8 key (259 EXTENT_DATA 262144) itemoff 15710 itemsize 53 generation 12 type 1 (regular) extent data disk byte 315629568 nr 4096 extent data offset 0 nr 131072 ram 131072 extent compression 3 (zstd) I.e a bunch of 128k extents, which in fact take only 4k on disk each. Whereas if I write the same file but without the compress-force mount option I get: item 138 key (260 EXTENT_DATA 0) itemoff 8787 itemsize 53 generation 14 type 1 (regular) extent data disk byte 298844160 nr 16777216 extent data offset 0 nr 16777216 ram 16777216 extent compression 0 (none) I.e a single extent, 16m in size. So instead of using this compsize utility or whatever it is can you dump the state of the filesystem as per the btrfs inspect-internal command shown above? > > > Hamish > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 9:25 ` Nikolay Borisov @ 2020-08-31 10:40 ` Hamish Moffatt 2020-08-31 10:47 ` Nikolay Borisov 0 siblings, 1 reply; 31+ messages in thread From: Hamish Moffatt @ 2020-08-31 10:40 UTC (permalink / raw) To: Nikolay Borisov, Zygo Blaxell; +Cc: linux-btrfs On 31/8/20 7:25 pm, Nikolay Borisov wrote: > > Doing the following test : > > root@ubuntu18:~# mount -O compress-force=zstd /dev/vdc /media/scratch/ > root@ubuntu18:~# rm -rf /media/scratch/zero > root@ubuntu18:~# dd if=/dev/zero of=/media/scratch/zero bs=16k count=1024 > sync > btrfs inspect-internal dump-tree -t5 /dev/vdc > > results in: > > > item 6 key (259 EXTENT_DATA 0) itemoff 15816 itemsize 53 > generation 12 type 1 (regular) > extent data disk byte 315621376 nr 4096 > extent data offset 0 nr 131072 ram 131072 > extent compression 3 (zstd) > item 7 key (259 EXTENT_DATA 131072) itemoff 15763 itemsize 53 > generation 12 type 1 (regular) > extent data disk byte 315625472 nr 4096 > extent data offset 0 nr 131072 ram 131072 > extent compression 3 (zstd) > item 8 key (259 EXTENT_DATA 262144) itemoff 15710 itemsize 53 > generation 12 type 1 (regular) > extent data disk byte 315629568 nr 4096 > extent data offset 0 nr 131072 ram 131072 > extent compression 3 (zstd) > > > > I.e a bunch of 128k extents, which in fact take only 4k on disk each. > > Whereas if I write the same file but without the compress-force mount > option I get: > > item 138 key (260 EXTENT_DATA 0) itemoff 8787 itemsize 53 > generation 14 type 1 (regular) > extent data disk byte 298844160 nr 16777216 > extent data offset 0 nr 16777216 ram 16777216 > extent compression 0 (none) > > > I.e a single extent, 16m in size. So instead of using this compsize > utility or whatever it is can you dump the state of the filesystem as > per the btrfs inspect-internal command shown above? I used compsize as recommended by the wiki page: https://btrfs.wiki.kernel.org/index.php/Compression#How_can_I_determine_compressed_size_of_a_file.3F I created a new file system, ran dd bs=32k count=1024, then the dump shows me items all the way up to item 161. https://pastebin.com/0MPPdqqM item 160 key (258 EXTENT_DATA 33292288) itemoff 7750 itemsize 53 generation 7 type 1 (regular) extent data disk byte 14671872 nr 4096 extent data offset 0 nr 131072 ram 131072 extent compression 3 (zstd) item 161 key (258 EXTENT_DATA 33423360) itemoff 7697 itemsize 53 generation 7 type 1 (regular) extent data disk byte 14675968 nr 4096 extent data offset 0 nr 131072 ram 131072 extent compression 3 (zstd) total bytes 32196018176 bytes used 1245184 uuid 53eaed91-1060-445b-bf7c-5efe9917adbc From mount: /dev/sdb on /mnt/test type btrfs (rw,relatime,compress-force=zstd:3,space_cache,subvolid=5,subvol=/) Hamish ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 10:40 ` Hamish Moffatt @ 2020-08-31 10:47 ` Nikolay Borisov 2020-08-31 12:56 ` Hamish Moffatt 0 siblings, 1 reply; 31+ messages in thread From: Nikolay Borisov @ 2020-08-31 10:47 UTC (permalink / raw) To: Hamish Moffatt, Zygo Blaxell; +Cc: linux-btrfs <SNIP> > I created a new file system, ran dd bs=32k count=1024, then the dump > shows me items all the way up to item 161. https://pastebin.com/0MPPdqqM > > item 160 key (258 EXTENT_DATA 33292288) itemoff 7750 itemsize 53 > generation 7 type 1 (regular) > extent data disk byte 14671872 nr 4096 > extent data offset 0 nr 131072 ram 131072 > extent compression 3 (zstd) > item 161 key (258 EXTENT_DATA 33423360) itemoff 7697 itemsize 53 > generation 7 type 1 (regular) > extent data disk byte 14675968 nr 4096 > extent data offset 0 nr 131072 ram 131072 > extent compression 3 (zstd) > total bytes 32196018176 > bytes used 1245184 > uuid 53eaed91-1060-445b-bf7c-5efe9917adbc This output indicates that extents are compressed, what does compsize report on this file? > > > From mount: /dev/sdb on /mnt/test type btrfs > (rw,relatime,compress-force=zstd:3,space_cache,subvolid=5,subvol=/) > > Hamish > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 10:47 ` Nikolay Borisov @ 2020-08-31 12:56 ` Hamish Moffatt 0 siblings, 0 replies; 31+ messages in thread From: Hamish Moffatt @ 2020-08-31 12:56 UTC (permalink / raw) To: Nikolay Borisov, Zygo Blaxell On 31/8/20 8:47 pm, Nikolay Borisov wrote: > <SNIP> > >> I created a new file system, ran dd bs=32k count=1024, then the dump >> shows me items all the way up to item 161. https://pastebin.com/0MPPdqqM >> >> item 160 key (258 EXTENT_DATA 33292288) itemoff 7750 itemsize 53 >> generation 7 type 1 (regular) >> extent data disk byte 14671872 nr 4096 >> extent data offset 0 nr 131072 ram 131072 >> extent compression 3 (zstd) >> item 161 key (258 EXTENT_DATA 33423360) itemoff 7697 itemsize 53 >> generation 7 type 1 (regular) >> extent data disk byte 14675968 nr 4096 >> extent data offset 0 nr 131072 ram 131072 >> extent compression 3 (zstd) >> total bytes 32196018176 >> bytes used 1245184 >> uuid 53eaed91-1060-445b-bf7c-5efe9917adbc > This output indicates that extents are compressed, what does compsize > report on this file? Yes it is right actually, I had wrong mount options before as Roman observed. $ sudo compsize zeroes Type Perc Disk Usage Uncompressed Referenced TOTAL 3% 1.0M 32M 32M zstd 3% 1.0M 32M 32M But it is not working for Firebird still. $ isql-fb Use CONNECT or CREATE DATABASE to specify a database SQL> create database 'test3.fdb' page_size 16384; SQL> ^D $ sudo compsize test3.fdb Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 2.2M 2.2M 2.2M none 100% 2.2M 2.2M 2.2M Hamish ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 8:53 ` Hamish Moffatt 2020-08-31 9:25 ` Nikolay Borisov @ 2020-08-31 11:15 ` Roman Mamedov 2020-08-31 12:54 ` Hamish Moffatt 1 sibling, 1 reply; 31+ messages in thread From: Roman Mamedov @ 2020-08-31 11:15 UTC (permalink / raw) To: Hamish Moffatt; +Cc: Zygo Blaxell, linux-btrfs On Mon, 31 Aug 2020 18:53:54 +1000 Hamish Moffatt <hamish-btrfs@moffatt.email> wrote: > $ sudo mount -O compress-force=zstd /dev/sdb /mnt/test Specifying the filesystem mount options is done with -o, not -O. See "man mount". -- With respect, Roman ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 11:15 ` Roman Mamedov @ 2020-08-31 12:54 ` Hamish Moffatt 2020-08-31 12:57 ` Nikolay Borisov 0 siblings, 1 reply; 31+ messages in thread From: Hamish Moffatt @ 2020-08-31 12:54 UTC (permalink / raw) To: Roman Mamedov; +Cc: Zygo Blaxell, linux-btrfs On 31/8/20 9:15 pm, Roman Mamedov wrote: > On Mon, 31 Aug 2020 18:53:54 +1000 > Hamish Moffatt <hamish-btrfs@moffatt.email> wrote: > >> $ sudo mount -O compress-force=zstd /dev/sdb /mnt/test > Specifying the filesystem mount options is done with -o, not -O. > See "man mount". > Argh, I messed up the dd test. It does work for dd from /dev/zeroes when mounted properly. It still doesn't compress my Firebird files though. $ mount | grep btrfs /dev/sdb on /mnt/test type btrfs (rw,relatime,compress-force=zstd:3,space_cache,subvolid=5,subvol=/) $ dd if=/dev/zero of=zeroes bs=32k count=1024 1024+0 records in 1024+0 records out 33554432 bytes (34 MB, 32 MiB) copied, 0.0244041 s, 1.4 GB/s $ sudo compsize zeroes Type Perc Disk Usage Uncompressed Referenced TOTAL 3% 1.0M 32M 32M zstd 3% 1.0M 32M 32M $ zcat ~/*.zip | gbak -REP -page 16384 stdin test2.fdb $ sudo compsize test2.fdb Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 194M 194M 191M none 100% 194M 194M 191M $ isql-fb Use CONNECT or CREATE DATABASE to specify a database SQL> create database 'test3.fdb' page_size 16384; SQL> ^D $ sudo compsize test3.fdb Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 2.2M 2.2M 2.2M none 100% 2.2M 2.2M 2.2M Hamish ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 12:54 ` Hamish Moffatt @ 2020-08-31 12:57 ` Nikolay Borisov 2020-08-31 23:50 ` Hamish Moffatt 0 siblings, 1 reply; 31+ messages in thread From: Nikolay Borisov @ 2020-08-31 12:57 UTC (permalink / raw) To: Hamish Moffatt, Roman Mamedov; +Cc: Zygo Blaxell, linux-btrfs On 31.08.20 г. 15:54 ч., Hamish Moffatt wrote: > On 31/8/20 9:15 pm, Roman Mamedov wrote: >> On Mon, 31 Aug 2020 18:53:54 +1000 >> Hamish Moffatt <hamish-btrfs@moffatt.email> wrote: >> >>> $ sudo mount -O compress-force=zstd /dev/sdb /mnt/test >> Specifying the filesystem mount options is done with -o, not -O. >> See "man mount". >> > Argh, I messed up the dd test. It does work for dd from /dev/zeroes when > mounted properly. It still doesn't compress my Firebird files though. > > > $ mount | grep btrfs > /dev/sdb on /mnt/test type btrfs > (rw,relatime,compress-force=zstd:3,space_cache,subvolid=5,subvol=/) > $ dd if=/dev/zero of=zeroes bs=32k count=1024 > 1024+0 records in > 1024+0 records out > 33554432 bytes (34 MB, 32 MiB) copied, 0.0244041 s, 1.4 GB/s > $ sudo compsize zeroes > Type Perc Disk Usage Uncompressed Referenced > TOTAL 3% 1.0M 32M 32M > zstd 3% 1.0M 32M 32M > $ zcat ~/*.zip | gbak -REP -page 16384 stdin test2.fdb > $ sudo compsize test2.fdb > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 194M 194M 191M > none 100% 194M 194M 191M > $ isql-fb > Use CONNECT or CREATE DATABASE to specify a database > SQL> create database 'test3.fdb' page_size 16384; > SQL> ^D > $ sudo compsize test3.fdb > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 2.2M 2.2M 2.2M > none 100% 2.2M 2.2M 2.2M > This means the data being passed to btrfs is not compressible. I.e after coompression the data is not smaller than the original, input data. > > Hamish > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 12:57 ` Nikolay Borisov @ 2020-08-31 23:50 ` Hamish Moffatt 2020-09-01 5:15 ` Nikolay Borisov 0 siblings, 1 reply; 31+ messages in thread From: Hamish Moffatt @ 2020-08-31 23:50 UTC (permalink / raw) To: Nikolay Borisov, Roman Mamedov; +Cc: Zygo Blaxell, linux-btrfs On 31/8/20 10:57 pm, Nikolay Borisov wrote: > > This means the data being passed to btrfs is not compressible. I.e after > coompression the data is not smaller than the original, input data. It is though - if I copy it, or run defrag, it compresses very well: $ mount | grep btrfs /dev/sdb on /mnt/test type btrfs (rw,relatime,compress-force=zstd:3,space_cache,subvolid=5,subvol=/) $ zcat ~/*.zip | gbak -REP -page 16384 stdin test2.fdb $ sudo compsize test2.fdb Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 194M 194M 191M none 100% 194M 194M 191M $ dd if=test2.fdb of=test2.fdb2 bs=16k 12250+0 records in 12250+0 records out 200704000 bytes (201 MB, 191 MiB) copied, 0.151375 s, 1.3 GB/s $ sync $ sudo compsize test2.fdb2 Type Perc Disk Usage Uncompressed Referenced TOTAL 8% 17M 191M 191M zstd 8% 17M 191M 191M $ sudo btrfs fi defrag -czstd test2.fdb $ sudo compsize test2.fdb Type Perc Disk Usage Uncompressed Referenced TOTAL 8% 17M 191M 191M zstd 8% 17M 191M 191M So it must be something about how Firebird is creating or writing the file, as Zygo wrote. I set the page size to 16k (default 4k) and although I see Firebird making 16k writes, it is not affecting the result. Hamish ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-31 23:50 ` Hamish Moffatt @ 2020-09-01 5:15 ` Nikolay Borisov 2020-09-01 8:55 ` Hamish Moffatt 0 siblings, 1 reply; 31+ messages in thread From: Nikolay Borisov @ 2020-09-01 5:15 UTC (permalink / raw) To: Hamish Moffatt, Roman Mamedov; +Cc: Zygo Blaxell, linux-btrfs On 1.09.20 г. 2:50 ч., Hamish Moffatt wrote: > On 31/8/20 10:57 pm, Nikolay Borisov wrote: >> >> This means the data being passed to btrfs is not compressible. I.e after >> coompression the data is not smaller than the original, input data. > > It is though - if I copy it, or run defrag, it compresses very well: > > As Zygo explained - with 16k writes you'd need at least 25% compression in order for btrfs to deem it useful. If firebird's 16k writes are not 25% compressible then it won't compress. It also depends on whether it issues fsync after every write to ensure consistency meaning it won't allow more data to accumulate. > $ mount | grep btrfs > /dev/sdb on /mnt/test type btrfs > (rw,relatime,compress-force=zstd:3,space_cache,subvolid=5,subvol=/) > $ zcat ~/*.zip | gbak -REP -page 16384 stdin test2.fdb > $ sudo compsize test2.fdb > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 194M 194M 191M > none 100% 194M 194M 191M > > $ dd if=test2.fdb of=test2.fdb2 bs=16k > 12250+0 records in > 12250+0 records out > 200704000 bytes (201 MB, 191 MiB) copied, 0.151375 s, 1.3 GB/s > $ sync > $ sudo compsize test2.fdb2 > Type Perc Disk Usage Uncompressed Referenced > TOTAL 8% 17M 191M 191M > zstd 8% 17M 191M 191M > > $ sudo btrfs fi defrag -czstd test2.fdb > $ sudo compsize test2.fdb > Type Perc Disk Usage Uncompressed Referenced > TOTAL 8% 17M 191M 191M > zstd 8% 17M 191M 191M > > > So it must be something about how Firebird is creating or writing the > file, as Zygo wrote. I set the page size to 16k (default 4k) and > although I see Firebird making 16k writes, it is not affecting the result. > > > Hamish > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-01 5:15 ` Nikolay Borisov @ 2020-09-01 8:55 ` Hamish Moffatt 2020-09-02 0:32 ` Hamish Moffatt 0 siblings, 1 reply; 31+ messages in thread From: Hamish Moffatt @ 2020-09-01 8:55 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs On 1/9/20 3:15 pm, Nikolay Borisov wrote: > > On 1.09.20 г. 2:50 ч., Hamish Moffatt wrote: >> On 31/8/20 10:57 pm, Nikolay Borisov wrote: >>> This means the data being passed to btrfs is not compressible. I.e after >>> coompression the data is not smaller than the original, input data. >> It is though - if I copy it, or run defrag, it compresses very well: >> >> > As Zygo explained - with 16k writes you'd need at least 25% compression > in order for btrfs to deem it useful. If firebird's 16k writes are not > 25% compressible then it won't compress. It also depends on whether it > issues fsync after every write to ensure consistency meaning it won't > allow more data to accumulate. I understand, but I think these conditions are being met. 1. strace shows Firebird is writing sequential 16k blocks. I don't see fsync either. Example trace: https://hastebin.com/ecayosilog.pl 2. Copying the file with 'dd if=foo.fdb of=copy bs=16k', even with 'oflag=sync' or 'flag=direct', results in a compressed copy. There are no special attributes on the file according to lsattr. What else could Firebird be doing that would cause the file not to compress? Hamish ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-01 8:55 ` Hamish Moffatt @ 2020-09-02 0:32 ` Hamish Moffatt 2020-09-02 5:57 ` Nikolay Borisov 0 siblings, 1 reply; 31+ messages in thread From: Hamish Moffatt @ 2020-09-02 0:32 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs On 1/9/20 6:55 pm, Hamish Moffatt wrote: > On 1/9/20 3:15 pm, Nikolay Borisov wrote: >> >> On 1.09.20 г. 2:50 ч., Hamish Moffatt wrote: >>> On 31/8/20 10:57 pm, Nikolay Borisov wrote: >>>> This means the data being passed to btrfs is not compressible. I.e >>>> after >>>> coompression the data is not smaller than the original, input data. >>> It is though - if I copy it, or run defrag, it compresses very well: >>> >>> >> As Zygo explained - with 16k writes you'd need at least 25% compression >> in order for btrfs to deem it useful. If firebird's 16k writes are not >> 25% compressible then it won't compress. It also depends on whether it >> issues fsync after every write to ensure consistency meaning it won't >> allow more data to accumulate. I've been able to reproduce this with a trivial test program which mimics the I/O behaviour of Firebird. It is calling fallocate() to set up a bunch of blocks and then writing them with pwrite(). It seems to be the fallocate() step which is preventing compression. Here is my trivial test program which just writes zeroes to a file. The output file does not get compressed by btrfs. #define _GNU_SOURCE #include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <string.h> #define BLOCK_SIZE 16384 int main() { unlink("fill"); int fd = open("fill", O_RDWR | O_CREAT | O_EXCL, 0666); char buf[BLOCK_SIZE]; memset(buf, 0, BLOCK_SIZE); for (int count = 0; count < 256; ++count) { if (count % 8 == 0) fallocate(fd, 0, count * BLOCK_SIZE, 8 * BLOCK_SIZE); pwrite(fd, buf, BLOCK_SIZE, count * BLOCK_SIZE); } close(fd); return 0; } Hamish ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-02 0:32 ` Hamish Moffatt @ 2020-09-02 5:57 ` Nikolay Borisov 2020-09-02 6:05 ` Hamish Moffatt 2020-09-02 9:57 ` A L 0 siblings, 2 replies; 31+ messages in thread From: Nikolay Borisov @ 2020-09-02 5:57 UTC (permalink / raw) To: Hamish Moffatt; +Cc: linux-btrfs On 2.09.20 г. 3:32 ч., Hamish Moffatt wrote: > On 1/9/20 6:55 pm, Hamish Moffatt wrote: >> On 1/9/20 3:15 pm, Nikolay Borisov wrote: >>> >>> On 1.09.20 г. 2:50 ч., Hamish Moffatt wrote: >>>> On 31/8/20 10:57 pm, Nikolay Borisov wrote: >>>>> This means the data being passed to btrfs is not compressible. I.e >>>>> after >>>>> coompression the data is not smaller than the original, input data. >>>> It is though - if I copy it, or run defrag, it compresses very well: >>>> >>>> >>> As Zygo explained - with 16k writes you'd need at least 25% compression >>> in order for btrfs to deem it useful. If firebird's 16k writes are not >>> 25% compressible then it won't compress. It also depends on whether it >>> issues fsync after every write to ensure consistency meaning it won't >>> allow more data to accumulate. > > I've been able to reproduce this with a trivial test program which > mimics the I/O behaviour of Firebird. > > It is calling fallocate() to set up a bunch of blocks and then writing > them with pwrite(). It seems to be the fallocate() step which is > preventing compression. > > Here is my trivial test program which just writes zeroes to a file. The > output file does not get compressed by btrfs. Ag yes, this makes sense, because fallocate creates PREALLOC extents which are NOCOW (since they are essentially empty so it makes no sense to CoW them) hence they go through a different path which doesn't perform compression. > > > #define _GNU_SOURCE > > #include <stdio.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <string.h> > > #define BLOCK_SIZE 16384 > > int main() > { > unlink("fill"); > int fd = open("fill", O_RDWR | O_CREAT | O_EXCL, 0666); > > char buf[BLOCK_SIZE]; > memset(buf, 0, BLOCK_SIZE); > > for (int count = 0; count < 256; ++count) { > if (count % 8 == 0) > fallocate(fd, 0, count * BLOCK_SIZE, 8 * BLOCK_SIZE); > pwrite(fd, buf, BLOCK_SIZE, count * BLOCK_SIZE); > } > > close(fd); > > return 0; > } > > > Hamish > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-02 5:57 ` Nikolay Borisov @ 2020-09-02 6:05 ` Hamish Moffatt 2020-09-02 6:10 ` Nikolay Borisov 2020-09-02 9:57 ` A L 1 sibling, 1 reply; 31+ messages in thread From: Hamish Moffatt @ 2020-09-02 6:05 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs On 2/9/20 3:57 pm, Nikolay Borisov wrote: > > On 2.09.20 г. 3:32 ч., Hamish Moffatt wrote: >> >> I've been able to reproduce this with a trivial test program which >> mimics the I/O behaviour of Firebird. >> >> It is calling fallocate() to set up a bunch of blocks and then writing >> them with pwrite(). It seems to be the fallocate() step which is >> preventing compression. >> >> Here is my trivial test program which just writes zeroes to a file. The >> output file does not get compressed by btrfs. > Ag yes, this makes sense, because fallocate creates PREALLOC extents > which are NOCOW (since they are essentially empty so it makes no sense > to CoW them) hence they go through a different path which doesn't > perform compression. >> for (int count = 0; count < 256; ++count) { >> if (count % 8 == 0) >> fallocate(fd, 0, count * BLOCK_SIZE, 8 * BLOCK_SIZE); >> pwrite(fd, buf, BLOCK_SIZE, count * BLOCK_SIZE); >> } OK, and they can't be compressed when data is written to those extents afterwards? Thanks, it's good to understand why it's not working for Firebird. Hamish ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-02 6:05 ` Hamish Moffatt @ 2020-09-02 6:10 ` Nikolay Borisov 0 siblings, 0 replies; 31+ messages in thread From: Nikolay Borisov @ 2020-09-02 6:10 UTC (permalink / raw) To: Hamish Moffatt; +Cc: linux-btrfs On 2.09.20 г. 9:05 ч., Hamish Moffatt wrote: > On 2/9/20 3:57 pm, Nikolay Borisov wrote: >> >> On 2.09.20 г. 3:32 ч., Hamish Moffatt wrote: >>> >>> I've been able to reproduce this with a trivial test program which >>> mimics the I/O behaviour of Firebird. >>> >>> It is calling fallocate() to set up a bunch of blocks and then writing >>> them with pwrite(). It seems to be the fallocate() step which is >>> preventing compression. >>> >>> Here is my trivial test program which just writes zeroes to a file. The >>> output file does not get compressed by btrfs. >> Ag yes, this makes sense, because fallocate creates PREALLOC extents >> which are NOCOW (since they are essentially empty so it makes no sense >> to CoW them) hence they go through a different path which doesn't >> perform compression. > >>> for (int count = 0; count < 256; ++count) { >>> if (count % 8 == 0) >>> fallocate(fd, 0, count * BLOCK_SIZE, 8 * BLOCK_SIZE); >>> pwrite(fd, buf, BLOCK_SIZE, count * BLOCK_SIZE); >>> } > > > OK, and they can't be compressed when data is written to those extents > afterwards? Yes, compression entails creating NEW extents. But with falloc extents are created during the fallloc operation so when you issue subsequent writes into the prealloc range you simply write the data to disk and not bother with metadata management. > > > Thanks, it's good to understand why it's not working for Firebird. > > > Hamish > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-02 5:57 ` Nikolay Borisov 2020-09-02 6:05 ` Hamish Moffatt @ 2020-09-02 9:57 ` A L 2020-09-02 10:09 ` Nikolay Borisov 2020-09-02 16:16 ` Zygo Blaxell 1 sibling, 2 replies; 31+ messages in thread From: A L @ 2020-09-02 9:57 UTC (permalink / raw) To: Nikolay Borisov, Hamish Moffatt, linux-btrfs; +Cc: linux-btrfs ---- From: Nikolay Borisov <nborisov@suse.com> -- Sent: 2020-09-02 - 07:57 ---- >> >> I've been able to reproduce this with a trivial test program which >> mimics the I/O behaviour of Firebird. >> >> It is calling fallocate() to set up a bunch of blocks and then writing >> them with pwrite(). It seems to be the fallocate() step which is >> preventing compression. >> >> Here is my trivial test program which just writes zeroes to a file. The >> output file does not get compressed by btrfs. > > Ag yes, this makes sense, because fallocate creates PREALLOC extents > which are NOCOW (since they are essentially empty so it makes no sense > to CoW them) hence they go through a different path which doesn't > perform compression. > Hi, This is interesting. I think that a lot of applications use fallocate in their normal operations. This is probably why we see weird compsize results every now and then. A file that is nocow will also not have checksums. Is this true for these fallocated files (that has data written to them) too? I would really like to see that Btrfs was corrected so that writes to an fallocated area will be compressed (if one is using compression that is). Thanks. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-02 9:57 ` A L @ 2020-09-02 10:09 ` Nikolay Borisov 2020-09-03 15:04 ` A L 2020-09-02 16:16 ` Zygo Blaxell 1 sibling, 1 reply; 31+ messages in thread From: Nikolay Borisov @ 2020-09-02 10:09 UTC (permalink / raw) To: A L, Hamish Moffatt, linux-btrfs On 2.09.20 г. 12:57 ч., A L wrote: > > > ---- From: Nikolay Borisov <nborisov@suse.com> -- Sent: 2020-09-02 - 07:57 ---- >>> >>> I've been able to reproduce this with a trivial test program which >>> mimics the I/O behaviour of Firebird. >>> >>> It is calling fallocate() to set up a bunch of blocks and then writing >>> them with pwrite(). It seems to be the fallocate() step which is >>> preventing compression. >>> >>> Here is my trivial test program which just writes zeroes to a file. The >>> output file does not get compressed by btrfs. >> >> Ag yes, this makes sense, because fallocate creates PREALLOC extents >> which are NOCOW (since they are essentially empty so it makes no sense >> to CoW them) hence they go through a different path which doesn't >> perform compression. >> > > Hi, > > This is interesting. I think that a lot of applications use fallocate in their normal operations. This is probably why we see weird compsize results every now and then. > > A file that is nocow will also not have checksums. Is this true for these fallocated files (that has data written to them) too? No, fallocated files will have checksums. It's just that compression is not integrated into it. BTRFS is an open source project so patches are welcomed. > > I would really like to see that Btrfs was corrected so that writes to an fallocated area will be compressed (if one is using compression that is). > > Thanks. > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-02 10:09 ` Nikolay Borisov @ 2020-09-03 15:04 ` A L 0 siblings, 0 replies; 31+ messages in thread From: A L @ 2020-09-03 15:04 UTC (permalink / raw) To: Nikolay Borisov, Hamish Moffatt, linux-btrfs On 2020-09-02 12:09, Nikolay Borisov wrote: > > On 2.09.20 г. 12:57 ч., A L wrote: >> >> ---- From: Nikolay Borisov <nborisov@suse.com> -- Sent: 2020-09-02 - 07:57 ---- >>>> I've been able to reproduce this with a trivial test program which >>>> mimics the I/O behaviour of Firebird. >>>> >>>> It is calling fallocate() to set up a bunch of blocks and then writing >>>> them with pwrite(). It seems to be the fallocate() step which is >>>> preventing compression. >>>> >>>> Here is my trivial test program which just writes zeroes to a file. The >>>> output file does not get compressed by btrfs. >>> Ag yes, this makes sense, because fallocate creates PREALLOC extents >>> which are NOCOW (since they are essentially empty so it makes no sense >>> to CoW them) hence they go through a different path which doesn't >>> perform compression. >>> >> Hi, >> >> This is interesting. I think that a lot of applications use fallocate in their normal operations. This is probably why we see weird compsize results every now and then. >> >> A file that is nocow will also not have checksums. Is this true for these fallocated files (that has data written to them) too? > No, fallocated files will have checksums. It's just that compression is > not integrated into it. BTRFS is an open source project so patches are > welcomed. > Is it a very big task to add the compression code path here, since we already do checksums? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-02 9:57 ` A L 2020-09-02 10:09 ` Nikolay Borisov @ 2020-09-02 16:16 ` Zygo Blaxell 2020-09-03 12:53 ` Hamish Moffatt 2020-09-03 15:03 ` A L 1 sibling, 2 replies; 31+ messages in thread From: Zygo Blaxell @ 2020-09-02 16:16 UTC (permalink / raw) To: A L; +Cc: Nikolay Borisov, Hamish Moffatt, linux-btrfs On Wed, Sep 02, 2020 at 11:57:41AM +0200, A L wrote: > > > ---- From: Nikolay Borisov <nborisov@suse.com> -- Sent: 2020-09-02 - 07:57 ---- > >> > >> I've been able to reproduce this with a trivial test program which > >> mimics the I/O behaviour of Firebird. > >> > >> It is calling fallocate() to set up a bunch of blocks and then writing > >> them with pwrite(). It seems to be the fallocate() step which is > >> preventing compression. > >> > >> Here is my trivial test program which just writes zeroes to a file. The > >> output file does not get compressed by btrfs. > > > > Ag yes, this makes sense, because fallocate creates PREALLOC extents > > which are NOCOW (since they are essentially empty so it makes no sense > > to CoW them) hence they go through a different path which doesn't > > perform compression. > > > > Hi, > > This is interesting. I think that a lot of applications use fallocate > in their normal operations. This is probably why we see weird compsize > results every now and then. fallocate doesn't make a lot of sense on btrfs, except in the special case of nodatacow files without snapshots. fallocate breaks compression, and snapshots/reflinks break fallocate. bees deallocates preallocated extents on sight (dedupes them with holes). I didn't bother to implement an option not to, and so far nobody has asked for one. Before bees (and even before btrfs), I had LD_PRELOAD hacks to replace fallocate() with { return 0; }. On certain other filesystems fallocate is spectacularly slow, and one application-that-shall-not-be-named in particular liked to reserve a lot more space than it ever ultimately used. > A file that is nocow will also not have checksums. Is this true for > these fallocated files (that has data written to them) too? > > I would really like to see that Btrfs was corrected so that writes > to an fallocated area will be compressed (if one is using compression > that is). This is difficult to do with the semantics of fallocate, which dictate that a write to a file within the preallocated region shall not return ENOSPC. A fallocated extent is overwritten in-place on the first write, but later writes will be normal copy-on-write. To be clear, that means they will occupy other locations on the disk, and may encounter ENOSPC during write. fallocate is really broken on btrfs datacow files. The requirements of each are mutually exclusive unless you allow solutions that can reserve indefinite amounts of auxiliary space for journalling data blocks. It's better not to use fallocate at all. If a fallocated extent is shared (cloned or snapshotted) then it becomes temporarily copy-on-write until only one reference remains, then reverts back to in-place writing--even on nodatacow files. If you're using btrfs send for backups, fallocate doesn't work properly at best, and wastes space and time at worst. The unsharing _should_ also mean that the second write to a preallocated extent should be compressed, but I tried this on kernel 5.7.15 and it turns out they're not... :-O Here's a 1g fallocated file: # fallocate -l 1g test # sync test # filefrag -v test Filesystem type is: 9123683e File size of test is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 65535: 21281040720..21281106255: 65536: unwritten 1: 65536.. 98303: 20476350122..20476382889: 32768: 21281106256: unwritten 2: 98304.. 131071: 20479845152..20479877919: 32768: 20476382890: unwritten 3: 131072.. 163839: 20483351132..20483383899: 32768: 20479877920: unwritten 4: 163840.. 196607: 20485055258..20485088025: 32768: 20483383900: unwritten 5: 196608.. 229375: 20485546782..20485579549: 32768: 20485088026: unwritten 6: 229376.. 262143: 20675234358..20675267125: 32768: 20485579550: last,unwritten,eof test: 7 extents found Let's turn on compression and dump some compressed data on it: # chattr +c test # dd if=/boot/System.map-5.7.15 bs=512K seek=1 of=test conv=notrunc 12+1 records in 12+1 records out 6607498 bytes (6.6 MB, 6.3 MiB) copied, 0.070041 s, 94.3 MB/s # sync test # filefrag -v test Filesystem type is: 9123683e File size of test is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 21281040720..21281040847: 128: unwritten 1: 128.. 1741: 21281040848..21281042461: 1614: 720 + (512 / 4 = 128) = 848, this is written in-place on the extent. 2: 1742.. 65535: 21281042462..21281106255: 63794: unwritten 3: 65536.. 98303: 20476350122..20476382889: 32768: 21281106256: unwritten 4: 98304.. 131071: 20479845152..20479877919: 32768: 20476382890: unwritten 5: 131072.. 163839: 20483351132..20483383899: 32768: 20479877920: unwritten 6: 163840.. 196607: 20485055258..20485088025: 32768: 20483383900: unwritten 7: 196608.. 229375: 20485546782..20485579549: 32768: 20485088026: unwritten 8: 229376.. 262143: 20675234358..20675267125: 32768: 20485579550: last,unwritten,eof test: 7 extents found OK so we wrote to a fallocated extent and the write as in-place and not compressed. Good so far, now let's try writing over the same place again. The extent is no longer PREALLOC, so it should move to a different place, and compression can happen: # dd if=/boot/System.map-5.7.15 bs=512K seek=1 of=test conv=notrunc 12+1 records in 12+1 records out 6607498 bytes (6.6 MB, 6.3 MiB) copied, 0.0597547 s, 111 MB/s # sync test # filefrag -v test Filesystem type is: 9123683e File size of test is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 21281040720..21281040847: 128: unwritten 1: 128.. 1741: 21281106256..21281107869: 1614: 21281040848: OK, it moved from 21281040848 to 21281106256, but still isn't compressed. 2: 1742.. 65535: 21281042462..21281106255: 63794: 21281107870: unwritten 3: 65536.. 98303: 20476350122..20476382889: 32768: 21281106256: unwritten 4: 98304.. 131071: 20479845152..20479877919: 32768: 20476382890: unwritten 5: 131072.. 163839: 20483351132..20483383899: 32768: 20479877920: unwritten 6: 163840.. 196607: 20485055258..20485088025: 32768: 20483383900: unwritten 7: 196608.. 229375: 20485546782..20485579549: 32768: 20485088026: unwritten 8: 229376.. 262143: 20675234358..20675267125: 32768: 20485579550: last,unwritten,eof test: 9 extents found It looks like compressed writes have been disabled for the whole file: # dd if=/boot/System.map-5.7.15 bs=512K seek=10000 of=test conv=notrunc 12+1 records in 12+1 records out 6607498 bytes (6.6 MB, 6.3 MiB) copied, 0.144441 s, 45.7 MB/s # sync test # filefrag -v test Filesystem type is: 9123683e File size of test is 5249487498 (1281614 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 21281040720..21281040847: 128: unwritten 1: 128.. 1741: 21281106256..21281107869: 1614: 21281040848: 2: 1742.. 65535: 21281042462..21281106255: 63794: 21281107870: unwritten 3: 65536.. 98303: 20476350122..20476382889: 32768: 21281106256: unwritten 4: 98304.. 131071: 20479845152..20479877919: 32768: 20476382890: unwritten 5: 131072.. 163839: 20483351132..20483383899: 32768: 20479877920: unwritten 6: 163840.. 196607: 20485055258..20485088025: 32768: 20483383900: unwritten 7: 196608.. 229375: 20485546782..20485579549: 32768: 20485088026: unwritten 8: 229376.. 262143: 20675234358..20675267125: 32768: 20485579550: unwritten 9: 1280000.. 1281613: 21281107870..21281109483: 1614: 20676284982: last,eof test: 10 extents found # getfattr -n btrfs.compression test # file: test btrfs.compression="zstd" # lsattr test --------c---------- test This works OK if fallocate is not used: # truncate -s 1g test2 # chattr +c test2 # sync test2 # filefrag -v test2 Filesystem type is: 9123683e File size of test2 is 1073741824 (262144 blocks of 4096 bytes) test2: 0 extents found # dd if=/boot/System.map-5.7.15 bs=512K seek=1 of=test2 conv=notrunc 12+1 records in 12+1 records out 6607498 bytes (6.6 MB, 6.3 MiB) copied, 0.110609 s, 59.7 MB/s # sync test2 # filefrag -v test2 Filesystem type is: 9123683e File size of test2 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 128.. 159: 8663165813..8663165844: 32: 128: encoded 1: 160.. 191: 8663166005..8663166036: 32: 8663165845: encoded 2: 192.. 223: 8663165607..8663165638: 32: 8663166037: encoded 3: 224.. 255: 8663166052..8663166083: 32: 8663165639: encoded [...snip...] 48: 1664.. 1695: 8663178516..8663178547: 32: 8663176668: encoded 49: 1696.. 1727: 8663178709..8663178740: 32: 8663178548: encoded 50: 1728.. 1741: 8663176937..8663176950: 14: 8663178741: last,encoded test2: 51 extents found > Thanks. > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-02 16:16 ` Zygo Blaxell @ 2020-09-03 12:53 ` Hamish Moffatt 2020-09-03 19:44 ` Zygo Blaxell 2020-09-03 15:03 ` A L 1 sibling, 1 reply; 31+ messages in thread From: Hamish Moffatt @ 2020-09-03 12:53 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 3/9/20 2:16 am, Zygo Blaxell wrote: > > fallocate doesn't make a lot of sense on btrfs, except in the special > case of nodatacow files without snapshots. fallocate breaks compression, > and snapshots/reflinks break fallocate. I recompiled Firebird with fallocate disabled (it has a fallback for non-linux OSs), and now I have compressed database files. It may be that de-duplication suits my application better anyway. Will compsize tell me how much space is being saved by de-duplication, or is there another way to find out? Hamish ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-03 12:53 ` Hamish Moffatt @ 2020-09-03 19:44 ` Zygo Blaxell 2020-09-04 8:07 ` Hamish Moffatt 0 siblings, 1 reply; 31+ messages in thread From: Zygo Blaxell @ 2020-09-03 19:44 UTC (permalink / raw) To: Hamish Moffatt; +Cc: linux-btrfs On Thu, Sep 03, 2020 at 10:53:23PM +1000, Hamish Moffatt wrote: > On 3/9/20 2:16 am, Zygo Blaxell wrote: > > > > fallocate doesn't make a lot of sense on btrfs, except in the special > > case of nodatacow files without snapshots. fallocate breaks compression, > > and snapshots/reflinks break fallocate. > > > I recompiled Firebird with fallocate disabled (it has a fallback for > non-linux OSs), and now I have compressed database files. > > It may be that de-duplication suits my application better anyway. Will > compsize tell me how much space is being saved by de-duplication, or is > there another way to find out? Compsize reports "Uncompressed" and "Referenced" columns. "Uncompressed" is the physical size of the uncompressed data (i.e. how many bytes you would need to hold all of the extents on disk without compression but with dedupe). "Referenced" is the logical size of the data, after counting each reference (i.e. how many bytes you would need to hold all of the data without compression or dedupe). The "none" and "zstd" rows will tell you how much dedupe you're getting on uncompressed and compressed extents separately. > > Hamish > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-03 19:44 ` Zygo Blaxell @ 2020-09-04 8:07 ` Hamish Moffatt 2020-09-05 4:07 ` Zygo Blaxell 0 siblings, 1 reply; 31+ messages in thread From: Hamish Moffatt @ 2020-09-04 8:07 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 4/9/20 5:44 am, Zygo Blaxell wrote: > On Thu, Sep 03, 2020 at 10:53:23PM +1000, Hamish Moffatt wrote: > >> I recompiled Firebird with fallocate disabled (it has a fallback for >> non-linux OSs), and now I have compressed database files. >> >> It may be that de-duplication suits my application better anyway. Will >> compsize tell me how much space is being saved by de-duplication, or is >> there another way to find out? > Compsize reports "Uncompressed" and "Referenced" columns. "Uncompressed" > is the physical size of the uncompressed data (i.e. how many bytes > you would need to hold all of the extents on disk without compression > but with dedupe). "Referenced" is the logical size of the data, after > counting each reference (i.e. how many bytes you would need to hold all > of the data without compression or dedupe). > > The "none" and "zstd" rows will tell you how much dedupe you're getting > on uncompressed and compressed extents separately. Great, I have it bees running and I see the deduplication in compsize as you said. What is the appropriate place to ask question about bees - here, github or elsewhere? I added some files, restarted bees and it ran a deduplication, but then I added some more files (8 hours ago) and there's been some regularly logging but the new files haven't been deduplicated. Hamish ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-04 8:07 ` Hamish Moffatt @ 2020-09-05 4:07 ` Zygo Blaxell 0 siblings, 0 replies; 31+ messages in thread From: Zygo Blaxell @ 2020-09-05 4:07 UTC (permalink / raw) To: Hamish Moffatt; +Cc: linux-btrfs On Fri, Sep 04, 2020 at 06:07:32PM +1000, Hamish Moffatt wrote: > On 4/9/20 5:44 am, Zygo Blaxell wrote: > > On Thu, Sep 03, 2020 at 10:53:23PM +1000, Hamish Moffatt wrote: > > > > > I recompiled Firebird with fallocate disabled (it has a fallback for > > > non-linux OSs), and now I have compressed database files. > > > > > > It may be that de-duplication suits my application better anyway. Will > > > compsize tell me how much space is being saved by de-duplication, or is > > > there another way to find out? > > Compsize reports "Uncompressed" and "Referenced" columns. "Uncompressed" > > is the physical size of the uncompressed data (i.e. how many bytes > > you would need to hold all of the extents on disk without compression > > but with dedupe). "Referenced" is the logical size of the data, after > > counting each reference (i.e. how many bytes you would need to hold all > > of the data without compression or dedupe). > > > > The "none" and "zstd" rows will tell you how much dedupe you're getting > > on uncompressed and compressed extents separately. > > > Great, I have it bees running and I see the deduplication in compsize as you > said. > > What is the appropriate place to ask question about bees - here, github or > elsewhere? I'm in all three places, though I might miss it if it's only posted to the linux-btrfs list. If you need to send a log file and you don't want it to be fully public, there's an email address in the bees README. > I added some files, restarted bees and it ran a deduplication, but then I > added some more files (8 hours ago) and there's been some regularly logging > but the new files haven't been deduplicated. Information we'd need for this: - kernel version you're running - bees log, preferably at a level high enough to see the individual dedupe ops - btrfs dump-tree of the subvol trees containing the files or a small reproducer. If there's something like this in the log: 2020-09-05 04:03:13 4464.4468<5> crawl_5: WORKAROUND: toxic address: addr = 0x544eb000, sys_usage_delta = 0.136, user_usage_delta = 0.02, rt_age = 0.251574, refs 1 2020-09-05 04:03:13 4464.4468<4> crawl_5: WORKAROUND: discovered toxic match at found_addr 0x544eb000 matching bbd BeesBlockData { 4K 0x1db000 fd = 9 '/try/31264-2', address = 0x54823000, hash = 0x9a2052ab5fe19ae, data[4096] } then it's bees detecting btrfs taking too long to do an extent lookup, and blacklisting the extent to avoid crippling performance problems. On 5.7 it seems to be happening a lot...ironic, given that 5.7 has much faster backref code. > Hamish > > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-02 16:16 ` Zygo Blaxell 2020-09-03 12:53 ` Hamish Moffatt @ 2020-09-03 15:03 ` A L 2020-09-03 21:52 ` Zygo Blaxell 1 sibling, 1 reply; 31+ messages in thread From: A L @ 2020-09-03 15:03 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Nikolay Borisov, Hamish Moffatt, linux-btrfs On 2020-09-02 18:16, Zygo Blaxell wrote: > On Wed, Sep 02, 2020 at 11:57:41AM +0200, A L wrote: >> >> ---- From: Nikolay Borisov <nborisov@suse.com> -- Sent: 2020-09-02 - 07:57 ---- >>>> I've been able to reproduce this with a trivial test program which >>>> mimics the I/O behaviour of Firebird. >>>> >>>> It is calling fallocate() to set up a bunch of blocks and then writing >>>> them with pwrite(). It seems to be the fallocate() step which is >>>> preventing compression. >>>> >>>> Here is my trivial test program which just writes zeroes to a file. The >>>> output file does not get compressed by btrfs. >>> Ag yes, this makes sense, because fallocate creates PREALLOC extents >>> which are NOCOW (since they are essentially empty so it makes no sense >>> to CoW them) hence they go through a different path which doesn't >>> perform compression. >>> >> Hi, >> >> This is interesting. I think that a lot of applications use fallocate >> in their normal operations. This is probably why we see weird compsize >> results every now and then. > fallocate doesn't make a lot of sense on btrfs, except in the special > case of nodatacow files without snapshots. fallocate breaks compression, > and snapshots/reflinks break fallocate. Isn't this a strong use-case to improve fallocate behavior on Btrfs? > bees deallocates preallocated extents on sight (dedupes them with holes). > I didn't bother to implement an option not to, and so far nobody has > asked for one. > > Before bees (and even before btrfs), I had LD_PRELOAD hacks to replace > fallocate() with { return 0; }. On certain other filesystems fallocate > is spectacularly slow, and one application-that-shall-not-be-named in > particular liked to reserve a lot more space than it ever ultimately used. > >> A file that is nocow will also not have checksums. Is this true for >> these fallocated files (that has data written to them) too? >> >> I would really like to see that Btrfs was corrected so that writes >> to an fallocated area will be compressed (if one is using compression >> that is). > This is difficult to do with the semantics of fallocate, which dictate that > a write to a file within the preallocated region shall not return ENOSPC. Is this the case when you do `fallocate -l <larger-than-fs-size>`? > A fallocated extent is overwritten in-place on the first write, but later > writes will be normal copy-on-write. To be clear, that means they will > occupy other locations on the disk, and may encounter ENOSPC during write. > > fallocate is really broken on btrfs datacow files. The requirements of > each are mutually exclusive unless you allow solutions that can reserve > indefinite amounts of auxiliary space for journalling data blocks. > It's better not to use fallocate at all. > > If a fallocated extent is shared (cloned or snapshotted) then it > becomes temporarily copy-on-write until only one reference remains, > then reverts back to in-place writing--even on nodatacow files. > If you're using btrfs send for backups, fallocate doesn't work properly > at best, and wastes space and time at worst. > > The unsharing _should_ also mean that the second write to a preallocated > extent should be compressed, but I tried this on kernel 5.7.15 and it > turns out they're not... :-O > > Here's a 1g fallocated file: > > # fallocate -l 1g test > # sync test > # filefrag -v test > Filesystem type is: 9123683e > File size of test is 1073741824 (262144 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 65535: 21281040720..21281106255: 65536: unwritten > 1: 65536.. 98303: 20476350122..20476382889: 32768: 21281106256: unwritten > 2: 98304.. 131071: 20479845152..20479877919: 32768: 20476382890: unwritten > 3: 131072.. 163839: 20483351132..20483383899: 32768: 20479877920: unwritten > 4: 163840.. 196607: 20485055258..20485088025: 32768: 20483383900: unwritten > 5: 196608.. 229375: 20485546782..20485579549: 32768: 20485088026: unwritten > 6: 229376.. 262143: 20675234358..20675267125: 32768: 20485579550: last,unwritten,eof > test: 7 extents found > > Let's turn on compression and dump some compressed data on it: > > # chattr +c test > # dd if=/boot/System.map-5.7.15 bs=512K seek=1 of=test conv=notrunc > 12+1 records in > 12+1 records out > 6607498 bytes (6.6 MB, 6.3 MiB) copied, 0.070041 s, 94.3 MB/s > # sync test > # filefrag -v test > Filesystem type is: 9123683e > File size of test is 1073741824 (262144 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 127: 21281040720..21281040847: 128: unwritten > 1: 128.. 1741: 21281040848..21281042461: 1614: > > 720 + (512 / 4 = 128) = 848, this is written in-place on the extent. > > 2: 1742.. 65535: 21281042462..21281106255: 63794: unwritten > 3: 65536.. 98303: 20476350122..20476382889: 32768: 21281106256: unwritten > 4: 98304.. 131071: 20479845152..20479877919: 32768: 20476382890: unwritten > 5: 131072.. 163839: 20483351132..20483383899: 32768: 20479877920: unwritten > 6: 163840.. 196607: 20485055258..20485088025: 32768: 20483383900: unwritten > 7: 196608.. 229375: 20485546782..20485579549: 32768: 20485088026: unwritten > 8: 229376.. 262143: 20675234358..20675267125: 32768: 20485579550: last,unwritten,eof > test: 7 extents found > > OK so we wrote to a fallocated extent and the write as in-place and not > compressed. Good so far, now let's try writing over the same place again. > The extent is no longer PREALLOC, so it should move to a different place, > and compression can happen: > > # dd if=/boot/System.map-5.7.15 bs=512K seek=1 of=test conv=notrunc > 12+1 records in > 12+1 records out > 6607498 bytes (6.6 MB, 6.3 MiB) copied, 0.0597547 s, 111 MB/s > # sync test > # filefrag -v test > Filesystem type is: 9123683e > File size of test is 1073741824 (262144 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 127: 21281040720..21281040847: 128: unwritten > 1: 128.. 1741: 21281106256..21281107869: 1614: 21281040848: > > OK, it moved from 21281040848 to 21281106256, but still isn't compressed. > > 2: 1742.. 65535: 21281042462..21281106255: 63794: 21281107870: unwritten > 3: 65536.. 98303: 20476350122..20476382889: 32768: 21281106256: unwritten > 4: 98304.. 131071: 20479845152..20479877919: 32768: 20476382890: unwritten > 5: 131072.. 163839: 20483351132..20483383899: 32768: 20479877920: unwritten > 6: 163840.. 196607: 20485055258..20485088025: 32768: 20483383900: unwritten > 7: 196608.. 229375: 20485546782..20485579549: 32768: 20485088026: unwritten > 8: 229376.. 262143: 20675234358..20675267125: 32768: 20485579550: last,unwritten,eof > test: 9 extents found > > It looks like compressed writes have been disabled for the whole file: But this is odd. So we have a file with no special attributes that in effect is like a nodatacow file? What happens if we snapshot and then write to the file? > > # dd if=/boot/System.map-5.7.15 bs=512K seek=10000 of=test conv=notrunc > 12+1 records in > 12+1 records out > 6607498 bytes (6.6 MB, 6.3 MiB) copied, 0.144441 s, 45.7 MB/s > # sync test > # filefrag -v test > Filesystem type is: 9123683e > File size of test is 5249487498 (1281614 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 127: 21281040720..21281040847: 128: unwritten > 1: 128.. 1741: 21281106256..21281107869: 1614: 21281040848: > 2: 1742.. 65535: 21281042462..21281106255: 63794: 21281107870: unwritten > 3: 65536.. 98303: 20476350122..20476382889: 32768: 21281106256: unwritten > 4: 98304.. 131071: 20479845152..20479877919: 32768: 20476382890: unwritten > 5: 131072.. 163839: 20483351132..20483383899: 32768: 20479877920: unwritten > 6: 163840.. 196607: 20485055258..20485088025: 32768: 20483383900: unwritten > 7: 196608.. 229375: 20485546782..20485579549: 32768: 20485088026: unwritten > 8: 229376.. 262143: 20675234358..20675267125: 32768: 20485579550: unwritten > 9: 1280000.. 1281613: 21281107870..21281109483: 1614: 20676284982: last,eof > test: 10 extents found > # getfattr -n btrfs.compression test > # file: test > btrfs.compression="zstd" > > # lsattr test > --------c---------- test > > This works OK if fallocate is not used: > > # truncate -s 1g test2 > # chattr +c test2 > # sync test2 > # filefrag -v test2 > Filesystem type is: 9123683e > File size of test2 is 1073741824 (262144 blocks of 4096 bytes) > test2: 0 extents found > # dd if=/boot/System.map-5.7.15 bs=512K seek=1 of=test2 conv=notrunc > 12+1 records in > 12+1 records out > 6607498 bytes (6.6 MB, 6.3 MiB) copied, 0.110609 s, 59.7 MB/s > # sync test2 > # filefrag -v test2 > Filesystem type is: 9123683e > File size of test2 is 1073741824 (262144 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 128.. 159: 8663165813..8663165844: 32: 128: encoded > 1: 160.. 191: 8663166005..8663166036: 32: 8663165845: encoded > 2: 192.. 223: 8663165607..8663165638: 32: 8663166037: encoded > 3: 224.. 255: 8663166052..8663166083: 32: 8663165639: encoded > [...snip...] > 48: 1664.. 1695: 8663178516..8663178547: 32: 8663176668: encoded > 49: 1696.. 1727: 8663178709..8663178740: 32: 8663178548: encoded > 50: 1728.. 1741: 8663176937..8663176950: 14: 8663178741: last,encoded > test2: 51 extents found > >> Thanks. >> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-09-03 15:03 ` A L @ 2020-09-03 21:52 ` Zygo Blaxell 0 siblings, 0 replies; 31+ messages in thread From: Zygo Blaxell @ 2020-09-03 21:52 UTC (permalink / raw) To: A L; +Cc: Nikolay Borisov, Hamish Moffatt, linux-btrfs On Thu, Sep 03, 2020 at 05:03:15PM +0200, A L wrote: > On 2020-09-02 18:16, Zygo Blaxell wrote: > > On Wed, Sep 02, 2020 at 11:57:41AM +0200, A L wrote: > > > This is interesting. I think that a lot of applications use fallocate > > > in their normal operations. This is probably why we see weird compsize > > > results every now and then. > > fallocate doesn't make a lot of sense on btrfs, except in the special > > case of nodatacow files without snapshots. fallocate breaks compression, > > and snapshots/reflinks break fallocate. > > Isn't this a strong use-case to improve fallocate behavior on Btrfs? fallocate: you shall write data in exactly in one specific location. copy-on-write: you shall write data anywhere but in one specific location. It's the same specific location for both, so the requirements are mutually exclusive. You can only implement the complete requirements for one by not implementing the requirements for the other, or by restricting each to separate parts of the filesystem (e.g. datacow and nodatacow files). btrfs silently ignores fallocate whenever it conflicts with copy-on-write requirements. IMHO it would be better for btrfs to reject fallocate with an error when it is used in one of the allocate-disk-space modes on a datacow file, but that's just MHO. It would be clearer if the system call just failed, so that applications know not to expect the various guarantees mentioned in the fallocate(2) man page. I suppose it's possible to have a space reservation system where every fallocated extent reserves enough space to guarantee that copy-on-write won't run out of space, and every subvol tracks how much reserved space it has so that a duplicate space can be reserved in the event of a non-read-only snapshot. But that's probably a worse result overall: suddenly snapshotting a subvol or reflink cloning a file wants a surprising amount of extra space, and dedupe would never be able to make that reserved space go away (or would it? What does it even mean to dedupe a fallocated extent over a non-fallocated one? Does that transfer the space reservation, duplicate it, or delete it? What about deduping the other way?) Another possibility would be make fallocate able to reserve space without committing to a location, i.e. "guarantee I can allocate 500 MB for overwrites in this file at a later time" but without committing to a specific location within the file as fallocate requires now. This would cover data overwrite cases which are not covered by the current btrfs implementation of the fallocate system call, but it would require yet another "not-supported-by-all-filesystems" new fallocate flag. To be strictly correct, fallocate on nodatacow files would have to mark the subvol non-snapshottable as long as the fallocated extents exist, and disallow reflink copies of those extents. That would require some on-disk format changes to track fallocated extents that contain data. Administrators would probably want a "disallow fallocate" bit for subvols if they want to be able to make send/receive backups of them. There are probably more traps and pitfalls on this path, and good reasons btrfs didn't go there. > > > I would really like to see that Btrfs was corrected so that writes > > > to an fallocated area will be compressed (if one is using compression > > > that is). > > This is difficult to do with the semantics of fallocate, which dictate that > > a write to a file within the preallocated region shall not return ENOSPC. > Is this the case when you do `fallocate -l <larger-than-fs-size>`? In theory the system call would always fail because it can't allocate that amount of space, so it doesn't have to guarantee anything about ENOSPC. Note that fallocate is often emulated by userspace tools and libraries, so it may behave differently from the kernel call. Emulation is usually done by writing zeros with normal write calls, and that doesn't guarantee anything on btrfs (if anything, it does the opposite, making ENOSPC _more_ likely when writing in the fallocated region). > > It looks like compressed writes have been disabled for the whole file: > > But this is odd. So we have a file with no special attributes that in effect > is like a nodatacow file? What happens if we snapshot and then write to the > file? Prealloc is an attribute of the _extent reference_, not the file or the extent (prealloc data blocks logically contain zero, and on disk their contents are undefined). The prealloc attribute is removed when data is written to the extent. Prealloc is the only part of fallocate's allocation modes that are implemented by btrfs for datacow files. fallocate on existing blocks makes no guarantees about ENOSPC when overwriting those blocks in a btrfs datacow file, and making an extent unshared has no effect on allocation behavior in a datacow file. On nodatacow files, nodatacow is a persistent inode attribute that affects all extents referenced by the inode. btrfs fallocate behaves more or less as described by the fallocate(2) man page on nodatacow files as long as only one reference to each extent exists (snapshot or reflink). If the extent reference is prealloc or belongs to a nodatacow inode, and the extent is not shared, then a write puts data in-place within the existing extent. This doesn't require allocating space, so the ENOSPC guarantees of fallocate(2) hold. If the extent is shared, then a write allocates a new extent and puts the data there. If a forced copy-on-write occurs, the original extent semantics resume with the new extent (i.e. an extent created by a write becomes nodatacow in a nodatacow file, and remains datacow in a datacow file). If there is not sufficient space for copy-on-write, then the write fails with ENOSPC--this is how you get ENOSPC when writing to an existing block in a nodatacow file. In a nutshell, any write (*) to a prealloc or nodatasum extent triggers a partial backref lookup to see whether the extent is shared or not, and if the extent is shared, then nodatacow and prealloc attributes are ignored for the duration of the current write. (*) I left out some special cases that might come up if you try to follow along with btrfs-dump-tree, e.g. when one extent reference is split into two by writing to the middle of the extent reference with a seek, that technically makes two references to one extent, but doesn't trigger shared extent behavior because it is not necessary. > > # dd if=/boot/System.map-5.7.15 bs=512K seek=10000 of=test conv=notrunc > > 12+1 records in > > 12+1 records out > > 6607498 bytes (6.6 MB, 6.3 MiB) copied, 0.144441 s, 45.7 MB/s > > # sync test > > # filefrag -v test > > Filesystem type is: 9123683e > > File size of test is 5249487498 (1281614 blocks of 4096 bytes) > > ext: logical_offset: physical_offset: length: expected: flags: > > 0: 0.. 127: 21281040720..21281040847: 128: unwritten > > 1: 128.. 1741: 21281106256..21281107869: 1614: 21281040848: > > 2: 1742.. 65535: 21281042462..21281106255: 63794: 21281107870: unwritten > > 3: 65536.. 98303: 20476350122..20476382889: 32768: 21281106256: unwritten > > 4: 98304.. 131071: 20479845152..20479877919: 32768: 20476382890: unwritten > > 5: 131072.. 163839: 20483351132..20483383899: 32768: 20479877920: unwritten > > 6: 163840.. 196607: 20485055258..20485088025: 32768: 20483383900: unwritten > > 7: 196608.. 229375: 20485546782..20485579549: 32768: 20485088026: unwritten > > 8: 229376.. 262143: 20675234358..20675267125: 32768: 20485579550: unwritten > > 9: 1280000.. 1281613: 21281107870..21281109483: 1614: 20676284982: last,eof > > test: 10 extents found > > # getfattr -n btrfs.compression test > > # file: test > > btrfs.compression="zstd" > > > > # lsattr test > > --------c---------- test > > > > This works OK if fallocate is not used: > > > > # truncate -s 1g test2 > > # chattr +c test2 > > # sync test2 > > # filefrag -v test2 > > Filesystem type is: 9123683e > > File size of test2 is 1073741824 (262144 blocks of 4096 bytes) > > test2: 0 extents found > > # dd if=/boot/System.map-5.7.15 bs=512K seek=1 of=test2 conv=notrunc > > 12+1 records in > > 12+1 records out > > 6607498 bytes (6.6 MB, 6.3 MiB) copied, 0.110609 s, 59.7 MB/s > > # sync test2 > > # filefrag -v test2 > > Filesystem type is: 9123683e > > File size of test2 is 1073741824 (262144 blocks of 4096 bytes) > > ext: logical_offset: physical_offset: length: expected: flags: > > 0: 128.. 159: 8663165813..8663165844: 32: 128: encoded > > 1: 160.. 191: 8663166005..8663166036: 32: 8663165845: encoded > > 2: 192.. 223: 8663165607..8663165638: 32: 8663166037: encoded > > 3: 224.. 255: 8663166052..8663166083: 32: 8663165639: encoded > > [...snip...] > > 48: 1664.. 1695: 8663178516..8663178547: 32: 8663176668: encoded > > 49: 1696.. 1727: 8663178709..8663178740: 32: 8663178548: encoded > > 50: 1728.. 1741: 8663176937..8663176950: 14: 8663178741: last,encoded > > test2: 51 extents found > > > > > Thanks. > > > > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: new database files not compressed 2020-08-30 9:35 new database files not compressed Hamish Moffatt 2020-08-31 2:20 ` Eric Wong 2020-08-31 3:47 ` Zygo Blaxell @ 2020-09-01 1:43 ` Chris Murphy 2 siblings, 0 replies; 31+ messages in thread From: Chris Murphy @ 2020-09-01 1:43 UTC (permalink / raw) To: Hamish Moffatt; +Cc: Btrfs BTRFS I wonder if this is at all related to a case I've seen where VM images aren't being compressed at all. 'nodatacow' is not set. Does not matter if compress or compress-force is set. And if I merely copy the file, --reflink=never, the result is fairly well compressed. VM is qemu-kvm based. And I've tried either cache none or writeback (maybe both, I forget) and cache unsafe. No compression. *shrug* -- Chris Murphy ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2020-09-05 4:07 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-08-30 9:35 new database files not compressed Hamish Moffatt 2020-08-31 2:20 ` Eric Wong 2020-08-31 2:44 ` Hamish Moffatt 2020-08-31 3:15 ` A L 2020-08-31 3:47 ` Zygo Blaxell 2020-08-31 8:53 ` Hamish Moffatt 2020-08-31 9:25 ` Nikolay Borisov 2020-08-31 10:40 ` Hamish Moffatt 2020-08-31 10:47 ` Nikolay Borisov 2020-08-31 12:56 ` Hamish Moffatt 2020-08-31 11:15 ` Roman Mamedov 2020-08-31 12:54 ` Hamish Moffatt 2020-08-31 12:57 ` Nikolay Borisov 2020-08-31 23:50 ` Hamish Moffatt 2020-09-01 5:15 ` Nikolay Borisov 2020-09-01 8:55 ` Hamish Moffatt 2020-09-02 0:32 ` Hamish Moffatt 2020-09-02 5:57 ` Nikolay Borisov 2020-09-02 6:05 ` Hamish Moffatt 2020-09-02 6:10 ` Nikolay Borisov 2020-09-02 9:57 ` A L 2020-09-02 10:09 ` Nikolay Borisov 2020-09-03 15:04 ` A L 2020-09-02 16:16 ` Zygo Blaxell 2020-09-03 12:53 ` Hamish Moffatt 2020-09-03 19:44 ` Zygo Blaxell 2020-09-04 8:07 ` Hamish Moffatt 2020-09-05 4:07 ` Zygo Blaxell 2020-09-03 15:03 ` A L 2020-09-03 21:52 ` Zygo Blaxell 2020-09-01 1:43 ` Chris Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox