* btrfs thinks fs is full, though 11GB should be still free
@ 2023-12-11 20:26 Christoph Anton Mitterer
2023-12-11 20:57 ` Qu Wenruo
0 siblings, 1 reply; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-11 20:26 UTC (permalink / raw)
To: linux-btrfs
Hey.
I think the following might have already happened the 2nd time. I have
a Debian stable with kernel 6.1.55 running Prometheus.
There's one separate btrfs, just for Prometheus time series database.
# btrfs check /dev/vdb
Opening filesystem to check...
Checking filesystem on /dev/vdb
UUID: decdc81d-7cc4-431c-ab84-e03771f6de5d
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 42427637760 bytes used, no error found
total csum bytes: 27362284
total tree bytes: 32686080
total fs tree bytes: 1982464
total extent tree bytes: 360448
btree space waste bytes: 2839648
file data blocks allocated: 54877196288
referenced 28014796800
# mount /data/main/
# df | grep main
/dev/vdb btrfs 43G 43G 25k 100% /data/main
=> df thinks it's full
# btrfs filesystem usage /data/main/
Overall:
Device size: 40.00GiB
Device allocated: 40.00GiB
Device unallocated: 1.00MiB
Device missing: 0.00B
Device slack: 0.00B
Used: 39.54GiB
Free (estimated): 24.00KiB (min: 24.00KiB)
Free (statfs, df): 24.00KiB
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 29.22MiB (used: 0.00B)
Multiple profiles: no
Data,single: Size:39.48GiB, Used:39.48GiB (100.00%)
/dev/vdb 39.48GiB
Metadata,DUP: Size:256.00MiB, Used:31.16MiB (12.17%)
/dev/vdb 512.00MiB
System,DUP: Size:8.00MiB, Used:16.00KiB (0.20%)
/dev/vdb 16.00MiB
Unallocated:
/dev/vdb 1.00MiB
=> btrfs does so, too
# btrfs subvolume list -pagu /data/main/
ID 257 gen 2347947 parent 5 top level 5 uuid ae3fa7ff-f5a4-cf44-8555-ad579195036c path <FS_TREE>/data
=> no snapshots involved
# du --apparent-size --total -s --si /data/main/
29G /data/main/
29G total
=> but when actually counting the file sizes, there should be 11G left.
:/data/main/prometheus# dd if=/dev/zero of=foo bs=1M count=1
dd: error writing 'foo': No space left on device
1+0 records in
0+0 records out
0 bytes copied, 0,0876783 s, 0,0 kB/s
And it really is full.
Any ideas how this can happen?
Thanks,
Chris.
^ permalink raw reply [flat|nested] 43+ messages in thread* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-11 20:26 btrfs thinks fs is full, though 11GB should be still free Christoph Anton Mitterer @ 2023-12-11 20:57 ` Qu Wenruo 2023-12-11 22:23 ` Christoph Anton Mitterer 0 siblings, 1 reply; 43+ messages in thread From: Qu Wenruo @ 2023-12-11 20:57 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2023/12/12 06:56, Christoph Anton Mitterer wrote: > Hey. > > I think the following might have already happened the 2nd time. I have > a Debian stable with kernel 6.1.55 running Prometheus. > > There's one separate btrfs, just for Prometheus time series database. > > > # btrfs check /dev/vdb > Opening filesystem to check... > Checking filesystem on /dev/vdb > UUID: decdc81d-7cc4-431c-ab84-e03771f6de5d > [1/7] checking root items > [2/7] checking extents > [3/7] checking free space tree > [4/7] checking fs roots > [5/7] checking only csums items (without verifying data) > [6/7] checking root refs > [7/7] checking quota groups skipped (not enabled on this FS) > found 42427637760 bytes used, no error found > total csum bytes: 27362284 > total tree bytes: 32686080 > total fs tree bytes: 1982464 > total extent tree bytes: 360448 > btree space waste bytes: 2839648 > file data blocks allocated: 54877196288 > referenced 28014796800 That's pretty good. > # mount /data/main/ > > > # df | grep main > /dev/vdb btrfs 43G 43G 25k 100% /data/main > > => df thinks it's full > > > # btrfs filesystem usage /data/main/ > Overall: > Device size: 40.00GiB > Device allocated: 40.00GiB > Device unallocated: 1.00MiB Already full from the perspective of chunk space. No new chunk can be allocated. > Device missing: 0.00B > Device slack: 0.00B > Used: 39.54GiB > Free (estimated): 24.00KiB (min: 24.00KiB) > Free (statfs, df): 24.00KiB > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 29.22MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:39.48GiB, Used:39.48GiB (100.00%) Data chunks are already exhausted. > /dev/vdb 39.48GiB > > Metadata,DUP: Size:256.00MiB, Used:31.16MiB (12.17%) A single metadata chunk, which is not full. > /dev/vdb 512.00MiB > > System,DUP: Size:8.00MiB, Used:16.00KiB (0.20%) > /dev/vdb 16.00MiB > > Unallocated: > /dev/vdb 1.00MiB > > => btrfs does so, too > > # btrfs subvolume list -pagu /data/main/ > ID 257 gen 2347947 parent 5 top level 5 uuid ae3fa7ff-f5a4-cf44-8555-ad579195036c path <FS_TREE>/data Is your current mounted subvolume the fs tree? Or already the data subvolume? If the latter case, there are some files you can not access from your current mount point. Thus it's recommended to use qgroup to show a correct full view of the used space by each subvolume. Thanks, Qu > > => no snapshots involved > > # du --apparent-size --total -s --si /data/main/ > 29G /data/main/ > 29G total > > => but when actually counting the file sizes, there should be 11G left. > > > :/data/main/prometheus# dd if=/dev/zero of=foo bs=1M count=1 > dd: error writing 'foo': No space left on device > 1+0 records in > 0+0 records out > 0 bytes copied, 0,0876783 s, 0,0 kB/s > > > And it really is full. > > > Any ideas how this can happen? > > > Thanks, > Chris. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-11 20:57 ` Qu Wenruo @ 2023-12-11 22:23 ` Christoph Anton Mitterer 2023-12-11 22:26 ` Christoph Anton Mitterer 2023-12-11 23:20 ` Qu Wenruo 0 siblings, 2 replies; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-11 22:23 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs Hey Qu On Tue, 2023-12-12 at 07:27 +1030, Qu Wenruo wrote: > Is your current mounted subvolume the fs tree? Or already the data > subvolume? Well actually both, I always have a "service" mountpoint of the root volume as well, and just unmounted that to no confuse with "double" mount entries. In reality it looks like: # mount | grep vdb /dev/vdb on /data/main type btrfs (rw,noatime,space_cache=v2,subvolid=257,subvol=/data) /dev/vdb on /data/btrfs-top-level-subvolumes/data-main type btrfs (rw,noatime,space_cache=v2,subvolid=5,subvol=/) But all data (except for 2 empty dirs, where on other systems I would place btrbk snapshots) is in the data subvolume: data/btrfs-top-level-subvolumes/data-main# ls -al total 16 drwxr-xr-x 1 root root 26 Feb 21 2023 . drwxr-xr-x 1 root root 30 Nov 9 23:49 .. drwxr-xr-x 1 root root 20 Feb 21 2023 data drwx------ 1 root root 10 Feb 21 2023 snapshots /data/btrfs-top-level-subvolumes/data-main# du --apparent-size --total -s --si * 29G data 10 snapshots 29G total > If the latter case, there are some files you can not access from your > current mount point. No it's not that (that would have been quite embarrassing ^^). /data/btrfs-top-level-subvolumes/data-main# tree -a snapshots/ snapshots/ └── btrbk 2 directories, 0 files > Thus it's recommended to use qgroup to show a correct full view of > the > used space by each subvolume. Uhm... qgroups? How would they help me here? Cheers, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-11 22:23 ` Christoph Anton Mitterer @ 2023-12-11 22:26 ` Christoph Anton Mitterer 2023-12-11 23:20 ` Qu Wenruo 1 sibling, 0 replies; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-11 22:26 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs Oh and btw: I did fully unmount all mountpoins of the fs in question,... so it cannot be just some process that still holds 11G in deleted file(s). Cheers, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-11 22:23 ` Christoph Anton Mitterer 2023-12-11 22:26 ` Christoph Anton Mitterer @ 2023-12-11 23:20 ` Qu Wenruo 2023-12-11 23:38 ` Christoph Anton Mitterer 1 sibling, 1 reply; 43+ messages in thread From: Qu Wenruo @ 2023-12-11 23:20 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2023/12/12 08:53, Christoph Anton Mitterer wrote: > Hey Qu > > On Tue, 2023-12-12 at 07:27 +1030, Qu Wenruo wrote: >> Is your current mounted subvolume the fs tree? Or already the data >> subvolume? > > Well actually both, I always have a "service" mountpoint of the root > volume as well, and just unmounted that to no confuse with "double" > mount entries. > > In reality it looks like: > # mount | grep vdb > /dev/vdb on /data/main type btrfs (rw,noatime,space_cache=v2,subvolid=257,subvol=/data) > /dev/vdb on /data/btrfs-top-level-subvolumes/data-main type btrfs (rw,noatime,space_cache=v2,subvolid=5,subvol=/) > > But all data (except for 2 empty dirs, where on other systems I would > place btrbk snapshots) is in the data subvolume: > > data/btrfs-top-level-subvolumes/data-main# ls -al > total 16 > drwxr-xr-x 1 root root 26 Feb 21 2023 . > drwxr-xr-x 1 root root 30 Nov 9 23:49 .. > drwxr-xr-x 1 root root 20 Feb 21 2023 data > drwx------ 1 root root 10 Feb 21 2023 snapshots > /data/btrfs-top-level-subvolumes/data-main# du --apparent-size --total -s --si * > 29G data > 10 snapshots > 29G total > > >> If the latter case, there are some files you can not access from your >> current mount point. > > No it's not that (that would have been quite embarrassing ^^). > /data/btrfs-top-level-subvolumes/data-main# tree -a snapshots/ > snapshots/ > └── btrbk > > 2 directories, 0 files > > >> Thus it's recommended to use qgroup to show a correct full view of >> the >> used space by each subvolume. > > Uhm... qgroups? How would they help me here? Shows exactly which subvolumes uses how many bytes, including orphan ones which is pending for deletion. Thanks, Qu > > Cheers, > Chris. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-11 23:20 ` Qu Wenruo @ 2023-12-11 23:38 ` Christoph Anton Mitterer 2023-12-11 23:54 ` Qu Wenruo 0 siblings, 1 reply; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-11 23:38 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On Tue, 2023-12-12 at 09:50 +1030, Qu Wenruo wrote: > Shows exactly which subvolumes uses how many bytes, including orphan > ones which is pending for deletion. Well... here we go: # btrfs qgroup show . Qgroupid Referenced Exclusive Path -------- ---------- --------- ---- 0/5 16.00KiB 16.00KiB <toplevel> 0/257 39.48GiB 39.48GiB data 0/258 16.00KiB 16.00KiB <stale> 0/259 16.00KiB 16.00KiB a 0/260 16.00KiB 16.00KiB b 1/100 32.00KiB 32.00KiB <0 member qgroups> 1/101 0.00B 0.00B <0 member qgroups> I've just created a and b to get qgroup (somehow? ^^) working. Nevertheless: I'm 100% sure, that before, there were never any subvolumes on that fs other than the toplevel and data, unless btrfs somehow creates/deletes them automatically. But the above output, AFAIU, still shows that "everything" is in data, while counting the bytes of files there, still yields a much lower number. And other ideas what I could test? Thanks, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-11 23:38 ` Christoph Anton Mitterer @ 2023-12-11 23:54 ` Qu Wenruo 2023-12-12 0:12 ` Christoph Anton Mitterer 0 siblings, 1 reply; 43+ messages in thread From: Qu Wenruo @ 2023-12-11 23:54 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2023/12/12 10:08, Christoph Anton Mitterer wrote: > On Tue, 2023-12-12 at 09:50 +1030, Qu Wenruo wrote: >> Shows exactly which subvolumes uses how many bytes, including orphan >> ones which is pending for deletion. > > Well... here we go: > # btrfs qgroup show . > Qgroupid Referenced Exclusive Path > -------- ---------- --------- ---- > 0/5 16.00KiB 16.00KiB <toplevel> > 0/257 39.48GiB 39.48GiB data > 0/258 16.00KiB 16.00KiB <stale> > 0/259 16.00KiB 16.00KiB a > 0/260 16.00KiB 16.00KiB b > 1/100 32.00KiB 32.00KiB <0 member qgroups> > 1/101 0.00B 0.00B <0 member qgroups> > > I've just created a and b to get qgroup (somehow? ^^) working. > > > Nevertheless: > I'm 100% sure, that before, there were never any subvolumes on that fs > other than the toplevel and data, unless btrfs somehow creates/deletes > them automatically. > > > But the above output, AFAIU, still shows that "everything" is in data, > while counting the bytes of files there, still yields a much lower > number. OK, then everything looks fine. > > > And other ideas what I could test? Then the last thing is extent bookends. COW and small random writes can easily lead to extra space wasted by extent bookends. E.g. You write a 16M data extents, then over-write the tailing 8M, now we have two data extents, the old 16M and the new 8M, wasting 8M space. In that case, you can try defrag, but you still need to delete some data first so that you can do defrag... Thanks, Qu > > > Thanks, > Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-11 23:54 ` Qu Wenruo @ 2023-12-12 0:12 ` Christoph Anton Mitterer 2023-12-12 0:58 ` Qu Wenruo 2023-12-13 8:29 ` Andrea Gelmini 0 siblings, 2 replies; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-12 0:12 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On Tue, 2023-12-12 at 10:24 +1030, Qu Wenruo wrote: > Then the last thing is extent bookends. > > COW and small random writes can easily lead to extra space wasted by > extent bookends. Is there a way to check this? Would I just seem maaany extents when I look at the files with filefrag? I mean Prometheus, continuously collects metrics from a number of nodes an (sooner or later) writes them to disk. I don't really know their code, so I have no idea if they already write every tiny metric, or only large bunches thereof. Since they do maintain a WAL, I'd assume the former. Every know and then, the WAL is written to chunk files which are rather large, well ~160M or so in my case, but that depends on how many metrics one collects. I think they always write data for a period of 2h. Later on, they further compact that chunks (I think after 8 hours and so on), in which case some larger rewritings would be done. Though in my case this doesn't happen, as I run Thanos on top of Prometheus, and for that one needs to disable Prometheus' own compaction. I've had already previously looked at the extents for these "compacted" chunk files, but the worst file had only 32 extents (as reported by filefrag). Looking at the WAL files: /data/main/prometheus/metrics2/wal# filefrag * | grep -v ' 0 extents found' 00001030: 82 extents found 00001031: 81 extents found 00001032: 79 extents found 00001033: 82 extents found 00001034: 78 extents found 00001035: 78 extents found 00001036: 81 extents found 00001037: 79 extents found 00001038: 79 extents found 00001039: 89 extents found 00001040: 80 extents found 00001041: 74 extents found 00001042: 81 extents found 00001043: 97 extents found 00001044: 101 extents found 00001045: 316 extents found checkpoint.00001029: FIBMAP/FIEMAP unsupported (I did the grep -v, because there were a gazillion of empty wal files, presumably created when the fs was already full). The above numbers though still don't look to bad, do they? And checking all: # find /data/main/ -type f -execdir filefrag {} \; | cut -d : -f 2 | sort | uniq -c | sort -V 3706 0 extents found 450 1 extent found 25 3 extents found 62 2 extents found 1 8 extents found 1 9 extents found 1 10 extents found 1 11 extents found 1 32 extents found 1 74 extents found 1 80 extents found 1 89 extents found 1 97 extents found 1 101 extents found 1 316 extents found 2 78 extents found 2 82 extents found 3 5 extents found 3 79 extents found 3 81 extents found 6 4 extents found > E.g. You write a 16M data extents, then over-write the tailing 8M, > now > we have two data extents, the old 16M and the new 8M, wasting 8M > space. > > In that case, you can try defrag, but you still need to delete some > data > first so that you can do defrag... Well my main concern is rather how to prevent this from happening in the first place... the data is already all backuped into Thanos, so I could also just wipe the fs. But this seems to occur repeatedly (well, okay only twice so far O:-) ). So that would mean we have some IO pattern that "kills" btrfs. Cheers, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-12 0:12 ` Christoph Anton Mitterer @ 2023-12-12 0:58 ` Qu Wenruo 2023-12-12 2:30 ` Qu Wenruo 2023-12-12 3:27 ` Christoph Anton Mitterer 2023-12-13 8:29 ` Andrea Gelmini 1 sibling, 2 replies; 43+ messages in thread From: Qu Wenruo @ 2023-12-12 0:58 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2023/12/12 10:42, Christoph Anton Mitterer wrote: > On Tue, 2023-12-12 at 10:24 +1030, Qu Wenruo wrote: >> Then the last thing is extent bookends. >> >> COW and small random writes can easily lead to extra space wasted by >> extent bookends. > > Is there a way to check this? Would I just seem maaany extents when I > look at the files with filefrag? > > > I mean Prometheus, continuously collects metrics from a number of nodes > an (sooner or later) writes them to disk. > I don't really know their code, so I have no idea if they already write > every tiny metric, or only large bunches thereof. > > Since they do maintain a WAL, I'd assume the former. > > Every know and then, the WAL is written to chunk files which are rather > large, well ~160M or so in my case, but that depends on how many > metrics one collects. I think they always write data for a period of > 2h. > Later on, they further compact that chunks (I think after 8 hours and > so on), in which case some larger rewritings would be done. > Though in my case this doesn't happen, as I run Thanos on top of > Prometheus, and for that one needs to disable Prometheus' own > compaction. > > > I've had already previously looked at the extents for these "compacted" > chunk files, but the worst file had only 32 extents (as reported by > filefrag). Filefrag doesn't work that well on btrfs AFAIK, as btrfs is emitting merged extents to fiemap ioctl, but for fragmented one, filefrag should be enough to detect them. > > Looking at the WAL files: > /data/main/prometheus/metrics2/wal# filefrag * | grep -v ' 0 extents > found' > 00001030: 82 extents found > 00001031: 81 extents found > 00001032: 79 extents found > 00001033: 82 extents found > 00001034: 78 extents found > 00001035: 78 extents found > 00001036: 81 extents found > 00001037: 79 extents found > 00001038: 79 extents found > 00001039: 89 extents found > 00001040: 80 extents found > 00001041: 74 extents found > 00001042: 81 extents found > 00001043: 97 extents found > 00001044: 101 extents found > 00001045: 316 extents found > checkpoint.00001029: FIBMAP/FIEMAP unsupported > > (I did the grep -v, because there were a gazillion of empty wal files, > presumably created when the fs was already full). > > The above numbers though still don't look to bad, do they? Depends, in my previous 16M case. you only got 2 extents, but still wasted 8M (33.3% space wasted). But WAL indeeds looks like a bad patter for btrfs. > > And checking all: > # find /data/main/ -type f -execdir filefrag {} \; | cut -d : -f 2 | > sort | uniq -c | sort -V > 3706 0 extents found > 450 1 extent found > 25 3 extents found > 62 2 extents found > 1 8 extents found > 1 9 extents found > 1 10 extents found > 1 11 extents found > 1 32 extents found > 1 74 extents found > 1 80 extents found > 1 89 extents found > 1 97 extents found > 1 101 extents found > 1 316 extents found > 2 78 extents found > 2 82 extents found > 3 5 extents found > 3 79 extents found > 3 81 extents found > 6 4 extents found > > > >> E.g. You write a 16M data extents, then over-write the tailing 8M, >> now >> we have two data extents, the old 16M and the new 8M, wasting 8M >> space. >> >> In that case, you can try defrag, but you still need to delete some >> data >> first so that you can do defrag... > > > Well my main concern is rather how to prevent this from happening in > the first place... the data is already all backuped into Thanos, so I > could also just wipe the fs. > But this seems to occur repeatedly (well, okay only twice so far O:-) > ). > So that would mean we have some IO pattern that "kills" btrfs. Thus we have "autodefrag" mount option for such use case. Thanks, Qu > > > Cheers, > Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-12 0:58 ` Qu Wenruo @ 2023-12-12 2:30 ` Qu Wenruo 2023-12-12 3:27 ` Christoph Anton Mitterer 1 sibling, 0 replies; 43+ messages in thread From: Qu Wenruo @ 2023-12-12 2:30 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2023/12/12 11:28, Qu Wenruo wrote: > > > On 2023/12/12 10:42, Christoph Anton Mitterer wrote: >> On Tue, 2023-12-12 at 10:24 +1030, Qu Wenruo wrote: >>> Then the last thing is extent bookends. >>> >>> COW and small random writes can easily lead to extra space wasted by >>> extent bookends. >> >> Is there a way to check this? Would I just seem maaany extents when I >> look at the files with filefrag? IIRC compsize can do it. https://github.com/kilobyte/compsize Thanks, Qu >> >> >> I mean Prometheus, continuously collects metrics from a number of nodes >> an (sooner or later) writes them to disk. >> I don't really know their code, so I have no idea if they already write >> every tiny metric, or only large bunches thereof. >> >> Since they do maintain a WAL, I'd assume the former. >> >> Every know and then, the WAL is written to chunk files which are rather >> large, well ~160M or so in my case, but that depends on how many >> metrics one collects. I think they always write data for a period of >> 2h. >> Later on, they further compact that chunks (I think after 8 hours and >> so on), in which case some larger rewritings would be done. >> Though in my case this doesn't happen, as I run Thanos on top of >> Prometheus, and for that one needs to disable Prometheus' own >> compaction. >> >> >> I've had already previously looked at the extents for these "compacted" >> chunk files, but the worst file had only 32 extents (as reported by >> filefrag). > > Filefrag doesn't work that well on btrfs AFAIK, as btrfs is emitting > merged extents to fiemap ioctl, but for fragmented one, filefrag should > be enough to detect them. >> >> Looking at the WAL files: >> /data/main/prometheus/metrics2/wal# filefrag * | grep -v ' 0 extents >> found' >> 00001030: 82 extents found >> 00001031: 81 extents found >> 00001032: 79 extents found >> 00001033: 82 extents found >> 00001034: 78 extents found >> 00001035: 78 extents found >> 00001036: 81 extents found >> 00001037: 79 extents found >> 00001038: 79 extents found >> 00001039: 89 extents found >> 00001040: 80 extents found >> 00001041: 74 extents found >> 00001042: 81 extents found >> 00001043: 97 extents found >> 00001044: 101 extents found >> 00001045: 316 extents found >> checkpoint.00001029: FIBMAP/FIEMAP unsupported >> >> (I did the grep -v, because there were a gazillion of empty wal files, >> presumably created when the fs was already full). >> >> The above numbers though still don't look to bad, do they? > > Depends, in my previous 16M case. you only got 2 extents, but still > wasted 8M (33.3% space wasted). > > But WAL indeeds looks like a bad patter for btrfs. > >> >> And checking all: >> # find /data/main/ -type f -execdir filefrag {} \; | cut -d : -f 2 | >> sort | uniq -c | sort -V >> 3706 0 extents found >> 450 1 extent found >> 25 3 extents found >> 62 2 extents found >> 1 8 extents found >> 1 9 extents found >> 1 10 extents found >> 1 11 extents found >> 1 32 extents found >> 1 74 extents found >> 1 80 extents found >> 1 89 extents found >> 1 97 extents found >> 1 101 extents found >> 1 316 extents found >> 2 78 extents found >> 2 82 extents found >> 3 5 extents found >> 3 79 extents found >> 3 81 extents found >> 6 4 extents found >> >> >> >>> E.g. You write a 16M data extents, then over-write the tailing 8M, >>> now >>> we have two data extents, the old 16M and the new 8M, wasting 8M >>> space. >>> >>> In that case, you can try defrag, but you still need to delete some >>> data >>> first so that you can do defrag... >> >> >> Well my main concern is rather how to prevent this from happening in >> the first place... the data is already all backuped into Thanos, so I >> could also just wipe the fs. >> But this seems to occur repeatedly (well, okay only twice so far O:-) >> ). >> So that would mean we have some IO pattern that "kills" btrfs. > > Thus we have "autodefrag" mount option for such use case. > > Thanks, > Qu >> >> >> Cheers, >> Chris. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-12 0:58 ` Qu Wenruo 2023-12-12 2:30 ` Qu Wenruo @ 2023-12-12 3:27 ` Christoph Anton Mitterer 2023-12-12 3:40 ` Christoph Anton Mitterer 2023-12-13 1:49 ` Remi Gauvin 1 sibling, 2 replies; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-12 3:27 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs Hey. On Tue, 2023-12-12 at 13:00 +1030, Qu Wenruo wrote: > IIRC compsize can do it. > https://github.com/kilobyte/compsize Okay... that seems promising: /data/main/prometheus/metrics2# compsize 01H* Processed 544 files, 399 regular extents (447 refs), 272 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 37G 37G 23G none 100% 37G 37G 23G 01H* are the subdirs for the "semi-final" chunks. So here's my stolen storage ;-) I'm a bit puzzled how that can happen, I mean for the chunks I'd have naively assumed that they just write them more or less at once and in sequence. Interestingly, the WAL seems good, though: /data/main/prometheus/metrics2# compsize wal/ Processed 3723 files, 1617 regular extents (1617 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 1.9G 1.9G 1.9G none 100% 1.9G 1.9G 1.9G One thing I've noted by chance and don't understand: I assumed "Referenced" was the number of unique bytes actually references (by someone). So when I run compsize on a single file, Reference should be the file size? /data/main/prometheus/metrics2/wal# lll 00001030 251052 -rw-rw-r-- 1 106 106 ? 134217728 2023-12-10 04:51:58.665808973 +0100 00001030 /data/main/prometheus/metrics2/wal# compsize -b 00001030 Processed 1 file, 83 regular extents (83 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 134451200 134451200 134217728 none 100% 134451200 134451200 134217728 => okay, here it is /data/main/prometheus/metrics2/wal# lll 00001045 251947 -rw-rw-r-- 1 106 106 ? 33034564 2023-12-10 08:57:01.892017049 +0100 00001045 /data/main/prometheus/metrics2/wal# compsize -b 00001045 Processed 1 file, 316 regular extents (316 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 33116160 33116160 33038336 none 100% 33116160 33116160 33038336 => here, Referenced is 3772 bytes larger than the actual file size? How can that happen? On Tue, 2023-12-12 at 11:28 +1030, Qu Wenruo wrote: > > > But WAL indeeds looks like a bad patter for btrfs. > Thus we have "autodefrag" mount option for such use case. Well the manpage warns from using on large DB workloads... I mean Prometheus is not exactly like a DB, and I would have naively assumed that at least the chunks were written not as many small random writes... but apparently they are. Also, this a VM, so the storage volume is actually something Ceph backed, which the university's super computing centre provides us with. I wonder if I do autodefrag on all that, if it doesn't just kill of our performance even more? Thanks :-) Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-12 3:27 ` Christoph Anton Mitterer @ 2023-12-12 3:40 ` Christoph Anton Mitterer 2023-12-12 4:13 ` Qu Wenruo 2023-12-13 1:49 ` Remi Gauvin 1 sibling, 1 reply; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-12 3:40 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs I just noticed that others have already had that problem with Prometheus: https://github.com/prometheus/prometheus/issues/9107 Some users there wondered whether the issue could be caused by too aggressive preallocation via `fallocate`. Do you think that would be something that could also cause the wasted space? Thanks, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-12 3:40 ` Christoph Anton Mitterer @ 2023-12-12 4:13 ` Qu Wenruo 2023-12-15 2:33 ` Chris Murphy ` (2 more replies) 0 siblings, 3 replies; 43+ messages in thread From: Qu Wenruo @ 2023-12-12 4:13 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2023/12/12 14:10, Christoph Anton Mitterer wrote: > I just noticed that others have already had that problem with > Prometheus: > https://github.com/prometheus/prometheus/issues/9107 > > Some users there wondered whether the issue could be caused by too > aggressive preallocation via `fallocate`. > > Do you think that would be something that could also cause the wasted > space? > Well, preallocated inodes (any inodes with any preallocated extents during its lifespan, it's a btrfs specific flag, won't be cleared until the inode is evicted), would only lead to btrfs to try NOCOW first, then fallback to COW if NOCOW failed. (Missing the compression path). It's not a direct cause to the problem. The direct cause is frequent fsync()/sync() with overwrites. Btrfs is really relying on merging the writes between transactions, if fsync()/sync() is called too frequently (like some data base) and the program is doing overwrites, this is exactly what you would have. IIRC we can set the AUTODEFRAG for an directory? Thanks, Qu > > Thanks, > Chris. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-12 4:13 ` Qu Wenruo @ 2023-12-15 2:33 ` Chris Murphy 2023-12-15 3:12 ` Qu Wenruo 2023-12-18 16:24 ` Christoph Anton Mitterer 2023-12-18 22:30 ` Christoph Anton Mitterer 2 siblings, 1 reply; 43+ messages in thread From: Chris Murphy @ 2023-12-15 2:33 UTC (permalink / raw) To: Qu Wenruo, Christoph Mitterer, Btrfs BTRFS On Mon, Dec 11, 2023, at 11:13 PM, Qu Wenruo wrote: > IIRC we can set the AUTODEFRAG for an directory? How? Would be useful to isolate autofrag for the bookends and small database (web browser) use case, but not for the large busy database use case. -- Chris Murphy ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-15 2:33 ` Chris Murphy @ 2023-12-15 3:12 ` Qu Wenruo 0 siblings, 0 replies; 43+ messages in thread From: Qu Wenruo @ 2023-12-15 3:12 UTC (permalink / raw) To: Chris Murphy, Christoph Mitterer, Btrfs BTRFS On 2023/12/15 13:03, Chris Murphy wrote: > > > On Mon, Dec 11, 2023, at 11:13 PM, Qu Wenruo wrote: > >> IIRC we can set the AUTODEFRAG for an directory? > > How? Would be useful to isolate autofrag for the bookends and small database (web browser) use case, but not for the large busy database use case. > I get confused it with NODATACOW, that would help a lot as long the subvolume doesn't get snapshotted. Thanks, Qu ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-12 4:13 ` Qu Wenruo 2023-12-15 2:33 ` Chris Murphy @ 2023-12-18 16:24 ` Christoph Anton Mitterer 2023-12-18 19:18 ` Goffredo Baroncelli 2023-12-18 19:54 ` Qu Wenruo 2023-12-18 22:30 ` Christoph Anton Mitterer 2 siblings, 2 replies; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-18 16:24 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs Hey again. Seems that even the manual defrag doesn't help at all: After: btrfs filesystem defragment -v -r -t 100000M there's still: # compsize . Processed 309 files, 324 regular extents (324 refs), 146 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 22G 22G 13G none 100% 22G 22G 13G Any other ideas how this could be solved? Cheers, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-18 16:24 ` Christoph Anton Mitterer @ 2023-12-18 19:18 ` Goffredo Baroncelli 2023-12-18 20:04 ` Goffredo Baroncelli 2023-12-18 22:38 ` Christoph Anton Mitterer 2023-12-18 19:54 ` Qu Wenruo 1 sibling, 2 replies; 43+ messages in thread From: Goffredo Baroncelli @ 2023-12-18 19:18 UTC (permalink / raw) To: Christoph Anton Mitterer, Qu Wenruo, linux-btrfs On 18/12/2023 17.24, Christoph Anton Mitterer wrote: > Hey again. > > Seems that even the manual defrag doesn't help at all: > > After: > btrfs filesystem defragment -v -r -t 100000M Being only 309 files, I suggest to find one file as test case and start to inspect what is happening. > > there's still: > # compsize . > Processed 309 files, 324 regular extents (324 refs), 146 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 22G 22G 13G > none 100% 22G 22G 13G > > > Any other ideas how this could be solved? > > Cheers, > Chris. > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-18 19:18 ` Goffredo Baroncelli @ 2023-12-18 20:04 ` Goffredo Baroncelli 2023-12-18 22:38 ` Christoph Anton Mitterer 1 sibling, 0 replies; 43+ messages in thread From: Goffredo Baroncelli @ 2023-12-18 20:04 UTC (permalink / raw) To: Christoph Anton Mitterer, Qu Wenruo, linux-btrfs On 18/12/2023 20.18, Goffredo Baroncelli wrote: > On 18/12/2023 17.24, Christoph Anton Mitterer wrote: >> Hey again. >> >> Seems that even the manual defrag doesn't help at all: >> >> After: >> btrfs filesystem defragment -v -r -t 100000M > > Being only 309 files, I suggest to find one file as test case and start to inspect what is happening I don't know if this would help, however I tried to reproduce this situation and what I found is $ python3 mktestfile.py $ sudo /usr/sbin/compsize test.bin Processed 1 file, 3 regular extents (3 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 3.0M 3.0M 2.0M none 100% 3.0M 3.0M 2.0M $ btrfs fi defra -v test.bin test.bin $ sudo /usr/sbin/compsize test.bin Processed 1 file, 3 regular extents (3 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 3.0M 3.0M 2.0M <------------- 3M none 100% 3.0M 3.0M 2.0M $ sync $ sudo /usr/sbin/compsize test.bin Processed 1 file, 2 regular extents (2 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 2.0M 2.0M 2.0M <------------- 2M after a sync none 100% 2.0M 2.0M 2.0M So until a sync, the file are not updated. #------------------------------------------ $ cat mktestfile.py import os f = open("test.bin", "w") p = 0 s = 1024 * 1024 for i in range(3): f.write("x" * s) p += s os.fsync(f) p -= s/2 f.seek(p, 0) os.fsync(f) f.close() #------------------------------------------ >> >> there's still: >> # compsize . >> Processed 309 files, 324 regular extents (324 refs), 146 inline. >> Type Perc Disk Usage Uncompressed Referenced >> TOTAL 100% 22G 22G 13G >> none 100% 22G 22G 13G >> >> >> Any other ideas how this could be solved? >> >> Cheers, >> Chris. >> > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-18 19:18 ` Goffredo Baroncelli 2023-12-18 20:04 ` Goffredo Baroncelli @ 2023-12-18 22:38 ` Christoph Anton Mitterer 2023-12-19 8:22 ` Andrei Borzenkov 1 sibling, 1 reply; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-18 22:38 UTC (permalink / raw) To: kreijack, Qu Wenruo, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 11465 bytes --] Hey. On Mon, 2023-12-18 at 20:18 +0100, Goffredo Baroncelli wrote: > Being only 309 files, I suggest to find one file as test case and > start to inspect what is happening. I made a small wrapper around compsize: #!/bin/sh while IFS='' read -r file; do tmp="$(compsize -b "$file" | grep '^none' | sed -E 's/ +/ /g')" du="$(printf '%s\n' "$tmp" | cut -d ' ' -f 3)" ref="$(printf '%s\n' "$tmp" | cut -d ' ' -f 5)" delta="$(( $du - $ref ))" if [ "$delta" -ge 524288 ]; then printf '%s\t%s\n' "$delta" "$file" fi done called like: # find /data/main -type f | ~/compsize-helper 252653568 /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 217198592 /data/main/prometheus/metrics2/01HHFFHGP4SDPEGGY4ME3GT0Q7/chunks/000001 107094016 /data/main/prometheus/metrics2/01HHFPD50EKCPVC2WZCP006WDV/chunks/000001 121311232 /data/main/prometheus/metrics2/01HHFX8YPK0D55GRM4V5TN46J3/chunks/000001 106512384 /data/main/prometheus/metrics2/01HHG44J1F5R8K3527V2VVT5VA/chunks/000001 102907904 /data/main/prometheus/metrics2/01HHGB0A32RWYKWZBDVMH7NG70/chunks/000001 105345024 /data/main/prometheus/metrics2/01HHGHW20CFYANMMNF1DGSMQE2/chunks/000001 105590784 /data/main/prometheus/metrics2/01HHGRQSR0M0F53JJBKZH52S7H/chunks/000001 106688512 /data/main/prometheus/metrics2/01HHGZKJEXEKW3594B74DR9NTH/chunks/000001 106213376 /data/main/prometheus/metrics2/01HHH6FA6YRWC6HAJGQJX9QATF/chunks/000001 106409984 /data/main/prometheus/metrics2/01HHHDB1EHS6V9NYKYAMJTSXBB/chunks/000001 225648640 /data/main/prometheus/metrics2/01HHHM6S6JYGAXCQV9XVMGMW83/chunks/000001 107253760 /data/main/prometheus/metrics2/01HHHV2DZYXNFZ5PRHB2M12E7Z/chunks/000001 106340352 /data/main/prometheus/metrics2/01HHJ1Y5QS4DS408MMCH97WC02/chunks/000001 105254912 /data/main/prometheus/metrics2/01HHJ8SXF3W7DKW5N0YACG3ZWT/chunks/000001 104161280 /data/main/prometheus/metrics2/01HHJFNKQZ3AEZYFXWEQ5RQ3HB/chunks/000001 104771584 /data/main/prometheus/metrics2/01HHJPHBFVMBRB27ECMNB0J3GG/chunks/000001 101986304 /data/main/prometheus/metrics2/01HHJXD37EPQXB5599MHWEESMZ/chunks/000001 106614784 /data/main/prometheus/metrics2/01HHK48TZ2KNFNWT1VTH5HE0HF/chunks/000001 231501824 /data/main/prometheus/metrics2/01HHKB4K6M286JNX0QAX6M5P7C/chunks/000001 107102208 /data/main/prometheus/metrics2/01HHKJ07GSVRGR3B3RW7EY2M03/chunks/000001 106823680 /data/main/prometheus/metrics2/01HHKRVZ85RXB1TZ0M1Q3453D2/chunks/000001 105611264 /data/main/prometheus/metrics2/01HHKZQPZTQV3DCVQCCVH7DXGE/chunks/000001 104497152 /data/main/prometheus/metrics2/01HHM6KB9WNF18ZNC7JE81Z3QW/chunks/000001 107216896 /data/main/prometheus/metrics2/01HHMDF5FRRAX8DCNGCGTNDR98/chunks/000001 105299968 /data/main/prometheus/metrics2/01HHMMAVRD2BTWKCZPYZ1WD6DT/chunks/000001 105328640 /data/main/prometheus/metrics2/01HHMV6KZS6F476HG6DT4ZM180/chunks/000001 223375360 /data/main/prometheus/metrics2/01HHN22CPP9FVDZG1MGET13J4X/chunks/000001 224215040 /data/main/prometheus/metrics2/01HHN8Y10XMFF9M2CRPQJWG94Y/chunks/000001 99090432 /data/main/prometheus/metrics2/01HHNFSS84Y6C6BBVY7E3QMAZS/chunks/000001 104562688 /data/main/prometheus/metrics2/01HHNPNH00AZHAV1RPB6SW8443/chunks/000001 107819008 /data/main/prometheus/metrics2/01HHNXH8QE32FQEMA7DDVZK8M8/chunks/000001 103890944 /data/main/prometheus/metrics2/01HHP4D1EBSX48Y3CY66VV703Y/chunks/000001 105041920 /data/main/prometheus/metrics2/01HHPB8P8AN1W5SH6ZTC90R48Q/chunks/000001 107278336 /data/main/prometheus/metrics2/01HHPJ4EZG9TB32P8W4MDMQSZG/chunks/000001 106553344 /data/main/prometheus/metrics2/01HHPS02SFZEC187ZJJDVSD1Z0/chunks/000001 156815360 /data/main/prometheus/metrics2/01HHPZVV0P5QCGG8QA89C36RVW/chunks/000001 153927680 /data/main/prometheus/metrics2/01HHQ6QJRFY0WF1K4QHM74A12S/chunks/000001 125075456 /data/main/prometheus/metrics2/01HHQDKBFRFFQ7J2XJ836CRHW3/chunks/000001 209260544 /data/main/prometheus/metrics2/01HHQMF375KTC4BWMF9RBCM6RS/chunks/000001 106807296 /data/main/prometheus/metrics2/01HHQVAVESRRXJ5KQ9FN8AQ0XA/chunks/000001 105955328 /data/main/prometheus/metrics2/01HHR26E9C5B2RYFCQPG7G7M4H/chunks/000001 105402368 /data/main/prometheus/metrics2/01HHR92618K1C6QCR1JP3NKRR5/chunks/000001 107376640 /data/main/prometheus/metrics2/01HHRFXY8DJPBQ924RJJWZB3BY/chunks/000001 107413504 /data/main/prometheus/metrics2/01HHRPSP069YJXRDHF3P8CPM5S/chunks/000001 105881600 /data/main/prometheus/metrics2/01HHRXNE7JFSYFQ3QF63WDZ0AB/chunks/000001 217497600 /data/main/prometheus/metrics2/01HHS4H21ZSZVA10JVDNHNQ7Q8/chunks/000001 108908544 /data/main/prometheus/metrics2/01HHSBCT94CC9AW1N9JQA6VF0K/chunks/000001 109170688 /data/main/prometheus/metrics2/01HHSJ8J18HQTQ4V45AZTSN336/chunks/000001 109060096 /data/main/prometheus/metrics2/01HHSS4AQMF186EXPQKT77KG8F/chunks/000001 108732416 /data/main/prometheus/metrics2/01HHSZZWKYJG4AQ8S6X3A0V1RF/chunks/000001 159817728 /data/main/prometheus/metrics2/01HHT6VMTVPDT2RMR76P642HZ9/chunks/000001 109006848 /data/main/prometheus/metrics2/01HHTDQD2H7VJRXBC5BA7K4018/chunks/000001 108662784 /data/main/prometheus/metrics2/01HHTMK68T7QSGE4MHFWBJHM1W/chunks/000001 201793536 /data/main/prometheus/metrics2/01HHTVEYZYPDXDZSGDM91S9QFW/chunks/000001 109015040 /data/main/prometheus/metrics2/01HHV2AQ7DEHX5BNSXX2VQC2TW/chunks/000001 109756416 /data/main/prometheus/metrics2/01HHV96BHBAP1TZY54AQ8BA0YB/chunks/000001 103170048 /data/main/prometheus/metrics2/01HHVG238JVD74GD07F2W87Y9Y/chunks/000001 210407424 /data/main/prometheus/metrics2/01HHVPXVG00GAK4E3YY6ADGDG9/chunks/000001 108261376 /data/main/prometheus/metrics2/01HHVXSMQ5XWWKXAK7SRR61AK8/chunks/000001 108457984 /data/main/prometheus/metrics2/01HHW4N72CMZ90VTVACG7QD87M/chunks/000001 109596672 /data/main/prometheus/metrics2/01HHWBH08ZYZRNEC38VXK9Y946/chunks/000001 194060288 /data/main/prometheus/metrics2/01HHWJCRG7WC5NHB684RXCNKMV/chunks/000001 110596096 /data/main/prometheus/metrics2/01HHWS8GQGHNWH4Z4WETJN454Q/chunks/000001 110592000 /data/main/prometheus/metrics2/01HHX048F9110B1CMK300XCE99/chunks/000001 107954176 /data/main/prometheus/metrics2/01HHX6ZTAZEQ69MGKQDSZ7TG96/chunks/000001 157130752 /data/main/prometheus/metrics2/01HHXDVJJG0YMR3T2ZSS7TMGPA/chunks/000001 122175488 /data/main/prometheus/metrics2/01HHXMQE7152WG1HARM51JPKFT/chunks/000001 208908288 /data/main/prometheus/metrics2/01HHXVK5YTHRVQXEHXWR1K270F/chunks/000001 140976128 /data/main/prometheus/metrics2/01HHY2EXPC19MBPRGM05X3ZB75/chunks/000001 140902400 /data/main/prometheus/metrics2/01HHY9AKFSADNWJVW4XN06QTZY/chunks/000001 250085376 /data/main/prometheus/metrics2/01HHYG6AQEAYV7WBSWJSSMHR02/chunks/000001 114298880 /data/main/prometheus/metrics2/01HHYQ22YNG446ARA9KDTJH9RP/chunks/000001 108421120 /data/main/prometheus/metrics2/01HHYXXT096QKW02KAAXH791SG/chunks/000001 109977600 /data/main/prometheus/metrics2/01HHZ4SEH5S13B9EYV7S4NV4EZ/chunks/000001 99586048 /data/main/prometheus/metrics2/01HHZBN68RG5YTA2EP50TRC6T4/chunks/000001 To answer Qu's question from his reply: No snapshots are involved, and unless Prometheus itself makes recopies, no such should be involved either. I do run Thanos sidecar on top of Prometheus, which indeed makes copies of the files before uploading them to some remote storage, but it also deletes them afterwards. And looking at the /proc/<thanos>/fd, it doesn't keep them open as deleted. No compression should be used (again, unless Prometheus would manually set chattr +c or so), btrfs RAID. Nothing except a plain btrfs. I used to have quota's enabled when Qu asked me to last week, but disabled them afterwards. I've also attached the output of: # find /data/main -type f -exec sh -c 'echo "$1"; compsize "$1"' '' {} \; > compsize.log in case it helps anyone. The above shows IMO that most data is lost in the chunk files (these 000001 files are the beef of the data). Looking at the reverse (files where the delta is less than 0.5 MiB): 233472 /data/main/prometheus/metrics2/wal/00000522 155648 /data/main/prometheus/metrics2/wal/00000523 184320 /data/main/prometheus/metrics2/wal/00000524 200704 /data/main/prometheus/metrics2/wal/00000525 126976 /data/main/prometheus/metrics2/wal/00000526 192512 /data/main/prometheus/metrics2/wal/00000527 200704 /data/main/prometheus/metrics2/wal/00000528 139264 /data/main/prometheus/metrics2/wal/00000529 200704 /data/main/prometheus/metrics2/wal/00000530 172032 /data/main/prometheus/metrics2/wal/00000531 57344 /data/main/prometheus/metrics2/wal/00000532 20480 /data/main/prometheus/metrics2/chunks_head/000151 12288 /data/main/prometheus/metrics2/chunks_head/000152 28672 /data/main/prometheus/metrics2/chunks_head/000153 45056 /data/main/prometheus/metrics2/01HHFFHGP4SDPEGGY4ME3GT0Q7/inde x 229376 /data/main/prometheus/metrics2/01HHFPD50EKCPVC2WZCP006WDV/inde x 40960 /data/main/prometheus/metrics2/01HHHM6S6JYGAXCQV9XVMGMW83/inde x 40960 /data/main/prometheus/metrics2/01HHHV2DZYXNFZ5PRHB2M12E7Z/inde x 40960 /data/main/prometheus/metrics2/01HHJ1Y5QS4DS408MMCH97WC02/inde x 40960 /data/main/prometheus/metrics2/01HHJ8SXF3W7DKW5N0YACG3ZWT/inde x 40960 /data/main/prometheus/metrics2/01HHJPHBFVMBRB27ECMNB0J3GG/inde x 40960 /data/main/prometheus/metrics2/01HHJXD37EPQXB5599MHWEESMZ/inde x 40960 /data/main/prometheus/metrics2/01HHKB4K6M286JNX0QAX6M5P7C/inde x 40960 /data/main/prometheus/metrics2/01HHMV6KZS6F476HG6DT4ZM180/inde x 40960 /data/main/prometheus/metrics2/01HHN22CPP9FVDZG1MGET13J4X/inde x 40960 /data/main/prometheus/metrics2/01HHN8Y10XMFF9M2CRPQJWG94Y/inde x 40960 /data/main/prometheus/metrics2/01HHNPNH00AZHAV1RPB6SW8443/inde x 40960 /data/main/prometheus/metrics2/01HHPZVV0P5QCGG8QA89C36RVW/inde x 40960 /data/main/prometheus/metrics2/01HHQ6QJRFY0WF1K4QHM74A12S/inde x 40960 /data/main/prometheus/metrics2/01HHQDKBFRFFQ7J2XJ836CRHW3/inde x 40960 /data/main/prometheus/metrics2/01HHQMF375KTC4BWMF9RBCM6RS/inde x 40960 /data/main/prometheus/metrics2/01HHR92618K1C6QCR1JP3NKRR5/inde x 40960 /data/main/prometheus/metrics2/01HHRFXY8DJPBQ924RJJWZB3BY/inde x 40960 /data/main/prometheus/metrics2/01HHS4H21ZSZVA10JVDNHNQ7Q8/inde x 40960 /data/main/prometheus/metrics2/01HHSBCT94CC9AW1N9JQA6VF0K/inde x 40960 /data/main/prometheus/metrics2/01HHT6VMTVPDT2RMR76P642HZ9/inde x 40960 /data/main/prometheus/metrics2/01HHTVEYZYPDXDZSGDM91S9QFW/inde x 40960 /data/main/prometheus/metrics2/01HHVPXVG00GAK4E3YY6ADGDG9/inde x 40960 /data/main/prometheus/metrics2/01HHWJCRG7WC5NHB684RXCNKMV/inde x 40960 /data/main/prometheus/metrics2/01HHX048F9110B1CMK300XCE99/inde x 40960 /data/main/prometheus/metrics2/01HHXDVJJG0YMR3T2ZSS7TMGPA/inde x 86016 /data/main/prometheus/metrics2/01HHXVK5YTHRVQXEHXWR1K270F/inde x 40960 /data/main/prometheus/metrics2/01HHYG6AQEAYV7WBSWJSSMHR02/inde x 4096 /data/main/prometheus/metrics2/01HHZBN68RG5YTA2EP50TRC6T4/inde x So the WAL is not that much of a problem. As for Goffredo's idea about syncing, that doesn't seem to change things either: # btrfs filesystem sync /data/main # compsize btrfs filesystem sync /data/main Processed 321 files, 793 regular extents (794 refs), 152 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 23G 23G 14G none 100% 23G 23G 14G And this is anyway running long-term... so I'd assume that sooner or later btrfs syncs it's stuff? Cheers, Chris. [-- Attachment #2: compsize.log.xz --] [-- Type: application/x-xz, Size: 2984 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-18 22:38 ` Christoph Anton Mitterer @ 2023-12-19 8:22 ` Andrei Borzenkov 2023-12-19 19:09 ` Goffredo Baroncelli 2023-12-21 13:46 ` Christoph Anton Mitterer 0 siblings, 2 replies; 43+ messages in thread From: Andrei Borzenkov @ 2023-12-19 8:22 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: kreijack, Qu Wenruo, linux-btrfs On Tue, Dec 19, 2023 at 4:00 AM Christoph Anton Mitterer <calestyo@scientia.org> wrote: > > Hey. > > On Mon, 2023-12-18 at 20:18 +0100, Goffredo Baroncelli wrote: > > Being only 309 files, I suggest to find one file as test case and > > start to inspect what is happening. > ... > > I've also attached the output of: > # find /data/main -type f -exec sh -c 'echo "$1"; compsize "$1"' '' {} \; > compsize.log > in case it helps anyone. > /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 Processed 1 file, 1 regular extents (1 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 256M 256M 15M none 100% 256M 256M 15M I would try to find out whether this single extent is shared, where the data is located inside this extent. Could it be that file was truncated or the hole was punched in it? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-19 8:22 ` Andrei Borzenkov @ 2023-12-19 19:09 ` Goffredo Baroncelli 2023-12-21 13:53 ` Christoph Anton Mitterer 2023-12-21 13:46 ` Christoph Anton Mitterer 1 sibling, 1 reply; 43+ messages in thread From: Goffredo Baroncelli @ 2023-12-19 19:09 UTC (permalink / raw) To: Andrei Borzenkov, Christoph Anton Mitterer; +Cc: Qu Wenruo, linux-btrfs On 19/12/2023 09.22, Andrei Borzenkov wrote: > On Tue, Dec 19, 2023 at 4:00 AM Christoph Anton Mitterer > <calestyo@scientia.org> wrote: > > > /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 > Processed 1 file, 1 regular extents (1 refs), 0 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 256M 256M 15M > none 100% 256M 256M 15M > > I would try to find out whether this single extent is shared, where > the data is located inside this extent. Could it be that file was > truncated or the hole was punched in it? > Ok, now we have the case study. To be sure, could you try a defrag (+ sync) of this single file ? The what is the lsof output ? Does anyone know a way to extract the "owners" of an extent ? I think that we should go through the backref, but I never did. I don't want to re-invent the wheel, so I am asking if someone knows a tool that can help to find the owners of a extent. BR -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-19 19:09 ` Goffredo Baroncelli @ 2023-12-21 13:53 ` Christoph Anton Mitterer 2023-12-21 18:03 ` Goffredo Baroncelli 0 siblings, 1 reply; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-21 13:53 UTC (permalink / raw) To: kreijack, Andrei Borzenkov; +Cc: Qu Wenruo, linux-btrfs Hey Goffredo. On Tue, 2023-12-19 at 20:09 +0100, Goffredo Baroncelli wrote: > Ok, now we have the case study. > To be sure, could you try a defrag (+ sync) of this single file ? # btrfs filesystem defragment /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 # btrfs filesystem defragment -t 1000M /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 # sync # btrfs filesystem sync /data/main/ # compsize /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 Processed 1 file, 1 regular extents (1 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 256M 256M 15M none 100% 256M 256M 15M # > The what is the lsof output ? # lsof -- /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME prometheu 2327412 prometheus 12r REG 0,43 15781418 642 /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 # I also stopped prometheus synced and checked then: # systemctl stop prometheus.service # lsof -- /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 # btrfs filesystem sync /data/main/ # compsize /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 Processed 1 file, 1 regular extents (1 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 256M 256M 15M none 100% 256M 256M 15M > Does anyone know a way to extract the "owners" of an extent ? I think > that > we should go through the backref, but I never did. I don't want to > re-invent the wheel, so I am asking if someone knows a tool that can > help to find the owners of a extent. Not me ;-) ... Does it help if I'd provide something like dump-tree data? The fs is soon to be full again, so I'll likely have to delete some of the (test) data... Thanks, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-21 13:53 ` Christoph Anton Mitterer @ 2023-12-21 18:03 ` Goffredo Baroncelli 2023-12-21 22:06 ` Christoph Anton Mitterer 0 siblings, 1 reply; 43+ messages in thread From: Goffredo Baroncelli @ 2023-12-21 18:03 UTC (permalink / raw) To: Christoph Anton Mitterer, Andrei Borzenkov; +Cc: Qu Wenruo, linux-btrfs On 21/12/2023 14.53, Christoph Anton Mitterer wrote: > Hey Goffredo. > > On Tue, 2023-12-19 at 20:09 +0100, Goffredo Baroncelli wrote: >> Ok, now we have the case study. >> To be sure, could you try a defrag (+ sync) of this single file ? > > # btrfs filesystem defragment /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 > # btrfs filesystem defragment -t 1000M /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 > # sync > # btrfs filesystem sync /data/main/ > # compsize /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 > Processed 1 file, 1 regular extents (1 refs), 0 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 256M 256M 15M > none 100% 256M 256M 15M > # > > > >> The what is the lsof output ? > > # lsof -- /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > prometheu 2327412 prometheus 12r REG 0,43 15781418 642 /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 > # > > I also stopped prometheus synced and checked then: > # systemctl stop prometheus.service > # lsof -- /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 Here you should do a defrag, after the stop of prometheus. > # btrfs filesystem sync /data/main/ > # compsize /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 > Processed 1 file, 1 regular extents (1 refs), 0 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 256M 256M 15M > none 100% 256M 256M 15M > > >> Does anyone know a way to extract the "owners" of an extent ? I think >> that >> we should go through the backref, but I never did. I don't want to >> re-invent the wheel, so I am asking if someone knows a tool that can >> help to find the owners of a extent. > > Not me ;-) ... Does it help if I'd provide something like dump-tree > data? > I am trying to write a tool that walks the backref to find the owners. I hope for tomorrow to have a prototype to test. > > The fs is soon to be full again, so I'll likely have to delete some of > the (test) data... > > > Thanks, > Chris. -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-21 18:03 ` Goffredo Baroncelli @ 2023-12-21 22:06 ` Christoph Anton Mitterer 0 siblings, 0 replies; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-21 22:06 UTC (permalink / raw) To: kreijack, Andrei Borzenkov; +Cc: Qu Wenruo, linux-btrfs On Thu, 2023-12-21 at 19:03 +0100, Goffredo Baroncelli wrote: > > # lsof -- > > /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/00 > > 0001 > > Here you should do a defrag, after the stop of prometheus. No difference. Even after syncing, and even after unmount/mountig. btw: I did that: /data/main# compsize /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 Processed 1 file, 1 regular extents (1 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 256M 256M 15M none 100% 256M 256M 15M /data/main# cat /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 > foo /data/main# compsize -b foo Processed 1 file, 1 regular extents (1 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 268435456 268435456 15781888 none 100% 268435456 268435456 15781888 /data/main# ls -al foo -rw-r--r-- 1 root root 15781418 Dec 21 23:02 foo => Wouldn't have expected that, not only the discrepancy between referenced and ls. But even the freshly cat'ed file has that space waste. There should be no holes, or any other monkey business involved? # du --apparent-size --total -s --block-size=1 /data/main/ 22112706625 /data/main/ 22112706625 total # compsize -b /data/main/ Processed 463 files, 865 regular extents (874 refs), 224 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 35752870045 35752870045 22097375389 none 100% 35752870045 35752870045 22097375389 > > I am trying to write a tool that walks the backref to find the > owners. > I hope for tomorrow to have a prototype to test. Thanks! Cheers, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-19 8:22 ` Andrei Borzenkov 2023-12-19 19:09 ` Goffredo Baroncelli @ 2023-12-21 13:46 ` Christoph Anton Mitterer 2023-12-21 20:41 ` Qu Wenruo 1 sibling, 1 reply; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-21 13:46 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: kreijack, Qu Wenruo, linux-btrfs On Tue, 2023-12-19 at 11:22 +0300, Andrei Borzenkov wrote: > /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/0000 > 01 > Processed 1 file, 1 regular extents (1 refs), 0 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 256M 256M 15M > none 100% 256M 256M 15M > > I would try to find out whether this single extent is shared, where > the data is located inside this extent. Could it be that file was > truncated or the hole was punched in it? How would I do that? :-) Thanks, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-21 13:46 ` Christoph Anton Mitterer @ 2023-12-21 20:41 ` Qu Wenruo 2023-12-21 22:15 ` Christoph Anton Mitterer 0 siblings, 1 reply; 43+ messages in thread From: Qu Wenruo @ 2023-12-21 20:41 UTC (permalink / raw) To: Christoph Anton Mitterer, Andrei Borzenkov; +Cc: kreijack, linux-btrfs On 2023/12/22 00:16, Christoph Anton Mitterer wrote: > On Tue, 2023-12-19 at 11:22 +0300, Andrei Borzenkov wrote: >> /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/0000 >> 01 >> Processed 1 file, 1 regular extents (1 refs), 0 inline. >> Type Perc Disk Usage Uncompressed Referenced >> TOTAL 100% 256M 256M 15M >> none 100% 256M 256M 15M >> >> I would try to find out whether this single extent is shared, where >> the data is located inside this extent. Could it be that file was >> truncated or the hole was punched in it? > > How would I do that? :-) Grab the INODE number of that file (`stat` is good enough). Know the subvolume id. Then `btrfs ins dump-tree -t <subvolid> <device> | grep -A7 "key (256 " I guess it's time to add a way to dump all the items of a single inode for dump-tree now. Thanks, Qu > > Thanks, > Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-21 20:41 ` Qu Wenruo @ 2023-12-21 22:15 ` Christoph Anton Mitterer 2023-12-21 22:41 ` Qu Wenruo 0 siblings, 1 reply; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-21 22:15 UTC (permalink / raw) To: Qu Wenruo, Andrei Borzenkov; +Cc: kreijack, linux-btrfs On Fri, 2023-12-22 at 07:11 +1030, Qu Wenruo wrote: > Grab the INODE number of that file (`stat` is good enough). # stat /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 File: /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 Size: 15781418 Blocks: 30824 IO Block: 4096 regular file Device: 0,43 Inode: 642 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 106/prometheus) Gid: ( 106/prometheus) Access: 2023-12-12 17:50:26.968485936 +0100 Modify: 2023-12-12 17:50:28.748495544 +0100 Change: 2023-12-12 17:50:57.280649521 +0100 Birth: 2023-12-12 17:50:26.968485936 +0100 > Know the subvolume id. # btrfs subvolume list -pagu /data/main/ ID 257 gen 2371697 parent 5 top level 5 uuid ae3fa7ff-f5a4-cf44-8555-ad579195036c path <FS_TREE>/data > Then `btrfs ins dump-tree -t <subvolid> <device> | grep -A7 "key (256 > " I assume 256 should be the inode number? If so: # btrfs ins dump-tree -t 257 /dev/vdb | grep -A7 "key (642 " location key (642 INODE_ITEM 0) type FILE transid 2348290 data_len 0 name_len 6 name: 000001 item 128 key (638 DIR_INDEX 3) itemoff 9441 itemsize 36 location key (642 INODE_ITEM 0) type FILE transid 2348290 data_len 0 name_len 6 name: 000001 item 129 key (639 INODE_ITEM 0) itemoff 9281 itemsize 160 generation 2348289 transid 2348290 size 17788225 nbytes 17788928 block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0 sequence 408 flags 0x0(none) atime 1702399826.500483413 (2023-12-12 17:50:26) -- item 132 key (642 INODE_ITEM 0) itemoff 9053 itemsize 160 generation 2348289 transid 2348290 size 15781418 nbytes 15781888 block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0 sequence 3362 flags 0x10(PREALLOC) atime 1702399826.968485936 (2023-12-12 17:50:26) ctime 1702399857.280649521 (2023-12-12 17:50:57) mtime 1702399828.748495544 (2023-12-12 17:50:28) otime 1702399826.968485936 (2023-12-12 17:50:26) item 133 key (642 INODE_REF 638) itemoff 9037 itemsize 16 index 3 namelen 6 name: 000001 item 134 key (642 EXTENT_DATA 0) itemoff 8984 itemsize 53 generation 2348290 type 1 (regular) extent data disk byte 9500291072 nr 268435456 extent data offset 0 nr 15781888 ram 268435456 extent compression 0 (none) item 135 key (643 INODE_ITEM 0) itemoff 8824 itemsize 160 generation 2348290 transid 2363471 size 283 nbytes 283 block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0 If you need the whole output of btrfs ins dump-tree -t 257 /dev/vdb, it's only 72k compressed, and AFAIU shouldn't contain any private data (well nothing on the whole fs is private ^^). Cheers, Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-21 22:15 ` Christoph Anton Mitterer @ 2023-12-21 22:41 ` Qu Wenruo 2023-12-21 22:54 ` Christoph Anton Mitterer 0 siblings, 1 reply; 43+ messages in thread From: Qu Wenruo @ 2023-12-21 22:41 UTC (permalink / raw) To: Christoph Anton Mitterer, Andrei Borzenkov; +Cc: kreijack, linux-btrfs On 2023/12/22 08:45, Christoph Anton Mitterer wrote: > On Fri, 2023-12-22 at 07:11 +1030, Qu Wenruo wrote: >> Grab the INODE number of that file (`stat` is good enough). > # stat /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 > File: /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 > Size: 15781418 Blocks: 30824 IO Block: 4096 regular file > Device: 0,43 Inode: 642 Links: 1\ 642 is your inode number. > Access: (0664/-rw-rw-r--) Uid: ( 106/prometheus) Gid: ( 106/prometheus) > Access: 2023-12-12 17:50:26.968485936 +0100 > Modify: 2023-12-12 17:50:28.748495544 +0100 > Change: 2023-12-12 17:50:57.280649521 +0100 > Birth: 2023-12-12 17:50:26.968485936 +0100 > >> Know the subvolume id. > > # btrfs subvolume list -pagu /data/main/ > ID 257 gen 2371697 parent 5 top level 5 uuid ae3fa7ff-f5a4-cf44-8555-ad579195036c path <FS_TREE>/data > > >> Then `btrfs ins dump-tree -t <subvolid> <device> | grep -A7 "key (256 >> " > > I assume 256 should be the inode number? > If so: > # btrfs ins dump-tree -t 257 /dev/vdb | grep -A7 "key (642 " > location key (642 INODE_ITEM 0) type FILE > transid 2348290 data_len 0 name_len 6 > name: 000001 > item 128 key (638 DIR_INDEX 3) itemoff 9441 itemsize 36 > location key (642 INODE_ITEM 0) type FILE > transid 2348290 data_len 0 name_len 6 > name: 000001 > item 129 key (639 INODE_ITEM 0) itemoff 9281 itemsize 160 > generation 2348289 transid 2348290 size 17788225 nbytes 17788928 > block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0 > sequence 408 flags 0x0(none) > atime 1702399826.500483413 (2023-12-12 17:50:26) > -- > item 132 key (642 INODE_ITEM 0) itemoff 9053 itemsize 160 > generation 2348289 transid 2348290 size 15781418 nbytes 15781888 > block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0 > sequence 3362 flags 0x10(PREALLOC) > atime 1702399826.968485936 (2023-12-12 17:50:26) > ctime 1702399857.280649521 (2023-12-12 17:50:57) > mtime 1702399828.748495544 (2023-12-12 17:50:28) > otime 1702399826.968485936 (2023-12-12 17:50:26) > item 133 key (642 INODE_REF 638) itemoff 9037 itemsize 16 > index 3 namelen 6 name: 000001 > item 134 key (642 EXTENT_DATA 0) itemoff 8984 itemsize 53 > generation 2348290 type 1 (regular) > extent data disk byte 9500291072 nr 268435456 > extent data offset 0 nr 15781888 ram 268435456 > extent compression 0 (none) > item 135 key (643 INODE_ITEM 0) itemoff 8824 itemsize 160 > generation 2348290 transid 2363471 size 283 nbytes 283 > block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0 > > > If you need the whole output of btrfs ins dump-tree -t 257 /dev/vdb, > it's only 72k compressed, and AFAIU shouldn't contain any private data > (well nothing on the whole fs is private ^^). The whole one is easier for me to check. But I still strongly recommend to go "--hide-names" just in case. Meanwhile you may want to upload the extent tree too (which can be pretty large though), as my final step would need to check-cross extent tree to be sure. Thanks, Qu > > > Cheers, > Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-21 22:41 ` Qu Wenruo @ 2023-12-21 22:54 ` Christoph Anton Mitterer 2023-12-22 0:53 ` Qu Wenruo 0 siblings, 1 reply; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-21 22:54 UTC (permalink / raw) To: Qu Wenruo, Andrei Borzenkov; +Cc: kreijack, linux-btrfs On Fri, 2023-12-22 at 09:11 +1030, Qu Wenruo wrote: > The whole one is easier for me to check. https://drive.google.com/file/d/1-TJKoL85e23u5mJN0Nuoa9qyvgLRssFM/view?usp=sharing Should contain both (-t 257 and -e) > But I still strongly recommend > to go "--hide-names" just in case. Checked it, and it's really only the prometheus filenames, all of which are completely non-sensitive... and it makes live easier if we can see the names. > Meanwhile you may want to upload the extent tree too (which can be > pretty large though), as my final step would need to check-cross > extent > tree to be sure. Here we go, thanks in advance. Cheers, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-21 22:54 ` Christoph Anton Mitterer @ 2023-12-22 0:53 ` Qu Wenruo 2023-12-22 0:56 ` Christoph Anton Mitterer 0 siblings, 1 reply; 43+ messages in thread From: Qu Wenruo @ 2023-12-22 0:53 UTC (permalink / raw) To: Christoph Anton Mitterer, Qu Wenruo, Andrei Borzenkov Cc: kreijack, linux-btrfs [-- Attachment #1.1.1: Type: text/plain, Size: 1229 bytes --] On 2023/12/22 09:24, Christoph Anton Mitterer wrote: > On Fri, 2023-12-22 at 09:11 +1030, Qu Wenruo wrote: >> The whole one is easier for me to check. > > https://drive.google.com/file/d/1-TJKoL85e23u5mJN0Nuoa9qyvgLRssFM/view?usp=sharing > > Should contain both (-t 257 and -e) The situation is in fact simpler than I thought. The original extent is 256M, and I believe that whole 256M is preallocated. But later truncated to the current size, which is only 15+M. Thus this explains the size problem. But the problem comes on why defrag doesn't work. I'll look into this during the holiday season, but I strongly believe it's the PREALLOC inode flag. We should take it seriously now. Thanks, Qu > >> But I still strongly recommend >> to go "--hide-names" just in case. > > Checked it, and it's really only the prometheus filenames, all of which > are completely non-sensitive... and it makes live easier if we can see > the names. > > >> Meanwhile you may want to upload the extent tree too (which can be >> pretty large though), as my final step would need to check-cross >> extent >> tree to be sure. > > Here we go, thanks in advance. > > > Cheers, > Chris. > [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 7027 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 495 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-22 0:53 ` Qu Wenruo @ 2023-12-22 0:56 ` Christoph Anton Mitterer 2023-12-22 1:13 ` Qu Wenruo 0 siblings, 1 reply; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-22 0:56 UTC (permalink / raw) To: Qu Wenruo, Qu Wenruo, Andrei Borzenkov; +Cc: kreijack, linux-btrfs On Fri, 2023-12-22 at 11:23 +1030, Qu Wenruo wrote: > But the problem comes on why defrag doesn't work. > I'll look into this during the holiday season, but I strongly believe > it's the PREALLOC inode flag. > We should take it seriously now. Oh and keep in mind that - as I've hopefully had mentioned in the beginning - this is all on 6.1.55 (Debian stable)- Thanks, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-22 0:56 ` Christoph Anton Mitterer @ 2023-12-22 1:13 ` Qu Wenruo 2023-12-22 1:23 ` Christoph Anton Mitterer 2024-01-05 3:30 ` Christoph Anton Mitterer 0 siblings, 2 replies; 43+ messages in thread From: Qu Wenruo @ 2023-12-22 1:13 UTC (permalink / raw) To: Christoph Anton Mitterer, Qu Wenruo, Andrei Borzenkov Cc: kreijack, linux-btrfs On 2023/12/22 11:26, Christoph Anton Mitterer wrote: > On Fri, 2023-12-22 at 11:23 +1030, Qu Wenruo wrote: >> But the problem comes on why defrag doesn't work. >> I'll look into this during the holiday season, but I strongly believe >> it's the PREALLOC inode flag. >> We should take it seriously now. > > Oh and keep in mind that - as I've hopefully had mentioned in the > beginning - this is all on 6.1.55 (Debian stable)- That's not a big deal, because before sending that reply, I have already reproduced the problem using 6.6 kernel, so it's a long existing problem. Just like the whole PREALLOC and compression problem. Thanks, Qu > > Thanks, > Chris. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-22 1:13 ` Qu Wenruo @ 2023-12-22 1:23 ` Christoph Anton Mitterer 2024-01-05 3:30 ` Christoph Anton Mitterer 1 sibling, 0 replies; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-22 1:23 UTC (permalink / raw) To: Qu Wenruo, Qu Wenruo, Andrei Borzenkov; +Cc: kreijack, linux-btrfs On Fri, 2023-12-22 at 11:43 +1030, Qu Wenruo wrote: > That's not a big deal, because before sending that reply, I have > already > reproduced the problem using 6.6 kernel, so it's a long existing > problem. > Just like the whole PREALLOC and compression problem. A I see. Just wanted to mention it, not that you waste hours of time by searching for an issue that might have been fixed meanwhile without anyone noticing it. Cheers, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-22 1:13 ` Qu Wenruo 2023-12-22 1:23 ` Christoph Anton Mitterer @ 2024-01-05 3:30 ` Christoph Anton Mitterer 2024-01-05 7:07 ` Qu Wenruo 1 sibling, 1 reply; 43+ messages in thread From: Christoph Anton Mitterer @ 2024-01-05 3:30 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs Hey there. On Fri, 2023-12-22 at 11:43 +1030, Qu Wenruo wrote: > That's not a big deal, because before sending that reply, I have > already > reproduced the problem using 6.6 kernel, so it's a long existing > problem. > Just like the whole PREALLOC and compression problem. Just wondered whether there's anything new on this or whether the best for now would be to switch the fs? Also, do you think that NODATACOW would also be affected by the underlying "issue"? Cheers. Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2024-01-05 3:30 ` Christoph Anton Mitterer @ 2024-01-05 7:07 ` Qu Wenruo 2024-01-06 0:42 ` Christoph Anton Mitterer 0 siblings, 1 reply; 43+ messages in thread From: Qu Wenruo @ 2024-01-05 7:07 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs On 2024/1/5 14:00, Christoph Anton Mitterer wrote: > Hey there. > > On Fri, 2023-12-22 at 11:43 +1030, Qu Wenruo wrote: >> That's not a big deal, because before sending that reply, I have >> already >> reproduced the problem using 6.6 kernel, so it's a long existing >> problem. >> Just like the whole PREALLOC and compression problem. > > > Just wondered whether there's anything new on this or whether the best > for now would be to switch the fs? Root cause is pinned down. It's very stupid, but pretty hard to find a good way to improve: For that offending inode, it only has one extent. During defrag, we need to determine if we need to defrag one extent. For this particular case, we choose not to defrag it at all, the reason is: - The single extent can not be merged with any other extent So even if we're only utilizing a single 4K of a 128M extent, defrag choose not to defrag at all. But this rule applies pretty well for really fragmented files, thus I have no good idea how to continue. Maybe we can add a new rule, to add very under utilized extents to defrag target? But I'm sure that may cause other corner cases to be unhappy though. > > Also, do you think that NODATACOW would also be affected by the > underlying "issue"? My previous guess is totally wrong, it has nothing to do with NODATACOW/PREALLOC flags at all. It's a defrag only problem. Thanks, Qu > > Cheers. > Chris > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2024-01-05 7:07 ` Qu Wenruo @ 2024-01-06 0:42 ` Christoph Anton Mitterer 2024-01-06 5:40 ` Qu Wenruo 2024-12-14 19:09 ` Christoph Anton Mitterer 0 siblings, 2 replies; 43+ messages in thread From: Christoph Anton Mitterer @ 2024-01-06 0:42 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Fri, 2024-01-05 at 17:37 +1030, Qu Wenruo wrote: > but pretty hard to find a > good way to improve: I guess it would not be an alternative to do the work on truncate (i.e. check whether a lot is wasted and if so, create a new extent with just the right size), because that would need a full re-write of the extent... or would it? Also, a while ago in one of my mails I saw that: $ cat affected-file > new-file would also cause new-file to be affected... which I found pretty strange. Any idea why that happens? > My previous guess is totally wrong, it has nothing to do with > NODATACOW/PREALLOC flags at all. > > It's a defrag only problem. Sure, but I meant, if a file is NODATACOW and would be prealloced to a large size and then truncated - would it also loose the extra space? And do you think that other CoW fs would be affected, too? IIRC XFS is also about to get CoW features... so may it's simply an IO pattern that developers need to avoid with modern filesystems. Cheers, Chris- ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2024-01-06 0:42 ` Christoph Anton Mitterer @ 2024-01-06 5:40 ` Qu Wenruo 2024-01-06 8:12 ` Andrei Borzenkov 2024-12-14 19:09 ` Christoph Anton Mitterer 1 sibling, 1 reply; 43+ messages in thread From: Qu Wenruo @ 2024-01-06 5:40 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs On 2024/1/6 11:12, Christoph Anton Mitterer wrote: > On Fri, 2024-01-05 at 17:37 +1030, Qu Wenruo wrote: >> but pretty hard to find a >> good way to improve: > > I guess it would not be an alternative to do the work on truncate (i.e. > check whether a lot is wasted and if so, create a new extent with just > the right size), because that would need a full re-write of the > extent... or would it? I'm not sure if it's a valid idea to do all these on truncation. It would involve doing all the backref walk to determine if it's needed to do the COW. > > Also, a while ago in one of my mails I saw that: > $ cat affected-file > new-file > would also cause new-file to be affected... which I found pretty > strange. > > Any idea why that happens? Initially I thought that's impossible, until I tired it: mkfs.btrfs -f $dev1 mount $dev1 $mnt xfs_io -f -c "pwrite 0 128m" $mnt/file sync truncate -s 4k $mnt/file sync cat $mnt/file > $mnt/new sync Then dump tree indeed show the new file is sharing the same large extent: item 6 key (257 INODE_ITEM 0) itemoff 15817 itemsize 160 generation 7 transid 8 size 4096 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 32770 flags 0x0(none) item 7 key (257 INODE_REF 256) itemoff 15803 itemsize 14 index 2 namelen 4 name: file item 8 key (257 EXTENT_DATA 0) itemoff 15750 itemsize 53 generation 7 type 1 (regular) extent data disk byte 298844160 nr 134217728 <<< extent data offset 0 nr 4096 ram 134217728 extent compression 0 (none) item 9 key (258 INODE_ITEM 0) itemoff 15590 itemsize 160 generation 9 transid 9 size 4096 nbytes 4096 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 10 key (258 INODE_REF 256) itemoff 15577 itemsize 13 index 3 namelen 3 name: new item 11 key (258 EXTENT_DATA 0) itemoff 15524 itemsize 53 generation 7 type 1 (regular) extent data disk byte 298844160 nr 134217728 <<< extent data offset 0 nr 4096 ram 134217728 extent compression 0 (none) My guess is bash is doing something weird thus making the whole cat + redirection into reflink. But at least dd works as expected by creating a new extent. > > > >> My previous guess is totally wrong, it has nothing to do with >> NODATACOW/PREALLOC flags at all. >> >> It's a defrag only problem. > > Sure, but I meant, if a file is NODATACOW and would be prealloced to a > large size and then truncated - would it also loose the extra space? That would be the same. > > > And do you think that other CoW fs would be affected, too? IIRC XFS is > also about to get CoW features... so may it's simply an IO pattern that > developers need to avoid with modern filesystems. Not sure, maybe XFS would do extra extent split to solve the problem. Thanks, Qu > > > Cheers, > Chris- > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2024-01-06 5:40 ` Qu Wenruo @ 2024-01-06 8:12 ` Andrei Borzenkov 0 siblings, 0 replies; 43+ messages in thread From: Andrei Borzenkov @ 2024-01-06 8:12 UTC (permalink / raw) To: Qu Wenruo, Christoph Anton Mitterer; +Cc: linux-btrfs On 06.01.2024 08:40, Qu Wenruo wrote: > > > On 2024/1/6 11:12, Christoph Anton Mitterer wrote: >> On Fri, 2024-01-05 at 17:37 +1030, Qu Wenruo wrote: >>> but pretty hard to find a >>> good way to improve: >> >> I guess it would not be an alternative to do the work on truncate (i.e. >> check whether a lot is wasted and if so, create a new extent with just >> the right size), because that would need a full re-write of the >> extent... or would it? > > I'm not sure if it's a valid idea to do all these on truncation. > It would involve doing all the backref walk to determine if it's needed > to do the COW. > >> >> Also, a while ago in one of my mails I saw that: >> $ cat affected-file > new-file >> would also cause new-file to be affected... which I found pretty >> strange. >> >> Any idea why that happens? > > Initially I thought that's impossible, until I tired it: > > mkfs.btrfs -f $dev1 > mount $dev1 $mnt > xfs_io -f -c "pwrite 0 128m" $mnt/file > sync > truncate -s 4k $mnt/file > sync > cat $mnt/file > $mnt/new > sync > > Then dump tree indeed show the new file is sharing the same large extent: > > > item 6 key (257 INODE_ITEM 0) itemoff 15817 itemsize 160 > generation 7 transid 8 size 4096 nbytes 4096 > block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 > sequence 32770 flags 0x0(none) > item 7 key (257 INODE_REF 256) itemoff 15803 itemsize 14 > index 2 namelen 4 name: file > item 8 key (257 EXTENT_DATA 0) itemoff 15750 itemsize 53 > generation 7 type 1 (regular) > extent data disk byte 298844160 nr 134217728 <<< > extent data offset 0 nr 4096 ram 134217728 > extent compression 0 (none) > item 9 key (258 INODE_ITEM 0) itemoff 15590 itemsize 160 > generation 9 transid 9 size 4096 nbytes 4096 > block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 > sequence 1 flags 0x0(none) > item 10 key (258 INODE_REF 256) itemoff 15577 itemsize 13 > index 3 namelen 3 name: new > item 11 key (258 EXTENT_DATA 0) itemoff 15524 itemsize 53 > generation 7 type 1 (regular) > extent data disk byte 298844160 nr 134217728 <<< > extent data offset 0 nr 4096 ram 134217728 > extent compression 0 (none) > > My guess is bash is doing something weird thus making the whole cat + > redirection into reflink. > It is not bash (which just opens file and dup's it). It is cat from GNU coreutils which defaults to using copy_file_range 10393 openat(AT_FDCWD, "/mnt/file", O_RDONLY) = 3 10393 fstat(3, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0 10393 fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0 10393 uname({sysname="Linux", nodename="tw", ...}) = 0 10393 copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 4096 10393 copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 0 In case of btrfs this ends in btrfs_remap_file_range(). So it is more or less self inflicted wound. May be btrfs should not reflink partial extents in such cases. > But at least dd works as expected by creating a new extent. > >> >> >> >>> My previous guess is totally wrong, it has nothing to do with >>> NODATACOW/PREALLOC flags at all. >>> >>> It's a defrag only problem. >> >> Sure, but I meant, if a file is NODATACOW and would be prealloced to a >> large size and then truncated - would it also loose the extra space? > > That would be the same. > >> >> >> And do you think that other CoW fs would be affected, too? IIRC XFS is >> also about to get CoW features... so may it's simply an IO pattern that >> developers need to avoid with modern filesystems. > > Not sure, maybe XFS would do extra extent split to solve the problem. > > Thanks, > Qu > >> >> >> Cheers, >> Chris- >> > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2024-01-06 0:42 ` Christoph Anton Mitterer 2024-01-06 5:40 ` Qu Wenruo @ 2024-12-14 19:09 ` Christoph Anton Mitterer 1 sibling, 0 replies; 43+ messages in thread From: Christoph Anton Mitterer @ 2024-12-14 19:09 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs Hey Qu, et all. Is the issue behind this still being looked at? There used to be some patches out there, but I think they were never merged. Still having quite massively (well only where I use Prometheus) the problem that on some filesystems where the application does that pre- allocation, a lot of space is wasted. IIRC your patches tried to put that in defrag, but the downside with that is IMO that you have to defrag - and didn't that break up refcopied extents? Thanks, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-18 16:24 ` Christoph Anton Mitterer 2023-12-18 19:18 ` Goffredo Baroncelli @ 2023-12-18 19:54 ` Qu Wenruo 1 sibling, 0 replies; 43+ messages in thread From: Qu Wenruo @ 2023-12-18 19:54 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2023/12/19 02:54, Christoph Anton Mitterer wrote: > Hey again. > > Seems that even the manual defrag doesn't help at all: > > After: > btrfs filesystem defragment -v -r -t 100000M > > there's still: > # compsize . > Processed 309 files, 324 regular extents (324 refs), 146 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 22G 22G 13G > none 100% 22G 22G 13G > > > Any other ideas how this could be solved? Snapshot or reflinks (remember cp now goes reflink by default)? Thanks, Qu > > Cheers, > Chris. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-12 4:13 ` Qu Wenruo 2023-12-15 2:33 ` Chris Murphy 2023-12-18 16:24 ` Christoph Anton Mitterer @ 2023-12-18 22:30 ` Christoph Anton Mitterer 2 siblings, 0 replies; 43+ messages in thread From: Christoph Anton Mitterer @ 2023-12-18 22:30 UTC (permalink / raw) To: linux-btrfs Hey. Had already sent the mail below this afternoon, but just got a bounce: <linux-btrfs@vger.kernel.org>: lost connection with smtp.subspace.kernel.org[44.238.234.78] while receiving the initial server greeting So here it's again,... effectively it just says that autodefrag didn't help either. Cheers, Chris. On Tue, 2023-12-12 at 14:43 +1030, Qu Wenruo wrote: > The direct cause is frequent fsync()/sync() with overwrites. > Btrfs is really relying on merging the writes between transactions, > if > fsync()/sync() is called too frequently (like some data base) and the > program is doing overwrites, this is exactly what you would have. > > IIRC we can set the AUTODEFRAG for an directory? I have tried meanwhile with autodefrag for a few days, but that doesn't cure the problem, not sure why it doesn't seem to kick in. The way Prometheus writes together with btrfs, causes extensive loss of space: compsize /data/main/prometheus/metrics2 Processed 305 files, 567 regular extents (586 refs), 146 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 21G 21G 13G none 100% 21G 21G 13G I'll try with a manual defrag now, but it's a bit unfortunate that this happens without manual intervention. Or would it be better or even help to balance? And nodatacow isn't IMO a real alternative either, as long as one looses one of the greatest btrfs benefits with is (checksumming). Cheers, Chris. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-12 3:27 ` Christoph Anton Mitterer 2023-12-12 3:40 ` Christoph Anton Mitterer @ 2023-12-13 1:49 ` Remi Gauvin 1 sibling, 0 replies; 43+ messages in thread From: Remi Gauvin @ 2023-12-13 1:49 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs On 2023-12-11 10:27 p.m., Christoph Anton Mitterer wrote: > > Well the manpage warns from using on large DB workloads... I mean > Prometheus is not exactly like a DB, and I would have naively assumed > that at least the chunks were written not as many small random > writes... but apparently they are. > > Also, this a VM, so the storage volume is actually something Ceph > backed, which the university's super computing centre provides us with. > > I wonder if I do autodefrag on all that, if it doesn't just kill of our > performance even more? On SSD, I like to use compress-forced option and a daily defrag with target size 128k to eliminate space wasted by fragmentation. (Compressed extents are max size 128KB) However, compression will destroy sequential file read speed on spinning disks, (presumably due to the small extent size, possibly not landing on disk in order, I'm not really sure why read speed is badly affected when write speed is not.) If there are no snapshots or reflink copies, you can use defrag with target size 128MB (at whatever frequency you need.) to eliminate wasted space. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: btrfs thinks fs is full, though 11GB should be still free 2023-12-12 0:12 ` Christoph Anton Mitterer 2023-12-12 0:58 ` Qu Wenruo @ 2023-12-13 8:29 ` Andrea Gelmini 1 sibling, 0 replies; 43+ messages in thread From: Andrea Gelmini @ 2023-12-13 8:29 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Qu Wenruo, linux-btrfs Il giorno mar 12 dic 2023 alle ore 01:12 Christoph Anton Mitterer <calestyo@scientia.org> ha scritto: > Is there a way to check this? Would I just seem maaany extents when I > look at the files with filefrag? I use this: https://github.com/CyberShadow/btdu.git And no, maaannyyyy extents don't imply wasted space ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2024-12-14 19:18 UTC | newest] Thread overview: 43+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-12-11 20:26 btrfs thinks fs is full, though 11GB should be still free Christoph Anton Mitterer 2023-12-11 20:57 ` Qu Wenruo 2023-12-11 22:23 ` Christoph Anton Mitterer 2023-12-11 22:26 ` Christoph Anton Mitterer 2023-12-11 23:20 ` Qu Wenruo 2023-12-11 23:38 ` Christoph Anton Mitterer 2023-12-11 23:54 ` Qu Wenruo 2023-12-12 0:12 ` Christoph Anton Mitterer 2023-12-12 0:58 ` Qu Wenruo 2023-12-12 2:30 ` Qu Wenruo 2023-12-12 3:27 ` Christoph Anton Mitterer 2023-12-12 3:40 ` Christoph Anton Mitterer 2023-12-12 4:13 ` Qu Wenruo 2023-12-15 2:33 ` Chris Murphy 2023-12-15 3:12 ` Qu Wenruo 2023-12-18 16:24 ` Christoph Anton Mitterer 2023-12-18 19:18 ` Goffredo Baroncelli 2023-12-18 20:04 ` Goffredo Baroncelli 2023-12-18 22:38 ` Christoph Anton Mitterer 2023-12-19 8:22 ` Andrei Borzenkov 2023-12-19 19:09 ` Goffredo Baroncelli 2023-12-21 13:53 ` Christoph Anton Mitterer 2023-12-21 18:03 ` Goffredo Baroncelli 2023-12-21 22:06 ` Christoph Anton Mitterer 2023-12-21 13:46 ` Christoph Anton Mitterer 2023-12-21 20:41 ` Qu Wenruo 2023-12-21 22:15 ` Christoph Anton Mitterer 2023-12-21 22:41 ` Qu Wenruo 2023-12-21 22:54 ` Christoph Anton Mitterer 2023-12-22 0:53 ` Qu Wenruo 2023-12-22 0:56 ` Christoph Anton Mitterer 2023-12-22 1:13 ` Qu Wenruo 2023-12-22 1:23 ` Christoph Anton Mitterer 2024-01-05 3:30 ` Christoph Anton Mitterer 2024-01-05 7:07 ` Qu Wenruo 2024-01-06 0:42 ` Christoph Anton Mitterer 2024-01-06 5:40 ` Qu Wenruo 2024-01-06 8:12 ` Andrei Borzenkov 2024-12-14 19:09 ` Christoph Anton Mitterer 2023-12-18 19:54 ` Qu Wenruo 2023-12-18 22:30 ` Christoph Anton Mitterer 2023-12-13 1:49 ` Remi Gauvin 2023-12-13 8:29 ` Andrea Gelmini
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox