btrfs thinks fs is full, though 11GB should be still free

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* btrfs thinks fs is full, though 11GB should be still free
@ 2023-12-11 20:26 Christoph Anton Mitterer
  2023-12-11 20:57 ` Qu Wenruo
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-11 20:26 UTC (permalink / raw)
  To: linux-btrfs

Hey.

I think the following might have already happened the 2nd time. I have
a Debian stable with kernel 6.1.55 running Prometheus.

There's one separate btrfs, just for Prometheus time series database.

# btrfs check /dev/vdb 
Opening filesystem to check...
Checking filesystem on /dev/vdb
UUID: decdc81d-7cc4-431c-ab84-e03771f6de5d
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 42427637760 bytes used, no error found
total csum bytes: 27362284
total tree bytes: 32686080
total fs tree bytes: 1982464
total extent tree bytes: 360448
btree space waste bytes: 2839648
file data blocks allocated: 54877196288
 referenced 28014796800
# mount /data/main/

# df | grep main
/dev/vdb       btrfs      43G   43G   25k 100% /data/main

=> df thinks it's full

# btrfs filesystem usage /data/main/
Overall:
    Device size:		  40.00GiB
    Device allocated:		  40.00GiB
    Device unallocated:		   1.00MiB
    Device missing:		     0.00B
    Device slack:		     0.00B
    Used:			  39.54GiB
    Free (estimated):		  24.00KiB	(min: 24.00KiB)
    Free (statfs, df):		  24.00KiB
    Data ratio:			      1.00
    Metadata ratio:		      2.00
    Global reserve:		  29.22MiB	(used: 0.00B)
    Multiple profiles:		        no

Data,single: Size:39.48GiB, Used:39.48GiB (100.00%)
   /dev/vdb	  39.48GiB

Metadata,DUP: Size:256.00MiB, Used:31.16MiB (12.17%)
   /dev/vdb	 512.00MiB

System,DUP: Size:8.00MiB, Used:16.00KiB (0.20%)
   /dev/vdb	  16.00MiB

Unallocated:
   /dev/vdb	   1.00MiB

=> btrfs does so, too

# btrfs subvolume list -pagu /data/main/
ID 257 gen 2347947 parent 5 top level 5 uuid ae3fa7ff-f5a4-cf44-8555-ad579195036c path <FS_TREE>/data

=> no snapshots involved

# du --apparent-size --total -s --si /data/main/
29G	/data/main/
29G	total

=> but when actually counting the file sizes, there should be 11G left.

:/data/main/prometheus# dd if=/dev/zero of=foo bs=1M count=1
dd: error writing 'foo': No space left on device
1+0 records in
0+0 records out
0 bytes copied, 0,0876783 s, 0,0 kB/s

And it really is full.

Any ideas how this can happen?

Thanks,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-11 20:26 btrfs thinks fs is full, though 11GB should be still free Christoph Anton Mitterer
@ 2023-12-11 20:57 ` Qu Wenruo
  2023-12-11 22:23   ` Christoph Anton Mitterer
  0 siblings, 1 reply; 43+ messages in thread
From: Qu Wenruo @ 2023-12-11 20:57 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs



On 2023/12/12 06:56, Christoph Anton Mitterer wrote:
> Hey.
>
> I think the following might have already happened the 2nd time. I have
> a Debian stable with kernel 6.1.55 running Prometheus.
>
> There's one separate btrfs, just for Prometheus time series database.
>
>
> # btrfs check /dev/vdb
> Opening filesystem to check...
> Checking filesystem on /dev/vdb
> UUID: decdc81d-7cc4-431c-ab84-e03771f6de5d
> [1/7] checking root items
> [2/7] checking extents
> [3/7] checking free space tree
> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups skipped (not enabled on this FS)
> found 42427637760 bytes used, no error found
> total csum bytes: 27362284
> total tree bytes: 32686080
> total fs tree bytes: 1982464
> total extent tree bytes: 360448
> btree space waste bytes: 2839648
> file data blocks allocated: 54877196288
>   referenced 28014796800

That's pretty good.

> # mount /data/main/
>
>
> # df | grep main
> /dev/vdb       btrfs      43G   43G   25k 100% /data/main
>
> => df thinks it's full
>
>
> # btrfs filesystem usage /data/main/
> Overall:
>      Device size:		  40.00GiB
>      Device allocated:		  40.00GiB
>      Device unallocated:		   1.00MiB

Already full from the perspective of chunk space.

No new chunk can be allocated.

>      Device missing:		     0.00B
>      Device slack:		     0.00B
>      Used:			  39.54GiB
>      Free (estimated):		  24.00KiB	(min: 24.00KiB)
>      Free (statfs, df):		  24.00KiB
>      Data ratio:			      1.00
>      Metadata ratio:		      2.00
>      Global reserve:		  29.22MiB	(used: 0.00B)
>      Multiple profiles:		        no
>
> Data,single: Size:39.48GiB, Used:39.48GiB (100.00%)

Data chunks are already exhausted.

>     /dev/vdb	  39.48GiB
>
> Metadata,DUP: Size:256.00MiB, Used:31.16MiB (12.17%)

A single metadata chunk, which is not full.


>     /dev/vdb	 512.00MiB
>
> System,DUP: Size:8.00MiB, Used:16.00KiB (0.20%)
>     /dev/vdb	  16.00MiB
>
> Unallocated:
>     /dev/vdb	   1.00MiB
>
> => btrfs does so, too
>
> # btrfs subvolume list -pagu /data/main/
> ID 257 gen 2347947 parent 5 top level 5 uuid ae3fa7ff-f5a4-cf44-8555-ad579195036c path <FS_TREE>/data

Is your current mounted subvolume the fs tree? Or already the data
subvolume?

If the latter case, there are some files you can not access from your
current mount point.

Thus it's recommended to use qgroup to show a correct full view of the
used space by each subvolume.

Thanks,
Qu

>
> => no snapshots involved
>
> # du --apparent-size --total -s --si /data/main/
> 29G	/data/main/
> 29G	total
>
> => but when actually counting the file sizes, there should be 11G left.
>
>
> :/data/main/prometheus# dd if=/dev/zero of=foo bs=1M count=1
> dd: error writing 'foo': No space left on device
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 0,0876783 s, 0,0 kB/s
>
>
> And it really is full.
>
>
> Any ideas how this can happen?
>
>
> Thanks,
> Chris.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-11 20:57 ` Qu Wenruo
@ 2023-12-11 22:23   ` Christoph Anton Mitterer
  2023-12-11 22:26     ` Christoph Anton Mitterer
  2023-12-11 23:20     ` Qu Wenruo
  0 siblings, 2 replies; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-11 22:23 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hey Qu

On Tue, 2023-12-12 at 07:27 +1030, Qu Wenruo wrote:
> Is your current mounted subvolume the fs tree? Or already the data
> subvolume?

Well actually both, I always have a "service" mountpoint of the root
volume as well, and just unmounted that to no confuse with "double"
mount entries.

In reality it looks like:
# mount | grep vdb
/dev/vdb on /data/main type btrfs (rw,noatime,space_cache=v2,subvolid=257,subvol=/data)
/dev/vdb on /data/btrfs-top-level-subvolumes/data-main type btrfs (rw,noatime,space_cache=v2,subvolid=5,subvol=/)

But all data (except for 2 empty dirs, where on other systems I would
place btrbk snapshots) is in the data subvolume:

data/btrfs-top-level-subvolumes/data-main# ls -al
total 16
drwxr-xr-x 1 root root 26 Feb 21  2023 .
drwxr-xr-x 1 root root 30 Nov  9 23:49 ..
drwxr-xr-x 1 root root 20 Feb 21  2023 data
drwx------ 1 root root 10 Feb 21  2023 snapshots
/data/btrfs-top-level-subvolumes/data-main# du --apparent-size --total -s --si *
29G	data
10	snapshots
29G	total


> If the latter case, there are some files you can not access from your
> current mount point.

No it's not that (that would have been quite embarrassing ^^).
/data/btrfs-top-level-subvolumes/data-main# tree -a snapshots/
snapshots/
└── btrbk

2 directories, 0 files


> Thus it's recommended to use qgroup to show a correct full view of
> the
> used space by each subvolume.

Uhm... qgroups? How would they help me here?

Cheers,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-11 22:23   ` Christoph Anton Mitterer
@ 2023-12-11 22:26     ` Christoph Anton Mitterer
  2023-12-11 23:20     ` Qu Wenruo
  1 sibling, 0 replies; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-11 22:26 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Oh and btw: I did fully unmount all mountpoins of the fs in
question,... so it cannot be just some process that still holds 11G in
deleted file(s).


Cheers,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-11 22:23   ` Christoph Anton Mitterer
  2023-12-11 22:26     ` Christoph Anton Mitterer
@ 2023-12-11 23:20     ` Qu Wenruo
  2023-12-11 23:38       ` Christoph Anton Mitterer
  1 sibling, 1 reply; 43+ messages in thread
From: Qu Wenruo @ 2023-12-11 23:20 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs



On 2023/12/12 08:53, Christoph Anton Mitterer wrote:
> Hey Qu
>
> On Tue, 2023-12-12 at 07:27 +1030, Qu Wenruo wrote:
>> Is your current mounted subvolume the fs tree? Or already the data
>> subvolume?
>
> Well actually both, I always have a "service" mountpoint of the root
> volume as well, and just unmounted that to no confuse with "double"
> mount entries.
>
> In reality it looks like:
> # mount | grep vdb
> /dev/vdb on /data/main type btrfs (rw,noatime,space_cache=v2,subvolid=257,subvol=/data)
> /dev/vdb on /data/btrfs-top-level-subvolumes/data-main type btrfs (rw,noatime,space_cache=v2,subvolid=5,subvol=/)
>
> But all data (except for 2 empty dirs, where on other systems I would
> place btrbk snapshots) is in the data subvolume:
>
> data/btrfs-top-level-subvolumes/data-main# ls -al
> total 16
> drwxr-xr-x 1 root root 26 Feb 21  2023 .
> drwxr-xr-x 1 root root 30 Nov  9 23:49 ..
> drwxr-xr-x 1 root root 20 Feb 21  2023 data
> drwx------ 1 root root 10 Feb 21  2023 snapshots
> /data/btrfs-top-level-subvolumes/data-main# du --apparent-size --total -s --si *
> 29G	data
> 10	snapshots
> 29G	total
>
>
>> If the latter case, there are some files you can not access from your
>> current mount point.
>
> No it's not that (that would have been quite embarrassing ^^).
> /data/btrfs-top-level-subvolumes/data-main# tree -a snapshots/
> snapshots/
> └── btrbk
>
> 2 directories, 0 files
>
>
>> Thus it's recommended to use qgroup to show a correct full view of
>> the
>> used space by each subvolume.
>
> Uhm... qgroups? How would they help me here?

Shows exactly which subvolumes uses how many bytes, including orphan
ones which is pending for deletion.

Thanks,
Qu
>
> Cheers,
> Chris.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-11 23:20     ` Qu Wenruo
@ 2023-12-11 23:38       ` Christoph Anton Mitterer
  2023-12-11 23:54         ` Qu Wenruo
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-11 23:38 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On Tue, 2023-12-12 at 09:50 +1030, Qu Wenruo wrote:
> Shows exactly which subvolumes uses how many bytes, including orphan
> ones which is pending for deletion.

Well... here we go:
# btrfs qgroup show .
Qgroupid    Referenced    Exclusive   Path 
--------    ----------    ---------   ---- 
0/5           16.00KiB     16.00KiB   <toplevel>
0/257         39.48GiB     39.48GiB   data
0/258         16.00KiB     16.00KiB   <stale>
0/259         16.00KiB     16.00KiB   a
0/260         16.00KiB     16.00KiB   b
1/100         32.00KiB     32.00KiB   <0 member qgroups>
1/101            0.00B        0.00B   <0 member qgroups>

I've just created a and b to get qgroup (somehow? ^^) working.


Nevertheless:
I'm 100% sure, that before, there were never any subvolumes on that fs
other than the toplevel and data, unless btrfs somehow creates/deletes
them automatically.


But the above output, AFAIU, still shows that "everything" is in data,
while counting the bytes of files there, still yields a much lower
number.


And other ideas what I could test?


Thanks,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-11 23:38       ` Christoph Anton Mitterer
@ 2023-12-11 23:54         ` Qu Wenruo
  2023-12-12  0:12           ` Christoph Anton Mitterer
  0 siblings, 1 reply; 43+ messages in thread
From: Qu Wenruo @ 2023-12-11 23:54 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs



On 2023/12/12 10:08, Christoph Anton Mitterer wrote:
> On Tue, 2023-12-12 at 09:50 +1030, Qu Wenruo wrote:
>> Shows exactly which subvolumes uses how many bytes, including orphan
>> ones which is pending for deletion.
>
> Well... here we go:
> # btrfs qgroup show .
> Qgroupid    Referenced    Exclusive   Path
> --------    ----------    ---------   ----
> 0/5           16.00KiB     16.00KiB   <toplevel>
> 0/257         39.48GiB     39.48GiB   data
> 0/258         16.00KiB     16.00KiB   <stale>
> 0/259         16.00KiB     16.00KiB   a
> 0/260         16.00KiB     16.00KiB   b
> 1/100         32.00KiB     32.00KiB   <0 member qgroups>
> 1/101            0.00B        0.00B   <0 member qgroups>
>
> I've just created a and b to get qgroup (somehow? ^^) working.
>
>
> Nevertheless:
> I'm 100% sure, that before, there were never any subvolumes on that fs
> other than the toplevel and data, unless btrfs somehow creates/deletes
> them automatically.
>
>
> But the above output, AFAIU, still shows that "everything" is in data,
> while counting the bytes of files there, still yields a much lower
> number.

OK, then everything looks fine.

>
>
> And other ideas what I could test?

Then the last thing is extent bookends.

COW and small random writes can easily lead to extra space wasted by
extent bookends.

E.g. You write a 16M data extents, then over-write the tailing 8M, now
we have two data extents, the old 16M and the new 8M, wasting 8M space.

In that case, you can try defrag, but you still need to delete some data
first so that you can do defrag...

Thanks,
Qu
>
>
> Thanks,
> Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-11 23:54         ` Qu Wenruo
@ 2023-12-12  0:12           ` Christoph Anton Mitterer
  2023-12-12  0:58             ` Qu Wenruo
  2023-12-13  8:29             ` Andrea Gelmini
  0 siblings, 2 replies; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-12  0:12 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On Tue, 2023-12-12 at 10:24 +1030, Qu Wenruo wrote:
> Then the last thing is extent bookends.
> 
> COW and small random writes can easily lead to extra space wasted by
> extent bookends.

Is there a way to check this? Would I just seem maaany extents when I
look at the files with filefrag?

I mean Prometheus, continuously collects metrics from a number of nodes
an (sooner or later) writes them to disk.
I don't really know their code, so I have no idea if they already write
every tiny metric, or only large bunches thereof.

Since they do maintain a WAL, I'd assume the former.

Every know and then, the WAL is written to chunk files which are rather
large, well ~160M or so in my case, but that depends on how many
metrics one collects. I think they always write data for a period of
2h.
Later on, they further compact that chunks (I think after 8 hours and
so on), in which case some larger rewritings would be done.
Though in my case this doesn't happen, as I run Thanos on top of
Prometheus, and for that one needs to disable Prometheus' own
compaction.

I've had already previously looked at the extents for these "compacted"
chunk files, but the worst file had only 32 extents (as reported by
filefrag).

Looking at the WAL files:
/data/main/prometheus/metrics2/wal# filefrag * | grep -v ' 0 extents
found'
00001030: 82 extents found
00001031: 81 extents found
00001032: 79 extents found
00001033: 82 extents found
00001034: 78 extents found
00001035: 78 extents found
00001036: 81 extents found
00001037: 79 extents found
00001038: 79 extents found
00001039: 89 extents found
00001040: 80 extents found
00001041: 74 extents found
00001042: 81 extents found
00001043: 97 extents found
00001044: 101 extents found
00001045: 316 extents found
checkpoint.00001029: FIBMAP/FIEMAP unsupported

(I did the grep -v, because there were a gazillion of empty wal files,
presumably created when the fs was already full).

The above numbers though still don't look to bad, do they?

And checking all:
# find /data/main/ -type f -execdir filefrag {} \; | cut -d : -f 2 |
sort | uniq -c | sort -V
   3706  0 extents found
    450  1 extent found
     25  3 extents found
     62  2 extents found
      1  8 extents found
      1  9 extents found
      1  10 extents found
      1  11 extents found
      1  32 extents found
      1  74 extents found
      1  80 extents found
      1  89 extents found
      1  97 extents found
      1  101 extents found
      1  316 extents found
      2  78 extents found
      2  82 extents found
      3  5 extents found
      3  79 extents found
      3  81 extents found
      6  4 extents found

> E.g. You write a 16M data extents, then over-write the tailing 8M,
> now
> we have two data extents, the old 16M and the new 8M, wasting 8M
> space.
> 
> In that case, you can try defrag, but you still need to delete some
> data
> first so that you can do defrag...

Well my main concern is rather how to prevent this from happening in
the first place... the data is already all backuped into Thanos, so I
could also just wipe the fs.
But this seems to occur repeatedly (well, okay only twice so far O:-)
).
So that would mean we have some IO pattern that "kills" btrfs.

Cheers,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-12  0:12           ` Christoph Anton Mitterer
@ 2023-12-12  0:58             ` Qu Wenruo
  2023-12-12  2:30               ` Qu Wenruo
  2023-12-12  3:27               ` Christoph Anton Mitterer
  2023-12-13  8:29             ` Andrea Gelmini
  1 sibling, 2 replies; 43+ messages in thread
From: Qu Wenruo @ 2023-12-12  0:58 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs



On 2023/12/12 10:42, Christoph Anton Mitterer wrote:
> On Tue, 2023-12-12 at 10:24 +1030, Qu Wenruo wrote:
>> Then the last thing is extent bookends.
>>
>> COW and small random writes can easily lead to extra space wasted by
>> extent bookends.
>
> Is there a way to check this? Would I just seem maaany extents when I
> look at the files with filefrag?
>
>
> I mean Prometheus, continuously collects metrics from a number of nodes
> an (sooner or later) writes them to disk.
> I don't really know their code, so I have no idea if they already write
> every tiny metric, or only large bunches thereof.
>
> Since they do maintain a WAL, I'd assume the former.
>
> Every know and then, the WAL is written to chunk files which are rather
> large, well ~160M or so in my case, but that depends on how many
> metrics one collects. I think they always write data for a period of
> 2h.
> Later on, they further compact that chunks (I think after 8 hours and
> so on), in which case some larger rewritings would be done.
> Though in my case this doesn't happen, as I run Thanos on top of
> Prometheus, and for that one needs to disable Prometheus' own
> compaction.
>
>
> I've had already previously looked at the extents for these "compacted"
> chunk files, but the worst file had only 32 extents (as reported by
> filefrag).

Filefrag doesn't work that well on btrfs AFAIK, as btrfs is emitting
merged extents to fiemap ioctl, but for fragmented one, filefrag should
be enough to detect them.
>
> Looking at the WAL files:
> /data/main/prometheus/metrics2/wal# filefrag * | grep -v ' 0 extents
> found'
> 00001030: 82 extents found
> 00001031: 81 extents found
> 00001032: 79 extents found
> 00001033: 82 extents found
> 00001034: 78 extents found
> 00001035: 78 extents found
> 00001036: 81 extents found
> 00001037: 79 extents found
> 00001038: 79 extents found
> 00001039: 89 extents found
> 00001040: 80 extents found
> 00001041: 74 extents found
> 00001042: 81 extents found
> 00001043: 97 extents found
> 00001044: 101 extents found
> 00001045: 316 extents found
> checkpoint.00001029: FIBMAP/FIEMAP unsupported
>
> (I did the grep -v, because there were a gazillion of empty wal files,
> presumably created when the fs was already full).
>
> The above numbers though still don't look to bad, do they?

Depends, in my previous 16M case. you only got 2 extents, but still
wasted 8M (33.3% space wasted).

But WAL indeeds looks like a bad patter for btrfs.

>
> And checking all:
> # find /data/main/ -type f -execdir filefrag {} \; | cut -d : -f 2 |
> sort | uniq -c | sort -V
>     3706  0 extents found
>      450  1 extent found
>       25  3 extents found
>       62  2 extents found
>        1  8 extents found
>        1  9 extents found
>        1  10 extents found
>        1  11 extents found
>        1  32 extents found
>        1  74 extents found
>        1  80 extents found
>        1  89 extents found
>        1  97 extents found
>        1  101 extents found
>        1  316 extents found
>        2  78 extents found
>        2  82 extents found
>        3  5 extents found
>        3  79 extents found
>        3  81 extents found
>        6  4 extents found
>
>
>
>> E.g. You write a 16M data extents, then over-write the tailing 8M,
>> now
>> we have two data extents, the old 16M and the new 8M, wasting 8M
>> space.
>>
>> In that case, you can try defrag, but you still need to delete some
>> data
>> first so that you can do defrag...
>
>
> Well my main concern is rather how to prevent this from happening in
> the first place... the data is already all backuped into Thanos, so I
> could also just wipe the fs.
> But this seems to occur repeatedly (well, okay only twice so far O:-)
> ).
> So that would mean we have some IO pattern that "kills" btrfs.

Thus we have "autodefrag" mount option for such use case.

Thanks,
Qu
>
>
> Cheers,
> Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-12  0:58             ` Qu Wenruo
@ 2023-12-12  2:30               ` Qu Wenruo
  2023-12-12  3:27               ` Christoph Anton Mitterer
  1 sibling, 0 replies; 43+ messages in thread
From: Qu Wenruo @ 2023-12-12  2:30 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs



On 2023/12/12 11:28, Qu Wenruo wrote:
>
>
> On 2023/12/12 10:42, Christoph Anton Mitterer wrote:
>> On Tue, 2023-12-12 at 10:24 +1030, Qu Wenruo wrote:
>>> Then the last thing is extent bookends.
>>>
>>> COW and small random writes can easily lead to extra space wasted by
>>> extent bookends.
>>
>> Is there a way to check this? Would I just seem maaany extents when I
>> look at the files with filefrag?

IIRC compsize can do it.

https://github.com/kilobyte/compsize

Thanks,
Qu
>>
>>
>> I mean Prometheus, continuously collects metrics from a number of nodes
>> an (sooner or later) writes them to disk.
>> I don't really know their code, so I have no idea if they already write
>> every tiny metric, or only large bunches thereof.
>>
>> Since they do maintain a WAL, I'd assume the former.
>>
>> Every know and then, the WAL is written to chunk files which are rather
>> large, well ~160M or so in my case, but that depends on how many
>> metrics one collects. I think they always write data for a period of
>> 2h.
>> Later on, they further compact that chunks (I think after 8 hours and
>> so on), in which case some larger rewritings would be done.
>> Though in my case this doesn't happen, as I run Thanos on top of
>> Prometheus, and for that one needs to disable Prometheus' own
>> compaction.
>>
>>
>> I've had already previously looked at the extents for these "compacted"
>> chunk files, but the worst file had only 32 extents (as reported by
>> filefrag).
>
> Filefrag doesn't work that well on btrfs AFAIK, as btrfs is emitting
> merged extents to fiemap ioctl, but for fragmented one, filefrag should
> be enough to detect them.
>>
>> Looking at the WAL files:
>> /data/main/prometheus/metrics2/wal# filefrag * | grep -v ' 0 extents
>> found'
>> 00001030: 82 extents found
>> 00001031: 81 extents found
>> 00001032: 79 extents found
>> 00001033: 82 extents found
>> 00001034: 78 extents found
>> 00001035: 78 extents found
>> 00001036: 81 extents found
>> 00001037: 79 extents found
>> 00001038: 79 extents found
>> 00001039: 89 extents found
>> 00001040: 80 extents found
>> 00001041: 74 extents found
>> 00001042: 81 extents found
>> 00001043: 97 extents found
>> 00001044: 101 extents found
>> 00001045: 316 extents found
>> checkpoint.00001029: FIBMAP/FIEMAP unsupported
>>
>> (I did the grep -v, because there were a gazillion of empty wal files,
>> presumably created when the fs was already full).
>>
>> The above numbers though still don't look to bad, do they?
>
> Depends, in my previous 16M case. you only got 2 extents, but still
> wasted 8M (33.3% space wasted).
>
> But WAL indeeds looks like a bad patter for btrfs.
>
>>
>> And checking all:
>> # find /data/main/ -type f -execdir filefrag {} \; | cut -d : -f 2 |
>> sort | uniq -c | sort -V
>>     3706  0 extents found
>>      450  1 extent found
>>       25  3 extents found
>>       62  2 extents found
>>        1  8 extents found
>>        1  9 extents found
>>        1  10 extents found
>>        1  11 extents found
>>        1  32 extents found
>>        1  74 extents found
>>        1  80 extents found
>>        1  89 extents found
>>        1  97 extents found
>>        1  101 extents found
>>        1  316 extents found
>>        2  78 extents found
>>        2  82 extents found
>>        3  5 extents found
>>        3  79 extents found
>>        3  81 extents found
>>        6  4 extents found
>>
>>
>>
>>> E.g. You write a 16M data extents, then over-write the tailing 8M,
>>> now
>>> we have two data extents, the old 16M and the new 8M, wasting 8M
>>> space.
>>>
>>> In that case, you can try defrag, but you still need to delete some
>>> data
>>> first so that you can do defrag...
>>
>>
>> Well my main concern is rather how to prevent this from happening in
>> the first place... the data is already all backuped into Thanos, so I
>> could also just wipe the fs.
>> But this seems to occur repeatedly (well, okay only twice so far O:-)
>> ).
>> So that would mean we have some IO pattern that "kills" btrfs.
>
> Thus we have "autodefrag" mount option for such use case.
>
> Thanks,
> Qu
>>
>>
>> Cheers,
>> Chris.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-12  0:58             ` Qu Wenruo
  2023-12-12  2:30               ` Qu Wenruo
@ 2023-12-12  3:27               ` Christoph Anton Mitterer
  2023-12-12  3:40                 ` Christoph Anton Mitterer
  2023-12-13  1:49                 ` Remi Gauvin
  1 sibling, 2 replies; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-12  3:27 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hey.

On Tue, 2023-12-12 at 13:00 +1030, Qu Wenruo wrote:
> IIRC compsize can do it.
> https://github.com/kilobyte/compsize

Okay... that seems promising:

/data/main/prometheus/metrics2# compsize 01H*
Processed 544 files, 399 regular extents (447 refs), 272 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%       37G          37G          23G       
none       100%       37G          37G          23G       

01H* are the subdirs for the "semi-final" chunks.

So here's my stolen storage ;-)
I'm a bit puzzled how that can happen, I mean for the chunks I'd have
naively assumed that they just write them more or less at once and in
sequence.

Interestingly, the WAL seems good, though:
/data/main/prometheus/metrics2# compsize wal/
Processed 3723 files, 1617 regular extents (1617 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      1.9G         1.9G         1.9G       
none       100%      1.9G         1.9G         1.9G       

One thing I've noted by chance and don't understand:

I assumed "Referenced" was the number of unique bytes actually
references (by someone). So when I run compsize on a single file,
Reference should be the file size?

/data/main/prometheus/metrics2/wal# lll 00001030
251052 -rw-rw-r-- 1 106 106 ? 134217728 2023-12-10 04:51:58.665808973 +0100 00001030

/data/main/prometheus/metrics2/wal# compsize -b 00001030
Processed 1 file, 83 regular extents (83 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%     134451200    134451200    134217728   
none       100%     134451200    134451200    134217728   

=> okay, here it is

/data/main/prometheus/metrics2/wal# lll 00001045
251947 -rw-rw-r-- 1 106 106 ? 33034564 2023-12-10 08:57:01.892017049 +0100 00001045

/data/main/prometheus/metrics2/wal# compsize -b 00001045
Processed 1 file, 316 regular extents (316 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%     33116160     33116160     33038336    
none       100%     33116160     33116160     33038336    

=> here, Referenced is 3772 bytes larger than the actual file size?
   How can that happen?

On Tue, 2023-12-12 at 11:28 +1030, Qu Wenruo wrote:
> 
> 
> But WAL indeeds looks like a bad patter for btrfs.

> Thus we have "autodefrag" mount option for such use case.

Well the manpage warns from using on large DB workloads... I mean
Prometheus is not exactly like a DB, and I would have naively assumed
that at least the chunks were written not as many small random
writes... but apparently they are.

Also, this a VM, so the storage volume is actually something Ceph
backed, which the university's super computing centre provides us with.

I wonder if I do autodefrag on all that, if it doesn't just kill of our
performance even more?

Thanks :-)

Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-12  3:27               ` Christoph Anton Mitterer
@ 2023-12-12  3:40                 ` Christoph Anton Mitterer
  2023-12-12  4:13                   ` Qu Wenruo
  2023-12-13  1:49                 ` Remi Gauvin
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-12  3:40 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

I just noticed that others have already had that problem with
Prometheus:
https://github.com/prometheus/prometheus/issues/9107

Some users there wondered whether the issue could be caused by too
aggressive preallocation via `fallocate`.

Do you think that would be something that could also cause the wasted
space?


Thanks,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-12  3:40                 ` Christoph Anton Mitterer
@ 2023-12-12  4:13                   ` Qu Wenruo
  2023-12-15  2:33                     ` Chris Murphy
                                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Qu Wenruo @ 2023-12-12  4:13 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs

On 2023/12/12 14:10, Christoph Anton Mitterer wrote:
> I just noticed that others have already had that problem with
> Prometheus:
> https://github.com/prometheus/prometheus/issues/9107
>
> Some users there wondered whether the issue could be caused by too
> aggressive preallocation via `fallocate`.
>
> Do you think that would be something that could also cause the wasted
> space?
>

Well, preallocated inodes (any inodes with any preallocated extents
during its lifespan, it's a btrfs specific flag, won't be cleared until
the inode is evicted), would only lead to btrfs to try NOCOW first, then
fallback to COW if NOCOW failed. (Missing the compression path).

It's not a direct cause to the problem.
The direct cause is frequent fsync()/sync() with overwrites.
Btrfs is really relying on merging the writes between transactions, if
fsync()/sync() is called too frequently (like some data base) and the
program is doing overwrites, this is exactly what you would have.

IIRC we can set the AUTODEFRAG for an directory?

Thanks,
Qu

>
> Thanks,
> Chris.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-12  4:13                   ` Qu Wenruo
@ 2023-12-15  2:33                     ` Chris Murphy
  2023-12-15  3:12                       ` Qu Wenruo
  2023-12-18 16:24                     ` Christoph Anton Mitterer
  2023-12-18 22:30                     ` Christoph Anton Mitterer
  2 siblings, 1 reply; 43+ messages in thread
From: Chris Murphy @ 2023-12-15  2:33 UTC (permalink / raw)
  To: Qu Wenruo, Christoph Mitterer, Btrfs BTRFS



On Mon, Dec 11, 2023, at 11:13 PM, Qu Wenruo wrote:

> IIRC we can set the AUTODEFRAG for an directory?

How? Would be useful to isolate autofrag for the bookends and small database (web browser) use case, but not for the large busy database use case.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-15  2:33                     ` Chris Murphy
@ 2023-12-15  3:12                       ` Qu Wenruo
  0 siblings, 0 replies; 43+ messages in thread
From: Qu Wenruo @ 2023-12-15  3:12 UTC (permalink / raw)
  To: Chris Murphy, Christoph Mitterer, Btrfs BTRFS



On 2023/12/15 13:03, Chris Murphy wrote:
>
>
> On Mon, Dec 11, 2023, at 11:13 PM, Qu Wenruo wrote:
>
>> IIRC we can set the AUTODEFRAG for an directory?
>
> How? Would be useful to isolate autofrag for the bookends and small database (web browser) use case, but not for the large busy database use case.
>

I get confused it with NODATACOW, that would help a lot as long the
subvolume doesn't get snapshotted.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-12  4:13                   ` Qu Wenruo
  2023-12-15  2:33                     ` Chris Murphy
@ 2023-12-18 16:24                     ` Christoph Anton Mitterer
  2023-12-18 19:18                       ` Goffredo Baroncelli
  2023-12-18 19:54                       ` Qu Wenruo
  2023-12-18 22:30                     ` Christoph Anton Mitterer
  2 siblings, 2 replies; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-18 16:24 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hey again.

Seems that even the manual defrag doesn't help at all:

After:
btrfs filesystem defragment -v -r -t 100000M

there's still:
# compsize .
Processed 309 files, 324 regular extents (324 refs), 146 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%       22G          22G          13G       
none       100%       22G          22G          13G       


Any other ideas how this could be solved?

Cheers,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-18 16:24                     ` Christoph Anton Mitterer
@ 2023-12-18 19:18                       ` Goffredo Baroncelli
  2023-12-18 20:04                         ` Goffredo Baroncelli
  2023-12-18 22:38                         ` Christoph Anton Mitterer
  2023-12-18 19:54                       ` Qu Wenruo
  1 sibling, 2 replies; 43+ messages in thread
From: Goffredo Baroncelli @ 2023-12-18 19:18 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Qu Wenruo, linux-btrfs

On 18/12/2023 17.24, Christoph Anton Mitterer wrote:
> Hey again.
>
> Seems that even the manual defrag doesn't help at all:
>
> After:
> btrfs filesystem defragment -v -r -t 100000M

Being only 309 files, I suggest to find one file as test case and start to inspect what is happening.

>
> there's still:
> # compsize .
> Processed 309 files, 324 regular extents (324 refs), 146 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%       22G          22G          13G
> none       100%       22G          22G          13G
>
>
> Any other ideas how this could be solved?
>
> Cheers,
> Chris.
>

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-18 19:18                       ` Goffredo Baroncelli
@ 2023-12-18 20:04                         ` Goffredo Baroncelli
  2023-12-18 22:38                         ` Christoph Anton Mitterer
  1 sibling, 0 replies; 43+ messages in thread
From: Goffredo Baroncelli @ 2023-12-18 20:04 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Qu Wenruo, linux-btrfs

On 18/12/2023 20.18, Goffredo Baroncelli wrote:
> On 18/12/2023 17.24, Christoph Anton Mitterer wrote:
>> Hey again.
>>
>> Seems that even the manual defrag doesn't help at all:
>>
>> After:
>> btrfs filesystem defragment -v -r -t 100000M
>
> Being only 309 files, I suggest to find one file as test case and start to inspect what is happening

I don't know if this would help, however I tried to reproduce this situation and what I found is

$ python3 mktestfile.py

$ sudo /usr/sbin/compsize test.bin
Processed 1 file, 3 regular extents (3 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%      3.0M         3.0M         2.0M
none       100%      3.0M         3.0M         2.0M

$ btrfs fi defra -v test.bin

test.bin

$ sudo /usr/sbin/compsize test.bin
Processed 1 file, 3 regular extents (3 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%      3.0M         3.0M         2.0M <------------- 3M
none       100%      3.0M         3.0M         2.0M

$ sync

$ sudo /usr/sbin/compsize test.bin
Processed 1 file, 2 regular extents (2 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%      2.0M         2.0M         2.0M <------------- 2M after a sync
none       100%      2.0M         2.0M         2.0M


So until a sync, the file are not updated.


#------------------------------------------

$ cat mktestfile.py

import os

f = open("test.bin", "w")
p = 0
s = 1024 * 1024
for i in range(3):
         f.write("x" * s)
         p += s

         os.fsync(f)

         p -= s/2
         f.seek(p, 0)

os.fsync(f)
f.close()

#------------------------------------------


>>
>> there's still:
>> # compsize .
>> Processed 309 files, 324 regular extents (324 refs), 146 inline.
>> Type       Perc     Disk Usage   Uncompressed Referenced
>> TOTAL      100%       22G          22G          13G
>> none       100%       22G          22G          13G
>>
>>
>> Any other ideas how this could be solved?
>>
>> Cheers,
>> Chris.
>>
>

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-18 19:18                       ` Goffredo Baroncelli
  2023-12-18 20:04                         ` Goffredo Baroncelli
@ 2023-12-18 22:38                         ` Christoph Anton Mitterer
  2023-12-19  8:22                           ` Andrei Borzenkov
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-18 22:38 UTC (permalink / raw)
  To: kreijack, Qu Wenruo, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 11465 bytes --]

Hey.

On Mon, 2023-12-18 at 20:18 +0100, Goffredo Baroncelli wrote:
> Being only 309 files, I suggest to find one file as test case and
> start to inspect what is happening.

I made a small wrapper around compsize:
#!/bin/sh

while IFS='' read -r file; do
        tmp="$(compsize -b "$file" | grep '^none' | sed -E 's/ +/ /g')"

        du="$(printf '%s\n' "$tmp" | cut -d ' ' -f 3)"
        ref="$(printf '%s\n' "$tmp" | cut -d ' ' -f 5)"

        delta="$(( $du - $ref ))"
        if [ "$delta" -ge 524288 ]; then
                printf '%s\t%s\n' "$delta" "$file"
        fi
done

called like:
# find /data/main -type f | ~/compsize-helper 
252653568	/data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
217198592	/data/main/prometheus/metrics2/01HHFFHGP4SDPEGGY4ME3GT0Q7/chunks/000001
107094016	/data/main/prometheus/metrics2/01HHFPD50EKCPVC2WZCP006WDV/chunks/000001
121311232	/data/main/prometheus/metrics2/01HHFX8YPK0D55GRM4V5TN46J3/chunks/000001
106512384	/data/main/prometheus/metrics2/01HHG44J1F5R8K3527V2VVT5VA/chunks/000001
102907904	/data/main/prometheus/metrics2/01HHGB0A32RWYKWZBDVMH7NG70/chunks/000001
105345024	/data/main/prometheus/metrics2/01HHGHW20CFYANMMNF1DGSMQE2/chunks/000001
105590784	/data/main/prometheus/metrics2/01HHGRQSR0M0F53JJBKZH52S7H/chunks/000001
106688512	/data/main/prometheus/metrics2/01HHGZKJEXEKW3594B74DR9NTH/chunks/000001
106213376	/data/main/prometheus/metrics2/01HHH6FA6YRWC6HAJGQJX9QATF/chunks/000001
106409984	/data/main/prometheus/metrics2/01HHHDB1EHS6V9NYKYAMJTSXBB/chunks/000001
225648640	/data/main/prometheus/metrics2/01HHHM6S6JYGAXCQV9XVMGMW83/chunks/000001
107253760	/data/main/prometheus/metrics2/01HHHV2DZYXNFZ5PRHB2M12E7Z/chunks/000001
106340352	/data/main/prometheus/metrics2/01HHJ1Y5QS4DS408MMCH97WC02/chunks/000001
105254912	/data/main/prometheus/metrics2/01HHJ8SXF3W7DKW5N0YACG3ZWT/chunks/000001
104161280	/data/main/prometheus/metrics2/01HHJFNKQZ3AEZYFXWEQ5RQ3HB/chunks/000001
104771584	/data/main/prometheus/metrics2/01HHJPHBFVMBRB27ECMNB0J3GG/chunks/000001
101986304	/data/main/prometheus/metrics2/01HHJXD37EPQXB5599MHWEESMZ/chunks/000001
106614784	/data/main/prometheus/metrics2/01HHK48TZ2KNFNWT1VTH5HE0HF/chunks/000001
231501824	/data/main/prometheus/metrics2/01HHKB4K6M286JNX0QAX6M5P7C/chunks/000001
107102208	/data/main/prometheus/metrics2/01HHKJ07GSVRGR3B3RW7EY2M03/chunks/000001
106823680	/data/main/prometheus/metrics2/01HHKRVZ85RXB1TZ0M1Q3453D2/chunks/000001
105611264	/data/main/prometheus/metrics2/01HHKZQPZTQV3DCVQCCVH7DXGE/chunks/000001
104497152	/data/main/prometheus/metrics2/01HHM6KB9WNF18ZNC7JE81Z3QW/chunks/000001
107216896	/data/main/prometheus/metrics2/01HHMDF5FRRAX8DCNGCGTNDR98/chunks/000001
105299968	/data/main/prometheus/metrics2/01HHMMAVRD2BTWKCZPYZ1WD6DT/chunks/000001
105328640	/data/main/prometheus/metrics2/01HHMV6KZS6F476HG6DT4ZM180/chunks/000001
223375360	/data/main/prometheus/metrics2/01HHN22CPP9FVDZG1MGET13J4X/chunks/000001
224215040	/data/main/prometheus/metrics2/01HHN8Y10XMFF9M2CRPQJWG94Y/chunks/000001
99090432	/data/main/prometheus/metrics2/01HHNFSS84Y6C6BBVY7E3QMAZS/chunks/000001
104562688	/data/main/prometheus/metrics2/01HHNPNH00AZHAV1RPB6SW8443/chunks/000001
107819008	/data/main/prometheus/metrics2/01HHNXH8QE32FQEMA7DDVZK8M8/chunks/000001
103890944	/data/main/prometheus/metrics2/01HHP4D1EBSX48Y3CY66VV703Y/chunks/000001
105041920	/data/main/prometheus/metrics2/01HHPB8P8AN1W5SH6ZTC90R48Q/chunks/000001
107278336	/data/main/prometheus/metrics2/01HHPJ4EZG9TB32P8W4MDMQSZG/chunks/000001
106553344	/data/main/prometheus/metrics2/01HHPS02SFZEC187ZJJDVSD1Z0/chunks/000001
156815360	/data/main/prometheus/metrics2/01HHPZVV0P5QCGG8QA89C36RVW/chunks/000001
153927680	/data/main/prometheus/metrics2/01HHQ6QJRFY0WF1K4QHM74A12S/chunks/000001
125075456	/data/main/prometheus/metrics2/01HHQDKBFRFFQ7J2XJ836CRHW3/chunks/000001
209260544	/data/main/prometheus/metrics2/01HHQMF375KTC4BWMF9RBCM6RS/chunks/000001
106807296	/data/main/prometheus/metrics2/01HHQVAVESRRXJ5KQ9FN8AQ0XA/chunks/000001
105955328	/data/main/prometheus/metrics2/01HHR26E9C5B2RYFCQPG7G7M4H/chunks/000001
105402368	/data/main/prometheus/metrics2/01HHR92618K1C6QCR1JP3NKRR5/chunks/000001
107376640	/data/main/prometheus/metrics2/01HHRFXY8DJPBQ924RJJWZB3BY/chunks/000001
107413504	/data/main/prometheus/metrics2/01HHRPSP069YJXRDHF3P8CPM5S/chunks/000001
105881600	/data/main/prometheus/metrics2/01HHRXNE7JFSYFQ3QF63WDZ0AB/chunks/000001
217497600	/data/main/prometheus/metrics2/01HHS4H21ZSZVA10JVDNHNQ7Q8/chunks/000001
108908544	/data/main/prometheus/metrics2/01HHSBCT94CC9AW1N9JQA6VF0K/chunks/000001
109170688	/data/main/prometheus/metrics2/01HHSJ8J18HQTQ4V45AZTSN336/chunks/000001
109060096	/data/main/prometheus/metrics2/01HHSS4AQMF186EXPQKT77KG8F/chunks/000001
108732416	/data/main/prometheus/metrics2/01HHSZZWKYJG4AQ8S6X3A0V1RF/chunks/000001
159817728	/data/main/prometheus/metrics2/01HHT6VMTVPDT2RMR76P642HZ9/chunks/000001
109006848	/data/main/prometheus/metrics2/01HHTDQD2H7VJRXBC5BA7K4018/chunks/000001
108662784	/data/main/prometheus/metrics2/01HHTMK68T7QSGE4MHFWBJHM1W/chunks/000001
201793536	/data/main/prometheus/metrics2/01HHTVEYZYPDXDZSGDM91S9QFW/chunks/000001
109015040	/data/main/prometheus/metrics2/01HHV2AQ7DEHX5BNSXX2VQC2TW/chunks/000001
109756416	/data/main/prometheus/metrics2/01HHV96BHBAP1TZY54AQ8BA0YB/chunks/000001
103170048	/data/main/prometheus/metrics2/01HHVG238JVD74GD07F2W87Y9Y/chunks/000001
210407424	/data/main/prometheus/metrics2/01HHVPXVG00GAK4E3YY6ADGDG9/chunks/000001
108261376	/data/main/prometheus/metrics2/01HHVXSMQ5XWWKXAK7SRR61AK8/chunks/000001
108457984	/data/main/prometheus/metrics2/01HHW4N72CMZ90VTVACG7QD87M/chunks/000001
109596672	/data/main/prometheus/metrics2/01HHWBH08ZYZRNEC38VXK9Y946/chunks/000001
194060288	/data/main/prometheus/metrics2/01HHWJCRG7WC5NHB684RXCNKMV/chunks/000001
110596096	/data/main/prometheus/metrics2/01HHWS8GQGHNWH4Z4WETJN454Q/chunks/000001
110592000	/data/main/prometheus/metrics2/01HHX048F9110B1CMK300XCE99/chunks/000001
107954176	/data/main/prometheus/metrics2/01HHX6ZTAZEQ69MGKQDSZ7TG96/chunks/000001
157130752	/data/main/prometheus/metrics2/01HHXDVJJG0YMR3T2ZSS7TMGPA/chunks/000001
122175488	/data/main/prometheus/metrics2/01HHXMQE7152WG1HARM51JPKFT/chunks/000001
208908288	/data/main/prometheus/metrics2/01HHXVK5YTHRVQXEHXWR1K270F/chunks/000001
140976128	/data/main/prometheus/metrics2/01HHY2EXPC19MBPRGM05X3ZB75/chunks/000001
140902400	/data/main/prometheus/metrics2/01HHY9AKFSADNWJVW4XN06QTZY/chunks/000001
250085376	/data/main/prometheus/metrics2/01HHYG6AQEAYV7WBSWJSSMHR02/chunks/000001
114298880	/data/main/prometheus/metrics2/01HHYQ22YNG446ARA9KDTJH9RP/chunks/000001
108421120	/data/main/prometheus/metrics2/01HHYXXT096QKW02KAAXH791SG/chunks/000001
109977600	/data/main/prometheus/metrics2/01HHZ4SEH5S13B9EYV7S4NV4EZ/chunks/000001
99586048	/data/main/prometheus/metrics2/01HHZBN68RG5YTA2EP50TRC6T4/chunks/000001


To answer Qu's question from his reply:

No snapshots are involved, and unless Prometheus itself makes recopies,
no such should be involved either.

I do run Thanos sidecar on top of Prometheus, which indeed makes copies
of the files before uploading them to some remote storage, but it also
deletes them afterwards. And looking at the /proc/<thanos>/fd, it
doesn't keep them open as deleted.

No compression should be used (again, unless Prometheus would manually
set chattr +c or so), btrfs RAID. Nothing except a plain btrfs.
I used to have quota's enabled when Qu asked me to last week, but
disabled them afterwards.

I've also attached the output of:
# find /data/main -type f -exec sh -c 'echo "$1"; compsize "$1"' '' {} \; > compsize.log
in case it helps anyone.


The above shows IMO that most data is lost in the chunk files (these
000001 files are the beef of the data).
Looking at the reverse (files where the delta is less than 0.5 MiB):
233472	/data/main/prometheus/metrics2/wal/00000522
155648	/data/main/prometheus/metrics2/wal/00000523
184320	/data/main/prometheus/metrics2/wal/00000524
200704	/data/main/prometheus/metrics2/wal/00000525
126976	/data/main/prometheus/metrics2/wal/00000526
192512	/data/main/prometheus/metrics2/wal/00000527
200704	/data/main/prometheus/metrics2/wal/00000528
139264	/data/main/prometheus/metrics2/wal/00000529
200704	/data/main/prometheus/metrics2/wal/00000530
172032	/data/main/prometheus/metrics2/wal/00000531
57344	/data/main/prometheus/metrics2/wal/00000532
20480	/data/main/prometheus/metrics2/chunks_head/000151
12288	/data/main/prometheus/metrics2/chunks_head/000152
28672	/data/main/prometheus/metrics2/chunks_head/000153
45056	/data/main/prometheus/metrics2/01HHFFHGP4SDPEGGY4ME3GT0Q7/inde
x
229376	/data/main/prometheus/metrics2/01HHFPD50EKCPVC2WZCP006WDV/inde
x
40960	/data/main/prometheus/metrics2/01HHHM6S6JYGAXCQV9XVMGMW83/inde
x
40960	/data/main/prometheus/metrics2/01HHHV2DZYXNFZ5PRHB2M12E7Z/inde
x
40960	/data/main/prometheus/metrics2/01HHJ1Y5QS4DS408MMCH97WC02/inde
x
40960	/data/main/prometheus/metrics2/01HHJ8SXF3W7DKW5N0YACG3ZWT/inde
x
40960	/data/main/prometheus/metrics2/01HHJPHBFVMBRB27ECMNB0J3GG/inde
x
40960	/data/main/prometheus/metrics2/01HHJXD37EPQXB5599MHWEESMZ/inde
x
40960	/data/main/prometheus/metrics2/01HHKB4K6M286JNX0QAX6M5P7C/inde
x
40960	/data/main/prometheus/metrics2/01HHMV6KZS6F476HG6DT4ZM180/inde
x
40960	/data/main/prometheus/metrics2/01HHN22CPP9FVDZG1MGET13J4X/inde
x
40960	/data/main/prometheus/metrics2/01HHN8Y10XMFF9M2CRPQJWG94Y/inde
x
40960	/data/main/prometheus/metrics2/01HHNPNH00AZHAV1RPB6SW8443/inde
x
40960	/data/main/prometheus/metrics2/01HHPZVV0P5QCGG8QA89C36RVW/inde
x
40960	/data/main/prometheus/metrics2/01HHQ6QJRFY0WF1K4QHM74A12S/inde
x
40960	/data/main/prometheus/metrics2/01HHQDKBFRFFQ7J2XJ836CRHW3/inde
x
40960	/data/main/prometheus/metrics2/01HHQMF375KTC4BWMF9RBCM6RS/inde
x
40960	/data/main/prometheus/metrics2/01HHR92618K1C6QCR1JP3NKRR5/inde
x
40960	/data/main/prometheus/metrics2/01HHRFXY8DJPBQ924RJJWZB3BY/inde
x
40960	/data/main/prometheus/metrics2/01HHS4H21ZSZVA10JVDNHNQ7Q8/inde
x
40960	/data/main/prometheus/metrics2/01HHSBCT94CC9AW1N9JQA6VF0K/inde
x
40960	/data/main/prometheus/metrics2/01HHT6VMTVPDT2RMR76P642HZ9/inde
x
40960	/data/main/prometheus/metrics2/01HHTVEYZYPDXDZSGDM91S9QFW/inde
x
40960	/data/main/prometheus/metrics2/01HHVPXVG00GAK4E3YY6ADGDG9/inde
x
40960	/data/main/prometheus/metrics2/01HHWJCRG7WC5NHB684RXCNKMV/inde
x
40960	/data/main/prometheus/metrics2/01HHX048F9110B1CMK300XCE99/inde
x
40960	/data/main/prometheus/metrics2/01HHXDVJJG0YMR3T2ZSS7TMGPA/inde
x
86016	/data/main/prometheus/metrics2/01HHXVK5YTHRVQXEHXWR1K270F/inde
x
40960	/data/main/prometheus/metrics2/01HHYG6AQEAYV7WBSWJSSMHR02/inde
x
4096	/data/main/prometheus/metrics2/01HHZBN68RG5YTA2EP50TRC6T4/inde
x

So the WAL is not that much of a problem.


As for Goffredo's idea about syncing, that doesn't seem to change
things either:
# btrfs filesystem sync /data/main
# compsize btrfs filesystem sync /data/main
Processed 321 files, 793 regular extents (794 refs), 152 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%       23G          23G          14G       
none       100%       23G          23G          14G       

And this is anyway running long-term... so I'd assume that sooner or
later btrfs syncs it's stuff?


Cheers,
Chris.

[-- Attachment #2: compsize.log.xz --]
[-- Type: application/x-xz, Size: 2984 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-18 22:38                         ` Christoph Anton Mitterer
@ 2023-12-19  8:22                           ` Andrei Borzenkov
  2023-12-19 19:09                             ` Goffredo Baroncelli
  2023-12-21 13:46                             ` Christoph Anton Mitterer
  0 siblings, 2 replies; 43+ messages in thread
From: Andrei Borzenkov @ 2023-12-19  8:22 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: kreijack, Qu Wenruo, linux-btrfs

On Tue, Dec 19, 2023 at 4:00 AM Christoph Anton Mitterer
<calestyo@scientia.org> wrote:
>
> Hey.
>
> On Mon, 2023-12-18 at 20:18 +0100, Goffredo Baroncelli wrote:
> > Being only 309 files, I suggest to find one file as test case and
> > start to inspect what is happening.
>
...
>
> I've also attached the output of:
> # find /data/main -type f -exec sh -c 'echo "$1"; compsize "$1"' '' {} \; > compsize.log
> in case it helps anyone.
>

/data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
Processed 1 file, 1 regular extents (1 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%      256M         256M          15M
none       100%      256M         256M          15M

I would try to find out whether this single extent is shared, where
the data is located inside this extent. Could it be that file was
truncated or the hole was punched in it?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-19  8:22                           ` Andrei Borzenkov
@ 2023-12-19 19:09                             ` Goffredo Baroncelli
  2023-12-21 13:53                               ` Christoph Anton Mitterer
  2023-12-21 13:46                             ` Christoph Anton Mitterer
  1 sibling, 1 reply; 43+ messages in thread
From: Goffredo Baroncelli @ 2023-12-19 19:09 UTC (permalink / raw)
  To: Andrei Borzenkov, Christoph Anton Mitterer; +Cc: Qu Wenruo, linux-btrfs

On 19/12/2023 09.22, Andrei Borzenkov wrote:
> On Tue, Dec 19, 2023 at 4:00 AM Christoph Anton Mitterer
> <calestyo@scientia.org> wrote:
>
> 
> /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
> Processed 1 file, 1 regular extents (1 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%      256M         256M          15M
> none       100%      256M         256M          15M
> 
> I would try to find out whether this single extent is shared, where
> the data is located inside this extent. Could it be that file was
> truncated or the hole was punched in it?
> 

Ok, now we have the case study.
To be sure, could you try a defrag (+ sync) of this single file ?

The what is the lsof output ?

Does anyone know a way to extract the "owners" of an extent ? I think that
we should go through the backref, but I never did. I don't want to
re-invent the wheel, so I am asking if someone knows a tool that can
help to find the owners of a extent.

BR

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-19 19:09                             ` Goffredo Baroncelli
@ 2023-12-21 13:53                               ` Christoph Anton Mitterer
  2023-12-21 18:03                                 ` Goffredo Baroncelli
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-21 13:53 UTC (permalink / raw)
  To: kreijack, Andrei Borzenkov; +Cc: Qu Wenruo, linux-btrfs

Hey Goffredo.

On Tue, 2023-12-19 at 20:09 +0100, Goffredo Baroncelli wrote:
> Ok, now we have the case study.
> To be sure, could you try a defrag (+ sync) of this single file ?

# btrfs filesystem defragment /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
# btrfs filesystem defragment -t 1000M /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
# sync
# btrfs filesystem sync /data/main/
# compsize /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 
Processed 1 file, 1 regular extents (1 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      256M         256M          15M       
none       100%      256M         256M          15M       
# 



> The what is the lsof output ?

# lsof -- /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 
COMMAND       PID       USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
prometheu 2327412 prometheus   12r   REG   0,43 15781418  642 /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
# 

I also stopped prometheus synced and checked then:
# systemctl stop prometheus.service 
# lsof -- /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 
# btrfs filesystem sync /data/main/
# compsize /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 
Processed 1 file, 1 regular extents (1 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      256M         256M          15M       
none       100%      256M         256M          15M       


> Does anyone know a way to extract the "owners" of an extent ? I think
> that
> we should go through the backref, but I never did. I don't want to
> re-invent the wheel, so I am asking if someone knows a tool that can
> help to find the owners of a extent.

Not me ;-) ... Does it help if I'd provide something like dump-tree
data?


The fs is soon to be full again, so I'll likely have to delete some of
the (test) data...


Thanks,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-21 13:53                               ` Christoph Anton Mitterer
@ 2023-12-21 18:03                                 ` Goffredo Baroncelli
  2023-12-21 22:06                                   ` Christoph Anton Mitterer
  0 siblings, 1 reply; 43+ messages in thread
From: Goffredo Baroncelli @ 2023-12-21 18:03 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Andrei Borzenkov; +Cc: Qu Wenruo, linux-btrfs

On 21/12/2023 14.53, Christoph Anton Mitterer wrote:
> Hey Goffredo.
> 
> On Tue, 2023-12-19 at 20:09 +0100, Goffredo Baroncelli wrote:
>> Ok, now we have the case study.
>> To be sure, could you try a defrag (+ sync) of this single file ?
> 
> # btrfs filesystem defragment /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
> # btrfs filesystem defragment -t 1000M /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
> # sync
> # btrfs filesystem sync /data/main/
> # compsize /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
> Processed 1 file, 1 regular extents (1 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%      256M         256M          15M
> none       100%      256M         256M          15M
> #
> 
> 
> 
>> The what is the lsof output ?
> 
> # lsof -- /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
> COMMAND       PID       USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
> prometheu 2327412 prometheus   12r   REG   0,43 15781418  642 /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
> #
> 
> I also stopped prometheus synced and checked then:
> # systemctl stop prometheus.service
> # lsof -- /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001

Here you should do a defrag, after the stop of prometheus.

> # btrfs filesystem sync /data/main/
> # compsize /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
> Processed 1 file, 1 regular extents (1 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%      256M         256M          15M
> none       100%      256M         256M          15M
> 
> 
>> Does anyone know a way to extract the "owners" of an extent ? I think
>> that
>> we should go through the backref, but I never did. I don't want to
>> re-invent the wheel, so I am asking if someone knows a tool that can
>> help to find the owners of a extent.
> 
> Not me ;-) ... Does it help if I'd provide something like dump-tree
> data?
> 
I am trying to write a tool that walks the backref to find the owners.
I hope for tomorrow to have a prototype to test.

> 
> The fs is soon to be full again, so I'll likely have to delete some of
> the (test) data...
> 
> 
> Thanks,
> Chris.

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-21 18:03                                 ` Goffredo Baroncelli
@ 2023-12-21 22:06                                   ` Christoph Anton Mitterer
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-21 22:06 UTC (permalink / raw)
  To: kreijack, Andrei Borzenkov; +Cc: Qu Wenruo, linux-btrfs

On Thu, 2023-12-21 at 19:03 +0100, Goffredo Baroncelli wrote:
> > # lsof --
> > /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/00
> > 0001
> 
> Here you should do a defrag, after the stop of prometheus.

No difference. Even after syncing, and even after unmount/mountig.


btw: I did that:

/data/main# compsize /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 
Processed 1 file, 1 regular extents (1 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      256M         256M          15M       
none       100%      256M         256M          15M       

/data/main# cat /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 > foo
/data/main# compsize -b foo 
Processed 1 file, 1 regular extents (1 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%     268435456    268435456    15781888    
none       100%     268435456    268435456    15781888    
/data/main# ls -al foo 
-rw-r--r-- 1 root root 15781418 Dec 21 23:02 foo

=> Wouldn't have expected that, not only the discrepancy between
   referenced and ls.
   But even the freshly cat'ed file has that space waste. There should
   be no holes, or any other monkey business involved?


# du --apparent-size --total -s --block-size=1 /data/main/
22112706625	/data/main/
22112706625	total

# compsize -b /data/main/
Processed 463 files, 865 regular extents (874 refs), 224 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%     35752870045  35752870045  22097375389 
none       100%     35752870045  35752870045  22097375389 



> 
> I am trying to write a tool that walks the backref to find the
> owners.
> I hope for tomorrow to have a prototype to test.

Thanks!


Cheers,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-19  8:22                           ` Andrei Borzenkov
  2023-12-19 19:09                             ` Goffredo Baroncelli
@ 2023-12-21 13:46                             ` Christoph Anton Mitterer
  2023-12-21 20:41                               ` Qu Wenruo
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-21 13:46 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: kreijack, Qu Wenruo, linux-btrfs

On Tue, 2023-12-19 at 11:22 +0300, Andrei Borzenkov wrote:
> /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/0000
> 01
> Processed 1 file, 1 regular extents (1 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%      256M         256M          15M
> none       100%      256M         256M          15M
> 
> I would try to find out whether this single extent is shared, where
> the data is located inside this extent. Could it be that file was
> truncated or the hole was punched in it?

How would I do that? :-)

Thanks,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-21 13:46                             ` Christoph Anton Mitterer
@ 2023-12-21 20:41                               ` Qu Wenruo
  2023-12-21 22:15                                 ` Christoph Anton Mitterer
  0 siblings, 1 reply; 43+ messages in thread
From: Qu Wenruo @ 2023-12-21 20:41 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Andrei Borzenkov; +Cc: kreijack, linux-btrfs



On 2023/12/22 00:16, Christoph Anton Mitterer wrote:
> On Tue, 2023-12-19 at 11:22 +0300, Andrei Borzenkov wrote:
>> /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/0000
>> 01
>> Processed 1 file, 1 regular extents (1 refs), 0 inline.
>> Type       Perc     Disk Usage   Uncompressed Referenced
>> TOTAL      100%      256M         256M          15M
>> none       100%      256M         256M          15M
>>
>> I would try to find out whether this single extent is shared, where
>> the data is located inside this extent. Could it be that file was
>> truncated or the hole was punched in it?
>
> How would I do that? :-)

Grab the INODE number of that file (`stat` is good enough).

Know the subvolume id.

Then `btrfs ins dump-tree -t <subvolid> <device> | grep -A7 "key (256 "

I guess it's time to add a way to dump all the items of a single inode
for dump-tree now.

Thanks,
Qu
>
> Thanks,
> Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-21 20:41                               ` Qu Wenruo
@ 2023-12-21 22:15                                 ` Christoph Anton Mitterer
  2023-12-21 22:41                                   ` Qu Wenruo
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-21 22:15 UTC (permalink / raw)
  To: Qu Wenruo, Andrei Borzenkov; +Cc: kreijack, linux-btrfs

On Fri, 2023-12-22 at 07:11 +1030, Qu Wenruo wrote:
> Grab the INODE number of that file (`stat` is good enough).
# stat /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001 
  File: /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
  Size: 15781418  	Blocks: 30824      IO Block: 4096   regular file
Device: 0,43	Inode: 642         Links: 1
Access: (0664/-rw-rw-r--)  Uid: (  106/prometheus)   Gid: (  106/prometheus)
Access: 2023-12-12 17:50:26.968485936 +0100
Modify: 2023-12-12 17:50:28.748495544 +0100
Change: 2023-12-12 17:50:57.280649521 +0100
 Birth: 2023-12-12 17:50:26.968485936 +0100

> Know the subvolume id.

# btrfs subvolume list -pagu /data/main/
ID 257 gen 2371697 parent 5 top level 5 uuid ae3fa7ff-f5a4-cf44-8555-ad579195036c path <FS_TREE>/data


> Then `btrfs ins dump-tree -t <subvolid> <device> | grep -A7 "key (256
> "

I assume 256 should be the inode number?
If so:
# btrfs ins dump-tree -t 257 /dev/vdb | grep -A7 "key (642 "
		location key (642 INODE_ITEM 0) type FILE
		transid 2348290 data_len 0 name_len 6
		name: 000001
	item 128 key (638 DIR_INDEX 3) itemoff 9441 itemsize 36
		location key (642 INODE_ITEM 0) type FILE
		transid 2348290 data_len 0 name_len 6
		name: 000001
	item 129 key (639 INODE_ITEM 0) itemoff 9281 itemsize 160
		generation 2348289 transid 2348290 size 17788225 nbytes 17788928
		block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0
		sequence 408 flags 0x0(none)
		atime 1702399826.500483413 (2023-12-12 17:50:26)
--
	item 132 key (642 INODE_ITEM 0) itemoff 9053 itemsize 160
		generation 2348289 transid 2348290 size 15781418 nbytes 15781888
		block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0
		sequence 3362 flags 0x10(PREALLOC)
		atime 1702399826.968485936 (2023-12-12 17:50:26)
		ctime 1702399857.280649521 (2023-12-12 17:50:57)
		mtime 1702399828.748495544 (2023-12-12 17:50:28)
		otime 1702399826.968485936 (2023-12-12 17:50:26)
	item 133 key (642 INODE_REF 638) itemoff 9037 itemsize 16
		index 3 namelen 6 name: 000001
	item 134 key (642 EXTENT_DATA 0) itemoff 8984 itemsize 53
		generation 2348290 type 1 (regular)
		extent data disk byte 9500291072 nr 268435456
		extent data offset 0 nr 15781888 ram 268435456
		extent compression 0 (none)
	item 135 key (643 INODE_ITEM 0) itemoff 8824 itemsize 160
		generation 2348290 transid 2363471 size 283 nbytes 283
		block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0


If you need the whole output of btrfs ins dump-tree -t 257 /dev/vdb,
it's only 72k compressed, and AFAIU shouldn't contain any private data
(well nothing on the whole fs is private ^^).


Cheers,
Chris

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-21 22:15                                 ` Christoph Anton Mitterer
@ 2023-12-21 22:41                                   ` Qu Wenruo
  2023-12-21 22:54                                     ` Christoph Anton Mitterer
  0 siblings, 1 reply; 43+ messages in thread
From: Qu Wenruo @ 2023-12-21 22:41 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Andrei Borzenkov; +Cc: kreijack, linux-btrfs



On 2023/12/22 08:45, Christoph Anton Mitterer wrote:
> On Fri, 2023-12-22 at 07:11 +1030, Qu Wenruo wrote:
>> Grab the INODE number of that file (`stat` is good enough).
> # stat /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
>    File: /data/main/prometheus/metrics2/01HHFEZPJ8TPFVYTXV11R7ZH4X/chunks/000001
>    Size: 15781418  	Blocks: 30824      IO Block: 4096   regular file
> Device: 0,43	Inode: 642         Links: 1\

642 is your inode number.

> Access: (0664/-rw-rw-r--)  Uid: (  106/prometheus)   Gid: (  106/prometheus)
> Access: 2023-12-12 17:50:26.968485936 +0100
> Modify: 2023-12-12 17:50:28.748495544 +0100
> Change: 2023-12-12 17:50:57.280649521 +0100
>   Birth: 2023-12-12 17:50:26.968485936 +0100
>
>> Know the subvolume id.
>
> # btrfs subvolume list -pagu /data/main/
> ID 257 gen 2371697 parent 5 top level 5 uuid ae3fa7ff-f5a4-cf44-8555-ad579195036c path <FS_TREE>/data
>
>
>> Then `btrfs ins dump-tree -t <subvolid> <device> | grep -A7 "key (256
>> "
>
> I assume 256 should be the inode number?
> If so:
> # btrfs ins dump-tree -t 257 /dev/vdb | grep -A7 "key (642 "
> 		location key (642 INODE_ITEM 0) type FILE
> 		transid 2348290 data_len 0 name_len 6
> 		name: 000001
> 	item 128 key (638 DIR_INDEX 3) itemoff 9441 itemsize 36
> 		location key (642 INODE_ITEM 0) type FILE
> 		transid 2348290 data_len 0 name_len 6
> 		name: 000001
> 	item 129 key (639 INODE_ITEM 0) itemoff 9281 itemsize 160
> 		generation 2348289 transid 2348290 size 17788225 nbytes 17788928
> 		block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0
> 		sequence 408 flags 0x0(none)
> 		atime 1702399826.500483413 (2023-12-12 17:50:26)
> --
> 	item 132 key (642 INODE_ITEM 0) itemoff 9053 itemsize 160
> 		generation 2348289 transid 2348290 size 15781418 nbytes 15781888
> 		block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0
> 		sequence 3362 flags 0x10(PREALLOC)
> 		atime 1702399826.968485936 (2023-12-12 17:50:26)
> 		ctime 1702399857.280649521 (2023-12-12 17:50:57)
> 		mtime 1702399828.748495544 (2023-12-12 17:50:28)
> 		otime 1702399826.968485936 (2023-12-12 17:50:26)
> 	item 133 key (642 INODE_REF 638) itemoff 9037 itemsize 16
> 		index 3 namelen 6 name: 000001
> 	item 134 key (642 EXTENT_DATA 0) itemoff 8984 itemsize 53
> 		generation 2348290 type 1 (regular)
> 		extent data disk byte 9500291072 nr 268435456
> 		extent data offset 0 nr 15781888 ram 268435456
> 		extent compression 0 (none)
> 	item 135 key (643 INODE_ITEM 0) itemoff 8824 itemsize 160
> 		generation 2348290 transid 2363471 size 283 nbytes 283
> 		block group 0 mode 100664 links 1 uid 106 gid 106 rdev 0
>
>
> If you need the whole output of btrfs ins dump-tree -t 257 /dev/vdb,
> it's only 72k compressed, and AFAIU shouldn't contain any private data
> (well nothing on the whole fs is private ^^).

The whole one is easier for me to check. But I still strongly recommend
to go "--hide-names" just in case.

Meanwhile you may want to upload the extent tree too (which can be
pretty large though), as my final step would need to check-cross extent
tree to be sure.

Thanks,
Qu
>
>
> Cheers,
> Chris

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-21 22:41                                   ` Qu Wenruo
@ 2023-12-21 22:54                                     ` Christoph Anton Mitterer
  2023-12-22  0:53                                       ` Qu Wenruo
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-21 22:54 UTC (permalink / raw)
  To: Qu Wenruo, Andrei Borzenkov; +Cc: kreijack, linux-btrfs

On Fri, 2023-12-22 at 09:11 +1030, Qu Wenruo wrote:
> The whole one is easier for me to check.

https://drive.google.com/file/d/1-TJKoL85e23u5mJN0Nuoa9qyvgLRssFM/view?usp=sharing

Should contain both (-t 257 and -e)

>  But I still strongly recommend
> to go "--hide-names" just in case.

Checked it, and it's really only the prometheus filenames, all of which
are completely non-sensitive... and it makes live easier if we can see
the names.


> Meanwhile you may want to upload the extent tree too (which can be
> pretty large though), as my final step would need to check-cross
> extent
> tree to be sure.

Here we go, thanks in advance.


Cheers,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-21 22:54                                     ` Christoph Anton Mitterer
@ 2023-12-22  0:53                                       ` Qu Wenruo
  2023-12-22  0:56                                         ` Christoph Anton Mitterer
  0 siblings, 1 reply; 43+ messages in thread
From: Qu Wenruo @ 2023-12-22  0:53 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Qu Wenruo, Andrei Borzenkov
  Cc: kreijack, linux-btrfs


[-- Attachment #1.1.1: Type: text/plain, Size: 1229 bytes --]



On 2023/12/22 09:24, Christoph Anton Mitterer wrote:
> On Fri, 2023-12-22 at 09:11 +1030, Qu Wenruo wrote:
>> The whole one is easier for me to check.
> 
> https://drive.google.com/file/d/1-TJKoL85e23u5mJN0Nuoa9qyvgLRssFM/view?usp=sharing
> 
> Should contain both (-t 257 and -e)

The situation is in fact simpler than I thought.

The original extent is 256M, and I believe that whole 256M is preallocated.

But later truncated to the current size, which is only 15+M.

Thus this explains the size problem.

But the problem comes on why defrag doesn't work.
I'll look into this during the holiday season, but I strongly believe 
it's the PREALLOC inode flag.
We should take it seriously now.

Thanks,
Qu
> 
>>   But I still strongly recommend
>> to go "--hide-names" just in case.
> 
> Checked it, and it's really only the prometheus filenames, all of which
> are completely non-sensitive... and it makes live easier if we can see
> the names.
> 
> 
>> Meanwhile you may want to upload the extent tree too (which can be
>> pretty large though), as my final step would need to check-cross
>> extent
>> tree to be sure.
> 
> Here we go, thanks in advance.
> 
> 
> Cheers,
> Chris.
> 

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7027 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-22  0:53                                       ` Qu Wenruo
@ 2023-12-22  0:56                                         ` Christoph Anton Mitterer
  2023-12-22  1:13                                           ` Qu Wenruo
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-22  0:56 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, Andrei Borzenkov; +Cc: kreijack, linux-btrfs

On Fri, 2023-12-22 at 11:23 +1030, Qu Wenruo wrote:
> But the problem comes on why defrag doesn't work.
> I'll look into this during the holiday season, but I strongly believe
> it's the PREALLOC inode flag.
> We should take it seriously now.

Oh and keep in mind that - as I've hopefully had mentioned in the
beginning - this is all on 6.1.55 (Debian stable)-

Thanks,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-22  0:56                                         ` Christoph Anton Mitterer
@ 2023-12-22  1:13                                           ` Qu Wenruo
  2023-12-22  1:23                                             ` Christoph Anton Mitterer
  2024-01-05  3:30                                             ` Christoph Anton Mitterer
  0 siblings, 2 replies; 43+ messages in thread
From: Qu Wenruo @ 2023-12-22  1:13 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Qu Wenruo, Andrei Borzenkov
  Cc: kreijack, linux-btrfs



On 2023/12/22 11:26, Christoph Anton Mitterer wrote:
> On Fri, 2023-12-22 at 11:23 +1030, Qu Wenruo wrote:
>> But the problem comes on why defrag doesn't work.
>> I'll look into this during the holiday season, but I strongly believe
>> it's the PREALLOC inode flag.
>> We should take it seriously now.
>
> Oh and keep in mind that - as I've hopefully had mentioned in the
> beginning - this is all on 6.1.55 (Debian stable)-

That's not a big deal, because before sending that reply, I have already
reproduced the problem using 6.6 kernel, so it's a long existing problem.
Just like the whole PREALLOC and compression problem.

Thanks,
Qu
>
> Thanks,
> Chris.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-22  1:13                                           ` Qu Wenruo
@ 2023-12-22  1:23                                             ` Christoph Anton Mitterer
  2024-01-05  3:30                                             ` Christoph Anton Mitterer
  1 sibling, 0 replies; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-22  1:23 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, Andrei Borzenkov; +Cc: kreijack, linux-btrfs

On Fri, 2023-12-22 at 11:43 +1030, Qu Wenruo wrote:
> That's not a big deal, because before sending that reply, I have
> already
> reproduced the problem using 6.6 kernel, so it's a long existing
> problem.
> Just like the whole PREALLOC and compression problem.

A I see. Just wanted to mention it, not that you waste hours of time by
searching for an issue that might have been fixed meanwhile without
anyone noticing it.

Cheers,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-22  1:13                                           ` Qu Wenruo
  2023-12-22  1:23                                             ` Christoph Anton Mitterer
@ 2024-01-05  3:30                                             ` Christoph Anton Mitterer
  2024-01-05  7:07                                               ` Qu Wenruo
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Anton Mitterer @ 2024-01-05  3:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hey there.

On Fri, 2023-12-22 at 11:43 +1030, Qu Wenruo wrote:
> That's not a big deal, because before sending that reply, I have
> already
> reproduced the problem using 6.6 kernel, so it's a long existing
> problem.
> Just like the whole PREALLOC and compression problem.


Just wondered whether there's anything new on this or whether the best
for now would be to switch the fs?

Also, do you think that NODATACOW would also be affected by the
underlying "issue"?

Cheers.
Chris

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2024-01-05  3:30                                             ` Christoph Anton Mitterer
@ 2024-01-05  7:07                                               ` Qu Wenruo
  2024-01-06  0:42                                                 ` Christoph Anton Mitterer
  0 siblings, 1 reply; 43+ messages in thread
From: Qu Wenruo @ 2024-01-05  7:07 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs



On 2024/1/5 14:00, Christoph Anton Mitterer wrote:
> Hey there.
>
> On Fri, 2023-12-22 at 11:43 +1030, Qu Wenruo wrote:
>> That's not a big deal, because before sending that reply, I have
>> already
>> reproduced the problem using 6.6 kernel, so it's a long existing
>> problem.
>> Just like the whole PREALLOC and compression problem.
>
>
> Just wondered whether there's anything new on this or whether the best
> for now would be to switch the fs?

Root cause is pinned down. It's very stupid, but pretty hard to find a
good way to improve:

   For that offending inode, it only has one extent.
   During defrag, we need to determine if we need to defrag one extent.
   For this particular case, we choose not to defrag it at all, the
   reason is:
   - The single extent can not be merged with any other extent

So even if we're only utilizing a single 4K of a 128M extent, defrag
choose not to defrag at all.

But this rule applies pretty well for really fragmented files, thus I
have no good idea how to continue.

Maybe we can add a new rule, to add very under utilized extents to
defrag target?
But I'm sure that may cause other corner cases to be unhappy though.

>
> Also, do you think that NODATACOW would also be affected by the
> underlying "issue"?

My previous guess is totally wrong, it has nothing to do with
NODATACOW/PREALLOC flags at all.

It's a defrag only problem.

Thanks,
Qu
>
> Cheers.
> Chris
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2024-01-05  7:07                                               ` Qu Wenruo
@ 2024-01-06  0:42                                                 ` Christoph Anton Mitterer
  2024-01-06  5:40                                                   ` Qu Wenruo
  2024-12-14 19:09                                                   ` Christoph Anton Mitterer
  0 siblings, 2 replies; 43+ messages in thread
From: Christoph Anton Mitterer @ 2024-01-06  0:42 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, 2024-01-05 at 17:37 +1030, Qu Wenruo wrote:
> but pretty hard to find a
> good way to improve:

I guess it would not be an alternative to do the work on truncate (i.e.
check whether a lot is wasted and if so, create a new extent with just
the right size), because that would need a full re-write of the
extent... or would it?

Also, a while ago in one of my mails I saw that:
$ cat affected-file > new-file
would also cause new-file to be affected... which I found pretty
strange.

Any idea why that happens?

> My previous guess is totally wrong, it has nothing to do with
> NODATACOW/PREALLOC flags at all.
> 
> It's a defrag only problem.

Sure, but I meant, if a file is NODATACOW and would be prealloced to a
large size and then truncated - would it also loose the extra space?

And do you think that other CoW fs would be affected, too? IIRC XFS is
also about to get CoW features... so may it's simply an IO pattern that
developers need to avoid with modern filesystems.

Cheers,
Chris-

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2024-01-06  0:42                                                 ` Christoph Anton Mitterer
@ 2024-01-06  5:40                                                   ` Qu Wenruo
  2024-01-06  8:12                                                     ` Andrei Borzenkov
  2024-12-14 19:09                                                   ` Christoph Anton Mitterer
  1 sibling, 1 reply; 43+ messages in thread
From: Qu Wenruo @ 2024-01-06  5:40 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs



On 2024/1/6 11:12, Christoph Anton Mitterer wrote:
> On Fri, 2024-01-05 at 17:37 +1030, Qu Wenruo wrote:
>> but pretty hard to find a
>> good way to improve:
>
> I guess it would not be an alternative to do the work on truncate (i.e.
> check whether a lot is wasted and if so, create a new extent with just
> the right size), because that would need a full re-write of the
> extent... or would it?

I'm not sure if it's a valid idea to do all these on truncation.
It would involve doing all the backref walk to determine if it's needed
to do the COW.

>
> Also, a while ago in one of my mails I saw that:
> $ cat affected-file > new-file
> would also cause new-file to be affected... which I found pretty
> strange.
>
> Any idea why that happens?

Initially I thought that's impossible, until I tired it:

mkfs.btrfs -f $dev1
mount $dev1 $mnt
xfs_io -f -c "pwrite 0 128m" $mnt/file
sync
truncate -s 4k $mnt/file
sync
cat $mnt/file > $mnt/new
sync

Then dump tree indeed show the new file is sharing the same large extent:


         item 6 key (257 INODE_ITEM 0) itemoff 15817 itemsize 160
                 generation 7 transid 8 size 4096 nbytes 4096
                 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                 sequence 32770 flags 0x0(none)
         item 7 key (257 INODE_REF 256) itemoff 15803 itemsize 14
                 index 2 namelen 4 name: file
         item 8 key (257 EXTENT_DATA 0) itemoff 15750 itemsize 53
                 generation 7 type 1 (regular)
                 extent data disk byte 298844160 nr 134217728 <<<
                 extent data offset 0 nr 4096 ram 134217728
                 extent compression 0 (none)
         item 9 key (258 INODE_ITEM 0) itemoff 15590 itemsize 160
                 generation 9 transid 9 size 4096 nbytes 4096
                 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
                 sequence 1 flags 0x0(none)
         item 10 key (258 INODE_REF 256) itemoff 15577 itemsize 13
                 index 3 namelen 3 name: new
         item 11 key (258 EXTENT_DATA 0) itemoff 15524 itemsize 53
                 generation 7 type 1 (regular)
                 extent data disk byte 298844160 nr 134217728 <<<
                 extent data offset 0 nr 4096 ram 134217728
                 extent compression 0 (none)

My guess is bash is doing something weird thus making the whole cat +
redirection into reflink.

But at least dd works as expected by creating a new extent.

>
>
>
>> My previous guess is totally wrong, it has nothing to do with
>> NODATACOW/PREALLOC flags at all.
>>
>> It's a defrag only problem.
>
> Sure, but I meant, if a file is NODATACOW and would be prealloced to a
> large size and then truncated - would it also loose the extra space?

That would be the same.

>
>
> And do you think that other CoW fs would be affected, too? IIRC XFS is
> also about to get CoW features... so may it's simply an IO pattern that
> developers need to avoid with modern filesystems.

Not sure, maybe XFS would do extra extent split to solve the problem.

Thanks,
Qu

>
>
> Cheers,
> Chris-
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2024-01-06  5:40                                                   ` Qu Wenruo
@ 2024-01-06  8:12                                                     ` Andrei Borzenkov
  0 siblings, 0 replies; 43+ messages in thread
From: Andrei Borzenkov @ 2024-01-06  8:12 UTC (permalink / raw)
  To: Qu Wenruo, Christoph Anton Mitterer; +Cc: linux-btrfs

On 06.01.2024 08:40, Qu Wenruo wrote:
> 
> 
> On 2024/1/6 11:12, Christoph Anton Mitterer wrote:
>> On Fri, 2024-01-05 at 17:37 +1030, Qu Wenruo wrote:
>>> but pretty hard to find a
>>> good way to improve:
>>
>> I guess it would not be an alternative to do the work on truncate (i.e.
>> check whether a lot is wasted and if so, create a new extent with just
>> the right size), because that would need a full re-write of the
>> extent... or would it?
> 
> I'm not sure if it's a valid idea to do all these on truncation.
> It would involve doing all the backref walk to determine if it's needed
> to do the COW.
> 
>>
>> Also, a while ago in one of my mails I saw that:
>> $ cat affected-file > new-file
>> would also cause new-file to be affected... which I found pretty
>> strange.
>>
>> Any idea why that happens?
> 
> Initially I thought that's impossible, until I tired it:
> 
> mkfs.btrfs -f $dev1
> mount $dev1 $mnt
> xfs_io -f -c "pwrite 0 128m" $mnt/file
> sync
> truncate -s 4k $mnt/file
> sync
> cat $mnt/file > $mnt/new
> sync
> 
> Then dump tree indeed show the new file is sharing the same large extent:
> 
> 
>           item 6 key (257 INODE_ITEM 0) itemoff 15817 itemsize 160
>                   generation 7 transid 8 size 4096 nbytes 4096
>                   block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>                   sequence 32770 flags 0x0(none)
>           item 7 key (257 INODE_REF 256) itemoff 15803 itemsize 14
>                   index 2 namelen 4 name: file
>           item 8 key (257 EXTENT_DATA 0) itemoff 15750 itemsize 53
>                   generation 7 type 1 (regular)
>                   extent data disk byte 298844160 nr 134217728 <<<
>                   extent data offset 0 nr 4096 ram 134217728
>                   extent compression 0 (none)
>           item 9 key (258 INODE_ITEM 0) itemoff 15590 itemsize 160
>                   generation 9 transid 9 size 4096 nbytes 4096
>                   block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>                   sequence 1 flags 0x0(none)
>           item 10 key (258 INODE_REF 256) itemoff 15577 itemsize 13
>                   index 3 namelen 3 name: new
>           item 11 key (258 EXTENT_DATA 0) itemoff 15524 itemsize 53
>                   generation 7 type 1 (regular)
>                   extent data disk byte 298844160 nr 134217728 <<<
>                   extent data offset 0 nr 4096 ram 134217728
>                   extent compression 0 (none)
> 
> My guess is bash is doing something weird thus making the whole cat +
> redirection into reflink.
> 

It is not bash (which just opens file and dup's it). It is cat from GNU 
coreutils which defaults to using copy_file_range

10393 openat(AT_FDCWD, "/mnt/file", O_RDONLY) = 3
10393 fstat(3, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
10393 fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
10393 uname({sysname="Linux", nodename="tw", ...}) = 0
10393 copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 4096
10393 copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 0

In case of btrfs this ends in btrfs_remap_file_range(). So it is more or 
less self inflicted wound. May be btrfs should not reflink partial 
extents in such cases.

> But at least dd works as expected by creating a new extent.
> 
>>
>>
>>
>>> My previous guess is totally wrong, it has nothing to do with
>>> NODATACOW/PREALLOC flags at all.
>>>
>>> It's a defrag only problem.
>>
>> Sure, but I meant, if a file is NODATACOW and would be prealloced to a
>> large size and then truncated - would it also loose the extra space?
> 
> That would be the same.
> 
>>
>>
>> And do you think that other CoW fs would be affected, too? IIRC XFS is
>> also about to get CoW features... so may it's simply an IO pattern that
>> developers need to avoid with modern filesystems.
> 
> Not sure, maybe XFS would do extra extent split to solve the problem.
> 
> Thanks,
> Qu
> 
>>
>>
>> Cheers,
>> Chris-
>>
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2024-01-06  0:42                                                 ` Christoph Anton Mitterer
  2024-01-06  5:40                                                   ` Qu Wenruo
@ 2024-12-14 19:09                                                   ` Christoph Anton Mitterer
  1 sibling, 0 replies; 43+ messages in thread
From: Christoph Anton Mitterer @ 2024-12-14 19:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hey Qu, et all.

Is the issue behind this still being looked at?

There used to be some patches out there, but I think they were never
merged.

Still having quite massively (well only where I use Prometheus) the
problem that on some filesystems where the application does that pre-
allocation, a lot of space is wasted.

IIRC your patches tried to put that in defrag, but the downside with
that is IMO that you have to defrag - and didn't that break up
refcopied extents?

Thanks,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-18 16:24                     ` Christoph Anton Mitterer
  2023-12-18 19:18                       ` Goffredo Baroncelli
@ 2023-12-18 19:54                       ` Qu Wenruo
  1 sibling, 0 replies; 43+ messages in thread
From: Qu Wenruo @ 2023-12-18 19:54 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs



On 2023/12/19 02:54, Christoph Anton Mitterer wrote:
> Hey again.
>
> Seems that even the manual defrag doesn't help at all:
>
> After:
> btrfs filesystem defragment -v -r -t 100000M
>
> there's still:
> # compsize .
> Processed 309 files, 324 regular extents (324 refs), 146 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%       22G          22G          13G
> none       100%       22G          22G          13G
>
>
> Any other ideas how this could be solved?

Snapshot or reflinks (remember cp now goes reflink by default)?

Thanks,
Qu
>
> Cheers,
> Chris.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-12  4:13                   ` Qu Wenruo
  2023-12-15  2:33                     ` Chris Murphy
  2023-12-18 16:24                     ` Christoph Anton Mitterer
@ 2023-12-18 22:30                     ` Christoph Anton Mitterer
  2 siblings, 0 replies; 43+ messages in thread
From: Christoph Anton Mitterer @ 2023-12-18 22:30 UTC (permalink / raw)
  To: linux-btrfs

Hey.

Had already sent the mail below this afternoon, but just got a bounce:
<linux-btrfs@vger.kernel.org>: lost connection with
    smtp.subspace.kernel.org[44.238.234.78] while receiving the initial server
    greeting

So here it's again,... effectively it just says that autodefrag didn't
help either.

Cheers,
Chris.

On Tue, 2023-12-12 at 14:43 +1030, Qu Wenruo wrote:
> The direct cause is frequent fsync()/sync() with overwrites.
> Btrfs is really relying on merging the writes between transactions,
> if
> fsync()/sync() is called too frequently (like some data base) and the
> program is doing overwrites, this is exactly what you would have.
> 
> IIRC we can set the AUTODEFRAG for an directory?

I have tried meanwhile with autodefrag for a few days, but that doesn't
cure the problem, not sure why it doesn't seem to kick in.

The way Prometheus writes together with btrfs, causes extensive loss of
space:
compsize /data/main/prometheus/metrics2
Processed 305 files, 567 regular extents (586 refs), 146 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%       21G          21G          13G       
none       100%       21G          21G          13G       

I'll try with a manual defrag now, but it's a bit unfortunate that this
happens without manual intervention.
Or would it be better or even help to balance?

And nodatacow isn't IMO a real alternative either, as long as one
looses one of the greatest btrfs benefits with is (checksumming).

Cheers,
Chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-12  3:27               ` Christoph Anton Mitterer
  2023-12-12  3:40                 ` Christoph Anton Mitterer
@ 2023-12-13  1:49                 ` Remi Gauvin
  1 sibling, 0 replies; 43+ messages in thread
From: Remi Gauvin @ 2023-12-13  1:49 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

On 2023-12-11 10:27 p.m., Christoph Anton Mitterer wrote:
>
> Well the manpage warns from using on large DB workloads... I mean
> Prometheus is not exactly like a DB, and I would have naively assumed
> that at least the chunks were written not as many small random
> writes... but apparently they are.
>
> Also, this a VM, so the storage volume is actually something Ceph
> backed, which the university's super computing centre provides us with.
>
> I wonder if I do autodefrag on all that, if it doesn't just kill of our
> performance even more?



On SSD, I like to use compress-forced option and a daily defrag with
target size 128k to eliminate space wasted by fragmentation. 
(Compressed extents are max size 128KB)


However, compression will destroy sequential file read speed on spinning
disks, (presumably due to the small extent size, possibly not landing on
disk in order, I'm not really sure why read speed is badly affected when
write speed is not.)


If there are no snapshots or reflink copies, you can use defrag with
target size 128MB (at whatever frequency you need.) to eliminate wasted
space.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs thinks fs is full, though 11GB should be still free
  2023-12-12  0:12           ` Christoph Anton Mitterer
  2023-12-12  0:58             ` Qu Wenruo
@ 2023-12-13  8:29             ` Andrea Gelmini
  1 sibling, 0 replies; 43+ messages in thread
From: Andrea Gelmini @ 2023-12-13  8:29 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Qu Wenruo, linux-btrfs

Il giorno mar 12 dic 2023 alle ore 01:12 Christoph Anton Mitterer
<calestyo@scientia.org> ha scritto:
> Is there a way to check this? Would I just seem maaany extents when I
> look at the files with filefrag?

I use this:
https://github.com/CyberShadow/btdu.git

And no, maaannyyyy extents don't imply wasted space

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2024-12-14 19:18 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-11 20:26 btrfs thinks fs is full, though 11GB should be still free Christoph Anton Mitterer
2023-12-11 20:57 ` Qu Wenruo
2023-12-11 22:23   ` Christoph Anton Mitterer
2023-12-11 22:26     ` Christoph Anton Mitterer
2023-12-11 23:20     ` Qu Wenruo
2023-12-11 23:38       ` Christoph Anton Mitterer
2023-12-11 23:54         ` Qu Wenruo
2023-12-12  0:12           ` Christoph Anton Mitterer
2023-12-12  0:58             ` Qu Wenruo
2023-12-12  2:30               ` Qu Wenruo
2023-12-12  3:27               ` Christoph Anton Mitterer
2023-12-12  3:40                 ` Christoph Anton Mitterer
2023-12-12  4:13                   ` Qu Wenruo
2023-12-15  2:33                     ` Chris Murphy
2023-12-15  3:12                       ` Qu Wenruo
2023-12-18 16:24                     ` Christoph Anton Mitterer
2023-12-18 19:18                       ` Goffredo Baroncelli
2023-12-18 20:04                         ` Goffredo Baroncelli
2023-12-18 22:38                         ` Christoph Anton Mitterer
2023-12-19  8:22                           ` Andrei Borzenkov
2023-12-19 19:09                             ` Goffredo Baroncelli
2023-12-21 13:53                               ` Christoph Anton Mitterer
2023-12-21 18:03                                 ` Goffredo Baroncelli
2023-12-21 22:06                                   ` Christoph Anton Mitterer
2023-12-21 13:46                             ` Christoph Anton Mitterer
2023-12-21 20:41                               ` Qu Wenruo
2023-12-21 22:15                                 ` Christoph Anton Mitterer
2023-12-21 22:41                                   ` Qu Wenruo
2023-12-21 22:54                                     ` Christoph Anton Mitterer
2023-12-22  0:53                                       ` Qu Wenruo
2023-12-22  0:56                                         ` Christoph Anton Mitterer
2023-12-22  1:13                                           ` Qu Wenruo
2023-12-22  1:23                                             ` Christoph Anton Mitterer
2024-01-05  3:30                                             ` Christoph Anton Mitterer
2024-01-05  7:07                                               ` Qu Wenruo
2024-01-06  0:42                                                 ` Christoph Anton Mitterer
2024-01-06  5:40                                                   ` Qu Wenruo
2024-01-06  8:12                                                     ` Andrei Borzenkov
2024-12-14 19:09                                                   ` Christoph Anton Mitterer
2023-12-18 19:54                       ` Qu Wenruo
2023-12-18 22:30                     ` Christoph Anton Mitterer
2023-12-13  1:49                 ` Remi Gauvin
2023-12-13  8:29             ` Andrea Gelmini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox