Deleting large amounts of data causes system freeze due to OOM.

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Deleting large amounts of data causes system freeze due to OOM.
@ 2023-09-13  2:28 fdavidl073rnovn
  2023-09-13  5:55 ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: fdavidl073rnovn @ 2023-09-13  2:28 UTC (permalink / raw)
  To: Linux Btrfs

Dear Btrfs Mailing List,

Full disclosure I reported this on kernel.org but am hoping to get more exposure on the mailing list. 

When I delete several terabytes of data memory usage increases until the system becomes entirely unresponsive. This has been an issue for several kernel version since at least 5.19 and continues to be an issue up to 6.5.2-artix1-1. This is on an older computer with several hard drives, eight gigabytes of memory, and a four core x86_64 cpu. Slabtop output right before the system becomes unresponsive shows about four gigabytes used by khugepaged_mm_slot and three used by btrfs_extent_map. This happens in over the span of a couple minutes and during this time btrfs-transaction is using a moderate amount of cpu time.

While this is happening the free space reported by btrfs filesystem usage slowly falls until the system is unresponsive. If I delete smaller amounts of data at a time memory usage increases but if the system doesn't go out of memory all the disk space is freed and memory usage comes back down. Deleting things bit by bit isn't a useful workaround because this also happens when deleting a snapshot even if it won't free any disk space and I am trying to use this computer for incremental backups.

The only things that seem to cause a difference are the checksum used and slower hard drives. Checksum changes the behavior of the issue. If using xxhash when I remount the filesystem it seems to try to either restart or continue the delete operation causing another out of memory condition but using the default crc32 remounting the filesystem has it in the original state before the delete command was issued and nothing happens (I haven't tried any other checksums). Having slower (SMR) drives as part of the device causes the out of memory to happen much faster. Nothing else like raid level, compression, kernel version, block group tree have seemed to change anything.

My speculation is that operations to finish the delete are being queued up in memory faster than they can be completed until the system completely runs out of memory. That would explain what's happening, why slower drives make it worse, and why deleting small amounts of data works. I'm not sure why checksum seems to change the behavior when remounting the filesystem.

I am willing to do destructive testing on this data to hopefully get this fixed.

Sincerely,
David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-09-13  2:28 Deleting large amounts of data causes system freeze due to OOM fdavidl073rnovn
@ 2023-09-13  5:55 ` Qu Wenruo
  2023-09-14  3:38   ` fdavidl073rnovn
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2023-09-13  5:55 UTC (permalink / raw)
  To: fdavidl073rnovn, Linux Btrfs



On 2023/9/13 11:58, fdavidl073rnovn@tutanota.com wrote:
> Dear Btrfs Mailing List,
> 
> Full disclosure I reported this on kernel.org but am hoping to get more exposure on the mailing list.
> 
> When I delete several terabytes of data memory usage increases until the system becomes entirely unresponsive. This has been an issue for several kernel version since at least 5.19 and continues to be an issue up to 6.5.2-artix1-1. This is on an older computer with several hard drives, eight gigabytes of memory, and a four core x86_64 cpu. Slabtop output right before the system becomes unresponsive shows about four gigabytes used by khugepaged_mm_slot and three used by btrfs_extent_map. This happens in over the span of a couple minutes and during this time btrfs-transaction is using a moderate amount of cpu time.

This looks exactly like something caused by btrfs qgroup.

Could you try to disable qgroup to see if it helps?
The amount of CPU time and IO of qgroup overhead is directly related to 
the amount of extent being updated.

For normal writes the IO itself would take most of the CPU/memory thus 
qgroup is not a big deal.
But for massive snapshots drop or file deletion qgroup can be too large 
to be handled in just one transaction.

For now you can disable the qgroup as a workaround.

Thanks,
Qu
> 
> While this is happening the free space reported by btrfs filesystem usage slowly falls until the system is unresponsive. If I delete smaller amounts of data at a time memory usage increases but if the system doesn't go out of memory all the disk space is freed and memory usage comes back down. Deleting things bit by bit isn't a useful workaround because this also happens when deleting a snapshot even if it won't free any disk space and I am trying to use this computer for incremental backups.
> 
> The only things that seem to cause a difference are the checksum used and slower hard drives. Checksum changes the behavior of the issue. If using xxhash when I remount the filesystem it seems to try to either restart or continue the delete operation causing another out of memory condition but using the default crc32 remounting the filesystem has it in the original state before the delete command was issued and nothing happens (I haven't tried any other checksums). Having slower (SMR) drives as part of the device causes the out of memory to happen much faster. Nothing else like raid level, compression, kernel version, block group tree have seemed to change anything.
> 
> My speculation is that operations to finish the delete are being queued up in memory faster than they can be completed until the system completely runs out of memory. That would explain what's happening, why slower drives make it worse, and why deleting small amounts of data works. I'm not sure why checksum seems to change the behavior when remounting the filesystem.
> 
> I am willing to do destructive testing on this data to hopefully get this fixed.
> 
> Sincerely,
> David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-09-13  5:55 ` Qu Wenruo
@ 2023-09-14  3:38   ` fdavidl073rnovn
  2023-09-14  5:12     ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: fdavidl073rnovn @ 2023-09-14  3:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Linux Btrfs

Sep 13, 2023, 05:55 by wqu@suse.com:

>
>
> On 2023/9/13 11:58, fdavidl073rnovn@tutanota.com wrote:
>
>> Dear Btrfs Mailing List,
>>
>> Full disclosure I reported this on kernel.org but am hoping to get more exposure on the mailing list.
>>
>> When I delete several terabytes of data memory usage increases until the system becomes entirely unresponsive. This has been an issue for several kernel version since at least 5.19 and continues to be an issue up to 6.5.2-artix1-1. This is on an older computer with several hard drives, eight gigabytes of memory, and a four core x86_64 cpu. Slabtop output right before the system becomes unresponsive shows about four gigabytes used by khugepaged_mm_slot and three used by btrfs_extent_map. This happens in over the span of a couple minutes and during this time btrfs-transaction is using a moderate amount of cpu time.
>>
>
> This looks exactly like something caused by btrfs qgroup.
>
> Could you try to disable qgroup to see if it helps?
> The amount of CPU time and IO of qgroup overhead is directly related to the amount of extent being updated.
>
> For normal writes the IO itself would take most of the CPU/memory thus qgroup is not a big deal.
> But for massive snapshots drop or file deletion qgroup can be too large to be handled in just one transaction.
>
> For now you can disable the qgroup as a workaround.
>
> Thanks,
> Qu
>
I've never enabled quotas and my most recent attempt using the single profile for data was on kernel 6.4 so they would have been disabled by default. Running "btrfs qgroup show [path]" returns "ERROR: can't list qgroups: quotas not enabled".

Sincerely,
David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-09-14  3:38   ` fdavidl073rnovn
@ 2023-09-14  5:12     ` Qu Wenruo
  2023-09-14 23:08       ` fdavidl073rnovn
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2023-09-14  5:12 UTC (permalink / raw)
  To: fdavidl073rnovn, Qu Wenruo; +Cc: Linux Btrfs



On 2023/9/14 13:08, fdavidl073rnovn@tutanota.com wrote:
> Sep 13, 2023, 05:55 by wqu@suse.com:
>
>>
>>
>> On 2023/9/13 11:58, fdavidl073rnovn@tutanota.com wrote:
>>
>>> Dear Btrfs Mailing List,
>>>
>>> Full disclosure I reported this on kernel.org but am hoping to get more exposure on the mailing list.
>>>
>>> When I delete several terabytes of data memory usage increases until the system becomes entirely unresponsive. This has been an issue for several kernel version since at least 5.19 and continues to be an issue up to 6.5.2-artix1-1. This is on an older computer with several hard drives, eight gigabytes of memory, and a four core x86_64 cpu. Slabtop output right before the system becomes unresponsive shows about four gigabytes used by khugepaged_mm_slot and three used by btrfs_extent_map. This happens in over the span of a couple minutes and during this time btrfs-transaction is using a moderate amount of cpu time.
>>>
>>
>> This looks exactly like something caused by btrfs qgroup.
>>
>> Could you try to disable qgroup to see if it helps?
>> The amount of CPU time and IO of qgroup overhead is directly related to the amount of extent being updated.
>>
>> For normal writes the IO itself would take most of the CPU/memory thus qgroup is not a big deal.
>> But for massive snapshots drop or file deletion qgroup can be too large to be handled in just one transaction.
>>
>> For now you can disable the qgroup as a workaround.
>>
>> Thanks,
>> Qu
>>
> I've never enabled quotas and my most recent attempt using the single profile for data was on kernel 6.4 so they would have been disabled by default. Running "btrfs qgroup show [path]" returns "ERROR: can't list qgroups: quotas not enabled".

OK, at least we can rule out qgroup.

Mind to provide more info? Including:

- How many files are involved?
   A large file vs a ton of small files have very different workloads.
   Any values on the average file size would also help.

- Is the fs using v1 or v2 space cache?
- Do the deleted files have any snapshot/reflink?
- Is there any other processes reading the to-be-deleted files?

One of my concern is the btrfs_extent_map usage, that's mostly used by
regular files as an in-memory cache so that they don't need to lookup
the tree on-disk.

I just checked the code, evicting an inode won't trigger
btrfs_extent_map usage, it's mostly read/write triggering such
btrfs_extent_map usage.

Thus there must be something else causing the unexpected
btrfs_extent_map usage.

Thanks,
Qu
>
> Sincerely,
> David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-09-14  5:12     ` Qu Wenruo
@ 2023-09-14 23:08       ` fdavidl073rnovn
  2023-09-27  1:46         ` fdavidl073rnovn
  0 siblings, 1 reply; 13+ messages in thread
From: fdavidl073rnovn @ 2023-09-14 23:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Linux Btrfs


Sep 14, 2023, 05:12 by quwenruo.btrfs@gmx.com:

>
>
> On 2023/9/14 13:08, fdavidl073rnovn@tutanota.com wrote:
>
>> Sep 13, 2023, 05:55 by wqu@suse.com:
>>
>>>
>>>
>>> On 2023/9/13 11:58, fdavidl073rnovn@tutanota.com wrote:
>>>
>>>> Dear Btrfs Mailing List,
>>>>
>>>> Full disclosure I reported this on kernel.org but am hoping to get more exposure on the mailing list.
>>>>
>>>> When I delete several terabytes of data memory usage increases until the system becomes entirely unresponsive. This has been an issue for several kernel version since at least 5.19 and continues to be an issue up to 6.5.2-artix1-1. This is on an older computer with several hard drives, eight gigabytes of memory, and a four core x86_64 cpu. Slabtop output right before the system becomes unresponsive shows about four gigabytes used by khugepaged_mm_slot and three used by btrfs_extent_map. This happens in over the span of a couple minutes and during this time btrfs-transaction is using a moderate amount of cpu time.
>>>>
>>>
>>> This looks exactly like something caused by btrfs qgroup.
>>>
>>> Could you try to disable qgroup to see if it helps?
>>> The amount of CPU time and IO of qgroup overhead is directly related to the amount of extent being updated.
>>>
>>> For normal writes the IO itself would take most of the CPU/memory thus qgroup is not a big deal.
>>> But for massive snapshots drop or file deletion qgroup can be too large to be handled in just one transaction.
>>>
>>> For now you can disable the qgroup as a workaround.
>>>
>>> Thanks,
>>> Qu
>>>
>> I've never enabled quotas and my most recent attempt using the single profile for data was on kernel 6.4 so they would have been disabled by default. Running "btrfs qgroup show [path]" returns "ERROR: can't list qgroups: quotas not enabled".
>>
>
> OK, at least we can rule out qgroup.
>
> Mind to provide more info? Including:
>
> - How many files are involved?
>  A large file vs a ton of small files have very different workloads.
>  Any values on the average file size would also help.
>
> - Is the fs using v1 or v2 space cache?
> - Do the deleted files have any snapshot/reflink?
> - Is there any other processes reading the to-be-deleted files?
>
> One of my concern is the btrfs_extent_map usage, that's mostly used by
> regular files as an in-memory cache so that they don't need to lookup
> the tree on-disk.
>
> I just checked the code, evicting an inode won't trigger
> btrfs_extent_map usage, it's mostly read/write triggering such
> btrfs_extent_map usage.
>
> Thus there must be something else causing the unexpected
> btrfs_extent_map usage.
>
> Thanks,
> Qu
>
>>
>> Sincerely,
>> David
>>
On my latest attempt using the single profile there is about fifteen terabytes total of space used, around eight hundred and fifty thousand files, over 9000 directories, and there are three very large files (two two terabyte and one four terabyte). There are also about two terabytes of compressed files using zstd at a fifty percent ratio.

The device is using space cache version two, there are no reflink or snapshots as far as I know and nothing else is reading or happening when this occurs. The system idles at about three hundred megabytes of memory used with negligible cpu activity before this happens.

For some context the device is currently mounted with compress-force=zstd:3 and noatime. The data currently on the device was transferred via send-receive version two (and was already compressed) as a snapshot but it is the only copy of it on the disk so I am not sure if that counts as a snapshot. I do not think the snapshot is related because I have deleted a single four terabyte file (from the snapshot) as a test and the memory usage went from about three hundred megabytes to over a gigabyte before going back down. I assume that was the same thing but the system just did not run out of memory.

Sincerely,
David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-09-14 23:08       ` fdavidl073rnovn
@ 2023-09-27  1:46         ` fdavidl073rnovn
  2023-09-27  4:53           ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: fdavidl073rnovn @ 2023-09-27  1:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Linux Btrfs


Sep 14, 2023, 23:08 by fdavidl073rnovn@tutanota.com:

>
> Sep 14, 2023, 05:12 by quwenruo.btrfs@gmx.com:
>
>>
>>
>> On 2023/9/14 13:08, fdavidl073rnovn@tutanota.com wrote:
>>
>>> Sep 13, 2023, 05:55 by wqu@suse.com:
>>>
>>>>
>>>>
>>>> On 2023/9/13 11:58, fdavidl073rnovn@tutanota.com wrote:
>>>>
>>>>> Dear Btrfs Mailing List,
>>>>>
>>>>> Full disclosure I reported this on kernel.org but am hoping to get more exposure on the mailing list.
>>>>>
>>>>> When I delete several terabytes of data memory usage increases until the system becomes entirely unresponsive. This has been an issue for several kernel version since at least 5.19 and continues to be an issue up to 6.5.2-artix1-1. This is on an older computer with several hard drives, eight gigabytes of memory, and a four core x86_64 cpu. Slabtop output right before the system becomes unresponsive shows about four gigabytes used by khugepaged_mm_slot and three used by btrfs_extent_map. This happens in over the span of a couple minutes and during this time btrfs-transaction is using a moderate amount of cpu time.
>>>>>
>>>>
>>>> This looks exactly like something caused by btrfs qgroup.
>>>>
>>>> Could you try to disable qgroup to see if it helps?
>>>> The amount of CPU time and IO of qgroup overhead is directly related to the amount of extent being updated.
>>>>
>>>> For normal writes the IO itself would take most of the CPU/memory thus qgroup is not a big deal.
>>>> But for massive snapshots drop or file deletion qgroup can be too large to be handled in just one transaction.
>>>>
>>>> For now you can disable the qgroup as a workaround.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>> I've never enabled quotas and my most recent attempt using the single profile for data was on kernel 6.4 so they would have been disabled by default. Running "btrfs qgroup show [path]" returns "ERROR: can't list qgroups: quotas not enabled".
>>>
>>
>> OK, at least we can rule out qgroup.
>>
>> Mind to provide more info? Including:
>>
>> - How many files are involved?
>> A large file vs a ton of small files have very different workloads.
>> Any values on the average file size would also help.
>>
>> - Is the fs using v1 or v2 space cache?
>> - Do the deleted files have any snapshot/reflink?
>> - Is there any other processes reading the to-be-deleted files?
>>
>> One of my concern is the btrfs_extent_map usage, that's mostly used by
>> regular files as an in-memory cache so that they don't need to lookup
>> the tree on-disk.
>>
>> I just checked the code, evicting an inode won't trigger
>> btrfs_extent_map usage, it's mostly read/write triggering such
>> btrfs_extent_map usage.
>>
>> Thus there must be something else causing the unexpected
>> btrfs_extent_map usage.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Sincerely,
>>> David
>>>
> On my latest attempt using the single profile there is about fifteen terabytes total of space used, around eight hundred and fifty thousand files, over 9000 directories, and there are three very large files (two two terabyte and one four terabyte). There are also about two terabytes of compressed files using zstd at a fifty percent ratio.
>
> The device is using space cache version two, there are no reflink or snapshots as far as I know and nothing else is reading or happening when this occurs. The system idles at about three hundred megabytes of memory used with negligible cpu activity before this happens.
>
> For some context the device is currently mounted with compress-force=zstd:3 and noatime. The data currently on the device was transferred via send-receive version two (and was already compressed) as a snapshot but it is the only copy of it on the disk so I am not sure if that counts as a snapshot. I do not think the snapshot is related because I have deleted a single four terabyte file (from the snapshot) as a test and the memory usage went from about three hundred megabytes to over a gigabyte before going back down. I assume that was the same thing but the system just did not run out of memory.
>
> Sincerely,
> David
>
>
To follow up on this I've tried creating a ten terabyte file then deleting it then tried creating approximately ten terabytes of files randomly between one and thirty two megabytes then deleting that folder. I tried this both at the root of the btrfs device and inside a subvolume. Each trial did increase the memory usage by up to one gigabyte at points but did not cause the system to run out of memory.

I still believe the cause is that requests are being queued faster than they're completed until there is no memory left so my current thought is that this either has something to do with nested directories or my real backup is significantly more fragmented. I think either of those possibilities might cause significantly more  seeks for the harddrives and slow down how fast operations are completed causing them to pile up.

I might try to put together something to make nested directories with lots of small files and delete that but otherwise I am out of ideas (I cannot think how I could properly replicate fragmentation easily). If you have any thoughts or things you think it'd be worthwhile to test I would love to hear them.

Sincerely,
David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-09-27  1:46         ` fdavidl073rnovn
@ 2023-09-27  4:53           ` Qu Wenruo
  2023-09-28 23:32             ` fdavidl073rnovn
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2023-09-27  4:53 UTC (permalink / raw)
  To: fdavidl073rnovn; +Cc: Qu Wenruo, Linux Btrfs



On 2023/9/27 11:16, fdavidl073rnovn@tutanota.com wrote:
>
> Sep 14, 2023, 23:08 by fdavidl073rnovn@tutanota.com:
>
>>
>> Sep 14, 2023, 05:12 by quwenruo.btrfs@gmx.com:
>>
>>>
>>>
>>> On 2023/9/14 13:08, fdavidl073rnovn@tutanota.com wrote:
>>>
>>>> Sep 13, 2023, 05:55 by wqu@suse.com:
>>>>
>>>>>
>>>>>
>>>>> On 2023/9/13 11:58, fdavidl073rnovn@tutanota.com wrote:
>>>>>
>>>>>> Dear Btrfs Mailing List,
>>>>>>
>>>>>> Full disclosure I reported this on kernel.org but am hoping to get more exposure on the mailing list.
>>>>>>
>>>>>> When I delete several terabytes of data memory usage increases until the system becomes entirely unresponsive. This has been an issue for several kernel version since at least 5.19 and continues to be an issue up to 6.5.2-artix1-1. This is on an older computer with several hard drives, eight gigabytes of memory, and a four core x86_64 cpu. Slabtop output right before the system becomes unresponsive shows about four gigabytes used by khugepaged_mm_slot and three used by btrfs_extent_map. This happens in over the span of a couple minutes and during this time btrfs-transaction is using a moderate amount of cpu time.
>>>>>>
>>>>>
>>>>> This looks exactly like something caused by btrfs qgroup.
>>>>>
>>>>> Could you try to disable qgroup to see if it helps?
>>>>> The amount of CPU time and IO of qgroup overhead is directly related to the amount of extent being updated.
>>>>>
>>>>> For normal writes the IO itself would take most of the CPU/memory thus qgroup is not a big deal.
>>>>> But for massive snapshots drop or file deletion qgroup can be too large to be handled in just one transaction.
>>>>>
>>>>> For now you can disable the qgroup as a workaround.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>> I've never enabled quotas and my most recent attempt using the single profile for data was on kernel 6.4 so they would have been disabled by default. Running "btrfs qgroup show [path]" returns "ERROR: can't list qgroups: quotas not enabled".
>>>>
>>>
>>> OK, at least we can rule out qgroup.
>>>
>>> Mind to provide more info? Including:
>>>
>>> - How many files are involved?
>>> A large file vs a ton of small files have very different workloads.
>>> Any values on the average file size would also help.
>>>
>>> - Is the fs using v1 or v2 space cache?
>>> - Do the deleted files have any snapshot/reflink?
>>> - Is there any other processes reading the to-be-deleted files?
>>>
>>> One of my concern is the btrfs_extent_map usage, that's mostly used by
>>> regular files as an in-memory cache so that they don't need to lookup
>>> the tree on-disk.
>>>
>>> I just checked the code, evicting an inode won't trigger
>>> btrfs_extent_map usage, it's mostly read/write triggering such
>>> btrfs_extent_map usage.
>>>
>>> Thus there must be something else causing the unexpected
>>> btrfs_extent_map usage.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> Sincerely,
>>>> David
>>>>
>> On my latest attempt using the single profile there is about fifteen terabytes total of space used, around eight hundred and fifty thousand files, over 9000 directories, and there are three very large files (two two terabyte and one four terabyte). There are also about two terabytes of compressed files using zstd at a fifty percent ratio.

The compression is the easily way to create tons of small file extents
(the limit of a compressed extent is only 128K).

Furthermore, each file extent would need an in-memory structure (struct
extent_map, for a debug kernel, it's 122 bytes) to cache the contents.

Thus for a 8TiB file with all compressed file extents at their max size
(pretty common if it's only for backup).
Then we still have 512M file extents.

Just multiple that by 122, you can see how this go crazy.

But still, if you're only deleting the file, the result shouldn't go
this crazy, as deleting itself won't try to read the file extents thus
no such cache.

However as long as we start doing read/write, the cache can go very
large, especially if you use compress, and only get released when the
whole inode get released from kernel.

On the other hand, if you go uncompressed data, the maximum file extent
size is enlarged to 128M (a 1024x increase), thus a huge reduce in the
number of extents.

In the long run I guess we need some way to release the extent_map when
low on memory.
But for now, I'm afraid I don't have better suggestion other than
turning off compression and defrag the compressed files using newer
kernel (v6.2 and newer).

In v6.2, there is a patch to prevent defrag from populating the extent
map cache, thus it won't take all the memory just by defrag.
And with all those files converted from compression, I believe the
situation would be greatly improved.

Thanks,
Qu


>>
>> The device is using space cache version two, there are no reflink or snapshots as far as I know and nothing else is reading or happening when this occurs. The system idles at about three hundred megabytes of memory used with negligible cpu activity before this happens.
>>
>> For some context the device is currently mounted with compress-force=zstd:3 and noatime. The data currently on the device was transferred via send-receive version two (and was already compressed) as a snapshot but it is the only copy of it on the disk so I am not sure if that counts as a snapshot. I do not think the snapshot is related because I have deleted a single four terabyte file (from the snapshot) as a test and the memory usage went from about three hundred megabytes to over a gigabyte before going back down. I assume that was the same thing but the system just did not run out of memory.
>>
>> Sincerely,
>> David
>>
>>
> To follow up on this I've tried creating a ten terabyte file then deleting it then tried creating approximately ten terabytes of files randomly between one and thirty two megabytes then deleting that folder. I tried this both at the root of the btrfs device and inside a subvolume. Each trial did increase the memory usage by up to one gigabyte at points but did not cause the system to run out of memory.
>
> I still believe the cause is that requests are being queued faster than they're completed until there is no memory left so my current thought is that this either has something to do with nested directories or my real backup is significantly more fragmented. I think either of those possibilities might cause significantly more  seeks for the harddrives and slow down how fast operations are completed causing them to pile up.
>
> I might try to put together something to make nested directories with lots of small files and delete that but otherwise I am out of ideas (I cannot think how I could properly replicate fragmentation easily). If you have any thoughts or things you think it'd be worthwhile to test I would love to hear them.
>
> Sincerely,
> David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-09-27  4:53           ` Qu Wenruo
@ 2023-09-28 23:32             ` fdavidl073rnovn
  2023-09-29  1:01               ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: fdavidl073rnovn @ 2023-09-28 23:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Linux Btrfs



Sep 27, 2023, 04:53 by quwenruo.btrfs@gmx.com:

>
> The compression is the easily way to create tons of small file extents
> (the limit of a compressed extent is only 128K).
>
> Furthermore, each file extent would need an in-memory structure (struct
> extent_map, for a debug kernel, it's 122 bytes) to cache the contents.
>
> Thus for a 8TiB file with all compressed file extents at their max size
> (pretty common if it's only for backup).
> Then we still have 512M file extents.
>
> Just multiple that by 122, you can see how this go crazy.
>
> But still, if you're only deleting the file, the result shouldn't go
> this crazy, as deleting itself won't try to read the file extents thus
> no such cache.
>
> However as long as we start doing read/write, the cache can go very
> large, especially if you use compress, and only get released when the
> whole inode get released from kernel.
>
> On the other hand, if you go uncompressed data, the maximum file extent
> size is enlarged to 128M (a 1024x increase), thus a huge reduce in the
> number of extents.
>
> In the long run I guess we need some way to release the extent_map when
> low on memory.
> But for now, I'm afraid I don't have better suggestion other than
> turning off compression and defrag the compressed files using newer
> kernel (v6.2 and newer).
>
> In v6.2, there is a patch to prevent defrag from populating the extent
> map cache, thus it won't take all the memory just by defrag.
> And with all those files converted from compression, I believe the
> situation would be greatly improved.
>
> Thanks,
> Qu
>
The backup itself is gone and will need to be re-sent. If I'm understanding things properly then by mounting the btrfs device for the backup without compression and enforcing send protocol one it should be written uncompressed which will avoid the issue correct?

I was also looking at the source code and it seems relatively straight forward to change BTRFS_MAX_COMPRESSED and BTRFS_MAX_UNCOMPRESSED to SZ_128M or somewhere in between like SZ_8M. Do you have any thoughts on how well that might work?

Do you have any idea on how complicated the long term fix is or when it might added? v6.8 maybe?

Thank you for your prompt responses. Sending the backup again will take some days but I will email you to tell you if disabling compression fixes the issue.

Sincerely,
David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-09-28 23:32             ` fdavidl073rnovn
@ 2023-09-29  1:01               ` Qu Wenruo
  2023-10-13 22:28                 ` fdavidl073rnovn
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2023-09-29  1:01 UTC (permalink / raw)
  To: fdavidl073rnovn; +Cc: Qu Wenruo, Linux Btrfs



On 2023/9/29 09:02, fdavidl073rnovn@tutanota.com wrote:
>
>
> Sep 27, 2023, 04:53 by quwenruo.btrfs@gmx.com:
>
>>
>> The compression is the easily way to create tons of small file extents
>> (the limit of a compressed extent is only 128K).
>>
>> Furthermore, each file extent would need an in-memory structure (struct
>> extent_map, for a debug kernel, it's 122 bytes) to cache the contents.
>>
>> Thus for a 8TiB file with all compressed file extents at their max size
>> (pretty common if it's only for backup).
>> Then we still have 512M file extents.
>>
>> Just multiple that by 122, you can see how this go crazy.
>>
>> But still, if you're only deleting the file, the result shouldn't go
>> this crazy, as deleting itself won't try to read the file extents thus
>> no such cache.
>>
>> However as long as we start doing read/write, the cache can go very
>> large, especially if you use compress, and only get released when the
>> whole inode get released from kernel.
>>
>> On the other hand, if you go uncompressed data, the maximum file extent
>> size is enlarged to 128M (a 1024x increase), thus a huge reduce in the
>> number of extents.
>>
>> In the long run I guess we need some way to release the extent_map when
>> low on memory.
>> But for now, I'm afraid I don't have better suggestion other than
>> turning off compression and defrag the compressed files using newer
>> kernel (v6.2 and newer).
>>
>> In v6.2, there is a patch to prevent defrag from populating the extent
>> map cache, thus it won't take all the memory just by defrag.
>> And with all those files converted from compression, I believe the
>> situation would be greatly improved.
>>
>> Thanks,
>> Qu
>>
> The backup itself is gone and will need to be re-sent. If I'm understanding things properly then by mounting the btrfs device for the backup without compression and enforcing send protocol one it should be written uncompressed which will avoid the issue correct?

IIRC yes.

The send stream only contains the decompressed content, thus as long as
it's mounted without compression, the received data on-disk would not be
compressed either.

>
> I was also looking at the source code and it seems relatively straight forward to change BTRFS_MAX_COMPRESSED and BTRFS_MAX_UNCOMPRESSED to SZ_128M or somewhere in between like SZ_8M. Do you have any thoughts on how well that might work?

The size is a trade-off between space wasted by COW and memory needed to
decompress an extent.

Remember even if we only need part of the compressed extent, we still
need to decompress the whole extent.
Image if we have to read 8 compressed extents in the same time, and the
BTRFS_MAX_COMPRESSED is 128M.

So I'm afraid we can not got super large on the value.
>
> Do you have any idea on how complicated the long term fix is or when it might added? v6.8 maybe?

At least not near term, I'm not aware of any ongoing project related to
this.

Thanks,
Qu
>
> Thank you for your prompt responses. Sending the backup again will take some days but I will email you to tell you if disabling compression fixes the issue.
>
> Sincerely,
> David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-09-29  1:01               ` Qu Wenruo
@ 2023-10-13 22:28                 ` fdavidl073rnovn
  2023-10-13 22:32                   ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: fdavidl073rnovn @ 2023-10-13 22:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Linux Btrfs


Sep 29, 2023, 01:02 by quwenruo.btrfs@gmx.com:

>
>
> On 2023/9/29 09:02, fdavidl073rnovn@tutanota.com wrote:
>
>>
>>
>> Sep 27, 2023, 04:53 by quwenruo.btrfs@gmx.com:
>>
>>>
>>> The compression is the easily way to create tons of small file extents
>>> (the limit of a compressed extent is only 128K).
>>>
>>> Furthermore, each file extent would need an in-memory structure (struct
>>> extent_map, for a debug kernel, it's 122 bytes) to cache the contents.
>>>
>>> Thus for a 8TiB file with all compressed file extents at their max size
>>> (pretty common if it's only for backup).
>>> Then we still have 512M file extents.
>>>
>>> Just multiple that by 122, you can see how this go crazy.
>>>
>>> But still, if you're only deleting the file, the result shouldn't go
>>> this crazy, as deleting itself won't try to read the file extents thus
>>> no such cache.
>>>
>>> However as long as we start doing read/write, the cache can go very
>>> large, especially if you use compress, and only get released when the
>>> whole inode get released from kernel.
>>>
>>> On the other hand, if you go uncompressed data, the maximum file extent
>>> size is enlarged to 128M (a 1024x increase), thus a huge reduce in the
>>> number of extents.
>>>
>>> In the long run I guess we need some way to release the extent_map when
>>> low on memory.
>>> But for now, I'm afraid I don't have better suggestion other than
>>> turning off compression and defrag the compressed files using newer
>>> kernel (v6.2 and newer).
>>>
>>> In v6.2, there is a patch to prevent defrag from populating the extent
>>> map cache, thus it won't take all the memory just by defrag.
>>> And with all those files converted from compression, I believe the
>>> situation would be greatly improved.
>>>
>>> Thanks,
>>> Qu
>>>
>> The backup itself is gone and will need to be re-sent. If I'm understanding things properly then by mounting the btrfs device for the backup without compression and enforcing send protocol one it should be written uncompressed which will avoid the issue correct?
>>
>
> IIRC yes.
>
> The send stream only contains the decompressed content, thus as long as
> it's mounted without compression, the received data on-disk would not be
> compressed either.
>
>>
>> I was also looking at the source code and it seems relatively straight forward to change BTRFS_MAX_COMPRESSED and BTRFS_MAX_UNCOMPRESSED to SZ_128M or somewhere in between like SZ_8M. Do you have any thoughts on how well that might work?
>>
>
> The size is a trade-off between space wasted by COW and memory needed to
> decompress an extent.
>
> Remember even if we only need part of the compressed extent, we still
> need to decompress the whole extent.
> Image if we have to read 8 compressed extents in the same time, and the
> BTRFS_MAX_COMPRESSED is 128M.
>
> So I'm afraid we can not got super large on the value.
>
>>
>> Do you have any idea on how complicated the long term fix is or when it might added? v6.8 maybe?
>>
>
> At least not near term, I'm not aware of any ongoing project related to
> this.
>
> Thanks,
> Qu
>
>>
>> Thank you for your prompt responses. Sending the backup again will take some days but I will email you to tell you if disabling compression fixes the issue.
>>
>> Sincerely,
>> David
>>
To follow up on this I was successfully able to transfer my backup then both make and delete snapshots of it without running out of memory. I will update my ticket on there bug tracker if and I think there should be a warning about this in the documents.

Is there anything else I can do to make sure this is addressed at some point? I would like to eventually be able to re-enable compression as it was saving me several terabytes.

Sincerely,
David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-10-13 22:28                 ` fdavidl073rnovn
@ 2023-10-13 22:32                   ` Qu Wenruo
  2023-10-14 19:09                     ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2023-10-13 22:32 UTC (permalink / raw)
  To: fdavidl073rnovn; +Cc: Qu Wenruo, Linux Btrfs



On 2023/10/14 08:58, fdavidl073rnovn@tutanota.com wrote:
>
> Sep 29, 2023, 01:02 by quwenruo.btrfs@gmx.com:
>
>>
>>
>> On 2023/9/29 09:02, fdavidl073rnovn@tutanota.com wrote:
>>
>>>
>>>
>>> Sep 27, 2023, 04:53 by quwenruo.btrfs@gmx.com:
>>>
>>>>
>>>> The compression is the easily way to create tons of small file extents
>>>> (the limit of a compressed extent is only 128K).
>>>>
>>>> Furthermore, each file extent would need an in-memory structure (struct
>>>> extent_map, for a debug kernel, it's 122 bytes) to cache the contents.
>>>>
>>>> Thus for a 8TiB file with all compressed file extents at their max size
>>>> (pretty common if it's only for backup).
>>>> Then we still have 512M file extents.
>>>>
>>>> Just multiple that by 122, you can see how this go crazy.
>>>>
>>>> But still, if you're only deleting the file, the result shouldn't go
>>>> this crazy, as deleting itself won't try to read the file extents thus
>>>> no such cache.
>>>>
>>>> However as long as we start doing read/write, the cache can go very
>>>> large, especially if you use compress, and only get released when the
>>>> whole inode get released from kernel.
>>>>
>>>> On the other hand, if you go uncompressed data, the maximum file extent
>>>> size is enlarged to 128M (a 1024x increase), thus a huge reduce in the
>>>> number of extents.
>>>>
>>>> In the long run I guess we need some way to release the extent_map when
>>>> low on memory.
>>>> But for now, I'm afraid I don't have better suggestion other than
>>>> turning off compression and defrag the compressed files using newer
>>>> kernel (v6.2 and newer).
>>>>
>>>> In v6.2, there is a patch to prevent defrag from populating the extent
>>>> map cache, thus it won't take all the memory just by defrag.
>>>> And with all those files converted from compression, I believe the
>>>> situation would be greatly improved.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>> The backup itself is gone and will need to be re-sent. If I'm understanding things properly then by mounting the btrfs device for the backup without compression and enforcing send protocol one it should be written uncompressed which will avoid the issue correct?
>>>
>>
>> IIRC yes.
>>
>> The send stream only contains the decompressed content, thus as long as
>> it's mounted without compression, the received data on-disk would not be
>> compressed either.
>>
>>>
>>> I was also looking at the source code and it seems relatively straight forward to change BTRFS_MAX_COMPRESSED and BTRFS_MAX_UNCOMPRESSED to SZ_128M or somewhere in between like SZ_8M. Do you have any thoughts on how well that might work?
>>>
>>
>> The size is a trade-off between space wasted by COW and memory needed to
>> decompress an extent.
>>
>> Remember even if we only need part of the compressed extent, we still
>> need to decompress the whole extent.
>> Image if we have to read 8 compressed extents in the same time, and the
>> BTRFS_MAX_COMPRESSED is 128M.
>>
>> So I'm afraid we can not got super large on the value.
>>
>>>
>>> Do you have any idea on how complicated the long term fix is or when it might added? v6.8 maybe?
>>>
>>
>> At least not near term, I'm not aware of any ongoing project related to
>> this.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thank you for your prompt responses. Sending the backup again will take some days but I will email you to tell you if disabling compression fixes the issue.
>>>
>>> Sincerely,
>>> David
>>>
> To follow up on this I was successfully able to transfer my backup then both make and delete snapshots of it without running out of memory. I will update my ticket on there bug tracker if and I think there should be a warning about this in the documents.
>
> Is there anything else I can do to make sure this is addressed at some point? I would like to eventually be able to re-enable compression as it was saving me several terabytes.

I believe Filipe is working on improving the extent map code recently.
You may want to test his patchset when it comes out.

Otherwise you may need to keep away from compression for now.

Thanks,
Qu
>
> Sincerely,
> David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-10-13 22:32                   ` Qu Wenruo
@ 2023-10-14 19:09                     ` Chris Murphy
  2023-10-14 22:10                       ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2023-10-14 19:09 UTC (permalink / raw)
  To: Qu Wenruo, fdavidl073rnovn; +Cc: Qu WenRuo, Btrfs BTRFS



On Fri, Oct 13, 2023, at 6:32 PM, Qu Wenruo wrote:
> On 2023/10/14 08:58, fdavidl073rnovn@tutanota.com wrote:

>> Is there anything else I can do to make sure this is addressed at some point? I would like to eventually be able to re-enable compression as it was saving me several terabytes.
>
> I believe Filipe is working on improving the extent map code recently.
> You may want to test his patchset when it comes out.
>
> Otherwise you may need to keep away from compression for now.

Is the cost of tracking extents reduced at all by increasing leaf/node size? The number of extents is the same, so that cost wouldn't be reduced - and maybe that's the bulk of the problem. But if it's also related to the cost of having so many leaves, maybe it would help?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deleting large amounts of data causes system freeze due to OOM.
  2023-10-14 19:09                     ` Chris Murphy
@ 2023-10-14 22:10                       ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2023-10-14 22:10 UTC (permalink / raw)
  To: Chris Murphy, Qu Wenruo, fdavidl073rnovn; +Cc: Btrfs BTRFS



On 2023/10/15 05:39, Chris Murphy wrote:
> 
> 
> On Fri, Oct 13, 2023, at 6:32 PM, Qu Wenruo wrote:
>> On 2023/10/14 08:58, fdavidl073rnovn@tutanota.com wrote:
> 
>>> Is there anything else I can do to make sure this is addressed at some point? I would like to eventually be able to re-enable compression as it was saving me several terabytes.
>>
>> I believe Filipe is working on improving the extent map code recently.
>> You may want to test his patchset when it comes out.
>>
>> Otherwise you may need to keep away from compression for now.
> 
> Is the cost of tracking extents reduced at all by increasing leaf/node size?

Unfortunately no.

The cost is related to the size of extent_map structure, which is 
independent from node/leaf size.


> The number of extents is the same, so that cost wouldn't be reduced - and maybe that's the bulk of the problem. But if it's also related to the cost of having so many leaves, maybe it would help?

For metadata, they are cached using inode's address space, thus MM layer 
is able to properly drop the unused space AFAIK.

Meanwhile we have no way to free unused extent_map for now.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-10-14 22:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-13  2:28 Deleting large amounts of data causes system freeze due to OOM fdavidl073rnovn
2023-09-13  5:55 ` Qu Wenruo
2023-09-14  3:38   ` fdavidl073rnovn
2023-09-14  5:12     ` Qu Wenruo
2023-09-14 23:08       ` fdavidl073rnovn
2023-09-27  1:46         ` fdavidl073rnovn
2023-09-27  4:53           ` Qu Wenruo
2023-09-28 23:32             ` fdavidl073rnovn
2023-09-29  1:01               ` Qu Wenruo
2023-10-13 22:28                 ` fdavidl073rnovn
2023-10-13 22:32                   ` Qu Wenruo
2023-10-14 19:09                     ` Chris Murphy
2023-10-14 22:10                       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).