linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Deleting large amounts of data causes system freeze due to OOM.
@ 2023-09-13  2:28 fdavidl073rnovn
  2023-09-13  5:55 ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: fdavidl073rnovn @ 2023-09-13  2:28 UTC (permalink / raw)
  To: Linux Btrfs

Dear Btrfs Mailing List,

Full disclosure I reported this on kernel.org but am hoping to get more exposure on the mailing list. 

When I delete several terabytes of data memory usage increases until the system becomes entirely unresponsive. This has been an issue for several kernel version since at least 5.19 and continues to be an issue up to 6.5.2-artix1-1. This is on an older computer with several hard drives, eight gigabytes of memory, and a four core x86_64 cpu. Slabtop output right before the system becomes unresponsive shows about four gigabytes used by khugepaged_mm_slot and three used by btrfs_extent_map. This happens in over the span of a couple minutes and during this time btrfs-transaction is using a moderate amount of cpu time.

While this is happening the free space reported by btrfs filesystem usage slowly falls until the system is unresponsive. If I delete smaller amounts of data at a time memory usage increases but if the system doesn't go out of memory all the disk space is freed and memory usage comes back down. Deleting things bit by bit isn't a useful workaround because this also happens when deleting a snapshot even if it won't free any disk space and I am trying to use this computer for incremental backups.

The only things that seem to cause a difference are the checksum used and slower hard drives. Checksum changes the behavior of the issue. If using xxhash when I remount the filesystem it seems to try to either restart or continue the delete operation causing another out of memory condition but using the default crc32 remounting the filesystem has it in the original state before the delete command was issued and nothing happens (I haven't tried any other checksums). Having slower (SMR) drives as part of the device causes the out of memory to happen much faster. Nothing else like raid level, compression, kernel version, block group tree have seemed to change anything.

My speculation is that operations to finish the delete are being queued up in memory faster than they can be completed until the system completely runs out of memory. That would explain what's happening, why slower drives make it worse, and why deleting small amounts of data works. I'm not sure why checksum seems to change the behavior when remounting the filesystem.

I am willing to do destructive testing on this data to hopefully get this fixed.

Sincerely,
David

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-10-14 22:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-13  2:28 Deleting large amounts of data causes system freeze due to OOM fdavidl073rnovn
2023-09-13  5:55 ` Qu Wenruo
2023-09-14  3:38   ` fdavidl073rnovn
2023-09-14  5:12     ` Qu Wenruo
2023-09-14 23:08       ` fdavidl073rnovn
2023-09-27  1:46         ` fdavidl073rnovn
2023-09-27  4:53           ` Qu Wenruo
2023-09-28 23:32             ` fdavidl073rnovn
2023-09-29  1:01               ` Qu Wenruo
2023-10-13 22:28                 ` fdavidl073rnovn
2023-10-13 22:32                   ` Qu Wenruo
2023-10-14 19:09                     ` Chris Murphy
2023-10-14 22:10                       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).