From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v3 08/10] btrfs: add a shrinker for extent maps
Date: Tue, 16 Apr 2024 13:25:55 -0400 [thread overview]
Message-ID: <20240416172555.GA2094489@perftesting> (raw)
In-Reply-To: <4bfde54904b5a91a71eb0d86b9c78367865f93d8.1713267925.git.fdmanana@suse.com>
On Tue, Apr 16, 2024 at 02:08:10PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Extent maps are used either to represent existing file extent items, or to
> represent new extents that are going to be written and the respective file
> extent items are created when the ordered extent completes.
>
> We currently don't have any limit for how many extent maps we can have,
> neither per inode nor globally. Most of the time this not too noticeable
> because extent maps are removed in the following situations:
>
> 1) When evicting an inode;
>
> 2) When releasing folios (pages) through the btrfs_release_folio() address
> space operation callback.
>
> However we won't release extent maps in the folio range if the folio is
> either dirty or under writeback or if the inode's i_size is less than
> or equals to 16M (see try_release_extent_mapping(). This 16M i_size
> constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
> extent_io and extent_state optimizations"), but there's no explanation
> about why we have it or why the 16M value.
>
> This means that for buffered IO we can reach an OOM situation due to too
> many extent maps if either of the following happens:
>
> 1) There's a set of tasks constantly doing IO on many files with a size
> not larger than 16M, specially if they keep the files open for very
> long periods, therefore preventing inode eviction.
>
> This requires a really high number of such files, and having many non
> mergeable extent maps (due to random 4K writes for example) and a
> machine with very little memory;
>
> 2) There's a set tasks constantly doing random write IO (therefore
> creating many non mergeable extent maps) on files and keeping them
> open for long periods of time, so inode eviction doesn't happen and
> there's always a lot of dirty pages or pages under writeback,
> preventing btrfs_release_folio() from releasing the respective extent
> maps.
>
> This second case was actually reported in the thread pointed by the Link
> tag below, and it requires a very large file under heavy IO and a machine
> with very little amount of RAM, which is probably hard to happen in
> practice in a real world use case.
>
> However when using direct IO this is not so hard to happen, because the
> page cache is not used, and therefore btrfs_release_folio() is never
> called. Which means extent maps are dropped only when evicting the inode,
> and that means that if we have tasks that keep a file descriptor open and
> keep doing IO on a very large file (or files), we can exhaust memory due
> to an unbounded amount of extent maps. This is especially easy to happen
> if we have a huge file with millions of small extents and their extent
> maps are not mergeable (non contiguous offsets and disk locations).
> This was reported in that thread with the following fio test:
>
> $ cat test.sh
> #!/bin/bash
>
> DEV=/dev/sdj
> MNT=/mnt/sdj
> MOUNT_OPTIONS="-o ssd"
> MKFS_OPTIONS=""
>
> cat <<EOF > /tmp/fio-job.ini
> [global]
> name=fio-rand-write
> filename=$MNT/fio-rand-write
> rw=randwrite
> bs=4K
> direct=1
> numjobs=16
> fallocate=none
> time_based
> runtime=90000
>
> [file1]
> size=300G
> ioengine=libaio
> iodepth=16
>
> EOF
>
> umount $MNT &> /dev/null
> mkfs.btrfs -f $MKFS_OPTIONS $DEV
> mount $MOUNT_OPTIONS $DEV $MNT
>
> fio /tmp/fio-job.ini
> umount $MNT
>
> Monitoring the btrfs_extent_map slab while running the test with:
>
> $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
> /sys/kernel/slab/btrfs_extent_map/total_objects'
>
> Shows the number of active and total extent maps skyrocketing to tens of
> millions, and on systems with a short amount of memory it's easy and quick
> to get into an OOM situation, as reported in that thread.
>
> So to avoid this issue add a shrinker that will remove extents maps, as
> long as they are not pinned, and takes proper care with any concurrent
> fsync to avoid missing extents (setting the full sync flag while in the
> middle of a fast fsync). This shrinker is triggered through the callbacks
> nr_cached_objects and free_cached_objects of struct super_operations.
>
> The shrinker will iterates over all roots and over all inodes of each
> root, and keeps track of the last scanned root and inode, so that the
> next time it runs, it starts from that root and from the next inode.
> This is similar to what xfs does for its inode reclaim (implements those
> callbacks, and cycles through inodes by starting from where it ended
> last time).
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
This is great, thanks Filipe!
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
next prev parent reply other threads:[~2024-04-16 17:25 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
2024-04-10 11:28 ` [PATCH 01/11] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 02/11] btrfs: tests: error out on unexpected extent map reference count fdmanana
2024-04-10 11:28 ` [PATCH 03/11] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
2024-04-10 11:28 ` [PATCH 04/11] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 05/11] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-10 11:28 ` [PATCH 06/11] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 07/11] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 08/11] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-11 5:39 ` Qu Wenruo
2024-04-11 10:09 ` Filipe Manana
2024-04-10 11:28 ` [PATCH 09/11] btrfs: add a shrinker for " fdmanana
2024-04-11 5:58 ` Qu Wenruo
2024-04-11 10:15 ` Filipe Manana
2024-04-10 11:28 ` [PATCH 10/11] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-10 11:28 ` [PATCH 11/11] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-11 5:25 ` [PATCH 00/11] btrfs: add a shrinker for extent maps Qu Wenruo
2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
2024-04-11 16:18 ` [PATCH v2 01/15] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
2024-04-11 16:18 ` [PATCH v2 02/15] btrfs: tests: error out on unexpected extent map reference count fdmanana
2024-04-11 16:18 ` [PATCH v2 03/15] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
2024-04-11 16:18 ` [PATCH v2 04/15] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-11 16:18 ` [PATCH v2 05/15] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-11 16:19 ` [PATCH v2 06/15] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-11 16:19 ` [PATCH v2 07/15] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-11 16:19 ` [PATCH v2 08/15] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
2024-04-11 23:25 ` Qu Wenruo
2024-04-11 16:19 ` [PATCH v2 09/15] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
2024-04-11 23:25 ` Qu Wenruo
2024-04-11 16:19 ` [PATCH v2 10/15] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-15 2:47 ` kernel test robot
2024-04-11 16:19 ` [PATCH v2 11/15] btrfs: export find_next_inode() as btrfs_find_first_inode() fdmanana
2024-04-11 23:14 ` Qu Wenruo
2024-04-11 16:19 ` [PATCH v2 12/15] btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries() fdmanana
2024-04-11 23:15 ` Qu Wenruo
2024-04-11 16:19 ` [PATCH v2 13/15] btrfs: add a shrinker for extent maps fdmanana
2024-04-12 20:06 ` Josef Bacik
2024-04-13 11:07 ` Filipe Manana
2024-04-14 10:38 ` Filipe Manana
2024-04-14 13:02 ` Josef Bacik
2024-04-15 11:24 ` Filipe Manana
2024-04-11 16:19 ` [PATCH v2 14/15] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-11 16:19 ` [PATCH v2 15/15] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
2024-04-16 13:08 ` [PATCH v3 01/10] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-17 11:08 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 02/10] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-17 11:09 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 03/10] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-17 11:10 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 04/10] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-17 11:11 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 05/10] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
2024-04-17 11:11 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 06/10] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
2024-04-17 11:12 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 07/10] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-17 11:14 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 08/10] btrfs: add a shrinker for " fdmanana
2024-04-16 17:25 ` Josef Bacik [this message]
2024-04-16 13:08 ` [PATCH v3 09/10] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-16 13:08 ` [PATCH v3 10/10] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-16 17:26 ` Josef Bacik
2024-04-17 11:20 ` Johannes Thumshirn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240416172555.GA2094489@perftesting \
--to=josef@toxicpanda.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox