From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v3 08/10] btrfs: add a shrinker for extent maps
Date: Tue, 16 Apr 2024 13:25:55 -0400 [thread overview]
Message-ID: <20240416172555.GA2094489@perftesting> (raw)
In-Reply-To: <4bfde54904b5a91a71eb0d86b9c78367865f93d8.1713267925.git.fdmanana@suse.com>
On Tue, Apr 16, 2024 at 02:08:10PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Extent maps are used either to represent existing file extent items, or to
> represent new extents that are going to be written and the respective file
> extent items are created when the ordered extent completes.
>
> We currently don't have any limit for how many extent maps we can have,
> neither per inode nor globally. Most of the time this not too noticeable
> because extent maps are removed in the following situations:
>
> 1) When evicting an inode;
>
> 2) When releasing folios (pages) through the btrfs_release_folio() address
> space operation callback.
>
> However we won't release extent maps in the folio range if the folio is
> either dirty or under writeback or if the inode's i_size is less than
> or equals to 16M (see try_release_extent_mapping(). This 16M i_size
> constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
> extent_io and extent_state optimizations"), but there's no explanation
> about why we have it or why the 16M value.
>
> This means that for buffered IO we can reach an OOM situation due to too
> many extent maps if either of the following happens:
>
> 1) There's a set of tasks constantly doing IO on many files with a size
> not larger than 16M, specially if they keep the files open for very
> long periods, therefore preventing inode eviction.
>
> This requires a really high number of such files, and having many non
> mergeable extent maps (due to random 4K writes for example) and a
> machine with very little memory;
>
> 2) There's a set tasks constantly doing random write IO (therefore
> creating many non mergeable extent maps) on files and keeping them
> open for long periods of time, so inode eviction doesn't happen and
> there's always a lot of dirty pages or pages under writeback,
> preventing btrfs_release_folio() from releasing the respective extent
> maps.
>
> This second case was actually reported in the thread pointed by the Link
> tag below, and it requires a very large file under heavy IO and a machine
> with very little amount of RAM, which is probably hard to happen in
> practice in a real world use case.
>
> However when using direct IO this is not so hard to happen, because the
> page cache is not used, and therefore btrfs_release_folio() is never
> called. Which means extent maps are dropped only when evicting the inode,
> and that means that if we have tasks that keep a file descriptor open and
> keep doing IO on a very large file (or files), we can exhaust memory due
> to an unbounded amount of extent maps. This is especially easy to happen
> if we have a huge file with millions of small extents and their extent
> maps are not mergeable (non contiguous offsets and disk locations).
> This was reported in that thread with the following fio test:
>
> $ cat test.sh
> #!/bin/bash
>
> DEV=/dev/sdj
> MNT=/mnt/sdj
> MOUNT_OPTIONS="-o ssd"
> MKFS_OPTIONS=""
>
> cat <<EOF > /tmp/fio-job.ini
> [global]
> name=fio-rand-write
> filename=$MNT/fio-rand-write
> rw=randwrite
> bs=4K
> direct=1
> numjobs=16
> fallocate=none
> time_based
> runtime=90000
>
> [file1]
> size=300G
> ioengine=libaio
> iodepth=16
>
> EOF
>
> umount $MNT &> /dev/null
> mkfs.btrfs -f $MKFS_OPTIONS $DEV
> mount $MOUNT_OPTIONS $DEV $MNT
>
> fio /tmp/fio-job.ini
> umount $MNT
>
> Monitoring the btrfs_extent_map slab while running the test with:
>
> $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
> /sys/kernel/slab/btrfs_extent_map/total_objects'
>
> Shows the number of active and total extent maps skyrocketing to tens of
> millions, and on systems with a short amount of memory it's easy and quick
> to get into an OOM situation, as reported in that thread.
>
> So to avoid this issue add a shrinker that will remove extents maps, as
> long as they are not pinned, and takes proper care with any concurrent
> fsync to avoid missing extents (setting the full sync flag while in the
> middle of a fast fsync). This shrinker is triggered through the callbacks
> nr_cached_objects and free_cached_objects of struct super_operations.
>
> The shrinker will iterates over all roots and over all inodes of each
> root, and keeps track of the last scanned root and inode, so that the
> next time it runs, it starts from that root and from the next inode.
> This is similar to what xfs does for its inode reclaim (implements those
> callbacks, and cycles through inodes by starting from where it ended
> last time).
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
This is great, thanks Filipe!
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
next prev parent reply other threads:[~2024-04-16 17:25 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
2024-04-10 11:28 ` [PATCH 01/11] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 02/11] btrfs: tests: error out on unexpected extent map reference count fdmanana
2024-04-10 11:28 ` [PATCH 03/11] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
2024-04-10 11:28 ` [PATCH 04/11] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 05/11] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-10 11:28 ` [PATCH 06/11] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 07/11] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 08/11] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-11 5:39 ` Qu Wenruo
2024-04-11 10:09 ` Filipe Manana
2024-04-10 11:28 ` [PATCH 09/11] btrfs: add a shrinker for " fdmanana
2024-04-11 5:58 ` Qu Wenruo
2024-04-11 10:15 ` Filipe Manana
2024-04-10 11:28 ` [PATCH 10/11] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-10 11:28 ` [PATCH 11/11] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-11 5:25 ` [PATCH 00/11] btrfs: add a shrinker for extent maps Qu Wenruo
2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
2024-04-11 16:18 ` [PATCH v2 01/15] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
2024-04-11 16:18 ` [PATCH v2 02/15] btrfs: tests: error out on unexpected extent map reference count fdmanana
2024-04-11 16:18 ` [PATCH v2 03/15] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
2024-04-11 16:18 ` [PATCH v2 04/15] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-11 16:18 ` [PATCH v2 05/15] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-11 16:19 ` [PATCH v2 06/15] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-11 16:19 ` [PATCH v2 07/15] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-11 16:19 ` [PATCH v2 08/15] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
2024-04-11 23:25 ` Qu Wenruo
2024-04-11 16:19 ` [PATCH v2 09/15] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
2024-04-11 23:25 ` Qu Wenruo
2024-04-11 16:19 ` [PATCH v2 10/15] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-15 2:47 ` kernel test robot
2024-04-11 16:19 ` [PATCH v2 11/15] btrfs: export find_next_inode() as btrfs_find_first_inode() fdmanana
2024-04-11 23:14 ` Qu Wenruo
2024-04-11 16:19 ` [PATCH v2 12/15] btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries() fdmanana
2024-04-11 23:15 ` Qu Wenruo
2024-04-11 16:19 ` [PATCH v2 13/15] btrfs: add a shrinker for extent maps fdmanana
2024-04-12 20:06 ` Josef Bacik
2024-04-13 11:07 ` Filipe Manana
2024-04-14 10:38 ` Filipe Manana
2024-04-14 13:02 ` Josef Bacik
2024-04-15 11:24 ` Filipe Manana
2024-04-11 16:19 ` [PATCH v2 14/15] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-11 16:19 ` [PATCH v2 15/15] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
2024-04-16 13:08 ` [PATCH v3 01/10] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-17 11:08 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 02/10] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-17 11:09 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 03/10] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-17 11:10 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 04/10] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-17 11:11 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 05/10] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
2024-04-17 11:11 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 06/10] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
2024-04-17 11:12 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 07/10] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-17 11:14 ` Johannes Thumshirn
2024-04-16 13:08 ` [PATCH v3 08/10] btrfs: add a shrinker for " fdmanana
2024-04-16 17:25 ` Josef Bacik [this message]
2024-04-16 13:08 ` [PATCH v3 09/10] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-16 13:08 ` [PATCH v3 10/10] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-16 17:26 ` Josef Bacik
2024-04-17 11:20 ` Johannes Thumshirn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240416172555.GA2094489@perftesting \
--to=josef@toxicpanda.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.