Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v3 08/10] btrfs: add a shrinker for extent maps
Date: Tue, 16 Apr 2024 13:25:55 -0400	[thread overview]
Message-ID: <20240416172555.GA2094489@perftesting> (raw)
In-Reply-To: <4bfde54904b5a91a71eb0d86b9c78367865f93d8.1713267925.git.fdmanana@suse.com>

On Tue, Apr 16, 2024 at 02:08:10PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Extent maps are used either to represent existing file extent items, or to
> represent new extents that are going to be written and the respective file
> extent items are created when the ordered extent completes.
> 
> We currently don't have any limit for how many extent maps we can have,
> neither per inode nor globally. Most of the time this not too noticeable
> because extent maps are removed in the following situations:
> 
> 1) When evicting an inode;
> 
> 2) When releasing folios (pages) through the btrfs_release_folio() address
>    space operation callback.
> 
>    However we won't release extent maps in the folio range if the folio is
>    either dirty or under writeback or if the inode's i_size is less than
>    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
>    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
>    extent_io and extent_state optimizations"), but there's no explanation
>    about why we have it or why the 16M value.
> 
> This means that for buffered IO we can reach an OOM situation due to too
> many extent maps if either of the following happens:
> 
> 1) There's a set of tasks constantly doing IO on many files with a size
>    not larger than 16M, specially if they keep the files open for very
>    long periods, therefore preventing inode eviction.
> 
>    This requires a really high number of such files, and having many non
>    mergeable extent maps (due to random 4K writes for example) and a
>    machine with very little memory;
> 
> 2) There's a set tasks constantly doing random write IO (therefore
>    creating many non mergeable extent maps) on files and keeping them
>    open for long periods of time, so inode eviction doesn't happen and
>    there's always a lot of dirty pages or pages under writeback,
>    preventing btrfs_release_folio() from releasing the respective extent
>    maps.
> 
> This second case was actually reported in the thread pointed by the Link
> tag below, and it requires a very large file under heavy IO and a machine
> with very little amount of RAM, which is probably hard to happen in
> practice in a real world use case.
> 
> However when using direct IO this is not so hard to happen, because the
> page cache is not used, and therefore btrfs_release_folio() is never
> called. Which means extent maps are dropped only when evicting the inode,
> and that means that if we have tasks that keep a file descriptor open and
> keep doing IO on a very large file (or files), we can exhaust memory due
> to an unbounded amount of extent maps. This is especially easy to happen
> if we have a huge file with millions of small extents and their extent
> maps are not mergeable (non contiguous offsets and disk locations).
> This was reported in that thread with the following fio test:
> 
>    $ cat test.sh
>    #!/bin/bash
> 
>    DEV=/dev/sdj
>    MNT=/mnt/sdj
>    MOUNT_OPTIONS="-o ssd"
>    MKFS_OPTIONS=""
> 
>    cat <<EOF > /tmp/fio-job.ini
>    [global]
>    name=fio-rand-write
>    filename=$MNT/fio-rand-write
>    rw=randwrite
>    bs=4K
>    direct=1
>    numjobs=16
>    fallocate=none
>    time_based
>    runtime=90000
> 
>    [file1]
>    size=300G
>    ioengine=libaio
>    iodepth=16
> 
>    EOF
> 
>    umount $MNT &> /dev/null
>    mkfs.btrfs -f $MKFS_OPTIONS $DEV
>    mount $MOUNT_OPTIONS $DEV $MNT
> 
>    fio /tmp/fio-job.ini
>    umount $MNT
> 
> Monitoring the btrfs_extent_map slab while running the test with:
> 
>    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
>                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> 
> Shows the number of active and total extent maps skyrocketing to tens of
> millions, and on systems with a short amount of memory it's easy and quick
> to get into an OOM situation, as reported in that thread.
> 
> So to avoid this issue add a shrinker that will remove extents maps, as
> long as they are not pinned, and takes proper care with any concurrent
> fsync to avoid missing extents (setting the full sync flag while in the
> middle of a fast fsync). This shrinker is triggered through the callbacks
> nr_cached_objects and free_cached_objects of struct super_operations.
> 
> The shrinker will iterates over all roots and over all inodes of each
> root, and keeps track of the last scanned root and inode, so that the
> next time it runs, it starts from that root and from the next inode.
> This is similar to what xfs does for its inode reclaim (implements those
> callbacks, and cycles through inodes by starting from where it ended
> last time).
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

This is great, thanks Filipe!

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

  reply	other threads:[~2024-04-16 17:25 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-10 11:28 [PATCH 00/11] btrfs: add a shrinker for extent maps fdmanana
2024-04-10 11:28 ` [PATCH 01/11] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 02/11] btrfs: tests: error out on unexpected extent map reference count fdmanana
2024-04-10 11:28 ` [PATCH 03/11] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
2024-04-10 11:28 ` [PATCH 04/11] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 05/11] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-10 11:28 ` [PATCH 06/11] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 07/11] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-10 11:28 ` [PATCH 08/11] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-11  5:39   ` Qu Wenruo
2024-04-11 10:09     ` Filipe Manana
2024-04-10 11:28 ` [PATCH 09/11] btrfs: add a shrinker for " fdmanana
2024-04-11  5:58   ` Qu Wenruo
2024-04-11 10:15     ` Filipe Manana
2024-04-10 11:28 ` [PATCH 10/11] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-10 11:28 ` [PATCH 11/11] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-11  5:25 ` [PATCH 00/11] btrfs: add a shrinker for extent maps Qu Wenruo
2024-04-11 16:18 ` [PATCH v2 00/15] " fdmanana
2024-04-11 16:18   ` [PATCH v2 01/15] btrfs: pass an inode to btrfs_add_extent_mapping() fdmanana
2024-04-11 16:18   ` [PATCH v2 02/15] btrfs: tests: error out on unexpected extent map reference count fdmanana
2024-04-11 16:18   ` [PATCH v2 03/15] btrfs: simplify add_extent_mapping() by removing pointless label fdmanana
2024-04-11 16:18   ` [PATCH v2 04/15] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-11 16:18   ` [PATCH v2 05/15] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-11 16:19   ` [PATCH v2 06/15] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-11 16:19   ` [PATCH v2 07/15] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-11 16:19   ` [PATCH v2 08/15] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
2024-04-11 23:25     ` Qu Wenruo
2024-04-11 16:19   ` [PATCH v2 09/15] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
2024-04-11 23:25     ` Qu Wenruo
2024-04-11 16:19   ` [PATCH v2 10/15] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-15  2:47     ` kernel test robot
2024-04-11 16:19   ` [PATCH v2 11/15] btrfs: export find_next_inode() as btrfs_find_first_inode() fdmanana
2024-04-11 23:14     ` Qu Wenruo
2024-04-11 16:19   ` [PATCH v2 12/15] btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries() fdmanana
2024-04-11 23:15     ` Qu Wenruo
2024-04-11 16:19   ` [PATCH v2 13/15] btrfs: add a shrinker for extent maps fdmanana
2024-04-12 20:06     ` Josef Bacik
2024-04-13 11:07       ` Filipe Manana
2024-04-14 10:38         ` Filipe Manana
2024-04-14 13:02         ` Josef Bacik
2024-04-15 11:24           ` Filipe Manana
2024-04-11 16:19   ` [PATCH v2 14/15] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-11 16:19   ` [PATCH v2 15/15] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-16 13:08 ` [PATCH v3 00/10] btrfs: add a shrinker for extent maps fdmanana
2024-04-16 13:08   ` [PATCH v3 01/10] btrfs: pass the extent map tree's inode to add_extent_mapping() fdmanana
2024-04-17 11:08     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 02/10] btrfs: pass the extent map tree's inode to clear_em_logging() fdmanana
2024-04-17 11:09     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 03/10] btrfs: pass the extent map tree's inode to remove_extent_mapping() fdmanana
2024-04-17 11:10     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 04/10] btrfs: pass the extent map tree's inode to replace_extent_mapping() fdmanana
2024-04-17 11:11     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 05/10] btrfs: pass the extent map tree's inode to setup_extent_mapping() fdmanana
2024-04-17 11:11     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 06/10] btrfs: pass the extent map tree's inode to try_merge_map() fdmanana
2024-04-17 11:12     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 07/10] btrfs: add a global per cpu counter to track number of used extent maps fdmanana
2024-04-17 11:14     ` Johannes Thumshirn
2024-04-16 13:08   ` [PATCH v3 08/10] btrfs: add a shrinker for " fdmanana
2024-04-16 17:25     ` Josef Bacik [this message]
2024-04-16 13:08   ` [PATCH v3 09/10] btrfs: update comment for btrfs_set_inode_full_sync() about locking fdmanana
2024-04-16 13:08   ` [PATCH v3 10/10] btrfs: add tracepoints for extent map shrinker events fdmanana
2024-04-16 17:26     ` Josef Bacik
2024-04-17 11:20     ` Johannes Thumshirn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240416172555.GA2094489@perftesting \
    --to=josef@toxicpanda.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox