public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 2/2] btrfs: send: avoid trashing the page cache
Date: Tue, 17 May 2022 11:46:28 +0100	[thread overview]
Message-ID: <20220517104628.GA2611561@falcondesktop> (raw)
In-Reply-To: <4174cbc9-6321-4055-6824-a43f7d222846@gmx.com>

On Tue, May 17, 2022 at 06:36:57PM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/5/17 17:39, Filipe Manana wrote:
> > On Tue, May 17, 2022 at 03:26:14PM +0800, Qu Wenruo wrote:
> > > 
> > > 
> > > On 2022/5/17 14:35, Qu Wenruo wrote:
> > > > 
> > > > 
> > > > On 2022/5/6 01:16, fdmanana@kernel.org wrote:
> > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > > 
> > > > > A send operation reads extent data using the buffered IO path for getting
> > > > > extent data to send in write commands and this is both because it's
> > > > > simple
> > > > > and to make use of the generic readahead infrastructure, which results in
> > > > > a massive speedup.
> > > > > 
> > > > > However this fills the page cache with data that, most of the time, is
> > > > > really only used by the send operation - once the write commands are
> > > > > sent,
> > > > > it's not useful to have the data in the page cache anymore. For large
> > > > > snapshots, bringing all data into the page cache eventually leads to the
> > > > > need to evict other data from the page cache that may be more useful for
> > > > > applications (and kernel susbsystems).
> > > > > 
> > > > > Even if extents are shared with the subvolume on which a snapshot is
> > > > > based
> > > > > on and the data is currently on the page cache due to being read through
> > > > > the subvolume, attempting to read the data through the snapshot will
> > > > > always result in bringing a new copy of the data into another location in
> > > > > the page cache (there's currently no shared memory for shared extents).
> > > > > 
> > > > > So make send evict the data it has read before if when it first opened
> > > > > the inode, its mapping had no pages currently loaded: when
> > > > > inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
> > > > > based on the return value of filemap_range_has_page() before reading an
> > > > > extent because the generic readahead mechanism may read pages beyond the
> > > > > range we request (and it very often does it), which means a call to
> > > > > filemap_range_has_page() will return true due to the readahead that was
> > > > > triggered when processing a previous extent - we don't have a simple way
> > > > > to distinguish this case from the case where the data was brought into
> > > > > the page cache through someone else. So checking for the mapping number
> > > > > of pages being 0 when we first open the inode is simple, cheap and it
> > > > > generally accomplishes the goal of not trashing the page cache - the
> > > > > only exception is if part of data was previously loaded into the page
> > > > > cache through the snapshot by some other process, in that case we end
> > > > > up not evicting any data send brings into the page cache, just like
> > > > > before this change - but that however is not the common case.
> > > > > 
> > > > > Example scenario, on a box with 32G of RAM:
> > > > > 
> > > > >     $ btrfs subvolume create /mnt/sv1
> > > > >     $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
> > > > > 
> > > > >     $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
> > > > > 
> > > > >     $ free -m
> > > > >                    total        used        free      shared
> > > > > buff/cache   available
> > > > >     Mem:           31937         186       26866           0
> > > > > 4883       31297
> > > > >     Swap:           8188           0        8188
> > > > > 
> > > > >     # After this we get less 4G of free memory.
> > > > >     $ btrfs send /mnt/snap1 >/dev/null
> > > > > 
> > > > >     $ free -m
> > > > >                    total        used        free      shared
> > > > > buff/cache   available
> > > > >     Mem:           31937         186       22814           0
> > > > > 8935       31297
> > > > >     Swap:           8188           0        8188
> > > > > 
> > > > > The same, obviously, applies to an incremental send.
> > > > > 
> > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > > 
> > > > Unfortunately, this patch seems to cause subpage cases to fail test case
> > > > btrfs/007, the reproducibility is around 50%, thus better "-I 8" to be
> > > > extra safe.
> > > > 
> > > > And I believe it also causes other send related failure for subpage cases.
> > > > 
> > > > I guess it's truncate_inode_pages_range() only truncating the full page,
> > > > but for subpage case, since one sector is smaller than one page, it
> > > > doesn't work as expected?
> > > 
> > > It looks like that's the case.
> > > 
> > > The following diff fixed the send related bugs here:
> > > (mail client seems to screw up the indent)
> > > 
> > > https://paste.opensuse.org/95871661
> > 
> > That may seem to work often, but it's not really correct because prev_extent_end
> > is already aligned to the page size, except possibly for the first extent processed.
> 
> You're right, my snippet is really just a dirty hack.
> 
> > 
> > For the case of the first processed extent not starting at an offset that is page
> > size aligned, we also need to round down that offset so that we evict the page
> > (and not simply zero out part of its content).
> > 
> > Can you try this instead:  https://pastebin.com/raw/kRPZKYxG ?
> 
> Tested with btrfs/007 and it passed without problems (8 runs).
> 
> Also tested it with the full send group, no regression found here.
> 
> 
> And with his fix, the following subpage failures now all pass.
> 
> - btrfs/007
> - btrfs/016
> - btrfs/025
> - btrfs/034
> - btrfs/097
> - btrfs/149
> - btrfs/252

Thanks for testing and the report! I'll send a v2 with that patch incorporated
(and a few comment fixes).

> 
> Thanks,
> Qu
> 
> > 
> > Thanks.
> > 
> > 
> > > 
> > > Thanks,
> > > Qu
> > > > 
> > > > If needed, I can provide you the access to my aarch64 vm for debugging.
> > > > 
> > > > Thanks,
> > > > Qu
> > > > 
> > > > > ---
> > > > >    fs/btrfs/send.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++--
> > > > >    1 file changed, 77 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> > > > > index 55275ba90cb4..d899049dea53 100644
> > > > > --- a/fs/btrfs/send.c
> > > > > +++ b/fs/btrfs/send.c
> > > > > @@ -137,6 +137,8 @@ struct send_ctx {
> > > > >         */
> > > > >        struct inode *cur_inode;
> > > > >        struct file_ra_state ra;
> > > > > +    u64 prev_extent_end;
> > > > > +    bool clean_page_cache;
> > > > > 
> > > > >        /*
> > > > >         * We process inodes by their increasing order, so if before an
> > > > > @@ -5157,6 +5159,28 @@ static int send_extent_data(struct send_ctx *sctx,
> > > > >            }
> > > > >            memset(&sctx->ra, 0, sizeof(struct file_ra_state));
> > > > >            file_ra_state_init(&sctx->ra, sctx->cur_inode->i_mapping);
> > > > > +
> > > > > +        /*
> > > > > +         * It's very likely there are no pages from this inode in
> > > > > the page
> > > > > +         * cache, so after reading extents and sending their data,
> > > > > we clean
> > > > > +         * the page cache to avoid trashing the page cache (adding
> > > > > pressure
> > > > > +         * to the page cache and forcing eviction of other data
> > > > > more useful
> > > > > +         * for applications).
> > > > > +         *
> > > > > +         * We decide if we should clean the page cache simply by
> > > > > checking
> > > > > +         * if the inode's mapping nrpages is 0 when we first open
> > > > > it, and
> > > > > +         * not by using something like filemap_range_has_page() before
> > > > > +         * reading an extent because when we ask the readahead code to
> > > > > +         * read a given file range, it may (and almost always does) read
> > > > > +         * pages from beyond that range (see the documentation for
> > > > > +         * page_cache_sync_readahead()), so it would not be reliable,
> > > > > +         * because after reading the first extent future calls to
> > > > > +         * filemap_range_has_page() would return true because the
> > > > > readahead
> > > > > +         * on the previous extent resulted in reading pages of the
> > > > > current
> > > > > +         * extent as well.
> > > > > +         */
> > > > > +        sctx->clean_page_cache =
> > > > > (sctx->cur_inode->i_mapping->nrpages == 0);
> > > > > +        sctx->prev_extent_end = offset;
> > > > >        }
> > > > > 
> > > > >        while (sent < len) {
> > > > > @@ -5168,6 +5192,33 @@ static int send_extent_data(struct send_ctx *sctx,
> > > > >                return ret;
> > > > >            sent += size;
> > > > >        }
> > > > > +
> > > > > +    if (sctx->clean_page_cache) {
> > > > > +        const u64 end = round_up(offset + len, PAGE_SIZE);
> > > > > +
> > > > > +        /*
> > > > > +         * Always start from the end offset of the last processed
> > > > > extent.
> > > > > +         * This is because the readahead code may (and very often does)
> > > > > +         * reads pages beyond the range we request for readahead. So if
> > > > > +         * we have an extent layout like this:
> > > > > +         *
> > > > > +         *            [ extent A ] [ extent B ] [ extent C ]
> > > > > +         *
> > > > > +         * When we ask page_cache_sync_readahead() to read extent A, it
> > > > > +         * may also trigger reads for pages of extent B. If we are doing
> > > > > +         * an incremental send and extent B has not changed between the
> > > > > +         * parent and send snapshots, some or all of its pages may end
> > > > > +         * up being read and placed in the page cache. So when
> > > > > truncating
> > > > > +         * the page cache we always start from the end offset of the
> > > > > +         * previously processed extent up to the end of the current
> > > > > +         * extent.
> > > > > +         */
> > > > > +        truncate_inode_pages_range(&sctx->cur_inode->i_data,
> > > > > +                       sctx->prev_extent_end,
> > > > > +                       end - 1);
> > > > > +        sctx->prev_extent_end = end;
> > > > > +    }
> > > > > +
> > > > >        return 0;
> > > > >    }
> > > > > 
> > > > > @@ -6172,6 +6223,30 @@ static int btrfs_unlink_all_paths(struct
> > > > > send_ctx *sctx)
> > > > >        return ret;
> > > > >    }
> > > > > 
> > > > > +static void close_current_inode(struct send_ctx *sctx)
> > > > > +{
> > > > > +    u64 i_size;
> > > > > +
> > > > > +    if (sctx->cur_inode == NULL)
> > > > > +        return;
> > > > > +
> > > > > +    i_size = i_size_read(sctx->cur_inode);
> > > > > +
> > > > > +    /*
> > > > > +     * If we are doing an incremental send, we may have extents
> > > > > between the
> > > > > +     * last processed extent and the i_size that have not been processed
> > > > > +     * because they haven't changed but we may have read some of
> > > > > their pages
> > > > > +     * through readahead, see the comments at send_extent_data().
> > > > > +     */
> > > > > +    if (sctx->clean_page_cache && sctx->prev_extent_end < i_size)
> > > > > +        truncate_inode_pages_range(&sctx->cur_inode->i_data,
> > > > > +                       sctx->prev_extent_end,
> > > > > +                       round_up(i_size, PAGE_SIZE) - 1);
> > > > > +
> > > > > +    iput(sctx->cur_inode);
> > > > > +    sctx->cur_inode = NULL;
> > > > > +}
> > > > > +
> > > > >    static int changed_inode(struct send_ctx *sctx,
> > > > >                 enum btrfs_compare_tree_result result)
> > > > >    {
> > > > > @@ -6182,8 +6257,7 @@ static int changed_inode(struct send_ctx *sctx,
> > > > >        u64 left_gen = 0;
> > > > >        u64 right_gen = 0;
> > > > > 
> > > > > -    iput(sctx->cur_inode);
> > > > > -    sctx->cur_inode = NULL;
> > > > > +    close_current_inode(sctx);
> > > > > 
> > > > >        sctx->cur_ino = key->objectid;
> > > > >        sctx->cur_inode_new_gen = 0;
> > > > > @@ -7671,7 +7745,7 @@ long btrfs_ioctl_send(struct inode *inode,
> > > > > struct btrfs_ioctl_send_args *arg)
> > > > > 
> > > > >            name_cache_free(sctx);
> > > > > 
> > > > > -        iput(sctx->cur_inode);
> > > > > +        close_current_inode(sctx);
> > > > > 
> > > > >            kfree(sctx);
> > > > >        }

  reply	other threads:[~2022-05-17 10:47 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-05 17:16 [PATCH 0/2] btrfs: teach send to avoid trashing the page cache with data fdmanana
2022-05-05 17:16 ` [PATCH 1/2] btrfs: send: keep the current inode open while processing it fdmanana
2022-05-05 17:16 ` [PATCH 2/2] btrfs: send: avoid trashing the page cache fdmanana
2022-05-17  6:35   ` Qu Wenruo
2022-05-17  7:26     ` Qu Wenruo
2022-05-17  9:39       ` Filipe Manana
2022-05-17 10:36         ` Qu Wenruo
2022-05-17 10:46           ` Filipe Manana [this message]
2022-05-09 19:08 ` [PATCH 0/2] btrfs: teach send to avoid trashing the page cache with data David Sterba
2022-05-17 10:47 ` [PATCH v2 " fdmanana
2022-05-17 10:47   ` [PATCH 1/2] btrfs: send: keep the current inode open while processing it fdmanana
2022-05-17 10:47   ` [PATCH 2/2] btrfs: send: avoid trashing the page cache fdmanana
2022-05-17 18:18   ` [PATCH v2 0/2] btrfs: teach send to avoid trashing the page cache with data David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220517104628.GA2611561@falcondesktop \
    --to=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox