From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8C794ECAAD5
	for <linux-btrfs@archiver.kernel.org>; Fri,  2 Sep 2022 13:52:29 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236550AbiIBNw1 (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Fri, 2 Sep 2022 09:52:27 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47244 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S237385AbiIBNwC (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Fri, 2 Sep 2022 09:52:02 -0400
Received: from mail-qk1-x72d.google.com (mail-qk1-x72d.google.com [IPv6:2607:f8b0:4864:20::72d])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1A728139713
        for <linux-btrfs@vger.kernel.org>; Fri,  2 Sep 2022 06:26:16 -0700 (PDT)
Received: by mail-qk1-x72d.google.com with SMTP id j6so1644016qkl.10
        for <linux-btrfs@vger.kernel.org>; Fri, 02 Sep 2022 06:26:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=toxicpanda-com.20210112.gappssmtp.com; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date;
        bh=NTZHlqa1sOvXW0jQEKu+TbN87KCET6C21Wbc2fgrO4o=;
        b=hzbqusrhCQyXPisPTtPu9m4tT8Qzgkp1vw78wyGgvX+Yjm5jEY+hxRsxnH2zqNzkcq
         5h9mv6a/JSt/l32iRLwaMDj0tXnT4+2S1kGXL5L2YwjNVGUT/4Dp5m7btPC3eiJqTnAJ
         xCB1s+zBBy/pilo+nVzKSXCP0bppIxOG8Q2sXns0zzl0Mfew/SuCWxCa+vSIkAFrotJ7
         eHkZw3rf5uufeC/0vgeCijbOiZyHLNkGIxLayul31L6ffDuV0ZKOSo3TIaTltf/T/7LJ
         ygnwnF5cRYS/i7sroPnVo8KLt8rdsBLMMSlATn7ZaVCKcYRXL5UnunSxYkpF3jmNaj6B
         kr+w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date;
        bh=NTZHlqa1sOvXW0jQEKu+TbN87KCET6C21Wbc2fgrO4o=;
        b=mLW3mUp2NQiaHhZdm3dYyTePE8PhG9UW5UMt+Uvd67yU9RberHyJQdv5QhoOZfKYLr
         LI0oQ7Us9qyEvgBoniwYYREPZwjeg2G+I1TdnZrfNSPQkhpmGIzS1Y5ox244+zSnZLX8
         oXsQ+LW1m8PpUMr5kbdylR1UED6IybqRm+3dd95x3FOexqKsGWam16BlqvnLyP0Lrd9U
         q0ODykWmbFTm/zHg/pl4n4N86WViUM/AHOoPGgFg69UwPx+X1MzvrEqFeDRIIQncadzg
         DuDvg8GHLZrw6Cjv4JopoXyJbqVfrWr/iNlom2GP2061VBuyHlQQt3QMobQ4OKjzBUVU
         w9Jw==
X-Gm-Message-State: ACgBeo1CwBqzAN9HsbJuElKq9ccSOy0i0RxkcY2nF3wKpQ0oJS2AP3Ia
        Z4vrhwqXEMA7RfspiqXC0R1BKX0rQIDD+w==
X-Google-Smtp-Source: AA6agR7qQwdvcAmnUyYXAUFeCbM1XhxFv6UWk+DMRggLMKJzd8CSMVvkJuNuz3yfVlptGF/FinYYxw==
X-Received: by 2002:a05:620a:c01:b0:6b6:1aaf:bade with SMTP id l1-20020a05620a0c0100b006b61aafbademr22644466qki.524.1662125148737;
        Fri, 02 Sep 2022 06:25:48 -0700 (PDT)
Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136])
        by smtp.gmail.com with ESMTPSA id q2-20020a05620a0d8200b006b555509398sm1432238qkl.136.2022.09.02.06.25.47
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 02 Sep 2022 06:25:48 -0700 (PDT)
Date:   Fri, 2 Sep 2022 09:25:46 -0400
From:   Josef Bacik <josef@toxicpanda.com>
To:     Filipe Manana <fdmanana@kernel.org>
Cc:     linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate
 reporting extent sharedness
Message-ID: <YxIEWigMXjlpRT3z@localhost.localdomain>
References: <cover.1662022922.git.fdmanana@suse.com>
 <afad772c28ed5fc236785df6f3c43282d5c12534.1662022922.git.fdmanana@suse.com>
 <YxDDNvsbrlxoaJxh@localhost.localdomain>
 <CAL3q7H6M2qoZZ7FgxL7FviddnLBJFwX_iiPhbdxo+4j-LxTtZw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAL3q7H6M2qoZZ7FgxL7FviddnLBJFwX_iiPhbdxo+4j-LxTtZw@mail.gmail.com>
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

On Thu, Sep 01, 2022 at 04:04:17PM +0100, Filipe Manana wrote:
> On Thu, Sep 1, 2022 at 3:35 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Thu, Sep 01, 2022 at 02:18:30PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > The current fiemap implementation does not scale very well with the number
> > > of extents a file has. This is both because the main algorithm to find out
> > > the extents has a high algorithmic complexity and because for each extent
> > > we have to check if it's shared. This second part, checking if an extent
> > > is shared, is significantly improved by the two previous patches in this
> > > patchset, while the first part is improved by this specific patch. Every
> > > now and then we get reports from users mentioning fiemap is too slow or
> > > even unusable for files with a very large number of extents, such as the
> > > two recent reports referred to by the Link tags at the bottom of this
> > > change log.
> > >
> > > To understand why the part of finding which extents a file has is very
> > > inneficient, consider the example of doing a full ranged fiemap against
> > > a file that has over 100K extents (normal for example for a file with
> > > more than 10G of data and using compression, which limits the extent size
> > > to 128K). When we enter fiemap at extent_fiemap(), the following happens:
> > >
> > > 1) Before entering the main loop, we call get_extent_skip_holes() to get
> > >    the first extent map. This leads us to btrfs_get_extent_fiemap(), which
> > >    in turn calls btrfs_get_extent(), to find the first extent map that
> > >    covers the file range [0, LLONG_MAX).
> > >
> > >    btrfs_get_extent() will first search the inode's extent map tree, to
> > >    see if we have an extent map there that covers the range. If it does
> > >    not find one, then it will search the inode's subvolume b+tree for a
> > >    fitting file extent item. After finding the file extent item, it will
> > >    allocate an extent map, fill it in with information extracted from the
> > >    file extent item, and add it to the inode's extent map tree (which
> > >    requires a search for insertion in the tree).
> > >
> > > 2) Then we enter the main loop at extent_fiemap(), emit the details of
> > >    the extent, and call again get_extent_skip_holes(), with a start
> > >    offset matching the end of the extent map we previously processed.
> > >
> > >    We end up at btrfs_get_extent() again, will search the extent map tree
> > >    and then search the subvolume b+tree for a file extent item if we could
> > >    not find an extent map in the extent tree. We allocate an extent map,
> > >    fill it in with the details in the file extent item, and then insert
> > >    it into the extent map tree (yet another search in this tree).
> > >
> > > 3) The second step is repeated over and over, until we have processed the
> > >    whole file range. Each iteration ends at btrfs_get_extent(), which
> > >    does a red black tree search on the extent map tree, then searches the
> > >    subvolume b+tree, allocates an extent map and then does another search
> > >    in the extent map tree in order to insert the extent map.
> > >
> > >    In the best scenario we have all the extent maps already in the extent
> > >    tree, and so for each extent we do a single search on a red black tree,
> > >    so we have a complexity of O(n log n).
> > >
> > >    In the worst scenario we don't have any extent map already loaded in
> > >    the extent map tree, or have very few already there. In this case the
> > >    complexity is much higher since we do:
> > >
> > >    - A red black tree search on the extent map tree, which has O(log n)
> > >      complexity, initially very fast since the tree is empty or very
> > >      small, but as we end up allocating extent maps and adding them to
> > >      the tree when we don't find them there, each subsequent search on
> > >      the tree gets slower, since it's getting bigger and bigger after
> > >      each iteration.
> > >
> > >    - A search on the subvolume b+tree, also O(log n) complexity, but it
> > >      has items for all inodes in the subvolume, not just items for our
> > >      inode. Plus on a filesystem with concurrent operations on other
> > >      inodes, we can block doing the search due to lock contention on
> > >      b+tree nodes/leaves.
> > >
> > >    - Allocate an extent map - this can block, and can also fail if we
> > >      are under serious memory pressure.
> > >
> > >    - Do another search on the extent maps red black tree, with the goal
> > >      of inserting the extent map we just allocated. Again, after every
> > >      iteration this tree is getting bigger by 1 element, so after many
> > >      iterations the searches are slower and slower.
> > >
> > >    - We will not need the allocated extent map anymore, so it's pointless
> > >      to add it to the extent map tree. It's just wasting time and memory.
> > >
> > >    In short we end up searching the extent map tree multiple times, on a
> > >    tree that is growing bigger and bigger after each iteration. And
> > >    besides that we visit the same leaf of the subvolume b+tree many times,
> > >    since a leaf with the default size of 16K can easily have more than 200
> > >    file extent items.
> > >
> > > This is very inneficient overall. This patch changes the algorithm to
> > > instead iterate over the subvolume b+tree, visiting each leaf only once,
> > > and only searching in the extent map tree for file ranges that have holes
> > > or prealloc extents, in order to figure out if we have delalloc there.
> > > It will never allocate an extent map and add it to the extent map tree.
> > > This is very similar to what was previously done for the lseek's hole and
> > > data seeking features.
> > >
> > > Also, the current implementation relying on extent maps for figuring out
> > > which extents we have is not correct. This is because extent maps can be
> > > merged even if they represent different extents - we do this to minimize
> > > memory utilization and keep extent map trees smaller. For example if we
> > > have two extents that are contiguous on disk, once we load the two extent
> > > maps, they get merged into a single one - however if only one of the
> > > extents is shared, we end up reporting both as shared or both as not
> > > shared, which is incorrect.
> > >
> > > This reproducer triggers that bug:
> > >
> > >     $ cat fiemap-bug.sh
> > >     #!/bin/bash
> > >
> > >     DEV=/dev/sdj
> > >     MNT=/mnt/sdj
> > >
> > >     mkfs.btrfs -f $DEV
> > >     mount $DEV $MNT
> > >
> > >     # Create a file with two 256K extents.
> > >     # Since there is no other write activity, they will be contiguous,
> > >     # and their extent maps merged, despite having two distinct extents.
> > >     xfs_io -f -c "pwrite -S 0xab 0 256K" \
> > >               -c "fsync" \
> > >               -c "pwrite -S 0xcd 256K 256K" \
> > >               -c "fsync" \
> > >               $MNT/foo
> > >
> > >     # Now clone only the second extent into another file.
> > >     xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
> > >
> > >     # Filefrag will report a single 512K extent, and say it's not shared.
> > >     echo
> > >     filefrag -v $MNT/foo
> > >
> > >     umount $MNT
> > >
> > > Running the reproducer:
> > >
> > >     $ ./fiemap-bug.sh
> > >     wrote 262144/262144 bytes at offset 0
> > >     256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
> > >     wrote 262144/262144 bytes at offset 262144
> > >     256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
> > >     linked 262144/262144 bytes at offset 0
> > >     256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
> > >
> > >     Filesystem type is: 9123683e
> > >     File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
> > >      ext:     logical_offset:        physical_offset: length:   expected: flags:
> > >        0:        0..     127:       3328..      3455:    128:             last,eof
> > >     /mnt/sdj/foo: 1 extent found
> > >
> > > We end up reporting that we have a single 512K that is not shared, however
> > > we have two 256K extents, and the second one is shared. Changing the
> > > reproducer to clone instead the first extent into file 'bar', makes us
> > > report a single 512K extent that is shared, which is algo incorrect since
> > > we have two 256K extents and only the first one is shared.
> > >
> > > This patch is part of a larger patchset that is comprised of the following
> > > patches:
> > >
> > >     btrfs: allow hole and data seeking to be interruptible
> > >     btrfs: make hole and data seeking a lot more efficient
> > >     btrfs: remove check for impossible block start for an extent map at fiemap
> > >     btrfs: remove zero length check when entering fiemap
> > >     btrfs: properly flush delalloc when entering fiemap
> > >     btrfs: allow fiemap to be interruptible
> > >     btrfs: rename btrfs_check_shared() to a more descriptive name
> > >     btrfs: speedup checking for extent sharedness during fiemap
> > >     btrfs: skip unnecessary extent buffer sharedness checks during fiemap
> > >     btrfs: make fiemap more efficient and accurate reporting extent sharedness
> > >
> > > The patchset was tested on a machine running a non-debug kernel (Debian's
> > > default config) and compared the tests below on a branch without the
> > > patchset versus the same branch with the whole patchset applied.
> > >
> > > The following test for a large compressed file without holes:
> > >
> > >     $ cat fiemap-perf-test.sh
> > >     #!/bin/bash
> > >
> > >     DEV=/dev/sdi
> > >     MNT=/mnt/sdi
> > >
> > >     mkfs.btrfs -f $DEV
> > >     mount -o compress=lzo $DEV $MNT
> > >
> > >     # 40G gives 327680 128K file extents (due to compression).
> > >     xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
> > >
> > >     umount $MNT
> > >     mount -o compress=lzo $DEV $MNT
> > >
> > >     start=$(date +%s%N)
> > >     filefrag $MNT/foobar
> > >     end=$(date +%s%N)
> > >     dur=$(( (end - start) / 1000000 ))
> > >     echo "fiemap took $dur milliseconds (metadata not cached)"
> > >
> > >     start=$(date +%s%N)
> > >     filefrag $MNT/foobar
> > >     end=$(date +%s%N)
> > >     dur=$(( (end - start) / 1000000 ))
> > >     echo "fiemap took $dur milliseconds (metadata cached)"
> > >
> > >     umount $MNT
> > >
> > > Before patchset:
> > >
> > >     $ ./fiemap-perf-test.sh
> > >     (...)
> > >     /mnt/sdi/foobar: 327680 extents found
> > >     fiemap took 3597 milliseconds (metadata not cached)
> > >     /mnt/sdi/foobar: 327680 extents found
> > >     fiemap took 2107 milliseconds (metadata cached)
> > >
> > > After patchset:
> > >
> > >     $ ./fiemap-perf-test.sh
> > >     (...)
> > >     /mnt/sdi/foobar: 327680 extents found
> > >     fiemap took 1214 milliseconds (metadata not cached)
> > >     /mnt/sdi/foobar: 327680 extents found
> > >     fiemap took 684 milliseconds (metadata cached)
> > >
> > > That's a speedup of about 3x for both cases (no metadata cached and all
> > > metadata cached).
> > >
> > > The test provided by Pavel (first Link tag at the bottom), which uses
> > > files with a large number of holes, was also used to measure the gains,
> > > and it consists on a small C program and a shell script to invoke it.
> > > The C program is the following:
> > >
> > >     $ cat pavels-test.c
> > >     #include <stdio.h>
> > >     #include <unistd.h>
> > >     #include <stdlib.h>
> > >     #include <fcntl.h>
> > >
> > >     #include <sys/stat.h>
> > >     #include <sys/time.h>
> > >     #include <sys/ioctl.h>
> > >
> > >     #include <linux/fs.h>
> > >     #include <linux/fiemap.h>
> > >
> > >     #define FILE_INTERVAL (1<<13) /* 8Kb */
> > >
> > >     long long interval(struct timeval t1, struct timeval t2)
> > >     {
> > >         long long val = 0;
> > >         val += (t2.tv_usec - t1.tv_usec);
> > >         val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
> > >         return val;
> > >     }
> > >
> > >     int main(int argc, char **argv)
> > >     {
> > >         struct fiemap fiemap = {};
> > >         struct timeval t1, t2;
> > >         char data = 'a';
> > >         struct stat st;
> > >         int fd, off, file_size = FILE_INTERVAL;
> > >
> > >         if (argc != 3 && argc != 2) {
> > >                 printf("usage: %s <path> [size]\n", argv[0]);
> > >                 return 1;
> > >         }
> > >
> > >         if (argc == 3)
> > >                 file_size = atoi(argv[2]);
> > >         if (file_size < FILE_INTERVAL)
> > >                 file_size = FILE_INTERVAL;
> > >         file_size -= file_size % FILE_INTERVAL;
> > >
> > >         fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
> > >         if (fd < 0) {
> > >             perror("open");
> > >             return 1;
> > >         }
> > >
> > >         for (off = 0; off < file_size; off += FILE_INTERVAL) {
> > >             if (pwrite(fd, &data, 1, off) != 1) {
> > >                 perror("pwrite");
> > >                 close(fd);
> > >                 return 1;
> > >             }
> > >         }
> > >
> > >         if (ftruncate(fd, file_size)) {
> > >             perror("ftruncate");
> > >             close(fd);
> > >             return 1;
> > >         }
> > >
> > >         if (fstat(fd, &st) < 0) {
> > >             perror("fstat");
> > >             close(fd);
> > >             return 1;
> > >         }
> > >
> > >         printf("size: %ld\n", st.st_size);
> > >         printf("actual size: %ld\n", st.st_blocks * 512);
> > >
> > >         fiemap.fm_length = FIEMAP_MAX_OFFSET;
> > >         gettimeofday(&t1, NULL);
> > >         if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
> > >             perror("fiemap");
> > >             close(fd);
> > >             return 1;
> > >         }
> > >         gettimeofday(&t2, NULL);
> > >
> > >         printf("fiemap: fm_mapped_extents = %d\n",
> > >                fiemap.fm_mapped_extents);
> > >         printf("time = %lld us\n", interval(t1, t2));
> > >
> > >         close(fd);
> > >         return 0;
> > >     }
> > >
> > >     $ gcc -o pavels_test pavels_test.c
> > >
> > > And the wrapper shell script:
> > >
> > >     $ cat fiemap-pavels-test.sh
> > >
> > >     #!/bin/bash
> > >
> > >     DEV=/dev/sdi
> > >     MNT=/mnt/sdi
> > >
> > >     mkfs.btrfs -f -O no-holes $DEV
> > >     mount $DEV $MNT
> > >
> > >     echo
> > >     echo "*********** 256M ***********"
> > >     echo
> > >
> > >     ./pavels-test $MNT/testfile $((1 << 28))
> > >     echo
> > >     ./pavels-test $MNT/testfile $((1 << 28))
> > >
> > >     echo
> > >     echo "*********** 512M ***********"
> > >     echo
> > >
> > >     ./pavels-test $MNT/testfile $((1 << 29))
> > >     echo
> > >     ./pavels-test $MNT/testfile $((1 << 29))
> > >
> > >     echo
> > >     echo "*********** 1G ***********"
> > >     echo
> > >
> > >     ./pavels-test $MNT/testfile $((1 << 30))
> > >     echo
> > >     ./pavels-test $MNT/testfile $((1 << 30))
> > >
> > >     umount $MNT
> > >
> > > Running his reproducer before applying the patchset:
> > >
> > >     *********** 256M ***********
> > >
> > >     size: 268435456
> > >     actual size: 134217728
> > >     fiemap: fm_mapped_extents = 32768
> > >     time = 4003133 us
> > >
> > >     size: 268435456
> > >     actual size: 134217728
> > >     fiemap: fm_mapped_extents = 32768
> > >     time = 4895330 us
> > >
> > >     *********** 512M ***********
> > >
> > >     size: 536870912
> > >     actual size: 268435456
> > >     fiemap: fm_mapped_extents = 65536
> > >     time = 30123675 us
> > >
> > >     size: 536870912
> > >     actual size: 268435456
> > >     fiemap: fm_mapped_extents = 65536
> > >     time = 33450934 us
> > >
> > >     *********** 1G ***********
> > >
> > >     size: 1073741824
> > >     actual size: 536870912
> > >     fiemap: fm_mapped_extents = 131072
> > >     time = 224924074 us
> > >
> > >     size: 1073741824
> > >     actual size: 536870912
> > >     fiemap: fm_mapped_extents = 131072
> > >     time = 217239242 us
> > >
> > > Running it after applying the patchset:
> > >
> > >     *********** 256M ***********
> > >
> > >     size: 268435456
> > >     actual size: 134217728
> > >     fiemap: fm_mapped_extents = 32768
> > >     time = 29475 us
> > >
> > >     size: 268435456
> > >     actual size: 134217728
> > >     fiemap: fm_mapped_extents = 32768
> > >     time = 29307 us
> > >
> > >     *********** 512M ***********
> > >
> > >     size: 536870912
> > >     actual size: 268435456
> > >     fiemap: fm_mapped_extents = 65536
> > >     time = 58996 us
> > >
> > >     size: 536870912
> > >     actual size: 268435456
> > >     fiemap: fm_mapped_extents = 65536
> > >     time = 59115 us
> > >
> > >     *********** 1G ***********
> > >
> > >     size: 1073741824
> > >     actual size: 536870912
> > >     fiemap: fm_mapped_extents = 116251
> > >     time = 124141 us
> > >
> > >     size: 1073741824
> > >     actual size: 536870912
> > >     fiemap: fm_mapped_extents = 131072
> > >     time = 119387 us
> > >
> > > The speedup is massive, both on the first fiemap call and on the second
> > > one as well, as his test creates files with many holes and small extents
> > > (every extent follows a hole and precedes another hole).
> > >
> > > For the 256M file we go from 4 seconds down to 29 milliseconds in the
> > > first run, and then from 4.9 seconds down to 29 milliseconds again in the
> > > second run, a speedup of 138x and 169x, respectively.
> > >
> > > For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
> > > first run, and then from 33.5 seconds down to 59 milliseconds again in the
> > > second run, a speedup of 510x and 568x, respectively.
> > >
> > > For the 1G file, we go from 225 seconds down to 124 milliseconds in the
> > > first run, and then from 217 seconds down to 119 milliseconds in the
> > > second run, a speedup of 1815x and 1824x, respectively.
> > >
> > > Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> > > Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> > > Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
> > > Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >  fs/btrfs/ctree.h     |   4 +-
> > >  fs/btrfs/extent_io.c | 714 +++++++++++++++++++++++++++++--------------
> > >  fs/btrfs/file.c      |  16 +-
> > >  fs/btrfs/inode.c     | 140 +--------
> > >  4 files changed, 506 insertions(+), 368 deletions(-)
> > >
> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > index f7fe7f633eb5..7b266f9dc8b4 100644
> > > --- a/fs/btrfs/ctree.h
> > > +++ b/fs/btrfs/ctree.h
> > > @@ -3402,8 +3402,6 @@ unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
> > >                                   u64 start, u64 end);
> > >  int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
> > >                         u32 bio_offset, struct page *page, u32 pgoff);
> > > -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> > > -                                        u64 start, u64 len);
> > >  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
> > >                             u64 *orig_start, u64 *orig_block_len,
> > >                             u64 *ram_bytes, bool strict);
> > > @@ -3583,6 +3581,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
> > >  int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
> > >                          size_t *write_bytes);
> > >  void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
> > > +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > > +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret);
> > >
> > >  /* tree-defrag.c */
> > >  int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index 0e3fa9b08aaf..50bb2182e795 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -5353,42 +5353,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
> > >       return try_release_extent_state(tree, page, mask);
> > >  }
> > >
> > > -/*
> > > - * helper function for fiemap, which doesn't want to see any holes.
> > > - * This maps until we find something past 'last'
> > > - */
> > > -static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
> > > -                                             u64 offset, u64 last)
> > > -{
> > > -     u64 sectorsize = btrfs_inode_sectorsize(inode);
> > > -     struct extent_map *em;
> > > -     u64 len;
> > > -
> > > -     if (offset >= last)
> > > -             return NULL;
> > > -
> > > -     while (1) {
> > > -             len = last - offset;
> > > -             if (len == 0)
> > > -                     break;
> > > -             len = ALIGN(len, sectorsize);
> > > -             em = btrfs_get_extent_fiemap(inode, offset, len);
> > > -             if (IS_ERR(em))
> > > -                     return em;
> > > -
> > > -             /* if this isn't a hole return it */
> > > -             if (em->block_start != EXTENT_MAP_HOLE)
> > > -                     return em;
> > > -
> > > -             /* this is a hole, advance to the next extent */
> > > -             offset = extent_map_end(em);
> > > -             free_extent_map(em);
> > > -             if (offset >= last)
> > > -                     break;
> > > -     }
> > > -     return NULL;
> > > -}
> > > -
> > >  /*
> > >   * To cache previous fiemap extent
> > >   *
> > > @@ -5418,6 +5382,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> > >  {
> > >       int ret = 0;
> > >
> > > +     /* Set at the end of extent_fiemap(). */
> > > +     ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
> > > +
> > >       if (!cache->cached)
> > >               goto assign;
> > >
> > > @@ -5446,11 +5413,10 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> > >        */
> > >       if (cache->offset + cache->len  == offset &&
> > >           cache->phys + cache->len == phys  &&
> > > -         (cache->flags & ~FIEMAP_EXTENT_LAST) ==
> > > -                     (flags & ~FIEMAP_EXTENT_LAST)) {
> > > +         cache->flags == flags) {
> > >               cache->len += len;
> > >               cache->flags |= flags;
> > > -             goto try_submit_last;
> > > +             return 0;
> > >       }
> > >
> > >       /* Not mergeable, need to submit cached one */
> > > @@ -5465,13 +5431,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> > >       cache->phys = phys;
> > >       cache->len = len;
> > >       cache->flags = flags;
> > > -try_submit_last:
> > > -     if (cache->flags & FIEMAP_EXTENT_LAST) {
> > > -             ret = fiemap_fill_next_extent(fieinfo, cache->offset,
> > > -                             cache->phys, cache->len, cache->flags);
> > > -             cache->cached = false;
> > > -     }
> > > -     return ret;
> > > +
> > > +     return 0;
> > >  }
> > >
> > >  /*
> > > @@ -5501,229 +5462,534 @@ static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
> > >       return ret;
> > >  }
> > >
> > > -int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> > > -               u64 start, u64 len)
> > > +static int fiemap_next_leaf_item(struct btrfs_inode *inode,
> > > +                              struct btrfs_path *path)
> > >  {
> > > -     int ret = 0;
> > > -     u64 off;
> > > -     u64 max = start + len;
> > > -     u32 flags = 0;
> > > -     u32 found_type;
> > > -     u64 last;
> > > -     u64 last_for_get_extent = 0;
> > > -     u64 disko = 0;
> > > -     u64 isize = i_size_read(&inode->vfs_inode);
> > > -     struct btrfs_key found_key;
> > > -     struct extent_map *em = NULL;
> > > -     struct extent_state *cached_state = NULL;
> > > -     struct btrfs_path *path;
> > > +     struct extent_buffer *clone;
> > > +     struct btrfs_key key;
> > > +     int slot;
> > > +     int ret;
> > > +
> > > +     path->slots[0]++;
> > > +     if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
> > > +             return 0;
> > > +
> > > +     ret = btrfs_next_leaf(inode->root, path);
> > > +     if (ret != 0)
> > > +             return ret;
> > > +
> > > +     /*
> > > +      * Don't bother with cloning if there are no more file extent items for
> > > +      * our inode.
> > > +      */
> > > +     btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> > > +     if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY)
> > > +             return 1;
> > > +
> > > +     /* See the comment at fiemap_search_slot() about why we clone. */
> > > +     clone = btrfs_clone_extent_buffer(path->nodes[0]);
> > > +     if (!clone)
> > > +             return -ENOMEM;
> > > +
> > > +     slot = path->slots[0];
> > > +     btrfs_release_path(path);
> > > +     path->nodes[0] = clone;
> > > +     path->slots[0] = slot;
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +/*
> > > + * Search for the first file extent item that starts at a given file offset or
> > > + * the one that starts immediately before that offset.
> > > + * Returns: 0 on success, < 0 on error, 1 if not found.
> > > + */
> > > +static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
> > > +                           u64 file_offset)
> > > +{
> > > +     const u64 ino = btrfs_ino(inode);
> > >       struct btrfs_root *root = inode->root;
> > > -     struct fiemap_cache cache = { 0 };
> > > -     struct btrfs_backref_shared_cache *backref_cache;
> > > -     struct ulist *roots;
> > > -     struct ulist *tmp_ulist;
> > > -     int end = 0;
> > > -     u64 em_start = 0;
> > > -     u64 em_len = 0;
> > > -     u64 em_end = 0;
> > > +     struct extent_buffer *clone;
> > > +     struct btrfs_key key;
> > > +     int slot;
> > > +     int ret;
> > >
> > > -     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> > > -     path = btrfs_alloc_path();
> > > -     roots = ulist_alloc(GFP_KERNEL);
> > > -     tmp_ulist = ulist_alloc(GFP_KERNEL);
> > > -     if (!backref_cache || !path || !roots || !tmp_ulist) {
> > > -             ret = -ENOMEM;
> > > -             goto out_free_ulist;
> > > +     key.objectid = ino;
> > > +     key.type = BTRFS_EXTENT_DATA_KEY;
> > > +     key.offset = file_offset;
> > > +
> > > +     ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> > > +     if (ret < 0)
> > > +             return ret;
> > > +
> > > +     if (ret > 0 && path->slots[0] > 0) {
> > > +             btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> > > +             if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> > > +                     path->slots[0]--;
> > > +     }
> > > +
> > > +     if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> > > +             ret = btrfs_next_leaf(root, path);
> > > +             if (ret != 0)
> > > +                     return ret;
> > > +
> > > +             btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> > > +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> > > +                     return 1;
> > >       }
> > >
> > >       /*
> > > -      * We can't initialize that to 'start' as this could miss extents due
> > > -      * to extent item merging
> > > +      * We clone the leaf and use it during fiemap. This is because while
> > > +      * using the leaf we do expensive things like checking if an extent is
> > > +      * shared, which can take a long time. In order to prevent blocking
> > > +      * other tasks for too long, we use a clone of the leaf. We have locked
> > > +      * the file range in the inode's io tree, so we know none of our file
> > > +      * extent items can change. This way we avoid blocking other tasks that
> > > +      * want to insert items for other inodes in the same leaf or b+tree
> > > +      * rebalance operations (triggered for example when someone is trying
> > > +      * to push items into this leaf when trying to insert an item in a
> > > +      * neighbour leaf).
> > > +      * We also need the private clone because holding a read lock on an
> > > +      * extent buffer of the subvolume's b+tree will make lockdep unhappy
> > > +      * when we call fiemap_fill_next_extent(), because that may cause a page
> > > +      * fault when filling the user space buffer with fiemap data.
> > >        */
> > > -     off = 0;
> > > -     start = round_down(start, btrfs_inode_sectorsize(inode));
> > > -     len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
> > > +     clone = btrfs_clone_extent_buffer(path->nodes[0]);
> > > +     if (!clone)
> > > +             return -ENOMEM;
> > > +
> > > +     slot = path->slots[0];
> > > +     btrfs_release_path(path);
> > > +     path->nodes[0] = clone;
> > > +     path->slots[0] = slot;
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +/*
> > > + * Process a range which is a hole or a prealloc extent in the inode's subvolume
> > > + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
> > > + * extent. The end offset (@end) is inclusive.
> > > + */
> > > +static int fiemap_process_hole(struct btrfs_inode *inode,
> > > +                            struct fiemap_extent_info *fieinfo,
> > > +                            struct fiemap_cache *cache,
> > > +                            struct btrfs_backref_shared_cache *backref_cache,
> > > +                            u64 disk_bytenr, u64 extent_offset,
> > > +                            u64 extent_gen,
> > > +                            struct ulist *roots, struct ulist *tmp_ulist,
> > > +                            u64 start, u64 end)
> > > +{
> > > +     const u64 i_size = i_size_read(&inode->vfs_inode);
> > > +     const u64 ino = btrfs_ino(inode);
> > > +     u64 cur_offset = start;
> > > +     u64 last_delalloc_end = 0;
> > > +     u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
> > > +     bool checked_extent_shared = false;
> > > +     int ret;
> > >
> > >       /*
> > > -      * lookup the last file extent.  We're not using i_size here
> > > -      * because there might be preallocation past i_size
> > > +      * There can be no delalloc past i_size, so don't waste time looking for
> > > +      * it beyond i_size.
> > >        */
> > > -     ret = btrfs_lookup_file_extent(NULL, root, path, btrfs_ino(inode), -1,
> > > -                                    0);
> > > -     if (ret < 0) {
> > > -             goto out_free_ulist;
> > > -     } else {
> > > -             WARN_ON(!ret);
> > > -             if (ret == 1)
> > > -                     ret = 0;
> > > -     }
> > > +     while (cur_offset < end && cur_offset < i_size) {
> > > +             u64 delalloc_start;
> > > +             u64 delalloc_end;
> > > +             u64 prealloc_start;
> > > +             u64 prealloc_len = 0;
> > > +             bool delalloc;
> > > +
> > > +             delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
> > > +                                                     &delalloc_start,
> > > +                                                     &delalloc_end);
> > > +             if (!delalloc)
> > > +                     break;
> > >
> > > -     path->slots[0]--;
> > > -     btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
> > > -     found_type = found_key.type;
> > > -
> > > -     /* No extents, but there might be delalloc bits */
> > > -     if (found_key.objectid != btrfs_ino(inode) ||
> > > -         found_type != BTRFS_EXTENT_DATA_KEY) {
> > > -             /* have to trust i_size as the end */
> > > -             last = (u64)-1;
> > > -             last_for_get_extent = isize;
> > > -     } else {
> > >               /*
> > > -              * remember the start of the last extent.  There are a
> > > -              * bunch of different factors that go into the length of the
> > > -              * extent, so its much less complex to remember where it started
> > > +              * If this is a prealloc extent we have to report every section
> > > +              * of it that has no delalloc.
> > >                */
> > > -             last = found_key.offset;
> > > -             last_for_get_extent = last + 1;
> > > +             if (disk_bytenr != 0) {
> > > +                     if (last_delalloc_end == 0) {
> > > +                             prealloc_start = start;
> > > +                             prealloc_len = delalloc_start - start;
> > > +                     } else {
> > > +                             prealloc_start = last_delalloc_end + 1;
> > > +                             prealloc_len = delalloc_start - prealloc_start;
> > > +                     }
> > > +             }
> > > +
> > > +             if (prealloc_len > 0) {
> > > +                     if (!checked_extent_shared && fieinfo->fi_extents_max) {
> > > +                             ret = btrfs_is_data_extent_shared(inode->root,
> > > +                                                       ino, disk_bytenr,
> > > +                                                       extent_gen, roots,
> > > +                                                       tmp_ulist,
> > > +                                                       backref_cache);
> > > +                             if (ret < 0)
> > > +                                     return ret;
> > > +                             else if (ret > 0)
> > > +                                     prealloc_flags |= FIEMAP_EXTENT_SHARED;
> > > +
> > > +                             checked_extent_shared = true;
> > > +                     }
> > > +                     ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> > > +                                              disk_bytenr + extent_offset,
> > > +                                              prealloc_len, prealloc_flags);
> > > +                     if (ret)
> > > +                             return ret;
> > > +                     extent_offset += prealloc_len;
> > > +             }
> > > +
> > > +             ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
> > > +                                      delalloc_end + 1 - delalloc_start,
> > > +                                      FIEMAP_EXTENT_DELALLOC |
> > > +                                      FIEMAP_EXTENT_UNKNOWN);
> > > +             if (ret)
> > > +                     return ret;
> > > +
> > > +             last_delalloc_end = delalloc_end;
> > > +             cur_offset = delalloc_end + 1;
> > > +             extent_offset += cur_offset - delalloc_start;
> > > +             cond_resched();
> > > +     }
> > > +
> > > +     /*
> > > +      * Either we found no delalloc for the whole prealloc extent or we have
> > > +      * a prealloc extent that spans i_size or starts at or after i_size.
> > > +      */
> > > +     if (disk_bytenr != 0 && last_delalloc_end < end) {
> > > +             u64 prealloc_start;
> > > +             u64 prealloc_len;
> > > +
> > > +             if (last_delalloc_end == 0) {
> > > +                     prealloc_start = start;
> > > +                     prealloc_len = end + 1 - start;
> > > +             } else {
> > > +                     prealloc_start = last_delalloc_end + 1;
> > > +                     prealloc_len = end + 1 - prealloc_start;
> > > +             }
> > > +
> > > +             if (!checked_extent_shared && fieinfo->fi_extents_max) {
> > > +                     ret = btrfs_is_data_extent_shared(inode->root,
> > > +                                                       ino, disk_bytenr,
> > > +                                                       extent_gen, roots,
> > > +                                                       tmp_ulist,
> > > +                                                       backref_cache);
> > > +                     if (ret < 0)
> > > +                             return ret;
> > > +                     else if (ret > 0)
> > > +                             prealloc_flags |= FIEMAP_EXTENT_SHARED;
> > > +             }
> > > +             ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> > > +                                      disk_bytenr + extent_offset,
> > > +                                      prealloc_len, prealloc_flags);
> > > +             if (ret)
> > > +                     return ret;
> > > +     }
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
> > > +                                       struct btrfs_path *path,
> > > +                                       u64 *last_extent_end_ret)
> > > +{
> > > +     const u64 ino = btrfs_ino(inode);
> > > +     struct btrfs_root *root = inode->root;
> > > +     struct extent_buffer *leaf;
> > > +     struct btrfs_file_extent_item *ei;
> > > +     struct btrfs_key key;
> > > +     u64 disk_bytenr;
> > > +     int ret;
> > > +
> > > +     /*
> > > +      * Lookup the last file extent. We're not using i_size here because
> > > +      * there might be preallocation past i_size.
> > > +      */
> > > +     ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
> > > +     /* There can't be a file extent item at offset (u64)-1 */
> > > +     ASSERT(ret != 0);
> > > +     if (ret < 0)
> > > +             return ret;
> > > +
> > > +     /*
> > > +      * For a non-existing key, btrfs_search_slot() always leaves us at a
> > > +      * slot > 0, except if the btree is empty, which is impossible because
> > > +      * at least it has the inode item for this inode and all the items for
> > > +      * the root inode 256.
> > > +      */
> > > +     ASSERT(path->slots[0] > 0);
> > > +     path->slots[0]--;
> > > +     leaf = path->nodes[0];
> > > +     btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > > +     if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> > > +             /* No file extent items in the subvolume tree. */
> > > +             *last_extent_end_ret = 0;
> > > +             return 0;
> > >       }
> > > -     btrfs_release_path(path);
> > >
> > >       /*
> > > -      * we might have some extents allocated but more delalloc past those
> > > -      * extents.  so, we trust isize unless the start of the last extent is
> > > -      * beyond isize
> > > +      * For an inline extent, the disk_bytenr is where inline data starts at,
> > > +      * so first check if we have an inline extent item before checking if we
> > > +      * have an implicit hole (disk_bytenr == 0).
> > >        */
> > > -     if (last < isize) {
> > > -             last = (u64)-1;
> > > -             last_for_get_extent = isize;
> > > +     ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
> > > +     if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
> > > +             *last_extent_end_ret = btrfs_file_extent_end(path);
> > > +             return 0;
> > >       }
> > >
> > > -     lock_extent_bits(&inode->io_tree, start, start + len - 1,
> > > -                      &cached_state);
> > > +     /*
> > > +      * Find the last file extent item that is not a hole (when NO_HOLES is
> > > +      * not enabled). This should take at most 2 iterations in the worst
> > > +      * case: we have one hole file extent item at slot 0 of a leaf and
> > > +      * another hole file extent item as the last item in the previous leaf.
> > > +      * This is because we merge file extent items that represent holes.
> > > +      */
> > > +     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > > +     while (disk_bytenr == 0) {
> > > +             ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
> > > +             if (ret < 0) {
> > > +                     return ret;
> > > +             } else if (ret > 0) {
> > > +                     /* No file extent items that are not holes. */
> > > +                     *last_extent_end_ret = 0;
> > > +                     return 0;
> > > +             }
> > > +             leaf = path->nodes[0];
> > > +             ei = btrfs_item_ptr(leaf, path->slots[0],
> > > +                                 struct btrfs_file_extent_item);
> > > +             disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > > +     }
> > >
> > > -     em = get_extent_skip_holes(inode, start, last_for_get_extent);
> > > -     if (!em)
> > > -             goto out;
> > > -     if (IS_ERR(em)) {
> > > -             ret = PTR_ERR(em);
> > > +     *last_extent_end_ret = btrfs_file_extent_end(path);
> > > +     return 0;
> > > +}
> > > +
> > > +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> > > +               u64 start, u64 len)
> > > +{
> > > +     const u64 ino = btrfs_ino(inode);
> > > +     struct extent_state *cached_state = NULL;
> > > +     struct btrfs_path *path;
> > > +     struct btrfs_root *root = inode->root;
> > > +     struct fiemap_cache cache = { 0 };
> > > +     struct btrfs_backref_shared_cache *backref_cache;
> > > +     struct ulist *roots;
> > > +     struct ulist *tmp_ulist;
> > > +     u64 last_extent_end;
> > > +     u64 prev_extent_end;
> > > +     u64 lockstart;
> > > +     u64 lockend;
> > > +     bool stopped = false;
> > > +     int ret;
> > > +
> > > +     backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> > > +     path = btrfs_alloc_path();
> > > +     roots = ulist_alloc(GFP_KERNEL);
> > > +     tmp_ulist = ulist_alloc(GFP_KERNEL);
> > > +     if (!backref_cache || !path || !roots || !tmp_ulist) {
> > > +             ret = -ENOMEM;
> > >               goto out;
> > >       }
> > >
> > > -     while (!end) {
> > > -             u64 offset_in_extent = 0;
> > > +     lockstart = round_down(start, btrfs_inode_sectorsize(inode));
> > > +     lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
> > > +     prev_extent_end = lockstart;
> > >
> > > -             /* break if the extent we found is outside the range */
> > > -             if (em->start >= max || extent_map_end(em) < off)
> > > -                     break;
> > > +     lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
> > >
> > > -             /*
> > > -              * get_extent may return an extent that starts before our
> > > -              * requested range.  We have to make sure the ranges
> > > -              * we return to fiemap always move forward and don't
> > > -              * overlap, so adjust the offsets here
> > > -              */
> > > -             em_start = max(em->start, off);
> > > +     ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
> > > +     if (ret < 0)
> > > +             goto out_unlock;
> > > +     btrfs_release_path(path);
> > >
> > > +     path->reada = READA_FORWARD;
> > > +     ret = fiemap_search_slot(inode, path, lockstart);
> > > +     if (ret < 0) {
> > > +             goto out_unlock;
> > > +     } else if (ret > 0) {
> > >               /*
> > > -              * record the offset from the start of the extent
> > > -              * for adjusting the disk offset below.  Only do this if the
> > > -              * extent isn't compressed since our in ram offset may be past
> > > -              * what we have actually allocated on disk.
> > > +              * No file extent item found, but we may have delalloc between
> > > +              * the current offset and i_size. So check for that.
> > >                */
> > > -             if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> > > -                     offset_in_extent = em_start - em->start;
> > > -             em_end = extent_map_end(em);
> > > -             em_len = em_end - em_start;
> > > -             flags = 0;
> > > -             if (em->block_start < EXTENT_MAP_LAST_BYTE)
> > > -                     disko = em->block_start + offset_in_extent;
> > > -             else
> > > -                     disko = 0;
> > > +             ret = 0;
> > > +             goto check_eof_delalloc;
> > > +     }
> > > +
> > > +     while (prev_extent_end < lockend) {
> > > +             struct extent_buffer *leaf = path->nodes[0];
> > > +             struct btrfs_file_extent_item *ei;
> > > +             struct btrfs_key key;
> > > +             u64 extent_end;
> > > +             u64 extent_len;
> > > +             u64 extent_offset = 0;
> > > +             u64 extent_gen;
> > > +             u64 disk_bytenr = 0;
> > > +             u64 flags = 0;
> > > +             int extent_type;
> > > +             u8 compression;
> > > +
> > > +             btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > > +             if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> > > +                     break;
> > > +
> > > +             extent_end = btrfs_file_extent_end(path);
> > >
> > >               /*
> > > -              * bump off for our next call to get_extent
> > > +              * The first iteration can leave us at an extent item that ends
> > > +              * before our range's start. Move to the next item.
> > >                */
> > > -             off = extent_map_end(em);
> > > -             if (off >= max)
> > > -                     end = 1;
> > > -
> > > -             if (em->block_start == EXTENT_MAP_INLINE) {
> > > -                     flags |= (FIEMAP_EXTENT_DATA_INLINE |
> > > -                               FIEMAP_EXTENT_NOT_ALIGNED);
> > > -             } else if (em->block_start == EXTENT_MAP_DELALLOC) {
> > > -                     flags |= (FIEMAP_EXTENT_DELALLOC |
> > > -                               FIEMAP_EXTENT_UNKNOWN);
> > > -             } else if (fieinfo->fi_extents_max) {
> > > -                     u64 extent_gen;
> > > -                     u64 bytenr = em->block_start -
> > > -                             (em->start - em->orig_start);
> > > +             if (extent_end <= lockstart)
> > > +                     goto next_item;
> > >
> > > -                     /*
> > > -                      * If two extent maps are merged, then their generation
> > > -                      * is set to the maximum between their generations.
> > > -                      * Otherwise its generation matches the one we have in
> > > -                      * corresponding file extent item. If we have a merged
> > > -                      * extent map, don't use its generation to speedup the
> > > -                      * sharedness check below.
> > > -                      */
> > > -                     if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
> > > -                             extent_gen = 0;
> > > -                     else
> > > -                             extent_gen = em->generation;
> > > +             /* We have in implicit hole (NO_HOLES feature enabled). */
> > > +             if (prev_extent_end < key.offset) {
> > > +                     const u64 range_end = min(key.offset, lockend) - 1;
> > >
> > > -                     /*
> > > -                      * As btrfs supports shared space, this information
> > > -                      * can be exported to userspace tools via
> > > -                      * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
> > > -                      * then we're just getting a count and we can skip the
> > > -                      * lookup stuff.
> > > -                      */
> > > -                     ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> > > -                                                       bytenr, extent_gen,
> > > -                                                       roots, tmp_ulist,
> > > -                                                       backref_cache);
> > > -                     if (ret < 0)
> > > -                             goto out_free;
> > > -                     if (ret)
> > > -                             flags |= FIEMAP_EXTENT_SHARED;
> > > -                     ret = 0;
> > > -             }
> > > -             if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> > > -                     flags |= FIEMAP_EXTENT_ENCODED;
> > > -             if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> > > -                     flags |= FIEMAP_EXTENT_UNWRITTEN;
> > > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > > +                                               backref_cache, 0, 0, 0,
> > > +                                               roots, tmp_ulist,
> > > +                                               prev_extent_end, range_end);
> > > +                     if (ret < 0) {
> > > +                             goto out_unlock;
> > > +                     } else if (ret > 0) {
> > > +                             /* fiemap_fill_next_extent() told us to stop. */
> > > +                             stopped = true;
> > > +                             break;
> > > +                     }
> > >
> > > -             free_extent_map(em);
> > > -             em = NULL;
> > > -             if ((em_start >= last) || em_len == (u64)-1 ||
> > > -                (last == (u64)-1 && isize <= em_end)) {
> > > -                     flags |= FIEMAP_EXTENT_LAST;
> > > -                     end = 1;
> > > +                     /* We've reached the end of the fiemap range, stop. */
> > > +                     if (key.offset >= lockend) {
> > > +                             stopped = true;
> > > +                             break;
> > > +                     }
> > >               }
> > >
> > > -             /* now scan forward to see if this is really the last extent. */
> > > -             em = get_extent_skip_holes(inode, off, last_for_get_extent);
> > > -             if (IS_ERR(em)) {
> > > -                     ret = PTR_ERR(em);
> > > -                     goto out;
> > > +             extent_len = extent_end - key.offset;
> > > +             ei = btrfs_item_ptr(leaf, path->slots[0],
> > > +                                 struct btrfs_file_extent_item);
> > > +             compression = btrfs_file_extent_compression(leaf, ei);
> > > +             extent_type = btrfs_file_extent_type(leaf, ei);
> > > +             extent_gen = btrfs_file_extent_generation(leaf, ei);
> > > +
> > > +             if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> > > +                     disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > > +                     if (compression == BTRFS_COMPRESS_NONE)
> > > +                             extent_offset = btrfs_file_extent_offset(leaf, ei);
> > >               }
> > > -             if (!em) {
> > > -                     flags |= FIEMAP_EXTENT_LAST;
> > > -                     end = 1;
> > > +
> > > +             if (compression != BTRFS_COMPRESS_NONE)
> > > +                     flags |= FIEMAP_EXTENT_ENCODED;
> > > +
> > > +             if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> > > +                     flags |= FIEMAP_EXTENT_DATA_INLINE;
> > > +                     flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
> > > +                                              extent_len, flags);
> > > +             } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> > > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > > +                                               backref_cache,
> > > +                                               disk_bytenr, extent_offset,
> > > +                                               extent_gen, roots, tmp_ulist,
> > > +                                               key.offset, extent_end - 1);
> > > +             } else if (disk_bytenr == 0) {
> > > +                     /* We have an explicit hole. */
> > > +                     ret = fiemap_process_hole(inode, fieinfo, &cache,
> > > +                                               backref_cache, 0, 0, 0,
> > > +                                               roots, tmp_ulist,
> > > +                                               key.offset, extent_end - 1);
> > > +             } else {
> > > +                     /* We have a regular extent. */
> > > +                     if (fieinfo->fi_extents_max) {
> > > +                             ret = btrfs_is_data_extent_shared(root, ino,
> > > +                                                               disk_bytenr,
> > > +                                                               extent_gen,
> > > +                                                               roots,
> > > +                                                               tmp_ulist,
> > > +                                                               backref_cache);
> > > +                             if (ret < 0)
> > > +                                     goto out_unlock;
> > > +                             else if (ret > 0)
> > > +                                     flags |= FIEMAP_EXTENT_SHARED;
> > > +                     }
> > > +
> > > +                     ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
> > > +                                              disk_bytenr + extent_offset,
> > > +                                              extent_len, flags);
> > >               }
> > > -             ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
> > > -                                        em_len, flags);
> > > -             if (ret) {
> > > -                     if (ret == 1)
> > > -                             ret = 0;
> > > -                     goto out_free;
> > > +
> > > +             if (ret < 0) {
> > > +                     goto out_unlock;
> > > +             } else if (ret > 0) {
> > > +                     /* fiemap_fill_next_extent() told us to stop. */
> > > +                     stopped = true;
> > > +                     break;
> > >               }
> > >
> > > +             prev_extent_end = extent_end;
> > > +next_item:
> > >               if (fatal_signal_pending(current)) {
> > >                       ret = -EINTR;
> > > -                     goto out_free;
> > > +                     goto out_unlock;
> > >               }
> > > +
> > > +             ret = fiemap_next_leaf_item(inode, path);
> > > +             if (ret < 0) {
> > > +                     goto out_unlock;
> > > +             } else if (ret > 0) {
> > > +                     /* No more file extent items for this inode. */
> > > +                     break;
> > > +             }
> > > +             cond_resched();
> > >       }
> > > -out_free:
> > > -     if (!ret)
> > > -             ret = emit_last_fiemap_cache(fieinfo, &cache);
> > > -     free_extent_map(em);
> > > -out:
> > > -     unlock_extent_cached(&inode->io_tree, start, start + len - 1,
> > > -                          &cached_state);
> > >
> > > -out_free_ulist:
> > > +check_eof_delalloc:
> > > +     /*
> > > +      * Release (and free) the path before emitting any final entries to
> > > +      * fiemap_fill_next_extent() to keep lockdep happy. This is because
> > > +      * once we find no more file extent items exist, we may have a
> > > +      * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
> > > +      * faults when copying data to the user space buffer.
> > > +      */
> > > +     btrfs_free_path(path);
> > > +     path = NULL;
> > > +
> > > +     if (!stopped && prev_extent_end < lockend) {
> > > +             ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
> > > +                                       0, 0, 0, roots, tmp_ulist,
> > > +                                       prev_extent_end, lockend - 1);
> > > +             if (ret < 0)
> > > +                     goto out_unlock;
> > > +             prev_extent_end = lockend;
> > > +     }
> > > +
> > > +     if (cache.cached && cache.offset + cache.len >= last_extent_end) {
> > > +             const u64 i_size = i_size_read(&inode->vfs_inode);
> > > +
> > > +             if (prev_extent_end < i_size) {
> > > +                     u64 delalloc_start;
> > > +                     u64 delalloc_end;
> > > +                     bool delalloc;
> > > +
> > > +                     delalloc = btrfs_find_delalloc_in_range(inode,
> > > +                                                             prev_extent_end,
> > > +                                                             i_size - 1,
> > > +                                                             &delalloc_start,
> > > +                                                             &delalloc_end);
> > > +                     if (!delalloc)
> > > +                             cache.flags |= FIEMAP_EXTENT_LAST;
> > > +             } else {
> > > +                     cache.flags |= FIEMAP_EXTENT_LAST;
> > > +             }
> > > +     }
> > > +
> > > +     ret = emit_last_fiemap_cache(fieinfo, &cache);
> > > +
> > > +out_unlock:
> > > +     unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
> > > +out:
> > >       kfree(backref_cache);
> > >       btrfs_free_path(path);
> > >       ulist_free(roots);
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index b292a8ada3a4..636b3ec46184 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
> > >  }
> > >
> > >  /*
> > > - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> > > - * has unflushed and/or flushing delalloc. There might be other adjacent
> > > - * subranges after the one it found, so have_delalloc_in_range() keeps looping
> > > - * while it gets adjacent subranges, and merging them together.
> > > + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
> > > + * that has unflushed and/or flushing delalloc. There might be other adjacent
> > > + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
> > > + * looping while it gets adjacent subranges, and merging them together.
> > >   */
> > >  static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
> > >                                  u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > > @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
> > >   * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
> > >   * end offsets of the subrange.
> > >   */
> > > -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > > -                                u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > > +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> > > +                               u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> > >  {
> > >       u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
> > >       u64 prev_delalloc_end = 0;
> > > @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
> > >       u64 delalloc_end;
> > >       bool delalloc;
> > >
> > > -     delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> > > -                                       &delalloc_end);
> > > +     delalloc = btrfs_find_delalloc_in_range(inode, start, end,
> > > +                                             &delalloc_start, &delalloc_end);
> > >       if (delalloc && whence == SEEK_DATA) {
> > >               *start_ret = delalloc_start;
> > >               return true;
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 2c7d31990777..8be1e021513a 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> > >       return em;
> > >  }
> > >
> > > -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> > > -                                        u64 start, u64 len)
> > > -{
> > > -     struct extent_map *em;
> > > -     struct extent_map *hole_em = NULL;
> > > -     u64 delalloc_start = start;
> > > -     u64 end;
> > > -     u64 delalloc_len;
> > > -     u64 delalloc_end;
> > > -     int err = 0;
> > > -
> > > -     em = btrfs_get_extent(inode, NULL, 0, start, len);
> > > -     if (IS_ERR(em))
> > > -             return em;
> > > -     /*
> > > -      * If our em maps to:
> > > -      * - a hole or
> > > -      * - a pre-alloc extent,
> > > -      * there might actually be delalloc bytes behind it.
> > > -      */
> > > -     if (em->block_start != EXTENT_MAP_HOLE &&
> > > -         !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> > > -             return em;
> > > -     else
> > > -             hole_em = em;
> > > -
> > > -     /* check to see if we've wrapped (len == -1 or similar) */
> > > -     end = start + len;
> > > -     if (end < start)
> > > -             end = (u64)-1;
> > > -     else
> > > -             end -= 1;
> > > -
> > > -     em = NULL;
> > > -
> > > -     /* ok, we didn't find anything, lets look for delalloc */
> > > -     delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
> > > -                              end, len, EXTENT_DELALLOC, 1);
> > > -     delalloc_end = delalloc_start + delalloc_len;
> > > -     if (delalloc_end < delalloc_start)
> > > -             delalloc_end = (u64)-1;
> > > -
> > > -     /*
> > > -      * We didn't find anything useful, return the original results from
> > > -      * get_extent()
> > > -      */
> > > -     if (delalloc_start > end || delalloc_end <= start) {
> > > -             em = hole_em;
> > > -             hole_em = NULL;
> > > -             goto out;
> > > -     }
> > > -
> > > -     /*
> > > -      * Adjust the delalloc_start to make sure it doesn't go backwards from
> > > -      * the start they passed in
> > > -      */
> > > -     delalloc_start = max(start, delalloc_start);
> > > -     delalloc_len = delalloc_end - delalloc_start;
> > > -
> > > -     if (delalloc_len > 0) {
> > > -             u64 hole_start;
> > > -             u64 hole_len;
> > > -             const u64 hole_end = extent_map_end(hole_em);
> > > -
> > > -             em = alloc_extent_map();
> > > -             if (!em) {
> > > -                     err = -ENOMEM;
> > > -                     goto out;
> > > -             }
> > > -
> > > -             ASSERT(hole_em);
> > > -             /*
> > > -              * When btrfs_get_extent can't find anything it returns one
> > > -              * huge hole
> > > -              *
> > > -              * Make sure what it found really fits our range, and adjust to
> > > -              * make sure it is based on the start from the caller
> > > -              */
> > > -             if (hole_end <= start || hole_em->start > end) {
> > > -                    free_extent_map(hole_em);
> > > -                    hole_em = NULL;
> > > -             } else {
> > > -                    hole_start = max(hole_em->start, start);
> > > -                    hole_len = hole_end - hole_start;
> > > -             }
> > > -
> > > -             if (hole_em && delalloc_start > hole_start) {
> > > -                     /*
> > > -                      * Our hole starts before our delalloc, so we have to
> > > -                      * return just the parts of the hole that go until the
> > > -                      * delalloc starts
> > > -                      */
> > > -                     em->len = min(hole_len, delalloc_start - hole_start);
> > > -                     em->start = hole_start;
> > > -                     em->orig_start = hole_start;
> > > -                     /*
> > > -                      * Don't adjust block start at all, it is fixed at
> > > -                      * EXTENT_MAP_HOLE
> > > -                      */
> > > -                     em->block_start = hole_em->block_start;
> > > -                     em->block_len = hole_len;
> > > -                     if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
> > > -                             set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
> > > -             } else {
> > > -                     /*
> > > -                      * Hole is out of passed range or it starts after
> > > -                      * delalloc range
> > > -                      */
> > > -                     em->start = delalloc_start;
> > > -                     em->len = delalloc_len;
> > > -                     em->orig_start = delalloc_start;
> > > -                     em->block_start = EXTENT_MAP_DELALLOC;
> > > -                     em->block_len = delalloc_len;
> > > -             }
> > > -     } else {
> > > -             return hole_em;
> > > -     }
> > > -out:
> > > -
> > > -     free_extent_map(hole_em);
> > > -     if (err) {
> > > -             free_extent_map(em);
> > > -             return ERR_PTR(err);
> > > -     }
> > > -     return em;
> > > -}
> > > -
> > >  static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
> > >                                                 const u64 start,
> > >                                                 const u64 len,
> > > @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> > >        * in the compression of data (in an async thread) and will return
> > >        * before the compression is done and writeback is started. A second
> > >        * filemap_fdatawrite_range() is needed to wait for the compression to
> > > -      * complete and writeback to start. Without this, our user is very
> > > -      * likely to get stale results, because the extents and extent maps for
> > > -      * delalloc regions are only allocated when writeback starts.
> > > +      * complete and writeback to start. We also need to wait for ordered
> > > +      * extents to complete, because our fiemap implementation uses mainly
> > > +      * file extent items to list the extents, searching for extent maps
> > > +      * only for file ranges with holes or prealloc extents to figure out
> > > +      * if we have delalloc in those ranges.
> > >        */
> > >       if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
> > > -             ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
> > > -             if (ret)
> > > -                     return ret;
> > > -             ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
> > > +             ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
> > >               if (ret)
> > >                       return ret;
> > >       }
> >
> > Hmm this bit should be in "btrfs: properly flush delalloc when entering fiemap"
> > instead.  Thanks,
> 
> Nop, the change is done here for a good reason: before this change, we
> only needed
> to wait for writeback to complete (actually just to start and create
> the new extent maps),
> so that's why that other patch only waits for writeback to complete,
> just like the generic code.
> 
> After this change we need to wait for ordered extents to complete,
> since we use the file
> extent items to get extent information for fiemap - that's why that
> change is in this patch.
>

Got it, you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef