From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 47FAEECAAD1
	for <linux-btrfs@archiver.kernel.org>; Thu,  1 Sep 2022 14:35:50 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S234303AbiIAOfs (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Thu, 1 Sep 2022 10:35:48 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53194 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233832AbiIAOfq (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>); Thu, 1 Sep 2022 10:35:46 -0400
Received: from mail-qk1-x733.google.com (mail-qk1-x733.google.com [IPv6:2607:f8b0:4864:20::733])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C7A0C4C617
        for <linux-btrfs@vger.kernel.org>; Thu,  1 Sep 2022 07:35:42 -0700 (PDT)
Received: by mail-qk1-x733.google.com with SMTP id j6so13360362qkl.10
        for <linux-btrfs@vger.kernel.org>; Thu, 01 Sep 2022 07:35:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=toxicpanda-com.20210112.gappssmtp.com; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc;
        bh=IVNyCNMAF8ai6vyXf76k9AcT0BfCzBjXW10zP6DgMrY=;
        b=iolW9F0EiPmzjydMmM1lUb3bf/BmxhN+UJBjWDFaV1jE9kTKXUtkpnzIN0V1Mc1IPO
         TVh4TGVuAykHOUAG0dSD9NK3113Y8OBRG7llvtx7AGbY7kKp/O4xcBwyUcrR3/5WxvcD
         x0tD0KHlJHMms1HfNVmAxvbE59JXMJfrXsFhf/mtpz12lc6FwVFV6qRY8G2h5h9K/oNR
         0iXS4O3/h2F6Y6CnK6wmNyZcpCJ27Gf9/q2pO9dGdYsF7FQED2QkgB+mwe7PkTEA6gEi
         LlhThoz3sZPOUpdnEC1izgpVvFRp/e3jhvUbLFcvfJvsoEAZHyp0Hh12VTq4stL6OpQ+
         em0A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc;
        bh=IVNyCNMAF8ai6vyXf76k9AcT0BfCzBjXW10zP6DgMrY=;
        b=FmQBBnEgKjaY0Y3c6fUww65BS4qLjqDkwqy7RvM3EVfiBH5d/H8hOVgydjmBmQAagh
         1L202C3TMvx4NnDWXss5fUDov6sAWZ6S/0ITrSIPHfyHLbVSjLDxgsPdedqNtMQeJfF+
         whA6rQ/tG7Cm7ZK0EpiJpagGqoxU9yGSNX53tCnrCMJlkr8KOflhVNioMmCfNL6Ue3xx
         GRNm9/qBSTDOpbirqubotMFhCHyOan/GcxI/sxtrvB9nHZeyrug+9O2UPc09wCDPpZfT
         nRvtEtw493s7KroaMBPzb+LADygEpZaEQdw2pvmYDyEg2/dEnFcyYYrel/X4P1D8OPPD
         OV3w==
X-Gm-Message-State: ACgBeo3/MWqmo0Zx1oLgq5yzOVIVSbJBPQC/0Nt23gVvNCUUoDzzhGKg
        Mn2VUprvkXDfQfYNLq+mq10ZimfnHOJUIg==
X-Google-Smtp-Source: AA6agR6fvvhbW09pBVsNuaC5NKmhLgkH3pn7G/sdhqRb9UkXddT8E2fMjUyh+Gk8J6cgGyIf7vRd2w==
X-Received: by 2002:a37:42d3:0:b0:69c:830d:6e51 with SMTP id p202-20020a3742d3000000b0069c830d6e51mr19806734qka.302.1662042941099;
        Thu, 01 Sep 2022 07:35:41 -0700 (PDT)
Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136])
        by smtp.gmail.com with ESMTPSA id s14-20020a05620a254e00b006b8fb2a1145sm12509945qko.124.2022.09.01.07.35.36
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 01 Sep 2022 07:35:40 -0700 (PDT)
Date:   Thu, 1 Sep 2022 10:35:34 -0400
From:   Josef Bacik <josef@toxicpanda.com>
To:     fdmanana@kernel.org
Cc:     linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 10/10] btrfs: make fiemap more efficient and accurate
 reporting extent sharedness
Message-ID: <YxDDNvsbrlxoaJxh@localhost.localdomain>
References: <cover.1662022922.git.fdmanana@suse.com>
 <afad772c28ed5fc236785df6f3c43282d5c12534.1662022922.git.fdmanana@suse.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <afad772c28ed5fc236785df6f3c43282d5c12534.1662022922.git.fdmanana@suse.com>
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

On Thu, Sep 01, 2022 at 02:18:30PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The current fiemap implementation does not scale very well with the number
> of extents a file has. This is both because the main algorithm to find out
> the extents has a high algorithmic complexity and because for each extent
> we have to check if it's shared. This second part, checking if an extent
> is shared, is significantly improved by the two previous patches in this
> patchset, while the first part is improved by this specific patch. Every
> now and then we get reports from users mentioning fiemap is too slow or
> even unusable for files with a very large number of extents, such as the
> two recent reports referred to by the Link tags at the bottom of this
> change log.
> 
> To understand why the part of finding which extents a file has is very
> inneficient, consider the example of doing a full ranged fiemap against
> a file that has over 100K extents (normal for example for a file with
> more than 10G of data and using compression, which limits the extent size
> to 128K). When we enter fiemap at extent_fiemap(), the following happens:
> 
> 1) Before entering the main loop, we call get_extent_skip_holes() to get
>    the first extent map. This leads us to btrfs_get_extent_fiemap(), which
>    in turn calls btrfs_get_extent(), to find the first extent map that
>    covers the file range [0, LLONG_MAX).
> 
>    btrfs_get_extent() will first search the inode's extent map tree, to
>    see if we have an extent map there that covers the range. If it does
>    not find one, then it will search the inode's subvolume b+tree for a
>    fitting file extent item. After finding the file extent item, it will
>    allocate an extent map, fill it in with information extracted from the
>    file extent item, and add it to the inode's extent map tree (which
>    requires a search for insertion in the tree).
> 
> 2) Then we enter the main loop at extent_fiemap(), emit the details of
>    the extent, and call again get_extent_skip_holes(), with a start
>    offset matching the end of the extent map we previously processed.
> 
>    We end up at btrfs_get_extent() again, will search the extent map tree
>    and then search the subvolume b+tree for a file extent item if we could
>    not find an extent map in the extent tree. We allocate an extent map,
>    fill it in with the details in the file extent item, and then insert
>    it into the extent map tree (yet another search in this tree).
> 
> 3) The second step is repeated over and over, until we have processed the
>    whole file range. Each iteration ends at btrfs_get_extent(), which
>    does a red black tree search on the extent map tree, then searches the
>    subvolume b+tree, allocates an extent map and then does another search
>    in the extent map tree in order to insert the extent map.
> 
>    In the best scenario we have all the extent maps already in the extent
>    tree, and so for each extent we do a single search on a red black tree,
>    so we have a complexity of O(n log n).
> 
>    In the worst scenario we don't have any extent map already loaded in
>    the extent map tree, or have very few already there. In this case the
>    complexity is much higher since we do:
> 
>    - A red black tree search on the extent map tree, which has O(log n)
>      complexity, initially very fast since the tree is empty or very
>      small, but as we end up allocating extent maps and adding them to
>      the tree when we don't find them there, each subsequent search on
>      the tree gets slower, since it's getting bigger and bigger after
>      each iteration.
> 
>    - A search on the subvolume b+tree, also O(log n) complexity, but it
>      has items for all inodes in the subvolume, not just items for our
>      inode. Plus on a filesystem with concurrent operations on other
>      inodes, we can block doing the search due to lock contention on
>      b+tree nodes/leaves.
> 
>    - Allocate an extent map - this can block, and can also fail if we
>      are under serious memory pressure.
> 
>    - Do another search on the extent maps red black tree, with the goal
>      of inserting the extent map we just allocated. Again, after every
>      iteration this tree is getting bigger by 1 element, so after many
>      iterations the searches are slower and slower.
> 
>    - We will not need the allocated extent map anymore, so it's pointless
>      to add it to the extent map tree. It's just wasting time and memory.
> 
>    In short we end up searching the extent map tree multiple times, on a
>    tree that is growing bigger and bigger after each iteration. And
>    besides that we visit the same leaf of the subvolume b+tree many times,
>    since a leaf with the default size of 16K can easily have more than 200
>    file extent items.
> 
> This is very inneficient overall. This patch changes the algorithm to
> instead iterate over the subvolume b+tree, visiting each leaf only once,
> and only searching in the extent map tree for file ranges that have holes
> or prealloc extents, in order to figure out if we have delalloc there.
> It will never allocate an extent map and add it to the extent map tree.
> This is very similar to what was previously done for the lseek's hole and
> data seeking features.
> 
> Also, the current implementation relying on extent maps for figuring out
> which extents we have is not correct. This is because extent maps can be
> merged even if they represent different extents - we do this to minimize
> memory utilization and keep extent map trees smaller. For example if we
> have two extents that are contiguous on disk, once we load the two extent
> maps, they get merged into a single one - however if only one of the
> extents is shared, we end up reporting both as shared or both as not
> shared, which is incorrect.
> 
> This reproducer triggers that bug:
> 
>     $ cat fiemap-bug.sh
>     #!/bin/bash
> 
>     DEV=/dev/sdj
>     MNT=/mnt/sdj
> 
>     mkfs.btrfs -f $DEV
>     mount $DEV $MNT
> 
>     # Create a file with two 256K extents.
>     # Since there is no other write activity, they will be contiguous,
>     # and their extent maps merged, despite having two distinct extents.
>     xfs_io -f -c "pwrite -S 0xab 0 256K" \
>               -c "fsync" \
>               -c "pwrite -S 0xcd 256K 256K" \
>               -c "fsync" \
>               $MNT/foo
> 
>     # Now clone only the second extent into another file.
>     xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
> 
>     # Filefrag will report a single 512K extent, and say it's not shared.
>     echo
>     filefrag -v $MNT/foo
> 
>     umount $MNT
> 
> Running the reproducer:
> 
>     $ ./fiemap-bug.sh
>     wrote 262144/262144 bytes at offset 0
>     256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
>     wrote 262144/262144 bytes at offset 262144
>     256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
>     linked 262144/262144 bytes at offset 0
>     256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
> 
>     Filesystem type is: 9123683e
>     File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
>      ext:     logical_offset:        physical_offset: length:   expected: flags:
>        0:        0..     127:       3328..      3455:    128:             last,eof
>     /mnt/sdj/foo: 1 extent found
> 
> We end up reporting that we have a single 512K that is not shared, however
> we have two 256K extents, and the second one is shared. Changing the
> reproducer to clone instead the first extent into file 'bar', makes us
> report a single 512K extent that is shared, which is algo incorrect since
> we have two 256K extents and only the first one is shared.
> 
> This patch is part of a larger patchset that is comprised of the following
> patches:
> 
>     btrfs: allow hole and data seeking to be interruptible
>     btrfs: make hole and data seeking a lot more efficient
>     btrfs: remove check for impossible block start for an extent map at fiemap
>     btrfs: remove zero length check when entering fiemap
>     btrfs: properly flush delalloc when entering fiemap
>     btrfs: allow fiemap to be interruptible
>     btrfs: rename btrfs_check_shared() to a more descriptive name
>     btrfs: speedup checking for extent sharedness during fiemap
>     btrfs: skip unnecessary extent buffer sharedness checks during fiemap
>     btrfs: make fiemap more efficient and accurate reporting extent sharedness
> 
> The patchset was tested on a machine running a non-debug kernel (Debian's
> default config) and compared the tests below on a branch without the
> patchset versus the same branch with the whole patchset applied.
> 
> The following test for a large compressed file without holes:
> 
>     $ cat fiemap-perf-test.sh
>     #!/bin/bash
> 
>     DEV=/dev/sdi
>     MNT=/mnt/sdi
> 
>     mkfs.btrfs -f $DEV
>     mount -o compress=lzo $DEV $MNT
> 
>     # 40G gives 327680 128K file extents (due to compression).
>     xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
> 
>     umount $MNT
>     mount -o compress=lzo $DEV $MNT
> 
>     start=$(date +%s%N)
>     filefrag $MNT/foobar
>     end=$(date +%s%N)
>     dur=$(( (end - start) / 1000000 ))
>     echo "fiemap took $dur milliseconds (metadata not cached)"
> 
>     start=$(date +%s%N)
>     filefrag $MNT/foobar
>     end=$(date +%s%N)
>     dur=$(( (end - start) / 1000000 ))
>     echo "fiemap took $dur milliseconds (metadata cached)"
> 
>     umount $MNT
> 
> Before patchset:
> 
>     $ ./fiemap-perf-test.sh
>     (...)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 3597 milliseconds (metadata not cached)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 2107 milliseconds (metadata cached)
> 
> After patchset:
> 
>     $ ./fiemap-perf-test.sh
>     (...)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 1214 milliseconds (metadata not cached)
>     /mnt/sdi/foobar: 327680 extents found
>     fiemap took 684 milliseconds (metadata cached)
> 
> That's a speedup of about 3x for both cases (no metadata cached and all
> metadata cached).
> 
> The test provided by Pavel (first Link tag at the bottom), which uses
> files with a large number of holes, was also used to measure the gains,
> and it consists on a small C program and a shell script to invoke it.
> The C program is the following:
> 
>     $ cat pavels-test.c
>     #include <stdio.h>
>     #include <unistd.h>
>     #include <stdlib.h>
>     #include <fcntl.h>
> 
>     #include <sys/stat.h>
>     #include <sys/time.h>
>     #include <sys/ioctl.h>
> 
>     #include <linux/fs.h>
>     #include <linux/fiemap.h>
> 
>     #define FILE_INTERVAL (1<<13) /* 8Kb */
> 
>     long long interval(struct timeval t1, struct timeval t2)
>     {
>         long long val = 0;
>         val += (t2.tv_usec - t1.tv_usec);
>         val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
>         return val;
>     }
> 
>     int main(int argc, char **argv)
>     {
>         struct fiemap fiemap = {};
>         struct timeval t1, t2;
>         char data = 'a';
>         struct stat st;
>         int fd, off, file_size = FILE_INTERVAL;
> 
>         if (argc != 3 && argc != 2) {
>                 printf("usage: %s <path> [size]\n", argv[0]);
>                 return 1;
>         }
> 
>         if (argc == 3)
>                 file_size = atoi(argv[2]);
>         if (file_size < FILE_INTERVAL)
>                 file_size = FILE_INTERVAL;
>         file_size -= file_size % FILE_INTERVAL;
> 
>         fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
>         if (fd < 0) {
>             perror("open");
>             return 1;
>         }
> 
>         for (off = 0; off < file_size; off += FILE_INTERVAL) {
>             if (pwrite(fd, &data, 1, off) != 1) {
>                 perror("pwrite");
>                 close(fd);
>                 return 1;
>             }
>         }
> 
>         if (ftruncate(fd, file_size)) {
>             perror("ftruncate");
>             close(fd);
>             return 1;
>         }
> 
>         if (fstat(fd, &st) < 0) {
>             perror("fstat");
>             close(fd);
>             return 1;
>         }
> 
>         printf("size: %ld\n", st.st_size);
>         printf("actual size: %ld\n", st.st_blocks * 512);
> 
>         fiemap.fm_length = FIEMAP_MAX_OFFSET;
>         gettimeofday(&t1, NULL);
>         if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
>             perror("fiemap");
>             close(fd);
>             return 1;
>         }
>         gettimeofday(&t2, NULL);
> 
>         printf("fiemap: fm_mapped_extents = %d\n",
>                fiemap.fm_mapped_extents);
>         printf("time = %lld us\n", interval(t1, t2));
> 
>         close(fd);
>         return 0;
>     }
> 
>     $ gcc -o pavels_test pavels_test.c
> 
> And the wrapper shell script:
> 
>     $ cat fiemap-pavels-test.sh
> 
>     #!/bin/bash
> 
>     DEV=/dev/sdi
>     MNT=/mnt/sdi
> 
>     mkfs.btrfs -f -O no-holes $DEV
>     mount $DEV $MNT
> 
>     echo
>     echo "*********** 256M ***********"
>     echo
> 
>     ./pavels-test $MNT/testfile $((1 << 28))
>     echo
>     ./pavels-test $MNT/testfile $((1 << 28))
> 
>     echo
>     echo "*********** 512M ***********"
>     echo
> 
>     ./pavels-test $MNT/testfile $((1 << 29))
>     echo
>     ./pavels-test $MNT/testfile $((1 << 29))
> 
>     echo
>     echo "*********** 1G ***********"
>     echo
> 
>     ./pavels-test $MNT/testfile $((1 << 30))
>     echo
>     ./pavels-test $MNT/testfile $((1 << 30))
> 
>     umount $MNT
> 
> Running his reproducer before applying the patchset:
> 
>     *********** 256M ***********
> 
>     size: 268435456
>     actual size: 134217728
>     fiemap: fm_mapped_extents = 32768
>     time = 4003133 us
> 
>     size: 268435456
>     actual size: 134217728
>     fiemap: fm_mapped_extents = 32768
>     time = 4895330 us
> 
>     *********** 512M ***********
> 
>     size: 536870912
>     actual size: 268435456
>     fiemap: fm_mapped_extents = 65536
>     time = 30123675 us
> 
>     size: 536870912
>     actual size: 268435456
>     fiemap: fm_mapped_extents = 65536
>     time = 33450934 us
> 
>     *********** 1G ***********
> 
>     size: 1073741824
>     actual size: 536870912
>     fiemap: fm_mapped_extents = 131072
>     time = 224924074 us
> 
>     size: 1073741824
>     actual size: 536870912
>     fiemap: fm_mapped_extents = 131072
>     time = 217239242 us
> 
> Running it after applying the patchset:
> 
>     *********** 256M ***********
> 
>     size: 268435456
>     actual size: 134217728
>     fiemap: fm_mapped_extents = 32768
>     time = 29475 us
> 
>     size: 268435456
>     actual size: 134217728
>     fiemap: fm_mapped_extents = 32768
>     time = 29307 us
> 
>     *********** 512M ***********
> 
>     size: 536870912
>     actual size: 268435456
>     fiemap: fm_mapped_extents = 65536
>     time = 58996 us
> 
>     size: 536870912
>     actual size: 268435456
>     fiemap: fm_mapped_extents = 65536
>     time = 59115 us
> 
>     *********** 1G ***********
> 
>     size: 1073741824
>     actual size: 536870912
>     fiemap: fm_mapped_extents = 116251
>     time = 124141 us
> 
>     size: 1073741824
>     actual size: 536870912
>     fiemap: fm_mapped_extents = 131072
>     time = 119387 us
> 
> The speedup is massive, both on the first fiemap call and on the second
> one as well, as his test creates files with many holes and small extents
> (every extent follows a hole and precedes another hole).
> 
> For the 256M file we go from 4 seconds down to 29 milliseconds in the
> first run, and then from 4.9 seconds down to 29 milliseconds again in the
> second run, a speedup of 138x and 169x, respectively.
> 
> For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
> first run, and then from 33.5 seconds down to 59 milliseconds again in the
> second run, a speedup of 510x and 568x, respectively.
> 
> For the 1G file, we go from 225 seconds down to 124 milliseconds in the
> first run, and then from 217 seconds down to 119 milliseconds in the
> second run, a speedup of 1815x and 1824x, respectively.
> 
> Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
> Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
> Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/ctree.h     |   4 +-
>  fs/btrfs/extent_io.c | 714 +++++++++++++++++++++++++++++--------------
>  fs/btrfs/file.c      |  16 +-
>  fs/btrfs/inode.c     | 140 +--------
>  4 files changed, 506 insertions(+), 368 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f7fe7f633eb5..7b266f9dc8b4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3402,8 +3402,6 @@ unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
>  				    u64 start, u64 end);
>  int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
>  			  u32 bio_offset, struct page *page, u32 pgoff);
> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> -					   u64 start, u64 len);
>  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>  			      u64 *orig_start, u64 *orig_block_len,
>  			      u64 *ram_bytes, bool strict);
> @@ -3583,6 +3581,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
>  int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
>  			   size_t *write_bytes);
>  void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> +				  u64 *delalloc_start_ret, u64 *delalloc_end_ret);
>  
>  /* tree-defrag.c */
>  int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0e3fa9b08aaf..50bb2182e795 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5353,42 +5353,6 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
>  	return try_release_extent_state(tree, page, mask);
>  }
>  
> -/*
> - * helper function for fiemap, which doesn't want to see any holes.
> - * This maps until we find something past 'last'
> - */
> -static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
> -						u64 offset, u64 last)
> -{
> -	u64 sectorsize = btrfs_inode_sectorsize(inode);
> -	struct extent_map *em;
> -	u64 len;
> -
> -	if (offset >= last)
> -		return NULL;
> -
> -	while (1) {
> -		len = last - offset;
> -		if (len == 0)
> -			break;
> -		len = ALIGN(len, sectorsize);
> -		em = btrfs_get_extent_fiemap(inode, offset, len);
> -		if (IS_ERR(em))
> -			return em;
> -
> -		/* if this isn't a hole return it */
> -		if (em->block_start != EXTENT_MAP_HOLE)
> -			return em;
> -
> -		/* this is a hole, advance to the next extent */
> -		offset = extent_map_end(em);
> -		free_extent_map(em);
> -		if (offset >= last)
> -			break;
> -	}
> -	return NULL;
> -}
> -
>  /*
>   * To cache previous fiemap extent
>   *
> @@ -5418,6 +5382,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
>  {
>  	int ret = 0;
>  
> +	/* Set at the end of extent_fiemap(). */
> +	ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
> +
>  	if (!cache->cached)
>  		goto assign;
>  
> @@ -5446,11 +5413,10 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
>  	 */
>  	if (cache->offset + cache->len  == offset &&
>  	    cache->phys + cache->len == phys  &&
> -	    (cache->flags & ~FIEMAP_EXTENT_LAST) ==
> -			(flags & ~FIEMAP_EXTENT_LAST)) {
> +	    cache->flags == flags) {
>  		cache->len += len;
>  		cache->flags |= flags;
> -		goto try_submit_last;
> +		return 0;
>  	}
>  
>  	/* Not mergeable, need to submit cached one */
> @@ -5465,13 +5431,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
>  	cache->phys = phys;
>  	cache->len = len;
>  	cache->flags = flags;
> -try_submit_last:
> -	if (cache->flags & FIEMAP_EXTENT_LAST) {
> -		ret = fiemap_fill_next_extent(fieinfo, cache->offset,
> -				cache->phys, cache->len, cache->flags);
> -		cache->cached = false;
> -	}
> -	return ret;
> +
> +	return 0;
>  }
>  
>  /*
> @@ -5501,229 +5462,534 @@ static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
>  	return ret;
>  }
>  
> -int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> -		  u64 start, u64 len)
> +static int fiemap_next_leaf_item(struct btrfs_inode *inode,
> +				 struct btrfs_path *path)
>  {
> -	int ret = 0;
> -	u64 off;
> -	u64 max = start + len;
> -	u32 flags = 0;
> -	u32 found_type;
> -	u64 last;
> -	u64 last_for_get_extent = 0;
> -	u64 disko = 0;
> -	u64 isize = i_size_read(&inode->vfs_inode);
> -	struct btrfs_key found_key;
> -	struct extent_map *em = NULL;
> -	struct extent_state *cached_state = NULL;
> -	struct btrfs_path *path;
> +	struct extent_buffer *clone;
> +	struct btrfs_key key;
> +	int slot;
> +	int ret;
> +
> +	path->slots[0]++;
> +	if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
> +		return 0;
> +
> +	ret = btrfs_next_leaf(inode->root, path);
> +	if (ret != 0)
> +		return ret;
> +
> +	/*
> +	 * Don't bother with cloning if there are no more file extent items for
> +	 * our inode.
> +	 */
> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +	if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY)
> +		return 1;
> +
> +	/* See the comment at fiemap_search_slot() about why we clone. */
> +	clone = btrfs_clone_extent_buffer(path->nodes[0]);
> +	if (!clone)
> +		return -ENOMEM;
> +
> +	slot = path->slots[0];
> +	btrfs_release_path(path);
> +	path->nodes[0] = clone;
> +	path->slots[0] = slot;
> +
> +	return 0;
> +}
> +
> +/*
> + * Search for the first file extent item that starts at a given file offset or
> + * the one that starts immediately before that offset.
> + * Returns: 0 on success, < 0 on error, 1 if not found.
> + */
> +static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
> +			      u64 file_offset)
> +{
> +	const u64 ino = btrfs_ino(inode);
>  	struct btrfs_root *root = inode->root;
> -	struct fiemap_cache cache = { 0 };
> -	struct btrfs_backref_shared_cache *backref_cache;
> -	struct ulist *roots;
> -	struct ulist *tmp_ulist;
> -	int end = 0;
> -	u64 em_start = 0;
> -	u64 em_len = 0;
> -	u64 em_end = 0;
> +	struct extent_buffer *clone;
> +	struct btrfs_key key;
> +	int slot;
> +	int ret;
>  
> -	backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> -	path = btrfs_alloc_path();
> -	roots = ulist_alloc(GFP_KERNEL);
> -	tmp_ulist = ulist_alloc(GFP_KERNEL);
> -	if (!backref_cache || !path || !roots || !tmp_ulist) {
> -		ret = -ENOMEM;
> -		goto out_free_ulist;
> +	key.objectid = ino;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = file_offset;
> +
> +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (ret > 0 && path->slots[0] > 0) {
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
> +		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> +			path->slots[0]--;
> +	}
> +
> +	if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> +		ret = btrfs_next_leaf(root, path);
> +		if (ret != 0)
> +			return ret;
> +
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> +			return 1;
>  	}
>  
>  	/*
> -	 * We can't initialize that to 'start' as this could miss extents due
> -	 * to extent item merging
> +	 * We clone the leaf and use it during fiemap. This is because while
> +	 * using the leaf we do expensive things like checking if an extent is
> +	 * shared, which can take a long time. In order to prevent blocking
> +	 * other tasks for too long, we use a clone of the leaf. We have locked
> +	 * the file range in the inode's io tree, so we know none of our file
> +	 * extent items can change. This way we avoid blocking other tasks that
> +	 * want to insert items for other inodes in the same leaf or b+tree
> +	 * rebalance operations (triggered for example when someone is trying
> +	 * to push items into this leaf when trying to insert an item in a
> +	 * neighbour leaf).
> +	 * We also need the private clone because holding a read lock on an
> +	 * extent buffer of the subvolume's b+tree will make lockdep unhappy
> +	 * when we call fiemap_fill_next_extent(), because that may cause a page
> +	 * fault when filling the user space buffer with fiemap data.
>  	 */
> -	off = 0;
> -	start = round_down(start, btrfs_inode_sectorsize(inode));
> -	len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
> +	clone = btrfs_clone_extent_buffer(path->nodes[0]);
> +	if (!clone)
> +		return -ENOMEM;
> +
> +	slot = path->slots[0];
> +	btrfs_release_path(path);
> +	path->nodes[0] = clone;
> +	path->slots[0] = slot;
> +
> +	return 0;
> +}
> +
> +/*
> + * Process a range which is a hole or a prealloc extent in the inode's subvolume
> + * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
> + * extent. The end offset (@end) is inclusive.
> + */
> +static int fiemap_process_hole(struct btrfs_inode *inode,
> +			       struct fiemap_extent_info *fieinfo,
> +			       struct fiemap_cache *cache,
> +			       struct btrfs_backref_shared_cache *backref_cache,
> +			       u64 disk_bytenr, u64 extent_offset,
> +			       u64 extent_gen,
> +			       struct ulist *roots, struct ulist *tmp_ulist,
> +			       u64 start, u64 end)
> +{
> +	const u64 i_size = i_size_read(&inode->vfs_inode);
> +	const u64 ino = btrfs_ino(inode);
> +	u64 cur_offset = start;
> +	u64 last_delalloc_end = 0;
> +	u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
> +	bool checked_extent_shared = false;
> +	int ret;
>  
>  	/*
> -	 * lookup the last file extent.  We're not using i_size here
> -	 * because there might be preallocation past i_size
> +	 * There can be no delalloc past i_size, so don't waste time looking for
> +	 * it beyond i_size.
>  	 */
> -	ret = btrfs_lookup_file_extent(NULL, root, path, btrfs_ino(inode), -1,
> -				       0);
> -	if (ret < 0) {
> -		goto out_free_ulist;
> -	} else {
> -		WARN_ON(!ret);
> -		if (ret == 1)
> -			ret = 0;
> -	}
> +	while (cur_offset < end && cur_offset < i_size) {
> +		u64 delalloc_start;
> +		u64 delalloc_end;
> +		u64 prealloc_start;
> +		u64 prealloc_len = 0;
> +		bool delalloc;
> +
> +		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
> +							&delalloc_start,
> +							&delalloc_end);
> +		if (!delalloc)
> +			break;
>  
> -	path->slots[0]--;
> -	btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
> -	found_type = found_key.type;
> -
> -	/* No extents, but there might be delalloc bits */
> -	if (found_key.objectid != btrfs_ino(inode) ||
> -	    found_type != BTRFS_EXTENT_DATA_KEY) {
> -		/* have to trust i_size as the end */
> -		last = (u64)-1;
> -		last_for_get_extent = isize;
> -	} else {
>  		/*
> -		 * remember the start of the last extent.  There are a
> -		 * bunch of different factors that go into the length of the
> -		 * extent, so its much less complex to remember where it started
> +		 * If this is a prealloc extent we have to report every section
> +		 * of it that has no delalloc.
>  		 */
> -		last = found_key.offset;
> -		last_for_get_extent = last + 1;
> +		if (disk_bytenr != 0) {
> +			if (last_delalloc_end == 0) {
> +				prealloc_start = start;
> +				prealloc_len = delalloc_start - start;
> +			} else {
> +				prealloc_start = last_delalloc_end + 1;
> +				prealloc_len = delalloc_start - prealloc_start;
> +			}
> +		}
> +
> +		if (prealloc_len > 0) {
> +			if (!checked_extent_shared && fieinfo->fi_extents_max) {
> +				ret = btrfs_is_data_extent_shared(inode->root,
> +							  ino, disk_bytenr,
> +							  extent_gen, roots,
> +							  tmp_ulist,
> +							  backref_cache);
> +				if (ret < 0)
> +					return ret;
> +				else if (ret > 0)
> +					prealloc_flags |= FIEMAP_EXTENT_SHARED;
> +
> +				checked_extent_shared = true;
> +			}
> +			ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> +						 disk_bytenr + extent_offset,
> +						 prealloc_len, prealloc_flags);
> +			if (ret)
> +				return ret;
> +			extent_offset += prealloc_len;
> +		}
> +
> +		ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
> +					 delalloc_end + 1 - delalloc_start,
> +					 FIEMAP_EXTENT_DELALLOC |
> +					 FIEMAP_EXTENT_UNKNOWN);
> +		if (ret)
> +			return ret;
> +
> +		last_delalloc_end = delalloc_end;
> +		cur_offset = delalloc_end + 1;
> +		extent_offset += cur_offset - delalloc_start;
> +		cond_resched();
> +	}
> +
> +	/*
> +	 * Either we found no delalloc for the whole prealloc extent or we have
> +	 * a prealloc extent that spans i_size or starts at or after i_size.
> +	 */
> +	if (disk_bytenr != 0 && last_delalloc_end < end) {
> +		u64 prealloc_start;
> +		u64 prealloc_len;
> +
> +		if (last_delalloc_end == 0) {
> +			prealloc_start = start;
> +			prealloc_len = end + 1 - start;
> +		} else {
> +			prealloc_start = last_delalloc_end + 1;
> +			prealloc_len = end + 1 - prealloc_start;
> +		}
> +
> +		if (!checked_extent_shared && fieinfo->fi_extents_max) {
> +			ret = btrfs_is_data_extent_shared(inode->root,
> +							  ino, disk_bytenr,
> +							  extent_gen, roots,
> +							  tmp_ulist,
> +							  backref_cache);
> +			if (ret < 0)
> +				return ret;
> +			else if (ret > 0)
> +				prealloc_flags |= FIEMAP_EXTENT_SHARED;
> +		}
> +		ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
> +					 disk_bytenr + extent_offset,
> +					 prealloc_len, prealloc_flags);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
> +					  struct btrfs_path *path,
> +					  u64 *last_extent_end_ret)
> +{
> +	const u64 ino = btrfs_ino(inode);
> +	struct btrfs_root *root = inode->root;
> +	struct extent_buffer *leaf;
> +	struct btrfs_file_extent_item *ei;
> +	struct btrfs_key key;
> +	u64 disk_bytenr;
> +	int ret;
> +
> +	/*
> +	 * Lookup the last file extent. We're not using i_size here because
> +	 * there might be preallocation past i_size.
> +	 */
> +	ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
> +	/* There can't be a file extent item at offset (u64)-1 */
> +	ASSERT(ret != 0);
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * For a non-existing key, btrfs_search_slot() always leaves us at a
> +	 * slot > 0, except if the btree is empty, which is impossible because
> +	 * at least it has the inode item for this inode and all the items for
> +	 * the root inode 256.
> +	 */
> +	ASSERT(path->slots[0] > 0);
> +	path->slots[0]--;
> +	leaf = path->nodes[0];
> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +	if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> +		/* No file extent items in the subvolume tree. */
> +		*last_extent_end_ret = 0;
> +		return 0;
>  	}
> -	btrfs_release_path(path);
>  
>  	/*
> -	 * we might have some extents allocated but more delalloc past those
> -	 * extents.  so, we trust isize unless the start of the last extent is
> -	 * beyond isize
> +	 * For an inline extent, the disk_bytenr is where inline data starts at,
> +	 * so first check if we have an inline extent item before checking if we
> +	 * have an implicit hole (disk_bytenr == 0).
>  	 */
> -	if (last < isize) {
> -		last = (u64)-1;
> -		last_for_get_extent = isize;
> +	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
> +	if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
> +		*last_extent_end_ret = btrfs_file_extent_end(path);
> +		return 0;
>  	}
>  
> -	lock_extent_bits(&inode->io_tree, start, start + len - 1,
> -			 &cached_state);
> +	/*
> +	 * Find the last file extent item that is not a hole (when NO_HOLES is
> +	 * not enabled). This should take at most 2 iterations in the worst
> +	 * case: we have one hole file extent item at slot 0 of a leaf and
> +	 * another hole file extent item as the last item in the previous leaf.
> +	 * This is because we merge file extent items that represent holes.
> +	 */
> +	disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> +	while (disk_bytenr == 0) {
> +		ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
> +		if (ret < 0) {
> +			return ret;
> +		} else if (ret > 0) {
> +			/* No file extent items that are not holes. */
> +			*last_extent_end_ret = 0;
> +			return 0;
> +		}
> +		leaf = path->nodes[0];
> +		ei = btrfs_item_ptr(leaf, path->slots[0],
> +				    struct btrfs_file_extent_item);
> +		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> +	}
>  
> -	em = get_extent_skip_holes(inode, start, last_for_get_extent);
> -	if (!em)
> -		goto out;
> -	if (IS_ERR(em)) {
> -		ret = PTR_ERR(em);
> +	*last_extent_end_ret = btrfs_file_extent_end(path);
> +	return 0;
> +}
> +
> +int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
> +		  u64 start, u64 len)
> +{
> +	const u64 ino = btrfs_ino(inode);
> +	struct extent_state *cached_state = NULL;
> +	struct btrfs_path *path;
> +	struct btrfs_root *root = inode->root;
> +	struct fiemap_cache cache = { 0 };
> +	struct btrfs_backref_shared_cache *backref_cache;
> +	struct ulist *roots;
> +	struct ulist *tmp_ulist;
> +	u64 last_extent_end;
> +	u64 prev_extent_end;
> +	u64 lockstart;
> +	u64 lockend;
> +	bool stopped = false;
> +	int ret;
> +
> +	backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
> +	path = btrfs_alloc_path();
> +	roots = ulist_alloc(GFP_KERNEL);
> +	tmp_ulist = ulist_alloc(GFP_KERNEL);
> +	if (!backref_cache || !path || !roots || !tmp_ulist) {
> +		ret = -ENOMEM;
>  		goto out;
>  	}
>  
> -	while (!end) {
> -		u64 offset_in_extent = 0;
> +	lockstart = round_down(start, btrfs_inode_sectorsize(inode));
> +	lockend = round_up(start + len, btrfs_inode_sectorsize(inode));
> +	prev_extent_end = lockstart;
>  
> -		/* break if the extent we found is outside the range */
> -		if (em->start >= max || extent_map_end(em) < off)
> -			break;
> +	lock_extent_bits(&inode->io_tree, lockstart, lockend, &cached_state);
>  
> -		/*
> -		 * get_extent may return an extent that starts before our
> -		 * requested range.  We have to make sure the ranges
> -		 * we return to fiemap always move forward and don't
> -		 * overlap, so adjust the offsets here
> -		 */
> -		em_start = max(em->start, off);
> +	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
> +	if (ret < 0)
> +		goto out_unlock;
> +	btrfs_release_path(path);
>  
> +	path->reada = READA_FORWARD;
> +	ret = fiemap_search_slot(inode, path, lockstart);
> +	if (ret < 0) {
> +		goto out_unlock;
> +	} else if (ret > 0) {
>  		/*
> -		 * record the offset from the start of the extent
> -		 * for adjusting the disk offset below.  Only do this if the
> -		 * extent isn't compressed since our in ram offset may be past
> -		 * what we have actually allocated on disk.
> +		 * No file extent item found, but we may have delalloc between
> +		 * the current offset and i_size. So check for that.
>  		 */
> -		if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> -			offset_in_extent = em_start - em->start;
> -		em_end = extent_map_end(em);
> -		em_len = em_end - em_start;
> -		flags = 0;
> -		if (em->block_start < EXTENT_MAP_LAST_BYTE)
> -			disko = em->block_start + offset_in_extent;
> -		else
> -			disko = 0;
> +		ret = 0;
> +		goto check_eof_delalloc;
> +	}
> +
> +	while (prev_extent_end < lockend) {
> +		struct extent_buffer *leaf = path->nodes[0];
> +		struct btrfs_file_extent_item *ei;
> +		struct btrfs_key key;
> +		u64 extent_end;
> +		u64 extent_len;
> +		u64 extent_offset = 0;
> +		u64 extent_gen;
> +		u64 disk_bytenr = 0;
> +		u64 flags = 0;
> +		int extent_type;
> +		u8 compression;
> +
> +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
> +			break;
> +
> +		extent_end = btrfs_file_extent_end(path);
>  
>  		/*
> -		 * bump off for our next call to get_extent
> +		 * The first iteration can leave us at an extent item that ends
> +		 * before our range's start. Move to the next item.
>  		 */
> -		off = extent_map_end(em);
> -		if (off >= max)
> -			end = 1;
> -
> -		if (em->block_start == EXTENT_MAP_INLINE) {
> -			flags |= (FIEMAP_EXTENT_DATA_INLINE |
> -				  FIEMAP_EXTENT_NOT_ALIGNED);
> -		} else if (em->block_start == EXTENT_MAP_DELALLOC) {
> -			flags |= (FIEMAP_EXTENT_DELALLOC |
> -				  FIEMAP_EXTENT_UNKNOWN);
> -		} else if (fieinfo->fi_extents_max) {
> -			u64 extent_gen;
> -			u64 bytenr = em->block_start -
> -				(em->start - em->orig_start);
> +		if (extent_end <= lockstart)
> +			goto next_item;
>  
> -			/*
> -			 * If two extent maps are merged, then their generation
> -			 * is set to the maximum between their generations.
> -			 * Otherwise its generation matches the one we have in
> -			 * corresponding file extent item. If we have a merged
> -			 * extent map, don't use its generation to speedup the
> -			 * sharedness check below.
> -			 */
> -			if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
> -				extent_gen = 0;
> -			else
> -				extent_gen = em->generation;
> +		/* We have in implicit hole (NO_HOLES feature enabled). */
> +		if (prev_extent_end < key.offset) {
> +			const u64 range_end = min(key.offset, lockend) - 1;
>  
> -			/*
> -			 * As btrfs supports shared space, this information
> -			 * can be exported to userspace tools via
> -			 * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
> -			 * then we're just getting a count and we can skip the
> -			 * lookup stuff.
> -			 */
> -			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
> -							  bytenr, extent_gen,
> -							  roots, tmp_ulist,
> -							  backref_cache);
> -			if (ret < 0)
> -				goto out_free;
> -			if (ret)
> -				flags |= FIEMAP_EXTENT_SHARED;
> -			ret = 0;
> -		}
> -		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
> -			flags |= FIEMAP_EXTENT_ENCODED;
> -		if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> -			flags |= FIEMAP_EXTENT_UNWRITTEN;
> +			ret = fiemap_process_hole(inode, fieinfo, &cache,
> +						  backref_cache, 0, 0, 0,
> +						  roots, tmp_ulist,
> +						  prev_extent_end, range_end);
> +			if (ret < 0) {
> +				goto out_unlock;
> +			} else if (ret > 0) {
> +				/* fiemap_fill_next_extent() told us to stop. */
> +				stopped = true;
> +				break;
> +			}
>  
> -		free_extent_map(em);
> -		em = NULL;
> -		if ((em_start >= last) || em_len == (u64)-1 ||
> -		   (last == (u64)-1 && isize <= em_end)) {
> -			flags |= FIEMAP_EXTENT_LAST;
> -			end = 1;
> +			/* We've reached the end of the fiemap range, stop. */
> +			if (key.offset >= lockend) {
> +				stopped = true;
> +				break;
> +			}
>  		}
>  
> -		/* now scan forward to see if this is really the last extent. */
> -		em = get_extent_skip_holes(inode, off, last_for_get_extent);
> -		if (IS_ERR(em)) {
> -			ret = PTR_ERR(em);
> -			goto out;
> +		extent_len = extent_end - key.offset;
> +		ei = btrfs_item_ptr(leaf, path->slots[0],
> +				    struct btrfs_file_extent_item);
> +		compression = btrfs_file_extent_compression(leaf, ei);
> +		extent_type = btrfs_file_extent_type(leaf, ei);
> +		extent_gen = btrfs_file_extent_generation(leaf, ei);
> +
> +		if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> +			disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> +			if (compression == BTRFS_COMPRESS_NONE)
> +				extent_offset = btrfs_file_extent_offset(leaf, ei);
>  		}
> -		if (!em) {
> -			flags |= FIEMAP_EXTENT_LAST;
> -			end = 1;
> +
> +		if (compression != BTRFS_COMPRESS_NONE)
> +			flags |= FIEMAP_EXTENT_ENCODED;
> +
> +		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> +			flags |= FIEMAP_EXTENT_DATA_INLINE;
> +			flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> +			ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
> +						 extent_len, flags);
> +		} else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> +			ret = fiemap_process_hole(inode, fieinfo, &cache,
> +						  backref_cache,
> +						  disk_bytenr, extent_offset,
> +						  extent_gen, roots, tmp_ulist,
> +						  key.offset, extent_end - 1);
> +		} else if (disk_bytenr == 0) {
> +			/* We have an explicit hole. */
> +			ret = fiemap_process_hole(inode, fieinfo, &cache,
> +						  backref_cache, 0, 0, 0,
> +						  roots, tmp_ulist,
> +						  key.offset, extent_end - 1);
> +		} else {
> +			/* We have a regular extent. */
> +			if (fieinfo->fi_extents_max) {
> +				ret = btrfs_is_data_extent_shared(root, ino,
> +								  disk_bytenr,
> +								  extent_gen,
> +								  roots,
> +								  tmp_ulist,
> +								  backref_cache);
> +				if (ret < 0)
> +					goto out_unlock;
> +				else if (ret > 0)
> +					flags |= FIEMAP_EXTENT_SHARED;
> +			}
> +
> +			ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
> +						 disk_bytenr + extent_offset,
> +						 extent_len, flags);
>  		}
> -		ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
> -					   em_len, flags);
> -		if (ret) {
> -			if (ret == 1)
> -				ret = 0;
> -			goto out_free;
> +
> +		if (ret < 0) {
> +			goto out_unlock;
> +		} else if (ret > 0) {
> +			/* fiemap_fill_next_extent() told us to stop. */
> +			stopped = true;
> +			break;
>  		}
>  
> +		prev_extent_end = extent_end;
> +next_item:
>  		if (fatal_signal_pending(current)) {
>  			ret = -EINTR;
> -			goto out_free;
> +			goto out_unlock;
>  		}
> +
> +		ret = fiemap_next_leaf_item(inode, path);
> +		if (ret < 0) {
> +			goto out_unlock;
> +		} else if (ret > 0) {
> +			/* No more file extent items for this inode. */
> +			break;
> +		}
> +		cond_resched();
>  	}
> -out_free:
> -	if (!ret)
> -		ret = emit_last_fiemap_cache(fieinfo, &cache);
> -	free_extent_map(em);
> -out:
> -	unlock_extent_cached(&inode->io_tree, start, start + len - 1,
> -			     &cached_state);
>  
> -out_free_ulist:
> +check_eof_delalloc:
> +	/*
> +	 * Release (and free) the path before emitting any final entries to
> +	 * fiemap_fill_next_extent() to keep lockdep happy. This is because
> +	 * once we find no more file extent items exist, we may have a
> +	 * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
> +	 * faults when copying data to the user space buffer.
> +	 */
> +	btrfs_free_path(path);
> +	path = NULL;
> +
> +	if (!stopped && prev_extent_end < lockend) {
> +		ret = fiemap_process_hole(inode, fieinfo, &cache, backref_cache,
> +					  0, 0, 0, roots, tmp_ulist,
> +					  prev_extent_end, lockend - 1);
> +		if (ret < 0)
> +			goto out_unlock;
> +		prev_extent_end = lockend;
> +	}
> +
> +	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
> +		const u64 i_size = i_size_read(&inode->vfs_inode);
> +
> +		if (prev_extent_end < i_size) {
> +			u64 delalloc_start;
> +			u64 delalloc_end;
> +			bool delalloc;
> +
> +			delalloc = btrfs_find_delalloc_in_range(inode,
> +								prev_extent_end,
> +								i_size - 1,
> +								&delalloc_start,
> +								&delalloc_end);
> +			if (!delalloc)
> +				cache.flags |= FIEMAP_EXTENT_LAST;
> +		} else {
> +			cache.flags |= FIEMAP_EXTENT_LAST;
> +		}
> +	}
> +
> +	ret = emit_last_fiemap_cache(fieinfo, &cache);
> +
> +out_unlock:
> +	unlock_extent_cached(&inode->io_tree, lockstart, lockend, &cached_state);
> +out:
>  	kfree(backref_cache);
>  	btrfs_free_path(path);
>  	ulist_free(roots);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b292a8ada3a4..636b3ec46184 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3602,10 +3602,10 @@ static long btrfs_fallocate(struct file *file, int mode,
>  }
>  
>  /*
> - * Helper for have_delalloc_in_range(). Find a subrange in a given range that
> - * has unflushed and/or flushing delalloc. There might be other adjacent
> - * subranges after the one it found, so have_delalloc_in_range() keeps looping
> - * while it gets adjacent subranges, and merging them together.
> + * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
> + * that has unflushed and/or flushing delalloc. There might be other adjacent
> + * subranges after the one it found, so btrfs_find_delalloc_in_range() keeps
> + * looping while it gets adjacent subranges, and merging them together.
>   */
>  static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end,
>  				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> @@ -3740,8 +3740,8 @@ static bool find_delalloc_subrange(struct btrfs_inode *inode, u64 start, u64 end
>   * if so it sets @delalloc_start_ret and @delalloc_end_ret with the start and
>   * end offsets of the subrange.
>   */
> -static bool have_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> -				   u64 *delalloc_start_ret, u64 *delalloc_end_ret)
> +bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
> +				  u64 *delalloc_start_ret, u64 *delalloc_end_ret)
>  {
>  	u64 cur_offset = round_down(start, inode->root->fs_info->sectorsize);
>  	u64 prev_delalloc_end = 0;
> @@ -3804,8 +3804,8 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
>  	u64 delalloc_end;
>  	bool delalloc;
>  
> -	delalloc = have_delalloc_in_range(inode, start, end, &delalloc_start,
> -					  &delalloc_end);
> +	delalloc = btrfs_find_delalloc_in_range(inode, start, end,
> +						&delalloc_start, &delalloc_end);
>  	if (delalloc && whence == SEEK_DATA) {
>  		*start_ret = delalloc_start;
>  		return true;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2c7d31990777..8be1e021513a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7064,133 +7064,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>  	return em;
>  }
>  
> -struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> -					   u64 start, u64 len)
> -{
> -	struct extent_map *em;
> -	struct extent_map *hole_em = NULL;
> -	u64 delalloc_start = start;
> -	u64 end;
> -	u64 delalloc_len;
> -	u64 delalloc_end;
> -	int err = 0;
> -
> -	em = btrfs_get_extent(inode, NULL, 0, start, len);
> -	if (IS_ERR(em))
> -		return em;
> -	/*
> -	 * If our em maps to:
> -	 * - a hole or
> -	 * - a pre-alloc extent,
> -	 * there might actually be delalloc bytes behind it.
> -	 */
> -	if (em->block_start != EXTENT_MAP_HOLE &&
> -	    !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> -		return em;
> -	else
> -		hole_em = em;
> -
> -	/* check to see if we've wrapped (len == -1 or similar) */
> -	end = start + len;
> -	if (end < start)
> -		end = (u64)-1;
> -	else
> -		end -= 1;
> -
> -	em = NULL;
> -
> -	/* ok, we didn't find anything, lets look for delalloc */
> -	delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
> -				 end, len, EXTENT_DELALLOC, 1);
> -	delalloc_end = delalloc_start + delalloc_len;
> -	if (delalloc_end < delalloc_start)
> -		delalloc_end = (u64)-1;
> -
> -	/*
> -	 * We didn't find anything useful, return the original results from
> -	 * get_extent()
> -	 */
> -	if (delalloc_start > end || delalloc_end <= start) {
> -		em = hole_em;
> -		hole_em = NULL;
> -		goto out;
> -	}
> -
> -	/*
> -	 * Adjust the delalloc_start to make sure it doesn't go backwards from
> -	 * the start they passed in
> -	 */
> -	delalloc_start = max(start, delalloc_start);
> -	delalloc_len = delalloc_end - delalloc_start;
> -
> -	if (delalloc_len > 0) {
> -		u64 hole_start;
> -		u64 hole_len;
> -		const u64 hole_end = extent_map_end(hole_em);
> -
> -		em = alloc_extent_map();
> -		if (!em) {
> -			err = -ENOMEM;
> -			goto out;
> -		}
> -
> -		ASSERT(hole_em);
> -		/*
> -		 * When btrfs_get_extent can't find anything it returns one
> -		 * huge hole
> -		 *
> -		 * Make sure what it found really fits our range, and adjust to
> -		 * make sure it is based on the start from the caller
> -		 */
> -		if (hole_end <= start || hole_em->start > end) {
> -		       free_extent_map(hole_em);
> -		       hole_em = NULL;
> -		} else {
> -		       hole_start = max(hole_em->start, start);
> -		       hole_len = hole_end - hole_start;
> -		}
> -
> -		if (hole_em && delalloc_start > hole_start) {
> -			/*
> -			 * Our hole starts before our delalloc, so we have to
> -			 * return just the parts of the hole that go until the
> -			 * delalloc starts
> -			 */
> -			em->len = min(hole_len, delalloc_start - hole_start);
> -			em->start = hole_start;
> -			em->orig_start = hole_start;
> -			/*
> -			 * Don't adjust block start at all, it is fixed at
> -			 * EXTENT_MAP_HOLE
> -			 */
> -			em->block_start = hole_em->block_start;
> -			em->block_len = hole_len;
> -			if (test_bit(EXTENT_FLAG_PREALLOC, &hole_em->flags))
> -				set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
> -		} else {
> -			/*
> -			 * Hole is out of passed range or it starts after
> -			 * delalloc range
> -			 */
> -			em->start = delalloc_start;
> -			em->len = delalloc_len;
> -			em->orig_start = delalloc_start;
> -			em->block_start = EXTENT_MAP_DELALLOC;
> -			em->block_len = delalloc_len;
> -		}
> -	} else {
> -		return hole_em;
> -	}
> -out:
> -
> -	free_extent_map(hole_em);
> -	if (err) {
> -		free_extent_map(em);
> -		return ERR_PTR(err);
> -	}
> -	return em;
> -}
> -
>  static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>  						  const u64 start,
>  						  const u64 len,
> @@ -8265,15 +8138,14 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  	 * in the compression of data (in an async thread) and will return
>  	 * before the compression is done and writeback is started. A second
>  	 * filemap_fdatawrite_range() is needed to wait for the compression to
> -	 * complete and writeback to start. Without this, our user is very
> -	 * likely to get stale results, because the extents and extent maps for
> -	 * delalloc regions are only allocated when writeback starts.
> +	 * complete and writeback to start. We also need to wait for ordered
> +	 * extents to complete, because our fiemap implementation uses mainly
> +	 * file extent items to list the extents, searching for extent maps
> +	 * only for file ranges with holes or prealloc extents to figure out
> +	 * if we have delalloc in those ranges.
>  	 */
>  	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
> -		ret = btrfs_fdatawrite_range(inode, 0, LLONG_MAX);
> -		if (ret)
> -			return ret;
> -		ret = filemap_fdatawait_range(inode->i_mapping, 0, LLONG_MAX);
> +		ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
>  		if (ret)
>  			return ret;
>  	}

Hmm this bit should be in "btrfs: properly flush delalloc when entering fiemap"
instead.  Thanks,

Josef