From mboxrd@z Thu Jan  1 00:00:00 1970
From: Bob Peterson <rpeterso@redhat.com>
Subject: [PATCH][try5] fs: if block_map clears buffer_holesize bit skip hole
 size from b_size
Date: Thu, 11 Sep 2014 11:29:32 -0400 (EDT)
Message-ID: <1501522376.20961158.1410449372237.JavaMail.zimbra@redhat.com>
References: <877076335.17542169.1409875465535.JavaMail.zimbra@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
To: Alexander Viro <aviro@redhat.com>, linux-fsdevel@vger.kernel.org
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mx5-phx2.redhat.com ([209.132.183.37]:42684 "EHLO
	mx5-phx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750860AbaIKP3c (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 11 Sep 2014 11:29:32 -0400
In-Reply-To: <877076335.17542169.1409875465535.JavaMail.zimbra@redhat.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Hi,

This version uses a new buffer flag, holesize, as Dave Chinner
suggested. It also incorporates a suggestion from Steve Whitehouse.

The problem:
If you do a fiemap operation on a very large sparse file, it can take
an extremely long amount of time (we're talking days here) because
function __generic_block_fiemap does a block-for-block search when it
encounters a hole.

The solution:
Allow the underlying file system to return the hole size so that function
__generic_block_fiemap can quickly skip the hole.

Patch description:

This patch changes function __generic_block_fiemap so that it sets a new
buffer_holesize bit. The new bit signals to the underlying file system
to return a hole size from its block_map function (if possible) in the
event that a hole is encountered at the requested block. If the block_map
function encounters a hole, and clears buffer_holesize, fiemap takes the
returned b_size to be the size of the hole, in bytes. It then skips the
hole and moves to the next block. This may be repeated several times
in a row, especially for large holes, due to possible limitations of the
fs-specific block_map function. This is still much faster than trying
each block individually when large holes are encountered. If the
block_map function does not clear buffer_holesize, the request for
holesize has been ignored, and it falls back to today's method of doing a
block-by-block search for the next valid block.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson <rpeterso@redhat.com> 
---
 fs/ioctl.c                  | 7 ++++++-
 include/linux/buffer_head.h | 2 ++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 8ac3fad..ae63b1f 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -291,13 +291,18 @@ int __generic_block_fiemap(struct inode *inode,
 		memset(&map_bh, 0, sizeof(struct buffer_head));
 		map_bh.b_size = len;
 
+		set_buffer_holesize(&map_bh); /* return hole size if able */
 		ret = get_block(inode, start_blk, &map_bh, 0);
 		if (ret)
 			break;
 
 		/* HOLE */
 		if (!buffer_mapped(&map_bh)) {
-			start_blk++;
+			if (buffer_holesize(&map_bh)) /* holesize ignored */
+				start_blk++;
+			else
+				start_blk += logical_to_blk(inode,
+							    map_bh.b_size);
 
 			/*
 			 * We want to handle the case where there is an
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 324329c..b8ce396 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -37,6 +37,7 @@ enum bh_state_bits {
 	BH_Meta,	/* Buffer contains metadata */
 	BH_Prio,	/* Buffer should be submitted with REQ_PRIO */
 	BH_Defer_Completion, /* Defer AIO completion to workqueue */
+	BH_Holesize,    /* Return hole size (and clear) if possible */
 
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
@@ -128,6 +129,7 @@ BUFFER_FNS(Boundary, boundary)
 BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Unwritten, unwritten)
 BUFFER_FNS(Meta, meta)
+BUFFER_FNS(Holesize, holesize)
 BUFFER_FNS(Prio, prio)
 BUFFER_FNS(Defer_Completion, defer_completion)