[PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups
@ 2015-09-30 10:28 Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 01/13] Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size Chandan Rajendra
                   ` (12 more replies)
  0 siblings, 13 replies; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

The patches posted along with this cover letter are cleanups made
during the developement of subpagesize-blocksize patchset. I believe
that they can be integrated with the mainline kernel. Hence I have
posted them separately from the subpagesize-blocksize patchset.

I have testsed the patchset by running xfstests on ppc64 and
x86_64. On ppc64, some of the Btrfs specific tests and generic/255
fail because they assume 4K as the filesystem's block size. I have
fixed some of the test cases. I will fix the rest and mail them to the
fstests mailing list in the near future.

Changes from V4:
1. Removed the RFC tag.

Changes from V3:
Two new issues have been been fixed by the patches,
1. Btrfs: prepare_pages: Retry adding a page to the page cache.
2. Btrfs: Return valid delalloc range when the page does not have
   PG_Dirty flag set or has been invalidated.
IMHO, The above issues are also applicable to the "page size == block
size" scenario but for reasons unknown to me they aren't seen even
when the tests are run for a long time.

Changes from V2:
1. For detecting logical errors, Use ASSERT() calls instead of calls to
   BUG_ON().
2. In the patch "Btrfs: Compute and look up csums based on sectorsized
   blocks", fix usage of kmap_atomic/kunmap_atomic such that between the
   kmap_atomic() and kunmap_atomic() calls we do not invoke any function
   that might cause the current task to sleep.

Changes from V1:
1. Call round_[down,up]() functions instead of doing hard coded alignment.

Chandan Rajendra (13):
  Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to
    block size
  Btrfs: Compute and look up csums based on sectorsized blocks
  Btrfs: Direct I/O read: Work on sectorsized blocks
  Btrfs: fallocate: Work with sectorsized blocks
  Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units
  Btrfs: Search for all ordered extents that could span across a page
  Btrfs: Use (eb->start, seq) as search key for tree modification log
  Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length
  Btrfs: Limit inline extents to root->sectorsize
  Btrfs: Fix block size returned to user space
  Btrfs: Clean pte corresponding to page straddling i_size
  Btrfs: prepare_pages: Retry adding a page to the page cache
  Btrfs: Return valid delalloc range when the page does not have
    PG_Dirty flag set or has been invalidated

 fs/btrfs/ctree.c     |  34 ++++----
 fs/btrfs/ctree.h     |   2 +-
 fs/btrfs/extent_io.c |   5 +-
 fs/btrfs/file-item.c |  93 ++++++++++++--------
 fs/btrfs/file.c      | 119 +++++++++++++++++--------
 fs/btrfs/inode.c     | 239 ++++++++++++++++++++++++++++++++++++---------------
 6 files changed, 331 insertions(+), 161 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH V5 01/13] Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-10-01 14:37   ` Josef Bacik
  2015-09-30 10:28 ` [PATCH V5 02/13] Btrfs: Compute and look up csums based on sectorsized blocks Chandan Rajendra
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE
units. Fix this by doing reservation/releases in block size units.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/file.c | 44 +++++++++++++++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..12ce401 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -499,7 +499,7 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 	loff_t isize = i_size_read(inode);
 
 	start_pos = pos & ~((u64)root->sectorsize - 1);
-	num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize);
+	num_bytes = round_up(write_bytes + pos - start_pos, root->sectorsize);
 
 	end_of_last_block = start_pos + num_bytes - 1;
 	err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
@@ -1362,16 +1362,19 @@ fail:
 static noinline int
 lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages,
 				size_t num_pages, loff_t pos,
+				size_t write_bytes,
 				u64 *lockstart, u64 *lockend,
 				struct extent_state **cached_state)
 {
+	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 start_pos;
 	u64 last_pos;
 	int i;
 	int ret = 0;
 
-	start_pos = pos & ~((u64)PAGE_CACHE_SIZE - 1);
-	last_pos = start_pos + ((u64)num_pages << PAGE_CACHE_SHIFT) - 1;
+	start_pos = round_down(pos, root->sectorsize);
+	last_pos = start_pos
+		+ round_up(pos + write_bytes - start_pos, root->sectorsize) - 1;
 
 	if (start_pos < inode->i_size) {
 		struct btrfs_ordered_extent *ordered;
@@ -1489,6 +1492,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 
 	while (iov_iter_count(i) > 0) {
 		size_t offset = pos & (PAGE_CACHE_SIZE - 1);
+		size_t sector_offset;
 		size_t write_bytes = min(iov_iter_count(i),
 					 nrptrs * (size_t)PAGE_CACHE_SIZE -
 					 offset);
@@ -1497,6 +1501,8 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 		size_t reserve_bytes;
 		size_t dirty_pages;
 		size_t copied;
+		size_t dirty_sectors;
+		size_t num_sectors;
 
 		WARN_ON(num_pages > nrptrs);
 
@@ -1509,8 +1515,12 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 			break;
 		}
 
-		reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
+		sector_offset = pos & (root->sectorsize - 1);
+		reserve_bytes = round_up(write_bytes + sector_offset,
+				root->sectorsize);
+
 		ret = btrfs_check_data_free_space(inode, reserve_bytes, write_bytes);
+
 		if (ret == -ENOSPC &&
 		    (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
 					      BTRFS_INODE_PREALLOC))) {
@@ -1523,7 +1533,10 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 				 */
 				num_pages = DIV_ROUND_UP(write_bytes + offset,
 							 PAGE_CACHE_SIZE);
-				reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
+				reserve_bytes = round_up(write_bytes
+							+ sector_offset,
+							root->sectorsize);
+
 				ret = 0;
 			} else {
 				ret = -ENOSPC;
@@ -1558,8 +1571,8 @@ again:
 			break;
 
 		ret = lock_and_cleanup_extent_if_need(inode, pages, num_pages,
-						      pos, &lockstart, &lockend,
-						      &cached_state);
+						pos, write_bytes, &lockstart,
+						&lockend, &cached_state);
 		if (ret < 0) {
 			if (ret == -EAGAIN)
 				goto again;
@@ -1595,9 +1608,14 @@ again:
 		 * we still have an outstanding extent for the chunk we actually
 		 * managed to copy.
 		 */
-		if (num_pages > dirty_pages) {
-			release_bytes = (num_pages - dirty_pages) <<
-				PAGE_CACHE_SHIFT;
+		num_sectors = reserve_bytes >> inode->i_blkbits;
+		dirty_sectors = round_up(copied + sector_offset,
+					root->sectorsize);
+		dirty_sectors >>= inode->i_blkbits;
+
+		if (num_sectors > dirty_sectors) {
+			release_bytes = (write_bytes - copied)
+				& ~((u64)root->sectorsize - 1);
 			if (copied > 0) {
 				spin_lock(&BTRFS_I(inode)->lock);
 				BTRFS_I(inode)->outstanding_extents++;
@@ -1611,7 +1629,8 @@ again:
 							     release_bytes);
 		}
 
-		release_bytes = dirty_pages << PAGE_CACHE_SHIFT;
+		release_bytes = round_up(copied + sector_offset,
+					root->sectorsize);
 
 		if (copied > 0)
 			ret = btrfs_dirty_pages(root, inode, pages,
@@ -1632,8 +1651,7 @@ again:
 
 		if (only_release_metadata && copied > 0) {
 			lockstart = round_down(pos, root->sectorsize);
-			lockend = lockstart +
-				(dirty_pages << PAGE_CACHE_SHIFT) - 1;
+			lockend = round_up(pos + copied, root->sectorsize) - 1;
 
 			set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
 				       lockend, EXTENT_NORESERVE, NULL,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 02/13] Btrfs: Compute and look up csums based on sectorsized blocks
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 01/13] Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-10-01 14:39   ` Josef Bacik
  2015-09-30 10:28 ` [PATCH V5 03/13] Btrfs: Direct I/O read: Work " Chandan Rajendra
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

Checksums are applicable to sectorsize units. The current code uses
bio->bv_len units to compute and look up checksums. This works on machines
where sectorsize == PAGE_SIZE. This patch makes the checksum computation and
look up code to work with sectorsize units.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/file-item.c | 93 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 59 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 58ece65..818c859 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 	u64 item_start_offset = 0;
 	u64 item_last_offset = 0;
 	u64 disk_bytenr;
+	u64 page_bytes_left;
 	u32 diff;
 	int nblocks;
 	int bio_index = 0;
@@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
 	if (dio)
 		offset = logical_offset;
+
+	page_bytes_left = bvec->bv_len;
 	while (bio_index < bio->bi_vcnt) {
 		if (!dio)
 			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
@@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 				if (BTRFS_I(inode)->root->root_key.objectid ==
 				    BTRFS_DATA_RELOC_TREE_OBJECTID) {
 					set_extent_bits(io_tree, offset,
-						offset + bvec->bv_len - 1,
+						offset + root->sectorsize - 1,
 						EXTENT_NODATASUM, GFP_NOFS);
 				} else {
 					btrfs_info(BTRFS_I(inode)->root->fs_info,
@@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 found:
 		csum += count * csum_size;
 		nblocks -= count;
-		bio_index += count;
+
 		while (count--) {
-			disk_bytenr += bvec->bv_len;
-			offset += bvec->bv_len;
-			bvec++;
+			disk_bytenr += root->sectorsize;
+			offset += root->sectorsize;
+			page_bytes_left -= root->sectorsize;
+			if (!page_bytes_left) {
+				bio_index++;
+				bvec++;
+				page_bytes_left = bvec->bv_len;
+			}
+
 		}
 	}
 	btrfs_free_path(path);
@@ -432,6 +441,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
 	struct bio_vec *bvec = bio->bi_io_vec;
 	int bio_index = 0;
 	int index;
+	int nr_sectors;
+	int i;
 	unsigned long total_bytes = 0;
 	unsigned long this_sum_bytes = 0;
 	u64 offset;
@@ -451,7 +462,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
 		offset = page_offset(bvec->bv_page) + bvec->bv_offset;
 
 	ordered = btrfs_lookup_ordered_extent(inode, offset);
-	BUG_ON(!ordered); /* Logic error */
+	ASSERT(ordered); /* Logic error */
 	sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
 	index = 0;
 
@@ -459,41 +470,55 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
 		if (!contig)
 			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
 
-		if (offset >= ordered->file_offset + ordered->len ||
-		    offset < ordered->file_offset) {
-			unsigned long bytes_left;
-			sums->len = this_sum_bytes;
-			this_sum_bytes = 0;
-			btrfs_add_ordered_sum(inode, ordered, sums);
-			btrfs_put_ordered_extent(ordered);
+		data = kmap_atomic(bvec->bv_page);
 
-			bytes_left = bio->bi_iter.bi_size - total_bytes;
+		nr_sectors = (bvec->bv_len + root->sectorsize - 1)
+			>> inode->i_blkbits;
+
+		for (i = 0; i < nr_sectors; i++) {
+			if (offset >= ordered->file_offset + ordered->len ||
+				offset < ordered->file_offset) {
+				unsigned long bytes_left;
+
+				kunmap_atomic(data);
+				sums->len = this_sum_bytes;
+				this_sum_bytes = 0;
+				btrfs_add_ordered_sum(inode, ordered, sums);
+				btrfs_put_ordered_extent(ordered);
+
+				bytes_left = bio->bi_iter.bi_size - total_bytes;
+
+				sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
+					GFP_NOFS);
+				BUG_ON(!sums); /* -ENOMEM */
+				sums->len = bytes_left;
+				ordered = btrfs_lookup_ordered_extent(inode,
+								offset);
+				ASSERT(ordered); /* Logic error */
+				sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9)
+					+ total_bytes;
+				index = 0;
+
+				data = kmap_atomic(bvec->bv_page);
+			}
 
-			sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
-				       GFP_NOFS);
-			BUG_ON(!sums); /* -ENOMEM */
-			sums->len = bytes_left;
-			ordered = btrfs_lookup_ordered_extent(inode, offset);
-			BUG_ON(!ordered); /* Logic error */
-			sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) +
-				       total_bytes;
-			index = 0;
+			sums->sums[index] = ~(u32)0;
+			sums->sums[index]
+				= btrfs_csum_data(data + bvec->bv_offset
+						+ (i * root->sectorsize),
+						sums->sums[index],
+						root->sectorsize);
+			btrfs_csum_final(sums->sums[index],
+					(char *)(sums->sums + index));
+			index++;
+			offset += root->sectorsize;
+			this_sum_bytes += root->sectorsize;
+			total_bytes += root->sectorsize;
 		}
 
-		data = kmap_atomic(bvec->bv_page);
-		sums->sums[index] = ~(u32)0;
-		sums->sums[index] = btrfs_csum_data(data + bvec->bv_offset,
-						    sums->sums[index],
-						    bvec->bv_len);
 		kunmap_atomic(data);
-		btrfs_csum_final(sums->sums[index],
-				 (char *)(sums->sums + index));
 
 		bio_index++;
-		index++;
-		total_bytes += bvec->bv_len;
-		this_sum_bytes += bvec->bv_len;
-		offset += bvec->bv_len;
 		bvec++;
 	}
 	this_sum_bytes = 0;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 03/13] Btrfs: Direct I/O read: Work on sectorsized blocks
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 01/13] Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 02/13] Btrfs: Compute and look up csums based on sectorsized blocks Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 04/13] Btrfs: fallocate: Work with " Chandan Rajendra
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

The direct I/O read's endio and corresponding repair functions work on
page sized blocks. This commit adds the ability for direct I/O read to work on
subpagesized blocks.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/inode.c | 96 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 73 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b7e439b..5a47731 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7664,9 +7664,9 @@ static int btrfs_check_dio_repairable(struct inode *inode,
 }
 
 static int dio_read_error(struct inode *inode, struct bio *failed_bio,
-			  struct page *page, u64 start, u64 end,
-			  int failed_mirror, bio_end_io_t *repair_endio,
-			  void *repair_arg)
+			struct page *page, unsigned int pgoff,
+			u64 start, u64 end, int failed_mirror,
+			bio_end_io_t *repair_endio, void *repair_arg)
 {
 	struct io_failure_record *failrec;
 	struct bio *bio;
@@ -7687,7 +7687,9 @@ static int dio_read_error(struct inode *inode, struct bio *failed_bio,
 		return -EIO;
 	}
 
-	if (failed_bio->bi_vcnt > 1)
+	if ((failed_bio->bi_vcnt > 1)
+		|| (failed_bio->bi_io_vec->bv_len
+			> BTRFS_I(inode)->root->sectorsize))
 		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
 	else
 		read_mode = READ_SYNC;
@@ -7695,7 +7697,7 @@ static int dio_read_error(struct inode *inode, struct bio *failed_bio,
 	isector = start - btrfs_io_bio(failed_bio)->logical;
 	isector >>= inode->i_sb->s_blocksize_bits;
 	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
-				      0, isector, repair_endio, repair_arg);
+				pgoff, isector, repair_endio, repair_arg);
 	if (!bio) {
 		free_io_failure(inode, failrec);
 		return -EIO;
@@ -7725,12 +7727,17 @@ struct btrfs_retry_complete {
 static void btrfs_retry_endio_nocsum(struct bio *bio, int err)
 {
 	struct btrfs_retry_complete *done = bio->bi_private;
+	struct inode *inode;
 	struct bio_vec *bvec;
 	int i;
 
 	if (err)
 		goto end;
 
+	ASSERT(bio->bi_vcnt == 1);
+	inode = bio->bi_io_vec->bv_page->mapping->host;
+	ASSERT(bio->bi_io_vec->bv_len == BTRFS_I(inode)->root->sectorsize);
+
 	done->uptodate = 1;
 	bio_for_each_segment_all(bvec, bio, i)
 		clean_io_failure(done->inode, done->start, bvec->bv_page, 0);
@@ -7745,22 +7752,30 @@ static int __btrfs_correct_data_nocsum(struct inode *inode,
 	struct bio_vec *bvec;
 	struct btrfs_retry_complete done;
 	u64 start;
+	unsigned int pgoff;
+	u32 sectorsize;
+	int nr_sectors;
 	int i;
 	int ret;
 
+	sectorsize = BTRFS_I(inode)->root->sectorsize;
+
 	start = io_bio->logical;
 	done.inode = inode;
 
 	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
-try_again:
+		nr_sectors = bvec->bv_len >> inode->i_blkbits;
+		pgoff = bvec->bv_offset;
+
+next_block_or_try_again:
 		done.uptodate = 0;
 		done.start = start;
 		init_completion(&done.done);
 
-		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
-				     start + bvec->bv_len - 1,
-				     io_bio->mirror_num,
-				     btrfs_retry_endio_nocsum, &done);
+		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page,
+				pgoff, start, start + sectorsize - 1,
+				io_bio->mirror_num,
+				btrfs_retry_endio_nocsum, &done);
 		if (ret)
 			return ret;
 
@@ -7768,10 +7783,15 @@ try_again:
 
 		if (!done.uptodate) {
 			/* We might have another mirror, so try again */
-			goto try_again;
+			goto next_block_or_try_again;
 		}
 
-		start += bvec->bv_len;
+		start += sectorsize;
+
+		if (nr_sectors--) {
+			pgoff += sectorsize;
+			goto next_block_or_try_again;
+		}
 	}
 
 	return 0;
@@ -7781,7 +7801,9 @@ static void btrfs_retry_endio(struct bio *bio, int err)
 {
 	struct btrfs_retry_complete *done = bio->bi_private;
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct inode *inode;
 	struct bio_vec *bvec;
+	u64 start;
 	int uptodate;
 	int ret;
 	int i;
@@ -7790,13 +7812,20 @@ static void btrfs_retry_endio(struct bio *bio, int err)
 		goto end;
 
 	uptodate = 1;
+
+	start = done->start;
+
+	ASSERT(bio->bi_vcnt == 1);
+	inode = bio->bi_io_vec->bv_page->mapping->host;
+	ASSERT(bio->bi_io_vec->bv_len == BTRFS_I(inode)->root->sectorsize);
+
 	bio_for_each_segment_all(bvec, bio, i) {
 		ret = __readpage_endio_check(done->inode, io_bio, i,
-					     bvec->bv_page, 0,
-					     done->start, bvec->bv_len);
+					bvec->bv_page, bvec->bv_offset,
+					done->start, bvec->bv_len);
 		if (!ret)
 			clean_io_failure(done->inode, done->start,
-					 bvec->bv_page, 0);
+					bvec->bv_page, bvec->bv_offset);
 		else
 			uptodate = 0;
 	}
@@ -7814,16 +7843,30 @@ static int __btrfs_subio_endio_read(struct inode *inode,
 	struct btrfs_retry_complete done;
 	u64 start;
 	u64 offset = 0;
+	u32 sectorsize;
+	int nr_sectors;
+	unsigned int pgoff;
+	int csum_pos;
 	int i;
 	int ret;
+	unsigned char blocksize_bits;
+
+	blocksize_bits = inode->i_blkbits;
+	sectorsize = BTRFS_I(inode)->root->sectorsize;
 
 	err = 0;
 	start = io_bio->logical;
 	done.inode = inode;
 
 	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
-		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
-					     0, start, bvec->bv_len);
+		nr_sectors = bvec->bv_len >> blocksize_bits;
+		pgoff = bvec->bv_offset;
+next_block:
+		csum_pos = offset >> blocksize_bits;
+
+		ret = __readpage_endio_check(inode, io_bio, csum_pos,
+					bvec->bv_page, pgoff, start,
+					sectorsize);
 		if (likely(!ret))
 			goto next;
 try_again:
@@ -7831,10 +7874,10 @@ try_again:
 		done.start = start;
 		init_completion(&done.done);
 
-		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
-				     start + bvec->bv_len - 1,
-				     io_bio->mirror_num,
-				     btrfs_retry_endio, &done);
+		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page,
+				pgoff, start, start + sectorsize - 1,
+				io_bio->mirror_num,
+				btrfs_retry_endio, &done);
 		if (ret) {
 			err = ret;
 			goto next;
@@ -7847,8 +7890,15 @@ try_again:
 			goto try_again;
 		}
 next:
-		offset += bvec->bv_len;
-		start += bvec->bv_len;
+		offset += sectorsize;
+		start += sectorsize;
+
+		ASSERT(nr_sectors);
+
+		if (--nr_sectors) {
+			pgoff += sectorsize;
+			goto next_block;
+		}
 	}
 
 	return err;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 04/13] Btrfs: fallocate: Work with sectorsized blocks
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
                   ` (2 preceding siblings ...)
  2015-09-30 10:28 ` [PATCH V5 03/13] Btrfs: Direct I/O read: Work " Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 05/13] Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units Chandan Rajendra
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

While at it, this commit changes btrfs_truncate_page() to truncate sectorsized
blocks instead of pages. Hence the function has been renamed to
btrfs_truncate_block().

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/file.c  | 47 +++++++++++++++++++++++++----------------------
 fs/btrfs/inode.c | 52 +++++++++++++++++++++++++++-------------------------
 3 files changed, 53 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..99a0fff 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3893,7 +3893,7 @@ int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
 			struct btrfs_root *root,
 			struct inode *dir, u64 objectid,
 			const char *name, int name_len);
-int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
+int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 			int front);
 int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *root,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 12ce401..360d56d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2280,23 +2280,26 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	u64 tail_len;
 	u64 orig_start = offset;
 	u64 cur_offset;
+	unsigned char blocksize_bits;
 	u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
 	u64 drop_end;
 	int ret = 0;
 	int err = 0;
 	int rsv_count;
-	bool same_page;
+	bool same_block;
 	bool no_holes = btrfs_fs_incompat(root->fs_info, NO_HOLES);
 	u64 ino_size;
-	bool truncated_page = false;
+	bool truncated_block = false;
 	bool updated_inode = false;
 
+	blocksize_bits = inode->i_blkbits;
+
 	ret = btrfs_wait_ordered_range(inode, offset, len);
 	if (ret)
 		return ret;
 
 	mutex_lock(&inode->i_mutex);
-	ino_size = round_up(inode->i_size, PAGE_CACHE_SIZE);
+	ino_size = round_up(inode->i_size, root->sectorsize);
 	ret = find_first_non_hole(inode, &offset, &len);
 	if (ret < 0)
 		goto out_only_mutex;
@@ -2309,31 +2312,30 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize);
 	lockend = round_down(offset + len,
 			     BTRFS_I(inode)->root->sectorsize) - 1;
-	same_page = ((offset >> PAGE_CACHE_SHIFT) ==
-		    ((offset + len - 1) >> PAGE_CACHE_SHIFT));
-
+	same_block = ((offset >> blocksize_bits)
+		== ((offset + len - 1) >> blocksize_bits));
 	/*
-	 * We needn't truncate any page which is beyond the end of the file
+	 * We needn't truncate any block which is beyond the end of the file
 	 * because we are sure there is no data there.
 	 */
 	/*
-	 * Only do this if we are in the same page and we aren't doing the
-	 * entire page.
+	 * Only do this if we are in the same block and we aren't doing the
+	 * entire block.
 	 */
-	if (same_page && len < PAGE_CACHE_SIZE) {
+	if (same_block && len < root->sectorsize) {
 		if (offset < ino_size) {
-			truncated_page = true;
-			ret = btrfs_truncate_page(inode, offset, len, 0);
+			truncated_block = true;
+			ret = btrfs_truncate_block(inode, offset, len, 0);
 		} else {
 			ret = 0;
 		}
 		goto out_only_mutex;
 	}
 
-	/* zero back part of the first page */
+	/* zero back part of the first block */
 	if (offset < ino_size) {
-		truncated_page = true;
-		ret = btrfs_truncate_page(inode, offset, 0, 0);
+		truncated_block = true;
+		ret = btrfs_truncate_block(inode, offset, 0, 0);
 		if (ret) {
 			mutex_unlock(&inode->i_mutex);
 			return ret;
@@ -2368,9 +2370,10 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 		if (!ret) {
 			/* zero the front end of the last page */
 			if (tail_start + tail_len < ino_size) {
-				truncated_page = true;
-				ret = btrfs_truncate_page(inode,
-						tail_start + tail_len, 0, 1);
+				truncated_block = true;
+				ret = btrfs_truncate_block(inode,
+							tail_start + tail_len,
+							0, 1);
 				if (ret)
 					goto out_only_mutex;
 			}
@@ -2537,7 +2540,7 @@ out:
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend,
 			     &cached_state, GFP_NOFS);
 out_only_mutex:
-	if (!updated_inode && truncated_page && !ret && !err) {
+	if (!updated_inode && truncated_block && !ret && !err) {
 		/*
 		 * If we only end up zeroing part of a page, we still need to
 		 * update the inode item, so that all the time fields are
@@ -2605,10 +2608,10 @@ static long btrfs_fallocate(struct file *file, int mode,
 	} else {
 		/*
 		 * If we are fallocating from the end of the file onward we
-		 * need to zero out the end of the page if i_size lands in the
-		 * middle of a page.
+		 * need to zero out the end of the block if i_size lands in the
+		 * middle of a block.
 		 */
-		ret = btrfs_truncate_page(inode, inode->i_size, 0, 0);
+		ret = btrfs_truncate_block(inode, inode->i_size, 0, 0);
 		if (ret)
 			goto out;
 	}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5a47731..5301d4e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4540,17 +4540,17 @@ error:
 }
 
 /*
- * btrfs_truncate_page - read, zero a chunk and write a page
+ * btrfs_truncate_block - read, zero a chunk and write a block
  * @inode - inode that we're zeroing
  * @from - the offset to start zeroing
  * @len - the length to zero, 0 to zero the entire range respective to the
  *	offset
  * @front - zero up to the offset instead of from the offset on
  *
- * This will find the page for the "from" offset and cow the page and zero the
+ * This will find the block for the "from" offset and cow the block and zero the
  * part we want to zero.  This is used with truncate and hole punching.
  */
-int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
+int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 			int front)
 {
 	struct address_space *mapping = inode->i_mapping;
@@ -4561,30 +4561,30 @@ int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
 	char *kaddr;
 	u32 blocksize = root->sectorsize;
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned offset = from & (blocksize - 1);
 	struct page *page;
 	gfp_t mask = btrfs_alloc_write_mask(mapping);
 	int ret = 0;
-	u64 page_start;
-	u64 page_end;
+	u64 block_start;
+	u64 block_end;
 
 	if ((offset & (blocksize - 1)) == 0 &&
 	    (!len || ((len & (blocksize - 1)) == 0)))
 		goto out;
-	ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+	ret = btrfs_delalloc_reserve_space(inode, blocksize);
 	if (ret)
 		goto out;
 
 again:
 	page = find_or_create_page(mapping, index, mask);
 	if (!page) {
-		btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+		btrfs_delalloc_release_space(inode, blocksize);
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	page_start = page_offset(page);
-	page_end = page_start + PAGE_CACHE_SIZE - 1;
+	block_start = round_down(from, blocksize);
+	block_end = block_start + blocksize - 1;
 
 	if (!PageUptodate(page)) {
 		ret = btrfs_readpage(NULL, page);
@@ -4601,12 +4601,12 @@ again:
 	}
 	wait_on_page_writeback(page);
 
-	lock_extent_bits(io_tree, page_start, page_end, 0, &cached_state);
+	lock_extent_bits(io_tree, block_start, block_end, 0, &cached_state);
 	set_page_extent_mapped(page);
 
-	ordered = btrfs_lookup_ordered_extent(inode, page_start);
+	ordered = btrfs_lookup_ordered_extent(inode, block_start);
 	if (ordered) {
-		unlock_extent_cached(io_tree, page_start, page_end,
+		unlock_extent_cached(io_tree, block_start, block_end,
 				     &cached_state, GFP_NOFS);
 		unlock_page(page);
 		page_cache_release(page);
@@ -4615,38 +4615,40 @@ again:
 		goto again;
 	}
 
-	clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, page_end,
+	clear_extent_bit(&BTRFS_I(inode)->io_tree, block_start, block_end,
 			  EXTENT_DIRTY | EXTENT_DELALLOC |
 			  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
 			  0, 0, &cached_state, GFP_NOFS);
 
-	ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
+	ret = btrfs_set_extent_delalloc(inode, block_start, block_end,
 					&cached_state);
 	if (ret) {
-		unlock_extent_cached(io_tree, page_start, page_end,
+		unlock_extent_cached(io_tree, block_start, block_end,
 				     &cached_state, GFP_NOFS);
 		goto out_unlock;
 	}
 
-	if (offset != PAGE_CACHE_SIZE) {
+	if (offset != blocksize) {
 		if (!len)
-			len = PAGE_CACHE_SIZE - offset;
+			len = blocksize - offset;
 		kaddr = kmap(page);
 		if (front)
-			memset(kaddr, 0, offset);
+			memset(kaddr + (block_start - page_offset(page)),
+				0, offset);
 		else
-			memset(kaddr + offset, 0, len);
+			memset(kaddr + (block_start - page_offset(page)) +  offset,
+				0, len);
 		flush_dcache_page(page);
 		kunmap(page);
 	}
 	ClearPageChecked(page);
 	set_page_dirty(page);
-	unlock_extent_cached(io_tree, page_start, page_end, &cached_state,
+	unlock_extent_cached(io_tree, block_start, block_end, &cached_state,
 			     GFP_NOFS);
 
 out_unlock:
 	if (ret)
-		btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+		btrfs_delalloc_release_space(inode, blocksize);
 	unlock_page(page);
 	page_cache_release(page);
 out:
@@ -4717,11 +4719,11 @@ int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size)
 	int err = 0;
 
 	/*
-	 * If our size started in the middle of a page we need to zero out the
-	 * rest of the page before we expand the i_size, otherwise we could
+	 * If our size started in the middle of a block we need to zero out the
+	 * rest of the block before we expand the i_size, otherwise we could
 	 * expose stale data.
 	 */
-	err = btrfs_truncate_page(inode, oldsize, 0, 0);
+	err = btrfs_truncate_block(inode, oldsize, 0, 0);
 	if (err)
 		return err;
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 05/13] Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
                   ` (3 preceding siblings ...)
  2015-09-30 10:28 ` [PATCH V5 04/13] Btrfs: fallocate: Work with " Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 06/13] Btrfs: Search for all ordered extents that could span across a page Chandan Rajendra
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

In subpagesize-blocksize scenario, if i_size occurs in a block which is not
the last block in the page, then the space to be reserved should be calculated
appropriately.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/inode.c | 36 +++++++++++++++++++++++++++++++-----
 1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5301d4e..5e6052d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8659,11 +8659,24 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	loff_t size;
 	int ret;
 	int reserved = 0;
+	u64 reserved_space;
 	u64 page_start;
 	u64 page_end;
+	u64 end;
+
+	reserved_space = PAGE_CACHE_SIZE;
 
 	sb_start_pagefault(inode->i_sb);
-	ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+
+	/*
+	  Reserving delalloc space after obtaining the page lock can lead to
+	  deadlock. For example, if a dirty page is locked by this function
+	  and the call to btrfs_delalloc_reserve_space() ends up triggering
+	  dirty page write out, then the btrfs_writepage() function could
+	  end up waiting indefinitely to get a lock on the page currently
+	  being processed by btrfs_page_mkwrite() function.
+	 */
+	ret  = btrfs_delalloc_reserve_space(inode, reserved_space);
 	if (!ret) {
 		ret = file_update_time(vma->vm_file);
 		reserved = 1;
@@ -8684,6 +8697,7 @@ again:
 	size = i_size_read(inode);
 	page_start = page_offset(page);
 	page_end = page_start + PAGE_CACHE_SIZE - 1;
+	end = page_end;
 
 	if ((page->mapping != inode->i_mapping) ||
 	    (page_start >= size)) {
@@ -8699,7 +8713,7 @@ again:
 	 * we can't set the delalloc bits if there are pending ordered
 	 * extents.  Drop our locks and wait for them to finish
 	 */
-	ordered = btrfs_lookup_ordered_extent(inode, page_start);
+	ordered = btrfs_lookup_ordered_range(inode, page_start, page_end);
 	if (ordered) {
 		unlock_extent_cached(io_tree, page_start, page_end,
 				     &cached_state, GFP_NOFS);
@@ -8709,6 +8723,18 @@ again:
 		goto again;
 	}
 
+	if (page->index == ((size - 1) >> PAGE_CACHE_SHIFT)) {
+		reserved_space = round_up(size - page_start, root->sectorsize);
+		if (reserved_space < PAGE_CACHE_SIZE) {
+			end = page_start + reserved_space - 1;
+			spin_lock(&BTRFS_I(inode)->lock);
+			BTRFS_I(inode)->outstanding_extents++;
+			spin_unlock(&BTRFS_I(inode)->lock);
+			btrfs_delalloc_release_space(inode,
+						PAGE_CACHE_SIZE - reserved_space);
+		}
+	}
+
 	/*
 	 * XXX - page_mkwrite gets called every time the page is dirtied, even
 	 * if it was already dirty, so for space accounting reasons we need to
@@ -8716,12 +8742,12 @@ again:
 	 * is probably a better way to do this, but for now keep consistent with
 	 * prepare_pages in the normal write path.
 	 */
-	clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, page_end,
+	clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, end,
 			  EXTENT_DIRTY | EXTENT_DELALLOC |
 			  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
 			  0, 0, &cached_state, GFP_NOFS);
 
-	ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
+	ret = btrfs_set_extent_delalloc(inode, page_start, end,
 					&cached_state);
 	if (ret) {
 		unlock_extent_cached(io_tree, page_start, page_end,
@@ -8760,7 +8786,7 @@ out_unlock:
 	}
 	unlock_page(page);
 out:
-	btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+	btrfs_delalloc_release_space(inode, reserved_space);
 out_noreserve:
 	sb_end_pagefault(inode->i_sb);
 	return ret;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 06/13] Btrfs: Search for all ordered extents that could span across a page
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
                   ` (4 preceding siblings ...)
  2015-09-30 10:28 ` [PATCH V5 05/13] Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 07/13] Btrfs: Use (eb->start, seq) as search key for tree modification log Chandan Rajendra
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

In subpagesize-blocksize scenario it is not sufficient to search using the
first byte of the page to make sure that there are no ordered extents
present across the page. Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c |  3 ++-
 fs/btrfs/inode.c     | 25 ++++++++++++++++++-------
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 11aa8f7..0ee486a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3224,7 +3224,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
 
 	while (1) {
 		lock_extent(tree, start, end);
-		ordered = btrfs_lookup_ordered_extent(inode, start);
+		ordered = btrfs_lookup_ordered_range(inode, start,
+						PAGE_CACHE_SIZE);
 		if (!ordered)
 			break;
 		unlock_extent(tree, start, end);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5e6052d..4fbe9de 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1975,7 +1975,8 @@ again:
 	if (PagePrivate2(page))
 		goto out;
 
-	ordered = btrfs_lookup_ordered_extent(inode, page_start);
+	ordered = btrfs_lookup_ordered_range(inode, page_start,
+					PAGE_CACHE_SIZE);
 	if (ordered) {
 		unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start,
 				     page_end, &cached_state, GFP_NOFS);
@@ -8552,6 +8553,8 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	struct extent_state *cached_state = NULL;
 	u64 page_start = page_offset(page);
 	u64 page_end = page_start + PAGE_CACHE_SIZE - 1;
+	u64 start;
+	u64 end;
 	int inode_evicting = inode->i_state & I_FREEING;
 
 	/*
@@ -8571,14 +8574,18 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 
 	if (!inode_evicting)
 		lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
-	ordered = btrfs_lookup_ordered_extent(inode, page_start);
+again:
+	start = page_start;
+	ordered = btrfs_lookup_ordered_range(inode, start,
+					page_end - start + 1);
 	if (ordered) {
+		end = min(page_end, ordered->file_offset + ordered->len - 1);
 		/*
 		 * IO on this page will never be started, so we need
 		 * to account for any ordered extents now
 		 */
 		if (!inode_evicting)
-			clear_extent_bit(tree, page_start, page_end,
+			clear_extent_bit(tree, start, end,
 					 EXTENT_DIRTY | EXTENT_DELALLOC |
 					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 					 EXTENT_DEFRAG, 1, 0, &cached_state,
@@ -8595,22 +8602,26 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 
 			spin_lock_irq(&tree->lock);
 			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-			new_len = page_start - ordered->file_offset;
+			new_len = start - ordered->file_offset;
 			if (new_len < ordered->truncated_len)
 				ordered->truncated_len = new_len;
 			spin_unlock_irq(&tree->lock);
 
 			if (btrfs_dec_test_ordered_pending(inode, &ordered,
-							   page_start,
-							   PAGE_CACHE_SIZE, 1))
+							   start,
+							   end - start + 1, 1))
 				btrfs_finish_ordered_io(ordered);
 		}
 		btrfs_put_ordered_extent(ordered);
 		if (!inode_evicting) {
 			cached_state = NULL;
-			lock_extent_bits(tree, page_start, page_end, 0,
+			lock_extent_bits(tree, start, end, 0,
 					 &cached_state);
 		}
+
+		start = end + 1;
+		if (start < page_end)
+			goto again;
 	}
 
 	if (!inode_evicting) {
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 07/13] Btrfs: Use (eb->start, seq) as search key for tree modification log
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
                   ` (5 preceding siblings ...)
  2015-09-30 10:28 ` [PATCH V5 06/13] Btrfs: Search for all ordered extents that could span across a page Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 08/13] Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length Chandan Rajendra
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

In subpagesize-blocksize a page can map multiple extent buffers and hence
using (page index, seq) as the search key is incorrect. For example, searching
through tree modification log tree can return an entry associated with the
first extent buffer mapped by the page (if such an entry exists), when we are
actually searching for entries associated with extent buffers that are mapped
at position 2 or more in the page.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 5f745ea..719ed3c 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -311,7 +311,7 @@ struct tree_mod_root {
 
 struct tree_mod_elem {
 	struct rb_node node;
-	u64 index;		/* shifted logical */
+	u64 logical;
 	u64 seq;
 	enum mod_log_op op;
 
@@ -435,11 +435,11 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
 
 /*
  * key order of the log:
- *       index -> sequence
+ *       node/leaf start address -> sequence
  *
- * the index is the shifted logical of the *new* root node for root replace
- * operations, or the shifted logical of the affected block for all other
- * operations.
+ * The 'start address' is the logical address of the *new* root node
+ * for root replace operations, or the logical address of the affected
+ * block for all other operations.
  *
  * Note: must be called with write lock (tree_mod_log_write_lock).
  */
@@ -460,9 +460,9 @@ __tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct tree_mod_elem *tm)
 	while (*new) {
 		cur = container_of(*new, struct tree_mod_elem, node);
 		parent = *new;
-		if (cur->index < tm->index)
+		if (cur->logical < tm->logical)
 			new = &((*new)->rb_left);
-		else if (cur->index > tm->index)
+		else if (cur->logical > tm->logical)
 			new = &((*new)->rb_right);
 		else if (cur->seq < tm->seq)
 			new = &((*new)->rb_left);
@@ -523,7 +523,7 @@ alloc_tree_mod_elem(struct extent_buffer *eb, int slot,
 	if (!tm)
 		return NULL;
 
-	tm->index = eb->start >> PAGE_CACHE_SHIFT;
+	tm->logical = eb->start;
 	if (op != MOD_LOG_KEY_ADD) {
 		btrfs_node_key(eb, &tm->key, slot);
 		tm->blockptr = btrfs_node_blockptr(eb, slot);
@@ -588,7 +588,7 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info,
 		goto free_tms;
 	}
 
-	tm->index = eb->start >> PAGE_CACHE_SHIFT;
+	tm->logical = eb->start;
 	tm->slot = src_slot;
 	tm->move.dst_slot = dst_slot;
 	tm->move.nr_items = nr_items;
@@ -699,7 +699,7 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info,
 		goto free_tms;
 	}
 
-	tm->index = new_root->start >> PAGE_CACHE_SHIFT;
+	tm->logical = new_root->start;
 	tm->old_root.logical = old_root->start;
 	tm->old_root.level = btrfs_header_level(old_root);
 	tm->generation = btrfs_header_generation(old_root);
@@ -739,16 +739,15 @@ __tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 start, u64 min_seq,
 	struct rb_node *node;
 	struct tree_mod_elem *cur = NULL;
 	struct tree_mod_elem *found = NULL;
-	u64 index = start >> PAGE_CACHE_SHIFT;
 
 	tree_mod_log_read_lock(fs_info);
 	tm_root = &fs_info->tree_mod_log;
 	node = tm_root->rb_node;
 	while (node) {
 		cur = container_of(node, struct tree_mod_elem, node);
-		if (cur->index < index) {
+		if (cur->logical < start) {
 			node = node->rb_left;
-		} else if (cur->index > index) {
+		} else if (cur->logical > start) {
 			node = node->rb_right;
 		} else if (cur->seq < min_seq) {
 			node = node->rb_left;
@@ -1230,9 +1229,10 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info,
 		return NULL;
 
 	/*
-	 * the very last operation that's logged for a root is the replacement
-	 * operation (if it is replaced at all). this has the index of the *new*
-	 * root, making it the very first operation that's logged for this root.
+	 * the very last operation that's logged for a root is the
+	 * replacement operation (if it is replaced at all). this has
+	 * the logical address of the *new* root, making it the very
+	 * first operation that's logged for this root.
 	 */
 	while (1) {
 		tm = tree_mod_log_search_oldest(fs_info, root_logical,
@@ -1336,7 +1336,7 @@ __tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer *eb,
 		if (!next)
 			break;
 		tm = container_of(next, struct tree_mod_elem, node);
-		if (tm->index != first_tm->index)
+		if (tm->logical != first_tm->logical)
 			break;
 	}
 	tree_mod_log_read_unlock(fs_info);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 08/13] Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
                   ` (6 preceding siblings ...)
  2015-09-30 10:28 ` [PATCH V5 07/13] Btrfs: Use (eb->start, seq) as search key for tree modification log Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 09/13] Btrfs: Limit inline extents to root->sectorsize Chandan Rajendra
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

In subpagesize-blocksize scenario, map_length can be less than the length of a
bio vector. Such a condition may cause btrfs_submit_direct_hook() to submit a
zero length bio. Fix this by comparing map_length against block size rather
than with bv_len.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/inode.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4fbe9de..b1ceba4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8148,9 +8148,11 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	u64 file_offset = dip->logical_offset;
 	u64 submit_len = 0;
 	u64 map_length;
-	int nr_pages = 0;
-	int ret;
+	u32 blocksize = root->sectorsize;
 	int async_submit = 0;
+	int nr_sectors;
+	int ret;
+	int i;
 
 	map_length = orig_bio->bi_iter.bi_size;
 	ret = btrfs_map_block(root->fs_info, rw, start_sector << 9,
@@ -8180,9 +8182,12 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	atomic_inc(&dip->pending_bios);
 
 	while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
-		if (map_length < submit_len + bvec->bv_len ||
-		    bio_add_page(bio, bvec->bv_page, bvec->bv_len,
-				 bvec->bv_offset) < bvec->bv_len) {
+		nr_sectors = bvec->bv_len >> inode->i_blkbits;
+		i = 0;
+next_block:
+		if (unlikely(map_length < submit_len + blocksize ||
+		    bio_add_page(bio, bvec->bv_page, blocksize,
+			    bvec->bv_offset + (i * blocksize)) < blocksize)) {
 			/*
 			 * inc the count before we submit the bio so
 			 * we know the end IO handler won't happen before
@@ -8203,7 +8208,6 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 			file_offset += submit_len;
 
 			submit_len = 0;
-			nr_pages = 0;
 
 			bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev,
 						  start_sector, GFP_NOFS);
@@ -8221,9 +8225,14 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 				bio_put(bio);
 				goto out_err;
 			}
+
+			goto next_block;
 		} else {
-			submit_len += bvec->bv_len;
-			nr_pages++;
+			submit_len += blocksize;
+			if (--nr_sectors) {
+				i++;
+				goto next_block;
+			}
 			bvec++;
 		}
 	}
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 09/13] Btrfs: Limit inline extents to root->sectorsize
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
                   ` (7 preceding siblings ...)
  2015-09-30 10:28 ` [PATCH V5 08/13] Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 10/13] Btrfs: Fix block size returned to user space Chandan Rajendra
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

cow_file_range_inline() limits the size of an inline extent to
PAGE_CACHE_SIZE. This breaks in subpagesize-blocksize scenarios. Fix this by
comparing against root->sectorsize.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b1ceba4..b2eedb9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -257,7 +257,7 @@ static noinline int cow_file_range_inline(struct btrfs_root *root,
 		data_len = compressed_size;
 
 	if (start > 0 ||
-	    actual_end > PAGE_CACHE_SIZE ||
+	    actual_end > root->sectorsize ||
 	    data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) ||
 	    (!compressed_size &&
 	    (actual_end & (root->sectorsize - 1)) == 0) ||
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 10/13] Btrfs: Fix block size returned to user space
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
                   ` (8 preceding siblings ...)
  2015-09-30 10:28 ` [PATCH V5 09/13] Btrfs: Limit inline extents to root->sectorsize Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-10-01 14:58   ` Josef Bacik
  2015-09-30 10:28 ` [PATCH V5 11/13] Btrfs: Clean pte corresponding to page straddling i_size Chandan Rajendra
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

btrfs_getattr() returns PAGE_CACHE_SIZE as the block size. Since
generic_fillattr() already does the right thing (by obtaining block size
from inode->i_blkbits), just remove the statement from btrfs_getattr.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/inode.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b2eedb9..c937357 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9197,7 +9197,6 @@ static int btrfs_getattr(struct vfsmount *mnt,
 
 	generic_fillattr(inode, stat);
 	stat->dev = BTRFS_I(inode)->root->anon_dev;
-	stat->blksize = PAGE_CACHE_SIZE;
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	delalloc_bytes = BTRFS_I(inode)->delalloc_bytes;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 11/13] Btrfs: Clean pte corresponding to page straddling i_size
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
                   ` (9 preceding siblings ...)
  2015-09-30 10:28 ` [PATCH V5 10/13] Btrfs: Fix block size returned to user space Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-10-01 14:57   ` Josef Bacik
  2015-09-30 10:28 ` [PATCH V5 12/13] Btrfs: prepare_pages: Retry adding a page to the page cache Chandan Rajendra
  2015-09-30 10:28 ` [PATCH V5 13/13] Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated Chandan Rajendra
  12 siblings, 1 reply; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

When extending a file by either "truncate up" or by writing beyond i_size, the
page which had i_size needs to be marked "read only" so that future writes to
the page via mmap interface causes btrfs_page_mkwrite() to be invoked. If not,
a write performed after extending the file via the mmap interface will find
the page to be writaeable and continue writing to the page without invoking
btrfs_page_mkwrite() i.e. we end up writing to a file without reserving disk
space.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/file.c  | 12 ++++++++++--
 fs/btrfs/inode.c |  2 +-
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 360d56d..5715e29 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1757,6 +1757,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	ssize_t err;
 	loff_t pos;
 	size_t count;
+	loff_t oldsize;
+	int clean_page = 0;
 
 	mutex_lock(&inode->i_mutex);
 	err = generic_write_checks(iocb, from);
@@ -1795,14 +1797,17 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	pos = iocb->ki_pos;
 	count = iov_iter_count(from);
 	start_pos = round_down(pos, root->sectorsize);
-	if (start_pos > i_size_read(inode)) {
+	oldsize = i_size_read(inode);
+	if (start_pos > oldsize) {
 		/* Expand hole size to cover write data, preventing empty gap */
 		end_pos = round_up(pos + count, root->sectorsize);
-		err = btrfs_cont_expand(inode, i_size_read(inode), end_pos);
+		err = btrfs_cont_expand(inode, oldsize, end_pos);
 		if (err) {
 			mutex_unlock(&inode->i_mutex);
 			goto out;
 		}
+		if (start_pos > round_up(oldsize, root->sectorsize))
+			clean_page = 1;
 	}
 
 	if (sync)
@@ -1814,6 +1819,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 		num_written = __btrfs_buffered_write(file, from, pos);
 		if (num_written > 0)
 			iocb->ki_pos = pos + num_written;
+		if (clean_page)
+			pagecache_isize_extended(inode, oldsize,
+						i_size_read(inode));
 	}
 
 	mutex_unlock(&inode->i_mutex);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c937357..f31da87 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4853,7 +4853,6 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 	}
 
 	if (newsize > oldsize) {
-		truncate_pagecache(inode, newsize);
 		/*
 		 * Don't do an expanding truncate while snapshoting is ongoing.
 		 * This is to ensure the snapshot captures a fully consistent
@@ -4876,6 +4875,7 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 
 		i_size_write(inode, newsize);
 		btrfs_ordered_update_i_size(inode, i_size_read(inode), NULL);
+		pagecache_isize_extended(inode, oldsize, newsize);
 		ret = btrfs_update_inode(trans, root, inode);
 		btrfs_end_write_no_snapshoting(root);
 		btrfs_end_transaction(trans, root);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 12/13] Btrfs: prepare_pages: Retry adding a page to the page cache
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
                   ` (10 preceding siblings ...)
  2015-09-30 10:28 ` [PATCH V5 11/13] Btrfs: Clean pte corresponding to page straddling i_size Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-10-01 14:50   ` Josef Bacik
  2015-09-30 10:28 ` [PATCH V5 13/13] Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated Chandan Rajendra
  12 siblings, 1 reply; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

When reading the page from the disk, we can race with Direct I/O which can get
the page lock (before prepare_uptodate_page() gets it) and can go ahead and
invalidate the page. Hence if the page is not found in the inode's address
space, retry the operation of getting a page.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/file.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 5715e29..76db77c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1316,6 +1316,7 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
 	int faili;
 
 	for (i = 0; i < num_pages; i++) {
+again:
 		pages[i] = find_or_create_page(inode->i_mapping, index + i,
 					       mask | __GFP_WRITE);
 		if (!pages[i]) {
@@ -1330,6 +1331,21 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
 		if (i == num_pages - 1)
 			err = prepare_uptodate_page(pages[i],
 						    pos + write_bytes, false);
+
+		/*
+		 * When reading the page from the disk, we can race
+		 * with direct i/o which can get the page lock (before
+		 * prepare_uptodate_page() gets it) and can go ahead
+		 * and invalidate the page. Hence if the page is found
+		 * to be not belonging to the inode's address space,
+		 * retry the operation of getting a page.
+		 */
+		if (unlikely(pages[i]->mapping != inode->i_mapping)) {
+			unlock_page(pages[i]);
+			page_cache_release(pages[i]);
+			goto again;
+		}
+
 		if (err) {
 			page_cache_release(pages[i]);
 			faili = i - 1;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 13/13] Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated
  2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
                   ` (11 preceding siblings ...)
  2015-09-30 10:28 ` [PATCH V5 12/13] Btrfs: prepare_pages: Retry adding a page to the page cache Chandan Rajendra
@ 2015-09-30 10:28 ` Chandan Rajendra
  2015-10-01 14:48   ` Josef Bacik
  12 siblings, 1 reply; 23+ messages in thread
From: Chandan Rajendra @ 2015-09-30 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chandan Rajendra, jbacik, clm, bo.li.liu, dsterba, quwenruo,
	chandan

The following issue was observed when running generic/095 test on
subpagesize-blocksize patchset.

Assume that we are trying to write a dirty page that is mapping file offset
range [159744, 163839].

writepage_delalloc()
  find_lock_delalloc_range(*start = 159744, *end = 0)
    find_delalloc_range()
      Returns range [X, Y] where (X > 163839)
    lock_delalloc_pages()
      One of the pages in range [X, Y] has dirty flag cleared;
      Loop once more restricting the delalloc range to span only
      PAGE_CACHE_SIZE bytes;
    find_delalloc_range()
      Returns range [356352, 360447];
    lock_delalloc_pages()
      The page [356352, 360447] has dirty flag cleared;
    Returns with *start = 159744 and *end = 0;
  *start = *end + 1;
  find_lock_delalloc_range(*start = 1, *end = 0)
    Finds and returns delalloc range [1, 12288];
  cow_file_range()
    Clears delalloc range [1, 12288]
    Create ordered extent for range [1, 12288]

The ordered extent thus created above breaks the rule that extents have to be
aligned to the filesystem's block size.

In cases where lock_delalloc_pages() fails (either due to PG_dirty flag being
cleared or the page no longer being a member of the inode's page cache), this
patch sets and returns the delalloc range that was found by
find_delalloc_range().

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0ee486a..3912d1f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1731,6 +1731,8 @@ again:
 			goto again;
 		} else {
 			found = 0;
+			*start = delalloc_start;
+			*end = delalloc_end;
 			goto out_failed;
 		}
 	}
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 01/13] Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size
  2015-09-30 10:28 ` [PATCH V5 01/13] Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size Chandan Rajendra
@ 2015-10-01 14:37   ` Josef Bacik
  0 siblings, 0 replies; 23+ messages in thread
From: Josef Bacik @ 2015-10-01 14:37 UTC (permalink / raw)
  To: Chandan Rajendra, linux-btrfs; +Cc: clm, bo.li.liu, dsterba, quwenruo, chandan

On 09/30/2015 06:28 AM, Chandan Rajendra wrote:
> Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE
> units. Fix this by doing reservation/releases in block size units.
>
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>   fs/btrfs/file.c | 44 +++++++++++++++++++++++++++++++-------------
>   1 file changed, 31 insertions(+), 13 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b823fac..12ce401 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -499,7 +499,7 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
>   	loff_t isize = i_size_read(inode);
>
>   	start_pos = pos & ~((u64)root->sectorsize - 1);
> -	num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize);
> +	num_bytes = round_up(write_bytes + pos - start_pos, root->sectorsize);
>
>   	end_of_last_block = start_pos + num_bytes - 1;
>   	err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
> @@ -1362,16 +1362,19 @@ fail:
>   static noinline int
>   lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages,
>   				size_t num_pages, loff_t pos,
> +				size_t write_bytes,
>   				u64 *lockstart, u64 *lockend,
>   				struct extent_state **cached_state)
>   {
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
>   	u64 start_pos;
>   	u64 last_pos;
>   	int i;
>   	int ret = 0;
>
> -	start_pos = pos & ~((u64)PAGE_CACHE_SIZE - 1);
> -	last_pos = start_pos + ((u64)num_pages << PAGE_CACHE_SHIFT) - 1;
> +	start_pos = round_down(pos, root->sectorsize);
> +	last_pos = start_pos
> +		+ round_up(pos + write_bytes - start_pos, root->sectorsize) - 1;
>
>   	if (start_pos < inode->i_size) {
>   		struct btrfs_ordered_extent *ordered;
> @@ -1489,6 +1492,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>
>   	while (iov_iter_count(i) > 0) {
>   		size_t offset = pos & (PAGE_CACHE_SIZE - 1);
> +		size_t sector_offset;
>   		size_t write_bytes = min(iov_iter_count(i),
>   					 nrptrs * (size_t)PAGE_CACHE_SIZE -
>   					 offset);
> @@ -1497,6 +1501,8 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>   		size_t reserve_bytes;
>   		size_t dirty_pages;
>   		size_t copied;
> +		size_t dirty_sectors;
> +		size_t num_sectors;
>
>   		WARN_ON(num_pages > nrptrs);
>
> @@ -1509,8 +1515,12 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>   			break;
>   		}
>
> -		reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
> +		sector_offset = pos & (root->sectorsize - 1);
> +		reserve_bytes = round_up(write_bytes + sector_offset,
> +				root->sectorsize);
> +
>   		ret = btrfs_check_data_free_space(inode, reserve_bytes, write_bytes);
> +
>   		if (ret == -ENOSPC &&
>   		    (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
>   					      BTRFS_INODE_PREALLOC))) {
> @@ -1523,7 +1533,10 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>   				 */
>   				num_pages = DIV_ROUND_UP(write_bytes + offset,
>   							 PAGE_CACHE_SIZE);
> -				reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
> +				reserve_bytes = round_up(write_bytes
> +							+ sector_offset,
> +							root->sectorsize);
> +
>   				ret = 0;
>   			} else {
>   				ret = -ENOSPC;
> @@ -1558,8 +1571,8 @@ again:
>   			break;
>
>   		ret = lock_and_cleanup_extent_if_need(inode, pages, num_pages,
> -						      pos, &lockstart, &lockend,
> -						      &cached_state);
> +						pos, write_bytes, &lockstart,
> +						&lockend, &cached_state);
>   		if (ret < 0) {
>   			if (ret == -EAGAIN)
>   				goto again;
> @@ -1595,9 +1608,14 @@ again:
>   		 * we still have an outstanding extent for the chunk we actually
>   		 * managed to copy.
>   		 */
> -		if (num_pages > dirty_pages) {
> -			release_bytes = (num_pages - dirty_pages) <<
> -				PAGE_CACHE_SHIFT;
> +		num_sectors = reserve_bytes >> inode->i_blkbits;
> +		dirty_sectors = round_up(copied + sector_offset,
> +					root->sectorsize);
> +		dirty_sectors >>= inode->i_blkbits;
> +
> +		if (num_sectors > dirty_sectors) {
> +			release_bytes = (write_bytes - copied)
> +				& ~((u64)root->sectorsize - 1);
>   			if (copied > 0) {
>   				spin_lock(&BTRFS_I(inode)->lock);
>   				BTRFS_I(inode)->outstanding_extents++;
> @@ -1611,7 +1629,8 @@ again:
>   							     release_bytes);
>   		}
>
> -		release_bytes = dirty_pages << PAGE_CACHE_SHIFT;
> +		release_bytes = round_up(copied + sector_offset,
> +					root->sectorsize);
>
>   		if (copied > 0)
>   			ret = btrfs_dirty_pages(root, inode, pages,
> @@ -1632,8 +1651,7 @@ again:
>
>   		if (only_release_metadata && copied > 0) {
>   			lockstart = round_down(pos, root->sectorsize);
> -			lockend = lockstart +
> -				(dirty_pages << PAGE_CACHE_SHIFT) - 1;
> +			lockend = round_up(pos + copied, root->sectorsize) - 1;
>
>   			set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
>   				       lockend, EXTENT_NORESERVE, NULL,
>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 02/13] Btrfs: Compute and look up csums based on sectorsized blocks
  2015-09-30 10:28 ` [PATCH V5 02/13] Btrfs: Compute and look up csums based on sectorsized blocks Chandan Rajendra
@ 2015-10-01 14:39   ` Josef Bacik
  2015-10-02 12:20     ` Chandan Rajendra
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2015-10-01 14:39 UTC (permalink / raw)
  To: Chandan Rajendra, linux-btrfs; +Cc: clm, bo.li.liu, dsterba, quwenruo, chandan

On 09/30/2015 06:28 AM, Chandan Rajendra wrote:
> Checksums are applicable to sectorsize units. The current code uses
> bio->bv_len units to compute and look up checksums. This works on machines
> where sectorsize == PAGE_SIZE. This patch makes the checksum computation and
> look up code to work with sectorsize units.
>
> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
> Reviewed-by: Josef Bacik <jbacik@fb.com>
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>   fs/btrfs/file-item.c | 93 +++++++++++++++++++++++++++++++++-------------------
>   1 file changed, 59 insertions(+), 34 deletions(-)
>
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 58ece65..818c859 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
>   	u64 item_start_offset = 0;
>   	u64 item_last_offset = 0;
>   	u64 disk_bytenr;
> +	u64 page_bytes_left;
>   	u32 diff;
>   	int nblocks;
>   	int bio_index = 0;
> @@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
>   	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
>   	if (dio)
>   		offset = logical_offset;
> +
> +	page_bytes_left = bvec->bv_len;
>   	while (bio_index < bio->bi_vcnt) {
>   		if (!dio)
>   			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
> @@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
>   				if (BTRFS_I(inode)->root->root_key.objectid ==
>   				    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>   					set_extent_bits(io_tree, offset,
> -						offset + bvec->bv_len - 1,
> +						offset + root->sectorsize - 1,
>   						EXTENT_NODATASUM, GFP_NOFS);
>   				} else {
>   					btrfs_info(BTRFS_I(inode)->root->fs_info,
> @@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
>   found:
>   		csum += count * csum_size;
>   		nblocks -= count;
> -		bio_index += count;
> +
>   		while (count--) {
> -			disk_bytenr += bvec->bv_len;
> -			offset += bvec->bv_len;
> -			bvec++;
> +			disk_bytenr += root->sectorsize;
> +			offset += root->sectorsize;
> +			page_bytes_left -= root->sectorsize;
> +			if (!page_bytes_left) {
> +				bio_index++;
> +				bvec++;
> +				page_bytes_left = bvec->bv_len;
> +			}
> +
>   		}
>   	}
>   	btrfs_free_path(path);
> @@ -432,6 +441,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
>   	struct bio_vec *bvec = bio->bi_io_vec;
>   	int bio_index = 0;
>   	int index;
> +	int nr_sectors;
> +	int i;
>   	unsigned long total_bytes = 0;
>   	unsigned long this_sum_bytes = 0;
>   	u64 offset;
> @@ -451,7 +462,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
>   		offset = page_offset(bvec->bv_page) + bvec->bv_offset;
>
>   	ordered = btrfs_lookup_ordered_extent(inode, offset);
> -	BUG_ON(!ordered); /* Logic error */
> +	ASSERT(ordered); /* Logic error */
>   	sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
>   	index = 0;
>
> @@ -459,41 +470,55 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
>   		if (!contig)
>   			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
>
> -		if (offset >= ordered->file_offset + ordered->len ||
> -		    offset < ordered->file_offset) {
> -			unsigned long bytes_left;
> -			sums->len = this_sum_bytes;
> -			this_sum_bytes = 0;
> -			btrfs_add_ordered_sum(inode, ordered, sums);
> -			btrfs_put_ordered_extent(ordered);
> +		data = kmap_atomic(bvec->bv_page);
>
> -			bytes_left = bio->bi_iter.bi_size - total_bytes;
> +		nr_sectors = (bvec->bv_len + root->sectorsize - 1)
> +			>> inode->i_blkbits;
> +
> +		for (i = 0; i < nr_sectors; i++) {
> +			if (offset >= ordered->file_offset + ordered->len ||
> +				offset < ordered->file_offset) {
> +				unsigned long bytes_left;
> +
> +				kunmap_atomic(data);
> +				sums->len = this_sum_bytes;
> +				this_sum_bytes = 0;
> +				btrfs_add_ordered_sum(inode, ordered, sums);
> +				btrfs_put_ordered_extent(ordered);
> +
> +				bytes_left = bio->bi_iter.bi_size - total_bytes;
> +
> +				sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
> +					GFP_NOFS);
> +				BUG_ON(!sums); /* -ENOMEM */
> +				sums->len = bytes_left;
> +				ordered = btrfs_lookup_ordered_extent(inode,
> +								offset);
> +				ASSERT(ordered); /* Logic error */
> +				sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9)
> +					+ total_bytes;
> +				index = 0;
> +
> +				data = kmap_atomic(bvec->bv_page);
> +			}
>
> -			sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
> -				       GFP_NOFS);
> -			BUG_ON(!sums); /* -ENOMEM */
> -			sums->len = bytes_left;
> -			ordered = btrfs_lookup_ordered_extent(inode, offset);
> -			BUG_ON(!ordered); /* Logic error */
> -			sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) +
> -				       total_bytes;
> -			index = 0;
> +			sums->sums[index] = ~(u32)0;
> +			sums->sums[index]
> +				= btrfs_csum_data(data + bvec->bv_offset
> +						+ (i * root->sectorsize),
> +						sums->sums[index],
> +						root->sectorsize);
> +			btrfs_csum_final(sums->sums[index],
> +					(char *)(sums->sums + index));
> +			index++;
> +			offset += root->sectorsize;
> +			this_sum_bytes += root->sectorsize;
> +			total_bytes += root->sectorsize;
>   		}
>

What I said about this area in the other email I sent just ignore, I 
misread the patch.  The other stuff is still valid tho.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 13/13] Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated
  2015-09-30 10:28 ` [PATCH V5 13/13] Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated Chandan Rajendra
@ 2015-10-01 14:48   ` Josef Bacik
  0 siblings, 0 replies; 23+ messages in thread
From: Josef Bacik @ 2015-10-01 14:48 UTC (permalink / raw)
  To: Chandan Rajendra, linux-btrfs; +Cc: clm, bo.li.liu, dsterba, quwenruo, chandan

On 09/30/2015 06:28 AM, Chandan Rajendra wrote:
> The following issue was observed when running generic/095 test on
> subpagesize-blocksize patchset.
>
> Assume that we are trying to write a dirty page that is mapping file offset
> range [159744, 163839].
>
> writepage_delalloc()
>    find_lock_delalloc_range(*start = 159744, *end = 0)
>      find_delalloc_range()
>        Returns range [X, Y] where (X > 163839)
>      lock_delalloc_pages()
>        One of the pages in range [X, Y] has dirty flag cleared;
>        Loop once more restricting the delalloc range to span only
>        PAGE_CACHE_SIZE bytes;
>      find_delalloc_range()
>        Returns range [356352, 360447];
>      lock_delalloc_pages()
>        The page [356352, 360447] has dirty flag cleared;
>      Returns with *start = 159744 and *end = 0;
>    *start = *end + 1;
>    find_lock_delalloc_range(*start = 1, *end = 0)
>      Finds and returns delalloc range [1, 12288];
>    cow_file_range()
>      Clears delalloc range [1, 12288]
>      Create ordered extent for range [1, 12288]
>
> The ordered extent thus created above breaks the rule that extents have to be
> aligned to the filesystem's block size.
>
> In cases where lock_delalloc_pages() fails (either due to PG_dirty flag being
> cleared or the page no longer being a member of the inode's page cache), this
> patch sets and returns the delalloc range that was found by
> find_delalloc_range().
>
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 12/13] Btrfs: prepare_pages: Retry adding a page to the page cache
  2015-09-30 10:28 ` [PATCH V5 12/13] Btrfs: prepare_pages: Retry adding a page to the page cache Chandan Rajendra
@ 2015-10-01 14:50   ` Josef Bacik
  2015-10-02 12:24     ` Chandan Rajendra
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2015-10-01 14:50 UTC (permalink / raw)
  To: Chandan Rajendra, linux-btrfs; +Cc: clm, bo.li.liu, dsterba, quwenruo, chandan

On 09/30/2015 06:28 AM, Chandan Rajendra wrote:
> When reading the page from the disk, we can race with Direct I/O which can get
> the page lock (before prepare_uptodate_page() gets it) and can go ahead and
> invalidate the page. Hence if the page is not found in the inode's address
> space, retry the operation of getting a page.
>
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---

Huh, how in the world did you make that happen?

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 11/13] Btrfs: Clean pte corresponding to page straddling i_size
  2015-09-30 10:28 ` [PATCH V5 11/13] Btrfs: Clean pte corresponding to page straddling i_size Chandan Rajendra
@ 2015-10-01 14:57   ` Josef Bacik
  2015-10-02 16:34     ` Chandan Rajendra
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2015-10-01 14:57 UTC (permalink / raw)
  To: Chandan Rajendra, linux-btrfs; +Cc: clm, bo.li.liu, dsterba, quwenruo, chandan

On 09/30/2015 06:28 AM, Chandan Rajendra wrote:
> When extending a file by either "truncate up" or by writing beyond i_size, the
> page which had i_size needs to be marked "read only" so that future writes to
> the page via mmap interface causes btrfs_page_mkwrite() to be invoked. If not,
> a write performed after extending the file via the mmap interface will find
> the page to be writaeable and continue writing to the page without invoking
> btrfs_page_mkwrite() i.e. we end up writing to a file without reserving disk
> space.
>
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>   fs/btrfs/file.c  | 12 ++++++++++--
>   fs/btrfs/inode.c |  2 +-
>   2 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 360d56d..5715e29 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1757,6 +1757,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>   	ssize_t err;
>   	loff_t pos;
>   	size_t count;
> +	loff_t oldsize;
> +	int clean_page = 0;
>
>   	mutex_lock(&inode->i_mutex);
>   	err = generic_write_checks(iocb, from);
> @@ -1795,14 +1797,17 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>   	pos = iocb->ki_pos;
>   	count = iov_iter_count(from);
>   	start_pos = round_down(pos, root->sectorsize);
> -	if (start_pos > i_size_read(inode)) {
> +	oldsize = i_size_read(inode);
> +	if (start_pos > oldsize) {
>   		/* Expand hole size to cover write data, preventing empty gap */
>   		end_pos = round_up(pos + count, root->sectorsize);
> -		err = btrfs_cont_expand(inode, i_size_read(inode), end_pos);
> +		err = btrfs_cont_expand(inode, oldsize, end_pos);
>   		if (err) {
>   			mutex_unlock(&inode->i_mutex);
>   			goto out;
>   		}
> +		if (start_pos > round_up(oldsize, root->sectorsize))
> +			clean_page = 1;
>   	}
>
>   	if (sync)
> @@ -1814,6 +1819,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>   		num_written = __btrfs_buffered_write(file, from, pos);
>   		if (num_written > 0)
>   			iocb->ki_pos = pos + num_written;
> +		if (clean_page)
> +			pagecache_isize_extended(inode, oldsize,
> +						i_size_read(inode));
>   	}
>
>   	mutex_unlock(&inode->i_mutex);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c937357..f31da87 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4853,7 +4853,6 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
>   	}
>
>   	if (newsize > oldsize) {
> -		truncate_pagecache(inode, newsize);

So I don't understand why we are dropping this bit here, could you 
explain?  Otherwise the patch looks fine to me.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 10/13] Btrfs: Fix block size returned to user space
  2015-09-30 10:28 ` [PATCH V5 10/13] Btrfs: Fix block size returned to user space Chandan Rajendra
@ 2015-10-01 14:58   ` Josef Bacik
  0 siblings, 0 replies; 23+ messages in thread
From: Josef Bacik @ 2015-10-01 14:58 UTC (permalink / raw)
  To: Chandan Rajendra, linux-btrfs; +Cc: clm, bo.li.liu, dsterba, quwenruo, chandan

On 09/30/2015 06:28 AM, Chandan Rajendra wrote:
> btrfs_getattr() returns PAGE_CACHE_SIZE as the block size. Since
> generic_fillattr() already does the right thing (by obtaining block size
> from inode->i_blkbits), just remove the statement from btrfs_getattr.
>
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 02/13] Btrfs: Compute and look up csums based on sectorsized blocks
  2015-10-01 14:39   ` Josef Bacik
@ 2015-10-02 12:20     ` Chandan Rajendra
  0 siblings, 0 replies; 23+ messages in thread
From: Chandan Rajendra @ 2015-10-02 12:20 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, clm, bo.li.liu, dsterba, quwenruo, chandan

On Thursday 01 Oct 2015 10:39:29 Josef Bacik wrote:
> On 09/30/2015 06:28 AM, Chandan Rajendra wrote:
> > Checksums are applicable to sectorsize units. The current code uses
> > bio->bv_len units to compute and look up checksums. This works on machines
> > where sectorsize == PAGE_SIZE. This patch makes the checksum computation
> > and look up code to work with sectorsize units.
> > 
> > Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
> > Reviewed-by: Josef Bacik <jbacik@fb.com>
> > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > ---
> > 
> >   fs/btrfs/file-item.c | 93
> >   +++++++++++++++++++++++++++++++++------------------- 1 file changed, 59
> >   insertions(+), 34 deletions(-)
> > 
> > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> > index 58ece65..818c859 100644
> > --- a/fs/btrfs/file-item.c
> > +++ b/fs/btrfs/file-item.c
> > @@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root
> > *root,> 
> >   	u64 item_start_offset = 0;
> >   	u64 item_last_offset = 0;
> >   	u64 disk_bytenr;
> > 
> > +	u64 page_bytes_left;
> > 
> >   	u32 diff;
> >   	int nblocks;
> >   	int bio_index = 0;
> > 
> > @@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root
> > *root,> 
> >   	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
> >   	if (dio)
> >   	
> >   		offset = logical_offset;
> > 
> > +
> > +	page_bytes_left = bvec->bv_len;
> > 
> >   	while (bio_index < bio->bi_vcnt) {
> >   	
> >   		if (!dio)
> >   		
> >   			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
> > 
> > @@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root
> > *root,> 
> >   				if (BTRFS_I(inode)->root->root_key.objectid ==
> >   				
> >   				    BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >   					
> >   					set_extent_bits(io_tree, offset,
> > 
> > -						offset + bvec->bv_len - 1,
> > +						offset + root->sectorsize - 1,
> > 
> >   						EXTENT_NODATASUM, GFP_NOFS);
> >   				
> >   				} else {
> >   				
> >   					btrfs_info(BTRFS_I(inode)->root-
>fs_info,
> > 
> > @@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root
> > *root,> 
> >   found:
> >   		csum += count * csum_size;
> >   		nblocks -= count;
> > 
> > -		bio_index += count;
> > +
> > 
> >   		while (count--) {
> > 
> > -			disk_bytenr += bvec->bv_len;
> > -			offset += bvec->bv_len;
> > -			bvec++;
> > +			disk_bytenr += root->sectorsize;
> > +			offset += root->sectorsize;
> > +			page_bytes_left -= root->sectorsize;
> > +			if (!page_bytes_left) {
> > +				bio_index++;
> > +				bvec++;
> > +				page_bytes_left = bvec->bv_len;
> > +			}
> > +
> >
> I don't understand why this needs to be changed, bv_len is still the
> amount we're copying, irrespective of the page size.

Josef, assume bvec[0] has 2 blocks worth of data and bvec[1] has 4 blocks of
worth of data. For the first iteration of the loop, assume that
btrfs_find_ordered_sum() returned 4 csums i.e. csums associated with first 4
blocks of the bio. In such a scenario, the first of the several csums returned
during the second iteration of the loop applies to the the 3rd block mapped by
bvec[1]. Knowing this wouldn't be possible by only using bvec->bv_len. Hence
page_bytes_left helps us figure out the block inside a bvec for which the
first of the new set of csums found applies and also to decide whether to move
to the next bvec or not.

> > 
> >   		}
> >   	
> >   	}
> >   	btrfs_free_path(path);
> > 
> > @@ -432,6 +441,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct
> > inode *inode,> 
> >   	struct bio_vec *bvec = bio->bi_io_vec;
> >   	int bio_index = 0;
> >   	int index;
> > 
> > +	int nr_sectors;
> > +	int i;
> > 
> >   	unsigned long total_bytes = 0;
> >   	unsigned long this_sum_bytes = 0;
> >   	u64 offset;
> > 
> > @@ -451,7 +462,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct
> > inode *inode,> 
> >   		offset = page_offset(bvec->bv_page) + bvec->bv_offset;
> >   	
> >   	ordered = btrfs_lookup_ordered_extent(inode, offset);
> > 
> > -	BUG_ON(!ordered); /* Logic error */
> > +	ASSERT(ordered); /* Logic error */
> >
> 
> Don't worry about converting existing BUG_ON()'s, just don't add new ones.

Ok. 

> >   	sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
> >   	index = 0;
> > 
> > @@ -459,41 +470,55 @@ int btrfs_csum_one_bio(struct btrfs_root *root,
> > struct inode *inode,> 
> >   		if (!contig)
> >   		
> >   			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
> > 
> > -		if (offset >= ordered->file_offset + ordered->len ||
> > -		    offset < ordered->file_offset) {
> > -			unsigned long bytes_left;
> > -			sums->len = this_sum_bytes;
> > -			this_sum_bytes = 0;
> > -			btrfs_add_ordered_sum(inode, ordered, sums);
> > -			btrfs_put_ordered_extent(ordered);
> > +		data = kmap_atomic(bvec->bv_page);
> > 
> > -			bytes_left = bio->bi_iter.bi_size - total_bytes;
> > +		nr_sectors = (bvec->bv_len + root->sectorsize - 1)
> > +			>> inode->i_blkbits;
> > +
> So I've seen similar sort of math in the previous patch for this as
> well, lets make this into a helper.

I agree. I will add a helper function to do that and invoke it in appropriate
place.

> > +		for (i = 0; i < nr_sectors; i++) {
> > +			if (offset >= ordered->file_offset + ordered->len ||
> > +				offset < ordered->file_offset) {
> > +				unsigned long bytes_left;
> > +
> > +				kunmap_atomic(data);
> > +				sums->len = this_sum_bytes;
> > +				this_sum_bytes = 0;
> > +				btrfs_add_ordered_sum(inode, ordered, sums);
> > +				btrfs_put_ordered_extent(ordered);
> > +
> > +				bytes_left = bio->bi_iter.bi_size - 
total_bytes;
> > +
> > +				sums = kzalloc(btrfs_ordered_sum_size(root, 
bytes_left),
> > +					GFP_NOFS);
> > +				BUG_ON(!sums); /* -ENOMEM */
> > +				sums->len = bytes_left;
> > +				ordered = btrfs_lookup_ordered_extent(inode,
> > +								offset);
> > +				ASSERT(ordered); /* Logic error */
> > +				sums->bytenr = ((u64)bio->bi_iter.bi_sector << 
9)
> > +					+ total_bytes;
> > +				index = 0;
> > +
> > +				data = kmap_atomic(bvec->bv_page);
> > +			}
> >
> > -			sums = kzalloc(btrfs_ordered_sum_size(root, 
bytes_left),
> > -				       GFP_NOFS);
> > -			BUG_ON(!sums); /* -ENOMEM */
> > -			sums->len = bytes_left;
> > -			ordered = btrfs_lookup_ordered_extent(inode, offset);
> > -			BUG_ON(!ordered); /* Logic error */
> > -			sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) +
> > -				       total_bytes;
> > -			index = 0;
> > +			sums->sums[index] = ~(u32)0;
> > +			sums->sums[index]
> > +				= btrfs_csum_data(data + bvec->bv_offset
> > +						+ (i * root->sectorsize),
> > +						sums->sums[index],
> > +						root->sectorsize);
> > +			btrfs_csum_final(sums->sums[index],
> > +					(char *)(sums->sums + index));
> > +			index++;
> > +			offset += root->sectorsize;
> > +			this_sum_bytes += root->sectorsize;
> > +			total_bytes += root->sectorsize;
> > 
> >   		}
> 
> What I said about this area in the other email I sent just ignore, I
> misread the patch.  The other stuff is still valid tho.  Thanks,
> 
> Josef

-- 
chandan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 12/13] Btrfs: prepare_pages: Retry adding a page to the page cache
  2015-10-01 14:50   ` Josef Bacik
@ 2015-10-02 12:24     ` Chandan Rajendra
  0 siblings, 0 replies; 23+ messages in thread
From: Chandan Rajendra @ 2015-10-02 12:24 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, clm, bo.li.liu, dsterba, quwenruo, chandan, jpa

On Thursday 01 Oct 2015 10:50:30 Josef Bacik wrote:
> On 09/30/2015 06:28 AM, Chandan Rajendra wrote:
> > When reading the page from the disk, we can race with Direct I/O which can
> > get the page lock (before prepare_uptodate_page() gets it) and can go
> > ahead and invalidate the page. Hence if the page is not found in the
> > inode's address space, retry the operation of getting a page.
> > 
> > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > ---
> 
> Huh, how in the world did you make that happen?
>

The issue is seen when generic/095 test is run in a loop. I would like to add,

Reported-by: Jakub Palider <jpa@semihalf.com>

> Reviewed-by: Josef Bacik <jbacik@fb.com>
> 
> Thanks,
> 
> Josef

-- 
chandan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 11/13] Btrfs: Clean pte corresponding to page straddling i_size
  2015-10-01 14:57   ` Josef Bacik
@ 2015-10-02 16:34     ` Chandan Rajendra
  0 siblings, 0 replies; 23+ messages in thread
From: Chandan Rajendra @ 2015-10-02 16:34 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, clm, bo.li.liu, dsterba, quwenruo, chandan

On Thursday 01 Oct 2015 10:57:52 Josef Bacik wrote:
> On 09/30/2015 06:28 AM, Chandan Rajendra wrote:
> > When extending a file by either "truncate up" or by writing beyond i_size,
> > the page which had i_size needs to be marked "read only" so that future
> > writes to the page via mmap interface causes btrfs_page_mkwrite() to be
> > invoked. If not, a write performed after extending the file via the mmap
> > interface will find the page to be writaeable and continue writing to the
> > page without invoking btrfs_page_mkwrite() i.e. we end up writing to a
> > file without reserving disk space.
> > 
> > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > ---
> > 
> >   fs/btrfs/file.c  | 12 ++++++++++--
> >   fs/btrfs/inode.c |  2 +-
> >   2 files changed, 11 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 360d56d..5715e29 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1757,6 +1757,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb
> > *iocb,> 
> >   	ssize_t err;
> >   	loff_t pos;
> >   	size_t count;
> > 
> > +	loff_t oldsize;
> > +	int clean_page = 0;
> > 
> >   	mutex_lock(&inode->i_mutex);
> >   	err = generic_write_checks(iocb, from);
> > 
> > @@ -1795,14 +1797,17 @@ static ssize_t btrfs_file_write_iter(struct kiocb
> > *iocb,> 
> >   	pos = iocb->ki_pos;
> >   	count = iov_iter_count(from);
> >   	start_pos = round_down(pos, root->sectorsize);
> > 
> > -	if (start_pos > i_size_read(inode)) {
> > +	oldsize = i_size_read(inode);
> > +	if (start_pos > oldsize) {
> > 
> >   		/* Expand hole size to cover write data, preventing empty gap 
*/
> >   		end_pos = round_up(pos + count, root->sectorsize);
> > 
> > -		err = btrfs_cont_expand(inode, i_size_read(inode), end_pos);
> > +		err = btrfs_cont_expand(inode, oldsize, end_pos);
> > 
> >   		if (err) {
> >   		
> >   			mutex_unlock(&inode->i_mutex);
> >   			goto out;
> >   		
> >   		}
> > 
> > +		if (start_pos > round_up(oldsize, root->sectorsize))
> > +			clean_page = 1;
> > 
> >   	}
> >   	
> >   	if (sync)
> > 
> > @@ -1814,6 +1819,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb
> > *iocb,> 
> >   		num_written = __btrfs_buffered_write(file, from, pos);
> >   		if (num_written > 0)
> >   		
> >   			iocb->ki_pos = pos + num_written;
> > 
> > +		if (clean_page)
> > +			pagecache_isize_extended(inode, oldsize,
> > +						i_size_read(inode));
> > 
> >   	}
> >   	
> >   	mutex_unlock(&inode->i_mutex);
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index c937357..f31da87 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -4853,7 +4853,6 @@ static int btrfs_setsize(struct inode *inode, struct
> > iattr *attr)> 
> >   	}
> >   	
> >   	if (newsize > oldsize) {
> > 
> > -		truncate_pagecache(inode, newsize);
> 
> So I don't understand why we are dropping this bit here, could you
> explain?  Otherwise the patch looks fine to me.  Thanks,
>
Josef, As per our previous discussion on IRC we found that the
"truncate_pagecache(inode, newsize)" statement to be a relic of the
past. During the test runs, I haven't seen any sort of failure caused by the
removal of this statement.

-- 
chandan


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-10-02 16:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-30 10:28 [PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups Chandan Rajendra
2015-09-30 10:28 ` [PATCH V5 01/13] Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size Chandan Rajendra
2015-10-01 14:37   ` Josef Bacik
2015-09-30 10:28 ` [PATCH V5 02/13] Btrfs: Compute and look up csums based on sectorsized blocks Chandan Rajendra
2015-10-01 14:39   ` Josef Bacik
2015-10-02 12:20     ` Chandan Rajendra
2015-09-30 10:28 ` [PATCH V5 03/13] Btrfs: Direct I/O read: Work " Chandan Rajendra
2015-09-30 10:28 ` [PATCH V5 04/13] Btrfs: fallocate: Work with " Chandan Rajendra
2015-09-30 10:28 ` [PATCH V5 05/13] Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units Chandan Rajendra
2015-09-30 10:28 ` [PATCH V5 06/13] Btrfs: Search for all ordered extents that could span across a page Chandan Rajendra
2015-09-30 10:28 ` [PATCH V5 07/13] Btrfs: Use (eb->start, seq) as search key for tree modification log Chandan Rajendra
2015-09-30 10:28 ` [PATCH V5 08/13] Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length Chandan Rajendra
2015-09-30 10:28 ` [PATCH V5 09/13] Btrfs: Limit inline extents to root->sectorsize Chandan Rajendra
2015-09-30 10:28 ` [PATCH V5 10/13] Btrfs: Fix block size returned to user space Chandan Rajendra
2015-10-01 14:58   ` Josef Bacik
2015-09-30 10:28 ` [PATCH V5 11/13] Btrfs: Clean pte corresponding to page straddling i_size Chandan Rajendra
2015-10-01 14:57   ` Josef Bacik
2015-10-02 16:34     ` Chandan Rajendra
2015-09-30 10:28 ` [PATCH V5 12/13] Btrfs: prepare_pages: Retry adding a page to the page cache Chandan Rajendra
2015-10-01 14:50   ` Josef Bacik
2015-10-02 12:24     ` Chandan Rajendra
2015-09-30 10:28 ` [PATCH V5 13/13] Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated Chandan Rajendra
2015-10-01 14:48   ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).