[RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2)

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2)
@ 2012-12-24  7:55 Zheng Liu
  2012-12-24  7:55 ` [RFC][PATCH 1/9 v1] ext4: fixup metadata reserve block warning when bigalloc and delalloc are enabled Zheng Liu
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Zheng Liu @ 2012-12-24  7:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu

Hi all,

This is the first try to implement the second step of extent status tree.
In this step, it tries to improve the following problems:
 - A metadata reserve space warning when bigalloc and delalloc are enabled
 - track all extent status in this tree
 - lookup a block mapping in this tree as a extent tree cache
 - improve unwritten extent conversion
 - improve the dio performance

The patch series is not perfect, and there still has some works in my TODO list
(see below).  But I believe that I need to send it out as early as possible to
let others review.  Any comments, suggestions, or feedbacks are welcome!


The patch series can be splitted into 5 parts.

Patch 1:
  ext4: fixup metadata reserve block warning when bigalloc and delalloc
    are enabled

  This patch tries to fixup a metadata reserve space warning from
ext4_da_update_reserve_space() when bigalloc and delalloc are enabled.  This
warning can be triggered by xfstest #13.

Patch 2:
  ext4: refine extent status tree

  This patch refine the code of extent status tree.  The major change is add a
prefix 'es_'.  Some comments also are updated.

Patch 3-5:
  ext4: add physical block and status member into extent status tree
  ext4: adjust interfaces of extent status tree
  ext4: track all extent status in extent status tree

  These patches make extent status tree track all extent status in memory.  We
first add two members (physical block and status) into the tree, and adjust
related functions to save them in the tree.  Then when we create/lookup an
extent in *_map_blocks, this extent will be inserted into the extent status tree.
Currently we don't load all extent status in alloc_inode function because if a
file is opened/closed very frequently and it will cost too much memory and cause
a latency while the file is being opened.  So now the solution is to load extent
status on-demand.

Patch 6:
  ext4: lookup block mapping in extent status tree

  It makes extent status tree as like a extent cache in memory to try to avoid
potential disk I/O because we don't need to lookup in extent tree if this lookup
hits this cache.  Due to there has not a complete extent status in the tree, its
effect is not very obviously for performance.  But it is useful for us to
improve unwritten extent conversion.

Patch 7-9:
  ext4: add a new convert function to convert an unwritten extent in
    extent status tree
  ext4: refine unwritten extent conversion
  ext4: set dioread_nolock by default for extent-based files

  These patches aim to improve unwritten extent conversion and dio performance.
The first patch adds a new function to convert unwritten extent in extent status
tree.  The second patch refines the unwritten extent conversion and improves the
dio performance.  Before applied this patch, all unwritten conversion need to be
done in a work queue to avoid to take i_data_sem in a irq context due to dio
end_io function is in a irq context.  It causes that we call aio_complete and
inode_dio_done to notify upper level that a dio has been done until this
conversion had done.  When dioread_nolock is enabled, reader must wait the
conversion to avoid to get a stale data.  After applied this patch, we will
convert this unwritten extent in extent status tree in dio end_io function, and
then aio_complete and inode_dio_done are called.  Here we don't need to be
worried about exposing a stale data because we always try to lookup a block
mapping in extent status tree firstly.  Then we finish this conversion in a work
queue to convert unwritten extent in disk.  Meanwhile reader with dioread_nolock
never need to wait the conversion and this can reduce the latency.

TODO list in this step:
 - Use cache as inserting a new extent.  Now when an new extent is inserted
   into extent status tree, the cache will only be invalidated to avoid some
   complexities.  We could use cache to speed up this process.

 - Refactor the delayed space reservation code.  Now delayed space reservation
   has been simplfied but it sill has some problems.  So maybe a refactor is a
   good choice.

 - Avoid to change extent status tree when we convert an unwritten extent in
   ext4_convert_unwritten_extents().  Now ext4_map_blocks is called by
   ext4_convert_unwritten_extents() to convert an unwritten extent.  But at the
   time the unwritten extent has been converted in extent status tree.

 - Refactor ext4_map_blocks.  In ext4 some operations call this function but
   these operations is only for extent-based files.  So maybe we need to
   refactor this function to simplify the code.

Here I use fio to do a simple test to verify that the dio latency quite can be
reduced after applied this patch series.  The result shows that the max latency
can be reduced.  Max submission latency is reduced from 228903 (usec) to 19734
(usec), Max completion latency is reduced from 1002.3k (usec) to 845251 (usec).

[fio config file]

[global]
ioengine=libaio
direct=1
bs=4k
thread
group_reporting
directory=/mnt/sda1/
filename=testfile
filesize=10g
size=10g
runtime=120
iodepth=16

[fio]
rw=randrw
numjobs=4

[result]
== w/o patches ==
Starting 4 threads
Jobs: 4 (f=4): [mmmm] [100.0% done] [8862K/8755K/0K /s] [2215 /2188 /0  iops]
[eta 00m:00s] 
fio: (groupid=0, jobs=4): err= 0: pid=14214: Sun Dec 23 23:25:03 2012
  read : io=1457.9MB, bw=12440KB/s, iops=3109 , runt=120007msec
    slat (usec): min=3 , max=228903 , avg=13.00, stdev=534.68
    clat (usec): min=67 , max=1002.3K, avg=10239.69, stdev=46513.08
     lat (usec): min=167 , max=1002.3K, avg=10253.04, stdev=46515.61
    clat percentiles (usec):
     |  1.00th=[  266],  5.00th=[  524], 10.00th=[  660], 20.00th=[  924],
     | 30.00th=[ 1240], 40.00th=[ 1544], 50.00th=[ 1832], 60.00th=[ 2128],
     | 70.00th=[ 2896], 80.00th=[ 3568], 90.00th=[ 4768], 95.00th=[ 7200],
     | 99.00th=[232448], 99.50th=[276480], 99.90th=[468992], 99.95th=[561152],
     | 99.99th=[618496]
    bw (KB/s)  : min=    7, max= 6728, per=25.08%, avg=3119.32, stdev=1100.92
  write: io=1457.5MB, bw=12436KB/s, iops=3109 , runt=120007msec
    slat (usec): min=3 , max=219742 , avg=14.50, stdev=519.13
    clat (usec): min=82 , max=1002.4K, avg=10308.26, stdev=47075.41
     lat (usec): min=100 , max=1002.4K, avg=10323.12, stdev=47083.93
    clat percentiles (usec):
     |  1.00th=[  199],  5.00th=[  346], 10.00th=[  572], 20.00th=[  788],
     | 30.00th=[ 1112], 40.00th=[ 1448], 50.00th=[ 1720], 60.00th=[ 1992],
     | 70.00th=[ 2640], 80.00th=[ 3440], 90.00th=[ 4640], 95.00th=[ 7456],
     | 99.00th=[232448], 99.50th=[276480], 99.90th=[473088], 99.95th=[561152],
     | 99.99th=[618496]
    bw (KB/s)  : min=   23, max= 6424, per=25.07%, avg=3117.85, stdev=1080.65
    lat (usec) : 100=0.01%, 250=1.55%, 500=4.61%, 750=10.12%, 1000=8.25%
    lat (msec) : 2=33.76%, 4=27.86%, 10=9.95%, 20=0.46%, 50=0.18%
    lat (msec) : 100=0.11%, 250=2.56%, 500=0.52%, 750=0.07%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.54%, sys=2.31%, ctx=330224, majf=0,
minf=18446744073709500708
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=373217/w=373112/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=1457.9MB, aggrb=12439KB/s, minb=12439KB/s, maxb=12439KB/s,
mint=120007msec, maxt=120007msec
  WRITE: io=1457.5MB, aggrb=12436KB/s, minb=12436KB/s, maxb=12436KB/s,
mint=120007msec, maxt=120007msec

Disk stats (read/write):
  sda: ios=372594/372606, merge=248/233, ticks=3800094/3825295,
in_queue=7630213, util=100.00%

== w/ patches ==
Starting 4 threads
Jobs: 4 (f=4): [mmmm] [100.0% done] [12518K/12358K/0K /s] [3129 /3089 /0  iops]
[eta 00m:00s]
fio: (groupid=0, jobs=4): err= 0: pid=13551: Sun Dec 23 23:17:12 2012
  read : io=1465.6MB, bw=12501KB/s, iops=3125 , runt=120010msec
    slat (usec): min=3 , max=19734 , avg=11.20, stdev=69.57
    clat (usec): min=70 , max=845251 , avg=10183.20, stdev=46813.94
     lat (usec): min=167 , max=845266 , avg=10194.76, stdev=46813.77
    clat percentiles (usec):
     |  1.00th=[  266],  5.00th=[  524], 10.00th=[  652], 20.00th=[  916],
     | 30.00th=[ 1240], 40.00th=[ 1544], 50.00th=[ 1816], 60.00th=[ 2096],
     | 70.00th=[ 2832], 80.00th=[ 3536], 90.00th=[ 4640], 95.00th=[ 6816],
     | 99.00th=[232448], 99.50th=[305152], 99.90th=[497664], 99.95th=[585728],
     | 99.99th=[618496]
    bw (KB/s)  : min=   53, max= 6528, per=25.20%, avg=3149.71, stdev=1136.70
  write: io=1459.9MB, bw=12457KB/s, iops=3114 , runt=120010msec
    slat (usec): min=3 , max=19539 , avg=12.68, stdev=76.27
    clat (usec): min=79 , max=847388 , avg=10301.65, stdev=47597.19
     lat (usec): min=96 , max=847407 , avg=10314.69, stdev=47598.35
    clat percentiles (usec):
     |  1.00th=[  199],  5.00th=[  342], 10.00th=[  572], 20.00th=[  780],
     | 30.00th=[ 1112], 40.00th=[ 1448], 50.00th=[ 1720], 60.00th=[ 1976],
     | 70.00th=[ 2544], 80.00th=[ 3376], 90.00th=[ 4448], 95.00th=[ 6944],
     | 99.00th=[232448], 99.50th=[313344], 99.90th=[497664], 99.95th=[569344],
     | 99.99th=[626688]
    bw (KB/s)  : min=   38, max= 6696, per=25.20%, avg=3139.33, stdev=1133.35
    lat (usec) : 100=0.01%, 250=1.52%, 500=4.79%, 750=10.01%, 1000=8.39%
    lat (msec) : 2=34.14%, 4=27.93%, 10=9.40%, 20=0.42%, 50=0.15%
    lat (msec) : 100=0.10%, 250=2.44%, 500=0.60%, 750=0.10%, 1000=0.01%
  cpu          : usr=0.52%, sys=2.28%, ctx=333031, majf=0,
minf=18446744073709500709
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=375055/w=373729/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=1465.6MB, aggrb=12500KB/s, minb=12500KB/s, maxb=12500KB/s,
mint=120010msec, maxt=120010msec
  WRITE: io=1459.9MB, aggrb=12456KB/s, minb=12456KB/s, maxb=12456KB/s,
mint=120010msec, maxt=120010msec

Disk stats (read/write):
  sda: ios=374445/373178, merge=203/232, ticks=3803894/3836417,
in_queue=7645242, util=100.00%


Regards,
					- Zheng

Zheng Liu (9):
  ext4: fixup metadata reserve block warning when bigalloc and delalloc
    are enabled
  ext4: refine extent status tree
  ext4: add physical block and status member into extent status tree
  ext4: adjust interfaces of extent status tree
  ext4: track all extent status in extent status tree
  ext4: lookup block mapping in extent status tree
  ext4: add a new convert function to convert an unwritten extent in
    extent status tree
  ext4: refine unwritten extent conversion
  ext4: set dioread_nolock by default for extent-based files

 Documentation/filesystems/ext4.txt |   5 +-
 fs/ext4/ext4.h                     |   2 +-
 fs/ext4/extents.c                  |  26 +-
 fs/ext4/extents_status.c           | 545 +++++++++++++++++++++++++++----------
 fs/ext4/extents_status.h           |  37 ++-
 fs/ext4/file.c                     |  14 +-
 fs/ext4/indirect.c                 |  11 +-
 fs/ext4/inode.c                    | 150 +++++++---
 fs/ext4/page-io.c                  |  26 +-
 fs/ext4/super.c                    |   8 +
 include/trace/events/ext4.h        |  62 +++--
 11 files changed, 650 insertions(+), 236 deletions(-)

-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 1/9 v1] ext4: fixup metadata reserve block warning when bigalloc and delalloc are enabled
  2012-12-24  7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
@ 2012-12-24  7:55 ` Zheng Liu
  2012-12-24  7:55 ` [RFC][PATCH 2/9 v1] ext4: refine extent status tree Zheng Liu
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Zheng Liu @ 2012-12-24  7:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

When bigalloc and delalloc are enabled, some stress tests (e.g. xfstest #13)
will trigger this warning because all metadata reserved spaces will be released
after dirty pages are written out.  But we still needs to allocate one block,
such as growing the extent tree.  So *DO NOT* release these metadata blocks when
bigalloc and delaloc are enabled.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/inode.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cb1c1ab..91542be 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -375,10 +375,16 @@ void ext4_da_update_reserve_space(struct inode *inode,
 		 * We can release all of the reserved metadata blocks
 		 * only when we have written all of the delayed
 		 * allocation blocks.
+		 * When bigalloc and delalloc are enabled, we couldn't release
+		 * all of reserved metadata blocks, although all delay blocks
+		 * are written out, because it still has some metadata blocks
+		 * which are allocated.
 		 */
-		percpu_counter_sub(&sbi->s_dirtyclusters_counter,
-				   ei->i_reserved_meta_blocks);
-		ei->i_reserved_meta_blocks = 0;
+		if (sbi->s_cluster_ratio == 1) {
+			percpu_counter_sub(&sbi->s_dirtyclusters_counter,
+					   ei->i_reserved_meta_blocks);
+			ei->i_reserved_meta_blocks = 0;
+		}
 		ei->i_da_metadata_calc_len = 0;
 	}
 	spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][PATCH 2/9 v1] ext4: refine extent status tree
  2012-12-24  7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
  2012-12-24  7:55 ` [RFC][PATCH 1/9 v1] ext4: fixup metadata reserve block warning when bigalloc and delalloc are enabled Zheng Liu
@ 2012-12-24  7:55 ` Zheng Liu
  2012-12-24  7:55 ` [RFC][PATCH 3/9 v1] ext4: add physical block and status member into " Zheng Liu
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Zheng Liu @ 2012-12-24  7:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

This patch tries to refine the extent status tree.  A prefix 'es_' is added to
make code clearly.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents.c           |  21 ++--
 fs/ext4/extents_status.c    | 300 ++++++++++++++++++++++++--------------------
 fs/ext4/extents_status.h    |   8 +-
 fs/ext4/file.c              |  12 +-
 include/trace/events/ext4.h |  40 +++---
 5 files changed, 207 insertions(+), 174 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 26af228..0a0f635 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3511,13 +3511,14 @@ static int ext4_find_delalloc_range(struct inode *inode,
 {
 	struct extent_status es;
 
-	es.start = lblk_start;
-	ext4_es_find_extent(inode, &es);
-	if (es.len == 0)
+	es.es_lblk = lblk_start;
+	(void)ext4_es_find_extent(inode, &es);
+	if (es.es_len == 0)
 		return 0; /* there is no delay extent in this tree */
-	else if (es.start <= lblk_start && lblk_start < es.start + es.len)
+	else if (es.es_lblk <= lblk_start &&
+		 lblk_start < es.es_lblk + es.es_len)
 		return 1;
-	else if (lblk_start <= es.start && es.start <= lblk_end)
+	else if (lblk_start <= es.es_lblk && es.es_lblk <= lblk_end)
 		return 1;
 	else
 		return 0;
@@ -4553,7 +4554,7 @@ static int ext4_find_delayed_extent(struct inode *inode,
 	struct extent_status es;
 	ext4_lblk_t next_del;
 
-	es.start = newex->ec_block;
+	es.es_lblk = newex->ec_block;
 	next_del = ext4_es_find_extent(inode, &es);
 
 	if (newex->ec_start == 0) {
@@ -4561,18 +4562,18 @@ static int ext4_find_delayed_extent(struct inode *inode,
 		 * No extent in extent-tree contains block @newex->ec_start,
 		 * then the block may stay in 1)a hole or 2)delayed-extent.
 		 */
-		if (es.len == 0)
+		if (es.es_len == 0)
 			/* A hole found. */
 			return 0;
 
-		if (es.start > newex->ec_block) {
+		if (es.es_lblk > newex->ec_block) {
 			/* A hole found. */
-			newex->ec_len = min(es.start - newex->ec_block,
+			newex->ec_len = min(es.es_lblk - newex->ec_block,
 					    newex->ec_len);
 			return 0;
 		}
 
-		newex->ec_len = es.start + es.len - newex->ec_block;
+		newex->ec_len = es.es_lblk + es.es_len - newex->ec_block;
 	}
 
 	return next_del;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 564d981..5878fb3 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -26,9 +26,10 @@
  * extent tree, whose goal is only track delay extent in memory to
  * simplify the implementation of fiemap and bigalloc, and introduce
  * lseek SEEK_DATA/SEEK_HOLE support.  That is why it is still called
- * delay extent tree at the following comment.  But for better
- * understand what it does, it has been rename to extent status tree.
+ * delay extent tree at the first commit.  But for better understand
+ * what it does, it has been rename to extent status tree.
  *
+ * Step1:
  * Currently the first step has been done.  All delay extents are
  * tracked in the tree.  It maintains the delay extent when a delay
  * allocation is issued, and the delay extent is written out or
@@ -37,26 +38,38 @@
  *
  * The following comment describes the implemenmtation of extent
  * status tree and future works.
+ *
+ * Step2:
+ * In this step all extent status is tracked by extent status tree.
+ * Thus, we can first try to lookup a block mapping in this tree before
+ * find it in extent tree.  Unwritten extent conversion also can be
+ * improved.  Currently this conversion need to be done in a workqueue
+ * because this conversion can not be done in end_io function due to it
+ * needs to take i_data_sem locking in a irq context.  After looking block
+ * mapping in extent status tree, we can first convert unwritten extent in
+ * extent status tree, call aio_comlete() and inode_dio_done() in end_io
+ * function, and don't need to be worried about expose a stale data.
+ * Meanwhile when dioread_nolock is enabled, reader won't need to wait
+ * this unwritten extent conversion, and latency also is reduced.
  */
 
 /*
- * extents status tree implementation for ext4.
+ * Extent status tree implementation for ext4.
  *
  *
  * ==========================================================================
- * Extents status encompass delayed extents and extent locks
+ * Extent status tracks all extent status.
  *
- * 1. Why delayed extent implementation ?
+ * 1. Why we need to implement extent status tree?
  *
- * Without delayed extent, ext4 identifies a delayed extent by looking
+ * Without extent status tree, ext4 identifies a delayed extent by looking
  * up page cache, this has several deficiencies - complicated, buggy,
  * and inefficient code.
  *
- * FIEMAP, SEEK_HOLE/DATA, bigalloc, punch hole and writeout all need
- * to know if a block or a range of blocks are belonged to a delayed
- * extent.
+ * FIEMAP, SEEK_HOLE/DATA, bigalloc, and writeout all need to know if a
+ * block or a range of blocks are belonged to a delayed extent.
  *
- * Let us have a look at how they do without delayed extents implementation.
+ * Let us have a look at how they do without extent status tree.
  *   --	FIEMAP
  *	FIEMAP looks up page cache to identify delayed allocations from holes.
  *
@@ -68,47 +81,43 @@
  *	already under delayed allocation or not to determine whether
  *	quota reserving is needed for the cluster.
  *
- *   -- punch hole
- *	punch hole looks up page cache to identify a delayed extent.
- *
  *   --	writeout
  *	Writeout looks up whole page cache to see if a buffer is
  *	mapped, If there are not very many delayed buffers, then it is
  *	time comsuming.
  *
- * With delayed extents implementation, FIEMAP, SEEK_HOLE/DATA,
+ * With extent status tree implementation, FIEMAP, SEEK_HOLE/DATA,
  * bigalloc and writeout can figure out if a block or a range of
  * blocks is under delayed allocation(belonged to a delayed extent) or
- * not by searching the delayed extent tree.
+ * not by searching the extent tree.
  *
  *
  * ==========================================================================
- * 2. ext4 delayed extents impelmentation
+ * 2. Ext4 extent status tree impelmentation
  *
- *   --	delayed extent
- *	A delayed extent is a range of blocks which are contiguous
- *	logically and under delayed allocation.  Unlike extent in
- *	ext4, delayed extent in ext4 is a in-memory struct, there is
- *	no corresponding on-disk data.  There is no limit on length of
- *	delayed extent, so a delayed extent can contain as many blocks
- *	as they are contiguous logically.
+ *   --	extent
+ *	A extent is a range of blocks which are contiguous logically and
+ *	physically.  Unlike extent in extent tree, this extent in ext4 is
+ *	a in-memory struct, there is no corresponding on-disk data.  There
+ *	is no limit on length of extent, so an extent can contain as many
+ *	blocks as they are contiguous logically and physically.
  *
- *   --	delayed extent tree
- *	Every inode has a delayed extent tree and all under delayed
- *	allocation blocks are added to the tree as delayed extents.
- *	Delayed extents in the tree are ordered by logical block no.
+ *   --	extent status tree
+ *	Every inode has an extent status tree and all allocation blocks
+ *	are added to the tree with different status.  The extent in the
+ *	tree are ordered by logical block no.
  *
- *   --	operations on a delayed extent tree
- *	There are three operations on a delayed extent tree: find next
- *	delayed extent, adding a space(a range of blocks) and removing
- *	a space.
+ *   --	operations on a extent status tree
+ *	There are three important operations on a delayed extent tree: find
+ *	next extent, adding a extent(a range of blocks) and removing a extent.
  *
- *   --	race on a delayed extent tree
- *	Delayed extent tree is protected inode->i_es_lock.
+ *   --	race on a extent status tree
+ *	Extent status tree is protected inode->i_es_lock.
  *
  *
  * ==========================================================================
- * 3. performance analysis
+ * 3. Performance analysis
+ *
  *   --	overhead
  *	1. There is a cache extent for write access, so if writes are
  *	not very random, adding space operaions are in O(1) time.
@@ -120,15 +129,19 @@
  *
  * ==========================================================================
  * 4. TODO list
- *   -- Track all extent status
  *
- *   -- Improve get block process
+ *   -- Refactor delayed reserve space
  *
  *   -- Extent-level locking
  */
 
 static struct kmem_cache *ext4_es_cachep;
 
+static int __es_insert_extent(struct ext4_es_tree *tree,
+			      struct extent_status *newes);
+static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+				 ext4_lblk_t end);
+
 int __init ext4_init_es(void)
 {
 	ext4_es_cachep = KMEM_CACHE(extent_status, SLAB_RECLAIM_ACCOUNT);
@@ -161,7 +174,7 @@ static void ext4_es_print_tree(struct inode *inode)
 	while (node) {
 		struct extent_status *es;
 		es = rb_entry(node, struct extent_status, rb_node);
-		printk(KERN_DEBUG " [%u/%u)", es->start, es->len);
+		printk(KERN_DEBUG " [%u/%u)", es->es_lblk, es->es_len);
 		node = rb_next(node);
 	}
 	printk(KERN_DEBUG "\n");
@@ -172,8 +185,8 @@ static void ext4_es_print_tree(struct inode *inode)
 
 static inline ext4_lblk_t extent_status_end(struct extent_status *es)
 {
-	BUG_ON(es->start + es->len < es->start);
-	return es->start + es->len - 1;
+	BUG_ON(es->es_lblk + es->es_len < es->es_lblk);
+	return es->es_lblk + es->es_len - 1;
 }
 
 /*
@@ -181,25 +194,25 @@ static inline ext4_lblk_t extent_status_end(struct extent_status *es)
  * it can't be found, try to find next extent.
  */
 static struct extent_status *__es_tree_search(struct rb_root *root,
-					      ext4_lblk_t offset)
+					      ext4_lblk_t lblk)
 {
 	struct rb_node *node = root->rb_node;
 	struct extent_status *es = NULL;
 
 	while (node) {
 		es = rb_entry(node, struct extent_status, rb_node);
-		if (offset < es->start)
+		if (lblk < es->es_lblk)
 			node = node->rb_left;
-		else if (offset > extent_status_end(es))
+		else if (lblk > extent_status_end(es))
 			node = node->rb_right;
 		else
 			return es;
 	}
 
-	if (es && offset < es->start)
+	if (es && lblk < es->es_lblk)
 		return es;
 
-	if (es && offset > extent_status_end(es)) {
+	if (es && lblk > extent_status_end(es)) {
 		node = rb_next(&es->rb_node);
 		return node ? rb_entry(node, struct extent_status, rb_node) :
 			      NULL;
@@ -209,8 +222,8 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
 }
 
 /*
- * ext4_es_find_extent: find the 1st delayed extent covering @es->start
- * if it exists, otherwise, the next extent after @es->start.
+ * ext4_es_find_extent: find the 1st delayed extent covering @es->lblk
+ * if it exists, otherwise, the next extent after @es->lblk.
  *
  * @inode: the inode which owns delayed extents
  * @es: delayed extent that we found
@@ -226,7 +239,7 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
 	struct rb_node *node;
 	ext4_lblk_t ret = EXT_MAX_BLOCKS;
 
-	trace_ext4_es_find_extent_enter(inode, es->start);
+	trace_ext4_es_find_extent_enter(inode, es->es_lblk);
 
 	read_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
@@ -234,25 +247,25 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
 	/* find delay extent in cache firstly */
 	if (tree->cache_es) {
 		es1 = tree->cache_es;
-		if (in_range(es->start, es1->start, es1->len)) {
+		if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
 			es_debug("%u cached by [%u/%u)\n",
-				 es->start, es1->start, es1->len);
+				 es->es_lblk, es1->es_lblk, es1->es_len);
 			goto out;
 		}
 	}
 
-	es->len = 0;
-	es1 = __es_tree_search(&tree->root, es->start);
+	es->es_len = 0;
+	es1 = __es_tree_search(&tree->root, es->es_lblk);
 
 out:
 	if (es1) {
 		tree->cache_es = es1;
-		es->start = es1->start;
-		es->len = es1->len;
+		es->es_lblk = es1->es_lblk;
+		es->es_len = es1->es_len;
 		node = rb_next(&es1->rb_node);
 		if (node) {
 			es1 = rb_entry(node, struct extent_status, rb_node);
-			ret = es1->start;
+			ret = es1->es_lblk;
 		}
 	}
 
@@ -263,14 +276,14 @@ out:
 }
 
 static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t start, ext4_lblk_t len)
+ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
 {
 	struct extent_status *es;
 	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
 	if (es == NULL)
 		return NULL;
-	es->start = start;
-	es->len = len;
+	es->es_lblk = lblk;
+	es->es_len = len;
 	return es;
 }
 
@@ -279,6 +292,20 @@ static void ext4_es_free_extent(struct extent_status *es)
 	kmem_cache_free(ext4_es_cachep, es);
 }
 
+/*
+ * Check whether or not two extents can be merged
+ * Condition:
+ *  - logical block number is contiguous
+ */
+static int ext4_es_can_be_merged(struct extent_status *es1,
+				 struct extent_status *es2)
+{
+	if (es1->es_lblk + es1->es_len != es2->es_lblk)
+		return 0;
+
+	return 1;
+}
+
 static struct extent_status *
 ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
 {
@@ -290,8 +317,8 @@ ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
 		return es;
 
 	es1 = rb_entry(node, struct extent_status, rb_node);
-	if (es->start == extent_status_end(es1) + 1) {
-		es1->len += es->len;
+	if (ext4_es_can_be_merged(es1, es)) {
+		es1->es_len += es->es_len;
 		rb_erase(&es->rb_node, &tree->root);
 		ext4_es_free_extent(es);
 		es = es1;
@@ -311,8 +338,8 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
 		return es;
 
 	es1 = rb_entry(node, struct extent_status, rb_node);
-	if (es1->start == extent_status_end(es) + 1) {
-		es->len += es1->len;
+	if (ext4_es_can_be_merged(es, es1)) {
+		es->es_len += es1->es_len;
 		rb_erase(node, &tree->root);
 		ext4_es_free_extent(es1);
 	}
@@ -320,92 +347,85 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
 	return es;
 }
 
-static int __es_insert_extent(struct ext4_es_tree *tree, ext4_lblk_t offset,
-			      ext4_lblk_t len)
+static int __es_insert_extent(struct ext4_es_tree *tree,
+			      struct extent_status *newes)
 {
 	struct rb_node **p = &tree->root.rb_node;
 	struct rb_node *parent = NULL;
 	struct extent_status *es;
-	ext4_lblk_t end = offset + len - 1;
-
-	BUG_ON(end < offset);
-	es = tree->cache_es;
-	if (es && offset == (extent_status_end(es) + 1)) {
-		es_debug("cached by [%u/%u)\n", es->start, es->len);
-		es->len += len;
-		es = ext4_es_try_to_merge_right(tree, es);
-		goto out;
-	} else if (es && es->start == end + 1) {
-		es_debug("cached by [%u/%u)\n", es->start, es->len);
-		es->start = offset;
-		es->len += len;
-		es = ext4_es_try_to_merge_left(tree, es);
-		goto out;
-	} else if (es && es->start <= offset &&
-		   end <= extent_status_end(es)) {
-		es_debug("cached by [%u/%u)\n", es->start, es->len);
-		goto out;
-	}
+
+	/* invalidate cache */
+	/* TODO: first try to lookup in cache */
+	tree->cache_es = NULL;
 
 	while (*p) {
 		parent = *p;
 		es = rb_entry(parent, struct extent_status, rb_node);
 
-		if (offset < es->start) {
-			if (es->start == end + 1) {
-				es->start = offset;
-				es->len += len;
+		if (newes->es_lblk < es->es_lblk) {
+			if (ext4_es_can_be_merged(newes, es)) {
+				es->es_lblk = newes->es_lblk;
+				es->es_len += newes->es_len;
 				es = ext4_es_try_to_merge_left(tree, es);
 				goto out;
 			}
 			p = &(*p)->rb_left;
-		} else if (offset > extent_status_end(es)) {
-			if (offset == extent_status_end(es) + 1) {
-				es->len += len;
+		} else if (newes->es_lblk > extent_status_end(es)) {
+			if (ext4_es_can_be_merged(es, newes)) {
+				es->es_len += newes->es_len;
 				es = ext4_es_try_to_merge_right(tree, es);
 				goto out;
 			}
 			p = &(*p)->rb_right;
 		} else {
-			if (extent_status_end(es) <= end)
-				es->len = offset - es->start + len;
-			goto out;
+			BUG_ON(1);
+			return -EINVAL;
 		}
 	}
 
-	es = ext4_es_alloc_extent(offset, len);
+	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len);
 	if (!es)
 		return -ENOMEM;
 	rb_link_node(&es->rb_node, parent, p);
 	rb_insert_color(&es->rb_node, &tree->root);
 
 out:
-	tree->cache_es = es;
 	return 0;
 }
 
 /*
- * ext4_es_insert_extent() adds a space to a delayed extent tree.
- * Caller holds inode->i_es_lock.
+ * ext4_es_insert_extent() adds a space to a extent status tree.
  *
  * ext4_es_insert_extent is called by ext4_da_write_begin and
  * ext4_es_remove_extent.
  *
  * Return 0 on success, error code on failure.
  */
-int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
+int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 			  ext4_lblk_t len)
 {
 	struct ext4_es_tree *tree;
+	struct extent_status newes;
+	ext4_lblk_t end = lblk + len - 1;
 	int err = 0;
 
-	trace_ext4_es_insert_extent(inode, offset, len);
+	trace_ext4_es_insert_extent(inode, lblk, len);
 	es_debug("add [%u/%u) to extent status tree of inode %lu\n",
-		 offset, len, inode->i_ino);
+		 lblk, len, inode->i_ino);
+
+	BUG_ON(end < lblk);
+
+	newes.es_lblk = lblk;
+	newes.es_len = len;
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
-	err = __es_insert_extent(tree, offset, len);
+	err = __es_remove_extent(tree, lblk, end);
+	if (err != 0)
+		goto error;
+	err = __es_insert_extent(tree, &newes);
+
+error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 
 	ext4_es_print_tree(inode);
@@ -413,57 +433,46 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
 	return err;
 }
 
-/*
- * ext4_es_remove_extent() removes a space from a delayed extent tree.
- * Caller holds inode->i_es_lock.
- *
- * Return 0 on success, error code on failure.
- */
-int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
-			  ext4_lblk_t len)
+static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+				 ext4_lblk_t end)
 {
 	struct rb_node *node;
-	struct ext4_es_tree *tree;
 	struct extent_status *es;
 	struct extent_status orig_es;
-	ext4_lblk_t len1, len2, end;
+	ext4_lblk_t len1, len2;
 	int err = 0;
 
-	trace_ext4_es_remove_extent(inode, offset, len);
-	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
-		 offset, len, inode->i_ino);
-
-	end = offset + len - 1;
-	BUG_ON(end < offset);
-	write_lock(&EXT4_I(inode)->i_es_lock);
-	tree = &EXT4_I(inode)->i_es_tree;
-	es = __es_tree_search(&tree->root, offset);
+	es = __es_tree_search(&tree->root, lblk);
 	if (!es)
 		goto out;
-	if (es->start > end)
+	if (es->es_lblk > end)
 		goto out;
 
 	/* Simply invalidate cache_es. */
 	tree->cache_es = NULL;
 
-	orig_es.start = es->start;
-	orig_es.len = es->len;
-	len1 = offset > es->start ? offset - es->start : 0;
+	orig_es.es_lblk = es->es_lblk;
+	orig_es.es_len = es->es_len;
+	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
 	len2 = extent_status_end(es) > end ?
 	       extent_status_end(es) - end : 0;
 	if (len1 > 0)
-		es->len = len1;
+		es->es_len = len1;
 	if (len2 > 0) {
 		if (len1 > 0) {
-			err = __es_insert_extent(tree, end + 1, len2);
+			struct extent_status newes;
+
+			newes.es_lblk = end + 1;
+			newes.es_len = len2;
+			err = __es_insert_extent(tree, &newes);
 			if (err) {
-				es->start = orig_es.start;
-				es->len = orig_es.len;
+				es->es_lblk = orig_es.es_lblk;
+				es->es_len = orig_es.es_len;
 				goto out;
 			}
 		} else {
-			es->start = end + 1;
-			es->len = len2;
+			es->es_lblk = end + 1;
+			es->es_len = len2;
 		}
 		goto out;
 	}
@@ -487,13 +496,38 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
 		es = rb_entry(node, struct extent_status, rb_node);
 	}
 
-	if (es && es->start < end + 1) {
+	if (es && es->es_lblk < end + 1) {
 		len1 = extent_status_end(es) - end;
-		es->start = end + 1;
-		es->len = len1;
+		es->es_lblk = end + 1;
+		es->es_len = len1;
 	}
 
 out:
+	return err;
+}
+/*
+ * ext4_es_remove_extent() removes a space from a extent status tree.
+ *
+ * Return 0 on success, error code on failure.
+ */
+int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
+			  ext4_lblk_t len)
+{
+	struct ext4_es_tree *tree;
+	ext4_lblk_t end;
+	int err = 0;
+
+	trace_ext4_es_remove_extent(inode, lblk, len);
+	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
+		 lblk, len, inode->i_ino);
+
+	end = lblk + len - 1;
+	BUG_ON(end < lblk);
+
+	tree = &EXT4_I(inode)->i_es_tree;
+
+	write_lock(&EXT4_I(inode)->i_es_lock);
+	err = __es_remove_extent(tree, lblk, end);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	ext4_es_print_tree(inode);
 	return err;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 077f82d..81e9339 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -22,8 +22,8 @@
 
 struct extent_status {
 	struct rb_node rb_node;
-	ext4_lblk_t start;	/* first block extent covers */
-	ext4_lblk_t len;	/* length of extent in block */
+	ext4_lblk_t es_lblk;	/* first logical block extent covers */
+	ext4_lblk_t es_len;	/* length of extent in block */
 };
 
 struct ext4_es_tree {
@@ -35,9 +35,9 @@ extern int __init ext4_init_es(void);
 extern void ext4_exit_es(void);
 extern void ext4_es_init_tree(struct ext4_es_tree *tree);
 
-extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t start,
+extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
-extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t start,
+extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
 extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
 				struct extent_status *es);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b64a60b..8b65a01 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -472,10 +472,9 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
 		 * If there is a delay extent at this offset,
 		 * it will be as a data.
 		 */
-		es.start = last;
+		es.es_lblk = last;
 		(void)ext4_es_find_extent(inode, &es);
-		if (last >= es.start &&
-		    last < es.start + es.len) {
+		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
 			if (last != start)
 				dataoff = last << blkbits;
 			break;
@@ -557,11 +556,10 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
 		 * If there is a delay extent at this offset,
 		 * we will skip this extent.
 		 */
-		es.start = last;
+		es.es_lblk = last;
 		(void)ext4_es_find_extent(inode, &es);
-		if (last >= es.start &&
-		    last < es.start + es.len) {
-			last = es.start + es.len;
+		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+			last = es.es_lblk + es.es_len;
 			holeoff = last << blkbits;
 			continue;
 		}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f6372b0..11374b7 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2056,75 +2056,75 @@ TRACE_EVENT(ext4_ext_remove_space_done,
 );
 
 TRACE_EVENT(ext4_es_insert_extent,
-	TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+	TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
 
-	TP_ARGS(inode, start, len),
+	TP_ARGS(inode, lblk, len),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
-		__field(	loff_t,	start			)
+		__field(	loff_t,	lblk			)
 		__field(	loff_t, len			)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= start;
+		__entry->lblk	= lblk;
 		__entry->len	= len;
 	),
 
 	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->start, __entry->len)
+		  __entry->lblk, __entry->len)
 );
 
 TRACE_EVENT(ext4_es_remove_extent,
-	TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+	TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
 
-	TP_ARGS(inode, start, len),
+	TP_ARGS(inode, lblk, len),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
-		__field(	loff_t,	start			)
+		__field(	loff_t,	lblk			)
 		__field(	loff_t,	len			)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= start;
+		__entry->lblk	= lblk;
 		__entry->len	= len;
 	),
 
 	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->start, __entry->len)
+		  __entry->lblk, __entry->len)
 );
 
 TRACE_EVENT(ext4_es_find_extent_enter,
-	TP_PROTO(struct inode *inode, ext4_lblk_t start),
+	TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
 
-	TP_ARGS(inode, start),
+	TP_ARGS(inode, lblk),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,		dev		)
 		__field(	ino_t,		ino		)
-		__field(	ext4_lblk_t,	start		)
+		__field(	ext4_lblk_t,	lblk		)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= start;
+		__entry->lblk	= lblk;
 	),
 
-	TP_printk("dev %d,%d ino %lu start %u",
+	TP_printk("dev %d,%d ino %lu lblk %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  (unsigned long) __entry->ino, __entry->start)
+		  (unsigned long) __entry->ino, __entry->lblk)
 );
 
 TRACE_EVENT(ext4_es_find_extent_exit,
@@ -2136,7 +2136,7 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 	TP_STRUCT__entry(
 		__field(	dev_t,		dev		)
 		__field(	ino_t,		ino		)
-		__field(	ext4_lblk_t,	start		)
+		__field(	ext4_lblk_t,	lblk		)
 		__field(	ext4_lblk_t,	len		)
 		__field(	ext4_lblk_t,	ret		)
 	),
@@ -2144,15 +2144,15 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= es->start;
-		__entry->len	= es->len;
+		__entry->lblk	= es->es_lblk;
+		__entry->len	= es->es_len;
 		__entry->ret	= ret;
 	),
 
 	TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->start, __entry->len, __entry->ret)
+		  __entry->lblk, __entry->len, __entry->ret)
 );
 
 #endif /* _TRACE_EXT4_H */
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][PATCH 3/9 v1] ext4: add physical block and status member into extent status tree
  2012-12-24  7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
  2012-12-24  7:55 ` [RFC][PATCH 1/9 v1] ext4: fixup metadata reserve block warning when bigalloc and delalloc are enabled Zheng Liu
  2012-12-24  7:55 ` [RFC][PATCH 2/9 v1] ext4: refine extent status tree Zheng Liu
@ 2012-12-24  7:55 ` Zheng Liu
  2012-12-31 21:49   ` Jan Kara
  2012-12-24  7:55 ` [RFC][PATCH 4/9 v1] ext4: adjust interfaces of " Zheng Liu
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2012-12-24  7:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

es_pblk is used to record physical block that maps to the disk.  es_status is
used to record the status of the extent.  Three status are defined, which are
written, unwritten and delayed.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents_status.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 81e9339..85115bb 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -20,10 +20,18 @@
 #define es_debug(fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
+enum {
+	EXTENT_STATUS_WRITTEN = 0,	/* written extent */
+	EXTENT_STATUS_UNWRITTEN = 1,	/* unwritten extent */
+	EXTENT_STATUS_DELAYED = 2,	/* delayed extent */
+};
+
 struct extent_status {
 	struct rb_node rb_node;
 	ext4_lblk_t es_lblk;	/* first logical block extent covers */
 	ext4_lblk_t es_len;	/* length of extent in block */
+	ext4_fsblk_t es_pblk;	/* first physical block */
+	int es_status;		/* record the status of extent */
 };
 
 struct ext4_es_tree {
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][PATCH 4/9 v1] ext4: adjust interfaces of extent status tree
  2012-12-24  7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
                   ` (2 preceding siblings ...)
  2012-12-24  7:55 ` [RFC][PATCH 3/9 v1] ext4: add physical block and status member into " Zheng Liu
@ 2012-12-24  7:55 ` Zheng Liu
  2012-12-24  7:55 ` [RFC][PATCH 5/9 v1] ext4: track all extent status in " Zheng Liu
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Zheng Liu @ 2012-12-24  7:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

Due to two members are added into extent status tree, all interfaces need to be
adjusted.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents_status.c    | 51 +++++++++++++++++++++++++++++++++++++--------
 fs/ext4/extents_status.h    | 18 +++++++++++++++-
 fs/ext4/inode.c             |  3 ++-
 include/trace/events/ext4.h | 34 +++++++++++++++++++-----------
 4 files changed, 83 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 5878fb3..3e6fa43 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -174,7 +174,8 @@ static void ext4_es_print_tree(struct inode *inode)
 	while (node) {
 		struct extent_status *es;
 		es = rb_entry(node, struct extent_status, rb_node);
-		printk(KERN_DEBUG " [%u/%u)", es->es_lblk, es->es_len);
+		printk(KERN_DEBUG " [%u/%u) %llu %d",
+		       es->es_lblk, es->es_len, es->es_pblk, es->es_status);
 		node = rb_next(node);
 	}
 	printk(KERN_DEBUG "\n");
@@ -248,8 +249,9 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
 	if (tree->cache_es) {
 		es1 = tree->cache_es;
 		if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
-			es_debug("%u cached by [%u/%u)\n",
-				 es->es_lblk, es1->es_lblk, es1->es_len);
+			es_debug("%u cached by [%u/%u) %llu %d\n",
+				 es->es_lblk, es1->es_lblk, es1->es_len,
+				 es1->es_pblk, es1->es_status);
 			goto out;
 		}
 	}
@@ -262,6 +264,8 @@ out:
 		tree->cache_es = es1;
 		es->es_lblk = es1->es_lblk;
 		es->es_len = es1->es_len;
+		es->es_pblk = es1->es_pblk;
+		es->es_status = es1->es_status;
 		node = rb_next(&es1->rb_node);
 		if (node) {
 			es1 = rb_entry(node, struct extent_status, rb_node);
@@ -276,7 +280,8 @@ out:
 }
 
 static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
+ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len,
+		     ext4_fsblk_t pblk, int status)
 {
 	struct extent_status *es;
 	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
@@ -284,6 +289,8 @@ ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
 		return NULL;
 	es->es_lblk = lblk;
 	es->es_len = len;
+	es->es_pblk = pblk;
+	es->es_status = status;
 	return es;
 }
 
@@ -296,6 +303,8 @@ static void ext4_es_free_extent(struct extent_status *es)
  * Check whether or not two extents can be merged
  * Condition:
  *  - logical block number is contiguous
+ *  - physical block number is contiguous
+ *  - status is equal
  */
 static int ext4_es_can_be_merged(struct extent_status *es1,
 				 struct extent_status *es2)
@@ -303,6 +312,13 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
 	if (es1->es_lblk + es1->es_len != es2->es_lblk)
 		return 0;
 
+	if (es1->es_status != es2->es_status)
+		return 0;
+
+	if (!ext4_es_is_delayed(es1) &&
+	    (es1->es_pblk + es1->es_len != es2->es_pblk))
+		return 0;
+
 	return 1;
 }
 
@@ -366,6 +382,8 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
 			if (ext4_es_can_be_merged(newes, es)) {
 				es->es_lblk = newes->es_lblk;
 				es->es_len += newes->es_len;
+				es->es_pblk = !ext4_es_is_delayed(es) ?
+						newes->es_pblk : ~0;
 				es = ext4_es_try_to_merge_left(tree, es);
 				goto out;
 			}
@@ -383,7 +401,8 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
 		}
 	}
 
-	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len);
+	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len,
+				  newes->es_pblk, newes->es_status);
 	if (!es)
 		return -ENOMEM;
 	rb_link_node(&es->rb_node, parent, p);
@@ -402,21 +421,23 @@ out:
  * Return 0 on success, error code on failure.
  */
 int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
-			  ext4_lblk_t len)
+			  ext4_lblk_t len, ext4_fsblk_t pblk, int status)
 {
 	struct ext4_es_tree *tree;
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
 	int err = 0;
 
-	trace_ext4_es_insert_extent(inode, lblk, len);
-	es_debug("add [%u/%u) to extent status tree of inode %lu\n",
-		 lblk, len, inode->i_ino);
+	es_debug("add [%u/%u) %llu %d to extent status tree of inode %lu\n",
+		 lblk, len, pblk, status, inode->i_ino);
 
 	BUG_ON(end < lblk);
 
 	newes.es_lblk = lblk;
 	newes.es_len = len;
+	newes.es_pblk = pblk;
+	newes.es_status = status;
+	trace_ext4_es_insert_extent(inode, &newes);
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
@@ -453,6 +474,9 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 
 	orig_es.es_lblk = es->es_lblk;
 	orig_es.es_len = es->es_len;
+	orig_es.es_pblk = es->es_pblk;
+	orig_es.es_status = es->es_status;
+
 	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
 	len2 = extent_status_end(es) > end ?
 	       extent_status_end(es) - end : 0;
@@ -464,6 +488,9 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 
 			newes.es_lblk = end + 1;
 			newes.es_len = len2;
+			newes.es_pblk = !ext4_es_is_delayed(&orig_es) ?
+				orig_es.es_pblk + orig_es.es_len - len2 : ~0;
+			newes.es_status = orig_es.es_status;
 			err = __es_insert_extent(tree, &newes);
 			if (err) {
 				es->es_lblk = orig_es.es_lblk;
@@ -473,6 +500,8 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 		} else {
 			es->es_lblk = end + 1;
 			es->es_len = len2;
+			es->es_pblk = !ext4_es_is_delayed(es) ?
+				orig_es.es_pblk + orig_es.es_len - len2 : ~0;
 		}
 		goto out;
 	}
@@ -497,9 +526,13 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 	}
 
 	if (es && es->es_lblk < end + 1) {
+		ext4_lblk_t orig_len = es->es_len;
+
 		len1 = extent_status_end(es) - end;
 		es->es_lblk = end + 1;
 		es->es_len = len1;
+		es->es_pblk = !ext4_es_is_delayed(es) ?
+				es->es_pblk + orig_len - len1 : ~0;
 	}
 
 out:
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 85115bb..d1516fd 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -44,10 +44,26 @@ extern void ext4_exit_es(void);
 extern void ext4_es_init_tree(struct ext4_es_tree *tree);
 
 extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
-				 ext4_lblk_t len);
+				 ext4_lblk_t len, ext4_fsblk_t pblk,
+				 int status);
 extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
 extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
 				struct extent_status *es);
 
+static inline int ext4_es_is_written(struct extent_status *es)
+{
+	return (es->es_status == EXTENT_STATUS_WRITTEN);
+}
+
+static inline int ext4_es_is_unwritten(struct extent_status *es)
+{
+	return (es->es_status == EXTENT_STATUS_UNWRITTEN);
+}
+
+static inline int ext4_es_is_delayed(struct extent_status *es)
+{
+	return (es->es_status == EXTENT_STATUS_DELAYED);
+}
+
 #endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 91542be..f7135fe 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1827,7 +1827,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 				goto out_unlock;
 		}
 
-		retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len);
+		retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					       ~0, EXTENT_STATUS_DELAYED);
 		if (retval)
 			goto out_unlock;
 
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 11374b7..53e117d 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2056,28 +2056,33 @@ TRACE_EVENT(ext4_ext_remove_space_done,
 );
 
 TRACE_EVENT(ext4_es_insert_extent,
-	TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
+	TP_PROTO(struct inode *inode, struct extent_status *es),
 
-	TP_ARGS(inode, lblk, len),
+	TP_ARGS(inode, es),
 
 	TP_STRUCT__entry(
-		__field(	dev_t,	dev			)
-		__field(	ino_t,	ino			)
-		__field(	loff_t,	lblk			)
-		__field(	loff_t, len			)
+		__field(	dev_t,		dev		)
+		__field(	ino_t,		ino		)
+		__field(	ext4_lblk_t,	lblk		)
+		__field(	ext4_lblk_t,	len		)
+		__field(	ext4_fsblk_t,	pblk		)
+		__field(	int,		status		)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->lblk	= lblk;
-		__entry->len	= len;
+		__entry->lblk	= es->es_lblk;
+		__entry->len	= es->es_len;
+		__entry->pblk	= es->es_pblk;
+		__entry->status	= es->es_status;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->lblk, __entry->len)
+		  __entry->lblk, __entry->len,
+		  __entry->pblk, __entry->status)
 );
 
 TRACE_EVENT(ext4_es_remove_extent,
@@ -2138,6 +2143,8 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 		__field(	ino_t,		ino		)
 		__field(	ext4_lblk_t,	lblk		)
 		__field(	ext4_lblk_t,	len		)
+		__field(	ext4_fsblk_t,	pblk		)
+		__field(	int,		status		)
 		__field(	ext4_lblk_t,	ret		)
 	),
 
@@ -2146,13 +2153,16 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 		__entry->ino	= inode->i_ino;
 		__entry->lblk	= es->es_lblk;
 		__entry->len	= es->es_len;
+		__entry->pblk	= es->es_pblk;
+		__entry->status	= es->es_status;
 		__entry->ret	= ret;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %d ret %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->lblk, __entry->len, __entry->ret)
+		  __entry->lblk, __entry->len,
+		  __entry->pblk, __entry->status, __entry->ret)
 );
 
 #endif /* _TRACE_EXT4_H */
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][PATCH 5/9 v1] ext4: track all extent status in extent status tree
  2012-12-24  7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
                   ` (3 preceding siblings ...)
  2012-12-24  7:55 ` [RFC][PATCH 4/9 v1] ext4: adjust interfaces of " Zheng Liu
@ 2012-12-24  7:55 ` Zheng Liu
  2012-12-24  7:55 ` [RFC][PATCH 6/9 v1] ext4: lookup block mapping " Zheng Liu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Zheng Liu @ 2012-12-24  7:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

After record phycisal block and status, extent status tree is able to track the
status of every extents.  When we call _map_blocks functions to lookup an extent
or create a new written/unwritten/delayed extent, this extent will be inserted
into extent status tree.

We don't load all extents from disk in alloc_inode() because it costs too much
memory, and, when open/close a file very frequently, it will takes too much time
to load all extent information.  So currently when we create/lookup an extent,
this extent will be inserted into extent status tree.  Hence, the status in
extent status tree might be not completely.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents.c |  5 +++-
 fs/ext4/file.c    |  6 +++--
 fs/ext4/inode.c   | 73 ++++++++++++++++++++++++++++++++++++-------------------
 3 files changed, 56 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0a0f635..d537cfa 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2074,7 +2074,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
 		}
 
 		/* This is possible iff next == next_del == EXT_MAX_BLOCKS */
-		if (next == next_del) {
+		if (next == next_del && next_del == EXT_MAX_BLOCKS) {
 			flags |= FIEMAP_EXTENT_LAST;
 			if (unlikely(next_del != EXT_MAX_BLOCKS ||
 				     next != EXT_MAX_BLOCKS)) {
@@ -4566,6 +4566,9 @@ static int ext4_find_delayed_extent(struct inode *inode,
 			/* A hole found. */
 			return 0;
 
+		if (!ext4_es_is_delayed(&es))
+			return 0;
+
 		if (es.es_lblk > newex->ec_block) {
 			/* A hole found. */
 			newex->ec_len = min(es.es_lblk - newex->ec_block,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8b65a01..dbc581b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -474,7 +474,8 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
 		 */
 		es.es_lblk = last;
 		(void)ext4_es_find_extent(inode, &es);
-		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+		if (ext4_es_is_delayed(&es) &&
+		    last >= es.es_lblk && last < es.es_lblk + es.es_len) {
 			if (last != start)
 				dataoff = last << blkbits;
 			break;
@@ -558,7 +559,8 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
 		 */
 		es.es_lblk = last;
 		(void)ext4_es_find_extent(inode, &es);
-		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+		if (ext4_es_is_delayed(&es) &&
+		    last >= es.es_lblk && last < es.es_lblk + es.es_len) {
 			last = es.es_lblk + es.es_len;
 			holeoff = last << blkbits;
 			continue;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f7135fe..290c2c2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -533,20 +533,20 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		retval = ext4_ind_map_blocks(handle, inode, map, flags &
 					     EXT4_GET_BLOCKS_KEEP_SIZE);
 	}
+	if (retval > 0) {
+		int ret, status;
+		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+		ret = ext4_es_insert_extent(inode, map->m_lblk,
+					    map->m_len, map->m_pblk, status);
+		if (ret < 0)
+			retval = ret;
+	}
 	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
 		up_read((&EXT4_I(inode)->i_data_sem));
 
 	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
-		int ret;
-		if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
-			/* delayed alloc may be allocated by fallocate and
-			 * coverted to initialized by directIO.
-			 * we need to handle delayed extent here.
-			 */
-			down_write((&EXT4_I(inode)->i_data_sem));
-			goto delayed_mapped;
-		}
-		ret = check_block_validity(inode, map);
+		int ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
 	}
@@ -621,18 +621,27 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
 			ext4_da_update_reserve_space(inode, retval, 1);
 	}
-	if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
+	if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
 		ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
 
-		if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
-			int ret;
-delayed_mapped:
-			/* delayed allocation blocks has been allocated */
-			ret = ext4_es_remove_extent(inode, map->m_lblk,
-						    map->m_len);
-			if (ret < 0)
-				retval = ret;
-		}
+	if (retval > 0) {
+		int ret, status;
+
+		if (flags & EXT4_GET_BLOCKS_PRE_IO)
+			status = EXTENT_STATUS_UNWRITTEN;
+		else if (flags & EXT4_GET_BLOCKS_CONVERT)
+			status = EXTENT_STATUS_WRITTEN;
+		else if (flags & EXT4_GET_BLOCKS_UNINIT_EXT)
+			status = EXTENT_STATUS_UNWRITTEN;
+		else if (flags & EXT4_GET_BLOCKS_CREATE)
+			status = EXTENT_STATUS_WRITTEN;
+		else
+			BUG_ON(1);
+
+		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					    map->m_pblk, status);
+		if (ret < 0)
+			retval = ret;
 	}
 
 	up_write((&EXT4_I(inode)->i_data_sem));
@@ -1814,6 +1823,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
 
 	if (retval == 0) {
+		int ret;
 		/*
 		 * XXX: __block_prepare_write() unmaps passed block,
 		 * is it OK?
@@ -1821,16 +1831,20 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		/* If the block was allocated from previously allocated cluster,
 		 * then we dont need to reserve it again. */
 		if (!(map->m_flags & EXT4_MAP_FROM_CLUSTER)) {
-			retval = ext4_da_reserve_space(inode, iblock);
-			if (retval)
+			ret = ext4_da_reserve_space(inode, iblock);
+			if (ret) {
 				/* not enough space to reserve */
+				retval = ret;
 				goto out_unlock;
+			}
 		}
 
-		retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-					       ~0, EXTENT_STATUS_DELAYED);
-		if (retval)
+		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					    ~0, EXTENT_STATUS_DELAYED);
+		if (ret) {
+			retval = ret;
 			goto out_unlock;
+		}
 
 		/* Clear EXT4_MAP_FROM_CLUSTER flag since its purpose is served
 		 * and it should not appear on the bh->b_state.
@@ -1840,6 +1854,15 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		map_bh(bh, inode->i_sb, invalid_block);
 		set_buffer_new(bh);
 		set_buffer_delay(bh);
+	} else if (retval > 0) {
+		int ret, status;
+
+		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					    map->m_pblk, status);
+		if (ret != 0)
+			retval = ret;
 	}
 
 out_unlock:
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][PATCH 6/9 v1] ext4: lookup block mapping in extent status tree
  2012-12-24  7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
                   ` (4 preceding siblings ...)
  2012-12-24  7:55 ` [RFC][PATCH 5/9 v1] ext4: track all extent status in " Zheng Liu
@ 2012-12-24  7:55 ` Zheng Liu
  2012-12-24  7:55 ` [RFC][PATCH 7/9 v1] ext4: add a new convert function to convert an unwritten extent " Zheng Liu
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Zheng Liu @ 2012-12-24  7:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

After tracking all extent status, we already have a extent cache in memory.
Every time we want to lookup a block mapping, we can first try to lookup it in
extent status tree to avoid a potential disk I/O.

A new function called ext4_es_lookup_extent is defined to finish this work.
When we try to lookup a block mapping, we always call ext4_map_blocks and/or
ext4_da_map_blocks.  So in these functions we first try to lookup a block
mapping in extent status tree.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents_status.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/extents_status.h |  1 +
 fs/ext4/inode.c          | 55 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 113 insertions(+)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 3e6fa43..ccd940c 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -454,6 +454,63 @@ error:
 	return err;
 }
 
+/*
+ * ext4_es_lookup_extent() looks up an extent in extent status tree.
+ *
+ * ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks.
+ *
+ * Return: 1 on found, 0 on not
+ */
+int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es)
+{
+	struct ext4_es_tree *tree;
+	struct extent_status *es1;
+	struct rb_node *node;
+	int found = 0;
+
+	es_debug("lookup extent in block %u\n", es->es_lblk);
+
+	tree = &EXT4_I(inode)->i_es_tree;
+	read_lock(&EXT4_I(inode)->i_es_lock);
+
+	/* find delay extent in cache firstly */
+	if (tree->cache_es) {
+		es1 = tree->cache_es;
+		if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
+			es_debug("%u cached by [%u/%u)\n",
+				 es->es_lblk, es1->es_lblk, es1->es_len);
+			found = 1;
+			goto out;
+		}
+	}
+
+	es->es_len = 0;
+	node = tree->root.rb_node;
+	while (node) {
+		es1 = rb_entry(node, struct extent_status, rb_node);
+		if (es->es_lblk < es1->es_lblk)
+			node = node->rb_left;
+		else if (es->es_lblk > extent_status_end(es1))
+			node = node->rb_right;
+		else {
+			found = 1;
+			break;
+		}
+	}
+
+out:
+	if (found) {
+		es->es_lblk = es1->es_lblk;
+		es->es_len = es1->es_len;
+		es->es_pblk = es1->es_pblk;
+		es->es_status = es1->es_status;
+	}
+
+	read_unlock(&EXT4_I(inode)->i_es_lock);
+
+	return found;
+}
+
 static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 				 ext4_lblk_t end)
 {
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index d1516fd..1890f80 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -50,6 +50,7 @@ extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
 extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
 				struct extent_status *es);
+extern int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es);
 
 static inline int ext4_es_is_written(struct extent_status *es)
 {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 290c2c2..6610dc7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -514,12 +514,39 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
 int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		    struct ext4_map_blocks *map, int flags)
 {
+	struct extent_status es;
 	int retval;
 
 	map->m_flags = 0;
 	ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
 		  "logical block %lu\n", inode->i_ino, flags, map->m_len,
 		  (unsigned long) map->m_lblk);
+
+	/* Lookup extent status tree firstly */
+	es.es_lblk = map->m_lblk;
+	if (ext4_es_lookup_extent(inode, &es)) {
+		if (ext4_es_is_written(&es)) {
+			map->m_pblk = es.es_pblk + map->m_lblk - es.es_lblk;
+			map->m_flags |= EXT4_MAP_MAPPED;
+			retval = es.es_len - (map->m_lblk - es.es_lblk);
+			if (retval > map->m_len)
+				retval = map->m_len;
+			map->m_len = retval;
+		} else if (ext4_es_is_unwritten(&es)) {
+			map->m_pblk = es.es_pblk + map->m_lblk - es.es_lblk;
+			map->m_flags |= EXT4_MAP_UNWRITTEN;
+			retval = es.es_len - (map->m_lblk - es.es_lblk);
+			if (retval > map->m_len)
+				retval = map->m_len;
+			map->m_len = retval;
+		} else if (ext4_es_is_delayed(&es)) {
+			retval = 0;
+		} else {
+			BUG_ON(1);
+		}
+		goto found;
+	}
+
 	/*
 	 * Try to see if we can get the block without requesting a new
 	 * file system block.
@@ -545,6 +572,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
 		up_read((&EXT4_I(inode)->i_data_sem));
 
+found:
 	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
 		int ret = check_block_validity(inode, map);
 		if (ret != 0)
@@ -1790,6 +1818,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 			      struct ext4_map_blocks *map,
 			      struct buffer_head *bh)
 {
+	struct extent_status es;
 	int retval;
 	sector_t invalid_block = ~((sector_t) 0xffff);
 
@@ -1800,6 +1829,32 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 	ext_debug("ext4_da_map_blocks(): inode %lu, max_blocks %u,"
 		  "logical block %lu\n", inode->i_ino, map->m_len,
 		  (unsigned long) map->m_lblk);
+
+	/* Lookup extent status tree firstly */
+	es.es_lblk = iblock;
+	if (ext4_es_lookup_extent(inode, &es)) {
+		map->m_pblk = es.es_pblk + iblock - es.es_lblk;
+		retval = es.es_len - (iblock - es.es_lblk);
+		if (retval > map->m_len)
+			retval = map->m_len;
+		map->m_len = retval;
+		if (ext4_es_is_written(&es)) {
+			map->m_flags |= EXT4_MAP_MAPPED;
+		} else if (ext4_es_is_unwritten(&es)) {
+			map->m_flags |= EXT4_MAP_UNWRITTEN;
+		} else if (ext4_es_is_delayed(&es)) {
+			map_bh(bh, inode->i_sb, invalid_block);
+			set_buffer_new(bh);
+			set_buffer_delay(bh);
+
+			return 0;
+		} else {
+			BUG_ON(1);
+		}
+
+		return retval;
+	}
+
 	/*
 	 * Try to see if we can get the block without requesting a new
 	 * file system block.
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][PATCH 7/9 v1] ext4: add a new convert function to convert an unwritten extent in extent status tree
  2012-12-24  7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
                   ` (5 preceding siblings ...)
  2012-12-24  7:55 ` [RFC][PATCH 6/9 v1] ext4: lookup block mapping " Zheng Liu
@ 2012-12-24  7:55 ` Zheng Liu
  2012-12-24  7:55 ` [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion Zheng Liu
  2012-12-24  7:55 ` [RFC][PATCH 9/9 v1] ext4: set dioread_nolock by default for extent-based files Zheng Liu
  8 siblings, 0 replies; 22+ messages in thread
From: Zheng Liu @ 2012-12-24  7:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

A new function called ext4_es_convert_unwritten_extents() is defined to convert
a range of unwritten extents to written in extent status tree.

This function aims to improve the unwritten extent conversion in DIO end_io.
Meanwhile all locks are changed to save irq flags due to DIO end_io is in irq
context.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents_status.c | 161 ++++++++++++++++++++++++++++++++++++++++++++---
 fs/ext4/extents_status.h |   2 +
 2 files changed, 155 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index ccd940c..9db9e05 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -239,10 +239,11 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
 	struct extent_status *es1 = NULL;
 	struct rb_node *node;
 	ext4_lblk_t ret = EXT_MAX_BLOCKS;
+	unsigned long flags;
 
 	trace_ext4_es_find_extent_enter(inode, es->es_lblk);
 
-	read_lock(&EXT4_I(inode)->i_es_lock);
+	read_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
 	tree = &EXT4_I(inode)->i_es_tree;
 
 	/* find delay extent in cache firstly */
@@ -273,7 +274,7 @@ out:
 		}
 	}
 
-	read_unlock(&EXT4_I(inode)->i_es_lock);
+	read_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
 
 	trace_ext4_es_find_extent_exit(inode, es, ret);
 	return ret;
@@ -426,6 +427,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	struct ext4_es_tree *tree;
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
+	unsigned long flags;
 	int err = 0;
 
 	es_debug("add [%u/%u) %llu %d to extent status tree of inode %lu\n",
@@ -439,7 +441,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	newes.es_status = status;
 	trace_ext4_es_insert_extent(inode, &newes);
 
-	write_lock(&EXT4_I(inode)->i_es_lock);
+	write_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
 	tree = &EXT4_I(inode)->i_es_tree;
 	err = __es_remove_extent(tree, lblk, end);
 	if (err != 0)
@@ -447,7 +449,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	err = __es_insert_extent(tree, &newes);
 
 error:
-	write_unlock(&EXT4_I(inode)->i_es_lock);
+	write_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
 
 	ext4_es_print_tree(inode);
 
@@ -466,12 +468,13 @@ int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es)
 	struct ext4_es_tree *tree;
 	struct extent_status *es1;
 	struct rb_node *node;
+	unsigned long flags;
 	int found = 0;
 
 	es_debug("lookup extent in block %u\n", es->es_lblk);
 
 	tree = &EXT4_I(inode)->i_es_tree;
-	read_lock(&EXT4_I(inode)->i_es_lock);
+	read_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
 
 	/* find delay extent in cache firstly */
 	if (tree->cache_es) {
@@ -506,7 +509,7 @@ out:
 		es->es_status = es1->es_status;
 	}
 
-	read_unlock(&EXT4_I(inode)->i_es_lock);
+	read_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
 
 	return found;
 }
@@ -605,6 +608,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 {
 	struct ext4_es_tree *tree;
 	ext4_lblk_t end;
+	unsigned long flags;
 	int err = 0;
 
 	trace_ext4_es_remove_extent(inode, lblk, len);
@@ -616,9 +620,150 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 
 	tree = &EXT4_I(inode)->i_es_tree;
 
-	write_lock(&EXT4_I(inode)->i_es_lock);
+	write_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
 	err = __es_remove_extent(tree, lblk, end);
-	write_unlock(&EXT4_I(inode)->i_es_lock);
+	write_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
+	ext4_es_print_tree(inode);
+	return err;
+}
+
+int ext4_es_convert_unwritten_extents(struct inode *inode, loff_t offset,
+				      size_t size)
+{
+	struct ext4_es_tree *tree;
+	struct rb_node *node;
+	struct extent_status *es, orig_es, conv_es;
+	ext4_lblk_t end, len1, len2;
+	ext4_lblk_t lblk = 0, len = 0;
+	unsigned long flags;
+	unsigned int blkbits;
+	int err = 0;
+
+	/* add trace point and debug */
+	blkbits = inode->i_blkbits;
+	lblk = offset >> blkbits;
+	len = (EXT4_BLOCK_ALIGN(offset + size, blkbits) >> blkbits) - lblk;
+
+	end = lblk + len - 1;
+	BUG_ON(end < lblk);
+
+	tree = &EXT4_I(inode)->i_es_tree;
+
+	write_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
+
+	es = __es_tree_search(&tree->root, lblk);
+	if (!es)
+		goto out;
+	if (es->es_lblk > end)
+		goto out;
+
+	tree->cache_es = NULL;
+
+	orig_es.es_lblk = es->es_lblk;
+	orig_es.es_len = es->es_len;
+	orig_es.es_pblk = es->es_pblk;
+	orig_es.es_status = es->es_status;
+
+	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
+	len2 = extent_status_end(es) > end ?
+	       extent_status_end(es) - end : 0;
+	if (len1 > 0)
+		es->es_len = len1;
+	if (len2 > 0) {
+		if (len1 > 0) {
+			struct extent_status newes;
+
+			newes.es_lblk = end + 1;
+			newes.es_len = len2;
+			newes.es_pblk = orig_es.es_pblk + orig_es.es_len - len2;
+			newes.es_status = orig_es.es_status;
+			/*BUG_ON(newes.es_status != EXTENT_STATUS_UNWRITTEN);*/
+			err = __es_insert_extent(tree, &newes);
+			if (err) {
+				es->es_lblk = orig_es.es_lblk;
+				es->es_len = orig_es.es_len;
+				goto out;
+			}
+
+			conv_es.es_lblk = orig_es.es_lblk + len1;
+			conv_es.es_len = orig_es.es_len - len1 - len2;
+			conv_es.es_pblk = orig_es.es_pblk + len1;
+			conv_es.es_status = EXTENT_STATUS_WRITTEN;
+			err = __es_insert_extent(tree, &conv_es);
+			if (err) {
+				int err2;
+				err2 = __es_remove_extent(tree, newes.es_lblk,
+						extent_status_end(&newes));
+				if (err2)
+					goto out;
+				es->es_lblk = orig_es.es_lblk;
+				es->es_len = orig_es.es_len;
+				goto out;
+			}
+		} else {
+			es->es_lblk = end + 1;
+			es->es_len = len2;
+			es->es_pblk = orig_es.es_pblk + orig_es.es_len - len2;
+			/*BUG_ON(newes.es_status != EXTENT_STATUS_UNWRITTEN);*/
+
+			conv_es.es_lblk = orig_es.es_lblk;
+			conv_es.es_len = orig_es.es_len - len2;
+			conv_es.es_pblk = orig_es.es_pblk;
+			conv_es.es_status = EXTENT_STATUS_WRITTEN;
+			err = __es_insert_extent(tree, &conv_es);
+			if (err) {
+				es->es_lblk = orig_es.es_lblk;
+				es->es_len = orig_es.es_len;
+				es->es_pblk = orig_es.es_pblk;
+			}
+		}
+
+		goto out;
+	}
+
+	if (len1 > 0) {
+		node = rb_next(&es->rb_node);
+		if (node)
+			es = rb_entry(node, struct extent_status, rb_node);
+		else
+			es = NULL;
+	}
+
+	while (es && extent_status_end(es) <= end) {
+		node = rb_next(&es->rb_node);
+		es->es_status = EXTENT_STATUS_WRITTEN;
+		if (!node) {
+			es = NULL;
+			break;
+		}
+		es = rb_entry(node, struct extent_status, rb_node);
+	}
+
+	if (es && es->es_lblk < end + 1) {
+		ext4_lblk_t orig_len = es->es_len;
+
+		/*
+		 * Here we first set conv_es just because of avoiding copy the
+		 * value of es to a tmporary variable.
+		 */
+		len1 = extent_status_end(es) - end;
+		conv_es.es_lblk = es->es_lblk;
+		conv_es.es_len = es->es_len - len1;
+		conv_es.es_pblk = es->es_pblk;
+		conv_es.es_status = EXTENT_STATUS_WRITTEN;
+
+		es->es_lblk = end + 1;
+		es->es_len = len1;
+		es->es_pblk = es->es_pblk + orig_len - len1;
+
+		err = __es_insert_extent(tree, &conv_es);
+		if (err)
+			goto out;
+	}
+
+out:
+	write_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
+
 	ext4_es_print_tree(inode);
 	return err;
 }
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 1890f80..9069ecf 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -51,6 +51,8 @@ extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
 				struct extent_status *es);
 extern int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es);
+extern int ext4_es_convert_unwritten_extents(struct inode *inode,
+					     loff_t offset, size_t size);
 
 static inline int ext4_es_is_written(struct extent_status *es)
 {
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion
  2012-12-24  7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
                   ` (6 preceding siblings ...)
  2012-12-24  7:55 ` [RFC][PATCH 7/9 v1] ext4: add a new convert function to convert an unwritten extent " Zheng Liu
@ 2012-12-24  7:55 ` Zheng Liu
  2012-12-31 16:36   ` Jan Kara
  2012-12-31 21:58   ` Jan Kara
  2012-12-24  7:55 ` [RFC][PATCH 9/9 v1] ext4: set dioread_nolock by default for extent-based files Zheng Liu
  8 siblings, 2 replies; 22+ messages in thread
From: Zheng Liu @ 2012-12-24  7:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: Jan Kara, Darrick J. Wong, Christoph Hellwig, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

Currently all unwritten extent conversion work is pushed into a workqueue to be
done because DIO end_io is in a irq context and this conversion needs to take
i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
tracking all extent status, we can first convert this unwritten extent in extent
status tree, and call aio_complete() and inode_dio_done() to notify upper level
that this dio has done.  We don't need to be worried about exposing a stale data
because we first try to lookup in extent status tree.  So it makes us to see the
latest extent status.  Meanwhile we queue a work to convert this unwritten
extent in extent tree.  After this improvement, reader also needn't wait this
conversion to be done when dioread_nolock is enabled.

CC: Jan Kara <jack@suse.cz>
CC: "Darrick J. Wong" <darrick.wong@oracle.com>
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
[Cc' to Jan, Darrick, and Christoph because Christoph is trying to handle
O_(D)SYNC for AIO]

Hi Jan, Darrick, and Christoph,

This patch refines the unwritten extent conversion in ext4.  Now we can call
aio_complete() and inode_dio_done() in end_io function.  I believe Christoph's
patch also can work well after applied this patch.  Could you please review
this patch?

Regards,
					- Zheng

 fs/ext4/ext4.h     |  2 +-
 fs/ext4/indirect.c | 11 ++++-------
 fs/ext4/inode.c    | 11 ++++-------
 fs/ext4/page-io.c  | 26 ++++++++++++++++++++++----
 4 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8462eb3..b76dc49 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -195,7 +195,6 @@ struct mpage_da_data {
 #define	EXT4_IO_END_UNWRITTEN	0x0001
 #define EXT4_IO_END_ERROR	0x0002
 #define EXT4_IO_END_QUEUED	0x0004
-#define EXT4_IO_END_DIRECT	0x0008
 
 struct ext4_io_page {
 	struct page	*p_page;
@@ -2538,6 +2537,7 @@ extern void ext4_ioend_wait(struct inode *);
 extern void ext4_free_io_end(ext4_io_end_t *io);
 extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
 extern void ext4_io_submit(struct ext4_io_submit *io);
+extern void ext4_end_io_bh(ext4_io_end_t *io_end, int is_dio);
 extern int ext4_bio_write_page(struct ext4_io_submit *io,
 			       struct page *page,
 			       int len,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 20862f9..c6d7f7f 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -807,11 +807,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 
 retry:
 	if (rw == READ && ext4_should_dioread_nolock(inode)) {
-		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
-			mutex_lock(&inode->i_mutex);
-			ext4_flush_unwritten_io(inode);
-			mutex_unlock(&inode->i_mutex);
-		}
 		/*
 		 * Nolock dioread optimization may be dynamically disabled
 		 * via ext4_inode_block_unlocked_dio(). Check inode's state
@@ -831,8 +826,10 @@ retry:
 		inode_dio_done(inode);
 	} else {
 locked:
-		ret = blockdev_direct_IO(rw, iocb, inode, iov,
-				 offset, nr_segs, ext4_get_block);
+		ret = __blockdev_direct_IO(rw, iocb, inode,
+				 inode->i_sb->s_bdev, iov,
+				 offset, nr_segs,
+				 ext4_get_block, NULL, NULL, DIO_LOCKING);
 
 		if (unlikely((rw & WRITE) && ret < 0)) {
 			loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6610dc7..4549103 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3024,9 +3024,8 @@ static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
 			       EXT4_GET_BLOCKS_NO_LOCK);
 }
 
-static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
-			    ssize_t size, void *private, int ret,
-			    bool is_async)
+static void ext4_end_dio(struct kiocb *iocb, loff_t offset, ssize_t size,
+			 void *private, int ret, bool is_async)
 {
 	struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
         ext4_io_end_t *io_end = iocb->private;
@@ -3058,8 +3057,7 @@ out:
 		io_end->iocb = iocb;
 		io_end->result = ret;
 	}
-
-	ext4_add_complete_io(io_end);
+	ext4_end_io_bh(io_end, 1);
 }
 
 static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
@@ -3195,7 +3193,6 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 			ret = -ENOMEM;
 			goto retake_lock;
 		}
-		io_end->flag |= EXT4_IO_END_DIRECT;
 		iocb->private = io_end;
 		/*
 		 * we save the io structure for current async direct
@@ -3216,7 +3213,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 				   inode->i_sb->s_bdev, iov,
 				   offset, nr_segs,
 				   get_block_func,
-				   ext4_end_io_dio,
+				   ext4_end_dio,
 				   NULL,
 				   dio_flags);
 
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 0016fbc..6b2d88d 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -103,17 +103,35 @@ static int ext4_end_io(ext4_io_end_t *io)
 			 "(inode %lu, offset %llu, size %zd, error %d)",
 			 inode->i_ino, offset, size, ret);
 	}
-	if (io->iocb)
-		aio_complete(io->iocb, io->result, 0);
 
-	if (io->flag & EXT4_IO_END_DIRECT)
-		inode_dio_done(inode);
 	/* Wake up anyone waiting on unwritten extent conversion */
 	if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
 		wake_up_all(ext4_ioend_wq(inode));
+
 	return ret;
 }
 
+void ext4_end_io_bh(ext4_io_end_t *io_end, int is_dio)
+{
+	struct inode *inode;
+
+	inode = io_end->inode;
+	(void)ext4_es_convert_unwritten_extents(inode, io_end->offset,
+						io_end->size);
+	ext4_add_complete_io(io_end);
+
+	/*
+	 * Here we can safely notify upper level that aio has done because
+	 * unwritten extent in extent status tree has been converted.  Thus,
+	 * others won't get a stale data because we always lookup extent status
+	 * tree firstly in get_block_t.
+	 */
+	if (io_end->iocb)
+		aio_complete(io_end->iocb, io_end->result, 0);
+	if (is_dio)
+		inode_dio_done(inode);
+}
+
 static void dump_completed_IO(struct inode *inode)
 {
 #ifdef	EXT4FS_DEBUG
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][PATCH 9/9 v1] ext4: set dioread_nolock by default for extent-based files
  2012-12-24  7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
                   ` (7 preceding siblings ...)
  2012-12-24  7:55 ` [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion Zheng Liu
@ 2012-12-24  7:55 ` Zheng Liu
  8 siblings, 0 replies; 22+ messages in thread
From: Zheng Liu @ 2012-12-24  7:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

After a dio improvement had been done, we can try to set dioread_nolock by
default for extent-based files when block size is not smaller than page size.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 Documentation/filesystems/ext4.txt | 5 ++---
 fs/ext4/super.c                    | 8 ++++++++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 34ea4f1..b377a1e 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -368,9 +368,8 @@ dioread_nolock		locking. If the dioread_nolock option is specified
 			speed storages. However this does not work with
 			data journaling and dioread_nolock option will be
 			ignored with kernel warning. Note that dioread_nolock
-			code path is only used for extent-based files.
-			Because of the restrictions this options comprises
-			it is off by default (e.g. dioread_lock).
+			code path is only used for extent-based files.  it is
+			on by default for extent-based files.
 
 max_dir_size_kb=n	This limits the size of directories so that any
 			attempt to expand them beyond the specified
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3cdb0a2..e6def09 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3402,6 +3402,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		set_opt(sb, DELALLOC);
 
 	/*
+	 * enable dioread_nolock by default
+	 * Use -o dioread_lock to turn it off
+	 */
+	if (!IS_EXT3_SB(sb) && !IS_EXT2_SB(sb) &&
+	    EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS))
+		set_opt(sb, DIOREAD_NOLOCK);
+
+	/*
 	 * set default s_li_wait_mult for lazyinit, for the case there is
 	 * no mount option specified.
 	 */
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion
  2012-12-24  7:55 ` [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion Zheng Liu
@ 2012-12-31 16:36   ` Jan Kara
  2012-12-31 17:04     ` Jan Kara
  2012-12-31 21:58   ` Jan Kara
  1 sibling, 1 reply; 22+ messages in thread
From: Jan Kara @ 2012-12-31 16:36 UTC (permalink / raw)
  To: Zheng Liu
  Cc: linux-ext4, Jan Kara, Darrick J. Wong, Christoph Hellwig,
	Zheng Liu

On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> Currently all unwritten extent conversion work is pushed into a workqueue to be
> done because DIO end_io is in a irq context and this conversion needs to take
> i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> tracking all extent status, we can first convert this unwritten extent in extent
> status tree, and call aio_complete() and inode_dio_done() to notify upper level
> that this dio has done.  We don't need to be worried about exposing a stale data
> because we first try to lookup in extent status tree.  So it makes us to see the
> latest extent status.  Meanwhile we queue a work to convert this unwritten
> extent in extent tree.  After this improvement, reader also needn't wait this
> conversion to be done when dioread_nolock is enabled.
> 
> CC: Jan Kara <jack@suse.cz>
> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> CC: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> ---
> [Cc' to Jan, Darrick, and Christoph because Christoph is trying to handle
> O_(D)SYNC for AIO]
> 
> Hi Jan, Darrick, and Christoph,
> 
> This patch refines the unwritten extent conversion in ext4.  Now we can call
> aio_complete() and inode_dio_done() in end_io function.  I believe Christoph's
> patch also can work well after applied this patch.  Could you please review
> this patch?
  Umm, I don't understand one thing (please bear with me, I've not followed
extent status tree work in detail): After you report IO completion, you
must make sure subsequent read returns data you wrote. Thus you would need
to track also physical location of all extents that are written but not yet
converted in the extent status tree. I'm not sure which patches are in
flight but it's definitely not happening right now and it seems to me it
would complicate the extent status tree (effectively making a full extent
cache out of it, making it considerably heavier etc.). If extent tree is
really going that way, then what you propose is probably a good idea. I'm
still somewhat uneasy about completing the IO before it's really on disk
(we still need flushing of conversions in various places) but doing the
conversion before completing the IO has its own (locking) issues especially
for writeback path. So the solution using extent status tree is fine.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion
  2012-12-31 16:36   ` Jan Kara
@ 2012-12-31 17:04     ` Jan Kara
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2012-12-31 17:04 UTC (permalink / raw)
  To: Zheng Liu
  Cc: linux-ext4, Jan Kara, Darrick J. Wong, Christoph Hellwig,
	Zheng Liu

On Mon 31-12-12 17:36:21, Jan Kara wrote:
> On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > Currently all unwritten extent conversion work is pushed into a workqueue to be
> > done because DIO end_io is in a irq context and this conversion needs to take
> > i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> > tracking all extent status, we can first convert this unwritten extent in extent
> > status tree, and call aio_complete() and inode_dio_done() to notify upper level
> > that this dio has done.  We don't need to be worried about exposing a stale data
> > because we first try to lookup in extent status tree.  So it makes us to see the
> > latest extent status.  Meanwhile we queue a work to convert this unwritten
> > extent in extent tree.  After this improvement, reader also needn't wait this
> > conversion to be done when dioread_nolock is enabled.
> > 
> > CC: Jan Kara <jack@suse.cz>
> > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > CC: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > ---
> > [Cc' to Jan, Darrick, and Christoph because Christoph is trying to handle
> > O_(D)SYNC for AIO]
> > 
> > Hi Jan, Darrick, and Christoph,
> > 
> > This patch refines the unwritten extent conversion in ext4.  Now we can call
> > aio_complete() and inode_dio_done() in end_io function.  I believe Christoph's
> > patch also can work well after applied this patch.  Could you please review
> > this patch?
>   Umm, I don't understand one thing (please bear with me, I've not followed
> extent status tree work in detail): After you report IO completion, you
> must make sure subsequent read returns data you wrote. Thus you would need
> to track also physical location of all extents that are written but not yet
> converted in the extent status tree. I'm not sure which patches are in
> flight but it's definitely not happening right now and it seems to me it
> would complicate the extent status tree (effectively making a full extent
> cache out of it, making it considerably heavier etc.). If extent tree is
> really going that way, then what you propose is probably a good idea. I'm
> still somewhat uneasy about completing the IO before it's really on disk
> (we still need flushing of conversions in various places) but doing the
> conversion before completing the IO has its own (locking) issues especially
> for writeback path. So the solution using extent status tree is fine.
  Ah, the patch is part of a series which changes the extent tree :) OK,
I'm looking into the patches...

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 3/9 v1] ext4: add physical block and status member into extent status tree
  2012-12-24  7:55 ` [RFC][PATCH 3/9 v1] ext4: add physical block and status member into " Zheng Liu
@ 2012-12-31 21:49   ` Jan Kara
  2013-01-01  5:16     ` Zheng Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2012-12-31 21:49 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu

On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> es_pblk is used to record physical block that maps to the disk.  es_status is
> used to record the status of the extent.  Three status are defined, which are
> written, unwritten and delayed.
  So this means one extent is 48 bytes on 64-bit architectures. If I'm a
nasty user and create artificially fragmented file (by allocating every
second block), extent tree takes 6 MB per GB of file. That's quite a bit
and I think you need to provide a way for kernel to reclaim extent
structures...

								Honza
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> ---
>  fs/ext4/extents_status.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index 81e9339..85115bb 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -20,10 +20,18 @@
>  #define es_debug(fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
>  #endif
>  
> +enum {
> +	EXTENT_STATUS_WRITTEN = 0,	/* written extent */
> +	EXTENT_STATUS_UNWRITTEN = 1,	/* unwritten extent */
> +	EXTENT_STATUS_DELAYED = 2,	/* delayed extent */
> +};
> +
>  struct extent_status {
>  	struct rb_node rb_node;
>  	ext4_lblk_t es_lblk;	/* first logical block extent covers */
>  	ext4_lblk_t es_len;	/* length of extent in block */
> +	ext4_fsblk_t es_pblk;	/* first physical block */
> +	int es_status;		/* record the status of extent */
>  };
>  
>  struct ext4_es_tree {
> -- 
> 1.7.12.rc2.18.g61b472e
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion
  2012-12-24  7:55 ` [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion Zheng Liu
  2012-12-31 16:36   ` Jan Kara
@ 2012-12-31 21:58   ` Jan Kara
  2013-01-01  5:24     ` Zheng Liu
  1 sibling, 1 reply; 22+ messages in thread
From: Jan Kara @ 2012-12-31 21:58 UTC (permalink / raw)
  To: Zheng Liu
  Cc: linux-ext4, Jan Kara, Darrick J. Wong, Christoph Hellwig,
	Zheng Liu

On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> Currently all unwritten extent conversion work is pushed into a workqueue to be
> done because DIO end_io is in a irq context and this conversion needs to take
> i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> tracking all extent status, we can first convert this unwritten extent in extent
> status tree, and call aio_complete() and inode_dio_done() to notify upper level
> that this dio has done.  We don't need to be worried about exposing a stale data
> because we first try to lookup in extent status tree.  So it makes us to see the
> latest extent status.  Meanwhile we queue a work to convert this unwritten
> extent in extent tree.  After this improvement, reader also needn't wait this
> conversion to be done when dioread_nolock is enabled.
> 
> CC: Jan Kara <jack@suse.cz>
> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> CC: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
  OK, so after reading other patches this should work fine. Just I think we
should somehow mark in the extent status tree that the extent tree is
inconsistent with what's on disk - something like extent dirty bit. It will
be set for UNWRITTEN extents where conversion is pending logically it would
also make sence to have it set for DELAYED extents. Then if we need to
reclaim some extents due to memory pressure we know we have to keep dirty
extents because those cache irreplacible information. What do you think?

								Honza
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8462eb3..b76dc49 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -195,7 +195,6 @@ struct mpage_da_data {
>  #define	EXT4_IO_END_UNWRITTEN	0x0001
>  #define EXT4_IO_END_ERROR	0x0002
>  #define EXT4_IO_END_QUEUED	0x0004
> -#define EXT4_IO_END_DIRECT	0x0008
>  
>  struct ext4_io_page {
>  	struct page	*p_page;
> @@ -2538,6 +2537,7 @@ extern void ext4_ioend_wait(struct inode *);
>  extern void ext4_free_io_end(ext4_io_end_t *io);
>  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
>  extern void ext4_io_submit(struct ext4_io_submit *io);
> +extern void ext4_end_io_bh(ext4_io_end_t *io_end, int is_dio);
>  extern int ext4_bio_write_page(struct ext4_io_submit *io,
>  			       struct page *page,
>  			       int len,
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 20862f9..c6d7f7f 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -807,11 +807,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>  
>  retry:
>  	if (rw == READ && ext4_should_dioread_nolock(inode)) {
> -		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
> -			mutex_lock(&inode->i_mutex);
> -			ext4_flush_unwritten_io(inode);
> -			mutex_unlock(&inode->i_mutex);
> -		}
>  		/*
>  		 * Nolock dioread optimization may be dynamically disabled
>  		 * via ext4_inode_block_unlocked_dio(). Check inode's state
> @@ -831,8 +826,10 @@ retry:
>  		inode_dio_done(inode);
>  	} else {
>  locked:
> -		ret = blockdev_direct_IO(rw, iocb, inode, iov,
> -				 offset, nr_segs, ext4_get_block);
> +		ret = __blockdev_direct_IO(rw, iocb, inode,
> +				 inode->i_sb->s_bdev, iov,
> +				 offset, nr_segs,
> +				 ext4_get_block, NULL, NULL, DIO_LOCKING);
>  
>  		if (unlikely((rw & WRITE) && ret < 0)) {
>  			loff_t isize = i_size_read(inode);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6610dc7..4549103 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3024,9 +3024,8 @@ static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
>  			       EXT4_GET_BLOCKS_NO_LOCK);
>  }
>  
> -static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
> -			    ssize_t size, void *private, int ret,
> -			    bool is_async)
> +static void ext4_end_dio(struct kiocb *iocb, loff_t offset, ssize_t size,
> +			 void *private, int ret, bool is_async)
>  {
>  	struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
>          ext4_io_end_t *io_end = iocb->private;
> @@ -3058,8 +3057,7 @@ out:
>  		io_end->iocb = iocb;
>  		io_end->result = ret;
>  	}
> -
> -	ext4_add_complete_io(io_end);
> +	ext4_end_io_bh(io_end, 1);
>  }
>  
>  static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
> @@ -3195,7 +3193,6 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  			ret = -ENOMEM;
>  			goto retake_lock;
>  		}
> -		io_end->flag |= EXT4_IO_END_DIRECT;
>  		iocb->private = io_end;
>  		/*
>  		 * we save the io structure for current async direct
> @@ -3216,7 +3213,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  				   inode->i_sb->s_bdev, iov,
>  				   offset, nr_segs,
>  				   get_block_func,
> -				   ext4_end_io_dio,
> +				   ext4_end_dio,
>  				   NULL,
>  				   dio_flags);
>  
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 0016fbc..6b2d88d 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -103,17 +103,35 @@ static int ext4_end_io(ext4_io_end_t *io)
>  			 "(inode %lu, offset %llu, size %zd, error %d)",
>  			 inode->i_ino, offset, size, ret);
>  	}
> -	if (io->iocb)
> -		aio_complete(io->iocb, io->result, 0);
>  
> -	if (io->flag & EXT4_IO_END_DIRECT)
> -		inode_dio_done(inode);
>  	/* Wake up anyone waiting on unwritten extent conversion */
>  	if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
>  		wake_up_all(ext4_ioend_wq(inode));
> +
>  	return ret;
>  }
>  
> +void ext4_end_io_bh(ext4_io_end_t *io_end, int is_dio)
> +{
> +	struct inode *inode;
> +
> +	inode = io_end->inode;
> +	(void)ext4_es_convert_unwritten_extents(inode, io_end->offset,
> +						io_end->size);
> +	ext4_add_complete_io(io_end);
> +
> +	/*
> +	 * Here we can safely notify upper level that aio has done because
> +	 * unwritten extent in extent status tree has been converted.  Thus,
> +	 * others won't get a stale data because we always lookup extent status
> +	 * tree firstly in get_block_t.
> +	 */
> +	if (io_end->iocb)
> +		aio_complete(io_end->iocb, io_end->result, 0);
> +	if (is_dio)
> +		inode_dio_done(inode);
> +}
> +
>  static void dump_completed_IO(struct inode *inode)
>  {
>  #ifdef	EXT4FS_DEBUG
> -- 
> 1.7.12.rc2.18.g61b472e
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 3/9 v1] ext4: add physical block and status member into extent status tree
  2012-12-31 21:49   ` Jan Kara
@ 2013-01-01  5:16     ` Zheng Liu
  2013-01-02 11:22       ` Jan Kara
  0 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-01-01  5:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Zheng Liu

On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote:
> On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > es_pblk is used to record physical block that maps to the disk.  es_status is
> > used to record the status of the extent.  Three status are defined, which are
> > written, unwritten and delayed.
>   So this means one extent is 48 bytes on 64-bit architectures. If I'm a
> nasty user and create artificially fragmented file (by allocating every
> second block), extent tree takes 6 MB per GB of file. That's quite a bit
> and I think you need to provide a way for kernel to reclaim extent
> structures...

Indeed, when a file has a lot of fragmentations, status tree will occupy
a number of memory.  That is why it will be loaded on-demand.  When I make
it, there are two solutions to load status tree.  One is loading
on-demand, and another is loading complete extent tree in
ext4_alloc_inode().  Finally I choose the former because it can reduce
the pressure of memory at most of time.  But it has a disadvantage that
status tree doesn't be fully trusted because it hasn't track a
completely status of extent tree on disk.

I will provide a way to reclaim extent structures from status tree.  Now
I have an idea in my mind that we can reclaim all extent which are
WRITTEN/UNWRITTEN status because we always need DELAYED extent in
fiemap, seek_data/hole and bigalloc code.  Furthermore, as you said in
another mail, some unwritten extent which will be converted into
written also doesn't be reclaimed.

Another question is when do these extents reclaim?  Currently when
clear_inode() is called, the whole status tree will be reclaimed.  Maybe
a switch in sysfs is a optional choice.  Any thoughts?

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion
  2012-12-31 21:58   ` Jan Kara
@ 2013-01-01  5:24     ` Zheng Liu
  2013-01-03 10:56       ` Jan Kara
  0 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-01-01  5:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Darrick J. Wong, Christoph Hellwig, Zheng Liu

On Mon, Dec 31, 2012 at 10:58:15PM +0100, Jan Kara wrote:
> On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > Currently all unwritten extent conversion work is pushed into a workqueue to be
> > done because DIO end_io is in a irq context and this conversion needs to take
> > i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> > tracking all extent status, we can first convert this unwritten extent in extent
> > status tree, and call aio_complete() and inode_dio_done() to notify upper level
> > that this dio has done.  We don't need to be worried about exposing a stale data
> > because we first try to lookup in extent status tree.  So it makes us to see the
> > latest extent status.  Meanwhile we queue a work to convert this unwritten
> > extent in extent tree.  After this improvement, reader also needn't wait this
> > conversion to be done when dioread_nolock is enabled.
> > 
> > CC: Jan Kara <jack@suse.cz>
> > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > CC: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
>   OK, so after reading other patches this should work fine. Just I think we
> should somehow mark in the extent status tree that the extent tree is
> inconsistent with what's on disk - something like extent dirty bit. It will
> be set for UNWRITTEN extents where conversion is pending logically it would
> also make sence to have it set for DELAYED extents. Then if we need to
> reclaim some extents due to memory pressure we know we have to keep dirty
> extents because those cache irreplacible information. What do you think?

Dirty bit is a good idea for UNWRITTEN extent because we can feel free
to reclaim all WRITTEN extents and all UNWRITTEN extents that are
without dirty bit.  But we can not reclaim DEALYED extents no matter
whether they are dirty or not because they are used to lookup an delayed
extent in fiemap, seek_data/hole, and bigalloc.  So at least DEALYED
extent must be kept in status tree.  That is why in step 1 status tree
only tracks all DELAYED extents in the tree.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 3/9 v1] ext4: add physical block and status member into extent status tree
  2013-01-01  5:16     ` Zheng Liu
@ 2013-01-02 11:22       ` Jan Kara
  2013-01-05  2:44         ` Zheng Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2013-01-02 11:22 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, linux-ext4, Zheng Liu

On Tue 01-01-13 13:16:07, Zheng Liu wrote:
> On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote:
> > On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > 
> > > es_pblk is used to record physical block that maps to the disk.  es_status is
> > > used to record the status of the extent.  Three status are defined, which are
> > > written, unwritten and delayed.
> >   So this means one extent is 48 bytes on 64-bit architectures. If I'm a
> > nasty user and create artificially fragmented file (by allocating every
> > second block), extent tree takes 6 MB per GB of file. That's quite a bit
> > and I think you need to provide a way for kernel to reclaim extent
> > structures...
> 
> Indeed, when a file has a lot of fragmentations, status tree will occupy
> a number of memory.  That is why it will be loaded on-demand.  When I make
> it, there are two solutions to load status tree.  One is loading
> on-demand, and another is loading complete extent tree in
> ext4_alloc_inode().  Finally I choose the former because it can reduce
> the pressure of memory at most of time.  But it has a disadvantage that
> status tree doesn't be fully trusted because it hasn't track a
> completely status of extent tree on disk.
  Not reading the whole extent tree in ext4_alloc_inode() is a good start
but it's not the whole solution IMHO. It saves us from unnecessary reading
of extents but still if someone reads the whole filesystem (like
grep -R "foo" /) you will still end up with all extents cached. And that
will make ext4 inodes pretty heavy in memory. Surely inode reclaim will
eventually release these inodes including cached extents but it is usually
more beneficial to cache the inode itself than more extents so allowing us
to strip cached extents without releasing inode itself would be good.

> I will provide a way to reclaim extent structures from status tree.  Now
> I have an idea in my mind that we can reclaim all extent which are
> WRITTEN/UNWRITTEN status because we always need DELAYED extent in
> fiemap, seek_data/hole and bigalloc code.  Furthermore, as you said in
> another mail, some unwritten extent which will be converted into
> written also doesn't be reclaimed.
> 
> Another question is when do these extents reclaim?  Currently when
> clear_inode() is called, the whole status tree will be reclaimed.  Maybe
> a switch in sysfs is a optional choice.  Any thoughts?
  The natural way to handle the shrinking is using 'shrinker' framework. In
this case, we could register a shrinker for shrinking extents. Just having
LRU of extents would increase the size of extent structure by 2 pointers
which is too big I'd think and I'm not yet sure how to choose extents for
reclaim in some other way. I will think about it...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion
  2013-01-01  5:24     ` Zheng Liu
@ 2013-01-03 10:56       ` Jan Kara
  2013-01-04  4:26         ` Zheng Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2013-01-03 10:56 UTC (permalink / raw)
  To: Zheng Liu
  Cc: Jan Kara, linux-ext4, Darrick J. Wong, Christoph Hellwig,
	Zheng Liu

On Tue 01-01-13 13:24:45, Zheng Liu wrote:
> On Mon, Dec 31, 2012 at 10:58:15PM +0100, Jan Kara wrote:
> > On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > 
> > > Currently all unwritten extent conversion work is pushed into a workqueue to be
> > > done because DIO end_io is in a irq context and this conversion needs to take
> > > i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> > > tracking all extent status, we can first convert this unwritten extent in extent
> > > status tree, and call aio_complete() and inode_dio_done() to notify upper level
> > > that this dio has done.  We don't need to be worried about exposing a stale data
> > > because we first try to lookup in extent status tree.  So it makes us to see the
> > > latest extent status.  Meanwhile we queue a work to convert this unwritten
> > > extent in extent tree.  After this improvement, reader also needn't wait this
> > > conversion to be done when dioread_nolock is enabled.
> > > 
> > > CC: Jan Kara <jack@suse.cz>
> > > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > > CC: Christoph Hellwig <hch@infradead.org>
> > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> >   OK, so after reading other patches this should work fine. Just I think we
> > should somehow mark in the extent status tree that the extent tree is
> > inconsistent with what's on disk - something like extent dirty bit. It will
> > be set for UNWRITTEN extents where conversion is pending logically it would
> > also make sence to have it set for DELAYED extents. Then if we need to
> > reclaim some extents due to memory pressure we know we have to keep dirty
> > extents because those cache irreplacible information. What do you think?
> 
> Dirty bit is a good idea for UNWRITTEN extent because we can feel free
> to reclaim all WRITTEN extents and all UNWRITTEN extents that are
> without dirty bit.  But we can not reclaim DEALYED extents no matter
> whether they are dirty or not because they are used to lookup an delayed
> extent in fiemap, seek_data/hole, and bigalloc.  So at least DEALYED
> extent must be kept in status tree.  That is why in step 1 status tree
> only tracks all DELAYED extents in the tree.
  So I was thinking about this some more and also testing some code and I
realized that using extent status tree won't be enough (sadly). In case of
AIO DIO with O_SYNC set, we have to perform extent conversion on *disk*
before we can complete the AIO. So extent status tree won't help us in any
way.

Furthermore O_SYNC writes end up calling ->fsync() after IO is finished
which currently waits for all unwritten extents to convert and that
effectively deadlocks the conversion thread if there are more DIO
conversions pending. To fix this we would have to hack around
ext4_file_fsync() to avoid waiting in case of O_SYNC AIO writes and that
gets nasty quickly. So I'm currently back to my original plan of completing
IO only after extent conversion happens... I'll see how that works out.

Eventually we could *optimize* that by doing the extent conversion only in
the extent status tree if possible but I wouldn't bother with it right now.
For one thing, I also realized we would probably have to somehow throttle
writers so that there are not too many outstanding conversions (when we
complete AIO only after the conversion is finished, writer is naturally
limited by the amount of AIOs allowed).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion
  2013-01-03 10:56       ` Jan Kara
@ 2013-01-04  4:26         ` Zheng Liu
  0 siblings, 0 replies; 22+ messages in thread
From: Zheng Liu @ 2013-01-04  4:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Darrick J. Wong, Christoph Hellwig, Zheng Liu

On Thu, Jan 03, 2013 at 11:56:13AM +0100, Jan Kara wrote:
> On Tue 01-01-13 13:24:45, Zheng Liu wrote:
> > On Mon, Dec 31, 2012 at 10:58:15PM +0100, Jan Kara wrote:
> > > On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > 
> > > > Currently all unwritten extent conversion work is pushed into a workqueue to be
> > > > done because DIO end_io is in a irq context and this conversion needs to take
> > > > i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> > > > tracking all extent status, we can first convert this unwritten extent in extent
> > > > status tree, and call aio_complete() and inode_dio_done() to notify upper level
> > > > that this dio has done.  We don't need to be worried about exposing a stale data
> > > > because we first try to lookup in extent status tree.  So it makes us to see the
> > > > latest extent status.  Meanwhile we queue a work to convert this unwritten
> > > > extent in extent tree.  After this improvement, reader also needn't wait this
> > > > conversion to be done when dioread_nolock is enabled.
> > > > 
> > > > CC: Jan Kara <jack@suse.cz>
> > > > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > > > CC: Christoph Hellwig <hch@infradead.org>
> > > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > >   OK, so after reading other patches this should work fine. Just I think we
> > > should somehow mark in the extent status tree that the extent tree is
> > > inconsistent with what's on disk - something like extent dirty bit. It will
> > > be set for UNWRITTEN extents where conversion is pending logically it would
> > > also make sence to have it set for DELAYED extents. Then if we need to
> > > reclaim some extents due to memory pressure we know we have to keep dirty
> > > extents because those cache irreplacible information. What do you think?
> > 
> > Dirty bit is a good idea for UNWRITTEN extent because we can feel free
> > to reclaim all WRITTEN extents and all UNWRITTEN extents that are
> > without dirty bit.  But we can not reclaim DEALYED extents no matter
> > whether they are dirty or not because they are used to lookup an delayed
> > extent in fiemap, seek_data/hole, and bigalloc.  So at least DEALYED
> > extent must be kept in status tree.  That is why in step 1 status tree
> > only tracks all DELAYED extents in the tree.
>   So I was thinking about this some more and also testing some code and I
> realized that using extent status tree won't be enough (sadly). In case of
> AIO DIO with O_SYNC set, we have to perform extent conversion on *disk*
> before we can complete the AIO. So extent status tree won't help us in any
> way.

Ah, I see.  Conversion must be done in disk before aio_complete() is
called if AIO DIO with O_SYNC set.  So now we must call aio_complete()
after ext4_convert_unwritten_extents() in ext4_end_io().

> 
> Furthermore O_SYNC writes end up calling ->fsync() after IO is finished
> which currently waits for all unwritten extents to convert and that
> effectively deadlocks the conversion thread if there are more DIO
> conversions pending. To fix this we would have to hack around
> ext4_file_fsync() to avoid waiting in case of O_SYNC AIO writes and that
> gets nasty quickly. So I'm currently back to my original plan of completing
> IO only after extent conversion happens... I'll see how that works out.
> 
> Eventually we could *optimize* that by doing the extent conversion only in
> the extent status tree if possible but I wouldn't bother with it right now.
> For one thing, I also realized we would probably have to somehow throttle
> writers so that there are not too many outstanding conversions (when we
> complete AIO only after the conversion is finished, writer is naturally
> limited by the amount of AIOs allowed).

Sorry, currently no any idea is in my mind.  I will think about it. :-(

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 3/9 v1] ext4: add physical block and status member into extent status tree
  2013-01-02 11:22       ` Jan Kara
@ 2013-01-05  2:44         ` Zheng Liu
  2013-01-08  1:27           ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-01-05  2:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Zheng Liu

On Wed, Jan 02, 2013 at 12:22:55PM +0100, Jan Kara wrote:
> On Tue 01-01-13 13:16:07, Zheng Liu wrote:
> > On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote:
> > > On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > 
> > > > es_pblk is used to record physical block that maps to the disk.  es_status is
> > > > used to record the status of the extent.  Three status are defined, which are
> > > > written, unwritten and delayed.
> > >   So this means one extent is 48 bytes on 64-bit architectures. If I'm a
> > > nasty user and create artificially fragmented file (by allocating every
> > > second block), extent tree takes 6 MB per GB of file. That's quite a bit
> > > and I think you need to provide a way for kernel to reclaim extent
> > > structures...
> > 
> > Indeed, when a file has a lot of fragmentations, status tree will occupy
> > a number of memory.  That is why it will be loaded on-demand.  When I make
> > it, there are two solutions to load status tree.  One is loading
> > on-demand, and another is loading complete extent tree in
> > ext4_alloc_inode().  Finally I choose the former because it can reduce
> > the pressure of memory at most of time.  But it has a disadvantage that
> > status tree doesn't be fully trusted because it hasn't track a
> > completely status of extent tree on disk.
>   Not reading the whole extent tree in ext4_alloc_inode() is a good start
> but it's not the whole solution IMHO. It saves us from unnecessary reading
> of extents but still if someone reads the whole filesystem (like
> grep -R "foo" /) you will still end up with all extents cached. And that
> will make ext4 inodes pretty heavy in memory. Surely inode reclaim will
> eventually release these inodes including cached extents but it is usually
> more beneficial to cache the inode itself than more extents so allowing us
> to strip cached extents without releasing inode itself would be good.
> 
> > I will provide a way to reclaim extent structures from status tree.  Now
> > I have an idea in my mind that we can reclaim all extent which are
> > WRITTEN/UNWRITTEN status because we always need DELAYED extent in
> > fiemap, seek_data/hole and bigalloc code.  Furthermore, as you said in
> > another mail, some unwritten extent which will be converted into
> > written also doesn't be reclaimed.
> > 
> > Another question is when do these extents reclaim?  Currently when
> > clear_inode() is called, the whole status tree will be reclaimed.  Maybe
> > a switch in sysfs is a optional choice.  Any thoughts?
>   The natural way to handle the shrinking is using 'shrinker' framework. In
> this case, we could register a shrinker for shrinking extents. Just having
> LRU of extents would increase the size of extent structure by 2 pointers
> which is too big I'd think and I'm not yet sure how to choose extents for
> reclaim in some other way. I will think about it...

Hi Jan,

Sorry for the delay.  'shrinker' framework is an option.  We can define
a callback function to reclaim extents from status tree.  When we access
an extent in an inode, we will move this inode into the tail of LRU list.
But this way has a defect that the spinlock which protects the LRU list
has a heavy contention because all inodes need to take this lock.  I
guess this overhead is unacceptable for us.  Any comments?

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 3/9 v1] ext4: add physical block and status member into extent status tree
  2013-01-05  2:44         ` Zheng Liu
@ 2013-01-08  1:27           ` Dave Chinner
  2013-01-08  2:25             ` Zheng Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2013-01-08  1:27 UTC (permalink / raw)
  To: Jan Kara, linux-ext4, Zheng Liu

On Sat, Jan 05, 2013 at 10:44:01AM +0800, Zheng Liu wrote:
> On Wed, Jan 02, 2013 at 12:22:55PM +0100, Jan Kara wrote:
> > On Tue 01-01-13 13:16:07, Zheng Liu wrote:
> > > On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote:
> > > > On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > > 
> > > > > es_pblk is used to record physical block that maps to the disk.  es_status is
> > > > > used to record the status of the extent.  Three status are defined, which are
> > > > > written, unwritten and delayed.
> > > >   So this means one extent is 48 bytes on 64-bit architectures. If I'm a
> > > > nasty user and create artificially fragmented file (by allocating every
> > > > second block), extent tree takes 6 MB per GB of file. That's quite a bit
> > > > and I think you need to provide a way for kernel to reclaim extent
> > > > structures...
> > > 
> > > Indeed, when a file has a lot of fragmentations, status tree will occupy
> > > a number of memory.  That is why it will be loaded on-demand.  When I make
> > > it, there are two solutions to load status tree.  One is loading
> > > on-demand, and another is loading complete extent tree in
> > > ext4_alloc_inode().  Finally I choose the former because it can reduce
> > > the pressure of memory at most of time.  But it has a disadvantage that
> > > status tree doesn't be fully trusted because it hasn't track a
> > > completely status of extent tree on disk.
> >   Not reading the whole extent tree in ext4_alloc_inode() is a good start
> > but it's not the whole solution IMHO. It saves us from unnecessary reading
> > of extents but still if someone reads the whole filesystem (like
> > grep -R "foo" /) you will still end up with all extents cached. And that
> > will make ext4 inodes pretty heavy in memory. Surely inode reclaim will
> > eventually release these inodes including cached extents but it is usually
> > more beneficial to cache the inode itself than more extents so allowing us
> > to strip cached extents without releasing inode itself would be good.
> > 
> > > I will provide a way to reclaim extent structures from status tree.  Now
> > > I have an idea in my mind that we can reclaim all extent which are
> > > WRITTEN/UNWRITTEN status because we always need DELAYED extent in
> > > fiemap, seek_data/hole and bigalloc code.  Furthermore, as you said in
> > > another mail, some unwritten extent which will be converted into
> > > written also doesn't be reclaimed.
> > > 
> > > Another question is when do these extents reclaim?  Currently when
> > > clear_inode() is called, the whole status tree will be reclaimed.  Maybe
> > > a switch in sysfs is a optional choice.  Any thoughts?
> >   The natural way to handle the shrinking is using 'shrinker' framework. In
> > this case, we could register a shrinker for shrinking extents. Just having
> > LRU of extents would increase the size of extent structure by 2 pointers
> > which is too big I'd think and I'm not yet sure how to choose extents for
> > reclaim in some other way. I will think about it...
> 
> Hi Jan,
> 
> Sorry for the delay.  'shrinker' framework is an option.  We can define
> a callback function to reclaim extents from status tree.  When we access
> an extent in an inode, we will move this inode into the tail of LRU list.
> But this way has a defect that the spinlock which protects the LRU list
> has a heavy contention because all inodes need to take this lock.  I
> guess this overhead is unacceptable for us.  Any comments?

Measure it first. There are several filesystem global locks still
in existance at the VFS level. solve the simple problem first, and
then the hard problem might get solved for you by someone else. e.g:

http://oss.sgi.com/archives/xfs/2012-11/msg00643.html

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 3/9 v1] ext4: add physical block and status member into extent status tree
  2013-01-08  1:27           ` Dave Chinner
@ 2013-01-08  2:25             ` Zheng Liu
  0 siblings, 0 replies; 22+ messages in thread
From: Zheng Liu @ 2013-01-08  2:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-ext4, Zheng Liu

On Tue, Jan 08, 2013 at 12:27:54PM +1100, Dave Chinner wrote:
> On Sat, Jan 05, 2013 at 10:44:01AM +0800, Zheng Liu wrote:
> > On Wed, Jan 02, 2013 at 12:22:55PM +0100, Jan Kara wrote:
> > > On Tue 01-01-13 13:16:07, Zheng Liu wrote:
> > > > On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote:
> > > > > On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> > > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > > > 
> > > > > > es_pblk is used to record physical block that maps to the disk.  es_status is
> > > > > > used to record the status of the extent.  Three status are defined, which are
> > > > > > written, unwritten and delayed.
> > > > >   So this means one extent is 48 bytes on 64-bit architectures. If I'm a
> > > > > nasty user and create artificially fragmented file (by allocating every
> > > > > second block), extent tree takes 6 MB per GB of file. That's quite a bit
> > > > > and I think you need to provide a way for kernel to reclaim extent
> > > > > structures...
> > > > 
> > > > Indeed, when a file has a lot of fragmentations, status tree will occupy
> > > > a number of memory.  That is why it will be loaded on-demand.  When I make
> > > > it, there are two solutions to load status tree.  One is loading
> > > > on-demand, and another is loading complete extent tree in
> > > > ext4_alloc_inode().  Finally I choose the former because it can reduce
> > > > the pressure of memory at most of time.  But it has a disadvantage that
> > > > status tree doesn't be fully trusted because it hasn't track a
> > > > completely status of extent tree on disk.
> > >   Not reading the whole extent tree in ext4_alloc_inode() is a good start
> > > but it's not the whole solution IMHO. It saves us from unnecessary reading
> > > of extents but still if someone reads the whole filesystem (like
> > > grep -R "foo" /) you will still end up with all extents cached. And that
> > > will make ext4 inodes pretty heavy in memory. Surely inode reclaim will
> > > eventually release these inodes including cached extents but it is usually
> > > more beneficial to cache the inode itself than more extents so allowing us
> > > to strip cached extents without releasing inode itself would be good.
> > > 
> > > > I will provide a way to reclaim extent structures from status tree.  Now
> > > > I have an idea in my mind that we can reclaim all extent which are
> > > > WRITTEN/UNWRITTEN status because we always need DELAYED extent in
> > > > fiemap, seek_data/hole and bigalloc code.  Furthermore, as you said in
> > > > another mail, some unwritten extent which will be converted into
> > > > written also doesn't be reclaimed.
> > > > 
> > > > Another question is when do these extents reclaim?  Currently when
> > > > clear_inode() is called, the whole status tree will be reclaimed.  Maybe
> > > > a switch in sysfs is a optional choice.  Any thoughts?
> > >   The natural way to handle the shrinking is using 'shrinker' framework. In
> > > this case, we could register a shrinker for shrinking extents. Just having
> > > LRU of extents would increase the size of extent structure by 2 pointers
> > > which is too big I'd think and I'm not yet sure how to choose extents for
> > > reclaim in some other way. I will think about it...
> > 
> > Hi Jan,
> > 
> > Sorry for the delay.  'shrinker' framework is an option.  We can define
> > a callback function to reclaim extents from status tree.  When we access
> > an extent in an inode, we will move this inode into the tail of LRU list.
> > But this way has a defect that the spinlock which protects the LRU list
> > has a heavy contention because all inodes need to take this lock.  I
> > guess this overhead is unacceptable for us.  Any comments?
> 
> Measure it first. There are several filesystem global locks still
> in existance at the VFS level. solve the simple problem first, and
> then the hard problem might get solved for you by someone else. e.g:
> 
> http://oss.sgi.com/archives/xfs/2012-11/msg00643.html

Thanks for teaching me. :-)  I will measure its overhead first.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-01-08  2:12 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-24  7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 1/9 v1] ext4: fixup metadata reserve block warning when bigalloc and delalloc are enabled Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 2/9 v1] ext4: refine extent status tree Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 3/9 v1] ext4: add physical block and status member into " Zheng Liu
2012-12-31 21:49   ` Jan Kara
2013-01-01  5:16     ` Zheng Liu
2013-01-02 11:22       ` Jan Kara
2013-01-05  2:44         ` Zheng Liu
2013-01-08  1:27           ` Dave Chinner
2013-01-08  2:25             ` Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 4/9 v1] ext4: adjust interfaces of " Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 5/9 v1] ext4: track all extent status in " Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 6/9 v1] ext4: lookup block mapping " Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 7/9 v1] ext4: add a new convert function to convert an unwritten extent " Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion Zheng Liu
2012-12-31 16:36   ` Jan Kara
2012-12-31 17:04     ` Jan Kara
2012-12-31 21:58   ` Jan Kara
2013-01-01  5:24     ` Zheng Liu
2013-01-03 10:56       ` Jan Kara
2013-01-04  4:26         ` Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 9/9 v1] ext4: set dioread_nolock by default for extent-based files Zheng Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).