* [PATCH 0/9 v4] ext4: extent status tree (step2)
@ 2013-01-31 5:17 Zheng Liu
2013-01-31 5:17 ` [PATCH 1/9 v4] ext4: refine extent status tree Zheng Liu
` (8 more replies)
0 siblings, 9 replies; 21+ messages in thread
From: Zheng Liu @ 2013-01-31 5:17 UTC (permalink / raw)
To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o
Hi all,
Here's the fourth round to implement the second step of extent status
tree. In the first step, extent status tree can track all delayed
extents, improve bigalloc, delayed allocation, and introduce lseek(2)
seek data/hole. The first step has been applied. In the second step,
extent status tree will try to track all extent status, and it will be
as a extent cache for extent tree. Then we can first lookup a block
mapping from extent status tree before traversing extent tree.
The major change in this version is to rework the patch that reclaims
extent from extent status tree. Now when the extent status tree of an
inode is accessing, this inode will be moved into the tail of the lru
list. A counter is used to count the number of written/unwritten
extents in a extent status tree. Meanwhile a normal shrinker is
registered to reclaim memeory when we are under a high memory pressure.
ext4_es_reclaim_extent_count() traverses the lru list to count all
cached objects. ext4_es_shrink() tries to reclaim written/unwritten
extent from the tree. Here we never reclaim delayed extent in the tree
because bigalloc, seek data/hole, and fiemap need it.
I do the following test to verify that the shrinker works well. 50
files that the length of file is 128m are created with 2048 extents,
and mlock(2) is used to occupy the memory as much as possible. Then
grep(1) is used to touch all extents of every files. I can see from the
tracepoint in ext4_es_shrink() that extents are reclaimed.
changelog:
v4 <- v3:
- register a normal shrinker to reclaim extent from extent status tree
v3 <- v2:
- use prune_super() to reclaim extents from extent status tree
- stashed es_status into es_pblk
- remove single extent cache
- rebase against 3.8-rc4
v2 <- v1:
- drop patches that try to improve unwritten extent conversion
- remove EXT4_MAP_FROM_CLUSTER flag
- add tracepoint for ext4_es_lookup_extent()
- drop a patch, which tries to fix a warning when bigalloc and delalloc
are enabled
- add a shrinker to reclaim memory from extent status tree
- rebase against 3.8-rc2
v3: http://lwn.net/Articles/533730/
v2: http://lwn.net/Articles/532446/
v1: http://lwn.net/Articles/531065/
Any comments or feedbacks are appreciated.
Regards,
- Zheng
Zheng Liu (9):
ext4: refine extent status tree
ext4: remove EXT4_MAP_FROM_CLUSTER flag
ext4: add physical block and status member into extent status tree
ext4: adjust interfaces of extent status tree
ext4: track all extent status in extent status tree
ext4: lookup block mapping in extent status tree
ext4: remove single extent cache
ext4: adjust some functions for reclaiming extents from extent status
tree
ext4: reclaim extents from extent status tree
fs/ext4/ext4.h | 34 +--
fs/ext4/ext4_extents.h | 6 -
fs/ext4/extents.c | 256 ++++---------------
fs/ext4/extents_status.c | 588 ++++++++++++++++++++++++++++++++------------
fs/ext4/extents_status.h | 50 +++-
fs/ext4/file.c | 14 +-
fs/ext4/inode.c | 132 +++++++---
fs/ext4/move_extent.c | 3 -
fs/ext4/super.c | 8 +-
include/trace/events/ext4.h | 178 ++++++++++++--
10 files changed, 802 insertions(+), 467 deletions(-)
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/9 v4] ext4: refine extent status tree
2013-01-31 5:17 [PATCH 0/9 v4] ext4: extent status tree (step2) Zheng Liu
@ 2013-01-31 5:17 ` Zheng Liu
2013-01-31 5:17 ` [PATCH 2/9 v4] ext4: remove EXT4_MAP_FROM_CLUSTER flag Zheng Liu
` (7 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Zheng Liu @ 2013-01-31 5:17 UTC (permalink / raw)
To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
This commit refines the extent status tree code.
1) A prefix 'es_' is added to to the extent status tree structure
members.
2) Refactored es_remove_extent() so that __es_remove_extent() can be
used by es_insert_extent() to remove the old extent entry(-ies) before
inserting a new one.
3) Rename extent_status_end() to ext4_es_end()
4) ext4_es_can_be_merged() is define to check whether two extents can
be merged or not.
5) Update and clarified comments.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents.c | 21 +--
fs/ext4/extents_status.c | 318 ++++++++++++++++++++++++--------------------
fs/ext4/extents_status.h | 8 +-
fs/ext4/file.c | 12 +-
include/trace/events/ext4.h | 40 +++---
5 files changed, 217 insertions(+), 182 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5ae1674..f7bf616 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3525,13 +3525,14 @@ static int ext4_find_delalloc_range(struct inode *inode,
{
struct extent_status es;
- es.start = lblk_start;
- ext4_es_find_extent(inode, &es);
- if (es.len == 0)
+ es.es_lblk = lblk_start;
+ (void)ext4_es_find_extent(inode, &es);
+ if (es.es_len == 0)
return 0; /* there is no delay extent in this tree */
- else if (es.start <= lblk_start && lblk_start < es.start + es.len)
+ else if (es.es_lblk <= lblk_start &&
+ lblk_start < es.es_lblk + es.es_len)
return 1;
- else if (lblk_start <= es.start && es.start <= lblk_end)
+ else if (lblk_start <= es.es_lblk && es.es_lblk <= lblk_end)
return 1;
else
return 0;
@@ -4567,7 +4568,7 @@ static int ext4_find_delayed_extent(struct inode *inode,
struct extent_status es;
ext4_lblk_t next_del;
- es.start = newex->ec_block;
+ es.es_lblk = newex->ec_block;
next_del = ext4_es_find_extent(inode, &es);
if (newex->ec_start == 0) {
@@ -4575,18 +4576,18 @@ static int ext4_find_delayed_extent(struct inode *inode,
* No extent in extent-tree contains block @newex->ec_start,
* then the block may stay in 1)a hole or 2)delayed-extent.
*/
- if (es.len == 0)
+ if (es.es_len == 0)
/* A hole found. */
return 0;
- if (es.start > newex->ec_block) {
+ if (es.es_lblk > newex->ec_block) {
/* A hole found. */
- newex->ec_len = min(es.start - newex->ec_block,
+ newex->ec_len = min(es.es_lblk - newex->ec_block,
newex->ec_len);
return 0;
}
- newex->ec_len = es.start + es.len - newex->ec_block;
+ newex->ec_len = es.es_lblk + es.es_len - newex->ec_block;
}
return next_del;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 564d981..aa4d346 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -23,40 +23,53 @@
* (e.g. Reservation space warning), and provide extent-level locking.
* Delay extent tree is the first step to achieve this goal. It is
* original built by Yongqiang Yang. At that time it is called delay
- * extent tree, whose goal is only track delay extent in memory to
+ * extent tree, whose goal is only track delayed extents in memory to
* simplify the implementation of fiemap and bigalloc, and introduce
* lseek SEEK_DATA/SEEK_HOLE support. That is why it is still called
- * delay extent tree at the following comment. But for better
- * understand what it does, it has been rename to extent status tree.
+ * delay extent tree at the first commit. But for better understand
+ * what it does, it has been rename to extent status tree.
*
- * Currently the first step has been done. All delay extents are
- * tracked in the tree. It maintains the delay extent when a delay
- * allocation is issued, and the delay extent is written out or
+ * Step1:
+ * Currently the first step has been done. All delayed extents are
+ * tracked in the tree. It maintains the delayed extent when a delayed
+ * allocation is issued, and the delayed extent is written out or
* invalidated. Therefore the implementation of fiemap and bigalloc
* are simplified, and SEEK_DATA/SEEK_HOLE are introduced.
*
* The following comment describes the implemenmtation of extent
* status tree and future works.
+ *
+ * Step2:
+ * In this step all extent status is tracked by extent status tree.
+ * Thus, we can first try to lookup a block mapping in this tree before
+ * finding it in extent tree. Hence, single extent cache can be removed
+ * because extent status tree can do a better job. Extents in status
+ * tree are loaded on-demand. Therefore, the extent status tree may not
+ * contain all of the extents in a file. Meanwhile we add
+ * nr_cached_objects and free_cached_objects callback functions to
+ * reclaim extents from extent status tree. These functions make us
+ * reclaim written/unwritten extents from the tree under a heavy memory
+ * pressure. Delayed extents will not be reclaimed because fiemap,
+ * bigalloc, and seek_data/hole need it.
*/
/*
- * extents status tree implementation for ext4.
+ * Extent status tree implementation for ext4.
*
*
* ==========================================================================
- * Extents status encompass delayed extents and extent locks
+ * Extent status tree tracks all extent status.
*
- * 1. Why delayed extent implementation ?
+ * 1. Why we need to implement extent status tree?
*
- * Without delayed extent, ext4 identifies a delayed extent by looking
+ * Without extent status tree, ext4 identifies a delayed extent by looking
* up page cache, this has several deficiencies - complicated, buggy,
* and inefficient code.
*
- * FIEMAP, SEEK_HOLE/DATA, bigalloc, punch hole and writeout all need
- * to know if a block or a range of blocks are belonged to a delayed
- * extent.
+ * FIEMAP, SEEK_HOLE/DATA, bigalloc, and writeout all need to know if a
+ * block or a range of blocks are belonged to a delayed extent.
*
- * Let us have a look at how they do without delayed extents implementation.
+ * Let us have a look at how they do without extent status tree.
* -- FIEMAP
* FIEMAP looks up page cache to identify delayed allocations from holes.
*
@@ -68,47 +81,48 @@
* already under delayed allocation or not to determine whether
* quota reserving is needed for the cluster.
*
- * -- punch hole
- * punch hole looks up page cache to identify a delayed extent.
- *
* -- writeout
* Writeout looks up whole page cache to see if a buffer is
* mapped, If there are not very many delayed buffers, then it is
* time comsuming.
*
- * With delayed extents implementation, FIEMAP, SEEK_HOLE/DATA,
+ * With extent status tree implementation, FIEMAP, SEEK_HOLE/DATA,
* bigalloc and writeout can figure out if a block or a range of
* blocks is under delayed allocation(belonged to a delayed extent) or
- * not by searching the delayed extent tree.
+ * not by searching the extent tree.
*
*
* ==========================================================================
- * 2. ext4 delayed extents impelmentation
+ * 2. Ext4 extent status tree impelmentation
+ *
+ * -- extent
+ * A extent is a range of blocks which are contiguous logically and
+ * physically. Unlike extent in extent tree, this extent in ext4 is
+ * a in-memory struct, there is no corresponding on-disk data. There
+ * is no limit on length of extent, so an extent can contain as many
+ * blocks as they are contiguous logically and physically.
*
- * -- delayed extent
- * A delayed extent is a range of blocks which are contiguous
- * logically and under delayed allocation. Unlike extent in
- * ext4, delayed extent in ext4 is a in-memory struct, there is
- * no corresponding on-disk data. There is no limit on length of
- * delayed extent, so a delayed extent can contain as many blocks
- * as they are contiguous logically.
+ * -- extent status tree
+ * Every inode has an extent status tree and all allocation blocks
+ * are added to the tree with different status. The extent in the
+ * tree are ordered by logical block no.
*
- * -- delayed extent tree
- * Every inode has a delayed extent tree and all under delayed
- * allocation blocks are added to the tree as delayed extents.
- * Delayed extents in the tree are ordered by logical block no.
+ * -- operations on a extent status tree
+ * There are three important operations on a delayed extent tree: find
+ * next extent, adding a extent(a range of blocks) and removing a extent.
*
- * -- operations on a delayed extent tree
- * There are three operations on a delayed extent tree: find next
- * delayed extent, adding a space(a range of blocks) and removing
- * a space.
+ * -- race on a extent status tree
+ * Extent status tree is protected inode->i_es_lock.
*
- * -- race on a delayed extent tree
- * Delayed extent tree is protected inode->i_es_lock.
+ * -- memory consumption
+ * Fragmented extent tree will make extent status tree cost too much
+ * memory. Hence, we will reclaim written/unwritten extents from the
+ * tree under a heavy memory pressure.
*
*
* ==========================================================================
- * 3. performance analysis
+ * 3. Performance analysis
+ *
* -- overhead
* 1. There is a cache extent for write access, so if writes are
* not very random, adding space operaions are in O(1) time.
@@ -120,15 +134,19 @@
*
* ==========================================================================
* 4. TODO list
- * -- Track all extent status
*
- * -- Improve get block process
+ * -- Refactor delayed space reservation
*
* -- Extent-level locking
*/
static struct kmem_cache *ext4_es_cachep;
+static int __es_insert_extent(struct ext4_es_tree *tree,
+ struct extent_status *newes);
+static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+ ext4_lblk_t end);
+
int __init ext4_init_es(void)
{
ext4_es_cachep = KMEM_CACHE(extent_status, SLAB_RECLAIM_ACCOUNT);
@@ -161,7 +179,7 @@ static void ext4_es_print_tree(struct inode *inode)
while (node) {
struct extent_status *es;
es = rb_entry(node, struct extent_status, rb_node);
- printk(KERN_DEBUG " [%u/%u)", es->start, es->len);
+ printk(KERN_DEBUG " [%u/%u)", es->es_lblk, es->es_len);
node = rb_next(node);
}
printk(KERN_DEBUG "\n");
@@ -170,10 +188,10 @@ static void ext4_es_print_tree(struct inode *inode)
#define ext4_es_print_tree(inode)
#endif
-static inline ext4_lblk_t extent_status_end(struct extent_status *es)
+static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
{
- BUG_ON(es->start + es->len < es->start);
- return es->start + es->len - 1;
+ BUG_ON(es->es_lblk + es->es_len < es->es_lblk);
+ return es->es_lblk + es->es_len - 1;
}
/*
@@ -181,25 +199,25 @@ static inline ext4_lblk_t extent_status_end(struct extent_status *es)
* it can't be found, try to find next extent.
*/
static struct extent_status *__es_tree_search(struct rb_root *root,
- ext4_lblk_t offset)
+ ext4_lblk_t lblk)
{
struct rb_node *node = root->rb_node;
struct extent_status *es = NULL;
while (node) {
es = rb_entry(node, struct extent_status, rb_node);
- if (offset < es->start)
+ if (lblk < es->es_lblk)
node = node->rb_left;
- else if (offset > extent_status_end(es))
+ else if (lblk > ext4_es_end(es))
node = node->rb_right;
else
return es;
}
- if (es && offset < es->start)
+ if (es && lblk < es->es_lblk)
return es;
- if (es && offset > extent_status_end(es)) {
+ if (es && lblk > ext4_es_end(es)) {
node = rb_next(&es->rb_node);
return node ? rb_entry(node, struct extent_status, rb_node) :
NULL;
@@ -209,8 +227,8 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
}
/*
- * ext4_es_find_extent: find the 1st delayed extent covering @es->start
- * if it exists, otherwise, the next extent after @es->start.
+ * ext4_es_find_extent: find the 1st delayed extent covering @es->lblk
+ * if it exists, otherwise, the next extent after @es->lblk.
*
* @inode: the inode which owns delayed extents
* @es: delayed extent that we found
@@ -226,7 +244,7 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
struct rb_node *node;
ext4_lblk_t ret = EXT_MAX_BLOCKS;
- trace_ext4_es_find_extent_enter(inode, es->start);
+ trace_ext4_es_find_extent_enter(inode, es->es_lblk);
read_lock(&EXT4_I(inode)->i_es_lock);
tree = &EXT4_I(inode)->i_es_tree;
@@ -234,25 +252,25 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
/* find delay extent in cache firstly */
if (tree->cache_es) {
es1 = tree->cache_es;
- if (in_range(es->start, es1->start, es1->len)) {
+ if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
es_debug("%u cached by [%u/%u)\n",
- es->start, es1->start, es1->len);
+ es->es_lblk, es1->es_lblk, es1->es_len);
goto out;
}
}
- es->len = 0;
- es1 = __es_tree_search(&tree->root, es->start);
+ es->es_len = 0;
+ es1 = __es_tree_search(&tree->root, es->es_lblk);
out:
if (es1) {
tree->cache_es = es1;
- es->start = es1->start;
- es->len = es1->len;
+ es->es_lblk = es1->es_lblk;
+ es->es_len = es1->es_len;
node = rb_next(&es1->rb_node);
if (node) {
es1 = rb_entry(node, struct extent_status, rb_node);
- ret = es1->start;
+ ret = es1->es_lblk;
}
}
@@ -263,14 +281,14 @@ out:
}
static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t start, ext4_lblk_t len)
+ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
{
struct extent_status *es;
es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
if (es == NULL)
return NULL;
- es->start = start;
- es->len = len;
+ es->es_lblk = lblk;
+ es->es_len = len;
return es;
}
@@ -279,6 +297,20 @@ static void ext4_es_free_extent(struct extent_status *es)
kmem_cache_free(ext4_es_cachep, es);
}
+/*
+ * Check whether or not two extents can be merged
+ * Condition:
+ * - logical block number is contiguous
+ */
+static int ext4_es_can_be_merged(struct extent_status *es1,
+ struct extent_status *es2)
+{
+ if (es1->es_lblk + es1->es_len != es2->es_lblk)
+ return 0;
+
+ return 1;
+}
+
static struct extent_status *
ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
{
@@ -290,8 +322,8 @@ ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
return es;
es1 = rb_entry(node, struct extent_status, rb_node);
- if (es->start == extent_status_end(es1) + 1) {
- es1->len += es->len;
+ if (ext4_es_can_be_merged(es1, es)) {
+ es1->es_len += es->es_len;
rb_erase(&es->rb_node, &tree->root);
ext4_es_free_extent(es);
es = es1;
@@ -311,8 +343,8 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
return es;
es1 = rb_entry(node, struct extent_status, rb_node);
- if (es1->start == extent_status_end(es) + 1) {
- es->len += es1->len;
+ if (ext4_es_can_be_merged(es, es1)) {
+ es->es_len += es1->es_len;
rb_erase(node, &tree->root);
ext4_es_free_extent(es1);
}
@@ -320,60 +352,39 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
return es;
}
-static int __es_insert_extent(struct ext4_es_tree *tree, ext4_lblk_t offset,
- ext4_lblk_t len)
+static int __es_insert_extent(struct ext4_es_tree *tree,
+ struct extent_status *newes)
{
struct rb_node **p = &tree->root.rb_node;
struct rb_node *parent = NULL;
struct extent_status *es;
- ext4_lblk_t end = offset + len - 1;
-
- BUG_ON(end < offset);
- es = tree->cache_es;
- if (es && offset == (extent_status_end(es) + 1)) {
- es_debug("cached by [%u/%u)\n", es->start, es->len);
- es->len += len;
- es = ext4_es_try_to_merge_right(tree, es);
- goto out;
- } else if (es && es->start == end + 1) {
- es_debug("cached by [%u/%u)\n", es->start, es->len);
- es->start = offset;
- es->len += len;
- es = ext4_es_try_to_merge_left(tree, es);
- goto out;
- } else if (es && es->start <= offset &&
- end <= extent_status_end(es)) {
- es_debug("cached by [%u/%u)\n", es->start, es->len);
- goto out;
- }
while (*p) {
parent = *p;
es = rb_entry(parent, struct extent_status, rb_node);
- if (offset < es->start) {
- if (es->start == end + 1) {
- es->start = offset;
- es->len += len;
+ if (newes->es_lblk < es->es_lblk) {
+ if (ext4_es_can_be_merged(newes, es)) {
+ es->es_lblk = newes->es_lblk;
+ es->es_len += newes->es_len;
es = ext4_es_try_to_merge_left(tree, es);
goto out;
}
p = &(*p)->rb_left;
- } else if (offset > extent_status_end(es)) {
- if (offset == extent_status_end(es) + 1) {
- es->len += len;
+ } else if (newes->es_lblk > ext4_es_end(es)) {
+ if (ext4_es_can_be_merged(es, newes)) {
+ es->es_len += newes->es_len;
es = ext4_es_try_to_merge_right(tree, es);
goto out;
}
p = &(*p)->rb_right;
} else {
- if (extent_status_end(es) <= end)
- es->len = offset - es->start + len;
- goto out;
+ BUG_ON(1);
+ return -EINVAL;
}
}
- es = ext4_es_alloc_extent(offset, len);
+ es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len);
if (!es)
return -ENOMEM;
rb_link_node(&es->rb_node, parent, p);
@@ -385,27 +396,38 @@ out:
}
/*
- * ext4_es_insert_extent() adds a space to a delayed extent tree.
- * Caller holds inode->i_es_lock.
+ * ext4_es_insert_extent() adds a space to a extent status tree.
*
* ext4_es_insert_extent is called by ext4_da_write_begin and
* ext4_es_remove_extent.
*
* Return 0 on success, error code on failure.
*/
-int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
+int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len)
{
struct ext4_es_tree *tree;
+ struct extent_status newes;
+ ext4_lblk_t end = lblk + len - 1;
int err = 0;
- trace_ext4_es_insert_extent(inode, offset, len);
+ trace_ext4_es_insert_extent(inode, lblk, len);
es_debug("add [%u/%u) to extent status tree of inode %lu\n",
- offset, len, inode->i_ino);
+ lblk, len, inode->i_ino);
+
+ BUG_ON(end < lblk);
+
+ newes.es_lblk = lblk;
+ newes.es_len = len;
write_lock(&EXT4_I(inode)->i_es_lock);
tree = &EXT4_I(inode)->i_es_tree;
- err = __es_insert_extent(tree, offset, len);
+ err = __es_remove_extent(tree, lblk, end);
+ if (err != 0)
+ goto error;
+ err = __es_insert_extent(tree, &newes);
+
+error:
write_unlock(&EXT4_I(inode)->i_es_lock);
ext4_es_print_tree(inode);
@@ -413,57 +435,45 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
return err;
}
-/*
- * ext4_es_remove_extent() removes a space from a delayed extent tree.
- * Caller holds inode->i_es_lock.
- *
- * Return 0 on success, error code on failure.
- */
-int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
- ext4_lblk_t len)
+static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+ ext4_lblk_t end)
{
struct rb_node *node;
- struct ext4_es_tree *tree;
struct extent_status *es;
struct extent_status orig_es;
- ext4_lblk_t len1, len2, end;
+ ext4_lblk_t len1, len2;
int err = 0;
- trace_ext4_es_remove_extent(inode, offset, len);
- es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
- offset, len, inode->i_ino);
-
- end = offset + len - 1;
- BUG_ON(end < offset);
- write_lock(&EXT4_I(inode)->i_es_lock);
- tree = &EXT4_I(inode)->i_es_tree;
- es = __es_tree_search(&tree->root, offset);
+ es = __es_tree_search(&tree->root, lblk);
if (!es)
goto out;
- if (es->start > end)
+ if (es->es_lblk > end)
goto out;
/* Simply invalidate cache_es. */
tree->cache_es = NULL;
- orig_es.start = es->start;
- orig_es.len = es->len;
- len1 = offset > es->start ? offset - es->start : 0;
- len2 = extent_status_end(es) > end ?
- extent_status_end(es) - end : 0;
+ orig_es.es_lblk = es->es_lblk;
+ orig_es.es_len = es->es_len;
+ len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
+ len2 = ext4_es_end(es) > end ? ext4_es_end(es) - end : 0;
if (len1 > 0)
- es->len = len1;
+ es->es_len = len1;
if (len2 > 0) {
if (len1 > 0) {
- err = __es_insert_extent(tree, end + 1, len2);
+ struct extent_status newes;
+
+ newes.es_lblk = end + 1;
+ newes.es_len = len2;
+ err = __es_insert_extent(tree, &newes);
if (err) {
- es->start = orig_es.start;
- es->len = orig_es.len;
+ es->es_lblk = orig_es.es_lblk;
+ es->es_len = orig_es.es_len;
goto out;
}
} else {
- es->start = end + 1;
- es->len = len2;
+ es->es_lblk = end + 1;
+ es->es_len = len2;
}
goto out;
}
@@ -476,7 +486,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
es = NULL;
}
- while (es && extent_status_end(es) <= end) {
+ while (es && ext4_es_end(es) <= end) {
node = rb_next(&es->rb_node);
rb_erase(&es->rb_node, &tree->root);
ext4_es_free_extent(es);
@@ -487,13 +497,39 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
es = rb_entry(node, struct extent_status, rb_node);
}
- if (es && es->start < end + 1) {
- len1 = extent_status_end(es) - end;
- es->start = end + 1;
- es->len = len1;
+ if (es && es->es_lblk < end + 1) {
+ len1 = ext4_es_end(es) - end;
+ es->es_lblk = end + 1;
+ es->es_len = len1;
}
out:
+ return err;
+}
+
+/*
+ * ext4_es_remove_extent() removes a space from a extent status tree.
+ *
+ * Return 0 on success, error code on failure.
+ */
+int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
+ ext4_lblk_t len)
+{
+ struct ext4_es_tree *tree;
+ ext4_lblk_t end;
+ int err = 0;
+
+ trace_ext4_es_remove_extent(inode, lblk, len);
+ es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
+ lblk, len, inode->i_ino);
+
+ end = lblk + len - 1;
+ BUG_ON(end < lblk);
+
+ tree = &EXT4_I(inode)->i_es_tree;
+
+ write_lock(&EXT4_I(inode)->i_es_lock);
+ err = __es_remove_extent(tree, lblk, end);
write_unlock(&EXT4_I(inode)->i_es_lock);
ext4_es_print_tree(inode);
return err;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 077f82d..81e9339 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -22,8 +22,8 @@
struct extent_status {
struct rb_node rb_node;
- ext4_lblk_t start; /* first block extent covers */
- ext4_lblk_t len; /* length of extent in block */
+ ext4_lblk_t es_lblk; /* first logical block extent covers */
+ ext4_lblk_t es_len; /* length of extent in block */
};
struct ext4_es_tree {
@@ -35,9 +35,9 @@ extern int __init ext4_init_es(void);
extern void ext4_exit_es(void);
extern void ext4_es_init_tree(struct ext4_es_tree *tree);
-extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t start,
+extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
-extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t start,
+extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
struct extent_status *es);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 405565a..718c49f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -464,10 +464,9 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
* If there is a delay extent at this offset,
* it will be as a data.
*/
- es.start = last;
+ es.es_lblk = last;
(void)ext4_es_find_extent(inode, &es);
- if (last >= es.start &&
- last < es.start + es.len) {
+ if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
if (last != start)
dataoff = last << blkbits;
break;
@@ -549,11 +548,10 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
* If there is a delay extent at this offset,
* we will skip this extent.
*/
- es.start = last;
+ es.es_lblk = last;
(void)ext4_es_find_extent(inode, &es);
- if (last >= es.start &&
- last < es.start + es.len) {
- last = es.start + es.len;
+ if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+ last = es.es_lblk + es.es_len;
holeoff = last << blkbits;
continue;
}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 7e8c36b..952628a 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2068,75 +2068,75 @@ TRACE_EVENT(ext4_ext_remove_space_done,
);
TRACE_EVENT(ext4_es_insert_extent,
- TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
- TP_ARGS(inode, start, len),
+ TP_ARGS(inode, lblk, len),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
- __field( loff_t, start )
+ __field( loff_t, lblk )
__field( loff_t, len )
),
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->start = start;
+ __entry->lblk = lblk;
__entry->len = len;
),
TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
- __entry->start, __entry->len)
+ __entry->lblk, __entry->len)
);
TRACE_EVENT(ext4_es_remove_extent,
- TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
- TP_ARGS(inode, start, len),
+ TP_ARGS(inode, lblk, len),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
- __field( loff_t, start )
+ __field( loff_t, lblk )
__field( loff_t, len )
),
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->start = start;
+ __entry->lblk = lblk;
__entry->len = len;
),
TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
- __entry->start, __entry->len)
+ __entry->lblk, __entry->len)
);
TRACE_EVENT(ext4_es_find_extent_enter,
- TP_PROTO(struct inode *inode, ext4_lblk_t start),
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
- TP_ARGS(inode, start),
+ TP_ARGS(inode, lblk),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
- __field( ext4_lblk_t, start )
+ __field( ext4_lblk_t, lblk )
),
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->start = start;
+ __entry->lblk = lblk;
),
- TP_printk("dev %d,%d ino %lu start %u",
+ TP_printk("dev %d,%d ino %lu lblk %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
- (unsigned long) __entry->ino, __entry->start)
+ (unsigned long) __entry->ino, __entry->lblk)
);
TRACE_EVENT(ext4_es_find_extent_exit,
@@ -2148,7 +2148,7 @@ TRACE_EVENT(ext4_es_find_extent_exit,
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
- __field( ext4_lblk_t, start )
+ __field( ext4_lblk_t, lblk )
__field( ext4_lblk_t, len )
__field( ext4_lblk_t, ret )
),
@@ -2156,15 +2156,15 @@ TRACE_EVENT(ext4_es_find_extent_exit,
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->start = es->start;
- __entry->len = es->len;
+ __entry->lblk = es->es_lblk;
+ __entry->len = es->es_len;
__entry->ret = ret;
),
TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
- __entry->start, __entry->len, __entry->ret)
+ __entry->lblk, __entry->len, __entry->ret)
);
#endif /* _TRACE_EXT4_H */
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/9 v4] ext4: remove EXT4_MAP_FROM_CLUSTER flag
2013-01-31 5:17 [PATCH 0/9 v4] ext4: extent status tree (step2) Zheng Liu
2013-01-31 5:17 ` [PATCH 1/9 v4] ext4: refine extent status tree Zheng Liu
@ 2013-01-31 5:17 ` Zheng Liu
2013-01-31 5:17 ` [PATCH 3/9 v4] ext4: add physical block and status member into extent status tree Zheng Liu
` (6 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Zheng Liu @ 2013-01-31 5:17 UTC (permalink / raw)
To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
EXT4_MAP_FROM_CLUSTER flag is only used in ext4_da_map_blocks() and
ext4_ext_map_blocks() to indicate whether a cluster has been reserved or
not. But in ext4_da_map_blocks() we can check it directly. Meanwhile
this flag couldn't appear on the buffer head. So it can be replaced
with a local variable.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/ext4.h | 15 +--------------
fs/ext4/extents.c | 18 ++++--------------
fs/ext4/inode.c | 11 ++---------
3 files changed, 7 insertions(+), 37 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8462eb3..36145ef1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -155,17 +155,9 @@ struct ext4_allocation_request {
#define EXT4_MAP_UNWRITTEN (1 << BH_Unwritten)
#define EXT4_MAP_BOUNDARY (1 << BH_Boundary)
#define EXT4_MAP_UNINIT (1 << BH_Uninit)
-/* Sometimes (in the bigalloc case, from ext4_da_get_block_prep) the caller of
- * ext4_map_blocks wants to know whether or not the underlying cluster has
- * already been accounted for. EXT4_MAP_FROM_CLUSTER conveys to the caller that
- * the requested mapping was from previously mapped (or delayed allocated)
- * cluster. We use BH_AllocFromCluster only for this flag. BH_AllocFromCluster
- * should never appear on buffer_head's state flags.
- */
-#define EXT4_MAP_FROM_CLUSTER (1 << BH_AllocFromCluster)
#define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
- EXT4_MAP_UNINIT | EXT4_MAP_FROM_CLUSTER)
+ EXT4_MAP_UNINIT)
struct ext4_map_blocks {
ext4_fsblk_t m_pblk;
@@ -2553,11 +2545,6 @@ extern int ext4_mmp_csum_verify(struct super_block *sb,
enum ext4_state_bits {
BH_Uninit /* blocks are allocated but uninitialized on disk */
= BH_JBDPrivateStart,
- BH_AllocFromCluster, /* allocated blocks were part of already
- * allocated cluster. Note that this flag will
- * never, ever appear in a buffer_head's state
- * flag. See EXT4_MAP_FROM_CLUSTER to see where
- * this is used. */
};
BUFFER_FNS(Uninit, uninit)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f7bf616..aa9a6d2 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3894,6 +3894,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext4_io_end_t *io = ext4_inode_aio(inode);
ext4_lblk_t cluster_offset;
int set_unwritten = 0;
+ int from_cluster = 0;
ext_debug("blocks %u/%u requested for inode %lu\n",
map->m_lblk, map->m_len, inode->i_ino);
@@ -3902,10 +3903,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/* check in cache */
if (ext4_ext_in_cache(inode, map->m_lblk, &newex)) {
if (!newex.ee_start_lo && !newex.ee_start_hi) {
- if ((sbi->s_cluster_ratio > 1) &&
- ext4_find_delalloc_cluster(inode, map->m_lblk))
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
-
if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
/*
* block isn't allocated yet and
@@ -3916,8 +3913,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/* we should allocate requested block */
} else {
/* block is already allocated */
- if (sbi->s_cluster_ratio > 1)
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
newblock = map->m_lblk
- le32_to_cpu(newex.ee_block)
+ ext4_ext_pblock(&newex);
@@ -3990,10 +3985,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
}
}
- if ((sbi->s_cluster_ratio > 1) &&
- ext4_find_delalloc_cluster(inode, map->m_lblk))
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
-
/*
* requested block isn't allocated yet;
* we couldn't try to create block if create flag is zero
@@ -4010,7 +4001,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/*
* Okay, we need to do block allocation.
*/
- map->m_flags &= ~EXT4_MAP_FROM_CLUSTER;
newex.ee_block = cpu_to_le32(map->m_lblk);
cluster_offset = map->m_lblk & (sbi->s_cluster_ratio-1);
@@ -4022,7 +4012,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
get_implied_cluster_alloc(inode->i_sb, map, ex, path)) {
ar.len = allocated = map->m_len;
newblock = map->m_pblk;
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
+ from_cluster = 1;
goto got_allocated_blocks;
}
@@ -4043,7 +4033,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
get_implied_cluster_alloc(inode->i_sb, map, ex2, path)) {
ar.len = allocated = map->m_len;
newblock = map->m_pblk;
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
+ from_cluster = 1;
goto got_allocated_blocks;
}
@@ -4168,7 +4158,7 @@ got_allocated_blocks:
*/
reserved_clusters = get_reserved_cluster_alloc(inode,
map->m_lblk, allocated);
- if (map->m_flags & EXT4_MAP_FROM_CLUSTER) {
+ if (from_cluster) {
if (reserved_clusters) {
/*
* We have clusters reserved for this range.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cbfe13b..4d066f3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1798,9 +1798,6 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
* of mapping from cluster so that the reserved space
* is calculated properly.
*/
- if ((EXT4_SB(inode->i_sb)->s_cluster_ratio > 1) &&
- ext4_find_delalloc_cluster(inode, map->m_lblk))
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
retval = 0;
} else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
retval = ext4_ext_map_blocks(NULL, inode, map, 0);
@@ -1814,7 +1811,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
*/
/* If the block was allocated from previously allocated cluster,
* then we dont need to reserve it again. */
- if (!(map->m_flags & EXT4_MAP_FROM_CLUSTER)) {
+ if ((EXT4_SB(inode->i_sb)->s_cluster_ratio == 1) ||
+ !ext4_find_delalloc_cluster(inode, map->m_lblk)) {
retval = ext4_da_reserve_space(inode, iblock);
if (retval)
/* not enough space to reserve */
@@ -1825,11 +1823,6 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
if (retval)
goto out_unlock;
- /* Clear EXT4_MAP_FROM_CLUSTER flag since its purpose is served
- * and it should not appear on the bh->b_state.
- */
- map->m_flags &= ~EXT4_MAP_FROM_CLUSTER;
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 3/9 v4] ext4: add physical block and status member into extent status tree
2013-01-31 5:17 [PATCH 0/9 v4] ext4: extent status tree (step2) Zheng Liu
2013-01-31 5:17 ` [PATCH 1/9 v4] ext4: refine extent status tree Zheng Liu
2013-01-31 5:17 ` [PATCH 2/9 v4] ext4: remove EXT4_MAP_FROM_CLUSTER flag Zheng Liu
@ 2013-01-31 5:17 ` Zheng Liu
2013-01-31 5:17 ` [PATCH 4/9 v4] ext4: adjust interfaces of " Zheng Liu
` (5 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Zheng Liu @ 2013-01-31 5:17 UTC (permalink / raw)
To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
es_pblk is used to record physical block that maps to the disk.
es_status is used to record the status of the extent. Three status
are defined, which are written, unwritten and delayed.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents_status.h | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 81e9339..2eb9cc3 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -20,10 +20,22 @@
#define es_debug(fmt, ...) no_printk(fmt, ##__VA_ARGS__)
#endif
+enum {
+ EXTENT_STATUS_WRITTEN = 0, /* written extent */
+ EXTENT_STATUS_UNWRITTEN = 1, /* unwritten extent */
+ EXTENT_STATUS_DELAYED = 2, /* delayed extent */
+};
+
+/*
+ * Here for save memory es_status is stashed into es_pblk because we only have
+ * 48 bits physical block and es_status only needs 2 bits.
+ */
struct extent_status {
struct rb_node rb_node;
- ext4_lblk_t es_lblk; /* first logical block extent covers */
- ext4_lblk_t es_len; /* length of extent in block */
+ ext4_lblk_t es_lblk; /* first logical block extent covers */
+ ext4_lblk_t es_len; /* length of extent in block */
+ ext4_fsblk_t es_pblk : 62; /* first physical block */
+ ext4_fsblk_t es_status : 2; /* record the status of extent */
};
struct ext4_es_tree {
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 4/9 v4] ext4: adjust interfaces of extent status tree
2013-01-31 5:17 [PATCH 0/9 v4] ext4: extent status tree (step2) Zheng Liu
` (2 preceding siblings ...)
2013-01-31 5:17 ` [PATCH 3/9 v4] ext4: add physical block and status member into extent status tree Zheng Liu
@ 2013-01-31 5:17 ` Zheng Liu
2013-01-31 16:02 ` Jan Kara
2013-01-31 5:17 ` [PATCH 5/9 v4] ext4: track all extent status in " Zheng Liu
` (4 subsequent siblings)
8 siblings, 1 reply; 21+ messages in thread
From: Zheng Liu @ 2013-01-31 5:17 UTC (permalink / raw)
To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
Due to two members are added into extent status tree, all interfaces
need to be adjusted.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents_status.c | 56 ++++++++++++++++++++++++++++++++++++---------
fs/ext4/extents_status.h | 24 ++++++++++++++++++-
fs/ext4/inode.c | 3 ++-
include/trace/events/ext4.h | 34 +++++++++++++++++----------
4 files changed, 92 insertions(+), 25 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index aa4d346..9c7984c 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -179,7 +179,8 @@ static void ext4_es_print_tree(struct inode *inode)
while (node) {
struct extent_status *es;
es = rb_entry(node, struct extent_status, rb_node);
- printk(KERN_DEBUG " [%u/%u)", es->es_lblk, es->es_len);
+ printk(KERN_DEBUG " [%u/%u) %llu %u",
+ es->es_lblk, es->es_len, es->es_pblk, es->es_status);
node = rb_next(node);
}
printk(KERN_DEBUG "\n");
@@ -234,7 +235,7 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
* @es: delayed extent that we found
*
* Returns the first block of the next extent after es, otherwise
- * EXT_MAX_BLOCKS if no delay extent is found.
+ * EXT_MAX_BLOCKS if no extent is found.
* Delayed extent is returned via @es.
*/
ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
@@ -249,12 +250,14 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
read_lock(&EXT4_I(inode)->i_es_lock);
tree = &EXT4_I(inode)->i_es_tree;
- /* find delay extent in cache firstly */
+ /* find extent in cache firstly */
if (tree->cache_es) {
es1 = tree->cache_es;
if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
- es_debug("%u cached by [%u/%u)\n",
- es->es_lblk, es1->es_lblk, es1->es_len);
+ es_debug("%u cached by [%u/%u) %llu %u\n",
+ es->es_lblk, es1->es_lblk, es1->es_len,
+ (unsigned long long)es1->es_pblk,
+ es1->es_status);
goto out;
}
}
@@ -267,6 +270,8 @@ out:
tree->cache_es = es1;
es->es_lblk = es1->es_lblk;
es->es_len = es1->es_len;
+ es->es_pblk = es1->es_pblk;
+ es->es_status = es1->es_status;
node = rb_next(&es1->rb_node);
if (node) {
es1 = rb_entry(node, struct extent_status, rb_node);
@@ -281,7 +286,8 @@ out:
}
static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
+ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len,
+ ext4_fsblk_t pblk, int status)
{
struct extent_status *es;
es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
@@ -289,6 +295,8 @@ ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
return NULL;
es->es_lblk = lblk;
es->es_len = len;
+ es->es_pblk = pblk;
+ es->es_status = status;
return es;
}
@@ -301,6 +309,8 @@ static void ext4_es_free_extent(struct extent_status *es)
* Check whether or not two extents can be merged
* Condition:
* - logical block number is contiguous
+ * - physical block number is contiguous
+ * - status is equal
*/
static int ext4_es_can_be_merged(struct extent_status *es1,
struct extent_status *es2)
@@ -308,6 +318,13 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
if (es1->es_lblk + es1->es_len != es2->es_lblk)
return 0;
+ if (es1->es_status != es2->es_status)
+ return 0;
+
+ if (!ext4_es_is_delayed(es1) &&
+ (es1->es_pblk + es1->es_len != es2->es_pblk))
+ return 0;
+
return 1;
}
@@ -367,6 +384,8 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
if (ext4_es_can_be_merged(newes, es)) {
es->es_lblk = newes->es_lblk;
es->es_len += newes->es_len;
+ es->es_pblk = ext4_es_get_pblock(es,
+ newes->es_pblk);
es = ext4_es_try_to_merge_left(tree, es);
goto out;
}
@@ -384,7 +403,8 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
}
}
- es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len);
+ es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len,
+ newes->es_pblk, newes->es_status);
if (!es)
return -ENOMEM;
rb_link_node(&es->rb_node, parent, p);
@@ -404,21 +424,23 @@ out:
* Return 0 on success, error code on failure.
*/
int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
- ext4_lblk_t len)
+ ext4_lblk_t len, ext4_fsblk_t pblk, int status)
{
struct ext4_es_tree *tree;
struct extent_status newes;
ext4_lblk_t end = lblk + len - 1;
int err = 0;
- trace_ext4_es_insert_extent(inode, lblk, len);
- es_debug("add [%u/%u) to extent status tree of inode %lu\n",
- lblk, len, inode->i_ino);
+ es_debug("add [%u/%u) %llu %d to extent status tree of inode %lu\n",
+ lblk, len, pblk, status, inode->i_ino);
BUG_ON(end < lblk);
newes.es_lblk = lblk;
newes.es_len = len;
+ newes.es_pblk = pblk;
+ newes.es_status = status;
+ trace_ext4_es_insert_extent(inode, &newes);
write_lock(&EXT4_I(inode)->i_es_lock);
tree = &EXT4_I(inode)->i_es_tree;
@@ -455,6 +477,9 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
orig_es.es_lblk = es->es_lblk;
orig_es.es_len = es->es_len;
+ orig_es.es_pblk = es->es_pblk;
+ orig_es.es_status = es->es_status;
+
len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
len2 = ext4_es_end(es) > end ? ext4_es_end(es) - end : 0;
if (len1 > 0)
@@ -465,6 +490,9 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
newes.es_lblk = end + 1;
newes.es_len = len2;
+ newes.es_pblk = ext4_es_get_pblock(&orig_es,
+ orig_es.es_pblk + orig_es.es_len - len2);
+ newes.es_status = orig_es.es_status;
err = __es_insert_extent(tree, &newes);
if (err) {
es->es_lblk = orig_es.es_lblk;
@@ -474,6 +502,8 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
} else {
es->es_lblk = end + 1;
es->es_len = len2;
+ es->es_pblk = ext4_es_get_pblock(es,
+ orig_es.es_pblk + orig_es.es_len - len2);
}
goto out;
}
@@ -498,9 +528,13 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
}
if (es && es->es_lblk < end + 1) {
+ ext4_lblk_t orig_len = es->es_len;
+
len1 = ext4_es_end(es) - end;
es->es_lblk = end + 1;
es->es_len = len1;
+ es->es_pblk = ext4_es_get_pblock(es,
+ es->es_pblk + orig_len - len1);
}
out:
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 2eb9cc3..1345c06 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -48,10 +48,32 @@ extern void ext4_exit_es(void);
extern void ext4_es_init_tree(struct ext4_es_tree *tree);
extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
- ext4_lblk_t len);
+ ext4_lblk_t len, ext4_fsblk_t pblk,
+ int status);
extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
struct extent_status *es);
+static inline int ext4_es_is_written(struct extent_status *es)
+{
+ return (es->es_status == EXTENT_STATUS_WRITTEN);
+}
+
+static inline int ext4_es_is_unwritten(struct extent_status *es)
+{
+ return (es->es_status == EXTENT_STATUS_UNWRITTEN);
+}
+
+static inline int ext4_es_is_delayed(struct extent_status *es)
+{
+ return (es->es_status == EXTENT_STATUS_DELAYED);
+}
+
+static inline ext4_fsblk_t ext4_es_get_pblock(struct extent_status *es,
+ ext4_fsblk_t pb)
+{
+ return (ext4_es_is_delayed(es) ? ~0 : pb);
+}
+
#endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4d066f3..e09c7cf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1819,7 +1819,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
goto out_unlock;
}
- retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len);
+ retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+ ~0, EXTENT_STATUS_DELAYED);
if (retval)
goto out_unlock;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 952628a..43f335a 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2068,28 +2068,33 @@ TRACE_EVENT(ext4_ext_remove_space_done,
);
TRACE_EVENT(ext4_es_insert_extent,
- TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
+ TP_PROTO(struct inode *inode, struct extent_status *es),
- TP_ARGS(inode, lblk, len),
+ TP_ARGS(inode, es),
TP_STRUCT__entry(
- __field( dev_t, dev )
- __field( ino_t, ino )
- __field( loff_t, lblk )
- __field( loff_t, len )
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( ext4_lblk_t, lblk )
+ __field( ext4_lblk_t, len )
+ __field( ext4_fsblk_t, pblk )
+ __field( int, status )
),
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->lblk = lblk;
- __entry->len = len;
+ __entry->lblk = es->es_lblk;
+ __entry->len = es->es_len;
+ __entry->pblk = es->es_pblk;
+ __entry->status = es->es_status;
),
- TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
+ TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
- __entry->lblk, __entry->len)
+ __entry->lblk, __entry->len,
+ __entry->pblk, __entry->status)
);
TRACE_EVENT(ext4_es_remove_extent,
@@ -2150,6 +2155,8 @@ TRACE_EVENT(ext4_es_find_extent_exit,
__field( ino_t, ino )
__field( ext4_lblk_t, lblk )
__field( ext4_lblk_t, len )
+ __field( ext4_fsblk_t, pblk )
+ __field( int, status )
__field( ext4_lblk_t, ret )
),
@@ -2158,13 +2165,16 @@ TRACE_EVENT(ext4_es_find_extent_exit,
__entry->ino = inode->i_ino;
__entry->lblk = es->es_lblk;
__entry->len = es->es_len;
+ __entry->pblk = es->es_pblk;
+ __entry->status = es->es_status;
__entry->ret = ret;
),
- TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
+ TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %u ret %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
- __entry->lblk, __entry->len, __entry->ret)
+ __entry->lblk, __entry->len,
+ __entry->pblk, __entry->status, __entry->ret)
);
#endif /* _TRACE_EXT4_H */
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 5/9 v4] ext4: track all extent status in extent status tree
2013-01-31 5:17 [PATCH 0/9 v4] ext4: extent status tree (step2) Zheng Liu
` (3 preceding siblings ...)
2013-01-31 5:17 ` [PATCH 4/9 v4] ext4: adjust interfaces of " Zheng Liu
@ 2013-01-31 5:17 ` Zheng Liu
2013-01-31 16:50 ` Jan Kara
2013-01-31 5:17 ` [PATCH 6/9 v4] ext4: lookup block mapping " Zheng Liu
` (3 subsequent siblings)
8 siblings, 1 reply; 21+ messages in thread
From: Zheng Liu @ 2013-01-31 5:17 UTC (permalink / raw)
To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
By recording the phycisal block and status, extent status tree is able
to track the status of every extents. When we call _map_blocks
functions to lookup an extent or create a new written/unwritten/delayed
extent, this extent will be inserted into extent status tree.
We don't load all extents from disk in alloc_inode() because it costs
too much memory, and if a file is opened and closed frequently it will
takes too much time to load all extent information. So currently when
we create/lookup an extent, this extent will be inserted into extent
status tree. Hence, the extent status tree may not comprehensively
contain all of the extents found in the file.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents.c | 5 +++-
fs/ext4/file.c | 6 +++--
fs/ext4/inode.c | 73 ++++++++++++++++++++++++++++++++++++-------------------
3 files changed, 56 insertions(+), 28 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index aa9a6d2..d23a654 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2074,7 +2074,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
}
/* This is possible iff next == next_del == EXT_MAX_BLOCKS */
- if (next == next_del) {
+ if (next == next_del && next_del == EXT_MAX_BLOCKS) {
flags |= FIEMAP_EXTENT_LAST;
if (unlikely(next_del != EXT_MAX_BLOCKS ||
next != EXT_MAX_BLOCKS)) {
@@ -4570,6 +4570,9 @@ static int ext4_find_delayed_extent(struct inode *inode,
/* A hole found. */
return 0;
+ if (!ext4_es_is_delayed(&es))
+ return 0;
+
if (es.es_lblk > newex->ec_block) {
/* A hole found. */
newex->ec_len = min(es.es_lblk - newex->ec_block,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 718c49f..afaf9f15 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -466,7 +466,8 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
*/
es.es_lblk = last;
(void)ext4_es_find_extent(inode, &es);
- if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+ if (ext4_es_is_delayed(&es) &&
+ last >= es.es_lblk && last < es.es_lblk + es.es_len) {
if (last != start)
dataoff = last << blkbits;
break;
@@ -550,7 +551,8 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
*/
es.es_lblk = last;
(void)ext4_es_find_extent(inode, &es);
- if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+ if (ext4_es_is_delayed(&es) &&
+ last >= es.es_lblk && last < es.es_lblk + es.es_len) {
last = es.es_lblk + es.es_len;
holeoff = last << blkbits;
continue;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e09c7cf..f0dda2a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -527,20 +527,20 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
retval = ext4_ind_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
}
+ if (retval > 0) {
+ int ret, status;
+ status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+ EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+ ret = ext4_es_insert_extent(inode, map->m_lblk,
+ map->m_len, map->m_pblk, status);
+ if (ret < 0)
+ retval = ret;
+ }
if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
up_read((&EXT4_I(inode)->i_data_sem));
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
- int ret;
- if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
- /* delayed alloc may be allocated by fallocate and
- * coverted to initialized by directIO.
- * we need to handle delayed extent here.
- */
- down_write((&EXT4_I(inode)->i_data_sem));
- goto delayed_mapped;
- }
- ret = check_block_validity(inode, map);
+ int ret = check_block_validity(inode, map);
if (ret != 0)
return ret;
}
@@ -615,18 +615,27 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
ext4_da_update_reserve_space(inode, retval, 1);
}
- if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
+ if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
- if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
- int ret;
-delayed_mapped:
- /* delayed allocation blocks has been allocated */
- ret = ext4_es_remove_extent(inode, map->m_lblk,
- map->m_len);
- if (ret < 0)
- retval = ret;
- }
+ if (retval > 0) {
+ int ret, status;
+
+ if (flags & EXT4_GET_BLOCKS_PRE_IO)
+ status = EXTENT_STATUS_UNWRITTEN;
+ else if (flags & EXT4_GET_BLOCKS_CONVERT)
+ status = EXTENT_STATUS_WRITTEN;
+ else if (flags & EXT4_GET_BLOCKS_UNINIT_EXT)
+ status = EXTENT_STATUS_UNWRITTEN;
+ else if (flags & EXT4_GET_BLOCKS_CREATE)
+ status = EXTENT_STATUS_WRITTEN;
+ else
+ BUG_ON(1);
+
+ ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+ map->m_pblk, status);
+ if (ret < 0)
+ retval = ret;
}
up_write((&EXT4_I(inode)->i_data_sem));
@@ -1805,6 +1814,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
retval = ext4_ind_map_blocks(NULL, inode, map, 0);
if (retval == 0) {
+ int ret;
/*
* XXX: __block_prepare_write() unmaps passed block,
* is it OK?
@@ -1813,20 +1823,33 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
* then we dont need to reserve it again. */
if ((EXT4_SB(inode->i_sb)->s_cluster_ratio == 1) ||
!ext4_find_delalloc_cluster(inode, map->m_lblk)) {
- retval = ext4_da_reserve_space(inode, iblock);
- if (retval)
+ ret = ext4_da_reserve_space(inode, iblock);
+ if (ret) {
/* not enough space to reserve */
+ retval = ret;
goto out_unlock;
+ }
}
- retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
- ~0, EXTENT_STATUS_DELAYED);
- if (retval)
+ ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+ ~0, EXTENT_STATUS_DELAYED);
+ if (ret) {
+ retval = ret;
goto out_unlock;
+ }
map_bh(bh, inode->i_sb, invalid_block);
set_buffer_new(bh);
set_buffer_delay(bh);
+ } else if (retval > 0) {
+ int ret, status;
+
+ status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+ EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+ ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+ map->m_pblk, status);
+ if (ret != 0)
+ retval = ret;
}
out_unlock:
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 6/9 v4] ext4: lookup block mapping in extent status tree
2013-01-31 5:17 [PATCH 0/9 v4] ext4: extent status tree (step2) Zheng Liu
` (4 preceding siblings ...)
2013-01-31 5:17 ` [PATCH 5/9 v4] ext4: track all extent status in " Zheng Liu
@ 2013-01-31 5:17 ` Zheng Liu
2013-01-31 5:17 ` [PATCH 7/9 v4] ext4: remove single extent cache Zheng Liu
` (2 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Zheng Liu @ 2013-01-31 5:17 UTC (permalink / raw)
To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
After tracking all extent status, we already have a extent cache in
memory. Every time we want to lookup a block mapping, we can first
try to lookup it in extent status tree to avoid a potential disk I/O.
A new function called ext4_es_lookup_extent is defined to finish this
work. When we try to lookup a block mapping, we always call
ext4_map_blocks and/or ext4_da_map_blocks. So in these functions we
first try to lookup a block mapping in extent status tree.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents_status.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/extents_status.h | 1 +
fs/ext4/inode.c | 49 +++++++++++++++++++++++++++++++++++++
include/trace/events/ext4.h | 56 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 165 insertions(+)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 9c7984c..e792d34 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -457,6 +457,65 @@ error:
return err;
}
+/*
+ * ext4_es_lookup_extent() looks up an extent in extent status tree.
+ *
+ * ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks.
+ *
+ * Return: 1 on found, 0 on not
+ */
+int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es)
+{
+ struct ext4_es_tree *tree;
+ struct extent_status *es1;
+ struct rb_node *node;
+ int found = 0;
+
+ trace_ext4_es_lookup_extent_enter(inode, es->es_lblk);
+ es_debug("lookup extent in block %u\n", es->es_lblk);
+
+ tree = &EXT4_I(inode)->i_es_tree;
+ read_lock(&EXT4_I(inode)->i_es_lock);
+
+ /* find extent in cache firstly */
+ if (tree->cache_es) {
+ es1 = tree->cache_es;
+ if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
+ es_debug("%u cached by [%u/%u)\n",
+ es->es_lblk, es1->es_lblk, es1->es_len);
+ found = 1;
+ goto out;
+ }
+ }
+
+ es->es_len = 0;
+ node = tree->root.rb_node;
+ while (node) {
+ es1 = rb_entry(node, struct extent_status, rb_node);
+ if (es->es_lblk < es1->es_lblk)
+ node = node->rb_left;
+ else if (es->es_lblk > ext4_es_end(es1))
+ node = node->rb_right;
+ else {
+ found = 1;
+ break;
+ }
+ }
+
+out:
+ if (found) {
+ es->es_lblk = es1->es_lblk;
+ es->es_len = es1->es_len;
+ es->es_pblk = es1->es_pblk;
+ es->es_status = es1->es_status;
+ }
+
+ read_unlock(&EXT4_I(inode)->i_es_lock);
+
+ trace_ext4_es_lookup_extent_exit(inode, es, found);
+ return found;
+}
+
static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
ext4_lblk_t end)
{
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 1345c06..6350d35 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -54,6 +54,7 @@ extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
struct extent_status *es);
+extern int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es);
static inline int ext4_es_is_written(struct extent_status *es)
{
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f0dda2a..561af16 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -508,12 +508,33 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
int ext4_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags)
{
+ struct extent_status es;
int retval;
map->m_flags = 0;
ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
"logical block %lu\n", inode->i_ino, flags, map->m_len,
(unsigned long) map->m_lblk);
+
+ /* Lookup extent status tree firstly */
+ es.es_lblk = map->m_lblk;
+ if (ext4_es_lookup_extent(inode, &es)) {
+ if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
+ map->m_pblk = es.es_pblk + map->m_lblk - es.es_lblk;
+ map->m_flags |= ext4_es_is_written(&es) ?
+ EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
+ retval = es.es_len - (map->m_lblk - es.es_lblk);
+ if (retval > map->m_len)
+ retval = map->m_len;
+ map->m_len = retval;
+ } else if (ext4_es_is_delayed(&es)) {
+ retval = 0;
+ } else {
+ BUG_ON(1);
+ }
+ goto found;
+ }
+
/*
* Try to see if we can get the block without requesting a new
* file system block.
@@ -539,6 +560,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
up_read((&EXT4_I(inode)->i_data_sem));
+found:
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
int ret = check_block_validity(inode, map);
if (ret != 0)
@@ -1784,6 +1806,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
struct ext4_map_blocks *map,
struct buffer_head *bh)
{
+ struct extent_status es;
int retval;
sector_t invalid_block = ~((sector_t) 0xffff);
@@ -1794,6 +1817,32 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
ext_debug("ext4_da_map_blocks(): inode %lu, max_blocks %u,"
"logical block %lu\n", inode->i_ino, map->m_len,
(unsigned long) map->m_lblk);
+
+ /* Lookup extent status tree firstly */
+ es.es_lblk = iblock;
+ if (ext4_es_lookup_extent(inode, &es)) {
+ map->m_pblk = es.es_pblk + iblock - es.es_lblk;
+ retval = es.es_len - (iblock - es.es_lblk);
+ if (retval > map->m_len)
+ retval = map->m_len;
+ map->m_len = retval;
+ if (ext4_es_is_written(&es)) {
+ map->m_flags |= EXT4_MAP_MAPPED;
+ } else if (ext4_es_is_unwritten(&es)) {
+ map->m_flags |= EXT4_MAP_UNWRITTEN;
+ } else if (ext4_es_is_delayed(&es)) {
+ map_bh(bh, inode->i_sb, invalid_block);
+ set_buffer_new(bh);
+ set_buffer_delay(bh);
+
+ return 0;
+ } else {
+ BUG_ON(1);
+ }
+
+ return retval;
+ }
+
/*
* Try to see if we can get the block without requesting a new
* file system block.
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 43f335a..f23c177 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2177,6 +2177,62 @@ TRACE_EVENT(ext4_es_find_extent_exit,
__entry->pblk, __entry->status, __entry->ret)
);
+TRACE_EVENT(ext4_es_lookup_extent_enter,
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
+
+ TP_ARGS(inode, lblk),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( ext4_lblk_t, lblk )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->lblk = lblk;
+ ),
+
+ TP_printk("dev %d,%d ino %lu lblk %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino, __entry->lblk)
+);
+
+TRACE_EVENT(ext4_es_lookup_extent_exit,
+ TP_PROTO(struct inode *inode, struct extent_status *es,
+ int found),
+
+ TP_ARGS(inode, es, found),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( ext4_lblk_t, lblk )
+ __field( ext4_lblk_t, len )
+ __field( ext4_fsblk_t, pblk )
+ __field( int, status )
+ __field( int, found )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->lblk = es->es_lblk;
+ __entry->len = es->es_len;
+ __entry->pblk = es->es_pblk;
+ __entry->status = es->es_status;
+ __entry->found = found;
+ ),
+
+ TP_printk("dev %d,%d ino %lu found %d [%u/%u) %llu %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino, __entry->found,
+ __entry->lblk, __entry->len,
+ __entry->found ? __entry->pblk : 0,
+ __entry->found ? __entry->status : 0)
+);
+
#endif /* _TRACE_EXT4_H */
/* This part must be outside protection */
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 7/9 v4] ext4: remove single extent cache
2013-01-31 5:17 [PATCH 0/9 v4] ext4: extent status tree (step2) Zheng Liu
` (5 preceding siblings ...)
2013-01-31 5:17 ` [PATCH 6/9 v4] ext4: lookup block mapping " Zheng Liu
@ 2013-01-31 5:17 ` Zheng Liu
2013-01-31 17:05 ` Jan Kara
2013-01-31 5:17 ` [PATCH 8/9 v4] ext4: adjust some functions for reclaiming extents from extent status tree Zheng Liu
2013-01-31 5:17 ` [PATCH 9/9 v4] ext4: reclaim " Zheng Liu
8 siblings, 1 reply; 21+ messages in thread
From: Zheng Liu @ 2013-01-31 5:17 UTC (permalink / raw)
To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
Single extent cache could be removed because we have extent status tree
as a extent cache, and it would be better.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/ext4.h | 12 ---
fs/ext4/ext4_extents.h | 6 --
fs/ext4/extents.c | 220 ++++++++-----------------------------------------
fs/ext4/move_extent.c | 3 -
fs/ext4/super.c | 1 -
5 files changed, 34 insertions(+), 208 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 36145ef1..60d16d1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -802,17 +802,6 @@ do { \
#endif /* defined(__KERNEL__) || defined(__linux__) */
-/*
- * storage for cached extent
- * If ec_len == 0, then the cache is invalid.
- * If ec_start == 0, then the cache represents a gap (null mapping)
- */
-struct ext4_ext_cache {
- ext4_fsblk_t ec_start;
- ext4_lblk_t ec_block;
- __u32 ec_len; /* must be 32bit to return holes */
-};
-
#include "extents_status.h"
/*
@@ -879,7 +868,6 @@ struct ext4_inode_info {
struct inode vfs_inode;
struct jbd2_inode *jinode;
- struct ext4_ext_cache i_cached_extent;
/*
* File creation time. Its function is same as that of
* struct timespec i_{a,c,m}time in the generic inode.
diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
index 487fda1..8643ff5 100644
--- a/fs/ext4/ext4_extents.h
+++ b/fs/ext4/ext4_extents.h
@@ -193,12 +193,6 @@ static inline unsigned short ext_depth(struct inode *inode)
return le16_to_cpu(ext_inode_hdr(inode)->eh_depth);
}
-static inline void
-ext4_ext_invalidate_cache(struct inode *inode)
-{
- EXT4_I(inode)->i_cached_extent.ec_len = 0;
-}
-
static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext)
{
/* We can not have an uninitialized extent of zero length! */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d23a654..dedd557 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -112,7 +112,7 @@ static int ext4_split_extent_at(handle_t *handle,
int flags);
static int ext4_find_delayed_extent(struct inode *inode,
- struct ext4_ext_cache *newex);
+ struct extent_status *newes);
static int ext4_ext_truncate_extend_restart(handle_t *handle,
struct inode *inode,
@@ -714,7 +714,6 @@ int ext4_ext_tree_init(handle_t *handle, struct inode *inode)
eh->eh_magic = EXT4_EXT_MAGIC;
eh->eh_max = cpu_to_le16(ext4_ext_space_root(inode, 0));
ext4_mark_inode_dirty(handle, inode);
- ext4_ext_invalidate_cache(inode);
return 0;
}
@@ -1960,7 +1959,6 @@ cleanup:
ext4_ext_drop_refs(npath);
kfree(npath);
}
- ext4_ext_invalidate_cache(inode);
return err;
}
@@ -1969,8 +1967,8 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
struct fiemap_extent_info *fieinfo)
{
struct ext4_ext_path *path = NULL;
- struct ext4_ext_cache newex;
struct ext4_extent *ex;
+ struct extent_status es;
ext4_lblk_t next, next_del, start = 0, end = 0;
ext4_lblk_t last = block + num;
int exists, depth = 0, err = 0;
@@ -2044,31 +2042,31 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
BUG_ON(end <= start);
if (!exists) {
- newex.ec_block = start;
- newex.ec_len = end - start;
- newex.ec_start = 0;
+ es.es_lblk = start;
+ es.es_len = end - start;
+ es.es_pblk = 0;
} else {
- newex.ec_block = le32_to_cpu(ex->ee_block);
- newex.ec_len = ext4_ext_get_actual_len(ex);
- newex.ec_start = ext4_ext_pblock(ex);
+ es.es_lblk = le32_to_cpu(ex->ee_block);
+ es.es_len = ext4_ext_get_actual_len(ex);
+ es.es_pblk = ext4_ext_pblock(ex);
if (ext4_ext_is_uninitialized(ex))
flags |= FIEMAP_EXTENT_UNWRITTEN;
}
/*
- * Find delayed extent and update newex accordingly. We call
- * it even in !exists case to find out whether newex is the
+ * Find delayed extent and update es accordingly. We call
+ * it even in !exists case to find out whether es is the
* last existing extent or not.
*/
- next_del = ext4_find_delayed_extent(inode, &newex);
+ next_del = ext4_find_delayed_extent(inode, &es);
if (!exists && next_del) {
exists = 1;
flags |= FIEMAP_EXTENT_DELALLOC;
}
up_read(&EXT4_I(inode)->i_data_sem);
- if (unlikely(newex.ec_len == 0)) {
- EXT4_ERROR_INODE(inode, "newex.ec_len == 0");
+ if (unlikely(es.es_len == 0)) {
+ EXT4_ERROR_INODE(inode, "es.es_len == 0");
err = -EIO;
break;
}
@@ -2089,9 +2087,9 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
if (exists) {
err = fiemap_fill_next_extent(fieinfo,
- (__u64)newex.ec_block << blksize_bits,
- (__u64)newex.ec_start << blksize_bits,
- (__u64)newex.ec_len << blksize_bits,
+ (__u64)es.es_lblk << blksize_bits,
+ (__u64)es.es_pblk << blksize_bits,
+ (__u64)es.es_len << blksize_bits,
flags);
if (err < 0)
break;
@@ -2101,7 +2099,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
}
}
- block = newex.ec_block + newex.ec_len;
+ block = es.es_lblk + es.es_len;
}
if (path) {
@@ -2112,115 +2110,6 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
return err;
}
-static void
-ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
- __u32 len, ext4_fsblk_t start)
-{
- struct ext4_ext_cache *cex;
- BUG_ON(len == 0);
- spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
- trace_ext4_ext_put_in_cache(inode, block, len, start);
- cex = &EXT4_I(inode)->i_cached_extent;
- cex->ec_block = block;
- cex->ec_len = len;
- cex->ec_start = start;
- spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
-}
-
-/*
- * ext4_ext_put_gap_in_cache:
- * calculate boundaries of the gap that the requested block fits into
- * and cache this gap
- */
-static void
-ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
- ext4_lblk_t block)
-{
- int depth = ext_depth(inode);
- unsigned long len;
- ext4_lblk_t lblock;
- struct ext4_extent *ex;
-
- ex = path[depth].p_ext;
- if (ex == NULL) {
- /* there is no extent yet, so gap is [0;-] */
- lblock = 0;
- len = EXT_MAX_BLOCKS;
- ext_debug("cache gap(whole file):");
- } else if (block < le32_to_cpu(ex->ee_block)) {
- lblock = block;
- len = le32_to_cpu(ex->ee_block) - block;
- ext_debug("cache gap(before): %u [%u:%u]",
- block,
- le32_to_cpu(ex->ee_block),
- ext4_ext_get_actual_len(ex));
- } else if (block >= le32_to_cpu(ex->ee_block)
- + ext4_ext_get_actual_len(ex)) {
- ext4_lblk_t next;
- lblock = le32_to_cpu(ex->ee_block)
- + ext4_ext_get_actual_len(ex);
-
- next = ext4_ext_next_allocated_block(path);
- ext_debug("cache gap(after): [%u:%u] %u",
- le32_to_cpu(ex->ee_block),
- ext4_ext_get_actual_len(ex),
- block);
- BUG_ON(next == lblock);
- len = next - lblock;
- } else {
- lblock = len = 0;
- BUG();
- }
-
- ext_debug(" -> %u:%lu\n", lblock, len);
- ext4_ext_put_in_cache(inode, lblock, len, 0);
-}
-
-/*
- * ext4_ext_in_cache()
- * Checks to see if the given block is in the cache.
- * If it is, the cached extent is stored in the given
- * cache extent pointer.
- *
- * @inode: The files inode
- * @block: The block to look for in the cache
- * @ex: Pointer where the cached extent will be stored
- * if it contains block
- *
- * Return 0 if cache is invalid; 1 if the cache is valid
- */
-static int
-ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
- struct ext4_extent *ex)
-{
- struct ext4_ext_cache *cex;
- int ret = 0;
-
- /*
- * We borrow i_block_reservation_lock to protect i_cached_extent
- */
- spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
- cex = &EXT4_I(inode)->i_cached_extent;
-
- /* has cache valid data? */
- if (cex->ec_len == 0)
- goto errout;
-
- if (in_range(block, cex->ec_block, cex->ec_len)) {
- ex->ee_block = cpu_to_le32(cex->ec_block);
- ext4_ext_store_pblock(ex, cex->ec_start);
- ex->ee_len = cpu_to_le16(cex->ec_len);
- ext_debug("%u cached by %u:%u:%llu\n",
- block,
- cex->ec_block, cex->ec_len, cex->ec_start);
- ret = 1;
- }
-errout:
- trace_ext4_ext_in_cache(inode, block, ret);
- spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
- return ret;
-}
-
/*
* ext4_ext_rm_idx:
* removes index from the index block.
@@ -2658,8 +2547,6 @@ static int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
return PTR_ERR(handle);
again:
- ext4_ext_invalidate_cache(inode);
-
trace_ext4_ext_remove_space(inode, start, depth);
/*
@@ -3900,29 +3787,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
map->m_lblk, map->m_len, inode->i_ino);
trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
- /* check in cache */
- if (ext4_ext_in_cache(inode, map->m_lblk, &newex)) {
- if (!newex.ee_start_lo && !newex.ee_start_hi) {
- if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
- /*
- * block isn't allocated yet and
- * user doesn't want to allocate it
- */
- goto out2;
- }
- /* we should allocate requested block */
- } else {
- /* block is already allocated */
- newblock = map->m_lblk
- - le32_to_cpu(newex.ee_block)
- + ext4_ext_pblock(&newex);
- /* number of remaining blocks in the extent */
- allocated = ext4_ext_get_actual_len(&newex) -
- (map->m_lblk - le32_to_cpu(newex.ee_block));
- goto out;
- }
- }
-
/* find extent for this block */
path = ext4_ext_find_extent(inode, map->m_lblk, NULL);
if (IS_ERR(path)) {
@@ -3969,15 +3833,9 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk,
ee_block, ee_len, newblock);
- /*
- * Do not put uninitialized extent
- * in the cache
- */
- if (!ext4_ext_is_uninitialized(ex)) {
- ext4_ext_put_in_cache(inode, ee_block,
- ee_len, ee_start);
+ if (!ext4_ext_is_uninitialized(ex))
goto out;
- }
+
allocated = ext4_ext_handle_uninitialized_extents(
handle, inode, map, path, flags,
allocated, newblock);
@@ -3989,14 +3847,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
* requested block isn't allocated yet;
* we couldn't try to create block if create flag is zero
*/
- if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
- /*
- * put just found gap into cache to speed up
- * subsequent requests
- */
- ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
+ if ((flags & EXT4_GET_BLOCKS_CREATE) == 0)
goto out2;
- }
/*
* Okay, we need to do block allocation.
@@ -4232,10 +4084,9 @@ got_allocated_blocks:
* Cache the extent and update transaction to commit on fdatasync only
* when it is _not_ an uninitialized extent.
*/
- if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) {
- ext4_ext_put_in_cache(inode, map->m_lblk, allocated, newblock);
+ if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0)
ext4_update_inode_fsync_trans(handle, inode, 1);
- } else
+ else
ext4_update_inode_fsync_trans(handle, inode, 0);
out:
if (allocated > map->m_len)
@@ -4294,7 +4145,6 @@ void ext4_ext_truncate(struct inode *inode)
goto out_stop;
down_write(&EXT4_I(inode)->i_data_sem);
- ext4_ext_invalidate_cache(inode);
ext4_discard_preallocations(inode);
@@ -4544,26 +4394,26 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
}
/*
- * If newex is not existing extent (newex->ec_start equals zero) find
- * delayed extent at start of newex and update newex accordingly and
+ * If newes is not existing extent (newes->ec_pblk equals zero) find
+ * delayed extent at start of newes and update newes accordingly and
* return start of the next delayed extent.
*
- * If newex is existing extent (newex->ec_start is not equal zero)
+ * If newes is existing extent (newes->ec_pblk is not equal zero)
* return start of next delayed extent or EXT_MAX_BLOCKS if no delayed
- * extent found. Leave newex unmodified.
+ * extent found. Leave newes unmodified.
*/
static int ext4_find_delayed_extent(struct inode *inode,
- struct ext4_ext_cache *newex)
+ struct extent_status *newes)
{
struct extent_status es;
ext4_lblk_t next_del;
- es.es_lblk = newex->ec_block;
+ es.es_lblk = newes->es_lblk;
next_del = ext4_es_find_extent(inode, &es);
- if (newex->ec_start == 0) {
+ if (newes->es_pblk == 0) {
/*
- * No extent in extent-tree contains block @newex->ec_start,
+ * No extent in extent-tree contains block @newes->es_pblk,
* then the block may stay in 1)a hole or 2)delayed-extent.
*/
if (es.es_len == 0)
@@ -4573,14 +4423,14 @@ static int ext4_find_delayed_extent(struct inode *inode,
if (!ext4_es_is_delayed(&es))
return 0;
- if (es.es_lblk > newex->ec_block) {
+ if (es.es_lblk > newes->es_lblk) {
/* A hole found. */
- newex->ec_len = min(es.es_lblk - newex->ec_block,
- newex->ec_len);
+ newes->es_len = min(es.es_lblk - newes->es_lblk,
+ newes->es_len);
return 0;
}
- newex->ec_len = es.es_lblk + es.es_len - newex->ec_block;
+ newes->es_len = es.es_lblk + es.es_len - newes->es_lblk;
}
return next_del;
@@ -4780,14 +4630,12 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
goto out;
down_write(&EXT4_I(inode)->i_data_sem);
- ext4_ext_invalidate_cache(inode);
ext4_discard_preallocations(inode);
err = ext4_es_remove_extent(inode, first_block,
stop_block - first_block);
err = ext4_ext_remove_space(inode, first_block, stop_block - 1);
- ext4_ext_invalidate_cache(inode);
ext4_discard_preallocations(inode);
if (IS_SYNC(inode))
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index d9cc5ee..b9222c8 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -761,9 +761,6 @@ out:
kfree(donor_path);
}
- ext4_ext_invalidate_cache(orig_inode);
- ext4_ext_invalidate_cache(donor_inode);
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 8/9 v4] ext4: adjust some functions for reclaiming extents from extent status tree
2013-01-31 5:17 [PATCH 0/9 v4] ext4: extent status tree (step2) Zheng Liu
` (6 preceding siblings ...)
2013-01-31 5:17 ` [PATCH 7/9 v4] ext4: remove single extent cache Zheng Liu
@ 2013-01-31 5:17 ` Zheng Liu
2013-01-31 5:17 ` [PATCH 9/9 v4] ext4: reclaim " Zheng Liu
8 siblings, 0 replies; 21+ messages in thread
From: Zheng Liu @ 2013-01-31 5:17 UTC (permalink / raw)
To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
This commit changes some interfaces in extent status tree because we
need to use inode to count the cached objects in a extent status tree.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents_status.c | 49 +++++++++++++++++++++++-------------------------
1 file changed, 23 insertions(+), 26 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e792d34..01fb000 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -142,9 +142,8 @@
static struct kmem_cache *ext4_es_cachep;
-static int __es_insert_extent(struct ext4_es_tree *tree,
- struct extent_status *newes);
-static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
+static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t end);
int __init ext4_init_es(void)
@@ -286,7 +285,7 @@ out:
}
static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len,
+ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
ext4_fsblk_t pblk, int status)
{
struct extent_status *es;
@@ -300,7 +299,7 @@ ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len,
return es;
}
-static void ext4_es_free_extent(struct extent_status *es)
+static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
{
kmem_cache_free(ext4_es_cachep, es);
}
@@ -329,8 +328,9 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
}
static struct extent_status *
-ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
+ext4_es_try_to_merge_left(struct inode *inode, struct extent_status *es)
{
+ struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
struct extent_status *es1;
struct rb_node *node;
@@ -342,7 +342,7 @@ ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
if (ext4_es_can_be_merged(es1, es)) {
es1->es_len += es->es_len;
rb_erase(&es->rb_node, &tree->root);
- ext4_es_free_extent(es);
+ ext4_es_free_extent(inode, es);
es = es1;
}
@@ -350,8 +350,9 @@ ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
}
static struct extent_status *
-ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
+ext4_es_try_to_merge_right(struct inode *inode, struct extent_status *es)
{
+ struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
struct extent_status *es1;
struct rb_node *node;
@@ -363,15 +364,15 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
if (ext4_es_can_be_merged(es, es1)) {
es->es_len += es1->es_len;
rb_erase(node, &tree->root);
- ext4_es_free_extent(es1);
+ ext4_es_free_extent(inode, es1);
}
return es;
}
-static int __es_insert_extent(struct ext4_es_tree *tree,
- struct extent_status *newes)
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
{
+ struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
struct rb_node **p = &tree->root.rb_node;
struct rb_node *parent = NULL;
struct extent_status *es;
@@ -386,14 +387,14 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
es->es_len += newes->es_len;
es->es_pblk = ext4_es_get_pblock(es,
newes->es_pblk);
- es = ext4_es_try_to_merge_left(tree, es);
+ es = ext4_es_try_to_merge_left(inode, es);
goto out;
}
p = &(*p)->rb_left;
} else if (newes->es_lblk > ext4_es_end(es)) {
if (ext4_es_can_be_merged(es, newes)) {
es->es_len += newes->es_len;
- es = ext4_es_try_to_merge_right(tree, es);
+ es = ext4_es_try_to_merge_right(inode, es);
goto out;
}
p = &(*p)->rb_right;
@@ -403,7 +404,7 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
}
}
- es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len,
+ es = ext4_es_alloc_extent(inode, newes->es_lblk, newes->es_len,
newes->es_pblk, newes->es_status);
if (!es)
return -ENOMEM;
@@ -426,7 +427,6 @@ out:
int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len, ext4_fsblk_t pblk, int status)
{
- struct ext4_es_tree *tree;
struct extent_status newes;
ext4_lblk_t end = lblk + len - 1;
int err = 0;
@@ -443,11 +443,10 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
trace_ext4_es_insert_extent(inode, &newes);
write_lock(&EXT4_I(inode)->i_es_lock);
- tree = &EXT4_I(inode)->i_es_tree;
- err = __es_remove_extent(tree, lblk, end);
+ err = __es_remove_extent(inode, lblk, end);
if (err != 0)
goto error;
- err = __es_insert_extent(tree, &newes);
+ err = __es_insert_extent(inode, &newes);
error:
write_unlock(&EXT4_I(inode)->i_es_lock);
@@ -516,9 +515,10 @@ out:
return found;
}
-static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
- ext4_lblk_t end)
+static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
+ ext4_lblk_t end)
{
+ struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
struct rb_node *node;
struct extent_status *es;
struct extent_status orig_es;
@@ -552,7 +552,7 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
newes.es_pblk = ext4_es_get_pblock(&orig_es,
orig_es.es_pblk + orig_es.es_len - len2);
newes.es_status = orig_es.es_status;
- err = __es_insert_extent(tree, &newes);
+ err = __es_insert_extent(inode, &newes);
if (err) {
es->es_lblk = orig_es.es_lblk;
es->es_len = orig_es.es_len;
@@ -578,7 +578,7 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
while (es && ext4_es_end(es) <= end) {
node = rb_next(&es->rb_node);
rb_erase(&es->rb_node, &tree->root);
- ext4_es_free_extent(es);
+ ext4_es_free_extent(inode, es);
if (!node) {
es = NULL;
break;
@@ -608,7 +608,6 @@ out:
int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len)
{
- struct ext4_es_tree *tree;
ext4_lblk_t end;
int err = 0;
@@ -619,10 +618,8 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
end = lblk + len - 1;
BUG_ON(end < lblk);
- tree = &EXT4_I(inode)->i_es_tree;
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 9/9 v4] ext4: reclaim extents from extent status tree
2013-01-31 5:17 [PATCH 0/9 v4] ext4: extent status tree (step2) Zheng Liu
` (7 preceding siblings ...)
2013-01-31 5:17 ` [PATCH 8/9 v4] ext4: adjust some functions for reclaiming extents from extent status tree Zheng Liu
@ 2013-01-31 5:17 ` Zheng Liu
8 siblings, 0 replies; 21+ messages in thread
From: Zheng Liu @ 2013-01-31 5:17 UTC (permalink / raw)
To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
Although extent status is loaded on-demand, we also need to reclaim
extent from the tree when we are under a heavy memory pressure because
in some cases fragmented extent tree causes status tree costs too much
memory.
Here we maintain a lru list in super_block. When the extent status of
an inode is accessed and changed, this inode will be move to the tail
of the list. The inode will be dropped from this list when it is
cleared. In the inode, a counter is added to count the number of
cached objects in extent status tree. Here only written/unwritten
extent is counted because delayed extent doesn't be reclaimed due to
fiemap, bigalloc and seek_data/hole need it. The counter will be
increased as a new extent is allocated, and it will be decreased as a
extent is freed.
In this commit we use normal shrinker framework to reclaim memory from
the status tree. ext4_es_reclaim_extents_count() traverses the lru list
to count the number of reclaimable extents. ext4_es_shrink() tries to
reclaim written/unwritten extents from extent status tree. The inode
that has been shrunk is moved to the tail of lru list.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/ext4.h | 7 ++
fs/ext4/extents_status.c | 156 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/extents_status.h | 5 ++
fs/ext4/super.c | 7 ++
include/trace/events/ext4.h | 60 +++++++++++++++++
5 files changed, 235 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 60d16d1..a565701 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -881,6 +881,8 @@ struct ext4_inode_info {
/* extents status tree */
struct ext4_es_tree i_es_tree;
rwlock_t i_es_lock;
+ struct list_head i_es_lru;
+ unsigned int i_es_lru_nr; /* protected by i_es_lock */
/* ialloc */
ext4_group_t i_last_alloc_group;
@@ -1296,6 +1298,11 @@ struct ext4_sb_info {
/* Precomputed FS UUID checksum for seeding other checksums */
__u32 s_csum_seed;
+
+ /* Reclaim extents from extent status tree */
+ struct shrinker s_es_shrinker;
+ struct list_head s_es_lru;
+ spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
};
static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 01fb000..9727597 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -145,6 +145,9 @@ static struct kmem_cache *ext4_es_cachep;
static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t end);
+static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
+ int nr_to_scan);
+static int ext4_es_reclaim_extents_count(struct super_block *sb);
int __init ext4_init_es(void)
{
@@ -280,6 +283,7 @@ out:
read_unlock(&EXT4_I(inode)->i_es_lock);
+ ext4_es_lru_add(inode);
trace_ext4_es_find_extent_exit(inode, es, ret);
return ret;
}
@@ -296,11 +300,24 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
es->es_len = len;
es->es_pblk = pblk;
es->es_status = status;
+
+ /*
+ * We don't count delayed extent because we never try to reclaim them
+ */
+ if (!ext4_es_is_delayed(es))
+ EXT4_I(inode)->i_es_lru_nr++;
+
return es;
}
static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
{
+ /* Decrease the lru counter when this es is not delayed */
+ if (!ext4_es_is_delayed(es)) {
+ BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
+ EXT4_I(inode)->i_es_lru_nr--;
+ }
+
kmem_cache_free(ext4_es_cachep, es);
}
@@ -451,6 +468,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
error:
write_unlock(&EXT4_I(inode)->i_es_lock);
+ ext4_es_lru_add(inode);
ext4_es_print_tree(inode);
return err;
@@ -511,6 +529,7 @@ out:
read_unlock(&EXT4_I(inode)->i_es_lock);
+ ext4_es_lru_add(inode);
trace_ext4_es_lookup_extent_exit(inode, es, found);
return found;
}
@@ -624,3 +643,140 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_es_print_tree(inode);
return err;
}
+
+static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
+{
+ struct ext4_sb_info *sbi = container_of(shrink,
+ struct ext4_sb_info, s_es_shrinker);
+ struct ext4_inode_info *ei;
+ struct list_head *cur, *tmp, scanned;
+ int nr_to_scan = sc->nr_to_scan;
+ int ret, nr_shrunk = 0;
+
+ trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan);
+
+ if (!nr_to_scan)
+ return ext4_es_reclaim_extents_count(sbi->s_sb);
+
+ INIT_LIST_HEAD(&scanned);
+
+ spin_lock(&sbi->s_es_lru_lock);
+ list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
+ list_move_tail(cur, &scanned);
+
+ ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
+
+ read_lock(&ei->i_es_lock);
+ if (ei->i_es_lru_nr == 0) {
+ read_unlock(&ei->i_es_lock);
+ continue;
+ }
+ read_unlock(&ei->i_es_lock);
+
+ write_lock(&ei->i_es_lock);
+ ret = __es_try_to_reclaim_extents(ei, nr_to_scan);
+ write_unlock(&ei->i_es_lock);
+
+ nr_shrunk += ret;
+ nr_to_scan -= ret;
+ if (nr_to_scan == 0)
+ break;
+ }
+ list_splice_tail(&scanned, &sbi->s_es_lru);
+ spin_unlock(&sbi->s_es_lru_lock);
+ trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk);
+
+ return ext4_es_reclaim_extents_count(sbi->s_sb);
+}
+
+void ext4_es_register_shrinker(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi;
+
+ sbi = EXT4_SB(sb);
+ INIT_LIST_HEAD(&sbi->s_es_lru);
+ spin_lock_init(&sbi->s_es_lru_lock);
+ sbi->s_es_shrinker.shrink = ext4_es_shrink;
+ sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
+ register_shrinker(&sbi->s_es_shrinker);
+}
+
+void ext4_es_unregister_shrinker(struct super_block *sb)
+{
+ unregister_shrinker(&EXT4_SB(sb)->s_es_shrinker);
+}
+
+void ext4_es_lru_add(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+ spin_lock(&sbi->s_es_lru_lock);
+ if (list_empty(&ei->i_es_lru))
+ list_add_tail(&ei->i_es_lru, &sbi->s_es_lru);
+ else
+ list_move_tail(&ei->i_es_lru, &sbi->s_es_lru);
+ spin_unlock(&sbi->s_es_lru_lock);
+}
+
+void ext4_es_lru_del(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+ spin_lock(&sbi->s_es_lru_lock);
+ if (!list_empty(&ei->i_es_lru))
+ list_del_init(&ei->i_es_lru);
+ spin_unlock(&sbi->s_es_lru_lock);
+}
+
+static int ext4_es_reclaim_extents_count(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_inode_info *ei;
+ struct list_head *cur;
+ int nr_cached = 0;
+
+ spin_lock(&sbi->s_es_lru_lock);
+ list_for_each(cur, &sbi->s_es_lru) {
+ ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
+ read_lock(&ei->i_es_lock);
+ nr_cached += ei->i_es_lru_nr;
+ read_unlock(&ei->i_es_lock);
+ }
+ spin_unlock(&sbi->s_es_lru_lock);
+ trace_ext4_es_reclaim_extents_count(sb, nr_cached);
+ return nr_cached;
+}
+
+static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
+ int nr_to_scan)
+{
+ struct inode *inode = &ei->vfs_inode;
+ struct ext4_es_tree *tree = &ei->i_es_tree;
+ struct rb_node *node;
+ struct extent_status *es;
+ int nr_shrunk = 0;
+
+ if (ei->i_es_lru_nr == 0)
+ return 0;
+
+ node = rb_first(&tree->root);
+ while (node != NULL) {
+ es = rb_entry(node, struct extent_status, rb_node);
+ node = rb_next(&es->rb_node);
+ /*
+ * We can't reclaim delayed extent from status tree because
+ * fiemap, bigallic, and seek_data/hole need to use it.
+ */
+ if (!ext4_es_is_delayed(es)) {
+ rb_erase(&es->rb_node, &tree->root);
+ ext4_es_free_extent(inode, es);
+ nr_shrunk++;
+ if (--nr_to_scan == 0)
+ break;
+ }
+ }
+ tree->cache_es = NULL;
+ return nr_shrunk;
+}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 6350d35..d199a51 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -77,4 +77,9 @@ static inline ext4_fsblk_t ext4_es_get_pblock(struct extent_status *es,
return (ext4_es_is_delayed(es) ? ~0 : pb);
}
+extern void ext4_es_register_shrinker(struct super_block *sb);
+extern void ext4_es_unregister_shrinker(struct super_block *sb);
+extern void ext4_es_lru_add(struct inode *inode);
+extern void ext4_es_lru_del(struct inode *inode);
+
#endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a35c6c1..64d78b1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -858,6 +858,7 @@ static void ext4_put_super(struct super_block *sb)
ext4_abort(sb, "Couldn't clean up the journal");
}
+ ext4_es_unregister_shrinker(sb);
del_timer(&sbi->s_err_report);
ext4_release_system_zone(sb);
ext4_mb_release(sb);
@@ -943,6 +944,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
spin_lock_init(&ei->i_prealloc_lock);
ext4_es_init_tree(&ei->i_es_tree);
rwlock_init(&ei->i_es_lock);
+ INIT_LIST_HEAD(&ei->i_es_lru);
+ ei->i_es_lru_nr = 0;
ei->i_reserved_data_blocks = 0;
ei->i_reserved_meta_blocks = 0;
ei->i_allocated_meta_blocks = 0;
@@ -1030,6 +1033,7 @@ void ext4_clear_inode(struct inode *inode)
dquot_drop(inode);
ext4_discard_preallocations(inode);
ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
+ ext4_es_lru_del(inode);
if (EXT4_I(inode)->jinode) {
jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
EXT4_I(inode)->jinode);
@@ -3771,6 +3775,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sbi->s_max_writeback_mb_bump = 128;
sbi->s_extent_max_zeroout_kb = 32;
+ /* Register extent status tree shrinker */
+ ext4_es_register_shrinker(sb);
+
/*
* set up enough so that it can read an inode
*/
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f23c177..05318dc 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2233,6 +2233,66 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
__entry->found ? __entry->status : 0)
);
+TRACE_EVENT(ext4_es_reclaim_extents_count,
+ TP_PROTO(struct super_block *sb, int nr_cached),
+
+ TP_ARGS(sb, nr_cached),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( int, nr_cached )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->nr_cached = nr_cached;
+ ),
+
+ TP_printk("dev %d,%d cached objects nr %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->nr_cached)
+);
+
+TRACE_EVENT(ext4_es_shrink_enter,
+ TP_PROTO(struct super_block *sb, int nr_to_scan),
+
+ TP_ARGS(sb, nr_to_scan),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( int, nr_to_scan )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->nr_to_scan = nr_to_scan;
+ ),
+
+ TP_printk("dev %d,%d nr to scan %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->nr_to_scan)
+);
+
+TRACE_EVENT(ext4_es_shrink_exit,
+ TP_PROTO(struct super_block *sb, int shrunk_nr),
+
+ TP_ARGS(sb, shrunk_nr),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( int, shrunk_nr )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->shrunk_nr = shrunk_nr;
+ ),
+
+ TP_printk("dev %d,%d nr to scan %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->shrunk_nr)
+);
+
#endif /* _TRACE_EXT4_H */
/* This part must be outside protection */
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 4/9 v4] ext4: adjust interfaces of extent status tree
2013-01-31 5:17 ` [PATCH 4/9 v4] ext4: adjust interfaces of " Zheng Liu
@ 2013-01-31 16:02 ` Jan Kara
2013-02-01 2:51 ` Zheng Liu
0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2013-01-31 16:02 UTC (permalink / raw)
To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o
On Thu 31-01-13 13:17:52, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
>
> Due to two members are added into extent status tree, all interfaces
> need to be adjusted.
I have one minor comment below...
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> ---
> fs/ext4/extents_status.c | 56 ++++++++++++++++++++++++++++++++++++---------
> fs/ext4/extents_status.h | 24 ++++++++++++++++++-
> fs/ext4/inode.c | 3 ++-
> include/trace/events/ext4.h | 34 +++++++++++++++++----------
> 4 files changed, 92 insertions(+), 25 deletions(-)
>
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index aa4d346..9c7984c 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
<snip>
> @@ -465,6 +490,9 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
>
> newes.es_lblk = end + 1;
> newes.es_len = len2;
> + newes.es_pblk = ext4_es_get_pblock(&orig_es,
> + orig_es.es_pblk + orig_es.es_len - len2);
> + newes.es_status = orig_es.es_status;
> err = __es_insert_extent(tree, &newes);
> if (err) {
> es->es_lblk = orig_es.es_lblk;
> @@ -474,6 +502,8 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
> } else {
> es->es_lblk = end + 1;
> es->es_len = len2;
> + es->es_pblk = ext4_es_get_pblock(es,
> + orig_es.es_pblk + orig_es.es_len - len2);
> }
> goto out;
> }
> @@ -498,9 +528,13 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
> }
>
> if (es && es->es_lblk < end + 1) {
> + ext4_lblk_t orig_len = es->es_len;
> +
> len1 = ext4_es_end(es) - end;
> es->es_lblk = end + 1;
> es->es_len = len1;
> + es->es_pblk = ext4_es_get_pblock(es,
> + es->es_pblk + orig_len - len1);
> }
>
> out:
So ext4_es_get_pblock() seems a bit confusing. I understand that for
delayed extents physical block doesn't make sence and that you wanted to
save some typing by that function but IMHO it's making the code less
readable. I'd rather see there e.g.:
if (!ext4_es_is_delayed(es))
es->es_pblk += orig_len - len1;
and for delayed extents we can just leave es_pblk undefined because nobody
should really look at it anyway.
Honza
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index 2eb9cc3..1345c06 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -48,10 +48,32 @@ extern void ext4_exit_es(void);
> extern void ext4_es_init_tree(struct ext4_es_tree *tree);
>
> extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> - ext4_lblk_t len);
> + ext4_lblk_t len, ext4_fsblk_t pblk,
> + int status);
> extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
> ext4_lblk_t len);
> extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
> struct extent_status *es);
>
> +static inline int ext4_es_is_written(struct extent_status *es)
> +{
> + return (es->es_status == EXTENT_STATUS_WRITTEN);
> +}
> +
> +static inline int ext4_es_is_unwritten(struct extent_status *es)
> +{
> + return (es->es_status == EXTENT_STATUS_UNWRITTEN);
> +}
> +
> +static inline int ext4_es_is_delayed(struct extent_status *es)
> +{
> + return (es->es_status == EXTENT_STATUS_DELAYED);
> +}
> +
> +static inline ext4_fsblk_t ext4_es_get_pblock(struct extent_status *es,
> + ext4_fsblk_t pb)
> +{
> + return (ext4_es_is_delayed(es) ? ~0 : pb);
> +}
> +
> #endif /* _EXT4_EXTENTS_STATUS_H */
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 4d066f3..e09c7cf 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1819,7 +1819,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
> goto out_unlock;
> }
>
> - retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len);
> + retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> + ~0, EXTENT_STATUS_DELAYED);
> if (retval)
> goto out_unlock;
>
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 952628a..43f335a 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -2068,28 +2068,33 @@ TRACE_EVENT(ext4_ext_remove_space_done,
> );
>
> TRACE_EVENT(ext4_es_insert_extent,
> - TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
> + TP_PROTO(struct inode *inode, struct extent_status *es),
>
> - TP_ARGS(inode, lblk, len),
> + TP_ARGS(inode, es),
>
> TP_STRUCT__entry(
> - __field( dev_t, dev )
> - __field( ino_t, ino )
> - __field( loff_t, lblk )
> - __field( loff_t, len )
> + __field( dev_t, dev )
> + __field( ino_t, ino )
> + __field( ext4_lblk_t, lblk )
> + __field( ext4_lblk_t, len )
> + __field( ext4_fsblk_t, pblk )
> + __field( int, status )
> ),
>
> TP_fast_assign(
> __entry->dev = inode->i_sb->s_dev;
> __entry->ino = inode->i_ino;
> - __entry->lblk = lblk;
> - __entry->len = len;
> + __entry->lblk = es->es_lblk;
> + __entry->len = es->es_len;
> + __entry->pblk = es->es_pblk;
> + __entry->status = es->es_status;
> ),
>
> - TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
> + TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %u",
> MAJOR(__entry->dev), MINOR(__entry->dev),
> (unsigned long) __entry->ino,
> - __entry->lblk, __entry->len)
> + __entry->lblk, __entry->len,
> + __entry->pblk, __entry->status)
> );
>
> TRACE_EVENT(ext4_es_remove_extent,
> @@ -2150,6 +2155,8 @@ TRACE_EVENT(ext4_es_find_extent_exit,
> __field( ino_t, ino )
> __field( ext4_lblk_t, lblk )
> __field( ext4_lblk_t, len )
> + __field( ext4_fsblk_t, pblk )
> + __field( int, status )
> __field( ext4_lblk_t, ret )
> ),
>
> @@ -2158,13 +2165,16 @@ TRACE_EVENT(ext4_es_find_extent_exit,
> __entry->ino = inode->i_ino;
> __entry->lblk = es->es_lblk;
> __entry->len = es->es_len;
> + __entry->pblk = es->es_pblk;
> + __entry->status = es->es_status;
> __entry->ret = ret;
> ),
>
> - TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
> + TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %u ret %u",
> MAJOR(__entry->dev), MINOR(__entry->dev),
> (unsigned long) __entry->ino,
> - __entry->lblk, __entry->len, __entry->ret)
> + __entry->lblk, __entry->len,
> + __entry->pblk, __entry->status, __entry->ret)
> );
>
> #endif /* _TRACE_EXT4_H */
> --
> 1.7.12.rc2.18.g61b472e
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/9 v4] ext4: track all extent status in extent status tree
2013-01-31 5:17 ` [PATCH 5/9 v4] ext4: track all extent status in " Zheng Liu
@ 2013-01-31 16:50 ` Jan Kara
2013-02-01 5:33 ` Zheng Liu
0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2013-01-31 16:50 UTC (permalink / raw)
To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o
On Thu 31-01-13 13:17:53, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
>
> By recording the phycisal block and status, extent status tree is able
> to track the status of every extents. When we call _map_blocks
> functions to lookup an extent or create a new written/unwritten/delayed
> extent, this extent will be inserted into extent status tree.
>
> We don't load all extents from disk in alloc_inode() because it costs
> too much memory, and if a file is opened and closed frequently it will
> takes too much time to load all extent information. So currently when
> we create/lookup an extent, this extent will be inserted into extent
> status tree. Hence, the extent status tree may not comprehensively
> contain all of the extents found in the file.
>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> ---
> fs/ext4/extents.c | 5 +++-
> fs/ext4/file.c | 6 +++--
> fs/ext4/inode.c | 73 ++++++++++++++++++++++++++++++++++++-------------------
> 3 files changed, 56 insertions(+), 28 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index aa9a6d2..d23a654 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -2074,7 +2074,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
> }
>
> /* This is possible iff next == next_del == EXT_MAX_BLOCKS */
> - if (next == next_del) {
> + if (next == next_del && next_del == EXT_MAX_BLOCKS) {
This doesn't seem to be related, does it?
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e09c7cf..f0dda2a 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -615,18 +615,27 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
> ext4_da_update_reserve_space(inode, retval, 1);
> }
> - if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
> + if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
> ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
>
> - if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> - int ret;
> -delayed_mapped:
> - /* delayed allocation blocks has been allocated */
> - ret = ext4_es_remove_extent(inode, map->m_lblk,
> - map->m_len);
> - if (ret < 0)
> - retval = ret;
> - }
> + if (retval > 0) {
> + int ret, status;
> +
> + if (flags & EXT4_GET_BLOCKS_PRE_IO)
> + status = EXTENT_STATUS_UNWRITTEN;
> + else if (flags & EXT4_GET_BLOCKS_CONVERT)
> + status = EXTENT_STATUS_WRITTEN;
> + else if (flags & EXT4_GET_BLOCKS_UNINIT_EXT)
> + status = EXTENT_STATUS_UNWRITTEN;
> + else if (flags & EXT4_GET_BLOCKS_CREATE)
> + status = EXTENT_STATUS_WRITTEN;
> + else
> + BUG_ON(1);
> +
> + ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> + map->m_pblk, status);
> + if (ret < 0)
> + retval = ret;
Hum, are you sure the extent status will be correct? Won't it be safer to
just use whatever we have in 'map'?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 7/9 v4] ext4: remove single extent cache
2013-01-31 5:17 ` [PATCH 7/9 v4] ext4: remove single extent cache Zheng Liu
@ 2013-01-31 17:05 ` Jan Kara
2013-02-01 3:08 ` Zheng Liu
0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2013-01-31 17:05 UTC (permalink / raw)
To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o
On Thu 31-01-13 13:17:55, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
>
> Single extent cache could be removed because we have extent status tree
> as a extent cache, and it would be better.
Just one note: The original extent cache has a capability of containing
information "there's a hole in range x-y" so we don't have to walk the tree
again only to find there's nothing for given block. It might be useful to
put this extent type in extent status tree as well for caching purposes...
Honza
>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> ---
> fs/ext4/ext4.h | 12 ---
> fs/ext4/ext4_extents.h | 6 --
> fs/ext4/extents.c | 220 ++++++++-----------------------------------------
> fs/ext4/move_extent.c | 3 -
> fs/ext4/super.c | 1 -
> 5 files changed, 34 insertions(+), 208 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 36145ef1..60d16d1 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -802,17 +802,6 @@ do { \
>
> #endif /* defined(__KERNEL__) || defined(__linux__) */
>
> -/*
> - * storage for cached extent
> - * If ec_len == 0, then the cache is invalid.
> - * If ec_start == 0, then the cache represents a gap (null mapping)
> - */
> -struct ext4_ext_cache {
> - ext4_fsblk_t ec_start;
> - ext4_lblk_t ec_block;
> - __u32 ec_len; /* must be 32bit to return holes */
> -};
> -
> #include "extents_status.h"
>
> /*
> @@ -879,7 +868,6 @@ struct ext4_inode_info {
> struct inode vfs_inode;
> struct jbd2_inode *jinode;
>
> - struct ext4_ext_cache i_cached_extent;
> /*
> * File creation time. Its function is same as that of
> * struct timespec i_{a,c,m}time in the generic inode.
> diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
> index 487fda1..8643ff5 100644
> --- a/fs/ext4/ext4_extents.h
> +++ b/fs/ext4/ext4_extents.h
> @@ -193,12 +193,6 @@ static inline unsigned short ext_depth(struct inode *inode)
> return le16_to_cpu(ext_inode_hdr(inode)->eh_depth);
> }
>
> -static inline void
> -ext4_ext_invalidate_cache(struct inode *inode)
> -{
> - EXT4_I(inode)->i_cached_extent.ec_len = 0;
> -}
> -
> static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext)
> {
> /* We can not have an uninitialized extent of zero length! */
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index d23a654..dedd557 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -112,7 +112,7 @@ static int ext4_split_extent_at(handle_t *handle,
> int flags);
>
> static int ext4_find_delayed_extent(struct inode *inode,
> - struct ext4_ext_cache *newex);
> + struct extent_status *newes);
>
> static int ext4_ext_truncate_extend_restart(handle_t *handle,
> struct inode *inode,
> @@ -714,7 +714,6 @@ int ext4_ext_tree_init(handle_t *handle, struct inode *inode)
> eh->eh_magic = EXT4_EXT_MAGIC;
> eh->eh_max = cpu_to_le16(ext4_ext_space_root(inode, 0));
> ext4_mark_inode_dirty(handle, inode);
> - ext4_ext_invalidate_cache(inode);
> return 0;
> }
>
> @@ -1960,7 +1959,6 @@ cleanup:
> ext4_ext_drop_refs(npath);
> kfree(npath);
> }
> - ext4_ext_invalidate_cache(inode);
> return err;
> }
>
> @@ -1969,8 +1967,8 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
> struct fiemap_extent_info *fieinfo)
> {
> struct ext4_ext_path *path = NULL;
> - struct ext4_ext_cache newex;
> struct ext4_extent *ex;
> + struct extent_status es;
> ext4_lblk_t next, next_del, start = 0, end = 0;
> ext4_lblk_t last = block + num;
> int exists, depth = 0, err = 0;
> @@ -2044,31 +2042,31 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
> BUG_ON(end <= start);
>
> if (!exists) {
> - newex.ec_block = start;
> - newex.ec_len = end - start;
> - newex.ec_start = 0;
> + es.es_lblk = start;
> + es.es_len = end - start;
> + es.es_pblk = 0;
> } else {
> - newex.ec_block = le32_to_cpu(ex->ee_block);
> - newex.ec_len = ext4_ext_get_actual_len(ex);
> - newex.ec_start = ext4_ext_pblock(ex);
> + es.es_lblk = le32_to_cpu(ex->ee_block);
> + es.es_len = ext4_ext_get_actual_len(ex);
> + es.es_pblk = ext4_ext_pblock(ex);
> if (ext4_ext_is_uninitialized(ex))
> flags |= FIEMAP_EXTENT_UNWRITTEN;
> }
>
> /*
> - * Find delayed extent and update newex accordingly. We call
> - * it even in !exists case to find out whether newex is the
> + * Find delayed extent and update es accordingly. We call
> + * it even in !exists case to find out whether es is the
> * last existing extent or not.
> */
> - next_del = ext4_find_delayed_extent(inode, &newex);
> + next_del = ext4_find_delayed_extent(inode, &es);
> if (!exists && next_del) {
> exists = 1;
> flags |= FIEMAP_EXTENT_DELALLOC;
> }
> up_read(&EXT4_I(inode)->i_data_sem);
>
> - if (unlikely(newex.ec_len == 0)) {
> - EXT4_ERROR_INODE(inode, "newex.ec_len == 0");
> + if (unlikely(es.es_len == 0)) {
> + EXT4_ERROR_INODE(inode, "es.es_len == 0");
> err = -EIO;
> break;
> }
> @@ -2089,9 +2087,9 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
>
> if (exists) {
> err = fiemap_fill_next_extent(fieinfo,
> - (__u64)newex.ec_block << blksize_bits,
> - (__u64)newex.ec_start << blksize_bits,
> - (__u64)newex.ec_len << blksize_bits,
> + (__u64)es.es_lblk << blksize_bits,
> + (__u64)es.es_pblk << blksize_bits,
> + (__u64)es.es_len << blksize_bits,
> flags);
> if (err < 0)
> break;
> @@ -2101,7 +2099,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
> }
> }
>
> - block = newex.ec_block + newex.ec_len;
> + block = es.es_lblk + es.es_len;
> }
>
> if (path) {
> @@ -2112,115 +2110,6 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
> return err;
> }
>
> -static void
> -ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
> - __u32 len, ext4_fsblk_t start)
> -{
> - struct ext4_ext_cache *cex;
> - BUG_ON(len == 0);
> - spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
> - trace_ext4_ext_put_in_cache(inode, block, len, start);
> - cex = &EXT4_I(inode)->i_cached_extent;
> - cex->ec_block = block;
> - cex->ec_len = len;
> - cex->ec_start = start;
> - spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> -}
> -
> -/*
> - * ext4_ext_put_gap_in_cache:
> - * calculate boundaries of the gap that the requested block fits into
> - * and cache this gap
> - */
> -static void
> -ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
> - ext4_lblk_t block)
> -{
> - int depth = ext_depth(inode);
> - unsigned long len;
> - ext4_lblk_t lblock;
> - struct ext4_extent *ex;
> -
> - ex = path[depth].p_ext;
> - if (ex == NULL) {
> - /* there is no extent yet, so gap is [0;-] */
> - lblock = 0;
> - len = EXT_MAX_BLOCKS;
> - ext_debug("cache gap(whole file):");
> - } else if (block < le32_to_cpu(ex->ee_block)) {
> - lblock = block;
> - len = le32_to_cpu(ex->ee_block) - block;
> - ext_debug("cache gap(before): %u [%u:%u]",
> - block,
> - le32_to_cpu(ex->ee_block),
> - ext4_ext_get_actual_len(ex));
> - } else if (block >= le32_to_cpu(ex->ee_block)
> - + ext4_ext_get_actual_len(ex)) {
> - ext4_lblk_t next;
> - lblock = le32_to_cpu(ex->ee_block)
> - + ext4_ext_get_actual_len(ex);
> -
> - next = ext4_ext_next_allocated_block(path);
> - ext_debug("cache gap(after): [%u:%u] %u",
> - le32_to_cpu(ex->ee_block),
> - ext4_ext_get_actual_len(ex),
> - block);
> - BUG_ON(next == lblock);
> - len = next - lblock;
> - } else {
> - lblock = len = 0;
> - BUG();
> - }
> -
> - ext_debug(" -> %u:%lu\n", lblock, len);
> - ext4_ext_put_in_cache(inode, lblock, len, 0);
> -}
> -
> -/*
> - * ext4_ext_in_cache()
> - * Checks to see if the given block is in the cache.
> - * If it is, the cached extent is stored in the given
> - * cache extent pointer.
> - *
> - * @inode: The files inode
> - * @block: The block to look for in the cache
> - * @ex: Pointer where the cached extent will be stored
> - * if it contains block
> - *
> - * Return 0 if cache is invalid; 1 if the cache is valid
> - */
> -static int
> -ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
> - struct ext4_extent *ex)
> -{
> - struct ext4_ext_cache *cex;
> - int ret = 0;
> -
> - /*
> - * We borrow i_block_reservation_lock to protect i_cached_extent
> - */
> - spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
> - cex = &EXT4_I(inode)->i_cached_extent;
> -
> - /* has cache valid data? */
> - if (cex->ec_len == 0)
> - goto errout;
> -
> - if (in_range(block, cex->ec_block, cex->ec_len)) {
> - ex->ee_block = cpu_to_le32(cex->ec_block);
> - ext4_ext_store_pblock(ex, cex->ec_start);
> - ex->ee_len = cpu_to_le16(cex->ec_len);
> - ext_debug("%u cached by %u:%u:%llu\n",
> - block,
> - cex->ec_block, cex->ec_len, cex->ec_start);
> - ret = 1;
> - }
> -errout:
> - trace_ext4_ext_in_cache(inode, block, ret);
> - spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> - return ret;
> -}
> -
> /*
> * ext4_ext_rm_idx:
> * removes index from the index block.
> @@ -2658,8 +2547,6 @@ static int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
> return PTR_ERR(handle);
>
> again:
> - ext4_ext_invalidate_cache(inode);
> -
> trace_ext4_ext_remove_space(inode, start, depth);
>
> /*
> @@ -3900,29 +3787,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> map->m_lblk, map->m_len, inode->i_ino);
> trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
>
> - /* check in cache */
> - if (ext4_ext_in_cache(inode, map->m_lblk, &newex)) {
> - if (!newex.ee_start_lo && !newex.ee_start_hi) {
> - if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
> - /*
> - * block isn't allocated yet and
> - * user doesn't want to allocate it
> - */
> - goto out2;
> - }
> - /* we should allocate requested block */
> - } else {
> - /* block is already allocated */
> - newblock = map->m_lblk
> - - le32_to_cpu(newex.ee_block)
> - + ext4_ext_pblock(&newex);
> - /* number of remaining blocks in the extent */
> - allocated = ext4_ext_get_actual_len(&newex) -
> - (map->m_lblk - le32_to_cpu(newex.ee_block));
> - goto out;
> - }
> - }
> -
> /* find extent for this block */
> path = ext4_ext_find_extent(inode, map->m_lblk, NULL);
> if (IS_ERR(path)) {
> @@ -3969,15 +3833,9 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk,
> ee_block, ee_len, newblock);
>
> - /*
> - * Do not put uninitialized extent
> - * in the cache
> - */
> - if (!ext4_ext_is_uninitialized(ex)) {
> - ext4_ext_put_in_cache(inode, ee_block,
> - ee_len, ee_start);
> + if (!ext4_ext_is_uninitialized(ex))
> goto out;
> - }
> +
> allocated = ext4_ext_handle_uninitialized_extents(
> handle, inode, map, path, flags,
> allocated, newblock);
> @@ -3989,14 +3847,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> * requested block isn't allocated yet;
> * we couldn't try to create block if create flag is zero
> */
> - if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
> - /*
> - * put just found gap into cache to speed up
> - * subsequent requests
> - */
> - ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
> + if ((flags & EXT4_GET_BLOCKS_CREATE) == 0)
> goto out2;
> - }
>
> /*
> * Okay, we need to do block allocation.
> @@ -4232,10 +4084,9 @@ got_allocated_blocks:
> * Cache the extent and update transaction to commit on fdatasync only
> * when it is _not_ an uninitialized extent.
> */
> - if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) {
> - ext4_ext_put_in_cache(inode, map->m_lblk, allocated, newblock);
> + if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0)
> ext4_update_inode_fsync_trans(handle, inode, 1);
> - } else
> + else
> ext4_update_inode_fsync_trans(handle, inode, 0);
> out:
> if (allocated > map->m_len)
> @@ -4294,7 +4145,6 @@ void ext4_ext_truncate(struct inode *inode)
> goto out_stop;
>
> down_write(&EXT4_I(inode)->i_data_sem);
> - ext4_ext_invalidate_cache(inode);
>
> ext4_discard_preallocations(inode);
>
> @@ -4544,26 +4394,26 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
> }
>
> /*
> - * If newex is not existing extent (newex->ec_start equals zero) find
> - * delayed extent at start of newex and update newex accordingly and
> + * If newes is not existing extent (newes->ec_pblk equals zero) find
> + * delayed extent at start of newes and update newes accordingly and
> * return start of the next delayed extent.
> *
> - * If newex is existing extent (newex->ec_start is not equal zero)
> + * If newes is existing extent (newes->ec_pblk is not equal zero)
> * return start of next delayed extent or EXT_MAX_BLOCKS if no delayed
> - * extent found. Leave newex unmodified.
> + * extent found. Leave newes unmodified.
> */
> static int ext4_find_delayed_extent(struct inode *inode,
> - struct ext4_ext_cache *newex)
> + struct extent_status *newes)
> {
> struct extent_status es;
> ext4_lblk_t next_del;
>
> - es.es_lblk = newex->ec_block;
> + es.es_lblk = newes->es_lblk;
> next_del = ext4_es_find_extent(inode, &es);
>
> - if (newex->ec_start == 0) {
> + if (newes->es_pblk == 0) {
> /*
> - * No extent in extent-tree contains block @newex->ec_start,
> + * No extent in extent-tree contains block @newes->es_pblk,
> * then the block may stay in 1)a hole or 2)delayed-extent.
> */
> if (es.es_len == 0)
> @@ -4573,14 +4423,14 @@ static int ext4_find_delayed_extent(struct inode *inode,
> if (!ext4_es_is_delayed(&es))
> return 0;
>
> - if (es.es_lblk > newex->ec_block) {
> + if (es.es_lblk > newes->es_lblk) {
> /* A hole found. */
> - newex->ec_len = min(es.es_lblk - newex->ec_block,
> - newex->ec_len);
> + newes->es_len = min(es.es_lblk - newes->es_lblk,
> + newes->es_len);
> return 0;
> }
>
> - newex->ec_len = es.es_lblk + es.es_len - newex->ec_block;
> + newes->es_len = es.es_lblk + es.es_len - newes->es_lblk;
> }
>
> return next_del;
> @@ -4780,14 +4630,12 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
> goto out;
>
> down_write(&EXT4_I(inode)->i_data_sem);
> - ext4_ext_invalidate_cache(inode);
> ext4_discard_preallocations(inode);
>
> err = ext4_es_remove_extent(inode, first_block,
> stop_block - first_block);
> err = ext4_ext_remove_space(inode, first_block, stop_block - 1);
>
> - ext4_ext_invalidate_cache(inode);
> ext4_discard_preallocations(inode);
>
> if (IS_SYNC(inode))
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index d9cc5ee..b9222c8 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -761,9 +761,6 @@ out:
> kfree(donor_path);
> }
>
> - ext4_ext_invalidate_cache(orig_inode);
> - ext4_ext_invalidate_cache(donor_inode);
> -
> return replaced_count;
> }
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3d4fb81..a35c6c1 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -939,7 +939,6 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
> return NULL;
>
> ei->vfs_inode.i_version = 1;
> - memset(&ei->i_cached_extent, 0, sizeof(struct ext4_ext_cache));
> INIT_LIST_HEAD(&ei->i_prealloc_list);
> spin_lock_init(&ei->i_prealloc_lock);
> ext4_es_init_tree(&ei->i_es_tree);
> --
> 1.7.12.rc2.18.g61b472e
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/9 v4] ext4: adjust interfaces of extent status tree
2013-01-31 16:02 ` Jan Kara
@ 2013-02-01 2:51 ` Zheng Liu
0 siblings, 0 replies; 21+ messages in thread
From: Zheng Liu @ 2013-02-01 2:51 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o
On Thu, Jan 31, 2013 at 05:02:06PM +0100, Jan Kara wrote:
[snip]
> > @@ -465,6 +490,9 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
> >
> > newes.es_lblk = end + 1;
> > newes.es_len = len2;
> > + newes.es_pblk = ext4_es_get_pblock(&orig_es,
> > + orig_es.es_pblk + orig_es.es_len - len2);
> > + newes.es_status = orig_es.es_status;
> > err = __es_insert_extent(tree, &newes);
> > if (err) {
> > es->es_lblk = orig_es.es_lblk;
> > @@ -474,6 +502,8 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
> > } else {
> > es->es_lblk = end + 1;
> > es->es_len = len2;
> > + es->es_pblk = ext4_es_get_pblock(es,
> > + orig_es.es_pblk + orig_es.es_len - len2);
> > }
> > goto out;
> > }
> > @@ -498,9 +528,13 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
> > }
> >
> > if (es && es->es_lblk < end + 1) {
> > + ext4_lblk_t orig_len = es->es_len;
> > +
> > len1 = ext4_es_end(es) - end;
> > es->es_lblk = end + 1;
> > es->es_len = len1;
> > + es->es_pblk = ext4_es_get_pblock(es,
> > + es->es_pblk + orig_len - len1);
> > }
> >
> > out:
> So ext4_es_get_pblock() seems a bit confusing. I understand that for
> delayed extents physical block doesn't make sence and that you wanted to
> save some typing by that function but IMHO it's making the code less
> readable. I'd rather see there e.g.:
> if (!ext4_es_is_delayed(es))
> es->es_pblk += orig_len - len1;
> and for delayed extents we can just leave es_pblk undefined because nobody
> should really look at it anyway.
Agree. It will be fixed in next version.
Thanks,
- Zheng
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 7/9 v4] ext4: remove single extent cache
2013-01-31 17:05 ` Jan Kara
@ 2013-02-01 3:08 ` Zheng Liu
0 siblings, 0 replies; 21+ messages in thread
From: Zheng Liu @ 2013-02-01 3:08 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o
On Thu, Jan 31, 2013 at 06:05:43PM +0100, Jan Kara wrote:
> On Thu 31-01-13 13:17:55, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> >
> > Single extent cache could be removed because we have extent status tree
> > as a extent cache, and it would be better.
> Just one note: The original extent cache has a capability of containing
> information "there's a hole in range x-y" so we don't have to walk the tree
> again only to find there's nothing for given block. It might be useful to
> put this extent type in extent status tree as well for caching purposes...
Yes, when I removed extent cache, I hesitated whether or not a new status
for a hole is added because that might occupies too much memory. Let me
consider what happens after adding this status. If we have a fragmented
file that has 2048 extents, we will cost double spaces to track these
holes in memory when a grep(1) is run. *But* now I think maybe you are
right because extent status tree has ability to reclaim memory when we
are under a high memory pressure. Meanwhile tracking all holes for a
file let us avoid to walk the extent tree in disk.
FWIW, I revise the patch (ext4: Remove bogus wait for unwritten extents
in ext4_ind_direct_IO) and I have an idea that let us not flush
unwritten io using extent status tree, and I will try it.
Thanks,
- Zheng
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/9 v4] ext4: track all extent status in extent status tree
2013-01-31 16:50 ` Jan Kara
@ 2013-02-01 5:33 ` Zheng Liu
2013-02-04 11:27 ` Jan Kara
0 siblings, 1 reply; 21+ messages in thread
From: Zheng Liu @ 2013-02-01 5:33 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o
On Thu, Jan 31, 2013 at 05:50:55PM +0100, Jan Kara wrote:
> On Thu 31-01-13 13:17:53, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> >
> > By recording the phycisal block and status, extent status tree is able
> > to track the status of every extents. When we call _map_blocks
> > functions to lookup an extent or create a new written/unwritten/delayed
> > extent, this extent will be inserted into extent status tree.
> >
> > We don't load all extents from disk in alloc_inode() because it costs
> > too much memory, and if a file is opened and closed frequently it will
> > takes too much time to load all extent information. So currently when
> > we create/lookup an extent, this extent will be inserted into extent
> > status tree. Hence, the extent status tree may not comprehensively
> > contain all of the extents found in the file.
> >
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > ---
> > fs/ext4/extents.c | 5 +++-
> > fs/ext4/file.c | 6 +++--
> > fs/ext4/inode.c | 73 ++++++++++++++++++++++++++++++++++++-------------------
> > 3 files changed, 56 insertions(+), 28 deletions(-)
> >
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index aa9a6d2..d23a654 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -2074,7 +2074,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
> > }
> >
> > /* This is possible iff next == next_del == EXT_MAX_BLOCKS */
> > - if (next == next_del) {
> > + if (next == next_del && next_del == EXT_MAX_BLOCKS) {
> This doesn't seem to be related, does it?
ext4_ext_next_allocated_block() will return EXT_MAX_BLOCKS when it
reaches the end of file. ext4_find_delayed_extent() does the same
thing. Before tracking written/unwritten extent it is correct because
next never equals to next_del unless both of them equal to
EXT_MAX_BLOCKS. However, after that next is possible to equal to
next_del when they don't reach the end of file. So we need to make sure
next equals to next_del and both of them equal to EXT_MAX_BLOCKS. In
this condition it indicates that we reach the end of file. Am I miss
something?
>
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index e09c7cf..f0dda2a 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -615,18 +615,27 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> > (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
> > ext4_da_update_reserve_space(inode, retval, 1);
> > }
> > - if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
> > + if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
> > ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
> >
> > - if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> > - int ret;
> > -delayed_mapped:
> > - /* delayed allocation blocks has been allocated */
> > - ret = ext4_es_remove_extent(inode, map->m_lblk,
> > - map->m_len);
> > - if (ret < 0)
> > - retval = ret;
> > - }
> > + if (retval > 0) {
> > + int ret, status;
> > +
> > + if (flags & EXT4_GET_BLOCKS_PRE_IO)
> > + status = EXTENT_STATUS_UNWRITTEN;
> > + else if (flags & EXT4_GET_BLOCKS_CONVERT)
> > + status = EXTENT_STATUS_WRITTEN;
> > + else if (flags & EXT4_GET_BLOCKS_UNINIT_EXT)
> > + status = EXTENT_STATUS_UNWRITTEN;
> > + else if (flags & EXT4_GET_BLOCKS_CREATE)
> > + status = EXTENT_STATUS_WRITTEN;
> > + else
> > + BUG_ON(1);
> > +
> > + ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> > + map->m_pblk, status);
> > + if (ret < 0)
> > + retval = ret;
> Hum, are you sure the extent status will be correct? Won't it be safer to
> just use whatever we have in 'map'?
Your meaning is that we need to ignore the error when we insert a extent
into the extent status tree, right? But that would causes an
inconsistency between status tree and extent tree. Further,
ext4_es_insert_extent() returns EINVAL or ENOMEM. I believe that
reporting an error is a better choice. What do you think?
Thanks,
- Zheng
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/9 v4] ext4: track all extent status in extent status tree
2013-02-01 5:33 ` Zheng Liu
@ 2013-02-04 11:27 ` Jan Kara
2013-02-05 3:32 ` Zheng Liu
0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2013-02-04 11:27 UTC (permalink / raw)
To: Zheng Liu; +Cc: Jan Kara, linux-ext4, Zheng Liu, Theodore Ts'o
On Fri 01-02-13 13:33:08, Zheng Liu wrote:
> On Thu, Jan 31, 2013 at 05:50:55PM +0100, Jan Kara wrote:
> > On Thu 31-01-13 13:17:53, Zheng Liu wrote:
> > > From: Zheng Liu <wenqing.lz@taobao.com>
> > >
> > > By recording the phycisal block and status, extent status tree is able
> > > to track the status of every extents. When we call _map_blocks
> > > functions to lookup an extent or create a new written/unwritten/delayed
> > > extent, this extent will be inserted into extent status tree.
> > >
> > > We don't load all extents from disk in alloc_inode() because it costs
> > > too much memory, and if a file is opened and closed frequently it will
> > > takes too much time to load all extent information. So currently when
> > > we create/lookup an extent, this extent will be inserted into extent
> > > status tree. Hence, the extent status tree may not comprehensively
> > > contain all of the extents found in the file.
> > >
> > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > > ---
> > > fs/ext4/extents.c | 5 +++-
> > > fs/ext4/file.c | 6 +++--
> > > fs/ext4/inode.c | 73 ++++++++++++++++++++++++++++++++++++-------------------
> > > 3 files changed, 56 insertions(+), 28 deletions(-)
> > >
> > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > > index aa9a6d2..d23a654 100644
> > > --- a/fs/ext4/extents.c
> > > +++ b/fs/ext4/extents.c
> > > @@ -2074,7 +2074,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
> > > }
> > >
> > > /* This is possible iff next == next_del == EXT_MAX_BLOCKS */
> > > - if (next == next_del) {
> > > + if (next == next_del && next_del == EXT_MAX_BLOCKS) {
> > This doesn't seem to be related, does it?
>
> ext4_ext_next_allocated_block() will return EXT_MAX_BLOCKS when it
> reaches the end of file. ext4_find_delayed_extent() does the same
> thing.
Yes.
> Before tracking written/unwritten extent it is correct because
> next never equals to next_del unless both of them equal to
> EXT_MAX_BLOCKS. However, after that next is possible to equal to
> next_del when they don't reach the end of file.
Hum, I have to say I'm somewhat lost in ext4_find_delayed_extent().
ext4_es_find_extent() returns the first extent from extent status tree
containing at / after the given offset. So es.len == 0 iff there's extent
in status tree at or after newex->ec_block. The comment before that
condition suggest something different and I'd expect the return value to be
EXT_MAX_BLOCKS, not 0? Also ext4_find_delayed_extent() would be much less
confusing if it just returned position of next delalloc extent after given
block and didn't try to return length of a hole before that extent in
newex? That value can be easily computed from 'next' and 'next_del' in
ext4_fill_fiemap_extents() anyway... But that's slightly off topic.
To our current discussion ext4_find_delayed_extent() either returns 0 or
start of next delalloc extent (if we found extent of other type in the
status tree we just return 0) so I don't see how next_del == next for other
value than EXT_MAX_BLOCKS. But maybe I miss something.
> So we need to make sure next equals to next_del and both of them equal to
> EXT_MAX_BLOCKS. In this condition it indicates that we reach the end of
> file. Am I miss something?
>
> >
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index e09c7cf..f0dda2a 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -615,18 +615,27 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> > > (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
> > > ext4_da_update_reserve_space(inode, retval, 1);
> > > }
> > > - if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
> > > + if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
> > > ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
> > >
> > > - if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> > > - int ret;
> > > -delayed_mapped:
> > > - /* delayed allocation blocks has been allocated */
> > > - ret = ext4_es_remove_extent(inode, map->m_lblk,
> > > - map->m_len);
> > > - if (ret < 0)
> > > - retval = ret;
> > > - }
> > > + if (retval > 0) {
> > > + int ret, status;
> > > +
> > > + if (flags & EXT4_GET_BLOCKS_PRE_IO)
> > > + status = EXTENT_STATUS_UNWRITTEN;
> > > + else if (flags & EXT4_GET_BLOCKS_CONVERT)
> > > + status = EXTENT_STATUS_WRITTEN;
> > > + else if (flags & EXT4_GET_BLOCKS_UNINIT_EXT)
> > > + status = EXTENT_STATUS_UNWRITTEN;
> > > + else if (flags & EXT4_GET_BLOCKS_CREATE)
> > > + status = EXTENT_STATUS_WRITTEN;
> > > + else
> > > + BUG_ON(1);
> > > +
> > > + ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> > > + map->m_pblk, status);
> > > + if (ret < 0)
> > > + retval = ret;
> > Hum, are you sure the extent status will be correct? Won't it be safer to
> > just use whatever we have in 'map'?
>
> Your meaning is that we need to ignore the error when we insert a extent
> into the extent status tree, right? But that would causes an
> inconsistency between status tree and extent tree. Further,
> ext4_es_insert_extent() returns EINVAL or ENOMEM. I believe that
> reporting an error is a better choice. What do you think?
No, I meant something else. For example you decide extent at given
position is 'UNWRITTEN' just on the basis that someone passed
EXT4_GET_BLOCKS_PRE_IO as get_blocks flags. How do you know? Cannot someone
pass EXT4_GET_BLOCKS_PRE_IO and we actually find out the extent at given
position is fully allocated extent (i.e. WRITTEN) so we do nothing? Then
you would cache bad state in the extent tree... That's why I'd rather see
we derive current 'status' from 'map' where we are sure to have correct
information and don't have to guess it from passed flags.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/9 v4] ext4: track all extent status in extent status tree
2013-02-04 11:27 ` Jan Kara
@ 2013-02-05 3:32 ` Zheng Liu
2013-02-05 12:08 ` Jan Kara
0 siblings, 1 reply; 21+ messages in thread
From: Zheng Liu @ 2013-02-05 3:32 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o
On Mon, Feb 04, 2013 at 12:27:09PM +0100, Jan Kara wrote:
> On Fri 01-02-13 13:33:08, Zheng Liu wrote:
> > On Thu, Jan 31, 2013 at 05:50:55PM +0100, Jan Kara wrote:
> > > On Thu 31-01-13 13:17:53, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > >
> > > > By recording the phycisal block and status, extent status tree is able
> > > > to track the status of every extents. When we call _map_blocks
> > > > functions to lookup an extent or create a new written/unwritten/delayed
> > > > extent, this extent will be inserted into extent status tree.
> > > >
> > > > We don't load all extents from disk in alloc_inode() because it costs
> > > > too much memory, and if a file is opened and closed frequently it will
> > > > takes too much time to load all extent information. So currently when
> > > > we create/lookup an extent, this extent will be inserted into extent
> > > > status tree. Hence, the extent status tree may not comprehensively
> > > > contain all of the extents found in the file.
> > > >
> > > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > > > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > > > ---
> > > > fs/ext4/extents.c | 5 +++-
> > > > fs/ext4/file.c | 6 +++--
> > > > fs/ext4/inode.c | 73 ++++++++++++++++++++++++++++++++++++-------------------
> > > > 3 files changed, 56 insertions(+), 28 deletions(-)
> > > >
> > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > > > index aa9a6d2..d23a654 100644
> > > > --- a/fs/ext4/extents.c
> > > > +++ b/fs/ext4/extents.c
> > > > @@ -2074,7 +2074,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
> > > > }
> > > >
> > > > /* This is possible iff next == next_del == EXT_MAX_BLOCKS */
> > > > - if (next == next_del) {
> > > > + if (next == next_del && next_del == EXT_MAX_BLOCKS) {
> > > This doesn't seem to be related, does it?
> >
> > ext4_ext_next_allocated_block() will return EXT_MAX_BLOCKS when it
> > reaches the end of file. ext4_find_delayed_extent() does the same
> > thing.
> Yes.
>
> > Before tracking written/unwritten extent it is correct because
> > next never equals to next_del unless both of them equal to
> > EXT_MAX_BLOCKS. However, after that next is possible to equal to
> > next_del when they don't reach the end of file.
> Hum, I have to say I'm somewhat lost in ext4_find_delayed_extent().
> ext4_es_find_extent() returns the first extent from extent status tree
> containing at / after the given offset. So es.len == 0 iff there's extent
> in status tree at or after newex->ec_block. The comment before that
> condition suggest something different and I'd expect the return value to be
> EXT_MAX_BLOCKS, not 0? Also ext4_find_delayed_extent() would be much less
> confusing if it just returned position of next delalloc extent after given
> block and didn't try to return length of a hole before that extent in
> newex? That value can be easily computed from 'next' and 'next_del' in
> ext4_fill_fiemap_extents() anyway... But that's slightly off topic.
Yes, ext4_find_delayed_extent is not very clearly now, and I agree with
you that it need to be refactored. Certainly it is another topic.
>
> To our current discussion ext4_find_delayed_extent() either returns 0 or
> start of next delalloc extent (if we found extent of other type in the
> status tree we just return 0) so I don't see how next_del == next for other
> value than EXT_MAX_BLOCKS. But maybe I miss something.
After tracking all extent status in status tree, ext4_es_find_extent()
returns not only delayed extent, but also written/unwritten extents. So
it is possible that next_del == next and its value is not
EXT_MAX_BLOCKS. *But* in latest version ext4_es_find_extent() will be
changed to only return delayed extent. So the problem will be fixed.
>
> > So we need to make sure next equals to next_del and both of them equal to
> > EXT_MAX_BLOCKS. In this condition it indicates that we reach the end of
> > file. Am I miss something?
> >
> > >
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index e09c7cf..f0dda2a 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -615,18 +615,27 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> > > > (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
> > > > ext4_da_update_reserve_space(inode, retval, 1);
> > > > }
> > > > - if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
> > > > + if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
> > > > ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
> > > >
> > > > - if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> > > > - int ret;
> > > > -delayed_mapped:
> > > > - /* delayed allocation blocks has been allocated */
> > > > - ret = ext4_es_remove_extent(inode, map->m_lblk,
> > > > - map->m_len);
> > > > - if (ret < 0)
> > > > - retval = ret;
> > > > - }
> > > > + if (retval > 0) {
> > > > + int ret, status;
> > > > +
> > > > + if (flags & EXT4_GET_BLOCKS_PRE_IO)
> > > > + status = EXTENT_STATUS_UNWRITTEN;
> > > > + else if (flags & EXT4_GET_BLOCKS_CONVERT)
> > > > + status = EXTENT_STATUS_WRITTEN;
> > > > + else if (flags & EXT4_GET_BLOCKS_UNINIT_EXT)
> > > > + status = EXTENT_STATUS_UNWRITTEN;
> > > > + else if (flags & EXT4_GET_BLOCKS_CREATE)
> > > > + status = EXTENT_STATUS_WRITTEN;
> > > > + else
> > > > + BUG_ON(1);
> > > > +
> > > > + ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> > > > + map->m_pblk, status);
> > > > + if (ret < 0)
> > > > + retval = ret;
> > > Hum, are you sure the extent status will be correct? Won't it be safer to
> > > just use whatever we have in 'map'?
> >
> > Your meaning is that we need to ignore the error when we insert a extent
> > into the extent status tree, right? But that would causes an
> > inconsistency between status tree and extent tree. Further,
> > ext4_es_insert_extent() returns EINVAL or ENOMEM. I believe that
> > reporting an error is a better choice. What do you think?
> No, I meant something else. For example you decide extent at given
> position is 'UNWRITTEN' just on the basis that someone passed
> EXT4_GET_BLOCKS_PRE_IO as get_blocks flags. How do you know? Cannot someone
> pass EXT4_GET_BLOCKS_PRE_IO and we actually find out the extent at given
> position is fully allocated extent (i.e. WRITTEN) so we do nothing? Then
> you would cache bad state in the extent tree... That's why I'd rather see
> we derive current 'status' from 'map' where we are sure to have correct
> information and don't have to guess it from passed flags.
First of all, we don't need to worry about this problem because we
always lookup an extent before trying to create it. So when it is an
written extent, we will return from ext4_map_blocks() directly and won't
try to create it. So status tree don't be touched.
Secondly, as far as I know, ext4_ext_map_blocks() will never return
EXT4_MAP_UNWRITTEN flags after it tries to allocate an unwritten extent.
So that means that 'm_flags' in map is incorrect. Maybe I miss
something.
Thanks,
- Zheng
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/9 v4] ext4: track all extent status in extent status tree
2013-02-05 3:32 ` Zheng Liu
@ 2013-02-05 12:08 ` Jan Kara
2013-02-05 13:24 ` Zheng Liu
0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2013-02-05 12:08 UTC (permalink / raw)
To: Zheng Liu; +Cc: Jan Kara, linux-ext4, Zheng Liu, Theodore Ts'o
On Tue 05-02-13 11:32:21, Zheng Liu wrote:
> On Mon, Feb 04, 2013 at 12:27:09PM +0100, Jan Kara wrote:
> > On Fri 01-02-13 13:33:08, Zheng Liu wrote:
> > > On Thu, Jan 31, 2013 at 05:50:55PM +0100, Jan Kara wrote:
> > > > On Thu 31-01-13 13:17:53, Zheng Liu wrote:
> > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > >
> > > > > By recording the phycisal block and status, extent status tree is able
> > > > > to track the status of every extents. When we call _map_blocks
> > > > > functions to lookup an extent or create a new written/unwritten/delayed
> > > > > extent, this extent will be inserted into extent status tree.
> > > > >
> > > > > We don't load all extents from disk in alloc_inode() because it costs
> > > > > too much memory, and if a file is opened and closed frequently it will
> > > > > takes too much time to load all extent information. So currently when
> > > > > we create/lookup an extent, this extent will be inserted into extent
> > > > > status tree. Hence, the extent status tree may not comprehensively
> > > > > contain all of the extents found in the file.
> > > > >
> > > > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > > > > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > > > > ---
> > > > > fs/ext4/extents.c | 5 +++-
> > > > > fs/ext4/file.c | 6 +++--
> > > > > fs/ext4/inode.c | 73 ++++++++++++++++++++++++++++++++++++-------------------
> > > > > 3 files changed, 56 insertions(+), 28 deletions(-)
> > > > >
> > > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > > > > index aa9a6d2..d23a654 100644
> > > > > --- a/fs/ext4/extents.c
> > > > > +++ b/fs/ext4/extents.c
> > > > > @@ -2074,7 +2074,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
> > > > > }
> > > > >
> > > > > /* This is possible iff next == next_del == EXT_MAX_BLOCKS */
> > > > > - if (next == next_del) {
> > > > > + if (next == next_del && next_del == EXT_MAX_BLOCKS) {
> > > > This doesn't seem to be related, does it?
> > >
> > > ext4_ext_next_allocated_block() will return EXT_MAX_BLOCKS when it
> > > reaches the end of file. ext4_find_delayed_extent() does the same
> > > thing.
> > Yes.
> >
> > > Before tracking written/unwritten extent it is correct because
> > > next never equals to next_del unless both of them equal to
> > > EXT_MAX_BLOCKS. However, after that next is possible to equal to
> > > next_del when they don't reach the end of file.
<snip>
> > To our current discussion ext4_find_delayed_extent() either returns 0 or
> > start of next delalloc extent (if we found extent of other type in the
> > status tree we just return 0) so I don't see how next_del == next for other
> > value than EXT_MAX_BLOCKS. But maybe I miss something.
>
> After tracking all extent status in status tree, ext4_es_find_extent()
> returns not only delayed extent, but also written/unwritten extents. So
> it is possible that next_del == next and its value is not
> EXT_MAX_BLOCKS. *But* in latest version ext4_es_find_extent() will be
> changed to only return delayed extent. So the problem will be fixed.
Ah, now I see. You added the condition checking whether extent is delayed
only to the newex->ec_start == 0 branch. So if we don't take that branch,
we could have returned an extent which isn't delayed.
IMHO it is a wrong decision for ext4_es_find_extent() to return only
delayed extents. That should really return any extent that contains given
block (or is after it). It is ext4_find_delayed_extent() that should really
be changed to return only delayed extents as its name suggests...
> > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > index e09c7cf..f0dda2a 100644
> > > > > --- a/fs/ext4/inode.c
> > > > > +++ b/fs/ext4/inode.c
> > > > > @@ -615,18 +615,27 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> > > > > (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
> > > > > ext4_da_update_reserve_space(inode, retval, 1);
> > > > > }
> > > > > - if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
> > > > > + if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
> > > > > ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
> > > > >
> > > > > - if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> > > > > - int ret;
> > > > > -delayed_mapped:
> > > > > - /* delayed allocation blocks has been allocated */
> > > > > - ret = ext4_es_remove_extent(inode, map->m_lblk,
> > > > > - map->m_len);
> > > > > - if (ret < 0)
> > > > > - retval = ret;
> > > > > - }
> > > > > + if (retval > 0) {
> > > > > + int ret, status;
> > > > > +
> > > > > + if (flags & EXT4_GET_BLOCKS_PRE_IO)
> > > > > + status = EXTENT_STATUS_UNWRITTEN;
> > > > > + else if (flags & EXT4_GET_BLOCKS_CONVERT)
> > > > > + status = EXTENT_STATUS_WRITTEN;
> > > > > + else if (flags & EXT4_GET_BLOCKS_UNINIT_EXT)
> > > > > + status = EXTENT_STATUS_UNWRITTEN;
> > > > > + else if (flags & EXT4_GET_BLOCKS_CREATE)
> > > > > + status = EXTENT_STATUS_WRITTEN;
> > > > > + else
> > > > > + BUG_ON(1);
> > > > > +
> > > > > + ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> > > > > + map->m_pblk, status);
> > > > > + if (ret < 0)
> > > > > + retval = ret;
> > > > Hum, are you sure the extent status will be correct? Won't it be safer to
> > > > just use whatever we have in 'map'?
> > >
> > > Your meaning is that we need to ignore the error when we insert a extent
> > > into the extent status tree, right? But that would causes an
> > > inconsistency between status tree and extent tree. Further,
> > > ext4_es_insert_extent() returns EINVAL or ENOMEM. I believe that
> > > reporting an error is a better choice. What do you think?
> > No, I meant something else. For example you decide extent at given
> > position is 'UNWRITTEN' just on the basis that someone passed
> > EXT4_GET_BLOCKS_PRE_IO as get_blocks flags. How do you know? Cannot someone
> > pass EXT4_GET_BLOCKS_PRE_IO and we actually find out the extent at given
> > position is fully allocated extent (i.e. WRITTEN) so we do nothing? Then
> > you would cache bad state in the extent tree... That's why I'd rather see
> > we derive current 'status' from 'map' where we are sure to have correct
> > information and don't have to guess it from passed flags.
>
> First of all, we don't need to worry about this problem because we
> always lookup an extent before trying to create it. So when it is an
> written extent, we will return from ext4_map_blocks() directly and won't
> try to create it. So status tree don't be touched.
So my argument isn't as much about whether you can deduce the correct
result from flags passed to ext4_map_blocks() but rather that it simply
isn't the right place where to look. The right place where to look what
extent is at given position is 'map' where we store what we found. And you
are right that ext4_ext_map_blocks() isn't properly returning
EXT4_MAP_UNWRITTEN in some cases - thanks for noticing that - but then the
right answer is to fix ext4_ext_map_blocks() to return it and not to hack
around that in extent cache code... Believe me it will save us quite some
head scratching later.
> Secondly, as far as I know, ext4_ext_map_blocks() will never return
> EXT4_MAP_UNWRITTEN flags after it tries to allocate an unwritten extent.
> So that means that 'm_flags' in map is incorrect. Maybe I miss
> something.
Bye
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/9 v4] ext4: track all extent status in extent status tree
2013-02-05 12:08 ` Jan Kara
@ 2013-02-05 13:24 ` Zheng Liu
2013-02-05 13:27 ` Jan Kara
0 siblings, 1 reply; 21+ messages in thread
From: Zheng Liu @ 2013-02-05 13:24 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o
On Tue, Feb 05, 2013 at 01:08:54PM +0100, Jan Kara wrote:
[snip]
> > After tracking all extent status in status tree, ext4_es_find_extent()
> > returns not only delayed extent, but also written/unwritten extents. So
> > it is possible that next_del == next and its value is not
> > EXT_MAX_BLOCKS. *But* in latest version ext4_es_find_extent() will be
> > changed to only return delayed extent. So the problem will be fixed.
> Ah, now I see. You added the condition checking whether extent is delayed
> only to the newex->ec_start == 0 branch. So if we don't take that branch,
> we could have returned an extent which isn't delayed.
>
> IMHO it is a wrong decision for ext4_es_find_extent() to return only
> delayed extents. That should really return any extent that contains given
> block (or is after it). It is ext4_find_delayed_extent() that should really
> be changed to return only delayed extents as its name suggests...
I revised the patch series and found that ext4_es_find_extent() is
only used to lookup a delayed extent by the following functions:
- ext4_find_delalloc_range()
- ext4_find_delayed_extent()
- ext4_seek_data()
- ext4_seek_hole()
So IMHO the better decision is to rename it to ext4_es_find_delayed_extent()
and let it only return delayed extent. In patch 6/9, a new function
called ext4_es_lookup_extent() is defined to do this job that returns an
extent that contains given block. What do you think?
[snip]
> > > > > Hum, are you sure the extent status will be correct? Won't it be safer to
> > > > > just use whatever we have in 'map'?
> > > >
> > > > Your meaning is that we need to ignore the error when we insert a extent
> > > > into the extent status tree, right? But that would causes an
> > > > inconsistency between status tree and extent tree. Further,
> > > > ext4_es_insert_extent() returns EINVAL or ENOMEM. I believe that
> > > > reporting an error is a better choice. What do you think?
> > > No, I meant something else. For example you decide extent at given
> > > position is 'UNWRITTEN' just on the basis that someone passed
> > > EXT4_GET_BLOCKS_PRE_IO as get_blocks flags. How do you know? Cannot someone
> > > pass EXT4_GET_BLOCKS_PRE_IO and we actually find out the extent at given
> > > position is fully allocated extent (i.e. WRITTEN) so we do nothing? Then
> > > you would cache bad state in the extent tree... That's why I'd rather see
> > > we derive current 'status' from 'map' where we are sure to have correct
> > > information and don't have to guess it from passed flags.
> >
> > First of all, we don't need to worry about this problem because we
> > always lookup an extent before trying to create it. So when it is an
> > written extent, we will return from ext4_map_blocks() directly and won't
> > try to create it. So status tree don't be touched.
> So my argument isn't as much about whether you can deduce the correct
> result from flags passed to ext4_map_blocks() but rather that it simply
> isn't the right place where to look. The right place where to look what
> extent is at given position is 'map' where we store what we found. And you
> are right that ext4_ext_map_blocks() isn't properly returning
> EXT4_MAP_UNWRITTEN in some cases - thanks for noticing that - but then the
> right answer is to fix ext4_ext_map_blocks() to return it and not to hack
> around that in extent cache code... Believe me it will save us quite some
> head scratching later.
Fair enough. I will try to fix it.
Thanks for your suggestion,
- Zheng
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/9 v4] ext4: track all extent status in extent status tree
2013-02-05 13:24 ` Zheng Liu
@ 2013-02-05 13:27 ` Jan Kara
0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2013-02-05 13:27 UTC (permalink / raw)
To: Zheng Liu; +Cc: Jan Kara, linux-ext4, Zheng Liu, Theodore Ts'o
On Tue 05-02-13 21:24:33, Zheng Liu wrote:
> On Tue, Feb 05, 2013 at 01:08:54PM +0100, Jan Kara wrote:
> [snip]
> > > After tracking all extent status in status tree, ext4_es_find_extent()
> > > returns not only delayed extent, but also written/unwritten extents. So
> > > it is possible that next_del == next and its value is not
> > > EXT_MAX_BLOCKS. *But* in latest version ext4_es_find_extent() will be
> > > changed to only return delayed extent. So the problem will be fixed.
> > Ah, now I see. You added the condition checking whether extent is delayed
> > only to the newex->ec_start == 0 branch. So if we don't take that branch,
> > we could have returned an extent which isn't delayed.
> >
> > IMHO it is a wrong decision for ext4_es_find_extent() to return only
> > delayed extents. That should really return any extent that contains given
> > block (or is after it). It is ext4_find_delayed_extent() that should really
> > be changed to return only delayed extents as its name suggests...
>
> I revised the patch series and found that ext4_es_find_extent() is
> only used to lookup a delayed extent by the following functions:
>
> - ext4_find_delalloc_range()
> - ext4_find_delayed_extent()
> - ext4_seek_data()
> - ext4_seek_hole()
>
> So IMHO the better decision is to rename it to ext4_es_find_delayed_extent()
> and let it only return delayed extent. In patch 6/9, a new function
> called ext4_es_lookup_extent() is defined to do this job that returns an
> extent that contains given block. What do you think?
Yes, this works for me. Thanks for looking into it.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2013-02-05 13:27 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-31 5:17 [PATCH 0/9 v4] ext4: extent status tree (step2) Zheng Liu
2013-01-31 5:17 ` [PATCH 1/9 v4] ext4: refine extent status tree Zheng Liu
2013-01-31 5:17 ` [PATCH 2/9 v4] ext4: remove EXT4_MAP_FROM_CLUSTER flag Zheng Liu
2013-01-31 5:17 ` [PATCH 3/9 v4] ext4: add physical block and status member into extent status tree Zheng Liu
2013-01-31 5:17 ` [PATCH 4/9 v4] ext4: adjust interfaces of " Zheng Liu
2013-01-31 16:02 ` Jan Kara
2013-02-01 2:51 ` Zheng Liu
2013-01-31 5:17 ` [PATCH 5/9 v4] ext4: track all extent status in " Zheng Liu
2013-01-31 16:50 ` Jan Kara
2013-02-01 5:33 ` Zheng Liu
2013-02-04 11:27 ` Jan Kara
2013-02-05 3:32 ` Zheng Liu
2013-02-05 12:08 ` Jan Kara
2013-02-05 13:24 ` Zheng Liu
2013-02-05 13:27 ` Jan Kara
2013-01-31 5:17 ` [PATCH 6/9 v4] ext4: lookup block mapping " Zheng Liu
2013-01-31 5:17 ` [PATCH 7/9 v4] ext4: remove single extent cache Zheng Liu
2013-01-31 17:05 ` Jan Kara
2013-02-01 3:08 ` Zheng Liu
2013-01-31 5:17 ` [PATCH 8/9 v4] ext4: adjust some functions for reclaiming extents from extent status tree Zheng Liu
2013-01-31 5:17 ` [PATCH 9/9 v4] ext4: reclaim " Zheng Liu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).