* [PATCH 01/10 v3] ext4: refine extent status tree
2013-01-23 12:03 [PATCH 00/10 v3] ext4: extent status tree (step2) Zheng Liu
@ 2013-01-23 12:03 ` Zheng Liu
2013-01-23 12:03 ` [PATCH 02/10 v3] ext4: remove EXT4_MAP_FROM_CLUSTER flag Zheng Liu
` (8 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Zheng Liu @ 2013-01-23 12:03 UTC (permalink / raw)
To: linux-ext4, linux-fsdevel; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
This commit refines the extent status tree code.
1) A prefix 'es_' is added to to the extent status tree structure
members.
2) Refactored es_remove_extent() so that __es_remove_extent() can be
used by es_insert_extent() to remove the old extent entry(-ies) before
inserting a new one.
3) Rename extent_status_end() to ext4_es_end()
4) ext4_es_can_be_merged() is defined to check whether two extents can
be merged or not.
5) Update and clarified comments.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents.c | 21 +--
fs/ext4/extents_status.c | 322 +++++++++++++++++++++++++-------------------
fs/ext4/extents_status.h | 8 +-
fs/ext4/file.c | 12 +-
include/trace/events/ext4.h | 40 +++---
5 files changed, 220 insertions(+), 183 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5ae1674..f7bf616 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3525,13 +3525,14 @@ static int ext4_find_delalloc_range(struct inode *inode,
{
struct extent_status es;
- es.start = lblk_start;
- ext4_es_find_extent(inode, &es);
- if (es.len == 0)
+ es.es_lblk = lblk_start;
+ (void)ext4_es_find_extent(inode, &es);
+ if (es.es_len == 0)
return 0; /* there is no delay extent in this tree */
- else if (es.start <= lblk_start && lblk_start < es.start + es.len)
+ else if (es.es_lblk <= lblk_start &&
+ lblk_start < es.es_lblk + es.es_len)
return 1;
- else if (lblk_start <= es.start && es.start <= lblk_end)
+ else if (lblk_start <= es.es_lblk && es.es_lblk <= lblk_end)
return 1;
else
return 0;
@@ -4567,7 +4568,7 @@ static int ext4_find_delayed_extent(struct inode *inode,
struct extent_status es;
ext4_lblk_t next_del;
- es.start = newex->ec_block;
+ es.es_lblk = newex->ec_block;
next_del = ext4_es_find_extent(inode, &es);
if (newex->ec_start == 0) {
@@ -4575,18 +4576,18 @@ static int ext4_find_delayed_extent(struct inode *inode,
* No extent in extent-tree contains block @newex->ec_start,
* then the block may stay in 1)a hole or 2)delayed-extent.
*/
- if (es.len == 0)
+ if (es.es_len == 0)
/* A hole found. */
return 0;
- if (es.start > newex->ec_block) {
+ if (es.es_lblk > newex->ec_block) {
/* A hole found. */
- newex->ec_len = min(es.start - newex->ec_block,
+ newex->ec_len = min(es.es_lblk - newex->ec_block,
newex->ec_len);
return 0;
}
- newex->ec_len = es.start + es.len - newex->ec_block;
+ newex->ec_len = es.es_lblk + es.es_len - newex->ec_block;
}
return next_del;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 564d981..1ce4a69 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -23,40 +23,53 @@
* (e.g. Reservation space warning), and provide extent-level locking.
* Delay extent tree is the first step to achieve this goal. It is
* original built by Yongqiang Yang. At that time it is called delay
- * extent tree, whose goal is only track delay extent in memory to
+ * extent tree, whose goal is only track delayed extents in memory to
* simplify the implementation of fiemap and bigalloc, and introduce
* lseek SEEK_DATA/SEEK_HOLE support. That is why it is still called
- * delay extent tree at the following comment. But for better
- * understand what it does, it has been rename to extent status tree.
+ * delay extent tree at the first commit. But for better understand
+ * what it does, it has been rename to extent status tree.
*
- * Currently the first step has been done. All delay extents are
- * tracked in the tree. It maintains the delay extent when a delay
- * allocation is issued, and the delay extent is written out or
+ * Step1:
+ * Currently the first step has been done. All delayed extents are
+ * tracked in the tree. It maintains the delayed extent when a delayed
+ * allocation is issued, and the delayed extent is written out or
* invalidated. Therefore the implementation of fiemap and bigalloc
* are simplified, and SEEK_DATA/SEEK_HOLE are introduced.
*
* The following comment describes the implemenmtation of extent
* status tree and future works.
+ *
+ * Step2:
+ * In this step all extent status is tracked by extent status tree.
+ * Thus, we can first try to lookup a block mapping in this tree before
+ * finding it in extent tree. Hence, single extent cache can be removed
+ * because extent status tree can do a better job. Extents in status
+ * tree are loaded on-demand. Therefore, the extent status tree may not
+ * contain all of the extents in a file. Meanwhile we add
+ * nr_cached_objects and free_cached_objects callback functions to
+ * reclaim extents from extent status tree. These functions make us
+ * reclaim written/unwritten extents from the tree under a heavy memory
+ * pressure. Delayed extents will not be reclaimed because fiemap,
+ * bigalloc, and seek_data/hole need it.
*/
/*
- * extents status tree implementation for ext4.
+ * Extent status tree implementation for ext4.
*
*
* ==========================================================================
- * Extents status encompass delayed extents and extent locks
+ * Extent status tree tracks all extent status.
*
- * 1. Why delayed extent implementation ?
+ * 1. Why we need to implement extent status tree?
*
- * Without delayed extent, ext4 identifies a delayed extent by looking
+ * Without extent status tree, ext4 identifies a delayed extent by looking
* up page cache, this has several deficiencies - complicated, buggy,
* and inefficient code.
*
- * FIEMAP, SEEK_HOLE/DATA, bigalloc, punch hole and writeout all need
- * to know if a block or a range of blocks are belonged to a delayed
- * extent.
+ * FIEMAP, SEEK_HOLE/DATA, bigalloc, and writeout all need to know if a
+ * block or a range of blocks are belonged to a delayed extent.
*
- * Let us have a look at how they do without delayed extents implementation.
+ * Let us have a look at how they do without extent status tree.
* -- FIEMAP
* FIEMAP looks up page cache to identify delayed allocations from holes.
*
@@ -68,47 +81,48 @@
* already under delayed allocation or not to determine whether
* quota reserving is needed for the cluster.
*
- * -- punch hole
- * punch hole looks up page cache to identify a delayed extent.
- *
* -- writeout
* Writeout looks up whole page cache to see if a buffer is
* mapped, If there are not very many delayed buffers, then it is
* time comsuming.
*
- * With delayed extents implementation, FIEMAP, SEEK_HOLE/DATA,
+ * With extent status tree implementation, FIEMAP, SEEK_HOLE/DATA,
* bigalloc and writeout can figure out if a block or a range of
* blocks is under delayed allocation(belonged to a delayed extent) or
- * not by searching the delayed extent tree.
+ * not by searching the extent tree.
*
*
* ==========================================================================
- * 2. ext4 delayed extents impelmentation
+ * 2. Ext4 extent status tree impelmentation
*
- * -- delayed extent
- * A delayed extent is a range of blocks which are contiguous
- * logically and under delayed allocation. Unlike extent in
- * ext4, delayed extent in ext4 is a in-memory struct, there is
- * no corresponding on-disk data. There is no limit on length of
- * delayed extent, so a delayed extent can contain as many blocks
- * as they are contiguous logically.
+ * -- extent
+ * A extent is a range of blocks which are contiguous logically and
+ * physically. Unlike extent in extent tree, this extent in ext4 is
+ * a in-memory struct, there is no corresponding on-disk data. There
+ * is no limit on length of extent, so an extent can contain as many
+ * blocks as they are contiguous logically and physically.
*
- * -- delayed extent tree
- * Every inode has a delayed extent tree and all under delayed
- * allocation blocks are added to the tree as delayed extents.
- * Delayed extents in the tree are ordered by logical block no.
+ * -- extent status tree
+ * Every inode has an extent status tree and all allocation blocks
+ * are added to the tree with different status. The extent in the
+ * tree are ordered by logical block no.
*
- * -- operations on a delayed extent tree
- * There are three operations on a delayed extent tree: find next
- * delayed extent, adding a space(a range of blocks) and removing
- * a space.
+ * -- operations on a extent status tree
+ * There are three important operations on a delayed extent tree: find
+ * next extent, adding a extent(a range of blocks) and removing a extent.
*
- * -- race on a delayed extent tree
- * Delayed extent tree is protected inode->i_es_lock.
+ * -- race on a extent status tree
+ * Extent status tree is protected inode->i_es_lock.
+ *
+ * -- memory consumption
+ * Fragmented extent tree will make extent status tree cost too much
+ * memory. Hence, we will reclaim written/unwritten extents from the
+ * tree under a heavy memory pressure.
*
*
* ==========================================================================
- * 3. performance analysis
+ * 3. Performance analysis
+ *
* -- overhead
* 1. There is a cache extent for write access, so if writes are
* not very random, adding space operaions are in O(1) time.
@@ -120,15 +134,19 @@
*
* ==========================================================================
* 4. TODO list
- * -- Track all extent status
*
- * -- Improve get block process
+ * -- Refactor delayed reserve space
*
* -- Extent-level locking
*/
static struct kmem_cache *ext4_es_cachep;
+static int __es_insert_extent(struct ext4_es_tree *tree,
+ struct extent_status *newes);
+static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+ ext4_lblk_t end);
+
int __init ext4_init_es(void)
{
ext4_es_cachep = KMEM_CACHE(extent_status, SLAB_RECLAIM_ACCOUNT);
@@ -161,7 +179,7 @@ static void ext4_es_print_tree(struct inode *inode)
while (node) {
struct extent_status *es;
es = rb_entry(node, struct extent_status, rb_node);
- printk(KERN_DEBUG " [%u/%u)", es->start, es->len);
+ printk(KERN_DEBUG " [%u/%u)", es->es_lblk, es->es_len);
node = rb_next(node);
}
printk(KERN_DEBUG "\n");
@@ -170,10 +188,10 @@ static void ext4_es_print_tree(struct inode *inode)
#define ext4_es_print_tree(inode)
#endif
-static inline ext4_lblk_t extent_status_end(struct extent_status *es)
+static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
{
- BUG_ON(es->start + es->len < es->start);
- return es->start + es->len - 1;
+ BUG_ON(es->es_lblk + es->es_len < es->es_lblk);
+ return es->es_lblk + es->es_len - 1;
}
/*
@@ -181,25 +199,25 @@ static inline ext4_lblk_t extent_status_end(struct extent_status *es)
* it can't be found, try to find next extent.
*/
static struct extent_status *__es_tree_search(struct rb_root *root,
- ext4_lblk_t offset)
+ ext4_lblk_t lblk)
{
struct rb_node *node = root->rb_node;
struct extent_status *es = NULL;
while (node) {
es = rb_entry(node, struct extent_status, rb_node);
- if (offset < es->start)
+ if (lblk < es->es_lblk)
node = node->rb_left;
- else if (offset > extent_status_end(es))
+ else if (lblk > ext4_es_end(es))
node = node->rb_right;
else
return es;
}
- if (es && offset < es->start)
+ if (es && lblk < es->es_lblk)
return es;
- if (es && offset > extent_status_end(es)) {
+ if (es && lblk > ext4_es_end(es)) {
node = rb_next(&es->rb_node);
return node ? rb_entry(node, struct extent_status, rb_node) :
NULL;
@@ -209,8 +227,8 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
}
/*
- * ext4_es_find_extent: find the 1st delayed extent covering @es->start
- * if it exists, otherwise, the next extent after @es->start.
+ * ext4_es_find_extent: find the 1st delayed extent covering @es->lblk
+ * if it exists, otherwise, the next extent after @es->lblk.
*
* @inode: the inode which owns delayed extents
* @es: delayed extent that we found
@@ -226,7 +244,7 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
struct rb_node *node;
ext4_lblk_t ret = EXT_MAX_BLOCKS;
- trace_ext4_es_find_extent_enter(inode, es->start);
+ trace_ext4_es_find_extent_enter(inode, es->es_lblk);
read_lock(&EXT4_I(inode)->i_es_lock);
tree = &EXT4_I(inode)->i_es_tree;
@@ -234,25 +252,25 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
/* find delay extent in cache firstly */
if (tree->cache_es) {
es1 = tree->cache_es;
- if (in_range(es->start, es1->start, es1->len)) {
+ if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
es_debug("%u cached by [%u/%u)\n",
- es->start, es1->start, es1->len);
+ es->es_lblk, es1->es_lblk, es1->es_len);
goto out;
}
}
- es->len = 0;
- es1 = __es_tree_search(&tree->root, es->start);
+ es->es_len = 0;
+ es1 = __es_tree_search(&tree->root, es->es_lblk);
out:
if (es1) {
tree->cache_es = es1;
- es->start = es1->start;
- es->len = es1->len;
+ es->es_lblk = es1->es_lblk;
+ es->es_len = es1->es_len;
node = rb_next(&es1->rb_node);
if (node) {
es1 = rb_entry(node, struct extent_status, rb_node);
- ret = es1->start;
+ ret = es1->es_lblk;
}
}
@@ -263,14 +281,14 @@ out:
}
static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t start, ext4_lblk_t len)
+ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
{
struct extent_status *es;
es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
if (es == NULL)
return NULL;
- es->start = start;
- es->len = len;
+ es->es_lblk = lblk;
+ es->es_len = len;
return es;
}
@@ -279,6 +297,20 @@ static void ext4_es_free_extent(struct extent_status *es)
kmem_cache_free(ext4_es_cachep, es);
}
+/*
+ * Check whether or not two extents can be merged
+ * Condition:
+ * - logical block number is contiguous
+ */
+static int ext4_es_can_be_merged(struct extent_status *es1,
+ struct extent_status *es2)
+{
+ if (es1->es_lblk + es1->es_len != es2->es_lblk)
+ return 0;
+
+ return 1;
+}
+
static struct extent_status *
ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
{
@@ -290,8 +322,8 @@ ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
return es;
es1 = rb_entry(node, struct extent_status, rb_node);
- if (es->start == extent_status_end(es1) + 1) {
- es1->len += es->len;
+ if (ext4_es_can_be_merged(es1, es)) {
+ es1->es_len += es->es_len;
rb_erase(&es->rb_node, &tree->root);
ext4_es_free_extent(es);
es = es1;
@@ -311,8 +343,8 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
return es;
es1 = rb_entry(node, struct extent_status, rb_node);
- if (es1->start == extent_status_end(es) + 1) {
- es->len += es1->len;
+ if (ext4_es_can_be_merged(es, es1)) {
+ es->es_len += es1->es_len;
rb_erase(node, &tree->root);
ext4_es_free_extent(es1);
}
@@ -320,92 +352,84 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
return es;
}
-static int __es_insert_extent(struct ext4_es_tree *tree, ext4_lblk_t offset,
- ext4_lblk_t len)
+static int __es_insert_extent(struct ext4_es_tree *tree,
+ struct extent_status *newes)
{
struct rb_node **p = &tree->root.rb_node;
struct rb_node *parent = NULL;
struct extent_status *es;
- ext4_lblk_t end = offset + len - 1;
-
- BUG_ON(end < offset);
- es = tree->cache_es;
- if (es && offset == (extent_status_end(es) + 1)) {
- es_debug("cached by [%u/%u)\n", es->start, es->len);
- es->len += len;
- es = ext4_es_try_to_merge_right(tree, es);
- goto out;
- } else if (es && es->start == end + 1) {
- es_debug("cached by [%u/%u)\n", es->start, es->len);
- es->start = offset;
- es->len += len;
- es = ext4_es_try_to_merge_left(tree, es);
- goto out;
- } else if (es && es->start <= offset &&
- end <= extent_status_end(es)) {
- es_debug("cached by [%u/%u)\n", es->start, es->len);
- goto out;
- }
+
+ /* invalidate cache */
+ tree->cache_es = NULL;
while (*p) {
parent = *p;
es = rb_entry(parent, struct extent_status, rb_node);
- if (offset < es->start) {
- if (es->start == end + 1) {
- es->start = offset;
- es->len += len;
+ if (newes->es_lblk < es->es_lblk) {
+ if (ext4_es_can_be_merged(newes, es)) {
+ es->es_lblk = newes->es_lblk;
+ es->es_len += newes->es_len;
es = ext4_es_try_to_merge_left(tree, es);
goto out;
}
p = &(*p)->rb_left;
- } else if (offset > extent_status_end(es)) {
- if (offset == extent_status_end(es) + 1) {
- es->len += len;
+ } else if (newes->es_lblk > ext4_es_end(es)) {
+ if (ext4_es_can_be_merged(es, newes)) {
+ es->es_len += newes->es_len;
es = ext4_es_try_to_merge_right(tree, es);
goto out;
}
p = &(*p)->rb_right;
} else {
- if (extent_status_end(es) <= end)
- es->len = offset - es->start + len;
- goto out;
+ BUG_ON(1);
+ return -EINVAL;
}
}
- es = ext4_es_alloc_extent(offset, len);
+ es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len);
if (!es)
return -ENOMEM;
rb_link_node(&es->rb_node, parent, p);
rb_insert_color(&es->rb_node, &tree->root);
out:
- tree->cache_es = es;
return 0;
}
/*
- * ext4_es_insert_extent() adds a space to a delayed extent tree.
- * Caller holds inode->i_es_lock.
+ * ext4_es_insert_extent() adds a space to a extent status tree.
*
* ext4_es_insert_extent is called by ext4_da_write_begin and
* ext4_es_remove_extent.
*
* Return 0 on success, error code on failure.
*/
-int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
+int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len)
{
struct ext4_es_tree *tree;
+ struct extent_status newes;
+ ext4_lblk_t end = lblk + len - 1;
int err = 0;
- trace_ext4_es_insert_extent(inode, offset, len);
+ trace_ext4_es_insert_extent(inode, lblk, len);
es_debug("add [%u/%u) to extent status tree of inode %lu\n",
- offset, len, inode->i_ino);
+ lblk, len, inode->i_ino);
+
+ BUG_ON(end < lblk);
+
+ newes.es_lblk = lblk;
+ newes.es_len = len;
write_lock(&EXT4_I(inode)->i_es_lock);
tree = &EXT4_I(inode)->i_es_tree;
- err = __es_insert_extent(tree, offset, len);
+ err = __es_remove_extent(tree, lblk, end);
+ if (err != 0)
+ goto error;
+ err = __es_insert_extent(tree, &newes);
+
+error:
write_unlock(&EXT4_I(inode)->i_es_lock);
ext4_es_print_tree(inode);
@@ -413,57 +437,45 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
return err;
}
-/*
- * ext4_es_remove_extent() removes a space from a delayed extent tree.
- * Caller holds inode->i_es_lock.
- *
- * Return 0 on success, error code on failure.
- */
-int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
- ext4_lblk_t len)
+static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+ ext4_lblk_t end)
{
struct rb_node *node;
- struct ext4_es_tree *tree;
struct extent_status *es;
struct extent_status orig_es;
- ext4_lblk_t len1, len2, end;
+ ext4_lblk_t len1, len2;
int err = 0;
- trace_ext4_es_remove_extent(inode, offset, len);
- es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
- offset, len, inode->i_ino);
-
- end = offset + len - 1;
- BUG_ON(end < offset);
- write_lock(&EXT4_I(inode)->i_es_lock);
- tree = &EXT4_I(inode)->i_es_tree;
- es = __es_tree_search(&tree->root, offset);
+ es = __es_tree_search(&tree->root, lblk);
if (!es)
goto out;
- if (es->start > end)
+ if (es->es_lblk > end)
goto out;
/* Simply invalidate cache_es. */
tree->cache_es = NULL;
- orig_es.start = es->start;
- orig_es.len = es->len;
- len1 = offset > es->start ? offset - es->start : 0;
- len2 = extent_status_end(es) > end ?
- extent_status_end(es) - end : 0;
+ orig_es.es_lblk = es->es_lblk;
+ orig_es.es_len = es->es_len;
+ len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
+ len2 = ext4_es_end(es) > end ? ext4_es_end(es) - end : 0;
if (len1 > 0)
- es->len = len1;
+ es->es_len = len1;
if (len2 > 0) {
if (len1 > 0) {
- err = __es_insert_extent(tree, end + 1, len2);
+ struct extent_status newes;
+
+ newes.es_lblk = end + 1;
+ newes.es_len = len2;
+ err = __es_insert_extent(tree, &newes);
if (err) {
- es->start = orig_es.start;
- es->len = orig_es.len;
+ es->es_lblk = orig_es.es_lblk;
+ es->es_len = orig_es.es_len;
goto out;
}
} else {
- es->start = end + 1;
- es->len = len2;
+ es->es_lblk = end + 1;
+ es->es_len = len2;
}
goto out;
}
@@ -476,7 +488,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
es = NULL;
}
- while (es && extent_status_end(es) <= end) {
+ while (es && ext4_es_end(es) <= end) {
node = rb_next(&es->rb_node);
rb_erase(&es->rb_node, &tree->root);
ext4_es_free_extent(es);
@@ -487,13 +499,39 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
es = rb_entry(node, struct extent_status, rb_node);
}
- if (es && es->start < end + 1) {
- len1 = extent_status_end(es) - end;
- es->start = end + 1;
- es->len = len1;
+ if (es && es->es_lblk < end + 1) {
+ len1 = ext4_es_end(es) - end;
+ es->es_lblk = end + 1;
+ es->es_len = len1;
}
out:
+ return err;
+}
+
+/*
+ * ext4_es_remove_extent() removes a space from a extent status tree.
+ *
+ * Return 0 on success, error code on failure.
+ */
+int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
+ ext4_lblk_t len)
+{
+ struct ext4_es_tree *tree;
+ ext4_lblk_t end;
+ int err = 0;
+
+ trace_ext4_es_remove_extent(inode, lblk, len);
+ es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
+ lblk, len, inode->i_ino);
+
+ end = lblk + len - 1;
+ BUG_ON(end < lblk);
+
+ tree = &EXT4_I(inode)->i_es_tree;
+
+ write_lock(&EXT4_I(inode)->i_es_lock);
+ err = __es_remove_extent(tree, lblk, end);
write_unlock(&EXT4_I(inode)->i_es_lock);
ext4_es_print_tree(inode);
return err;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 077f82d..81e9339 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -22,8 +22,8 @@
struct extent_status {
struct rb_node rb_node;
- ext4_lblk_t start; /* first block extent covers */
- ext4_lblk_t len; /* length of extent in block */
+ ext4_lblk_t es_lblk; /* first logical block extent covers */
+ ext4_lblk_t es_len; /* length of extent in block */
};
struct ext4_es_tree {
@@ -35,9 +35,9 @@ extern int __init ext4_init_es(void);
extern void ext4_exit_es(void);
extern void ext4_es_init_tree(struct ext4_es_tree *tree);
-extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t start,
+extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
-extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t start,
+extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
struct extent_status *es);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 405565a..718c49f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -464,10 +464,9 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
* If there is a delay extent at this offset,
* it will be as a data.
*/
- es.start = last;
+ es.es_lblk = last;
(void)ext4_es_find_extent(inode, &es);
- if (last >= es.start &&
- last < es.start + es.len) {
+ if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
if (last != start)
dataoff = last << blkbits;
break;
@@ -549,11 +548,10 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
* If there is a delay extent at this offset,
* we will skip this extent.
*/
- es.start = last;
+ es.es_lblk = last;
(void)ext4_es_find_extent(inode, &es);
- if (last >= es.start &&
- last < es.start + es.len) {
- last = es.start + es.len;
+ if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+ last = es.es_lblk + es.es_len;
holeoff = last << blkbits;
continue;
}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 7e8c36b..952628a 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2068,75 +2068,75 @@ TRACE_EVENT(ext4_ext_remove_space_done,
);
TRACE_EVENT(ext4_es_insert_extent,
- TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
- TP_ARGS(inode, start, len),
+ TP_ARGS(inode, lblk, len),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
- __field( loff_t, start )
+ __field( loff_t, lblk )
__field( loff_t, len )
),
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->start = start;
+ __entry->lblk = lblk;
__entry->len = len;
),
TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
- __entry->start, __entry->len)
+ __entry->lblk, __entry->len)
);
TRACE_EVENT(ext4_es_remove_extent,
- TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
- TP_ARGS(inode, start, len),
+ TP_ARGS(inode, lblk, len),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
- __field( loff_t, start )
+ __field( loff_t, lblk )
__field( loff_t, len )
),
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->start = start;
+ __entry->lblk = lblk;
__entry->len = len;
),
TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
- __entry->start, __entry->len)
+ __entry->lblk, __entry->len)
);
TRACE_EVENT(ext4_es_find_extent_enter,
- TP_PROTO(struct inode *inode, ext4_lblk_t start),
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
- TP_ARGS(inode, start),
+ TP_ARGS(inode, lblk),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
- __field( ext4_lblk_t, start )
+ __field( ext4_lblk_t, lblk )
),
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->start = start;
+ __entry->lblk = lblk;
),
- TP_printk("dev %d,%d ino %lu start %u",
+ TP_printk("dev %d,%d ino %lu lblk %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
- (unsigned long) __entry->ino, __entry->start)
+ (unsigned long) __entry->ino, __entry->lblk)
);
TRACE_EVENT(ext4_es_find_extent_exit,
@@ -2148,7 +2148,7 @@ TRACE_EVENT(ext4_es_find_extent_exit,
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
- __field( ext4_lblk_t, start )
+ __field( ext4_lblk_t, lblk )
__field( ext4_lblk_t, len )
__field( ext4_lblk_t, ret )
),
@@ -2156,15 +2156,15 @@ TRACE_EVENT(ext4_es_find_extent_exit,
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->start = es->start;
- __entry->len = es->len;
+ __entry->lblk = es->es_lblk;
+ __entry->len = es->es_len;
__entry->ret = ret;
),
TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
- __entry->start, __entry->len, __entry->ret)
+ __entry->lblk, __entry->len, __entry->ret)
);
#endif /* _TRACE_EXT4_H */
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 02/10 v3] ext4: remove EXT4_MAP_FROM_CLUSTER flag
2013-01-23 12:03 [PATCH 00/10 v3] ext4: extent status tree (step2) Zheng Liu
2013-01-23 12:03 ` [PATCH 01/10 v3] ext4: refine extent status tree Zheng Liu
@ 2013-01-23 12:03 ` Zheng Liu
2013-01-23 12:03 ` [PATCH 03/10 v3] ext4: add physical block and status member into extent status tree Zheng Liu
` (7 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Zheng Liu @ 2013-01-23 12:03 UTC (permalink / raw)
To: linux-ext4, linux-fsdevel; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
EXT4_MAP_FROM_CLUSTER flag is only used in ext4_da_map_blocks() and
ext4_ext_map_blocks() to indicate whether a cluster has been reserved or
not. But in ext4_da_map_blocks() we can check it directly. Meanwhile
this flag couldn't appear on the buffer head. So it can be replaced
with a local variable.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/ext4.h | 15 +--------------
fs/ext4/extents.c | 18 ++++--------------
fs/ext4/inode.c | 11 ++---------
3 files changed, 7 insertions(+), 37 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8462eb3..36145ef1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -155,17 +155,9 @@ struct ext4_allocation_request {
#define EXT4_MAP_UNWRITTEN (1 << BH_Unwritten)
#define EXT4_MAP_BOUNDARY (1 << BH_Boundary)
#define EXT4_MAP_UNINIT (1 << BH_Uninit)
-/* Sometimes (in the bigalloc case, from ext4_da_get_block_prep) the caller of
- * ext4_map_blocks wants to know whether or not the underlying cluster has
- * already been accounted for. EXT4_MAP_FROM_CLUSTER conveys to the caller that
- * the requested mapping was from previously mapped (or delayed allocated)
- * cluster. We use BH_AllocFromCluster only for this flag. BH_AllocFromCluster
- * should never appear on buffer_head's state flags.
- */
-#define EXT4_MAP_FROM_CLUSTER (1 << BH_AllocFromCluster)
#define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
- EXT4_MAP_UNINIT | EXT4_MAP_FROM_CLUSTER)
+ EXT4_MAP_UNINIT)
struct ext4_map_blocks {
ext4_fsblk_t m_pblk;
@@ -2553,11 +2545,6 @@ extern int ext4_mmp_csum_verify(struct super_block *sb,
enum ext4_state_bits {
BH_Uninit /* blocks are allocated but uninitialized on disk */
= BH_JBDPrivateStart,
- BH_AllocFromCluster, /* allocated blocks were part of already
- * allocated cluster. Note that this flag will
- * never, ever appear in a buffer_head's state
- * flag. See EXT4_MAP_FROM_CLUSTER to see where
- * this is used. */
};
BUFFER_FNS(Uninit, uninit)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f7bf616..aa9a6d2 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3894,6 +3894,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext4_io_end_t *io = ext4_inode_aio(inode);
ext4_lblk_t cluster_offset;
int set_unwritten = 0;
+ int from_cluster = 0;
ext_debug("blocks %u/%u requested for inode %lu\n",
map->m_lblk, map->m_len, inode->i_ino);
@@ -3902,10 +3903,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/* check in cache */
if (ext4_ext_in_cache(inode, map->m_lblk, &newex)) {
if (!newex.ee_start_lo && !newex.ee_start_hi) {
- if ((sbi->s_cluster_ratio > 1) &&
- ext4_find_delalloc_cluster(inode, map->m_lblk))
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
-
if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
/*
* block isn't allocated yet and
@@ -3916,8 +3913,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/* we should allocate requested block */
} else {
/* block is already allocated */
- if (sbi->s_cluster_ratio > 1)
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
newblock = map->m_lblk
- le32_to_cpu(newex.ee_block)
+ ext4_ext_pblock(&newex);
@@ -3990,10 +3985,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
}
}
- if ((sbi->s_cluster_ratio > 1) &&
- ext4_find_delalloc_cluster(inode, map->m_lblk))
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
-
/*
* requested block isn't allocated yet;
* we couldn't try to create block if create flag is zero
@@ -4010,7 +4001,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/*
* Okay, we need to do block allocation.
*/
- map->m_flags &= ~EXT4_MAP_FROM_CLUSTER;
newex.ee_block = cpu_to_le32(map->m_lblk);
cluster_offset = map->m_lblk & (sbi->s_cluster_ratio-1);
@@ -4022,7 +4012,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
get_implied_cluster_alloc(inode->i_sb, map, ex, path)) {
ar.len = allocated = map->m_len;
newblock = map->m_pblk;
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
+ from_cluster = 1;
goto got_allocated_blocks;
}
@@ -4043,7 +4033,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
get_implied_cluster_alloc(inode->i_sb, map, ex2, path)) {
ar.len = allocated = map->m_len;
newblock = map->m_pblk;
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
+ from_cluster = 1;
goto got_allocated_blocks;
}
@@ -4168,7 +4158,7 @@ got_allocated_blocks:
*/
reserved_clusters = get_reserved_cluster_alloc(inode,
map->m_lblk, allocated);
- if (map->m_flags & EXT4_MAP_FROM_CLUSTER) {
+ if (from_cluster) {
if (reserved_clusters) {
/*
* We have clusters reserved for this range.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cbfe13b..4d066f3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1798,9 +1798,6 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
* of mapping from cluster so that the reserved space
* is calculated properly.
*/
- if ((EXT4_SB(inode->i_sb)->s_cluster_ratio > 1) &&
- ext4_find_delalloc_cluster(inode, map->m_lblk))
- map->m_flags |= EXT4_MAP_FROM_CLUSTER;
retval = 0;
} else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
retval = ext4_ext_map_blocks(NULL, inode, map, 0);
@@ -1814,7 +1811,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
*/
/* If the block was allocated from previously allocated cluster,
* then we dont need to reserve it again. */
- if (!(map->m_flags & EXT4_MAP_FROM_CLUSTER)) {
+ if ((EXT4_SB(inode->i_sb)->s_cluster_ratio == 1) ||
+ !ext4_find_delalloc_cluster(inode, map->m_lblk)) {
retval = ext4_da_reserve_space(inode, iblock);
if (retval)
/* not enough space to reserve */
@@ -1825,11 +1823,6 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
if (retval)
goto out_unlock;
- /* Clear EXT4_MAP_FROM_CLUSTER flag since its purpose is served
- * and it should not appear on the bh->b_state.
- */
- map->m_flags &= ~EXT4_MAP_FROM_CLUSTER;
-
map_bh(bh, inode->i_sb, invalid_block);
set_buffer_new(bh);
set_buffer_delay(bh);
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 03/10 v3] ext4: add physical block and status member into extent status tree
2013-01-23 12:03 [PATCH 00/10 v3] ext4: extent status tree (step2) Zheng Liu
2013-01-23 12:03 ` [PATCH 01/10 v3] ext4: refine extent status tree Zheng Liu
2013-01-23 12:03 ` [PATCH 02/10 v3] ext4: remove EXT4_MAP_FROM_CLUSTER flag Zheng Liu
@ 2013-01-23 12:03 ` Zheng Liu
2013-01-29 3:03 ` Theodore Ts'o
2013-01-23 12:03 ` [PATCH 04/10 v3] ext4: adjust interfaces of " Zheng Liu
` (6 subsequent siblings)
9 siblings, 1 reply; 15+ messages in thread
From: Zheng Liu @ 2013-01-23 12:03 UTC (permalink / raw)
To: linux-ext4, linux-fsdevel; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
es_pblk is used to record physical block that maps to the disk.
es_status is used to record the status of the extent. Three status
are defined, which are written, unwritten and delayed.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents_status.h | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 81e9339..2eb9cc3 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -20,10 +20,22 @@
#define es_debug(fmt, ...) no_printk(fmt, ##__VA_ARGS__)
#endif
+enum {
+ EXTENT_STATUS_WRITTEN = 0, /* written extent */
+ EXTENT_STATUS_UNWRITTEN = 1, /* unwritten extent */
+ EXTENT_STATUS_DELAYED = 2, /* delayed extent */
+};
+
+/*
+ * Here for save memory es_status is stashed into es_pblk because we only have
+ * 48 bits physical block and es_status only needs 2 bits.
+ */
struct extent_status {
struct rb_node rb_node;
- ext4_lblk_t es_lblk; /* first logical block extent covers */
- ext4_lblk_t es_len; /* length of extent in block */
+ ext4_lblk_t es_lblk; /* first logical block extent covers */
+ ext4_lblk_t es_len; /* length of extent in block */
+ ext4_fsblk_t es_pblk : 62; /* first physical block */
+ ext4_fsblk_t es_status : 2; /* record the status of extent */
};
struct ext4_es_tree {
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 03/10 v3] ext4: add physical block and status member into extent status tree
2013-01-23 12:03 ` [PATCH 03/10 v3] ext4: add physical block and status member into extent status tree Zheng Liu
@ 2013-01-29 3:03 ` Theodore Ts'o
2013-01-29 5:34 ` Zheng Liu
0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2013-01-29 3:03 UTC (permalink / raw)
To: Zheng Liu; +Cc: linux-ext4, linux-fsdevel, Zheng Liu
On Wed, Jan 23, 2013 at 08:03:53PM +0800, Zheng Liu wrote:
> + ext4_fsblk_t es_pblk : 62; /* first physical block */
> + ext4_fsblk_t es_status : 2; /* record the status of extent */
I'll accept this for now but note that ext4_fsblk_t is typedefed to be
an unsigned long long, and C99 only guarantees that bitfields can be
made from Bool, signed int, and unsigned int. Gcc accepts unsigned
long long based bit fields as an extension, but it's not portable
code. This is kernel code, though, and we have plenty more gcc
specific code....
- Ted
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 03/10 v3] ext4: add physical block and status member into extent status tree
2013-01-29 3:03 ` Theodore Ts'o
@ 2013-01-29 5:34 ` Zheng Liu
2013-01-29 17:28 ` Theodore Ts'o
0 siblings, 1 reply; 15+ messages in thread
From: Zheng Liu @ 2013-01-29 5:34 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4, linux-fsdevel, Zheng Liu
On Mon, Jan 28, 2013 at 10:03:53PM -0500, Theodore Ts'o wrote:
> On Wed, Jan 23, 2013 at 08:03:53PM +0800, Zheng Liu wrote:
> > + ext4_fsblk_t es_pblk : 62; /* first physical block */
> > + ext4_fsblk_t es_status : 2; /* record the status of extent */
>
> I'll accept this for now but note that ext4_fsblk_t is typedefed to be
> an unsigned long long, and C99 only guarantees that bitfields can be
> made from Bool, signed int, and unsigned int. Gcc accepts unsigned
> long long based bit fields as an extension, but it's not portable
> code. This is kernel code, though, and we have plenty more gcc
> specific code....
Thanks for pointing out. When I tried to implement this code, there are
two choices. One is like this that bit field is used. IMHO it is clear
enough, although it is not portable.
Another choice is like this:
struct extent_status {
...
ext4_fsblk_t es_pblk; /* first physical block */
};
#define EXTENT_STATUS_WRITTEN (1ULL << 60)
#define EXTENT_STATUS_UNWRITTEN (1ULL << 61)
#define EXTENT_STATUS_DELAYED (1ULL << 62)
When we want to set extent status, we will need to do like the following:
es->es_pblk |= EXTENT_STATUS_WRITTEN;
This can make us avoid non-protable code. I am happy to refine this
patch if you think the latter one is better.
Thanks,
- Zheng
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 03/10 v3] ext4: add physical block and status member into extent status tree
2013-01-29 5:34 ` Zheng Liu
@ 2013-01-29 17:28 ` Theodore Ts'o
2013-01-30 2:43 ` Zheng Liu
0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2013-01-29 17:28 UTC (permalink / raw)
To: linux-ext4, linux-fsdevel, Zheng Liu
On Tue, Jan 29, 2013 at 01:34:15PM +0800, Zheng Liu wrote:
>
> Another choice is like this:
>
> struct extent_status {
> ...
> ext4_fsblk_t es_pblk; /* first physical block */
> };
>
> #define EXTENT_STATUS_WRITTEN (1ULL << 60)
> #define EXTENT_STATUS_UNWRITTEN (1ULL << 61)
> #define EXTENT_STATUS_DELAYED (1ULL << 62)
>
> When we want to set extent status, we will need to do like the following:
>
> es->es_pblk |= EXTENT_STATUS_WRITTEN;
>
> This can make us avoid non-protable code. I am happy to refine this
> patch if you think the latter one is better.
This is probably the way I would have done it myself, but the then you
need to make sure that all of the places where es_pblk is used you
have to mask off the high bits.
At this point, though, I don't think it's worth it to make the change
now, especially since we're almost at -rc6, I want to make sure this
gets into linux-next and so we get lots of testing.
As a matter of fact, I've already started testing the v3 vesion of the
extent status patches from January 23rd, with the v2 version of the
slab reclaim patch. It's in the unstable portion of the ext4 git
tree, at:
http://repo.or.cz/w/ext4-patch-queue.git /
git://repo.or.cz/ext4-patch-queue.git
I'm waiting for your next version of your patch series before I move
it into the dev branch which will get fed into linux-next; my
understanding is you're just about ready to push it out, right?
If we want to move away from using bitfields, we can do that as a
separate patch that gets submitted later, since that's pretty easy to
audit and verify for correctness. Also, I've since tested clang and
noted that it supports bitfields for unsigned long long. There is
some differences between how gcc and clang handles sign extension for
unsigned long values, though:
#include <stdio.h>
struct s
{
unsigned long long a:2;
unsigned long long b:40;
unsigned long long c:22;
};
int main()
{
struct s t = {1, 2, 3};
printf("0x%llx\n",(t.b-8));
}
Gcc 4.7.2 will print "0xfffffffffa", while clang 3.0-6 will print
"0xfffffffffffffffa" for the same program.
I don't think this is a huge issue for us, but it's worth keeping in
mind...
So let's go ahead and keep the bitfields at least for the initial
patch submission.
Thanks for all your on this patch series!
- Ted
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 03/10 v3] ext4: add physical block and status member into extent status tree
2013-01-29 17:28 ` Theodore Ts'o
@ 2013-01-30 2:43 ` Zheng Liu
0 siblings, 0 replies; 15+ messages in thread
From: Zheng Liu @ 2013-01-30 2:43 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4, linux-fsdevel, Zheng Liu
On Tue, Jan 29, 2013 at 12:28:14PM -0500, Theodore Ts'o wrote:
> On Tue, Jan 29, 2013 at 01:34:15PM +0800, Zheng Liu wrote:
> >
> > Another choice is like this:
> >
> > struct extent_status {
> > ...
> > ext4_fsblk_t es_pblk; /* first physical block */
> > };
> >
> > #define EXTENT_STATUS_WRITTEN (1ULL << 60)
> > #define EXTENT_STATUS_UNWRITTEN (1ULL << 61)
> > #define EXTENT_STATUS_DELAYED (1ULL << 62)
> >
> > When we want to set extent status, we will need to do like the following:
> >
> > es->es_pblk |= EXTENT_STATUS_WRITTEN;
> >
> > This can make us avoid non-protable code. I am happy to refine this
> > patch if you think the latter one is better.
>
> This is probably the way I would have done it myself, but the then you
> need to make sure that all of the places where es_pblk is used you
> have to mask off the high bits.
>
> At this point, though, I don't think it's worth it to make the change
> now, especially since we're almost at -rc6, I want to make sure this
> gets into linux-next and so we get lots of testing.
>
> As a matter of fact, I've already started testing the v3 vesion of the
> extent status patches from January 23rd, with the v2 version of the
> slab reclaim patch. It's in the unstable portion of the ext4 git
> tree, at:
>
> http://repo.or.cz/w/ext4-patch-queue.git /
> git://repo.or.cz/ext4-patch-queue.git
>
> I'm waiting for your next version of your patch series before I move
> it into the dev branch which will get fed into linux-next; my
> understanding is you're just about ready to push it out, right?
Yes, I am running xfstests to make sure that the patch series doesn't
break anything. Later it will be sent out.
>
> If we want to move away from using bitfields, we can do that as a
> separate patch that gets submitted later, since that's pretty easy to
> audit and verify for correctness. Also, I've since tested clang and
> noted that it supports bitfields for unsigned long long. There is
> some differences between how gcc and clang handles sign extension for
> unsigned long values, though:
>
> #include <stdio.h>
> struct s
> {
> unsigned long long a:2;
> unsigned long long b:40;
> unsigned long long c:22;
> };
>
> int main()
> {
> struct s t = {1, 2, 3};
> printf("0x%llx\n",(t.b-8));
> }
>
> Gcc 4.7.2 will print "0xfffffffffa", while clang 3.0-6 will print
> "0xfffffffffffffffa" for the same program.
Clang is first coming in my mind. I know that some one try to use it
to build a linux kernel and get a lot of problems that are about gcc
extension. But for us it seems that things are not too bad. ;)
>
> I don't think this is a huge issue for us, but it's worth keeping in
> mind...
>
> So let's go ahead and keep the bitfields at least for the initial
> patch submission.
Yes, just keep in mind and go ahead.
Thanks for teaching me a lot,
- Zheng
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 04/10 v3] ext4: adjust interfaces of extent status tree
2013-01-23 12:03 [PATCH 00/10 v3] ext4: extent status tree (step2) Zheng Liu
` (2 preceding siblings ...)
2013-01-23 12:03 ` [PATCH 03/10 v3] ext4: add physical block and status member into extent status tree Zheng Liu
@ 2013-01-23 12:03 ` Zheng Liu
2013-01-23 12:03 ` [PATCH 05/10 v3] ext4: track all extent status in " Zheng Liu
` (5 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Zheng Liu @ 2013-01-23 12:03 UTC (permalink / raw)
To: linux-ext4, linux-fsdevel; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
Due to two members are added into extent status tree, all interfaces
need to be adjusted.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents_status.c | 56 ++++++++++++++++++++++++++++++++++++---------
fs/ext4/extents_status.h | 24 ++++++++++++++++++-
fs/ext4/inode.c | 3 ++-
include/trace/events/ext4.h | 34 +++++++++++++++++----------
4 files changed, 92 insertions(+), 25 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 1ce4a69..7cbb916 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -179,7 +179,8 @@ static void ext4_es_print_tree(struct inode *inode)
while (node) {
struct extent_status *es;
es = rb_entry(node, struct extent_status, rb_node);
- printk(KERN_DEBUG " [%u/%u)", es->es_lblk, es->es_len);
+ printk(KERN_DEBUG " [%u/%u) %llu %u",
+ es->es_lblk, es->es_len, es->es_pblk, es->es_status);
node = rb_next(node);
}
printk(KERN_DEBUG "\n");
@@ -234,7 +235,7 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
* @es: delayed extent that we found
*
* Returns the first block of the next extent after es, otherwise
- * EXT_MAX_BLOCKS if no delay extent is found.
+ * EXT_MAX_BLOCKS if no extent is found.
* Delayed extent is returned via @es.
*/
ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
@@ -249,12 +250,14 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
read_lock(&EXT4_I(inode)->i_es_lock);
tree = &EXT4_I(inode)->i_es_tree;
- /* find delay extent in cache firstly */
+ /* find extent in cache firstly */
if (tree->cache_es) {
es1 = tree->cache_es;
if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
- es_debug("%u cached by [%u/%u)\n",
- es->es_lblk, es1->es_lblk, es1->es_len);
+ es_debug("%u cached by [%u/%u) %llu %u\n",
+ es->es_lblk, es1->es_lblk, es1->es_len,
+ (unsigned long long)es1->es_pblk,
+ es1->es_status);
goto out;
}
}
@@ -267,6 +270,8 @@ out:
tree->cache_es = es1;
es->es_lblk = es1->es_lblk;
es->es_len = es1->es_len;
+ es->es_pblk = es1->es_pblk;
+ es->es_status = es1->es_status;
node = rb_next(&es1->rb_node);
if (node) {
es1 = rb_entry(node, struct extent_status, rb_node);
@@ -281,7 +286,8 @@ out:
}
static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
+ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len,
+ ext4_fsblk_t pblk, int status)
{
struct extent_status *es;
es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
@@ -289,6 +295,8 @@ ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
return NULL;
es->es_lblk = lblk;
es->es_len = len;
+ es->es_pblk = pblk;
+ es->es_status = status;
return es;
}
@@ -301,6 +309,8 @@ static void ext4_es_free_extent(struct extent_status *es)
* Check whether or not two extents can be merged
* Condition:
* - logical block number is contiguous
+ * - physical block number is contiguous
+ * - status is equal
*/
static int ext4_es_can_be_merged(struct extent_status *es1,
struct extent_status *es2)
@@ -308,6 +318,13 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
if (es1->es_lblk + es1->es_len != es2->es_lblk)
return 0;
+ if (es1->es_status != es2->es_status)
+ return 0;
+
+ if (!ext4_es_is_delayed(es1) &&
+ (es1->es_pblk + es1->es_len != es2->es_pblk))
+ return 0;
+
return 1;
}
@@ -370,6 +387,8 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
if (ext4_es_can_be_merged(newes, es)) {
es->es_lblk = newes->es_lblk;
es->es_len += newes->es_len;
+ es->es_pblk = ext4_es_get_pblock(es,
+ newes->es_pblk);
es = ext4_es_try_to_merge_left(tree, es);
goto out;
}
@@ -387,7 +406,8 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
}
}
- es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len);
+ es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len,
+ newes->es_pblk, newes->es_status);
if (!es)
return -ENOMEM;
rb_link_node(&es->rb_node, parent, p);
@@ -406,21 +426,23 @@ out:
* Return 0 on success, error code on failure.
*/
int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
- ext4_lblk_t len)
+ ext4_lblk_t len, ext4_fsblk_t pblk, int status)
{
struct ext4_es_tree *tree;
struct extent_status newes;
ext4_lblk_t end = lblk + len - 1;
int err = 0;
- trace_ext4_es_insert_extent(inode, lblk, len);
- es_debug("add [%u/%u) to extent status tree of inode %lu\n",
- lblk, len, inode->i_ino);
+ es_debug("add [%u/%u) %llu %d to extent status tree of inode %lu\n",
+ lblk, len, pblk, status, inode->i_ino);
BUG_ON(end < lblk);
newes.es_lblk = lblk;
newes.es_len = len;
+ newes.es_pblk = pblk;
+ newes.es_status = status;
+ trace_ext4_es_insert_extent(inode, &newes);
write_lock(&EXT4_I(inode)->i_es_lock);
tree = &EXT4_I(inode)->i_es_tree;
@@ -457,6 +479,9 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
orig_es.es_lblk = es->es_lblk;
orig_es.es_len = es->es_len;
+ orig_es.es_pblk = es->es_pblk;
+ orig_es.es_status = es->es_status;
+
len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
len2 = ext4_es_end(es) > end ? ext4_es_end(es) - end : 0;
if (len1 > 0)
@@ -467,6 +492,9 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
newes.es_lblk = end + 1;
newes.es_len = len2;
+ newes.es_pblk = ext4_es_get_pblock(&orig_es,
+ orig_es.es_pblk + orig_es.es_len - len2);
+ newes.es_status = orig_es.es_status;
err = __es_insert_extent(tree, &newes);
if (err) {
es->es_lblk = orig_es.es_lblk;
@@ -476,6 +504,8 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
} else {
es->es_lblk = end + 1;
es->es_len = len2;
+ es->es_pblk = ext4_es_get_pblock(es,
+ orig_es.es_pblk + orig_es.es_len - len2);
}
goto out;
}
@@ -500,9 +530,13 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
}
if (es && es->es_lblk < end + 1) {
+ ext4_lblk_t orig_len = es->es_len;
+
len1 = ext4_es_end(es) - end;
es->es_lblk = end + 1;
es->es_len = len1;
+ es->es_pblk = ext4_es_get_pblock(es,
+ es->es_pblk + orig_len - len1);
}
out:
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 2eb9cc3..1345c06 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -48,10 +48,32 @@ extern void ext4_exit_es(void);
extern void ext4_es_init_tree(struct ext4_es_tree *tree);
extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
- ext4_lblk_t len);
+ ext4_lblk_t len, ext4_fsblk_t pblk,
+ int status);
extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
struct extent_status *es);
+static inline int ext4_es_is_written(struct extent_status *es)
+{
+ return (es->es_status == EXTENT_STATUS_WRITTEN);
+}
+
+static inline int ext4_es_is_unwritten(struct extent_status *es)
+{
+ return (es->es_status == EXTENT_STATUS_UNWRITTEN);
+}
+
+static inline int ext4_es_is_delayed(struct extent_status *es)
+{
+ return (es->es_status == EXTENT_STATUS_DELAYED);
+}
+
+static inline ext4_fsblk_t ext4_es_get_pblock(struct extent_status *es,
+ ext4_fsblk_t pb)
+{
+ return (ext4_es_is_delayed(es) ? ~0 : pb);
+}
+
#endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4d066f3..e09c7cf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1819,7 +1819,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
goto out_unlock;
}
- retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len);
+ retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+ ~0, EXTENT_STATUS_DELAYED);
if (retval)
goto out_unlock;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 952628a..43f335a 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2068,28 +2068,33 @@ TRACE_EVENT(ext4_ext_remove_space_done,
);
TRACE_EVENT(ext4_es_insert_extent,
- TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
+ TP_PROTO(struct inode *inode, struct extent_status *es),
- TP_ARGS(inode, lblk, len),
+ TP_ARGS(inode, es),
TP_STRUCT__entry(
- __field( dev_t, dev )
- __field( ino_t, ino )
- __field( loff_t, lblk )
- __field( loff_t, len )
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( ext4_lblk_t, lblk )
+ __field( ext4_lblk_t, len )
+ __field( ext4_fsblk_t, pblk )
+ __field( int, status )
),
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->lblk = lblk;
- __entry->len = len;
+ __entry->lblk = es->es_lblk;
+ __entry->len = es->es_len;
+ __entry->pblk = es->es_pblk;
+ __entry->status = es->es_status;
),
- TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
+ TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
- __entry->lblk, __entry->len)
+ __entry->lblk, __entry->len,
+ __entry->pblk, __entry->status)
);
TRACE_EVENT(ext4_es_remove_extent,
@@ -2150,6 +2155,8 @@ TRACE_EVENT(ext4_es_find_extent_exit,
__field( ino_t, ino )
__field( ext4_lblk_t, lblk )
__field( ext4_lblk_t, len )
+ __field( ext4_fsblk_t, pblk )
+ __field( int, status )
__field( ext4_lblk_t, ret )
),
@@ -2158,13 +2165,16 @@ TRACE_EVENT(ext4_es_find_extent_exit,
__entry->ino = inode->i_ino;
__entry->lblk = es->es_lblk;
__entry->len = es->es_len;
+ __entry->pblk = es->es_pblk;
+ __entry->status = es->es_status;
__entry->ret = ret;
),
- TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
+ TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %u ret %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
- __entry->lblk, __entry->len, __entry->ret)
+ __entry->lblk, __entry->len,
+ __entry->pblk, __entry->status, __entry->ret)
);
#endif /* _TRACE_EXT4_H */
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 05/10 v3] ext4: track all extent status in extent status tree
2013-01-23 12:03 [PATCH 00/10 v3] ext4: extent status tree (step2) Zheng Liu
` (3 preceding siblings ...)
2013-01-23 12:03 ` [PATCH 04/10 v3] ext4: adjust interfaces of " Zheng Liu
@ 2013-01-23 12:03 ` Zheng Liu
2013-01-23 12:03 ` [PATCH 06/10 v3] ext4: lookup block mapping " Zheng Liu
` (4 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Zheng Liu @ 2013-01-23 12:03 UTC (permalink / raw)
To: linux-ext4, linux-fsdevel; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
By recording the phycisal block and status, extent status tree is able
to track the status of every extents. When we call _map_blocks
functions to lookup an extent or create a new written/unwritten/delayed
extent, this extent will be inserted into extent status tree.
We don't load all extents from disk in alloc_inode() because it costs
too much memory, and if a file is opened and closed frequently it will
takes too much time to load all extent information. So currently when
we create/lookup an extent, this extent will be inserted into extent
status tree. Hence, the extent status tree may not comprehensively
contain all of the extents found in the file.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents.c | 5 +++-
fs/ext4/file.c | 6 +++--
fs/ext4/inode.c | 73 ++++++++++++++++++++++++++++++++++++-------------------
3 files changed, 56 insertions(+), 28 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index aa9a6d2..d23a654 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2074,7 +2074,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
}
/* This is possible iff next == next_del == EXT_MAX_BLOCKS */
- if (next == next_del) {
+ if (next == next_del && next_del == EXT_MAX_BLOCKS) {
flags |= FIEMAP_EXTENT_LAST;
if (unlikely(next_del != EXT_MAX_BLOCKS ||
next != EXT_MAX_BLOCKS)) {
@@ -4570,6 +4570,9 @@ static int ext4_find_delayed_extent(struct inode *inode,
/* A hole found. */
return 0;
+ if (!ext4_es_is_delayed(&es))
+ return 0;
+
if (es.es_lblk > newex->ec_block) {
/* A hole found. */
newex->ec_len = min(es.es_lblk - newex->ec_block,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 718c49f..afaf9f15 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -466,7 +466,8 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
*/
es.es_lblk = last;
(void)ext4_es_find_extent(inode, &es);
- if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+ if (ext4_es_is_delayed(&es) &&
+ last >= es.es_lblk && last < es.es_lblk + es.es_len) {
if (last != start)
dataoff = last << blkbits;
break;
@@ -550,7 +551,8 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
*/
es.es_lblk = last;
(void)ext4_es_find_extent(inode, &es);
- if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+ if (ext4_es_is_delayed(&es) &&
+ last >= es.es_lblk && last < es.es_lblk + es.es_len) {
last = es.es_lblk + es.es_len;
holeoff = last << blkbits;
continue;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e09c7cf..f0dda2a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -527,20 +527,20 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
retval = ext4_ind_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
}
+ if (retval > 0) {
+ int ret, status;
+ status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+ EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+ ret = ext4_es_insert_extent(inode, map->m_lblk,
+ map->m_len, map->m_pblk, status);
+ if (ret < 0)
+ retval = ret;
+ }
if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
up_read((&EXT4_I(inode)->i_data_sem));
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
- int ret;
- if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
- /* delayed alloc may be allocated by fallocate and
- * coverted to initialized by directIO.
- * we need to handle delayed extent here.
- */
- down_write((&EXT4_I(inode)->i_data_sem));
- goto delayed_mapped;
- }
- ret = check_block_validity(inode, map);
+ int ret = check_block_validity(inode, map);
if (ret != 0)
return ret;
}
@@ -615,18 +615,27 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
ext4_da_update_reserve_space(inode, retval, 1);
}
- if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
+ if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
- if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
- int ret;
-delayed_mapped:
- /* delayed allocation blocks has been allocated */
- ret = ext4_es_remove_extent(inode, map->m_lblk,
- map->m_len);
- if (ret < 0)
- retval = ret;
- }
+ if (retval > 0) {
+ int ret, status;
+
+ if (flags & EXT4_GET_BLOCKS_PRE_IO)
+ status = EXTENT_STATUS_UNWRITTEN;
+ else if (flags & EXT4_GET_BLOCKS_CONVERT)
+ status = EXTENT_STATUS_WRITTEN;
+ else if (flags & EXT4_GET_BLOCKS_UNINIT_EXT)
+ status = EXTENT_STATUS_UNWRITTEN;
+ else if (flags & EXT4_GET_BLOCKS_CREATE)
+ status = EXTENT_STATUS_WRITTEN;
+ else
+ BUG_ON(1);
+
+ ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+ map->m_pblk, status);
+ if (ret < 0)
+ retval = ret;
}
up_write((&EXT4_I(inode)->i_data_sem));
@@ -1805,6 +1814,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
retval = ext4_ind_map_blocks(NULL, inode, map, 0);
if (retval == 0) {
+ int ret;
/*
* XXX: __block_prepare_write() unmaps passed block,
* is it OK?
@@ -1813,20 +1823,33 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
* then we dont need to reserve it again. */
if ((EXT4_SB(inode->i_sb)->s_cluster_ratio == 1) ||
!ext4_find_delalloc_cluster(inode, map->m_lblk)) {
- retval = ext4_da_reserve_space(inode, iblock);
- if (retval)
+ ret = ext4_da_reserve_space(inode, iblock);
+ if (ret) {
/* not enough space to reserve */
+ retval = ret;
goto out_unlock;
+ }
}
- retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
- ~0, EXTENT_STATUS_DELAYED);
- if (retval)
+ ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+ ~0, EXTENT_STATUS_DELAYED);
+ if (ret) {
+ retval = ret;
goto out_unlock;
+ }
map_bh(bh, inode->i_sb, invalid_block);
set_buffer_new(bh);
set_buffer_delay(bh);
+ } else if (retval > 0) {
+ int ret, status;
+
+ status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+ EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+ ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+ map->m_pblk, status);
+ if (ret != 0)
+ retval = ret;
}
out_unlock:
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 06/10 v3] ext4: lookup block mapping in extent status tree
2013-01-23 12:03 [PATCH 00/10 v3] ext4: extent status tree (step2) Zheng Liu
` (4 preceding siblings ...)
2013-01-23 12:03 ` [PATCH 05/10 v3] ext4: track all extent status in " Zheng Liu
@ 2013-01-23 12:03 ` Zheng Liu
2013-01-23 12:03 ` [PATCH 07/10 v3] ext4: remove single extent cache Zheng Liu
` (3 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Zheng Liu @ 2013-01-23 12:03 UTC (permalink / raw)
To: linux-ext4, linux-fsdevel; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
After tracking all extent status, we already have a extent cache in
memory. Every time we want to lookup a block mapping, we can first
try to lookup it in extent status tree to avoid a potential disk I/O.
A new function called ext4_es_lookup_extent is defined to finish this
work. When we try to lookup a block mapping, we always call
ext4_map_blocks and/or ext4_da_map_blocks. So in these functions we
first try to lookup a block mapping in extent status tree.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents_status.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/extents_status.h | 1 +
fs/ext4/inode.c | 55 ++++++++++++++++++++++++++++++++++++++++++
include/trace/events/ext4.h | 56 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 171 insertions(+)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 7cbb916..fd4354e 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -459,6 +459,65 @@ error:
return err;
}
+/*
+ * ext4_es_lookup_extent() looks up an extent in extent status tree.
+ *
+ * ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks.
+ *
+ * Return: 1 on found, 0 on not
+ */
+int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es)
+{
+ struct ext4_es_tree *tree;
+ struct extent_status *es1;
+ struct rb_node *node;
+ int found = 0;
+
+ trace_ext4_es_lookup_extent_enter(inode, es->es_lblk);
+ es_debug("lookup extent in block %u\n", es->es_lblk);
+
+ tree = &EXT4_I(inode)->i_es_tree;
+ read_lock(&EXT4_I(inode)->i_es_lock);
+
+ /* find extent in cache firstly */
+ if (tree->cache_es) {
+ es1 = tree->cache_es;
+ if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
+ es_debug("%u cached by [%u/%u)\n",
+ es->es_lblk, es1->es_lblk, es1->es_len);
+ found = 1;
+ goto out;
+ }
+ }
+
+ es->es_len = 0;
+ node = tree->root.rb_node;
+ while (node) {
+ es1 = rb_entry(node, struct extent_status, rb_node);
+ if (es->es_lblk < es1->es_lblk)
+ node = node->rb_left;
+ else if (es->es_lblk > ext4_es_end(es1))
+ node = node->rb_right;
+ else {
+ found = 1;
+ break;
+ }
+ }
+
+out:
+ if (found) {
+ es->es_lblk = es1->es_lblk;
+ es->es_len = es1->es_len;
+ es->es_pblk = es1->es_pblk;
+ es->es_status = es1->es_status;
+ }
+
+ read_unlock(&EXT4_I(inode)->i_es_lock);
+
+ trace_ext4_es_lookup_extent_exit(inode, es, found);
+ return found;
+}
+
static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
ext4_lblk_t end)
{
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 1345c06..6350d35 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -54,6 +54,7 @@ extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
struct extent_status *es);
+extern int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es);
static inline int ext4_es_is_written(struct extent_status *es)
{
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f0dda2a..c86b086 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -508,12 +508,39 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
int ext4_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags)
{
+ struct extent_status es;
int retval;
map->m_flags = 0;
ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
"logical block %lu\n", inode->i_ino, flags, map->m_len,
(unsigned long) map->m_lblk);
+
+ /* Lookup extent status tree firstly */
+ es.es_lblk = map->m_lblk;
+ if (ext4_es_lookup_extent(inode, &es)) {
+ if (ext4_es_is_written(&es)) {
+ map->m_pblk = es.es_pblk + map->m_lblk - es.es_lblk;
+ map->m_flags |= EXT4_MAP_MAPPED;
+ retval = es.es_len - (map->m_lblk - es.es_lblk);
+ if (retval > map->m_len)
+ retval = map->m_len;
+ map->m_len = retval;
+ } else if (ext4_es_is_unwritten(&es)) {
+ map->m_pblk = es.es_pblk + map->m_lblk - es.es_lblk;
+ map->m_flags |= EXT4_MAP_UNWRITTEN;
+ retval = es.es_len - (map->m_lblk - es.es_lblk);
+ if (retval > map->m_len)
+ retval = map->m_len;
+ map->m_len = retval;
+ } else if (ext4_es_is_delayed(&es)) {
+ retval = 0;
+ } else {
+ BUG_ON(1);
+ }
+ goto found;
+ }
+
/*
* Try to see if we can get the block without requesting a new
* file system block.
@@ -539,6 +566,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
up_read((&EXT4_I(inode)->i_data_sem));
+found:
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
int ret = check_block_validity(inode, map);
if (ret != 0)
@@ -1784,6 +1812,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
struct ext4_map_blocks *map,
struct buffer_head *bh)
{
+ struct extent_status es;
int retval;
sector_t invalid_block = ~((sector_t) 0xffff);
@@ -1794,6 +1823,32 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
ext_debug("ext4_da_map_blocks(): inode %lu, max_blocks %u,"
"logical block %lu\n", inode->i_ino, map->m_len,
(unsigned long) map->m_lblk);
+
+ /* Lookup extent status tree firstly */
+ es.es_lblk = iblock;
+ if (ext4_es_lookup_extent(inode, &es)) {
+ map->m_pblk = es.es_pblk + iblock - es.es_lblk;
+ retval = es.es_len - (iblock - es.es_lblk);
+ if (retval > map->m_len)
+ retval = map->m_len;
+ map->m_len = retval;
+ if (ext4_es_is_written(&es)) {
+ map->m_flags |= EXT4_MAP_MAPPED;
+ } else if (ext4_es_is_unwritten(&es)) {
+ map->m_flags |= EXT4_MAP_UNWRITTEN;
+ } else if (ext4_es_is_delayed(&es)) {
+ map_bh(bh, inode->i_sb, invalid_block);
+ set_buffer_new(bh);
+ set_buffer_delay(bh);
+
+ return 0;
+ } else {
+ BUG_ON(1);
+ }
+
+ return retval;
+ }
+
/*
* Try to see if we can get the block without requesting a new
* file system block.
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 43f335a..f23c177 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2177,6 +2177,62 @@ TRACE_EVENT(ext4_es_find_extent_exit,
__entry->pblk, __entry->status, __entry->ret)
);
+TRACE_EVENT(ext4_es_lookup_extent_enter,
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
+
+ TP_ARGS(inode, lblk),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( ext4_lblk_t, lblk )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->lblk = lblk;
+ ),
+
+ TP_printk("dev %d,%d ino %lu lblk %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino, __entry->lblk)
+);
+
+TRACE_EVENT(ext4_es_lookup_extent_exit,
+ TP_PROTO(struct inode *inode, struct extent_status *es,
+ int found),
+
+ TP_ARGS(inode, es, found),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( ext4_lblk_t, lblk )
+ __field( ext4_lblk_t, len )
+ __field( ext4_fsblk_t, pblk )
+ __field( int, status )
+ __field( int, found )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->lblk = es->es_lblk;
+ __entry->len = es->es_len;
+ __entry->pblk = es->es_pblk;
+ __entry->status = es->es_status;
+ __entry->found = found;
+ ),
+
+ TP_printk("dev %d,%d ino %lu found %d [%u/%u) %llu %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino, __entry->found,
+ __entry->lblk, __entry->len,
+ __entry->found ? __entry->pblk : 0,
+ __entry->found ? __entry->status : 0)
+);
+
#endif /* _TRACE_EXT4_H */
/* This part must be outside protection */
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 07/10 v3] ext4: remove single extent cache
2013-01-23 12:03 [PATCH 00/10 v3] ext4: extent status tree (step2) Zheng Liu
` (5 preceding siblings ...)
2013-01-23 12:03 ` [PATCH 06/10 v3] ext4: lookup block mapping " Zheng Liu
@ 2013-01-23 12:03 ` Zheng Liu
2013-01-23 12:03 ` [PATCH 08/10 v3] fs: allow for fs-specific objects to be pruned as part of pruning inodes Zheng Liu
` (2 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Zheng Liu @ 2013-01-23 12:03 UTC (permalink / raw)
To: linux-ext4, linux-fsdevel; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
Single extent cache could be removed because we have extent status tree
as a extent cache, and it can do a better job.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/ext4.h | 12 ---
fs/ext4/ext4_extents.h | 6 --
fs/ext4/extents.c | 220 ++++++++-----------------------------------------
fs/ext4/move_extent.c | 3 -
fs/ext4/super.c | 1 -
5 files changed, 34 insertions(+), 208 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 36145ef1..60d16d1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -802,17 +802,6 @@ do { \
#endif /* defined(__KERNEL__) || defined(__linux__) */
-/*
- * storage for cached extent
- * If ec_len == 0, then the cache is invalid.
- * If ec_start == 0, then the cache represents a gap (null mapping)
- */
-struct ext4_ext_cache {
- ext4_fsblk_t ec_start;
- ext4_lblk_t ec_block;
- __u32 ec_len; /* must be 32bit to return holes */
-};
-
#include "extents_status.h"
/*
@@ -879,7 +868,6 @@ struct ext4_inode_info {
struct inode vfs_inode;
struct jbd2_inode *jinode;
- struct ext4_ext_cache i_cached_extent;
/*
* File creation time. Its function is same as that of
* struct timespec i_{a,c,m}time in the generic inode.
diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
index 487fda1..8643ff5 100644
--- a/fs/ext4/ext4_extents.h
+++ b/fs/ext4/ext4_extents.h
@@ -193,12 +193,6 @@ static inline unsigned short ext_depth(struct inode *inode)
return le16_to_cpu(ext_inode_hdr(inode)->eh_depth);
}
-static inline void
-ext4_ext_invalidate_cache(struct inode *inode)
-{
- EXT4_I(inode)->i_cached_extent.ec_len = 0;
-}
-
static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext)
{
/* We can not have an uninitialized extent of zero length! */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d23a654..dedd557 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -112,7 +112,7 @@ static int ext4_split_extent_at(handle_t *handle,
int flags);
static int ext4_find_delayed_extent(struct inode *inode,
- struct ext4_ext_cache *newex);
+ struct extent_status *newes);
static int ext4_ext_truncate_extend_restart(handle_t *handle,
struct inode *inode,
@@ -714,7 +714,6 @@ int ext4_ext_tree_init(handle_t *handle, struct inode *inode)
eh->eh_magic = EXT4_EXT_MAGIC;
eh->eh_max = cpu_to_le16(ext4_ext_space_root(inode, 0));
ext4_mark_inode_dirty(handle, inode);
- ext4_ext_invalidate_cache(inode);
return 0;
}
@@ -1960,7 +1959,6 @@ cleanup:
ext4_ext_drop_refs(npath);
kfree(npath);
}
- ext4_ext_invalidate_cache(inode);
return err;
}
@@ -1969,8 +1967,8 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
struct fiemap_extent_info *fieinfo)
{
struct ext4_ext_path *path = NULL;
- struct ext4_ext_cache newex;
struct ext4_extent *ex;
+ struct extent_status es;
ext4_lblk_t next, next_del, start = 0, end = 0;
ext4_lblk_t last = block + num;
int exists, depth = 0, err = 0;
@@ -2044,31 +2042,31 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
BUG_ON(end <= start);
if (!exists) {
- newex.ec_block = start;
- newex.ec_len = end - start;
- newex.ec_start = 0;
+ es.es_lblk = start;
+ es.es_len = end - start;
+ es.es_pblk = 0;
} else {
- newex.ec_block = le32_to_cpu(ex->ee_block);
- newex.ec_len = ext4_ext_get_actual_len(ex);
- newex.ec_start = ext4_ext_pblock(ex);
+ es.es_lblk = le32_to_cpu(ex->ee_block);
+ es.es_len = ext4_ext_get_actual_len(ex);
+ es.es_pblk = ext4_ext_pblock(ex);
if (ext4_ext_is_uninitialized(ex))
flags |= FIEMAP_EXTENT_UNWRITTEN;
}
/*
- * Find delayed extent and update newex accordingly. We call
- * it even in !exists case to find out whether newex is the
+ * Find delayed extent and update es accordingly. We call
+ * it even in !exists case to find out whether es is the
* last existing extent or not.
*/
- next_del = ext4_find_delayed_extent(inode, &newex);
+ next_del = ext4_find_delayed_extent(inode, &es);
if (!exists && next_del) {
exists = 1;
flags |= FIEMAP_EXTENT_DELALLOC;
}
up_read(&EXT4_I(inode)->i_data_sem);
- if (unlikely(newex.ec_len == 0)) {
- EXT4_ERROR_INODE(inode, "newex.ec_len == 0");
+ if (unlikely(es.es_len == 0)) {
+ EXT4_ERROR_INODE(inode, "es.es_len == 0");
err = -EIO;
break;
}
@@ -2089,9 +2087,9 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
if (exists) {
err = fiemap_fill_next_extent(fieinfo,
- (__u64)newex.ec_block << blksize_bits,
- (__u64)newex.ec_start << blksize_bits,
- (__u64)newex.ec_len << blksize_bits,
+ (__u64)es.es_lblk << blksize_bits,
+ (__u64)es.es_pblk << blksize_bits,
+ (__u64)es.es_len << blksize_bits,
flags);
if (err < 0)
break;
@@ -2101,7 +2099,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
}
}
- block = newex.ec_block + newex.ec_len;
+ block = es.es_lblk + es.es_len;
}
if (path) {
@@ -2112,115 +2110,6 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
return err;
}
-static void
-ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
- __u32 len, ext4_fsblk_t start)
-{
- struct ext4_ext_cache *cex;
- BUG_ON(len == 0);
- spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
- trace_ext4_ext_put_in_cache(inode, block, len, start);
- cex = &EXT4_I(inode)->i_cached_extent;
- cex->ec_block = block;
- cex->ec_len = len;
- cex->ec_start = start;
- spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
-}
-
-/*
- * ext4_ext_put_gap_in_cache:
- * calculate boundaries of the gap that the requested block fits into
- * and cache this gap
- */
-static void
-ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
- ext4_lblk_t block)
-{
- int depth = ext_depth(inode);
- unsigned long len;
- ext4_lblk_t lblock;
- struct ext4_extent *ex;
-
- ex = path[depth].p_ext;
- if (ex == NULL) {
- /* there is no extent yet, so gap is [0;-] */
- lblock = 0;
- len = EXT_MAX_BLOCKS;
- ext_debug("cache gap(whole file):");
- } else if (block < le32_to_cpu(ex->ee_block)) {
- lblock = block;
- len = le32_to_cpu(ex->ee_block) - block;
- ext_debug("cache gap(before): %u [%u:%u]",
- block,
- le32_to_cpu(ex->ee_block),
- ext4_ext_get_actual_len(ex));
- } else if (block >= le32_to_cpu(ex->ee_block)
- + ext4_ext_get_actual_len(ex)) {
- ext4_lblk_t next;
- lblock = le32_to_cpu(ex->ee_block)
- + ext4_ext_get_actual_len(ex);
-
- next = ext4_ext_next_allocated_block(path);
- ext_debug("cache gap(after): [%u:%u] %u",
- le32_to_cpu(ex->ee_block),
- ext4_ext_get_actual_len(ex),
- block);
- BUG_ON(next == lblock);
- len = next - lblock;
- } else {
- lblock = len = 0;
- BUG();
- }
-
- ext_debug(" -> %u:%lu\n", lblock, len);
- ext4_ext_put_in_cache(inode, lblock, len, 0);
-}
-
-/*
- * ext4_ext_in_cache()
- * Checks to see if the given block is in the cache.
- * If it is, the cached extent is stored in the given
- * cache extent pointer.
- *
- * @inode: The files inode
- * @block: The block to look for in the cache
- * @ex: Pointer where the cached extent will be stored
- * if it contains block
- *
- * Return 0 if cache is invalid; 1 if the cache is valid
- */
-static int
-ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
- struct ext4_extent *ex)
-{
- struct ext4_ext_cache *cex;
- int ret = 0;
-
- /*
- * We borrow i_block_reservation_lock to protect i_cached_extent
- */
- spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
- cex = &EXT4_I(inode)->i_cached_extent;
-
- /* has cache valid data? */
- if (cex->ec_len == 0)
- goto errout;
-
- if (in_range(block, cex->ec_block, cex->ec_len)) {
- ex->ee_block = cpu_to_le32(cex->ec_block);
- ext4_ext_store_pblock(ex, cex->ec_start);
- ex->ee_len = cpu_to_le16(cex->ec_len);
- ext_debug("%u cached by %u:%u:%llu\n",
- block,
- cex->ec_block, cex->ec_len, cex->ec_start);
- ret = 1;
- }
-errout:
- trace_ext4_ext_in_cache(inode, block, ret);
- spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
- return ret;
-}
-
/*
* ext4_ext_rm_idx:
* removes index from the index block.
@@ -2658,8 +2547,6 @@ static int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
return PTR_ERR(handle);
again:
- ext4_ext_invalidate_cache(inode);
-
trace_ext4_ext_remove_space(inode, start, depth);
/*
@@ -3900,29 +3787,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
map->m_lblk, map->m_len, inode->i_ino);
trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
- /* check in cache */
- if (ext4_ext_in_cache(inode, map->m_lblk, &newex)) {
- if (!newex.ee_start_lo && !newex.ee_start_hi) {
- if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
- /*
- * block isn't allocated yet and
- * user doesn't want to allocate it
- */
- goto out2;
- }
- /* we should allocate requested block */
- } else {
- /* block is already allocated */
- newblock = map->m_lblk
- - le32_to_cpu(newex.ee_block)
- + ext4_ext_pblock(&newex);
- /* number of remaining blocks in the extent */
- allocated = ext4_ext_get_actual_len(&newex) -
- (map->m_lblk - le32_to_cpu(newex.ee_block));
- goto out;
- }
- }
-
/* find extent for this block */
path = ext4_ext_find_extent(inode, map->m_lblk, NULL);
if (IS_ERR(path)) {
@@ -3969,15 +3833,9 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk,
ee_block, ee_len, newblock);
- /*
- * Do not put uninitialized extent
- * in the cache
- */
- if (!ext4_ext_is_uninitialized(ex)) {
- ext4_ext_put_in_cache(inode, ee_block,
- ee_len, ee_start);
+ if (!ext4_ext_is_uninitialized(ex))
goto out;
- }
+
allocated = ext4_ext_handle_uninitialized_extents(
handle, inode, map, path, flags,
allocated, newblock);
@@ -3989,14 +3847,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
* requested block isn't allocated yet;
* we couldn't try to create block if create flag is zero
*/
- if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
- /*
- * put just found gap into cache to speed up
- * subsequent requests
- */
- ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
+ if ((flags & EXT4_GET_BLOCKS_CREATE) == 0)
goto out2;
- }
/*
* Okay, we need to do block allocation.
@@ -4232,10 +4084,9 @@ got_allocated_blocks:
* Cache the extent and update transaction to commit on fdatasync only
* when it is _not_ an uninitialized extent.
*/
- if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) {
- ext4_ext_put_in_cache(inode, map->m_lblk, allocated, newblock);
+ if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0)
ext4_update_inode_fsync_trans(handle, inode, 1);
- } else
+ else
ext4_update_inode_fsync_trans(handle, inode, 0);
out:
if (allocated > map->m_len)
@@ -4294,7 +4145,6 @@ void ext4_ext_truncate(struct inode *inode)
goto out_stop;
down_write(&EXT4_I(inode)->i_data_sem);
- ext4_ext_invalidate_cache(inode);
ext4_discard_preallocations(inode);
@@ -4544,26 +4394,26 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
}
/*
- * If newex is not existing extent (newex->ec_start equals zero) find
- * delayed extent at start of newex and update newex accordingly and
+ * If newes is not existing extent (newes->ec_pblk equals zero) find
+ * delayed extent at start of newes and update newes accordingly and
* return start of the next delayed extent.
*
- * If newex is existing extent (newex->ec_start is not equal zero)
+ * If newes is existing extent (newes->ec_pblk is not equal zero)
* return start of next delayed extent or EXT_MAX_BLOCKS if no delayed
- * extent found. Leave newex unmodified.
+ * extent found. Leave newes unmodified.
*/
static int ext4_find_delayed_extent(struct inode *inode,
- struct ext4_ext_cache *newex)
+ struct extent_status *newes)
{
struct extent_status es;
ext4_lblk_t next_del;
- es.es_lblk = newex->ec_block;
+ es.es_lblk = newes->es_lblk;
next_del = ext4_es_find_extent(inode, &es);
- if (newex->ec_start == 0) {
+ if (newes->es_pblk == 0) {
/*
- * No extent in extent-tree contains block @newex->ec_start,
+ * No extent in extent-tree contains block @newes->es_pblk,
* then the block may stay in 1)a hole or 2)delayed-extent.
*/
if (es.es_len == 0)
@@ -4573,14 +4423,14 @@ static int ext4_find_delayed_extent(struct inode *inode,
if (!ext4_es_is_delayed(&es))
return 0;
- if (es.es_lblk > newex->ec_block) {
+ if (es.es_lblk > newes->es_lblk) {
/* A hole found. */
- newex->ec_len = min(es.es_lblk - newex->ec_block,
- newex->ec_len);
+ newes->es_len = min(es.es_lblk - newes->es_lblk,
+ newes->es_len);
return 0;
}
- newex->ec_len = es.es_lblk + es.es_len - newex->ec_block;
+ newes->es_len = es.es_lblk + es.es_len - newes->es_lblk;
}
return next_del;
@@ -4780,14 +4630,12 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
goto out;
down_write(&EXT4_I(inode)->i_data_sem);
- ext4_ext_invalidate_cache(inode);
ext4_discard_preallocations(inode);
err = ext4_es_remove_extent(inode, first_block,
stop_block - first_block);
err = ext4_ext_remove_space(inode, first_block, stop_block - 1);
- ext4_ext_invalidate_cache(inode);
ext4_discard_preallocations(inode);
if (IS_SYNC(inode))
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index d9cc5ee..b9222c8 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -761,9 +761,6 @@ out:
kfree(donor_path);
}
- ext4_ext_invalidate_cache(orig_inode);
- ext4_ext_invalidate_cache(donor_inode);
-
return replaced_count;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3d4fb81..a35c6c1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -939,7 +939,6 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
return NULL;
ei->vfs_inode.i_version = 1;
- memset(&ei->i_cached_extent, 0, sizeof(struct ext4_ext_cache));
INIT_LIST_HEAD(&ei->i_prealloc_list);
spin_lock_init(&ei->i_prealloc_lock);
ext4_es_init_tree(&ei->i_es_tree);
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 08/10 v3] fs: allow for fs-specific objects to be pruned as part of pruning inodes
2013-01-23 12:03 [PATCH 00/10 v3] ext4: extent status tree (step2) Zheng Liu
` (6 preceding siblings ...)
2013-01-23 12:03 ` [PATCH 07/10 v3] ext4: remove single extent cache Zheng Liu
@ 2013-01-23 12:03 ` Zheng Liu
2013-01-23 12:03 ` [PATCH 09/10 v3] ext4: adjust some functions for reclaiming extents from extent status tree Zheng Liu
2013-01-23 12:04 ` [PATCH 10/10 v3] ext4: reclaim " Zheng Liu
9 siblings, 0 replies; 15+ messages in thread
From: Zheng Liu @ 2013-01-23 12:03 UTC (permalink / raw)
To: linux-ext4, linux-fsdevel; +Cc: Theodore Ts'o, Dave Chinner, Zheng Liu
From: Theodore Ts'o <tytso@mit.edu>
The VFS's prune_super() function allows for the file system to prune
file-system specific objects. Ext4 would like to use this to prune
parts of the inode's extent cache. The object lifetime rules used by
ext4 is somewhat different from the those of the dentry and inode in
the VFS. Ext4's extent cache objects can be pruned without removing
the inode; however if an inode is pruned, all of the extent cache
objects associated with the inode are immediately removed.
To accomodate this rule, we measure the number of fs-specific objects
before the dentry and inodes are pruned, and then measure them again
afterwards. If the number of fs-specific objects have decreased, we
credit that decrease as part of the shrink operation, so that we do
not end up removing too many fs-specific objects.
In the case where fs-specific objects are not removed when inodes are
removed, this will not change the behavior of prune_super() in any
appreciable way. (Currently the only other user of this facility is
XFS, and this change should not affect XFS's usage of this facility
for this reason.)
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Zheng Liu <wenqing.lz@taobao.com>
---
fs/super.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/fs/super.c b/fs/super.c
index 12f1237..fb57bd2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -80,6 +80,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
if (sc->nr_to_scan) {
int dentries;
int inodes;
+ int fs_to_scan = 0;
/* proportion the scan between the caches */
dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
@@ -87,7 +88,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
total_objects;
if (fs_objects)
- fs_objects = (sc->nr_to_scan * fs_objects) /
+ fs_to_scan = (sc->nr_to_scan * fs_objects) /
total_objects;
/*
* prune the dcache first as the icache is pinned by it, then
@@ -96,8 +97,23 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
prune_dcache_sb(sb, dentries);
prune_icache_sb(sb, inodes);
- if (fs_objects && sb->s_op->free_cached_objects) {
- sb->s_op->free_cached_objects(sb, fs_objects);
+ /*
+ * If as a result of pruning the icache, we released some
+ * of the fs_objects, give credit to the fact and
+ * reduce the number of fs objects that we should try
+ * to release.
+ */
+ if (fs_to_scan) {
+ int fs_objects_now = sb->s_op->nr_cached_objects(sb);
+
+ if (fs_objects_now < fs_objects)
+ fs_to_scan -= fs_objects - fs_objects_now;
+ if (fs_to_scan < 0)
+ fs_to_scan = 0;
+ }
+
+ if (fs_to_scan && sb->s_op->free_cached_objects) {
+ sb->s_op->free_cached_objects(sb, fs_to_scan);
fs_objects = sb->s_op->nr_cached_objects(sb);
}
total_objects = sb->s_nr_dentry_unused +
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 09/10 v3] ext4: adjust some functions for reclaiming extents from extent status tree
2013-01-23 12:03 [PATCH 00/10 v3] ext4: extent status tree (step2) Zheng Liu
` (7 preceding siblings ...)
2013-01-23 12:03 ` [PATCH 08/10 v3] fs: allow for fs-specific objects to be pruned as part of pruning inodes Zheng Liu
@ 2013-01-23 12:03 ` Zheng Liu
2013-01-23 12:04 ` [PATCH 10/10 v3] ext4: reclaim " Zheng Liu
9 siblings, 0 replies; 15+ messages in thread
From: Zheng Liu @ 2013-01-23 12:03 UTC (permalink / raw)
To: linux-ext4, linux-fsdevel; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
This commit changes some interfaces in extent status tree because we
need to use inode to count the number of cached objects in a extent
status tree.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/extents_status.c | 49 +++++++++++++++++++++++-------------------------
1 file changed, 23 insertions(+), 26 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index fd4354e..badfaa2 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -142,9 +142,8 @@
static struct kmem_cache *ext4_es_cachep;
-static int __es_insert_extent(struct ext4_es_tree *tree,
- struct extent_status *newes);
-static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
+static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t end);
int __init ext4_init_es(void)
@@ -286,7 +285,7 @@ out:
}
static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len,
+ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
ext4_fsblk_t pblk, int status)
{
struct extent_status *es;
@@ -300,7 +299,7 @@ ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len,
return es;
}
-static void ext4_es_free_extent(struct extent_status *es)
+static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
{
kmem_cache_free(ext4_es_cachep, es);
}
@@ -329,8 +328,9 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
}
static struct extent_status *
-ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
+ext4_es_try_to_merge_left(struct inode *inode, struct extent_status *es)
{
+ struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
struct extent_status *es1;
struct rb_node *node;
@@ -342,7 +342,7 @@ ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
if (ext4_es_can_be_merged(es1, es)) {
es1->es_len += es->es_len;
rb_erase(&es->rb_node, &tree->root);
- ext4_es_free_extent(es);
+ ext4_es_free_extent(inode, es);
es = es1;
}
@@ -350,8 +350,9 @@ ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
}
static struct extent_status *
-ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
+ext4_es_try_to_merge_right(struct inode *inode, struct extent_status *es)
{
+ struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
struct extent_status *es1;
struct rb_node *node;
@@ -363,15 +364,15 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
if (ext4_es_can_be_merged(es, es1)) {
es->es_len += es1->es_len;
rb_erase(node, &tree->root);
- ext4_es_free_extent(es1);
+ ext4_es_free_extent(inode, es1);
}
return es;
}
-static int __es_insert_extent(struct ext4_es_tree *tree,
- struct extent_status *newes)
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
{
+ struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
struct rb_node **p = &tree->root.rb_node;
struct rb_node *parent = NULL;
struct extent_status *es;
@@ -389,14 +390,14 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
es->es_len += newes->es_len;
es->es_pblk = ext4_es_get_pblock(es,
newes->es_pblk);
- es = ext4_es_try_to_merge_left(tree, es);
+ es = ext4_es_try_to_merge_left(inode, es);
goto out;
}
p = &(*p)->rb_left;
} else if (newes->es_lblk > ext4_es_end(es)) {
if (ext4_es_can_be_merged(es, newes)) {
es->es_len += newes->es_len;
- es = ext4_es_try_to_merge_right(tree, es);
+ es = ext4_es_try_to_merge_right(inode, es);
goto out;
}
p = &(*p)->rb_right;
@@ -406,7 +407,7 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
}
}
- es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len,
+ es = ext4_es_alloc_extent(inode, newes->es_lblk, newes->es_len,
newes->es_pblk, newes->es_status);
if (!es)
return -ENOMEM;
@@ -428,7 +429,6 @@ out:
int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len, ext4_fsblk_t pblk, int status)
{
- struct ext4_es_tree *tree;
struct extent_status newes;
ext4_lblk_t end = lblk + len - 1;
int err = 0;
@@ -445,11 +445,10 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
trace_ext4_es_insert_extent(inode, &newes);
write_lock(&EXT4_I(inode)->i_es_lock);
- tree = &EXT4_I(inode)->i_es_tree;
- err = __es_remove_extent(tree, lblk, end);
+ err = __es_remove_extent(inode, lblk, end);
if (err != 0)
goto error;
- err = __es_insert_extent(tree, &newes);
+ err = __es_insert_extent(inode, &newes);
error:
write_unlock(&EXT4_I(inode)->i_es_lock);
@@ -518,9 +517,10 @@ out:
return found;
}
-static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
- ext4_lblk_t end)
+static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
+ ext4_lblk_t end)
{
+ struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
struct rb_node *node;
struct extent_status *es;
struct extent_status orig_es;
@@ -554,7 +554,7 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
newes.es_pblk = ext4_es_get_pblock(&orig_es,
orig_es.es_pblk + orig_es.es_len - len2);
newes.es_status = orig_es.es_status;
- err = __es_insert_extent(tree, &newes);
+ err = __es_insert_extent(inode, &newes);
if (err) {
es->es_lblk = orig_es.es_lblk;
es->es_len = orig_es.es_len;
@@ -580,7 +580,7 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
while (es && ext4_es_end(es) <= end) {
node = rb_next(&es->rb_node);
rb_erase(&es->rb_node, &tree->root);
- ext4_es_free_extent(es);
+ ext4_es_free_extent(inode, es);
if (!node) {
es = NULL;
break;
@@ -610,7 +610,6 @@ out:
int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len)
{
- struct ext4_es_tree *tree;
ext4_lblk_t end;
int err = 0;
@@ -621,10 +620,8 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
end = lblk + len - 1;
BUG_ON(end < lblk);
- tree = &EXT4_I(inode)->i_es_tree;
-
write_lock(&EXT4_I(inode)->i_es_lock);
- err = __es_remove_extent(tree, lblk, end);
+ err = __es_remove_extent(inode, lblk, end);
write_unlock(&EXT4_I(inode)->i_es_lock);
ext4_es_print_tree(inode);
return err;
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 10/10 v3] ext4: reclaim extents from extent status tree
2013-01-23 12:03 [PATCH 00/10 v3] ext4: extent status tree (step2) Zheng Liu
` (8 preceding siblings ...)
2013-01-23 12:03 ` [PATCH 09/10 v3] ext4: adjust some functions for reclaiming extents from extent status tree Zheng Liu
@ 2013-01-23 12:04 ` Zheng Liu
9 siblings, 0 replies; 15+ messages in thread
From: Zheng Liu @ 2013-01-23 12:04 UTC (permalink / raw)
To: linux-ext4, linux-fsdevel; +Cc: Zheng Liu, Theodore Ts'o
From: Zheng Liu <wenqing.lz@taobao.com>
Although extent status is loaded on-demand, we also need to reclaim
extent from the tree when we are under a heavy memory pressure because
in some cases fragmented extent tree causes status tree costs too much
memory.
Here we maintain a lru list in super_block. When the extent status of
an inode is accessed and changed, this inode will be move to the tail
of the list. The inode will be dropped from this list when it is
cleared. In the inode, a counter is added to count the number of
cached objects in extent status tree. Here only written/unwritten
extent is counted because delayed extent doesn't be reclaimed due to
fiemap, bigalloc and seek_data/hole need it. The counter will be
increased as a new extent is allocated, and it will be decreased as a
extent is freed.
In this commit we define nr_cached_objects and free_cached_objects
callback functions in order to reclaim extents from the tree. In
nr_cached_objects, every inodes in lru list are traversed and all
counters are counted. In free_cached_objects, we traverse every inodes
in the list and reclaim all written/unwritten extents from the extent
status tree of the inode until the number of reclaimed objects is equal
to nr_to_scan or all cached objects are reclaimed.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/ext4.h | 6 +++
fs/ext4/extents_status.c | 118 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/extents_status.h | 5 ++
fs/ext4/super.c | 19 +++++++
include/trace/events/ext4.h | 40 +++++++++++++++
5 files changed, 188 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 60d16d1..317aa0b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -881,6 +881,8 @@ struct ext4_inode_info {
/* extents status tree */
struct ext4_es_tree i_es_tree;
rwlock_t i_es_lock;
+ struct list_head i_es_lru;
+ unsigned int i_es_lru_nr; /* protected by i_es_lock */
/* ialloc */
ext4_group_t i_last_alloc_group;
@@ -1296,6 +1298,10 @@ struct ext4_sb_info {
/* Precomputed FS UUID checksum for seeding other checksums */
__u32 s_csum_seed;
+
+ /* Reclaim extents from extent status tree */
+ struct list_head s_es_lru;
+ spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
};
static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index badfaa2..dbdcb2b 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -145,6 +145,8 @@ static struct kmem_cache *ext4_es_cachep;
static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t end);
+static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
+ int nr_to_scan);
int __init ext4_init_es(void)
{
@@ -280,6 +282,7 @@ out:
read_unlock(&EXT4_I(inode)->i_es_lock);
+ ext4_es_lru_add(inode);
trace_ext4_es_find_extent_exit(inode, es, ret);
return ret;
}
@@ -296,11 +299,24 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
es->es_len = len;
es->es_pblk = pblk;
es->es_status = status;
+
+ /*
+ * We don't count delayed extent because we never try to reclaim them
+ */
+ if (!ext4_es_is_delayed(es))
+ EXT4_I(inode)->i_es_lru_nr++;
+
return es;
}
static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
{
+ /* Decrease the lru count when this es is not delayed */
+ if (!ext4_es_is_delayed(es)) {
+ BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
+ EXT4_I(inode)->i_es_lru_nr--;
+ }
+
kmem_cache_free(ext4_es_cachep, es);
}
@@ -453,6 +469,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
error:
write_unlock(&EXT4_I(inode)->i_es_lock);
+ ext4_es_lru_add(inode);
ext4_es_print_tree(inode);
return err;
@@ -513,6 +530,7 @@ out:
read_unlock(&EXT4_I(inode)->i_es_lock);
+ ext4_es_lru_add(inode);
trace_ext4_es_lookup_extent_exit(inode, es, found);
return found;
}
@@ -626,3 +644,103 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_es_print_tree(inode);
return err;
}
+
+void ext4_es_lru_add(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+ spin_lock(&sbi->s_es_lru_lock);
+ if (list_empty(&ei->i_es_lru))
+ list_add_tail(&ei->i_es_lru, &sbi->s_es_lru);
+ else
+ list_move_tail(&ei->i_es_lru, &sbi->s_es_lru);
+ spin_unlock(&sbi->s_es_lru_lock);
+}
+
+void ext4_es_lru_del(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+ spin_lock(&sbi->s_es_lru_lock);
+ if (!list_empty(&ei->i_es_lru))
+ list_del_init(&ei->i_es_lru);
+ spin_unlock(&sbi->s_es_lru_lock);
+}
+
+int ext4_es_reclaim_extents_count(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_inode_info *ei;
+ struct list_head *cur;
+ int nr_cached = 0;
+
+ spin_lock(&sbi->s_es_lru_lock);
+ list_for_each(cur, &sbi->s_es_lru) {
+ ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
+ read_lock(&ei->i_es_lock);
+ nr_cached += ei->i_es_lru_nr;
+ read_unlock(&ei->i_es_lock);
+ }
+ spin_unlock(&sbi->s_es_lru_lock);
+ trace_ext4_es_reclaim_extents_count(sb, nr_cached);
+ return nr_cached;
+}
+
+static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
+ int nr_to_scan)
+{
+ struct inode *inode = &ei->vfs_inode;
+ struct ext4_es_tree *tree = &ei->i_es_tree;
+ struct rb_node *node;
+ struct extent_status *es;
+ int nr_shrunk = 0;
+
+ if (ei->i_es_lru_nr == 0)
+ return 0;
+
+ node = rb_first(&tree->root);
+ while (node != NULL) {
+ es = rb_entry(node, struct extent_status, rb_node);
+ node = rb_next(&es->rb_node);
+ /*
+ * We can't reclaim delayed extent from status tree because
+ * fiemap, bigallic, and seek_data/hole need to use it.
+ */
+ if (!ext4_es_is_delayed(es)) {
+ rb_erase(&es->rb_node, &tree->root);
+ ext4_es_free_extent(inode, es);
+ nr_shrunk++;
+ if (--nr_to_scan == 0)
+ break;
+ }
+ }
+ tree->cache_es = NULL;
+ return nr_shrunk;
+}
+
+void ext4_es_reclaim_extents(struct super_block *sb, int nr_to_scan)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_inode_info *ei;
+ struct list_head *cur;
+
+ trace_ext4_es_reclaim_extents(sb, nr_to_scan);
+ spin_lock(&sbi->s_es_lru_lock);
+ list_for_each(cur, &sbi->s_es_lru) {
+ ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
+ read_lock(&ei->i_es_lock);
+ if (ei->i_es_lru_nr == 0) {
+ read_unlock(&ei->i_es_lock);
+ continue;
+ }
+ read_unlock(&ei->i_es_lock);
+ write_lock(&ei->i_es_lock);
+ nr_to_scan -= __es_try_to_reclaim_extents(ei, nr_to_scan);
+ write_unlock(&ei->i_es_lock);
+ if (nr_to_scan == 0)
+ break;
+ }
+ spin_unlock(&sbi->s_es_lru_lock);
+}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 6350d35..17e7a28 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -77,4 +77,9 @@ static inline ext4_fsblk_t ext4_es_get_pblock(struct extent_status *es,
return (ext4_es_is_delayed(es) ? ~0 : pb);
}
+extern void ext4_es_lru_add(struct inode *inode);
+extern void ext4_es_lru_del(struct inode *inode);
+extern int ext4_es_reclaim_extents_count(struct super_block *sb);
+extern void ext4_es_reclaim_extents(struct super_block *sb, int nr_to_scan);
+
#endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a35c6c1..3a2c3a7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -943,6 +943,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
spin_lock_init(&ei->i_prealloc_lock);
ext4_es_init_tree(&ei->i_es_tree);
rwlock_init(&ei->i_es_lock);
+ INIT_LIST_HEAD(&ei->i_es_lru);
+ ei->i_es_lru_nr = 0;
ei->i_reserved_data_blocks = 0;
ei->i_reserved_meta_blocks = 0;
ei->i_allocated_meta_blocks = 0;
@@ -1030,6 +1032,7 @@ void ext4_clear_inode(struct inode *inode)
dquot_drop(inode);
ext4_discard_preallocations(inode);
ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
+ ext4_es_lru_del(inode);
if (EXT4_I(inode)->jinode) {
jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
EXT4_I(inode)->jinode);
@@ -1101,6 +1104,16 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
return try_to_free_buffers(page);
}
+static int nr_cached_objects(struct super_block *sb)
+{
+ return ext4_es_reclaim_extents_count(sb);
+}
+
+static void free_cached_objects(struct super_block *sb, int nr_to_scan)
+{
+ ext4_es_reclaim_extents(sb, nr_to_scan);
+}
+
#ifdef CONFIG_QUOTA
#define QTYPE2NAME(t) ((t) == USRQUOTA ? "user" : "group")
#define QTYPE2MOPT(on, t) ((t) == USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA))
@@ -1176,6 +1189,8 @@ static const struct super_operations ext4_sops = {
.quota_write = ext4_quota_write,
#endif
.bdev_try_to_free_page = bdev_try_to_free_page,
+ .nr_cached_objects = nr_cached_objects,
+ .free_cached_objects = free_cached_objects,
};
static const struct super_operations ext4_nojournal_sops = {
@@ -1194,6 +1209,8 @@ static const struct super_operations ext4_nojournal_sops = {
.quota_write = ext4_quota_write,
#endif
.bdev_try_to_free_page = bdev_try_to_free_page,
+ .nr_cached_objects = nr_cached_objects,
+ .free_cached_objects = free_cached_objects,
};
static const struct export_operations ext4_export_ops = {
@@ -3770,6 +3787,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sbi->s_stripe = ext4_get_stripe_size(sbi);
sbi->s_max_writeback_mb_bump = 128;
sbi->s_extent_max_zeroout_kb = 32;
+ INIT_LIST_HEAD(&sbi->s_es_lru);
+ spin_lock_init(&sbi->s_es_lru_lock);
/*
* set up enough so that it can read an inode
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f23c177..816f0ae 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2233,6 +2233,46 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
__entry->found ? __entry->status : 0)
);
+TRACE_EVENT(ext4_es_reclaim_extents_count,
+ TP_PROTO(struct super_block *sb, int nr_cached),
+
+ TP_ARGS(sb, nr_cached),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( int, nr_cached )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->nr_cached = nr_cached;
+ ),
+
+ TP_printk("dev %d,%d cached objects nr %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->nr_cached)
+);
+
+TRACE_EVENT(ext4_es_reclaim_extents,
+ TP_PROTO(struct super_block *sb, int nr_to_scan),
+
+ TP_ARGS(sb, nr_to_scan),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( int, nr_to_scan )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->nr_to_scan = nr_to_scan;
+ ),
+
+ TP_printk("dev %d,%d nr to scan %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->nr_to_scan)
+);
+
#endif /* _TRACE_EXT4_H */
/* This part must be outside protection */
--
1.7.12.rc2.18.g61b472e
^ permalink raw reply related [flat|nested] 15+ messages in thread