ocfs2-devel.oss.oracle.com archive mirror
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1.
@ 2010-12-28 11:40 Tristan Ye
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 1/8] Ocfs2: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2 Tristan Ye
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Tristan Ye @ 2010-12-28 11:40 UTC (permalink / raw)
  To: ocfs2-devel

Hi All,

	It's a quite rough patches series v1 for online defragmentation on OCFS2, it's
workable anyway, may look ugly though;) The essence of online file defragmentation is
extents moving like what btrfs and ext4 were doing, adding 'OCFS2_IOC_MOVE_EXT' ioctl
to ocfs2 allows two strategies upon defragmentation:

1. simple-defragmentation-in-kernl, which means kernel will be responsible for
   claiming new clusters, and packing the defragmented extents according to a
   user-specified threshold.

2. simple-extents moving, in this case, userspace play much more important role
   when doing defragmentation, it needs to specify the new physical blk_offset
   where extents will be moved, kernel itself will not do anything more than
   moving the extents per requested, maybe kernel also needs to manage to
   probe/validate the new_blkoffset to guarantee enough free space around there.

Above two operations using the same OCFS2_IOC_MOVE_EXT:
-------------------------------------------------------------------------------
#define OCFS2_MOVE_EXT_FL_AUTO_DEFRAG   (0x00000001)    /* Kernel manages to
                                                           claim new clusters
                                                           as the goal place
                                                           for extents moving */
#define OCFS2_MOVE_EXT_FL_COMPLETE      (0x00000002)    /* Move or defragmenation
                                                           completely gets done.
                                                         */
struct ocfs2_move_extents {
/* All values are in bytes */
        /* in */
        __u64 me_start;         /* Virtual start in the file to move */
        __u64 me_len;           /* Length of the extents to be moved */
        __u64 me_goal;          /* Physical offset of the goal */
        __u64 me_thresh;        /* Maximum distance from goal or threshold
                                   for auto defragmentation */
        __u64 me_flags;         /* flags for the operation:
                                 * - auto defragmentation.
                                 * - refcount,xattr cases.
                                 */

        /* out */
        __u64 me_moved_len;     /* moved length, are we completely done? */
        __u64 me_new_offset;    /* Resulting physical location */
        __u32 me_reserved[3];   /* reserved for futhure */
};
-------------------------------------------------------------------------------

	Current V1 patches set will be focusing mostly on strategy #1 though, since #2
strategy is still there under discussion.

	Following are some interesting data gathered from simple tests:

1. Performance improvement gained on I/O reads:
-------------------------------------------------------------------------------
* Before defragmentation *

[root at ocfs2-box4 ~]# sync
[root at ocfs2-box4 ~]# echo 3>/proc/sys/vm/drop_caches 
[root at ocfs2-box4 ~]# time dd if=/storage/testfile-1 of=/dev/null
640000+0 records in
640000+0 records out
327680000 bytes (328 MB) copied, 19.9351 s, 16.4 MB/s

real	0m19.954s
user	0m0.246s
sys	0m1.111s

* Do defragmentation *

[root at ocfs2-box4 defrag]# ./defrag -s 0 -l 293601280  -t 3145728 /storage/testfile-1

* After defragmentation *

[root at ocfs2-box4 ~]# sync
[root at ocfs2-box4 ~]# echo 3>/proc/sys/vm/drop_caches
[root at ocfs2-box4 ~]# time dd if=/storage/testfile-1 of=/dev/null
640000+0 records in
640000+0 records out
327680000 bytes (328 MB) copied, 6.79885 s, 48.2 MB/s

real	0m6.969s
user	0m0.209s
sys	0m1.063s
-------------------------------------------------------------------------------


2. Extent tree layout via debugfs.ocfs2:
-------------------------------------------------------------------------------
* Before defragmentation *

        Tree Depth: 1   Count: 243   Next Free Rec: 8
        ## Offset        Clusters       Block#
        0  0             1173           86561
        1  1173          1173           84527
        2  2346          1151           81468
        3  3497          1173           76362
        4  4670          1173           74328
        5  5843          1172           66150
        6  7015          1460           70260
        7  8475          662            87680
        SubAlloc Bit: 1   SubAlloc Slot: 0
        Blknum: 86561   Next Leaf: 84527
        CRC32: abf06a6b   ECC: 44bc
        Tree Depth: 0   Count: 252   Next Free Rec: 252
        ## Offset        Clusters       Block#          Flags
        0  1             16             516104          0x0
        1  17            1              554632          0x0
        2  18            7              560144          0x0
        3  25            1              565960          0x0
        4  26            1              572632          0x
	...
	/* around 1700 extent records were hidden there */
	...
	138 9131          1              258968          0x0
        139 9132          1              259568          0x0
        140 9133          1              260168          0x0
        141 9134          1              260768          0x0
        142 9135          1              261368          0x0
        143 9136          1              261968          0x0

* After defragmentation *

      Tree Depth: 1   Count: 243   Next Free Rec: 1
	## Offset        Clusters       Block#
	0  0             9137           66081
	SubAlloc Bit: 1   SubAlloc Slot: 0
	Blknum: 66081   Next Leaf: 0
	CRC32: 22897d34   ECC: 0619
	Tree Depth: 0   Count: 252   Next Free Rec: 6
	## Offset        Clusters       Block#          Flags
	0  1             1600           4412936         0x0 
	1  1601          1595           20669448        0x0 
	2  3196          1600           9358856         0x0 
	3  4796          1404           14516232        0x0 
	4  6200          1600           21627400        0x0 
	5  7800          1337           7483400         0x0 
-------------------------------------------------------------------------------


TO-DO:

1. Complete strategy #2
2. Adding refcount/xattr/unwritten_extents support.
3. Free space defragmentation.


Go to http://oss.oracle.com/osswiki/OCFS2/DesignDocs/OnlineDefrag for more details.


Tristan.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 1/8] Ocfs2: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2.
  2010-12-28 11:40 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Tristan Ye
@ 2010-12-28 11:40 ` Tristan Ye
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 2/8] Ocfs2: Add basic framework and source files for extent moving Tristan Ye
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Tristan Ye @ 2010-12-28 11:40 UTC (permalink / raw)
  To: ocfs2-devel

Patch also manages to add an manipulative struture for this ioctl.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
---
 fs/ocfs2/ocfs2_ioctl.h |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/ocfs2_ioctl.h b/fs/ocfs2/ocfs2_ioctl.h
index b46f39b..9340ad3 100644
--- a/fs/ocfs2/ocfs2_ioctl.h
+++ b/fs/ocfs2/ocfs2_ioctl.h
@@ -171,4 +171,31 @@ enum ocfs2_info_type {
 
 #define OCFS2_IOC_INFO		_IOR('o', 5, struct ocfs2_info)
 
+#define OCFS2_MOVE_EXT_FL_AUTO_DEFRAG	(0x00000001)	/* Kernel manages to
+							   claim new clusters
+							   as the goal place
+							   for extents moving */
+#define OCFS2_MOVE_EXT_FL_COMPLETE	(0x00000002)	/* Move or defragmenation
+							   completely gets done.
+							 */
+struct ocfs2_move_extents {
+/* All values are in bytes */
+	/* in */
+	__u64 me_start;		/* Virtual start in the file to move */
+	__u64 me_len;		/* Length of the extents to be moved */
+	__u64 me_goal;		/* Physical offset of the goal */
+	__u64 me_thresh;	/* Maximum distance from goal or threshold
+				   for auto defragmentation */
+	__u64 me_flags;		/* flags for the operation:
+				 * - auto defragmentation.
+				 * - refcount,xattr cases.
+				 */
+	/* out */
+	__u64 me_moved_len;	/* moved length */
+	__u64 me_new_offset;	/* Resulting physical location */
+	__u32 me_reserved[2];	/* reserved for futhure */
+};
+
+#define OCFS2_IOC_MOVE_EXT	_IOW('o', 6, struct ocfs2_move_extents)
+
 #endif /* OCFS2_IOCTL_H */
-- 
1.5.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 2/8] Ocfs2: Add basic framework and source files for extent moving.
  2010-12-28 11:40 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Tristan Ye
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 1/8] Ocfs2: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2 Tristan Ye
@ 2010-12-28 11:40 ` Tristan Ye
  2010-12-28 15:11   ` Tao Ma
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 3/8] Ocfs2: Duplicate old clusters into new blk_offset by dirty and remap pages Tristan Ye
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 16+ messages in thread
From: Tristan Ye @ 2010-12-28 11:40 UTC (permalink / raw)
  To: ocfs2-devel

Patch tries to do following things:

1. Adding new files move_extents.[c|h] and fill it with basic framework.

2. Adding handler for new ioctl OCFS2_MOVE_EXT_FL_AUTO_DEFRAG.

The pre/post-workfor extents moving is much like ocfs2_change_file_space(),
also needs to validate permissions, acquire kinds of locks and update mtime
etc.

Other than these, this patch also introduces a new type of context struture
for manipulating the extent_moving codes better.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
---
 fs/ocfs2/Makefile          |    1 +
 fs/ocfs2/cluster/masklog.h |    1 +
 fs/ocfs2/ioctl.c           |    5 +
 fs/ocfs2/move_extents.c    |  218 ++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/move_extents.h    |   22 +++++
 5 files changed, 247 insertions(+), 0 deletions(-)
 create mode 100644 fs/ocfs2/move_extents.c
 create mode 100644 fs/ocfs2/move_extents.h

diff --git a/fs/ocfs2/Makefile b/fs/ocfs2/Makefile
index 07d9fd8..5166f2d 100644
--- a/fs/ocfs2/Makefile
+++ b/fs/ocfs2/Makefile
@@ -30,6 +30,7 @@ ocfs2-objs := \
 	namei.o 		\
 	refcounttree.o		\
 	reservations.o		\
+	move_extents.o		\
 	resize.o		\
 	slot_map.o 		\
 	suballoc.o 		\
diff --git a/fs/ocfs2/cluster/masklog.h b/fs/ocfs2/cluster/masklog.h
index ea2ed9f..23f9312 100644
--- a/fs/ocfs2/cluster/masklog.h
+++ b/fs/ocfs2/cluster/masklog.h
@@ -121,6 +121,7 @@
 #define ML_KTHREAD	0x0000000400000000ULL /* kernel thread activity */
 #define ML_RESERVATIONS	0x0000000800000000ULL /* ocfs2 alloc reservations */
 #define ML_CLUSTER	0x0000001000000000ULL /* cluster stack */
+#define ML_MOVE_EXT	0x0000002000000000ULL /* cluster stack */
 
 #define MLOG_INITIAL_AND_MASK (ML_ERROR|ML_NOTICE)
 #define MLOG_INITIAL_NOT_MASK (ML_ENTRY|ML_EXIT)
diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
index 7a48681..b8fe7a7 100644
--- a/fs/ocfs2/ioctl.c
+++ b/fs/ocfs2/ioctl.c
@@ -23,6 +23,7 @@
 #include "ioctl.h"
 #include "resize.h"
 #include "refcounttree.h"
+#include "move_extents.h"
 
 #include <linux/ext2_fs.h>
 
@@ -523,6 +524,8 @@ long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 			return -EFAULT;
 
 		return ocfs2_info_handle(inode, &info, 0);
+	case OCFS2_IOC_MOVE_EXT:
+		return ocfs2_ioctl_move_extents(filp, (void __user *)arg);
 	default:
 		return -ENOTTY;
 	}
@@ -565,6 +568,8 @@ long ocfs2_compat_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 			return -EFAULT;
 
 		return ocfs2_info_handle(inode, &info, 1);
+	case OCFS2_IOC_MOVE_EXT:
+		break;
 	default:
 		return -ENOIOCTLCMD;
 	}
diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
new file mode 100644
index 0000000..f5bbb28
--- /dev/null
+++ b/fs/ocfs2/move_extents.c
@@ -0,0 +1,218 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * move_extents.c
+ *
+ * Copyright (C) 2010 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/fs.h>
+#include <linux/types.h>
+#include <linux/mount.h>
+#include <linux/swap.h>
+
+#define MLOG_MASK_PREFIX ML_MOVE_EXT
+#include <cluster/masklog.h>
+
+#include "ocfs2.h"
+#include "ocfs2_ioctl.h"
+
+#include "alloc.h"
+#include "aops.h"
+#include "dlmglue.h"
+#include "extent_map.h"
+#include "inode.h"
+#include "journal.h"
+#include "suballoc.h"
+#include "uptodate.h"
+#include "super.h"
+#include "move_extents.h"
+
+struct ocfs2_move_extents_context {
+	struct inode *inode;
+	struct file *file;
+	int auto_defrag;
+	int credits;
+	u32 clusters_moved;
+	struct ocfs2_move_extents *range;
+	struct ocfs2_extent_tree et;
+	struct ocfs2_alloc_context *meta_ac;
+	struct ocfs2_alloc_context *data_ac;
+	struct ocfs2_cached_dealloc_ctxt dealloc;
+};
+
+static int ocfs2_move_extents(struct ocfs2_move_extents_context *context)
+{
+	int status;
+	handle_t *handle;
+	struct inode *inode = context->inode;
+	struct buffer_head *di_bh = NULL;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+	if (!inode)
+		return -ENOENT;
+
+	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
+		return -EROFS;
+
+	mutex_lock(&inode->i_mutex);
+
+	/*
+	 * This prevents concurrent writes from other nodes
+	 */
+	status = ocfs2_rw_lock(inode, 1);
+	if (status) {
+		mlog_errno(status);
+		goto out;
+	}
+
+	status = ocfs2_inode_lock(inode, &di_bh, 1);
+	if (status) {
+		mlog_errno(status);
+		goto out_rw_unlock;
+	}
+
+	/*
+	 * rememer ip_xattr_sem also needs to be held if necessary
+	 */
+	down_write(&OCFS2_I(inode)->ip_alloc_sem);
+
+	/*
+	 * real extents moving codes will be fulfilled later.
+	 *
+	 * status = __ocfs2_move_extents_range(di_bh, context);
+	 */
+
+	up_write(&OCFS2_I(inode)->ip_alloc_sem);
+	if (status) {
+		mlog_errno(status);
+		goto out_inode_unlock;
+	}
+
+	/*
+	 * We update mtime for these changes
+	 */
+	handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
+	if (IS_ERR(handle)) {
+		status = PTR_ERR(handle);
+		mlog_errno(status);
+		goto out_inode_unlock;
+	}
+
+	inode->i_mtime = CURRENT_TIME;
+	status = ocfs2_mark_inode_dirty(handle, inode, di_bh);
+	if (status < 0)
+		mlog_errno(status);
+
+	ocfs2_commit_trans(osb, handle);
+
+out_inode_unlock:
+	brelse(di_bh);
+	ocfs2_inode_unlock(inode, 1);
+out_rw_unlock:
+	ocfs2_rw_unlock(inode, 1);
+out:
+	mutex_unlock(&inode->i_mutex);
+
+	return status;
+}
+
+int ocfs2_ioctl_move_extents(struct file *filp, void __user *argp)
+{
+	int status;
+
+	struct inode *inode = filp->f_path.dentry->d_inode;
+	struct ocfs2_move_extents range;
+	struct ocfs2_move_extents_context *context = NULL;
+
+	status = mnt_want_write(filp->f_path.mnt);
+	if (status)
+		return status;
+
+	status = -EINVAL;
+
+	if (!S_ISREG(inode->i_mode)) {
+		goto out;
+	} else {
+		if (!(filp->f_mode & FMODE_WRITE))
+			goto out;
+	}
+
+	if (inode->i_flags & (S_IMMUTABLE|S_APPEND)) {
+		status = -EPERM;
+		goto out;
+	}
+
+	context = kzalloc(sizeof(struct ocfs2_move_extents_context), GFP_NOFS);
+	if (!context) {
+		status = -ENOMEM;
+		mlog_errno(status);
+		goto out;
+	}
+
+	context->inode = inode;
+	context->file = filp;
+
+	if (!argp) {
+		memset(&range, 0, sizeof(range));
+		range.me_len = (u64)-1;
+		range.me_flags |= OCFS2_MOVE_EXT_FL_AUTO_DEFRAG;
+		context->auto_defrag = 1;
+	} else {
+		if (copy_from_user(&range, (struct ocfs2_move_extents *)argp,
+				   sizeof(range))) {
+			status = -EFAULT;
+			goto out;
+		}
+	}
+
+	context->range = &range;
+
+	if (range.me_flags & OCFS2_MOVE_EXT_FL_AUTO_DEFRAG) {
+		context->auto_defrag = 1;
+		if (!range.me_thresh)
+			range.me_thresh = 1024 * 1024;
+	} else {
+		/*
+		 * TO-DO XXX: validate the range.me_goal here.
+		 *
+		 * - should be cluster aligned.
+		 * - should contain enough free clusters around range.me_goal.
+		 * - strategy of moving extent to an appropriate goal is still
+		 *   being discussed.
+		 */
+	}
+
+	/*
+	 * returning -EINVAL here.
+	 */
+	if (range.me_start > i_size_read(inode))
+		goto out;
+
+	if (range.me_start + range.me_len > i_size_read(inode))
+			range.me_len = i_size_read(inode) - range.me_start;
+
+	status = ocfs2_move_extents(context);
+	if (status)
+		mlog_errno(status);
+
+	if (argp) {
+		if (copy_to_user((struct ocfs2_move_extents *)argp, &range,
+				sizeof(range)))
+			status = -EFAULT;
+	}
+out:
+	mnt_drop_write(filp->f_path.mnt);
+
+	kfree(context);
+
+	return status;
+}
diff --git a/fs/ocfs2/move_extents.h b/fs/ocfs2/move_extents.h
new file mode 100644
index 0000000..d9f0f29
--- /dev/null
+++ b/fs/ocfs2/move_extents.h
@@ -0,0 +1,22 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * move_extents.h
+ *
+ * Copyright (C) 2010 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef OCFS2_MOVE_EXTENTS_H
+#define OCFS2_MOVE_EXTENTS_H
+
+int ocfs2_ioctl_move_extents(struct file *filp,  void __user *argp);
+
+#endif /* OCFS2_MOVE_EXTENTS_H */
-- 
1.5.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 3/8] Ocfs2: Duplicate old clusters into new blk_offset by dirty and remap pages.
  2010-12-28 11:40 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Tristan Ye
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 1/8] Ocfs2: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2 Tristan Ye
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 2/8] Ocfs2: Add basic framework and source files for extent moving Tristan Ye
@ 2010-12-28 11:40 ` Tristan Ye
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 4/8] Ocfs2: move a range of extent Tristan Ye
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Tristan Ye @ 2010-12-28 11:40 UTC (permalink / raw)
  To: ocfs2-devel

Most of codes for duplicating clusters into new_blkoffset where extent will
be moved, were copid from refcounttree's CoW logics, the intention of making
an exact copy here is to make extents moving codes work at the very beginning
,and also avoid interfering refcounttree's codes at most.

Final version will be using some common funcs with refcounttree part.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
---
 fs/ocfs2/move_extents.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index f5bbb28..3601545 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -49,6 +49,113 @@ struct ocfs2_move_extents_context {
 	struct ocfs2_cached_dealloc_ctxt dealloc;
 };
 
+static int ocfs2_clear_cow_buffer(handle_t *handle, struct buffer_head *bh)
+{
+	BUG_ON(buffer_dirty(bh));
+
+	clear_buffer_mapped(bh);
+
+	return 0;
+}
+
+#define MAX_CONTIG_BYTES        1048576
+
+static inline unsigned int ocfs2_cow_contig_clusters(struct super_block *sb)
+{
+	return ocfs2_clusters_for_bytes(sb, MAX_CONTIG_BYTES);
+}
+
+/*
+ * Duplicate clusters into new blk_offset, following codes were almost taken
+ * from refcounttree.c on CoW.
+ */
+int ocfs2_duplicate_clusters_by_page(handle_t *handle,
+				     struct ocfs2_move_extents_context *context,
+				     u32 cpos, u32 old_cluster,
+				     u32 new_cluster, u32 len)
+{
+	int ret = 0, partial;
+	struct inode *inode = context->inode;
+	struct ocfs2_caching_info *ci = INODE_CACHE(inode);
+	struct super_block *sb = ocfs2_metadata_cache_get_super(ci);
+	u64 new_block = ocfs2_clusters_to_blocks(sb, new_cluster);
+	struct page *page;
+	pgoff_t page_index;
+	unsigned int from, to, readahead_pages;
+	loff_t offset, end, map_end;
+	struct address_space *mapping = inode->i_mapping;
+
+	mlog(0, "old_cluster %u, new %u, len %u at offset %u\n", old_cluster,
+	     new_cluster, len, cpos);
+
+	readahead_pages =
+		(ocfs2_cow_contig_clusters(sb) <<
+		 OCFS2_SB(sb)->s_clustersize_bits) >> PAGE_CACHE_SHIFT;
+	offset = ((loff_t)cpos) << OCFS2_SB(sb)->s_clustersize_bits;
+	end = offset + (len << OCFS2_SB(sb)->s_clustersize_bits);
+
+	if (end > i_size_read(inode))
+		end = i_size_read(inode);
+
+	while (offset < end) {
+		page_index = offset >> PAGE_CACHE_SHIFT;
+		map_end = ((loff_t)page_index + 1) << PAGE_CACHE_SHIFT;
+		if (map_end > end)
+			map_end = end;
+
+		from = offset & (PAGE_CACHE_SIZE - 1);
+		to = PAGE_CACHE_SIZE;
+		if (map_end & (PAGE_CACHE_SIZE - 1))
+			to = map_end & (PAGE_CACHE_SIZE - 1);
+
+		page = find_or_create_page(mapping, page_index, GFP_NOFS);
+
+		if (PAGE_CACHE_SIZE <= OCFS2_SB(sb)->s_clustersize)
+			BUG_ON(PageDirty(page));
+
+		if (PageReadahead(page) && context->file) {
+			page_cache_async_readahead(mapping,
+						   &context->file->f_ra,
+						   context->file,
+						   page, page_index,
+						   readahead_pages);
+		}
+
+		if (!PageUptodate(page)) {
+			ret = block_read_full_page(page, ocfs2_get_block);
+			if (ret) {
+				mlog_errno(ret);
+				goto unlock;
+			}
+			lock_page(page);
+		}
+
+		if (page_has_buffers(page)) {
+			ret = walk_page_buffers(handle, page_buffers(page),
+						from, to, &partial,
+						ocfs2_clear_cow_buffer);
+			if (ret) {
+				mlog_errno(ret);
+				goto unlock;
+			}
+		}
+
+		ocfs2_map_and_dirty_page(inode,
+					 handle, from, to,
+					 page, 0, &new_block);
+		mark_page_accessed(page);
+unlock:
+		unlock_page(page);
+		page_cache_release(page);
+		page = NULL;
+		offset = map_end;
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
 static int ocfs2_move_extents(struct ocfs2_move_extents_context *context)
 {
 	int status;
-- 
1.5.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 4/8] Ocfs2: move a range of extent.
  2010-12-28 11:40 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Tristan Ye
                   ` (2 preceding siblings ...)
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 3/8] Ocfs2: Duplicate old clusters into new blk_offset by dirty and remap pages Tristan Ye
@ 2010-12-28 11:40 ` Tristan Ye
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 5/8] Ocfs2: lock allocators and reserve metadata blocks and data clusters for extents moving Tristan Ye
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Tristan Ye @ 2010-12-28 11:40 UTC (permalink / raw)
  To: ocfs2-devel

The range of __ocfs2_move_extent() was within one extent always, it consists
following parts:

1. Duplicates the clusters in pages to new_blkoffset, where extent to be moved.

2. Split the original extent with new extent, coalecse the nearby extents if possible

3. Append old clusters to truncate log.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
---
 fs/ocfs2/move_extents.c |   82 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index 3601545..7e9daab 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -156,6 +156,88 @@ unlock:
 	return ret;
 }
 
+static int __ocfs2_move_extent(handle_t *handle,
+			       struct ocfs2_move_extents_context *context,
+			       u32 cpos, u32 len, u32 p_cpos, u32 new_p_cpos)
+{
+	int ret = 0, index;
+	struct inode *inode = context->inode;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	struct ocfs2_extent_rec *rec, replace_rec;
+	struct ocfs2_path *path = NULL;
+	struct ocfs2_extent_list *el;
+	u64 ino = ocfs2_metadata_cache_owner(context->et.et_ci);
+	u64 old_blkno = ocfs2_clusters_to_blocks(inode->i_sb, p_cpos);
+
+	ret = ocfs2_duplicate_clusters_by_page(handle, context, cpos, p_cpos,
+					       new_p_cpos, len);
+	if (ret) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+	memset(&replace_rec, 0, sizeof(replace_rec));
+	replace_rec.e_cpos = cpu_to_le32(cpos);
+	replace_rec.e_leaf_clusters = cpu_to_le16(len);
+	replace_rec.e_blkno = cpu_to_le64(ocfs2_clusters_to_blocks(inode->i_sb,
+								   new_p_cpos));
+
+	path = ocfs2_new_path_from_et(&context->et);
+	if (!path) {
+		ret = -ENOMEM;
+		mlog_errno(ret);
+		goto out;
+	}
+
+	ret = ocfs2_find_path(INODE_CACHE(inode), path, cpos);
+	if (ret) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+	el = path_leaf_el(path);
+
+	index = ocfs2_search_extent_list(el, cpos);
+	if (index == -1 || index >= le16_to_cpu(el->l_next_free_rec)) {
+		ocfs2_error(inode->i_sb,
+			    "Inode %llu has an extent at cpos %u which can no "
+			    "longer be found.\n",
+			    (unsigned long long)ino, cpos);
+		ret = -EROFS;
+		goto out;
+	}
+
+	rec = &el->l_recs[index];
+	replace_rec.e_flags = rec->e_flags;
+
+	ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode),
+				      context->et.et_root_bh,
+				      OCFS2_JOURNAL_ACCESS_WRITE);
+	if (ret) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+	ret = ocfs2_split_extent(handle, &context->et, path, index,
+				 &replace_rec, context->meta_ac,
+				 &context->dealloc);
+	if (ret) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+	ocfs2_journal_dirty(handle, context->et.et_root_bh);
+
+	/*
+	 * need I to append truncate log for old clusters?
+	 */
+	if (old_blkno)
+		ret = ocfs2_truncate_log_append(osb, handle, old_blkno, len);
+
+out:
+	return ret;
+}
+
 static int ocfs2_move_extents(struct ocfs2_move_extents_context *context)
 {
 	int status;
-- 
1.5.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 5/8] Ocfs2: lock allocators and reserve metadata blocks and data clusters for extents moving.
  2010-12-28 11:40 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Tristan Ye
                   ` (3 preceding siblings ...)
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 4/8] Ocfs2: move a range of extent Tristan Ye
@ 2010-12-28 11:40 ` Tristan Ye
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 6/8] Ocfs2: defrag a range of extent Tristan Ye
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Tristan Ye @ 2010-12-28 11:40 UTC (permalink / raw)
  To: ocfs2-devel

ocfs2_lock_allocators_move_extents() was like the common ocfs2_lock_allocators(),
to lock metadata and data alloctors during extents moving, reserve appropriate
metadata blocks and data clusters, also performa a best- effort to calculate the
credits for journal transaction in one run of movement.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
---
 fs/ocfs2/move_extents.c |   61 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 61 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index 7e9daab..e32dd40 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -238,6 +238,67 @@ out:
 	return ret;
 }
 
+/*
+ * lock allocators, and reserving appropriate number of bits for
+ * meta blocks and data clusters.
+ *
+ * in some cases, we don't need to reserve clusters, just let data_ac
+ * be NULL.
+ */
+static int ocfs2_lock_allocators_move_extents(struct inode *inode,
+					struct ocfs2_extent_tree *et,
+					u32 clusters_to_move,
+					u32 extents_to_split,
+					struct ocfs2_alloc_context **meta_ac,
+					struct ocfs2_alloc_context **data_ac,
+					int extra_blocks,
+					int *credits)
+{
+	int ret, num_free_extents;
+	unsigned int max_recs_needed = 2 * extents_to_split + clusters_to_move;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+	num_free_extents = ocfs2_num_free_extents(osb, et);
+	if (num_free_extents < 0) {
+		ret = num_free_extents;
+		mlog_errno(ret);
+		goto out;
+	}
+
+	if (!num_free_extents ||
+	    (ocfs2_sparse_alloc(osb) && num_free_extents < max_recs_needed))
+		extra_blocks += ocfs2_extend_meta_needed(et->et_root_el);
+
+	ret = ocfs2_reserve_new_metadata_blocks(osb, extra_blocks, meta_ac);
+	if (ret) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+	if (data_ac) {
+		ret = ocfs2_reserve_clusters(osb, clusters_to_move, data_ac);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	*credits += ocfs2_calc_extend_credits(osb->sb, et->et_root_el,
+					      clusters_to_move + 2);
+
+	mlog(0, "reserve metadata_blocks: %d, data_clusters: %u, credits: %d\n",
+	     extra_blocks, clusters_to_move, *credits);
+out:
+	if (ret) {
+		if (*meta_ac) {
+			ocfs2_free_alloc_context(*meta_ac);
+			*meta_ac = NULL;
+		}
+	}
+
+	return ret;
+}
+
 static int ocfs2_move_extents(struct ocfs2_move_extents_context *context)
 {
 	int status;
-- 
1.5.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 6/8] Ocfs2: defrag a range of extent.
  2010-12-28 11:40 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Tristan Ye
                   ` (4 preceding siblings ...)
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 5/8] Ocfs2: lock allocators and reserve metadata blocks and data clusters for extents moving Tristan Ye
@ 2010-12-28 11:40 ` Tristan Ye
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 7/8] Ocfs2: move entire/partial extent Tristan Ye
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Tristan Ye @ 2010-12-28 11:40 UTC (permalink / raw)
  To: ocfs2-devel

It's a relatively complete function to accomplish defragmentation for entire
or partial extent, one journal handle was kept during the operation, it was
logically doing one more thing than ocfs2_move_extent() acutally, yes, it's
claiming the new clusters itself;-)

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
---
 fs/ocfs2/move_extents.c |   97 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index e32dd40..6d2d130 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -299,6 +299,103 @@ out:
 	return ret;
 }
 
+/*
+ * Using one journal handle to guarantee the data consistency in case
+ * crash happens anywhere.
+ */
+static int ocfs2_defrag_extent(struct ocfs2_move_extents_context *context,
+			       u32 cpos, u32 phys_cpos, u32 len, int flags)
+{
+	int ret, credits = 0;
+	handle_t *handle;
+	struct inode *inode = context->inode;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	struct inode *tl_inode = osb->osb_tl_inode;
+	u32 new_phys_cpos, new_len;
+
+	ret = ocfs2_lock_allocators_move_extents(inode, &context->et, len, 1,
+						 &context->meta_ac,
+						 &context->data_ac,
+						 0, &credits);
+	if (ret) {
+		mlog_errno(ret);
+		return ret;
+	}
+
+	/*
+	 * should be using allocation reservation strategy there?
+	 *
+	 * if (context->data_ac)
+	 *	context->data_ac->ac_resv = &OCFS2_I(inode)->ip_la_data_resv;
+	 */
+
+	mutex_lock(&tl_inode->i_mutex);
+
+	if (ocfs2_truncate_log_needs_flush(osb)) {
+		ret = __ocfs2_flush_truncate_log(osb);
+		if (ret < 0) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	handle = ocfs2_start_trans(osb, credits);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		mlog_errno(ret);
+		goto out;
+	}
+
+	ret = __ocfs2_claim_clusters(handle, context->data_ac, 1, len,
+				     &new_phys_cpos, &new_len);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_commit;
+	}
+
+	/*
+	 * we're not quite patient here to make multiple attempts for claiming
+	 * enough clusters, failure to claim clusters per-requested is not a
+	 * disaster though, it can only mean partial range of defragmentation
+	 * or extent movements get gone, users anyway is able to have another
+	 * try as they wish anytime, since they're going to be returned a
+	 * '-ENOSPC' and completed length of this movement.
+	 *
+	 */
+	if (new_len != len) {
+		mlog(0, "len_claimed: %u, len: %u\n", new_len, len);
+		context->range->me_flags &= ~OCFS2_MOVE_EXT_FL_COMPLETE;
+		ret = -ENOSPC;
+		goto out_commit;
+	}
+
+	mlog(0, "cpos: %u, phys_cpos: %u, new_phys_cpos: %u\n", cpos,
+	     phys_cpos, new_phys_cpos);
+
+	ret = __ocfs2_move_extent(handle, context, cpos, len, phys_cpos,
+				  new_phys_cpos);
+	if (ret)
+		mlog_errno(ret);
+
+out_commit:
+	ocfs2_commit_trans(osb, handle);
+
+out:
+	mutex_unlock(&tl_inode->i_mutex);
+
+	if (context->data_ac) {
+		ocfs2_free_alloc_context(context->data_ac);
+		context->data_ac = NULL;
+	}
+
+	if (context->meta_ac) {
+		ocfs2_free_alloc_context(context->meta_ac);
+		context->meta_ac = NULL;
+	}
+
+	return ret;
+}
+
 static int ocfs2_move_extents(struct ocfs2_move_extents_context *context)
 {
 	int status;
-- 
1.5.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 7/8] Ocfs2: move entire/partial extent.
  2010-12-28 11:40 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Tristan Ye
                   ` (5 preceding siblings ...)
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 6/8] Ocfs2: defrag a range of extent Tristan Ye
@ 2010-12-28 11:40 ` Tristan Ye
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 8/8] Ocfs2: move extents within a certain range Tristan Ye
  2010-12-28 12:05 ` [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Wengang Wang
  8 siblings, 0 replies; 16+ messages in thread
From: Tristan Ye @ 2010-12-28 11:40 UTC (permalink / raw)
  To: ocfs2-devel

The only difference between ocfs2_move_extent() and ocfs2_defrag_extent() is
that former one didn't claim new clusters itself;)

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
---
 fs/ocfs2/move_extents.c |   50 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index 6d2d130..02d4d01 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -396,6 +396,56 @@ out:
 	return ret;
 }
 
+static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
+			     u32 cpos, u32 phys_cpos, u32 new_phys_cpos,
+			     u32 len, int flags)
+{
+	int ret, credits = 0;
+	handle_t *handle;
+	struct inode *inode = context->inode;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	struct inode *tl_inode = osb->osb_tl_inode;
+
+	ret = ocfs2_lock_allocators_move_extents(inode, &context->et, len, 1,
+						 &context->meta_ac,
+						 NULL, 0, &credits);
+	if (ret) {
+		mlog_errno(ret);
+		return ret;
+	}
+
+	mutex_lock(&tl_inode->i_mutex);
+
+	handle = ocfs2_start_trans(osb, credits);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		mlog_errno(ret);
+		goto out;
+	}
+
+	ret = __ocfs2_move_extent(handle, context, cpos, len, phys_cpos,
+				  new_phys_cpos);
+	if (ret)
+		mlog_errno(ret);
+
+	ocfs2_commit_trans(osb, handle);
+
+out:
+	mutex_unlock(&tl_inode->i_mutex);
+
+	if (context->data_ac) {
+		ocfs2_free_alloc_context(context->data_ac);
+		context->data_ac = NULL;
+	}
+
+	if (context->meta_ac) {
+		ocfs2_free_alloc_context(context->meta_ac);
+		context->meta_ac = NULL;
+	}
+
+	return ret;
+}
+
 static int ocfs2_move_extents(struct ocfs2_move_extents_context *context)
 {
 	int status;
-- 
1.5.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 8/8] Ocfs2: move extents within a certain range.
  2010-12-28 11:40 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Tristan Ye
                   ` (6 preceding siblings ...)
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 7/8] Ocfs2: move entire/partial extent Tristan Ye
@ 2010-12-28 11:40 ` Tristan Ye
  2010-12-28 12:05 ` [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Wengang Wang
  8 siblings, 0 replies; 16+ messages in thread
From: Tristan Ye @ 2010-12-28 11:40 UTC (permalink / raw)
  To: ocfs2-devel

The basic logic of moving extents for a file is pretty like punching-hole
sequence, walk the extents within the range as user specified, calculating
an appropriate len to defrag/move, then let ocfs2_defrag/move_extent() to
do the actual moving.

This func ends up setting 'OCFS2_MOVE_EXT_FL_COMPLETE' to userpace if operation
gets done successfully.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
---
 fs/ocfs2/move_extents.c |  169 +++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 164 insertions(+), 5 deletions(-)

diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index 02d4d01..b9bc1cf 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -446,6 +446,169 @@ out:
 	return ret;
 }
 
+/*
+ * Helper to calculate the defraging length in one run according to threshold.
+ */
+static void ocfs2_calc_extent_defrag_len(u32 *alloc_size, u32 *len_defraged,
+					 u32 threshold, int *skip)
+{
+	if ((*alloc_size + *len_defraged) < threshold) {
+		/*
+		 * proceed defragmentation until we meet the thresh
+		 */
+		*len_defraged += *alloc_size;
+	} else if (*len_defraged == 0) {
+		/*
+		 * XXX: skip a large extent.
+		 */
+		*skip = 1;
+	} else {
+		/*
+		 * split this extent to coalesce with former pieces as
+		 * to reach the threshold.
+		 *
+		 * we're done here with one cycle of defragmentation
+		 * in a size of 'thresh', resetting 'len_defraged'
+		 * forces a new defragmentation.
+		 */
+		*alloc_size = threshold - *len_defraged;
+		*len_defraged = 0;
+	}
+}
+
+static int __ocfs2_move_extents_range(struct buffer_head *di_bh,
+				struct ocfs2_move_extents_context *context)
+{
+	int ret, flags, do_defrag, skip = 0;
+	u32 cpos, phys_cpos, move_start, len_to_move, alloc_size;
+	u32 len_defraged = 0, defrag_thresh, new_phys_cpos = 0;
+
+	struct inode *inode = context->inode;
+	struct ocfs2_move_extents *range = context->range;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+	if ((inode->i_size == 0) || (range->me_len == 0))
+		return 0;
+
+	if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)
+		return 0;
+
+	ocfs2_init_dinode_extent_tree(&context->et, INODE_CACHE(inode), di_bh);
+	ocfs2_init_dealloc_ctxt(&context->dealloc);
+
+	/*
+	 * TO-DO XXX:
+	 *
+	 * - refcount tree.
+	 * - xattr extents.
+	 * - unwritten extents.
+	 */
+
+	do_defrag = context->auto_defrag;
+
+	/*
+	 * extents moving happens in unit of clusters, for the sake
+	 * of simplicity, we may ignore two clusters where 'byte_start'
+	 * and 'byte_start + len' were within.
+	 */
+	move_start = ocfs2_clusters_for_bytes(osb->sb, range->me_start);
+	len_to_move = (range->me_start + range->me_len) >>
+						osb->s_clustersize_bits;
+	if (len_to_move >= move_start)
+		len_to_move -= move_start;
+	else
+		len_to_move = 0;
+
+	if (do_defrag)
+		defrag_thresh = range->me_thresh >> osb->s_clustersize_bits;
+	else
+		new_phys_cpos = ocfs2_clusters_for_bytes(osb->sb,
+							 range->me_goal);
+
+	mlog(0, "Inode: %llu, start: %llu, len: %llu, cstart: %u, clen: %u, "
+	     "thresh: %u\n",
+	     (unsigned long long)OCFS2_I(inode)->ip_blkno,
+	     (unsigned long long)range->me_start,
+	     (unsigned long long)range->me_len,
+	     move_start, len_to_move, defrag_thresh);
+
+	cpos = move_start;
+	while (len_to_move) {
+		ret = ocfs2_get_clusters(inode, cpos, &phys_cpos, &alloc_size,
+					 &flags);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		if (alloc_size > len_to_move)
+			alloc_size = len_to_move;
+
+		/*
+		 * XXX: how to deal with a hole:
+		 *
+		 * - skip the hole of course
+		 * - force a new defragmentation
+		 */
+		if (!phys_cpos) {
+			if (do_defrag)
+				len_defraged = 0;
+
+			goto next;
+		}
+
+		if (do_defrag) {
+			ocfs2_calc_extent_defrag_len(&alloc_size, &len_defraged,
+						     defrag_thresh, &skip);
+			/*
+			 * skip large extents
+			 */
+			if (skip) {
+				skip = 0;
+				goto next;
+			}
+
+			mlog(0, "#Defrag: cpos: %u, phys_cpos: %u, "
+			     "alloc_size: %u, len_defraged: %u\n",
+			     cpos, phys_cpos, alloc_size, len_defraged);
+
+			ret = ocfs2_defrag_extent(context, cpos, phys_cpos,
+						  alloc_size, flags);
+		} else {
+			ret = ocfs2_move_extent(context, cpos, phys_cpos,
+						new_phys_cpos, alloc_size,
+						flags);
+
+			new_phys_cpos += alloc_size;
+		}
+
+		if (ret < 0) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		context->clusters_moved += alloc_size;
+next:
+		cpos += alloc_size;
+		len_to_move -= alloc_size;
+	}
+
+	range->me_flags |= OCFS2_MOVE_EXT_FL_COMPLETE;
+
+out:
+	range->me_moved_len = ocfs2_clusters_to_bytes(osb->sb,
+						      context->clusters_moved);
+	/*
+	 * also need to return 'me_new_offset' for none-defragmentation case
+	 */
+
+	ocfs2_schedule_truncate_log_flush(osb, 1);
+	ocfs2_run_deallocs(osb, &context->dealloc);
+
+	return ret;
+}
+
+
 static int ocfs2_move_extents(struct ocfs2_move_extents_context *context)
 {
 	int status;
@@ -482,11 +645,7 @@ static int ocfs2_move_extents(struct ocfs2_move_extents_context *context)
 	 */
 	down_write(&OCFS2_I(inode)->ip_alloc_sem);
 
-	/*
-	 * real extents moving codes will be fulfilled later.
-	 *
-	 * status = __ocfs2_move_extents_range(di_bh, context);
-	 */
+	status = __ocfs2_move_extents_range(di_bh, context);
 
 	up_write(&OCFS2_I(inode)->ip_alloc_sem);
 	if (status) {
-- 
1.5.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1.
  2010-12-28 11:40 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Tristan Ye
                   ` (7 preceding siblings ...)
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 8/8] Ocfs2: move extents within a certain range Tristan Ye
@ 2010-12-28 12:05 ` Wengang Wang
  2010-12-28 15:44   ` Tristan Ye
  8 siblings, 1 reply; 16+ messages in thread
From: Wengang Wang @ 2010-12-28 12:05 UTC (permalink / raw)
  To: ocfs2-devel

Hi Guy,

I like it. Having is definitely better than no.
I will play with it when I am free :)

thanks,
wengang.
On 10-12-28 19:40, Tristan Ye wrote:
> Hi All,
> 
> 	It's a quite rough patches series v1 for online defragmentation on OCFS2, it's
> workable anyway, may look ugly though;) The essence of online file defragmentation is
> extents moving like what btrfs and ext4 were doing, adding 'OCFS2_IOC_MOVE_EXT' ioctl
> to ocfs2 allows two strategies upon defragmentation:
> 
> 1. simple-defragmentation-in-kernl, which means kernel will be responsible for
>    claiming new clusters, and packing the defragmented extents according to a
>    user-specified threshold.
> 
> 2. simple-extents moving, in this case, userspace play much more important role
>    when doing defragmentation, it needs to specify the new physical blk_offset
>    where extents will be moved, kernel itself will not do anything more than
>    moving the extents per requested, maybe kernel also needs to manage to
>    probe/validate the new_blkoffset to guarantee enough free space around there.
> 
> Above two operations using the same OCFS2_IOC_MOVE_EXT:
> -------------------------------------------------------------------------------
> #define OCFS2_MOVE_EXT_FL_AUTO_DEFRAG   (0x00000001)    /* Kernel manages to
>                                                            claim new clusters
>                                                            as the goal place
>                                                            for extents moving */
> #define OCFS2_MOVE_EXT_FL_COMPLETE      (0x00000002)    /* Move or defragmenation
>                                                            completely gets done.
>                                                          */
> struct ocfs2_move_extents {
> /* All values are in bytes */
>         /* in */
>         __u64 me_start;         /* Virtual start in the file to move */
>         __u64 me_len;           /* Length of the extents to be moved */
>         __u64 me_goal;          /* Physical offset of the goal */
>         __u64 me_thresh;        /* Maximum distance from goal or threshold
>                                    for auto defragmentation */
>         __u64 me_flags;         /* flags for the operation:
>                                  * - auto defragmentation.
>                                  * - refcount,xattr cases.
>                                  */
> 
>         /* out */
>         __u64 me_moved_len;     /* moved length, are we completely done? */
>         __u64 me_new_offset;    /* Resulting physical location */
>         __u32 me_reserved[3];   /* reserved for futhure */
> };
> -------------------------------------------------------------------------------
> 
> 	Current V1 patches set will be focusing mostly on strategy #1 though, since #2
> strategy is still there under discussion.
> 
> 	Following are some interesting data gathered from simple tests:
> 
> 1. Performance improvement gained on I/O reads:
> -------------------------------------------------------------------------------
> * Before defragmentation *
> 
> [root at ocfs2-box4 ~]# sync
> [root at ocfs2-box4 ~]# echo 3>/proc/sys/vm/drop_caches 
> [root at ocfs2-box4 ~]# time dd if=/storage/testfile-1 of=/dev/null
> 640000+0 records in
> 640000+0 records out
> 327680000 bytes (328 MB) copied, 19.9351 s, 16.4 MB/s
> 
> real	0m19.954s
> user	0m0.246s
> sys	0m1.111s
> 
> * Do defragmentation *
> 
> [root at ocfs2-box4 defrag]# ./defrag -s 0 -l 293601280  -t 3145728 /storage/testfile-1
> 
> * After defragmentation *
> 
> [root at ocfs2-box4 ~]# sync
> [root at ocfs2-box4 ~]# echo 3>/proc/sys/vm/drop_caches
> [root at ocfs2-box4 ~]# time dd if=/storage/testfile-1 of=/dev/null
> 640000+0 records in
> 640000+0 records out
> 327680000 bytes (328 MB) copied, 6.79885 s, 48.2 MB/s
> 
> real	0m6.969s
> user	0m0.209s
> sys	0m1.063s
> -------------------------------------------------------------------------------
> 
> 
> 2. Extent tree layout via debugfs.ocfs2:
> -------------------------------------------------------------------------------
> * Before defragmentation *
> 
>         Tree Depth: 1   Count: 243   Next Free Rec: 8
>         ## Offset        Clusters       Block#
>         0  0             1173           86561
>         1  1173          1173           84527
>         2  2346          1151           81468
>         3  3497          1173           76362
>         4  4670          1173           74328
>         5  5843          1172           66150
>         6  7015          1460           70260
>         7  8475          662            87680
>         SubAlloc Bit: 1   SubAlloc Slot: 0
>         Blknum: 86561   Next Leaf: 84527
>         CRC32: abf06a6b   ECC: 44bc
>         Tree Depth: 0   Count: 252   Next Free Rec: 252
>         ## Offset        Clusters       Block#          Flags
>         0  1             16             516104          0x0
>         1  17            1              554632          0x0
>         2  18            7              560144          0x0
>         3  25            1              565960          0x0
>         4  26            1              572632          0x
> 	...
> 	/* around 1700 extent records were hidden there */
> 	...
> 	138 9131          1              258968          0x0
>         139 9132          1              259568          0x0
>         140 9133          1              260168          0x0
>         141 9134          1              260768          0x0
>         142 9135          1              261368          0x0
>         143 9136          1              261968          0x0
> 
> * After defragmentation *
> 
>       Tree Depth: 1   Count: 243   Next Free Rec: 1
> 	## Offset        Clusters       Block#
> 	0  0             9137           66081
> 	SubAlloc Bit: 1   SubAlloc Slot: 0
> 	Blknum: 66081   Next Leaf: 0
> 	CRC32: 22897d34   ECC: 0619
> 	Tree Depth: 0   Count: 252   Next Free Rec: 6
> 	## Offset        Clusters       Block#          Flags
> 	0  1             1600           4412936         0x0 
> 	1  1601          1595           20669448        0x0 
> 	2  3196          1600           9358856         0x0 
> 	3  4796          1404           14516232        0x0 
> 	4  6200          1600           21627400        0x0 
> 	5  7800          1337           7483400         0x0 
> -------------------------------------------------------------------------------
> 
> 
> TO-DO:
> 
> 1. Complete strategy #2
> 2. Adding refcount/xattr/unwritten_extents support.
> 3. Free space defragmentation.
> 
> 
> Go to http://oss.oracle.com/osswiki/OCFS2/DesignDocs/OnlineDefrag for more details.
> 
> 
> Tristan.
> 
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 2/8] Ocfs2: Add basic framework and source files for extent moving.
  2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 2/8] Ocfs2: Add basic framework and source files for extent moving Tristan Ye
@ 2010-12-28 15:11   ` Tao Ma
  2010-12-28 15:38     ` Tristan Ye
  0 siblings, 1 reply; 16+ messages in thread
From: Tao Ma @ 2010-12-28 15:11 UTC (permalink / raw)
  To: ocfs2-devel

Hi Tristan,
	Some comments inlined.
On 12/28/2010 07:40 PM, Tristan Ye wrote:
<snip>
> +static int ocfs2_move_extents(struct ocfs2_move_extents_context *context)
> +{
> +	int status;
> +	handle_t *handle;
> +	struct inode *inode = context->inode;
> +	struct buffer_head *di_bh = NULL;
> +	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> +
> +	if (!inode)
> +		return -ENOENT;
> +
> +	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
> +		return -EROFS;
> +
> +	mutex_lock(&inode->i_mutex);
> +
> +	/*
> +	 * This prevents concurrent writes from other nodes
> +	 */
> +	status = ocfs2_rw_lock(inode, 1);
> +	if (status) {
> +		mlog_errno(status);
> +		goto out;
> +	}
> +
> +	status = ocfs2_inode_lock(inode,&di_bh, 1);
> +	if (status) {
> +		mlog_errno(status);
> +		goto out_rw_unlock;
> +	}
> +
> +	/*
> +	 * rememer ip_xattr_sem also needs to be held if necessary
> +	 */
> +	down_write(&OCFS2_I(inode)->ip_alloc_sem);
> +
> +	/*
> +	 * real extents moving codes will be fulfilled later.
> +	 *
> +	 * status = __ocfs2_move_extents_range(di_bh, context);
> +	 */
OK, so this function does nothing by now? ;)
> +
> +	up_write(&OCFS2_I(inode)->ip_alloc_sem);
> +	if (status) {
> +		mlog_errno(status);
> +		goto out_inode_unlock;
> +	}
> +
> +	/*
> +	 * We update mtime for these changes
> +	 */
> +	handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
> +	if (IS_ERR(handle)) {
> +		status = PTR_ERR(handle);
> +		mlog_errno(status);
> +		goto out_inode_unlock;
> +	}
> +
> +	inode->i_mtime = CURRENT_TIME;
> +	status = ocfs2_mark_inode_dirty(handle, inode, di_bh);
We really need such a heavy function in case you just want to set 
di->i_mtime and di->i_mtime_nsec?
> +	if (status<  0)
> +		mlog_errno(status);
> +
> +	ocfs2_commit_trans(osb, handle);
> +
> +out_inode_unlock:
> +	brelse(di_bh);
> +	ocfs2_inode_unlock(inode, 1);
> +out_rw_unlock:
> +	ocfs2_rw_unlock(inode, 1);
> +out:
> +	mutex_unlock(&inode->i_mutex);
> +
> +	return status;
> +}
> +
> +int ocfs2_ioctl_move_extents(struct file *filp, void __user *argp)
> +{
> +	int status;
> +
> +	struct inode *inode = filp->f_path.dentry->d_inode;
> +	struct ocfs2_move_extents range;
> +	struct ocfs2_move_extents_context *context = NULL;
> +
> +	status = mnt_want_write(filp->f_path.mnt);
> +	if (status)
> +		return status;
> +
> +	status = -EINVAL;
> +
> +	if (!S_ISREG(inode->i_mode)) {
> +		goto out;
> +	} else {
> +		if (!(filp->f_mode&  FMODE_WRITE))
> +			goto out;
> +	}
these can be just changed to:
	if (!S_ISREG(inode->i_mode) || !(filp->f_mode & FMODE_WRITE))
		goto out;
> +
> +	if (inode->i_flags&  (S_IMMUTABLE|S_APPEND)) {
> +		status = -EPERM;
> +		goto out;
> +	}
> +
> +	context = kzalloc(sizeof(struct ocfs2_move_extents_context), GFP_NOFS);
> +	if (!context) {
> +		status = -ENOMEM;
> +		mlog_errno(status);
> +		goto out;
> +	}
> +
> +	context->inode = inode;
> +	context->file = filp;
> +
> +	if (!argp) {
> +		memset(&range, 0, sizeof(range));
> +		range.me_len = (u64)-1;
> +		range.me_flags |= OCFS2_MOVE_EXT_FL_AUTO_DEFRAG;
> +		context->auto_defrag = 1;
> +	} else {
> +		if (copy_from_user(&range, (struct ocfs2_move_extents *)argp,
> +				   sizeof(range))) {
> +			status = -EFAULT;
> +			goto out;
> +		}
> +	}
> +
> +	context->range =&range;
> +
> +	if (range.me_flags&  OCFS2_MOVE_EXT_FL_AUTO_DEFRAG) {
> +		context->auto_defrag = 1;
> +		if (!range.me_thresh)
> +			range.me_thresh = 1024 * 1024;
> +	} else {
> +		/*
> +		 * TO-DO XXX: validate the range.me_goal here.
> +		 *
> +		 * - should be cluster aligned.
> +		 * - should contain enough free clusters around range.me_goal.
> +		 * - strategy of moving extent to an appropriate goal is still
> +		 *   being discussed.
> +		 */
> +	}
> +
> +	/*
> +	 * returning -EINVAL here.
> +	 */
> +	if (range.me_start>  i_size_read(inode))
Do we allow overlap here? If no, I guess it should be
	if (range.me_start + range.me_len > i_size_read(inode))
> +		goto out;
> +
> +	if (range.me_start + range.me_len>  i_size_read(inode))
> +			range.me_len = i_size_read(inode) - range.me_start;
> +
> +	status = ocfs2_move_extents(context);
> +	if (status)
> +		mlog_errno(status);
> +
> +	if (argp) {
> +		if (copy_to_user((struct ocfs2_move_extents *)argp,&range,
> +				sizeof(range)))
> +			status = -EFAULT;
> +	}
> +out:
> +	mnt_drop_write(filp->f_path.mnt);
> +
> +	kfree(context);
> +
> +	return status;
> +}

Regards,
Tao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 2/8] Ocfs2: Add basic framework and source files for extent moving.
  2010-12-28 15:11   ` Tao Ma
@ 2010-12-28 15:38     ` Tristan Ye
  2010-12-29  5:45       ` Wengang Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Tristan Ye @ 2010-12-28 15:38 UTC (permalink / raw)
  To: ocfs2-devel

On 12/28/2010 11:11 PM, Tao Ma wrote:
> Hi Tristan,
>     Some comments inlined.

Tao,

     It's good to see your comments here, very nice;)


> On 12/28/2010 07:40 PM, Tristan Ye wrote:
> <snip>
>> +static int ocfs2_move_extents(struct ocfs2_move_extents_context 
>> *context)
>> +{
>> +    int status;
>> +    handle_t *handle;
>> +    struct inode *inode = context->inode;
>> +    struct buffer_head *di_bh = NULL;
>> +    struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>> +
>> +    if (!inode)
>> +        return -ENOENT;
>> +
>> +    if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
>> +        return -EROFS;
>> +
>> +    mutex_lock(&inode->i_mutex);
>> +
>> +    /*
>> +     * This prevents concurrent writes from other nodes
>> +     */
>> +    status = ocfs2_rw_lock(inode, 1);
>> +    if (status) {
>> +        mlog_errno(status);
>> +        goto out;
>> +    }
>> +
>> +    status = ocfs2_inode_lock(inode,&di_bh, 1);
>> +    if (status) {
>> +        mlog_errno(status);
>> +        goto out_rw_unlock;
>> +    }
>> +
>> +    /*
>> +     * rememer ip_xattr_sem also needs to be held if necessary
>> +     */
>> +    down_write(&OCFS2_I(inode)->ip_alloc_sem);
>> +
>> +    /*
>> +     * real extents moving codes will be fulfilled later.
>> +     *
>> +     * status = __ocfs2_move_extents_range(di_bh, context);
>> +     */
> OK, so this function does nothing by now? ;)

Actually, this function will be added in following patches, I commented 
this for a better understanding ONLY.

So without collecting the last patch, it did nothing actually;-)

>> +
>> +    up_write(&OCFS2_I(inode)->ip_alloc_sem);
>> +    if (status) {
>> +        mlog_errno(status);
>> +        goto out_inode_unlock;
>> +    }
>> +
>> +    /*
>> +     * We update mtime for these changes
>> +     */
>> +    handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
>> +    if (IS_ERR(handle)) {
>> +        status = PTR_ERR(handle);
>> +        mlog_errno(status);
>> +        goto out_inode_unlock;
>> +    }
>> +
>> +    inode->i_mtime = CURRENT_TIME;
>> +    status = ocfs2_mark_inode_dirty(handle, inode, di_bh);
> We really need such a heavy function in case you just want to set 
> di->i_mtime and di->i_mtime_nsec?


These above codes were almost borrowed from __ocfs2_change_file_space(), 
it just called
ocfs2_mark_inode_dirty() when updating the mtime/ctime.

Maybe you're right, I'll rethink about it.


>> +    if (status<  0)
>> +        mlog_errno(status);
>> +
>> +    ocfs2_commit_trans(osb, handle);
>> +
>> +out_inode_unlock:
>> +    brelse(di_bh);
>> +    ocfs2_inode_unlock(inode, 1);
>> +out_rw_unlock:
>> +    ocfs2_rw_unlock(inode, 1);
>> +out:
>> +    mutex_unlock(&inode->i_mutex);
>> +
>> +    return status;
>> +}
>> +
>> +int ocfs2_ioctl_move_extents(struct file *filp, void __user *argp)
>> +{
>> +    int status;
>> +
>> +    struct inode *inode = filp->f_path.dentry->d_inode;
>> +    struct ocfs2_move_extents range;
>> +    struct ocfs2_move_extents_context *context = NULL;
>> +
>> +    status = mnt_want_write(filp->f_path.mnt);
>> +    if (status)
>> +        return status;
>> +
>> +    status = -EINVAL;
>> +
>> +    if (!S_ISREG(inode->i_mode)) {
>> +        goto out;
>> +    } else {
>> +        if (!(filp->f_mode&  FMODE_WRITE))
>> +            goto out;
>> +    }
> these can be just changed to:
>     if (!S_ISREG(inode->i_mode) || !(filp->f_mode & FMODE_WRITE))
>         goto out;

Good point.

>> +
>> +    if (inode->i_flags&  (S_IMMUTABLE|S_APPEND)) {
>> +        status = -EPERM;
>> +        goto out;
>> +    }
>> +
>> +    context = kzalloc(sizeof(struct ocfs2_move_extents_context), 
>> GFP_NOFS);
>> +    if (!context) {
>> +        status = -ENOMEM;
>> +        mlog_errno(status);
>> +        goto out;
>> +    }
>> +
>> +    context->inode = inode;
>> +    context->file = filp;
>> +
>> +    if (!argp) {
>> +        memset(&range, 0, sizeof(range));
>> +        range.me_len = (u64)-1;
>> +        range.me_flags |= OCFS2_MOVE_EXT_FL_AUTO_DEFRAG;
>> +        context->auto_defrag = 1;
>> +    } else {
>> +        if (copy_from_user(&range, (struct ocfs2_move_extents *)argp,
>> +                   sizeof(range))) {
>> +            status = -EFAULT;
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    context->range =&range;
>> +
>> +    if (range.me_flags&  OCFS2_MOVE_EXT_FL_AUTO_DEFRAG) {
>> +        context->auto_defrag = 1;
>> +        if (!range.me_thresh)
>> +            range.me_thresh = 1024 * 1024;
>> +    } else {
>> +        /*
>> +         * TO-DO XXX: validate the range.me_goal here.
>> +         *
>> +         * - should be cluster aligned.
>> +         * - should contain enough free clusters around range.me_goal.
>> +         * - strategy of moving extent to an appropriate goal is still
>> +         *   being discussed.
>> +         */
>> +    }
>> +
>> +    /*
>> +     * returning -EINVAL here.
>> +     */
>> +    if (range.me_start>  i_size_read(inode))
> Do we allow overlap here? If no, I guess it should be
>     if (range.me_start + range.me_len > i_size_read(inode))


What did you mean? 'range.me_start + range.me_len>  i_size_read(inode)' 
will be judged by following codes,


It has nothing to do with the overlap issue, range.me_start was not 
range.me_new_start;)

>> +        goto out;
>> +
>> +    if (range.me_start + range.me_len>  i_size_read(inode))
>> +            range.me_len = i_size_read(inode) - range.me_start;
>> +
>> +    status = ocfs2_move_extents(context);
>> +    if (status)
>> +        mlog_errno(status);
>> +
>> +    if (argp) {
>> +        if (copy_to_user((struct ocfs2_move_extents *)argp,&range,
>> +                sizeof(range)))
>> +            status = -EFAULT;
>> +    }
>> +out:
>> +    mnt_drop_write(filp->f_path.mnt);
>> +
>> +    kfree(context);
>> +
>> +    return status;
>> +}
>
> Regards,
> Tao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1.
  2010-12-28 12:05 ` [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Wengang Wang
@ 2010-12-28 15:44   ` Tristan Ye
  0 siblings, 0 replies; 16+ messages in thread
From: Tristan Ye @ 2010-12-28 15:44 UTC (permalink / raw)
  To: ocfs2-devel

On 12/28/2010 08:05 PM, Wengang Wang wrote:
> Hi Guy,
>
> I like it. Having is definitely better than no.
> I will play with it when I am free :)

Hi wengang,

     Very cool to hear you're showing interest on it;-)

Actually the motivation at the very beginning was inspired by your 
original discussion on wikipage

Thanks a lot;)

Tristan



>
> thanks,
> wengang.
> On 10-12-28 19:40, Tristan Ye wrote:
>> Hi All,
>>
>> 	It's a quite rough patches series v1 for online defragmentation on OCFS2, it's
>> workable anyway, may look ugly though;) The essence of online file defragmentation is
>> extents moving like what btrfs and ext4 were doing, adding 'OCFS2_IOC_MOVE_EXT' ioctl
>> to ocfs2 allows two strategies upon defragmentation:
>>
>> 1. simple-defragmentation-in-kernl, which means kernel will be responsible for
>>     claiming new clusters, and packing the defragmented extents according to a
>>     user-specified threshold.
>>
>> 2. simple-extents moving, in this case, userspace play much more important role
>>     when doing defragmentation, it needs to specify the new physical blk_offset
>>     where extents will be moved, kernel itself will not do anything more than
>>     moving the extents per requested, maybe kernel also needs to manage to
>>     probe/validate the new_blkoffset to guarantee enough free space around there.
>>
>> Above two operations using the same OCFS2_IOC_MOVE_EXT:
>> -------------------------------------------------------------------------------
>> #define OCFS2_MOVE_EXT_FL_AUTO_DEFRAG   (0x00000001)    /* Kernel manages to
>>                                                             claim new clusters
>>                                                             as the goal place
>>                                                             for extents moving */
>> #define OCFS2_MOVE_EXT_FL_COMPLETE      (0x00000002)    /* Move or defragmenation
>>                                                             completely gets done.
>>                                                           */
>> struct ocfs2_move_extents {
>> /* All values are in bytes */
>>          /* in */
>>          __u64 me_start;         /* Virtual start in the file to move */
>>          __u64 me_len;           /* Length of the extents to be moved */
>>          __u64 me_goal;          /* Physical offset of the goal */
>>          __u64 me_thresh;        /* Maximum distance from goal or threshold
>>                                     for auto defragmentation */
>>          __u64 me_flags;         /* flags for the operation:
>>                                   * - auto defragmentation.
>>                                   * - refcount,xattr cases.
>>                                   */
>>
>>          /* out */
>>          __u64 me_moved_len;     /* moved length, are we completely done? */
>>          __u64 me_new_offset;    /* Resulting physical location */
>>          __u32 me_reserved[3];   /* reserved for futhure */
>> };
>> -------------------------------------------------------------------------------
>>
>> 	Current V1 patches set will be focusing mostly on strategy #1 though, since #2
>> strategy is still there under discussion.
>>
>> 	Following are some interesting data gathered from simple tests:
>>
>> 1. Performance improvement gained on I/O reads:
>> -------------------------------------------------------------------------------
>> * Before defragmentation *
>>
>> [root at ocfs2-box4 ~]# sync
>> [root at ocfs2-box4 ~]# echo 3>/proc/sys/vm/drop_caches
>> [root at ocfs2-box4 ~]# time dd if=/storage/testfile-1 of=/dev/null
>> 640000+0 records in
>> 640000+0 records out
>> 327680000 bytes (328 MB) copied, 19.9351 s, 16.4 MB/s
>>
>> real	0m19.954s
>> user	0m0.246s
>> sys	0m1.111s
>>
>> * Do defragmentation *
>>
>> [root at ocfs2-box4 defrag]# ./defrag -s 0 -l 293601280  -t 3145728 /storage/testfile-1
>>
>> * After defragmentation *
>>
>> [root at ocfs2-box4 ~]# sync
>> [root at ocfs2-box4 ~]# echo 3>/proc/sys/vm/drop_caches
>> [root at ocfs2-box4 ~]# time dd if=/storage/testfile-1 of=/dev/null
>> 640000+0 records in
>> 640000+0 records out
>> 327680000 bytes (328 MB) copied, 6.79885 s, 48.2 MB/s
>>
>> real	0m6.969s
>> user	0m0.209s
>> sys	0m1.063s
>> -------------------------------------------------------------------------------
>>
>>
>> 2. Extent tree layout via debugfs.ocfs2:
>> -------------------------------------------------------------------------------
>> * Before defragmentation *
>>
>>          Tree Depth: 1   Count: 243   Next Free Rec: 8
>>          ## Offset        Clusters       Block#
>>          0  0             1173           86561
>>          1  1173          1173           84527
>>          2  2346          1151           81468
>>          3  3497          1173           76362
>>          4  4670          1173           74328
>>          5  5843          1172           66150
>>          6  7015          1460           70260
>>          7  8475          662            87680
>>          SubAlloc Bit: 1   SubAlloc Slot: 0
>>          Blknum: 86561   Next Leaf: 84527
>>          CRC32: abf06a6b   ECC: 44bc
>>          Tree Depth: 0   Count: 252   Next Free Rec: 252
>>          ## Offset        Clusters       Block#          Flags
>>          0  1             16             516104          0x0
>>          1  17            1              554632          0x0
>>          2  18            7              560144          0x0
>>          3  25            1              565960          0x0
>>          4  26            1              572632          0x
>> 	...
>> 	/* around 1700 extent records were hidden there */
>> 	...
>> 	138 9131          1              258968          0x0
>>          139 9132          1              259568          0x0
>>          140 9133          1              260168          0x0
>>          141 9134          1              260768          0x0
>>          142 9135          1              261368          0x0
>>          143 9136          1              261968          0x0
>>
>> * After defragmentation *
>>
>>        Tree Depth: 1   Count: 243   Next Free Rec: 1
>> 	## Offset        Clusters       Block#
>> 	0  0             9137           66081
>> 	SubAlloc Bit: 1   SubAlloc Slot: 0
>> 	Blknum: 66081   Next Leaf: 0
>> 	CRC32: 22897d34   ECC: 0619
>> 	Tree Depth: 0   Count: 252   Next Free Rec: 6
>> 	## Offset        Clusters       Block#          Flags
>> 	0  1             1600           4412936         0x0
>> 	1  1601          1595           20669448        0x0
>> 	2  3196          1600           9358856         0x0
>> 	3  4796          1404           14516232        0x0
>> 	4  6200          1600           21627400        0x0
>> 	5  7800          1337           7483400         0x0
>> -------------------------------------------------------------------------------
>>
>>
>> TO-DO:
>>
>> 1. Complete strategy #2
>> 2. Adding refcount/xattr/unwritten_extents support.
>> 3. Free space defragmentation.
>>
>>
>> Go to http://oss.oracle.com/osswiki/OCFS2/DesignDocs/OnlineDefrag for more details.
>>
>>
>> Tristan.
>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 2/8] Ocfs2: Add basic framework and source files for extent moving.
  2010-12-28 15:38     ` Tristan Ye
@ 2010-12-29  5:45       ` Wengang Wang
  2010-12-29  6:30         ` Tristan Ye
  0 siblings, 1 reply; 16+ messages in thread
From: Wengang Wang @ 2010-12-29  5:45 UTC (permalink / raw)
  To: ocfs2-devel

Hi,

On 10-12-28 23:38, Tristan Ye wrote:
> On 12/28/2010 11:11 PM, Tao Ma wrote:
> 
> >> +
> >> +    up_write(&OCFS2_I(inode)->ip_alloc_sem);
> >> +    if (status) {
> >> +        mlog_errno(status);
> >> +        goto out_inode_unlock;
> >> +    }
> >> +
> >> +    /*
> >> +     * We update mtime for these changes
> >> +     */
> >> +    handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
> >> +    if (IS_ERR(handle)) {
> >> +        status = PTR_ERR(handle);
> >> +        mlog_errno(status);
> >> +        goto out_inode_unlock;
> >> +    }
> >> +
> >> +    inode->i_mtime = CURRENT_TIME;
> >> +    status = ocfs2_mark_inode_dirty(handle, inode, di_bh);
> > We really need such a heavy function in case you just want to set 
> > di->i_mtime and di->i_mtime_nsec?
> 
> 
> These above codes were almost borrowed from __ocfs2_change_file_space(), 
> it just called
> ocfs2_mark_inode_dirty() when updating the mtime/ctime.
> 
> Maybe you're right, I'll rethink about it.
> 

Do we really need to update modify time or change time?
modify time is "Time of last file write". no need I think since it's not a
write(no touch on file contents).
change time is "Time of last inode change". I think here, the inode
status includes mode, gid, uid, size, ino..... no change in those area,
right? so I guess no need to update mtime nor ctime.

Also, even we need to update them, shall we check a status of the
__ocfs2_move_extents_range() whether there is really a movement. think
the case another node just finished defragmenting on this file.

And likely we have modified di_bh in __ocfs2_move_extents_range() and
journaled it(though not checked yet). So if need, we'd better merge the
transactins

BTW, just a thought, not checked all, do we only allow root to do this?
Seems no security risk for the task to FS its self... attacks?

thanks,
wengang.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 2/8] Ocfs2: Add basic framework and source files for extent moving.
  2010-12-29  5:45       ` Wengang Wang
@ 2010-12-29  6:30         ` Tristan Ye
  0 siblings, 0 replies; 16+ messages in thread
From: Tristan Ye @ 2010-12-29  6:30 UTC (permalink / raw)
  To: ocfs2-devel

Wengang Wang wrote:
> Hi,
>
> On 10-12-28 23:38, Tristan Ye wrote:
>> On 12/28/2010 11:11 PM, Tao Ma wrote:
>>
>>>> +
>>>> +    up_write(&OCFS2_I(inode)->ip_alloc_sem);
>>>> +    if (status) {
>>>> +        mlog_errno(status);
>>>> +        goto out_inode_unlock;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * We update mtime for these changes
>>>> +     */
>>>> +    handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
>>>> +    if (IS_ERR(handle)) {
>>>> +        status = PTR_ERR(handle);
>>>> +        mlog_errno(status);
>>>> +        goto out_inode_unlock;
>>>> +    }
>>>> +
>>>> +    inode->i_mtime = CURRENT_TIME;
>>>> +    status = ocfs2_mark_inode_dirty(handle, inode, di_bh);
>>> We really need such a heavy function in case you just want to set 
>>> di->i_mtime and di->i_mtime_nsec?
>>
>> These above codes were almost borrowed from __ocfs2_change_file_space(), 
>> it just called
>> ocfs2_mark_inode_dirty() when updating the mtime/ctime.
>>
>> Maybe you're right, I'll rethink about it.
>>
>
> Do we really need to update modify time or change time?
> modify time is "Time of last file write". no need I think since it's not a
> write(no touch on file contents).
> change time is "Time of last inode change". I think here, the inode
> status includes mode, gid, uid, size, ino..... no change in those area,
> right? so I guess no need to update mtime nor ctime.
>
> Also, even we need to update them, shall we check a status of the
> __ocfs2_move_extents_range() whether there is really a movement. think
> the case another node just finished defragmenting on this file.

    Actually I borrowed these codes from ocfs2_change_file_space()
at the very beginning, your comments did make sense, but wait, 
dedragmentation
has the possibility of changing dinode a bit,  it's the btree root, right?

    mtime may not be updated as you said;)


>
> And likely we have modified di_bh in __ocfs2_move_extents_range() and
> journaled it(though not checked yet). So if need, we'd better merge the
> transactins

Good point.
   
>
> BTW, just a thought, not checked all, do we only allow root to do this?
> Seems no security risk for the task to FS its self... attacks?

    I did allow ordinary users to do this in this patch, didn't I?

>
> thanks,
> wengang.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Ocfs2-devel] [PATCH 1/8] Ocfs2: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2.
  2011-01-13 10:20 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V2 Tristan Ye
@ 2011-01-13 10:20 ` Tristan Ye
  0 siblings, 0 replies; 16+ messages in thread
From: Tristan Ye @ 2011-01-13 10:20 UTC (permalink / raw)
  To: ocfs2-devel

Patch also manages to add a manipulative struture for this ioctl.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
---
 fs/ocfs2/ocfs2_ioctl.h |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/ocfs2_ioctl.h b/fs/ocfs2/ocfs2_ioctl.h
index b46f39b..25c28ec 100644
--- a/fs/ocfs2/ocfs2_ioctl.h
+++ b/fs/ocfs2/ocfs2_ioctl.h
@@ -171,4 +171,32 @@ enum ocfs2_info_type {
 
 #define OCFS2_IOC_INFO		_IOR('o', 5, struct ocfs2_info)
 
+#define OCFS2_MOVE_EXT_FL_AUTO_DEFRAG	(0x00000001)	/* Kernel manages to
+							   claim new clusters
+							   as the goal place
+							   for extents moving */
+#define OCFS2_MOVE_EXT_FL_COMPLETE	(0x00000002)	/* Move or defragmenation
+							   completely gets done.
+							 */
+struct ocfs2_move_extents {
+/* All values are in bytes */
+	/* in */
+	__u64 me_start;		/* Virtual start in the file to move */
+	__u64 me_len;		/* Length of the extents to be moved */
+	__u64 me_goal;		/* Physical offset of the goal,
+				   it's in block unit */
+	__u64 me_thresh;	/* Maximum distance from goal or threshold
+				   for auto defragmentation */
+	__u64 me_flags;		/* flags for the operation:
+				 * - auto defragmentation.
+				 * - refcount,xattr cases.
+				 */
+	/* out */
+	__u64 me_moved_len;	/* moved/defraged length */
+	__u64 me_new_offset;	/* Resulting physical location */
+	__u32 me_reserved[2];	/* reserved for futhure */
+};
+
+#define OCFS2_IOC_MOVE_EXT	_IOW('o', 6, struct ocfs2_move_extents)
+
 #endif /* OCFS2_IOCTL_H */
-- 
1.5.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-01-13 10:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-28 11:40 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Tristan Ye
2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 1/8] Ocfs2: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2 Tristan Ye
2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 2/8] Ocfs2: Add basic framework and source files for extent moving Tristan Ye
2010-12-28 15:11   ` Tao Ma
2010-12-28 15:38     ` Tristan Ye
2010-12-29  5:45       ` Wengang Wang
2010-12-29  6:30         ` Tristan Ye
2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 3/8] Ocfs2: Duplicate old clusters into new blk_offset by dirty and remap pages Tristan Ye
2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 4/8] Ocfs2: move a range of extent Tristan Ye
2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 5/8] Ocfs2: lock allocators and reserve metadata blocks and data clusters for extents moving Tristan Ye
2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 6/8] Ocfs2: defrag a range of extent Tristan Ye
2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 7/8] Ocfs2: move entire/partial extent Tristan Ye
2010-12-28 11:40 ` [Ocfs2-devel] [PATCH 8/8] Ocfs2: move extents within a certain range Tristan Ye
2010-12-28 12:05 ` [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V1 Wengang Wang
2010-12-28 15:44   ` Tristan Ye
  -- strict thread matches above, loose matches on Subject: below --
2011-01-13 10:20 [Ocfs2-devel] [PATCH 0/8] Ocfs2: Online defragmentaion V2 Tristan Ye
2011-01-13 10:20 ` [Ocfs2-devel] [PATCH 1/8] Ocfs2: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2 Tristan Ye

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).