[Ocfs2-devel] [PATCH 1/3] Add online resize support for ocfs2, take 2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ocfs2-devel] [PATCH 1/3] Add online resize support for ocfs2, take 2
@ 2007-11-27  0:12 Tao Ma
  2007-11-27  0:20 ` [Ocfs2-devel] [PATCH 1/3] Initalize bitmap_cpg of ocfs2_super to be the maximum Tao Ma
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Tao Ma @ 2007-11-27  0:12 UTC (permalink / raw)
  To: ocfs2-devel

Modification from V1 to V2:
Divide the whole processes into 2 steps like ext3.
1) If the last group isn't full, tunefs.ocfs2 will call
   OCFS2_IOC_GROUP_EXTEND first to extend it. All the main work is
   done in kernel.
2) For every new groups, tunefs.ocfs2 will call OCFS2_IOC_GROUP_ADD
   to add them one by one. The new group descriptor is initialized
   in userspace, we only check it in the kernel and update the
   global_bitap, super blocks etc.

User can do offline resize using tunefs.ocfs2 when a volume isn't
mounted. Now the support for online resize is added into ocfs2.

Please note that the node where online resize goes must already
has the volume mounted. We don't mount it behind the user and the
operation would fail if we find it isn't mounted. As for other
nodes, we don't care whether the volume is mounted or not.

global_bitmap, super block and all the backups will be updated
in the kernel. And if super block or backup's update fails, we
just output some error message in dmesg and continue the work.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH 1/3] Initalize bitmap_cpg of ocfs2_super to be the maximum.
  2007-11-27  0:12 [Ocfs2-devel] [PATCH 1/3] Add online resize support for ocfs2, take 2 Tao Ma
@ 2007-11-27  0:20 ` Tao Ma
  2007-11-30 11:47   ` Mark Fasheh
  2007-11-27  0:22 ` [Ocfs2-devel] [PATCH 3/3] Implement "GROUP_ADD" for online resize, take 2 Tao Ma
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Tao Ma @ 2007-11-27  0:20 UTC (permalink / raw)
  To: ocfs2-devel

This value is initialized from global_bitmap->id2.i_chain.cl_cpg. If there
is only 1 group, it will be equal to the total clusters in the volume. So
as for online resize, it should change for all the nodes in the cluster.
It isn't easy and there is no corresponding lock for it.

bitmap_cpg is only used in 2 areas:
1. Check whether the suballoc is too large for us to allocate from the global
   bitmap, so it is little used. And now the suballoc size is 2048, it rarely
   meet this situation and the check is almost useless.
2. Calculate which group a cluster belongs to. We use it during truncate to
   figure out which cluster group an extent belongs too. But we should be OK
   if we increase it though as the cluster group calculated shouldn't change
   and we only ever have a small bitmap_cpg on file systems with a single
   cluster group.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
---
 fs/ocfs2/super.c |   19 +------------------
 1 files changed, 1 insertions(+), 18 deletions(-)

diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index be562ac..9f3d73c 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1315,7 +1315,6 @@ static int ocfs2_initialize_super(struct super_block *sb,
 	int i, cbits, bbits;
 	struct ocfs2_dinode *di = (struct ocfs2_dinode *)bh->b_data;
 	struct inode *inode = NULL;
-	struct buffer_head *bitmap_bh = NULL;
 	struct ocfs2_journal *journal;
 	__le32 uuid_net_key;
 	struct ocfs2_super *osb;
@@ -1539,25 +1538,9 @@ static int ocfs2_initialize_super(struct super_block *sb,
 	}
 
 	osb->bitmap_blkno = OCFS2_I(inode)->ip_blkno;
-
-	/* We don't have a cluster lock on the bitmap here because
-	 * we're only interested in static information and the extra
-	 * complexity at mount time isn't worht it. Don't pass the
-	 * inode in to the read function though as we don't want it to
-	 * be put in the cache. */
-	status = ocfs2_read_block(osb, osb->bitmap_blkno, &bitmap_bh, 0,
-				  NULL);
 	iput(inode);
-	if (status < 0) {
-		mlog_errno(status);
-		goto bail;
-	}
 
-	di = (struct ocfs2_dinode *) bitmap_bh->b_data;
-	osb->bitmap_cpg = le16_to_cpu(di->id2.i_chain.cl_cpg);
-	brelse(bitmap_bh);
-	mlog(0, "cluster bitmap inode: %llu, clusters per group: %u\n",
-	     (unsigned long long)osb->bitmap_blkno, osb->bitmap_cpg);
+	osb->bitmap_cpg = ocfs2_group_bitmap_size(sb) * 8;
 
 	status = ocfs2_init_slot_info(osb);
 	if (status < 0) {
-- 
gitgui.0.9.0.gd794

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH 1/3] Initalize bitmap_cpg of ocfs2_super to be the maximum.
  2007-11-27  0:20 ` [Ocfs2-devel] [PATCH 1/3] Initalize bitmap_cpg of ocfs2_super to be the maximum Tao Ma
@ 2007-11-30 11:47   ` Mark Fasheh
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Fasheh @ 2007-11-30 11:47 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Nov 27, 2007 at 04:20:19PM +0800, tao.ma wrote:
> This value is initialized from global_bitmap->id2.i_chain.cl_cpg. If there
> is only 1 group, it will be equal to the total clusters in the volume. So
> as for online resize, it should change for all the nodes in the cluster.
> It isn't easy and there is no corresponding lock for it.
> bitmap_cpg is only used in 2 areas:
> 1. Check whether the suballoc is too large for us to allocate from the global
>    bitmap, so it is little used. And now the suballoc size is 2048, it rarely
>    meet this situation and the check is almost useless.
> 2. Calculate which group a cluster belongs to. We use it during truncate to
>    figure out which cluster group an extent belongs too. But we should be OK
>    if we increase it though as the cluster group calculated shouldn't change
>    and we only ever have a small bitmap_cpg on file systems with a single
>    cluster group.
> 
> Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

Ok, I'm going to start carrying this patch in ocfs2.git as I think it's
pretty straightforward and a win regardless of how the rest of the online
resize stuff goes. You don't have to change how you're sending these patches
though - just keep re-sending this one with my signoff and I'll handle
merging things when the rest of the resize work is done.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH 3/3] Implement "GROUP_ADD" for online resize, take 2
  2007-11-27  0:12 [Ocfs2-devel] [PATCH 1/3] Add online resize support for ocfs2, take 2 Tao Ma
  2007-11-27  0:20 ` [Ocfs2-devel] [PATCH 1/3] Initalize bitmap_cpg of ocfs2_super to be the maximum Tao Ma
@ 2007-11-27  0:22 ` Tao Ma
  2007-11-30 15:21   ` Mark Fasheh
  2007-11-27  0:23 ` [Ocfs2-devel] [PATCH 2/3] Add group extend " Tao Ma
  2007-11-30 15:43 ` [Ocfs2-devel] [PATCH 1/3] Add online resize support for ocfs2, " Mark Fasheh
  3 siblings, 1 reply; 12+ messages in thread
From: Tao Ma @ 2007-11-27  0:22 UTC (permalink / raw)
  To: ocfs2-devel

User can do offline resize using tunefs.ocfs2 when a volume isn't
mounted. Now the support for online resize is added into ocfs2.

Please note that the node where online resize goes must already
has the volume mounted. We don't mount it behind the user and the
operation would fail if we find it isn't mounted. As for other
nodes, we don't care whether the volume is mounted or not.

global_bitmap, super block and all the backups will be updated
in the kernel. And if super block or backup's update fails, we
just output some error message in dmesg and continue the work.

The whole process is derived from ext3 and divided into 2 steps:
1. If the last group isn't full, tunefs.ocfs2 will call
   OCFS2_IOC_GROUP_EXTEND first to extend it. All the main work is
   done in kernel.
2. For every new groups, tunefs.ocfs2 will call OCFS2_IOC_GROUP_ADD
   to add them one by one. The new group descriptor is initialized
   in userspace, we only check it in the kernel and update the
   global_bitap, super blocks etc.

This patch includes the implentation for the 2nd step.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
---
 fs/ocfs2/ioctl.c    |    7 ++
 fs/ocfs2/ocfs2.h    |    1 +
 fs/ocfs2/ocfs2_fs.h |   11 +++
 fs/ocfs2/resize.c   |  231 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 250 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
index 60698de..ac30fe5 100644
--- a/fs/ocfs2/ioctl.c
+++ b/fs/ocfs2/ioctl.c
@@ -118,6 +118,7 @@ int ocfs2_ioctl(struct inode * inode, struct file * filp,
 	u32 new_clusters;
 	int status;
 	struct ocfs2_space_resv sr;
+	struct ocfs2_new_group_input input;
 
 	switch (cmd) {
 	case OCFS2_IOC_GETFLAGS:
@@ -146,6 +147,11 @@ int ocfs2_ioctl(struct inode * inode, struct file * filp,
 			return -EFAULT;
 
 		return ocfs2_group_extend(inode, new_clusters);
+	case OCFS2_IOC_GROUP_ADD:
+		if (copy_from_user(&input, (int __user *) arg, sizeof(input)))
+			return -EFAULT;
+
+		return ocfs2_group_add(inode, &input);
 	default:
 		return -ENOTTY;
 	}
@@ -169,6 +175,7 @@ long ocfs2_compat_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	case OCFS2_IOC_UNRESVSP:
 	case OCFS2_IOC_UNRESVSP64:
 	case OCFS2_IOC_GROUP_EXTEND:
+	case OCFS2_IOC_GROUP_ADD:
 		break;
 	default:
 		return -ENOIOCTLCMD;
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 8a6925b..805a76e 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -531,6 +531,7 @@ static inline unsigned int ocfs2_pages_per_cluster(struct super_block *sb)
  * and return that block offset. */
 u64 ocfs2_which_cluster_group(struct inode *inode, u32 cluster);
 int ocfs2_group_extend(struct inode * inode, u32 new_clusters);
+int ocfs2_group_add(struct inode *inode, struct ocfs2_new_group_input *input);
 
 #define ocfs2_set_bit ext2_set_bit
 #define ocfs2_clear_bit ext2_clear_bit
diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
index 4b5813d..969c310 100644
--- a/fs/ocfs2/ocfs2_fs.h
+++ b/fs/ocfs2/ocfs2_fs.h
@@ -231,7 +231,18 @@ struct ocfs2_space_resv {
 #define OCFS2_IOC_RESVSP64	_IOW ('X', 42, struct ocfs2_space_resv)
 #define OCFS2_IOC_UNRESVSP64	_IOW ('X', 43, struct ocfs2_space_resv)
 
+/* Used to pass group descriptor data when online resize is done */
+struct ocfs2_new_group_input {
+	__u64 group;		/* Group descriptor's blkno. */
+	__u32 clusters;		/* Total number of clusters in this group */
+	__u32 frees;		/* Total free clusters in this group */
+	__u16 chain;		/* Chain for this group */
+	__u16 reserved1;
+	__u32 reserved2;
+};
+
 #define OCFS2_IOC_GROUP_EXTEND	_IOW('f', 7, unsigned long)
+#define OCFS2_IOC_GROUP_ADD	_IOW('f', 8,struct ocfs2_new_group_input)
 
 /*
  * Journal Flags (ocfs2_dinode.id1.journal1.i_flags)
diff --git a/fs/ocfs2/resize.c b/fs/ocfs2/resize.c
index 5c863af..efd7805 100644
--- a/fs/ocfs2/resize.c
+++ b/fs/ocfs2/resize.c
@@ -364,3 +364,234 @@ out:
 	mlog_exit_void();
 	return ret;
 }
+
+static int ocfs2_check_new_group(struct inode *inode,
+				 struct ocfs2_dinode *di,
+				 struct ocfs2_new_group_input *input)
+{
+	int ret;
+	struct ocfs2_group_desc *gd;
+	struct buffer_head *group_bh = NULL;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	u16 cl_bpc = le16_to_cpu(di->id2.i_chain.cl_bpc);
+	u64 cr_blkno;
+	unsigned int max_bits = le16_to_cpu(di->id2.i_chain.cl_cpg) *
+				le16_to_cpu(di->id2.i_chain.cl_bpc);
+
+	ret = ocfs2_read_block(osb, input->group, &group_bh, 0, NULL);
+	if (ret < 0)
+		goto out;
+
+	gd = (struct ocfs2_group_desc *)group_bh->b_data;
+	cr_blkno = le64_to_cpu(di->id2.i_chain.cl_recs[input->chain].c_blkno);
+
+	ret = -EIO;
+	if (!OCFS2_IS_VALID_GROUP_DESC(gd))
+		OCFS2_RO_ON_INVALID_GROUP_DESC(inode->i_sb, gd);
+	else if (di->i_blkno != gd->bg_parent_dinode)
+		mlog(0, "Group descriptor # %llu has bad parent "
+		     "pointer (%llu, expected %llu)\n",
+		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
+		     (unsigned long long)le64_to_cpu(gd->bg_parent_dinode),
+		     (unsigned long long)le64_to_cpu(di->i_blkno));
+	else if (le16_to_cpu(gd->bg_bits) > max_bits)
+		mlog(0, "Group descriptor # %llu has bit count of %u\n",
+		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
+		     le16_to_cpu(gd->bg_bits));
+	else if (le16_to_cpu(gd->bg_free_bits_count) > le16_to_cpu(gd->bg_bits))
+		mlog(0, "Group descriptor # %llu has bit count %u but "
+		     "claims that %u are free\n",
+		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
+		     le16_to_cpu(gd->bg_bits),
+		     le16_to_cpu(gd->bg_free_bits_count));
+	else if (le16_to_cpu(gd->bg_bits) > (8 * le16_to_cpu(gd->bg_size)))
+		mlog(0, "Group descriptor # %llu has bit count %u but "
+		     "max bitmap bits of %u\n",
+		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
+		     le16_to_cpu(gd->bg_bits),
+		     8 * le16_to_cpu(gd->bg_size));
+	else if (le16_to_cpu(gd->bg_chain) != input->chain)
+		mlog(0, "Group descriptor # %llu has bad chain %u "
+		     "while input has %u set.\n",
+		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
+		    le16_to_cpu(gd->bg_chain), input->chain);
+	else if (le16_to_cpu(gd->bg_bits) != input->clusters * cl_bpc)
+		mlog(0, "Group descriptor # %llu has bit count %u but "
+		     "input has %u clusters set\n",
+		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
+		     le16_to_cpu(gd->bg_bits), input->clusters);
+	else if (le16_to_cpu(gd->bg_free_bits_count) != input->frees * cl_bpc)
+		mlog(0, "Group descriptor # %llu has free bit count %u but "
+		     "it should have %u set\n",
+		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
+		     le16_to_cpu(gd->bg_bits),
+		     input->frees * cl_bpc);
+	else if (le64_to_cpu(gd->bg_next_group) != cr_blkno)
+		mlog(0, "Group descriptor # %llu has next group set as %llu "
+		     "while the chain head has  %llu set\n",
+		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
+		     le64_to_cpu(gd->bg_next_group), cr_blkno);
+	else
+		ret = 0;
+out:
+	if (group_bh)
+		brelse(group_bh);
+
+	return ret;
+}
+
+static int ocfs2_verify_group_input(struct inode *inode,
+				    struct ocfs2_dinode *di,
+				    struct ocfs2_new_group_input *input)
+{
+	u16 cl_count = le16_to_cpu(di->id2.i_chain.cl_count);
+	u16 cl_cpg = le16_to_cpu(di->id2.i_chain.cl_cpg);
+	u16 next_free = le16_to_cpu(di->id2.i_chain.cl_next_free_rec);
+	u32 cluster = ocfs2_blocks_to_clusters(inode->i_sb, input->group);
+	u32 total_clusters = le32_to_cpu(di->i_clusters);
+	int ret = -EINVAL;
+
+	if (cluster < total_clusters)
+		mlog(0, "add a group which is in the current volume.\n");
+	else if (input->chain >= cl_count)
+		mlog(0, "input chain exceeds the limit.\n");
+	else if (next_free != cl_count && next_free != input->chain)
+		mlog(0, "the add group should be in chain %u\n", next_free);
+	else if (total_clusters + input->clusters < total_clusters)
+		mlog(0, "add group's clusters overflow.\n");
+	else if (input->clusters > cl_cpg)
+		mlog(0, "the cluster exceeds the maximum of a group\n");
+	else if (input->frees > input->clusters)
+		mlog(0, "the free cluster exceeds the total clusters\n");
+	else if (total_clusters % cl_cpg != 0)
+		mlog(0, "the last group isn't full. Use group extend first.\n");
+	else if (input->group != ocfs2_which_cluster_group(inode, cluster))
+		mlog(0, "group blkno is invalid\n");
+	else if ((ret = ocfs2_check_new_group(inode, di, input)))
+		mlog(0, "group descriptor check failed.\n");
+	else
+		ret = 0;
+
+	return ret;
+}
+
+/* Add a new group descriptor to global_bitmap. */
+int ocfs2_group_add(struct inode *inode, struct ocfs2_new_group_input *input)
+{
+	int ret;
+	handle_t *handle;
+	struct buffer_head *main_bm_bh = NULL;
+	struct inode *main_bm_inode = NULL;
+	struct ocfs2_dinode *fe = NULL;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	struct ocfs2_chain_list *cl;
+	struct ocfs2_chain_rec *cr;
+	u16 cl_bpc;
+
+	mlog_entry_void();
+
+	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
+		return -EROFS;
+
+	main_bm_inode = ocfs2_get_system_file_inode(osb,
+						    GLOBAL_BITMAP_SYSTEM_INODE,
+						    OCFS2_INVALID_SLOT);
+	if (!main_bm_inode) {
+		ret = -EINVAL;
+		mlog_errno(ret);
+		goto out;
+	}
+
+	mutex_lock(&main_bm_inode->i_mutex);
+
+	ret = ocfs2_meta_lock(main_bm_inode, &main_bm_bh, 1);
+	if (ret < 0) {
+		mlog_errno(ret);
+		goto out_mutex;
+	}
+
+	fe = (struct ocfs2_dinode *)main_bm_bh->b_data;
+
+	if (le16_to_cpu(fe->id2.i_chain.cl_cpg) !=
+				 ocfs2_group_bitmap_size(osb->sb) * 8) {
+		mlog(0, "The disk is too old and small."
+		     " Force to do offline resize.");
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = ocfs2_verify_group_input(main_bm_inode, fe, input);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	mlog(0, "Add a new group  %llu in chain = %u, length = %u\n",
+	     input->group, input->chain, input->clusters);
+
+	handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
+	if (IS_ERR(handle)) {
+		mlog_errno(PTR_ERR(handle));
+		handle = NULL;
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = ocfs2_journal_access(handle, main_bm_inode, main_bm_bh,
+				   OCFS2_JOURNAL_ACCESS_WRITE);
+	if (ret < 0) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	cl_bpc = le16_to_cpu(fe->id2.i_chain.cl_bpc);
+	cl = &fe->id2.i_chain;
+	cr = &cl->cl_recs[input->chain];
+
+	if (input->chain == le16_to_cpu(cl->cl_next_free_rec)) {
+		le16_add_cpu(&cl->cl_next_free_rec, 1);
+		memset(cr, 0, sizeof(struct ocfs2_chain_rec));
+	}
+
+	cr->c_blkno = le64_to_cpu(input->group);
+	le32_add_cpu(&cr->c_total, input->clusters * cl_bpc);
+	le32_add_cpu(&cr->c_free, input->frees * cl_bpc);
+
+	le32_add_cpu(&fe->id1.bitmap1.i_total, input->clusters *cl_bpc);
+	le32_add_cpu(&fe->id1.bitmap1.i_used,
+		     (input->clusters - input->frees) * cl_bpc);
+	le32_add_cpu(&fe->i_clusters, input->clusters);
+	le64_add_cpu(&fe->i_size, input->clusters << osb->s_clustersize_bits);
+
+	ret = ocfs2_journal_dirty(handle, main_bm_bh);
+	if (ret < 0) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	ret = ocfs2_commit_trans(osb, handle);
+	if (ret < 0) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	spin_lock(&OCFS2_I(main_bm_inode)->ip_lock);
+	OCFS2_I(main_bm_inode)->ip_clusters = le32_to_cpu(fe->i_clusters);
+	spin_unlock(&OCFS2_I(main_bm_inode)->ip_lock);
+
+	ocfs2_update_super_and_backups(main_bm_inode, input->clusters);
+
+out_unlock:
+	if (main_bm_bh)
+		brelse(main_bm_bh);
+
+	ocfs2_meta_unlock(main_bm_inode, 1);
+
+out_mutex:
+	mutex_unlock(&main_bm_inode->i_mutex);
+	iput(main_bm_inode);
+
+out:
+	mlog_exit_void();
+	return ret;
+}
-- 
gitgui.0.9.0.gd794

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH 3/3] Implement "GROUP_ADD" for online resize, take 2
  2007-11-27  0:22 ` [Ocfs2-devel] [PATCH 3/3] Implement "GROUP_ADD" for online resize, take 2 Tao Ma
@ 2007-11-30 15:21   ` Mark Fasheh
  2007-12-02 18:08     ` tao.ma
  2007-12-12 23:06     ` tao.ma
  0 siblings, 2 replies; 12+ messages in thread
From: Mark Fasheh @ 2007-11-30 15:21 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Nov 27, 2007 at 04:21:49PM +0800, tao.ma wrote:
> diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
> index 8a6925b..805a76e 100644
> --- a/fs/ocfs2/ocfs2.h
> +++ b/fs/ocfs2/ocfs2.h
> @@ -531,6 +531,7 @@ static inline unsigned int ocfs2_pages_per_cluster(struct super_block *sb)
>   * and return that block offset. */
>  u64 ocfs2_which_cluster_group(struct inode *inode, u32 cluster);
>  int ocfs2_group_extend(struct inode * inode, u32 new_clusters);
> +int ocfs2_group_add(struct inode *inode, struct ocfs2_new_group_input *input);

resize.h please


>  #define ocfs2_set_bit ext2_set_bit
>  #define ocfs2_clear_bit ext2_clear_bit
> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
> index 4b5813d..969c310 100644
> --- a/fs/ocfs2/ocfs2_fs.h
> +++ b/fs/ocfs2/ocfs2_fs.h
> @@ -231,7 +231,18 @@ struct ocfs2_space_resv {
>  #define OCFS2_IOC_RESVSP64	_IOW ('X', 42, struct ocfs2_space_resv)
>  #define OCFS2_IOC_UNRESVSP64	_IOW ('X', 43, struct ocfs2_space_resv)
>  
> +/* Used to pass group descriptor data when online resize is done */
> +struct ocfs2_new_group_input {
> +	__u64 group;		/* Group descriptor's blkno. */
> +	__u32 clusters;		/* Total number of clusters in this group */
> +	__u32 frees;		/* Total free clusters in this group */
> +	__u16 chain;		/* Chain for this group */
> +	__u16 reserved1;
> +	__u32 reserved2;
> +};
> +
>  #define OCFS2_IOC_GROUP_EXTEND	_IOW('f', 7, unsigned long)
> +#define OCFS2_IOC_GROUP_ADD	_IOW('f', 8,struct ocfs2_new_group_input)
>  
>  /*
>   * Journal Flags (ocfs2_dinode.id1.journal1.i_flags)
> diff --git a/fs/ocfs2/resize.c b/fs/ocfs2/resize.c
> index 5c863af..efd7805 100644
> --- a/fs/ocfs2/resize.c
> +++ b/fs/ocfs2/resize.c
> @@ -364,3 +364,234 @@ out:
>  	mlog_exit_void();
>  	return ret;
>  }
> +
> +static int ocfs2_check_new_group(struct inode *inode,
> +				 struct ocfs2_dinode *di,
> +				 struct ocfs2_new_group_input *input)
> +{
> +	int ret;
> +	struct ocfs2_group_desc *gd;
> +	struct buffer_head *group_bh = NULL;
> +	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> +	u16 cl_bpc = le16_to_cpu(di->id2.i_chain.cl_bpc);
> +	u64 cr_blkno;
> +	unsigned int max_bits = le16_to_cpu(di->id2.i_chain.cl_cpg) *
> +				le16_to_cpu(di->id2.i_chain.cl_bpc);
> +
> +	ret = ocfs2_read_block(osb, input->group, &group_bh, 0, NULL);
> +	if (ret < 0)
> +		goto out;

You should pass the inode so that the group gets added to the caching info.
We're fine if the checks here fail and userspace tries again since this read
(that doesn't use the caching flag) always goes to disk.


> +	gd = (struct ocfs2_group_desc *)group_bh->b_data;
> +	cr_blkno = le64_to_cpu(di->id2.i_chain.cl_recs[input->chain].c_blkno);
> +
> +	ret = -EIO;
> +	if (!OCFS2_IS_VALID_GROUP_DESC(gd))
> +		OCFS2_RO_ON_INVALID_GROUP_DESC(inode->i_sb, gd);

We can probably just throw an error instead of going read-only since an
invalid group descriptor here isn't referenced by the fs yet...


> +	else if (di->i_blkno != gd->bg_parent_dinode)
> +		mlog(0, "Group descriptor # %llu has bad parent "
> +		     "pointer (%llu, expected %llu)\n",
> +		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
> +		     (unsigned long long)le64_to_cpu(gd->bg_parent_dinode),
> +		     (unsigned long long)le64_to_cpu(di->i_blkno));
> +	else if (le16_to_cpu(gd->bg_bits) > max_bits)
> +		mlog(0, "Group descriptor # %llu has bit count of %u\n",
> +		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
> +		     le16_to_cpu(gd->bg_bits));
> +	else if (le16_to_cpu(gd->bg_free_bits_count) > le16_to_cpu(gd->bg_bits))
> +		mlog(0, "Group descriptor # %llu has bit count %u but "
> +		     "claims that %u are free\n",
> +		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
> +		     le16_to_cpu(gd->bg_bits),
> +		     le16_to_cpu(gd->bg_free_bits_count));
> +	else if (le16_to_cpu(gd->bg_bits) > (8 * le16_to_cpu(gd->bg_size)))
> +		mlog(0, "Group descriptor # %llu has bit count %u but "
> +		     "max bitmap bits of %u\n",
> +		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
> +		     le16_to_cpu(gd->bg_bits),
> +		     8 * le16_to_cpu(gd->bg_size));
> +	else if (le16_to_cpu(gd->bg_chain) != input->chain)
> +		mlog(0, "Group descriptor # %llu has bad chain %u "
> +		     "while input has %u set.\n",
> +		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
> +		    le16_to_cpu(gd->bg_chain), input->chain);
> +	else if (le16_to_cpu(gd->bg_bits) != input->clusters * cl_bpc)
> +		mlog(0, "Group descriptor # %llu has bit count %u but "
> +		     "input has %u clusters set\n",
> +		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
> +		     le16_to_cpu(gd->bg_bits), input->clusters);
> +	else if (le16_to_cpu(gd->bg_free_bits_count) != input->frees * cl_bpc)
> +		mlog(0, "Group descriptor # %llu has free bit count %u but "
> +		     "it should have %u set\n",
> +		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
> +		     le16_to_cpu(gd->bg_bits),
> +		     input->frees * cl_bpc);
> +	else if (le64_to_cpu(gd->bg_next_group) != cr_blkno)
> +		mlog(0, "Group descriptor # %llu has next group set as %llu "
> +		     "while the chain head has  %llu set\n",
> +		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
> +		     le64_to_cpu(gd->bg_next_group), cr_blkno);

The 1st block pointer in a chain can change when we re-link groups during
allocation, so this check is invalid.


> +	else
> +		ret = 0;
> +out:
> +	if (group_bh)
> +		brelse(group_bh);
> +
> +	return ret;
> +}
> +
> +static int ocfs2_verify_group_input(struct inode *inode,
> +				    struct ocfs2_dinode *di,
> +				    struct ocfs2_new_group_input *input)
> +{
> +	u16 cl_count = le16_to_cpu(di->id2.i_chain.cl_count);
> +	u16 cl_cpg = le16_to_cpu(di->id2.i_chain.cl_cpg);
> +	u16 next_free = le16_to_cpu(di->id2.i_chain.cl_next_free_rec);
> +	u32 cluster = ocfs2_blocks_to_clusters(inode->i_sb, input->group);
> +	u32 total_clusters = le32_to_cpu(di->i_clusters);
> +	int ret = -EINVAL;
> +
> +	if (cluster < total_clusters)
> +		mlog(0, "add a group which is in the current volume.\n");
> +	else if (input->chain >= cl_count)
> +		mlog(0, "input chain exceeds the limit.\n");
> +	else if (next_free != cl_count && next_free != input->chain)
> +		mlog(0, "the add group should be in chain %u\n", next_free);

If I read this properly, then it's insisting that new groups always be added
to the rightmost chain... I'm not sure that's correct - we want to fill
chains which have fewer groups than the others first.

Anyway, weren't we trying to avoid enforcing group placement policy in the
kernel, short of making sure that the requested chain is a valid one?


> +	else if (total_clusters + input->clusters < total_clusters)
> +		mlog(0, "add group's clusters overflow.\n");
> +	else if (input->clusters > cl_cpg)
> +		mlog(0, "the cluster exceeds the maximum of a group\n");
> +	else if (input->frees > input->clusters)
> +		mlog(0, "the free cluster exceeds the total clusters\n");
> +	else if (total_clusters % cl_cpg != 0)
> +		mlog(0, "the last group isn't full. Use group extend first.\n");
> +	else if (input->group != ocfs2_which_cluster_group(inode, cluster))
> +		mlog(0, "group blkno is invalid\n");
> +	else if ((ret = ocfs2_check_new_group(inode, di, input)))
> +		mlog(0, "group descriptor check failed.\n");
> +	else
> +		ret = 0;
> +
> +	return ret;
> +}
> +
> +/* Add a new group descriptor to global_bitmap. */
> +int ocfs2_group_add(struct inode *inode, struct ocfs2_new_group_input *input)
> +{
> +	int ret;
> +	handle_t *handle;
> +	struct buffer_head *main_bm_bh = NULL;
> +	struct inode *main_bm_inode = NULL;
> +	struct ocfs2_dinode *fe = NULL;
> +	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> +	struct ocfs2_chain_list *cl;
> +	struct ocfs2_chain_rec *cr;
> +	u16 cl_bpc;
> +
> +	mlog_entry_void();
> +
> +	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
> +		return -EROFS;
> +
> +	main_bm_inode = ocfs2_get_system_file_inode(osb,
> +						    GLOBAL_BITMAP_SYSTEM_INODE,
> +						    OCFS2_INVALID_SLOT);
> +	if (!main_bm_inode) {
> +		ret = -EINVAL;
> +		mlog_errno(ret);
> +		goto out;
> +	}
> +
> +	mutex_lock(&main_bm_inode->i_mutex);
> +
> +	ret = ocfs2_meta_lock(main_bm_inode, &main_bm_bh, 1);
> +	if (ret < 0) {
> +		mlog_errno(ret);
> +		goto out_mutex;
> +	}
> +
> +	fe = (struct ocfs2_dinode *)main_bm_bh->b_data;
> +
> +	if (le16_to_cpu(fe->id2.i_chain.cl_cpg) !=
> +				 ocfs2_group_bitmap_size(osb->sb) * 8) {
> +		mlog(0, "The disk is too old and small."
> +		     " Force to do offline resize.");
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = ocfs2_verify_group_input(main_bm_inode, fe, input);
> +	if (ret) {
> +		mlog_errno(ret);
> +		goto out_unlock;
> +	}
> +
> +	mlog(0, "Add a new group  %llu in chain = %u, length = %u\n",
> +	     input->group, input->chain, input->clusters);
> +
> +	handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
> +	if (IS_ERR(handle)) {
> +		mlog_errno(PTR_ERR(handle));
> +		handle = NULL;
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = ocfs2_journal_access(handle, main_bm_inode, main_bm_bh,
> +				   OCFS2_JOURNAL_ACCESS_WRITE);
> +	if (ret < 0) {
> +		mlog_errno(ret);
> +		goto out_unlock;
> +	}
> +
> +	cl_bpc = le16_to_cpu(fe->id2.i_chain.cl_bpc);
> +	cl = &fe->id2.i_chain;
> +	cr = &cl->cl_recs[input->chain];
> +
> +	if (input->chain == le16_to_cpu(cl->cl_next_free_rec)) {
> +		le16_add_cpu(&cl->cl_next_free_rec, 1);
> +		memset(cr, 0, sizeof(struct ocfs2_chain_rec));
> +	}
> +
> +	cr->c_blkno = le64_to_cpu(input->group);

Before this, you need to set the new group descriptors bg_next_group value
to cr->c_blkno.

By the way, this probably means you need to read the group descriptor near
the top of this function instead of within ocfs2_check_new_group().


> +	le32_add_cpu(&cr->c_total, input->clusters * cl_bpc);
> +	le32_add_cpu(&cr->c_free, input->frees * cl_bpc);
> +
> +	le32_add_cpu(&fe->id1.bitmap1.i_total, input->clusters *cl_bpc);
> +	le32_add_cpu(&fe->id1.bitmap1.i_used,
> +		     (input->clusters - input->frees) * cl_bpc);
> +	le32_add_cpu(&fe->i_clusters, input->clusters);
> +	le64_add_cpu(&fe->i_size, input->clusters << osb->s_clustersize_bits);
> +
> +	ret = ocfs2_journal_dirty(handle, main_bm_bh);
> +	if (ret < 0) {
> +		mlog_errno(ret);
> +		goto out_unlock;
> +	}
> +
> +	ret = ocfs2_commit_trans(osb, handle);
> +	if (ret < 0) {
> +		mlog_errno(ret);
> +		goto out_unlock;
> +	}
> +
> +	spin_lock(&OCFS2_I(main_bm_inode)->ip_lock);
> +	OCFS2_I(main_bm_inode)->ip_clusters = le32_to_cpu(fe->i_clusters);
> +	spin_unlock(&OCFS2_I(main_bm_inode)->ip_lock);

Need to also update i_size. I think the other patch was missing
that too...
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH 3/3] Implement "GROUP_ADD" for online resize, take 2
  2007-11-30 15:21   ` Mark Fasheh
@ 2007-12-02 18:08     ` tao.ma
  2007-12-03 19:24       ` Mark Fasheh
  2007-12-12 23:06     ` tao.ma
  1 sibling, 1 reply; 12+ messages in thread
From: tao.ma @ 2007-12-02 18:08 UTC (permalink / raw)
  To: ocfs2-devel

Mark Fasheh wrote:
>> +static int ocfs2_check_new_group(struct inode *inode,
>> +				 struct ocfs2_dinode *di,
>> +				 struct ocfs2_new_group_input *input)
>> +{
>> +	int ret;
>> +	struct ocfs2_group_desc *gd;
>> +	struct buffer_head *group_bh = NULL;
>> +	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>> +	u16 cl_bpc = le16_to_cpu(di->id2.i_chain.cl_bpc);
>> +	u64 cr_blkno;
>> +	unsigned int max_bits = le16_to_cpu(di->id2.i_chain.cl_cpg) *
>> +				le16_to_cpu(di->id2.i_chain.cl_bpc);
>> +
>> +	ret = ocfs2_read_block(osb, input->group, &group_bh, 0, NULL);
>> +	if (ret < 0)
>> +		goto out;
>>     
>
> You should pass the inode so that the group gets added to the caching info.
> We're fine if the checks here fail and userspace tries again since this read
> (that doesn't use the caching flag) always goes to disk.
>   
Just a question. We can cache the group information even if the block 
isn't in this inode, right?
I used to think of that we *only* cache the block which is already in 
the inode.

>> +	else if (le64_to_cpu(gd->bg_next_group) != cr_blkno)
>> +		mlog(0, "Group descriptor # %llu has next group set as %llu "
>> +		     "while the chain head has  %llu set\n",
>> +		     (unsigned long long)le64_to_cpu(gd->bg_blkno),
>> +		     le64_to_cpu(gd->bg_next_group), cr_blkno);
>>     
>
> The 1st block pointer in a chain can change when we re-link groups during
> allocation, so this check is invalid.
>   
I don't realize it until I read suballoc.c today. :)
>
>   
>> +static int ocfs2_verify_group_input(struct inode *inode,
>> +				    struct ocfs2_dinode *di,
>> +				    struct ocfs2_new_group_input *input)
>> +{
>> +	u16 cl_count = le16_to_cpu(di->id2.i_chain.cl_count);
>> +	u16 cl_cpg = le16_to_cpu(di->id2.i_chain.cl_cpg);
>> +	u16 next_free = le16_to_cpu(di->id2.i_chain.cl_next_free_rec);
>> +	u32 cluster = ocfs2_blocks_to_clusters(inode->i_sb, input->group);
>> +	u32 total_clusters = le32_to_cpu(di->i_clusters);
>> +	int ret = -EINVAL;
>> +
>> +	if (cluster < total_clusters)
>> +		mlog(0, "add a group which is in the current volume.\n");
>> +	else if (input->chain >= cl_count)
>> +		mlog(0, "input chain exceeds the limit.\n");
>> +	else if (next_free != cl_count && next_free != input->chain)
>> +		mlog(0, "the add group should be in chain %u\n", next_free);
>>     
>
> If I read this properly, then it's insisting that new groups always be added
> to the rightmost chain... I'm not sure that's correct - we want to fill
> chains which have fewer groups than the others first.
>   
We want to fill the chains first, Instead of growing the chain? This is 
the mechanism in offline resize. So not suitable for online?
> Anyway, weren't we trying to avoid enforcing group placement policy in the
> kernel, short of making sure that the requested chain is a valid one?
>   
Sorry for my poor English( ;) Keep improving). Do you mean we should 
enforce group placement policy in the user space, or vice versa?
>   
>> +
>> +	if (input->chain == le16_to_cpu(cl->cl_next_free_rec)) {
>> +		le16_add_cpu(&cl->cl_next_free_rec, 1);
>> +		memset(cr, 0, sizeof(struct ocfs2_chain_rec));
>> +	}
>> +
>> +	cr->c_blkno = le64_to_cpu(input->group);
>>     
>
> Before this, you need to set the new group descriptors bg_next_group value
> to cr->c_blkno.
>
> By the way, this probably means you need to read the group descriptor near
> the top of this function instead of within ocfs2_check_new_group().
>   
>
I used to write the group descriptors in the user space to avoid the 
group write in the kernel, so the cr->c_blkno is already set in the 
bg_next_group in tunefs.ocfs2.
But from all your 3 comments above, it seems that we *have to* write the 
new group descriptor in the kernel.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH 3/3] Implement "GROUP_ADD" for online resize, take 2
  2007-12-02 18:08     ` tao.ma
@ 2007-12-03 19:24       ` Mark Fasheh
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Fasheh @ 2007-12-03 19:24 UTC (permalink / raw)
  To: ocfs2-devel

On Mon, Dec 03, 2007 at 10:06:32AM +0800, tao.ma wrote:
> Mark Fasheh wrote:
>>> +static int ocfs2_check_new_group(struct inode *inode,
>>> +				 struct ocfs2_dinode *di,
>>> +				 struct ocfs2_new_group_input *input)
>>> +{
>>> +	int ret;
>>> +	struct ocfs2_group_desc *gd;
>>> +	struct buffer_head *group_bh = NULL;
>>> +	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>> +	u16 cl_bpc = le16_to_cpu(di->id2.i_chain.cl_bpc);
>>> +	u64 cr_blkno;
>>> +	unsigned int max_bits = le16_to_cpu(di->id2.i_chain.cl_cpg) *
>>> +				le16_to_cpu(di->id2.i_chain.cl_bpc);
>>> +
>>> +	ret = ocfs2_read_block(osb, input->group, &group_bh, 0, NULL);
>>> +	if (ret < 0)
>>> +		goto out;
>>>     
>>
>> You should pass the inode so that the group gets added to the caching 
>> info.
>> We're fine if the checks here fail and userspace tries again since this 
>> read
>> (that doesn't use the caching flag) always goes to disk.
>>   
> Just a question. We can cache the group information even if the block isn't 
> in this inode, right?
> I used to think of that we *only* cache the block which is already in the 
> inode.

Yeah, you're right... Hmm. On one hand if we don't cache it now but journal
it, things might be a bit inconstent. On the other hand, if we cache it but
then the operation fails, we'll have the updtodate cache referring to a
block which isn't actually uptodate.

For now, leaving it uncached is fine. Ultimately, it would be nice to add
it to the uptodate cache though. Maybe what you should do is call
"ocfs2_set_new_buffer_uptodate()" on it if the operation suceeds.


>>> +static int ocfs2_verify_group_input(struct inode *inode,
>>> +				    struct ocfs2_dinode *di,
>>> +				    struct ocfs2_new_group_input *input)
>>> +{
>>> +	u16 cl_count = le16_to_cpu(di->id2.i_chain.cl_count);
>>> +	u16 cl_cpg = le16_to_cpu(di->id2.i_chain.cl_cpg);
>>> +	u16 next_free = le16_to_cpu(di->id2.i_chain.cl_next_free_rec);
>>> +	u32 cluster = ocfs2_blocks_to_clusters(inode->i_sb, input->group);
>>> +	u32 total_clusters = le32_to_cpu(di->i_clusters);
>>> +	int ret = -EINVAL;
>>> +
>>> +	if (cluster < total_clusters)
>>> +		mlog(0, "add a group which is in the current volume.\n");
>>> +	else if (input->chain >= cl_count)
>>> +		mlog(0, "input chain exceeds the limit.\n");
>>> +	else if (next_free != cl_count && next_free != input->chain)
>>> +		mlog(0, "the add group should be in chain %u\n", next_free);
>>>     
>>
>> If I read this properly, then it's insisting that new groups always be 
>> added
>> to the rightmost chain... I'm not sure that's correct - we want to fill
>> chains which have fewer groups than the others first.
>>   
> We want to fill the chains first, Instead of growing the chain? This is the 
> mechanism in offline resize. So not suitable for online?
>> Anyway, weren't we trying to avoid enforcing group placement policy in the
>> kernel, short of making sure that the requested chain is a valid one?
>>   
> Sorry for my poor English( ;) Keep improving). Do you mean we should 
> enforce group placement policy in the user space, or vice versa?

Well, I just mean which chain the group goes to. It's not a big deal to do
this in kernel, but since userspace has all the information it needs, it can
write a chain number which the kernel just honors.


>>> +	if (input->chain == le16_to_cpu(cl->cl_next_free_rec)) {
>>> +		le16_add_cpu(&cl->cl_next_free_rec, 1);
>>> +		memset(cr, 0, sizeof(struct ocfs2_chain_rec));
>>> +	}
>>> +
>>> +	cr->c_blkno = le64_to_cpu(input->group);
>>>     
>>
>> Before this, you need to set the new group descriptors bg_next_group value
>> to cr->c_blkno.
>>
>> By the way, this probably means you need to read the group descriptor near
>> the top of this function instead of within ocfs2_check_new_group().
>>   
> I used to write the group descriptors in the user space to avoid the group 
> write in the kernel, so the cr->c_blkno is already set in the bg_next_group 
> in tunefs.ocfs2.
> But from all your 3 comments above, it seems that we *have to* write the 
> new group descriptor in the kernel.

Yeah, there's no way to avoid writing the group descriptor in kernel at
least once. Some information which needs to be written into it requires that
the bitmap lock be held.

The question then becomes "how much of the group descriptor do we fill in
kernel?" I can go either way on that one. You can either fill the entire
block in kernel, or write most of it out in userspace and have kernel only
change the parts required for it to be consistent with the bitmap.

There's already some code for writing block group descriptors in suballoc.c,
you might want to look at that.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH 3/3] Implement "GROUP_ADD" for online resize, take 2
  2007-11-30 15:21   ` Mark Fasheh
  2007-12-02 18:08     ` tao.ma
@ 2007-12-12 23:06     ` tao.ma
  1 sibling, 0 replies; 12+ messages in thread
From: tao.ma @ 2007-12-12 23:06 UTC (permalink / raw)
  To: ocfs2-devel

Mark Fasheh wrote:
>
>> +	le32_add_cpu(&cr->c_total, input->clusters * cl_bpc);
>> +	le32_add_cpu(&cr->c_free, input->frees * cl_bpc);
>> +
>> +	le32_add_cpu(&fe->id1.bitmap1.i_total, input->clusters *cl_bpc);
>> +	le32_add_cpu(&fe->id1.bitmap1.i_used,
>> +		     (input->clusters - input->frees) * cl_bpc);
>> +	le32_add_cpu(&fe->i_clusters, input->clusters);
>> +	le64_add_cpu(&fe->i_size, input->clusters << osb->s_clustersize_bits);
>> +
>> +	ret = ocfs2_journal_dirty(handle, main_bm_bh);
>> +	if (ret < 0) {
>> +		mlog_errno(ret);
>> +		goto out_unlock;
>> +	}
>> +
>> +	ret = ocfs2_commit_trans(osb, handle);
>> +	if (ret < 0) {
>> +		mlog_errno(ret);
>> +		goto out_unlock;
>> +	}
>> +
>> +	spin_lock(&OCFS2_I(main_bm_inode)->ip_lock);
>> +	OCFS2_I(main_bm_inode)->ip_clusters = le32_to_cpu(fe->i_clusters);
>> +	spin_unlock(&OCFS2_I(main_bm_inode)->ip_lock);
>>     
>
> Need to also update i_size. I think the other patch was missing
> that too...
> 	--Mark
>   
Sorry for the late response. I have been in leave for some days.
Do you mean "fe->i_size" which I have updated above?
If it is,  do I need to spin_lock and update "fe->i_size" and 
"ip_clusters" in the same time before commit_trans?
I saw some similar codes in suballoc.c.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel]  [PATCH 2/3] Add group extend for online resize, take 2
  2007-11-27  0:12 [Ocfs2-devel] [PATCH 1/3] Add online resize support for ocfs2, take 2 Tao Ma
  2007-11-27  0:20 ` [Ocfs2-devel] [PATCH 1/3] Initalize bitmap_cpg of ocfs2_super to be the maximum Tao Ma
  2007-11-27  0:22 ` [Ocfs2-devel] [PATCH 3/3] Implement "GROUP_ADD" for online resize, take 2 Tao Ma
@ 2007-11-27  0:23 ` Tao Ma
  2007-11-27 17:34   ` tao.ma
  2007-11-30 11:42   ` Mark Fasheh
  2007-11-30 15:43 ` [Ocfs2-devel] [PATCH 1/3] Add online resize support for ocfs2, " Mark Fasheh
  3 siblings, 2 replies; 12+ messages in thread
From: Tao Ma @ 2007-11-27  0:23 UTC (permalink / raw)
  To: ocfs2-devel

User can do offline resize using tunefs.ocfs2 when a volume isn't
mounted. Now the support for online resize is added into ocfs2.

Please note that the node where online resize goes must already
has the volume mounted. We don't mount it behind the user and the
operation would fail if we find it isn't mounted. As for other
nodes, we don't care whether the volume is mounted or not.

global_bitmap, super block and all the backups will be updated
in the kernel. And if super block or backup's update fails, we
just output some error message in dmesg and continue the work.

The whole process is derived from ext3 and divided into 2 steps:
1. If the last group isn't full, tunefs.ocfs2 will call
   OCFS2_IOC_GROUP_EXTEND first to extend it. All the main work is
   done in kernel.
2. For every new groups, tunefs.ocfs2 will call OCFS2_IOC_GROUP_ADD
   to add them one by one. The new group descriptor is initialized
   in userspace, we only check it in the kernel and update the
   global_bitap, super blocks etc.

This patch includes the implementation for the 1st step.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
---
 fs/ocfs2/Makefile         |    3 +-
 fs/ocfs2/buffer_head_io.c |   61 ++++++++
 fs/ocfs2/buffer_head_io.h |    2 +
 fs/ocfs2/ioctl.c          |    7 +
 fs/ocfs2/journal.h        |    3 +
 fs/ocfs2/ocfs2.h          |    5 +
 fs/ocfs2/ocfs2_fs.h       |    2 +
 fs/ocfs2/resize.c         |  366 +++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/suballoc.c       |    5 +-
 9 files changed, 449 insertions(+), 5 deletions(-)
 create mode 100644 fs/ocfs2/resize.c

diff --git a/fs/ocfs2/Makefile b/fs/ocfs2/Makefile
index 9fb8132..ecc58b8 100644
--- a/fs/ocfs2/Makefile
+++ b/fs/ocfs2/Makefile
@@ -28,7 +28,8 @@ ocfs2-objs := \
 	sysfile.o 		\
 	uptodate.o		\
 	ver.o 			\
-	vote.o
+	vote.o			\
+	resize.o
 
 obj-$(CONFIG_OCFS2_FS) += cluster/
 obj-$(CONFIG_OCFS2_FS) += dlm/
diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
index c903741..6eaa67f 100644
--- a/fs/ocfs2/buffer_head_io.c
+++ b/fs/ocfs2/buffer_head_io.c
@@ -280,3 +280,64 @@ bail:
 	mlog_exit(status);
 	return status;
 }
+
+/* Check whether the blkno is the super block or one of the backups. */
+static inline void ocfs2_check_super_or_backup(struct super_block *sb,
+					       sector_t blkno)
+{
+	int i;
+	u64 backup_blkno;
+
+	if (blkno == OCFS2_SUPER_BLOCK_BLKNO)
+		return;
+
+	for (i = 0; i < OCFS2_MAX_BACKUP_SUPERBLOCKS; i++) {
+		backup_blkno = ocfs2_backup_super_blkno(sb, i);
+		if (backup_blkno == blkno)
+			return;
+	}
+
+	BUG();
+}
+
+/*
+ * Write super block and bakcups doesn't need to collaborate with journal,
+ * so we don't need to lock ip_io_mutex and inode doesn't need to bea passed
+ * into this function.
+ */
+int ocfs2_write_super_or_backup(struct ocfs2_super *osb,
+				struct buffer_head *bh)
+{
+	int ret = 0;
+
+	mlog_entry_void();
+
+	BUG_ON(buffer_jbd(bh));
+	ocfs2_check_super_or_backup(osb->sb, bh->b_blocknr);
+
+	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb)) {
+		ret = -EROFS;
+		goto out;
+	}
+
+	lock_buffer(bh);
+	set_buffer_uptodate(bh);
+
+	/* remove from dirty list before I/O. */
+	clear_buffer_dirty(bh);
+
+	get_bh(bh); /* for end_buffer_write_sync() */
+	bh->b_end_io = end_buffer_write_sync;
+	submit_bh(WRITE, bh);
+
+	wait_on_buffer(bh);
+
+	if (!buffer_uptodate(bh)) {
+		ret = -EIO;
+		brelse(bh);
+	}
+
+out:
+	mlog_exit(ret);
+	return ret;
+}
diff --git a/fs/ocfs2/buffer_head_io.h b/fs/ocfs2/buffer_head_io.h
index 6cc2093..c2e7861 100644
--- a/fs/ocfs2/buffer_head_io.h
+++ b/fs/ocfs2/buffer_head_io.h
@@ -47,6 +47,8 @@ int ocfs2_read_blocks(struct ocfs2_super          *osb,
 		      int                  flags,
 		      struct inode        *inode);
 
+int ocfs2_write_super_or_backup(struct ocfs2_super *osb,
+				struct buffer_head *bh);
 
 #define OCFS2_BH_CACHED            1
 #define OCFS2_BH_READAHEAD         8
diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
index 87dcece..60698de 100644
--- a/fs/ocfs2/ioctl.c
+++ b/fs/ocfs2/ioctl.c
@@ -115,6 +115,7 @@ int ocfs2_ioctl(struct inode * inode, struct file * filp,
 	unsigned int cmd, unsigned long arg)
 {
 	unsigned int flags;
+	u32 new_clusters;
 	int status;
 	struct ocfs2_space_resv sr;
 
@@ -140,6 +141,11 @@ int ocfs2_ioctl(struct inode * inode, struct file * filp,
 			return -EFAULT;
 
 		return ocfs2_change_file_space(filp, cmd, &sr);
+	case OCFS2_IOC_GROUP_EXTEND:
+		if (get_user(new_clusters, (__u32 __user *)arg))
+			return -EFAULT;
+
+		return ocfs2_group_extend(inode, new_clusters);
 	default:
 		return -ENOTTY;
 	}
@@ -162,6 +168,7 @@ long ocfs2_compat_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	case OCFS2_IOC_RESVSP64:
 	case OCFS2_IOC_UNRESVSP:
 	case OCFS2_IOC_UNRESVSP64:
+	case OCFS2_IOC_GROUP_EXTEND:
 		break;
 	default:
 		return -ENOIOCTLCMD;
diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h
index 4b32e09..0ba3a42 100644
--- a/fs/ocfs2/journal.h
+++ b/fs/ocfs2/journal.h
@@ -278,6 +278,9 @@ int                  ocfs2_journal_dirty_data(handle_t *handle,
 /* simple file updates like chmod, etc. */
 #define OCFS2_INODE_UPDATE_CREDITS 1
 
+/* group extend. inode update and last group update. */
+#define OCFS2_GROUP_EXTEND_CREDITS	(OCFS2_INODE_UPDATE_CREDITS + 1)
+
 /* get one bit out of a suballocator: dinode + group descriptor +
  * prev. group desc. if we relink. */
 #define OCFS2_SUBALLOC_ALLOC (3)
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 60a23e1..8a6925b 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -527,6 +527,11 @@ static inline unsigned int ocfs2_pages_per_cluster(struct super_block *sb)
 	return pages_per_cluster;
 }
 
+/* given a cluster offset, calculate which block group it belongs to
+ * and return that block offset. */
+u64 ocfs2_which_cluster_group(struct inode *inode, u32 cluster);
+int ocfs2_group_extend(struct inode * inode, u32 new_clusters);
+
 #define ocfs2_set_bit ext2_set_bit
 #define ocfs2_clear_bit ext2_clear_bit
 #define ocfs2_test_bit ext2_test_bit
diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
index 6ef8767..4b5813d 100644
--- a/fs/ocfs2/ocfs2_fs.h
+++ b/fs/ocfs2/ocfs2_fs.h
@@ -231,6 +231,8 @@ struct ocfs2_space_resv {
 #define OCFS2_IOC_RESVSP64	_IOW ('X', 42, struct ocfs2_space_resv)
 #define OCFS2_IOC_UNRESVSP64	_IOW ('X', 43, struct ocfs2_space_resv)
 
+#define OCFS2_IOC_GROUP_EXTEND	_IOW('f', 7, unsigned long)
+
 /*
  * Journal Flags (ocfs2_dinode.id1.journal1.i_flags)
  */
diff --git a/fs/ocfs2/resize.c b/fs/ocfs2/resize.c
new file mode 100644
index 0000000..5c863af
--- /dev/null
+++ b/fs/ocfs2/resize.c
@@ -0,0 +1,366 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * resize.c
+ *
+ * volume resize.
+ * Inspired by ext3/resize.c.
+ *
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/types.h>
+
+#define MLOG_MASK_PREFIX ML_DISK_ALLOC
+#include <cluster/masklog.h>
+
+#include "ocfs2.h"
+
+#include "alloc.h"
+#include "dlmglue.h"
+#include "inode.h"
+#include "journal.h"
+#include "super.h"
+#include "sysfile.h"
+#include "uptodate.h"
+
+#include "buffer_head_io.h"
+
+/*
+ * Check whether there are new backup superblocks exist
+ * in the last group. If there are some, mark them and modify
+ * the group information.
+ * Return how many backups we find in the last group.
+ */
+static u16 ocfs2_add_new_backup_super(struct inode *inode,
+				      struct ocfs2_group_desc *gd,
+				      u32 new_clusters,
+				      u32 first_new_cluster,
+				      u16 cl_cpg)
+{
+	int i;
+	u16 backups = 0;
+	u32 cluster;
+	u64 blkno, gd_blkno, lgd_blkno = le64_to_cpu(gd->bg_blkno);
+
+	for (i = 0; i < OCFS2_MAX_BACKUP_SUPERBLOCKS; i++) {
+		blkno = ocfs2_backup_super_blkno(inode->i_sb, i);
+		cluster = ocfs2_blocks_to_clusters(inode->i_sb, blkno);
+
+		gd_blkno = ocfs2_which_cluster_group(inode, cluster);
+		if (gd_blkno < lgd_blkno)
+			continue;
+		else if (gd_blkno > lgd_blkno)
+			break;
+
+		ocfs2_set_bit(cluster % cl_cpg, (unsigned long *)gd->bg_bitmap);
+		le16_add_cpu(&gd->bg_free_bits_count, -1);
+		backups++;
+	}
+
+	mlog_exit_void();
+	return backups;
+}
+
+static int ocfs2_update_last_group_and_inode(handle_t *handle,
+					     struct inode *bm_inode,
+					     struct buffer_head *bm_bh,
+					     struct buffer_head *group_bh,
+					     u32 first_new_cluster,
+					     u32 new_clusters)
+{
+	int ret = 0;
+	struct ocfs2_super *osb = OCFS2_SB(bm_inode->i_sb);
+	struct ocfs2_dinode *fe = (struct ocfs2_dinode *) bm_bh->b_data;
+	struct ocfs2_chain_list *cl = &fe->id2.i_chain;
+	struct ocfs2_chain_rec *cr;
+	struct ocfs2_group_desc *group;
+	u16 chain, num_bits, backups = 0;
+	u16 cl_bpc = le16_to_cpu(cl->cl_bpc);
+	u16 cl_cpg = le16_to_cpu(cl->cl_cpg);
+
+	mlog_entry("(new_clusters=%u, first_new_cluster = %u)\n",
+		   new_clusters, first_new_cluster);
+
+	ret = ocfs2_journal_access(handle, bm_inode, group_bh,
+				   OCFS2_JOURNAL_ACCESS_WRITE);
+	if (ret < 0) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+	group = (struct ocfs2_group_desc *) group_bh->b_data;
+
+	/* update the group first. */
+	num_bits = new_clusters * cl_bpc;
+	le16_add_cpu(&group->bg_bits, num_bits);
+	le16_add_cpu(&group->bg_free_bits_count, num_bits);
+
+	/*
+	 * check whether there are some new backup superblocks exist in
+	 * this group and update the group bitmap accordingly.
+	 */
+	if (OCFS2_HAS_COMPAT_FEATURE(osb->sb,
+				     OCFS2_FEATURE_COMPAT_BACKUP_SB))
+		backups = ocfs2_add_new_backup_super(bm_inode,
+						     group,
+						     new_clusters,
+						     first_new_cluster,
+						     cl_cpg);
+
+	ret = ocfs2_journal_dirty(handle, group_bh);
+	if (ret < 0)
+		mlog_errno(ret);
+
+	/* update the inode accordingly. */
+	ret = ocfs2_journal_access(handle, bm_inode, bm_bh,
+				   OCFS2_JOURNAL_ACCESS_WRITE);
+	if (ret < 0) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+	chain = le16_to_cpu(group->bg_chain);
+	cr = (&cl->cl_recs[chain]);
+	le32_add_cpu(&cr->c_total, num_bits);
+	le32_add_cpu(&cr->c_free, num_bits);
+	le32_add_cpu(&fe->id1.bitmap1.i_total, num_bits);
+
+	if (backups) {
+		le32_add_cpu(&cr->c_free, -1 * backups);
+		le32_add_cpu(&fe->id1.bitmap1.i_used, backups);
+	}
+
+	le32_add_cpu(&fe->i_clusters, new_clusters);
+	le64_add_cpu(&fe->i_size, new_clusters << osb->s_clustersize_bits);
+
+	ret = ocfs2_journal_dirty(handle, bm_bh);
+	if (ret < 0)
+		mlog_errno(ret);
+
+out:
+	mlog_exit(ret);
+	return ret;
+}
+
+static void update_backups(struct inode * inode, u32 clusters, char *data)
+{
+	int i, ret = 0;
+	u32 cluster;
+	u64 blkno;
+	struct buffer_head *backup = NULL;
+	struct ocfs2_dinode *backup_di = NULL;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+	/* calculate the real backups we need to update. */
+	for (i = 0; i < OCFS2_MAX_BACKUP_SUPERBLOCKS; i++) {
+		blkno = ocfs2_backup_super_blkno(inode->i_sb, i);
+		cluster = ocfs2_blocks_to_clusters(inode->i_sb, blkno);
+		 if (cluster > clusters)
+			break;
+
+		ret = ocfs2_read_block(osb, blkno, &backup, 0, NULL);
+		if (ret < 0) {
+			mlog_errno(ret);
+			return;
+		}
+
+		memcpy(backup->b_data, data, inode->i_sb->s_blocksize);
+
+		backup_di = (struct ocfs2_dinode *)backup->b_data;
+		backup_di->i_blkno = cpu_to_le64(blkno);
+
+		ret = ocfs2_write_super_or_backup(osb, backup);
+		brelse(backup);
+		backup = NULL;
+		if (ret < 0) {
+			mlog_errno(ret);
+			break;
+		}
+	}
+
+	return;
+}
+
+static void ocfs2_update_super_and_backups(struct inode *inode,
+					   u32 new_clusters)
+{
+	int ret;
+	u32 clusters = 0;
+	struct buffer_head *super_bh = NULL;
+	struct ocfs2_dinode *super_di = NULL;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+	/*
+	 * update the superblock last.
+	 * It doesn't matter if the write failed.
+	 */
+	ret = ocfs2_read_block(osb, OCFS2_SUPER_BLOCK_BLKNO,
+			       &super_bh, 0, NULL);
+	if (ret < 0) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+	super_di = (struct ocfs2_dinode *)super_bh->b_data;
+	le32_add_cpu(&super_di->i_clusters, new_clusters);
+	clusters = le32_to_cpu(super_di->i_clusters);
+
+	ret = ocfs2_write_super_or_backup(osb, super_bh);
+	if (ret < 0) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+	if (OCFS2_HAS_COMPAT_FEATURE(osb->sb, OCFS2_FEATURE_COMPAT_BACKUP_SB))
+		update_backups(inode, clusters, super_bh->b_data);
+
+out:
+	if (super_bh)
+		brelse(super_bh);
+	return;
+}
+
+/*
+ * Extend the filesystem to the new number of clusters specified.  This entry
+ * point is only used to extend the current filesystem to the end of the last
+ * existing group.
+ */
+int ocfs2_group_extend(struct inode * inode, u32 new_clusters)
+{
+	int ret;
+	handle_t *handle;
+	struct buffer_head *main_bm_bh = NULL;
+	struct buffer_head *group_bh = NULL;
+	struct inode *main_bm_inode = NULL;
+	struct ocfs2_dinode *fe = NULL;
+	struct ocfs2_group_desc *group = NULL;
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	u16 cl_bpc;
+	u32 first_new_cluster;
+	u64 lgd_blkno;
+
+	mlog_entry_void();
+
+	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
+		return -EROFS;
+
+	if (new_clusters == 0)
+		return 0;
+
+	main_bm_inode = ocfs2_get_system_file_inode(osb,
+						    GLOBAL_BITMAP_SYSTEM_INODE,
+						    OCFS2_INVALID_SLOT);
+	if (!main_bm_inode) {
+		ret = -EINVAL;
+		mlog_errno(ret);
+		goto out;
+	}
+
+	mutex_lock(&main_bm_inode->i_mutex);
+
+	ret = ocfs2_meta_lock(main_bm_inode, &main_bm_bh, 1);
+	if (ret < 0) {
+		mlog_errno(ret);
+		goto out_mutex;
+	}
+
+	fe = (struct ocfs2_dinode *)main_bm_bh->b_data;
+
+	if (le16_to_cpu(fe->id2.i_chain.cl_cpg) !=
+				 ocfs2_group_bitmap_size(osb->sb) * 8) {
+		mlog(0, "The disk is too old. Force to do offline resize.");
+		ret = -EINVAL;
+		goto out_mutex;
+	}
+
+	if (!OCFS2_IS_VALID_DINODE(fe)) {
+		OCFS2_RO_ON_INVALID_DINODE(main_bm_inode->i_sb, fe);
+		ret = -EIO;
+		goto out_unlock;
+	}
+
+	first_new_cluster = le32_to_cpu(fe->i_clusters);
+	lgd_blkno = ocfs2_which_cluster_group(main_bm_inode,
+					      first_new_cluster - 1);
+
+	ret = ocfs2_read_block(osb, lgd_blkno, &group_bh, OCFS2_BH_CACHED,
+			       main_bm_inode);
+	if (ret < 0) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	group = (struct ocfs2_group_desc *) group_bh->b_data;
+	cl_bpc = le16_to_cpu(fe->id2.i_chain.cl_bpc);
+	if (new_clusters + le16_to_cpu(group->bg_bits) / cl_bpc >
+		le16_to_cpu(fe->id2.i_chain.cl_cpg)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	mlog(0, "extend the last group at %llu, new cluster = %u\n",
+	     le64_to_cpu(group->bg_blkno), new_clusters);
+
+	handle = ocfs2_start_trans(osb, OCFS2_GROUP_EXTEND_CREDITS);
+	if (IS_ERR(handle)) {
+		mlog_errno(PTR_ERR(handle));
+		handle = NULL;
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* update the last group descriptor and inode. */
+	ret = ocfs2_update_last_group_and_inode(handle, main_bm_inode,
+						main_bm_bh, group_bh,
+						first_new_cluster,
+						new_clusters);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	ret = ocfs2_commit_trans(osb, handle);
+	if (ret < 0) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	spin_lock(&OCFS2_I(main_bm_inode)->ip_lock);
+	OCFS2_I(main_bm_inode)->ip_clusters = le32_to_cpu(fe->i_clusters);
+	spin_unlock(&OCFS2_I(main_bm_inode)->ip_lock);
+
+	ocfs2_update_super_and_backups(inode, new_clusters);
+out_unlock:
+	if (group_bh)
+		brelse(group_bh);
+
+	if (main_bm_bh)
+		brelse(main_bm_bh);
+
+	ocfs2_meta_unlock(main_bm_inode, 1);
+
+out_mutex:
+	mutex_unlock(&main_bm_inode->i_mutex);
+	iput(main_bm_inode);
+
+out:
+	mlog_exit_void();
+	return ret;
+}
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 8f09f52..3d0c988 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -101,8 +101,6 @@ static inline int ocfs2_block_group_reasonably_empty(struct ocfs2_group_desc *bg
 static inline u32 ocfs2_desc_bitmap_to_cluster_off(struct inode *inode,
 						   u64 bg_blkno,
 						   u16 bg_bit_off);
-static inline u64 ocfs2_which_cluster_group(struct inode *inode,
-					    u32 cluster);
 static inline void ocfs2_block_to_cluster_group(struct inode *inode,
 						u64 data_blkno,
 						u64 *bg_blkno,
@@ -1443,8 +1441,7 @@ static inline u32 ocfs2_desc_bitmap_to_cluster_off(struct inode *inode,
 
 /* given a cluster offset, calculate which block group it belongs to
  * and return that block offset. */
-static inline u64 ocfs2_which_cluster_group(struct inode *inode,
-					    u32 cluster)
+u64 ocfs2_which_cluster_group(struct inode *inode, u32 cluster)
 {
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	u32 group_no;
-- 
gitgui.0.9.0.gd794

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Ocfs2-devel]  [PATCH 2/3] Add group extend for online resize, take 2
  2007-11-27  0:23 ` [Ocfs2-devel] [PATCH 2/3] Add group extend " Tao Ma
@ 2007-11-27 17:34   ` tao.ma
  2007-11-30 11:42   ` Mark Fasheh
  1 sibling, 0 replies; 12+ messages in thread
From: tao.ma @ 2007-11-27 17:34 UTC (permalink / raw)
  To: ocfs2-devel

Tao Ma wrote:
> +/*
> + * Extend the filesystem to the new number of clusters specified.  This entry
> + * point is only used to extend the current filesystem to the end of the last
> + * existing group.
> + */
> +int ocfs2_group_extend(struct inode * inode, u32 new_clusters)
> +{
> +	int ret;
> +	handle_t *handle;
> +	struct buffer_head *main_bm_bh = NULL;
> +	struct buffer_head *group_bh = NULL;
> +	struct inode *main_bm_inode = NULL;
> +	struct ocfs2_dinode *fe = NULL;
> +	struct ocfs2_group_desc *group = NULL;
> +	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> +	u16 cl_bpc;
> +	u32 first_new_cluster;
> +	u64 lgd_blkno;
> +
> +	mlog_entry_void();
> +
> +	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
> +		return -EROFS;
> +
> +	if (new_clusters == 0)
> +		return 0;
> +
> +	main_bm_inode = ocfs2_get_system_file_inode(osb,
> +						    GLOBAL_BITMAP_SYSTEM_INODE,
> +						    OCFS2_INVALID_SLOT);
> +	if (!main_bm_inode) {
> +		ret = -EINVAL;
> +		mlog_errno(ret);
> +		goto out;
> +	}
> +
> +	mutex_lock(&main_bm_inode->i_mutex);
> +
> +	ret = ocfs2_meta_lock(main_bm_inode, &main_bm_bh, 1);
> +	if (ret < 0) {
> +		mlog_errno(ret);
> +		goto out_mutex;
> +	}
> +
> +	fe = (struct ocfs2_dinode *)main_bm_bh->b_data;
> +
> +	if (le16_to_cpu(fe->id2.i_chain.cl_cpg) !=
> +				 ocfs2_group_bitmap_size(osb->sb) * 8) {
> +		mlog(0, "The disk is too old. Force to do offline resize.");
> +		ret = -EINVAL;
> +		goto out_mutex;
> +	}
>   
A bug here. It should be "goto out_unlock".
I will modify it after I collect all the feedback.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH 2/3] Add group extend for online resize, take 2
  2007-11-27  0:23 ` [Ocfs2-devel] [PATCH 2/3] Add group extend " Tao Ma
  2007-11-27 17:34   ` tao.ma
@ 2007-11-30 11:42   ` Mark Fasheh
  1 sibling, 0 replies; 12+ messages in thread
From: Mark Fasheh @ 2007-11-30 11:42 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Nov 27, 2007 at 04:21:35PM +0800, tao.ma wrote:
> User can do offline resize using tunefs.ocfs2 when a volume isn't
> mounted. Now the support for online resize is added into ocfs2.
> 
> Please note that the node where online resize goes must already
> has the volume mounted. We don't mount it behind the user and the
> operation would fail if we find it isn't mounted. As for other
> nodes, we don't care whether the volume is mounted or not.
> 
> global_bitmap, super block and all the backups will be updated
> in the kernel. And if super block or backup's update fails, we
> just output some error message in dmesg and continue the work.
> 
> The whole process is derived from ext3 and divided into 2 steps:
> 1. If the last group isn't full, tunefs.ocfs2 will call
>    OCFS2_IOC_GROUP_EXTEND first to extend it. All the main work is
>    done in kernel.
> 2. For every new groups, tunefs.ocfs2 will call OCFS2_IOC_GROUP_ADD
>    to add them one by one. The new group descriptor is initialized
>    in userspace, we only check it in the kernel and update the
>    global_bitap, super blocks etc.
> 
> This patch includes the implementation for the 1st step.

Ok, this looks pretty good. Some comments below - the new method of doing
this is much simpler to understand.


> diff --git a/fs/ocfs2/Makefile b/fs/ocfs2/Makefile
> index 9fb8132..ecc58b8 100644
> --- a/fs/ocfs2/Makefile
> +++ b/fs/ocfs2/Makefile
> @@ -28,7 +28,8 @@ ocfs2-objs := \
>  	sysfile.o 		\
>  	uptodate.o		\
>  	ver.o 			\
> -	vote.o
> +	vote.o			\
> +	resize.o

Should stay consitent with the makefile and add this in alphabetical
order...


> diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
> index 87dcece..60698de 100644
> --- a/fs/ocfs2/ioctl.c
> +++ b/fs/ocfs2/ioctl.c
> @@ -115,6 +115,7 @@ int ocfs2_ioctl(struct inode * inode, struct file * filp,
>  	unsigned int cmd, unsigned long arg)
>  {
>  	unsigned int flags;
> +	u32 new_clusters;
>  	int status;
>  	struct ocfs2_space_resv sr;
>  
> @@ -140,6 +141,11 @@ int ocfs2_ioctl(struct inode * inode, struct file * filp,
>  			return -EFAULT;
>  
>  		return ocfs2_change_file_space(filp, cmd, &sr);
> +	case OCFS2_IOC_GROUP_EXTEND:
> +		if (get_user(new_clusters, (__u32 __user *)arg))
> +			return -EFAULT;
> +
> +		return ocfs2_group_extend(inode, new_clusters);
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -162,6 +168,7 @@ long ocfs2_compat_ioctl(struct file *file, unsigned cmd, unsigned long arg)
>  	case OCFS2_IOC_RESVSP64:
>  	case OCFS2_IOC_UNRESVSP:
>  	case OCFS2_IOC_UNRESVSP64:
> +	case OCFS2_IOC_GROUP_EXTEND:
>  		break;
>  	default:
>  		return -ENOIOCTLCMD;
> diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h
> index 4b32e09..0ba3a42 100644
> --- a/fs/ocfs2/journal.h
> +++ b/fs/ocfs2/journal.h
> @@ -278,6 +278,9 @@ int                  ocfs2_journal_dirty_data(handle_t *handle,
>  /* simple file updates like chmod, etc. */
>  #define OCFS2_INODE_UPDATE_CREDITS 1
>  
> +/* group extend. inode update and last group update. */
> +#define OCFS2_GROUP_EXTEND_CREDITS	(OCFS2_INODE_UPDATE_CREDITS + 1)
> +
>  /* get one bit out of a suballocator: dinode + group descriptor +
>   * prev. group desc. if we relink. */
>  #define OCFS2_SUBALLOC_ALLOC (3)
> diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
> index 60a23e1..8a6925b 100644
> --- a/fs/ocfs2/ocfs2.h
> +++ b/fs/ocfs2/ocfs2.h
> @@ -527,6 +527,11 @@ static inline unsigned int ocfs2_pages_per_cluster(struct super_block *sb)
>  	return pages_per_cluster;
>  }
>  
> +/* given a cluster offset, calculate which block group it belongs to
> + * and return that block offset. */
> +u64 ocfs2_which_cluster_group(struct inode *inode, u32 cluster);
> +int ocfs2_group_extend(struct inode * inode, u32 new_clusters);
> +

You should stick ocfs2_which_cluster_group() in suballoc.c and
ocfs2_group_extend() in resize.h. We've been trying to chip away at ocfs2.h
for years now ;)


>  #define ocfs2_set_bit ext2_set_bit
>  #define ocfs2_clear_bit ext2_clear_bit
>  #define ocfs2_test_bit ext2_test_bit
> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
> index 6ef8767..4b5813d 100644
> --- a/fs/ocfs2/ocfs2_fs.h
> +++ b/fs/ocfs2/ocfs2_fs.h
> @@ -231,6 +231,8 @@ struct ocfs2_space_resv {
>  #define OCFS2_IOC_RESVSP64	_IOW ('X', 42, struct ocfs2_space_resv)
>  #define OCFS2_IOC_UNRESVSP64	_IOW ('X', 43, struct ocfs2_space_resv)
>  
> +#define OCFS2_IOC_GROUP_EXTEND	_IOW('f', 7, unsigned long)
> +

You also need to add:

#define OCFS2_IOC_GROUP_EXTEND32		_IOW('f', 7, int)

and check for that in ocfs2_compat_ioctl(). Look at how
OCFS2_IOC32_GETFLAGS/OCFS2_IOC_SETFLAGS is handled.

This is all for compatibility with 32 bit system calls on 64 bit kernels by
the way which would be passing the argument as a 32 bit value.


Actually, thinking about this a bit more - why not just make the argument
always an 'int'? That should be 32 bits pretty much everywhere and we
wouldn't need the compat handling for it.


> +static void ocfs2_update_super_and_backups(struct inode *inode,
> +					   u32 new_clusters)
> +{
> +	int ret;
> +	u32 clusters = 0;
> +	struct buffer_head *super_bh = NULL;
> +	struct ocfs2_dinode *super_di = NULL;
> +	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> +
> +	/*
> +	 * update the superblock last.
> +	 * It doesn't matter if the write failed.
> +	 */
> +	ret = ocfs2_read_block(osb, OCFS2_SUPER_BLOCK_BLKNO,
> +			       &super_bh, 0, NULL);
> +	if (ret < 0) {
> +		mlog_errno(ret);
> +		goto out;
> +	}
> +
> +	super_di = (struct ocfs2_dinode *)super_bh->b_data;
> +	le32_add_cpu(&super_di->i_clusters, new_clusters);
> +	clusters = le32_to_cpu(super_di->i_clusters);
> +
> +	ret = ocfs2_write_super_or_backup(osb, super_bh);
> +	if (ret < 0) {
> +		mlog_errno(ret);
> +		goto out;
> +	}
> +
> +	if (OCFS2_HAS_COMPAT_FEATURE(osb->sb, OCFS2_FEATURE_COMPAT_BACKUP_SB))
> +		update_backups(inode, clusters, super_bh->b_data);
> +
> +out:
> +	if (super_bh)
> +		brelse(super_bh);

If we exit with error here, we should probably print to log so that the user
knows to run fsck.ocfs2. Maybe:

printk(KERN_WARN "ocfs2: Failed to update super blocks on %s during fs "
"resize. This condition is not fatal, but fsck.ocfs2 should be run to fix "
"it\n", osb->dev_str);


> +int ocfs2_group_extend(struct inode * inode, u32 new_clusters)
> +{
> +	int ret;
> +	handle_t *handle;
> +	struct buffer_head *main_bm_bh = NULL;
> +	struct buffer_head *group_bh = NULL;
> +	struct inode *main_bm_inode = NULL;
> +	struct ocfs2_dinode *fe = NULL;
> +	struct ocfs2_group_desc *group = NULL;
> +	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> +	u16 cl_bpc;
> +	u32 first_new_cluster;
> +	u64 lgd_blkno;
> +
> +	mlog_entry_void();
> +
> +	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
> +		return -EROFS;
> +
> +	if (new_clusters == 0)
> +		return 0;
> +
> +	main_bm_inode = ocfs2_get_system_file_inode(osb,
> +						    GLOBAL_BITMAP_SYSTEM_INODE,
> +						    OCFS2_INVALID_SLOT);
> +	if (!main_bm_inode) {
> +		ret = -EINVAL;
> +		mlog_errno(ret);
> +		goto out;
> +	}
> +
> +	mutex_lock(&main_bm_inode->i_mutex);
> +
> +	ret = ocfs2_meta_lock(main_bm_inode, &main_bm_bh, 1);
> +	if (ret < 0) {
> +		mlog_errno(ret);
> +		goto out_mutex;
> +	}
> +
> +	fe = (struct ocfs2_dinode *)main_bm_bh->b_data;
> +
> +	if (le16_to_cpu(fe->id2.i_chain.cl_cpg) !=
> +				 ocfs2_group_bitmap_size(osb->sb) * 8) {
> +		mlog(0, "The disk is too old. Force to do offline resize.");
> +		ret = -EINVAL;
> +		goto out_mutex;
> +	}
> +
> +	if (!OCFS2_IS_VALID_DINODE(fe)) {
> +		OCFS2_RO_ON_INVALID_DINODE(main_bm_inode->i_sb, fe);
> +		ret = -EIO;
> +		goto out_unlock;
> +	}
> +
> +	first_new_cluster = le32_to_cpu(fe->i_clusters);
> +	lgd_blkno = ocfs2_which_cluster_group(main_bm_inode,
> +					      first_new_cluster - 1);
> +
> +	ret = ocfs2_read_block(osb, lgd_blkno, &group_bh, OCFS2_BH_CACHED,
> +			       main_bm_inode);
> +	if (ret < 0) {
> +		mlog_errno(ret);
> +		goto out_unlock;
> +	}
> +
> +	group = (struct ocfs2_group_desc *) group_bh->b_data;

Right here is probably a good place to call ocfs2_check_group_descriptor().
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH 1/3] Add online resize support for ocfs2, take 2
  2007-11-27  0:12 [Ocfs2-devel] [PATCH 1/3] Add online resize support for ocfs2, take 2 Tao Ma
                   ` (2 preceding siblings ...)
  2007-11-27  0:23 ` [Ocfs2-devel] [PATCH 2/3] Add group extend " Tao Ma
@ 2007-11-30 15:43 ` Mark Fasheh
  3 siblings, 0 replies; 12+ messages in thread
From: Mark Fasheh @ 2007-11-30 15:43 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Nov 27, 2007 at 04:11:23PM +0800, tao.ma wrote:
> Modification from V1 to V2:
> Divide the whole processes into 2 steps like ext3.
> 1) If the last group isn't full, tunefs.ocfs2 will call
>    OCFS2_IOC_GROUP_EXTEND first to extend it. All the main work is
>    done in kernel.
> 2) For every new groups, tunefs.ocfs2 will call OCFS2_IOC_GROUP_ADD
>    to add them one by one. The new group descriptor is initialized
>    in userspace, we only check it in the kernel and update the
>    global_bitap, super blocks etc.

One more blanket comment about these patches - be sure to run the patched
source through the "sparse" tool. It will help us catch endian bugs when
reading/writing on-disk fields. LWN actually has a good tutorial:

http://lwn.net/Articles/205624/
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-12-12 23:06 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-27  0:12 [Ocfs2-devel] [PATCH 1/3] Add online resize support for ocfs2, take 2 Tao Ma
2007-11-27  0:20 ` [Ocfs2-devel] [PATCH 1/3] Initalize bitmap_cpg of ocfs2_super to be the maximum Tao Ma
2007-11-30 11:47   ` Mark Fasheh
2007-11-27  0:22 ` [Ocfs2-devel] [PATCH 3/3] Implement "GROUP_ADD" for online resize, take 2 Tao Ma
2007-11-30 15:21   ` Mark Fasheh
2007-12-02 18:08     ` tao.ma
2007-12-03 19:24       ` Mark Fasheh
2007-12-12 23:06     ` tao.ma
2007-11-27  0:23 ` [Ocfs2-devel] [PATCH 2/3] Add group extend " Tao Ma
2007-11-27 17:34   ` tao.ma
2007-11-30 11:42   ` Mark Fasheh
2007-11-30 15:43 ` [Ocfs2-devel] [PATCH 1/3] Add online resize support for ocfs2, " Mark Fasheh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.