public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Implement shrink of empty AGs
@ 2007-06-10 16:40 Iustin Pop
  2007-06-12  2:40 ` David Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Iustin Pop @ 2007-06-10 16:40 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1412 bytes --]

The attached patch implements shrinking of completely empty allocation
groups. The patch is against current CVS and modifies two files:
  - xfs_trans.c to remove two asserts in which prevent lowering the
    number of AGs or filesystem blocks;
  - xfs_fsops.c where it does:
    - modify xfs_growfs_data() to branch to either
      xfs_growfs_data_private or xfs_shrinkfs_data private depending on
      the new size of the fs
    - abstract the last part of xfs_growfs_data_private (the modify of
      all the superblocks) into a separate function, xfs_update_sb(),
      which is called both from shrink and grow
    - add the new xfs_shrinkfs_data_private function, mostly based on
      the growfs function

There are many printk()'s left in the patch, I left them as they show
where I compute some important values. There are also many FIXMEs in the
comments showing what parts I didn't understand or was not sure about
(not that these are the only ones...). Probably for a real patch,
xfs-specific debug hooks need to be added and the printk()s removed.

The patch works on UML and QEMU virtual machines, both in UP and SMP. I
just tested many shrink/grow operations and verified with xfs_repair
that the fs is not corrupted. The free space counters seem to be correct
after shrink.

Note that you also need to remove the check from xfs_growfs.c of not
allowing to shrink the filesystem.

regards,
iustin

[-- Attachment #1.2: patch-nice-4 --]
[-- Type: text/plain, Size: 11582 bytes --]

diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c
--- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c	2007-06-09 18:56:21.509308225 +0200
+++ linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c	2007-06-10 18:32:36.074856477 +0200
@@ -112,6 +112,53 @@
 	return 0;
 }
 
+static void xfs_update_sb(
+	xfs_mount_t		*mp,		/* mount point for filesystem */
+	xfs_agnumber_t		nagimax,
+	xfs_agnumber_t		nagcount)       /* new number of a.g. */
+{
+	xfs_agnumber_t		agno;
+	xfs_buf_t		*bp;
+	xfs_sb_t		*sbp;
+	int			error;
+
+	/* New allocation groups fully initialized, so update mount struct */
+	if (nagimax)
+		mp->m_maxagi = nagimax;
+	if (mp->m_sb.sb_imax_pct) {
+		__uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
+		do_div(icount, 100);
+		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
+	} else
+		mp->m_maxicount = 0;
+	for (agno = 1; agno < nagcount; agno++) {
+		error = xfs_read_buf(mp, mp->m_ddev_targp,
+				  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
+				  XFS_FSS_TO_BB(mp, 1), 0, &bp);
+		if (error) {
+			xfs_fs_cmn_err(CE_WARN, mp,
+			"error %d reading secondary superblock for ag %d",
+				error, agno);
+			break;
+		}
+		sbp = XFS_BUF_TO_SBP(bp);
+		xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS);
+		/*
+		 * If we get an error writing out the alternate superblocks,
+		 * just issue a warning and continue.  The real work is
+		 * already done and committed.
+		 */
+		if (!(error = xfs_bwrite(mp, bp))) {
+			continue;
+		} else {
+			xfs_fs_cmn_err(CE_WARN, mp,
+		"write error %d updating secondary superblock for ag %d",
+				error, agno);
+			break; /* no point in continuing */
+		}
+	}
+}
+
 static int
 xfs_growfs_data_private(
 	xfs_mount_t		*mp,		/* mount point for filesystem */
@@ -135,7 +182,6 @@
 	xfs_rfsblock_t		nfree;
 	xfs_agnumber_t		oagcount;
 	int			pct;
-	xfs_sb_t		*sbp;
 	xfs_trans_t		*tp;
 
 	nb = in->newblocks;
@@ -356,44 +402,228 @@
 	if (error) {
 		return error;
 	}
-	/* New allocation groups fully initialized, so update mount struct */
-	if (nagimax)
-		mp->m_maxagi = nagimax;
-	if (mp->m_sb.sb_imax_pct) {
-		__uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
-		do_div(icount, 100);
-		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
-	} else
-		mp->m_maxicount = 0;
-	for (agno = 1; agno < nagcount; agno++) {
-		error = xfs_read_buf(mp, mp->m_ddev_targp,
-				  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
-				  XFS_FSS_TO_BB(mp, 1), 0, &bp);
+	xfs_update_sb(mp, nagimax, nagcount);
+	return 0;
+
+ error0:
+	xfs_trans_cancel(tp, XFS_TRANS_ABORT);
+	return error;
+}
+
+static int
+xfs_shrinkfs_data_private(
+			  xfs_mount_t		*mp,		/* mount point for filesystem */
+			  xfs_growfs_data_t	*in)		/* growfs data input struct */
+{
+	xfs_agf_t		*agf;
+	xfs_agnumber_t		agno;
+	xfs_buf_t		*bp;
+	int			dpct;
+	int			error;
+	xfs_agnumber_t		nagcount;       /* new AG count */
+	xfs_agnumber_t		oagcount;       /* old AG count */
+	xfs_agnumber_t		nagimax = 0;
+	xfs_rfsblock_t		nb, nb_mod;
+	xfs_rfsblock_t		dbdelta;        /* will be used as a
+						   check that we
+						   shrink the fs by
+						   the correct number
+						   of blocks */
+	xfs_rfsblock_t		fdbdelta;       /* will keep track of
+						   how many ag blocks
+						   we need to
+						   remove */
+	int			pct;
+	xfs_trans_t		*tp;
+
+	nb = in->newblocks;
+	pct = in->imaxpct;
+	if (nb >= mp->m_sb.sb_dblocks || pct < 0 || pct > 100)
+		return XFS_ERROR(EINVAL);
+	dpct = pct - mp->m_sb.sb_imax_pct;
+	error = xfs_read_buf(mp, mp->m_ddev_targp,
+			     XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1),
+			     XFS_FSS_TO_BB(mp, 1), 0, &bp);
+	if (error)
+		return error;
+	ASSERT(bp);
+	/* FIXME: we release the buffer here manually because we are
+	 * outside of a transaction? The other buffers read using the
+	 * functions which take a tp parameter are not released in
+	 * growfs
+	 */
+	xfs_buf_relse(bp);
+
+	/* Do basic checks (at the fs level) */
+	oagcount = mp->m_sb.sb_agcount;
+	nagcount = nb;
+	nb_mod = do_div(nagcount, mp->m_sb.sb_agblocks);
+	if(nb_mod) {
+		printk("not shrinking on an AG boundary (diff=%d)\n", nb_mod);
+		return XFS_ERROR(ENOSPC);
+	}
+	if(nagcount < 2) {
+		printk("refusing to shrink below 2 AGs\n");
+		return XFS_ERROR(ENOSPC);
+	}
+	if(nagcount >= oagcount) {
+		printk("number of AGs will not decrease\n");
+		return XFS_ERROR(EINVAL);
+	}
+	printk("Cur ag=%d, cur blocks=%llu\n",
+	       mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks);
+	printk("New ag=%d, new blocks=%d\n", nagcount, nb);
+
+	printk("Will resize from %llu to %d, delta is %llu\n",
+               mp->m_sb.sb_dblocks, nb, mp->m_sb.sb_dblocks - nb);
+	/* Check to see if we trip over the log section */
+	printk("logstart=%llu logblocks=%u\n",
+               mp->m_sb.sb_logstart, mp->m_sb.sb_logblocks);
+	if (nb < mp->m_sb.sb_logstart + mp->m_sb.sb_logblocks)
+		return XFS_ERROR(EINVAL);
+	/* dbdelta starts at the diff and must become zero */
+	dbdelta = mp->m_sb.sb_dblocks - nb;
+	tp = xfs_trans_alloc(mp, XFS_TRANS_GROWFS);
+	printk("reserving %d\n", XFS_GROWFS_SPACE_RES(mp) + dbdelta);
+	if ((error = xfs_trans_reserve(tp, XFS_GROWFS_SPACE_RES(mp) + dbdelta,
+				       XFS_GROWDATA_LOG_RES(mp), 0, 0, 0))) {
+		xfs_trans_cancel(tp, 0);
+		return error;
+	}
+
+	fdbdelta = 0;
+
+	/* Per-AG checks */
+	/* FIXME: do we need to hold m_peraglock while doing this? */
+	/* I think that since we do read and write to the m_perag
+	 * stuff, we should be holding the lock for the entire walk &
+	 * modify of the fs
+	 */
+	/* Note that because we hold the lock, on any error+early
+	 *  return, we must either release manually and return, or
+	 *  jump to error0
+	 */
+	down_write(&mp->m_peraglock);
+	for(agno = oagcount - 1; agno >= nagcount; agno--) {
+		xfs_extlen_t	usedblks; /* total used blocks in this a.g. */
+		xfs_extlen_t	freeblks; /* free blocks in this a.g. */
+		xfs_agblock_t	aglen;    /* this ag's len */
+		struct xfs_perag *pag;  /* the m_perag structure */
+
+		printk("doing agno=%d\n", agno);
+
+		pag = &mp->m_perag[agno];
+
+		error = xfs_alloc_read_agf(mp, tp, agno, 0, &bp);
 		if (error) {
-			xfs_fs_cmn_err(CE_WARN, mp,
-			"error %d reading secondary superblock for ag %d",
-				error, agno);
-			break;
+			goto error0;
 		}
-		sbp = XFS_BUF_TO_SBP(bp);
-		xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS);
+		ASSERT(bp);
+		agf = XFS_BUF_TO_AGF(bp);
+		aglen = INT_GET(agf->agf_length, ARCH_CONVERT);
+
+		/* read the pagf/pagi if not already initialized */
+		/* agf should be initialized because of the ablove read_agf */
+		ASSERT(pag->pagf_init);
+		if (!pag->pagi_init) {
+			if ((error = xfs_ialloc_read_agi(mp, tp, agno, &bp)))
+				goto error0;
+			ASSERT(pag->pagi_init);
+		}
+
 		/*
-		 * If we get an error writing out the alternate superblocks,
-		 * just issue a warning and continue.  The real work is
-		 * already done and committed.
+		 * Check the inodes: as long as we have pagi_count ==
+		 * pagi_freecount == 0, then: a) we don't have to
+		 * update any global inode counters, and b) there are
+		 * no extra blocks in inode btrees
 		 */
-		if (!(error = xfs_bwrite(mp, bp))) {
-			continue;
-		} else {
-			xfs_fs_cmn_err(CE_WARN, mp,
-		"write error %d updating secondary superblock for ag %d",
-				error, agno);
-			break; /* no point in continuing */
+		if(pag->pagi_count > 0 ||
+		   pag->pagi_freecount > 0) {
+			printk("agi %d has %d inodes in total and %d free\n",
+			       agno, pag->pagi_count, pag->pagi_freecount);
+			error = XFS_ERROR(ENOSPC);
+			goto error0;
+		}
+
+		/* Check the AGF: if levels[] == 1, then there should
+		 *  be no extra blocks in the btrees beyond the ones
+		 *  at the beggining of the AG
+		 */
+		if(pag->pagf_levels[XFS_BTNUM_BNOi] > 1 ||
+		   pag->pagf_levels[XFS_BTNUM_CNTi] > 1) {
+			printk("agf %d has level %d bt and %d cnt\n",
+			       agno,
+			       pag->pagf_levels[XFS_BTNUM_BNOi],
+			       pag->pagf_levels[XFS_BTNUM_CNTi]);
+			error = XFS_ERROR(ENOSPC);
+			goto error0;
 		}
+
+		freeblks = pag->pagf_freeblks;
+		printk("Usage: %d prealloc, %d flcount\n",
+		       XFS_PREALLOC_BLOCKS(mp), pag->pagf_flcount);
+
+		/* Done gathering data, check sizes */
+		usedblks = XFS_PREALLOC_BLOCKS(mp) + pag->pagf_flcount;
+		printk("agno=%d agf_length=%d computed used=%d"
+		       " known free=%d\n", agno, aglen, usedblks, freeblks);
+
+		if(usedblks + freeblks != aglen) {
+			printk("agno %d is not free (%d blocks allocated)\n",
+			       agno, aglen-usedblks-freeblks);
+			error = XFS_ERROR(ENOSPC);
+			goto error0;
+		}
+		dbdelta -= aglen;
+		printk("will lower with %d\n",
+		       aglen - XFS_PREALLOC_BLOCKS(mp));
+		fdbdelta += aglen - XFS_PREALLOC_BLOCKS(mp);
+	}
+	/*
+	 * Check that we removed all blocks
+	 */
+	ASSERT(!dbdelta);
+	ASSERT(nagcount < oagcount);
+
+	printk("to free: %d, oagcount=%d, nagcount=%d\n",
+	       fdbdelta, oagcount, nagcount);
+
+	xfs_trans_agblocks_delta(tp, -((long)fdbdelta));
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, nb - mp->m_sb.sb_dblocks);
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, -((int64_t)fdbdelta));
+
+	if (dpct)
+		xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
+	error = xfs_trans_commit(tp, 0);
+	if (error) {
+		up_write(&mp->m_peraglock);
+		return error;
 	}
+	/* Free memory as the number of AG has changed */
+	for (agno = nagcount; agno < oagcount; agno++)
+		if (mp->m_perag[agno].pagb_list)
+			kmem_free(mp->m_perag[agno].pagb_list,
+				  sizeof(xfs_perag_busy_t) *
+				  XFS_PAGB_NUM_SLOTS);
+
+	mp->m_perag = kmem_realloc(mp->m_perag,
+				   sizeof(xfs_perag_t) * nagcount,
+				   sizeof(xfs_perag_t) * oagcount,
+				   KM_SLEEP);
+	/* FIXME: here we could instead just lower
+	 * nagimax to nagcount; is it better this way?
+	 */
+	/* FIXME: why is this flag unconditionally set in growfs? */
+	mp->m_flags |= XFS_MOUNT_32BITINODES;
+	nagimax = xfs_initialize_perag(XFS_MTOVFS(mp), mp, nagcount);
+	up_write(&mp->m_peraglock);
+
+	xfs_update_sb(mp, nagimax, nagcount);
 	return 0;
 
  error0:
+	up_write(&mp->m_peraglock);
 	xfs_trans_cancel(tp, XFS_TRANS_ABORT);
 	return error;
 }
@@ -435,7 +665,10 @@
 	int error;
 	if (!cpsema(&mp->m_growlock))
 		return XFS_ERROR(EWOULDBLOCK);
-	error = xfs_growfs_data_private(mp, in);
+	if(in->newblocks < mp->m_sb.sb_dblocks)
+		error = xfs_shrinkfs_data_private(mp, in);
+	else
+		error = xfs_growfs_data_private(mp, in);
 	vsema(&mp->m_growlock);
 	return error;
 }
@@ -633,7 +866,7 @@
 			xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
 			thaw_bdev(sb->s_bdev, sb);
 		}
-	
+
 		break;
 	}
 	case XFS_FSOP_GOING_FLAGS_LOGFLUSH:
diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_trans.c linux-2.6-xfs.shrink/fs/xfs/xfs_trans.c
--- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_trans.c	2007-06-05 17:40:51.000000000 +0200
+++ linux-2.6-xfs.shrink/fs/xfs/xfs_trans.c	2007-06-07 23:01:03.000000000 +0200
@@ -503,11 +503,9 @@
 		tp->t_res_frextents_delta += delta;
 		break;
 	case XFS_TRANS_SB_DBLOCKS:
-		ASSERT(delta > 0);
 		tp->t_dblocks_delta += delta;
 		break;
 	case XFS_TRANS_SB_AGCOUNT:
-		ASSERT(delta > 0);
 		tp->t_agcount_delta += delta;
 		break;
 	case XFS_TRANS_SB_IMAXPCT:

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Implement shrink of empty AGs
  2007-06-10 16:40 [PATCH] Implement shrink of empty AGs Iustin Pop
@ 2007-06-12  2:40 ` David Chinner
  2007-06-12  4:25   ` Eric Sandeen
  2007-06-14  6:01   ` Iustin Pop
  0 siblings, 2 replies; 7+ messages in thread
From: David Chinner @ 2007-06-12  2:40 UTC (permalink / raw)
  To: xfs; +Cc: iusty

On Sun, Jun 10, 2007 at 06:40:14PM +0200, Iustin Pop wrote:
> The attached patch implements shrinking of completely empty allocation
> groups. The patch is against current CVS and modifies two files:
>   - xfs_trans.c to remove two asserts in which prevent lowering the
>     number of AGs or filesystem blocks;
>   - xfs_fsops.c where it does:
>     - modify xfs_growfs_data() to branch to either
>       xfs_growfs_data_private or xfs_shrinkfs_data private depending on
>       the new size of the fs
>     - abstract the last part of xfs_growfs_data_private (the modify of
>       all the superblocks) into a separate function, xfs_update_sb(),
>       which is called both from shrink and grow
>     - add the new xfs_shrinkfs_data_private function, mostly based on
>       the growfs function

comments are all inline....

> 
> There are many printk()'s left in the patch, I left them as they show
> where I compute some important values. There are also many FIXMEs in the
> comments showing what parts I didn't understand or was not sure about
> (not that these are the only ones...). Probably for a real patch,
> xfs-specific debug hooks need to be added and the printk()s removed.
> 
> The patch works on UML and QEMU virtual machines, both in UP and SMP. I
> just tested many shrink/grow operations and verified with xfs_repair
> that the fs is not corrupted. The free space counters seem to be correct
> after shrink.
> 
> Note that you also need to remove the check from xfs_growfs.c of not
> allowing to shrink the filesystem.
> 
> regards,
> iustin

> diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c
> --- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c	2007-06-09 18:56:21.509308225 +0200
> +++ linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c	2007-06-10 18:32:36.074856477 +0200
> @@ -112,6 +112,53 @@
>  	return 0;
>  }
>  
> +static void xfs_update_sb(

STATIC void
xfs_growfs_update_sb(

> +	xfs_mount_t		*mp,		/* mount point for filesystem */
> +	xfs_agnumber_t		nagimax,
> +	xfs_agnumber_t		nagcount)       /* new number of a.g. */

tabs, not spaces (and tabs are 8 spaces).

> +{
> +	xfs_agnumber_t		agno;
> +	xfs_buf_t		*bp;
> +	xfs_sb_t		*sbp;
> +	int			error;
> +
> +	/* New allocation groups fully initialized, so update mount struct */
> +	if (nagimax)
> +		mp->m_maxagi = nagimax;
> +	if (mp->m_sb.sb_imax_pct) {
> +		__uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;

I'd prefer to have long lines like this split.

> +		do_div(icount, 100);
> +		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
> +	} else
> +		mp->m_maxicount = 0;

Insert empty line.

> +	for (agno = 1; agno < nagcount; agno++) {
> +		error = xfs_read_buf(mp, mp->m_ddev_targp,
> +				  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
> +				  XFS_FSS_TO_BB(mp, 1), 0, &bp);
> +		if (error) {
> +			xfs_fs_cmn_err(CE_WARN, mp,
> +			"error %d reading secondary superblock for ag %d",
> +				error, agno);
> +			break;
> +		}
> +		sbp = XFS_BUF_TO_SBP(bp);
> +		xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS);

Insert empty line

> +		/*
> +		 * If we get an error writing out the alternate superblocks,
> +		 * just issue a warning and continue.  The real work is
> +		 * already done and committed.
> +		 */
> +		if (!(error = xfs_bwrite(mp, bp))) {
> +			continue;
> +		} else {
> +			xfs_fs_cmn_err(CE_WARN, mp,
> +		"write error %d updating secondary superblock for ag %d",
> +				error, agno);
> +			break; /* no point in continuing */
> +		}
> +	}

		error = xfs_bwrite(mp, bp);
		if (error) {
			xfs_fs_cmn_err(...)
			break;
		}
	}
> +}
> +
>  static int
>  xfs_growfs_data_private(
>  	xfs_mount_t		*mp,		/* mount point for filesystem */
> @@ -135,7 +182,6 @@
>  	xfs_rfsblock_t		nfree;
>  	xfs_agnumber_t		oagcount;
>  	int			pct;
> -	xfs_sb_t		*sbp;
>  	xfs_trans_t		*tp;
>  
>  	nb = in->newblocks;
> @@ -356,44 +402,228 @@
>  	if (error) {
>  		return error;
>  	}
> -	/* New allocation groups fully initialized, so update mount struct */
> -	if (nagimax)
> -		mp->m_maxagi = nagimax;
> -	if (mp->m_sb.sb_imax_pct) {
> -		__uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
> -		do_div(icount, 100);
> -		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
> -	} else
> -		mp->m_maxicount = 0;
> -	for (agno = 1; agno < nagcount; agno++) {
> -		error = xfs_read_buf(mp, mp->m_ddev_targp,
> -				  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
> -				  XFS_FSS_TO_BB(mp, 1), 0, &bp);
> +	xfs_update_sb(mp, nagimax, nagcount);
> +	return 0;
> +
> + error0:
> +	xfs_trans_cancel(tp, XFS_TRANS_ABORT);
> +	return error;
> +}
> +
> +static int

STATIC int

> +xfs_shrinkfs_data_private(
> +			  xfs_mount_t		*mp,		/* mount point for filesystem */
> +			  xfs_growfs_data_t	*in)		/* growfs data input struct */

whitespace issues

> +{
> +	xfs_agf_t		*agf;
> +	xfs_agnumber_t		agno;
> +	xfs_buf_t		*bp;
> +	int			dpct;
> +	int			error;
> +	xfs_agnumber_t		nagcount;       /* new AG count */
> +	xfs_agnumber_t		oagcount;       /* old AG count */
> +	xfs_agnumber_t		nagimax = 0;
> +	xfs_rfsblock_t		nb, nb_mod;
> +	xfs_rfsblock_t		dbdelta;        /* will be used as a
> +						   check that we
> +						   shrink the fs by
> +						   the correct number
> +						   of blocks */
> +	xfs_rfsblock_t		fdbdelta;       /* will keep track of
> +						   how many ag blocks
> +						   we need to
> +						   remove */

Long comments like this don't go on the declaration. Put them where
the variable is initialised or first used.

> +	int			pct;
> +	xfs_trans_t		*tp;
> +
> +	nb = in->newblocks;
> +	pct = in->imaxpct;
> +	if (nb >= mp->m_sb.sb_dblocks || pct < 0 || pct > 100)
> +		return XFS_ERROR(EINVAL);
> +	dpct = pct - mp->m_sb.sb_imax_pct;

This next bit:

> +	error = xfs_read_buf(mp, mp->m_ddev_targp,
> +			     XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1),
> +			     XFS_FSS_TO_BB(mp, 1), 0, &bp);
> +	if (error)
> +		return error;
> +	ASSERT(bp);
> +	/* FIXME: we release the buffer here manually because we are
> +	 * outside of a transaction? The other buffers read using the
> +	 * functions which take a tp parameter are not released in
> +	 * growfs
> +	 */
> +	xfs_buf_relse(bp);

Should not be necessary - we don't need to check if the new last
filesystem block is beyond EOF because we are shrinking....

To answer the FIXME - xfs_trans_commit() releases locked buffers and
inodes that have been joined ot the transaction unless they have
also been held. So if you are outside a transaction, you do have to
ensure you release any buffers you read.

> +	/* Do basic checks (at the fs level) */
> +	oagcount = mp->m_sb.sb_agcount;
> +	nagcount = nb;
> +	nb_mod = do_div(nagcount, mp->m_sb.sb_agblocks);
> +	if(nb_mod) {
> +		printk("not shrinking on an AG boundary (diff=%d)\n", nb_mod);
> +		return XFS_ERROR(ENOSPC);

EINVAL, I think.

> +	}
> +	if(nagcount < 2) {
> +		printk("refusing to shrink below 2 AGs\n");
> +		return XFS_ERROR(ENOSPC);

EINVAL.

> +	}
> +	if(nagcount >= oagcount) {
> +		printk("number of AGs will not decrease\n");
> +		return XFS_ERROR(EINVAL);
> +	}
> +	printk("Cur ag=%d, cur blocks=%llu\n",
> +	       mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks);
> +	printk("New ag=%d, new blocks=%d\n", nagcount, nb);
> +
> +	printk("Will resize from %llu to %d, delta is %llu\n",
> +               mp->m_sb.sb_dblocks, nb, mp->m_sb.sb_dblocks - nb);
> +	/* Check to see if we trip over the log section */
> +	printk("logstart=%llu logblocks=%u\n",
> +               mp->m_sb.sb_logstart, mp->m_sb.sb_logblocks);
> +	if (nb < mp->m_sb.sb_logstart + mp->m_sb.sb_logblocks)
> +		return XFS_ERROR(EINVAL);

Insert empty line

> +	/* dbdelta starts at the diff and must become zero */
> +	dbdelta = mp->m_sb.sb_dblocks - nb;
> +	tp = xfs_trans_alloc(mp, XFS_TRANS_GROWFS);
> +	printk("reserving %d\n", XFS_GROWFS_SPACE_RES(mp) + dbdelta);
> +	if ((error = xfs_trans_reserve(tp, XFS_GROWFS_SPACE_RES(mp) + dbdelta,
> +				       XFS_GROWDATA_LOG_RES(mp), 0, 0, 0))) {
> +		xfs_trans_cancel(tp, 0);
> +		return error;
> +	}

What's the dbdelta part of the reservation for? That's reserving dbdelta
blocks for *allocations*, so I don't think this is right....

> +
> +	fdbdelta = 0;
> +
> +	/* Per-AG checks */
> +	/* FIXME: do we need to hold m_peraglock while doing this? */

Yes.

> +	/* I think that since we do read and write to the m_perag
> +	 * stuff, we should be holding the lock for the entire walk &
> +	 * modify of the fs
> +	 */

Deadlock warning! holding the m_peraglock in write mode will cause
allocation deadlocks if you are not careful as all allocation/free
operations take the m_peraglock in read mode. (And yes, growing
an active, loaded filesystem can deadlock because of this.)

> +	/* Note that because we hold the lock, on any error+early
> +	 *  return, we must either release manually and return, or
> +	 *  jump to error0
> +	 */

whitespace.

> +	down_write(&mp->m_peraglock);
> +	for(agno = oagcount - 1; agno >= nagcount; agno--) {
> +		xfs_extlen_t	usedblks; /* total used blocks in this a.g. */
> +		xfs_extlen_t	freeblks; /* free blocks in this a.g. */
> +		xfs_agblock_t	aglen;    /* this ag's len */
> +		struct xfs_perag *pag;  /* the m_perag structure */
> +
> +		printk("doing agno=%d\n", agno);
> +
> +		pag = &mp->m_perag[agno];
> +
> +		error = xfs_alloc_read_agf(mp, tp, agno, 0, &bp);
>  		if (error) {
> -			xfs_fs_cmn_err(CE_WARN, mp,
> -			"error %d reading secondary superblock for ag %d",
> -				error, agno);
> -			break;
> +			goto error0;
>  		}
> -		sbp = XFS_BUF_TO_SBP(bp);
> -		xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS);
> +		ASSERT(bp);
> +		agf = XFS_BUF_TO_AGF(bp);
> +		aglen = INT_GET(agf->agf_length, ARCH_CONVERT);
> +
> +		/* read the pagf/pagi if not already initialized */
> +		/* agf should be initialized because of the ablove read_agf */
> +		ASSERT(pag->pagf_init);
> +		if (!pag->pagi_init) {
> +			if ((error = xfs_ialloc_read_agi(mp, tp, agno, &bp)))
> +				goto error0;

I don't think you should be overwriting bp here....

> +			ASSERT(pag->pagi_init);
> +		}
> +

Because now you have bp potentially pointing to two different buffers.

>  		/*
> -		 * If we get an error writing out the alternate superblocks,
> -		 * just issue a warning and continue.  The real work is
> -		 * already done and committed.
> +		 * Check the inodes: as long as we have pagi_count ==
> +		 * pagi_freecount == 0, then: a) we don't have to
> +		 * update any global inode counters, and b) there are
> +		 * no extra blocks in inode btrees
>  		 */
> -		if (!(error = xfs_bwrite(mp, bp))) {
> -			continue;
> -		} else {
> -			xfs_fs_cmn_err(CE_WARN, mp,
> -		"write error %d updating secondary superblock for ag %d",
> -				error, agno);
> -			break; /* no point in continuing */
> +		if(pag->pagi_count > 0 ||
> +		   pag->pagi_freecount > 0) {
> +			printk("agi %d has %d inodes in total and %d free\n",
> +			       agno, pag->pagi_count, pag->pagi_freecount);
> +			error = XFS_ERROR(ENOSPC);
> +			goto error0;
> +		}
> +
> +		/* Check the AGF: if levels[] == 1, then there should
> +		 *  be no extra blocks in the btrees beyond the ones
> +		 *  at the beggining of the AG
> +		 */
> +		if(pag->pagf_levels[XFS_BTNUM_BNOi] > 1 ||
> +		   pag->pagf_levels[XFS_BTNUM_CNTi] > 1) {
> +			printk("agf %d has level %d bt and %d cnt\n",
> +			       agno,
> +			       pag->pagf_levels[XFS_BTNUM_BNOi],
> +			       pag->pagf_levels[XFS_BTNUM_CNTi]);
> +			error = XFS_ERROR(ENOSPC);
> +			goto error0;
>  		}

ok, so we have empty ag's here. You might want to check that the
inode btree is empty and that the AGI unlinked list is empty.

> +
> +		freeblks = pag->pagf_freeblks;
> +		printk("Usage: %d prealloc, %d flcount\n",
> +		       XFS_PREALLOC_BLOCKS(mp), pag->pagf_flcount);
> +
> +		/* Done gathering data, check sizes */
> +		usedblks = XFS_PREALLOC_BLOCKS(mp) + pag->pagf_flcount;
> +		printk("agno=%d agf_length=%d computed used=%d"
> +		       " known free=%d\n", agno, aglen, usedblks, freeblks);
> +
> +		if(usedblks + freeblks != aglen) {
> +			printk("agno %d is not free (%d blocks allocated)\n",
> +			       agno, aglen-usedblks-freeblks);
> +			error = XFS_ERROR(ENOSPC);
> +			goto error0;
> +		}
> +		dbdelta -= aglen;
> +		printk("will lower with %d\n",
> +		       aglen - XFS_PREALLOC_BLOCKS(mp));
> +		fdbdelta += aglen - XFS_PREALLOC_BLOCKS(mp);

Ok, so why not just

	fdbdelta += mp->m_sb.sb_agblocks - XFS_PREALLOC_BLOCKS(mp);

> +	}
> +	/*
> +	 * Check that we removed all blocks
> +	 */
> +	ASSERT(!dbdelta);
> +	ASSERT(nagcount < oagcount);

Error out, not assert, because at this point we have not changed anything.

> +
> +	printk("to free: %d, oagcount=%d, nagcount=%d\n",
> +	       fdbdelta, oagcount, nagcount);
> +
> +	xfs_trans_agblocks_delta(tp, -((long)fdbdelta));
> +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
> +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, nb - mp->m_sb.sb_dblocks);
> +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, -((int64_t)fdbdelta));
> +
> +	if (dpct)
> +		xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
> +	error = xfs_trans_commit(tp, 0);
> +	if (error) {
> +		up_write(&mp->m_peraglock);
> +		return error;
>  	}
> +	/* Free memory as the number of AG has changed */
> +	for (agno = nagcount; agno < oagcount; agno++)
> +		if (mp->m_perag[agno].pagb_list)
> +			kmem_free(mp->m_perag[agno].pagb_list,
> +				  sizeof(xfs_perag_busy_t) *
> +				  XFS_PAGB_NUM_SLOTS);
> +
> +	mp->m_perag = kmem_realloc(mp->m_perag,
> +				   sizeof(xfs_perag_t) * nagcount,
> +				   sizeof(xfs_perag_t) * oagcount,
> +				   KM_SLEEP);

This is not really safe - how do we know if all the users of the
higher AGs have gone away? I think we should simply just zero out
the unused AGs and don't worry about a realloc().

> +	/* FIXME: here we could instead just lower
> +	 * nagimax to nagcount; is it better this way?
> +	 */

Not really.

> +	/* FIXME: why is this flag unconditionally set in growfs? */
> +	mp->m_flags |= XFS_MOUNT_32BITINODES;

good question. I don't think it should be there but I'll have to
do some digging....

> +	nagimax = xfs_initialize_perag(XFS_MTOVFS(mp), mp, nagcount);
> +	up_write(&mp->m_peraglock);
> +
> +	xfs_update_sb(mp, nagimax, nagcount);
>  	return 0;
>  
>   error0:
> +	up_write(&mp->m_peraglock);
>  	xfs_trans_cancel(tp, XFS_TRANS_ABORT);
>  	return error;
>  }
> @@ -435,7 +665,10 @@
>  	int error;
>  	if (!cpsema(&mp->m_growlock))
>  		return XFS_ERROR(EWOULDBLOCK);
> -	error = xfs_growfs_data_private(mp, in);
> +	if(in->newblocks < mp->m_sb.sb_dblocks)
> +		error = xfs_shrinkfs_data_private(mp, in);
> +	else
> +		error = xfs_growfs_data_private(mp, in);

Hmmm - that's using the one ioctl to do both grow and shrink. I'd
prefer a new shrink ioctl rather than changing the behaviour of an
existing ioctl.

Looks like a good start ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Implement shrink of empty AGs
  2007-06-12  2:40 ` David Chinner
@ 2007-06-12  4:25   ` Eric Sandeen
  2007-06-14  6:01   ` Iustin Pop
  1 sibling, 0 replies; 7+ messages in thread
From: Eric Sandeen @ 2007-06-12  4:25 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs, iusty

David Chinner wrote:
> On Sun, Jun 10, 2007 at 06:40:14PM +0200, Iustin Pop wrote:

...

>> +	/* FIXME: why is this flag unconditionally set in growfs? */
>> +	mp->m_flags |= XFS_MOUNT_32BITINODES;
>> +	nagimax = xfs_initialize_perag(XFS_MTOVFS(mp), mp, nagcount);

> good question. I don't think it should be there but I'll have to
> do some digging....


http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_fsops.c#rev1.72
Thu Dec 6 19:26:09 2001 UTC (5 years, 6 months ago) by lord
Add in the 32 bit inode mount flag before re initializing the perag
structures in growfs.
http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_fsops.c.diff?r1=1.71;r2=1.72

but, it seems harmless because it immediately calls xfs_initialize_perag
which does:


        /* Clear the mount flag if no inode can overflow 32 bits
         * on this filesystem, or if specifically requested..
         */
        if ((mp->m_flags & XFS_MOUNT_32BITINOOPT) && ino > max_inum) {
                mp->m_flags |= XFS_MOUNT_32BITINODES;
        } else {
                mp->m_flags &= ~XFS_MOUNT_32BITINODES;
        }

so I think it sets it (or clears it) properly in any case.  I'd probably
remove the setting before initialize_perag though as it's superfluous...

that was added after steve's change...

http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.c#rev1.335
Mon Sep 8 05:46:42 2003 UTC (3 years, 9 months ago) by nathans
Add inode64 mount option; fix case where growfs can push 32 bit inodes
into 64 bit space accidentally - both changes originally from IRIX
http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.c.diff?r1=1.334;r2=1.335

(previously it would always clear the flag if max inode was < 32 bits..)

... so yeah, looks like the setting in question can/should go.

-Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Implement shrink of empty AGs
  2007-06-12  2:40 ` David Chinner
  2007-06-12  4:25   ` Eric Sandeen
@ 2007-06-14  6:01   ` Iustin Pop
  2007-06-14  9:00     ` David Chinner
  1 sibling, 1 reply; 7+ messages in thread
From: Iustin Pop @ 2007-06-14  6:01 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 17332 bytes --]

On Tue, Jun 12, 2007 at 12:40:25PM +1000, David Chinner wrote:
> > diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c
> > --- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c	2007-06-09 18:56:21.509308225 +0200
> > +++ linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c	2007-06-10 18:32:36.074856477 +0200
> > @@ -112,6 +112,53 @@
> >  	return 0;
> >  }
> >  
> > +static void xfs_update_sb(
> 
> STATIC void
> xfs_growfs_update_sb(
this was because xfs_growfs_private is also static and not STATIC.
Should I change the def for it also?

> > +	xfs_mount_t		*mp,		/* mount point for filesystem */
> > +	xfs_agnumber_t		nagimax,
> > +	xfs_agnumber_t		nagcount)       /* new number of a.g. */
> 
> tabs, not spaces (and tabs are 8 spaces).
sorry, I though I got all of these. There are some more in the def of
xfs_reserve_blocks, btw.

> > +{
> > +	xfs_agnumber_t		agno;
> > +	xfs_buf_t		*bp;
> > +	xfs_sb_t		*sbp;
> > +	int			error;
> > +
> > +	/* New allocation groups fully initialized, so update mount struct */
> > +	if (nagimax)
> > +		mp->m_maxagi = nagimax;
> > +	if (mp->m_sb.sb_imax_pct) {
> > +		__uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
> 
> I'd prefer to have long lines like this split.
Ok, I hope I split them ok.

> > +		do_div(icount, 100);
> > +		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
> > +	} else
> > +		mp->m_maxicount = 0;
> 
> Insert empty line.
done.

> > +	for (agno = 1; agno < nagcount; agno++) {
> > +		error = xfs_read_buf(mp, mp->m_ddev_targp,
> > +				  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
> > +				  XFS_FSS_TO_BB(mp, 1), 0, &bp);
> > +		if (error) {
> > +			xfs_fs_cmn_err(CE_WARN, mp,
> > +			"error %d reading secondary superblock for ag %d",
> > +				error, agno);
> > +			break;
> > +		}
> > +		sbp = XFS_BUF_TO_SBP(bp);
> > +		xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS);
> 
> Insert empty line
done.

> > +		/*
> > +		 * If we get an error writing out the alternate superblocks,
> > +		 * just issue a warning and continue.  The real work is
> > +		 * already done and committed.
> > +		 */
> > +		if (!(error = xfs_bwrite(mp, bp))) {
> > +			continue;
> > +		} else {
> > +			xfs_fs_cmn_err(CE_WARN, mp,
> > +		"write error %d updating secondary superblock for ag %d",
> > +				error, agno);
> > +			break; /* no point in continuing */
> > +		}
> > +	}
> 
> 		error = xfs_bwrite(mp, bp);
> 		if (error) {
> 			xfs_fs_cmn_err(...)
> 			break;
ok. (the original version was purely a move from xfs_growfs_data_private
to a separate function)

> 		}
> 	}
> > +}
> > +
> >  static int
> >  xfs_growfs_data_private(
> >  	xfs_mount_t		*mp,		/* mount point for filesystem */
> > @@ -135,7 +182,6 @@
> >  	xfs_rfsblock_t		nfree;
> >  	xfs_agnumber_t		oagcount;
> >  	int			pct;
> > -	xfs_sb_t		*sbp;
> >  	xfs_trans_t		*tp;
> >  
> >  	nb = in->newblocks;
> > @@ -356,44 +402,228 @@
> >  	if (error) {
> >  		return error;
> >  	}
> > -	/* New allocation groups fully initialized, so update mount struct */
> > -	if (nagimax)
> > -		mp->m_maxagi = nagimax;
> > -	if (mp->m_sb.sb_imax_pct) {
> > -		__uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
> > -		do_div(icount, 100);
> > -		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
> > -	} else
> > -		mp->m_maxicount = 0;
> > -	for (agno = 1; agno < nagcount; agno++) {
> > -		error = xfs_read_buf(mp, mp->m_ddev_targp,
> > -				  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
> > -				  XFS_FSS_TO_BB(mp, 1), 0, &bp);
> > +	xfs_update_sb(mp, nagimax, nagcount);
> > +	return 0;
> > +
> > + error0:
> > +	xfs_trans_cancel(tp, XFS_TRANS_ABORT);
> > +	return error;
> > +}
> > +
> > +static int
> 
> STATIC int
done.

> > +xfs_shrinkfs_data_private(
> > +			  xfs_mount_t		*mp,		/* mount point for filesystem */
> > +			  xfs_growfs_data_t	*in)		/* growfs data input struct */
> 
> whitespace issues
fixed (I think you were referring to the alignement of the two argument
lines).

> > +{
> > +	xfs_agf_t		*agf;
> > +	xfs_agnumber_t		agno;
> > +	xfs_buf_t		*bp;
> > +	int			dpct;
> > +	int			error;
> > +	xfs_agnumber_t		nagcount;       /* new AG count */
> > +	xfs_agnumber_t		oagcount;       /* old AG count */
> > +	xfs_agnumber_t		nagimax = 0;
> > +	xfs_rfsblock_t		nb, nb_mod;
> > +	xfs_rfsblock_t		dbdelta;        /* will be used as a
> > +						   check that we
> > +						   shrink the fs by
> > +						   the correct number
> > +						   of blocks */
> > +	xfs_rfsblock_t		fdbdelta;       /* will keep track of
> > +						   how many ag blocks
> > +						   we need to
> > +						   remove */
> 
> Long comments like this don't go on the declaration. Put them where
> the variable is initialised or first used.
ok.

> > +	int			pct;
> > +	xfs_trans_t		*tp;
> > +
> > +	nb = in->newblocks;
> > +	pct = in->imaxpct;
> > +	if (nb >= mp->m_sb.sb_dblocks || pct < 0 || pct > 100)
> > +		return XFS_ERROR(EINVAL);
> > +	dpct = pct - mp->m_sb.sb_imax_pct;
> 
> This next bit:
> 
> > +	error = xfs_read_buf(mp, mp->m_ddev_targp,
> > +			     XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1),
> > +			     XFS_FSS_TO_BB(mp, 1), 0, &bp);
> > +	if (error)
> > +		return error;
> > +	ASSERT(bp);
> > +	/* FIXME: we release the buffer here manually because we are
> > +	 * outside of a transaction? The other buffers read using the
> > +	 * functions which take a tp parameter are not released in
> > +	 * growfs
> > +	 */
> > +	xfs_buf_relse(bp);
> 
> Should not be necessary - we don't need to check if the new last
> filesystem block is beyond EOF because we are shrinking....
Ah, now I understand what this does. I just copied it from growfs
without realising what it does. Removed.

> To answer the FIXME - xfs_trans_commit() releases locked buffers and
> inodes that have been joined ot the transaction unless they have
> also been held. So if you are outside a transaction, you do have to
> ensure you release any buffers you read.
Thanks for the clarification!

> > +	/* Do basic checks (at the fs level) */
> > +	oagcount = mp->m_sb.sb_agcount;
> > +	nagcount = nb;
> > +	nb_mod = do_div(nagcount, mp->m_sb.sb_agblocks);
> > +	if(nb_mod) {
> > +		printk("not shrinking on an AG boundary (diff=%d)\n", nb_mod);
> > +		return XFS_ERROR(ENOSPC);
> 
> EINVAL, I think.
fixed.

> > +	}
> > +	if(nagcount < 2) {
> > +		printk("refusing to shrink below 2 AGs\n");
> > +		return XFS_ERROR(ENOSPC);
> 
> EINVAL.
fixed.

> > +	}
> > +	if(nagcount >= oagcount) {
> > +		printk("number of AGs will not decrease\n");
> > +		return XFS_ERROR(EINVAL);
> > +	}
> > +	printk("Cur ag=%d, cur blocks=%llu\n",
> > +	       mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks);
> > +	printk("New ag=%d, new blocks=%d\n", nagcount, nb);
> > +
> > +	printk("Will resize from %llu to %d, delta is %llu\n",
> > +               mp->m_sb.sb_dblocks, nb, mp->m_sb.sb_dblocks - nb);
> > +	/* Check to see if we trip over the log section */
> > +	printk("logstart=%llu logblocks=%u\n",
> > +               mp->m_sb.sb_logstart, mp->m_sb.sb_logblocks);
> > +	if (nb < mp->m_sb.sb_logstart + mp->m_sb.sb_logblocks)
> > +		return XFS_ERROR(EINVAL);
> 
> Insert empty line
Done.

> > +	/* dbdelta starts at the diff and must become zero */
> > +	dbdelta = mp->m_sb.sb_dblocks - nb;
> > +	tp = xfs_trans_alloc(mp, XFS_TRANS_GROWFS);
> > +	printk("reserving %d\n", XFS_GROWFS_SPACE_RES(mp) + dbdelta);
> > +	if ((error = xfs_trans_reserve(tp, XFS_GROWFS_SPACE_RES(mp) + dbdelta,
> > +				       XFS_GROWDATA_LOG_RES(mp), 0, 0, 0))) {
> > +		xfs_trans_cancel(tp, 0);
> > +		return error;
> > +	}
> 
> What's the dbdelta part of the reservation for? That's reserving dbdelta
> blocks for *allocations*, so I don't think this is right....
Well, we'll shrink the filesystem by dbdelta, so the filesystem needs to
have enough free space to do it. Whether this space is in the correct
place (the AGs we want to shrink) or not is a question answered by the
per-AG checking, but since we will reduce the freespace by this amount,
I thought that it's safer to mark this space in use. Did I misread the
intent of xfs_trans_reserve?

Also, unless I'm mistaken and don't remember correctly, you have to
reserve the amount of space one will pass to xfs_trans_mod_sb(tp,
XFS_TRANS_SB_FDBLOCKS, ...) otherwise the transaction code complains
about exceeding your reservation.

> > +
> > +	fdbdelta = 0;
> > +
> > +	/* Per-AG checks */
> > +	/* FIXME: do we need to hold m_peraglock while doing this? */
> 
> Yes.
> 
> > +	/* I think that since we do read and write to the m_perag
> > +	 * stuff, we should be holding the lock for the entire walk &
> > +	 * modify of the fs
> > +	 */
> 
> Deadlock warning! holding the m_peraglock in write mode will cause
> allocation deadlocks if you are not careful as all allocation/free
> operations take the m_peraglock in read mode. (And yes, growing
> an active, loaded filesystem can deadlock because of this.)
How can we ensure then that no one modifies the AGs while we walk them?
I hoped that we can do it without the per-AG not-avail flag, by just
holding the perag lock.

Or you mean we should keep the read lock, and grab also the write lock
when actually modifying the perag structure? (Or should that be grab
read lock for checking, release read lock - hope nobody modifies AGs -
grab write lock?)

> > +	/* Note that because we hold the lock, on any error+early
> > +	 *  return, we must either release manually and return, or
> > +	 *  jump to error0
> > +	 */
> 
> whitespace.
fixed.

> > +	down_write(&mp->m_peraglock);
> > +	for(agno = oagcount - 1; agno >= nagcount; agno--) {
> > +		xfs_extlen_t	usedblks; /* total used blocks in this a.g. */
> > +		xfs_extlen_t	freeblks; /* free blocks in this a.g. */
> > +		xfs_agblock_t	aglen;    /* this ag's len */
> > +		struct xfs_perag *pag;  /* the m_perag structure */
> > +
> > +		printk("doing agno=%d\n", agno);
> > +
> > +		pag = &mp->m_perag[agno];
> > +
> > +		error = xfs_alloc_read_agf(mp, tp, agno, 0, &bp);
> >  		if (error) {
> > -			xfs_fs_cmn_err(CE_WARN, mp,
> > -			"error %d reading secondary superblock for ag %d",
> > -				error, agno);
> > -			break;
> > +			goto error0;
> >  		}
> > -		sbp = XFS_BUF_TO_SBP(bp);
> > -		xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS);
> > +		ASSERT(bp);
> > +		agf = XFS_BUF_TO_AGF(bp);
> > +		aglen = INT_GET(agf->agf_length, ARCH_CONVERT);
> > +
> > +		/* read the pagf/pagi if not already initialized */
> > +		/* agf should be initialized because of the ablove read_agf */
> > +		ASSERT(pag->pagf_init);
> > +		if (!pag->pagi_init) {
> > +			if ((error = xfs_ialloc_read_agi(mp, tp, agno, &bp)))
> > +				goto error0;
> 
> I don't think you should be overwriting bp here....
> 
> > +			ASSERT(pag->pagi_init);
> > +		}
> > +
> 
> Because now you have bp potentially pointing to two different buffers.
Yes, indeed, sloppy programming. Fixed.

> >  		/*
> > -		 * If we get an error writing out the alternate superblocks,
> > -		 * just issue a warning and continue.  The real work is
> > -		 * already done and committed.
> > +		 * Check the inodes: as long as we have pagi_count ==
> > +		 * pagi_freecount == 0, then: a) we don't have to
> > +		 * update any global inode counters, and b) there are
> > +		 * no extra blocks in inode btrees
> >  		 */
> > -		if (!(error = xfs_bwrite(mp, bp))) {
> > -			continue;
> > -		} else {
> > -			xfs_fs_cmn_err(CE_WARN, mp,
> > -		"write error %d updating secondary superblock for ag %d",
> > -				error, agno);
> > -			break; /* no point in continuing */
> > +		if(pag->pagi_count > 0 ||
> > +		   pag->pagi_freecount > 0) {
> > +			printk("agi %d has %d inodes in total and %d free\n",
> > +			       agno, pag->pagi_count, pag->pagi_freecount);
> > +			error = XFS_ERROR(ENOSPC);
> > +			goto error0;
> > +		}
> > +
> > +		/* Check the AGF: if levels[] == 1, then there should
> > +		 *  be no extra blocks in the btrees beyond the ones
> > +		 *  at the beggining of the AG
> > +		 */
> > +		if(pag->pagf_levels[XFS_BTNUM_BNOi] > 1 ||
> > +		   pag->pagf_levels[XFS_BTNUM_CNTi] > 1) {
> > +			printk("agf %d has level %d bt and %d cnt\n",
> > +			       agno,
> > +			       pag->pagf_levels[XFS_BTNUM_BNOi],
> > +			       pag->pagf_levels[XFS_BTNUM_CNTi]);
> > +			error = XFS_ERROR(ENOSPC);
> > +			goto error0;
> >  		}
> 
> ok, so we have empty ag's here. You might want to check that the
> inode btree is empty and that the AGI unlinked list is empty.
I thought that inode btree is empty by virtue of pag->pagi_count == 0.
Is this not always so? Furthermore, also since agi_count == agi_free +
actual used inodes + number of unlinked inodes, I think we don't need to
check the unlinked list.

> > +
> > +		freeblks = pag->pagf_freeblks;
> > +		printk("Usage: %d prealloc, %d flcount\n",
> > +		       XFS_PREALLOC_BLOCKS(mp), pag->pagf_flcount);
> > +
> > +		/* Done gathering data, check sizes */
> > +		usedblks = XFS_PREALLOC_BLOCKS(mp) + pag->pagf_flcount;
> > +		printk("agno=%d agf_length=%d computed used=%d"
> > +		       " known free=%d\n", agno, aglen, usedblks, freeblks);
> > +
> > +		if(usedblks + freeblks != aglen) {
> > +			printk("agno %d is not free (%d blocks allocated)\n",
> > +			       agno, aglen-usedblks-freeblks);
> > +			error = XFS_ERROR(ENOSPC);
> > +			goto error0;
> > +		}
> > +		dbdelta -= aglen;
> > +		printk("will lower with %d\n",
> > +		       aglen - XFS_PREALLOC_BLOCKS(mp));
> > +		fdbdelta += aglen - XFS_PREALLOC_BLOCKS(mp);
> 
> Ok, so why not just
> 
> 	fdbdelta += mp->m_sb.sb_agblocks - XFS_PREALLOC_BLOCKS(mp);
Because the last AG can be smaller than the sb_agblocks. It's true
that this holds only for the last one, but having two cases is uglier
than just always reading this size from the AG.

> > +	}
> > +	/*
> > +	 * Check that we removed all blocks
> > +	 */
> > +	ASSERT(!dbdelta);
> > +	ASSERT(nagcount < oagcount);
> 
> Error out, not assert, because at this point we have not changed anything.
Actually here, !dbdelta or nacount < oagacount are programming errors
and not possible correct conditions.  They could be removed, sine the
checks before the per-AG for loop ensure these conditions are not met.
But I used them just to be sure.

> > +
> > +	printk("to free: %d, oagcount=%d, nagcount=%d\n",
> > +	       fdbdelta, oagcount, nagcount);
> > +
> > +	xfs_trans_agblocks_delta(tp, -((long)fdbdelta));
> > +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
> > +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, nb - mp->m_sb.sb_dblocks);
> > +	xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, -((int64_t)fdbdelta));
> > +
> > +	if (dpct)
> > +		xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
> > +	error = xfs_trans_commit(tp, 0);
> > +	if (error) {
> > +		up_write(&mp->m_peraglock);
> > +		return error;
> >  	}
> > +	/* Free memory as the number of AG has changed */
> > +	for (agno = nagcount; agno < oagcount; agno++)
> > +		if (mp->m_perag[agno].pagb_list)
> > +			kmem_free(mp->m_perag[agno].pagb_list,
> > +				  sizeof(xfs_perag_busy_t) *
> > +				  XFS_PAGB_NUM_SLOTS);
> > +
> > +	mp->m_perag = kmem_realloc(mp->m_perag,
> > +				   sizeof(xfs_perag_t) * nagcount,
> > +				   sizeof(xfs_perag_t) * oagcount,
> > +				   KM_SLEEP);
> 
> This is not really safe - how do we know if all the users of the
> higher AGs have gone away? I think we should simply just zero out
> the unused AGs and don't worry about a realloc().
The problem that I saw is that if you do shrink+grow+shrink+grow+... you
will leak a small amount of memory (or corrupt kernel mem allocation?)
since the growfs code does a realloc from what it thinks is the size of
m_perag, which it computes solely from the current number of AGs. Should
we have a size in the mp struct for this and not assume it's the
agcount?

> > +	/* FIXME: here we could instead just lower
> > +	 * nagimax to nagcount; is it better this way?
> > +	 */
> 
> Not really.
Ok, removed comment.

> > +	/* FIXME: why is this flag unconditionally set in growfs? */
> > +	mp->m_flags |= XFS_MOUNT_32BITINODES;
> 
> good question. I don't think it should be there but I'll have to
> do some digging....
Per Eric's mail, I removed both the comment and the flag setting.

> > +	nagimax = xfs_initialize_perag(XFS_MTOVFS(mp), mp, nagcount);
> > +	up_write(&mp->m_peraglock);
> > +
> > +	xfs_update_sb(mp, nagimax, nagcount);
> >  	return 0;
> >  
> >   error0:
> > +	up_write(&mp->m_peraglock);
> >  	xfs_trans_cancel(tp, XFS_TRANS_ABORT);
> >  	return error;
> >  }
> > @@ -435,7 +665,10 @@
> >  	int error;
> >  	if (!cpsema(&mp->m_growlock))
> >  		return XFS_ERROR(EWOULDBLOCK);
> > -	error = xfs_growfs_data_private(mp, in);
> > +	if(in->newblocks < mp->m_sb.sb_dblocks)
> > +		error = xfs_shrinkfs_data_private(mp, in);
> > +	else
> > +		error = xfs_growfs_data_private(mp, in);
> 
> Hmmm - that's using the one ioctl to do both grow and shrink. I'd
> prefer a new shrink ioctl rather than changing the behaviour of an
> existing ioctl.
Well, I chose this way because I see it as the ioctl that changes the
data size of the filesystem. It may be lower or higher than the current
size, but that is not so important :), and if another ioctl there would
be the need for another tool.

> Looks like a good start ;)
thanks! and also thanks for taking time to look at this.

updated patch (without the separation of the IOCTL and without the
rework of the perag lock) is attached.

iustin

[-- Attachment #2: patch-nice-5 --]
[-- Type: text/plain, Size: 10514 bytes --]

diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c
--- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c	2007-06-09 18:56:21.509308225 +0200
+++ linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c	2007-06-14 07:54:50.580420252 +0200
@@ -112,6 +112,55 @@
 	return 0;
 }
 
+STATIC void xfs_growfs_update_sb(
+	xfs_mount_t		*mp,		/* mount point for filesystem */
+	xfs_agnumber_t		nagimax,
+	xfs_agnumber_t		nagcount)       /* new number of a.g. */
+{
+	xfs_agnumber_t		agno;
+	xfs_buf_t		*bp;
+	xfs_sb_t		*sbp;
+	int			error;
+
+	/* New allocation groups fully initialized, so update mount struct */
+	if (nagimax)
+		mp->m_maxagi = nagimax;
+	if (mp->m_sb.sb_imax_pct) {
+		__uint64_t icount = mp->m_sb.sb_dblocks *
+			mp->m_sb.sb_imax_pct;
+		do_div(icount, 100);
+		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
+	} else
+		mp->m_maxicount = 0;
+
+	for (agno = 1; agno < nagcount; agno++) {
+		error = xfs_read_buf(mp, mp->m_ddev_targp,
+				  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
+				  XFS_FSS_TO_BB(mp, 1), 0, &bp);
+		if (error) {
+			xfs_fs_cmn_err(CE_WARN, mp,
+			"error %d reading secondary superblock for ag %d",
+				error, agno);
+			break;
+		}
+		sbp = XFS_BUF_TO_SBP(bp);
+		xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS);
+
+		/*
+		 * If we get an error writing out the alternate superblocks,
+		 * just issue a warning and continue.  The real work is
+		 * already done and committed.
+		 */
+		error = xfs_bwrite(mp, bp);
+		if(error) {
+			xfs_fs_cmn_err(CE_WARN, mp,
+		"write error %d updating secondary superblock for ag %d",
+				error, agno);
+			break; /* no point in continuing */
+		}
+	}
+}
+
 static int
 xfs_growfs_data_private(
 	xfs_mount_t		*mp,		/* mount point for filesystem */
@@ -135,7 +184,6 @@
 	xfs_rfsblock_t		nfree;
 	xfs_agnumber_t		oagcount;
 	int			pct;
-	xfs_sb_t		*sbp;
 	xfs_trans_t		*tp;
 
 	nb = in->newblocks;
@@ -356,44 +404,210 @@
 	if (error) {
 		return error;
 	}
-	/* New allocation groups fully initialized, so update mount struct */
-	if (nagimax)
-		mp->m_maxagi = nagimax;
-	if (mp->m_sb.sb_imax_pct) {
-		__uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
-		do_div(icount, 100);
-		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
-	} else
-		mp->m_maxicount = 0;
-	for (agno = 1; agno < nagcount; agno++) {
-		error = xfs_read_buf(mp, mp->m_ddev_targp,
-				  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
-				  XFS_FSS_TO_BB(mp, 1), 0, &bp);
+	xfs_growfs_update_sb(mp, nagimax, nagcount);
+	return 0;
+
+ error0:
+	xfs_trans_cancel(tp, XFS_TRANS_ABORT);
+	return error;
+}
+
+STATIC int
+xfs_shrinkfs_data_private(
+	xfs_mount_t		*mp,		/* mount point for filesystem */
+	xfs_growfs_data_t	*in)		/* growfs data input struct */
+{
+	xfs_agf_t		*agf;
+	xfs_agnumber_t		agno;
+	int			dpct;
+	int			error;
+	xfs_agnumber_t		nagcount;       /* new AG count */
+	xfs_agnumber_t		oagcount;       /* old AG count */
+	xfs_agnumber_t		nagimax = 0;
+	xfs_rfsblock_t		nb, nb_mod;
+	xfs_rfsblock_t		dbdelta;
+	xfs_rfsblock_t		fdbdelta;
+	int			pct;
+	xfs_trans_t		*tp;
+
+	nb = in->newblocks;
+	pct = in->imaxpct;
+	if (nb >= mp->m_sb.sb_dblocks || pct < 0 || pct > 100)
+		return XFS_ERROR(EINVAL);
+	dpct = pct - mp->m_sb.sb_imax_pct;
+
+	/* Do basic checks (at the fs level) */
+	oagcount = mp->m_sb.sb_agcount;
+	nagcount = nb;
+	nb_mod = do_div(nagcount, mp->m_sb.sb_agblocks);
+	if(nb_mod) {
+		printk("not shrinking on an AG boundary (diff=%d)\n", nb_mod);
+		return XFS_ERROR(EINVAL);
+	}
+	if(nagcount < 2) {
+		printk("refusing to shrink below 2 AGs\n");
+		return XFS_ERROR(EINVAL);
+	}
+	if(nagcount >= oagcount) {
+		printk("number of AGs will not decrease\n");
+		return XFS_ERROR(EINVAL);
+	}
+	printk("Cur ag=%d, cur blocks=%llu\n",
+	       mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks);
+	printk("New ag=%d, new blocks=%d\n", nagcount, nb);
+
+	printk("Will resize from %llu to %d, delta is %llu\n",
+		mp->m_sb.sb_dblocks, nb, mp->m_sb.sb_dblocks - nb);
+	/* Check to see if we trip over the log section */
+	printk("logstart=%llu logblocks=%u\n",
+		mp->m_sb.sb_logstart, mp->m_sb.sb_logblocks);
+	if (nb < mp->m_sb.sb_logstart + mp->m_sb.sb_logblocks)
+		return XFS_ERROR(EINVAL);
+
+	/* dbdelta starts at the diff and must become zero */
+	dbdelta = mp->m_sb.sb_dblocks - nb;
+	tp = xfs_trans_alloc(mp, XFS_TRANS_GROWFS);
+	printk("reserving %d\n", XFS_GROWFS_SPACE_RES(mp) + dbdelta);
+	if ((error = xfs_trans_reserve(tp, XFS_GROWFS_SPACE_RES(mp) + dbdelta,
+				       XFS_GROWDATA_LOG_RES(mp), 0, 0, 0))) {
+		xfs_trans_cancel(tp, 0);
+		return error;
+	}
+
+	/* fbdelta keeps track of how many ag blocks we need to remove
+	 * (this is different from the number of filesystem blocks)
+	 */
+	fdbdelta = 0;
+
+	/* Per-AG checks */
+	/* FIXME: do we need to hold m_peraglock while doing this? */
+	/* I think that since we do read and write to the m_perag
+	 * stuff, we should be holding the lock for the entire walk &
+	 * modify of the fs
+	 */
+	/* Note that because we hold the lock, on any error+early
+	 * return, we must either release manually and return, or jump
+	 * to error0
+	 */
+	down_write(&mp->m_peraglock);
+	for(agno = oagcount - 1; agno >= nagcount; agno--) {
+		xfs_extlen_t	usedblks; /* total used blocks in this a.g. */
+		xfs_extlen_t	freeblks; /* free blocks in this a.g. */
+		xfs_agblock_t	aglen;    /* this ag's len */
+		struct xfs_perag *pag;  /* the m_perag structure */
+		xfs_buf_t	*bpf;
+		xfs_buf_t	*bpi;
+
+		printk("doing agno=%d\n", agno);
+
+		pag = &mp->m_perag[agno];
+
+		error = xfs_alloc_read_agf(mp, tp, agno, 0, &bpf);
 		if (error) {
-			xfs_fs_cmn_err(CE_WARN, mp,
-			"error %d reading secondary superblock for ag %d",
-				error, agno);
-			break;
+			goto error0;
 		}
-		sbp = XFS_BUF_TO_SBP(bp);
-		xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS);
+		ASSERT(bpf);
+		agf = XFS_BUF_TO_AGF(bpf);
+		aglen = INT_GET(agf->agf_length, ARCH_CONVERT);
+
+		/* read the pagf/pagi if not already initialized */
+		/* agf should be initialized because of the ablove read_agf */
+		ASSERT(pag->pagf_init);
+		if (!pag->pagi_init) {
+			if ((error = xfs_ialloc_read_agi(mp, tp, agno, &bpi)))
+				goto error0;
+			ASSERT(pag->pagi_init);
+		}
+
 		/*
-		 * If we get an error writing out the alternate superblocks,
-		 * just issue a warning and continue.  The real work is
-		 * already done and committed.
+		 * Check the inodes: as long as we have pagi_count ==
+		 * pagi_freecount == 0, then: a) we don't have to
+		 * update any global inode counters, and b) there are
+		 * no extra blocks in inode btrees
 		 */
-		if (!(error = xfs_bwrite(mp, bp))) {
-			continue;
-		} else {
-			xfs_fs_cmn_err(CE_WARN, mp,
-		"write error %d updating secondary superblock for ag %d",
-				error, agno);
-			break; /* no point in continuing */
+		if(pag->pagi_count > 0 ||
+		   pag->pagi_freecount > 0) {
+			printk("agi %d has %d inodes in total and %d free\n",
+			       agno, pag->pagi_count, pag->pagi_freecount);
+			error = XFS_ERROR(ENOSPC);
+			goto error0;
 		}
+
+		/* Check the AGF: if levels[] == 1, then there should
+		 *  be no extra blocks in the btrees beyond the ones
+		 *  at the beggining of the AG
+		 */
+		if(pag->pagf_levels[XFS_BTNUM_BNOi] > 1 ||
+		   pag->pagf_levels[XFS_BTNUM_CNTi] > 1) {
+			printk("agf %d has level %d bt and %d cnt\n",
+			       agno,
+			       pag->pagf_levels[XFS_BTNUM_BNOi],
+			       pag->pagf_levels[XFS_BTNUM_CNTi]);
+			error = XFS_ERROR(ENOSPC);
+			goto error0;
+		}
+
+		freeblks = pag->pagf_freeblks;
+		printk("Usage: %d prealloc, %d flcount\n",
+		       XFS_PREALLOC_BLOCKS(mp), pag->pagf_flcount);
+
+		/* Done gathering data, check sizes */
+		usedblks = XFS_PREALLOC_BLOCKS(mp) + pag->pagf_flcount;
+		printk("agno=%d agf_length=%d computed used=%d"
+		       " known free=%d\n", agno, aglen, usedblks, freeblks);
+
+		if(usedblks + freeblks != aglen) {
+			printk("agno %d is not free (%d blocks allocated)\n",
+			       agno, aglen-usedblks-freeblks);
+			error = XFS_ERROR(ENOSPC);
+			goto error0;
+		}
+		dbdelta -= aglen;
+		printk("will lower with %d\n",
+		       aglen - XFS_PREALLOC_BLOCKS(mp));
+		fdbdelta += aglen - XFS_PREALLOC_BLOCKS(mp);
 	}
+	/*
+	 * Check that we removed all blocks
+	 */
+	ASSERT(!dbdelta);
+	ASSERT(nagcount < oagcount);
+
+	printk("to free: %d, oagcount=%d, nagcount=%d\n",
+	       fdbdelta, oagcount, nagcount);
+
+	xfs_trans_agblocks_delta(tp, -((long)fdbdelta));
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, nb - mp->m_sb.sb_dblocks);
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, -((int64_t)fdbdelta));
+
+	if (dpct)
+		xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
+	error = xfs_trans_commit(tp, 0);
+	if (error) {
+		up_write(&mp->m_peraglock);
+		return error;
+	}
+	/* Free memory as the number of AG has changed */
+	for (agno = nagcount; agno < oagcount; agno++)
+		if (mp->m_perag[agno].pagb_list)
+			kmem_free(mp->m_perag[agno].pagb_list,
+				  sizeof(xfs_perag_busy_t) *
+				  XFS_PAGB_NUM_SLOTS);
+
+	mp->m_perag = kmem_realloc(mp->m_perag,
+				   sizeof(xfs_perag_t) * nagcount,
+				   sizeof(xfs_perag_t) * oagcount,
+				   KM_SLEEP);
+
+	nagimax = xfs_initialize_perag(XFS_MTOVFS(mp), mp, nagcount);
+	up_write(&mp->m_peraglock);
+
+	xfs_growfs_update_sb(mp, nagimax, nagcount);
 	return 0;
 
  error0:
+	up_write(&mp->m_peraglock);
 	xfs_trans_cancel(tp, XFS_TRANS_ABORT);
 	return error;
 }
@@ -435,7 +649,10 @@
 	int error;
 	if (!cpsema(&mp->m_growlock))
 		return XFS_ERROR(EWOULDBLOCK);
-	error = xfs_growfs_data_private(mp, in);
+	if(in->newblocks < mp->m_sb.sb_dblocks)
+		error = xfs_shrinkfs_data_private(mp, in);
+	else
+		error = xfs_growfs_data_private(mp, in);
 	vsema(&mp->m_growlock);
 	return error;
 }
@@ -633,7 +850,7 @@
 			xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
 			thaw_bdev(sb->s_bdev, sb);
 		}
-	
+
 		break;
 	}
 	case XFS_FSOP_GOING_FLAGS_LOGFLUSH:
diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_trans.c linux-2.6-xfs.shrink/fs/xfs/xfs_trans.c
--- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_trans.c	2007-06-05 17:40:51.000000000 +0200
+++ linux-2.6-xfs.shrink/fs/xfs/xfs_trans.c	2007-06-07 23:01:03.000000000 +0200
@@ -503,11 +503,9 @@
 		tp->t_res_frextents_delta += delta;
 		break;
 	case XFS_TRANS_SB_DBLOCKS:
-		ASSERT(delta > 0);
 		tp->t_dblocks_delta += delta;
 		break;
 	case XFS_TRANS_SB_AGCOUNT:
-		ASSERT(delta > 0);
 		tp->t_agcount_delta += delta;
 		break;
 	case XFS_TRANS_SB_IMAXPCT:

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Implement shrink of empty AGs
  2007-06-14  6:01   ` Iustin Pop
@ 2007-06-14  9:00     ` David Chinner
  2007-06-14 20:55       ` Iustin Pop
  0 siblings, 1 reply; 7+ messages in thread
From: David Chinner @ 2007-06-14  9:00 UTC (permalink / raw)
  To: xfs; +Cc: Iustin Pop

On Thu, Jun 14, 2007 at 08:01:58AM +0200, Iustin Pop wrote:
> On Tue, Jun 12, 2007 at 12:40:25PM +1000, David Chinner wrote:
> > > diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c
> > > --- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c	2007-06-09 18:56:21.509308225 +0200
> > > +++ linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c	2007-06-10 18:32:36.074856477 +0200
> > > @@ -112,6 +112,53 @@
> > >  	return 0;
> > >  }
> > >  
> > > +static void xfs_update_sb(
> > 
> > STATIC void
> > xfs_growfs_update_sb(
> this was because xfs_growfs_private is also static and not STATIC.
> Should I change the def for it also?

Yes, probably should.

> > > +	xfs_mount_t		*mp,		/* mount point for filesystem */
> > > +	xfs_agnumber_t		nagimax,
> > > +	xfs_agnumber_t		nagcount)       /* new number of a.g. */
> > 
> > tabs, not spaces (and tabs are 8 spaces).
> sorry, I though I got all of these. There are some more in the def of
> xfs_reserve_blocks, btw.

Ok, I'll clean them up in the current fix-reserve-blocks-yet-again
patch I have.

> > > +xfs_shrinkfs_data_private(
> > > +			  xfs_mount_t		*mp,		/* mount point for filesystem */
> > > +			  xfs_growfs_data_t	*in)		/* growfs data input struct */
> > 
> > whitespace issues
> fixed (I think you were referring to the alignement of the two argument
> lines).

Yes.

> > > +	/* dbdelta starts at the diff and must become zero */
> > > +	dbdelta = mp->m_sb.sb_dblocks - nb;
> > > +	tp = xfs_trans_alloc(mp, XFS_TRANS_GROWFS);
> > > +	printk("reserving %d\n", XFS_GROWFS_SPACE_RES(mp) + dbdelta);
> > > +	if ((error = xfs_trans_reserve(tp, XFS_GROWFS_SPACE_RES(mp) + dbdelta,
> > > +				       XFS_GROWDATA_LOG_RES(mp), 0, 0, 0))) {
> > > +		xfs_trans_cancel(tp, 0);
> > > +		return error;
> > > +	}
> > 
> > What's the dbdelta part of the reservation for? That's reserving dbdelta
> > blocks for *allocations*, so I don't think this is right....
>
> Well, we'll shrink the filesystem by dbdelta, so the filesystem needs to
> have enough free space to do it.

True, but when you look at how much free space we need to make that
modification, it turns out ot be zero. i.e. we don't need to do any
allocations *anywhere* to remove an empty AG from the end of the
filesystem - just log a change to the superblock(s).

> Whether this space is in the correct
> place (the AGs we want to shrink) or not is a question answered by the
> per-AG checking, but since we will reduce the freespace by this amount,
> I thought that it's safer to mark this space in use. Did I misread the
> intent of xfs_trans_reserve?

The transactions reservation takes into account blocks that may need
to be allocated during a transaction. Normally when we are freeing
an extent, we may still have to allocate some blocks because a
btree block might need to be split (and split all the way up the tree
to the root). The XFS_GROWFS_SPACE_RES(mp) macro reserves the space
needed for the freeing of a new extent which occurs during growing a
partial AG to a full AG. We aren't doing such an operation here;
we don't modify any free space or inode or extent btrees here at
all.

> Also, unless I'm mistaken and don't remember correctly, you have to
> reserve the amount of space one will pass to xfs_trans_mod_sb(tp,
> XFS_TRANS_SB_FDBLOCKS, ...) otherwise the transaction code complains
> about exceeding your reservation.

xfs_trans_mod_sb:

    459         case XFS_TRANS_SB_FDBLOCKS:
    460                 /*
    461                  * Track the number of blocks allocated in the
    462                  * transaction.  Make sure it does not exceed the
    463                  * number reserved.
    464                  */
    465                 if (delta < 0) {
    466                         tp->t_blk_res_used += (uint)-delta;
    467                         ASSERT(tp->t_blk_res_used <= tp->t_blk_res);
    468                 }
    469                 tp->t_fdblocks_delta += delta;
    470                 if (xfs_sb_version_haslazysbcount(&mp->m_sb))
    471                         flags &= ~XFS_TRANS_SB_DIRTY;
    472                 break;

So, delta < 0 (i.e. allocating blocks) requires a transaction reservation,
but freeing blocks (delta > 0) does not.


> > > +	/* I think that since we do read and write to the m_perag
> > > +	 * stuff, we should be holding the lock for the entire walk &
> > > +	 * modify of the fs
> > > +	 */
> > 
> > Deadlock warning! holding the m_peraglock in write mode will cause
> > allocation deadlocks if you are not careful as all allocation/free
> > operations take the m_peraglock in read mode. (And yes, growing
> > an active, loaded filesystem can deadlock because of this.)
>
> How can we ensure then that no one modifies the AGs while we walk them?

It's not even modify - how do you prevent any *reference* to the perag
while we are doing this.

> I hoped that we can do it without the per-AG not-avail flag, by just
> holding the perag lock.

The per-AG not-available flag will prevent new allocations, but it doesn't prevent
access to the AGs as we still need access to the AGs to copy the data and
metadata out. The m_peraglock in write mode will prevent *simultaneous*
access to the per-ag, but the code waiting on a read lock may still try
to access the bit of the perag structure we're about to remove.

See, for example, xfs_bmap_btalloc(). It gets and AG it is supposed to
use from somewhere external, and then before it goes to use the perag
structure for that AG, it waits on the peraglock in read mode. If that
blocks waiting on a shrink and after the shrink ag > sb_agcount, we're
in a world of pain.....

Note that I'm just using this as an example piece of existing code that would
break badly if we simply realloc the perag array. I could point you to the
filestreams code where there is a perag stream refcount that we'll need to
ensure is zero.

What I'm trying to say is that the perag lock may not give us sufficient
exclusion to be able to shrink the perag array safely.

> > ok, so we have empty ag's here. You might want to check that the
> > inode btree is empty and that the AGI unlinked list is empty.
> I thought that inode btree is empty by virtue of pag->pagi_count == 0.
> Is this not always so?

It should be. Call me paranoid, but with an operation like this I want
to make sure we check and double check that it is safe to proceed.

> Furthermore, also since agi_count == agi_free +
> actual used inodes + number of unlinked inodes, I think we don't need to
> check the unlinked list.

Call me paranoid, but....

I think at minimum there should be debug code that caters to my paranoia ;)

> > > +
> > > +		freeblks = pag->pagf_freeblks;
> > > +		printk("Usage: %d prealloc, %d flcount\n",
> > > +		       XFS_PREALLOC_BLOCKS(mp), pag->pagf_flcount);
> > > +
> > > +		/* Done gathering data, check sizes */
> > > +		usedblks = XFS_PREALLOC_BLOCKS(mp) + pag->pagf_flcount;
> > > +		printk("agno=%d agf_length=%d computed used=%d"
> > > +		       " known free=%d\n", agno, aglen, usedblks, freeblks);
> > > +
> > > +		if(usedblks + freeblks != aglen) {
> > > +			printk("agno %d is not free (%d blocks allocated)\n",
> > > +			       agno, aglen-usedblks-freeblks);
> > > +			error = XFS_ERROR(ENOSPC);
> > > +			goto error0;
> > > +		}
> > > +		dbdelta -= aglen;
> > > +		printk("will lower with %d\n",
> > > +		       aglen - XFS_PREALLOC_BLOCKS(mp));
> > > +		fdbdelta += aglen - XFS_PREALLOC_BLOCKS(mp);
> > 
> > Ok, so why not just
> > 
> > 	fdbdelta += mp->m_sb.sb_agblocks - XFS_PREALLOC_BLOCKS(mp);
> Because the last AG can be smaller than the sb_agblocks. It's true
> that this holds only for the last one, but having two cases is uglier
> than just always reading this size from the AG.

But you -EINVAL'd earlier when the shrink size would not leave entire
AGs behind. So the partial last AG is something that won't happen
here.....

> > > +	}
> > > +	/*
> > > +	 * Check that we removed all blocks
> > > +	 */
> > > +	ASSERT(!dbdelta);
> > > +	ASSERT(nagcount < oagcount);
> > 
> > Error out, not assert, because at this point we have not changed anything.
> Actually here, !dbdelta or nacount < oagacount are programming errors
> and not possible correct conditions.  They could be removed, sine the
> checks before the per-AG for loop ensure these conditions are not met.
> But I used them just to be sure.

Ok - I misinterpreted what they were catching.


> > > +	/* Free memory as the number of AG has changed */
> > > +	for (agno = nagcount; agno < oagcount; agno++)
> > > +		if (mp->m_perag[agno].pagb_list)
> > > +			kmem_free(mp->m_perag[agno].pagb_list,
> > > +				  sizeof(xfs_perag_busy_t) *
> > > +				  XFS_PAGB_NUM_SLOTS);
> > > +
> > > +	mp->m_perag = kmem_realloc(mp->m_perag,
> > > +				   sizeof(xfs_perag_t) * nagcount,
> > > +				   sizeof(xfs_perag_t) * oagcount,
> > > +				   KM_SLEEP);
> > 
> > This is not really safe - how do we know if all the users of the
> > higher AGs have gone away? I think we should simply just zero out
> > the unused AGs and don't worry about a realloc().
> The problem that I saw is that if you do shrink+grow+shrink+grow+... you
> will leak a small amount of memory (or corrupt kernel mem allocation?)
> since the growfs code does a realloc from what it thinks is the size of
> m_perag, which it computes solely from the current number of AGs. Should
> we have a size in the mp struct for this and not assume it's the
> agcount?

Yup, another size variable in the mount structure is probably the only
way to fix this.

> > > +	if(in->newblocks < mp->m_sb.sb_dblocks)
> > > +		error = xfs_shrinkfs_data_private(mp, in);
> > > +	else
> > > +		error = xfs_growfs_data_private(mp, in);
> > 
> > Hmmm - that's using the one ioctl to do both grow and shrink. I'd
> > prefer a new shrink ioctl rather than changing the behaviour of an
> > existing ioctl.
> Well, I chose this way because I see it as the ioctl that changes the
> data size of the filesystem. It may be lower or higher than the current
> size, but that is not so important :), and if another ioctl there would
> be the need for another tool.

True, put that way i think you're right.

> updated patch (without the separation of the IOCTL and without the
> rework of the perag lock) is attached.

I haven't looked at it yet - i'll try to get to it tomorrow...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Implement shrink of empty AGs
  2007-06-14  9:00     ` David Chinner
@ 2007-06-14 20:55       ` Iustin Pop
  2007-06-14 22:16         ` David Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Iustin Pop @ 2007-06-14 20:55 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs

On Thu, Jun 14, 2007 at 07:00:52PM +1000, David Chinner wrote:
> > > What's the dbdelta part of the reservation for? That's reserving dbdelta
> > > blocks for *allocations*, so I don't think this is right....
> >
> > Well, we'll shrink the filesystem by dbdelta, so the filesystem needs to
> > have enough free space to do it.
> 
> True, but when you look at how much free space we need to make that
> modification, it turns out ot be zero. i.e. we don't need to do any
> allocations *anywhere* to remove an empty AG from the end of the
> filesystem - just log a change to the superblock(s).
> 
> > Whether this space is in the correct
> > place (the AGs we want to shrink) or not is a question answered by the
> > per-AG checking, but since we will reduce the freespace by this amount,
> > I thought that it's safer to mark this space in use. Did I misread the
> > intent of xfs_trans_reserve?
> 
> The transactions reservation takes into account blocks that may need
> to be allocated during a transaction. Normally when we are freeing
> an extent, we may still have to allocate some blocks because a
> btree block might need to be split (and split all the way up the tree
> to the root). The XFS_GROWFS_SPACE_RES(mp) macro reserves the space
> needed for the freeing of a new extent which occurs during growing a
> partial AG to a full AG. We aren't doing such an operation here;
> we don't modify any free space or inode or extent btrees here at
> all.
Ah, finally I understand what the reservation is for. I probably
misunderstood the actual meaning of the transaction also...

Thanks for the clear explanation, will remove this.

> [...]
> Note that I'm just using this as an example piece of existing code that would
> break badly if we simply realloc the perag array. I could point you to the
> filestreams code where there is a perag stream refcount that we'll need to
> ensure is zero.
> 
> What I'm trying to say is that the perag lock may not give us sufficient
> exclusion to be able to shrink the perag array safely.
Interesting conclusion. I'm fail right now to come up with an idea,
except auditing the whole codebase and make sure any down_read is
followed by a check on the perag size. Or something like adding a flag
to the perag structure like 'gone' that is checked (since we won't
k_realloc anymore), after the read lock.

> > > ok, so we have empty ag's here. You might want to check that the
> > > inode btree is empty and that the AGI unlinked list is empty.
> > I thought that inode btree is empty by virtue of pag->pagi_count == 0.
> > Is this not always so?
> 
> It should be. Call me paranoid, but with an operation like this I want
> to make sure we check and double check that it is safe to proceed.
> 
> > Furthermore, also since agi_count == agi_free +
> > actual used inodes + number of unlinked inodes, I think we don't need to
> > check the unlinked list.
> 
> Call me paranoid, but....
> 
> I think at minimum there should be debug code that caters to my paranoia ;)
Ok, I'll look into this then. I understand the paranoia, I just assumed
that logically it's provable that this situation doesn't/can't happen.

> > > Ok, so why not just
> > > 
> > > 	fdbdelta += mp->m_sb.sb_agblocks - XFS_PREALLOC_BLOCKS(mp);
> > Because the last AG can be smaller than the sb_agblocks. It's true
> > that this holds only for the last one, but having two cases is uglier
> > than just always reading this size from the AG.
> 
> But you -EINVAL'd earlier when the shrink size would not leave entire
> AGs behind. So the partial last AG is something that won't happen
> here.....
No, what I meant when I said 'last AG' is the "original" last, not the
"new" last. Consider this fs: AGs 1-8 = 4096 blocks, AG 9 = 3000 blocks.
We can't, in the for loop, make fbdelta += mp->m_sb.sb_agblocks ...
since that is 4096, whereas the for 'last' AG (AG 9) has only 3000
blocks.

What I mean is that for a normal filesystem (non-shrink situation), the
size of the last AG is readable only from the AGF structure on-disk,
but for all the others AG the size is the one from the superblock. So if
you touch the last AG, you have to read the AGF.

> > > This is not really safe - how do we know if all the users of the
> > > higher AGs have gone away? I think we should simply just zero out
> > > the unused AGs and don't worry about a realloc().
> > The problem that I saw is that if you do shrink+grow+shrink+grow+... you
> > will leak a small amount of memory (or corrupt kernel mem allocation?)
> > since the growfs code does a realloc from what it thinks is the size of
> > m_perag, which it computes solely from the current number of AGs. Should
> > we have a size in the mp struct for this and not assume it's the
> > agcount?
> 
> Yup, another size variable in the mount structure is probably the only
> way to fix this.
Ok, noted. Also I get the impression that this ties in with the perag
locking above somehow...

> > updated patch (without the separation of the IOCTL and without the
> > rework of the perag lock) is attached.
> 
> I haven't looked at it yet - i'll try to get to it tomorrow...
No hurry, just small changes.

I'll think about the issues you raised.

thanks,
iustin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Implement shrink of empty AGs
  2007-06-14 20:55       ` Iustin Pop
@ 2007-06-14 22:16         ` David Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: David Chinner @ 2007-06-14 22:16 UTC (permalink / raw)
  To: xfs

On Thu, Jun 14, 2007 at 10:55:18PM +0200, Iustin Pop wrote:
> On Thu, Jun 14, 2007 at 07:00:52PM +1000, David Chinner wrote:
> > [...]
> > Note that I'm just using this as an example piece of existing code that would
> > break badly if we simply realloc the perag array. I could point you to the
> > filestreams code where there is a perag stream refcount that we'll need to
> > ensure is zero.
> > 
> > What I'm trying to say is that the perag lock may not give us sufficient
> > exclusion to be able to shrink the perag array safely.
> Interesting conclusion. I'm fail right now to come up with an idea,
> except auditing the whole codebase

Yup.

> and make sure any down_read is followed by a check on the perag size. 

That doesn't help, really, because at some points we won't be
able to back out and select another AG. i.e. once we've selected
an AG and ensured the perag structure is initialised, it is
assumed it will remain that way forever.

More likely we need reference counting on the AGs as we do operations
(e.g. when we select an AG for an operation we take a reference
and then drop it when we are done) and prevent new references
from being taken on an AG at some point (i.e. once it's been emptied).

> Or something like adding a flag
> to the perag structure like 'gone' that is checked (since we won't
> k_realloc anymore), after the read lock.

Clear the pag[if]_init flags and ensure failure to initialise is
treated correctly i.e. fall back to selecting another AG.

> > > > ok, so we have empty ag's here. You might want to check that the
> > > > inode btree is empty and that the AGI unlinked list is empty.
> > > I thought that inode btree is empty by virtue of pag->pagi_count == 0.
> > > Is this not always so?
> > 
> > It should be. Call me paranoid, but with an operation like this I want
> > to make sure we check and double check that it is safe to proceed.
> > 
> > > Furthermore, also since agi_count == agi_free +
> > > actual used inodes + number of unlinked inodes, I think we don't need to
> > > check the unlinked list.
> > 
> > Call me paranoid, but....
> > 
> > I think at minimum there should be debug code that caters to my paranoia ;)
> Ok, I'll look into this then. I understand the paranoia, I just assumed
> that logically it's provable that this situation doesn't/can't happen.

At some point we'll have a bug that breaks that logic - we need code
to catch that....

> > > > Ok, so why not just
> > > > 
> > > > 	fdbdelta += mp->m_sb.sb_agblocks - XFS_PREALLOC_BLOCKS(mp);
> > > Because the last AG can be smaller than the sb_agblocks. It's true
> > > that this holds only for the last one, but having two cases is uglier
> > > than just always reading this size from the AG.
> > 
> > But you -EINVAL'd earlier when the shrink size would not leave entire
> > AGs behind. So the partial last AG is something that won't happen
> > here.....
> No, what I meant when I said 'last AG' is the "original" last, not the
> "new" last. Consider this fs: AGs 1-8 = 4096 blocks, AG 9 = 3000 blocks.

Ok, gotcha. Makes sense now.

> We can't, in the for loop, make fbdelta += mp->m_sb.sb_agblocks ...
> since that is 4096, whereas the for 'last' AG (AG 9) has only 3000
> blocks.
> 
> What I mean is that for a normal filesystem (non-shrink situation), the
> size of the last AG is readable only from the AGF structure on-disk,
> but for all the others AG the size is the one from the superblock. So if
> you touch the last AG, you have to read the AGF.

I'd still special case it, like the growing of a partial last AG is
special cased in xfs_growfs_data_private().

> > > > This is not really safe - how do we know if all the users of the
> > > > higher AGs have gone away? I think we should simply just zero out
> > > > the unused AGs and don't worry about a realloc().
> > > The problem that I saw is that if you do shrink+grow+shrink+grow+... you
> > > will leak a small amount of memory (or corrupt kernel mem allocation?)
> > > since the growfs code does a realloc from what it thinks is the size of
> > > m_perag, which it computes solely from the current number of AGs. Should
> > > we have a size in the mp struct for this and not assume it's the
> > > agcount?
> > 
> > Yup, another size variable in the mount structure is probably the only
> > way to fix this.
> Ok, noted. Also I get the impression that this ties in with the perag
> locking above somehow...

*nod*

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-06-14 22:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-10 16:40 [PATCH] Implement shrink of empty AGs Iustin Pop
2007-06-12  2:40 ` David Chinner
2007-06-12  4:25   ` Eric Sandeen
2007-06-14  6:01   ` Iustin Pop
2007-06-14  9:00     ` David Chinner
2007-06-14 20:55       ` Iustin Pop
2007-06-14 22:16         ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox