* [PATCH] Implement shrink of empty AGs
@ 2007-06-10 16:40 Iustin Pop
2007-06-12 2:40 ` David Chinner
0 siblings, 1 reply; 7+ messages in thread
From: Iustin Pop @ 2007-06-10 16:40 UTC (permalink / raw)
To: xfs
[-- Attachment #1.1: Type: text/plain, Size: 1412 bytes --]
The attached patch implements shrinking of completely empty allocation
groups. The patch is against current CVS and modifies two files:
- xfs_trans.c to remove two asserts in which prevent lowering the
number of AGs or filesystem blocks;
- xfs_fsops.c where it does:
- modify xfs_growfs_data() to branch to either
xfs_growfs_data_private or xfs_shrinkfs_data private depending on
the new size of the fs
- abstract the last part of xfs_growfs_data_private (the modify of
all the superblocks) into a separate function, xfs_update_sb(),
which is called both from shrink and grow
- add the new xfs_shrinkfs_data_private function, mostly based on
the growfs function
There are many printk()'s left in the patch, I left them as they show
where I compute some important values. There are also many FIXMEs in the
comments showing what parts I didn't understand or was not sure about
(not that these are the only ones...). Probably for a real patch,
xfs-specific debug hooks need to be added and the printk()s removed.
The patch works on UML and QEMU virtual machines, both in UP and SMP. I
just tested many shrink/grow operations and verified with xfs_repair
that the fs is not corrupted. The free space counters seem to be correct
after shrink.
Note that you also need to remove the check from xfs_growfs.c of not
allowing to shrink the filesystem.
regards,
iustin
[-- Attachment #1.2: patch-nice-4 --]
[-- Type: text/plain, Size: 11582 bytes --]
diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c
--- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c 2007-06-09 18:56:21.509308225 +0200
+++ linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c 2007-06-10 18:32:36.074856477 +0200
@@ -112,6 +112,53 @@
return 0;
}
+static void xfs_update_sb(
+ xfs_mount_t *mp, /* mount point for filesystem */
+ xfs_agnumber_t nagimax,
+ xfs_agnumber_t nagcount) /* new number of a.g. */
+{
+ xfs_agnumber_t agno;
+ xfs_buf_t *bp;
+ xfs_sb_t *sbp;
+ int error;
+
+ /* New allocation groups fully initialized, so update mount struct */
+ if (nagimax)
+ mp->m_maxagi = nagimax;
+ if (mp->m_sb.sb_imax_pct) {
+ __uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
+ do_div(icount, 100);
+ mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
+ } else
+ mp->m_maxicount = 0;
+ for (agno = 1; agno < nagcount; agno++) {
+ error = xfs_read_buf(mp, mp->m_ddev_targp,
+ XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
+ XFS_FSS_TO_BB(mp, 1), 0, &bp);
+ if (error) {
+ xfs_fs_cmn_err(CE_WARN, mp,
+ "error %d reading secondary superblock for ag %d",
+ error, agno);
+ break;
+ }
+ sbp = XFS_BUF_TO_SBP(bp);
+ xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS);
+ /*
+ * If we get an error writing out the alternate superblocks,
+ * just issue a warning and continue. The real work is
+ * already done and committed.
+ */
+ if (!(error = xfs_bwrite(mp, bp))) {
+ continue;
+ } else {
+ xfs_fs_cmn_err(CE_WARN, mp,
+ "write error %d updating secondary superblock for ag %d",
+ error, agno);
+ break; /* no point in continuing */
+ }
+ }
+}
+
static int
xfs_growfs_data_private(
xfs_mount_t *mp, /* mount point for filesystem */
@@ -135,7 +182,6 @@
xfs_rfsblock_t nfree;
xfs_agnumber_t oagcount;
int pct;
- xfs_sb_t *sbp;
xfs_trans_t *tp;
nb = in->newblocks;
@@ -356,44 +402,228 @@
if (error) {
return error;
}
- /* New allocation groups fully initialized, so update mount struct */
- if (nagimax)
- mp->m_maxagi = nagimax;
- if (mp->m_sb.sb_imax_pct) {
- __uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
- do_div(icount, 100);
- mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
- } else
- mp->m_maxicount = 0;
- for (agno = 1; agno < nagcount; agno++) {
- error = xfs_read_buf(mp, mp->m_ddev_targp,
- XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
- XFS_FSS_TO_BB(mp, 1), 0, &bp);
+ xfs_update_sb(mp, nagimax, nagcount);
+ return 0;
+
+ error0:
+ xfs_trans_cancel(tp, XFS_TRANS_ABORT);
+ return error;
+}
+
+static int
+xfs_shrinkfs_data_private(
+ xfs_mount_t *mp, /* mount point for filesystem */
+ xfs_growfs_data_t *in) /* growfs data input struct */
+{
+ xfs_agf_t *agf;
+ xfs_agnumber_t agno;
+ xfs_buf_t *bp;
+ int dpct;
+ int error;
+ xfs_agnumber_t nagcount; /* new AG count */
+ xfs_agnumber_t oagcount; /* old AG count */
+ xfs_agnumber_t nagimax = 0;
+ xfs_rfsblock_t nb, nb_mod;
+ xfs_rfsblock_t dbdelta; /* will be used as a
+ check that we
+ shrink the fs by
+ the correct number
+ of blocks */
+ xfs_rfsblock_t fdbdelta; /* will keep track of
+ how many ag blocks
+ we need to
+ remove */
+ int pct;
+ xfs_trans_t *tp;
+
+ nb = in->newblocks;
+ pct = in->imaxpct;
+ if (nb >= mp->m_sb.sb_dblocks || pct < 0 || pct > 100)
+ return XFS_ERROR(EINVAL);
+ dpct = pct - mp->m_sb.sb_imax_pct;
+ error = xfs_read_buf(mp, mp->m_ddev_targp,
+ XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1),
+ XFS_FSS_TO_BB(mp, 1), 0, &bp);
+ if (error)
+ return error;
+ ASSERT(bp);
+ /* FIXME: we release the buffer here manually because we are
+ * outside of a transaction? The other buffers read using the
+ * functions which take a tp parameter are not released in
+ * growfs
+ */
+ xfs_buf_relse(bp);
+
+ /* Do basic checks (at the fs level) */
+ oagcount = mp->m_sb.sb_agcount;
+ nagcount = nb;
+ nb_mod = do_div(nagcount, mp->m_sb.sb_agblocks);
+ if(nb_mod) {
+ printk("not shrinking on an AG boundary (diff=%d)\n", nb_mod);
+ return XFS_ERROR(ENOSPC);
+ }
+ if(nagcount < 2) {
+ printk("refusing to shrink below 2 AGs\n");
+ return XFS_ERROR(ENOSPC);
+ }
+ if(nagcount >= oagcount) {
+ printk("number of AGs will not decrease\n");
+ return XFS_ERROR(EINVAL);
+ }
+ printk("Cur ag=%d, cur blocks=%llu\n",
+ mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks);
+ printk("New ag=%d, new blocks=%d\n", nagcount, nb);
+
+ printk("Will resize from %llu to %d, delta is %llu\n",
+ mp->m_sb.sb_dblocks, nb, mp->m_sb.sb_dblocks - nb);
+ /* Check to see if we trip over the log section */
+ printk("logstart=%llu logblocks=%u\n",
+ mp->m_sb.sb_logstart, mp->m_sb.sb_logblocks);
+ if (nb < mp->m_sb.sb_logstart + mp->m_sb.sb_logblocks)
+ return XFS_ERROR(EINVAL);
+ /* dbdelta starts at the diff and must become zero */
+ dbdelta = mp->m_sb.sb_dblocks - nb;
+ tp = xfs_trans_alloc(mp, XFS_TRANS_GROWFS);
+ printk("reserving %d\n", XFS_GROWFS_SPACE_RES(mp) + dbdelta);
+ if ((error = xfs_trans_reserve(tp, XFS_GROWFS_SPACE_RES(mp) + dbdelta,
+ XFS_GROWDATA_LOG_RES(mp), 0, 0, 0))) {
+ xfs_trans_cancel(tp, 0);
+ return error;
+ }
+
+ fdbdelta = 0;
+
+ /* Per-AG checks */
+ /* FIXME: do we need to hold m_peraglock while doing this? */
+ /* I think that since we do read and write to the m_perag
+ * stuff, we should be holding the lock for the entire walk &
+ * modify of the fs
+ */
+ /* Note that because we hold the lock, on any error+early
+ * return, we must either release manually and return, or
+ * jump to error0
+ */
+ down_write(&mp->m_peraglock);
+ for(agno = oagcount - 1; agno >= nagcount; agno--) {
+ xfs_extlen_t usedblks; /* total used blocks in this a.g. */
+ xfs_extlen_t freeblks; /* free blocks in this a.g. */
+ xfs_agblock_t aglen; /* this ag's len */
+ struct xfs_perag *pag; /* the m_perag structure */
+
+ printk("doing agno=%d\n", agno);
+
+ pag = &mp->m_perag[agno];
+
+ error = xfs_alloc_read_agf(mp, tp, agno, 0, &bp);
if (error) {
- xfs_fs_cmn_err(CE_WARN, mp,
- "error %d reading secondary superblock for ag %d",
- error, agno);
- break;
+ goto error0;
}
- sbp = XFS_BUF_TO_SBP(bp);
- xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS);
+ ASSERT(bp);
+ agf = XFS_BUF_TO_AGF(bp);
+ aglen = INT_GET(agf->agf_length, ARCH_CONVERT);
+
+ /* read the pagf/pagi if not already initialized */
+ /* agf should be initialized because of the ablove read_agf */
+ ASSERT(pag->pagf_init);
+ if (!pag->pagi_init) {
+ if ((error = xfs_ialloc_read_agi(mp, tp, agno, &bp)))
+ goto error0;
+ ASSERT(pag->pagi_init);
+ }
+
/*
- * If we get an error writing out the alternate superblocks,
- * just issue a warning and continue. The real work is
- * already done and committed.
+ * Check the inodes: as long as we have pagi_count ==
+ * pagi_freecount == 0, then: a) we don't have to
+ * update any global inode counters, and b) there are
+ * no extra blocks in inode btrees
*/
- if (!(error = xfs_bwrite(mp, bp))) {
- continue;
- } else {
- xfs_fs_cmn_err(CE_WARN, mp,
- "write error %d updating secondary superblock for ag %d",
- error, agno);
- break; /* no point in continuing */
+ if(pag->pagi_count > 0 ||
+ pag->pagi_freecount > 0) {
+ printk("agi %d has %d inodes in total and %d free\n",
+ agno, pag->pagi_count, pag->pagi_freecount);
+ error = XFS_ERROR(ENOSPC);
+ goto error0;
+ }
+
+ /* Check the AGF: if levels[] == 1, then there should
+ * be no extra blocks in the btrees beyond the ones
+ * at the beggining of the AG
+ */
+ if(pag->pagf_levels[XFS_BTNUM_BNOi] > 1 ||
+ pag->pagf_levels[XFS_BTNUM_CNTi] > 1) {
+ printk("agf %d has level %d bt and %d cnt\n",
+ agno,
+ pag->pagf_levels[XFS_BTNUM_BNOi],
+ pag->pagf_levels[XFS_BTNUM_CNTi]);
+ error = XFS_ERROR(ENOSPC);
+ goto error0;
}
+
+ freeblks = pag->pagf_freeblks;
+ printk("Usage: %d prealloc, %d flcount\n",
+ XFS_PREALLOC_BLOCKS(mp), pag->pagf_flcount);
+
+ /* Done gathering data, check sizes */
+ usedblks = XFS_PREALLOC_BLOCKS(mp) + pag->pagf_flcount;
+ printk("agno=%d agf_length=%d computed used=%d"
+ " known free=%d\n", agno, aglen, usedblks, freeblks);
+
+ if(usedblks + freeblks != aglen) {
+ printk("agno %d is not free (%d blocks allocated)\n",
+ agno, aglen-usedblks-freeblks);
+ error = XFS_ERROR(ENOSPC);
+ goto error0;
+ }
+ dbdelta -= aglen;
+ printk("will lower with %d\n",
+ aglen - XFS_PREALLOC_BLOCKS(mp));
+ fdbdelta += aglen - XFS_PREALLOC_BLOCKS(mp);
+ }
+ /*
+ * Check that we removed all blocks
+ */
+ ASSERT(!dbdelta);
+ ASSERT(nagcount < oagcount);
+
+ printk("to free: %d, oagcount=%d, nagcount=%d\n",
+ fdbdelta, oagcount, nagcount);
+
+ xfs_trans_agblocks_delta(tp, -((long)fdbdelta));
+ xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
+ xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, nb - mp->m_sb.sb_dblocks);
+ xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, -((int64_t)fdbdelta));
+
+ if (dpct)
+ xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
+ error = xfs_trans_commit(tp, 0);
+ if (error) {
+ up_write(&mp->m_peraglock);
+ return error;
}
+ /* Free memory as the number of AG has changed */
+ for (agno = nagcount; agno < oagcount; agno++)
+ if (mp->m_perag[agno].pagb_list)
+ kmem_free(mp->m_perag[agno].pagb_list,
+ sizeof(xfs_perag_busy_t) *
+ XFS_PAGB_NUM_SLOTS);
+
+ mp->m_perag = kmem_realloc(mp->m_perag,
+ sizeof(xfs_perag_t) * nagcount,
+ sizeof(xfs_perag_t) * oagcount,
+ KM_SLEEP);
+ /* FIXME: here we could instead just lower
+ * nagimax to nagcount; is it better this way?
+ */
+ /* FIXME: why is this flag unconditionally set in growfs? */
+ mp->m_flags |= XFS_MOUNT_32BITINODES;
+ nagimax = xfs_initialize_perag(XFS_MTOVFS(mp), mp, nagcount);
+ up_write(&mp->m_peraglock);
+
+ xfs_update_sb(mp, nagimax, nagcount);
return 0;
error0:
+ up_write(&mp->m_peraglock);
xfs_trans_cancel(tp, XFS_TRANS_ABORT);
return error;
}
@@ -435,7 +665,10 @@
int error;
if (!cpsema(&mp->m_growlock))
return XFS_ERROR(EWOULDBLOCK);
- error = xfs_growfs_data_private(mp, in);
+ if(in->newblocks < mp->m_sb.sb_dblocks)
+ error = xfs_shrinkfs_data_private(mp, in);
+ else
+ error = xfs_growfs_data_private(mp, in);
vsema(&mp->m_growlock);
return error;
}
@@ -633,7 +866,7 @@
xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
thaw_bdev(sb->s_bdev, sb);
}
-
+
break;
}
case XFS_FSOP_GOING_FLAGS_LOGFLUSH:
diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_trans.c linux-2.6-xfs.shrink/fs/xfs/xfs_trans.c
--- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_trans.c 2007-06-05 17:40:51.000000000 +0200
+++ linux-2.6-xfs.shrink/fs/xfs/xfs_trans.c 2007-06-07 23:01:03.000000000 +0200
@@ -503,11 +503,9 @@
tp->t_res_frextents_delta += delta;
break;
case XFS_TRANS_SB_DBLOCKS:
- ASSERT(delta > 0);
tp->t_dblocks_delta += delta;
break;
case XFS_TRANS_SB_AGCOUNT:
- ASSERT(delta > 0);
tp->t_agcount_delta += delta;
break;
case XFS_TRANS_SB_IMAXPCT:
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] Implement shrink of empty AGs 2007-06-10 16:40 [PATCH] Implement shrink of empty AGs Iustin Pop @ 2007-06-12 2:40 ` David Chinner 2007-06-12 4:25 ` Eric Sandeen 2007-06-14 6:01 ` Iustin Pop 0 siblings, 2 replies; 7+ messages in thread From: David Chinner @ 2007-06-12 2:40 UTC (permalink / raw) To: xfs; +Cc: iusty On Sun, Jun 10, 2007 at 06:40:14PM +0200, Iustin Pop wrote: > The attached patch implements shrinking of completely empty allocation > groups. The patch is against current CVS and modifies two files: > - xfs_trans.c to remove two asserts in which prevent lowering the > number of AGs or filesystem blocks; > - xfs_fsops.c where it does: > - modify xfs_growfs_data() to branch to either > xfs_growfs_data_private or xfs_shrinkfs_data private depending on > the new size of the fs > - abstract the last part of xfs_growfs_data_private (the modify of > all the superblocks) into a separate function, xfs_update_sb(), > which is called both from shrink and grow > - add the new xfs_shrinkfs_data_private function, mostly based on > the growfs function comments are all inline.... > > There are many printk()'s left in the patch, I left them as they show > where I compute some important values. There are also many FIXMEs in the > comments showing what parts I didn't understand or was not sure about > (not that these are the only ones...). Probably for a real patch, > xfs-specific debug hooks need to be added and the printk()s removed. > > The patch works on UML and QEMU virtual machines, both in UP and SMP. I > just tested many shrink/grow operations and verified with xfs_repair > that the fs is not corrupted. The free space counters seem to be correct > after shrink. > > Note that you also need to remove the check from xfs_growfs.c of not > allowing to shrink the filesystem. > > regards, > iustin > diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c > --- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c 2007-06-09 18:56:21.509308225 +0200 > +++ linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c 2007-06-10 18:32:36.074856477 +0200 > @@ -112,6 +112,53 @@ > return 0; > } > > +static void xfs_update_sb( STATIC void xfs_growfs_update_sb( > + xfs_mount_t *mp, /* mount point for filesystem */ > + xfs_agnumber_t nagimax, > + xfs_agnumber_t nagcount) /* new number of a.g. */ tabs, not spaces (and tabs are 8 spaces). > +{ > + xfs_agnumber_t agno; > + xfs_buf_t *bp; > + xfs_sb_t *sbp; > + int error; > + > + /* New allocation groups fully initialized, so update mount struct */ > + if (nagimax) > + mp->m_maxagi = nagimax; > + if (mp->m_sb.sb_imax_pct) { > + __uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct; I'd prefer to have long lines like this split. > + do_div(icount, 100); > + mp->m_maxicount = icount << mp->m_sb.sb_inopblog; > + } else > + mp->m_maxicount = 0; Insert empty line. > + for (agno = 1; agno < nagcount; agno++) { > + error = xfs_read_buf(mp, mp->m_ddev_targp, > + XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)), > + XFS_FSS_TO_BB(mp, 1), 0, &bp); > + if (error) { > + xfs_fs_cmn_err(CE_WARN, mp, > + "error %d reading secondary superblock for ag %d", > + error, agno); > + break; > + } > + sbp = XFS_BUF_TO_SBP(bp); > + xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS); Insert empty line > + /* > + * If we get an error writing out the alternate superblocks, > + * just issue a warning and continue. The real work is > + * already done and committed. > + */ > + if (!(error = xfs_bwrite(mp, bp))) { > + continue; > + } else { > + xfs_fs_cmn_err(CE_WARN, mp, > + "write error %d updating secondary superblock for ag %d", > + error, agno); > + break; /* no point in continuing */ > + } > + } error = xfs_bwrite(mp, bp); if (error) { xfs_fs_cmn_err(...) break; } } > +} > + > static int > xfs_growfs_data_private( > xfs_mount_t *mp, /* mount point for filesystem */ > @@ -135,7 +182,6 @@ > xfs_rfsblock_t nfree; > xfs_agnumber_t oagcount; > int pct; > - xfs_sb_t *sbp; > xfs_trans_t *tp; > > nb = in->newblocks; > @@ -356,44 +402,228 @@ > if (error) { > return error; > } > - /* New allocation groups fully initialized, so update mount struct */ > - if (nagimax) > - mp->m_maxagi = nagimax; > - if (mp->m_sb.sb_imax_pct) { > - __uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct; > - do_div(icount, 100); > - mp->m_maxicount = icount << mp->m_sb.sb_inopblog; > - } else > - mp->m_maxicount = 0; > - for (agno = 1; agno < nagcount; agno++) { > - error = xfs_read_buf(mp, mp->m_ddev_targp, > - XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)), > - XFS_FSS_TO_BB(mp, 1), 0, &bp); > + xfs_update_sb(mp, nagimax, nagcount); > + return 0; > + > + error0: > + xfs_trans_cancel(tp, XFS_TRANS_ABORT); > + return error; > +} > + > +static int STATIC int > +xfs_shrinkfs_data_private( > + xfs_mount_t *mp, /* mount point for filesystem */ > + xfs_growfs_data_t *in) /* growfs data input struct */ whitespace issues > +{ > + xfs_agf_t *agf; > + xfs_agnumber_t agno; > + xfs_buf_t *bp; > + int dpct; > + int error; > + xfs_agnumber_t nagcount; /* new AG count */ > + xfs_agnumber_t oagcount; /* old AG count */ > + xfs_agnumber_t nagimax = 0; > + xfs_rfsblock_t nb, nb_mod; > + xfs_rfsblock_t dbdelta; /* will be used as a > + check that we > + shrink the fs by > + the correct number > + of blocks */ > + xfs_rfsblock_t fdbdelta; /* will keep track of > + how many ag blocks > + we need to > + remove */ Long comments like this don't go on the declaration. Put them where the variable is initialised or first used. > + int pct; > + xfs_trans_t *tp; > + > + nb = in->newblocks; > + pct = in->imaxpct; > + if (nb >= mp->m_sb.sb_dblocks || pct < 0 || pct > 100) > + return XFS_ERROR(EINVAL); > + dpct = pct - mp->m_sb.sb_imax_pct; This next bit: > + error = xfs_read_buf(mp, mp->m_ddev_targp, > + XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1), > + XFS_FSS_TO_BB(mp, 1), 0, &bp); > + if (error) > + return error; > + ASSERT(bp); > + /* FIXME: we release the buffer here manually because we are > + * outside of a transaction? The other buffers read using the > + * functions which take a tp parameter are not released in > + * growfs > + */ > + xfs_buf_relse(bp); Should not be necessary - we don't need to check if the new last filesystem block is beyond EOF because we are shrinking.... To answer the FIXME - xfs_trans_commit() releases locked buffers and inodes that have been joined ot the transaction unless they have also been held. So if you are outside a transaction, you do have to ensure you release any buffers you read. > + /* Do basic checks (at the fs level) */ > + oagcount = mp->m_sb.sb_agcount; > + nagcount = nb; > + nb_mod = do_div(nagcount, mp->m_sb.sb_agblocks); > + if(nb_mod) { > + printk("not shrinking on an AG boundary (diff=%d)\n", nb_mod); > + return XFS_ERROR(ENOSPC); EINVAL, I think. > + } > + if(nagcount < 2) { > + printk("refusing to shrink below 2 AGs\n"); > + return XFS_ERROR(ENOSPC); EINVAL. > + } > + if(nagcount >= oagcount) { > + printk("number of AGs will not decrease\n"); > + return XFS_ERROR(EINVAL); > + } > + printk("Cur ag=%d, cur blocks=%llu\n", > + mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks); > + printk("New ag=%d, new blocks=%d\n", nagcount, nb); > + > + printk("Will resize from %llu to %d, delta is %llu\n", > + mp->m_sb.sb_dblocks, nb, mp->m_sb.sb_dblocks - nb); > + /* Check to see if we trip over the log section */ > + printk("logstart=%llu logblocks=%u\n", > + mp->m_sb.sb_logstart, mp->m_sb.sb_logblocks); > + if (nb < mp->m_sb.sb_logstart + mp->m_sb.sb_logblocks) > + return XFS_ERROR(EINVAL); Insert empty line > + /* dbdelta starts at the diff and must become zero */ > + dbdelta = mp->m_sb.sb_dblocks - nb; > + tp = xfs_trans_alloc(mp, XFS_TRANS_GROWFS); > + printk("reserving %d\n", XFS_GROWFS_SPACE_RES(mp) + dbdelta); > + if ((error = xfs_trans_reserve(tp, XFS_GROWFS_SPACE_RES(mp) + dbdelta, > + XFS_GROWDATA_LOG_RES(mp), 0, 0, 0))) { > + xfs_trans_cancel(tp, 0); > + return error; > + } What's the dbdelta part of the reservation for? That's reserving dbdelta blocks for *allocations*, so I don't think this is right.... > + > + fdbdelta = 0; > + > + /* Per-AG checks */ > + /* FIXME: do we need to hold m_peraglock while doing this? */ Yes. > + /* I think that since we do read and write to the m_perag > + * stuff, we should be holding the lock for the entire walk & > + * modify of the fs > + */ Deadlock warning! holding the m_peraglock in write mode will cause allocation deadlocks if you are not careful as all allocation/free operations take the m_peraglock in read mode. (And yes, growing an active, loaded filesystem can deadlock because of this.) > + /* Note that because we hold the lock, on any error+early > + * return, we must either release manually and return, or > + * jump to error0 > + */ whitespace. > + down_write(&mp->m_peraglock); > + for(agno = oagcount - 1; agno >= nagcount; agno--) { > + xfs_extlen_t usedblks; /* total used blocks in this a.g. */ > + xfs_extlen_t freeblks; /* free blocks in this a.g. */ > + xfs_agblock_t aglen; /* this ag's len */ > + struct xfs_perag *pag; /* the m_perag structure */ > + > + printk("doing agno=%d\n", agno); > + > + pag = &mp->m_perag[agno]; > + > + error = xfs_alloc_read_agf(mp, tp, agno, 0, &bp); > if (error) { > - xfs_fs_cmn_err(CE_WARN, mp, > - "error %d reading secondary superblock for ag %d", > - error, agno); > - break; > + goto error0; > } > - sbp = XFS_BUF_TO_SBP(bp); > - xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS); > + ASSERT(bp); > + agf = XFS_BUF_TO_AGF(bp); > + aglen = INT_GET(agf->agf_length, ARCH_CONVERT); > + > + /* read the pagf/pagi if not already initialized */ > + /* agf should be initialized because of the ablove read_agf */ > + ASSERT(pag->pagf_init); > + if (!pag->pagi_init) { > + if ((error = xfs_ialloc_read_agi(mp, tp, agno, &bp))) > + goto error0; I don't think you should be overwriting bp here.... > + ASSERT(pag->pagi_init); > + } > + Because now you have bp potentially pointing to two different buffers. > /* > - * If we get an error writing out the alternate superblocks, > - * just issue a warning and continue. The real work is > - * already done and committed. > + * Check the inodes: as long as we have pagi_count == > + * pagi_freecount == 0, then: a) we don't have to > + * update any global inode counters, and b) there are > + * no extra blocks in inode btrees > */ > - if (!(error = xfs_bwrite(mp, bp))) { > - continue; > - } else { > - xfs_fs_cmn_err(CE_WARN, mp, > - "write error %d updating secondary superblock for ag %d", > - error, agno); > - break; /* no point in continuing */ > + if(pag->pagi_count > 0 || > + pag->pagi_freecount > 0) { > + printk("agi %d has %d inodes in total and %d free\n", > + agno, pag->pagi_count, pag->pagi_freecount); > + error = XFS_ERROR(ENOSPC); > + goto error0; > + } > + > + /* Check the AGF: if levels[] == 1, then there should > + * be no extra blocks in the btrees beyond the ones > + * at the beggining of the AG > + */ > + if(pag->pagf_levels[XFS_BTNUM_BNOi] > 1 || > + pag->pagf_levels[XFS_BTNUM_CNTi] > 1) { > + printk("agf %d has level %d bt and %d cnt\n", > + agno, > + pag->pagf_levels[XFS_BTNUM_BNOi], > + pag->pagf_levels[XFS_BTNUM_CNTi]); > + error = XFS_ERROR(ENOSPC); > + goto error0; > } ok, so we have empty ag's here. You might want to check that the inode btree is empty and that the AGI unlinked list is empty. > + > + freeblks = pag->pagf_freeblks; > + printk("Usage: %d prealloc, %d flcount\n", > + XFS_PREALLOC_BLOCKS(mp), pag->pagf_flcount); > + > + /* Done gathering data, check sizes */ > + usedblks = XFS_PREALLOC_BLOCKS(mp) + pag->pagf_flcount; > + printk("agno=%d agf_length=%d computed used=%d" > + " known free=%d\n", agno, aglen, usedblks, freeblks); > + > + if(usedblks + freeblks != aglen) { > + printk("agno %d is not free (%d blocks allocated)\n", > + agno, aglen-usedblks-freeblks); > + error = XFS_ERROR(ENOSPC); > + goto error0; > + } > + dbdelta -= aglen; > + printk("will lower with %d\n", > + aglen - XFS_PREALLOC_BLOCKS(mp)); > + fdbdelta += aglen - XFS_PREALLOC_BLOCKS(mp); Ok, so why not just fdbdelta += mp->m_sb.sb_agblocks - XFS_PREALLOC_BLOCKS(mp); > + } > + /* > + * Check that we removed all blocks > + */ > + ASSERT(!dbdelta); > + ASSERT(nagcount < oagcount); Error out, not assert, because at this point we have not changed anything. > + > + printk("to free: %d, oagcount=%d, nagcount=%d\n", > + fdbdelta, oagcount, nagcount); > + > + xfs_trans_agblocks_delta(tp, -((long)fdbdelta)); > + xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount); > + xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, nb - mp->m_sb.sb_dblocks); > + xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, -((int64_t)fdbdelta)); > + > + if (dpct) > + xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct); > + error = xfs_trans_commit(tp, 0); > + if (error) { > + up_write(&mp->m_peraglock); > + return error; > } > + /* Free memory as the number of AG has changed */ > + for (agno = nagcount; agno < oagcount; agno++) > + if (mp->m_perag[agno].pagb_list) > + kmem_free(mp->m_perag[agno].pagb_list, > + sizeof(xfs_perag_busy_t) * > + XFS_PAGB_NUM_SLOTS); > + > + mp->m_perag = kmem_realloc(mp->m_perag, > + sizeof(xfs_perag_t) * nagcount, > + sizeof(xfs_perag_t) * oagcount, > + KM_SLEEP); This is not really safe - how do we know if all the users of the higher AGs have gone away? I think we should simply just zero out the unused AGs and don't worry about a realloc(). > + /* FIXME: here we could instead just lower > + * nagimax to nagcount; is it better this way? > + */ Not really. > + /* FIXME: why is this flag unconditionally set in growfs? */ > + mp->m_flags |= XFS_MOUNT_32BITINODES; good question. I don't think it should be there but I'll have to do some digging.... > + nagimax = xfs_initialize_perag(XFS_MTOVFS(mp), mp, nagcount); > + up_write(&mp->m_peraglock); > + > + xfs_update_sb(mp, nagimax, nagcount); > return 0; > > error0: > + up_write(&mp->m_peraglock); > xfs_trans_cancel(tp, XFS_TRANS_ABORT); > return error; > } > @@ -435,7 +665,10 @@ > int error; > if (!cpsema(&mp->m_growlock)) > return XFS_ERROR(EWOULDBLOCK); > - error = xfs_growfs_data_private(mp, in); > + if(in->newblocks < mp->m_sb.sb_dblocks) > + error = xfs_shrinkfs_data_private(mp, in); > + else > + error = xfs_growfs_data_private(mp, in); Hmmm - that's using the one ioctl to do both grow and shrink. I'd prefer a new shrink ioctl rather than changing the behaviour of an existing ioctl. Looks like a good start ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Implement shrink of empty AGs 2007-06-12 2:40 ` David Chinner @ 2007-06-12 4:25 ` Eric Sandeen 2007-06-14 6:01 ` Iustin Pop 1 sibling, 0 replies; 7+ messages in thread From: Eric Sandeen @ 2007-06-12 4:25 UTC (permalink / raw) To: David Chinner; +Cc: xfs, iusty David Chinner wrote: > On Sun, Jun 10, 2007 at 06:40:14PM +0200, Iustin Pop wrote: ... >> + /* FIXME: why is this flag unconditionally set in growfs? */ >> + mp->m_flags |= XFS_MOUNT_32BITINODES; >> + nagimax = xfs_initialize_perag(XFS_MTOVFS(mp), mp, nagcount); > good question. I don't think it should be there but I'll have to > do some digging.... http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_fsops.c#rev1.72 Thu Dec 6 19:26:09 2001 UTC (5 years, 6 months ago) by lord Add in the 32 bit inode mount flag before re initializing the perag structures in growfs. http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_fsops.c.diff?r1=1.71;r2=1.72 but, it seems harmless because it immediately calls xfs_initialize_perag which does: /* Clear the mount flag if no inode can overflow 32 bits * on this filesystem, or if specifically requested.. */ if ((mp->m_flags & XFS_MOUNT_32BITINOOPT) && ino > max_inum) { mp->m_flags |= XFS_MOUNT_32BITINODES; } else { mp->m_flags &= ~XFS_MOUNT_32BITINODES; } so I think it sets it (or clears it) properly in any case. I'd probably remove the setting before initialize_perag though as it's superfluous... that was added after steve's change... http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.c#rev1.335 Mon Sep 8 05:46:42 2003 UTC (3 years, 9 months ago) by nathans Add inode64 mount option; fix case where growfs can push 32 bit inodes into 64 bit space accidentally - both changes originally from IRIX http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.c.diff?r1=1.334;r2=1.335 (previously it would always clear the flag if max inode was < 32 bits..) ... so yeah, looks like the setting in question can/should go. -Eric ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Implement shrink of empty AGs 2007-06-12 2:40 ` David Chinner 2007-06-12 4:25 ` Eric Sandeen @ 2007-06-14 6:01 ` Iustin Pop 2007-06-14 9:00 ` David Chinner 1 sibling, 1 reply; 7+ messages in thread From: Iustin Pop @ 2007-06-14 6:01 UTC (permalink / raw) To: David Chinner; +Cc: xfs [-- Attachment #1: Type: text/plain, Size: 17332 bytes --] On Tue, Jun 12, 2007 at 12:40:25PM +1000, David Chinner wrote: > > diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c > > --- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c 2007-06-09 18:56:21.509308225 +0200 > > +++ linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c 2007-06-10 18:32:36.074856477 +0200 > > @@ -112,6 +112,53 @@ > > return 0; > > } > > > > +static void xfs_update_sb( > > STATIC void > xfs_growfs_update_sb( this was because xfs_growfs_private is also static and not STATIC. Should I change the def for it also? > > + xfs_mount_t *mp, /* mount point for filesystem */ > > + xfs_agnumber_t nagimax, > > + xfs_agnumber_t nagcount) /* new number of a.g. */ > > tabs, not spaces (and tabs are 8 spaces). sorry, I though I got all of these. There are some more in the def of xfs_reserve_blocks, btw. > > +{ > > + xfs_agnumber_t agno; > > + xfs_buf_t *bp; > > + xfs_sb_t *sbp; > > + int error; > > + > > + /* New allocation groups fully initialized, so update mount struct */ > > + if (nagimax) > > + mp->m_maxagi = nagimax; > > + if (mp->m_sb.sb_imax_pct) { > > + __uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct; > > I'd prefer to have long lines like this split. Ok, I hope I split them ok. > > + do_div(icount, 100); > > + mp->m_maxicount = icount << mp->m_sb.sb_inopblog; > > + } else > > + mp->m_maxicount = 0; > > Insert empty line. done. > > + for (agno = 1; agno < nagcount; agno++) { > > + error = xfs_read_buf(mp, mp->m_ddev_targp, > > + XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)), > > + XFS_FSS_TO_BB(mp, 1), 0, &bp); > > + if (error) { > > + xfs_fs_cmn_err(CE_WARN, mp, > > + "error %d reading secondary superblock for ag %d", > > + error, agno); > > + break; > > + } > > + sbp = XFS_BUF_TO_SBP(bp); > > + xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS); > > Insert empty line done. > > + /* > > + * If we get an error writing out the alternate superblocks, > > + * just issue a warning and continue. The real work is > > + * already done and committed. > > + */ > > + if (!(error = xfs_bwrite(mp, bp))) { > > + continue; > > + } else { > > + xfs_fs_cmn_err(CE_WARN, mp, > > + "write error %d updating secondary superblock for ag %d", > > + error, agno); > > + break; /* no point in continuing */ > > + } > > + } > > error = xfs_bwrite(mp, bp); > if (error) { > xfs_fs_cmn_err(...) > break; ok. (the original version was purely a move from xfs_growfs_data_private to a separate function) > } > } > > +} > > + > > static int > > xfs_growfs_data_private( > > xfs_mount_t *mp, /* mount point for filesystem */ > > @@ -135,7 +182,6 @@ > > xfs_rfsblock_t nfree; > > xfs_agnumber_t oagcount; > > int pct; > > - xfs_sb_t *sbp; > > xfs_trans_t *tp; > > > > nb = in->newblocks; > > @@ -356,44 +402,228 @@ > > if (error) { > > return error; > > } > > - /* New allocation groups fully initialized, so update mount struct */ > > - if (nagimax) > > - mp->m_maxagi = nagimax; > > - if (mp->m_sb.sb_imax_pct) { > > - __uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct; > > - do_div(icount, 100); > > - mp->m_maxicount = icount << mp->m_sb.sb_inopblog; > > - } else > > - mp->m_maxicount = 0; > > - for (agno = 1; agno < nagcount; agno++) { > > - error = xfs_read_buf(mp, mp->m_ddev_targp, > > - XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)), > > - XFS_FSS_TO_BB(mp, 1), 0, &bp); > > + xfs_update_sb(mp, nagimax, nagcount); > > + return 0; > > + > > + error0: > > + xfs_trans_cancel(tp, XFS_TRANS_ABORT); > > + return error; > > +} > > + > > +static int > > STATIC int done. > > +xfs_shrinkfs_data_private( > > + xfs_mount_t *mp, /* mount point for filesystem */ > > + xfs_growfs_data_t *in) /* growfs data input struct */ > > whitespace issues fixed (I think you were referring to the alignement of the two argument lines). > > +{ > > + xfs_agf_t *agf; > > + xfs_agnumber_t agno; > > + xfs_buf_t *bp; > > + int dpct; > > + int error; > > + xfs_agnumber_t nagcount; /* new AG count */ > > + xfs_agnumber_t oagcount; /* old AG count */ > > + xfs_agnumber_t nagimax = 0; > > + xfs_rfsblock_t nb, nb_mod; > > + xfs_rfsblock_t dbdelta; /* will be used as a > > + check that we > > + shrink the fs by > > + the correct number > > + of blocks */ > > + xfs_rfsblock_t fdbdelta; /* will keep track of > > + how many ag blocks > > + we need to > > + remove */ > > Long comments like this don't go on the declaration. Put them where > the variable is initialised or first used. ok. > > + int pct; > > + xfs_trans_t *tp; > > + > > + nb = in->newblocks; > > + pct = in->imaxpct; > > + if (nb >= mp->m_sb.sb_dblocks || pct < 0 || pct > 100) > > + return XFS_ERROR(EINVAL); > > + dpct = pct - mp->m_sb.sb_imax_pct; > > This next bit: > > > + error = xfs_read_buf(mp, mp->m_ddev_targp, > > + XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1), > > + XFS_FSS_TO_BB(mp, 1), 0, &bp); > > + if (error) > > + return error; > > + ASSERT(bp); > > + /* FIXME: we release the buffer here manually because we are > > + * outside of a transaction? The other buffers read using the > > + * functions which take a tp parameter are not released in > > + * growfs > > + */ > > + xfs_buf_relse(bp); > > Should not be necessary - we don't need to check if the new last > filesystem block is beyond EOF because we are shrinking.... Ah, now I understand what this does. I just copied it from growfs without realising what it does. Removed. > To answer the FIXME - xfs_trans_commit() releases locked buffers and > inodes that have been joined ot the transaction unless they have > also been held. So if you are outside a transaction, you do have to > ensure you release any buffers you read. Thanks for the clarification! > > + /* Do basic checks (at the fs level) */ > > + oagcount = mp->m_sb.sb_agcount; > > + nagcount = nb; > > + nb_mod = do_div(nagcount, mp->m_sb.sb_agblocks); > > + if(nb_mod) { > > + printk("not shrinking on an AG boundary (diff=%d)\n", nb_mod); > > + return XFS_ERROR(ENOSPC); > > EINVAL, I think. fixed. > > + } > > + if(nagcount < 2) { > > + printk("refusing to shrink below 2 AGs\n"); > > + return XFS_ERROR(ENOSPC); > > EINVAL. fixed. > > + } > > + if(nagcount >= oagcount) { > > + printk("number of AGs will not decrease\n"); > > + return XFS_ERROR(EINVAL); > > + } > > + printk("Cur ag=%d, cur blocks=%llu\n", > > + mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks); > > + printk("New ag=%d, new blocks=%d\n", nagcount, nb); > > + > > + printk("Will resize from %llu to %d, delta is %llu\n", > > + mp->m_sb.sb_dblocks, nb, mp->m_sb.sb_dblocks - nb); > > + /* Check to see if we trip over the log section */ > > + printk("logstart=%llu logblocks=%u\n", > > + mp->m_sb.sb_logstart, mp->m_sb.sb_logblocks); > > + if (nb < mp->m_sb.sb_logstart + mp->m_sb.sb_logblocks) > > + return XFS_ERROR(EINVAL); > > Insert empty line Done. > > + /* dbdelta starts at the diff and must become zero */ > > + dbdelta = mp->m_sb.sb_dblocks - nb; > > + tp = xfs_trans_alloc(mp, XFS_TRANS_GROWFS); > > + printk("reserving %d\n", XFS_GROWFS_SPACE_RES(mp) + dbdelta); > > + if ((error = xfs_trans_reserve(tp, XFS_GROWFS_SPACE_RES(mp) + dbdelta, > > + XFS_GROWDATA_LOG_RES(mp), 0, 0, 0))) { > > + xfs_trans_cancel(tp, 0); > > + return error; > > + } > > What's the dbdelta part of the reservation for? That's reserving dbdelta > blocks for *allocations*, so I don't think this is right.... Well, we'll shrink the filesystem by dbdelta, so the filesystem needs to have enough free space to do it. Whether this space is in the correct place (the AGs we want to shrink) or not is a question answered by the per-AG checking, but since we will reduce the freespace by this amount, I thought that it's safer to mark this space in use. Did I misread the intent of xfs_trans_reserve? Also, unless I'm mistaken and don't remember correctly, you have to reserve the amount of space one will pass to xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, ...) otherwise the transaction code complains about exceeding your reservation. > > + > > + fdbdelta = 0; > > + > > + /* Per-AG checks */ > > + /* FIXME: do we need to hold m_peraglock while doing this? */ > > Yes. > > > + /* I think that since we do read and write to the m_perag > > + * stuff, we should be holding the lock for the entire walk & > > + * modify of the fs > > + */ > > Deadlock warning! holding the m_peraglock in write mode will cause > allocation deadlocks if you are not careful as all allocation/free > operations take the m_peraglock in read mode. (And yes, growing > an active, loaded filesystem can deadlock because of this.) How can we ensure then that no one modifies the AGs while we walk them? I hoped that we can do it without the per-AG not-avail flag, by just holding the perag lock. Or you mean we should keep the read lock, and grab also the write lock when actually modifying the perag structure? (Or should that be grab read lock for checking, release read lock - hope nobody modifies AGs - grab write lock?) > > + /* Note that because we hold the lock, on any error+early > > + * return, we must either release manually and return, or > > + * jump to error0 > > + */ > > whitespace. fixed. > > + down_write(&mp->m_peraglock); > > + for(agno = oagcount - 1; agno >= nagcount; agno--) { > > + xfs_extlen_t usedblks; /* total used blocks in this a.g. */ > > + xfs_extlen_t freeblks; /* free blocks in this a.g. */ > > + xfs_agblock_t aglen; /* this ag's len */ > > + struct xfs_perag *pag; /* the m_perag structure */ > > + > > + printk("doing agno=%d\n", agno); > > + > > + pag = &mp->m_perag[agno]; > > + > > + error = xfs_alloc_read_agf(mp, tp, agno, 0, &bp); > > if (error) { > > - xfs_fs_cmn_err(CE_WARN, mp, > > - "error %d reading secondary superblock for ag %d", > > - error, agno); > > - break; > > + goto error0; > > } > > - sbp = XFS_BUF_TO_SBP(bp); > > - xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS); > > + ASSERT(bp); > > + agf = XFS_BUF_TO_AGF(bp); > > + aglen = INT_GET(agf->agf_length, ARCH_CONVERT); > > + > > + /* read the pagf/pagi if not already initialized */ > > + /* agf should be initialized because of the ablove read_agf */ > > + ASSERT(pag->pagf_init); > > + if (!pag->pagi_init) { > > + if ((error = xfs_ialloc_read_agi(mp, tp, agno, &bp))) > > + goto error0; > > I don't think you should be overwriting bp here.... > > > + ASSERT(pag->pagi_init); > > + } > > + > > Because now you have bp potentially pointing to two different buffers. Yes, indeed, sloppy programming. Fixed. > > /* > > - * If we get an error writing out the alternate superblocks, > > - * just issue a warning and continue. The real work is > > - * already done and committed. > > + * Check the inodes: as long as we have pagi_count == > > + * pagi_freecount == 0, then: a) we don't have to > > + * update any global inode counters, and b) there are > > + * no extra blocks in inode btrees > > */ > > - if (!(error = xfs_bwrite(mp, bp))) { > > - continue; > > - } else { > > - xfs_fs_cmn_err(CE_WARN, mp, > > - "write error %d updating secondary superblock for ag %d", > > - error, agno); > > - break; /* no point in continuing */ > > + if(pag->pagi_count > 0 || > > + pag->pagi_freecount > 0) { > > + printk("agi %d has %d inodes in total and %d free\n", > > + agno, pag->pagi_count, pag->pagi_freecount); > > + error = XFS_ERROR(ENOSPC); > > + goto error0; > > + } > > + > > + /* Check the AGF: if levels[] == 1, then there should > > + * be no extra blocks in the btrees beyond the ones > > + * at the beggining of the AG > > + */ > > + if(pag->pagf_levels[XFS_BTNUM_BNOi] > 1 || > > + pag->pagf_levels[XFS_BTNUM_CNTi] > 1) { > > + printk("agf %d has level %d bt and %d cnt\n", > > + agno, > > + pag->pagf_levels[XFS_BTNUM_BNOi], > > + pag->pagf_levels[XFS_BTNUM_CNTi]); > > + error = XFS_ERROR(ENOSPC); > > + goto error0; > > } > > ok, so we have empty ag's here. You might want to check that the > inode btree is empty and that the AGI unlinked list is empty. I thought that inode btree is empty by virtue of pag->pagi_count == 0. Is this not always so? Furthermore, also since agi_count == agi_free + actual used inodes + number of unlinked inodes, I think we don't need to check the unlinked list. > > + > > + freeblks = pag->pagf_freeblks; > > + printk("Usage: %d prealloc, %d flcount\n", > > + XFS_PREALLOC_BLOCKS(mp), pag->pagf_flcount); > > + > > + /* Done gathering data, check sizes */ > > + usedblks = XFS_PREALLOC_BLOCKS(mp) + pag->pagf_flcount; > > + printk("agno=%d agf_length=%d computed used=%d" > > + " known free=%d\n", agno, aglen, usedblks, freeblks); > > + > > + if(usedblks + freeblks != aglen) { > > + printk("agno %d is not free (%d blocks allocated)\n", > > + agno, aglen-usedblks-freeblks); > > + error = XFS_ERROR(ENOSPC); > > + goto error0; > > + } > > + dbdelta -= aglen; > > + printk("will lower with %d\n", > > + aglen - XFS_PREALLOC_BLOCKS(mp)); > > + fdbdelta += aglen - XFS_PREALLOC_BLOCKS(mp); > > Ok, so why not just > > fdbdelta += mp->m_sb.sb_agblocks - XFS_PREALLOC_BLOCKS(mp); Because the last AG can be smaller than the sb_agblocks. It's true that this holds only for the last one, but having two cases is uglier than just always reading this size from the AG. > > + } > > + /* > > + * Check that we removed all blocks > > + */ > > + ASSERT(!dbdelta); > > + ASSERT(nagcount < oagcount); > > Error out, not assert, because at this point we have not changed anything. Actually here, !dbdelta or nacount < oagacount are programming errors and not possible correct conditions. They could be removed, sine the checks before the per-AG for loop ensure these conditions are not met. But I used them just to be sure. > > + > > + printk("to free: %d, oagcount=%d, nagcount=%d\n", > > + fdbdelta, oagcount, nagcount); > > + > > + xfs_trans_agblocks_delta(tp, -((long)fdbdelta)); > > + xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount); > > + xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, nb - mp->m_sb.sb_dblocks); > > + xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, -((int64_t)fdbdelta)); > > + > > + if (dpct) > > + xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct); > > + error = xfs_trans_commit(tp, 0); > > + if (error) { > > + up_write(&mp->m_peraglock); > > + return error; > > } > > + /* Free memory as the number of AG has changed */ > > + for (agno = nagcount; agno < oagcount; agno++) > > + if (mp->m_perag[agno].pagb_list) > > + kmem_free(mp->m_perag[agno].pagb_list, > > + sizeof(xfs_perag_busy_t) * > > + XFS_PAGB_NUM_SLOTS); > > + > > + mp->m_perag = kmem_realloc(mp->m_perag, > > + sizeof(xfs_perag_t) * nagcount, > > + sizeof(xfs_perag_t) * oagcount, > > + KM_SLEEP); > > This is not really safe - how do we know if all the users of the > higher AGs have gone away? I think we should simply just zero out > the unused AGs and don't worry about a realloc(). The problem that I saw is that if you do shrink+grow+shrink+grow+... you will leak a small amount of memory (or corrupt kernel mem allocation?) since the growfs code does a realloc from what it thinks is the size of m_perag, which it computes solely from the current number of AGs. Should we have a size in the mp struct for this and not assume it's the agcount? > > + /* FIXME: here we could instead just lower > > + * nagimax to nagcount; is it better this way? > > + */ > > Not really. Ok, removed comment. > > + /* FIXME: why is this flag unconditionally set in growfs? */ > > + mp->m_flags |= XFS_MOUNT_32BITINODES; > > good question. I don't think it should be there but I'll have to > do some digging.... Per Eric's mail, I removed both the comment and the flag setting. > > + nagimax = xfs_initialize_perag(XFS_MTOVFS(mp), mp, nagcount); > > + up_write(&mp->m_peraglock); > > + > > + xfs_update_sb(mp, nagimax, nagcount); > > return 0; > > > > error0: > > + up_write(&mp->m_peraglock); > > xfs_trans_cancel(tp, XFS_TRANS_ABORT); > > return error; > > } > > @@ -435,7 +665,10 @@ > > int error; > > if (!cpsema(&mp->m_growlock)) > > return XFS_ERROR(EWOULDBLOCK); > > - error = xfs_growfs_data_private(mp, in); > > + if(in->newblocks < mp->m_sb.sb_dblocks) > > + error = xfs_shrinkfs_data_private(mp, in); > > + else > > + error = xfs_growfs_data_private(mp, in); > > Hmmm - that's using the one ioctl to do both grow and shrink. I'd > prefer a new shrink ioctl rather than changing the behaviour of an > existing ioctl. Well, I chose this way because I see it as the ioctl that changes the data size of the filesystem. It may be lower or higher than the current size, but that is not so important :), and if another ioctl there would be the need for another tool. > Looks like a good start ;) thanks! and also thanks for taking time to look at this. updated patch (without the separation of the IOCTL and without the rework of the perag lock) is attached. iustin [-- Attachment #2: patch-nice-5 --] [-- Type: text/plain, Size: 10514 bytes --] diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c --- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c 2007-06-09 18:56:21.509308225 +0200 +++ linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c 2007-06-14 07:54:50.580420252 +0200 @@ -112,6 +112,55 @@ return 0; } +STATIC void xfs_growfs_update_sb( + xfs_mount_t *mp, /* mount point for filesystem */ + xfs_agnumber_t nagimax, + xfs_agnumber_t nagcount) /* new number of a.g. */ +{ + xfs_agnumber_t agno; + xfs_buf_t *bp; + xfs_sb_t *sbp; + int error; + + /* New allocation groups fully initialized, so update mount struct */ + if (nagimax) + mp->m_maxagi = nagimax; + if (mp->m_sb.sb_imax_pct) { + __uint64_t icount = mp->m_sb.sb_dblocks * + mp->m_sb.sb_imax_pct; + do_div(icount, 100); + mp->m_maxicount = icount << mp->m_sb.sb_inopblog; + } else + mp->m_maxicount = 0; + + for (agno = 1; agno < nagcount; agno++) { + error = xfs_read_buf(mp, mp->m_ddev_targp, + XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)), + XFS_FSS_TO_BB(mp, 1), 0, &bp); + if (error) { + xfs_fs_cmn_err(CE_WARN, mp, + "error %d reading secondary superblock for ag %d", + error, agno); + break; + } + sbp = XFS_BUF_TO_SBP(bp); + xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS); + + /* + * If we get an error writing out the alternate superblocks, + * just issue a warning and continue. The real work is + * already done and committed. + */ + error = xfs_bwrite(mp, bp); + if(error) { + xfs_fs_cmn_err(CE_WARN, mp, + "write error %d updating secondary superblock for ag %d", + error, agno); + break; /* no point in continuing */ + } + } +} + static int xfs_growfs_data_private( xfs_mount_t *mp, /* mount point for filesystem */ @@ -135,7 +184,6 @@ xfs_rfsblock_t nfree; xfs_agnumber_t oagcount; int pct; - xfs_sb_t *sbp; xfs_trans_t *tp; nb = in->newblocks; @@ -356,44 +404,210 @@ if (error) { return error; } - /* New allocation groups fully initialized, so update mount struct */ - if (nagimax) - mp->m_maxagi = nagimax; - if (mp->m_sb.sb_imax_pct) { - __uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct; - do_div(icount, 100); - mp->m_maxicount = icount << mp->m_sb.sb_inopblog; - } else - mp->m_maxicount = 0; - for (agno = 1; agno < nagcount; agno++) { - error = xfs_read_buf(mp, mp->m_ddev_targp, - XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)), - XFS_FSS_TO_BB(mp, 1), 0, &bp); + xfs_growfs_update_sb(mp, nagimax, nagcount); + return 0; + + error0: + xfs_trans_cancel(tp, XFS_TRANS_ABORT); + return error; +} + +STATIC int +xfs_shrinkfs_data_private( + xfs_mount_t *mp, /* mount point for filesystem */ + xfs_growfs_data_t *in) /* growfs data input struct */ +{ + xfs_agf_t *agf; + xfs_agnumber_t agno; + int dpct; + int error; + xfs_agnumber_t nagcount; /* new AG count */ + xfs_agnumber_t oagcount; /* old AG count */ + xfs_agnumber_t nagimax = 0; + xfs_rfsblock_t nb, nb_mod; + xfs_rfsblock_t dbdelta; + xfs_rfsblock_t fdbdelta; + int pct; + xfs_trans_t *tp; + + nb = in->newblocks; + pct = in->imaxpct; + if (nb >= mp->m_sb.sb_dblocks || pct < 0 || pct > 100) + return XFS_ERROR(EINVAL); + dpct = pct - mp->m_sb.sb_imax_pct; + + /* Do basic checks (at the fs level) */ + oagcount = mp->m_sb.sb_agcount; + nagcount = nb; + nb_mod = do_div(nagcount, mp->m_sb.sb_agblocks); + if(nb_mod) { + printk("not shrinking on an AG boundary (diff=%d)\n", nb_mod); + return XFS_ERROR(EINVAL); + } + if(nagcount < 2) { + printk("refusing to shrink below 2 AGs\n"); + return XFS_ERROR(EINVAL); + } + if(nagcount >= oagcount) { + printk("number of AGs will not decrease\n"); + return XFS_ERROR(EINVAL); + } + printk("Cur ag=%d, cur blocks=%llu\n", + mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks); + printk("New ag=%d, new blocks=%d\n", nagcount, nb); + + printk("Will resize from %llu to %d, delta is %llu\n", + mp->m_sb.sb_dblocks, nb, mp->m_sb.sb_dblocks - nb); + /* Check to see if we trip over the log section */ + printk("logstart=%llu logblocks=%u\n", + mp->m_sb.sb_logstart, mp->m_sb.sb_logblocks); + if (nb < mp->m_sb.sb_logstart + mp->m_sb.sb_logblocks) + return XFS_ERROR(EINVAL); + + /* dbdelta starts at the diff and must become zero */ + dbdelta = mp->m_sb.sb_dblocks - nb; + tp = xfs_trans_alloc(mp, XFS_TRANS_GROWFS); + printk("reserving %d\n", XFS_GROWFS_SPACE_RES(mp) + dbdelta); + if ((error = xfs_trans_reserve(tp, XFS_GROWFS_SPACE_RES(mp) + dbdelta, + XFS_GROWDATA_LOG_RES(mp), 0, 0, 0))) { + xfs_trans_cancel(tp, 0); + return error; + } + + /* fbdelta keeps track of how many ag blocks we need to remove + * (this is different from the number of filesystem blocks) + */ + fdbdelta = 0; + + /* Per-AG checks */ + /* FIXME: do we need to hold m_peraglock while doing this? */ + /* I think that since we do read and write to the m_perag + * stuff, we should be holding the lock for the entire walk & + * modify of the fs + */ + /* Note that because we hold the lock, on any error+early + * return, we must either release manually and return, or jump + * to error0 + */ + down_write(&mp->m_peraglock); + for(agno = oagcount - 1; agno >= nagcount; agno--) { + xfs_extlen_t usedblks; /* total used blocks in this a.g. */ + xfs_extlen_t freeblks; /* free blocks in this a.g. */ + xfs_agblock_t aglen; /* this ag's len */ + struct xfs_perag *pag; /* the m_perag structure */ + xfs_buf_t *bpf; + xfs_buf_t *bpi; + + printk("doing agno=%d\n", agno); + + pag = &mp->m_perag[agno]; + + error = xfs_alloc_read_agf(mp, tp, agno, 0, &bpf); if (error) { - xfs_fs_cmn_err(CE_WARN, mp, - "error %d reading secondary superblock for ag %d", - error, agno); - break; + goto error0; } - sbp = XFS_BUF_TO_SBP(bp); - xfs_xlatesb(sbp, &mp->m_sb, -1, XFS_SB_ALL_BITS); + ASSERT(bpf); + agf = XFS_BUF_TO_AGF(bpf); + aglen = INT_GET(agf->agf_length, ARCH_CONVERT); + + /* read the pagf/pagi if not already initialized */ + /* agf should be initialized because of the ablove read_agf */ + ASSERT(pag->pagf_init); + if (!pag->pagi_init) { + if ((error = xfs_ialloc_read_agi(mp, tp, agno, &bpi))) + goto error0; + ASSERT(pag->pagi_init); + } + /* - * If we get an error writing out the alternate superblocks, - * just issue a warning and continue. The real work is - * already done and committed. + * Check the inodes: as long as we have pagi_count == + * pagi_freecount == 0, then: a) we don't have to + * update any global inode counters, and b) there are + * no extra blocks in inode btrees */ - if (!(error = xfs_bwrite(mp, bp))) { - continue; - } else { - xfs_fs_cmn_err(CE_WARN, mp, - "write error %d updating secondary superblock for ag %d", - error, agno); - break; /* no point in continuing */ + if(pag->pagi_count > 0 || + pag->pagi_freecount > 0) { + printk("agi %d has %d inodes in total and %d free\n", + agno, pag->pagi_count, pag->pagi_freecount); + error = XFS_ERROR(ENOSPC); + goto error0; } + + /* Check the AGF: if levels[] == 1, then there should + * be no extra blocks in the btrees beyond the ones + * at the beggining of the AG + */ + if(pag->pagf_levels[XFS_BTNUM_BNOi] > 1 || + pag->pagf_levels[XFS_BTNUM_CNTi] > 1) { + printk("agf %d has level %d bt and %d cnt\n", + agno, + pag->pagf_levels[XFS_BTNUM_BNOi], + pag->pagf_levels[XFS_BTNUM_CNTi]); + error = XFS_ERROR(ENOSPC); + goto error0; + } + + freeblks = pag->pagf_freeblks; + printk("Usage: %d prealloc, %d flcount\n", + XFS_PREALLOC_BLOCKS(mp), pag->pagf_flcount); + + /* Done gathering data, check sizes */ + usedblks = XFS_PREALLOC_BLOCKS(mp) + pag->pagf_flcount; + printk("agno=%d agf_length=%d computed used=%d" + " known free=%d\n", agno, aglen, usedblks, freeblks); + + if(usedblks + freeblks != aglen) { + printk("agno %d is not free (%d blocks allocated)\n", + agno, aglen-usedblks-freeblks); + error = XFS_ERROR(ENOSPC); + goto error0; + } + dbdelta -= aglen; + printk("will lower with %d\n", + aglen - XFS_PREALLOC_BLOCKS(mp)); + fdbdelta += aglen - XFS_PREALLOC_BLOCKS(mp); } + /* + * Check that we removed all blocks + */ + ASSERT(!dbdelta); + ASSERT(nagcount < oagcount); + + printk("to free: %d, oagcount=%d, nagcount=%d\n", + fdbdelta, oagcount, nagcount); + + xfs_trans_agblocks_delta(tp, -((long)fdbdelta)); + xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount); + xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, nb - mp->m_sb.sb_dblocks); + xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, -((int64_t)fdbdelta)); + + if (dpct) + xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct); + error = xfs_trans_commit(tp, 0); + if (error) { + up_write(&mp->m_peraglock); + return error; + } + /* Free memory as the number of AG has changed */ + for (agno = nagcount; agno < oagcount; agno++) + if (mp->m_perag[agno].pagb_list) + kmem_free(mp->m_perag[agno].pagb_list, + sizeof(xfs_perag_busy_t) * + XFS_PAGB_NUM_SLOTS); + + mp->m_perag = kmem_realloc(mp->m_perag, + sizeof(xfs_perag_t) * nagcount, + sizeof(xfs_perag_t) * oagcount, + KM_SLEEP); + + nagimax = xfs_initialize_perag(XFS_MTOVFS(mp), mp, nagcount); + up_write(&mp->m_peraglock); + + xfs_growfs_update_sb(mp, nagimax, nagcount); return 0; error0: + up_write(&mp->m_peraglock); xfs_trans_cancel(tp, XFS_TRANS_ABORT); return error; } @@ -435,7 +649,10 @@ int error; if (!cpsema(&mp->m_growlock)) return XFS_ERROR(EWOULDBLOCK); - error = xfs_growfs_data_private(mp, in); + if(in->newblocks < mp->m_sb.sb_dblocks) + error = xfs_shrinkfs_data_private(mp, in); + else + error = xfs_growfs_data_private(mp, in); vsema(&mp->m_growlock); return error; } @@ -633,7 +850,7 @@ xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT); thaw_bdev(sb->s_bdev, sb); } - + break; } case XFS_FSOP_GOING_FLAGS_LOGFLUSH: diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_trans.c linux-2.6-xfs.shrink/fs/xfs/xfs_trans.c --- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_trans.c 2007-06-05 17:40:51.000000000 +0200 +++ linux-2.6-xfs.shrink/fs/xfs/xfs_trans.c 2007-06-07 23:01:03.000000000 +0200 @@ -503,11 +503,9 @@ tp->t_res_frextents_delta += delta; break; case XFS_TRANS_SB_DBLOCKS: - ASSERT(delta > 0); tp->t_dblocks_delta += delta; break; case XFS_TRANS_SB_AGCOUNT: - ASSERT(delta > 0); tp->t_agcount_delta += delta; break; case XFS_TRANS_SB_IMAXPCT: ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Implement shrink of empty AGs 2007-06-14 6:01 ` Iustin Pop @ 2007-06-14 9:00 ` David Chinner 2007-06-14 20:55 ` Iustin Pop 0 siblings, 1 reply; 7+ messages in thread From: David Chinner @ 2007-06-14 9:00 UTC (permalink / raw) To: xfs; +Cc: Iustin Pop On Thu, Jun 14, 2007 at 08:01:58AM +0200, Iustin Pop wrote: > On Tue, Jun 12, 2007 at 12:40:25PM +1000, David Chinner wrote: > > > diff -X ignore -urN linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c > > > --- linux-2.6-xfs.cvs-orig/fs/xfs/xfs_fsops.c 2007-06-09 18:56:21.509308225 +0200 > > > +++ linux-2.6-xfs.shrink/fs/xfs/xfs_fsops.c 2007-06-10 18:32:36.074856477 +0200 > > > @@ -112,6 +112,53 @@ > > > return 0; > > > } > > > > > > +static void xfs_update_sb( > > > > STATIC void > > xfs_growfs_update_sb( > this was because xfs_growfs_private is also static and not STATIC. > Should I change the def for it also? Yes, probably should. > > > + xfs_mount_t *mp, /* mount point for filesystem */ > > > + xfs_agnumber_t nagimax, > > > + xfs_agnumber_t nagcount) /* new number of a.g. */ > > > > tabs, not spaces (and tabs are 8 spaces). > sorry, I though I got all of these. There are some more in the def of > xfs_reserve_blocks, btw. Ok, I'll clean them up in the current fix-reserve-blocks-yet-again patch I have. > > > +xfs_shrinkfs_data_private( > > > + xfs_mount_t *mp, /* mount point for filesystem */ > > > + xfs_growfs_data_t *in) /* growfs data input struct */ > > > > whitespace issues > fixed (I think you were referring to the alignement of the two argument > lines). Yes. > > > + /* dbdelta starts at the diff and must become zero */ > > > + dbdelta = mp->m_sb.sb_dblocks - nb; > > > + tp = xfs_trans_alloc(mp, XFS_TRANS_GROWFS); > > > + printk("reserving %d\n", XFS_GROWFS_SPACE_RES(mp) + dbdelta); > > > + if ((error = xfs_trans_reserve(tp, XFS_GROWFS_SPACE_RES(mp) + dbdelta, > > > + XFS_GROWDATA_LOG_RES(mp), 0, 0, 0))) { > > > + xfs_trans_cancel(tp, 0); > > > + return error; > > > + } > > > > What's the dbdelta part of the reservation for? That's reserving dbdelta > > blocks for *allocations*, so I don't think this is right.... > > Well, we'll shrink the filesystem by dbdelta, so the filesystem needs to > have enough free space to do it. True, but when you look at how much free space we need to make that modification, it turns out ot be zero. i.e. we don't need to do any allocations *anywhere* to remove an empty AG from the end of the filesystem - just log a change to the superblock(s). > Whether this space is in the correct > place (the AGs we want to shrink) or not is a question answered by the > per-AG checking, but since we will reduce the freespace by this amount, > I thought that it's safer to mark this space in use. Did I misread the > intent of xfs_trans_reserve? The transactions reservation takes into account blocks that may need to be allocated during a transaction. Normally when we are freeing an extent, we may still have to allocate some blocks because a btree block might need to be split (and split all the way up the tree to the root). The XFS_GROWFS_SPACE_RES(mp) macro reserves the space needed for the freeing of a new extent which occurs during growing a partial AG to a full AG. We aren't doing such an operation here; we don't modify any free space or inode or extent btrees here at all. > Also, unless I'm mistaken and don't remember correctly, you have to > reserve the amount of space one will pass to xfs_trans_mod_sb(tp, > XFS_TRANS_SB_FDBLOCKS, ...) otherwise the transaction code complains > about exceeding your reservation. xfs_trans_mod_sb: 459 case XFS_TRANS_SB_FDBLOCKS: 460 /* 461 * Track the number of blocks allocated in the 462 * transaction. Make sure it does not exceed the 463 * number reserved. 464 */ 465 if (delta < 0) { 466 tp->t_blk_res_used += (uint)-delta; 467 ASSERT(tp->t_blk_res_used <= tp->t_blk_res); 468 } 469 tp->t_fdblocks_delta += delta; 470 if (xfs_sb_version_haslazysbcount(&mp->m_sb)) 471 flags &= ~XFS_TRANS_SB_DIRTY; 472 break; So, delta < 0 (i.e. allocating blocks) requires a transaction reservation, but freeing blocks (delta > 0) does not. > > > + /* I think that since we do read and write to the m_perag > > > + * stuff, we should be holding the lock for the entire walk & > > > + * modify of the fs > > > + */ > > > > Deadlock warning! holding the m_peraglock in write mode will cause > > allocation deadlocks if you are not careful as all allocation/free > > operations take the m_peraglock in read mode. (And yes, growing > > an active, loaded filesystem can deadlock because of this.) > > How can we ensure then that no one modifies the AGs while we walk them? It's not even modify - how do you prevent any *reference* to the perag while we are doing this. > I hoped that we can do it without the per-AG not-avail flag, by just > holding the perag lock. The per-AG not-available flag will prevent new allocations, but it doesn't prevent access to the AGs as we still need access to the AGs to copy the data and metadata out. The m_peraglock in write mode will prevent *simultaneous* access to the per-ag, but the code waiting on a read lock may still try to access the bit of the perag structure we're about to remove. See, for example, xfs_bmap_btalloc(). It gets and AG it is supposed to use from somewhere external, and then before it goes to use the perag structure for that AG, it waits on the peraglock in read mode. If that blocks waiting on a shrink and after the shrink ag > sb_agcount, we're in a world of pain..... Note that I'm just using this as an example piece of existing code that would break badly if we simply realloc the perag array. I could point you to the filestreams code where there is a perag stream refcount that we'll need to ensure is zero. What I'm trying to say is that the perag lock may not give us sufficient exclusion to be able to shrink the perag array safely. > > ok, so we have empty ag's here. You might want to check that the > > inode btree is empty and that the AGI unlinked list is empty. > I thought that inode btree is empty by virtue of pag->pagi_count == 0. > Is this not always so? It should be. Call me paranoid, but with an operation like this I want to make sure we check and double check that it is safe to proceed. > Furthermore, also since agi_count == agi_free + > actual used inodes + number of unlinked inodes, I think we don't need to > check the unlinked list. Call me paranoid, but.... I think at minimum there should be debug code that caters to my paranoia ;) > > > + > > > + freeblks = pag->pagf_freeblks; > > > + printk("Usage: %d prealloc, %d flcount\n", > > > + XFS_PREALLOC_BLOCKS(mp), pag->pagf_flcount); > > > + > > > + /* Done gathering data, check sizes */ > > > + usedblks = XFS_PREALLOC_BLOCKS(mp) + pag->pagf_flcount; > > > + printk("agno=%d agf_length=%d computed used=%d" > > > + " known free=%d\n", agno, aglen, usedblks, freeblks); > > > + > > > + if(usedblks + freeblks != aglen) { > > > + printk("agno %d is not free (%d blocks allocated)\n", > > > + agno, aglen-usedblks-freeblks); > > > + error = XFS_ERROR(ENOSPC); > > > + goto error0; > > > + } > > > + dbdelta -= aglen; > > > + printk("will lower with %d\n", > > > + aglen - XFS_PREALLOC_BLOCKS(mp)); > > > + fdbdelta += aglen - XFS_PREALLOC_BLOCKS(mp); > > > > Ok, so why not just > > > > fdbdelta += mp->m_sb.sb_agblocks - XFS_PREALLOC_BLOCKS(mp); > Because the last AG can be smaller than the sb_agblocks. It's true > that this holds only for the last one, but having two cases is uglier > than just always reading this size from the AG. But you -EINVAL'd earlier when the shrink size would not leave entire AGs behind. So the partial last AG is something that won't happen here..... > > > + } > > > + /* > > > + * Check that we removed all blocks > > > + */ > > > + ASSERT(!dbdelta); > > > + ASSERT(nagcount < oagcount); > > > > Error out, not assert, because at this point we have not changed anything. > Actually here, !dbdelta or nacount < oagacount are programming errors > and not possible correct conditions. They could be removed, sine the > checks before the per-AG for loop ensure these conditions are not met. > But I used them just to be sure. Ok - I misinterpreted what they were catching. > > > + /* Free memory as the number of AG has changed */ > > > + for (agno = nagcount; agno < oagcount; agno++) > > > + if (mp->m_perag[agno].pagb_list) > > > + kmem_free(mp->m_perag[agno].pagb_list, > > > + sizeof(xfs_perag_busy_t) * > > > + XFS_PAGB_NUM_SLOTS); > > > + > > > + mp->m_perag = kmem_realloc(mp->m_perag, > > > + sizeof(xfs_perag_t) * nagcount, > > > + sizeof(xfs_perag_t) * oagcount, > > > + KM_SLEEP); > > > > This is not really safe - how do we know if all the users of the > > higher AGs have gone away? I think we should simply just zero out > > the unused AGs and don't worry about a realloc(). > The problem that I saw is that if you do shrink+grow+shrink+grow+... you > will leak a small amount of memory (or corrupt kernel mem allocation?) > since the growfs code does a realloc from what it thinks is the size of > m_perag, which it computes solely from the current number of AGs. Should > we have a size in the mp struct for this and not assume it's the > agcount? Yup, another size variable in the mount structure is probably the only way to fix this. > > > + if(in->newblocks < mp->m_sb.sb_dblocks) > > > + error = xfs_shrinkfs_data_private(mp, in); > > > + else > > > + error = xfs_growfs_data_private(mp, in); > > > > Hmmm - that's using the one ioctl to do both grow and shrink. I'd > > prefer a new shrink ioctl rather than changing the behaviour of an > > existing ioctl. > Well, I chose this way because I see it as the ioctl that changes the > data size of the filesystem. It may be lower or higher than the current > size, but that is not so important :), and if another ioctl there would > be the need for another tool. True, put that way i think you're right. > updated patch (without the separation of the IOCTL and without the > rework of the perag lock) is attached. I haven't looked at it yet - i'll try to get to it tomorrow... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Implement shrink of empty AGs 2007-06-14 9:00 ` David Chinner @ 2007-06-14 20:55 ` Iustin Pop 2007-06-14 22:16 ` David Chinner 0 siblings, 1 reply; 7+ messages in thread From: Iustin Pop @ 2007-06-14 20:55 UTC (permalink / raw) To: David Chinner; +Cc: xfs On Thu, Jun 14, 2007 at 07:00:52PM +1000, David Chinner wrote: > > > What's the dbdelta part of the reservation for? That's reserving dbdelta > > > blocks for *allocations*, so I don't think this is right.... > > > > Well, we'll shrink the filesystem by dbdelta, so the filesystem needs to > > have enough free space to do it. > > True, but when you look at how much free space we need to make that > modification, it turns out ot be zero. i.e. we don't need to do any > allocations *anywhere* to remove an empty AG from the end of the > filesystem - just log a change to the superblock(s). > > > Whether this space is in the correct > > place (the AGs we want to shrink) or not is a question answered by the > > per-AG checking, but since we will reduce the freespace by this amount, > > I thought that it's safer to mark this space in use. Did I misread the > > intent of xfs_trans_reserve? > > The transactions reservation takes into account blocks that may need > to be allocated during a transaction. Normally when we are freeing > an extent, we may still have to allocate some blocks because a > btree block might need to be split (and split all the way up the tree > to the root). The XFS_GROWFS_SPACE_RES(mp) macro reserves the space > needed for the freeing of a new extent which occurs during growing a > partial AG to a full AG. We aren't doing such an operation here; > we don't modify any free space or inode or extent btrees here at > all. Ah, finally I understand what the reservation is for. I probably misunderstood the actual meaning of the transaction also... Thanks for the clear explanation, will remove this. > [...] > Note that I'm just using this as an example piece of existing code that would > break badly if we simply realloc the perag array. I could point you to the > filestreams code where there is a perag stream refcount that we'll need to > ensure is zero. > > What I'm trying to say is that the perag lock may not give us sufficient > exclusion to be able to shrink the perag array safely. Interesting conclusion. I'm fail right now to come up with an idea, except auditing the whole codebase and make sure any down_read is followed by a check on the perag size. Or something like adding a flag to the perag structure like 'gone' that is checked (since we won't k_realloc anymore), after the read lock. > > > ok, so we have empty ag's here. You might want to check that the > > > inode btree is empty and that the AGI unlinked list is empty. > > I thought that inode btree is empty by virtue of pag->pagi_count == 0. > > Is this not always so? > > It should be. Call me paranoid, but with an operation like this I want > to make sure we check and double check that it is safe to proceed. > > > Furthermore, also since agi_count == agi_free + > > actual used inodes + number of unlinked inodes, I think we don't need to > > check the unlinked list. > > Call me paranoid, but.... > > I think at minimum there should be debug code that caters to my paranoia ;) Ok, I'll look into this then. I understand the paranoia, I just assumed that logically it's provable that this situation doesn't/can't happen. > > > Ok, so why not just > > > > > > fdbdelta += mp->m_sb.sb_agblocks - XFS_PREALLOC_BLOCKS(mp); > > Because the last AG can be smaller than the sb_agblocks. It's true > > that this holds only for the last one, but having two cases is uglier > > than just always reading this size from the AG. > > But you -EINVAL'd earlier when the shrink size would not leave entire > AGs behind. So the partial last AG is something that won't happen > here..... No, what I meant when I said 'last AG' is the "original" last, not the "new" last. Consider this fs: AGs 1-8 = 4096 blocks, AG 9 = 3000 blocks. We can't, in the for loop, make fbdelta += mp->m_sb.sb_agblocks ... since that is 4096, whereas the for 'last' AG (AG 9) has only 3000 blocks. What I mean is that for a normal filesystem (non-shrink situation), the size of the last AG is readable only from the AGF structure on-disk, but for all the others AG the size is the one from the superblock. So if you touch the last AG, you have to read the AGF. > > > This is not really safe - how do we know if all the users of the > > > higher AGs have gone away? I think we should simply just zero out > > > the unused AGs and don't worry about a realloc(). > > The problem that I saw is that if you do shrink+grow+shrink+grow+... you > > will leak a small amount of memory (or corrupt kernel mem allocation?) > > since the growfs code does a realloc from what it thinks is the size of > > m_perag, which it computes solely from the current number of AGs. Should > > we have a size in the mp struct for this and not assume it's the > > agcount? > > Yup, another size variable in the mount structure is probably the only > way to fix this. Ok, noted. Also I get the impression that this ties in with the perag locking above somehow... > > updated patch (without the separation of the IOCTL and without the > > rework of the perag lock) is attached. > > I haven't looked at it yet - i'll try to get to it tomorrow... No hurry, just small changes. I'll think about the issues you raised. thanks, iustin ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Implement shrink of empty AGs 2007-06-14 20:55 ` Iustin Pop @ 2007-06-14 22:16 ` David Chinner 0 siblings, 0 replies; 7+ messages in thread From: David Chinner @ 2007-06-14 22:16 UTC (permalink / raw) To: xfs On Thu, Jun 14, 2007 at 10:55:18PM +0200, Iustin Pop wrote: > On Thu, Jun 14, 2007 at 07:00:52PM +1000, David Chinner wrote: > > [...] > > Note that I'm just using this as an example piece of existing code that would > > break badly if we simply realloc the perag array. I could point you to the > > filestreams code where there is a perag stream refcount that we'll need to > > ensure is zero. > > > > What I'm trying to say is that the perag lock may not give us sufficient > > exclusion to be able to shrink the perag array safely. > Interesting conclusion. I'm fail right now to come up with an idea, > except auditing the whole codebase Yup. > and make sure any down_read is followed by a check on the perag size. That doesn't help, really, because at some points we won't be able to back out and select another AG. i.e. once we've selected an AG and ensured the perag structure is initialised, it is assumed it will remain that way forever. More likely we need reference counting on the AGs as we do operations (e.g. when we select an AG for an operation we take a reference and then drop it when we are done) and prevent new references from being taken on an AG at some point (i.e. once it's been emptied). > Or something like adding a flag > to the perag structure like 'gone' that is checked (since we won't > k_realloc anymore), after the read lock. Clear the pag[if]_init flags and ensure failure to initialise is treated correctly i.e. fall back to selecting another AG. > > > > ok, so we have empty ag's here. You might want to check that the > > > > inode btree is empty and that the AGI unlinked list is empty. > > > I thought that inode btree is empty by virtue of pag->pagi_count == 0. > > > Is this not always so? > > > > It should be. Call me paranoid, but with an operation like this I want > > to make sure we check and double check that it is safe to proceed. > > > > > Furthermore, also since agi_count == agi_free + > > > actual used inodes + number of unlinked inodes, I think we don't need to > > > check the unlinked list. > > > > Call me paranoid, but.... > > > > I think at minimum there should be debug code that caters to my paranoia ;) > Ok, I'll look into this then. I understand the paranoia, I just assumed > that logically it's provable that this situation doesn't/can't happen. At some point we'll have a bug that breaks that logic - we need code to catch that.... > > > > Ok, so why not just > > > > > > > > fdbdelta += mp->m_sb.sb_agblocks - XFS_PREALLOC_BLOCKS(mp); > > > Because the last AG can be smaller than the sb_agblocks. It's true > > > that this holds only for the last one, but having two cases is uglier > > > than just always reading this size from the AG. > > > > But you -EINVAL'd earlier when the shrink size would not leave entire > > AGs behind. So the partial last AG is something that won't happen > > here..... > No, what I meant when I said 'last AG' is the "original" last, not the > "new" last. Consider this fs: AGs 1-8 = 4096 blocks, AG 9 = 3000 blocks. Ok, gotcha. Makes sense now. > We can't, in the for loop, make fbdelta += mp->m_sb.sb_agblocks ... > since that is 4096, whereas the for 'last' AG (AG 9) has only 3000 > blocks. > > What I mean is that for a normal filesystem (non-shrink situation), the > size of the last AG is readable only from the AGF structure on-disk, > but for all the others AG the size is the one from the superblock. So if > you touch the last AG, you have to read the AGF. I'd still special case it, like the growing of a partial last AG is special cased in xfs_growfs_data_private(). > > > > This is not really safe - how do we know if all the users of the > > > > higher AGs have gone away? I think we should simply just zero out > > > > the unused AGs and don't worry about a realloc(). > > > The problem that I saw is that if you do shrink+grow+shrink+grow+... you > > > will leak a small amount of memory (or corrupt kernel mem allocation?) > > > since the growfs code does a realloc from what it thinks is the size of > > > m_perag, which it computes solely from the current number of AGs. Should > > > we have a size in the mp struct for this and not assume it's the > > > agcount? > > > > Yup, another size variable in the mount structure is probably the only > > way to fix this. > Ok, noted. Also I get the impression that this ties in with the perag > locking above somehow... *nod* Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2007-06-14 22:16 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-06-10 16:40 [PATCH] Implement shrink of empty AGs Iustin Pop 2007-06-12 2:40 ` David Chinner 2007-06-12 4:25 ` Eric Sandeen 2007-06-14 6:01 ` Iustin Pop 2007-06-14 9:00 ` David Chinner 2007-06-14 20:55 ` Iustin Pop 2007-06-14 22:16 ` David Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox