public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC(RAP) 00/14] xfs: add reflink and dedupe support
@ 2015-06-25 23:39 Darrick J. Wong
  2015-06-25 23:39 ` [PATCH 01/14] xfs: create a per-AG btree to track reference counts Darrick J. Wong
                   ` (13 more replies)
  0 siblings, 14 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:39 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Hi all,

This is a RFC-quality pass at kernel support for mapping multiple file
logical blocks to the same physical block, more commonly known as
reflinking.  It is a single [block, refcount] tree to track the
reference counts of extents of physical blocks.  There's also a bunch
of support code that provides the desired copy-on-write behavior,
userland interfaces to reflink, query the status of, and un-reflink
files.

The patch set is based on the current (4.2) for-next branch plus
Dave's rmap RFC patches.  There are still a lot of bugs in this code;
I'm not sure that I've gotten the locking and the transaction
handling correct.  Deadlocks and hangs due to the log being full are
unfortunately common, but light exercise shows that it works well
enough as a proof of concept.

The ioctl interface to XFS reflink looks surprisingly like the btrfs
ioctl interface <cough> -- you can reflink a file, reflink subranges
of a file, or dedupe subranges of files.  (Dedupe also checks file
blocks, though I have a feeling it's racy.)  To un-reflink a file,
simply chattr +C it to mark it no-cow.  xfs_fsr can be better at
that, though. :)

If you're going to start using this mess, you're going to want to pull
my xfsprogs dev tree[1], which itself is also based on xfsprogs
for-next and the userland rmap support bits.  I've not had time to get
reflink and rmap to work together.

I've also prepared a bunch of xfstests[2] to exercise the userland
interfaces.

This is an extraordinary way to eat your data.  Enjoy!

Comments and questions are, as always, welcome.

--D

[1] https://github.com/djwong/xfsprogs/commits/for-next
[2] https://github.com/djwong/xfstests/commits/master

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/14] xfs: create a per-AG btree to track reference counts
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
@ 2015-06-25 23:39 ` Darrick J. Wong
  2015-07-01  0:13   ` Dave Chinner
  2015-06-25 23:39 ` [PATCH 02/14] libxfs: adjust refcounts in reflink btree Darrick J. Wong
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:39 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Create a per-AG btree to track the reference counts of physical blocks
to support reflink.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                   |    1 
 fs/xfs/libxfs/xfs_alloc.c         |   19 +
 fs/xfs/libxfs/xfs_btree.c         |    8 -
 fs/xfs/libxfs/xfs_btree.h         |    7 
 fs/xfs/libxfs/xfs_format.h        |   59 ++++
 fs/xfs/libxfs/xfs_reflink_btree.c |  531 +++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_reflink_btree.h |   70 +++++
 fs/xfs/libxfs/xfs_sb.c            |    7 
 fs/xfs/libxfs/xfs_shared.h        |    1 
 fs/xfs/libxfs/xfs_trans_resv.c    |    2 
 fs/xfs/libxfs/xfs_types.h         |    2 
 fs/xfs/xfs_mount.h                |    5 
 fs/xfs/xfs_stats.c                |    1 
 fs/xfs/xfs_stats.h                |   18 +
 14 files changed, 722 insertions(+), 9 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_reflink_btree.c
 create mode 100644 fs/xfs/libxfs/xfs_reflink_btree.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index e338595..ba89aee 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -52,6 +52,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_log_rlimit.o \
 				   xfs_rmap.o \
 				   xfs_rmap_btree.o \
+				   xfs_reflink_btree.o \
 				   xfs_sb.o \
 				   xfs_symlink_remote.o \
 				   xfs_trans_resv.o \
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index c6a1372..fc8a499 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -54,6 +54,8 @@ xfs_extlen_t
 xfs_prealloc_blocks(
 	struct xfs_mount	*mp)
 {
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		return XFS_RL_BLOCK(mp) + 1;
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
 		return XFS_RMAP_BLOCK(mp) + 1;
 	if (xfs_sb_version_hasfinobt(&mp->m_sb))
@@ -91,9 +93,11 @@ xfs_alloc_set_aside(
 	unsigned int	blocks;
 
 	blocks = 4 + (mp->m_sb.sb_agcount * XFS_ALLOC_AGFL_RESERVE);
-	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
-		return blocks;
-	return blocks + (mp->m_sb.sb_agcount * (2 * mp->m_ag_maxlevels) - 1);
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
+		blocks += (mp->m_sb.sb_agcount * (2 * mp->m_ag_maxlevels) - 1);
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		blocks += (mp->m_sb.sb_agcount * (2 * mp->m_ag_maxlevels) - 1);
+	return blocks;
 }
 
 /*
@@ -123,6 +127,10 @@ xfs_alloc_ag_max_usable(struct xfs_mount *mp)
 		/* rmap root block + full tree split on full AG */
 		blocks += 1 + (2 * mp->m_ag_maxlevels) - 1;
 	}
+	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+		/* reflink root block + full tree split on full AG */
+		blocks += 1 + (2 * mp->m_ag_maxlevels) - 1;
+	}
 
 	return mp->m_sb.sb_agblocks - blocks;
 }
@@ -2378,6 +2386,10 @@ xfs_agf_verify(
 	    be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length))
 		return false;
 
+	if (xfs_sb_version_hasreflink(&mp->m_sb) &&
+	    be32_to_cpu(agf->agf_reflink_level) > XFS_BTREE_MAXLEVELS)
+		return false;
+
 	return true;;
 
 }
@@ -2497,6 +2509,7 @@ xfs_alloc_read_agf(
 			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
 		pag->pagf_levels[XFS_BTNUM_RMAPi] =
 			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
+		pag->pagf_reflink_level = be32_to_cpu(agf->agf_reflink_level);
 		spin_lock_init(&pag->pagb_lock);
 		pag->pagb_count = 0;
 		pag->pagb_tree = RB_ROOT;
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 4c9b9b3..8820aad 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -43,9 +43,10 @@ kmem_zone_t	*xfs_btree_cur_zone;
  */
 static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
 	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, 0, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
-	  XFS_FIBT_MAGIC },
+	  XFS_FIBT_MAGIC, 0 },
 	{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC, XFS_RMAP_CRC_MAGIC,
-	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC }
+	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC,
+	  XFS_RLBT_CRC_MAGIC }
 };
 #define xfs_btree_magic(cur) \
 	xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum]
@@ -1117,6 +1118,9 @@ xfs_btree_set_refs(
 	case XFS_BTNUM_RMAP:
 		xfs_buf_set_ref(bp, XFS_RMAP_BTREE_REF);
 		break;
+	case XFS_BTNUM_RL:
+		xfs_buf_set_ref(bp, XFS_REFLINK_BTREE_REF);
+		break;
 	default:
 		ASSERT(0);
 	}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 48ab2b1..a3f8661 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -43,6 +43,7 @@ union xfs_btree_key {
 	xfs_alloc_key_t			alloc;
 	struct xfs_inobt_key		inobt;
 	struct xfs_rmap_key		rmap;
+	xfs_reflink_key_t		reflink;
 };
 
 union xfs_btree_rec {
@@ -51,6 +52,7 @@ union xfs_btree_rec {
 	struct xfs_alloc_rec		alloc;
 	struct xfs_inobt_rec		inobt;
 	struct xfs_rmap_rec		rmap;
+	xfs_reflink_rec_t		reflink;
 };
 
 /*
@@ -67,6 +69,8 @@ union xfs_btree_rec {
 #define	XFS_BTNUM_FINO	((xfs_btnum_t)XFS_BTNUM_FINOi)
 #define	XFS_BTNUM_RMAP	((xfs_btnum_t)XFS_BTNUM_RMAPi)
 
+#define	XFS_BTNUM_RL	((xfs_btnum_t)XFS_BTNUM_RLi)
+
 /*
  * For logging record fields.
  */
@@ -98,6 +102,7 @@ do {    \
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(ibt, stat); break;	\
 	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(fibt, stat); break;	\
 	case XFS_BTNUM_RMAP: __XFS_BTREE_STATS_INC(rmap, stat); break;	\
+	case XFS_BTNUM_RL: __XFS_BTREE_STATS_INC(rlbt, stat); break;	\
 	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break;	\
 	}       \
 } while (0)
@@ -113,6 +118,7 @@ do {    \
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_ADD(ibt, stat, val); break; \
 	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_ADD(fibt, stat, val); break; \
 	case XFS_BTNUM_RMAP: __XFS_BTREE_STATS_ADD(rmap, stat, val); break; \
+	case XFS_BTNUM_RL: __XFS_BTREE_STATS_INC(rlbt, stat); break;	\
 	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break;	\
 	}       \
 } while (0)
@@ -205,6 +211,7 @@ typedef struct xfs_btree_cur
 		xfs_bmbt_irec_t		b;
 		xfs_inobt_rec_incore_t	i;
 		struct xfs_rmap_irec	r;
+		xfs_reflink_rec_incore_t	rl;
 	}		bc_rec;		/* current insert/search record value */
 	struct xfs_buf	*bc_bufs[XFS_BTREE_MAXLEVELS];	/* buf ptr per level */
 	int		bc_ptrs[XFS_BTREE_MAXLEVELS];	/* key/record # */
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 9cff517..e4954ab 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -446,9 +446,11 @@ xfs_sb_has_compat_feature(
 
 #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
+#define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflink btree */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
-		 XFS_SB_FEAT_RO_COMPAT_RMAPBT)
+		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
+		 XFS_SB_FEAT_RO_COMPAT_REFLINK)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
@@ -522,6 +524,12 @@ static inline bool xfs_sb_version_hasrmapbt(struct xfs_sb *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_RMAPBT);
 }
 
+static inline int xfs_sb_version_hasreflink(xfs_sb_t *sbp)
+{
+	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
+		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
+}
+
 /*
  * end of superblock version macros
  */
@@ -616,12 +624,15 @@ typedef struct xfs_agf {
 	__be32		agf_btreeblks;	/* # of blocks held in AGF btrees */
 	uuid_t		agf_uuid;	/* uuid of filesystem */
 
+	__be32		agf_reflink_root;	/* reflink tree root block */
+	__be32		agf_reflink_level;	/* reflink btree levels */
+
 	/*
 	 * reserve some contiguous space for future logged fields before we add
 	 * the unlogged fields. This makes the range logging via flags and
 	 * structure offsets much simpler.
 	 */
-	__be64		agf_spare64[16];
+	__be64		agf_spare64[15];
 
 	/* unlogged fields, written during buffer writeback. */
 	__be64		agf_lsn;	/* last write sequence */
@@ -1338,6 +1349,50 @@ typedef __be32 xfs_rmap_ptr_t;
 	 XFS_IBT_BLOCK(mp) + 1)
 
 /*
+ * reflink Btree format definitions
+ *
+ */
+#define	XFS_RLBT_CRC_MAGIC	0x524C4233	/* 'RLB3' */
+
+/*
+ * Data record/key structure
+ */
+typedef struct xfs_reflink_rec {
+	__be32		rr_startblock;	/* starting block number */
+	__be32		rr_blockcount;	/* count of blocks */
+	__be32		rr_nlinks;	/* number of inodes linked here */
+} xfs_reflink_rec_t;
+
+typedef struct xfs_reflink_key {
+	__be32		rr_startblock;	/* starting block number */
+} xfs_reflink_key_t;
+
+typedef struct xfs_reflink_rec_incore {
+	xfs_agblock_t	rr_startblock;	/* starting block number */
+	xfs_extlen_t	rr_blockcount;	/* count of free blocks */
+	xfs_nlink_t	rr_nlinks;	/* number of inodes linked here */
+} xfs_reflink_rec_incore_t;
+
+/*
+ * When a block hits MAXRLCOUNT references, it becomes permanently
+ * stuck in CoW mode, because who knows how many times it's really
+ * referenced.
+ */
+#define MAXRLCOUNT	((xfs_nlink_t)~0U)
+#define MAXRLEXTLEN	((xfs_extlen_t)~0U)
+
+/* btree pointer type */
+typedef __be32 xfs_reflink_ptr_t;
+
+#define	XFS_RL_BLOCK(mp) \
+	(xfs_sb_version_hasrmapbt(&((mp)->m_sb)) ? \
+	 XFS_RMAP_BLOCK(mp) + 1 : \
+	 (xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
+	  XFS_FIBT_BLOCK(mp) + 1 : \
+	  XFS_IBT_BLOCK(mp) + 1))
+
+
+/*
  * BMAP Btree format definitions
  *
  * This includes both the root block definition that sits inside an inode fork
diff --git a/fs/xfs/libxfs/xfs_reflink_btree.c b/fs/xfs/libxfs/xfs_reflink_btree.c
new file mode 100644
index 0000000..8a0fa5d
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_reflink_btree.c
@@ -0,0 +1,531 @@
+/*
+ * Copyright (c) 2000-2001,2005 Silicon Graphics, Inc.
+ * Copyright (c) 2015 Oracle.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_btree.h"
+#include "xfs_bmap.h"
+#include "xfs_reflink_btree.h"
+#include "xfs_alloc.h"
+#include "xfs_extent_busy.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_cksum.h"
+#include "xfs_trans.h"
+#include "xfs_bit.h"
+
+#undef REFLINK_DEBUG
+
+#ifdef REFLINK_DEBUG
+# define dbg_printk(f, a...)  do {printk(KERN_ERR f, ## a); } while (0)
+#else
+# define dbg_printk(f, a...)
+#endif
+
+#define CHECK_AG_NUMBER(mp, agno) \
+	do { \
+		ASSERT((agno) != NULLAGNUMBER); \
+		ASSERT((agno) < (mp)->m_sb.sb_agcount); \
+	} while(0);
+
+#define CHECK_AG_EXTENT(mp, agbno, len) \
+	do { \
+		ASSERT((agbno) != NULLAGBLOCK); \
+		ASSERT((len) > 0); \
+		ASSERT((unsigned long long)(agbno) + (len) <= \
+				(mp)->m_sb.sb_agblocks); \
+	} while(0);
+
+#define XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, have, agbno, len, nr, label) \
+	do { \
+		XFS_WANT_CORRUPTED_GOTO((mp), (have) == 1, label); \
+		XFS_WANT_CORRUPTED_GOTO((mp), (len) > 0, label); \
+		XFS_WANT_CORRUPTED_GOTO((mp), (nr) >= 2, label); \
+		XFS_WANT_CORRUPTED_GOTO((mp), (unsigned long long)(agbno) + \
+				(len) <= (mp)->m_sb.sb_agblocks, label); \
+	} while(0);
+
+STATIC struct xfs_btree_cur *
+xfs_reflinkbt_dup_cursor(
+	struct xfs_btree_cur	*cur)
+{
+	return xfs_reflinkbt_init_cursor(cur->bc_mp, cur->bc_tp,
+			cur->bc_private.a.agbp, cur->bc_private.a.agno);
+}
+
+STATIC void
+xfs_reflinkbt_set_root(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	int			inc)
+{
+	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	xfs_agnumber_t		seqno = be32_to_cpu(agf->agf_seqno);
+	struct xfs_perag	*pag = xfs_perag_get(cur->bc_mp, seqno);
+
+	ASSERT(ptr->s != 0);
+
+	agf->agf_reflink_root = ptr->s;
+	be32_add_cpu(&agf->agf_reflink_level, inc);
+	pag->pagf_reflink_level += inc;
+	xfs_perag_put(pag);
+
+	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_ROOTS | XFS_AGF_LEVELS);
+}
+
+STATIC int
+xfs_reflinkbt_alloc_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*start,
+	union xfs_btree_ptr	*new,
+	int			*stat)
+{
+	int			error;
+	xfs_agblock_t		bno;
+
+	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
+
+	/* Allocate the new block from the freelist. If we can't, give up.  */
+	error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp,
+				       &bno, 1);
+	if (error) {
+		XFS_BTREE_TRACE_CURSOR(cur, XBT_ERROR);
+		return error;
+	}
+
+	if (bno == NULLAGBLOCK) {
+		XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
+		*stat = 0;
+		return 0;
+	}
+
+	xfs_extent_busy_reuse(cur->bc_mp, cur->bc_private.a.agno, bno, 1, false);
+
+	xfs_trans_agbtree_delta(cur->bc_tp, 1);
+	new->s = cpu_to_be32(bno);
+
+	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
+	*stat = 1;
+	return 0;
+}
+
+STATIC int
+xfs_reflinkbt_free_block(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	xfs_agblock_t		bno;
+	int			error;
+
+	bno = xfs_daddr_to_agbno(cur->bc_mp, XFS_BUF_ADDR(bp));
+	error = xfs_alloc_put_freelist(cur->bc_tp, agbp, NULL, bno, 1);
+	if (error)
+		return error;
+
+	xfs_extent_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1,
+			      XFS_EXTENT_BUSY_SKIP_DISCARD);
+	xfs_trans_agbtree_delta(cur->bc_tp, -1);
+
+	xfs_trans_binval(cur->bc_tp, bp);
+	return 0;
+}
+
+STATIC int
+xfs_reflinkbt_get_minrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	return cur->bc_mp->m_rlbt_mnr[level != 0];
+}
+
+STATIC int
+xfs_reflinkbt_get_maxrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	return cur->bc_mp->m_rlbt_mxr[level != 0];
+}
+
+STATIC void
+xfs_reflinkbt_init_key_from_rec(
+	union xfs_btree_key	*key,
+	union xfs_btree_rec	*rec)
+{
+	ASSERT(rec->reflink.rr_startblock != 0);
+
+	key->reflink.rr_startblock = rec->reflink.rr_startblock;
+}
+
+STATIC void
+xfs_reflinkbt_init_rec_from_key(
+	union xfs_btree_key	*key,
+	union xfs_btree_rec	*rec)
+{
+	ASSERT(key->reflink.rr_startblock != 0);
+
+	rec->reflink.rr_startblock = key->reflink.rr_startblock;
+}
+
+STATIC void
+xfs_reflinkbt_init_rec_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*rec)
+{
+	ASSERT(cur->bc_rec.rl.rr_startblock != 0);
+
+	rec->reflink.rr_startblock = cpu_to_be32(cur->bc_rec.rl.rr_startblock);
+	rec->reflink.rr_blockcount = cpu_to_be32(cur->bc_rec.rl.rr_blockcount);
+	rec->reflink.rr_nlinks = cpu_to_be32(cur->bc_rec.rl.rr_nlinks);
+}
+
+STATIC void
+xfs_reflinkbt_init_ptr_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr)
+{
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(cur->bc_private.a.agbp);
+
+	ASSERT(cur->bc_private.a.agno == be32_to_cpu(agf->agf_seqno));
+	ASSERT(agf->agf_reflink_root != 0);
+
+	ptr->s = agf->agf_reflink_root;
+}
+
+STATIC __int64_t
+xfs_reflinkbt_key_diff(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*key)
+{
+	xfs_reflink_rec_incore_t	*rec = &cur->bc_rec.rl;
+	xfs_reflink_key_t		*kp = &key->reflink;
+
+	return (__int64_t)be32_to_cpu(kp->rr_startblock) - rec->rr_startblock;
+}
+
+static bool
+xfs_reflinkbt_verify(
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = bp->b_target->bt_mount;
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	struct xfs_perag	*pag = bp->b_pag;
+	unsigned int		level;
+
+	if (block->bb_magic != cpu_to_be32(XFS_RLBT_CRC_MAGIC))
+		return false;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return false;
+	if (!uuid_equal(&block->bb_u.s.bb_uuid, &mp->m_sb.sb_uuid))
+		return false;
+	if (block->bb_u.s.bb_blkno != cpu_to_be64(bp->b_bn))
+		return false;
+	if (pag &&
+	    be32_to_cpu(block->bb_u.s.bb_owner) != pag->pag_agno)
+		return false;
+
+	level = be16_to_cpu(block->bb_level);
+	if (pag && pag->pagf_init) {
+		if (level >= pag->pagf_reflink_level)
+			return false;
+	} else if (level >= mp->m_ag_maxlevels)
+		return false;
+
+	/* numrecs verification */
+	if (be16_to_cpu(block->bb_numrecs) > mp->m_rlbt_mxr[level != 0])
+		return false;
+
+	/* sibling pointer verification */
+	if (!block->bb_u.s.bb_leftsib ||
+	    (be32_to_cpu(block->bb_u.s.bb_leftsib) >= mp->m_sb.sb_agblocks &&
+	     block->bb_u.s.bb_leftsib != cpu_to_be32(NULLAGBLOCK)))
+		return false;
+	if (!block->bb_u.s.bb_rightsib ||
+	    (be32_to_cpu(block->bb_u.s.bb_rightsib) >= mp->m_sb.sb_agblocks &&
+	     block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK)))
+		return false;
+
+	return true;
+}
+
+static void
+xfs_reflinkbt_read_verify(
+	struct xfs_buf	*bp)
+{
+	if (!xfs_btree_sblock_verify_crc(bp))
+		xfs_buf_ioerror(bp, -EFSBADCRC);
+	else if (!xfs_reflinkbt_verify(bp))
+		xfs_buf_ioerror(bp, -EFSCORRUPTED);
+
+	if (bp->b_error) {
+		trace_xfs_btree_corrupt(bp, _RET_IP_);
+		xfs_verifier_error(bp);
+	}
+}
+
+static void
+xfs_reflinkbt_write_verify(
+	struct xfs_buf	*bp)
+{
+	if (!xfs_reflinkbt_verify(bp)) {
+		trace_xfs_btree_corrupt(bp, _RET_IP_);
+		xfs_buf_ioerror(bp, -EFSCORRUPTED);
+		xfs_verifier_error(bp);
+		return;
+	}
+	xfs_btree_sblock_calc_crc(bp);
+
+}
+
+const struct xfs_buf_ops xfs_reflinkbt_buf_ops = {
+	.verify_read = xfs_reflinkbt_read_verify,
+	.verify_write = xfs_reflinkbt_write_verify,
+};
+
+
+#if defined(DEBUG) || defined(XFS_WARN)
+STATIC int
+xfs_reflinkbt_keys_inorder(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return be32_to_cpu(k1->reflink.rr_startblock) <
+	       be32_to_cpu(k2->reflink.rr_startblock);
+}
+
+STATIC int
+xfs_reflinkbt_recs_inorder(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*r1,
+	union xfs_btree_rec	*r2)
+{
+	return be32_to_cpu(r1->reflink.rr_startblock) +
+		be32_to_cpu(r1->reflink.rr_blockcount) <=
+		be32_to_cpu(r2->reflink.rr_startblock);
+}
+#endif	/* DEBUG */
+
+static const struct xfs_btree_ops xfs_reflinkbt_ops = {
+	.rec_len		= sizeof(xfs_reflink_rec_t),
+	.key_len		= sizeof(xfs_reflink_key_t),
+
+	.dup_cursor		= xfs_reflinkbt_dup_cursor,
+	.set_root		= xfs_reflinkbt_set_root,
+	.alloc_block		= xfs_reflinkbt_alloc_block,
+	.free_block		= xfs_reflinkbt_free_block,
+	.get_minrecs		= xfs_reflinkbt_get_minrecs,
+	.get_maxrecs		= xfs_reflinkbt_get_maxrecs,
+	.init_key_from_rec	= xfs_reflinkbt_init_key_from_rec,
+	.init_rec_from_key	= xfs_reflinkbt_init_rec_from_key,
+	.init_rec_from_cur	= xfs_reflinkbt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfs_reflinkbt_init_ptr_from_cur,
+	.key_diff		= xfs_reflinkbt_key_diff,
+	.buf_ops		= &xfs_reflinkbt_buf_ops,
+#if defined(DEBUG) || defined(XFS_WARN)
+	.keys_inorder		= xfs_reflinkbt_keys_inorder,
+	.recs_inorder		= xfs_reflinkbt_recs_inorder,
+#endif
+};
+
+/*
+ * Allocate a new reflink btree cursor.
+ */
+struct xfs_btree_cur *			/* new reflink btree cursor */
+xfs_reflinkbt_init_cursor(
+	struct xfs_mount	*mp,		/* file system mount point */
+	struct xfs_trans	*tp,		/* transaction pointer */
+	struct xfs_buf		*agbp,		/* buffer for agf structure */
+	xfs_agnumber_t		agno)		/* allocation group number */
+{
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	struct xfs_btree_cur	*cur;
+
+	CHECK_AG_NUMBER(mp, agno);
+	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_SLEEP);
+
+	cur->bc_tp = tp;
+	cur->bc_mp = mp;
+	cur->bc_btnum = XFS_BTNUM_RL;
+	cur->bc_blocklog = mp->m_sb.sb_blocklog;
+	cur->bc_ops = &xfs_reflinkbt_ops;
+
+	cur->bc_nlevels = be32_to_cpu(agf->agf_reflink_level);
+
+	cur->bc_private.a.agbp = agbp;
+	cur->bc_private.a.agno = agno;
+
+	if (xfs_sb_version_hascrc(&mp->m_sb))
+		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
+
+	return cur;
+}
+
+/*
+ * Calculate number of records in an reflink btree block.
+ */
+int
+xfs_reflinkbt_maxrecs(
+	struct xfs_mount	*mp,
+	int			blocklen,
+	int			leaf)
+{
+	blocklen -= XFS_REFLINK_BLOCK_LEN;
+
+	if (leaf)
+		return blocklen / sizeof(xfs_reflink_rec_t);
+	return blocklen / (sizeof(xfs_reflink_key_t) +
+			   sizeof(xfs_reflink_ptr_t));
+}
+
+/*
+ * Lookup the first record less than or equal to [bno, len]
+ * in the btree given by cur.
+ */
+int					/* error */
+xfs_reflink_lookup_le(
+	struct xfs_btree_cur	*cur,	/* btree cursor */
+	xfs_agblock_t		bno,	/* starting block of extent */
+	int			*stat)	/* success/failure */
+{
+	cur->bc_rec.rl.rr_startblock = bno;
+	cur->bc_rec.rl.rr_blockcount = 0;
+	return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
+}
+
+/*
+ * Lookup the first record greater than or equal to [bno, len]
+ * in the btree given by cur.
+ */
+int					/* error */
+xfs_reflink_lookup_ge(
+	struct xfs_btree_cur	*cur,	/* btree cursor */
+	xfs_agblock_t		bno,	/* starting block of extent */
+	int			*stat)	/* success/failure */
+{
+	cur->bc_rec.rl.rr_startblock = bno;
+	cur->bc_rec.rl.rr_blockcount = 0;
+	return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
+}
+
+/*
+ * Get the data from the pointed-to record.
+ */
+int					/* error */
+xfs_reflink_get_rec(
+	struct xfs_btree_cur	*cur,	/* btree cursor */
+	xfs_agblock_t		*bno,	/* output: starting block of extent */
+	xfs_extlen_t		*len,	/* output: length of extent */
+	xfs_nlink_t		*nlink,	/* output: number of links */
+	int			*stat)	/* output: success/failure */
+{
+	union xfs_btree_rec	*rec;
+	int			error;
+
+	error = xfs_btree_get_rec(cur, &rec, stat);
+	if (!error && *stat == 1) {
+		CHECK_AG_EXTENT(cur->bc_mp,
+			be32_to_cpu(rec->reflink.rr_startblock),
+			be32_to_cpu(rec->reflink.rr_blockcount));
+		*bno = be32_to_cpu(rec->reflink.rr_startblock);
+		*len = be32_to_cpu(rec->reflink.rr_blockcount);
+		*nlink = be32_to_cpu(rec->reflink.rr_nlinks);
+	}
+	return error;
+}
+
+/*
+ * Update the record referred to by cur to the value given
+ * by [bno, len, nr].
+ * This either works (return 0) or gets an EFSCORRUPTED error.
+ */
+STATIC int				/* error */
+xfs_reflinkbt_update(
+	struct xfs_btree_cur	*cur,	/* btree cursor */
+	xfs_agblock_t		bno,	/* starting block of extent */
+	xfs_extlen_t		len,	/* length of extent */
+	xfs_nlink_t		nr)	/* reference count */
+{
+	union xfs_btree_rec	rec;
+
+	CHECK_AG_EXTENT(cur->bc_mp, bno, len);
+	ASSERT(nr > 1);
+
+	rec.reflink.rr_startblock = cpu_to_be32(bno);
+	rec.reflink.rr_blockcount = cpu_to_be32(len);
+	rec.reflink.rr_nlinks = cpu_to_be32(nr);
+	return xfs_btree_update(cur, &rec);
+}
+
+/*
+ * Insert the record referred to by cur to the value given
+ * by [bno, len, nr].
+ * This either works (return 0) or gets an EFSCORRUPTED error.
+ */
+STATIC int				/* error */
+xfs_reflinkbt_insert(
+	struct xfs_btree_cur	*cur,	/* btree cursor */
+	xfs_agblock_t		bno,	/* starting block of extent */
+	xfs_extlen_t		len,	/* length of extent */
+	xfs_nlink_t		nr,	/* reference count */
+	int			*i)	/* success? */
+{
+	CHECK_AG_EXTENT(cur->bc_mp, bno, len);
+	ASSERT(nr > 1);
+
+	cur->bc_rec.rl.rr_startblock = bno;
+	cur->bc_rec.rl.rr_blockcount = len;
+	cur->bc_rec.rl.rr_nlinks = nr;
+	return xfs_btree_insert(cur, i);
+}
+
+/*
+ * Remove the record referred to by cur.
+ * This either works (return 0) or gets an EFSCORRUPTED error.
+ */
+STATIC int				/* error */
+xfs_reflinkbt_delete(
+	struct xfs_btree_cur	*cur,	/* btree cursor */
+	int			*i)	/* success? */
+{
+	xfs_agblock_t		bno;
+	xfs_extlen_t		len;
+	xfs_nlink_t		nr;
+	int			x;
+	int			error;
+
+	error = xfs_reflink_get_rec(cur, &bno, &len, &nr, &x);
+	if (error)
+		return error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, x == 1, error0);
+	error = xfs_btree_delete(cur, i);
+	if (error)
+		return error;
+	error = xfs_reflink_lookup_ge(cur, bno, &x);
+error0:
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_reflink_btree.h b/fs/xfs/libxfs/xfs_reflink_btree.h
new file mode 100644
index 0000000..a27588a
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_reflink_btree.h
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2000,2005 Silicon Graphics, Inc.
+ * Copyright (c) 2015 Oracle.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#ifndef __XFS_REFLINK_BTREE_H__
+#define	__XFS_REFLINK_BTREE_H__
+
+/*
+ * Freespace on-disk structures
+ */
+
+struct xfs_buf;
+struct xfs_btree_cur;
+struct xfs_mount;
+
+/*
+ * Btree block header size depends on a superblock flag.
+ */
+#define XFS_REFLINK_BLOCK_LEN	XFS_BTREE_SBLOCK_CRC_LEN
+
+/*
+ * Record, key, and pointer address macros for btree blocks.
+ *
+ * (note that some of these may appear unused, but they are used in userspace)
+ */
+#define XFS_REFLINK_REC_ADDR(block, index) \
+	((xfs_reflink_rec_t *) \
+		((char *)(block) + \
+		 XFS_REFLINK_BLOCK_LEN + \
+		 (((index) - 1) * sizeof(xfs_reflink_rec_t))))
+
+#define XFS_REFLINK_KEY_ADDR(block, index) \
+	((xfs_reflink_key_t *) \
+		((char *)(block) + \
+		 XFS_REFLINK_BLOCK_LEN + \
+		 ((index) - 1) * sizeof(xfs_reflink_key_t)))
+
+#define XFS_REFLINK_PTR_ADDR(block, index, maxrecs) \
+	((xfs_reflink_ptr_t *) \
+		((char *)(block) + \
+		 XFS_REFLINK_BLOCK_LEN + \
+		 (maxrecs) * sizeof(xfs_reflink_key_t) + \
+		 ((index) - 1) * sizeof(xfs_reflink_ptr_t)))
+
+extern struct xfs_btree_cur *xfs_reflinkbt_init_cursor(struct xfs_mount *,
+		struct xfs_trans *, struct xfs_buf *,
+		xfs_agnumber_t);
+extern int xfs_reflinkbt_maxrecs(struct xfs_mount *, int, int);
+extern int xfs_reflink_lookup_le(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		int *stat);
+extern int xfs_reflink_lookup_ge(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		int *stat);
+extern int xfs_reflink_get_rec(struct xfs_btree_cur *cur, xfs_agblock_t *bno,
+		xfs_extlen_t *len, xfs_nlink_t *nlink, int *stat);
+
+#endif	/* __XFS_REFLINK_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index db5a19d3..5f8f7fd 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -36,6 +36,8 @@
 #include "xfs_alloc_btree.h"
 #include "xfs_ialloc_btree.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_bmap.h"
+#include "xfs_reflink_btree.h"
 
 /*
  * Physical superblock buffer manipulations. Shared with libxfs in userspace.
@@ -717,6 +719,11 @@ xfs_sb_mount_common(
 	mp->m_rmap_mnr[0] = mp->m_rmap_mxr[0] / 2;
 	mp->m_rmap_mnr[1] = mp->m_rmap_mxr[1] / 2;
 
+	mp->m_rlbt_mxr[0] = xfs_reflinkbt_maxrecs(mp, sbp->sb_blocksize, 1);
+	mp->m_rlbt_mxr[1] = xfs_reflinkbt_maxrecs(mp, sbp->sb_blocksize, 0);
+	mp->m_rlbt_mnr[0] = mp->m_rlbt_mxr[0] / 2;
+	mp->m_rlbt_mnr[1] = mp->m_rlbt_mxr[1] / 2;
+
 	mp->m_bsize = XFS_FSB_TO_BB(mp, 1);
 	mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
 					sbp->sb_inopblock);
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 88efbb4..d1de74e 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -216,6 +216,7 @@ int	xfs_log_calc_minimum_size(struct xfs_mount *);
 #define	XFS_INO_REF		2
 #define	XFS_ATTR_BTREE_REF	1
 #define	XFS_DQUOT_REF		1
+#define XFS_REFLINK_BTREE_REF	1
 
 /*
  * Flags for xfs_trans_ichgtime().
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index d495f82..a6d1d3b 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -81,6 +81,8 @@ xfs_allocfree_log_count(
 
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
 		num_trees++;
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		num_trees++;
 
 	return num_ops * num_trees * (2 * mp->m_ag_maxlevels - 1);
 }
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 3d50364..1a93ac9 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -109,7 +109,7 @@ typedef enum {
 
 typedef enum {
 	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_RMAPi, XFS_BTNUM_BMAPi,
-	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_MAX
+	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_RLi, XFS_BTNUM_MAX
 } xfs_btnum_t;
 
 struct xfs_name {
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index cdced0b..69af7f7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -92,6 +92,8 @@ typedef struct xfs_mount {
 	uint			m_inobt_mnr[2];	/* min inobt btree records */
 	uint			m_rmap_mxr[2];	/* max rmap btree records */
 	uint			m_rmap_mnr[2];	/* min rmap btree records */
+	uint			m_rlbt_mxr[2];	/* max rlbt btree records */
+	uint			m_rlbt_mnr[2];	/* min rlbt btree records */
 	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
 	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
 	uint			m_in_maxlevels;	/* max inobt btree levels. */
@@ -315,6 +317,9 @@ typedef struct xfs_perag {
 	/* for rcu-safe freeing */
 	struct rcu_head	rcu_head;
 	int		pagb_count;	/* pagb slots in use */
+
+	/* reflink */
+	__uint8_t	pagf_reflink_level;
 } xfs_perag_t;
 
 extern int	xfs_log_sbcount(xfs_mount_t *);
diff --git a/fs/xfs/xfs_stats.c b/fs/xfs/xfs_stats.c
index 67bbfa2..57449b8 100644
--- a/fs/xfs/xfs_stats.c
+++ b/fs/xfs/xfs_stats.c
@@ -61,6 +61,7 @@ static int xfs_stat_proc_show(struct seq_file *m, void *v)
 		{ "ibt2",		XFSSTAT_END_IBT_V2		},
 		{ "fibt2",		XFSSTAT_END_FIBT_V2		},
 		{ "rmapbt",		XFSSTAT_END_RMAP_V2		},
+		{ "rlbt2",		XFSSTAT_END_RLBT_V2		},
 		/* we print both series of quota information together */
 		{ "qm",			XFSSTAT_END_QM			},
 	};
diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
index 8414db2..d943c04 100644
--- a/fs/xfs/xfs_stats.h
+++ b/fs/xfs/xfs_stats.h
@@ -215,7 +215,23 @@ struct xfsstats {
 	__uint32_t		xs_rmap_2_alloc;
 	__uint32_t		xs_rmap_2_free;
 	__uint32_t		xs_rmap_2_moves;
-#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_RMAP_V2+6)
+#define XFSSTAT_END_RLBT_V2		(XFSSTAT_END_RMAP_V2+15)
+	__uint32_t		xs_rlbt_2_lookup;
+	__uint32_t		xs_rlbt_2_compare;
+	__uint32_t		xs_rlbt_2_insrec;
+	__uint32_t		xs_rlbt_2_delrec;
+	__uint32_t		xs_rlbt_2_newroot;
+	__uint32_t		xs_rlbt_2_killroot;
+	__uint32_t		xs_rlbt_2_increment;
+	__uint32_t		xs_rlbt_2_decrement;
+	__uint32_t		xs_rlbt_2_lshift;
+	__uint32_t		xs_rlbt_2_rshift;
+	__uint32_t		xs_rlbt_2_split;
+	__uint32_t		xs_rlbt_2_join;
+	__uint32_t		xs_rlbt_2_alloc;
+	__uint32_t		xs_rlbt_2_free;
+	__uint32_t		xs_rlbt_2_moves;
+#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_RLBT_V2+6)
 	__uint32_t		xs_qm_dqreclaims;
 	__uint32_t		xs_qm_dqreclaim_misses;
 	__uint32_t		xs_qm_dquot_dups;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/14] libxfs: adjust refcounts in reflink btree
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
  2015-06-25 23:39 ` [PATCH 01/14] xfs: create a per-AG btree to track reference counts Darrick J. Wong
@ 2015-06-25 23:39 ` Darrick J. Wong
  2015-07-01  1:06   ` Dave Chinner
  2015-06-25 23:39 ` [PATCH 03/14] libxfs: support unmapping reflink blocks Darrick J. Wong
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:39 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Provide a function to adjust the reference counts for a range of
blocks in the reflink btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_reflink_btree.c |  406 +++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_reflink_btree.h |    4 
 2 files changed, 410 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_reflink_btree.c b/fs/xfs/libxfs/xfs_reflink_btree.c
index 8a0fa5d..380ed72 100644
--- a/fs/xfs/libxfs/xfs_reflink_btree.c
+++ b/fs/xfs/libxfs/xfs_reflink_btree.c
@@ -529,3 +529,409 @@ xfs_reflinkbt_delete(
 error0:
 	return error;
 }
+
+#ifdef REFLINK_DEBUG
+static void
+dump_cur_loc(
+	struct xfs_btree_cur	*cur,
+	const char		*str,
+	int			line)
+{
+	xfs_agblock_t		gbno;
+	xfs_extlen_t		glen;
+	xfs_nlink_t		gnr;
+	int			i;
+
+	xfs_reflink_get_rec(cur, &gbno, &glen, &gnr, &i);
+	printk(KERN_INFO "%s(%d) cur[%d]:[%u,%u,%u,%d] ", str, line,
+	       cur->bc_ptrs[0], gbno, glen, gnr, i);
+	if (i && cur->bc_ptrs[0]) {
+		cur->bc_ptrs[0]--;
+		xfs_reflink_get_rec(cur, &gbno, &glen, &gnr, &i);
+		printk("left[%d]:[%u,%u,%u,%d] ", cur->bc_ptrs[0],
+		       gbno, glen, gnr, i);
+		cur->bc_ptrs[0]++;
+	}
+
+	if (i && cur->bc_ptrs[0] < xfs_reflinkbt_get_maxrecs(cur, 0)) {
+		cur->bc_ptrs[0]++;
+		xfs_reflink_get_rec(cur, &gbno, &glen, &gnr, &i);
+		printk("right[%d]:[%u,%u,%u,%d] ", cur->bc_ptrs[0],
+		       gbno, glen, gnr, i);
+		cur->bc_ptrs[0]--;
+	}
+	printk("\n");
+}
+#else
+# define dump_cur_loc(c, s, l)
+#endif
+
+/*
+ * Adjust the ref count of a range of AG blocks.
+ */
+int						/* error */
+xfs_reflinkbt_adjust_refcount(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,		/* transaction pointer */
+	struct xfs_buf		*agbp,		/* buffer for agf structure */
+	xfs_agnumber_t		agno,		/* allocation group number */
+	xfs_agblock_t		agbno,		/* start of range */
+	xfs_extlen_t		aglen,		/* length of range */
+	int			adj)		/* how much to change refcnt */
+{
+	struct xfs_btree_cur	*cur;
+	int			error;
+	int			i, have;
+	bool			real_crl;	/* cbno/clen is on disk? */
+	xfs_agblock_t		lbno, cbno, rbno;	/* rlextent start */
+	xfs_extlen_t		llen, clen, rlen;	/* rlextent length */
+	xfs_nlink_t		lnr, cnr, rnr;		/* rlextent refcount */
+
+	xfs_agblock_t		bno;		/* ag bno in the loop */
+	xfs_agblock_t		agbend;		/* end agbno of the loop */
+	xfs_extlen_t		len;		/* remaining len to add */
+	xfs_nlink_t		new_cnr;	/* new refcount */
+
+	CHECK_AG_NUMBER(mp, agno);
+	CHECK_AG_EXTENT(mp, agbno, aglen);
+	ASSERT(adj == -1 || adj == 1);
+
+	/*
+	 * Allocate/initialize a cursor for the by-number freespace btree.
+	 */
+	cur = xfs_reflinkbt_init_cursor(mp, tp, agbp, agno);
+
+	/*
+	 * Split a left rlextent that crosses agbno.
+	 */
+	error = xfs_reflink_lookup_le(cur, agbno, &have);
+	if (error)
+		goto error0;
+	if (have) {
+		error = xfs_reflink_get_rec(cur, &lbno, &llen, &lnr, &i);
+		if (error)
+			goto error0;
+		XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, lbno, llen, lnr, error0);
+		if (lbno < agbno && lbno + llen > agbno) {
+			dbg_printk("split lext crossing agbno [%u:%u:%u]\n",
+				   lbno, llen, lnr);
+			error = xfs_reflinkbt_update(cur, lbno, agbno - lbno,
+					lnr);
+			if (error)
+				goto error0;
+
+			error = xfs_btree_increment(cur, 0, &i);
+			if (error)
+				goto error0;
+
+			error = xfs_reflinkbt_insert(cur, agbno,
+					llen - (agbno - lbno), lnr, &i);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_GOTO(mp, i == 1, error0);
+		}
+	}
+
+	/*
+	 * Split a right rlextent that crosses agbno.
+	 */
+	agbend = agbno + aglen - 1;
+	error = xfs_reflink_lookup_le(cur, agbend, &have);
+	if (error)
+		goto error0;
+	if (have) {
+		error = xfs_reflink_get_rec(cur, &rbno, &rlen, &rnr, &i);
+		if (error)
+			goto error0;
+		XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, rbno, rlen, rnr, error0);
+		if (agbend + 1 != mp->m_sb.sb_agblocks &&
+		    agbend + 1 < rbno + rlen) {
+			dbg_printk("split rext crossing agbend [%u:%u:%u]\n",
+				   rbno, rlen, rnr);
+			error = xfs_reflinkbt_update(cur, agbend + 1,
+					rlen - (agbend - rbno + 1), rnr);
+			if (error)
+				goto error0;
+
+			error = xfs_reflinkbt_insert(cur, rbno,
+					agbend - rbno + 1, rnr, &i);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_GOTO(mp, i == 1, error0);
+		}
+	}
+
+	/*
+	 * Start iterating the range we're adjusting.  rlextent boundaries
+	 * should be at agbno and agbend.
+	 */
+	bno = agbno;
+	len = aglen;
+	while (len > 0) {
+		llen = clen = rlen = 0;
+		real_crl = false;
+		/*
+		 * Look up the current and left rlextents.
+		 */
+		error = xfs_reflink_lookup_le(cur, bno, &have);
+		if (error)
+			goto error0;
+		if (have) {
+			error = xfs_reflink_get_rec(cur, &cbno, &clen, &cnr,
+						    &i);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, cbno, clen, cnr,
+						      error0);
+			if (cbno != bno) {
+				/*
+				 * bno points to a hole; this is the left rlext.
+				 */
+				ASSERT((unsigned long long)lbno + llen <= bno);
+				lbno = cbno;
+				llen = clen;
+				lnr = cnr;
+
+				cbno = bno;
+				clen = len;
+				cnr = 1;
+			} else {
+				real_crl = true;
+				/*
+				 * Go find the left rlext.
+				 */
+				error = xfs_btree_decrement(cur, 0, &have);
+				if (error)
+					goto error0;
+				if (have) {
+					error = xfs_reflink_get_rec(cur, &lbno,
+							&llen, &lnr, &i);
+					if (error)
+						goto error0;
+					XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i,
+							lbno, llen, lnr,
+							error0);
+					ASSERT((unsigned long long)lbno + llen <= bno);
+				}
+				error = xfs_btree_increment(cur, 0, &have);
+				if (error)
+					goto error0;
+			}
+		} else {
+			/*
+			 * No left extent; just invent our current rlextent.
+			 */
+			cbno = bno;
+			clen = len;
+			cnr = 1;
+		}
+
+		/*
+		 * If the left rlext isn't adjacent, forget about it.
+		 */
+		if (llen > 0 && lbno + llen != bno)
+			llen = 0;
+
+		/*
+		 * Look up the right rlextent.
+		 */
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto error0;
+		if (have) {
+			error = xfs_reflink_get_rec(cur, &rbno, &rlen, &rnr,
+						    &i);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, rbno, rlen, rnr,
+						      error0);
+			if (agbno + aglen < rbno)
+				rlen = 0;
+			if (!real_crl)
+				clen = min(clen, rbno - cbno);
+			ASSERT((unsigned long long)cbno + clen <= rbno);
+		}
+
+		/*
+		 * Point the cursor to cbno (or where it will be inserted).
+		 */
+		if (real_crl) {
+			error = xfs_btree_decrement(cur, 0, &i);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_GOTO(mp, i == 1, error0);
+		}
+		ASSERT(clen > 0);
+		ASSERT(cbno == bno);
+		ASSERT(cbno >= agbno);
+		ASSERT((unsigned long long)cbno + clen <=
+		       (unsigned long long)agbno + aglen);
+		if (real_crl)
+			ASSERT(cnr > 1);
+		else
+			ASSERT(cnr == 1);
+		new_cnr = cnr + adj;
+
+#ifdef REFLINK_DEBUG
+		{
+		xfs_agblock_t gbno;
+		xfs_extlen_t glen;
+		xfs_nlink_t gnr;
+		xfs_reflink_get_rec(cur, &gbno, &glen, &gnr, &i);
+		printk(KERN_ERR "%s: insert ag=%u [%u:%u:%d] ", __func__,
+		       agno, agbno, agbend, adj);
+		if (llen)
+			printk("l:[%u,%u,%u] ", lbno, llen, lnr);
+		printk("[%u,%u,%u,%d] ", cbno, clen, cnr, real_crl);
+		if (rlen)
+			printk("r:[%u,%u,%u] ", rbno, rlen, rnr);
+		printk("\n");
+		dump_cur_loc(cur, "cur", __LINE__);
+		}
+#endif
+		/*
+		 * Nothing to do when unmapping a range of blocks with
+		 * a single owner.
+		 */
+		if (new_cnr == 0) {
+			dbg_printk("single-owner blocks; ignoring");
+			goto advloop;
+		}
+
+		/*
+		 * These blocks have hit MAXRLCOUNT; keep it that way.
+		 */
+		if (cnr == MAXRLCOUNT) {
+			dbg_printk("hit MAXRLCOUNT; moving on");
+			goto advloop;
+		}
+
+		/*
+		 * Try to merge with left and right rlexts outside range.
+		 */
+		if (llen > 0 && rlen > 0 &&
+		    lbno + llen == agbno &&
+		    rbno == agbend + 1 &&
+		    lbno + llen + clen == rbno &&
+		    (unsigned long long)llen + clen + rlen < MAXRLEXTLEN &&
+		    lnr == rnr &&
+		    lnr == new_cnr) {
+			dbg_printk("merge l/c/rext\n");
+			error = xfs_reflinkbt_delete(cur, &i);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_GOTO(mp, i == 1, error0);
+			if (real_crl) {
+				error = xfs_reflinkbt_delete(cur, &i);
+				if (error)
+					goto error0;
+				XFS_WANT_CORRUPTED_GOTO(mp, i == 1, error0);
+			}
+
+			error = xfs_btree_decrement(cur, 0, &have);
+			if (error)
+				goto error0;
+			error = xfs_reflinkbt_update(cur, lbno,
+					llen + clen + rlen, lnr);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_GOTO(mp, have == 1, error0);
+			break;
+		}
+
+		/*
+		 * Try to merge with left rlext outside the range.
+		 */
+		if (llen > 0 &&
+		    lbno + llen == agbno &&
+		    lnr == new_cnr &&
+		    (unsigned long long)llen + clen < MAXRLEXTLEN) {
+			dbg_printk("merge l/cext\n");
+			if (real_crl) {
+				error = xfs_reflinkbt_delete(cur, &i);
+				if (error)
+					goto error0;
+				XFS_WANT_CORRUPTED_GOTO(mp, i == 1, error0);
+			}
+
+			error = xfs_btree_decrement(cur, 0, &have);
+			if (error)
+				goto error0;
+			error = xfs_reflinkbt_update(cur, lbno,
+					llen + clen, lnr);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_GOTO(mp, have == 1, error0);
+			goto advloop;
+		}
+
+		/*
+		 * Try to merge with right rlext outside the range.
+		 */
+		if (rlen > 0 &&
+		    rbno == agbend + 1 &&
+		    rnr == new_cnr &&
+		    cbno + clen == rbno &&
+		    (unsigned long long)clen + rlen < MAXRLEXTLEN) {
+			dbg_printk("merge c/rext\n");
+			if (real_crl) {
+				error = xfs_reflinkbt_delete(cur, &i);
+				if (error)
+					goto error0;
+				XFS_WANT_CORRUPTED_GOTO(mp, i == 1, error0);
+			}
+
+			error = xfs_reflinkbt_update(cur, cbno,
+					clen + rlen, rnr);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_GOTO(mp, have == 1, error0);
+			break;
+		}
+
+		/*
+		 * rlext is no longer reflinked; remove it from tree.
+		 */
+		if (new_cnr == 1 && adj < 0) {
+			dbg_printk("remove cext\n");
+			ASSERT(real_crl == true);
+			error = xfs_reflinkbt_delete(cur, &i);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_GOTO(mp, i == 1, error0);
+			goto advloop;
+		}
+
+		/*
+		 * rlext needs to be added to the tree.
+		 */
+		if (new_cnr == 2 && adj > 0) {
+			dbg_printk("insert cext\n");
+			error = xfs_reflinkbt_insert(cur, cbno, clen,
+					new_cnr, &i);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_GOTO(mp, i == 1, error0);
+			goto advloop;
+		}
+
+		/*
+		 * Update rlext.
+		 */
+		dbg_printk("update cext\n");
+		ASSERT(new_cnr >= 2);
+		error = xfs_reflinkbt_update(cur, cbno, clen, new_cnr);
+		if (error)
+			goto error0;
+
+advloop:
+		bno += clen;
+		len -= clen;
+	}
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+error0:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_reflink_btree.h b/fs/xfs/libxfs/xfs_reflink_btree.h
index a27588a..d0785ff 100644
--- a/fs/xfs/libxfs/xfs_reflink_btree.h
+++ b/fs/xfs/libxfs/xfs_reflink_btree.h
@@ -67,4 +67,8 @@ extern int xfs_reflink_lookup_ge(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 extern int xfs_reflink_get_rec(struct xfs_btree_cur *cur, xfs_agblock_t *bno,
 		xfs_extlen_t *len, xfs_nlink_t *nlink, int *stat);
 
+extern int xfs_reflinkbt_adjust_refcount(struct xfs_mount *, struct xfs_trans *,
+		struct xfs_buf *, xfs_agnumber_t, xfs_agblock_t, xfs_extlen_t,
+		int);
+
 #endif	/* __XFS_REFLINK_BTREE_H__ */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/14] libxfs: support unmapping reflink blocks
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
  2015-06-25 23:39 ` [PATCH 01/14] xfs: create a per-AG btree to track reference counts Darrick J. Wong
  2015-06-25 23:39 ` [PATCH 02/14] libxfs: adjust refcounts in reflink btree Darrick J. Wong
@ 2015-06-25 23:39 ` Darrick J. Wong
  2015-07-01  1:26   ` Dave Chinner
  2015-06-25 23:39 ` [PATCH 04/14] libxfs: block-mapper changes to support reflink Darrick J. Wong
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:39 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

When we're unmapping blocks from a file, we need to decrease refcounts
in the btree and only free blocks if they refcount is 1.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c          |    5 +
 fs/xfs/libxfs/xfs_reflink_btree.c |  140 +++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_reflink_btree.h |    4 +
 3 files changed, 147 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 057fa9a..3f5e8da 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -45,6 +45,7 @@
 #include "xfs_symlink.h"
 #include "xfs_attr_leaf.h"
 #include "xfs_filestream.h"
+#include "xfs_reflink_btree.h"
 
 
 kmem_zone_t		*xfs_bmap_free_item_zone;
@@ -4984,8 +4985,8 @@ xfs_bmap_del_extent(
 	 * If we need to, add to list of extents to delete.
 	 */
 	if (do_fx)
-		xfs_bmap_add_free(mp, flist, del->br_startblock,
-				  del->br_blockcount, ip->i_ino);
+		xfs_reflink_bmap_add_free(mp, flist, del->br_startblock,
+					  del->br_blockcount, ip->i_ino, tp);
 	/*
 	 * Adjust inode # blocks in the file.
 	 */
diff --git a/fs/xfs/libxfs/xfs_reflink_btree.c b/fs/xfs/libxfs/xfs_reflink_btree.c
index 380ed72..f40ba1f 100644
--- a/fs/xfs/libxfs/xfs_reflink_btree.c
+++ b/fs/xfs/libxfs/xfs_reflink_btree.c
@@ -935,3 +935,143 @@ error0:
 	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
 	return error;
 }
+
+/**
+ * xfs_reflink_bmap_add_free() - release a range of blocks
+ *
+ * @mp: XFS mount object
+ * @flist: List of blocks to be freed at the end of the transaction
+ * @fsbno: First fs block of the range to release
+ * @len: Length of range
+ * @owner: owner of the extent
+ * @tp: transaction that goes with the free operation
+ */
+int
+xfs_reflink_bmap_add_free(
+	struct xfs_mount	*mp,		/* mount point structure */
+	xfs_bmap_free_t		*flist,		/* list of extents */
+	xfs_fsblock_t		fsbno,		/* fs block number of extent */
+	xfs_filblks_t		fslen,		/* length of extent */
+	uint64_t		owner,		/* extent owner */
+	struct xfs_trans	*tp)		/* transaction */
+{
+	struct xfs_btree_cur	*cur;
+	int			error;
+	struct xfs_buf		*agbp;
+	xfs_agnumber_t		agno;		/* allocation group number */
+	xfs_agblock_t		agbno;		/* ag start of range to free */
+	xfs_agblock_t		agbend;		/* ag end of range to free */
+	xfs_extlen_t		aglen;		/* ag length of range to free */
+	int			i, have;
+	xfs_agblock_t		lbno;		/* rlextent start */
+	xfs_extlen_t		llen;		/* rlextent length */
+	xfs_nlink_t		lnr;		/* rlextent refcount */
+	xfs_agblock_t		bno;		/* rlext block # in loop */
+	xfs_extlen_t		len;		/* rlext length in loop */
+	unsigned long long	blocks_freed;
+	xfs_fsblock_t		range_fsb;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb)) {
+		xfs_bmap_add_free(mp, flist, fsbno, fslen, owner);
+		return 0;
+	}
+
+	agno = XFS_FSB_TO_AGNO(mp, fsbno);
+	agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+	CHECK_AG_NUMBER(mp, agno);
+	ASSERT(fslen < mp->m_sb.sb_agblocks);
+	CHECK_AG_EXTENT(mp, agbno, fslen);
+	aglen = fslen;
+
+	/*
+	 * Drop reference counts in the reflink tree.
+	 */
+	error = xfs_alloc_read_agf(mp, tp, agno, 0, &agbp);
+	if (error)
+		return error;
+
+	/*
+	 * Grab a rl btree cursor.
+	 */
+	cur = xfs_reflinkbt_init_cursor(mp, tp, agbp, agno);
+	bno = agbno;
+	len = aglen;
+	agbend = agbno + aglen - 1;
+	blocks_freed = 0;
+
+	/*
+	 * Account for a left extent that partially covers our range.
+	 */
+	error = xfs_reflink_lookup_le(cur, bno, &have);
+	if (error)
+		goto error0;
+	if (have) {
+		error = xfs_reflink_get_rec(cur, &lbno, &llen, &lnr, &i);
+		if (error)
+			goto error0;
+		XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, lbno, llen, lnr, error0);
+		if (lbno + llen > bno) {
+			blocks_freed += min(len, lbno + llen - bno);
+			bno += blocks_freed;
+			len -= blocks_freed;
+		}
+	}
+
+	while (len > 0) {
+		/*
+		 * Go find the next rlext.
+		 */
+		range_fsb = XFS_AGB_TO_FSB(mp, agno, bno);
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto error0;
+		if (!have) {
+			/*
+			 * There's no right rlextent, so free bno to the end.
+			 */
+			lbno = bno + len;
+			llen = 0;
+		} else {
+			/*
+			 * Find the next rlextent.
+			 */
+			error = xfs_reflink_get_rec(cur, &lbno, &llen,
+					&lnr, &i);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, lbno, llen, lnr,
+						      error0);
+			if (lbno >= bno + len) {
+				lbno = bno + len;
+				llen = 0;
+			}
+		}
+
+		/*
+		 * Free everything up to the start of the rlextent and
+		 * account for still-mapped blocks.
+		 */
+		if (lbno - bno > 0) {
+			xfs_bmap_add_free(mp, flist, range_fsb, lbno - bno,
+					  owner);
+			len -= lbno - bno;
+			bno += lbno - bno;
+		}
+		llen = min(llen, agbend + 1 - lbno);
+		blocks_freed += llen;
+		len -= llen;
+		bno += llen;
+	}
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	error = xfs_reflinkbt_adjust_refcount(mp, tp, agbp, agno, agbno, aglen,
+					      -1);
+	xfs_trans_brelse(tp, agbp);
+
+	return error;
+error0:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	xfs_trans_brelse(tp, agbp);
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_reflink_btree.h b/fs/xfs/libxfs/xfs_reflink_btree.h
index d0785ff..4ea0ac4 100644
--- a/fs/xfs/libxfs/xfs_reflink_btree.h
+++ b/fs/xfs/libxfs/xfs_reflink_btree.h
@@ -71,4 +71,8 @@ extern int xfs_reflinkbt_adjust_refcount(struct xfs_mount *, struct xfs_trans *,
 		struct xfs_buf *, xfs_agnumber_t, xfs_agblock_t, xfs_extlen_t,
 		int);
 
+extern int xfs_reflink_bmap_add_free(struct xfs_mount *mp,
+		xfs_bmap_free_t *flist, xfs_fsblock_t fsbno, xfs_filblks_t len,
+		uint64_t owner, struct xfs_trans *tp);
+
 #endif	/* __XFS_REFLINK_BTREE_H__ */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/14] libxfs: block-mapper changes to support reflink
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (2 preceding siblings ...)
  2015-06-25 23:39 ` [PATCH 03/14] libxfs: support unmapping reflink blocks Darrick J. Wong
@ 2015-06-25 23:39 ` Darrick J. Wong
  2015-06-25 23:39 ` [PATCH 05/14] xfs: add reflink functions and ioctl Darrick J. Wong
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:39 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Modify the XFS block mapper routine to know how to "allocate" blocks
that already exist, for the purpose of mapping them into a second
file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   19 +++++++++++++++++++
 fs/xfs/libxfs/xfs_bmap.h |    2 ++
 2 files changed, 21 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 3f5e8da..05e8346 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3897,6 +3897,13 @@ STATIC int
 xfs_bmap_alloc(
 	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
 {
+	if (ap->flags & XFS_BMAPI_REFLINK) {
+		ap->blkno = *ap->firstblock;
+		ap->ip->i_d.di_nblocks += ap->length;
+		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
+		return 0;
+	}
+
 	if (XFS_IS_REALTIME_INODE(ap->ip) && ap->userdata)
 		return xfs_bmap_rtalloc(ap);
 	return xfs_bmap_btalloc(ap);
@@ -4519,6 +4526,12 @@ xfs_bmapi_write(
 	ASSERT(len > 0);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	if (whichfork == XFS_ATTR_FORK)
+		ASSERT(!(flags & XFS_BMAPI_REFLINK));
+	if (flags & XFS_BMAPI_REFLINK) {
+		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
+		ASSERT(!(flags & XFS_BMAPI_CONVERT));
+	}
 
 	if (unlikely(XFS_TEST_ERROR(
 	    (XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
@@ -4568,6 +4581,12 @@ xfs_bmapi_write(
 		wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
 
 		/*
+		 * Make sure we only reflink into a hole.
+		 */
+		if (flags & XFS_BMAPI_REFLINK)
+			ASSERT(inhole);
+
+		/*
 		 * First, deal with the hole before the allocated space
 		 * that we found, if any.
 		 */
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 674819f..908caaf 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -110,6 +110,8 @@ typedef	struct xfs_bmap_free
  */
 #define XFS_BMAPI_CONVERT	0x040
 
+#define XFS_BMAPI_REFLINK	0x080	/* map the inode to this exact block. */
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/14] xfs: add reflink functions and ioctl
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (3 preceding siblings ...)
  2015-06-25 23:39 ` [PATCH 04/14] libxfs: block-mapper changes to support reflink Darrick J. Wong
@ 2015-06-25 23:39 ` Darrick J. Wong
  2015-06-25 23:39 ` [PATCH 06/14] xfs: implement copy-on-write for reflinked blocks Darrick J. Wong
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:39 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Add to XFS the ability to share arbitrary blocks between one file and
another (reflink).  The userspace ioctl uses the same interface as
the btrfs ioctl.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |   10 ++
 fs/xfs/xfs_ioctl.c     |  178 +++++++++++++++++++++++++++++
 fs/xfs/xfs_ioctl32.c   |    2 
 fs/xfs/xfs_reflink.c   |  296 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h   |   24 ++++
 6 files changed, 511 insertions(+)
 create mode 100644 fs/xfs/xfs_reflink.c
 create mode 100644 fs/xfs/xfs_reflink.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index ba89aee..eb9dc8e 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -87,6 +87,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_message.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
+				   xfs_reflink.o \
 				   xfs_super.o \
 				   xfs_symlink.o \
 				   xfs_sysfs.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 9fbdb86..92f21e1 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -560,6 +560,16 @@ typedef struct xfs_swapext
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
+/* reflink ioctls; these should match btrfs */
+struct xfs_ioctl_clone_range_args {
+	__s64 src_fd;
+	__u64 src_offset;
+	__u64 src_length;
+	__u64 dest_offset;
+};
+
+#define XFS_IOC_CLONE		 _IOW (0x94, 9, int)
+#define XFS_IOC_CLONE_RANGE	 _IOW (0x94, 13, struct xfs_ioctl_clone_range_args)
 
 #ifndef HAVE_BBMACROS
 /*
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index ea7d85a..efc6e8d 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -40,6 +40,7 @@
 #include "xfs_symlink.h"
 #include "xfs_trans.h"
 #include "xfs_pnfs.h"
+#include "xfs_reflink.h"
 
 #include <linux/capability.h>
 #include <linux/dcache.h>
@@ -48,6 +49,8 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/exportfs.h>
+#include <linux/fsnotify.h>
+#include <linux/security.h>
 
 /*
  * xfs_find_handle maps from userspace xfs_fsop_handlereq structure to
@@ -1502,6 +1505,145 @@ xfs_ioc_swapext(
 	return error;
 }
 
+static int
+wait_for_io(
+	struct inode	*inode,
+	loff_t		offset,
+	size_t		len)
+{
+	loff_t		rounding;
+	loff_t		ioffset;
+	loff_t		iendoffset;
+	loff_t		bs;
+	int		ret;
+
+	bs = inode->i_sb->s_blocksize;
+	inode_dio_wait(inode);
+
+	rounding = max_t(xfs_off_t, bs, PAGE_CACHE_SIZE);
+	ioffset = round_down(offset, rounding);
+	iendoffset = round_up(offset + len, rounding) - 1;
+	ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
+					   iendoffset);
+	return ret;
+}
+
+static int
+xfs_ioctl_reflink(
+	struct file	*file_in,
+	loff_t		pos_in,
+	struct file	*file_out,
+	loff_t		pos_out,
+	size_t		len)
+{
+	struct inode	*inode_in;
+	struct inode	*inode_out;
+	ssize_t		ret;
+	loff_t		bs;
+	loff_t		isize;
+	int		same_inode;
+	loff_t		blen;
+
+	if (len == 0)
+		return 0;
+	else if (len != ~0ULL && (ssize_t)len < 0)
+		return -EINVAL;
+
+	/* Do we have the correct permissions? */
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND))
+		return -EPERM;
+	ret = security_file_permission(file_out, MAY_WRITE);
+	if (ret)
+		return ret;
+
+	inode_in = file_inode(file_in);
+	inode_out = file_inode(file_out);
+	bs = inode_out->i_sb->s_blocksize;
+
+	/* Don't touch certain kinds of inodes */
+	if (IS_IMMUTABLE(inode_out))
+		return -EPERM;
+	if (IS_SWAPFILE(inode_in) ||
+	    IS_SWAPFILE(inode_out))
+		return -ETXTBSY;
+
+	/* Reflink only works within this filesystem. */
+	if (inode_in->i_sb != inode_out->i_sb ||
+	    file_in->f_path.mnt != file_out->f_path.mnt)
+		return -EXDEV;
+	same_inode = (inode_in->i_ino == inode_out->i_ino);
+
+	/* Don't reflink dirs, pipes, sockets... */
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		return -EISDIR;
+	if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
+		return -ESPIPE;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		return -EINVAL;
+
+	/* Are we going all the way to the end? */
+	isize = i_size_read(inode_in);
+	if (isize == 0)
+		return 0;
+	if (len  == ~0ULL)
+		len = isize - pos_in;
+
+	/* Ensure offsets don't wrap and the input is inside i_size */
+	if (pos_in + len < pos_in || pos_out + len < pos_out ||
+	    pos_in + len > isize)
+		return -EINVAL;
+
+	/* If we're linking to EOF, continue to the block boundary. */
+	if (pos_in + len == isize)
+		blen = ALIGN(isize, bs) - pos_in;
+	else
+		blen = len;
+
+	/* Only reflink if we're aligned to block boundaries */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
+	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
+		return -EINVAL;
+
+	/* Don't allow overlapped reflink within the same file */
+	if (same_inode && pos_out + blen > pos_in && pos_out < pos_in + blen)
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	/* Wait for the completion of any pending IOs on srcfile */
+	ret = wait_for_io(inode_in, pos_in, len);
+	if (ret)
+		goto out_unlock;
+	ret = wait_for_io(inode_out, pos_out, len);
+	if (ret)
+		goto out_unlock;
+
+	ret = xfs_reflink(XFS_I(inode_in), pos_in, XFS_I(inode_out), pos_out, len);
+	if (ret < 0)
+		goto out_unlock;
+
+	/* Truncate the page cache so we don't see stale data */
+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
+				   PAGE_CACHE_ALIGN(pos_out + len) - 1);
+
+out_unlock:
+	if (ret == 0) {
+		fsnotify_access(file_in);
+		add_rchar(current, len);
+		fsnotify_modify(file_out);
+		add_wchar(current, len);
+	}
+	inc_syscr(current);
+	inc_syscw(current);
+
+	mnt_drop_write_file(file_out);
+	return ret;
+}
+
 /*
  * Note: some of the ioctl's return positive numbers as a
  * byte count indicating success, such as readlink_by_handle.
@@ -1800,6 +1942,42 @@ xfs_file_ioctl(
 		return xfs_icache_free_eofblocks(mp, &keofb);
 	}
 
+	case XFS_IOC_CLONE: {
+		struct fd src;
+
+		src = fdget(p);
+		if (!src.file)
+			return -EBADF;
+
+		error = xfs_ioctl_reflink(src.file, 0, filp, 0, ~0ULL);
+		fdput(src);
+		if (error > 0)
+			error = 0;
+
+		return error;
+	}
+
+	case XFS_IOC_CLONE_RANGE: {
+		struct fd src;
+		struct xfs_ioctl_clone_range_args args;
+
+		if (copy_from_user(&args, arg, sizeof(args)))
+			return -EFAULT;
+		src = fdget(args.src_fd);
+		if (!src.file)
+			return -EBADF;
+		if (args.src_length == 0)
+			args.src_length = ~0ULL;
+
+		error = xfs_ioctl_reflink(src.file, args.src_offset, filp,
+					  args.dest_offset, args.src_length);
+		fdput(src);
+		if (error > 0)
+			error = 0;
+
+		return error;
+	}
+
 	default:
 		return -ENOTTY;
 	}
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index b88bdc8..76d8729 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -558,6 +558,8 @@ xfs_file_compat_ioctl(
 	case XFS_IOC_GOINGDOWN:
 	case XFS_IOC_ERROR_INJECTION:
 	case XFS_IOC_ERROR_CLEARALL:
+	case XFS_IOC_CLONE:
+	case XFS_IOC_CLONE_RANGE:
 		return xfs_file_ioctl(filp, cmd, p);
 #ifndef BROKEN_X86_ALIGNMENT
 	/* These are handled fine if no alignment issues */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
new file mode 100644
index 0000000..ce5feeb
--- /dev/null
+++ b/fs/xfs/xfs_reflink.c
@@ -0,0 +1,296 @@
+/*
+ * Copyright (c) 2015 Oracle.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_inode_item.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_error.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_ioctl.h"
+#include "xfs_trace.h"
+#include "xfs_log.h"
+#include "xfs_icache.h"
+#include "xfs_pnfs.h"
+#include "xfs_reflink_btree.h"
+#include "xfs_reflink.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_bit.h"
+#include "xfs_alloc.h"
+#include "xfs_quota_defs.h"
+#include "xfs_quota.h"
+
+/**
+ * xfs_reflink() - link a range of blocks from one inode to another
+ *
+ * @src: Inode to clone from
+ * @srcoff: Offset within source to start clone from
+ * @dest: Inode to clone to
+ * @destoff: Offset within @inode to start clone
+ * @len: Original length, passed by user, of range to clone
+ */
+int					/* error */
+xfs_reflink(
+	struct xfs_inode	*src,	/* XFS inode to copy extents from */
+	xfs_off_t		srcoff, /* offset in source file */
+	struct xfs_inode	*dest,	/* XFS inode to copy extents to */
+	xfs_off_t		destoff,/* offset in destination file */
+	xfs_off_t		len)	/* number of bytes to copy */
+{
+	struct xfs_mount	*mp = src->i_mount;
+	loff_t			uninitialized_var(offset);
+	xfs_fileoff_t		fsbno, dfsbno, fsbnext;
+	xfs_filblks_t		end;
+	int			error;
+	xfs_bmbt_irec_t		imaps[1];
+	int			nimaps = 1;
+	uint			resblks;
+	xfs_bmap_free_t		free_list;
+	xfs_bmbt_irec_t		map, dmap;
+	xfs_trans_t		*tp;
+	int			committed;
+	xfs_fsblock_t		firstfsb;
+	struct xfs_buf		*agbp;
+	xfs_agnumber_t		agno;		/* allocation group number */
+	xfs_agblock_t		agbno;
+	int			done;
+	xfs_off_t		blen = ALIGN(len, mp->m_sb.sb_blocksize);
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	/* For now, we won't reflink realtime inodes */
+	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
+		return -EINVAL;
+
+	/* Lock both files against IO */
+	if (src->i_ino == dest->i_ino) {
+		xfs_ilock(src, XFS_IOLOCK_EXCL);
+		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
+	} else {
+		xfs_lock_two_inodes(src, dest, XFS_IOLOCK_EXCL);
+		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
+	}
+
+	/*
+	 * Try to read extents from the first block indicated
+	 * by fsbno to the end block of the file.
+	 */
+	dfsbno = XFS_B_TO_FSBT(mp, destoff);
+	fsbno = fsbnext = XFS_B_TO_FSBT(mp, srcoff);
+	end = XFS_B_TO_FSB(mp, srcoff + blen);
+
+	/*
+	 * free file space until done or until there is an error
+	 */
+	resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
+	error = done = 0;
+	while (!error && !done) {
+		/*
+		 * allocate and setup the transaction. Allow this
+		 * transaction to dip into the reserve blocks to ensure
+		 * the freeing of the space succeeds at ENOSPC.
+		 */
+		tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
+		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write, resblks, 0);
+
+		/*
+		 * check for running out of space
+		 */
+		if (error) {
+			/*
+			 * Free the transaction structure.
+			 */
+			ASSERT(error == -ENOSPC || XFS_FORCED_SHUTDOWN(mp));
+			goto error0;
+		}
+		error = xfs_trans_reserve_quota(tp, mp,
+				dest->i_udquot, dest->i_gdquot, dest->i_pdquot,
+				resblks, 0, XFS_QMOPT_RES_REGBLKS);
+		if (error)
+			goto error0;
+
+		xfs_ilock(dest, XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
+
+		/*
+		 * issue the bunmapi() call to free the blocks
+		 */
+		xfs_bmap_init(&free_list, &firstfsb);
+		error = xfs_bunmapi(tp, dest, dfsbno,
+				  XFS_B_TO_FSBT(mp, destoff + blen) - dfsbno,
+				  0, 2, &firstfsb, &free_list, &done);
+		if (error)
+			goto error1;
+
+		/*
+		 * complete the transaction
+		 */
+		error = xfs_bmap_finish(&tp, &free_list, &committed);
+		if (error)
+			goto error0;
+
+		error = xfs_trans_commit(tp);
+	}
+	if (error)
+		goto out_unlock_io;
+
+	while (end - fsbnext > 0) {
+		/* Read extent from the source file */
+		nimaps = 1;
+		xfs_ilock(src, XFS_ILOCK_EXCL);
+		error = xfs_bmapi_read(src, fsbnext, end - fsbnext, &map,
+				       &nimaps, 0);
+		xfs_iunlock(src, XFS_ILOCK_EXCL);
+		if (error)
+			goto out_unlock_io;
+
+		/* No extents at given offset, must be beyond EOF */
+		if (nimaps == 0)
+			break;
+
+		if (map.br_startblock == HOLESTARTBLOCK ||
+		    map.br_startblock == DELAYSTARTBLOCK)
+			goto next;
+
+		/* Shrink the map to whatever we're linking */
+		dmap = map;
+		dmap.br_startoff = dfsbno + dmap.br_startoff - fsbno;
+		nimaps = 1;
+
+		/*
+		 * Allocate and setup the transaction.
+		 */
+		resblks = XFS_DIOSTRAT_SPACE_RES(mp, dmap.br_blockcount * 2);
+		tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
+		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write,
+					  resblks, 0);
+		/*
+		 * Check for running out of space
+		 */
+		if (error) {
+			/*
+			 * Free the transaction structure.
+			 */
+			ASSERT(error == -ENOSPC || XFS_FORCED_SHUTDOWN(mp));
+			goto error0;
+		}
+
+		xfs_ilock(dest, XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
+
+		xfs_bmap_init(&free_list, &firstfsb);
+
+		/* Update the refcount tree */
+		agno = XFS_FSB_TO_AGNO(mp, dmap.br_startblock);
+		agbno = XFS_FSB_TO_AGBNO(mp, dmap.br_startblock);
+		error = xfs_alloc_read_agf(mp, tp, agno, 0, &agbp);
+		if (error)
+			goto error1;
+		error = xfs_reflinkbt_adjust_refcount(mp, tp, agbp, agno, agbno,
+					      dmap.br_blockcount, 1);
+		if (error)
+			goto error1;
+		xfs_trans_brelse(tp, agbp);
+
+		// XXX: should this be a separate transaction?
+
+		/* Add this extent to the destination file */
+		error = xfs_bmapi_write(tp, dest, dmap.br_startoff,
+					dmap.br_blockcount,
+					XFS_BMAPI_REFLINK, &dmap.br_startblock,
+					0, &imaps[0], &nimaps, &free_list);
+		if (error)
+			goto error1;
+
+		/*
+		 * Complete the transaction
+		 */
+		error = xfs_bmap_finish(&tp, &free_list, &committed);
+		if (error)
+			goto error0;
+
+		error = xfs_trans_commit(tp);
+		if (error)
+			goto out_unlock_io;
+
+		/* Keep going */
+next:
+		fsbnext = map.br_startoff + map.br_blockcount;
+	}
+
+	/* Update inode size */
+	if (destoff + len > i_size_read(VFS_I(dest))) {
+		tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
+		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
+
+		/*
+		 * check for running out of space
+		 */
+		if (error) {
+			/*
+			 * Free the transaction structure.
+			 */
+			ASSERT(error == -ENOSPC || XFS_FORCED_SHUTDOWN(mp));
+			goto error0;
+		}
+
+		xfs_ilock(dest, XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
+
+		i_size_write(VFS_I(dest), destoff + len);
+		dest->i_d.di_size = destoff + len;
+		xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
+
+		error = xfs_trans_commit(tp);
+		if (error)
+			goto out_unlock_io;
+	}
+
+	goto out_unlock_io;
+
+error1:
+	/* Cancel bmap, unlock inode, unreserve quota blocks, cancel trans */
+	xfs_bmap_cancel(&free_list);
+error0:
+	xfs_trans_cancel(tp);
+
+out_unlock_io:
+	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
+	xfs_iunlock(src, XFS_IOLOCK_EXCL);
+	if (src->i_ino != dest->i_ino) {
+		xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
+		xfs_iunlock(dest, XFS_IOLOCK_EXCL);
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
new file mode 100644
index 0000000..7cccd50
--- /dev/null
+++ b/fs/xfs/xfs_reflink.h
@@ -0,0 +1,24 @@
+/*
+ * Copyright (c) 2015 Oracle.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#ifndef __XFS_REFLINK_H
+#define __XFS_REFLINK_H 1
+
+extern int xfs_reflink(struct xfs_inode *src, xfs_off_t srcoff,
+	struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
+
+#endif /* __XFS_REFLINK_H */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/14] xfs: implement copy-on-write for reflinked blocks
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (4 preceding siblings ...)
  2015-06-25 23:39 ` [PATCH 05/14] xfs: add reflink functions and ioctl Darrick J. Wong
@ 2015-06-25 23:39 ` Darrick J. Wong
  2015-06-25 23:39 ` [PATCH 07/14] xfs: handle directio " Darrick J. Wong
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:39 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Implement a copy-on-write handler for the buffered write path.  When
writepages is called, allocate a new block (which we then tell the log
that we intend to delete so that it's freed if we crash), and then
write the buffer to the new block.  Upon completion, remove the freed
block intent from the log and remap the file so that the changes
appear.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c    |   38 +++++-
 fs/xfs/xfs_aops.h    |    5 +
 fs/xfs/xfs_reflink.c |  340 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |   15 ++
 4 files changed, 393 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index dc52698..be57e5d 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -31,6 +31,8 @@
 #include "xfs_bmap.h"
 #include "xfs_bmap_util.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_reflink.h"
+#include <linux/aio.h>
 #include <linux/gfp.h>
 #include <linux/mpage.h>
 #include <linux/pagevec.h>
@@ -190,7 +192,8 @@ xfs_finish_ioend(
 	if (atomic_dec_and_test(&ioend->io_remaining)) {
 		struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
 
-		if (ioend->io_type == XFS_IO_UNWRITTEN)
+		if (ioend->io_type == XFS_IO_UNWRITTEN ||
+		    ioend->io_type == XFS_IO_FORKED)
 			queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
 		else if (ioend->io_append_trans)
 			queue_work(mp->m_data_workqueue, &ioend->io_work);
@@ -218,6 +221,19 @@ xfs_end_io(
 		goto done;
 
 	/*
+	 * If we forked the block, we need to remap the bmbt and possibly
+	 * finish up the i_size transaction too.
+	 */
+	if (ioend->io_type == XFS_IO_FORKED) {
+		error = xfs_reflink_end_io(ip->i_mount, ip, ioend);
+		if (error)
+			goto done;
+		if (ioend->io_append_trans)
+			error = xfs_setfilesize_ioend(ioend);
+		goto done;
+	}
+
+	/*
 	 * For unwritten extents we need to issue transactions to convert a
 	 * range to normal written extens after the data I/O has finished.
 	 */
@@ -268,6 +284,7 @@ xfs_alloc_ioend(
 	ioend->io_append_trans = NULL;
 
 	INIT_WORK(&ioend->io_work, xfs_end_io);
+	INIT_LIST_HEAD(&ioend->io_reflink_endio_list);
 	return ioend;
 }
 
@@ -567,7 +584,8 @@ xfs_add_to_ioend(
 	xfs_off_t		offset,
 	unsigned int		type,
 	xfs_ioend_t		**result,
-	int			need_ioend)
+	int			need_ioend,
+	xfs_reflink_end_io_t	*eio)
 {
 	xfs_ioend_t		*ioend = *result;
 
@@ -588,6 +606,8 @@ xfs_add_to_ioend(
 
 	bh->b_private = NULL;
 	ioend->io_size += bh->b_size;
+	if (eio)
+		list_add_tail(&eio->rlei_list, &ioend->io_reflink_endio_list);
 }
 
 STATIC void
@@ -788,7 +808,7 @@ xfs_convert_page(
 			if (type != XFS_IO_OVERWRITE)
 				xfs_map_at_offset(inode, bh, imap, offset);
 			xfs_add_to_ioend(inode, bh, offset, type,
-					 ioendp, done);
+					 ioendp, done, NULL);
 
 			page_dirty--;
 			count++;
@@ -951,6 +971,7 @@ xfs_vm_writepage(
 	int			err, imap_valid = 0, uptodate = 1;
 	int			count = 0;
 	int			nonblocking = 0;
+	struct xfs_inode	*ip = XFS_I(inode);
 
 	trace_xfs_writepage(inode, page, 0, 0);
 
@@ -1119,11 +1140,17 @@ xfs_vm_writepage(
 			imap_valid = xfs_imap_valid(inode, &imap, offset);
 		}
 		if (imap_valid) {
+			xfs_reflink_end_io_t *eio = NULL;
+
+			err = xfs_reflink_fork_block(ip, &imap, offset,
+						     &type, &eio);
+			if (err)
+				goto error;
 			lock_buffer(bh);
 			if (type != XFS_IO_OVERWRITE)
 				xfs_map_at_offset(inode, bh, &imap, offset);
 			xfs_add_to_ioend(inode, bh, offset, type, &ioend,
-					 new_ioend);
+					 new_ioend, eio);
 			count++;
 		}
 
@@ -1137,6 +1164,9 @@ xfs_vm_writepage(
 
 	xfs_start_page_writeback(page, 1, count);
 
+	if (err)
+		goto error;
+
 	/* if there is no IO to be submitted for this page, we are done */
 	if (!ioend)
 		return 0;
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 86afd1a..9cf206a 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -27,12 +27,14 @@ enum {
 	XFS_IO_DELALLOC,	/* covers delalloc region */
 	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
 	XFS_IO_OVERWRITE,	/* covers already allocated extent */
+	XFS_IO_FORKED,		/* covers copy-on-write region */
 };
 
 #define XFS_IO_TYPES \
 	{ XFS_IO_DELALLOC,		"delalloc" }, \
 	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
-	{ XFS_IO_OVERWRITE,		"overwrite" }
+	{ XFS_IO_OVERWRITE,		"overwrite" }, \
+	{ XFS_IO_FORKED,		"forked" }
 
 /*
  * xfs_ioend struct manages large extent writes for XFS.
@@ -50,6 +52,7 @@ typedef struct xfs_ioend {
 	xfs_off_t		io_offset;	/* offset in the file */
 	struct work_struct	io_work;	/* xfsdatad work queue */
 	struct xfs_trans	*io_append_trans;/* xact. for size update */
+	struct list_head	io_reflink_endio_list;/* remappings for CoW */
 } xfs_ioend_t;
 
 extern const struct address_space_operations xfs_address_space_operations;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index ce5feeb..39b29a4 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -45,6 +45,31 @@
 #include "xfs_alloc.h"
 #include "xfs_quota_defs.h"
 #include "xfs_quota.h"
+#include "xfs_btree.h"
+#include "xfs_bmap_btree.h"
+
+#define CHECK_AG_NUMBER(mp, agno) \
+	do { \
+		ASSERT((agno) != NULLAGNUMBER); \
+		ASSERT((agno) < (mp)->m_sb.sb_agcount); \
+	} while(0);
+
+#define CHECK_AG_EXTENT(mp, agbno, len) \
+	do { \
+		ASSERT((agbno) != NULLAGBLOCK); \
+		ASSERT((len) > 0); \
+		ASSERT((unsigned long long)(agbno) + (len) <= \
+				(mp)->m_sb.sb_agblocks); \
+	} while(0);
+
+#define XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, have, agbno, len, nr, label) \
+	do { \
+		XFS_WANT_CORRUPTED_GOTO((mp), (have) == 1, label); \
+		XFS_WANT_CORRUPTED_GOTO((mp), (len) > 0, label); \
+		XFS_WANT_CORRUPTED_GOTO((mp), (nr) >= 2, label); \
+		XFS_WANT_CORRUPTED_GOTO((mp), (unsigned long long)(agbno) + \
+				(len) <= (mp)->m_sb.sb_agblocks, label); \
+	} while(0);
 
 /**
  * xfs_reflink() - link a range of blocks from one inode to another
@@ -294,3 +319,318 @@ out_unlock_io:
 
 	return error;
 }
+
+/**
+ * xfs_reflink_get_refcount() - get refcount and extent length for a given pblk
+ *
+ * @mp: XFS mount object
+ * @agno: AG number
+ * @agbno: AG block number
+ * @len: length of extent
+ * @nr: refcount
+ */
+int
+xfs_reflink_get_refcount(
+	struct xfs_mount	*mp,		/* xfs mount object */
+	xfs_agnumber_t		agno,		/* allocation group number */
+	xfs_agblock_t		agbno,		/* ag start of range to free */
+	xfs_extlen_t		*len,		/* out: length of extent */
+	xfs_nlink_t		*nr)		/* out: refcount */
+{
+	struct xfs_btree_cur	*cur;
+	struct xfs_buf		*agbp;
+	xfs_agblock_t		lbno;		/* rlextent start */
+	xfs_extlen_t		llen;		/* rlextent length */
+	xfs_nlink_t		lnr;		/* rlextent refcount */
+	xfs_extlen_t		aglen;
+	int			error;
+	int			i, have;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb)) {
+		*len = 0;
+		*nr = 1;
+		return 0;
+	}
+
+	CHECK_AG_NUMBER(mp, agno);
+	CHECK_AG_EXTENT(mp, agbno, 1);
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+	aglen = be32_to_cpu(XFS_BUF_TO_AGF(agbp)->agf_length);
+	ASSERT(agbno < aglen);
+
+	/*
+	 * See if there's an extent covering the block we want.
+	 */
+	cur = xfs_reflinkbt_init_cursor(mp, NULL, agbp, agno);
+	error = xfs_reflink_lookup_le(cur, agbno, &have);
+	if (error)
+		goto error0;
+	if (!have)
+		goto hole;
+	error = xfs_reflink_get_rec(cur, &lbno, &llen, &lnr, &i);
+	if (error)
+		goto error0;
+	XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, lbno, llen, lnr, error0);
+	if (lbno + llen <= agbno)
+		goto hole;
+
+	*len = llen - (agbno - lbno);
+	*nr = lnr;
+	goto out;
+hole:
+	/*
+	 * We're in a hole, so pretend that this we have a refcount=1 extent
+	 * going to the next rlextent or the end of the AG.
+	 */
+	error = xfs_btree_increment(cur, 0, &have);
+	if (error)
+		goto error0;
+	if (!have)
+		*len = aglen - agbno;
+	else {
+		error = xfs_reflink_get_rec(cur, &lbno, &llen,
+				&lnr, &i);
+		XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, lbno, llen, lnr, error0);
+		ASSERT(lbno + llen >= agbno);
+		*len = lbno - agbno;
+	}
+	*nr = 1;
+out:
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	xfs_buf_relse(agbp);
+	return error;
+error0:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	xfs_buf_relse(agbp);
+	return error;
+}
+
+/**
+ * xfs_reflink_fork_block() - start forking a block, if reflinked
+ *
+ * @ip: XFS inode object
+ * @imap: the fileoff:fsblock mapping that we might fork
+ * @offset: the file offset of the block we're examining
+ * @type: the ioend type
+ */
+int
+xfs_reflink_fork_block(
+	struct xfs_inode	*ip,		/* xfs inode object */
+	xfs_bmbt_irec_t		*imap,		/* in/out: block mapping */
+	xfs_off_t		offset,		/* file offset */
+	unsigned int		*type,		/* in/out: what kind of io is this? */
+	xfs_reflink_end_io_t	**peio)		/* out: reflink context for end_io */
+{
+	xfs_fsblock_t		fsbno;
+	xfs_off_t		iomap_offset;
+	xfs_agnumber_t		agno;		/* allocation group number */
+	xfs_agblock_t		agbno;		/* ag start of range to free */
+	xfs_alloc_arg_t		args;		/* allocation arguments */
+	xfs_extlen_t		len;		/* rlextent length */
+	xfs_nlink_t		nr;		/* rlextent refcount */
+	struct xfs_trans	*tp = NULL;
+	int			error;
+	xfs_reflink_end_io_t	*eio;
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return 0;
+	if (*type == XFS_IO_DELALLOC || *type == XFS_IO_UNWRITTEN)
+		return 0;
+
+	iomap_offset = XFS_FSB_TO_B(mp, imap->br_startoff);
+	fsbno = imap->br_startblock + XFS_B_TO_FSB(mp, offset - iomap_offset);
+	agno = XFS_FSB_TO_AGNO(mp, fsbno);
+	agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+	CHECK_AG_NUMBER(mp, agno);
+	CHECK_AG_EXTENT(mp, agbno, 1);
+	ASSERT(imap->br_state == XFS_EXT_NORM);
+
+	/*
+	 * See if there's an extent covering the block we want.  If so,
+	 * then this block is reflinked and must be forked.
+	 */
+	error = xfs_reflink_get_refcount(mp, agno, agbno, &len, &nr);
+	if (error)
+		return error;
+	ASSERT(len != 0);
+	if (nr < 2)
+		goto out;
+
+	/*
+	 * Ok, we have to fork this block.  First set up a transaction...
+	 */
+	tp = xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE);
+	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write,
+				  XFS_DIOSTRAT_SPACE_RES(mp, 2), 0);
+	if (error)
+		goto error0;
+
+	/*
+	 * Now allocate a block, stash the new mapping, and add an EFI entry
+	 * so the block gets cleared if we crash.
+	 *
+	 * XXX: Ideally we'd scan up and down the incore extent list
+	 * looking for a block, but do this stupid thing for now.
+	 */
+	memset(&args, 0, sizeof(args));
+	args.tp = tp;
+	args.mp = mp;
+	args.type = XFS_ALLOCTYPE_START_BNO;
+	args.firstblock = imap->br_startblock;
+	args.fsbno = imap->br_startblock;
+	args.minlen = args.maxlen = args.prod = 1;
+	args.userdata = XFS_ALLOC_USERDATA;
+	error = xfs_alloc_vextent(&args);
+	if (error)
+		goto error0;
+	ASSERT(args.len == 1);
+
+	imap->br_startblock = args.fsbno;
+	imap->br_startoff = XFS_B_TO_FSB(mp, offset);
+	imap->br_blockcount = args.len;
+	imap->br_state = XFS_EXT_NORM;
+
+	eio = kmem_zalloc(sizeof(*eio), KM_SLEEP | KM_NOFS);
+	eio->rlei_efi = xfs_trans_get_efi(tp, 1);
+	eio->rlei_mapping = *imap;
+	xfs_trans_log_efi_extent(tp, eio->rlei_efi, imap->br_startblock,
+				 imap->br_blockcount);
+	*peio = eio;
+
+	/*
+	 * ...and we're done.
+	 */
+	*type = XFS_IO_FORKED;
+	error = xfs_trans_commit(tp);
+
+	return error;
+out:
+	return 0;
+error0:
+	xfs_trans_cancel(tp);
+	return error;
+}
+
+/**
+ * xfs_reflink_remap_after_io() - remap a range of file blocks after forking
+ *
+ * @mp: XFS mount object
+ * @ip: XFS inode object
+ * @imap: the new mapping
+ */
+STATIC int
+xfs_reflink_remap_after_io(
+	struct xfs_mount	*mp,		/* XFS mount object */
+	struct xfs_inode	*ip,		/* inode */
+	xfs_reflink_end_io_t	*eio)		/* endio data */
+{
+	struct xfs_trans	*tp = NULL;
+	int			error;
+	xfs_agnumber_t		agno;		/* allocation group number */
+	xfs_agblock_t		agbno;		/* ag start of range to free */
+	xfs_fsblock_t		firstfsb;
+	int			committed;
+	xfs_bmbt_irec_t		imaps[1];
+	int			nimaps = 1;
+	int			done;
+	xfs_bmap_free_t		free_list;
+	xfs_bmbt_irec_t		*imap = &eio->rlei_mapping;
+	struct xfs_efd_log_item	*efd;
+	unsigned int		resblks;
+
+	ASSERT(xfs_sb_version_hasreflink(&mp->m_sb));
+	agno = XFS_FSB_TO_AGNO(mp, imap->br_startblock);
+	agbno = XFS_FSB_TO_AGBNO(mp, imap->br_startblock);
+	CHECK_AG_NUMBER(mp, agno);
+	CHECK_AG_EXTENT(mp, agbno, 1);
+	ASSERT(imap->br_state == XFS_EXT_NORM);
+
+	ASSERT(!XFS_IS_REALTIME_INODE(ip));
+
+	/*
+	 * Set up a transaction -- we're munging the rlbt update, the unmap,
+	 * and the remap operation into one huge transaction.
+	 */
+	resblks = XFS_DIOSTRAT_SPACE_RES(mp, imap->br_blockcount * 3);
+	tp = xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE);
+	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write, resblks, 0);
+	if (error) {
+		xfs_trans_cancel(tp);
+		return error;
+	}
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+
+	/*
+	 * Remove the EFD.
+	 */
+	efd = xfs_trans_get_efd(tp, eio->rlei_efi, 1);
+	xfs_trans_log_efd_extent(tp, efd, imap->br_startblock,
+				 imap->br_blockcount);
+
+	/*
+	 * Remap the old blocks.
+	 */
+	xfs_bmap_init(&free_list, &firstfsb);
+	error = xfs_bunmapi(tp, ip, imap->br_startoff, imap->br_blockcount, 0,
+			imap->br_blockcount, &firstfsb, &free_list, &done);
+	if (error)
+		goto error2;
+
+	error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount,
+					XFS_BMAPI_REFLINK, &imap->br_startblock,
+					0, &imaps[0], &nimaps, &free_list);
+	if (error)
+		goto error2;
+
+	/*
+	 * Finish transaction.
+	 */
+	error = xfs_bmap_finish(&tp, &free_list, &committed);
+	if (error)
+		goto error1;
+
+
+	error = xfs_trans_commit(tp);
+	return error;
+
+error2:
+	xfs_bmap_cancel(&free_list);
+error1:
+	xfs_trans_cancel(tp);
+	return error;
+}
+
+/**
+ * xfs_reflink_end_io() - remap all blocks after forking
+ *
+ * @mp: XFS mount object
+ * @ip: XFS inode object
+ * @ioend: the io completion object
+ */
+int
+xfs_reflink_end_io(
+	struct xfs_mount	*mp,		/* XFS mount object */
+	struct xfs_inode	*ip,		/* inode */
+	xfs_ioend_t		*ioend)		/* IO completion object */
+{
+	int			error, err2;
+	struct list_head	*pos, *n;
+	xfs_reflink_end_io_t	*eio;
+
+	error = 0;
+	list_for_each_safe(pos, n, &ioend->io_reflink_endio_list) {
+		eio = list_entry(pos, xfs_reflink_end_io_t, rlei_list);
+		err2 = xfs_reflink_remap_after_io(mp, ip, eio);
+		if (error == 0)
+			error = err2;
+		kfree(eio);
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 7cccd50..40a6576 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -18,7 +18,22 @@
 #ifndef __XFS_REFLINK_H
 #define __XFS_REFLINK_H 1
 
+typedef struct xfs_reflink_end_io {
+	struct list_head	rlei_list;
+	xfs_bmbt_irec_t		rlei_mapping;
+	struct xfs_efi_log_item	*rlei_efi;
+} xfs_reflink_end_io_t;
+
 extern int xfs_reflink(struct xfs_inode *src, xfs_off_t srcoff,
 	struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
 
+extern int xfs_reflink_get_refcount(struct xfs_mount *mp, xfs_agnumber_t agno,
+	xfs_agblock_t agbno, xfs_extlen_t *len, xfs_nlink_t *nr);
+
+extern int xfs_reflink_fork_block(struct xfs_inode *ip, xfs_bmbt_irec_t *imap,
+	xfs_off_t offset, unsigned int *type, xfs_reflink_end_io_t **peio);
+
+extern int xfs_reflink_end_io(struct xfs_mount *mp, struct xfs_inode *ip,
+	xfs_ioend_t *ioend);
+
 #endif /* __XFS_REFLINK_H */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/14] xfs: handle directio copy-on-write for reflinked blocks
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (5 preceding siblings ...)
  2015-06-25 23:39 ` [PATCH 06/14] xfs: implement copy-on-write for reflinked blocks Darrick J. Wong
@ 2015-06-25 23:39 ` Darrick J. Wong
  2015-06-25 23:40 ` [PATCH 08/14] xfs: teach fiemap about reflink'd extents Darrick J. Wong
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:39 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

We hope that CoW writes will be rare and that directio CoW writes will
be even more rare.  Therefore, fall-back any such write to the
buffered path.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c    |   17 +++++++++++++++++
 fs/xfs/xfs_file.c    |   12 ++++++++++--
 fs/xfs/xfs_reflink.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    3 +++
 4 files changed, 75 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index be57e5d..73986ca 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1487,6 +1487,23 @@ __xfs_get_blocks(
 	if (imap.br_startblock != HOLESTARTBLOCK &&
 	    imap.br_startblock != DELAYSTARTBLOCK &&
 	    (create || !ISUNWRITTEN(&imap))) {
+		/*
+		 * Are we doing a DIO write to a reflinked block?  In the
+		 * ideal world we at least would fork full blocks, but for now
+		 * just fall back to buffered mode.  Yuck.  Use -EREMCHG
+		 * ("remote address changed") to signal this, since in general
+		 * XFS doesn't do this sort of fallback.
+		 */
+		if (create && direct && !ISUNWRITTEN(&imap)) {
+			bool type = false;
+
+			error = xfs_reflink_should_fork_block(ip, &imap,
+							      offset, &type);
+			if (error)
+				return error;
+			if (type)
+				return -EREMCHG;
+		}
 		xfs_map_buffer(inode, bh_result, &imap, offset);
 		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 97d92c1..898c492 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -858,10 +858,18 @@ xfs_file_write_iter(
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
-	if ((iocb->ki_flags & IOCB_DIRECT) || IS_DAX(inode))
+	/*
+	 * Allow DIO to fall back to buffered *only* in the case that we're
+	 * doing a reflink CoW.
+	 */
+	if ((iocb->ki_flags & IOCB_DIRECT) || IS_DAX(inode)) {
 		ret = xfs_file_dio_aio_write(iocb, from);
-	else
+		if (ret == -EREMCHG)
+			goto buffered;
+	} else {
+buffered:
 		ret = xfs_file_buffered_aio_write(iocb, from);
+	}
 
 	if (ret > 0) {
 		ssize_t err;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 39b29a4..3f4d9a3 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -634,3 +634,48 @@ xfs_reflink_end_io(
 
 	return error;
 }
+
+/**
+ * xfs_reflink_should_fork_block() - determine if a block should be forked
+ *
+ * @ip: XFS inode object
+ * @imap: the fileoff:fsblock mapping that we might fork
+ * @offset: the file offset of the block we're examining
+ * @type: set to 1 if reflinked, 0 otherwise.
+ */
+int
+xfs_reflink_should_fork_block(
+	struct xfs_inode	*ip,		/* xfs inode object */
+	xfs_bmbt_irec_t		*imap,		/* block mapping */
+	xfs_off_t		offset,		/* file offset */
+	bool			*type)		/* out: is this reflinked? */
+{
+	xfs_fsblock_t		fsbno;
+	xfs_off_t		iomap_offset;
+	xfs_agnumber_t		agno;		/* allocation group number */
+	xfs_agblock_t		agbno;		/* ag start of range to free */
+	xfs_extlen_t		len;
+	xfs_nlink_t		nr;
+	int			error;
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb)) {
+		*type = false;
+		return 0;
+	}
+
+	iomap_offset = XFS_FSB_TO_B(mp, imap->br_startoff);
+	fsbno = imap->br_startblock + XFS_B_TO_FSB(mp, offset - iomap_offset);
+	agno = XFS_FSB_TO_AGNO(mp, fsbno);
+	agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+	CHECK_AG_NUMBER(mp, agno);
+	CHECK_AG_EXTENT(mp, agbno, 1);
+	ASSERT(imap->br_state == XFS_EXT_NORM);
+
+	error = xfs_reflink_get_refcount(mp, agno, agbno, &len, &nr);
+	if (error)
+		return error;
+	ASSERT(len != 0);
+	*type = (nr > 1);
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 40a6576..295a9c7 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -36,4 +36,7 @@ extern int xfs_reflink_fork_block(struct xfs_inode *ip, xfs_bmbt_irec_t *imap,
 extern int xfs_reflink_end_io(struct xfs_mount *mp, struct xfs_inode *ip,
 	xfs_ioend_t *ioend);
 
+extern int xfs_reflink_should_fork_block(struct xfs_inode *ip,
+	xfs_bmbt_irec_t *imap, xfs_off_t offset, bool *type);
+
 #endif /* __XFS_REFLINK_H */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/14] xfs: teach fiemap about reflink'd extents
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (6 preceding siblings ...)
  2015-06-25 23:39 ` [PATCH 07/14] xfs: handle directio " Darrick J. Wong
@ 2015-06-25 23:40 ` Darrick J. Wong
  2015-06-25 23:40 ` [PATCH 09/14] xfs: copy-on-write reflinked blocks when zeroing ranges of blocks Darrick J. Wong
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:40 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Teach FIEMAP to report shared (i.e. reflinked) extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |    2 +-
 fs/xfs/xfs_bmap_util.h |    3 ++
 fs/xfs/xfs_ioctl.c     |    4 ++-
 fs/xfs/xfs_iops.c      |   62 +++++++++++++++++++++++++++++++++++++++---------
 4 files changed, 55 insertions(+), 16 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 17975fe..090cf75 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -689,7 +689,7 @@ xfs_getbmap(
 		int full = 0;	/* user array is full */
 
 		/* format results & advance arg */
-		error = formatter(&arg, &out[i], &full);
+		error = formatter(ip, &arg, &out[i], &full);
 		if (error || full)
 			break;
 	}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index af97d9a..9919b9a 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -37,7 +37,8 @@ int	xfs_bmap_punch_delalloc_range(struct xfs_inode *ip,
 		xfs_fileoff_t start_fsb, xfs_fileoff_t length);
 
 /* bmap to userspace formatter - copy to user & advance pointer */
-typedef int (*xfs_bmap_format_t)(void **, struct getbmapx *, int *);
+typedef int (*xfs_bmap_format_t)(xfs_inode_t *ip, void **, struct getbmapx *,
+		int *);
 int	xfs_getbmap(struct xfs_inode *ip, struct getbmapx *bmv,
 		xfs_bmap_format_t formatter, void *arg);
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index efc6e8d..c590786 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1352,7 +1352,7 @@ out_drop_write:
 }
 
 STATIC int
-xfs_getbmap_format(void **ap, struct getbmapx *bmv, int *full)
+xfs_getbmap_format(xfs_inode_t *ip, void **ap, struct getbmapx *bmv, int *full)
 {
 	struct getbmap __user	*base = (struct getbmap __user *)*ap;
 
@@ -1396,7 +1396,7 @@ xfs_ioc_getbmap(
 }
 
 STATIC int
-xfs_getbmapx_format(void **ap, struct getbmapx *bmv, int *full)
+xfs_getbmapx_format(xfs_inode_t *ip, void **ap, struct getbmapx *bmv, int *full)
 {
 	struct getbmapx __user	*base = (struct getbmapx __user *)*ap;
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 2923419..0336fed 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -38,6 +38,8 @@
 #include "xfs_dir2.h"
 #include "xfs_trans_space.h"
 #include "xfs_pnfs.h"
+#include "xfs_bit.h"
+#include "xfs_reflink.h"
 
 #include <linux/capability.h>
 #include <linux/xattr.h>
@@ -1017,14 +1019,21 @@ xfs_vn_update_time(
  */
 STATIC int
 xfs_fiemap_format(
+	xfs_inode_t		*ip,
 	void			**arg,
 	struct getbmapx		*bmv,
 	int			*full)
 {
-	int			error;
+	int			error = 0;
 	struct fiemap_extent_info *fieinfo = *arg;
 	u32			fiemap_flags = 0;
-	u64			logical, physical, length;
+	u64			logical, physical, length, loop_len, len;
+	xfs_extlen_t		elen;
+	xfs_nlink_t		nr;
+	xfs_fsblock_t		fsbno;
+	xfs_agnumber_t		agno;
+	xfs_agblock_t		agbno;
+	xfs_mount_t		*mp = ip->i_mount;
 
 	/* Do nothing for a hole */
 	if (bmv->bmv_block == -1LL)
@@ -1032,7 +1041,7 @@ xfs_fiemap_format(
 
 	logical = BBTOB(bmv->bmv_offset);
 	physical = BBTOB(bmv->bmv_block);
-	length = BBTOB(bmv->bmv_length);
+	length = loop_len = BBTOB(bmv->bmv_length);
 
 	if (bmv->bmv_oflags & BMV_OF_PREALLOC)
 		fiemap_flags |= FIEMAP_EXTENT_UNWRITTEN;
@@ -1041,16 +1050,45 @@ xfs_fiemap_format(
 				 FIEMAP_EXTENT_UNKNOWN);
 		physical = 0;   /* no block yet */
 	}
-	if (bmv->bmv_oflags & BMV_OF_LAST)
-		fiemap_flags |= FIEMAP_EXTENT_LAST;
-
-	error = fiemap_fill_next_extent(fieinfo, logical, physical,
-					length, fiemap_flags);
-	if (error > 0) {
-		error = 0;
-		*full = 1;	/* user array now full */
-	}
 
+	while (loop_len > 0) {
+		u32 ext_flags = 0;
+
+		if (bmv->bmv_oflags & BMV_OF_DELALLOC) {
+			physical = 0;
+			len = loop_len;
+			nr = 1;
+		} else if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+			fsbno = XFS_DADDR_TO_FSB(mp, BTOBB(physical));
+			agno = XFS_FSB_TO_AGNO(mp, fsbno);
+			agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+			error = xfs_reflink_get_refcount(mp, agno, agbno,
+					&elen, &nr);
+			if (error)
+				goto out;
+			len = XFS_FSB_TO_B(mp, elen);
+			if (len == 0 || len > loop_len)
+				len = loop_len;
+			if (nr >= 2)
+				ext_flags |= FIEMAP_EXTENT_SHARED;
+		} else
+			len = loop_len;
+		if ((bmv->bmv_oflags & BMV_OF_LAST) &&
+		    len == loop_len)
+			ext_flags |= FIEMAP_EXTENT_LAST;
+
+		error = fiemap_fill_next_extent(fieinfo, logical, physical,
+						len, fiemap_flags | ext_flags);
+		if (error > 0) {
+			error = 0;
+			*full = 1;	/* user array now full */
+			goto out;
+		}
+		logical += len;
+		physical += len;
+		loop_len -= len;
+	}
+out:
 	return error;
 }
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/14] xfs: copy-on-write reflinked blocks when zeroing ranges of blocks
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (7 preceding siblings ...)
  2015-06-25 23:40 ` [PATCH 08/14] xfs: teach fiemap about reflink'd extents Darrick J. Wong
@ 2015-06-25 23:40 ` Darrick J. Wong
  2015-06-25 23:40 ` [PATCH 10/14] xfs: minimize impact to non-reflink files via reflink per-inode flag Darrick J. Wong
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:40 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

When we're writing zeroes to a reflinked block (such as when we're
punching a reflinked range), we need to fork the the block and write
to that, otherwise we can corrupt the other reflinks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |   13 +++-
 fs/xfs/xfs_reflink.c   |  172 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h   |    7 ++
 3 files changed, 191 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 090cf75..9c931a7 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -40,6 +40,7 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_log.h"
+#include "xfs_reflink.h"
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -1087,7 +1088,8 @@ xfs_zero_remaining_bytes(
 	xfs_buf_t		*bp;
 	xfs_mount_t		*mp = ip->i_mount;
 	int			nimap;
-	int			error = 0;
+	int			error = 0, err2;
+	xfs_trans_t		*tp;
 
 	/*
 	 * Avoid doing I/O beyond eof - it's not necessary
@@ -1150,10 +1152,19 @@ xfs_zero_remaining_bytes(
 				(offset - XFS_FSB_TO_B(mp, imap.br_startoff)),
 		       0, lastoffset - offset + 1);
 
+		tp = NULL;
+		error = xfs_reflink_fork_buf(mp, ip, bp, &tp);
+		if (error)
+			return error;
+
 		error = xfs_bwrite(bp);
+		err2 = xfs_reflink_finish_fork_buf(mp, ip, bp, offset_fsb,
+						   tp, error);
 		xfs_buf_relse(bp);
 		if (error)
 			return error;
+		if (err2)
+			return err2;
 	}
 	return error;
 }
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 3f4d9a3..d796280 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -679,3 +679,175 @@ xfs_reflink_should_fork_block(
 	*type = (nr > 1);
 	return error;
 }
+
+/**
+ * xfs_reflink_fork_buf() - start a transaction to fork a buffer (if needed)
+ *
+ * @mp: XFS mount point
+ * @ip: XFS inode
+ * @bp: the buffer that we might need to fork
+ * @ptp: pointer to an XFS transaction
+ */
+int					/* error */
+xfs_reflink_fork_buf(
+	xfs_mount_t	*mp,		/* XFS mount object */
+	xfs_inode_t	*ip,		/* XFS inode */
+	xfs_buf_t	*bp,		/* the buffer that we might fork */
+	xfs_trans_t	**ptp)		/* out: transaction for forking buffer */
+{
+	xfs_trans_t	*tp;
+	xfs_fsblock_t	fsbno;
+	xfs_agnumber_t	agno;
+	xfs_agblock_t	agbno;
+	xfs_extlen_t	len;
+	xfs_nlink_t	nr;
+	xfs_alloc_arg_t	args;		/* allocation arguments */
+	uint		resblks;
+	int		error;
+
+	/*
+	 * Do we need to fork this block?
+	 */
+	if (!xfs_sb_version_hasreflink(&mp->m_sb) ||
+	    XFS_IS_REALTIME_INODE(ip)) {
+		*ptp = NULL;
+		return 0;
+	}
+
+	fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
+	agno = XFS_FSB_TO_AGNO(mp, fsbno);
+	agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+	CHECK_AG_NUMBER(mp, agno);
+	CHECK_AG_EXTENT(mp, agno, 1);
+
+	error = xfs_reflink_get_refcount(mp, agno, agbno, &len, &nr);
+	if (error)
+		return error;
+	ASSERT(len != 0);
+	if (nr < 2) {
+		*ptp = NULL;
+		return 0;
+	}
+
+	/*
+	 * Yes we do, so prepare a transaction...
+	 */
+	resblks = XFS_DIOSTRAT_SPACE_RES(mp, 3);
+	tp = xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE);
+	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write, resblks, 0);
+
+	/*
+	 * check for running out of space
+	 */
+	if (error) {
+		/*
+		 * Free the transaction structure.
+		 */
+		ASSERT(error == -ENOSPC || XFS_FORCED_SHUTDOWN(mp));
+		goto error0;
+	}
+	error = xfs_trans_reserve_quota(tp, mp,
+			ip->i_udquot, ip->i_gdquot, ip->i_pdquot,
+			resblks, 0, XFS_QMOPT_RES_REGBLKS);
+	if (error)
+		goto error0;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+
+	/*
+	 * Now allocate a block, and stash the new mapping.
+	 *
+	 * XXX: Ideally we'd scan up and down the incore extent list
+	 * looking for a block, but do this stupid thing for now.
+	 */
+	memset(&args, 0, sizeof(args));
+	args.tp = tp;
+	args.mp = mp;
+	args.type = XFS_ALLOCTYPE_START_BNO;
+	args.firstblock = args.fsbno = fsbno;
+	args.minlen = args.maxlen = args.prod = 1;
+	args.userdata = XFS_ALLOC_USERDATA;
+	error = xfs_alloc_vextent(&args);
+	if (error)
+		goto error0;
+	ASSERT(args.len == 1);
+
+	XFS_BUF_SET_ADDR(bp, XFS_FSB_TO_DADDR(mp, args.fsbno));
+	*ptp = tp;
+	return 0;
+error0:
+	xfs_trans_cancel(tp);
+	return error;
+}
+
+/**
+ * xfs_reflink_finish_fork_buf() - finish forking a file buffer
+ *
+ * @mp: XFS mount object
+ * @ip: XFS inode
+ * @bp: the buffer that was forked
+ * @fileoff: file offset of the buffer
+ * @tp: transaction that was returned from xfs_reflink_fork_buf()
+ * @write_error: status code from writing the block
+ */
+int						/* error */
+xfs_reflink_finish_fork_buf(
+	xfs_mount_t		*mp,		/* XFS mount object */
+	xfs_inode_t		*ip,		/* XFS inode */
+	xfs_buf_t		*bp,		/* block buffer object */
+	xfs_fileoff_t		fileoff,	/* file offset */
+	xfs_trans_t		*tp,		/* transaction object */
+	int			write_error)	/* status code from writing buffer */
+{
+	xfs_bmap_free_t		free_list;
+	xfs_fsblock_t		firstfsb;
+	xfs_fsblock_t		fsbno;
+	xfs_bmbt_irec_t		imaps[1];
+	int			nimaps = 1;
+	int			done;
+	int			error;
+	int			committed;
+
+	if (tp == NULL)
+		return 0;
+
+	fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
+	if (write_error != 0) {
+		error = xfs_free_extent(tp, fsbno, 1, ip->i_ino);
+		goto out;
+	}
+
+	/*
+	 * Remap the old blocks.
+	 */
+	xfs_bmap_init(&free_list, &firstfsb);
+	error = xfs_bunmapi(tp, ip, fileoff, 1, 0, 1, &firstfsb, &free_list,
+			    &done);
+	if (error)
+		goto error2;
+	ASSERT(done == 1);
+
+	error = xfs_bmapi_write(tp, ip, fileoff, 1, XFS_BMAPI_REFLINK, &fsbno,
+					0, &imaps[0], &nimaps, &free_list);
+	if (error)
+		goto error2;
+
+	/*
+	 * complete the transaction
+	 */
+	error = xfs_bmap_finish(&tp, &free_list, &committed);
+	if (error)
+		goto out;
+
+	error = xfs_trans_commit(tp);
+	return error;
+error2:
+	xfs_bmap_finish(&tp, &free_list, &committed);
+	done = xfs_free_extent(tp, fsbno, 1, ip->i_ino);
+	if (error == 0)
+		error = done;
+out:
+	xfs_trans_cancel(tp);
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 295a9c7..adfd99c 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -39,4 +39,11 @@ extern int xfs_reflink_end_io(struct xfs_mount *mp, struct xfs_inode *ip,
 extern int xfs_reflink_should_fork_block(struct xfs_inode *ip,
 	xfs_bmbt_irec_t *imap, xfs_off_t offset, bool *type);
 
+extern int xfs_reflink_fork_buf(xfs_mount_t *mp, xfs_inode_t *ip, xfs_buf_t *bp,
+	xfs_trans_t **ptp);
+
+extern int xfs_reflink_finish_fork_buf(xfs_mount_t  *mp, xfs_inode_t *ip,
+	xfs_buf_t *bp, xfs_fileoff_t fileoff, xfs_trans_t *tp,
+	int write_error);
+
 #endif /* __XFS_REFLINK_H */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/14] xfs: minimize impact to non-reflink files via reflink per-inode flag
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (8 preceding siblings ...)
  2015-06-25 23:40 ` [PATCH 09/14] xfs: copy-on-write reflinked blocks when zeroing ranges of blocks Darrick J. Wong
@ 2015-06-25 23:40 ` Darrick J. Wong
  2015-07-01  1:58   ` Dave Chinner
  2015-06-25 23:40 ` [PATCH 11/14] xfs: emulate the btrfs dedupe extent same ioctl Darrick J. Wong
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:40 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Gate all the reflink functions (which generally involve an expensive
trip to the reflink btree) on an inode flag which is applied to both
inodes at reflink time.  This minimizes reflink's impact on non-CoW
files.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c          |    2 +
 fs/xfs/libxfs/xfs_format.h        |    5 ++-
 fs/xfs/libxfs/xfs_reflink_btree.c |   24 ++++++++++++++-
 fs/xfs/libxfs/xfs_reflink_btree.h |    5 ++-
 fs/xfs/xfs_bmap_util.c            |    9 ++++++
 fs/xfs/xfs_inode.c                |    7 ++++
 fs/xfs/xfs_iops.c                 |    3 +-
 fs/xfs/xfs_reflink.c              |   60 ++++++++++++++++++++++++++++++++++---
 8 files changed, 105 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 05e8346..737c03a 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5004,7 +5004,7 @@ xfs_bmap_del_extent(
 	 * If we need to, add to list of extents to delete.
 	 */
 	if (do_fx)
-		xfs_reflink_bmap_add_free(mp, flist, del->br_startblock,
+		xfs_reflink_bmap_add_free(mp, flist, ip, del->br_startblock,
 					  del->br_blockcount, ip->i_ino, tp);
 	/*
 	 * Adjust inode # blocks in the file.
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index e4954ab..44e408a 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -995,6 +995,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG_EXTSZINHERIT_BIT 12	/* inherit inode extent size */
 #define XFS_DIFLAG_NODEFRAG_BIT     13	/* do not reorganize/defragment */
 #define XFS_DIFLAG_FILESTREAM_BIT   14  /* use filestream allocator */
+#define XFS_DIFLAG_REFLINK_BIT      15  /* check reflink btree for CoW */
 #define XFS_DIFLAG_REALTIME      (1 << XFS_DIFLAG_REALTIME_BIT)
 #define XFS_DIFLAG_PREALLOC      (1 << XFS_DIFLAG_PREALLOC_BIT)
 #define XFS_DIFLAG_NEWRTBM       (1 << XFS_DIFLAG_NEWRTBM_BIT)
@@ -1010,13 +1011,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG_EXTSZINHERIT  (1 << XFS_DIFLAG_EXTSZINHERIT_BIT)
 #define XFS_DIFLAG_NODEFRAG      (1 << XFS_DIFLAG_NODEFRAG_BIT)
 #define XFS_DIFLAG_FILESTREAM    (1 << XFS_DIFLAG_FILESTREAM_BIT)
+#define XFS_DIFLAG_REFLINK       (1 << XFS_DIFLAG_REFLINK_BIT)
 
 #define XFS_DIFLAG_ANY \
 	(XFS_DIFLAG_REALTIME | XFS_DIFLAG_PREALLOC | XFS_DIFLAG_NEWRTBM | \
 	 XFS_DIFLAG_IMMUTABLE | XFS_DIFLAG_APPEND | XFS_DIFLAG_SYNC | \
 	 XFS_DIFLAG_NOATIME | XFS_DIFLAG_NODUMP | XFS_DIFLAG_RTINHERIT | \
 	 XFS_DIFLAG_PROJINHERIT | XFS_DIFLAG_NOSYMLINKS | XFS_DIFLAG_EXTSIZE | \
-	 XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM)
+	 XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM | \
+	 XFS_DIFLAG_REFLINK)
 
 /*
  * Inode number format:
diff --git a/fs/xfs/libxfs/xfs_reflink_btree.c b/fs/xfs/libxfs/xfs_reflink_btree.c
index f40ba1f..7daba37 100644
--- a/fs/xfs/libxfs/xfs_reflink_btree.c
+++ b/fs/xfs/libxfs/xfs_reflink_btree.c
@@ -25,6 +25,7 @@
 #include "xfs_sb.h"
 #include "xfs_mount.h"
 #include "xfs_btree.h"
+#include "xfs_inode.h"
 #include "xfs_bmap.h"
 #include "xfs_reflink_btree.h"
 #include "xfs_alloc.h"
@@ -936,6 +937,26 @@ error0:
 	return error;
 }
 
+/*
+ * xfs_is_reflink_inode() -- Decide if an inode needs to be checked for CoW.
+ *
+ * @ip: XFS inode
+ */
+bool
+xfs_is_reflink_inode(
+	struct xfs_inode	*ip)		/* XFS inode */
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return false;
+	if (!(ip->i_d.di_flags & XFS_DIFLAG_REFLINK))
+		return false;
+
+	ASSERT(!XFS_IS_REALTIME_INODE(ip));
+	return true;
+}
+
 /**
  * xfs_reflink_bmap_add_free() - release a range of blocks
  *
@@ -950,6 +971,7 @@ int
 xfs_reflink_bmap_add_free(
 	struct xfs_mount	*mp,		/* mount point structure */
 	xfs_bmap_free_t		*flist,		/* list of extents */
+	struct xfs_inode	*ip,		/* xfs inode */
 	xfs_fsblock_t		fsbno,		/* fs block number of extent */
 	xfs_filblks_t		fslen,		/* length of extent */
 	uint64_t		owner,		/* extent owner */
@@ -971,7 +993,7 @@ xfs_reflink_bmap_add_free(
 	unsigned long long	blocks_freed;
 	xfs_fsblock_t		range_fsb;
 
-	if (!xfs_sb_version_hasreflink(&mp->m_sb)) {
+	if (!xfs_is_reflink_inode(ip)) {
 		xfs_bmap_add_free(mp, flist, fsbno, fslen, owner);
 		return 0;
 	}
diff --git a/fs/xfs/libxfs/xfs_reflink_btree.h b/fs/xfs/libxfs/xfs_reflink_btree.h
index 4ea0ac4..46dd0f2 100644
--- a/fs/xfs/libxfs/xfs_reflink_btree.h
+++ b/fs/xfs/libxfs/xfs_reflink_btree.h
@@ -72,7 +72,10 @@ extern int xfs_reflinkbt_adjust_refcount(struct xfs_mount *, struct xfs_trans *,
 		int);
 
 extern int xfs_reflink_bmap_add_free(struct xfs_mount *mp,
-		xfs_bmap_free_t *flist, xfs_fsblock_t fsbno, xfs_filblks_t len,
+		xfs_bmap_free_t *flist, struct xfs_inode *ip,
+		xfs_fsblock_t fsbno, xfs_filblks_t len,
 		uint64_t owner, struct xfs_trans *tp);
 
+extern bool xfs_is_reflink_inode(struct xfs_inode *ip);
+
 #endif	/* __XFS_REFLINK_BTREE_H__ */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 9c931a7..be010c9 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -41,6 +41,7 @@
 #include "xfs_icache.h"
 #include "xfs_log.h"
 #include "xfs_reflink.h"
+#include "xfs_reflink_btree.h"
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -1320,6 +1321,14 @@ xfs_free_file_space(
 		}
 
 		/*
+		 * Clear the reflink flag if we freed everything.
+		 */
+		if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip)) {
+			ip->i_d.di_flags &= ~XFS_DIFLAG_REFLINK;
+			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		}
+
+		/*
 		 * complete the transaction
 		 */
 		error = xfs_bmap_finish(&tp, &free_list, &committed);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index a37a101..e688732 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -48,6 +48,7 @@
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_reflink_btree.h"
 
 kmem_zone_t *xfs_inode_zone;
 
@@ -1566,6 +1567,12 @@ xfs_itruncate_extents(
 	}
 
 	/*
+	 * Clear the reflink flag if we truncated everything.
+	 */
+	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip))
+		ip->i_d.di_flags &= ~XFS_DIFLAG_REFLINK;
+
+	/*
 	 * Always re-log the inode so that our permanent transaction can keep
 	 * on rolling it forward in the log.
 	 */
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 0336fed..be17eef 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -40,6 +40,7 @@
 #include "xfs_pnfs.h"
 #include "xfs_bit.h"
 #include "xfs_reflink.h"
+#include "xfs_reflink_btree.h"
 
 #include <linux/capability.h>
 #include <linux/xattr.h>
@@ -1058,7 +1059,7 @@ xfs_fiemap_format(
 			physical = 0;
 			len = loop_len;
 			nr = 1;
-		} else if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+		} else if (xfs_is_reflink_inode(ip)) {
 			fsbno = XFS_DADDR_TO_FSB(mp, BTOBB(physical));
 			agno = XFS_FSB_TO_AGNO(mp, fsbno);
 			agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d796280..4f027d3 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -127,6 +127,54 @@ xfs_reflink(
 	}
 
 	/*
+	 * Ensure the reflink bit is set in both inodes.
+	 */
+	if (!(src->i_d.di_flags & XFS_DIFLAG_REFLINK) ||
+	    !(dest->i_d.di_flags & XFS_DIFLAG_REFLINK)) {
+		tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_NOT_SIZE);
+		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ichange, 0, 0);
+
+		/*
+		 * check for running out of space
+		 */
+		if (error) {
+			/*
+			 * Free the transaction structure.
+			 */
+			ASSERT(error == -ENOSPC || XFS_FORCED_SHUTDOWN(mp));
+			goto error0;
+		}
+
+		/* Lock both files against IO */
+		if (src->i_ino == dest->i_ino)
+			xfs_ilock(src, XFS_ILOCK_EXCL);
+		else
+			xfs_lock_two_inodes(src, dest, XFS_ILOCK_EXCL);
+
+		if (!(src->i_d.di_flags & XFS_DIFLAG_REFLINK)) {
+			xfs_trans_ijoin(tp, src, XFS_ILOCK_EXCL);
+			src->i_d.di_flags |= XFS_DIFLAG_REFLINK;
+			xfs_trans_log_inode(tp, src, XFS_ILOG_CORE);
+		} else
+			xfs_iunlock(src, XFS_ILOCK_EXCL);
+
+		if (src->i_ino == dest->i_ino)
+			goto commit_flags;
+
+		if (!(dest->i_d.di_flags & XFS_DIFLAG_REFLINK)) {
+			xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
+			dest->i_d.di_flags |= XFS_DIFLAG_REFLINK;
+			xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
+		} else
+			xfs_iunlock(dest, XFS_ILOCK_EXCL);
+
+commit_flags:
+		error = xfs_trans_commit(tp);
+		if (error)
+			goto out_unlock_io;
+	}
+
+	/*
 	 * Try to read extents from the first block indicated
 	 * by fsbno to the end block of the file.
 	 */
@@ -436,7 +484,7 @@ xfs_reflink_fork_block(
 	xfs_reflink_end_io_t	*eio;
 	struct xfs_mount	*mp = ip->i_mount;
 
-	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+	if (!xfs_is_reflink_inode(ip))
 		return 0;
 	if (*type == XFS_IO_DELALLOC || *type == XFS_IO_UNWRITTEN)
 		return 0;
@@ -548,7 +596,7 @@ xfs_reflink_remap_after_io(
 	CHECK_AG_NUMBER(mp, agno);
 	CHECK_AG_EXTENT(mp, agbno, 1);
 	ASSERT(imap->br_state == XFS_EXT_NORM);
-
+	ASSERT(xfs_is_reflink_inode(ip));
 	ASSERT(!XFS_IS_REALTIME_INODE(ip));
 
 	/*
@@ -623,6 +671,7 @@ xfs_reflink_end_io(
 	struct list_head	*pos, *n;
 	xfs_reflink_end_io_t	*eio;
 
+	ASSERT(xfs_is_reflink_inode(ip));
 	error = 0;
 	list_for_each_safe(pos, n, &ioend->io_reflink_endio_list) {
 		eio = list_entry(pos, xfs_reflink_end_io_t, rlei_list);
@@ -659,7 +708,7 @@ xfs_reflink_should_fork_block(
 	int			error;
 	struct xfs_mount	*mp = ip->i_mount;
 
-	if (!xfs_sb_version_hasreflink(&mp->m_sb)) {
+	if (!xfs_is_reflink_inode(ip)) {
 		*type = false;
 		return 0;
 	}
@@ -708,8 +757,7 @@ xfs_reflink_fork_buf(
 	/*
 	 * Do we need to fork this block?
 	 */
-	if (!xfs_sb_version_hasreflink(&mp->m_sb) ||
-	    XFS_IS_REALTIME_INODE(ip)) {
+	if (!xfs_is_reflink_inode(ip)) {
 		*ptp = NULL;
 		return 0;
 	}
@@ -812,6 +860,8 @@ xfs_reflink_finish_fork_buf(
 	if (tp == NULL)
 		return 0;
 
+	ASSERT(xfs_is_reflink_inode(ip));
+
 	fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
 	if (write_error != 0) {
 		error = xfs_free_extent(tp, fsbno, 1, ip->i_ino);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 11/14] xfs: emulate the btrfs dedupe extent same ioctl
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (9 preceding siblings ...)
  2015-06-25 23:40 ` [PATCH 10/14] xfs: minimize impact to non-reflink files via reflink per-inode flag Darrick J. Wong
@ 2015-06-25 23:40 ` Darrick J. Wong
  2015-06-25 23:40 ` [PATCH 12/14] xfs: support XFS_XFLAG_REFLINK (and FS_NOCOW_FL) on reflink filesystems Darrick J. Wong
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:40 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Emulate the BTRFS_IOC_EXTENT_SAME ioctl.  This operation is similar
to clone_range, but the kernel must confirm that the contents of the
two extents are identical before performing the reflink.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h |   28 +++++++++++
 fs/xfs/xfs_ioctl.c     |  121 ++++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_ioctl32.c   |    1 
 fs/xfs/xfs_reflink.c   |  109 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h   |    6 ++
 5 files changed, 258 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 92f21e1..7f4d886 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -560,7 +560,7 @@ typedef struct xfs_swapext
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
-/* reflink ioctls; these should match btrfs */
+/* reflink ioctls; these MUST match the btrfs ioctl definitions */
 struct xfs_ioctl_clone_range_args {
 	__s64 src_fd;
 	__u64 src_offset;
@@ -568,8 +568,34 @@ struct xfs_ioctl_clone_range_args {
 	__u64 dest_offset;
 };
 
+#define XFS_SAME_DATA_DIFFERS	1
+/* For extent-same ioctl */
+struct xfs_ioctl_file_extent_same_info {
+	__s64 fd;		/* in - destination file */
+	__u64 logical_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file */
+	/* status of this dedupe operation:
+	 * 0 if dedup succeeds
+	 * < 0 for error
+	 * == XFS_SAME_DATA_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;
+};
+
+struct xfs_ioctl_file_extent_same_args {
+	__u64 logical_offset;	/* in - start of extent in source */
+	__u64 length;		/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;
+	__u32 reserved2;
+	struct xfs_ioctl_file_extent_same_info info[0];
+};
+
 #define XFS_IOC_CLONE		 _IOW (0x94, 9, int)
 #define XFS_IOC_CLONE_RANGE	 _IOW (0x94, 13, struct xfs_ioctl_clone_range_args)
+#define XFS_IOC_FILE_EXTENT_SAME _IOWR(0x94, 54, struct xfs_ioctl_file_extent_same_args)
 
 #ifndef HAVE_BBMACROS
 /*
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index c590786..da4d7b7 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1534,7 +1534,8 @@ xfs_ioctl_reflink(
 	loff_t		pos_in,
 	struct file	*file_out,
 	loff_t		pos_out,
-	size_t		len)
+	size_t		len,
+	bool		is_dedupe)
 {
 	struct inode	*inode_in;
 	struct inode	*inode_out;
@@ -1543,6 +1544,7 @@ xfs_ioctl_reflink(
 	loff_t		isize;
 	int		same_inode;
 	loff_t		blen;
+	unsigned int	flags;
 
 	if (len == 0)
 		return 0;
@@ -1622,7 +1624,12 @@ xfs_ioctl_reflink(
 	if (ret)
 		goto out_unlock;
 
-	ret = xfs_reflink(XFS_I(inode_in), pos_in, XFS_I(inode_out), pos_out, len);
+	flags = 0;
+	if (is_dedupe)
+		flags |= XFS_REFLINK_DEDUPE;
+
+	ret = xfs_reflink(XFS_I(inode_in), pos_in, XFS_I(inode_out), pos_out,
+			len, flags);
 	if (ret < 0)
 		goto out_unlock;
 
@@ -1644,6 +1651,108 @@ out_unlock:
 	return ret;
 }
 
+#define XFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+
+static long
+xfs_ioctl_file_extent_same(
+	struct file					*file,
+	struct xfs_ioctl_file_extent_same_args __user	*argp)
+{
+	struct xfs_ioctl_file_extent_same_args	*same;
+	struct xfs_ioctl_file_extent_same_info	*info;
+	struct inode 				*src;
+	u64					off;
+	u64					len;
+	int					i;
+	int					ret;
+	unsigned long				size;
+	bool					is_admin;
+	u16					count;
+
+	is_admin = capable(CAP_SYS_ADMIN);
+	src = file_inode(file);
+	if (!(file->f_mode & FMODE_READ))
+		return -EINVAL;
+
+	if (get_user(count, &argp->dest_count)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	size = offsetof(struct xfs_ioctl_file_extent_same_args __user,
+			info[count]);
+
+	same = memdup_user(argp, size);
+
+	if (IS_ERR(same)) {
+		ret = PTR_ERR(same);
+		goto out;
+	}
+
+	off = same->logical_offset;
+	len = same->length;
+
+	/*
+	 * Limit the total length we will dedupe for each operation.
+	 * This is intended to bound the total time spent in this
+	 * ioctl to something sane.
+	 */
+	if (len > XFS_MAX_DEDUPE_LEN)
+		len = XFS_MAX_DEDUPE_LEN;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode))
+		goto out;
+
+	ret = -EACCES;
+	if (!S_ISREG(src->i_mode))
+		goto out;
+
+	/* pre-format output fields to sane values */
+	for (i = 0; i < count; i++) {
+		same->info[i].bytes_deduped = 0ULL;
+		same->info[i].status = 0;
+	}
+
+	for (i = 0, info = same->info; i < count; i++, info++) {
+		struct inode *dst;
+		struct fd dst_file = fdget(info->fd);
+		if (!dst_file.file) {
+			info->status = -EBADF;
+			continue;
+		}
+		dst = file_inode(dst_file.file);
+
+		info->bytes_deduped = 0;
+		if (!(is_admin || (dst_file.file->f_mode & FMODE_WRITE))) {
+			info->status = -EINVAL;
+		} else if (file->f_path.mnt != dst_file.file->f_path.mnt) {
+			info->status = -EXDEV;
+		} else if (S_ISDIR(dst->i_mode)) {
+			info->status = -EISDIR;
+		} else if (!S_ISREG(dst->i_mode)) {
+			info->status = -EACCES;
+		} else {
+			info->status = xfs_ioctl_reflink(file, off,
+							 dst_file.file,
+							 info->logical_offset,
+							 len, true);
+			if (info->status == -EBADE)
+				info->status = XFS_SAME_DATA_DIFFERS;
+			else if (info->status == 0)
+				info->bytes_deduped = len;
+		}
+		fdput(dst_file);
+	}
+
+	ret = copy_to_user(argp, same, size);
+	if (ret)
+		ret = -EFAULT;
+
+out:
+	return ret;
+}
+
 /*
  * Note: some of the ioctl's return positive numbers as a
  * byte count indicating success, such as readlink_by_handle.
@@ -1949,7 +2058,7 @@ xfs_file_ioctl(
 		if (!src.file)
 			return -EBADF;
 
-		error = xfs_ioctl_reflink(src.file, 0, filp, 0, ~0ULL);
+		error = xfs_ioctl_reflink(src.file, 0, filp, 0, ~0ULL, false);
 		fdput(src);
 		if (error > 0)
 			error = 0;
@@ -1970,7 +2079,8 @@ xfs_file_ioctl(
 			args.src_length = ~0ULL;
 
 		error = xfs_ioctl_reflink(src.file, args.src_offset, filp,
-					  args.dest_offset, args.src_length);
+					  args.dest_offset, args.src_length,
+					  false);
 		fdput(src);
 		if (error > 0)
 			error = 0;
@@ -1978,6 +2088,9 @@ xfs_file_ioctl(
 		return error;
 	}
 
+	case XFS_IOC_FILE_EXTENT_SAME:
+		return xfs_ioctl_file_extent_same(filp, arg);
+
 	default:
 		return -ENOTTY;
 	}
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index 76d8729..575c292 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -560,6 +560,7 @@ xfs_file_compat_ioctl(
 	case XFS_IOC_ERROR_CLEARALL:
 	case XFS_IOC_CLONE:
 	case XFS_IOC_CLONE_RANGE:
+	case XFS_IOC_FILE_EXTENT_SAME:
 		return xfs_file_ioctl(filp, cmd, p);
 #ifndef BROKEN_X86_ALIGNMENT
 	/* These are handled fine if no alignment issues */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 4f027d3..325dd14 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -71,6 +71,94 @@
 				(len) <= (mp)->m_sb.sb_agblocks, label); \
 	} while(0);
 
+/*
+ * Read a page's worth of file data into the page cache.
+ */
+static struct page *
+xfs_get_page(
+	struct inode	*inode,		/* inode */
+	xfs_off_t 	offset)		/* where in the inode to read */
+{
+	struct address_space	*mapping;
+	struct page		*page;
+	pgoff_t			n;
+
+	n = offset >> PAGE_CACHE_SHIFT;
+	mapping = inode->i_mapping;
+	page = read_mapping_page(mapping, n, NULL);
+	if (IS_ERR(page))
+		return page;
+	if (!PageUptodate(page)) {
+		page_cache_release(page);
+		return NULL;
+	}
+	return page;
+}
+
+/*
+ * Compare extents of two files to see if they are the same.
+ */
+static int
+xfs_compare_extents(
+	struct inode	*src,		/* first inode */
+	xfs_off_t	srcoff,		/* offset of first inode */
+	struct inode	*dest,		/* second inode */
+	xfs_off_t	destoff,	/* offset of second inode */
+	xfs_off_t	len,		/* length of data to compare */
+	bool		*is_same)	/* out: true if the contents match */
+{
+	xfs_off_t	src_poff;
+	xfs_off_t	dest_poff;
+	void		*src_addr;
+	void		*dest_addr;
+	struct page	*src_page;
+	struct page	*dest_page;
+	xfs_off_t	cmp_len;
+	bool		same;
+
+	same = true;
+	while (len) {
+		src_poff = srcoff & (PAGE_CACHE_SIZE - 1);
+		dest_poff = destoff & (PAGE_CACHE_SIZE - 1);
+		cmp_len = min(PAGE_CACHE_SIZE - src_poff,
+			      PAGE_CACHE_SIZE - dest_poff);
+		cmp_len = min(cmp_len, len);
+		ASSERT(cmp_len > 0);
+
+		src_page = xfs_get_page(src, srcoff);
+		if (!src_page)
+			return -EINVAL;
+		dest_page = xfs_get_page(dest, destoff);
+		if (!dest_page) {
+			page_cache_release(src_page);
+			return -EINVAL;
+		}
+		src_addr = kmap_atomic(src_page);
+		dest_addr = kmap_atomic(dest_page);
+
+		flush_dcache_page(src_page);
+		flush_dcache_page(dest_page);
+
+		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+			same = false;
+
+		kunmap_atomic(src_addr);
+		kunmap_atomic(dest_addr);
+		page_cache_release(src_page);
+		page_cache_release(dest_page);
+
+		if (!same)
+			break;
+
+		srcoff += cmp_len;
+		destoff += cmp_len;
+		len -= cmp_len;
+	}
+
+	*is_same = same;
+	return 0;
+}
+
 /**
  * xfs_reflink() - link a range of blocks from one inode to another
  *
@@ -86,7 +174,8 @@ xfs_reflink(
 	xfs_off_t		srcoff, /* offset in source file */
 	struct xfs_inode	*dest,	/* XFS inode to copy extents to */
 	xfs_off_t		destoff,/* offset in destination file */
-	xfs_off_t		len)	/* number of bytes to copy */
+	xfs_off_t		len,	/* number of bytes to copy */
+	unsigned int		flags)	/* reflink flags */
 {
 	struct xfs_mount	*mp = src->i_mount;
 	loff_t			uninitialized_var(offset);
@@ -105,6 +194,7 @@ xfs_reflink(
 	xfs_agnumber_t		agno;		/* allocation group number */
 	xfs_agblock_t		agbno;
 	int			done;
+	bool			is_same;
 	xfs_off_t		blen = ALIGN(len, mp->m_sb.sb_blocksize);
 
 	if (!xfs_sb_version_hasreflink(&mp->m_sb))
@@ -117,6 +207,9 @@ xfs_reflink(
 	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
 		return -EINVAL;
 
+	if (flags & ~XFS_REFLINK_ALL)
+		return -EINVAL;
+
 	/* Lock both files against IO */
 	if (src->i_ino == dest->i_ino) {
 		xfs_ilock(src, XFS_IOLOCK_EXCL);
@@ -127,6 +220,20 @@ xfs_reflink(
 	}
 
 	/*
+	 * Check that the extents are the same.
+	 */
+	if (flags & XFS_REFLINK_DEDUPE) {
+		error = xfs_compare_extents(VFS_I(src), srcoff, VFS_I(dest),
+				destoff, len, &is_same);
+		if (error)
+			goto out_unlock_io;
+		if (!is_same) {
+			error = -EBADE;
+			goto out_unlock_io;
+		}
+	}
+
+	/*
 	 * Ensure the reflink bit is set in both inodes.
 	 */
 	if (!(src->i_d.di_flags & XFS_DIFLAG_REFLINK) ||
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index adfd99c..7f9660d 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -24,8 +24,12 @@ typedef struct xfs_reflink_end_io {
 	struct xfs_efi_log_item	*rlei_efi;
 } xfs_reflink_end_io_t;
 
+#define XFS_REFLINK_DEDUPE	1	/* only reflink if contents match */
+#define XFS_REFLINK_ALL		(XFS_REFLINK_DEDUPE)
+
 extern int xfs_reflink(struct xfs_inode *src, xfs_off_t srcoff,
-	struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
+	struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len,
+	unsigned int flags);
 
 extern int xfs_reflink_get_refcount(struct xfs_mount *mp, xfs_agnumber_t agno,
 	xfs_agblock_t agbno, xfs_extlen_t *len, xfs_nlink_t *nr);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 12/14] xfs: support XFS_XFLAG_REFLINK (and FS_NOCOW_FL) on reflink filesystems
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (10 preceding siblings ...)
  2015-06-25 23:40 ` [PATCH 11/14] xfs: emulate the btrfs dedupe extent same ioctl Darrick J. Wong
@ 2015-06-25 23:40 ` Darrick J. Wong
  2015-06-25 23:40 ` [PATCH 13/14] xfs: add reflink btree root when expanding the filesystem Darrick J. Wong
  2015-06-25 23:40 ` [PATCH 14/14] xfs: add reflink btree block detection to log recovery Darrick J. Wong
  13 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:40 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Report the reflink/nocow flags as appropriate in the XFS-specific and
"standard" getattr ioctls.  For now we'll implicilty report all reflink
files as also being nodefrag, to prevent the defragger from corrupting
the extent maps.

Allow the user to clear the reflink flag (or set the nocow flag), which
will try to remap all shared blocks to private blocks on disk.  If this
succeeds, the file will become a non-reflinked file.

Transfer the reflink flag between inodes when swapping extents, and
quietly ignore attempts to set the reflink flag, so that xfs_fsr can
defragment reflinked file (albeit by breaking the reflink...) unless
of course nodefrag is set.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h |    1 
 fs/xfs/xfs_bmap_util.c |    5 +
 fs/xfs/xfs_inode.c     |    2 
 fs/xfs/xfs_ioctl.c     |   42 +++++-
 fs/xfs/xfs_reflink.c   |  321 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h   |   10 +
 6 files changed, 374 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 7f4d886..6b1b71c 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -67,6 +67,7 @@ struct fsxattr {
 #define XFS_XFLAG_EXTSZINHERIT	0x00001000	/* inherit inode extent size */
 #define XFS_XFLAG_NODEFRAG	0x00002000  	/* do not defragment */
 #define XFS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
+#define XFS_XFLAG_REFLINK	0x00008000	/* file is reflinked */
 #define XFS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /*
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index be010c9..e5b4752 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1920,6 +1920,11 @@ xfs_swap_extents(
 		break;
 	}
 
+	if (ip->i_d.di_flags & XFS_DIFLAG_REFLINK) {
+		tip->i_d.di_flags |= XFS_DIFLAG_REFLINK;
+		ip->i_d.di_flags &= ~XFS_DIFLAG_REFLINK;
+	}
+
 	xfs_trans_log_inode(tp, ip,  src_log_flags);
 	xfs_trans_log_inode(tp, tip, target_log_flags);
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e688732..4aa51f4 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -592,6 +592,8 @@ _xfs_dic2xflags(
 			flags |= XFS_XFLAG_NODEFRAG;
 		if (di_flags & XFS_DIFLAG_FILESTREAM)
 			flags |= XFS_XFLAG_FILESTREAM;
+		if (di_flags & XFS_DIFLAG_REFLINK)
+			flags |= XFS_XFLAG_REFLINK;
 	}
 
 	return flags;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index da4d7b7..5a9c161 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -870,6 +870,10 @@ xfs_merge_ioc_xflags(
 		xflags |= XFS_XFLAG_NODUMP;
 	else
 		xflags &= ~XFS_XFLAG_NODUMP;
+	if (flags & FS_NOCOW_FL)
+		xflags &= ~XFS_XFLAG_REFLINK;
+	else
+		xflags |= XFS_XFLAG_REFLINK;
 
 	return xflags;
 }
@@ -939,7 +943,8 @@ xfs_set_diflags(
 	unsigned int		di_flags;
 
 	/* can't set PREALLOC this way, just preserve it */
-	di_flags = (ip->i_d.di_flags & XFS_DIFLAG_PREALLOC);
+	di_flags = (ip->i_d.di_flags &
+			(XFS_DIFLAG_PREALLOC | XFS_DIFLAG_REFLINK));
 	if (xflags & XFS_XFLAG_IMMUTABLE)
 		di_flags |= XFS_DIFLAG_IMMUTABLE;
 	if (xflags & XFS_XFLAG_APPEND)
@@ -1002,9 +1007,11 @@ static int
 xfs_ioctl_setattr_xflags(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
-	struct fsxattr		*fa)
+	struct fsxattr		*fa,
+	struct file		*filp)
 {
 	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
 
 	/* Can't change realtime flag if any extents are allocated. */
 	if ((ip->i_d.di_nextents || ip->i_delayed_blks) &&
@@ -1028,6 +1035,9 @@ xfs_ioctl_setattr_xflags(
 		return -EPERM;
 
 	xfs_set_diflags(ip, fa->fsx_xflags);
+	error = xfs_reflink_end_unshare(ip, fa->fsx_xflags);
+	if (error)
+		return error;
 	xfs_diflags_to_linux(ip);
 	xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG);
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
@@ -1170,7 +1180,8 @@ xfs_ioctl_setattr_check_projid(
 STATIC int
 xfs_ioctl_setattr(
 	xfs_inode_t		*ip,
-	struct fsxattr		*fa)
+	struct fsxattr		*fa,
+	struct file		*filp)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
@@ -1181,6 +1192,10 @@ xfs_ioctl_setattr(
 
 	trace_xfs_ioctl_setattr(ip);
 
+	code = xfs_reflink_check_flag_adjust(ip, &fa->fsx_xflags);
+	if (code)
+		return code;
+
 	code = xfs_ioctl_setattr_check_projid(ip, fa);
 	if (code)
 		return code;
@@ -1201,6 +1216,10 @@ xfs_ioctl_setattr(
 			return code;
 	}
 
+	code = xfs_reflink_start_unshare(ip, fa->fsx_xflags, filp);
+	if (code)
+		return code;
+
 	tp = xfs_ioctl_setattr_get_trans(ip);
 	if (IS_ERR(tp)) {
 		code = PTR_ERR(tp);
@@ -1220,7 +1239,7 @@ xfs_ioctl_setattr(
 	if (code)
 		goto error_trans_cancel;
 
-	code = xfs_ioctl_setattr_xflags(tp, ip, fa);
+	code = xfs_ioctl_setattr_xflags(tp, ip, fa, filp);
 	if (code)
 		goto error_trans_cancel;
 
@@ -1290,7 +1309,7 @@ xfs_ioc_fssetxattr(
 	error = mnt_want_write_file(filp);
 	if (error)
 		return error;
-	error = xfs_ioctl_setattr(ip, &fa);
+	error = xfs_ioctl_setattr(ip, &fa, filp);
 	mnt_drop_write_file(filp);
 	return error;
 }
@@ -1303,6 +1322,7 @@ xfs_ioc_getxflags(
 	unsigned int		flags;
 
 	flags = xfs_di2lxflags(ip->i_d.di_flags);
+	xfs_reflink_get_lxflags(ip, &flags);
 	if (copy_to_user(arg, &flags, sizeof(flags)))
 		return -EFAULT;
 	return 0;
@@ -1324,22 +1344,30 @@ xfs_ioc_setxflags(
 
 	if (flags & ~(FS_IMMUTABLE_FL | FS_APPEND_FL | \
 		      FS_NOATIME_FL | FS_NODUMP_FL | \
-		      FS_SYNC_FL))
+		      FS_SYNC_FL | FS_NOCOW_FL))
 		return -EOPNOTSUPP;
 
 	fa.fsx_xflags = xfs_merge_ioc_xflags(flags, xfs_ip2xflags(ip));
 
+	error = xfs_reflink_check_flag_adjust(ip, &fa.fsx_xflags);
+	if (error)
+		return error;
+
 	error = mnt_want_write_file(filp);
 	if (error)
 		return error;
 
+	error = xfs_reflink_start_unshare(ip, fa.fsx_xflags, filp);
+	if (error)
+		return error;
+
 	tp = xfs_ioctl_setattr_get_trans(ip);
 	if (IS_ERR(tp)) {
 		error = PTR_ERR(tp);
 		goto out_drop_write;
 	}
 
-	error = xfs_ioctl_setattr_xflags(tp, ip, &fa);
+	error = xfs_ioctl_setattr_xflags(tp, ip, &fa, filp);
 	if (error) {
 		xfs_trans_cancel(tp);
 		goto out_drop_write;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 325dd14..23ce9fc 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1008,3 +1008,324 @@ out:
 	xfs_trans_cancel(tp);
 	return error;
 }
+
+/**
+ * xfs_reflink_get_lxflags() - set reflink-related linux inode flags
+ *
+ * @ip: XFS inode
+ * @flags: Pointer to the user-visible inode flags
+ */
+void
+xfs_reflink_get_lxflags(
+	struct xfs_inode	*ip,		/* XFS inode */
+	unsigned int		*flags)		/* user flags */
+{
+	/*
+	 * If this is a reflink-capable filesystem and there are no shared
+	 * blocks, then this is a "nocow" file.
+	 */
+	if (!xfs_sb_version_hasreflink(&ip->i_mount->m_sb) ||
+	    (ip->i_d.di_flags & XFS_DIFLAG_REFLINK))
+		return;
+	*flags |= FS_NOCOW_FL;
+}
+
+
+/**
+ * xfs_reflink_dirty_range() -- Dirty all the shared blocks in the file so that
+ * they're rewritten elsewhere.  Similar to generic_perform_write().
+ *
+ * @filp: VFS file pointer
+ * @pos: offset to start dirtying
+ * @len: number of bytes to dirty
+ */
+STATIC int
+xfs_reflink_dirty_range(
+	struct file		*filp,
+	xfs_off_t		pos,
+	xfs_off_t		len)
+{
+	struct address_space	*mapping;
+	const struct address_space_operations *a_ops;
+	int			error;
+	unsigned int		flags;
+	struct page		*page;
+	struct page		*rpage;
+	unsigned long		offset;	/* Offset into pagecache page */
+	unsigned long		bytes;	/* Bytes to write to page */
+	void			*fsdata;
+
+	mapping = filp->f_mapping;
+	a_ops = mapping->a_ops;
+	flags = AOP_FLAG_UNINTERRUPTIBLE;
+	do {
+
+		offset = (pos & (PAGE_CACHE_SIZE - 1));
+		bytes = min_t(unsigned long, len, PAGE_CACHE_SIZE) - offset;
+		rpage = xfs_get_page(filp->f_inode, pos);
+		if (IS_ERR(rpage)) {
+			error = PTR_ERR(rpage);
+			break;
+		} else if (!rpage) {
+			error = -ENOMEM;
+			break;
+		}
+
+		error = a_ops->write_begin(filp, mapping, pos, bytes, flags,
+					   &page, &fsdata);
+		page_cache_release(rpage);
+		if (error < 0)
+			break;
+
+		if (!PageUptodate(page))
+			printk(KERN_ERR "%s: STALE? ino=%lu pos=%llu\n", __func__, filp->f_inode->i_ino, pos);
+		if (mapping_writably_mapped(mapping))
+			flush_dcache_page(page);
+
+		error = a_ops->write_end(filp, mapping, pos, bytes, bytes,
+					 page, fsdata);
+		if (error < 0)
+			break;
+		else if (error == 0) {
+			error = -EIO;
+			break;
+		} else {
+			bytes = error;
+			error = 0;
+		}
+
+		cond_resched();
+
+		pos += bytes;
+		len -= bytes;
+
+		balance_dirty_pages_ratelimited(mapping);
+		if (fatal_signal_pending(current)) {
+			error = -EINTR;
+			break;
+		}
+	} while (len > 0);
+
+	return error;
+}
+
+/**
+ * xfs_reflink_check_flag_adjust() - the only change we allow to the inode
+ * reflink flag is to clear it when the fs supports reflink.
+ *
+ * @ip: XFS inode
+ * @xflags: XFS in-core inode flags
+ */
+int						/* error */
+xfs_reflink_check_flag_adjust(
+	struct xfs_inode	*ip,		/* XFS inode */
+	unsigned int		*xflags)		/* in-core flags */
+{
+	unsigned int		chg;
+
+	compiletime_assert(XFS_XFLAG_REFLINK == XFS_DIFLAG_REFLINK,
+			"in-core and on-disk inode reflink flags must match");
+	chg = (*xflags & XFS_XFLAG_REFLINK) ^
+	      (ip->i_d.di_flags & XFS_DIFLAG_REFLINK);
+
+	if (!chg)
+		return 0;
+	if (!xfs_sb_version_hasreflink(&ip->i_mount->m_sb))
+		return -EOPNOTSUPP;
+	if (*xflags & XFS_XFLAG_REFLINK) {
+		*xflags &= ~XFS_XFLAG_REFLINK;
+		return 0;
+	}
+	return 0;
+}
+
+/**
+ * xfs_reflink_start_unshare() - dirty all the shared blocks so that they
+ * can be reallocated elsewhere, in preparation for clearing the reflink
+ * hint.
+ *
+ * @ip: XFS inode
+ * @xflags: XFS in-core inode flags
+ * @filp: VFS file structure
+ */
+int						/* error */
+xfs_reflink_start_unshare(
+	struct xfs_inode	*ip,		/* XFS inode */
+	unsigned int		xflags,		/* in-core flags */
+	struct file		*filp)		/* VFS file structure */
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	int			error = 0;
+	xfs_fileoff_t		fbno;
+	xfs_filblks_t		end;
+	xfs_agnumber_t		agno;
+	xfs_agblock_t		agbno;
+	xfs_extlen_t		len;
+	xfs_nlink_t		nr;
+	xfs_off_t		isize;
+	xfs_off_t		fpos;
+	xfs_off_t		flen;
+	struct xfs_bmbt_irec	map[2];
+	int			nmaps;
+
+	if (!xfs_sb_version_hasreflink(&ip->i_mount->m_sb) ||
+	    (xflags & XFS_XFLAG_REFLINK) ||
+	    !(ip->i_d.di_flags & XFS_DIFLAG_REFLINK))
+		return 0;
+
+	inode_dio_wait(VFS_I(ip));
+
+	/*
+	 * The user wants to preemptively CoW all shared blocks in this file,
+	 * which enables us to turn off the reflink flag.  Iterate all
+	 * extents which are not prealloc/delalloc to see which ranges are
+	 * mentioned in the refcount tree, then read those blocks into the
+	 * pagecache, dirty them, fsync them back out, and then we can update
+	 * the inode flag.  What happens if we run out of memory? :)
+	 */
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	fbno = 0;
+	isize = i_size_read(VFS_I(ip));
+	if (isize == 0) {
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		return 0;
+	}
+	end = XFS_B_TO_FSB(mp, isize);
+	while (end - fbno > 0) {
+		nmaps = 1;
+		/*
+		 * Look for extents in the file.  Skip holes, delalloc, or
+		 * unwritten extents; they can't be reflinked.
+		 */
+		error = xfs_bmapi_read(ip, fbno, end - fbno, map, &nmaps, 0);
+		if (error)
+			goto out_unlock;
+		if (nmaps == 0)
+			break;
+		if (map[0].br_startblock == HOLESTARTBLOCK ||
+		    map[0].br_startblock == DELAYSTARTBLOCK ||
+		    map[0].br_state == XFS_EXT_UNWRITTEN)
+			goto next;
+
+		map[1] = map[0];
+		while (map[1].br_blockcount) {
+			agno = XFS_FSB_TO_AGNO(mp, map[1].br_startblock);
+			agbno = XFS_FSB_TO_AGBNO(mp, map[1].br_startblock);
+			CHECK_AG_NUMBER(mp, agno);
+			CHECK_AG_EXTENT(mp, agbno, 1);
+
+			error = xfs_reflink_get_refcount(mp, agno, agbno,
+							 &len, &nr);
+			if (error)
+				goto out_unlock;
+			XFS_WANT_CORRUPTED_GOTO(mp, len != 0, out_unlock);
+			if (len > map[1].br_blockcount)
+				len = map[1].br_blockcount;
+			if (nr < 2)
+				goto skip_copy;
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+			fpos = XFS_FSB_TO_B(mp, map[1].br_startoff);
+			flen = XFS_FSB_TO_B(mp, len);
+			if (fpos + flen > isize)
+				flen = isize - fpos;
+			error = xfs_reflink_dirty_range(filp, fpos, flen);
+			xfs_ilock(ip, XFS_ILOCK_EXCL);
+			if (error)
+				goto out_unlock;
+skip_copy:
+			map[1].br_blockcount -= len;
+			map[1].br_startoff += len;
+			map[1].br_startblock += len;
+		}
+
+next:
+		fbno = map[0].br_startoff + map[0].br_blockcount;
+	}
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (error == 0)
+		error = filemap_write_and_wait(filp->f_mapping);
+	return error;
+}
+
+/**
+ * xfs_reflink_end_unshare() - finish removing reflink flag from inode
+ *
+ * @ip: XFS inode
+ * @xflags: XFS in-core inode flags
+ */
+int						/* error */
+xfs_reflink_end_unshare(
+	struct xfs_inode	*ip,		/* XFS inode */
+	unsigned int		xflags)		/* VFS file structure */
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
+	xfs_fileoff_t		fbno;
+	xfs_filblks_t		end;
+	xfs_agnumber_t		agno;
+	xfs_agblock_t		agbno;
+	xfs_extlen_t		len;
+	xfs_nlink_t		nr;
+	struct xfs_bmbt_irec	map[2];
+	int			nmaps;
+
+	if (!xfs_sb_version_hasreflink(&ip->i_mount->m_sb) ||
+	    (xflags & XFS_XFLAG_REFLINK) ||
+	    !(ip->i_d.di_flags & XFS_DIFLAG_REFLINK))
+		return 0;
+
+	/*
+	 * Earlier we copied all the shared blocks in this file to new blocks.
+	 * However, we dropped the ilock before getting the transaction, so
+	 * check that nobody wandered in and added more reflinks.
+	 */
+	fbno = 0;
+	end = XFS_B_TO_FSB(mp, i_size_read(VFS_I(ip)));
+	while (end - fbno > 0) {
+		nmaps = 1;
+		/*
+		 * Look for extents in the file.  Skip holes, delalloc, or
+		 * unwritten extents; they can't be reflinked.
+		 */
+		error = xfs_bmapi_read(ip, fbno, end - fbno, map, &nmaps, 0);
+		if (error)
+			goto out_unlock;
+		if (nmaps == 0)
+			break;
+		if (map[0].br_startblock == HOLESTARTBLOCK ||
+		    map[0].br_startblock == DELAYSTARTBLOCK ||
+		    map[0].br_state == XFS_EXT_UNWRITTEN)
+			goto next;
+
+		map[1] = map[0];
+		while (map[1].br_blockcount) {
+			agno = XFS_FSB_TO_AGNO(mp, map[1].br_startblock);
+			agbno = XFS_FSB_TO_AGBNO(mp, map[1].br_startblock);
+			CHECK_AG_NUMBER(mp, agno);
+			CHECK_AG_EXTENT(mp, agbno, 1);
+
+			error = xfs_reflink_get_refcount(mp, agno, agbno,
+							 &len, &nr);
+			if (error)
+				goto out_unlock;
+			XFS_WANT_CORRUPTED_GOTO(mp, len != 0, out_unlock);
+			if (len > map[1].br_blockcount)
+				len = map[1].br_blockcount;
+			if (nr > 1) {
+				error = -EINTR;
+				goto out_unlock;
+			}
+			map[1].br_blockcount -= len;
+			map[1].br_startblock += len;
+		}
+
+next:
+		fbno = map[0].br_startoff + map[0].br_blockcount;
+	}
+
+	ip->i_d.di_flags &= ~XFS_DIFLAG_REFLINK;
+out_unlock:
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 7f9660d..6f1ecf8 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -50,4 +50,14 @@ extern int xfs_reflink_finish_fork_buf(xfs_mount_t  *mp, xfs_inode_t *ip,
 	xfs_buf_t *bp, xfs_fileoff_t fileoff, xfs_trans_t *tp,
 	int write_error);
 
+extern void xfs_reflink_get_lxflags(struct xfs_inode *ip, unsigned int *flags);
+
+extern int xfs_reflink_check_flag_adjust(struct xfs_inode *ip,
+	unsigned int *xflags);
+
+extern int xfs_reflink_start_unshare(struct xfs_inode *ip, unsigned int xflags,
+	struct file *filp);
+
+extern int xfs_reflink_end_unshare(struct xfs_inode *ip, unsigned int xflags);
+
 #endif /* __XFS_REFLINK_H */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 13/14] xfs: add reflink btree root when expanding the filesystem
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (11 preceding siblings ...)
  2015-06-25 23:40 ` [PATCH 12/14] xfs: support XFS_XFLAG_REFLINK (and FS_NOCOW_FL) on reflink filesystems Darrick J. Wong
@ 2015-06-25 23:40 ` Darrick J. Wong
  2015-06-25 23:40 ` [PATCH 14/14] xfs: add reflink btree block detection to log recovery Darrick J. Wong
  13 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:40 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Create the root of the reflink btree whenever we add an AG to the
filesystem.  Plumb in the bits that enable growfs to ask the kernel
whether or not the fs supports reflink.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h     |    1 +
 fs/xfs/libxfs/xfs_shared.h |    1 +
 fs/xfs/xfs_fsops.c         |   30 +++++++++++++++++++++++++++++-
 3 files changed, 31 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 6b1b71c..d7541f7 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -242,6 +242,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
 #define XFS_FSOP_GEOM_FLAGS_SPINODES	0x40000	/* sparse inode chunks	*/
 #define XFS_FSOP_GEOM_FLAGS_RMAPBT	0x80000	/* Reverse mapping btree */
+#define XFS_FSOP_GEOM_FLAGS_REFLINK	0x100000	/* reflink */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index d1de74e..807c0e3 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -53,6 +53,7 @@ extern const struct xfs_buf_ops xfs_dquot_buf_ops;
 extern const struct xfs_buf_ops xfs_sb_buf_ops;
 extern const struct xfs_buf_ops xfs_sb_quiet_buf_ops;
 extern const struct xfs_buf_ops xfs_symlink_buf_ops;
+extern const struct xfs_buf_ops xfs_reflinkbt_buf_ops;
 
 /*
  * Transaction types.  Used to distinguish types of buffers. These never reach
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 9aabefb..d68a3b5 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -106,7 +106,9 @@ xfs_fs_geometry(
 			(xfs_sb_version_hassparseinodes(&mp->m_sb) ?
 				XFS_FSOP_GEOM_FLAGS_SPINODES : 0) |
 			(xfs_sb_version_hasrmapbt(&mp->m_sb) ?
-				XFS_FSOP_GEOM_FLAGS_RMAPBT : 0);
+				XFS_FSOP_GEOM_FLAGS_RMAPBT : 0) |
+			(xfs_sb_version_hasreflink(&mp->m_sb) ?
+				XFS_FSOP_GEOM_FLAGS_REFLINK : 0);
 		geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
 				mp->m_sb.sb_logsectsize : BBSIZE;
 		geo->rtsectsize = mp->m_sb.sb_blocksize;
@@ -260,6 +262,10 @@ xfs_growfs_data_private(
 		agf->agf_longest = cpu_to_be32(tmpsize);
 		if (xfs_sb_version_hascrc(&mp->m_sb))
 			uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_uuid);
+		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+			agf->agf_reflink_root = cpu_to_be32(XFS_RL_BLOCK(mp));
+			agf->agf_reflink_level = cpu_to_be32(1);
+		}
 
 		error = xfs_bwrite(bp);
 		xfs_buf_relse(bp);
@@ -503,6 +509,28 @@ xfs_growfs_data_private(
 				goto error0;
 		}
 
+		/*
+		 * reflink btree root block
+		 */
+		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+			bp = xfs_growfs_get_hdr_buf(mp,
+				XFS_AGB_TO_DADDR(mp, agno, XFS_RL_BLOCK(mp)),
+				BTOBB(mp->m_sb.sb_blocksize), 0,
+				&xfs_reflinkbt_buf_ops);
+			if (!bp) {
+				error = -ENOMEM;
+				goto error0;
+			}
+
+			xfs_btree_init_block(mp, bp, XFS_RLBT_CRC_MAGIC,
+					     0, 0, agno,
+					     XFS_BTREE_CRC_BLOCKS);
+
+			error = xfs_bwrite(bp);
+			xfs_buf_relse(bp);
+			if (error)
+				goto error0;
+		}
 	}
 	xfs_trans_agblocks_delta(tp, nfree);
 	/*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 14/14] xfs: add reflink btree block detection to log recovery
  2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (12 preceding siblings ...)
  2015-06-25 23:40 ` [PATCH 13/14] xfs: add reflink btree root when expanding the filesystem Darrick J. Wong
@ 2015-06-25 23:40 ` Darrick J. Wong
  13 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-06-25 23:40 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: xfs

Teach log recovery how to deal with reflink btree blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_log_recover.c |    4 ++++
 1 file changed, 4 insertions(+)


diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 3bbea4f..2175d06 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1849,6 +1849,7 @@ xlog_recover_get_buf_lsn(
 	case XFS_ABTB_MAGIC:
 	case XFS_ABTC_MAGIC:
 	case XFS_RMAP_CRC_MAGIC:
+	case XFS_RLBT_CRC_MAGIC:
 	case XFS_IBT_CRC_MAGIC:
 	case XFS_IBT_MAGIC: {
 		struct xfs_btree_block *btb = blk;
@@ -2005,6 +2006,9 @@ xlog_recover_validate_buf_type(
 		case XFS_RMAP_CRC_MAGIC:
 			bp->b_ops = &xfs_rmapbt_buf_ops;
 			break;
+		case XFS_RLBT_CRC_MAGIC:
+			bp->b_ops = &xfs_reflinkbt_buf_ops;
+			break;
 		default:
 			xfs_warn(mp, "Bad btree block magic!");
 			ASSERT(0);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/14] xfs: create a per-AG btree to track reference counts
  2015-06-25 23:39 ` [PATCH 01/14] xfs: create a per-AG btree to track reference counts Darrick J. Wong
@ 2015-07-01  0:13   ` Dave Chinner
  2015-07-01 22:52     ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2015-07-01  0:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Thu, Jun 25, 2015 at 04:39:16PM -0700, Darrick J. Wong wrote:
> Create a per-AG btree to track the reference counts of physical blocks
> to support reflink.

Few things from a quick glance...

> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -54,6 +54,8 @@ xfs_extlen_t
>  xfs_prealloc_blocks(
>  	struct xfs_mount	*mp)
>  {
> +	if (xfs_sb_version_hasreflink(&mp->m_sb))
> +		return XFS_RL_BLOCK(mp) + 1;

Should introduce the sb version stuff as a separate patch perhaps
with the basic infrastructure defines (see how I did the first rmap
btree patch).

> @@ -1117,6 +1118,9 @@ xfs_btree_set_refs(
>  	case XFS_BTNUM_RMAP:
>  		xfs_buf_set_ref(bp, XFS_RMAP_BTREE_REF);
>  		break;
> +	case XFS_BTNUM_RL:

Probably better to call it XFS_BTNUM_REFLINK

> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 48ab2b1..a3f8661 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -43,6 +43,7 @@ union xfs_btree_key {
>  	xfs_alloc_key_t			alloc;
>  	struct xfs_inobt_key		inobt;
>  	struct xfs_rmap_key		rmap;
> +	xfs_reflink_key_t		reflink;

No new typedefs. struct xfs_reflink_key...

(only say this once, but applies many times ;)

> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 9cff517..e4954ab 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -446,9 +446,11 @@ xfs_sb_has_compat_feature(
>  
>  #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
>  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> +#define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflink btree */
>  #define XFS_SB_FEAT_RO_COMPAT_ALL \
>  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> -		 XFS_SB_FEAT_RO_COMPAT_RMAPBT)
> +		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> +		 XFS_SB_FEAT_RO_COMPAT_REFLINK)

The XFS_SB_FEAT_RO_COMPAT_REFLINK flag shoul dbe added as a separate
patch and put last in the series so it is only enabled once
everything is complete.


>  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
>  static inline bool
>  xfs_sb_has_ro_compat_feature(
> @@ -522,6 +524,12 @@ static inline bool xfs_sb_version_hasrmapbt(struct xfs_sb *sbp)
>  		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_RMAPBT);
>  }
>  
> +static inline int xfs_sb_version_hasreflink(xfs_sb_t *sbp)

bool.

> @@ -1338,6 +1349,50 @@ typedef __be32 xfs_rmap_ptr_t;
>  	 XFS_IBT_BLOCK(mp) + 1)
>  
>  /*
> + * reflink Btree format definitions
> + *
> + */
> +#define	XFS_RLBT_CRC_MAGIC	0x524C4233	/* 'RLB3' */

#define        XFS_RMAP_CRC_MAGIC      0x524d4233      /* 'RMB3' */

Only one bit difference in the magic numbers, which means they are
too similar. "RFL3" might be better or maybe "R3FL"...


> +/*
> + * Data record/key structure
> + */
> +typedef struct xfs_reflink_rec {
> +	__be32		rr_startblock;	/* starting block number */
> +	__be32		rr_blockcount;	/* count of blocks */
> +	__be32		rr_nlinks;	/* number of inodes linked here */
> +} xfs_reflink_rec_t;
> +
> +typedef struct xfs_reflink_key {
> +	__be32		rr_startblock;	/* starting block number */
> +} xfs_reflink_key_t;
> +
> +typedef struct xfs_reflink_rec_incore {
> +	xfs_agblock_t	rr_startblock;	/* starting block number */
> +	xfs_extlen_t	rr_blockcount;	/* count of free blocks */
> +	xfs_nlink_t	rr_nlinks;	/* number of inodes linked here */
> +} xfs_reflink_rec_incore_t;

We have being using "irec" as shorthand for "in-core record". i.e:
struct xfs_reflink_irec.

(kill typedefs)

> +
> +/*
> + * When a block hits MAXRLCOUNT references, it becomes permanently
> + * stuck in CoW mode, because who knows how many times it's really
> + * referenced.
> + */
> +#define MAXRLCOUNT	((xfs_nlink_t)~0U)
> +#define MAXRLEXTLEN	((xfs_extlen_t)~0U)

I'd suggest that if we hit the maximum count, we just abort the
reflink operation.

> +/* btree pointer type */
> +typedef __be32 xfs_reflink_ptr_t;
> +
> +#define	XFS_RL_BLOCK(mp) \
> +	(xfs_sb_version_hasrmapbt(&((mp)->m_sb)) ? \
> +	 XFS_RMAP_BLOCK(mp) + 1 : \
> +	 (xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
> +	  XFS_FIBT_BLOCK(mp) + 1 : \
> +	  XFS_IBT_BLOCK(mp) + 1))

That's getting unwieldy. It's large enough for a function....

> +#ifdef REFLINK_DEBUG
> +# define dbg_printk(f, a...)  do {printk(KERN_ERR f, ## a); } while (0)
> +#else
> +# define dbg_printk(f, a...)
> +#endif

xfs_debug() is your friend.

> +#define CHECK_AG_NUMBER(mp, agno) \
> +	do { \
> +		ASSERT((agno) != NULLAGNUMBER); \
> +		ASSERT((agno) < (mp)->m_sb.sb_agcount); \
> +	} while(0);

Ugh. Used once, just open code.

> +#define CHECK_AG_EXTENT(mp, agbno, len) \
> +	do { \
> +		ASSERT((agbno) != NULLAGBLOCK); \
> +		ASSERT((len) > 0); \
> +		ASSERT((unsigned long long)(agbno) + (len) <= \
> +				(mp)->m_sb.sb_agblocks); \
> +	} while(0);

These are really used in places where corruption checks are
warranted, or the extent has already been checked....

> +#define XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, have, agbno, len, nr, label) \
> +	do { \
> +		XFS_WANT_CORRUPTED_GOTO((mp), (have) == 1, label); \
> +		XFS_WANT_CORRUPTED_GOTO((mp), (len) > 0, label); \
> +		XFS_WANT_CORRUPTED_GOTO((mp), (nr) >= 2, label); \
> +		XFS_WANT_CORRUPTED_GOTO((mp), (unsigned long long)(agbno) + \
> +				(len) <= (mp)->m_sb.sb_agblocks, label); \
> +	} while(0);

Unused.

> +
> +STATIC int
> +xfs_reflinkbt_alloc_block(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_ptr	*start,
> +	union xfs_btree_ptr	*new,
> +	int			*stat)
> +{
> +	int			error;
> +	xfs_agblock_t		bno;
> +
> +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> +
> +	/* Allocate the new block from the freelist. If we can't, give up.  */
> +	error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp,
> +				       &bno, 1);
> +	if (error) {
> +		XFS_BTREE_TRACE_CURSOR(cur, XBT_ERROR);
> +		return error;
> +	}

Why does the reflink btree use the free list? Why can't it use
block allocation like the BMBT tree?

> +/*
> + * Allocate a new reflink btree cursor.
> + */
> +struct xfs_btree_cur *			/* new reflink btree cursor */
> +xfs_reflinkbt_init_cursor(
> +	struct xfs_mount	*mp,		/* file system mount point */
> +	struct xfs_trans	*tp,		/* transaction pointer */
> +	struct xfs_buf		*agbp,		/* buffer for agf structure */
> +	xfs_agnumber_t		agno)		/* allocation group number */

No real need for these comments on the variables. They are redundant
as the code documents what they are just fine.

> +{
> +	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> +	struct xfs_btree_cur	*cur;
> +
> +	CHECK_AG_NUMBER(mp, agno);
> +	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_SLEEP);
> +
> +	cur->bc_tp = tp;
> +	cur->bc_mp = mp;
> +	cur->bc_btnum = XFS_BTNUM_RL;
> +	cur->bc_blocklog = mp->m_sb.sb_blocklog;
> +	cur->bc_ops = &xfs_reflinkbt_ops;
> +
> +	cur->bc_nlevels = be32_to_cpu(agf->agf_reflink_level);
> +
> +	cur->bc_private.a.agbp = agbp;
> +	cur->bc_private.a.agno = agno;
> +
> +	if (xfs_sb_version_hascrc(&mp->m_sb))
> +		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;

Can be set unconditionally.

The next set of functions normally go into a different file. i.e the
"xfs_foo_btree.c" file contains the functions required by the
generic btree abstraction to implement the "foo" btree format.  The
file "xfs_foo.c" then contains the code/logic that provides the
external foo API, manages the information inthe foo btree, and calls
the generic btree functions to manage the btree. This logic isn't
present in the patch, so really it shoul dbe added by the patch that
starts implementing the reflink API....

> +/*
> + * Get the data from the pointed-to record.
> + */
> +int					/* error */
> +xfs_reflink_get_rec(
> +	struct xfs_btree_cur	*cur,	/* btree cursor */
> +	xfs_agblock_t		*bno,	/* output: starting block of extent */
> +	xfs_extlen_t		*len,	/* output: length of extent */
> +	xfs_nlink_t		*nlink,	/* output: number of links */
> +	int			*stat)	/* output: success/failure */
> +{
> +	union xfs_btree_rec	*rec;
> +	int			error;
> +
> +	error = xfs_btree_get_rec(cur, &rec, stat);
> +	if (!error && *stat == 1) {
> +		CHECK_AG_EXTENT(cur->bc_mp,
> +			be32_to_cpu(rec->reflink.rr_startblock),
> +			be32_to_cpu(rec->reflink.rr_blockcount));
> +		*bno = be32_to_cpu(rec->reflink.rr_startblock);
> +		*len = be32_to_cpu(rec->reflink.rr_blockcount);
> +		*nlink = be32_to_cpu(rec->reflink.rr_nlinks);
> +	}
> +	return error;

	if (error || !*stat)
		return error;
	.....
	return 0;

> +	error = xfs_reflink_get_rec(cur, &bno, &len, &nr, &x);
> +	if (error)
> +		return error;
> +	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, x == 1, error0);
> +	error = xfs_btree_delete(cur, i);
> +	if (error)
> +		return error;
> +	error = xfs_reflink_lookup_ge(cur, bno, &x);
> +error0:

New code should use sane jump labels. e.g. "out_error" is a pretty
standard jump label name for this...

> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 88efbb4..d1de74e 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -216,6 +216,7 @@ int	xfs_log_calc_minimum_size(struct xfs_mount *);
>  #define	XFS_INO_REF		2
>  #define	XFS_ATTR_BTREE_REF	1
>  #define	XFS_DQUOT_REF		1
> +#define XFS_REFLINK_BTREE_REF	1

whitespace.

> @@ -315,6 +317,9 @@ typedef struct xfs_perag {
>  	/* for rcu-safe freeing */
>  	struct rcu_head	rcu_head;
>  	int		pagb_count;	/* pagb slots in use */
> +
> +	/* reflink */
> +	__uint8_t	pagf_reflink_level;

May as well just make it the same as what is on disk (i.e.
uint32_t).

> +++ b/fs/xfs/xfs_stats.c
> @@ -61,6 +61,7 @@ static int xfs_stat_proc_show(struct seq_file *m, void *v)
>  		{ "ibt2",		XFSSTAT_END_IBT_V2		},
>  		{ "fibt2",		XFSSTAT_END_FIBT_V2		},
>  		{ "rmapbt",		XFSSTAT_END_RMAP_V2		},
> +		{ "rlbt2",		XFSSTAT_END_RLBT_V2		},

"reflinkbt". No need for the "2", as there is only one set of
reflink btree stats.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/14] libxfs: adjust refcounts in reflink btree
  2015-06-25 23:39 ` [PATCH 02/14] libxfs: adjust refcounts in reflink btree Darrick J. Wong
@ 2015-07-01  1:06   ` Dave Chinner
  2015-07-01 23:10     ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2015-07-01  1:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Thu, Jun 25, 2015 at 04:39:23PM -0700, Darrick J. Wong wrote:
> Provide a function to adjust the reference counts for a range of
> blocks in the reflink btree.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_reflink_btree.c |  406 +++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_reflink_btree.h |    4 
>  2 files changed, 410 insertions(+)

As per previous comments, this all belongs in
fs/xfs/libxfs/xfs_reflink.c...

> 
> 
> diff --git a/fs/xfs/libxfs/xfs_reflink_btree.c b/fs/xfs/libxfs/xfs_reflink_btree.c
> index 8a0fa5d..380ed72 100644
> --- a/fs/xfs/libxfs/xfs_reflink_btree.c
> +++ b/fs/xfs/libxfs/xfs_reflink_btree.c
> @@ -529,3 +529,409 @@ xfs_reflinkbt_delete(
>  error0:
>  	return error;
>  }
> +
> +#ifdef REFLINK_DEBUG
> +static void
> +dump_cur_loc(
> +	struct xfs_btree_cur	*cur,
> +	const char		*str,
> +	int			line)
> +{
> +	xfs_agblock_t		gbno;
> +	xfs_extlen_t		glen;
> +	xfs_nlink_t		gnr;
> +	int			i;
> +
> +	xfs_reflink_get_rec(cur, &gbno, &glen, &gnr, &i);
> +	printk(KERN_INFO "%s(%d) cur[%d]:[%u,%u,%u,%d] ", str, line,
> +	       cur->bc_ptrs[0], gbno, glen, gnr, i);
> +	if (i && cur->bc_ptrs[0]) {
> +		cur->bc_ptrs[0]--;
> +		xfs_reflink_get_rec(cur, &gbno, &glen, &gnr, &i);
> +		printk("left[%d]:[%u,%u,%u,%d] ", cur->bc_ptrs[0],
> +		       gbno, glen, gnr, i);
> +		cur->bc_ptrs[0]++;
> +	}
> +
> +	if (i && cur->bc_ptrs[0] < xfs_reflinkbt_get_maxrecs(cur, 0)) {
> +		cur->bc_ptrs[0]++;
> +		xfs_reflink_get_rec(cur, &gbno, &glen, &gnr, &i);
> +		printk("right[%d]:[%u,%u,%u,%d] ", cur->bc_ptrs[0],
> +		       gbno, glen, gnr, i);
> +		cur->bc_ptrs[0]--;
> +	}
> +	printk("\n");
> +}
> +#else
> +# define dump_cur_loc(c, s, l)
> +#endif

Use trace points on lookup/update/insert/delete so debug like this
is unnecessary.


> +/*
> + * Adjust the ref count of a range of AG blocks.
> + */
> +int						/* error */
> +xfs_reflinkbt_adjust_refcount(
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,		/* transaction pointer */
> +	struct xfs_buf		*agbp,		/* buffer for agf structure */
> +	xfs_agnumber_t		agno,		/* allocation group number */
> +	xfs_agblock_t		agbno,		/* start of range */
> +	xfs_extlen_t		aglen,		/* length of range */
> +	int			adj)		/* how much to change refcnt */

350 line function. Needs factoring. Also needs a comment explaining
the algorithm. 

> +{
> +	struct xfs_btree_cur	*cur;
> +	int			error;
> +	int			i, have;
> +	bool			real_crl;	/* cbno/clen is on disk? */
> +	xfs_agblock_t		lbno, cbno, rbno;	/* rlextent start */
> +	xfs_extlen_t		llen, clen, rlen;	/* rlextent length */
> +	xfs_nlink_t		lnr, cnr, rnr;		/* rlextent refcount */

"num" is the usual shorthand for "number". And in this case, nr is
extremely ambiguous: Number of records, number of reflinks, some
other number? I can't easily tell when I read the code, so the
variable names need to be better. factoring will certainly help
here.

> +	xfs_agblock_t		bno;		/* ag bno in the loop */
> +	xfs_agblock_t		agbend;		/* end agbno of the loop */
> +	xfs_extlen_t		len;		/* remaining len to add */
> +	xfs_nlink_t		new_cnr;	/* new refcount */
> +
> +	CHECK_AG_NUMBER(mp, agno);
> +	CHECK_AG_EXTENT(mp, agbno, aglen);

No real need for these checks - bad agno or extent sizes shoul dhave
been validated long before this.

> +
> +	/*
> +	 * Allocate/initialize a cursor for the by-number freespace btree.
> +	 */
> +	cur = xfs_reflinkbt_init_cursor(mp, tp, agbp, agno);

You can kill that incorrect comment.

> +
> +	/*
> +	 * Split a left rlextent that crosses agbno.
> +	 */

These comments need some ascii art displaying the before, current
extent and after states so it's clear what the intent is. As it is,
I'd probably split these into "left/right/middle" helper functions,
as there is no state created by these initial overlap splits
used later in the function. That would get rid of excessive
indentation, make the error handling more obvious, etc.

> +	error = xfs_reflink_lookup_le(cur, agbno, &have);
> +	if (error)
> +		goto error0;

		goto out_error;

> +	if (have) {

if I "have" what?  "found_rec" would be a better name, because then
the code reads clearly...

> +	/*
> +	 * Start iterating the range we're adjusting.  rlextent boundaries
> +	 * should be at agbno and agbend.
> +	 */

Trying to work my way through this loop, but the logic is hard to
follow. It's hurting my head trying to work out what it is supposed
to be doing, so I'm going to wait for more comments, ascii art, and
factoring before really looking at it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 03/14] libxfs: support unmapping reflink blocks
  2015-06-25 23:39 ` [PATCH 03/14] libxfs: support unmapping reflink blocks Darrick J. Wong
@ 2015-07-01  1:26   ` Dave Chinner
  2015-07-02  2:27     ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2015-07-01  1:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Thu, Jun 25, 2015 at 04:39:30PM -0700, Darrick J. Wong wrote:
> When we're unmapping blocks from a file, we need to decrease refcounts
> in the btree and only free blocks if they refcount is 1.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c          |    5 +
>  fs/xfs/libxfs/xfs_reflink_btree.c |  140 +++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_reflink_btree.h |    4 +
>  3 files changed, 147 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 057fa9a..3f5e8da 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -45,6 +45,7 @@
>  #include "xfs_symlink.h"
>  #include "xfs_attr_leaf.h"
>  #include "xfs_filestream.h"
> +#include "xfs_reflink_btree.h"
>  
>  
>  kmem_zone_t		*xfs_bmap_free_item_zone;
> @@ -4984,8 +4985,8 @@ xfs_bmap_del_extent(
>  	 * If we need to, add to list of extents to delete.
>  	 */
>  	if (do_fx)
> -		xfs_bmap_add_free(mp, flist, del->br_startblock,
> -				  del->br_blockcount, ip->i_ino);
> +		xfs_reflink_bmap_add_free(mp, flist, del->br_startblock,
> +					  del->br_blockcount, ip->i_ino, tp);

I think this is the wrong abstraction. I think the code should look
like this:

	if (do_fx) {
		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
			error = xfs_reflink_del_extent(mp, tp, flist,
						del->br_startblock,
						del->br_blockcount, ip->i_ino);
			if (error)
				goto done;
		} else
			xfs_bmap_add_free()
	}

Because what we are doing is deleting an extent from the reflink
btree, not adding a freed extent to the "to-be-freed" list.


> diff --git a/fs/xfs/libxfs/xfs_reflink_btree.c b/fs/xfs/libxfs/xfs_reflink_btree.c
> index 380ed72..f40ba1f 100644
> --- a/fs/xfs/libxfs/xfs_reflink_btree.c
> +++ b/fs/xfs/libxfs/xfs_reflink_btree.c

Again, xfs_reflink.c

> @@ -935,3 +935,143 @@ error0:
>  	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
>  	return error;
>  }
> +
> +/**
> + * xfs_reflink_bmap_add_free() - release a range of blocks
> + *
> + * @mp: XFS mount object
> + * @flist: List of blocks to be freed at the end of the transaction
> + * @fsbno: First fs block of the range to release
> + * @len: Length of range
> + * @owner: owner of the extent
> + * @tp: transaction that goes with the free operation
> + */
> +int
> +xfs_reflink_bmap_add_free(
> +	struct xfs_mount	*mp,		/* mount point structure */
> +	xfs_bmap_free_t		*flist,		/* list of extents */
> +	xfs_fsblock_t		fsbno,		/* fs block number of extent */
> +	xfs_filblks_t		fslen,		/* length of extent */
> +	uint64_t		owner,		/* extent owner */
> +	struct xfs_trans	*tp)		/* transaction */
> +{
> +	struct xfs_btree_cur	*cur;
> +	int			error;
> +	struct xfs_buf		*agbp;
> +	xfs_agnumber_t		agno;		/* allocation group number */
> +	xfs_agblock_t		agbno;		/* ag start of range to free */
> +	xfs_agblock_t		agbend;		/* ag end of range to free */
> +	xfs_extlen_t		aglen;		/* ag length of range to free */
> +	int			i, have;
> +	xfs_agblock_t		lbno;		/* rlextent start */
> +	xfs_extlen_t		llen;		/* rlextent length */
> +	xfs_nlink_t		lnr;		/* rlextent refcount */
> +	xfs_agblock_t		bno;		/* rlext block # in loop */
> +	xfs_extlen_t		len;		/* rlext length in loop */
> +	unsigned long long	blocks_freed;
> +	xfs_fsblock_t		range_fsb;
> +
> +	if (!xfs_sb_version_hasreflink(&mp->m_sb)) {
> +		xfs_bmap_add_free(mp, flist, fsbno, fslen, owner);
> +		return 0;
> +	}

That canbe dropped.
> +
> +	agno = XFS_FSB_TO_AGNO(mp, fsbno);
> +	agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
> +	CHECK_AG_NUMBER(mp, agno);
> +	ASSERT(fslen < mp->m_sb.sb_agblocks);
> +	CHECK_AG_EXTENT(mp, agbno, fslen);

These extent lengths have already been checked. If they are invalid,
then the extent deletion would have errored out with corruption
long before we get here.

> +	aglen = fslen;
> +
> +	/*
> +	 * Drop reference counts in the reflink tree.
> +	 */
> +	error = xfs_alloc_read_agf(mp, tp, agno, 0, &agbp);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Grab a rl btree cursor.
> +	 */
> +	cur = xfs_reflinkbt_init_cursor(mp, tp, agbp, agno);
> +	bno = agbno;
> +	len = aglen;
> +	agbend = agbno + aglen - 1;
> +	blocks_freed = 0;
> +
> +	/*
> +	 * Account for a left extent that partially covers our range.
> +	 */
> +	error = xfs_reflink_lookup_le(cur, bno, &have);
> +	if (error)
> +		goto error0;
> +	if (have) {
> +		error = xfs_reflink_get_rec(cur, &lbno, &llen, &lnr, &i);
> +		if (error)
> +			goto error0;
> +		XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, lbno, llen, lnr, error0);
> +		if (lbno + llen > bno) {
> +			blocks_freed += min(len, lbno + llen - bno);
> +			bno += blocks_freed;
> +			len -= blocks_freed;
> +		}
> +	}

So we unconditionally look up the reflink btree on extent free to
see if we need to free it, even if the inode has not been reflinked?
Doesn't this add a lot of overhead to the extent freeing?

Indeed, why not just mark inodes that have been reflinked (i.e. have
shared extents) with an on-disk flag so that we know if we need to
do reflink btree work or not? That way the code fragment above could
just check an inode flag rather than always calling into this
function for reflink enabled filesystems....

> +	while (len > 0) {
> +		/*
> +		 * Go find the next rlext.
> +		 */
> +		range_fsb = XFS_AGB_TO_FSB(mp, agno, bno);
> +		error = xfs_btree_increment(cur, 0, &have);
> +		if (error)
> +			goto error0;
> +		if (!have) {
> +			/*
> +			 * There's no right rlextent, so free bno to the end.
> +			 */
> +			lbno = bno + len;
> +			llen = 0;
> +		} else {
> +			/*
> +			 * Find the next rlextent.
> +			 */
> +			error = xfs_reflink_get_rec(cur, &lbno, &llen,
> +					&lnr, &i);
> +			if (error)
> +				goto error0;
> +			XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, lbno, llen, lnr,
> +						      error0);
> +			if (lbno >= bno + len) {
> +				lbno = bno + len;
> +				llen = 0;
> +			}
> +		}
> +
> +		/*
> +		 * Free everything up to the start of the rlextent and
> +		 * account for still-mapped blocks.
> +		 */
> +		if (lbno - bno > 0) {
> +			xfs_bmap_add_free(mp, flist, range_fsb, lbno - bno,
> +					  owner);
> +			len -= lbno - bno;
> +			bno += lbno - bno;
> +		}
> +		llen = min(llen, agbend + 1 - lbno);
> +		blocks_freed += llen;
> +		len -= llen;
> +		bno += llen;
> +	}
> +
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +
> +	error = xfs_reflinkbt_adjust_refcount(mp, tp, agbp, agno, agbno, aglen,
> +					      -1);

Hmmm - we just walked the btree to determine what extents to
free, and now we are going to walk the btree again to drop the
reference counts on shared extents? So every extent that gets freed
does two walks of the reflink btree regardless of the whether it has
shared blocks or not?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/14] xfs: minimize impact to non-reflink files via reflink per-inode flag
  2015-06-25 23:40 ` [PATCH 10/14] xfs: minimize impact to non-reflink files via reflink per-inode flag Darrick J. Wong
@ 2015-07-01  1:58   ` Dave Chinner
  2015-07-01 22:59     ` Darrick J. Wong
  2015-07-02  2:32     ` Darrick J. Wong
  0 siblings, 2 replies; 28+ messages in thread
From: Dave Chinner @ 2015-07-01  1:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Thu, Jun 25, 2015 at 04:40:16PM -0700, Darrick J. Wong wrote:
> Gate all the reflink functions (which generally involve an expensive
> trip to the reflink btree) on an inode flag which is applied to both
> inodes at reflink time.  This minimizes reflink's impact on non-CoW
> files.

Ah, I see you do this reflink inode flag here. This should be one of
the first patches, not the last.  i.e. the patch series should
build up all the supporting infrastructure in individual patches
before adding any of the actual reflink implementation....

Also, the flag needs to go into the di_flags2 field, as the last
flag in the di_flags field is reserved for a "more flags" flag if we
ever need to add more flags to a v2 inode in a v4 filesystem...

> +/*
> + * xfs_is_reflink_inode() -- Decide if an inode needs to be checked for CoW.
> + *
> + * @ip: XFS inode
> + */
> +bool
> +xfs_is_reflink_inode(
> +	struct xfs_inode	*ip)		/* XFS inode */
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +
> +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> +		return false;
> +	if (!(ip->i_d.di_flags & XFS_DIFLAG_REFLINK))
> +		return false;
> +
> +	ASSERT(!XFS_IS_REALTIME_INODE(ip));
> +	return true;

I would have thought you only need to check the inode flag here
because the only time it will be set is on a reflink enabled
filesystem. i.e. that flag being set implies we've already done
all the "reflink is supported in this filesystem and it's not a
realtime file" checks when setting the flag.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/14] xfs: create a per-AG btree to track reference counts
  2015-07-01  0:13   ` Dave Chinner
@ 2015-07-01 22:52     ` Darrick J. Wong
  2015-07-01 23:30       ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2015-07-01 22:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Jul 01, 2015 at 10:13:06AM +1000, Dave Chinner wrote:
> On Thu, Jun 25, 2015 at 04:39:16PM -0700, Darrick J. Wong wrote:
> > Create a per-AG btree to track the reference counts of physical blocks
> > to support reflink.
> 
> Few things from a quick glance...
> 
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -54,6 +54,8 @@ xfs_extlen_t
> >  xfs_prealloc_blocks(
> >  	struct xfs_mount	*mp)
> >  {
> > +	if (xfs_sb_version_hasreflink(&mp->m_sb))
> > +		return XFS_RL_BLOCK(mp) + 1;
> 
> Should introduce the sb version stuff as a separate patch perhaps
> with the basic infrastructure defines (see how I did the first rmap
> btree patch).

Ok.

> > @@ -1117,6 +1118,9 @@ xfs_btree_set_refs(
> >  	case XFS_BTNUM_RMAP:
> >  		xfs_buf_set_ref(bp, XFS_RMAP_BTREE_REF);
> >  		break;
> > +	case XFS_BTNUM_RL:
> 
> Probably better to call it XFS_BTNUM_REFLINK

I was thinking about renaming the whole thing to 'refcount', i.e.
XFS_BTNUM_REFCOUNT since it /is/ a btree of reference counts.

> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index 48ab2b1..a3f8661 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -43,6 +43,7 @@ union xfs_btree_key {
> >  	xfs_alloc_key_t			alloc;
> >  	struct xfs_inobt_key		inobt;
> >  	struct xfs_rmap_key		rmap;
> > +	xfs_reflink_key_t		reflink;
> 
> No new typedefs. struct xfs_reflink_key...
> 
> (only say this once, but applies many times ;)

Yeah, sorry about that.

> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 9cff517..e4954ab 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -446,9 +446,11 @@ xfs_sb_has_compat_feature(
> >  
> >  #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
> >  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> > +#define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflink btree */
> >  #define XFS_SB_FEAT_RO_COMPAT_ALL \
> >  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> > -		 XFS_SB_FEAT_RO_COMPAT_RMAPBT)
> > +		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> > +		 XFS_SB_FEAT_RO_COMPAT_REFLINK)
> 
> The XFS_SB_FEAT_RO_COMPAT_REFLINK flag shoul dbe added as a separate
> patch and put last in the series so it is only enabled once
> everything is complete.

What if I define XFS_SB_FEAT_RO_COMPAT_REFLINK at the beginning but omit it
from the XFS_SB_FEAT_RO_COMPAT_ALL definition until the final patch?  That
should prohibit anyone from using the half-baked feature during a bisect.

> >  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
> >  static inline bool
> >  xfs_sb_has_ro_compat_feature(
> > @@ -522,6 +524,12 @@ static inline bool xfs_sb_version_hasrmapbt(struct xfs_sb *sbp)
> >  		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_RMAPBT);
> >  }
> >  
> > +static inline int xfs_sb_version_hasreflink(xfs_sb_t *sbp)
> 
> bool.
> 
> > @@ -1338,6 +1349,50 @@ typedef __be32 xfs_rmap_ptr_t;
> >  	 XFS_IBT_BLOCK(mp) + 1)
> >  
> >  /*
> > + * reflink Btree format definitions
> > + *
> > + */
> > +#define	XFS_RLBT_CRC_MAGIC	0x524C4233	/* 'RLB3' */
> 
> #define        XFS_RMAP_CRC_MAGIC      0x524d4233      /* 'RMB3' */
> 
> Only one bit difference in the magic numbers, which means they are
> too similar. "RFL3" might be better or maybe "R3FL"...

"RFC3" ?

> > +/*
> > + * Data record/key structure
> > + */
> > +typedef struct xfs_reflink_rec {
> > +	__be32		rr_startblock;	/* starting block number */
> > +	__be32		rr_blockcount;	/* count of blocks */
> > +	__be32		rr_nlinks;	/* number of inodes linked here */
> > +} xfs_reflink_rec_t;
> > +
> > +typedef struct xfs_reflink_key {
> > +	__be32		rr_startblock;	/* starting block number */
> > +} xfs_reflink_key_t;
> > +
> > +typedef struct xfs_reflink_rec_incore {
> > +	xfs_agblock_t	rr_startblock;	/* starting block number */
> > +	xfs_extlen_t	rr_blockcount;	/* count of free blocks */
> > +	xfs_nlink_t	rr_nlinks;	/* number of inodes linked here */
> > +} xfs_reflink_rec_incore_t;
> 
> We have being using "irec" as shorthand for "in-core record". i.e:
> struct xfs_reflink_irec.

Noted.

> (kill typedefs)
> 
> > +
> > +/*
> > + * When a block hits MAXRLCOUNT references, it becomes permanently
> > + * stuck in CoW mode, because who knows how many times it's really
> > + * referenced.
> > + */
> > +#define MAXRLCOUNT	((xfs_nlink_t)~0U)
> > +#define MAXRLEXTLEN	((xfs_extlen_t)~0U)
> 
> I'd suggest that if we hit the maximum count, we just abort the
> reflink operation.

<nod>

> > +/* btree pointer type */
> > +typedef __be32 xfs_reflink_ptr_t;
> > +
> > +#define	XFS_RL_BLOCK(mp) \
> > +	(xfs_sb_version_hasrmapbt(&((mp)->m_sb)) ? \
> > +	 XFS_RMAP_BLOCK(mp) + 1 : \
> > +	 (xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
> > +	  XFS_FIBT_BLOCK(mp) + 1 : \
> > +	  XFS_IBT_BLOCK(mp) + 1))
> 
> That's getting unwieldy. It's large enough for a function....

Ok.

> > +#ifdef REFLINK_DEBUG
> > +# define dbg_printk(f, a...)  do {printk(KERN_ERR f, ## a); } while (0)
> > +#else
> > +# define dbg_printk(f, a...)
> > +#endif
> 
> xfs_debug() is your friend.
> 
> > +#define CHECK_AG_NUMBER(mp, agno) \
> > +	do { \
> > +		ASSERT((agno) != NULLAGNUMBER); \
> > +		ASSERT((agno) < (mp)->m_sb.sb_agcount); \
> > +	} while(0);
> 
> Ugh. Used once, just open code.
> 
> > +#define CHECK_AG_EXTENT(mp, agbno, len) \
> > +	do { \
> > +		ASSERT((agbno) != NULLAGBLOCK); \
> > +		ASSERT((len) > 0); \
> > +		ASSERT((unsigned long long)(agbno) + (len) <= \
> > +				(mp)->m_sb.sb_agblocks); \
> > +	} while(0);
> 
> These are really used in places where corruption checks are
> warranted, or the extent has already been checked....
> 
> > +#define XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, have, agbno, len, nr, label) \
> > +	do { \
> > +		XFS_WANT_CORRUPTED_GOTO((mp), (have) == 1, label); \
> > +		XFS_WANT_CORRUPTED_GOTO((mp), (len) > 0, label); \
> > +		XFS_WANT_CORRUPTED_GOTO((mp), (nr) >= 2, label); \
> > +		XFS_WANT_CORRUPTED_GOTO((mp), (unsigned long long)(agbno) + \
> > +				(len) <= (mp)->m_sb.sb_agblocks, label); \
> > +	} while(0);
> 
> Unused.
> 
> > +
> > +STATIC int
> > +xfs_reflinkbt_alloc_block(
> > +	struct xfs_btree_cur	*cur,
> > +	union xfs_btree_ptr	*start,
> > +	union xfs_btree_ptr	*new,
> > +	int			*stat)
> > +{
> > +	int			error;
> > +	xfs_agblock_t		bno;
> > +
> > +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > +
> > +	/* Allocate the new block from the freelist. If we can't, give up.  */
> > +	error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp,
> > +				       &bno, 1);
> > +	if (error) {
> > +		XFS_BTREE_TRACE_CURSOR(cur, XBT_ERROR);
> > +		return error;
> > +	}
> 
> Why does the reflink btree use the free list? Why can't it use
> block allocation like the BMBT tree?

I'm confused about the intended usage of the AGFL -- the XFS FS structure doc
says that it's for growing the free space btrees and can't be used for anything
else, but the rmap btree uses it.

Originally it /did/ use xfs_alloc_vextent(), though it won't be difficult to
revert.

> 
> > +/*
> > + * Allocate a new reflink btree cursor.
> > + */
> > +struct xfs_btree_cur *			/* new reflink btree cursor */
> > +xfs_reflinkbt_init_cursor(
> > +	struct xfs_mount	*mp,		/* file system mount point */
> > +	struct xfs_trans	*tp,		/* transaction pointer */
> > +	struct xfs_buf		*agbp,		/* buffer for agf structure */
> > +	xfs_agnumber_t		agno)		/* allocation group number */
> 
> No real need for these comments on the variables. They are redundant
> as the code documents what they are just fine.

I was playing monkey-see monkey-do.  Some of the other functions had
commented args. :)

> 
> > +{
> > +	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> > +	struct xfs_btree_cur	*cur;
> > +
> > +	CHECK_AG_NUMBER(mp, agno);
> > +	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_SLEEP);
> > +
> > +	cur->bc_tp = tp;
> > +	cur->bc_mp = mp;
> > +	cur->bc_btnum = XFS_BTNUM_RL;
> > +	cur->bc_blocklog = mp->m_sb.sb_blocklog;
> > +	cur->bc_ops = &xfs_reflinkbt_ops;
> > +
> > +	cur->bc_nlevels = be32_to_cpu(agf->agf_reflink_level);
> > +
> > +	cur->bc_private.a.agbp = agbp;
> > +	cur->bc_private.a.agno = agno;
> > +
> > +	if (xfs_sb_version_hascrc(&mp->m_sb))
> > +		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
> 
> Can be set unconditionally.
> 
> The next set of functions normally go into a different file. i.e the
> "xfs_foo_btree.c" file contains the functions required by the
> generic btree abstraction to implement the "foo" btree format.  The
> file "xfs_foo.c" then contains the code/logic that provides the
> external foo API, manages the information inthe foo btree, and calls
> the generic btree functions to manage the btree. This logic isn't
> present in the patch, so really it shoul dbe added by the patch that
> starts implementing the reflink API....

Ok, I'll split this stuff out into smaller files.

> > +/*
> > + * Get the data from the pointed-to record.
> > + */
> > +int					/* error */
> > +xfs_reflink_get_rec(
> > +	struct xfs_btree_cur	*cur,	/* btree cursor */
> > +	xfs_agblock_t		*bno,	/* output: starting block of extent */
> > +	xfs_extlen_t		*len,	/* output: length of extent */
> > +	xfs_nlink_t		*nlink,	/* output: number of links */
> > +	int			*stat)	/* output: success/failure */
> > +{
> > +	union xfs_btree_rec	*rec;
> > +	int			error;
> > +
> > +	error = xfs_btree_get_rec(cur, &rec, stat);
> > +	if (!error && *stat == 1) {
> > +		CHECK_AG_EXTENT(cur->bc_mp,
> > +			be32_to_cpu(rec->reflink.rr_startblock),
> > +			be32_to_cpu(rec->reflink.rr_blockcount));
> > +		*bno = be32_to_cpu(rec->reflink.rr_startblock);
> > +		*len = be32_to_cpu(rec->reflink.rr_blockcount);
> > +		*nlink = be32_to_cpu(rec->reflink.rr_nlinks);
> > +	}
> > +	return error;
> 
> 	if (error || !*stat)
> 		return error;
> 	.....
> 	return 0;
> 
> > +	error = xfs_reflink_get_rec(cur, &bno, &len, &nr, &x);
> > +	if (error)
> > +		return error;
> > +	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, x == 1, error0);
> > +	error = xfs_btree_delete(cur, i);
> > +	if (error)
> > +		return error;
> > +	error = xfs_reflink_lookup_ge(cur, bno, &x);
> > +error0:
> 
> New code should use sane jump labels. e.g. "out_error" is a pretty
> standard jump label name for this...
> 
> > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> > index 88efbb4..d1de74e 100644
> > --- a/fs/xfs/libxfs/xfs_shared.h
> > +++ b/fs/xfs/libxfs/xfs_shared.h
> > @@ -216,6 +216,7 @@ int	xfs_log_calc_minimum_size(struct xfs_mount *);
> >  #define	XFS_INO_REF		2
> >  #define	XFS_ATTR_BTREE_REF	1
> >  #define	XFS_DQUOT_REF		1
> > +#define XFS_REFLINK_BTREE_REF	1
> 
> whitespace.
> 
> > @@ -315,6 +317,9 @@ typedef struct xfs_perag {
> >  	/* for rcu-safe freeing */
> >  	struct rcu_head	rcu_head;
> >  	int		pagb_count;	/* pagb slots in use */
> > +
> > +	/* reflink */
> > +	__uint8_t	pagf_reflink_level;
> 
> May as well just make it the same as what is on disk (i.e.
> uint32_t).
> 
> > +++ b/fs/xfs/xfs_stats.c
> > @@ -61,6 +61,7 @@ static int xfs_stat_proc_show(struct seq_file *m, void *v)
> >  		{ "ibt2",		XFSSTAT_END_IBT_V2		},
> >  		{ "fibt2",		XFSSTAT_END_FIBT_V2		},
> >  		{ "rmapbt",		XFSSTAT_END_RMAP_V2		},
> > +		{ "rlbt2",		XFSSTAT_END_RLBT_V2		},
> 
> "reflinkbt". No need for the "2", as there is only one set of
> reflink btree stats.

Ok, thanks for the review!

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/14] xfs: minimize impact to non-reflink files via reflink per-inode flag
  2015-07-01  1:58   ` Dave Chinner
@ 2015-07-01 22:59     ` Darrick J. Wong
  2015-07-01 23:49       ` Dave Chinner
  2015-07-02  2:32     ` Darrick J. Wong
  1 sibling, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2015-07-01 22:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Jul 01, 2015 at 11:58:43AM +1000, Dave Chinner wrote:
> On Thu, Jun 25, 2015 at 04:40:16PM -0700, Darrick J. Wong wrote:
> > Gate all the reflink functions (which generally involve an expensive
> > trip to the reflink btree) on an inode flag which is applied to both
> > inodes at reflink time.  This minimizes reflink's impact on non-CoW
> > files.
> 
> Ah, I see you do this reflink inode flag here. This should be one of
> the first patches, not the last.  i.e. the patch series should
> build up all the supporting infrastructure in individual patches
> before adding any of the actual reflink implementation....

Pardon all the dust, I figured that it'd be better to get all the patches
out for earlier review than to make everyone wait until I could get a
reasonable refactoring done once.

> Also, the flag needs to go into the di_flags2 field, as the last
> flag in the di_flags field is reserved for a "more flags" flag if we
> ever need to add more flags to a v2 inode in a v4 filesystem...

Ok.

> > +/*
> > + * xfs_is_reflink_inode() -- Decide if an inode needs to be checked for CoW.
> > + *
> > + * @ip: XFS inode
> > + */
> > +bool
> > +xfs_is_reflink_inode(
> > +	struct xfs_inode	*ip)		/* XFS inode */
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +
> > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > +		return false;
> > +	if (!(ip->i_d.di_flags & XFS_DIFLAG_REFLINK))
> > +		return false;
> > +
> > +	ASSERT(!XFS_IS_REALTIME_INODE(ip));
> > +	return true;
> 
> I would have thought you only need to check the inode flag here
> because the only time it will be set is on a reflink enabled
> filesystem. i.e. that flag being set implies we've already done
> all the "reflink is supported in this filesystem and it's not a
> realtime file" checks when setting the flag.

Sure.  The reason for so many ASSERTs everywhere is to help me check my
own sanity while cobbling together the first version.  I imagine I could
eliminate a lot of them, but since they all compile out on !XFS_DEBUG &&
!XFS_WARN, I didn't think it was a serious problem. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/14] libxfs: adjust refcounts in reflink btree
  2015-07-01  1:06   ` Dave Chinner
@ 2015-07-01 23:10     ` Darrick J. Wong
  2015-07-01 23:32       ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2015-07-01 23:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Jul 01, 2015 at 11:06:54AM +1000, Dave Chinner wrote:
> On Thu, Jun 25, 2015 at 04:39:23PM -0700, Darrick J. Wong wrote:
> > Provide a function to adjust the reference counts for a range of
> > blocks in the reflink btree.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_reflink_btree.c |  406 +++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_reflink_btree.h |    4 
> >  2 files changed, 410 insertions(+)
> 
> As per previous comments, this all belongs in
> fs/xfs/libxfs/xfs_reflink.c...
> 
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_reflink_btree.c b/fs/xfs/libxfs/xfs_reflink_btree.c
> > index 8a0fa5d..380ed72 100644
> > --- a/fs/xfs/libxfs/xfs_reflink_btree.c
> > +++ b/fs/xfs/libxfs/xfs_reflink_btree.c
> > @@ -529,3 +529,409 @@ xfs_reflinkbt_delete(
> >  error0:
> >  	return error;
> >  }
> > +
> > +#ifdef REFLINK_DEBUG
> > +static void
> > +dump_cur_loc(
> > +	struct xfs_btree_cur	*cur,
> > +	const char		*str,
> > +	int			line)
> > +{
> > +	xfs_agblock_t		gbno;
> > +	xfs_extlen_t		glen;
> > +	xfs_nlink_t		gnr;
> > +	int			i;
> > +
> > +	xfs_reflink_get_rec(cur, &gbno, &glen, &gnr, &i);
> > +	printk(KERN_INFO "%s(%d) cur[%d]:[%u,%u,%u,%d] ", str, line,
> > +	       cur->bc_ptrs[0], gbno, glen, gnr, i);
> > +	if (i && cur->bc_ptrs[0]) {
> > +		cur->bc_ptrs[0]--;
> > +		xfs_reflink_get_rec(cur, &gbno, &glen, &gnr, &i);
> > +		printk("left[%d]:[%u,%u,%u,%d] ", cur->bc_ptrs[0],
> > +		       gbno, glen, gnr, i);
> > +		cur->bc_ptrs[0]++;
> > +	}
> > +
> > +	if (i && cur->bc_ptrs[0] < xfs_reflinkbt_get_maxrecs(cur, 0)) {
> > +		cur->bc_ptrs[0]++;
> > +		xfs_reflink_get_rec(cur, &gbno, &glen, &gnr, &i);
> > +		printk("right[%d]:[%u,%u,%u,%d] ", cur->bc_ptrs[0],
> > +		       gbno, glen, gnr, i);
> > +		cur->bc_ptrs[0]--;
> > +	}
> > +	printk("\n");
> > +}
> > +#else
> > +# define dump_cur_loc(c, s, l)
> > +#endif
> 
> Use trace points on lookup/update/insert/delete so debug like this
> is unnecessary.
> 
> 
> > +/*
> > + * Adjust the ref count of a range of AG blocks.
> > + */
> > +int						/* error */
> > +xfs_reflinkbt_adjust_refcount(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	*tp,		/* transaction pointer */
> > +	struct xfs_buf		*agbp,		/* buffer for agf structure */
> > +	xfs_agnumber_t		agno,		/* allocation group number */
> > +	xfs_agblock_t		agbno,		/* start of range */
> > +	xfs_extlen_t		aglen,		/* length of range */
> > +	int			adj)		/* how much to change refcnt */
> 
> 350 line function. Needs factoring. Also needs a comment explaining
> the algorithm. 
> 
> > +{
> > +	struct xfs_btree_cur	*cur;
> > +	int			error;
> > +	int			i, have;
> > +	bool			real_crl;	/* cbno/clen is on disk? */
> > +	xfs_agblock_t		lbno, cbno, rbno;	/* rlextent start */
> > +	xfs_extlen_t		llen, clen, rlen;	/* rlextent length */
> > +	xfs_nlink_t		lnr, cnr, rnr;		/* rlextent refcount */
> 
> "num" is the usual shorthand for "number". And in this case, nr is
> extremely ambiguous: Number of records, number of reflinks, some
> other number? I can't easily tell when I read the code, so the
> variable names need to be better. factoring will certainly help
> here.

"refc" as shorthand for reference count, perhaps?

> > +	xfs_agblock_t		bno;		/* ag bno in the loop */
> > +	xfs_agblock_t		agbend;		/* end agbno of the loop */
> > +	xfs_extlen_t		len;		/* remaining len to add */
> > +	xfs_nlink_t		new_cnr;	/* new refcount */
> > +
> > +	CHECK_AG_NUMBER(mp, agno);
> > +	CHECK_AG_EXTENT(mp, agbno, aglen);
> 
> No real need for these checks - bad agno or extent sizes shoul dhave
> been validated long before this.
> 
> > +
> > +	/*
> > +	 * Allocate/initialize a cursor for the by-number freespace btree.
> > +	 */
> > +	cur = xfs_reflinkbt_init_cursor(mp, tp, agbp, agno);
> 
> You can kill that incorrect comment.
> 
> > +
> > +	/*
> > +	 * Split a left rlextent that crosses agbno.
> > +	 */
> 
> These comments need some ascii art displaying the before, current
> extent and after states so it's clear what the intent is. As it is,
> I'd probably split these into "left/right/middle" helper functions,
> as there is no state created by these initial overlap splits
> used later in the function. That would get rid of excessive
> indentation, make the error handling more obvious, etc.

Ok, I'll draw some pictures. :)

> > +	error = xfs_reflink_lookup_le(cur, agbno, &have);
> > +	if (error)
> > +		goto error0;
> 
> 		goto out_error;
> 
> > +	if (have) {
> 
> if I "have" what?  "found_rec" would be a better name, because then
> the code reads clearly...
> 
> > +	/*
> > +	 * Start iterating the range we're adjusting.  rlextent boundaries
> > +	 * should be at agbno and agbend.
> > +	 */
> 
> Trying to work my way through this loop, but the logic is hard to
> follow. It's hurting my head trying to work out what it is supposed
> to be doing, so I'm going to wait for more comments, ascii art, and
> factoring before really looking at it.

:)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/14] xfs: create a per-AG btree to track reference counts
  2015-07-01 22:52     ` Darrick J. Wong
@ 2015-07-01 23:30       ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2015-07-01 23:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Wed, Jul 01, 2015 at 03:52:13PM -0700, Darrick J. Wong wrote:
> On Wed, Jul 01, 2015 at 10:13:06AM +1000, Dave Chinner wrote:
> > On Thu, Jun 25, 2015 at 04:39:16PM -0700, Darrick J. Wong wrote:
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 9cff517..e4954ab 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -446,9 +446,11 @@ xfs_sb_has_compat_feature(
> > >  
> > >  #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
> > >  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> > > +#define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflink btree */
> > >  #define XFS_SB_FEAT_RO_COMPAT_ALL \
> > >  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> > > -		 XFS_SB_FEAT_RO_COMPAT_RMAPBT)
> > > +		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> > > +		 XFS_SB_FEAT_RO_COMPAT_REFLINK)
> > 
> > The XFS_SB_FEAT_RO_COMPAT_REFLINK flag shoul dbe added as a separate
> > patch and put last in the series so it is only enabled once
> > everything is complete.
> 
> What if I define XFS_SB_FEAT_RO_COMPAT_REFLINK at the beginning but omit it
> from the XFS_SB_FEAT_RO_COMPAT_ALL definition until the final patch?  That
> should prohibit anyone from using the half-baked feature during a bisect.

Yup, thats what I meant ;)

> > > +	int			*stat)
> > > +{
> > > +	int			error;
> > > +	xfs_agblock_t		bno;
> > > +
> > > +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > > +
> > > +	/* Allocate the new block from the freelist. If we can't, give up.  */
> > > +	error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp,
> > > +				       &bno, 1);
> > > +	if (error) {
> > > +		XFS_BTREE_TRACE_CURSOR(cur, XBT_ERROR);
> > > +		return error;
> > > +	}
> > 
> > Why does the reflink btree use the free list? Why can't it use
> > block allocation like the BMBT tree?
> 
> I'm confused about the intended usage of the AGFL -- the XFS FS structure doc
> says that it's for growing the free space btrees and can't be used for anything
> else, but the rmap btree uses it.

The rmap btree is a "freespace" btree in that it is modified at the
same time the two freespace btrees are modified. It's tracking used
space rather than free space, but from an architectural POV the rmap
btree sits at the same lowest layer as the freespace btree.

Think of it like this: when an extent is allocated, the freespace
btree removal needs ot be atomic with the rmap btree insertion so
they remain coherent at all times. Similarly we have the same
situation with extent freeing - removal from the rmap must be atomic
with addition to the freespace btree.

The reflink btree sits a layer above the freespace btrees,
equivalent to the BMBT. That is, when we remove an extent from the
BMBT, we also need to remove the reflink btree reference. Only if
the reference drops to zero does the extent then become free, and we
pass it off to xfs_free_extent()....

> Originally it /did/ use xfs_alloc_vextent(), though it won't be
> difficult to revert.

The way you use EFIs means that it can't be put inside
xfs_alloc_vextent()/xfs_free_extent() - EFIs track movement of
extents from the BMBT to the freespace tree, and so if we now have a
reflink btree in the way, the EFI tracks movement from the reflink
btree to the freespace trees.  i.e. the reflink btree is modified
atomically with the BMBT, not the freespace trees.

Which, really, is a long way of saying that the allocation/freeing
model of reflink btree blocks shoul dbe the same as the BMBT, and
the transactional model integrates with the BMBT modifications, not
the freespace btree modifications...

> > > +/*
> > > + * Allocate a new reflink btree cursor.
> > > + */
> > > +struct xfs_btree_cur *			/* new reflink btree cursor */
> > > +xfs_reflinkbt_init_cursor(
> > > +	struct xfs_mount	*mp,		/* file system mount point */
> > > +	struct xfs_trans	*tp,		/* transaction pointer */
> > > +	struct xfs_buf		*agbp,		/* buffer for agf structure */
> > > +	xfs_agnumber_t		agno)		/* allocation group number */
> > 
> > No real need for these comments on the variables. They are redundant
> > as the code documents what they are just fine.
> 
> I was playing monkey-see monkey-do.  Some of the other functions had
> commented args. :)

Yup, that's the old code. For new code we write it in a way that
doesn't require comments like that ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/14] libxfs: adjust refcounts in reflink btree
  2015-07-01 23:10     ` Darrick J. Wong
@ 2015-07-01 23:32       ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2015-07-01 23:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Wed, Jul 01, 2015 at 04:10:22PM -0700, Darrick J. Wong wrote:
> On Wed, Jul 01, 2015 at 11:06:54AM +1000, Dave Chinner wrote:
> > On Thu, Jun 25, 2015 at 04:39:23PM -0700, Darrick J. Wong wrote:
> > > Provide a function to adjust the reference counts for a range of
> > > blocks in the reflink btree.
.....
> > > +{
> > > +	struct xfs_btree_cur	*cur;
> > > +	int			error;
> > > +	int			i, have;
> > > +	bool			real_crl;	/* cbno/clen is on disk? */
> > > +	xfs_agblock_t		lbno, cbno, rbno;	/* rlextent start */
> > > +	xfs_extlen_t		llen, clen, rlen;	/* rlextent length */
> > > +	xfs_nlink_t		lnr, cnr, rnr;		/* rlextent refcount */
> > 
> > "num" is the usual shorthand for "number". And in this case, nr is
> > extremely ambiguous: Number of records, number of reflinks, some
> > other number? I can't easily tell when I read the code, so the
> > variable names need to be better. factoring will certainly help
> > here.
> 
> "refc" as shorthand for reference count, perhaps?

refcnt is the usual self-documenting shorthand ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/14] xfs: minimize impact to non-reflink files via reflink per-inode flag
  2015-07-01 22:59     ` Darrick J. Wong
@ 2015-07-01 23:49       ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2015-07-01 23:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Wed, Jul 01, 2015 at 03:59:44PM -0700, Darrick J. Wong wrote:
> On Wed, Jul 01, 2015 at 11:58:43AM +1000, Dave Chinner wrote:
> > I would have thought you only need to check the inode flag here
> > because the only time it will be set is on a reflink enabled
> > filesystem. i.e. that flag being set implies we've already done
> > all the "reflink is supported in this filesystem and it's not a
> > realtime file" checks when setting the flag.
> 
> Sure.  The reason for so many ASSERTs everywhere is to help me check my
> own sanity while cobbling together the first version.  I imagine I could
> eliminate a lot of them, but since they all compile out on !XFS_DEBUG &&
> !XFS_WARN, I didn't think it was a serious problem. :)

Generally it's not, but we try to keep performance of the debug
kernel within a few percent of a non-debug build, just so that it
behaves roughly the same w.r.t. CPU and memory overhead, scalability
and race conditions.

Hence I'd much prefer to see strong validation of the parameters at
the highest layer possible so that they don't need to be constantly
checked in lower layers that have a single context.  e.g.
AG-specific modification functions shouldn't need to check the agno
is valid, as they wouldn't have been called if someone tried to
perform the operation on an invalid agno.  Same goes for block
numbers, etc.

And for printk debugging to tell you whow functions a being called
and what oeprations they are doing, you should replace all that with
tracepoints.  Addition of trace points at the entry and exit of
functions gives sufficient information for verifying this during
debugging, but has almost no overhead in the code or at runtime.
They can also be switched on dynamically in production machines,
which you can't do with compile option debug code like xfs_debug and
ASSERT statements. i.e. tracepoints = good, debug printk = bad ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 03/14] libxfs: support unmapping reflink blocks
  2015-07-01  1:26   ` Dave Chinner
@ 2015-07-02  2:27     ` Darrick J. Wong
  0 siblings, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2015-07-02  2:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Jul 01, 2015 at 11:26:32AM +1000, Dave Chinner wrote:
> On Thu, Jun 25, 2015 at 04:39:30PM -0700, Darrick J. Wong wrote:
> > When we're unmapping blocks from a file, we need to decrease refcounts
> > in the btree and only free blocks if they refcount is 1.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c          |    5 +
> >  fs/xfs/libxfs/xfs_reflink_btree.c |  140 +++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_reflink_btree.h |    4 +
> >  3 files changed, 147 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 057fa9a..3f5e8da 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -45,6 +45,7 @@
> >  #include "xfs_symlink.h"
> >  #include "xfs_attr_leaf.h"
> >  #include "xfs_filestream.h"
> > +#include "xfs_reflink_btree.h"
> >  
> >  
> >  kmem_zone_t		*xfs_bmap_free_item_zone;
> > @@ -4984,8 +4985,8 @@ xfs_bmap_del_extent(
> >  	 * If we need to, add to list of extents to delete.
> >  	 */
> >  	if (do_fx)
> > -		xfs_bmap_add_free(mp, flist, del->br_startblock,
> > -				  del->br_blockcount, ip->i_ino);
> > +		xfs_reflink_bmap_add_free(mp, flist, del->br_startblock,
> > +					  del->br_blockcount, ip->i_ino, tp);
> 
> I think this is the wrong abstraction. I think the code should look
> like this:
> 
> 	if (do_fx) {
> 		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
> 			error = xfs_reflink_del_extent(mp, tp, flist,
> 						del->br_startblock,
> 						del->br_blockcount, ip->i_ino);
> 			if (error)
> 				goto done;
> 		} else
> 			xfs_bmap_add_free()
> 	}
> 
> Because what we are doing is deleting an extent from the reflink
> btree, not adding a freed extent to the "to-be-freed" list.

<nod> Not a great choice of name, I agree...

> 
> 
> > diff --git a/fs/xfs/libxfs/xfs_reflink_btree.c b/fs/xfs/libxfs/xfs_reflink_btree.c
> > index 380ed72..f40ba1f 100644
> > --- a/fs/xfs/libxfs/xfs_reflink_btree.c
> > +++ b/fs/xfs/libxfs/xfs_reflink_btree.c
> 
> Again, xfs_reflink.c
> 
> > @@ -935,3 +935,143 @@ error0:
> >  	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> >  	return error;
> >  }
> > +
> > +/**
> > + * xfs_reflink_bmap_add_free() - release a range of blocks
> > + *
> > + * @mp: XFS mount object
> > + * @flist: List of blocks to be freed at the end of the transaction
> > + * @fsbno: First fs block of the range to release
> > + * @len: Length of range
> > + * @owner: owner of the extent
> > + * @tp: transaction that goes with the free operation
> > + */
> > +int
> > +xfs_reflink_bmap_add_free(
> > +	struct xfs_mount	*mp,		/* mount point structure */
> > +	xfs_bmap_free_t		*flist,		/* list of extents */
> > +	xfs_fsblock_t		fsbno,		/* fs block number of extent */
> > +	xfs_filblks_t		fslen,		/* length of extent */
> > +	uint64_t		owner,		/* extent owner */
> > +	struct xfs_trans	*tp)		/* transaction */
> > +{
> > +	struct xfs_btree_cur	*cur;
> > +	int			error;
> > +	struct xfs_buf		*agbp;
> > +	xfs_agnumber_t		agno;		/* allocation group number */
> > +	xfs_agblock_t		agbno;		/* ag start of range to free */
> > +	xfs_agblock_t		agbend;		/* ag end of range to free */
> > +	xfs_extlen_t		aglen;		/* ag length of range to free */
> > +	int			i, have;
> > +	xfs_agblock_t		lbno;		/* rlextent start */
> > +	xfs_extlen_t		llen;		/* rlextent length */
> > +	xfs_nlink_t		lnr;		/* rlextent refcount */
> > +	xfs_agblock_t		bno;		/* rlext block # in loop */
> > +	xfs_extlen_t		len;		/* rlext length in loop */
> > +	unsigned long long	blocks_freed;
> > +	xfs_fsblock_t		range_fsb;
> > +
> > +	if (!xfs_sb_version_hasreflink(&mp->m_sb)) {
> > +		xfs_bmap_add_free(mp, flist, fsbno, fslen, owner);
> > +		return 0;
> > +	}
> 
> That canbe dropped.
> > +
> > +	agno = XFS_FSB_TO_AGNO(mp, fsbno);
> > +	agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
> > +	CHECK_AG_NUMBER(mp, agno);
> > +	ASSERT(fslen < mp->m_sb.sb_agblocks);
> > +	CHECK_AG_EXTENT(mp, agbno, fslen);
> 
> These extent lengths have already been checked. If they are invalid,
> then the extent deletion would have errored out with corruption
> long before we get here.

Ok.

> > +	aglen = fslen;
> > +
> > +	/*
> > +	 * Drop reference counts in the reflink tree.
> > +	 */
> > +	error = xfs_alloc_read_agf(mp, tp, agno, 0, &agbp);
> > +	if (error)
> > +		return error;
> > +
> > +	/*
> > +	 * Grab a rl btree cursor.
> > +	 */
> > +	cur = xfs_reflinkbt_init_cursor(mp, tp, agbp, agno);
> > +	bno = agbno;
> > +	len = aglen;
> > +	agbend = agbno + aglen - 1;
> > +	blocks_freed = 0;
> > +
> > +	/*
> > +	 * Account for a left extent that partially covers our range.
> > +	 */
> > +	error = xfs_reflink_lookup_le(cur, bno, &have);
> > +	if (error)
> > +		goto error0;
> > +	if (have) {
> > +		error = xfs_reflink_get_rec(cur, &lbno, &llen, &lnr, &i);
> > +		if (error)
> > +			goto error0;
> > +		XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, lbno, llen, lnr, error0);
> > +		if (lbno + llen > bno) {
> > +			blocks_freed += min(len, lbno + llen - bno);
> > +			bno += blocks_freed;
> > +			len -= blocks_freed;
> > +		}
> > +	}
> 
> So we unconditionally look up the reflink btree on extent free to
> see if we need to free it, even if the inode has not been reflinked?
> Doesn't this add a lot of overhead to the extent freeing?
> 
> Indeed, why not just mark inodes that have been reflinked (i.e. have
> shared extents) with an on-disk flag so that we know if we need to
> do reflink btree work or not? That way the code fragment above could
> just check an inode flag rather than always calling into this
> function for reflink enabled filesystems....

Yep, the inode flag comes later, though I'm melding it into an earlier
part of the patch...

> 
> > +	while (len > 0) {
> > +		/*
> > +		 * Go find the next rlext.
> > +		 */
> > +		range_fsb = XFS_AGB_TO_FSB(mp, agno, bno);
> > +		error = xfs_btree_increment(cur, 0, &have);
> > +		if (error)
> > +			goto error0;
> > +		if (!have) {
> > +			/*
> > +			 * There's no right rlextent, so free bno to the end.
> > +			 */
> > +			lbno = bno + len;
> > +			llen = 0;
> > +		} else {
> > +			/*
> > +			 * Find the next rlextent.
> > +			 */
> > +			error = xfs_reflink_get_rec(cur, &lbno, &llen,
> > +					&lnr, &i);
> > +			if (error)
> > +				goto error0;
> > +			XFS_WANT_CORRUPTED_RLEXT_GOTO(mp, i, lbno, llen, lnr,
> > +						      error0);
> > +			if (lbno >= bno + len) {
> > +				lbno = bno + len;
> > +				llen = 0;
> > +			}
> > +		}
> > +
> > +		/*
> > +		 * Free everything up to the start of the rlextent and
> > +		 * account for still-mapped blocks.
> > +		 */
> > +		if (lbno - bno > 0) {
> > +			xfs_bmap_add_free(mp, flist, range_fsb, lbno - bno,
> > +					  owner);
> > +			len -= lbno - bno;
> > +			bno += lbno - bno;
> > +		}
> > +		llen = min(llen, agbend + 1 - lbno);
> > +		blocks_freed += llen;
> > +		len -= llen;
> > +		bno += llen;
> > +	}
> > +
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > +
> > +	error = xfs_reflinkbt_adjust_refcount(mp, tp, agbp, agno, agbno, aglen,
> > +					      -1);
> 
> Hmmm - we just walked the btree to determine what extents to
> free, and now we are going to walk the btree again to drop the
> reference counts on shared extents? So every extent that gets freed
> does two walks of the reflink btree regardless of the whether it has
> shared blocks or not?

Yeah, it would be more efficient to bundle the xfs_bmap_add_free loop
into adjust_refcount() so that we only have to make one pass.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/14] xfs: minimize impact to non-reflink files via reflink per-inode flag
  2015-07-01  1:58   ` Dave Chinner
  2015-07-01 22:59     ` Darrick J. Wong
@ 2015-07-02  2:32     ` Darrick J. Wong
  2015-07-02  7:07       ` Dave Chinner
  1 sibling, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2015-07-02  2:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Jul 01, 2015 at 11:58:43AM +1000, Dave Chinner wrote:
> On Thu, Jun 25, 2015 at 04:40:16PM -0700, Darrick J. Wong wrote:
> > Gate all the reflink functions (which generally involve an expensive
> > trip to the reflink btree) on an inode flag which is applied to both
> > inodes at reflink time.  This minimizes reflink's impact on non-CoW
> > files.
> 
> Ah, I see you do this reflink inode flag here. This should be one of
> the first patches, not the last.  i.e. the patch series should
> build up all the supporting infrastructure in individual patches
> before adding any of the actual reflink implementation....
> 
> Also, the flag needs to go into the di_flags2 field, as the last
> flag in the di_flags field is reserved for a "more flags" flag if we
> ever need to add more flags to a v2 inode in a v4 filesystem...

It looks to me like di_flags2 only exists in a v3 inode, and v3 inodes
only exist on v5 filesystems.  I don't really mind using di_flags2 for
reflink (on the off chance you want to use bit 15 of di_flags for a
v2 inode) but I'm wondering how is it possible to have di_flags on a v4 fs?

> 
> > +/*
> > + * xfs_is_reflink_inode() -- Decide if an inode needs to be checked for CoW.
> > + *
> > + * @ip: XFS inode
> > + */
> > +bool
> > +xfs_is_reflink_inode(
> > +	struct xfs_inode	*ip)		/* XFS inode */
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +
> > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > +		return false;
> > +	if (!(ip->i_d.di_flags & XFS_DIFLAG_REFLINK))
> > +		return false;
> > +
> > +	ASSERT(!XFS_IS_REALTIME_INODE(ip));
> > +	return true;
> 
> I would have thought you only need to check the inode flag here
> because the only time it will be set is on a reflink enabled
> filesystem. i.e. that flag being set implies we've already done
> all the "reflink is supported in this filesystem and it's not a
> realtime file" checks when setting the flag.

Yeah, probably these checks are all unnecessary.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/14] xfs: minimize impact to non-reflink files via reflink per-inode flag
  2015-07-02  2:32     ` Darrick J. Wong
@ 2015-07-02  7:07       ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2015-07-02  7:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Wed, Jul 01, 2015 at 07:32:08PM -0700, Darrick J. Wong wrote:
> On Wed, Jul 01, 2015 at 11:58:43AM +1000, Dave Chinner wrote:
> > On Thu, Jun 25, 2015 at 04:40:16PM -0700, Darrick J. Wong wrote:
> > > Gate all the reflink functions (which generally involve an expensive
> > > trip to the reflink btree) on an inode flag which is applied to both
> > > inodes at reflink time.  This minimizes reflink's impact on non-CoW
> > > files.
> > 
> > Ah, I see you do this reflink inode flag here. This should be one of
> > the first patches, not the last.  i.e. the patch series should
> > build up all the supporting infrastructure in individual patches
> > before adding any of the actual reflink implementation....
> > 
> > Also, the flag needs to go into the di_flags2 field, as the last
> > flag in the di_flags field is reserved for a "more flags" flag if we
> > ever need to add more flags to a v2 inode in a v4 filesystem...
> 
> It looks to me like di_flags2 only exists in a v3 inode, and v3 inodes
> only exist on v5 filesystems.  I don't really mind using di_flags2 for
> reflink (on the off chance you want to use bit 15 of di_flags for a
> v2 inode) but I'm wondering how is it possible to have di_flags on a v4 fs?

You mean how is it possible to have di_flags2 on a v4 fs?

Internally when inodes are read off disk, they are converted to v3
format in memory. i.e. the struct xfs_icdinode is a v3 format
structure. Hence when reading in v2 inodes, the di_flags2 field is
present in the structure and it gets initialised to zero. When we
format the in-memory inode to disk (in xfs_iflush_int()), we don't
ever write the v3 fields back to the on disk inode structure, and
hence the in-memory value of the di_flags2 field doesn't ever get
written to disk.

So while the various v3 inode fields are always present in the
in-memory inode, if di_version = 2 then the v3 fields will be
initialised to zero on read and will never be written back to
disk...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-07-02  7:07 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-25 23:39 [RFC(RAP) 00/14] xfs: add reflink and dedupe support Darrick J. Wong
2015-06-25 23:39 ` [PATCH 01/14] xfs: create a per-AG btree to track reference counts Darrick J. Wong
2015-07-01  0:13   ` Dave Chinner
2015-07-01 22:52     ` Darrick J. Wong
2015-07-01 23:30       ` Dave Chinner
2015-06-25 23:39 ` [PATCH 02/14] libxfs: adjust refcounts in reflink btree Darrick J. Wong
2015-07-01  1:06   ` Dave Chinner
2015-07-01 23:10     ` Darrick J. Wong
2015-07-01 23:32       ` Dave Chinner
2015-06-25 23:39 ` [PATCH 03/14] libxfs: support unmapping reflink blocks Darrick J. Wong
2015-07-01  1:26   ` Dave Chinner
2015-07-02  2:27     ` Darrick J. Wong
2015-06-25 23:39 ` [PATCH 04/14] libxfs: block-mapper changes to support reflink Darrick J. Wong
2015-06-25 23:39 ` [PATCH 05/14] xfs: add reflink functions and ioctl Darrick J. Wong
2015-06-25 23:39 ` [PATCH 06/14] xfs: implement copy-on-write for reflinked blocks Darrick J. Wong
2015-06-25 23:39 ` [PATCH 07/14] xfs: handle directio " Darrick J. Wong
2015-06-25 23:40 ` [PATCH 08/14] xfs: teach fiemap about reflink'd extents Darrick J. Wong
2015-06-25 23:40 ` [PATCH 09/14] xfs: copy-on-write reflinked blocks when zeroing ranges of blocks Darrick J. Wong
2015-06-25 23:40 ` [PATCH 10/14] xfs: minimize impact to non-reflink files via reflink per-inode flag Darrick J. Wong
2015-07-01  1:58   ` Dave Chinner
2015-07-01 22:59     ` Darrick J. Wong
2015-07-01 23:49       ` Dave Chinner
2015-07-02  2:32     ` Darrick J. Wong
2015-07-02  7:07       ` Dave Chinner
2015-06-25 23:40 ` [PATCH 11/14] xfs: emulate the btrfs dedupe extent same ioctl Darrick J. Wong
2015-06-25 23:40 ` [PATCH 12/14] xfs: support XFS_XFLAG_REFLINK (and FS_NOCOW_FL) on reflink filesystems Darrick J. Wong
2015-06-25 23:40 ` [PATCH 13/14] xfs: add reflink btree root when expanding the filesystem Darrick J. Wong
2015-06-25 23:40 ` [PATCH 14/14] xfs: add reflink btree block detection to log recovery Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox