[PATCH v2 0/9] xfsrestore dirent limitations and scaling issues

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues
@ 2010-11-05 16:35 wkendall
  2010-11-05 16:35 ` [PATCH v2 1/9] xfsrestore: turn off NODECHK wkendall
                   ` (9 more replies)
  0 siblings, 10 replies; 24+ messages in thread
From: wkendall @ 2010-11-05 16:35 UTC (permalink / raw)
  To: xfs

The first two patches in this series remove dirent limitations that
exist in the current xfsrestore, allowing restore to now handle 4
billion directory entries. Restore would map 200 GB of memory to do
so, so don't go thinking this is a good idea. :)  (These two patches
were previously submitted to the list but I've made some changes to
them as suggested by Alex Elder.)

The remaining patches mostly deal with improving restore
performance, most noticeably on dumps containing upwards of 10
million inodes/dirents.  This resulted in a 50% improvement in the
time required to build restore's node table (a mmap'd representation
of the dump's directory structure), so for interactive restores and
restoring sub-directories this is very helpful. For full restores
with millions of files the overall restore time is dominated by
creating inodes and laying down the data, so the improvements here
would be less noticeable.

For dumps with lots of hard links, these changes fix a bug that was
causing xfsrestore to constantly have to map and unmap segments of
the node table, leading to horrible performance.

Several of these patches modify the on-disk state information that
xfsrestore leaves around for resuming restores. The final patch adds
versioning information to the on-disk state to detect cases where
the user tries to resume a restore with an incompatible version of
xfsrstore (or an incompatible system).

Bill

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/9] xfsrestore: turn off NODECHK
  2010-11-05 16:35 [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues wkendall
@ 2010-11-05 16:35 ` wkendall
  2010-11-12 23:23   ` Alex Elder
  2010-11-05 16:35 ` [PATCH v2 2/9] xfsrestore: change nrh_t from 32 to 64 bits wkendall
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: wkendall @ 2010-11-05 16:35 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: no_node_check --]
[-- Type: text/plain, Size: 1652 bytes --]

The NODECHK macro should only be enabled as needed for
development/debugging. Having it on limits xfsrestore to dumps
with 268 million directory entries instead of 4 billion. 

Signed-off-by: Bill Kendall <wkendall@sgi.com>

---
 restore/Makefile |    2 +-
 restore/node.c   |    4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

Index: xfsdump-kernel.org/restore/Makefile
===================================================================
--- xfsdump-kernel.org.orig/restore/Makefile
+++ xfsdump-kernel.org/restore/Makefile
@@ -103,7 +103,7 @@ LLDLIBS = $(LIBUUID) $(LIBHANDLE) $(LIBA
 LTDEPENDENCIES = $(LIBRMT)
 
 LCFLAGS = -DRESTORE -DRMT -DBASED -DDOSOCKS -DINVCONVFIX -DPIPEINVFIX \
-	-DEOMFIX -DSESSCPLT -DWHITEPARSE -DNODECHK -DDIRENTHDR_CHECKSUM \
+	-DEOMFIX -DSESSCPLT -DWHITEPARSE -DDIRENTHDR_CHECKSUM \
 	-DF_FSSETDM
 
 default: depend $(LTCOMMAND)
Index: xfsdump-kernel.org/restore/node.c
===================================================================
--- xfsdump-kernel.org.orig/restore/node.c
+++ xfsdump-kernel.org/restore/node.c
@@ -42,7 +42,9 @@ extern size_t pgmask;
 
 /* macros for manipulating node handles when handle consistency
  * checking is enabled. the upper bits of a handle will be loaded
- * with the node gen count, described below.
+ * with the node gen count, described below. this should not be
+ * used for production code, it cuts into the number of dirents
+ * that xfsrestore can handle.
  */
 #define HDLGENCNT		4
 #define	HDLGENSHIFT		( NBBY * sizeof ( nh_t ) - HDLGENCNT )

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 2/9] xfsrestore: change nrh_t from 32 to 64 bits
  2010-11-05 16:35 [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues wkendall
  2010-11-05 16:35 ` [PATCH v2 1/9] xfsrestore: turn off NODECHK wkendall
@ 2010-11-05 16:35 ` wkendall
  2010-11-12 23:24   ` Alex Elder
  2010-11-05 16:35 ` [PATCH v2 3/9] xfsrestore: cache path lookups wkendall
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: wkendall @ 2010-11-05 16:35 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: namreg_64bit --]
[-- Type: text/plain, Size: 4334 bytes --]

An nrh_t refers to a byte offset in a file containing all the pathname
components from a dump. At an average filename length of 20
characters, an nrh_t would overflow on a dump containing ~214 million
directory entries. Removing this limitation allows xfsrestore to
handle 4 billion directory entries.

Signed-off-by: Bill Kendall <wkendall@sgi.com>

---
 restore/namreg.c |   20 +++++++-------------
 restore/namreg.h |    4 ++--
 restore/tree.c   |   29 ++++++++++++++---------------
 3 files changed, 23 insertions(+), 30 deletions(-)

Index: xfsdump-kernel.org/restore/namreg.h
===================================================================
--- xfsdump-kernel.org.orig/restore/namreg.h
+++ xfsdump-kernel.org/restore/namreg.h
@@ -26,8 +26,8 @@
 
 /* nrh_t - handle to a registered name
  */
-typedef size32_t nrh_t;
-#define NRH_NULL	SIZE32MAX
+typedef size64_t nrh_t;
+#define NRH_NULL	SIZE64MAX
 
 
 /* namreg_init - creates the name registry. resync is TRUE if the
Index: xfsdump-kernel.org/restore/namreg.c
===================================================================
--- xfsdump-kernel.org.orig/restore/namreg.c
+++ xfsdump-kernel.org/restore/namreg.c
@@ -72,26 +72,20 @@ typedef struct namreg_tran namreg_tran_t
  */
 #define CHKBITCNT		2
 #define	CHKBITSHIFT		( NBBY * sizeof( nrh_t ) - CHKBITCNT )
-#define	CHKBITLOMASK		( ( 1 << CHKBITCNT ) - 1 )
+#define	CHKBITLOMASK		( ( 1ULL << CHKBITCNT ) - 1 )
 #define	CHKBITMASK		( CHKBITLOMASK << CHKBITSHIFT )
 #define CHKHDLCNT		CHKBITSHIFT
-#define CHKHDLMASK		( ( 1 << CHKHDLCNT ) - 1 )
-#define CHKGETBIT( h )		( ( h >> CHKBITSHIFT ) & CHKBITLOMASK )
-#define CHKGETHDL( h )		( h & CHKHDLMASK )
-#define CHKMKHDL( c, h )	( ( ( c << CHKBITSHIFT ) & CHKBITMASK )	\
+#define CHKHDLMASK		( ( 1ULL << CHKHDLCNT ) - 1 )
+#define CHKGETBIT( h )		( ( (h) >> CHKBITSHIFT ) & CHKBITLOMASK )
+#define CHKGETHDL( h )		( (h) & CHKHDLMASK )
+#define CHKMKHDL( c, h )	( ( ( (c) << CHKBITSHIFT ) & CHKBITMASK )	\
 				  |					\
-				  ( h & CHKHDLMASK ))
+				  ( (h) & CHKHDLMASK ))
 #define HDLMAX			( ( off64_t )CHKHDLMASK )
 
 #else /* NAMREGCHK */
 
-#define HDLMAX			( ( ( off64_t )1			\
-				    <<					\
-				    ( ( off64_t )NBBY			\
-				      *					\
-				      ( off64_t )sizeof( nrh_t )))	\
-				  -					\
-				  ( off64_t )2 ) /* 2 to avoid NRH_NULL */
+#define HDLMAX			( NRH_NULL - 1 )
 
 #endif /* NAMREGCHK */
 
Index: xfsdump-kernel.org/restore/tree.c
===================================================================
--- xfsdump-kernel.org.orig/restore/tree.c
+++ xfsdump-kernel.org/restore/tree.c
@@ -166,19 +166,18 @@ typedef struct tran tran_t;
 #define NODESZ	48
 
 struct node {
-	xfs_ino_t n_ino;		/* 8  8 ino */
-	nrh_t n_nrh;		/* 4 12 handle to name in name registry */
-	dah_t n_dah;		/* 4 16 handle to directory attributes */
-	nh_t n_hashh;		/* 4 20 hash array */
-	nh_t n_parh;		/* 4 24 parent */
-	nh_t n_sibh;		/* 4 28 sibling list */
-	nh_t n_sibprevh;	/* 4 32 prev sibling list - dbl link list */
-	nh_t n_cldh;		/* 4 36 children list */
-	nh_t n_lnkh;		/* 4 40 hard link list */
-	gen_t n_gen;		/* 2 42 generation count mod 0x10000 */
-	u_char_t n_flags;	/* 1 43 action and state flags */
-	u_char_t n_nodehkbyte;	/* 1 44 given to node abstraction */
-	int32_t pad;		/* 4 48 padding to 8 byte boundary */
+	xfs_ino_t n_ino;	/* 8  8 ino */
+	nrh_t n_nrh;		/* 8 16 handle to name in name registry */
+	dah_t n_dah;		/* 4 20 handle to directory attributes */
+	nh_t n_hashh;		/* 4 24 hash array */
+	nh_t n_parh;		/* 4 28 parent */
+	nh_t n_sibh;		/* 4 32 sibling list */
+	nh_t n_sibprevh;	/* 4 36 prev sibling list - dbl link list */
+	nh_t n_cldh;		/* 4 40 children list */
+	nh_t n_lnkh;		/* 4 44 hard link list */
+	gen_t n_gen;		/* 2 46 generation count mod 0x10000 */
+	u_char_t n_flags;	/* 1 47 action and state flags */
+	u_char_t n_nodehkbyte;	/* 1 48 given to node abstraction */
 };
 
 typedef struct node node_t;
@@ -3393,9 +3392,9 @@ Node_free( nh_t *nhp )
 		namreg_del( np->n_nrh );
 		np->n_nrh = NRH_NULL;
 	}
-	if ( np->n_dah != NRH_NULL ) {
+	if ( np->n_dah != DAH_NULL ) {
 		dirattr_del( np->n_dah );
-		np->n_dah = NRH_NULL;
+		np->n_dah = DAH_NULL;
 	}
 	np->n_flags = 0;
 	np->n_parh = NH_NULL;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 3/9] xfsrestore: cache path lookups
  2010-11-05 16:35 [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues wkendall
  2010-11-05 16:35 ` [PATCH v2 1/9] xfsrestore: turn off NODECHK wkendall
  2010-11-05 16:35 ` [PATCH v2 2/9] xfsrestore: change nrh_t from 32 to 64 bits wkendall
@ 2010-11-05 16:35 ` wkendall
  2010-11-12 23:25   ` Alex Elder
  2010-11-05 16:35 ` [PATCH v2 4/9] xfsrestore: mmap dirent names for faster lookups wkendall
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: wkendall @ 2010-11-05 16:35 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: node2path_caching --]
[-- Type: text/plain, Size: 3895 bytes --]

In order to resolve a pathname, xfsrestore must work from an inode
number (from the dump) and recurse up the directory entry tree that it
has constructed. Each level of recursion requires a seek and read to
get the name of the dirent, and possibly a mmap of a section of the
directory entry tree if it is not already mapped (and in that case,
possibly a munmap of another section). It's quite common to resolve
pathnames in the same directory consecutively, so simply caching the
parent directory pathname from the previous lookup saves quite a bit
of overhead.

Signed-off-by: Bill Kendall <wkendall@sgi.com>


---
 restore/tree.c |   41 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 5 deletions(-)

Index: xfsdump-kernel.org/restore/tree.c
===================================================================
--- xfsdump-kernel.org.orig/restore/tree.c
+++ xfsdump-kernel.org/restore/tree.c
@@ -236,6 +236,14 @@ struct link_iter_context {
 };
 typedef struct link_iter_context link_iter_context_t;
 
+/* used for caching parent pathname from previous Node2path result
+ */
+struct path_cache {
+	nh_t nh;
+	intgen_t len;
+	char buf[MAXPATHLEN];
+};
+typedef struct path_cache path_cache_t;
 
 /* declarations of externally defined global symbols *************************/
 
@@ -254,7 +262,8 @@ static nh_t Node_alloc( xfs_ino_t ino,
 static void Node_free( nh_t *nhp );
 static node_t * Node_map( nh_t nh );
 static void Node_unmap( nh_t nh, node_t **npp );
-static intgen_t Node2path_recurse( nh_t nh, char *buf, intgen_t bufsz );
+static intgen_t Node2path_recurse( nh_t nh, char *buf,
+				   intgen_t bufsz, intgen_t level );
 static void adopt( nh_t parh, nh_t cldh, nrh_t nrh );
 static nrh_t disown( nh_t cldh );
 static void selsubtree( nh_t nh, bool_t sensepr );
@@ -3435,7 +3444,7 @@ Node2path( nh_t nh, char *path, char *er
 {
 	intgen_t remainingcnt;
 	path[ 0 ] = 0; /* in case root node passed in */
-	remainingcnt = Node2path_recurse( nh, path, MAXPATHLEN );
+	remainingcnt = Node2path_recurse( nh, path, MAXPATHLEN, 0 );
 	if ( remainingcnt <= 0 ) {
 		node_t *np = Node_map( nh );
 		xfs_ino_t ino = np->n_ino;
@@ -3459,13 +3468,15 @@ Node2path( nh_t nh, char *path, char *er
  * works because the buffer size is secretly 2 * MAXPATHLEN.
  */
 static intgen_t
-Node2path_recurse( nh_t nh, char *buf, intgen_t bufsz )
+Node2path_recurse( nh_t nh, char *buf, intgen_t bufsz, intgen_t level )
 {
+	static path_cache_t cache = { NH_NULL, 0, "" };
 	node_t *np;
 	nh_t parh;
 	xfs_ino_t ino;
 	gen_t gen;
 	nrh_t nrh;
+	char *oldbuf;
 	intgen_t oldbufsz;
 	intgen_t namelen;
 
@@ -3475,6 +3486,13 @@ Node2path_recurse( nh_t nh, char *buf, i
 		return bufsz;
 	}
 
+	/* if we have a cache hit, no need to recurse any further
+	 */
+	if ( nh == cache.nh && bufsz > cache.len ) {
+		strcpy( buf, cache.buf );
+		return bufsz - cache.len;
+	}
+
 	/* extract useful node members
 	 */
 	np = Node_map( nh );
@@ -3486,8 +3504,9 @@ Node2path_recurse( nh_t nh, char *buf, i
 
 	/* build path to parent
 	 */
+	oldbuf = buf;
 	oldbufsz = bufsz;
-	bufsz = Node2path_recurse( parh, buf, bufsz ); /* RECURSION */
+	bufsz = Node2path_recurse( parh, buf, bufsz, level+1 ); /* RECURSION */
 	if ( bufsz <= 0 ) {
 		return bufsz;
 	}
@@ -3517,10 +3536,22 @@ Node2path_recurse( nh_t nh, char *buf, i
 		ASSERT( namelen > 0 );
 	}
 
-	/* return remaining buffer size
+	/* update remaining buffer size
 	 */
 	bufsz -= namelen;
 	ASSERT( bufsz + MAXPATHLEN > 0 );
+
+	/* update the cache if we're the target's parent
+	 * (and the pathname is not too long)
+	 */
+	if ( level == 1 && bufsz > 0 ) {
+		cache.nh = nh;
+		strcpy( cache.buf, oldbuf );
+		cache.len = oldbufsz - bufsz;
+	}
+
+	/* return remaining buffer size
+	 */
 	return bufsz;
 }
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 4/9] xfsrestore: mmap dirent names for faster lookups
  2010-11-05 16:35 [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues wkendall
                   ` (2 preceding siblings ...)
  2010-11-05 16:35 ` [PATCH v2 3/9] xfsrestore: cache path lookups wkendall
@ 2010-11-05 16:35 ` wkendall
  2010-11-12 23:25   ` Alex Elder
  2010-11-05 16:35 ` [PATCH v2 5/9] xfsrestore: cleanup node allocation wkendall
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: wkendall @ 2010-11-05 16:35 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: namreg_map --]
[-- Type: text/plain, Size: 7130 bytes --]

Pathname resolution in xfsrestore is about 4x faster if the file
containing dirent names ("namreg") is memory mapped.  If xfsrestore is
unable to map the file (e.g., due to virtual memory constraints)
fallback to the existing seek-and-read approach.

The file is mapped after all directory entries have been written to
the "namreg" file. If the caller tries to add additional entries after
the file has been mapped, it will be unmapped and restore will resort
back to seek-and-read lookups.

Signed-off-by: Bill Kendall <wkendall@sgi.com>

---
 restore/content.c |    2 
 restore/namreg.c  |  124 ++++++++++++++++++++++++++++++++++++++++--------------
 restore/namreg.h  |    6 +-
 3 files changed, 98 insertions(+), 34 deletions(-)

Index: xfsdump-kernel.org/restore/content.c
===================================================================
--- xfsdump-kernel.org.orig/restore/content.c
+++ xfsdump-kernel.org/restore/content.c
@@ -3026,7 +3026,7 @@ applydirdump( drive_t *drivep,
 			return rv;
 		}
 
-		if ((rv = namreg_flush()) != RV_OK) {
+		if ((rv = namreg_flush(BOOL_TRUE)) != RV_OK) {
 			return rv;
 		}
 
Index: xfsdump-kernel.org/restore/namreg.c
===================================================================
--- xfsdump-kernel.org.orig/restore/namreg.c
+++ xfsdump-kernel.org/restore/namreg.c
@@ -57,6 +57,7 @@ typedef struct namreg_pers namreg_pers_t
 struct namreg_tran {
 	char *nt_pathname;
 	int nt_fd;
+	char *nt_map;
 	bool_t nt_at_endpr;
 	size_t nt_off;
 	char nt_buf[NAMREG_BUFSIZE];
@@ -96,6 +97,9 @@ extern size_t pgsz;
 
 /* forward declarations of locally defined static functions ******************/
 
+static inline bool_t namreg_is_mapped();
+static void namreg_map();
+static void namreg_unmap();
 
 /* definition of locally defined global variables ****************************/
 
@@ -263,6 +267,10 @@ namreg_add( char *name, size_t namelen )
 	ASSERT( ntp );
 	ASSERT( npp );
 
+	if ( namreg_is_mapped() ) {
+		namreg_unmap();
+	}
+
 	/* make sure file pointer is positioned to append
 	 */
 	if ( ! ntp->nt_at_endpr ) {
@@ -280,7 +288,7 @@ namreg_add( char *name, size_t namelen )
 	}
 
 	if (ntp->nt_off + namelen + 1 > sizeof(ntp->nt_buf)) {
-		if (namreg_flush() != RV_OK) {
+		if (namreg_flush(BOOL_FALSE) != RV_OK) {
 			return NRH_NULL;
 		}
 	}
@@ -327,7 +335,7 @@ namreg_del( nrh_t nrh )
 }
 
 rv_t
-namreg_flush( void )
+namreg_flush( bool_t done_adding )
 {
 	ssize_t nwritten;
 
@@ -335,6 +343,10 @@ namreg_flush( void )
 	*/
 	assert( ntp );
 
+	if ( namreg_is_mapped() ) {
+		namreg_unmap();
+	}
+
 	if (ntp->nt_off) {
 
 		/* write the accumulated name strings.
@@ -356,6 +368,11 @@ namreg_flush( void )
 		}
 		ntp->nt_off = 0;
 	}
+
+	if (done_adding) {
+		namreg_map();
+	}
+
 	return RV_OK;
 }
 
@@ -367,6 +384,7 @@ namreg_get( nrh_t nrh,
 	off64_t newoff;
 	intgen_t nread;
 	size_t len;
+	char *in_bufp;
 	static char read_buf[256];
 	/* long enough for the longest allowed name (255), plus 1 for length */
 #ifdef NAMREGCHK
@@ -403,41 +421,51 @@ namreg_get( nrh_t nrh,
 
 	lock( );
 
-	if ( ntp->nt_at_endpr && ntp->nt_off ) {
-		if (namreg_flush() != RV_OK) {
+	if ( namreg_is_mapped() ) {
+
+		in_bufp = ntp->nt_map + newoff - NAMREG_PERS_SZ;
+
+	} else {
+
+		if ( ntp->nt_at_endpr && ntp->nt_off ) {
+			if (namreg_flush(BOOL_FALSE) != RV_OK) {
+				unlock( );
+				return -3;
+			}
+		}
+
+		/* seek to the name
+		*/
+		newoff = lseek64( ntp->nt_fd, newoff, SEEK_SET );
+		if ( newoff == ( off64_t )-1 ) {
 			unlock( );
+			mlog( MLOG_NORMAL, _(
+				"lseek of namreg failed: %s\n"),
+				strerror( errno ));
 			return -3;
 		}
-	}
+		ntp->nt_at_endpr = BOOL_FALSE;
 
-	/* seek to the name
-	 */
-	newoff = lseek64( ntp->nt_fd, newoff, SEEK_SET );
-	if ( newoff == ( off64_t )-1 ) {
-		unlock( );
-		mlog( MLOG_NORMAL, _(
-		      "lseek of namreg failed: %s\n"),
-		      strerror( errno ));
-		return -3;
-	}
+		/* read the name length and the name itself in one call
+		 * NOTE: assumes read_buf is big enough for the longest
+		 * allowed name (255 chars) plus one byte for length.
+		 */
+		nread = read( ntp->nt_fd, ( void * )read_buf, sizeof(read_buf) );
+		if ( nread <= 0 ) {
+			unlock( );
+			mlog( MLOG_NORMAL, _(
+				"read of namreg failed: %s (nread = %d)\n"),
+				strerror( errno ),
+				nread );
+			return -3;
+		}
 
-	/* read the name length and the name itself in one call
-	 * NOTE: assumes read_buf is big enough for the longest
-	 * allowed name (255 chars) plus one byte for length.
-	 */
-	nread = read( ntp->nt_fd, ( void * )read_buf, sizeof(read_buf) );
-	if ( nread <= 0 ) {
-		unlock( );
-		mlog( MLOG_NORMAL, _(
-		      "read of namreg failed: %s (nread = %d)\n"),
-		      strerror( errno ),
-		      nread );
-		return -3;
+		in_bufp = read_buf;
 	}
 
 	/* deal with a short caller-supplied buffer
 	 */
-	len = ( size_t )read_buf[0];
+	len = ( size_t )in_bufp[0];
 	if ( bufsz < len + 1 ) {
 		unlock( );
 		return -1;
@@ -445,7 +473,7 @@ namreg_get( nrh_t nrh,
 
 	/* copy the name into the caller-supplied buffer.
 	 */
-	strncpy(bufp, read_buf+1, len);
+	strncpy(bufp, in_bufp+1, len);
 
 #ifdef NAMREGCHK
 
@@ -460,8 +488,6 @@ namreg_get( nrh_t nrh,
 	/* null-terminate the string if room
 	 */
 	bufp[ len ] = 0;
-
-	ntp->nt_at_endpr = BOOL_FALSE;
 	
 	unlock( );
 
@@ -470,3 +496,39 @@ namreg_get( nrh_t nrh,
 
 
 /* definition of locally defined static functions ****************************/
+
+static inline bool_t
+namreg_is_mapped()
+{
+	return ntp->nt_map ? BOOL_TRUE : BOOL_FALSE;
+}
+
+static void
+namreg_map()
+{
+	ntp->nt_map = ( char * ) mmap_autogrow(
+					npp->np_appendoff - NAMREG_PERS_SZ,
+					ntp->nt_fd,
+					NAMREG_PERS_SZ );
+
+	/* it's okay if this fails, just fall back to (the much slower)
+	 * seek-and-read lookups.
+	 */
+	if ( ntp->nt_map == ( char * )-1 ) {
+		mlog( MLOG_DEBUG, "failed to map namreg: %s\n",
+			strerror( errno ) );
+		ntp->nt_map = NULL;
+	}
+}
+
+static void
+namreg_unmap()
+{
+	/* we really shouldn't be here, we were told that no more names would
+	 * be added to the namreg. rather than tip over, just fall back to
+	 * seek-and-read lookups.
+	 */
+	munmap( ( void * )ntp->nt_map, npp->np_appendoff - NAMREG_PERS_SZ );
+	ntp->nt_map = NULL;
+	mlog( MLOG_DEBUG, "namreg lookups resorting to seek-and-read\n" );
+}
Index: xfsdump-kernel.org/restore/namreg.h
===================================================================
--- xfsdump-kernel.org.orig/restore/namreg.h
+++ xfsdump-kernel.org/restore/namreg.h
@@ -49,9 +49,11 @@ extern nrh_t namreg_add( char *name, siz
  */
 extern void namreg_del( nrh_t nrh );
 
-/* namreg_flush - flush namreg I/O buffer.  Returns 0 if successful.
+/* namreg_flush - flush namreg I/O buffer. pass BOOL_TRUE
+ * if done adding names to the registry.
+ * Returns 0 if successful.
  */
-extern rv_t namreg_flush( void );
+extern rv_t namreg_flush( bool_t done_adding );
 
 /* namreg_get - retrieves the name identified by the index.
  * fills the buffer with the null-terminated name from the registry.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 5/9] xfsrestore: cleanup node allocation
  2010-11-05 16:35 [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues wkendall
                   ` (3 preceding siblings ...)
  2010-11-05 16:35 ` [PATCH v2 4/9] xfsrestore: mmap dirent names for faster lookups wkendall
@ 2010-11-05 16:35 ` wkendall
  2010-11-15 20:38   ` Alex Elder
  2010-11-05 16:35 ` [PATCH v2 6/9] xfsrestore: fix node table setup wkendall
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: wkendall @ 2010-11-05 16:35 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: node_alloc_cleanup --]
[-- Type: text/plain, Size: 6667 bytes --]

Simplify the node allocation code. The current code takes some
number of nodes from a new segment and links them into the
freelist whenever the freelist is depleted. There's no reason
to put the new nodes on the freelist, we can just allocate the
next available new node as needed. This also saves a trip through
win_map/win_unmap if there are no nodes on the freelist (the
common case).

Signed-off-by: Bill Kendall <wkendall@sgi.com>

---
 restore/node.c |  144 +++++++++++++++------------------------------------------
 1 file changed, 39 insertions(+), 105 deletions(-)

Index: xfsdump-kernel.org/restore/node.c
===================================================================
--- xfsdump-kernel.org.orig/restore/node.c
+++ xfsdump-kernel.org/restore/node.c
@@ -111,10 +111,6 @@ extern size_t pgmask;
  */
 #define NODESPERSEGMIN	1000000
 
-/* how many nodes to place on free list at a time
- */
-#define VIRGSACRMAX	8192 /* fudged: 8192 48 byte nodes (24 or 96 pages) */
-
 /* a node is identified internally by its index into the backing store.
  * this index is the offset of the node into the segmented portion of
  * backing store (follows the abstraction header page) divided by the
@@ -380,72 +376,41 @@ node_alloc( void )
 	nix_t nix;
 	u_char_t *p;
 	nh_t nh;
-	register nix_t *linkagep;
 #ifdef NODECHK
-	register u_char_t *hkpp;
-	register u_char_t gen;
-	register u_char_t unq;
+	u_char_t *hkpp;
+	u_char_t gen = 0;
+	u_char_t unq;
 #endif /* NODECHK */
 
-	/* if free list is depleted, map in a new window at the
-	 * end of backing store. put all nodes on free list.
-	 * initialize the gen count to the node index, and the unique
-	 * pattern to the free pattern.
-	 */
-	if ( node_hdrp->nh_freenix == NIX_NULL ) {
-		nix_t virgbegnix; /* abs. nix of first node in virg seg */
-		nix_t virgendnix; /* abs. nix of next node after last */
-		nix_t sacrcnt; /* how many virgins to put on free list */
-		nix_t sacrnix; 
-
-		ASSERT( node_hdrp->nh_virgrelnix
-			<
-			( nix_t )node_hdrp->nh_nodesperseg );
-		virgbegnix = OFF2NIX( node_hdrp->nh_virgsegreloff )
-			     +
-			     node_hdrp->nh_virgrelnix;
-		virgendnix =
-		      OFF2NIX( ( node_hdrp->nh_virgsegreloff
-			       +
-			       ( off64_t )node_hdrp->nh_segsz ) );
-#ifdef TREE_DEBUG
-		mlog(MLOG_DEBUG | MLOG_TREE,
-		   "node_alloc(): create freelist - "
-		   "virg_begin=%lld virg_end=%lld\n",
-		   virgbegnix, virgendnix); 
-#endif
-		ASSERT( virgendnix > virgbegnix );
-		sacrcnt = min( VIRGSACRMAX, virgendnix - virgbegnix );
-		ASSERT( sacrcnt >= 1 );
-		p = 0; /* keep lint happy */
-		win_map( NIX2OFF( virgbegnix ), ( void ** )&p );
+	/* if there's a node available on the free list, use it.
+	 * otherwise get the next one from the current virgin segment,
+	 * or allocate a new virgin segment if the current one is depleted.
+	 */
+	if ( node_hdrp->nh_freenix != NIX_NULL ) {
+		nix_t *linkagep;
+		off64_t off;
+
+		nix = node_hdrp->nh_freenix;
+
+		off = NIX2OFF( nix );
+		win_map( off, ( void ** )&p );
 		if (p == NULL)
-		    return NH_NULL;
-		node_hdrp->nh_freenix = virgbegnix;
-		for ( sacrnix = virgbegnix
-		      ;
-		      sacrnix < virgbegnix + sacrcnt - 1
-		      ;
-		      p += node_hdrp->nh_nodesz, sacrnix++ ) {
-			linkagep = ( nix_t * )p;
-			*linkagep = sacrnix + 1;
-#ifdef NODECHK
-			hkpp = p + node_hdrp->nh_nodehkix;
-			gen = ( u_char_t )sacrnix;
-			*hkpp = ( u_char_t )HKPMKHKP( ( size_t )gen,
-						      NODEUNQFREE );
-#endif /* NODECHK */
-		}
-		linkagep = ( nix_t * )p;
-		*linkagep = NIX_NULL;
+			return NH_NULL;
 #ifdef NODECHK
 		hkpp = p + node_hdrp->nh_nodehkix;
-		gen = ( u_char_t )sacrnix;
-		*hkpp = HKPMKHKP( gen, NODEUNQFREE );
+		gen = ( u_char_t )( HKPGETGEN( *p ) + ( u_char_t )1 );
+		unq = HKPGETUNQ( *hkpp );
+		ASSERT( unq != NODEUNQALCD );
+		ASSERT( unq == NODEUNQFREE );
 #endif /* NODECHK */
-		node_hdrp->nh_virgrelnix += sacrcnt;
-		win_unmap( node_hdrp->nh_virgsegreloff, ( void ** )&p );
 
+		/* adjust the free list */
+		linkagep = ( nix_t * )p;
+		node_hdrp->nh_freenix = *linkagep;
+
+		win_unmap( off, ( void ** )&p );
+
+	} else {
 		if ( node_hdrp->nh_virgrelnix
 		     >=
 		     ( nix_t )node_hdrp->nh_nodesperseg ) {
@@ -468,14 +433,11 @@ node_alloc( void )
 			mlog( MLOG_DEBUG,
 			      "pre-growing new node array segment at %lld "
 			      "size %lld\n",
-			      node_hdrp->nh_firstsegoff 
-			      +
-			      node_hdrp->nh_virgsegreloff
-			      +
-			      ( off64_t )node_hdrp->nh_segsz,
+			      node_hdrp->nh_firstsegoff +
+			      node_hdrp->nh_virgsegreloff,
 			      ( off64_t )node_hdrp->nh_segsz );
 			rval = ftruncate64( node_fd,
-					    node_hdrp->nh_firstsegoff 
+					    node_hdrp->nh_firstsegoff
 					    +
 					    node_hdrp->nh_virgsegreloff
 					    +
@@ -484,60 +446,32 @@ node_alloc( void )
 				mlog( MLOG_NORMAL | MLOG_WARNING | MLOG_TREE, _(
 				      "unable to autogrow node segment %llu: "
 				      "%s (%d)\n"),
-				      node_hdrp->nh_virgsegreloff
-				      /
-				      ( off64_t )node_hdrp->nh_segsz,
+				      OFF2NIX(node_hdrp->nh_virgsegreloff),
 				      strerror( errno ),
 				      errno );
 			}
 		}
-	}
 
-	/* map in window containing node at top of free list,
-	 * and adjust free list.
-	 */
-	nix = node_hdrp->nh_freenix;
-#ifdef TREE_DEBUG
-	mlog(MLOG_DEBUG | MLOG_TREE,
-	   "node_alloc(): win_map(%llu) and get head from node freelist\n",
-           NIX2OFF(nix));
-#endif
-	win_map( NIX2OFF( nix ), ( void ** )&p );
-	if (p == NULL)
-	    return NH_NULL;
-#ifdef NODECHK
-	hkpp = p + node_hdrp->nh_nodehkix;
-	unq = HKPGETUNQ( *hkpp );
-	ASSERT( unq != NODEUNQALCD );
-	ASSERT( unq == NODEUNQFREE );
-#endif /* NODECHK */
-	linkagep = ( nix_t * )p;
-	node_hdrp->nh_freenix = *linkagep;
-
-	/* clean the node
-	 */
-	memset( ( void * )p, 0, node_hdrp->nh_nodesz );
+		nix = OFF2NIX( node_hdrp->nh_virgsegreloff )
+			+
+			node_hdrp->nh_virgrelnix++;
+	}
 
 	/* build a handle for node
 	 */
 	ASSERT( nix <= NIX_MAX );
 #ifdef NODECHK
+	win_map( NIX2OFF( nix ), ( void ** )&p );
+	if (p == NULL)
+		abort();
 	hkpp = p + ( int )node_hdrp->nh_nodehkix;
-	gen = ( u_char_t )( HKPGETGEN( *p ) + ( u_char_t )1 );
 	nh = HDLMKHDL( gen, nix );
 	*hkpp = HKPMKHKP( gen, NODEUNQALCD );
+	win_unmap( NIX2OFF( nix ), ( void ** )&p );
 #else /* NODECHK */
 	nh = ( nh_t )nix;
 #endif /* NODECHK */
 
-	/* unmap window
-	 */
-#ifdef TREE_DEBUG
-	mlog(MLOG_DEBUG | MLOG_TREE,
-	   "node_alloc(): win_unmap(%llu)\n", NIX2OFF(nix));
-#endif
-	win_unmap( NIX2OFF( nix ), ( void ** )&p );
-
 	return nh;
 }
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 6/9] xfsrestore: fix node table setup
  2010-11-05 16:35 [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues wkendall
                   ` (4 preceding siblings ...)
  2010-11-05 16:35 ` [PATCH v2 5/9] xfsrestore: cleanup node allocation wkendall
@ 2010-11-05 16:35 ` wkendall
  2010-11-15 20:38   ` Alex Elder
  2010-11-05 16:35 ` [PATCH v2 7/9] xfsrestore: make node lookup more efficient wkendall
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: wkendall @ 2010-11-05 16:35 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: winmap_max_fix --]
[-- Type: text/plain, Size: 10314 bytes --]

The node table setup code unintentionally limits the amount of virtual
memory that will be used for the node table. Rather than setting a
limit based on the remaining virtual memory, the node table is limited
to the amount of memory it thinks it will need based on the dump's
inode count. But the node table stores directory entries, not inodes,
and so dumps with lots of hard links end up thrashing mapping and
unmapping segments of the node table to stay within the self-imposed
virtual memory limit.

This patch also changes the node table to ensure that there are a
power of two nodes per segment. Profiling revealed that while building
the node table, 50% of the restore cpu time was spent doing division
and modulo to convert a node index into its segment index and offset
within the segment. This change prepares the node table for another
patch which will change the lookup mechanism to use shift and
bitwise-and.

Also don't bother passing the estimated segment table size to the
window abstraction. It has to deal with resizing the table anyway and
can choose a reasonable starting size.


Signed-off-by: Bill Kendall <wkendall@sgi.com>

---
 restore/node.c |  125 ++++++++++++++++++++++++++++++++++-----------------------
 restore/win.c  |    7 +--
 restore/win.h  |    3 -
 3 files changed, 80 insertions(+), 55 deletions(-)

Index: xfsdump-kernel.org/restore/node.c
===================================================================
--- xfsdump-kernel.org.orig/restore/node.c
+++ xfsdump-kernel.org/restore/node.c
@@ -97,19 +97,14 @@ extern size_t pgmask;
 
 #else /* NODECHK */
 
-#define NIX_MAX			( ( ( off64_t )1			\
-				    <<					\
-				    ( ( off64_t )NBBY			\
-				      *					\
-				      ( off64_t )sizeof( nh_t )))	\
-				  -					\
-				  ( off64_t )2 ) /* 2 to avoid NH_NULL */
+#define NIX_MAX			( NH_NULL - 1 )
 
 #endif /* NODECHK */
 
 /* window constraints
  */
-#define NODESPERSEGMIN	1000000
+#define NODESPERSEG_MIN	1048576
+#define WINMAP_MIN	4
 
 /* a node is identified internally by its index into the backing store.
  * this index is the offset of the node into the segmented portion of
@@ -135,17 +130,14 @@ struct node_hdr {
 		/* index of byte in each node the user has reserved
 		 * for use by me
 		 */
-	size_t nh_nodesperseg;
+	nh_t nh_nodesperseg;
 		/* an integral number of internal nodes must fit into a
 		 * segment
 		 */
-	size_t nh_segsz;
+	size64_t nh_segsz;
 		/* the backing store is partitoned into segment, which
 		 * can be mapped into VM windows  by the win abstraction
 		 */
-	size64_t nh_segtblsz;
-		/* the estimated size of the entire segment table.
-		 */
 	size_t nh_winmapmax;
 		/* maximum number of windows which can be mapped
 		 */
@@ -191,13 +183,13 @@ node_init( intgen_t fd,
 	   size64_t vmsz,
 	   size64_t dirs_nondirs_cnt )
 {
-	size64_t nodesz;
-	size64_t winmap_mem;
+	size_t nodesz;
 	size64_t segsz;
-	size64_t segtablesz;
-	size64_t nodesperseg;
-	size64_t minsegsz;
-	size64_t winmapmax;
+	nh_t nodesperseg;
+	size_t max_segments;
+	size_t winmapmax;
+	size_t segcount;
+	intgen_t segixshift;
 	intgen_t rval;
 
 	/* sanity checks
@@ -218,36 +210,76 @@ node_init( intgen_t fd,
 	*/
 	nodesz = ( usrnodesz + nodealignsz - 1 ) & ~( nodealignsz - 1 );
 
-#define	WINMAP_MAX	20	/* maximum number of windows to use */
-#define	WINMAP_MIN	4	/* minimum number of windows to use */
-#define	HARDLINK_FUDGE	1.2	/* approx 1.2 hard links per file */
-
-	/* Calculate the expected size of the segment table using the number
-	 * of dirs and non-dirs.  Since we don't know how many hard-links
-	 * there will be, scale the size upward using HARDLINK_FUDGE.
+	/* Calculate the node table params based on the number of inodes in the
+	 * dump, since that's all we know. Ideally we'd base this on the number
+	 * of dirents in the dump instead as there's a node per dirent.
+	 *
+	 * Due to virtual memory constraints and the fact that we don't know
+	 * the final node table size, we can't guarantee that the entire node
+	 * table can be mapped at once. Instead segments will be mapped using a
+	 * window abstraction. Some operations require WINMAP_MIN nodes to be
+	 * referenced at a time, therefore we must ensure that this many
+	 * segments fit in virtual memory.
+	 *
+	 * nodesperseg must be a power of two. Earlier versions did not enforce
+	 * this, and profiling showed that nearly 50% of cpu time during the
+	 * node table construction was consumed doing division and modulo to
+	 * convert an nh_t to a segment index and node offset.  By making
+	 * nodesperseg a power of two we can use shift and bitwise-and instead.
+	 *
+	 * Each segment must hold an integral number of nodes and be an integral
+	 * number of pages. #1 ensures this except when nodesperseg is small, so
+	 * the smallest allowed segsz is pgsz * nodesz (i.e., nodesperseg ==
+	 * pgsz). However this is of no consequence as we enforce a larger
+	 * minimum nodesperseg (NODESPERSEG_MIN) anyway in order to place a
+	 * reasonable cap on the max number of segments.
 	 */
 
-	segtablesz = ( (size64_t)(HARDLINK_FUDGE * (double)dirs_nondirs_cnt) * nodesz);
+	ASSERT( NODESPERSEG_MIN >= pgsz );
+
+	if ( vmsz < WINMAP_MIN * NODESPERSEG_MIN * nodesz ) {
+		mlog( MLOG_NORMAL | MLOG_ERROR, _(
+		  "not enough virtual memory for node abstraction: "
+		  "remaining-vsmz=%llu need=%llu\n"),
+		  vmsz, WINMAP_MIN * NODESPERSEG_MIN * nodesz );
+		return BOOL_FALSE;
+	}
 
-	/* Figure out how much memory is available for use by winmaps, and
-	 * use that to pick an appropriate winmapmax, segsz, and nodesperseg,
-	 * the goal being that if at all possible we want the entire segment
-	 * table to be mapped so that we aren't constantly mapping and
-	 * unmapping winmaps.  There must be at least WINMAP_MIN winmaps
-	 * because references can be held on more than one winmap at the
-	 * same time.  More winmaps are generally better to reduce the
-	 * number of nodes that are unmapped if unmapping does occur.
+	/* This is checked as nodes are allocated as well (remember that
+	 * dirs_nondirs_cnt may be less than the number of nodes/dirents).
+	 * Checking this here prevents potential overflow in the logic below.
 	 */
+	if ( dirs_nondirs_cnt > NIX_MAX ) {
+		mlog( MLOG_NORMAL | MLOG_ERROR, _(
+		  "dump contains %llu inodes, restore can only handle %llu\n"),
+		  dirs_nondirs_cnt, NIX_MAX );
+		return BOOL_FALSE;
+	}
+
+	for ( winmapmax = 0, segcount = 1; winmapmax < WINMAP_MIN; segcount <<= 1 ) {
+
+		nodesperseg = max( dirs_nondirs_cnt / segcount, NODESPERSEG_MIN );
+
+		/* nodesperseg must be a power of 2 */
+		for ( segixshift = 0;
+		      ( 1ULL << segixshift ) < nodesperseg;
+		      segixshift++ );
 
-	minsegsz = pgsz * nodesz;	/* must be pgsz and nodesz multiple */
-	winmap_mem = min(vmsz, segtablesz);
-	segsz = (((winmap_mem / WINMAP_MAX) + minsegsz - 1) / minsegsz) * minsegsz;
-	segsz = max(segsz, minsegsz);
+		/* rounding up to a power of 2 may have caused overflow */
+		if ( ( 1ULL << segixshift ) > NIX_MAX )
+			segixshift--;
 
-	nodesperseg = segsz / nodesz;
+		nodesperseg = 1UL << segixshift;
 
-	winmapmax = min(WINMAP_MAX, vmsz / segsz);
-	winmapmax = max(winmapmax, WINMAP_MIN);
+		max_segments = 1UL << ( NBBY * sizeof(nh_t) - segixshift );
+
+		segsz = nodesperseg * nodesz;
+
+		/* max number of segments that will fit in virtual memory,
+		 * capped at the max possible number of segments
+		 */
+		winmapmax = min( vmsz / segsz, max_segments );
+	}
 
 	/* map the abstraction header
 	 */
@@ -272,7 +304,6 @@ node_init( intgen_t fd,
 	node_hdrp->nh_nodesz = nodesz;
 	node_hdrp->nh_nodehkix = nodehkix;
 	node_hdrp->nh_segsz = segsz;
-	node_hdrp->nh_segtblsz = segtablesz;
 	node_hdrp->nh_winmapmax = winmapmax;
 	node_hdrp->nh_nodesperseg = nodesperseg;
 	node_hdrp->nh_nodealignsz = nodealignsz;
@@ -309,7 +340,6 @@ node_init( intgen_t fd,
 	win_init( fd,
 		  node_hdrp->nh_firstsegoff,
 		  segsz,
-		  segtablesz,
 		  winmapmax );
 	
 	/* announce the results
@@ -317,14 +347,12 @@ node_init( intgen_t fd,
 	mlog( MLOG_DEBUG | MLOG_TREE,
 	      "node_init:"
 	      " vmsz = %llu (0x%llx)"
-	      " segsz = %u (0x%x)"
-	      " segtblsz = %llu (0x%llx)"
+	      " segsz = %llu (0x%llx)"
 	      " nodesperseg = %u (0x%x)"
-	      " winmapmax = %llu (0x%llx)"
+	      " winmapmax = %lu (0x%lx)"
 	      "\n",
 	      vmsz, vmsz,
 	      segsz, segsz,
-	      segtablesz, segtablesz,
 	      nodesperseg, nodesperseg,
 	      winmapmax, winmapmax );
 
@@ -364,7 +392,6 @@ node_sync( intgen_t fd, off64_t off )
 	win_init( fd,
 		  node_hdrp->nh_firstsegoff,
 		  node_hdrp->nh_segsz,
-		  node_hdrp->nh_segtblsz,
 		  node_hdrp->nh_winmapmax );
 
 	return BOOL_TRUE;
Index: xfsdump-kernel.org/restore/win.c
===================================================================
--- xfsdump-kernel.org.orig/restore/win.c
+++ xfsdump-kernel.org/restore/win.c
@@ -80,7 +80,7 @@ struct tran {
 	off64_t t_firstoff;
 		/* offset of first seg within backing store (for mmap( ))
 		 */
-	size_t t_segsz;
+	size64_t t_segsz;
 		/* backing store segment / window size
 		 */
 	size_t t_winmax;
@@ -147,8 +147,7 @@ win_getnum_mmaps(void)
 void
 win_init( intgen_t fd,
 	  off64_t firstoff,
-	  size_t segsz,
-	  size64_t segtablesz,
+	  size64_t segsz,
 	  size_t winmax )
 {
 	/* validate parameters
@@ -167,7 +166,7 @@ win_init( intgen_t fd,
 	tranp->t_segsz = segsz;
 	tranp->t_winmax = winmax;
 
-	tranp->t_segmaplen = (size_t)(segtablesz / segsz) + 1;
+	tranp->t_segmaplen = SEGMAP_INCR;
 	tranp->t_segmap = (win_t **)
 	calloc( tranp->t_segmaplen, sizeof(win_t *) );
 	ASSERT( tranp->t_segmap );
Index: xfsdump-kernel.org/restore/win.h
===================================================================
--- xfsdump-kernel.org.orig/restore/win.h
+++ xfsdump-kernel.org/restore/win.h
@@ -25,8 +25,7 @@
  */
 void win_init( intgen_t fd,
 	       off64_t rngoff,		/* offset into file of windowing */
-	       size_t winsz,		/* window size */
-	       size64_t segtablesz,     /* estimate of segment table size */
+	       size64_t winsz,		/* window size */
 	       size_t wincntmax );	/* max number of windows to manage */
 
 /* supply a pointer to the portion of the file identified by off.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 7/9] xfsrestore: make node lookup more efficient
  2010-11-05 16:35 [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues wkendall
                   ` (5 preceding siblings ...)
  2010-11-05 16:35 ` [PATCH v2 6/9] xfsrestore: fix node table setup wkendall
@ 2010-11-05 16:35 ` wkendall
  2010-11-15 20:38   ` Alex Elder
  2010-11-05 16:35 ` [PATCH v2 8/9] xfsrestore: remove nix_t wkendall
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: wkendall @ 2010-11-05 16:35 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: window_lookup_performance --]
[-- Type: text/plain, Size: 11778 bytes --]

When converting an nh_t to a segment index and relative node index,
use shift and bitwise-and instead of division and modulo. Results show
this is about 50% faster than the current method.


Signed-off-by: Bill Kendall <wkendall@sgi.com>

---
 restore/node.c |  150 ++++++++++++++++++++++++++++-----------------------------
 restore/win.c  |   36 +++----------
 restore/win.h  |    6 +-
 3 files changed, 87 insertions(+), 105 deletions(-)

Index: xfsdump-kernel.org/restore/node.c
===================================================================
--- xfsdump-kernel.org.orig/restore/node.c
+++ xfsdump-kernel.org/restore/node.c
@@ -115,8 +115,6 @@ extern size_t pgmask;
  */
 typedef off64_t nix_t;
 #define NIX_NULL OFF64MAX
-#define NIX2OFF( nix )	( nix * ( nix_t )node_hdrp->nh_nodesz )
-#define OFF2NIX( noff )	( noff / ( nix_t )node_hdrp->nh_nodesz )
 
 /* reserve the firstpage for a header to save persistent context
  */
@@ -151,20 +149,15 @@ struct node_hdr {
 	off64_t nh_firstsegoff;
 		/* offset into backing store of the first segment
 		 */
-	off64_t nh_virgsegreloff;
-		/* offset (relative to beginning of first segment) into
-		 * backing store of segment containing one or
-		 * more virgin nodes. relative to beginning of segmented
-		 * portion of backing store. bumped only when all of the
-		 * nodes in the segment have been placed on the free list.
-		 * when bumped, nh_virginrelnix is simultaneously set back
-		 * to zero.
-		 */
-	nix_t nh_virgrelnix;
-		/* relative node index within the segment identified by
-		 * nh_virgsegreloff of the next node not yet placed on the
-		 * free list. never reaches nh_nodesperseg: instead set
-		 * to zero and bump nh_virgsegreloff by one segment.
+	nh_t nh_virgnix;
+		/* node handle of next virgin node
+		 */
+	intgen_t nh_segixshift;
+		/* bitshift used to extract the segment index from an nh_t
+		 */
+	nh_t nh_relnixmask;
+		/* bitmask used to extract the node index from an nh_t
+		 * (relative to the start of a segment)
 		 */
 };
 
@@ -173,6 +166,12 @@ typedef struct node_hdr node_hdr_t;
 static node_hdr_t *node_hdrp;
 static intgen_t node_fd;
 
+/* forward declarations of locally defined static functions ******************/
+static inline size_t nh2segix( nh_t nh );
+static inline nh_t nh2relnix( nh_t nh );
+static inline void node_map_internal( nh_t nh, void **pp );
+static inline void node_unmap_internal( nh_t nh, void **pp, bool_t freepr );
+
 /* ARGSUSED */
 bool_t
 node_init( intgen_t fd,
@@ -190,7 +189,7 @@ node_init( intgen_t fd,
 	size_t winmapmax;
 	size_t segcount;
 	intgen_t segixshift;
-	intgen_t rval;
+	nh_t relnixmask;
 
 	/* sanity checks
 	 */
@@ -281,6 +280,8 @@ node_init( intgen_t fd,
 		winmapmax = min( vmsz / segsz, max_segments );
 	}
 
+	relnixmask = nodesperseg - 1;
+
 	/* map the abstraction header
 	 */
 	ASSERT( ( NODE_HDRSZ & pgmask ) == 0 );
@@ -309,32 +310,14 @@ node_init( intgen_t fd,
 	node_hdrp->nh_nodealignsz = nodealignsz;
 	node_hdrp->nh_freenix = NIX_NULL;
 	node_hdrp->nh_firstsegoff = off + ( off64_t )NODE_HDRSZ;
-	node_hdrp->nh_virgsegreloff = 0;
-	node_hdrp->nh_virgrelnix = 0;
+	node_hdrp->nh_virgnix = 0;
+	node_hdrp->nh_segixshift = segixshift;
+	node_hdrp->nh_relnixmask = relnixmask;
 
 	/* save transient context
 	 */
 	node_fd = fd;
 
-	/* autogrow the first segment
-	 */
-	mlog( MLOG_DEBUG,
-	      "pre-growing new node array segment at %lld "
-	      "size %lld\n",
-	      node_hdrp->nh_firstsegoff,
-	      ( off64_t )node_hdrp->nh_segsz );
-	rval = ftruncate64( node_fd,
-			    node_hdrp->nh_firstsegoff
-			    +
-			    ( off64_t )node_hdrp->nh_segsz );
-	if ( rval ) {
-		mlog( MLOG_NORMAL | MLOG_ERROR | MLOG_TREE, _(
-		      "unable to autogrow first node segment: %s (%d)\n"),
-		      strerror( errno ),
-		      errno );
-		return BOOL_FALSE;
-	}
-
 	/* initialize the window abstraction
 	 */
 	win_init( fd,
@@ -415,12 +398,10 @@ node_alloc( void )
 	 */
 	if ( node_hdrp->nh_freenix != NIX_NULL ) {
 		nix_t *linkagep;
-		off64_t off;
 
 		nix = node_hdrp->nh_freenix;
 
-		off = NIX2OFF( nix );
-		win_map( off, ( void ** )&p );
+		node_map_internal( nix, ( void ** )&p );
 		if (p == NULL)
 			return NH_NULL;
 #ifdef NODECHK
@@ -435,66 +416,59 @@ node_alloc( void )
 		linkagep = ( nix_t * )p;
 		node_hdrp->nh_freenix = *linkagep;
 
-		win_unmap( off, ( void ** )&p );
+		node_unmap_internal( nix, ( void ** )&p, BOOL_TRUE );
 
 	} else {
-		if ( node_hdrp->nh_virgrelnix
-		     >=
-		     ( nix_t )node_hdrp->nh_nodesperseg ) {
+		if ( nh2relnix( node_hdrp->nh_virgnix ) == 0 ) {
+			/* need to start a new virgin segment */
 			intgen_t rval;
-			ASSERT( node_hdrp->nh_virgrelnix
-				==
-				( nix_t )node_hdrp->nh_nodesperseg );
-			ASSERT( node_hdrp->nh_virgsegreloff
+			off64_t new_seg_off =
+				node_hdrp->nh_firstsegoff +
+				( off64_t )nh2segix( node_hdrp->nh_virgnix ) *
+				( off64_t )node_hdrp->nh_segsz;
+
+			ASSERT( new_seg_off
 				<=
 				OFF64MAX - ( off64_t )node_hdrp->nh_segsz );
-#ifdef TREE_DEBUG
-			mlog(MLOG_DEBUG | MLOG_TREE,
-			    "node_alloc(): runout of nodes for freelist in "
-                            "this segment - nodes used = %lld\n", 
-                            node_hdrp->nh_virgrelnix);
-#endif
-			node_hdrp->nh_virgsegreloff +=
-					( off64_t )node_hdrp->nh_segsz;
-			node_hdrp->nh_virgrelnix = 0;
 			mlog( MLOG_DEBUG,
 			      "pre-growing new node array segment at %lld "
 			      "size %lld\n",
-			      node_hdrp->nh_firstsegoff +
-			      node_hdrp->nh_virgsegreloff,
+			      new_seg_off,
 			      ( off64_t )node_hdrp->nh_segsz );
 			rval = ftruncate64( node_fd,
-					    node_hdrp->nh_firstsegoff
-					    +
-					    node_hdrp->nh_virgsegreloff
+					    new_seg_off
 					    +
 					    ( off64_t )node_hdrp->nh_segsz );
 			if ( rval ) {
 				mlog( MLOG_NORMAL | MLOG_WARNING | MLOG_TREE, _(
-				      "unable to autogrow node segment %llu: "
+				      "unable to autogrow node segment %lu: "
 				      "%s (%d)\n"),
-				      OFF2NIX(node_hdrp->nh_virgsegreloff),
+				      nh2segix( node_hdrp->nh_virgnix ),
 				      strerror( errno ),
 				      errno );
 			}
 		}
 
-		nix = OFF2NIX( node_hdrp->nh_virgsegreloff )
-			+
-			node_hdrp->nh_virgrelnix++;
+		nix = node_hdrp->nh_virgnix++;
 	}
 
 	/* build a handle for node
 	 */
-	ASSERT( nix <= NIX_MAX );
+	if ( nix > NIX_MAX ) {
+		mlog( MLOG_NORMAL | MLOG_ERROR, _(
+		  "dump contains too many dirents, "
+		  "restore can only handle %llu\n"),
+		  NIX_MAX );
+		return NH_NULL;
+	}
 #ifdef NODECHK
-	win_map( NIX2OFF( nix ), ( void ** )&p );
+	node_map_internal( nix , ( void ** )&p );
 	if (p == NULL)
 		abort();
 	hkpp = p + ( int )node_hdrp->nh_nodehkix;
 	nh = HDLMKHDL( gen, nix );
 	*hkpp = HKPMKHKP( gen, NODEUNQALCD );
-	win_unmap( NIX2OFF( nix ), ( void ** )&p );
+	node_unmap_internal( nix, ( void ** )&p, BOOL_FALSE );
 #else /* NODECHK */
 	nh = ( nh_t )nix;
 #endif /* NODECHK */
@@ -502,6 +476,30 @@ node_alloc( void )
 	return nh;
 }
 
+static inline size_t
+nh2segix( nh_t nh )
+{
+	return nh >> node_hdrp->nh_segixshift;
+}
+
+static inline nh_t
+nh2relnix( nh_t nh )
+{
+	return nh & node_hdrp->nh_relnixmask;
+}
+
+static inline void
+node_map_internal( nh_t nh, void **pp )
+{
+	win_map( nh2segix( nh ), pp );
+	if ( *pp != NULL ) {
+		nh_t relnix = nh2relnix( nh );
+		*pp = ( void * )( ( char * )( *pp ) +
+				( ( off64_t )relnix *
+				  node_hdrp->nh_nodesz ) );
+	}
+}
+
 void *
 node_map( nh_t nh )
 {
@@ -530,7 +528,7 @@ node_map( nh_t nh )
 	/* map in
 	 */
 	p = 0; /* keep lint happy */
-	win_map( NIX2OFF( nix ), ( void ** )&p );
+	node_map_internal( nix, ( void ** )&p );
 	if (p == NULL)
 	    return NULL;
 
@@ -547,8 +545,8 @@ node_map( nh_t nh )
 }
 
 /* ARGSUSED */
-static void
-node_unmap_internal( nh_t nh, void **pp, bool_t internalpr )
+static inline void
+node_unmap_internal( nh_t nh, void **pp, bool_t freepr )
 {
 	nix_t nix;
 #ifdef NODECHK
@@ -578,7 +576,7 @@ node_unmap_internal( nh_t nh, void **pp,
 	nodegen = HKPGETGEN( hkp );
 	ASSERT( nodegen == hdlgen );
 	nodeunq = HKPGETUNQ( hkp );
-	if ( ! internalpr ) {
+	if ( ! freepr ) {
 		ASSERT( nodeunq != NODEUNQFREE );
 		ASSERT( nodeunq == NODEUNQALCD );
 	} else {
@@ -589,7 +587,7 @@ node_unmap_internal( nh_t nh, void **pp,
 
 	/* unmap the window containing the node
 	 */
-	win_unmap( NIX2OFF( nix ), pp ); /* zeros *pp */
+	win_unmap( nh2segix( nix ), pp ); /* zeros *pp */
 }
 
 void
Index: xfsdump-kernel.org/restore/win.c
===================================================================
--- xfsdump-kernel.org.orig/restore/win.c
+++ xfsdump-kernel.org/restore/win.c
@@ -177,31 +177,16 @@ win_init( intgen_t fd,
 }
 
 void
-win_map( off64_t off, void **pp )
+win_map( size_t segix, void **pp )
 {
-	size_t offwithinseg;
-	size_t segix;
 	off64_t segoff;
 	win_t *winp;
 
 	CRITICAL_BEGIN();
 
-	/* calculate offset within segment
-	 */
-	offwithinseg = ( size_t )( off % ( off64_t )tranp->t_segsz );
-
-	/* calculate segment index
-	 */
-	segix = (size_t)( off / ( off64_t )tranp->t_segsz );
-
-	/* calculate offset of segment
-	 */
-	segoff = off - ( off64_t )offwithinseg;
-
 #ifdef TREE_DEBUG
 	mlog(MLOG_DEBUG | MLOG_TREE | MLOG_NOLOCK,
-	     "win_map(off=%lld,addr=%x): off within = %llu, segoff = %lld\n",
-	      off, pp, offwithinseg, segoff);
+	     "win_map(segix=%lu,addr=%p)\n", segix, pp);
 #endif
 	/* resize the array if necessary */
 	if ( segix >= tranp->t_segmaplen )
@@ -243,7 +228,7 @@ win_map( off64_t off, void **pp )
 			ASSERT( ! winp->w_nextp );
 		}
 		winp->w_refcnt++;
-		*pp = ( void * )( ( char * )( winp->w_p ) + offwithinseg );
+		*pp = winp->w_p;
 		CRITICAL_END();
 		return;
 	}
@@ -287,6 +272,10 @@ win_map( off64_t off, void **pp )
 		return;
 	}
 
+	/* calculate offset of segment
+	 */
+	segoff = segix * ( off64_t )tranp->t_segsz;
+
 	/* map the window
 	 */
 	ASSERT( tranp->t_segsz >= 1 );
@@ -320,7 +309,7 @@ win_map( off64_t off, void **pp )
 		if (error == ENOMEM && tranp->t_lruheadp) {
 			mlog( MLOG_NORMAL | MLOG_ERROR,
 		      		_("win_map(): try to select a different win_t\n"));
-			win_map(off, pp);
+			win_map(segix, pp);
 			return;
 		}
 		*pp = NULL;
@@ -331,23 +320,18 @@ win_map( off64_t off, void **pp )
 	winp->w_refcnt++;
 	tranp->t_segmap[winp->w_segix] = winp;
 
-	*pp = ( void * )( ( char * )( winp->w_p ) + offwithinseg );
+	*pp = winp->w_p;
 
 	CRITICAL_END();
 }
 
 void
-win_unmap( off64_t off, void **pp )
+win_unmap( size_t segix, void **pp )
 {
-	size_t segix;
 	win_t *winp;
 
 	CRITICAL_BEGIN();
 
-	/* calculate segment index
-	 */
-	segix = (size_t)( off / ( off64_t )tranp->t_segsz );
-
 	/* verify window mapped
 	 */
 	ASSERT( segix < tranp->t_segmaplen );
Index: xfsdump-kernel.org/restore/win.h
===================================================================
--- xfsdump-kernel.org.orig/restore/win.h
+++ xfsdump-kernel.org/restore/win.h
@@ -28,15 +28,15 @@ void win_init( intgen_t fd,
 	       size64_t winsz,		/* window size */
 	       size_t wincntmax );	/* max number of windows to manage */
 
-/* supply a pointer to the portion of the file identified by off.
+/* supply a pointer to the portion of the file identified by segix.
  */
-void win_map( off64_t off,		/* file offset relative to rngoff */
+void win_map( size_t segix,		/* segment index to be mapped */
 	      void **pp );		/* returns pointer by reference */
 
 /* invalidate the pointer previously supplied. SIDE-EFFECT: zeros
  * by reference the caller's pointer.
  */
-void win_unmap( off64_t off,		/* must match win_map param */
+void win_unmap( size_t segix,		/* must match win_map param */
 	        void **pp );		/* ptr generated by win_map: ZEROED */
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 8/9] xfsrestore: remove nix_t
  2010-11-05 16:35 [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues wkendall
                   ` (6 preceding siblings ...)
  2010-11-05 16:35 ` [PATCH v2 7/9] xfsrestore: make node lookup more efficient wkendall
@ 2010-11-05 16:35 ` wkendall
  2010-11-12 23:25   ` Alex Elder
  2010-11-05 16:35 ` [PATCH v2 9/9] xfsrestore: check for compatible xfsrestore wkendall
  2010-11-12 23:25 ` [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues Alex Elder
  9 siblings, 1 reply; 24+ messages in thread
From: wkendall @ 2010-11-05 16:35 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: nuke_nix --]
[-- Type: text/plain, Size: 9411 bytes --]

The nix_t type is a 64-bit version of a (32-bit) nh_t, created "to
facilitate conversion to a 64-bit offset." At this point there's only
one case where an nh_t is converted to an off64_t, so just do an
explicit cast in that case and remove the nix_t.

Signed-off-by: Bill Kendall <wkendall@sgi.com>

---
 restore/node.c |  112 ++++++++++++++++++++++-----------------------------------
 1 file changed, 44 insertions(+), 68 deletions(-)

Index: xfsdump-kernel.org/restore/node.c
===================================================================
--- xfsdump-kernel.org.orig/restore/node.c
+++ xfsdump-kernel.org/restore/node.c
@@ -50,22 +50,21 @@ extern size_t pgmask;
 #define	HDLGENSHIFT		( NBBY * sizeof ( nh_t ) - HDLGENCNT )
 #define	HDLGENLOMASK		( ( 1 << HDLGENCNT ) - 1 )
 #define	HDLGENMASK		( HDLGENLOMASK << HDLGENSHIFT )
-#define HDLNIXCNT		HDLGENSHIFT
-#define HDLNIXMASK		( ( 1 << HDLNIXCNT ) - 1 )
+#define HDLMASK			( ( 1 << HDLGENSHIFT ) - 1 )
 #define HDLGETGEN( h )		( ( u_char_t )				\
 				  ( ( ( int )h >> HDLGENSHIFT )		\
 				    &					\
 				    HDLGENLOMASK ))
-#define HDLGETNIX( h )		( ( nix_t )( ( int )h & HDLNIXMASK ))
+#define HDLGETNHDL( h )		( ( nh_t )( ( int )h & HDLMASK ))
 #define HDLMKHDL( g, n )	( ( nh_t )( ( ( ( int )g << HDLGENSHIFT )\
 					      &				\
 					      HDLGENMASK )		\
 					  |				\
-					  ( ( int )n & HDLNIXMASK )))
-#define NIX_MAX			( ( off64_t )HDLNIXMASK )
+					  ( ( int )n & HDLMASK )))
+#define NH_MAX			( HDLMASK )
 
 /* the housekeeping byte of each node will hold two check fields:
- * a gen count, initialized to the node ix and incremented each time a node
+ * a gen count, initialized to 0 and incremented each time a node
  * is allocated, to catch re-use of stale handles; and unique pattern, to
  * differentiate a valid node from random memory. two unique patterns will
  * be used; one when the node is on the free list, another when it is
@@ -97,7 +96,7 @@ extern size_t pgmask;
 
 #else /* NODECHK */
 
-#define NIX_MAX			( NH_NULL - 1 )
+#define NH_MAX			( NH_NULL - 1 )
 
 #endif /* NODECHK */
 
@@ -106,16 +105,6 @@ extern size_t pgmask;
 #define NODESPERSEG_MIN	1048576
 #define WINMAP_MIN	4
 
-/* a node is identified internally by its index into the backing store.
- * this index is the offset of the node into the segmented portion of
- * backing store (follows the abstraction header page) divided by the
- * size of a node. a special index is reserved to represent the null
- * index. a type is defined for node index (nix_t). it is a 64 bit
- * unsigned to facilitate conversion from index to 64 bit offset.
- */
-typedef off64_t nix_t;
-#define NIX_NULL OFF64MAX
-
 /* reserve the firstpage for a header to save persistent context
  */
 #define NODE_HDRSZ	pgsz
@@ -142,15 +131,14 @@ struct node_hdr {
 	size_t nh_nodealignsz;
 		/* user's constraint on node alignment
 		 */
-	nix_t nh_freenix;
-		/* index into backing store of first node of singly-linked
-		 * list of free nodes
+	nh_t nh_freenh;
+		/* handle of first node of singly-linked list of free nodes
 		 */
 	off64_t nh_firstsegoff;
 		/* offset into backing store of the first segment
 		 */
-	nh_t nh_virgnix;
-		/* node handle of next virgin node
+	nh_t nh_virgnh;
+		/* handle of next virgin node
 		 */
 	intgen_t nh_segixshift;
 		/* bitshift used to extract the segment index from an nh_t
@@ -248,10 +236,10 @@ node_init( intgen_t fd,
 	 * dirs_nondirs_cnt may be less than the number of nodes/dirents).
 	 * Checking this here prevents potential overflow in the logic below.
 	 */
-	if ( dirs_nondirs_cnt > NIX_MAX ) {
+	if ( dirs_nondirs_cnt > NH_MAX ) {
 		mlog( MLOG_NORMAL | MLOG_ERROR, _(
-		  "dump contains %llu inodes, restore can only handle %llu\n"),
-		  dirs_nondirs_cnt, NIX_MAX );
+		  "dump contains %llu inodes, restore can only handle %u\n"),
+		  dirs_nondirs_cnt, NH_MAX );
 		return BOOL_FALSE;
 	}
 
@@ -265,7 +253,7 @@ node_init( intgen_t fd,
 		      segixshift++ );
 
 		/* rounding up to a power of 2 may have caused overflow */
-		if ( ( 1ULL << segixshift ) > NIX_MAX )
+		if ( ( 1ULL << segixshift ) > NH_MAX )
 			segixshift--;
 
 		nodesperseg = 1UL << segixshift;
@@ -308,9 +296,9 @@ node_init( intgen_t fd,
 	node_hdrp->nh_winmapmax = winmapmax;
 	node_hdrp->nh_nodesperseg = nodesperseg;
 	node_hdrp->nh_nodealignsz = nodealignsz;
-	node_hdrp->nh_freenix = NIX_NULL;
+	node_hdrp->nh_freenh = NH_NULL;
 	node_hdrp->nh_firstsegoff = off + ( off64_t )NODE_HDRSZ;
-	node_hdrp->nh_virgnix = 0;
+	node_hdrp->nh_virgnh = 0;
 	node_hdrp->nh_segixshift = segixshift;
 	node_hdrp->nh_relnixmask = relnixmask;
 
@@ -383,7 +371,6 @@ node_sync( intgen_t fd, off64_t off )
 nh_t
 node_alloc( void )
 {
-	nix_t nix;
 	u_char_t *p;
 	nh_t nh;
 #ifdef NODECHK
@@ -396,12 +383,12 @@ node_alloc( void )
 	 * otherwise get the next one from the current virgin segment,
 	 * or allocate a new virgin segment if the current one is depleted.
 	 */
-	if ( node_hdrp->nh_freenix != NIX_NULL ) {
-		nix_t *linkagep;
+	if ( node_hdrp->nh_freenh != NH_NULL ) {
+		nh_t *linkagep;
 
-		nix = node_hdrp->nh_freenix;
+		nh = node_hdrp->nh_freenh;
 
-		node_map_internal( nix, ( void ** )&p );
+		node_map_internal( nh, ( void ** )&p );
 		if (p == NULL)
 			return NH_NULL;
 #ifdef NODECHK
@@ -413,18 +400,18 @@ node_alloc( void )
 #endif /* NODECHK */
 
 		/* adjust the free list */
-		linkagep = ( nix_t * )p;
-		node_hdrp->nh_freenix = *linkagep;
+		linkagep = ( nh_t * )p;
+		node_hdrp->nh_freenh = *linkagep;
 
-		node_unmap_internal( nix, ( void ** )&p, BOOL_TRUE );
+		node_unmap_internal( nh, ( void ** )&p, BOOL_TRUE );
 
 	} else {
-		if ( nh2relnix( node_hdrp->nh_virgnix ) == 0 ) {
+		if ( nh2relnix( node_hdrp->nh_virgnh ) == 0 ) {
 			/* need to start a new virgin segment */
 			intgen_t rval;
 			off64_t new_seg_off =
 				node_hdrp->nh_firstsegoff +
-				( off64_t )nh2segix( node_hdrp->nh_virgnix ) *
+				( off64_t )nh2segix( node_hdrp->nh_virgnh ) *
 				( off64_t )node_hdrp->nh_segsz;
 
 			ASSERT( new_seg_off
@@ -443,34 +430,32 @@ node_alloc( void )
 				mlog( MLOG_NORMAL | MLOG_WARNING | MLOG_TREE, _(
 				      "unable to autogrow node segment %lu: "
 				      "%s (%d)\n"),
-				      nh2segix( node_hdrp->nh_virgnix ),
+				      nh2segix( node_hdrp->nh_virgnh ),
 				      strerror( errno ),
 				      errno );
 			}
 		}
 
-		nix = node_hdrp->nh_virgnix++;
+		nh = node_hdrp->nh_virgnh++;
 	}
 
 	/* build a handle for node
 	 */
-	if ( nix > NIX_MAX ) {
+	if ( nh > NH_MAX ) {
 		mlog( MLOG_NORMAL | MLOG_ERROR, _(
 		  "dump contains too many dirents, "
 		  "restore can only handle %llu\n"),
-		  NIX_MAX );
+		  NH_MAX );
 		return NH_NULL;
 	}
 #ifdef NODECHK
-	node_map_internal( nix , ( void ** )&p );
+	node_map_internal( nh , ( void ** )&p );
 	if (p == NULL)
 		abort();
 	hkpp = p + ( int )node_hdrp->nh_nodehkix;
-	nh = HDLMKHDL( gen, nix );
+	nh = HDLMKHDL( gen, nh );
 	*hkpp = HKPMKHKP( gen, NODEUNQALCD );
-	node_unmap_internal( nix, ( void ** )&p, BOOL_FALSE );
-#else /* NODECHK */
-	nh = ( nh_t )nix;
+	node_unmap_internal( nh, ( void ** )&p, BOOL_FALSE );
 #endif /* NODECHK */
 
 	return nh;
@@ -503,7 +488,6 @@ node_map_internal( nh_t nh, void **pp )
 void *
 node_map( nh_t nh )
 {
-	nix_t nix;
 	u_char_t *p;
 #ifdef NODECHK
 	register u_char_t hkp;
@@ -518,17 +502,15 @@ node_map( nh_t nh )
 	 */
 #ifdef NODECHK
 	hdlgen = HDLGETGEN( nh );
-	nix = HDLGETNIX( nh );
-#else /* NODECHK */
-	nix = ( nix_t )nh;
+	nh = HDLGETNHDL( nh );
 #endif /* NODECHK */
 
-	ASSERT( nix <= NIX_MAX );
+	ASSERT( nh <= NH_MAX );
 
 	/* map in
 	 */
 	p = 0; /* keep lint happy */
-	node_map_internal( nix, ( void ** )&p );
+	node_map_internal( nh, ( void ** )&p );
 	if (p == NULL)
 	    return NULL;
 
@@ -548,7 +530,6 @@ node_map( nh_t nh )
 static inline void
 node_unmap_internal( nh_t nh, void **pp, bool_t freepr )
 {
-	nix_t nix;
 #ifdef NODECHK
 	register u_char_t hkp;
 	register u_char_t hdlgen;
@@ -564,12 +545,10 @@ node_unmap_internal( nh_t nh, void **pp,
 	 */
 #ifdef NODECHK
 	hdlgen = HDLGETGEN( nh );
-	nix = HDLGETNIX( nh );
-#else /* NODECHK */
-	nix = ( nix_t )nh;
+	nh = HDLGETNHDL( nh );
 #endif /* NODECHK */
 
-	ASSERT( nix <= NIX_MAX );
+	ASSERT( nh <= NH_MAX );
 
 #ifdef NODECHK
 	hkp = *( *( u_char_t ** )pp + node_hdrp->nh_nodehkix );
@@ -587,7 +566,7 @@ node_unmap_internal( nh_t nh, void **pp,
 
 	/* unmap the window containing the node
 	 */
-	win_unmap( nh2segix( nix ), pp ); /* zeros *pp */
+	win_unmap( nh2segix( nh ), pp ); /* zeros *pp */
 }
 
 void
@@ -600,9 +579,8 @@ void
 node_free( nh_t *nhp )
 {
 	nh_t nh;
-	nix_t nix;
 	u_char_t *p;
-	register nix_t *linkagep;
+	register nh_t *linkagep;
 #ifdef NODECHK
 	register u_char_t *hkpp;
 	register u_char_t hdlgen;
@@ -618,12 +596,10 @@ node_free( nh_t *nhp )
 	 */
 #ifdef NODECHK
 	hdlgen = HDLGETGEN( nh );
-	nix = HDLGETNIX( nh );
-#else /* NODECHK */
-	nix = ( nix_t )nh;
+	nh = HDLGETNHDL( nh );
 #endif /* NODECHK */
 
-	ASSERT( nix <= NIX_MAX );
+	ASSERT( nh <= NH_MAX );
 
 	/* map in
 	 */
@@ -647,9 +623,9 @@ node_free( nh_t *nhp )
 
 	/* put node on free list
 	 */
-	linkagep = ( nix_t * )p;
-	*linkagep = node_hdrp->nh_freenix;
-	node_hdrp->nh_freenix = nix;
+	linkagep = ( nh_t * )p;
+	*linkagep = node_hdrp->nh_freenh;
+	node_hdrp->nh_freenh = nh;
 
 	/* map out
 	 */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 9/9] xfsrestore: check for compatible xfsrestore
  2010-11-05 16:35 [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues wkendall
                   ` (7 preceding siblings ...)
  2010-11-05 16:35 ` [PATCH v2 8/9] xfsrestore: remove nix_t wkendall
@ 2010-11-05 16:35 ` wkendall
  2010-11-12 23:25   ` Alex Elder
  2010-11-12 23:25 ` [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues Alex Elder
  9 siblings, 1 reply; 24+ messages in thread
From: wkendall @ 2010-11-05 16:35 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: check_compatible_restore --]
[-- Type: text/plain, Size: 6939 bytes --]

When resuming a restore or doing a cumulative restore, xfsrestore
reads state information left around by the previous invocation.
This patch adds logic to determine whether or not restore is
able to make sense of the sense information.

The xfsrestore man page has also been updated to make the user
aware of the requirement to use a compatible restore and
system when resuming restores.

Signed-off-by: Bill Kendall <wkendall@sgi.com>

---
 man/man8/xfsrestore.8 |   26 +++++++++++---
 restore/content.c     |   92 +++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 108 insertions(+), 10 deletions(-)

Index: xfsdump-kernel.org/man/man8/xfsrestore.8
===================================================================
--- xfsdump-kernel.org.orig/man/man8/xfsrestore.8
+++ xfsdump-kernel.org/man/man8/xfsrestore.8
@@ -61,6 +61,18 @@ The deltas must be applied in the order 
 Each delta applied must have been produced with the previously applied
 delta as its base.
 .P
+.I xfsrestore
+keeps state information in the
+.IR xfsrestorehousekeepingdir ,
+to inform subsequent invocations when used in
+cumulative mode, or in the event a restore is interrupted.
+To ensure that the state information can be processed,
+a compatible version of
+.I xfsrestore
+must be used for each subsequent invocation. Additionally,
+each invocation must run on a system of the same endianness
+and page size.
+.P
 The options to
 .I xfsrestore
 are:
@@ -78,11 +90,11 @@ option allows the operator to specify an
 in which
 .I xfsrestore
 creates the
-.I xfsrestorehousekeeping
-directory.
-When performing a cumulative (\f3\-r\f1 option) restore,
-each successive invocation of \f2xfsrestore\f1 must specify the same alternate
+.I xfsrestorehousekeepingdir
 directory.
+When performing a cumulative (\f3\-r\f1 option) restore or
+resuming (\f3\-R\f1 option) a restore, each successive invocation
+must specify the same alternate directory.
 .TP 5
 \f3\-b\f1 \f2blocksize\f1
 Specifies the blocksize, in bytes, to be used for the restore. 
@@ -205,7 +217,11 @@ Source tape drive is a QIC tape.  QIC ta
 blocksize, for which \f2xfsrestore\f1 must make special allowances.
 .TP 5
 \f3\-r\f1
-Selects the cumulative mode of operation.
+Selects the cumulative mode of operation. The
+.B \-a
+and
+.I destination
+options must be the same for each invocation.
 .TP 5
 \f3\-s\f1 \f2subtree\f1
 Specifies a subtree to restore.
Index: xfsdump-kernel.org/restore/content.c
===================================================================
--- xfsdump-kernel.org.orig/restore/content.c
+++ xfsdump-kernel.org/restore/content.c
@@ -71,6 +71,13 @@
 
 /* structure definitions used locally ****************************************/
 
+#define HOUSEKEEPING_MAGIC	0x686b6d61
+	/* "hkma" - see the housekeeping_magic field of pers_t below.
+	 */
+#define HOUSEKEEPING_VERSION	1
+	/* see the housekeeping_version field of pers_t below.
+	 */
+
 #define WRITE_TRIES_MAX	3
 	/* retry loop tuning for write(2) workaround
 	 */
@@ -358,12 +365,43 @@ struct stream_context {
 
 typedef struct stream_context stream_context_t;
 
-/* persistent state file header - two parts: accumulation state
- * which spans several sessions, and session state. each has a valid
- * bit, and no fields are valid until the valid bit is set.
- * all elements defined such that a bzero results in a valid initial state.
+/* persistent state file header - on-disk format information plus
+ * accumulation state (which spans several sessions) and session state.
+ * the latter two have a valid bit, and their fields are not valid until
+ * the valid bit is set. all elements defined such that a bzero results
+ * in a valid initial state.
  */
 struct pers {
+	/* on-disk format information used to verify that xfsrestore
+	 * can make sense of the data in xfsrestorehousekeepingdir
+	 * when running in cumulative mode or when resuming a restore.
+	 *
+	 * for backwards/forwards compatibility, this struct must be
+	 * the first field! also any changes to the struct must address
+	 * compatibility with other xfsrestore versions.
+	 */
+	struct {
+		size32_t housekeeping_magic;
+			/* used to determine if this struct has been
+			 * initialized, and whether the machine's
+			 * endianness is the same as the previous
+			 * invocation. (data written to xfsrestore's
+			 * state directory is not converted to an
+			 * endian-neutral format since it only persists
+			 * for the life of one or more restore sessions.)
+			 */
+		size32_t housekeeping_version;
+			/* version of the data structures used in the
+			 * state files in housekeepingdir. this must be
+			 * bumped whenever the on-disk format changes.
+			 */
+		size64_t pagesize;
+			/* headers in the persistent state files
+			 * are aligned on page size boundaries, so
+			 * this cannot change betweeen invocations.
+			 */
+	} v;
+
 	/* command line arguments from first session, and session
 	 * history.
 	 */
@@ -1301,6 +1339,49 @@ content_init( intgen_t argc, char *argv[
 		      strerror( errno ));
 		return BOOL_FALSE;
 	}
+
+	/* but first setup or verify the on-disk format information
+	 */
+	if ( ! persp->a.valpr ) {
+		/* this is the first restore session
+		 */
+		persp->v.housekeeping_magic = HOUSEKEEPING_MAGIC;
+		persp->v.housekeeping_version = HOUSEKEEPING_VERSION;
+		persp->v.pagesize = pgsz;
+
+	} else {
+		/* cumulative or resuming a restore, verify the header
+		 */
+		if ( persp->v.housekeeping_magic != HOUSEKEEPING_MAGIC ) {
+			mlog( MLOG_NORMAL | MLOG_ERROR, _(
+			      "%s format corrupt or wrong endianness "
+			      "(0x%x, expected 0x%x)\n"),
+			      hkdirname,
+			      persp->v.housekeeping_magic,
+			      HOUSEKEEPING_MAGIC );
+			return BOOL_FALSE;
+		}
+		if ( persp->v.housekeeping_version != HOUSEKEEPING_VERSION ) {
+			mlog( MLOG_NORMAL | MLOG_ERROR, _(
+			      "%s format version differs from previous "
+			      "restore (%u, expected %u)\n"),
+			      hkdirname,
+			      persp->v.housekeeping_version,
+			      HOUSEKEEPING_VERSION );
+			return BOOL_FALSE;
+		}
+		if ( persp->v.pagesize != pgsz ) {
+			mlog( MLOG_NORMAL | MLOG_ERROR, _(
+			      "%s format differs from previous "
+			      "restore due to page size change "
+			      "(was %lu, now %lu)\n"),
+			      hkdirname,
+			      persp->v.pagesize,
+			      pgsz );
+			return BOOL_FALSE;
+		}
+	}
+
 	if ( ! persp->a.valpr ) {
 		if ( ! dstdir ) {
 			mlog( MLOG_NORMAL | MLOG_ERROR, _(
@@ -1565,7 +1646,8 @@ content_init( intgen_t argc, char *argv[
 		stpgcnt = 0;
 		newstpgcnt = ( stsz + pgmask ) / pgsz;
 		descpgcnt = 0;
-		memset( ( void * )persp, 0, sizeof( pers_t ));
+		memset( ( void * )&persp->a, 0,
+			sizeof( pers_t ) - offsetofmember( pers_t, a ));
 	} else if ( ! persp->s.valpr ) {
 		stpgcnt = persp->a.stpgcnt;
 		newstpgcnt = stpgcnt;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/9] xfsrestore: turn off NODECHK
  2010-11-05 16:35 ` [PATCH v2 1/9] xfsrestore: turn off NODECHK wkendall
@ 2010-11-12 23:23   ` Alex Elder
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Elder @ 2010-11-12 23:23 UTC (permalink / raw)
  To: wkendall; +Cc: xfs

On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
> The NODECHK macro should only be enabled as needed for
> development/debugging. Having it on limits xfsrestore to dumps
> with 268 million directory entries instead of 4 billion. 
> 
> Signed-off-by: Bill Kendall <wkendall@sgi.com> 

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/9] xfsrestore: change nrh_t from 32 to 64 bits
  2010-11-05 16:35 ` [PATCH v2 2/9] xfsrestore: change nrh_t from 32 to 64 bits wkendall
@ 2010-11-12 23:24   ` Alex Elder
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Elder @ 2010-11-12 23:24 UTC (permalink / raw)
  To: wkendall; +Cc: xfs

On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
> plain text document attachment (namreg_64bit)
> An nrh_t refers to a byte offset in a file containing all the pathname
> components from a dump. At an average filename length of 20
> characters, an nrh_t would overflow on a dump containing ~214 million
> directory entries. Removing this limitation allows xfsrestore to
> handle 4 billion directory entries.
> 
> Signed-off-by: Bill Kendall <wkendall@sgi.com>

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/9] xfsrestore: cache path lookups
  2010-11-05 16:35 ` [PATCH v2 3/9] xfsrestore: cache path lookups wkendall
@ 2010-11-12 23:25   ` Alex Elder
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Elder @ 2010-11-12 23:25 UTC (permalink / raw)
  To: wkendall; +Cc: xfs

On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
> 
> In order to resolve a pathname, xfsrestore must work from an inode
> number (from the dump) and recurse up the directory entry tree that it
> has constructed. Each level of recursion requires a seek and read to
> get the name of the dirent, and possibly a mmap of a section of the
> directory entry tree if it is not already mapped (and in that case,
> possibly a munmap of another section). It's quite common to resolve
> pathnames in the same directory consecutively, so simply caching the
> parent directory pathname from the previous lookup saves quite a bit
> of overhead.
> 
> Signed-off-by: Bill Kendall <wkendall@sgi.com>

I think an assertion would do in a place I show below,
but it's nothing to worry about.

Reviewed-by: Alex Elder <aelder@sgi.com>

> ---
>  restore/tree.c |   41 ++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 36 insertions(+), 5 deletions(-)
> 
> Index: xfsdump-kernel.org/restore/tree.c
> ===================================================================
> --- xfsdump-kernel.org.orig/restore/tree.c
> +++ xfsdump-kernel.org/restore/tree.c

. . .

> @@ -3475,6 +3486,13 @@ Node2path_recurse( nh_t nh, char *buf, i
>  		return bufsz;
>  	}
>  
> +	/* if we have a cache hit, no need to recurse any further
> +	 */
> +	if ( nh == cache.nh && bufsz > cache.len ) {

Since a nh basically encodes the entire path,
the check that the buffer is big enough should
not be necessary, and could probably be inserted
instead:
		ASSERT(bufsz > cache.len);

> +		strcpy( buf, cache.buf );
> +		return bufsz - cache.len;
> +	}
> +
>  	/* extract useful node members
>  	 */
>  	np = Node_map( nh );

 . . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/9] xfsrestore: mmap dirent names for faster lookups
  2010-11-05 16:35 ` [PATCH v2 4/9] xfsrestore: mmap dirent names for faster lookups wkendall
@ 2010-11-12 23:25   ` Alex Elder
  2010-11-15 21:51     ` Bill Kendall
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Elder @ 2010-11-12 23:25 UTC (permalink / raw)
  To: wkendall; +Cc: xfs

On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
> plain text document attachment (namreg_map)
> Pathname resolution in xfsrestore is about 4x faster if the file
> containing dirent names ("namreg") is memory mapped.  If xfsrestore is
> unable to map the file (e.g., due to virtual memory constraints)
> fallback to the existing seek-and-read approach.
> 
> The file is mapped after all directory entries have been written to
> the "namreg" file. If the caller tries to add additional entries after
> the file has been mapped, it will be unmapped and restore will resort
> back to seek-and-read lookups.
> 
> Signed-off-by: Bill Kendall <wkendall@sgi.com>

I guess I might have created a simple namreg_flush_final()
routine to encapsulate the namreg_flush() and namreg_map()
calls, rather than adding a flag.  Then namreg_flush() could
have been made to have static scope.  No big deal though.

Also, you fall back to the non-mmapped method in case a
flush or add happens after you've mapped the file.  That
is in some sense comforting, but it would be nice to be
assured it simply won't happen.  If you come to a point
someday where you are sure of this it would be nice to
see that code switched to assertions instead.

You could define the new namreg_*map*() functions earlier
in the file and skip the forward declarations.

These aren't all that important though, I think it's a
good change.  Let me know if you think you will
re-work it but I think it's OK as-is.

Reviewed-by: Alex Elder <aelder@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 8/9] xfsrestore: remove nix_t
  2010-11-05 16:35 ` [PATCH v2 8/9] xfsrestore: remove nix_t wkendall
@ 2010-11-12 23:25   ` Alex Elder
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Elder @ 2010-11-12 23:25 UTC (permalink / raw)
  To: wkendall; +Cc: xfs

On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
> plain text document attachment (nuke_nix)
> The nix_t type is a 64-bit version of a (32-bit) nh_t, created "to
> facilitate conversion to a 64-bit offset." At this point there's only
> one case where an nh_t is converted to an off64_t, so just do an
> explicit cast in that case and remove the nix_t.
> 
> Signed-off-by: Bill Kendall <wkendall@sgi.com>

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 9/9] xfsrestore: check for compatible xfsrestore
  2010-11-05 16:35 ` [PATCH v2 9/9] xfsrestore: check for compatible xfsrestore wkendall
@ 2010-11-12 23:25   ` Alex Elder
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Elder @ 2010-11-12 23:25 UTC (permalink / raw)
  To: wkendall; +Cc: xfs

On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
> plain text document attachment (check_compatible_restore)
> When resuming a restore or doing a cumulative restore, xfsrestore
> reads state information left around by the previous invocation.
> This patch adds logic to determine whether or not restore is
> able to make sense of the sense information.
> 
> The xfsrestore man page has also been updated to make the user
> aware of the requirement to use a compatible restore and
> system when resuming restores.
> 
> Signed-off-by: Bill Kendall <wkendall@sgi.com>

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues
  2010-11-05 16:35 [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues wkendall
                   ` (8 preceding siblings ...)
  2010-11-05 16:35 ` [PATCH v2 9/9] xfsrestore: check for compatible xfsrestore wkendall
@ 2010-11-12 23:25 ` Alex Elder
  9 siblings, 0 replies; 24+ messages in thread
From: Alex Elder @ 2010-11-12 23:25 UTC (permalink / raw)
  To: wkendall; +Cc: xfs

On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
> The first two patches in this series remove dirent limitations that
> exist in the current xfsrestore, allowing restore to now handle 4
> billion directory entries. Restore would map 200 GB of memory to do
> so, so don't go thinking this is a good idea. :)  (These two patches
> were previously submitted to the list but I've made some changes to
> them as suggested by Alex Elder.)
> 
> The remaining patches mostly deal with improving restore
> performance, most noticeably on dumps containing upwards of 10
> million inodes/dirents.  This resulted in a 50% improvement in the
> time required to build restore's node table (a mmap'd representation
> of the dump's directory structure), so for interactive restores and
> restoring sub-directories this is very helpful. For full restores
> with millions of files the overall restore time is dominated by
> creating inodes and laying down the data, so the improvements here
> would be less noticeable.
> 
> For dumps with lots of hard links, these changes fix a bug that was
> causing xfsrestore to constantly have to map and unmap segments of
> the node table, leading to horrible performance.
> 
> Several of these patches modify the on-disk state information that
> xfsrestore leaves around for resuming restores. The final patch adds
> versioning information to the on-disk state to detect cases where
> the user tries to resume a restore with an incompatible version of
> xfsrstore (or an incompatible system).

I've reviewed most of these but I'll have to return
to the rest next week.

					-Alex

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/9] xfsrestore: cleanup node allocation
  2010-11-05 16:35 ` [PATCH v2 5/9] xfsrestore: cleanup node allocation wkendall
@ 2010-11-15 20:38   ` Alex Elder
  2010-11-15 21:36     ` Bill Kendall
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Elder @ 2010-11-15 20:38 UTC (permalink / raw)
  To: wkendall; +Cc: xfs

On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
> plain text document attachment (node_alloc_cleanup)
> Simplify the node allocation code. The current code takes some
> number of nodes from a new segment and links them into the
> freelist whenever the freelist is depleted. There's no reason
> to put the new nodes on the freelist, we can just allocate the
> next available new node as needed. This also saves a trip through
> win_map/win_unmap if there are no nodes on the freelist (the
> common case).

Prior to your change, a node allocated off the "virgin"
segment got zeroed before it gets returned for use. Your
change eliminates that.  Is that OK?

You also dropped a few TREE_DEBUG messages.  Were they
not useful?  (Just curious.)

Otherwise looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Bill Kendall <wkendall@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/9] xfsrestore: fix node table setup
  2010-11-05 16:35 ` [PATCH v2 6/9] xfsrestore: fix node table setup wkendall
@ 2010-11-15 20:38   ` Alex Elder
  2010-11-15 21:30     ` Bill Kendall
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Elder @ 2010-11-15 20:38 UTC (permalink / raw)
  To: wkendall; +Cc: xfs

On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
> plain text document attachment (winmap_max_fix)
> The node table setup code unintentionally limits the amount of virtual
> memory that will be used for the node table. Rather than setting a
> limit based on the remaining virtual memory, the node table is limited
> to the amount of memory it thinks it will need based on the dump's
> inode count. But the node table stores directory entries, not inodes,
> and so dumps with lots of hard links end up thrashing mapping and
> unmapping segments of the node table to stay within the self-imposed
> virtual memory limit.
> 
> This patch also changes the node table to ensure that there are a
> power of two nodes per segment. Profiling revealed that while building
> the node table, 50% of the restore cpu time was spent doing division
> and modulo to convert a node index into its segment index and offset
> within the segment. This change prepares the node table for another
> patch which will change the lookup mechanism to use shift and
> bitwise-and.
> 
> Also don't bother passing the estimated segment table size to the
> window abstraction. It has to deal with resizing the table anyway and
> can choose a reasonable starting size.
> 
> 
> Signed-off-by: Bill Kendall <wkendall@sgi.com>

This looks good to me.

In your loop determining how many segments to
use, etc., you might as well start with the
minimum number of segments.  I.e.:

winmapmax = 0;
for ( segcount = WINMAP_MIN; winmapmax < WINMAP_MIN; segcount <<= 1 ) {

I had a feeling there could be a pathological case
in which the the computation of winmapmax, etc. would
get stuck looping indefinitely, but I gave up trying
to see if I could identify such a case...

And although I'm not that confident I understand the way
the windows on the tree file are used, I don't see anything
really wrong with your change.

Reviewed-by: Alex Elder <aelder@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 7/9] xfsrestore: make node lookup more efficient
  2010-11-05 16:35 ` [PATCH v2 7/9] xfsrestore: make node lookup more efficient wkendall
@ 2010-11-15 20:38   ` Alex Elder
  2010-11-15 22:06     ` Bill Kendall
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Elder @ 2010-11-15 20:38 UTC (permalink / raw)
  To: wkendall; +Cc: xfs

On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
> plain text document attachment (window_lookup_performance)
> When converting an nh_t to a segment index and relative node index,
> use shift and bitwise-and instead of division and modulo. Results show
> this is about 50% faster than the current method.
> 
> 
> Signed-off-by: Bill Kendall <wkendall@sgi.com>

I have a few comments below.  They are suggestions and
don't indicate that I've found any real problems in your
code.  If you think that updating things in response to
my suggestions is warranted, but want to do it later (as
a separate change) that's OK with me.

Reviewed-by: Alex Elder <aelder@sgi.com>

> ---
>  restore/node.c |  150 ++++++++++++++++++++++++++++-----------------------------
>  restore/win.c  |   36 +++----------
>  restore/win.h  |    6 +-
>  3 files changed, 87 insertions(+), 105 deletions(-)
> 
> Index: xfsdump-kernel.org/restore/node.c
> ===================================================================
> --- xfsdump-kernel.org.orig/restore/node.c
> +++ xfsdump-kernel.org/restore/node.c

. . .

> @@ -173,6 +166,12 @@ typedef struct node_hdr node_hdr_t;
>  static node_hdr_t *node_hdrp;
>  static intgen_t node_fd;

The following forward declarations could be eliminated
if you just move their definitions up in the file.

> +/* forward declarations of locally defined static functions ******************/
> +static inline size_t nh2segix( nh_t nh );
> +static inline nh_t nh2relnix( nh_t nh );
> +static inline void node_map_internal( nh_t nh, void **pp );
> +static inline void node_unmap_internal( nh_t nh, void **pp, bool_t freepr );
> +
>  /* ARGSUSED */
>  bool_t
>  node_init( intgen_t fd,

. . .

> @@ -281,6 +280,8 @@ node_init( intgen_t fd,
>  		winmapmax = min( vmsz / segsz, max_segments );
>  	}
>  
> +	relnixmask = nodesperseg - 1;
> +

Is nodesperseg guaranteed to be a power of 2?  It may
be, but it's not obvious to me from glancing at this
block of code.

>  	/* map the abstraction header
>  	 */
>  	ASSERT( ( NODE_HDRSZ & pgmask ) == 0 );

. . .

>  void
> Index: xfsdump-kernel.org/restore/win.c
> ===================================================================
> --- xfsdump-kernel.org.orig/restore/win.c
> +++ xfsdump-kernel.org/restore/win.c
> @@ -177,31 +177,16 @@ win_init( intgen_t fd,
>  }
>  
>  void
> -win_map( off64_t off, void **pp )
> +win_map( size_t segix, void **pp )

It might be nice to have a segix_t or something.
The point being that what you are passing in here
is an index value (not a size), which is computed
from a node handle by figuring out which segment
that node handle resides in.

I was a little puzzled that nh2segix() took a
node handle and returned a size_t.

Similarly, a relnix appears to be an index
value within a segment; an integral typedef
could be used to clear that up. (and make
it clear it's different from just a nh_t).
Again, this suggestion comes out of my
not understanding the return type of
nh2relnix().

>  {
> -	size_t offwithinseg;
> -	size_t segix;
>  	off64_t segoff;
>  	win_t *winp;
>  
>  	CRITICAL_BEGIN();
>  
> -	/* calculate offset within segment
> -	 */
> -	offwithinseg = ( size_t )( off % ( off64_t )tranp->t_segsz );
> -
> -	/* calculate segment index
> -	 */
> -	segix = (size_t)( off / ( off64_t )tranp->t_segsz );
> -
> -	/* calculate offset of segment
> -	 */
> -	segoff = off - ( off64_t )offwithinseg;
> -
>  #ifdef TREE_DEBUG
>  	mlog(MLOG_DEBUG | MLOG_TREE | MLOG_NOLOCK,
> -	     "win_map(off=%lld,addr=%x): off within = %llu, segoff = %lld\n",
> -	      off, pp, offwithinseg, segoff);
> +	     "win_map(segix=%lu,addr=%p)\n", segix, pp);
>  #endif
>  	/* resize the array if necessary */
>  	if ( segix >= tranp->t_segmaplen )


. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/9] xfsrestore: fix node table setup
  2010-11-15 20:38   ` Alex Elder
@ 2010-11-15 21:30     ` Bill Kendall
  0 siblings, 0 replies; 24+ messages in thread
From: Bill Kendall @ 2010-11-15 21:30 UTC (permalink / raw)
  To: aelder; +Cc: xfs

On 11/15/2010 02:38 PM, Alex Elder wrote:
> On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
>> plain text document attachment (winmap_max_fix)
>> The node table setup code unintentionally limits the amount of virtual
>> memory that will be used for the node table. Rather than setting a
>> limit based on the remaining virtual memory, the node table is limited
>> to the amount of memory it thinks it will need based on the dump's
>> inode count. But the node table stores directory entries, not inodes,
>> and so dumps with lots of hard links end up thrashing mapping and
>> unmapping segments of the node table to stay within the self-imposed
>> virtual memory limit.
>>
>> This patch also changes the node table to ensure that there are a
>> power of two nodes per segment. Profiling revealed that while building
>> the node table, 50% of the restore cpu time was spent doing division
>> and modulo to convert a node index into its segment index and offset
>> within the segment. This change prepares the node table for another
>> patch which will change the lookup mechanism to use shift and
>> bitwise-and.
>>
>> Also don't bother passing the estimated segment table size to the
>> window abstraction. It has to deal with resizing the table anyway and
>> can choose a reasonable starting size.
>>
>>
>> Signed-off-by: Bill Kendall<wkendall@sgi.com>
>
> This looks good to me.
>
> In your loop determining how many segments to
> use, etc., you might as well start with the
> minimum number of segments.  I.e.:
>
> winmapmax = 0;
> for ( segcount = WINMAP_MIN; winmapmax<  WINMAP_MIN; segcount<<= 1 ) {
>
> I had a feeling there could be a pathological case
> in which the the computation of winmapmax, etc. would
> get stuck looping indefinitely, but I gave up trying
> to see if I could identify such a case...

I had a similar thought while making the change, but then
remembered that we just need enough virtual memory to
map up to WINMAP_WIN segments. It's perfectly okay to
only use one segment assuming there's enough virtual
memory.

I did some unit testing here playing around with the
bounds of the inputs (inode count and virtual memory size)
just to be sure that all output params were sane.

Bill

>
> And although I'm not that confident I understand the way
> the windows on the tree file are used, I don't see anything
> really wrong with your change.
>
> Reviewed-by: Alex Elder<aelder@sgi.com>
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/9] xfsrestore: cleanup node allocation
  2010-11-15 20:38   ` Alex Elder
@ 2010-11-15 21:36     ` Bill Kendall
  0 siblings, 0 replies; 24+ messages in thread
From: Bill Kendall @ 2010-11-15 21:36 UTC (permalink / raw)
  To: aelder; +Cc: xfs

On 11/15/2010 02:38 PM, Alex Elder wrote:
> On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
>> plain text document attachment (node_alloc_cleanup)
>> Simplify the node allocation code. The current code takes some
>> number of nodes from a new segment and links them into the
>> freelist whenever the freelist is depleted. There's no reason
>> to put the new nodes on the freelist, we can just allocate the
>> next available new node as needed. This also saves a trip through
>> win_map/win_unmap if there are no nodes on the freelist (the
>> common case).
>
> Prior to your change, a node allocated off the "virgin"
> segment got zeroed before it gets returned for use. Your
> change eliminates that.  Is that OK?

There's a wrapper function, Node_alloc, through which
all callers allocate nodes. Node_alloc initializes each
field. So this is okay.

>
> You also dropped a few TREE_DEBUG messages.  Were they
> not useful?  (Just curious.)

I did not find them useful (though probably should have
mentioned it in the commit message.)

Bill

>
> Otherwise looks good.
>
> Reviewed-by: Alex Elder<aelder@sgi.com>
>
>> Signed-off-by: Bill Kendall<wkendall@sgi.com>
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/9] xfsrestore: mmap dirent names for faster lookups
  2010-11-12 23:25   ` Alex Elder
@ 2010-11-15 21:51     ` Bill Kendall
  0 siblings, 0 replies; 24+ messages in thread
From: Bill Kendall @ 2010-11-15 21:51 UTC (permalink / raw)
  To: aelder; +Cc: xfs

On 11/12/2010 05:25 PM, Alex Elder wrote:
> On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
>> plain text document attachment (namreg_map)
>> Pathname resolution in xfsrestore is about 4x faster if the file
>> containing dirent names ("namreg") is memory mapped.  If xfsrestore is
>> unable to map the file (e.g., due to virtual memory constraints)
>> fallback to the existing seek-and-read approach.
>>
>> The file is mapped after all directory entries have been written to
>> the "namreg" file. If the caller tries to add additional entries after
>> the file has been mapped, it will be unmapped and restore will resort
>> back to seek-and-read lookups.
>>
>> Signed-off-by: Bill Kendall<wkendall@sgi.com>
>
> I guess I might have created a simple namreg_flush_final()
> routine to encapsulate the namreg_flush() and namreg_map()
> calls, rather than adding a flag.  Then namreg_flush() could
> have been made to have static scope.  No big deal though.
>
> Also, you fall back to the non-mmapped method in case a
> flush or add happens after you've mapped the file.  That
> is in some sense comforting, but it would be nice to be
> assured it simply won't happen.  If you come to a point
> someday where you are sure of this it would be nice to
> see that code switched to assertions instead.
>
> You could define the new namreg_*map*() functions earlier
> in the file and skip the forward declarations.

But the comments told me to put the functions there! ;)
xfsdump/restore could benefit from some major refactoring.

>
> These aren't all that important though, I think it's a
> good change.  Let me know if you think you will
> re-work it but I think it's OK as-is.

I'll leave this patch as is, and make a note to look
at changing the mmap-checks into asserts.

Bill

>
> Reviewed-by: Alex Elder<aelder@sgi.com>
>
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 7/9] xfsrestore: make node lookup more efficient
  2010-11-15 20:38   ` Alex Elder
@ 2010-11-15 22:06     ` Bill Kendall
  0 siblings, 0 replies; 24+ messages in thread
From: Bill Kendall @ 2010-11-15 22:06 UTC (permalink / raw)
  To: aelder; +Cc: xfs

On 11/15/2010 02:38 PM, Alex Elder wrote:
> On Fri, 2010-11-05 at 11:35 -0500, wkendall@sgi.com wrote:
>> plain text document attachment (window_lookup_performance)
>> When converting an nh_t to a segment index and relative node index,
>> use shift and bitwise-and instead of division and modulo. Results show
>> this is about 50% faster than the current method.
>>
>>
>> Signed-off-by: Bill Kendall<wkendall@sgi.com>
>
> I have a few comments below.  They are suggestions and
> don't indicate that I've found any real problems in your
> code.  If you think that updating things in response to
> my suggestions is warranted, but want to do it later (as
> a separate change) that's OK with me.
>
> Reviewed-by: Alex Elder<aelder@sgi.com>
>
>> ---
>>   restore/node.c |  150 ++++++++++++++++++++++++++++-----------------------------
>>   restore/win.c  |   36 +++----------
>>   restore/win.h  |    6 +-
>>   3 files changed, 87 insertions(+), 105 deletions(-)
>>
>> Index: xfsdump-kernel.org/restore/node.c
>> ===================================================================
>> --- xfsdump-kernel.org.orig/restore/node.c
>> +++ xfsdump-kernel.org/restore/node.c
>
> . . .
>
>> @@ -173,6 +166,12 @@ typedef struct node_hdr node_hdr_t;
>>   static node_hdr_t *node_hdrp;
>>   static intgen_t node_fd;
>
> The following forward declarations could be eliminated
> if you just move their definitions up in the file.
>
>> +/* forward declarations of locally defined static functions ******************/
>> +static inline size_t nh2segix( nh_t nh );
>> +static inline nh_t nh2relnix( nh_t nh );
>> +static inline void node_map_internal( nh_t nh, void **pp );
>> +static inline void node_unmap_internal( nh_t nh, void **pp, bool_t freepr );
>> +
>>   /* ARGSUSED */
>>   bool_t
>>   node_init( intgen_t fd,
>
> . . .
>
>> @@ -281,6 +280,8 @@ node_init( intgen_t fd,
>>   		winmapmax = min( vmsz / segsz, max_segments );
>>   	}
>>
>> +	relnixmask = nodesperseg - 1;
>> +
>
> Is nodesperseg guaranteed to be a power of 2?  It may
> be, but it's not obvious to me from glancing at this
> block of code.

Yes, just above this hunk nodesperseg is set using a bitshift.

>
>>   	/* map the abstraction header
>>   	 */
>>   	ASSERT( ( NODE_HDRSZ&  pgmask ) == 0 );
>
> . . .
>
>>   void
>> Index: xfsdump-kernel.org/restore/win.c
>> ===================================================================
>> --- xfsdump-kernel.org.orig/restore/win.c
>> +++ xfsdump-kernel.org/restore/win.c
>> @@ -177,31 +177,16 @@ win_init( intgen_t fd,
>>   }
>>
>>   void
>> -win_map( off64_t off, void **pp )
>> +win_map( size_t segix, void **pp )
>
> It might be nice to have a segix_t or something.
> The point being that what you are passing in here
> is an index value (not a size), which is computed
> from a node handle by figuring out which segment
> that node handle resides in.
>
> I was a little puzzled that nh2segix() took a
> node handle and returned a size_t.
>
> Similarly, a relnix appears to be an index
> value within a segment; an integral typedef
> could be used to clear that up. (and make
> it clear it's different from just a nh_t).
> Again, this suggestion comes out of my
> not understanding the return type of
> nh2relnix().

I agree, these were bad choices. I based them off
the fact that the old code unnecessarily passed
a size64_t to the winmap interface for one of
the params, then just changed it to size_t.

I'll rework this one and resubmit the series.

Bill

>
>>   {
>> -	size_t offwithinseg;
>> -	size_t segix;
>>   	off64_t segoff;
>>   	win_t *winp;
>>
>>   	CRITICAL_BEGIN();
>>
>> -	/* calculate offset within segment
>> -	 */
>> -	offwithinseg = ( size_t )( off % ( off64_t )tranp->t_segsz );
>> -
>> -	/* calculate segment index
>> -	 */
>> -	segix = (size_t)( off / ( off64_t )tranp->t_segsz );
>> -
>> -	/* calculate offset of segment
>> -	 */
>> -	segoff = off - ( off64_t )offwithinseg;
>> -
>>   #ifdef TREE_DEBUG
>>   	mlog(MLOG_DEBUG | MLOG_TREE | MLOG_NOLOCK,
>> -	     "win_map(off=%lld,addr=%x): off within = %llu, segoff = %lld\n",
>> -	      off, pp, offwithinseg, segoff);
>> +	     "win_map(segix=%lu,addr=%p)\n", segix, pp);
>>   #endif
>>   	/* resize the array if necessary */
>>   	if ( segix>= tranp->t_segmaplen )
>
>
> . . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2010-11-15 22:05 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-05 16:35 [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues wkendall
2010-11-05 16:35 ` [PATCH v2 1/9] xfsrestore: turn off NODECHK wkendall
2010-11-12 23:23   ` Alex Elder
2010-11-05 16:35 ` [PATCH v2 2/9] xfsrestore: change nrh_t from 32 to 64 bits wkendall
2010-11-12 23:24   ` Alex Elder
2010-11-05 16:35 ` [PATCH v2 3/9] xfsrestore: cache path lookups wkendall
2010-11-12 23:25   ` Alex Elder
2010-11-05 16:35 ` [PATCH v2 4/9] xfsrestore: mmap dirent names for faster lookups wkendall
2010-11-12 23:25   ` Alex Elder
2010-11-15 21:51     ` Bill Kendall
2010-11-05 16:35 ` [PATCH v2 5/9] xfsrestore: cleanup node allocation wkendall
2010-11-15 20:38   ` Alex Elder
2010-11-15 21:36     ` Bill Kendall
2010-11-05 16:35 ` [PATCH v2 6/9] xfsrestore: fix node table setup wkendall
2010-11-15 20:38   ` Alex Elder
2010-11-15 21:30     ` Bill Kendall
2010-11-05 16:35 ` [PATCH v2 7/9] xfsrestore: make node lookup more efficient wkendall
2010-11-15 20:38   ` Alex Elder
2010-11-15 22:06     ` Bill Kendall
2010-11-05 16:35 ` [PATCH v2 8/9] xfsrestore: remove nix_t wkendall
2010-11-12 23:25   ` Alex Elder
2010-11-05 16:35 ` [PATCH v2 9/9] xfsrestore: check for compatible xfsrestore wkendall
2010-11-12 23:25   ` Alex Elder
2010-11-12 23:25 ` [PATCH v2 0/9] xfsrestore dirent limitations and scaling issues Alex Elder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox