public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] xfs_repair: scalability inmprovements
@ 2013-12-12  7:22 Dave Chinner
  2013-12-12  7:22 ` [PATCH 1/5] repair: translation lookups limit scalability Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12  7:22 UTC (permalink / raw)
  To: xfs

HI folks,

The following 5 patches to xfs_repair improve scalability on high
IOPS devices. They remove contention points that limit the number of
IO threads that we can use, whilst also protecting against using too
much concurrency.

The first patch removes a contention point in a 3rd party
translation library by avoiding translating static data repeatedly.

The second separates the per-ag locks into separate cachelines, so
we don't get threads working on different AGs contending on
cachelines shared by non-shared locks.

THe third parallelises phase 6, which was never done because the
original repair parallelism work didn't show it to be a significant
contributor to runtime. Even serialised it was as fast as the
parallelised phases. However, that's a different story now with SSDs
- it's the only phase that is CPU bound because it doesn't spread
the work across multiple CPUs, and so is by far the slowest phase of
repair on SSDs.

The fourht patch corrects a problem with CPU usage in the buffer
cache - the hash table distributions are awful.

Finally, the last patch fixes a problem reported by Michael Semon
where too much concurrency was being used by xfs_repair and hence
failing because it was unable to allocate threads.

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/5] repair: translation lookups limit scalability
  2013-12-12  7:22 [PATCH 0/5] xfs_repair: scalability inmprovements Dave Chinner
@ 2013-12-12  7:22 ` Dave Chinner
  2013-12-12 18:26   ` Christoph Hellwig
  2013-12-12 18:58   ` Brian Foster
  2013-12-12  7:22 ` [PATCH 2/5] repair: per AG locks contend for cachelines Dave Chinner
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12  7:22 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

A bit of perf magic showed that scalability was limits to 3-4
concurrent threads due to contention on a lock inside in something
called __dcigettext(). That some library somewhere that repair is
linked against, and it turns out to be inside the translation
infrastructure to support the _() string mechanism:

# Samples: 34K of event 'cs'
# Event count (approx.): 495567
#
# Overhead        Command      Shared Object          Symbol
# ........  .............  .................  ..............
#
    60.30%     xfs_repair  [kernel.kallsyms]  [k] __schedule
               |
               --- 0x63fffff9c
                   process_bmbt_reclist_int
                  |
                  |--39.95%-- __dcigettext
                  |          __lll_lock_wait
                  |          system_call_fastpath
                  |          SyS_futex
                  |          do_futex
                  |          futex_wait
                  |          futex_wait_queue_me
                  |          schedule
                  |          __schedule
                  |
                  |--8.91%-- __lll_lock_wait
                  |          system_call_fastpath
                  |          SyS_futex
                  |          do_futex
                  |          futex_wait
                  |          futex_wait_queue_me
                  |          schedule
                  |          __schedule
                   --51.13%-- [...]

Runtime of an unpatched xfs_repair is roughly:

        XFS_REPAIR Summary    Fri Dec  6 11:15:50 2013

Phase           Start           End             Duration
Phase 1:        12/06 10:56:21  12/06 10:56:21
Phase 2:        12/06 10:56:21  12/06 10:56:23  2 seconds
Phase 3:        12/06 10:56:23  12/06 11:01:31  5 minutes, 8 seconds
Phase 4:        12/06 11:01:31  12/06 11:07:08  5 minutes, 37 seconds
Phase 5:        12/06 11:07:08  12/06 11:07:09  1 second
Phase 6:        12/06 11:07:09  12/06 11:15:49  8 minutes, 40 seconds
Phase 7:        12/06 11:15:49  12/06 11:15:50  1 second

Total run time: 19 minutes, 29 seconds

Patched version:

Phase           Start           End             Duration
Phase 1:        12/06 10:36:29  12/06 10:36:29
Phase 2:        12/06 10:36:29  12/06 10:36:31  2 seconds
Phase 3:        12/06 10:36:31  12/06 10:40:08  3 minutes, 37 seconds
Phase 4:        12/06 10:40:08  12/06 10:43:42  3 minutes, 34 seconds
Phase 5:        12/06 10:43:42  12/06 10:43:42
Phase 6:        12/06 10:43:42  12/06 10:50:28  6 minutes, 46 seconds
Phase 7:        12/06 10:50:28  12/06 10:50:29  1 second

Total run time: 14 minutes

Big win!

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/dinode.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/repair/dinode.c b/repair/dinode.c
index 7469fc8..8f14a9e 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -522,6 +522,11 @@ _("illegal state %d in rt block map %" PRIu64 "\n"),
  * file overlaps with any duplicate extents (in the
  * duplicate extent list).
  */
+static char	*__forkname_data;
+static char	*__forkname_attr;
+static char	*__ftype_real_time;
+static char	*__ftype_regular;
+
 static int
 process_bmbt_reclist_int(
 	xfs_mount_t		*mp,
@@ -552,15 +557,30 @@ process_bmbt_reclist_int(
 	xfs_agnumber_t		locked_agno = -1;
 	int			error = 1;
 
+	/*
+	 * gettext lookups for translations of strings use mutexes internally to
+	 * the library. Hence when we come through here doing parallel scans in
+	 * multiple AGs, then all do concurrent text conversions and serialise
+	 * on the translation string lookups. Let's avoid doing repeated lookups
+	 * by making them static variables and only assigning the translation
+	 * once.
+	 */
+	if (!__forkname_data) {
+		__forkname_data = _("data");
+		__forkname_attr = _("attr");
+		__ftype_real_time = _("real-time");
+		__ftype_regular = _("regular");
+	}
+
 	if (whichfork == XFS_DATA_FORK)
-		forkname = _("data");
+		forkname = __forkname_data;
 	else
-		forkname = _("attr");
+		forkname = __forkname_attr;
 
 	if (type == XR_INO_RTDATA)
-		ftype = _("real-time");
+		ftype = __ftype_real_time;
 	else
-		ftype = _("regular");
+		ftype = __ftype_regular;
 
 	for (i = 0; i < *numrecs; i++) {
 		libxfs_bmbt_disk_get_all(rp + i, &irec);
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/5] repair: per AG locks contend for cachelines
  2013-12-12  7:22 [PATCH 0/5] xfs_repair: scalability inmprovements Dave Chinner
  2013-12-12  7:22 ` [PATCH 1/5] repair: translation lookups limit scalability Dave Chinner
@ 2013-12-12  7:22 ` Dave Chinner
  2013-12-12 18:27   ` Christoph Hellwig
  2013-12-12 18:58   ` Brian Foster
  2013-12-12  7:22 ` [PATCH 3/5] repair: phase 6 is trivially parallelisable Dave Chinner
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12  7:22 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The per-ag locks used to protect per-ag block lists are located in a tightly
packed array. That means that they share cachelines, so separate them out inot
separate 64 byte regions in the array.

pahole confirms the padding is correctly applied:

struct aglock {
        pthread_mutex_t            lock;                 /*     0    40 */

        /* size: 64, cachelines: 1, members: 1 */
        /* padding: 24 */
};

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/dino_chunks.c | 24 ++++++++++++------------
 repair/dinode.c      |  6 +++---
 repair/globals.h     |  5 ++++-
 repair/incore.c      |  4 ++--
 repair/scan.c        |  4 ++--
 5 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c
index d3c2236..02d32d8 100644
--- a/repair/dino_chunks.c
+++ b/repair/dino_chunks.c
@@ -141,7 +141,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 		if (check_aginode_block(mp, agno, agino) == 0)
 			return 0;
 
-		pthread_mutex_lock(&ag_locks[agno]);
+		pthread_mutex_lock(&ag_locks[agno].lock);
 
 		state = get_bmap(agno, agbno);
 		switch (state) {
@@ -166,7 +166,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 		_("inode block %d/%d multiply claimed, (state %d)\n"),
 				agno, agbno, state);
 			set_bmap(agno, agbno, XR_E_MULT);
-			pthread_mutex_unlock(&ag_locks[agno]);
+			pthread_mutex_unlock(&ag_locks[agno].lock);
 			return(0);
 		default:
 			do_warn(
@@ -176,7 +176,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 			break;
 		}
 
-		pthread_mutex_unlock(&ag_locks[agno]);
+		pthread_mutex_unlock(&ag_locks[agno].lock);
 
 		start_agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
 		*start_ino = XFS_AGINO_TO_INO(mp, agno, start_agino);
@@ -424,7 +424,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 	 * user data -- we're probably here as a result of a directory
 	 * entry or an iunlinked pointer
 	 */
-	pthread_mutex_lock(&ag_locks[agno]);
+	pthread_mutex_lock(&ag_locks[agno].lock);
 	for (cur_agbno = chunk_start_agbno;
 	     cur_agbno < chunk_stop_agbno;
 	     cur_agbno += blen)  {
@@ -438,7 +438,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 	_("inode block %d/%d multiply claimed, (state %d)\n"),
 				agno, cur_agbno, state);
 			set_bmap_ext(agno, cur_agbno, blen, XR_E_MULT);
-			pthread_mutex_unlock(&ag_locks[agno]);
+			pthread_mutex_unlock(&ag_locks[agno].lock);
 			return 0;
 		case XR_E_INO:
 			do_error(
@@ -449,7 +449,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 			break;
 		}
 	}
-	pthread_mutex_unlock(&ag_locks[agno]);
+	pthread_mutex_unlock(&ag_locks[agno].lock);
 
 	/*
 	 * ok, chunk is good.  put the record into the tree if required,
@@ -472,7 +472,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 
 	set_inode_used(irec_p, agino - start_agino);
 
-	pthread_mutex_lock(&ag_locks[agno]);
+	pthread_mutex_lock(&ag_locks[agno].lock);
 
 	for (cur_agbno = chunk_start_agbno;
 	     cur_agbno < chunk_stop_agbno;
@@ -505,7 +505,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 			break;
 		}
 	}
-	pthread_mutex_unlock(&ag_locks[agno]);
+	pthread_mutex_unlock(&ag_locks[agno].lock);
 
 	return(ino_cnt);
 }
@@ -736,7 +736,7 @@ process_inode_chunk(
 	/*
 	 * mark block as an inode block in the incore bitmap
 	 */
-	pthread_mutex_lock(&ag_locks[agno]);
+	pthread_mutex_lock(&ag_locks[agno].lock);
 	state = get_bmap(agno, agbno);
 	switch (state) {
 	case XR_E_INO:	/* already marked */
@@ -755,7 +755,7 @@ process_inode_chunk(
 			XFS_AGB_TO_FSB(mp, agno, agbno), state);
 		break;
 	}
-	pthread_mutex_unlock(&ag_locks[agno]);
+	pthread_mutex_unlock(&ag_locks[agno].lock);
 
 	for (;;) {
 		/*
@@ -914,7 +914,7 @@ process_inode_chunk(
 			ibuf_offset = 0;
 			agbno++;
 
-			pthread_mutex_lock(&ag_locks[agno]);
+			pthread_mutex_lock(&ag_locks[agno].lock);
 			state = get_bmap(agno, agbno);
 			switch (state) {
 			case XR_E_INO:	/* already marked */
@@ -935,7 +935,7 @@ process_inode_chunk(
 					XFS_AGB_TO_FSB(mp, agno, agbno), state);
 				break;
 			}
-			pthread_mutex_unlock(&ag_locks[agno]);
+			pthread_mutex_unlock(&ag_locks[agno].lock);
 
 		} else if (irec_offset == XFS_INODES_PER_CHUNK)  {
 			/*
diff --git a/repair/dinode.c b/repair/dinode.c
index 8f14a9e..77bbe40 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -700,8 +700,8 @@ _("Fatal error: inode %" PRIu64 " - blkmap_set_ext(): %s\n"
 		ebno = agbno + irec.br_blockcount;
 		if (agno != locked_agno) {
 			if (locked_agno != -1)
-				pthread_mutex_unlock(&ag_locks[locked_agno]);
-			pthread_mutex_lock(&ag_locks[agno]);
+				pthread_mutex_unlock(&ag_locks[locked_agno].lock);
+			pthread_mutex_lock(&ag_locks[agno].lock);
 			locked_agno = agno;
 		}
 
@@ -770,7 +770,7 @@ _("illegal state %d in block map %" PRIu64 "\n"),
 	error = 0;
 done:
 	if (locked_agno != -1)
-		pthread_mutex_unlock(&ag_locks[locked_agno]);
+		pthread_mutex_unlock(&ag_locks[locked_agno].lock);
 
 	if (i != *numrecs) {
 		ASSERT(i < *numrecs);
diff --git a/repair/globals.h b/repair/globals.h
index aef8b79..cbb2ce7 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -186,7 +186,10 @@ EXTERN xfs_extlen_t	sb_inoalignmt;
 EXTERN __uint32_t	sb_unit;
 EXTERN __uint32_t	sb_width;
 
-EXTERN pthread_mutex_t	*ag_locks;
+struct aglock {
+	pthread_mutex_t	lock __attribute__((__aligned__(64)));
+};
+EXTERN struct aglock	*ag_locks;
 
 EXTERN int 		report_interval;
 EXTERN __uint64_t 	*prog_rpt_done;
diff --git a/repair/incore.c b/repair/incore.c
index 3590464..a8d497e 100644
--- a/repair/incore.c
+++ b/repair/incore.c
@@ -294,13 +294,13 @@ init_bmaps(xfs_mount_t *mp)
 	if (!ag_bmap)
 		do_error(_("couldn't allocate block map btree roots\n"));
 
-	ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(pthread_mutex_t));
+	ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(struct aglock));
 	if (!ag_locks)
 		do_error(_("couldn't allocate block map locks\n"));
 
 	for (i = 0; i < mp->m_sb.sb_agcount; i++)  {
 		btree_init(&ag_bmap[i]);
-		pthread_mutex_init(&ag_locks[i], NULL);
+		pthread_mutex_init(&ag_locks[i].lock, NULL);
 	}
 
 	init_rt_bmap(mp);
diff --git a/repair/scan.c b/repair/scan.c
index 49ed194..f9afdb1 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -273,7 +273,7 @@ _("bad back (left) sibling pointer (saw %llu should be NULL (0))\n"
 		agno = XFS_FSB_TO_AGNO(mp, bno);
 		agbno = XFS_FSB_TO_AGBNO(mp, bno);
 
-		pthread_mutex_lock(&ag_locks[agno]);
+		pthread_mutex_lock(&ag_locks[agno].lock);
 		state = get_bmap(agno, agbno);
 		switch (state) {
 		case XR_E_UNKNOWN:
@@ -319,7 +319,7 @@ _("bad state %d, inode %" PRIu64 " bmap block 0x%" PRIx64 "\n"),
 				state, ino, bno);
 			break;
 		}
-		pthread_mutex_unlock(&ag_locks[agno]);
+		pthread_mutex_unlock(&ag_locks[agno].lock);
 	} else  {
 		/*
 		 * attribute fork for realtime files is in the regular
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/5] repair: phase 6 is trivially parallelisable
  2013-12-12  7:22 [PATCH 0/5] xfs_repair: scalability inmprovements Dave Chinner
  2013-12-12  7:22 ` [PATCH 1/5] repair: translation lookups limit scalability Dave Chinner
  2013-12-12  7:22 ` [PATCH 2/5] repair: per AG locks contend for cachelines Dave Chinner
@ 2013-12-12  7:22 ` Dave Chinner
  2013-12-12 18:43   ` Christoph Hellwig
  2013-12-12 18:59   ` Brian Foster
  2013-12-12  7:22 ` [PATCH 4/5] libxfs: buffer cache hashing is suboptimal Dave Chinner
  2013-12-12  7:22 ` [PATCH 5/5] repair: limit auto-striding concurrency apprpriately Dave Chinner
  4 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12  7:22 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Phase 6 is currently single threaded, but it iterates AGs one at a
time. When there are hundreds of AGs that need scanning, this takes
a long time. Given that all the objects that the AG traversal works
on are per-ag, we can simply parallelise this into a strided AG
processing like phase 3 and 4.

Unpatched:	8m40s
patched:	1m10s (7 threads)

Big win!

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/phase6.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 47 insertions(+), 9 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index d2d4a44..d82f900 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -51,6 +51,7 @@ typedef struct dotdot_update {
 
 static dotdot_update_t		*dotdot_update_list;
 static int			dotdot_update;
+static pthread_mutex_t		dotdot_lock;
 
 static void
 add_dotdot_update(
@@ -64,12 +65,14 @@ add_dotdot_update(
 		do_error(_("malloc failed add_dotdot_update (%zu bytes)\n"),
 			sizeof(dotdot_update_t));
 
+	pthread_mutex_lock(&dotdot_lock);
 	dir->next = dotdot_update_list;
 	dir->irec = irec;
 	dir->agno = agno;
 	dir->ino_offset = ino_offset;
 
 	dotdot_update_list = dir;
+	pthread_mutex_unlock(&dotdot_lock);
 }
 
 /*
@@ -2918,34 +2921,68 @@ update_missing_dotdot_entries(
 	 * these entries parents were updated, rebuild them again
 	 * set dotdot_update flag so processing routines do not count links
 	 */
+	pthread_mutex_lock(&dotdot_lock);
 	dotdot_update = 1;
 	while (dotdot_update_list) {
 		dir = dotdot_update_list;
 		dotdot_update_list = dir->next;
+		dir->next = NULL;
+		pthread_mutex_unlock(&dotdot_lock);
+
 		process_dir_inode(mp, dir->agno, dir->irec, dir->ino_offset);
 		free(dir);
+
+		pthread_mutex_lock(&dotdot_lock);
 	}
+	pthread_mutex_unlock(&dotdot_lock);
 }
 
 static void
 traverse_ags(
-	xfs_mount_t 		*mp)
+	xfs_mount_t		*mp)
 {
-	int			i;
-	work_queue_t		queue;
+	int			i, j;
+	xfs_agnumber_t		agno;
+	work_queue_t		*queues;
 	prefetch_args_t		*pf_args[2];
 
 	/*
 	 * we always do prefetch for phase 6 as it will fill in the gaps
 	 * not read during phase 3 prefetch.
 	 */
-	queue.mp = mp;
-	pf_args[0] = start_inode_prefetch(0, 1, NULL);
-	for (i = 0; i < glob_agcount; i++) {
-		pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
-				pf_args[i & 1]);
-		traverse_function(&queue, i, pf_args[i & 1]);
+	if (!ag_stride) {
+		work_queue_t	queue;
+
+		queue.mp = mp;
+		pf_args[0] = start_inode_prefetch(0, 1, NULL);
+		for (i = 0; i < glob_agcount; i++) {
+			pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
+					pf_args[i & 1]);
+			traverse_function(&queue, i, pf_args[i & 1]);
+		}
+		return;
 	}
+
+	/*
+	 * create one worker thread for each segment of the volume
+	 */
+	queues = malloc(thread_count * sizeof(work_queue_t));
+	for (i = 0, agno = 0; i < thread_count; i++) {
+		create_work_queue(&queues[i], mp, 1);
+		pf_args[0] = NULL;
+		for (j = 0; j < ag_stride && agno < glob_agcount; j++, agno++) {
+			pf_args[0] = start_inode_prefetch(agno, 1, pf_args[0]);
+			queue_work(&queues[i], traverse_function, agno,
+				   pf_args[0]);
+		}
+	}
+
+	/*
+	 * wait for workers to complete
+	 */
+	for (i = 0; i < thread_count; i++)
+		destroy_work_queue(&queues[i]);
+	free(queues);
 }
 
 void
@@ -2957,6 +2994,7 @@ phase6(xfs_mount_t *mp)
 	memset(&zerocr, 0, sizeof(struct cred));
 	memset(&zerofsx, 0, sizeof(struct fsxattr));
 	orphanage_ino = 0;
+	pthread_mutex_init(&dotdot_lock, NULL);
 
 	do_log(_("Phase 6 - check inode connectivity...\n"));
 
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/5] libxfs: buffer cache hashing is suboptimal
  2013-12-12  7:22 [PATCH 0/5] xfs_repair: scalability inmprovements Dave Chinner
                   ` (2 preceding siblings ...)
  2013-12-12  7:22 ` [PATCH 3/5] repair: phase 6 is trivially parallelisable Dave Chinner
@ 2013-12-12  7:22 ` Dave Chinner
  2013-12-12 18:28   ` Christoph Hellwig
  2013-12-12 18:59   ` Brian Foster
  2013-12-12  7:22 ` [PATCH 5/5] repair: limit auto-striding concurrency apprpriately Dave Chinner
  4 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12  7:22 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The hashkey calculation is very simplistic,and throws away an amount
of entropy that should be folded into the hash. The result is
sub-optimal distribution across the hash tables. For example, with a
default 512 entry table, phase 2 results in this:

Max supported entries = 4096
Max utilized entries = 3970
Active entries = 3970
Hash table size = 512
Hits = 0
Misses = 3970
Hit ratio =  0.00
Hash buckets with   0 entries     12 (  0%)
Hash buckets with   1 entries      3 (  0%)
Hash buckets with   2 entries     10 (  0%)
Hash buckets with   3 entries      2 (  0%)
Hash buckets with   4 entries    129 ( 12%)
Hash buckets with   5 entries     20 (  2%)
Hash buckets with   6 entries     54 (  8%)
Hash buckets with   7 entries     22 (  3%)
Hash buckets with   8 entries    150 ( 30%)
Hash buckets with   9 entries     14 (  3%)
Hash buckets with  10 entries     16 (  4%)
Hash buckets with  11 entries      7 (  1%)
Hash buckets with  12 entries     38 ( 11%)
Hash buckets with  13 entries      5 (  1%)
Hash buckets with  14 entries      4 (  1%)
Hash buckets with  17 entries      1 (  0%)
Hash buckets with  19 entries      1 (  0%)
Hash buckets with  23 entries      1 (  0%)
Hash buckets with >24 entries     23 ( 16%)

Now, given a perfect distribution, we shoul dhave 8 entries per
chain. What we end up with is nothing like that.

Unfortunately, for phase 3/4 and others, the number of cached
objects results in the cache being expanded to 256k entries, and
so the stats just give this;

Hits = 262276
Misses = 8130393
Hit ratio =  3.13
Hash buckets with >24 entries    512 (100%)

We can't evaluate the efficiency of the hashing algorithm here.
Let's increase the size of the hash table to 32768 entries and go
from there:

Phase 2:

Hash buckets with   0 entries  31884 (  0%)
Hash buckets with   1 entries     35 (  0%)
Hash buckets with   2 entries     78 (  3%)
Hash buckets with   3 entries     30 (  2%)
Hash buckets with   4 entries    649 ( 65%)
Hash buckets with   5 entries     12 (  1%)
Hash buckets with   6 entries     13 (  1%)
Hash buckets with   8 entries     40 (  8%)
Hash buckets with   9 entries      1 (  0%)
Hash buckets with  13 entries      1 (  0%)
Hash buckets with  15 entries      1 (  0%)
Hash buckets with  22 entries      1 (  0%)
Hash buckets with  24 entries     17 ( 10%)
Hash buckets with >24 entries      6 (  4%)

There's a significant number of collisions given the population is
only 15% of the size of the table itself....

Phase 3:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262090
Hash table size = 32768
Hits = 530844
Misses = 7164575
Hit ratio =  6.90
Hash buckets with   0 entries  11898 (  0%)
....
Hash buckets with  12 entries   5513 ( 25%)
Hash buckets with  13 entries   4188 ( 20%)
Hash buckets with  14 entries   2073 ( 11%)
Hash buckets with  15 entries   1811 ( 10%)
Hash buckets with  16 entries   1994 ( 12%)
....
Hash buckets with >24 entries    339 (  4%)

So, a third of the hash table does not even has any entries in them,
despite having more than 7.5 million entries run through the cache.
Median chain lengths are 12-16 entries, ideal is 8. And lots of
collisions on the longer than 24 entrie chains...

Phase 6:

Hash buckets with   0 entries  14573 (  0%)
....
Hash buckets with >24 entries   2291 ( 36%)

Ouch. Not a good distribution at all.

Overall runtime:

Phase           Start           End             Duration
Phase 1:        12/06 11:35:04  12/06 11:35:04
Phase 2:        12/06 11:35:04  12/06 11:35:07  3 seconds
Phase 3:        12/06 11:35:07  12/06 11:38:27  3 minutes, 20 seconds
Phase 4:        12/06 11:38:27  12/06 11:41:32  3 minutes, 5 seconds
Phase 5:        12/06 11:41:32  12/06 11:41:32
Phase 6:        12/06 11:41:32  12/06 11:42:29  57 seconds
Phase 7:        12/06 11:42:29  12/06 11:42:30  1 second

Total run time: 7 minutes, 26 seconds

Modify the hash to be something more workable - steal the linux
kernel inode hash calculation and try that:

phase 2:

Max supported entries = 262144
Max utilized entries = 3970
Active entries = 3970
Hash table size = 32768
Hits = 0
Misses = 3972
Hit ratio =  0.00
Hash buckets with   0 entries  29055 (  0%)
Hash buckets with   1 entries   3464 ( 87%)
Hash buckets with   2 entries    241 ( 12%)
Hash buckets with   3 entries      8 (  0%)

Close to perfect.

Phase 3:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262118
Hash table size = 32768
Hits = 567900
Misses = 7118749
Hit ratio =  7.39
Hash buckets with   5 entries   1572 (  2%)
Hash buckets with   6 entries   2186 (  5%)
Hash buckets with   7 entries   9217 ( 24%)
Hash buckets with   8 entries   8757 ( 26%)
Hash buckets with   9 entries   6135 ( 21%)
Hash buckets with  10 entries   3166 ( 12%)
Hash buckets with  11 entries   1257 (  5%)
Hash buckets with  12 entries    364 (  1%)
Hash buckets with  13 entries     94 (  0%)
Hash buckets with  14 entries     14 (  0%)
Hash buckets with  15 entries      5 (  0%)

A near-perfect bell curve centered on the optimal distribution
number of 8 entries per chain.

Phase 6:

Hash buckets with   0 entries     24 (  0%)
Hash buckets with   1 entries    190 (  0%)
Hash buckets with   2 entries    571 (  0%)
Hash buckets with   3 entries   1263 (  1%)
Hash buckets with   4 entries   2465 (  3%)
Hash buckets with   5 entries   3399 (  6%)
Hash buckets with   6 entries   4002 (  9%)
Hash buckets with   7 entries   4186 ( 11%)
Hash buckets with   8 entries   3773 ( 11%)
Hash buckets with   9 entries   3240 ( 11%)
Hash buckets with  10 entries   2523 (  9%)
Hash buckets with  11 entries   2074 (  8%)
Hash buckets with  12 entries   1582 (  7%)
Hash buckets with  13 entries   1206 (  5%)
Hash buckets with  14 entries    863 (  4%)
Hash buckets with  15 entries    601 (  3%)
Hash buckets with  16 entries    386 (  2%)
Hash buckets with  17 entries    205 (  1%)
Hash buckets with  18 entries    122 (  0%)
Hash buckets with  19 entries     48 (  0%)
Hash buckets with  20 entries     24 (  0%)
Hash buckets with  21 entries     13 (  0%)
Hash buckets with  22 entries      8 (  0%)

A much wider bell curve than phase 3, but still centered around the
optimal value and far, far better than the distribution of the
current hash calculation. Runtime:

Phase           Start           End             Duration
Phase 1:        12/06 11:47:21  12/06 11:47:21
Phase 2:        12/06 11:47:21  12/06 11:47:23  2 seconds
Phase 3:        12/06 11:47:23  12/06 11:50:50  3 minutes, 27 seconds
Phase 4:        12/06 11:50:50  12/06 11:53:57  3 minutes, 7 seconds
Phase 5:        12/06 11:53:57  12/06 11:53:58  1 second
Phase 6:        12/06 11:53:58  12/06 11:54:51  53 seconds
Phase 7:        12/06 11:54:51  12/06 11:54:52  1 second

Total run time: 7 minutes, 31 seconds

Essentially unchanged - this is somewhat of a "swings and
roundabouts" test here because what it is testing is the cache-miss
overhead.

FWIW, the comparison here shows a pretty good case for the existing
hash calculation. On a less populated filesystem (5m inodes rather
than 50m inodes) the typical hash distribution was:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262094
Hash table size = 32768
Hits = 626228
Misses = 800166
Hit ratio = 43.90
Hash buckets with   0 entries  29274 (  0%)
Hash buckets with   3 entries      1 (  0%)
Hash buckets with   4 entries      1 (  0%)
Hash buckets with   7 entries      1 (  0%)
Hash buckets with   8 entries      1 (  0%)
Hash buckets with   9 entries      1 (  0%)
Hash buckets with  12 entries      1 (  0%)
Hash buckets with  13 entries      1 (  0%)
Hash buckets with  16 entries      2 (  0%)
Hash buckets with  18 entries      1 (  0%)
Hash buckets with  22 entries      1 (  0%)
Hash buckets with >24 entries   3483 ( 99%)

Total and utter crap. Same filesystem, new hash function:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262103
Hash table size = 32768
Hits = 673208
Misses = 838265
Hit ratio = 44.54
Hash buckets with   3 entries    558 (  0%)
Hash buckets with   4 entries   1126 (  1%)
Hash buckets with   5 entries   2440 (  4%)
Hash buckets with   6 entries   4249 (  9%)
Hash buckets with   7 entries   5280 ( 14%)
Hash buckets with   8 entries   5598 ( 17%)
Hash buckets with   9 entries   5446 ( 18%)
Hash buckets with  10 entries   3879 ( 14%)
Hash buckets with  11 entries   2405 ( 10%)
Hash buckets with  12 entries   1187 (  5%)
Hash buckets with  13 entries    447 (  2%)
Hash buckets with  14 entries    125 (  0%)
Hash buckets with  15 entries     25 (  0%)
Hash buckets with  16 entries      3 (  0%)

Kinda says it all, really...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/cache.h |  4 +++-
 libxfs/cache.c  |  7 +++++--
 libxfs/rdwr.c   | 12 ++++++++++--
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/cache.h b/include/cache.h
index 76cb234..0a84c69 100644
--- a/include/cache.h
+++ b/include/cache.h
@@ -66,7 +66,8 @@ typedef void (*cache_walk_t)(struct cache_node *);
 typedef struct cache_node * (*cache_node_alloc_t)(cache_key_t);
 typedef void (*cache_node_flush_t)(struct cache_node *);
 typedef void (*cache_node_relse_t)(struct cache_node *);
-typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int);
+typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int,
+					  unsigned int);
 typedef int (*cache_node_compare_t)(struct cache_node *, cache_key_t);
 typedef unsigned int (*cache_bulk_relse_t)(struct cache *, struct list_head *);
 
@@ -112,6 +113,7 @@ struct cache {
 	cache_node_compare_t	compare;	/* comparison routine */
 	cache_bulk_relse_t	bulkrelse;	/* bulk release routine */
 	unsigned int		c_hashsize;	/* hash bucket count */
+	unsigned int		c_hashshift;	/* hash key shift */
 	struct cache_hash	*c_hash;	/* hash table buckets */
 	struct cache_mru	c_mrus[CACHE_MAX_PRIORITY + 1];
 	unsigned long long	c_misses;	/* cache misses */
diff --git a/libxfs/cache.c b/libxfs/cache.c
index 84d2860..dc69689 100644
--- a/libxfs/cache.c
+++ b/libxfs/cache.c
@@ -25,6 +25,7 @@
 #include <xfs/platform_defs.h>
 #include <xfs/list.h>
 #include <xfs/cache.h>
+#include <xfs/libxfs.h>
 
 #define CACHE_DEBUG 1
 #undef CACHE_DEBUG
@@ -61,6 +62,7 @@ cache_init(
 	cache->c_misses = 0;
 	cache->c_maxcount = maxcount;
 	cache->c_hashsize = hashsize;
+	cache->c_hashshift = libxfs_highbit32(hashsize);
 	cache->hash = cache_operations->hash;
 	cache->alloc = cache_operations->alloc;
 	cache->flush = cache_operations->flush;
@@ -343,7 +345,7 @@ cache_node_get(
 	int			priority = 0;
 	int			purged = 0;
 
-	hashidx = cache->hash(key, cache->c_hashsize);
+	hashidx = cache->hash(key, cache->c_hashsize, cache->c_hashshift);
 	hash = cache->c_hash + hashidx;
 	head = &hash->ch_list;
 
@@ -515,7 +517,8 @@ cache_node_purge(
 	struct cache_hash *	hash;
 	int			count = -1;
 
-	hash = cache->c_hash + cache->hash(key, cache->c_hashsize);
+	hash = cache->c_hash + cache->hash(key, cache->c_hashsize,
+					   cache->c_hashshift);
 	head = &hash->ch_list;
 	pthread_mutex_lock(&hash->ch_mutex);
 	for (pos = head->next, n = pos->next; pos != head;
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 0219a08..0effb9a 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -311,10 +311,18 @@ struct xfs_bufkey {
 	int			nmaps;
 };
 
+/*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME	0x9e37fffffffc0001UL
+#define CACHE_LINE_SIZE		64
 static unsigned int
-libxfs_bhash(cache_key_t key, unsigned int hashsize)
+libxfs_bhash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
 {
-	return (((unsigned int)((struct xfs_bufkey *)key)->blkno) >> 5) % hashsize;
+	uint64_t	hashval = ((struct xfs_bufkey *)key)->blkno;
+	uint64_t	tmp;
+
+	tmp = hashval ^ (GOLDEN_RATIO_PRIME + hashval) / CACHE_LINE_SIZE;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> hashshift);
+	return tmp % hashsize;
 }
 
 static int
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/5] repair: limit auto-striding concurrency apprpriately
  2013-12-12  7:22 [PATCH 0/5] xfs_repair: scalability inmprovements Dave Chinner
                   ` (3 preceding siblings ...)
  2013-12-12  7:22 ` [PATCH 4/5] libxfs: buffer cache hashing is suboptimal Dave Chinner
@ 2013-12-12  7:22 ` Dave Chinner
  2013-12-12 18:29   ` Christoph Hellwig
  2013-12-12 18:59   ` Brian Foster
  4 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12  7:22 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

It's possible to have filesystems with hundreds of AGs on systems
with little concurrency and resources. In this case, we can easily
exhaust memory and fail to create threads and have all sorts of
interesting problems.

xfs/250 can cause this to occur, with failures like:

        - agno = 707
        - agno = 692
fatal error -- cannot create worker threads, error = [11] Resource temporarily unavailable

And this:

        - agno = 484
        - agno = 782
failed to create prefetch thread: Resource temporarily unavailable

Because it's trying to create more threads than a poor little 512MB
single CPU ia32 box can handle.

So, limit concurrency to a maximum of numcpus * 8 to prevent this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/libxfs.h    |  1 +
 libxfs/init.h       |  1 -
 repair/xfs_repair.c | 18 +++++++++++++++++-
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/libxfs.h b/include/libxfs.h
index 4bf331c..39e3d85 100644
--- a/include/libxfs.h
+++ b/include/libxfs.h
@@ -144,6 +144,7 @@ extern void	libxfs_device_close (dev_t);
 extern int	libxfs_device_alignment (void);
 extern void	libxfs_report(FILE *);
 extern void	platform_findsizes(char *path, int fd, long long *sz, int *bsz);
+extern int	platform_nproc(void);
 
 /* check or write log footer: specify device, log size in blocks & uuid */
 typedef xfs_caddr_t (libxfs_get_block_t)(xfs_caddr_t, int, void *);
diff --git a/libxfs/init.h b/libxfs/init.h
index f0b8cb6..112febb 100644
--- a/libxfs/init.h
+++ b/libxfs/init.h
@@ -31,7 +31,6 @@ extern char *platform_findrawpath (char *path);
 extern char *platform_findblockpath (char *path);
 extern int platform_direct_blockdev (void);
 extern int platform_align_blockdev (void);
-extern int platform_nproc(void);
 extern unsigned long platform_physmem(void);	/* in kilobytes */
 extern int platform_has_uuid;
 
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 7beffcb..0d006ae 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -627,13 +627,29 @@ main(int argc, char **argv)
 	 * to target these for an increase in thread count. Hence a stride value
 	 * of 15 is chosen to ensure we get at least 2 AGs being scanned at once
 	 * on such filesystems.
+	 *
+	 * Limit the maximum thread count based on the available CPU power that
+	 * is available. If we use too many threads, we might run out of memory
+	 * and CPU power before we run out of IO concurrency.
 	 */
 	if (!ag_stride && glob_agcount >= 16 && do_prefetch)
 		ag_stride = 15;
 
 	if (ag_stride) {
+		int max_threads = platform_nproc() * 8;
+
 		thread_count = (glob_agcount + ag_stride - 1) / ag_stride;
-		thread_init();
+		while (thread_count > max_threads) {
+			ag_stride *= 2;
+			thread_count = (glob_agcount + ag_stride - 1) /
+								ag_stride;
+		}
+		if (thread_count > 0)
+			thread_init();
+		else {
+			thread_count = 1;
+			ag_stride = 0;
+		}
 	}
 
 	if (ag_stride && report_interval) {
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] repair: translation lookups limit scalability
  2013-12-12  7:22 ` [PATCH 1/5] repair: translation lookups limit scalability Dave Chinner
@ 2013-12-12 18:26   ` Christoph Hellwig
  2013-12-12 18:58   ` Brian Foster
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Dec 12, 2013 at 06:22:21PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> A bit of perf magic showed that scalability was limits to 3-4
> concurrent threads due to contention on a lock inside in something
> called __dcigettext(). That some library somewhere that repair is
> linked against, and it turns out to be inside the translation
> infrastructure to support the _() string mechanism:
> 
> # Samples: 34K of event 'cs'
> # Event count (approx.): 495567
> #
> # Overhead        Command      Shared Object          Symbol
> # ........  .............  .................  ..............
> #
>     60.30%     xfs_repair  [kernel.kallsyms]  [k] __schedule
>                |
>                --- 0x63fffff9c
>                    process_bmbt_reclist_int
>                   |
>                   |--39.95%-- __dcigettext
>                   |          __lll_lock_wait
>                   |          system_call_fastpath
>                   |          SyS_futex
>                   |          do_futex
>                   |          futex_wait
>                   |          futex_wait_queue_me
>                   |          schedule
>                   |          __schedule
>                   |
>                   |--8.91%-- __lll_lock_wait
>                   |          system_call_fastpath
>                   |          SyS_futex
>                   |          do_futex
>                   |          futex_wait
>                   |          futex_wait_queue_me
>                   |          schedule
>                   |          __schedule
>                    --51.13%-- [...]
> 
> Runtime of an unpatched xfs_repair is roughly:
> 
>         XFS_REPAIR Summary    Fri Dec  6 11:15:50 2013
> 
> Phase           Start           End             Duration
> Phase 1:        12/06 10:56:21  12/06 10:56:21
> Phase 2:        12/06 10:56:21  12/06 10:56:23  2 seconds
> Phase 3:        12/06 10:56:23  12/06 11:01:31  5 minutes, 8 seconds
> Phase 4:        12/06 11:01:31  12/06 11:07:08  5 minutes, 37 seconds
> Phase 5:        12/06 11:07:08  12/06 11:07:09  1 second
> Phase 6:        12/06 11:07:09  12/06 11:15:49  8 minutes, 40 seconds
> Phase 7:        12/06 11:15:49  12/06 11:15:50  1 second
> 
> Total run time: 19 minutes, 29 seconds
> 
> Patched version:
> 
> Phase           Start           End             Duration
> Phase 1:        12/06 10:36:29  12/06 10:36:29
> Phase 2:        12/06 10:36:29  12/06 10:36:31  2 seconds
> Phase 3:        12/06 10:36:31  12/06 10:40:08  3 minutes, 37 seconds
> Phase 4:        12/06 10:40:08  12/06 10:43:42  3 minutes, 34 seconds
> Phase 5:        12/06 10:43:42  12/06 10:43:42
> Phase 6:        12/06 10:43:42  12/06 10:50:28  6 minutes, 46 seconds
> Phase 7:        12/06 10:50:28  12/06 10:50:29  1 second
> 
> Total run time: 14 minutes
> 
> Big win!
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/dinode.c | 28 ++++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/repair/dinode.c b/repair/dinode.c
> index 7469fc8..8f14a9e 100644
> --- a/repair/dinode.c
> +++ b/repair/dinode.c
> @@ -522,6 +522,11 @@ _("illegal state %d in rt block map %" PRIu64 "\n"),
>   * file overlaps with any duplicate extents (in the
>   * duplicate extent list).
>   */
> +static char	*__forkname_data;
> +static char	*__forkname_attr;
> +static char	*__ftype_real_time;
> +static char	*__ftype_regular;

Double underscores for symbol names aren't really nice for userspace
programs as they are in the implementation namespace.  I also don't
really see a need for the prefixed here.

> +	if (!__forkname_data) {
> +		__forkname_data = _("data");
> +		__forkname_attr = _("attr");
> +		__ftype_real_time = _("real-time");
> +		__ftype_regular = _("regular");
> +	}

Maybe just do these assignments early on during program initialization?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] repair: per AG locks contend for cachelines
  2013-12-12  7:22 ` [PATCH 2/5] repair: per AG locks contend for cachelines Dave Chinner
@ 2013-12-12 18:27   ` Christoph Hellwig
  2013-12-12 18:58   ` Brian Foster
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] libxfs: buffer cache hashing is suboptimal
  2013-12-12  7:22 ` [PATCH 4/5] libxfs: buffer cache hashing is suboptimal Dave Chinner
@ 2013-12-12 18:28   ` Christoph Hellwig
  2013-12-12 18:59   ` Brian Foster
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] repair: limit auto-striding concurrency apprpriately
  2013-12-12  7:22 ` [PATCH 5/5] repair: limit auto-striding concurrency apprpriately Dave Chinner
@ 2013-12-12 18:29   ` Christoph Hellwig
  2013-12-12 21:00     ` Dave Chinner
  2013-12-12 18:59   ` Brian Foster
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks reasonable, but it might be worth to add a blurb on how you
arrived at the magic 8 threads per cpu.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] repair: phase 6 is trivially parallelisable
  2013-12-12  7:22 ` [PATCH 3/5] repair: phase 6 is trivially parallelisable Dave Chinner
@ 2013-12-12 18:43   ` Christoph Hellwig
  2013-12-12 20:53     ` Dave Chinner
  2013-12-12 18:59   ` Brian Foster
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

>  static void
>  add_dotdot_update(
> @@ -64,12 +65,14 @@ add_dotdot_update(
>  		do_error(_("malloc failed add_dotdot_update (%zu bytes)\n"),
>  			sizeof(dotdot_update_t));
>  
> +	pthread_mutex_lock(&dotdot_lock);
>  	dir->next = dotdot_update_list;
>  	dir->irec = irec;
>  	dir->agno = agno;
>  	dir->ino_offset = ino_offset;
>  
>  	dotdot_update_list = dir;
> +	pthread_mutex_unlock(&dotdot_lock);

Would be nice to make this use a list_head if you touch it anyway.
(As a separate patch)

>  static void
>  traverse_ags(
> -	xfs_mount_t 		*mp)
> +	xfs_mount_t		*mp)

Not quite sure what actually changed here, but if you touch it anyway
you might as well use the struct version..

> +	if (!ag_stride) {
> +		work_queue_t	queue;
> +
> +		queue.mp = mp;
> +		pf_args[0] = start_inode_prefetch(0, 1, NULL);
> +		for (i = 0; i < glob_agcount; i++) {
> +			pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> +					pf_args[i & 1]);
> +			traverse_function(&queue, i, pf_args[i & 1]);
> +		}
> +		return;
>  	}
> +
> +	/*
> +	 * create one worker thread for each segment of the volume
> +	 */
> +	queues = malloc(thread_count * sizeof(work_queue_t));
> +	for (i = 0, agno = 0; i < thread_count; i++) {
> +		create_work_queue(&queues[i], mp, 1);
> +		pf_args[0] = NULL;
> +		for (j = 0; j < ag_stride && agno < glob_agcount; j++, agno++) {
> +			pf_args[0] = start_inode_prefetch(agno, 1, pf_args[0]);
> +			queue_work(&queues[i], traverse_function, agno,
> +				   pf_args[0]);
> +		}
> +	}
> +
> +	/*
> +	 * wait for workers to complete
> +	 */
> +	for (i = 0; i < thread_count; i++)
> +		destroy_work_queue(&queues[i]);
> +	free(queues);


This is the third copy of this code block, might make sense to
consolidate it.

Btw, does anyone remember why we have the libxfs_bcache_overflowed()
special case in phase4, but not anywhere else?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] repair: translation lookups limit scalability
  2013-12-12  7:22 ` [PATCH 1/5] repair: translation lookups limit scalability Dave Chinner
  2013-12-12 18:26   ` Christoph Hellwig
@ 2013-12-12 18:58   ` Brian Foster
  1 sibling, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-12-12 18:58 UTC (permalink / raw)
  To: Dave Chinner, xfs

On 12/12/2013 02:22 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> A bit of perf magic showed that scalability was limits to 3-4
> concurrent threads due to contention on a lock inside in something
> called __dcigettext(). That some library somewhere that repair is
> linked against, and it turns out to be inside the translation
> infrastructure to support the _() string mechanism:
> 
> # Samples: 34K of event 'cs'
> # Event count (approx.): 495567
> #
> # Overhead        Command      Shared Object          Symbol
> # ........  .............  .................  ..............
> #
>     60.30%     xfs_repair  [kernel.kallsyms]  [k] __schedule
>                |
>                --- 0x63fffff9c
>                    process_bmbt_reclist_int
>                   |
>                   |--39.95%-- __dcigettext
>                   |          __lll_lock_wait
>                   |          system_call_fastpath
>                   |          SyS_futex
>                   |          do_futex
>                   |          futex_wait
>                   |          futex_wait_queue_me
>                   |          schedule
>                   |          __schedule
>                   |
>                   |--8.91%-- __lll_lock_wait
>                   |          system_call_fastpath
>                   |          SyS_futex
>                   |          do_futex
>                   |          futex_wait
>                   |          futex_wait_queue_me
>                   |          schedule
>                   |          __schedule
>                    --51.13%-- [...]
> 
> Runtime of an unpatched xfs_repair is roughly:
> 
>         XFS_REPAIR Summary    Fri Dec  6 11:15:50 2013
> 
> Phase           Start           End             Duration
> Phase 1:        12/06 10:56:21  12/06 10:56:21
> Phase 2:        12/06 10:56:21  12/06 10:56:23  2 seconds
> Phase 3:        12/06 10:56:23  12/06 11:01:31  5 minutes, 8 seconds
> Phase 4:        12/06 11:01:31  12/06 11:07:08  5 minutes, 37 seconds
> Phase 5:        12/06 11:07:08  12/06 11:07:09  1 second
> Phase 6:        12/06 11:07:09  12/06 11:15:49  8 minutes, 40 seconds
> Phase 7:        12/06 11:15:49  12/06 11:15:50  1 second
> 
> Total run time: 19 minutes, 29 seconds
> 
> Patched version:
> 
> Phase           Start           End             Duration
> Phase 1:        12/06 10:36:29  12/06 10:36:29
> Phase 2:        12/06 10:36:29  12/06 10:36:31  2 seconds
> Phase 3:        12/06 10:36:31  12/06 10:40:08  3 minutes, 37 seconds
> Phase 4:        12/06 10:40:08  12/06 10:43:42  3 minutes, 34 seconds
> Phase 5:        12/06 10:43:42  12/06 10:43:42
> Phase 6:        12/06 10:43:42  12/06 10:50:28  6 minutes, 46 seconds
> Phase 7:        12/06 10:50:28  12/06 10:50:29  1 second
> 
> Total run time: 14 minutes
> 
> Big win!
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Indeed! Neat fix. When looking at the code, I wondered whether the same
type of thing would show up in process_btinode() or scan_bmapbt() (e.g.,
defining 'forkname'). Perhaps with smaller (btree fmt) files? Or perhaps
the ratio of inodes to extent records is such that it simply isn't a
potential bottleneck. Anyways:

Reviewed-by: Brian Foster <bfoster@redhat.com>


>  repair/dinode.c | 28 ++++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/repair/dinode.c b/repair/dinode.c
> index 7469fc8..8f14a9e 100644
> --- a/repair/dinode.c
> +++ b/repair/dinode.c
> @@ -522,6 +522,11 @@ _("illegal state %d in rt block map %" PRIu64 "\n"),
>   * file overlaps with any duplicate extents (in the
>   * duplicate extent list).
>   */
> +static char	*__forkname_data;
> +static char	*__forkname_attr;
> +static char	*__ftype_real_time;
> +static char	*__ftype_regular;
> +
>  static int
>  process_bmbt_reclist_int(
>  	xfs_mount_t		*mp,
> @@ -552,15 +557,30 @@ process_bmbt_reclist_int(
>  	xfs_agnumber_t		locked_agno = -1;
>  	int			error = 1;
>  
> +	/*
> +	 * gettext lookups for translations of strings use mutexes internally to
> +	 * the library. Hence when we come through here doing parallel scans in
> +	 * multiple AGs, then all do concurrent text conversions and serialise
> +	 * on the translation string lookups. Let's avoid doing repeated lookups
> +	 * by making them static variables and only assigning the translation
> +	 * once.
> +	 */
> +	if (!__forkname_data) {
> +		__forkname_data = _("data");
> +		__forkname_attr = _("attr");
> +		__ftype_real_time = _("real-time");
> +		__ftype_regular = _("regular");
> +	}
> +
>  	if (whichfork == XFS_DATA_FORK)
> -		forkname = _("data");
> +		forkname = __forkname_data;
>  	else
> -		forkname = _("attr");
> +		forkname = __forkname_attr;
>  
>  	if (type == XR_INO_RTDATA)
> -		ftype = _("real-time");
> +		ftype = __ftype_real_time;
>  	else
> -		ftype = _("regular");
> +		ftype = __ftype_regular;
>  
>  	for (i = 0; i < *numrecs; i++) {
>  		libxfs_bmbt_disk_get_all(rp + i, &irec);
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] repair: per AG locks contend for cachelines
  2013-12-12  7:22 ` [PATCH 2/5] repair: per AG locks contend for cachelines Dave Chinner
  2013-12-12 18:27   ` Christoph Hellwig
@ 2013-12-12 18:58   ` Brian Foster
  2013-12-12 20:46     ` Dave Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Brian Foster @ 2013-12-12 18:58 UTC (permalink / raw)
  To: Dave Chinner, xfs

On 12/12/2013 02:22 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The per-ag locks used to protect per-ag block lists are located in a tightly
> packed array. That means that they share cachelines, so separate them out inot
> separate 64 byte regions in the array.
> 
> pahole confirms the padding is correctly applied:
> 
> struct aglock {
>         pthread_mutex_t            lock;                 /*     0    40 */
> 
>         /* size: 64, cachelines: 1, members: 1 */
>         /* padding: 24 */
> };
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Out of curiosity, any data on this one?

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  repair/dino_chunks.c | 24 ++++++++++++------------
>  repair/dinode.c      |  6 +++---
>  repair/globals.h     |  5 ++++-
>  repair/incore.c      |  4 ++--
>  repair/scan.c        |  4 ++--
>  5 files changed, 23 insertions(+), 20 deletions(-)
> 
> diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c
> index d3c2236..02d32d8 100644
> --- a/repair/dino_chunks.c
> +++ b/repair/dino_chunks.c
> @@ -141,7 +141,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
>  		if (check_aginode_block(mp, agno, agino) == 0)
>  			return 0;
>  
> -		pthread_mutex_lock(&ag_locks[agno]);
> +		pthread_mutex_lock(&ag_locks[agno].lock);
>  
>  		state = get_bmap(agno, agbno);
>  		switch (state) {
> @@ -166,7 +166,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
>  		_("inode block %d/%d multiply claimed, (state %d)\n"),
>  				agno, agbno, state);
>  			set_bmap(agno, agbno, XR_E_MULT);
> -			pthread_mutex_unlock(&ag_locks[agno]);
> +			pthread_mutex_unlock(&ag_locks[agno].lock);
>  			return(0);
>  		default:
>  			do_warn(
> @@ -176,7 +176,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
>  			break;
>  		}
>  
> -		pthread_mutex_unlock(&ag_locks[agno]);
> +		pthread_mutex_unlock(&ag_locks[agno].lock);
>  
>  		start_agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
>  		*start_ino = XFS_AGINO_TO_INO(mp, agno, start_agino);
> @@ -424,7 +424,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
>  	 * user data -- we're probably here as a result of a directory
>  	 * entry or an iunlinked pointer
>  	 */
> -	pthread_mutex_lock(&ag_locks[agno]);
> +	pthread_mutex_lock(&ag_locks[agno].lock);
>  	for (cur_agbno = chunk_start_agbno;
>  	     cur_agbno < chunk_stop_agbno;
>  	     cur_agbno += blen)  {
> @@ -438,7 +438,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
>  	_("inode block %d/%d multiply claimed, (state %d)\n"),
>  				agno, cur_agbno, state);
>  			set_bmap_ext(agno, cur_agbno, blen, XR_E_MULT);
> -			pthread_mutex_unlock(&ag_locks[agno]);
> +			pthread_mutex_unlock(&ag_locks[agno].lock);
>  			return 0;
>  		case XR_E_INO:
>  			do_error(
> @@ -449,7 +449,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
>  			break;
>  		}
>  	}
> -	pthread_mutex_unlock(&ag_locks[agno]);
> +	pthread_mutex_unlock(&ag_locks[agno].lock);
>  
>  	/*
>  	 * ok, chunk is good.  put the record into the tree if required,
> @@ -472,7 +472,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
>  
>  	set_inode_used(irec_p, agino - start_agino);
>  
> -	pthread_mutex_lock(&ag_locks[agno]);
> +	pthread_mutex_lock(&ag_locks[agno].lock);
>  
>  	for (cur_agbno = chunk_start_agbno;
>  	     cur_agbno < chunk_stop_agbno;
> @@ -505,7 +505,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
>  			break;
>  		}
>  	}
> -	pthread_mutex_unlock(&ag_locks[agno]);
> +	pthread_mutex_unlock(&ag_locks[agno].lock);
>  
>  	return(ino_cnt);
>  }
> @@ -736,7 +736,7 @@ process_inode_chunk(
>  	/*
>  	 * mark block as an inode block in the incore bitmap
>  	 */
> -	pthread_mutex_lock(&ag_locks[agno]);
> +	pthread_mutex_lock(&ag_locks[agno].lock);
>  	state = get_bmap(agno, agbno);
>  	switch (state) {
>  	case XR_E_INO:	/* already marked */
> @@ -755,7 +755,7 @@ process_inode_chunk(
>  			XFS_AGB_TO_FSB(mp, agno, agbno), state);
>  		break;
>  	}
> -	pthread_mutex_unlock(&ag_locks[agno]);
> +	pthread_mutex_unlock(&ag_locks[agno].lock);
>  
>  	for (;;) {
>  		/*
> @@ -914,7 +914,7 @@ process_inode_chunk(
>  			ibuf_offset = 0;
>  			agbno++;
>  
> -			pthread_mutex_lock(&ag_locks[agno]);
> +			pthread_mutex_lock(&ag_locks[agno].lock);
>  			state = get_bmap(agno, agbno);
>  			switch (state) {
>  			case XR_E_INO:	/* already marked */
> @@ -935,7 +935,7 @@ process_inode_chunk(
>  					XFS_AGB_TO_FSB(mp, agno, agbno), state);
>  				break;
>  			}
> -			pthread_mutex_unlock(&ag_locks[agno]);
> +			pthread_mutex_unlock(&ag_locks[agno].lock);
>  
>  		} else if (irec_offset == XFS_INODES_PER_CHUNK)  {
>  			/*
> diff --git a/repair/dinode.c b/repair/dinode.c
> index 8f14a9e..77bbe40 100644
> --- a/repair/dinode.c
> +++ b/repair/dinode.c
> @@ -700,8 +700,8 @@ _("Fatal error: inode %" PRIu64 " - blkmap_set_ext(): %s\n"
>  		ebno = agbno + irec.br_blockcount;
>  		if (agno != locked_agno) {
>  			if (locked_agno != -1)
> -				pthread_mutex_unlock(&ag_locks[locked_agno]);
> -			pthread_mutex_lock(&ag_locks[agno]);
> +				pthread_mutex_unlock(&ag_locks[locked_agno].lock);
> +			pthread_mutex_lock(&ag_locks[agno].lock);
>  			locked_agno = agno;
>  		}
>  
> @@ -770,7 +770,7 @@ _("illegal state %d in block map %" PRIu64 "\n"),
>  	error = 0;
>  done:
>  	if (locked_agno != -1)
> -		pthread_mutex_unlock(&ag_locks[locked_agno]);
> +		pthread_mutex_unlock(&ag_locks[locked_agno].lock);
>  
>  	if (i != *numrecs) {
>  		ASSERT(i < *numrecs);
> diff --git a/repair/globals.h b/repair/globals.h
> index aef8b79..cbb2ce7 100644
> --- a/repair/globals.h
> +++ b/repair/globals.h
> @@ -186,7 +186,10 @@ EXTERN xfs_extlen_t	sb_inoalignmt;
>  EXTERN __uint32_t	sb_unit;
>  EXTERN __uint32_t	sb_width;
>  
> -EXTERN pthread_mutex_t	*ag_locks;
> +struct aglock {
> +	pthread_mutex_t	lock __attribute__((__aligned__(64)));
> +};
> +EXTERN struct aglock	*ag_locks;
>  
>  EXTERN int 		report_interval;
>  EXTERN __uint64_t 	*prog_rpt_done;
> diff --git a/repair/incore.c b/repair/incore.c
> index 3590464..a8d497e 100644
> --- a/repair/incore.c
> +++ b/repair/incore.c
> @@ -294,13 +294,13 @@ init_bmaps(xfs_mount_t *mp)
>  	if (!ag_bmap)
>  		do_error(_("couldn't allocate block map btree roots\n"));
>  
> -	ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(pthread_mutex_t));
> +	ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(struct aglock));
>  	if (!ag_locks)
>  		do_error(_("couldn't allocate block map locks\n"));
>  
>  	for (i = 0; i < mp->m_sb.sb_agcount; i++)  {
>  		btree_init(&ag_bmap[i]);
> -		pthread_mutex_init(&ag_locks[i], NULL);
> +		pthread_mutex_init(&ag_locks[i].lock, NULL);
>  	}
>  
>  	init_rt_bmap(mp);
> diff --git a/repair/scan.c b/repair/scan.c
> index 49ed194..f9afdb1 100644
> --- a/repair/scan.c
> +++ b/repair/scan.c
> @@ -273,7 +273,7 @@ _("bad back (left) sibling pointer (saw %llu should be NULL (0))\n"
>  		agno = XFS_FSB_TO_AGNO(mp, bno);
>  		agbno = XFS_FSB_TO_AGBNO(mp, bno);
>  
> -		pthread_mutex_lock(&ag_locks[agno]);
> +		pthread_mutex_lock(&ag_locks[agno].lock);
>  		state = get_bmap(agno, agbno);
>  		switch (state) {
>  		case XR_E_UNKNOWN:
> @@ -319,7 +319,7 @@ _("bad state %d, inode %" PRIu64 " bmap block 0x%" PRIx64 "\n"),
>  				state, ino, bno);
>  			break;
>  		}
> -		pthread_mutex_unlock(&ag_locks[agno]);
> +		pthread_mutex_unlock(&ag_locks[agno].lock);
>  	} else  {
>  		/*
>  		 * attribute fork for realtime files is in the regular
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] repair: phase 6 is trivially parallelisable
  2013-12-12  7:22 ` [PATCH 3/5] repair: phase 6 is trivially parallelisable Dave Chinner
  2013-12-12 18:43   ` Christoph Hellwig
@ 2013-12-12 18:59   ` Brian Foster
  1 sibling, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-12-12 18:59 UTC (permalink / raw)
  To: Dave Chinner, xfs

On 12/12/2013 02:22 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Phase 6 is currently single threaded, but it iterates AGs one at a
> time. When there are hundreds of AGs that need scanning, this takes
> a long time. Given that all the objects that the AG traversal works
> on are per-ag, we can simply parallelise this into a strided AG
> processing like phase 3 and 4.
> 
> Unpatched:	8m40s
> patched:	1m10s (7 threads)
> 
> Big win!
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/phase6.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 47 insertions(+), 9 deletions(-)
> 
> diff --git a/repair/phase6.c b/repair/phase6.c
> index d2d4a44..d82f900 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -51,6 +51,7 @@ typedef struct dotdot_update {
>  
>  static dotdot_update_t		*dotdot_update_list;
>  static int			dotdot_update;
> +static pthread_mutex_t		dotdot_lock;
>  
>  static void
>  add_dotdot_update(
> @@ -64,12 +65,14 @@ add_dotdot_update(
>  		do_error(_("malloc failed add_dotdot_update (%zu bytes)\n"),
>  			sizeof(dotdot_update_t));
>  
> +	pthread_mutex_lock(&dotdot_lock);
>  	dir->next = dotdot_update_list;
>  	dir->irec = irec;
>  	dir->agno = agno;
>  	dir->ino_offset = ino_offset;
>  
>  	dotdot_update_list = dir;
> +	pthread_mutex_unlock(&dotdot_lock);
>  }
>  
>  /*
> @@ -2918,34 +2921,68 @@ update_missing_dotdot_entries(
>  	 * these entries parents were updated, rebuild them again
>  	 * set dotdot_update flag so processing routines do not count links
>  	 */
> +	pthread_mutex_lock(&dotdot_lock);
>  	dotdot_update = 1;
>  	while (dotdot_update_list) {
>  		dir = dotdot_update_list;
>  		dotdot_update_list = dir->next;
> +		dir->next = NULL;
> +		pthread_mutex_unlock(&dotdot_lock);
> +
>  		process_dir_inode(mp, dir->agno, dir->irec, dir->ino_offset);
>  		free(dir);
> +
> +		pthread_mutex_lock(&dotdot_lock);
>  	}
> +	pthread_mutex_unlock(&dotdot_lock);
>  }

Technically the locking here is unnecessary, as this appears to remain
single threaded, yes? It doesn't hurt and probably eliminates a
potential landmine, so:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  
>  static void
>  traverse_ags(
> -	xfs_mount_t 		*mp)
> +	xfs_mount_t		*mp)
>  {
> -	int			i;
> -	work_queue_t		queue;
> +	int			i, j;
> +	xfs_agnumber_t		agno;
> +	work_queue_t		*queues;
>  	prefetch_args_t		*pf_args[2];
>  
>  	/*
>  	 * we always do prefetch for phase 6 as it will fill in the gaps
>  	 * not read during phase 3 prefetch.
>  	 */
> -	queue.mp = mp;
> -	pf_args[0] = start_inode_prefetch(0, 1, NULL);
> -	for (i = 0; i < glob_agcount; i++) {
> -		pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> -				pf_args[i & 1]);
> -		traverse_function(&queue, i, pf_args[i & 1]);
> +	if (!ag_stride) {
> +		work_queue_t	queue;
> +
> +		queue.mp = mp;
> +		pf_args[0] = start_inode_prefetch(0, 1, NULL);
> +		for (i = 0; i < glob_agcount; i++) {
> +			pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> +					pf_args[i & 1]);
> +			traverse_function(&queue, i, pf_args[i & 1]);
> +		}
> +		return;
>  	}
> +
> +	/*
> +	 * create one worker thread for each segment of the volume
> +	 */
> +	queues = malloc(thread_count * sizeof(work_queue_t));
> +	for (i = 0, agno = 0; i < thread_count; i++) {
> +		create_work_queue(&queues[i], mp, 1);
> +		pf_args[0] = NULL;
> +		for (j = 0; j < ag_stride && agno < glob_agcount; j++, agno++) {
> +			pf_args[0] = start_inode_prefetch(agno, 1, pf_args[0]);
> +			queue_work(&queues[i], traverse_function, agno,
> +				   pf_args[0]);
> +		}
> +	}
> +
> +	/*
> +	 * wait for workers to complete
> +	 */
> +	for (i = 0; i < thread_count; i++)
> +		destroy_work_queue(&queues[i]);
> +	free(queues);
>  }
>  
>  void
> @@ -2957,6 +2994,7 @@ phase6(xfs_mount_t *mp)
>  	memset(&zerocr, 0, sizeof(struct cred));
>  	memset(&zerofsx, 0, sizeof(struct fsxattr));
>  	orphanage_ino = 0;
> +	pthread_mutex_init(&dotdot_lock, NULL);
>  
>  	do_log(_("Phase 6 - check inode connectivity...\n"));
>  
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] libxfs: buffer cache hashing is suboptimal
  2013-12-12  7:22 ` [PATCH 4/5] libxfs: buffer cache hashing is suboptimal Dave Chinner
  2013-12-12 18:28   ` Christoph Hellwig
@ 2013-12-12 18:59   ` Brian Foster
  2013-12-12 20:56     ` Dave Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Brian Foster @ 2013-12-12 18:59 UTC (permalink / raw)
  To: Dave Chinner, xfs

On 12/12/2013 02:22 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The hashkey calculation is very simplistic,and throws away an amount
> of entropy that should be folded into the hash. The result is
> sub-optimal distribution across the hash tables. For example, with a
> default 512 entry table, phase 2 results in this:
> 
...
> Modify the hash to be something more workable - steal the linux
> kernel inode hash calculation and try that:
> 
...
> 
> Kinda says it all, really...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Results look nice and the algorithm seems to match the kernel variant,
but what about the 32-bit alternate prime/cache line values? Safe to
leave out..?

Brian

>  include/cache.h |  4 +++-
>  libxfs/cache.c  |  7 +++++--
>  libxfs/rdwr.c   | 12 ++++++++++--
>  3 files changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/include/cache.h b/include/cache.h
> index 76cb234..0a84c69 100644
> --- a/include/cache.h
> +++ b/include/cache.h
> @@ -66,7 +66,8 @@ typedef void (*cache_walk_t)(struct cache_node *);
>  typedef struct cache_node * (*cache_node_alloc_t)(cache_key_t);
>  typedef void (*cache_node_flush_t)(struct cache_node *);
>  typedef void (*cache_node_relse_t)(struct cache_node *);
> -typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int);
> +typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int,
> +					  unsigned int);
>  typedef int (*cache_node_compare_t)(struct cache_node *, cache_key_t);
>  typedef unsigned int (*cache_bulk_relse_t)(struct cache *, struct list_head *);
>  
> @@ -112,6 +113,7 @@ struct cache {
>  	cache_node_compare_t	compare;	/* comparison routine */
>  	cache_bulk_relse_t	bulkrelse;	/* bulk release routine */
>  	unsigned int		c_hashsize;	/* hash bucket count */
> +	unsigned int		c_hashshift;	/* hash key shift */
>  	struct cache_hash	*c_hash;	/* hash table buckets */
>  	struct cache_mru	c_mrus[CACHE_MAX_PRIORITY + 1];
>  	unsigned long long	c_misses;	/* cache misses */
> diff --git a/libxfs/cache.c b/libxfs/cache.c
> index 84d2860..dc69689 100644
> --- a/libxfs/cache.c
> +++ b/libxfs/cache.c
> @@ -25,6 +25,7 @@
>  #include <xfs/platform_defs.h>
>  #include <xfs/list.h>
>  #include <xfs/cache.h>
> +#include <xfs/libxfs.h>
>  
>  #define CACHE_DEBUG 1
>  #undef CACHE_DEBUG
> @@ -61,6 +62,7 @@ cache_init(
>  	cache->c_misses = 0;
>  	cache->c_maxcount = maxcount;
>  	cache->c_hashsize = hashsize;
> +	cache->c_hashshift = libxfs_highbit32(hashsize);
>  	cache->hash = cache_operations->hash;
>  	cache->alloc = cache_operations->alloc;
>  	cache->flush = cache_operations->flush;
> @@ -343,7 +345,7 @@ cache_node_get(
>  	int			priority = 0;
>  	int			purged = 0;
>  
> -	hashidx = cache->hash(key, cache->c_hashsize);
> +	hashidx = cache->hash(key, cache->c_hashsize, cache->c_hashshift);
>  	hash = cache->c_hash + hashidx;
>  	head = &hash->ch_list;
>  
> @@ -515,7 +517,8 @@ cache_node_purge(
>  	struct cache_hash *	hash;
>  	int			count = -1;
>  
> -	hash = cache->c_hash + cache->hash(key, cache->c_hashsize);
> +	hash = cache->c_hash + cache->hash(key, cache->c_hashsize,
> +					   cache->c_hashshift);
>  	head = &hash->ch_list;
>  	pthread_mutex_lock(&hash->ch_mutex);
>  	for (pos = head->next, n = pos->next; pos != head;
> diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
> index 0219a08..0effb9a 100644
> --- a/libxfs/rdwr.c
> +++ b/libxfs/rdwr.c
> @@ -311,10 +311,18 @@ struct xfs_bufkey {
>  	int			nmaps;
>  };
>  
> +/*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
> +#define GOLDEN_RATIO_PRIME	0x9e37fffffffc0001UL
> +#define CACHE_LINE_SIZE		64
>  static unsigned int
> -libxfs_bhash(cache_key_t key, unsigned int hashsize)
> +libxfs_bhash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
>  {
> -	return (((unsigned int)((struct xfs_bufkey *)key)->blkno) >> 5) % hashsize;
> +	uint64_t	hashval = ((struct xfs_bufkey *)key)->blkno;
> +	uint64_t	tmp;
> +
> +	tmp = hashval ^ (GOLDEN_RATIO_PRIME + hashval) / CACHE_LINE_SIZE;
> +	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> hashshift);
> +	return tmp % hashsize;
>  }
>  
>  static int
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] repair: limit auto-striding concurrency apprpriately
  2013-12-12  7:22 ` [PATCH 5/5] repair: limit auto-striding concurrency apprpriately Dave Chinner
  2013-12-12 18:29   ` Christoph Hellwig
@ 2013-12-12 18:59   ` Brian Foster
  1 sibling, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-12-12 18:59 UTC (permalink / raw)
  To: Dave Chinner, xfs

On 12/12/2013 02:22 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It's possible to have filesystems with hundreds of AGs on systems
> with little concurrency and resources. In this case, we can easily
> exhaust memory and fail to create threads and have all sorts of
> interesting problems.
> 
> xfs/250 can cause this to occur, with failures like:
> 
>         - agno = 707
>         - agno = 692
> fatal error -- cannot create worker threads, error = [11] Resource temporarily unavailable
> 
> And this:
> 
>         - agno = 484
>         - agno = 782
> failed to create prefetch thread: Resource temporarily unavailable
> 
> Because it's trying to create more threads than a poor little 512MB
> single CPU ia32 box can handle.
> 
> So, limit concurrency to a maximum of numcpus * 8 to prevent this.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  include/libxfs.h    |  1 +
>  libxfs/init.h       |  1 -
>  repair/xfs_repair.c | 18 +++++++++++++++++-
>  3 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/include/libxfs.h b/include/libxfs.h
> index 4bf331c..39e3d85 100644
> --- a/include/libxfs.h
> +++ b/include/libxfs.h
> @@ -144,6 +144,7 @@ extern void	libxfs_device_close (dev_t);
>  extern int	libxfs_device_alignment (void);
>  extern void	libxfs_report(FILE *);
>  extern void	platform_findsizes(char *path, int fd, long long *sz, int *bsz);
> +extern int	platform_nproc(void);
>  
>  /* check or write log footer: specify device, log size in blocks & uuid */
>  typedef xfs_caddr_t (libxfs_get_block_t)(xfs_caddr_t, int, void *);
> diff --git a/libxfs/init.h b/libxfs/init.h
> index f0b8cb6..112febb 100644
> --- a/libxfs/init.h
> +++ b/libxfs/init.h
> @@ -31,7 +31,6 @@ extern char *platform_findrawpath (char *path);
>  extern char *platform_findblockpath (char *path);
>  extern int platform_direct_blockdev (void);
>  extern int platform_align_blockdev (void);
> -extern int platform_nproc(void);
>  extern unsigned long platform_physmem(void);	/* in kilobytes */
>  extern int platform_has_uuid;
>  
> diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> index 7beffcb..0d006ae 100644
> --- a/repair/xfs_repair.c
> +++ b/repair/xfs_repair.c
> @@ -627,13 +627,29 @@ main(int argc, char **argv)
>  	 * to target these for an increase in thread count. Hence a stride value
>  	 * of 15 is chosen to ensure we get at least 2 AGs being scanned at once
>  	 * on such filesystems.
> +	 *
> +	 * Limit the maximum thread count based on the available CPU power that
> +	 * is available. If we use too many threads, we might run out of memory
> +	 * and CPU power before we run out of IO concurrency.
>  	 */
>  	if (!ag_stride && glob_agcount >= 16 && do_prefetch)
>  		ag_stride = 15;
>  
>  	if (ag_stride) {
> +		int max_threads = platform_nproc() * 8;
> +
>  		thread_count = (glob_agcount + ag_stride - 1) / ag_stride;
> -		thread_init();
> +		while (thread_count > max_threads) {
> +			ag_stride *= 2;
> +			thread_count = (glob_agcount + ag_stride - 1) /
> +								ag_stride;
> +		}
> +		if (thread_count > 0)
> +			thread_init();
> +		else {
> +			thread_count = 1;
> +			ag_stride = 0;
> +		}
>  	}
>  
>  	if (ag_stride && report_interval) {
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] repair: per AG locks contend for cachelines
  2013-12-12 18:58   ` Brian Foster
@ 2013-12-12 20:46     ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 20:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Dec 12, 2013 at 01:58:59PM -0500, Brian Foster wrote:
> On 12/12/2013 02:22 AM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The per-ag locks used to protect per-ag block lists are located in a tightly
> > packed array. That means that they share cachelines, so separate them out inot
> > separate 64 byte regions in the array.
> > 
> > pahole confirms the padding is correctly applied:
> > 
> > struct aglock {
> >         pthread_mutex_t            lock;                 /*     0    40 */
> > 
> >         /* size: 64, cachelines: 1, members: 1 */
> >         /* padding: 24 */
> > };
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Out of curiosity, any data on this one?

There's a small improvement but it's within the margins of error. At
higher levels of concurrency it will make a bigger difference, but
right now we can't get to a thread per AG (where the contention
would really show) because of other scalability limitations

e.g. mmap_sem contention becomes a limiting factor at 20-25 threads
because of page faults occurring on hash cache lookups (read lock)
occurring in parallel with memory allocation (libc mprotect calls
take a write lock). So more scalability will come from a CPU cache
friendlier cache index - I'm tempted to drop in per-AG rb trees and
see where that leads (now where would I have got that idea?)....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] repair: phase 6 is trivially parallelisable
  2013-12-12 18:43   ` Christoph Hellwig
@ 2013-12-12 20:53     ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 20:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Dec 12, 2013 at 10:43:46AM -0800, Christoph Hellwig wrote:
> >  static void
> >  add_dotdot_update(
> > @@ -64,12 +65,14 @@ add_dotdot_update(
> >  		do_error(_("malloc failed add_dotdot_update (%zu bytes)\n"),
> >  			sizeof(dotdot_update_t));
> >  
> > +	pthread_mutex_lock(&dotdot_lock);
> >  	dir->next = dotdot_update_list;
> >  	dir->irec = irec;
> >  	dir->agno = agno;
> >  	dir->ino_offset = ino_offset;
> >  
> >  	dotdot_update_list = dir;
> > +	pthread_mutex_unlock(&dotdot_lock);
> 
> Would be nice to make this use a list_head if you touch it anyway.
> (As a separate patch)
> 
> >  static void
> >  traverse_ags(
> > -	xfs_mount_t 		*mp)
> > +	xfs_mount_t		*mp)
> 
> Not quite sure what actually changed here, but if you touch it anyway
> you might as well use the struct version..

Whitespace after xfs_mount_t, judging by the highlighting I see in
the editor right now.

> > +	if (!ag_stride) {
> > +		work_queue_t	queue;
> > +
> > +		queue.mp = mp;
> > +		pf_args[0] = start_inode_prefetch(0, 1, NULL);
> > +		for (i = 0; i < glob_agcount; i++) {
> > +			pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> > +					pf_args[i & 1]);
> > +			traverse_function(&queue, i, pf_args[i & 1]);
> > +		}
> > +		return;
> >  	}
> > +
> > +	/*
> > +	 * create one worker thread for each segment of the volume
> > +	 */
> > +	queues = malloc(thread_count * sizeof(work_queue_t));
> > +	for (i = 0, agno = 0; i < thread_count; i++) {
> > +		create_work_queue(&queues[i], mp, 1);
> > +		pf_args[0] = NULL;
> > +		for (j = 0; j < ag_stride && agno < glob_agcount; j++, agno++) {
> > +			pf_args[0] = start_inode_prefetch(agno, 1, pf_args[0]);
> > +			queue_work(&queues[i], traverse_function, agno,
> > +				   pf_args[0]);
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * wait for workers to complete
> > +	 */
> > +	for (i = 0; i < thread_count; i++)
> > +		destroy_work_queue(&queues[i]);
> > +	free(queues);
> 
> 
> This is the third copy of this code block, might make sense to
> consolidate it.

Agreed, just haven't got to it.

> Btw, does anyone remember why we have the libxfs_bcache_overflowed()
> special case in phase4, but not anywhere else?

I recall something about memory consumption, but I doubt that code
can even trigger given that if we get to overflow conditions we
immediately double the cache size and so libxfs_bcache_overflowed()
will never see an overflow condition....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] libxfs: buffer cache hashing is suboptimal
  2013-12-12 18:59   ` Brian Foster
@ 2013-12-12 20:56     ` Dave Chinner
  2013-12-13 14:23       ` Brian Foster
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 20:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Dec 12, 2013 at 01:59:26PM -0500, Brian Foster wrote:
> On 12/12/2013 02:22 AM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The hashkey calculation is very simplistic,and throws away an amount
> > of entropy that should be folded into the hash. The result is
> > sub-optimal distribution across the hash tables. For example, with a
> > default 512 entry table, phase 2 results in this:
> > 
> ...
> > Modify the hash to be something more workable - steal the linux
> > kernel inode hash calculation and try that:
> > 
> ...
> > 
> > Kinda says it all, really...
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Results look nice and the algorithm seems to match the kernel variant,
> but what about the 32-bit alternate prime/cache line values? Safe to
> leave out..?

The buffer cache uses a 64 bit key, regardless of the platform.
Therefore the 64 bit variant is always needed. The kernel inode hash
uses a 32 bit key on 32 bit systems, which is why there are two
variants for it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] repair: limit auto-striding concurrency apprpriately
  2013-12-12 18:29   ` Christoph Hellwig
@ 2013-12-12 21:00     ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 21:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Dec 12, 2013 at 10:29:08AM -0800, Christoph Hellwig wrote:
> Looks reasonable, but it might be worth to add a blurb on how you
> arrived at the magic 8 threads per cpu.

OK - that was simply that processing with 8 threads is more than
enough to saturate a single CPU on fast devices so we don't need any
more parallelism in this case. And when you have large slow devices,
8 threads is more than enough to saturate the IO subsystem....

I'll add a comment mentioning that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] libxfs: buffer cache hashing is suboptimal
  2013-12-12 20:56     ` Dave Chinner
@ 2013-12-13 14:23       ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-12-13 14:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 12/12/2013 03:56 PM, Dave Chinner wrote:
> On Thu, Dec 12, 2013 at 01:59:26PM -0500, Brian Foster wrote:
>> On 12/12/2013 02:22 AM, Dave Chinner wrote:
>>> From: Dave Chinner <dchinner@redhat.com>
>>>
>>> The hashkey calculation is very simplistic,and throws away an amount
>>> of entropy that should be folded into the hash. The result is
>>> sub-optimal distribution across the hash tables. For example, with a
>>> default 512 entry table, phase 2 results in this:
>>>
>> ...
>>> Modify the hash to be something more workable - steal the linux
>>> kernel inode hash calculation and try that:
>>>
>> ...
>>>
>>> Kinda says it all, really...
>>>
>>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>>> ---
>>
>> Results look nice and the algorithm seems to match the kernel variant,
>> but what about the 32-bit alternate prime/cache line values? Safe to
>> leave out..?
> 
> The buffer cache uses a 64 bit key, regardless of the platform.
> Therefore the 64 bit variant is always needed. The kernel inode hash
> uses a 32 bit key on 32 bit systems, which is why there are two
> variants for it.
> 

Ah.. xfs_bufkey->blkno is an xfs_daddr_t, which is an __int64_t. Thanks.

Reviewed-by: Brian Foster <bfoster@redhat.com>

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-12-13 14:23 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-12  7:22 [PATCH 0/5] xfs_repair: scalability inmprovements Dave Chinner
2013-12-12  7:22 ` [PATCH 1/5] repair: translation lookups limit scalability Dave Chinner
2013-12-12 18:26   ` Christoph Hellwig
2013-12-12 18:58   ` Brian Foster
2013-12-12  7:22 ` [PATCH 2/5] repair: per AG locks contend for cachelines Dave Chinner
2013-12-12 18:27   ` Christoph Hellwig
2013-12-12 18:58   ` Brian Foster
2013-12-12 20:46     ` Dave Chinner
2013-12-12  7:22 ` [PATCH 3/5] repair: phase 6 is trivially parallelisable Dave Chinner
2013-12-12 18:43   ` Christoph Hellwig
2013-12-12 20:53     ` Dave Chinner
2013-12-12 18:59   ` Brian Foster
2013-12-12  7:22 ` [PATCH 4/5] libxfs: buffer cache hashing is suboptimal Dave Chinner
2013-12-12 18:28   ` Christoph Hellwig
2013-12-12 18:59   ` Brian Foster
2013-12-12 20:56     ` Dave Chinner
2013-12-13 14:23       ` Brian Foster
2013-12-12  7:22 ` [PATCH 5/5] repair: limit auto-striding concurrency apprpriately Dave Chinner
2013-12-12 18:29   ` Christoph Hellwig
2013-12-12 21:00     ` Dave Chinner
2013-12-12 18:59   ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox