* [PATCH 1/5] repair: translation lookups limit scalability
2013-12-12 7:22 [PATCH 0/5] xfs_repair: scalability inmprovements Dave Chinner
@ 2013-12-12 7:22 ` Dave Chinner
2013-12-12 18:26 ` Christoph Hellwig
2013-12-12 18:58 ` Brian Foster
2013-12-12 7:22 ` [PATCH 2/5] repair: per AG locks contend for cachelines Dave Chinner
` (3 subsequent siblings)
4 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 7:22 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
A bit of perf magic showed that scalability was limits to 3-4
concurrent threads due to contention on a lock inside in something
called __dcigettext(). That some library somewhere that repair is
linked against, and it turns out to be inside the translation
infrastructure to support the _() string mechanism:
# Samples: 34K of event 'cs'
# Event count (approx.): 495567
#
# Overhead Command Shared Object Symbol
# ........ ............. ................. ..............
#
60.30% xfs_repair [kernel.kallsyms] [k] __schedule
|
--- 0x63fffff9c
process_bmbt_reclist_int
|
|--39.95%-- __dcigettext
| __lll_lock_wait
| system_call_fastpath
| SyS_futex
| do_futex
| futex_wait
| futex_wait_queue_me
| schedule
| __schedule
|
|--8.91%-- __lll_lock_wait
| system_call_fastpath
| SyS_futex
| do_futex
| futex_wait
| futex_wait_queue_me
| schedule
| __schedule
--51.13%-- [...]
Runtime of an unpatched xfs_repair is roughly:
XFS_REPAIR Summary Fri Dec 6 11:15:50 2013
Phase Start End Duration
Phase 1: 12/06 10:56:21 12/06 10:56:21
Phase 2: 12/06 10:56:21 12/06 10:56:23 2 seconds
Phase 3: 12/06 10:56:23 12/06 11:01:31 5 minutes, 8 seconds
Phase 4: 12/06 11:01:31 12/06 11:07:08 5 minutes, 37 seconds
Phase 5: 12/06 11:07:08 12/06 11:07:09 1 second
Phase 6: 12/06 11:07:09 12/06 11:15:49 8 minutes, 40 seconds
Phase 7: 12/06 11:15:49 12/06 11:15:50 1 second
Total run time: 19 minutes, 29 seconds
Patched version:
Phase Start End Duration
Phase 1: 12/06 10:36:29 12/06 10:36:29
Phase 2: 12/06 10:36:29 12/06 10:36:31 2 seconds
Phase 3: 12/06 10:36:31 12/06 10:40:08 3 minutes, 37 seconds
Phase 4: 12/06 10:40:08 12/06 10:43:42 3 minutes, 34 seconds
Phase 5: 12/06 10:43:42 12/06 10:43:42
Phase 6: 12/06 10:43:42 12/06 10:50:28 6 minutes, 46 seconds
Phase 7: 12/06 10:50:28 12/06 10:50:29 1 second
Total run time: 14 minutes
Big win!
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
repair/dinode.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)
diff --git a/repair/dinode.c b/repair/dinode.c
index 7469fc8..8f14a9e 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -522,6 +522,11 @@ _("illegal state %d in rt block map %" PRIu64 "\n"),
* file overlaps with any duplicate extents (in the
* duplicate extent list).
*/
+static char *__forkname_data;
+static char *__forkname_attr;
+static char *__ftype_real_time;
+static char *__ftype_regular;
+
static int
process_bmbt_reclist_int(
xfs_mount_t *mp,
@@ -552,15 +557,30 @@ process_bmbt_reclist_int(
xfs_agnumber_t locked_agno = -1;
int error = 1;
+ /*
+ * gettext lookups for translations of strings use mutexes internally to
+ * the library. Hence when we come through here doing parallel scans in
+ * multiple AGs, then all do concurrent text conversions and serialise
+ * on the translation string lookups. Let's avoid doing repeated lookups
+ * by making them static variables and only assigning the translation
+ * once.
+ */
+ if (!__forkname_data) {
+ __forkname_data = _("data");
+ __forkname_attr = _("attr");
+ __ftype_real_time = _("real-time");
+ __ftype_regular = _("regular");
+ }
+
if (whichfork == XFS_DATA_FORK)
- forkname = _("data");
+ forkname = __forkname_data;
else
- forkname = _("attr");
+ forkname = __forkname_attr;
if (type == XR_INO_RTDATA)
- ftype = _("real-time");
+ ftype = __ftype_real_time;
else
- ftype = _("regular");
+ ftype = __ftype_regular;
for (i = 0; i < *numrecs; i++) {
libxfs_bmbt_disk_get_all(rp + i, &irec);
--
1.8.4.rc3
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH 1/5] repair: translation lookups limit scalability
2013-12-12 7:22 ` [PATCH 1/5] repair: translation lookups limit scalability Dave Chinner
@ 2013-12-12 18:26 ` Christoph Hellwig
2013-12-12 18:58 ` Brian Foster
1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:26 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Thu, Dec 12, 2013 at 06:22:21PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> A bit of perf magic showed that scalability was limits to 3-4
> concurrent threads due to contention on a lock inside in something
> called __dcigettext(). That some library somewhere that repair is
> linked against, and it turns out to be inside the translation
> infrastructure to support the _() string mechanism:
>
> # Samples: 34K of event 'cs'
> # Event count (approx.): 495567
> #
> # Overhead Command Shared Object Symbol
> # ........ ............. ................. ..............
> #
> 60.30% xfs_repair [kernel.kallsyms] [k] __schedule
> |
> --- 0x63fffff9c
> process_bmbt_reclist_int
> |
> |--39.95%-- __dcigettext
> | __lll_lock_wait
> | system_call_fastpath
> | SyS_futex
> | do_futex
> | futex_wait
> | futex_wait_queue_me
> | schedule
> | __schedule
> |
> |--8.91%-- __lll_lock_wait
> | system_call_fastpath
> | SyS_futex
> | do_futex
> | futex_wait
> | futex_wait_queue_me
> | schedule
> | __schedule
> --51.13%-- [...]
>
> Runtime of an unpatched xfs_repair is roughly:
>
> XFS_REPAIR Summary Fri Dec 6 11:15:50 2013
>
> Phase Start End Duration
> Phase 1: 12/06 10:56:21 12/06 10:56:21
> Phase 2: 12/06 10:56:21 12/06 10:56:23 2 seconds
> Phase 3: 12/06 10:56:23 12/06 11:01:31 5 minutes, 8 seconds
> Phase 4: 12/06 11:01:31 12/06 11:07:08 5 minutes, 37 seconds
> Phase 5: 12/06 11:07:08 12/06 11:07:09 1 second
> Phase 6: 12/06 11:07:09 12/06 11:15:49 8 minutes, 40 seconds
> Phase 7: 12/06 11:15:49 12/06 11:15:50 1 second
>
> Total run time: 19 minutes, 29 seconds
>
> Patched version:
>
> Phase Start End Duration
> Phase 1: 12/06 10:36:29 12/06 10:36:29
> Phase 2: 12/06 10:36:29 12/06 10:36:31 2 seconds
> Phase 3: 12/06 10:36:31 12/06 10:40:08 3 minutes, 37 seconds
> Phase 4: 12/06 10:40:08 12/06 10:43:42 3 minutes, 34 seconds
> Phase 5: 12/06 10:43:42 12/06 10:43:42
> Phase 6: 12/06 10:43:42 12/06 10:50:28 6 minutes, 46 seconds
> Phase 7: 12/06 10:50:28 12/06 10:50:29 1 second
>
> Total run time: 14 minutes
>
> Big win!
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> repair/dinode.c | 28 ++++++++++++++++++++++++----
> 1 file changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/repair/dinode.c b/repair/dinode.c
> index 7469fc8..8f14a9e 100644
> --- a/repair/dinode.c
> +++ b/repair/dinode.c
> @@ -522,6 +522,11 @@ _("illegal state %d in rt block map %" PRIu64 "\n"),
> * file overlaps with any duplicate extents (in the
> * duplicate extent list).
> */
> +static char *__forkname_data;
> +static char *__forkname_attr;
> +static char *__ftype_real_time;
> +static char *__ftype_regular;
Double underscores for symbol names aren't really nice for userspace
programs as they are in the implementation namespace. I also don't
really see a need for the prefixed here.
> + if (!__forkname_data) {
> + __forkname_data = _("data");
> + __forkname_attr = _("attr");
> + __ftype_real_time = _("real-time");
> + __ftype_regular = _("regular");
> + }
Maybe just do these assignments early on during program initialization?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 1/5] repair: translation lookups limit scalability
2013-12-12 7:22 ` [PATCH 1/5] repair: translation lookups limit scalability Dave Chinner
2013-12-12 18:26 ` Christoph Hellwig
@ 2013-12-12 18:58 ` Brian Foster
1 sibling, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-12-12 18:58 UTC (permalink / raw)
To: Dave Chinner, xfs
On 12/12/2013 02:22 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> A bit of perf magic showed that scalability was limits to 3-4
> concurrent threads due to contention on a lock inside in something
> called __dcigettext(). That some library somewhere that repair is
> linked against, and it turns out to be inside the translation
> infrastructure to support the _() string mechanism:
>
> # Samples: 34K of event 'cs'
> # Event count (approx.): 495567
> #
> # Overhead Command Shared Object Symbol
> # ........ ............. ................. ..............
> #
> 60.30% xfs_repair [kernel.kallsyms] [k] __schedule
> |
> --- 0x63fffff9c
> process_bmbt_reclist_int
> |
> |--39.95%-- __dcigettext
> | __lll_lock_wait
> | system_call_fastpath
> | SyS_futex
> | do_futex
> | futex_wait
> | futex_wait_queue_me
> | schedule
> | __schedule
> |
> |--8.91%-- __lll_lock_wait
> | system_call_fastpath
> | SyS_futex
> | do_futex
> | futex_wait
> | futex_wait_queue_me
> | schedule
> | __schedule
> --51.13%-- [...]
>
> Runtime of an unpatched xfs_repair is roughly:
>
> XFS_REPAIR Summary Fri Dec 6 11:15:50 2013
>
> Phase Start End Duration
> Phase 1: 12/06 10:56:21 12/06 10:56:21
> Phase 2: 12/06 10:56:21 12/06 10:56:23 2 seconds
> Phase 3: 12/06 10:56:23 12/06 11:01:31 5 minutes, 8 seconds
> Phase 4: 12/06 11:01:31 12/06 11:07:08 5 minutes, 37 seconds
> Phase 5: 12/06 11:07:08 12/06 11:07:09 1 second
> Phase 6: 12/06 11:07:09 12/06 11:15:49 8 minutes, 40 seconds
> Phase 7: 12/06 11:15:49 12/06 11:15:50 1 second
>
> Total run time: 19 minutes, 29 seconds
>
> Patched version:
>
> Phase Start End Duration
> Phase 1: 12/06 10:36:29 12/06 10:36:29
> Phase 2: 12/06 10:36:29 12/06 10:36:31 2 seconds
> Phase 3: 12/06 10:36:31 12/06 10:40:08 3 minutes, 37 seconds
> Phase 4: 12/06 10:40:08 12/06 10:43:42 3 minutes, 34 seconds
> Phase 5: 12/06 10:43:42 12/06 10:43:42
> Phase 6: 12/06 10:43:42 12/06 10:50:28 6 minutes, 46 seconds
> Phase 7: 12/06 10:50:28 12/06 10:50:29 1 second
>
> Total run time: 14 minutes
>
> Big win!
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
Indeed! Neat fix. When looking at the code, I wondered whether the same
type of thing would show up in process_btinode() or scan_bmapbt() (e.g.,
defining 'forkname'). Perhaps with smaller (btree fmt) files? Or perhaps
the ratio of inodes to extent records is such that it simply isn't a
potential bottleneck. Anyways:
Reviewed-by: Brian Foster <bfoster@redhat.com>
> repair/dinode.c | 28 ++++++++++++++++++++++++----
> 1 file changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/repair/dinode.c b/repair/dinode.c
> index 7469fc8..8f14a9e 100644
> --- a/repair/dinode.c
> +++ b/repair/dinode.c
> @@ -522,6 +522,11 @@ _("illegal state %d in rt block map %" PRIu64 "\n"),
> * file overlaps with any duplicate extents (in the
> * duplicate extent list).
> */
> +static char *__forkname_data;
> +static char *__forkname_attr;
> +static char *__ftype_real_time;
> +static char *__ftype_regular;
> +
> static int
> process_bmbt_reclist_int(
> xfs_mount_t *mp,
> @@ -552,15 +557,30 @@ process_bmbt_reclist_int(
> xfs_agnumber_t locked_agno = -1;
> int error = 1;
>
> + /*
> + * gettext lookups for translations of strings use mutexes internally to
> + * the library. Hence when we come through here doing parallel scans in
> + * multiple AGs, then all do concurrent text conversions and serialise
> + * on the translation string lookups. Let's avoid doing repeated lookups
> + * by making them static variables and only assigning the translation
> + * once.
> + */
> + if (!__forkname_data) {
> + __forkname_data = _("data");
> + __forkname_attr = _("attr");
> + __ftype_real_time = _("real-time");
> + __ftype_regular = _("regular");
> + }
> +
> if (whichfork == XFS_DATA_FORK)
> - forkname = _("data");
> + forkname = __forkname_data;
> else
> - forkname = _("attr");
> + forkname = __forkname_attr;
>
> if (type == XR_INO_RTDATA)
> - ftype = _("real-time");
> + ftype = __ftype_real_time;
> else
> - ftype = _("regular");
> + ftype = __ftype_regular;
>
> for (i = 0; i < *numrecs; i++) {
> libxfs_bmbt_disk_get_all(rp + i, &irec);
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 2/5] repair: per AG locks contend for cachelines
2013-12-12 7:22 [PATCH 0/5] xfs_repair: scalability inmprovements Dave Chinner
2013-12-12 7:22 ` [PATCH 1/5] repair: translation lookups limit scalability Dave Chinner
@ 2013-12-12 7:22 ` Dave Chinner
2013-12-12 18:27 ` Christoph Hellwig
2013-12-12 18:58 ` Brian Foster
2013-12-12 7:22 ` [PATCH 3/5] repair: phase 6 is trivially parallelisable Dave Chinner
` (2 subsequent siblings)
4 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 7:22 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
The per-ag locks used to protect per-ag block lists are located in a tightly
packed array. That means that they share cachelines, so separate them out inot
separate 64 byte regions in the array.
pahole confirms the padding is correctly applied:
struct aglock {
pthread_mutex_t lock; /* 0 40 */
/* size: 64, cachelines: 1, members: 1 */
/* padding: 24 */
};
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
repair/dino_chunks.c | 24 ++++++++++++------------
repair/dinode.c | 6 +++---
repair/globals.h | 5 ++++-
repair/incore.c | 4 ++--
repair/scan.c | 4 ++--
5 files changed, 23 insertions(+), 20 deletions(-)
diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c
index d3c2236..02d32d8 100644
--- a/repair/dino_chunks.c
+++ b/repair/dino_chunks.c
@@ -141,7 +141,7 @@ verify_inode_chunk(xfs_mount_t *mp,
if (check_aginode_block(mp, agno, agino) == 0)
return 0;
- pthread_mutex_lock(&ag_locks[agno]);
+ pthread_mutex_lock(&ag_locks[agno].lock);
state = get_bmap(agno, agbno);
switch (state) {
@@ -166,7 +166,7 @@ verify_inode_chunk(xfs_mount_t *mp,
_("inode block %d/%d multiply claimed, (state %d)\n"),
agno, agbno, state);
set_bmap(agno, agbno, XR_E_MULT);
- pthread_mutex_unlock(&ag_locks[agno]);
+ pthread_mutex_unlock(&ag_locks[agno].lock);
return(0);
default:
do_warn(
@@ -176,7 +176,7 @@ verify_inode_chunk(xfs_mount_t *mp,
break;
}
- pthread_mutex_unlock(&ag_locks[agno]);
+ pthread_mutex_unlock(&ag_locks[agno].lock);
start_agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
*start_ino = XFS_AGINO_TO_INO(mp, agno, start_agino);
@@ -424,7 +424,7 @@ verify_inode_chunk(xfs_mount_t *mp,
* user data -- we're probably here as a result of a directory
* entry or an iunlinked pointer
*/
- pthread_mutex_lock(&ag_locks[agno]);
+ pthread_mutex_lock(&ag_locks[agno].lock);
for (cur_agbno = chunk_start_agbno;
cur_agbno < chunk_stop_agbno;
cur_agbno += blen) {
@@ -438,7 +438,7 @@ verify_inode_chunk(xfs_mount_t *mp,
_("inode block %d/%d multiply claimed, (state %d)\n"),
agno, cur_agbno, state);
set_bmap_ext(agno, cur_agbno, blen, XR_E_MULT);
- pthread_mutex_unlock(&ag_locks[agno]);
+ pthread_mutex_unlock(&ag_locks[agno].lock);
return 0;
case XR_E_INO:
do_error(
@@ -449,7 +449,7 @@ verify_inode_chunk(xfs_mount_t *mp,
break;
}
}
- pthread_mutex_unlock(&ag_locks[agno]);
+ pthread_mutex_unlock(&ag_locks[agno].lock);
/*
* ok, chunk is good. put the record into the tree if required,
@@ -472,7 +472,7 @@ verify_inode_chunk(xfs_mount_t *mp,
set_inode_used(irec_p, agino - start_agino);
- pthread_mutex_lock(&ag_locks[agno]);
+ pthread_mutex_lock(&ag_locks[agno].lock);
for (cur_agbno = chunk_start_agbno;
cur_agbno < chunk_stop_agbno;
@@ -505,7 +505,7 @@ verify_inode_chunk(xfs_mount_t *mp,
break;
}
}
- pthread_mutex_unlock(&ag_locks[agno]);
+ pthread_mutex_unlock(&ag_locks[agno].lock);
return(ino_cnt);
}
@@ -736,7 +736,7 @@ process_inode_chunk(
/*
* mark block as an inode block in the incore bitmap
*/
- pthread_mutex_lock(&ag_locks[agno]);
+ pthread_mutex_lock(&ag_locks[agno].lock);
state = get_bmap(agno, agbno);
switch (state) {
case XR_E_INO: /* already marked */
@@ -755,7 +755,7 @@ process_inode_chunk(
XFS_AGB_TO_FSB(mp, agno, agbno), state);
break;
}
- pthread_mutex_unlock(&ag_locks[agno]);
+ pthread_mutex_unlock(&ag_locks[agno].lock);
for (;;) {
/*
@@ -914,7 +914,7 @@ process_inode_chunk(
ibuf_offset = 0;
agbno++;
- pthread_mutex_lock(&ag_locks[agno]);
+ pthread_mutex_lock(&ag_locks[agno].lock);
state = get_bmap(agno, agbno);
switch (state) {
case XR_E_INO: /* already marked */
@@ -935,7 +935,7 @@ process_inode_chunk(
XFS_AGB_TO_FSB(mp, agno, agbno), state);
break;
}
- pthread_mutex_unlock(&ag_locks[agno]);
+ pthread_mutex_unlock(&ag_locks[agno].lock);
} else if (irec_offset == XFS_INODES_PER_CHUNK) {
/*
diff --git a/repair/dinode.c b/repair/dinode.c
index 8f14a9e..77bbe40 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -700,8 +700,8 @@ _("Fatal error: inode %" PRIu64 " - blkmap_set_ext(): %s\n"
ebno = agbno + irec.br_blockcount;
if (agno != locked_agno) {
if (locked_agno != -1)
- pthread_mutex_unlock(&ag_locks[locked_agno]);
- pthread_mutex_lock(&ag_locks[agno]);
+ pthread_mutex_unlock(&ag_locks[locked_agno].lock);
+ pthread_mutex_lock(&ag_locks[agno].lock);
locked_agno = agno;
}
@@ -770,7 +770,7 @@ _("illegal state %d in block map %" PRIu64 "\n"),
error = 0;
done:
if (locked_agno != -1)
- pthread_mutex_unlock(&ag_locks[locked_agno]);
+ pthread_mutex_unlock(&ag_locks[locked_agno].lock);
if (i != *numrecs) {
ASSERT(i < *numrecs);
diff --git a/repair/globals.h b/repair/globals.h
index aef8b79..cbb2ce7 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -186,7 +186,10 @@ EXTERN xfs_extlen_t sb_inoalignmt;
EXTERN __uint32_t sb_unit;
EXTERN __uint32_t sb_width;
-EXTERN pthread_mutex_t *ag_locks;
+struct aglock {
+ pthread_mutex_t lock __attribute__((__aligned__(64)));
+};
+EXTERN struct aglock *ag_locks;
EXTERN int report_interval;
EXTERN __uint64_t *prog_rpt_done;
diff --git a/repair/incore.c b/repair/incore.c
index 3590464..a8d497e 100644
--- a/repair/incore.c
+++ b/repair/incore.c
@@ -294,13 +294,13 @@ init_bmaps(xfs_mount_t *mp)
if (!ag_bmap)
do_error(_("couldn't allocate block map btree roots\n"));
- ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(pthread_mutex_t));
+ ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(struct aglock));
if (!ag_locks)
do_error(_("couldn't allocate block map locks\n"));
for (i = 0; i < mp->m_sb.sb_agcount; i++) {
btree_init(&ag_bmap[i]);
- pthread_mutex_init(&ag_locks[i], NULL);
+ pthread_mutex_init(&ag_locks[i].lock, NULL);
}
init_rt_bmap(mp);
diff --git a/repair/scan.c b/repair/scan.c
index 49ed194..f9afdb1 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -273,7 +273,7 @@ _("bad back (left) sibling pointer (saw %llu should be NULL (0))\n"
agno = XFS_FSB_TO_AGNO(mp, bno);
agbno = XFS_FSB_TO_AGBNO(mp, bno);
- pthread_mutex_lock(&ag_locks[agno]);
+ pthread_mutex_lock(&ag_locks[agno].lock);
state = get_bmap(agno, agbno);
switch (state) {
case XR_E_UNKNOWN:
@@ -319,7 +319,7 @@ _("bad state %d, inode %" PRIu64 " bmap block 0x%" PRIx64 "\n"),
state, ino, bno);
break;
}
- pthread_mutex_unlock(&ag_locks[agno]);
+ pthread_mutex_unlock(&ag_locks[agno].lock);
} else {
/*
* attribute fork for realtime files is in the regular
--
1.8.4.rc3
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH 2/5] repair: per AG locks contend for cachelines
2013-12-12 7:22 ` [PATCH 2/5] repair: per AG locks contend for cachelines Dave Chinner
@ 2013-12-12 18:27 ` Christoph Hellwig
2013-12-12 18:58 ` Brian Foster
1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:27 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/5] repair: per AG locks contend for cachelines
2013-12-12 7:22 ` [PATCH 2/5] repair: per AG locks contend for cachelines Dave Chinner
2013-12-12 18:27 ` Christoph Hellwig
@ 2013-12-12 18:58 ` Brian Foster
2013-12-12 20:46 ` Dave Chinner
1 sibling, 1 reply; 21+ messages in thread
From: Brian Foster @ 2013-12-12 18:58 UTC (permalink / raw)
To: Dave Chinner, xfs
On 12/12/2013 02:22 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The per-ag locks used to protect per-ag block lists are located in a tightly
> packed array. That means that they share cachelines, so separate them out inot
> separate 64 byte regions in the array.
>
> pahole confirms the padding is correctly applied:
>
> struct aglock {
> pthread_mutex_t lock; /* 0 40 */
>
> /* size: 64, cachelines: 1, members: 1 */
> /* padding: 24 */
> };
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
Out of curiosity, any data on this one?
Reviewed-by: Brian Foster <bfoster@redhat.com>
> repair/dino_chunks.c | 24 ++++++++++++------------
> repair/dinode.c | 6 +++---
> repair/globals.h | 5 ++++-
> repair/incore.c | 4 ++--
> repair/scan.c | 4 ++--
> 5 files changed, 23 insertions(+), 20 deletions(-)
>
> diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c
> index d3c2236..02d32d8 100644
> --- a/repair/dino_chunks.c
> +++ b/repair/dino_chunks.c
> @@ -141,7 +141,7 @@ verify_inode_chunk(xfs_mount_t *mp,
> if (check_aginode_block(mp, agno, agino) == 0)
> return 0;
>
> - pthread_mutex_lock(&ag_locks[agno]);
> + pthread_mutex_lock(&ag_locks[agno].lock);
>
> state = get_bmap(agno, agbno);
> switch (state) {
> @@ -166,7 +166,7 @@ verify_inode_chunk(xfs_mount_t *mp,
> _("inode block %d/%d multiply claimed, (state %d)\n"),
> agno, agbno, state);
> set_bmap(agno, agbno, XR_E_MULT);
> - pthread_mutex_unlock(&ag_locks[agno]);
> + pthread_mutex_unlock(&ag_locks[agno].lock);
> return(0);
> default:
> do_warn(
> @@ -176,7 +176,7 @@ verify_inode_chunk(xfs_mount_t *mp,
> break;
> }
>
> - pthread_mutex_unlock(&ag_locks[agno]);
> + pthread_mutex_unlock(&ag_locks[agno].lock);
>
> start_agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
> *start_ino = XFS_AGINO_TO_INO(mp, agno, start_agino);
> @@ -424,7 +424,7 @@ verify_inode_chunk(xfs_mount_t *mp,
> * user data -- we're probably here as a result of a directory
> * entry or an iunlinked pointer
> */
> - pthread_mutex_lock(&ag_locks[agno]);
> + pthread_mutex_lock(&ag_locks[agno].lock);
> for (cur_agbno = chunk_start_agbno;
> cur_agbno < chunk_stop_agbno;
> cur_agbno += blen) {
> @@ -438,7 +438,7 @@ verify_inode_chunk(xfs_mount_t *mp,
> _("inode block %d/%d multiply claimed, (state %d)\n"),
> agno, cur_agbno, state);
> set_bmap_ext(agno, cur_agbno, blen, XR_E_MULT);
> - pthread_mutex_unlock(&ag_locks[agno]);
> + pthread_mutex_unlock(&ag_locks[agno].lock);
> return 0;
> case XR_E_INO:
> do_error(
> @@ -449,7 +449,7 @@ verify_inode_chunk(xfs_mount_t *mp,
> break;
> }
> }
> - pthread_mutex_unlock(&ag_locks[agno]);
> + pthread_mutex_unlock(&ag_locks[agno].lock);
>
> /*
> * ok, chunk is good. put the record into the tree if required,
> @@ -472,7 +472,7 @@ verify_inode_chunk(xfs_mount_t *mp,
>
> set_inode_used(irec_p, agino - start_agino);
>
> - pthread_mutex_lock(&ag_locks[agno]);
> + pthread_mutex_lock(&ag_locks[agno].lock);
>
> for (cur_agbno = chunk_start_agbno;
> cur_agbno < chunk_stop_agbno;
> @@ -505,7 +505,7 @@ verify_inode_chunk(xfs_mount_t *mp,
> break;
> }
> }
> - pthread_mutex_unlock(&ag_locks[agno]);
> + pthread_mutex_unlock(&ag_locks[agno].lock);
>
> return(ino_cnt);
> }
> @@ -736,7 +736,7 @@ process_inode_chunk(
> /*
> * mark block as an inode block in the incore bitmap
> */
> - pthread_mutex_lock(&ag_locks[agno]);
> + pthread_mutex_lock(&ag_locks[agno].lock);
> state = get_bmap(agno, agbno);
> switch (state) {
> case XR_E_INO: /* already marked */
> @@ -755,7 +755,7 @@ process_inode_chunk(
> XFS_AGB_TO_FSB(mp, agno, agbno), state);
> break;
> }
> - pthread_mutex_unlock(&ag_locks[agno]);
> + pthread_mutex_unlock(&ag_locks[agno].lock);
>
> for (;;) {
> /*
> @@ -914,7 +914,7 @@ process_inode_chunk(
> ibuf_offset = 0;
> agbno++;
>
> - pthread_mutex_lock(&ag_locks[agno]);
> + pthread_mutex_lock(&ag_locks[agno].lock);
> state = get_bmap(agno, agbno);
> switch (state) {
> case XR_E_INO: /* already marked */
> @@ -935,7 +935,7 @@ process_inode_chunk(
> XFS_AGB_TO_FSB(mp, agno, agbno), state);
> break;
> }
> - pthread_mutex_unlock(&ag_locks[agno]);
> + pthread_mutex_unlock(&ag_locks[agno].lock);
>
> } else if (irec_offset == XFS_INODES_PER_CHUNK) {
> /*
> diff --git a/repair/dinode.c b/repair/dinode.c
> index 8f14a9e..77bbe40 100644
> --- a/repair/dinode.c
> +++ b/repair/dinode.c
> @@ -700,8 +700,8 @@ _("Fatal error: inode %" PRIu64 " - blkmap_set_ext(): %s\n"
> ebno = agbno + irec.br_blockcount;
> if (agno != locked_agno) {
> if (locked_agno != -1)
> - pthread_mutex_unlock(&ag_locks[locked_agno]);
> - pthread_mutex_lock(&ag_locks[agno]);
> + pthread_mutex_unlock(&ag_locks[locked_agno].lock);
> + pthread_mutex_lock(&ag_locks[agno].lock);
> locked_agno = agno;
> }
>
> @@ -770,7 +770,7 @@ _("illegal state %d in block map %" PRIu64 "\n"),
> error = 0;
> done:
> if (locked_agno != -1)
> - pthread_mutex_unlock(&ag_locks[locked_agno]);
> + pthread_mutex_unlock(&ag_locks[locked_agno].lock);
>
> if (i != *numrecs) {
> ASSERT(i < *numrecs);
> diff --git a/repair/globals.h b/repair/globals.h
> index aef8b79..cbb2ce7 100644
> --- a/repair/globals.h
> +++ b/repair/globals.h
> @@ -186,7 +186,10 @@ EXTERN xfs_extlen_t sb_inoalignmt;
> EXTERN __uint32_t sb_unit;
> EXTERN __uint32_t sb_width;
>
> -EXTERN pthread_mutex_t *ag_locks;
> +struct aglock {
> + pthread_mutex_t lock __attribute__((__aligned__(64)));
> +};
> +EXTERN struct aglock *ag_locks;
>
> EXTERN int report_interval;
> EXTERN __uint64_t *prog_rpt_done;
> diff --git a/repair/incore.c b/repair/incore.c
> index 3590464..a8d497e 100644
> --- a/repair/incore.c
> +++ b/repair/incore.c
> @@ -294,13 +294,13 @@ init_bmaps(xfs_mount_t *mp)
> if (!ag_bmap)
> do_error(_("couldn't allocate block map btree roots\n"));
>
> - ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(pthread_mutex_t));
> + ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(struct aglock));
> if (!ag_locks)
> do_error(_("couldn't allocate block map locks\n"));
>
> for (i = 0; i < mp->m_sb.sb_agcount; i++) {
> btree_init(&ag_bmap[i]);
> - pthread_mutex_init(&ag_locks[i], NULL);
> + pthread_mutex_init(&ag_locks[i].lock, NULL);
> }
>
> init_rt_bmap(mp);
> diff --git a/repair/scan.c b/repair/scan.c
> index 49ed194..f9afdb1 100644
> --- a/repair/scan.c
> +++ b/repair/scan.c
> @@ -273,7 +273,7 @@ _("bad back (left) sibling pointer (saw %llu should be NULL (0))\n"
> agno = XFS_FSB_TO_AGNO(mp, bno);
> agbno = XFS_FSB_TO_AGBNO(mp, bno);
>
> - pthread_mutex_lock(&ag_locks[agno]);
> + pthread_mutex_lock(&ag_locks[agno].lock);
> state = get_bmap(agno, agbno);
> switch (state) {
> case XR_E_UNKNOWN:
> @@ -319,7 +319,7 @@ _("bad state %d, inode %" PRIu64 " bmap block 0x%" PRIx64 "\n"),
> state, ino, bno);
> break;
> }
> - pthread_mutex_unlock(&ag_locks[agno]);
> + pthread_mutex_unlock(&ag_locks[agno].lock);
> } else {
> /*
> * attribute fork for realtime files is in the regular
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 2/5] repair: per AG locks contend for cachelines
2013-12-12 18:58 ` Brian Foster
@ 2013-12-12 20:46 ` Dave Chinner
0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 20:46 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Thu, Dec 12, 2013 at 01:58:59PM -0500, Brian Foster wrote:
> On 12/12/2013 02:22 AM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > The per-ag locks used to protect per-ag block lists are located in a tightly
> > packed array. That means that they share cachelines, so separate them out inot
> > separate 64 byte regions in the array.
> >
> > pahole confirms the padding is correctly applied:
> >
> > struct aglock {
> > pthread_mutex_t lock; /* 0 40 */
> >
> > /* size: 64, cachelines: 1, members: 1 */
> > /* padding: 24 */
> > };
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
>
> Out of curiosity, any data on this one?
There's a small improvement but it's within the margins of error. At
higher levels of concurrency it will make a bigger difference, but
right now we can't get to a thread per AG (where the contention
would really show) because of other scalability limitations
e.g. mmap_sem contention becomes a limiting factor at 20-25 threads
because of page faults occurring on hash cache lookups (read lock)
occurring in parallel with memory allocation (libc mprotect calls
take a write lock). So more scalability will come from a CPU cache
friendlier cache index - I'm tempted to drop in per-AG rb trees and
see where that leads (now where would I have got that idea?)....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 3/5] repair: phase 6 is trivially parallelisable
2013-12-12 7:22 [PATCH 0/5] xfs_repair: scalability inmprovements Dave Chinner
2013-12-12 7:22 ` [PATCH 1/5] repair: translation lookups limit scalability Dave Chinner
2013-12-12 7:22 ` [PATCH 2/5] repair: per AG locks contend for cachelines Dave Chinner
@ 2013-12-12 7:22 ` Dave Chinner
2013-12-12 18:43 ` Christoph Hellwig
2013-12-12 18:59 ` Brian Foster
2013-12-12 7:22 ` [PATCH 4/5] libxfs: buffer cache hashing is suboptimal Dave Chinner
2013-12-12 7:22 ` [PATCH 5/5] repair: limit auto-striding concurrency apprpriately Dave Chinner
4 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 7:22 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Phase 6 is currently single threaded, but it iterates AGs one at a
time. When there are hundreds of AGs that need scanning, this takes
a long time. Given that all the objects that the AG traversal works
on are per-ag, we can simply parallelise this into a strided AG
processing like phase 3 and 4.
Unpatched: 8m40s
patched: 1m10s (7 threads)
Big win!
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
repair/phase6.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 47 insertions(+), 9 deletions(-)
diff --git a/repair/phase6.c b/repair/phase6.c
index d2d4a44..d82f900 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -51,6 +51,7 @@ typedef struct dotdot_update {
static dotdot_update_t *dotdot_update_list;
static int dotdot_update;
+static pthread_mutex_t dotdot_lock;
static void
add_dotdot_update(
@@ -64,12 +65,14 @@ add_dotdot_update(
do_error(_("malloc failed add_dotdot_update (%zu bytes)\n"),
sizeof(dotdot_update_t));
+ pthread_mutex_lock(&dotdot_lock);
dir->next = dotdot_update_list;
dir->irec = irec;
dir->agno = agno;
dir->ino_offset = ino_offset;
dotdot_update_list = dir;
+ pthread_mutex_unlock(&dotdot_lock);
}
/*
@@ -2918,34 +2921,68 @@ update_missing_dotdot_entries(
* these entries parents were updated, rebuild them again
* set dotdot_update flag so processing routines do not count links
*/
+ pthread_mutex_lock(&dotdot_lock);
dotdot_update = 1;
while (dotdot_update_list) {
dir = dotdot_update_list;
dotdot_update_list = dir->next;
+ dir->next = NULL;
+ pthread_mutex_unlock(&dotdot_lock);
+
process_dir_inode(mp, dir->agno, dir->irec, dir->ino_offset);
free(dir);
+
+ pthread_mutex_lock(&dotdot_lock);
}
+ pthread_mutex_unlock(&dotdot_lock);
}
static void
traverse_ags(
- xfs_mount_t *mp)
+ xfs_mount_t *mp)
{
- int i;
- work_queue_t queue;
+ int i, j;
+ xfs_agnumber_t agno;
+ work_queue_t *queues;
prefetch_args_t *pf_args[2];
/*
* we always do prefetch for phase 6 as it will fill in the gaps
* not read during phase 3 prefetch.
*/
- queue.mp = mp;
- pf_args[0] = start_inode_prefetch(0, 1, NULL);
- for (i = 0; i < glob_agcount; i++) {
- pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
- pf_args[i & 1]);
- traverse_function(&queue, i, pf_args[i & 1]);
+ if (!ag_stride) {
+ work_queue_t queue;
+
+ queue.mp = mp;
+ pf_args[0] = start_inode_prefetch(0, 1, NULL);
+ for (i = 0; i < glob_agcount; i++) {
+ pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
+ pf_args[i & 1]);
+ traverse_function(&queue, i, pf_args[i & 1]);
+ }
+ return;
}
+
+ /*
+ * create one worker thread for each segment of the volume
+ */
+ queues = malloc(thread_count * sizeof(work_queue_t));
+ for (i = 0, agno = 0; i < thread_count; i++) {
+ create_work_queue(&queues[i], mp, 1);
+ pf_args[0] = NULL;
+ for (j = 0; j < ag_stride && agno < glob_agcount; j++, agno++) {
+ pf_args[0] = start_inode_prefetch(agno, 1, pf_args[0]);
+ queue_work(&queues[i], traverse_function, agno,
+ pf_args[0]);
+ }
+ }
+
+ /*
+ * wait for workers to complete
+ */
+ for (i = 0; i < thread_count; i++)
+ destroy_work_queue(&queues[i]);
+ free(queues);
}
void
@@ -2957,6 +2994,7 @@ phase6(xfs_mount_t *mp)
memset(&zerocr, 0, sizeof(struct cred));
memset(&zerofsx, 0, sizeof(struct fsxattr));
orphanage_ino = 0;
+ pthread_mutex_init(&dotdot_lock, NULL);
do_log(_("Phase 6 - check inode connectivity...\n"));
--
1.8.4.rc3
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH 3/5] repair: phase 6 is trivially parallelisable
2013-12-12 7:22 ` [PATCH 3/5] repair: phase 6 is trivially parallelisable Dave Chinner
@ 2013-12-12 18:43 ` Christoph Hellwig
2013-12-12 20:53 ` Dave Chinner
2013-12-12 18:59 ` Brian Foster
1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:43 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
> static void
> add_dotdot_update(
> @@ -64,12 +65,14 @@ add_dotdot_update(
> do_error(_("malloc failed add_dotdot_update (%zu bytes)\n"),
> sizeof(dotdot_update_t));
>
> + pthread_mutex_lock(&dotdot_lock);
> dir->next = dotdot_update_list;
> dir->irec = irec;
> dir->agno = agno;
> dir->ino_offset = ino_offset;
>
> dotdot_update_list = dir;
> + pthread_mutex_unlock(&dotdot_lock);
Would be nice to make this use a list_head if you touch it anyway.
(As a separate patch)
> static void
> traverse_ags(
> - xfs_mount_t *mp)
> + xfs_mount_t *mp)
Not quite sure what actually changed here, but if you touch it anyway
you might as well use the struct version..
> + if (!ag_stride) {
> + work_queue_t queue;
> +
> + queue.mp = mp;
> + pf_args[0] = start_inode_prefetch(0, 1, NULL);
> + for (i = 0; i < glob_agcount; i++) {
> + pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> + pf_args[i & 1]);
> + traverse_function(&queue, i, pf_args[i & 1]);
> + }
> + return;
> }
> +
> + /*
> + * create one worker thread for each segment of the volume
> + */
> + queues = malloc(thread_count * sizeof(work_queue_t));
> + for (i = 0, agno = 0; i < thread_count; i++) {
> + create_work_queue(&queues[i], mp, 1);
> + pf_args[0] = NULL;
> + for (j = 0; j < ag_stride && agno < glob_agcount; j++, agno++) {
> + pf_args[0] = start_inode_prefetch(agno, 1, pf_args[0]);
> + queue_work(&queues[i], traverse_function, agno,
> + pf_args[0]);
> + }
> + }
> +
> + /*
> + * wait for workers to complete
> + */
> + for (i = 0; i < thread_count; i++)
> + destroy_work_queue(&queues[i]);
> + free(queues);
This is the third copy of this code block, might make sense to
consolidate it.
Btw, does anyone remember why we have the libxfs_bcache_overflowed()
special case in phase4, but not anywhere else?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 3/5] repair: phase 6 is trivially parallelisable
2013-12-12 18:43 ` Christoph Hellwig
@ 2013-12-12 20:53 ` Dave Chinner
0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 20:53 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Thu, Dec 12, 2013 at 10:43:46AM -0800, Christoph Hellwig wrote:
> > static void
> > add_dotdot_update(
> > @@ -64,12 +65,14 @@ add_dotdot_update(
> > do_error(_("malloc failed add_dotdot_update (%zu bytes)\n"),
> > sizeof(dotdot_update_t));
> >
> > + pthread_mutex_lock(&dotdot_lock);
> > dir->next = dotdot_update_list;
> > dir->irec = irec;
> > dir->agno = agno;
> > dir->ino_offset = ino_offset;
> >
> > dotdot_update_list = dir;
> > + pthread_mutex_unlock(&dotdot_lock);
>
> Would be nice to make this use a list_head if you touch it anyway.
> (As a separate patch)
>
> > static void
> > traverse_ags(
> > - xfs_mount_t *mp)
> > + xfs_mount_t *mp)
>
> Not quite sure what actually changed here, but if you touch it anyway
> you might as well use the struct version..
Whitespace after xfs_mount_t, judging by the highlighting I see in
the editor right now.
> > + if (!ag_stride) {
> > + work_queue_t queue;
> > +
> > + queue.mp = mp;
> > + pf_args[0] = start_inode_prefetch(0, 1, NULL);
> > + for (i = 0; i < glob_agcount; i++) {
> > + pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> > + pf_args[i & 1]);
> > + traverse_function(&queue, i, pf_args[i & 1]);
> > + }
> > + return;
> > }
> > +
> > + /*
> > + * create one worker thread for each segment of the volume
> > + */
> > + queues = malloc(thread_count * sizeof(work_queue_t));
> > + for (i = 0, agno = 0; i < thread_count; i++) {
> > + create_work_queue(&queues[i], mp, 1);
> > + pf_args[0] = NULL;
> > + for (j = 0; j < ag_stride && agno < glob_agcount; j++, agno++) {
> > + pf_args[0] = start_inode_prefetch(agno, 1, pf_args[0]);
> > + queue_work(&queues[i], traverse_function, agno,
> > + pf_args[0]);
> > + }
> > + }
> > +
> > + /*
> > + * wait for workers to complete
> > + */
> > + for (i = 0; i < thread_count; i++)
> > + destroy_work_queue(&queues[i]);
> > + free(queues);
>
>
> This is the third copy of this code block, might make sense to
> consolidate it.
Agreed, just haven't got to it.
> Btw, does anyone remember why we have the libxfs_bcache_overflowed()
> special case in phase4, but not anywhere else?
I recall something about memory consumption, but I doubt that code
can even trigger given that if we get to overflow conditions we
immediately double the cache size and so libxfs_bcache_overflowed()
will never see an overflow condition....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/5] repair: phase 6 is trivially parallelisable
2013-12-12 7:22 ` [PATCH 3/5] repair: phase 6 is trivially parallelisable Dave Chinner
2013-12-12 18:43 ` Christoph Hellwig
@ 2013-12-12 18:59 ` Brian Foster
1 sibling, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-12-12 18:59 UTC (permalink / raw)
To: Dave Chinner, xfs
On 12/12/2013 02:22 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Phase 6 is currently single threaded, but it iterates AGs one at a
> time. When there are hundreds of AGs that need scanning, this takes
> a long time. Given that all the objects that the AG traversal works
> on are per-ag, we can simply parallelise this into a strided AG
> processing like phase 3 and 4.
>
> Unpatched: 8m40s
> patched: 1m10s (7 threads)
>
> Big win!
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> repair/phase6.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 47 insertions(+), 9 deletions(-)
>
> diff --git a/repair/phase6.c b/repair/phase6.c
> index d2d4a44..d82f900 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -51,6 +51,7 @@ typedef struct dotdot_update {
>
> static dotdot_update_t *dotdot_update_list;
> static int dotdot_update;
> +static pthread_mutex_t dotdot_lock;
>
> static void
> add_dotdot_update(
> @@ -64,12 +65,14 @@ add_dotdot_update(
> do_error(_("malloc failed add_dotdot_update (%zu bytes)\n"),
> sizeof(dotdot_update_t));
>
> + pthread_mutex_lock(&dotdot_lock);
> dir->next = dotdot_update_list;
> dir->irec = irec;
> dir->agno = agno;
> dir->ino_offset = ino_offset;
>
> dotdot_update_list = dir;
> + pthread_mutex_unlock(&dotdot_lock);
> }
>
> /*
> @@ -2918,34 +2921,68 @@ update_missing_dotdot_entries(
> * these entries parents were updated, rebuild them again
> * set dotdot_update flag so processing routines do not count links
> */
> + pthread_mutex_lock(&dotdot_lock);
> dotdot_update = 1;
> while (dotdot_update_list) {
> dir = dotdot_update_list;
> dotdot_update_list = dir->next;
> + dir->next = NULL;
> + pthread_mutex_unlock(&dotdot_lock);
> +
> process_dir_inode(mp, dir->agno, dir->irec, dir->ino_offset);
> free(dir);
> +
> + pthread_mutex_lock(&dotdot_lock);
> }
> + pthread_mutex_unlock(&dotdot_lock);
> }
Technically the locking here is unnecessary, as this appears to remain
single threaded, yes? It doesn't hurt and probably eliminates a
potential landmine, so:
Reviewed-by: Brian Foster <bfoster@redhat.com>
>
> static void
> traverse_ags(
> - xfs_mount_t *mp)
> + xfs_mount_t *mp)
> {
> - int i;
> - work_queue_t queue;
> + int i, j;
> + xfs_agnumber_t agno;
> + work_queue_t *queues;
> prefetch_args_t *pf_args[2];
>
> /*
> * we always do prefetch for phase 6 as it will fill in the gaps
> * not read during phase 3 prefetch.
> */
> - queue.mp = mp;
> - pf_args[0] = start_inode_prefetch(0, 1, NULL);
> - for (i = 0; i < glob_agcount; i++) {
> - pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> - pf_args[i & 1]);
> - traverse_function(&queue, i, pf_args[i & 1]);
> + if (!ag_stride) {
> + work_queue_t queue;
> +
> + queue.mp = mp;
> + pf_args[0] = start_inode_prefetch(0, 1, NULL);
> + for (i = 0; i < glob_agcount; i++) {
> + pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> + pf_args[i & 1]);
> + traverse_function(&queue, i, pf_args[i & 1]);
> + }
> + return;
> }
> +
> + /*
> + * create one worker thread for each segment of the volume
> + */
> + queues = malloc(thread_count * sizeof(work_queue_t));
> + for (i = 0, agno = 0; i < thread_count; i++) {
> + create_work_queue(&queues[i], mp, 1);
> + pf_args[0] = NULL;
> + for (j = 0; j < ag_stride && agno < glob_agcount; j++, agno++) {
> + pf_args[0] = start_inode_prefetch(agno, 1, pf_args[0]);
> + queue_work(&queues[i], traverse_function, agno,
> + pf_args[0]);
> + }
> + }
> +
> + /*
> + * wait for workers to complete
> + */
> + for (i = 0; i < thread_count; i++)
> + destroy_work_queue(&queues[i]);
> + free(queues);
> }
>
> void
> @@ -2957,6 +2994,7 @@ phase6(xfs_mount_t *mp)
> memset(&zerocr, 0, sizeof(struct cred));
> memset(&zerofsx, 0, sizeof(struct fsxattr));
> orphanage_ino = 0;
> + pthread_mutex_init(&dotdot_lock, NULL);
>
> do_log(_("Phase 6 - check inode connectivity...\n"));
>
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 4/5] libxfs: buffer cache hashing is suboptimal
2013-12-12 7:22 [PATCH 0/5] xfs_repair: scalability inmprovements Dave Chinner
` (2 preceding siblings ...)
2013-12-12 7:22 ` [PATCH 3/5] repair: phase 6 is trivially parallelisable Dave Chinner
@ 2013-12-12 7:22 ` Dave Chinner
2013-12-12 18:28 ` Christoph Hellwig
2013-12-12 18:59 ` Brian Foster
2013-12-12 7:22 ` [PATCH 5/5] repair: limit auto-striding concurrency apprpriately Dave Chinner
4 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 7:22 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
The hashkey calculation is very simplistic,and throws away an amount
of entropy that should be folded into the hash. The result is
sub-optimal distribution across the hash tables. For example, with a
default 512 entry table, phase 2 results in this:
Max supported entries = 4096
Max utilized entries = 3970
Active entries = 3970
Hash table size = 512
Hits = 0
Misses = 3970
Hit ratio = 0.00
Hash buckets with 0 entries 12 ( 0%)
Hash buckets with 1 entries 3 ( 0%)
Hash buckets with 2 entries 10 ( 0%)
Hash buckets with 3 entries 2 ( 0%)
Hash buckets with 4 entries 129 ( 12%)
Hash buckets with 5 entries 20 ( 2%)
Hash buckets with 6 entries 54 ( 8%)
Hash buckets with 7 entries 22 ( 3%)
Hash buckets with 8 entries 150 ( 30%)
Hash buckets with 9 entries 14 ( 3%)
Hash buckets with 10 entries 16 ( 4%)
Hash buckets with 11 entries 7 ( 1%)
Hash buckets with 12 entries 38 ( 11%)
Hash buckets with 13 entries 5 ( 1%)
Hash buckets with 14 entries 4 ( 1%)
Hash buckets with 17 entries 1 ( 0%)
Hash buckets with 19 entries 1 ( 0%)
Hash buckets with 23 entries 1 ( 0%)
Hash buckets with >24 entries 23 ( 16%)
Now, given a perfect distribution, we shoul dhave 8 entries per
chain. What we end up with is nothing like that.
Unfortunately, for phase 3/4 and others, the number of cached
objects results in the cache being expanded to 256k entries, and
so the stats just give this;
Hits = 262276
Misses = 8130393
Hit ratio = 3.13
Hash buckets with >24 entries 512 (100%)
We can't evaluate the efficiency of the hashing algorithm here.
Let's increase the size of the hash table to 32768 entries and go
from there:
Phase 2:
Hash buckets with 0 entries 31884 ( 0%)
Hash buckets with 1 entries 35 ( 0%)
Hash buckets with 2 entries 78 ( 3%)
Hash buckets with 3 entries 30 ( 2%)
Hash buckets with 4 entries 649 ( 65%)
Hash buckets with 5 entries 12 ( 1%)
Hash buckets with 6 entries 13 ( 1%)
Hash buckets with 8 entries 40 ( 8%)
Hash buckets with 9 entries 1 ( 0%)
Hash buckets with 13 entries 1 ( 0%)
Hash buckets with 15 entries 1 ( 0%)
Hash buckets with 22 entries 1 ( 0%)
Hash buckets with 24 entries 17 ( 10%)
Hash buckets with >24 entries 6 ( 4%)
There's a significant number of collisions given the population is
only 15% of the size of the table itself....
Phase 3:
Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262090
Hash table size = 32768
Hits = 530844
Misses = 7164575
Hit ratio = 6.90
Hash buckets with 0 entries 11898 ( 0%)
....
Hash buckets with 12 entries 5513 ( 25%)
Hash buckets with 13 entries 4188 ( 20%)
Hash buckets with 14 entries 2073 ( 11%)
Hash buckets with 15 entries 1811 ( 10%)
Hash buckets with 16 entries 1994 ( 12%)
....
Hash buckets with >24 entries 339 ( 4%)
So, a third of the hash table does not even has any entries in them,
despite having more than 7.5 million entries run through the cache.
Median chain lengths are 12-16 entries, ideal is 8. And lots of
collisions on the longer than 24 entrie chains...
Phase 6:
Hash buckets with 0 entries 14573 ( 0%)
....
Hash buckets with >24 entries 2291 ( 36%)
Ouch. Not a good distribution at all.
Overall runtime:
Phase Start End Duration
Phase 1: 12/06 11:35:04 12/06 11:35:04
Phase 2: 12/06 11:35:04 12/06 11:35:07 3 seconds
Phase 3: 12/06 11:35:07 12/06 11:38:27 3 minutes, 20 seconds
Phase 4: 12/06 11:38:27 12/06 11:41:32 3 minutes, 5 seconds
Phase 5: 12/06 11:41:32 12/06 11:41:32
Phase 6: 12/06 11:41:32 12/06 11:42:29 57 seconds
Phase 7: 12/06 11:42:29 12/06 11:42:30 1 second
Total run time: 7 minutes, 26 seconds
Modify the hash to be something more workable - steal the linux
kernel inode hash calculation and try that:
phase 2:
Max supported entries = 262144
Max utilized entries = 3970
Active entries = 3970
Hash table size = 32768
Hits = 0
Misses = 3972
Hit ratio = 0.00
Hash buckets with 0 entries 29055 ( 0%)
Hash buckets with 1 entries 3464 ( 87%)
Hash buckets with 2 entries 241 ( 12%)
Hash buckets with 3 entries 8 ( 0%)
Close to perfect.
Phase 3:
Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262118
Hash table size = 32768
Hits = 567900
Misses = 7118749
Hit ratio = 7.39
Hash buckets with 5 entries 1572 ( 2%)
Hash buckets with 6 entries 2186 ( 5%)
Hash buckets with 7 entries 9217 ( 24%)
Hash buckets with 8 entries 8757 ( 26%)
Hash buckets with 9 entries 6135 ( 21%)
Hash buckets with 10 entries 3166 ( 12%)
Hash buckets with 11 entries 1257 ( 5%)
Hash buckets with 12 entries 364 ( 1%)
Hash buckets with 13 entries 94 ( 0%)
Hash buckets with 14 entries 14 ( 0%)
Hash buckets with 15 entries 5 ( 0%)
A near-perfect bell curve centered on the optimal distribution
number of 8 entries per chain.
Phase 6:
Hash buckets with 0 entries 24 ( 0%)
Hash buckets with 1 entries 190 ( 0%)
Hash buckets with 2 entries 571 ( 0%)
Hash buckets with 3 entries 1263 ( 1%)
Hash buckets with 4 entries 2465 ( 3%)
Hash buckets with 5 entries 3399 ( 6%)
Hash buckets with 6 entries 4002 ( 9%)
Hash buckets with 7 entries 4186 ( 11%)
Hash buckets with 8 entries 3773 ( 11%)
Hash buckets with 9 entries 3240 ( 11%)
Hash buckets with 10 entries 2523 ( 9%)
Hash buckets with 11 entries 2074 ( 8%)
Hash buckets with 12 entries 1582 ( 7%)
Hash buckets with 13 entries 1206 ( 5%)
Hash buckets with 14 entries 863 ( 4%)
Hash buckets with 15 entries 601 ( 3%)
Hash buckets with 16 entries 386 ( 2%)
Hash buckets with 17 entries 205 ( 1%)
Hash buckets with 18 entries 122 ( 0%)
Hash buckets with 19 entries 48 ( 0%)
Hash buckets with 20 entries 24 ( 0%)
Hash buckets with 21 entries 13 ( 0%)
Hash buckets with 22 entries 8 ( 0%)
A much wider bell curve than phase 3, but still centered around the
optimal value and far, far better than the distribution of the
current hash calculation. Runtime:
Phase Start End Duration
Phase 1: 12/06 11:47:21 12/06 11:47:21
Phase 2: 12/06 11:47:21 12/06 11:47:23 2 seconds
Phase 3: 12/06 11:47:23 12/06 11:50:50 3 minutes, 27 seconds
Phase 4: 12/06 11:50:50 12/06 11:53:57 3 minutes, 7 seconds
Phase 5: 12/06 11:53:57 12/06 11:53:58 1 second
Phase 6: 12/06 11:53:58 12/06 11:54:51 53 seconds
Phase 7: 12/06 11:54:51 12/06 11:54:52 1 second
Total run time: 7 minutes, 31 seconds
Essentially unchanged - this is somewhat of a "swings and
roundabouts" test here because what it is testing is the cache-miss
overhead.
FWIW, the comparison here shows a pretty good case for the existing
hash calculation. On a less populated filesystem (5m inodes rather
than 50m inodes) the typical hash distribution was:
Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262094
Hash table size = 32768
Hits = 626228
Misses = 800166
Hit ratio = 43.90
Hash buckets with 0 entries 29274 ( 0%)
Hash buckets with 3 entries 1 ( 0%)
Hash buckets with 4 entries 1 ( 0%)
Hash buckets with 7 entries 1 ( 0%)
Hash buckets with 8 entries 1 ( 0%)
Hash buckets with 9 entries 1 ( 0%)
Hash buckets with 12 entries 1 ( 0%)
Hash buckets with 13 entries 1 ( 0%)
Hash buckets with 16 entries 2 ( 0%)
Hash buckets with 18 entries 1 ( 0%)
Hash buckets with 22 entries 1 ( 0%)
Hash buckets with >24 entries 3483 ( 99%)
Total and utter crap. Same filesystem, new hash function:
Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262103
Hash table size = 32768
Hits = 673208
Misses = 838265
Hit ratio = 44.54
Hash buckets with 3 entries 558 ( 0%)
Hash buckets with 4 entries 1126 ( 1%)
Hash buckets with 5 entries 2440 ( 4%)
Hash buckets with 6 entries 4249 ( 9%)
Hash buckets with 7 entries 5280 ( 14%)
Hash buckets with 8 entries 5598 ( 17%)
Hash buckets with 9 entries 5446 ( 18%)
Hash buckets with 10 entries 3879 ( 14%)
Hash buckets with 11 entries 2405 ( 10%)
Hash buckets with 12 entries 1187 ( 5%)
Hash buckets with 13 entries 447 ( 2%)
Hash buckets with 14 entries 125 ( 0%)
Hash buckets with 15 entries 25 ( 0%)
Hash buckets with 16 entries 3 ( 0%)
Kinda says it all, really...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
include/cache.h | 4 +++-
libxfs/cache.c | 7 +++++--
libxfs/rdwr.c | 12 ++++++++++--
3 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/include/cache.h b/include/cache.h
index 76cb234..0a84c69 100644
--- a/include/cache.h
+++ b/include/cache.h
@@ -66,7 +66,8 @@ typedef void (*cache_walk_t)(struct cache_node *);
typedef struct cache_node * (*cache_node_alloc_t)(cache_key_t);
typedef void (*cache_node_flush_t)(struct cache_node *);
typedef void (*cache_node_relse_t)(struct cache_node *);
-typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int);
+typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int,
+ unsigned int);
typedef int (*cache_node_compare_t)(struct cache_node *, cache_key_t);
typedef unsigned int (*cache_bulk_relse_t)(struct cache *, struct list_head *);
@@ -112,6 +113,7 @@ struct cache {
cache_node_compare_t compare; /* comparison routine */
cache_bulk_relse_t bulkrelse; /* bulk release routine */
unsigned int c_hashsize; /* hash bucket count */
+ unsigned int c_hashshift; /* hash key shift */
struct cache_hash *c_hash; /* hash table buckets */
struct cache_mru c_mrus[CACHE_MAX_PRIORITY + 1];
unsigned long long c_misses; /* cache misses */
diff --git a/libxfs/cache.c b/libxfs/cache.c
index 84d2860..dc69689 100644
--- a/libxfs/cache.c
+++ b/libxfs/cache.c
@@ -25,6 +25,7 @@
#include <xfs/platform_defs.h>
#include <xfs/list.h>
#include <xfs/cache.h>
+#include <xfs/libxfs.h>
#define CACHE_DEBUG 1
#undef CACHE_DEBUG
@@ -61,6 +62,7 @@ cache_init(
cache->c_misses = 0;
cache->c_maxcount = maxcount;
cache->c_hashsize = hashsize;
+ cache->c_hashshift = libxfs_highbit32(hashsize);
cache->hash = cache_operations->hash;
cache->alloc = cache_operations->alloc;
cache->flush = cache_operations->flush;
@@ -343,7 +345,7 @@ cache_node_get(
int priority = 0;
int purged = 0;
- hashidx = cache->hash(key, cache->c_hashsize);
+ hashidx = cache->hash(key, cache->c_hashsize, cache->c_hashshift);
hash = cache->c_hash + hashidx;
head = &hash->ch_list;
@@ -515,7 +517,8 @@ cache_node_purge(
struct cache_hash * hash;
int count = -1;
- hash = cache->c_hash + cache->hash(key, cache->c_hashsize);
+ hash = cache->c_hash + cache->hash(key, cache->c_hashsize,
+ cache->c_hashshift);
head = &hash->ch_list;
pthread_mutex_lock(&hash->ch_mutex);
for (pos = head->next, n = pos->next; pos != head;
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 0219a08..0effb9a 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -311,10 +311,18 @@ struct xfs_bufkey {
int nmaps;
};
+/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
+#define CACHE_LINE_SIZE 64
static unsigned int
-libxfs_bhash(cache_key_t key, unsigned int hashsize)
+libxfs_bhash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
{
- return (((unsigned int)((struct xfs_bufkey *)key)->blkno) >> 5) % hashsize;
+ uint64_t hashval = ((struct xfs_bufkey *)key)->blkno;
+ uint64_t tmp;
+
+ tmp = hashval ^ (GOLDEN_RATIO_PRIME + hashval) / CACHE_LINE_SIZE;
+ tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> hashshift);
+ return tmp % hashsize;
}
static int
--
1.8.4.rc3
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH 4/5] libxfs: buffer cache hashing is suboptimal
2013-12-12 7:22 ` [PATCH 4/5] libxfs: buffer cache hashing is suboptimal Dave Chinner
@ 2013-12-12 18:28 ` Christoph Hellwig
2013-12-12 18:59 ` Brian Foster
1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:28 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] libxfs: buffer cache hashing is suboptimal
2013-12-12 7:22 ` [PATCH 4/5] libxfs: buffer cache hashing is suboptimal Dave Chinner
2013-12-12 18:28 ` Christoph Hellwig
@ 2013-12-12 18:59 ` Brian Foster
2013-12-12 20:56 ` Dave Chinner
1 sibling, 1 reply; 21+ messages in thread
From: Brian Foster @ 2013-12-12 18:59 UTC (permalink / raw)
To: Dave Chinner, xfs
On 12/12/2013 02:22 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The hashkey calculation is very simplistic,and throws away an amount
> of entropy that should be folded into the hash. The result is
> sub-optimal distribution across the hash tables. For example, with a
> default 512 entry table, phase 2 results in this:
>
...
> Modify the hash to be something more workable - steal the linux
> kernel inode hash calculation and try that:
>
...
>
> Kinda says it all, really...
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
Results look nice and the algorithm seems to match the kernel variant,
but what about the 32-bit alternate prime/cache line values? Safe to
leave out..?
Brian
> include/cache.h | 4 +++-
> libxfs/cache.c | 7 +++++--
> libxfs/rdwr.c | 12 ++++++++++--
> 3 files changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/include/cache.h b/include/cache.h
> index 76cb234..0a84c69 100644
> --- a/include/cache.h
> +++ b/include/cache.h
> @@ -66,7 +66,8 @@ typedef void (*cache_walk_t)(struct cache_node *);
> typedef struct cache_node * (*cache_node_alloc_t)(cache_key_t);
> typedef void (*cache_node_flush_t)(struct cache_node *);
> typedef void (*cache_node_relse_t)(struct cache_node *);
> -typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int);
> +typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int,
> + unsigned int);
> typedef int (*cache_node_compare_t)(struct cache_node *, cache_key_t);
> typedef unsigned int (*cache_bulk_relse_t)(struct cache *, struct list_head *);
>
> @@ -112,6 +113,7 @@ struct cache {
> cache_node_compare_t compare; /* comparison routine */
> cache_bulk_relse_t bulkrelse; /* bulk release routine */
> unsigned int c_hashsize; /* hash bucket count */
> + unsigned int c_hashshift; /* hash key shift */
> struct cache_hash *c_hash; /* hash table buckets */
> struct cache_mru c_mrus[CACHE_MAX_PRIORITY + 1];
> unsigned long long c_misses; /* cache misses */
> diff --git a/libxfs/cache.c b/libxfs/cache.c
> index 84d2860..dc69689 100644
> --- a/libxfs/cache.c
> +++ b/libxfs/cache.c
> @@ -25,6 +25,7 @@
> #include <xfs/platform_defs.h>
> #include <xfs/list.h>
> #include <xfs/cache.h>
> +#include <xfs/libxfs.h>
>
> #define CACHE_DEBUG 1
> #undef CACHE_DEBUG
> @@ -61,6 +62,7 @@ cache_init(
> cache->c_misses = 0;
> cache->c_maxcount = maxcount;
> cache->c_hashsize = hashsize;
> + cache->c_hashshift = libxfs_highbit32(hashsize);
> cache->hash = cache_operations->hash;
> cache->alloc = cache_operations->alloc;
> cache->flush = cache_operations->flush;
> @@ -343,7 +345,7 @@ cache_node_get(
> int priority = 0;
> int purged = 0;
>
> - hashidx = cache->hash(key, cache->c_hashsize);
> + hashidx = cache->hash(key, cache->c_hashsize, cache->c_hashshift);
> hash = cache->c_hash + hashidx;
> head = &hash->ch_list;
>
> @@ -515,7 +517,8 @@ cache_node_purge(
> struct cache_hash * hash;
> int count = -1;
>
> - hash = cache->c_hash + cache->hash(key, cache->c_hashsize);
> + hash = cache->c_hash + cache->hash(key, cache->c_hashsize,
> + cache->c_hashshift);
> head = &hash->ch_list;
> pthread_mutex_lock(&hash->ch_mutex);
> for (pos = head->next, n = pos->next; pos != head;
> diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
> index 0219a08..0effb9a 100644
> --- a/libxfs/rdwr.c
> +++ b/libxfs/rdwr.c
> @@ -311,10 +311,18 @@ struct xfs_bufkey {
> int nmaps;
> };
>
> +/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
> +#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
> +#define CACHE_LINE_SIZE 64
> static unsigned int
> -libxfs_bhash(cache_key_t key, unsigned int hashsize)
> +libxfs_bhash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
> {
> - return (((unsigned int)((struct xfs_bufkey *)key)->blkno) >> 5) % hashsize;
> + uint64_t hashval = ((struct xfs_bufkey *)key)->blkno;
> + uint64_t tmp;
> +
> + tmp = hashval ^ (GOLDEN_RATIO_PRIME + hashval) / CACHE_LINE_SIZE;
> + tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> hashshift);
> + return tmp % hashsize;
> }
>
> static int
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 4/5] libxfs: buffer cache hashing is suboptimal
2013-12-12 18:59 ` Brian Foster
@ 2013-12-12 20:56 ` Dave Chinner
2013-12-13 14:23 ` Brian Foster
0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 20:56 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Thu, Dec 12, 2013 at 01:59:26PM -0500, Brian Foster wrote:
> On 12/12/2013 02:22 AM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > The hashkey calculation is very simplistic,and throws away an amount
> > of entropy that should be folded into the hash. The result is
> > sub-optimal distribution across the hash tables. For example, with a
> > default 512 entry table, phase 2 results in this:
> >
> ...
> > Modify the hash to be something more workable - steal the linux
> > kernel inode hash calculation and try that:
> >
> ...
> >
> > Kinda says it all, really...
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
>
> Results look nice and the algorithm seems to match the kernel variant,
> but what about the 32-bit alternate prime/cache line values? Safe to
> leave out..?
The buffer cache uses a 64 bit key, regardless of the platform.
Therefore the 64 bit variant is always needed. The kernel inode hash
uses a 32 bit key on 32 bit systems, which is why there are two
variants for it.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] libxfs: buffer cache hashing is suboptimal
2013-12-12 20:56 ` Dave Chinner
@ 2013-12-13 14:23 ` Brian Foster
0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-12-13 14:23 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 12/12/2013 03:56 PM, Dave Chinner wrote:
> On Thu, Dec 12, 2013 at 01:59:26PM -0500, Brian Foster wrote:
>> On 12/12/2013 02:22 AM, Dave Chinner wrote:
>>> From: Dave Chinner <dchinner@redhat.com>
>>>
>>> The hashkey calculation is very simplistic,and throws away an amount
>>> of entropy that should be folded into the hash. The result is
>>> sub-optimal distribution across the hash tables. For example, with a
>>> default 512 entry table, phase 2 results in this:
>>>
>> ...
>>> Modify the hash to be something more workable - steal the linux
>>> kernel inode hash calculation and try that:
>>>
>> ...
>>>
>>> Kinda says it all, really...
>>>
>>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>>> ---
>>
>> Results look nice and the algorithm seems to match the kernel variant,
>> but what about the 32-bit alternate prime/cache line values? Safe to
>> leave out..?
>
> The buffer cache uses a 64 bit key, regardless of the platform.
> Therefore the 64 bit variant is always needed. The kernel inode hash
> uses a 32 bit key on 32 bit systems, which is why there are two
> variants for it.
>
Ah.. xfs_bufkey->blkno is an xfs_daddr_t, which is an __int64_t. Thanks.
Reviewed-by: Brian Foster <bfoster@redhat.com>
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 5/5] repair: limit auto-striding concurrency apprpriately
2013-12-12 7:22 [PATCH 0/5] xfs_repair: scalability inmprovements Dave Chinner
` (3 preceding siblings ...)
2013-12-12 7:22 ` [PATCH 4/5] libxfs: buffer cache hashing is suboptimal Dave Chinner
@ 2013-12-12 7:22 ` Dave Chinner
2013-12-12 18:29 ` Christoph Hellwig
2013-12-12 18:59 ` Brian Foster
4 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 7:22 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
It's possible to have filesystems with hundreds of AGs on systems
with little concurrency and resources. In this case, we can easily
exhaust memory and fail to create threads and have all sorts of
interesting problems.
xfs/250 can cause this to occur, with failures like:
- agno = 707
- agno = 692
fatal error -- cannot create worker threads, error = [11] Resource temporarily unavailable
And this:
- agno = 484
- agno = 782
failed to create prefetch thread: Resource temporarily unavailable
Because it's trying to create more threads than a poor little 512MB
single CPU ia32 box can handle.
So, limit concurrency to a maximum of numcpus * 8 to prevent this.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
include/libxfs.h | 1 +
libxfs/init.h | 1 -
repair/xfs_repair.c | 18 +++++++++++++++++-
3 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/include/libxfs.h b/include/libxfs.h
index 4bf331c..39e3d85 100644
--- a/include/libxfs.h
+++ b/include/libxfs.h
@@ -144,6 +144,7 @@ extern void libxfs_device_close (dev_t);
extern int libxfs_device_alignment (void);
extern void libxfs_report(FILE *);
extern void platform_findsizes(char *path, int fd, long long *sz, int *bsz);
+extern int platform_nproc(void);
/* check or write log footer: specify device, log size in blocks & uuid */
typedef xfs_caddr_t (libxfs_get_block_t)(xfs_caddr_t, int, void *);
diff --git a/libxfs/init.h b/libxfs/init.h
index f0b8cb6..112febb 100644
--- a/libxfs/init.h
+++ b/libxfs/init.h
@@ -31,7 +31,6 @@ extern char *platform_findrawpath (char *path);
extern char *platform_findblockpath (char *path);
extern int platform_direct_blockdev (void);
extern int platform_align_blockdev (void);
-extern int platform_nproc(void);
extern unsigned long platform_physmem(void); /* in kilobytes */
extern int platform_has_uuid;
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 7beffcb..0d006ae 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -627,13 +627,29 @@ main(int argc, char **argv)
* to target these for an increase in thread count. Hence a stride value
* of 15 is chosen to ensure we get at least 2 AGs being scanned at once
* on such filesystems.
+ *
+ * Limit the maximum thread count based on the available CPU power that
+ * is available. If we use too many threads, we might run out of memory
+ * and CPU power before we run out of IO concurrency.
*/
if (!ag_stride && glob_agcount >= 16 && do_prefetch)
ag_stride = 15;
if (ag_stride) {
+ int max_threads = platform_nproc() * 8;
+
thread_count = (glob_agcount + ag_stride - 1) / ag_stride;
- thread_init();
+ while (thread_count > max_threads) {
+ ag_stride *= 2;
+ thread_count = (glob_agcount + ag_stride - 1) /
+ ag_stride;
+ }
+ if (thread_count > 0)
+ thread_init();
+ else {
+ thread_count = 1;
+ ag_stride = 0;
+ }
}
if (ag_stride && report_interval) {
--
1.8.4.rc3
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH 5/5] repair: limit auto-striding concurrency apprpriately
2013-12-12 7:22 ` [PATCH 5/5] repair: limit auto-striding concurrency apprpriately Dave Chinner
@ 2013-12-12 18:29 ` Christoph Hellwig
2013-12-12 21:00 ` Dave Chinner
2013-12-12 18:59 ` Brian Foster
1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:29 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Looks reasonable, but it might be worth to add a blurb on how you
arrived at the magic 8 threads per cpu.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/5] repair: limit auto-striding concurrency apprpriately
2013-12-12 18:29 ` Christoph Hellwig
@ 2013-12-12 21:00 ` Dave Chinner
0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2013-12-12 21:00 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Thu, Dec 12, 2013 at 10:29:08AM -0800, Christoph Hellwig wrote:
> Looks reasonable, but it might be worth to add a blurb on how you
> arrived at the magic 8 threads per cpu.
OK - that was simply that processing with 8 threads is more than
enough to saturate a single CPU on fast devices so we don't need any
more parallelism in this case. And when you have large slow devices,
8 threads is more than enough to saturate the IO subsystem....
I'll add a comment mentioning that.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 5/5] repair: limit auto-striding concurrency apprpriately
2013-12-12 7:22 ` [PATCH 5/5] repair: limit auto-striding concurrency apprpriately Dave Chinner
2013-12-12 18:29 ` Christoph Hellwig
@ 2013-12-12 18:59 ` Brian Foster
1 sibling, 0 replies; 21+ messages in thread
From: Brian Foster @ 2013-12-12 18:59 UTC (permalink / raw)
To: Dave Chinner, xfs
On 12/12/2013 02:22 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> It's possible to have filesystems with hundreds of AGs on systems
> with little concurrency and resources. In this case, we can easily
> exhaust memory and fail to create threads and have all sorts of
> interesting problems.
>
> xfs/250 can cause this to occur, with failures like:
>
> - agno = 707
> - agno = 692
> fatal error -- cannot create worker threads, error = [11] Resource temporarily unavailable
>
> And this:
>
> - agno = 484
> - agno = 782
> failed to create prefetch thread: Resource temporarily unavailable
>
> Because it's trying to create more threads than a poor little 512MB
> single CPU ia32 box can handle.
>
> So, limit concurrency to a maximum of numcpus * 8 to prevent this.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
Reviewed-by: Brian Foster <bfoster@redhat.com>
> include/libxfs.h | 1 +
> libxfs/init.h | 1 -
> repair/xfs_repair.c | 18 +++++++++++++++++-
> 3 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/include/libxfs.h b/include/libxfs.h
> index 4bf331c..39e3d85 100644
> --- a/include/libxfs.h
> +++ b/include/libxfs.h
> @@ -144,6 +144,7 @@ extern void libxfs_device_close (dev_t);
> extern int libxfs_device_alignment (void);
> extern void libxfs_report(FILE *);
> extern void platform_findsizes(char *path, int fd, long long *sz, int *bsz);
> +extern int platform_nproc(void);
>
> /* check or write log footer: specify device, log size in blocks & uuid */
> typedef xfs_caddr_t (libxfs_get_block_t)(xfs_caddr_t, int, void *);
> diff --git a/libxfs/init.h b/libxfs/init.h
> index f0b8cb6..112febb 100644
> --- a/libxfs/init.h
> +++ b/libxfs/init.h
> @@ -31,7 +31,6 @@ extern char *platform_findrawpath (char *path);
> extern char *platform_findblockpath (char *path);
> extern int platform_direct_blockdev (void);
> extern int platform_align_blockdev (void);
> -extern int platform_nproc(void);
> extern unsigned long platform_physmem(void); /* in kilobytes */
> extern int platform_has_uuid;
>
> diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> index 7beffcb..0d006ae 100644
> --- a/repair/xfs_repair.c
> +++ b/repair/xfs_repair.c
> @@ -627,13 +627,29 @@ main(int argc, char **argv)
> * to target these for an increase in thread count. Hence a stride value
> * of 15 is chosen to ensure we get at least 2 AGs being scanned at once
> * on such filesystems.
> + *
> + * Limit the maximum thread count based on the available CPU power that
> + * is available. If we use too many threads, we might run out of memory
> + * and CPU power before we run out of IO concurrency.
> */
> if (!ag_stride && glob_agcount >= 16 && do_prefetch)
> ag_stride = 15;
>
> if (ag_stride) {
> + int max_threads = platform_nproc() * 8;
> +
> thread_count = (glob_agcount + ag_stride - 1) / ag_stride;
> - thread_init();
> + while (thread_count > max_threads) {
> + ag_stride *= 2;
> + thread_count = (glob_agcount + ag_stride - 1) /
> + ag_stride;
> + }
> + if (thread_count > 0)
> + thread_init();
> + else {
> + thread_count = 1;
> + ag_stride = 0;
> + }
> }
>
> if (ag_stride && report_interval) {
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 21+ messages in thread