* [PATCH 1/5] repair: handle repair of image files on large sector size filesystems
2011-10-09 23:11 [PATCH 0/5, v2] repair: sector size and blkmap fixes Dave Chinner
@ 2011-10-09 23:11 ` Dave Chinner
2011-10-09 23:52 ` Christoph Hellwig
2011-10-09 23:11 ` [PATCH 2/5] repair: fix some valgrind reported errors on i686 Dave Chinner
` (3 subsequent siblings)
4 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2011-10-09 23:11 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Because repair uses direct IO, it cannot do IO smaller than a sector
on the underlying device. When repairing a filesystem image, the
filesystem hosting the image may have a sector size larger than the
sector size of the image, and so single image sector reads and
writes will fail.
To avoid this, when checking a file and there is a sector size
mismatch like this, turn off direct IO. While there, fix a compile
bug in the IO_DEBUG option for libxfs which was found during triage.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
libxfs/rdwr.c | 2 +-
repair/sb.c | 9 ++++++++-
repair/xfs_repair.c | 29 +++++++++++++++++++++++++++++
3 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index ec2675e..a656851 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -364,7 +364,7 @@ libxfs_getbufr(dev_t device, xfs_daddr_t blkno, int bblen)
libxfs_initbuf(bp, device, blkno, blen);
#ifdef IO_DEBUG
printf("%lx: %s: allocated %u bytes buffer, key=%llu(%llu), %p\n",
- pthread_self(), __FUNCTION__, BBTOB(len),
+ pthread_self(), __FUNCTION__, blen,
(long long)LIBXFS_BBTOOFF64(blkno), (long long)blkno, bp);
#endif
diff --git a/repair/sb.c b/repair/sb.c
index 0ee2345..8873956 100644
--- a/repair/sb.c
+++ b/repair/sb.c
@@ -689,7 +689,14 @@ verify_set_primary_sb(xfs_sb_t *rsb,
*/
num_sbs = MIN(NUM_SBS, rsb->sb_agcount);
skip = howmany(num_sbs, rsb->sb_agcount);
- size = NUM_AGH_SECTS * rsb->sb_sectsize;
+
+ /*
+ * We haven't been able to validate the sector size yet properly
+ * (e.g. in the case of repairing an image in a file), so we need to
+ * take into account sector mismatches and so use the maximum possible
+ * sector size rather than the sector size in @rsb.
+ */
+ size = NUM_AGH_SECTS * (1 << (XFS_MAX_SECTORSIZE_LOG));
retval = 0;
list = NULL;
num_ok = 0;
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 4707b83..3ecb4b3 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -569,6 +569,35 @@ main(int argc, char **argv)
memset(&xfs_m, 0, sizeof(xfs_mount_t));
libxfs_sb_from_disk(&xfs_m.m_sb, XFS_BUF_TO_SBP(sbp));
+ /*
+ * if the sector size of the filesystem we are trying to repair is
+ * smaller than that of the underlying filesystem (i.e. we are repairing
+ * an image), the we have to turn off direct IO because we cannot do IO
+ * smaller than the host filesystem's sector size.
+ */
+ if (isa_file) {
+ int fd = libxfs_device_to_fd(x.ddev);
+ struct xfs_fsop_geom_v1 geom = { 0 };
+
+ if (ioctl(fd, XFS_IOC_FSGEOMETRY_V1, &geom) < 0) {
+ do_warn(_("Cannot get host filesystem geometry.\n"
+ "Repair may fail if there is a sector size mismatch between\n"
+ "the image and the host filesystem.\n"));
+ geom.sectsize = BBSIZE;
+ }
+
+ if (xfs_m.m_sb.sb_sectsize < geom.sectsize) {
+ long old_flags;
+
+ old_flags = fcntl(fd, F_GETFL, 0);
+ if (fcntl(fd, F_SETFL, old_flags & ~O_DIRECT) < 0) {
+ do_warn(_(
+ "Sector size on host filesystem larger than image sector size.\n"
+ "Cannot turn off direct IO, so exiting.\n"));
+ exit(1);
+ }
+ }
+ }
mp = libxfs_mount(&xfs_m, &xfs_m.m_sb, x.ddev, x.logdev, x.rtdev, 0);
if (!mp) {
--
1.7.5.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 1/5] repair: handle repair of image files on large sector size filesystems
2011-10-09 23:11 ` [PATCH 1/5] repair: handle repair of image files on large sector size filesystems Dave Chinner
@ 2011-10-09 23:52 ` Christoph Hellwig
2011-10-10 0:17 ` Dave Chinner
0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2011-10-09 23:52 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, Oct 10, 2011 at 10:11:46AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Because repair uses direct IO, it cannot do IO smaller than a sector
> on the underlying device. When repairing a filesystem image, the
> filesystem hosting the image may have a sector size larger than the
> sector size of the image, and so single image sector reads and
> writes will fail.
Does it? Given that the loop driver uses buffered I/O that would be
unexpected behaviour. I'll try to reproduce it as soon as my 4k
disk is available again.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/5] repair: handle repair of image files on large sector size filesystems
2011-10-09 23:52 ` Christoph Hellwig
@ 2011-10-10 0:17 ` Dave Chinner
2011-10-10 0:19 ` Christoph Hellwig
0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2011-10-10 0:17 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Sun, Oct 09, 2011 at 07:52:40PM -0400, Christoph Hellwig wrote:
> On Mon, Oct 10, 2011 at 10:11:46AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Because repair uses direct IO, it cannot do IO smaller than a sector
> > on the underlying device. When repairing a filesystem image, the
> > filesystem hosting the image may have a sector size larger than the
> > sector size of the image, and so single image sector reads and
> > writes will fail.
>
> Does it? Given that the loop driver uses buffered I/O that would be
> unexpected behaviour. I'll try to reproduce it as soon as my 4k
> disk is available again.
My 3TB, RAID6 based image storage filesystem has a 4k sector size,
and I can't run xfs_repair on an image file that has smaller sector
sizes than 4k without this mod. In that case, xfs_repair is doing
direct IO on a file, not a device, and the file is on the 4k sector
sized filesytem. I've been carrying this patch for a while.
So it's definitely a real problem.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/5] repair: handle repair of image files on large sector size filesystems
2011-10-10 0:17 ` Dave Chinner
@ 2011-10-10 0:19 ` Christoph Hellwig
0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2011-10-10 0:19 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, xfs
On Mon, Oct 10, 2011 at 11:17:08AM +1100, Dave Chinner wrote:
> > Does it? Given that the loop driver uses buffered I/O that would be
> > unexpected behaviour. I'll try to reproduce it as soon as my 4k
> > disk is available again.
>
> My 3TB, RAID6 based image storage filesystem has a 4k sector size,
> and I can't run xfs_repair on an image file that has smaller sector
> sizes than 4k without this mod. In that case, xfs_repair is doing
> direct IO on a file, not a device, and the file is on the 4k sector
> sized filesytem. I've been carrying this patch for a while.
Got it - you repair the image on directly on the device, without the
loop driver.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 2/5] repair: fix some valgrind reported errors on i686
2011-10-09 23:11 [PATCH 0/5, v2] repair: sector size and blkmap fixes Dave Chinner
2011-10-09 23:11 ` [PATCH 1/5] repair: handle repair of image files on large sector size filesystems Dave Chinner
@ 2011-10-09 23:11 ` Dave Chinner
2011-10-09 23:45 ` Christoph Hellwig
2011-10-09 23:11 ` [PATCH 3/5] repair: handle memory allocation failure from blkmap_grow Dave Chinner
` (2 subsequent siblings)
4 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2011-10-09 23:11 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Fix a potential prefetch read problem due to the first loop
execution of pf_batch_read potentially not initialising the fsbno
varaible:
==10177== Thread 6:
==10177== Conditional jump or move depends on uninitialised value(s)
==10177== at 0x8079CAB: pf_batch_read (prefetch.c:408)
==10177== by 0x6A2996D: clone (clone.S:130)
==10177==
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
repair/prefetch.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/repair/prefetch.c b/repair/prefetch.c
index d2fdf90..da074a8 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -397,7 +397,7 @@ pf_batch_read(
int len, size;
int i;
int inode_bufs;
- unsigned long fsbno;
+ unsigned long fsbno = 0;
unsigned long max_fsbno;
char *pbuf;
--
1.7.5.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 2/5] repair: fix some valgrind reported errors on i686
2011-10-09 23:11 ` [PATCH 2/5] repair: fix some valgrind reported errors on i686 Dave Chinner
@ 2011-10-09 23:45 ` Christoph Hellwig
2011-10-10 0:20 ` Dave Chinner
0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2011-10-09 23:45 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, Oct 10, 2011 at 10:11:47AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Fix a potential prefetch read problem due to the first loop
> execution of pf_batch_read potentially not initialising the fsbno
> varaible:
variable.
>
> ==10177== Thread 6:
> ==10177== Conditional jump or move depends on uninitialised value(s)
> ==10177== at 0x8079CAB: pf_batch_read (prefetch.c:408)
> ==10177== by 0x6A2996D: clone (clone.S:130)
> ==10177==
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> repair/prefetch.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/repair/prefetch.c b/repair/prefetch.c
> index d2fdf90..da074a8 100644
> --- a/repair/prefetch.c
> +++ b/repair/prefetch.c
> @@ -397,7 +397,7 @@ pf_batch_read(
> int len, size;
> int i;
> int inode_bufs;
> - unsigned long fsbno;
> + unsigned long fsbno = 0;
Call me stupid, but I can't see how this could actually be a real
life issue. The first thing we do in the loop is to to write
to fsbno in btree_find. I'm fine adding this to shut up warnins,
but I can't see a real issue.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/5] repair: fix some valgrind reported errors on i686
2011-10-09 23:45 ` Christoph Hellwig
@ 2011-10-10 0:20 ` Dave Chinner
2011-10-10 13:14 ` Christoph Hellwig
0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2011-10-10 0:20 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Sun, Oct 09, 2011 at 07:45:29PM -0400, Christoph Hellwig wrote:
> On Mon, Oct 10, 2011 at 10:11:47AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Fix a potential prefetch read problem due to the first loop
> > execution of pf_batch_read potentially not initialising the fsbno
> > varaible:
>
> variable.
>
> >
> > ==10177== Thread 6:
> > ==10177== Conditional jump or move depends on uninitialised value(s)
> > ==10177== at 0x8079CAB: pf_batch_read (prefetch.c:408)
> > ==10177== by 0x6A2996D: clone (clone.S:130)
> > ==10177==
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> > repair/prefetch.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/repair/prefetch.c b/repair/prefetch.c
> > index d2fdf90..da074a8 100644
> > --- a/repair/prefetch.c
> > +++ b/repair/prefetch.c
> > @@ -397,7 +397,7 @@ pf_batch_read(
> > int len, size;
> > int i;
> > int inode_bufs;
> > - unsigned long fsbno;
> > + unsigned long fsbno = 0;
>
> Call me stupid, but I can't see how this could actually be a real
> life issue. The first thing we do in the loop is to to write
> to fsbno in btree_find. I'm fine adding this to shut up warnins,
> but I can't see a real issue.
If btree_find() fails to find the key being looked up, it returns
without having initialised fsbno.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/5] repair: fix some valgrind reported errors on i686
2011-10-10 0:20 ` Dave Chinner
@ 2011-10-10 13:14 ` Christoph Hellwig
0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2011-10-10 13:14 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, xfs
On Mon, Oct 10, 2011 at 11:20:17AM +1100, Dave Chinner wrote:
> > Call me stupid, but I can't see how this could actually be a real
> > life issue. The first thing we do in the loop is to to write
> > to fsbno in btree_find. I'm fine adding this to shut up warnins,
> > but I can't see a real issue.
>
> If btree_find() fails to find the key being looked up, it returns
> without having initialised fsbno.
Indeed. The normal pattern for btree_find seems to be:
obj = btree_find(tree, key, &key_ptr);
if (!obj)
return;
which makes this fine.
The code in pf_batch_read effectively boils down to that due to
the
while (bplist[num] && num < MAX_BUFS && fsbno < max_fsbno) {
loop which stops executing the first conditional that evaluated to
false. So I can't see how this actually ever had an affect, but I'm
fine with fixing the warnings.
Btw, did I mention that the while loop over the bplist is completely
non-intuitive?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 3/5] repair: handle memory allocation failure from blkmap_grow
2011-10-09 23:11 [PATCH 0/5, v2] repair: sector size and blkmap fixes Dave Chinner
2011-10-09 23:11 ` [PATCH 1/5] repair: handle repair of image files on large sector size filesystems Dave Chinner
2011-10-09 23:11 ` [PATCH 2/5] repair: fix some valgrind reported errors on i686 Dave Chinner
@ 2011-10-09 23:11 ` Dave Chinner
2011-10-09 23:51 ` Christoph Hellwig
2011-10-09 23:11 ` [PATCH 4/5] repair: don't cache large blkmap allocations Dave Chinner
2011-10-09 23:11 ` [PATCH 5/5] repair: prevent blkmap extent count overflows Dave Chinner
4 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2011-10-09 23:11 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
If blkmap_grow fails to allocate a new chunk of memory, it returns
with a null blkmap. The sole caller of blkmap_grow does not check
for this failure, and so will segfault if this error ever occurs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
repair/bmap.c | 33 ++++++++++++++++++++++-----------
repair/bmap.h | 2 +-
repair/dinode.c | 22 ++++++++++++++++++++--
3 files changed, 43 insertions(+), 14 deletions(-)
diff --git a/repair/bmap.c b/repair/bmap.c
index 79b9f79..3ee5eff 100644
--- a/repair/bmap.c
+++ b/repair/bmap.c
@@ -207,29 +207,34 @@ blkmap_next_off(
*/
static blkmap_t *
blkmap_grow(
- blkmap_t **blkmapp)
+ blkmap_t *blkmap)
{
pthread_key_t key = dblkmap_key;
- blkmap_t *blkmap = *blkmapp;
+ blkmap_t *new_blkmap;
+ int new_naexts = blkmap->naexts + 4;
if (pthread_getspecific(key) != blkmap) {
key = ablkmap_key;
ASSERT(pthread_getspecific(key) == blkmap);
}
- blkmap->naexts += 4;
- blkmap = realloc(blkmap, BLKMAP_SIZE(blkmap->naexts));
- if (blkmap == NULL)
+ new_blkmap = realloc(blkmap, BLKMAP_SIZE(new_naexts));
+ if (!new_blkmap) {
do_error(_("realloc failed in blkmap_grow\n"));
- *blkmapp = blkmap;
- pthread_setspecific(key, blkmap);
- return blkmap;
+ return NULL;
+ }
+ blkmap->naexts = new_naexts;
+ pthread_setspecific(key, new_blkmap);
+ return new_blkmap;
}
/*
* Set an extent into a block map.
+ *
+ * If this function fails, it leaves the blkmapp untouched so the caller can
+ * handle the error and free the blkmap appropriately.
*/
-void
+int
blkmap_set_ext(
blkmap_t **blkmapp,
xfs_dfiloff_t o,
@@ -239,9 +244,14 @@ blkmap_set_ext(
blkmap_t *blkmap = *blkmapp;
xfs_extnum_t i;
- if (blkmap->nexts == blkmap->naexts)
- blkmap = blkmap_grow(blkmapp);
+ if (blkmap->nexts == blkmap->naexts) {
+ blkmap = blkmap_grow(blkmap);
+ if (!blkmap)
+ return ENOMEM;
+ *blkmapp = blkmap;
+ }
+ ASSERT(blkmap->nexts < blkmap->naexts);
for (i = 0; i < blkmap->nexts; i++) {
if (blkmap->exts[i].startoff > o) {
memmove(blkmap->exts + i + 1,
@@ -255,4 +265,5 @@ blkmap_set_ext(
blkmap->exts[i].startblock = b;
blkmap->exts[i].blockcount = c;
blkmap->nexts++;
+ return 0;
}
diff --git a/repair/bmap.h b/repair/bmap.h
index 58abf95..118ae1e 100644
--- a/repair/bmap.h
+++ b/repair/bmap.h
@@ -43,7 +43,7 @@ typedef struct blkmap {
blkmap_t *blkmap_alloc(xfs_extnum_t nex, int whichfork);
void blkmap_free(blkmap_t *blkmap);
-void blkmap_set_ext(blkmap_t **blkmapp, xfs_dfiloff_t o,
+int blkmap_set_ext(blkmap_t **blkmapp, xfs_dfiloff_t o,
xfs_dfsbno_t b, xfs_dfilblks_t c);
xfs_dfsbno_t blkmap_get(blkmap_t *blkmap, xfs_dfiloff_t o);
diff --git a/repair/dinode.c b/repair/dinode.c
index b208c51..0cedc28 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -708,9 +708,27 @@ process_bmbt_reclist_int(
goto done;
}
- if (blkmapp && *blkmapp)
- blkmap_set_ext(blkmapp, irec.br_startoff,
+ if (blkmapp && *blkmapp) {
+ error = blkmap_set_ext(blkmapp, irec.br_startoff,
irec.br_startblock, irec.br_blockcount);
+ if (error) {
+ /*
+ * we don't want to clear the inode due to an
+ * internal bmap tracking error, but if we've
+ * run out of memory then we simply can't
+ * validate that the filesystem is consistent.
+ * Hence just abort at this point with an ENOMEM
+ * error.
+ */
+ do_abort(
+ _("Fatal error: inode %llu - blkmap_set_ext(): %s\n"
+ "\t%s fork, off - %llu, start - %llu, cnt %llu\n"),
+ ino, strerror(error), forkname,
+ irec.br_startoff, irec.br_startblock,
+ irec.br_blockcount);
+ }
+ }
+
/*
* Profiling shows that the following loop takes the
* most time in all of xfs_repair.
--
1.7.5.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 3/5] repair: handle memory allocation failure from blkmap_grow
2011-10-09 23:11 ` [PATCH 3/5] repair: handle memory allocation failure from blkmap_grow Dave Chinner
@ 2011-10-09 23:51 ` Christoph Hellwig
2011-10-10 0:10 ` Dave Chinner
0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2011-10-09 23:51 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
> static blkmap_t *
> blkmap_grow(
> - blkmap_t **blkmapp)
> + blkmap_t *blkmap)
> {
> pthread_key_t key = dblkmap_key;
> - blkmap_t *blkmap = *blkmapp;
> + blkmap_t *new_blkmap;
> + int new_naexts = blkmap->naexts + 4;
>
> if (pthread_getspecific(key) != blkmap) {
> key = ablkmap_key;
> ASSERT(pthread_getspecific(key) == blkmap);
> }
>
> - blkmap->naexts += 4;
> - blkmap = realloc(blkmap, BLKMAP_SIZE(blkmap->naexts));
> - if (blkmap == NULL)
> - do_error(_("realloc failed in blkmap_grow\n"));
> - *blkmapp = blkmap;
> - pthread_setspecific(key, blkmap);
> - return blkmap;
> + new_blkmap = realloc(blkmap, BLKMAP_SIZE(new_naexts));
> + if (!new_blkmap) {
> + do_error(_("realloc failed in blkmap_grow\n"));
> + return NULL;
> + }
> + blkmap->naexts = new_naexts;
Why would we modify naexts in the old blkmap?
Otherwise looks fine.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 3/5] repair: handle memory allocation failure from blkmap_grow
2011-10-09 23:51 ` Christoph Hellwig
@ 2011-10-10 0:10 ` Dave Chinner
0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2011-10-10 0:10 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Sun, Oct 09, 2011 at 07:51:26PM -0400, Christoph Hellwig wrote:
> > static blkmap_t *
> > blkmap_grow(
> > - blkmap_t **blkmapp)
> > + blkmap_t *blkmap)
> > {
> > pthread_key_t key = dblkmap_key;
> > - blkmap_t *blkmap = *blkmapp;
> > + blkmap_t *new_blkmap;
> > + int new_naexts = blkmap->naexts + 4;
> >
> > if (pthread_getspecific(key) != blkmap) {
> > key = ablkmap_key;
> > ASSERT(pthread_getspecific(key) == blkmap);
> > }
> >
> > - blkmap->naexts += 4;
> > - blkmap = realloc(blkmap, BLKMAP_SIZE(blkmap->naexts));
> > - if (blkmap == NULL)
> > - do_error(_("realloc failed in blkmap_grow\n"));
> > - *blkmapp = blkmap;
> > - pthread_setspecific(key, blkmap);
> > - return blkmap;
> > + new_blkmap = realloc(blkmap, BLKMAP_SIZE(new_naexts));
> > + if (!new_blkmap) {
> > + do_error(_("realloc failed in blkmap_grow\n"));
> > + return NULL;
> > + }
> > + blkmap->naexts = new_naexts;
>
> Why would we modify naexts in the old blkmap?
Ooops. Will fix.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 4/5] repair: don't cache large blkmap allocations
2011-10-09 23:11 [PATCH 0/5, v2] repair: sector size and blkmap fixes Dave Chinner
` (2 preceding siblings ...)
2011-10-09 23:11 ` [PATCH 3/5] repair: handle memory allocation failure from blkmap_grow Dave Chinner
@ 2011-10-09 23:11 ` Dave Chinner
2011-10-09 23:48 ` Christoph Hellwig
2011-10-09 23:11 ` [PATCH 5/5] repair: prevent blkmap extent count overflows Dave Chinner
4 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2011-10-09 23:11 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
We currently use thread local storage for storing blkmap allocations
from one inode to another as a way of reducing the number of short
term allocations we do. However, the stored allocations can only
ever grow, so once we've done a large allocation we never free than
memory even if we never need that much memory again. This can occur
if we have corrupted extent counts in inodes, and can greatly
increase the memory footprint of the repair process.
Hence if the cached blkmap array id greater than a reasonable number
of extents (say 100,000), then don't store the blkmap in TLS and
instead free it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
repair/bmap.c | 20 +++++++++++++++++++-
repair/dinode.c | 6 ++----
2 files changed, 21 insertions(+), 5 deletions(-)
diff --git a/repair/bmap.c b/repair/bmap.c
index 3ee5eff..3e53457 100644
--- a/repair/bmap.c
+++ b/repair/bmap.c
@@ -66,12 +66,30 @@ blkmap_alloc(
/*
* Free a block map.
+ *
+ * If the map is a large, uncommon size (say for hundreds of thousands of
+ * extents) then free it to release the memory. This prevents us from pinning
+ * large tracts of memory due to corrupted fork values or one-off fragmented
+ * files. Otherwise we have nothing to do but keep the memory around for the
+ * next inode
*/
void
blkmap_free(
blkmap_t *blkmap)
{
- /* nothing to do! - keep the memory around for the next inode */
+ if (!blkmap)
+ return;
+
+ /* consider more than 100k extents rare */
+ if (blkmap->naexts < 100 * 1024)
+ return;
+
+ if (blkmap == pthread_getspecific(dblkmap_key))
+ pthread_setspecific(dblkmap_key, NULL);
+ else
+ pthread_setspecific(ablkmap_key, NULL);
+
+ free(blkmap);
}
/*
diff --git a/repair/dinode.c b/repair/dinode.c
index 0cedc28..8ad4e94 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -2748,8 +2748,7 @@ process_dinode_int(xfs_mount_t *mp,
break;
}
- if (dblkmap)
- blkmap_free(dblkmap);
+ blkmap_free(dblkmap);
/*
* check nlinks feature, if it's a version 1 inode,
@@ -2768,8 +2767,7 @@ clear_bad_out:
bad_out:
*used = is_free;
*isa_dir = 0;
- if (dblkmap)
- blkmap_free(dblkmap);
+ blkmap_free(dblkmap);
return 1;
}
--
1.7.5.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 4/5] repair: don't cache large blkmap allocations
2011-10-09 23:11 ` [PATCH 4/5] repair: don't cache large blkmap allocations Dave Chinner
@ 2011-10-09 23:48 ` Christoph Hellwig
2011-10-10 0:14 ` Dave Chinner
0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2011-10-09 23:48 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
> + /* consider more than 100k extents rare */
> + if (blkmap->naexts < 100 * 1024)
> + return;
100k extents still seems like a fairly high number. Otherwise looks
okay.
(If only we had a non-sucky threaded memory allocator in userspace..)
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/5] repair: don't cache large blkmap allocations
2011-10-09 23:48 ` Christoph Hellwig
@ 2011-10-10 0:14 ` Dave Chinner
2011-10-10 13:22 ` Christoph Hellwig
0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2011-10-10 0:14 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Sun, Oct 09, 2011 at 07:48:09PM -0400, Christoph Hellwig wrote:
> > + /* consider more than 100k extents rare */
> > + if (blkmap->naexts < 100 * 1024)
> > + return;
>
> 100k extents still seems like a fairly high number. Otherwise looks
> okay.
But not unreasonable for a filesystem full of torrents ;)
> (If only we had a non-sucky threaded memory allocator in userspace..)
Perhaps we should look at the talloc code from ccan?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/5] repair: don't cache large blkmap allocations
2011-10-10 0:14 ` Dave Chinner
@ 2011-10-10 13:22 ` Christoph Hellwig
0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2011-10-10 13:22 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, xfs
On Mon, Oct 10, 2011 at 11:14:00AM +1100, Dave Chinner wrote:
> > okay.
>
> But not unreasonable for a filesystem full of torrents ;)
I'm just wondering if it's still the right memory / overhead tradeoff
at that point.
>
> > (If only we had a non-sucky threaded memory allocator in userspace..)
>
> Perhaps we should look at the talloc code from ccan?
I'm not sure the interface is compatible enough with the kernel style
allocator we use in libxfs. Except for that it's probably worth
taking a look.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 5/5] repair: prevent blkmap extent count overflows
2011-10-09 23:11 [PATCH 0/5, v2] repair: sector size and blkmap fixes Dave Chinner
` (3 preceding siblings ...)
2011-10-09 23:11 ` [PATCH 4/5] repair: don't cache large blkmap allocations Dave Chinner
@ 2011-10-09 23:11 ` Dave Chinner
4 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2011-10-09 23:11 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Fix a bunch of invalid read/write errors due to excessive blkmap
allocations when inode forks are corrupted. These show up some time
after making a blkmap allocation for 536870913 extents on i686,
which is followed some time later by a crash caused bymemory
corruption.
This blkmap allocation size overflows 32 bits in such a
way that it results in a 32 byte allocation and so access to the
second extent results in access beyond the allocated memory and
corrupts random memory.
==5419== Invalid write of size 4
==5419== at 0x80507DA: blkmap_set_ext (bmap.c:260)
==5419== by 0x8055CF4: process_bmbt_reclist_int (dinode.c:712)
==5419== by 0x8056206: process_bmbt_reclist (dinode.c:813)
==5419== by 0x80579DA: process_exinode (dinode.c:1324)
==5419== by 0x8059B77: process_dinode_int (dinode.c:2036)
==5419== by 0x805ABE6: process_dinode (dinode.c:2823)
==5419== by 0x8052493: process_inode_chunk.isra.4 (dino_chunks.c:777)
==5419== by 0x8054012: process_aginodes (dino_chunks.c:1024)
==5419== by 0xFFF: ???
==5419== Address 0x944cfb8 is 0 bytes after a block of size 32 alloc'd
==5419== at 0x48E1102: realloc (in
/usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==5419== by 0x80501F3: blkmap_alloc (bmap.c:56)
==5419== by 0x80599F5: process_dinode_int (dinode.c:2027)
==5419== by 0x805ABE6: process_dinode (dinode.c:2823)
==5419== by 0x8052493: process_inode_chunk.isra.4 (dino_chunks.c:777)
==5419== by 0x8054012: process_aginodes (dino_chunks.c:1024)
==5419== by 0xFFF: ???
Add overflow detection code into the blkmap allocation code to avoid
this problem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
repair/bmap.c | 28 ++++++++++++++++++++++++++++
repair/bmap.h | 13 +++++++++++++
2 files changed, 41 insertions(+), 0 deletions(-)
diff --git a/repair/bmap.c b/repair/bmap.c
index 3e53457..a6a041a 100644
--- a/repair/bmap.c
+++ b/repair/bmap.c
@@ -47,6 +47,17 @@ blkmap_alloc(
if (nex < 1)
nex = 1;
+ if (nex > BLKMAP_NEXTS_MAX) {
+#if (BITS_PER_LONG == 32)
+ do_warn(
+ _("Number of extents requested in blkmap_alloc (%d) overflows 32 bits.\n"
+ "If this is not a corruption, then you will need a 64 bit system\n"
+ "to repair this filesystem.\n"),
+ nex);
+#endif
+ return NULL;
+ }
+
key = whichfork ? ablkmap_key : dblkmap_key;
blkmap = pthread_getspecific(key);
if (!blkmap || blkmap->naexts < nex) {
@@ -236,6 +247,23 @@ blkmap_grow(
ASSERT(pthread_getspecific(key) == blkmap);
}
+ if (new_naexts > BLKMAP_NEXTS_MAX) {
+#if (BITS_PER_LONG == 32)
+ do_error(
+ _("Number of extents requested in blkmap_grow (%d) overflows 32 bits.\n"
+ "You need a 64 bit system to repair this filesystem.\n"),
+ new_naexts);
+#endif
+ return NULL;
+ }
+ if (new_naexts <= 0) {
+ do_error(
+ _("Number of extents requested in blkmap_grow (%d) overflowed the\n"
+ "maximum number of supported extents (%d).\n"),
+ new_naexts, BLKMAP_NEXTS_MAX);
+ return NULL;
+ }
+
new_blkmap = realloc(blkmap, BLKMAP_SIZE(new_naexts));
if (!new_blkmap) {
do_error(_("realloc failed in blkmap_grow\n"));
diff --git a/repair/bmap.h b/repair/bmap.h
index 118ae1e..19720b1 100644
--- a/repair/bmap.h
+++ b/repair/bmap.h
@@ -40,6 +40,19 @@ typedef struct blkmap {
#define BLKMAP_SIZE(n) \
(offsetof(blkmap_t, exts) + (sizeof(bmap_ext_t) * (n)))
+/*
+ * For 32 bit platforms, we are limited to extent arrays of 2^31 bytes, which
+ * limits the number of extents in an inode we can check. If we don't limit the
+ * valid range, we can overflow the BLKMAP_SIZE() calculation and allocate less
+ * memory than we think we needed, and hence walk off the end of the array and
+ * corrupt memory.
+ */
+#if BITS_PER_LONG == 32
+#define BLKMAP_NEXTS_MAX ((INT_MAX / sizeof(bmap_ext_t)) - 1)
+#else
+#define BLKMAP_NEXTS_MAX INT_MAX
+#endif
+
blkmap_t *blkmap_alloc(xfs_extnum_t nex, int whichfork);
void blkmap_free(blkmap_t *blkmap);
--
1.7.5.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 3/5] repair: handle memory allocation failure from blkmap_grow
2011-10-10 1:08 [PATCH 0/5, v3] repair: sector size and blkmap fixes Dave Chinner
@ 2011-10-10 1:08 ` Dave Chinner
2011-10-10 14:14 ` Christoph Hellwig
2011-10-13 9:57 ` Alex Elder
0 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2011-10-10 1:08 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
If blkmap_grow fails to allocate a new chunk of memory, it returns
with a null blkmap. The sole caller of blkmap_grow does not check
for this failure, and so will segfault if this error ever occurs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
repair/bmap.c | 33 ++++++++++++++++++++++-----------
repair/bmap.h | 2 +-
repair/dinode.c | 22 ++++++++++++++++++++--
3 files changed, 43 insertions(+), 14 deletions(-)
diff --git a/repair/bmap.c b/repair/bmap.c
index 0ff9315..e290515 100644
--- a/repair/bmap.c
+++ b/repair/bmap.c
@@ -207,29 +207,34 @@ blkmap_next_off(
*/
static blkmap_t *
blkmap_grow(
- blkmap_t **blkmapp)
+ blkmap_t *blkmap)
{
pthread_key_t key = dblkmap_key;
- blkmap_t *blkmap = *blkmapp;
+ blkmap_t *new_blkmap;
+ int new_naexts = blkmap->naexts + 4;
if (pthread_getspecific(key) != blkmap) {
key = ablkmap_key;
ASSERT(pthread_getspecific(key) == blkmap);
}
- blkmap->naexts += 4;
- blkmap = realloc(blkmap, BLKMAP_SIZE(blkmap->naexts));
- if (blkmap == NULL)
+ new_blkmap = realloc(blkmap, BLKMAP_SIZE(new_naexts));
+ if (!new_blkmap) {
do_error(_("realloc failed in blkmap_grow\n"));
- *blkmapp = blkmap;
- pthread_setspecific(key, blkmap);
- return blkmap;
+ return NULL;
+ }
+ new_blkmap->naexts = new_naexts;
+ pthread_setspecific(key, new_blkmap);
+ return new_blkmap;
}
/*
* Set an extent into a block map.
+ *
+ * If this function fails, it leaves the blkmapp untouched so the caller can
+ * handle the error and free the blkmap appropriately.
*/
-void
+int
blkmap_set_ext(
blkmap_t **blkmapp,
xfs_dfiloff_t o,
@@ -239,9 +244,14 @@ blkmap_set_ext(
blkmap_t *blkmap = *blkmapp;
xfs_extnum_t i;
- if (blkmap->nexts == blkmap->naexts)
- blkmap = blkmap_grow(blkmapp);
+ if (blkmap->nexts == blkmap->naexts) {
+ blkmap = blkmap_grow(blkmap);
+ if (!blkmap)
+ return ENOMEM;
+ *blkmapp = blkmap;
+ }
+ ASSERT(blkmap->nexts < blkmap->naexts);
for (i = 0; i < blkmap->nexts; i++) {
if (blkmap->exts[i].startoff > o) {
memmove(blkmap->exts + i + 1,
@@ -255,4 +265,5 @@ blkmap_set_ext(
blkmap->exts[i].startblock = b;
blkmap->exts[i].blockcount = c;
blkmap->nexts++;
+ return 0;
}
diff --git a/repair/bmap.h b/repair/bmap.h
index 58abf95..118ae1e 100644
--- a/repair/bmap.h
+++ b/repair/bmap.h
@@ -43,7 +43,7 @@ typedef struct blkmap {
blkmap_t *blkmap_alloc(xfs_extnum_t nex, int whichfork);
void blkmap_free(blkmap_t *blkmap);
-void blkmap_set_ext(blkmap_t **blkmapp, xfs_dfiloff_t o,
+int blkmap_set_ext(blkmap_t **blkmapp, xfs_dfiloff_t o,
xfs_dfsbno_t b, xfs_dfilblks_t c);
xfs_dfsbno_t blkmap_get(blkmap_t *blkmap, xfs_dfiloff_t o);
diff --git a/repair/dinode.c b/repair/dinode.c
index 5a74538..39a0cb1 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -730,9 +730,27 @@ _("inode %" PRIu64 " - extent offset too large - start %" PRIu64 ", "
goto done;
}
- if (blkmapp && *blkmapp)
- blkmap_set_ext(blkmapp, irec.br_startoff,
+ if (blkmapp && *blkmapp) {
+ error = blkmap_set_ext(blkmapp, irec.br_startoff,
irec.br_startblock, irec.br_blockcount);
+ if (error) {
+ /*
+ * we don't want to clear the inode due to an
+ * internal bmap tracking error, but if we've
+ * run out of memory then we simply can't
+ * validate that the filesystem is consistent.
+ * Hence just abort at this point with an ENOMEM
+ * error.
+ */
+ do_abort(
+_("Fatal error: inode %" PRIu64 " - blkmap_set_ext(): %s\n"
+ "\t%s fork, off - %" PRIu64 ", start - %" PRIu64 ", cnt %" PRIu64 "\n"),
+ ino, strerror(error), forkname,
+ irec.br_startoff, irec.br_startblock,
+ irec.br_blockcount);
+ }
+ }
+
/*
* Profiling shows that the following loop takes the
* most time in all of xfs_repair.
--
1.7.5.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH 3/5] repair: handle memory allocation failure from blkmap_grow
2011-10-10 1:08 ` [PATCH 3/5] repair: handle memory allocation failure from blkmap_grow Dave Chinner
@ 2011-10-10 14:14 ` Christoph Hellwig
2011-10-13 9:57 ` Alex Elder
1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2011-10-10 14:14 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, Oct 10, 2011 at 12:08:33PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> If blkmap_grow fails to allocate a new chunk of memory, it returns
> with a null blkmap. The sole caller of blkmap_grow does not check
> for this failure, and so will segfault if this error ever occurs.
Looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
Two comments on the code not directly related to your change:
- it seems like xfs_db has another copy of these blkmap routines,
which even missed the last round of updates during the repair
scalability work. It seems like it should be switched to reuse
the repair code
- growing the map by four seems to be incredibly inefficient for
large files. Given that the only caller actually knows how many
entries it processes in that batch we should grow it at least by
the number, reducing the allocations to one per call to
process_bmbt_reclist_int.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/5] repair: handle memory allocation failure from blkmap_grow
2011-10-10 1:08 ` [PATCH 3/5] repair: handle memory allocation failure from blkmap_grow Dave Chinner
2011-10-10 14:14 ` Christoph Hellwig
@ 2011-10-13 9:57 ` Alex Elder
1 sibling, 0 replies; 20+ messages in thread
From: Alex Elder @ 2011-10-13 9:57 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, 2011-10-10 at 12:08 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> If blkmap_grow fails to allocate a new chunk of memory, it returns
> with a null blkmap. The sole caller of blkmap_grow does not check
> for this failure, and so will segfault if this error ever occurs.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Looks good.
Reviewed-by: Alex Elder <aelder@sgi.com>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 20+ messages in thread