* [PATCH 0/2] xfs: handle mount failures more cleanly @ 2018-05-11 2:09 Dave Chinner 2018-05-11 2:09 ` [PATCH 1/2] xfs: add mount delay debug option Dave Chinner 2018-05-11 2:09 ` [PATCH 2/2] xfs: clear sb->s_fs_info on mount failure Dave Chinner 0 siblings, 2 replies; 5+ messages in thread From: Dave Chinner @ 2018-05-11 2:09 UTC (permalink / raw) To: linux-xfs Hi folks, This is the XFS side of the "fs: don't scan the inode cache before SB_BORN is set" changes to handle mount failures more cleanly. This addresses teh use-after-free problems that can occur when xfs_fs_fill_super() fails, frees the xfs_mount structure but then leaves the VFS superblock pointing to it. Th efirst patch is the mount delay sysfs option I've been using to test this. I'll post the fstest that uses it in a minute. Cheers, Dave. ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/2] xfs: add mount delay debug option 2018-05-11 2:09 [PATCH 0/2] xfs: handle mount failures more cleanly Dave Chinner @ 2018-05-11 2:09 ` Dave Chinner 2018-05-12 0:28 ` Darrick J. Wong 2018-05-11 2:09 ` [PATCH 2/2] xfs: clear sb->s_fs_info on mount failure Dave Chinner 1 sibling, 1 reply; 5+ messages in thread From: Dave Chinner @ 2018-05-11 2:09 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> Similar to log_recovery_delay, this delay occurs between the VFS superblock being initialised and the xfs_mount being fully initialised. It also poisons the per-ag radix tree node so that it can be used for triggering shrinker races during mount such as the following: <run memory pressure workload in background> $ cat dirty-mount.sh #! /bin/bash umount -f /dev/pmem0 mkfs.xfs -f /dev/pmem0 mount /dev/pmem0 /mnt/test rm -f /mnt/test/foo xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo umount /dev/pmem0 # let's crash it now! echo 30 > /sys/fs/xfs/debug/mount_delay mount /dev/pmem0 /mnt/test echo 0 > /sys/fs/xfs/debug/mount_delay umount /dev/pmem0 $ sudo ./dirty-mount.sh ..... [ 60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G D W 4.16.0-rc5-dgc #440 [ 60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320 [ 60.378127] RSP: 0018:ffffc9000276f4f8 EFLAGS: 00010282 [ 60.383670] RAX: a5a5a5a5a5a5a5a4 RBX: 0000000000000010 RCX: 000000000000001a [ 60.385277] RDX: 0000000000000000 RSI: ffffc9000276f540 RDI: 0000000000000000 [ 60.386554] RBP: 0000000000000000 R08: 0000000000000000 R09: a5a5a5a5a5a5a5a5 [ 60.388194] R10: 0000000000000006 R11: 0000000000000001 R12: ffffc9000276f598 [ 60.389288] R13: 0000000000000040 R14: 0000000000000228 R15: ffff880816cd6458 [ 60.390827] FS: 00007f5c124b9740(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000 [ 60.392253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 60.393423] CR2: 00007f5c11bba0b8 CR3: 000000035580e001 CR4: 00000000000606e0 [ 60.394519] Call Trace: [ 60.395252] radix_tree_gang_lookup_tag+0xc4/0x130 [ 60.395948] xfs_perag_get_tag+0x37/0xf0 [ 60.396522] xfs_reclaim_inodes_count+0x32/0x40 [ 60.397178] xfs_fs_nr_cached_objects+0x11/0x20 [ 60.397837] super_cache_count+0x35/0xc0 [ 60.399159] shrink_slab.part.66+0xb1/0x370 [ 60.400194] shrink_node+0x7e/0x1a0 [ 60.401058] try_to_free_pages+0x199/0x470 [ 60.402081] __alloc_pages_slowpath+0x3a1/0xd20 [ 60.403729] __alloc_pages_nodemask+0x1c3/0x200 [ 60.404941] cache_grow_begin+0x20b/0x2e0 [ 60.406164] fallback_alloc+0x160/0x200 [ 60.407088] kmem_cache_alloc+0x111/0x4e0 [ 60.408038] ? xfs_buf_rele+0x61/0x430 [ 60.408925] kmem_zone_alloc+0x61/0xe0 [ 60.409965] xfs_inode_alloc+0x24/0x1d0 ..... Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_globals.c | 1 + fs/xfs/xfs_super.c | 11 +++++++++++ fs/xfs/xfs_sysctl.h | 1 + fs/xfs/xfs_sysfs.c | 31 +++++++++++++++++++++++++++++++ 4 files changed, 44 insertions(+) diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c index 3e1cc3001bcb..fdde17a2333c 100644 --- a/fs/xfs/xfs_globals.c +++ b/fs/xfs/xfs_globals.c @@ -47,6 +47,7 @@ xfs_param_t xfs_params = { struct xfs_globals xfs_globals = { .log_recovery_delay = 0, /* no delay by default */ + .mount_delay = 0, /* no delay by default */ #ifdef XFS_ASSERT_FATAL .bug_on_assert = true, /* assert failures BUG() */ #else diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 5726ef496980..a523eaeb3f29 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1635,6 +1635,17 @@ xfs_fs_fill_super( #endif sb->s_op = &xfs_super_operations; + /* + * Delay mount work if the debug hook is set. This is debug + * instrumention to coordinate simulation of xfs mount failures with + * VFS superblock operations + */ + if (xfs_globals.mount_delay) { + xfs_notice(mp, "Delaying mount for %d seconds.", + xfs_globals.mount_delay); + msleep(xfs_globals.mount_delay * 1000); + } + if (silent) flags |= XFS_MFSI_QUIET; diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h index 82afee005140..b53a33e69932 100644 --- a/fs/xfs/xfs_sysctl.h +++ b/fs/xfs/xfs_sysctl.h @@ -95,6 +95,7 @@ extern xfs_param_t xfs_params; struct xfs_globals { int log_recovery_delay; /* log recovery delay (secs) */ + int mount_delay; /* mount setup delay (secs) */ bool bug_on_assert; /* BUG() the kernel on assert failure */ }; extern struct xfs_globals xfs_globals; diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c index 8b2ccc234f36..2d5cd2529f8e 100644 --- a/fs/xfs/xfs_sysfs.c +++ b/fs/xfs/xfs_sysfs.c @@ -165,9 +165,40 @@ log_recovery_delay_show( } XFS_SYSFS_ATTR_RW(log_recovery_delay); +STATIC ssize_t +mount_delay_store( + struct kobject *kobject, + const char *buf, + size_t count) +{ + int ret; + int val; + + ret = kstrtoint(buf, 0, &val); + if (ret) + return ret; + + if (val < 0 || val > 60) + return -EINVAL; + + xfs_globals.mount_delay = val; + + return count; +} + +STATIC ssize_t +mount_delay_show( + struct kobject *kobject, + char *buf) +{ + return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.mount_delay); +} +XFS_SYSFS_ATTR_RW(mount_delay); + static struct attribute *xfs_dbg_attrs[] = { ATTR_LIST(bug_on_assert), ATTR_LIST(log_recovery_delay), + ATTR_LIST(mount_delay), NULL, }; -- 2.17.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 1/2] xfs: add mount delay debug option 2018-05-11 2:09 ` [PATCH 1/2] xfs: add mount delay debug option Dave Chinner @ 2018-05-12 0:28 ` Darrick J. Wong 0 siblings, 0 replies; 5+ messages in thread From: Darrick J. Wong @ 2018-05-12 0:28 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Fri, May 11, 2018 at 12:09:42PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Similar to log_recovery_delay, this delay occurs between the VFS > superblock being initialised and the xfs_mount being fully > initialised. It also poisons the per-ag radix tree node so that it > can be used for triggering shrinker races during mount > such as the following: > > <run memory pressure workload in background> > > $ cat dirty-mount.sh > #! /bin/bash > > umount -f /dev/pmem0 > mkfs.xfs -f /dev/pmem0 > mount /dev/pmem0 /mnt/test > rm -f /mnt/test/foo > xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo > umount /dev/pmem0 > > # let's crash it now! > echo 30 > /sys/fs/xfs/debug/mount_delay > mount /dev/pmem0 /mnt/test > echo 0 > /sys/fs/xfs/debug/mount_delay > umount /dev/pmem0 > $ sudo ./dirty-mount.sh > ..... > [ 60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G D W 4.16.0-rc5-dgc #440 > [ 60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > [ 60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320 > [ 60.378127] RSP: 0018:ffffc9000276f4f8 EFLAGS: 00010282 > [ 60.383670] RAX: a5a5a5a5a5a5a5a4 RBX: 0000000000000010 RCX: 000000000000001a > [ 60.385277] RDX: 0000000000000000 RSI: ffffc9000276f540 RDI: 0000000000000000 > [ 60.386554] RBP: 0000000000000000 R08: 0000000000000000 R09: a5a5a5a5a5a5a5a5 > [ 60.388194] R10: 0000000000000006 R11: 0000000000000001 R12: ffffc9000276f598 > [ 60.389288] R13: 0000000000000040 R14: 0000000000000228 R15: ffff880816cd6458 > [ 60.390827] FS: 00007f5c124b9740(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000 > [ 60.392253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 60.393423] CR2: 00007f5c11bba0b8 CR3: 000000035580e001 CR4: 00000000000606e0 > [ 60.394519] Call Trace: > [ 60.395252] radix_tree_gang_lookup_tag+0xc4/0x130 > [ 60.395948] xfs_perag_get_tag+0x37/0xf0 > [ 60.396522] xfs_reclaim_inodes_count+0x32/0x40 > [ 60.397178] xfs_fs_nr_cached_objects+0x11/0x20 > [ 60.397837] super_cache_count+0x35/0xc0 > [ 60.399159] shrink_slab.part.66+0xb1/0x370 > [ 60.400194] shrink_node+0x7e/0x1a0 > [ 60.401058] try_to_free_pages+0x199/0x470 > [ 60.402081] __alloc_pages_slowpath+0x3a1/0xd20 > [ 60.403729] __alloc_pages_nodemask+0x1c3/0x200 > [ 60.404941] cache_grow_begin+0x20b/0x2e0 > [ 60.406164] fallback_alloc+0x160/0x200 > [ 60.407088] kmem_cache_alloc+0x111/0x4e0 > [ 60.408038] ? xfs_buf_rele+0x61/0x430 > [ 60.408925] kmem_zone_alloc+0x61/0xe0 > [ 60.409965] xfs_inode_alloc+0x24/0x1d0 > ..... > > > Signed-Off-By: Dave Chinner <dchinner@redhat.com> > Reviewed-by: Brian Foster <bfoster@redhat.com> Looks ok, will test... Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > --- > fs/xfs/xfs_globals.c | 1 + > fs/xfs/xfs_super.c | 11 +++++++++++ > fs/xfs/xfs_sysctl.h | 1 + > fs/xfs/xfs_sysfs.c | 31 +++++++++++++++++++++++++++++++ > 4 files changed, 44 insertions(+) > > diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c > index 3e1cc3001bcb..fdde17a2333c 100644 > --- a/fs/xfs/xfs_globals.c > +++ b/fs/xfs/xfs_globals.c > @@ -47,6 +47,7 @@ xfs_param_t xfs_params = { > > struct xfs_globals xfs_globals = { > .log_recovery_delay = 0, /* no delay by default */ > + .mount_delay = 0, /* no delay by default */ > #ifdef XFS_ASSERT_FATAL > .bug_on_assert = true, /* assert failures BUG() */ > #else > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 5726ef496980..a523eaeb3f29 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1635,6 +1635,17 @@ xfs_fs_fill_super( > #endif > sb->s_op = &xfs_super_operations; > > + /* > + * Delay mount work if the debug hook is set. This is debug > + * instrumention to coordinate simulation of xfs mount failures with > + * VFS superblock operations > + */ > + if (xfs_globals.mount_delay) { > + xfs_notice(mp, "Delaying mount for %d seconds.", > + xfs_globals.mount_delay); > + msleep(xfs_globals.mount_delay * 1000); > + } > + > if (silent) > flags |= XFS_MFSI_QUIET; > > diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h > index 82afee005140..b53a33e69932 100644 > --- a/fs/xfs/xfs_sysctl.h > +++ b/fs/xfs/xfs_sysctl.h > @@ -95,6 +95,7 @@ extern xfs_param_t xfs_params; > > struct xfs_globals { > int log_recovery_delay; /* log recovery delay (secs) */ > + int mount_delay; /* mount setup delay (secs) */ > bool bug_on_assert; /* BUG() the kernel on assert failure */ > }; > extern struct xfs_globals xfs_globals; > diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c > index 8b2ccc234f36..2d5cd2529f8e 100644 > --- a/fs/xfs/xfs_sysfs.c > +++ b/fs/xfs/xfs_sysfs.c > @@ -165,9 +165,40 @@ log_recovery_delay_show( > } > XFS_SYSFS_ATTR_RW(log_recovery_delay); > > +STATIC ssize_t > +mount_delay_store( > + struct kobject *kobject, > + const char *buf, > + size_t count) > +{ > + int ret; > + int val; > + > + ret = kstrtoint(buf, 0, &val); > + if (ret) > + return ret; > + > + if (val < 0 || val > 60) > + return -EINVAL; > + > + xfs_globals.mount_delay = val; > + > + return count; > +} > + > +STATIC ssize_t > +mount_delay_show( > + struct kobject *kobject, > + char *buf) > +{ > + return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.mount_delay); > +} > +XFS_SYSFS_ATTR_RW(mount_delay); > + > static struct attribute *xfs_dbg_attrs[] = { > ATTR_LIST(bug_on_assert), > ATTR_LIST(log_recovery_delay), > + ATTR_LIST(mount_delay), > NULL, > }; > > -- > 2.17.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 2/2] xfs: clear sb->s_fs_info on mount failure 2018-05-11 2:09 [PATCH 0/2] xfs: handle mount failures more cleanly Dave Chinner 2018-05-11 2:09 ` [PATCH 1/2] xfs: add mount delay debug option Dave Chinner @ 2018-05-11 2:09 ` Dave Chinner 2018-05-11 13:46 ` Brian Foster 1 sibling, 1 reply; 5+ messages in thread From: Dave Chinner @ 2018-05-11 2:09 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> We recently had an oops reported on a 4.14 kernel in xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage and so the m_perag_tree lookup walked into lala land. Essentially, the machine was under memory pressure when the mount was being run, xfs_fs_fill_super() failed after allocating the xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and freed the xfs_mount, but the sb->s_fs_info field still pointed to the freed memory. Hence when the superblock shrinker then ran it fell off the bad pointer. With the superblock shrinker problem fixed at teh VFS level, this stale s_fs_info pointer is still a problem - we use it unconditionally in ->put_super when the superblock is being torn down, and hence we can still trip over it after a ->fill_super call failure. Hence we need to clear s_fs_info if xfs-fs_fill_super() fails, and we need to check if it's valid in the places it can potentially be dereferenced after a ->fill_super failure. Signed-Off-By: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_super.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index a523eaeb3f29..d7e51b08919c 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1772,6 +1772,7 @@ xfs_fs_fill_super( out_close_devices: xfs_close_devices(mp); out_free_fsname: + sb->s_fs_info = NULL; xfs_free_fsname(mp); kfree(mp); out: @@ -1789,6 +1790,10 @@ xfs_fs_put_super( { struct xfs_mount *mp = XFS_M(sb); + /* if ->fill_super failed, we have no mount to tear down */ + if (!sb->s_fs_info) + return; + xfs_notice(mp, "Unmounting Filesystem"); xfs_filestream_unmount(mp); xfs_unmountfs(mp); @@ -1798,6 +1803,8 @@ xfs_fs_put_super( xfs_destroy_percpu_counters(mp); xfs_destroy_mount_workqueues(mp); xfs_close_devices(mp); + + sb->s_fs_info = NULL; xfs_free_fsname(mp); kfree(mp); } @@ -1817,6 +1824,9 @@ xfs_fs_nr_cached_objects( struct super_block *sb, struct shrink_control *sc) { + /* Paranoia: catch incorrect calls during mount setup or teardown */ + if (WARN_ON_ONCE(!sb->s_fs_info)) + return 0; return xfs_reclaim_inodes_count(XFS_M(sb)); } -- 2.17.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 2/2] xfs: clear sb->s_fs_info on mount failure 2018-05-11 2:09 ` [PATCH 2/2] xfs: clear sb->s_fs_info on mount failure Dave Chinner @ 2018-05-11 13:46 ` Brian Foster 0 siblings, 0 replies; 5+ messages in thread From: Brian Foster @ 2018-05-11 13:46 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Fri, May 11, 2018 at 12:09:43PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > We recently had an oops reported on a 4.14 kernel in > xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage > and so the m_perag_tree lookup walked into lala land. > > Essentially, the machine was under memory pressure when the mount > was being run, xfs_fs_fill_super() failed after allocating the > xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and > freed the xfs_mount, but the sb->s_fs_info field still pointed to > the freed memory. Hence when the superblock shrinker then ran > it fell off the bad pointer. > > With the superblock shrinker problem fixed at teh VFS level, this > stale s_fs_info pointer is still a problem - we use it > unconditionally in ->put_super when the superblock is being torn > down, and hence we can still trip over it after a ->fill_super > call failure. Hence we need to clear s_fs_info if > xfs-fs_fill_super() fails, and we need to check if it's valid in > the places it can potentially be dereferenced after a ->fill_super > failure. > > Signed-Off-By: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_super.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index a523eaeb3f29..d7e51b08919c 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1772,6 +1772,7 @@ xfs_fs_fill_super( > out_close_devices: > xfs_close_devices(mp); > out_free_fsname: > + sb->s_fs_info = NULL; > xfs_free_fsname(mp); > kfree(mp); > out: > @@ -1789,6 +1790,10 @@ xfs_fs_put_super( > { > struct xfs_mount *mp = XFS_M(sb); > > + /* if ->fill_super failed, we have no mount to tear down */ > + if (!sb->s_fs_info) > + return; > + I'd still prefer we use either XFS_M() or ->s_fs_info consistently in this function, but that's a nit: Reviewed-by: Brian Foster <bfoster@redhat.com> > xfs_notice(mp, "Unmounting Filesystem"); > xfs_filestream_unmount(mp); > xfs_unmountfs(mp); > @@ -1798,6 +1803,8 @@ xfs_fs_put_super( > xfs_destroy_percpu_counters(mp); > xfs_destroy_mount_workqueues(mp); > xfs_close_devices(mp); > + > + sb->s_fs_info = NULL; > xfs_free_fsname(mp); > kfree(mp); > } > @@ -1817,6 +1824,9 @@ xfs_fs_nr_cached_objects( > struct super_block *sb, > struct shrink_control *sc) > { > + /* Paranoia: catch incorrect calls during mount setup or teardown */ > + if (WARN_ON_ONCE(!sb->s_fs_info)) > + return 0; > return xfs_reclaim_inodes_count(XFS_M(sb)); > } > > -- > 2.17.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-05-12 0:28 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-05-11 2:09 [PATCH 0/2] xfs: handle mount failures more cleanly Dave Chinner 2018-05-11 2:09 ` [PATCH 1/2] xfs: add mount delay debug option Dave Chinner 2018-05-12 0:28 ` Darrick J. Wong 2018-05-11 2:09 ` [PATCH 2/2] xfs: clear sb->s_fs_info on mount failure Dave Chinner 2018-05-11 13:46 ` Brian Foster
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).