* [PATCH linux-next] mqueue: fix IPC namespace use-after-free
@ 2017-12-19 10:14 Giuseppe Scrivano
2017-12-19 11:48 ` Al Viro
0 siblings, 1 reply; 13+ messages in thread
From: Giuseppe Scrivano @ 2017-12-19 10:14 UTC (permalink / raw)
To: Andrew Morton
Cc: LKML, gscrivan, alexander.deucher, broonie, chris, David Miller,
deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar,
Neil Horman, syzkaller-bugs, Al Viro, Vladislav Yasevich
mqueue_evict_inode() doesn't access the ipc namespace if it was
already freed. It can happen if in a new IPC namespace the inode was
created without a prior mq_open() which creates the vfsmount used to
access the superblock from mq_clear_sbinfo().
Keep a direct pointer to the superblock used by the inodes so we can
correctly reset the reference to the IPC namespace being destroyed.
Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call
kern_mount_data in new namespaces")
==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:183
[inline]
BUG: KASAN: use-after-free in atomic_read arch/x86/include/asm/atomic.h:27
[inline]
BUG: KASAN: use-after-free in refcount_inc_not_zero+0x16e/0x180
lib/refcount.c:120
Read of size 4 at addr ffff8801c51bb200 by task syzkaller711981/3156
CPU: 1 PID: 3156 Comm: syzkaller711981 Not tainted 4.15.0-rc2-mm1+ #39
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
print_address_description+0x73/0x250 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351 [inline]
kasan_report+0x25b/0x340 mm/kasan/report.c:409
__asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
__read_once_size include/linux/compiler.h:183 [inline]
atomic_read arch/x86/include/asm/atomic.h:27 [inline]
refcount_inc_not_zero+0x16e/0x180 lib/refcount.c:120
refcount_inc+0x15/0x50 lib/refcount.c:153
get_ipc_ns include/linux/ipc_namespace.h:129 [inline]
__get_ns_from_inode ipc/mqueue.c:110 [inline]
get_ns_from_inode ipc/mqueue.c:118 [inline]
mqueue_evict_inode+0x137/0x9c0 ipc/mqueue.c:402
evict+0x481/0x920 fs/inode.c:552
iput_final fs/inode.c:1514 [inline]
iput+0x7b9/0xaf0 fs/inode.c:1541
dentry_unlink_inode+0x4b0/0x5e0 fs/dcache.c:376
__dentry_kill+0x3b7/0x6d0 fs/dcache.c:573
shrink_dentry_list+0x3c5/0xcf0 fs/dcache.c:1020
shrink_dcache_parent+0xba/0x230 fs/dcache.c:1454
do_one_tree+0x15/0x50 fs/dcache.c:1485
shrink_dcache_for_umount+0xbb/0x290 fs/dcache.c:1502
generic_shutdown_super+0xcd/0x540 fs/super.c:424
kill_anon_super fs/super.c:987 [inline]
kill_litter_super+0x72/0x90 fs/super.c:997
deactivate_locked_super+0x88/0xd0 fs/super.c:312
deactivate_super+0x141/0x1b0 fs/super.c:343
cleanup_mnt+0xb2/0x150 fs/namespace.c:1173
__cleanup_mnt+0x16/0x20 fs/namespace.c:1180
task_work_run+0x199/0x270 kernel/task_work.c:113
exit_task_work include/linux/task_work.h:22 [inline]
do_exit+0x9bb/0x1ae0 kernel/exit.c:869
do_group_exit+0x149/0x400 kernel/exit.c:972
SYSC_exit_group kernel/exit.c:983 [inline]
SyS_exit_group+0x1d/0x20 kernel/exit.c:981
entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x440729
RSP: 002b:00007ffd090ef228 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0030656c69662f2e RCX: 0000000000440729
RDX: 0000000000440729 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 00000000006cb018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000401bf0
R13: 0000000000401c80 R14: 0000000000000000 R15: 0000000000000000
Allocated by task 3156:
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3614
kmalloc include/linux/slab.h:516 [inline]
create_ipc_ns ipc/namespace.c:45 [inline]
copy_ipcs+0x1b3/0x520 ipc/namespace.c:96
create_new_namespaces+0x278/0x880 kernel/nsproxy.c:87
unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:206
SYSC_unshare kernel/fork.c:2421 [inline]
SyS_unshare+0x653/0xfa0 kernel/fork.c:2371
entry_SYSCALL_64_fastpath+0x1f/0x96
Freed by task 3156:
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
__cache_free mm/slab.c:3492 [inline]
kfree+0xca/0x250 mm/slab.c:3807
free_ipc_ns ipc/namespace.c:139 [inline]
put_ipc_ns+0x112/0x150 ipc/namespace.c:164
free_nsproxy+0xc0/0x1f0 kernel/nsproxy.c:180
switch_task_namespaces+0x9d/0xc0 kernel/nsproxy.c:229
exit_task_namespaces+0x17/0x20 kernel/nsproxy.c:234
do_exit+0x9b6/0x1ae0 kernel/exit.c:868
do_group_exit+0x149/0x400 kernel/exit.c:972
SYSC_exit_group kernel/exit.c:983 [inline]
SyS_exit_group+0x1d/0x20 kernel/exit.c:981
entry_SYSCALL_64_fastpath+0x1f/0x96
The buggy address belongs to the object at ffff8801c51bb200
which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 0 bytes inside of
2048-byte region [ffff8801c51bb200, ffff8801c51bba00)
The buggy address belongs to the page:
page:000000007764ba6d count:1 mapcount:0 mapping:000000002c36623f index:0x0
compound_mapcount: 0
flags: 0x2fffc0000008100(slab|head)
raw: 02fffc0000008100 ffff8801c51ba100 0000000000000000 0000000100000003
raw: ffffea000715d320 ffff8801dac01950 ffff8801dac00c40 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8801c51bb100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8801c51bb180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ffff8801c51bb200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801c51bb280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c51bb300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
---
include/linux/ipc_namespace.h | 3 ++-
ipc/mqueue.c | 6 ++++--
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 554e31494f69..29ae2ede7602 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -52,7 +52,8 @@ struct ipc_namespace {
struct notifier_block ipcns_nb;
/* The kern_mount of the mqueuefs sb. We take a ref on it */
- struct vfsmount *mq_mnt;
+ struct vfsmount *mq_mnt;
+ struct super_block *mq_sb;
/* # queues in this ns, protected by mq_lock */
unsigned int mq_queues_count;
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 36f177dcb39a..d664c0b0f075 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -341,6 +341,7 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
sb->s_root = d_make_root(inode);
if (!sb->s_root)
return -ENOMEM;
+ ns->mq_sb = sb;
return 0;
}
@@ -1554,6 +1555,7 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount)
ns->mq_msg_max = DFLT_MSGMAX;
ns->mq_msgsize_max = DFLT_MSGSIZEMAX;
ns->mq_msg_default = DFLT_MSG;
+ ns->mq_sb = NULL;
ns->mq_msgsize_default = DFLT_MSGSIZE;
if (!mount)
@@ -1573,8 +1575,8 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount)
void mq_clear_sbinfo(struct ipc_namespace *ns)
{
- if (ns->mq_mnt)
- ns->mq_mnt->mnt_sb->s_fs_info = NULL;
+ if (ns->mq_sb)
+ ns->mq_sb->s_fs_info = NULL;
}
void mq_put_mnt(struct ipc_namespace *ns)
--
2.14.3
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 10:14 [PATCH linux-next] mqueue: fix IPC namespace use-after-free Giuseppe Scrivano @ 2017-12-19 11:48 ` Al Viro 2017-12-19 15:32 ` Al Viro 0 siblings, 1 reply; 13+ messages in thread From: Al Viro @ 2017-12-19 11:48 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich On Tue, Dec 19, 2017 at 11:14:40AM +0100, Giuseppe Scrivano wrote: > mqueue_evict_inode() doesn't access the ipc namespace if it was > already freed. It can happen if in a new IPC namespace the inode was > created without a prior mq_open() which creates the vfsmount used to > access the superblock from mq_clear_sbinfo(). > > Keep a direct pointer to the superblock used by the inodes so we can > correctly reset the reference to the IPC namespace being destroyed. > > Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call > kern_mount_data in new namespaces") And just what will happen in the same scenario if you mount the damn thing in userland without ever calling mq_open(), touch a file there, then unmount and then leave the ipc namespace? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 11:48 ` Al Viro @ 2017-12-19 15:32 ` Al Viro 2017-12-19 15:44 ` Al Viro 2017-12-19 16:59 ` Giuseppe Scrivano 0 siblings, 2 replies; 13+ messages in thread From: Al Viro @ 2017-12-19 15:32 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich On Tue, Dec 19, 2017 at 11:48:19AM +0000, Al Viro wrote: > On Tue, Dec 19, 2017 at 11:14:40AM +0100, Giuseppe Scrivano wrote: > > mqueue_evict_inode() doesn't access the ipc namespace if it was > > already freed. It can happen if in a new IPC namespace the inode was > > created without a prior mq_open() which creates the vfsmount used to > > access the superblock from mq_clear_sbinfo(). > > > > Keep a direct pointer to the superblock used by the inodes so we can > > correctly reset the reference to the IPC namespace being destroyed. > > > > Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call > > kern_mount_data in new namespaces") > > And just what will happen in the same scenario if you mount the damn > thing in userland without ever calling mq_open(), touch a file there, > then unmount and then leave the ipc namespace? FWIW, the real solution would be to have userland mounts trigger the creation of internal one, same as mq_open() would. Something along these lines (completely untested, on top of vfs.git#for-next). Care to give it some beating? diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 10b82338415b..30327e201571 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -343,18 +343,46 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent) return 0; } +static struct file_system_type mqueue_fs_type; +/* + * Return value is pinned only by reference in ->mq_mnt; it will + * live until ipcns dies. Caller does not need to drop it. + */ +static struct vfsmount *mq_internal_mount(void) +{ + struct ipc_namespace *ns = current->nsproxy->ipc_ns; + struct vfsmount *m = ns->mq_mnt; + if (m) + return m; + m = kern_mount_data(&mqueue_fs_type, ns); + spin_lock(&mq_lock); + if (unlikely(ns->mq_mnt)) { + spin_unlock(&mq_lock); + if (!IS_ERR(m)) + kern_unmount(m); + return ns->mq_mnt; + } + if (!IS_ERR(m)) + ns->mq_mnt = m; + spin_unlock(&mq_lock); + return m; +} + static struct dentry *mqueue_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { - struct ipc_namespace *ns; - if (flags & MS_KERNMOUNT) { - ns = data; - data = NULL; - } else { - ns = current->nsproxy->ipc_ns; - } - return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super); + struct ipc_namespace *ns = data; + struct vfsmount *m; + if (flags & MS_KERNMOUNT) + return mount_ns(fs_type, flags, NULL, ns, ns->user_ns, + mqueue_fill_super); + m = mq_internal_mount(); + if (IS_ERR(m)) + return ERR_CAST(m); + atomic_inc(&m->mnt_sb->s_active); + down_write(&m->mnt_sb->s_umount); + return dget(m->mnt_root); } static void init_once(void *foo) @@ -743,13 +771,16 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro, static int do_mq_open(const char __user *u_name, int oflag, umode_t mode, struct mq_attr *attr) { - struct vfsmount *mnt = current->nsproxy->ipc_ns->mq_mnt; - struct dentry *root = mnt->mnt_root; + struct vfsmount *mnt = mq_internal_mount(); + struct dentry *root; struct filename *name; struct path path; int fd, error; int ro; + if (IS_ERR(mnt)) + return PTR_ERR(mnt); + audit_mq_open(oflag, mode, attr); if (IS_ERR(name = getname(u_name))) @@ -760,6 +791,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode, goto out_putname; ro = mnt_want_write(mnt); /* we'll drop it in any case */ + root = mnt->mnt_root; inode_lock(d_inode(root)); path.dentry = lookup_one_len(name->name, root, strlen(name->name)); if (IS_ERR(path.dentry)) { @@ -1535,27 +1567,24 @@ int mq_init_ns(struct ipc_namespace *ns) ns->mq_msg_default = DFLT_MSG; ns->mq_msgsize_default = DFLT_MSGSIZE; - ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns); - if (IS_ERR(ns->mq_mnt)) { - int err = PTR_ERR(ns->mq_mnt); - ns->mq_mnt = NULL; - return err; - } return 0; } void mq_clear_sbinfo(struct ipc_namespace *ns) { - ns->mq_mnt->mnt_sb->s_fs_info = NULL; + if (ns->mq_mnt) + ns->mq_mnt->mnt_sb->s_fs_info = NULL; } void mq_put_mnt(struct ipc_namespace *ns) { - kern_unmount(ns->mq_mnt); + if (ns->mq_mnt) + kern_unmount(ns->mq_mnt); } static int __init init_mqueue_fs(void) { + struct vfsmount *m; int error; mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache", @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void) if (error) goto out_filesystem; + m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns); + if (IS_ERR(m)) + goto out_filesystem; + init_ipc_ns.mq_mnt = m; return 0; out_filesystem: ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 15:32 ` Al Viro @ 2017-12-19 15:44 ` Al Viro 2017-12-19 16:31 ` Dmitry Vyukov 2017-12-19 16:59 ` Giuseppe Scrivano 1 sibling, 1 reply; 13+ messages in thread From: Al Viro @ 2017-12-19 15:44 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich On Tue, Dec 19, 2017 at 03:32:25PM +0000, Al Viro wrote: > + m = mq_internal_mount(); > + if (IS_ERR(m)) > + return ERR_CAST(m); > + atomic_inc(&m->mnt_sb->s_active); > + down_write(&m->mnt_sb->s_umount); > + return dget(m->mnt_root); Note: this is stripped down mount_subtree(m, ""), of course; it might make sense to recognize that case and bypass the create_mnt_ns/vfs_path_lookup/put_mnt_ns business in mount_subtree() when the relative pathname is empty, replacing it with path.mnt = mntget(mnt); path.dentry = dget(mnt->mnt_root); in such case. That'd allow to simply call mount_subtree() here. It would work as-is, but it's ridiculously heavy for such use right now. > static int __init init_mqueue_fs(void) > { > + struct vfsmount *m; > int error; > > mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache", > @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void) > if (error) > goto out_filesystem; > > + m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns); > + if (IS_ERR(m)) > + goto out_filesystem; > + init_ipc_ns.mq_mnt = m; > return 0; > > out_filesystem: Unrelated issue, but register_filesystem() should be the last thing module_init() of a filesystem driver does. It's a separate story, in any case... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 15:44 ` Al Viro @ 2017-12-19 16:31 ` Dmitry Vyukov 2017-12-19 17:02 ` Giuseppe Scrivano 0 siblings, 1 reply; 13+ messages in thread From: Dmitry Vyukov @ 2017-12-19 16:31 UTC (permalink / raw) To: Al Viro Cc: Giuseppe Scrivano, Andrew Morton, LKML, alexander.deucher, Mark Brown, Chris Wilson, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich On Tue, Dec 19, 2017 at 4:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Tue, Dec 19, 2017 at 03:32:25PM +0000, Al Viro wrote: >> + m = mq_internal_mount(); >> + if (IS_ERR(m)) >> + return ERR_CAST(m); >> + atomic_inc(&m->mnt_sb->s_active); >> + down_write(&m->mnt_sb->s_umount); >> + return dget(m->mnt_root); > > Note: this is stripped down mount_subtree(m, ""), of course; > it might make sense to recognize that case and bypass the > create_mnt_ns/vfs_path_lookup/put_mnt_ns business in > mount_subtree() when the relative pathname is empty, replacing > it with path.mnt = mntget(mnt); path.dentry = dget(mnt->mnt_root); > in such case. That'd allow to simply call mount_subtree() here. > It would work as-is, but it's ridiculously heavy for such use > right now. > >> static int __init init_mqueue_fs(void) >> { >> + struct vfsmount *m; >> int error; >> >> mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache", >> @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void) >> if (error) >> goto out_filesystem; >> >> + m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns); >> + if (IS_ERR(m)) >> + goto out_filesystem; >> + init_ipc_ns.mq_mnt = m; >> return 0; >> >> out_filesystem: > > Unrelated issue, but register_filesystem() should be the last thing > module_init() of a filesystem driver does. It's a separate story, > in any case... Giuseppe, what report is this? If there is a reproducer, you can ask syzbot to test a patch. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 16:31 ` Dmitry Vyukov @ 2017-12-19 17:02 ` Giuseppe Scrivano 0 siblings, 0 replies; 13+ messages in thread From: Giuseppe Scrivano @ 2017-12-19 17:02 UTC (permalink / raw) To: Dmitry Vyukov Cc: Al Viro, Andrew Morton, LKML, alexander.deucher, Mark Brown, Chris Wilson, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich Dmitry Vyukov <dvyukov@google.com> writes: >> Unrelated issue, but register_filesystem() should be the last thing >> module_init() of a filesystem driver does. It's a separate story, >> in any case... > > Giuseppe, what report is this? > If there is a reproducer, you can ask syzbot to test a patch. I have tried locally the reproducer and the issue seems fixed both in Al's patch and in my version. In any case, the original issue was: https://groups.google.com/forum/#!msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ Thanks, Giuseppe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 15:32 ` Al Viro 2017-12-19 15:44 ` Al Viro @ 2017-12-19 16:59 ` Giuseppe Scrivano 2017-12-19 18:40 ` Giuseppe Scrivano 1 sibling, 1 reply; 13+ messages in thread From: Giuseppe Scrivano @ 2017-12-19 16:59 UTC (permalink / raw) To: Al Viro Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich Al Viro <viro@ZenIV.linux.org.uk> writes: > On Tue, Dec 19, 2017 at 11:48:19AM +0000, Al Viro wrote: >> On Tue, Dec 19, 2017 at 11:14:40AM +0100, Giuseppe Scrivano wrote: >> > mqueue_evict_inode() doesn't access the ipc namespace if it was >> > already freed. It can happen if in a new IPC namespace the inode was >> > created without a prior mq_open() which creates the vfsmount used to >> > access the superblock from mq_clear_sbinfo(). >> > >> > Keep a direct pointer to the superblock used by the inodes so we can >> > correctly reset the reference to the IPC namespace being destroyed. >> > >> > Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call >> > kern_mount_data in new namespaces") >> >> And just what will happen in the same scenario if you mount the damn >> thing in userland without ever calling mq_open(), touch a file there, >> then unmount and then leave the ipc namespace? > > FWIW, the real solution would be to have userland mounts trigger the creation > of internal one, same as mq_open() would. Something along these lines > (completely untested, on top of vfs.git#for-next). Care to give it some > beating? thanks for the patch. It seems to work after this minor fixup on top of it: diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 30327e201571..636989a44fae 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -360,7 +360,7 @@ static struct vfsmount *mq_internal_mount(void) spin_unlock(&mq_lock); if (!IS_ERR(m)) kern_unmount(m); - return ns->mq_mnt; + return ns->mq_mnt; } if (!IS_ERR(m)) ns->mq_mnt = m; @@ -1560,6 +1560,7 @@ static struct file_system_type mqueue_fs_type = { int mq_init_ns(struct ipc_namespace *ns) { + ns->mq_mnt = NULL; ns->mq_queues_count = 0; ns->mq_queues_max = DFLT_QUEUESMAX; ns->mq_msg_max = DFLT_MSGMAX; The only issue I've seen with my version is that if I do: # unshare -im /bin/sh # mount -t mqueue mqueue /dev/mqueue # touch /dev/mqueue/foo # umount /dev/mqueue # mount -t mqueue mqueue /dev/mqueue then /dev/mqueue/foo doesn't exist at this point. Your patch does not have this problem and /dev/mqueue/foo is again accessible after the second mount. Regards, Giuseppe ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 16:59 ` Giuseppe Scrivano @ 2017-12-19 18:40 ` Giuseppe Scrivano 2017-12-19 20:14 ` Al Viro 0 siblings, 1 reply; 13+ messages in thread From: Giuseppe Scrivano @ 2017-12-19 18:40 UTC (permalink / raw) To: Al Viro Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich Giuseppe Scrivano <gscrivan@redhat.com> writes: > The only issue I've seen with my version is that if I do: > > # unshare -im /bin/sh > # mount -t mqueue mqueue /dev/mqueue > # touch /dev/mqueue/foo > # umount /dev/mqueue > # mount -t mqueue mqueue /dev/mqueue > > then /dev/mqueue/foo doesn't exist at this point. Your patch does not > have this problem and /dev/mqueue/foo is again accessible after the > second mount. although, how much is that of an issue? Is there any other way to delay the cost of kern_mount_data()? Most containers have /dev/mqueue mounted but it is not really going to be used. Would it be possible somehow to postpone it until the first inode is created? Thanks, Giuseppe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 18:40 ` Giuseppe Scrivano @ 2017-12-19 20:14 ` Al Viro 2017-12-19 21:49 ` Eric W. Biederman 2017-12-21 19:19 ` Giuseppe Scrivano 0 siblings, 2 replies; 13+ messages in thread From: Al Viro @ 2017-12-19 20:14 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich On Tue, Dec 19, 2017 at 07:40:43PM +0100, Giuseppe Scrivano wrote: > Giuseppe Scrivano <gscrivan@redhat.com> writes: > > > The only issue I've seen with my version is that if I do: > > > > # unshare -im /bin/sh > > # mount -t mqueue mqueue /dev/mqueue > > # touch /dev/mqueue/foo > > # umount /dev/mqueue > > # mount -t mqueue mqueue /dev/mqueue > > > > then /dev/mqueue/foo doesn't exist at this point. Your patch does not > > have this problem and /dev/mqueue/foo is again accessible after the > > second mount. > > although, how much is that of an issue? Is there any other way to delay You do realize that with your patch you would end up with worse than that - mount/touch/umount and now you've got ->mq_sb non-NULL and pointing to freed super_block. With really unpleasant effects when you quit that ipcns... > the cost of kern_mount_data()? Most containers have /dev/mqueue mounted > but it is not really going to be used. _What_ cost? At mount(2) time you are setting the superblock up anyway, so what would you be delaying? kmem_cache_alloc() for struct mount and assignments to its fields? That's noise; if anything, I would expect the main cost with a plenty of containers to be in sget() scanning the list of mqueue superblocks. And we can get rid of that, while we are at it - to hell with mount_ns(), with that approach we can just use mount_nodev() instead. The logics in mq_internal_mount() will deal with multiple instances - if somebody has already triggered creation of internal mount, all subsequent calls in that ipcns will end up avoiding kern_mount_data() entirely. And if you have two callers racing - sure, you will get two superblocks. Not for long, though - the first one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses and prompty destroys his vfsmount and superblock. I seriously suspect that variant below would cut down on the cost a whole lot more - as it is, we have the total of O(N^2) spent in the loop inside of sget_userns() when we create N ipcns and mount in each of those; this patch should cut that to O(N)... diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 10b82338415b..e9870b0cda29 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -325,8 +325,9 @@ static struct inode *mqueue_get_inode(struct super_block *sb, static int mqueue_fill_super(struct super_block *sb, void *data, int silent) { struct inode *inode; - struct ipc_namespace *ns = sb->s_fs_info; + struct ipc_namespace *ns = data; + sb->s_fs_info = ns; sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV; sb->s_blocksize = PAGE_SIZE; sb->s_blocksize_bits = PAGE_SHIFT; @@ -343,18 +344,44 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent) return 0; } +static struct file_system_type mqueue_fs_type; +/* + * Return value is pinned only by reference in ->mq_mnt; it will + * live until ipcns dies. Caller does not need to drop it. + */ +static struct vfsmount *mq_internal_mount(void) +{ + struct ipc_namespace *ns = current->nsproxy->ipc_ns; + struct vfsmount *m = ns->mq_mnt; + if (m) + return m; + m = kern_mount_data(&mqueue_fs_type, ns); + spin_lock(&mq_lock); + if (unlikely(ns->mq_mnt)) { + spin_unlock(&mq_lock); + if (!IS_ERR(m)) + kern_unmount(m); + return ns->mq_mnt; + } + if (!IS_ERR(m)) + ns->mq_mnt = m; + spin_unlock(&mq_lock); + return m; +} + static struct dentry *mqueue_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { - struct ipc_namespace *ns; - if (flags & MS_KERNMOUNT) { - ns = data; - data = NULL; - } else { - ns = current->nsproxy->ipc_ns; - } - return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super); + struct vfsmount *m; + if (flags & MS_KERNMOUNT) + return mount_nodev(fs_type, flags, data, mqueue_fill_super); + m = mq_internal_mount(); + if (IS_ERR(m)) + return ERR_CAST(m); + atomic_inc(&m->mnt_sb->s_active); + down_write(&m->mnt_sb->s_umount); + return dget(m->mnt_root); } static void init_once(void *foo) @@ -743,13 +770,16 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro, static int do_mq_open(const char __user *u_name, int oflag, umode_t mode, struct mq_attr *attr) { - struct vfsmount *mnt = current->nsproxy->ipc_ns->mq_mnt; - struct dentry *root = mnt->mnt_root; + struct vfsmount *mnt = mq_internal_mount(); + struct dentry *root; struct filename *name; struct path path; int fd, error; int ro; + if (IS_ERR(mnt)) + return PTR_ERR(mnt); + audit_mq_open(oflag, mode, attr); if (IS_ERR(name = getname(u_name))) @@ -760,6 +790,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode, goto out_putname; ro = mnt_want_write(mnt); /* we'll drop it in any case */ + root = mnt->mnt_root; inode_lock(d_inode(root)); path.dentry = lookup_one_len(name->name, root, strlen(name->name)); if (IS_ERR(path.dentry)) { @@ -1534,28 +1565,26 @@ int mq_init_ns(struct ipc_namespace *ns) ns->mq_msgsize_max = DFLT_MSGSIZEMAX; ns->mq_msg_default = DFLT_MSG; ns->mq_msgsize_default = DFLT_MSGSIZE; + ns->mq_mnt = NULL; - ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns); - if (IS_ERR(ns->mq_mnt)) { - int err = PTR_ERR(ns->mq_mnt); - ns->mq_mnt = NULL; - return err; - } return 0; } void mq_clear_sbinfo(struct ipc_namespace *ns) { - ns->mq_mnt->mnt_sb->s_fs_info = NULL; + if (ns->mq_mnt) + ns->mq_mnt->mnt_sb->s_fs_info = NULL; } void mq_put_mnt(struct ipc_namespace *ns) { - kern_unmount(ns->mq_mnt); + if (ns->mq_mnt) + kern_unmount(ns->mq_mnt); } static int __init init_mqueue_fs(void) { + struct vfsmount *m; int error; mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache", @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void) if (error) goto out_filesystem; + m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns); + if (IS_ERR(m)) + goto out_filesystem; + init_ipc_ns.mq_mnt = m; return 0; out_filesystem: ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 20:14 ` Al Viro @ 2017-12-19 21:49 ` Eric W. Biederman 2017-12-19 22:40 ` Al Viro 2017-12-21 19:19 ` Giuseppe Scrivano 1 sibling, 1 reply; 13+ messages in thread From: Eric W. Biederman @ 2017-12-19 21:49 UTC (permalink / raw) To: Al Viro Cc: Giuseppe Scrivano, Andrew Morton, LKML, alexander.deucher, broonie, chris, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich Al Viro <viro@ZenIV.linux.org.uk> writes: > On Tue, Dec 19, 2017 at 07:40:43PM +0100, Giuseppe Scrivano wrote: >> Giuseppe Scrivano <gscrivan@redhat.com> writes: >> >> > The only issue I've seen with my version is that if I do: >> > >> > # unshare -im /bin/sh >> > # mount -t mqueue mqueue /dev/mqueue >> > # touch /dev/mqueue/foo >> > # umount /dev/mqueue >> > # mount -t mqueue mqueue /dev/mqueue >> > >> > then /dev/mqueue/foo doesn't exist at this point. Your patch does not >> > have this problem and /dev/mqueue/foo is again accessible after the >> > second mount. >> >> although, how much is that of an issue? Is there any other way to delay > > You do realize that with your patch you would end up with worse than that - > mount/touch/umount and now you've got ->mq_sb non-NULL and pointing to freed > super_block. With really unpleasant effects when you quit that ipcns... > >> the cost of kern_mount_data()? Most containers have /dev/mqueue mounted >> but it is not really going to be used. > > _What_ cost? At mount(2) time you are setting the superblock up anyway, so > what would you be delaying? kmem_cache_alloc() for struct mount and assignments > to its fields? That's noise; if anything, I would expect the main cost with > a plenty of containers to be in sget() scanning the list of mqueue superblocks. > And we can get rid of that, while we are at it - to hell with mount_ns(), with > that approach we can just use mount_nodev() instead. The logics in > mq_internal_mount() will deal with multiple instances - if somebody has already > triggered creation of internal mount, all subsequent calls in that ipcns will > end up avoiding kern_mount_data() entirely. And if you have two callers > racing - sure, you will get two superblocks. Not for long, though - the first > one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses > and prompty destroys his vfsmount and superblock. I seriously suspect that > variant below would cut down on the cost a whole lot more - as it is, we have > the total of O(N^2) spent in the loop inside of sget_userns() when we create > N ipcns and mount in each of those; this patch should cut that to > O(N)... If that is where the cost is, is there any point in delaying creating the internal mount at all? Eric > > diff --git a/ipc/mqueue.c b/ipc/mqueue.c > index 10b82338415b..e9870b0cda29 100644 > --- a/ipc/mqueue.c > +++ b/ipc/mqueue.c > @@ -325,8 +325,9 @@ static struct inode *mqueue_get_inode(struct super_block *sb, > static int mqueue_fill_super(struct super_block *sb, void *data, int silent) > { > struct inode *inode; > - struct ipc_namespace *ns = sb->s_fs_info; > + struct ipc_namespace *ns = data; > > + sb->s_fs_info = ns; > sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV; > sb->s_blocksize = PAGE_SIZE; > sb->s_blocksize_bits = PAGE_SHIFT; > @@ -343,18 +344,44 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent) > return 0; > } > > +static struct file_system_type mqueue_fs_type; > +/* > + * Return value is pinned only by reference in ->mq_mnt; it will > + * live until ipcns dies. Caller does not need to drop it. > + */ > +static struct vfsmount *mq_internal_mount(void) > +{ > + struct ipc_namespace *ns = current->nsproxy->ipc_ns; > + struct vfsmount *m = ns->mq_mnt; > + if (m) > + return m; > + m = kern_mount_data(&mqueue_fs_type, ns); > + spin_lock(&mq_lock); > + if (unlikely(ns->mq_mnt)) { > + spin_unlock(&mq_lock); > + if (!IS_ERR(m)) > + kern_unmount(m); > + return ns->mq_mnt; > + } > + if (!IS_ERR(m)) > + ns->mq_mnt = m; > + spin_unlock(&mq_lock); > + return m; > +} > + > static struct dentry *mqueue_mount(struct file_system_type *fs_type, > int flags, const char *dev_name, > void *data) > { > - struct ipc_namespace *ns; > - if (flags & MS_KERNMOUNT) { > - ns = data; > - data = NULL; > - } else { > - ns = current->nsproxy->ipc_ns; > - } > - return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super); > + struct vfsmount *m; > + if (flags & MS_KERNMOUNT) > + return mount_nodev(fs_type, flags, data, mqueue_fill_super); > + m = mq_internal_mount(); > + if (IS_ERR(m)) > + return ERR_CAST(m); > + atomic_inc(&m->mnt_sb->s_active); > + down_write(&m->mnt_sb->s_umount); > + return dget(m->mnt_root); > } > > static void init_once(void *foo) > @@ -743,13 +770,16 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro, > static int do_mq_open(const char __user *u_name, int oflag, umode_t mode, > struct mq_attr *attr) > { > - struct vfsmount *mnt = current->nsproxy->ipc_ns->mq_mnt; > - struct dentry *root = mnt->mnt_root; > + struct vfsmount *mnt = mq_internal_mount(); > + struct dentry *root; > struct filename *name; > struct path path; > int fd, error; > int ro; > > + if (IS_ERR(mnt)) > + return PTR_ERR(mnt); > + > audit_mq_open(oflag, mode, attr); > > if (IS_ERR(name = getname(u_name))) > @@ -760,6 +790,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode, > goto out_putname; > > ro = mnt_want_write(mnt); /* we'll drop it in any case */ > + root = mnt->mnt_root; > inode_lock(d_inode(root)); > path.dentry = lookup_one_len(name->name, root, strlen(name->name)); > if (IS_ERR(path.dentry)) { > @@ -1534,28 +1565,26 @@ int mq_init_ns(struct ipc_namespace *ns) > ns->mq_msgsize_max = DFLT_MSGSIZEMAX; > ns->mq_msg_default = DFLT_MSG; > ns->mq_msgsize_default = DFLT_MSGSIZE; > + ns->mq_mnt = NULL; > > - ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns); > - if (IS_ERR(ns->mq_mnt)) { > - int err = PTR_ERR(ns->mq_mnt); > - ns->mq_mnt = NULL; > - return err; > - } > return 0; > } > > void mq_clear_sbinfo(struct ipc_namespace *ns) > { > - ns->mq_mnt->mnt_sb->s_fs_info = NULL; > + if (ns->mq_mnt) > + ns->mq_mnt->mnt_sb->s_fs_info = NULL; > } > > void mq_put_mnt(struct ipc_namespace *ns) > { > - kern_unmount(ns->mq_mnt); > + if (ns->mq_mnt) > + kern_unmount(ns->mq_mnt); > } > > static int __init init_mqueue_fs(void) > { > + struct vfsmount *m; > int error; > > mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache", > @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void) > if (error) > goto out_filesystem; > > + m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns); > + if (IS_ERR(m)) > + goto out_filesystem; > + init_ipc_ns.mq_mnt = m; > return 0; > > out_filesystem: ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 21:49 ` Eric W. Biederman @ 2017-12-19 22:40 ` Al Viro 2017-12-19 23:36 ` Eric W. Biederman 0 siblings, 1 reply; 13+ messages in thread From: Al Viro @ 2017-12-19 22:40 UTC (permalink / raw) To: Eric W. Biederman Cc: Giuseppe Scrivano, Andrew Morton, LKML, alexander.deucher, broonie, chris, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich On Tue, Dec 19, 2017 at 03:49:24PM -0600, Eric W. Biederman wrote: > > what would you be delaying? kmem_cache_alloc() for struct mount and assignments > > to its fields? That's noise; if anything, I would expect the main cost with > > a plenty of containers to be in sget() scanning the list of mqueue superblocks. > > And we can get rid of that, while we are at it - to hell with mount_ns(), with > > that approach we can just use mount_nodev() instead. The logics in > > mq_internal_mount() will deal with multiple instances - if somebody has already > > triggered creation of internal mount, all subsequent calls in that ipcns will > > end up avoiding kern_mount_data() entirely. And if you have two callers > > racing - sure, you will get two superblocks. Not for long, though - the first > > one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses > > and prompty destroys his vfsmount and superblock. I seriously suspect that > > variant below would cut down on the cost a whole lot more - as it is, we have > > the total of O(N^2) spent in the loop inside of sget_userns() when we create > > N ipcns and mount in each of those; this patch should cut that to > > O(N)... > > If that is where the cost is, is there any point in delaying creating > the internal mount at all? We won't know without the profiles... Incidentally, is there any point in using mount_ns() for procfs? Similar scheme (with ->proc_mnt instead of ->mq_mnt, of course) would live with mount_nodev() just fine, and it's definitely less costly - we don't bother with the loop in sget_userns() at all that way. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 22:40 ` Al Viro @ 2017-12-19 23:36 ` Eric W. Biederman 0 siblings, 0 replies; 13+ messages in thread From: Eric W. Biederman @ 2017-12-19 23:36 UTC (permalink / raw) To: Al Viro Cc: Giuseppe Scrivano, Andrew Morton, LKML, alexander.deucher, broonie, chris, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich Al Viro <viro@ZenIV.linux.org.uk> writes: > On Tue, Dec 19, 2017 at 03:49:24PM -0600, Eric W. Biederman wrote: >> > what would you be delaying? kmem_cache_alloc() for struct mount and assignments >> > to its fields? That's noise; if anything, I would expect the main cost with >> > a plenty of containers to be in sget() scanning the list of mqueue superblocks. >> > And we can get rid of that, while we are at it - to hell with mount_ns(), with >> > that approach we can just use mount_nodev() instead. The logics in >> > mq_internal_mount() will deal with multiple instances - if somebody has already >> > triggered creation of internal mount, all subsequent calls in that ipcns will >> > end up avoiding kern_mount_data() entirely. And if you have two callers >> > racing - sure, you will get two superblocks. Not for long, though - the first >> > one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses >> > and prompty destroys his vfsmount and superblock. I seriously suspect that >> > variant below would cut down on the cost a whole lot more - as it is, we have >> > the total of O(N^2) spent in the loop inside of sget_userns() when we create >> > N ipcns and mount in each of those; this patch should cut that to >> > O(N)... >> >> If that is where the cost is, is there any point in delaying creating >> the internal mount at all? > > We won't know without the profiles... Incidentally, is there any point in > using mount_ns() for procfs? Similar scheme (with ->proc_mnt instead of > ->mq_mnt, of course) would live with mount_nodev() just fine, and it's > definitely less costly - we don't bother with the loop in sget_userns() > at all that way. The mechanism of mqueuefs and proc are different for dealing with a filesystem that continues to be mounted/referenced after the namespace exists. Proc actually takes a reference on the pid namespace so it is easier to work with. pid_ns_prepare_proc and and pid_ns_release_proc are the namespace side of that dependency. So yes we could look at a local cache in the namespace and all would be well for proc. I don't know what we would use for locking when we start allowing more that one path to set it. atmoic_cmpxchg(&proc_mnt, NULL)? That makes me suspect we could have a common helper that does the work. I do know that the reason I moved proc to mount_ns is that it had simply been open coding that function. Eric ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free 2017-12-19 20:14 ` Al Viro 2017-12-19 21:49 ` Eric W. Biederman @ 2017-12-21 19:19 ` Giuseppe Scrivano 1 sibling, 0 replies; 13+ messages in thread From: Giuseppe Scrivano @ 2017-12-21 19:19 UTC (permalink / raw) To: Al Viro Cc: Andrew Morton, LKML, alexander.deucher, broonie, chris, David Miller, deepa.kernel, Greg KH, luc.vanoostenryck, lucien xin, Ingo Molnar, Neil Horman, syzkaller-bugs, Vladislav Yasevich Al Viro <viro@ZenIV.linux.org.uk> writes: > On Tue, Dec 19, 2017 at 07:40:43PM +0100, Giuseppe Scrivano wrote: >> Giuseppe Scrivano <gscrivan@redhat.com> writes: >> >> > The only issue I've seen with my version is that if I do: >> > >> > # unshare -im /bin/sh >> > # mount -t mqueue mqueue /dev/mqueue >> > # touch /dev/mqueue/foo >> > # umount /dev/mqueue >> > # mount -t mqueue mqueue /dev/mqueue >> > >> > then /dev/mqueue/foo doesn't exist at this point. Your patch does not >> > have this problem and /dev/mqueue/foo is again accessible after the >> > second mount. >> >> although, how much is that of an issue? Is there any other way to delay > > You do realize that with your patch you would end up with worse than that - > mount/touch/umount and now you've got ->mq_sb non-NULL and pointing to freed > super_block. With really unpleasant effects when you quit that ipcns... > >> the cost of kern_mount_data()? Most containers have /dev/mqueue mounted >> but it is not really going to be used. > > _What_ cost? At mount(2) time you are setting the superblock up anyway, so > what would you be delaying? kmem_cache_alloc() for struct mount and assignments > to its fields? That's noise; if anything, I would expect the main cost with > a plenty of containers to be in sget() scanning the list of mqueue superblocks. > And we can get rid of that, while we are at it - to hell with mount_ns(), with > that approach we can just use mount_nodev() instead. The logics in > mq_internal_mount() will deal with multiple instances - if somebody has already > triggered creation of internal mount, all subsequent calls in that ipcns will > end up avoiding kern_mount_data() entirely. And if you have two callers > racing - sure, you will get two superblocks. Not for long, though - the first > one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses > and prompty destroys his vfsmount and superblock. I seriously suspect that > variant below would cut down on the cost a whole lot more - as it is, we have > the total of O(N^2) spent in the loop inside of sget_userns() when we create > N ipcns and mount in each of those; this patch should cut that to O(N)... Thanks for the explanation. I see what you mean now and how this cost is inevitable as anyway we need to setup the superblock. Giuseppe ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-12-21 19:19 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-12-19 10:14 [PATCH linux-next] mqueue: fix IPC namespace use-after-free Giuseppe Scrivano 2017-12-19 11:48 ` Al Viro 2017-12-19 15:32 ` Al Viro 2017-12-19 15:44 ` Al Viro 2017-12-19 16:31 ` Dmitry Vyukov 2017-12-19 17:02 ` Giuseppe Scrivano 2017-12-19 16:59 ` Giuseppe Scrivano 2017-12-19 18:40 ` Giuseppe Scrivano 2017-12-19 20:14 ` Al Viro 2017-12-19 21:49 ` Eric W. Biederman 2017-12-19 22:40 ` Al Viro 2017-12-19 23:36 ` Eric W. Biederman 2017-12-21 19:19 ` Giuseppe Scrivano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).