* [PATCH] exit: move exit_task_namespaces() after exit_task_work() @ 2017-12-14 20:17 Cong Wang 2017-12-14 21:08 ` Al Viro 2017-12-15 6:56 ` Eric W. Biederman 0 siblings, 2 replies; 9+ messages in thread From: Cong Wang @ 2017-12-14 20:17 UTC (permalink / raw) To: linux-kernel Cc: Cong Wang, Ingo Molnar, Al Viro, Andrew Morton, Linus Torvalds, stable syzbot reported we have a use-after-free when mqueue_evict_inode() is called on __cleanup_mnt() path, where the ipc ns is already freed by the previous exit_task_namespaces(). We can just move it after after exit_task_work() to avoid this use-after-free. Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> --- kernel/exit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/exit.c b/kernel/exit.c index 6b4298a41167..909e43c45158 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -861,8 +861,8 @@ void __noreturn do_exit(long code) exit_fs(tsk); if (group_dead) disassociate_ctty(1); - exit_task_namespaces(tsk); exit_task_work(tsk); + exit_task_namespaces(tsk); exit_thread(tsk); /* -- 2.13.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work() 2017-12-14 20:17 [PATCH] exit: move exit_task_namespaces() after exit_task_work() Cong Wang @ 2017-12-14 21:08 ` Al Viro 2017-12-15 23:59 ` Cong Wang 2017-12-15 6:56 ` Eric W. Biederman 1 sibling, 1 reply; 9+ messages in thread From: Al Viro @ 2017-12-14 21:08 UTC (permalink / raw) To: Cong Wang Cc: linux-kernel, Ingo Molnar, Andrew Morton, Linus Torvalds, stable On Thu, Dec 14, 2017 at 12:17:57PM -0800, Cong Wang wrote: > syzbot reported we have a use-after-free when mqueue_evict_inode() > is called on __cleanup_mnt() path, where the ipc ns is already > freed by the previous exit_task_namespaces(). We can just move > it after after exit_task_work() to avoid this use-after-free. What's to prevent somebody else holding a reference to the same inode past the exit(2)? IOW, I don't believe that this is fixing anything - in the best case, your patch papers over a specific reproducer. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work() 2017-12-14 21:08 ` Al Viro @ 2017-12-15 23:59 ` Cong Wang 2017-12-16 22:04 ` Giuseppe Scrivano 0 siblings, 1 reply; 9+ messages in thread From: Cong Wang @ 2017-12-15 23:59 UTC (permalink / raw) To: Al Viro; +Cc: LKML, Ingo Molnar, Andrew Morton, Linus Torvalds, stable, gscrivan On Thu, Dec 14, 2017 at 1:08 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Dec 14, 2017 at 12:17:57PM -0800, Cong Wang wrote: >> syzbot reported we have a use-after-free when mqueue_evict_inode() >> is called on __cleanup_mnt() path, where the ipc ns is already >> freed by the previous exit_task_namespaces(). We can just move >> it after after exit_task_work() to avoid this use-after-free. > > What's to prevent somebody else holding a reference to the same > inode past the exit(2)? IOW, I don't believe that this is fixing > anything - in the best case, your patch papers over a specific > reproducer. You are right, I missed mq_clear_sbinfo(). And the offending commit is: commit 9c583773d036336176e9e50441890659bc4eeae8 Author: Giuseppe Scrivano <gscrivan@redhat.com> Date: Fri Dec 15 01:06:28 2017 +0000 ipc, mqueue: lazy call kern_mount_data in new namespaces kern_mount_data is a relatively expensive operation when creating a new IPC namespace, so delay the mount until its first usage when not creating the the global namespace. This is a net saving for new IPC namespaces that don't use mq_open(). In this case there won't be any kern_mount_data() cost at all. On my machine, the time for creating 1000 new IPC namespaces dropped from ~8s to ~2s. Link: http://lkml.kernel.org/r/20171206151422.9660-1-gscrivan@redhat.com Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work() 2017-12-15 23:59 ` Cong Wang @ 2017-12-16 22:04 ` Giuseppe Scrivano 0 siblings, 0 replies; 9+ messages in thread From: Giuseppe Scrivano @ 2017-12-16 22:04 UTC (permalink / raw) To: Cong Wang Cc: Al Viro, LKML, Ingo Molnar, Andrew Morton, Linus Torvalds, stable Cong Wang <xiyou.wangcong@gmail.com> writes: > On Thu, Dec 14, 2017 at 1:08 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> On Thu, Dec 14, 2017 at 12:17:57PM -0800, Cong Wang wrote: >>> syzbot reported we have a use-after-free when mqueue_evict_inode() >>> is called on __cleanup_mnt() path, where the ipc ns is already >>> freed by the previous exit_task_namespaces(). We can just move >>> it after after exit_task_work() to avoid this use-after-free. >> >> What's to prevent somebody else holding a reference to the same >> inode past the exit(2)? IOW, I don't believe that this is fixing >> anything - in the best case, your patch papers over a specific >> reproducer. > > You are right, I missed mq_clear_sbinfo(). yes, the patch 9c583773d036336176e9e50441890659bc4eeae8 introduced an issue since it doesn't reset s_fs_info when ns->mq_mnt == NULL. The following patch solves the issue for me. I store the superblock in "struct ipc_namespace" since we cannot assume "mq_mnt" exists. Does the patch look fine? I'll clean it up and submit it separately if this approach is correct. Regards, Giuseppe diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h index 554e31494f69..66d4a5740a60 100644 --- a/include/linux/ipc_namespace.h +++ b/include/linux/ipc_namespace.h @@ -53,6 +53,7 @@ struct ipc_namespace { /* The kern_mount of the mqueuefs sb. We take a ref on it */ struct vfsmount *mq_mnt; + struct super_block *mq_sb; /* # queues in this ns, protected by mq_lock */ unsigned int mq_queues_count; diff --git a/ipc/mqueue.c b/ipc/mqueue.c index f78d6687a61b..16347cf1de2d 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -341,6 +341,7 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent) sb->s_root = d_make_root(inode); if (!sb->s_root) return -ENOMEM; + ns->mq_sb = sb; return 0; } @@ -1554,6 +1555,7 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount) ns->mq_msg_max = DFLT_MSGMAX; ns->mq_msgsize_max = DFLT_MSGSIZEMAX; ns->mq_msg_default = DFLT_MSG; + ns->mq_sb = NULL; ns->mq_msgsize_default = DFLT_MSGSIZE; if (!mount) @@ -1573,8 +1575,8 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount) void mq_clear_sbinfo(struct ipc_namespace *ns) { - if (ns->mq_mnt) - ns->mq_mnt->mnt_sb->s_fs_info = NULL; + if (ns->mq_sb) + ns->mq_sb->s_fs_info = NULL; } void mq_put_mnt(struct ipc_namespace *ns) ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work() 2017-12-14 20:17 [PATCH] exit: move exit_task_namespaces() after exit_task_work() Cong Wang 2017-12-14 21:08 ` Al Viro @ 2017-12-15 6:56 ` Eric W. Biederman 2017-12-15 7:35 ` Dmitry Vyukov 1 sibling, 1 reply; 9+ messages in thread From: Eric W. Biederman @ 2017-12-15 6:56 UTC (permalink / raw) To: Cong Wang Cc: linux-kernel, Ingo Molnar, Al Viro, Andrew Morton, Linus Torvalds, stable Cong Wang <xiyou.wangcong@gmail.com> writes: > syzbot reported we have a use-after-free when mqueue_evict_inode() > is called on __cleanup_mnt() path, where the ipc ns is already > freed by the previous exit_task_namespaces(). We can just move > it after after exit_task_work() to avoid this use-after-free. How does that possibly work. (I haven't seen this syzbot report). Looking at the code we have get_ns_from_inode. Which takes the mq_lock, sees if the pointer is NULL and takes a reference if it is non-NULL. Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held when the count drops to zero. Where is the race in that? The rest of mqueue_evict_inode uses the returned pointer and tests that the pointer is non-NULL before user it. So either szbot is giving you a bad report or there is a subtle race there I am not seeing. The change below is not at all the proper way to fix a subtle race. Eric > > Reported-by: syzbot <syzkaller@googlegroups.com> > Cc: Ingo Molnar <mingo@kernel.org> > Cc: Al Viro <viro@zeniv.linux.org.uk> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: stable@vger.kernel.org > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> > --- > kernel/exit.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/exit.c b/kernel/exit.c > index 6b4298a41167..909e43c45158 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -861,8 +861,8 @@ void __noreturn do_exit(long code) > exit_fs(tsk); > if (group_dead) > disassociate_ctty(1); > - exit_task_namespaces(tsk); > exit_task_work(tsk); > + exit_task_namespaces(tsk); > exit_thread(tsk); > > /* ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work() 2017-12-15 6:56 ` Eric W. Biederman @ 2017-12-15 7:35 ` Dmitry Vyukov 2017-12-15 8:00 ` Dmitry Vyukov 0 siblings, 1 reply; 9+ messages in thread From: Dmitry Vyukov @ 2017-12-15 7:35 UTC (permalink / raw) To: Eric W. Biederman Cc: Cong Wang, LKML, Ingo Molnar, Al Viro, Andrew Morton, Linus Torvalds, stable On Fri, Dec 15, 2017 at 7:56 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Cong Wang <xiyou.wangcong@gmail.com> writes: > >> syzbot reported we have a use-after-free when mqueue_evict_inode() >> is called on __cleanup_mnt() path, where the ipc ns is already >> freed by the previous exit_task_namespaces(). We can just move >> it after after exit_task_work() to avoid this use-after-free. > > How does that possibly work. (I haven't seen this syzbot report). > > Looking at the code we have get_ns_from_inode. Which takes the mq_lock, > sees if the pointer is NULL and takes a reference if it is non-NULL. > > Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held > when the count drops to zero. > > Where is the race in that? > > The rest of mqueue_evict_inode uses the returned pointer and > tests that the pointer is non-NULL before user it. > > So either szbot is giving you a bad report or there is a subtle race > there I am not seeing. The change below is not at all the proper way to > fix a subtle race. > > Eric Cong, what was that report? Searching by "exit_task_work|exit_task_namespaces" there are too many of them: https://groups.google.com/forum/#!searchin/syzkaller-bugs/%22exit_task_work$7Cexit_task_namespaces%22%7Csort:date I can only say that syzbot does not make up reports. That's something that actually happened and was provoked by userspace. >> Reported-by: syzbot <syzkaller@googlegroups.com> >> Cc: Ingo Molnar <mingo@kernel.org> >> Cc: Al Viro <viro@zeniv.linux.org.uk> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Linus Torvalds <torvalds@linux-foundation.org> >> Cc: stable@vger.kernel.org >> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> >> --- >> kernel/exit.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/kernel/exit.c b/kernel/exit.c >> index 6b4298a41167..909e43c45158 100644 >> --- a/kernel/exit.c >> +++ b/kernel/exit.c >> @@ -861,8 +861,8 @@ void __noreturn do_exit(long code) >> exit_fs(tsk); >> if (group_dead) >> disassociate_ctty(1); >> - exit_task_namespaces(tsk); >> exit_task_work(tsk); >> + exit_task_namespaces(tsk); >> exit_thread(tsk); >> >> /* ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work() 2017-12-15 7:35 ` Dmitry Vyukov @ 2017-12-15 8:00 ` Dmitry Vyukov 2017-12-16 0:00 ` Cong Wang 0 siblings, 1 reply; 9+ messages in thread From: Dmitry Vyukov @ 2017-12-15 8:00 UTC (permalink / raw) To: Eric W. Biederman Cc: Cong Wang, LKML, Ingo Molnar, Al Viro, Andrew Morton, Linus Torvalds, stable On Fri, Dec 15, 2017 at 8:35 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > On Fri, Dec 15, 2017 at 7:56 AM, Eric W. Biederman > <ebiederm@xmission.com> wrote: >> Cong Wang <xiyou.wangcong@gmail.com> writes: >> >>> syzbot reported we have a use-after-free when mqueue_evict_inode() >>> is called on __cleanup_mnt() path, where the ipc ns is already >>> freed by the previous exit_task_namespaces(). We can just move >>> it after after exit_task_work() to avoid this use-after-free. >> >> How does that possibly work. (I haven't seen this syzbot report). >> >> Looking at the code we have get_ns_from_inode. Which takes the mq_lock, >> sees if the pointer is NULL and takes a reference if it is non-NULL. >> >> Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held >> when the count drops to zero. >> >> Where is the race in that? >> >> The rest of mqueue_evict_inode uses the returned pointer and >> tests that the pointer is non-NULL before user it. >> >> So either szbot is giving you a bad report or there is a subtle race >> there I am not seeing. The change below is not at all the proper way to >> fix a subtle race. >> >> Eric > > Cong, what was that report? Searching by > "exit_task_work|exit_task_namespaces" there are too many of them: > https://groups.google.com/forum/#!searchin/syzkaller-bugs/%22exit_task_work$7Cexit_task_namespaces%22%7Csort:date > > I can only say that syzbot does not make up reports. That's something > that actually happened and was provoked by userspace. Ah, found that bug: https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ >>> Reported-by: syzbot <syzkaller@googlegroups.com> >>> Cc: Ingo Molnar <mingo@kernel.org> >>> Cc: Al Viro <viro@zeniv.linux.org.uk> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: Linus Torvalds <torvalds@linux-foundation.org> >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> >>> --- >>> kernel/exit.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/kernel/exit.c b/kernel/exit.c >>> index 6b4298a41167..909e43c45158 100644 >>> --- a/kernel/exit.c >>> +++ b/kernel/exit.c >>> @@ -861,8 +861,8 @@ void __noreturn do_exit(long code) >>> exit_fs(tsk); >>> if (group_dead) >>> disassociate_ctty(1); >>> - exit_task_namespaces(tsk); >>> exit_task_work(tsk); >>> + exit_task_namespaces(tsk); >>> exit_thread(tsk); >>> >>> /* ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work() 2017-12-15 8:00 ` Dmitry Vyukov @ 2017-12-16 0:00 ` Cong Wang 2017-12-16 7:40 ` Dmitry Vyukov 0 siblings, 1 reply; 9+ messages in thread From: Cong Wang @ 2017-12-16 0:00 UTC (permalink / raw) To: Dmitry Vyukov Cc: Eric W. Biederman, LKML, Ingo Molnar, Al Viro, Andrew Morton, Linus Torvalds, stable On Fri, Dec 15, 2017 at 12:00 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > On Fri, Dec 15, 2017 at 8:35 AM, Dmitry Vyukov <dvyukov@google.com> wrote: >> On Fri, Dec 15, 2017 at 7:56 AM, Eric W. Biederman >> <ebiederm@xmission.com> wrote: >>> Cong Wang <xiyou.wangcong@gmail.com> writes: >>> >>>> syzbot reported we have a use-after-free when mqueue_evict_inode() >>>> is called on __cleanup_mnt() path, where the ipc ns is already >>>> freed by the previous exit_task_namespaces(). We can just move >>>> it after after exit_task_work() to avoid this use-after-free. >>> >>> How does that possibly work. (I haven't seen this syzbot report). >>> >>> Looking at the code we have get_ns_from_inode. Which takes the mq_lock, >>> sees if the pointer is NULL and takes a reference if it is non-NULL. >>> >>> Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held >>> when the count drops to zero. >>> >>> Where is the race in that? >>> >>> The rest of mqueue_evict_inode uses the returned pointer and >>> tests that the pointer is non-NULL before user it. >>> >>> So either szbot is giving you a bad report or there is a subtle race >>> there I am not seeing. The change below is not at all the proper way to >>> fix a subtle race. >>> >>> Eric >> >> Cong, what was that report? Searching by >> "exit_task_work|exit_task_namespaces" there are too many of them: >> https://groups.google.com/forum/#!searchin/syzkaller-bugs/%22exit_task_work$7Cexit_task_namespaces%22%7Csort:date >> >> I can only say that syzbot does not make up reports. That's something >> that actually happened and was provoked by userspace. > > > Ah, found that bug: > https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ Yeah, and it is introduced by: http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/commit/?id=9c583773d036336176e9e50441890659bc4eeae8 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work() 2017-12-16 0:00 ` Cong Wang @ 2017-12-16 7:40 ` Dmitry Vyukov 0 siblings, 0 replies; 9+ messages in thread From: Dmitry Vyukov @ 2017-12-16 7:40 UTC (permalink / raw) To: Cong Wang Cc: Eric W. Biederman, LKML, Ingo Molnar, Al Viro, Andrew Morton, Linus Torvalds, stable On Sat, Dec 16, 2017 at 1:00 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote: >>>>> syzbot reported we have a use-after-free when mqueue_evict_inode() >>>>> is called on __cleanup_mnt() path, where the ipc ns is already >>>>> freed by the previous exit_task_namespaces(). We can just move >>>>> it after after exit_task_work() to avoid this use-after-free. >>>> >>>> How does that possibly work. (I haven't seen this syzbot report). >>>> >>>> Looking at the code we have get_ns_from_inode. Which takes the mq_lock, >>>> sees if the pointer is NULL and takes a reference if it is non-NULL. >>>> >>>> Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held >>>> when the count drops to zero. >>>> >>>> Where is the race in that? >>>> >>>> The rest of mqueue_evict_inode uses the returned pointer and >>>> tests that the pointer is non-NULL before user it. >>>> >>>> So either szbot is giving you a bad report or there is a subtle race >>>> there I am not seeing. The change below is not at all the proper way to >>>> fix a subtle race. >>>> >>>> Eric >>> >>> Cong, what was that report? Searching by >>> "exit_task_work|exit_task_namespaces" there are too many of them: >>> https://groups.google.com/forum/#!searchin/syzkaller-bugs/%22exit_task_work$7Cexit_task_namespaces%22%7Csort:date >>> >>> I can only say that syzbot does not make up reports. That's something >>> that actually happened and was provoked by userspace. >> >> >> Ah, found that bug: >> https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ > > Yeah, and it is introduced by: > > http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/commit/?id=9c583773d036336176e9e50441890659bc4eeae8 If it's going to be discussed here, note that it was reported by syzbot and requires "#syz fix" tag after fixing here: https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/jGDCMoz_AQAJ ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-12-16 22:04 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-12-14 20:17 [PATCH] exit: move exit_task_namespaces() after exit_task_work() Cong Wang 2017-12-14 21:08 ` Al Viro 2017-12-15 23:59 ` Cong Wang 2017-12-16 22:04 ` Giuseppe Scrivano 2017-12-15 6:56 ` Eric W. Biederman 2017-12-15 7:35 ` Dmitry Vyukov 2017-12-15 8:00 ` Dmitry Vyukov 2017-12-16 0:00 ` Cong Wang 2017-12-16 7:40 ` Dmitry Vyukov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).