* List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket @ 2015-09-13 19:53 Mathias Krause 2015-09-14 2:39 ` Eric Wong 2015-09-15 17:07 ` Rainer Weikusat 0 siblings, 2 replies; 18+ messages in thread From: Mathias Krause @ 2015-09-13 19:53 UTC (permalink / raw) To: netdev, linux-kernel Cc: Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team [-- Attachment #1: Type: text/plain, Size: 11177 bytes --] Hi, this is an attempt to resurrect the thread initially started here: http://thread.gmane.org/gmane.linux.network/353003 As that patch fixed the issue for the mentioned reproducer, it did not fix the bug for the production code Olivier is using. :( Changing the reproducer only slightly allows me to trigger the following list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even with the above linked patch applied: [ 50.264249] ------------[ cut here ]------------ [ 50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 __list_del_entry+0xa4/0xd0() [ 50.264249] list_del corruption. prev->next should be ffff88003c2c1bb8, but was ffff88003f07bbb8 [ 50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom [ 50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75 [ 50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 50.264249] ffffffff817e902e ffff880000087d08 ffffffff8155593c 0000000000000007 [ 50.264249] ffff880000087d58 ffff880000087d48 ffffffff8105202a 0000000000000001 [ 50.264249] ffff88003c2c1bb8 ffff88003f07bb80 0000000000000286 ffff88003f736640 [ 50.264249] Call Trace: [ 50.264249] [<ffffffff8155593c>] dump_stack+0x4c/0x65 [ 50.264249] [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0 [ 50.264249] [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50 [ 50.264249] [<ffffffff81322354>] __list_del_entry+0xa4/0xd0 [ 50.264249] [<ffffffff81322391>] list_del+0x11/0x40 [ 50.264249] [<ffffffff81094d39>] remove_wait_queue+0x29/0x40 [ 50.264249] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 [ 50.264249] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 [ 50.264249] [<ffffffff811bca02>] ep_remove+0x22/0x110 [ 50.264249] [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70 [ 50.264249] [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14 [ 50.264249] [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 50.264249] ---[ end trace d9af9b915df9667e ]--- [ 50.572100] ------------[ cut here ]------------ [ 50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 __list_del_entry+0xc3/0xd0() [ 50.584263] list_del corruption. next->prev should be ffff88003f664c90, but was ffff88003f0cb5b8 [ 50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom [ 50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G W 4.2.0 #75 [ 50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 50.584263] ffffffff817e902e ffff88003d37fce8 ffffffff8155593c 0000000000000007 [ 50.584263] ffff88003d37fd38 ffff88003d37fd28 ffffffff8105202a 0000000000000001 [ 50.584263] ffff88003f664c90 ffff88003f0cb580 0000000000000282 ffff88003f731740 [ 50.584263] Call Trace: [ 50.584263] [<ffffffff8155593c>] dump_stack+0x4c/0x65 [ 50.584263] [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0 [ 50.584263] [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50 [ 50.584263] [<ffffffff81322373>] __list_del_entry+0xc3/0xd0 [ 50.584263] [<ffffffff81322391>] list_del+0x11/0x40 [ 50.584263] [<ffffffff81094d39>] remove_wait_queue+0x29/0x40 [ 50.584263] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 [ 50.584263] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 [ 50.584263] [<ffffffff811bca02>] ep_remove+0x22/0x110 [ 50.584263] [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0 [ 50.584263] [<ffffffff8117704f>] __fput+0x1af/0x200 [ 50.584263] [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f [ 50.584263] [<ffffffff811770ee>] ____fput+0xe/0x10 [ 50.584263] [<ffffffff8107271d>] task_work_run+0x8d/0xc0 [ 50.584263] [<ffffffff8100390f>] do_notify_resume+0x4f/0x60 [ 50.584263] [<ffffffff8155cf6c>] int_signal+0x12/0x17 [ 50.584263] ---[ end trace d9af9b915df9667f ]--- [ 50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212 [ 50.584263] lock: 0xffff88003f0cb580, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1 [ 50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G W 4.2.0 #75 [ 50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 50.584263] ffff88003f0cb580 ffff88003d37fd38 ffffffff8155593c 0000000000000007 [ 50.584263] ffffffffffffffff ffff88003d37fd58 ffffffff810a3375 ffff88003f0cb580 [ 50.584263] ffffffff817b9cc8 ffff88003d37fd78 ffffffff810a33f6 ffff88003f0cb580 [ 50.584263] Call Trace: [ 50.584263] [<ffffffff8155593c>] dump_stack+0x4c/0x65 [ 50.584263] [<ffffffff810a3375>] spin_dump+0x85/0xe0 [ 50.584263] [<ffffffff810a33f6>] spin_bug+0x26/0x30 [ 50.584263] [<ffffffff810a3645>] do_raw_spin_unlock+0x75/0xa0 [ 50.584263] [<ffffffff8155c4ec>] _raw_spin_unlock_irqrestore+0x2c/0x50 [ 50.584263] [<ffffffff81094d44>] remove_wait_queue+0x34/0x40 [ 50.584263] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 [ 50.584263] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 [ 50.584263] [<ffffffff811bca02>] ep_remove+0x22/0x110 [ 50.584263] [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0 [ 50.584263] [<ffffffff8117704f>] __fput+0x1af/0x200 [ 50.584263] [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f [ 50.584263] [<ffffffff811770ee>] ____fput+0xe/0x10 [ 50.584263] [<ffffffff8107271d>] task_work_run+0x8d/0xc0 [ 50.584263] [<ffffffff8100390f>] do_notify_resume+0x4f/0x60 [ 50.584263] [<ffffffff8155cf6c>] int_signal+0x12/0x17 [...] That 'spinlock already unlocked' message is also interesting. But even better, enabling slab debugging (CONFIG_SLUB_DEBUG_ON=y) makes that list_del corruption warning a GPF: [ 21.124241] general protection fault: 0000 [#1] SMP [ 21.128193] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom [ 21.144249] CPU: 1 PID: 221 Comm: epoll_bug Not tainted 4.2.0 #75 [ 21.144249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 21.144249] task: ffff88001fa82b80 ti: ffff880018894000 task.ti: ffff880018894000 [ 21.144249] RIP: 0010:[<ffffffff8109def0>] [<ffffffff8109def0>] __lock_acquire+0x240/0x1800 [ 21.144249] RSP: 0018:ffff880018897c98 EFLAGS: 00010002 [ 21.144249] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 [ 21.144249] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88001f8b1c18 [ 21.144249] RBP: ffff880018897d28 R08: 0000000000000001 R09: 0000000000000001 [ 21.144249] R10: 0000000000000000 R11: ffff88001f8b1c18 R12: 0000000000000000 [ 21.144249] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88001fa82b80 [ 21.144249] FS: 00007f0c87e5d700(0000) GS:ffff88001eb00000(0000) knlGS:0000000000000000 [ 21.144249] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 21.144249] CR2: 00007f0c87e5cff8 CR3: 000000001899c000 CR4: 00000000001406e0 [ 21.144249] Stack: [ 21.144249] ffff880018897cb8 ffffffff8109d94b ffff880018b6fdb0 ffff88001e3d3b80 [ 21.144249] ffff880018897cc8 ffff88001fa83250 0000000000000002 0000000000000000 [ 21.144249] 0000000000000001 ffff88001fa82b80 ffff880018897d88 ffffffff8109e0f7 [ 21.144249] Call Trace: [ 21.144249] [<ffffffff8109d94b>] ? trace_hardirqs_on_caller+0x14b/0x1e0 [ 21.144249] [<ffffffff8109e0f7>] ? __lock_acquire+0x447/0x1800 [ 21.144249] [<ffffffff8109fdd7>] lock_acquire+0xc7/0x260 [ 21.144249] [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40 [ 21.144249] [<ffffffff8155c373>] _raw_spin_lock_irqsave+0x43/0x60 [ 21.144249] [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40 [ 21.144249] [<ffffffff81094d2d>] remove_wait_queue+0x1d/0x40 [ 21.144249] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 [ 21.144249] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 [ 21.144249] [<ffffffff811bca02>] ep_remove+0x22/0x110 [ 21.144249] [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70 [ 21.144249] [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14 [ 21.144249] [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 21.144249] Code: 49 81 3b c0 9e c4 81 b8 00 00 00 00 44 0f 44 c0 41 83 fe 01 0f 87 39 fe ff ff 44 89 f0 49 8b 5c c3 08 48 85 db 0f 84 28 fe ff ff <f0> ff 83 98 01 00 00 45 8b b7 a0 06 00 00 41 83 fe 2f 76 10 44 [ 21.144249] RIP [<ffffffff8109def0>] __lock_acquire+0x240/0x1800 [ 21.144249] RSP <ffff880018897c98> [ 21.144249] ---[ end trace 7136cfe3b6480f34 ]--- The slab poisoning hits, as can be seen by the pattern in RBX which the faulting instruction is using as a memory operand. So this is probably a use-after-free bug -- or, more likely, an object that better had been freed via rcu semantics. I looked at the epoll code really hard and concluded I don't understand it at all. I've added a few calls to synchroize_rcu() and changed direct pointer assignments to rcu_assign_pointer() in places I thought would need them. But that only made the race happen less often, not cured it. This is what I did: - add a call to synchroize_rcu() in eventpoll_release_file() after taking the epmutex. It's traversing an rcu list, after all. - make the NULL pointer assignment of whead in ep_poll_callback() an rcu_assign_pointer() assignment and call synchronize_rcu() afterwards. It's also an rcu pointer that should be assigned that way, no? But, apparently, all irrelevant. I had the same luck while staring at the af_unix code. I've added a few unix_state_lock() / -unlock() calls to places I though would need them to be able to reliably test / set socket flags and the peer member of struct unix_sock. But again, that only made the bug happen less often. What I did: - take other's unix_state_lock() in unix_dgram_disconnected() for testing the flags and signaling the error - moving the 'unix_peer(sk) = NULL' assignment to the section that holds the unix_state_lock() in unix_release_sock(). This ensures others will see changes to the peer pointer atomically -- when themselves making use of the lock only, of course. - in unix_dgram_poll() avoid calling sock_poll_wait() if the peer is already SOCK_DEAD or RCV_SHUTDOWN. For testing the flags, other's unix_state_lock() is taken. The last one is, in fact, the old patch, extended by the '(other->sk_shutdown & RCV_SHUTDOWN)' test. But Eric already noted back then, it might be an expensive lock to take here. Anyways, none of the above changes fixed the issue. I suspect it's related to the double usage of the peer_wq waitQ in unix_dgram_sendmsg() (via unix_wait_for_peer()) and unix_dgram_poll() (via sock_poll_wait()). But I might be totally wrong, here. However, it's definitely the second sock_poll_wait() call in unix_dgram_poll() that triggers the issue. Commenting out the call -- thereby obviously breaking its functionality -- gets me rid of the list debug splat and the GFP. But that's not a fix either. So I'm asking for help. Regards, Mathias [-- Attachment #2: epoll_bug.c --] [-- Type: text/x-csrc, Size: 2922 bytes --] /* use-after-free in poll routine of AF_UNIX sockets, triggerable using epoll * * ..intruduced in 3c73419c09 "af_unix: fix 'poll for write'/ connected DGRAM * sockets" (v2.6.26-rc7) * * $ gcc -pthread -o epoll_bug epoll_bug.c * * - minipli */ #include <sys/socket.h> #include <sys/epoll.h> #include <sys/un.h> #include <pthread.h> #include <unistd.h> #include <signal.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <fcntl.h> #include <errno.h> static long fd_max; static int ep = -1; static int get_fd(void) { int fd; for (;;) { fd = rand() % fd_max; if (fd > 2 && fd != ep) break; } return fd; } static void *opener(void *ptr) { sleep(1); for (;;) { if (rand() % 2) { struct sockaddr_un sa = { .sun_family = AF_UNIX, .sun_path = "\0epool_bug-", }; int sock = socket(AF_UNIX, SOCK_DGRAM, 0); int err; /* take a short nap when there are no more fds left so closer() can * catch up */ if (sock < 0) { usleep(1); continue; } /* ensure the write won't block */ fcntl(sock, F_SETFL, fcntl(sock, F_GETFL, 0) | O_NONBLOCK); sa.sun_path[11] = rand() % 26 + 'A'; if (rand() % 2) err = connect(sock, (struct sockaddr *) &sa, sizeof(sa)); else err = bind(sock, (struct sockaddr *) &sa, sizeof(sa)); if (err) close(sock); } else { static const char dot[] = { [0 ... 1023] = '.' }; write(get_fd(), dot, rand() % 2 ? 1 : sizeof(dot)); } } return ptr; } static void *closer(void *ptr) { int miss = 0; sleep(1); for (;;) { errno = 0; close(get_fd()); /* take a short nap when we're hitting invalid fds 5 times in a row so * opener() can catch up */ if (errno == EBADF && ++miss >= 5) { usleep(10); miss = 0; } else if (errno == 0) { miss = 0; } } return ptr; } static void *ep_add(void *ptr) { sleep(1); for (;;) { int fd = get_fd(); struct epoll_event ev = { .events = EPOLLIN | EPOLLOUT, .data.fd = fd, }; if (epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev) < 0 && errno == ENOSPC) usleep(1); } return ptr; } static void *ep_del(void *ptr) { sleep(1); for (;;) epoll_ctl(ep, EPOLL_CTL_DEL, get_fd(), NULL); return ptr; } int main(void) { pthread_t thread[4]; int i; signal(SIGPIPE, SIG_IGN); ep = epoll_create(42); /* use epoll_create() for older kernels */ if (ep < 0) { fprintf(stderr, "err: epoll_create1() failed (%s)\n", strerror(errno)); return 1; } fd_max = sysconf(_SC_OPEN_MAX); if (pthread_create(&thread[0], NULL, opener, NULL) || pthread_create(&thread[1], NULL, closer, NULL) || pthread_create(&thread[2], NULL, ep_add, NULL) || pthread_create(&thread[3], NULL, ep_del, NULL)) { fprintf(stderr, "err: failed to start all threads!\n"); return 1; } /* XXX: pthread_cancel() all threads on termination request */ for (i = 0; i < 4; i++) pthread_join(thread[i], NULL); return 0; } ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-13 19:53 List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket Mathias Krause @ 2015-09-14 2:39 ` Eric Wong 2015-09-29 18:09 ` Mathias Krause 2015-09-15 17:07 ` Rainer Weikusat 1 sibling, 1 reply; 18+ messages in thread From: Eric Wong @ 2015-09-14 2:39 UTC (permalink / raw) To: Mathias Krause Cc: netdev, linux-kernel, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Jason Baron +cc Jason Baron since he might be able to provide more insight into epoll. Mathias Krause <minipli@googlemail.com> wrote: > Hi, > > this is an attempt to resurrect the thread initially started here: > > http://thread.gmane.org/gmane.linux.network/353003 > > As that patch fixed the issue for the mentioned reproducer, it did not > fix the bug for the production code Olivier is using. :( > > Changing the reproducer only slightly allows me to trigger the following > list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even > with the above linked patch applied: > > [ 50.264249] ------------[ cut here ]------------ > [ 50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 __list_del_entry+0xa4/0xd0() > [ 50.264249] list_del corruption. prev->next should be ffff88003c2c1bb8, but was ffff88003f07bbb8 > [ 50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom > [ 50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75 > [ 50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 > [ 50.264249] ffffffff817e902e ffff880000087d08 ffffffff8155593c 0000000000000007 > [ 50.264249] ffff880000087d58 ffff880000087d48 ffffffff8105202a 0000000000000001 > [ 50.264249] ffff88003c2c1bb8 ffff88003f07bb80 0000000000000286 ffff88003f736640 > [ 50.264249] Call Trace: > [ 50.264249] [<ffffffff8155593c>] dump_stack+0x4c/0x65 > [ 50.264249] [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0 > [ 50.264249] [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50 > [ 50.264249] [<ffffffff81322354>] __list_del_entry+0xa4/0xd0 > [ 50.264249] [<ffffffff81322391>] list_del+0x11/0x40 > [ 50.264249] [<ffffffff81094d39>] remove_wait_queue+0x29/0x40 > [ 50.264249] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 > [ 50.264249] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 > [ 50.264249] [<ffffffff811bca02>] ep_remove+0x22/0x110 > [ 50.264249] [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70 > [ 50.264249] [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14 > [ 50.264249] [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f > [ 50.264249] ---[ end trace d9af9b915df9667e ]--- > [ 50.572100] ------------[ cut here ]------------ > [ 50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 __list_del_entry+0xc3/0xd0() > [ 50.584263] list_del corruption. next->prev should be ffff88003f664c90, but was ffff88003f0cb5b8 > [ 50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom > [ 50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G W 4.2.0 #75 > [ 50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 > [ 50.584263] ffffffff817e902e ffff88003d37fce8 ffffffff8155593c 0000000000000007 > [ 50.584263] ffff88003d37fd38 ffff88003d37fd28 ffffffff8105202a 0000000000000001 > [ 50.584263] ffff88003f664c90 ffff88003f0cb580 0000000000000282 ffff88003f731740 > [ 50.584263] Call Trace: > [ 50.584263] [<ffffffff8155593c>] dump_stack+0x4c/0x65 > [ 50.584263] [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0 > [ 50.584263] [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50 > [ 50.584263] [<ffffffff81322373>] __list_del_entry+0xc3/0xd0 > [ 50.584263] [<ffffffff81322391>] list_del+0x11/0x40 > [ 50.584263] [<ffffffff81094d39>] remove_wait_queue+0x29/0x40 > [ 50.584263] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 > [ 50.584263] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 > [ 50.584263] [<ffffffff811bca02>] ep_remove+0x22/0x110 > [ 50.584263] [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0 > [ 50.584263] [<ffffffff8117704f>] __fput+0x1af/0x200 > [ 50.584263] [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f > [ 50.584263] [<ffffffff811770ee>] ____fput+0xe/0x10 > [ 50.584263] [<ffffffff8107271d>] task_work_run+0x8d/0xc0 > [ 50.584263] [<ffffffff8100390f>] do_notify_resume+0x4f/0x60 > [ 50.584263] [<ffffffff8155cf6c>] int_signal+0x12/0x17 > [ 50.584263] ---[ end trace d9af9b915df9667f ]--- > [ 50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212 > [ 50.584263] lock: 0xffff88003f0cb580, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1 > [ 50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G W 4.2.0 #75 > [ 50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 > [ 50.584263] ffff88003f0cb580 ffff88003d37fd38 ffffffff8155593c 0000000000000007 > [ 50.584263] ffffffffffffffff ffff88003d37fd58 ffffffff810a3375 ffff88003f0cb580 > [ 50.584263] ffffffff817b9cc8 ffff88003d37fd78 ffffffff810a33f6 ffff88003f0cb580 > [ 50.584263] Call Trace: > [ 50.584263] [<ffffffff8155593c>] dump_stack+0x4c/0x65 > [ 50.584263] [<ffffffff810a3375>] spin_dump+0x85/0xe0 > [ 50.584263] [<ffffffff810a33f6>] spin_bug+0x26/0x30 > [ 50.584263] [<ffffffff810a3645>] do_raw_spin_unlock+0x75/0xa0 > [ 50.584263] [<ffffffff8155c4ec>] _raw_spin_unlock_irqrestore+0x2c/0x50 > [ 50.584263] [<ffffffff81094d44>] remove_wait_queue+0x34/0x40 > [ 50.584263] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 > [ 50.584263] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 > [ 50.584263] [<ffffffff811bca02>] ep_remove+0x22/0x110 > [ 50.584263] [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0 > [ 50.584263] [<ffffffff8117704f>] __fput+0x1af/0x200 > [ 50.584263] [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f > [ 50.584263] [<ffffffff811770ee>] ____fput+0xe/0x10 > [ 50.584263] [<ffffffff8107271d>] task_work_run+0x8d/0xc0 > [ 50.584263] [<ffffffff8100390f>] do_notify_resume+0x4f/0x60 > [ 50.584263] [<ffffffff8155cf6c>] int_signal+0x12/0x17 > [...] > > That 'spinlock already unlocked' message is also interesting. But even > better, enabling slab debugging (CONFIG_SLUB_DEBUG_ON=y) makes that > list_del corruption warning a GPF: > > [ 21.124241] general protection fault: 0000 [#1] SMP > [ 21.128193] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom > [ 21.144249] CPU: 1 PID: 221 Comm: epoll_bug Not tainted 4.2.0 #75 > [ 21.144249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 > [ 21.144249] task: ffff88001fa82b80 ti: ffff880018894000 task.ti: ffff880018894000 > [ 21.144249] RIP: 0010:[<ffffffff8109def0>] [<ffffffff8109def0>] __lock_acquire+0x240/0x1800 > [ 21.144249] RSP: 0018:ffff880018897c98 EFLAGS: 00010002 > [ 21.144249] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 > [ 21.144249] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88001f8b1c18 > [ 21.144249] RBP: ffff880018897d28 R08: 0000000000000001 R09: 0000000000000001 > [ 21.144249] R10: 0000000000000000 R11: ffff88001f8b1c18 R12: 0000000000000000 > [ 21.144249] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88001fa82b80 > [ 21.144249] FS: 00007f0c87e5d700(0000) GS:ffff88001eb00000(0000) knlGS:0000000000000000 > [ 21.144249] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 21.144249] CR2: 00007f0c87e5cff8 CR3: 000000001899c000 CR4: 00000000001406e0 > [ 21.144249] Stack: > [ 21.144249] ffff880018897cb8 ffffffff8109d94b ffff880018b6fdb0 ffff88001e3d3b80 > [ 21.144249] ffff880018897cc8 ffff88001fa83250 0000000000000002 0000000000000000 > [ 21.144249] 0000000000000001 ffff88001fa82b80 ffff880018897d88 ffffffff8109e0f7 > [ 21.144249] Call Trace: > [ 21.144249] [<ffffffff8109d94b>] ? trace_hardirqs_on_caller+0x14b/0x1e0 > [ 21.144249] [<ffffffff8109e0f7>] ? __lock_acquire+0x447/0x1800 > [ 21.144249] [<ffffffff8109fdd7>] lock_acquire+0xc7/0x260 > [ 21.144249] [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40 > [ 21.144249] [<ffffffff8155c373>] _raw_spin_lock_irqsave+0x43/0x60 > [ 21.144249] [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40 > [ 21.144249] [<ffffffff81094d2d>] remove_wait_queue+0x1d/0x40 > [ 21.144249] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 > [ 21.144249] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 > [ 21.144249] [<ffffffff811bca02>] ep_remove+0x22/0x110 > [ 21.144249] [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70 > [ 21.144249] [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14 > [ 21.144249] [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f > [ 21.144249] Code: 49 81 3b c0 9e c4 81 b8 00 00 00 00 44 0f 44 c0 41 83 fe 01 0f 87 39 fe ff ff 44 89 f0 49 8b 5c c3 08 48 85 db 0f 84 28 fe ff ff <f0> ff 83 98 01 00 00 45 8b b7 a0 06 00 00 41 83 fe 2f 76 10 44 > [ 21.144249] RIP [<ffffffff8109def0>] __lock_acquire+0x240/0x1800 > [ 21.144249] RSP <ffff880018897c98> > [ 21.144249] ---[ end trace 7136cfe3b6480f34 ]--- > > The slab poisoning hits, as can be seen by the pattern in RBX which the > faulting instruction is using as a memory operand. So this is probably a > use-after-free bug -- or, more likely, an object that better had been > freed via rcu semantics. > > I looked at the epoll code really hard and concluded I don't understand > it at all. I've added a few calls to synchroize_rcu() and changed direct > pointer assignments to rcu_assign_pointer() in places I thought would > need them. But that only made the race happen less often, not cured it. > > This is what I did: > - add a call to synchroize_rcu() in eventpoll_release_file() after > taking the epmutex. It's traversing an rcu list, after all. > - make the NULL pointer assignment of whead in ep_poll_callback() an > rcu_assign_pointer() assignment and call synchronize_rcu() afterwards. > It's also an rcu pointer that should be assigned that way, no? > > But, apparently, all irrelevant. > > I had the same luck while staring at the af_unix code. I've added a few > unix_state_lock() / -unlock() calls to places I though would need them > to be able to reliably test / set socket flags and the peer member of > struct unix_sock. But again, that only made the bug happen less often. > > What I did: > - take other's unix_state_lock() in unix_dgram_disconnected() for > testing the flags and signaling the error > - moving the 'unix_peer(sk) = NULL' assignment to the section that holds > the unix_state_lock() in unix_release_sock(). This ensures others will > see changes to the peer pointer atomically -- when themselves making > use of the lock only, of course. > - in unix_dgram_poll() avoid calling sock_poll_wait() if the peer is > already SOCK_DEAD or RCV_SHUTDOWN. For testing the flags, other's > unix_state_lock() is taken. > > The last one is, in fact, the old patch, extended by the > '(other->sk_shutdown & RCV_SHUTDOWN)' test. But Eric already noted back > then, it might be an expensive lock to take here. > > Anyways, none of the above changes fixed the issue. I suspect it's > related to the double usage of the peer_wq waitQ in unix_dgram_sendmsg() > (via unix_wait_for_peer()) and unix_dgram_poll() (via sock_poll_wait()). > But I might be totally wrong, here. > > However, it's definitely the second sock_poll_wait() call in > unix_dgram_poll() that triggers the issue. Commenting out the call -- > thereby obviously breaking its functionality -- gets me rid of the list > debug splat and the GFP. But that's not a fix either. So I'm asking for > help. > > > Regards, > Mathias > /* use-after-free in poll routine of AF_UNIX sockets, triggerable using epoll > * > * ..intruduced in 3c73419c09 "af_unix: fix 'poll for write'/ connected DGRAM > * sockets" (v2.6.26-rc7) > * > * $ gcc -pthread -o epoll_bug epoll_bug.c > * > * - minipli > */ > #include <sys/socket.h> > #include <sys/epoll.h> > #include <sys/un.h> > #include <pthread.h> > #include <unistd.h> > #include <signal.h> > #include <stdlib.h> > #include <string.h> > #include <stdio.h> > #include <fcntl.h> > #include <errno.h> > > > static long fd_max; > static int ep = -1; > > > static int get_fd(void) { > int fd; > > for (;;) { > fd = rand() % fd_max; > > if (fd > 2 && fd != ep) > break; > } > > return fd; > } > > > static void *opener(void *ptr) { > sleep(1); > > for (;;) { > if (rand() % 2) { > struct sockaddr_un sa = { > .sun_family = AF_UNIX, > .sun_path = "\0epool_bug-", > }; > int sock = socket(AF_UNIX, SOCK_DGRAM, 0); > int err; > > /* take a short nap when there are no more fds left so closer() can > * catch up */ > if (sock < 0) { > usleep(1); > > continue; > } > > /* ensure the write won't block */ > fcntl(sock, F_SETFL, fcntl(sock, F_GETFL, 0) | O_NONBLOCK); > > sa.sun_path[11] = rand() % 26 + 'A'; > if (rand() % 2) > err = connect(sock, (struct sockaddr *) &sa, sizeof(sa)); > else > err = bind(sock, (struct sockaddr *) &sa, sizeof(sa)); > > if (err) > close(sock); > } else { > static const char dot[] = { [0 ... 1023] = '.' }; > > write(get_fd(), dot, rand() % 2 ? 1 : sizeof(dot)); > } > } > > return ptr; > } > > > static void *closer(void *ptr) { > int miss = 0; > > sleep(1); > > for (;;) { > errno = 0; > close(get_fd()); > > /* take a short nap when we're hitting invalid fds 5 times in a row so > * opener() can catch up */ > if (errno == EBADF && ++miss >= 5) { > usleep(10); > miss = 0; > } else if (errno == 0) { > miss = 0; > } > } > > return ptr; > } > > > static void *ep_add(void *ptr) { > sleep(1); > > for (;;) { > int fd = get_fd(); > struct epoll_event ev = { > .events = EPOLLIN | EPOLLOUT, > .data.fd = fd, > }; > > if (epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev) < 0 && errno == ENOSPC) > usleep(1); > } > > return ptr; > } > > > static void *ep_del(void *ptr) { > sleep(1); > > for (;;) > epoll_ctl(ep, EPOLL_CTL_DEL, get_fd(), NULL); > > return ptr; > } > > > int main(void) { > pthread_t thread[4]; > int i; > > signal(SIGPIPE, SIG_IGN); > > ep = epoll_create(42); /* use epoll_create() for older kernels */ > if (ep < 0) { > fprintf(stderr, "err: epoll_create1() failed (%s)\n", strerror(errno)); > > return 1; > } > > fd_max = sysconf(_SC_OPEN_MAX); > if (pthread_create(&thread[0], NULL, opener, NULL) || > pthread_create(&thread[1], NULL, closer, NULL) || > pthread_create(&thread[2], NULL, ep_add, NULL) || > pthread_create(&thread[3], NULL, ep_del, NULL)) > { > fprintf(stderr, "err: failed to start all threads!\n"); > > return 1; > } > > /* XXX: pthread_cancel() all threads on termination request */ > > for (i = 0; i < 4; i++) > pthread_join(thread[i], NULL); > > return 0; > } ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-14 2:39 ` Eric Wong @ 2015-09-29 18:09 ` Mathias Krause 2015-09-29 19:09 ` Jason Baron 0 siblings, 1 reply; 18+ messages in thread From: Mathias Krause @ 2015-09-29 18:09 UTC (permalink / raw) To: netdev, linux-kernel@vger.kernel.org Cc: Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Jason Baron On 14 September 2015 at 04:39, Eric Wong <normalperson@yhbt.net> wrote: > +cc Jason Baron since he might be able to provide more insight into > epoll. > > Mathias Krause <minipli@googlemail.com> wrote: >> Hi, >> >> this is an attempt to resurrect the thread initially started here: >> >> http://thread.gmane.org/gmane.linux.network/353003 >> >> As that patch fixed the issue for the mentioned reproducer, it did not >> fix the bug for the production code Olivier is using. :( >> >> Changing the reproducer only slightly allows me to trigger the following >> list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even >> with the above linked patch applied: >> >> [ 50.264249] ------------[ cut here ]------------ >> [ 50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 __list_del_entry+0xa4/0xd0() >> [ 50.264249] list_del corruption. prev->next should be ffff88003c2c1bb8, but was ffff88003f07bbb8 >> [ 50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom >> [ 50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75 >> [ 50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 >> [ 50.264249] ffffffff817e902e ffff880000087d08 ffffffff8155593c 0000000000000007 >> [ 50.264249] ffff880000087d58 ffff880000087d48 ffffffff8105202a 0000000000000001 >> [ 50.264249] ffff88003c2c1bb8 ffff88003f07bb80 0000000000000286 ffff88003f736640 >> [ 50.264249] Call Trace: >> [ 50.264249] [<ffffffff8155593c>] dump_stack+0x4c/0x65 >> [ 50.264249] [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0 >> [ 50.264249] [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50 >> [ 50.264249] [<ffffffff81322354>] __list_del_entry+0xa4/0xd0 >> [ 50.264249] [<ffffffff81322391>] list_del+0x11/0x40 >> [ 50.264249] [<ffffffff81094d39>] remove_wait_queue+0x29/0x40 >> [ 50.264249] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 >> [ 50.264249] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 >> [ 50.264249] [<ffffffff811bca02>] ep_remove+0x22/0x110 >> [ 50.264249] [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70 >> [ 50.264249] [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14 >> [ 50.264249] [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f >> [ 50.264249] ---[ end trace d9af9b915df9667e ]--- >> [ 50.572100] ------------[ cut here ]------------ >> [ 50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 __list_del_entry+0xc3/0xd0() >> [ 50.584263] list_del corruption. next->prev should be ffff88003f664c90, but was ffff88003f0cb5b8 >> [ 50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom >> [ 50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G W 4.2.0 #75 >> [ 50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 >> [ 50.584263] ffffffff817e902e ffff88003d37fce8 ffffffff8155593c 0000000000000007 >> [ 50.584263] ffff88003d37fd38 ffff88003d37fd28 ffffffff8105202a 0000000000000001 >> [ 50.584263] ffff88003f664c90 ffff88003f0cb580 0000000000000282 ffff88003f731740 >> [ 50.584263] Call Trace: >> [ 50.584263] [<ffffffff8155593c>] dump_stack+0x4c/0x65 >> [ 50.584263] [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0 >> [ 50.584263] [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50 >> [ 50.584263] [<ffffffff81322373>] __list_del_entry+0xc3/0xd0 >> [ 50.584263] [<ffffffff81322391>] list_del+0x11/0x40 >> [ 50.584263] [<ffffffff81094d39>] remove_wait_queue+0x29/0x40 >> [ 50.584263] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 >> [ 50.584263] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 >> [ 50.584263] [<ffffffff811bca02>] ep_remove+0x22/0x110 >> [ 50.584263] [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0 >> [ 50.584263] [<ffffffff8117704f>] __fput+0x1af/0x200 >> [ 50.584263] [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f >> [ 50.584263] [<ffffffff811770ee>] ____fput+0xe/0x10 >> [ 50.584263] [<ffffffff8107271d>] task_work_run+0x8d/0xc0 >> [ 50.584263] [<ffffffff8100390f>] do_notify_resume+0x4f/0x60 >> [ 50.584263] [<ffffffff8155cf6c>] int_signal+0x12/0x17 >> [ 50.584263] ---[ end trace d9af9b915df9667f ]--- >> [ 50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212 >> [ 50.584263] lock: 0xffff88003f0cb580, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1 >> [ 50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G W 4.2.0 #75 >> [ 50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 >> [ 50.584263] ffff88003f0cb580 ffff88003d37fd38 ffffffff8155593c 0000000000000007 >> [ 50.584263] ffffffffffffffff ffff88003d37fd58 ffffffff810a3375 ffff88003f0cb580 >> [ 50.584263] ffffffff817b9cc8 ffff88003d37fd78 ffffffff810a33f6 ffff88003f0cb580 >> [ 50.584263] Call Trace: >> [ 50.584263] [<ffffffff8155593c>] dump_stack+0x4c/0x65 >> [ 50.584263] [<ffffffff810a3375>] spin_dump+0x85/0xe0 >> [ 50.584263] [<ffffffff810a33f6>] spin_bug+0x26/0x30 >> [ 50.584263] [<ffffffff810a3645>] do_raw_spin_unlock+0x75/0xa0 >> [ 50.584263] [<ffffffff8155c4ec>] _raw_spin_unlock_irqrestore+0x2c/0x50 >> [ 50.584263] [<ffffffff81094d44>] remove_wait_queue+0x34/0x40 >> [ 50.584263] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 >> [ 50.584263] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 >> [ 50.584263] [<ffffffff811bca02>] ep_remove+0x22/0x110 >> [ 50.584263] [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0 >> [ 50.584263] [<ffffffff8117704f>] __fput+0x1af/0x200 >> [ 50.584263] [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f >> [ 50.584263] [<ffffffff811770ee>] ____fput+0xe/0x10 >> [ 50.584263] [<ffffffff8107271d>] task_work_run+0x8d/0xc0 >> [ 50.584263] [<ffffffff8100390f>] do_notify_resume+0x4f/0x60 >> [ 50.584263] [<ffffffff8155cf6c>] int_signal+0x12/0x17 >> [...] >> >> That 'spinlock already unlocked' message is also interesting. But even >> better, enabling slab debugging (CONFIG_SLUB_DEBUG_ON=y) makes that >> list_del corruption warning a GPF: >> >> [ 21.124241] general protection fault: 0000 [#1] SMP >> [ 21.128193] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom >> [ 21.144249] CPU: 1 PID: 221 Comm: epoll_bug Not tainted 4.2.0 #75 >> [ 21.144249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 >> [ 21.144249] task: ffff88001fa82b80 ti: ffff880018894000 task.ti: ffff880018894000 >> [ 21.144249] RIP: 0010:[<ffffffff8109def0>] [<ffffffff8109def0>] __lock_acquire+0x240/0x1800 >> [ 21.144249] RSP: 0018:ffff880018897c98 EFLAGS: 00010002 >> [ 21.144249] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 >> [ 21.144249] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88001f8b1c18 >> [ 21.144249] RBP: ffff880018897d28 R08: 0000000000000001 R09: 0000000000000001 >> [ 21.144249] R10: 0000000000000000 R11: ffff88001f8b1c18 R12: 0000000000000000 >> [ 21.144249] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88001fa82b80 >> [ 21.144249] FS: 00007f0c87e5d700(0000) GS:ffff88001eb00000(0000) knlGS:0000000000000000 >> [ 21.144249] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 21.144249] CR2: 00007f0c87e5cff8 CR3: 000000001899c000 CR4: 00000000001406e0 >> [ 21.144249] Stack: >> [ 21.144249] ffff880018897cb8 ffffffff8109d94b ffff880018b6fdb0 ffff88001e3d3b80 >> [ 21.144249] ffff880018897cc8 ffff88001fa83250 0000000000000002 0000000000000000 >> [ 21.144249] 0000000000000001 ffff88001fa82b80 ffff880018897d88 ffffffff8109e0f7 >> [ 21.144249] Call Trace: >> [ 21.144249] [<ffffffff8109d94b>] ? trace_hardirqs_on_caller+0x14b/0x1e0 >> [ 21.144249] [<ffffffff8109e0f7>] ? __lock_acquire+0x447/0x1800 >> [ 21.144249] [<ffffffff8109fdd7>] lock_acquire+0xc7/0x260 >> [ 21.144249] [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40 >> [ 21.144249] [<ffffffff8155c373>] _raw_spin_lock_irqsave+0x43/0x60 >> [ 21.144249] [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40 >> [ 21.144249] [<ffffffff81094d2d>] remove_wait_queue+0x1d/0x40 >> [ 21.144249] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 >> [ 21.144249] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 >> [ 21.144249] [<ffffffff811bca02>] ep_remove+0x22/0x110 >> [ 21.144249] [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70 >> [ 21.144249] [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14 >> [ 21.144249] [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f >> [ 21.144249] Code: 49 81 3b c0 9e c4 81 b8 00 00 00 00 44 0f 44 c0 41 83 fe 01 0f 87 39 fe ff ff 44 89 f0 49 8b 5c c3 08 48 85 db 0f 84 28 fe ff ff <f0> ff 83 98 01 00 00 45 8b b7 a0 06 00 00 41 83 fe 2f 76 10 44 >> [ 21.144249] RIP [<ffffffff8109def0>] __lock_acquire+0x240/0x1800 >> [ 21.144249] RSP <ffff880018897c98> >> [ 21.144249] ---[ end trace 7136cfe3b6480f34 ]--- >> >> The slab poisoning hits, as can be seen by the pattern in RBX which the >> faulting instruction is using as a memory operand. So this is probably a >> use-after-free bug -- or, more likely, an object that better had been >> freed via rcu semantics. >> >> I looked at the epoll code really hard and concluded I don't understand >> it at all. I've added a few calls to synchroize_rcu() and changed direct >> pointer assignments to rcu_assign_pointer() in places I thought would >> need them. But that only made the race happen less often, not cured it. >> >> This is what I did: >> - add a call to synchroize_rcu() in eventpoll_release_file() after >> taking the epmutex. It's traversing an rcu list, after all. >> - make the NULL pointer assignment of whead in ep_poll_callback() an >> rcu_assign_pointer() assignment and call synchronize_rcu() afterwards. >> It's also an rcu pointer that should be assigned that way, no? >> >> But, apparently, all irrelevant. >> >> I had the same luck while staring at the af_unix code. I've added a few >> unix_state_lock() / -unlock() calls to places I though would need them >> to be able to reliably test / set socket flags and the peer member of >> struct unix_sock. But again, that only made the bug happen less often. >> >> What I did: >> - take other's unix_state_lock() in unix_dgram_disconnected() for >> testing the flags and signaling the error >> - moving the 'unix_peer(sk) = NULL' assignment to the section that holds >> the unix_state_lock() in unix_release_sock(). This ensures others will >> see changes to the peer pointer atomically -- when themselves making >> use of the lock only, of course. >> - in unix_dgram_poll() avoid calling sock_poll_wait() if the peer is >> already SOCK_DEAD or RCV_SHUTDOWN. For testing the flags, other's >> unix_state_lock() is taken. >> >> The last one is, in fact, the old patch, extended by the >> '(other->sk_shutdown & RCV_SHUTDOWN)' test. But Eric already noted back >> then, it might be an expensive lock to take here. >> >> Anyways, none of the above changes fixed the issue. I suspect it's >> related to the double usage of the peer_wq waitQ in unix_dgram_sendmsg() >> (via unix_wait_for_peer()) and unix_dgram_poll() (via sock_poll_wait()). >> But I might be totally wrong, here. >> >> However, it's definitely the second sock_poll_wait() call in >> unix_dgram_poll() that triggers the issue. Commenting out the call -- >> thereby obviously breaking its functionality -- gets me rid of the list >> debug splat and the GFP. But that's not a fix either. So I'm asking for >> help. >> >> >> Regards, >> Mathias > >> /* use-after-free in poll routine of AF_UNIX sockets, triggerable using epoll >> * >> * ..intruduced in 3c73419c09 "af_unix: fix 'poll for write'/ connected DGRAM >> * sockets" (v2.6.26-rc7) >> * >> * $ gcc -pthread -o epoll_bug epoll_bug.c >> * >> * - minipli >> */ >> #include <sys/socket.h> >> #include <sys/epoll.h> >> #include <sys/un.h> >> #include <pthread.h> >> #include <unistd.h> >> #include <signal.h> >> #include <stdlib.h> >> #include <string.h> >> #include <stdio.h> >> #include <fcntl.h> >> #include <errno.h> >> >> >> static long fd_max; >> static int ep = -1; >> >> >> static int get_fd(void) { >> int fd; >> >> for (;;) { >> fd = rand() % fd_max; >> >> if (fd > 2 && fd != ep) >> break; >> } >> >> return fd; >> } >> >> >> static void *opener(void *ptr) { >> sleep(1); >> >> for (;;) { >> if (rand() % 2) { >> struct sockaddr_un sa = { >> .sun_family = AF_UNIX, >> .sun_path = "\0epool_bug-", >> }; >> int sock = socket(AF_UNIX, SOCK_DGRAM, 0); >> int err; >> >> /* take a short nap when there are no more fds left so closer() can >> * catch up */ >> if (sock < 0) { >> usleep(1); >> >> continue; >> } >> >> /* ensure the write won't block */ >> fcntl(sock, F_SETFL, fcntl(sock, F_GETFL, 0) | O_NONBLOCK); >> >> sa.sun_path[11] = rand() % 26 + 'A'; >> if (rand() % 2) >> err = connect(sock, (struct sockaddr *) &sa, sizeof(sa)); >> else >> err = bind(sock, (struct sockaddr *) &sa, sizeof(sa)); >> >> if (err) >> close(sock); >> } else { >> static const char dot[] = { [0 ... 1023] = '.' }; >> >> write(get_fd(), dot, rand() % 2 ? 1 : sizeof(dot)); >> } >> } >> >> return ptr; >> } >> >> >> static void *closer(void *ptr) { >> int miss = 0; >> >> sleep(1); >> >> for (;;) { >> errno = 0; >> close(get_fd()); >> >> /* take a short nap when we're hitting invalid fds 5 times in a row so >> * opener() can catch up */ >> if (errno == EBADF && ++miss >= 5) { >> usleep(10); >> miss = 0; >> } else if (errno == 0) { >> miss = 0; >> } >> } >> >> return ptr; >> } >> >> >> static void *ep_add(void *ptr) { >> sleep(1); >> >> for (;;) { >> int fd = get_fd(); >> struct epoll_event ev = { >> .events = EPOLLIN | EPOLLOUT, >> .data.fd = fd, >> }; >> >> if (epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev) < 0 && errno == ENOSPC) >> usleep(1); >> } >> >> return ptr; >> } >> >> >> static void *ep_del(void *ptr) { >> sleep(1); >> >> for (;;) >> epoll_ctl(ep, EPOLL_CTL_DEL, get_fd(), NULL); >> >> return ptr; >> } >> >> >> int main(void) { >> pthread_t thread[4]; >> int i; >> >> signal(SIGPIPE, SIG_IGN); >> >> ep = epoll_create(42); /* use epoll_create() for older kernels */ >> if (ep < 0) { >> fprintf(stderr, "err: epoll_create1() failed (%s)\n", strerror(errno)); >> >> return 1; >> } >> >> fd_max = sysconf(_SC_OPEN_MAX); >> if (pthread_create(&thread[0], NULL, opener, NULL) || >> pthread_create(&thread[1], NULL, closer, NULL) || >> pthread_create(&thread[2], NULL, ep_add, NULL) || >> pthread_create(&thread[3], NULL, ep_del, NULL)) >> { >> fprintf(stderr, "err: failed to start all threads!\n"); >> >> return 1; >> } >> >> /* XXX: pthread_cancel() all threads on termination request */ >> >> for (i = 0; i < 4; i++) >> pthread_join(thread[i], NULL); >> >> return 0; >> } Ping. This still triggers on v4.3-rc3. Any epoll expert around to have a look at this? Regards, Mathias ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-29 18:09 ` Mathias Krause @ 2015-09-29 19:09 ` Jason Baron 2015-09-30 5:54 ` Mathias Krause 0 siblings, 1 reply; 18+ messages in thread From: Jason Baron @ 2015-09-29 19:09 UTC (permalink / raw) To: Mathias Krause, netdev, linux-kernel@vger.kernel.org Cc: Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, eric.dumazet, peterz@infradead.org, davem@davemloft.net On 09/29/2015 02:09 PM, Mathias Krause wrote: > On 14 September 2015 at 04:39, Eric Wong <normalperson@yhbt.net> wrote: >> +cc Jason Baron since he might be able to provide more insight into >> epoll. >> >> Mathias Krause <minipli@googlemail.com> wrote: >>> Hi, >>> >>> this is an attempt to resurrect the thread initially started here: >>> >>> http://thread.gmane.org/gmane.linux.network/353003 >>> >>> As that patch fixed the issue for the mentioned reproducer, it did not >>> fix the bug for the production code Olivier is using. :( >>> >>> Changing the reproducer only slightly allows me to trigger the following >>> list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even >>> with the above linked patch applied: >>> >>> [ 50.264249] ------------[ cut here ]------------ >>> [ 50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 __list_del_entry+0xa4/0xd0() >>> [ 50.264249] list_del corruption. prev->next should be ffff88003c2c1bb8, but was ffff88003f07bbb8 >>> [ 50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom >>> [ 50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75 >>> [ 50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 >>> [ 50.264249] ffffffff817e902e ffff880000087d08 ffffffff8155593c 0000000000000007 >>> [ 50.264249] ffff880000087d58 ffff880000087d48 ffffffff8105202a 0000000000000001 >>> [ 50.264249] ffff88003c2c1bb8 ffff88003f07bb80 0000000000000286 ffff88003f736640 >>> [ 50.264249] Call Trace: >>> [ 50.264249] [<ffffffff8155593c>] dump_stack+0x4c/0x65 >>> [ 50.264249] [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0 >>> [ 50.264249] [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50 >>> [ 50.264249] [<ffffffff81322354>] __list_del_entry+0xa4/0xd0 >>> [ 50.264249] [<ffffffff81322391>] list_del+0x11/0x40 >>> [ 50.264249] [<ffffffff81094d39>] remove_wait_queue+0x29/0x40 >>> [ 50.264249] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 >>> [ 50.264249] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 >>> [ 50.264249] [<ffffffff811bca02>] ep_remove+0x22/0x110 >>> [ 50.264249] [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70 >>> [ 50.264249] [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14 >>> [ 50.264249] [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f >>> [ 50.264249] ---[ end trace d9af9b915df9667e ]--- >>> [ 50.572100] ------------[ cut here ]------------ >>> [ 50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 __list_del_entry+0xc3/0xd0() >>> [ 50.584263] list_del corruption. next->prev should be ffff88003f664c90, but was ffff88003f0cb5b8 >>> [ 50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom >>> [ 50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G W 4.2.0 #75 >>> [ 50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 >>> [ 50.584263] ffffffff817e902e ffff88003d37fce8 ffffffff8155593c 0000000000000007 >>> [ 50.584263] ffff88003d37fd38 ffff88003d37fd28 ffffffff8105202a 0000000000000001 >>> [ 50.584263] ffff88003f664c90 ffff88003f0cb580 0000000000000282 ffff88003f731740 >>> [ 50.584263] Call Trace: >>> [ 50.584263] [<ffffffff8155593c>] dump_stack+0x4c/0x65 >>> [ 50.584263] [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0 >>> [ 50.584263] [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50 >>> [ 50.584263] [<ffffffff81322373>] __list_del_entry+0xc3/0xd0 >>> [ 50.584263] [<ffffffff81322391>] list_del+0x11/0x40 >>> [ 50.584263] [<ffffffff81094d39>] remove_wait_queue+0x29/0x40 >>> [ 50.584263] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 >>> [ 50.584263] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 >>> [ 50.584263] [<ffffffff811bca02>] ep_remove+0x22/0x110 >>> [ 50.584263] [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0 >>> [ 50.584263] [<ffffffff8117704f>] __fput+0x1af/0x200 >>> [ 50.584263] [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f >>> [ 50.584263] [<ffffffff811770ee>] ____fput+0xe/0x10 >>> [ 50.584263] [<ffffffff8107271d>] task_work_run+0x8d/0xc0 >>> [ 50.584263] [<ffffffff8100390f>] do_notify_resume+0x4f/0x60 >>> [ 50.584263] [<ffffffff8155cf6c>] int_signal+0x12/0x17 >>> [ 50.584263] ---[ end trace d9af9b915df9667f ]--- >>> [ 50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212 >>> [ 50.584263] lock: 0xffff88003f0cb580, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1 >>> [ 50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G W 4.2.0 #75 >>> [ 50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 >>> [ 50.584263] ffff88003f0cb580 ffff88003d37fd38 ffffffff8155593c 0000000000000007 >>> [ 50.584263] ffffffffffffffff ffff88003d37fd58 ffffffff810a3375 ffff88003f0cb580 >>> [ 50.584263] ffffffff817b9cc8 ffff88003d37fd78 ffffffff810a33f6 ffff88003f0cb580 >>> [ 50.584263] Call Trace: >>> [ 50.584263] [<ffffffff8155593c>] dump_stack+0x4c/0x65 >>> [ 50.584263] [<ffffffff810a3375>] spin_dump+0x85/0xe0 >>> [ 50.584263] [<ffffffff810a33f6>] spin_bug+0x26/0x30 >>> [ 50.584263] [<ffffffff810a3645>] do_raw_spin_unlock+0x75/0xa0 >>> [ 50.584263] [<ffffffff8155c4ec>] _raw_spin_unlock_irqrestore+0x2c/0x50 >>> [ 50.584263] [<ffffffff81094d44>] remove_wait_queue+0x34/0x40 >>> [ 50.584263] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 >>> [ 50.584263] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 >>> [ 50.584263] [<ffffffff811bca02>] ep_remove+0x22/0x110 >>> [ 50.584263] [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0 >>> [ 50.584263] [<ffffffff8117704f>] __fput+0x1af/0x200 >>> [ 50.584263] [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f >>> [ 50.584263] [<ffffffff811770ee>] ____fput+0xe/0x10 >>> [ 50.584263] [<ffffffff8107271d>] task_work_run+0x8d/0xc0 >>> [ 50.584263] [<ffffffff8100390f>] do_notify_resume+0x4f/0x60 >>> [ 50.584263] [<ffffffff8155cf6c>] int_signal+0x12/0x17 >>> [...] >>> >>> That 'spinlock already unlocked' message is also interesting. But even >>> better, enabling slab debugging (CONFIG_SLUB_DEBUG_ON=y) makes that >>> list_del corruption warning a GPF: >>> >>> [ 21.124241] general protection fault: 0000 [#1] SMP >>> [ 21.128193] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom >>> [ 21.144249] CPU: 1 PID: 221 Comm: epoll_bug Not tainted 4.2.0 #75 >>> [ 21.144249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 >>> [ 21.144249] task: ffff88001fa82b80 ti: ffff880018894000 task.ti: ffff880018894000 >>> [ 21.144249] RIP: 0010:[<ffffffff8109def0>] [<ffffffff8109def0>] __lock_acquire+0x240/0x1800 >>> [ 21.144249] RSP: 0018:ffff880018897c98 EFLAGS: 00010002 >>> [ 21.144249] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 >>> [ 21.144249] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88001f8b1c18 >>> [ 21.144249] RBP: ffff880018897d28 R08: 0000000000000001 R09: 0000000000000001 >>> [ 21.144249] R10: 0000000000000000 R11: ffff88001f8b1c18 R12: 0000000000000000 >>> [ 21.144249] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88001fa82b80 >>> [ 21.144249] FS: 00007f0c87e5d700(0000) GS:ffff88001eb00000(0000) knlGS:0000000000000000 >>> [ 21.144249] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 21.144249] CR2: 00007f0c87e5cff8 CR3: 000000001899c000 CR4: 00000000001406e0 >>> [ 21.144249] Stack: >>> [ 21.144249] ffff880018897cb8 ffffffff8109d94b ffff880018b6fdb0 ffff88001e3d3b80 >>> [ 21.144249] ffff880018897cc8 ffff88001fa83250 0000000000000002 0000000000000000 >>> [ 21.144249] 0000000000000001 ffff88001fa82b80 ffff880018897d88 ffffffff8109e0f7 >>> [ 21.144249] Call Trace: >>> [ 21.144249] [<ffffffff8109d94b>] ? trace_hardirqs_on_caller+0x14b/0x1e0 >>> [ 21.144249] [<ffffffff8109e0f7>] ? __lock_acquire+0x447/0x1800 >>> [ 21.144249] [<ffffffff8109fdd7>] lock_acquire+0xc7/0x260 >>> [ 21.144249] [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40 >>> [ 21.144249] [<ffffffff8155c373>] _raw_spin_lock_irqsave+0x43/0x60 >>> [ 21.144249] [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40 >>> [ 21.144249] [<ffffffff81094d2d>] remove_wait_queue+0x1d/0x40 >>> [ 21.144249] [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0 >>> [ 21.144249] [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0 >>> [ 21.144249] [<ffffffff811bca02>] ep_remove+0x22/0x110 >>> [ 21.144249] [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70 >>> [ 21.144249] [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14 >>> [ 21.144249] [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f >>> [ 21.144249] Code: 49 81 3b c0 9e c4 81 b8 00 00 00 00 44 0f 44 c0 41 83 fe 01 0f 87 39 fe ff ff 44 89 f0 49 8b 5c c3 08 48 85 db 0f 84 28 fe ff ff <f0> ff 83 98 01 00 00 45 8b b7 a0 06 00 00 41 83 fe 2f 76 10 44 >>> [ 21.144249] RIP [<ffffffff8109def0>] __lock_acquire+0x240/0x1800 >>> [ 21.144249] RSP <ffff880018897c98> >>> [ 21.144249] ---[ end trace 7136cfe3b6480f34 ]--- >>> >>> The slab poisoning hits, as can be seen by the pattern in RBX which the >>> faulting instruction is using as a memory operand. So this is probably a >>> use-after-free bug -- or, more likely, an object that better had been >>> freed via rcu semantics. >>> >>> I looked at the epoll code really hard and concluded I don't understand >>> it at all. I've added a few calls to synchroize_rcu() and changed direct >>> pointer assignments to rcu_assign_pointer() in places I thought would >>> need them. But that only made the race happen less often, not cured it. >>> >>> This is what I did: >>> - add a call to synchroize_rcu() in eventpoll_release_file() after >>> taking the epmutex. It's traversing an rcu list, after all. >>> - make the NULL pointer assignment of whead in ep_poll_callback() an >>> rcu_assign_pointer() assignment and call synchronize_rcu() afterwards. >>> It's also an rcu pointer that should be assigned that way, no? >>> >>> But, apparently, all irrelevant. >>> >>> I had the same luck while staring at the af_unix code. I've added a few >>> unix_state_lock() / -unlock() calls to places I though would need them >>> to be able to reliably test / set socket flags and the peer member of >>> struct unix_sock. But again, that only made the bug happen less often. >>> >>> What I did: >>> - take other's unix_state_lock() in unix_dgram_disconnected() for >>> testing the flags and signaling the error >>> - moving the 'unix_peer(sk) = NULL' assignment to the section that holds >>> the unix_state_lock() in unix_release_sock(). This ensures others will >>> see changes to the peer pointer atomically -- when themselves making >>> use of the lock only, of course. >>> - in unix_dgram_poll() avoid calling sock_poll_wait() if the peer is >>> already SOCK_DEAD or RCV_SHUTDOWN. For testing the flags, other's >>> unix_state_lock() is taken. >>> >>> The last one is, in fact, the old patch, extended by the >>> '(other->sk_shutdown & RCV_SHUTDOWN)' test. But Eric already noted back >>> then, it might be an expensive lock to take here. >>> >>> Anyways, none of the above changes fixed the issue. I suspect it's >>> related to the double usage of the peer_wq waitQ in unix_dgram_sendmsg() >>> (via unix_wait_for_peer()) and unix_dgram_poll() (via sock_poll_wait()). >>> But I might be totally wrong, here. >>> >>> However, it's definitely the second sock_poll_wait() call in >>> unix_dgram_poll() that triggers the issue. Commenting out the call -- >>> thereby obviously breaking its functionality -- gets me rid of the list >>> debug splat and the GFP. But that's not a fix either. So I'm asking for >>> help. >>> >>> >>> Regards, >>> Mathias >> >>> /* use-after-free in poll routine of AF_UNIX sockets, triggerable using epoll >>> * >>> * ..intruduced in 3c73419c09 "af_unix: fix 'poll for write'/ connected DGRAM >>> * sockets" (v2.6.26-rc7) >>> * >>> * $ gcc -pthread -o epoll_bug epoll_bug.c >>> * >>> * - minipli >>> */ >>> #include <sys/socket.h> >>> #include <sys/epoll.h> >>> #include <sys/un.h> >>> #include <pthread.h> >>> #include <unistd.h> >>> #include <signal.h> >>> #include <stdlib.h> >>> #include <string.h> >>> #include <stdio.h> >>> #include <fcntl.h> >>> #include <errno.h> >>> >>> >>> static long fd_max; >>> static int ep = -1; >>> >>> >>> static int get_fd(void) { >>> int fd; >>> >>> for (;;) { >>> fd = rand() % fd_max; >>> >>> if (fd > 2 && fd != ep) >>> break; >>> } >>> >>> return fd; >>> } >>> >>> >>> static void *opener(void *ptr) { >>> sleep(1); >>> >>> for (;;) { >>> if (rand() % 2) { >>> struct sockaddr_un sa = { >>> .sun_family = AF_UNIX, >>> .sun_path = "\0epool_bug-", >>> }; >>> int sock = socket(AF_UNIX, SOCK_DGRAM, 0); >>> int err; >>> >>> /* take a short nap when there are no more fds left so closer() can >>> * catch up */ >>> if (sock < 0) { >>> usleep(1); >>> >>> continue; >>> } >>> >>> /* ensure the write won't block */ >>> fcntl(sock, F_SETFL, fcntl(sock, F_GETFL, 0) | O_NONBLOCK); >>> >>> sa.sun_path[11] = rand() % 26 + 'A'; >>> if (rand() % 2) >>> err = connect(sock, (struct sockaddr *) &sa, sizeof(sa)); >>> else >>> err = bind(sock, (struct sockaddr *) &sa, sizeof(sa)); >>> >>> if (err) >>> close(sock); >>> } else { >>> static const char dot[] = { [0 ... 1023] = '.' }; >>> >>> write(get_fd(), dot, rand() % 2 ? 1 : sizeof(dot)); >>> } >>> } >>> >>> return ptr; >>> } >>> >>> >>> static void *closer(void *ptr) { >>> int miss = 0; >>> >>> sleep(1); >>> >>> for (;;) { >>> errno = 0; >>> close(get_fd()); >>> >>> /* take a short nap when we're hitting invalid fds 5 times in a row so >>> * opener() can catch up */ >>> if (errno == EBADF && ++miss >= 5) { >>> usleep(10); >>> miss = 0; >>> } else if (errno == 0) { >>> miss = 0; >>> } >>> } >>> >>> return ptr; >>> } >>> >>> >>> static void *ep_add(void *ptr) { >>> sleep(1); >>> >>> for (;;) { >>> int fd = get_fd(); >>> struct epoll_event ev = { >>> .events = EPOLLIN | EPOLLOUT, >>> .data.fd = fd, >>> }; >>> >>> if (epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev) < 0 && errno == ENOSPC) >>> usleep(1); >>> } >>> >>> return ptr; >>> } >>> >>> >>> static void *ep_del(void *ptr) { >>> sleep(1); >>> >>> for (;;) >>> epoll_ctl(ep, EPOLL_CTL_DEL, get_fd(), NULL); >>> >>> return ptr; >>> } >>> >>> >>> int main(void) { >>> pthread_t thread[4]; >>> int i; >>> >>> signal(SIGPIPE, SIG_IGN); >>> >>> ep = epoll_create(42); /* use epoll_create() for older kernels */ >>> if (ep < 0) { >>> fprintf(stderr, "err: epoll_create1() failed (%s)\n", strerror(errno)); >>> >>> return 1; >>> } >>> >>> fd_max = sysconf(_SC_OPEN_MAX); >>> if (pthread_create(&thread[0], NULL, opener, NULL) || >>> pthread_create(&thread[1], NULL, closer, NULL) || >>> pthread_create(&thread[2], NULL, ep_add, NULL) || >>> pthread_create(&thread[3], NULL, ep_del, NULL)) >>> { >>> fprintf(stderr, "err: failed to start all threads!\n"); >>> >>> return 1; >>> } >>> >>> /* XXX: pthread_cancel() all threads on termination request */ >>> >>> for (i = 0; i < 4; i++) >>> pthread_join(thread[i], NULL); >>> >>> return 0; >>> } > > Ping. This still triggers on v4.3-rc3. > Any epoll expert around to have a look at this? > > Regards, > Mathias > Hi, So the issue is that unix_dgram_poll() registers not only the wait queue for socket that we are polling against, but also the wait queue for the socket 'other' with epoll(). So if we have socket s, and we add the socket s to an epoll set ep, normally, when either the 'ep' set closes or the socket 's' closes, we remove all references to the wait queue. Now, in this case we can call connect(), and associated socket 'o', with socket 's'. Thus, epoll now has a references to both the wait queue for socket 's' *and* the one for socket 'o'. Now, this works most of the time on close() b/c socket 's' takes a reference on socket 'o' when it does a connect. Thus, if socket 's', 'o' or 'ep' closes we properly unregister things. However, if we call connect on socket 's', to connect to a new socket 'o2', we drop the reference on the original socket 'o'. Thus, we can now close socket 'o' without unregistering from epoll. Then, when we either close the ep or unregister 'o', we end up with this list corruption. Thus, this is not a race per se, but can be triggered sequentially. Linus explains the general case in the context the signalfd stuff here: https://lkml.org/lkml/2013/10/14/634 So this may be the case that we've been chasing here for a while... In any case, we could fix with that same POLLFREE mechansim, the simplest would be something like: diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 03ee4d3..d499f81 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -392,6 +392,9 @@ static void unix_sock_destructor(struct sock *sk) pr_debug("UNIX %p is destroyed, %ld are still alive.\n", sk, atomic_long_read(&unix_nr_socks)); #endif + /* make sure we remove from epoll */ + wake_up_poll(&u->peer_wait, POLLFREE); + synchronize_sched(); } static void unix_release_sock(struct sock *sk, int embrion) I'm not suggesting we apply that, but that fixed the supplied test case. We could enhance the above, to avoid the free'ing latency there by doing the SLAB_DESTROY_BY_RCU for unix sockets. But I'm not convinced that this wouldn't be still broken for select()/poll() as well. I think we can be in a select() call for socket 's', and if we remove socket 'o' from it in the meantime (by doing a connect() on s to somewhere else and a close on 'o'), I think we can still crash there. So POLLFREE would have to be extended. I tried to hit this with select() but could not, but I think if I tried harder I could. Instead of going further down that route, perhaps something like below might be better. The basic idea would be to do away with the 'other' poll call in unix_dgram_poll(), and instead revert back to a registering on a single wait queue. We add a new wait queue to unix sockets such that we can register it with a remote other on connect(). Then we can use the wakeup from the remote to wake up the registered unix socket. Probably better explained with the patch below. Note I didn't add to the remote for SOCK_STREAM, since the poll() routine there doesn't do the double wait queue registering: diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 4a167b3..9698aff 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE 1 struct socket_wq peer_wq; + wait_queue_t wait; }; #define unix_sk(__sk) ((struct unix_sock *)__sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 03ee4d3..9e0692a 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -420,6 +420,8 @@ static void unix_release_sock(struct sock *sk, int embrion) skpair = unix_peer(sk); if (skpair != NULL) { + if (sk->sk_type != SOCK_STREAM) + remove_wait_queue(&unix_sk(skpair)->peer_wait, &u->wait); if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) { unix_state_lock(skpair); /* No more writes */ @@ -636,6 +638,16 @@ static struct proto unix_proto = { */ static struct lock_class_key af_unix_sk_receive_queue_lock_key; +static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + struct unix_sock *u; + + u = container_of(wait, struct unix_sock, wait); + wake_up_interruptible_sync_poll(sk_sleep(&u->sk), key); + + return 0; +} + static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) { struct sock *sk = NULL; @@ -664,6 +676,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) INIT_LIST_HEAD(&u->link); mutex_init(&u->readlock); /* single task reading lock */ init_waitqueue_head(&u->peer_wait); + init_waitqueue_func_entry(&u->wait, peer_wake); unix_insert_socket(unix_sockets_unbound(sk), sk); out: if (sk == NULL) @@ -1030,7 +1043,10 @@ restart: */ if (unix_peer(sk)) { struct sock *old_peer = unix_peer(sk); + + remove_wait_queue(&unix_sk(old_peer)->peer_wait, &unix_sk(sk)->wait); unix_peer(sk) = other; + add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait); unix_state_double_unlock(sk, other); if (other != old_peer) @@ -1038,8 +1054,12 @@ restart: sock_put(old_peer); } else { unix_peer(sk) = other; + add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait); unix_state_double_unlock(sk, other); } + /* New remote may have created write space for us */ + wake_up_interruptible_sync_poll(sk_sleep(sk), + POLLOUT | POLLWRNORM | POLLWRBAND); return 0; out_unlock: @@ -1194,6 +1214,8 @@ restart: sock_hold(sk); unix_peer(newsk) = sk; + if (sk->sk_type == SOCK_SEQPACKET) + add_wait_queue(&unix_sk(sk)->peer_wait, &unix_sk(newsk)->wait); newsk->sk_state = TCP_ESTABLISHED; newsk->sk_type = sk->sk_type; init_peercred(newsk); @@ -1220,6 +1242,8 @@ restart: smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ unix_peer(sk) = newsk; + if (sk->sk_type == SOCK_SEQPACKET) + add_wait_queue(&unix_sk(newsk)->peer_wait, &unix_sk(sk)->wait); unix_state_unlock(sk); @@ -1254,6 +1278,10 @@ static int unix_socketpair(struct socket *socka, struct socket *sockb) sock_hold(skb); unix_peer(ska) = skb; unix_peer(skb) = ska; + if (ska->sk_type != SOCK_STREAM) { + add_wait_queue(&unix_sk(ska)->peer_wait, &unix_sk(skb)->wait); + add_wait_queue(&unix_sk(skb)->peer_wait, &unix_sk(ska)->wait); + } init_peercred(ska); init_peercred(skb); @@ -1565,6 +1593,7 @@ restart: unix_state_lock(sk); if (unix_peer(sk) == other) { unix_peer(sk) = NULL; + remove_wait_queue(&unix_sk(other)->peer_wait, &u->wait); unix_state_unlock(sk); unix_dgram_disconnected(sk, other); @@ -2441,7 +2470,6 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, other = unix_peer_get(sk); if (other) { if (unix_peer(other) != sk) { - sock_poll_wait(file, &unix_sk(other)->peer_wait, wait); if (unix_recvq_full(other)) writable = 0; } -- 1.8.2.rc2 I think this makes more sense too, b/c it can also get the epoll() case correct where the socket s is connected to a new remote. In the old code we would continue to get wakeups then from the wrong remote. Here, we can fix that as well. There a perhaps a perfomance issue with this approach, since I'm adding to the poll list on connect(). So even if we are not in a poll(), we are still walking the list to do wakeups. That may or may not be a big deal, and I think can be fixed by setting something like the SOCK_NOSPACE bit in the unix_dgram_poll() when its not writeable, and condition the wakeup on that. The alternative would be to register with the remote on poll(), but then the unregister is more complex... Perhaps, somebody more familiar with the unix socket code can comment on this. Thanks for the test case! Thanks, -Jason ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-29 19:09 ` Jason Baron @ 2015-09-30 5:54 ` Mathias Krause 2015-09-30 7:34 ` Michal Kubecek ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Mathias Krause @ 2015-09-30 5:54 UTC (permalink / raw) To: Jason Baron Cc: netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote: > However, if we call connect on socket 's', to connect to a new socket 'o2', we > drop the reference on the original socket 'o'. Thus, we can now close socket > 'o' without unregistering from epoll. Then, when we either close the ep > or unregister 'o', we end up with this list corruption. Thus, this is not a > race per se, but can be triggered sequentially. Sounds profound, but the reproducers calls connect only once per socket. So there is no "connect to a new socket", no? But w/e, see below. > Linus explains the general case in the context the signalfd stuff here: > https://lkml.org/lkml/2013/10/14/634 I also found that posting while looking for similar bug reports. Also found that one: https://lkml.org/lkml/2014/5/15/532 > So this may be the case that we've been chasing here for a while... That bug triggers since commit 3c73419c09 "af_unix: fix 'poll for write'/ connected DGRAM sockets". That's v2.6.26-rc7, as noted in the reproducer. > > In any case, we could fix with that same POLLFREE mechansim, the simplest > would be something like: > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index 03ee4d3..d499f81 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -392,6 +392,9 @@ static void unix_sock_destructor(struct sock *sk) > pr_debug("UNIX %p is destroyed, %ld are still alive.\n", sk, > atomic_long_read(&unix_nr_socks)); > #endif > + /* make sure we remove from epoll */ > + wake_up_poll(&u->peer_wait, POLLFREE); > + synchronize_sched(); > } > > static void unix_release_sock(struct sock *sk, int embrion) > > I'm not suggesting we apply that, but that fixed the supplied test case. > We could enhance the above, to avoid the free'ing latency there by doing > the SLAB_DESTROY_BY_RCU for unix sockets. But I'm not convinced > that this wouldn't be still broken for select()/poll() as well. I think > we can be in a select() call for socket 's', and if we remove socket > 'o' from it in the meantime (by doing a connect() on s to somewhere else > and a close on 'o'), I think we can still crash there. So POLLFREE would > have to be extended. I tried to hit this with select() but could not, > but I think if I tried harder I could. > > Instead of going further down that route, perhaps something like below > might be better. The basic idea would be to do away with the 'other' > poll call in unix_dgram_poll(), and instead revert back to a registering > on a single wait queue. We add a new wait queue to unix sockets such > that we can register it with a remote other on connect(). Then we can > use the wakeup from the remote to wake up the registered unix socket. > Probably better explained with the patch below. Note I didn't add to > the remote for SOCK_STREAM, since the poll() routine there doesn't do > the double wait queue registering: > > diff --git a/include/net/af_unix.h b/include/net/af_unix.h > index 4a167b3..9698aff 100644 > --- a/include/net/af_unix.h > +++ b/include/net/af_unix.h > @@ -62,6 +62,7 @@ struct unix_sock { > #define UNIX_GC_CANDIDATE 0 > #define UNIX_GC_MAYBE_CYCLE 1 > struct socket_wq peer_wq; > + wait_queue_t wait; > }; > #define unix_sk(__sk) ((struct unix_sock *)__sk) > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index 03ee4d3..9e0692a 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -420,6 +420,8 @@ static void unix_release_sock(struct sock *sk, int embrion) > skpair = unix_peer(sk); > > if (skpair != NULL) { > + if (sk->sk_type != SOCK_STREAM) > + remove_wait_queue(&unix_sk(skpair)->peer_wait, &u->wait); > if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) { > unix_state_lock(skpair); > /* No more writes */ > @@ -636,6 +638,16 @@ static struct proto unix_proto = { > */ > static struct lock_class_key af_unix_sk_receive_queue_lock_key; > > +static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key) > +{ > + struct unix_sock *u; > + > + u = container_of(wait, struct unix_sock, wait); > + wake_up_interruptible_sync_poll(sk_sleep(&u->sk), key); > + > + return 0; > +} > + > static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) > { > struct sock *sk = NULL; > @@ -664,6 +676,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) > INIT_LIST_HEAD(&u->link); > mutex_init(&u->readlock); /* single task reading lock */ > init_waitqueue_head(&u->peer_wait); > + init_waitqueue_func_entry(&u->wait, peer_wake); > unix_insert_socket(unix_sockets_unbound(sk), sk); > out: > if (sk == NULL) > @@ -1030,7 +1043,10 @@ restart: > */ > if (unix_peer(sk)) { > struct sock *old_peer = unix_peer(sk); > + > + remove_wait_queue(&unix_sk(old_peer)->peer_wait, &unix_sk(sk)->wait); > unix_peer(sk) = other; > + add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait); > unix_state_double_unlock(sk, other); > > if (other != old_peer) > @@ -1038,8 +1054,12 @@ restart: > sock_put(old_peer); > } else { > unix_peer(sk) = other; > + add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait); > unix_state_double_unlock(sk, other); > } > + /* New remote may have created write space for us */ > + wake_up_interruptible_sync_poll(sk_sleep(sk), > + POLLOUT | POLLWRNORM | POLLWRBAND); > return 0; > > out_unlock: > @@ -1194,6 +1214,8 @@ restart: > > sock_hold(sk); > unix_peer(newsk) = sk; > + if (sk->sk_type == SOCK_SEQPACKET) > + add_wait_queue(&unix_sk(sk)->peer_wait, &unix_sk(newsk)->wait); > newsk->sk_state = TCP_ESTABLISHED; > newsk->sk_type = sk->sk_type; > init_peercred(newsk); > @@ -1220,6 +1242,8 @@ restart: > > smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ > unix_peer(sk) = newsk; > + if (sk->sk_type == SOCK_SEQPACKET) > + add_wait_queue(&unix_sk(newsk)->peer_wait, &unix_sk(sk)->wait); > > unix_state_unlock(sk); > > @@ -1254,6 +1278,10 @@ static int unix_socketpair(struct socket *socka, struct socket *sockb) > sock_hold(skb); > unix_peer(ska) = skb; > unix_peer(skb) = ska; > + if (ska->sk_type != SOCK_STREAM) { > + add_wait_queue(&unix_sk(ska)->peer_wait, &unix_sk(skb)->wait); > + add_wait_queue(&unix_sk(skb)->peer_wait, &unix_sk(ska)->wait); > + } > init_peercred(ska); > init_peercred(skb); > > @@ -1565,6 +1593,7 @@ restart: > unix_state_lock(sk); > if (unix_peer(sk) == other) { > unix_peer(sk) = NULL; > + remove_wait_queue(&unix_sk(other)->peer_wait, &u->wait); > unix_state_unlock(sk); > > unix_dgram_disconnected(sk, other); > @@ -2441,7 +2470,6 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, > other = unix_peer_get(sk); > if (other) { > if (unix_peer(other) != sk) { > - sock_poll_wait(file, &unix_sk(other)->peer_wait, wait); > if (unix_recvq_full(other)) > writable = 0; > } > -- > 1.8.2.rc2 > > > > I think this makes more sense too, b/c it can also get the epoll() case > correct where the socket s is connected to a new remote. In the old code > we would continue to get wakeups then from the wrong remote. Here, we > can fix that as well. > > There a perhaps a perfomance issue with this approach, since I'm adding > to the poll list on connect(). So even if we are not in a poll(), we are > still walking the list to do wakeups. That may or may not be a big deal, > and I think can be fixed by setting something like the SOCK_NOSPACE bit > in the unix_dgram_poll() when its not writeable, and condition the wakeup > on that. The alternative would be to register with the remote on poll(), > but then the unregister is more complex... This one looks good to me. It survived almost 10 hours on a 2 CPU system where the reproducer was able to trigger the bug within seconds. Thanks for fixing this! > Perhaps, somebody more familiar with the unix socket code can comment > on this. Thanks for the test case! Eric Dumazet may have opinions concerning the performance impact. At least he had in the past... Regards, Mathias ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-30 5:54 ` Mathias Krause @ 2015-09-30 7:34 ` Michal Kubecek 2015-10-01 2:55 ` Jason Baron 2015-09-30 10:56 ` Rainer Weikusat 2015-10-01 2:39 ` Jason Baron 2 siblings, 1 reply; 18+ messages in thread From: Michal Kubecek @ 2015-09-30 7:34 UTC (permalink / raw) To: Mathias Krause Cc: Jason Baron, netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net On Wed, Sep 30, 2015 at 07:54:29AM +0200, Mathias Krause wrote: > On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote: > > However, if we call connect on socket 's', to connect to a new socket 'o2', we > > drop the reference on the original socket 'o'. Thus, we can now close socket > > 'o' without unregistering from epoll. Then, when we either close the ep > > or unregister 'o', we end up with this list corruption. Thus, this is not a > > race per se, but can be triggered sequentially. > > Sounds profound, but the reproducers calls connect only once per > socket. So there is no "connect to a new socket", no? I believe there is another scenario: 'o' becomes SOCK_DEAD while 's' is still connected to it. This is detected by 's' in unix_dgram_sendmsg() so that 's' releases its reference on 'o' and 'o' can be freed. If this happens before 's' is unregistered, we get use-after-free as 'o' has never been unregistered. And as the interval between freeing 'o' and unregistering 's' can be quite long, there is a chance for the memory to be reused. This is what one of our customers has seen: [exception RIP: _raw_spin_lock_irqsave+156] RIP: ffffffff8040f5bc RSP: ffff8800e929de78 RFLAGS: 00010082 RAX: 000000000000a32c RBX: ffff88003954ab80 RCX: 0000000000001000 RDX: 00000000f2320000 RSI: 000000000000f232 RDI: ffff88003954ab80 RBP: 0000000000005220 R8: dead000000100100 R9: 0000000000000000 R10: 00007fff1a284960 R11: 0000000000000246 R12: 0000000000000000 R13: ffff8800e929de8c R14: 000000000000000e R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 10000e030 SS: e02b #8 [ffff8800e929de70] _raw_spin_lock_irqsave at ffffffff8040f5a9 #9 [ffff8800e929deb0] remove_wait_queue at ffffffff8006ad09 #10 [ffff8800e929ded0] ep_unregister_pollwait at ffffffff80170043 #11 [ffff8800e929def0] ep_remove at ffffffff80170073 #12 [ffff8800e929df10] sys_epoll_ctl at ffffffff80171453 #13 [ffff8800e929df80] system_call_fastpath at ffffffff80417553 In this case, crash happened on unregistering 's' which had null peer (i.e. not reconnected but rather disconnected) but there were still two items in the list, the other pointing to an unallocated page which has apparently been modified in between. IMHO unix_dgram_disonnected() could be the place to handle this issue: it is called from both places where we disconnect from a peer (dead peer detection in unix_dgram_sendmsg() and reconnect in unix_dgram_connect()) just before the reference to peer is released. I'm not familiar with the epoll implementation so I'm still trying to find what exactly needs to be done to unregister the peer at this moment. > That bug triggers since commit 3c73419c09 "af_unix: fix 'poll for > write'/ connected DGRAM sockets". That's v2.6.26-rc7, as noted in the > reproducer. Sounds likely as this is the commit that introduced unix_dgram_poll() with the code which adds the "asymmetric peer" to monitor its queue state. More precisely, the asymmetricity check has been added by ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") shortly after that. Michal Kubecek ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-30 7:34 ` Michal Kubecek @ 2015-10-01 2:55 ` Jason Baron 0 siblings, 0 replies; 18+ messages in thread From: Jason Baron @ 2015-10-01 2:55 UTC (permalink / raw) To: Michal Kubecek, Mathias Krause Cc: netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net On 09/30/2015 03:34 AM, Michal Kubecek wrote: > On Wed, Sep 30, 2015 at 07:54:29AM +0200, Mathias Krause wrote: >> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote: >>> However, if we call connect on socket 's', to connect to a new socket 'o2', we >>> drop the reference on the original socket 'o'. Thus, we can now close socket >>> 'o' without unregistering from epoll. Then, when we either close the ep >>> or unregister 'o', we end up with this list corruption. Thus, this is not a >>> race per se, but can be triggered sequentially. >> >> Sounds profound, but the reproducers calls connect only once per >> socket. So there is no "connect to a new socket", no? > > I believe there is another scenario: 'o' becomes SOCK_DEAD while 's' is > still connected to it. This is detected by 's' in unix_dgram_sendmsg() > so that 's' releases its reference on 'o' and 'o' can be freed. If this > happens before 's' is unregistered, we get use-after-free as 'o' has > never been unregistered. And as the interval between freeing 'o' and > unregistering 's' can be quite long, there is a chance for the memory to > be reused. This is what one of our customers has seen: > > [exception RIP: _raw_spin_lock_irqsave+156] > RIP: ffffffff8040f5bc RSP: ffff8800e929de78 RFLAGS: 00010082 > RAX: 000000000000a32c RBX: ffff88003954ab80 RCX: 0000000000001000 > RDX: 00000000f2320000 RSI: 000000000000f232 RDI: ffff88003954ab80 > RBP: 0000000000005220 R8: dead000000100100 R9: 0000000000000000 > R10: 00007fff1a284960 R11: 0000000000000246 R12: 0000000000000000 > R13: ffff8800e929de8c R14: 000000000000000e R15: 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 10000e030 SS: e02b > #8 [ffff8800e929de70] _raw_spin_lock_irqsave at ffffffff8040f5a9 > #9 [ffff8800e929deb0] remove_wait_queue at ffffffff8006ad09 > #10 [ffff8800e929ded0] ep_unregister_pollwait at ffffffff80170043 > #11 [ffff8800e929def0] ep_remove at ffffffff80170073 > #12 [ffff8800e929df10] sys_epoll_ctl at ffffffff80171453 > #13 [ffff8800e929df80] system_call_fastpath at ffffffff80417553 > > In this case, crash happened on unregistering 's' which had null peer > (i.e. not reconnected but rather disconnected) but there were still two > items in the list, the other pointing to an unallocated page which has > apparently been modified in between. > > IMHO unix_dgram_disonnected() could be the place to handle this issue: > it is called from both places where we disconnect from a peer (dead peer > detection in unix_dgram_sendmsg() and reconnect in unix_dgram_connect()) > just before the reference to peer is released. I'm not familiar with the > epoll implementation so I'm still trying to find what exactly needs to > be done to unregister the peer at this moment. > Indeed that is a path as well. The patch I posted here deals with that case as well. It does a remove_wait_queue() in that case. unix_dgram_disconnected() gets called as you point out when we are removing the remote peer, and I have a remove_wait_queue() in that case. The patch I posted converts us back to a polling() against a single wait queue. The wait structure that epoll()/select()/poll() adds to the peer wait queue is really opaque to the unix code. The normal pattern is for epoll()/select()/poll() to do the unregister, not the socket/fd that we are waiting on. Further we could not just release all of the wait queues in unix_dgram_disconnected() b/c there could be multiple waiters there. So the POLLFREE thing really has to be done from the unix_sock_destructor() path, since it going to free the entire queue. In addition, I think that removing the the wait queue from unix_dgram_disconnected() will still be broken, b/c we would need to re-add to the remote peer via subsequent poll(), to get events if the socket was re-connected to a new peer. Thanks, -Jason ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-30 5:54 ` Mathias Krause 2015-09-30 7:34 ` Michal Kubecek @ 2015-09-30 10:56 ` Rainer Weikusat 2015-09-30 11:55 ` Mathias Krause 2015-10-01 2:39 ` Jason Baron 2 siblings, 1 reply; 18+ messages in thread From: Rainer Weikusat @ 2015-09-30 10:56 UTC (permalink / raw) To: Mathias Krause Cc: Jason Baron, netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net Mathias Krause <minipli@googlemail.com> writes: > On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote: >> However, if we call connect on socket 's', to connect to a new socket 'o2', we >> drop the reference on the original socket 'o'. Thus, we can now close socket >> 'o' without unregistering from epoll. Then, when we either close the ep >> or unregister 'o', we end up with this list corruption. Thus, this is not a >> race per se, but can be triggered sequentially. > > Sounds profound, but the reproducers calls connect only once per > socket. So there is no "connect to a new socket", no? > But w/e, see below. In case you want some information on this: This is a kernel warning I could trigger (more than once) on the single day I could so far spend looking into this (3.2.54 kernel): Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 list_del+0x9/0x30() Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should be ffff88022c38f078, but was dead000000100100 Sep 15 19:37:19 doppelsaurus kernel: Modules linked in: snd_hrtimer binfmt_misc af_packet nf_conntrack loop snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm sg snd_page_alloc snd_seq_du mmy sr_mod snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_timer ath9k snd cdrom ath9k_common ath9k_hw r8169 mii ath usb_storage unix Sep 15 19:37:19 doppelsaurus kernel: Pid: 3340, comm: a.out Tainted: G W 3.2.54-saurus-vesa #9 Sep 15 19:37:19 doppelsaurus kernel: Call Trace: Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff811c1e00>] ? __list_del_entry+0x80/0xc0 Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff81036ad9>] ? warn_slowpath_common+0x79/0xc0 Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff81036bd5>] ? warn_slowpath_fmt+0x45/0x50 Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff811c1e49>] ? list_del+0x9/0x30 Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff81051509>] ? remove_wait_queue+0x29/0x50 Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810fde62>] ? ep_unregister_pollwait.isra.9+0x32/0x50 Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810fdeaa>] ? ep_remove+0x2a/0xc0 Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810fe9ae>] ? eventpoll_release_file+0x5e/0x90 Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810c76f6>] ? fput+0x1c6/0x220 Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810c3b7f>] ? filp_close+0x5f/0x90 Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810c3c36>] ? sys_close+0x86/0xd0 Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff8141f43b>] ? system_call_fastpath+0x16/0x1b The dead000000100100 is one of the list poison values a linkage pointer is set to during/ after removal from a list. The particular warning means that entry->prev (the item being removed) pointed to another entry whose next pointer was not the address of entry but dead000000100100. Most likely, this means there's a list insert racing with a list remove somewhere here where the insert picks up the pointer to the previous item while it is still on the list and uses it while the delete removes it, with delete having the last word and thus setting prev->next to dead000000100100 after the insert set it to the address of the item to be inserted. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-30 10:56 ` Rainer Weikusat @ 2015-09-30 11:55 ` Mathias Krause 2015-09-30 13:25 ` Rainer Weikusat 0 siblings, 1 reply; 18+ messages in thread From: Mathias Krause @ 2015-09-30 11:55 UTC (permalink / raw) To: Rainer Weikusat Cc: Jason Baron, netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net On 30 September 2015 at 12:56, Rainer Weikusat <rweikusat@mobileactivedefense.com> wrote: > Mathias Krause <minipli@googlemail.com> writes: >> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote: >>> However, if we call connect on socket 's', to connect to a new socket 'o2', we >>> drop the reference on the original socket 'o'. Thus, we can now close socket >>> 'o' without unregistering from epoll. Then, when we either close the ep >>> or unregister 'o', we end up with this list corruption. Thus, this is not a >>> race per se, but can be triggered sequentially. >> >> Sounds profound, but the reproducers calls connect only once per >> socket. So there is no "connect to a new socket", no? >> But w/e, see below. > > In case you want some information on this: This is a kernel warning I > could trigger (more than once) on the single day I could so far spend > looking into this (3.2.54 kernel): > > Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 list_del+0x9/0x30() > Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam > Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should be ffff88022c38f078, but was dead000000100100 > [snip] Is that with Jason's patch or a vanilla v3.2.54? Regards, Mathias ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-30 11:55 ` Mathias Krause @ 2015-09-30 13:25 ` Rainer Weikusat 2015-09-30 13:38 ` Mathias Krause 0 siblings, 1 reply; 18+ messages in thread From: Rainer Weikusat @ 2015-09-30 13:25 UTC (permalink / raw) To: Mathias Krause Cc: Rainer Weikusat, Jason Baron, netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net Mathias Krause <minipli@googlemail.com> writes: > On 30 September 2015 at 12:56, Rainer Weikusat <rweikusat@mobileactivedefense.com> wrote: >> Mathias Krause <minipli@googlemail.com> writes: >>> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote: >>>> However, if we call connect on socket 's', to connect to a new socket 'o2', we >>>> drop the reference on the original socket 'o'. Thus, we can now close socket >>>> 'o' without unregistering from epoll. Then, when we either close the ep >>>> or unregister 'o', we end up with this list corruption. Thus, this is not a >>>> race per se, but can be triggered sequentially. >>> >>> Sounds profound, but the reproducers calls connect only once per >>> socket. So there is no "connect to a new socket", no? >>> But w/e, see below. >> >> In case you want some information on this: This is a kernel warning I >> could trigger (more than once) on the single day I could so far spend >> looking into this (3.2.54 kernel): >> >> Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 list_del+0x9/0x30() >> Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam >> Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should be ffff88022c38f078, but was dead000000100100 >> [snip] > > Is that with Jason's patch or a vanilla v3.2.54? That's a kernel warning which occurred repeatedly (among other "link pointer disorganization" warnings) when I tested the "program with unknown behaviour" you wrote with the kernel I'm currently supporting a while ago (as I already wrote in the original mail). ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-30 13:25 ` Rainer Weikusat @ 2015-09-30 13:38 ` Mathias Krause 2015-09-30 13:51 ` Rainer Weikusat 0 siblings, 1 reply; 18+ messages in thread From: Mathias Krause @ 2015-09-30 13:38 UTC (permalink / raw) To: Rainer Weikusat Cc: Jason Baron, netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net On 30 September 2015 at 15:25, Rainer Weikusat <rweikusat@mobileactivedefense.com> wrote: > Mathias Krause <minipli@googlemail.com> writes: >> On 30 September 2015 at 12:56, Rainer Weikusat <rweikusat@mobileactivedefense.com> wrote: >>> In case you want some information on this: This is a kernel warning I >>> could trigger (more than once) on the single day I could so far spend >>> looking into this (3.2.54 kernel): >>> >>> Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 list_del+0x9/0x30() >>> Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam >>> Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should be ffff88022c38f078, but was dead000000100100 >>> [snip] >> >> Is that with Jason's patch or a vanilla v3.2.54? > > That's a kernel warning which occurred repeatedly (among other "link > pointer disorganization" warnings) when I tested the "program with > unknown behaviour" you wrote with the kernel I'm currently supporting a > while ago (as I already wrote in the original mail). So I assume Jason's patch is not included in your kernel. Then those messages are expected; expected even on kernels as old as v2.6.27. Can you re-try with Jason's patch applied? Thanks, Mathias ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-30 13:38 ` Mathias Krause @ 2015-09-30 13:51 ` Rainer Weikusat 0 siblings, 0 replies; 18+ messages in thread From: Rainer Weikusat @ 2015-09-30 13:51 UTC (permalink / raw) To: Mathias Krause Cc: Rainer Weikusat, Jason Baron, netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net Mathias Krause <minipli@googlemail.com> writes: > On 30 September 2015 at 15:25, Rainer Weikusat > <rweikusat@mobileactivedefense.com> wrote: >> Mathias Krause <minipli@googlemail.com> writes: >>> On 30 September 2015 at 12:56, Rainer Weikusat <rweikusat@mobileactivedefense.com> wrote: >>>> In case you want some information on this: This is a kernel warning I >>>> could trigger (more than once) on the single day I could so far spend >>>> looking into this (3.2.54 kernel): >>>> >>>> Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 list_del+0x9/0x30() >>>> Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam >>>> Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should be ffff88022c38f078, but was dead000000100100 >>>> [snip] >>> >>> Is that with Jason's patch or a vanilla v3.2.54? >> >> That's a kernel warning which occurred repeatedly (among other "link >> pointer disorganization" warnings) when I tested the "program with >> unknown behaviour" you wrote with the kernel I'm currently supporting a >> while ago (as I already wrote in the original mail). > > So I assume Jason's patch is not included in your kernel. Then those > messages are expected; expected even on kernels as old as v2.6.27. > Can you re-try with Jason's patch applied? If something needs to be unregistered here (which I don't claim to know but someone should), the corresponding code would need to be added to the unix_dgram_disconnected function (as was already pointed out). ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-30 5:54 ` Mathias Krause 2015-09-30 7:34 ` Michal Kubecek 2015-09-30 10:56 ` Rainer Weikusat @ 2015-10-01 2:39 ` Jason Baron 2015-10-01 10:33 ` Rainer Weikusat 2 siblings, 1 reply; 18+ messages in thread From: Jason Baron @ 2015-10-01 2:39 UTC (permalink / raw) To: Mathias Krause Cc: netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net On 09/30/2015 01:54 AM, Mathias Krause wrote: > On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote: >> However, if we call connect on socket 's', to connect to a new socket 'o2', we >> drop the reference on the original socket 'o'. Thus, we can now close socket >> 'o' without unregistering from epoll. Then, when we either close the ep >> or unregister 'o', we end up with this list corruption. Thus, this is not a >> race per se, but can be triggered sequentially. > > Sounds profound, but the reproducers calls connect only once per > socket. So there is no "connect to a new socket", no? > But w/e, see below. Yes, but it can be reproduced this way too. It can also happen with a close() on the remote peer 'o', and a send to 'o' from 's', which the reproducer can do as pointed out Michal. The patch I sent deals with both cases. Thanks, -Jason ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-10-01 2:39 ` Jason Baron @ 2015-10-01 10:33 ` Rainer Weikusat 2015-10-01 12:10 ` Rainer Weikusat 0 siblings, 1 reply; 18+ messages in thread From: Rainer Weikusat @ 2015-10-01 10:33 UTC (permalink / raw) To: Jason Baron Cc: Mathias Krause, netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net Jason Baron <jbaron@akamai.com> writes: > On 09/30/2015 01:54 AM, Mathias Krause wrote: >> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote: >>> However, if we call connect on socket 's', to connect to a new socket 'o2', we >>> drop the reference on the original socket 'o'. Thus, we can now close socket >>> 'o' without unregistering from epoll. Then, when we either close the ep >>> or unregister 'o', we end up with this list corruption. Thus, this is not a >>> race per se, but can be triggered sequentially. >> >> Sounds profound, but the reproducers calls connect only once per >> socket. So there is no "connect to a new socket", no? >> But w/e, see below. > > Yes, but it can be reproduced this way too. It can also happen with a > close() on the remote peer 'o', and a send to 'o' from 's', which the > reproducer can do as pointed out Michal. The patch I sent deals with > both cases. As Michal also pointed out, there's a unix_dgram_disconnected routine being called in both cases and insofar "deregistering" anything beyond what unix_dgram_disconnected (and - insofar I can tell this - unix_release_sock) already do is actually required, this would be the obvious place to add it. A good step on the way to that would be to write (and post) some test code which actually reproduces the problem in a predictable way. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-10-01 10:33 ` Rainer Weikusat @ 2015-10-01 12:10 ` Rainer Weikusat 2015-10-01 12:58 ` Rainer Weikusat 0 siblings, 1 reply; 18+ messages in thread From: Rainer Weikusat @ 2015-10-01 12:10 UTC (permalink / raw) To: Jason Baron Cc: Mathias Krause, netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net Rainer Weikusat <rw@doppelsaurus.mobileactivedefense.com> writes: > Jason Baron <jbaron@akamai.com> writes: >> On 09/30/2015 01:54 AM, Mathias Krause wrote: >>> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote: >>>> However, if we call connect on socket 's', to connect to a new socket 'o2', we >>>> drop the reference on the original socket 'o'. Thus, we can now close socket >>>> 'o' without unregistering from epoll. Then, when we either close the ep >>>> or unregister 'o', we end up with this list corruption. Thus, this is not a >>>> race per se, but can be triggered sequentially. >>> >>> Sounds profound, but the reproducers calls connect only once per >>> socket. So there is no "connect to a new socket", no? >>> But w/e, see below. >> >> Yes, but it can be reproduced this way too. It can also happen with a >> close() on the remote peer 'o', and a send to 'o' from 's', which the >> reproducer can do as pointed out Michal. The patch I sent deals with >> both cases. > > As Michal also pointed out, there's a unix_dgram_disconnected routine > being called in both cases and insofar "deregistering" anything beyond > what unix_dgram_disconnected (and - insofar I can tell this - > unix_release_sock) already do is actually required, this would be the > obvious place to add it. A good step on the way to that would be to > write (and post) some test code which actually reproduces the problem in > a predictable way. Test program (assumes that it can execute itself as ./a.out): ------------- #include <fcntl.h> #include <pthread.h> #include <string.h> #include <sys/socket.h> #include <sys/un.h> #include <sys/epoll.h> #include <signal.h> #include <unistd.h> static int sk; static void *epoller(void *unused) { struct epoll_event epev; int epfd; epfd = epoll_create(1); epev.events = EPOLLOUT; epoll_ctl(epfd, EPOLL_CTL_ADD, sk, &epev); epoll_wait(epfd, &epev, 1, 5000); execl("./a.out", "./a.out", (void *)0); return NULL; } int main(void) { struct sockaddr_un sun; pthread_t tid; int tg0, tg1, rc; sun.sun_family = AF_UNIX; tg0 = socket(AF_UNIX, SOCK_DGRAM, 0); strncpy(sun.sun_path, "/tmp/tg0", sizeof(sun.sun_path)); unlink(sun.sun_path); bind(tg0, (struct sockaddr *)&sun, sizeof(sun)); tg1 = socket(AF_UNIX, SOCK_DGRAM, 0); strncpy(sun.sun_path, "/tmp/tg1", sizeof(sun.sun_path)); unlink(sun.sun_path); bind(tg1, (struct sockaddr *)&sun, sizeof(sun)); sk = socket(AF_UNIX, SOCK_DGRAM, 0); connect(sk, (struct sockaddr *)&sun, sizeof(sun)); fcntl(sk, F_SETFL, fcntl(sk, F_GETFL) | O_NONBLOCK); while ((rc = write(sk, "bla", 3)) != -1); pthread_create(&tid, NULL, epoller, NULL); usleep(5); strncpy(sun.sun_path, "/tmp/tg0", sizeof(sun.sun_path)); connect(sk, (struct sockaddr *)&sun, sizeof(sun)); close(tg1); pause(); return 0; } ----------------- Inserting a struct list_head *p; p = &u->peer_wait.task_list; WARN_ON(p->next != p || p->prev != p); into unix_sock_destructor triggers a warning and after running for a while, one also gets the list corruption warnings, example Oct 1 12:53:38 doppelsaurus kernel: WARNING: at lib/list_debug.c:54 list_del+0x9/0x30() Oct 1 12:53:38 doppelsaurus kernel: Hardware name: 500-330nam Oct 1 12:53:38 doppelsaurus kernel: list_del corruption. prev->next should be ffff880232634420, but was ffff880224f878c8 Oct 1 12:53:38 doppelsaurus kernel: Modules linked in: snd_hrtimer binfmt_misc af_packet nf_conntrack loop snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm ath9k ath9k_common ath9k_hw snd_seq_dummy snd_page_alloc snd_seq_oss snd_seq_midi_event snd_seq sg ath sr_mod snd_seq_device cdrom snd_timer snd r8169 mii usb_storage unix Oct 1 12:53:38 doppelsaurus kernel: Pid: 4619, comm: a.out Tainted: G W 3.2.54-saurus-vesa #18 Oct 1 12:53:38 doppelsaurus kernel: Call Trace: Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff811c1e00>] ? __list_del_entry+0x80/0x100 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff81036ad9>] ? warn_slowpath_common+0x79/0xc0 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff81036bd5>] ? warn_slowpath_fmt+0x45/0x50 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff811c1e89>] ? list_del+0x9/0x30 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff81051509>] ? remove_wait_queue+0x29/0x50 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff810fde62>] ? ep_unregister_pollwait.isra.9+0x32/0x50 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff810fdeaa>] ? ep_remove+0x2a/0xc0 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff810fe9ae>] ? eventpoll_release_file+0x5e/0x90 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff810c76f6>] ? fput+0x1c6/0x220 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff810c3b7f>] ? filp_close+0x5f/0x90 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff81039e35>] ? put_files_struct+0xb5/0x110 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff8103a4ef>] ? do_exit+0x59f/0x750 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff8103a9b8>] ? do_group_exit+0x38/0xa0 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff8103aa32>] ? sys_exit_group+0x12/0x20 Oct 1 12:53:38 doppelsaurus kernel: [<ffffffff8141f5fe>] ? tracesys+0xd0/0xd5 Oct 1 12:53:38 doppelsaurus kernel: ---[ end trace ba5c51cbfeb664d8 ]--- Doing "the signalfd_cleanup" thing should prevent this (didn't test this so far). I still think the correct solution would be to clean this up in unix_dgram_disconnected. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-10-01 12:10 ` Rainer Weikusat @ 2015-10-01 12:58 ` Rainer Weikusat 0 siblings, 0 replies; 18+ messages in thread From: Rainer Weikusat @ 2015-10-01 12:58 UTC (permalink / raw) To: Jason Baron Cc: Mathias Krause, netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org, davem@davemloft.net Rainer Weikusat <rw@doppelsaurus.mobileactivedefense.com> writes: > Rainer Weikusat <rw@doppelsaurus.mobileactivedefense.com> writes: >> Jason Baron <jbaron@akamai.com> writes: >>> On 09/30/2015 01:54 AM, Mathias Krause wrote: >>>> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote: >>>>> However, if we call connect on socket 's', to connect to a new socket 'o2', we >>>>> drop the reference on the original socket 'o'. Thus, we can now close socket >>>>> 'o' without unregistering from epoll. Then, when we either close the ep >>>>> or unregister 'o', we end up with this list corruption. Thus, this is not a >>>>> race per se, but can be triggered sequentially. [...] > Test program (assumes that it can execute itself as ./a.out): > > ------------- > #include <fcntl.h> > #include <pthread.h> > #include <string.h> > #include <sys/socket.h> > #include <sys/un.h> > #include <sys/epoll.h> > #include <signal.h> > #include <unistd.h> > > static int sk; > > static void *epoller(void *unused) > { > struct epoll_event epev; > int epfd; > > epfd = epoll_create(1); > > epev.events = EPOLLOUT; > epoll_ctl(epfd, EPOLL_CTL_ADD, sk, &epev); > epoll_wait(epfd, &epev, 1, 5000); > > execl("./a.out", "./a.out", (void *)0); > > return NULL; > } [...] Possibly interesting additional bit of information: The list corruption warnings appear only if the 2nd connect is there and both the sk and epfd file descriptors are left open accross the exec. Closing either of both still triggers the _destructor warnings but nothing else (until the process runs out of file descriptors). ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-13 19:53 List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket Mathias Krause 2015-09-14 2:39 ` Eric Wong @ 2015-09-15 17:07 ` Rainer Weikusat 2015-09-15 18:15 ` Mathias Krause 1 sibling, 1 reply; 18+ messages in thread From: Rainer Weikusat @ 2015-09-15 17:07 UTC (permalink / raw) To: Mathias Krause Cc: netdev, linux-kernel, Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team Mathias Krause <minipli@googlemail.com> writes: > this is an attempt to resurrect the thread initially started here: > > http://thread.gmane.org/gmane.linux.network/353003 > > As that patch fixed the issue for the mentioned reproducer, it did not > fix the bug for the production code Olivier is using. :( > > Changing the reproducer only slightly allows me to trigger the following > list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even > with the above linked patch applied: The patch was --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c <at> <at> -2233,10 +2233,14 <at> <at> static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, writable = unix_writable(sk); other = unix_peer_get(sk); if (other) { - if (unix_peer(other) != sk) { + unix_state_lock(other); + if (!sock_flag(other, SOCK_DEAD) && unix_peer(other) != sk) { + unix_state_unlock(other); sock_poll_wait(file, &unix_sk(other)->peer_wait, wait); if (unix_recvq_full(other)) writable = 0; + } else { + unix_state_unlock(other); } sock_put(other); } That's obviously not going to help you when 'racing with unix_release_sock' as the socket might be released immediately after the unix_state_unlock, ie, before sock_poll_wait is called. Provided I understand this correctly, the problem is that the socket reference count may have become 1 by the time sock_put is called but the sock_poll_wait has created a new reference to it which isn't accounted for. A simple way to fix that could be to do something like unix_state_lock(other); if (!sock_flag(other, SOCK_DEAD)) sock_poll_wait(...) unix_state_unlock(other); This would imply that unix_release_sock either marked the socket as dead before the sock_poll_wait was executed or that the wake_up_interruptible call in there will run after ->peer_wait was used (and it will thus 'unpollwait' it again). ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket 2015-09-15 17:07 ` Rainer Weikusat @ 2015-09-15 18:15 ` Mathias Krause 0 siblings, 0 replies; 18+ messages in thread From: Mathias Krause @ 2015-09-15 18:15 UTC (permalink / raw) To: Rainer Weikusat Cc: netdev, linux-kernel, Eric Dumazet, Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team On Tue, Sep 15, 2015 at 06:07:05PM +0100, Rainer Weikusat wrote: > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > <at> <at> -2233,10 +2233,14 <at> <at> static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, > writable = unix_writable(sk); > other = unix_peer_get(sk); > if (other) { > - if (unix_peer(other) != sk) { > + unix_state_lock(other); > + if (!sock_flag(other, SOCK_DEAD) && unix_peer(other) != sk) { > + unix_state_unlock(other); > sock_poll_wait(file, &unix_sk(other)->peer_wait, wait); > if (unix_recvq_full(other)) > writable = 0; > + } else { > + unix_state_unlock(other); > } > sock_put(other); > } > > That's obviously not going to help you when 'racing with > unix_release_sock' as the socket might be released immediately after the > unix_state_unlock, ie, before sock_poll_wait is called. Provided I > understand this correctly, the problem is that the socket reference > count may have become 1 by the time sock_put is called but the > sock_poll_wait has created a new reference to it which isn't accounted > for. > > A simple way to fix that could be to do something like > > unix_state_lock(other); > if (!sock_flag(other, SOCK_DEAD)) sock_poll_wait(...) > unix_state_unlock(other); Sorry, but that does not fix the bug. I get the same GPF as before. > This would imply that unix_release_sock either marked the socket as dead > before the sock_poll_wait was executed or that the wake_up_interruptible > call in there will run after ->peer_wait was used (and it will thus > 'unpollwait' it again). It must be something else as I also tried the following patch, which moves the wake_up_interruptible_all() call into the locked code section: diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 03ee4d359f6a..58570a7680ce 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -360,10 +360,12 @@ static void unix_dgram_disconnected(struct sock *sk, struct sock *other) * we signal error. Messages are lost. Do not make this, * when peer was not connected to us. */ + unix_state_lock(other); if (!sock_flag(other, SOCK_DEAD) && unix_peer(other) == sk) { other->sk_err = ECONNRESET; other->sk_error_report(other); } + unix_state_unlock(other); } } @@ -413,11 +415,13 @@ static void unix_release_sock(struct sock *sk, int embrion) u->path.mnt = NULL; state = sk->sk_state; sk->sk_state = TCP_CLOSE; - unix_state_unlock(sk); - - wake_up_interruptible_all(&u->peer_wait); skpair = unix_peer(sk); + unix_peer(sk) = NULL; + + wake_up_interruptible_all(&u->peer_wait); + + unix_state_unlock(sk); if (skpair != NULL) { if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) { @@ -431,7 +435,6 @@ static void unix_release_sock(struct sock *sk, int embrion) sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP); } sock_put(skpair); /* It may now die */ - unix_peer(sk) = NULL; } /* Try to flush out this socket. Throw out buffers at least */ @@ -2440,10 +2443,18 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, writable = unix_writable(sk); other = unix_peer_get(sk); if (other) { - if (unix_peer(other) != sk) { + unix_state_lock(other); + + if (sock_flag(other, SOCK_DEAD) || (other->sk_shutdown & RCV_SHUTDOWN)) { + unix_state_unlock(other); + writable = 0; + } else if (unix_peer(other) != sk) { sock_poll_wait(file, &unix_sk(other)->peer_wait, wait); + unix_state_unlock(other); if (unix_recvq_full(other)) writable = 0; + } else { + unix_state_unlock(other); } sock_put(other); } ^ permalink raw reply related [flat|nested] 18+ messages in thread
end of thread, other threads:[~2015-10-01 12:58 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-09-13 19:53 List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket Mathias Krause 2015-09-14 2:39 ` Eric Wong 2015-09-29 18:09 ` Mathias Krause 2015-09-29 19:09 ` Jason Baron 2015-09-30 5:54 ` Mathias Krause 2015-09-30 7:34 ` Michal Kubecek 2015-10-01 2:55 ` Jason Baron 2015-09-30 10:56 ` Rainer Weikusat 2015-09-30 11:55 ` Mathias Krause 2015-09-30 13:25 ` Rainer Weikusat 2015-09-30 13:38 ` Mathias Krause 2015-09-30 13:51 ` Rainer Weikusat 2015-10-01 2:39 ` Jason Baron 2015-10-01 10:33 ` Rainer Weikusat 2015-10-01 12:10 ` Rainer Weikusat 2015-10-01 12:58 ` Rainer Weikusat 2015-09-15 17:07 ` Rainer Weikusat 2015-09-15 18:15 ` Mathias Krause
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).