List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
@ 2015-09-13 19:53 Mathias Krause
  2015-09-14  2:39 ` Eric Wong
  2015-09-15 17:07 ` Rainer Weikusat
  0 siblings, 2 replies; 18+ messages in thread
From: Mathias Krause @ 2015-09-13 19:53 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi,
	Davidlohr Bueso, Olivier Mauras, PaX Team

[-- Attachment #1: Type: text/plain, Size: 11177 bytes --]

Hi,

this is an attempt to resurrect the thread initially started here:

  http://thread.gmane.org/gmane.linux.network/353003

As that patch fixed the issue for the mentioned reproducer, it did not
fix the bug for the production code Olivier is using. :(

Changing the reproducer only slightly allows me to trigger the following
list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even
with the above linked patch applied:

[   50.264249] ------------[ cut here ]------------
[   50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 __list_del_entry+0xa4/0xd0()
[   50.264249] list_del corruption. prev->next should be ffff88003c2c1bb8, but was ffff88003f07bbb8
[   50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
[   50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75
[   50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[   50.264249]  ffffffff817e902e ffff880000087d08 ffffffff8155593c 0000000000000007
[   50.264249]  ffff880000087d58 ffff880000087d48 ffffffff8105202a 0000000000000001
[   50.264249]  ffff88003c2c1bb8 ffff88003f07bb80 0000000000000286 ffff88003f736640
[   50.264249] Call Trace:
[   50.264249]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
[   50.264249]  [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0
[   50.264249]  [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50
[   50.264249]  [<ffffffff81322354>] __list_del_entry+0xa4/0xd0
[   50.264249]  [<ffffffff81322391>] list_del+0x11/0x40
[   50.264249]  [<ffffffff81094d39>] remove_wait_queue+0x29/0x40
[   50.264249]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
[   50.264249]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
[   50.264249]  [<ffffffff811bca02>] ep_remove+0x22/0x110
[   50.264249]  [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70
[   50.264249]  [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14
[   50.264249]  [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[   50.264249] ---[ end trace d9af9b915df9667e ]---
[   50.572100] ------------[ cut here ]------------
[   50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 __list_del_entry+0xc3/0xd0()
[   50.584263] list_del corruption. next->prev should be ffff88003f664c90, but was ffff88003f0cb5b8
[   50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
[   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G        W       4.2.0 #75
[   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[   50.584263]  ffffffff817e902e ffff88003d37fce8 ffffffff8155593c 0000000000000007
[   50.584263]  ffff88003d37fd38 ffff88003d37fd28 ffffffff8105202a 0000000000000001
[   50.584263]  ffff88003f664c90 ffff88003f0cb580 0000000000000282 ffff88003f731740
[   50.584263] Call Trace:
[   50.584263]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
[   50.584263]  [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0
[   50.584263]  [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50
[   50.584263]  [<ffffffff81322373>] __list_del_entry+0xc3/0xd0
[   50.584263]  [<ffffffff81322391>] list_del+0x11/0x40
[   50.584263]  [<ffffffff81094d39>] remove_wait_queue+0x29/0x40
[   50.584263]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
[   50.584263]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
[   50.584263]  [<ffffffff811bca02>] ep_remove+0x22/0x110
[   50.584263]  [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0
[   50.584263]  [<ffffffff8117704f>] __fput+0x1af/0x200
[   50.584263]  [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f
[   50.584263]  [<ffffffff811770ee>] ____fput+0xe/0x10
[   50.584263]  [<ffffffff8107271d>] task_work_run+0x8d/0xc0
[   50.584263]  [<ffffffff8100390f>] do_notify_resume+0x4f/0x60
[   50.584263]  [<ffffffff8155cf6c>] int_signal+0x12/0x17
[   50.584263] ---[ end trace d9af9b915df9667f ]---
[   50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212
[   50.584263]  lock: 0xffff88003f0cb580, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
[   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G        W       4.2.0 #75
[   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[   50.584263]  ffff88003f0cb580 ffff88003d37fd38 ffffffff8155593c 0000000000000007
[   50.584263]  ffffffffffffffff ffff88003d37fd58 ffffffff810a3375 ffff88003f0cb580
[   50.584263]  ffffffff817b9cc8 ffff88003d37fd78 ffffffff810a33f6 ffff88003f0cb580
[   50.584263] Call Trace:
[   50.584263]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
[   50.584263]  [<ffffffff810a3375>] spin_dump+0x85/0xe0
[   50.584263]  [<ffffffff810a33f6>] spin_bug+0x26/0x30
[   50.584263]  [<ffffffff810a3645>] do_raw_spin_unlock+0x75/0xa0
[   50.584263]  [<ffffffff8155c4ec>] _raw_spin_unlock_irqrestore+0x2c/0x50
[   50.584263]  [<ffffffff81094d44>] remove_wait_queue+0x34/0x40
[   50.584263]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
[   50.584263]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
[   50.584263]  [<ffffffff811bca02>] ep_remove+0x22/0x110
[   50.584263]  [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0
[   50.584263]  [<ffffffff8117704f>] __fput+0x1af/0x200
[   50.584263]  [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f
[   50.584263]  [<ffffffff811770ee>] ____fput+0xe/0x10
[   50.584263]  [<ffffffff8107271d>] task_work_run+0x8d/0xc0
[   50.584263]  [<ffffffff8100390f>] do_notify_resume+0x4f/0x60
[   50.584263]  [<ffffffff8155cf6c>] int_signal+0x12/0x17
[...]

That 'spinlock already unlocked' message is also interesting. But even
better, enabling slab debugging (CONFIG_SLUB_DEBUG_ON=y) makes that
list_del corruption warning a GPF:

[   21.124241] general protection fault: 0000 [#1] SMP 
[   21.128193] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
[   21.144249] CPU: 1 PID: 221 Comm: epoll_bug Not tainted 4.2.0 #75
[   21.144249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[   21.144249] task: ffff88001fa82b80 ti: ffff880018894000 task.ti: ffff880018894000
[   21.144249] RIP: 0010:[<ffffffff8109def0>]  [<ffffffff8109def0>] __lock_acquire+0x240/0x1800
[   21.144249] RSP: 0018:ffff880018897c98  EFLAGS: 00010002
[   21.144249] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
[   21.144249] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88001f8b1c18
[   21.144249] RBP: ffff880018897d28 R08: 0000000000000001 R09: 0000000000000001
[   21.144249] R10: 0000000000000000 R11: ffff88001f8b1c18 R12: 0000000000000000
[   21.144249] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88001fa82b80
[   21.144249] FS:  00007f0c87e5d700(0000) GS:ffff88001eb00000(0000) knlGS:0000000000000000
[   21.144249] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   21.144249] CR2: 00007f0c87e5cff8 CR3: 000000001899c000 CR4: 00000000001406e0
[   21.144249] Stack:
[   21.144249]  ffff880018897cb8 ffffffff8109d94b ffff880018b6fdb0 ffff88001e3d3b80
[   21.144249]  ffff880018897cc8 ffff88001fa83250 0000000000000002 0000000000000000
[   21.144249]  0000000000000001 ffff88001fa82b80 ffff880018897d88 ffffffff8109e0f7
[   21.144249] Call Trace:
[   21.144249]  [<ffffffff8109d94b>] ? trace_hardirqs_on_caller+0x14b/0x1e0
[   21.144249]  [<ffffffff8109e0f7>] ? __lock_acquire+0x447/0x1800
[   21.144249]  [<ffffffff8109fdd7>] lock_acquire+0xc7/0x260
[   21.144249]  [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40
[   21.144249]  [<ffffffff8155c373>] _raw_spin_lock_irqsave+0x43/0x60
[   21.144249]  [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40
[   21.144249]  [<ffffffff81094d2d>] remove_wait_queue+0x1d/0x40
[   21.144249]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
[   21.144249]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
[   21.144249]  [<ffffffff811bca02>] ep_remove+0x22/0x110
[   21.144249]  [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70
[   21.144249]  [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14
[   21.144249]  [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[   21.144249] Code: 49 81 3b c0 9e c4 81 b8 00 00 00 00 44 0f 44 c0 41 83 fe 01 0f 87 39 fe ff ff 44 89 f0 49 8b 5c c3 08 48 85 db 0f 84 28 fe ff ff <f0> ff 83 98 01 00 00 45 8b b7 a0 06 00 00 41 83 fe 2f 76 10 44 
[   21.144249] RIP  [<ffffffff8109def0>] __lock_acquire+0x240/0x1800
[   21.144249]  RSP <ffff880018897c98>
[   21.144249] ---[ end trace 7136cfe3b6480f34 ]---

The slab poisoning hits, as can be seen by the pattern in RBX which the
faulting instruction is using as a memory operand. So this is probably a
use-after-free bug -- or, more likely, an object that better had been
freed via rcu semantics.

I looked at the epoll code really hard and concluded I don't understand
it at all. I've added a few calls to synchroize_rcu() and changed direct
pointer assignments to rcu_assign_pointer() in places I thought would
need them. But that only made the race happen less often, not cured it. 

This is what I did:
- add a call to synchroize_rcu() in eventpoll_release_file() after
  taking the epmutex. It's traversing an rcu list, after all.
- make the NULL pointer assignment of whead in ep_poll_callback() an
  rcu_assign_pointer() assignment and call synchronize_rcu() afterwards.
  It's also an rcu pointer that should be assigned that way, no?

But, apparently, all irrelevant.

I had the same luck while staring at the af_unix code. I've added a few
unix_state_lock() / -unlock() calls to places I though would need them
to be able to reliably test / set socket flags and the peer member of
struct unix_sock. But again, that only made the bug happen less often.

What I did:
- take other's unix_state_lock() in unix_dgram_disconnected() for
  testing the flags and signaling the error
- moving the 'unix_peer(sk) = NULL' assignment to the section that holds
  the unix_state_lock() in unix_release_sock(). This ensures others will
  see changes to the peer pointer atomically -- when themselves making
  use of the lock only, of course.
- in unix_dgram_poll() avoid calling sock_poll_wait() if the peer is
  already SOCK_DEAD or RCV_SHUTDOWN. For testing the flags, other's
  unix_state_lock() is taken.

The last one is, in fact, the old patch, extended by the
'(other->sk_shutdown & RCV_SHUTDOWN)' test. But Eric already noted back
then, it might be an expensive lock to take here.

Anyways, none of the above changes fixed the issue. I suspect it's
related to the double usage of the peer_wq waitQ in unix_dgram_sendmsg()
(via unix_wait_for_peer()) and unix_dgram_poll() (via sock_poll_wait()).
But I might be totally wrong, here.

However, it's definitely the second sock_poll_wait() call in
unix_dgram_poll() that triggers the issue. Commenting out the call --
thereby obviously breaking its functionality -- gets me rid of the list
debug splat and the GFP. But that's not a fix either. So I'm asking for
help.


Regards,
Mathias

[-- Attachment #2: epoll_bug.c --]
[-- Type: text/x-csrc, Size: 2922 bytes --]

/* use-after-free in poll routine of AF_UNIX sockets, triggerable using epoll
 *
 * ..intruduced in 3c73419c09 "af_unix: fix 'poll for write'/ connected DGRAM
 * sockets" (v2.6.26-rc7)
 *
 * $ gcc -pthread -o epoll_bug epoll_bug.c
 *
 * - minipli
 */
#include <sys/socket.h>
#include <sys/epoll.h>
#include <sys/un.h>
#include <pthread.h>
#include <unistd.h>
#include <signal.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>


static long fd_max;
static int ep = -1;


static int get_fd(void) {
	int fd;

	for (;;) {
		fd = rand() % fd_max;

		if (fd > 2 && fd != ep)
			break;
	}

	return fd;
}


static void *opener(void *ptr) {
	sleep(1);

	for (;;) {
		if (rand() % 2) {
			struct sockaddr_un sa = {
				.sun_family = AF_UNIX,
				.sun_path = "\0epool_bug-",
			};
			int sock = socket(AF_UNIX, SOCK_DGRAM, 0);
			int err;

			/* take a short nap when there are no more fds left so closer() can
			 * catch up */
			if (sock < 0) {
				usleep(1);

				continue;
			}

			/* ensure the write won't block */
			fcntl(sock, F_SETFL, fcntl(sock, F_GETFL, 0) | O_NONBLOCK);

			sa.sun_path[11] = rand() % 26 + 'A';
			if (rand() % 2)
				err = connect(sock, (struct sockaddr *) &sa, sizeof(sa));
			else
				err = bind(sock, (struct sockaddr *) &sa, sizeof(sa));

			if (err)
				close(sock);
		} else {
			static const char dot[] = { [0 ... 1023] = '.' };

			write(get_fd(), dot, rand() % 2 ? 1 : sizeof(dot));
		}
	}

	return ptr;
}


static void *closer(void *ptr) {
	int miss = 0;

	sleep(1);

	for (;;) {
		errno = 0;
		close(get_fd());

		/* take a short nap when we're hitting invalid fds 5 times in a row so
		 * opener() can catch up */
		if (errno == EBADF && ++miss >= 5) {
			usleep(10);
			miss = 0;
		} else if (errno == 0) {
			miss = 0;
		}
	}

	return ptr;
}


static void *ep_add(void *ptr) {
	sleep(1);

	for (;;) {
		int fd = get_fd();
		struct epoll_event ev = {
			.events = EPOLLIN | EPOLLOUT,
			.data.fd = fd,
		};

		if (epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev) < 0 && errno == ENOSPC)
			usleep(1);
	}

	return ptr;
}


static void *ep_del(void *ptr) {
	sleep(1);

	for (;;)
		epoll_ctl(ep, EPOLL_CTL_DEL, get_fd(), NULL);

	return ptr;
}


int main(void) {
	pthread_t thread[4];
	int i;

	signal(SIGPIPE, SIG_IGN);

	ep = epoll_create(42);	/* use epoll_create() for older kernels */
	if (ep < 0) {
		fprintf(stderr, "err: epoll_create1() failed (%s)\n", strerror(errno));

		return 1;
	}

	fd_max = sysconf(_SC_OPEN_MAX);
	if (pthread_create(&thread[0], NULL, opener, NULL) ||
		pthread_create(&thread[1], NULL, closer, NULL) ||
		pthread_create(&thread[2], NULL, ep_add, NULL) ||
		pthread_create(&thread[3], NULL, ep_del, NULL))
	{
		fprintf(stderr, "err: failed to start all threads!\n");

		return 1;
	}

	/* XXX: pthread_cancel() all threads on termination request */

	for (i = 0; i < 4; i++)
		pthread_join(thread[i], NULL);

	return 0;
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-13 19:53 List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket Mathias Krause
@ 2015-09-14  2:39 ` Eric Wong
  2015-09-29 18:09   ` Mathias Krause
  2015-09-15 17:07 ` Rainer Weikusat
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Wong @ 2015-09-14  2:39 UTC (permalink / raw)
  To: Mathias Krause
  Cc: netdev, linux-kernel, Eric Dumazet, Rainer Weikusat,
	Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras,
	PaX Team, Jason Baron

+cc Jason Baron since he might be able to provide more insight into
epoll.

Mathias Krause <minipli@googlemail.com> wrote:
> Hi,
> 
> this is an attempt to resurrect the thread initially started here:
> 
>   http://thread.gmane.org/gmane.linux.network/353003
> 
> As that patch fixed the issue for the mentioned reproducer, it did not
> fix the bug for the production code Olivier is using. :(
> 
> Changing the reproducer only slightly allows me to trigger the following
> list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even
> with the above linked patch applied:
> 
> [   50.264249] ------------[ cut here ]------------
> [   50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 __list_del_entry+0xa4/0xd0()
> [   50.264249] list_del corruption. prev->next should be ffff88003c2c1bb8, but was ffff88003f07bbb8
> [   50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
> [   50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75
> [   50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [   50.264249]  ffffffff817e902e ffff880000087d08 ffffffff8155593c 0000000000000007
> [   50.264249]  ffff880000087d58 ffff880000087d48 ffffffff8105202a 0000000000000001
> [   50.264249]  ffff88003c2c1bb8 ffff88003f07bb80 0000000000000286 ffff88003f736640
> [   50.264249] Call Trace:
> [   50.264249]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
> [   50.264249]  [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0
> [   50.264249]  [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50
> [   50.264249]  [<ffffffff81322354>] __list_del_entry+0xa4/0xd0
> [   50.264249]  [<ffffffff81322391>] list_del+0x11/0x40
> [   50.264249]  [<ffffffff81094d39>] remove_wait_queue+0x29/0x40
> [   50.264249]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
> [   50.264249]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
> [   50.264249]  [<ffffffff811bca02>] ep_remove+0x22/0x110
> [   50.264249]  [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70
> [   50.264249]  [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14
> [   50.264249]  [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
> [   50.264249] ---[ end trace d9af9b915df9667e ]---
> [   50.572100] ------------[ cut here ]------------
> [   50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 __list_del_entry+0xc3/0xd0()
> [   50.584263] list_del corruption. next->prev should be ffff88003f664c90, but was ffff88003f0cb5b8
> [   50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
> [   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G        W       4.2.0 #75
> [   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [   50.584263]  ffffffff817e902e ffff88003d37fce8 ffffffff8155593c 0000000000000007
> [   50.584263]  ffff88003d37fd38 ffff88003d37fd28 ffffffff8105202a 0000000000000001
> [   50.584263]  ffff88003f664c90 ffff88003f0cb580 0000000000000282 ffff88003f731740
> [   50.584263] Call Trace:
> [   50.584263]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
> [   50.584263]  [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0
> [   50.584263]  [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50
> [   50.584263]  [<ffffffff81322373>] __list_del_entry+0xc3/0xd0
> [   50.584263]  [<ffffffff81322391>] list_del+0x11/0x40
> [   50.584263]  [<ffffffff81094d39>] remove_wait_queue+0x29/0x40
> [   50.584263]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
> [   50.584263]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
> [   50.584263]  [<ffffffff811bca02>] ep_remove+0x22/0x110
> [   50.584263]  [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0
> [   50.584263]  [<ffffffff8117704f>] __fput+0x1af/0x200
> [   50.584263]  [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f
> [   50.584263]  [<ffffffff811770ee>] ____fput+0xe/0x10
> [   50.584263]  [<ffffffff8107271d>] task_work_run+0x8d/0xc0
> [   50.584263]  [<ffffffff8100390f>] do_notify_resume+0x4f/0x60
> [   50.584263]  [<ffffffff8155cf6c>] int_signal+0x12/0x17
> [   50.584263] ---[ end trace d9af9b915df9667f ]---
> [   50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212
> [   50.584263]  lock: 0xffff88003f0cb580, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
> [   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G        W       4.2.0 #75
> [   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [   50.584263]  ffff88003f0cb580 ffff88003d37fd38 ffffffff8155593c 0000000000000007
> [   50.584263]  ffffffffffffffff ffff88003d37fd58 ffffffff810a3375 ffff88003f0cb580
> [   50.584263]  ffffffff817b9cc8 ffff88003d37fd78 ffffffff810a33f6 ffff88003f0cb580
> [   50.584263] Call Trace:
> [   50.584263]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
> [   50.584263]  [<ffffffff810a3375>] spin_dump+0x85/0xe0
> [   50.584263]  [<ffffffff810a33f6>] spin_bug+0x26/0x30
> [   50.584263]  [<ffffffff810a3645>] do_raw_spin_unlock+0x75/0xa0
> [   50.584263]  [<ffffffff8155c4ec>] _raw_spin_unlock_irqrestore+0x2c/0x50
> [   50.584263]  [<ffffffff81094d44>] remove_wait_queue+0x34/0x40
> [   50.584263]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
> [   50.584263]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
> [   50.584263]  [<ffffffff811bca02>] ep_remove+0x22/0x110
> [   50.584263]  [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0
> [   50.584263]  [<ffffffff8117704f>] __fput+0x1af/0x200
> [   50.584263]  [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f
> [   50.584263]  [<ffffffff811770ee>] ____fput+0xe/0x10
> [   50.584263]  [<ffffffff8107271d>] task_work_run+0x8d/0xc0
> [   50.584263]  [<ffffffff8100390f>] do_notify_resume+0x4f/0x60
> [   50.584263]  [<ffffffff8155cf6c>] int_signal+0x12/0x17
> [...]
> 
> That 'spinlock already unlocked' message is also interesting. But even
> better, enabling slab debugging (CONFIG_SLUB_DEBUG_ON=y) makes that
> list_del corruption warning a GPF:
> 
> [   21.124241] general protection fault: 0000 [#1] SMP 
> [   21.128193] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
> [   21.144249] CPU: 1 PID: 221 Comm: epoll_bug Not tainted 4.2.0 #75
> [   21.144249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [   21.144249] task: ffff88001fa82b80 ti: ffff880018894000 task.ti: ffff880018894000
> [   21.144249] RIP: 0010:[<ffffffff8109def0>]  [<ffffffff8109def0>] __lock_acquire+0x240/0x1800
> [   21.144249] RSP: 0018:ffff880018897c98  EFLAGS: 00010002
> [   21.144249] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
> [   21.144249] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88001f8b1c18
> [   21.144249] RBP: ffff880018897d28 R08: 0000000000000001 R09: 0000000000000001
> [   21.144249] R10: 0000000000000000 R11: ffff88001f8b1c18 R12: 0000000000000000
> [   21.144249] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88001fa82b80
> [   21.144249] FS:  00007f0c87e5d700(0000) GS:ffff88001eb00000(0000) knlGS:0000000000000000
> [   21.144249] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   21.144249] CR2: 00007f0c87e5cff8 CR3: 000000001899c000 CR4: 00000000001406e0
> [   21.144249] Stack:
> [   21.144249]  ffff880018897cb8 ffffffff8109d94b ffff880018b6fdb0 ffff88001e3d3b80
> [   21.144249]  ffff880018897cc8 ffff88001fa83250 0000000000000002 0000000000000000
> [   21.144249]  0000000000000001 ffff88001fa82b80 ffff880018897d88 ffffffff8109e0f7
> [   21.144249] Call Trace:
> [   21.144249]  [<ffffffff8109d94b>] ? trace_hardirqs_on_caller+0x14b/0x1e0
> [   21.144249]  [<ffffffff8109e0f7>] ? __lock_acquire+0x447/0x1800
> [   21.144249]  [<ffffffff8109fdd7>] lock_acquire+0xc7/0x260
> [   21.144249]  [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40
> [   21.144249]  [<ffffffff8155c373>] _raw_spin_lock_irqsave+0x43/0x60
> [   21.144249]  [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40
> [   21.144249]  [<ffffffff81094d2d>] remove_wait_queue+0x1d/0x40
> [   21.144249]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
> [   21.144249]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
> [   21.144249]  [<ffffffff811bca02>] ep_remove+0x22/0x110
> [   21.144249]  [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70
> [   21.144249]  [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14
> [   21.144249]  [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
> [   21.144249] Code: 49 81 3b c0 9e c4 81 b8 00 00 00 00 44 0f 44 c0 41 83 fe 01 0f 87 39 fe ff ff 44 89 f0 49 8b 5c c3 08 48 85 db 0f 84 28 fe ff ff <f0> ff 83 98 01 00 00 45 8b b7 a0 06 00 00 41 83 fe 2f 76 10 44 
> [   21.144249] RIP  [<ffffffff8109def0>] __lock_acquire+0x240/0x1800
> [   21.144249]  RSP <ffff880018897c98>
> [   21.144249] ---[ end trace 7136cfe3b6480f34 ]---
> 
> The slab poisoning hits, as can be seen by the pattern in RBX which the
> faulting instruction is using as a memory operand. So this is probably a
> use-after-free bug -- or, more likely, an object that better had been
> freed via rcu semantics.
> 
> I looked at the epoll code really hard and concluded I don't understand
> it at all. I've added a few calls to synchroize_rcu() and changed direct
> pointer assignments to rcu_assign_pointer() in places I thought would
> need them. But that only made the race happen less often, not cured it. 
> 
> This is what I did:
> - add a call to synchroize_rcu() in eventpoll_release_file() after
>   taking the epmutex. It's traversing an rcu list, after all.
> - make the NULL pointer assignment of whead in ep_poll_callback() an
>   rcu_assign_pointer() assignment and call synchronize_rcu() afterwards.
>   It's also an rcu pointer that should be assigned that way, no?
> 
> But, apparently, all irrelevant.
> 
> I had the same luck while staring at the af_unix code. I've added a few
> unix_state_lock() / -unlock() calls to places I though would need them
> to be able to reliably test / set socket flags and the peer member of
> struct unix_sock. But again, that only made the bug happen less often.
> 
> What I did:
> - take other's unix_state_lock() in unix_dgram_disconnected() for
>   testing the flags and signaling the error
> - moving the 'unix_peer(sk) = NULL' assignment to the section that holds
>   the unix_state_lock() in unix_release_sock(). This ensures others will
>   see changes to the peer pointer atomically -- when themselves making
>   use of the lock only, of course.
> - in unix_dgram_poll() avoid calling sock_poll_wait() if the peer is
>   already SOCK_DEAD or RCV_SHUTDOWN. For testing the flags, other's
>   unix_state_lock() is taken.
> 
> The last one is, in fact, the old patch, extended by the
> '(other->sk_shutdown & RCV_SHUTDOWN)' test. But Eric already noted back
> then, it might be an expensive lock to take here.
> 
> Anyways, none of the above changes fixed the issue. I suspect it's
> related to the double usage of the peer_wq waitQ in unix_dgram_sendmsg()
> (via unix_wait_for_peer()) and unix_dgram_poll() (via sock_poll_wait()).
> But I might be totally wrong, here.
> 
> However, it's definitely the second sock_poll_wait() call in
> unix_dgram_poll() that triggers the issue. Commenting out the call --
> thereby obviously breaking its functionality -- gets me rid of the list
> debug splat and the GFP. But that's not a fix either. So I'm asking for
> help.
> 
> 
> Regards,
> Mathias

> /* use-after-free in poll routine of AF_UNIX sockets, triggerable using epoll
>  *
>  * ..intruduced in 3c73419c09 "af_unix: fix 'poll for write'/ connected DGRAM
>  * sockets" (v2.6.26-rc7)
>  *
>  * $ gcc -pthread -o epoll_bug epoll_bug.c
>  *
>  * - minipli
>  */
> #include <sys/socket.h>
> #include <sys/epoll.h>
> #include <sys/un.h>
> #include <pthread.h>
> #include <unistd.h>
> #include <signal.h>
> #include <stdlib.h>
> #include <string.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <errno.h>
> 
> 
> static long fd_max;
> static int ep = -1;
> 
> 
> static int get_fd(void) {
> 	int fd;
> 
> 	for (;;) {
> 		fd = rand() % fd_max;
> 
> 		if (fd > 2 && fd != ep)
> 			break;
> 	}
> 
> 	return fd;
> }
> 
> 
> static void *opener(void *ptr) {
> 	sleep(1);
> 
> 	for (;;) {
> 		if (rand() % 2) {
> 			struct sockaddr_un sa = {
> 				.sun_family = AF_UNIX,
> 				.sun_path = "\0epool_bug-",
> 			};
> 			int sock = socket(AF_UNIX, SOCK_DGRAM, 0);
> 			int err;
> 
> 			/* take a short nap when there are no more fds left so closer() can
> 			 * catch up */
> 			if (sock < 0) {
> 				usleep(1);
> 
> 				continue;
> 			}
> 
> 			/* ensure the write won't block */
> 			fcntl(sock, F_SETFL, fcntl(sock, F_GETFL, 0) | O_NONBLOCK);
> 
> 			sa.sun_path[11] = rand() % 26 + 'A';
> 			if (rand() % 2)
> 				err = connect(sock, (struct sockaddr *) &sa, sizeof(sa));
> 			else
> 				err = bind(sock, (struct sockaddr *) &sa, sizeof(sa));
> 
> 			if (err)
> 				close(sock);
> 		} else {
> 			static const char dot[] = { [0 ... 1023] = '.' };
> 
> 			write(get_fd(), dot, rand() % 2 ? 1 : sizeof(dot));
> 		}
> 	}
> 
> 	return ptr;
> }
> 
> 
> static void *closer(void *ptr) {
> 	int miss = 0;
> 
> 	sleep(1);
> 
> 	for (;;) {
> 		errno = 0;
> 		close(get_fd());
> 
> 		/* take a short nap when we're hitting invalid fds 5 times in a row so
> 		 * opener() can catch up */
> 		if (errno == EBADF && ++miss >= 5) {
> 			usleep(10);
> 			miss = 0;
> 		} else if (errno == 0) {
> 			miss = 0;
> 		}
> 	}
> 
> 	return ptr;
> }
> 
> 
> static void *ep_add(void *ptr) {
> 	sleep(1);
> 
> 	for (;;) {
> 		int fd = get_fd();
> 		struct epoll_event ev = {
> 			.events = EPOLLIN | EPOLLOUT,
> 			.data.fd = fd,
> 		};
> 
> 		if (epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev) < 0 && errno == ENOSPC)
> 			usleep(1);
> 	}
> 
> 	return ptr;
> }
> 
> 
> static void *ep_del(void *ptr) {
> 	sleep(1);
> 
> 	for (;;)
> 		epoll_ctl(ep, EPOLL_CTL_DEL, get_fd(), NULL);
> 
> 	return ptr;
> }
> 
> 
> int main(void) {
> 	pthread_t thread[4];
> 	int i;
> 
> 	signal(SIGPIPE, SIG_IGN);
> 
> 	ep = epoll_create(42);	/* use epoll_create() for older kernels */
> 	if (ep < 0) {
> 		fprintf(stderr, "err: epoll_create1() failed (%s)\n", strerror(errno));
> 
> 		return 1;
> 	}
> 
> 	fd_max = sysconf(_SC_OPEN_MAX);
> 	if (pthread_create(&thread[0], NULL, opener, NULL) ||
> 		pthread_create(&thread[1], NULL, closer, NULL) ||
> 		pthread_create(&thread[2], NULL, ep_add, NULL) ||
> 		pthread_create(&thread[3], NULL, ep_del, NULL))
> 	{
> 		fprintf(stderr, "err: failed to start all threads!\n");
> 
> 		return 1;
> 	}
> 
> 	/* XXX: pthread_cancel() all threads on termination request */
> 
> 	for (i = 0; i < 4; i++)
> 		pthread_join(thread[i], NULL);
> 
> 	return 0;
> }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-13 19:53 List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket Mathias Krause
  2015-09-14  2:39 ` Eric Wong
@ 2015-09-15 17:07 ` Rainer Weikusat
  2015-09-15 18:15   ` Mathias Krause
  1 sibling, 1 reply; 18+ messages in thread
From: Rainer Weikusat @ 2015-09-15 17:07 UTC (permalink / raw)
  To: Mathias Krause
  Cc: netdev, linux-kernel, Eric Dumazet, Rainer Weikusat,
	Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras,
	PaX Team

Mathias Krause <minipli@googlemail.com> writes:
> this is an attempt to resurrect the thread initially started here:
>
>   http://thread.gmane.org/gmane.linux.network/353003
>
> As that patch fixed the issue for the mentioned reproducer, it did not
> fix the bug for the production code Olivier is using. :(
>
> Changing the reproducer only slightly allows me to trigger the following
> list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even
> with the above linked patch applied:

The patch was

--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
 <at>  <at>  -2233,10 +2233,14  <at>  <at>  static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	writable = unix_writable(sk);
 	other = unix_peer_get(sk);
 	if (other) {
-		if (unix_peer(other) != sk) {
+		unix_state_lock(other);
+		if (!sock_flag(other, SOCK_DEAD) && unix_peer(other) != sk) {
+			unix_state_unlock(other);
 			sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
 			if (unix_recvq_full(other))
 				writable = 0;
+		} else {
+			unix_state_unlock(other);
 		}
 		sock_put(other);
 	}

That's obviously not going to help you when 'racing with
unix_release_sock' as the socket might be released immediately after the
unix_state_unlock, ie, before sock_poll_wait is called. Provided I
understand this correctly, the problem is that the socket reference
count may have become 1 by the time sock_put is called but the
sock_poll_wait has created a new reference to it which isn't accounted
for.

A simple way to fix that could be to do something like

unix_state_lock(other);
if (!sock_flag(other, SOCK_DEAD)) sock_poll_wait(...)
unix_state_unlock(other);

This would imply that unix_release_sock either marked the socket as dead
before the sock_poll_wait was executed or that the wake_up_interruptible
call in there will run after ->peer_wait was used (and it will thus
'unpollwait' it again).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-15 17:07 ` Rainer Weikusat
@ 2015-09-15 18:15   ` Mathias Krause
  0 siblings, 0 replies; 18+ messages in thread
From: Mathias Krause @ 2015-09-15 18:15 UTC (permalink / raw)
  To: Rainer Weikusat
  Cc: netdev, linux-kernel, Eric Dumazet, Alexander Viro,
	Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team

On Tue, Sep 15, 2015 at 06:07:05PM +0100, Rainer Weikusat wrote:
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
>  <at>  <at>  -2233,10 +2233,14  <at>  <at>  static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
>  	writable = unix_writable(sk);
>  	other = unix_peer_get(sk);
>  	if (other) {
> -		if (unix_peer(other) != sk) {
> +		unix_state_lock(other);
> +		if (!sock_flag(other, SOCK_DEAD) && unix_peer(other) != sk) {
> +			unix_state_unlock(other);
>  			sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
>  			if (unix_recvq_full(other))
>  				writable = 0;
> +		} else {
> +			unix_state_unlock(other);
>  		}
>  		sock_put(other);
>  	}
> 
> That's obviously not going to help you when 'racing with
> unix_release_sock' as the socket might be released immediately after the
> unix_state_unlock, ie, before sock_poll_wait is called. Provided I
> understand this correctly, the problem is that the socket reference
> count may have become 1 by the time sock_put is called but the
> sock_poll_wait has created a new reference to it which isn't accounted
> for.
> 
> A simple way to fix that could be to do something like
> 
> unix_state_lock(other);
> if (!sock_flag(other, SOCK_DEAD)) sock_poll_wait(...)
> unix_state_unlock(other);

Sorry, but that does not fix the bug. I get the same GPF as before.

> This would imply that unix_release_sock either marked the socket as dead
> before the sock_poll_wait was executed or that the wake_up_interruptible
> call in there will run after ->peer_wait was used (and it will thus
> 'unpollwait' it again).

It must be something else as I also tried the following patch, which
moves the wake_up_interruptible_all() call into the locked code section:

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 03ee4d359f6a..58570a7680ce 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -360,10 +360,12 @@ static void unix_dgram_disconnected(struct sock *sk, struct sock *other)
 		 * we signal error. Messages are lost. Do not make this,
 		 * when peer was not connected to us.
 		 */
+		unix_state_lock(other);
 		if (!sock_flag(other, SOCK_DEAD) && unix_peer(other) == sk) {
 			other->sk_err = ECONNRESET;
 			other->sk_error_report(other);
 		}
+		unix_state_unlock(other);
 	}
 }
 
@@ -413,11 +415,13 @@ static void unix_release_sock(struct sock *sk, int embrion)
 	u->path.mnt = NULL;
 	state = sk->sk_state;
 	sk->sk_state = TCP_CLOSE;
-	unix_state_unlock(sk);
-
-	wake_up_interruptible_all(&u->peer_wait);
 
 	skpair = unix_peer(sk);
+	unix_peer(sk) = NULL;
+
+	wake_up_interruptible_all(&u->peer_wait);
+
+	unix_state_unlock(sk);
 
 	if (skpair != NULL) {
 		if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) {
@@ -431,7 +435,6 @@ static void unix_release_sock(struct sock *sk, int embrion)
 			sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
 		}
 		sock_put(skpair); /* It may now die */
-		unix_peer(sk) = NULL;
 	}
 
 	/* Try to flush out this socket. Throw out buffers at least */
@@ -2440,10 +2443,18 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	writable = unix_writable(sk);
 	other = unix_peer_get(sk);
 	if (other) {
-		if (unix_peer(other) != sk) {
+		unix_state_lock(other);
+
+		if (sock_flag(other, SOCK_DEAD) || (other->sk_shutdown & RCV_SHUTDOWN)) {
+			unix_state_unlock(other);
+			writable = 0;
+		} else if (unix_peer(other) != sk) {
 			sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
+			unix_state_unlock(other);
 			if (unix_recvq_full(other))
 				writable = 0;
+		} else {
+			unix_state_unlock(other);
 		}
 		sock_put(other);
 	}

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-14  2:39 ` Eric Wong
@ 2015-09-29 18:09   ` Mathias Krause
  2015-09-29 19:09     ` Jason Baron
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Krause @ 2015-09-29 18:09 UTC (permalink / raw)
  To: netdev, linux-kernel@vger.kernel.org
  Cc: Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro,
	Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team,
	Jason Baron

On 14 September 2015 at 04:39, Eric Wong <normalperson@yhbt.net> wrote:
> +cc Jason Baron since he might be able to provide more insight into
> epoll.
>
> Mathias Krause <minipli@googlemail.com> wrote:
>> Hi,
>>
>> this is an attempt to resurrect the thread initially started here:
>>
>>   http://thread.gmane.org/gmane.linux.network/353003
>>
>> As that patch fixed the issue for the mentioned reproducer, it did not
>> fix the bug for the production code Olivier is using. :(
>>
>> Changing the reproducer only slightly allows me to trigger the following
>> list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even
>> with the above linked patch applied:
>>
>> [   50.264249] ------------[ cut here ]------------
>> [   50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 __list_del_entry+0xa4/0xd0()
>> [   50.264249] list_del corruption. prev->next should be ffff88003c2c1bb8, but was ffff88003f07bbb8
>> [   50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
>> [   50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75
>> [   50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
>> [   50.264249]  ffffffff817e902e ffff880000087d08 ffffffff8155593c 0000000000000007
>> [   50.264249]  ffff880000087d58 ffff880000087d48 ffffffff8105202a 0000000000000001
>> [   50.264249]  ffff88003c2c1bb8 ffff88003f07bb80 0000000000000286 ffff88003f736640
>> [   50.264249] Call Trace:
>> [   50.264249]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
>> [   50.264249]  [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0
>> [   50.264249]  [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50
>> [   50.264249]  [<ffffffff81322354>] __list_del_entry+0xa4/0xd0
>> [   50.264249]  [<ffffffff81322391>] list_del+0x11/0x40
>> [   50.264249]  [<ffffffff81094d39>] remove_wait_queue+0x29/0x40
>> [   50.264249]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
>> [   50.264249]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
>> [   50.264249]  [<ffffffff811bca02>] ep_remove+0x22/0x110
>> [   50.264249]  [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70
>> [   50.264249]  [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14
>> [   50.264249]  [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
>> [   50.264249] ---[ end trace d9af9b915df9667e ]---
>> [   50.572100] ------------[ cut here ]------------
>> [   50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 __list_del_entry+0xc3/0xd0()
>> [   50.584263] list_del corruption. next->prev should be ffff88003f664c90, but was ffff88003f0cb5b8
>> [   50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
>> [   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G        W       4.2.0 #75
>> [   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
>> [   50.584263]  ffffffff817e902e ffff88003d37fce8 ffffffff8155593c 0000000000000007
>> [   50.584263]  ffff88003d37fd38 ffff88003d37fd28 ffffffff8105202a 0000000000000001
>> [   50.584263]  ffff88003f664c90 ffff88003f0cb580 0000000000000282 ffff88003f731740
>> [   50.584263] Call Trace:
>> [   50.584263]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
>> [   50.584263]  [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0
>> [   50.584263]  [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50
>> [   50.584263]  [<ffffffff81322373>] __list_del_entry+0xc3/0xd0
>> [   50.584263]  [<ffffffff81322391>] list_del+0x11/0x40
>> [   50.584263]  [<ffffffff81094d39>] remove_wait_queue+0x29/0x40
>> [   50.584263]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
>> [   50.584263]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
>> [   50.584263]  [<ffffffff811bca02>] ep_remove+0x22/0x110
>> [   50.584263]  [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0
>> [   50.584263]  [<ffffffff8117704f>] __fput+0x1af/0x200
>> [   50.584263]  [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f
>> [   50.584263]  [<ffffffff811770ee>] ____fput+0xe/0x10
>> [   50.584263]  [<ffffffff8107271d>] task_work_run+0x8d/0xc0
>> [   50.584263]  [<ffffffff8100390f>] do_notify_resume+0x4f/0x60
>> [   50.584263]  [<ffffffff8155cf6c>] int_signal+0x12/0x17
>> [   50.584263] ---[ end trace d9af9b915df9667f ]---
>> [   50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212
>> [   50.584263]  lock: 0xffff88003f0cb580, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
>> [   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G        W       4.2.0 #75
>> [   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
>> [   50.584263]  ffff88003f0cb580 ffff88003d37fd38 ffffffff8155593c 0000000000000007
>> [   50.584263]  ffffffffffffffff ffff88003d37fd58 ffffffff810a3375 ffff88003f0cb580
>> [   50.584263]  ffffffff817b9cc8 ffff88003d37fd78 ffffffff810a33f6 ffff88003f0cb580
>> [   50.584263] Call Trace:
>> [   50.584263]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
>> [   50.584263]  [<ffffffff810a3375>] spin_dump+0x85/0xe0
>> [   50.584263]  [<ffffffff810a33f6>] spin_bug+0x26/0x30
>> [   50.584263]  [<ffffffff810a3645>] do_raw_spin_unlock+0x75/0xa0
>> [   50.584263]  [<ffffffff8155c4ec>] _raw_spin_unlock_irqrestore+0x2c/0x50
>> [   50.584263]  [<ffffffff81094d44>] remove_wait_queue+0x34/0x40
>> [   50.584263]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
>> [   50.584263]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
>> [   50.584263]  [<ffffffff811bca02>] ep_remove+0x22/0x110
>> [   50.584263]  [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0
>> [   50.584263]  [<ffffffff8117704f>] __fput+0x1af/0x200
>> [   50.584263]  [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f
>> [   50.584263]  [<ffffffff811770ee>] ____fput+0xe/0x10
>> [   50.584263]  [<ffffffff8107271d>] task_work_run+0x8d/0xc0
>> [   50.584263]  [<ffffffff8100390f>] do_notify_resume+0x4f/0x60
>> [   50.584263]  [<ffffffff8155cf6c>] int_signal+0x12/0x17
>> [...]
>>
>> That 'spinlock already unlocked' message is also interesting. But even
>> better, enabling slab debugging (CONFIG_SLUB_DEBUG_ON=y) makes that
>> list_del corruption warning a GPF:
>>
>> [   21.124241] general protection fault: 0000 [#1] SMP
>> [   21.128193] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
>> [   21.144249] CPU: 1 PID: 221 Comm: epoll_bug Not tainted 4.2.0 #75
>> [   21.144249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
>> [   21.144249] task: ffff88001fa82b80 ti: ffff880018894000 task.ti: ffff880018894000
>> [   21.144249] RIP: 0010:[<ffffffff8109def0>]  [<ffffffff8109def0>] __lock_acquire+0x240/0x1800
>> [   21.144249] RSP: 0018:ffff880018897c98  EFLAGS: 00010002
>> [   21.144249] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
>> [   21.144249] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88001f8b1c18
>> [   21.144249] RBP: ffff880018897d28 R08: 0000000000000001 R09: 0000000000000001
>> [   21.144249] R10: 0000000000000000 R11: ffff88001f8b1c18 R12: 0000000000000000
>> [   21.144249] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88001fa82b80
>> [   21.144249] FS:  00007f0c87e5d700(0000) GS:ffff88001eb00000(0000) knlGS:0000000000000000
>> [   21.144249] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   21.144249] CR2: 00007f0c87e5cff8 CR3: 000000001899c000 CR4: 00000000001406e0
>> [   21.144249] Stack:
>> [   21.144249]  ffff880018897cb8 ffffffff8109d94b ffff880018b6fdb0 ffff88001e3d3b80
>> [   21.144249]  ffff880018897cc8 ffff88001fa83250 0000000000000002 0000000000000000
>> [   21.144249]  0000000000000001 ffff88001fa82b80 ffff880018897d88 ffffffff8109e0f7
>> [   21.144249] Call Trace:
>> [   21.144249]  [<ffffffff8109d94b>] ? trace_hardirqs_on_caller+0x14b/0x1e0
>> [   21.144249]  [<ffffffff8109e0f7>] ? __lock_acquire+0x447/0x1800
>> [   21.144249]  [<ffffffff8109fdd7>] lock_acquire+0xc7/0x260
>> [   21.144249]  [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40
>> [   21.144249]  [<ffffffff8155c373>] _raw_spin_lock_irqsave+0x43/0x60
>> [   21.144249]  [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40
>> [   21.144249]  [<ffffffff81094d2d>] remove_wait_queue+0x1d/0x40
>> [   21.144249]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
>> [   21.144249]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
>> [   21.144249]  [<ffffffff811bca02>] ep_remove+0x22/0x110
>> [   21.144249]  [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70
>> [   21.144249]  [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14
>> [   21.144249]  [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
>> [   21.144249] Code: 49 81 3b c0 9e c4 81 b8 00 00 00 00 44 0f 44 c0 41 83 fe 01 0f 87 39 fe ff ff 44 89 f0 49 8b 5c c3 08 48 85 db 0f 84 28 fe ff ff <f0> ff 83 98 01 00 00 45 8b b7 a0 06 00 00 41 83 fe 2f 76 10 44
>> [   21.144249] RIP  [<ffffffff8109def0>] __lock_acquire+0x240/0x1800
>> [   21.144249]  RSP <ffff880018897c98>
>> [   21.144249] ---[ end trace 7136cfe3b6480f34 ]---
>>
>> The slab poisoning hits, as can be seen by the pattern in RBX which the
>> faulting instruction is using as a memory operand. So this is probably a
>> use-after-free bug -- or, more likely, an object that better had been
>> freed via rcu semantics.
>>
>> I looked at the epoll code really hard and concluded I don't understand
>> it at all. I've added a few calls to synchroize_rcu() and changed direct
>> pointer assignments to rcu_assign_pointer() in places I thought would
>> need them. But that only made the race happen less often, not cured it.
>>
>> This is what I did:
>> - add a call to synchroize_rcu() in eventpoll_release_file() after
>>   taking the epmutex. It's traversing an rcu list, after all.
>> - make the NULL pointer assignment of whead in ep_poll_callback() an
>>   rcu_assign_pointer() assignment and call synchronize_rcu() afterwards.
>>   It's also an rcu pointer that should be assigned that way, no?
>>
>> But, apparently, all irrelevant.
>>
>> I had the same luck while staring at the af_unix code. I've added a few
>> unix_state_lock() / -unlock() calls to places I though would need them
>> to be able to reliably test / set socket flags and the peer member of
>> struct unix_sock. But again, that only made the bug happen less often.
>>
>> What I did:
>> - take other's unix_state_lock() in unix_dgram_disconnected() for
>>   testing the flags and signaling the error
>> - moving the 'unix_peer(sk) = NULL' assignment to the section that holds
>>   the unix_state_lock() in unix_release_sock(). This ensures others will
>>   see changes to the peer pointer atomically -- when themselves making
>>   use of the lock only, of course.
>> - in unix_dgram_poll() avoid calling sock_poll_wait() if the peer is
>>   already SOCK_DEAD or RCV_SHUTDOWN. For testing the flags, other's
>>   unix_state_lock() is taken.
>>
>> The last one is, in fact, the old patch, extended by the
>> '(other->sk_shutdown & RCV_SHUTDOWN)' test. But Eric already noted back
>> then, it might be an expensive lock to take here.
>>
>> Anyways, none of the above changes fixed the issue. I suspect it's
>> related to the double usage of the peer_wq waitQ in unix_dgram_sendmsg()
>> (via unix_wait_for_peer()) and unix_dgram_poll() (via sock_poll_wait()).
>> But I might be totally wrong, here.
>>
>> However, it's definitely the second sock_poll_wait() call in
>> unix_dgram_poll() that triggers the issue. Commenting out the call --
>> thereby obviously breaking its functionality -- gets me rid of the list
>> debug splat and the GFP. But that's not a fix either. So I'm asking for
>> help.
>>
>>
>> Regards,
>> Mathias
>
>> /* use-after-free in poll routine of AF_UNIX sockets, triggerable using epoll
>>  *
>>  * ..intruduced in 3c73419c09 "af_unix: fix 'poll for write'/ connected DGRAM
>>  * sockets" (v2.6.26-rc7)
>>  *
>>  * $ gcc -pthread -o epoll_bug epoll_bug.c
>>  *
>>  * - minipli
>>  */
>> #include <sys/socket.h>
>> #include <sys/epoll.h>
>> #include <sys/un.h>
>> #include <pthread.h>
>> #include <unistd.h>
>> #include <signal.h>
>> #include <stdlib.h>
>> #include <string.h>
>> #include <stdio.h>
>> #include <fcntl.h>
>> #include <errno.h>
>>
>>
>> static long fd_max;
>> static int ep = -1;
>>
>>
>> static int get_fd(void) {
>>       int fd;
>>
>>       for (;;) {
>>               fd = rand() % fd_max;
>>
>>               if (fd > 2 && fd != ep)
>>                       break;
>>       }
>>
>>       return fd;
>> }
>>
>>
>> static void *opener(void *ptr) {
>>       sleep(1);
>>
>>       for (;;) {
>>               if (rand() % 2) {
>>                       struct sockaddr_un sa = {
>>                               .sun_family = AF_UNIX,
>>                               .sun_path = "\0epool_bug-",
>>                       };
>>                       int sock = socket(AF_UNIX, SOCK_DGRAM, 0);
>>                       int err;
>>
>>                       /* take a short nap when there are no more fds left so closer() can
>>                        * catch up */
>>                       if (sock < 0) {
>>                               usleep(1);
>>
>>                               continue;
>>                       }
>>
>>                       /* ensure the write won't block */
>>                       fcntl(sock, F_SETFL, fcntl(sock, F_GETFL, 0) | O_NONBLOCK);
>>
>>                       sa.sun_path[11] = rand() % 26 + 'A';
>>                       if (rand() % 2)
>>                               err = connect(sock, (struct sockaddr *) &sa, sizeof(sa));
>>                       else
>>                               err = bind(sock, (struct sockaddr *) &sa, sizeof(sa));
>>
>>                       if (err)
>>                               close(sock);
>>               } else {
>>                       static const char dot[] = { [0 ... 1023] = '.' };
>>
>>                       write(get_fd(), dot, rand() % 2 ? 1 : sizeof(dot));
>>               }
>>       }
>>
>>       return ptr;
>> }
>>
>>
>> static void *closer(void *ptr) {
>>       int miss = 0;
>>
>>       sleep(1);
>>
>>       for (;;) {
>>               errno = 0;
>>               close(get_fd());
>>
>>               /* take a short nap when we're hitting invalid fds 5 times in a row so
>>                * opener() can catch up */
>>               if (errno == EBADF && ++miss >= 5) {
>>                       usleep(10);
>>                       miss = 0;
>>               } else if (errno == 0) {
>>                       miss = 0;
>>               }
>>       }
>>
>>       return ptr;
>> }
>>
>>
>> static void *ep_add(void *ptr) {
>>       sleep(1);
>>
>>       for (;;) {
>>               int fd = get_fd();
>>               struct epoll_event ev = {
>>                       .events = EPOLLIN | EPOLLOUT,
>>                       .data.fd = fd,
>>               };
>>
>>               if (epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev) < 0 && errno == ENOSPC)
>>                       usleep(1);
>>       }
>>
>>       return ptr;
>> }
>>
>>
>> static void *ep_del(void *ptr) {
>>       sleep(1);
>>
>>       for (;;)
>>               epoll_ctl(ep, EPOLL_CTL_DEL, get_fd(), NULL);
>>
>>       return ptr;
>> }
>>
>>
>> int main(void) {
>>       pthread_t thread[4];
>>       int i;
>>
>>       signal(SIGPIPE, SIG_IGN);
>>
>>       ep = epoll_create(42);  /* use epoll_create() for older kernels */
>>       if (ep < 0) {
>>               fprintf(stderr, "err: epoll_create1() failed (%s)\n", strerror(errno));
>>
>>               return 1;
>>       }
>>
>>       fd_max = sysconf(_SC_OPEN_MAX);
>>       if (pthread_create(&thread[0], NULL, opener, NULL) ||
>>               pthread_create(&thread[1], NULL, closer, NULL) ||
>>               pthread_create(&thread[2], NULL, ep_add, NULL) ||
>>               pthread_create(&thread[3], NULL, ep_del, NULL))
>>       {
>>               fprintf(stderr, "err: failed to start all threads!\n");
>>
>>               return 1;
>>       }
>>
>>       /* XXX: pthread_cancel() all threads on termination request */
>>
>>       for (i = 0; i < 4; i++)
>>               pthread_join(thread[i], NULL);
>>
>>       return 0;
>> }

Ping. This still triggers on v4.3-rc3.
Any epoll expert around to have a look at this?

Regards,
Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-29 18:09   ` Mathias Krause
@ 2015-09-29 19:09     ` Jason Baron
  2015-09-30  5:54       ` Mathias Krause
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Baron @ 2015-09-29 19:09 UTC (permalink / raw)
  To: Mathias Krause, netdev, linux-kernel@vger.kernel.org
  Cc: Eric Wong, Eric Dumazet, Rainer Weikusat, Alexander Viro,
	Davide Libenzi, Davidlohr Bueso, Olivier Mauras, PaX Team,
	Linus Torvalds, eric.dumazet, peterz@infradead.org,
	davem@davemloft.net

On 09/29/2015 02:09 PM, Mathias Krause wrote:
> On 14 September 2015 at 04:39, Eric Wong <normalperson@yhbt.net> wrote:
>> +cc Jason Baron since he might be able to provide more insight into
>> epoll.
>>
>> Mathias Krause <minipli@googlemail.com> wrote:
>>> Hi,
>>>
>>> this is an attempt to resurrect the thread initially started here:
>>>
>>>   http://thread.gmane.org/gmane.linux.network/353003
>>>
>>> As that patch fixed the issue for the mentioned reproducer, it did not
>>> fix the bug for the production code Olivier is using. :(
>>>
>>> Changing the reproducer only slightly allows me to trigger the following
>>> list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even
>>> with the above linked patch applied:
>>>
>>> [   50.264249] ------------[ cut here ]------------
>>> [   50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 __list_del_entry+0xa4/0xd0()
>>> [   50.264249] list_del corruption. prev->next should be ffff88003c2c1bb8, but was ffff88003f07bbb8
>>> [   50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
>>> [   50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75
>>> [   50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
>>> [   50.264249]  ffffffff817e902e ffff880000087d08 ffffffff8155593c 0000000000000007
>>> [   50.264249]  ffff880000087d58 ffff880000087d48 ffffffff8105202a 0000000000000001
>>> [   50.264249]  ffff88003c2c1bb8 ffff88003f07bb80 0000000000000286 ffff88003f736640
>>> [   50.264249] Call Trace:
>>> [   50.264249]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
>>> [   50.264249]  [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0
>>> [   50.264249]  [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50
>>> [   50.264249]  [<ffffffff81322354>] __list_del_entry+0xa4/0xd0
>>> [   50.264249]  [<ffffffff81322391>] list_del+0x11/0x40
>>> [   50.264249]  [<ffffffff81094d39>] remove_wait_queue+0x29/0x40
>>> [   50.264249]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
>>> [   50.264249]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
>>> [   50.264249]  [<ffffffff811bca02>] ep_remove+0x22/0x110
>>> [   50.264249]  [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70
>>> [   50.264249]  [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14
>>> [   50.264249]  [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
>>> [   50.264249] ---[ end trace d9af9b915df9667e ]---
>>> [   50.572100] ------------[ cut here ]------------
>>> [   50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 __list_del_entry+0xc3/0xd0()
>>> [   50.584263] list_del corruption. next->prev should be ffff88003f664c90, but was ffff88003f0cb5b8
>>> [   50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
>>> [   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G        W       4.2.0 #75
>>> [   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
>>> [   50.584263]  ffffffff817e902e ffff88003d37fce8 ffffffff8155593c 0000000000000007
>>> [   50.584263]  ffff88003d37fd38 ffff88003d37fd28 ffffffff8105202a 0000000000000001
>>> [   50.584263]  ffff88003f664c90 ffff88003f0cb580 0000000000000282 ffff88003f731740
>>> [   50.584263] Call Trace:
>>> [   50.584263]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
>>> [   50.584263]  [<ffffffff8105202a>] warn_slowpath_common+0x8a/0xc0
>>> [   50.584263]  [<ffffffff810520a6>] warn_slowpath_fmt+0x46/0x50
>>> [   50.584263]  [<ffffffff81322373>] __list_del_entry+0xc3/0xd0
>>> [   50.584263]  [<ffffffff81322391>] list_del+0x11/0x40
>>> [   50.584263]  [<ffffffff81094d39>] remove_wait_queue+0x29/0x40
>>> [   50.584263]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
>>> [   50.584263]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
>>> [   50.584263]  [<ffffffff811bca02>] ep_remove+0x22/0x110
>>> [   50.584263]  [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0
>>> [   50.584263]  [<ffffffff8117704f>] __fput+0x1af/0x200
>>> [   50.584263]  [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f
>>> [   50.584263]  [<ffffffff811770ee>] ____fput+0xe/0x10
>>> [   50.584263]  [<ffffffff8107271d>] task_work_run+0x8d/0xc0
>>> [   50.584263]  [<ffffffff8100390f>] do_notify_resume+0x4f/0x60
>>> [   50.584263]  [<ffffffff8155cf6c>] int_signal+0x12/0x17
>>> [   50.584263] ---[ end trace d9af9b915df9667f ]---
>>> [   50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212
>>> [   50.584263]  lock: 0xffff88003f0cb580, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
>>> [   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: G        W       4.2.0 #75
>>> [   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
>>> [   50.584263]  ffff88003f0cb580 ffff88003d37fd38 ffffffff8155593c 0000000000000007
>>> [   50.584263]  ffffffffffffffff ffff88003d37fd58 ffffffff810a3375 ffff88003f0cb580
>>> [   50.584263]  ffffffff817b9cc8 ffff88003d37fd78 ffffffff810a33f6 ffff88003f0cb580
>>> [   50.584263] Call Trace:
>>> [   50.584263]  [<ffffffff8155593c>] dump_stack+0x4c/0x65
>>> [   50.584263]  [<ffffffff810a3375>] spin_dump+0x85/0xe0
>>> [   50.584263]  [<ffffffff810a33f6>] spin_bug+0x26/0x30
>>> [   50.584263]  [<ffffffff810a3645>] do_raw_spin_unlock+0x75/0xa0
>>> [   50.584263]  [<ffffffff8155c4ec>] _raw_spin_unlock_irqrestore+0x2c/0x50
>>> [   50.584263]  [<ffffffff81094d44>] remove_wait_queue+0x34/0x40
>>> [   50.584263]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
>>> [   50.584263]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
>>> [   50.584263]  [<ffffffff811bca02>] ep_remove+0x22/0x110
>>> [   50.584263]  [<ffffffff811bda62>] eventpoll_release_file+0x62/0xa0
>>> [   50.584263]  [<ffffffff8117704f>] __fput+0x1af/0x200
>>> [   50.584263]  [<ffffffff8155cf20>] ? int_very_careful+0x5/0x3f
>>> [   50.584263]  [<ffffffff811770ee>] ____fput+0xe/0x10
>>> [   50.584263]  [<ffffffff8107271d>] task_work_run+0x8d/0xc0
>>> [   50.584263]  [<ffffffff8100390f>] do_notify_resume+0x4f/0x60
>>> [   50.584263]  [<ffffffff8155cf6c>] int_signal+0x12/0x17
>>> [...]
>>>
>>> That 'spinlock already unlocked' message is also interesting. But even
>>> better, enabling slab debugging (CONFIG_SLUB_DEBUG_ON=y) makes that
>>> list_del corruption warning a GPF:
>>>
>>> [   21.124241] general protection fault: 0000 [#1] SMP
>>> [   21.128193] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net virtio_pci virtio_ring virtio sr_mod cdrom
>>> [   21.144249] CPU: 1 PID: 221 Comm: epoll_bug Not tainted 4.2.0 #75
>>> [   21.144249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
>>> [   21.144249] task: ffff88001fa82b80 ti: ffff880018894000 task.ti: ffff880018894000
>>> [   21.144249] RIP: 0010:[<ffffffff8109def0>]  [<ffffffff8109def0>] __lock_acquire+0x240/0x1800
>>> [   21.144249] RSP: 0018:ffff880018897c98  EFLAGS: 00010002
>>> [   21.144249] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
>>> [   21.144249] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88001f8b1c18
>>> [   21.144249] RBP: ffff880018897d28 R08: 0000000000000001 R09: 0000000000000001
>>> [   21.144249] R10: 0000000000000000 R11: ffff88001f8b1c18 R12: 0000000000000000
>>> [   21.144249] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88001fa82b80
>>> [   21.144249] FS:  00007f0c87e5d700(0000) GS:ffff88001eb00000(0000) knlGS:0000000000000000
>>> [   21.144249] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   21.144249] CR2: 00007f0c87e5cff8 CR3: 000000001899c000 CR4: 00000000001406e0
>>> [   21.144249] Stack:
>>> [   21.144249]  ffff880018897cb8 ffffffff8109d94b ffff880018b6fdb0 ffff88001e3d3b80
>>> [   21.144249]  ffff880018897cc8 ffff88001fa83250 0000000000000002 0000000000000000
>>> [   21.144249]  0000000000000001 ffff88001fa82b80 ffff880018897d88 ffffffff8109e0f7
>>> [   21.144249] Call Trace:
>>> [   21.144249]  [<ffffffff8109d94b>] ? trace_hardirqs_on_caller+0x14b/0x1e0
>>> [   21.144249]  [<ffffffff8109e0f7>] ? __lock_acquire+0x447/0x1800
>>> [   21.144249]  [<ffffffff8109fdd7>] lock_acquire+0xc7/0x260
>>> [   21.144249]  [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40
>>> [   21.144249]  [<ffffffff8155c373>] _raw_spin_lock_irqsave+0x43/0x60
>>> [   21.144249]  [<ffffffff81094d2d>] ? remove_wait_queue+0x1d/0x40
>>> [   21.144249]  [<ffffffff81094d2d>] remove_wait_queue+0x1d/0x40
>>> [   21.144249]  [<ffffffff811bc898>] ep_unregister_pollwait.isra.6+0x58/0x1a0
>>> [   21.144249]  [<ffffffff811bc8e9>] ? ep_unregister_pollwait.isra.6+0xa9/0x1a0
>>> [   21.144249]  [<ffffffff811bca02>] ep_remove+0x22/0x110
>>> [   21.144249]  [<ffffffff811be28b>] SyS_epoll_ctl+0x62b/0xf70
>>> [   21.144249]  [<ffffffff81000f44>] ? lockdep_sys_exit_thunk+0x12/0x14
>>> [   21.144249]  [<ffffffff8155cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
>>> [   21.144249] Code: 49 81 3b c0 9e c4 81 b8 00 00 00 00 44 0f 44 c0 41 83 fe 01 0f 87 39 fe ff ff 44 89 f0 49 8b 5c c3 08 48 85 db 0f 84 28 fe ff ff <f0> ff 83 98 01 00 00 45 8b b7 a0 06 00 00 41 83 fe 2f 76 10 44
>>> [   21.144249] RIP  [<ffffffff8109def0>] __lock_acquire+0x240/0x1800
>>> [   21.144249]  RSP <ffff880018897c98>
>>> [   21.144249] ---[ end trace 7136cfe3b6480f34 ]---
>>>
>>> The slab poisoning hits, as can be seen by the pattern in RBX which the
>>> faulting instruction is using as a memory operand. So this is probably a
>>> use-after-free bug -- or, more likely, an object that better had been
>>> freed via rcu semantics.
>>>
>>> I looked at the epoll code really hard and concluded I don't understand
>>> it at all. I've added a few calls to synchroize_rcu() and changed direct
>>> pointer assignments to rcu_assign_pointer() in places I thought would
>>> need them. But that only made the race happen less often, not cured it.
>>>
>>> This is what I did:
>>> - add a call to synchroize_rcu() in eventpoll_release_file() after
>>>   taking the epmutex. It's traversing an rcu list, after all.
>>> - make the NULL pointer assignment of whead in ep_poll_callback() an
>>>   rcu_assign_pointer() assignment and call synchronize_rcu() afterwards.
>>>   It's also an rcu pointer that should be assigned that way, no?
>>>
>>> But, apparently, all irrelevant.
>>>
>>> I had the same luck while staring at the af_unix code. I've added a few
>>> unix_state_lock() / -unlock() calls to places I though would need them
>>> to be able to reliably test / set socket flags and the peer member of
>>> struct unix_sock. But again, that only made the bug happen less often.
>>>
>>> What I did:
>>> - take other's unix_state_lock() in unix_dgram_disconnected() for
>>>   testing the flags and signaling the error
>>> - moving the 'unix_peer(sk) = NULL' assignment to the section that holds
>>>   the unix_state_lock() in unix_release_sock(). This ensures others will
>>>   see changes to the peer pointer atomically -- when themselves making
>>>   use of the lock only, of course.
>>> - in unix_dgram_poll() avoid calling sock_poll_wait() if the peer is
>>>   already SOCK_DEAD or RCV_SHUTDOWN. For testing the flags, other's
>>>   unix_state_lock() is taken.
>>>
>>> The last one is, in fact, the old patch, extended by the
>>> '(other->sk_shutdown & RCV_SHUTDOWN)' test. But Eric already noted back
>>> then, it might be an expensive lock to take here.
>>>
>>> Anyways, none of the above changes fixed the issue. I suspect it's
>>> related to the double usage of the peer_wq waitQ in unix_dgram_sendmsg()
>>> (via unix_wait_for_peer()) and unix_dgram_poll() (via sock_poll_wait()).
>>> But I might be totally wrong, here.
>>>
>>> However, it's definitely the second sock_poll_wait() call in
>>> unix_dgram_poll() that triggers the issue. Commenting out the call --
>>> thereby obviously breaking its functionality -- gets me rid of the list
>>> debug splat and the GFP. But that's not a fix either. So I'm asking for
>>> help.
>>>
>>>
>>> Regards,
>>> Mathias
>>
>>> /* use-after-free in poll routine of AF_UNIX sockets, triggerable using epoll
>>>  *
>>>  * ..intruduced in 3c73419c09 "af_unix: fix 'poll for write'/ connected DGRAM
>>>  * sockets" (v2.6.26-rc7)
>>>  *
>>>  * $ gcc -pthread -o epoll_bug epoll_bug.c
>>>  *
>>>  * - minipli
>>>  */
>>> #include <sys/socket.h>
>>> #include <sys/epoll.h>
>>> #include <sys/un.h>
>>> #include <pthread.h>
>>> #include <unistd.h>
>>> #include <signal.h>
>>> #include <stdlib.h>
>>> #include <string.h>
>>> #include <stdio.h>
>>> #include <fcntl.h>
>>> #include <errno.h>
>>>
>>>
>>> static long fd_max;
>>> static int ep = -1;
>>>
>>>
>>> static int get_fd(void) {
>>>       int fd;
>>>
>>>       for (;;) {
>>>               fd = rand() % fd_max;
>>>
>>>               if (fd > 2 && fd != ep)
>>>                       break;
>>>       }
>>>
>>>       return fd;
>>> }
>>>
>>>
>>> static void *opener(void *ptr) {
>>>       sleep(1);
>>>
>>>       for (;;) {
>>>               if (rand() % 2) {
>>>                       struct sockaddr_un sa = {
>>>                               .sun_family = AF_UNIX,
>>>                               .sun_path = "\0epool_bug-",
>>>                       };
>>>                       int sock = socket(AF_UNIX, SOCK_DGRAM, 0);
>>>                       int err;
>>>
>>>                       /* take a short nap when there are no more fds left so closer() can
>>>                        * catch up */
>>>                       if (sock < 0) {
>>>                               usleep(1);
>>>
>>>                               continue;
>>>                       }
>>>
>>>                       /* ensure the write won't block */
>>>                       fcntl(sock, F_SETFL, fcntl(sock, F_GETFL, 0) | O_NONBLOCK);
>>>
>>>                       sa.sun_path[11] = rand() % 26 + 'A';
>>>                       if (rand() % 2)
>>>                               err = connect(sock, (struct sockaddr *) &sa, sizeof(sa));
>>>                       else
>>>                               err = bind(sock, (struct sockaddr *) &sa, sizeof(sa));
>>>
>>>                       if (err)
>>>                               close(sock);
>>>               } else {
>>>                       static const char dot[] = { [0 ... 1023] = '.' };
>>>
>>>                       write(get_fd(), dot, rand() % 2 ? 1 : sizeof(dot));
>>>               }
>>>       }
>>>
>>>       return ptr;
>>> }
>>>
>>>
>>> static void *closer(void *ptr) {
>>>       int miss = 0;
>>>
>>>       sleep(1);
>>>
>>>       for (;;) {
>>>               errno = 0;
>>>               close(get_fd());
>>>
>>>               /* take a short nap when we're hitting invalid fds 5 times in a row so
>>>                * opener() can catch up */
>>>               if (errno == EBADF && ++miss >= 5) {
>>>                       usleep(10);
>>>                       miss = 0;
>>>               } else if (errno == 0) {
>>>                       miss = 0;
>>>               }
>>>       }
>>>
>>>       return ptr;
>>> }
>>>
>>>
>>> static void *ep_add(void *ptr) {
>>>       sleep(1);
>>>
>>>       for (;;) {
>>>               int fd = get_fd();
>>>               struct epoll_event ev = {
>>>                       .events = EPOLLIN | EPOLLOUT,
>>>                       .data.fd = fd,
>>>               };
>>>
>>>               if (epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev) < 0 && errno == ENOSPC)
>>>                       usleep(1);
>>>       }
>>>
>>>       return ptr;
>>> }
>>>
>>>
>>> static void *ep_del(void *ptr) {
>>>       sleep(1);
>>>
>>>       for (;;)
>>>               epoll_ctl(ep, EPOLL_CTL_DEL, get_fd(), NULL);
>>>
>>>       return ptr;
>>> }
>>>
>>>
>>> int main(void) {
>>>       pthread_t thread[4];
>>>       int i;
>>>
>>>       signal(SIGPIPE, SIG_IGN);
>>>
>>>       ep = epoll_create(42);  /* use epoll_create() for older kernels */
>>>       if (ep < 0) {
>>>               fprintf(stderr, "err: epoll_create1() failed (%s)\n", strerror(errno));
>>>
>>>               return 1;
>>>       }
>>>
>>>       fd_max = sysconf(_SC_OPEN_MAX);
>>>       if (pthread_create(&thread[0], NULL, opener, NULL) ||
>>>               pthread_create(&thread[1], NULL, closer, NULL) ||
>>>               pthread_create(&thread[2], NULL, ep_add, NULL) ||
>>>               pthread_create(&thread[3], NULL, ep_del, NULL))
>>>       {
>>>               fprintf(stderr, "err: failed to start all threads!\n");
>>>
>>>               return 1;
>>>       }
>>>
>>>       /* XXX: pthread_cancel() all threads on termination request */
>>>
>>>       for (i = 0; i < 4; i++)
>>>               pthread_join(thread[i], NULL);
>>>
>>>       return 0;
>>> }
> 
> Ping. This still triggers on v4.3-rc3.
> Any epoll expert around to have a look at this?
> 
> Regards,
> Mathias
> 

Hi,

So the issue is that unix_dgram_poll() registers not only the wait queue for
socket that we are polling against, but also the wait queue for the socket
'other' with epoll(). So if we have socket s, and we add the socket
s to an epoll set ep, normally, when either the 'ep' set closes or the
socket 's' closes, we remove all references to the wait queue. Now, in this
case we can call connect(), and associated socket 'o', with socket 's'. Thus,
epoll now has a references to both the wait queue for socket 's' *and*
the one for socket 'o'. Now, this works most of the time on close() b/c socket
's' takes a reference on socket 'o' when it does a connect. Thus, if socket
's', 'o' or 'ep' closes we properly unregister things.

However, if we call connect on socket 's', to connect to a new socket 'o2', we
drop the reference on the original socket 'o'. Thus, we can now close socket
'o' without unregistering from epoll. Then, when we either close the ep
or unregister 'o', we end up with this list corruption. Thus, this is not a
race per se, but can be triggered sequentially.

Linus explains the general case in the context the signalfd stuff here:
https://lkml.org/lkml/2013/10/14/634

So this may be the case that we've been chasing here for a while...

In any case, we could fix with that same POLLFREE mechansim, the simplest
would be something like:

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 03ee4d3..d499f81 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -392,6 +392,9 @@ static void unix_sock_destructor(struct sock *sk)
 	pr_debug("UNIX %p is destroyed, %ld are still alive.\n", sk,
 		atomic_long_read(&unix_nr_socks));
 #endif
+	/* make sure we remove from epoll */
+	wake_up_poll(&u->peer_wait, POLLFREE);
+	synchronize_sched();
 }
 
 static void unix_release_sock(struct sock *sk, int embrion)

I'm not suggesting we apply that, but that fixed the supplied test case.
We could enhance the above, to avoid the free'ing latency there by doing
the SLAB_DESTROY_BY_RCU for unix sockets. But I'm not convinced
that this wouldn't be still broken for select()/poll() as well. I think
we can be in a select() call for socket 's', and if we remove socket
'o' from it in the meantime (by doing a connect() on s to somewhere else
and a close on 'o'), I think we can still crash there. So POLLFREE would
have to be extended. I tried to hit this with select() but could not,
but I think if I tried harder I could.

Instead of going further down that route, perhaps something like below
might be better. The basic idea would be to do away with the 'other'
poll call in unix_dgram_poll(), and instead revert back to a registering
on a single wait queue. We add a new wait queue to unix sockets such
that we can register it with a remote other on connect(). Then we can
use the wakeup from the remote to wake up the registered unix socket.
Probably better explained with the patch below. Note I didn't add to
the remote for SOCK_STREAM, since the poll() routine there doesn't do
the double wait queue registering:

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 4a167b3..9698aff 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE	0
 #define UNIX_GC_MAYBE_CYCLE	1
 	struct socket_wq	peer_wq;
+	wait_queue_t		wait;
 };
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 03ee4d3..9e0692a 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -420,6 +420,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
 	skpair = unix_peer(sk);
 
 	if (skpair != NULL) {
+		if (sk->sk_type != SOCK_STREAM)
+			remove_wait_queue(&unix_sk(skpair)->peer_wait, &u->wait);
 		if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) {
 			unix_state_lock(skpair);
 			/* No more writes */
@@ -636,6 +638,16 @@ static struct proto unix_proto = {
  */
 static struct lock_class_key af_unix_sk_receive_queue_lock_key;
 
+static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct unix_sock *u;
+
+	u = container_of(wait, struct unix_sock, wait);
+	wake_up_interruptible_sync_poll(sk_sleep(&u->sk), key);
+
+	return 0;
+}
+
 static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
 {
 	struct sock *sk = NULL;
@@ -664,6 +676,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
 	INIT_LIST_HEAD(&u->link);
 	mutex_init(&u->readlock); /* single task reading lock */
 	init_waitqueue_head(&u->peer_wait);
+	init_waitqueue_func_entry(&u->wait, peer_wake);
 	unix_insert_socket(unix_sockets_unbound(sk), sk);
 out:
 	if (sk == NULL)
@@ -1030,7 +1043,10 @@ restart:
 	 */
 	if (unix_peer(sk)) {
 		struct sock *old_peer = unix_peer(sk);
+
+		remove_wait_queue(&unix_sk(old_peer)->peer_wait, &unix_sk(sk)->wait);
 		unix_peer(sk) = other;
+		add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait);
 		unix_state_double_unlock(sk, other);
 
 		if (other != old_peer)
@@ -1038,8 +1054,12 @@ restart:
 		sock_put(old_peer);
 	} else {
 		unix_peer(sk) = other;
+		add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait);
 		unix_state_double_unlock(sk, other);
 	}
+	/* New remote may have created write space for us */
+	wake_up_interruptible_sync_poll(sk_sleep(sk),
+					POLLOUT | POLLWRNORM | POLLWRBAND);
 	return 0;
 
 out_unlock:
@@ -1194,6 +1214,8 @@ restart:
 
 	sock_hold(sk);
 	unix_peer(newsk)	= sk;
+	if (sk->sk_type == SOCK_SEQPACKET)
+		add_wait_queue(&unix_sk(sk)->peer_wait, &unix_sk(newsk)->wait);
 	newsk->sk_state		= TCP_ESTABLISHED;
 	newsk->sk_type		= sk->sk_type;
 	init_peercred(newsk);
@@ -1220,6 +1242,8 @@ restart:
 
 	smp_mb__after_atomic();	/* sock_hold() does an atomic_inc() */
 	unix_peer(sk)	= newsk;
+	if (sk->sk_type == SOCK_SEQPACKET)
+		add_wait_queue(&unix_sk(newsk)->peer_wait, &unix_sk(sk)->wait);
 
 	unix_state_unlock(sk);
 
@@ -1254,6 +1278,10 @@ static int unix_socketpair(struct socket *socka, struct socket *sockb)
 	sock_hold(skb);
 	unix_peer(ska) = skb;
 	unix_peer(skb) = ska;
+	if (ska->sk_type != SOCK_STREAM) {
+		add_wait_queue(&unix_sk(ska)->peer_wait, &unix_sk(skb)->wait);
+		add_wait_queue(&unix_sk(skb)->peer_wait, &unix_sk(ska)->wait);
+	}
 	init_peercred(ska);
 	init_peercred(skb);
 
@@ -1565,6 +1593,7 @@ restart:
 		unix_state_lock(sk);
 		if (unix_peer(sk) == other) {
 			unix_peer(sk) = NULL;
+			remove_wait_queue(&unix_sk(other)->peer_wait, &u->wait);
 			unix_state_unlock(sk);
 
 			unix_dgram_disconnected(sk, other);
@@ -2441,7 +2470,6 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	other = unix_peer_get(sk);
 	if (other) {
 		if (unix_peer(other) != sk) {
-			sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
 			if (unix_recvq_full(other))
 				writable = 0;
 		}
-- 
1.8.2.rc2



I think this makes more sense too, b/c it can also get the epoll() case
correct where the socket s is connected to a new remote. In the old code
we would continue to get wakeups then from the wrong remote. Here, we
can fix that as well.

There a perhaps a perfomance issue with this approach, since I'm adding
to the poll list on connect(). So even if we are not in a poll(), we are
still walking the list to do wakeups. That may or may not be a big deal,
and I think can be fixed by setting something like the SOCK_NOSPACE bit
in the unix_dgram_poll() when its not writeable, and condition the wakeup
on that. The alternative would be to register with the remote on poll(),
but then the unregister is more complex...

Perhaps, somebody more familiar with the unix socket code can comment
on this. Thanks for the test case!

Thanks,

-Jason

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-29 19:09     ` Jason Baron
@ 2015-09-30  5:54       ` Mathias Krause
  2015-09-30  7:34         ` Michal Kubecek
                           ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Mathias Krause @ 2015-09-30  5:54 UTC (permalink / raw)
  To: Jason Baron
  Cc: netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet,
	Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso,
	Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org,
	davem@davemloft.net

On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote:
> However, if we call connect on socket 's', to connect to a new socket 'o2', we
> drop the reference on the original socket 'o'. Thus, we can now close socket
> 'o' without unregistering from epoll. Then, when we either close the ep
> or unregister 'o', we end up with this list corruption. Thus, this is not a
> race per se, but can be triggered sequentially.

Sounds profound, but the reproducers calls connect only once per
socket. So there is no "connect to a new socket", no?
But w/e, see below.

> Linus explains the general case in the context the signalfd stuff here:
> https://lkml.org/lkml/2013/10/14/634

I also found that posting while looking for similar bug reports. Also
found that one: https://lkml.org/lkml/2014/5/15/532

> So this may be the case that we've been chasing here for a while...

That bug triggers since commit 3c73419c09 "af_unix: fix 'poll for
write'/ connected DGRAM sockets". That's v2.6.26-rc7, as noted in the
reproducer.

>
> In any case, we could fix with that same POLLFREE mechansim, the simplest
> would be something like:
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 03ee4d3..d499f81 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -392,6 +392,9 @@ static void unix_sock_destructor(struct sock *sk)
>         pr_debug("UNIX %p is destroyed, %ld are still alive.\n", sk,
>                 atomic_long_read(&unix_nr_socks));
>  #endif
> +       /* make sure we remove from epoll */
> +       wake_up_poll(&u->peer_wait, POLLFREE);
> +       synchronize_sched();
>  }
>
>  static void unix_release_sock(struct sock *sk, int embrion)
>
> I'm not suggesting we apply that, but that fixed the supplied test case.
> We could enhance the above, to avoid the free'ing latency there by doing
> the SLAB_DESTROY_BY_RCU for unix sockets. But I'm not convinced
> that this wouldn't be still broken for select()/poll() as well. I think
> we can be in a select() call for socket 's', and if we remove socket
> 'o' from it in the meantime (by doing a connect() on s to somewhere else
> and a close on 'o'), I think we can still crash there. So POLLFREE would
> have to be extended. I tried to hit this with select() but could not,
> but I think if I tried harder I could.
>
> Instead of going further down that route, perhaps something like below
> might be better. The basic idea would be to do away with the 'other'
> poll call in unix_dgram_poll(), and instead revert back to a registering
> on a single wait queue. We add a new wait queue to unix sockets such
> that we can register it with a remote other on connect(). Then we can
> use the wakeup from the remote to wake up the registered unix socket.
> Probably better explained with the patch below. Note I didn't add to
> the remote for SOCK_STREAM, since the poll() routine there doesn't do
> the double wait queue registering:
>
> diff --git a/include/net/af_unix.h b/include/net/af_unix.h
> index 4a167b3..9698aff 100644
> --- a/include/net/af_unix.h
> +++ b/include/net/af_unix.h
> @@ -62,6 +62,7 @@ struct unix_sock {
>  #define UNIX_GC_CANDIDATE      0
>  #define UNIX_GC_MAYBE_CYCLE    1
>         struct socket_wq        peer_wq;
> +       wait_queue_t            wait;
>  };
>  #define unix_sk(__sk) ((struct unix_sock *)__sk)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 03ee4d3..9e0692a 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -420,6 +420,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
>         skpair = unix_peer(sk);
>
>         if (skpair != NULL) {
> +               if (sk->sk_type != SOCK_STREAM)
> +                       remove_wait_queue(&unix_sk(skpair)->peer_wait, &u->wait);
>                 if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) {
>                         unix_state_lock(skpair);
>                         /* No more writes */
> @@ -636,6 +638,16 @@ static struct proto unix_proto = {
>   */
>  static struct lock_class_key af_unix_sk_receive_queue_lock_key;
>
> +static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key)
> +{
> +       struct unix_sock *u;
> +
> +       u = container_of(wait, struct unix_sock, wait);
> +       wake_up_interruptible_sync_poll(sk_sleep(&u->sk), key);
> +
> +       return 0;
> +}
> +
>  static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
>  {
>         struct sock *sk = NULL;
> @@ -664,6 +676,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
>         INIT_LIST_HEAD(&u->link);
>         mutex_init(&u->readlock); /* single task reading lock */
>         init_waitqueue_head(&u->peer_wait);
> +       init_waitqueue_func_entry(&u->wait, peer_wake);
>         unix_insert_socket(unix_sockets_unbound(sk), sk);
>  out:
>         if (sk == NULL)
> @@ -1030,7 +1043,10 @@ restart:
>          */
>         if (unix_peer(sk)) {
>                 struct sock *old_peer = unix_peer(sk);
> +
> +               remove_wait_queue(&unix_sk(old_peer)->peer_wait, &unix_sk(sk)->wait);
>                 unix_peer(sk) = other;
> +               add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait);
>                 unix_state_double_unlock(sk, other);
>
>                 if (other != old_peer)
> @@ -1038,8 +1054,12 @@ restart:
>                 sock_put(old_peer);
>         } else {
>                 unix_peer(sk) = other;
> +               add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait);
>                 unix_state_double_unlock(sk, other);
>         }
> +       /* New remote may have created write space for us */
> +       wake_up_interruptible_sync_poll(sk_sleep(sk),
> +                                       POLLOUT | POLLWRNORM | POLLWRBAND);
>         return 0;
>
>  out_unlock:
> @@ -1194,6 +1214,8 @@ restart:
>
>         sock_hold(sk);
>         unix_peer(newsk)        = sk;
> +       if (sk->sk_type == SOCK_SEQPACKET)
> +               add_wait_queue(&unix_sk(sk)->peer_wait, &unix_sk(newsk)->wait);
>         newsk->sk_state         = TCP_ESTABLISHED;
>         newsk->sk_type          = sk->sk_type;
>         init_peercred(newsk);
> @@ -1220,6 +1242,8 @@ restart:
>
>         smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
>         unix_peer(sk)   = newsk;
> +       if (sk->sk_type == SOCK_SEQPACKET)
> +               add_wait_queue(&unix_sk(newsk)->peer_wait, &unix_sk(sk)->wait);
>
>         unix_state_unlock(sk);
>
> @@ -1254,6 +1278,10 @@ static int unix_socketpair(struct socket *socka, struct socket *sockb)
>         sock_hold(skb);
>         unix_peer(ska) = skb;
>         unix_peer(skb) = ska;
> +       if (ska->sk_type != SOCK_STREAM) {
> +               add_wait_queue(&unix_sk(ska)->peer_wait, &unix_sk(skb)->wait);
> +               add_wait_queue(&unix_sk(skb)->peer_wait, &unix_sk(ska)->wait);
> +       }
>         init_peercred(ska);
>         init_peercred(skb);
>
> @@ -1565,6 +1593,7 @@ restart:
>                 unix_state_lock(sk);
>                 if (unix_peer(sk) == other) {
>                         unix_peer(sk) = NULL;
> +                       remove_wait_queue(&unix_sk(other)->peer_wait, &u->wait);
>                         unix_state_unlock(sk);
>
>                         unix_dgram_disconnected(sk, other);
> @@ -2441,7 +2470,6 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
>         other = unix_peer_get(sk);
>         if (other) {
>                 if (unix_peer(other) != sk) {
> -                       sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
>                         if (unix_recvq_full(other))
>                                 writable = 0;
>                 }
> --
> 1.8.2.rc2
>
>
>
> I think this makes more sense too, b/c it can also get the epoll() case
> correct where the socket s is connected to a new remote. In the old code
> we would continue to get wakeups then from the wrong remote. Here, we
> can fix that as well.
>
> There a perhaps a perfomance issue with this approach, since I'm adding
> to the poll list on connect(). So even if we are not in a poll(), we are
> still walking the list to do wakeups. That may or may not be a big deal,
> and I think can be fixed by setting something like the SOCK_NOSPACE bit
> in the unix_dgram_poll() when its not writeable, and condition the wakeup
> on that. The alternative would be to register with the remote on poll(),
> but then the unregister is more complex...

This one looks good to me. It survived almost 10 hours on a 2 CPU
system where the reproducer was able to trigger the bug within
seconds.

Thanks for fixing this!

> Perhaps, somebody more familiar with the unix socket code can comment
> on this. Thanks for the test case!

Eric Dumazet may have opinions concerning the performance impact. At
least he had in the past...

Regards,
Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-30  5:54       ` Mathias Krause
@ 2015-09-30  7:34         ` Michal Kubecek
  2015-10-01  2:55           ` Jason Baron
  2015-09-30 10:56         ` Rainer Weikusat
  2015-10-01  2:39         ` Jason Baron
  2 siblings, 1 reply; 18+ messages in thread
From: Michal Kubecek @ 2015-09-30  7:34 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Jason Baron, netdev, linux-kernel@vger.kernel.org, Eric Wong,
	Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi,
	Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds,
	peterz@infradead.org, davem@davemloft.net

On Wed, Sep 30, 2015 at 07:54:29AM +0200, Mathias Krause wrote:
> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote:
> > However, if we call connect on socket 's', to connect to a new socket 'o2', we
> > drop the reference on the original socket 'o'. Thus, we can now close socket
> > 'o' without unregistering from epoll. Then, when we either close the ep
> > or unregister 'o', we end up with this list corruption. Thus, this is not a
> > race per se, but can be triggered sequentially.
> 
> Sounds profound, but the reproducers calls connect only once per
> socket. So there is no "connect to a new socket", no?

I believe there is another scenario: 'o' becomes SOCK_DEAD while 's' is
still connected to it. This is detected by 's' in unix_dgram_sendmsg()
so that 's' releases its reference on 'o' and 'o' can be freed. If this
happens before 's' is unregistered, we get use-after-free as 'o' has
never been unregistered. And as the interval between freeing 'o' and
unregistering 's' can be quite long, there is a chance for the memory to
be reused. This is what one of our customers has seen:

    [exception RIP: _raw_spin_lock_irqsave+156]
    RIP: ffffffff8040f5bc  RSP: ffff8800e929de78  RFLAGS: 00010082
    RAX: 000000000000a32c  RBX: ffff88003954ab80  RCX: 0000000000001000
    RDX: 00000000f2320000  RSI: 000000000000f232  RDI: ffff88003954ab80
    RBP: 0000000000005220   R8: dead000000100100   R9: 0000000000000000
    R10: 00007fff1a284960  R11: 0000000000000246  R12: 0000000000000000
    R13: ffff8800e929de8c  R14: 000000000000000e  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 10000e030  SS: e02b
 #8 [ffff8800e929de70] _raw_spin_lock_irqsave at ffffffff8040f5a9
 #9 [ffff8800e929deb0] remove_wait_queue at ffffffff8006ad09
#10 [ffff8800e929ded0] ep_unregister_pollwait at ffffffff80170043
#11 [ffff8800e929def0] ep_remove at ffffffff80170073
#12 [ffff8800e929df10] sys_epoll_ctl at ffffffff80171453
#13 [ffff8800e929df80] system_call_fastpath at ffffffff80417553

In this case, crash happened on unregistering 's' which had null peer
(i.e. not reconnected but rather disconnected) but there were still two
items in the list, the other pointing to an unallocated page which has
apparently been modified in between.

IMHO unix_dgram_disonnected() could be the place to handle this issue:
it is called from both places where we disconnect from a peer (dead peer
detection in unix_dgram_sendmsg() and reconnect in unix_dgram_connect())
just before the reference to peer is released. I'm not familiar with the
epoll implementation so I'm still trying to find what exactly needs to
be done to unregister the peer at this moment.

> That bug triggers since commit 3c73419c09 "af_unix: fix 'poll for
> write'/ connected DGRAM sockets". That's v2.6.26-rc7, as noted in the
> reproducer.

Sounds likely as this is the commit that introduced unix_dgram_poll()
with the code which adds the "asymmetric peer" to monitor its queue
state. More precisely, the asymmetricity check has been added by

  ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")

shortly after that.

                                                          Michal Kubecek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-30  5:54       ` Mathias Krause
  2015-09-30  7:34         ` Michal Kubecek
@ 2015-09-30 10:56         ` Rainer Weikusat
  2015-09-30 11:55           ` Mathias Krause
  2015-10-01  2:39         ` Jason Baron
  2 siblings, 1 reply; 18+ messages in thread
From: Rainer Weikusat @ 2015-09-30 10:56 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Jason Baron, netdev, linux-kernel@vger.kernel.org, Eric Wong,
	Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi,
	Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds,
	peterz@infradead.org, davem@davemloft.net

Mathias Krause <minipli@googlemail.com> writes:
> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote:
>> However, if we call connect on socket 's', to connect to a new socket 'o2', we
>> drop the reference on the original socket 'o'. Thus, we can now close socket
>> 'o' without unregistering from epoll. Then, when we either close the ep
>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>> race per se, but can be triggered sequentially.
>
> Sounds profound, but the reproducers calls connect only once per
> socket. So there is no "connect to a new socket", no?
> But w/e, see below.

In case you want some information on this: This is a kernel warning I
could trigger (more than once) on the single day I could so far spend
looking into this (3.2.54 kernel):

Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 list_del+0x9/0x30()
Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam
Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should be ffff88022c38f078, but was dead000000100100
Sep 15 19:37:19 doppelsaurus kernel: Modules linked in: snd_hrtimer binfmt_misc af_packet nf_conntrack loop snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm sg snd_page_alloc snd_seq_du
mmy sr_mod snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_timer ath9k snd cdrom ath9k_common ath9k_hw r8169 mii ath usb_storage unix
Sep 15 19:37:19 doppelsaurus kernel: Pid: 3340, comm: a.out Tainted: G        W    3.2.54-saurus-vesa #9
Sep 15 19:37:19 doppelsaurus kernel: Call Trace:
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff811c1e00>] ? __list_del_entry+0x80/0xc0
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff81036ad9>] ? warn_slowpath_common+0x79/0xc0
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff81036bd5>] ? warn_slowpath_fmt+0x45/0x50
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff811c1e49>] ? list_del+0x9/0x30
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff81051509>] ? remove_wait_queue+0x29/0x50
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810fde62>] ? ep_unregister_pollwait.isra.9+0x32/0x50
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810fdeaa>] ? ep_remove+0x2a/0xc0
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810fe9ae>] ? eventpoll_release_file+0x5e/0x90
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810c76f6>] ? fput+0x1c6/0x220
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810c3b7f>] ? filp_close+0x5f/0x90
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff810c3c36>] ? sys_close+0x86/0xd0
Sep 15 19:37:19 doppelsaurus kernel: [<ffffffff8141f43b>] ? system_call_fastpath+0x16/0x1b

The dead000000100100 is one of the list poison values a linkage pointer
is set to during/ after removal from a list. The particular warning
means that entry->prev (the item being removed) pointed to another entry
whose next pointer was not the address of entry but
dead000000100100. Most likely, this means there's a list insert racing
with a list remove somewhere here where the insert picks up the pointer
to the previous item while it is still on the list and uses it while the
delete removes it, with delete having the last word and thus setting
prev->next to dead000000100100 after the insert set it to the address of
the item to be inserted.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-30 10:56         ` Rainer Weikusat
@ 2015-09-30 11:55           ` Mathias Krause
  2015-09-30 13:25             ` Rainer Weikusat
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Krause @ 2015-09-30 11:55 UTC (permalink / raw)
  To: Rainer Weikusat
  Cc: Jason Baron, netdev, linux-kernel@vger.kernel.org, Eric Wong,
	Eric Dumazet, Alexander Viro, Davide Libenzi, Davidlohr Bueso,
	Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org,
	davem@davemloft.net

On 30 September 2015 at 12:56, Rainer Weikusat
<rweikusat@mobileactivedefense.com> wrote:
> Mathias Krause <minipli@googlemail.com> writes:
>> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote:
>>> However, if we call connect on socket 's', to connect to a new socket 'o2', we
>>> drop the reference on the original socket 'o'. Thus, we can now close socket
>>> 'o' without unregistering from epoll. Then, when we either close the ep
>>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>>> race per se, but can be triggered sequentially.
>>
>> Sounds profound, but the reproducers calls connect only once per
>> socket. So there is no "connect to a new socket", no?
>> But w/e, see below.
>
> In case you want some information on this: This is a kernel warning I
> could trigger (more than once) on the single day I could so far spend
> looking into this (3.2.54 kernel):
>
> Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 list_del+0x9/0x30()
> Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam
> Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should be ffff88022c38f078, but was dead000000100100
> [snip]

Is that with Jason's patch or a vanilla v3.2.54?

Regards,
Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-30 11:55           ` Mathias Krause
@ 2015-09-30 13:25             ` Rainer Weikusat
  2015-09-30 13:38               ` Mathias Krause
  0 siblings, 1 reply; 18+ messages in thread
From: Rainer Weikusat @ 2015-09-30 13:25 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Rainer Weikusat, Jason Baron, netdev,
	linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet,
	Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras,
	PaX Team, Linus Torvalds, peterz@infradead.org,
	davem@davemloft.net

Mathias Krause <minipli@googlemail.com> writes:
> On 30 September 2015 at 12:56, Rainer Weikusat <rweikusat@mobileactivedefense.com> wrote:
>> Mathias Krause <minipli@googlemail.com> writes:
>>> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote:
>>>> However, if we call connect on socket 's', to connect to a new socket 'o2', we
>>>> drop the reference on the original socket 'o'. Thus, we can now close socket
>>>> 'o' without unregistering from epoll. Then, when we either close the ep
>>>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>>>> race per se, but can be triggered sequentially.
>>>
>>> Sounds profound, but the reproducers calls connect only once per
>>> socket. So there is no "connect to a new socket", no?
>>> But w/e, see below.
>>
>> In case you want some information on this: This is a kernel warning I
>> could trigger (more than once) on the single day I could so far spend
>> looking into this (3.2.54 kernel):
>>
>> Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 list_del+0x9/0x30()
>> Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam
>> Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should be ffff88022c38f078, but was dead000000100100
>> [snip]
>
> Is that with Jason's patch or a vanilla v3.2.54?

That's a kernel warning which occurred repeatedly (among other "link 
pointer disorganization" warnings) when I tested the "program with
unknown behaviour" you wrote with the kernel I'm currently supporting a
while ago (as I already wrote in the original mail).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-30 13:25             ` Rainer Weikusat
@ 2015-09-30 13:38               ` Mathias Krause
  2015-09-30 13:51                 ` Rainer Weikusat
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Krause @ 2015-09-30 13:38 UTC (permalink / raw)
  To: Rainer Weikusat
  Cc: Jason Baron, netdev, linux-kernel@vger.kernel.org, Eric Wong,
	Eric Dumazet, Alexander Viro, Davide Libenzi, Davidlohr Bueso,
	Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org,
	davem@davemloft.net

On 30 September 2015 at 15:25, Rainer Weikusat
<rweikusat@mobileactivedefense.com> wrote:
> Mathias Krause <minipli@googlemail.com> writes:
>> On 30 September 2015 at 12:56, Rainer Weikusat <rweikusat@mobileactivedefense.com> wrote:
>>> In case you want some information on this: This is a kernel warning I
>>> could trigger (more than once) on the single day I could so far spend
>>> looking into this (3.2.54 kernel):
>>>
>>> Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 list_del+0x9/0x30()
>>> Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam
>>> Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should be ffff88022c38f078, but was dead000000100100
>>> [snip]
>>
>> Is that with Jason's patch or a vanilla v3.2.54?
>
> That's a kernel warning which occurred repeatedly (among other "link
> pointer disorganization" warnings) when I tested the "program with
> unknown behaviour" you wrote with the kernel I'm currently supporting a
> while ago (as I already wrote in the original mail).

So I assume Jason's patch is not included in your kernel. Then those
messages are expected; expected even on kernels as old as v2.6.27.
Can you re-try with Jason's patch applied?

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-30 13:38               ` Mathias Krause
@ 2015-09-30 13:51                 ` Rainer Weikusat
  0 siblings, 0 replies; 18+ messages in thread
From: Rainer Weikusat @ 2015-09-30 13:51 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Rainer Weikusat, Jason Baron, netdev,
	linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet,
	Alexander Viro, Davide Libenzi, Davidlohr Bueso, Olivier Mauras,
	PaX Team, Linus Torvalds, peterz@infradead.org,
	davem@davemloft.net

Mathias Krause <minipli@googlemail.com> writes:
> On 30 September 2015 at 15:25, Rainer Weikusat
> <rweikusat@mobileactivedefense.com> wrote:
>> Mathias Krause <minipli@googlemail.com> writes:
>>> On 30 September 2015 at 12:56, Rainer Weikusat <rweikusat@mobileactivedefense.com> wrote:
>>>> In case you want some information on this: This is a kernel warning I
>>>> could trigger (more than once) on the single day I could so far spend
>>>> looking into this (3.2.54 kernel):
>>>>
>>>> Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 list_del+0x9/0x30()
>>>> Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam
>>>> Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should be ffff88022c38f078, but was dead000000100100
>>>> [snip]
>>>
>>> Is that with Jason's patch or a vanilla v3.2.54?
>>
>> That's a kernel warning which occurred repeatedly (among other "link
>> pointer disorganization" warnings) when I tested the "program with
>> unknown behaviour" you wrote with the kernel I'm currently supporting a
>> while ago (as I already wrote in the original mail).
>
> So I assume Jason's patch is not included in your kernel.  Then those
> messages are expected; expected even on kernels as old as v2.6.27.
> Can you re-try with Jason's patch applied?

If something needs to be unregistered here (which I don't claim to
know but someone should), the corresponding code would need to be added
to the unix_dgram_disconnected function (as was already pointed
out).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-30  5:54       ` Mathias Krause
  2015-09-30  7:34         ` Michal Kubecek
  2015-09-30 10:56         ` Rainer Weikusat
@ 2015-10-01  2:39         ` Jason Baron
  2015-10-01 10:33           ` Rainer Weikusat
  2 siblings, 1 reply; 18+ messages in thread
From: Jason Baron @ 2015-10-01  2:39 UTC (permalink / raw)
  To: Mathias Krause
  Cc: netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet,
	Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso,
	Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org,
	davem@davemloft.net

On 09/30/2015 01:54 AM, Mathias Krause wrote:
> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote:
>> However, if we call connect on socket 's', to connect to a new socket 'o2', we
>> drop the reference on the original socket 'o'. Thus, we can now close socket
>> 'o' without unregistering from epoll. Then, when we either close the ep
>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>> race per se, but can be triggered sequentially.
> 
> Sounds profound, but the reproducers calls connect only once per
> socket. So there is no "connect to a new socket", no?
> But w/e, see below.

Yes, but it can be reproduced this way too. It can also happen with a
close() on the remote peer 'o', and a send to 'o' from 's', which the
reproducer can do as pointed out Michal. The patch I sent deals with
both cases.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-09-30  7:34         ` Michal Kubecek
@ 2015-10-01  2:55           ` Jason Baron
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Baron @ 2015-10-01  2:55 UTC (permalink / raw)
  To: Michal Kubecek, Mathias Krause
  Cc: netdev, linux-kernel@vger.kernel.org, Eric Wong, Eric Dumazet,
	Rainer Weikusat, Alexander Viro, Davide Libenzi, Davidlohr Bueso,
	Olivier Mauras, PaX Team, Linus Torvalds, peterz@infradead.org,
	davem@davemloft.net

On 09/30/2015 03:34 AM, Michal Kubecek wrote:
> On Wed, Sep 30, 2015 at 07:54:29AM +0200, Mathias Krause wrote:
>> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote:
>>> However, if we call connect on socket 's', to connect to a new socket 'o2', we
>>> drop the reference on the original socket 'o'. Thus, we can now close socket
>>> 'o' without unregistering from epoll. Then, when we either close the ep
>>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>>> race per se, but can be triggered sequentially.
>>
>> Sounds profound, but the reproducers calls connect only once per
>> socket. So there is no "connect to a new socket", no?
> 
> I believe there is another scenario: 'o' becomes SOCK_DEAD while 's' is
> still connected to it. This is detected by 's' in unix_dgram_sendmsg()
> so that 's' releases its reference on 'o' and 'o' can be freed. If this
> happens before 's' is unregistered, we get use-after-free as 'o' has
> never been unregistered. And as the interval between freeing 'o' and
> unregistering 's' can be quite long, there is a chance for the memory to
> be reused. This is what one of our customers has seen:
> 
>     [exception RIP: _raw_spin_lock_irqsave+156]
>     RIP: ffffffff8040f5bc  RSP: ffff8800e929de78  RFLAGS: 00010082
>     RAX: 000000000000a32c  RBX: ffff88003954ab80  RCX: 0000000000001000
>     RDX: 00000000f2320000  RSI: 000000000000f232  RDI: ffff88003954ab80
>     RBP: 0000000000005220   R8: dead000000100100   R9: 0000000000000000
>     R10: 00007fff1a284960  R11: 0000000000000246  R12: 0000000000000000
>     R13: ffff8800e929de8c  R14: 000000000000000e  R15: 0000000000000000
>     ORIG_RAX: ffffffffffffffff  CS: 10000e030  SS: e02b
>  #8 [ffff8800e929de70] _raw_spin_lock_irqsave at ffffffff8040f5a9
>  #9 [ffff8800e929deb0] remove_wait_queue at ffffffff8006ad09
> #10 [ffff8800e929ded0] ep_unregister_pollwait at ffffffff80170043
> #11 [ffff8800e929def0] ep_remove at ffffffff80170073
> #12 [ffff8800e929df10] sys_epoll_ctl at ffffffff80171453
> #13 [ffff8800e929df80] system_call_fastpath at ffffffff80417553
> 
> In this case, crash happened on unregistering 's' which had null peer
> (i.e. not reconnected but rather disconnected) but there were still two
> items in the list, the other pointing to an unallocated page which has
> apparently been modified in between.
> 
> IMHO unix_dgram_disonnected() could be the place to handle this issue:
> it is called from both places where we disconnect from a peer (dead peer
> detection in unix_dgram_sendmsg() and reconnect in unix_dgram_connect())
> just before the reference to peer is released. I'm not familiar with the
> epoll implementation so I'm still trying to find what exactly needs to
> be done to unregister the peer at this moment.
> 

Indeed that is a path as well. The patch I posted here deals with that
case as well. It does a remove_wait_queue() in that case.

unix_dgram_disconnected() gets called as you point out when we are removing
the remote peer, and I have a remove_wait_queue() in that case. The patch
I posted converts us back to a polling() against a single wait queue.

The wait structure that epoll()/select()/poll() adds to the peer wait
queue is really opaque to the unix code. The normal pattern is for
epoll()/select()/poll() to do the unregister, not the socket/fd that
we are waiting on. Further we could not just release all of the wait
queues in unix_dgram_disconnected() b/c there could be multiple waiters
there. So the POLLFREE thing really has to be done from the
unix_sock_destructor() path, since it going to free the entire queue.

In addition, I think that removing the the wait queue from
unix_dgram_disconnected() will still be broken, b/c we would need to
re-add to the remote peer via subsequent poll(), to get events if the
socket was re-connected to a new peer.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-10-01  2:39         ` Jason Baron
@ 2015-10-01 10:33           ` Rainer Weikusat
  2015-10-01 12:10             ` Rainer Weikusat
  0 siblings, 1 reply; 18+ messages in thread
From: Rainer Weikusat @ 2015-10-01 10:33 UTC (permalink / raw)
  To: Jason Baron
  Cc: Mathias Krause, netdev, linux-kernel@vger.kernel.org, Eric Wong,
	Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi,
	Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds,
	peterz@infradead.org, davem@davemloft.net

Jason Baron <jbaron@akamai.com> writes:
> On 09/30/2015 01:54 AM, Mathias Krause wrote:
>> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote:
>>> However, if we call connect on socket 's', to connect to a new socket 'o2', we
>>> drop the reference on the original socket 'o'. Thus, we can now close socket
>>> 'o' without unregistering from epoll. Then, when we either close the ep
>>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>>> race per se, but can be triggered sequentially.
>> 
>> Sounds profound, but the reproducers calls connect only once per
>> socket. So there is no "connect to a new socket", no?
>> But w/e, see below.
>
> Yes, but it can be reproduced this way too. It can also happen with a
> close() on the remote peer 'o', and a send to 'o' from 's', which the
> reproducer can do as pointed out Michal. The patch I sent deals with
> both cases.

As Michal also pointed out, there's a unix_dgram_disconnected routine
being called in both cases and insofar "deregistering" anything beyond
what unix_dgram_disconnected (and - insofar I can tell this -
unix_release_sock) already do is actually required, this would be the
obvious place to add it. A good step on the way to that would be to
write (and post) some test code which actually reproduces the problem in
a predictable way.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-10-01 10:33           ` Rainer Weikusat
@ 2015-10-01 12:10             ` Rainer Weikusat
  2015-10-01 12:58               ` Rainer Weikusat
  0 siblings, 1 reply; 18+ messages in thread
From: Rainer Weikusat @ 2015-10-01 12:10 UTC (permalink / raw)
  To: Jason Baron
  Cc: Mathias Krause, netdev, linux-kernel@vger.kernel.org, Eric Wong,
	Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi,
	Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds,
	peterz@infradead.org, davem@davemloft.net

Rainer Weikusat <rw@doppelsaurus.mobileactivedefense.com> writes:
> Jason Baron <jbaron@akamai.com> writes:
>> On 09/30/2015 01:54 AM, Mathias Krause wrote:
>>> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote:
>>>> However, if we call connect on socket 's', to connect to a new socket 'o2', we
>>>> drop the reference on the original socket 'o'. Thus, we can now close socket
>>>> 'o' without unregistering from epoll. Then, when we either close the ep
>>>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>>>> race per se, but can be triggered sequentially.
>>> 
>>> Sounds profound, but the reproducers calls connect only once per
>>> socket. So there is no "connect to a new socket", no?
>>> But w/e, see below.
>>
>> Yes, but it can be reproduced this way too. It can also happen with a
>> close() on the remote peer 'o', and a send to 'o' from 's', which the
>> reproducer can do as pointed out Michal. The patch I sent deals with
>> both cases.
>
> As Michal also pointed out, there's a unix_dgram_disconnected routine
> being called in both cases and insofar "deregistering" anything beyond
> what unix_dgram_disconnected (and - insofar I can tell this -
> unix_release_sock) already do is actually required, this would be the
> obvious place to add it. A good step on the way to that would be to
> write (and post) some test code which actually reproduces the problem in
> a predictable way.

Test program (assumes that it can execute itself as ./a.out):

-------------
#include <fcntl.h>
#include <pthread.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <sys/epoll.h>
#include <signal.h>
#include <unistd.h>

static int sk;

static void *epoller(void *unused)
{
    struct epoll_event epev;
    int epfd;
    
    epfd = epoll_create(1);

    epev.events = EPOLLOUT;
    epoll_ctl(epfd, EPOLL_CTL_ADD, sk, &epev);
    epoll_wait(epfd, &epev, 1, 5000);

    execl("./a.out", "./a.out", (void *)0);

    return NULL;
}

int main(void)
{
    struct sockaddr_un sun;
    pthread_t tid;
    int tg0, tg1, rc;

    sun.sun_family = AF_UNIX;
    
    tg0 = socket(AF_UNIX, SOCK_DGRAM, 0);
    strncpy(sun.sun_path, "/tmp/tg0", sizeof(sun.sun_path));
    unlink(sun.sun_path);
    bind(tg0, (struct sockaddr *)&sun, sizeof(sun));
    
    tg1 = socket(AF_UNIX, SOCK_DGRAM, 0);
    strncpy(sun.sun_path, "/tmp/tg1", sizeof(sun.sun_path));
    unlink(sun.sun_path);
    bind(tg1, (struct sockaddr *)&sun, sizeof(sun));

    sk = socket(AF_UNIX, SOCK_DGRAM, 0);
    connect(sk, (struct sockaddr *)&sun, sizeof(sun));

    fcntl(sk, F_SETFL, fcntl(sk, F_GETFL) | O_NONBLOCK);
    
    while ((rc = write(sk, "bla", 3)) != -1);

    pthread_create(&tid, NULL, epoller, NULL);

    usleep(5);

    strncpy(sun.sun_path, "/tmp/tg0", sizeof(sun.sun_path));
    connect(sk, (struct sockaddr *)&sun, sizeof(sun));
    close(tg1);
    
    pause();

    return 0;
}
-----------------

Inserting a

struct list_head *p;

p = &u->peer_wait.task_list;
WARN_ON(p->next != p || p->prev != p);

into unix_sock_destructor triggers a warning and after running for a
while, one also gets the list corruption warnings, example

Oct  1 12:53:38 doppelsaurus kernel: WARNING: at lib/list_debug.c:54 list_del+0x9/0x30()
Oct  1 12:53:38 doppelsaurus kernel: Hardware name: 500-330nam
Oct  1 12:53:38 doppelsaurus kernel: list_del corruption. prev->next should be ffff880232634420, but was ffff880224f878c8
Oct  1 12:53:38 doppelsaurus kernel: Modules linked in: snd_hrtimer binfmt_misc af_packet nf_conntrack loop snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm ath9k ath9k_common ath9k_hw snd_seq_dummy snd_page_alloc snd_seq_oss snd_seq_midi_event snd_seq sg ath sr_mod snd_seq_device cdrom snd_timer snd r8169 mii usb_storage unix
Oct  1 12:53:38 doppelsaurus kernel: Pid: 4619, comm: a.out Tainted: G        W    3.2.54-saurus-vesa #18
Oct  1 12:53:38 doppelsaurus kernel: Call Trace:
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff811c1e00>] ? __list_del_entry+0x80/0x100
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff81036ad9>] ? warn_slowpath_common+0x79/0xc0
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff81036bd5>] ? warn_slowpath_fmt+0x45/0x50
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff811c1e89>] ? list_del+0x9/0x30
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff81051509>] ? remove_wait_queue+0x29/0x50
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff810fde62>] ? ep_unregister_pollwait.isra.9+0x32/0x50
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff810fdeaa>] ? ep_remove+0x2a/0xc0
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff810fe9ae>] ? eventpoll_release_file+0x5e/0x90
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff810c76f6>] ? fput+0x1c6/0x220
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff810c3b7f>] ? filp_close+0x5f/0x90
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff81039e35>] ? put_files_struct+0xb5/0x110
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff8103a4ef>] ? do_exit+0x59f/0x750
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff8103a9b8>] ? do_group_exit+0x38/0xa0
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff8103aa32>] ? sys_exit_group+0x12/0x20
Oct  1 12:53:38 doppelsaurus kernel: [<ffffffff8141f5fe>] ? tracesys+0xd0/0xd5
Oct  1 12:53:38 doppelsaurus kernel: ---[ end trace ba5c51cbfeb664d8 ]---

Doing "the signalfd_cleanup" thing should prevent this (didn't test this
so far). I still think the correct solution would be to clean this up in
unix_dgram_disconnected.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket
  2015-10-01 12:10             ` Rainer Weikusat
@ 2015-10-01 12:58               ` Rainer Weikusat
  0 siblings, 0 replies; 18+ messages in thread
From: Rainer Weikusat @ 2015-10-01 12:58 UTC (permalink / raw)
  To: Jason Baron
  Cc: Mathias Krause, netdev, linux-kernel@vger.kernel.org, Eric Wong,
	Eric Dumazet, Rainer Weikusat, Alexander Viro, Davide Libenzi,
	Davidlohr Bueso, Olivier Mauras, PaX Team, Linus Torvalds,
	peterz@infradead.org, davem@davemloft.net

Rainer Weikusat <rw@doppelsaurus.mobileactivedefense.com> writes:
> Rainer Weikusat <rw@doppelsaurus.mobileactivedefense.com> writes:
>> Jason Baron <jbaron@akamai.com> writes:
>>> On 09/30/2015 01:54 AM, Mathias Krause wrote:
>>>> On 29 September 2015 at 21:09, Jason Baron <jbaron@akamai.com> wrote:
>>>>> However, if we call connect on socket 's', to connect to a new socket 'o2', we
>>>>> drop the reference on the original socket 'o'. Thus, we can now close socket
>>>>> 'o' without unregistering from epoll. Then, when we either close the ep
>>>>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>>>>> race per se, but can be triggered sequentially.

[...]


> Test program (assumes that it can execute itself as ./a.out):
>
> -------------
> #include <fcntl.h>
> #include <pthread.h>
> #include <string.h>
> #include <sys/socket.h>
> #include <sys/un.h>
> #include <sys/epoll.h>
> #include <signal.h>
> #include <unistd.h>
>
> static int sk;
>
> static void *epoller(void *unused)
> {
>     struct epoll_event epev;
>     int epfd;
>     
>     epfd = epoll_create(1);
>
>     epev.events = EPOLLOUT;
>     epoll_ctl(epfd, EPOLL_CTL_ADD, sk, &epev);
>     epoll_wait(epfd, &epev, 1, 5000);
>
>     execl("./a.out", "./a.out", (void *)0);
>
>     return NULL;
> }

[...]

Possibly interesting additional bit of information: The list corruption
warnings appear only if the 2nd connect is there and both the sk and
epfd file descriptors are left open accross the exec. Closing either of
both still triggers the _destructor warnings but nothing else (until the
process runs out of file descriptors).

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-10-01 12:58 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-13 19:53 List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket Mathias Krause
2015-09-14  2:39 ` Eric Wong
2015-09-29 18:09   ` Mathias Krause
2015-09-29 19:09     ` Jason Baron
2015-09-30  5:54       ` Mathias Krause
2015-09-30  7:34         ` Michal Kubecek
2015-10-01  2:55           ` Jason Baron
2015-09-30 10:56         ` Rainer Weikusat
2015-09-30 11:55           ` Mathias Krause
2015-09-30 13:25             ` Rainer Weikusat
2015-09-30 13:38               ` Mathias Krause
2015-09-30 13:51                 ` Rainer Weikusat
2015-10-01  2:39         ` Jason Baron
2015-10-01 10:33           ` Rainer Weikusat
2015-10-01 12:10             ` Rainer Weikusat
2015-10-01 12:58               ` Rainer Weikusat
2015-09-15 17:07 ` Rainer Weikusat
2015-09-15 18:15   ` Mathias Krause

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).