Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 2/6] VSOCK: vsock address implementaion.
From: George Zhang @ 2012-11-21 20:39 UTC (permalink / raw)
  To: netdev, linux-kernel, georgezhang, virtualization
  Cc: pv-drivers, gregkh, davem
In-Reply-To: <20121121203715.14395.27632.stgit@promb-2n-dhcp175.eng.vmware.com>

VSOCK linux address code implementation.

Signed-off-by: George Zhang <georgezhang@vmware.com>
Signed-off-by: Dmitry Torokhov <dtor@vmware.com>
Signed-off-by: Andy King <acking@vmware.com>

---
 net/vmw_vsock/vsock_addr.c |  246 ++++++++++++++++++++++++++++++++++++++++++++
 net/vmw_vsock/vsock_addr.h |   40 +++++++
 2 files changed, 286 insertions(+), 0 deletions(-)
 create mode 100644 net/vmw_vsock/vsock_addr.c
 create mode 100644 net/vmw_vsock/vsock_addr.h

diff --git a/net/vmw_vsock/vsock_addr.c b/net/vmw_vsock/vsock_addr.c
new file mode 100644
index 0000000..35eeb14
--- /dev/null
+++ b/net/vmw_vsock/vsock_addr.c
@@ -0,0 +1,246 @@
+/*
+ * VMware vSockets Driver
+ *
+ * Copyright (C) 2007-2012 VMware, Inc. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation version 2 and no later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+/*
+ * vsockAddr.c --
+ *
+ * VSockets address implementation.
+ */
+
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/stddef.h>	/* for NULL */
+#include <net/sock.h>
+
+#include "vsock_common.h"
+
+/*
+ *
+ * vsock_addr_init --
+ *
+ * Initialize the given address with the given context id and port. This will
+ * clear the address, set the correct family, and add the given values.
+ *
+ * Results: None.
+ *
+ * Side effects: None.
+ */
+
+void vsock_addr_init(struct sockaddr_vm *addr, u32 cid, u32 port)
+{
+	memset(addr, 0, sizeof *addr);
+
+	addr->svm_family = AF_VSOCK;
+	addr->svm_cid = cid;
+	addr->svm_port = port;
+}
+
+/*
+ *
+ * vsock_addr_validate --
+ *
+ * Try to validate the given address.  The address must not be null and must
+ * have the correct address family.  Any reserved fields must be zero.
+ *
+ * Results: 0 on success, EFAULT if the address is null, EAFNOSUPPORT if the
+ * address is of the wrong family, and EINVAL if the reserved fields are not
+ * zero.
+ *
+ * Side effects: None.
+ */
+
+int vsock_addr_validate(const struct sockaddr_vm *addr)
+{
+	if (!addr)
+		return -EFAULT;
+
+	if (addr->svm_family != AF_VSOCK)
+		return -EAFNOSUPPORT;
+
+	if (addr->svm_zero[0] != 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
+ *
+ * vsock_addr_bound --
+ *
+ * Determines whether the provided address is bound.
+ *
+ * Results: TRUE if the address structure is bound, FALSE otherwise.
+ *
+ * Side effects: None.
+ */
+
+bool vsock_addr_bound(const struct sockaddr_vm *addr)
+{
+	return addr->svm_port != VMADDR_PORT_ANY;
+}
+
+/*
+ *
+ * vsock_addr_unbind --
+ *
+ * Unbind the given addresss.
+ *
+ * Results: None.
+ *
+ * Side effects: None.
+ */
+
+void vsock_addr_unbind(struct sockaddr_vm *addr)
+{
+	vsock_addr_init(addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+}
+
+/*
+ *
+ * vsock_addr_equals_addr --
+ *
+ * Determine if the given addresses are equal.
+ *
+ * Results: TRUE if the addresses are equal, FALSE otherwise.
+ *
+ * Side effects: None.
+ */
+
+bool vsock_addr_equals_addr(const struct sockaddr_vm *addr,
+			    const struct sockaddr_vm *other)
+{
+	return addr->svm_cid == other->svm_cid &&
+		addr->svm_port == other->svm_port;
+}
+
+/*
+ *
+ * vsock_addr_equals_addr_any --
+ *
+ * Determine if the given addresses are equal. Will accept either an exact
+ * match or one where the rids match and that either the cids match or are set
+ * to VMADDR_CID_ANY.
+ *
+ * Results: TRUE if the addresses are equal, FALSE otherwise.
+ *
+ * Side effects: None.
+ */
+
+bool vsock_addr_equals_addr_any(const struct sockaddr_vm *addr,
+				const struct sockaddr_vm *other)
+{
+	return (addr->svm_cid == VMADDR_CID_ANY ||
+		other->svm_cid == VMADDR_CID_ANY ||
+		addr->svm_cid == other->svm_cid) &&
+	       addr->svm_port == other->svm_port;
+}
+
+/*
+ *
+ * vsock_addr_equals_handle_port --
+ *
+ * Determines if the given address matches the given handle and port.
+ *
+ * Results: TRUE if the address matches the handle and port, FALSE otherwise.
+ *
+ * Side effects: None.
+ */
+
+bool vsock_addr_equals_handle_port(const struct sockaddr_vm *addr,
+				   struct vmci_handle handle, u32 port)
+{
+	return addr->svm_cid == VMCI_HANDLE_TO_CONTEXT_ID(handle) &&
+		addr->svm_port == port;
+}
+
+/*
+ *
+ * vsock_addr_cast --
+ *
+ * Try to cast the given generic address to a VM address.  The given length
+ * must match that of a VM address and the address must be valid. The
+ * "out_addr" parameter contains the address if successful.
+ *
+ * Results: 0 on success, EFAULT if the length is too small.  See
+ * vsock_addr_validate() for other possible return codes.
+ *
+ * Side effects: None.
+ */
+
+int vsock_addr_cast(const struct sockaddr *addr,
+		    size_t len, struct sockaddr_vm **out_addr)
+{
+	if (len < sizeof **out_addr)
+		return -EFAULT;
+
+	*out_addr = (struct sockaddr_vm *)addr;
+	return vsock_addr_validate(*out_addr);
+}
+
+/*
+ *
+ * vsock_addr_socket_context_stream --
+ *
+ * Determines whether the provided context id represents a context that
+ * contains a stream socket endpoints.
+ *
+ * Results: TRUE if the context does have socket endpoints, FALSE otherwise.
+ *
+ * Side effects: None.
+ */
+
+bool vsock_addr_socket_context_stream(u32 cid)
+{
+	static const vmci_id non_socket_contexts[] = {
+		VMCI_HYPERVISOR_CONTEXT_ID,
+		VMCI_WELL_KNOWN_CONTEXT_ID,
+	};
+	int i;
+
+	BUILD_BUG_ON(sizeof cid != sizeof *non_socket_contexts);
+
+	for (i = 0; i < ARRAY_SIZE(non_socket_contexts); i++) {
+		if (cid == non_socket_contexts[i])
+			return false;
+
+	}
+
+	return true;
+}
+
+/*
+ *
+ * vsock_addr_socket_context_dgram --
+ *
+ * Determines whether the provided <context id, resource id> represent a
+ * protected datagram endpoint.
+ *
+ * Results: TRUE if the context does have socket endpoints, FALSE otherwise.
+ *
+ * Side effects: None.
+ */
+
+bool vsock_addr_socket_context_dgram(u32 cid, u32 rid)
+{
+	if (cid == VMCI_HYPERVISOR_CONTEXT_ID) {
+		/*
+		 * Registrations of PBRPC Servers do not modify VMX/Hypervisor
+		 * state and are allowed.
+		 */
+		return rid == VMCI_UNITY_PBRPC_REGISTER;
+	}
+
+	return true;
+}
diff --git a/net/vmw_vsock/vsock_addr.h b/net/vmw_vsock/vsock_addr.h
new file mode 100644
index 0000000..18f023d
--- /dev/null
+++ b/net/vmw_vsock/vsock_addr.h
@@ -0,0 +1,40 @@
+/*
+ * VMware vSockets Driver
+ *
+ * Copyright (C) 2007-2012 VMware, Inc. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation version 2 and no later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+/*
+ * vsockAddr.h --
+ *
+ * VSockets address constants, types and functions.
+ */
+
+#ifndef _VSOCK_ADDR_H_
+#define _VSOCK_ADDR_H_
+
+void vsock_addr_init(struct sockaddr_vm *addr, u32 cid, u32 port);
+int vsock_addr_validate(const struct sockaddr_vm *addr);
+bool vsock_addr_bound(const struct sockaddr_vm *addr);
+void vsock_addr_unbind(struct sockaddr_vm *addr);
+bool vsock_addr_equals_addr(const struct sockaddr_vm *addr,
+			    const struct sockaddr_vm *other);
+bool vsock_addr_equals_addr_any(const struct sockaddr_vm *addr,
+				const struct sockaddr_vm *other);
+bool vsock_addr_equals_handle_port(const struct sockaddr_vm *addr,
+				   struct vmci_handle handle, u32 port);
+int vsock_addr_cast(const struct sockaddr *addr, size_t len,
+		    struct sockaddr_vm **out_addr);
+bool vsock_addr_socket_context_stream(u32 cid);
+bool vsock_addr_socket_context_dgram(u32 cid, u32 rid);
+
+#endif

^ permalink raw reply related

* [PATCH 1/6] VSOCK: vsock protocol implementation.
From: George Zhang @ 2012-11-21 20:39 UTC (permalink / raw)
  To: netdev, linux-kernel, georgezhang, virtualization
  Cc: pv-drivers, gregkh, davem
In-Reply-To: <20121121203715.14395.27632.stgit@promb-2n-dhcp175.eng.vmware.com>

VSOCK linux socket module for VMCI Sockets protocol family.

Signed-off-by: George Zhang <georgezhang@vmware.com>
Signed-off-by: Dmitry Torokhov <dtor@vmware.com>
Signed-off-by: Andy King <acking@vmware.com>

---
 net/vmw_vsock/af_vsock.c | 4054 ++++++++++++++++++++++++++++++++++++++++++++++
 net/vmw_vsock/af_vsock.h |  180 ++
 2 files changed, 4234 insertions(+), 0 deletions(-)
 create mode 100644 net/vmw_vsock/af_vsock.c
 create mode 100644 net/vmw_vsock/af_vsock.h

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
new file mode 100644
index 0000000..981d24c
--- /dev/null
+++ b/net/vmw_vsock/af_vsock.c
@@ -0,0 +1,4054 @@
+/*
+ * VMware vSockets Driver
+ *
+ * Copyright (C) 2007-2012 VMware, Inc. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation version 2 and no later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+/*
+ * af_vsock.c --
+ *
+ * Linux socket module for the VMCI Sockets protocol family.
+ */
+
+/*
+ * Implementation notes:
+ *
+ * - There are two kinds of sockets: those created by user action (such as
+ * calling socket(2)) and those created by incoming connection request packets.
+ *
+ * - There are two "global" tables, one for bound sockets (sockets that have
+ * specified an address that they are responsible for) and one for connected
+ * sockets (sockets that have established a connection with another socket).
+ * These tables are "global" in that all sockets on the system are placed
+ * within them. - Note, though, that the bound table contains an extra entry
+ * for a list of unbound sockets and SOCK_DGRAM sockets will always remain in
+ * that list. The bound table is used solely for lookup of sockets when packets
+ * are received and that's not necessary for SOCK_DGRAM sockets since we create
+ * a datagram handle for each and need not perform a lookup.  Keeping SOCK_DGRAM
+ * sockets out of the bound hash buckets will reduce the chance of collisions
+ * when looking for SOCK_STREAM sockets and prevents us from having to check the
+ * socket type in the hash table lookups.
+ *
+ * - Sockets created by user action will either be "client" sockets that
+ * initiate a connection or "server" sockets that listen for connections; we do
+ * not support simultaneous connects (two "client" sockets connecting).
+ *
+ * - "Server" sockets are referred to as listener sockets throughout this
+ * implementation because they are in the SS_LISTEN state.  When a connection
+ * request is received (the second kind of socket mentioned above), we create a
+ * new socket and refer to it as a pending socket.  These pending sockets are
+ * placed on the pending connection list of the listener socket.  When future
+ * packets are received for the address the listener socket is bound to, we
+ * check if the source of the packet is from one that has an existing pending
+ * connection.  If it does, we process the packet for the pending socket.  When
+ * that socket reaches the connected state, it is removed from the listener
+ * socket's pending list and enqueued in the listener socket's accept queue.
+ * Callers of accept(2) will accept connected sockets from the listener socket's
+ * accept queue.  If the socket cannot be accepted for some reason then it is
+ * marked rejected.  Once the connection is accepted, it is owned by the user
+ * process and the responsibility for cleanup falls with that user process.
+ *
+ * - It is possible that these pending sockets will never reach the connected
+ * state; in fact, we may never receive another packet after the connection
+ * request.  Because of this, we must schedule a cleanup function to run in the
+ * future, after some amount of time passes where a connection should have been
+ * established.  This function ensures that the socket is off all lists so it
+ * cannot be retrieved, then drops all references to the socket so it is cleaned
+ * up (sock_put() -> sk_free() -> our sk_destruct implementation).  Note this
+ * function will also cleanup rejected sockets, those that reach the connected
+ * state but leave it before they have been accepted.
+ *
+ * - Sockets created by user action will be cleaned up when the user process
+ * calls close(2), causing our release implementation to be called. Our release
+ * implementation will perform some cleanup then drop the last reference so our
+ * sk_destruct implementation is invoked.  Our sk_destruct implementation will
+ * perform additional cleanup that's common for both types of sockets.
+ *
+ * - A socket's reference count is what ensures that the structure won't be
+ * freed.  Each entry in a list (such as the "global" bound and connected tables
+ * and the listener socket's pending list and connected queue) ensures a
+ * reference.  When we defer work until process context and pass a socket as our
+ * argument, we must ensure the reference count is increased to ensure the
+ * socket isn't freed before the function is run; the deferred function will
+ * then drop the reference.
+ */
+
+#include <linux/types.h>
+
+#define EXPORT_SYMTAB
+#include <linux/kmod.h>
+#include <linux/socket.h>
+#include <linux/net.h>
+#include <linux/skbuff.h>
+#include <linux/miscdevice.h>
+#include <linux/poll.h>
+#include <linux/smp.h>
+#include <linux/bitops.h>
+#include <linux/list.h>
+#include <linux/wait.h>
+#include <linux/init.h>
+#include <linux/io.h>
+
+#include <linux/module.h>
+#include <linux/unistd.h>
+#include <linux/stddef.h>	/* for NULL */
+#include <net/sock.h>
+#include <linux/kernel.h>
+#include <linux/workqueue.h>
+#include <linux/mutex.h>
+/*
+ * Include linux/cred.h via linux/sched.h - it is not nice, but as cpp does not
+ * have #ifexist...
+ */
+#include <linux/sched.h>
+
+#include "af_vsock.h"
+#include "stats.h"
+#include "util.h"
+#include "vsock_version.h"
+
+/*
+ * Prototypes
+ */
+
+/* Internal functions. */
+static bool vsock_vmci_proto_to_notify_struct(struct sock *sk,
+					      vsock_proto_version * proto,
+					      bool old_pkt_proto);
+static int vsock_vmci_recv_dgram_cb(void *data, struct vmci_datagram *dg);
+static int vsock_vmci_recv_stream_cb(void *data, struct vmci_datagram *dg);
+static void vsock_vmci_peer_attach_cb(vmci_id sub_id,
+				      const struct vmci_event_data *ed,
+				      void *client_data);
+static void vsock_vmci_peer_detach_cb(vmci_id sub_id,
+				      const struct vmci_event_data *ed,
+				      void *client_data);
+static void vsock_vmci_recv_pkt_work(struct work_struct *work);
+static int vsock_vmci_recv_listen(struct sock *sk, struct vsock_packet *pkt);
+static int vsock_vmci_recv_connecting_server(struct sock *sk,
+					     struct sock *pending,
+					     struct vsock_packet *pkt);
+static int vsock_vmci_recv_connecting_client(struct sock *sk,
+					     struct vsock_packet *pkt);
+static int vsock_vmci_recv_connecting_client_negotiate(struct sock *sk,
+						struct vsock_packet *pkt);
+static int vsock_vmci_recv_connecting_client_invalid(struct sock *sk,
+						     struct vsock_packet *pkt);
+static int vsock_vmci_recv_connected(struct sock *sk,
+				     struct vsock_packet *pkt);
+static int __vsock_vmci_bind(struct sock *sk, struct sockaddr_vm *addr);
+static struct sock *__vsock_vmci_create(struct net *net,
+					struct socket *sock,
+					struct sock *parent, gfp_t priority,
+					unsigned short type);
+static int vsock_vmci_register_with_vmci(void);
+static void vsock_vmci_unregister_with_vmci(void);
+
+/* Socket operations. */
+static void vsock_vmci_sk_destruct(struct sock *sk);
+static int vsock_vmci_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
+static int vsock_vmci_release(struct socket *sock);
+static int vsock_vmci_bind(struct socket *sock,
+			   struct sockaddr *addr, int addr_len);
+static int vsock_vmci_dgram_connect(struct socket *sock,
+				    struct sockaddr *addr, int addr_len,
+				    int flags);
+static int vsock_vmci_stream_connect(struct socket *sock, struct sockaddr *addr,
+				     int addr_len, int flags);
+static int vsock_vmci_accept(struct socket *sock, struct socket *newsock,
+			     int flags);
+static int vsock_vmci_getname(struct socket *sock, struct sockaddr *addr,
+			      int *addr_len, int peer);
+static unsigned int vsock_vmci_poll(struct file *file, struct socket *sock,
+				    poll_table *wait);
+static int vsock_vmci_listen(struct socket *sock, int backlog);
+static int vsock_vmci_shutdown(struct socket *sock, int mode);
+
+static int vsock_vmci_stream_setsockopt(struct socket *sock, int level,
+					int optname, char __user *optval,
+					unsigned int optlen);
+
+static int vsock_vmci_stream_getsockopt(struct socket *sock, int level,
+					int optname, char __user *optval,
+					int __user *optlen);
+
+static int vsock_vmci_dgram_sendmsg(struct kiocb *kiocb,
+				    struct socket *sock, struct msghdr *msg,
+				    size_t len);
+static int vsock_vmci_dgram_recvmsg(struct kiocb *kiocb, struct socket *sock,
+				    struct msghdr *msg, size_t len, int flags);
+static int vsock_vmci_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
+				     struct msghdr *msg, size_t len);
+static int vsock_vmci_stream_recvmsg(struct kiocb *kiocb, struct socket *sock,
+				     struct msghdr *msg, size_t len, int flags);
+
+static int vsock_vmci_create(struct net *net,
+			     struct socket *sock, int protocol, int kern);
+
+/*
+ * Variables.
+ */
+
+/* Protocol family. */
+static struct proto vsock_vmci_proto = {
+	.name = "AF_VMCI",
+	.owner = THIS_MODULE,
+	.obj_size = sizeof(struct vsock_vmci_sock),
+};
+
+static const struct net_proto_family vsock_vmci_family_ops = {
+	.family = AF_VSOCK,
+	.create = vsock_vmci_create,
+	.owner = THIS_MODULE,
+};
+
+/* Socket operations, split for DGRAM and STREAM sockets. */
+static const struct proto_ops vsock_vmci_dgram_ops = {
+	.family = PF_VSOCK,
+	.owner = THIS_MODULE,
+	.release = vsock_vmci_release,
+	.bind = vsock_vmci_bind,
+	.connect = vsock_vmci_dgram_connect,
+	.socketpair = sock_no_socketpair,
+	.accept = sock_no_accept,
+	.getname = vsock_vmci_getname,
+	.poll = vsock_vmci_poll,
+	.ioctl = sock_no_ioctl,
+	.listen = sock_no_listen,
+	.shutdown = vsock_vmci_shutdown,
+	.setsockopt = sock_no_setsockopt,
+	.getsockopt = sock_no_getsockopt,
+	.sendmsg = vsock_vmci_dgram_sendmsg,
+	.recvmsg = vsock_vmci_dgram_recvmsg,
+	.mmap = sock_no_mmap,
+	.sendpage = sock_no_sendpage,
+};
+
+static const struct proto_ops vsock_vmci_stream_ops = {
+	.family = PF_VSOCK,
+	.owner = THIS_MODULE,
+	.release = vsock_vmci_release,
+	.bind = vsock_vmci_bind,
+	.connect = vsock_vmci_stream_connect,
+	.socketpair = sock_no_socketpair,
+	.accept = vsock_vmci_accept,
+	.getname = vsock_vmci_getname,
+	.poll = vsock_vmci_poll,
+	.ioctl = sock_no_ioctl,
+	.listen = vsock_vmci_listen,
+	.shutdown = vsock_vmci_shutdown,
+	.setsockopt = vsock_vmci_stream_setsockopt,
+	.getsockopt = vsock_vmci_stream_getsockopt,
+	.sendmsg = vsock_vmci_stream_sendmsg,
+	.recvmsg = vsock_vmci_stream_recvmsg,
+	.mmap = sock_no_mmap,
+	.sendpage = sock_no_sendpage,
+};
+
+struct vsock_recv_pkt_info {
+	struct work_struct work;
+	struct sock *sk;
+	struct vsock_packet pkt;
+};
+
+static struct vmci_handle vmci_stream_handle = { VMCI_INVALID_ID,
+						 VMCI_INVALID_ID };
+
+static vmci_id qp_resumed_sub_id = VMCI_INVALID_ID;
+
+static int PROTOCOL_OVERRIDE = -1;
+
+/*
+ * Netperf benchmarks have shown significant throughput improvements when the
+ * QP size is bumped from 64k to 256k. These measurements were taken during the
+ * K/L.next timeframe. Give users better performance by default.
+ */
+#define VSOCK_DEFAULT_QP_SIZE_MIN   128
+#define VSOCK_DEFAULT_QP_SIZE       262144
+#define VSOCK_DEFAULT_QP_SIZE_MAX   262144
+
+/*
+ * The default peer timeout indicates how long we will wait for a peer response
+ * to a control message.
+ */
+#define VSOCK_DEFAULT_CONNECT_TIMEOUT (2 * HZ)
+
+#define LOG_PACKET(_pkt)
+
+/**
+ * vsock_vmci_old_proto_override --
+ *
+ * Check to see if the user has asked us to override all sockets to use the
+ * vsock notify protocol.
+ *
+ * Results: true if there is a protocol override in effect. - old_pkt_proto is
+ * true the original protocol should be used. False if there is no override in
+ * effect.
+ *
+ * Side effects: None.
+ */
+
+static bool vsock_vmci_old_proto_override(bool *old_pkt_proto)
+{
+	if (PROTOCOL_OVERRIDE != -1) {
+		if (PROTOCOL_OVERRIDE == 0)
+			*old_pkt_proto = true;
+		else
+			*old_pkt_proto = false;
+
+		pr_info("Proto override in use.\n");
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * vsock_vmci_proto_to_notify_struct --
+ *
+ * Given a particular notify protocol version, setup the socket's notify struct
+ * correctly.
+ *
+ * Results: true on success, false otherwise.
+ *
+ * Side effects: None.
+ */
+
+static bool
+vsock_vmci_proto_to_notify_struct(struct sock *sk,
+				  vsock_proto_version *proto,
+				  bool old_pkt_proto)
+{
+	struct vsock_vmci_sock *vsk = vsock_sk(sk);
+
+	if (old_pkt_proto) {
+		if (*proto != VSOCK_PROTO_INVALID) {
+			pr_err("Can't set both an old and new protocol\n");
+			return false;
+		}
+		vsk->notify_ops = &vsock_vmci_notify_pkt_ops;
+		goto exit;
+	}
+
+	switch (*proto) {
+	case VSOCK_PROTO_PKT_ON_NOTIFY:
+		vsk->notify_ops = &vsock_vmci_notify_pkt_q_state_ops;
+		break;
+	default:
+		pr_err("Unknown notify protocol version\n");
+		return false;
+	}
+
+exit:
+	NOTIFYCALL(vsk, socket_init, sk);
+	return true;
+}
+
+/*
+ * vsock_vmci_new_proto_supported_versions
+ *
+ * Gets the supported REQUEST2/NEGOTIATE2 vsock protocol versions.
+ *
+ * Results: Either 1 specific protocol version (override mode) or
+ * VSOCK_PROTO_ALL_SUPPORTED.
+ *
+ * Side effects: None.
+ */
+
+static vsock_proto_version vsock_vmci_new_proto_supported_versions(void)
+{
+	if (PROTOCOL_OVERRIDE != -1)
+		return PROTOCOL_OVERRIDE;
+
+	return VSOCK_PROTO_ALL_SUPPORTED;
+}
+
+/*
+ * VSockSocket_Trusted --
+ *
+ * We allow two kinds of sockets to communicate with a restricted VM: 1)
+ * trusted sockets 2) sockets from applications running as the same user as the
+ * VM (this is only true for the host side and only when using hosted products)
+ *
+ * Results: true if trusted communication is allowed to peer_cid, false
+ * otherwise.
+ *
+ * Side effects: None.
+ */
+
+static bool vsock_vmci_trusted(struct vsock_vmci_sock *vsock, vmci_id peer_cid)
+{
+	return vsock->trusted ||
+	       vmci_is_context_owner(peer_cid, vsock->owner->uid);
+}
+
+/*
+ * VSockSocket_AllowDgram --
+ *
+ * We allow sending datagrams to and receiving datagrams from a restricted VM
+ * only if it is trusted as described in vsock_vmci_trusted.
+ *
+ * Results: true if datagram communication is allowed to peer_cid, false
+ * otherwise.
+ *
+ * Side effects: None.
+ */
+
+static bool vsock_vmci_allow_dgram(struct vsock_vmci_sock *vsock,
+				   vmci_id peer_cid)
+{
+	if (vsock->cached_peer != peer_cid) {
+		vsock->cached_peer = peer_cid;
+		if (!vsock_vmci_trusted(vsock, peer_cid) &&
+		    (vmci_context_get_priv_flags(peer_cid) &
+		     VMCI_PRIVILEGE_FLAG_RESTRICTED)) {
+			vsock->cached_peer_allow_dgram = false;
+		} else {
+			vsock->cached_peer_allow_dgram = true;
+		}
+	}
+
+	return vsock->cached_peer_allow_dgram;
+}
+
+/*
+ * vmci_sock_get_local_c_id --
+ *
+ * Kernel interface that allows external kernel modules to get the current VMCI
+ * context id. This version of the function is exported to kernel clients and
+ * should not change.
+ *
+ * Results: The context id on success, a negative error on failure.
+ *
+ * Side effects: None.
+ */
+
+int vmci_sock_get_local_c_id(void)
+{
+	/* FIXME: needed? */
+	return vmci_get_context_id();
+}
+EXPORT_SYMBOL_GPL(vmci_sock_get_local_c_id);
+
+/*
+ * Helper functions.
+ */
+
+/*
+ * vsock_vmci_queue_pair_alloc --
+ *
+ * Allocates or attaches to a queue pair. Tries to register with trusted status
+ * if requested but does not fail if the queuepair could not be allocate as
+ * trusted (running in the guest)
+ *
+ * Results: 0 on success. A VSock error on error.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_queue_pair_alloc(struct vmci_qp **qpair,
+			    struct vmci_handle *handle,
+			    u64 produce_size,
+			    u64 consume_size,
+			    vmci_id peer, u32 flags, bool trusted)
+{
+	int err = 0;
+
+	if (trusted) {
+		/*
+		 * Try to allocate our queue pair as trusted. This will only
+		 * work if vsock is running in the host.
+		 */
+
+		err = vmci_qpair_alloc(qpair, handle, produce_size,
+				       consume_size,
+				       peer, flags,
+				       VMCI_PRIVILEGE_FLAG_TRUSTED);
+		if (err != VMCI_ERROR_NO_ACCESS)
+			goto out;
+
+	}
+
+	err = vmci_qpair_alloc(qpair, handle, produce_size, consume_size,
+			       peer, flags, VMCI_NO_PRIVILEGE_FLAGS);
+out:
+	if (err < 0) {
+		pr_err("Could not attach to queue pair with %d\n",
+		       err);
+		err = vsock_vmci_error_to_vsock_error(err);
+	}
+
+	return err;
+}
+
+/*
+ * vsock_vmci_datagram_create_hnd --
+ *
+ * Creates a datagram handle. Tries to register with trusted status but does
+ * not fail if the handler could not be allocated as trusted (running in the
+ * guest).
+ *
+ * Results: 0 on success. A VMCI error on error.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_datagram_create_hnd(vmci_id resource_id,
+			       u32 flags,
+			       vmci_datagram_recv_cb recv_cb,
+			       void *client_data,
+			       struct vmci_handle *out_handle)
+{
+	int err = 0;
+
+	/*
+	 * Try to allocate our datagram handler as trusted. This will only work
+	 * if vsock is running in the host.
+	 */
+
+	err = vmci_datagram_create_handle_priv(resource_id, flags,
+					       VMCI_PRIVILEGE_FLAG_TRUSTED,
+					       recv_cb,
+					       client_data, out_handle);
+
+	if (err == VMCI_ERROR_NO_ACCESS)
+		err = vmci_datagram_create_handle(resource_id, flags,
+						  recv_cb, client_data,
+						  out_handle);
+
+	return err;
+}
+
+/*
+ * vsock_vmci_recv_dgram_cb --
+ *
+ * VMCI Datagram receive callback.  This function is used specifically for
+ * SOCK_DGRAM sockets.
+ *
+ * This is invoked as part of a tasklet that's scheduled when the VMCI
+ * interrupt fires.  This is run in bottom-half context and if it ever needs to
+ * sleep it should defer that work to a work queue.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: An sk_buff is created and queued with this socket.
+ */
+
+static int vsock_vmci_recv_dgram_cb(void *data, struct vmci_datagram *dg)
+{
+	struct sock *sk;
+	size_t size;
+	struct sk_buff *skb;
+	struct vsock_vmci_sock *vsk;
+
+	sk = (struct sock *)data;
+
+	/*
+	 * This handler is privileged when this module is running on the host.
+	 * We will get datagrams from all endpoints (even VMs that are in a
+	 * restricted context). If we get one from a restricted context then
+	 * the destination socket must be trusted.
+	 *
+	 * NOTE: We access the socket struct without holding the lock here.
+	 * This is ok because the field we are interested is never modified
+	 * outside of the create and destruct socket functions.
+	 */
+	vsk = vsock_sk(sk);
+	if (!vsock_vmci_allow_dgram(vsk, VMCI_HANDLE_TO_CONTEXT_ID(dg->src)))
+		return VMCI_ERROR_NO_ACCESS;
+
+	size = VMCI_DG_SIZE(dg);
+
+	/*
+	 * Attach the packet to the socket's receive queue as an sk_buff.
+	 */
+	skb = alloc_skb(size, GFP_ATOMIC);
+	if (skb) {
+		/* sk_receive_skb() will do a sock_put(), so hold here. */
+		sock_hold(sk);
+		skb_put(skb, size);
+		memcpy(skb->data, dg, size);
+		sk_receive_skb(sk, skb, 0);
+	}
+
+	return VMCI_SUCCESS;
+}
+
+/*
+ * vsock_vmci_recv_stream_cb --
+ *
+ * VMCI stream receive callback for control datagrams.  This function is used
+ * specifically for SOCK_STREAM sockets.
+ *
+ * This is invoked as part of a tasklet that's scheduled when the VMCI
+ * interrupt fires.  This is run in bottom-half context but it defers most of
+ * its work to the packet handling work queue.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int vsock_vmci_recv_stream_cb(void *data, struct vmci_datagram *dg)
+{
+	struct sock *sk;
+	struct sockaddr_vm dst;
+	struct sockaddr_vm src;
+	struct vsock_packet *pkt;
+	struct vsock_vmci_sock *vsk;
+	bool bh_process_pkt;
+	int err;
+
+	sk = NULL;
+	err = VMCI_SUCCESS;
+	bh_process_pkt = false;
+
+	/*
+	 * Ignore incoming packets from contexts without sockets, or resources
+	 * that aren't vsock implementations.
+	 */
+
+	if (!vsock_addr_socket_context_stream
+	    (VMCI_HANDLE_TO_CONTEXT_ID(dg->src))
+	    || VSOCK_PACKET_RID != VMCI_HANDLE_TO_RESOURCE_ID(dg->src)) {
+		return VMCI_ERROR_NO_ACCESS;
+	}
+
+	if (VMCI_DG_SIZE(dg) < sizeof *pkt)
+		/* Drop datagrams that do not contain full VSock packets. */
+		return VMCI_ERROR_INVALID_ARGS;
+
+	pkt = (struct vsock_packet *) dg;
+
+	LOG_PACKET(pkt);
+
+	/*
+	 * Find the socket that should handle this packet.  First we look for a
+	 * connected socket and if there is none we look for a socket bound to
+	 * the destintation address.
+	 */
+	vsock_addr_init(&src, VMCI_HANDLE_TO_CONTEXT_ID(pkt->dg.src),
+			pkt->src_port);
+
+	vsock_addr_init(&dst, VMCI_HANDLE_TO_CONTEXT_ID(pkt->dg.dst),
+			pkt->dst_port);
+
+	sk = vsock_vmci_find_connected_socket(&src, &dst);
+	if (!sk) {
+		sk = vsock_vmci_find_bound_socket(&dst);
+		if (!sk) {
+			/*
+			 * We could not find a socket for this specified
+			 * address.  If this packet is a RST, we just drop it.
+			 * If it is another packet, we send a RST.  Note that
+			 * we do not send a RST reply to RSTs so that we do not
+			 * continually send RSTs between two endpoints.
+			 *
+			 * Note that since this is a reply, dst is src and src
+			 * is dst.
+			 */
+			if (VSOCK_SEND_RESET_BH(&dst, &src, pkt) < 0)
+				pr_err("unable to send reset.\n");
+
+			err = VMCI_ERROR_NOT_FOUND;
+			goto out;
+		}
+	}
+
+	/*
+	 * If the received packet type is beyond all types known to this
+	 * implementation, reply with an invalid message.  Hopefully this will
+	 * help when implementing backwards compatibility in the future.
+	 */
+	if (pkt->type >= VSOCK_PACKET_TYPE_MAX) {
+		VSOCK_SEND_INVALID_BH(&dst, &src);
+		err = VMCI_ERROR_INVALID_ARGS;
+		goto out;
+	}
+
+	/*
+	 * This handler is privileged when this module is running on the host.
+	 * We will get datagram connect requests from all endpoints (even VMs
+	 * that are in a restricted context). If we get one from a restricted
+	 * context then the destination socket must be trusted.
+	 *
+	 * NOTE: We access the socket struct without holding the lock here.
+	 * This is ok because the field we are interested is never modified
+	 * outside of the create and destruct socket functions.
+	 */
+	vsk = vsock_sk(sk);
+	if (!vsock_vmci_allow_dgram
+	    (vsk, VMCI_HANDLE_TO_CONTEXT_ID(pkt->dg.src))) {
+		err = VMCI_ERROR_NO_ACCESS;
+		goto out;
+	}
+
+	/*
+	 * We do most everything in a work queue, but let's fast path the
+	 * notification of reads and writes to help data transfer performance.
+	 * We can only do this if there is no process context code executing
+	 * for this socket since that may change the state.
+	 */
+	bh_lock_sock(sk);
+
+	if (!sock_owned_by_user(sk) && sk->sk_state == SS_CONNECTED)
+		NOTIFYCALL(vsk, handle_notify_pkt, sk, pkt, true, &dst, &src,
+			   &bh_process_pkt);
+
+	bh_unlock_sock(sk);
+
+	if (!bh_process_pkt) {
+		struct vsock_recv_pkt_info *recv_pkt_info;
+
+		recv_pkt_info = kmalloc(sizeof *recv_pkt_info, GFP_ATOMIC);
+		if (!recv_pkt_info) {
+			if (VSOCK_SEND_RESET_BH(&dst, &src, pkt) < 0)
+				pr_err("unable to send reset\n");
+
+			err = VMCI_ERROR_NO_MEM;
+			goto out;
+		}
+
+		recv_pkt_info->sk = sk;
+		memcpy(&recv_pkt_info->pkt, pkt, sizeof recv_pkt_info->pkt);
+		INIT_WORK(&recv_pkt_info->work, vsock_vmci_recv_pkt_work);
+
+		schedule_work(&recv_pkt_info->work);
+		/*
+		 * Clear sk so that the reference count incremented by one of
+		 * the Find functions above is not decremented below.  We need
+		 * that reference count for the packet handler we've scheduled
+		 * to run.
+		 */
+		sk = NULL;
+	}
+
+out:
+	if (sk)
+		sock_put(sk);
+
+	return err;
+}
+
+/*
+ * vsock_vmci_peer_attach_cb --
+ *
+ * Invoked when a peer attaches to a queue pair.
+ *
+ * Right now this does not do anything.
+ *
+ * Results: None.
+ *
+ * Side effects: May modify socket state and signal socket.
+ */
+
+static void vsock_vmci_peer_attach_cb(vmci_id sub_id,
+				      const struct vmci_event_data *e_data,
+				      void *client_data)
+{
+	struct sock *sk = client_data;
+	const struct vmci_event_payload_qp *e_payload;
+	struct vsock_vmci_sock *vsk;
+
+	e_payload = vmci_event_data_const_payload(e_data);
+
+	vsk = vsock_sk(sk);
+
+	/*
+	 * We don't ask for delayed CBs when we subscribe to this event (we
+	 * pass 0 as flags to vmci_event_subscribe()).  VMCI makes no
+	 * guarantees in that case about what context we might be running in,
+	 * so it could be BH or process, blockable or non-blockable.  So we
+	 * need to account for all possible contexts here.
+	 */
+	local_bh_disable();
+	bh_lock_sock(sk);
+
+	/*
+	 * XXX This is lame, we should provide a way to lookup sockets by
+	 * qp_handle.
+	 */
+	if (VMCI_HANDLE_EQUAL(vsk->qp_handle, e_payload->handle)) {
+		/*
+		 * XXX This doesn't do anything, but in the future we may want
+		 * to set a flag here to verify the attach really did occur and
+		 * we weren't just sent a datagram claiming it was.
+		 */
+		goto out;
+	}
+
+out:
+	bh_unlock_sock(sk);
+	local_bh_enable();
+}
+
+/*
+ *
+ * vsock_vmci_handle_detach --
+ *
+ * Perform the work necessary when the peer has detached.
+ *
+ * Note that this assumes the socket lock is held.
+ *
+ * Results: None.
+ *
+ * Side effects: The socket's and its peer's shutdown mask will be set
+ * appropriately, and any callers waiting on this socket will be awoken.
+ */
+
+static void vsock_vmci_handle_detach(struct sock *sk)
+{
+	struct vsock_vmci_sock *vsk;
+
+	vsk = vsock_sk(sk);
+	if (!VMCI_HANDLE_INVALID(vsk->qp_handle)) {
+		sock_set_flag(sk, SOCK_DONE);
+
+		/*
+		 * On a detach the peer will not be sending or receiving
+		 * anymore.
+		 */
+		vsk->peer_shutdown = SHUTDOWN_MASK;
+
+		/*
+		 * We should not be sending anymore since the peer won't be
+		 * there to receive, but we can still receive if there is data
+		 * left in our consume queue.
+		 */
+		if (vsock_vmci_stream_has_data(vsk) <= 0) {
+			if (sk->sk_state == SS_CONNECTING) {
+				/*
+				 * The peer may detach from a queue pair while
+				 * we are still in the connecting state, i.e.,
+				 * if the peer VM is killed after attaching to
+				 * a queue pair, but before we complete the
+				 * handshake. In that case, we treat the detach
+				 * event like a reset.
+				 */
+
+				sk->sk_state = SS_UNCONNECTED;
+				sk->sk_err = ECONNRESET;
+				sk->sk_error_report(sk);
+				return;
+			}
+			sk->sk_state = SS_UNCONNECTED;
+		}
+		sk->sk_state_change(sk);
+	}
+}
+
+/*
+ * vsock_vmci_peer_detach_cb --
+ *
+ * Invoked when a peer detaches from a queue pair.
+ *
+ * Results: None.
+ *
+ * Side effects: May modify socket state and signal socket.
+ */
+
+static void vsock_vmci_peer_detach_cb(vmci_id sub_id,
+				      const struct vmci_event_data *e_data,
+				      void *client_data)
+{
+	struct sock *sk = client_data;
+	const struct vmci_event_payload_qp *e_payload;
+	struct vsock_vmci_sock *vsk;
+
+	e_payload = vmci_event_data_const_payload(e_data);
+	vsk = vsock_sk(sk);
+	if (VMCI_HANDLE_INVALID(e_payload->handle))
+		return;
+
+	/* Same rules for locking as for peer_attach_cb(). */
+	local_bh_disable();
+	bh_lock_sock(sk);
+
+	/*
+	 * XXX This is lame, we should provide a way to lookup sockets by
+	 * qp_handle.
+	 */
+	if (VMCI_HANDLE_EQUAL(vsk->qp_handle, e_payload->handle))
+		vsock_vmci_handle_detach(sk);
+
+	bh_unlock_sock(sk);
+	local_bh_enable();
+}
+
+/*
+ * vsock_vmci_qp_resumed_cb --
+ *
+ * Invoked when a VM is resumed.  We must mark all connected stream sockets as
+ * detached.
+ *
+ * Results: None.
+ *
+ * Side effects: May modify socket state and signal socket.
+ */
+
+static void vsock_vmci_qp_resumed_cb(vmci_id sub_id,
+				     const struct vmci_event_data *e_data,
+				     void *client_data)
+{
+	int i;
+
+	spin_lock_bh(&vsock_table_lock);
+
+	/*
+	 * XXX This loop should probably be provided by util.{h,c}, but that's
+	 * for another day.
+	 */
+	for (i = 0; i < ARRAY_SIZE(vsock_connected_table); i++) {
+		struct vsock_vmci_sock *vsk;
+
+		list_for_each_entry(vsk, &vsock_connected_table[i],
+				    connected_table) {
+			struct sock *sk = sk_vsock(vsk);
+
+			/*
+			 * XXX Technically this is racy but the resulting
+			 * outcome from such a race is relatively harmless.  My
+			 * next change will be a fix to this.
+			 */
+			vsock_vmci_handle_detach(sk);
+		}
+	}
+
+	spin_unlock_bh(&vsock_table_lock);
+}
+
+/*
+ * vsock_vmci_pending_work --
+ *
+ * Releases the resources for a pending socket if it has not reached the
+ * connected state and been accepted by a user process.
+ *
+ * Results: None.
+ *
+ * Side effects: The socket may be removed from the connected list and all its
+ * resources freed.
+ */
+
+static void vsock_vmci_pending_work(struct work_struct *work)
+{
+	struct sock *sk;
+	struct sock *listener;
+	struct vsock_vmci_sock *vsk;
+	bool cleanup;
+
+	vsk = container_of(work, struct vsock_vmci_sock, dwork.work);
+	sk = sk_vsock(vsk);
+	listener = vsk->listener;
+	cleanup = true;
+
+	lock_sock(listener);
+	lock_sock(sk);
+
+	if (vsock_vmci_is_pending(sk)) {
+		vsock_vmci_remove_pending(listener, sk);
+	} else if (!vsk->rejected) {
+		/*
+		 * We are not on the pending list and accept() did not reject
+		 * us, so we must have been accepted by our user process.  We
+		 * just need to drop our references to the sockets and be on
+		 * our way.
+		 */
+		cleanup = false;
+		goto out;
+	}
+
+	listener->sk_ack_backlog--;
+
+	/*
+	 * We need to remove ourself from the global connected sockets list so
+	 * incoming packets can't find this socket, and to reduce the reference
+	 * count.
+	 */
+	if (vsock_vmci_in_connected_table(sk))
+		vsock_vmci_remove_connected(sk);
+
+	sk->sk_state = SS_FREE;
+
+out:
+	release_sock(sk);
+	release_sock(listener);
+	if (cleanup)
+		sock_put(sk);
+
+	sock_put(sk);
+	sock_put(listener);
+}
+
+/*
+ * vsock_vmci_recv_pkt_work --
+ *
+ * Handles an incoming control packet for the provided socket.  This is the
+ * state machine for our stream sockets.
+ *
+ * Results: None.
+ *
+ * Side effects: May set state and wakeup threads waiting for socket state to
+ * change.
+ */
+
+static void vsock_vmci_recv_pkt_work(struct work_struct *work)
+{
+	struct vsock_recv_pkt_info *recv_pkt_info;
+	struct vsock_packet *pkt;
+	struct sock *sk;
+
+	recv_pkt_info = container_of(work, struct vsock_recv_pkt_info, work);
+	sk = recv_pkt_info->sk;
+	pkt = &recv_pkt_info->pkt;
+
+	lock_sock(sk);
+
+	switch (sk->sk_state) {
+	case SS_LISTEN:
+		vsock_vmci_recv_listen(sk, pkt);
+		break;
+	case SS_CONNECTING:
+		/*
+		 * Processing of pending connections for servers goes through
+		 * the listening socket, so see vsock_vmci_recv_listen() for
+		 * that path.
+		 */
+		vsock_vmci_recv_connecting_client(sk, pkt);
+		break;
+	case SS_CONNECTED:
+		vsock_vmci_recv_connected(sk, pkt);
+		break;
+	default:
+		/*
+		 * Because this function does not run in the same context as
+		 * vsock_vmci_recv_stream_cb it is possible that the socket has
+		 * closed. We need to let the other side know or it could be
+		 * sitting in a connect and hang forever. Send a reset to
+		 * prevent that.
+		 */
+		VSOCK_SEND_RESET(sk, pkt);
+		goto out;
+	}
+
+out:
+	release_sock(sk);
+	kfree(recv_pkt_info);
+	/*
+	 * Release reference obtained in the stream callback when we fetched
+	 * this socket out of the bound or connected list.
+	 */
+	sock_put(sk);
+}
+
+/*
+ * vsock_vmci_recv_listen --
+ *
+ * Receives packets for sockets in the listen state.
+ *
+ * Note that this assumes the socket lock is held.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: A new socket may be created and a negotiate control packet is
+ * sent.
+ */
+
+static int vsock_vmci_recv_listen(struct sock *sk,
+				  struct vsock_packet *pkt)
+{
+	struct sock *pending;
+	struct vsock_vmci_sock *vpending;
+	int err;
+	u64 qp_size;
+	bool old_request = false;
+	bool old_pkt_proto = false;
+
+	err = 0;
+
+	/*
+	 * Because we are in the listen state, we could be receiving a packet
+	 * for ourself or any previous connection requests that we received.
+	 * If it's the latter, we try to find a socket in our list of pending
+	 * connections and, if we do, call the appropriate handler for the
+	 * state that that socket is in.  Otherwise we try to service the
+	 * connection request.
+	 */
+	pending = vsock_vmci_get_pending(sk, pkt);
+	if (pending) {
+		lock_sock(pending);
+		switch (pending->sk_state) {
+		case SS_CONNECTING:
+			err =
+			    vsock_vmci_recv_connecting_server(sk, pending, pkt);
+			break;
+		default:
+			VSOCK_SEND_RESET(pending, pkt);
+			err = -EINVAL;
+		}
+
+		if (err < 0)
+			vsock_vmci_remove_pending(sk, pending);
+
+		release_sock(pending);
+		vsock_vmci_release_pending(pending);
+
+		return err;
+	}
+
+	/*
+	 * The listen state only accepts connection requests.  Reply with a
+	 * reset unless we received a reset.
+	 */
+
+	if (!(pkt->type == VSOCK_PACKET_TYPE_REQUEST ||
+	      pkt->type == VSOCK_PACKET_TYPE_REQUEST2)) {
+		VSOCK_REPLY_RESET(pkt);
+		return -EINVAL;
+	}
+
+	if (pkt->u.size == 0) {
+		VSOCK_REPLY_RESET(pkt);
+		return -EINVAL;
+	}
+
+	/*
+	 * If this socket can't accommodate this connection request, we send a
+	 * reset.  Otherwise we create and initialize a child socket and reply
+	 * with a connection negotiation.
+	 */
+	if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog) {
+		VSOCK_REPLY_RESET(pkt);
+		return -ECONNREFUSED;
+	}
+
+	pending = __vsock_vmci_create(sock_net(sk), NULL, sk, GFP_KERNEL,
+				      sk->sk_type);
+	if (!pending) {
+		VSOCK_SEND_RESET(sk, pkt);
+		return -ENOMEM;
+	}
+
+	vpending = vsock_sk(pending);
+
+	vsock_addr_init(&vpending->local_addr,
+			VMCI_HANDLE_TO_CONTEXT_ID(pkt->dg.dst), pkt->dst_port);
+	vsock_addr_init(&vpending->remote_addr,
+			VMCI_HANDLE_TO_CONTEXT_ID(pkt->dg.src), pkt->src_port);
+
+	/*
+	 * If the proposed size fits within our min/max, accept it. Otherwise
+	 * propose our own size.
+	 */
+	if (pkt->u.size >= vpending->queue_pair_min_size &&
+	    pkt->u.size <= vpending->queue_pair_max_size) {
+		qp_size = pkt->u.size;
+	} else {
+		qp_size = vpending->queue_pair_size;
+	}
+
+	/*
+	 * Figure out if we are using old or new requests based on the
+	 * overrides pkt types sent by our peer.
+	 */
+	if (vsock_vmci_old_proto_override(&old_pkt_proto)) {
+		old_request = old_pkt_proto;
+	} else {
+		if (pkt->type == VSOCK_PACKET_TYPE_REQUEST)
+			old_request = true;
+		else if (pkt->type == VSOCK_PACKET_TYPE_REQUEST2)
+			old_request = false;
+
+	}
+
+	if (old_request) {
+		/* Handle a REQUEST (or override) */
+		vsock_proto_version version = VSOCK_PROTO_INVALID;
+		if (vsock_vmci_proto_to_notify_struct(pending, &version, true))
+			err = VSOCK_SEND_NEGOTIATE(pending, qp_size);
+		else
+			err = -EINVAL;
+
+	} else {
+		/* Handle a REQUEST2 (or override) */
+		int proto_int = pkt->proto;
+		int pos;
+		u16 active_proto_version = 0;
+
+		/*
+		 * The list of possible protocols is the intersection of all
+		 * protocols the client supports ... plus all the protocols we
+		 * support.
+		 */
+		proto_int &= vsock_vmci_new_proto_supported_versions();
+
+		/* We choose the highest possible protocol version and use that
+		 * one. */
+		pos = fls(proto_int);
+		if (pos) {
+			active_proto_version = (1 << (pos - 1));
+			if (vsock_vmci_proto_to_notify_struct
+			    (pending, &active_proto_version, false))
+				err =
+				    VSOCK_SEND_NEGOTIATE2(pending, qp_size,
+							  active_proto_version);
+			else
+				err = -EINVAL;
+
+		} else {
+			err = -EINVAL;
+		}
+	}
+
+	if (err < 0) {
+		VSOCK_SEND_RESET(sk, pkt);
+		sock_put(pending);
+		err = vsock_vmci_error_to_vsock_error(err);
+		goto out;
+	}
+
+	vsock_vmci_add_pending(sk, pending);
+	sk->sk_ack_backlog++;
+
+	pending->sk_state = SS_CONNECTING;
+	vpending->produce_size = vpending->consume_size = qp_size;
+	vpending->queue_pair_size = qp_size;
+
+	NOTIFYCALL(vpending, process_request, pending);
+
+	/*
+	 * We might never receive another message for this socket and it's not
+	 * connected to any process, so we have to ensure it gets cleaned up
+	 * ourself.  Our delayed work function will take care of that.  Note
+	 * that we do not ever cancel this function since we have few
+	 * guarantees about its state when calling cancel_delayed_work().
+	 * Instead we hold a reference on the socket for that function and make
+	 * it capable of handling cases where it needs to do nothing but
+	 * release that reference.
+	 */
+	vpending->listener = sk;
+	sock_hold(sk);
+	sock_hold(pending);
+	INIT_DELAYED_WORK(&vpending->dwork, vsock_vmci_pending_work);
+	schedule_delayed_work(&vpending->dwork, HZ);
+
+out:
+	return err;
+}
+
+/*
+ * vsock_vmci_recv_connecting_server --
+ *
+ * Receives packets for sockets in the connecting state on the server side.
+ *
+ * Connecting sockets on the server side can only receive queue pair offer
+ * packets.  All others should be treated as cause for closing the connection.
+ *
+ * Note that this assumes the socket lock is held for both sk and pending.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: A queue pair may be created, an attach control packet may be
+ * sent, the socket may transition to the connected state, and a pending caller
+ * in accept() may be woken up.
+ */
+
+static int
+vsock_vmci_recv_connecting_server(struct sock *listener,
+				  struct sock *pending,
+				  struct vsock_packet *pkt)
+{
+	struct vsock_vmci_sock *vpending;
+	struct vmci_handle handle;
+	struct vmci_qp *qpair;
+	bool is_local;
+	u32 flags;
+	vmci_id detach_sub_id;
+	int err;
+	int skerr;
+
+	vpending = vsock_sk(pending);
+	detach_sub_id = VMCI_INVALID_ID;
+
+	switch (pkt->type) {
+	case VSOCK_PACKET_TYPE_OFFER:
+		if (VMCI_HANDLE_INVALID(pkt->u.handle)) {
+			VSOCK_SEND_RESET(pending, pkt);
+			skerr = EPROTO;
+			err = -EINVAL;
+			goto destroy;
+		}
+		break;
+	default:
+		/* Close and cleanup the connection. */
+		VSOCK_SEND_RESET(pending, pkt);
+		skerr = EPROTO;
+		err = pkt->type == VSOCK_PACKET_TYPE_RST ? 0 : -EINVAL;
+		goto destroy;
+	}
+
+	/*
+	 * In order to complete the connection we need to attach to the offered
+	 * queue pair and send an attach notification.  We also subscribe to the
+	 * detach event so we know when our peer goes away, and we do that
+	 * before attaching so we don't miss an event.  If all this succeeds,
+	 * we update our state and wakeup anything waiting in accept() for a
+	 * connection.
+	 */
+
+	/*
+	 * We don't care about attach since we ensure the other side has
+	 * attached by specifying the ATTACH_ONLY flag below.
+	 */
+	err = vmci_event_subscribe(VMCI_EVENT_QP_PEER_DETACH,
+				   vsock_vmci_peer_detach_cb,
+				   pending, &detach_sub_id);
+	if (err < VMCI_SUCCESS) {
+		VSOCK_SEND_RESET(pending, pkt);
+		err = vsock_vmci_error_to_vsock_error(err);
+		skerr = -err;
+		goto destroy;
+	}
+
+	vpending->detach_sub_id = detach_sub_id;
+
+	/* Now attach to the queue pair the client created. */
+	handle = pkt->u.handle;
+
+	/*
+	 * vpending->local_addr always has a context id so we do not need to
+	 * worry about VMADDR_CID_ANY in this case.
+	 */
+	is_local =
+	    vpending->remote_addr.svm_cid == vpending->local_addr.svm_cid;
+	flags = VMCI_QPFLAG_ATTACH_ONLY;
+	flags |= is_local ? VMCI_QPFLAG_LOCAL : 0;
+
+	err = vsock_vmci_queue_pair_alloc(&qpair,
+					  &handle,
+					  vpending->produce_size,
+					  vpending->consume_size,
+					  VMCI_HANDLE_TO_CONTEXT_ID(pkt->
+								    dg.src),
+					  flags,
+					  vsock_vmci_trusted(
+						vpending,
+						vpending->remote_addr.svm_cid));
+	if (err < 0) {
+		VSOCK_SEND_RESET(pending, pkt);
+		skerr = -err;
+		goto destroy;
+	}
+
+	vpending->qp_handle = handle;
+	vpending->qpair = qpair;
+
+	/*
+	 * When we send the attach message, we must be ready to handle incoming
+	 * control messages on the newly connected socket. So we move the
+	 * pending socket to the connected state before sending the attach
+	 * message. Otherwise, an incoming packet triggered by the attach being
+	 * received by the peer may be processed concurrently with what happens
+	 * below after sending the attach message, and that incoming packet
+	 * will find the listening socket instead of the (currently) pending
+	 * socket. Note that enqueueing the socket increments the reference
+	 * count, so even if a reset comes before the connection is accepted,
+	 * the socket will be valid until it is removed from the queue.
+	 *
+	 * If we fail sending the attach below, we remove the socket from the
+	 * connected list and move the socket to SS_UNCONNECTED before
+	 * releasing the lock, so a pending slow path processing of an incoming
+	 * packet will not see the socket in the connected state in that case.
+	 */
+	pending->sk_state = SS_CONNECTED;
+
+	vsock_vmci_insert_connected(vsock_connected_sockets_vsk(vpending),
+				    pending);
+
+	/* Notify our peer of our attach. */
+	err = VSOCK_SEND_ATTACH(pending, handle);
+	if (err < 0) {
+		vsock_vmci_remove_connected(pending);
+		pr_err("Could not send attach\n");
+		VSOCK_SEND_RESET(pending, pkt);
+		err = vsock_vmci_error_to_vsock_error(err);
+		skerr = -err;
+		goto destroy;
+	}
+
+	/*
+	 * We have a connection. Move the now connected socket from the
+	 * listener's pending list to the accept queue so callers of accept()
+	 * can find it.
+	 */
+	vsock_vmci_remove_pending(listener, pending);
+	vsock_vmci_enqueue_accept(listener, pending);
+
+	/*
+	 * Callers of accept() will be be waiting on the listening socket, not
+	 * the pending socket.
+	 */
+	listener->sk_state_change(listener);
+
+	return 0;
+
+destroy:
+	pending->sk_err = skerr;
+	pending->sk_state = SS_UNCONNECTED;
+	/*
+	 * As long as we drop our reference, all necessary cleanup will handle
+	 * when the cleanup function drops its reference and our destruct
+	 * implementation is called.  Note that since the listen handler will
+	 * remove pending from the pending list upon our failure, the cleanup
+	 * function won't drop the additional reference, which is why we do it
+	 * here.
+	 */
+	sock_put(pending);
+
+	return err;
+}
+
+/*
+ * vsock_vmci_recv_connecting_client --
+ *
+ * Receives packets for sockets in the connecting state on the client side.
+ *
+ * Connecting sockets on the client side should only receive attach packets.
+ * All others should be treated as cause for closing the connection.
+ *
+ * Note that this assumes the socket lock is held for both sk and pending.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: The socket may transition to the connected state and wakeup
+ * the pending caller of connect().
+ */
+
+static int
+vsock_vmci_recv_connecting_client(struct sock *sk,
+				  struct vsock_packet *pkt)
+{
+	struct vsock_vmci_sock *vsk;
+	int err;
+	int skerr;
+
+	vsk = vsock_sk(sk);
+
+	switch (pkt->type) {
+	case VSOCK_PACKET_TYPE_ATTACH:
+		if (VMCI_HANDLE_INVALID(pkt->u.handle) ||
+		    !VMCI_HANDLE_EQUAL(pkt->u.handle, vsk->qp_handle)) {
+			skerr = EPROTO;
+			err = -EINVAL;
+			goto destroy;
+		}
+
+		/*
+		 * Signify the socket is connected and wakeup the waiter in
+		 * connect(). Also place the socket in the connected table for
+		 * accounting (it can already be found since it's in the bound
+		 * table).
+		 */
+		sk->sk_state = SS_CONNECTED;
+		sk->sk_socket->state = SS_CONNECTED;
+		vsock_vmci_insert_connected(vsock_connected_sockets_vsk(vsk),
+					    sk);
+		sk->sk_state_change(sk);
+
+		break;
+	case VSOCK_PACKET_TYPE_NEGOTIATE:
+	case VSOCK_PACKET_TYPE_NEGOTIATE2:
+		if (pkt->u.size == 0 ||
+		    VMCI_HANDLE_TO_CONTEXT_ID(pkt->dg.src) !=
+		    vsk->remote_addr.svm_cid
+		    || pkt->src_port != vsk->remote_addr.svm_port
+		    || !VMCI_HANDLE_INVALID(vsk->qp_handle) || vsk->qpair
+		    || vsk->produce_size != 0 || vsk->consume_size != 0
+		    || vsk->attach_sub_id != VMCI_INVALID_ID
+		    || vsk->detach_sub_id != VMCI_INVALID_ID) {
+			skerr = EPROTO;
+			err = -EINVAL;
+
+			goto destroy;
+		}
+
+		err = vsock_vmci_recv_connecting_client_negotiate(sk, pkt);
+		if (err) {
+			skerr = -err;
+			goto destroy;
+		}
+
+		break;
+	case VSOCK_PACKET_TYPE_INVALID:
+		err = vsock_vmci_recv_connecting_client_invalid(sk, pkt);
+		if (err) {
+			skerr = -err;
+			goto destroy;
+		}
+
+		break;
+	case VSOCK_PACKET_TYPE_RST:
+		/*
+		 * Older versions of the linux code (WS 6.5 / ESX 4.0) used to
+		 * continue processing here after they sent an INVALID packet.
+		 * This meant that we got a RST after the INVALID. We ignore a
+		 * RST after an INVALID. The common code doesn't send the RST
+		 * ... so we can hang if an old version of the common code
+		 * fails between getting a REQUEST and sending an OFFER back.
+		 * Not much we can do about it... except hope that it doesn't
+		 * happen.
+		 */
+		if (vsk->ignore_connecting_rst) {
+			vsk->ignore_connecting_rst = false;
+		} else {
+			skerr = ECONNRESET;
+			err = 0;
+			goto destroy;
+		}
+
+		break;
+	default:
+		/* Close and cleanup the connection. */
+		skerr = EPROTO;
+		err = -EINVAL;
+		goto destroy;
+	}
+
+	return 0;
+
+destroy:
+	VSOCK_SEND_RESET(sk, pkt);
+
+	sk->sk_state = SS_UNCONNECTED;
+	sk->sk_err = skerr;
+	sk->sk_error_report(sk);
+	return err;
+}
+
+/*
+ * vsock_vmci_recv_connecting_client_negotiate --
+ *
+ * Handles a negotiate packet for a client in the connecting state.
+ *
+ * Note that this assumes the socket lock is held for both sk and pending.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: The socket may transition to the connected state and wakeup
+ * the pending caller of connect().
+ */
+
+static int
+vsock_vmci_recv_connecting_client_negotiate(struct sock *sk,
+					    struct vsock_packet *pkt)
+{
+	int err;
+	struct vsock_vmci_sock *vsk;
+	struct vmci_handle handle;
+	struct vmci_qp *qpair;
+	vmci_id attach_sub_id;
+	vmci_id detach_sub_id;
+	bool is_local;
+	u32 flags;
+	bool old_proto = true;
+	bool old_pkt_proto;
+	vsock_proto_version version;
+
+	vsk = vsock_sk(sk);
+	handle = VMCI_INVALID_HANDLE;
+	attach_sub_id = VMCI_INVALID_ID;
+	detach_sub_id = VMCI_INVALID_ID;
+
+	/*
+	 * If we have gotten here then we should be past the point where old
+	 * linux vsock could have sent the bogus rst.
+	 */
+	vsk->sent_request = false;
+	vsk->ignore_connecting_rst = false;
+
+	/* Verify that we're OK with the proposed queue pair size */
+	if (pkt->u.size < vsk->queue_pair_min_size ||
+	    pkt->u.size > vsk->queue_pair_max_size) {
+		err = -EINVAL;
+		goto destroy;
+	}
+
+	/*
+	 * At this point we know the CID the peer is using to talk to us.
+	 */
+
+	if (vsk->local_addr.svm_cid == VMADDR_CID_ANY)
+		vsk->local_addr.svm_cid =
+		    VMCI_HANDLE_TO_CONTEXT_ID(pkt->dg.dst);
+
+	/*
+	 * Setup the notify ops to be the highest supported version that both
+	 * the server and the client support.
+	 */
+
+	if (vsock_vmci_old_proto_override(&old_pkt_proto)) {
+		old_proto = old_pkt_proto;
+	} else {
+		if (pkt->type == VSOCK_PACKET_TYPE_NEGOTIATE)
+			old_proto = true;
+		else if (pkt->type == VSOCK_PACKET_TYPE_NEGOTIATE2)
+			old_proto = false;
+
+	}
+
+	if (old_proto)
+		version = VSOCK_PROTO_INVALID;
+	else
+		version = pkt->proto;
+
+	if (!vsock_vmci_proto_to_notify_struct(sk, &version, old_proto)) {
+		err = -EINVAL;
+		goto destroy;
+	}
+
+	/*
+	 * Subscribe to attach and detach events first.
+	 *
+	 * XXX We attach once for each queue pair created for now so it is easy
+	 * to find the socket (it's provided), but later we should only
+	 * subscribe once and add a way to lookup sockets by queue pair handle.
+	 */
+	err = vmci_event_subscribe(VMCI_EVENT_QP_PEER_ATTACH,
+				   vsock_vmci_peer_attach_cb,
+				   sk, &attach_sub_id);
+	if (err < VMCI_SUCCESS) {
+		err = vsock_vmci_error_to_vsock_error(err);
+		goto destroy;
+	}
+
+	err = vmci_event_subscribe(VMCI_EVENT_QP_PEER_DETACH,
+				   vsock_vmci_peer_detach_cb,
+				   sk, &detach_sub_id);
+	if (err < VMCI_SUCCESS) {
+		err = vsock_vmci_error_to_vsock_error(err);
+		goto destroy;
+	}
+
+	/* Make VMCI select the handle for us. */
+	handle = VMCI_INVALID_HANDLE;
+	is_local = vsk->remote_addr.svm_cid == vsk->local_addr.svm_cid;
+	flags = is_local ? VMCI_QPFLAG_LOCAL : 0;
+
+	err = vsock_vmci_queue_pair_alloc(&qpair,
+					  &handle,
+					  pkt->u.size,
+					  pkt->u.size,
+					  vsk->remote_addr.svm_cid,
+					  flags,
+					  vsock_vmci_trusted(
+						  vsk,
+						  vsk->
+						  remote_addr.svm_cid));
+	if (err < 0)
+		goto destroy;
+
+	err = VSOCK_SEND_QP_OFFER(sk, handle);
+	if (err < 0) {
+		err = vsock_vmci_error_to_vsock_error(err);
+		goto destroy;
+	}
+
+	vsk->qp_handle = handle;
+	vsk->qpair = qpair;
+
+	vsk->produce_size = vsk->consume_size = pkt->u.size;
+
+	vsk->attach_sub_id = attach_sub_id;
+	vsk->detach_sub_id = detach_sub_id;
+
+	NOTIFYCALL(vsk, process_negotiate, sk);
+
+	return 0;
+
+destroy:
+	if (attach_sub_id != VMCI_INVALID_ID)
+		vmci_event_unsubscribe(attach_sub_id);
+
+	if (detach_sub_id != VMCI_INVALID_ID)
+		vmci_event_unsubscribe(detach_sub_id);
+
+	if (!VMCI_HANDLE_INVALID(handle))
+		vmci_qpair_detach(&qpair);
+
+	return err;
+}
+
+/*
+ * vsock_vmci_recv_connecting_client_invalid --
+ *
+ * Handles an invalid packet for a client in the connecting state.
+ *
+ * Note that this assumes the socket lock is held for both sk and pending.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_recv_connecting_client_invalid(struct sock *sk,
+					  struct vsock_packet *pkt)
+{
+	int err = 0;
+	struct vsock_vmci_sock *vsk = vsock_sk(sk);
+
+	if (vsk->sent_request) {
+		vsk->sent_request = false;
+		vsk->ignore_connecting_rst = true;
+
+		err = VSOCK_SEND_CONN_REQUEST(sk, vsk->queue_pair_size);
+		if (err < 0)
+			err = vsock_vmci_error_to_vsock_error(err);
+		else
+			err = 0;
+
+	}
+
+	return err;
+}
+
+/*
+ * vsock_vmci_recv_connected --
+ *
+ * Receives packets for sockets in the connected state.
+ *
+ * Connected sockets should only ever receive detach, wrote, read, or reset
+ * control messages.  Others are treated as errors that are ignored.
+ *
+ * Wrote and read signify that the peer has produced or consumed, respectively.
+ *
+ * Detach messages signify that the connection is being closed cleanly and
+ * reset messages signify that the connection is being closed in error.
+ *
+ * Note that this assumes the socket lock is held.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: A queue pair may be created, an offer control packet sent, and
+ * the socket may transition to the connecting state.
+ *
+ */
+
+static int vsock_vmci_recv_connected(struct sock *sk,
+				     struct vsock_packet *pkt)
+{
+	struct vsock_vmci_sock *vsk;
+	bool pkt_processed = false;
+
+	/*
+	 * In cases where we are closing the connection, it's sufficient to
+	 * mark the state change (and maybe error) and wake up any waiting
+	 * threads. Since this is a connected socket, it's owned by a user
+	 * process and will be cleaned up when the failure is passed back on
+	 * the current or next system call.  Our system call implementations
+	 * must therefore check for error and state changes on entry and when
+	 * being awoken.
+	 */
+	switch (pkt->type) {
+	case VSOCK_PACKET_TYPE_SHUTDOWN:
+		if (pkt->u.mode) {
+			vsk = vsock_sk(sk);
+
+			vsk->peer_shutdown |= pkt->u.mode;
+			sk->sk_state_change(sk);
+		}
+		break;
+
+	case VSOCK_PACKET_TYPE_RST:
+		vsk = vsock_sk(sk);
+		/*
+		 * It is possible that we sent our peer a message (e.g a
+		 * WAITING_READ) right before we got notified that the peer had
+		 * detached. If that happens then we can get a RST pkt back
+		 * from our peer even though there is data available for us to
+		 * read. In that case, don't shutdown the socket completely but
+		 * instead allow the local client to finish reading data off
+		 * the queuepair. Always treat a RST pkt in connected mode like
+		 * a clean shutdown.
+		 */
+		sock_set_flag(sk, SOCK_DONE);
+		vsk->peer_shutdown = SHUTDOWN_MASK;
+		if (vsock_vmci_stream_has_data(vsk) <= 0)
+			sk->sk_state = SS_DISCONNECTING;
+
+		sk->sk_state_change(sk);
+		break;
+
+	default:
+		vsk = vsock_sk(sk);
+		NOTIFYCALL(vsk, handle_notify_pkt, sk, pkt, false, NULL, NULL,
+			   &pkt_processed);
+		if (!pkt_processed)
+			return -EINVAL;
+
+		break;
+	}
+
+	return 0;
+}
+
+/*
+ * __vsock_vmci_send_control_pkt --
+ *
+ * Common code to send a control packet.
+ *
+ * Results: Size of datagram sent on success, negative error code otherwise. If
+ * convert_error is true, error code is a vsock error, otherwise, result is a
+ * VMCI error code.
+ *
+ * Side effects: None.
+ */
+
+static int
+__vsock_vmci_send_control_pkt(struct vsock_packet *pkt,
+			      struct sockaddr_vm *src,
+			      struct sockaddr_vm *dst,
+			      enum vsock_packet_type type,
+			      u64 size,
+			      u64 mode,
+			      struct vsock_waiting_info *wait,
+			      vsock_proto_version proto,
+			      struct vmci_handle handle,
+			      bool convert_error)
+{
+	int err;
+
+	vsock_packet_init(pkt, src, dst, type, size, mode, wait, proto, handle);
+	LOG_PACKET(pkt);
+	VSOCK_STATS_CTLPKT_LOG(pkt->type);
+	err = vmci_datagram_send(&pkt->dg);
+	if (convert_error && (err < 0))
+		return vsock_vmci_error_to_vsock_error(err);
+
+	return err;
+}
+
+/*
+ * vsock_vmci_reply_control_pkt_fast --
+ *
+ * Sends a control packet back to the source of an incoming packet. The control
+ * packet is allocated in the stack.
+ *
+ * Results: Size of datagram sent on success, negative error code otherwise.
+ *
+ * Side effects: None.
+ */
+
+int
+vsock_vmci_reply_control_pkt_fast(struct vsock_packet *pkt,
+				  enum vsock_packet_type type,
+				  u64 size,
+				  u64 mode,
+				  struct vsock_waiting_info *wait,
+				  struct vmci_handle handle)
+{
+	struct vsock_packet reply;
+	struct sockaddr_vm src, dst;
+
+	if (pkt->type == VSOCK_PACKET_TYPE_RST) {
+		return 0;
+	} else {
+		vsock_packet_get_addresses(pkt, &src, &dst);
+		return __vsock_vmci_send_control_pkt(&reply, &src, &dst, type,
+						     size, mode, wait,
+						     VSOCK_PROTO_INVALID,
+						     handle, true);
+	}
+}
+
+/*
+ * vsock_vmci_send_control_pkt_bh --
+ *
+ * Sends a control packet from bottom-half context. The control packet is
+ * static data to minimize the resource cost.
+ *
+ * Results: Size of datagram sent on success, negative error code otherwise.
+ * Note that we return a VMCI error message since that's what callers will need
+ * to provide.
+ *
+ * Side effects: None.
+ */
+
+int
+vsock_vmci_send_control_pkt_bh(struct sockaddr_vm *src,
+			       struct sockaddr_vm *dst,
+			       enum vsock_packet_type type,
+			       u64 size,
+			       u64 mode,
+			       struct vsock_waiting_info *wait,
+			       struct vmci_handle handle)
+{
+	/*
+	 * Note that it is safe to use a single packet across all CPUs since
+	 * two tasklets of the same type are guaranteed to not ever run
+	 * simultaneously. If that ever changes, or VMCI stops using tasklets,
+	 * we can use per-cpu packets.
+	 */
+	static struct vsock_packet pkt;
+
+	return __vsock_vmci_send_control_pkt(&pkt, src, dst, type,
+					     size, mode, wait,
+					     VSOCK_PROTO_INVALID, handle,
+					     false);
+}
+
+/*
+ * vsock_vmci_send_control_pkt --
+ *
+ * Sends a control packet.
+ *
+ * Results: Size of datagram sent on success, negative error on failure.
+ *
+ * Side effects: None.
+ */
+
+int
+vsock_vmci_send_control_pkt(struct sock *sk,
+			    enum vsock_packet_type type,
+			    u64 size,
+			    u64 mode,
+			    struct vsock_waiting_info *wait,
+			    vsock_proto_version proto,
+			    struct vmci_handle handle)
+{
+	struct vsock_packet *pkt;
+	struct vsock_vmci_sock *vsk;
+	int err;
+
+	vsk = vsock_sk(sk);
+
+	if (!vsock_addr_bound(&vsk->local_addr))
+		return -EINVAL;
+
+	if (!vsock_addr_bound(&vsk->remote_addr))
+		return -EINVAL;
+
+	pkt = kmalloc(sizeof *pkt, GFP_KERNEL);
+	if (!pkt)
+		return -ENOMEM;
+
+	err =
+	    __vsock_vmci_send_control_pkt(pkt, &vsk->local_addr,
+					  &vsk->remote_addr, type, size, mode,
+					  wait, proto, handle, true);
+	kfree(pkt);
+
+	return err;
+}
+
+static int __vsock_vmci_bind_stream(struct vsock_vmci_sock *vsk,
+				    struct sockaddr_vm *addr)
+{
+	static u32 port = LAST_RESERVED_PORT + 1;
+	struct sockaddr_vm new_addr;
+
+	vsock_addr_init(&new_addr, addr->svm_cid, addr->svm_port);
+
+	if (addr->svm_port == VMADDR_PORT_ANY) {
+		bool found = false;
+		unsigned int i;
+
+		for (i = 0; i < MAX_PORT_RETRIES; i++) {
+			if (port <= LAST_RESERVED_PORT)
+				port = LAST_RESERVED_PORT + 1;
+
+			new_addr.svm_port = port++;
+
+			if (!__vsock_vmci_find_bound_socket(&new_addr)) {
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return -EADDRNOTAVAIL;
+	} else {
+		/*
+		 * If port is in reserved range, ensure caller
+		 * has necessary privileges.
+		 */
+		if (addr->svm_port <= LAST_RESERVED_PORT &&
+		    !capable(CAP_NET_BIND_SERVICE)) {
+			return -EACCES;
+		}
+
+		if (__vsock_vmci_find_bound_socket(&new_addr))
+			return -EADDRINUSE;
+	}
+
+	vsock_addr_init(&vsk->local_addr, new_addr.svm_cid, new_addr.svm_port);
+
+	/*
+	 * Remove stream sockets from the unbound list and add them to the hash
+	 * table for easy lookup by its address.  The unbound list is simply an
+	 * extra entry at the end of the hash table, a trick used by AF_UNIX.
+	 */
+	__vsock_vmci_remove_bound(&vsk->sk);
+	__vsock_vmci_insert_bound(vsock_bound_sockets(&vsk->local_addr),
+				  &vsk->sk);
+
+	return 0;
+}
+
+static int __vsock_vmci_bind_dgram(struct vsock_vmci_sock *vsk,
+				   struct sockaddr_vm *addr)
+{
+	u32 port;
+	u32 flags;
+	int err;
+
+	/*
+	 * VMCI will select a resource ID for us if we provide
+	 * VMCI_INVALID_ID.
+	 */
+	port = addr->svm_port == VMADDR_PORT_ANY ?
+			VMCI_INVALID_ID : addr->svm_port;
+
+	if (port <= LAST_RESERVED_PORT && !capable(CAP_NET_BIND_SERVICE))
+		return -EACCES;
+
+	flags = addr->svm_cid == VMADDR_CID_ANY ?
+				VMCI_FLAG_ANYCID_DG_HND : 0;
+
+	err = vsock_vmci_datagram_create_hnd(port, flags,
+					     vsock_vmci_recv_dgram_cb,
+					     &vsk->sk, &vsk->dg_handle);
+	if (err < VMCI_SUCCESS)
+		return vsock_vmci_error_to_vsock_error(err);
+
+	vsock_addr_init(&vsk->local_addr, addr->svm_cid,
+			VMCI_HANDLE_TO_RESOURCE_ID(vsk->dg_handle));
+
+	return 0;
+}
+
+/*
+ * __vsock_vmci_bind --
+ *
+ * Common functionality needed to bind the specified address to the VSocket.
+ * If VMADDR_CID_ANY or VMADDR_PORT_ANY are specified, the context ID or port
+ * are selected automatically.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: On success, a new datagram handle is created.
+ */
+
+static int __vsock_vmci_bind(struct sock *sk, struct sockaddr_vm *addr)
+{
+	struct vsock_vmci_sock *vsk = vsock_sk(sk);
+	vmci_id cid;
+	int retval;
+
+	/* First ensure this socket isn't already bound. */
+	if (vsock_addr_bound(&vsk->local_addr))
+		return -EINVAL;
+
+	/*
+	 * Now bind to the provided address or select appropriate values if
+	 * none are provided (VMADDR_CID_ANY and VMADDR_PORT_ANY).  Note that
+	 * like AF_INET prevents binding to a non-local IP address (in most
+	 * cases), we only allow binding to the local CID.
+	 */
+	cid = vmci_get_context_id();
+	if (addr->svm_cid != cid && addr->svm_cid != VMADDR_CID_ANY)
+		return -EADDRNOTAVAIL;
+
+	switch (sk->sk_socket->type) {
+	case SOCK_STREAM:
+		spin_lock_bh(&vsock_table_lock);
+		retval = __vsock_vmci_bind_stream(vsk, addr);
+		spin_unlock_bh(&vsock_table_lock);
+		break;
+
+	case SOCK_DGRAM:
+		retval = __vsock_vmci_bind_dgram(vsk, addr);
+		break;
+
+	default:
+		retval = -EINVAL;
+		break;
+	}
+
+	return retval;
+}
+
+/*
+ * __vsock_vmci_create --
+ *
+ * Does the work to create the sock structure. Note: If sock is NULL then the
+ * type field must be non-zero. Otherwise, sock is non-NULL and the type of
+ * sock is used in the newly created socket.
+ *
+ * Results: sock structure on success, NULL on failure.
+ *
+ * Side effects: Allocated sk is added to the unbound sockets list iff it is
+ * owned by a struct socket.
+ */
+
+static struct sock *__vsock_vmci_create(struct net *net,
+					struct socket *sock,
+					struct sock *parent,
+					gfp_t priority, unsigned short type)
+{
+	struct sock *sk;
+	struct vsock_vmci_sock *psk;
+	struct vsock_vmci_sock *vsk;
+
+	vsk = NULL;
+
+	sk = sk_alloc(net, vsock_vmci_family_ops.family, priority,
+		      &vsock_vmci_proto);
+	if (!sk)
+		return NULL;
+
+	sock_init_data(sock, sk);
+
+	/*
+	 * sk->sk_type is normally set in sock_init_data, but only if sock is
+	 * non-NULL. We make sure that our sockets always have a type by
+	 * setting it here if needed.
+	 */
+	if (!sock)
+		sk->sk_type = type;
+
+	vsk = vsock_sk(sk);
+	vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+	vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+
+	sk->sk_destruct = vsock_vmci_sk_destruct;
+	sk->sk_backlog_rcv = vsock_vmci_queue_rcv_skb;
+	sk->sk_state = 0;
+	sock_reset_flag(sk, SOCK_DONE);
+
+	INIT_LIST_HEAD(&vsk->bound_table);
+	INIT_LIST_HEAD(&vsk->connected_table);
+	vsk->dg_handle = VMCI_INVALID_HANDLE;
+	vsk->qp_handle = VMCI_INVALID_HANDLE;
+	vsk->qpair = NULL;
+	vsk->produce_size = vsk->consume_size = 0;
+	vsk->listener = NULL;
+	INIT_LIST_HEAD(&vsk->pending_links);
+	INIT_LIST_HEAD(&vsk->accept_queue);
+	vsk->rejected = false;
+	vsk->sent_request = false;
+	vsk->ignore_connecting_rst = false;
+	vsk->attach_sub_id = vsk->detach_sub_id = VMCI_INVALID_ID;
+	vsk->peer_shutdown = 0;
+
+	if (parent) {
+		psk = vsock_sk(parent);
+		vsk->trusted = psk->trusted;
+		vsk->owner = get_cred(psk->owner);
+		vsk->queue_pair_size = psk->queue_pair_size;
+		vsk->queue_pair_min_size = psk->queue_pair_min_size;
+		vsk->queue_pair_max_size = psk->queue_pair_max_size;
+		vsk->connect_timeout = psk->connect_timeout;
+	} else {
+		vsk->trusted = capable(CAP_NET_ADMIN);
+		vsk->owner = get_current_cred();
+		vsk->queue_pair_size = VSOCK_DEFAULT_QP_SIZE;
+		vsk->queue_pair_min_size = VSOCK_DEFAULT_QP_SIZE_MIN;
+		vsk->queue_pair_max_size = VSOCK_DEFAULT_QP_SIZE_MAX;
+		vsk->connect_timeout = VSOCK_DEFAULT_CONNECT_TIMEOUT;
+	}
+
+	vsk->notify_ops = NULL;
+
+	if (sock)
+		vsock_vmci_insert_bound(vsock_unbound_sockets, sk);
+
+	return sk;
+}
+
+/*
+ * __vsock_vmci_release --
+ *
+ * Releases the provided socket.
+ *
+ * Results: None.
+ *
+ * Side effects: Any pending sockets are also released.
+ */
+
+static void __vsock_vmci_release(struct sock *sk)
+{
+	if (sk) {
+		struct sk_buff *skb;
+		struct sock *pending;
+		struct vsock_vmci_sock *vsk;
+
+		vsk = vsock_sk(sk);
+		pending = NULL;	/* Compiler warning. */
+
+		if (vsock_vmci_in_bound_table(sk))
+			vsock_vmci_remove_bound(sk);
+
+		if (vsock_vmci_in_connected_table(sk))
+			vsock_vmci_remove_connected(sk);
+
+		if (!VMCI_HANDLE_INVALID(vsk->dg_handle)) {
+			vmci_datagram_destroy_handle(vsk->dg_handle);
+			vsk->dg_handle = VMCI_INVALID_HANDLE;
+		}
+
+		lock_sock(sk);
+		sock_orphan(sk);
+		sk->sk_shutdown = SHUTDOWN_MASK;
+
+		while ((skb = skb_dequeue(&sk->sk_receive_queue)))
+			kfree_skb(skb);
+
+		/* Clean up any sockets that never were accepted. */
+		while ((pending = vsock_vmci_dequeue_accept(sk)) != NULL) {
+			__vsock_vmci_release(pending);
+			sock_put(pending);
+		}
+
+		release_sock(sk);
+		sock_put(sk);
+	}
+}
+
+/*
+ * Sock operations.
+ */
+
+/*
+ * vsock_vmci_sk_destruct --
+ *
+ * Destroys the provided socket.  This is called by sk_free(), which is invoke
+ * when the reference count of the socket drops to zero.
+ *
+ * Results: None.
+ *
+ * Side effects: Socket count is decremented.
+ */
+
+static void vsock_vmci_sk_destruct(struct sock *sk)
+{
+	struct vsock_vmci_sock *vsk = vsock_sk(sk);
+
+	if (vsk->attach_sub_id != VMCI_INVALID_ID) {
+		vmci_event_unsubscribe(vsk->attach_sub_id);
+		vsk->attach_sub_id = VMCI_INVALID_ID;
+	}
+
+	if (vsk->detach_sub_id != VMCI_INVALID_ID) {
+		vmci_event_unsubscribe(vsk->detach_sub_id);
+		vsk->detach_sub_id = VMCI_INVALID_ID;
+	}
+
+	if (!VMCI_HANDLE_INVALID(vsk->qp_handle)) {
+		vmci_qpair_detach(&vsk->qpair);
+		vsk->qp_handle = VMCI_INVALID_HANDLE;
+		vsk->produce_size = vsk->consume_size = 0;
+	}
+
+	/*
+	 * When clearing these addresses, there's no need to set the family and
+	 * possibly register the address family with the kernel.
+	 */
+	vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+	vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+
+	NOTIFYCALL(vsk, socket_destruct, sk);
+
+	put_cred(vsk->owner);
+
+	VSOCK_STATS_CTLPKT_DUMP_ALL();
+	VSOCK_STATS_HIST_DUMP_ALL();
+	VSOCK_STATS_TOTALS_DUMP_ALL();
+}
+
+/*
+ * vsock_vmci_queue_rcv_skb --
+ *
+ * Receives skb on the socket's receive queue.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int vsock_vmci_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+{
+	int err;
+
+	err = sock_queue_rcv_skb(sk, skb);
+	if (err)
+		kfree_skb(skb);
+
+	return err;
+}
+
+/*
+ * vsock_vmci_register_with_vmci --
+ *
+ * Registers with the VMCI device, and creates control message and event
+ * handlers.
+ *
+ * Results: Zero on success, error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int vsock_vmci_register_with_vmci(void)
+{
+	int err = 0;
+
+	/*
+	 * Create the datagram handle that we will use to send and receive all
+	 * VSocket control messages for this context.
+	 */
+	err = vsock_vmci_datagram_create_hnd(VSOCK_PACKET_RID,
+					     VMCI_FLAG_ANYCID_DG_HND,
+					     vsock_vmci_recv_stream_cb, NULL,
+					     &vmci_stream_handle);
+	if (err < VMCI_SUCCESS) {
+		pr_err("Unable to create datagram handle. (%d)\n",
+		       err);
+		err = vsock_vmci_error_to_vsock_error(err);
+		goto out;
+	}
+
+	err = vmci_event_subscribe(VMCI_EVENT_QP_RESUMED,
+				   vsock_vmci_qp_resumed_cb,
+				   NULL, &qp_resumed_sub_id);
+	if (err < VMCI_SUCCESS) {
+		pr_err("Unable to subscribe to resumed event. (%d)\n",
+		       err);
+		err = vsock_vmci_error_to_vsock_error(err);
+		qp_resumed_sub_id = VMCI_INVALID_ID;
+		goto out;
+	}
+
+out:
+	if (err != 0)
+		vsock_vmci_unregister_with_vmci();
+
+	return err;
+}
+
+/*
+ * vsock_vmci_unregister_with_vmci --
+ *
+ * Destroys control message and event handlers, and unregisters with the VMCI
+ * device
+ *
+ * Results: None.
+ *
+ * Side effects: Our socket implementation is no longer accessible.
+ */
+
+static void vsock_vmci_unregister_with_vmci(void)
+{
+	if (!VMCI_HANDLE_INVALID(vmci_stream_handle)) {
+		if (vmci_datagram_destroy_handle(vmci_stream_handle) !=
+		    VMCI_SUCCESS)
+			pr_err("Couldn't destroy datagram handle.\n");
+
+		vmci_stream_handle = VMCI_INVALID_HANDLE;
+	}
+
+	if (qp_resumed_sub_id != VMCI_INVALID_ID) {
+		vmci_event_unsubscribe(qp_resumed_sub_id);
+		qp_resumed_sub_id = VMCI_INVALID_ID;
+	}
+}
+
+/*
+ * vsock_vmci_stream_has_data --
+ *
+ * Gets the amount of data available for a given stream socket's consume queue.
+ *
+ * Note that this assumes the socket lock is held.
+ *
+ * Results: The amount of data available or a VMCI error code on failure.
+ *
+ * Side effects: None.
+ */
+
+s64 vsock_vmci_stream_has_data(struct vsock_vmci_sock *vsk)
+{
+	return vmci_qpair_consume_buf_ready(vsk->qpair);
+}
+
+/*
+ * vsock_vmci_stream_has_space --
+ *
+ * Gets the amount of space available for a give stream socket's produce queue.
+ *
+ * Note that this assumes the socket lock is held.
+ *
+ * Results: The amount of space available or a VMCI error code on failure.
+ *
+ * Side effects: None.
+ */
+
+s64 vsock_vmci_stream_has_space(struct vsock_vmci_sock *vsk)
+{
+	return vmci_qpair_produce_free_space(vsk->qpair);
+}
+
+/*
+ * Socket operations.
+ */
+
+/*
+ *
+ * vsock_vmci_release --
+ *
+ * Releases the provided socket by freeing the contents of its queue.  This is
+ * called when a user process calls close(2) on the socket.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int vsock_vmci_release(struct socket *sock)
+{
+	__vsock_vmci_release(sock->sk);
+	sock->sk = NULL;
+	sock->state = SS_FREE;
+
+	return 0;
+}
+
+/*
+ * vsock_vmci_bind --
+ *
+ * Binds the provided address to the provided socket.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+	int err;
+	struct sock *sk;
+	struct sockaddr_vm *vmci_addr;
+
+	sk = sock->sk;
+
+	if (vsock_addr_cast(addr, addr_len, &vmci_addr) != 0)
+		return -EINVAL;
+
+	lock_sock(sk);
+	err = __vsock_vmci_bind(sk, vmci_addr);
+	release_sock(sk);
+
+	return err;
+}
+
+/*
+ * vsock_vmci_dgram_connect --
+ *
+ * Connects a datagram socket.  This can be called multiple times to change the
+ * socket's association and can be called with a sockaddr whose family is set
+ * to AF_UNSPEC to dissolve any existing association.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_dgram_connect(struct socket *sock,
+			 struct sockaddr *addr, int addr_len, int flags)
+{
+	int err;
+	struct sock *sk;
+	struct vsock_vmci_sock *vsk;
+	struct sockaddr_vm *remote_addr;
+
+	sk = sock->sk;
+	vsk = vsock_sk(sk);
+
+	err = vsock_addr_cast(addr, addr_len, &remote_addr);
+	if (err == -EAFNOSUPPORT && remote_addr->svm_family == AF_UNSPEC) {
+		lock_sock(sk);
+		vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY,
+				VMADDR_PORT_ANY);
+		sock->state = SS_UNCONNECTED;
+		release_sock(sk);
+		return 0;
+	} else if (err != 0)
+		return -EINVAL;
+
+	lock_sock(sk);
+
+	if (!vsock_addr_bound(&vsk->local_addr)) {
+		struct sockaddr_vm local_addr;
+
+		vsock_addr_init(&local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+		err = __vsock_vmci_bind(sk, &local_addr);
+		if (err != 0)
+			goto out;
+
+	}
+
+	if (!vsock_addr_socket_context_dgram(remote_addr->svm_cid,
+					     remote_addr->svm_port)) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	memcpy(&vsk->remote_addr, remote_addr, sizeof vsk->remote_addr);
+	sock->state = SS_CONNECTED;
+
+out:
+	release_sock(sk);
+	return err;
+}
+
+/*
+ * vsock_vmci_connect_timeout --
+ *
+ * Asynchronous connection attempts schedule this timeout function to notify
+ * the connector of an unsuccessfull connection attempt. If the socket is still
+ * in the connecting state and hasn't been closed, we mark the socket as timed
+ * out. Otherwise, we do nothing.
+ *
+ * Results: None.
+ *
+ * Side effects: May destroy the socket.
+ */
+
+static void vsock_vmci_connect_timeout(struct work_struct *work)
+{
+	struct sock *sk;
+	struct vsock_vmci_sock *vsk;
+
+	vsk = container_of(work, struct vsock_vmci_sock, dwork.work);
+	sk = sk_vsock(vsk);
+
+	lock_sock(sk);
+	if (sk->sk_state == SS_CONNECTING &&
+	    (sk->sk_shutdown != SHUTDOWN_MASK)) {
+		sk->sk_state = SS_UNCONNECTED;
+		sk->sk_err = ETIMEDOUT;
+		sk->sk_error_report(sk);
+	}
+	release_sock(sk);
+
+	sock_put(sk);
+}
+
+/*
+ * vsock_vmci_stream_connect --
+ *
+ * Connects a stream socket.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_stream_connect(struct socket *sock,
+			  struct sockaddr *addr, int addr_len, int flags)
+{
+	int err;
+	struct sock *sk;
+	struct vsock_vmci_sock *vsk;
+	struct sockaddr_vm *remote_addr;
+	long timeout;
+	bool old_pkt_proto = false;
+	DEFINE_WAIT(wait);
+
+	err = 0;
+	sk = sock->sk;
+	vsk = vsock_sk(sk);
+
+	lock_sock(sk);
+
+	/* XXX AF_UNSPEC should make us disconnect like AF_INET. */
+	switch (sock->state) {
+	case SS_CONNECTED:
+		err = -EISCONN;
+		goto out;
+	case SS_DISCONNECTING:
+		err = -EINVAL;
+		goto out;
+	case SS_CONNECTING:
+		/*
+		 * This continues on so we can move sock into the SS_CONNECTED
+		 * state once the connection has completed (at which point err
+		 * will be set to zero also).  Otherwise, we will either wait
+		 * for the connection or return -EALREADY should this be a
+		 * non-blocking call.
+		 */
+		err = -EALREADY;
+		break;
+	default:
+		if ((sk->sk_state == SS_LISTEN) ||
+		    vsock_addr_cast(addr, addr_len, &remote_addr) != 0) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		/*
+		 * The hypervisor and well-known contexts do not have socket
+		 * endpoints.
+		 */
+		if (!vsock_addr_socket_context_stream(remote_addr->svm_cid)) {
+			err = -ENETUNREACH;
+			goto out;
+		}
+
+		/* Set the remote address that we are connecting to. */
+		memcpy(&vsk->remote_addr, remote_addr, sizeof vsk->remote_addr);
+
+		/* Autobind this socket to the local address if necessary. */
+		if (!vsock_addr_bound(&vsk->local_addr)) {
+			struct sockaddr_vm local_addr;
+
+			vsock_addr_init(&local_addr, VMADDR_CID_ANY,
+					VMADDR_PORT_ANY);
+			err = __vsock_vmci_bind(sk, &local_addr);
+			if (err != 0)
+				goto out;
+
+		}
+
+		sk->sk_state = SS_CONNECTING;
+
+		if (vsock_vmci_old_proto_override(&old_pkt_proto)
+		    && old_pkt_proto) {
+			err = VSOCK_SEND_CONN_REQUEST(sk, vsk->queue_pair_size);
+			if (err < 0) {
+				sk->sk_state = SS_UNCONNECTED;
+				goto out;
+			}
+		} else {
+			int supported_proto_versions =
+			    vsock_vmci_new_proto_supported_versions();
+			err =
+			    VSOCK_SEND_CONN_REQUEST2(sk, vsk->queue_pair_size,
+						     supported_proto_versions);
+			if (err < 0) {
+				sk->sk_state = SS_UNCONNECTED;
+				goto out;
+			}
+
+			vsk->sent_request = true;
+		}
+
+		/*
+		 * Mark sock as connecting and set the error code to in
+		 * progress in case this is a non-blocking connect.
+		 */
+		sock->state = SS_CONNECTING;
+		err = -EINPROGRESS;
+	}
+
+	/*
+	 * The receive path will handle all communication until we are able to
+	 * enter the connected state.  Here we wait for the connection to be
+	 * completed or a notification of an error.
+	 */
+	timeout = vsk->connect_timeout;
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+
+	while (sk->sk_state != SS_CONNECTED && sk->sk_err == 0) {
+		if (flags & O_NONBLOCK) {
+			/*
+			 * If we're not going to block, we schedule a timeout
+			 * function to generate a timeout on the connection
+			 * attempt, in case the peer doesn't respond in a
+			 * timely manner. We hold on to the socket until the
+			 * timeout fires.
+			 */
+			sock_hold(sk);
+			INIT_DELAYED_WORK(&vsk->dwork,
+					  vsock_vmci_connect_timeout);
+			schedule_delayed_work(&vsk->dwork, timeout);
+
+			/* Skip ahead to preserve error code set above. */
+			goto out_wait;
+		}
+
+		release_sock(sk);
+		timeout = schedule_timeout(timeout);
+		lock_sock(sk);
+
+		if (signal_pending(current)) {
+			err = sock_intr_errno(timeout);
+			goto out_wait_error;
+		} else if (timeout == 0) {
+			err = -ETIMEDOUT;
+			goto out_wait_error;
+		}
+
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+	}
+
+	if (sk->sk_err) {
+		err = -sk->sk_err;
+		goto out_wait_error;
+	} else
+		err = 0;
+
+out_wait:
+	finish_wait(sk_sleep(sk), &wait);
+out:
+	release_sock(sk);
+	return err;
+
+out_wait_error:
+	sk->sk_state = SS_UNCONNECTED;
+	sock->state = SS_UNCONNECTED;
+	goto out_wait;
+}
+
+/*
+ * vsock_vmci_accept --
+ *
+ * Accepts next available connection request for this socket.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_accept(struct socket *sock, struct socket *newsock, int flags)
+{
+	struct sock *listener;
+	int err;
+	struct sock *connected;
+	struct vsock_vmci_sock *vconnected;
+	long timeout;
+	DEFINE_WAIT(wait);
+
+	err = 0;
+	listener = sock->sk;
+
+	lock_sock(listener);
+
+	if (sock->type != SOCK_STREAM) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (listener->sk_state != SS_LISTEN) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Wait for children sockets to appear; these are the new sockets
+	 * created upon connection establishment.
+	 */
+	timeout = sock_sndtimeo(listener, flags & O_NONBLOCK);
+	prepare_to_wait(sk_sleep(listener), &wait, TASK_INTERRUPTIBLE);
+
+	while ((connected = vsock_vmci_dequeue_accept(listener)) == NULL &&
+	       listener->sk_err == 0) {
+		release_sock(listener);
+		timeout = schedule_timeout(timeout);
+		lock_sock(listener);
+
+		if (signal_pending(current)) {
+			err = sock_intr_errno(timeout);
+			goto out_wait;
+		} else if (timeout == 0) {
+			err = -EAGAIN;
+			goto out_wait;
+		}
+
+		prepare_to_wait(sk_sleep(listener), &wait, TASK_INTERRUPTIBLE);
+	}
+
+	if (listener->sk_err)
+		err = -listener->sk_err;
+
+	if (connected) {
+		listener->sk_ack_backlog--;
+
+		lock_sock(connected);
+		vconnected = vsock_sk(connected);
+
+		/*
+		 * If the listener socket has received an error, then we should
+		 * reject this socket and return.  Note that we simply mark the
+		 * socket rejected, drop our reference, and let the cleanup
+		 * function handle the cleanup; the fact that we found it in
+		 * the listener's accept queue guarantees that the cleanup
+		 * function hasn't run yet.
+		 */
+		if (err) {
+			vconnected->rejected = true;
+			release_sock(connected);
+			sock_put(connected);
+			goto out_wait;
+		}
+
+		newsock->state = SS_CONNECTED;
+		sock_graft(connected, newsock);
+		release_sock(connected);
+		sock_put(connected);
+	}
+
+out_wait:
+	finish_wait(sk_sleep(listener), &wait);
+out:
+	release_sock(listener);
+	return err;
+}
+
+/*
+ * vsock_vmci_getname --
+ *
+ * Provides the local or remote address for the socket.
+ *
+ * Results: Zero on success, negative error code otherwise.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_getname(struct socket *sock,
+		   struct sockaddr *addr, int *addr_len, int peer)
+{
+	int err;
+	struct sock *sk;
+	struct vsock_vmci_sock *vsk;
+	struct sockaddr_vm *vmci_addr;
+
+	sk = sock->sk;
+	vsk = vsock_sk(sk);
+	err = 0;
+
+	lock_sock(sk);
+
+	if (peer) {
+		if (sock->state != SS_CONNECTED) {
+			err = -ENOTCONN;
+			goto out;
+		}
+		vmci_addr = &vsk->remote_addr;
+	} else {
+		vmci_addr = &vsk->local_addr;
+	}
+
+	if (!vmci_addr) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * sys_getsockname() and sys_getpeername() pass us a
+	 * MAX_SOCK_ADDR-sized buffer and don't set addr_len.  Unfortunately
+	 * that macro is defined in socket.c instead of .h, so we hardcode its
+	 * value here.
+	 */
+	BUILD_BUG_ON(sizeof *vmci_addr > 128);
+	memcpy(addr, vmci_addr, sizeof *vmci_addr);
+	*addr_len = sizeof *vmci_addr;
+
+out:
+	release_sock(sk);
+	return err;
+}
+
+/*
+ * vsock_vmci_poll --
+ *
+ * Waits on file for activity then provides mask indicating state of socket.
+ *
+ * Results: Mask of flags containing socket state.
+ *
+ * Side effects: None.
+ */
+
+static unsigned int
+vsock_vmci_poll(struct file *file, struct socket *sock, poll_table *wait)
+{
+	struct sock *sk;
+	unsigned int mask;
+	struct vsock_vmci_sock *vsk;
+
+	sk = sock->sk;
+	vsk = vsock_sk(sk);
+
+	poll_wait(file, sk_sleep(sk), wait);
+	mask = 0;
+
+	if (sk->sk_err)
+		/* Signify that there has been an error on this socket. */
+		mask |= POLLERR;
+
+	/*
+	 * INET sockets treat local write shutdown and peer write shutdown as a
+	 * case of POLLHUP set.
+	 */
+	if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
+	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
+	     (vsk->peer_shutdown & SEND_SHUTDOWN))) {
+		mask |= POLLHUP;
+	}
+
+	if (sk->sk_shutdown & RCV_SHUTDOWN ||
+	    vsk->peer_shutdown & SEND_SHUTDOWN) {
+		mask |= POLLRDHUP;
+	}
+
+	if (sock->type == SOCK_DGRAM) {
+		/*
+		 * For datagram sockets we can read if there is something in
+		 * the queue and write as long as the socket isn't shutdown for
+		 * sending.
+		 */
+		if (!skb_queue_empty(&sk->sk_receive_queue) ||
+		    (sk->sk_shutdown & RCV_SHUTDOWN)) {
+			mask |= POLLIN | POLLRDNORM;
+		}
+
+		if (!(sk->sk_shutdown & SEND_SHUTDOWN))
+			mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
+
+	} else if (sock->type == SOCK_STREAM) {
+		lock_sock(sk);
+
+		/*
+		 * Listening sockets that have connections in their accept
+		 * queue can be read.
+		 */
+		if (sk->sk_state == SS_LISTEN
+		    && !vsock_vmci_is_accept_queue_empty(sk))
+			mask |= POLLIN | POLLRDNORM;
+
+		/*
+		 * If there is something in the queue then we can read.
+		 */
+		if (!VMCI_HANDLE_INVALID(vsk->qp_handle) &&
+		    !(sk->sk_shutdown & RCV_SHUTDOWN)) {
+			bool data_ready_now = false;
+			int ret = 0;
+			NOTIFYCALLRET(vsk, ret, poll_in, sk, 1,
+				      &data_ready_now);
+			if (ret < 0) {
+				mask |= POLLERR;
+			} else {
+				if (data_ready_now)
+					mask |= POLLIN | POLLRDNORM;
+
+			}
+		}
+
+		/*
+		 * Sockets whose connections have been closed, reset, or
+		 * terminated should also be considered read, and we check the
+		 * shutdown flag for that.
+		 */
+		if (sk->sk_shutdown & RCV_SHUTDOWN ||
+		    vsk->peer_shutdown & SEND_SHUTDOWN) {
+			mask |= POLLIN | POLLRDNORM;
+		}
+
+		/*
+		 * Connected sockets that can produce data can be written.
+		 */
+		if (sk->sk_state == SS_CONNECTED) {
+			if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
+				bool space_avail_now = false;
+				int ret = 0;
+
+				NOTIFYCALLRET(vsk, ret, poll_out, sk, 1,
+					      &space_avail_now);
+				if (ret < 0) {
+					mask |= POLLERR;
+				} else {
+					if (space_avail_now)
+						/*
+						 * Remove POLLWRBAND since INET
+						 * sockets are not setting it.
+						 */
+						mask |= POLLOUT | POLLWRNORM;
+
+				}
+			}
+		}
+
+		/*
+		 * Simulate INET socket poll behaviors, which sets
+		 * POLLOUT|POLLWRNORM when peer is closed and nothing to read,
+		 * but local send is not shutdown.
+		 */
+		if (sk->sk_state == SS_UNCONNECTED) {
+			if (!(sk->sk_shutdown & SEND_SHUTDOWN))
+				mask |= POLLOUT | POLLWRNORM;
+
+		}
+
+		release_sock(sk);
+	}
+
+	return mask;
+}
+
+/*
+ * vsock_vmci_listen --
+ *
+ * Signify that this socket is listening for connection requests.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int vsock_vmci_listen(struct socket *sock, int backlog)
+{
+	int err;
+	struct sock *sk;
+	struct vsock_vmci_sock *vsk;
+
+	sk = sock->sk;
+
+	lock_sock(sk);
+
+	if (sock->type != SOCK_STREAM) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (sock->state != SS_UNCONNECTED) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	vsk = vsock_sk(sk);
+
+	if (!vsock_addr_bound(&vsk->local_addr)) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	sk->sk_max_ack_backlog = backlog;
+	sk->sk_state = SS_LISTEN;
+
+	err = 0;
+
+out:
+	release_sock(sk);
+	return err;
+}
+
+/*
+ * vsock_vmci_shutdown --
+ *
+ * Shuts down the provided socket in the provided method.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int vsock_vmci_shutdown(struct socket *sock, int mode)
+{
+	int err;
+	struct sock *sk;
+
+	/*
+	 * User level uses SHUT_RD (0) and SHUT_WR (1), but the kernel uses
+	 * RCV_SHUTDOWN (1) and SEND_SHUTDOWN (2), so we must increment mode
+	 * here like the other address families do.  Note also that the
+	 * increment makes SHUT_RDWR (2) into RCV_SHUTDOWN | SEND_SHUTDOWN (3),
+	 * which is what we want.
+	 */
+	mode++;
+
+	if ((mode & ~SHUTDOWN_MASK) || !mode)
+		return -EINVAL;
+
+	/*
+	 * If this is a STREAM socket and it is not connected then bail out
+	 * immediately.  If it is a DGRAM socket then we must first kick the
+	 * socket so that it wakes up from any sleeping calls, for example
+	 * recv(), and then afterwards return the error.
+	 */
+
+	sk = sock->sk;
+	if (sock->state == SS_UNCONNECTED) {
+		err = -ENOTCONN;
+		if (sk->sk_type == SOCK_STREAM)
+			return err;
+	} else {
+		sock->state = SS_DISCONNECTING;
+		err = 0;
+	}
+
+	/* Receive and send shutdowns are treated alike. */
+	mode = mode & (RCV_SHUTDOWN | SEND_SHUTDOWN);
+	if (mode) {
+		lock_sock(sk);
+		sk->sk_shutdown |= mode;
+		sk->sk_state_change(sk);
+		release_sock(sk);
+
+		if (sk->sk_type == SOCK_STREAM) {
+			sock_reset_flag(sk, SOCK_DONE);
+			VSOCK_SEND_SHUTDOWN(sk, mode);
+		}
+	}
+
+	return err;
+}
+
+/*
+ * vsock_vmci_dgram_sendmsg --
+ *
+ * Sends a datagram.
+ *
+ * Results: Number of bytes sent on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_dgram_sendmsg(struct kiocb *kiocb,
+			 struct socket *sock, struct msghdr *msg, size_t len)
+{
+	int err;
+	struct sock *sk;
+	struct vsock_vmci_sock *vsk;
+	struct sockaddr_vm *remote_addr;
+	struct vmci_datagram *dg;
+
+	if (msg->msg_flags & MSG_OOB)
+		return -EOPNOTSUPP;
+
+	if (len > VMCI_MAX_DG_PAYLOAD_SIZE)
+		return -EMSGSIZE;
+
+	/* For now, MSG_DONTWAIT is always assumed... */
+	err = 0;
+	sk = sock->sk;
+	vsk = vsock_sk(sk);
+
+	lock_sock(sk);
+
+	if (!vsock_addr_bound(&vsk->local_addr)) {
+		struct sockaddr_vm local_addr;
+
+		vsock_addr_init(&local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+		err = __vsock_vmci_bind(sk, &local_addr);
+		if (err != 0)
+			goto out;
+
+	}
+
+	/*
+	 * If the provided message contains an address, use that.  Otherwise
+	 * fall back on the socket's remote handle (if it has been connected).
+	 */
+	if (msg->msg_name &&
+	    vsock_addr_cast(msg->msg_name, msg->msg_namelen,
+			    &remote_addr) == 0) {
+		/* Ensure this address is of the right type and is a valid
+		 * destination. */
+
+		if (remote_addr->svm_cid == VMADDR_CID_ANY)
+			remote_addr->svm_cid = vmci_get_context_id();
+
+		if (!vsock_addr_bound(remote_addr)) {
+			err = -EINVAL;
+			goto out;
+		}
+	} else if (sock->state == SS_CONNECTED) {
+		remote_addr = &vsk->remote_addr;
+
+		if (remote_addr->svm_cid == VMADDR_CID_ANY)
+			remote_addr->svm_cid = vmci_get_context_id();
+
+		/* XXX Should connect() or this function ensure remote_addr is
+		 * bound? */
+		if (!vsock_addr_bound(&vsk->remote_addr)) {
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		err = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Make sure that we don't allow a userlevel app to send datagrams to
+	 * the hypervisor that modify VMCI device state.
+	 */
+	if (!vsock_addr_socket_context_dgram(remote_addr->svm_cid,
+					     remote_addr->svm_port)) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (!vsock_vmci_allow_dgram(vsk, remote_addr->svm_cid)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	/*
+	 * Allocate a buffer for the user's message and our packet header.
+	 */
+	dg = kmalloc(len + sizeof *dg, GFP_KERNEL);
+	if (!dg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	memcpy_fromiovec(VMCI_DG_PAYLOAD(dg), msg->msg_iov, len);
+
+	dg->dst = VMCI_MAKE_HANDLE(remote_addr->svm_cid, remote_addr->svm_port);
+	dg->src =
+	    VMCI_MAKE_HANDLE(vsk->local_addr.svm_cid, vsk->local_addr.svm_port);
+
+	dg->payload_size = len;
+
+	err = vmci_datagram_send(dg);
+	kfree(dg);
+	if (err < 0) {
+		err = vsock_vmci_error_to_vsock_error(err);
+		goto out;
+	}
+
+	err -= sizeof *dg;
+
+out:
+	release_sock(sk);
+	return err;
+}
+
+/*
+ * vsock_vmci_stream_setsockopt --
+ *
+ * Set a socket option on a stream socket
+ *
+ * Results: 0 on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int vsock_vmci_stream_setsockopt(struct socket *sock,
+					int level,
+					int optname,
+					char __user *optval,
+					unsigned int optlen)
+{
+	int err;
+	struct sock *sk;
+	struct vsock_vmci_sock *vsk;
+	u64 val;
+
+	if (level != AF_VSOCK)
+		return -ENOPROTOOPT;
+
+#define COPY_IN(_v)                                       \
+	do {						  \
+		if (optlen < sizeof _v) {		  \
+			err = -EINVAL;			  \
+			goto exit;			  \
+		}					  \
+		if (copy_from_user(&_v, optval, sizeof _v) != 0) {	\
+			err = -EFAULT;					\
+			goto exit;					\
+		}							\
+	} while (0)
+
+	err = 0;
+	sk = sock->sk;
+	vsk = vsock_sk(sk);
+
+	lock_sock(sk);
+
+	switch (optname) {
+	case SO_VMCI_BUFFER_SIZE:
+		COPY_IN(val);
+		if (val < vsk->queue_pair_min_size)
+			vsk->queue_pair_min_size = val;
+
+		if (val > vsk->queue_pair_max_size)
+			vsk->queue_pair_max_size = val;
+
+		vsk->queue_pair_size = val;
+		break;
+
+	case SO_VMCI_BUFFER_MAX_SIZE:
+		COPY_IN(val);
+		if (val < vsk->queue_pair_size)
+			vsk->queue_pair_size = val;
+
+		vsk->queue_pair_max_size = val;
+		break;
+
+	case SO_VMCI_BUFFER_MIN_SIZE:
+		COPY_IN(val);
+		if (val > vsk->queue_pair_size)
+			vsk->queue_pair_size = val;
+
+		vsk->queue_pair_min_size = val;
+		break;
+
+	case SO_VMCI_CONNECT_TIMEOUT: {
+		struct timeval tv;
+		COPY_IN(tv);
+		if (tv.tv_sec >= 0 && tv.tv_usec < USEC_PER_SEC &&
+		    tv.tv_sec < (MAX_SCHEDULE_TIMEOUT / HZ - 1)) {
+			vsk->connect_timeout = tv.tv_sec * HZ +
+			    DIV_ROUND_UP(tv.tv_usec, (1000000 / HZ));
+			if (vsk->connect_timeout == 0)
+				vsk->connect_timeout =
+				    VSOCK_DEFAULT_CONNECT_TIMEOUT;
+
+		} else {
+			err = -ERANGE;
+		}
+		break;
+	}
+
+	default:
+		err = -ENOPROTOOPT;
+		break;
+	}
+
+#undef COPY_IN
+
+exit:
+	release_sock(sk);
+	return err;
+}
+
+/*
+ * vsock_vmci_stream_getsockopt --
+ *
+ * Get a socket option for a stream socket
+ *
+ * Results: 0 on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int vsock_vmci_stream_getsockopt(struct socket *sock,
+					int level, int optname,
+					char __user *optval,
+					int __user *optlen)
+{
+	int err;
+	int len;
+	struct sock *sk;
+	struct vsock_vmci_sock *vsk;
+
+	if (level != AF_VSOCK)
+		return -ENOPROTOOPT;
+
+	err = get_user(len, optlen);
+	if (err != 0)
+		return err;
+
+#define COPY_OUT(_v)                            \
+	do {					\
+		if (len < sizeof _v)		\
+			return -EINVAL;		\
+						\
+		len = sizeof _v;		\
+		if (copy_to_user(optval, &_v, len) != 0)	\
+			return -EFAULT;				\
+								\
+	} while (0)
+
+	err = 0;
+	sk = sock->sk;
+	vsk = vsock_sk(sk);
+
+	switch (optname) {
+	case SO_VMCI_BUFFER_SIZE:
+		COPY_OUT(vsk->queue_pair_size);
+		break;
+
+	case SO_VMCI_BUFFER_MAX_SIZE:
+		COPY_OUT(vsk->queue_pair_max_size);
+		break;
+
+	case SO_VMCI_BUFFER_MIN_SIZE:
+		COPY_OUT(vsk->queue_pair_min_size);
+		break;
+
+	case SO_VMCI_CONNECT_TIMEOUT: {
+		struct timeval tv;
+		tv.tv_sec = vsk->connect_timeout / HZ;
+		tv.tv_usec =
+		    (vsk->connect_timeout -
+		     tv.tv_sec * HZ) * (1000000 / HZ);
+		COPY_OUT(tv);
+		break;
+	}
+	default:
+		return -ENOPROTOOPT;
+	}
+
+	err = put_user(len, optlen);
+	if (err != 0)
+		return -EFAULT;
+
+#undef COPY_OUT
+
+	return 0;
+}
+
+/*
+ * vsock_vmci_stream_sendmsg --
+ *
+ * Sends a message on the socket.
+ *
+ * Results: Number of bytes sent on success, negative error code on failure.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_stream_sendmsg(struct kiocb *kiocb,
+			  struct socket *sock, struct msghdr *msg, size_t len)
+{
+	struct sock *sk;
+	struct vsock_vmci_sock *vsk;
+	ssize_t total_written;
+	long timeout;
+	int err;
+	struct vsock_vmci_send_notify_data send_data;
+
+	DEFINE_WAIT(wait);
+
+	sk = sock->sk;
+	vsk = vsock_sk(sk);
+	total_written = 0;
+	err = 0;
+
+	if (msg->msg_flags & MSG_OOB)
+		return -EOPNOTSUPP;
+
+	lock_sock(sk);
+
+	/* Callers should not provide a destination with stream sockets. */
+	if (msg->msg_namelen) {
+		err = sk->sk_state == SS_CONNECTED ? -EISCONN : -EOPNOTSUPP;
+		goto out;
+	}
+
+	/* Send data only if both sides are not shutdown in the direction. */
+	if (sk->sk_shutdown & SEND_SHUTDOWN ||
+	    vsk->peer_shutdown & RCV_SHUTDOWN) {
+		err = -EPIPE;
+		goto out;
+	}
+
+	if (sk->sk_state != SS_CONNECTED ||
+	    !vsock_addr_bound(&vsk->local_addr)) {
+		err = -ENOTCONN;
+		goto out;
+	}
+
+	if (!vsock_addr_bound(&vsk->remote_addr)) {
+		err = -EDESTADDRREQ;
+		goto out;
+	}
+
+	/*
+	 * Wait for room in the produce queue to enqueue our user's data.
+	 */
+	timeout = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+
+	NOTIFYCALLRET(vsk, err, send_init, sk, &send_data);
+	if (err < 0)
+		goto out;
+
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+
+	while (total_written < len) {
+		ssize_t written;
+
+		while (vsock_vmci_stream_has_space(vsk) == 0 &&
+		       sk->sk_err == 0 &&
+		       !(sk->sk_shutdown & SEND_SHUTDOWN) &&
+		       !(vsk->peer_shutdown & RCV_SHUTDOWN)) {
+
+			/* Don't wait for non-blocking sockets. */
+			if (timeout == 0) {
+				err = -EAGAIN;
+				goto out_wait;
+			}
+
+			NOTIFYCALLRET(vsk, err, send_pre_block, sk, &send_data);
+
+			if (err < 0)
+				goto out_wait;
+
+			release_sock(sk);
+			timeout = schedule_timeout(timeout);
+			lock_sock(sk);
+			if (signal_pending(current)) {
+				err = sock_intr_errno(timeout);
+				goto out_wait;
+			} else if (timeout == 0) {
+				err = -EAGAIN;
+				goto out_wait;
+			}
+
+			prepare_to_wait(sk_sleep(sk), &wait,
+					TASK_INTERRUPTIBLE);
+		}
+
+		/*
+		 * These checks occur both as part of and after the loop
+		 * conditional since we need to check before and after
+		 * sleeping.
+		 */
+		if (sk->sk_err) {
+			err = -sk->sk_err;
+			goto out_wait;
+		} else if ((sk->sk_shutdown & SEND_SHUTDOWN) ||
+			   (vsk->peer_shutdown & RCV_SHUTDOWN)) {
+			err = -EPIPE;
+			goto out_wait;
+		}
+
+		VSOCK_STATS_STREAM_PRODUCE_HIST(vsk);
+
+		NOTIFYCALLRET(vsk, err, send_pre_enqueue, sk, &send_data);
+		if (err < 0)
+			goto out_wait;
+
+		/*
+		 * Note that enqueue will only write as many bytes as are free
+		 * in the produce queue, so we don't need to ensure len is
+		 * smaller than the queue size.  It is the caller's
+		 * responsibility to check how many bytes we were able to send.
+		 */
+
+		written = vmci_qpair_enquev(vsk->qpair, msg->msg_iov,
+					    len - total_written, 0);
+		if (written < 0) {
+			err = -ENOMEM;
+			goto out_wait;
+		}
+
+		total_written += written;
+
+		NOTIFYCALLRET(vsk, err, send_post_enqueue, sk, written,
+			      &send_data);
+		if (err < 0)
+			goto out_wait;
+
+	}
+
+out_wait:
+	if (total_written > 0) {
+		VSOCK_STATS_STREAM_PRODUCE(total_written);
+		err = total_written;
+	}
+	finish_wait(sk_sleep(sk), &wait);
+out:
+	release_sock(sk);
+	return err;
+}
+
+/*
+ * vsock_vmci_dgram_recvmsg --
+ *
+ * Receives a datagram and places it in the caller's msg.
+ *
+ * Results: The size of the payload on success, negative value on failure.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_dgram_recvmsg(struct kiocb *kiocb,
+			 struct socket *sock,
+			 struct msghdr *msg, size_t len, int flags)
+{
+	int err;
+	int noblock;
+	struct sock *sk;
+	struct vmci_datagram *dg;
+	size_t payload_len;
+	struct sk_buff *skb;
+
+	sk = sock->sk;
+	noblock = flags & MSG_DONTWAIT;
+
+	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
+		return -EOPNOTSUPP;
+
+	/* Retrieve the head sk_buff from the socket's receive queue. */
+	err = 0;
+	skb = skb_recv_datagram(sk, flags, noblock, &err);
+	if (err)
+		return err;
+
+	if (!skb)
+		return -EAGAIN;
+
+	dg = (struct vmci_datagram *) skb->data;
+	if (!dg)
+		/* err is 0, meaning we read zero bytes. */
+		goto out;
+
+	payload_len = dg->payload_size;
+	/* Ensure the sk_buff matches the payload size claimed in the packet. */
+	if (payload_len != skb->len - sizeof *dg) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (payload_len > len) {
+		payload_len = len;
+		msg->msg_flags |= MSG_TRUNC;
+	}
+
+	/* Place the datagram payload in the user's iovec. */
+	err =
+	    skb_copy_datagram_iovec(skb, sizeof *dg, msg->msg_iov, payload_len);
+	if (err)
+		goto out;
+
+	msg->msg_namelen = 0;
+	if (msg->msg_name) {
+		struct sockaddr_vm *vmci_addr;
+
+		/* Provide the address of the sender. */
+		vmci_addr = (struct sockaddr_vm *)msg->msg_name;
+		vsock_addr_init(vmci_addr,
+				VMCI_HANDLE_TO_CONTEXT_ID(dg->src),
+				VMCI_HANDLE_TO_RESOURCE_ID(dg->src));
+		msg->msg_namelen = sizeof *vmci_addr;
+	}
+	err = payload_len;
+
+out:
+	skb_free_datagram(sk, skb);
+	return err;
+}
+
+/*
+ * vsock_vmci_stream_recvmsg --
+ *
+ * Receives a datagram and places it in the caller's msg.
+ *
+ * Results: The size of the payload on success, negative value on failure.
+ *
+ * Side effects: None.
+ */
+
+static int
+vsock_vmci_stream_recvmsg(struct kiocb *kiocb,
+			  struct socket *sock,
+			  struct msghdr *msg, size_t len, int flags)
+{
+	struct sock *sk;
+	struct vsock_vmci_sock *vsk;
+	int err;
+	size_t target;
+	ssize_t copied;
+	long timeout;
+	struct vsock_vmci_recv_notify_data recv_data;
+
+	DEFINE_WAIT(wait);
+
+	sk = sock->sk;
+	vsk = vsock_sk(sk);
+	err = 0;
+
+	lock_sock(sk);
+
+	if (sk->sk_state != SS_CONNECTED) {
+		/*
+		 * Recvmsg is supposed to return 0 if a peer performs an
+		 * orderly shutdown. Differentiate between that case and when a
+		 * peer has not connected or a local shutdown occured with the
+		 * SOCK_DONE flag.
+		 */
+		if (sock_flag(sk, SOCK_DONE))
+			err = 0;
+		else
+			err = -ENOTCONN;
+
+		goto out;
+	}
+
+	if (flags & MSG_OOB) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
+
+	/*
+	 * We don't check peer_shutdown flag here since peer may actually shut
+	 * down, but there can be data in the VMCI queue that local socket can
+	 * receive.
+	 */
+	if (sk->sk_shutdown & RCV_SHUTDOWN) {
+		err = 0;
+		goto out;
+	}
+
+	/*
+	 * It is valid on Linux to pass in a zero-length receive buffer.  This
+	 * is not an error.  We may as well bail out now.
+	 */
+	if (!len) {
+		err = 0;
+		goto out;
+	}
+
+	/*
+	 * We must not copy less than target bytes into the user's buffer
+	 * before returning successfully, so we wait for the consume queue to
+	 * have that much data to consume before dequeueing.  Note that this
+	 * makes it impossible to handle cases where target is greater than the
+	 * queue size.
+	 */
+	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
+	if (target >= vsk->consume_size) {
+		err = -ENOMEM;
+		goto out;
+	}
+	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+	copied = 0;
+
+	NOTIFYCALLRET(vsk, err, recv_init, sk, target, &recv_data);
+	if (err < 0)
+		goto out;
+
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+
+	while (1) {
+		s64 ready = vsock_vmci_stream_has_data(vsk);
+
+		if (ready < 0) {
+			/*
+			 * Invalid queue pair content. XXX This should be
+			 * changed to a connection reset in a later change.
+			 */
+
+			err = -ENOMEM;
+			goto out_wait;
+		} else if (ready > 0) {
+			ssize_t read;
+
+			VSOCK_STATS_STREAM_CONSUME_HIST(vsk);
+
+			NOTIFYCALLRET(vsk, err, recv_pre_dequeue, sk, target,
+				      &recv_data);
+			if (err < 0)
+				break;
+
+			if (flags & MSG_PEEK)
+				read =
+				    vmci_qpair_peekv(vsk->qpair, msg->msg_iov,
+						     len - copied, 0);
+			else
+				read =
+				    vmci_qpair_dequev(vsk->qpair, msg->msg_iov,
+						      len - copied, 0);
+
+			if (read < 0) {
+				err = -ENOMEM;
+				break;
+			}
+
+			copied += read;
+
+			NOTIFYCALLRET(vsk, err, recv_post_dequeue, sk, target,
+				      read, !(flags & MSG_PEEK), &recv_data);
+			if (err < 0)
+				goto out_wait;
+
+			if (read >= target || flags & MSG_PEEK)
+				break;
+
+			target -= read;
+		} else {
+			if (sk->sk_err != 0 || (sk->sk_shutdown & RCV_SHUTDOWN)
+			    || (vsk->peer_shutdown & SEND_SHUTDOWN)) {
+				break;
+			}
+			/* Don't wait for non-blocking sockets. */
+			if (timeout == 0) {
+				err = -EAGAIN;
+				break;
+			}
+
+			NOTIFYCALLRET(vsk, err, recv_pre_block, sk, target,
+				      &recv_data);
+			if (err < 0)
+				break;
+
+			release_sock(sk);
+			timeout = schedule_timeout(timeout);
+			lock_sock(sk);
+
+			if (signal_pending(current)) {
+				err = sock_intr_errno(timeout);
+				break;
+			} else if (timeout == 0) {
+				err = -EAGAIN;
+				break;
+			}
+
+			prepare_to_wait(sk_sleep(sk), &wait,
+					TASK_INTERRUPTIBLE);
+		}
+	}
+
+	if (sk->sk_err)
+		err = -sk->sk_err;
+	else if (sk->sk_shutdown & RCV_SHUTDOWN)
+		err = 0;
+
+	if (copied > 0) {
+		/*
+		 * We only do these additional bookkeeping/notification steps
+		 * if we actually copied something out of the queue pair
+		 * instead of just peeking ahead.
+		 */
+
+		if (!(flags & MSG_PEEK)) {
+			VSOCK_STATS_STREAM_CONSUME(copied);
+
+			/*
+			 * If the other side has shutdown for sending and there
+			 * is nothing more to read, then modify the socket
+			 * state.
+			 */
+			if (vsk->peer_shutdown & SEND_SHUTDOWN) {
+				if (vsock_vmci_stream_has_data(vsk) <= 0) {
+					sk->sk_state = SS_UNCONNECTED;
+					sock_set_flag(sk, SOCK_DONE);
+					sk->sk_state_change(sk);
+				}
+			}
+		}
+		err = copied;
+	}
+
+out_wait:
+	finish_wait(sk_sleep(sk), &wait);
+out:
+	release_sock(sk);
+	return err;
+}
+
+/*
+ * Protocol operation.
+ */
+
+/*
+ * vsock_vmci_create --
+ *
+ * Creates a VSocket socket.
+ *
+ * Results: Zero on success, negative error code on failure.
+ *
+ * Side effects: Socket count is incremented.
+ */
+
+static int
+vsock_vmci_create(struct net *net, struct socket *sock, int protocol, int kern)
+{
+	if (!sock)
+		return -EINVAL;
+
+	if (protocol)
+		return -EPROTONOSUPPORT;
+
+	switch (sock->type) {
+	case SOCK_DGRAM:
+		sock->ops = &vsock_vmci_dgram_ops;
+		break;
+	case SOCK_STREAM:
+		sock->ops = &vsock_vmci_stream_ops;
+		break;
+	default:
+		return -ESOCKTNOSUPPORT;
+	}
+
+	sock->state = SS_UNCONNECTED;
+
+	return __vsock_vmci_create(net, sock, NULL, GFP_KERNEL,
+				   0) ? 0 : -ENOMEM;
+}
+
+/*
+ * Device operations.
+ */
+
+static long vsock_vmci_dev_do_ioctl(struct file *filp,
+				    unsigned int cmd, void __user *ptr)
+{
+	static const u16 parts[4] = VSOCK_DRIVER_VERSION_PARTS;
+	u32 __user *p = ptr;
+	int retval = 0;
+	u32 version;
+
+	switch (cmd) {
+	case IOCTL_VMCI_SOCKETS_VERSION:
+		version = VMCI_SOCKETS_MAKE_VERSION(parts);
+		if (put_user(version, p) != 0)
+			retval = -EFAULT;
+		break;
+
+	case IOCTL_VMCI_SOCKETS_GET_AF_VALUE:
+		if (put_user(AF_VSOCK, p) != 0)
+			retval = -EFAULT;
+
+		break;
+
+	case IOCTL_VMCI_SOCKETS_GET_LOCAL_CID:
+		if (put_user(vmci_get_context_id(), p) != 0)
+			retval = -EFAULT;
+
+		break;
+
+	default:
+		pr_err("Unknown ioctl %d\n", cmd);
+		retval = -EINVAL;
+	}
+
+	return retval;
+}
+
+static long vsock_vmci_dev_ioctl(struct file *filp,
+				 unsigned int cmd, unsigned long arg)
+{
+	return vsock_vmci_dev_do_ioctl(filp, cmd, (void __user *)arg);
+}
+
+#ifdef CONFIG_COMPAT
+static long vsock_vmci_dev_compat_ioctl(struct file *filp,
+					unsigned int cmd, unsigned long arg)
+{
+	return vsock_vmci_dev_do_ioctl(filp, cmd, compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations vsock_vmci_device_ops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= vsock_vmci_dev_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vsock_vmci_dev_compat_ioctl,
+#endif
+	.open		= nonseekable_open,
+};
+
+static struct miscdevice vsock_vmci_device = {
+	.name		= "vsock",
+	.minor		= MISC_DYNAMIC_MINOR,
+	.fops		= &vsock_vmci_device_ops,
+};
+
+
+/*
+ * Module operations.
+ */
+
+/*
+ * vsock_vmci_init --
+ *
+ * Initialization routine for the VSockets module.
+ *
+ * Results: Zero on success, error code on failure.
+ *
+ * Side effects: The VSocket protocol family and socket operations are
+ * registered.
+ */
+
+static int __init vsock_vmci_init(void)
+{
+	int err;
+
+	request_module("vmci");
+
+	err = misc_register(&vsock_vmci_device);
+	if (err) {
+		pr_err("Failed to register misc device\n");
+		return -ENOENT;
+	}
+
+	err = vsock_vmci_register_with_vmci();
+	if (err) {
+		pr_err("Cannot register with VMCI device.\n");
+		goto err_misc_deregister;
+	}
+
+	err = proto_register(&vsock_vmci_proto, 1);	/* we want our slab */
+	if (err) {
+		pr_err("Cannot register vsock protocol.\n");
+		goto err_unregister_with_vmci;
+	}
+
+	err = sock_register(&vsock_vmci_family_ops);
+	if (err) {
+		pr_err("could not register af_vsock (%d) address family: %d\n",
+		       AF_VSOCK, err);
+		goto err_unregister_proto;
+	}
+
+	vsock_vmci_init_tables();
+	return 0;
+
+err_unregister_proto:
+	proto_unregister(&vsock_vmci_proto);
+err_unregister_with_vmci:
+	vsock_vmci_unregister_with_vmci();
+err_misc_deregister:
+	misc_deregister(&vsock_vmci_device);
+	return err;
+}
+
+/*
+ * VSocketVmciExit --
+ *
+ * VSockets module exit routine.
+ *
+ * Results: None.
+ *
+ * Side effects: Unregisters VSocket protocol family and socket operations.
+ */
+
+static void __exit vsock_vmci_exit(void)
+{
+	misc_deregister(&vsock_vmci_device);
+	sock_unregister(AF_VSOCK);
+	proto_unregister(&vsock_vmci_proto);
+	/* Need reset ? */
+	VSOCK_STATS_RESET();
+	vsock_vmci_unregister_with_vmci();
+}
+
+module_init(vsock_vmci_init);
+module_exit(vsock_vmci_exit);
+
+MODULE_AUTHOR("VMware, Inc.");
+MODULE_DESCRIPTION("VMware Virtual Socket Family");
+MODULE_VERSION(VSOCK_DRIVER_VERSION_STRING);
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS("vmware_vsock");
diff --git a/net/vmw_vsock/af_vsock.h b/net/vmw_vsock/af_vsock.h
new file mode 100644
index 0000000..ecaf101
--- /dev/null
+++ b/net/vmw_vsock/af_vsock.h
@@ -0,0 +1,180 @@
+/*
+ * VMware vSockets Driver
+ *
+ * Copyright (C) 2007-2012 VMware, Inc. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation version 2 and no later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+/*
+ * af_vsock.h --
+ *
+ * Definitions for Linux VSockets module.
+ */
+
+#ifndef __AF_VSOCK_H__
+#define __AF_VSOCK_H__
+
+#include <linux/kernel.h>
+#include <linux/workqueue.h>
+#include <linux/vmw_vmci_defs.h>
+#include <linux/vmw_vmci_api.h>
+
+#include "vsock_common.h"
+#include "vsock_packet.h"
+#include "notify.h"
+
+#define vsock_sk(__sk)    ((struct vsock_vmci_sock *)__sk)
+#define sk_vsock(__vsk)   (&(__vsk)->sk)
+
+struct vsock_vmci_sock {
+	/* sk must be the first member. */
+	struct sock sk;
+	struct sockaddr_vm local_addr;
+	struct sockaddr_vm remote_addr;
+	/* Links for the global tables of bound and connected sockets. */
+	struct list_head bound_table;
+	struct list_head connected_table;
+	/*
+	 * Accessed without the socket lock held. This means it can never be
+	 * modified outsided of socket create or destruct.
+	 */
+	bool trusted;
+	bool cached_peer_allow_dgram;	/* Dgram communication allowed to
+					 * cached peer? */
+	vmci_id cached_peer;  /* Context ID of last dgram destination check. */
+	const struct cred *owner;
+	struct vmci_handle dg_handle;	/* For SOCK_DGRAM only. */
+	/* Rest are SOCK_STREAM only. */
+	struct vmci_handle qp_handle;
+	struct vmci_qp *qpair;
+	u64 produce_size;
+	u64 consume_size;
+	u64 queue_pair_size;
+	u64 queue_pair_min_size;
+	u64 queue_pair_max_size;
+	long connect_timeout;
+	union vsock_vmci_notify notify;
+	struct vsock_vmci_notify_ops *notify_ops;
+	vmci_id attach_sub_id;
+	vmci_id detach_sub_id;
+	/* Listening socket that this came from. */
+	struct sock *listener;
+	/*
+	 * Used for pending list and accept queue during connection handshake.
+	 * The listening socket is the head for both lists.  Sockets created
+	 * for connection requests are placed in the pending list until they
+	 * are connected, at which point they are put in the accept queue list
+	 * so they can be accepted in accept().  If accept() cannot accept the
+	 * connection, it is marked as rejected so the cleanup function knows
+	 * to clean up the socket.
+	 */
+	struct list_head pending_links;
+	struct list_head accept_queue;
+	bool rejected;
+	struct delayed_work dwork;
+	u32 peer_shutdown;
+	bool sent_request;
+	bool ignore_connecting_rst;
+};
+
+int vsock_vmci_send_control_pkt_bh(struct sockaddr_vm *src,
+				   struct sockaddr_vm *dst,
+				   enum vsock_packet_type type,
+				   u64 size,
+				   u64 mode,
+				   struct vsock_waiting_info *wait,
+				   struct vmci_handle handle);
+int vsock_vmci_reply_control_pkt_fast(struct vsock_packet *pkt,
+				      enum vsock_packet_type type, u64 size,
+				      u64 mode,
+				      struct vsock_waiting_info *wait,
+				      struct vmci_handle handle);
+int vsock_vmci_send_control_pkt(struct sock *sk, enum vsock_packet_type type,
+				u64 size, u64 mode,
+				struct vsock_waiting_info *wait,
+				vsock_proto_version version,
+				struct vmci_handle handle);
+
+s64 vsock_vmci_stream_has_data(struct vsock_vmci_sock *vsk);
+s64 vsock_vmci_stream_has_space(struct vsock_vmci_sock *vsk);
+
+#define VSOCK_SEND_RESET_BH(_dst, _src, _pkt)				\
+	(((_pkt)->type == VSOCK_PACKET_TYPE_RST) ?			\
+		0 :							\
+		vsock_vmci_send_control_pkt_bh(				\
+			_dst, _src,					\
+			VSOCK_PACKET_TYPE_RST, 0,			\
+			0, NULL, VMCI_INVALID_HANDLE))
+#define VSOCK_SEND_INVALID_BH(_dst, _src)				\
+	vsock_vmci_send_control_pkt_bh(_dst, _src,			\
+				       VSOCK_PACKET_TYPE_INVALID, 0,	\
+				       0, NULL, VMCI_INVALID_HANDLE)
+#define VSOCK_SEND_WROTE_BH(_dst, _src)					\
+	vsock_vmci_send_control_pkt_bh(_dst, _src, VSOCK_PACKET_TYPE_WROTE, 0, \
+				       0, NULL, VMCI_INVALID_HANDLE)
+#define VSOCK_SEND_READ_BH(_dst, _src)					\
+	vsock_vmci_send_control_pkt_bh((_dst), (_src),			\
+				       VSOCK_PACKET_TYPE_READ, 0,	\
+				       0, NULL, VMCI_INVALID_HANDLE)
+#define VSOCK_SEND_RESET(_sk, _pkt)					\
+	(((_pkt)->type == VSOCK_PACKET_TYPE_RST) ?			\
+		0 :							\
+		vsock_vmci_send_control_pkt(				\
+			_sk, VSOCK_PACKET_TYPE_RST,			\
+			0, 0, NULL, VSOCK_PROTO_INVALID,		\
+			VMCI_INVALID_HANDLE))
+#define VSOCK_SEND_NEGOTIATE(_sk, _size)				\
+	vsock_vmci_send_control_pkt(_sk, VSOCK_PACKET_TYPE_NEGOTIATE,	\
+				    _size, 0, NULL, VSOCK_PROTO_INVALID, \
+				    VMCI_INVALID_HANDLE)
+#define VSOCK_SEND_NEGOTIATE2(_sk, _size, signal_proto)			\
+	vsock_vmci_send_control_pkt(_sk, VSOCK_PACKET_TYPE_NEGOTIATE2,	\
+				    _size, 0, NULL, signal_proto,	\
+				    VMCI_INVALID_HANDLE)
+#define VSOCK_SEND_QP_OFFER(_sk, _handle)				\
+	vsock_vmci_send_control_pkt(_sk, VSOCK_PACKET_TYPE_OFFER,	\
+				    0, 0, NULL, VSOCK_PROTO_INVALID, _handle)
+#define VSOCK_SEND_CONN_REQUEST(_sk, _size)				\
+	vsock_vmci_send_control_pkt(_sk, VSOCK_PACKET_TYPE_REQUEST,	\
+				    _size, 0, NULL, VSOCK_PROTO_INVALID, \
+				    VMCI_INVALID_HANDLE)
+#define VSOCK_SEND_CONN_REQUEST2(_sk, _size, signal_proto)		\
+	vsock_vmci_send_control_pkt(_sk, VSOCK_PACKET_TYPE_REQUEST2,	\
+				    _size, 0, NULL, signal_proto,	\
+				    VMCI_INVALID_HANDLE)
+#define VSOCK_SEND_ATTACH(_sk, _handle)					\
+	vsock_vmci_send_control_pkt(_sk, VSOCK_PACKET_TYPE_ATTACH,	\
+				    0, 0, NULL, VSOCK_PROTO_INVALID, _handle)
+#define VSOCK_SEND_WROTE(_sk)						\
+	vsock_vmci_send_control_pkt(_sk, VSOCK_PACKET_TYPE_WROTE,	\
+				    0, 0, NULL, VSOCK_PROTO_INVALID,	\
+				    VMCI_INVALID_HANDLE)
+#define VSOCK_SEND_READ(_sk)						\
+	vsock_vmci_send_control_pkt(_sk, VSOCK_PACKET_TYPE_READ,	\
+				    0, 0, NULL, VSOCK_PROTO_INVALID,	\
+				    VMCI_INVALID_HANDLE)
+#define VSOCK_SEND_SHUTDOWN(_sk, _mode)					\
+	vsock_vmci_send_control_pkt(_sk, VSOCK_PACKET_TYPE_SHUTDOWN,	\
+				    0, _mode, NULL, VSOCK_PROTO_INVALID, \
+				    VMCI_INVALID_HANDLE)
+#define VSOCK_SEND_WAITING_WRITE(_sk, _wait_info)			\
+	vsock_vmci_send_control_pkt(_sk, VSOCK_PACKET_TYPE_WAITING_WRITE, \
+				    0, 0, _wait_info, VSOCK_PROTO_INVALID, \
+				    VMCI_INVALID_HANDLE)
+#define VSOCK_SEND_WAITING_READ(_sk, _wait_info)			\
+	vsock_vmci_send_control_pkt(_sk, VSOCK_PACKET_TYPE_WAITING_READ, \
+				    0, 0, _wait_info, VSOCK_PROTO_INVALID, \
+				    VMCI_INVALID_HANDLE)
+#define VSOCK_REPLY_RESET(_pkt)						\
+	vsock_vmci_reply_control_pkt_fast(_pkt, VSOCK_PACKET_TYPE_RST,	\
+					  0, 0, NULL, VMCI_INVALID_HANDLE)
+
+#endif /* __AF_VSOCK_H__ */

^ permalink raw reply related

* [PATCH 0/6] VSOCK for Linux upstreaming
From: George Zhang @ 2012-11-21 20:39 UTC (permalink / raw)
  To: netdev, linux-kernel, georgezhang, virtualization
  Cc: pv-drivers, gregkh, davem

* * *
This series of VSOCK linux upstreaming patches include latest udpate from
VMware.

Summary of changes:
        - Sparse clean.
        - Checkpatch clean with one exception, a "complex macro" in
          which we can't add parentheses.
        - Remove all runtime assertions.
        - Fix device name, so that existing user clients work.
        - Fix VMCI handle lookup.

* * *

In an effort to improve the out-of-the-box experience with Linux
kernels for VMware users, VMware is working on readying the Virtual
Machine Communication Interface (vmw_vmci) and VMCI Sockets (VSOCK)
(vmw_vsock) kernel modules for inclusion in the Linux kernel. The
purpose of this post is to acquire feedback on the vmw_vsock kernel
module. The vmw_vmci kernel module has been presented in an early post.

* * *

VMCI Sockets allows virtual machines to communicate with host kernel
modules and the VMware hypervisors. VMCI Sockets kernel module has
dependency on VMCI kernel module. User level applications both in
a virtual machine and on the host can use vmw_vmci through VMCI
Sockets API which facilitates fast and efficient communication
between guest virtual machines and their host. A socket
address family designed to be compatible with UDP and TCP at the
interface level. Today, VMCI and VMCI Sockets are used by the VMware
shared folders (HGFS) and various VMware Tools components inside the
guest for zero-config, network-less access to VMware host services. In
addition to this, VMware's users are using VMCI Sockets for various
applications, where network access of the virtual machine is
restricted or non-existent. Examples of this are VMs communicating
with device proxies for proprietary hardware running as host
applications and automated testing of applications running within
virtual machines.

The VMware VMCI Sockets are similar to other socket types, like
Berkeley UNIX socket interface. The VMCI sockets module supports
both connection-oriented stream sockets like TCP, and connectionless
datagram sockets like UDP. The VSOCK protocol family is defined as
"AF_VSOCK" and the socket operations split for SOCK_DGRAM and
SOCK_STREAM.

For additional information about the use of VMCI and in particular
VMCI Sockets, please refer to the VMCI Socket Programming Guide
available at https://www.vmware.com/support/developer/vmci-sdk/.

---

George Zhang (6):
      VSOCK: vsock protocol implementation.
      VSOCK: vsock address implementaion.
      VSOCK: notification implementation.
      VSOCK: statistics implementation.
      VSOCK: utility functions.
      VSOCK: header and config files.

 include/linux/socket.h              |    4 
 net/Kconfig                         |    1 
 net/Makefile                        |    1 
 net/vmw_vsock/Kconfig               |   14 
 net/vmw_vsock/Makefile              |    4 
 net/vmw_vsock/af_vsock.c            | 4054 +++++++++++++++++++++++++++++++++++
 net/vmw_vsock/af_vsock.h            |  180 ++
 net/vmw_vsock/notify.c              |  983 ++++++++
 net/vmw_vsock/notify.h              |  130 +
 net/vmw_vsock/notify_qstate.c       |  625 +++++
 net/vmw_vsock/stats.c               |   37 
 net/vmw_vsock/stats.h               |  217 ++
 net/vmw_vsock/util.c                |  620 +++++
 net/vmw_vsock/util.h                |  314 +++
 net/vmw_vsock/vmci_sockets.h        |  517 ++++
 net/vmw_vsock/vmci_sockets_packet.h |   90 +
 net/vmw_vsock/vsock_addr.c          |  246 ++
 net/vmw_vsock/vsock_addr.h          |   40 
 net/vmw_vsock/vsock_common.h        |  127 +
 net/vmw_vsock/vsock_packet.h        |  124 +
 net/vmw_vsock/vsock_version.h       |   28 
 21 files changed, 8355 insertions(+), 1 deletions(-)
 create mode 100644 net/vmw_vsock/Kconfig
 create mode 100644 net/vmw_vsock/Makefile
 create mode 100644 net/vmw_vsock/af_vsock.c
 create mode 100644 net/vmw_vsock/af_vsock.h
 create mode 100644 net/vmw_vsock/notify.c
 create mode 100644 net/vmw_vsock/notify.h
 create mode 100644 net/vmw_vsock/notify_qstate.c
 create mode 100644 net/vmw_vsock/stats.c
 create mode 100644 net/vmw_vsock/stats.h
 create mode 100644 net/vmw_vsock/util.c
 create mode 100644 net/vmw_vsock/util.h
 create mode 100644 net/vmw_vsock/vmci_sockets.h
 create mode 100644 net/vmw_vsock/vmci_sockets_packet.h
 create mode 100644 net/vmw_vsock/vsock_addr.c
 create mode 100644 net/vmw_vsock/vsock_addr.h
 create mode 100644 net/vmw_vsock/vsock_common.h
 create mode 100644 net/vmw_vsock/vsock_packet.h
 create mode 100644 net/vmw_vsock/vsock_version.h

-- 
Signature

^ permalink raw reply

* [PATCH net 1/1] 8139cp: revert "set ring address before enabling receiver"
From: Francois Romieu @ 2012-11-21 20:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, David Woodhouse, Jason Wang, Jeff Garzik, gilboad

This patch reverts b01af4579ec41f48e9b9c774e70bd6474ad210db.

The original patch was tested with emulated hardware. Real
hardware chokes.

Fixes https://bugzilla.kernel.org/show_bug.cgi?id=47041

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/realtek/8139cp.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index 1c81825..b01f83a 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -979,17 +979,6 @@ static void cp_init_hw (struct cp_private *cp)
 	cpw32_f (MAC0 + 0, le32_to_cpu (*(__le32 *) (dev->dev_addr + 0)));
 	cpw32_f (MAC0 + 4, le32_to_cpu (*(__le32 *) (dev->dev_addr + 4)));
 
-	cpw32_f(HiTxRingAddr, 0);
-	cpw32_f(HiTxRingAddr + 4, 0);
-
-	ring_dma = cp->ring_dma;
-	cpw32_f(RxRingAddr, ring_dma & 0xffffffff);
-	cpw32_f(RxRingAddr + 4, (ring_dma >> 16) >> 16);
-
-	ring_dma += sizeof(struct cp_desc) * CP_RX_RING_SIZE;
-	cpw32_f(TxRingAddr, ring_dma & 0xffffffff);
-	cpw32_f(TxRingAddr + 4, (ring_dma >> 16) >> 16);
-
 	cp_start_hw(cp);
 	cpw8(TxThresh, 0x06); /* XXX convert magic num to a constant */
 
@@ -1003,6 +992,17 @@ static void cp_init_hw (struct cp_private *cp)
 
 	cpw8(Config5, cpr8(Config5) & PMEStatus);
 
+	cpw32_f(HiTxRingAddr, 0);
+	cpw32_f(HiTxRingAddr + 4, 0);
+
+	ring_dma = cp->ring_dma;
+	cpw32_f(RxRingAddr, ring_dma & 0xffffffff);
+	cpw32_f(RxRingAddr + 4, (ring_dma >> 16) >> 16);
+
+	ring_dma += sizeof(struct cp_desc) * CP_RX_RING_SIZE;
+	cpw32_f(TxRingAddr, ring_dma & 0xffffffff);
+	cpw32_f(TxRingAddr + 4, (ring_dma >> 16) >> 16);
+
 	cpw16(MultiIntr, 0);
 
 	cpw8_f(Cfg9346, Cfg9346_Lock);
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH 9/9] batman-adv: Use packing of 2 for all headers before an ethernet header
From: David Miller @ 2012-11-21 20:31 UTC (permalink / raw)
  To: sven
  Cc: ordex, netdev, b.a.t.m.a.n, lindner_marek, kevin.curtis,
	linux-driver, ron.mercer, jitendra.kalsaria, rmody, andy, fubar
In-Reply-To: <8108710.8oNxxR0rRd@sven-laptop.home.narfation.org>

From: Sven Eckelmann <sven@narfation.org>
Date: Wed, 21 Nov 2012 19:20:21 +0100

> Agree. But we should get this message also to the other guys
> 
> $ git grep pragma -- drivers/net

Good point, I've pulled your tree, thanks!

^ permalink raw reply

* [PATCH] 8139cp: set ring address after enabling C+ mode
From: David Woodhouse @ 2012-11-21 20:27 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Jason Wang, David S. Miller, netdev
In-Reply-To: <50AD1972.5080403@pobox.com>

[-- Attachment #1: Type: text/plain, Size: 3796 bytes --]

This fixes (for me) a regression introduced by commit b01af457 ("8139cp:
set ring address before enabling receiver"). That commit configured the
descriptor ring addresses earlier in the initialisation sequence, in
order to avoid the possibility of triggering stray DMA before the
correct address had been set up.

Unfortunately, it seems that the hardware will scribble garbage into the
TxRingAddr registers when we enable "plus mode" Tx in the CpCmd
register. Observed on a Traverse Geos router board.

To deal with this, while not reintroducing the problem which led to the
original commit, we augment cp_start_hw() to write to the CpCmd register
*first*, then set the descriptor ring addresses, and then finally to
enable Rx and Tx in the original 8139 Cmd register. The datasheet
actually indicates that we should enable Tx/Rx in the Cmd register
*before* configuring the descriptor addresses, but that would appear to
re-introduce the problem that the offending commit b01af457 was trying
to solve. And this variant appears to work fine on real hardware.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org [3.5+]

---
How about this? I'm still somewhat confused about when it actually
*does* start doing DMA, given what the datasheet says.

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index 1c81825..5166d94 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -957,7 +957,35 @@ static void cp_reset_hw (struct cp_private *cp)
 
 static inline void cp_start_hw (struct cp_private *cp)
 {
+	dma_addr_t ring_dma;
+
 	cpw16(CpCmd, cp->cpcmd);
+
+	/*
+	 * These (at least TxRingAddr) need to be configured after the
+	 * corresponding bits in CpCmd are enabled. Datasheet v1.6 §6.33
+	 * (C+ Command Register) recommends that these and more be configured
+	 * *after* the [RT]xEnable bits in CpCmd are set. And on some hardware
+	 * it's been observed that the TxRingAddr is actually reset to garbage
+	 * when C+ mode Tx is enabled in CpCmd.
+	 */
+	cpw32_f(HiTxRingAddr, 0);
+	cpw32_f(HiTxRingAddr + 4, 0);
+
+	ring_dma = cp->ring_dma;
+	cpw32_f(RxRingAddr, ring_dma & 0xffffffff);
+	cpw32_f(RxRingAddr + 4, (ring_dma >> 16) >> 16);
+
+	ring_dma += sizeof(struct cp_desc) * CP_RX_RING_SIZE;
+	cpw32_f(TxRingAddr, ring_dma & 0xffffffff);
+	cpw32_f(TxRingAddr + 4, (ring_dma >> 16) >> 16);
+
+	/*
+	 * Strictly speaking, the datasheet says this should be enabled
+	 * *before* setting the descriptor addresses. But what, then, would
+	 * prevent it from doing DMA to random unconfigured addresses?
+	 * This variant appears to work fine.
+	 */
 	cpw8(Cmd, RxOn | TxOn);
 }
 
@@ -969,7 +997,6 @@ static void cp_enable_irq(struct cp_private *cp)
 static void cp_init_hw (struct cp_private *cp)
 {
 	struct net_device *dev = cp->dev;
-	dma_addr_t ring_dma;
 
 	cp_reset_hw(cp);
 
@@ -979,17 +1006,6 @@ static void cp_init_hw (struct cp_private *cp)
 	cpw32_f (MAC0 + 0, le32_to_cpu (*(__le32 *) (dev->dev_addr + 0)));
 	cpw32_f (MAC0 + 4, le32_to_cpu (*(__le32 *) (dev->dev_addr + 4)));
 
-	cpw32_f(HiTxRingAddr, 0);
-	cpw32_f(HiTxRingAddr + 4, 0);
-
-	ring_dma = cp->ring_dma;
-	cpw32_f(RxRingAddr, ring_dma & 0xffffffff);
-	cpw32_f(RxRingAddr + 4, (ring_dma >> 16) >> 16);
-
-	ring_dma += sizeof(struct cp_desc) * CP_RX_RING_SIZE;
-	cpw32_f(TxRingAddr, ring_dma & 0xffffffff);
-	cpw32_f(TxRingAddr + 4, (ring_dma >> 16) >> 16);
-
 	cp_start_hw(cp);
 	cpw8(TxThresh, 0x06); /* XXX convert magic num to a constant */
 


-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation




[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply related

* Re: 8139cp: set ring address before enabling receiver
From: Ben Hutchings @ 2012-11-21 20:18 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jeff Garzik, Jason Wang, David S. Miller, netdev
In-Reply-To: <1353527481.26346.150.camel@shinybook.infradead.org>

On Wed, 2012-11-21 at 19:51 +0000, David Woodhouse wrote:
> On Wed, 2012-11-21 at 13:12 -0500, Jeff Garzik wrote:
> > 
> > What sticks out at me from the commit message?
> > 
> > It was not tested on the famously quirky 8139 hardware at all.
> > 
> > While I have not looked at the 8139C+ data sheet in a while, sometimes 
> > the hardware _did_ have a strange init order.
> > 
> > As this works in a simulator but fails on real hardware, it seems like 
> > an obvious regression caused by an untested [on read hardware] patch.
> 
> The data sheet (v1.6, from http://realtek.info/pdf/rtl8139cp.pdf ) says
> in §6.33 (C+ Command Register):
>  "Enable C+ mode functions in C+CR register first,
>  => Enable transmit/receive in Command register (offset 37h),
>  => Configure other related registers (ex. Descriptor start address,
>     TCR, RCR, ...)."
>
> I understand the concern expressed in the offending commit message about
> DMA happening to invalid addresses, and I'll look at the data sheet
> harder to see when the DMA actually starts happening. But it definitely
> seems that our current code isn't doing what the data sheet says.
> 
> I wonder if I can find one of these lying around and stick it in a
> machine with an IOMMU...

You might be able to avoid disaster by doing:

1. Set MAC filter to drop everything
2. Enable RX DMA
3. Set RX DMA ring address
4. Set MAC filter according to current flags & multicast list

I'm assuming, knowing nothing about this particular hardware, that the
MAC filter register(s) will accept writes before RX DMA is enabled.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 1/1] asix: use ramdom hw addr if the one read is not valid
From: Sergei Shtylyov @ 2012-11-21 20:16 UTC (permalink / raw)
  To: Jean-Christophe PLAGNIOL-VILLARD
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1353493362-30418-1-git-send-email-plagnioj-sclMFOaUSTBWk0Htik3J/w@public.gmane.org>

Hello.

On 11/21/2012 01:22 PM, Jean-Christophe PLAGNIOL-VILLARD wrote:

> Signed-off-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj-sclMFOaUSTBWk0Htik3J/w@public.gmane.org>
> Cc: linux-usb-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
>  drivers/net/usb/asix_devices.c |   24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)

> diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
> index 33ab824..7ebec5b 100644
> --- a/drivers/net/usb/asix_devices.c
> +++ b/drivers/net/usb/asix_devices.c
> @@ -225,7 +225,13 @@ static int ax88172_bind(struct usbnet *dev, struct usb_interface *intf)
>  			   ret);
>  		goto out;
>  	}
> -	memcpy(dev->net->dev_addr, buf, ETH_ALEN);
> +
> +	if (is_valid_ether_addr(buf)) {
> +		memcpy(dev->net->dev_addr, buf, ETH_ALEN);
> +	} else {
> +		netdev_info(dev->net, "invalid hw address, using random\n");
> +		eth_hw_addr_random(dev->net);
> +	}
>  
>  	/* Initialize MII structure */
>  	dev->mii.dev = dev->net;
> @@ -423,7 +429,13 @@ static int ax88772_bind(struct usbnet *dev, struct usb_interface *intf)
>  		netdev_dbg(dev->net, "Failed to read MAC address: %d\n", ret);
>  		return ret;
>  	}
> -	memcpy(dev->net->dev_addr, buf, ETH_ALEN);
> +
> +	if (is_valid_ether_addr(buf)) {
> +		memcpy(dev->net->dev_addr, buf, ETH_ALEN);
> +	} else {
> +		netdev_info(dev->net, "invalid hw address, using random\n");
> +		eth_hw_addr_random(dev->net);
> +	}
>  
>  	/* Initialize MII structure */
>  	dev->mii.dev = dev->net;
> @@ -777,7 +789,13 @@ static int ax88178_bind(struct usbnet *dev, struct usb_interface *intf)
>  		netdev_dbg(dev->net, "Failed to read MAC address: %d\n", ret);
>  		return ret;
>  	}
> -	memcpy(dev->net->dev_addr, buf, ETH_ALEN);
> +
> +	if (is_valid_ether_addr(buf)) {
> +		memcpy(dev->net->dev_addr, buf, ETH_ALEN);
> +	} else {
> +		netdev_info(dev->net, "invalid hw address, using random\n");
> +		eth_hw_addr_random(dev->net);
> +	}
>  
>  	/* Initialize MII structure */
>  	dev->mii.dev = dev->net;

   Repeated thrice, this asks to be put into subroutine...

WBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: 8139cp: set ring address before enabling receiver
From: David Woodhouse @ 2012-11-21 19:51 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Jason Wang, David S. Miller, netdev
In-Reply-To: <50AD1972.5080403@pobox.com>

[-- Attachment #1: Type: text/plain, Size: 1172 bytes --]

On Wed, 2012-11-21 at 13:12 -0500, Jeff Garzik wrote:
> 
> What sticks out at me from the commit message?
> 
> It was not tested on the famously quirky 8139 hardware at all.
> 
> While I have not looked at the 8139C+ data sheet in a while, sometimes 
> the hardware _did_ have a strange init order.
> 
> As this works in a simulator but fails on real hardware, it seems like 
> an obvious regression caused by an untested [on read hardware] patch.

The data sheet (v1.6, from http://realtek.info/pdf/rtl8139cp.pdf ) says
in §6.33 (C+ Command Register):
 "Enable C+ mode functions in C+CR register first,
 => Enable transmit/receive in Command register (offset 37h),
 => Configure other related registers (ex. Descriptor start address,
    TCR, RCR, ...)."

I understand the concern expressed in the offending commit message about
DMA happening to invalid addresses, and I'll look at the data sheet
harder to see when the DMA actually starts happening. But it definitely
seems that our current code isn't doing what the data sheet says.

I wonder if I can find one of these lying around and stick it in a
machine with an IOMMU...

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: Packet Corruption with Atheros Communications Inc. AR8121/AR8113/AR8114 Gigabit or Fast Ethernet (rev b0) Interface
From: Martin Tessun @ 2012-11-21 19:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1353518049.2590.41.camel@edumazet-glaptop>

Am 21.11.2012 18:14, schrieb Eric Dumazet:
> On Wed, 2012-11-21 at 15:54 +0100, Martin Tessun wrote:
[snip bug description]
>
> Is it only happening for IPv6 traffic, or is it also buggy for IPv4 ?
>
> If you disable tso , does the bug disappear ?
>
>
>

It also happens with IPv4 (usually not so fast as with IPv6):

SCP from client:

$ scp -4 server:/export/no_backup/burn/openSUSE-12.2-DVD-x86_64.iso .
openSUSE-12.2-DVD-x86_64.iso 
 
                  4%  184MB   9.3MB/s   07:37 ETA
Corrupted MAC on input.
Disconnecting: Packet corrupt
openSUSE-12.2-DVD-x86_64.iso 
 
                  4%  193MB   9.2MB/s   07:40 ETA
lost connection

SCP initiated from the server:
$ scp -4 openSUSE-12.2-DVD-x86_64.iso uhura:/tmp
openSUSE-12.2-DVD-x86_64.iso 
 
                  3%  149MB   8.8MB/s   08:10 ETA
Received disconnect from 10.100.14.100: 2: Packet corrupt
lost connection

With TSO disabled:
$ sudo ethtool -K eth0 tso off
$ sudo ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: off
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: off

SCP (IPv4) initiated from client:
$ scp -4 server:/export/no_backup/burn/openSUSE-12.2-DVD-x86_64.iso .
openSUSE-12.2-DVD-x86_64.iso 
 
                  2%   93MB  10.4MB/s   06:59 ETA
Corrupted MAC on input.
Disconnecting: Packet corrupt
lost connection

SCP (IPv4) initated from server:
$ scp -4 openSUSE-12.2-DVD-x86_64.iso uhura:/tmp
openSUSE-12.2-DVD-x86_64.iso 
 
                  4%  179MB   9.2MB/s   07:46 ETA
Received disconnect from 10.100.14.100: 2: Packet corrupt
lost connection


Regards,
Martin

^ permalink raw reply

* Re: [PATCH 1/1 v2] net: add micrel KSZ8873MLL switch support
From: Joe Perches @ 2012-11-21 19:03 UTC (permalink / raw)
  To: Jean-Christophe PLAGNIOL-VILLARD; +Cc: linux-arm-kernel, netdev
In-Reply-To: <1353512287-24048-1-git-send-email-plagnioj@jcrosoft.com>

On Wed, 2012-11-21 at 16:38 +0100, Jean-Christophe PLAGNIOL-VILLARD
wrote:
> this will allow to detect the link between the switch and the soc
[]
> diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
[]
> @@ -127,6 +127,39 @@ static int ks8051_config_init(struct phy_device *phydev)
[]
> +int ksz8873mll_read_status(struct phy_device *phydev)
> +{
[]
> +	regval = phy_read(phydev, KSZ8873MLL_GLOBAL_CONTROL_4);
> +
> +	if (regval & KSZ8873MLL_GLOBAL_CONTROL_4_DUPLEX)
> +		phydev->duplex = DUPLEX_HALF;
> +	else
> +		phydev->duplex = DUPLEX_FULL;

This doesn't check for phy_read errors.
Shouldn't it?

^ permalink raw reply

* Re: [PATCH net-next 09/17] net: Allow userns root control of the core of the network stack.
From: Ben Hutchings @ 2012-11-21 18:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers, David Miller
In-Reply-To: <87lie13q18.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

On Fri, 2012-11-16 at 18:46 -0800, Eric W. Biederman wrote:
> Ben Hutchings <bhutchings-s/n/eUQHGBpZroRs9YW3xA@public.gmane.org> writes:
> 
> > On Fri, 2012-11-16 at 06:32 -0800, Eric W. Biederman wrote:
> >> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> >> 
> >> > On 11/16/2012 05:03 PM, Eric W. Biederman wrote:
> >> >> +	if (!capable(CAP_NET_ADMIN))
> >> >> +		return -EPERM;
> >> >> +
> >> >>  	return netdev_store(dev, attr, buf, len, change_tx_queue_len);
> >> >
> >> > You mean ns_capable here?
> >> 
> >> No.  There I meant capable.
> >> 
> >> I deliberately call capable here because I don't understand what
> >> the tx_queue_len well enough to be certain it is safe to relax
> >> that check to be just ns_capable.
> >> 
> >> My get feel is that allowing an unprivileged user to be able to
> >> arbitrarily change the tx_queue_len on a networking device would be a
> >> nice way to allow queuing as many network packets as you would like with
> >> kernel memory and DOSing the machine.
> >> 
> >> So since with a quick read of the code I could not convince myself it
> >> was safe to allow unprivilged users to change tx_queue_len I left it
> >> protected by capable.  While at the same time I relaxed the check in
> >> netdev_store to be ns_capable.
> >
> > Tor the same reason you had better be very selective about which ethtool
> > commands are allowed based on per-user_ns CAP_NET_ADMIN.  Consider for a
> > start:
> >
> > ETHTOOL_SEEPROM => brick the NIC
> > ETHTOOL_FLASHDEV => brick the NIC; own the system if it's not using an IOMMU
> 
> These are prevented by not having access to real hardware by default. A
> physical network interface must be moved into a network namespace for
> you to have access to it.

Yes, I realise that.  The question is whether you would expect anything
in a container to be able to do those things, even with a physical net
device assigned to it.

Actually we have the same issue without considering containers - should
CAP_NET_ADMIN really give you low-level control over hardware just
because it's networking hardware?  I think some of these ethtool
operations, and access to non-standard MDIO registers, should perhaps
require an additional capability (CAP_SYS_ADMIN or CAP_SYS_RAWIO?).

> There are a handful of software network devices that are generally safe
> macvlan, veth, tun, ipip tunnels, etc.  Using those network devices is
> very interesting and about as performant as you can get while still
> being safe.
> 
> A buffer overflow in an ethtool command looks as likely to me as being
> able to own the system by reflashing the NIC.

Sure, if you can find one.  But on many NICs the firmware can perform
more or less arbitrary DMA *by design* (one reason for using IOMMUs),
and the ability to update the firmware is not a bug to be fixed!

> Access to a real physical NIC is an act of trust.  Given the general
> linux policy that drivers are merged when they mostly work I don't
> currently know of any trust models between "I trust you with full access
> to this device" and "I don't trust you with direct access to this
> device" that I would feel confident giving to an untrusted user.

At the moment it's 'I trust you with full access to *all* network
devices' (init ns CAP_NET_ADMIN), 'I trust you with some reconfiguration
of these network devices' (other ns CAP_NET_ADMIN) and 'I don't trust
you...'

You're expanding what other-ns-CAP_NET_ADMIN means, to 'I trust you with
full access to these network devices'.

> Which is a convoluted way of saying "ip link set eth0 netns bob" is the
> moral equivalent of "chown bob.bob /dev/eth0; chmod u+rwx /dev/eth0"
[...]

And it's previously been decided that ownership of a block device still
should *not* mean full control over it (see responses to CVE-2011-4127).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: Re: [PATCH 9/9] batman-adv: Use packing of 2 for all headers before an ethernet header
From: Sven Eckelmann @ 2012-11-21 18:20 UTC (permalink / raw)
  To: David Miller
  Cc: ordex, netdev, b.a.t.m.a.n, lindner_marek, kevin.curtis,
	linux-driver, ron.mercer, jitendra.kalsaria, rmody, andy, fubar
In-Reply-To: <20121121.125759.322762083309845086.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1961 bytes --]

On Wednesday 21 November 2012 12:57:59 David Miller wrote:
> From: Antonio Quartulli <ordex@autistici.org>
> Date: Wed, 21 Nov 2012 13:11:59 +0100
> 
> > +#pragma pack(2)
> 
>  ...
> 
> > -} __packed;
> 
> The __packed attribute is an abstraction of the actual syntax
> the compiler uses, if it is supported at all.
> 
> Therefore, you can't just unconditionally use the #pragma, and
> you would need to use some kind of similar compiler abstraction
> for it.
> 
> But to be honest this is really ugly and for very little, if any,
> gain.

Agree. But we should get this message also to the other guys

$ git grep pragma -- drivers/net
drivers/net/bonding/bond_3ad.h:#pragma pack(1)
drivers/net/bonding/bond_3ad.h:#pragma pack()
drivers/net/bonding/bond_3ad.h:#pragma pack(8)
drivers/net/bonding/bond_3ad.h:#pragma pack()
drivers/net/bonding/bond_alb.c:#pragma pack(1)
drivers/net/bonding/bond_alb.c:#pragma pack()
drivers/net/ethernet/brocade/bna/bfa_defs.h:#pragma pack(1)
drivers/net/ethernet/brocade/bna/bfa_defs.h:#pragma pack()
drivers/net/ethernet/brocade/bna/bfa_defs_cna.h:#pragma pack(1)
drivers/net/ethernet/brocade/bna/bfa_defs_cna.h:#pragma pack()
drivers/net/ethernet/brocade/bna/bfa_defs_mfg_comm.h:#pragma pack(1)
drivers/net/ethernet/brocade/bna/bfa_defs_mfg_comm.h:#pragma pack()
drivers/net/ethernet/brocade/bna/bfi.h:#pragma pack(1)
drivers/net/ethernet/brocade/bna/bfi.h:#pragma pack()
drivers/net/ethernet/brocade/bna/bfi_cna.h:#pragma pack(1)
drivers/net/ethernet/brocade/bna/bfi_cna.h:#pragma pack()
drivers/net/ethernet/brocade/bna/bfi_enet.h:#pragma pack(1)
drivers/net/ethernet/brocade/bna/bfi_enet.h:#pragma pack()
drivers/net/ethernet/brocade/bna/cna.h:#pragma pack(1)
drivers/net/ethernet/brocade/bna/cna.h:#pragma pack()
drivers/net/ethernet/qlogic/qla3xxx.h:#pragma pack(1)
drivers/net/ethernet/qlogic/qla3xxx.h:#pragma pack()
drivers/net/wan/farsync.c:#pragma pack(1)
drivers/net/wan/farsync.c:#pragma pack()

Kind regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* [PATCH] brcmsmac: Add __printf verification to logging prototypes
From: Joe Perches @ 2012-11-21 18:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Brett Rudley, Roland Vossen, Arend van Spriel,
	Franky (Zhenhui) Lin, Kan Yan, John W. Linville, linux-wireless,
	brcm80211-dev-list, netdev

Adding __printf helps spot format and argument mismatches.

Signed-off-by: Joe Perches <joe@perches.com>
---
 drivers/net/wireless/brcm80211/brcmsmac/debug.h |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/brcm80211/brcmsmac/debug.h b/drivers/net/wireless/brcm80211/brcmsmac/debug.h
index c0d2cf7..f77066b 100644
--- a/drivers/net/wireless/brcm80211/brcmsmac/debug.h
+++ b/drivers/net/wireless/brcm80211/brcmsmac/debug.h
@@ -8,17 +8,23 @@
 #include "main.h"
 #include "mac80211_if.h"
 
+__printf(2, 3)
 void __brcms_info(struct device *dev, const char *fmt, ...);
+__printf(2, 3)
 void __brcms_warn(struct device *dev, const char *fmt, ...);
+__printf(2, 3)
 void __brcms_err(struct device *dev, const char *fmt, ...);
+__printf(2, 3)
 void __brcms_crit(struct device *dev, const char *fmt, ...);
 
 #if defined(CONFIG_BRCMDBG) || defined(CONFIG_BRCM_TRACING)
+__printf(4, 5)
 void __brcms_dbg(struct device *dev, u32 level, const char *func,
 		 const char *fmt, ...);
 #else
-static inline void __brcms_dbg(struct device *dev, u32 level,
-			       const char *func, const char *fmt, ...)
+static inline __printf(4, 5)
+void __brcms_dbg(struct device *dev, u32 level, const char *func,
+		 const char *fmt, ...)
 {
 }
 #endif
-- 
1.7.8.112.g3fd21

^ permalink raw reply related

* Re: 8139cp: set ring address before enabling receiver
From: Jeff Garzik @ 2012-11-21 18:12 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jason Wang, David S. Miller, netdev
In-Reply-To: <1353517042.26346.130.camel@shinybook.infradead.org>

On 11/21/2012 11:57 AM, David Woodhouse wrote:
> On Sat, 2012-06-02 at 23:50 +0000, Linux Kernel Mailing List wrote:
>> Gitweb:     http://git.kernel.org/linus/;a=commit;h=b01af4579ec41f48e9b9c774e70bd6474ad210db
>> Commit:     b01af4579ec41f48e9b9c774e70bd6474ad210db
>> Parent:     20e2a86485967c385d7c7befc1646e4d1d39362e
>> Author:     Jason Wang <jasowang@redhat.com>
>> AuthorDate: Thu May 31 18:19:39 2012 +0000
>> Committer:  David S. Miller <davem@davemloft.net>
>> CommitDate: Fri Jun 1 14:22:11 2012 -0400
>>
>>      8139cp: set ring address before enabling receiver
>>
>>      Currently, we enable the receiver before setting the ring address which could
>>      lead the card DMA into unexpected areas. Solving this by set the ring address
>>      before enabling the receiver.
>>
>>      btw. I find and test this in qemu as I didn't have a 8139cp card in hand. please
>>      review it carefully.

What sticks out at me from the commit message?

It was not tested on the famously quirky 8139 hardware at all.

While I have not looked at the 8139C+ data sheet in a while, sometimes 
the hardware _did_ have a strange init order.

As this works in a simulator but fails on real hardware, it seems like 
an obvious regression caused by an untested [on read hardware] patch.

	Jeff

^ permalink raw reply

* Re: [PATCH 9/9] batman-adv: Use packing of 2 for all headers before an ethernet header
From: David Miller @ 2012-11-21 17:57 UTC (permalink / raw)
  To: ordex; +Cc: netdev, b.a.t.m.a.n, sven, lindner_marek
In-Reply-To: <1353499919-28596-10-git-send-email-ordex@autistici.org>

From: Antonio Quartulli <ordex@autistici.org>
Date: Wed, 21 Nov 2012 13:11:59 +0100

> +#pragma pack(2)
 ...
> -} __packed;

The __packed attribute is an abstraction of the actual syntax
the compiler uses, if it is supported at all.

Therefore, you can't just unconditionally use the #pragma, and
you would need to use some kind of similar compiler abstraction
for it.

But to be honest this is really ugly and for very little, if any,
gain.

^ permalink raw reply

* Re: [PATCH RFC 3/5] printk: modify printk interface for syslog_namespace
From: Serge E. Hallyn @ 2012-11-21 17:49 UTC (permalink / raw)
  To: Rui Xiang
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <50A9EAF0.4000902-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

I notice that you haven't made any changes to the struct cont.  I
suspect this means that to-be-continued msgs from one ns can be
erroneously mixed with another ns.

You said you don't mind putting the syslogns into the userns.  If
there's no reason not to do that, then we should do so as it will
remove a bunch of code (plus the use of a new CLONE flag) from your
patch, and the new syslog(NEW_NS) command from mine.

Now IMO the ideal place for syslog_ns would be in the devices ns,
but that does not yet exist, and may never.  The bonus to that would
be that the consoles sort of belong there.  I avoid this by not
having consoles in child syslog namespaces.  You put the console in
the ns.  I haven't looked closely enough to see if what you do is
ok (will do so soon).

WOuld you mind looking through my patch to see if it suffices for
your needs?  Where it does not, patches would be greatly appreciated
if simple enough.

Note I'm not at all wedded to my patchset.  I'm happy to go with
something else entirely.  My set was just a proof of concept.

thanks,
-serge

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2012-11-21 17:36 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) inet6_csk_update_pmtu() must return NULL or non-NULL, so translate
   ERR_PTR to NULL, as needed.  Fix from Eric Dumazet.

2) Fix copy&paste error in IRDA sir_dev ->set_speed method invocation,
   it was testing the NULL'ness of a different method to guard the call.
   Fix from Alexander Shiyan.

3) Fix build regression of xilinx driver, from Jeff Mahoney.

4) Make XEN netfront (like XEN netback) handle compound pages
   in SKBs properly.  From Ian Campbell.

5) Fix inverted logic of team_dev_queue_xmit() return value checks,
   from Jiri Pirko and Dan Carpenter.

6) dma_poll_create() no longer allows a NULL device argument, breaking
   both ixp4xx drivers.  Fix from Xi Wang.

7) ne2000 driver doesn't hook up the parent device properly, breaking
   udev matching.  Fix from Alan Cox.

8) Locking and memory leak fixes in Near Field Communications layer.
   From Thierry Escande, Szymon Janc, and Waldemar Rymarkiewicz.

9) sis900 resume regression, sis900_set_mode() is being called with
   the iomem pointer instead of the expected device private.  Fix
   from Francois Romieu.

10) Fix IBSS regression caused by uninitializing the ibss-internals
    before performing an emptyness check, from Simon WUnderlich.

11) Fix SNIFFER mode regression in iwlwifi driver, from Johannes Berg.

12) Fix task wedges in mwifiex_cmd_timeout_func(), from Bing Zhao.

13) Add back wireless sysfs directory, too much stuff depends upon it
    being there.  (actually I'd say it never should have been removed
    to begin with)  From Johannes Berg.

14) Fix hang introduced by suspend/resume changes in ath9k.  Fix from
    Sujith Manoharan.

Please pull, thanks a lot!

The following changes since commit 3587b1b097d70c2eb9fee95ea7995d13c05f66e5:

  fanotify: fix FAN_Q_OVERFLOW case of fanotify_read() (2012-11-18 09:30:00 -1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net master

for you to fetch changes up to 403f43c937d24832b18524f65415c0bbba6b5064:

  team: bcast: convert return value of team_dev_queue_xmit() to bool correctly (2012-11-21 11:55:07 -0500)

----------------------------------------------------------------
Alan Cox (1):
      ne2000: add the right platform device

Albert Pool (1):
      rtlwifi: rtl8192cu: Add new USB ID

Alexander Shiyan (1):
      irda: sir_dev: Fix copy/paste typo

Bing Zhao (2):
      mwifiex: fix system hang issue in cmd timeout error case
      mwifiex: report error to MMC core if we cannot suspend

Emmanuel Grumbach (1):
      iwlwifi: don't WARN when a non empty queue is disabled

Eric Dumazet (1):
      ipv6: fix inet6_csk_update_pmtu() return value

Francois Romieu (1):
      sis900: fix sis900_set_mode call parameters.

Ian Campbell (1):
      xen/netfront: handle compound page fragments on transmit

Jeff Mahoney (1):
      net: fix build failure in xilinx

Jiri Pirko (1):
      team: bcast: convert return value of team_dev_queue_xmit() to bool correctly

Johannes Berg (2):
      iwlwifi: fix monitor mode FCS flag
      wireless: add back sysfs directory

John W. Linville (4):
      Merge branch 'for-john' of git://git.kernel.org/.../jberg/mac80211
      Merge branch 'for-john' of git://git.kernel.org/.../iwlwifi/iwlwifi-fixes
      Merge tag 'nfc-fixes-3.7-1' of git://git.kernel.org/.../sameo/nfc-3.0
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless into for-davem

Sarveshwar Bandi (1):
      bonding: Bonding driver does not consider the gso_max_size/gso_max_segs setting of slave devices.

Simon Wunderlich (1):
      mac80211: deinitialize ibss-internals after emptiness check

Srinivas Kandagatla (1):
      of/net/mdio-gpio: Fix pdev->id issue when using devicetrees.

Sujith Manoharan (1):
      ath9k_hw: Fix regression in device reset

Szymon Janc (2):
      NFC: pn533: Fix missing lock while operating on commands list
      NFC: pn533: Fix use after free

Thierry Escande (2):
      NFC: Fix nfc_llcp_local chained list insertion
      NFC: Fix pn533 target mode memory leak

Waldemar Rymarkiewicz (1):
      NFC: pn533: Fix mem leak in pn533_in_dep_link_up

Xi Wang (2):
      ixp4xx_eth: avoid calling dma_pool_create() with NULL dev
      ixp4xx_hss: avoid calling dma_pool_create() with NULL dev

 Documentation/devicetree/bindings/net/mdio-gpio.txt |  9 +++++-
 drivers/net/bonding/bond_main.c                     |  7 +++++
 drivers/net/ethernet/8390/ne.c                      |  1 +
 drivers/net/ethernet/sis/sis900.c                   |  2 +-
 drivers/net/ethernet/xilinx/xilinx_axienet_main.c   |  2 ++
 drivers/net/ethernet/xscale/ixp4xx_eth.c            |  8 +++--
 drivers/net/irda/sir_dev.c                          |  2 +-
 drivers/net/phy/mdio-gpio.c                         | 11 ++++---
 drivers/net/team/team_mode_broadcast.c              |  6 ++--
 drivers/net/wan/ixp4xx_hss.c                        |  8 +++--
 drivers/net/wireless/ath/ath9k/hw.c                 |  2 +-
 drivers/net/wireless/iwlwifi/dvm/mac80211.c         | 14 +++++++++
 drivers/net/wireless/iwlwifi/pcie/tx.c              |  8 -----
 drivers/net/wireless/mwifiex/cmdevt.c               | 11 +++++--
 drivers/net/wireless/mwifiex/sdio.c                 | 11 ++++---
 drivers/net/wireless/rtlwifi/rtl8192cu/sw.c         |  1 +
 drivers/net/xen-netfront.c                          | 98 +++++++++++++++++++++++++++++++++++++++++++++-------------
 drivers/nfc/pn533.c                                 | 25 ++++++++-------
 net/core/net-sysfs.c                                | 20 ++++++++++++
 net/ipv6/inet6_connection_sock.c                    |  3 +-
 net/mac80211/ibss.c                                 |  8 ++---
 net/nfc/llcp/llcp.c                                 |  2 +-
 22 files changed, 188 insertions(+), 71 deletions(-)

^ permalink raw reply

* Re: net, bluetooth: object debug warning in bt_host_release()
From: Gustavo Padovan @ 2012-11-21 17:27 UTC (permalink / raw)
  To: Sasha Levin
  Cc: marcel-kz+m5ild9QBg9hUCZPvPmw, Johan Hedberg, David S. Miller,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Dave Jones
In-Reply-To: <50A272E7.9080103-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 5691 bytes --]

* Sasha Levin <sasha.levin-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> [2012-11-13 11:18:47 -0500]:

> Hi all,
> 
> While fuzzing with trinity on a KVM tools (lkvm) guest running latest -next kernel I've
> stumbled on the following:
> 
> [ 1434.201149] ------------[ cut here ]------------
> [ 1434.204998] WARNING: at lib/debugobjects.c:261 debug_print_object+0x8e/0xb0()
> [ 1434.208324] ODEBUG: free active (active state 0) object type: work_struct hint: hci_power_on+0x0/0x90
> [ 1434.210386] Pid: 8564, comm: trinity-child25 Tainted: G        W    3.7.0-rc5-next-20121112-sasha-00018-g2f4ce0e #127
> [ 1434.210760] Call Trace:
> [ 1434.210760]  [<ffffffff819f3d6e>] ? debug_print_object+0x8e/0xb0
> [ 1434.210760]  [<ffffffff8110b887>] warn_slowpath_common+0x87/0xb0
> [ 1434.210760]  [<ffffffff8110b911>] warn_slowpath_fmt+0x41/0x50
> [ 1434.210760]  [<ffffffff819f3d6e>] debug_print_object+0x8e/0xb0
> [ 1434.210760]  [<ffffffff8376b750>] ? hci_dev_open+0x310/0x310
> [ 1434.210760]  [<ffffffff83bf94e5>] ? _raw_spin_unlock_irqrestore+0x55/0xa0
> [ 1434.210760]  [<ffffffff819f3ee5>] __debug_check_no_obj_freed+0xa5/0x230
> [ 1434.210760]  [<ffffffff83785db0>] ? bt_host_release+0x10/0x20
> [ 1434.210760]  [<ffffffff819f4d15>] debug_check_no_obj_freed+0x15/0x20
> [ 1434.210760]  [<ffffffff8125eee7>] kfree+0x227/0x330
> [ 1434.210760]  [<ffffffff83785db0>] bt_host_release+0x10/0x20
> [ 1434.210760]  [<ffffffff81e539e5>] device_release+0x65/0xc0
> [ 1434.210760]  [<ffffffff819d3975>] kobject_cleanup+0x145/0x190
> [ 1434.210760]  [<ffffffff819d39cd>] kobject_release+0xd/0x10
> [ 1434.210760]  [<ffffffff819d33cc>] kobject_put+0x4c/0x60
> [ 1434.210760]  [<ffffffff81e548b2>] put_device+0x12/0x20
> [ 1434.210760]  [<ffffffff8376a334>] hci_free_dev+0x24/0x30
> [ 1434.210760]  [<ffffffff82fd8fe1>] vhci_release+0x31/0x60
> [ 1434.210760]  [<ffffffff8127be12>] __fput+0x122/0x250
> [ 1434.210760]  [<ffffffff811cab0d>] ? rcu_user_exit+0x9d/0xd0
> [ 1434.210760]  [<ffffffff8127bf49>] ____fput+0x9/0x10
> [ 1434.210760]  [<ffffffff81133402>] task_work_run+0xb2/0xf0
> [ 1434.210760]  [<ffffffff8106cfa7>] do_notify_resume+0x77/0xa0
> [ 1434.210760]  [<ffffffff83bfb0ea>] int_signal+0x12/0x17
> [ 1434.210760] ---[ end trace a6d57fefbc8a8cc7 ]---
> 
> Not that the guest doesn't emulate anything that looks like a bluetooth device or
> has bluetooth capabilities.

You have a virtual bluetooth device (vhci). That is why you get a bluetooth
crash. I think the following patch will fix this issue.

	Gustavo

---
Author: Gustavo Padovan <gustavo.padovan-ZGY8ohtN/8pPYcu2f3hruQ@public.gmane.org>
Date:   Wed Nov 21 00:50:21 2012 -0200

    Bluetooth: cancel power_on work when unregistering the device
    
    We need to cancel the hci_power_on work in order to avoid it run when we
    try to free the hdev.
    
    [ 1434.201149] ------------[ cut here ]------------
    [ 1434.204998] WARNING: at lib/debugobjects.c:261 debug_print_object+0x8e/0xb0()
    [ 1434.208324] ODEBUG: free active (active state 0) object type: work_struct hint:
    _power_on+0x0/0x90
    [ 1434.210386] Pid: 8564, comm: trinity-child25 Tainted: G        W    3.7.0-rc5-n
    20121112-sasha-00018-g2f4ce0e #127
    [ 1434.210760] Call Trace:
    [ 1434.210760]  [<ffffffff819f3d6e>] ? debug_print_object+0x8e/0xb0
    [ 1434.210760]  [<ffffffff8110b887>] warn_slowpath_common+0x87/0xb0
    [ 1434.210760]  [<ffffffff8110b911>] warn_slowpath_fmt+0x41/0x50
    [ 1434.210760]  [<ffffffff819f3d6e>] debug_print_object+0x8e/0xb0
    [ 1434.210760]  [<ffffffff8376b750>] ? hci_dev_open+0x310/0x310
    [ 1434.210760]  [<ffffffff83bf94e5>] ? _raw_spin_unlock_irqrestore+0x55/0xa0
    [ 1434.210760]  [<ffffffff819f3ee5>] __debug_check_no_obj_freed+0xa5/0x230
    [ 1434.210760]  [<ffffffff83785db0>] ? bt_host_release+0x10/0x20
    [ 1434.210760]  [<ffffffff819f4d15>] debug_check_no_obj_freed+0x15/0x20
    [ 1434.210760]  [<ffffffff8125eee7>] kfree+0x227/0x330
    [ 1434.210760]  [<ffffffff83785db0>] bt_host_release+0x10/0x20
    [ 1434.210760]  [<ffffffff81e539e5>] device_release+0x65/0xc0
    [ 1434.210760]  [<ffffffff819d3975>] kobject_cleanup+0x145/0x190
    [ 1434.210760]  [<ffffffff819d39cd>] kobject_release+0xd/0x10
    [ 1434.210760]  [<ffffffff819d33cc>] kobject_put+0x4c/0x60
    [ 1434.210760]  [<ffffffff81e548b2>] put_device+0x12/0x20
    [ 1434.210760]  [<ffffffff8376a334>] hci_free_dev+0x24/0x30
    [ 1434.210760]  [<ffffffff82fd8fe1>] vhci_release+0x31/0x60
    [ 1434.210760]  [<ffffffff8127be12>] __fput+0x122/0x250
    [ 1434.210760]  [<ffffffff811cab0d>] ? rcu_user_exit+0x9d/0xd0
    [ 1434.210760]  [<ffffffff8127bf49>] ____fput+0x9/0x10
    [ 1434.210760]  [<ffffffff81133402>] task_work_run+0xb2/0xf0
    [ 1434.210760]  [<ffffffff8106cfa7>] do_notify_resume+0x77/0xa0
    [ 1434.210760]  [<ffffffff83bfb0ea>] int_signal+0x12/0x17
    [ 1434.210760] ---[ end trace a6d57fefbc8a8cc7 ]---
    
    Reported-by: Sasha Levin <sasha.levin-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
    Signed-off-by: Gustavo Padovan <gustavo.padovan-ZGY8ohtN/8pPYcu2f3hruQ@public.gmane.org>

diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index 81f4bac..69eb644 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -1854,6 +1854,8 @@ void hci_unregister_dev(struct hci_dev *hdev)
        for (i = 0; i < NUM_REASSEMBLY; i++)
                kfree_skb(hdev->reassembly[i]);
 
+       cancel_work_sync(&hdev->power_on);
+
        if (!test_bit(HCI_INIT, &hdev->flags) &&
            !test_bit(HCI_SETUP, &hdev->dev_flags)) {
                hci_dev_lock(hdev);


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related

* Re: 3.6 routing cache regression, multicast loopback broken
From: David Miller @ 2012-11-21 17:25 UTC (permalink / raw)
  To: ja; +Cc: mbizon, netdev
In-Reply-To: <alpine.LFD.2.00.1211210055040.1751@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Wed, 21 Nov 2012 02:35:34 +0200 (EET)

> [PATCH net-next] ipv4: do not cache looped multicasts
> 
> 	We have two cases for outgoing multicasts
> depending on the membership: with or without RTCF_LOCAL.
> Currently, we use caching only if route to 224/4 is used.
> As we can not cache for both cases, optimize the caching
> for senders only, do not cache routes with RTCF_LOCAL
> flag set.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

The sad part is that you warned me about this issue nearly
2 years ago Julian, sorry :-/

^ permalink raw reply

* Re: Packet Corruption with Atheros Communications Inc. AR8121/AR8113/AR8114 Gigabit or Fast Ethernet (rev b0) Interface
From: Eric Dumazet @ 2012-11-21 17:14 UTC (permalink / raw)
  To: Martin Tessun; +Cc: netdev
In-Reply-To: <50ACEB1D.2020303@gmx.de>

On Wed, 2012-11-21 at 15:54 +0100, Martin Tessun wrote:
> Hi @all,
> 
> unfortunately I don't have the old postings any more, so I start a new 
> thread.
> 
> Following situation:
> 
> I have a Server (NFS) and a Client with the Atheros network card. If I 
> transfer big files, the md5sum of these files differ.
> If I try to scp the file I get "MAC corrupted on inpu".
> 
> Everything works fine, if running on any other OS (or with other NICs 
> than the Atheros one).
> 
> So here is the data:
> 
> lspci:
> 02:00.0 Ethernet controller: Atheros Communications Inc. 
> AR8121/AR8113/AR8114 Gigabit or Fast Ethernet (rev b0)
>          Subsystem: ASUSTeK Computer Inc. Device 831c
>          Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
> ParErr- Stepping- SERR+ FastB2B- DisINTx+
>          Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>          Latency: 0, Cache Line Size: 64 bytes
>          Interrupt: pin A routed to IRQ 44
>          Region 0: Memory at fbec0000 (64-bit, non-prefetchable) [size=256K]
>          Region 2: I/O ports at dc00 [size=128]
>          Capabilities: [40] Power Management version 2
>                  Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
> PME(D0-,D1-,D2-,D3hot+,D3cold+)
>                  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>          Capabilities: [48] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                  Address: 00000000fee0f00c  Data: 4181
>          Capabilities: [58] Express (v1) Endpoint, MSI 00
>                  DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s 
> <4us, L1 unlimited
>                          ExtTag- AttnBtn+ AttnInd+ PwrInd+ RBE- FLReset-
>                  DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
> Unsupported-
>                          RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>                          MaxPayload 128 bytes, MaxReadReq 512 bytes
>                  DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ 
> AuxPwr+ TransPend-
>                  LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, 
> Latency L0 unlimited, L1 unlimited
>                          ClockPM- Surprise- LLActRep- BwNot-
>                  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- 
> CommClk+
>                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> DLActive- BWMgmt- ABWMgmt-
>          Capabilities: [100 v1] Advanced Error Reporting
>                  UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ 
> UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
>                  UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
> UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                  UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- 
> UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                  CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> NonFatalErr-
>                  CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> NonFatalErr-
>                  AERCap: First Error Pointer: 14, GenCap+ CGenEn- 
> ChkCap+ ChkEn-
>          Capabilities: [180 v1] Device Serial Number ff-76-f8-79-00-26-18-ff
>          Kernel driver in use: ATL1E
> 
> 
> $ ethtool -k eth0
> Offload parameters for eth0:
> rx-checksumming: off
> tx-checksumming: on
> scatter-gather: on
> tcp-segmentation-offload: on
> udp-fragmentation-offload: off
> generic-segmentation-offload: on
> generic-receive-offload: on
> large-receive-offload: off
> rx-vlan-offload: on
> tx-vlan-offload: on
> ntuple-filters: off
> receive-hashing: off
> 
> 
> SCP initiated on client:
> $ scp server:/export/no_backup/burn/openSUSE-12.2-DVD-x86_64.iso .
> openSUSE-12.2-DVD-x86_64.iso 
>  
>                                                                1%   88MB 
>    9.8MB/s   07:26 ETA
> Corrupted MAC on input.
> Disconnecting: Packet corrupt
> lost connection
> 
> SCP initiated on server:
> $ scp /export/no_backup/burn/openSUSE-12.2-DVD-x86_64.iso client:/tmp
> Enter passphrase for key '/home/chewie/tessun/.ssh/id_rsa':
> openSUSE-12.2-DVD-x86_64.iso 
>  
>                                                                5%  229MB 
>    7.0MB/s   10:00 ETA
> Received disconnect from 2a01:198:366:100::fba5: 2: Packet corrupt
> lost connection
> 
> 
> ifconfig-output (Client):
>            RX packets:773481 errors:0 dropped:7620 overruns:0 frame:0
>            TX packets:589594 errors:0 dropped:0 overruns:0 carrier:1
>            collisions:0 Sendewarteschlangenlänge:1000
>            RX bytes:596932088 (569.2 Mb)  TX bytes:90583584 (86.3 Mb)
>            Interrupt:44
> 
> ifconfig-Output (Server):
>            RX packets:11579837 errors:0 dropped:0 overruns:0 frame:0
>            TX packets:19734057 errors:0 dropped:0 overruns:0 carrier:0
>            collisions:0 txqueuelen:1000
>            RX bytes:2717120721 (2591.2 Mb)  TX bytes:22047372376 
> (21026.0 Mb)
>            Interrupt:42 Base address:0xc000
> 
> 
> As said: Other NICs than the Atheros work fine.
> 
> If you need additional Infos (tcpdumps, etc.) please let me know.
> 
> Regards,
> Martin

Is it only happening for IPv6 traffic, or is it also buggy for IPv4 ?

If you disable tso , does the bug disappear ?

^ permalink raw reply

* Re: [PATCH 2/2] smsc95xx: support PHY wakeup source
From: David Miller @ 2012-11-21 17:07 UTC (permalink / raw)
  To: steve.glendinning; +Cc: netdev
In-Reply-To: <1353504577-5719-2-git-send-email-steve.glendinning@shawell.net>

From: Steve Glendinning <steve.glendinning@shawell.net>
Date: Wed, 21 Nov 2012 13:29:37 +0000

> +{
> +	struct mii_if_info *mii = &dev->mii;
> +
> +        /* first, a dummy read, needed to latch some MII phys */
> +	int ret = smsc95xx_mdio_read_nopm(dev->net, mii->phy_id, MII_BMSR);

Please keep the local variable declarations together at the beginning
of the basic block, don't intermix empty lines and comments as you
are doing here.

^ permalink raw reply

* Re: [PATCH 1/2] smsc95xx: detect chip revision specific features
From: David Miller @ 2012-11-21 17:06 UTC (permalink / raw)
  To: steve.glendinning; +Cc: netdev
In-Reply-To: <1353504577-5719-1-git-send-email-steve.glendinning@shawell.net>

From: Steve Glendinning <steve.glendinning@shawell.net>
Date: Wed, 21 Nov 2012 13:29:36 +0000

> +	if ((val == ID_REV_CHIP_ID_9500A_) || (val == ID_REV_CHIP_ID_9530_) ||
> +		(val == ID_REV_CHIP_ID_89530_) || (val == ID_REV_CHIP_ID_9730_))
> +		pdata->features = FEATURE_8_WAKEUP_FILTERS
> +			| FEATURE_PHY_NLP_CROSSOVER | FEATURE_AUTOSUSPEND;

Please style this properly.

	if ((val == ID_REV_CHIP_ID_9500A_) || (val == ID_REV_CHIP_ID_9530_) ||
	    (val == ID_REV_CHIP_ID_89530_) || (val == ID_REV_CHIP_ID_9730_))
		pdata->features = (FEATURE_8_WAKEUP_FILTERS |
				   FEATURE_PHY_NLP_CROSSOVER |
				   FEATURE_AUTOSUSPEND);

^ permalink raw reply

* Re: [PATCH] pkt_sched: QFQ Plus: fair-queueing service at DRR cost
From: David Miller @ 2012-11-21 17:04 UTC (permalink / raw)
  To: paolo.valente; +Cc: jhs, shemminger, linux-kernel, netdev, rizzo, fchecconi
In-Reply-To: <50ACA2AA.1020206@unimore.it>

From: Paolo Valente <paolo.valente@unimore.it>
Date: Wed, 21 Nov 2012 10:45:14 +0100

> Got it. Actually, if the first qfq_peek_skb returns NULL, then the
> example version that you are proposing apparently may behave in a
> different way than the original one: in your proposal the scheduler
> tries to switch to a new aggregate and may return a non-NULL value,
> whereas the original version would immediately return NULL. I guess
> that this slightly different behavior is fine as well, and I am
> preparing a new patch that integrates these changes.

Thanks.

^ permalink raw reply

* Re: [net-next 06/10] ixgbe: eliminate Smatch warnings in ixgbe_debugfs.c
From: David Miller @ 2012-11-21 17:04 UTC (permalink / raw)
  To: dan.carpenter; +Cc: jeffrey.t.kirsher, joshua.a.hay, netdev, gospo, sassmann
In-Reply-To: <20121121110409.GG6186@mwanda>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Wed, 21 Nov 2012 14:04:09 +0300

> On Wed, Nov 21, 2012 at 02:47:32AM -0800, Jeff Kirsher wrote:
>> +	len = simple_write_to_buffer(ixgbe_dbg_reg_ops_buf,
>> +				     sizeof(ixgbe_dbg_reg_ops_buf)-1,
>> +				     ppos,
>> +				     buffer,
>> +				     count);
>> +	if (len < 0)
>> +		return -EFAULT;
> 
> Any negative return is bad.
> 
> 	if (len)
> 		return len;
 ...
>> @@ -187,15 +196,15 @@ static ssize_t ixgbe_dbg_netdev_ops_write(struct file *filp,
 ...
>> +	if (len < 0)
>> +		return -EFAULT;
> 
> Same.

Agreed.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox