[PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)
@ 2015-09-20 22:29 Tom Herbert
  2015-09-20 22:29 ` [PATCH RFC 1/3] rcu: Add list_next_or_null_rcu Tom Herbert
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Tom Herbert @ 2015-09-20 22:29 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
The motivation for this is based on the observation that although
TCP is byte stream transport protocol with no concept of message
boundaries, a common use case is to implement a framed application
layer protocol running over TCP. To date, most TCP stacks offer
byte stream API for applications, which places the burden of message
delineation, message I/O operation atomicity, and load balancing
in the application. With KCM an application can efficiently send
and receive application protocol messages over TCP using a
datagram interface.

In order to delineate message in a TCP stream for receive in KCM, the
kernel implements a message parser. For this we chose to employ BPF
which is applied to the TCP stream. BPF code parses application layer
messages and returns a message length. Nearly all binary application
protocols are parsable in this manner, so KCM should be applicable
across a wide range of applications. Other than message length
determination in receive, KCM does not require any other application
specific awareness. KCM does not implement any other application
protocol semantics-- these are are provided in userspace or could be
implemented in a kernel module layered above KCM.

KCM implements an NxM multiplexor in the kernel as diagrammed below:

+------------+   +------------+   +------------+   +------------+
| KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
+------------+   +------------+   +------------+   +------------+
      |                 |               |                |
      +-----------+     |               |     +----------+
                  |     |               |     |
               +----------------------------------+
               |           Multiplexor            |
               +----------------------------------+
                 |   |           |           |  |
       +---------+   |           |           |  ------------+
       |             |           |           |              |
+----------+  +----------+  +----------+  +----------+ +----------+
|  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
+----------+  +----------+  +----------+  +----------+ +----------+
      |              |           |            |             |
+----------+  +----------+  +----------+  +----------+ +----------+
| TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
+----------+  +----------+  +----------+  +----------+ +----------+

The KCM sockets provide the datagram interface to applications,
Psocks are the state for each attached TCP connection (i.e. where
message delineation is performed on receive).

A description of the APIs and design can be found in the included
Documentation/networking/kcm.txt.

Testing:

For testing I have been developing kcmperf and super_kcmperf which
should allow functional verification and some baseline comparisons
with netperf using TCP. For the test results listed below, one
instance of kcmperf is run which creates a trivial MUX with one KCM
socket and one attached TCP connection.

netperf TCP_RR
   - 1 instance, 1 byte RR size
     34219 tps

   - 1 instance, 1000000 byte RR size
     464 tps

   - 200 instances, 1 byte RR size
     1721552 tps
     86.86% CPU utilization

   - 200 instances, 1000000 byte RR size
     1165
     7.16% CPU utilization

kcmperf
   - 1 instance, byte RR size
     32679 tps

   - 1 instance, byte RR size
     412 tps

   - 200 instances, 1 byte RR size
     1420454 tps
     80.02% CPU utilization 

   - 200 instances, 1000000 byte RR size
     1130 tps
     10.08% CPU utilization

Future support:

The implementation provided here should be thought of as a first cut,
for which the goal is to establish a robust base implementation. There
are many avenues for extending this basic implementation and improving
upon this:

 - Sample application support
 - SOCK_SEQPACKET support
 - Integration with TLS (TLS-in-kernel is a separate intiative).
 - Page operations/splice support
 - sendmmsg, recvmmsg support
 - Unconnected KCM sockets. Will be able to attach sockets to different
   destinations, AF_KCM addresses with be used in sendmsg and recvmsg
   to indicate destination
 - Explore more utility in performing BPF inline with a TCP data stream
   (setting SO_MARK, rxhash for messages being sent received on
   KCM sockets).
 - Performance work
   - Reduce locking (MUX lock is currently a bottleneck).
   - KCM socket to TCP socket affinity
   - Small message coalescing, direct calls to TCP send functions

Tom Herbert (3):
  rcu: Add list_next_or_null_rcu
  kcm: Kernel Connection Multiplexor module
  kcm: Add statistics and proc interfaces

 Documentation/networking/kcm.txt |  173 ++++
 include/linux/rculist.h          |   21 +
 include/linux/socket.h           |    6 +-
 include/net/kcm.h                |  211 +++++
 include/uapi/linux/errqueue.h    |    1 +
 include/uapi/linux/kcm.h         |   26 +
 net/Kconfig                      |    1 +
 net/Makefile                     |    1 +
 net/kcm/Kconfig                  |   10 +
 net/kcm/Makefile                 |    3 +
 net/kcm/kcmproc.c                |  415 ++++++++++
 net/kcm/kcmsock.c                | 1629 ++++++++++++++++++++++++++++++++++++++
 12 files changed, 2496 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/networking/kcm.txt
 create mode 100644 include/net/kcm.h
 create mode 100644 include/uapi/linux/kcm.h
 create mode 100644 net/kcm/Kconfig
 create mode 100644 net/kcm/Makefile
 create mode 100644 net/kcm/kcmproc.c
 create mode 100644 net/kcm/kcmsock.c

-- 
1.8.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH RFC 1/3] rcu: Add list_next_or_null_rcu
  2015-09-20 22:29 [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
@ 2015-09-20 22:29 ` Tom Herbert
  2015-09-20 22:29 ` [PATCH RFC 2/3] kcm: Kernel Connection Multiplexor module Tom Herbert
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Tom Herbert @ 2015-09-20 22:29 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

This is a convenienve function that returns the next enrty in abn RCU
list or NULL if at the end of the list.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/linux/rculist.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 17c6b1f..3e8e2aa 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -293,6 +293,27 @@ static inline void list_splice_init_rcu(struct list_head *list,
 })
 
 /**
+ * list_next_or_null_rcu - get the first element from a list
+ * @head:	the head for the list.
+ * @ptr:        the list head to take the next element from.
+ * @type:       the type of the struct this is embedded in.
+ * @member:     the name of the list_head within the struct.
+ *
+ * Note that if the ptr is at the end of the list, NULL is returned.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
+ */
+#define list_next_or_null_rcu(head, ptr, type, member) \
+({ \
+	struct list_head *__head = (head); \
+	struct list_head *__ptr = (ptr); \
+	struct list_head *__next = READ_ONCE(__ptr->next); \
+	likely(__next != __head) ? list_entry_rcu(__next, type, \
+						  member) : NULL; \
+})
+
+/**
  * list_for_each_entry_rcu	-	iterate over rcu list of given type
  * @pos:	the type * to use as a loop cursor.
  * @head:	the head for your list.
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC 2/3] kcm: Kernel Connection Multiplexor module
  2015-09-20 22:29 [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
  2015-09-20 22:29 ` [PATCH RFC 1/3] rcu: Add list_next_or_null_rcu Tom Herbert
@ 2015-09-20 22:29 ` Tom Herbert
  2015-09-22 16:26   ` Alexei Starovoitov
  2015-09-23  9:36   ` Thomas Graf
  2015-09-20 22:29 ` [PATCH RFC 3/3] kcm: Add statistics and proc interfaces Tom Herbert
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Tom Herbert @ 2015-09-20 22:29 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

This module implement the Kernel Connection Multiplexor.

Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
With KCM an application can efficiently send and receive application
protocol messages over TCP using datagram sockets.

For more information see the included Documentation/networking/kcm.txt

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 Documentation/networking/kcm.txt |  173 +++++
 include/linux/socket.h           |    6 +-
 include/net/kcm.h                |  109 +++
 include/uapi/linux/errqueue.h    |    1 +
 include/uapi/linux/kcm.h         |   26 +
 net/Kconfig                      |    1 +
 net/Makefile                     |    1 +
 net/kcm/Kconfig                  |   10 +
 net/kcm/Makefile                 |    3 +
 net/kcm/kcmsock.c                | 1566 ++++++++++++++++++++++++++++++++++++++
 10 files changed, 1895 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/networking/kcm.txt
 create mode 100644 include/net/kcm.h
 create mode 100644 include/uapi/linux/kcm.h
 create mode 100644 net/kcm/Kconfig
 create mode 100644 net/kcm/Makefile
 create mode 100644 net/kcm/kcmsock.c

diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt
new file mode 100644
index 0000000..2e0968d
--- /dev/null
+++ b/Documentation/networking/kcm.txt
@@ -0,0 +1,173 @@
+Kernel Connection Mulitplexor
+-----------------------------
+
+Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
+interface over TCP for generic application protocols. With KCM an application
+can efficiently send and receive application protocol messages over TCP using
+datagram sockets.
+
+KCM implements an NxM multiplexor in the kernel as diagrammed below:
+
++------------+   +------------+   +------------+   +------------+
+| KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
++------------+   +------------+   +------------+   +------------+
+      |                 |               |                |
+      +-----------+     |               |     +----------+
+                  |     |               |     |
+               +----------------------------------+
+               |           Multiplexor            |
+               +----------------------------------+
+                 |   |           |           |  |
+       +---------+   |           |           |  ------------+
+       |             |           |           |              |
++----------+  +----------+  +----------+  +----------+ +----------+
+|  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
++----------+  +----------+  +----------+  +----------+ +----------+
+      |              |           |            |             |
++----------+  +----------+  +----------+  +----------+ +----------+
+| TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
++----------+  +----------+  +----------+  +----------+ +----------+
+
+KCM sockets
+-----------
+
+The KCM sockets provide the user interface to the muliplexor. All the KCM sockets
+bound to a multiplexor are considered to have equivalent function, and I/O
+operations in different sockets may be done in parallel without the need for
+synchronization between threads in userspace.
+
+Multiplexor
+-----------
+
+The multiplexor provides the message steering. In the transmit path, messages
+written on a KCM socket are sent atomically on an appropriate TCP socket.
+Similarly, in the receive path, messages are constructed on each TCP socket
+(Psock) and complete messages are steered to a KCM socket which is blocking in
+receive or poll.
+
+TCP sockets & Psocks
+--------------------
+
+TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocacted
+for each bound TCP socket, this structure holds the state for constructing
+messages on receive as well as other connection specific information for KCM.
+
+Connected mode semantics
+------------------------
+
+Each multiplexor assumes that all attached TCP connections are to the same
+destination and can use the different connections for load balancing when
+transmitting. The normal send and recv calls can be used to send and receive
+messages from the KCM socket.
+
+SOCK_DGRAM
+----------
+
+KCM currently support socket type SOCK_DGRAM only.
+
+Message delineation
+-------------------
+
+Messages are sent over a TCP stream with some application protocol message
+format that typically includes a header which encapsulates messages. The length
+of a received message can be deduced from the application protocol header
+(often just a simple length field).
+
+A TCP stream must be parsed to determine message boundaries. Berkeley Packet
+Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
+BPF program must be specified. The program is called at the start of receiving
+a new message and is given an skbuff that contains the bytes received so far.
+It parses the message header and returns the length of the message. Given this
+information, KCM will construct the message of the stated length and deliver it
+to a KCM socket.
+
+TCP socket management
+---------------------
+
+When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
+write space available (POLLOUT) events are handled by the multiplexor. If there
+is a state change (disconnection) or other error on a TCP socket, an error is
+posted on the socket so that an POLLERR event happens and KCM discontinues using
+the socket. In the case of an error affecting the receive side of the
+connection, any partial message under construction will also be set on the error
+queue for the TCP socket. When the application gets the error notification for a
+TCP socket it should unattach the socket from KCM and then handle the error
+condition (the typical response is to close the socket and create a new
+connection if necessary).
+
+User interface
+==============
+
+Creating a multiplexor
+----------------------
+
+A new multiplexor and initial KCM socket is created by a socket call:
+
+  socket(AF_KCM, type, protocol)
+
+  - type is either SOCK_DGRAM
+  - protocols is either KCMPROTO_CONNECTED
+
+Cloning KCM sockets
+-------------------
+
+After the first KCM socket is created using the socket call as described
+above, additional sockets for the multiplexor can be created by cloning
+a KCM socket. This is accomplished by calling accept on the KCM socket:
+
+   newkcmfd = accept(kcmfd, NULL, 0)
+
+Attach transport sockets
+------------------------
+
+Attaching of transport sockets to a multiplexor is performed by calling on
+ioctl on a KCM socket for the multiplexor. e.g.:
+
+  /* From linux/kcm.h */
+  struct kcm_attach {
+        int fd;
+        int bpf_type;
+        union {
+                int bpf_fd;
+                struct sock_fprog fprog;
+        };
+  };
+
+  struct kcm_attach info;
+
+  memset(&info, 0, sizeof(info));
+
+  info.fd = tcpfd;
+  info.bpf_type = KCM_BPF_TYPE_PROG;
+  info.bpf_fprog = bpf_prog;
+
+  ioctl(kcmfd, SIOCKCMATTACH, &info);
+
+The kcm_attach structure contains:
+  fd: file descriptor for TCP socket being attached
+  bpf_type: type of BPF program to be loaded this is either:
+    KCM_BPF_TYPE_PROG: program load directly for user space
+    KCM_BPF_TYPE_FD: Complied rogram to be load for the specified file
+                     descriptor (see BPF LLVM and Clang)
+  bpf_fprog: contains pointer to user space protocol to load
+  bpf_fd: file descriptor for compiled program download
+
+Unattach transport sockets
+--------------------------
+
+Unattaching a transport socket from a multiplexor is straightforward. An
+"unattach" ioctl is done with the kcm_unattach structure as the argument:
+
+  /* From linux/kcm.h /
+  struct kcm_unattach {
+        int fd;
+  };
+
+  struct kcm_unattach info;
+
+  memset(&info, 0, sizeof(info));
+
+  info.fd = cfd;
+
+  ioctl(fd, SIOCKCMUNATTACH, &info);
+
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 5bf59c8..4e1ea53 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -200,7 +200,9 @@ struct ucred {
 #define AF_ALG		38	/* Algorithm sockets		*/
 #define AF_NFC		39	/* NFC sockets			*/
 #define AF_VSOCK	40	/* vSockets			*/
-#define AF_MAX		41	/* For now.. */
+#define AF_KCM		41	/* Kernel Connection Multiplexor*/
+
+#define AF_MAX		42	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -246,6 +248,7 @@ struct ucred {
 #define PF_ALG		AF_ALG
 #define PF_NFC		AF_NFC
 #define PF_VSOCK	AF_VSOCK
+#define PF_KCM		AF_KCM
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -322,6 +325,7 @@ struct ucred {
 #define SOL_CAIF	278
 #define SOL_ALG		279
 #define SOL_NFC		280
+#define SOL_KCM		281
 
 /* IPX options */
 #define IPX_TYPE	1
diff --git a/include/net/kcm.h b/include/net/kcm.h
new file mode 100644
index 0000000..55ef56b
--- /dev/null
+++ b/include/net/kcm.h
@@ -0,0 +1,109 @@
+/* Kernel Connection Multiplexor */
+
+#ifndef __NET_KCM_H_
+#define __NET_KCM_H_
+
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <uapi/linux/kcm.h>
+
+#ifdef __KERNEL__
+
+extern unsigned int kcm_net_id;
+
+struct kcm_tx_msg {
+	unsigned int sent;
+	unsigned int fragidx;
+	unsigned int frag_offset;
+	unsigned int msg_flags;
+	struct sk_buff *frag_skb;
+};
+
+struct kcm_rx_msg {
+	int full_len;
+	int accum_len;
+};
+
+/* Socket structure for KCM client sockets */
+struct kcm_sock {
+	struct sock sk;
+	struct kcm_mux *mux;
+	struct list_head kcm_sock_list;
+
+	/* Transmit */
+	struct kcm_psock *tx_psock;
+	struct work_struct tx_work;
+	struct list_head wait_psock_list;
+	u32 tx_wait : 1;
+
+	/* Receive */
+	struct kcm_psock *rx_psock;
+	struct list_head wait_rx_list; /* KCMs waiting for receiving */
+	u32 rx_wait : 1;
+};
+
+struct bpf_prog;
+
+/* Structure for an attached lower socket */
+struct kcm_psock {
+	struct sock *sk;
+	struct kcm_mux *mux;
+
+	u32 tx_stopped : 1;
+	u32 rx_stopped : 1;
+	u32 done : 1;
+	u32 unattaching : 1;
+	u32 bpf_prog_fd : 1;
+
+	void (*save_state_change)(struct sock *sk);
+	void (*save_data_ready)(struct sock *sk);
+	void (*save_write_space)(struct sock *sk);
+
+	struct list_head psock_list;
+
+	/* Receive */
+	struct sk_buff *rx_skb_head;
+	struct sk_buff **rx_skb_nextp;
+	struct sk_buff *ready_rx_msg;
+	struct list_head psock_ready_list;
+	struct work_struct rx_work;
+	struct delayed_work rx_delayed_work;
+	struct bpf_prog *bpf_prog;
+
+	/* Transmit */
+	struct kcm_sock *tx_kcm;
+	struct list_head psock_avail_list;
+};
+
+/* Per net MUX list */
+struct kcm_net {
+	struct mutex mutex;
+	struct list_head mux_list;
+	int count;
+};
+
+/* Structure for a MUX */
+struct kcm_mux {
+	struct list_head kcm_mux_list;
+	struct rcu_head rcu;
+	struct kcm_net *knet;
+
+	spinlock_t  lock;		/* RX and TX locking */
+	struct list_head kcm_socks;	/* All KCM sockets on MUX */
+	int kcm_socks_cnt;		/* Total KCM socket count for MUX */
+	struct list_head psocks;	/* List of all psocks on MUX */
+	int psocks_cnt;		/* Total attached sockets */
+
+	/* Receive */
+	struct list_head kcm_rx_waiters; /* KCMs waiting for receiving */
+	struct kcm_sock *last_rx_kcm;
+	struct list_head psocks_ready;	/* List of psocks with a msg ready */
+
+	/* Transmit */
+	struct list_head psocks_avail;	/* List of available psocks */
+	struct list_head kcm_tx_waiters; /* KCMs waiting for a TX psock */
+};
+
+#endif /* __KERNEL__ */
+
+#endif /* __NET_KCM_H_ */
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 07bdce1..6a8e2f2 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -18,6 +18,7 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP	2
 #define SO_EE_ORIGIN_ICMP6	3
 #define SO_EE_ORIGIN_TXSTATUS	4
+#define SO_EE_ORIGIN_KCM	5
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
diff --git a/include/uapi/linux/kcm.h b/include/uapi/linux/kcm.h
new file mode 100644
index 0000000..196622e
--- /dev/null
+++ b/include/uapi/linux/kcm.h
@@ -0,0 +1,26 @@
+#ifndef KCM_KERNEL_H
+#define KCM_KERNEL_H
+
+struct kcm_attach {
+	int fd;
+	int bpf_type;
+	union {
+		struct sock_fprog bpf_fprog;
+		int bpf_fd;
+	};
+};
+
+#define KCM_BPF_TYPE_PROG	0x1
+#define KCM_BPF_TYPE_FD		0x2
+
+struct kcm_unattach {
+	int fd;
+};
+
+#define SIOCKCMATTACH	(SIOCPROTOPRIVATE + 0)
+#define SIOCKCMUNATTACH	(SIOCPROTOPRIVATE + 1)
+
+#define KCMPROTO_CONNECTED	0
+
+#endif
+
diff --git a/net/Kconfig b/net/Kconfig
index 7021c1b..cc5a020 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -350,6 +350,7 @@ source "net/can/Kconfig"
 source "net/irda/Kconfig"
 source "net/bluetooth/Kconfig"
 source "net/rxrpc/Kconfig"
+source "net/kcm/Kconfig"
 
 config FIB_RULES
 	bool
diff --git a/net/Makefile b/net/Makefile
index 3995613..2b5d24d 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_IRDA)		+= irda/
 obj-$(CONFIG_BT)		+= bluetooth/
 obj-$(CONFIG_SUNRPC)		+= sunrpc/
 obj-$(CONFIG_AF_RXRPC)		+= rxrpc/
+obj-$(CONFIG_AF_KCM)		+= kcm/
 obj-$(CONFIG_ATM)		+= atm/
 obj-$(CONFIG_L2TP)		+= l2tp/
 obj-$(CONFIG_DECNET)		+= decnet/
diff --git a/net/kcm/Kconfig b/net/kcm/Kconfig
new file mode 100644
index 0000000..5ee5294
--- /dev/null
+++ b/net/kcm/Kconfig
@@ -0,0 +1,10 @@
+
+config AF_KCM
+	tristate "KCM sockets"
+	depends on INET
+	select BPF_SYSCALL
+	---help---
+	  KCM (Kernel Connection Multiplexer) sockets provide a method
+	  for multiplexing a messages based user space protocol over
+	  kernel connectons (e.g. TCP connections).
+
diff --git a/net/kcm/Makefile b/net/kcm/Makefile
new file mode 100644
index 0000000..cb525f7
--- /dev/null
+++ b/net/kcm/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_AF_KCM) += kcm.o
+
+kcm-y := kcmsock.o
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
new file mode 100644
index 0000000..0240ce3
--- /dev/null
+++ b/net/kcm/kcmsock.c
@@ -0,0 +1,1566 @@
+#include <linux/bpf.h>
+#include <linux/errno.h>
+#include <linux/errqueue.h>
+#include <linux/file.h>
+#include <linux/in.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <linux/netdevice.h>
+#include <linux/poll.h>
+#include <linux/rculist.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/uaccess.h>
+#include <linux/workqueue.h>
+#include <net/kcm.h>
+#include <net/netns/generic.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <uapi/linux/kcm.h>
+
+unsigned int kcm_net_id;
+
+static struct kmem_cache *kcm_psockp __read_mostly;
+static struct kmem_cache *kcm_muxp __read_mostly;
+static struct workqueue_struct *kcm_wq;
+
+static inline struct kcm_sock *kcm_sk(const struct sock *sk)
+{
+	return (struct kcm_sock *)sk;
+}
+
+static inline struct kcm_tx_msg *kcm_tx_msg(struct sk_buff *skb)
+{
+	return (struct kcm_tx_msg *)skb->cb;
+}
+
+static inline struct kcm_rx_msg *kcm_rx_msg(struct sk_buff *skb)
+{
+	return (struct kcm_rx_msg *)skb->cb;
+}
+
+/* Lower socket locked */
+static void kcm_abort_rx_psock(struct kcm_psock *psock, int err,
+			       struct sk_buff *skb)
+{
+	struct sock_exterr_skb *serr;
+	struct sock *csk = psock->sk;
+
+	/* Unrecoverable error in receive */
+
+	if (psock->rx_stopped)
+		return;
+
+	psock->rx_stopped = 1;
+
+	if (!skb)
+		return;
+
+	/* Put the offending skb data on transport socket's error queue with
+	 * the error number.
+	 */
+	serr = SKB_EXT_ERR(skb);
+
+	memset(serr, 0, sizeof(*serr));
+	serr->ee.ee_errno = err;
+	serr->ee.ee_origin = SO_EE_ORIGIN_KCM;
+
+	/* Note sock_queue_err_skb calls sk_data_ready which still points to
+	 * the psock_tcp_data_ready, this should not be a problem since we've
+	 * set rx_stopped to that kcm_data_ready will just return.
+	 */
+
+	sock_queue_err_skb(csk, skb);
+
+	/* Wake up transport socket */
+	psock->save_data_ready(csk);
+}
+
+static void kcm_abort_tx_psock(struct kcm_psock *psock, int err,
+			       bool wakeup_kcm, bool report_transport)
+{
+	struct sock *csk = psock->sk;
+	struct kcm_mux *mux = psock->mux;
+
+	/* Unrecoverable error in transmit */
+
+	spin_lock_bh(&mux->lock);
+
+	if (psock->tx_stopped) {
+		spin_unlock_bh(&mux->lock);
+		return;
+	}
+
+	psock->tx_stopped = 1;
+
+	if (!psock->tx_kcm) {
+		/* Take off psocks_avail list */
+		list_del(&psock->psock_avail_list);
+	} else if (wakeup_kcm) {
+		/* In this case psock is being aborted while outside of
+		 * write_msgs and psock is reserved. Schedule tx_work
+		 * to handle the failure there.
+		 */
+		queue_work(kcm_wq, &psock->tx_kcm->tx_work);
+	}
+
+	spin_unlock_bh(&mux->lock);
+
+	if (report_transport) {
+		/* Report error on lower socket */
+
+		lock_sock(csk);
+		csk->sk_err = err;
+		csk->sk_error_report(csk);
+		release_sock(csk);
+	}
+}
+
+static void kcm_abort_psock(struct kcm_psock *psock, int err,
+			    bool wakeup_kcm, bool report_transport)
+{
+	/* Lower socket pretty much dead */
+
+	kcm_abort_rx_psock(psock, err, NULL);
+	kcm_abort_tx_psock(psock, err, wakeup_kcm, report_transport);
+}
+
+/* Process a new message. If there is no KCM socket waiting for a message
+ * hold it in the psock. Returns true if message is held this way, false
+ * otherwise.
+ */
+static bool new_rx_msg(struct kcm_psock *psock, struct sk_buff *head)
+{
+	struct kcm_mux *mux = psock->mux;
+	struct kcm_sock *kcm = NULL;
+	struct sock *sk;
+
+	spin_lock_bh(&mux->lock);
+
+	if (WARN_ON(psock->ready_rx_msg)) {
+		spin_unlock_bh(&mux->lock);
+		kfree_skb(head);
+		return false;
+	}
+
+	if (list_empty(&mux->kcm_rx_waiters)) {
+		psock->ready_rx_msg = head;
+
+		list_add_tail(&psock->psock_ready_list,
+			      &mux->psocks_ready);
+
+		spin_unlock_bh(&mux->lock);
+		return true;
+	}
+
+	kcm = list_first_entry(&mux->kcm_rx_waiters,
+			       struct kcm_sock, wait_rx_list);
+	WARN_ON(!kcm->rx_wait);
+
+	list_del(&kcm->wait_rx_list);
+	kcm->rx_wait = 0;
+
+	spin_unlock_bh(&mux->lock);
+
+	sk = &kcm->sk;
+
+	skb_queue_tail(&sk->sk_receive_queue, head);
+
+	if (!sock_flag(sk, SOCK_DEAD))
+		sk->sk_data_ready(sk);
+
+	return false;
+}
+
+/* Receive a message to a kcm socket from a psock that is on the ready list
+ * Mux lock is held here.
+ */
+static bool queue_ready_msg_to_kcm(struct kcm_mux *mux, struct kcm_sock *kcm)
+{
+	struct sock *sk = &kcm->sk;
+	struct kcm_psock *psock;
+	struct sk_buff *skb;
+
+	if (list_empty(&mux->psocks_ready))
+		return false;
+
+	psock = list_first_entry(&mux->psocks_ready,
+				 struct kcm_psock, psock_ready_list);
+
+	skb = psock->ready_rx_msg;
+	list_del(&psock->psock_ready_list);
+
+	psock->ready_rx_msg = NULL;
+
+	/* Read more */
+	queue_work(kcm_wq, &psock->rx_work);
+
+	if (WARN_ON(!skb))
+		return false;
+
+	skb_set_owner_r(skb, sk);
+	skb_queue_tail(&sk->sk_receive_queue, skb);
+
+	return true;
+}
+
+/* A state change on a connected socket means it's dying or dead. */
+static void
+psock_tcp_state_change(struct sock *sk)
+{
+	struct kcm_psock *psock = (struct kcm_psock *)sk->sk_user_data;
+	int err = sk->sk_err ? : EPIPE;
+
+	if (WARN_ON(!psock))
+		return;
+
+	/* Abort the psock, report as an EPIPE error on original socket */
+	kcm_abort_psock(psock, err, true, false);
+
+	/* Report an EPIPE error on the socket. We do this here instead of in
+	 * kcm_tx_abort_sock since we already hold the socket lock.
+	 */
+	sk->sk_err = err;
+	sk->sk_error_report(sk);
+}
+
+/* Macro to invoke filter function. */
+#define KCM_RUN_FILTER(prog, ctx) \
+	(*prog->bpf_func)(ctx, prog->insnsi)
+
+static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
+			unsigned int offset, size_t orig_len)
+{
+	struct kcm_psock *psock = (struct kcm_psock *)desc->arg.data;
+	struct kcm_rx_msg *rxm;
+	struct sk_buff *head, *skb;
+	size_t eaten = 0;
+	ssize_t extra;
+	int err;
+
+	if (WARN_ON(psock->rx_stopped))
+		return 0;
+
+	if (psock->ready_rx_msg)
+		return 0;
+
+	head = psock->rx_skb_head;
+	if (head && !psock->rx_skb_nextp) {
+		int err;
+
+		/* We are going to append to the frags_list of head. Need to
+		 * unshare the skbuf data as wells as all the skbs on the
+		 * frag_list (if there are any). We deferred this work in hopes
+		 * that orignal skbuff was consumed by the stack so that there
+		 * is less work needed here.
+		 */
+		if (unlikely(skb_shinfo(head)->frag_list)) {
+			if (WARN_ON(head->next)) {
+				desc->error = -EINVAL;
+				return 0;
+			}
+
+			skb = alloc_skb(0, GFP_ATOMIC);
+			if (!skb) {
+				desc->error = -ENOMEM;
+				return 0;
+			}
+			skb->len = head->len;
+			skb->data_len = head->len;
+			skb->truesize = head->truesize;
+			*kcm_rx_msg(skb) = *kcm_rx_msg(head);
+			psock->rx_skb_nextp = &head->next;
+			skb_shinfo(skb)->frag_list = head;
+			psock->rx_skb_head = skb;
+			head = skb;
+		} else {
+			err = skb_unclone(head, GFP_ATOMIC);
+			if (err) {
+				desc->error = err;
+				return 0;
+			}
+			psock->rx_skb_nextp =
+			    &skb_shinfo(head)->frag_list;
+		}
+	}
+
+	while (eaten < orig_len) {
+		/* Always clone since we will consume something */
+		skb = skb_clone(orig_skb, GFP_ATOMIC);
+		if (!skb) {
+			desc->error = -ENOMEM;
+			break;
+		}
+
+		if (!pskb_pull(skb, offset + eaten)) {
+			kfree_skb(skb);
+			desc->error = -ENOMEM;
+			break;
+		}
+
+		if (WARN_ON(skb->len < orig_len - eaten)) {
+			kfree_skb(skb);
+			desc->error = -EINVAL;
+			break;
+		}
+
+		/* Need to trim should be rare */
+		err = pskb_trim(skb, orig_len - eaten);
+		if (err) {
+			kfree_skb(skb);
+			desc->error = err;
+			break;
+		}
+
+		/* Preliminary */
+		eaten += skb->len;
+
+		head = psock->rx_skb_head;
+		if (!head) {
+			head = skb;
+			psock->rx_skb_head = head;
+			/* Will set rx_skb_nextp on next packet if needed */
+			psock->rx_skb_nextp = NULL;
+			rxm = kcm_rx_msg(head);
+			memset(rxm, 0, sizeof(*rxm));
+			rxm->accum_len = head->len;
+		} else {
+			rxm = kcm_rx_msg(head);
+			*psock->rx_skb_nextp = skb;
+			psock->rx_skb_nextp = &skb->next;
+			rxm->accum_len += skb->len;
+			head->data_len += skb->len;
+			head->len += skb->len;
+			head->truesize += skb->truesize;
+		}
+
+		if (!rxm->full_len) {
+			ssize_t len;
+
+			len = KCM_RUN_FILTER(psock->bpf_prog, head);
+
+			if (!len) {
+				/* Need more header to determine length */
+				break;
+			} else if (len <= head->len - skb->len) {
+				/* Length must be into new skb (and also
+				 * greater than zero)
+				 */
+				desc->error = -EPROTO;
+				psock->rx_skb_head = NULL;
+				kcm_abort_rx_psock(psock, EPROTO, head);
+				break;
+			}
+
+			rxm->full_len = len;
+		}
+
+		extra = (ssize_t)rxm->accum_len - rxm->full_len;
+
+		if (extra < 0) {
+			/* Message not complete yet. */
+			break;
+		} else if (extra > 0) {
+			/* More bytes than needed for the message */
+
+			WARN_ON(extra > skb->len);
+
+			/* We don't bother calling pskb_trim here. The skbuff
+			 * holds the full message size which is used to
+			 * copy data out.
+			 */
+
+			eaten -= extra;
+		}
+
+		/* Hurray, we have a new message! */
+		psock->rx_skb_head = NULL;
+
+		if (new_rx_msg(psock, head)) {
+			/* Message was held at psock */
+			break;
+		}
+	}
+
+	return eaten;
+}
+
+static int psock_tcp_read_sock(struct kcm_psock *psock)
+{
+	read_descriptor_t desc;
+
+	desc.arg.data = psock;
+	desc.error = 0;
+	desc.count = 1; /* give more than one skb per call */
+
+	/* sk should be locked here, so okay to do tcp_read_sock */
+	tcp_read_sock(psock->sk, &desc, kcm_tcp_recv);
+
+	return desc.error;
+}
+
+static void psock_tcp_data_ready(struct sock *sk)
+{
+	struct kcm_psock *psock = (struct kcm_psock *)sk->sk_user_data;
+
+	if (unlikely(psock->rx_stopped))
+		return;
+
+	read_lock_bh(&sk->sk_callback_lock);
+
+	if (psock->ready_rx_msg)
+		goto out;
+
+	if (psock_tcp_read_sock(psock) == -ENOMEM)
+		queue_delayed_work(kcm_wq, &psock->rx_delayed_work, 0);
+
+out:
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void do_psock_rx_work(struct kcm_psock *psock)
+{
+	read_descriptor_t rd_desc;
+	struct sock *csk = psock->sk;
+
+	/* Lock sock */
+	lock_sock(csk);
+
+	if (unlikely(psock->rx_stopped)) {
+		release_sock(csk);
+		return;
+	}
+
+	read_lock_bh(&csk->sk_callback_lock);
+
+	if (psock->ready_rx_msg) {
+		/* Already have a message pending, no work to do */
+		read_unlock_bh(&csk->sk_callback_lock);
+		release_sock(psock->sk);
+		return;
+	}
+
+	rd_desc.arg.data = psock;
+
+	if (psock_tcp_read_sock(psock) == -ENOMEM)
+		queue_delayed_work(kcm_wq, &psock->rx_delayed_work, 0);
+
+	read_unlock_bh(&csk->sk_callback_lock);
+	release_sock(psock->sk);
+}
+
+static void psock_rx_work(struct work_struct *w)
+{
+	do_psock_rx_work(container_of(w, struct kcm_psock, rx_work));
+}
+
+static void psock_rx_delayed_work(struct work_struct *w)
+{
+	do_psock_rx_work(container_of(w, struct kcm_psock,
+				      rx_delayed_work.work));
+}
+
+static void psock_tcp_write_space(struct sock *sk)
+{
+	struct kcm_psock *psock = (struct kcm_psock *)sk->sk_user_data;
+	struct kcm_mux *mux = psock->mux;
+	struct kcm_sock *kcm;
+
+	if (WARN_ON(!psock))
+		return;
+
+	spin_lock_bh(&mux->lock);
+
+	/* Check if the socket is reserved so someone is waiting for sending. */
+	kcm = psock->tx_kcm;
+	if (kcm)
+		queue_work(kcm_wq, &kcm->tx_work);
+
+	spin_unlock_bh(&mux->lock);
+}
+
+/* Assumes kcm sock is locked. */
+static struct kcm_psock *reserve_psock(struct kcm_sock *kcm,
+				       bool wait, int *ret_err)
+{
+	struct kcm_mux *mux = kcm->mux;
+	struct kcm_psock *psock = NULL;
+	int err = 0;
+
+	if (kcm->tx_psock)
+		return kcm->tx_psock;
+
+	spin_lock_bh(&mux->lock);
+
+	/* Check again under lock to see if psock was reserved for this
+	 * psock via psock_unreserve.
+	 */
+	if (kcm->tx_psock) {
+		spin_unlock_bh(&mux->lock);
+		return kcm->tx_psock;
+	}
+
+	if (!list_empty(&mux->psocks_avail)) {
+		psock = list_first_entry(&mux->psocks_avail,
+					 struct kcm_psock,
+					 psock_avail_list);
+		list_del(&psock->psock_avail_list);
+		if (kcm->tx_wait) {
+			list_del(&kcm->wait_psock_list);
+			kcm->tx_wait = 0;
+		}
+		kcm->tx_psock = psock;
+		psock->tx_kcm = kcm;
+	} else if (kcm->tx_wait) {
+		err = -EAGAIN;
+	} else {
+		if (mux->psocks_cnt && wait) {
+			list_add_tail(&kcm->wait_psock_list,
+				      &mux->kcm_tx_waiters);
+			kcm->tx_wait = 1;
+			err = -EAGAIN;
+		} else {
+			err = -EPIPE;
+		}
+	}
+
+	spin_unlock_bh(&mux->lock);
+
+	*ret_err = err;
+	return psock;
+}
+
+/* mux lock held */
+static void psock_now_avail(struct kcm_psock *psock)
+{
+	struct kcm_mux *mux = psock->mux;
+	struct kcm_sock *kcm;
+
+	if (list_empty(&mux->kcm_tx_waiters)) {
+		list_add_tail(&psock->psock_avail_list,
+			      &mux->psocks_avail);
+	} else {
+		kcm = list_first_entry(&mux->kcm_tx_waiters,
+				       struct kcm_sock,
+				       wait_psock_list);
+		list_del(&kcm->wait_psock_list);
+		kcm->tx_wait = 0;
+		kcm->tx_psock = psock;
+		psock->tx_kcm = kcm;
+		queue_work(kcm_wq, &kcm->tx_work);
+	}
+}
+
+/* Assumes kcm sock is locked. */
+static void unreserve_psock(struct kcm_sock *kcm)
+{
+	struct kcm_psock *psock = kcm->tx_psock;
+	struct kcm_mux *mux = kcm->mux;
+
+	if (WARN_ON(!psock))
+		return;
+
+	spin_lock_bh(&mux->lock);
+
+	WARN_ON(kcm->tx_wait);
+
+	kcm->tx_psock = NULL;
+	psock->tx_kcm = NULL;
+
+	if (unlikely(psock->done || psock->tx_stopped)) {
+		if (psock->done) {
+			/* Deferred free */
+			list_del(&psock->psock_list);
+			mux->psocks_cnt--;
+			sock_put(psock->sk);
+			fput(psock->sk->sk_socket->file);
+			kmem_cache_free(kcm_psockp, psock);
+		}
+
+		/* Don't put back on available list */
+
+		spin_unlock_bh(&mux->lock);
+
+		return;
+	}
+
+	psock_now_avail(psock);
+
+	spin_unlock_bh(&mux->lock);
+}
+
+/* Write any messages ready on the kcm socket.  Called with kcm sock lock
+ * held.  Return bytes actually sent or error.
+ */
+static int kcm_write_msgs(struct kcm_sock *kcm)
+{
+	struct sock *sk = &kcm->sk;
+	struct kcm_psock *psock;
+	struct sk_buff *skb, *head;
+	struct kcm_tx_msg *txm;
+	unsigned short fragidx, frag_offset;
+	unsigned int sent, total_sent = 0;
+	int ret = 0;
+
+	if (WARN_ON(skb_queue_empty(&sk->sk_write_queue)))
+		return 0;
+
+	head = skb_peek(&sk->sk_write_queue);
+
+	psock = kcm->tx_psock;
+	txm = kcm_tx_msg(head);
+
+	if (unlikely(psock && psock->tx_stopped)) {
+		/* A reserved psock was aborted asynchronously. Unreserve
+		 * it and we'll retry the message.
+		 */
+		unreserve_psock(kcm);
+		txm->sent = 0;
+	} else if (txm->sent) {
+		/* Send of first skbuff in queue already in progress */
+		if (WARN_ON(!psock)) {
+			ret = -EINVAL;
+			goto out;
+		}
+		sent = txm->sent;
+		frag_offset = txm->frag_offset;
+		fragidx = txm->fragidx;
+		skb = txm->frag_skb;
+
+		goto do_frag;
+	}
+
+try_again:
+	psock = reserve_psock(kcm, true, &ret);
+	if (!psock) {
+		if (ret == -EAGAIN)
+			ret = 0;
+		goto out;
+	}
+
+	do {
+		skb = head;
+		txm = kcm_tx_msg(head);
+		sent = 0;
+
+do_frag_list:
+		if (WARN_ON(!skb_shinfo(skb)->nr_frags)) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		for (fragidx = 0; fragidx < skb_shinfo(skb)->nr_frags;
+		     fragidx++) {
+			skb_frag_t *frag;
+
+			frag_offset = 0;
+do_frag:
+			frag = &skb_shinfo(skb)->frags[fragidx];
+			if (WARN_ON(!frag->size)) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			ret = kernel_sendpage(psock->sk->sk_socket,
+					      frag->page.p,
+					      frag->page_offset + frag_offset,
+					      frag->size - frag_offset,
+					      MSG_DONTWAIT);
+			if (ret <= 0) {
+				if (ret == -EAGAIN) {
+					/* Save state to try again when there's
+					 * write space on the socket
+					 */
+					txm->sent = sent;
+					txm->frag_offset = frag_offset;
+					txm->fragidx = fragidx;
+					txm->frag_skb = skb;
+
+					ret = 0;
+					goto out;
+				}
+
+				/* Hard failure in sending message, abort this
+				 * psock since it has lost framing
+				 * synchonization and retry sending the
+				 * message from the beginning.
+				 */
+				kcm_abort_tx_psock(psock, -ret, true, false);
+				unreserve_psock(kcm);
+
+				txm->sent = 0;
+
+				goto try_again;
+			}
+
+			sent += ret;
+			frag_offset += ret;
+			if (frag_offset < frag->size) {
+				/* Not finished with this frag */
+				goto do_frag;
+			}
+		}
+
+		if (skb == head) {
+			if (skb_has_frag_list(skb)) {
+				skb = skb_shinfo(skb)->frag_list;
+				goto do_frag_list;
+			}
+		} else if (skb->next) {
+			skb = skb->next;
+			goto do_frag_list;
+		}
+
+		/* Successfully sent the whole packet, account for it. */
+		skb_dequeue(&sk->sk_write_queue);
+		kfree_skb(head);
+		sk->sk_wmem_queued -= sent;
+		total_sent += sent;
+	} while ((head = skb_peek(&sk->sk_write_queue)));
+out:
+	if (!head) {
+		/* Done with all queue messages. */
+		WARN_ON(!skb_queue_empty(&sk->sk_write_queue));
+		WARN_ON(sk->sk_wmem_queued);
+		unreserve_psock(kcm);
+	}
+
+	/* Check if write space is available */
+	sk->sk_write_space(sk);
+
+	return total_sent ? : ret;
+}
+
+static void kcm_tx_work(struct work_struct *w)
+{
+	struct kcm_sock *kcm = container_of(w, struct kcm_sock, tx_work);
+	struct sock *sk = &kcm->sk;
+	int ret;
+
+	lock_sock(sk);
+
+	/* For SOCK_DGRAM socket */
+	if (!skb_queue_empty(&sk->sk_write_queue)) {
+		ret = kcm_write_msgs(kcm);
+		if (ret < 0 && ret != -EAGAIN) {
+			/* Hard failure in write, report error on KCM socket */
+			struct sk_buff *skb = skb_peek(&sk->sk_write_queue);
+
+			sk_stream_error(sk, kcm_tx_msg(skb)->msg_flags, ret);
+		}
+	}
+
+	release_sock(sk);
+}
+
+static unsigned int kcm_poll(struct file *file, struct socket *sock,
+			     poll_table *wait)
+{
+	struct sock *sk = sock->sk;
+	struct kcm_sock *kcm = kcm_sk(sk);
+	unsigned int mask = 0;
+
+	sock_poll_wait(file, sk_sleep(sk), wait);
+
+	/* Note that we don't need to lock the socket, as the upper poll layers
+	 * take care of normal races (between the test and the event) and we
+	 * don't go look at any of the socket buffers directly.
+	 */
+	if (sk->sk_err)
+		mask = POLLERR;
+
+	if (!skb_queue_empty(&sk->sk_receive_queue)) {
+		mask |= POLLIN | POLLRDNORM;
+	} else {
+		/* Assume the caller is interested in receiving. */
+		lock_sock(sk);
+		spin_lock_bh(&kcm->mux->lock);
+		if (queue_ready_msg_to_kcm(kcm->mux, kcm)) {
+			/* Found a message waiting on a psock */
+			mask |= POLLIN | POLLRDNORM;
+		} else if (!kcm->rx_wait) {
+			list_add_tail(&kcm->wait_rx_list,
+				      &kcm->mux->kcm_rx_waiters);
+			kcm->rx_wait = 1;
+		}
+		spin_unlock_bh(&kcm->mux->lock);
+		release_sock(sk);
+	}
+
+	if (sk_stream_memory_free(sk)) {
+		mask |= POLLOUT | POLLWRNORM;
+	} else {
+		set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+		set_bit(SOCK_NOSPACE, &sock->flags);
+
+		/* Race breaker. If space is freed after wmem test
+		 * but before the flags are set, IO signal will be lost.
+		 */
+		smp_mb__after_atomic();
+		if (sk_stream_is_writeable(sk))
+			mask |= POLLOUT | POLLWRNORM;
+	}
+
+	return mask;
+}
+
+static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
+{
+	struct sock *sk = sock->sk;
+	struct kcm_sock *kcm = kcm_sk(sk);
+	struct sk_buff *skb, **nextp;
+	int err;
+	bool not_busy;
+	long timeo;
+	size_t tlen, total_len = len;
+
+	lock_sock(sk);
+
+	/* Per tcp_sendmsg this should be in poll */
+	clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+
+	if (sk->sk_err)
+		goto out_error;
+
+	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+
+	/* Call the sk_stream functions to manage the sndbuf mem. */
+	if (!sk_stream_memory_free(sk)) {
+		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+		err = sk_stream_wait_memory(sk, &timeo);
+		if (err)
+			goto out_error;
+	}
+
+	/* Put all data into frags since we will be calling kernel_sendpage
+	 * for everything.
+	 */
+	tlen = min_t(size_t, len, MAX_SKB_FRAGS * PAGE_SIZE);
+
+	skb = alloc_skb_with_frags(0, tlen, PAGE_ALLOC_COSTLY_ORDER, &err,
+				   GFP_KERNEL);
+
+	if (!skb)
+		goto out_error;
+
+	skb->data_len = tlen;
+	skb->len = tlen;
+
+	kcm_tx_msg(skb)->msg_flags = msg->msg_flags;
+
+	len -= tlen;
+
+	nextp = &skb_shinfo(skb)->frag_list;
+	while (len) {
+		struct sk_buff *tskb;
+		size_t tlen;
+
+		/* Need a frag_list */
+
+		tlen = min_t(size_t, len, MAX_SKB_FRAGS * PAGE_SIZE);
+
+		tskb = alloc_skb_with_frags(0, tlen, PAGE_ALLOC_COSTLY_ORDER,
+					    &err, GFP_KERNEL);
+		if (!tskb) {
+			kfree_skb(skb);
+			goto out_error;
+		}
+
+		tskb->data_len = tlen;
+		tskb->len = tlen;
+		skb->data_len += tlen;
+		skb->len += tlen;
+		*nextp = tskb;
+		nextp = &tskb->next;
+
+		len -= tlen;
+	}
+
+	err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, total_len);
+	if (err) {
+		kfree_skb(skb);
+		goto out_error;
+	}
+
+	not_busy = !skb_peek(&sk->sk_write_queue);
+
+	__skb_queue_tail(&sk->sk_write_queue, skb);
+	sk->sk_wmem_queued += total_len;
+
+	if (not_busy) {
+		err = kcm_write_msgs(kcm);
+		if (err < 0 && err != -EAGAIN) {
+			/* We got a hard error in write_msgs but have already
+			 * queued this message. Report an error in the socket,
+			 * but return success here.
+			 */
+			sk_stream_error(sk, msg->msg_flags, err);
+		}
+	}
+
+	/* Everything seems okay */
+
+	release_sock(sk);
+	return total_len;
+
+out_error:
+	err = sk_stream_error(sk, msg->msg_flags, err);
+	/* make sure we wake any epoll edge trigger waiter */
+	if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 && err == -EAGAIN))
+		sk->sk_write_space(sk);
+
+	release_sock(sk);
+	return err;
+}
+
+static int kcm_recvmsg(struct socket *sock,
+		       struct msghdr *msg, size_t len, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct kcm_sock *kcm = kcm_sk(sk);
+	int err = 0;
+	long timeo;
+	struct kcm_rx_msg *rxm;
+	int copied = 0;
+	struct sk_buff *skb;
+
+	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+
+	lock_sock(sk);
+
+	while (!(skb = skb_peek(&sk->sk_receive_queue))) {
+		spin_lock_bh(&kcm->mux->lock);
+		if (queue_ready_msg_to_kcm(kcm->mux, kcm)) {
+			spin_unlock_bh(&kcm->mux->lock);
+
+			/* Found a message waiting on a psock */
+			skb = skb_peek(&sk->sk_receive_queue);
+
+			if (WARN_ON(!skb)) {
+				err = -EINVAL;
+				goto out;
+			}
+			break;
+		}
+
+		if (!kcm->rx_wait) {
+			list_add_tail(&kcm->wait_rx_list,
+				      &kcm->mux->kcm_rx_waiters);
+			kcm->rx_wait = 1;
+		}
+		spin_unlock_bh(&kcm->mux->lock);
+
+		if (sk->sk_err) {
+			err = sock_error(sk);
+			goto out;
+		}
+
+		if (sock_flag(sk, SOCK_DONE))
+			goto out;
+
+		if ((flags & MSG_DONTWAIT) || !timeo) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		sk_wait_data(sk, &timeo, NULL);
+
+		/* Handle signals */
+		if (signal_pending(current)) {
+			err = sock_intr_errno(timeo);
+			goto out;
+		}
+	}
+
+	/* Okay, have a message on the receive queue */
+
+	rxm = kcm_rx_msg(skb);
+
+	if (len > rxm->full_len)
+		len = rxm->full_len;
+
+	err = skb_copy_datagram_msg(skb, 0, msg, len);
+	if (err < 0)
+		goto out;
+
+	copied = len;
+	if (likely(!(flags & MSG_PEEK))) {
+		/* Finished with message */
+		skb_unlink(skb, &sk->sk_receive_queue);
+		kfree_skb(skb);
+	}
+
+out:
+	release_sock(sk);
+
+	if (copied > 0)
+		return copied;
+	else
+		return err;
+}
+
+static inline void init_kcm_sock(struct kcm_sock *kcm, struct kcm_mux *mux)
+{
+	/* Add to mux's kcm sockets list */
+	kcm->mux = mux;
+	spin_lock_bh(&mux->lock);
+	list_add_tail(&kcm->kcm_sock_list, &mux->kcm_socks);
+	mux->kcm_socks_cnt++;
+	spin_unlock_bh(&mux->lock);
+
+	INIT_WORK(&kcm->tx_work, kcm_tx_work);
+	kcm->tx_wait = 0;
+}
+
+static int kcm_attach(struct socket *sock, struct socket *csock,
+		      struct bpf_prog *prog, int bpf_type)
+{
+	struct kcm_sock *kcm = kcm_sk(sock->sk);
+	struct kcm_mux *mux = kcm->mux;
+	struct sock *csk;
+	struct kcm_psock *psock = NULL;
+
+	if (csock->ops->family != PF_INET)
+		return -EINVAL;
+
+	csk = csock->sk;
+	if (!csk)
+		return -EINVAL;
+
+	/* Only support TCP for now */
+	if (csk->sk_protocol != IPPROTO_TCP)
+		return -EINVAL;
+
+	psock = kmem_cache_zalloc(kcm_psockp, GFP_KERNEL);
+	if (!psock)
+		return -ENOMEM;
+
+	psock->mux = mux;
+	psock->sk = csk;
+	psock->bpf_prog = prog;
+	psock->bpf_prog_fd = !!(bpf_type == KCM_BPF_TYPE_FD);
+	INIT_WORK(&psock->rx_work, psock_rx_work);
+	INIT_DELAYED_WORK(&psock->rx_delayed_work, psock_rx_delayed_work);
+
+	sock_hold(csk);
+
+	write_lock_bh(&csk->sk_callback_lock);
+	psock->save_state_change = csk->sk_state_change;
+	psock->save_data_ready = csk->sk_data_ready;
+	psock->save_write_space = csk->sk_write_space;
+
+	csk->sk_user_data = psock;
+	csk->sk_state_change = psock_tcp_state_change;
+	csk->sk_data_ready = psock_tcp_data_ready;
+	csk->sk_write_space = psock_tcp_write_space;
+	write_unlock_bh(&csk->sk_callback_lock);
+
+	/* Finished initialization, now add the psock to the MUX. */
+	spin_lock_bh(&mux->lock);
+	list_add_tail(&psock->psock_list, &mux->psocks);
+	mux->psocks_cnt++;
+	psock_now_avail(psock);
+	spin_unlock_bh(&mux->lock);
+
+	/* Schedule RX work in case there are already bytes queued */
+	queue_work(kcm_wq, &psock->rx_work);
+
+	return 0;
+}
+
+static int kcm_attach_ioctl(struct socket *sock, struct kcm_attach *info)
+{
+	struct socket *csock;
+	struct bpf_prog *prog;
+	int err;
+
+	csock = sockfd_lookup(info->fd, &err);
+	if (!csock)
+		return -ENOENT;
+
+	switch (info->bpf_type) {
+	case KCM_BPF_TYPE_FD:
+		prog = bpf_prog_get(info->bpf_fd);
+		if (IS_ERR(prog)) {
+			err = PTR_ERR(prog);
+			goto out;
+		}
+
+		if (prog->type != BPF_PROG_TYPE_SOCKET_FILTER) {
+			bpf_prog_put(prog);
+			err = -EINVAL;
+			goto out;
+		}
+		break;
+	case KCM_BPF_TYPE_PROG:
+		err = bpf_prog_create_from_user(&prog, &info->bpf_fprog, NULL);
+		if (err)
+			goto out;
+		break;
+	default:
+		err = -EINVAL;
+		goto out;
+	}
+
+	err = kcm_attach(sock, csock, prog, info->bpf_type);
+	if (err) {
+		if (info->bpf_type == KCM_BPF_TYPE_PROG)
+			__bpf_prog_free(prog);
+		else
+			bpf_prog_put(prog);
+		goto out;
+	}
+
+	/* Keep reference on file also */
+
+	return 0;
+out:
+	fput(csock->file);
+	return err;
+}
+
+/* Under csk sock lock. */
+static void kcm_unattach(struct kcm_psock *psock)
+{
+	struct sock *csk = psock->sk;
+	struct kcm_mux *mux = psock->mux;
+
+	if (psock->done)
+		return;
+
+	/* Stop getting callbacks from TCP socket. */
+	write_lock_bh(&csk->sk_callback_lock);
+	csk->sk_user_data = NULL;
+	csk->sk_state_change = psock->save_state_change;
+	csk->sk_data_ready = psock->save_data_ready;
+	csk->sk_write_space = psock->save_write_space;
+	write_unlock_bh(&csk->sk_callback_lock);
+
+	cancel_work_sync(&psock->rx_work);
+	cancel_delayed_work_sync(&psock->rx_delayed_work);
+
+	if (psock->bpf_prog_fd)
+		bpf_prog_put(psock->bpf_prog);
+	else
+		__bpf_prog_free(psock->bpf_prog);
+
+	kfree_skb(psock->rx_skb_head);
+	psock->rx_skb_head = NULL;
+
+	spin_lock_bh(&mux->lock);
+
+	if (psock->ready_rx_msg) {
+		list_del(&psock->psock_ready_list);
+		kfree_skb(psock->ready_rx_msg);
+		psock->ready_rx_msg = NULL;
+	}
+
+	if (psock->tx_kcm) {
+		/* psock was reserved.  Just mark it finished and we will clean
+		 * up in the kcm paths, we need kcm lock which can not be
+		 * acquired here.
+		 */
+		psock->done = 1;
+		spin_unlock_bh(&mux->lock);
+		kcm_abort_tx_psock(psock, EPIPE, true, true);
+	} else {
+		if (!psock->tx_stopped)
+			list_del(&psock->psock_avail_list);
+		list_del(&psock->psock_list);
+		mux->psocks_cnt--;
+		spin_unlock_bh(&mux->lock);
+
+		sock_put(csk);
+		fput(csk->sk_socket->file);
+		kmem_cache_free(kcm_psockp, psock);
+	}
+}
+
+static int kcm_unattach_ioctl(struct socket *sock, struct kcm_unattach *info)
+{
+	struct kcm_sock *kcm = kcm_sk(sock->sk);
+	struct kcm_mux *mux = kcm->mux;
+	struct kcm_psock *psock;
+	struct socket *csock;
+	struct sock *csk;
+	int err;
+
+	csock = sockfd_lookup(info->fd, &err);
+	if (!csock)
+		return -ENOENT;
+
+	csk = csock->sk;
+	if (!csk) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	err = -ENOENT;
+
+	spin_lock_bh(&mux->lock);
+
+	list_for_each_entry(psock, &mux->psocks, psock_list) {
+		if (psock->sk != csk)
+			continue;
+
+		/* Found the matching psock */
+
+		if (psock->unattaching || WARN_ON(psock->done)) {
+			err = -EALREADY;
+			break;
+		}
+
+		psock->unattaching = 1;
+
+		spin_unlock_bh(&mux->lock);
+
+		kcm_unattach(psock);
+
+		err = 0;
+		goto out;
+	}
+
+	spin_unlock_bh(&mux->lock);
+
+out:
+	fput(csock->file);
+	return err;
+}
+
+static int kcm_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
+{
+	int err;
+
+	switch (cmd) {
+	case SIOCKCMATTACH: {
+		struct kcm_attach info;
+
+		if (copy_from_user(&info, (void __user *)arg, sizeof(info)))
+			err = -EFAULT;
+
+		err = kcm_attach_ioctl(sock, &info);
+
+		break;
+	}
+	case SIOCKCMUNATTACH: {
+		struct kcm_unattach info;
+
+		if (copy_from_user(&info, (void __user *)arg, sizeof(info)))
+			err = -EFAULT;
+
+		err = kcm_unattach_ioctl(sock, &info);
+
+		break;
+	}
+	default:
+		err = -ENOIOCTLCMD;
+		break;
+	}
+
+	return err;
+}
+
+static void free_mux(struct rcu_head *rcu)
+{
+	struct kcm_mux *mux = container_of(rcu,
+	    struct kcm_mux, rcu);
+
+	kmem_cache_free(kcm_muxp, mux);
+}
+
+static void release_mux(struct kcm_mux *mux)
+{
+	struct kcm_net *knet = mux->knet;
+	struct kcm_psock *psock, *tmp_psock;
+
+	/* Release psocks */
+	list_for_each_entry_safe(psock, tmp_psock,
+				 &mux->psocks, psock_list)
+		kcm_unattach(psock);
+
+	if (WARN_ON(mux->psocks_cnt))
+		return;
+
+	mutex_lock(&knet->mutex);
+	list_del_rcu(&mux->kcm_mux_list);
+	knet->count--;
+	mutex_unlock(&knet->mutex);
+
+	call_rcu(&mux->rcu, free_mux);
+}
+
+/* Called by kcm_release to close a KCM socket.
+ * If this is the last KCM socket on the MUX, destroy the MUX.
+ */
+static int kcm_release(struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct kcm_sock *kcm = kcm_sk(sk);
+	struct kcm_mux *mux = kcm->mux;
+	struct kcm_psock *psock;
+	int socks_cnt;
+
+	sock_orphan(sk);
+
+	lock_sock(sk);
+	/* Purge queue under lock to avoid race condition with tx_work trying
+	 * to act when queue is nonempty. If tx_work runs after this point
+	 * it will just return.
+	 */
+	__skb_queue_purge(&sk->sk_write_queue);
+	release_sock(sk);
+
+	spin_lock_bh(&mux->lock);
+	if (kcm->tx_wait) {
+		/* Take of tx_wait list, after this point there should be no way
+		 * that a psock will be assigned to this kcm.
+		 */
+		list_del(&kcm->wait_psock_list);
+		kcm->tx_wait = 0;
+	}
+	spin_unlock_bh(&mux->lock);
+
+	psock = kcm->tx_psock;
+	if (psock) {
+		/* A psock was reserved, so we need to kill it since it
+		 * may already have some bytes queued from a message. We
+		 * need to do this after removing kcm from tx_wait list.
+		 */
+		kcm_abort_tx_psock(psock, EINTR, false, true);
+		unreserve_psock(kcm);
+	}
+
+	/* Cancel work. After this point there should be no outside references
+	 * to the kcm socket.
+	 */
+	cancel_work_sync(&kcm->tx_work);
+
+	/* Detach from MUX */
+	spin_lock_bh(&mux->lock);
+
+	list_del(&kcm->kcm_sock_list);
+	mux->kcm_socks_cnt--;
+	if (kcm->rx_wait) {
+		list_del(&kcm->wait_rx_list);
+		kcm->rx_wait = 0;
+	}
+	if (mux->last_rx_kcm == kcm)
+		mux->last_rx_kcm = NULL;
+	socks_cnt = mux->kcm_socks_cnt;
+
+	spin_unlock_bh(&mux->lock);
+
+	if (!socks_cnt) {
+		/* We are done with the mux now. */
+		release_mux(mux);
+	}
+
+	/* Free anything in recv queue */
+	__skb_queue_purge(&sk->sk_receive_queue);
+
+	sock->sk = NULL;
+	sock_put(&kcm->sk);
+
+	return 0;
+}
+
+static struct proto kcm_proto = {
+	.name	= "KCM",
+	.owner	= THIS_MODULE,
+	.obj_size = sizeof(struct kcm_sock),
+};
+
+/* Clone a kcm socket. Overloads accept proto operation */
+static int kcm_accept(struct socket *osock, struct socket *newsock, int flags)
+{
+	struct sock *newsk;
+
+	newsk = sk_alloc(sock_net(osock->sk), PF_KCM, GFP_KERNEL,
+			 &kcm_proto, true);
+	if (!newsk)
+		return -ENOMEM;
+
+	newsock->ops = osock->ops;
+
+	sock_init_data(newsock, newsk);
+	init_kcm_sock(kcm_sk(newsk), kcm_sk(osock->sk)->mux);
+
+	return 0;
+}
+
+static const struct proto_ops kcm_dgram_ops = {
+	.family =	PF_KCM,
+	.owner =	THIS_MODULE,
+	.release =	kcm_release,
+	.bind =		sock_no_bind,
+	.connect =	sock_no_connect,
+	.socketpair =	sock_no_socketpair,
+	.accept =	kcm_accept,
+	.getname =	sock_no_getname,
+	.poll =		kcm_poll,
+	.ioctl =	kcm_ioctl,
+	.listen =	sock_no_listen,
+	.shutdown =	sock_no_shutdown,
+	.setsockopt =	sock_no_setsockopt,
+	.getsockopt =	sock_no_getsockopt,
+	.sendmsg =	kcm_sendmsg,
+	.recvmsg =	kcm_recvmsg,
+	.mmap =		sock_no_mmap,
+	.sendpage =	sock_no_sendpage,
+};
+
+/* Create proto operation for kcm sockets */
+static int kcm_create(struct net *net, struct socket *sock,
+		      int protocol, int kern)
+{
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+	struct sock *sk;
+	struct kcm_mux *mux;
+
+	switch (sock->type) {
+	case SOCK_DGRAM:
+		sock->ops = &kcm_dgram_ops;
+		break;
+	default:
+		return -ESOCKTNOSUPPORT;
+	}
+
+	if (protocol != KCMPROTO_CONNECTED)
+		return -EPROTONOSUPPORT;
+
+	sk = sk_alloc(net, PF_KCM, GFP_KERNEL, &kcm_proto, kern);
+	if (!sk)
+		return -ENOMEM;
+
+	/* Allocate a kcm mux, shared between KCM sockets */
+	mux = kmem_cache_zalloc(kcm_muxp, GFP_KERNEL);
+	if (!mux) {
+		sk_free(sk);
+		return -ENOMEM;
+	}
+
+	spin_lock_init(&mux->lock);
+	INIT_LIST_HEAD(&mux->kcm_socks);
+	INIT_LIST_HEAD(&mux->kcm_rx_waiters);
+	INIT_LIST_HEAD(&mux->kcm_tx_waiters);
+
+	INIT_LIST_HEAD(&mux->psocks);
+	INIT_LIST_HEAD(&mux->psocks_ready);
+	INIT_LIST_HEAD(&mux->psocks_avail);
+
+	mux->knet = knet;
+
+	/* Add new MUX to list */
+	mutex_lock(&knet->mutex);
+	list_add_rcu(&mux->kcm_mux_list, &knet->mux_list);
+	knet->count++;
+	mutex_unlock(&knet->mutex);
+
+	/* Init KCM socket */
+	sock_init_data(sock, sk);
+	init_kcm_sock(kcm_sk(sk), mux);
+
+	return 0;
+}
+
+static struct net_proto_family kcm_family_ops = {
+	.family = PF_KCM,
+	.create = kcm_create,
+	.owner  = THIS_MODULE,
+};
+
+static __net_init int kcm_init_net(struct net *net)
+{
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+
+	INIT_LIST_HEAD_RCU(&knet->mux_list);
+	mutex_init(&knet->mutex);
+
+	return 0;
+}
+
+static __net_exit void kcm_exit_net(struct net *net)
+{
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+
+	/* All KCM sockets should be closed at this point, which should mean
+	 * that all multiplexors and psocks have been destroyed.
+	 */
+	WARN_ON(!list_empty(&knet->mux_list));
+}
+
+static struct pernet_operations kcm_net_ops = {
+	.init = kcm_init_net,
+	.exit = kcm_exit_net,
+	.id   = &kcm_net_id,
+	.size = sizeof(struct kcm_net),
+};
+
+static int __init kcm_init(void)
+{
+	int err = -ENOMEM;
+
+	kcm_muxp = kmem_cache_create("kcm_mux_cache",
+				     sizeof(struct kcm_mux), 0,
+				     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+	if (!kcm_muxp)
+		goto fail;
+
+	kcm_psockp = kmem_cache_create("kcm_psock_cache",
+				       sizeof(struct kcm_psock), 0,
+					SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+	if (!kcm_psockp)
+		goto fail;
+
+	kcm_wq = create_singlethread_workqueue("kkcmd");
+	if (!kcm_wq)
+		goto fail;
+
+	err = proto_register(&kcm_proto, 1);
+	if (err)
+		goto fail;
+
+	err = sock_register(&kcm_family_ops);
+	if (err)
+		goto sock_register_fail;
+
+	err = register_pernet_device(&kcm_net_ops);
+	if (err)
+		goto net_ops_fail;
+
+	return 0;
+
+net_ops_fail:
+	sock_unregister(PF_KCM);
+
+sock_register_fail:
+	proto_unregister(&kcm_proto);
+
+fail:
+	kmem_cache_destroy(kcm_muxp);
+	kmem_cache_destroy(kcm_psockp);
+
+	if (kcm_wq)
+		destroy_workqueue(kcm_wq);
+
+	return err;
+}
+
+static void __exit kcm_exit(void)
+{
+	unregister_pernet_device(&kcm_net_ops);
+	sock_unregister(PF_KCM);
+	proto_unregister(&kcm_proto);
+	destroy_workqueue(kcm_wq);
+
+	kmem_cache_destroy(kcm_muxp);
+	kmem_cache_destroy(kcm_psockp);
+}
+
+module_init(kcm_init);
+module_exit(kcm_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NETPROTO(PF_KCM);
+
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC 3/3] kcm: Add statistics and proc interfaces
  2015-09-20 22:29 [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
  2015-09-20 22:29 ` [PATCH RFC 1/3] rcu: Add list_next_or_null_rcu Tom Herbert
  2015-09-20 22:29 ` [PATCH RFC 2/3] kcm: Kernel Connection Multiplexor module Tom Herbert
@ 2015-09-20 22:29 ` Tom Herbert
  2015-09-21 12:24 ` [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Sowmini Varadhan
  2015-09-22  9:14 ` Thomas Martitz
  4 siblings, 0 replies; 15+ messages in thread
From: Tom Herbert @ 2015-09-20 22:29 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team

This patch add various counters for KCM. These include counters for
messages and bytes received or sent, as well as counters number of
attached/unattached TCP sockets and other error or edge events.

The statistics are exposed via a proc interface. /proc/net/kcm provdes
statistics per KCM socket and per psock (attached TCP sockkets).
/proc/net/kcm_stats provides aggregate statistics.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/net/kcm.h | 102 ++++++++++++++
 net/kcm/Makefile  |   2 +-
 net/kcm/kcmproc.c | 415 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/kcm/kcmsock.c |  63 +++++++++
 4 files changed, 581 insertions(+), 1 deletion(-)
 create mode 100644 net/kcm/kcmproc.c

diff --git a/include/net/kcm.h b/include/net/kcm.h
index 55ef56b..5076247 100644
--- a/include/net/kcm.h
+++ b/include/net/kcm.h
@@ -11,6 +11,46 @@
 
 extern unsigned int kcm_net_id;
 
+#define KCM_STATS_ADD(stat, count)			\
+	((stat) += (count))
+
+#define KCM_STATS_INCR(stat)				\
+	((stat)++)
+
+struct kcm_psock_stats {
+	unsigned long long rx_msgs;
+	unsigned long long rx_bytes;
+	unsigned long long tx_msgs;
+	unsigned long long tx_bytes;
+	unsigned int rx_aborts;
+	unsigned int rx_mem_fail;
+	unsigned int rx_need_more_hdr;
+	unsigned int rx_bad_hdr_len;
+	unsigned long long reserved;
+	unsigned long long unreserved;
+	unsigned int tx_aborts;
+};
+
+struct kcm_mux_stats {
+	unsigned long long rx_msgs;
+	unsigned long long rx_bytes;
+	unsigned long long tx_msgs;
+	unsigned long long tx_bytes;
+	unsigned int rx_ready_drops;
+	unsigned int tx_no_psock;
+	unsigned int tx_retries;
+	unsigned int psock_attach;
+	unsigned int psock_unattach_rsvd;
+	unsigned int psock_unattach;
+};
+
+struct kcm_stats {
+	unsigned long long rx_msgs;
+	unsigned long long rx_bytes;
+	unsigned long long tx_msgs;
+	unsigned long long tx_bytes;
+};
+
 struct kcm_tx_msg {
 	unsigned int sent;
 	unsigned int fragidx;
@@ -30,6 +70,8 @@ struct kcm_sock {
 	struct kcm_mux *mux;
 	struct list_head kcm_sock_list;
 
+	struct kcm_stats stats;
+
 	/* Transmit */
 	struct kcm_psock *tx_psock;
 	struct work_struct tx_work;
@@ -61,6 +103,10 @@ struct kcm_psock {
 
 	struct list_head psock_list;
 
+	struct kcm_psock_stats stats;
+	unsigned long long saved_tx_bytes;
+	unsigned long long saved_tx_msgs;
+
 	/* Receive */
 	struct sk_buff *rx_skb_head;
 	struct sk_buff **rx_skb_nextp;
@@ -78,6 +124,8 @@ struct kcm_psock {
 /* Per net MUX list */
 struct kcm_net {
 	struct mutex mutex;
+	struct kcm_psock_stats aggregate_psock_stats;
+	struct kcm_mux_stats aggregate_mux_stats;
 	struct list_head mux_list;
 	int count;
 };
@@ -94,6 +142,9 @@ struct kcm_mux {
 	struct list_head psocks;	/* List of all psocks on MUX */
 	int psocks_cnt;		/* Total attached sockets */
 
+	struct kcm_mux_stats stats;
+	struct kcm_psock_stats aggregate_psock_stats;
+
 	/* Receive */
 	struct list_head kcm_rx_waiters; /* KCMs waiting for receiving */
 	struct kcm_sock *last_rx_kcm;
@@ -104,6 +155,57 @@ struct kcm_mux {
 	struct list_head kcm_tx_waiters; /* KCMs waiting for a TX psock */
 };
 
+#ifdef CONFIG_PROC_FS
+int kcm_proc_init(void);
+void kcm_proc_exit(void);
+#else
+static int kcm_proc_init(void) { return 0; }
+static void kcm_proc_exit(void) { }
+#endif
+
+
+static inline void aggregate_psock_stats(struct kcm_psock_stats *stats,
+					 struct kcm_psock_stats *agg_stats)
+{
+	/* Save psock statistics in the mux when psock is being unattached. */
+
+#define SAVE_PSOCK_STATS(_stat) (agg_stats->_stat += stats->_stat)
+
+	SAVE_PSOCK_STATS(rx_msgs);
+	SAVE_PSOCK_STATS(rx_bytes);
+	SAVE_PSOCK_STATS(rx_aborts);
+	SAVE_PSOCK_STATS(rx_mem_fail);
+	SAVE_PSOCK_STATS(rx_need_more_hdr);
+	SAVE_PSOCK_STATS(rx_bad_hdr_len);
+	SAVE_PSOCK_STATS(tx_msgs);
+	SAVE_PSOCK_STATS(tx_bytes);
+	SAVE_PSOCK_STATS(reserved);
+	SAVE_PSOCK_STATS(unreserved);
+	SAVE_PSOCK_STATS(tx_aborts);
+
+#undef SAVE_PSOCK_STATS
+}
+
+static inline void aggregate_mux_stats(struct kcm_mux_stats *stats,
+				       struct kcm_mux_stats *agg_stats)
+{
+	/* Save psock statistics in the mux when psock is being unattached. */
+
+#define SAVE_MUX_STATS(_stat) (agg_stats->_stat += stats->_stat)
+
+	SAVE_MUX_STATS(rx_msgs);
+	SAVE_MUX_STATS(rx_bytes);
+	SAVE_MUX_STATS(tx_msgs);
+	SAVE_MUX_STATS(tx_bytes);
+	SAVE_MUX_STATS(tx_no_psock);
+	SAVE_MUX_STATS(rx_ready_drops);
+	SAVE_MUX_STATS(psock_attach);
+	SAVE_MUX_STATS(psock_unattach_rsvd);
+	SAVE_MUX_STATS(psock_unattach);
+
+#undef SAVE_MUX_STATS
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* __NET_KCM_H_ */
diff --git a/net/kcm/Makefile b/net/kcm/Makefile
index cb525f7..7125613 100644
--- a/net/kcm/Makefile
+++ b/net/kcm/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_AF_KCM) += kcm.o
 
-kcm-y := kcmsock.o
+kcm-y := kcmsock.o kcmproc.o
diff --git a/net/kcm/kcmproc.c b/net/kcm/kcmproc.c
new file mode 100644
index 0000000..6334973
--- /dev/null
+++ b/net/kcm/kcmproc.c
@@ -0,0 +1,415 @@
+#include <linux/in.h>
+#include <linux/inet.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <linux/proc_fs.h>
+#include <linux/rculist.h>
+#include <linux/seq_file.h>
+#include <linux/socket.h>
+#include <net/inet_sock.h>
+#include <net/kcm.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/tcp.h>
+
+#ifdef CONFIG_PROC_FS
+struct kcm_seq_muxinfo {
+	char				*name;
+	const struct file_operations	*seq_fops;
+	struct seq_operations		seq_ops;
+};
+
+static struct kcm_mux *kcm_get_first(struct seq_file *seq)
+{
+	struct net *net = seq_file_net(seq);
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+
+	return list_first_or_null_rcu(&knet->mux_list,
+				      struct kcm_mux, kcm_mux_list);
+}
+
+static struct kcm_mux *kcm_get_next(struct kcm_mux *mux)
+{
+	struct kcm_net *knet = mux->knet;
+
+	return list_next_or_null_rcu(&knet->mux_list, &mux->kcm_mux_list,
+				     struct kcm_mux, kcm_mux_list);
+}
+
+static struct kcm_mux *kcm_get_idx(struct seq_file *seq, loff_t pos)
+{
+	struct net *net = seq_file_net(seq);
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+	struct kcm_mux *m;
+
+	list_for_each_entry(m, &knet->mux_list, kcm_mux_list) {
+		if (!pos)
+			return m;
+		--pos;
+	}
+	return NULL;
+}
+
+static void *kcm_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	void *p;
+
+	if (v == SEQ_START_TOKEN)
+		p = kcm_get_first(seq);
+	else
+		p = kcm_get_next(v);
+	++*pos;
+	return p;
+}
+
+static void *kcm_seq_start(struct seq_file *seq, loff_t *pos)
+	__acquires(rcu)
+{
+	rcu_read_lock();
+
+	if (!*pos)
+		return SEQ_START_TOKEN;
+	else
+		return kcm_get_idx(seq, *pos - 1);
+}
+
+static void kcm_seq_stop(struct seq_file *seq, void *v)
+	__releases(rcu)
+{
+	rcu_read_unlock();
+}
+
+struct kcm_proc_mux_state {
+	struct seq_net_private p;
+	int idx;
+};
+
+static int kcm_seq_open(struct inode *inode, struct file *file)
+{
+	struct kcm_seq_muxinfo *muxinfo = PDE_DATA(inode);
+	int err;
+
+	err = seq_open_net(inode, file, &muxinfo->seq_ops,
+			   sizeof(struct kcm_proc_mux_state));
+	if (err < 0)
+		return err;
+	return err;
+}
+
+static void kcm_format_mux_header(struct seq_file *seq)
+{
+	struct net *net = seq_file_net(seq);
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+
+	seq_printf(seq,
+		   "*** KCM statistics (%d MUX) ****\n",
+		   knet->count);
+
+	seq_printf(seq,
+		   "%-8s %-10s %-16s %-10s %-16s %-8s %-8s %s",
+		   "Object",
+		   "RX-Msgs",
+		   "RX-Bytes",
+		   "TX-Msgs",
+		   "TX-Bytes",
+		   "Recv-Q",
+		   "Send-Q",
+		   "Status");
+
+	/* XXX: pdsts header stuff here */
+	seq_puts(seq, "\n");
+}
+
+static void kcm_format_sock(struct kcm_sock *kcm, struct seq_file *seq,
+			    int i, int *len)
+{
+	seq_printf(seq,
+		   "%-6s%-6s %-10llu %-16llu %-10llu %-16llu %-8d %-8d",
+		   "", "kcm",
+		   kcm->stats.rx_msgs,
+		   kcm->stats.rx_bytes,
+		   kcm->stats.tx_msgs,
+		   kcm->stats.tx_bytes,
+		   kcm->sk.sk_receive_queue.qlen,
+		   kcm->sk.sk_write_queue.qlen);
+
+	if (kcm->tx_psock)
+		seq_puts(seq, "Psck ");
+
+	if (kcm->tx_wait)
+		seq_puts(seq, "TxWait ");
+
+	if (kcm->rx_wait)
+		seq_puts(seq, "RxWait ");
+
+	seq_puts(seq, "\n");
+}
+
+static void kcm_format_psock(struct kcm_psock *psock, struct seq_file *seq,
+			     int i, int *len)
+{
+	seq_printf(seq,
+		   "%-6s%-6s %-10llu %-16llu %-10llu %-16llu %-8d %-8d",
+		   "", "psock",
+		   psock->stats.rx_msgs,
+		   psock->stats.rx_bytes,
+		   psock->stats.tx_msgs,
+		   psock->stats.tx_bytes,
+		   psock->sk->sk_receive_queue.qlen,
+		   psock->sk->sk_write_queue.qlen);
+
+	if (psock->done)
+		seq_puts(seq, "Done ");
+
+	if (psock->tx_stopped)
+		seq_puts(seq, "TxStop ");
+
+	if (psock->rx_stopped)
+		seq_puts(seq, "RxStop ");
+
+	if (psock->tx_kcm)
+		seq_puts(seq, "Rsvd ");
+
+	if (psock->ready_rx_msg)
+		seq_puts(seq, "RdyRx ");
+
+	seq_puts(seq, "\n");
+}
+
+static void
+kcm_format_mux(struct kcm_mux *mux, loff_t idx, struct seq_file *seq)
+{
+	int i, len;
+	struct kcm_sock *kcm;
+	struct kcm_psock *psock;
+
+	/* mux information */
+	seq_printf(seq,
+		   "%-6s%-6s %-10llu %-16llu %-10llu %-16llu %-8s %-8s",
+		   "mux", "",
+		   mux->stats.tx_msgs,
+		   mux->stats.tx_bytes,
+		   mux->stats.rx_msgs,
+		   mux->stats.rx_bytes,
+		   "-", "-");
+
+	seq_printf(seq, "KCMs: %d, Psocks %d\n",
+		   mux->kcm_socks_cnt, mux->psocks_cnt);
+
+	/* kcm sock information */
+	i = 0;
+	spin_lock_bh(&mux->lock);
+	list_for_each_entry(kcm, &mux->kcm_socks, kcm_sock_list) {
+		kcm_format_sock(kcm, seq, i, &len);
+		i++;
+	}
+	i = 0;
+	list_for_each_entry(psock, &mux->psocks, psock_list) {
+		kcm_format_psock(psock, seq, i, &len);
+		i++;
+	}
+	spin_unlock_bh(&mux->lock);
+}
+
+static int kcm_seq_show(struct seq_file *seq, void *v)
+{
+	struct kcm_proc_mux_state *mux_state;
+
+	mux_state = seq->private;
+	if (v == SEQ_START_TOKEN) {
+		mux_state->idx = 0;
+		kcm_format_mux_header(seq);
+	} else {
+		kcm_format_mux(v, mux_state->idx, seq);
+		mux_state->idx++;
+	}
+	return 0;
+}
+
+static const struct file_operations kcm_seq_fops = {
+	.owner		= THIS_MODULE,
+	.open		= kcm_seq_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+};
+
+static struct kcm_seq_muxinfo kcm_seq_muxinfo = {
+	.name		= "kcm",
+	.seq_fops	= &kcm_seq_fops,
+	.seq_ops	= {
+		.show	= kcm_seq_show,
+		.start	= kcm_seq_start,
+		.next	= kcm_seq_next,
+		.stop	= kcm_seq_stop,
+	}
+};
+
+static int kcm_proc_register(struct net *net, struct kcm_seq_muxinfo *muxinfo)
+{
+	struct proc_dir_entry *p;
+	int rc = 0;
+
+	p = proc_create_data(muxinfo->name, S_IRUGO, net->proc_net,
+			     muxinfo->seq_fops, muxinfo);
+	if (!p)
+		rc = -ENOMEM;
+	return rc;
+}
+EXPORT_SYMBOL(kcm_proc_register);
+
+static void kcm_proc_unregister(struct net *net,
+				struct kcm_seq_muxinfo *muxinfo)
+{
+	remove_proc_entry(muxinfo->name, net->proc_net);
+}
+EXPORT_SYMBOL(kcm_proc_unregister);
+
+static int kcm_stats_seq_show(struct seq_file *seq, void *v)
+{
+	struct kcm_psock_stats psock_stats;
+	struct kcm_mux_stats mux_stats;
+	struct kcm_mux *mux;
+	struct kcm_psock *psock;
+	struct net *net = seq->private;
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+
+	memset(&mux_stats, 0, sizeof(mux_stats));
+	memset(&psock_stats, 0, sizeof(psock_stats));
+
+	mutex_lock(&knet->mutex);
+
+	aggregate_mux_stats(&knet->aggregate_mux_stats, &mux_stats);
+	aggregate_psock_stats(&knet->aggregate_psock_stats,
+			      &psock_stats);
+
+	list_for_each_entry_rcu(mux, &knet->mux_list, kcm_mux_list) {
+		spin_lock_bh(&mux->lock);
+		aggregate_mux_stats(&mux->stats, &mux_stats);
+		aggregate_psock_stats(&mux->aggregate_psock_stats,
+				      &psock_stats);
+		list_for_each_entry(psock, &mux->psocks, psock_list)
+			aggregate_psock_stats(&psock->stats, &psock_stats);
+		spin_unlock_bh(&mux->lock);
+	}
+
+	mutex_unlock(&knet->mutex);
+
+	seq_printf(seq,
+		   "%-8s %-10s %-16s %-10s %-16s %-10s %-10s %-10s %-10s %-10s %-10s\n",
+		   "MUX",
+		   "RX-Msgs",
+		   "RX-Bytes",
+		   "TX-Msgs",
+		   "TX-Bytes",
+		   "TX-NoPsock",
+		   "TX-Retries",
+		   "Attach",
+		   "Unattach",
+		   "UnattchRsvd",
+		   "RX-RdyDrops");
+
+	seq_printf(seq,
+		   "%-8s %-10llu %-16llu %-10llu %-16llu %-10u %-10u %-10u %-10u %-10u %-10u\n",
+		   "",
+		   mux_stats.rx_msgs,
+		   mux_stats.rx_bytes,
+		   mux_stats.tx_msgs,
+		   mux_stats.tx_bytes,
+		   mux_stats.tx_no_psock,
+		   mux_stats.tx_retries,
+		   mux_stats.psock_attach,
+		   mux_stats.psock_unattach_rsvd,
+		   mux_stats.psock_unattach,
+		   mux_stats.rx_ready_drops);
+
+	seq_printf(seq,
+		   "%-8s %-10s %-16s %-10s %-16s %-10s %-10s %-10s %-10s %-10s %-10s %-10s\n",
+		   "Psock",
+		   "RX-Msgs",
+		   "RX-Bytes",
+		   "TX-Msgs",
+		   "TX-Bytes",
+		   "Reserved",
+		   "Unreserved",
+		   "RX-Aborts",
+		   "RX-MemFail",
+		   "RX-NeedMor",
+		   "RX-BadLen",
+		   "TX-Aborts");
+
+	seq_printf(seq,
+		   "%-8s %-10llu %-16llu %-10llu %-16llu %-10llu %-10llu %-10u %-10u %-10u %-10u %-10u\n",
+		   "",
+		   psock_stats.rx_msgs,
+		   psock_stats.rx_bytes,
+		   psock_stats.tx_msgs,
+		   psock_stats.tx_bytes,
+		   psock_stats.reserved,
+		   psock_stats.unreserved,
+		   psock_stats.rx_aborts,
+		   psock_stats.rx_mem_fail,
+		   psock_stats.rx_need_more_hdr,
+		   psock_stats.rx_bad_hdr_len,
+		   psock_stats.tx_aborts);
+
+	return 0;
+}
+
+static int kcm_stats_seq_open(struct inode *inode, struct file *file)
+{
+	return single_open_net(inode, file, kcm_stats_seq_show);
+}
+
+static const struct file_operations kcm_stats_seq_fops = {
+	.owner   = THIS_MODULE,
+	.open    = kcm_stats_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = single_release_net,
+};
+
+static int kcm_proc_init_net(struct net *net)
+{
+	int err;
+
+	if (!proc_create("kcm_stats", S_IRUGO, net->proc_net,
+			 &kcm_stats_seq_fops)) {
+		err = -ENOMEM;
+		goto out_kcm_stats;
+	}
+
+	err = kcm_proc_register(net, &kcm_seq_muxinfo);
+	if (err)
+		goto out_kcm;
+
+	return 0;
+
+out_kcm:
+	remove_proc_entry("kcm_stats", net->proc_net);
+out_kcm_stats:
+	return err;
+}
+
+static void kcm_proc_exit_net(struct net *net)
+{
+	kcm_proc_unregister(net, &kcm_seq_muxinfo);
+	remove_proc_entry("kcm_stats", net->proc_net);
+}
+
+static struct pernet_operations kcm_net_ops = {
+	.init = kcm_proc_init_net,
+	.exit = kcm_proc_exit_net,
+};
+
+int __init kcm_proc_init(void)
+{
+	return register_pernet_subsys(&kcm_net_ops);
+}
+
+void __exit kcm_proc_exit(void)
+{
+	unregister_pernet_subsys(&kcm_net_ops);
+}
+
+#endif /* CONFIG_PROC_FS */
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 0240ce3..5508274 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -53,6 +53,7 @@ static void kcm_abort_rx_psock(struct kcm_psock *psock, int err,
 		return;
 
 	psock->rx_stopped = 1;
+	KCM_STATS_INCR(psock->stats.rx_aborts);
 
 	if (!skb)
 		return;
@@ -93,6 +94,7 @@ static void kcm_abort_tx_psock(struct kcm_psock *psock, int err,
 	}
 
 	psock->tx_stopped = 1;
+	KCM_STATS_INCR(psock->stats.tx_aborts);
 
 	if (!psock->tx_kcm) {
 		/* Take off psocks_avail list */
@@ -144,6 +146,9 @@ static bool new_rx_msg(struct kcm_psock *psock, struct sk_buff *head)
 		return false;
 	}
 
+	KCM_STATS_ADD(mux->stats.rx_bytes, kcm_rx_msg(head)->full_len);
+	KCM_STATS_INCR(mux->stats.rx_msgs);
+
 	if (list_empty(&mux->kcm_rx_waiters)) {
 		psock->ready_rx_msg = head;
 
@@ -263,6 +268,7 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 
 			skb = alloc_skb(0, GFP_ATOMIC);
 			if (!skb) {
+				KCM_STATS_INCR(psock->stats.rx_mem_fail);
 				desc->error = -ENOMEM;
 				return 0;
 			}
@@ -277,6 +283,7 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 		} else {
 			err = skb_unclone(head, GFP_ATOMIC);
 			if (err) {
+				KCM_STATS_INCR(psock->stats.rx_mem_fail);
 				desc->error = err;
 				return 0;
 			}
@@ -289,11 +296,13 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 		/* Always clone since we will consume something */
 		skb = skb_clone(orig_skb, GFP_ATOMIC);
 		if (!skb) {
+			KCM_STATS_INCR(psock->stats.rx_mem_fail);
 			desc->error = -ENOMEM;
 			break;
 		}
 
 		if (!pskb_pull(skb, offset + eaten)) {
+			KCM_STATS_INCR(psock->stats.rx_mem_fail);
 			kfree_skb(skb);
 			desc->error = -ENOMEM;
 			break;
@@ -308,6 +317,7 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 		/* Need to trim should be rare */
 		err = pskb_trim(skb, orig_len - eaten);
 		if (err) {
+			KCM_STATS_INCR(psock->stats.rx_mem_fail);
 			kfree_skb(skb);
 			desc->error = err;
 			break;
@@ -342,11 +352,13 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 
 			if (!len) {
 				/* Need more header to determine length */
+				KCM_STATS_INCR(psock->stats.rx_need_more_hdr);
 				break;
 			} else if (len <= head->len - skb->len) {
 				/* Length must be into new skb (and also
 				 * greater than zero)
 				 */
+				KCM_STATS_INCR(psock->stats.rx_bad_hdr_len);
 				desc->error = -EPROTO;
 				psock->rx_skb_head = NULL;
 				kcm_abort_rx_psock(psock, EPROTO, head);
@@ -376,6 +388,7 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 
 		/* Hurray, we have a new message! */
 		psock->rx_skb_head = NULL;
+		KCM_STATS_INCR(psock->stats.rx_msgs);
 
 		if (new_rx_msg(psock, head)) {
 			/* Message was held at psock */
@@ -383,6 +396,8 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 		}
 	}
 
+	KCM_STATS_ADD(psock->stats.rx_bytes, eaten);
+
 	return eaten;
 }
 
@@ -512,6 +527,9 @@ static struct kcm_psock *reserve_psock(struct kcm_sock *kcm,
 		}
 		kcm->tx_psock = psock;
 		psock->tx_kcm = kcm;
+		KCM_STATS_INCR(psock->stats.reserved);
+		psock->saved_tx_bytes = psock->stats.tx_bytes;
+		psock->saved_tx_msgs = psock->stats.tx_msgs;
 	} else if (kcm->tx_wait) {
 		err = -EAGAIN;
 	} else {
@@ -521,6 +539,7 @@ static struct kcm_psock *reserve_psock(struct kcm_sock *kcm,
 			kcm->tx_wait = 1;
 			err = -EAGAIN;
 		} else {
+			KCM_STATS_INCR(mux->stats.tx_no_psock);
 			err = -EPIPE;
 		}
 	}
@@ -548,6 +567,9 @@ static void psock_now_avail(struct kcm_psock *psock)
 		kcm->tx_wait = 0;
 		kcm->tx_psock = psock;
 		psock->tx_kcm = kcm;
+		KCM_STATS_INCR(psock->stats.reserved);
+		psock->saved_tx_bytes = psock->stats.tx_bytes;
+		psock->saved_tx_msgs = psock->stats.tx_msgs;
 		queue_work(kcm_wq, &kcm->tx_work);
 	}
 }
@@ -563,10 +585,16 @@ static void unreserve_psock(struct kcm_sock *kcm)
 
 	spin_lock_bh(&mux->lock);
 
+	KCM_STATS_ADD(mux->stats.tx_bytes,
+		      psock->stats.tx_bytes - psock->saved_tx_bytes);
+	mux->stats.tx_msgs +=
+		psock->stats.tx_msgs - psock->saved_tx_msgs;
+
 	WARN_ON(kcm->tx_wait);
 
 	kcm->tx_psock = NULL;
 	psock->tx_kcm = NULL;
+	KCM_STATS_INCR(psock->stats.unreserved);
 
 	if (unlikely(psock->done || psock->tx_stopped)) {
 		if (psock->done) {
@@ -590,6 +618,15 @@ static void unreserve_psock(struct kcm_sock *kcm)
 	spin_unlock_bh(&mux->lock);
 }
 
+static void kcm_report_tx_retry(struct kcm_sock *kcm)
+{
+	struct kcm_mux *mux = kcm->mux;
+
+	spin_lock_bh(&mux->lock);
+	KCM_STATS_INCR(mux->stats.tx_retries);
+	spin_unlock_bh(&mux->lock);
+}
+
 /* Write any messages ready on the kcm socket.  Called with kcm sock lock
  * held.  Return bytes actually sent or error.
  */
@@ -616,6 +653,7 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
 		 * it and we'll retry the message.
 		 */
 		unreserve_psock(kcm);
+		kcm_report_tx_retry(kcm);
 		txm->sent = 0;
 	} else if (txm->sent) {
 		/* Send of first skbuff in queue already in progress */
@@ -690,12 +728,14 @@ do_frag:
 				unreserve_psock(kcm);
 
 				txm->sent = 0;
+				kcm_report_tx_retry(kcm);
 
 				goto try_again;
 			}
 
 			sent += ret;
 			frag_offset += ret;
+			KCM_STATS_ADD(psock->stats.tx_bytes, ret);
 			if (frag_offset < frag->size) {
 				/* Not finished with this frag */
 				goto do_frag;
@@ -717,6 +757,7 @@ do_frag:
 		kfree_skb(head);
 		sk->sk_wmem_queued -= sent;
 		total_sent += sent;
+		KCM_STATS_INCR(psock->stats.tx_msgs);
 	} while ((head = skb_peek(&sk->sk_write_queue)));
 out:
 	if (!head) {
@@ -901,6 +942,9 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 
 	/* Everything seems okay */
 
+	KCM_STATS_ADD(kcm->stats.tx_bytes, total_len);
+	KCM_STATS_INCR(kcm->stats.tx_msgs);
+
 	release_sock(sk);
 	return total_len;
 
@@ -987,6 +1031,8 @@ static int kcm_recvmsg(struct socket *sock,
 	copied = len;
 	if (likely(!(flags & MSG_PEEK))) {
 		/* Finished with message */
+		KCM_STATS_ADD(kcm->stats.rx_bytes, copied);
+		KCM_STATS_INCR(kcm->stats.rx_msgs);
 		skb_unlink(skb, &sk->sk_receive_queue);
 		kfree_skb(skb);
 	}
@@ -1058,6 +1104,7 @@ static int kcm_attach(struct socket *sock, struct socket *csock,
 
 	/* Finished initialization, now add the psock to the MUX. */
 	spin_lock_bh(&mux->lock);
+	KCM_STATS_INCR(mux->stats.psock_attach);
 	list_add_tail(&psock->psock_list, &mux->psocks);
 	mux->psocks_cnt++;
 	psock_now_avail(psock);
@@ -1154,14 +1201,19 @@ static void kcm_unattach(struct kcm_psock *psock)
 		list_del(&psock->psock_ready_list);
 		kfree_skb(psock->ready_rx_msg);
 		psock->ready_rx_msg = NULL;
+		KCM_STATS_INCR(mux->stats.rx_ready_drops);
 	}
 
+	KCM_STATS_INCR(mux->stats.psock_unattach);
+	aggregate_psock_stats(&psock->stats, &mux->aggregate_psock_stats);
+
 	if (psock->tx_kcm) {
 		/* psock was reserved.  Just mark it finished and we will clean
 		 * up in the kcm paths, we need kcm lock which can not be
 		 * acquired here.
 		 */
 		psock->done = 1;
+		KCM_STATS_INCR(mux->stats.psock_unattach_rsvd);
 		spin_unlock_bh(&mux->lock);
 		kcm_abort_tx_psock(psock, EPIPE, true, true);
 	} else {
@@ -1283,6 +1335,9 @@ static void release_mux(struct kcm_mux *mux)
 		return;
 
 	mutex_lock(&knet->mutex);
+	aggregate_mux_stats(&mux->stats, &knet->aggregate_mux_stats);
+	aggregate_psock_stats(&mux->aggregate_psock_stats,
+			      &knet->aggregate_psock_stats);
 	list_del_rcu(&mux->kcm_mux_list);
 	knet->count--;
 	mutex_unlock(&knet->mutex);
@@ -1529,8 +1584,15 @@ static int __init kcm_init(void)
 	if (err)
 		goto net_ops_fail;
 
+	err = kcm_proc_init();
+	if (err)
+		goto proc_init_fail;
+
 	return 0;
 
+proc_init_fail:
+	unregister_pernet_device(&kcm_net_ops);
+
 net_ops_fail:
 	sock_unregister(PF_KCM);
 
@@ -1549,6 +1611,7 @@ fail:
 
 static void __exit kcm_exit(void)
 {
+	kcm_proc_exit();
 	unregister_pernet_device(&kcm_net_ops);
 	sock_unregister(PF_KCM);
 	proto_unregister(&kcm_proto);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)
  2015-09-20 22:29 [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
                   ` (2 preceding siblings ...)
  2015-09-20 22:29 ` [PATCH RFC 3/3] kcm: Add statistics and proc interfaces Tom Herbert
@ 2015-09-21 12:24 ` Sowmini Varadhan
  2015-09-21 17:33   ` Tom Herbert
  2015-09-22  9:14 ` Thomas Martitz
  4 siblings, 1 reply; 15+ messages in thread
From: Sowmini Varadhan @ 2015-09-21 12:24 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, kernel-team

On (09/20/15 15:29), Tom Herbert wrote:
> 
> Kernel Connection Multiplexor (KCM) is a facility that provides a
> message based interface over TCP for generic application protocols.
> The motivation for this is based on the observation that although
> TCP is byte stream transport protocol with no concept of message
> boundaries, a common use case is to implement a framed application
> layer protocol running over TCP. To date, most TCP stacks offer
> byte stream API for applications, which places the burden of message
> delineation, message I/O operation atomicity, and load balancing
> in the application. With KCM an application can efficiently send
> and receive application protocol messages over TCP using a
> datagram interface.

A lot of this design is very similar to the PF_RDS/RDS-TCP
design. There too, we have a PF_RDS dgram socket (that already 
supports SEQPACKET semantics today) that can be tunneled over TCP.

The biggest design difference that I see in your proposal is 
that you are using BPF so presumably the demux has more flexibility
than RDS, which does the demux based on RDS port numbers?

Would it make sense to build your solution on top of RDS,
rather than re-invent solutions for many of the challenges
that one encounters when building a dgram-over-stream hybrid
socket (see "lessons learned" list below)?

Some things that were not clear to me from the patch-set:

The doc statses that we re-assemble packets the "stated length" -
but how will the receiver know the "stated length"? 
(fwiw, RDS figures that out from the header len in RDS,
and elsewhere I think you allude to some similar encaps
header - is that a correct understanding?)

not clear from the diagram: Is there one TCP socket per kcm-socket? 
what is the relation (one-one, many-one etc.)  between a kcm-socket and
a psock?  How does the ksock-psock-tcp-sock association get set up? 

the notes say one can "accept()" over a kcm socket- but "accept()"
is itself a connection-oriented concept- one does not accept() on
a dgram socket. So what exactly does this mean, and why not just
use the well-defined TCP socket semantics at that point (with something
like XDR for message boundary marking)?

In the "fwiw" bucket of lessons learned from RDS..  please ignore if
you were already aware of these- 

In the case of RDS, since multiple rds/dgram sockets share a single TCP
socket, some issues that have to be dealt with are

- congestion/starvation: we dont want tcp to start advertising
  zero-window because one dgram socket pair has flooded the pipe
  and the peer is not reading. So the RDS protocol has port-congestion
  RDS control plane messages that track congestion at the RDS port.

- imposes some constraints on the TCP send side- if sock1 and sock2
  are sharing a tcp socket, and both are sending dgrams over the 
  stream, dgrams from sock1 may get interleaved  (see comments above
  rds_send_xmit() for a note on how rds deals witt this). There are ways
  to fan this out over multiple tcp sockets (and I'm working on those,
  to improve the scaling), but just a note that there is some complexity
  to be dealt with here. Not sure if this was considered in the "KCM
  sockets" section in patch2..

- in general the "dgram-over-stream" hybrid has some peculiar issues. E.g.,
  dgram APIs like BINDTODEVICE and IP_PKTINFO cannot be applied
  to the underlying stream. In the typical use case for RDS (database 
  clusters) there's a reasonable workaround for this using network
  namespaces to define bundles of outgoing interfaces, but that solution
  may not always be workable for other use-cases. Thus it might actually
  be more obvious to simply use tcp sockets (and use something like XDR
  for message boundary markers on the stream).

--Sowmini

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)
  2015-09-21 12:24 ` [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Sowmini Varadhan
@ 2015-09-21 17:33   ` Tom Herbert
  2015-09-21 21:26     ` Sowmini Varadhan
  0 siblings, 1 reply; 15+ messages in thread
From: Tom Herbert @ 2015-09-21 17:33 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team

Hi Sowmini,

Thanks for your comments, some replies are in line.

> A lot of this design is very similar to the PF_RDS/RDS-TCP
> design. There too, we have a PF_RDS dgram socket (that already
> supports SEQPACKET semantics today) that can be tunneled over TCP.
>
> The biggest design difference that I see in your proposal is
> that you are using BPF so presumably the demux has more flexibility
> than RDS, which does the demux based on RDS port numbers?
>
I did look a bit a RDS. Major differences with KCM are:

- KCM does not implement any specific protocol in the kernel. Parsing
in receive is accomplished using BPF which allows protocol parsing to
be programmed from userspace.
- Connection management is done in userspace. This is particularly
important when connections need to switch into a protocol mode, like
when doing HTTP/2, Web sockets, SPDY, etc. over port 80.

> Would it make sense to build your solution on top of RDS,
> rather than re-invent solutions for many of the challenges
> that one encounters when building a dgram-over-stream hybrid
> socket (see "lessons learned" list below)?

There might be some points of leverage, but as I pointed out the
primary goal of KCM is the multiplexing and datagram interface over
TCP not application protocol implementation in the kernel. It might be
interesting if there were a common protocol generic library to handle
the user interface.

>
> Some things that were not clear to me from the patch-set:
>
> The doc statses that we re-assemble packets the "stated length" -
> but how will the receiver know the "stated length"?

BPF program returns the length of the next message. In my testing so
far I've been using HTTP/2 which defines a frame format with first 3
bytes being header length field . The BPF program (using LLVM/Clang--
thanks Alexei!) is just:

int bpf_prog1(struct __sk_buff *skb)
{
     return (load_word(skb, 0) >> 8) + 9;
}

> (fwiw, RDS figures that out from the header len in RDS,
> and elsewhere I think you allude to some similar encaps
> header - is that a correct understanding?)
>
KCM does not define any encaps header, it is intended to support
existing ones. For instance, BPF code to get length from an RDS
message would be:

int bpf_prog1(struct __sk_buff *skb)
{
     return load_word(skb, 16) + 40;
}

> not clear from the diagram: Is there one TCP socket per kcm-socket?
> what is the relation (one-one, many-one etc.)  between a kcm-socket and
> a psock?  How does the ksock-psock-tcp-sock association get set up?
>
Each multiplexor is logically one destination. At the top multiple KCM
sockets allow concurrent operations in userspace, at the bottom
multiple TCP connections allow for load balancing. An application
controls construction of the multiplexor and would presumably create
multiplexor for each peer. See Documentaiton/net/kcm.txt for the
details on interfaces for plumbing.

> the notes say one can "accept()" over a kcm socket- but "accept()"
> is itself a connection-oriented concept- one does not accept() on
> a dgram socket. So what exactly does this mean, and why not just
> use the well-defined TCP socket semantics at that point (with something
> like XDR for message boundary marking)?
>
The accept method is overloaded on KCM sockets to do the socket
cloning operation. This is unrelated to TCP semantics, connection
management is performed on TCP sockets (i.e. before being attached to
a KCM multiplexor).

> In the "fwiw" bucket of lessons learned from RDS..  please ignore if
> you were already aware of these-
>
> In the case of RDS, since multiple rds/dgram sockets share a single TCP
> socket, some issues that have to be dealt with are
>
> - congestion/starvation: we dont want tcp to start advertising
>   zero-window because one dgram socket pair has flooded the pipe
>   and the peer is not reading. So the RDS protocol has port-congestion
>   RDS control plane messages that track congestion at the RDS port.
>
In KCM all upper sockets are equivalent so there is not HOL blocking
on receive or transmit. A message received on a multiplexor can be
steered to any socket that is receiving. Conceivably, we could
implement some message affinity, for instance sending an RPC reply to
same socket that made the request, but even that I think should only
be best effort to avoid having to deal with blocking.

> - imposes some constraints on the TCP send side- if sock1 and sock2
>   are sharing a tcp socket, and both are sending dgrams over the
>   stream, dgrams from sock1 may get interleaved  (see comments above
>   rds_send_xmit() for a note on how rds deals witt this). There are ways
>   to fan this out over multiple tcp sockets (and I'm working on those,
>   to improve the scaling), but just a note that there is some complexity
>   to be dealt with here. Not sure if this was considered in the "KCM
>   sockets" section in patch2..
>
Writing and reading messages atomically is a critical operation of the
multiplexor. This is implemented using a reservation model (see
reserve_psock, unreserve_psock).

> - in general the "dgram-over-stream" hybrid has some peculiar issues. E.g.,
>   dgram APIs like BINDTODEVICE and IP_PKTINFO cannot be applied
>   to the underlying stream. In the typical use case for RDS (database
>   clusters) there's a reasonable workaround for this using network
>   namespaces to define bundles of outgoing interfaces, but that solution
>   may not always be workable for other use-cases. Thus it might actually
>   be more obvious to simply use tcp sockets (and use something like XDR
>   for message boundary markers on the stream).
>
My intent is to add an "unconnected" mode to KCM which would allow
connections to different destinations (represented by connection
groups) to be attached to the same MUX. Destinations would be
specified by some sort of AF_KCM sockaddr.

Thanks,
Tom

> --Sowmini
>
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)
  2015-09-21 17:33   ` Tom Herbert
@ 2015-09-21 21:26     ` Sowmini Varadhan
  2015-09-21 22:36       ` Tom Herbert
  0 siblings, 1 reply; 15+ messages in thread
From: Sowmini Varadhan @ 2015-09-21 21:26 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team

On (09/21/15 10:33), Tom Herbert wrote:
> >
> > Some things that were not clear to me from the patch-set:
> >
> > The doc statses that we re-assemble packets the "stated length" -
> > but how will the receiver know the "stated length"?
> 
> BPF program returns the length of the next message. In my testing so
> far I've been using HTTP/2 which defines a frame format with first 3
> bytes being header length field . The BPF program (using LLVM/Clang--
> thanks Alexei!) is just:

Maybe I dont see something about the mux/demux here (I have to 
take a closer look at reserve_psock/unreserve_psock), but 
will every tcp segment have a 3 byte length in the payload?

Not every TCP segment in the RDS-TCP case will have a RDS header,
thus the comments before rds_send_xmit(), thus applying the bpf filter
to a TCP segment holding some "from-the-middle" piece of the RDS dgram
may not be possible 

> > the notes say one can "accept()" over a kcm socket- but "accept()"
> > is itself a connection-oriented concept- one does not accept() on
    :
> The accept method is overloaded on KCM sockets to do the socket
> cloning operation. This is unrelated to TCP semantics, connection
> management is performed on TCP sockets (i.e. before being attached to
> a KCM multiplexor).

If possible,it might be better to use some other
glibc-func/name/syscall/sockopt/whatever
for this, rather than overloading accept().. feels like that would
keep the semantics cleaner, and probably less likely to trip 
up on accept code in the kernel..

--Sowmini

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)
  2015-09-21 21:26     ` Sowmini Varadhan
@ 2015-09-21 22:36       ` Tom Herbert
  2015-09-21 22:53         ` Sowmini Varadhan
  0 siblings, 1 reply; 15+ messages in thread
From: Tom Herbert @ 2015-09-21 22:36 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team

On Mon, Sep 21, 2015 at 2:26 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (09/21/15 10:33), Tom Herbert wrote:
>> >
>> > Some things that were not clear to me from the patch-set:
>> >
>> > The doc statses that we re-assemble packets the "stated length" -
>> > but how will the receiver know the "stated length"?
>>
>> BPF program returns the length of the next message. In my testing so
>> far I've been using HTTP/2 which defines a frame format with first 3
>> bytes being header length field . The BPF program (using LLVM/Clang--
>> thanks Alexei!) is just:
>
> Maybe I dont see something about the mux/demux here (I have to
> take a closer look at reserve_psock/unreserve_psock), but
> will every tcp segment have a 3 byte length in the payload?
>
No, there is no provision in TCP that application layer headers align
with TCP segments or that message boundaries are respected with TCP
segments. What we need to do, which you're probably doing for RDS, is
do message delineation on the stream as a sequence of:

1) Read protocol header to determine message length (BPF used here)
2) Read data up to the length of the message
3) Deliver message
4) Goto #1 (i.e. process next message in the stream).

> Not every TCP segment in the RDS-TCP case will have a RDS header,
> thus the comments before rds_send_xmit(), thus applying the bpf filter
> to a TCP segment holding some "from-the-middle" piece of the RDS dgram
> may not be possible
>
>> > the notes say one can "accept()" over a kcm socket- but "accept()"
>> > is itself a connection-oriented concept- one does not accept() on
>     :
>> The accept method is overloaded on KCM sockets to do the socket
>> cloning operation. This is unrelated to TCP semantics, connection
>> management is performed on TCP sockets (i.e. before being attached to
>> a KCM multiplexor).
>
> If possible,it might be better to use some other
> glibc-func/name/syscall/sockopt/whatever
> for this, rather than overloading accept().. feels like that would
> keep the semantics cleaner, and probably less likely to trip
> up on accept code in the kernel..
>
I'll a look at alternatives, but I sort of think this is okay since
the semantics of accept are defined per protocol (in this case the
"protocol" is KCM).

Thanks,
Tom

> --Sowmini

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)
  2015-09-21 22:36       ` Tom Herbert
@ 2015-09-21 22:53         ` Sowmini Varadhan
  0 siblings, 0 replies; 15+ messages in thread
From: Sowmini Varadhan @ 2015-09-21 22:53 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team

On (09/21/15 15:36), Tom Herbert wrote:
> segments. What we need to do, which you're probably doing for RDS, is
> do message delineation on the stream as a sequence of:
> 
> 1) Read protocol header to determine message length (BPF used here)

right, that's what rds does- first reads the sizeof(rds_header),
and from that, figures out payload len, to stitch each rds dgram 
together from intermediate tcp segments..

> 2) Read data up to the length of the message
> 3) Deliver message
> 4) Goto #1 (i.e. process next message in the stream).

Thanks for the rest of the responses.

--Sowmini

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)
  2015-09-20 22:29 [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
                   ` (3 preceding siblings ...)
  2015-09-21 12:24 ` [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Sowmini Varadhan
@ 2015-09-22  9:14 ` Thomas Martitz
  2015-09-22 16:46   ` Tom Herbert
  4 siblings, 1 reply; 15+ messages in thread
From: Thomas Martitz @ 2015-09-22  9:14 UTC (permalink / raw)
  To: Tom Herbert, davem, netdev; +Cc: kernel-team

Am 21.09.2015 um 00:29 schrieb Tom Herbert:
> Kernel Connection Multiplexor (KCM) is a facility that provides a
> message based interface over TCP for generic application protocols.
> The motivation for this is based on the observation that although
> TCP is byte stream transport protocol with no concept of message
> boundaries, a common use case is to implement a framed application
> layer protocol running over TCP. To date, most TCP stacks offer
> byte stream API for applications, which places the burden of message
> delineation, message I/O operation atomicity, and load balancing
> in the application. With KCM an application can efficiently send
> and receive application protocol messages over TCP using a
> datagram interface.
>


Hello Tom,

on a general note I'm wondering what this offers over the existing SCTP 
support. SCTP offers reliable message exchange and multihoming already.

On a specific note, I'm wondering if the dup(2) familiy of system calls 
isn't a lot more suitable/natural instead of overloading accept.

Best regards.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/3] kcm: Kernel Connection Multiplexor module
  2015-09-20 22:29 ` [PATCH RFC 2/3] kcm: Kernel Connection Multiplexor module Tom Herbert
@ 2015-09-22 16:26   ` Alexei Starovoitov
  2015-09-22 17:26     ` Tom Herbert
  2015-09-23  9:36   ` Thomas Graf
  1 sibling, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2015-09-22 16:26 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, kernel-team

On Sun, Sep 20, 2015 at 03:29:20PM -0700, Tom Herbert wrote:
> +Attaching of transport sockets to a multiplexor is performed by calling on
> +ioctl on a KCM socket for the multiplexor. e.g.:
> +
> +  /* From linux/kcm.h */
> +  struct kcm_attach {
> +        int fd;
> +        int bpf_type;
> +        union {
> +                int bpf_fd;
> +                struct sock_fprog fprog;
> +        };
> +  };
> +
> +  struct kcm_attach info;
> +
> +  memset(&info, 0, sizeof(info));
> +
> +  info.fd = tcpfd;
> +  info.bpf_type = KCM_BPF_TYPE_PROG;
> +  info.bpf_fprog = bpf_prog;
> +
> +  ioctl(kcmfd, SIOCKCMATTACH, &info);
> +
> +The kcm_attach structure contains:
> +  fd: file descriptor for TCP socket being attached
> +  bpf_type: type of BPF program to be loaded this is either:
> +    KCM_BPF_TYPE_PROG: program load directly for user space
> +    KCM_BPF_TYPE_FD: Complied rogram to be load for the specified file
> +                     descriptor (see BPF LLVM and Clang)
> +  bpf_fprog: contains pointer to user space protocol to load
> +  bpf_fd: file descriptor for compiled program download

Interesting approach!
I would only suggest to drop support for classic BPF.
It's usable to return frame length of http2, but it won't be
able to parse protocols where fields are little endian.
Also it doesn't scale, since new cBPF program would be created
for every KCM socket, whereas with eBPF we can use single program
for all KCM sockets via single FD.

btw, did you consider to use BPF not only for frame length, but
also to select KCM socket ? For example for http2 it can pick
a socket based on stream id, providing affinity and
further improving performance ?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)
  2015-09-22  9:14 ` Thomas Martitz
@ 2015-09-22 16:46   ` Tom Herbert
  0 siblings, 0 replies; 15+ messages in thread
From: Tom Herbert @ 2015-09-22 16:46 UTC (permalink / raw)
  To: Thomas Martitz
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team

On Tue, Sep 22, 2015 at 2:14 AM, Thomas Martitz <kugel@rockbox.org> wrote:
> Am 21.09.2015 um 00:29 schrieb Tom Herbert:
>>
>> Kernel Connection Multiplexor (KCM) is a facility that provides a
>> message based interface over TCP for generic application protocols.
>> The motivation for this is based on the observation that although
>> TCP is byte stream transport protocol with no concept of message
>> boundaries, a common use case is to implement a framed application
>> layer protocol running over TCP. To date, most TCP stacks offer
>> byte stream API for applications, which places the burden of message
>> delineation, message I/O operation atomicity, and load balancing
>> in the application. With KCM an application can efficiently send
>> and receive application protocol messages over TCP using a
>> datagram interface.
>>
>
>
> Hello Tom,
>
> on a general note I'm wondering what this offers over the existing SCTP
> support. SCTP offers reliable message exchange and multihoming already.
>
Hi Thomas,

The idea of KCM is to provide a an improved interface that is usable
with currently deployed protocols without invoking any change on the
wire whatsoever. Deploying SCTP to replace TCP use cases would entail
a major operations change for us, AFAIK hardware support is scant, and
we probably still can't reliably use it on the Internet (you can
Google why is SCTP not widely deployed). I would point out that two of
the four limitations of TCP listed in RFC4960 should be addressed by
KCM :-)

> On a specific note, I'm wondering if the dup(2) familiy of system calls
> isn't a lot more suitable/natural instead of overloading accept.
>
dup is to duplicate a file descriptor not a socket. Sockets operations
on duplicated file descriptors operate on the same socket, need to
contend for the socket lock, etc. This is a similar rationale as to
why we need SO_REUSEPORT instead of just dup'ing listener fds.

Thanks,
Tom

> Best regards.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/3] kcm: Kernel Connection Multiplexor module
  2015-09-22 16:26   ` Alexei Starovoitov
@ 2015-09-22 17:26     ` Tom Herbert
  2015-09-22 18:41       ` Alexei Starovoitov
  0 siblings, 1 reply; 15+ messages in thread
From: Tom Herbert @ 2015-09-22 17:26 UTC (permalink / raw)
  To: Alexei Starovoitov, Alex Gartrell
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team

On Tue, Sep 22, 2015 at 9:26 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sun, Sep 20, 2015 at 03:29:20PM -0700, Tom Herbert wrote:
>> +Attaching of transport sockets to a multiplexor is performed by calling on
>> +ioctl on a KCM socket for the multiplexor. e.g.:
>> +
>> +  /* From linux/kcm.h */
>> +  struct kcm_attach {
>> +        int fd;
>> +        int bpf_type;
>> +        union {
>> +                int bpf_fd;
>> +                struct sock_fprog fprog;
>> +        };
>> +  };
>> +
>> +  struct kcm_attach info;
>> +
>> +  memset(&info, 0, sizeof(info));
>> +
>> +  info.fd = tcpfd;
>> +  info.bpf_type = KCM_BPF_TYPE_PROG;
>> +  info.bpf_fprog = bpf_prog;
>> +
>> +  ioctl(kcmfd, SIOCKCMATTACH, &info);
>> +
>> +The kcm_attach structure contains:
>> +  fd: file descriptor for TCP socket being attached
>> +  bpf_type: type of BPF program to be loaded this is either:
>> +    KCM_BPF_TYPE_PROG: program load directly for user space
>> +    KCM_BPF_TYPE_FD: Complied rogram to be load for the specified file
>> +                     descriptor (see BPF LLVM and Clang)
>> +  bpf_fprog: contains pointer to user space protocol to load
>> +  bpf_fd: file descriptor for compiled program download
>
> Interesting approach!
> I would only suggest to drop support for classic BPF.
> It's usable to return frame length of http2, but it won't be
> able to parse protocols where fields are little endian.
> Also it doesn't scale, since new cBPF program would be created
> for every KCM socket, whereas with eBPF we can use single program
> for all KCM sockets via single FD.
>
Hi Alexei,

That makes sense, but I think there may be some use cases where we'd
like lightweight methods to program filters. Writing C code for BPF is
extremely cool, but integrating LLVM/Clang into our development
environment may be a pain. We also might want to create a different
program for every socket anyway (like from some template with
parameterization). An eBPF assembler and jit support could be useful
(looks like there might be some work started on both of these
already).

> btw, did you consider to use BPF not only for frame length, but
> also to select KCM socket ? For example for http2 it can pick
> a socket based on stream id, providing affinity and
> further improving performance ?
>
Yes. I am thinking that eBPF can set the stream ID/transactional
identifiers in so_mark and then MUX steers to KCM sockets based on
that. As Sowmini pointed out we need to be weary of HOL blocking in
this...

Tom

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/3] kcm: Kernel Connection Multiplexor module
  2015-09-22 17:26     ` Tom Herbert
@ 2015-09-22 18:41       ` Alexei Starovoitov
  0 siblings, 0 replies; 15+ messages in thread
From: Alexei Starovoitov @ 2015-09-22 18:41 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alex Gartrell, David S. Miller, Linux Kernel Network Developers,
	Kernel Team

On Tue, Sep 22, 2015 at 10:26:10AM -0700, Tom Herbert wrote:
> > I would only suggest to drop support for classic BPF.
> > It's usable to return frame length of http2, but it won't be
> > able to parse protocols where fields are little endian.
> > Also it doesn't scale, since new cBPF program would be created
> > for every KCM socket, whereas with eBPF we can use single program
> > for all KCM sockets via single FD.
> >
> Hi Alexei,
> 
> That makes sense, but I think there may be some use cases where we'd
> like lightweight methods to program filters. Writing C code for BPF is
> extremely cool, but integrating LLVM/Clang into our development
> environment may be a pain. We also might want to create a different
> program for every socket anyway (like from some template with
> parameterization). An eBPF assembler and jit support could be useful
> (looks like there might be some work started on both of these
> already).

coding eBPF with macros as sock_example.c, test_verifier.c,
test_bpf.c do should be enough to start. User space assembler
to convert text to eBPF binary obviously would be even better.
But lack of tiny and standalone assembler today shouldn't be
the reason to include classic as permenent kernel ABI.
Not sure what you mean by 'jit support'. There is in-kernel jit
and there is jit mode of llvm where bpf can be taken from
memory without going to elf file.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/3] kcm: Kernel Connection Multiplexor module
  2015-09-20 22:29 ` [PATCH RFC 2/3] kcm: Kernel Connection Multiplexor module Tom Herbert
  2015-09-22 16:26   ` Alexei Starovoitov
@ 2015-09-23  9:36   ` Thomas Graf
  1 sibling, 0 replies; 15+ messages in thread
From: Thomas Graf @ 2015-09-23  9:36 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, kernel-team

On 09/20/15 at 03:29pm, Tom Herbert wrote:
> This module implement the Kernel Connection Multiplexor.
> 
> Kernel Connection Multiplexor (KCM) is a facility that provides a
> message based interface over TCP for generic application protocols.
> With KCM an application can efficiently send and receive application
> protocol messages over TCP using datagram sockets.
> 
> For more information see the included Documentation/networking/kcm.txt
> 
> Signed-off-by: Tom Herbert <tom@herbertland.com>

This looks great!

> +Cloning KCM sockets
> +-------------------
> +
> +After the first KCM socket is created using the socket call as described
> +above, additional sockets for the multiplexor can be created by cloning
> +a KCM socket. This is accomplished by calling accept on the KCM socket:
> +
> +   newkcmfd = accept(kcmfd, NULL, 0)

This looks a bit ugly.

> +  ioctl(kcmfd, SIOCKCMATTACH, &info);

Use setsockopt() instead?

> +/* Process a new message. If there is no KCM socket waiting for a message
> + * hold it in the psock. Returns true if message is held this way, false
> + * otherwise.
> + */
> +static bool new_rx_msg(struct kcm_psock *psock, struct sk_buff *head)
> +{
> +	struct kcm_mux *mux = psock->mux;
> +	struct kcm_sock *kcm = NULL;
> +	struct sock *sk;
> +
> +	spin_lock_bh(&mux->lock);
> +
> +	if (WARN_ON(psock->ready_rx_msg)) {
> +		spin_unlock_bh(&mux->lock);
> +		kfree_skb(head);
> +		return false;
> +	}
> +
> +	if (list_empty(&mux->kcm_rx_waiters)) {
> +		psock->ready_rx_msg = head;
> +
> +		list_add_tail(&psock->psock_ready_list,
> +			      &mux->psocks_ready);
> +
> +		spin_unlock_bh(&mux->lock);
> +		return true;
> +	}
> +
> +	kcm = list_first_entry(&mux->kcm_rx_waiters,
> +			       struct kcm_sock, wait_rx_list);

Per CPU list of waiting sockets?

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-09-23  9:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-20 22:29 [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
2015-09-20 22:29 ` [PATCH RFC 1/3] rcu: Add list_next_or_null_rcu Tom Herbert
2015-09-20 22:29 ` [PATCH RFC 2/3] kcm: Kernel Connection Multiplexor module Tom Herbert
2015-09-22 16:26   ` Alexei Starovoitov
2015-09-22 17:26     ` Tom Herbert
2015-09-22 18:41       ` Alexei Starovoitov
2015-09-23  9:36   ` Thomas Graf
2015-09-20 22:29 ` [PATCH RFC 3/3] kcm: Add statistics and proc interfaces Tom Herbert
2015-09-21 12:24 ` [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Sowmini Varadhan
2015-09-21 17:33   ` Tom Herbert
2015-09-21 21:26     ` Sowmini Varadhan
2015-09-21 22:36       ` Tom Herbert
2015-09-21 22:53         ` Sowmini Varadhan
2015-09-22  9:14 ` Thomas Martitz
2015-09-22 16:46   ` Tom Herbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).