Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 1/6] fsl/fman: enable FMan Keygen
From: David Miller @ 2017-08-23  4:46 UTC (permalink / raw)
  To: madalin.bucur; +Cc: netdev, linuxppc-dev, linux-kernel
In-Reply-To: <AM5PR0402MB2691CF2AC9555829EC89C59CEC850@AM5PR0402MB2691.eurprd04.prod.outlook.com>

From: Madalin-cristian Bucur <madalin.bucur@nxp.com>
Date: Wed, 23 Aug 2017 04:36:56 +0000

> The struct fman is only visible in the fman file, the fman port
> module uses struct fman as an opaque pointer, thus this export.

Don't use that programming model.

Export the datastructure properly to it's users.

This abstraction scheme is so wasteful and costly.

^ permalink raw reply

* [PATCH v2 net-next 1/1] hv_sock: implements Hyper-V transport for Virtual Sockets (AF_VSOCK)
From: Dexuan Cui @ 2017-08-23  4:52 UTC (permalink / raw)
  To: 'Jorgen S. Hansen', 'Stefan Hajnoczi',
	'davem@davemloft.net', 'netdev@vger.kernel.org'
  Cc: 'Michal Kubecek', 'joe@perches.com',
	'olaf@aepfle.de', Stephen Hemminger,
	'jasowang@redhat.com', KY Srinivasan, Haiyang Zhang,
	'Dave Scott', 'linux-kernel@vger.kernel.org',
	'apw@canonical.com', 'Rolf Neugebauer',
	'gregkh@linuxfoundation.org', 'Marcelo Cerri',
	'devel@linuxdriverproject.org',
	'Vitaly Kuznetsov', 'George Zhang',
	'Dan Carpenter'


Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It uses VMBus ringbuffer as the
transportation layer.

With hv_sock, applications between the host (Windows 10, Windows Server
2016 or newer) and the guest can talk with each other using the traditional
socket APIs.

More info about Hyper-V Sockets is available here:

"Make your own integration services":
https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-guide/make-integration-service

The patch implements the necessary support in Linux guest by introducing a new
vsock transport for AF_VSOCK.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Andy King <acking@vmware.com>
Cc: Dmitry Torokhov <dtor@vmware.com>
Cc: George Zhang <georgezhang@vmware.com>
Cc: Jorgen Hansen <jhansen@vmware.com>
Cc: Reilly Grant <grantr@vmware.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Rolf Neugebauer <rolf.neugebauer@docker.com>
Cc: Marcelo Cerri <marcelo.cerri@canonical.com>
---

Changes in v2:
	Fixed hvs_stream_allow() for cid and the comments
		Thanks Stefan Hajnoczi!

	Added proper locking when using vsock_enqueue_accept()
		Thanks Stefan Hajnoczi and Jorgen Hansen!

	The previous v1 patch is not needed any more:
 	[PATCH net-next 2/3] vsock: fix vsock_dequeue/enqueue_accept race

	Another previous v1 patch is being discussed in another thread:
	    vsock: only load vmci transport on VMware hypervisor by default


 MAINTAINERS                      |   1 +
 net/vmw_vsock/Kconfig            |  12 +
 net/vmw_vsock/Makefile           |   3 +
 net/vmw_vsock/hyperv_transport.c | 888 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 904 insertions(+)
 create mode 100644 net/vmw_vsock/hyperv_transport.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 2db0f8c..dae0573 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6279,6 +6279,7 @@ F:	drivers/net/hyperv/
 F:	drivers/scsi/storvsc_drv.c
 F:	drivers/uio/uio_hv_generic.c
 F:	drivers/video/fbdev/hyperv_fb.c
+F:	net/vmw_vsock/hyperv_transport.c
 F:	include/linux/hyperv.h
 F:	tools/hv/
 F:	Documentation/ABI/stable/sysfs-bus-vmbus
diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig
index a7ae09d..3f52929 100644
--- a/net/vmw_vsock/Kconfig
+++ b/net/vmw_vsock/Kconfig
@@ -46,3 +46,15 @@ config VIRTIO_VSOCKETS_COMMON
 	  This option is selected by any driver which needs to access
 	  the virtio_vsock.  The module will be called
 	  vmw_vsock_virtio_transport_common.
+
+config HYPERV_VSOCKETS
+	tristate "Hyper-V transport for Virtual Sockets"
+	depends on VSOCKETS && HYPERV
+	help
+	  This module implements a Hyper-V transport for Virtual Sockets.
+
+	  Enable this transport if your Virtual Machine host supports Virtual
+	  Sockets over Hyper-V VMBus.
+
+	  To compile this driver as a module, choose M here: the module will be
+	  called hv_sock. If unsure, say N.
diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
index 09fc2eb..e63d574 100644
--- a/net/vmw_vsock/Makefile
+++ b/net/vmw_vsock/Makefile
@@ -2,6 +2,7 @@ obj-$(CONFIG_VSOCKETS) += vsock.o
 obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS) += vmw_vsock_virtio_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += vmw_vsock_virtio_transport_common.o
+obj-$(CONFIG_HYPERV_VSOCKETS) += hv_sock.o
 
 vsock-y += af_vsock.o af_vsock_tap.o vsock_addr.o
 
@@ -11,3 +12,5 @@ vmw_vsock_vmci_transport-y += vmci_transport.o vmci_transport_notify.o \
 vmw_vsock_virtio_transport-y += virtio_transport.o
 
 vmw_vsock_virtio_transport_common-y += virtio_transport_common.o
+
+hv_sock-y += hyperv_transport.o
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
new file mode 100644
index 0000000..597fb25
--- /dev/null
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -0,0 +1,888 @@
+/*
+ * Hyper-V transport for vsock
+ *
+ * Hyper-V Sockets supplies a byte-stream based communication mechanism
+ * between the host and the VM. This driver implements the necessary
+ * support in the VM by introducing the new vsock transport.
+ *
+ * Copyright (c) 2017, Microsoft Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+#include <linux/module.h>
+#include <linux/vmalloc.h>
+#include <linux/hyperv.h>
+#include <net/sock.h>
+#include <net/af_vsock.h>
+
+/* The host side's design of the feature requires 6 exact 4KB pages for
+ * recv/send rings respectively -- this is suboptimal considering memory
+ * consumption, however unluckily we have to live with it, before the
+ * host comes up with a better design in the future.
+ */
+#define PAGE_SIZE_4K		4096
+#define RINGBUFFER_HVS_RCV_SIZE (PAGE_SIZE_4K * 6)
+#define RINGBUFFER_HVS_SND_SIZE (PAGE_SIZE_4K * 6)
+
+/* The MTU is 16KB per the host side's design */
+#define HVS_MTU_SIZE		(1024 * 16)
+
+struct vmpipe_proto_header {
+	u32 pkt_type;
+	u32 data_size;
+};
+
+/* For recv, we use the VMBus in-place packet iterator APIs to directly copy
+ * data from the ringbuffer into the userspace buffer.
+ */
+struct hvs_recv_buf {
+	/* The header before the payload data */
+	struct vmpipe_proto_header hdr;
+
+	/* The payload */
+	u8 data[HVS_MTU_SIZE];
+};
+
+/* We can send up to HVS_MTU_SIZE bytes of payload to the host, but let's use
+ * a small size, i.e. HVS_SEND_BUF_SIZE, to minimize the dynamically-allocated
+ * buffer, because tests show there is no significant performance difference.
+ *
+ * Note: the buffer can be eliminated in the future when we add new VMBus
+ * ringbuffer APIs that allow us to directly copy data from userspace buffer
+ * to VMBus ringbuffer.
+ */
+#define HVS_SEND_BUF_SIZE (PAGE_SIZE_4K - sizeof(struct vmpipe_proto_header))
+
+struct hvs_send_buf {
+	/* The header before the payload data */
+	struct vmpipe_proto_header hdr;
+
+	/* The payload */
+	u8 data[HVS_SEND_BUF_SIZE];
+};
+
+#define HVS_HEADER_LEN	(sizeof(struct vmpacket_descriptor) + \
+			 sizeof(struct vmpipe_proto_header))
+
+/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write(), and
+ * __hv_pkt_iter_next().
+ */
+#define VMBUS_PKT_TRAILER	(sizeof(u64))
+
+#define HVS_PKT_LEN(payload_len)	(HVS_HEADER_LEN + \
+					 ALIGN((payload_len), 8) + \
+					 VMBUS_PKT_TRAILER)
+
+/* Per-socket state (accessed via vsk->trans) */
+struct hvsock {
+	struct vsock_sock *vsk;
+
+	uuid_le	vm_srv_id;
+	uuid_le	host_srv_id;
+
+	struct vmbus_channel *chan;
+	struct vmpacket_descriptor *recv_desc;
+
+	/* The length of the payload not delivered to userland yet */
+	u32 recv_data_len;
+	/* The offset of the payload */
+	u32 recv_data_off;
+
+	/* Have we sent the zero-length packet (FIN)? */
+	unsigned long fin_sent;
+};
+
+/* In the VM, we support Hyper-V Sockets with AF_VSOCK, and the endpoint is
+ * <cid, port> (see struct sockaddr_vm). Note: cid is not really used here:
+ * when we write apps to connect to the host, we can only use VMADDR_CID_ANY
+ * or VMADDR_CID_HOST (both are equivalent) as the remote cid, and when we
+ * write apps to bind() & listen() in the VM, we can only use VMADDR_CID_ANY
+ * as the local cid.
+ *
+ * On the host, Hyper-V Sockets are supported by Winsock AF_HYPERV:
+ * https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-
+ * guide/make-integration-service, and the endpoint is <VmID, ServiceId> with
+ * the below sockaddr:
+ *
+ * struct SOCKADDR_HV
+ * {
+ *    ADDRESS_FAMILY Family;
+ *    USHORT Reserved;
+ *    GUID VmId;
+ *    GUID ServiceId;
+ * };
+ * Note: VmID is not used by Linux VM and actually it isn't transmitted via
+ * VMBus, because here it's obvious the host and the VM can easily identify
+ * each other. Though the VmID is useful on the host, especially in the case
+ * of Windows container, Linux VM doesn't need it at all.
+ *
+ * To make use of the AF_VSOCK infrastructure in Linux VM, we have to limit
+ * the available GUID space of SOCKADDR_HV so that we can create a mapping
+ * between AF_VSOCK port and SOCKADDR_HV Service GUID. The rule of writing
+ * Hyper-V Sockets apps on the host and in Linux VM is:
+ *
+ ****************************************************************************
+ * The only valid Service GUIDs, from the perspectives of both the host and *
+ * Linux VM, that can be connected by the other end, must conform to this   *
+ * format: <port>-facb-11e6-bd58-64006a7986d3, and the "port" must be in    *
+ * this range [0, 0x7FFFFFFF].                                              *
+ ****************************************************************************
+ *
+ * When we write apps on the host to connect(), the GUID ServiceID is used.
+ * When we write apps in Linux VM to connect(), we only need to specify the
+ * port and the driver will form the GUID and use that to request the host.
+ *
+ * From the perspective of Linux VM:
+ * 1. the local ephemeral port (i.e. the local auto-bound port when we call
+ * connect() without explicit bind()) is generated by __vsock_bind_stream(),
+ * and the range is [1024, 0xFFFFFFFF).
+ * 2. the remote ephemeral port (i.e. the auto-generated remote port for
+ * a connect request initiated by the host's connect()) is generated by
+ * hvs_remote_addr_init() and the range is [0x80000000, 0xFFFFFFFF).
+ */
+
+#define MAX_LISTEN_PORT			((u32)0x7FFFFFFF)
+#define MAX_VM_LISTEN_PORT		MAX_LISTEN_PORT
+#define MAX_HOST_LISTEN_PORT		MAX_LISTEN_PORT
+#define MIN_HOST_EPHEMERAL_PORT		(MAX_HOST_LISTEN_PORT + 1)
+
+/* 00000000-facb-11e6-bd58-64006a7986d3 */
+static const uuid_le srv_id_template =
+	UUID_LE(0x00000000, 0xfacb, 0x11e6, 0xbd, 0x58,
+		0x64, 0x00, 0x6a, 0x79, 0x86, 0xd3);
+
+static inline bool is_valid_srv_id(const uuid_le *id)
+{
+	return !memcmp(&id->b[4], &srv_id_template.b[4], sizeof(uuid_le) - 4);
+}
+
+static inline unsigned int get_port_by_srv_id(const uuid_le *svr_id)
+{
+	return *((unsigned int *)svr_id);
+}
+
+static inline void hvs_addr_init(struct sockaddr_vm *addr,
+				 const uuid_le *svr_id)
+{
+	unsigned int port = get_port_by_srv_id(svr_id);
+
+	vsock_addr_init(addr, VMADDR_CID_ANY, port);
+}
+
+static inline void hvs_remote_addr_init(struct sockaddr_vm *remote,
+					struct sockaddr_vm *local)
+{
+	static u32 host_ephemeral_port = MIN_HOST_EPHEMERAL_PORT;
+	struct sock *sk;
+
+	vsock_addr_init(remote, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+
+	while (1) {
+		/* Wrap around ? */
+		if (host_ephemeral_port < MIN_HOST_EPHEMERAL_PORT ||
+		    host_ephemeral_port == VMADDR_PORT_ANY)
+			host_ephemeral_port = MIN_HOST_EPHEMERAL_PORT;
+
+		remote->svm_port = host_ephemeral_port++;
+
+		sk = vsock_find_connected_socket(remote, local);
+		if (!sk) {
+			/* Found an available ephemeral port */
+			return;
+		}
+
+		/* Release refcnt got in vsock_find_connected_socket */
+		sock_put(sk);
+	}
+}
+
+static void hvs_set_channel_pending_send_size(struct vmbus_channel *chan)
+{
+	set_channel_pending_send_size(chan,
+				      HVS_PKT_LEN(HVS_SEND_BUF_SIZE));
+
+	/* See hvs_stream_has_space(): we must make sure the host has seen
+	 * the new pending send size, before we can re-check the writable
+	 * bytes.
+	 */
+	virt_mb();
+}
+
+static void hvs_clear_channel_pending_send_size(struct vmbus_channel *chan)
+{
+	set_channel_pending_send_size(chan, 0);
+
+	/* Ditto */
+	virt_mb();
+}
+
+static bool hvs_channel_readable(struct vmbus_channel *chan)
+{
+	u32 readable = hv_get_bytes_to_read(&chan->inbound);
+
+	/* 0-size payload means FIN */
+	return readable >= HVS_PKT_LEN(0);
+}
+
+static int hvs_channel_readable_payload(struct vmbus_channel *chan)
+{
+	u32 readable = hv_get_bytes_to_read(&chan->inbound);
+
+	if (readable > HVS_PKT_LEN(0)) {
+		/* At least we have 1 byte to read. We don't need to return
+		 * the exact readable bytes: see vsock_stream_recvmsg() ->
+		 * vsock_stream_has_data().
+		 */
+		return 1;
+	}
+
+	if (readable == HVS_PKT_LEN(0)) {
+		/* 0-size payload means FIN */
+		return 0;
+	}
+
+	/* No payload or FIN */
+	return -1;
+}
+
+static inline size_t hvs_channel_writable_bytes(struct vmbus_channel *chan)
+{
+	u32 writeable = hv_get_bytes_to_write(&chan->outbound);
+	size_t ret;
+
+	/* The ringbuffer mustn't be 100% full, and we should reserve a
+	 * zero-length-payload packet for the FIN: see hv_ringbuffer_write()
+	 * and hvs_shutdown().
+	 */
+	if (writeable <= HVS_PKT_LEN(1) + HVS_PKT_LEN(0))
+		return 0;
+
+	ret = writeable - HVS_PKT_LEN(1) - HVS_PKT_LEN(0);
+
+	return round_down(ret, 8);
+}
+
+static int hvs_send_data(struct vmbus_channel *chan,
+			 struct hvs_send_buf *send_buf, size_t to_write)
+{
+	send_buf->hdr.pkt_type = 1;
+	send_buf->hdr.data_size = to_write;
+	return vmbus_sendpacket(chan, &send_buf->hdr,
+				sizeof(send_buf->hdr) + to_write,
+				0, VM_PKT_DATA_INBAND, 0);
+}
+
+static void hvs_channel_cb(void *ctx)
+{
+	struct sock *sk = (struct sock *)ctx;
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct hvsock *hvs = vsk->trans;
+	struct vmbus_channel *chan = hvs->chan;
+
+	if (hvs_channel_readable(chan))
+		sk->sk_data_ready(sk);
+
+	/* See hvs_stream_has_space(): when we reach here, the writable bytes
+	 * may be already less than HVS_PKT_LEN(HVS_SEND_BUF_SIZE).
+	 */
+	if (hv_get_bytes_to_write(&chan->outbound) > 0)
+		sk->sk_write_space(sk);
+}
+
+static void hvs_close_connection(struct vmbus_channel *chan)
+{
+	struct sock *sk = get_per_channel_state(chan);
+	struct vsock_sock *vsk = vsock_sk(sk);
+
+	sk->sk_state = SS_UNCONNECTED;
+	sock_set_flag(sk, SOCK_DONE);
+	vsk->peer_shutdown |= SEND_SHUTDOWN | RCV_SHUTDOWN;
+
+	sk->sk_state_change(sk);
+}
+
+static void hvs_open_connection(struct vmbus_channel *chan)
+{
+	uuid_le *if_instance, *if_type;
+	unsigned char conn_from_host;
+
+	struct sockaddr_vm addr;
+	struct sock *sk, *new = NULL;
+	struct vsock_sock *vnew;
+	struct hvsock *hvs, *hvs_new;
+	int ret;
+
+	if_type = &chan->offermsg.offer.if_type;
+	if_instance = &chan->offermsg.offer.if_instance;
+	conn_from_host = chan->offermsg.offer.u.pipe.user_def[0];
+
+	/* The host or the VM should only listen on a port in
+	 * [0, MAX_LISTEN_PORT]
+	 */
+	if (!is_valid_srv_id(if_type) ||
+	    get_port_by_srv_id(if_type) > MAX_LISTEN_PORT)
+		return;
+
+	hvs_addr_init(&addr, conn_from_host ? if_type : if_instance);
+	sk = vsock_find_bound_socket(&addr);
+	if (!sk)
+		return;
+
+	if ((conn_from_host && sk->sk_state != VSOCK_SS_LISTEN) ||
+	    (!conn_from_host && sk->sk_state != SS_CONNECTING))
+		goto out;
+
+	if (conn_from_host) {
+		if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog)
+			goto out;
+
+		new = __vsock_create(sock_net(sk), NULL, sk, GFP_KERNEL,
+				     sk->sk_type, 0);
+		if (!new)
+			goto out;
+
+		new->sk_state = SS_CONNECTING;
+		vnew = vsock_sk(new);
+		hvs_new = vnew->trans;
+		hvs_new->chan = chan;
+	} else {
+		hvs = vsock_sk(sk)->trans;
+		hvs->chan = chan;
+	}
+
+	set_channel_read_mode(chan, HV_CALL_DIRECT);
+	ret = vmbus_open(chan, RINGBUFFER_HVS_SND_SIZE,
+			 RINGBUFFER_HVS_RCV_SIZE, NULL, 0,
+			 hvs_channel_cb, conn_from_host ? new : sk);
+	if (ret != 0) {
+		if (conn_from_host) {
+			hvs_new->chan = NULL;
+			sock_put(new);
+		} else {
+			hvs->chan = NULL;
+		}
+		goto out;
+	}
+
+	set_per_channel_state(chan, conn_from_host ? new : sk);
+	vmbus_set_chn_rescind_callback(chan, hvs_close_connection);
+
+	if (conn_from_host) {
+		new->sk_state = SS_CONNECTED;
+		sk->sk_ack_backlog++;
+
+		hvs_addr_init(&vnew->local_addr, if_type);
+		hvs_remote_addr_init(&vnew->remote_addr, &vnew->local_addr);
+
+		hvs_new->vm_srv_id = *if_type;
+		hvs_new->host_srv_id = *if_instance;
+
+		vsock_insert_connected(vnew);
+
+		lock_sock(sk);
+		vsock_enqueue_accept(sk, new);
+		release_sock(sk);
+	} else {
+		sk->sk_state = SS_CONNECTED;
+		sk->sk_socket->state = SS_CONNECTED;
+
+		vsock_insert_connected(vsock_sk(sk));
+	}
+
+	sk->sk_state_change(sk);
+
+out:
+	/* Release refcnt obtained when we called vsock_find_bound_socket() */
+	sock_put(sk);
+}
+
+static u32 hvs_get_local_cid(void)
+{
+	return VMADDR_CID_ANY;
+}
+
+static int hvs_sock_init(struct vsock_sock *vsk, struct vsock_sock *psk)
+{
+	struct hvsock *hvs;
+
+	hvs = kzalloc(sizeof(*hvs), GFP_KERNEL);
+	if (!hvs)
+		return -ENOMEM;
+
+	vsk->trans = hvs;
+	hvs->vsk = vsk;
+
+	return 0;
+}
+
+static int hvs_connect(struct vsock_sock *vsk)
+{
+	struct hvsock *h = vsk->trans;
+
+	h->vm_srv_id = srv_id_template;
+	h->host_srv_id = srv_id_template;
+
+	*((u32 *)&h->vm_srv_id) = vsk->local_addr.svm_port;
+	*((u32 *)&h->host_srv_id) = vsk->remote_addr.svm_port;
+
+	return vmbus_send_tl_connect_request(&h->vm_srv_id, &h->host_srv_id);
+}
+
+static int hvs_shutdown(struct vsock_sock *vsk, int mode)
+{
+	struct vmpipe_proto_header hdr;
+	struct hvs_send_buf *send_buf;
+	struct hvsock *hvs;
+
+	if (!(mode & SEND_SHUTDOWN))
+		return 0;
+
+	hvs = vsk->trans;
+
+	if (test_and_set_bit(0, &hvs->fin_sent))
+		return 0;
+
+	send_buf = (struct hvs_send_buf *)&hdr;
+
+	/* It can't fail: see hvs_channel_writable_bytes(). */
+	(void)hvs_send_data(hvs->chan, send_buf, 0);
+
+	return 0;
+}
+
+static void hvs_release(struct vsock_sock *vsk)
+{
+	struct hvsock *hvs = vsk->trans;
+	struct vmbus_channel *chan = hvs->chan;
+
+	if (chan)
+		hvs_shutdown(vsk, RCV_SHUTDOWN | SEND_SHUTDOWN);
+
+	vsock_remove_sock(vsk);
+}
+
+static void hvs_destruct(struct vsock_sock *vsk)
+{
+	struct hvsock *hvs = vsk->trans;
+	struct vmbus_channel *chan = hvs->chan;
+
+	if (chan)
+		vmbus_hvsock_device_unregister(chan);
+
+	kfree(hvs);
+}
+
+static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
+{
+	return -EOPNOTSUPP;
+}
+
+static int hvs_dgram_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
+			     size_t len, int flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static int hvs_dgram_enqueue(struct vsock_sock *vsk,
+			     struct sockaddr_vm *remote, struct msghdr *msg,
+			     size_t dgram_len)
+{
+	return -EOPNOTSUPP;
+}
+
+static bool hvs_dgram_allow(u32 cid, u32 port)
+{
+	return false;
+}
+
+static int hvs_update_recv_data(struct hvsock *hvs)
+{
+	struct hvs_recv_buf *recv_buf;
+	u32 payload_len;
+
+	recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
+	payload_len = recv_buf->hdr.data_size;
+
+	if (payload_len > HVS_MTU_SIZE)
+		return -EIO;
+
+	if (payload_len == 0)
+		hvs->vsk->peer_shutdown |= SEND_SHUTDOWN;
+
+	hvs->recv_data_len = payload_len;
+	hvs->recv_data_off = 0;
+
+	return 0;
+}
+
+static ssize_t hvs_stream_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
+				  size_t len, int flags)
+{
+	struct hvsock *hvs = vsk->trans;
+	bool need_refill = !hvs->recv_desc;
+	struct hvs_recv_buf *recv_buf;
+	u32 to_read;
+	int ret;
+
+	if (flags & MSG_PEEK)
+		return -EOPNOTSUPP;
+
+	if (need_refill) {
+		hvs->recv_desc = hv_pkt_iter_first(hvs->chan);
+		ret = hvs_update_recv_data(hvs);
+		if (ret)
+			return ret;
+	}
+
+	recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
+	to_read = min_t(u32, len, hvs->recv_data_len);
+	ret = memcpy_to_msg(msg, recv_buf->data + hvs->recv_data_off, to_read);
+	if (ret != 0)
+		return ret;
+
+	hvs->recv_data_len -= to_read;
+	if (hvs->recv_data_len == 0) {
+		hvs->recv_desc = hv_pkt_iter_next(hvs->chan, hvs->recv_desc);
+		if (hvs->recv_desc) {
+			ret = hvs_update_recv_data(hvs);
+			if (ret)
+				return ret;
+		}
+	} else {
+		hvs->recv_data_off += to_read;
+	}
+
+	return to_read;
+}
+
+static ssize_t hvs_stream_enqueue(struct vsock_sock *vsk, struct msghdr *msg,
+				  size_t len)
+{
+	struct hvsock *hvs = vsk->trans;
+	struct vmbus_channel *chan = hvs->chan;
+	struct hvs_send_buf *send_buf;
+	ssize_t to_write, max_writable, ret;
+
+	BUILD_BUG_ON(sizeof(*send_buf) != PAGE_SIZE_4K);
+
+	send_buf = kmalloc(sizeof(*send_buf), GFP_KERNEL);
+	if (!send_buf)
+		return -ENOMEM;
+
+	max_writable = hvs_channel_writable_bytes(chan);
+	to_write = min_t(ssize_t, len, max_writable);
+	to_write = min_t(ssize_t, to_write, HVS_SEND_BUF_SIZE);
+
+	ret = memcpy_from_msg(send_buf->data, msg, to_write);
+	if (ret < 0)
+		goto out;
+
+	ret = hvs_send_data(hvs->chan, send_buf, to_write);
+	if (ret < 0)
+		goto out;
+
+	ret = to_write;
+out:
+	kfree(send_buf);
+	return ret;
+}
+
+static s64 hvs_stream_has_data(struct vsock_sock *vsk)
+{
+	struct hvsock *hvs = vsk->trans;
+	s64 ret;
+
+	if (hvs->recv_data_len > 0)
+		return 1;
+
+	switch (hvs_channel_readable_payload(hvs->chan)) {
+	case 1:
+		ret = 1;
+		break;
+	case 0:
+		vsk->peer_shutdown |= SEND_SHUTDOWN;
+		ret = 0;
+		break;
+	default: /* -1 */
+		ret = 0;
+		break;
+	}
+
+	return ret;
+}
+
+static s64 hvs_stream_has_space(struct vsock_sock *vsk)
+{
+	struct hvsock *hvs = vsk->trans;
+	struct vmbus_channel *chan = hvs->chan;
+	s64 ret;
+
+	ret = hvs_channel_writable_bytes(chan);
+	if (ret > 0)  {
+		hvs_clear_channel_pending_send_size(chan);
+	} else {
+		/* See hvs_channel_cb() */
+		hvs_set_channel_pending_send_size(chan);
+
+		/* Re-check the writable bytes to avoid race */
+		ret = hvs_channel_writable_bytes(chan);
+		if (ret > 0)
+			hvs_clear_channel_pending_send_size(chan);
+	}
+
+	return ret;
+}
+
+static u64 hvs_stream_rcvhiwat(struct vsock_sock *vsk)
+{
+	return HVS_MTU_SIZE + 1;
+}
+
+static bool hvs_stream_is_active(struct vsock_sock *vsk)
+{
+	struct hvsock *hvs = vsk->trans;
+
+	return hvs->chan != NULL;
+}
+
+static bool hvs_stream_allow(u32 cid, u32 port)
+{
+	/* The host's port range [MIN_HOST_EPHEMERAL_PORT, 0xFFFFFFFF) is
+	 * reserved as ephemeral ports, which are used as the host's ports
+	 * when the host initiates connections.
+	 *
+	 * Perform this check in the guest so an immediate error is produced
+	 * instead of a timeout.
+	 */
+	if (port > MAX_HOST_LISTEN_PORT)
+		return false;
+
+	if (cid == VMADDR_CID_HOST)
+		return true;
+
+	return false;
+}
+
+static
+int hvs_notify_poll_in(struct vsock_sock *vsk, size_t target, bool *readable)
+{
+	struct hvsock *hvs = vsk->trans;
+
+	*readable = hvs_channel_readable(hvs->chan);
+	return 0;
+}
+
+static
+int hvs_notify_poll_out(struct vsock_sock *vsk, size_t target, bool *writable)
+{
+	*writable = hvs_stream_has_space(vsk) > 0;
+
+	return 0;
+}
+
+static
+int hvs_notify_recv_init(struct vsock_sock *vsk, size_t target,
+			 struct vsock_transport_recv_notify_data *d)
+{
+	return 0;
+}
+
+static
+int hvs_notify_recv_pre_block(struct vsock_sock *vsk, size_t target,
+			      struct vsock_transport_recv_notify_data *d)
+{
+	return 0;
+}
+
+static
+int hvs_notify_recv_pre_dequeue(struct vsock_sock *vsk, size_t target,
+				struct vsock_transport_recv_notify_data *d)
+{
+	return 0;
+}
+
+static
+int hvs_notify_recv_post_dequeue(struct vsock_sock *vsk, size_t target,
+				 ssize_t copied, bool data_read,
+				 struct vsock_transport_recv_notify_data *d)
+{
+	return 0;
+}
+
+static
+int hvs_notify_send_init(struct vsock_sock *vsk,
+			 struct vsock_transport_send_notify_data *d)
+{
+	return 0;
+}
+
+static
+int hvs_notify_send_pre_block(struct vsock_sock *vsk,
+			      struct vsock_transport_send_notify_data *d)
+{
+	return 0;
+}
+
+static
+int hvs_notify_send_pre_enqueue(struct vsock_sock *vsk,
+				struct vsock_transport_send_notify_data *d)
+{
+	return 0;
+}
+
+static
+int hvs_notify_send_post_enqueue(struct vsock_sock *vsk, ssize_t written,
+				 struct vsock_transport_send_notify_data *d)
+{
+	return 0;
+}
+
+static void hvs_set_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	/* Ignored. */
+}
+
+static void hvs_set_min_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	/* Ignored. */
+}
+
+static void hvs_set_max_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	/* Ignored. */
+}
+
+static u64 hvs_get_buffer_size(struct vsock_sock *vsk)
+{
+	return -ENOPROTOOPT;
+}
+
+static u64 hvs_get_min_buffer_size(struct vsock_sock *vsk)
+{
+	return -ENOPROTOOPT;
+}
+
+static u64 hvs_get_max_buffer_size(struct vsock_sock *vsk)
+{
+	return -ENOPROTOOPT;
+}
+
+static struct vsock_transport hvs_transport = {
+	.get_local_cid            = hvs_get_local_cid,
+
+	.init                     = hvs_sock_init,
+	.destruct                 = hvs_destruct,
+	.release                  = hvs_release,
+	.connect                  = hvs_connect,
+	.shutdown                 = hvs_shutdown,
+
+	.dgram_bind               = hvs_dgram_bind,
+	.dgram_dequeue            = hvs_dgram_dequeue,
+	.dgram_enqueue            = hvs_dgram_enqueue,
+	.dgram_allow              = hvs_dgram_allow,
+
+	.stream_dequeue           = hvs_stream_dequeue,
+	.stream_enqueue           = hvs_stream_enqueue,
+	.stream_has_data          = hvs_stream_has_data,
+	.stream_has_space         = hvs_stream_has_space,
+	.stream_rcvhiwat          = hvs_stream_rcvhiwat,
+	.stream_is_active         = hvs_stream_is_active,
+	.stream_allow             = hvs_stream_allow,
+
+	.notify_poll_in           = hvs_notify_poll_in,
+	.notify_poll_out          = hvs_notify_poll_out,
+	.notify_recv_init         = hvs_notify_recv_init,
+	.notify_recv_pre_block    = hvs_notify_recv_pre_block,
+	.notify_recv_pre_dequeue  = hvs_notify_recv_pre_dequeue,
+	.notify_recv_post_dequeue = hvs_notify_recv_post_dequeue,
+	.notify_send_init         = hvs_notify_send_init,
+	.notify_send_pre_block    = hvs_notify_send_pre_block,
+	.notify_send_pre_enqueue  = hvs_notify_send_pre_enqueue,
+	.notify_send_post_enqueue = hvs_notify_send_post_enqueue,
+
+	.set_buffer_size          = hvs_set_buffer_size,
+	.set_min_buffer_size      = hvs_set_min_buffer_size,
+	.set_max_buffer_size      = hvs_set_max_buffer_size,
+	.get_buffer_size          = hvs_get_buffer_size,
+	.get_min_buffer_size      = hvs_get_min_buffer_size,
+	.get_max_buffer_size      = hvs_get_max_buffer_size,
+};
+
+static int hvs_probe(struct hv_device *hdev,
+		     const struct hv_vmbus_device_id *dev_id)
+{
+	struct vmbus_channel *chan = hdev->channel;
+
+	hvs_open_connection(chan);
+
+	/* Always return success to suppress the unnecessary error message
+	 * in vmbus_probe(): on error the host will rescind the device in
+	 * 30 seconds and we can do cleanup at that time in
+	 * vmbus_onoffer_rescind().
+	 */
+	return 0;
+}
+
+static int hvs_remove(struct hv_device *hdev)
+{
+	struct vmbus_channel *chan = hdev->channel;
+
+	vmbus_close(chan);
+
+	return 0;
+}
+
+/* This isn't really used. See vmbus_match() and vmbus_probe() */
+static const struct hv_vmbus_device_id id_table[] = {
+	{},
+};
+
+static struct hv_driver hvs_drv = {
+	.name		= "hv_sock",
+	.hvsock		= true,
+	.id_table	= id_table,
+	.probe		= hvs_probe,
+	.remove		= hvs_remove,
+};
+
+static int __init hvs_init(void)
+{
+	int ret;
+
+	if (vmbus_proto_version < VERSION_WIN10)
+		return -ENODEV;
+
+	ret = vmbus_driver_register(&hvs_drv);
+	if (ret != 0)
+		return ret;
+
+	ret = vsock_core_init(&hvs_transport);
+	if (ret) {
+		vmbus_driver_unregister(&hvs_drv);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit hvs_exit(void)
+{
+	vsock_core_exit();
+	vmbus_driver_unregister(&hvs_drv);
+}
+
+module_init(hvs_init);
+module_exit(hvs_exit);
+
+MODULE_DESCRIPTION("Hyper-V Sockets");
+MODULE_VERSION("1.0.0");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NETPROTO(PF_VSOCK);
-- 
2.7.4

^ permalink raw reply related

* Re: Something hitting my total number of connections to the server
From: Akshat Kakkar @ 2017-08-23  5:08 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Eric Dumazet, David Laight, netdev
In-Reply-To: <CADVnQy=57_rc8A93zAcqr9hz1WtWniSoZdtx+-aTiZ6AyjZmQg@mail.gmail.com>

On Tue, Aug 22, 2017 at 5:58 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Tue, Aug 22, 2017 at 1:42 AM, Akshat Kakkar <akshat.1984@gmail.com> wrote:
>> There are multiple hosts/clients. All are mainly windows based.
>>
>> Timestamp is not used as my clients mainly are windows based and in
>> that it tcp timestamp is by defauly disabled.
> ...
>> net.ipv4.tcp_tw_reuse=1
>> net.ipv4.tcp_tw_recycle=1
>
> I suspect the problem is there. The net.ipv4.tcp_tw_recycle setting
> should be 0. Running with the value 1 is known to cause buggy behavior
> related to TCP timestamps, and that feature has been removed in kernel
> v4.12.
>
> Can you please re-run your tests with net.ipv4.tcp_tw_recycle=0 or a
> newer kernel?
>
> neal

Thanks for your reply.

I understand that.

But my point is, though tcp timestamp is enabled on the server, but as
client is not using it ... so how come this _bug_ (if any) is
triggered in first place.

^ permalink raw reply

* RE: [PATCH v2 1/6] fsl/fman: enable FMan Keygen
From: Madalin-cristian Bucur @ 2017-08-23  5:16 UTC (permalink / raw)
  To: David Miller
  Cc: netdev@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org, Florinel Iordache
In-Reply-To: <20170822.214645.953115103494976385.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Wednesday, August 23, 2017 7:47 AM
> To: Madalin-cristian Bucur <madalin.bucur@nxp.com>
> Cc: netdev@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH v2 1/6] fsl/fman: enable FMan Keygen
> 
> From: Madalin-cristian Bucur <madalin.bucur@nxp.com>
> Date: Wed, 23 Aug 2017 04:36:56 +0000
> 
> > The struct fman is only visible in the fman file, the fman port
> > module uses struct fman as an opaque pointer, thus this export.
> 
> Don't use that programming model.
> 
> Export the datastructure properly to it's users.
> 
> This abstraction scheme is so wasteful and costly.

Normally does not come with this cost, it's this case where one of the
sub-modules needs to call into another that gets things complicated.
I'll move struct fman to the header file.

Thanks,
Madalin

^ permalink raw reply

* Re: linux-next: manual merge of the net-next tree with the net tree
From: Ido Schimmel @ 2017-08-23  5:41 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, Networking, Linux-Next Mailing List,
	Linux Kernel Mailing List, Wei Wang, Ido Schimmel, iri Pirko
In-Reply-To: <20170823113105.0bd692ea@canb.auug.org.au>

On Wed, Aug 23, 2017 at 11:31:05AM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the net-next tree got a conflict in:
> 
>   net/ipv6/ip6_fib.c
> 
> between commit:
> 
>   c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
> 
> from the net tree and commit:
> 
>   a460aa83963b ("ipv6: fib: Add helpers to hold / drop a reference on rt6_info")
> 
> from the net-next tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Looks good to me.

Thanks!

^ permalink raw reply

* [PATCH] dt-binding: net/phy: fix interrupts description
From: Baruch Siach @ 2017-08-23  6:11 UTC (permalink / raw)
  To: Rob Herring, Mark Rutland, David S . Miller
  Cc: devicetree, netdev, Baruch Siach

Commit b053dc5a722ea (powerpc: Refactor device tree binding) split the
Ethernet PHY binding documentation out of the big booting-without-of.txt
file, leaving a dangling reference to "section 2" in the 'interrupts'
property description. Drop that reference, and make the description look
more like the rest.

While at it, make the example interrupt-parent phandle look more like a
real world phandle, and use an IRQ_TYPE_ macro for the 'interrupts'
type.

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
---
 Documentation/devicetree/bindings/net/phy.txt | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/phy.txt b/Documentation/devicetree/bindings/net/phy.txt
index b55857696fc3..86ba72af6163 100644
--- a/Documentation/devicetree/bindings/net/phy.txt
+++ b/Documentation/devicetree/bindings/net/phy.txt
@@ -2,11 +2,7 @@ PHY nodes
 
 Required properties:
 
- - interrupts : <a b> where a is the interrupt number and b is a
-   field that represents an encoding of the sense and level
-   information for the interrupt.  This should be encoded based on
-   the information in section 2) depending on the type of interrupt
-   controller you have.
+ - interrupts : interrupt specifier for the sole interrupt.
  - interrupt-parent : the phandle for the interrupt controller that
    services interrupts for this device.
  - reg : The ID number for the phy, usually a small integer
@@ -56,7 +52,7 @@ Example:
 
 ethernet-phy@0 {
 	compatible = "ethernet-phy-id0141.0e90", "ethernet-phy-ieee802.3-c22";
-	interrupt-parent = <40000>;
-	interrupts = <35 1>;
+	interrupt-parent = <&PIC>;
+	interrupts = <35 IRQ_TYPE_EDGE_RISING>;
 	reg = <0>;
 };
-- 
2.14.1

^ permalink raw reply related

* [PATCH net 0/3] nfp: fix SR-IOV deadlock and representor bugs
From: Jakub Kicinski @ 2017-08-23  6:22 UTC (permalink / raw)
  To: netdev; +Cc: oss-drivers, Jakub Kicinski

This series tackles the bug I've already tried to fix in commit 
6d48ceb27af1 ("nfp: allocate a private workqueue for driver work").
I created a separate workqueue to avoid possible deadlock, and
the lockdep error disappeared, coincidentally.  The way workqueues
are operating, separate workqueue doesn't necessarily mean separate
thread of execution.  Luckily we can safely forego the lock.

Second fix changes the order in which vNIC netdevs and representors
are created/destroyed.  The fix is kept small and should be sufficient
for net because of how flower uses representors, a more thorough fix 
will be targeted at net-next.

Third fix avoids leaking mapped frame buffers if FW sent a frame with
unknown portid.

Jakub Kicinski (3):
  nfp: don't hold PF lock while enabling SR-IOV
  nfp: make sure representors are destroyed before their lower netdev
  nfp: avoid buffer leak when representor is missing

 drivers/net/ethernet/netronome/nfp/nfp_main.c      | 16 +++++++---------
 .../net/ethernet/netronome/nfp/nfp_net_common.c    | 10 +++++-----
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c  | 22 ++++++++++++++--------
 3 files changed, 26 insertions(+), 22 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH net 1/3] nfp: don't hold PF lock while enabling SR-IOV
From: Jakub Kicinski @ 2017-08-23  6:22 UTC (permalink / raw)
  To: netdev; +Cc: oss-drivers, Jakub Kicinski
In-Reply-To: <20170823062244.26580-1-jakub.kicinski@netronome.com>

Enabling SR-IOV VFs will cause the PCI subsystem to schedule a
work and flush its workqueue.  Since the nfp driver schedules its
own work we can't enable VFs while holding driver load.  Commit
6d48ceb27af1 ("nfp: allocate a private workqueue for driver work")
tried to avoid this deadlock by creating a separate workqueue.
Unfortunately, due to the architecture of workqueue subsystem this
does not guarantee a separate thread of execution.  Luckily
we can simply take pci_enable_sriov() from under the driver lock.

Take pci_disable_sriov() from under the lock too for symmetry.

Fixes: 6d48ceb27af1 ("nfp: allocate a private workqueue for driver work")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index d67969d3e484..3f199db2002e 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -98,21 +98,20 @@ static int nfp_pcie_sriov_enable(struct pci_dev *pdev, int num_vfs)
 	struct nfp_pf *pf = pci_get_drvdata(pdev);
 	int err;
 
-	mutex_lock(&pf->lock);
-
 	if (num_vfs > pf->limit_vfs) {
 		nfp_info(pf->cpp, "Firmware limits number of VFs to %u\n",
 			 pf->limit_vfs);
-		err = -EINVAL;
-		goto err_unlock;
+		return -EINVAL;
 	}
 
 	err = pci_enable_sriov(pdev, num_vfs);
 	if (err) {
 		dev_warn(&pdev->dev, "Failed to enable PCI SR-IOV: %d\n", err);
-		goto err_unlock;
+		return err;
 	}
 
+	mutex_lock(&pf->lock);
+
 	err = nfp_app_sriov_enable(pf->app, num_vfs);
 	if (err) {
 		dev_warn(&pdev->dev,
@@ -129,9 +128,8 @@ static int nfp_pcie_sriov_enable(struct pci_dev *pdev, int num_vfs)
 	return num_vfs;
 
 err_sriov_disable:
-	pci_disable_sriov(pdev);
-err_unlock:
 	mutex_unlock(&pf->lock);
+	pci_disable_sriov(pdev);
 	return err;
 #endif
 	return 0;
@@ -158,10 +156,10 @@ static int nfp_pcie_sriov_disable(struct pci_dev *pdev)
 
 	pf->num_vfs = 0;
 
+	mutex_unlock(&pf->lock);
+
 	pci_disable_sriov(pdev);
 	dev_dbg(&pdev->dev, "Removed VFs.\n");
-
-	mutex_unlock(&pf->lock);
 #endif
 	return 0;
 }
-- 
2.11.0

^ permalink raw reply related

* [PATCH net 2/3] nfp: make sure representors are destroyed before their lower netdev
From: Jakub Kicinski @ 2017-08-23  6:22 UTC (permalink / raw)
  To: netdev; +Cc: oss-drivers, Jakub Kicinski
In-Reply-To: <20170823062244.26580-1-jakub.kicinski@netronome.com>

App start/stop callbacks can perform application initialization.
Unfortunately, flower app started using them for creating and
destroying representors.  This can lead to a situation where
lower vNIC netdev is destroyed while representors still try
to pass traffic.  This will most likely lead to a NULL-dereference
on the lower netdev TX path.

Move the start/stop callbacks, so that representors are created/
destroyed when vNICs are fully initialized.

Fixes: 5de73ee46704 ("nfp: general representor implementation")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
index 5797dbf2b507..1aca4e57bf41 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
@@ -456,10 +456,6 @@ static int nfp_net_pf_app_start(struct nfp_pf *pf)
 {
 	int err;
 
-	err = nfp_net_pf_app_start_ctrl(pf);
-	if (err)
-		return err;
-
 	err = nfp_app_start(pf->app, pf->ctrl_vnic);
 	if (err)
 		goto err_ctrl_stop;
@@ -484,7 +480,6 @@ static void nfp_net_pf_app_stop(struct nfp_pf *pf)
 	if (pf->num_vfs)
 		nfp_app_sriov_disable(pf->app);
 	nfp_app_stop(pf->app);
-	nfp_net_pf_app_stop_ctrl(pf);
 }
 
 static void nfp_net_pci_unmap_mem(struct nfp_pf *pf)
@@ -559,7 +554,7 @@ static int nfp_net_pci_map_mem(struct nfp_pf *pf)
 
 static void nfp_net_pci_remove_finish(struct nfp_pf *pf)
 {
-	nfp_net_pf_app_stop(pf);
+	nfp_net_pf_app_stop_ctrl(pf);
 	/* stop app first, to avoid double free of ctrl vNIC's ddir */
 	nfp_net_debugfs_dir_clean(&pf->ddir);
 
@@ -690,6 +685,7 @@ int nfp_net_pci_probe(struct nfp_pf *pf)
 {
 	struct nfp_net_fw_version fw_ver;
 	u8 __iomem *ctrl_bar, *qc_bar;
+	struct nfp_net *nn;
 	int stride;
 	int err;
 
@@ -766,7 +762,7 @@ int nfp_net_pci_probe(struct nfp_pf *pf)
 	if (err)
 		goto err_free_vnics;
 
-	err = nfp_net_pf_app_start(pf);
+	err = nfp_net_pf_app_start_ctrl(pf);
 	if (err)
 		goto err_free_irqs;
 
@@ -774,12 +770,20 @@ int nfp_net_pci_probe(struct nfp_pf *pf)
 	if (err)
 		goto err_stop_app;
 
+	err = nfp_net_pf_app_start(pf);
+	if (err)
+		goto err_clean_vnics;
+
 	mutex_unlock(&pf->lock);
 
 	return 0;
 
+err_clean_vnics:
+	list_for_each_entry(nn, &pf->vnics, vnic_list)
+		if (nfp_net_is_data_vnic(nn))
+			nfp_net_pf_clean_vnic(pf, nn);
 err_stop_app:
-	nfp_net_pf_app_stop(pf);
+	nfp_net_pf_app_stop_ctrl(pf);
 err_free_irqs:
 	nfp_net_pf_free_irqs(pf);
 err_free_vnics:
@@ -803,6 +807,8 @@ void nfp_net_pci_remove(struct nfp_pf *pf)
 	if (list_empty(&pf->vnics))
 		goto out;
 
+	nfp_net_pf_app_stop(pf);
+
 	list_for_each_entry(nn, &pf->vnics, vnic_list)
 		if (nfp_net_is_data_vnic(nn))
 			nfp_net_pf_clean_vnic(pf, nn);
-- 
2.11.0

^ permalink raw reply related

* [PATCH net 3/3] nfp: avoid buffer leak when representor is missing
From: Jakub Kicinski @ 2017-08-23  6:22 UTC (permalink / raw)
  To: netdev; +Cc: oss-drivers, Jakub Kicinski
In-Reply-To: <20170823062244.26580-1-jakub.kicinski@netronome.com>

When driver receives a muxed frame, but it can't find the representor
netdev it is destined to it will try to "drop" that frame, i.e. reuse
the buffer.  The issue is that the replacement buffer has already been
allocated at this point, and reusing the buffer from received frame
will leak it.  Change the code to put the new buffer on the ring
earlier and not reuse the old buffer (make the buffer parameter
to nfp_net_rx_drop() a NULL).

Fixes: 91bf82ca9eed ("nfp: add support for tx/rx with metadata portid")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 9f77ce038a4a..1ff0c577819e 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1751,6 +1751,10 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
 			continue;
 		}
 
+		nfp_net_dma_unmap_rx(dp, rxbuf->dma_addr);
+
+		nfp_net_rx_give_one(dp, rx_ring, new_frag, new_dma_addr);
+
 		if (likely(!meta.portid)) {
 			netdev = dp->netdev;
 		} else {
@@ -1759,16 +1763,12 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
 			nn = netdev_priv(dp->netdev);
 			netdev = nfp_app_repr_get(nn->app, meta.portid);
 			if (unlikely(!netdev)) {
-				nfp_net_rx_drop(dp, r_vec, rx_ring, rxbuf, skb);
+				nfp_net_rx_drop(dp, r_vec, rx_ring, NULL, skb);
 				continue;
 			}
 			nfp_repr_inc_rx_stats(netdev, pkt_len);
 		}
 
-		nfp_net_dma_unmap_rx(dp, rxbuf->dma_addr);
-
-		nfp_net_rx_give_one(dp, rx_ring, new_frag, new_dma_addr);
-
 		skb_reserve(skb, pkt_off);
 		skb_put(skb, pkt_len);
 
-- 
2.11.0

^ permalink raw reply related

* Re: XDP redirect measurements, gotchas and tracepoints
From: Michael Chan @ 2017-08-23  6:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, john.fastabend@gmail.com, brouer@redhat.com,
	pstaszewski@itcare.pl, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org, andy@greyhouse.net,
	borkmann@iogearbox.net
In-Reply-To: <CAKgT0Uf=ZfVeK-wtvNxSyGEVZ3UseUOHiP3ZOg-SrzmqsR=LtQ@mail.gmail.com>

On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
>>
>> Right, but it's conceivable to add an API to "return" the buffer to
>> the input device, right?
>
> You could, it is just added complexity. "just free the buffer" in
> ixgbe usually just amounts to one atomic operation to decrement the
> total page count since page recycling is already implemented in the
> driver. You still would have to unmap the buffer regardless of if you
> were recycling it or not so all you would save is 1.000015259 atomic
> operations per packet. The fraction is because once every 64K uses we
> have to bulk update the count on the page.
>

If the buffer is returned to the input device, the input device can
keep the DMA mapping.  All it needs to do is to dma_sync it back to
the input device when the buffer is returned.

^ permalink raw reply

* [PATCH RESEND v12 7/8] wireless: ipw2200: Replace PCI pool old API
From: Romain Perier @ 2017-08-23  7:16 UTC (permalink / raw)
  To: Stanislav Yakovlev, Kalle Valo
  Cc: linux-wireless, netdev, linux-kernel, Romain Perier

The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.

Signed-off-by: Romain Perier <romain.perier@collabora.com>
Reviewed-by: Peter Senna Tschudin <peter.senna@collabora.com>
Acked-by: Stanislav Yakovlev <stas.yakovlev@gmail.com>
---

Note: As requested by Kalle Valo, resend and add linux-wireless to Cc:

 drivers/net/wireless/intel/ipw2x00/ipw2200.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/intel/ipw2x00/ipw2200.c b/drivers/net/wireless/intel/ipw2x00/ipw2200.c
index c311b1a994c1..c57c567add4d 100644
--- a/drivers/net/wireless/intel/ipw2x00/ipw2200.c
+++ b/drivers/net/wireless/intel/ipw2x00/ipw2200.c
@@ -3209,7 +3209,7 @@ static int ipw_load_firmware(struct ipw_priv *priv, u8 * data, size_t len)
 	struct fw_chunk *chunk;
 	int total_nr = 0;
 	int i;
-	struct pci_pool *pool;
+	struct dma_pool *pool;
 	void **virts;
 	dma_addr_t *phys;
 
@@ -3226,9 +3226,10 @@ static int ipw_load_firmware(struct ipw_priv *priv, u8 * data, size_t len)
 		kfree(virts);
 		return -ENOMEM;
 	}
-	pool = pci_pool_create("ipw2200", priv->pci_dev, CB_MAX_LENGTH, 0, 0);
+	pool = dma_pool_create("ipw2200", &priv->pci_dev->dev, CB_MAX_LENGTH, 0,
+			       0);
 	if (!pool) {
-		IPW_ERROR("pci_pool_create failed\n");
+		IPW_ERROR("dma_pool_create failed\n");
 		kfree(phys);
 		kfree(virts);
 		return -ENOMEM;
@@ -3253,7 +3254,7 @@ static int ipw_load_firmware(struct ipw_priv *priv, u8 * data, size_t len)
 
 		nr = (chunk_len + CB_MAX_LENGTH - 1) / CB_MAX_LENGTH;
 		for (i = 0; i < nr; i++) {
-			virts[total_nr] = pci_pool_alloc(pool, GFP_KERNEL,
+			virts[total_nr] = dma_pool_alloc(pool, GFP_KERNEL,
 							 &phys[total_nr]);
 			if (!virts[total_nr]) {
 				ret = -ENOMEM;
@@ -3297,9 +3298,9 @@ static int ipw_load_firmware(struct ipw_priv *priv, u8 * data, size_t len)
 	}
  out:
 	for (i = 0; i < total_nr; i++)
-		pci_pool_free(pool, virts[i], phys[i]);
+		dma_pool_free(pool, virts[i], phys[i]);
 
-	pci_pool_destroy(pool);
+	dma_pool_destroy(pool);
 	kfree(phys);
 	kfree(virts);
 
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH net-next v4] openvswitch: enable NSH support
From: Jiri Benc @ 2017-08-23  7:26 UTC (permalink / raw)
  To: Yang, Yi
  Cc: netdev@vger.kernel.org, dev@openvswitch.org, blp@ovn.org,
	e@erig.me, jan.scheurich@ericsson.com
In-Reply-To: <20170822093825.GA94865@cran64.bj.intel.com>

On Tue, 22 Aug 2017 17:38:26 +0800, Yang, Yi wrote:
> I checked GSO support code, we have two cases to handle, one case is to
> output NSH packet to VxLAN-gpe port, the other case is to output NSH packet
> to Ethernet port (tap, veth, vhost, physical eth, etc.), I don't think
> it is a good way to do GSO support in OVS module, because we can't know
> the final output port at this point, if the final output port is
> VxLAN-gpe port, it will still face GSO issue, but I didn't see vxlan
> module handles this, maybe udp tunnel did this. I think a NSH packet can
> be split into two packets, one has NSH header, another one doesn't,
> so I'm not sure how vxlan module can avoid this situation.

As I said before, by either implementing GSO support for NSH or by
segmenting in software before pushing the NSH header. In the first
case, the packet will be software segmented before being pushed to the
hardware, in the second case, it will be software segmented before
pushing the NSH header.

> For the second case, we can add a nsh GSO module in net/nsh/nsh_gso.c
> (as MPLS did in net/mpls/mpls_gso.c), that is the best way to handle
> this, what do you think about it?

Yes, that's one of the two options that we've been talking about. It's
the better and the more efficient one.

 Jiri

^ permalink raw reply

* Re: [PATCH net-next v2 10/10] arm64: dts: marvell: mcbin: enable more networking ports
From: Baruch Siach @ 2017-08-23  7:28 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, jason, andrew, gregory.clement, sebastian.hesselbarth,
	thomas.petazzoni, netdev, linux, nadavh, stefanc, mw,
	linux-arm-kernel
In-Reply-To: <20170822170830.32413-11-antoine.tenart@free-electrons.com>

Hi Antoine,

On Tue, Aug 22, 2017 at 07:08:30PM +0200, Antoine Tenart wrote:
> This patch enables the two GE/SFP ports. They are configured in 10GKR
> mode by default. To do this the cpm_xdmio is enabled as well, and two
> phy descriptions are added.
> 
> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> Tested-by: Marcin Wojtas <mw@semihalf.com>
> ---
>  arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 30 +++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> index abd39d1c1739..6cb4b000e1ac 100644
> --- a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> +++ b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> @@ -127,6 +127,30 @@
>  	};
>  };
>  
> +&cpm_xmdio {
> +	status = "okay";
> +
> +	phy0: ethernet-phy@0 {
> +		compatible = "ethernet-phy-ieee802.3-c45";
> +		reg = <0>;
> +	};
> +
> +	phy1: ethernet-phy@1 {

Should be named 'ethernet-phy@8' to match the 'reg' property.

> +		compatible = "ethernet-phy-ieee802.3-c45";
> +		reg = <8>;
> +	};
> +};

baruch

-- 
     http://baruch.siach.name/blog/                  ~. .~   Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
   - baruch@tkos.co.il - tel: +972.2.679.5364, http://www.tkos.co.il -

^ permalink raw reply

* Stable apply request [was: Bluetooth: bnep: fix possible might sleep error in bnep_session]
From: Jiri Slaby @ 2017-08-23  7:29 UTC (permalink / raw)
  To: Marcel Holtmann, Jeffy Chen, stable
  Cc: LKML, open list:BLUETOOTH DRIVERS, rvaswani, Brian Norris,
	dmitry.torokhov, Douglas Anderson, acho, Johan Hedberg,
	Network Development, David S. Miller, Gustavo F. Padovan
In-Reply-To: <0815411E-62D7-42DC-A0BF-78FD150C0949@holtmann.org>

On 06/27/2017, 07:32 PM, Marcel Holtmann wrote:
>> It looks like bnep_session has same pattern as the issue reported in
>> old rfcomm:
>>
>> 	while (1) {
>> 		set_current_state(TASK_INTERRUPTIBLE);
>> 		if (condition)
>> 			break;
>> 		// may call might_sleep here
>> 		schedule();
>> 	}
>> 	__set_current_state(TASK_RUNNING);
>>
>> Which fixed at:
>> 	dfb2fae Bluetooth: Fix nested sleeps
>>
>> So let's fix it at the same way, also follow the suggestion of:
>> https://lwn.net/Articles/628628/

...

> all 3 patches have been applied to bluetooth-next tree.

Hi,

given users are hitting it in at least 4.4 and 4.12, can we have all
three in all stables where this applies?

5da8e47d849d Bluetooth: hidp: fix possible might sleep error in
hidp_session_thread
f06d977309d0 Bluetooth: cmtp: fix possible might sleep error in cmtp_session
25717382c1dd Bluetooth: bnep: fix possible might sleep error in bnep_session

I am not sure: to stable directly or via net stable?

thanks,
-- 
js
suse labs

^ permalink raw reply

* Re: [PATCH V8 net-next 00/22] Huawei HiNIC Ethernet Driver
From: Arnd Bergmann @ 2017-08-23  7:37 UTC (permalink / raw)
  To: Aviad Krawczyk
  Cc: David Miller, Linux Kernel Mailing List, Networking, bc.y,
	victor.gissin, zhaochen6, tony.qu
In-Reply-To: <cover.1503330613.git.aviad.krawczyk@huawei.com>

On Mon, Aug 21, 2017 at 5:55 PM, Aviad Krawczyk
<aviad.krawczyk@huawei.com> wrote:
> The patch-set contains the support of the HiNIC Ethernet driver for
> hinic family of PCIE Network Interface Cards.
>
> The Huawei's PCIE HiNIC card is a new Ethernet card and hence there was
> a need of a new driver.
>
> The current driver is meant to be used for the Physical Function and there
> would soon be a support for Virtual Function and more features once the
> basic PF driver has been accepted.

Sorry I didn't comment before it got merged, but once it appeared in
linux-next I saw it and wondered why this is grouped under huawei
unlike the network drivers for the parts integrated into the hip0x
SoCs from hisilicon. Both appear to be made by Huawei's HiSilicon
subsidiary.

I did not check whether the two devices are related at all or could
share some of the source code as well, but it might be good to
move this one into the existing directory to spare confusion later.

         Arnd

^ permalink raw reply

* Re: [PATCH net-next 5/5] xdp: get tracepoints xdp_exception and xdp_redirect in sync
From: Jesper Dangaard Brouer @ 2017-08-23  7:41 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, John Fastabend, brouer
In-Reply-To: <599CA26D.7060003@iogearbox.net>

On Tue, 22 Aug 2017 23:30:21 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 08/22/2017 10:47 PM, Jesper Dangaard Brouer wrote:
> > Remove the net_device string name from the xdp_exception tracepoint,
> > like the xdp_redirect tracepoint.
> >
> > Align the TP_STRUCT to have common entries between these two
> > tracepoint.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >   include/trace/events/xdp.h |   24 ++++++++++++------------
> >   1 file changed, 12 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> > index 7511bed80558..6495b0d9d5c7 100644
> > --- a/include/trace/events/xdp.h
> > +++ b/include/trace/events/xdp.h
> > @@ -31,22 +31,22 @@ TRACE_EVENT(xdp_exception,
> >   	TP_ARGS(dev, xdp, act),
> >
> >   	TP_STRUCT__entry(
> > -		__string(name, dev->name)
> >   		__array(u8, prog_tag, 8)
> >   		__field(u32, act)
> > +		__field(int, ifindex)
> >   	),
> >
> >   	TP_fast_assign(
> >   		BUILD_BUG_ON(sizeof(__entry->prog_tag) != sizeof(xdp->tag));
> >   		memcpy(__entry->prog_tag, xdp->tag, sizeof(xdp->tag));
> > -		__assign_str(name, dev->name);
> > -		__entry->act = act;
> > +		__entry->act		= act;
> > +		__entry->ifindex	= dev->ifindex;
> >   	),
> >  
> [...]
> >   TRACE_EVENT(xdp_redirect,
> > @@ -57,26 +57,26 @@ TRACE_EVENT(xdp_redirect,
> >   	TP_ARGS(from_index, to_index, xdp, act, err),
> >
> >   	TP_STRUCT__entry(
> > -		__field(int, from_index)
> > -		__field(int, to_index)
> >   		__array(u8, prog_tag, 8)
> >   		__field(u32, act)
> > +		__field(int, from_index)
> > +		__field(int, to_index)
> >   		__field(int, err)
> >   	),
> >
> >   	TP_fast_assign(
> >   		BUILD_BUG_ON(sizeof(__entry->prog_tag) != sizeof(xdp->tag));
> >   		memcpy(__entry->prog_tag, xdp->tag, sizeof(xdp->tag));
> > +		__entry->act		= act;
> >   		__entry->from_index	= from_index;
> >   		__entry->to_index	= to_index;
> > -		__entry->act = act;  
> 
> If you already get them in sync, could you also make it consistent
> that for both tracepoints in TP_ARGS() we use either ifindex
> directly or device pointer and extract it from TP_fast_assign().
> Right now, it's mixed.

I did this because, in the (bpf_redirect)_map case, I was hoping that
"from_index" could be come the index from the map.  Looking closer at
the devmap design, I cannot see how we can ever know what (the bpf_prog
thinks) is the corresponding map-ingress/from_index.

Based on that, I think we can/should use the device pointer as arg (as
you suggested). And then rename "from_index" to "ifindex", where
ifindex is the XDP ifindex the xdp_buff arrived on.

I guess we can keep the "to_index".  We also need to extend the
tracepoint args with a "map", to let the trace user tell whether the
"to_index" is an ifindex or a map-index.

I'll code this up and submit at V2.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next v7 02/10] bpf: Add eBPF program subtype and is_valid_subtype() verifier
From: Mickaël Salaün @ 2017-08-23  7:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf
In-Reply-To: <20170823024452.zvizovwfd7xjucsx@ast-mbp>


[-- Attachment #1.1: Type: text/plain, Size: 6655 bytes --]


On 23/08/2017 04:44, Alexei Starovoitov wrote:
> On Mon, Aug 21, 2017 at 02:09:25AM +0200, Mickaël Salaün wrote:
>> The goal of the program subtype is to be able to have different static
>> fine-grained verifications for a unique program type.
>>
>> The struct bpf_verifier_ops gets a new optional function:
>> is_valid_subtype(). This new verifier is called at the beginning of the
>> eBPF program verification to check if the (optional) program subtype is
>> valid.
>>
>> For now, only Landlock eBPF programs are using a program subtype (see
>> next commit) but this could be used by other program types in the future.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>> Cc: David S. Miller <davem@davemloft.net>
>> Link: https://lkml.kernel.org/r/20160827205559.GA43880@ast-mbp.thefacebook.com
>> ---
>>
>> Changes since v6:
>> * rename Landlock version to ABI to better reflect its purpose
>> * fix unsigned integer checks
>> * fix pointer cast
>> * constify pointers
>> * rebase
>>
>> Changes since v5:
>> * use a prog_subtype pointer and make it future-proof
>> * add subtype test
>> * constify bpf_load_program()'s subtype argument
>> * cleanup subtype initialization
>> * rebase
>>
>> Changes since v4:
>> * replace the "status" field with "version" (more generic)
>> * replace the "access" field with "ability" (less confusing)
>>
>> Changes since v3:
>> * remove the "origin" field
>> * add an "option" field
>> * cleanup comments
>> ---
>>  include/linux/bpf.h                         |  7 ++-
>>  include/linux/filter.h                      |  2 +
>>  include/uapi/linux/bpf.h                    | 11 +++++
>>  kernel/bpf/syscall.c                        | 22 ++++++++-
>>  kernel/bpf/verifier.c                       | 17 +++++--
>>  kernel/trace/bpf_trace.c                    | 15 ++++--
>>  net/core/filter.c                           | 71 ++++++++++++++++++-----------
>>  samples/bpf/bpf_load.c                      |  3 +-
>>  samples/bpf/cookie_uid_helper_example.c     |  2 +-
>>  samples/bpf/fds_example.c                   |  2 +-
>>  samples/bpf/sock_example.c                  |  3 +-
>>  samples/bpf/test_cgrp2_attach.c             |  2 +-
>>  samples/bpf/test_cgrp2_attach2.c            |  2 +-
>>  samples/bpf/test_cgrp2_sock.c               |  2 +-
>>  tools/include/uapi/linux/bpf.h              | 11 +++++
>>  tools/lib/bpf/bpf.c                         | 10 +++-
>>  tools/lib/bpf/bpf.h                         |  5 +-
>>  tools/lib/bpf/libbpf.c                      |  4 +-
>>  tools/perf/tests/bpf.c                      |  2 +-
>>  tools/testing/selftests/bpf/test_align.c    |  2 +-
>>  tools/testing/selftests/bpf/test_tag.c      |  2 +-
>>  tools/testing/selftests/bpf/test_verifier.c | 17 ++++++-
>>  22 files changed, 158 insertions(+), 56 deletions(-)
> 
> ...
> 
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 7015116331af..0c3fadbb5a58 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -464,6 +464,8 @@ struct bpf_prog {
>>  	u32			len;		/* Number of filter blocks */
>>  	u32			jited_len;	/* Size of jited insns in bytes */
>>  	u8			tag[BPF_TAG_SIZE];
>> +	u8			has_subtype;
>> +	union bpf_prog_subtype	subtype;	/* Fine-grained verifications */
> 
> these burn a hole in very performance sensitive structure.
> Also there are bits rigth above. use them instead of u8 has_subtype?
> or can these two fields be part of bpf_prog_aux ?

OK, I'll create one bit variable and a bpf_prog_subtype field in the
bpf_prog_aux struct then.


> 
>>  	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
>>  	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
>>  	unsigned int		(*bpf_func)(const void *ctx,
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 843818dff96d..8541ab85e432 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -177,6 +177,15 @@ enum bpf_attach_type {
>>  /* Specify numa node during map creation */
>>  #define BPF_F_NUMA_NODE		(1U << 2)
>>  
>> +union bpf_prog_subtype {
>> +	struct {
>> +		__u32		abi; /* minimal ABI version, cf. user doc */
> 
> the concept of abi (version) sounds a bit weird to me.
> Why bother with it at all?
> Once the first set of patches lands the kernel as whole will have landlock feature
> with a set of helpers, actions, event types.
> Some future patches will extend the landlock feature step by step.
> This abi concept assumes that anyone who adds new helper would need
> to keep incrementing this 'abi'. What value does it give to user or to kernel?
> The users will already know that landlock is present in kernel 4.14 or whatever
> and the kernel 4.18 has more landlock features. Why bother with extra abi number?

That's right for helpers and context fields, but we can't check the use
of one field's content. The status field is intended to be a bitfield
extendable in the future. For example, one use case is to set a flag to
inform the eBPF program that it was already called with the same context
and can skip most of its check (if not related to maps). Same goes for
the FS action bitfield, one may want to add more of them. Another
example may be the check for abilities. We may want to relax/remove the
capability require to set one of them. With an ABI version, the user can
easily check if the current kernel support that.

> 
>> +		__u32		event; /* enum landlock_subtype_event */
>> +		__aligned_u64	ability; /* LANDLOCK_SUBTYPE_ABILITY_* */
>> +		__aligned_u64	option; /* LANDLOCK_SUBTYPE_OPTION_* */
>> +	} landlock_rule;
>> +} __attribute__((aligned(8)));
>> +
>>  union bpf_attr {
>>  	struct { /* anonymous struct used by BPF_MAP_CREATE command */
>>  		__u32	map_type;	/* one of enum bpf_map_type */
>> @@ -212,6 +221,8 @@ union bpf_attr {
>>  		__aligned_u64	log_buf;	/* user supplied buffer */
>>  		__u32		kern_version;	/* checked when prog_type=kprobe */
>>  		__u32		prog_flags;
>> +		__aligned_u64	prog_subtype;	/* bpf_prog_subtype address */
>> +		__u32		prog_subtype_size;
>>  	};
> 
> more general question: what is the status of security/ bits?
> I'm assuming they still need to be reviewed and explicitly acked by James, right?

Right, the review process is ongoing. :)
BTW, I'll be at Linux Security Summit (co-located with Plumbers) next
month. We'll be able to clarify some points there too.

Regards,
 Mickaël


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH net-next 2/3] net: mvpp2: unify the txq size define use
From: Antoine Tenart @ 2017-08-23  7:46 UTC (permalink / raw)
  To: davem
  Cc: Antoine Tenart, andrew, gregory.clement, thomas.petazzoni, nadavh,
	linux, mw, stefanc, netdev
In-Reply-To: <20170823074656.14595-1-antoine.tenart@free-electrons.com>

The txq size is defined by MVPP2_AGGR_TXQ_SIZE, which is sometime not
used directly but through variables. As it is a fixed value use the
define everywhere in the driver.

Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvpp2.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
index 4a2ed5c66bb4..0f5a2a4f9ddf 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -5413,15 +5413,14 @@ static unsigned int mvpp2_tx_done(struct mvpp2_port *port, u32 cause,
 
 /* Allocate and initialize descriptors for aggr TXQ */
 static int mvpp2_aggr_txq_init(struct platform_device *pdev,
-			       struct mvpp2_tx_queue *aggr_txq,
-			       int desc_num, int cpu,
+			       struct mvpp2_tx_queue *aggr_txq, int cpu,
 			       struct mvpp2 *priv)
 {
 	u32 txq_dma;
 
 	/* Allocate memory for TX descriptors */
 	aggr_txq->descs = dma_alloc_coherent(&pdev->dev,
-				desc_num * MVPP2_DESC_ALIGNED_SIZE,
+				MVPP2_AGGR_TXQ_SIZE * MVPP2_DESC_ALIGNED_SIZE,
 				&aggr_txq->descs_dma, GFP_KERNEL);
 	if (!aggr_txq->descs)
 		return -ENOMEM;
@@ -5442,7 +5441,8 @@ static int mvpp2_aggr_txq_init(struct platform_device *pdev,
 			MVPP22_AGGR_TXQ_DESC_ADDR_OFFS;
 
 	mvpp2_write(priv, MVPP2_AGGR_TXQ_DESC_ADDR_REG(cpu), txq_dma);
-	mvpp2_write(priv, MVPP2_AGGR_TXQ_DESC_SIZE_REG(cpu), desc_num);
+	mvpp2_write(priv, MVPP2_AGGR_TXQ_DESC_SIZE_REG(cpu),
+		    MVPP2_AGGR_TXQ_SIZE);
 
 	return 0;
 }
@@ -7691,8 +7691,7 @@ static int mvpp2_init(struct platform_device *pdev, struct mvpp2 *priv)
 	for_each_present_cpu(i) {
 		priv->aggr_txqs[i].id = i;
 		priv->aggr_txqs[i].size = MVPP2_AGGR_TXQ_SIZE;
-		err = mvpp2_aggr_txq_init(pdev, &priv->aggr_txqs[i],
-					  MVPP2_AGGR_TXQ_SIZE, i, priv);
+		err = mvpp2_aggr_txq_init(pdev, &priv->aggr_txqs[i], i, priv);
 		if (err < 0)
 			return err;
 	}
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 1/3] net: define the TSO header size in net/tso.h
From: Antoine Tenart @ 2017-08-23  7:46 UTC (permalink / raw)
  To: davem
  Cc: Antoine Tenart, andrew, gregory.clement, thomas.petazzoni, nadavh,
	linux, mw, stefanc, netdev
In-Reply-To: <20170823074656.14595-1-antoine.tenart@free-electrons.com>

The TSO header size was defined in many drivers. Factorize the code and
define its size in net/tso.h.

Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
---
 drivers/net/ethernet/cavium/thunder/nicvf_queues.h | 1 -
 drivers/net/ethernet/freescale/fec_main.c          | 1 -
 drivers/net/ethernet/marvell/mv643xx_eth.c         | 2 --
 drivers/net/ethernet/marvell/mvneta.c              | 3 ---
 include/net/tso.h                                  | 2 ++
 5 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
index 57858522c33c..67d1a3230773 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
@@ -277,7 +277,6 @@ struct snd_queue {
 	u16		xdp_free_cnt;
 	bool		is_xdp;
 
-#define	TSO_HEADER_SIZE	128
 	/* For TSO segment's header */
 	char		*tso_hdrs;
 	dma_addr_t	tso_hdrs_phys;
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index df09b254553d..56f56d6ada9c 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -226,7 +226,6 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
 
 #define COPYBREAK_DEFAULT	256
 
-#define TSO_HEADER_SIZE		128
 /* Max number of allowed TCP segments for software TSO */
 #define FEC_MAX_TSO_SEGS	100
 #define FEC_MAX_SKB_DESCS	(FEC_MAX_TSO_SEGS * 2 + MAX_SKB_FRAGS)
diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c b/drivers/net/ethernet/marvell/mv643xx_eth.c
index 9c94ea9b2b80..fb2d533ae4ef 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -183,8 +183,6 @@ static char mv643xx_eth_driver_version[] = "1.4";
 #define DEFAULT_TX_QUEUE_SIZE	512
 #define SKB_DMA_REALIGN		((PAGE_SIZE - NET_SKB_PAD) % SMP_CACHE_BYTES)
 
-#define TSO_HEADER_SIZE		128
-
 /* Max number of allowed TCP segments for software TSO */
 #define MV643XX_MAX_TSO_SEGS 100
 #define MV643XX_MAX_SKB_DESCS (MV643XX_MAX_TSO_SEGS * 2 + MAX_SKB_FRAGS)
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 0aab74c2a209..35ff1ecfcff0 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -281,9 +281,6 @@
  */
 #define MVNETA_RSS_LU_TABLE_SIZE	1
 
-/* TSO header size */
-#define TSO_HEADER_SIZE 128
-
 /* Max number of Rx descriptors */
 #define MVNETA_MAX_RXD 128
 
diff --git a/include/net/tso.h b/include/net/tso.h
index b7be852bfe9d..9a56c39e6d0a 100644
--- a/include/net/tso.h
+++ b/include/net/tso.h
@@ -3,6 +3,8 @@
 
 #include <net/ip.h>
 
+#define TSO_HEADER_SIZE		128
+
 struct tso_t {
 	int next_frag_idx;
 	void *data;
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 0/3] net: mvpp2: software TSO support
From: Antoine Tenart @ 2017-08-23  7:46 UTC (permalink / raw)
  To: davem
  Cc: Antoine Tenart, andrew, gregory.clement, thomas.petazzoni, nadavh,
	linux, mw, stefanc, netdev

Hi all,

This series adds the s/w TSO support in the PPv2 driver, in addition to
two cosmetic commits. As stated in patch 3/3:

Using iperf and 10G ports, using TSO shows a significant performance
improvement by a factor 2 to reach around 9.5Gbps in TX; as well as a
significant CPU usage drop (from 25% to 15%).

Thanks!
Antoine

Antoine Tenart (3):
  net: define the TSO header size in net/tso.h
  net: mvpp2: unify the txq size define use
  net: mvpp2: software tso support

 drivers/net/ethernet/cavium/thunder/nicvf_queues.h |   1 -
 drivers/net/ethernet/freescale/fec_main.c          |   1 -
 drivers/net/ethernet/marvell/mv643xx_eth.c         |   2 -
 drivers/net/ethernet/marvell/mvneta.c              |   3 -
 drivers/net/ethernet/marvell/mvpp2.c               | 182 ++++++++++++++++++---
 include/net/tso.h                                  |   2 +
 6 files changed, 164 insertions(+), 27 deletions(-)

-- 
2.13.5

^ permalink raw reply

* [PATCH net-next 3/3] net: mvpp2: software tso support
From: Antoine Tenart @ 2017-08-23  7:46 UTC (permalink / raw)
  To: davem
  Cc: Antoine Tenart, andrew, gregory.clement, thomas.petazzoni, nadavh,
	linux, mw, stefanc, netdev
In-Reply-To: <20170823074656.14595-1-antoine.tenart@free-electrons.com>

The patch uses the tso API to implement the tso functionality in Marvell
PPv2 driver.

Using iperf and 10G ports, using TSO shows a significant performance
improvement by a factor 2 to reach around 9.5Gbps in TX; as well as a
significant CPU usage drop (from 25% to 15%).

Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvpp2.c | 171 ++++++++++++++++++++++++++++++++---
 1 file changed, 157 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
index 0f5a2a4f9ddf..90103bce6c66 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -37,6 +37,7 @@
 #include <uapi/linux/ppp_defs.h>
 #include <net/ip.h>
 #include <net/ipv6.h>
+#include <net/tso.h>
 
 /* RX Fifo Registers */
 #define MVPP2_RX_DATA_FIFO_SIZE_REG(port)	(0x00 + 4 * (port))
@@ -1033,6 +1034,10 @@ struct mvpp2_txq_pcpu {
 
 	/* Index of the TX DMA descriptor to be cleaned up */
 	int txq_get_index;
+
+	/* DMA buffer for TSO headers */
+	char *tso_headers;
+	dma_addr_t tso_headers_dma;
 };
 
 struct mvpp2_tx_queue {
@@ -5623,6 +5628,14 @@ static int mvpp2_txq_init(struct mvpp2_port *port,
 		txq_pcpu->reserved_num = 0;
 		txq_pcpu->txq_put_index = 0;
 		txq_pcpu->txq_get_index = 0;
+
+		txq_pcpu->tso_headers =
+			dma_alloc_coherent(port->dev->dev.parent,
+					   MVPP2_AGGR_TXQ_SIZE * TSO_HEADER_SIZE,
+					   &txq_pcpu->tso_headers_dma,
+					   GFP_KERNEL);
+		if (!txq_pcpu->tso_headers)
+			goto cleanup;
 	}
 
 	return 0;
@@ -5630,6 +5643,11 @@ static int mvpp2_txq_init(struct mvpp2_port *port,
 	for_each_present_cpu(cpu) {
 		txq_pcpu = per_cpu_ptr(txq->pcpu, cpu);
 		kfree(txq_pcpu->buffs);
+
+		dma_free_coherent(port->dev->dev.parent,
+				  MVPP2_AGGR_TXQ_SIZE * MVPP2_DESC_ALIGNED_SIZE,
+				  txq_pcpu->tso_headers,
+				  txq_pcpu->tso_headers_dma);
 	}
 
 	dma_free_coherent(port->dev->dev.parent,
@@ -5649,6 +5667,11 @@ static void mvpp2_txq_deinit(struct mvpp2_port *port,
 	for_each_present_cpu(cpu) {
 		txq_pcpu = per_cpu_ptr(txq->pcpu, cpu);
 		kfree(txq_pcpu->buffs);
+
+		dma_free_coherent(port->dev->dev.parent,
+				  MVPP2_AGGR_TXQ_SIZE * MVPP2_DESC_ALIGNED_SIZE,
+				  txq_pcpu->tso_headers,
+				  txq_pcpu->tso_headers_dma);
 	}
 
 	if (txq->descs)
@@ -6238,6 +6261,123 @@ static int mvpp2_tx_frag_process(struct mvpp2_port *port, struct sk_buff *skb,
 	return -ENOMEM;
 }
 
+static inline void mvpp2_tso_put_hdr(struct sk_buff *skb,
+				     struct net_device *dev,
+				     struct mvpp2_tx_queue *txq,
+				     struct mvpp2_tx_queue *aggr_txq,
+				     struct mvpp2_txq_pcpu *txq_pcpu,
+				     int hdr_sz)
+{
+	struct mvpp2_port *port = netdev_priv(dev);
+	struct mvpp2_tx_desc *tx_desc = mvpp2_txq_next_desc_get(aggr_txq);
+	dma_addr_t addr;
+
+	mvpp2_txdesc_txq_set(port, tx_desc, txq->id);
+	mvpp2_txdesc_size_set(port, tx_desc, hdr_sz);
+
+	addr = txq_pcpu->tso_headers_dma +
+	       txq_pcpu->txq_put_index * TSO_HEADER_SIZE;
+	mvpp2_txdesc_offset_set(port, tx_desc, addr & MVPP2_TX_DESC_ALIGN);
+	mvpp2_txdesc_dma_addr_set(port, tx_desc, addr & ~MVPP2_TX_DESC_ALIGN);
+
+	mvpp2_txdesc_cmd_set(port, tx_desc, mvpp2_skb_tx_csum(port, skb) |
+					    MVPP2_TXD_F_DESC |
+					    MVPP2_TXD_PADDING_DISABLE);
+	mvpp2_txq_inc_put(port, txq_pcpu, NULL, tx_desc);
+}
+
+static inline int mvpp2_tso_put_data(struct sk_buff *skb,
+				     struct net_device *dev, struct tso_t *tso,
+				     struct mvpp2_tx_queue *txq,
+				     struct mvpp2_tx_queue *aggr_txq,
+				     struct mvpp2_txq_pcpu *txq_pcpu,
+				     int sz, bool left, bool last)
+{
+	struct mvpp2_port *port = netdev_priv(dev);
+	struct mvpp2_tx_desc *tx_desc = mvpp2_txq_next_desc_get(aggr_txq);
+	dma_addr_t buf_dma_addr;
+
+	mvpp2_txdesc_txq_set(port, tx_desc, txq->id);
+	mvpp2_txdesc_size_set(port, tx_desc, sz);
+
+	buf_dma_addr = dma_map_single(dev->dev.parent, tso->data, sz,
+				      DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(dev->dev.parent, buf_dma_addr))) {
+		mvpp2_txq_desc_put(txq);
+		return -ENOMEM;
+	}
+
+	mvpp2_txdesc_offset_set(port, tx_desc,
+				buf_dma_addr & MVPP2_TX_DESC_ALIGN);
+	mvpp2_txdesc_dma_addr_set(port, tx_desc,
+				  buf_dma_addr & ~MVPP2_TX_DESC_ALIGN);
+
+	if (!left) {
+		mvpp2_txdesc_cmd_set(port, tx_desc, MVPP2_TXD_L_DESC);
+		if (last) {
+			mvpp2_txq_inc_put(port, txq_pcpu, skb, tx_desc);
+			return 0;
+		}
+	} else {
+		mvpp2_txdesc_cmd_set(port, tx_desc, 0);
+	}
+
+	mvpp2_txq_inc_put(port, txq_pcpu, NULL, tx_desc);
+	return 0;
+}
+
+static int mvpp2_tx_tso(struct sk_buff *skb, struct net_device *dev,
+			struct mvpp2_tx_queue *txq,
+			struct mvpp2_tx_queue *aggr_txq,
+			struct mvpp2_txq_pcpu *txq_pcpu)
+{
+	struct mvpp2_port *port = netdev_priv(dev);
+	struct tso_t tso;
+	int hdr_sz = skb_transport_offset(skb) + tcp_hdrlen(skb);
+	int i, len, descs = 0;
+
+	/* Check number of available descriptors */
+	if (mvpp2_aggr_desc_num_check(port->priv, aggr_txq,
+				      tso_count_descs(skb)) ||
+	    mvpp2_txq_reserved_desc_num_proc(port->priv, txq, txq_pcpu,
+					     tso_count_descs(skb)))
+		return 0;
+
+	tso_start(skb, &tso);
+	len = skb->len - hdr_sz;
+	while (len > 0) {
+		int left = min_t(int, skb_shinfo(skb)->gso_size, len);
+		char *hdr = txq_pcpu->tso_headers +
+			    txq_pcpu->txq_put_index * TSO_HEADER_SIZE;
+
+		len -= left;
+		descs++;
+
+		tso_build_hdr(skb, hdr, &tso, left, len == 0);
+		mvpp2_tso_put_hdr(skb, dev, txq, aggr_txq, txq_pcpu, hdr_sz);
+
+		while (left > 0) {
+			int sz = min_t(int, tso.size, left);
+			left -= sz;
+			descs++;
+
+			if (mvpp2_tso_put_data(skb, dev, &tso, txq, aggr_txq,
+					       txq_pcpu, sz, left, len == 0))
+				goto release;
+			tso_build_data(skb, &tso, sz);
+		}
+	}
+
+	return descs;
+
+release:
+	for (i = descs - 1; i >= 0; i--) {
+		struct mvpp2_tx_desc *tx_desc = txq->descs + i;
+		tx_desc_unmap_put(port, txq, tx_desc);
+	}
+	return 0;
+}
+
 /* Main tx processing */
 static int mvpp2_tx(struct sk_buff *skb, struct net_device *dev)
 {
@@ -6255,6 +6395,10 @@ static int mvpp2_tx(struct sk_buff *skb, struct net_device *dev)
 	txq_pcpu = this_cpu_ptr(txq->pcpu);
 	aggr_txq = &port->priv->aggr_txqs[smp_processor_id()];
 
+	if (skb_is_gso(skb)) {
+		frags = mvpp2_tx_tso(skb, dev, txq, aggr_txq, txq_pcpu);
+		goto out;
+	}
 	frags = skb_shinfo(skb)->nr_frags + 1;
 
 	/* Check number of available descriptors */
@@ -6304,22 +6448,21 @@ static int mvpp2_tx(struct sk_buff *skb, struct net_device *dev)
 		}
 	}
 
-	txq_pcpu->reserved_num -= frags;
-	txq_pcpu->count += frags;
-	aggr_txq->count += frags;
-
-	/* Enable transmit */
-	wmb();
-	mvpp2_aggr_txq_pend_desc_add(port, frags);
-
-	if (txq_pcpu->size - txq_pcpu->count < MAX_SKB_FRAGS + 1) {
-		struct netdev_queue *nq = netdev_get_tx_queue(dev, txq_id);
-
-		netif_tx_stop_queue(nq);
-	}
 out:
 	if (frags > 0) {
 		struct mvpp2_pcpu_stats *stats = this_cpu_ptr(port->stats);
+		struct netdev_queue *nq = netdev_get_tx_queue(dev, txq_id);
+
+		txq_pcpu->reserved_num -= frags;
+		txq_pcpu->count += frags;
+		aggr_txq->count += frags;
+
+		/* Enable transmit */
+		wmb();
+		mvpp2_aggr_txq_pend_desc_add(port, frags);
+
+		if (txq_pcpu->size - txq_pcpu->count < MAX_SKB_FRAGS + 1)
+			netif_tx_stop_queue(nq);
 
 		u64_stats_update_begin(&stats->syncp);
 		stats->tx_packets++;
@@ -7496,7 +7639,7 @@ static int mvpp2_port_probe(struct platform_device *pdev,
 		}
 	}
 
-	features = NETIF_F_SG | NETIF_F_IP_CSUM;
+	features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
 	dev->features = features | NETIF_F_RXCSUM;
 	dev->hw_features |= features | NETIF_F_RXCSUM | NETIF_F_GRO;
 	dev->vlan_features |= features;
-- 
2.13.5

^ permalink raw reply related

* RE: [PATCH v2 net-next 1/3] ipv6: Prevent unexpected sk->sk_prot changes
From: Ilya Lesokhin @ 2017-08-23  7:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev@vger.kernel.org, davem@davemloft.net, davejwatson@fb.com,
	Aviad Yehezkel, Boris Pismenny
In-Reply-To: <1502808336.4936.78.camel@edumazet-glaptop3.roam.corp.google.com>

> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Tuesday, August 15, 2017 5:46 PM
> To: Boris Pismenny <borisp@mellanox.com>
> Cc: Ilya Lesokhin <ilyal@mellanox.com>; netdev@vger.kernel.org;
> davem@davemloft.net; davejwatson@fb.com; Aviad Yehezkel
> <aviadye@mellanox.com>
> Subject: Re: [PATCH v2 net-next 1/3] ipv6: Prevent unexpected sk->sk_prot
> changes
> 
> I am just saying IPV6_ADDRFORM is becoming spaghetti code, and maybe
> this is time to make it modular. 
> 

Hi Eric,
I understand your concern, but I would like to postpone implementing this
feature until the ulp infrastructure stabilizes.
I don't even know how many users want to use IPV6_ADDRFORM on sockets with ulp.
I think that simply returning an error for ulp sockets in the meantime is a reasonable 
compromise.

what do you think about the patch below?

Subject: [PATCH 1/1] ipv6: Disable IPV6_ADDRFORM on sockets with ulp attached

The IPV6_ADDRFORM setsockopt works by overriding sk->sk_prot.
The current KTLS ulp also overrides sk->sk_prot.
Consequently, using IPV6_ADDRFORM when there is a ulp attached is
unsafe and this patch disables it.

Fixes: 3c4d7559159b ('tls: kernel TLS support')
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
---
 net/ipv6/ipv6_sockglue.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 02d795f..d935948 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -185,8 +185,12 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 					retv = -EBUSY;
 					break;
 				}
-			} else if (sk->sk_protocol != IPPROTO_TCP)
+			} else if (sk->sk_protocol == IPPROTO_TCP) {
+				if (inet_csk(sk)->icsk_ulp_ops)
+					break;
+			} else {
 				break;
+			}
 
 			if (sk->sk_state != TCP_ESTABLISHED) {
 				retv = -ENOTCONN;


Thanks,
Ilya

^ permalink raw reply related

* Re: [PATCH v3 3/4] net: stmmac: register parent MDIO node for sun8i-h3-emac
From: Maxime Ripard @ 2017-08-23  7:49 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Corentin Labbe, Chen-Yu Tsai, Andrew Lunn, Rob Herring,
	Mark Rutland, Russell King, Giuseppe Cavallaro, Alexandre Torgue,
	devicetree, linux-arm-kernel, linux-kernel, netdev
In-Reply-To: <352ae66b-78a4-e88b-4544-a8edd9390b0c@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1874 bytes --]

Hi Florian,

On Tue, Aug 22, 2017 at 11:35:01AM -0700, Florian Fainelli wrote:
> >>> So I think what you are saying is either impossible or engineering-wise
> >>> a very stupid design, like using an external MAC with a discrete PHY
> >>> connected to the internal MAC's MDIO bus, while using the internal MAC
> >>> with the internal PHY.
> >>>
> >>> Now can we please decide on something? We're a week and a half from
> >>> the 4.13 release. If mdio-mux is wrong, then we could have two mdio
> >>> nodes (internal-mdio & external-mdio).
> >>
> >> I really don't see a need for a mdio-mux in the first place, just have
> >> one MDIO controller (current state) sub-node which describes the
> >> built-in STMMAC MDIO controller and declare the internal PHY as a child
> >> node (along with 'phy-is-integrated'). If a different configuration is
> >> used, then just put the external PHY as a child node there.
> >>
> >> If fixed-link is required, the mdio node becomes unused anyway.
> >>
> >> Works for everyone?
> > 
> > If we put an external PHY with reg=1 as a child of internal MDIO,
> > il will be merged with internal PHY node and get
> > phy-is-integrated.
> 
> Then have the .dtsi file contain just the mdio node, but no internal or
> external PHY and push all the internal and external PHY node definition
> (in its entirety) to the per-board DTS file, does not that work?

If possible, I'd really like to have the internal PHY in the
DTSI. It's always there in hardware anyway, and duplicating the PHY,
with its clock, reset line, and whatever info we might need in the
future in each and every board DTS that uses it will be very error
prone and we will have the usual bunch of issues that come up with
duplication.

Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* [PATCH net-next] ipv6: Add sysctl for per namespace flow label reflection
From: Jakub Sitnicki @ 2017-08-23  7:55 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Nikolay Aleksandrov,
	Tom Herbert

Reflecting IPv6 Flow Label at server nodes is useful in environments
that employ multipath routing to load balance the requests. As "IPv6
Flow Label Reflection" standard draft [1] points out - ICMPv6 PTB error
messages generated in response to a downstream packets from the server
can be routed by a load balancer back to the original server without
looking at transport headers, if the server applies the flow label
reflection. This enables the Path MTU Discovery past the ECMP router in
load-balance or anycast environments where each server node is reachable
by only one path.

Introduce a sysctl to enable flow label reflection per net namespace for
all newly created sockets. Same could be earlier achieved only per
socket by setting the IPV6_FL_F_REFLECT flag for the IPV6_FLOWLABEL_MGR
socket option.

[1] https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01

Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
---
 Documentation/networking/ip-sysctl.txt | 9 +++++++++
 include/net/netns/ipv6.h               | 1 +
 net/ipv6/af_inet6.c                    | 1 +
 net/ipv6/sysctl_net_ipv6.c             | 8 ++++++++
 4 files changed, 19 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 84c9b8c..6b0bc0f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1350,6 +1350,15 @@ flowlabel_state_ranges - BOOLEAN
 	FALSE: disabled
 	Default: true
 
+flowlabel_reflect - BOOLEAN
+	Automatically reflect the flow label. Needed for Path MTU
+	Discovery to work with Equal Cost Multipath Routing in anycast
+	environments. See RFC 7690 and:
+	https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
+	TRUE: enabled
+	FALSE: disabled
+	Default: FALSE
+
 anycast_src_echo_reply - BOOLEAN
 	Controls the use of anycast addresses as source addresses for ICMPv6
 	echo reply
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 0e50bf3..2544f97 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -36,6 +36,7 @@ struct netns_sysctl_ipv6 {
 	int idgen_retries;
 	int idgen_delay;
 	int flowlabel_state_ranges;
+	int flowlabel_reflect;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 3b58ee7..fe5262f 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -211,6 +211,7 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
 	np->mc_loop	= 1;
 	np->pmtudisc	= IPV6_PMTUDISC_WANT;
 	np->autoflowlabel = ip6_default_np_autolabel(net);
+	np->repflow	= net->ipv6.sysctl.flowlabel_reflect;
 	sk->sk_ipv6only	= net->ipv6.sysctl.bindv6only;
 
 	/* Init the ipv4 part of the socket since we can have sockets
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 69c50e7..6fbf8ae 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -90,6 +90,13 @@ static struct ctl_table ipv6_table_template[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "flowlabel_reflect",
+		.data		= &init_net.ipv6.sysctl.flowlabel_reflect,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
@@ -149,6 +156,7 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
 	ipv6_table[6].data = &net->ipv6.sysctl.idgen_delay;
 	ipv6_table[7].data = &net->ipv6.sysctl.flowlabel_state_ranges;
 	ipv6_table[8].data = &net->ipv6.sysctl.ip_nonlocal_bind;
+	ipv6_table[9].data = &net->ipv6.sysctl.flowlabel_reflect;
 
 	ipv6_route_table = ipv6_route_sysctl_init(net);
 	if (!ipv6_route_table)
-- 
2.9.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox