Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH V6 net-next 02/15] net: Rename and export copy_skb_header
From: Boris Pismenny @ 2018-04-22 15:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin
In-Reply-To: <1524410397-108425-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

copy_skb_header is renamed to skb_copy_header and
exported. Exposing this function give more flexibility
in copying SKBs.
skb_copy and skb_copy_expand do not give enough control
over which parts are copied.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c      | 9 +++++----
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d274059..b4fff17 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1032,6 +1032,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t priority);
 struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 				   gfp_t gfp_mask, bool fclone);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ff49e35..3f5dba1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1305,7 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -1313,6 +1313,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(skb_copy_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
@@ -1355,7 +1356,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 
 	BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 	return n;
 }
 EXPORT_SYMBOL(skb_copy);
@@ -1419,7 +1420,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 		skb_clone_fraglist(n);
 	}
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 out:
 	return n;
 }
@@ -1599,7 +1600,7 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	BUG_ON(skb_copy_bits(skb, -head_copy_len, n->head + head_copy_off,
 			     skb->len + head_copy_len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 
 	skb_headers_offset_update(n, newheadroom - oldheadroom);
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V6 net-next 01/15] tcp: Add clean acked data hook
From: Boris Pismenny @ 2018-04-22 15:19 UTC (permalink / raw)
  To: davem
  Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin,
	Aviad Yehezkel
In-Reply-To: <1524410397-108425-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Called when a TCP segment is acknowledged.
Could be used by application protocols who hold additional
metadata associated with the stream data.

This is required by TLS device offload to release
metadata associated with acknowledged TLS records.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
---
 include/net/inet_connection_sock.h |  2 ++
 include/net/tcp.h                  |  8 ++++++++
 net/ipv4/tcp_input.c               | 25 +++++++++++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index b68fea0..2ab6667 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
  * @icsk_af_ops		   Operations which are AF_INET{4,6} specific
  * @icsk_ulp_ops	   Pluggable ULP control hook
  * @icsk_ulp_data	   ULP private data
+ * @icsk_clean_acked	   Clean acked data hook
  * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
  * @icsk_ca_state:	   Congestion control state
  * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
@@ -102,6 +103,7 @@ struct inet_connection_sock {
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	const struct tcp_ulp_ops  *icsk_ulp_ops;
 	void			  *icsk_ulp_data;
+	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
 	struct hlist_node         icsk_listen_portaddr_node;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
 	__u8			  icsk_ca_state:6,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e..cf803fe 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2105,4 +2105,12 @@ static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 #if IS_ENABLED(CONFIG_SMC)
 extern struct static_key_false tcp_have_smc;
 #endif
+
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+void clean_acked_data_enable(struct inet_connection_sock *icsk,
+			     void (*cad)(struct sock *sk, u32 ack_seq));
+void clean_acked_data_disable(struct inet_connection_sock *icsk);
+
+#endif
+
 #endif	/* _TCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0396fb9..0b8b1c9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -111,6 +111,25 @@
 #define REXMIT_LOST	1 /* retransmit packets marked lost */
 #define REXMIT_NEW	2 /* FRTO-style transmit of unsent/new packets */
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+static DEFINE_STATIC_KEY_FALSE(clean_acked_data_enabled);
+
+void clean_acked_data_enable(struct inet_connection_sock *icsk,
+			     void (*cad)(struct sock *sk, u32 ack_seq))
+{
+	icsk->icsk_clean_acked = cad;
+	static_branch_inc(&clean_acked_data_enabled);
+}
+EXPORT_SYMBOL_GPL(clean_acked_data_enable);
+
+void clean_acked_data_disable(struct inet_connection_sock *icsk)
+{
+	static_branch_dec(&clean_acked_data_enabled);
+	icsk->icsk_clean_acked = NULL;
+}
+EXPORT_SYMBOL_GPL(clean_acked_data_disable);
+#endif
+
 static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb,
 			     unsigned int len)
 {
@@ -3558,6 +3577,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (after(ack, prior_snd_una)) {
 		flag |= FLAG_SND_UNA_ADVANCED;
 		icsk->icsk_retransmits = 0;
+
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+		if (static_branch_unlikely(&clean_acked_data_enabled))
+			if (icsk->icsk_clean_acked)
+				icsk->icsk_clean_acked(sk, ack);
+#endif
 	}
 
 	prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V6 net-next 03/15] net: Add Software fallback infrastructure for socket dependent offloads
From: Boris Pismenny @ 2018-04-22 15:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin
In-Reply-To: <1524410397-108425-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

With socket dependent offloads we rely on the netdev to transform
the transmitted packets before sending them to the wire.
When a packet from an offloaded socket is rerouted to a different
device we need to detect it and do the transformation in software.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
---
 include/net/sock.h | 21 +++++++++++++++++++++
 net/Kconfig        |  4 ++++
 net/core/dev.c     |  4 ++++
 3 files changed, 29 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 74d725f..3c568b3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -481,6 +481,11 @@ struct sock {
 	void			(*sk_error_report)(struct sock *sk);
 	int			(*sk_backlog_rcv)(struct sock *sk,
 						  struct sk_buff *skb);
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+	struct sk_buff*		(*sk_validate_xmit_skb)(struct sock *sk,
+							struct net_device *dev,
+							struct sk_buff *skb);
+#endif
 	void                    (*sk_destruct)(struct sock *sk);
 	struct sock_reuseport __rcu	*sk_reuseport_cb;
 	struct rcu_head		sk_rcu;
@@ -2332,6 +2337,22 @@ static inline bool sk_fullsock(const struct sock *sk)
 	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
 }
 
+/* Checks if this SKB belongs to an HW offloaded socket
+ * and whether any SW fallbacks are required based on dev.
+ */
+static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
+						   struct net_device *dev)
+{
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+	struct sock *sk = skb->sk;
+
+	if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb)
+		skb = sk->sk_validate_xmit_skb(sk, dev, skb);
+#endif
+
+	return skb;
+}
+
 /* This helper checks if a socket is a LISTEN or NEW_SYN_RECV
  * SYNACK messages can be attached to either ones (depending on SYNCOOKIE)
  */
diff --git a/net/Kconfig b/net/Kconfig
index 6fa1a44..c6a3f14 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -407,6 +407,10 @@ config GRO_CELLS
 	bool
 	default n
 
+config SOCK_VALIDATE_XMIT
+	bool
+	default n
+
 config NET_DEVLINK
 	tristate "Network physical/parent device Netlink interface"
 	help
diff --git a/net/core/dev.c b/net/core/dev.c
index c624a04..9fcf993 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3114,6 +3114,10 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device
 	if (unlikely(!skb))
 		goto out_null;
 
+	skb = sk_validate_xmit_skb(skb, dev);
+	if (unlikely(!skb))
+		goto out_null;
+
 	if (netif_needs_gso(skb, features)) {
 		struct sk_buff *segs;
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V6 net-next 07/15] net/tls: Add generic NIC offload infrastructure
From: Boris Pismenny @ 2018-04-22 15:19 UTC (permalink / raw)
  To: davem
  Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin,
	Aviad Yehezkel
In-Reply-To: <1524410397-108425-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

This patch adds a generic infrastructure to offload TLS crypto to a
network device. It enables the kernel TLS socket to skip encryption
and authentication operations on the transmit side of the data path.
Leaving those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to
the TCP layer just like the SW KTLS implementation and using the same
API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmits the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
---
 include/net/tls.h             |  69 +++-
 net/tls/Kconfig               |  10 +
 net/tls/Makefile              |   2 +
 net/tls/tls_device.c          | 759 ++++++++++++++++++++++++++++++++++++++++++
 net/tls/tls_device_fallback.c | 450 +++++++++++++++++++++++++
 net/tls/tls_main.c            |  42 ++-
 6 files changed, 1327 insertions(+), 5 deletions(-)
 create mode 100644 net/tls/tls_device.c
 create mode 100644 net/tls/tls_device_fallback.c

diff --git a/include/net/tls.h b/include/net/tls.h
index 95a8c60..8c56809 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -116,6 +116,37 @@ struct tls_sw_context_rx {
 	bool decrypted;
 };
 
+struct tls_record_info {
+	struct list_head list;
+	u32 end_seq;
+	int len;
+	int num_frags;
+	skb_frag_t frags[MAX_SKB_FRAGS];
+};
+
+struct tls_offload_context {
+	struct crypto_aead *aead_send;
+	spinlock_t lock;	/* protects records list */
+	struct list_head records_list;
+	struct tls_record_info *open_record;
+	struct tls_record_info *retransmit_hint;
+	u64 hint_record_sn;
+	u64 unacked_record_sn;
+
+	struct scatterlist sg_tx_data[MAX_SKB_FRAGS];
+	void (*sk_destruct)(struct sock *sk);
+	u8 driver_state[];
+	/* The TLS layer reserves room for driver specific state
+	 * Currently the belief is that there is not enough
+	 * driver specific state to justify another layer of indirection
+	 */
+#define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *)))
+};
+
+#define TLS_OFFLOAD_CONTEXT_SIZE                                               \
+	(ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +           \
+	 TLS_DRIVER_STATE_SIZE)
+
 enum {
 	TLS_PENDING_CLOSED_RECORD
 };
@@ -195,9 +226,28 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos,
 			   struct pipe_inode_info *pipe,
 			   size_t len, unsigned int flags);
 
-void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
-void tls_icsk_clean_acked(struct sock *sk);
+int tls_set_device_offload(struct sock *sk, struct tls_context *ctx);
+int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
+int tls_device_sendpage(struct sock *sk, struct page *page,
+			int offset, size_t size, int flags);
+void tls_device_sk_destruct(struct sock *sk);
+void tls_device_init(void);
+void tls_device_cleanup(void);
+
+struct tls_record_info *tls_get_record(struct tls_offload_context *context,
+				       u32 seq, u64 *p_record_sn);
+
+static inline bool tls_record_is_start_marker(struct tls_record_info *rec)
+{
+	return rec->len == 0;
+}
+
+static inline u32 tls_record_start_seq(struct tls_record_info *rec)
+{
+	return rec->end_seq - rec->len;
+}
 
+void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
 int tls_push_sg(struct sock *sk, struct tls_context *ctx,
 		struct scatterlist *sg, u16 first_offset,
 		int flags);
@@ -234,6 +284,13 @@ static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx)
 	return tls_ctx->pending_open_record_frags;
 }
 
+static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk)
+{
+	return sk_fullsock(sk) &&
+	       /* matches smp_store_release in tls_set_device_offload */
+	       smp_load_acquire(&sk->sk_destruct) == &tls_device_sk_destruct;
+}
+
 static inline void tls_err_abort(struct sock *sk, int err)
 {
 	sk->sk_err = err;
@@ -329,4 +386,12 @@ int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
 void tls_register_device(struct tls_device *device);
 void tls_unregister_device(struct tls_device *device);
 
+struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
+				      struct net_device *dev,
+				      struct sk_buff *skb);
+
+int tls_sw_fallback_init(struct sock *sk,
+			 struct tls_offload_context *offload_ctx,
+			 struct tls_crypto_info *crypto_info);
+
 #endif /* _TLS_OFFLOAD_H */
diff --git a/net/tls/Kconfig b/net/tls/Kconfig
index 89b8745a..73f05ec 100644
--- a/net/tls/Kconfig
+++ b/net/tls/Kconfig
@@ -14,3 +14,13 @@ config TLS
 	encryption handling of the TLS protocol to be done in-kernel.
 
 	If unsure, say N.
+
+config TLS_DEVICE
+	bool "Transport Layer Security HW offload"
+	depends on TLS
+	select SOCK_VALIDATE_XMIT
+	default n
+	help
+	Enable kernel support for HW offload of the TLS protocol.
+
+	If unsure, say N.
diff --git a/net/tls/Makefile b/net/tls/Makefile
index a930fd1..4d6b728 100644
--- a/net/tls/Makefile
+++ b/net/tls/Makefile
@@ -5,3 +5,5 @@
 obj-$(CONFIG_TLS) += tls.o
 
 tls-y := tls_main.o tls_sw.o
+
+tls-$(CONFIG_TLS_DEVICE) += tls_device.o tls_device_fallback.o
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
new file mode 100644
index 0000000..c363cec
--- /dev/null
+++ b/net/tls/tls_device.c
@@ -0,0 +1,759 @@
+/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/module.h>
+#include <net/tcp.h>
+#include <net/inet_common.h>
+#include <linux/highmem.h>
+#include <linux/netdevice.h>
+
+#include <net/tls.h>
+#include <crypto/aead.h>
+
+/* device_offload_lock is used to synchronize tls_dev_add
+ * against NETDEV_DOWN notifications.
+ */
+static DECLARE_RWSEM(device_offload_lock);
+
+static void tls_device_gc_task(struct work_struct *work);
+
+static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
+static LIST_HEAD(tls_device_gc_list);
+static LIST_HEAD(tls_device_list);
+static DEFINE_SPINLOCK(tls_device_lock);
+
+static void tls_device_free_ctx(struct tls_context *ctx)
+{
+	struct tls_offload_context *offload_ctx = tls_offload_ctx(ctx);
+
+	kfree(offload_ctx);
+	kfree(ctx);
+}
+
+static void tls_device_gc_task(struct work_struct *work)
+{
+	struct tls_context *ctx, *tmp;
+	unsigned long flags;
+	LIST_HEAD(gc_list);
+
+	spin_lock_irqsave(&tls_device_lock, flags);
+	list_splice_init(&tls_device_gc_list, &gc_list);
+	spin_unlock_irqrestore(&tls_device_lock, flags);
+
+	list_for_each_entry_safe(ctx, tmp, &gc_list, list) {
+		struct net_device *netdev = ctx->netdev;
+
+		if (netdev) {
+			netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
+							TLS_OFFLOAD_CTX_DIR_TX);
+			dev_put(netdev);
+		}
+
+		list_del(&ctx->list);
+		tls_device_free_ctx(ctx);
+	}
+}
+
+static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&tls_device_lock, flags);
+	list_move_tail(&ctx->list, &tls_device_gc_list);
+
+	/* schedule_work inside the spinlock
+	 * to make sure tls_device_down waits for that work.
+	 */
+	schedule_work(&tls_device_gc_work);
+
+	spin_unlock_irqrestore(&tls_device_lock, flags);
+}
+
+/* We assume that the socket is already connected */
+static struct net_device *get_netdev_for_sock(struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct net_device *netdev = NULL;
+
+	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
+
+	return netdev;
+}
+
+static void destroy_record(struct tls_record_info *record)
+{
+	int nr_frags = record->num_frags;
+	skb_frag_t *frag;
+
+	while (nr_frags-- > 0) {
+		frag = &record->frags[nr_frags];
+		__skb_frag_unref(frag);
+	}
+	kfree(record);
+}
+
+static void delete_all_records(struct tls_offload_context *offload_ctx)
+{
+	struct tls_record_info *info, *temp;
+
+	list_for_each_entry_safe(info, temp, &offload_ctx->records_list, list) {
+		list_del(&info->list);
+		destroy_record(info);
+	}
+
+	offload_ctx->retransmit_hint = NULL;
+}
+
+static void tls_icsk_clean_acked(struct sock *sk, u32 acked_seq)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_record_info *info, *temp;
+	struct tls_offload_context *ctx;
+	u64 deleted_records = 0;
+	unsigned long flags;
+
+	if (!tls_ctx)
+		return;
+
+	ctx = tls_offload_ctx(tls_ctx);
+
+	spin_lock_irqsave(&ctx->lock, flags);
+	info = ctx->retransmit_hint;
+	if (info && !before(acked_seq, info->end_seq)) {
+		ctx->retransmit_hint = NULL;
+		list_del(&info->list);
+		destroy_record(info);
+		deleted_records++;
+	}
+
+	list_for_each_entry_safe(info, temp, &ctx->records_list, list) {
+		if (before(acked_seq, info->end_seq))
+			break;
+		list_del(&info->list);
+
+		destroy_record(info);
+		deleted_records++;
+	}
+
+	ctx->unacked_record_sn += deleted_records;
+	spin_unlock_irqrestore(&ctx->lock, flags);
+}
+
+/* At this point, there should be no references on this
+ * socket and no in-flight SKBs associated with this
+ * socket, so it is safe to free all the resources.
+ */
+void tls_device_sk_destruct(struct sock *sk)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+
+	if (ctx->open_record)
+		destroy_record(ctx->open_record);
+
+	delete_all_records(ctx);
+	crypto_free_aead(ctx->aead_send);
+	ctx->sk_destruct(sk);
+	clean_acked_data_disable(inet_csk(sk));
+
+	if (refcount_dec_and_test(&tls_ctx->refcount))
+		tls_device_queue_ctx_destruction(tls_ctx);
+}
+EXPORT_SYMBOL(tls_device_sk_destruct);
+
+static void tls_append_frag(struct tls_record_info *record,
+			    struct page_frag *pfrag,
+			    int size)
+{
+	skb_frag_t *frag;
+
+	frag = &record->frags[record->num_frags - 1];
+	if (frag->page.p == pfrag->page &&
+	    frag->page_offset + frag->size == pfrag->offset) {
+		frag->size += size;
+	} else {
+		++frag;
+		frag->page.p = pfrag->page;
+		frag->page_offset = pfrag->offset;
+		frag->size = size;
+		++record->num_frags;
+		get_page(pfrag->page);
+	}
+
+	pfrag->offset += size;
+	record->len += size;
+}
+
+static int tls_push_record(struct sock *sk,
+			   struct tls_context *ctx,
+			   struct tls_offload_context *offload_ctx,
+			   struct tls_record_info *record,
+			   struct page_frag *pfrag,
+			   int flags,
+			   unsigned char record_type)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct page_frag dummy_tag_frag;
+	skb_frag_t *frag;
+	int i;
+
+	/* fill prepend */
+	frag = &record->frags[0];
+	tls_fill_prepend(ctx,
+			 skb_frag_address(frag),
+			 record->len - ctx->tx.prepend_size,
+			 record_type);
+
+	/* HW doesn't care about the data in the tag, because it fills it. */
+	dummy_tag_frag.page = skb_frag_page(frag);
+	dummy_tag_frag.offset = 0;
+
+	tls_append_frag(record, &dummy_tag_frag, ctx->tx.tag_size);
+	record->end_seq = tp->write_seq + record->len;
+	spin_lock_irq(&offload_ctx->lock);
+	list_add_tail(&record->list, &offload_ctx->records_list);
+	spin_unlock_irq(&offload_ctx->lock);
+	offload_ctx->open_record = NULL;
+	set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
+	tls_advance_record_sn(sk, &ctx->tx);
+
+	for (i = 0; i < record->num_frags; i++) {
+		frag = &record->frags[i];
+		sg_unmark_end(&offload_ctx->sg_tx_data[i]);
+		sg_set_page(&offload_ctx->sg_tx_data[i], skb_frag_page(frag),
+			    frag->size, frag->page_offset);
+		sk_mem_charge(sk, frag->size);
+		get_page(skb_frag_page(frag));
+	}
+	sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags - 1]);
+
+	/* all ready, send */
+	return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags);
+}
+
+static int tls_create_new_record(struct tls_offload_context *offload_ctx,
+				 struct page_frag *pfrag,
+				 size_t prepend_size)
+{
+	struct tls_record_info *record;
+	skb_frag_t *frag;
+
+	record = kmalloc(sizeof(*record), GFP_KERNEL);
+	if (!record)
+		return -ENOMEM;
+
+	frag = &record->frags[0];
+	__skb_frag_set_page(frag, pfrag->page);
+	frag->page_offset = pfrag->offset;
+	skb_frag_size_set(frag, prepend_size);
+
+	get_page(pfrag->page);
+	pfrag->offset += prepend_size;
+
+	record->num_frags = 1;
+	record->len = prepend_size;
+	offload_ctx->open_record = record;
+	return 0;
+}
+
+static int tls_do_allocation(struct sock *sk,
+			     struct tls_offload_context *offload_ctx,
+			     struct page_frag *pfrag,
+			     size_t prepend_size)
+{
+	int ret;
+
+	if (!offload_ctx->open_record) {
+		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
+						   sk->sk_allocation))) {
+			sk->sk_prot->enter_memory_pressure(sk);
+			sk_stream_moderate_sndbuf(sk);
+			return -ENOMEM;
+		}
+
+		ret = tls_create_new_record(offload_ctx, pfrag, prepend_size);
+		if (ret)
+			return ret;
+
+		if (pfrag->size > pfrag->offset)
+			return 0;
+	}
+
+	if (!sk_page_frag_refill(sk, pfrag))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int tls_push_data(struct sock *sk,
+			 struct iov_iter *msg_iter,
+			 size_t size, int flags,
+			 unsigned char record_type)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+	int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
+	int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
+	struct tls_record_info *record = ctx->open_record;
+	struct page_frag *pfrag;
+	size_t orig_size = size;
+	u32 max_open_record_len;
+	int copy, rc = 0;
+	bool done = false;
+	long timeo;
+
+	if (flags &
+	    ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_SENDPAGE_NOTLAST))
+		return -ENOTSUPP;
+
+	if (sk->sk_err)
+		return -sk->sk_err;
+
+	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
+	rc = tls_complete_pending_work(sk, tls_ctx, flags, &timeo);
+	if (rc < 0)
+		return rc;
+
+	pfrag = sk_page_frag(sk);
+
+	/* TLS_HEADER_SIZE is not counted as part of the TLS record, and
+	 * we need to leave room for an authentication tag.
+	 */
+	max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
+			      tls_ctx->tx.prepend_size;
+	do {
+		rc = tls_do_allocation(sk, ctx, pfrag,
+				       tls_ctx->tx.prepend_size);
+		if (rc) {
+			rc = sk_stream_wait_memory(sk, &timeo);
+			if (!rc)
+				continue;
+
+			record = ctx->open_record;
+			if (!record)
+				break;
+handle_error:
+			if (record_type != TLS_RECORD_TYPE_DATA) {
+				/* avoid sending partial
+				 * record with type !=
+				 * application_data
+				 */
+				size = orig_size;
+				destroy_record(record);
+				ctx->open_record = NULL;
+			} else if (record->len > tls_ctx->tx.prepend_size) {
+				goto last_record;
+			}
+
+			break;
+		}
+
+		record = ctx->open_record;
+		copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
+		copy = min_t(size_t, copy, (max_open_record_len - record->len));
+
+		if (copy_from_iter_nocache(page_address(pfrag->page) +
+					       pfrag->offset,
+					   copy, msg_iter) != copy) {
+			rc = -EFAULT;
+			goto handle_error;
+		}
+		tls_append_frag(record, pfrag, copy);
+
+		size -= copy;
+		if (!size) {
+last_record:
+			tls_push_record_flags = flags;
+			if (more) {
+				tls_ctx->pending_open_record_frags =
+						record->num_frags;
+				break;
+			}
+
+			done = true;
+		}
+
+		if (done || record->len >= max_open_record_len ||
+		    (record->num_frags >= MAX_SKB_FRAGS - 1)) {
+			rc = tls_push_record(sk,
+					     tls_ctx,
+					     ctx,
+					     record,
+					     pfrag,
+					     tls_push_record_flags,
+					     record_type);
+			if (rc < 0)
+				break;
+		}
+	} while (!done);
+
+	if (orig_size - size > 0)
+		rc = orig_size - size;
+
+	return rc;
+}
+
+int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+{
+	unsigned char record_type = TLS_RECORD_TYPE_DATA;
+	int rc;
+
+	lock_sock(sk);
+
+	if (unlikely(msg->msg_controllen)) {
+		rc = tls_proccess_cmsg(sk, msg, &record_type);
+		if (rc)
+			goto out;
+	}
+
+	rc = tls_push_data(sk, &msg->msg_iter, size,
+			   msg->msg_flags, record_type);
+
+out:
+	release_sock(sk);
+	return rc;
+}
+
+int tls_device_sendpage(struct sock *sk, struct page *page,
+			int offset, size_t size, int flags)
+{
+	struct iov_iter	msg_iter;
+	char *kaddr = kmap(page);
+	struct kvec iov;
+	int rc;
+
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		flags |= MSG_MORE;
+
+	lock_sock(sk);
+
+	if (flags & MSG_OOB) {
+		rc = -ENOTSUPP;
+		goto out;
+	}
+
+	iov.iov_base = kaddr + offset;
+	iov.iov_len = size;
+	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1, size);
+	rc = tls_push_data(sk, &msg_iter, size,
+			   flags, TLS_RECORD_TYPE_DATA);
+	kunmap(page);
+
+out:
+	release_sock(sk);
+	return rc;
+}
+
+struct tls_record_info *tls_get_record(struct tls_offload_context *context,
+				       u32 seq, u64 *p_record_sn)
+{
+	u64 record_sn = context->hint_record_sn;
+	struct tls_record_info *info;
+
+	info = context->retransmit_hint;
+	if (!info ||
+	    before(seq, info->end_seq - info->len)) {
+		/* if retransmit_hint is irrelevant start
+		 * from the beggining of the list
+		 */
+		info = list_first_entry(&context->records_list,
+					struct tls_record_info, list);
+		record_sn = context->unacked_record_sn;
+	}
+
+	list_for_each_entry_from(info, &context->records_list, list) {
+		if (before(seq, info->end_seq)) {
+			if (!context->retransmit_hint ||
+			    after(info->end_seq,
+				  context->retransmit_hint->end_seq)) {
+				context->hint_record_sn = record_sn;
+				context->retransmit_hint = info;
+			}
+			*p_record_sn = record_sn;
+			return info;
+		}
+		record_sn++;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL(tls_get_record);
+
+static int tls_device_push_pending_record(struct sock *sk, int flags)
+{
+	struct iov_iter	msg_iter;
+
+	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0);
+	return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA);
+}
+
+int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
+{
+	u16 nonce_size, tag_size, iv_size, rec_seq_size;
+	struct tls_record_info *start_marker_record;
+	struct tls_offload_context *offload_ctx;
+	struct tls_crypto_info *crypto_info;
+	struct net_device *netdev;
+	char *iv, *rec_seq;
+	struct sk_buff *skb;
+	int rc = -EINVAL;
+	__be64 rcd_sn;
+
+	if (!ctx)
+		goto out;
+
+	if (ctx->priv_ctx_tx) {
+		rc = -EEXIST;
+		goto out;
+	}
+
+	start_marker_record = kmalloc(sizeof(*start_marker_record), GFP_KERNEL);
+	if (!start_marker_record) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	offload_ctx = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE, GFP_KERNEL);
+	if (!offload_ctx) {
+		rc = -ENOMEM;
+		goto free_marker_record;
+	}
+
+	crypto_info = &ctx->crypto_send;
+	switch (crypto_info->cipher_type) {
+	case TLS_CIPHER_AES_GCM_128:
+		nonce_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
+		tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE;
+		iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
+		iv = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->iv;
+		rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE;
+		rec_seq =
+		 ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->rec_seq;
+		break;
+	default:
+		rc = -EINVAL;
+		goto free_offload_ctx;
+	}
+
+	ctx->tx.prepend_size = TLS_HEADER_SIZE + nonce_size;
+	ctx->tx.tag_size = tag_size;
+	ctx->tx.overhead_size = ctx->tx.prepend_size + ctx->tx.tag_size;
+	ctx->tx.iv_size = iv_size;
+	ctx->tx.iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE,
+			     GFP_KERNEL);
+	if (!ctx->tx.iv) {
+		rc = -ENOMEM;
+		goto free_offload_ctx;
+	}
+
+	memcpy(ctx->tx.iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
+
+	ctx->tx.rec_seq_size = rec_seq_size;
+	ctx->tx.rec_seq = kmalloc(rec_seq_size, GFP_KERNEL);
+	if (!ctx->tx.rec_seq) {
+		rc = -ENOMEM;
+		goto free_iv;
+	}
+	memcpy(ctx->tx.rec_seq, rec_seq, rec_seq_size);
+
+	rc = tls_sw_fallback_init(sk, offload_ctx, crypto_info);
+	if (rc)
+		goto free_rec_seq;
+
+	/* start at rec_seq - 1 to account for the start marker record */
+	memcpy(&rcd_sn, ctx->tx.rec_seq, sizeof(rcd_sn));
+	offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
+
+	start_marker_record->end_seq = tcp_sk(sk)->write_seq;
+	start_marker_record->len = 0;
+	start_marker_record->num_frags = 0;
+
+	INIT_LIST_HEAD(&offload_ctx->records_list);
+	list_add_tail(&start_marker_record->list, &offload_ctx->records_list);
+	spin_lock_init(&offload_ctx->lock);
+
+	clean_acked_data_enable(inet_csk(sk), &tls_icsk_clean_acked);
+	ctx->push_pending_record = tls_device_push_pending_record;
+	offload_ctx->sk_destruct = sk->sk_destruct;
+
+	/* TLS offload is greatly simplified if we don't send
+	 * SKBs where only part of the payload needs to be encrypted.
+	 * So mark the last skb in the write queue as end of record.
+	 */
+	skb = tcp_write_queue_tail(sk);
+	if (skb)
+		TCP_SKB_CB(skb)->eor = 1;
+
+	refcount_set(&ctx->refcount, 1);
+
+	/* We support starting offload on multiple sockets
+	 * concurrently, so we only need a read lock here.
+	 * This lock must precede get_netdev_for_sock to prevent races between
+	 * NETDEV_DOWN and setsockopt.
+	 */
+	down_read(&device_offload_lock);
+	netdev = get_netdev_for_sock(sk);
+	if (!netdev) {
+		pr_err_ratelimited("%s: netdev not found\n", __func__);
+		rc = -EINVAL;
+		goto release_lock;
+	}
+
+	if (!(netdev->features & NETIF_F_HW_TLS_TX)) {
+		rc = -ENOTSUPP;
+		goto release_netdev;
+	}
+
+	/* Avoid offloading if the device is down
+	 * We don't want to offload new flows after
+	 * the NETDEV_DOWN event
+	 */
+	if (!(netdev->flags & IFF_UP)) {
+		rc = -EINVAL;
+		goto release_netdev;
+	}
+
+	ctx->priv_ctx_tx = offload_ctx;
+	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX,
+					     &ctx->crypto_send,
+					     tcp_sk(sk)->write_seq);
+	if (rc)
+		goto release_netdev;
+
+	ctx->netdev = netdev;
+
+	spin_lock_irq(&tls_device_lock);
+	list_add_tail(&ctx->list, &tls_device_list);
+	spin_unlock_irq(&tls_device_lock);
+
+	sk->sk_validate_xmit_skb = tls_validate_xmit_skb;
+	/* following this assignment tls_is_sk_tx_device_offloaded
+	 * will return true and the context might be accessed
+	 * by the netdev's xmit function.
+	 */
+	smp_store_release(&sk->sk_destruct,
+			  &tls_device_sk_destruct);
+	up_read(&device_offload_lock);
+	goto out;
+
+release_netdev:
+	dev_put(netdev);
+release_lock:
+	up_read(&device_offload_lock);
+	clean_acked_data_disable(inet_csk(sk));
+	crypto_free_aead(offload_ctx->aead_send);
+free_rec_seq:
+	kfree(ctx->tx.rec_seq);
+free_iv:
+	kfree(ctx->tx.iv);
+free_offload_ctx:
+	kfree(offload_ctx);
+	ctx->priv_ctx_tx = NULL;
+free_marker_record:
+	kfree(start_marker_record);
+out:
+	return rc;
+}
+
+static int tls_device_down(struct net_device *netdev)
+{
+	struct tls_context *ctx, *tmp;
+	unsigned long flags;
+	LIST_HEAD(list);
+
+	/* Request a write lock to block new offload attempts */
+	down_write(&device_offload_lock);
+
+	spin_lock_irqsave(&tls_device_lock, flags);
+	list_for_each_entry_safe(ctx, tmp, &tls_device_list, list) {
+		if (ctx->netdev != netdev ||
+		    !refcount_inc_not_zero(&ctx->refcount))
+			continue;
+
+		list_move(&ctx->list, &list);
+	}
+	spin_unlock_irqrestore(&tls_device_lock, flags);
+
+	list_for_each_entry_safe(ctx, tmp, &list, list)	{
+		netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
+						TLS_OFFLOAD_CTX_DIR_TX);
+		ctx->netdev = NULL;
+		dev_put(netdev);
+		list_del_init(&ctx->list);
+
+		if (refcount_dec_and_test(&ctx->refcount))
+			tls_device_free_ctx(ctx);
+	}
+
+	up_write(&device_offload_lock);
+
+	flush_work(&tls_device_gc_work);
+
+	return NOTIFY_DONE;
+}
+
+static int tls_dev_event(struct notifier_block *this, unsigned long event,
+			 void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+	if (!(dev->features & NETIF_F_HW_TLS_TX))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+	case NETDEV_FEAT_CHANGE:
+		if  (dev->tlsdev_ops &&
+		     dev->tlsdev_ops->tls_dev_add &&
+		     dev->tlsdev_ops->tls_dev_del)
+			return NOTIFY_DONE;
+		else
+			return NOTIFY_BAD;
+	case NETDEV_DOWN:
+		return tls_device_down(dev);
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block tls_dev_notifier = {
+	.notifier_call	= tls_dev_event,
+};
+
+void __init tls_device_init(void)
+{
+	register_netdevice_notifier(&tls_dev_notifier);
+}
+
+void __exit tls_device_cleanup(void)
+{
+	unregister_netdevice_notifier(&tls_dev_notifier);
+	flush_work(&tls_device_gc_work);
+}
diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
new file mode 100644
index 0000000..748914a
--- /dev/null
+++ b/net/tls/tls_device_fallback.c
@@ -0,0 +1,450 @@
+/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <net/tls.h>
+#include <crypto/aead.h>
+#include <crypto/scatterwalk.h>
+#include <net/ip6_checksum.h>
+
+static void chain_to_walk(struct scatterlist *sg, struct scatter_walk *walk)
+{
+	struct scatterlist *src = walk->sg;
+	int diff = walk->offset - src->offset;
+
+	sg_set_page(sg, sg_page(src),
+		    src->length - diff, walk->offset);
+
+	scatterwalk_crypto_chain(sg, sg_next(src), 0, 2);
+}
+
+static int tls_enc_record(struct aead_request *aead_req,
+			  struct crypto_aead *aead, char *aad,
+			  char *iv, __be64 rcd_sn,
+			  struct scatter_walk *in,
+			  struct scatter_walk *out, int *in_len)
+{
+	unsigned char buf[TLS_HEADER_SIZE + TLS_CIPHER_AES_GCM_128_IV_SIZE];
+	struct scatterlist sg_in[3];
+	struct scatterlist sg_out[3];
+	u16 len;
+	int rc;
+
+	len = min_t(int, *in_len, ARRAY_SIZE(buf));
+
+	scatterwalk_copychunks(buf, in, len, 0);
+	scatterwalk_copychunks(buf, out, len, 1);
+
+	*in_len -= len;
+	if (!*in_len)
+		return 0;
+
+	scatterwalk_pagedone(in, 0, 1);
+	scatterwalk_pagedone(out, 1, 1);
+
+	len = buf[4] | (buf[3] << 8);
+	len -= TLS_CIPHER_AES_GCM_128_IV_SIZE;
+
+	tls_make_aad(aad, len - TLS_CIPHER_AES_GCM_128_TAG_SIZE,
+		     (char *)&rcd_sn, sizeof(rcd_sn), buf[0]);
+
+	memcpy(iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, buf + TLS_HEADER_SIZE,
+	       TLS_CIPHER_AES_GCM_128_IV_SIZE);
+
+	sg_init_table(sg_in, ARRAY_SIZE(sg_in));
+	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
+	sg_set_buf(sg_in, aad, TLS_AAD_SPACE_SIZE);
+	sg_set_buf(sg_out, aad, TLS_AAD_SPACE_SIZE);
+	chain_to_walk(sg_in + 1, in);
+	chain_to_walk(sg_out + 1, out);
+
+	*in_len -= len;
+	if (*in_len < 0) {
+		*in_len += TLS_CIPHER_AES_GCM_128_TAG_SIZE;
+		/* the input buffer doesn't contain the entire record.
+		 * trim len accordingly. The resulting authentication tag
+		 * will contain garbage, but we don't care, so we won't
+		 * include any of it in the output skb
+		 * Note that we assume the output buffer length
+		 * is larger then input buffer length + tag size
+		 */
+		if (*in_len < 0)
+			len += *in_len;
+
+		*in_len = 0;
+	}
+
+	if (*in_len) {
+		scatterwalk_copychunks(NULL, in, len, 2);
+		scatterwalk_pagedone(in, 0, 1);
+		scatterwalk_copychunks(NULL, out, len, 2);
+		scatterwalk_pagedone(out, 1, 1);
+	}
+
+	len -= TLS_CIPHER_AES_GCM_128_TAG_SIZE;
+	aead_request_set_crypt(aead_req, sg_in, sg_out, len, iv);
+
+	rc = crypto_aead_encrypt(aead_req);
+
+	return rc;
+}
+
+static void tls_init_aead_request(struct aead_request *aead_req,
+				  struct crypto_aead *aead)
+{
+	aead_request_set_tfm(aead_req, aead);
+	aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE);
+}
+
+static struct aead_request *tls_alloc_aead_request(struct crypto_aead *aead,
+						   gfp_t flags)
+{
+	unsigned int req_size = sizeof(struct aead_request) +
+		crypto_aead_reqsize(aead);
+	struct aead_request *aead_req;
+
+	aead_req = kzalloc(req_size, flags);
+	if (aead_req)
+		tls_init_aead_request(aead_req, aead);
+	return aead_req;
+}
+
+static int tls_enc_records(struct aead_request *aead_req,
+			   struct crypto_aead *aead, struct scatterlist *sg_in,
+			   struct scatterlist *sg_out, char *aad, char *iv,
+			   u64 rcd_sn, int len)
+{
+	struct scatter_walk out, in;
+	int rc;
+
+	scatterwalk_start(&in, sg_in);
+	scatterwalk_start(&out, sg_out);
+
+	do {
+		rc = tls_enc_record(aead_req, aead, aad, iv,
+				    cpu_to_be64(rcd_sn), &in, &out, &len);
+		rcd_sn++;
+
+	} while (rc == 0 && len);
+
+	scatterwalk_done(&in, 0, 0);
+	scatterwalk_done(&out, 1, 0);
+
+	return rc;
+}
+
+/* Can't use icsk->icsk_af_ops->send_check here because the ip addresses
+ * might have been changed by NAT.
+ */
+static void update_chksum(struct sk_buff *skb, int headln)
+{
+	struct tcphdr *th = tcp_hdr(skb);
+	int datalen = skb->len - headln;
+	const struct ipv6hdr *ipv6h;
+	const struct iphdr *iph;
+
+	/* We only changed the payload so if we are using partial we don't
+	 * need to update anything.
+	 */
+	if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
+		return;
+
+	skb->ip_summed = CHECKSUM_PARTIAL;
+	skb->csum_start = skb_transport_header(skb) - skb->head;
+	skb->csum_offset = offsetof(struct tcphdr, check);
+
+	if (skb->sk->sk_family == AF_INET6) {
+		ipv6h = ipv6_hdr(skb);
+		th->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
+					     datalen, IPPROTO_TCP, 0);
+	} else {
+		iph = ip_hdr(skb);
+		th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen,
+					       IPPROTO_TCP, 0);
+	}
+}
+
+static void complete_skb(struct sk_buff *nskb, struct sk_buff *skb, int headln)
+{
+	skb_copy_header(nskb, skb);
+
+	skb_put(nskb, skb->len);
+	memcpy(nskb->data, skb->data, headln);
+	update_chksum(nskb, headln);
+
+	nskb->destructor = skb->destructor;
+	nskb->sk = skb->sk;
+	skb->destructor = NULL;
+	skb->sk = NULL;
+	refcount_add(nskb->truesize - skb->truesize,
+		     &nskb->sk->sk_wmem_alloc);
+}
+
+/* This function may be called after the user socket is already
+ * closed so make sure we don't use anything freed during
+ * tls_sk_proto_close here
+ */
+
+static int fill_sg_in(struct scatterlist *sg_in,
+		      struct sk_buff *skb,
+		      struct tls_offload_context *ctx,
+		      u64 *rcd_sn,
+		      s32 *sync_size,
+		      int *resync_sgs)
+{
+	int tcp_payload_offset = skb_transport_offset(skb) + tcp_hdrlen(skb);
+	int payload_len = skb->len - tcp_payload_offset;
+	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
+	struct tls_record_info *record;
+	unsigned long flags;
+	int remaining;
+	int i;
+
+	spin_lock_irqsave(&ctx->lock, flags);
+	record = tls_get_record(ctx, tcp_seq, rcd_sn);
+	if (!record) {
+		spin_unlock_irqrestore(&ctx->lock, flags);
+		WARN(1, "Record not found for seq %u\n", tcp_seq);
+		return -EINVAL;
+	}
+
+	*sync_size = tcp_seq - tls_record_start_seq(record);
+	if (*sync_size < 0) {
+		int is_start_marker = tls_record_is_start_marker(record);
+
+		spin_unlock_irqrestore(&ctx->lock, flags);
+		/* This should only occur if the relevant record was
+		 * already acked. In that case it should be ok
+		 * to drop the packet and avoid retransmission.
+		 *
+		 * There is a corner case where the packet contains
+		 * both an acked and a non-acked record.
+		 * We currently don't handle that case and rely
+		 * on TCP to retranmit a packet that doesn't contain
+		 * already acked payload.
+		 */
+		if (!is_start_marker)
+			*sync_size = 0;
+		return -EINVAL;
+	}
+
+	remaining = *sync_size;
+	for (i = 0; remaining > 0; i++) {
+		skb_frag_t *frag = &record->frags[i];
+
+		__skb_frag_ref(frag);
+		sg_set_page(sg_in + i, skb_frag_page(frag),
+			    skb_frag_size(frag), frag->page_offset);
+
+		remaining -= skb_frag_size(frag);
+
+		if (remaining < 0)
+			sg_in[i].length += remaining;
+	}
+	*resync_sgs = i;
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+	if (skb_to_sgvec(skb, &sg_in[i], tcp_payload_offset, payload_len) < 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void fill_sg_out(struct scatterlist sg_out[3], void *buf,
+			struct tls_context *tls_ctx,
+			struct sk_buff *nskb,
+			int tcp_payload_offset,
+			int payload_len,
+			int sync_size,
+			void *dummy_buf)
+{
+	sg_set_buf(&sg_out[0], dummy_buf, sync_size);
+	sg_set_buf(&sg_out[1], nskb->data + tcp_payload_offset, payload_len);
+	/* Add room for authentication tag produced by crypto */
+	dummy_buf += sync_size;
+	sg_set_buf(&sg_out[2], dummy_buf, TLS_CIPHER_AES_GCM_128_TAG_SIZE);
+}
+
+static struct sk_buff *tls_enc_skb(struct tls_context *tls_ctx,
+				   struct scatterlist sg_out[3],
+				   struct scatterlist *sg_in,
+				   struct sk_buff *skb,
+				   s32 sync_size, u64 rcd_sn)
+{
+	int tcp_payload_offset = skb_transport_offset(skb) + tcp_hdrlen(skb);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+	int payload_len = skb->len - tcp_payload_offset;
+	void *buf, *iv, *aad, *dummy_buf;
+	struct aead_request *aead_req;
+	struct sk_buff *nskb = NULL;
+	int buf_len;
+
+	aead_req = tls_alloc_aead_request(ctx->aead_send, GFP_ATOMIC);
+	if (!aead_req)
+		return NULL;
+
+	buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE +
+		  TLS_CIPHER_AES_GCM_128_IV_SIZE +
+		  TLS_AAD_SPACE_SIZE +
+		  sync_size +
+		  TLS_CIPHER_AES_GCM_128_TAG_SIZE;
+	buf = kmalloc(buf_len, GFP_ATOMIC);
+	if (!buf)
+		goto free_req;
+
+	iv = buf;
+	memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
+	       TLS_CIPHER_AES_GCM_128_SALT_SIZE);
+	aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
+	      TLS_CIPHER_AES_GCM_128_IV_SIZE;
+	dummy_buf = aad + TLS_AAD_SPACE_SIZE;
+
+	nskb = alloc_skb(skb_headroom(skb) + skb->len, GFP_ATOMIC);
+	if (!nskb)
+		goto free_buf;
+
+	skb_reserve(nskb, skb_headroom(skb));
+
+	fill_sg_out(sg_out, buf, tls_ctx, nskb, tcp_payload_offset,
+		    payload_len, sync_size, dummy_buf);
+
+	if (tls_enc_records(aead_req, ctx->aead_send, sg_in, sg_out, aad, iv,
+			    rcd_sn, sync_size + payload_len) < 0)
+		goto free_nskb;
+
+	complete_skb(nskb, skb, tcp_payload_offset);
+
+	/* validate_xmit_skb_list assumes that if the skb wasn't segmented
+	 * nskb->prev will point to the skb itself
+	 */
+	nskb->prev = nskb;
+
+free_buf:
+	kfree(buf);
+free_req:
+	kfree(aead_req);
+	return nskb;
+free_nskb:
+	kfree_skb(nskb);
+	nskb = NULL;
+	goto free_buf;
+}
+
+static struct sk_buff *tls_sw_fallback(struct sock *sk, struct sk_buff *skb)
+{
+	int tcp_payload_offset = skb_transport_offset(skb) + tcp_hdrlen(skb);
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+	int payload_len = skb->len - tcp_payload_offset;
+	struct scatterlist *sg_in, sg_out[3];
+	struct sk_buff *nskb = NULL;
+	int sg_in_max_elements;
+	int resync_sgs = 0;
+	s32 sync_size = 0;
+	u64 rcd_sn;
+
+	/* worst case is:
+	 * MAX_SKB_FRAGS in tls_record_info
+	 * MAX_SKB_FRAGS + 1 in SKB head and frags.
+	 */
+	sg_in_max_elements = 2 * MAX_SKB_FRAGS + 1;
+
+	if (!payload_len)
+		return skb;
+
+	sg_in = kmalloc_array(sg_in_max_elements, sizeof(*sg_in), GFP_ATOMIC);
+	if (!sg_in)
+		goto free_orig;
+
+	sg_init_table(sg_in, sg_in_max_elements);
+	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
+
+	if (fill_sg_in(sg_in, skb, ctx, &rcd_sn, &sync_size, &resync_sgs)) {
+		/* bypass packets before kernel TLS socket option was set */
+		if (sync_size < 0 && payload_len <= -sync_size)
+			nskb = skb_get(skb);
+		goto put_sg;
+	}
+
+	nskb = tls_enc_skb(tls_ctx, sg_out, sg_in, skb, sync_size, rcd_sn);
+
+put_sg:
+	while (resync_sgs)
+		put_page(sg_page(&sg_in[--resync_sgs]));
+	kfree(sg_in);
+free_orig:
+	kfree_skb(skb);
+	return nskb;
+}
+
+struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
+				      struct net_device *dev,
+				      struct sk_buff *skb)
+{
+	if (dev == tls_get_ctx(sk)->netdev)
+		return skb;
+
+	return tls_sw_fallback(sk, skb);
+}
+
+int tls_sw_fallback_init(struct sock *sk,
+			 struct tls_offload_context *offload_ctx,
+			 struct tls_crypto_info *crypto_info)
+{
+	const u8 *key;
+	int rc;
+
+	offload_ctx->aead_send =
+	    crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
+	if (IS_ERR(offload_ctx->aead_send)) {
+		rc = PTR_ERR(offload_ctx->aead_send);
+		pr_err_ratelimited("crypto_alloc_aead failed rc=%d\n", rc);
+		offload_ctx->aead_send = NULL;
+		goto err_out;
+	}
+
+	key = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->key;
+
+	rc = crypto_aead_setkey(offload_ctx->aead_send, key,
+				TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+	if (rc)
+		goto free_aead;
+
+	rc = crypto_aead_setauthsize(offload_ctx->aead_send,
+				     TLS_CIPHER_AES_GCM_128_TAG_SIZE);
+	if (rc)
+		goto free_aead;
+
+	return 0;
+free_aead:
+	crypto_free_aead(offload_ctx->aead_send);
+err_out:
+	return rc;
+}
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 545bf34..3aafb87 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -54,6 +54,9 @@ enum {
 enum {
 	TLS_BASE,
 	TLS_SW,
+#ifdef CONFIG_TLS_DEVICE
+	TLS_HW,
+#endif
 	TLS_HW_RECORD,
 	TLS_NUM_CONFIG,
 };
@@ -280,6 +283,15 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 		tls_sw_free_resources_rx(sk);
 	}
 
+#ifdef CONFIG_TLS_DEVICE
+	if (ctx->tx_conf != TLS_HW) {
+#else
+	{
+#endif
+		kfree(ctx);
+		ctx = NULL;
+	}
+
 skip_tx_cleanup:
 	release_sock(sk);
 	sk_proto_close(sk, timeout);
@@ -442,8 +454,16 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 	}
 
 	if (tx) {
-		rc = tls_set_sw_offload(sk, ctx, 1);
-		conf = TLS_SW;
+#ifdef CONFIG_TLS_DEVICE
+		rc = tls_set_device_offload(sk, ctx);
+		conf = TLS_HW;
+		if (rc) {
+#else
+		{
+#endif
+			rc = tls_set_sw_offload(sk, ctx, 1);
+			conf = TLS_SW;
+		}
 	} else {
 		rc = tls_set_sw_offload(sk, ctx, 0);
 		conf = TLS_SW;
@@ -596,6 +616,16 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG],
 	prot[TLS_SW][TLS_SW].recvmsg	= tls_sw_recvmsg;
 	prot[TLS_SW][TLS_SW].close	= tls_sk_proto_close;
 
+#ifdef CONFIG_TLS_DEVICE
+	prot[TLS_HW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
+	prot[TLS_HW][TLS_BASE].sendmsg		= tls_device_sendmsg;
+	prot[TLS_HW][TLS_BASE].sendpage		= tls_device_sendpage;
+
+	prot[TLS_HW][TLS_SW] = prot[TLS_BASE][TLS_SW];
+	prot[TLS_HW][TLS_SW].sendmsg		= tls_device_sendmsg;
+	prot[TLS_HW][TLS_SW].sendpage		= tls_device_sendpage;
+#endif
+
 	prot[TLS_HW_RECORD][TLS_HW_RECORD] = *base;
 	prot[TLS_HW_RECORD][TLS_HW_RECORD].hash		= tls_hw_hash;
 	prot[TLS_HW_RECORD][TLS_HW_RECORD].unhash	= tls_hw_unhash;
@@ -630,7 +660,7 @@ static int tls_init(struct sock *sk)
 	ctx->getsockopt = sk->sk_prot->getsockopt;
 	ctx->sk_proto_close = sk->sk_prot->close;
 
-	/* Build IPv6 TLS whenever the address of tcpv6_prot changes */
+	/* Build IPv6 TLS whenever the address of tcpv6	_prot changes */
 	if (ip_ver == TLSV6 &&
 	    unlikely(sk->sk_prot != smp_load_acquire(&saved_tcpv6_prot))) {
 		mutex_lock(&tcpv6_prot_mutex);
@@ -680,6 +710,9 @@ static int __init tls_register(void)
 	tls_sw_proto_ops.poll = tls_sw_poll;
 	tls_sw_proto_ops.splice_read = tls_sw_splice_read;
 
+#ifdef CONFIG_TLS_DEVICE
+	tls_device_init();
+#endif
 	tcp_register_ulp(&tcp_tls_ulp_ops);
 
 	return 0;
@@ -688,6 +721,9 @@ static int __init tls_register(void)
 static void __exit tls_unregister(void)
 {
 	tcp_unregister_ulp(&tcp_tls_ulp_ops);
+#ifdef CONFIG_TLS_DEVICE
+	tls_device_cleanup();
+#endif
 }
 
 module_init(tls_register);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V6 net-next 00/15] TLS offload, netdev & MLX5 support
From: Boris Pismenny @ 2018-04-22 15:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, saeedm, borisp, davejwatson, ktkhai

Hi Dave,

The following series provides TLS TX inline crypto offload.

v1->v2:
   - Added IS_ENABLED(CONFIG_TLS_DEVICE) and a STATIC_KEY for icsk_clean_acked
   - File license fix
   - Fix spelling, comment by DaveW
   - Move memory allocations out of tls_set_device_offload and other misc fixes,
	comments by Kiril.

v2->v3:
   - Reversed xmas tree where needed and style fixes
   - Removed the need for skb_page_frag_refill, per Eric's comment
   - IPv6 dependency fixes

v3->v4:
   - Remove "inline" from functions in C files
   - Make clean_acked_data_enabled a static variable and add enable/disable functions to control it.
   - Remove unnecessary variable initialization mentioned by ShannonN
   - Rebase over TLS RX
   - Refactor the tls_software_fallback to reduce the number of variables mentioned by KirilT

v4->v5:
   - Add missing CONFIG_TLS_DEVICE

v5->v6:
   - Move changes to the software implementation into a seperate patch
   - Fix some checkpatch warnings 
   - GPL export the enable/disable clean_acked_data functions

This series adds a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption and
authentication operations on the transmit side of the data path. Leaving
those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to the
TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Expected TCP SN is accessed without a lock, under the assumption that
TCP doesn't transmit SKBs from different TX queue concurrently.

If packets are rerouted to a different netdevice, then a software
fallback routine handles encryption.

Paper: https://www.netdevconf.org/1.2/papers/netdevconf-TLS.pdf

Boris Pismenny (3):
  net/tls: Split conf to rx + tx
  MAINTAINERS: Update mlx5 innova driver maintainers
  MAINTAINERS: Update TLS maintainers

Ilya Lesokhin (12):
  tcp: Add clean acked data hook
  net: Rename and export copy_skb_header
  net: Add Software fallback infrastructure for socket dependent
    offloads
  net: Add TLS offload netdev ops
  net: Add TLS TX offload features
  net/tls: Add generic NIC offload infrastructure
  net/tls: Support TLS device offload with IPv6
  net/mlx5e: Move defines out of ipsec code
  net/mlx5: Accel, Add TLS tx offload interface
  net/mlx5e: TLS, Add Innova TLS TX support
  net/mlx5e: TLS, Add Innova TLS TX offload data path
  net/mlx5e: TLS, Add error statistics

 MAINTAINERS                                        |  19 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |  11 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   6 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c    |  71 ++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h    |  86 +++
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  21 +
 .../mellanox/mlx5/core/en_accel/en_accel.h         |  72 ++
 .../ethernet/mellanox/mlx5/core/en_accel/ipsec.h   |   3 -
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 197 +++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  87 +++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         | 278 +++++++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h         |  50 ++
 .../mellanox/mlx5/core/en_accel/tls_stats.c        |  89 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  32 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  37 +-
 .../net/ethernet/mellanox/mlx5/core/fpga/core.h    |   1 +
 .../net/ethernet/mellanox/mlx5/core/fpga/ipsec.c   |   5 +-
 drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 563 ++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  68 ++
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  11 +
 include/linux/mlx5/mlx5_ifc.h                      |  16 -
 include/linux/mlx5/mlx5_ifc_fpga.h                 |  77 ++
 include/linux/netdev_features.h                    |   2 +
 include/linux/netdevice.h                          |  24 +
 include/linux/skbuff.h                             |   1 +
 include/net/inet_connection_sock.h                 |   2 +
 include/net/sock.h                                 |  21 +
 include/net/tcp.h                                  |   8 +
 include/net/tls.h                                  | 120 ++-
 net/Kconfig                                        |   4 +
 net/core/dev.c                                     |   4 +
 net/core/ethtool.c                                 |   1 +
 net/core/skbuff.c                                  |   9 +-
 net/ipv4/tcp_input.c                               |  25 +
 net/tls/Kconfig                                    |  10 +
 net/tls/Makefile                                   |   2 +
 net/tls/tls_device.c                               | 808 +++++++++++++++++++++
 net/tls/tls_device_fallback.c                      | 450 ++++++++++++
 net/tls/tls_main.c                                 | 143 ++--
 net/tls/tls_sw.c                                   | 134 ++--
 43 files changed, 3399 insertions(+), 189 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h
 create mode 100644 net/tls/tls_device.c
 create mode 100644 net/tls/tls_device_fallback.c

-- 
1.8.3.1

^ permalink raw reply

* [PATCH] r8169: don't use netif_info et al before net_device has been registered
From: Heiner Kallweit @ 2018-04-22 15:15 UTC (permalink / raw)
  To: David Miller, Realtek linux nic maintainers; +Cc: netdev@vger.kernel.org

There's no benefit in using netif_info et al before the net_device has
been registered. We get messages like
r8169 0000:03:00.0 (unnamed net_device) (uninitialized): [message]
Therefore use dev_info/dev_err instead.

As a side effect we don't need parameter dev for function
rtl8169_get_mac_version() any longer.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index fcd42d03..a5d00ee9 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -2483,7 +2483,7 @@ static const struct ethtool_ops rtl8169_ethtool_ops = {
 };
 
 static void rtl8169_get_mac_version(struct rtl8169_private *tp,
-				    struct net_device *dev, u8 default_version)
+				    u8 default_version)
 {
 	/*
 	 * The driver currently handles the 8168Bf and the 8168Be identically
@@ -2588,8 +2588,8 @@ static void rtl8169_get_mac_version(struct rtl8169_private *tp,
 	tp->mac_version = p->mac_version;
 
 	if (tp->mac_version == RTL_GIGA_MAC_NONE) {
-		netif_notice(tp, probe, dev,
-			     "unknown MAC, using family default\n");
+		dev_notice(tp_to_dev(tp),
+			   "unknown MAC, using family default\n");
 		tp->mac_version = default_version;
 	} else if (tp->mac_version == RTL_GIGA_MAC_VER_42) {
 		tp->mac_version = tp->mii.supports_gmii ?
@@ -8107,40 +8107,39 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	/* enable device (incl. PCI PM wakeup and hotplug setup) */
 	rc = pcim_enable_device(pdev);
 	if (rc < 0) {
-		netif_err(tp, probe, dev, "enable failure\n");
+		dev_err(&pdev->dev, "enable failure\n");
 		return rc;
 	}
 
 	if (pcim_set_mwi(pdev) < 0)
-		netif_info(tp, probe, dev, "Mem-Wr-Inval unavailable\n");
+		dev_info(&pdev->dev, "Mem-Wr-Inval unavailable\n");
 
 	/* use first MMIO region */
 	region = ffs(pci_select_bars(pdev, IORESOURCE_MEM)) - 1;
 	if (region < 0) {
-		netif_err(tp, probe, dev, "no MMIO resource found\n");
+		dev_err(&pdev->dev, "no MMIO resource found\n");
 		return -ENODEV;
 	}
 
 	/* check for weird/broken PCI region reporting */
 	if (pci_resource_len(pdev, region) < R8169_REGS_SIZE) {
-		netif_err(tp, probe, dev,
-			  "Invalid PCI region size(s), aborting\n");
+		dev_err(&pdev->dev, "Invalid PCI region size(s), aborting\n");
 		return -ENODEV;
 	}
 
 	rc = pcim_iomap_regions(pdev, BIT(region), MODULENAME);
 	if (rc < 0) {
-		netif_err(tp, probe, dev, "cannot remap MMIO, aborting\n");
+		dev_err(&pdev->dev, "cannot remap MMIO, aborting\n");
 		return rc;
 	}
 
 	tp->mmio_addr = pcim_iomap_table(pdev)[region];
 
 	if (!pci_is_pcie(pdev))
-		netif_info(tp, probe, dev, "not PCI Express\n");
+		dev_info(&pdev->dev, "not PCI Express\n");
 
 	/* Identify chip attached to board */
-	rtl8169_get_mac_version(tp, dev, cfg->default_ver);
+	rtl8169_get_mac_version(tp, cfg->default_ver);
 
 	tp->cp_cmd = 0;
 
@@ -8157,7 +8156,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	} else {
 		rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
 		if (rc < 0) {
-			netif_err(tp, probe, dev, "DMA configuration failed\n");
+			dev_err(&pdev->dev, "DMA configuration failed\n");
 			return rc;
 		}
 	}
@@ -8185,7 +8184,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	rc = rtl_alloc_irq(tp);
 	if (rc < 0) {
-		netif_err(tp, probe, dev, "Can't allocate interrupt\n");
+		dev_err(&pdev->dev, "Can't allocate interrupt\n");
 		return rc;
 	}
 
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH/RFC net-next 5/5] ravb: remove tx buffer addr 4byte alilgnment restriction for R-Car Gen3
From: Sergei Shtylyov @ 2018-04-22 15:11 UTC (permalink / raw)
  To: Simon Horman
  Cc: Magnus Damm, netdev, linux-renesas-soc, Wolfram Sang,
	Kazuya Mizuguchi
In-Reply-To: <20180417085030.32650-6-horms+renesas@verge.net.au>

Hello!

On 4/17/2018 11:50 AM, Simon Horman wrote:

> From: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> 
> This patch sets from two descriptor to one descriptor because R-Car Gen3
> does not have the 4 bytes alignment restriction of the transmission buffer.
> 
> Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
[...]
> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
> index fcd04dbc7dde..3d0985305c26 100644
> --- a/drivers/net/ethernet/renesas/ravb.h
> +++ b/drivers/net/ethernet/renesas/ravb.h
> @@ -964,7 +964,10 @@ enum RAVB_QUEUE {
>  #define RX_QUEUE_OFFSET	4
>  #define NUM_RX_QUEUE	2
>  #define NUM_TX_QUEUE	2
> -#define NUM_TX_DESC	2	/* TX descriptors per packet */
> +
> +/* TX descriptors per packet */
> +#define NUM_TX_DESC_GEN2	2
> +#define NUM_TX_DESC_GEN3	1

    We hanrdly need these...

>  
>  struct ravb_tstamp_skb {
>  	struct list_head list;
[...]
> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> index 88056dd912ed..f137b62d5b52 100644
> --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> @@ -189,12 +189,13 @@ static int ravb_tx_free(struct net_device *ndev, int q, bool free_txed_only)
>  	int free_num = 0;
>  	int entry;
>  	u32 size;
> +	int num_tx_desc = priv->num_tx_desc;

    Please keep sorting the declarations by length -- that's DaveM's pet peeve 
(and indeed looks prettier. :-)

[...]
> @@ -229,6 +230,7 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>  	struct ravb_private *priv = netdev_priv(ndev);
>  	int ring_size;
>  	int i;
> +	int num_tx_desc = priv->num_tx_desc;

    Here as well.

>  
>  	if (priv->rx_ring[q]) {
>  	
>  		for (i = 0; i < priv->num_rx_ring[q]; i++) {
[...]
> @@ -321,8 +324,10 @@ static void ravb_ring_format(struct net_device *ndev, int q)
>   	for (i = 0, tx_desc = priv->tx_ring[q]; i < priv->num_tx_ring[q];
>   	     i++, tx_desc++) {
>   		tx_desc->die_dt = DT_EEMPTY;
> -		tx_desc++;
> -		tx_desc->die_dt = DT_EEMPTY;
> +		if (num_tx_desc >= 2) {

    > 1, please.
    Strictly speaking, this only works when num_tx_desc == 2, however that's 
my fault...

> +			tx_desc++;
> +			tx_desc->die_dt = DT_EEMPTY;
> +		}
>   	}
>   	tx_desc->dptr = cpu_to_le32((u32)priv->tx_desc_dma[q]);
>   	tx_desc->die_dt = DT_LINKFIX; /* type */
> @@ -345,6 +350,7 @@ static int ravb_ring_init(struct net_device *ndev, int q)
>   	struct sk_buff *skb;
>   	int ring_size;
>   	int i;
> +	int num_tx_desc = priv->num_tx_desc;

    Again, please keep the descarations sorted by length.

>   
>   	priv->rx_buf_sz = (ndev->mtu <= 1492 ? PKT_BUF_SZ : ndev->mtu) +
>   		ETH_HLEN + VLAN_HLEN;
[...]
> @@ -1533,10 +1539,11 @@ static netdev_tx_t ravb_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>   	void *buffer;
>   	u32 entry;
>   	u32 len;
> +	int num_tx_desc = priv->num_tx_desc;

    Here as well...

[...]
> @@ -1547,41 +1554,55 @@ static netdev_tx_t ravb_start_xmit(struct sk_buff *skb, struct net_device *ndev)
[...]
> +	entry = priv->cur_tx[q] % (priv->num_tx_ring[q] * num_tx_desc);
> +	priv->tx_skb[q][entry / num_tx_desc] = skb;
> +
> +	if (num_tx_desc >= 2) {

    > 1, please.

[...]
> @@ -1589,9 +1610,11 @@ static netdev_tx_t ravb_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>   	if (q == RAVB_NC) {
>   		ts_skb = kmalloc(sizeof(*ts_skb), GFP_ATOMIC);
>   		if (!ts_skb) {
> -			desc--;
> -			dma_unmap_single(ndev->dev.parent, dma_addr, len,
> -					 DMA_TO_DEVICE);
> +			if (num_tx_desc >= 2) {

    Likewise.

> +				desc--;
> +				dma_unmap_single(ndev->dev.parent, dma_addr,
> +						 len, DMA_TO_DEVICE);
> +			}
>   			goto unmap;
>   		}
>   		ts_skb->skb = skb;
[...]
> @@ -2106,6 +2132,9 @@ static int ravb_probe(struct platform_device *pdev)
>   	ndev->max_mtu = 2048 - (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN);
>   	ndev->min_mtu = ETH_MIN_MTU;
>   
> +	priv->num_tx_desc = (chip_id == RCAR_GEN2) ?

    Parens not needed.

> +		NUM_TX_DESC_GEN2 : NUM_TX_DESC_GEN3;

    Just 2 : 1;

[...]

    However... I'm not seeing this patch disabling memory allocation for 
priv->tx_align[] and reducing the memory pressure is one of 2 reasons for this 
patch, isn't it?

MBR, Sergei

^ permalink raw reply

* [PATCH net-next] net: init sk_cookie for inet socket
From: Yafang Shao @ 2018-04-22 13:50 UTC (permalink / raw)
  To: eric.dumazet, alexei.starovoitov, davem; +Cc: netdev, linux-kernel, Yafang Shao

With sk_cookie we can identify a socket, that is very helpful for
traceing and statistic, i.e. tcp tracepiont and ebpf.
So we'd better init it by default for inet socket.
When using it, we just need call atomic64_read(&sk->sk_cookie).

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/sock_diag.h | 9 +++++++++
 net/ipv4/tcp_input.c      | 8 +++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h
index 15fe980..5c916e6 100644
--- a/include/linux/sock_diag.h
+++ b/include/linux/sock_diag.h
@@ -25,6 +25,15 @@ struct sock_diag_handler {
 void sock_diag_register_inet_compat(int (*fn)(struct sk_buff *skb, struct nlmsghdr *nlh));
 void sock_diag_unregister_inet_compat(int (*fn)(struct sk_buff *skb, struct nlmsghdr *nlh));
 
+static inline
+void sock_init_cookie(struct sock *sk)
+{
+	u64 res;
+
+	res = atomic64_inc_return(&sock_net(sk)->cookie_gen);
+	atomic64_set(&sk->sk_cookie, res);
+}
+
 u64 sock_gen_cookie(struct sock *sk);
 int sock_diag_check_cookie(struct sock *sk, const __u32 *cookie);
 void sock_diag_save_cookie(struct sock *sk, __u32 *cookie);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0396fb9..e6a64a3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -78,6 +78,7 @@
 #include <linux/errqueue.h>
 #include <trace/events/tcp.h>
 #include <linux/static_key.h>
+#include <linux/sock_diag.h>
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -6188,10 +6189,15 @@ struct request_sock *inet_reqsk_alloc(const struct request_sock_ops *ops,
 #if IS_ENABLED(CONFIG_IPV6)
 		ireq->pktopts = NULL;
 #endif
-		atomic64_set(&ireq->ir_cookie, 0);
 		ireq->ireq_state = TCP_NEW_SYN_RECV;
 		write_pnet(&ireq->ireq_net, sock_net(sk_listener));
 		ireq->ireq_family = sk_listener->sk_family;
+
+		BUILD_BUG_ON(offsetof(struct inet_request_sock, ir_cookie) !=
+					offsetof(struct sock, sk_cookie));
+		BUILD_BUG_ON(offsetof(struct inet_request_sock, ireq_net) !=
+					offsetof(struct sock, sk_net));
+		sock_init_cookie((struct sock *)ireq);
 	}
 
 	return req;
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-22 13:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mikulas Patocka, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, bhutchings, netdev, linux-kernel, mst,
	jasowang, virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <20180421144757.GC14610@bombadil.infradead.org>

On Sat 21-04-18 07:47:57, Matthew Wilcox wrote:
> On Fri, Apr 20, 2018 at 05:21:26PM -0400, Mikulas Patocka wrote:
> > On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> > > On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> > > > On Fri, 20 Apr 2018, Michal Hocko wrote:
> > > > > No way. This is just wrong! First of all, you will explode most likely
> > > > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > > > > enabled quite often.
> > > > 
> > > > You're an evil person who doesn't want to fix bugs.
> > > 
> > > Steady on.  There's no need for that.  Michal isn't evil.  Please
> > > apologise.
> > 
> > I see this attitude from Michal again and again.
> 
> Fine; then *say that*.  I also see Michal saying "No" a lot.  Sometimes
> I agree with him, sometimes I don't.  I think he genuinely wants the best
> code in the kernel, and saying "No" is part of it.

Exactly! We have a lot of legacy herritage and random semantic because
we were too eager to merge stuff. I think it is time to stop that and
think twice before merging someothing. If you call that evil then I am
fine to be evil.

> > He didn't want to fix vmalloc(GFP_NOIO)
> 
> I don't remember that conversation, so I don't know whether I agree with
> his reasoning or not.  But we are supposed to be moving away from GFP_NOIO
> towards marking regions with memalloc_noio_save() / restore.  If you do
> that, you won't need vmalloc(GFP_NOIO).

It was basically to detect GFP_NOIO context _inside_ vmalloc and use the
scope API to enforce it there. Does it solve potential problems? Yes it
does. Does it solve any existing report, no I am not aware of any. Is
it a good fix longterm? Absolutely no, because the scope API should be
used _at the place_ where the scope starts rather than a random utility
function. If we are going the easier way now, we will never teach users
to use the API properly. And I am willing to risk to keep a broken
code which we have for years rather than allow a random hack that will
seemingly fix it.

Btw. I was pretty much explicit with this reasoning when rejecting the
patch. Do you still call that evil?

> > he didn't want to fix alloc_pages sleeping when __GFP_NORETRY is used.
> 
> The GFP flags are a mess, still.

I do not remember that one but __GFP_NORETRY is _allowed_ to sleep. And
yes I do _agree_ gfp flags are a mess which is really hard to get fixed
because they are lacking a good design from the very beginning. Fixing
some of those issues today is a completely PITA.

> > So what should I say? Fix them and 
> > you won't be evil :-)
> 
> No, you should reserve calling somebody evil for truly evil things.
> 
> > (he could also fix the oom killer, so that it is triggered when 
> > free_memory+cache+free_swap goes beyond a threshold and not when you loop 
> > too long in the allocator)
> 
> ... that also doesn't make somebody evil.

And again, it is way much more easier to claim that something will get
fixed when the reality is much more complicated. I've tried to explain
those risks as well.

> > I already said that we can change it from CONFIG_DEBUG_VM to 
> > CONFIG_DEBUG_SG - or to whatever other option you may want, just to make 
> > sure that it is enabled in distro debug kernels by default.
> 
> Yes, and I think that's the right idea.  So send a v2 and ignore the
> replies that are clearly relating to an earlier version of the patch.
> Not everybody reads every mail in the thread before responding to one they
> find interesting.  Yes, ideally, one would, but sometimes one doesn't.

And look here. This is yet another ad-hoc idea. We have many users of
kvmalloc which have no relation to SG, yet you are going to control
their behavior by CONFIG_DEBUG_SG? No way! (yeah evil again)

Really, we have a fault injection framework and this sounds like
something to hook in there.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: linux-next on x60: network manager often complains "network is disabled" after resume
From: Woody Suwalski @ 2018-04-22 12:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rafael J. Wysocki, kernel list, Linux-pm mailing list,
	Netdev list
In-Reply-To: <20180415161551.GF10802@amd>

Pavel Machek wrote:
> On Tue 2018-03-20 21:11:54, Woody Suwalski wrote:
>> Woody Suwalski wrote:
>>> Pavel Machek wrote:
>>>> On Mon 2018-03-19 05:17:45, Woody Suwalski wrote:
>>>>> Pavel Machek wrote:
>>>>>> Hi!
>>>>>>
>>>>>> With recent linux-next, after resume networkmanager often claims that
>>>>>> "network is disabled". Sometimes suspend/resume clears that.
>>>>>>
>>>>>> Any ideas? Does it work for you?
>>>>>>                                      Pavel
>>>>> Tried the 4.16-rc6 with nm 1.4.4 - I do not see the issue.
>>>> Thanks for testing... but yes, 4.16 should be ok. If not fixed,
>>>> problem will appear in 4.17-rc1.
>>>>
>>> Works here OK. Tried ~10 suspends, all restarted OK.
>>> kernel next-20180320
>>> nmcli shows that Wifi always connects OK
>>>
>>> Woody
>>>
>> Contrary, it just happened to me on a 64-bit build 4.16-rc5 on T440.
>> I think that Dan's suspicion is correct - it is a snafu in the PM: trying to
>> hibernate results in a message:
>> Failed to hibernate system via logind: There's already a shutdown or sleep
>> operation in progress.
>>
>> And ps shows "Ds /lib/systemd/systemd-sleep suspend"...
> Problem now seems to be in the mainline.
>
> But no, I don't see systemd-sleep in my process list :-(.
>
> I guess you can't reproduce it easily? I tried bisecting, but while it
> happens often enough to make v4.17 hard to use, it does not permit
> reliable bisect.
>
> These should be bad according to my notes
>
> b04240a33b99b32cf6fbdf5c943c04e505a0cb07
>   ed80dc19e4dd395c951f745acd1484d61c4cfb20
>   52113a0d3889d6e2738cf09bf79bc9cac7b5e1c6
>   4fc97ef94bbfa185d16b3e44199b7559d0668747
>   14ebdb2c814f508936fe178a2abc906a16a3ab48
>   639adbeef5ae1bb8eeebbb0cde0b885397bde192
>
> bisection claimed
>
> c16add24522547bf52c189b3c0d1ab6f5c2b4375
>
> is first bad commit, but I'm not sure if I trust that.
> 									Pavel
It has not happen on any of my systems in the last month. Good, but bad 
for getting more info :-(

Woody

^ permalink raw reply

* [net-next:master 211/214] net/ipv6/route.c:3357:41: sparse: incorrect type in argument 2 (different address spaces)
From: kbuild test robot @ 2018-04-22 11:57 UTC (permalink / raw)
  To: David Ahern; +Cc: kbuild-all, netdev

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   e0ada51db907ed2db5d46ad7ff86a8b5df68e59b
commit: a68886a691804d3f6d479ebf6825480fbafb6a00 [211/214] net/ipv6: Make from in rt6_info rcu protected
reproduce:
        # apt-get install sparse
        git checkout a68886a691804d3f6d479ebf6825480fbafb6a00
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   net/ipv6/route.c:2584:15: sparse: expression using sizeof(void)
>> net/ipv6/route.c:3357:41: sparse: incorrect type in argument 2 (different address spaces) @@    expected struct fib6_info *ort @@    got struct fib6_info struct fib6_info *ort @@
   net/ipv6/route.c:3357:41:    expected struct fib6_info *ort
   net/ipv6/route.c:3357:41:    got struct fib6_info [noderef] <asn:4>*from
   net/ipv6/route.c:1391:15: sparse: expression using sizeof(void)
>> net/ipv6/route.c:1544:36: sparse: incorrect type in initializer (different address spaces) @@    expected struct fib6_info *from @@    got struct fib6_info struct fib6_info *from @@
   net/ipv6/route.c:1544:36:    expected struct fib6_info *from
   net/ipv6/route.c:1544:36:    got struct fib6_info [noderef] <asn:4>*from
   net/ipv6/route.c:1589:36: sparse: incorrect type in initializer (different address spaces) @@    expected struct fib6_info *from @@    got struct fib6_info struct fib6_info *from @@
   net/ipv6/route.c:1589:36:    expected struct fib6_info *from
   net/ipv6/route.c:1589:36:    got struct fib6_info [noderef] <asn:4>*from
   net/ipv6/route.c:1962:39: sparse: incorrect type in assignment (different base types) @@    expected unsigned int [unsigned] [usertype] flow_label @@    got gned] [usertype] flow_label @@
   net/ipv6/route.c:1962:39:    expected unsigned int [unsigned] [usertype] flow_label
   net/ipv6/route.c:1962:39:    got restricted __be32
   net/ipv6/route.c:2304:15: sparse: expression using sizeof(void)
   net/ipv6/route.c:3189:12: sparse: context imbalance in 'ip6_route_del' - wrong count at exit

vim +3357 net/ipv6/route.c

^1da177e4 Linus Torvalds               2005-04-16  3250  
6700c2709 David S. Miller              2012-07-17  3251  static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_buff *skb)
a6279458c YOSHIFUJI Hideaki            2006-08-23  3252  {
a6279458c YOSHIFUJI Hideaki            2006-08-23  3253  	struct netevent_redirect netevent;
e8599ff4b David S. Miller              2012-07-11  3254  	struct rt6_info *rt, *nrt = NULL;
e8599ff4b David S. Miller              2012-07-11  3255  	struct ndisc_options ndopts;
e8599ff4b David S. Miller              2012-07-11  3256  	struct inet6_dev *in6_dev;
e8599ff4b David S. Miller              2012-07-11  3257  	struct neighbour *neigh;
a68886a69 David Ahern                  2018-04-20  3258  	struct fib6_info *from;
71bcdba06 YOSHIFUJI Hideaki / 吉藤英明 2013-01-05  3259  	struct rd_msg *msg;
6e157b6ac David S. Miller              2012-07-12  3260  	int optlen, on_link;
6e157b6ac David S. Miller              2012-07-12  3261  	u8 *lladdr;
e8599ff4b David S. Miller              2012-07-11  3262  
29a3cad5c Simon Horman                 2013-05-28  3263  	optlen = skb_tail_pointer(skb) - skb_transport_header(skb);
71bcdba06 YOSHIFUJI Hideaki / 吉藤英明 2013-01-05  3264  	optlen -= sizeof(*msg);
e8599ff4b David S. Miller              2012-07-11  3265  
e8599ff4b David S. Miller              2012-07-11  3266  	if (optlen < 0) {
6e157b6ac David S. Miller              2012-07-12  3267  		net_dbg_ratelimited("rt6_do_redirect: packet too short\n");
e8599ff4b David S. Miller              2012-07-11  3268  		return;
e8599ff4b David S. Miller              2012-07-11  3269  	}
e8599ff4b David S. Miller              2012-07-11  3270  
71bcdba06 YOSHIFUJI Hideaki / 吉藤英明 2013-01-05  3271  	msg = (struct rd_msg *)icmp6_hdr(skb);
e8599ff4b David S. Miller              2012-07-11  3272  
71bcdba06 YOSHIFUJI Hideaki / 吉藤英明 2013-01-05  3273  	if (ipv6_addr_is_multicast(&msg->dest)) {
6e157b6ac David S. Miller              2012-07-12  3274  		net_dbg_ratelimited("rt6_do_redirect: destination address is multicast\n");
e8599ff4b David S. Miller              2012-07-11  3275  		return;
e8599ff4b David S. Miller              2012-07-11  3276  	}
e8599ff4b David S. Miller              2012-07-11  3277  
6e157b6ac David S. Miller              2012-07-12  3278  	on_link = 0;
71bcdba06 YOSHIFUJI Hideaki / 吉藤英明 2013-01-05  3279  	if (ipv6_addr_equal(&msg->dest, &msg->target)) {
e8599ff4b David S. Miller              2012-07-11  3280  		on_link = 1;
71bcdba06 YOSHIFUJI Hideaki / 吉藤英明 2013-01-05  3281  	} else if (ipv6_addr_type(&msg->target) !=
e8599ff4b David S. Miller              2012-07-11  3282  		   (IPV6_ADDR_UNICAST|IPV6_ADDR_LINKLOCAL)) {
6e157b6ac David S. Miller              2012-07-12  3283  		net_dbg_ratelimited("rt6_do_redirect: target address is not link-local unicast\n");
e8599ff4b David S. Miller              2012-07-11  3284  		return;
e8599ff4b David S. Miller              2012-07-11  3285  	}
e8599ff4b David S. Miller              2012-07-11  3286  
e8599ff4b David S. Miller              2012-07-11  3287  	in6_dev = __in6_dev_get(skb->dev);
e8599ff4b David S. Miller              2012-07-11  3288  	if (!in6_dev)
e8599ff4b David S. Miller              2012-07-11  3289  		return;
e8599ff4b David S. Miller              2012-07-11  3290  	if (in6_dev->cnf.forwarding || !in6_dev->cnf.accept_redirects)
e8599ff4b David S. Miller              2012-07-11  3291  		return;
e8599ff4b David S. Miller              2012-07-11  3292  
e8599ff4b David S. Miller              2012-07-11  3293  	/* RFC2461 8.1:
e8599ff4b David S. Miller              2012-07-11  3294  	 *	The IP source address of the Redirect MUST be the same as the current
e8599ff4b David S. Miller              2012-07-11  3295  	 *	first-hop router for the specified ICMP Destination Address.
e8599ff4b David S. Miller              2012-07-11  3296  	 */
e8599ff4b David S. Miller              2012-07-11  3297  
f997c55c1 Alexander Aring              2016-06-15  3298  	if (!ndisc_parse_options(skb->dev, msg->opt, optlen, &ndopts)) {
e8599ff4b David S. Miller              2012-07-11  3299  		net_dbg_ratelimited("rt6_redirect: invalid ND options\n");
e8599ff4b David S. Miller              2012-07-11  3300  		return;
e8599ff4b David S. Miller              2012-07-11  3301  	}
6e157b6ac David S. Miller              2012-07-12  3302  
6e157b6ac David S. Miller              2012-07-12  3303  	lladdr = NULL;
e8599ff4b David S. Miller              2012-07-11  3304  	if (ndopts.nd_opts_tgt_lladdr) {
e8599ff4b David S. Miller              2012-07-11  3305  		lladdr = ndisc_opt_addr_data(ndopts.nd_opts_tgt_lladdr,
e8599ff4b David S. Miller              2012-07-11  3306  					     skb->dev);
e8599ff4b David S. Miller              2012-07-11  3307  		if (!lladdr) {
e8599ff4b David S. Miller              2012-07-11  3308  			net_dbg_ratelimited("rt6_redirect: invalid link-layer address length\n");
e8599ff4b David S. Miller              2012-07-11  3309  			return;
e8599ff4b David S. Miller              2012-07-11  3310  		}
e8599ff4b David S. Miller              2012-07-11  3311  	}
e8599ff4b David S. Miller              2012-07-11  3312  
6e157b6ac David S. Miller              2012-07-12  3313  	rt = (struct rt6_info *) dst;
ec13ad1d7 Matthias Schiffer            2015-11-02  3314  	if (rt->rt6i_flags & RTF_REJECT) {
6e157b6ac David S. Miller              2012-07-12  3315  		net_dbg_ratelimited("rt6_redirect: source isn't a valid nexthop for redirect target\n");
e8599ff4b David S. Miller              2012-07-11  3316  		return;
6e157b6ac David S. Miller              2012-07-12  3317  	}
e8599ff4b David S. Miller              2012-07-11  3318  
6e157b6ac David S. Miller              2012-07-12  3319  	/* Redirect received -> path was valid.
6e157b6ac David S. Miller              2012-07-12  3320  	 * Look, redirects are sent only in response to data packets,
6e157b6ac David S. Miller              2012-07-12  3321  	 * so that this nexthop apparently is reachable. --ANK
6e157b6ac David S. Miller              2012-07-12  3322  	 */
0dec879f6 Julian Anastasov             2017-02-06  3323  	dst_confirm_neigh(&rt->dst, &ipv6_hdr(skb)->saddr);
a6279458c YOSHIFUJI Hideaki            2006-08-23  3324  
71bcdba06 YOSHIFUJI Hideaki / 吉藤英明 2013-01-05  3325  	neigh = __neigh_lookup(&nd_tbl, &msg->target, skb->dev, 1);
6e157b6ac David S. Miller              2012-07-12  3326  	if (!neigh)
6e157b6ac David S. Miller              2012-07-12  3327  		return;
a6279458c YOSHIFUJI Hideaki            2006-08-23  3328  
^1da177e4 Linus Torvalds               2005-04-16  3329  	/*
^1da177e4 Linus Torvalds               2005-04-16  3330  	 *	We have finally decided to accept it.
^1da177e4 Linus Torvalds               2005-04-16  3331  	 */
^1da177e4 Linus Torvalds               2005-04-16  3332  
f997c55c1 Alexander Aring              2016-06-15  3333  	ndisc_update(skb->dev, neigh, lladdr, NUD_STALE,
^1da177e4 Linus Torvalds               2005-04-16  3334  		     NEIGH_UPDATE_F_WEAK_OVERRIDE|
^1da177e4 Linus Torvalds               2005-04-16  3335  		     NEIGH_UPDATE_F_OVERRIDE|
^1da177e4 Linus Torvalds               2005-04-16  3336  		     (on_link ? 0 : (NEIGH_UPDATE_F_OVERRIDE_ISROUTER|
f997c55c1 Alexander Aring              2016-06-15  3337  				     NEIGH_UPDATE_F_ISROUTER)),
f997c55c1 Alexander Aring              2016-06-15  3338  		     NDISC_REDIRECT, &ndopts);
^1da177e4 Linus Torvalds               2005-04-16  3339  
4d85cd0c2 David Ahern                  2018-04-20  3340  	rcu_read_lock();
a68886a69 David Ahern                  2018-04-20  3341  	from = rcu_dereference(rt->from);
a68886a69 David Ahern                  2018-04-20  3342  	nrt = ip6_rt_cache_alloc(from, &msg->dest, NULL);
4d85cd0c2 David Ahern                  2018-04-20  3343  	rcu_read_unlock();
383084739 David S. Miller              2011-12-03  3344  	if (!nrt)
^1da177e4 Linus Torvalds               2005-04-16  3345  		goto out;
^1da177e4 Linus Torvalds               2005-04-16  3346  
^1da177e4 Linus Torvalds               2005-04-16  3347  	nrt->rt6i_flags = RTF_GATEWAY|RTF_UP|RTF_DYNAMIC|RTF_CACHE;
^1da177e4 Linus Torvalds               2005-04-16  3348  	if (on_link)
^1da177e4 Linus Torvalds               2005-04-16  3349  		nrt->rt6i_flags &= ~RTF_GATEWAY;
^1da177e4 Linus Torvalds               2005-04-16  3350  
4e3fd7a06 Alexey Dobriyan              2011-11-21  3351  	nrt->rt6i_gateway = *(struct in6_addr *)neigh->primary_key;
^1da177e4 Linus Torvalds               2005-04-16  3352  
2b760fcf5 Wei Wang                     2017-10-06  3353  	/* No need to remove rt from the exception table if rt is
2b760fcf5 Wei Wang                     2017-10-06  3354  	 * a cached route because rt6_insert_exception() will
2b760fcf5 Wei Wang                     2017-10-06  3355  	 * takes care of it
2b760fcf5 Wei Wang                     2017-10-06  3356  	 */
d4ead6b34 David Ahern                  2018-04-17 @3357  	if (rt6_insert_exception(nrt, rt->from)) {
2b760fcf5 Wei Wang                     2017-10-06  3358  		dst_release_immediate(&nrt->dst);
2b760fcf5 Wei Wang                     2017-10-06  3359  		goto out;
2b760fcf5 Wei Wang                     2017-10-06  3360  	}
^1da177e4 Linus Torvalds               2005-04-16  3361  
d8d1f30b9 Changli Gao                  2010-06-10  3362  	netevent.old = &rt->dst;
d8d1f30b9 Changli Gao                  2010-06-10  3363  	netevent.new = &nrt->dst;
71bcdba06 YOSHIFUJI Hideaki / 吉藤英明 2013-01-05  3364  	netevent.daddr = &msg->dest;
605928337 YOSHIFUJI Hideaki / 吉藤英明 2013-01-14  3365  	netevent.neigh = neigh;
8d71740c5 Tom Tucker                   2006-07-30  3366  	call_netevent_notifiers(NETEVENT_REDIRECT, &netevent);
8d71740c5 Tom Tucker                   2006-07-30  3367  
^1da177e4 Linus Torvalds               2005-04-16  3368  out:
e8599ff4b David S. Miller              2012-07-11  3369  	neigh_release(neigh);
6e157b6ac David S. Miller              2012-07-12  3370  }
6e157b6ac David S. Miller              2012-07-12  3371  

:::::: The code at line 3357 was first introduced by commit
:::::: d4ead6b34b67fd711639324b6465a050bcb197d4 net/ipv6: move metrics from dst to rt6_info

:::::: TO: David Ahern <dsahern@gmail.com>
:::::: CC: David S. Miller <davem@davemloft.net>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache
From: Paolo Abeni @ 2018-04-22 11:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet, Willem de Bruijn
  Cc: netdev, David S. Miller, Tariq Toukan
In-Reply-To: <20180420154836.3690a39e@redhat.com>

On Fri, 2018-04-20 at 15:48 +0200, Jesper Dangaard Brouer wrote:
> On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On 04/19/2018 12:40 AM, Paolo Abeni wrote:
> > > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:  
> > > > On 04/18/2018 10:15 AM, Paolo Abeni wrote:
> 
> [...]
> > > 
> > > Any suggestions for better results are more than welcome!  
> > 
> > Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
> > Seoul (netdev conference). Not tied to UDP, but a generic solution.
> 
> Yes, I remember.  I think... was it the idea, where you basically
> wanted to queue back SKBs to the CPU that allocated them, right?
> 
> Freeing an SKB on the same CPU that allocated it, have multiple
> advantages. (1) the SLUB allocator can use a non-atomic
> "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
> the SKB stay local.  (3) the atomic SKB refcnt/users stay local.

By the time the skb is returned to the ingress cpu, isn't that skb most
probably out of the cache?

> We just have to avoid that queue back SKB's mechanism, doesn't cost
> more than the operations we expect to save.  Bulk transfer is an
> obvious approach.  For storing SKBs until they are returned, we already
> have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
> which SLUB/SLAB-bulk free to amortize cost (1).
> 
> I guess, the missing information is that we don't know what CPU the SKB
> were created on...
> 
> Where to store this CPU info?
> 
> (a) In struct sk_buff, in a cache-line that is already read on remote
> CPU in UDP code?
> 
> (b) In struct page, as SLUB alloc hand-out objects/SKBs on a per page
> basis, we could have SLUB store a hint about the CPU it was allocated
> on, and bet on returning to that CPU ? (might be bad to read the
> struct-page cache-line)

Bulking would be doable only for connected sockets, elsewhere would be
difficult to assemble a burst long enough to amortize the handshake
with the remote CPU (spinlock + ipi needed ?!?)

Would be good enough for unconnected sockets sending a whole skb burst
back to one of the (several) ingress CPU? e.g. peeking the CPU
associated with the first skb inside the burst, we would somewhat
balance the load between the ingress CPUs.

Cheers,

Paolo

^ permalink raw reply

* [PATCH net] bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave
From: Xin Long @ 2018-04-22 11:11 UTC (permalink / raw)
  To: network dev; +Cc: davem, Andy Gospodarek, Jiri Pirko, Cong Wang

After Commit 8a8efa22f51b ("bonding: sync netpoll code with bridge"), it
would set slave_dev npinfo in slave_enable_netpoll when enslaving a dev
if bond->dev->npinfo was set.

However now slave_dev npinfo is set with bond->dev->npinfo before calling
slave_enable_netpoll. With slave_dev npinfo set, __netpoll_setup called
in slave_enable_netpoll will not call slave dev's .ndo_netpoll_setup().
It causes that the lower dev of this slave dev can't set its npinfo.

One way to reproduce it:

  # modprobe bonding
  # brctl addbr br0
  # brctl addif br0 eth1
  # ifconfig bond0 192.168.122.1/24 up
  # ifenslave bond0 eth2
  # systemctl restart netconsole
  # ifenslave bond0 br0
  # ifconfig eth2 down
  # systemctl restart netconsole

The netpoll won't really work.

This patch is to remove that slave_dev npinfo setting in bond_enslave().

Fixes: 8a8efa22f51b ("bonding: sync netpoll code with bridge")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 drivers/net/bonding/bond_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index b7b1130..718e491 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1660,8 +1660,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
 	} /* switch(bond_mode) */

 #ifdef CONFIG_NET_POLL_CONTROLLER
-	slave_dev->npinfo = bond->dev->npinfo;
-	if (slave_dev->npinfo) {
+	if (bond->dev->npinfo) {
 		if (slave_enable_netpoll(new_slave)) {
 			netdev_info(bond_dev, "master_dev is using netpoll, but new slave device does not support netpoll\n");
 			res = -EBUSY;
-- 
2.1.0

^ permalink raw reply related

* [PATCH net] team: check team dev npinfo when adding a port only
From: Xin Long @ 2018-04-22 11:08 UTC (permalink / raw)
  To: network dev; +Cc: davem, Jiri Pirko, stephen hemminger

Now when netconsole sets up netpoll on a team dev, __netpoll_setup
will invoke team dev's .ndo_netpoll_setup first, then set
team->dev->npinfo.

However Commit 0fb52a27a04a ("team: cleanup netpoll clode") added
!team->dev->npinfo check in team_port_enable_netpoll(), which is
also invoked by team dev's .ndo_netpoll_setup. It will cause that
port dev npinfo can't be set due to team->dev->npinfo is not yet
set before invoking team dev's .ndo_netpoll_setup.

Team dev only needs to check team->dev->npinfo for setting a port
dev npinfo when adding the port, like before that cleanup.

Fixes: 0fb52a27a04a ("team: cleanup netpoll clode")
Reported-by: João Avelino Bellomo Filho <jbellomo@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 drivers/net/team/team.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index acbe849..f4a0346 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1077,9 +1077,6 @@ static int team_port_enable_netpoll(struct team *team, struct team_port *port)
 	struct netpoll *np;
 	int err;
 
-	if (!team->dev->npinfo)
-		return 0;
-
 	np = kzalloc(sizeof(*np), GFP_KERNEL);
 	if (!np)
 		return -ENOMEM;
@@ -1221,11 +1218,13 @@ static int team_port_add(struct team *team, struct net_device *port_dev,
 		goto err_vids_add;
 	}
 
-	err = team_port_enable_netpoll(team, port);
-	if (err) {
-		netdev_err(dev, "Failed to enable netpoll on device %s\n",
-			   portname);
-		goto err_enable_netpoll;
+	if (team->dev->npinfo) {
+		err = team_port_enable_netpoll(team, port);
+		if (err) {
+			netdev_err(dev, "Failed to enable netpoll on device %s\n",
+				   portname);
+			goto err_enable_netpoll;
+		}
 	}
 
 	if (!(dev->features & NETIF_F_LRO))
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH 04/15] media: pxa_camera: remove the dmaengine compat need
From: Hans Verkuil @ 2018-04-22 11:06 UTC (permalink / raw)
  To: Robert Jarzmik, Daniel Mack, Haojian Zhuang,
	Bartlomiej Zolnierkiewicz, Tejun Heo, Vinod Koul,
	Mauro Carvalho Chehab, Ulf Hansson, Ezequiel Garcia,
	Boris Brezillon, David Woodhouse, Brian Norris, Marek Vasut,
	Richard Weinberger, Cyrille Pitchen, Nicolas Pitre, Samuel Ortiz,
	Greg Kroah-Hartman, Jaroslav Kysela, Takashi Iwai, Liam Girdwood,
	Mark Brown, A
  Cc: devel, alsa-devel, netdev, linux-mmc, linux-kernel, linux-ide,
	linux-mtd, dmaengine, Niklas Söderlund, linux-arm-kernel,
	linux-media
In-Reply-To: <20180402142656.26815-5-robert.jarzmik@free.fr>

On 04/02/2018 04:26 PM, Robert Jarzmik wrote:
> From: Robert Jarzmik <robert.jarzmik@renault.com>
> 
> As the pxa architecture switched towards the dmaengine slave map, the
> old compatibility mechanism to acquire the dma requestor line number and
> priority are not needed anymore.
> 
> This patch simplifies the dma resource acquisition, using the more
> generic function dma_request_slave_channel().
> 
> Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr>

Acked-by: Hans Verkuil <hans.verkuil@cisco.com>

Regards,

	Hans

> ---
>  drivers/media/platform/pxa_camera.c | 22 +++-------------------
>  1 file changed, 3 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/media/platform/pxa_camera.c b/drivers/media/platform/pxa_camera.c
> index c71a00736541..4c82d1880753 100644
> --- a/drivers/media/platform/pxa_camera.c
> +++ b/drivers/media/platform/pxa_camera.c
> @@ -2357,8 +2357,6 @@ static int pxa_camera_probe(struct platform_device *pdev)
>  		.src_maxburst = 8,
>  		.direction = DMA_DEV_TO_MEM,
>  	};
> -	dma_cap_mask_t mask;
> -	struct pxad_param params;
>  	char clk_name[V4L2_CLK_NAME_SIZE];
>  	int irq;
>  	int err = 0, i;
> @@ -2432,34 +2430,20 @@ static int pxa_camera_probe(struct platform_device *pdev)
>  	pcdev->base = base;
>  
>  	/* request dma */
> -	dma_cap_zero(mask);
> -	dma_cap_set(DMA_SLAVE, mask);
> -	dma_cap_set(DMA_PRIVATE, mask);
> -
> -	params.prio = 0;
> -	params.drcmr = 68;
> -	pcdev->dma_chans[0] =
> -		dma_request_slave_channel_compat(mask, pxad_filter_fn,
> -						 &params, &pdev->dev, "CI_Y");
> +	pcdev->dma_chans[0] = dma_request_slave_channel(&pdev->dev, "CI_Y");
>  	if (!pcdev->dma_chans[0]) {
>  		dev_err(&pdev->dev, "Can't request DMA for Y\n");
>  		return -ENODEV;
>  	}
>  
> -	params.drcmr = 69;
> -	pcdev->dma_chans[1] =
> -		dma_request_slave_channel_compat(mask, pxad_filter_fn,
> -						 &params, &pdev->dev, "CI_U");
> +	pcdev->dma_chans[1] = dma_request_slave_channel(&pdev->dev, "CI_U");
>  	if (!pcdev->dma_chans[1]) {
>  		dev_err(&pdev->dev, "Can't request DMA for Y\n");
>  		err = -ENODEV;
>  		goto exit_free_dma_y;
>  	}
>  
> -	params.drcmr = 70;
> -	pcdev->dma_chans[2] =
> -		dma_request_slave_channel_compat(mask, pxad_filter_fn,
> -						 &params, &pdev->dev, "CI_V");
> +	pcdev->dma_chans[2] = dma_request_slave_channel(&pdev->dev, "CI_V");
>  	if (!pcdev->dma_chans[2]) {
>  		dev_err(&pdev->dev, "Can't request DMA for V\n");
>  		err = -ENODEV;
> 

^ permalink raw reply

* Re: KMSAN: uninit-value in __udp4_lib_rcv
From: Dmitry Vyukov @ 2018-04-22 10:57 UTC (permalink / raw)
  To: syzbot
  Cc: David Miller, Alexey Kuznetsov, LKML, netdev, syzkaller-bugs,
	Hideaki YOSHIFUJI
In-Reply-To: <000000000000116485056a369609@google.com>

On Thu, Apr 19, 2018 at 7:06 PM, syzbot
<syzbot+493bccc5b8cfe9d5035e@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzbot hit the following crash on https://github.com/google/kmsan.git/master
> commit
> 35ff515e4bda2646f6c881d33951c306ea9c282a (Tue Apr 10 08:59:43 2018 +0000)
> Merge pull request #11 from parkerduckworth/readme
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=493bccc5b8cfe9d5035e
>
> So far this crash happened 11 times on
> https://github.com/google/kmsan.git/master.
> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4935004320694272
> syzkaller reproducer:
> https://syzkaller.appspot.com/x/repro.syz?id=5133260011077632
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=5329144879513600
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=6627248707860932248
> compiler: clang version 7.0.0 (trunk 329391)
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+493bccc5b8cfe9d5035e@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.


This does not reproduce after:
https://github.com/google/kmsan/commit/a7f95e9c8a95e9fbb388c3999b61a17667cd3bbe
so let's consider this as false positive related to checksums:

#syz invalid



> ==================================================================
> BUG: KMSAN: uninit-value in __udp4_lib_rcv+0x628/0x4740 net/ipv4/udp.c:2066
> CPU: 1 PID: 3573 Comm: syzkaller192717 Not tainted 4.16.0+ #83
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  <IRQ>
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
>  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
>  __udp4_lib_rcv+0x628/0x4740 net/ipv4/udp.c:2066
>  udp_rcv+0x5c/0x70 net/ipv4/udp.c:2287
>  ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
>  NF_HOOK include/linux/netfilter.h:288 [inline]
>  ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
>  dst_input include/net/dst.h:449 [inline]
>  ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
>  NF_HOOK include/linux/netfilter.h:288 [inline]
>  ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
>  __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
>  __netif_receive_skb net/core/dev.c:4627 [inline]
>  process_backlog+0x62d/0xe20 net/core/dev.c:5307
>  napi_poll net/core/dev.c:5705 [inline]
>  net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
>  __do_softirq+0x56d/0x93d kernel/softirq.c:285
>  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1040
>  </IRQ>
>  do_softirq kernel/softirq.c:329 [inline]
>  __local_bh_enable_ip+0x114/0x140 kernel/softirq.c:182
>  local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
>  rcu_read_unlock_bh include/linux/rcupdate.h:726 [inline]
>  ip_finish_output2+0x124e/0x1380 net/ipv4/ip_output.c:231
>  ip_finish_output+0xcb0/0xff0 net/ipv4/ip_output.c:317
>  NF_HOOK_COND include/linux/netfilter.h:277 [inline]
>  ip_output+0x502/0x5c0 net/ipv4/ip_output.c:405
>  dst_output include/net/dst.h:443 [inline]
>  ip_local_out net/ipv4/ip_output.c:124 [inline]
>  ip_send_skb+0x5f3/0x820 net/ipv4/ip_output.c:1414
>  ip_push_pending_frames+0x105/0x170 net/ipv4/ip_output.c:1434
>  raw_sendmsg+0x2960/0x3ed0 net/ipv4/raw.c:684
>  inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
>  sock_sendmsg_nosec net/socket.c:630 [inline]
>  sock_sendmsg net/socket.c:640 [inline]
>  ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
>  __sys_sendmsg net/socket.c:2080 [inline]
>  SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
>  SyS_sendmsg+0x54/0x80 net/socket.c:2087
>  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> RIP: 0033:0x43fe99
> RSP: 002b:00007ffca5bf5be8 EFLAGS: 00000217 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fe99
> RDX: 0000000000000000 RSI: 00000000200002c0 RDI: 0000000000000003
> RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
> R10: 00000000004002c8 R11: 0000000000000217 R12: 00000000004017c0
> R13: 0000000000401850 R14: 0000000000000000 R15: 0000000000000000
>
> Uninit was stored to memory at:
>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
>  kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
>  kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
>  kmsan_memcpy_origins+0x11d/0x170 mm/kmsan/kmsan.c:526
>  __msan_memcpy+0x19f/0x1f0 mm/kmsan/kmsan_instr.c:470
>  skb_copy_bits+0x63a/0xdb0 net/core/skbuff.c:2046
>  __pskb_pull_tail+0x483/0x22e0 net/core/skbuff.c:1883
>  pskb_may_pull include/linux/skbuff.h:2112 [inline]
>  __udp4_lib_rcv+0x55f/0x4740 net/ipv4/udp.c:2058
>  udp_rcv+0x5c/0x70 net/ipv4/udp.c:2287
>  ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
>  NF_HOOK include/linux/netfilter.h:288 [inline]
>  ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
>  dst_input include/net/dst.h:449 [inline]
>  ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
>  NF_HOOK include/linux/netfilter.h:288 [inline]
>  ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
>  __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
>  __netif_receive_skb net/core/dev.c:4627 [inline]
>  process_backlog+0x62d/0xe20 net/core/dev.c:5307
>  napi_poll net/core/dev.c:5705 [inline]
>  net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
>  __do_softirq+0x56d/0x93d kernel/softirq.c:285
> Uninit was created at:
>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
>  kmsan_alloc_meta_for_pages+0x161/0x3a0 mm/kmsan/kmsan.c:814
>  kmsan_alloc_page+0x82/0xe0 mm/kmsan/kmsan.c:868
>  __alloc_pages_nodemask+0xf5b/0x5dc0 mm/page_alloc.c:4283
>  alloc_pages_current+0x6b5/0x970 mm/mempolicy.c:2055
>  alloc_pages include/linux/gfp.h:494 [inline]
>  skb_page_frag_refill+0x3ba/0x5e0 net/core/sock.c:2208
>  sk_page_frag_refill+0xa4/0x340 net/core/sock.c:2228
>  __ip_append_data+0x107e/0x3d10 net/ipv4/ip_output.c:1057
>  ip_append_data+0x2fb/0x440 net/ipv4/ip_output.c:1170
>  raw_sendmsg+0x287b/0x3ed0 net/ipv4/raw.c:678
>  inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
>  sock_sendmsg_nosec net/socket.c:630 [inline]
>  sock_sendmsg net/socket.c:640 [inline]
>  ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
>  __sys_sendmsg net/socket.c:2080 [inline]
>  SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
>  SyS_sendmsg+0x54/0x80 net/socket.c:2087
>  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> ==================================================================
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz test: git://repo/address.git branch
> and provide the patch inline or as an attachment.
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/000000000000116485056a369609%40google.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH 15/39] acpi/battery: simplify procfs code
From: Rafael J. Wysocki @ 2018-04-22  9:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rtc, Alessandro Zummo, Alexandre Belloni, devel,
	Linux Kernel Mailing List, open list:TARGET SUBSYSTEM,
	Corey Minyard, linux-ide, Greg Kroah-Hartman, jfs-discussion,
	linux-afs, ACPI Devel Maling List, netdev, netfilter-devel,
	Alexander Viro, Jiri Slaby, Andrew Morton, Ext4 Developers List,
	Alexey Dobriyan, megaraidlinux.pdl, drbd-dev
In-Reply-To: <20180419124140.9309-16-hch@lst.de>

On Thu, Apr 19, 2018 at 2:41 PM, Christoph Hellwig <hch@lst.de> wrote:
> Use remove_proc_subtree to remove the whole subtree on cleanup, and
> unwind the registration loop into individual calls.  Switch to use
> proc_create_seq where applicable.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

It is OK AFAICS.

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

^ permalink raw reply

* Re: [PATCH net] net: ethtool: Add missing kernel doc for FEC parameters
From: Sergei Shtylyov @ 2018-04-22  8:51 UTC (permalink / raw)
  To: Florian Fainelli, netdev; +Cc: davem, vidya.chowdary, dustin, roopa
In-Reply-To: <20180421231848.17450-1-f.fainelli@gmail.com>

Hello!

On 4/22/2018 2:18 AM, Florian Fainelli wrote:

> While adding support for ethtool::get_fecparam and set_param, kernel doc for

    set_fecparam.

> these functions was missed, add those.
> 
> Fixes: 1a5f3da20bd9 ("net: ethtool: add support for forward error correction modes")
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
[...]

MBR, Sergei

^ permalink raw reply

* Re: [PATCH 3/8] [media] v4l: rcar_fdp1: Change platform dependency to ARCH_RENESAS
From: Laurent Pinchart @ 2018-04-22  8:46 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: alsa-devel, Kuninori Morimoto, Catalin Marinas, Will Deacon,
	Liam Girdwood, devel, Mauro Carvalho Chehab, Geert Uytterhoeven,
	Vinod Koul, Magnus Damm, Russell King, linux-media, Arnd Bergmann,
	Mark Brown, Dan Williams, linux-arm-kernel, Sergei Shtylyov,
	Greg Kroah-Hartman, Takashi Iwai, linux-kernel, linux-renesas-soc,
	Simon Horman, netdev, dmaengine
In-Reply-To: <3039853.rivznOVBTv@avalon>

Hi Geert,

On Saturday, 21 April 2018 11:07:11 EEST Laurent Pinchart wrote:
> On Friday, 20 April 2018 16:28:29 EEST Geert Uytterhoeven wrote:
> > The Renesas Fine Display Processor driver is used on Renesas R-Car SoCs
> > only.  Since commit 9b5ba0df4ea4f940 ("ARM: shmobile: Introduce
> > ARCH_RENESAS") is ARCH_RENESAS a more appropriate platform dependency
> > than the legacy ARCH_SHMOBILE, hence use the former.
> > 
> > This will allow to drop ARCH_SHMOBILE on ARM and ARM64 in the near
> > future.
> > 
> > Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> 
> Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
> 
> How would you like to get this merged ?

Unless you would like to merge the whole series in one go, I'll take this in 
my tree as I have a conflicting patch I would like to submit for v4.18.

> > ---
> > 
> >  drivers/media/platform/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/media/platform/Kconfig
> > b/drivers/media/platform/Kconfig index f9235e8f8e962d2e..7ad4725f9d1f9627
> > 100644
> > --- a/drivers/media/platform/Kconfig
> > +++ b/drivers/media/platform/Kconfig
> > @@ -396,7 +396,7 @@ config VIDEO_SH_VEU
> >  config VIDEO_RENESAS_FDP1
> >  	tristate "Renesas Fine Display Processor"
> >  	depends on VIDEO_DEV && VIDEO_V4L2 && HAS_DMA
> > -	depends on ARCH_SHMOBILE || COMPILE_TEST
> > +	depends on ARCH_RENESAS || COMPILE_TEST
> >  	depends on (!ARCH_RENESAS && !VIDEO_RENESAS_FCP) || VIDEO_RENESAS_FCP
> >  	select VIDEOBUF2_DMA_CONTIG
> >  	select V4L2_MEM2MEM_DEV

-- 
Regards,

Laurent Pinchart

^ permalink raw reply

* How to detect libbfd when building tools/bpf?
From: Wang Sheng-Hui @ 2018-04-22  7:51 UTC (permalink / raw)
  To: ast, daniel, kstewart, gregkh, tglx, pombredanne; +Cc: netdev

Sorry to trouble you !

I run debian and installed binutils-dev beforehand.
Then I copied tools/build/feature/test-libbfd.c to t.c and run:
------------------------------------------------------------------------
root@lazyfintech:~# cat t.c 
#include <bfd.h>

extern int printf(const char *format, ...);

int main(void)
{
	char symbol[4096] = "FieldName__9ClassNameFd";
	char *tmp;

	tmp = bfd_demangle(0, symbol, 0);

	printf("demangled symbol: {%s}\n", tmp);

	return 0;
}
root@lazyfintech:~# gcc t.c -lbfd
root@lazyfintech:~# ./a.out 
demangled symbol: {ClassName::FieldName}


I thought libbfd can be reached from above. 
But when I built tools/bpf, libbfd cannot be detected: 
-------------------------------------------------------------------------
/linux# make O=../buildkernel/ tools/bpf
make[1]: Entering directory '/root/buildkernel'
  DESCEND  bpf

Auto-detecting system features:
...                        libbfd: [ OFF ]
...        disassembler-four-args: [ on  ]

  DESCEND  bpftool

Auto-detecting system features:
...                        libbfd: [ OFF ]
...        disassembler-four-args: [ on  ]

make[1]: Leaving directory '/root/buildkernel'


I wonder how should I do to get libbfd auto detected when  building bpf tool?


Regards,
shenghui

^ permalink raw reply

* Bluetooth/lock_sock: false positive "WARNING: possible recursive locking detected"
From: Jiri Slaby @ 2018-04-22  7:57 UTC (permalink / raw)
  To: Marcel Holtmann, Johan Hedberg; +Cc: BlueZ development, ML netdev

Hi,

I have just got this lockdep warning during suspend:
> [ 2891.586061] ============================================
> [ 2891.586063] WARNING: possible recursive locking detected
> [ 2891.586065] 4.16.2-10.ge881e16-default #1 Not tainted
> [ 2891.586067] --------------------------------------------
> [ 2891.586068] kworker/u9:3/873 is trying to acquire lock:
> [ 2891.586070]  (sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP){+.+.}, at: [<000000007b85e829>] bt_accept_enqueue+0x29/0x90 [bluetooth]
> [ 2891.586086]
>                but task is already holding lock:
> [ 2891.586088]  (sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP){+.+.}, at: [<0000000042f0b4a5>] l2cap_sock_new_connection_cb+0x18/0xa0 [bluetooth]
> [ 2891.586109]
>                other info that might help us debug this:
> [ 2891.586111]  Possible unsafe locking scenario:
> 
> [ 2891.586115]        CPU0
> [ 2891.586116]        ----
> [ 2891.586117]   lock(sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP);
> [ 2891.586120]   lock(sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP);
> [ 2891.586122]
>                 *** DEADLOCK ***
> 
> [ 2891.586125]  May be due to missing lock nesting notation
> 
> [ 2891.586127] 5 locks held by kworker/u9:3/873:
> [ 2891.586128]  #0:  ((wq_completion)"%s"hdev->name#2){+.+.}, at: [<000000004aa1a273>] process_one_work+0x1e3/0x6a0
> [ 2891.586135]  #1:  ((work_completion)(&hdev->rx_work)){+.+.}, at: [<000000004aa1a273>] process_one_work+0x1e3/0x6a0
> [ 2891.586140]  #2:  (&conn->chan_lock){+.+.}, at: [<00000000fbad6c82>] l2cap_connect+0x88/0x540 [bluetooth]
> [ 2891.586155]  #3:  (&chan->lock/2){+.+.}, at: [<000000007c38e27e>] l2cap_connect+0xa0/0x540 [bluetooth]
> [ 2891.586170]  #4:  (sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP){+.+.}, at: [<0000000042f0b4a5>] l2cap_sock_new_connection_cb+0x18/0xa0 [bluetooth]
> [ 2891.586183]
>                stack backtrace:
> [ 2891.586187] CPU: 2 PID: 873 Comm: kworker/u9:3 Not tainted 4.16.2-10.ge881e16-default #1 openSUSE Tumbleweed (unreleased)
> [ 2891.586189] Hardware name: Dell Inc. Latitude 7280/0KK5D1, BIOS 1.9.3 03/09/2018
> [ 2891.586200] Workqueue: hci0 hci_rx_work [bluetooth]
> [ 2891.586202] Call Trace:
> [ 2891.586207]  dump_stack+0x85/0xc5
> [ 2891.586211]  __lock_acquire+0x6b4/0x1370
> [ 2891.586221]  lock_acquire+0x9f/0x210
> [ 2891.586237]  lock_sock_nested+0x5a/0x80
> [ 2891.586256]  bt_accept_enqueue+0x29/0x90 [bluetooth]
> [ 2891.586268]  l2cap_sock_new_connection_cb+0x5d/0xa0 [bluetooth]
> [ 2891.586280]  l2cap_connect+0x126/0x540 [bluetooth]
> [ 2891.586315]  l2cap_sig_channel+0x443/0x13b0 [bluetooth]
> [ 2891.586330]  l2cap_recv_frame+0x1a4/0x300 [bluetooth]
> [ 2891.586341]  hci_rx_work+0x1c8/0x5c0 [bluetooth]
> [ 2891.586345]  process_one_work+0x269/0x6a0
> [ 2891.586350]  worker_thread+0x2b/0x3d0
> [ 2891.586356]  kthread+0x113/0x130
> [ 2891.586363]  ret_from_fork+0x24/0x50
> [ 4954.622809] e1000e: eth0 NIC Link is Down
> [ 4955.299532] PM: suspend entry (deep)
> [ 4955.299538] PM: Syncing filesystems ... done.

This is:
  lock_sock(sk);       in bt_accept_enqueue
nested in
  lock_sock(parent);   in l2cap_sock_new_connection_cb

So this looks like a false positive to me. So I believe this is a fix:

--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1232,7 +1232,7 @@ static struct l2cap_chan
*l2cap_sock_new_connection_cb(struct l2cap_chan *chan)
 {
        struct sock *sk, *parent = chan->data;

-       lock_sock(parent);
+       lock_sock_nested(parent, L2CAP_NESTING_PARENT);

        /* Check for backlog size */
        if (sk_acceptq_is_full(parent)) {


?

thanks,
-- 
js
suse labs

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: btf: Clean up btf.h in uapi
From: Alexei Starovoitov @ 2018-04-22  4:57 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180421164823.1059397-1-kafai@fb.com>

On Sat, Apr 21, 2018 at 09:48:23AM -0700, Martin KaFai Lau wrote:
> This patch cleans up btf.h in uapi:
> 1) Rename "name" to "name_off" to better reflect it is an offset to the
>    string section instead of a char array.
> 2) Remove unused value BTF_FLAGS_COMPR and BTF_MAGIC_SWAP
> 
> Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* [PATCH] netfilter: fix nf_tables filter chain type build
From: Randy Dunlap @ 2018-04-22  4:10 UTC (permalink / raw)
  To: netdev@vger.kernel.org, LKML
  Cc: coreteam, netfilter-devel, Florian Westphal, Jozsef Kadlecsik,
	Pablo Neira Ayuso, kbuild test robot

From: Randy Dunlap <rdunlap@infradead.org>

Fix build errors due to a missing Kconfig dependency term.
Fixes these build errors:

net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_nat_do_chain':
net/ipv6/netfilter/nft_chain_nat_ipv6.c:37: undefined reference to `nft_do_chain'
net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_exit':
net/ipv6/netfilter/nft_chain_nat_ipv6.c:94: undefined reference to `nft_unregister_chain_type'
net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_init':
net/ipv6/netfilter/nft_chain_nat_ipv6.c:87: undefined reference to `nft_register_chain_type'

Fixes: 02c7b25e5f54 ("netfilter: nf_tables: build-in filter chain type")

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Florian Westphal <fw@strlen.de>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
---
 net/ipv6/netfilter/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next-20180419.orig/net/ipv6/netfilter/Kconfig
+++ linux-next-20180419/net/ipv6/netfilter/Kconfig
@@ -108,7 +108,7 @@ config NF_NAT_IPV6
 if NF_NAT_IPV6
 
 config NFT_CHAIN_NAT_IPV6
-	depends on NF_TABLES_IPV6
+	depends on NF_TABLES_IPV6 && NF_TABLES
 	tristate "IPv6 nf_tables nat chain support"
 	help
 	  This option enables the "nat" chain for IPv6 in nf_tables. This

^ permalink raw reply

* Re: Charity Venture
From: Mik. M. Fridman @ 2018-04-22  3:59 UTC (permalink / raw)




-- 
I Mikhail Fridman. has selected you specially as one of my beneficiaries
for my Charitable Donation, Just as I have declared on May 23, 2016 to 
give
my fortune as charity.

Check the link below for confirmation:

http://www.ibtimes.co.uk/russias-second-wealthiest-man-mikhail-fridman-plans-leaving-14-2bn-fortune-charity-1561604

Reply as soon as possible with further directives.

Best Regards,
Mikhail Fridman.

^ permalink raw reply

* Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV
From: Alexander Duyck @ 2018-04-22  0:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Bjorn Helgaas, linux-pci, virtio-dev, kvm,
	Netdev, Daly, Dan, LKML, linux-nvme, Keith Busch, netanel,
	Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
	David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <20180421203437.GW28657@bhelgaas-glaptop.roam.corp.google.com>

On Sat, Apr 21, 2018 at 1:34 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Fri, Apr 20, 2018 at 12:28:08PM -0400, Alexander Duyck wrote:
>> This series is meant to add support for SR-IOV on devices when the VFs are
>> not managed by the kernel. Examples of recent patches attempting to do this
>> include:
>> virto - https://patchwork.kernel.org/patch/10241225/
>> pci-stub - https://patchwork.kernel.org/patch/10109935/
>> vfio - https://patchwork.kernel.org/patch/10103353/
>> uio - https://patchwork.kernel.org/patch/9974031/
>>
>> Since this is quickly blowing up into a multi-driver problem it is probably
>> best to implement this solution as generically as possible.
>>
>> This series is an attempt to do that. What we do with this patch set is
>> provide a generic framework to enable SR-IOV in the case that the PF driver
>> doesn't support managing the VFs itself.
>>
>> I based my patch set originally on the patch by Mark Rustad but there isn't
>> much left after going through and cleaning out the bits that were no longer
>> needed, and after incorporating the feedback from David Miller. At this point
>> the only items to be fully reused was his patch description which is now
>> present in patch 3 of the set.
>>
>> This solution is limited in scope to just adding support for devices that
>> provide no functionality for SR-IOV other than allocating the VFs by
>> calling pci_enable_sriov. Previous sets had included patches for VFIO, but
>> for now I am dropping that as the scope of that work is larger then I
>> think I can take on at this time.
>>
>> v2: Reduced scope back to just virtio_pci and vfio-pci
>>     Broke into 3 patch set from single patch
>>     Changed autoprobe behavior to always set when num_vfs is set non-zero
>> v3: Updated Documentation to clarify when sriov_unmanaged_autoprobe is used
>>     Wrapped vfio_pci_sriov_configure to fix build errors w/o SR-IOV in kernel
>> v4: Dropped vfio-pci patch
>>     Added ena and nvme to drivers now using pci_sriov_configure_unmanaged
>>     Dropped pci_disable_sriov call in virtio_pci to be consistent with ena
>> v5: Dropped sriov_unmanaged_autoprobe and pci_sriov_conifgure_unmanaged
>>     Added new patch that enables pci_sriov_configure_simple
>>     Updated drivers to use pci_sriov_configure_simple
>> v6: Defined pci_sriov_configure_simple as NULL when SR-IOV is not enabled
>>     Updated drivers to drop "#ifdef" checks for IOV
>>     Added pci-pf-stub as place for PF-only drivers to add support
>> v7: Dropped pci_id table explanation from pci-pf-stub driver
>>     Updated pci_sriov_configure_simple to drop need for err value
>>     Fixed comment explaining why pci_sriov_configure_simple is NULL
>> v8: Dropped virtio from the set, support to be added later after TC approval
>>
>> Cc: Mark Rustad <mark.d.rustad@intel.com>
>> Cc: Maximilian Heyne <mheyne@amazon.de>
>> Cc: Liang-Min Wang <liang-min.wang@intel.com>
>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>>
>> ---
>>
>> Alexander Duyck (4):
>>       pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
>>       ena: Migrate over to unmanaged SR-IOV support
>>       nvme: Migrate over to unmanaged SR-IOV support
>>       pci-pf-stub: Add PF driver stub for PFs that function only to enable VFs
>>
>>
>>  drivers/net/ethernet/amazon/ena/ena_netdev.c |   28 -------------
>>  drivers/nvme/host/pci.c                      |   20 ----------
>>  drivers/pci/Kconfig                          |   12 ++++++
>>  drivers/pci/Makefile                         |    2 +
>>  drivers/pci/iov.c                            |   31 +++++++++++++++
>>  drivers/pci/pci-pf-stub.c                    |   54 ++++++++++++++++++++++++++
>>  include/linux/pci.h                          |    3 +
>>  include/linux/pci_ids.h                      |    2 +
>>  8 files changed, 106 insertions(+), 46 deletions(-)
>>  create mode 100644 drivers/pci/pci-pf-stub.c
>
> I tentatively applied these to pci/virtualization-review.
>
> The code changes look fine, but I want to flesh out the changelogs a
> little bit before merging them.

Thanks.

> For example, I'm not sure what you mean by "devices where the PF is
> not capable of managing VF resources."
>
> It *sounds* like you're saying the hardware works differently on some
> devices, but I don't think that's what you mean.  I think you're
> saying something about which drivers are used for the PF and the VF.

That is sort of what I am saying.

So for example with ixgbe there is functionality which is controlled
in the MMIO space of the PF that affects the functionality of the VFs
that are generated on the device. The PF has to rearrange the
resources such as queues and interrupts on the device before it can
enable SR-IOV, and it could alter those later to limit what the VF is
capable of doing.

The model I am dealing with via this patch set has a PF that is not
much different than the VFs other than the fact that it has some
extended configuration space bits in place for SR-IOV, ARI, ACS, and
whatever other bits are needed in order to support spawning isolated
VFs.

> I think a trivial example of how this will be used might help.  I
> assume this involves a virtualization scenario where the host uses the
> PF to enable several VFs, but the host doesn't use the PF for much
> else.  Then you assign the VFs to guests, and drivers in the guest
> OSes use the VFs.

So this description would work for the pci-pf-stub driver. Basically
the idea here is that you have a device that is nothing more than a
function there to spawn VFs.

The other scenario that I am supporting are the ena and nvme models.
They match what I have described above where the PF is really just
another VF with some extra configuration space bits that add support
for spawning what amount to peer VFs to the already present PF.

> Since .sriov_configure() is only used by sriov_numvfs_store(), I
> assume the usage model involves writing to the sysfs sriov_numvfs
> attribute to enable the VFs, then assigning them to guests?
>
> Bjorn

Yes. The assumption for this function is that we are only supporting
the sysfs approach for spawning VFs.

- Alex

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox