Netdev List
 help / color / mirror / Atom feed
* [net-next  1/3] tipc: modify socket iterator for sock_diag
From: GhantaKrishnamurthy MohanKrishna @ 2018-03-21 13:37 UTC (permalink / raw)
  To: tipc-discussion, jon.maloy, maloy, ying.xue,
	mohan.krishna.ghanta.krishnamurthy, netdev, davem
  Cc: Parthasarathy Bhuvaragan
In-Reply-To: <1521639465-3169-1-git-send-email-mohan.krishna.ghanta.krishnamurthy@ericsson.com>

The current socket iterator function tipc_nl_sk_dump, handles socket
locks and calls __tipc_nl_add_sk for each socket.
To reuse this logic in sock_diag implementation, we do minor
modifications to make these functions generic as described below.

In this commit, we add a two new functions __tipc_nl_sk_walk,
__tipc_nl_add_sk_info and modify tipc_nl_sk_dump, __tipc_nl_add_sk
accordingly.

In __tipc_nl_sk_walk we:
1. acquire and release socket locks
2. for each socket, execute the specified callback function

In __tipc_nl_add_sk we:
- Move the netlink attribute insertion to __tipc_nl_add_sk_info.

tipc_nl_sk_dump calls tipc_nl_sk_walk with __tipc_nl_add_sk as argument.

sock_diag will use these generic functions in a later commit.

There is no functional change in this commit.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
---
 net/tipc/socket.c | 65 +++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 41 insertions(+), 24 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index f93477187a90..8e83b0501d4e 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -3154,16 +3154,33 @@ static int __tipc_nl_add_sk_con(struct sk_buff *skb, struct tipc_sock *tsk)
 	return -EMSGSIZE;
 }
 
+static int __tipc_nl_add_sk_info(struct sk_buff *skb, struct tipc_sock
+			  *tsk)
+{
+	struct net *net = sock_net(skb->sk);
+	struct tipc_net *tn = tipc_net(net);
+	struct sock *sk = &tsk->sk;
+
+	if (nla_put_u32(skb, TIPC_NLA_SOCK_REF, tsk->portid) ||
+	    nla_put_u32(skb, TIPC_NLA_SOCK_ADDR, tn->own_addr))
+		return -EMSGSIZE;
+
+	if (tipc_sk_connected(sk)) {
+		if (__tipc_nl_add_sk_con(skb, tsk))
+			return -EMSGSIZE;
+	} else if (!list_empty(&tsk->publications)) {
+		if (nla_put_flag(skb, TIPC_NLA_SOCK_HAS_PUBL))
+			return -EMSGSIZE;
+	}
+	return 0;
+}
+
 /* Caller should hold socket lock for the passed tipc socket. */
 static int __tipc_nl_add_sk(struct sk_buff *skb, struct netlink_callback *cb,
 			    struct tipc_sock *tsk)
 {
-	int err;
-	void *hdr;
 	struct nlattr *attrs;
-	struct net *net = sock_net(skb->sk);
-	struct tipc_net *tn = net_generic(net, tipc_net_id);
-	struct sock *sk = &tsk->sk;
+	void *hdr;
 
 	hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
 			  &tipc_genl_family, NLM_F_MULTI, TIPC_NL_SOCK_GET);
@@ -3173,19 +3190,10 @@ static int __tipc_nl_add_sk(struct sk_buff *skb, struct netlink_callback *cb,
 	attrs = nla_nest_start(skb, TIPC_NLA_SOCK);
 	if (!attrs)
 		goto genlmsg_cancel;
-	if (nla_put_u32(skb, TIPC_NLA_SOCK_REF, tsk->portid))
-		goto attr_msg_cancel;
-	if (nla_put_u32(skb, TIPC_NLA_SOCK_ADDR, tn->own_addr))
+
+	if (__tipc_nl_add_sk_info(skb, tsk))
 		goto attr_msg_cancel;
 
-	if (tipc_sk_connected(sk)) {
-		err = __tipc_nl_add_sk_con(skb, tsk);
-		if (err)
-			goto attr_msg_cancel;
-	} else if (!list_empty(&tsk->publications)) {
-		if (nla_put_flag(skb, TIPC_NLA_SOCK_HAS_PUBL))
-			goto attr_msg_cancel;
-	}
 	nla_nest_end(skb, attrs);
 	genlmsg_end(skb, hdr);
 
@@ -3199,16 +3207,19 @@ static int __tipc_nl_add_sk(struct sk_buff *skb, struct netlink_callback *cb,
 	return -EMSGSIZE;
 }
 
-int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb)
+static int __tipc_nl_sk_walk(struct sk_buff *skb, struct netlink_callback *cb,
+			     int (*skb_handler)(struct sk_buff *skb,
+						struct netlink_callback *cb,
+						struct tipc_sock *tsk))
 {
-	int err;
-	struct tipc_sock *tsk;
-	const struct bucket_table *tbl;
-	struct rhash_head *pos;
 	struct net *net = sock_net(skb->sk);
-	struct tipc_net *tn = net_generic(net, tipc_net_id);
-	u32 tbl_id = cb->args[0];
+	struct tipc_net *tn = tipc_net(net);
+	const struct bucket_table *tbl;
 	u32 prev_portid = cb->args[1];
+	u32 tbl_id = cb->args[0];
+	struct rhash_head *pos;
+	struct tipc_sock *tsk;
+	int err;
 
 	rcu_read_lock();
 	tbl = rht_dereference_rcu((&tn->sk_rht)->tbl, &tn->sk_rht);
@@ -3220,12 +3231,13 @@ int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb)
 				continue;
 			}
 
-			err = __tipc_nl_add_sk(skb, cb, tsk);
+			err = skb_handler(skb, cb, tsk);
 			if (err) {
 				prev_portid = tsk->portid;
 				spin_unlock_bh(&tsk->sk.sk_lock.slock);
 				goto out;
 			}
+
 			prev_portid = 0;
 			spin_unlock_bh(&tsk->sk.sk_lock.slock);
 		}
@@ -3238,6 +3250,11 @@ int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
+int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	return __tipc_nl_sk_walk(skb, cb, __tipc_nl_add_sk);
+}
+
 /* Caller should hold socket lock for the passed tipc socket. */
 static int __tipc_nl_add_sk_publ(struct sk_buff *skb,
 				 struct netlink_callback *cb,
-- 
2.1.4

^ permalink raw reply related

* [net-next  2/3] tipc: implement socket diagnostics for AF_TIPC
From: GhantaKrishnamurthy MohanKrishna @ 2018-03-21 13:37 UTC (permalink / raw)
  To: tipc-discussion, jon.maloy, maloy, ying.xue,
	mohan.krishna.ghanta.krishnamurthy, netdev, davem
  Cc: Parthasarathy Bhuvaragan
In-Reply-To: <1521639465-3169-1-git-send-email-mohan.krishna.ghanta.krishnamurthy@ericsson.com>

This commit adds socket diagnostics capability for AF_TIPC in netlink
family NETLINK_SOCK_DIAG in a new kernel module (diag.ko).

The following are key design considerations:
- config TIPC_DIAG has default y, like INET_DIAG.
- only requests with flag NLM_F_DUMP is supported (dump all).
- tipc_sock_diag_req message is introduced to send filter parameters.
- the response attributes are of TLV, some nested.

To avoid exposing data structures between diag and tipc modules and
avoid code duplication, the following additions are required:
- export tipc_nl_sk_walk function to reuse socket iterator.
- export tipc_sk_fill_sock_diag to fill the tipc diag attributes.
- create a sock_diag response message in __tipc_add_sock_diag defined
  in diag.c and use the above exported tipc_sk_fill_sock_diag
  to fill response.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
---
 include/uapi/linux/tipc_netlink.h      |  18 ++++++
 include/uapi/linux/tipc_sockets_diag.h |  17 +++++
 net/tipc/Kconfig                       |   8 +++
 net/tipc/Makefile                      |   5 ++
 net/tipc/diag.c                        | 114 +++++++++++++++++++++++++++++++++
 net/tipc/socket.c                      |  72 +++++++++++++++++++--
 net/tipc/socket.h                      |  10 ++-
 7 files changed, 238 insertions(+), 6 deletions(-)
 create mode 100644 include/uapi/linux/tipc_sockets_diag.h
 create mode 100644 net/tipc/diag.c

diff --git a/include/uapi/linux/tipc_netlink.h b/include/uapi/linux/tipc_netlink.h
index 469aa67a5ecb..d7cec0480d70 100644
--- a/include/uapi/linux/tipc_netlink.h
+++ b/include/uapi/linux/tipc_netlink.h
@@ -114,6 +114,13 @@ enum {
 	TIPC_NLA_SOCK_REF,		/* u32 */
 	TIPC_NLA_SOCK_CON,		/* nest */
 	TIPC_NLA_SOCK_HAS_PUBL,		/* flag */
+	TIPC_NLA_SOCK_STAT,		/* nest */
+	TIPC_NLA_SOCK_TYPE,		/* u32 */
+	TIPC_NLA_SOCK_INO,		/* u32 */
+	TIPC_NLA_SOCK_UID,		/* u32 */
+	TIPC_NLA_SOCK_TIPC_STATE,	/* u32 */
+	TIPC_NLA_SOCK_COOKIE,		/* u64 */
+	TIPC_NLA_SOCK_PAD,		/* flag */
 
 	__TIPC_NLA_SOCK_MAX,
 	TIPC_NLA_SOCK_MAX = __TIPC_NLA_SOCK_MAX - 1
@@ -238,6 +245,17 @@ enum {
 	TIPC_NLA_CON_MAX = __TIPC_NLA_CON_MAX - 1
 };
 
+/* Nest, socket statistics info */
+enum {
+	TIPC_NLA_SOCK_STAT_RCVQ,	/* u32 */
+	TIPC_NLA_SOCK_STAT_SENDQ,	/* u32 */
+	TIPC_NLA_SOCK_STAT_LINK_CONG,	/* flag */
+	TIPC_NLA_SOCK_STAT_CONN_CONG,	/* flag */
+
+	__TIPC_NLA_SOCK_STAT_MAX,
+	TIPC_NLA_SOCK_STAT_MAX = __TIPC_NLA_SOCK_STAT_MAX - 1
+};
+
 /* Nest, link propreties. Valid for link, media and bearer */
 enum {
 	TIPC_NLA_PROP_UNSPEC,
diff --git a/include/uapi/linux/tipc_sockets_diag.h b/include/uapi/linux/tipc_sockets_diag.h
new file mode 100644
index 000000000000..7678cf2f0dcc
--- /dev/null
+++ b/include/uapi/linux/tipc_sockets_diag.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* AF_TIPC sock_diag interface for querying open sockets */
+
+#ifndef _UAPI__TIPC_SOCKETS_DIAG_H__
+#define _UAPI__TIPC_SOCKETS_DIAG_H__
+
+#include <linux/types.h>
+#include <linux/sock_diag.h>
+
+/* Request */
+struct tipc_sock_diag_req {
+	__u8	sdiag_family;	/* must be AF_TIPC */
+	__u8	sdiag_protocol;	/* must be 0 */
+	__u16	pad;		/* must be 0 */
+	__u32	tidiag_states;	/* query*/
+};
+#endif /* _UAPI__TIPC_SOCKETS_DIAG_H__ */
diff --git a/net/tipc/Kconfig b/net/tipc/Kconfig
index c25a3a149dc4..e450212121d2 100644
--- a/net/tipc/Kconfig
+++ b/net/tipc/Kconfig
@@ -34,3 +34,11 @@ config TIPC_MEDIA_UDP
 	  Saying Y here will enable support for running TIPC over IP/UDP
 	bool
 	default y
+
+config TIPC_DIAG
+	tristate "TIPC: socket monitoring interface"
+	depends on TIPC
+	default y
+	---help---
+	Support for TIPC socket monitoring interface used by ss tool.
+	If unsure, say Y.
diff --git a/net/tipc/Makefile b/net/tipc/Makefile
index 1edb7192aa2f..aca168f2abb1 100644
--- a/net/tipc/Makefile
+++ b/net/tipc/Makefile
@@ -14,3 +14,8 @@ tipc-y	+= addr.o bcast.o bearer.o \
 tipc-$(CONFIG_TIPC_MEDIA_UDP)	+= udp_media.o
 tipc-$(CONFIG_TIPC_MEDIA_IB)	+= ib_media.o
 tipc-$(CONFIG_SYSCTL)		+= sysctl.o
+
+
+obj-$(CONFIG_TIPC_DIAG)	+= diag.o
+
+tipc_diag-y	:= diag.o
diff --git a/net/tipc/diag.c b/net/tipc/diag.c
new file mode 100644
index 000000000000..46d9cd62f781
--- /dev/null
+++ b/net/tipc/diag.c
@@ -0,0 +1,114 @@
+/*
+ * net/tipc/diag.c: TIPC socket diag
+ *
+ * Copyright (c) 2018, Ericsson AB
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the names of the copyright holders nor the names of its
+ *    contributors may be used to endorse or promote products derived from
+ *    this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "ASIS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "core.h"
+#include "socket.h"
+#include <linux/sock_diag.h>
+#include <linux/tipc_sockets_diag.h>
+
+static u64 __tipc_diag_gen_cookie(struct sock *sk)
+{
+	u32 res[2];
+
+	sock_diag_save_cookie(sk, res);
+	return *((u64 *)res);
+}
+
+static int __tipc_add_sock_diag(struct sk_buff *skb,
+				struct netlink_callback *cb,
+				struct tipc_sock *tsk)
+{
+	struct tipc_sock_diag_req *req = nlmsg_data(cb->nlh);
+	struct nlmsghdr *nlh;
+	int err;
+
+	nlh = nlmsg_put_answer(skb, cb, SOCK_DIAG_BY_FAMILY, 0,
+			       NLM_F_MULTI);
+	if (!nlh)
+		return -EMSGSIZE;
+
+	err = tipc_sk_fill_sock_diag(skb, tsk, req->tidiag_states,
+				     __tipc_diag_gen_cookie);
+	if (err)
+		return err;
+
+	nlmsg_end(skb, nlh);
+	return 0;
+}
+
+static int tipc_diag_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	return tipc_nl_sk_walk(skb, cb, __tipc_add_sock_diag);
+}
+
+static int tipc_sock_diag_handler_dump(struct sk_buff *skb,
+				       struct nlmsghdr *h)
+{
+	int hdrlen = sizeof(struct tipc_sock_diag_req);
+	struct net *net = sock_net(skb->sk);
+
+	if (nlmsg_len(h) < hdrlen)
+		return -EINVAL;
+
+	if (h->nlmsg_flags & NLM_F_DUMP) {
+		struct netlink_dump_control c = {
+			.dump = tipc_diag_dump,
+		};
+		netlink_dump_start(net->diag_nlsk, skb, h, &c);
+		return 0;
+	}
+	return -EOPNOTSUPP;
+}
+
+static const struct sock_diag_handler tipc_sock_diag_handler = {
+	.family = AF_TIPC,
+	.dump = tipc_sock_diag_handler_dump,
+};
+
+static int __init tipc_diag_init(void)
+{
+	return sock_diag_register(&tipc_sock_diag_handler);
+}
+
+static void __exit tipc_diag_exit(void)
+{
+	sock_diag_unregister(&tipc_sock_diag_handler);
+}
+
+module_init(tipc_diag_init);
+module_exit(tipc_diag_exit);
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_NETLINK, NETLINK_SOCK_DIAG, AF_TIPC);
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 8e83b0501d4e..31fdd13d444e 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -3207,10 +3207,10 @@ static int __tipc_nl_add_sk(struct sk_buff *skb, struct netlink_callback *cb,
 	return -EMSGSIZE;
 }
 
-static int __tipc_nl_sk_walk(struct sk_buff *skb, struct netlink_callback *cb,
-			     int (*skb_handler)(struct sk_buff *skb,
-						struct netlink_callback *cb,
-						struct tipc_sock *tsk))
+int tipc_nl_sk_walk(struct sk_buff *skb, struct netlink_callback *cb,
+		    int (*skb_handler)(struct sk_buff *skb,
+				       struct netlink_callback *cb,
+				       struct tipc_sock *tsk))
 {
 	struct net *net = sock_net(skb->sk);
 	struct tipc_net *tn = tipc_net(net);
@@ -3249,10 +3249,72 @@ static int __tipc_nl_sk_walk(struct sk_buff *skb, struct netlink_callback *cb,
 
 	return skb->len;
 }
+EXPORT_SYMBOL(tipc_nl_sk_walk);
+
+int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct tipc_sock *tsk,
+			   u32 sk_filter_state,
+			   u64 (*tipc_diag_gen_cookie)(struct sock *sk))
+{
+	struct sock *sk = &tsk->sk;
+	struct nlattr *attrs;
+	struct nlattr *stat;
+
+	/*filter response w.r.t sk_state*/
+	if (!(sk_filter_state & (1 << sk->sk_state)))
+		return 0;
+
+	attrs = nla_nest_start(skb, TIPC_NLA_SOCK);
+	if (!attrs)
+		goto msg_cancel;
+
+	if (__tipc_nl_add_sk_info(skb, tsk))
+		goto attr_msg_cancel;
+
+	if (nla_put_u32(skb, TIPC_NLA_SOCK_TYPE, (u32)sk->sk_type) ||
+	    nla_put_u32(skb, TIPC_NLA_SOCK_TIPC_STATE, (u32)sk->sk_state) ||
+	    nla_put_u32(skb, TIPC_NLA_SOCK_INO, sock_i_ino(sk)) ||
+	    nla_put_u32(skb, TIPC_NLA_SOCK_UID,
+			from_kuid_munged(sk_user_ns(sk), sock_i_uid(sk))) ||
+	    nla_put_u64_64bit(skb, TIPC_NLA_SOCK_COOKIE,
+			      tipc_diag_gen_cookie(sk),
+			      TIPC_NLA_SOCK_PAD))
+		goto attr_msg_cancel;
+
+	stat = nla_nest_start(skb, TIPC_NLA_SOCK_STAT);
+	if (!stat)
+		goto attr_msg_cancel;
+
+	if (nla_put_u32(skb, TIPC_NLA_SOCK_STAT_RCVQ,
+			skb_queue_len(&sk->sk_receive_queue)) ||
+	    nla_put_u32(skb, TIPC_NLA_SOCK_STAT_SENDQ,
+			skb_queue_len(&sk->sk_write_queue)))
+		goto stat_msg_cancel;
+
+	if (tsk->cong_link_cnt &&
+	    nla_put_flag(skb, TIPC_NLA_SOCK_STAT_LINK_CONG))
+		goto stat_msg_cancel;
+
+	if (tsk_conn_cong(tsk) &&
+	    nla_put_flag(skb, TIPC_NLA_SOCK_STAT_CONN_CONG))
+		goto stat_msg_cancel;
+
+	nla_nest_end(skb, stat);
+	nla_nest_end(skb, attrs);
+
+	return 0;
+
+stat_msg_cancel:
+	nla_nest_cancel(skb, stat);
+attr_msg_cancel:
+	nla_nest_cancel(skb, attrs);
+msg_cancel:
+	return -EMSGSIZE;
+}
+EXPORT_SYMBOL(tipc_sk_fill_sock_diag);
 
 int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
-	return __tipc_nl_sk_walk(skb, cb, __tipc_nl_add_sk);
+	return tipc_nl_sk_walk(skb, cb, __tipc_nl_add_sk);
 }
 
 /* Caller should hold socket lock for the passed tipc socket. */
diff --git a/net/tipc/socket.h b/net/tipc/socket.h
index 06fb5944cf76..aae3fd4cd06c 100644
--- a/net/tipc/socket.h
+++ b/net/tipc/socket.h
@@ -49,6 +49,8 @@
 #define RCVBUF_DEF  (FLOWCTL_BLK_SZ * 1024 * 2)
 #define RCVBUF_MAX  (FLOWCTL_BLK_SZ * 1024 * 16)
 
+struct tipc_sock;
+
 int tipc_socket_init(void);
 void tipc_socket_stop(void);
 void tipc_sk_rcv(struct net *net, struct sk_buff_head *inputq);
@@ -59,5 +61,11 @@ int tipc_sk_rht_init(struct net *net);
 void tipc_sk_rht_destroy(struct net *net);
 int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb);
 int tipc_nl_publ_dump(struct sk_buff *skb, struct netlink_callback *cb);
-
+int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct tipc_sock *tsk,
+			   u32 sk_filter_state,
+			   u64 (*tipc_diag_gen_cookie)(struct sock *sk));
+int tipc_nl_sk_walk(struct sk_buff *skb, struct netlink_callback *cb,
+		    int (*skb_handler)(struct sk_buff *skb,
+				       struct netlink_callback *cb,
+				       struct tipc_sock *tsk));
 #endif
-- 
2.1.4

^ permalink raw reply related

* [net-next  0/3] tipc: socket diagnostics additions for AF_TIPC
From: GhantaKrishnamurthy MohanKrishna @ 2018-03-21 13:37 UTC (permalink / raw)
  To: tipc-discussion, jon.maloy, maloy, ying.xue,
	mohan.krishna.ghanta.krishnamurthy, netdev, davem

The following patchsets add socket diagnostics support for AF_TIPC by
using the sock diag framework. The patchset was created on top of
commit id: fb66cb0.

New iproute2 package is needed to use this functionality which will
be sent for review in a seperate mail.

The commit series improves diagnosis of tipc sockets by exporting the
configuration, states and statistics of sockets.

The series has been co-authored by Parthasarathy Bhuvaragan and
consist of two parts:

1-2:  Adaptations of existing code to support sock_diag framework.
We modify existing functions to support socket diagnostics. Required
information about the sockets are exported.

3:  Step sk_drops during packet drop.
This occurs if the packet cannot be queued due to queue length exceeding
configured thresholds.

The diag module is optional, and if enabled it will be loaded on demand
when needed.

GhantaKrishnamurthy MohanKrishna (3):
  tipc: modify socket iterator for sock_diag
  tipc: implement socket diagnostics for AF_TIPC
  tipc: step sk->sk_drops when rcv buffer is full

 include/uapi/linux/tipc_netlink.h      |  19 +++++
 include/uapi/linux/tipc_sockets_diag.h |  17 +++++
 net/tipc/Kconfig                       |   8 ++
 net/tipc/Makefile                      |   5 ++
 net/tipc/diag.c                        | 114 ++++++++++++++++++++++++++++
 net/tipc/socket.c                      | 134 +++++++++++++++++++++++++++------
 net/tipc/socket.h                      |  10 ++-
 7 files changed, 281 insertions(+), 26 deletions(-)
 create mode 100644 include/uapi/linux/tipc_sockets_diag.h
 create mode 100644 net/tipc/diag.c

-- 
2.1.4

^ permalink raw reply

* Re: [PATCH] netfilter: nfnetlink_cthelper: Remove VLA usage
From: Gustavo A. R. Silva @ 2018-03-21 13:51 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	netfilter-devel, coreteam, netdev, linux-kernel, Kernel Hardening,
	Kees Cook, Gustavo A. R. Silva
In-Reply-To: <20180320123655.kugtpftt7nhzvqc5@salvia>



On 03/20/2018 07:36 AM, Pablo Neira Ayuso wrote:
> On Mon, Mar 12, 2018 at 07:21:38PM -0500, Gustavo A. R. Silva wrote:
>> In preparation to enabling -Wvla, remove VLA and replace it
>> with dynamic memory allocation.
>>
>>  From a security viewpoint, the use of Variable Length Arrays can be
>> a vector for stack overflow attacks. Also, in general, as the code
>> evolves it is easy to lose track of how big a VLA can get. Thus, we
>> can end up having segfaults that are hard to debug.
>>
>> Also, fixed as part of the directive to remove all VLAs from
>> the kernel: https://lkml.org/lkml/2018/3/7/621
> 
> also applied, thanks.
> 

Awesome.

Thanks, Pablo.
--
Gustavo

^ permalink raw reply

* Re: [PATCH v2] Bluetooth: Remove VLA usage in aes_cmac
From: Gustavo A. R. Silva @ 2018-03-21 13:49 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Johan Hedberg, David S. Miller, linux-bluetooth,
	Network Development, linux-kernel
In-Reply-To: <8448C1B0-A2A9-4376-9CC3-7643F0FFC0FB@holtmann.org>



On 03/21/2018 08:45 AM, Marcel Holtmann wrote:
> Hi Gustavo,
> 
>> In preparation to enabling -Wvla, remove VLA and replace it
>> with dynamic memory allocation instead.
>>
>> The use of stack Variable Length Arrays needs to be avoided, as they
>> can be a vector for stack exhaustion, which can be both a runtime bug
>> or a security flaw. Also, in general, as code evolves it is easy to
>> lose track of how big a VLA can get. Thus, we can end up having runtime
>> failures that are hard to debug.
>>
>> Also, fixed as part of the directive to remove all VLAs from
>> the kernel: https://lkml.org/lkml/2018/3/7/621
>>
>> Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
>> ---
>> Changes in v2:
>> - Fix memory leak in previous patch.
>>
>> net/bluetooth/smp.c | 17 ++++++++++++-----
>> 1 file changed, 12 insertions(+), 5 deletions(-)
> 
> patch has been applied to bluetooth-next tree.
> 
> Regards
> 
> Marcel
> 

Awesome.

Thanks, Marcel.
--
Gustavo

^ permalink raw reply

* Re: [PATCH] mac80211: aes-cmac: remove VLA usage
From: Johannes Berg @ 2018-03-21 13:48 UTC (permalink / raw)
  To: Gustavo A. R. Silva, David S. Miller; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20180321134247.GA1275@embeddedgus>

On Wed, 2018-03-21 at 08:42 -0500, Gustavo A. R. Silva wrote:
> In preparation to enabling -Wvla, remove VLAs and replace them
> with dynamic memory allocation instead.
> 
> The use of stack Variable Length Arrays needs to be avoided, as they
> can be a vector for stack exhaustion, which can be both a runtime bug
> or a security flaw. Also, in general, as code evolves it is easy to
> lose track of how big a VLA can get. Thus, we can end up having runtime
> failures that are hard to debug.
> 
> Also, fixed as part of the directive to remove all VLAs from
> the kernel: https://lkml.org/lkml/2018/3/7/621
> 
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> ---
>  net/mac80211/aes_cmac.c | 36 ++++++++++++++++++++++++------------
>  1 file changed, 24 insertions(+), 12 deletions(-)
> 
> diff --git a/net/mac80211/aes_cmac.c b/net/mac80211/aes_cmac.c
> index 2fb6558..c9444bf 100644
> --- a/net/mac80211/aes_cmac.c
> +++ b/net/mac80211/aes_cmac.c
> @@ -27,30 +27,42 @@ static const u8 zero[CMAC_TLEN_256];
>  void ieee80211_aes_cmac(struct crypto_shash *tfm, const u8 *aad,
>  			const u8 *data, size_t data_len, u8 *mic)
>  {
> -	SHASH_DESC_ON_STACK(desc, tfm);
> +	struct shash_desc *shash;
>  	u8 out[AES_BLOCK_SIZE];
>  
> -	desc->tfm = tfm;
> +	shash = kmalloc(sizeof(*shash) + crypto_shash_descsize(tfm),
> +			GFP_KERNEL);
> +	if (!shash)
> +		return;

Honestly, this seems like a really bad idea - you're now hitting
kmalloc for every TX/RX frame here.

SHA_DESC_ON_STACK() should just be fixed to not need a VLA, but take
some sort of maximum, I guess?

johannes

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Thomas Gleixner @ 2018-03-21 13:46 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, john.stultz, levi.pearson,
	edumazet, willemb, mlichvar
In-Reply-To: <20180307011230.24001-14-jesus.sanchez-palencia@intel.com>

On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> +struct tbs_sched_data {
> +	bool sorting;
> +	int clockid;
> +	int queue;
> +	s32 delta; /* in ns */
> +	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
> +	struct rb_root head;

Hmm. You are reimplementing timerqueue open coded. Have you checked whether
you could reuse the timerqueue implementation?

That requires to add a timerqueue node to struct skbuff

@@ -671,7 +671,8 @@ struct sk_buff {
 				unsigned long		dev_scratch;
 			};
 		};
-		struct rb_node	rbnode; /* used in netem & tcp stack */
+		struct rb_node		rbnode; /* used in netem & tcp stack */
+		struct timerqueue_node	tqnode;
 	};
 	struct sock		*sk;

Then you can use timerqueue_head in your scheduler data and all the open
coded rbtree handling goes away.

> +static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	ktime_t txtime = nskb->tstamp;
> +	struct sock *sk = nskb->sk;
> +	ktime_t now;
> +
> +	if (sk && !sock_flag(sk, SOCK_TXTIME))
> +		return false;
> +
> +	/* We don't perform crosstimestamping.
> +	 * Drop if packet's clockid differs from qdisc's.
> +	 */
> +	if (nskb->txtime_clockid != q->clockid)
> +		return false;
> +
> +	now = get_time_by_clockid(q->clockid);

If you store the time getter function pointer in tbs_sched_data then you
avoid the lookup and just can do

       now = q->get_time();

That applies to lots of other places.

> +	if (ktime_before(txtime, now) || ktime_before(txtime, q->last))
> +		return false;
> +
> +	return true;
> +}
> +
> +static struct sk_buff *tbs_peek(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +
> +	return q->peek(sch);
> +}
> +
> +static struct sk_buff *tbs_peek_timesortedlist(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct rb_node *p;
> +
> +	p = rb_first(&q->head);

timerqueue gives you direct access to the first expiring entry w/o walking
the rbtree. So that would become:

	p = timerqueue_getnext(&q->tqhead);
	return p ? rb_to_skb(p) : NULL;

> +	if (!p)
> +		return NULL;
> +
> +	return rb_to_skb(p);
> +}

> +static int tbs_enqueue_timesortedlist(struct sk_buff *nskb, struct Qdisc *sch,
> +				      struct sk_buff **to_free)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct rb_node **p = &q->head.rb_node, *parent = NULL;
> +	ktime_t txtime = nskb->tstamp;
> +
> +	if (!is_packet_valid(sch, nskb))
> +		return qdisc_drop(nskb, sch, to_free);
> +
> +	while (*p) {
> +		struct sk_buff *skb;
> +
> +		parent = *p;
> +		skb = rb_to_skb(parent);
> +		if (ktime_after(txtime, skb->tstamp))
> +			p = &parent->rb_right;
> +		else
> +			p = &parent->rb_left;
> +	}
> +	rb_link_node(&nskb->rbnode, parent, p);
> +	rb_insert_color(&nskb->rbnode, &q->head);

That'd become:

       nskb->tknode.expires = txtime;
       timerqueue_add(&d->tqhead, &nskb->tknode);

> +	qdisc_qstats_backlog_inc(sch, nskb);
> +	sch->q.qlen++;
> +
> +	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
> +	reset_watchdog(sch);
> +
> +	return NET_XMIT_SUCCESS;
> +}
> +
> +static void timesortedlist_erase(struct Qdisc *sch, struct sk_buff *skb,
> +				 bool drop)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +
> +	rb_erase(&skb->rbnode, &q->head);
> +
> +	qdisc_qstats_backlog_dec(sch, skb);
> +
> +	if (drop) {
> +		struct sk_buff *to_free = NULL;
> +
> +		qdisc_drop(skb, sch, &to_free);
> +		kfree_skb_list(to_free);
> +		qdisc_qstats_overlimit(sch);
> +	} else {
> +		qdisc_bstats_update(sch, skb);
> +
> +		q->last = skb->tstamp;
> +	}
> +
> +	sch->q.qlen--;
> +
> +	/* The rbnode field in the skb re-uses these fields, now that
> +	 * we are done with the rbnode, reset them.
> +	 */
> +	skb->next = NULL;
> +	skb->prev = NULL;
> +	skb->dev = qdisc_dev(sch);
> +}
> +
> +static struct sk_buff *tbs_dequeue(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +
> +	return q->dequeue(sch);
> +}
> +
> +static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct sk_buff *skb = tbs_peek(sch);
> +	ktime_t now, next;
> +
> +	if (!skb)
> +		return NULL;
> +
> +	now = get_time_by_clockid(q->clockid);
> +
> +	/* Drop if packet has expired while in queue and the drop_if_late
> +	 * flag is set.
> +	 */
> +	if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
> +		struct sk_buff *to_free = NULL;
> +
> +		qdisc_queue_drop_head(sch, &to_free);
> +		kfree_skb_list(to_free);
> +		qdisc_qstats_overlimit(sch);
> +
> +		skb = NULL;
> +		goto out;

Instead of going out immediately you should check the next skb whether its
due for sending already.

> +	}
> +
> +	next = ktime_sub_ns(skb->tstamp, q->delta);
> +
> +	/* Dequeue only if now is within the [txtime - delta, txtime] range. */
> +	if (ktime_after(now, next))
> +		skb = qdisc_dequeue_head(sch);
> +	else
> +		skb = NULL;
> +
> +out:
> +	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
> +	reset_watchdog(sch);
> +
> +	return skb;
> +}
> +
> +static struct sk_buff *tbs_dequeue_timesortedlist(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct sk_buff *skb;
> +	ktime_t now, next;
> +
> +	skb = tbs_peek(sch);
> +	if (!skb)
> +		return NULL;
> +
> +	now = get_time_by_clockid(q->clockid);
> +
> +	/* Drop if packet has expired while in queue and the drop_if_late
> +	 * flag is set.
> +	 */
> +	if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
> +		timesortedlist_erase(sch, skb, true);
> +		skb = NULL;
> +		goto out;

Same as above.

> +	}
> +
> +	next = ktime_sub_ns(skb->tstamp, q->delta);
> +
> +	/* Dequeue only if now is within the [txtime - delta, txtime] range. */
> +	if (ktime_after(now, next))
> +		timesortedlist_erase(sch, skb, false);
> +	else
> +		skb = NULL;
> +
> +out:
> +	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
> +	reset_watchdog(sch);
> +
> +	return skb;
> +}
> +
> +static inline void setup_queueing_mode(struct tbs_sched_data *q)
> +{
> +	if (q->sorting) {
> +		q->enqueue = tbs_enqueue_timesortedlist;
> +		q->dequeue = tbs_dequeue_timesortedlist;
> +		q->peek = tbs_peek_timesortedlist;
> +	} else {
> +		q->enqueue = tbs_enqueue_scheduledfifo;
> +		q->dequeue = tbs_dequeue_scheduledfifo;
> +		q->peek = qdisc_peek_head;

I don't see the point of these two modes and all the duplicated code it
involves.

FIFO mode limits usage to a single thread which has to guarantee that the
packets are queued in time order.

If you look at the use cases of TDM in various fields then FIFO mode is
pretty much useless. In industrial/automotive fieldbus applications the
various time slices are filled by different threads or even processes.

Sure, the rbtree queue/dequeue has overhead compared to a simple linked
list, but you pay for that with more indirections and lots of mostly
duplicated code. And in the worst case one of these code pathes is going to
be rarely used and prone to bitrot.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH v2] Bluetooth: Remove VLA usage in aes_cmac
From: Marcel Holtmann @ 2018-03-21 13:45 UTC (permalink / raw)
  To: Gustavo A. R. Silva
  Cc: Johan Hedberg, David S. Miller, linux-bluetooth,
	Network Development, linux-kernel
In-Reply-To: <20180321010527.GA16616@embeddedor.com>

Hi Gustavo,

> In preparation to enabling -Wvla, remove VLA and replace it
> with dynamic memory allocation instead.
> 
> The use of stack Variable Length Arrays needs to be avoided, as they
> can be a vector for stack exhaustion, which can be both a runtime bug
> or a security flaw. Also, in general, as code evolves it is easy to
> lose track of how big a VLA can get. Thus, we can end up having runtime
> failures that are hard to debug.
> 
> Also, fixed as part of the directive to remove all VLAs from
> the kernel: https://lkml.org/lkml/2018/3/7/621
> 
> Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
> ---
> Changes in v2:
> - Fix memory leak in previous patch.
> 
> net/bluetooth/smp.c | 17 ++++++++++++-----
> 1 file changed, 12 insertions(+), 5 deletions(-)

patch has been applied to bluetooth-next tree.

Regards

Marcel

^ permalink raw reply

* [PATCH] mac80211: aes-cmac: remove VLA usage
From: Gustavo A. R. Silva @ 2018-03-21 13:42 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller
  Cc: linux-wireless, netdev, linux-kernel, Gustavo A. R. Silva

In preparation to enabling -Wvla, remove VLAs and replace them
with dynamic memory allocation instead.

The use of stack Variable Length Arrays needs to be avoided, as they
can be a vector for stack exhaustion, which can be both a runtime bug
or a security flaw. Also, in general, as code evolves it is easy to
lose track of how big a VLA can get. Thus, we can end up having runtime
failures that are hard to debug.

Also, fixed as part of the directive to remove all VLAs from
the kernel: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 net/mac80211/aes_cmac.c | 36 ++++++++++++++++++++++++------------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/net/mac80211/aes_cmac.c b/net/mac80211/aes_cmac.c
index 2fb6558..c9444bf 100644
--- a/net/mac80211/aes_cmac.c
+++ b/net/mac80211/aes_cmac.c
@@ -27,30 +27,42 @@ static const u8 zero[CMAC_TLEN_256];
 void ieee80211_aes_cmac(struct crypto_shash *tfm, const u8 *aad,
 			const u8 *data, size_t data_len, u8 *mic)
 {
-	SHASH_DESC_ON_STACK(desc, tfm);
+	struct shash_desc *shash;
 	u8 out[AES_BLOCK_SIZE];
 
-	desc->tfm = tfm;
+	shash = kmalloc(sizeof(*shash) + crypto_shash_descsize(tfm),
+			GFP_KERNEL);
+	if (!shash)
+		return;
 
-	crypto_shash_init(desc);
-	crypto_shash_update(desc, aad, AAD_LEN);
-	crypto_shash_update(desc, data, data_len - CMAC_TLEN);
-	crypto_shash_finup(desc, zero, CMAC_TLEN, out);
+	shash->tfm = tfm;
+
+	crypto_shash_init(shash);
+	crypto_shash_update(shash, aad, AAD_LEN);
+	crypto_shash_update(shash, data, data_len - CMAC_TLEN);
+	crypto_shash_finup(shash, zero, CMAC_TLEN, out);
 
 	memcpy(mic, out, CMAC_TLEN);
+	kfree(shash);
 }
 
 void ieee80211_aes_cmac_256(struct crypto_shash *tfm, const u8 *aad,
 			    const u8 *data, size_t data_len, u8 *mic)
 {
-	SHASH_DESC_ON_STACK(desc, tfm);
+	struct shash_desc *shash;
+
+	shash = kmalloc(sizeof(*shash) + crypto_shash_descsize(tfm),
+			GFP_KERNEL);
+	if (!shash)
+		return;
 
-	desc->tfm = tfm;
+	shash->tfm = tfm;
 
-	crypto_shash_init(desc);
-	crypto_shash_update(desc, aad, AAD_LEN);
-	crypto_shash_update(desc, data, data_len - CMAC_TLEN_256);
-	crypto_shash_finup(desc, zero, CMAC_TLEN_256, mic);
+	crypto_shash_init(shash);
+	crypto_shash_update(shash, aad, AAD_LEN);
+	crypto_shash_update(shash, data, data_len - CMAC_TLEN_256);
+	crypto_shash_finup(shash, zero, CMAC_TLEN_256, mic);
+	kfree(shash);
 }
 
 struct crypto_shash *ieee80211_aes_cmac_key_setup(const u8 key[],
-- 
2.7.4

^ permalink raw reply related

* RE: [PATCH net 1/1] qede: Fix barrier usage after tx doorbell write.
From: Elior, Ariel @ 2018-03-21 13:39 UTC (permalink / raw)
  To: davem@davemloft.net
  Cc: netdev@vger.kernel.org, Kalderon, Michal, Chopra, Manish
In-Reply-To: <20180316175844.18693-1-manish.chopra@cavium.com>

> Subject: [PATCH net 1/1] qede: Fix barrier usage after tx doorbell write.
> 
> Since commit c5ad119fb6c09b0297446be05bd66602fa564758
> ("net: sched: pfifo_fast use skb_array") driver is exposed
> to an issue where it is hitting NULL skbs while handling TX
> completions. Driver uses mmiowb() to flush the writes to the
> doorbell bar which is a write-combined bar, however on x86
> mmiowb() does not flush the write combined buffer.
> 
> This patch fixes this problem by replacing mmiowb() with wmb()
> after the write combined doorbell write so that writes are
> flushed and synchronized from more than one processor.
> 
> Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
> Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
> ---
>  drivers/net/ethernet/qlogic/qede/qede_fp.c |   10 ++++------
>  1 files changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c
> b/drivers/net/ethernet/qlogic/qede/qede_fp.c
> index dafc079..2e921ca 100644
> --- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
> +++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
> @@ -320,13 +320,11 @@ static inline void qede_update_tx_producer(struct
> qede_tx_queue *txq)
>  	barrier();
>  	writel(txq->tx_db.raw, txq->doorbell_addr);
> 
> -	/* mmiowb is needed to synchronize doorbell writes from more than one
> -	 * processor. It guarantees that the write arrives to the device before
> -	 * the queue lock is released and another start_xmit is called (possibly
> -	 * on another CPU). Without this barrier, the next doorbell can bypass
> -	 * this doorbell. This is applicable to IA64/Altix systems.
> +	/* Fence required to flush the write combined buffer, since another
> +	 * CPU may write to the same doorbell address and data may be lost
> +	 * due to relaxed order nature of write combined bar.
>  	 */
> -	mmiowb();
> +	wmb();
>  }
> 
>  static int qede_xdp_xmit(struct qede_dev *edev, struct qede_fastpath *fp,
> --
> 1.7.1

Hi Dave,
This patch appears as "superseded" in patchwork. I am not really sure why
that is - I noticed some other barrier work is going on, but none of it will
solve this issue. This patch solves an important bug in the driver - please
consider applying it.
Thanks,
Ariel

^ permalink raw reply

* Re: HW question: i210 vs. BCM5461S over SGMII: no response from PHY to MDIO requests?
From: Andrew Lunn @ 2018-03-21 13:08 UTC (permalink / raw)
  To: Frantisek Rysanek; +Cc: netdev
In-Reply-To: <5AB23844.22609.48A1E06C@Frantisek.Rysanek.post.cz>

> Looking at the i2c dumps, and some past dumps from the igb driver, 
> it's dawning on me on me that the igb driver, without much hacking, 
> would try to read the PHY ID from the DMI/DDM block - a case which 
> the drivers/net/phy/mdio-i2c.c specifically avoids :-)

It avoids if for a very good reason. This driver exports a standard
Linux MDIO bus. The core phylib code will then probe the bus, read the
IDs, find the correct PHY driver and loads it.

It is probably good that you spend some time looking at a driver other
than igb. Picking one i know a little, say the Freescale FEC. It has
functions to perform MDIO read and MDIO write. These are then exported
as an MDIO bus to the linux common code. And there are a few calls to
phy_connect(), phy_start(), phy_stop(). That is how you build a driver
which uses the code in drivers/net/phy. mvneta shows how you can use
phylink. Study these two far a while.

      Andrew

^ permalink raw reply

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
From: Thomas Gleixner @ 2018-03-21 12:58 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Eric Dumazet, Jesus Sanchez-Palencia, netdev, jhs, xiyou.wangcong,
	jiri, vinicius.gomes, intel-wired-lan, anna-maria, henrik,
	john.stultz, levi.pearson, edumazet, willemb, mlichvar
In-Reply-To: <20180307052410.m2yqmokrivjlwcjz@localhost>

On Tue, 6 Mar 2018, Richard Cochran wrote:

> On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
> > This is adding 32+1 bits to sk_buff, and possibly holes in this very
> > very hot (and already too fat) structure.
> > 
> > Do we really need 32 bits for a clockid_t ?
> 
> Probably we can live with fewer bits.
> 
> For clock IDs with a positive sign, the max possible clock value is 16.
> 
> For clock IDs with a negative sign, IIRC, three bits are for the type
> code (we have also posix timers packed like this) and the are for the
> file descriptor.  So maybe we could use 16 bits, allowing 12 bits or
> so for encoding the FD.
> 
> The downside would be that this forces the application to make sure
> and open the dynamic posix clock early enough before the FD count gets
> too high.

Errm. No. There is no way to support fd based clocks or one of the CPU
time/process time based clocks for this.

CLOCK_REALTIME and CLOCK_MONOTONIC are probably the only interesting
ones. BOOTTIME is hopefully soon irrelevant as we make MONOTONIC and
BOOTTIME the same unless this causes unexpectedly a major issues. I don't
think that CLOCK_TAI makes sense in that context, but I might be wrong.

The rest of the CLOCK_* space cannot be used at all.

So you need at max 2 bits for this, but I think 1 is good enough.

Thanks,

	tglx

^ permalink raw reply

* Re: HW question: i210 vs. BCM5461S over SGMII: no response from PHY to MDIO requests?
From: Andrew Lunn @ 2018-03-21 12:58 UTC (permalink / raw)
  To: Frantisek Rysanek; +Cc: netdev
In-Reply-To: <5AB2470E.24728.48DBA909@Frantisek.Rysanek.post.cz>

> I was also wondering if someone has written any kernel-space support 
> for the SFP's. Sure enough, I've found lots of code by Russell King 
> under drivers/net/phy. I started reading from sfp.c, went on to 
> sfp-bus.c, next the phylink stuff... Answers lots of my questions. 
> Clearly someone has "been there and done that" - I mean how to 
> interpret SFP EEPROM bits and act upon them in the PHY 
> initialization. 

That is all quite new code. It has not spread too far yet. The Marvell
mvneta ethernet driver is using it, and the Clearfog is probably the
first in kernel board to make use of it. A few of us are working on
converting DSA over to using PHYLINK, since we have boards with
Ethernet switches and SFP connected to switch ports.

> There are notes in the phylib drivers that this is "platform" stuff
> - a keyword which speaks to me of stuff hardwired onboard in
> embedded motherboards (is Russell King the father of Linux on ARM
> ?), rather than general-purpose PnP and addon boards.

At the moment, PHYLIB pretty much relies on device tree to glue all
the parts together. There is no support for not using device tree at
the moment. It would need somebody to contribute that code.

The other issue is that the igb driver, like most of the intel
Ethernet drivers, ignore much of the Linux common MDIO/PHY and I2C
infrastructure, and does it all themselves. This is probably because
they share code with the Windoze driver.

> I'd love to use existing code of the phylib to talk to the SFP
> PHY's, maybe extend the phylib a bit (with the phy's I have), rather
> than cobble together something crude and private on my own, inside
> the igb driver.

So this is quite a big job, to do it cleanly. You probably need to
retain the intel code for MDIO/PHY/I2C, but add an option to make use
of the Linux common MDIO/PHY/I2C infrastructure. Then you need to
extend PHYLINK with a non device tree way to configure it, and glue
all the parts together.

You can make it a bit easier by just throwing away all the intel
MDIO/PHY/I2C code, replacing it will Linux common code. But i expect
the Intel maintainers would then reject your changes. There is too
high a chance of introducing regressions.

     Andrew

^ permalink raw reply

* Re: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write
From: Rahul Lakkireddy @ 2018-03-21 12:28 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Thomas Gleixner, x86@kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, mingo@redhat.com, hpa@zytor.com,
	davem@davemloft.net, akpm@linux-foundation.org,
	torvalds@linux-foundation.org, Ganesh GR, Nirranjan Kirubaharan,
	Indranil Choudhury
In-Reply-To: <CAKgT0UeTHPVrxEUbEszxm1MyJsQDSLfyM8txHq=r7GVkPD37nQ@mail.gmail.com>

On Tuesday, March 03/20/18, 2018 at 20:12:15 +0530, Alexander Duyck wrote:
> On Tue, Mar 20, 2018 at 6:32 AM, Rahul Lakkireddy
> <rahul.lakkireddy@chelsio.com> wrote:
> > On Monday, March 03/19/18, 2018 at 20:13:10 +0530, Thomas Gleixner wrote:
> >> On Mon, 19 Mar 2018, Rahul Lakkireddy wrote:
> >>
> >> > Use VMOVDQU AVX CPU instruction when available to do 256-bit
> >> > IO read and write.
> >>
> >> That's not what the patch does. See below.
> >>
> >> > Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
> >> > Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
> >>
> >> That Signed-off-by chain is wrong....
> >>
> >> > +#ifdef CONFIG_AS_AVX
> >> > +#include <asm/fpu/api.h>
> >> > +
> >> > +static inline u256 __readqq(const volatile void __iomem *addr)
> >> > +{
> >> > +   u256 ret;
> >> > +
> >> > +   kernel_fpu_begin();
> >> > +   asm volatile("vmovdqu %0, %%ymm0" :
> >> > +                : "m" (*(volatile u256 __force *)addr));
> >> > +   asm volatile("vmovdqu %%ymm0, %0" : "=m" (ret));
> >> > +   kernel_fpu_end();
> >> > +   return ret;
> >>
> >> You _cannot_ assume that the instruction is available just because
> >> CONFIG_AS_AVX is set. The availability is determined by the runtime
> >> evaluated CPU feature flags, i.e. X86_FEATURE_AVX.
> >>
> >
> > Ok.  Will add boot_cpu_has(X86_FEATURE_AVX) check as well.
> >
> >> Aside of that I very much doubt that this is faster than 4 consecutive
> >> 64bit reads/writes as you have the full overhead of
> >> kernel_fpu_begin()/end() for each access.
> >>
> >> You did not provide any numbers for this so its even harder to
> >> determine.
> >>
> >
> > Sorry about that.  Here are the numbers with and without this series.
> >
> > When reading up to 2 GB on-chip memory via MMIO, the time taken:
> >
> > Without Series        With Series
> > (64-bit read)         (256-bit read)
> >
> > 52 seconds            26 seconds
> >
> > As can be seen, we see good improvement with doing 256-bits at a
> > time.
> 
> Instead of framing this as an enhanced version of the read/write ops
> why not look at replacing or extending something like the
> memcpy_fromio or memcpy_toio operations? It would probably be more
> comparable to what you are doing if you are wanting to move large
> chunks of memory from one region to another, and it should translate
> into something like AVX instructions once the CPU optimizations kick
> in for a memcpy.
> 

Ok. Will look into this approach.

Thanks,
Rahul

^ permalink raw reply

* Re: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write
From: Rahul Lakkireddy @ 2018-03-21 12:28 UTC (permalink / raw)
  To: David Laight
  Cc: Thomas Gleixner, x86@kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, mingo@redhat.com, hpa@zytor.com,
	davem@davemloft.net, akpm@linux-foundation.org,
	torvalds@linux-foundation.org, Ganesh GR, Nirranjan Kirubaharan,
	Indranil Choudhury
In-Reply-To: <5f43882155104f50bbd2e5cf63d432f2@AcuMS.aculab.com>

On Tuesday, March 03/20/18, 2018 at 20:10:19 +0530, David Laight wrote:
> From: Rahul Lakkireddy
> > Sent: 20 March 2018 13:32
> ...
> > On High Availability Server, the logs of the failing system must be
> > collected as quickly as possible.  So, we're concerned with the amount
> > of time taken to collect our large on-chip memory.  We see improvement
> > in doing 256-bit reads at a time.
> 
> Two other options:
> 
> 1) Get the device to DMA into host memory.
> 

Unfortunately, our device doesn't support doing DMA of on-chip memory.

> 2) Use mmap() (and vm_iomap_memory() in your driver) to get direct
>    userspace access to the (I assume) PCIe memory space.
>    You can then use whatever copy instructions the cpu has.
>    (Just don't use memcpy().)
> 

We also need to collect this in kernel space i.e. from crash recovery
kernel.

Thanks,
Rahul

^ permalink raw reply

* Re: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write
From: Rahul Lakkireddy @ 2018-03-21 12:27 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Thomas Gleixner, x86@kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, mingo@redhat.com, hpa@zytor.com,
	davem@davemloft.net, akpm@linux-foundation.org,
	torvalds@linux-foundation.org, Ganesh GR, Nirranjan Kirubaharan,
	Indranil Choudhury
In-Reply-To: <CAHp75VdTPzTK-MgrqU0N43YaNrwpN4pq8PwKAkcFQ-y_n6ez2A@mail.gmail.com>

On Tuesday, March 03/20/18, 2018 at 19:14:46 +0530, Andy Shevchenko wrote:
> On Tue, Mar 20, 2018 at 3:32 PM, Rahul Lakkireddy
> <rahul.lakkireddy@chelsio.com> wrote:
> > On Monday, March 03/19/18, 2018 at 20:13:10 +0530, Thomas Gleixner wrote:
> >> On Mon, 19 Mar 2018, Rahul Lakkireddy wrote:
> 
> >> Aside of that I very much doubt that this is faster than 4 consecutive
> >> 64bit reads/writes as you have the full overhead of
> >> kernel_fpu_begin()/end() for each access.
> >>
> >> You did not provide any numbers for this so its even harder to
> >> determine.
> >>
> >
> > Sorry about that.  Here are the numbers with and without this series.
> >
> > When reading up to 2 GB on-chip memory via MMIO, the time taken:
> >
> > Without Series        With Series
> > (64-bit read)         (256-bit read)
> >
> > 52 seconds            26 seconds
> >
> > As can be seen, we see good improvement with doing 256-bits at a
> > time.
> 
> But this is kinda synthetic test, right?
> If you run in a normal use case where kernel not only collecting logs,
> but doing something else, especially with frequent userspace
> interaction, would be trend the same?
> 

We see same improvement when collecting logs while running
heavy IO with iozone.

Thanks,
Rahul

^ permalink raw reply

* pull-request: mac80211 2018-03-21
From: Johannes Berg @ 2018-03-21 12:06 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA

Hi Dave,

Another few fixes - one for hwsim, so not really all that interesting,
and two patches to work around an ath9k_htc problem.

Note that I pulled your net tree today, so you may need to be careful
to not fast-forward if you don't merge anything else before this.

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit 5f2fb802eee1df0810b47ea251942fe3fd36589a:

  ipv6: old_dport should be a __be16 in __ip6_datagram_connect() (2018-03-20 12:43:43 -0400)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2018-03-21

for you to fetch changes up to 60b01bcce97191f473fa869df2713143936d6ef4:

  ath9k_htc: use non-QoS NDP for AP probing (2018-03-21 13:01:55 +0100)

----------------------------------------------------------------
Two more fixes (in three patches):
 * ath9k_htc doesn't like QoS NDP frames, use regular ones
 * hwsim: set up wmediumd for radios created later

----------------------------------------------------------------
Andrew Zaborowski (1):
      mac80211_hwsim: Set wmediumd for new radios

Ben Caradoc-Davies (1):
      mac80211: add ieee80211_hw flag for QoS NDP support

Johannes Berg (1):
      ath9k_htc: use non-QoS NDP for AP probing

 drivers/net/wireless/ath/ath9k/htc_drv_init.c | 1 +
 drivers/net/wireless/mac80211_hwsim.c         | 1 +
 include/net/mac80211.h                        | 4 ++++
 net/mac80211/debugfs.c                        | 1 +
 net/mac80211/mlme.c                           | 3 ++-
 5 files changed, 9 insertions(+), 1 deletion(-)

^ permalink raw reply

* Re: HW question: i210 vs. BCM5461S over SGMII: no response from PHY to MDIO requests?
From: Frantisek Rysanek @ 2018-03-21 11:50 UTC (permalink / raw)
  To: Andrew Lunn, netdev

On 21 Mar 2018 at 11:47, Andrew Lunn , netdev@vger.ker wrote:
> Another question is, how to write the driver's initialization 
> sequence, for it to correctly decide if the SFP is SERDES or SGMII or 
> what. I could just follow the config obtained from the i210 EEPROM.
> Alternatively, or somehow combined to that, I could try checking if 
> something responds to me over i2c, and do a quick check for SPD 
> EEPROM. If I find one, do a check for "MII regs over i2c" - at all 
> i2c slave addresses between 0x41..0x59  EXCEPT 0x50/0x51.
> If I find MII-i2c regs, it's clear that the transceiver contains a 
> PHY, and wants to run in SGMII mode - otherwise it's a SERDES thing.
> If nothing is found in i2c mode, and inherited i210 config indicates 
> SGMII, try external MDIO... if that fails, revert to SERDES.
> 

I was also wondering if someone has written any kernel-space support 
for the SFP's. Sure enough, I've found lots of code by Russell King 
under drivers/net/phy. I started reading from sfp.c, went on to 
sfp-bus.c, next the phylink stuff... Answers lots of my questions. 
Clearly someone has "been there and done that" - I mean how to 
interpret SFP EEPROM bits and act upon them in the PHY 
initialization. 
And I keep wondering where to start :-)
If I grep phylink recursively in drivers/net/ethernet, I get no 
hits... Any chance of hooking this up to the igb driver in a clean 
way? There are notes in the phylib drivers that this is "platform" 
stuff - a keyword which speaks to me of stuff hardwired onboard in 
embedded motherboards (is Russell King the father of Linux on ARM ?), 
rather than general-purpose PnP and addon boards. I'd love to use 
existing code of the phylib to talk to the SFP PHY's, maybe extend 
the phylib a bit (with the phy's I have), rather than cobble together 
something crude and private on my own, inside the igb driver. 
If you have any pointers, let me know.

Frank

^ permalink raw reply

* Re: [PATCH net-next 01/14] tcp: Add clean acked data hook
From: Boris Pismenny @ 2018-03-21 11:21 UTC (permalink / raw)
  To: Rao Shoaib, Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Ilya Lesokhin, Aviad Yehezkel
In-Reply-To: <1dc8963f-a279-b03c-c0b8-805f10e1c41a@oracle.com>



On 3/20/2018 10:36 PM, Rao Shoaib wrote:
> 
> 
> On 03/19/2018 07:44 PM, Saeed Mahameed wrote:
>> From: Ilya Lesokhin <ilyal@mellanox.com>
>>
>> Called when a TCP segment is acknowledged.
>> Could be used by application protocols who hold additional
>> metadata associated with the stream data.
>>
>> This is required by TLS device offload to release
>> metadata associated with acknowledged TLS records.
>>
>> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
>> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
>> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>> ---
>>   include/net/inet_connection_sock.h | 2 ++
>>   net/ipv4/tcp_input.c               | 2 ++
>>   2 files changed, 4 insertions(+)
>>
>> diff --git a/include/net/inet_connection_sock.h 
>> b/include/net/inet_connection_sock.h
>> index b68fea022a82..2ab6667275df 100644
>> --- a/include/net/inet_connection_sock.h
>> +++ b/include/net/inet_connection_sock.h
>> @@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
>>    * @icsk_af_ops           Operations which are AF_INET{4,6} specific
>>    * @icsk_ulp_ops       Pluggable ULP control hook
>>    * @icsk_ulp_data       ULP private data
>> + * @icsk_clean_acked       Clean acked data hook
>>    * @icsk_listen_portaddr_node    hash to the portaddr listener 
>> hashtable
>>    * @icsk_ca_state:       Congestion control state
>>    * @icsk_retransmits:       Number of unrecovered [RTO] timeouts
>> @@ -102,6 +103,7 @@ struct inet_connection_sock {
>>       const struct inet_connection_sock_af_ops *icsk_af_ops;
>>       const struct tcp_ulp_ops  *icsk_ulp_ops;
>>       void              *icsk_ulp_data;
>> +    void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
>>       struct hlist_node         icsk_listen_portaddr_node;
>>       unsigned int          (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
>>       __u8              icsk_ca_state:6,
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index 451ef3012636..9854ecae7245 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -3542,6 +3542,8 @@ static int tcp_ack(struct sock *sk, const struct 
>> sk_buff *skb, int flag)
>>       if (after(ack, prior_snd_una)) {
>>           flag |= FLAG_SND_UNA_ADVANCED;
>>           icsk->icsk_retransmits = 0;
>> +        if (icsk->icsk_clean_acked)
>> +            icsk->icsk_clean_acked(sk, ack);
>>       }
>>       prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : 
>> tp->snd_una;
> Per Dave we are not allowed to use function pointers any more, so why 
> extend their use. I implemented a similar callback for my changes but in 
> my use case I need to call the meta data update function even when the 
> packet does not ack any new data or has no payload. Is it possible to 
> move this to say tcp_data_queue() ?

Sometimes function pointers are unavoidable. For example, when a module 
must change the functionality of a function. I think it is preferable to 
advance the kernel

This function is used to free memory based on new acknowledged data. It 
is unrelated to whether data was received or not. So it is not possible 
to move this call to tcp_data_queue.

Just in case, I'll add a static key here to reduce the impact on the 
fast-path as once suggested by EricD on netdev2.2.

> 
> Thanks,
> 
> Shoaib
> 
> 

Best,
Boris.

^ permalink raw reply

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
From: Kirill Tkhai @ 2018-03-21 11:15 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel
In-Reply-To: <20180320024510.7408-7-saeedm@mellanox.com>

On 20.03.2018 05:45, Saeed Mahameed wrote:
> From: Ilya Lesokhin <ilyal@mellanox.com>
> 
> This patch adds a generic infrastructure to offload TLS crypto to a
> network devices. It enables the kernel TLS socket to skip encryption
> and authentication operations on the transmit side of the data path.
> Leaving those computationally expensive operations to the NIC.
> 
> The NIC offload infrastructure builds TLS records and pushes them to
> the TCP layer just like the SW KTLS implementation and using the same API.
> TCP segmentation is mostly unaffected. Currently the only exception is
> that we prevent mixed SKBs where only part of the payload requires
> offload. In the future we are likely to add a similar restriction
> following a change cipher spec record.
> 
> The notable differences between SW KTLS and NIC offloaded TLS
> implementations are as follows:
> 1. The offloaded implementation builds "plaintext TLS record", those
> records contain plaintext instead of ciphertext and place holder bytes
> instead of authentication tags.
> 2. The offloaded implementation maintains a mapping from TCP sequence
> number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
> TLS socket, we can use the tls NIC offload infrastructure to obtain
> enough context to encrypt the payload of the SKB.
> A TLS record is released when the last byte of the record is ack'ed,
> this is done through the new icsk_clean_acked callback.
> 
> The infrastructure should be extendable to support various NIC offload
> implementations.  However it is currently written with the
> implementation below in mind:
> The NIC assumes that packets from each offloaded stream are sent as
> plaintext and in-order. It keeps track of the TLS records in the TCP
> stream. When a packet marked for offload is transmitted, the NIC
> encrypts the payload in-place and puts authentication tags in the
> relevant place holders.
> 
> The responsibility for handling out-of-order packets (i.e. TCP
> retransmission, qdisc drops) falls on the netdev driver.
> 
> The netdev driver keeps track of the expected TCP SN from the NIC's
> perspective.  If the next packet to transmit matches the expected TCP
> SN, the driver advances the expected TCP SN, and transmits the packet
> with TLS offload indication.
> 
> If the next packet to transmit does not match the expected TCP SN. The
> driver calls the TLS layer to obtain the TLS record that includes the
> TCP of the packet for transmission. Using this TLS record, the driver
> posts a work entry on the transmit queue to reconstruct the NIC TLS
> state required for the offload of the out-of-order packet. It updates
> the expected TCP SN accordingly and transmit the now in-order packet.
> The same queue is used for packet transmission and TLS context
> reconstruction to avoid the need for flushing the transmit queue before
> issuing the context reconstruction request.
> 
> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  include/net/tls.h             |  70 +++-
>  net/tls/Kconfig               |  10 +
>  net/tls/Makefile              |   2 +
>  net/tls/tls_device.c          | 804 ++++++++++++++++++++++++++++++++++++++++++
>  net/tls/tls_device_fallback.c | 419 ++++++++++++++++++++++
>  net/tls/tls_main.c            |  33 +-
>  6 files changed, 1331 insertions(+), 7 deletions(-)
>  create mode 100644 net/tls/tls_device.c
>  create mode 100644 net/tls/tls_device_fallback.c
> 
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 4913430ab807..ab98a6dc4929 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -77,6 +77,37 @@ struct tls_sw_context {
>  	struct scatterlist sg_aead_out[2];
>  };
>  
> +struct tls_record_info {
> +	struct list_head list;
> +	u32 end_seq;
> +	int len;
> +	int num_frags;
> +	skb_frag_t frags[MAX_SKB_FRAGS];
> +};
> +
> +struct tls_offload_context {
> +	struct crypto_aead *aead_send;
> +	spinlock_t lock;	/* protects records list */
> +	struct list_head records_list;
> +	struct tls_record_info *open_record;
> +	struct tls_record_info *retransmit_hint;
> +	u64 hint_record_sn;
> +	u64 unacked_record_sn;
> +
> +	struct scatterlist sg_tx_data[MAX_SKB_FRAGS];
> +	void (*sk_destruct)(struct sock *sk);
> +	u8 driver_state[];
> +	/* The TLS layer reserves room for driver specific state
> +	 * Currently the belief is that there is not enough
> +	 * driver specific state to justify another layer of indirection
> +	 */
> +#define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *)))
> +};
> +
> +#define TLS_OFFLOAD_CONTEXT_SIZE                                               \
> +	(ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +           \
> +	 TLS_DRIVER_STATE_SIZE)
> +
>  enum {
>  	TLS_PENDING_CLOSED_RECORD
>  };
> @@ -87,6 +118,10 @@ struct tls_context {
>  		struct tls12_crypto_info_aes_gcm_128 crypto_send_aes_gcm_128;
>  	};
>  
> +	struct list_head list;
> +	struct net_device *netdev;
> +	refcount_t refcount;
> +
>  	void *priv_ctx;
>  
>  	u8 tx_conf:2;
> @@ -131,9 +166,29 @@ int tls_sw_sendpage(struct sock *sk, struct page *page,
>  void tls_sw_close(struct sock *sk, long timeout);
>  void tls_sw_free_tx_resources(struct sock *sk);
>  
> -void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
> -void tls_icsk_clean_acked(struct sock *sk);
> +void tls_clear_device_offload(struct sock *sk, struct tls_context *ctx);
> +int tls_set_device_offload(struct sock *sk, struct tls_context *ctx);
> +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
> +int tls_device_sendpage(struct sock *sk, struct page *page,
> +			int offset, size_t size, int flags);
> +void tls_device_sk_destruct(struct sock *sk);
> +void tls_device_init(void);
> +void tls_device_cleanup(void);
>  
> +struct tls_record_info *tls_get_record(struct tls_offload_context *context,
> +				       u32 seq, u64 *p_record_sn);
> +
> +static inline bool tls_record_is_start_marker(struct tls_record_info *rec)
> +{
> +	return rec->len == 0;
> +}
> +
> +static inline u32 tls_record_start_seq(struct tls_record_info *rec)
> +{
> +	return rec->end_seq - rec->len;
> +}
> +
> +void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
>  int tls_push_sg(struct sock *sk, struct tls_context *ctx,
>  		struct scatterlist *sg, u16 first_offset,
>  		int flags);
> @@ -170,6 +225,13 @@ static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx)
>  	return tls_ctx->pending_open_record_frags;
>  }
>  
> +static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk)
> +{
> +	return sk_fullsock(sk) &&
> +	       /* matches smp_store_release in tls_set_device_offload */
> +	       smp_load_acquire(&sk->sk_destruct) == &tls_device_sk_destruct;
> +}
> +
>  static inline void tls_err_abort(struct sock *sk)
>  {
>  	sk->sk_err = EBADMSG;
> @@ -257,4 +319,8 @@ static inline struct tls_offload_context *tls_offload_ctx(
>  int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
>  		      unsigned char *record_type);
>  
> +int tls_sw_fallback_init(struct sock *sk,
> +			 struct tls_offload_context *offload_ctx,
> +			 struct tls_crypto_info *crypto_info);
> +
>  #endif /* _TLS_OFFLOAD_H */
> diff --git a/net/tls/Kconfig b/net/tls/Kconfig
> index eb583038c67e..9d3ef820bb16 100644
> --- a/net/tls/Kconfig
> +++ b/net/tls/Kconfig
> @@ -13,3 +13,13 @@ config TLS
>  	encryption handling of the TLS protocol to be done in-kernel.
>  
>  	If unsure, say N.
> +
> +config TLS_DEVICE
> +	bool "Transport Layer Security HW offload"
> +	depends on TLS
> +	select SOCK_VALIDATE_XMIT
> +	default n
> +	---help---
> +	Enable kernel support for HW offload of the TLS protocol.
> +
> +	If unsure, say N.
> diff --git a/net/tls/Makefile b/net/tls/Makefile
> index a930fd1c4f7b..4d6b728a67d0 100644
> --- a/net/tls/Makefile
> +++ b/net/tls/Makefile
> @@ -5,3 +5,5 @@
>  obj-$(CONFIG_TLS) += tls.o
>  
>  tls-y := tls_main.o tls_sw.o
> +
> +tls-$(CONFIG_TLS_DEVICE) += tls_device.o tls_device_fallback.o
> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> new file mode 100644
> index 000000000000..c0d4e11a4286
> --- /dev/null
> +++ b/net/tls/tls_device.c
> @@ -0,0 +1,804 @@
> +/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + *      - Neither the name of the Mellanox Technologies nor the
> + *        names of its contributors may be used to endorse or promote
> + *        products derived from this software without specific prior written
> + *        permission.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> + * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED.
> + * IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
> + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
> + * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
> + * POSSIBILITY OF SUCH DAMAGE
> + */

Other patches have two licenses in header. Can I distribute this file under GPL license terms?

> +#include <linux/module.h>
> +#include <net/tcp.h>
> +#include <net/inet_common.h>
> +#include <linux/highmem.h>
> +#include <linux/netdevice.h>
> +
> +#include <net/tls.h>
> +#include <crypto/aead.h>
> +
> +/* device_offload_lock is used to synchronize tls_dev_add
> + * against NETDEV_DOWN notifications.
> + */
> +DEFINE_STATIC_PERCPU_RWSEM(device_offload_lock);
> +
> +static void tls_device_gc_task(struct work_struct *work);
> +
> +static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
> +static LIST_HEAD(tls_device_gc_list);
> +static LIST_HEAD(tls_device_list);
> +static DEFINE_SPINLOCK(tls_device_lock);
> +
> +static void tls_device_free_ctx(struct tls_context *ctx)
> +{
> +	struct tls_offload_context *offlad_ctx = tls_offload_ctx(ctx);
> +
> +	kfree(offlad_ctx);
> +	kfree(ctx);
> +}
> +
> +static void tls_device_gc_task(struct work_struct *work)
> +{
> +	struct tls_context *ctx, *tmp;
> +	struct list_head gc_list;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tls_device_lock, flags);
> +	INIT_LIST_HEAD(&gc_list);

This is stack variable, and it should be initialized outside of global spinlock.
There is LIST_HEAD() primitive for that in kernel.
There is one more similar place below.

> +	list_splice_init(&tls_device_gc_list, &gc_list);
> +	spin_unlock_irqrestore(&tls_device_lock, flags);
> +
> +	list_for_each_entry_safe(ctx, tmp, &gc_list, list) {
> +		struct net_device *netdev = ctx->netdev;
> +
> +		if (netdev) {
> +			netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> +							TLS_OFFLOAD_CTX_DIR_TX);
> +			dev_put(netdev);
> +		}

How is possible the situation we meet NULL netdev here?

> +
> +		list_del(&ctx->list);
> +		tls_device_free_ctx(ctx);
> +	}
> +}
> +
> +static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tls_device_lock, flags);
> +	list_move_tail(&ctx->list, &tls_device_gc_list);
> +
> +	/* schedule_work inside the spinlock
> +	 * to make sure tls_device_down waits for that work.
> +	 */
> +	schedule_work(&tls_device_gc_work);
> +
> +	spin_unlock_irqrestore(&tls_device_lock, flags);
> +}
> +
> +/* We assume that the socket is already connected */
> +static struct net_device *get_netdev_for_sock(struct sock *sk)
> +{
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct net_device *netdev = NULL;
> +
> +	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
> +
> +	return netdev;
> +}
> +
> +static int attach_sock_to_netdev(struct sock *sk, struct net_device *netdev,
> +				 struct tls_context *ctx)
> +{
> +	int rc;
> +
> +	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX,
> +					     &ctx->crypto_send,
> +					     tcp_sk(sk)->write_seq);
> +	if (rc) {
> +		pr_err_ratelimited("The netdev has refused to offload this socket\n");
> +		goto out;
> +	}
> +
> +	rc = 0;
> +out:
> +	return rc;
> +}
> +
> +static void destroy_record(struct tls_record_info *record)
> +{
> +	skb_frag_t *frag;
> +	int nr_frags = record->num_frags;
> +
> +	while (nr_frags > 0) {
> +		frag = &record->frags[nr_frags - 1];
> +		__skb_frag_unref(frag);
> +		--nr_frags;
> +	}
> +	kfree(record);
> +}
> +
> +static void delete_all_records(struct tls_offload_context *offload_ctx)
> +{
> +	struct tls_record_info *info, *temp;
> +
> +	list_for_each_entry_safe(info, temp, &offload_ctx->records_list, list) {
> +		list_del(&info->list);
> +		destroy_record(info);
> +	}
> +
> +	offload_ctx->retransmit_hint = NULL;
> +}
> +
> +static void tls_icsk_clean_acked(struct sock *sk, u32 acked_seq)
> +{
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_offload_context *ctx;
> +	struct tls_record_info *info, *temp;
> +	unsigned long flags;
> +	u64 deleted_records = 0;
> +
> +	if (!tls_ctx)
> +		return;
> +
> +	ctx = tls_offload_ctx(tls_ctx);
> +
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	info = ctx->retransmit_hint;
> +	if (info && !before(acked_seq, info->end_seq)) {
> +		ctx->retransmit_hint = NULL;
> +		list_del(&info->list);
> +		destroy_record(info);
> +		deleted_records++;
> +	}
> +
> +	list_for_each_entry_safe(info, temp, &ctx->records_list, list) {
> +		if (before(acked_seq, info->end_seq))
> +			break;
> +		list_del(&info->list);
> +
> +		destroy_record(info);
> +		deleted_records++;
> +	}
> +
> +	ctx->unacked_record_sn += deleted_records;
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +}
> +
> +/* At this point, there should be no references on this
> + * socket and no in-flight SKBs associated with this
> + * socket, so it is safe to free all the resources.
> + */
> +void tls_device_sk_destruct(struct sock *sk)
> +{
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
> +
> +	if (ctx->open_record)
> +		destroy_record(ctx->open_record);
> +
> +	delete_all_records(ctx);
> +	crypto_free_aead(ctx->aead_send);
> +	ctx->sk_destruct(sk);
> +
> +	if (refcount_dec_and_test(&tls_ctx->refcount))
> +		tls_device_queue_ctx_destruction(tls_ctx);
> +}
> +EXPORT_SYMBOL(tls_device_sk_destruct);
> +
> +static inline void tls_append_frag(struct tls_record_info *record,
> +				   struct page_frag *pfrag,
> +				   int size)
> +{
> +	skb_frag_t *frag;
> +
> +	frag = &record->frags[record->num_frags - 1];
> +	if (frag->page.p == pfrag->page &&
> +	    frag->page_offset + frag->size == pfrag->offset) {
> +		frag->size += size;
> +	} else {
> +		++frag;
> +		frag->page.p = pfrag->page;
> +		frag->page_offset = pfrag->offset;
> +		frag->size = size;
> +		++record->num_frags;
> +		get_page(pfrag->page);
> +	}
> +
> +	pfrag->offset += size;
> +	record->len += size;
> +}
> +
> +static inline int tls_push_record(struct sock *sk,
> +				  struct tls_context *ctx,
> +				  struct tls_offload_context *offload_ctx,
> +				  struct tls_record_info *record,
> +				  struct page_frag *pfrag,
> +				  int flags,
> +				  unsigned char record_type)
> +{
> +	skb_frag_t *frag;
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	struct page_frag fallback_frag;
> +	struct page_frag  *tag_pfrag = pfrag;
> +	int i;
> +
> +	/* fill prepand */
> +	frag = &record->frags[0];
> +	tls_fill_prepend(ctx,
> +			 skb_frag_address(frag),
> +			 record->len - ctx->prepend_size,
> +			 record_type);
> +
> +	if (unlikely(!skb_page_frag_refill(ctx->tag_size, pfrag, GFP_KERNEL))) {
> +		/* HW doesn't care about the data in the tag
> +		 * so in case pfrag has no room
> +		 * for a tag and we can't allocate a new pfrag
> +		 * just use the page in the first frag
> +		 * rather then write a complicated fall back code.
> +		 */
> +		tag_pfrag = &fallback_frag;
> +		tag_pfrag->page = skb_frag_page(frag);
> +		tag_pfrag->offset = 0;
> +	}
> +
> +	tls_append_frag(record, tag_pfrag, ctx->tag_size);
> +	record->end_seq = tp->write_seq + record->len;
> +	spin_lock_irq(&offload_ctx->lock);
> +	list_add_tail(&record->list, &offload_ctx->records_list);
> +	spin_unlock_irq(&offload_ctx->lock);
> +	offload_ctx->open_record = NULL;
> +	set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
> +	tls_advance_record_sn(sk, ctx);
> +
> +	for (i = 0; i < record->num_frags; i++) {
> +		frag = &record->frags[i];
> +		sg_unmark_end(&offload_ctx->sg_tx_data[i]);
> +		sg_set_page(&offload_ctx->sg_tx_data[i], skb_frag_page(frag),
> +			    frag->size, frag->page_offset);
> +		sk_mem_charge(sk, frag->size);
> +		get_page(skb_frag_page(frag));
> +	}
> +	sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags - 1]);
> +
> +	/* all ready, send */
> +	return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags);
> +}
> +
> +static inline int tls_create_new_record(struct tls_offload_context *offload_ctx,
> +					struct page_frag *pfrag,
> +					size_t prepend_size)
> +{
> +	skb_frag_t *frag;
> +	struct tls_record_info *record;
> +
> +	record = kmalloc(sizeof(*record), GFP_KERNEL);
> +	if (!record)
> +		return -ENOMEM;
> +
> +	frag = &record->frags[0];
> +	__skb_frag_set_page(frag, pfrag->page);
> +	frag->page_offset = pfrag->offset;
> +	skb_frag_size_set(frag, prepend_size);
> +
> +	get_page(pfrag->page);
> +	pfrag->offset += prepend_size;
> +
> +	record->num_frags = 1;
> +	record->len = prepend_size;
> +	offload_ctx->open_record = record;
> +	return 0;
> +}
> +
> +static inline int tls_do_allocation(struct sock *sk,
> +				    struct tls_offload_context *offload_ctx,
> +				    struct page_frag *pfrag,
> +				    size_t prepend_size)
> +{
> +	int ret;
> +
> +	if (!offload_ctx->open_record) {
> +		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
> +						   sk->sk_allocation))) {
> +			sk->sk_prot->enter_memory_pressure(sk);
> +			sk_stream_moderate_sndbuf(sk);
> +			return -ENOMEM;
> +		}
> +
> +		ret = tls_create_new_record(offload_ctx, pfrag, prepend_size);
> +		if (ret)
> +			return ret;
> +
> +		if (pfrag->size > pfrag->offset)
> +			return 0;
> +	}
> +
> +	if (!sk_page_frag_refill(sk, pfrag))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static int tls_push_data(struct sock *sk,
> +			 struct iov_iter *msg_iter,
> +			 size_t size, int flags,
> +			 unsigned char record_type)
> +{
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
> +	struct tls_record_info *record = ctx->open_record;
> +	struct page_frag *pfrag;
> +	int copy, rc = 0;
> +	size_t orig_size = size;
> +	u32 max_open_record_len;
> +	long timeo;
> +	int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
> +	int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
> +	bool done = false;
> +
> +	if (flags &
> +	    ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_SENDPAGE_NOTLAST))
> +		return -ENOTSUPP;
> +
> +	if (sk->sk_err)
> +		return -sk->sk_err;
> +
> +	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
> +	rc = tls_complete_pending_work(sk, tls_ctx, flags, &timeo);
> +	if (rc < 0)
> +		return rc;
> +
> +	pfrag = sk_page_frag(sk);
> +
> +	/* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and
> +	 * we need to leave room for an authentication tag.
> +	 */
> +	max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
> +			      tls_ctx->prepend_size;
> +	do {
> +		if (tls_do_allocation(sk, ctx, pfrag,
> +				      tls_ctx->prepend_size)) {
> +			rc = sk_stream_wait_memory(sk, &timeo);
> +			if (!rc)
> +				continue;
> +
> +			record = ctx->open_record;
> +			if (!record)
> +				break;
> +handle_error:
> +			if (record_type != TLS_RECORD_TYPE_DATA) {
> +				/* avoid sending partial
> +				 * record with type !=
> +				 * application_data
> +				 */
> +				size = orig_size;
> +				destroy_record(record);
> +				ctx->open_record = NULL;
> +			} else if (record->len > tls_ctx->prepend_size) {
> +				goto last_record;
> +			}
> +
> +			break;
> +		}
> +
> +		record = ctx->open_record;
> +		copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
> +		copy = min_t(size_t, copy, (max_open_record_len - record->len));
> +
> +		if (copy_from_iter_nocache(page_address(pfrag->page) +
> +					       pfrag->offset,
> +					   copy, msg_iter) != copy) {
> +			rc = -EFAULT;
> +			goto handle_error;
> +		}
> +		tls_append_frag(record, pfrag, copy);
> +
> +		size -= copy;
> +		if (!size) {
> +last_record:
> +			tls_push_record_flags = flags;
> +			if (more) {
> +				tls_ctx->pending_open_record_frags =
> +						record->num_frags;
> +				break;
> +			}
> +
> +			done = true;
> +		}
> +
> +		if ((done) || record->len >= max_open_record_len ||
> +		    (record->num_frags >= MAX_SKB_FRAGS - 1)) {
> +			rc = tls_push_record(sk,
> +					     tls_ctx,
> +					     ctx,
> +					     record,
> +					     pfrag,
> +					     tls_push_record_flags,
> +					     record_type);
> +			if (rc < 0)
> +				break;
> +		}
> +	} while (!done);
> +
> +	if (orig_size - size > 0)
> +		rc = orig_size - size;
> +
> +	return rc;
> +}
> +
> +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> +{
> +	unsigned char record_type = TLS_RECORD_TYPE_DATA;
> +	int rc = 0;
> +
> +	lock_sock(sk);
> +
> +	if (unlikely(msg->msg_controllen)) {
> +		rc = tls_proccess_cmsg(sk, msg, &record_type);
> +		if (rc)
> +			goto out;
> +	}
> +
> +	rc = tls_push_data(sk, &msg->msg_iter, size,
> +			   msg->msg_flags, record_type);
> +
> +out:
> +	release_sock(sk);
> +	return rc;
> +}
> +
> +int tls_device_sendpage(struct sock *sk, struct page *page,
> +			int offset, size_t size, int flags)
> +{
> +	struct iov_iter	msg_iter;
> +	struct kvec iov;
> +	char *kaddr = kmap(page);
> +	int rc = 0;
> +
> +	if (flags & MSG_SENDPAGE_NOTLAST)
> +		flags |= MSG_MORE;
> +
> +	lock_sock(sk);
> +
> +	if (flags & MSG_OOB) {
> +		rc = -ENOTSUPP;
> +		goto out;
> +	}
> +
> +	iov.iov_base = kaddr + offset;
> +	iov.iov_len = size;
> +	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1, size);
> +	rc = tls_push_data(sk, &msg_iter, size,
> +			   flags, TLS_RECORD_TYPE_DATA);
> +	kunmap(page);
> +
> +out:
> +	release_sock(sk);
> +	return rc;
> +}
> +
> +struct tls_record_info *tls_get_record(struct tls_offload_context *context,
> +				       u32 seq, u64 *p_record_sn)
> +{
> +	struct tls_record_info *info;
> +	u64 record_sn = context->hint_record_sn;
> +
> +	info = context->retransmit_hint;
> +	if (!info ||
> +	    before(seq, info->end_seq - info->len)) {
> +		/* if retransmit_hint is irrelevant start
> +		 * from the begging of the list
> +		 */
> +		info = list_first_entry(&context->records_list,
> +					struct tls_record_info, list);
> +		record_sn = context->unacked_record_sn;
> +	}
> +
> +	list_for_each_entry_from(info, &context->records_list, list) {
> +		if (before(seq, info->end_seq)) {
> +			if (!context->retransmit_hint ||
> +			    after(info->end_seq,
> +				  context->retransmit_hint->end_seq)) {
> +				context->hint_record_sn = record_sn;
> +				context->retransmit_hint = info;
> +			}
> +			*p_record_sn = record_sn;
> +			return info;
> +		}
> +		record_sn++;
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(tls_get_record);
> +
> +static int tls_device_push_pending_record(struct sock *sk, int flags)
> +{
> +	struct iov_iter	msg_iter;
> +
> +	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0);
> +	return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA);
> +}
> +
> +int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
> +{
> +	u16 nonece_size, tag_size, iv_size, rec_seq_size;
> +	struct tls_record_info *start_marker_record;
> +	struct tls_offload_context *offload_ctx;
> +	struct tls_crypto_info *crypto_info;
> +	struct net_device *netdev;
> +	char *iv, *rec_seq;
> +	struct sk_buff *skb;
> +	int rc = -EINVAL;
> +	__be64 rcd_sn;
> +
> +	if (!ctx)
> +		goto out;
> +
> +	if (ctx->priv_ctx) {
> +		rc = -EEXIST;
> +		goto out;
> +	}
> +
> +	/* We support starting offload on multiple sockets
> +	 * concurrently, So we only need a read lock here.
> +	 */
> +	percpu_down_read(&device_offload_lock);
> +	netdev = get_netdev_for_sock(sk);
> +	if (!netdev) {
> +		pr_err_ratelimited("%s: netdev not found\n", __func__);
> +		rc = -EINVAL;
> +		goto release_lock;
> +	}
> +
> +	if (!(netdev->features & NETIF_F_HW_TLS_TX)) {
> +		rc = -ENOTSUPP;
> +		goto release_netdev;
> +	}
> +
> +	/* Avoid offloading if the device is down
> +	 * We don't want to offload new flows after
> +	 * the NETDEV_DOWN event
> +	 */
> +	if (!(netdev->flags & IFF_UP)) {
> +		rc = -EINVAL;
> +		goto release_lock;
> +	}
> +
> +	crypto_info = &ctx->crypto_send;
> +	switch (crypto_info->cipher_type) {
> +	case TLS_CIPHER_AES_GCM_128: {
> +		nonece_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +		tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> +		iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +		iv = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->iv;
> +		rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE;
> +		rec_seq =
> +		 ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->rec_seq;
> +		break;
> +	}
> +	default:
> +		rc = -EINVAL;
> +		goto release_netdev;
> +	}
> +
> +	start_marker_record = kmalloc(sizeof(*start_marker_record), GFP_KERNEL);

Can we memory allocations and simple memory initializations ouside the global rwsem?

> +	if (!start_marker_record) {
> +		rc = -ENOMEM;
> +		goto release_netdev;
> +	}
> +
> +	offload_ctx = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE, GFP_KERNEL);
> +	if (!offload_ctx)
> +		goto free_marker_record;
> +
> +	ctx->priv_ctx = offload_ctx;
> +	rc = attach_sock_to_netdev(sk, netdev, ctx);
> +	if (rc)
> +		goto free_offload_context;
> +
> +	ctx->netdev = netdev;
> +	ctx->prepend_size = TLS_HEADER_SIZE + nonece_size;
> +	ctx->tag_size = tag_size;
> +	ctx->iv_size = iv_size;
> +	ctx->iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE,
> +			  GFP_KERNEL);
> +	if (!ctx->iv) {
> +		rc = -ENOMEM;
> +		goto detach_sock;
> +	}
> +
> +	memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
> +
> +	ctx->rec_seq_size = rec_seq_size;
> +	ctx->rec_seq = kmalloc(rec_seq_size, GFP_KERNEL);
> +	if (!ctx->rec_seq) {
> +		rc = -ENOMEM;
> +		goto free_iv;
> +	}
> +	memcpy(ctx->rec_seq, rec_seq, rec_seq_size);
> +
> +	/* start at rec_seq - 1 to account for the start marker record */
> +	memcpy(&rcd_sn, ctx->rec_seq, sizeof(rcd_sn));
> +	offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
> +
> +	rc = tls_sw_fallback_init(sk, offload_ctx, crypto_info);
> +	if (rc)
> +		goto free_rec_seq;
> +
> +	start_marker_record->end_seq = tcp_sk(sk)->write_seq;
> +	start_marker_record->len = 0;
> +	start_marker_record->num_frags = 0;
> +
> +	INIT_LIST_HEAD(&offload_ctx->records_list);
> +	list_add_tail(&start_marker_record->list, &offload_ctx->records_list);
> +	spin_lock_init(&offload_ctx->lock);
> +
> +	inet_csk(sk)->icsk_clean_acked = &tls_icsk_clean_acked;
> +	ctx->push_pending_record = tls_device_push_pending_record;
> +	offload_ctx->sk_destruct = sk->sk_destruct;
> +
> +	/* TLS offload is greatly simplified if we don't send
> +	 * SKBs where only part of the payload needs to be encrypted.
> +	 * So mark the last skb in the write queue as end of record.
> +	 */
> +	skb = tcp_write_queue_tail(sk);
> +	if (skb)
> +		TCP_SKB_CB(skb)->eor = 1;
> +
> +	refcount_set(&ctx->refcount, 1);
> +	spin_lock_irq(&tls_device_lock);
> +	list_add_tail(&ctx->list, &tls_device_list);
> +	spin_unlock_irq(&tls_device_lock);
> +
> +	/* following this assignment tls_is_sk_tx_device_offloaded
> +	 * will return true and the context might be accessed
> +	 * by the netdev's xmit function.
> +	 */
> +	smp_store_release(&sk->sk_destruct,
> +			  &tls_device_sk_destruct);
> +	goto release_lock;
> +
> +free_rec_seq:
> +	kfree(ctx->rec_seq);
> +free_iv:
> +	kfree(ctx->iv);
> +detach_sock:
> +	netdev->tlsdev_ops->tls_dev_del(netdev, ctx, TLS_OFFLOAD_CTX_DIR_TX);
> +free_offload_context:
> +	kfree(offload_ctx);
> +	ctx->priv_ctx = NULL;
> +free_marker_record:
> +	kfree(start_marker_record);
> +release_netdev:
> +	dev_put(netdev);
> +release_lock:
> +	percpu_up_read(&device_offload_lock);
> +out:
> +	return rc;
> +}
> +
> +static int tls_device_register(struct net_device *dev)
> +{
> +	if ((dev->features & NETIF_F_HW_TLS_TX) && !dev->tlsdev_ops)
> +		return NOTIFY_BAD;
> +
> +	return NOTIFY_DONE;
> +}

This function is the same as tls_device_feat_change(). Can't we merge
them together and avoid duplicating of code?

> +static int tls_device_unregister(struct net_device *dev)
> +{
> +	return NOTIFY_DONE;
> +}

This function does nothing, and next patches do not change it.
Can't we remove it since so?

> +static int tls_device_feat_change(struct net_device *dev)
> +{
> +	if ((dev->features & NETIF_F_HW_TLS_TX) && !dev->tlsdev_ops)
> +		return NOTIFY_BAD;
> +
> +	return NOTIFY_DONE;
> +}
> +
> +static int tls_device_down(struct net_device *netdev)
> +{
> +	struct tls_context *ctx, *tmp;
> +	struct list_head list;
> +	unsigned long flags;
> +
> +	if (!(netdev->features & NETIF_F_HW_TLS_TX))
> +		return NOTIFY_DONE;

Can't we move this check in tls_dev_event() and use it for all types of events?
Then we avoid duplicate code.

> +
> +	/* Request a write lock to block new offload attempts
> +	 */
> +	percpu_down_write(&device_offload_lock);

What is the reason percpu_rwsem is chosen here? It looks like this primitive
gives more advantages readers, then plain rwsem does. But it also gives
disadvantages to writers. It would be good, unless tls_device_down() is called
with rtnl_lock() held from netdevice notifier. But since netdevice notifier
are called with rtnl_lock() held, percpu_rwsem will increase the time rtnl_lock()
is locked.

Can't we use plain rwsem here instead?

> +
> +	spin_lock_irqsave(&tls_device_lock, flags);
> +	INIT_LIST_HEAD(&list);

This may go outside the global spinlock.

> +	list_for_each_entry_safe(ctx, tmp, &tls_device_list, list) {
> +		if (ctx->netdev != netdev ||
> +		    !refcount_inc_not_zero(&ctx->refcount))
> +			continue;
> +
> +		list_move(&ctx->list, &list);
> +	}
> +	spin_unlock_irqrestore(&tls_device_lock, flags);
> +
> +	list_for_each_entry_safe(ctx, tmp, &list, list)	{
> +		netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> +						TLS_OFFLOAD_CTX_DIR_TX);
> +		ctx->netdev = NULL;
> +		dev_put(netdev);
> +		list_del_init(&ctx->list);
> +
> +		if (refcount_dec_and_test(&ctx->refcount))
> +			tls_device_free_ctx(ctx);
> +	}
> +
> +	percpu_up_write(&device_offload_lock);
> +
> +	flush_work(&tls_device_gc_work);
> +
> +	return NOTIFY_DONE;
> +}
> +
> +static int tls_dev_event(struct notifier_block *this, unsigned long event,
> +			 void *ptr)
> +{
> +	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> +
> +	switch (event) {
> +	case NETDEV_REGISTER:
> +		return tls_device_register(dev);
> +
> +	case NETDEV_UNREGISTER:
> +		return tls_device_unregister(dev);
> +
> +	case NETDEV_FEAT_CHANGE:
> +		return tls_device_feat_change(dev);
> +
> +	case NETDEV_DOWN:
> +		return tls_device_down(dev);
> +	}
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block tls_dev_notifier = {
> +	.notifier_call	= tls_dev_event,
> +};
> +
> +void __init tls_device_init(void)
> +{
> +	register_netdevice_notifier(&tls_dev_notifier);
> +}
> +
> +void __exit tls_device_cleanup(void)
> +{
> +	unregister_netdevice_notifier(&tls_dev_notifier);
> +	flush_work(&tls_device_gc_work);
> +}
> diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
> new file mode 100644
> index 000000000000..14d31a36885c
> --- /dev/null
> +++ b/net/tls/tls_device_fallback.c
> @@ -0,0 +1,419 @@
> +/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + *      - Neither the name of the Mellanox Technologies nor the
> + *        names of its contributors may be used to endorse or promote
> + *        products derived from this software without specific prior written
> + *        permission.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> + * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED.
> + * IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
> + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
> + * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
> + * POSSIBILITY OF SUCH DAMAGE
> + */
> +
> +#include <net/tls.h>
> +#include <crypto/aead.h>
> +#include <crypto/scatterwalk.h>
> +#include <net/ip6_checksum.h>
> +
> +static void chain_to_walk(struct scatterlist *sg, struct scatter_walk *walk)
> +{
> +	struct scatterlist *src = walk->sg;
> +	int diff = walk->offset - src->offset;
> +
> +	sg_set_page(sg, sg_page(src),
> +		    src->length - diff, walk->offset);
> +
> +	scatterwalk_crypto_chain(sg, sg_next(src), 0, 2);
> +}
> +
> +static int tls_enc_record(struct aead_request *aead_req,
> +			  struct crypto_aead *aead, char *aad, char *iv,
> +			  __be64 rcd_sn, struct scatter_walk *in,
> +			  struct scatter_walk *out, int *in_len)
> +{
> +	struct scatterlist sg_in[3];
> +	struct scatterlist sg_out[3];
> +	unsigned char buf[TLS_HEADER_SIZE + TLS_CIPHER_AES_GCM_128_IV_SIZE];
> +	u16 len;
> +	int rc;
> +
> +	len = min_t(int, *in_len, ARRAY_SIZE(buf));
> +
> +	scatterwalk_copychunks(buf, in, len, 0);
> +	scatterwalk_copychunks(buf, out, len, 1);
> +
> +	*in_len -= len;
> +	if (!*in_len)
> +		return 0;
> +
> +	scatterwalk_pagedone(in, 0, 1);
> +	scatterwalk_pagedone(out, 1, 1);
> +
> +	len = buf[4] | (buf[3] << 8);
> +	len -= TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +
> +	tls_make_aad(aad, len - TLS_CIPHER_AES_GCM_128_TAG_SIZE,
> +		     (char *)&rcd_sn, sizeof(rcd_sn), buf[0]);
> +
> +	memcpy(iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, buf + TLS_HEADER_SIZE,
> +	       TLS_CIPHER_AES_GCM_128_IV_SIZE);
> +
> +	sg_init_table(sg_in, ARRAY_SIZE(sg_in));
> +	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
> +	sg_set_buf(sg_in, aad, TLS_AAD_SPACE_SIZE);
> +	sg_set_buf(sg_out, aad, TLS_AAD_SPACE_SIZE);
> +	chain_to_walk(sg_in + 1, in);
> +	chain_to_walk(sg_out + 1, out);
> +
> +	*in_len -= len;
> +	if (*in_len < 0) {
> +		*in_len += TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> +		if (*in_len < 0)
> +		/* the input buffer doesn't contain the entire record.
> +		 * trim len accordingly. The resulting authentication tag
> +		 * will contain garbage. but we don't care as we won't
> +		 * include any of it in the output skb
> +		 * Note that we assume the output buffer length
> +		 * is larger then input buffer length + tag size
> +		 */
> +			len += *in_len;
> +
> +		*in_len = 0;
> +	}
> +
> +	if (*in_len) {
> +		scatterwalk_copychunks(NULL, in, len, 2);
> +		scatterwalk_pagedone(in, 0, 1);
> +		scatterwalk_copychunks(NULL, out, len, 2);
> +		scatterwalk_pagedone(out, 1, 1);
> +	}
> +
> +	len -= TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> +	aead_request_set_crypt(aead_req, sg_in, sg_out, len, iv);
> +
> +	rc = crypto_aead_encrypt(aead_req);
> +
> +	return rc;
> +}
> +
> +static void tls_init_aead_request(struct aead_request *aead_req,
> +				  struct crypto_aead *aead)
> +{
> +	aead_request_set_tfm(aead_req, aead);
> +	aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE);
> +}
> +
> +static struct aead_request *tls_alloc_aead_request(struct crypto_aead *aead,
> +						   gfp_t flags)
> +{
> +	unsigned int req_size = sizeof(struct aead_request) +
> +		crypto_aead_reqsize(aead);
> +	struct aead_request *aead_req;
> +
> +	aead_req = kzalloc(req_size, flags);
> +	if (!aead_req)
> +		return NULL;
> +
> +	tls_init_aead_request(aead_req, aead);
> +	return aead_req;
> +}
> +
> +static int tls_enc_records(struct aead_request *aead_req,
> +			   struct crypto_aead *aead, struct scatterlist *sg_in,
> +			   struct scatterlist *sg_out, char *aad, char *iv,
> +			   u64 rcd_sn, int len)
> +{
> +	struct scatter_walk in;
> +	struct scatter_walk out;
> +	int rc;
> +
> +	scatterwalk_start(&in, sg_in);
> +	scatterwalk_start(&out, sg_out);
> +
> +	do {
> +		rc = tls_enc_record(aead_req, aead, aad, iv,
> +				    cpu_to_be64(rcd_sn), &in, &out, &len);
> +		rcd_sn++;
> +
> +	} while (rc == 0 && len);
> +
> +	scatterwalk_done(&in, 0, 0);
> +	scatterwalk_done(&out, 1, 0);
> +
> +	return rc;
> +}
> +
> +static inline void update_chksum(struct sk_buff *skb, int headln)
> +{
> +	/* Can't use icsk->icsk_af_ops->send_check here because the ip addresses
> +	 * might have been changed by NAT.
> +	 */
> +
> +	const struct ipv6hdr *ipv6h;
> +	const struct iphdr *iph;
> +	struct tcphdr *th = tcp_hdr(skb);
> +	int datalen = skb->len - headln;
> +
> +	/* We only changed the payload so if we are using partial we don't
> +	 * need to update anything.
> +	 */
> +	if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
> +		return;
> +
> +	skb->ip_summed = CHECKSUM_PARTIAL;
> +	skb->csum_start = skb_transport_header(skb) - skb->head;
> +	skb->csum_offset = offsetof(struct tcphdr, check);
> +
> +	if (skb->sk->sk_family == AF_INET6) {
> +		ipv6h = ipv6_hdr(skb);
> +		th->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
> +					     datalen, IPPROTO_TCP, 0);
> +	} else {
> +		iph = ip_hdr(skb);
> +		th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen,
> +					       IPPROTO_TCP, 0);
> +	}
> +}
> +
> +static void complete_skb(struct sk_buff *nskb, struct sk_buff *skb, int headln)
> +{
> +	skb_copy_header(nskb, skb);
> +
> +	skb_put(nskb, skb->len);
> +	memcpy(nskb->data, skb->data, headln);
> +	update_chksum(nskb, headln);
> +
> +	nskb->destructor = skb->destructor;
> +	nskb->sk = skb->sk;
> +	skb->destructor = NULL;
> +	skb->sk = NULL;
> +	refcount_add(nskb->truesize - skb->truesize,
> +		     &nskb->sk->sk_wmem_alloc);
> +}
> +
> +/* This function may be called after the user socket is already
> + * closed so make sure we don't use anything freed during
> + * tls_sk_proto_close here
> + */
> +static struct sk_buff *tls_sw_fallback(struct sock *sk, struct sk_buff *skb)
> +{
> +	int tcp_header_size = tcp_hdrlen(skb);
> +	int tcp_payload_offset = skb_transport_offset(skb) + tcp_header_size;
> +	int payload_len = skb->len - tcp_payload_offset;
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
> +	int remaining, buf_len, resync_sgs, rc, i = 0;
> +	void *buf, *dummy_buf, *iv, *aad;
> +	struct scatterlist *sg_in;
> +	struct scatterlist sg_out[3];
> +	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
> +	struct aead_request *aead_req;
> +	struct sk_buff *nskb = NULL;
> +	struct tls_record_info *record;
> +	unsigned long flags;
> +	s32 sync_size;
> +	u64 rcd_sn;
> +
> +	/* worst case is:
> +	 * MAX_SKB_FRAGS in tls_record_info
> +	 * MAX_SKB_FRAGS + 1 in SKB head an frags.
> +	 */
> +	int sg_in_max_elements = 2 * MAX_SKB_FRAGS + 1;
> +
> +	if (!payload_len)
> +		return skb;
> +
> +	sg_in = kmalloc_array(sg_in_max_elements, sizeof(*sg_in), GFP_ATOMIC);
> +	if (!sg_in)
> +		goto free_orig;
> +
> +	sg_init_table(sg_in, sg_in_max_elements);
> +	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
> +
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	record = tls_get_record(ctx, tcp_seq, &rcd_sn);
> +	if (!record) {
> +		spin_unlock_irqrestore(&ctx->lock, flags);
> +		WARN(1, "Record not found for seq %u\n", tcp_seq);
> +		goto free_sg;
> +	}
> +
> +	sync_size = tcp_seq - tls_record_start_seq(record);
> +	if (sync_size < 0) {
> +		int is_start_marker = tls_record_is_start_marker(record);
> +
> +		spin_unlock_irqrestore(&ctx->lock, flags);
> +		if (!is_start_marker)
> +		/* This should only occur if the relevant record was
> +		 * already acked. In that case it should be ok
> +		 * to drop the packet and avoid retransmission.
> +		 *
> +		 * There is a corner case where the packet contains
> +		 * both an acked and a non-acked record.
> +		 * We currently don't handle that case and rely
> +		 * on TCP to retranmit a packet that doesn't contain
> +		 * already acked payload.
> +		 */
> +			goto free_orig;
> +
> +		if (payload_len > -sync_size) {
> +			WARN(1, "Fallback of partially offloaded packets is not supported\n");
> +			goto free_sg;
> +		} else {
> +			return skb;
> +		}
> +	}
> +
> +	remaining = sync_size;
> +	while (remaining > 0) {
> +		skb_frag_t *frag = &record->frags[i];
> +
> +		__skb_frag_ref(frag);
> +		sg_set_page(sg_in + i, skb_frag_page(frag),
> +			    skb_frag_size(frag), frag->page_offset);
> +
> +		remaining -= skb_frag_size(frag);
> +
> +		if (remaining < 0)
> +			sg_in[i].length += remaining;
> +
> +		i++;
> +	}
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +	resync_sgs = i;
> +
> +	aead_req = tls_alloc_aead_request(ctx->aead_send, GFP_ATOMIC);
> +	if (!aead_req)
> +		goto put_sg;
> +
> +	buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE +
> +		  TLS_CIPHER_AES_GCM_128_IV_SIZE +
> +		  TLS_AAD_SPACE_SIZE +
> +		  sync_size +
> +		  tls_ctx->tag_size;
> +	buf = kmalloc(buf_len, GFP_ATOMIC);
> +	if (!buf)
> +		goto free_req;
> +
> +	nskb = alloc_skb(skb_headroom(skb) + skb->len, GFP_ATOMIC);
> +	if (!nskb)
> +		goto free_buf;
> +
> +	skb_reserve(nskb, skb_headroom(skb));
> +
> +	iv = buf;
> +
> +	memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
> +	       TLS_CIPHER_AES_GCM_128_SALT_SIZE);
> +	aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
> +	      TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +	dummy_buf = aad + TLS_AAD_SPACE_SIZE;
> +
> +	sg_set_buf(&sg_out[0], dummy_buf, sync_size);
> +	sg_set_buf(&sg_out[1], nskb->data + tcp_payload_offset,
> +		   payload_len);
> +	/* Add room for authentication tag produced by crypto */
> +	dummy_buf += sync_size;
> +	sg_set_buf(&sg_out[2], dummy_buf, tls_ctx->tag_size);
> +	rc = skb_to_sgvec(skb, &sg_in[i], tcp_payload_offset,
> +			  payload_len);
> +	if (rc < 0)
> +		goto free_nskb;
> +
> +	rc = tls_enc_records(aead_req, ctx->aead_send, sg_in, sg_out, aad, iv,
> +			     rcd_sn, sync_size + payload_len);
> +	if (rc < 0)
> +		goto free_nskb;
> +
> +	complete_skb(nskb, skb, tcp_payload_offset);
> +
> +	/* validate_xmit_skb_list assumes that if the skb wasn't segmented
> +	 * nskb->prev will point to the skb itself
> +	 */
> +	nskb->prev = nskb;
> +free_buf:
> +	kfree(buf);
> +free_req:
> +	kfree(aead_req);
> +put_sg:
> +	for (i = 0; i < resync_sgs; i++)
> +		put_page(sg_page(&sg_in[i]));
> +free_sg:
> +	kfree(sg_in);
> +free_orig:
> +	kfree_skb(skb);
> +	return nskb;
> +
> +free_nskb:
> +	kfree_skb(nskb);
> +	nskb = NULL;
> +	goto free_buf;
> +}
> +
> +static struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
> +					     struct net_device *dev,
> +					     struct sk_buff *skb)
> +{
> +	if (dev == tls_get_ctx(sk)->netdev)
> +		return skb;
> +
> +	return tls_sw_fallback(sk, skb);
> +}
> +
> +int tls_sw_fallback_init(struct sock *sk,
> +			 struct tls_offload_context *offload_ctx,
> +			 struct tls_crypto_info *crypto_info)
> +{
> +	int rc;
> +	const u8 *key;
> +
> +	offload_ctx->aead_send =
> +	    crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
> +	if (IS_ERR(offload_ctx->aead_send)) {
> +		rc = PTR_ERR(offload_ctx->aead_send);
> +		pr_err_ratelimited("crypto_alloc_aead failed rc=%d\n", rc);
> +		offload_ctx->aead_send = NULL;
> +		goto err_out;
> +	}
> +
> +	key = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->key;
> +
> +	rc = crypto_aead_setkey(offload_ctx->aead_send, key,
> +				TLS_CIPHER_AES_GCM_128_KEY_SIZE);
> +	if (rc)
> +		goto free_aead;
> +
> +	rc = crypto_aead_setauthsize(offload_ctx->aead_send,
> +				     TLS_CIPHER_AES_GCM_128_TAG_SIZE);
> +	if (rc)
> +		goto free_aead;
> +
> +	sk->sk_validate_xmit_skb = tls_validate_xmit_skb;
> +	return 0;
> +free_aead:
> +	crypto_free_aead(offload_ctx->aead_send);
> +err_out:
> +	return rc;
> +}
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index d824d548447e..e0dface33017 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -54,6 +54,9 @@ enum {
>  enum {
>  	TLS_BASE_TX,
>  	TLS_SW_TX,
> +#ifdef CONFIG_TLS_DEVICE
> +	TLS_HW_TX,
> +#endif
>  	TLS_NUM_CONFIG,
>  };
>  
> @@ -416,11 +419,19 @@ static int do_tls_setsockopt_tx(struct sock *sk, char __user *optval,
>  		goto err_crypto_info;
>  	}
>  
> -	/* currently SW is default, we will have ethtool in future */
> -	rc = tls_set_sw_offload(sk, ctx);
> -	tx_conf = TLS_SW_TX;
> -	if (rc)
> -		goto err_crypto_info;
> +#ifdef CONFIG_TLS_DEVICE
> +	rc = tls_set_device_offload(sk, ctx);
> +	tx_conf = TLS_HW_TX;
> +	if (rc) {
> +#else
> +	{
> +#endif
> +		/* if HW offload fails fallback to SW */
> +		rc = tls_set_sw_offload(sk, ctx);
> +		tx_conf = TLS_SW_TX;
> +		if (rc)
> +			goto err_crypto_info;
> +	}
>  
>  	ctx->tx_conf = tx_conf;
>  	update_sk_prot(sk, ctx);
> @@ -473,6 +484,12 @@ static void build_protos(struct proto *prot, struct proto *base)
>  	prot[TLS_SW_TX] = prot[TLS_BASE_TX];
>  	prot[TLS_SW_TX].sendmsg		= tls_sw_sendmsg;
>  	prot[TLS_SW_TX].sendpage	= tls_sw_sendpage;
> +
> +#ifdef CONFIG_TLS_DEVICE
> +	prot[TLS_HW_TX] = prot[TLS_SW_TX];
> +	prot[TLS_HW_TX].sendmsg		= tls_device_sendmsg;
> +	prot[TLS_HW_TX].sendpage	= tls_device_sendpage;
> +#endif
>  }
>  
>  static int tls_init(struct sock *sk)
> @@ -531,6 +548,9 @@ static int __init tls_register(void)
>  {
>  	build_protos(tls_prots[TLSV4], &tcp_prot);
>  
> +#ifdef CONFIG_TLS_DEVICE
> +	tls_device_init();
> +#endif
>  	tcp_register_ulp(&tcp_tls_ulp_ops);
>  
>  	return 0;
> @@ -539,6 +559,9 @@ static int __init tls_register(void)
>  static void __exit tls_unregister(void)
>  {
>  	tcp_unregister_ulp(&tcp_tls_ulp_ops);
> +#ifdef CONFIG_TLS_DEVICE
> +	tls_device_cleanup();
> +#endif
>  }
>  
>  module_init(tls_register);

Thanks,
Kirill

^ permalink raw reply

* [PATCH] bridge: netfilter: ebt_stp: Use generic functions for comparisons
From: Joe Perches @ 2018-03-21 11:03 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: Stephen Hemminger, David S. Miller, netfilter-devel, coreteam,
	bridge, netdev, linux-kernel

Instead of unnecessary const declarations, use the generic functions to
save a little object space.

$ size net/bridge/netfilter/ebt_stp.o*
   text	   data	    bss	    dec	    hex	filename
   1250	    144	      0	   1394	    572	net/bridge/netfilter/ebt_stp.o.new
   1344	    144	      0	   1488	    5d0	net/bridge/netfilter/ebt_stp.o.old

Signed-off-by: Joe Perches <joe@perches.com>
---
 net/bridge/netfilter/ebt_stp.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/bridge/netfilter/ebt_stp.c b/net/bridge/netfilter/ebt_stp.c
index 3140eb912d7e..47ba98db145d 100644
--- a/net/bridge/netfilter/ebt_stp.c
+++ b/net/bridge/netfilter/ebt_stp.c
@@ -153,8 +153,6 @@ ebt_stp_mt(const struct sk_buff *skb, struct xt_action_param *par)
 static int ebt_stp_mt_check(const struct xt_mtchk_param *par)
 {
 	const struct ebt_stp_info *info = par->matchinfo;
-	const u8 bridge_ula[6] = {0x01, 0x80, 0xc2, 0x00, 0x00, 0x00};
-	const u8 msk[6] = {0xff, 0xff, 0xff, 0xff, 0xff, 0xff};
 	const struct ebt_entry *e = par->entryinfo;
 
 	if (info->bitmask & ~EBT_STP_MASK || info->invflags & ~EBT_STP_MASK ||
@@ -162,8 +160,8 @@ static int ebt_stp_mt_check(const struct xt_mtchk_param *par)
 		return -EINVAL;
 	/* Make sure the match only receives stp frames */
 	if (!par->nft_compat &&
-	    (!ether_addr_equal(e->destmac, bridge_ula) ||
-	     !ether_addr_equal(e->destmsk, msk) ||
+	    (!ether_addr_equal(e->destmac, eth_stp_addr) ||
+	     !is_broadcast_ether_addr(e->destmsk) ||
 	     !(e->bitmask & EBT_DESTMAC)))
 		return -EINVAL;
 
-- 
2.15.0

^ permalink raw reply related

* Re: HW question: i210 vs. BCM5461S over SGMII: no response from PHY to MDIO requests?
From: Frantisek Rysanek @ 2018-03-21 10:47 UTC (permalink / raw)
  To: Andrew Lunn, netdev
In-Reply-To: <20180320120936.GA19142@lunn.ch>

Just another follow-up:

With specs on SFP MSA, DDM/DMI and MII in hand, I have determined:
0x50 (a.k.a. 0xA0 in SFP MSA spec) = the module's SPD "EEPROM"
0x51 (a.k.a. 0xA2 in SFP MSA spec) = diagnostics (DMI/DDM)
0x56 = MII management access over i2C

Using eeprog (reading each offset twice to get 16bit words, which 
works for both the SFP's), I've been able to get the MII PHY ID's of 
the chips inside both modules:

SFP#1 has PHY ID == 002060C1 == BCM5461S
SFP#2 has PHY ID == 01410c97 == Marvell MV88E111x series.
    I don't know the precise model of the Marvell chip.
    Could be the mythical MV88E1113.

The SPD EEPROM contents are interesting too.

The module ID == 0 is weird, I would expect 0x03 = SFP,
but then again, the SFP MSA doesn't consider SGMII in the socket at 
all, it appears to only consider SERDES, am I right? So skipping the 
module ID makes some limited sense to me.

I haven't calculated the checksums yet, but multiple fields in there 
are filled in according to the SFP MSA spec.

Interestingly, SFP#1 has "Encoding" (byte 11 / 0x0b) set to 0x05
= "SONET scrambled". But has other fields indicating 100Base-FX.
For 100Base-FX I would expect encoding "4b/5b".
Which is what SFP#2 has encoded in its otherwise sparse EEPROM.

SFP#1 has DMI/DDM implemented. Alarm thresholds seem cranked to the 
limit, but some measured values are returned.

SFP#2 does not have DMI/DDM active, yet it decodes the i2c device 
0x51 (a.k.a. 0xA2) and about three bytes are non-empty... but holding 
nonsensical values (some thresholds).

Looking at the i2c dumps, and some past dumps from the igb driver, 
it's dawning on me on me that the igb driver, without much hacking, 
would try to read the PHY ID from the DMI/DDM block - a case which 
the drivers/net/phy/mdio-i2c.c specifically avoids :-)

My current opinion about the matters is, that I don't really need a 
valid SPD EEPROM to initialize the PHY in the SFP.
The question is, if I can make the i210 properly handshake with the 
PHY on the SGMII payload lane.

Another question is, how to write the driver's initialization 
sequence, for it to correctly decide if the SFP is SERDES or SGMII or 
what. I could just follow the config obtained from the i210 EEPROM.
Alternatively, or somehow combined to that, I could try checking if 
something responds to me over i2c, and do a quick check for SPD 
EEPROM. If I find one, do a check for "MII regs over i2c" - at all 
i2c slave addresses between 0x41..0x59  EXCEPT 0x50/0x51.
If I find MII-i2c regs, it's clear that the transceiver contains a 
PHY, and wants to run in SGMII mode - otherwise it's a SERDES thing.
If nothing is found in i2c mode, and inherited i210 config indicates 
SGMII, try external MDIO... if that fails, revert to SERDES.

And then there's the MII PHY internal config, which can be quite 
proprietary...

And if I write all that, there's noone to re-test this after me, as 
my setup is such a special case :-/

No end of fun.

Unfortunately, some other work is interfering at this very moment...

Frank

^ permalink raw reply

* [PATCH] ath: Remove unnecessary ath_bcast_mac and use eth_broadcast_addr
From: Joe Perches @ 2018-03-21 10:32 UTC (permalink / raw)
  To: Luis R. Rodriguez, Jiri Slaby, Nick Kossifidis,
	QCA ath9k Development
  Cc: Kalle Valo, linux-wireless, netdev, linux-kernel

Remove the static array and use the generic routine to set the
Ethernet broadcast address.

Signed-off-by: Joe Perches <joe@perches.com>
---
 drivers/net/wireless/ath/ath.h                | 2 --
 drivers/net/wireless/ath/ath5k/attach.c       | 2 +-
 drivers/net/wireless/ath/ath9k/htc_drv_init.c | 2 +-
 drivers/net/wireless/ath/ath9k/init.c         | 2 +-
 4 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/wireless/ath/ath.h b/drivers/net/wireless/ath/ath.h
index f3f2784f6ebd..7a364eca46d6 100644
--- a/drivers/net/wireless/ath/ath.h
+++ b/drivers/net/wireless/ath/ath.h
@@ -33,8 +33,6 @@
  */
 #define	ATH_KEYMAX	        128     /* max key cache size we handle */
 
-static const u8 ath_bcast_mac[ETH_ALEN] = {0xff, 0xff, 0xff, 0xff, 0xff, 0xff};
-
 struct ath_ani {
 	bool caldone;
 	unsigned int longcal_timer;
diff --git a/drivers/net/wireless/ath/ath5k/attach.c b/drivers/net/wireless/ath/ath5k/attach.c
index 233054bd6b52..12d3a6c92ba4 100644
--- a/drivers/net/wireless/ath/ath5k/attach.c
+++ b/drivers/net/wireless/ath/ath5k/attach.c
@@ -327,7 +327,7 @@ int ath5k_hw_init(struct ath5k_hw *ah)
 	ath5k_hw_set_lladdr(ah, zero_mac);
 
 	/* Set BSSID to bcast address: ff:ff:ff:ff:ff:ff for now */
-	memcpy(common->curbssid, ath_bcast_mac, ETH_ALEN);
+	eth_broadcast_addr(common->curbssid);
 	ath5k_hw_set_bssid(ah);
 	ath5k_hw_set_opmode(ah, ah->opmode);
 
diff --git a/drivers/net/wireless/ath/ath9k/htc_drv_init.c b/drivers/net/wireless/ath/ath9k/htc_drv_init.c
index e89e5ef2c2a4..bf36bb2821b4 100644
--- a/drivers/net/wireless/ath/ath9k/htc_drv_init.c
+++ b/drivers/net/wireless/ath/ath9k/htc_drv_init.c
@@ -591,7 +591,7 @@ static void ath9k_init_misc(struct ath9k_htc_priv *priv)
 {
 	struct ath_common *common = ath9k_hw_common(priv->ah);
 
-	memcpy(common->bssidmask, ath_bcast_mac, ETH_ALEN);
+	eth_broadcast_addr(common->bssidmask);
 
 	common->last_rssi = ATH_RSSI_DUMMY_MARKER;
 	priv->ah->opmode = NL80211_IFTYPE_STATION;
diff --git a/drivers/net/wireless/ath/ath9k/init.c b/drivers/net/wireless/ath/ath9k/init.c
index e479fae5aab9..748becee3ecf 100644
--- a/drivers/net/wireless/ath/ath9k/init.c
+++ b/drivers/net/wireless/ath/ath9k/init.c
@@ -427,7 +427,7 @@ static void ath9k_init_misc(struct ath_softc *sc)
 	timer_setup(&common->ani.timer, ath_ani_calibrate, 0);
 
 	common->last_rssi = ATH_RSSI_DUMMY_MARKER;
-	memcpy(common->bssidmask, ath_bcast_mac, ETH_ALEN);
+	eth_broadcast_addr(common->bssidmask);
 	sc->beacon.slottime = 9;
 
 	for (i = 0; i < ARRAY_SIZE(sc->beacon.bslot); i++)
-- 
2.15.0

^ permalink raw reply related

* Re: [bug, bisected] pfifo_fast causes packet reordering
From: Jakob Unterwurzacher @ 2018-03-21 10:01 UTC (permalink / raw)
  To: John Fastabend, Dave Taht
  Cc: netdev, linux-kernel, David S. Miller, linux-can@vger.kernel.org,
	Martin Elshuber
In-Reply-To: <340a6c54-6031-5522-98f5-eafdd3a37a38@theobroma-systems.com>

On 16.03.18 11:26, Jakob Unterwurzacher wrote:
> On 15.03.18 23:30, John Fastabend wrote:
>>> I have reproduced it using two USB network cards connected to each 
>>> other. The test tool sends UDP packets containing a counter and 
>>> listens on the other interface, it is available at
>>> https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
>>
>> Great thanks, can you also run this with taskset to bind to
>> a single CPU,
>>
>>   # taskset 0x1 ./pifof_stress.py
>>
>> And let me know if you still see the OOO.
> 
> Interesting. Looks like it depends on which core it runs on. CPU0 is 
> clean, CPU1 is not.

So we are at v4.16-rc6 now - have you managed to reproduce this is or 
should I try to get the revert correct?

Best regards,
Jakob

^ permalink raw reply

* Re: [net-next] intel: add SPDX identifiers to all the Intel drivers
From: Greg Kroah-Hartman @ 2018-03-21  9:38 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: Philippe Ombredanne, Allan, Bruce W, Joe Perches,
	davem@davemloft.net, netdev@vger.kernel.org, nhorman@redhat.com,
	sassmann@redhat.com, jogreene@redhat.com, Kate Stewart,
	Thomas Gleixner
In-Reply-To: <1521594560.15055.10.camel@intel.com>

On Tue, Mar 20, 2018 at 06:09:20PM -0700, Jeff Kirsher wrote:
> > When the kernel maintainers decide to switch to V3.0 of the SPDX list,
> > the doc will be updated and then Joe's script could be applied at once
> > to update the past.
> 
> I am fine with changing my patch back to v2.6 SPDX ids, as long as Joe's
> script in the future won't touch the Intel wired LAN drivers, since we need
> to retain copyright on several files through out our drivers.

Changing SPDX comment lines should never change the "copyright" of a
file.  If you have questions about stuff like that, please talk to your
corporate lawyers, they will be glad to talk to you all about what
defines "copyright" :)

thanks,

greg "I talk to too many lawyers..." k-h

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox