netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/5] tls: Add generic NIC offload infrastructure
@ 2017-09-14 10:46 Ilya Lesokhin
  2017-09-14 10:46 ` [PATCH net-next 1/5] tls: Move release of tls_ctx into tls_sw_free_resources Ilya Lesokhin
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Ilya Lesokhin @ 2017-09-14 10:46 UTC (permalink / raw)
  To: netdev, davem; +Cc: davejwatson, tom, hannes, borisp, ilyal, aviadye, liranl

This series add a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption and
authentication operations on the transmit side of the data path. Leaving
those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to the
TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
 TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Expected TCP SN is accessed without a lock, under the assumption that
TCP doesn't transmit SKBs from different TX queue concurrently.

We assume that packets are not rerouted to a different network device.

Github with mlx5e TLS offload support:
https://github.com/Mellanox/tls-offload/tree/tls_device_v1

Paper: https://www.netdevconf.org/1.2/papers/netdevconf-TLS.pdf

Ilya Lesokhin (5):
  tls: Move release of tls_ctx into tls_sw_free_resources
  tcp: Add clean acked data hook
  net: Add TLS offload netdev ops
  net: Add TLS TX offload features
  tls: Add generic NIC offload infrastructure.

 include/linux/netdev_features.h    |   2 +
 include/linux/netdevice.h          |  21 ++
 include/net/inet_connection_sock.h |   2 +
 include/net/tls.h                  |  41 ++-
 net/core/ethtool.c                 |   1 +
 net/ipv4/tcp_input.c               |   3 +
 net/tls/Kconfig                    |   9 +
 net/tls/Makefile                   |   3 +
 net/tls/tls_device.c               | 673 +++++++++++++++++++++++++++++++++++++
 net/tls/tls_main.c                 |  68 ++--
 net/tls/tls_sw.c                   |   1 +
 11 files changed, 803 insertions(+), 21 deletions(-)
 create mode 100644 net/tls/tls_device.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next 1/5] tls: Move release of tls_ctx into tls_sw_free_resources
  2017-09-14 10:46 [PATCH net-next 0/5] tls: Add generic NIC offload infrastructure Ilya Lesokhin
@ 2017-09-14 10:46 ` Ilya Lesokhin
  2017-09-14 10:46 ` [PATCH net-next 2/5] tcp: Add clean acked data hook Ilya Lesokhin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Ilya Lesokhin @ 2017-09-14 10:46 UTC (permalink / raw)
  To: netdev, davem; +Cc: davejwatson, tom, hannes, borisp, ilyal, aviadye, liranl

Move release of tls_ctx into sw specific code.
This is required because the device offload implementation
requires this context to remain alive until there are
no more in-flight SKBs.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
---
 net/tls/tls_main.c | 5 ++---
 net/tls/tls_sw.c   | 1 +
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 60aff60..ae20ee3 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -232,12 +232,11 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 			sg++;
 		}
 	}
-	ctx->free_resources(sk);
+
 	kfree(ctx->rec_seq);
 	kfree(ctx->iv);
-
 	sk_proto_close = ctx->sk_proto_close;
-	kfree(ctx);
+	ctx->free_resources(sk);
 
 	release_sock(sk);
 	sk_proto_close(sk, timeout);
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index fa596fa..db1e566 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -650,6 +650,7 @@ void tls_sw_free_resources(struct sock *sk)
 	tls_free_both_sg(sk);
 
 	kfree(ctx);
+	kfree(tls_ctx);
 }
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 2/5] tcp: Add clean acked data hook
  2017-09-14 10:46 [PATCH net-next 0/5] tls: Add generic NIC offload infrastructure Ilya Lesokhin
  2017-09-14 10:46 ` [PATCH net-next 1/5] tls: Move release of tls_ctx into tls_sw_free_resources Ilya Lesokhin
@ 2017-09-14 10:46 ` Ilya Lesokhin
  2017-09-14 10:46 ` [PATCH net-next 3/5] net: Add TLS offload netdev ops Ilya Lesokhin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Ilya Lesokhin @ 2017-09-14 10:46 UTC (permalink / raw)
  To: netdev, davem; +Cc: davejwatson, tom, hannes, borisp, ilyal, aviadye, liranl

Called when a TCP segment is acknowledged.
Could be used by application protocols who hold additional
metadata associated with the stream data
This is required by TLS device offload to release
metadata associated with acknowledged TLS records.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
---
 include/net/inet_connection_sock.h | 2 ++
 net/ipv4/tcp_input.c               | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 13e4c89..2ea1026 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
  * @icsk_af_ops		   Operations which are AF_INET{4,6} specific
  * @icsk_ulp_ops	   Pluggable ULP control hook
  * @icsk_ulp_data	   ULP private data
+ * @icsk_clean_acked	   Clean acked data hook
  * @icsk_ca_state:	   Congestion control state
  * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
  * @icsk_pending:	   Scheduled timer event
@@ -101,6 +102,7 @@ struct inet_connection_sock {
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	const struct tcp_ulp_ops  *icsk_ulp_ops;
 	void			  *icsk_ulp_data;
+	void			  (*icsk_clean_acked)(struct sock *sk);
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
 	__u8			  icsk_ca_state:6,
 				  icsk_ca_setsockopt:1,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c5d7656..37ebdbb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3645,6 +3645,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (!prior_packets)
 		goto no_queue;
 
+	if (icsk->icsk_clean_acked)
+		icsk->icsk_clean_acked(sk);
+
 	/* See if we can take anything off of the retransmit queue. */
 	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una, &acked,
 				    &sack_state);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 3/5] net: Add TLS offload netdev ops
  2017-09-14 10:46 [PATCH net-next 0/5] tls: Add generic NIC offload infrastructure Ilya Lesokhin
  2017-09-14 10:46 ` [PATCH net-next 1/5] tls: Move release of tls_ctx into tls_sw_free_resources Ilya Lesokhin
  2017-09-14 10:46 ` [PATCH net-next 2/5] tcp: Add clean acked data hook Ilya Lesokhin
@ 2017-09-14 10:46 ` Ilya Lesokhin
  2017-09-14 10:46 ` [PATCH net-next 4/5] net: Add TLS TX offload features Ilya Lesokhin
  2017-09-14 10:46 ` [PATCH net-next 5/5] tls: Add generic NIC offload infrastructure Ilya Lesokhin
  4 siblings, 0 replies; 11+ messages in thread
From: Ilya Lesokhin @ 2017-09-14 10:46 UTC (permalink / raw)
  To: netdev, davem; +Cc: davejwatson, tom, hannes, borisp, ilyal, aviadye, liranl

Add new netdev ops to add and delete tls context

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
---
 include/linux/netdevice.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cdfd9ad..4ea81bab 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -826,6 +826,23 @@ struct xfrmdev_ops {
 };
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+enum tls_offload_ctx_dir {
+	TLS_OFFLOAD_CTX_DIR_RX,
+	TLS_OFFLOAD_CTX_DIR_TX,
+};
+
+struct tls_crypto_info;
+
+struct tlsdev_ops {
+	int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
+			   enum tls_offload_ctx_dir direction,
+			   struct tls_crypto_info *crypto_info);
+	void (*tls_dev_del)(struct net_device *netdev, struct sock *sk,
+			    enum tls_offload_ctx_dir direction);
+};
+#endif
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1713,6 +1730,10 @@ struct net_device {
 	const struct xfrmdev_ops *xfrmdev_ops;
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+	const struct tlsdev_ops *tlsdev_ops;
+#endif
+
 	const struct header_ops *header_ops;
 
 	unsigned int		flags;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 4/5] net: Add TLS TX offload features
  2017-09-14 10:46 [PATCH net-next 0/5] tls: Add generic NIC offload infrastructure Ilya Lesokhin
                   ` (2 preceding siblings ...)
  2017-09-14 10:46 ` [PATCH net-next 3/5] net: Add TLS offload netdev ops Ilya Lesokhin
@ 2017-09-14 10:46 ` Ilya Lesokhin
  2017-09-14 10:46 ` [PATCH net-next 5/5] tls: Add generic NIC offload infrastructure Ilya Lesokhin
  4 siblings, 0 replies; 11+ messages in thread
From: Ilya Lesokhin @ 2017-09-14 10:46 UTC (permalink / raw)
  To: netdev, davem; +Cc: davejwatson, tom, hannes, borisp, ilyal, aviadye, liranl

This patch adds a netdev feature to configure TLS TX offloads.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
---
 include/linux/netdev_features.h | 2 ++
 net/core/ethtool.c              | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index dc8b489..ed0648a 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -76,6 +76,7 @@ enum {
 	NETIF_F_HW_ESP_BIT,		/* Hardware ESP transformation offload */
 	NETIF_F_HW_ESP_TX_CSUM_BIT,	/* ESP with TX checksum offload */
 	NETIF_F_RX_UDP_TUNNEL_PORT_BIT, /* Offload of RX port for UDP tunnels */
+	NETIF_F_HW_TLS_TX_BIT,		/* Hardware TLS TX offload */
 
 	/*
 	 * Add your fresh new feature above and remember to update
@@ -140,6 +141,7 @@ enum {
 #define NETIF_F_HW_ESP		__NETIF_F(HW_ESP)
 #define NETIF_F_HW_ESP_TX_CSUM	__NETIF_F(HW_ESP_TX_CSUM)
 #define	NETIF_F_RX_UDP_TUNNEL_PORT  __NETIF_F(RX_UDP_TUNNEL_PORT)
+#define NETIF_F_HW_TLS_TX	__NETIF_F(HW_TLS_TX)
 
 #define for_each_netdev_feature(mask_addr, bit)	\
 	for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 6a582ae..2ae1fc4 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -106,6 +106,7 @@ int ethtool_op_get_ts_info(struct net_device *dev, struct ethtool_ts_info *info)
 	[NETIF_F_HW_ESP_BIT] =		 "esp-hw-offload",
 	[NETIF_F_HW_ESP_TX_CSUM_BIT] =	 "esp-tx-csum-hw-offload",
 	[NETIF_F_RX_UDP_TUNNEL_PORT_BIT] =	 "rx-udp_tunnel-port-offload",
+	[NETIF_F_HW_TLS_TX_BIT] =	 "tls-hw-tx-offload",
 };
 
 static const char
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 5/5] tls: Add generic NIC offload infrastructure.
  2017-09-14 10:46 [PATCH net-next 0/5] tls: Add generic NIC offload infrastructure Ilya Lesokhin
                   ` (3 preceding siblings ...)
  2017-09-14 10:46 ` [PATCH net-next 4/5] net: Add TLS TX offload features Ilya Lesokhin
@ 2017-09-14 10:46 ` Ilya Lesokhin
  2017-09-18 14:52   ` Hannes Frederic Sowa
  4 siblings, 1 reply; 11+ messages in thread
From: Ilya Lesokhin @ 2017-09-14 10:46 UTC (permalink / raw)
  To: netdev, davem; +Cc: davejwatson, tom, hannes, borisp, ilyal, aviadye, liranl

This patch adds a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption and
authentication operations on the transmit side of the data path. Leaving
those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to the
TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
 TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
---
 include/net/tls.h    |  41 +++-
 net/tls/Kconfig      |   9 +
 net/tls/Makefile     |   3 +
 net/tls/tls_device.c | 673 +++++++++++++++++++++++++++++++++++++++++++++++++++
 net/tls/tls_main.c   |  63 +++--
 5 files changed, 771 insertions(+), 18 deletions(-)
 create mode 100644 net/tls/tls_device.c

diff --git a/include/net/tls.h b/include/net/tls.h
index b89d397..1f83c8e 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -71,6 +71,24 @@ struct tls_sw_context {
 	struct scatterlist sg_aead_out[2];
 };
 
+struct tls_record_info {
+	struct list_head list;
+	u32 end_seq;
+	int len;
+	int num_frags;
+	skb_frag_t frags[MAX_SKB_FRAGS];
+};
+
+struct tls_offload_context {
+	struct list_head records_list;
+	struct scatterlist sg_tx_data[MAX_SKB_FRAGS];
+	void (*sk_destruct)(struct sock *sk);
+	struct tls_record_info *open_record;
+	struct tls_record_info *retransmit_hint;
+	u32 expected_seq;
+	spinlock_t lock;	/* protects records list */
+};
+
 enum {
 	TLS_PENDING_CLOSED_RECORD
 };
@@ -81,6 +99,9 @@ struct tls_context {
 		struct tls12_crypto_info_aes_gcm_128 crypto_send_aes_gcm_128;
 	};
 
+	struct list_head gclist;
+	struct sock *sk;
+	struct net_device *netdev;
 	void *priv_ctx;
 
 	u16 prepend_size;
@@ -123,9 +144,18 @@ int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
 void tls_sw_close(struct sock *sk, long timeout);
 
-void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
-void tls_icsk_clean_acked(struct sock *sk);
+void tls_clear_device_offload(struct sock *sk, struct tls_context *ctx);
+int tls_set_device_offload(struct sock *sk, struct tls_context *ctx);
+int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
+int tls_device_sendpage(struct sock *sk, struct page *page,
+			int offset, size_t size, int flags);
+void tls_device_sk_destruct(struct sock *sk);
+void tls_device_cleanup(void);
 
+struct tls_record_info *tls_get_record(struct tls_offload_context *context,
+				       u32 seq);
+
+void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
 int tls_push_sg(struct sock *sk, struct tls_context *ctx,
 		struct scatterlist *sg, u16 first_offset,
 		int flags);
@@ -162,6 +192,13 @@ static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx)
 	return tls_ctx->pending_open_record_frags;
 }
 
+static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk)
+{
+	/* matches smp_store_release in tls_set_device_offload */
+	return	smp_load_acquire(&sk->sk_destruct) ==
+			&tls_device_sk_destruct;
+}
+
 static inline void tls_err_abort(struct sock *sk)
 {
 	sk->sk_err = -EBADMSG;
diff --git a/net/tls/Kconfig b/net/tls/Kconfig
index eb58303..1a4ea55c 100644
--- a/net/tls/Kconfig
+++ b/net/tls/Kconfig
@@ -13,3 +13,12 @@ config TLS
 	encryption handling of the TLS protocol to be done in-kernel.
 
 	If unsure, say N.
+
+config TLS_DEVICE
+	bool "Transport Layer Security HW offload"
+	depends on TLS
+	default n
+	---help---
+	Enable kernel support for HW offload of the TLS protocol.
+
+	If unsure, say N.
diff --git a/net/tls/Makefile b/net/tls/Makefile
index a930fd1..9de5055 100644
--- a/net/tls/Makefile
+++ b/net/tls/Makefile
@@ -5,3 +5,6 @@
 obj-$(CONFIG_TLS) += tls.o
 
 tls-y := tls_main.o tls_sw.o
+
+tls-$(CONFIG_TLS_DEVICE) += tls_device.o
+
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
new file mode 100644
index 0000000..94a25c2
--- /dev/null
+++ b/net/tls/tls_device.c
@@ -0,0 +1,673 @@
+/* Copyright (c) 2016-2017, Mellanox Technologies All rights reserved.
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ *      - Neither the name of the Mellanox Technologies nor the
+ *        names of its contributors may be used to endorse or promote
+ *        products derived from this software without specific prior written
+ *        permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE
+ */
+
+#include <linux/module.h>
+#include <net/tcp.h>
+#include <net/inet_common.h>
+#include <linux/highmem.h>
+#include <linux/netdevice.h>
+
+#include <net/tls.h>
+
+static void tls_device_gc_task(struct work_struct *work);
+
+static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
+static LIST_HEAD(tls_device_gc_list);
+static DEFINE_SPINLOCK(tls_device_gc_lock);
+
+static void tls_device_gc_task(struct work_struct *work)
+{
+	struct tls_context *ctx, *tmp;
+	struct list_head gc_list;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tls_device_gc_lock, flags);
+	INIT_LIST_HEAD(&gc_list);
+	list_splice_init(&tls_device_gc_list, &gc_list);
+	spin_unlock_irqrestore(&tls_device_gc_lock, flags);
+
+	list_for_each_entry_safe(ctx, tmp, &gc_list, gclist) {
+		struct tls_offload_context *offlad_ctx = tls_offload_ctx(ctx);
+		void (*sk_destruct)(struct sock *sk) = offlad_ctx->sk_destruct;
+		struct net_device *netdev = ctx->netdev;
+		struct sock *sk = ctx->sk;
+
+		netdev->tlsdev_ops->tls_dev_del(netdev, sk,
+						TLS_OFFLOAD_CTX_DIR_TX);
+
+		list_del(&ctx->gclist);
+		kfree(offlad_ctx);
+		kfree(ctx);
+		sk_destruct(sk);
+	}
+}
+
+static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&tls_device_gc_lock, flags);
+	list_add_tail(&ctx->gclist, &tls_device_gc_list);
+	spin_unlock_irqrestore(&tls_device_gc_lock, flags);
+
+	schedule_work(&tls_device_gc_work);
+}
+
+/* We assume that the socket is already connected */
+static struct net_device *get_netdev_for_sock(struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct net_device *netdev = NULL;
+
+	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
+
+	return netdev;
+}
+
+static void detach_sock_from_netdev(struct sock *sk, struct tls_context *ctx)
+{
+	struct net_device *netdev;
+
+	netdev = get_netdev_for_sock(sk);
+	if (!netdev) {
+		pr_err("got offloaded socket with no netdev\n");
+		return;
+	}
+
+	if (!netdev->tlsdev_ops) {
+		pr_err("attach_sock_to_netdev: netdev %s with no TLS offload\n",
+		       netdev->name);
+		return;
+	}
+
+	netdev->tlsdev_ops->tls_dev_del(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX);
+	dev_put(netdev);
+}
+
+static int attach_sock_to_netdev(struct sock *sk, struct net_device *netdev,
+				 struct tls_context *ctx)
+{
+	int rc;
+
+	rc = netdev->tlsdev_ops->tls_dev_add(
+			netdev,
+			sk,
+			TLS_OFFLOAD_CTX_DIR_TX,
+			&ctx->crypto_send);
+	if (rc) {
+		pr_err("The netdev has refused to offload this socket\n");
+		goto out;
+	}
+
+	sk->sk_bound_dev_if = netdev->ifindex;
+	sk_dst_reset(sk);
+
+	rc = 0;
+out:
+	return rc;
+}
+
+static void destroy_record(struct tls_record_info *record)
+{
+	skb_frag_t *frag;
+	int nr_frags = record->num_frags;
+
+	while (nr_frags > 0) {
+		frag = &record->frags[nr_frags - 1];
+		__skb_frag_unref(frag);
+		--nr_frags;
+	}
+	kfree(record);
+}
+
+static void delete_all_records(struct tls_offload_context *offload_ctx)
+{
+	struct tls_record_info *info, *temp;
+
+	list_for_each_entry_safe(info, temp, &offload_ctx->records_list, list) {
+		list_del(&info->list);
+		destroy_record(info);
+	}
+
+	offload_ctx->retransmit_hint = NULL;
+}
+
+static void tls_icsk_clean_acked(struct sock *sk)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx;
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tls_record_info *info, *temp;
+	unsigned long flags;
+
+	if (!tls_ctx)
+		return;
+
+	ctx = tls_offload_ctx(tls_ctx);
+
+	spin_lock_irqsave(&ctx->lock, flags);
+	info = ctx->retransmit_hint;
+	if (info && !before(tp->snd_una, info->end_seq)) {
+		ctx->retransmit_hint = NULL;
+		list_del(&info->list);
+		destroy_record(info);
+	}
+
+	list_for_each_entry_safe(info, temp, &ctx->records_list, list) {
+		if (before(tp->snd_una, info->end_seq))
+			break;
+		list_del(&info->list);
+
+		destroy_record(info);
+	}
+
+	spin_unlock_irqrestore(&ctx->lock, flags);
+}
+
+static void tls_device_free_resources(struct sock *sk)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+
+	if (ctx->open_record)
+		destroy_record(ctx->open_record);
+}
+
+/* At this point, there should be no references on this
+ * socket and no in-flight SKBs associated with this
+ * socket, so it is safe to free all the resources.
+ */
+void tls_device_sk_destruct(struct sock *sk)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+
+	delete_all_records(ctx);
+
+	tls_device_queue_ctx_destruction(tls_ctx);
+}
+EXPORT_SYMBOL(tls_device_sk_destruct);
+
+static inline void tls_append_frag(struct tls_record_info *record,
+				   struct page_frag *pfrag,
+				   int size)
+{
+	skb_frag_t *frag;
+
+	frag = &record->frags[record->num_frags - 1];
+	if (frag->page.p == pfrag->page &&
+	    frag->page_offset + frag->size == pfrag->offset) {
+		frag->size += size;
+	} else {
+		++frag;
+		frag->page.p = pfrag->page;
+		frag->page_offset = pfrag->offset;
+		frag->size = size;
+		++record->num_frags;
+		get_page(pfrag->page);
+	}
+
+	pfrag->offset += size;
+	record->len += size;
+}
+
+static inline int tls_push_record(struct sock *sk,
+				  struct tls_context *ctx,
+				  struct tls_offload_context *offload_ctx,
+				  struct tls_record_info *record,
+				  struct page_frag *pfrag,
+				  int flags,
+				  unsigned char record_type)
+{
+	skb_frag_t *frag;
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct page_frag fallback_frag;
+	struct page_frag  *tag_pfrag = pfrag;
+	int i;
+
+	/* fill prepand */
+	frag = &record->frags[0];
+	tls_fill_prepend(ctx,
+			 skb_frag_address(frag),
+			 record->len - ctx->prepend_size,
+			 record_type);
+
+	if (unlikely(!skb_page_frag_refill(
+				ctx->tag_size,
+				pfrag, GFP_KERNEL))) {
+		/* HW doesn't care about the data in the tag
+		 * so in case pfrag has no room
+		 * for a tag and we can't allocate a new pfrag
+		 * just use the page in the first frag
+		 * rather then write a complicated fall back code.
+		 */
+		tag_pfrag = &fallback_frag;
+		tag_pfrag->page = skb_frag_page(frag);
+		tag_pfrag->offset = 0;
+	}
+
+	tls_append_frag(record, tag_pfrag, ctx->tag_size);
+	record->end_seq = tp->write_seq + record->len;
+	spin_lock_irq(&offload_ctx->lock);
+	list_add_tail(&record->list, &offload_ctx->records_list);
+	spin_unlock_irq(&offload_ctx->lock);
+	offload_ctx->open_record = NULL;
+	set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
+	tls_advance_record_sn(sk, ctx);
+
+	for (i = 0; i < record->num_frags; i++) {
+		frag = &record->frags[i];
+		sg_unmark_end(&offload_ctx->sg_tx_data[i]);
+		sg_set_page(&offload_ctx->sg_tx_data[i], skb_frag_page(frag),
+			    frag->size, frag->page_offset);
+		sk_mem_charge(sk, frag->size);
+		get_page(skb_frag_page(frag));
+	}
+	sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags - 1]);
+
+	/* all ready, send */
+	return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags);
+}
+
+static inline int tls_create_new_record(
+		struct tls_offload_context *offload_ctx,
+		struct page_frag *pfrag,
+		size_t prepend_size)
+{
+	skb_frag_t *frag;
+	struct tls_record_info *record;
+
+	record = kmalloc(sizeof(*record), GFP_KERNEL);
+	if (!record)
+		return -ENOMEM;
+
+	frag = &record->frags[0];
+	__skb_frag_set_page(frag, pfrag->page);
+	frag->page_offset = pfrag->offset;
+	skb_frag_size_set(frag, prepend_size);
+
+	get_page(pfrag->page);
+	pfrag->offset += prepend_size;
+
+	record->num_frags = 1;
+	record->len = prepend_size;
+	offload_ctx->open_record = record;
+	return 0;
+}
+
+static inline int tls_do_allocation(
+		struct sock *sk,
+		struct tls_offload_context *offload_ctx,
+		struct page_frag *pfrag,
+		size_t prepend_size)
+{
+	int ret;
+
+	if (!offload_ctx->open_record) {
+		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
+						   sk->sk_allocation))) {
+			sk->sk_prot->enter_memory_pressure(sk);
+			sk_stream_moderate_sndbuf(sk);
+			return -ENOMEM;
+		}
+
+		ret = tls_create_new_record(offload_ctx, pfrag, prepend_size);
+		if (ret)
+			return ret;
+
+		if (pfrag->size > pfrag->offset)
+			return 0;
+	}
+
+	if (!sk_page_frag_refill(sk, pfrag))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int tls_push_data(struct sock *sk,
+			 struct iov_iter *msg_iter,
+			 size_t size, int flags,
+			 unsigned char record_type)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+	struct tls_record_info *record = ctx->open_record;
+	struct page_frag *pfrag;
+	int copy, rc = 0;
+	size_t orig_size = size;
+	u32 max_open_record_len;
+	long timeo;
+	int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
+	int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
+	bool done = false;
+
+	if (sk->sk_err)
+		return -sk->sk_err;
+
+	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
+	rc = tls_complete_pending_work(sk, tls_ctx, flags, &timeo);
+	if (rc < 0)
+		return rc;
+
+	pfrag = sk_page_frag(sk);
+
+	/* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and
+	 * we need to leave room for an authentication tag.
+	 */
+	max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
+			      tls_ctx->prepend_size;
+	do {
+		if (tls_do_allocation(sk, ctx, pfrag,
+				      tls_ctx->prepend_size)) {
+			rc = sk_stream_wait_memory(sk, &timeo);
+			if (!rc)
+				continue;
+
+			record = ctx->open_record;
+			if (!record)
+				break;
+handle_error:
+			if (record_type != TLS_RECORD_TYPE_DATA) {
+				/* avoid sending partial
+				 * record with type !=
+				 * application_data
+				 */
+				size = orig_size;
+				destroy_record(record);
+				ctx->open_record = NULL;
+			} else if (record->len > tls_ctx->prepend_size) {
+				goto last_record;
+			}
+
+			break;
+		}
+
+		record = ctx->open_record;
+		copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
+		copy = min_t(size_t, copy, (max_open_record_len - record->len));
+
+		if (copy_from_iter_nocache(
+				page_address(pfrag->page) + pfrag->offset,
+				copy, msg_iter) != copy) {
+			rc = -EFAULT;
+			goto handle_error;
+		}
+		tls_append_frag(record, pfrag, copy);
+
+		size -= copy;
+		if (!size) {
+last_record:
+			tls_push_record_flags = flags;
+			if (more) {
+				tls_ctx->pending_open_record_frags =
+						record->num_frags;
+				break;
+			}
+
+			done = true;
+		}
+
+		if ((done) ||
+		    (record->len >= max_open_record_len) ||
+		    (record->num_frags >= MAX_SKB_FRAGS - 1)) {
+			rc = tls_push_record(sk,
+					     tls_ctx,
+					     ctx,
+					     record,
+					     pfrag,
+					     tls_push_record_flags,
+					     record_type);
+			if (rc < 0)
+				break;
+		}
+	} while (!done);
+
+	if (orig_size - size > 0)
+		rc = orig_size - size;
+
+	return rc;
+}
+
+int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+{
+	unsigned char record_type = TLS_RECORD_TYPE_DATA;
+	int rc = 0;
+
+	lock_sock(sk);
+
+	if (unlikely(msg->msg_controllen)) {
+		rc = tls_proccess_cmsg(sk, msg, &record_type);
+		if (rc)
+			goto out;
+	}
+
+	rc = tls_push_data(sk, &msg->msg_iter, size,
+			   msg->msg_flags, record_type);
+
+out:
+	release_sock(sk);
+	return rc;
+}
+
+int tls_device_sendpage(struct sock *sk, struct page *page,
+			int offset, size_t size, int flags)
+{
+	struct iov_iter	msg_iter;
+	struct kvec iov;
+	char *kaddr = kmap(page);
+	int rc = 0;
+
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		flags |= MSG_MORE;
+
+	lock_sock(sk);
+
+	if (flags & MSG_OOB) {
+		rc = -ENOTSUPP;
+		goto out;
+	}
+
+	iov.iov_base = kaddr + offset;
+	iov.iov_len = size;
+	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1, size);
+	rc = tls_push_data(sk, &msg_iter, size,
+			   flags, TLS_RECORD_TYPE_DATA);
+	kunmap(page);
+
+out:
+	release_sock(sk);
+	return rc;
+}
+
+struct tls_record_info *tls_get_record(struct tls_offload_context *context,
+				       u32 seq)
+{
+	struct tls_record_info *info;
+
+	info = context->retransmit_hint;
+	if (!info ||
+	    before(seq, info->end_seq - info->len))
+		info = list_first_entry(&context->records_list,
+					struct tls_record_info, list);
+
+	list_for_each_entry_from(info, &context->records_list, list) {
+		if (before(seq, info->end_seq)) {
+			if (!context->retransmit_hint ||
+			    after(info->end_seq,
+				  context->retransmit_hint->end_seq))
+				context->retransmit_hint = info;
+			return info;
+		}
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL(tls_get_record);
+
+static int tls_device_push_pending_record(struct sock *sk, int flags)
+{
+	struct iov_iter	msg_iter;
+
+	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0);
+	return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA);
+}
+
+int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
+{
+	struct tls_crypto_info *crypto_info;
+	struct tls_offload_context *offload_ctx;
+	struct tls_record_info *start_marker_record;
+	u16 nonece_size, tag_size, iv_size, rec_seq_size;
+	char *iv, *rec_seq;
+	int rc;
+	struct net_device *netdev;
+	struct sk_buff *skb;
+
+	if (!ctx) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	if (ctx->priv_ctx) {
+		rc = -EEXIST;
+		goto out;
+	}
+
+	netdev = get_netdev_for_sock(sk);
+	if (!netdev) {
+		pr_err("%s: netdev not found\n", __func__);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	if (!(netdev->features & NETIF_F_HW_TLS_TX)) {
+		rc = -ENOTSUPP;
+		goto release_netdev;
+	}
+
+	crypto_info = &ctx->crypto_send;
+	switch (crypto_info->cipher_type) {
+	case TLS_CIPHER_AES_GCM_128: {
+		nonece_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
+		tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE;
+		iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
+		iv = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->iv;
+		rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE;
+		rec_seq =
+		 ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->rec_seq;
+		break;
+	}
+	default:
+		rc = -EINVAL;
+		goto release_netdev;
+	}
+
+	start_marker_record = kmalloc(sizeof(*start_marker_record), GFP_KERNEL);
+	if (!start_marker_record) {
+		rc = -ENOMEM;
+		goto release_netdev;
+	}
+
+	rc = attach_sock_to_netdev(sk, netdev, ctx);
+	if (rc)
+		goto free_marker_record;
+
+	ctx->netdev = netdev;
+	ctx->sk = sk;
+
+	ctx->prepend_size = TLS_HEADER_SIZE + nonece_size;
+	ctx->tag_size = tag_size;
+	ctx->iv_size = iv_size;
+	ctx->iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE,
+			  GFP_KERNEL);
+	if (!ctx->iv) {
+		rc = -ENOMEM;
+		goto detach_sock;
+	}
+	memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
+	ctx->rec_seq_size = rec_seq_size;
+	ctx->rec_seq = kmalloc(rec_seq_size, GFP_KERNEL);
+	if (!ctx->rec_seq) {
+		rc = -ENOMEM;
+		goto err_iv;
+	}
+	memcpy(ctx->rec_seq, rec_seq, rec_seq_size);
+
+	offload_ctx = ctx->priv_ctx;
+	start_marker_record->end_seq = tcp_sk(sk)->write_seq;
+	start_marker_record->len = 0;
+	start_marker_record->num_frags = 0;
+
+	INIT_LIST_HEAD(&offload_ctx->records_list);
+	list_add_tail(&start_marker_record->list, &offload_ctx->records_list);
+	spin_lock_init(&offload_ctx->lock);
+
+	inet_csk(sk)->icsk_clean_acked = &tls_icsk_clean_acked;
+	ctx->push_pending_record = tls_device_push_pending_record;
+	ctx->free_resources = tls_device_free_resources;
+	offload_ctx->sk_destruct = sk->sk_destruct;
+
+	/* TLS offload is greatly simplified if we don't send
+	 * SKBs where only part of the payload needs to be encrypted.
+	 * So mark the last skb in the write queue as end of record.
+	 */
+	skb = tcp_write_queue_tail(sk);
+	if (skb)
+		TCP_SKB_CB(skb)->eor = 1;
+
+	/* After the next line tls_is_sk_tx_device_offloaded
+	 * will return true and ndo_start_xmit might access the
+	 * offload context
+	 */
+	smp_store_release(&sk->sk_destruct,
+			  &tls_device_sk_destruct);
+	goto release_netdev;
+
+err_iv:
+	kfree(ctx->iv);
+detach_sock:
+	detach_sock_from_netdev(sk, ctx);
+free_marker_record:
+	kfree(start_marker_record);
+release_netdev:
+	dev_put(netdev);
+out:
+	return rc;
+}
+
+void __exit tls_device_cleanup(void)
+{
+	flush_work(&tls_device_gc_work);
+}
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index ae20ee3..a93a712 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -45,8 +45,16 @@
 MODULE_DESCRIPTION("Transport Layer Security Support");
 MODULE_LICENSE("Dual BSD/GPL");
 
-static struct proto tls_base_prot;
-static struct proto tls_sw_prot;
+enum {
+	TLS_BASE_TX,
+	TLS_SW_TX,
+#ifdef CONFIG_TLS_DEVICE
+	TLS_HW_TX,
+#endif
+	TLS_NUM_CONFIG,
+};
+
+static struct proto tls_prots[TLS_NUM_CONFIG];
 
 int wait_on_pending_writer(struct sock *sk, long *timeo)
 {
@@ -393,11 +401,19 @@ static int do_tls_setsockopt_tx(struct sock *sk, char __user *optval,
 
 	ctx->sk_proto_close = sk->sk_prot->close;
 
-	/* currently SW is default, we will have ethtool in future */
-	rc = tls_set_sw_offload(sk, ctx);
-	prot = &tls_sw_prot;
-	if (rc)
-		goto err_crypto_info;
+#ifdef CONFIG_TLS_DEVICE
+	rc = tls_set_device_offload(sk, ctx);
+	prot = &tls_prots[TLS_HW_TX];
+	if (rc) {
+#else
+	{
+#endif
+		/* if HW offload fails fallback to SW */
+		rc = tls_set_sw_offload(sk, ctx);
+		prot = &tls_prots[TLS_SW_TX];
+		if (rc)
+			goto err_crypto_info;
+	}
 
 	sk->sk_prot = prot;
 	goto out;
@@ -452,7 +468,8 @@ static int tls_init(struct sock *sk)
 	icsk->icsk_ulp_data = ctx;
 	ctx->setsockopt = sk->sk_prot->setsockopt;
 	ctx->getsockopt = sk->sk_prot->getsockopt;
-	sk->sk_prot = &tls_base_prot;
+
+	sk->sk_prot = &tls_prots[TLS_BASE_TX];
 out:
 	return rc;
 }
@@ -463,16 +480,27 @@ static int tls_init(struct sock *sk)
 	.init			= tls_init,
 };
 
-static int __init tls_register(void)
+static void build_protos(struct proto *prot, struct proto *base)
 {
-	tls_base_prot			= tcp_prot;
-	tls_base_prot.setsockopt	= tls_setsockopt;
-	tls_base_prot.getsockopt	= tls_getsockopt;
+	prot[TLS_BASE_TX] = *base;
+	prot[TLS_BASE_TX].setsockopt = tls_setsockopt;
+	prot[TLS_BASE_TX].getsockopt = tls_getsockopt;
+
+	prot[TLS_SW_TX] = prot[TLS_BASE_TX];
+	prot[TLS_SW_TX].close		= tls_sk_proto_close;
+	prot[TLS_SW_TX].sendmsg		= tls_sw_sendmsg;
+	prot[TLS_SW_TX].sendpage	= tls_sw_sendpage;
+
+#ifdef CONFIG_TLS_DEVICE
+	prot[TLS_HW_TX] = prot[TLS_SW_TX];
+	prot[TLS_HW_TX].sendmsg		= tls_device_sendmsg;
+	prot[TLS_HW_TX].sendpage	= tls_device_sendpage;
+#endif
+}
 
-	tls_sw_prot			= tls_base_prot;
-	tls_sw_prot.sendmsg		= tls_sw_sendmsg;
-	tls_sw_prot.sendpage            = tls_sw_sendpage;
-	tls_sw_prot.close               = tls_sk_proto_close;
+static int __init tls_register(void)
+{
+	build_protos(tls_prots, &tcp_prot);
 
 	tcp_register_ulp(&tcp_tls_ulp_ops);
 
@@ -482,6 +510,9 @@ static int __init tls_register(void)
 static void __exit tls_unregister(void)
 {
 	tcp_unregister_ulp(&tcp_tls_ulp_ops);
+#ifdef CONFIG_TLS_DEVICE
+	tls_device_cleanup();
+#endif
 }
 
 module_init(tls_register);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 5/5] tls: Add generic NIC offload infrastructure.
  2017-09-14 10:46 ` [PATCH net-next 5/5] tls: Add generic NIC offload infrastructure Ilya Lesokhin
@ 2017-09-18 14:52   ` Hannes Frederic Sowa
  2017-09-19  7:36     ` Ilya Lesokhin
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Frederic Sowa @ 2017-09-18 14:52 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: netdev, davem, davejwatson, tom, borisp, aviadye, liranl

Ilya Lesokhin <ilyal@mellanox.com> writes:

> +/* We assume that the socket is already connected */
> +static struct net_device *get_netdev_for_sock(struct sock *sk)
> +{
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct net_device *netdev = NULL;
> +
> +	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
> +
> +	return netdev;
> +}

The user should be aware of that they can't migrate the socket to
another interface if they got hw offloaded. This is not the case for
software offload. Thus I think the user has to opt in and it shouldn't
be a heuristic until we can switch back to sw offload path.

Maybe change flowi_oif to sk_bound_dev_if and somwhow lock it against
further changes if hw tls is in use?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH net-next 5/5] tls: Add generic NIC offload infrastructure.
  2017-09-18 14:52   ` Hannes Frederic Sowa
@ 2017-09-19  7:36     ` Ilya Lesokhin
  2017-09-19  9:40       ` Hannes Frederic Sowa
  0 siblings, 1 reply; 11+ messages in thread
From: Ilya Lesokhin @ 2017-09-19  7:36 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: netdev@vger.kernel.org, davem@davemloft.net, davejwatson@fb.com,
	tom@herbertland.com, Boris Pismenny, Aviad Yehezkel, Liran Liss

Hannes Frederic Sowa <hannes@stressinduktion.org> writes:

> The user should be aware of that they can't migrate the socket to another
> interface if they got hw offloaded. This is not the case for software offload.
> Thus I think the user has to opt in and it shouldn't be a heuristic until we can
> switch back to sw offload path.
> 
> Maybe change flowi_oif to sk_bound_dev_if and somwhow lock it against
> further changes if hw tls is in use?
> 

I'm not sure I follow.
We do set sk->sk_bound_dev_if to prevent further changes.

Do you recommend we enable TLS offload only if SO_BINDTODEVICE	
was previously used on that socket?
and prevent even users with CAP_NET_RAW from unbinding it?

I would rather avoid requiring CAP_NET_RAW to use TLS offload. 
But admittedly I'm not sure setting sk->sk_bound_dev_if 
without CAP_NET_RAW like we do is legit either.

Finally, the reason we made HW offload the default is that the user
can use sudo ethtool -K enp0s4 tls-hw-tx-offload off to opt out of HW offload
and we currently don't have anything equivalent for opting out of SW KTLS.

Thanks,
Ilya

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 5/5] tls: Add generic NIC offload infrastructure.
  2017-09-19  7:36     ` Ilya Lesokhin
@ 2017-09-19  9:40       ` Hannes Frederic Sowa
  2017-09-19 14:02         ` Boris Pismenny
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Frederic Sowa @ 2017-09-19  9:40 UTC (permalink / raw)
  To: Ilya Lesokhin
  Cc: netdev@vger.kernel.org, davem@davemloft.net, davejwatson@fb.com,
	tom@herbertland.com, Boris Pismenny, Aviad Yehezkel, Liran Liss

Hello,

Ilya Lesokhin <ilyal@mellanox.com> writes:

> Hannes Frederic Sowa <hannes@stressinduktion.org> writes:
>
>> The user should be aware of that they can't migrate the socket to another
>> interface if they got hw offloaded. This is not the case for software offload.
>> Thus I think the user has to opt in and it shouldn't be a heuristic until we can
>> switch back to sw offload path.
>> 
>> Maybe change flowi_oif to sk_bound_dev_if and somwhow lock it against
>> further changes if hw tls is in use?
>> 
>
> I'm not sure I follow.
> We do set sk->sk_bound_dev_if to prevent further changes.
>
> Do you recommend we enable TLS offload only if SO_BINDTODEVICE	
> was previously used on that socket?
> and prevent even users with CAP_NET_RAW from unbinding it?
>
> I would rather avoid requiring CAP_NET_RAW to use TLS offload. 
> But admittedly I'm not sure setting sk->sk_bound_dev_if 
> without CAP_NET_RAW like we do is legit either.
>
> Finally, the reason we made HW offload the default is that the user
> can use sudo ethtool -K enp0s4 tls-hw-tx-offload off to opt out of HW offload
> and we currently don't have anything equivalent for opting out of SW KTLS.

IMHO the decision if a TCP flow should be bounded to hw and thus never
push traffic to another interface should a decision the administrator
and the application should opt in. You might have your management
application which is accessible over multiple interfaces and your
production application which might want to use hw offloaded tls. Thus I
don't think only a single ethtool knob will do it.

I agree that SO_BINDTODEVICE is bad for this use case. First, the
CAP_NET_RAW limitation seems annoying and we don't want to enforce TLS
apps to have this capability. Second, the user space application doesn't
care which interface it should talk to (maybe?) but leave the routing
decision to the kernel and just opt in to TLS. SO_BINDTODEVICE doesn't
allow this.

sk_bound_dev_if can be rebound later with CAP_NET_RAW privileges, will
this be a problem?

Have you thought how the user space will configure the various
offloading features (sw, hw, none)? Will it in e.g. OpenSSL be part of
the Cipher Spec or will there be new functions around SSL_CTX to do so?

Maybe an enhancement of the TLS_TX setsockopt with a boolean for hw
offload is a solution?

Another question:

How is the dependency management done between socket layer and driver
layer? It seems a bit cyclic but judging from this code you don't hold
references to the device (dev_hold) (which is good, you don't want to
have users creating refs to devices). OTOH you somehow need to match
sockets from the device layer up to the socket. Will those be reference
counted or does that work without?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH net-next 5/5] tls: Add generic NIC offload infrastructure.
  2017-09-19  9:40       ` Hannes Frederic Sowa
@ 2017-09-19 14:02         ` Boris Pismenny
  2017-09-20 15:16           ` Hannes Frederic Sowa
  0 siblings, 1 reply; 11+ messages in thread
From: Boris Pismenny @ 2017-09-19 14:02 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Ilya Lesokhin
  Cc: netdev@vger.kernel.org, davem@davemloft.net, davejwatson@fb.com,
	tom@herbertland.com, Aviad Yehezkel, Liran Liss

Hello,

Hannes Frederic Sowa <hannes@stressinduktion.org> writes:
> Hello,
> 
> Ilya Lesokhin <ilyal@mellanox.com> writes:
> 
> > Hannes Frederic Sowa <hannes@stressinduktion.org> writes:
> >
> >> The user should be aware of that they can't migrate the socket to
> >> another interface if they got hw offloaded. This is not the case for
> software offload.
> >> Thus I think the user has to opt in and it shouldn't be a heuristic
> >> until we can switch back to sw offload path.
> >>
> >> Maybe change flowi_oif to sk_bound_dev_if and somwhow lock it against
> >> further changes if hw tls is in use?
> >>
> >
> > I'm not sure I follow.
> > We do set sk->sk_bound_dev_if to prevent further changes.
> >
> > Do you recommend we enable TLS offload only if SO_BINDTODEVICE
> > was previously used on that socket?
> > and prevent even users with CAP_NET_RAW from unbinding it?
> >
> > I would rather avoid requiring CAP_NET_RAW to use TLS offload.
> > But admittedly I'm not sure setting sk->sk_bound_dev_if without
> > CAP_NET_RAW like we do is legit either.
> >
> > Finally, the reason we made HW offload the default is that the user
> > can use sudo ethtool -K enp0s4 tls-hw-tx-offload off to opt out of HW
> > offload and we currently don't have anything equivalent for opting out of
> SW KTLS.
> 
> IMHO the decision if a TCP flow should be bounded to hw and thus never
> push traffic to another interface should a decision the administrator and the
> application should opt in. You might have your management application
> which is accessible over multiple interfaces and your production application
> which might want to use hw offloaded tls. Thus I don't think only a single
> ethtool knob will do it.

IMO the configuration knob should be at the kTLS level and not at the
HW vs. SW level. The management application shouldn't be using kTLS.
I'd like to view TLS offload similarly to LSO. The default is opt-in if
possible, and the Kernel decides that based on device capabilities.

> 
> I agree that SO_BINDTODEVICE is bad for this use case. First, the
> CAP_NET_RAW limitation seems annoying and we don't want to enforce TLS
> apps to have this capability. Second, the user space application doesn't care
> which interface it should talk to (maybe?) but leave the routing decision to
> the kernel and just opt in to TLS. SO_BINDTODEVICE doesn't allow this.
> 
> sk_bound_dev_if can be rebound later with CAP_NET_RAW privileges, will
> this be a problem?

Yes it is a problem and we have some ideas for a software fallback that should
catch this. 

Is the software fallback a prerequisite for kTLS offload in Kernel?

> 
> Have you thought how the user space will configure the various offloading
> features (sw, hw, none)? Will it in e.g. OpenSSL be part of the Cipher Spec or
> will there be new functions around SSL_CTX to do so?
> 
> Maybe an enhancement of the TLS_TX setsockopt with a boolean for hw
> offload is a solution?

Yes, we think that OpenSSL should first configure whether it complies with
kTLS support. Next, we thought of using an environment variable to control
kTLS globally in OpenSSL as follows:
1. only software kTLS
2. only hardware kTLS - no fallback to software.
3. Try to use hardware kTLS and if it isn't supported fallback to software kTLS.

The above is something we plan for the future, assuming that kTLS wouldn't fit for
all use-cases. What do you think?

If you'd like to have more fine-grained control of kTLS, e.g. per socket,
then the application would need to be modified to configure that,
which is something we try to avoid.

> 
> Another question:
> 
> How is the dependency management done between socket layer and driver
> layer? It seems a bit cyclic but judging from this code you don't hold
> references to the device (dev_hold) (which is good, you don't want to have
> users creating refs to devices). OTOH you somehow need to match sockets
> from the device layer up to the socket. Will those be reference counted or
> does that work without?

Not sure I follow your question.
We use the socket from the device layer through the SKB that carries it,
so I think it should work without.
We don't attempt to perform a socket lookup or anything of this sort.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 5/5] tls: Add generic NIC offload infrastructure.
  2017-09-19 14:02         ` Boris Pismenny
@ 2017-09-20 15:16           ` Hannes Frederic Sowa
  0 siblings, 0 replies; 11+ messages in thread
From: Hannes Frederic Sowa @ 2017-09-20 15:16 UTC (permalink / raw)
  To: Boris Pismenny
  Cc: Ilya Lesokhin, netdev@vger.kernel.org, davem@davemloft.net,
	davejwatson@fb.com, tom@herbertland.com, Aviad Yehezkel,
	Liran Liss

Hello,

Boris Pismenny <borisp@mellanox.com> writes:

> Hello,
>
> Hannes Frederic Sowa <hannes@stressinduktion.org> writes:
>> Hello,
>> 
>> Ilya Lesokhin <ilyal@mellanox.com> writes:
>> 
>> > Hannes Frederic Sowa <hannes@stressinduktion.org> writes:
>> >
>> >> The user should be aware of that they can't migrate the socket to
>> >> another interface if they got hw offloaded. This is not the case for
>> software offload.
>> >> Thus I think the user has to opt in and it shouldn't be a heuristic
>> >> until we can switch back to sw offload path.
>> >>
>> >> Maybe change flowi_oif to sk_bound_dev_if and somwhow lock it against
>> >> further changes if hw tls is in use?
>> >>
>> >
>> > I'm not sure I follow.
>> > We do set sk->sk_bound_dev_if to prevent further changes.
>> >
>> > Do you recommend we enable TLS offload only if SO_BINDTODEVICE
>> > was previously used on that socket?
>> > and prevent even users with CAP_NET_RAW from unbinding it?
>> >
>> > I would rather avoid requiring CAP_NET_RAW to use TLS offload.
>> > But admittedly I'm not sure setting sk->sk_bound_dev_if without
>> > CAP_NET_RAW like we do is legit either.
>> >
>> > Finally, the reason we made HW offload the default is that the user
>> > can use sudo ethtool -K enp0s4 tls-hw-tx-offload off to opt out of HW
>> > offload and we currently don't have anything equivalent for opting out of
>> SW KTLS.
>> 
>> IMHO the decision if a TCP flow should be bounded to hw and thus never
>> push traffic to another interface should a decision the administrator and the
>> application should opt in. You might have your management application
>> which is accessible over multiple interfaces and your production application
>> which might want to use hw offloaded tls. Thus I don't think only a single
>> ethtool knob will do it.
>
> IMO the configuration knob should be at the kTLS level and not at the
> HW vs. SW level. The management application shouldn't be using kTLS.
> I'd like to view TLS offload similarly to LSO. The default is opt-in if
> possible, and the Kernel decides that based on device capabilities.
>
>> 
>> I agree that SO_BINDTODEVICE is bad for this use case. First, the
>> CAP_NET_RAW limitation seems annoying and we don't want to enforce TLS
>> apps to have this capability. Second, the user space application doesn't care
>> which interface it should talk to (maybe?) but leave the routing decision to
>> the kernel and just opt in to TLS. SO_BINDTODEVICE doesn't allow this.
>> 
>> sk_bound_dev_if can be rebound later with CAP_NET_RAW privileges, will
>> this be a problem?
>
> Yes it is a problem and we have some ideas for a software fallback that should
> catch this. 

Ok.

> Is the software fallback a prerequisite for kTLS offload in Kernel?

I don't know. I would assume yes because it will change how uAPI will
look like?

>> 
>> Have you thought how the user space will configure the various offloading
>> features (sw, hw, none)? Will it in e.g. OpenSSL be part of the Cipher Spec or
>> will there be new functions around SSL_CTX to do so?
>> 
>> Maybe an enhancement of the TLS_TX setsockopt with a boolean for hw
>> offload is a solution?
>
> Yes, we think that OpenSSL should first configure whether it complies with
> kTLS support. Next, we thought of using an environment variable to control
> kTLS globally in OpenSSL as follows:

0. no kernel tls at all but use e.g. OpenSSL crypto code.

> 1. only software kTLS
> 2. only hardware kTLS - no fallback to software.
> 3. Try to use hardware kTLS and if it isn't supported fallback to
> software kTLS.

Hmm, environment variable and global control contradicts itself. ;)

In some form or another there is a need to have all options for
debugging. I also wonder if it makes sense to disable ktls based on
reordering and fast path vs. slow path hit ratio. But that is something
to think about later.

> The above is something we plan for the future, assuming that kTLS
> wouldn't fit for all use-cases. What do you think?
>
> If you'd like to have more fine-grained control of kTLS, e.g. per socket,
> then the application would need to be modified to configure that,
> which is something we try to avoid.

That is why I proposed signaling over ciphers(1) for openssl. If you
e.g. look at apache/mod_ssl, they loop the cipher list from the
configuration file directly to OpenSSL. Same for a lot of other web
servers, nginx etc. Thus you just need to modify openssl and don't need
to touch the users of the library.

E.g. in Fedora/RHEL the crypto libs load a default cipher list from
/etc/crypto-policies/, which you can update centrally with
update-crypto-policies. Maybe the kTLS switches fit nicely in there?

For that to do, OpenSSL needs still to have more fine grain control over
which kTLS sw/hw to use, right?

>> 
>> Another question:
>> 
>> How is the dependency management done between socket layer and driver
>> layer? It seems a bit cyclic but judging from this code you don't hold
>> references to the device (dev_hold) (which is good, you don't want to have
>> users creating refs to devices). OTOH you somehow need to match sockets
>> from the device layer up to the socket. Will those be reference counted or
>> does that work without?
>
> Not sure I follow your question.
> We use the socket from the device layer through the SKB that carries it,
> so I think it should work without.
> We don't attempt to perform a socket lookup or anything of this sort.

The socket from skb is only valid as long as you have the skb. Basically
the question is: do you ever increase the ref counter of sockets from
the device drivers?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-09-20 15:16 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-14 10:46 [PATCH net-next 0/5] tls: Add generic NIC offload infrastructure Ilya Lesokhin
2017-09-14 10:46 ` [PATCH net-next 1/5] tls: Move release of tls_ctx into tls_sw_free_resources Ilya Lesokhin
2017-09-14 10:46 ` [PATCH net-next 2/5] tcp: Add clean acked data hook Ilya Lesokhin
2017-09-14 10:46 ` [PATCH net-next 3/5] net: Add TLS offload netdev ops Ilya Lesokhin
2017-09-14 10:46 ` [PATCH net-next 4/5] net: Add TLS TX offload features Ilya Lesokhin
2017-09-14 10:46 ` [PATCH net-next 5/5] tls: Add generic NIC offload infrastructure Ilya Lesokhin
2017-09-18 14:52   ` Hannes Frederic Sowa
2017-09-19  7:36     ` Ilya Lesokhin
2017-09-19  9:40       ` Hannes Frederic Sowa
2017-09-19 14:02         ` Boris Pismenny
2017-09-20 15:16           ` Hannes Frederic Sowa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).