Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v3 1/2] net: permit skb_segment on head_frag frag_list skb
From: Yonghong Song @ 2018-03-21 20:10 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, ast, Daniel Borkmann, diptanu, Netdev, Kernel Team
In-Reply-To: <CAKgT0UftYHNnmkVX64NfuvHDmdjghVnN7bQCoenAOD3dK61gOg@mail.gmail.com>



On 3/21/18 7:59 AM, Alexander Duyck wrote:
> On Tue, Mar 20, 2018 at 10:02 PM, Yonghong Song <yhs@fb.com> wrote:
>>
>>
>> On 3/20/18 4:50 PM, Alexander Duyck wrote:
>>>
>>> On Tue, Mar 20, 2018 at 4:21 PM, Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>>>> function skb_segment(), line 3667. The bpf program attaches to
>>>> clsact ingress, calls bpf_skb_change_proto to change protocol
>>>> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
>>>> to send the changed packet out.
>>>>
>>>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>>> 3473                             netdev_features_t features)
>>>> 3474 {
>>>> 3475         struct sk_buff *segs = NULL;
>>>> 3476         struct sk_buff *tail = NULL;
>>>> ...
>>>> 3665                 while (pos < offset + len) {
>>>> 3666                         if (i >= nfrags) {
>>>> 3667                                 BUG_ON(skb_headlen(list_skb));
>>>> 3668
>>>> 3669                                 i = 0;
>>>> 3670                                 nfrags =
>>>> skb_shinfo(list_skb)->nr_frags;
>>>> 3671                                 frag = skb_shinfo(list_skb)->frags;
>>>> 3672                                 frag_skb = list_skb;
>>>> ...
>>>>
>>>> call stack:
>>>> ...
>>>>    #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>>>>    #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>>>>    #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>>>>    #4 [ffff883ffef03668] die at ffffffff8101deb2
>>>>    #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>>>>    #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>>>>    #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>>>>    #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>>>>       [exception RIP: skb_segment+3044]
>>>>       RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>>>>       RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>>>>       RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>>>>       RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>>>>       R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>>>>       R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>>>>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>>>    #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>>>> --- <IRQ stack> ---
>>>> ...
>>>>
>>>> The triggering input skb has the following properties:
>>>>       list_skb = skb->frag_list;
>>>>       skb->nfrags != NULL && skb_headlen(list_skb) != 0
>>>> and skb_segment() is not able to handle a frag_list skb
>>>> if its headlen (list_skb->len - list_skb->data_len) is not 0.
>>>>
>>>> This patch addressed the issue by handling skb_headlen(list_skb) != 0
>>>> case properly if list_skb->head_frag is true, which is expected in
>>>> most cases. The head frag is processed before list_skb->frags
>>>> are processed.
>>>>
>>>> Reported-by: Diptanu Gon Choudhury <diptanu@fb.com>
>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>> ---
>>>>    net/core/skbuff.c | 51
>>>> +++++++++++++++++++++++++++++++++++++--------------
>>>>    1 file changed, 37 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index 715c134..59bbc06 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -3475,7 +3475,7 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>           struct sk_buff *segs = NULL;
>>>>           struct sk_buff *tail = NULL;
>>>>           struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
>>>> -       skb_frag_t *frag = skb_shinfo(head_skb)->frags;
>>>> +       skb_frag_t *frag = skb_shinfo(head_skb)->frags, *head_frag =
>>>> NULL;
>>>
>>>
>>> I think you misunderstood me. I wasn't saying you allocate head_frag.
>>> I was saying you could move the declaration down.
>>
>>
>> Sorry for my misunderstanding. I did understand your intention of moving
>> the declaration down in order to save stack space. I thought that we cannot
>> really move declaration down (although it works in C, but semantically it is
>> not quite right, more later), so I moved on to
>> use runtime allocation. But indeed skb_frag_t is not big (16 bytes), it
>> could live on the stack.
>>
>>>
>>>>           unsigned int mss = skb_shinfo(head_skb)->gso_size;
>>>>           unsigned int doffset = head_skb->data -
>>>> skb_mac_header(head_skb);
>>>>           struct sk_buff *frag_skb = head_skb;
>>>> @@ -3664,19 +3664,39 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>
>>>>                   while (pos < offset + len) {
>>>
>>>
>>> So right here in the loop you could add a "skb_frag_t head_frag;" just
>>> so we declare it here and save ourselves the stack space.
>>
>>
>> I actually tried to move "skb_frag_t head_frag". The stack size remains the
>> same, 0xc0. This is related to how C compiler allocates stack space.
>> The declaration place won't decide the stack size as long as the declaration
>> dictates the usage. The stack size is really determined by liveness
>> analysis.
>>
>> Further, we have code like:
>>          do {
>>             ....
>>             while (pos < offset + len) {
>>                 if (i >= nfrags) {
>>                     ...
>>                     head_frag = ...
>>                 }
>>                 ... = head_frag; // head_frag access guaranteed after
>>                                  // above definition, but it may not
>>                                  // be in the same outer do-while loop.
>>             }
>>             ...
>>           } while (((offset += len) < head_skb->len);
>>
>> So the use of head_frag maybe in different outer loop iterations.
>> So I feel the definition of head_frag should be outside the
>> outer do-while loop, which is the main function scope. I will add some
>> comments here.
> 
> So the point I had is that head_frag doesn't need to live that long.
> All you are doing is arranging the data so that you can essentially
> just dump it into *nskb_frag.
> 
> One alternative you could look at would be to rearrange the code so
> that the MAX_SKB_FRAGS check occurs first, then perform the check for
> (i >= nfrags). Doing it that way we end up bypassing the need for
> head_frag entirely as you could just populate *nskb_frag directly. You
> could probably just add an inline structure that would convert the
> head frag into a skb_frag_t structure and return that. Something along
> the lines of:
>          *nskb_frag = (i < 0) ? skb_head_frag_to_page_desc(frag_skb) : *frag;

This mechanism (inline function skb_head_frag_to_page_desc) works great!
I removed the local variable and the total stack size remains unchanged
with the patch.

> 
> Doing it that way you could pull out the bits below that were
> populating head frag and just have it all handled as an inline
> function. An added advantage is that it should make the line wrapping
> easier to deal with. :-)
> 
>>>
>>>>                           if (i >= nfrags) {
>>>> -                               BUG_ON(skb_headlen(list_skb));
>>>> -
>>>>                                   i = 0;
>>>> +                               if (skb_headlen(list_skb)) {
>>>> +                                       struct page *page;
>>>> +
>>>> +                                       BUG_ON(!list_skb->head_frag);
>>>> +
>>>> +                                       page =
>>>> virt_to_head_page(list_skb->head);
>>>> +                                       if (!head_frag) {
>>>> +                                               head_frag =
>>>> kmalloc(sizeof(skb_frag_t),
>>>> +
>>>> GFP_KERNEL);
>>>> +                                               if (!head_frag)
>>>> +                                                       goto err;
>>>> +                                       }
>>>
>>>
>>> Please no memory allocation. I just meant you could allocate it on the
>>> stack later.
>>>
>>>> +                                       head_frag->page.p = page;
>>>> +                                       head_frag->page_offset =
>>>> list_skb->data -
>>>> +                                               (unsigned char
>>>> *)page_address(page);
>>>> +                                       head_frag->size =
>>>> skb_headlen(list_skb);
>>>> +                                       /* set i = -1 so we will pick
>>>> head_frag
>>>> +                                        * instead of
>>>> skb_shinfo(list_skb)->frags
>>>> +                                        * when i == -1.
>>>> +                                        */
>>>> +                                       i = -1;
>>>> +                               }
>>>
>>>
>>> So it took me a bit to pick up on the fact that line below wasn't
>>> removed. So we are basically trying to do this all in one pass now. Do
>>> I have that right?
>>>
>>> One thing you could look at doing to save yourself the extra "if"
>>> later would be to pull frag pointer before you go through skb_headlen
>>> check above. Then if you are going to use a head_frag you could just
>>> do a "i--; frag--;" combination just to rewind and make the room for
>>> the increment to come later. That way you don't have an invalid frag
>>> pointer floating around. That way you only have to do this once
>>> instead of having to do a conditional check per fragment.
>>
>>
>> Right. This indeed make code more cleaner.
>>
>>>
>>>>                                   nfrags = skb_shinfo(list_skb)->nr_frags;
>>>> -                               frag = skb_shinfo(list_skb)->frags;
>>>
>>>
>>> This patch might be more readable if you were to just insert the
>>> skb_headlen() bits down here and left the i=0 through frag = .. in one
>>> piece.
>>
>>
>> Right. Will implement as suggested.
>>
>>>
>>>> -                               frag_skb = list_skb;
>>>> -
>>>> -                               BUG_ON(!nfrags);
>>>> -
>>>> -                               if (skb_orphan_frags(frag_skb,
>>>> GFP_ATOMIC) ||
>>>> -                                   skb_zerocopy_clone(nskb, frag_skb,
>>>> -                                                      GFP_ATOMIC))
>>>> -                                       goto err;
>>>> +                               if (nfrags) {
>>>> +                                       frag =
>>>> skb_shinfo(list_skb)->frags;
>>>> +                                       frag_skb = list_skb;
>>>> +
>>>> +                                       if (skb_orphan_frags(frag_skb,
>>>> GFP_ATOMIC) ||
>>>> +                                           skb_zerocopy_clone(nskb,
>>>> frag_skb,
>>>> +
>>>> GFP_ATOMIC))
>>>> +                                               goto err;
>>>> +                               }
>>>>
>>>>                                   list_skb = list_skb->next;
>>>>                           }
>>>> @@ -3689,7 +3709,7 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>                                   goto err;
>>>>                           }
>>>>
>>>> -                       *nskb_frag = *frag;
>>>> +                       *nskb_frag = (i == -1) ? *head_frag : *frag;
>>>
>>>
>>> So this would be better as "*nskb_frag = (i < 0) ? head_frag : *frag;".
>>
>>
>> Good suggestion. Will implement as suggested.
>>
>>
>>>
>>>>                           __skb_frag_ref(nskb_frag);
>>>>                           size = skb_frag_size(nskb_frag);
>>>>
>>>> @@ -3702,7 +3722,8 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>
>>>>                           if (pos + size <= offset + len) {
>>>>                                   i++;
>>>> -                               frag++;
>>>> +                               if (i != 0)
>>>> +                                       frag++;
>>>>                                   pos += size;
>>>>                           } else {
>>>>                                   skb_frag_size_sub(nskb_frag, pos + size
>>>> - (offset + len));
>>>> @@ -3774,10 +3795,12 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>                   swap(tail->destructor, head_skb->destructor);
>>>>                   swap(tail->sk, head_skb->sk);
>>>>           }
>>>> +       kfree(head_frag);
>>>>           return segs;
>>>>
>>>>    err:
>>>>           kfree_skb_list(segs);
>>>> +       kfree(head_frag);
>>>>           return ERR_PTR(err);
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(skb_segment);
>>>> --
>>>> 2.9.5
>>>>
>>

^ permalink raw reply

* Re: [PATCH net-next v4 2/2] net: bpf: add a test for skb_segment in test_bpf module
From: Yonghong Song @ 2018-03-21 20:15 UTC (permalink / raw)
  To: Eric Dumazet, edumazet, ast, daniel, diptanu, netdev; +Cc: kernel-team
In-Reply-To: <435477a4-bf2d-f710-a1a0-f1957eb01f7a@gmail.com>



On 3/21/18 8:26 AM, Eric Dumazet wrote:
> 
> 
> On 03/20/2018 11:47 PM, Yonghong Song wrote:
>> +static __init int test_skb_segment(void)
>> +{
>> +	netdev_features_t features;
>> +	struct sk_buff *skb;
>> +	int ret = -1;
>> +
>> +	features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
>> +		   NETIF_F_IPV6_CSUM;
>> +	features |= NETIF_F_RXCSUM;
>> +	skb = build_test_skb();
>> +	if (!skb) {
>> +		pr_info("%s: failed to build_test_skb", __func__);
>> +		goto done;
>> +	}
>> +
>> +	if (skb_segment(skb, features)) {
>> +		ret = 0;
>> +		pr_info("%s: success in skb_segment!", __func__);
>> +	} else {
>> +		pr_info("%s: failed in skb_segment!", __func__);
>> +	}
>> +	kfree_skb(skb);
> 
> If skb_segmen() was successful (original) skb was already freed.
> 
> kfree_skb(old_skb) should thus panic the box, if you run this code
> on a kernel having some debugging features like KASAN

I tried with KASAN. It does not panic.
Looking at the code in net/core/dev.c: validate_xmit_skb:

static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct 
net_device *dev, bool *again)
...

         if (netif_needs_gso(skb, features)) {
                 struct sk_buff *segs;

                 segs = skb_gso_segment(skb, features);
                 if (IS_ERR(segs)) {
                         goto out_kfree_skb;
                 } else if (segs) {
                         consume_skb(skb);
                         skb = segs;
                 }
...
out_kfree_skb:
         kfree_skb(skb);

which also indicates kfree_skb/consume_skb probably is the right way
to free skb after skb_gso_segment/skb_segment.

This probably explains why my above kfree_skb(skb) does not crash.

> 
> So you must store in a variable the return of skb_segment(),
> to be able to free skb(s), using kfree_skb_list()

Totally agree. Will make the change. Thanks!

> 
> 
>> +done:
>> +	return ret;
>> +}
>> +

^ permalink raw reply

* Re: [PATCH net-next V2] Documentation/networking: Add net DIM documentation
From: Marcelo Ricardo Leitner @ 2018-03-21 19:23 UTC (permalink / raw)
  To: Tal Gilboa; +Cc: David S. Miller, netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <1521657225-65392-1-git-send-email-talgi@mellanox.com>

On Wed, Mar 21, 2018 at 08:33:45PM +0200, Tal Gilboa wrote:
> Net DIM is a generic algorithm, purposed for dynamically
> optimizing network devices interrupt moderation. This
> document describes how it works and how to use it.
> 
> Signed-off-by: Tal Gilboa <talgi@mellanox.com>
> ---
>  Documentation/networking/net_dim.txt | 174 +++++++++++++++++++++++++++++++++++
>  1 file changed, 174 insertions(+)
>  create mode 100644 Documentation/networking/net_dim.txt
> 
> diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.txt
> new file mode 100644
> index 0000000..9cb31c5
> --- /dev/null
> +++ b/Documentation/networking/net_dim.txt
> @@ -0,0 +1,174 @@
> +Net DIM - Generic Network Dynamic Interrupt Moderation
> +======================================================
> +
> +Author:
> +	Tal Gilboa <talgi@mellanox.com>
> +
> +
> +Contents
> +=========
> +
> +- Assumptions
> +- Introduction
> +- The Net DIM Algorithm
> +- Registering a Network Device to DIM
> +- Example
> +
> +Part 0: Assumptions
> +======================
> +
> +This document assumes the reader has basic knowledge in network drivers
> +and in general interrupt moderation.
> +
> +
> +Part I: Introduction
> +======================
> +
> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
> +interrupt moderation configuration of a channel in order to optimize packet
> +processing. The mechanism includes an algorithm which decides if and how to
> +change moderation parameters for a channel, usually by performing an analysis on
> +runtime data sampled from the system. Net DIM is such a mechanism. In each
> +iteration of the algorithm, it analyses a given sample of the data, compares it
> +to the previous sample and if required, it can decide to change some of the
> +interrupt moderation configuration fields. The data sample is composed of data
> +bandwidth, the number of packets and the number of events. The time between
> +samples is also measured. Net DIM compares the current and the previous data and
> +returns an adjusted interrupt moderation configuration object. In some cases,
> +the algorithm might decide not to change anything. The configuration fields are
> +the minimum duration (microseconds) allowed between events and the maximum
> +number of wanted packets per event. The Net DIM algorithm ascribes importance to
> +increase bandwidth over reducing interrupt rate.
> +
> +
> +Part II: The Net DIM Algorithm
> +===============================
> +
> +Each iteration of the Net DIM algorithm follows these steps:
> +1. Calculates new data sample.
> +2. Compares it to previous sample.
> +3. Makes a decision - suggests interrupt moderation configuration fields.
> +4. Applies a schedule work function, which applies suggested configuration.
> +
> +The first two steps are straightforward, both the new and the previous data are
> +supplied by the driver registered to Net DIM. The previous data is the new data
> +supplied to the previous iteration. The comparison step checks the difference
> +between the new and previous data and decides on the result of the last step.
> +A step would result as "better" if bandwidth increases and as "worse" if
> +bandwidth reduces. If there is no change in bandwidth, the packet rate is
> +compared in a similar fashion - increase == "better" and decrease == "worse".
> +In case there is no change in the packet rate as well, the interrupt rate is
> +compared. Here the algorithm tries to optimize for lower interrupt rate so an
> +increase in the interrupt rate is considered "worse" and a decrease is
> +considered "better". Step #2 has an optimization for avoiding false results: it
> +only considers a difference between samples as valid if it is greater than a
> +certain percentage. Also, since Net DIM does not measure anything by itself, it
> +assumes the data provided by the driver is valid.
> +
> +Step #3 decides on the suggested configuration based on the result from step #2
> +and the internal state of the algorithm. The states reflect the "direction" of
> +the algorithm: is it going left (reducing moderation), right (increasing
> +moderation) or standing still. Another optimization is that if a decision
> +to stay still is made multiple times, the interval between iterations of the
> +algorithm would increase in order to reduce calculation overhead. Also, after

I wonder if this increased interval can lead to packet drops due to
some impulse? Like, the card is receiving a low volume of packets and
suddenly a new flow starts at line rate, for example. If the max
interval is not too aggressive, this would't be a problem.

(sorry, I didn't read much of the implementation nor the drivers
already using it)

> +"parking" on one of the most left or most right decisions, the algorithm may
> +decide to verify this decision by taking a step in the other direction. This is
> +done in order to avoid getting stuck in a "deep sleep" scenario. Once a
> +decision is made, an interrupt moderation configuration is selected from
> +the predefined profiles.
> +
> +The last step is to notify the registered driver that it should apply the
> +suggested configuration. This is done by scheduling a work function, defined by
> +the Net DIM API and provided by the registered driver.
> +
> +As you can see, Net DIM itself does not actively interact with the system. It
> +would have trouble making the correct decisions if the wrong data is supplied to
> +it and it would be useless if the work function would not apply the suggested
> +configuration. This does, however, allow the registered driver some room for
> +manoeuvre as it may provide partial data or ignore the algorithm suggestion
> +under some conditions.
> +
> +
> +Part III: Registering a Network Device to DIM
> +==============================================
> +
> +Net DIM API exposes the main function net_dim(struct net_dim *dim,
> +struct net_dim_sample end_sample). This function is the entry point to the Net
> +DIM algorithm and has to be called every time the driver would like to check if
> +it should change interrupt moderation parameters. The driver should provide two
> +data structures: struct net_dim and struct net_dim_sample. Struct net_dim
> +describes the state of DIM for a specific object (RX queue, TX queue,
> +other queues, etc.). This includes the current selected profile, previous data
> +samples, the callback function provided by the driver and more.
> +Struct net_dim_sample describes a data sample, which will be compared to the
> +data sample stored in struct net_dim in order to decide on the algorithm's next
> +step. The sample should include bytes, packets and interrupts, measured by
> +the driver.
> +
> +In order to use Net DIM from a networking driver, the driver needs to call the
> +main net_dim() function. The recommended method is to call net_dim() on each
> +interrupt. Since Net DIM has a built-in moderation and it might decide to skip
> +iterations under certain conditions, there is no need to moderate the net_dim()
> +calls as well. As mentioned above, the driver needs to provide an object of type
> +struct net_dim to the net_dim() function call. It is advised for each entity
> +using Net DIM to hold a struct net_dim as part of its data structure and use it
> +as the main Net DIM API object. The struct net_dim_sample should hold the latest
> +bytes, packets and interrupts count. No need to perform any calculations, just
> +include the raw data.
> +
> +The net_dim() call itself does not return anything. Instead Net DIM relies on
> +the driver to provide a callback function, which is called when the algorithm
> +decides to make a change in the interrupt moderation parameters. This callback
> +will be scheduled and run in a separate thread in order not to add overhead to
> +the data flow. After the work is done, Net DIM algorithm needs to be set to
> +the proper state in order to move to the next iteration.
> +
> +
> +Part IV: Example
> +=================
> +
> +The following code demonstrates how to register a driver to Net DIM. The actual
> +usage is not complete but it should make the outline of the usage clear.
> +
> +my_driver.c:
> +
> +#include <linux/net_dim.h>
> +
> +/* Callback for net DIM to schedule on a decision to change moderation */
> +void my_driver_do_dim_work(struct work_struct *work)
> +{
> +	/* Get struct net_dim from struct work_struct */
> +	struct net_dim *dim = container_of(work, struct net_dim,
> +					   work);
> +	/* Do interrupt moderation related stuff */
> +	...
> +
> +	/* Signal net DIM work is done and it should move to next iteration */
> +	dim->state = NET_DIM_START_MEASURE;

Doesn't this need some sort of order guarantee? This assignment cannot
happen before the complete usage of 'dim' struct, otherwise the
interrupt handler may overwrite stuff that is still being read.

Or maybe this 2-steps to measure handles it? (START_MEASURE ->
MEASURE_IN_PROGRESS) As it will only overwrite something in the next
irq, it wouldn't be able to race the workqueue. As work-queues can be
preempted, this doesn't seem enough.

> +}
> +
> +/* My driver's interrupt handler */
> +int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
> +{
> +	...
> +	/* A struct to hold current measured data */
> +	struct net_dim_sample dim_sample;
> +	...
> +	/* Initiate data sample struct with current data */
> +	net_dim_sample(my_entity->events,
> +		       my_entity->packets,
> +		       my_entity->bytes,
> +		       &dim_sample);
> +	/* Call net DIM */
> +	net_dim(&my_entity->dim, dim_sample);
> +	...
> +}
> +
> +/* My entity's initialization function (my_entity was already allocated) */
> +int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
> +{
> +	...
> +	/* Initiate struct work_struct with my driver's callback function */
> +	INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work);
> +	...
> +}
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH net-next V2] Documentation/networking: Add net DIM documentation
From: Marcelo Ricardo Leitner @ 2018-03-21 20:30 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Randy Dunlap, Tal Gilboa, David S. Miller, netdev@vger.kernel.org,
	Tariq Toukan
In-Reply-To: <f2e4e187-0123-7840-902b-a0af045c0f4a@gmail.com>

On Wed, Mar 21, 2018 at 12:44:29PM -0700, Florian Fainelli wrote:
> On 03/21/2018 12:37 PM, Randy Dunlap wrote:
> > On 03/21/2018 11:33 AM, Tal Gilboa wrote:
> >> Net DIM is a generic algorithm, purposed for dynamically
> >> optimizing network devices interrupt moderation. This
> >> document describes how it works and how to use it.
> >>
> >> Signed-off-by: Tal Gilboa <talgi@mellanox.com>
> >> ---
> >>  Documentation/networking/net_dim.txt | 174 +++++++++++++++++++++++++++++++++++
> >>  1 file changed, 174 insertions(+)
> >>  create mode 100644 Documentation/networking/net_dim.txt
> >>
> >> diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.txt
> >> new file mode 100644
> >> index 0000000..9cb31c5
> >> --- /dev/null
> >> +++ b/Documentation/networking/net_dim.txt
> >> @@ -0,0 +1,174 @@
> >> +Net DIM - Generic Network Dynamic Interrupt Moderation
> >> +======================================================
> >> +
> >> +Author:
> >> +	Tal Gilboa <talgi@mellanox.com>
> >> +
> >> +
> >> +Contents
> >> +=========
> >> +
> >> +- Assumptions
> >> +- Introduction
> >> +- The Net DIM Algorithm
> >> +- Registering a Network Device to DIM
> >> +- Example
> >> +
> >> +Part 0: Assumptions
> >> +======================
> >> +
> >> +This document assumes the reader has basic knowledge in network drivers
> >> +and in general interrupt moderation.
> >> +
> >> +
> >> +Part I: Introduction
> >> +======================
> >> +
> >> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
> >> +interrupt moderation configuration of a channel in order to optimize packet
> >> +processing. The mechanism includes an algorithm which decides if and how to
> >> +change moderation parameters for a channel, usually by performing an analysis on
> >> +runtime data sampled from the system. Net DIM is such a mechanism. In each
> >> +iteration of the algorithm, it analyses a given sample of the data, compares it
> >> +to the previous sample and if required, it can decide to change some of the
> >> +interrupt moderation configuration fields. The data sample is composed of data
> >> +bandwidth, the number of packets and the number of events. The time between
> >> +samples is also measured. Net DIM compares the current and the previous data and
> >> +returns an adjusted interrupt moderation configuration object. In some cases,
> >> +the algorithm might decide not to change anything. The configuration fields are
> >> +the minimum duration (microseconds) allowed between events and the maximum
> >> +number of wanted packets per event. The Net DIM algorithm ascribes importance to
> >> +increase bandwidth over reducing interrupt rate.
> >> +
> >> +
> >> +Part II: The Net DIM Algorithm
> >> +===============================
> >> +
> >> +Each iteration of the Net DIM algorithm follows these steps:
> >> +1. Calculates new data sample.
> >> +2. Compares it to previous sample.
> >> +3. Makes a decision - suggests interrupt moderation configuration fields.
> >> +4. Applies a schedule work function, which applies suggested configuration.
> >> +
> >> +The first two steps are straightforward, both the new and the previous data are
> >> +supplied by the driver registered to Net DIM. The previous data is the new data
> >> +supplied to the previous iteration. The comparison step checks the difference
> >> +between the new and previous data and decides on the result of the last step.
> >> +A step would result as "better" if bandwidth increases and as "worse" if
> >> +bandwidth reduces. If there is no change in bandwidth, the packet rate is
> >> +compared in a similar fashion - increase == "better" and decrease == "worse".
> >> +In case there is no change in the packet rate as well, the interrupt rate is
> >> +compared. Here the algorithm tries to optimize for lower interrupt rate so an
> >> +increase in the interrupt rate is considered "worse" and a decrease is
> >> +considered "better". Step #2 has an optimization for avoiding false results: it
> >> +only considers a difference between samples as valid if it is greater than a
> >> +certain percentage. Also, since Net DIM does not measure anything by itself, it
> >> +assumes the data provided by the driver is valid.
> >> +
> >> +Step #3 decides on the suggested configuration based on the result from step #2
> >> +and the internal state of the algorithm. The states reflect the "direction" of
> >> +the algorithm: is it going left (reducing moderation), right (increasing
> >> +moderation) or standing still. Another optimization is that if a decision
> >> +to stay still is made multiple times, the interval between iterations of the
> >> +algorithm would increase in order to reduce calculation overhead. Also, after
> >> +"parking" on one of the most left or most right decisions, the algorithm may
> >> +decide to verify this decision by taking a step in the other direction. This is
> >> +done in order to avoid getting stuck in a "deep sleep" scenario. Once a
> >> +decision is made, an interrupt moderation configuration is selected from
> >> +the predefined profiles.
> > 
> > I think a short description of the predefined profiles could help.
> 
> Agreed it would help if the different modes
> (NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE,
> NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE) were expanded a bit further. The

Speaking of these, I just had to edit the email and adjust the
alignment to notice the single letter difference between both out of
36 chars.

NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE
NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE

Would be nice if the readability could be improved somehow.

> whole term QE sounds very much Ethernet converged adapter to me...
> 

^ permalink raw reply

* [PATCH net-next v5 1/2] net: permit skb_segment on head_frag frag_list skb
From: Yonghong Song @ 2018-03-21 20:36 UTC (permalink / raw)
  To: edumazet, ast, daniel, diptanu, netdev; +Cc: kernel-team
In-Reply-To: <20180321203650.1404106-1-yhs@fb.com>

One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.

3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473                             netdev_features_t features)
3474 {
3475         struct sk_buff *segs = NULL;
3476         struct sk_buff *tail = NULL;
...
3665                 while (pos < offset + len) {
3666                         if (i >= nfrags) {
3667                                 BUG_ON(skb_headlen(list_skb));
3668
3669                                 i = 0;
3670                                 nfrags = skb_shinfo(list_skb)->nr_frags;
3671                                 frag = skb_shinfo(list_skb)->frags;
3672                                 frag_skb = list_skb;
...

call stack:
...
 #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
 #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
 #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
 #4 [ffff883ffef03668] die at ffffffff8101deb2
 #5 [ffff883ffef03698] do_trap at ffffffff8101a700
 #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
 #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
 #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
    [exception RIP: skb_segment+3044]
    RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
    RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
    RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
    RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
    R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
    R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
--- <IRQ stack> ---
...

The triggering input skb has the following properties:
    list_skb = skb->frag_list;
    skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.

This patch addressed the issue by handling skb_headlen(list_skb) != 0
case properly if list_skb->head_frag is true, which is expected in
most cases. The head frag is processed before list_skb->frags
are processed.

Reported-by: Diptanu Gon Choudhury <diptanu@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 net/core/skbuff.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c134..23b317a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3460,6 +3460,19 @@ void *skb_pull_rcsum(struct sk_buff *skb, unsigned int len)
 }
 EXPORT_SYMBOL_GPL(skb_pull_rcsum);
 
+static inline skb_frag_t skb_head_frag_to_page_desc(struct sk_buff *frag_skb)
+{
+	skb_frag_t head_frag;
+	struct page *page;
+
+	page = virt_to_head_page(frag_skb->head);
+	head_frag.page.p = page;
+	head_frag.page_offset = frag_skb->data -
+		(unsigned char *)page_address(page);
+	head_frag.size = skb_headlen(frag_skb);
+	return head_frag;
+}
+
 /**
  *	skb_segment - Perform protocol segmentation on skb.
  *	@head_skb: buffer to segment
@@ -3664,15 +3677,16 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 		while (pos < offset + len) {
 			if (i >= nfrags) {
-				BUG_ON(skb_headlen(list_skb));
-
 				i = 0;
 				nfrags = skb_shinfo(list_skb)->nr_frags;
 				frag = skb_shinfo(list_skb)->frags;
-				frag_skb = list_skb;
-
-				BUG_ON(!nfrags);
+				if (skb_headlen(list_skb)) {
+					BUG_ON(!list_skb->head_frag);
 
+					/* to make room for head_frag. */
+					i--; frag--;
+				}
+				frag_skb = list_skb;
 				if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
 				    skb_zerocopy_clone(nskb, frag_skb,
 						       GFP_ATOMIC))
@@ -3689,7 +3703,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 				goto err;
 			}
 
-			*nskb_frag = *frag;
+			*nskb_frag = (i < 0) ? skb_head_frag_to_page_desc(frag_skb) : *frag;
 			__skb_frag_ref(nskb_frag);
 			size = skb_frag_size(nskb_frag);
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next v5 0/2] net: permit skb_segment on head_frag frag_list skb
From: Yonghong Song @ 2018-03-21 20:36 UTC (permalink / raw)
  To: edumazet, ast, daniel, diptanu, netdev; +Cc: kernel-team

One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.
 ...
    3665                 while (pos < offset + len) {
    3666                         if (i >= nfrags) {
    3667                                 BUG_ON(skb_headlen(list_skb));
 ...

The triggering input skb has the following properties:
    list_skb = skb->frag_list;
    skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.

Patch #1 provides a simple solution to avoid BUG_ON. If
list_skb->head_frag is true, its page-backed frag will
be processed before the list_skb->frags.
Patch #2 provides a test case in test_bpf module which
constructs a skb and calls skb_segment() directly. The test
case is able to trigger the BUG_ON without Patch #1.

The patch has been tested in the following setup:
  ipv6_host <-> nat_server <-> ipv4_host
where nat_server has a bpf program doing ipv4<->ipv6
translation and forwarding through clsact hook
bpf_skb_change_proto.

Changelog:
v4 -> v5:
  . Replace local variable head_frag with
    a static inline function skb_head_frag_to_page_desc
    which gets the head_frag on-demand. This makes
    code more readable and also does not increase
    the stack size, from Alexander.
  . Remove the "if(nfrags)" guard for skb_orphan_frags
    and skb_zerocopy_clone as I found that they can
    handle zero-frag skb (with non-zero skb_headlen(skb))
    properly.
  . Properly release segment list from skb_segment()
    in the test, from Eric.
v3 -> v4:
  . Remove dynamic memory allocation and use rewinding
    for both index and frag to remove one branch in fast path,
    from Alexander.
  . Fix a bunch of issues in test_bpf skb_segment() test,
    including proper way to allocate skb, proper function
    argument for skb_add_rx_frag and not freeint skb, etc.,
    from Eric.
v2 -> v3:
  . Use starting frag index -1 (instead of 0) to
    special process head_frag before other frags in the skb,
    from Alexander Duyck.
v1 -> v2:
  . Removed never-hit BUG_ON, spotted by Linyu Yuan.

Yonghong Song (2):
  net: permit skb_segment on head_frag frag_list skb
  net: bpf: add a test for skb_segment in test_bpf module

 lib/test_bpf.c    | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 net/core/skbuff.c | 26 ++++++++++++----
 2 files changed, 111 insertions(+), 8 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH net-next v5 2/2] net: bpf: add a test for skb_segment in test_bpf module
From: Yonghong Song @ 2018-03-21 20:36 UTC (permalink / raw)
  To: edumazet, ast, daniel, diptanu, netdev; +Cc: kernel-team
In-Reply-To: <20180321203650.1404106-1-yhs@fb.com>

Without the previous commit,
"modprobe test_bpf" will have the following errors:
...
[   98.149165] ------------[ cut here ]------------
[   98.159362] kernel BUG at net/core/skbuff.c:3667!
[   98.169756] invalid opcode: 0000 [#1] SMP PTI
[   98.179370] Modules linked in:
[   98.179371]  test_bpf(+)
...
which triggers the bug the previous commit intends to fix.

The skbs are constructed to mimic what mlx5 may generate.
The packet size/header may not mimic real cases in production. But
the processing flow is similar.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 lib/test_bpf.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 91 insertions(+), 2 deletions(-)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 2efb213..a468b5c 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6574,6 +6574,93 @@ static bool exclude_test(int test_id)
 	return test_id < test_range[0] || test_id > test_range[1];
 }
 
+static __init struct sk_buff *build_test_skb(void)
+{
+	u32 headroom = NET_SKB_PAD + NET_IP_ALIGN + ETH_HLEN;
+	struct sk_buff *skb[2];
+	struct page *page[2];
+	int i, data_size = 8;
+
+	for (i = 0; i < 2; i++) {
+		page[i] = alloc_page(GFP_KERNEL);
+		if (!page[i]) {
+			if (i == 0)
+				goto err_page0;
+			else
+				goto err_page1;
+		}
+
+		/* this will set skb[i]->head_frag */
+		skb[i] = dev_alloc_skb(headroom + data_size);
+		if (!skb[i]) {
+			if (i == 0)
+				goto err_skb0;
+			else
+				goto err_skb1;
+		}
+
+		skb_reserve(skb[i], headroom);
+		skb_put(skb[i], data_size);
+		skb[i]->protocol = htons(ETH_P_IP);
+		skb_reset_network_header(skb[i]);
+		skb_set_mac_header(skb[i], -ETH_HLEN);
+
+		skb_add_rx_frag(skb[i], 0, page[i], 0, 64, 64);
+		// skb_headlen(skb[i]): 8, skb[i]->head_frag = 1
+	}
+
+	/* setup shinfo */
+	skb_shinfo(skb[0])->gso_size = 1448;
+	skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV4;
+	skb_shinfo(skb[0])->gso_type |= SKB_GSO_DODGY;
+	skb_shinfo(skb[0])->gso_segs = 0;
+	skb_shinfo(skb[0])->frag_list = skb[1];
+
+	/* adjust skb[0]'s len */
+	skb[0]->len += skb[1]->len;
+	skb[0]->data_len += skb[1]->data_len;
+	skb[0]->truesize += skb[1]->truesize;
+
+	return skb[0];
+
+err_skb1:
+	__free_page(page[1]);
+err_page1:
+	kfree_skb(skb[0]);
+err_skb0:
+	__free_page(page[0]);
+err_page0:
+	return NULL;
+}
+
+static __init int test_skb_segment(void)
+{
+	netdev_features_t features;
+	struct sk_buff *skb, *segs;
+	int ret = -1;
+
+	features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
+		   NETIF_F_IPV6_CSUM;
+	features |= NETIF_F_RXCSUM;
+	skb = build_test_skb();
+	if (!skb) {
+		pr_info("%s: failed to build_test_skb", __func__);
+		goto done;
+	}
+
+	segs = skb_segment(skb, features);
+	if (segs) {
+		kfree_skb_list(segs);
+		ret = 0;
+		pr_info("%s: success in skb_segment!", __func__);
+	} else {
+		pr_info("%s: failed in skb_segment!", __func__);
+	}
+	kfree_skb(skb);
+done:
+	return ret;
+}
+
 static __init int test_bpf(void)
 {
 	int i, err_cnt = 0, pass_cnt = 0;
@@ -6632,9 +6719,11 @@ static int __init test_bpf_init(void)
 		return ret;
 
 	ret = test_bpf();
-
 	destroy_bpf_tests();
-	return ret;
+	if (ret)
+		return ret;
+
+	return test_skb_segment();
 }
 
 static void __exit test_bpf_exit(void)
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
From: Saeed Mahameed @ 2018-03-21 20:50 UTC (permalink / raw)
  To: ktkhai@virtuozzo.com, davem@davemloft.net, Boris Pismenny
  Cc: netdev@vger.kernel.org, davejwatson@fb.com, Ilya Lesokhin,
	Aviad Yehezkel
In-Reply-To: <372a4ca2-9ecf-95fe-7233-4937da82003a@virtuozzo.com>

On Wed, 2018-03-21 at 19:31 +0300, Kirill Tkhai wrote:
> On 21.03.2018 18:53, Boris Pismenny wrote:
> > ...
> > > 
> > > Other patches have two licenses in header. Can I distribute this
> > > file under GPL license terms?
> > > 
> > 
> > Sure, I'll update the license to match other files under net/tls.
> > 
> > > > +#include <linux/module.h>
> > > > +#include <net/tcp.h>
> > > > +#include <net/inet_common.h>
> > > > +#include <linux/highmem.h>
> > > > +#include <linux/netdevice.h>
> > > > +
> > > > +#include <net/tls.h>
> > > > +#include <crypto/aead.h>
> > > > +
> > > > +/* device_offload_lock is used to synchronize tls_dev_add
> > > > + * against NETDEV_DOWN notifications.
> > > > + */
> > > > +DEFINE_STATIC_PERCPU_RWSEM(device_offload_lock);
> > > > +
> > > > +static void tls_device_gc_task(struct work_struct *work);
> > > > +
> > > > +static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
> > > > +static LIST_HEAD(tls_device_gc_list);
> > > > +static LIST_HEAD(tls_device_list);
> > > > +static DEFINE_SPINLOCK(tls_device_lock);
> > > > +
> > > > +static void tls_device_free_ctx(struct tls_context *ctx)
> > > > +{
> > > > +    struct tls_offload_context *offlad_ctx =
> > > > tls_offload_ctx(ctx);
> > > > +
> > > > +    kfree(offlad_ctx);
> > > > +    kfree(ctx);
> > > > +}
> > > > +
> > > > +static void tls_device_gc_task(struct work_struct *work)
> > > > +{
> > > > +    struct tls_context *ctx, *tmp;
> > > > +    struct list_head gc_list;
> > > > +    unsigned long flags;
> > > > +
> > > > +    spin_lock_irqsave(&tls_device_lock, flags);
> > > > +    INIT_LIST_HEAD(&gc_list);
> > > 
> > > This is stack variable, and it should be initialized outside of
> > > global spinlock.
> > > There is LIST_HEAD() primitive for that in kernel.
> > > There is one more similar place below.
> > > 
> > 
> > Sure.
> > 
> > > > +    list_splice_init(&tls_device_gc_list, &gc_list);
> > > > +    spin_unlock_irqrestore(&tls_device_lock, flags);
> > > > +
> > > > +    list_for_each_entry_safe(ctx, tmp, &gc_list, list) {
> > > > +        struct net_device *netdev = ctx->netdev;
> > > > +
> > > > +        if (netdev) {
> > > > +            netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> > > > +                            TLS_OFFLOAD_CTX_DIR_TX);
> > > > +            dev_put(netdev);
> > > > +        }
> > > 
> > > How is possible the situation we meet NULL netdev here >
> > 
> > This can happen in tls_device_down. tls_deviec_down is called
> > whenever a netdev that is used for TLS inline crypto offload goes
> > down. It gets called via the NETDEV_DOWN event of the netdevice
> > notifier.
> > 
> > This flow is somewhat similar to the xfrm_device netdev notifier.
> > However, we do not destroy the socket (as in destroying the
> > xfrm_state in xfrm_device). Instead, we cleanup the netdev state
> > and allow software fallback to handle the rest of the traffic.
> > 
> > > > +
> > > > +        list_del(&ctx->list);
> > > > +        tls_device_free_ctx(ctx);
> > > > +    }
> > > > +}
> > > > +
> > > > +static void tls_device_queue_ctx_destruction(struct
> > > > tls_context *ctx)
> > > > +{
> > > > +    unsigned long flags;
> > > > +
> > > > +    spin_lock_irqsave(&tls_device_lock, flags);
> > > > +    list_move_tail(&ctx->list, &tls_device_gc_list);
> > > > +
> > > > +    /* schedule_work inside the spinlock
> > > > +     * to make sure tls_device_down waits for that work.
> > > > +     */
> > > > +    schedule_work(&tls_device_gc_work);
> > > > +
> > > > +    spin_unlock_irqrestore(&tls_device_lock, flags);
> > > > +}
> > > > +
> > > > +/* We assume that the socket is already connected */
> > > > +static struct net_device *get_netdev_for_sock(struct sock *sk)
> > > > +{
> > > > +    struct inet_sock *inet = inet_sk(sk);
> > > > +    struct net_device *netdev = NULL;
> > > > +
> > > > +    netdev = dev_get_by_index(sock_net(sk), inet-
> > > > >cork.fl.flowi_oif);
> > > > +
> > > > +    return netdev;
> > > > +}
> > > > +
> > > > +static int attach_sock_to_netdev(struct sock *sk, struct
> > > > net_device *netdev,
> > > > +                 struct tls_context *ctx)
> > > > +{
> > > > +    int rc;
> > > > +
> > > > +    rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk,
> > > > TLS_OFFLOAD_CTX_DIR_TX,
> > > > +                         &ctx->crypto_send,
> > > > +                         tcp_sk(sk)->write_seq);
> > > > +    if (rc) {
> > > > +        pr_err_ratelimited("The netdev has refused to offload
> > > > this socket\n");
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    rc = 0;
> > > > +out:
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +static void destroy_record(struct tls_record_info *record)
> > > > +{
> > > > +    skb_frag_t *frag;
> > > > +    int nr_frags = record->num_frags;
> > > > +
> > > > +    while (nr_frags > 0) {
> > > > +        frag = &record->frags[nr_frags - 1];
> > > > +        __skb_frag_unref(frag);
> > > > +        --nr_frags;
> > > > +    }
> > > > +    kfree(record);
> > > > +}
> > > > +
> > > > +static void delete_all_records(struct tls_offload_context
> > > > *offload_ctx)
> > > > +{
> > > > +    struct tls_record_info *info, *temp;
> > > > +
> > > > +    list_for_each_entry_safe(info, temp, &offload_ctx-
> > > > >records_list, list) {
> > > > +        list_del(&info->list);
> > > > +        destroy_record(info);
> > > > +    }
> > > > +
> > > > +    offload_ctx->retransmit_hint = NULL;
> > > > +}
> > > > +
> > > > +static void tls_icsk_clean_acked(struct sock *sk, u32
> > > > acked_seq)
> > > > +{
> > > > +    struct tls_context *tls_ctx = tls_get_ctx(sk);
> > > > +    struct tls_offload_context *ctx;
> > > > +    struct tls_record_info *info, *temp;
> > > > +    unsigned long flags;
> > > > +    u64 deleted_records = 0;
> > > > +
> > > > +    if (!tls_ctx)
> > > > +        return;
> > > > +
> > > > +    ctx = tls_offload_ctx(tls_ctx);
> > > > +
> > > > +    spin_lock_irqsave(&ctx->lock, flags);
> > > > +    info = ctx->retransmit_hint;
> > > > +    if (info && !before(acked_seq, info->end_seq)) {
> > > > +        ctx->retransmit_hint = NULL;
> > > > +        list_del(&info->list);
> > > > +        destroy_record(info);
> > > > +        deleted_records++;
> > > > +    }
> > > > +
> > > > +    list_for_each_entry_safe(info, temp, &ctx->records_list,
> > > > list) {
> > > > +        if (before(acked_seq, info->end_seq))
> > > > +            break;
> > > > +        list_del(&info->list);
> > > > +
> > > > +        destroy_record(info);
> > > > +        deleted_records++;
> > > > +    }
> > > > +
> > > > +    ctx->unacked_record_sn += deleted_records;
> > > > +    spin_unlock_irqrestore(&ctx->lock, flags);
> > > > +}
> > > > +
> > > > +/* At this point, there should be no references on this
> > > > + * socket and no in-flight SKBs associated with this
> > > > + * socket, so it is safe to free all the resources.
> > > > + */
> > > > +void tls_device_sk_destruct(struct sock *sk)
> > > > +{
> > > > +    struct tls_context *tls_ctx = tls_get_ctx(sk);
> > > > +    struct tls_offload_context *ctx =
> > > > tls_offload_ctx(tls_ctx);
> > > > +
> > > > +    if (ctx->open_record)
> > > > +        destroy_record(ctx->open_record);
> > > > +
> > > > +    delete_all_records(ctx);
> > > > +    crypto_free_aead(ctx->aead_send);
> > > > +    ctx->sk_destruct(sk);
> > > > +
> > > > +    if (refcount_dec_and_test(&tls_ctx->refcount))
> > > > +        tls_device_queue_ctx_destruction(tls_ctx);
> > > > +}
> > > > +EXPORT_SYMBOL(tls_device_sk_destruct);
> > > > +
> > > > +static inline void tls_append_frag(struct tls_record_info
> > > > *record,
> > > > +                   struct page_frag *pfrag,
> > > > +                   int size)
> > > > +{
> > > > +    skb_frag_t *frag;
> > > > +
> > > > +    frag = &record->frags[record->num_frags - 1];
> > > > +    if (frag->page.p == pfrag->page &&
> > > > +        frag->page_offset + frag->size == pfrag->offset) {
> > > > +        frag->size += size;
> > > > +    } else {
> > > > +        ++frag;
> > > > +        frag->page.p = pfrag->page;
> > > > +        frag->page_offset = pfrag->offset;
> > > > +        frag->size = size;
> > > > +        ++record->num_frags;
> > > > +        get_page(pfrag->page);
> > > > +    }
> > > > +
> > > > +    pfrag->offset += size;
> > > > +    record->len += size;
> > > > +}
> > > > +
> > > > +static inline int tls_push_record(struct sock *sk,
> > > > +                  struct tls_context *ctx,
> > > > +                  struct tls_offload_context *offload_ctx,
> > > > +                  struct tls_record_info *record,
> > > > +                  struct page_frag *pfrag,
> > > > +                  int flags,
> > > > +                  unsigned char record_type)
> > > > +{
> > > > +    skb_frag_t *frag;
> > > > +    struct tcp_sock *tp = tcp_sk(sk);
> > > > +    struct page_frag fallback_frag;
> > > > +    struct page_frag  *tag_pfrag = pfrag;
> > > > +    int i;
> > > > +
> > > > +    /* fill prepand */
> > > > +    frag = &record->frags[0];
> > > > +    tls_fill_prepend(ctx,
> > > > +             skb_frag_address(frag),
> > > > +             record->len - ctx->prepend_size,
> > > > +             record_type);
> > > > +
> > > > +    if (unlikely(!skb_page_frag_refill(ctx->tag_size, pfrag,
> > > > GFP_KERNEL))) {
> > > > +        /* HW doesn't care about the data in the tag
> > > > +         * so in case pfrag has no room
> > > > +         * for a tag and we can't allocate a new pfrag
> > > > +         * just use the page in the first frag
> > > > +         * rather then write a complicated fall back code.
> > > > +         */
> > > > +        tag_pfrag = &fallback_frag;
> > > > +        tag_pfrag->page = skb_frag_page(frag);
> > > > +        tag_pfrag->offset = 0;
> > > > +    }
> > > > +
> > > > +    tls_append_frag(record, tag_pfrag, ctx->tag_size);
> > > > +    record->end_seq = tp->write_seq + record->len;
> > > > +    spin_lock_irq(&offload_ctx->lock);
> > > > +    list_add_tail(&record->list, &offload_ctx->records_list);
> > > > +    spin_unlock_irq(&offload_ctx->lock);
> > > > +    offload_ctx->open_record = NULL;
> > > > +    set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
> > > > +    tls_advance_record_sn(sk, ctx);
> > > > +
> > > > +    for (i = 0; i < record->num_frags; i++) {
> > > > +        frag = &record->frags[i];
> > > > +        sg_unmark_end(&offload_ctx->sg_tx_data[i]);
> > > > +        sg_set_page(&offload_ctx->sg_tx_data[i],
> > > > skb_frag_page(frag),
> > > > +                frag->size, frag->page_offset);
> > > > +        sk_mem_charge(sk, frag->size);
> > > > +        get_page(skb_frag_page(frag));
> > > > +    }
> > > > +    sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags -
> > > > 1]);
> > > > +
> > > > +    /* all ready, send */
> > > > +    return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0,
> > > > flags);
> > > > +}
> > > > +
> > > > +static inline int tls_create_new_record(struct
> > > > tls_offload_context *offload_ctx,
> > > > +                    struct page_frag *pfrag,
> > > > +                    size_t prepend_size)
> > > > +{
> > > > +    skb_frag_t *frag;
> > > > +    struct tls_record_info *record;
> > > > +
> > > > +    record = kmalloc(sizeof(*record), GFP_KERNEL);
> > > > +    if (!record)
> > > > +        return -ENOMEM;
> > > > +
> > > > +    frag = &record->frags[0];
> > > > +    __skb_frag_set_page(frag, pfrag->page);
> > > > +    frag->page_offset = pfrag->offset;
> > > > +    skb_frag_size_set(frag, prepend_size);
> > > > +
> > > > +    get_page(pfrag->page);
> > > > +    pfrag->offset += prepend_size;
> > > > +
> > > > +    record->num_frags = 1;
> > > > +    record->len = prepend_size;
> > > > +    offload_ctx->open_record = record;
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +static inline int tls_do_allocation(struct sock *sk,
> > > > +                    struct tls_offload_context *offload_ctx,
> > > > +                    struct page_frag *pfrag,
> > > > +                    size_t prepend_size)
> > > > +{
> > > > +    int ret;
> > > > +
> > > > +    if (!offload_ctx->open_record) {
> > > > +        if (unlikely(!skb_page_frag_refill(prepend_size,
> > > > pfrag,
> > > > +                           sk->sk_allocation))) {
> > > > +            sk->sk_prot->enter_memory_pressure(sk);
> > > > +            sk_stream_moderate_sndbuf(sk);
> > > > +            return -ENOMEM;
> > > > +        }
> > > > +
> > > > +        ret = tls_create_new_record(offload_ctx, pfrag,
> > > > prepend_size);
> > > > +        if (ret)
> > > > +            return ret;
> > > > +
> > > > +        if (pfrag->size > pfrag->offset)
> > > > +            return 0;
> > > > +    }
> > > > +
> > > > +    if (!sk_page_frag_refill(sk, pfrag))
> > > > +        return -ENOMEM;
> > > > +
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +static int tls_push_data(struct sock *sk,
> > > > +             struct iov_iter *msg_iter,
> > > > +             size_t size, int flags,
> > > > +             unsigned char record_type)
> > > > +{
> > > > +    struct tls_context *tls_ctx = tls_get_ctx(sk);
> > > > +    struct tls_offload_context *ctx =
> > > > tls_offload_ctx(tls_ctx);
> > > > +    struct tls_record_info *record = ctx->open_record;
> > > > +    struct page_frag *pfrag;
> > > > +    int copy, rc = 0;
> > > > +    size_t orig_size = size;
> > > > +    u32 max_open_record_len;
> > > > +    long timeo;
> > > > +    int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
> > > > +    int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
> > > > +    bool done = false;
> > > > +
> > > > +    if (flags &
> > > > +        ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
> > > > MSG_SENDPAGE_NOTLAST))
> > > > +        return -ENOTSUPP;
> > > > +
> > > > +    if (sk->sk_err)
> > > > +        return -sk->sk_err;
> > > > +
> > > > +    timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
> > > > +    rc = tls_complete_pending_work(sk, tls_ctx, flags,
> > > > &timeo);
> > > > +    if (rc < 0)
> > > > +        return rc;
> > > > +
> > > > +    pfrag = sk_page_frag(sk);
> > > > +
> > > > +    /* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS
> > > > record, and
> > > > +     * we need to leave room for an authentication tag.
> > > > +     */
> > > > +    max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
> > > > +                  tls_ctx->prepend_size;
> > > > +    do {
> > > > +        if (tls_do_allocation(sk, ctx, pfrag,
> > > > +                      tls_ctx->prepend_size)) {
> > > > +            rc = sk_stream_wait_memory(sk, &timeo);
> > > > +            if (!rc)
> > > > +                continue;
> > > > +
> > > > +            record = ctx->open_record;
> > > > +            if (!record)
> > > > +                break;
> > > > +handle_error:
> > > > +            if (record_type != TLS_RECORD_TYPE_DATA) {
> > > > +                /* avoid sending partial
> > > > +                 * record with type !=
> > > > +                 * application_data
> > > > +                 */
> > > > +                size = orig_size;
> > > > +                destroy_record(record);
> > > > +                ctx->open_record = NULL;
> > > > +            } else if (record->len > tls_ctx->prepend_size) {
> > > > +                goto last_record;
> > > > +            }
> > > > +
> > > > +            break;
> > > > +        }
> > > > +
> > > > +        record = ctx->open_record;
> > > > +        copy = min_t(size_t, size, (pfrag->size - pfrag-
> > > > >offset));
> > > > +        copy = min_t(size_t, copy, (max_open_record_len -
> > > > record->len));
> > > > +
> > > > +        if (copy_from_iter_nocache(page_address(pfrag->page) +
> > > > +                           pfrag->offset,
> > > > +                       copy, msg_iter) != copy) {
> > > > +            rc = -EFAULT;
> > > > +            goto handle_error;
> > > > +        }
> > > > +        tls_append_frag(record, pfrag, copy);
> > > > +
> > > > +        size -= copy;
> > > > +        if (!size) {
> > > > +last_record:
> > > > +            tls_push_record_flags = flags;
> > > > +            if (more) {
> > > > +                tls_ctx->pending_open_record_frags =
> > > > +                        record->num_frags;
> > > > +                break;
> > > > +            }
> > > > +
> > > > +            done = true;
> > > > +        }
> > > > +
> > > > +        if ((done) || record->len >= max_open_record_len ||
> > > > +            (record->num_frags >= MAX_SKB_FRAGS - 1)) {
> > > > +            rc = tls_push_record(sk,
> > > > +                         tls_ctx,
> > > > +                         ctx,
> > > > +                         record,
> > > > +                         pfrag,
> > > > +                         tls_push_record_flags,
> > > > +                         record_type);
> > > > +            if (rc < 0)
> > > > +                break;
> > > > +        }
> > > > +    } while (!done);
> > > > +
> > > > +    if (orig_size - size > 0)
> > > > +        rc = orig_size - size;
> > > > +
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg,
> > > > size_t size)
> > > > +{
> > > > +    unsigned char record_type = TLS_RECORD_TYPE_DATA;
> > > > +    int rc = 0;
> > > > +
> > > > +    lock_sock(sk);
> > > > +
> > > > +    if (unlikely(msg->msg_controllen)) {
> > > > +        rc = tls_proccess_cmsg(sk, msg, &record_type);
> > > > +        if (rc)
> > > > +            goto out;
> > > > +    }
> > > > +
> > > > +    rc = tls_push_data(sk, &msg->msg_iter, size,
> > > > +               msg->msg_flags, record_type);
> > > > +
> > > > +out:
> > > > +    release_sock(sk);
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +int tls_device_sendpage(struct sock *sk, struct page *page,
> > > > +            int offset, size_t size, int flags)
> > > > +{
> > > > +    struct iov_iter    msg_iter;
> > > > +    struct kvec iov;
> > > > +    char *kaddr = kmap(page);
> > > > +    int rc = 0;
> > > > +
> > > > +    if (flags & MSG_SENDPAGE_NOTLAST)
> > > > +        flags |= MSG_MORE;
> > > > +
> > > > +    lock_sock(sk);
> > > > +
> > > > +    if (flags & MSG_OOB) {
> > > > +        rc = -ENOTSUPP;
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    iov.iov_base = kaddr + offset;
> > > > +    iov.iov_len = size;
> > > > +    iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1,
> > > > size);
> > > > +    rc = tls_push_data(sk, &msg_iter, size,
> > > > +               flags, TLS_RECORD_TYPE_DATA);
> > > > +    kunmap(page);
> > > > +
> > > > +out:
> > > > +    release_sock(sk);
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +struct tls_record_info *tls_get_record(struct
> > > > tls_offload_context *context,
> > > > +                       u32 seq, u64 *p_record_sn)
> > > > +{
> > > > +    struct tls_record_info *info;
> > > > +    u64 record_sn = context->hint_record_sn;
> > > > +
> > > > +    info = context->retransmit_hint;
> > > > +    if (!info ||
> > > > +        before(seq, info->end_seq - info->len)) {
> > > > +        /* if retransmit_hint is irrelevant start
> > > > +         * from the begging of the list
> > > > +         */
> > > > +        info = list_first_entry(&context->records_list,
> > > > +                    struct tls_record_info, list);
> > > > +        record_sn = context->unacked_record_sn;
> > > > +    }
> > > > +
> > > > +    list_for_each_entry_from(info, &context->records_list,
> > > > list) {
> > > > +        if (before(seq, info->end_seq)) {
> > > > +            if (!context->retransmit_hint ||
> > > > +                after(info->end_seq,
> > > > +                  context->retransmit_hint->end_seq)) {
> > > > +                context->hint_record_sn = record_sn;
> > > > +                context->retransmit_hint = info;
> > > > +            }
> > > > +            *p_record_sn = record_sn;
> > > > +            return info;
> > > > +        }
> > > > +        record_sn++;
> > > > +    }
> > > > +
> > > > +    return NULL;
> > > > +}
> > > > +EXPORT_SYMBOL(tls_get_record);
> > > > +
> > > > +static int tls_device_push_pending_record(struct sock *sk, int
> > > > flags)
> > > > +{
> > > > +    struct iov_iter    msg_iter;
> > > > +
> > > > +    iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0);
> > > > +    return tls_push_data(sk, &msg_iter, 0, flags,
> > > > TLS_RECORD_TYPE_DATA);
> > > > +}
> > > > +
> > > > +int tls_set_device_offload(struct sock *sk, struct tls_context
> > > > *ctx)
> > > > +{
> > > > +    u16 nonece_size, tag_size, iv_size, rec_seq_size;
> > > > +    struct tls_record_info *start_marker_record;
> > > > +    struct tls_offload_context *offload_ctx;
> > > > +    struct tls_crypto_info *crypto_info;
> > > > +    struct net_device *netdev;
> > > > +    char *iv, *rec_seq;
> > > > +    struct sk_buff *skb;
> > > > +    int rc = -EINVAL;
> > > > +    __be64 rcd_sn;
> > > > +
> > > > +    if (!ctx)
> > > > +        goto out;
> > > > +
> > > > +    if (ctx->priv_ctx) {
> > > > +        rc = -EEXIST;
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    /* We support starting offload on multiple sockets
> > > > +     * concurrently, So we only need a read lock here.
> > > > +     */
> > > > +    percpu_down_read(&device_offload_lock);
> > > > +    netdev = get_netdev_for_sock(sk);
> > > > +    if (!netdev) {
> > > > +        pr_err_ratelimited("%s: netdev not found\n",
> > > > __func__);
> > > > +        rc = -EINVAL;
> > > > +        goto release_lock;
> > > > +    }
> > > > +
> > > > +    if (!(netdev->features & NETIF_F_HW_TLS_TX)) {
> > > > +        rc = -ENOTSUPP;
> > > > +        goto release_netdev;
> > > > +    }
> > > > +
> > > > +    /* Avoid offloading if the device is down
> > > > +     * We don't want to offload new flows after
> > > > +     * the NETDEV_DOWN event
> > > > +     */
> > > > +    if (!(netdev->flags & IFF_UP)) {
> > > > +        rc = -EINVAL;
> > > > +        goto release_lock;
> > > > +    }
> > > > +
> > > > +    crypto_info = &ctx->crypto_send;
> > > > +    switch (crypto_info->cipher_type) {
> > > > +    case TLS_CIPHER_AES_GCM_128: {
> > > > +        nonece_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
> > > > +        tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> > > > +        iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
> > > > +        iv = ((struct tls12_crypto_info_aes_gcm_128
> > > > *)crypto_info)->iv;
> > > > +        rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE;
> > > > +        rec_seq =
> > > > +         ((struct tls12_crypto_info_aes_gcm_128
> > > > *)crypto_info)->rec_seq;
> > > > +        break;
> > > > +    }
> > > > +    default:
> > > > +        rc = -EINVAL;
> > > > +        goto release_netdev;
> > > > +    }
> > > > +
> > > > +    start_marker_record =
> > > > kmalloc(sizeof(*start_marker_record), GFP_KERNEL);
> > > 
> > > Can we memory allocations and simple memory initializations
> > > ouside the global rwsem?
> > > 
> > 
> > Sure, we can move all memory allocations outside the lock.
> > 
> > > > +    if (!start_marker_record) {
> > > > +        rc = -ENOMEM;
> > > > +        goto release_netdev;
> > > > +    }
> > > > +
> > > > +    offload_ctx = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE,
> > > > GFP_KERNEL);
> > > > +    if (!offload_ctx)
> > > > +        goto free_marker_record;
> > > > +
> > > > +    ctx->priv_ctx = offload_ctx;
> > > > +    rc = attach_sock_to_netdev(sk, netdev, ctx);
> > > > +    if (rc)
> > > > +        goto free_offload_context;
> > > > +
> > > > +    ctx->netdev = netdev;
> > > > +    ctx->prepend_size = TLS_HEADER_SIZE + nonece_size;
> > > > +    ctx->tag_size = tag_size;
> > > > +    ctx->iv_size = iv_size;
> > > > +    ctx->iv = kmalloc(iv_size +
> > > > TLS_CIPHER_AES_GCM_128_SALT_SIZE,
> > > > +              GFP_KERNEL);
> > > > +    if (!ctx->iv) {
> > > > +        rc = -ENOMEM;
> > > > +        goto detach_sock;
> > > > +    }
> > > > +
> > > > +    memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv,
> > > > iv_size);
> > > > +
> > > > +    ctx->rec_seq_size = rec_seq_size;
> > > > +    ctx->rec_seq = kmalloc(rec_seq_size, GFP_KERNEL);
> > > > +    if (!ctx->rec_seq) {
> > > > +        rc = -ENOMEM;
> > > > +        goto free_iv;
> > > > +    }
> > > > +    memcpy(ctx->rec_seq, rec_seq, rec_seq_size);
> > > > +
> > > > +    /* start at rec_seq - 1 to account for the start marker
> > > > record */
> > > > +    memcpy(&rcd_sn, ctx->rec_seq, sizeof(rcd_sn));
> > > > +    offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
> > > > +
> > > > +    rc = tls_sw_fallback_init(sk, offload_ctx, crypto_info);
> > > > +    if (rc)
> > > > +        goto free_rec_seq;
> > > > +
> > > > +    start_marker_record->end_seq = tcp_sk(sk)->write_seq;
> > > > +    start_marker_record->len = 0;
> > > > +    start_marker_record->num_frags = 0;
> > > > +
> > > > +    INIT_LIST_HEAD(&offload_ctx->records_list);
> > > > +    list_add_tail(&start_marker_record->list, &offload_ctx-
> > > > >records_list);
> > > > +    spin_lock_init(&offload_ctx->lock);
> > > > +
> > > > +    inet_csk(sk)->icsk_clean_acked = &tls_icsk_clean_acked;
> > > > +    ctx->push_pending_record = tls_device_push_pending_record;
> > > > +    offload_ctx->sk_destruct = sk->sk_destruct;
> > > > +
> > > > +    /* TLS offload is greatly simplified if we don't send
> > > > +     * SKBs where only part of the payload needs to be
> > > > encrypted.
> > > > +     * So mark the last skb in the write queue as end of
> > > > record.
> > > > +     */
> > > > +    skb = tcp_write_queue_tail(sk);
> > > > +    if (skb)
> > > > +        TCP_SKB_CB(skb)->eor = 1;
> > > > +
> > > > +    refcount_set(&ctx->refcount, 1);
> > > > +    spin_lock_irq(&tls_device_lock);
> > > > +    list_add_tail(&ctx->list, &tls_device_list);
> > > > +    spin_unlock_irq(&tls_device_lock);
> > > > +
> > > > +    /* following this assignment tls_is_sk_tx_device_offloaded
> > > > +     * will return true and the context might be accessed
> > > > +     * by the netdev's xmit function.
> > > > +     */
> > > > +    smp_store_release(&sk->sk_destruct,
> > > > +              &tls_device_sk_destruct);
> > > > +    goto release_lock;
> > > > +
> > > > +free_rec_seq:
> > > > +    kfree(ctx->rec_seq);
> > > > +free_iv:
> > > > +    kfree(ctx->iv);
> > > > +detach_sock:
> > > > +    netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> > > > TLS_OFFLOAD_CTX_DIR_TX);
> > > > +free_offload_context:
> > > > +    kfree(offload_ctx);
> > > > +    ctx->priv_ctx = NULL;
> > > > +free_marker_record:
> > > > +    kfree(start_marker_record);
> > > > +release_netdev:
> > > > +    dev_put(netdev);
> > > > +release_lock:
> > > > +    percpu_up_read(&device_offload_lock);
> > > > +out:
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +static int tls_device_register(struct net_device *dev)
> > > > +{
> > > > +    if ((dev->features & NETIF_F_HW_TLS_TX) && !dev-
> > > > >tlsdev_ops)
> > > > +        return NOTIFY_BAD;
> > > > +
> > > > +    return NOTIFY_DONE;
> > > > +}
> > > 
> > > This function is the same as tls_device_feat_change(). Can't we
> > > merge
> > > them together and avoid duplicating of code?
> > > 
> > 
> > Sure.
> > 
> > > > +static int tls_device_unregister(struct net_device *dev)
> > > > +{
> > > > +    return NOTIFY_DONE;
> > > > +}
> > > 
> > > This function does nothing, and next patches do not change it.
> > > Can't we remove it since so?
> > > 
> > 
> > Sure.
> > 
> > > > +static int tls_device_feat_change(struct net_device *dev)
> > > > +{
> > > > +    if ((dev->features & NETIF_F_HW_TLS_TX) && !dev-
> > > > >tlsdev_ops)
> > > > +        return NOTIFY_BAD;
> > > > +
> > > > +    return NOTIFY_DONE;
> > > > +}
> > > > +
> > > > +static int tls_device_down(struct net_device *netdev)
> > > > +{
> > > > +    struct tls_context *ctx, *tmp;
> > > > +    struct list_head list;
> > > > +    unsigned long flags;
> > > > +
> > > > +    if (!(netdev->features & NETIF_F_HW_TLS_TX))
> > > > +        return NOTIFY_DONE;
> > > 
> > > Can't we move this check in tls_dev_event() and use it for all
> > > types of events?
> > > Then we avoid duplicate code.
> > > 
> > 
> > No. Not all events require this check. Also, the result is
> > different for different events.
> 
> No. You always return NOTIFY_DONE, in case of !(netdev->features &
> NETIF_F_HW_TLS_TX).
> See below:
> 
> static int tls_check_dev_ops(struct net_device *dev) 
> {
> 	if (!dev->tlsdev_ops)
> 		return NOTIFY_BAD; 
> 
> 	return NOTIFY_DONE; 
> }
> 
> static int tls_device_down(struct net_device *netdev) 
> {
> 	struct tls_context *ctx, *tmp; 
> 	struct list_head list; 
> 	unsigned long flags; 
> 
> 	...
> 	return NOTIFY_DONE;
> }
> 
> static int tls_dev_event(struct notifier_block *this, unsigned long
> event, 
>         		 void *ptr) 
> { 
>         struct net_device *dev = netdev_notifier_info_to_dev(ptr); 
> 
> 	if (!(netdev->features & NETIF_F_HW_TLS_TX)) 
> 		return NOTIFY_DONE; 
>  
>         switch (event) { 
>         case NETDEV_REGISTER:
>         case NETDEV_FEAT_CHANGE: 
>         	return tls_check_dev_ops(dev); 
>  
>         case NETDEV_DOWN: 
>         	return tls_device_down(dev); 
>         } 
>         return NOTIFY_DONE; 
> } 
>  

Will fix in V2.

> > > > +
> > > > +    /* Request a write lock to block new offload attempts
> > > > +     */
> > > > +    percpu_down_write(&device_offload_lock);
> > > 
> > > What is the reason percpu_rwsem is chosen here? It looks like
> > > this primitive
> > > gives more advantages readers, then plain rwsem does. But it also
> > > gives
> > > disadvantages to writers. It would be good, unless
> > > tls_device_down() is called
> > > with rtnl_lock() held from netdevice notifier. But since
> > > netdevice notifier
> > > are called with rtnl_lock() held, percpu_rwsem will increase the
> > > time rtnl_lock()
> > > is locked.
> > 
> > We use the a rwsem to allow multiple (readers) invocations of
> > tls_set_device_offload, which is triggered by the user (persumably)
> > during the TLS handshake. This might be considered a fast-path.
> > 
> > However, we must block all calls to tls_set_device_offload while we
> > are processing NETDEV_DOWN events (writer).
> > 
> > As you've mentioned, the percpu rwsem is more efficient for
> > readers, especially on NUMA systems, where cache-line bouncing
> > occurs during reader acquire and reduces performance.
> 
> Hm, and who are the readers? It's used from do_tls_setsockopt_tx(),
> but it doesn't
> seem to be performance critical. Who else?
> 

it is performance critical since it is done in the socket handshake
phase, anyway I tend to agree with you that per cpu rwsem is an
overkill, will change it to regular rwsem in V2.

> > > 
> > > Can't we use plain rwsem here instead?
> > > 
> > 
> > Its a performance tradeoff. I'm not certain that the percpu rwsem
> > write side acquire is significantly worse than using the global
> > rwsem.
> > 
> > For now, while all of this is experimental, can we agree to focus
> > on the performance of readers? We can change it later if it becomes
> > a problem.
> 
> Same as above.
>  
> > > > +
> > > > +    spin_lock_irqsave(&tls_device_lock, flags);
> > > > +    INIT_LIST_HEAD(&list);
> > > 
> > > This may go outside the global spinlock.
> > > 
> > 
> > Sure.
> > 
> > > > +    list_for_each_entry_safe(ctx, tmp, &tls_device_list, list)
> > > > {
> > > > +        if (ctx->netdev != netdev ||
> > > > +            !refcount_inc_not_zero(&ctx->refcount))
> > > > +            continue;
> > > > +
> > > > +        list_move(&ctx->list, &list);
> > > > +    }
> > > > +    spin_unlock_irqrestore(&tls_device_lock, flags);
> > > > +
> > > > +    list_for_each_entry_safe(ctx, tmp, &list, list)    {
> > > > +        netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> > > > +                        TLS_OFFLOAD_CTX_DIR_TX);
> > > > +        ctx->netdev = NULL;
> > > > +        dev_put(netdev);
> > > > +        list_del_init(&ctx->list);
> > > > +
> > > > +        if (refcount_dec_and_test(&ctx->refcount))
> > > > +            tls_device_free_ctx(ctx);
> > > > +    }
> > > > +
> > > > +    percpu_up_write(&device_offload_lock);
> > > > +
> > > > +    flush_work(&tls_device_gc_work);
> > > > +
> > > > +    return NOTIFY_DONE;
> > > > +}
> > > > +
> > > > +static int tls_dev_event(struct notifier_block *this, unsigned
> > > > long event,
> > > > +             void *ptr)
> > > > +{
> > > > +    struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> > > > +
> > > > +    switch (event) {
> > > > +    case NETDEV_REGISTER:
> > > > +        return tls_device_register(dev);
> > > > +
> > > > +    case NETDEV_UNREGISTER:
> > > > +        return tls_device_unregister(dev);
> > > > +
> > > > +    case NETDEV_FEAT_CHANGE:
> > > > +        return tls_device_feat_change(dev);
> > > > +
> > > > +    case NETDEV_DOWN:
> > > > +        return tls_device_down(dev);
> > > > +    }
> > > > +    return NOTIFY_DONE;
> > > > +}
> > > > +
> > > > +static struct notifier_block tls_dev_notifier = {
> > > > +    .notifier_call    = tls_dev_event,
> > > > +};
> > > > +
> > > > +void __init tls_device_init(void)
> > > > +{
> > > > +    register_netdevice_notifier(&tls_dev_notifier);
> > > > +}
> > > > +
> > > > +void __exit tls_device_cleanup(void)
> > > > +{
> > > > +    unregister_netdevice_notifier(&tls_dev_notifier);
> > > > +    flush_work(&tls_device_gc_work);
> > > > +}
> > > > diff --git a/net/tls/tls_device_fallback.c
> > > > b/net/tls/tls_device_fallback.c
> > > > new file mode 100644
> > > > index 000000000000..14d31a36885c
> > > > --- /dev/null
> > > > +++ b/net/tls/tls_device_fallback.c
> > > > @@ -0,0 +1,419 @@
> > > > +/* Copyright (c) 2018, Mellanox Technologies All rights
> > > > reserved.
> > > > + *
> > > > + *     Redistribution and use in source and binary forms, with
> > > > or
> > > > + *     without modification, are permitted provided that the
> > > > following
> > > > + *     conditions are met:
> > > > + *
> > > > + *      - Redistributions of source code must retain the above
> > > > + *        copyright notice, this list of conditions and the
> > > > following
> > > > + *        disclaimer.
> > > > + *
> > > > + *      - Redistributions in binary form must reproduce the
> > > > above
> > > > + *        copyright notice, this list of conditions and the
> > > > following
> > > > + *        disclaimer in the documentation and/or other
> > > > materials
> > > > + *        provided with the distribution.
> > > > + *
> > > > + *      - Neither the name of the Mellanox Technologies nor
> > > > the
> > > > + *        names of its contributors may be used to endorse or
> > > > promote
> > > > + *        products derived from this software without specific
> > > > prior written
> > > > + *        permission.
> > > > + *
> > > > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > > > CONTRIBUTORS "AS IS"
> > > > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > LIMITED TO,
> > > > + * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > + * A PARTICULAR PURPOSE ARE DISCLAIMED.
> > > > + * IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
> > > > LIABLE FOR
> > > > + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> > > > CONSEQUENTIAL
> > > > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> > > > SUBSTITUTE GOODS OR
> > > > + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
> > > > INTERRUPTION)
> > > > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
> > > > CONTRACT,
> > > > + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
> > > > OTHERWISE) ARISING
> > > > + * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
> > > > OF THE
> > > > + * POSSIBILITY OF SUCH DAMAGE
> > > > + */
> > > > +
> > > > +#include <net/tls.h>
> > > > +#include <crypto/aead.h>
> > > > +#include <crypto/scatterwalk.h>
> > > > +#include <net/ip6_checksum.h>
> > > > +
> > > > +static void chain_to_walk(struct scatterlist *sg, struct
> > > > scatter_walk *walk)
> > > > +{
> > > > +    struct scatterlist *src = walk->sg;
> > > > +    int diff = walk->offset - src->offset;
> > > > +
> > > > +    sg_set_page(sg, sg_page(src),
> > > > +            src->length - diff, walk->offset);
> > > > +
> > > > +    scatterwalk_crypto_chain(sg, sg_next(src), 0, 2);
> > > > +}
> > > > +
> > > > +static int tls_enc_record(struct aead_request *aead_req,
> > > > +              struct crypto_aead *aead, char *aad, char *iv,
> > > > +              __be64 rcd_sn, struct scatter_walk *in,
> > > > +              struct scatter_walk *out, int *in_len)
> > > > +{
> > > > +    struct scatterlist sg_in[3];
> > > > +    struct scatterlist sg_out[3];
> > > > +    unsigned char buf[TLS_HEADER_SIZE +
> > > > TLS_CIPHER_AES_GCM_128_IV_SIZE];
> > > > +    u16 len;
> > > > +    int rc;
> > > > +
> > > > +    len = min_t(int, *in_len, ARRAY_SIZE(buf));
> > > > +
> > > > +    scatterwalk_copychunks(buf, in, len, 0);
> > > > +    scatterwalk_copychunks(buf, out, len, 1);
> > > > +
> > > > +    *in_len -= len;
> > > > +    if (!*in_len)
> > > > +        return 0;
> > > > +
> > > > +    scatterwalk_pagedone(in, 0, 1);
> > > > +    scatterwalk_pagedone(out, 1, 1);
> > > > +
> > > > +    len = buf[4] | (buf[3] << 8);
> > > > +    len -= TLS_CIPHER_AES_GCM_128_IV_SIZE;
> > > > +
> > > > +    tls_make_aad(aad, len - TLS_CIPHER_AES_GCM_128_TAG_SIZE,
> > > > +             (char *)&rcd_sn, sizeof(rcd_sn), buf[0]);
> > > > +
> > > > +    memcpy(iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, buf +
> > > > TLS_HEADER_SIZE,
> > > > +           TLS_CIPHER_AES_GCM_128_IV_SIZE);
> > > > +
> > > > +    sg_init_table(sg_in, ARRAY_SIZE(sg_in));
> > > > +    sg_init_table(sg_out, ARRAY_SIZE(sg_out));
> > > > +    sg_set_buf(sg_in, aad, TLS_AAD_SPACE_SIZE);
> > > > +    sg_set_buf(sg_out, aad, TLS_AAD_SPACE_SIZE);
> > > > +    chain_to_walk(sg_in + 1, in);
> > > > +    chain_to_walk(sg_out + 1, out);
> > > > +
> > > > +    *in_len -= len;
> > > > +    if (*in_len < 0) {
> > > > +        *in_len += TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> > > > +        if (*in_len < 0)
> > > > +        /* the input buffer doesn't contain the entire record.
> > > > +         * trim len accordingly. The resulting authentication
> > > > tag
> > > > +         * will contain garbage. but we don't care as we won't
> > > > +         * include any of it in the output skb
> > > > +         * Note that we assume the output buffer length
> > > > +         * is larger then input buffer length + tag size
> > > > +         */
> > > > +            len += *in_len;
> > > > +
> > > > +        *in_len = 0;
> > > > +    }
> > > > +
> > > > +    if (*in_len) {
> > > > +        scatterwalk_copychunks(NULL, in, len, 2);
> > > > +        scatterwalk_pagedone(in, 0, 1);
> > > > +        scatterwalk_copychunks(NULL, out, len, 2);
> > > > +        scatterwalk_pagedone(out, 1, 1);
> > > > +    }
> > > > +
> > > > +    len -= TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> > > > +    aead_request_set_crypt(aead_req, sg_in, sg_out, len, iv);
> > > > +
> > > > +    rc = crypto_aead_encrypt(aead_req);
> > > > +
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +static void tls_init_aead_request(struct aead_request
> > > > *aead_req,
> > > > +                  struct crypto_aead *aead)
> > > > +{
> > > > +    aead_request_set_tfm(aead_req, aead);
> > > > +    aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE);
> > > > +}
> > > > +
> > > > +static struct aead_request *tls_alloc_aead_request(struct
> > > > crypto_aead *aead,
> > > > +                           gfp_t flags)
> > > > +{
> > > > +    unsigned int req_size = sizeof(struct aead_request) +
> > > > +        crypto_aead_reqsize(aead);
> > > > +    struct aead_request *aead_req;
> > > > +
> > > > +    aead_req = kzalloc(req_size, flags);
> > > > +    if (!aead_req)
> > > > +        return NULL;
> > > > +
> > > > +    tls_init_aead_request(aead_req, aead);
> > > > +    return aead_req;
> > > > +}
> > > > +
> > > > +static int tls_enc_records(struct aead_request *aead_req,
> > > > +               struct crypto_aead *aead, struct scatterlist
> > > > *sg_in,
> > > > +               struct scatterlist *sg_out, char *aad, char
> > > > *iv,
> > > > +               u64 rcd_sn, int len)
> > > > +{
> > > > +    struct scatter_walk in;
> > > > +    struct scatter_walk out;
> > > > +    int rc;
> > > > +
> > > > +    scatterwalk_start(&in, sg_in);
> > > > +    scatterwalk_start(&out, sg_out);
> > > > +
> > > > +    do {
> > > > +        rc = tls_enc_record(aead_req, aead, aad, iv,
> > > > +                    cpu_to_be64(rcd_sn), &in, &out, &len);
> > > > +        rcd_sn++;
> > > > +
> > > > +    } while (rc == 0 && len);
> > > > +
> > > > +    scatterwalk_done(&in, 0, 0);
> > > > +    scatterwalk_done(&out, 1, 0);
> > > > +
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +static inline void update_chksum(struct sk_buff *skb, int
> > > > headln)
> > > > +{
> > > > +    /* Can't use icsk->icsk_af_ops->send_check here because
> > > > the ip addresses
> > > > +     * might have been changed by NAT.
> > > > +     */
> > > > +
> > > > +    const struct ipv6hdr *ipv6h;
> > > > +    const struct iphdr *iph;
> > > > +    struct tcphdr *th = tcp_hdr(skb);
> > > > +    int datalen = skb->len - headln;
> > > > +
> > > > +    /* We only changed the payload so if we are using partial
> > > > we don't
> > > > +     * need to update anything.
> > > > +     */
> > > > +    if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
> > > > +        return;
> > > > +
> > > > +    skb->ip_summed = CHECKSUM_PARTIAL;
> > > > +    skb->csum_start = skb_transport_header(skb) - skb->head;
> > > > +    skb->csum_offset = offsetof(struct tcphdr, check);
> > > > +
> > > > +    if (skb->sk->sk_family == AF_INET6) {
> > > > +        ipv6h = ipv6_hdr(skb);
> > > > +        th->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h-
> > > > >daddr,
> > > > +                         datalen, IPPROTO_TCP, 0);
> > > > +    } else {
> > > > +        iph = ip_hdr(skb);
> > > > +        th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, 
> > > > datalen,
> > > > +                           IPPROTO_TCP, 0);
> > > > +    }
> > > > +}
> > > > +
> > > > +static void complete_skb(struct sk_buff *nskb, struct sk_buff
> > > > *skb, int headln)
> > > > +{
> > > > +    skb_copy_header(nskb, skb);
> > > > +
> > > > +    skb_put(nskb, skb->len);
> > > > +    memcpy(nskb->data, skb->data, headln);
> > > > +    update_chksum(nskb, headln);
> > > > +
> > > > +    nskb->destructor = skb->destructor;
> > > > +    nskb->sk = skb->sk;
> > > > +    skb->destructor = NULL;
> > > > +    skb->sk = NULL;
> > > > +    refcount_add(nskb->truesize - skb->truesize,
> > > > +             &nskb->sk->sk_wmem_alloc);
> > > > +}
> > > > +
> > > > +/* This function may be called after the user socket is
> > > > already
> > > > + * closed so make sure we don't use anything freed during
> > > > + * tls_sk_proto_close here
> > > > + */
> > > > +static struct sk_buff *tls_sw_fallback(struct sock *sk, struct
> > > > sk_buff *skb)
> > > > +{
> > > > +    int tcp_header_size = tcp_hdrlen(skb);
> > > > +    int tcp_payload_offset = skb_transport_offset(skb) +
> > > > tcp_header_size;
> > > > +    int payload_len = skb->len - tcp_payload_offset;
> > > > +    struct tls_context *tls_ctx = tls_get_ctx(sk);
> > > > +    struct tls_offload_context *ctx =
> > > > tls_offload_ctx(tls_ctx);
> > > > +    int remaining, buf_len, resync_sgs, rc, i = 0;
> > > > +    void *buf, *dummy_buf, *iv, *aad;
> > > > +    struct scatterlist *sg_in;
> > > > +    struct scatterlist sg_out[3];
> > > > +    u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
> > > > +    struct aead_request *aead_req;
> > > > +    struct sk_buff *nskb = NULL;
> > > > +    struct tls_record_info *record;
> > > > +    unsigned long flags;
> > > > +    s32 sync_size;
> > > > +    u64 rcd_sn;
> > > > +
> > > > +    /* worst case is:
> > > > +     * MAX_SKB_FRAGS in tls_record_info
> > > > +     * MAX_SKB_FRAGS + 1 in SKB head an frags.
> > > > +     */
> > > > +    int sg_in_max_elements = 2 * MAX_SKB_FRAGS + 1;
> > > > +
> > > > +    if (!payload_len)
> > > > +        return skb;
> > > > +
> > > > +    sg_in = kmalloc_array(sg_in_max_elements, sizeof(*sg_in),
> > > > GFP_ATOMIC);
> > > > +    if (!sg_in)
> > > > +        goto free_orig;
> > > > +
> > > > +    sg_init_table(sg_in, sg_in_max_elements);
> > > > +    sg_init_table(sg_out, ARRAY_SIZE(sg_out));
> > > > +
> > > > +    spin_lock_irqsave(&ctx->lock, flags);
> > > > +    record = tls_get_record(ctx, tcp_seq, &rcd_sn);
> > > > +    if (!record) {
> > > > +        spin_unlock_irqrestore(&ctx->lock, flags);
> > > > +        WARN(1, "Record not found for seq %u\n", tcp_seq);
> > > > +        goto free_sg;
> > > > +    }
> > > > +
> > > > +    sync_size = tcp_seq - tls_record_start_seq(record);
> > > > +    if (sync_size < 0) {
> > > > +        int is_start_marker =
> > > > tls_record_is_start_marker(record);
> > > > +
> > > > +        spin_unlock_irqrestore(&ctx->lock, flags);
> > > > +        if (!is_start_marker)
> > > > +        /* This should only occur if the relevant record was
> > > > +         * already acked. In that case it should be ok
> > > > +         * to drop the packet and avoid retransmission.
> > > > +         *
> > > > +         * There is a corner case where the packet contains
> > > > +         * both an acked and a non-acked record.
> > > > +         * We currently don't handle that case and rely
> > > > +         * on TCP to retranmit a packet that doesn't contain
> > > > +         * already acked payload.
> > > > +         */
> > > > +            goto free_orig;
> > > > +
> > > > +        if (payload_len > -sync_size) {
> > > > +            WARN(1, "Fallback of partially offloaded packets
> > > > is not supported\n");
> > > > +            goto free_sg;
> > > > +        } else {
> > > > +            return skb;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    remaining = sync_size;
> > > > +    while (remaining > 0) {
> > > > +        skb_frag_t *frag = &record->frags[i];
> > > > +
> > > > +        __skb_frag_ref(frag);
> > > > +        sg_set_page(sg_in + i, skb_frag_page(frag),
> > > > +                skb_frag_size(frag), frag->page_offset);
> > > > +
> > > > +        remaining -= skb_frag_size(frag);
> > > > +
> > > > +        if (remaining < 0)
> > > > +            sg_in[i].length += remaining;
> > > > +
> > > > +        i++;
> > > > +    }
> > > > +    spin_unlock_irqrestore(&ctx->lock, flags);
> > > > +    resync_sgs = i;
> > > > +
> > > > +    aead_req = tls_alloc_aead_request(ctx->aead_send,
> > > > GFP_ATOMIC);
> > > > +    if (!aead_req)
> > > > +        goto put_sg;
> > > > +
> > > > +    buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE +
> > > > +          TLS_CIPHER_AES_GCM_128_IV_SIZE +
> > > > +          TLS_AAD_SPACE_SIZE +
> > > > +          sync_size +
> > > > +          tls_ctx->tag_size;
> > > > +    buf = kmalloc(buf_len, GFP_ATOMIC);
> > > > +    if (!buf)
> > > > +        goto free_req;
> > > > +
> > > > +    nskb = alloc_skb(skb_headroom(skb) + skb->len,
> > > > GFP_ATOMIC);
> > > > +    if (!nskb)
> > > > +        goto free_buf;
> > > > +
> > > > +    skb_reserve(nskb, skb_headroom(skb));
> > > > +
> > > > +    iv = buf;
> > > > +
> > > > +    memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
> > > > +           TLS_CIPHER_AES_GCM_128_SALT_SIZE);
> > > > +    aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
> > > > +          TLS_CIPHER_AES_GCM_128_IV_SIZE;
> > > > +    dummy_buf = aad + TLS_AAD_SPACE_SIZE;
> > > > +
> > > > +    sg_set_buf(&sg_out[0], dummy_buf, sync_size);
> > > > +    sg_set_buf(&sg_out[1], nskb->data + tcp_payload_offset,
> > > > +           payload_len);
> > > > +    /* Add room for authentication tag produced by crypto */
> > > > +    dummy_buf += sync_size;
> > > > +    sg_set_buf(&sg_out[2], dummy_buf, tls_ctx->tag_size);
> > > > +    rc = skb_to_sgvec(skb, &sg_in[i], tcp_payload_offset,
> > > > +              payload_len);
> > > > +    if (rc < 0)
> > > > +        goto free_nskb;
> > > > +
> > > > +    rc = tls_enc_records(aead_req, ctx->aead_send, sg_in,
> > > > sg_out, aad, iv,
> > > > +                 rcd_sn, sync_size + payload_len);
> > > > +    if (rc < 0)
> > > > +        goto free_nskb;
> > > > +
> > > > +    complete_skb(nskb, skb, tcp_payload_offset);
> > > > +
> > > > +    /* validate_xmit_skb_list assumes that if the skb wasn't
> > > > segmented
> > > > +     * nskb->prev will point to the skb itself
> > > > +     */
> > > > +    nskb->prev = nskb;
> > > > +free_buf:
> > > > +    kfree(buf);
> > > > +free_req:
> > > > +    kfree(aead_req);
> > > > +put_sg:
> > > > +    for (i = 0; i < resync_sgs; i++)
> > > > +        put_page(sg_page(&sg_in[i]));
> > > > +free_sg:
> > > > +    kfree(sg_in);
> > > > +free_orig:
> > > > +    kfree_skb(skb);
> > > > +    return nskb;
> > > > +
> > > > +free_nskb:
> > > > +    kfree_skb(nskb);
> > > > +    nskb = NULL;
> > > > +    goto free_buf;
> > > > +}
> > > > +
> > > > +static struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
> > > > +                         struct net_device *dev,
> > > > +                         struct sk_buff *skb)
> > > > +{
> > > > +    if (dev == tls_get_ctx(sk)->netdev)
> > > > +        return skb;
> > > > +
> > > > +    return tls_sw_fallback(sk, skb);
> > > > +}
> > > > +
> > > > +int tls_sw_fallback_init(struct sock *sk,
> > > > +             struct tls_offload_context *offload_ctx,
> > > > +             struct tls_crypto_info *crypto_info)
> > > > +{
> > > > +    int rc;
> > > > +    const u8 *key;
> > > > +
> > > > +    offload_ctx->aead_send =
> > > > +        crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
> > > > +    if (IS_ERR(offload_ctx->aead_send)) {
> > > > +        rc = PTR_ERR(offload_ctx->aead_send);
> > > > +        pr_err_ratelimited("crypto_alloc_aead failed rc=%d\n",
> > > > rc);
> > > > +        offload_ctx->aead_send = NULL;
> > > > +        goto err_out;
> > > > +    }
> > > > +
> > > > +    key = ((struct tls12_crypto_info_aes_gcm_128
> > > > *)crypto_info)->key;
> > > > +
> > > > +    rc = crypto_aead_setkey(offload_ctx->aead_send, key,
> > > > +                TLS_CIPHER_AES_GCM_128_KEY_SIZE);
> > > > +    if (rc)
> > > > +        goto free_aead;
> > > > +
> > > > +    rc = crypto_aead_setauthsize(offload_ctx->aead_send,
> > > > +                     TLS_CIPHER_AES_GCM_128_TAG_SIZE);
> > > > +    if (rc)
> > > > +        goto free_aead;
> > > > +
> > > > +    sk->sk_validate_xmit_skb = tls_validate_xmit_skb;
> > > > +    return 0;
> > > > +free_aead:
> > > > +    crypto_free_aead(offload_ctx->aead_send);
> > > > +err_out:
> > > > +    return rc;
> > > > +}
> > > > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> > > > index d824d548447e..e0dface33017 100644
> > > > --- a/net/tls/tls_main.c
> > > > +++ b/net/tls/tls_main.c
> > > > @@ -54,6 +54,9 @@ enum {
> > > >   enum {
> > > >       TLS_BASE_TX,
> > > >       TLS_SW_TX,
> > > > +#ifdef CONFIG_TLS_DEVICE
> > > > +    TLS_HW_TX,
> > > > +#endif
> > > >       TLS_NUM_CONFIG,
> > > >   };
> > > >   @@ -416,11 +419,19 @@ static int do_tls_setsockopt_tx(struct
> > > > sock *sk, char __user *optval,
> > > >           goto err_crypto_info;
> > > >       }
> > > >   -    /* currently SW is default, we will have ethtool in
> > > > future */
> > > > -    rc = tls_set_sw_offload(sk, ctx);
> > > > -    tx_conf = TLS_SW_TX;
> > > > -    if (rc)
> > > > -        goto err_crypto_info;
> > > > +#ifdef CONFIG_TLS_DEVICE
> > > > +    rc = tls_set_device_offload(sk, ctx);
> > > > +    tx_conf = TLS_HW_TX;
> > > > +    if (rc) {
> > > > +#else
> > > > +    {
> > > > +#endif
> > > > +        /* if HW offload fails fallback to SW */
> > > > +        rc = tls_set_sw_offload(sk, ctx);
> > > > +        tx_conf = TLS_SW_TX;
> > > > +        if (rc)
> > > > +            goto err_crypto_info;
> > > > +    }
> > > >         ctx->tx_conf = tx_conf;
> > > >       update_sk_prot(sk, ctx);
> > > > @@ -473,6 +484,12 @@ static void build_protos(struct proto
> > > > *prot, struct proto *base)
> > > >       prot[TLS_SW_TX] = prot[TLS_BASE_TX];
> > > >       prot[TLS_SW_TX].sendmsg        = tls_sw_sendmsg;
> > > >       prot[TLS_SW_TX].sendpage    = tls_sw_sendpage;
> > > > +
> > > > +#ifdef CONFIG_TLS_DEVICE
> > > > +    prot[TLS_HW_TX] = prot[TLS_SW_TX];
> > > > +    prot[TLS_HW_TX].sendmsg        = tls_device_sendmsg;
> > > > +    prot[TLS_HW_TX].sendpage    = tls_device_sendpage;
> > > > +#endif
> > > >   }
> > > >     static int tls_init(struct sock *sk)
> > > > @@ -531,6 +548,9 @@ static int __init tls_register(void)
> > > >   {
> > > >       build_protos(tls_prots[TLSV4], &tcp_prot);
> > > >   +#ifdef CONFIG_TLS_DEVICE
> > > > +    tls_device_init();
> > > > +#endif
> > > >       tcp_register_ulp(&tcp_tls_ulp_ops);
> > > >         return 0;
> > > > @@ -539,6 +559,9 @@ static int __init tls_register(void)
> > > >   static void __exit tls_unregister(void)
> > > >   {
> > > >       tcp_unregister_ulp(&tcp_tls_ulp_ops);
> > > > +#ifdef CONFIG_TLS_DEVICE
> > > > +    tls_device_cleanup();
> > > > +#endif
> > > >   }
> > > >     module_init(tls_register);
> 
> Thanks,
> Kirill

^ permalink raw reply

* Re: [bug, bisected] pfifo_fast causes packet reordering
From: John Fastabend @ 2018-03-21 20:52 UTC (permalink / raw)
  To: Jakob Unterwurzacher, Dave Taht
  Cc: netdev, linux-kernel, David S. Miller, linux-can@vger.kernel.org,
	Martin Elshuber
In-Reply-To: <983427eb-2e25-f201-c953-4cff22569deb@theobroma-systems.com>

On 03/21/2018 12:44 PM, Jakob Unterwurzacher wrote:
> On 21.03.18 19:43, John Fastabend wrote:
>> Thats my theory at least. Are you able to test a patch if I generate
>> one to fix this?
> 
> Yes, no problem.

Can you try this,

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d4907b5..1e596bd 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -30,6 +30,7 @@ struct qdisc_rate_table {
 enum qdisc_state_t {
        __QDISC_STATE_SCHED,
        __QDISC_STATE_DEACTIVATED,
+       __QDISC_STATE_RUNNING,
 };
 
 struct qdisc_size_table {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 190570f..cf7c37d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -377,20 +377,26 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
        struct netdev_queue *txq;
        struct net_device *dev;
        struct sk_buff *skb;
-       bool validate;
+       bool more, validate;
 
        /* Dequeue packet */
+       if (test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+               return false;
+
        skb = dequeue_skb(q, &validate, packets);
-       if (unlikely(!skb))
+       if (unlikely(!skb)) {
+               clear_bit(__QDISC_STATE_RUNNING, &q->state);
                return false;
+       }
 
        if (!(q->flags & TCQ_F_NOLOCK))
                root_lock = qdisc_lock(q);
 
        dev = qdisc_dev(q);
        txq = skb_get_tx_queue(dev, skb);
-
-       return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
+       more = sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
+       clear_bit(__QDISC_STATE_RUNNING, &q->state);
+       return more;
 }


> 
> I just tested with the flag change you suggested (see below, I had to keep TCQ_F_CPUSTATS to prevent a crash) and I have NOT seen OOO so far.
> 

Right because the code expects per cpu stats if the CPUSTATS flag is
removed it will crash.

> Thanks,
> Jakob
> 
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 190570f21b20..51b68ef4977b 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -792,7 +792,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
>         .dump           =       pfifo_fast_dump,
>         .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
>         .owner          =       THIS_MODULE,
> -       .static_flags   =       TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
> +       .static_flags   =       TCQ_F_CPUSTATS,
>  };
>  EXPORT_SYMBOL(pfifo_fast_ops);

^ permalink raw reply related

* [PATCH V2 net-next 00/14] TLS offload, netdev & MLX5 support
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Dave Watson, Boris Pismenny, Saeed Mahameed

Hi Dave,

The following series from Ilya and Boris provides TLS TX inline crypto
offload.

v1->v2:
   - Added IS_ENABLED(CONFIG_TLS_DEVICE) and a STATIC_KEY for icsk_clean_acked
   - File license fix
   - Fix spelling, comment by DaveW
   - Move memory allocations out of tls_set_device_offload and other misc fixes,
	comments by Kiril.

Boris says:
===================
This series adds a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption and
authentication operations on the transmit side of the data path. Leaving
those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to the
TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
  TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Expected TCP SN is accessed without a lock, under the assumption that
TCP doesn't transmit SKBs from different TX queue concurrently.

We assume that packets are not rerouted to a different network device.

Paper: https://www.netdevconf.org/1.2/papers/netdevconf-TLS.pdf

===================

===================

The series is based on latest net-next:
0466080c751e ("Merge branch 'dsa-mv88e6xxx-some-fixes'")

Thanks,
Saeed.

---

Boris Pismenny (2):
  MAINTAINERS: Update mlx5 innova driver maintainers
  MAINTAINERS: Update TLS maintainers

Ilya Lesokhin (12):
  tcp: Add clean acked data hook
  net: Rename and export copy_skb_header
  net: Add Software fallback infrastructure for socket dependent
    offloads
  net: Add TLS offload netdev ops
  net: Add TLS TX offload features
  net/tls: Add generic NIC offload infrastructure
  net/tls: Support TLS device offload with IPv6
  net/mlx5e: Move defines out of ipsec code
  net/mlx5: Accel, Add TLS tx offload interface
  net/mlx5e: TLS, Add Innova TLS TX support
  net/mlx5e: TLS, Add Innova TLS TX offload data path
  net/mlx5e: TLS, Add error statistics

 MAINTAINERS                                        |  19 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |  11 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   6 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c    |  71 ++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h    |  86 +++
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  21 +
 .../mellanox/mlx5/core/en_accel/en_accel.h         |  72 ++
 .../ethernet/mellanox/mlx5/core/en_accel/ipsec.h   |   3 -
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 197 +++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  87 +++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         | 278 +++++++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h         |  50 ++
 .../mellanox/mlx5/core/en_accel/tls_stats.c        |  89 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  32 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  37 +-
 .../net/ethernet/mellanox/mlx5/core/fpga/core.h    |   1 +
 .../net/ethernet/mellanox/mlx5/core/fpga/ipsec.c   |   5 +-
 drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 563 ++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  68 ++
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  11 +
 include/linux/mlx5/mlx5_ifc.h                      |  16 -
 include/linux/mlx5/mlx5_ifc_fpga.h                 |  77 ++
 include/linux/netdev_features.h                    |   2 +
 include/linux/netdevice.h                          |  24 +
 include/linux/skbuff.h                             |   1 +
 include/net/inet_connection_sock.h                 |   2 +
 include/net/sock.h                                 |  21 +
 include/net/tcp.h                                  |   5 +
 include/net/tls.h                                  |  74 +-
 net/Kconfig                                        |   4 +
 net/core/dev.c                                     |   4 +
 net/core/ethtool.c                                 |   1 +
 net/core/skbuff.c                                  |   9 +-
 net/ipv4/tcp.c                                     |   5 +
 net/ipv4/tcp_input.c                               |   6 +
 net/tls/Kconfig                                    |  10 +
 net/tls/Makefile                                   |   2 +
 net/tls/tls_device.c                               | 840 +++++++++++++++++++++
 net/tls/tls_device_fallback.c                      | 415 ++++++++++
 net/tls/tls_main.c                                 |  33 +-
 43 files changed, 3213 insertions(+), 65 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h
 create mode 100644 net/tls/tls_device.c
 create mode 100644 net/tls/tls_device_fallback.c

-- 
2.14.3

^ permalink raw reply

* [PATCH V2 net-next 01/14] tcp: Add clean acked data hook
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel, Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Called when a TCP segment is acknowledged.
Could be used by application protocols who hold additional
metadata associated with the stream data.

This is required by TLS device offload to release
metadata associated with acknowledged TLS records.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/net/inet_connection_sock.h | 2 ++
 include/net/tcp.h                  | 5 +++++
 net/ipv4/tcp.c                     | 5 +++++
 net/ipv4/tcp_input.c               | 6 ++++++
 4 files changed, 18 insertions(+)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index b68fea022a82..2ab6667275df 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
  * @icsk_af_ops		   Operations which are AF_INET{4,6} specific
  * @icsk_ulp_ops	   Pluggable ULP control hook
  * @icsk_ulp_data	   ULP private data
+ * @icsk_clean_acked	   Clean acked data hook
  * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
  * @icsk_ca_state:	   Congestion control state
  * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
@@ -102,6 +103,7 @@ struct inet_connection_sock {
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	const struct tcp_ulp_ops  *icsk_ulp_ops;
 	void			  *icsk_ulp_data;
+	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
 	struct hlist_node         icsk_listen_portaddr_node;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
 	__u8			  icsk_ca_state:6,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9c9b3768b350..dba03b205680 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2101,4 +2101,9 @@ static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 #if IS_ENABLED(CONFIG_SMC)
 extern struct static_key_false tcp_have_smc;
 #endif
+
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+extern struct static_key_false clean_acked_data_enabled;
+#endif
+
 #endif	/* _TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e553f84bde83..70056bb760d2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -297,6 +297,11 @@ DEFINE_STATIC_KEY_FALSE(tcp_have_smc);
 EXPORT_SYMBOL(tcp_have_smc);
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+DEFINE_STATIC_KEY_FALSE(clean_acked_data_enabled);
+EXPORT_SYMBOL_GPL(clean_acked_data_enabled);
+#endif
+
 /*
  * Current number of TCP sockets.
  */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 451ef3012636..21f5c647f4be 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3542,6 +3542,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (after(ack, prior_snd_una)) {
 		flag |= FLAG_SND_UNA_ADVANCED;
 		icsk->icsk_retransmits = 0;
+
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+		if (static_branch_unlikely(&clean_acked_data_enabled))
+			if (icsk->icsk_clean_acked)
+				icsk->icsk_clean_acked(sk, ack);
+#endif
 	}
 
 	prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una;
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 03/14] net: Add Software fallback infrastructure for socket dependent offloads
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

With socket dependent offloads we rely on the netdev to transform
the transmitted packets before sending them to the wire.
When a packet from an offloaded socket is rerouted to a different
device we need to detect it and do the transformation in software.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/net/sock.h | 21 +++++++++++++++++++++
 net/Kconfig        |  4 ++++
 net/core/dev.c     |  4 ++++
 3 files changed, 29 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index b9624581d639..92a0e0c54ac1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -481,6 +481,11 @@ struct sock {
 	void			(*sk_error_report)(struct sock *sk);
 	int			(*sk_backlog_rcv)(struct sock *sk,
 						  struct sk_buff *skb);
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+	struct sk_buff*		(*sk_validate_xmit_skb)(struct sock *sk,
+							struct net_device *dev,
+							struct sk_buff *skb);
+#endif
 	void                    (*sk_destruct)(struct sock *sk);
 	struct sock_reuseport __rcu	*sk_reuseport_cb;
 	struct rcu_head		sk_rcu;
@@ -2323,6 +2328,22 @@ static inline bool sk_fullsock(const struct sock *sk)
 	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
 }
 
+/* Checks if this SKB belongs to an HW offloaded socket
+ * and whether any SW fallbacks are required based on dev.
+ */
+static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
+						   struct net_device *dev)
+{
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+	struct sock *sk = skb->sk;
+
+	if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb)
+		skb = sk->sk_validate_xmit_skb(sk, dev, skb);
+#endif
+
+	return skb;
+}
+
 /* This helper checks if a socket is a LISTEN or NEW_SYN_RECV
  * SYNACK messages can be attached to either ones (depending on SYNCOOKIE)
  */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..fe84cfe3260e 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -407,6 +407,10 @@ config GRO_CELLS
 	bool
 	default n
 
+config SOCK_VALIDATE_XMIT
+	bool
+	default n
+
 config NET_DEVLINK
 	tristate "Network physical/parent device Netlink interface"
 	help
diff --git a/net/core/dev.c b/net/core/dev.c
index d8887cc38e7b..244a4c7ab266 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3086,6 +3086,10 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device
 	if (unlikely(!skb))
 		goto out_null;
 
+	skb = sk_validate_xmit_skb(skb, dev);
+	if (unlikely(!skb))
+		goto out_null;
+
 	if (netif_needs_gso(skb, features)) {
 		struct sk_buff *segs;
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 02/14] net: Rename and export copy_skb_header
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

copy_skb_header is renamed to skb_copy_header and
exported. Exposing this function give more flexibility
in copying SKBs.
skb_copy and skb_copy_expand do not give enough control
over which parts are copied.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c      | 9 +++++----
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d8340e6e8814..dc0f81277723 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1031,6 +1031,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t priority);
 struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 				   gfp_t gfp_mask, bool fclone);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c13495ba6..9ae1812fb705 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1304,7 +1304,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -1312,6 +1312,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(skb_copy_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
@@ -1354,7 +1355,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 
 	BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 	return n;
 }
 EXPORT_SYMBOL(skb_copy);
@@ -1418,7 +1419,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 		skb_clone_fraglist(n);
 	}
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 out:
 	return n;
 }
@@ -1598,7 +1599,7 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	BUG_ON(skb_copy_bits(skb, -head_copy_len, n->head + head_copy_off,
 			     skb->len + head_copy_len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 
 	skb_headers_offset_update(n, newheadroom - oldheadroom);
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 04/14] net: Add TLS offload netdev ops
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel, Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Add new netdev ops to add and delete tls context

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/netdevice.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 913b1cc882cf..e1fef7bb6ed4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -864,6 +864,26 @@ struct xfrmdev_ops {
 };
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+enum tls_offload_ctx_dir {
+	TLS_OFFLOAD_CTX_DIR_RX,
+	TLS_OFFLOAD_CTX_DIR_TX,
+};
+
+struct tls_crypto_info;
+struct tls_context;
+
+struct tlsdev_ops {
+	int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
+			   enum tls_offload_ctx_dir direction,
+			   struct tls_crypto_info *crypto_info,
+			   u32 start_offload_tcp_sn);
+	void (*tls_dev_del)(struct net_device *netdev,
+			    struct tls_context *ctx,
+			    enum tls_offload_ctx_dir direction);
+};
+#endif
+
 struct dev_ifalias {
 	struct rcu_head rcuhead;
 	char ifalias[];
@@ -1748,6 +1768,10 @@ struct net_device {
 	const struct xfrmdev_ops *xfrmdev_ops;
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+	const struct tlsdev_ops *tlsdev_ops;
+#endif
+
 	const struct header_ops *header_ops;
 
 	unsigned int		flags;
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 05/14] net: Add TLS TX offload features
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel, Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

This patch adds a netdev feature to configure TLS TX offloads.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/netdev_features.h | 2 ++
 net/core/ethtool.c              | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index db84c516bcfb..18dc34202080 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -77,6 +77,7 @@ enum {
 	NETIF_F_HW_ESP_BIT,		/* Hardware ESP transformation offload */
 	NETIF_F_HW_ESP_TX_CSUM_BIT,	/* ESP with TX checksum offload */
 	NETIF_F_RX_UDP_TUNNEL_PORT_BIT, /* Offload of RX port for UDP tunnels */
+	NETIF_F_HW_TLS_TX_BIT,		/* Hardware TLS TX offload */
 
 	NETIF_F_GRO_HW_BIT,		/* Hardware Generic receive offload */
 
@@ -145,6 +146,7 @@ enum {
 #define NETIF_F_HW_ESP		__NETIF_F(HW_ESP)
 #define NETIF_F_HW_ESP_TX_CSUM	__NETIF_F(HW_ESP_TX_CSUM)
 #define	NETIF_F_RX_UDP_TUNNEL_PORT  __NETIF_F(RX_UDP_TUNNEL_PORT)
+#define NETIF_F_HW_TLS_TX	__NETIF_F(HW_TLS_TX)
 
 #define for_each_netdev_feature(mask_addr, bit)	\
 	for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 157cd9efa4be..9f07f9fe39ca 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -107,6 +107,7 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_HW_ESP_BIT] =		 "esp-hw-offload",
 	[NETIF_F_HW_ESP_TX_CSUM_BIT] =	 "esp-tx-csum-hw-offload",
 	[NETIF_F_RX_UDP_TUNNEL_PORT_BIT] =	 "rx-udp_tunnel-port-offload",
+	[NETIF_F_HW_TLS_TX_BIT] =	 "tls-hw-tx-offload",
 };
 
 static const char
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 07/14] net/tls: Support TLS device offload with IPv6
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Previously get_netdev_for_sock worked only with IPv4.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 net/tls/tls_device.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index e623280ea019..c35fc107d9c5 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -34,6 +34,11 @@
 #include <net/inet_common.h>
 #include <linux/highmem.h>
 #include <linux/netdevice.h>
+#include <net/addrconf.h>
+#include <net/flow.h>
+#include <linux/ipv6.h>
+#include <net/dst.h>
+#include <linux/security.h>
 
 #include <net/tls.h>
 #include <crypto/aead.h>
@@ -99,13 +104,55 @@ static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
 	spin_unlock_irqrestore(&tls_device_lock, flags);
 }
 
+static inline struct net_device *ipv6_get_netdev(struct sock *sk)
+{
+	struct net_device *dev = NULL;
+#if IS_ENABLED(CONFIG_IPV6)
+	struct inet_sock *inet = inet_sk(sk);
+	struct ipv6_pinfo *np = inet6_sk(sk);
+	struct flowi6 _fl6, *fl6 = &_fl6;
+	struct dst_entry *dst;
+
+	memset(fl6, 0, sizeof(*fl6));
+	fl6->flowi6_proto = sk->sk_protocol;
+	fl6->daddr = sk->sk_v6_daddr;
+	fl6->saddr = np->saddr;
+	fl6->flowlabel = np->flow_label;
+	IP6_ECN_flow_xmit(sk, fl6->flowlabel);
+	fl6->flowi6_oif = sk->sk_bound_dev_if;
+	fl6->flowi6_mark = sk->sk_mark;
+	fl6->fl6_sport = inet->inet_sport;
+	fl6->fl6_dport = inet->inet_dport;
+	fl6->flowi6_uid = sk->sk_uid;
+	security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
+
+	if (ipv6_stub->ipv6_dst_lookup(sock_net(sk), sk, &dst, fl6) < 0)
+		return NULL;
+
+	dev = dst->dev;
+	dev_hold(dev);
+	dst_release(dst);
+
+#endif
+	return dev;
+}
+
 /* We assume that the socket is already connected */
 static struct net_device *get_netdev_for_sock(struct sock *sk)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct net_device *netdev = NULL;
 
-	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
+	if (sk->sk_family == AF_INET)
+		netdev = dev_get_by_index(sock_net(sk),
+					  inet->cork.fl.flowi_oif);
+	else if (sk->sk_family == AF_INET6) {
+		netdev = ipv6_get_netdev(sk);
+		if (!netdev && !sk->sk_ipv6only &&
+		    ipv6_addr_type(&sk->sk_v6_daddr) == IPV6_ADDR_MAPPED)
+			netdev = dev_get_by_index(sock_net(sk),
+						  inet->cork.fl.flowi_oif);
+	}
 
 	return netdev;
 }
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 06/14] net/tls: Add generic NIC offload infrastructure
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel, Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

This patch adds a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption
and authentication operations on the transmit side of the data path.
Leaving those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to
the TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/net/tls.h             |  74 +++-
 net/tls/Kconfig               |  10 +
 net/tls/Makefile              |   2 +
 net/tls/tls_device.c          | 793 ++++++++++++++++++++++++++++++++++++++++++
 net/tls/tls_device_fallback.c | 415 ++++++++++++++++++++++
 net/tls/tls_main.c            |  33 +-
 6 files changed, 1320 insertions(+), 7 deletions(-)
 create mode 100644 net/tls/tls_device.c
 create mode 100644 net/tls/tls_device_fallback.c

diff --git a/include/net/tls.h b/include/net/tls.h
index 4913430ab807..0bfb1b0a156a 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -77,6 +77,37 @@ struct tls_sw_context {
 	struct scatterlist sg_aead_out[2];
 };
 
+struct tls_record_info {
+	struct list_head list;
+	u32 end_seq;
+	int len;
+	int num_frags;
+	skb_frag_t frags[MAX_SKB_FRAGS];
+};
+
+struct tls_offload_context {
+	struct crypto_aead *aead_send;
+	spinlock_t lock;	/* protects records list */
+	struct list_head records_list;
+	struct tls_record_info *open_record;
+	struct tls_record_info *retransmit_hint;
+	u64 hint_record_sn;
+	u64 unacked_record_sn;
+
+	struct scatterlist sg_tx_data[MAX_SKB_FRAGS];
+	void (*sk_destruct)(struct sock *sk);
+	u8 driver_state[];
+	/* The TLS layer reserves room for driver specific state
+	 * Currently the belief is that there is not enough
+	 * driver specific state to justify another layer of indirection
+	 */
+#define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *)))
+};
+
+#define TLS_OFFLOAD_CONTEXT_SIZE                                               \
+	(ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +           \
+	 TLS_DRIVER_STATE_SIZE)
+
 enum {
 	TLS_PENDING_CLOSED_RECORD
 };
@@ -87,6 +118,10 @@ struct tls_context {
 		struct tls12_crypto_info_aes_gcm_128 crypto_send_aes_gcm_128;
 	};
 
+	struct list_head list;
+	struct net_device *netdev;
+	refcount_t refcount;
+
 	void *priv_ctx;
 
 	u8 tx_conf:2;
@@ -131,9 +166,29 @@ int tls_sw_sendpage(struct sock *sk, struct page *page,
 void tls_sw_close(struct sock *sk, long timeout);
 void tls_sw_free_tx_resources(struct sock *sk);
 
-void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
-void tls_icsk_clean_acked(struct sock *sk);
+void tls_clear_device_offload(struct sock *sk, struct tls_context *ctx);
+int tls_set_device_offload(struct sock *sk, struct tls_context *ctx);
+int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
+int tls_device_sendpage(struct sock *sk, struct page *page,
+			int offset, size_t size, int flags);
+void tls_device_sk_destruct(struct sock *sk);
+void tls_device_init(void);
+void tls_device_cleanup(void);
+
+struct tls_record_info *tls_get_record(struct tls_offload_context *context,
+				       u32 seq, u64 *p_record_sn);
+
+static inline bool tls_record_is_start_marker(struct tls_record_info *rec)
+{
+	return rec->len == 0;
+}
+
+static inline u32 tls_record_start_seq(struct tls_record_info *rec)
+{
+	return rec->end_seq - rec->len;
+}
 
+void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
 int tls_push_sg(struct sock *sk, struct tls_context *ctx,
 		struct scatterlist *sg, u16 first_offset,
 		int flags);
@@ -170,6 +225,13 @@ static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx)
 	return tls_ctx->pending_open_record_frags;
 }
 
+static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk)
+{
+	return sk_fullsock(sk) &&
+	       /* matches smp_store_release in tls_set_device_offload */
+	       smp_load_acquire(&sk->sk_destruct) == &tls_device_sk_destruct;
+}
+
 static inline void tls_err_abort(struct sock *sk)
 {
 	sk->sk_err = EBADMSG;
@@ -257,4 +319,12 @@ static inline struct tls_offload_context *tls_offload_ctx(
 int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
 		      unsigned char *record_type);
 
+struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
+				      struct net_device *dev,
+				      struct sk_buff *skb);
+
+int tls_sw_fallback_init(struct sock *sk,
+			 struct tls_offload_context *offload_ctx,
+			 struct tls_crypto_info *crypto_info);
+
 #endif /* _TLS_OFFLOAD_H */
diff --git a/net/tls/Kconfig b/net/tls/Kconfig
index eb583038c67e..9d3ef820bb16 100644
--- a/net/tls/Kconfig
+++ b/net/tls/Kconfig
@@ -13,3 +13,13 @@ config TLS
 	encryption handling of the TLS protocol to be done in-kernel.
 
 	If unsure, say N.
+
+config TLS_DEVICE
+	bool "Transport Layer Security HW offload"
+	depends on TLS
+	select SOCK_VALIDATE_XMIT
+	default n
+	---help---
+	Enable kernel support for HW offload of the TLS protocol.
+
+	If unsure, say N.
diff --git a/net/tls/Makefile b/net/tls/Makefile
index a930fd1c4f7b..4d6b728a67d0 100644
--- a/net/tls/Makefile
+++ b/net/tls/Makefile
@@ -5,3 +5,5 @@
 obj-$(CONFIG_TLS) += tls.o
 
 tls-y := tls_main.o tls_sw.o
+
+tls-$(CONFIG_TLS_DEVICE) += tls_device.o tls_device_fallback.o
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
new file mode 100644
index 000000000000..e623280ea019
--- /dev/null
+++ b/net/tls/tls_device.c
@@ -0,0 +1,793 @@
+/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/module.h>
+#include <net/tcp.h>
+#include <net/inet_common.h>
+#include <linux/highmem.h>
+#include <linux/netdevice.h>
+
+#include <net/tls.h>
+#include <crypto/aead.h>
+
+/* device_offload_lock is used to synchronize tls_dev_add
+ * against NETDEV_DOWN notifications.
+ */
+static DECLARE_RWSEM(device_offload_lock);
+
+static void tls_device_gc_task(struct work_struct *work);
+
+static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
+static LIST_HEAD(tls_device_gc_list);
+static LIST_HEAD(tls_device_list);
+static DEFINE_SPINLOCK(tls_device_lock);
+
+static void tls_device_free_ctx(struct tls_context *ctx)
+{
+	struct tls_offload_context *offlad_ctx = tls_offload_ctx(ctx);
+
+	kfree(offlad_ctx);
+	kfree(ctx);
+}
+
+static void tls_device_gc_task(struct work_struct *work)
+{
+	struct tls_context *ctx, *tmp;
+	struct list_head gc_list;
+	unsigned long flags;
+
+	INIT_LIST_HEAD(&gc_list);
+
+	spin_lock_irqsave(&tls_device_lock, flags);
+	list_splice_init(&tls_device_gc_list, &gc_list);
+	spin_unlock_irqrestore(&tls_device_lock, flags);
+
+	list_for_each_entry_safe(ctx, tmp, &gc_list, list) {
+		struct net_device *netdev = ctx->netdev;
+
+		if (netdev) {
+			netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
+							TLS_OFFLOAD_CTX_DIR_TX);
+			dev_put(netdev);
+		}
+
+		list_del(&ctx->list);
+		tls_device_free_ctx(ctx);
+	}
+}
+
+static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&tls_device_lock, flags);
+	list_move_tail(&ctx->list, &tls_device_gc_list);
+
+	/* schedule_work inside the spinlock
+	 * to make sure tls_device_down waits for that work.
+	 */
+	schedule_work(&tls_device_gc_work);
+
+	spin_unlock_irqrestore(&tls_device_lock, flags);
+}
+
+/* We assume that the socket is already connected */
+static struct net_device *get_netdev_for_sock(struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct net_device *netdev = NULL;
+
+	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
+
+	return netdev;
+}
+
+static int attach_sock_to_netdev(struct sock *sk, struct net_device *netdev,
+				 struct tls_context *ctx)
+{
+	int rc;
+
+	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX,
+					     &ctx->crypto_send,
+					     tcp_sk(sk)->write_seq);
+	if (rc) {
+		pr_err_ratelimited("The netdev has refused to offload this socket\n");
+		goto out;
+	}
+
+	rc = 0;
+out:
+	return rc;
+}
+
+static void destroy_record(struct tls_record_info *record)
+{
+	skb_frag_t *frag;
+	int nr_frags = record->num_frags;
+
+	while (nr_frags > 0) {
+		frag = &record->frags[nr_frags - 1];
+		__skb_frag_unref(frag);
+		--nr_frags;
+	}
+	kfree(record);
+}
+
+static void delete_all_records(struct tls_offload_context *offload_ctx)
+{
+	struct tls_record_info *info, *temp;
+
+	list_for_each_entry_safe(info, temp, &offload_ctx->records_list, list) {
+		list_del(&info->list);
+		destroy_record(info);
+	}
+
+	offload_ctx->retransmit_hint = NULL;
+}
+
+static void tls_icsk_clean_acked(struct sock *sk, u32 acked_seq)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx;
+	struct tls_record_info *info, *temp;
+	unsigned long flags;
+	u64 deleted_records = 0;
+
+	if (!tls_ctx)
+		return;
+
+	ctx = tls_offload_ctx(tls_ctx);
+
+	spin_lock_irqsave(&ctx->lock, flags);
+	info = ctx->retransmit_hint;
+	if (info && !before(acked_seq, info->end_seq)) {
+		ctx->retransmit_hint = NULL;
+		list_del(&info->list);
+		destroy_record(info);
+		deleted_records++;
+	}
+
+	list_for_each_entry_safe(info, temp, &ctx->records_list, list) {
+		if (before(acked_seq, info->end_seq))
+			break;
+		list_del(&info->list);
+
+		destroy_record(info);
+		deleted_records++;
+	}
+
+	ctx->unacked_record_sn += deleted_records;
+	spin_unlock_irqrestore(&ctx->lock, flags);
+}
+
+/* At this point, there should be no references on this
+ * socket and no in-flight SKBs associated with this
+ * socket, so it is safe to free all the resources.
+ */
+void tls_device_sk_destruct(struct sock *sk)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+
+	if (ctx->open_record)
+		destroy_record(ctx->open_record);
+
+	delete_all_records(ctx);
+	crypto_free_aead(ctx->aead_send);
+	ctx->sk_destruct(sk);
+	static_branch_dec(&clean_acked_data_enabled);
+
+	if (refcount_dec_and_test(&tls_ctx->refcount))
+		tls_device_queue_ctx_destruction(tls_ctx);
+}
+EXPORT_SYMBOL(tls_device_sk_destruct);
+
+static inline void tls_append_frag(struct tls_record_info *record,
+				   struct page_frag *pfrag,
+				   int size)
+{
+	skb_frag_t *frag;
+
+	frag = &record->frags[record->num_frags - 1];
+	if (frag->page.p == pfrag->page &&
+	    frag->page_offset + frag->size == pfrag->offset) {
+		frag->size += size;
+	} else {
+		++frag;
+		frag->page.p = pfrag->page;
+		frag->page_offset = pfrag->offset;
+		frag->size = size;
+		++record->num_frags;
+		get_page(pfrag->page);
+	}
+
+	pfrag->offset += size;
+	record->len += size;
+}
+
+static inline int tls_push_record(struct sock *sk,
+				  struct tls_context *ctx,
+				  struct tls_offload_context *offload_ctx,
+				  struct tls_record_info *record,
+				  struct page_frag *pfrag,
+				  int flags,
+				  unsigned char record_type)
+{
+	skb_frag_t *frag;
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct page_frag fallback_frag;
+	struct page_frag  *tag_pfrag = pfrag;
+	int i;
+
+	/* fill prepand */
+	frag = &record->frags[0];
+	tls_fill_prepend(ctx,
+			 skb_frag_address(frag),
+			 record->len - ctx->prepend_size,
+			 record_type);
+
+	if (unlikely(!skb_page_frag_refill(ctx->tag_size, pfrag, GFP_KERNEL))) {
+		/* HW doesn't care about the data in the tag
+		 * so in case pfrag has no room
+		 * for a tag and we can't allocate a new pfrag
+		 * just use the page in the first frag
+		 * rather then write a complicated fall back code.
+		 */
+		tag_pfrag = &fallback_frag;
+		tag_pfrag->page = skb_frag_page(frag);
+		tag_pfrag->offset = 0;
+	}
+
+	tls_append_frag(record, tag_pfrag, ctx->tag_size);
+	record->end_seq = tp->write_seq + record->len;
+	spin_lock_irq(&offload_ctx->lock);
+	list_add_tail(&record->list, &offload_ctx->records_list);
+	spin_unlock_irq(&offload_ctx->lock);
+	offload_ctx->open_record = NULL;
+	set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
+	tls_advance_record_sn(sk, ctx);
+
+	for (i = 0; i < record->num_frags; i++) {
+		frag = &record->frags[i];
+		sg_unmark_end(&offload_ctx->sg_tx_data[i]);
+		sg_set_page(&offload_ctx->sg_tx_data[i], skb_frag_page(frag),
+			    frag->size, frag->page_offset);
+		sk_mem_charge(sk, frag->size);
+		get_page(skb_frag_page(frag));
+	}
+	sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags - 1]);
+
+	/* all ready, send */
+	return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags);
+}
+
+static inline int tls_create_new_record(struct tls_offload_context *offload_ctx,
+					struct page_frag *pfrag,
+					size_t prepend_size)
+{
+	skb_frag_t *frag;
+	struct tls_record_info *record;
+
+	record = kmalloc(sizeof(*record), GFP_KERNEL);
+	if (!record)
+		return -ENOMEM;
+
+	frag = &record->frags[0];
+	__skb_frag_set_page(frag, pfrag->page);
+	frag->page_offset = pfrag->offset;
+	skb_frag_size_set(frag, prepend_size);
+
+	get_page(pfrag->page);
+	pfrag->offset += prepend_size;
+
+	record->num_frags = 1;
+	record->len = prepend_size;
+	offload_ctx->open_record = record;
+	return 0;
+}
+
+static inline int tls_do_allocation(struct sock *sk,
+				    struct tls_offload_context *offload_ctx,
+				    struct page_frag *pfrag,
+				    size_t prepend_size)
+{
+	int ret;
+
+	if (!offload_ctx->open_record) {
+		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
+						   sk->sk_allocation))) {
+			sk->sk_prot->enter_memory_pressure(sk);
+			sk_stream_moderate_sndbuf(sk);
+			return -ENOMEM;
+		}
+
+		ret = tls_create_new_record(offload_ctx, pfrag, prepend_size);
+		if (ret)
+			return ret;
+
+		if (pfrag->size > pfrag->offset)
+			return 0;
+	}
+
+	if (!sk_page_frag_refill(sk, pfrag))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int tls_push_data(struct sock *sk,
+			 struct iov_iter *msg_iter,
+			 size_t size, int flags,
+			 unsigned char record_type)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+	struct tls_record_info *record = ctx->open_record;
+	struct page_frag *pfrag;
+	int copy, rc = 0;
+	size_t orig_size = size;
+	u32 max_open_record_len;
+	long timeo;
+	int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
+	int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
+	bool done = false;
+
+	if (flags &
+	    ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_SENDPAGE_NOTLAST))
+		return -ENOTSUPP;
+
+	if (sk->sk_err)
+		return -sk->sk_err;
+
+	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
+	rc = tls_complete_pending_work(sk, tls_ctx, flags, &timeo);
+	if (rc < 0)
+		return rc;
+
+	pfrag = sk_page_frag(sk);
+
+	/* TLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and
+	 * we need to leave room for an authentication tag.
+	 */
+	max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
+			      tls_ctx->prepend_size;
+	do {
+		if (tls_do_allocation(sk, ctx, pfrag,
+				      tls_ctx->prepend_size)) {
+			rc = sk_stream_wait_memory(sk, &timeo);
+			if (!rc)
+				continue;
+
+			record = ctx->open_record;
+			if (!record)
+				break;
+handle_error:
+			if (record_type != TLS_RECORD_TYPE_DATA) {
+				/* avoid sending partial
+				 * record with type !=
+				 * application_data
+				 */
+				size = orig_size;
+				destroy_record(record);
+				ctx->open_record = NULL;
+			} else if (record->len > tls_ctx->prepend_size) {
+				goto last_record;
+			}
+
+			break;
+		}
+
+		record = ctx->open_record;
+		copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
+		copy = min_t(size_t, copy, (max_open_record_len - record->len));
+
+		if (copy_from_iter_nocache(page_address(pfrag->page) +
+					       pfrag->offset,
+					   copy, msg_iter) != copy) {
+			rc = -EFAULT;
+			goto handle_error;
+		}
+		tls_append_frag(record, pfrag, copy);
+
+		size -= copy;
+		if (!size) {
+last_record:
+			tls_push_record_flags = flags;
+			if (more) {
+				tls_ctx->pending_open_record_frags =
+						record->num_frags;
+				break;
+			}
+
+			done = true;
+		}
+
+		if ((done) || record->len >= max_open_record_len ||
+		    (record->num_frags >= MAX_SKB_FRAGS - 1)) {
+			rc = tls_push_record(sk,
+					     tls_ctx,
+					     ctx,
+					     record,
+					     pfrag,
+					     tls_push_record_flags,
+					     record_type);
+			if (rc < 0)
+				break;
+		}
+	} while (!done);
+
+	if (orig_size - size > 0)
+		rc = orig_size - size;
+
+	return rc;
+}
+
+int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+{
+	unsigned char record_type = TLS_RECORD_TYPE_DATA;
+	int rc = 0;
+
+	lock_sock(sk);
+
+	if (unlikely(msg->msg_controllen)) {
+		rc = tls_proccess_cmsg(sk, msg, &record_type);
+		if (rc)
+			goto out;
+	}
+
+	rc = tls_push_data(sk, &msg->msg_iter, size,
+			   msg->msg_flags, record_type);
+
+out:
+	release_sock(sk);
+	return rc;
+}
+
+int tls_device_sendpage(struct sock *sk, struct page *page,
+			int offset, size_t size, int flags)
+{
+	struct iov_iter	msg_iter;
+	struct kvec iov;
+	char *kaddr = kmap(page);
+	int rc = 0;
+
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		flags |= MSG_MORE;
+
+	lock_sock(sk);
+
+	if (flags & MSG_OOB) {
+		rc = -ENOTSUPP;
+		goto out;
+	}
+
+	iov.iov_base = kaddr + offset;
+	iov.iov_len = size;
+	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1, size);
+	rc = tls_push_data(sk, &msg_iter, size,
+			   flags, TLS_RECORD_TYPE_DATA);
+	kunmap(page);
+
+out:
+	release_sock(sk);
+	return rc;
+}
+
+struct tls_record_info *tls_get_record(struct tls_offload_context *context,
+				       u32 seq, u64 *p_record_sn)
+{
+	struct tls_record_info *info;
+	u64 record_sn = context->hint_record_sn;
+
+	info = context->retransmit_hint;
+	if (!info ||
+	    before(seq, info->end_seq - info->len)) {
+		/* if retransmit_hint is irrelevant start
+		 * from the begging of the list
+		 */
+		info = list_first_entry(&context->records_list,
+					struct tls_record_info, list);
+		record_sn = context->unacked_record_sn;
+	}
+
+	list_for_each_entry_from(info, &context->records_list, list) {
+		if (before(seq, info->end_seq)) {
+			if (!context->retransmit_hint ||
+			    after(info->end_seq,
+				  context->retransmit_hint->end_seq)) {
+				context->hint_record_sn = record_sn;
+				context->retransmit_hint = info;
+			}
+			*p_record_sn = record_sn;
+			return info;
+		}
+		record_sn++;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL(tls_get_record);
+
+static int tls_device_push_pending_record(struct sock *sk, int flags)
+{
+	struct iov_iter	msg_iter;
+
+	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0);
+	return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA);
+}
+
+int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
+{
+	u16 nonece_size, tag_size, iv_size, rec_seq_size;
+	struct tls_record_info *start_marker_record;
+	struct tls_offload_context *offload_ctx;
+	struct tls_crypto_info *crypto_info;
+	struct net_device *netdev;
+	char *iv, *rec_seq;
+	struct sk_buff *skb;
+	int rc = -EINVAL;
+	__be64 rcd_sn;
+
+	if (!ctx)
+		goto out;
+
+	if (ctx->priv_ctx) {
+		rc = -EEXIST;
+		goto out;
+	}
+
+	start_marker_record = kmalloc(sizeof(*start_marker_record), GFP_KERNEL);
+	if (!start_marker_record) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	offload_ctx = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE, GFP_KERNEL);
+	if (!offload_ctx) {
+		rc = -ENOMEM;
+		goto free_marker_record;
+	}
+
+	crypto_info = &ctx->crypto_send;
+	switch (crypto_info->cipher_type) {
+	case TLS_CIPHER_AES_GCM_128: {
+		nonece_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
+		tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE;
+		iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
+		iv = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->iv;
+		rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE;
+		rec_seq =
+		 ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->rec_seq;
+		break;
+	}
+	default:
+		rc = -EINVAL;
+		goto free_offload_ctx;
+	}
+
+	ctx->prepend_size = TLS_HEADER_SIZE + nonece_size;
+	ctx->tag_size = tag_size;
+	ctx->iv_size = iv_size;
+	ctx->iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE,
+			  GFP_KERNEL);
+	if (!ctx->iv) {
+		rc = -ENOMEM;
+		goto free_offload_ctx;
+	}
+
+	memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
+
+	ctx->rec_seq_size = rec_seq_size;
+	ctx->rec_seq = kmalloc(rec_seq_size, GFP_KERNEL);
+	if (!ctx->rec_seq) {
+		rc = -ENOMEM;
+		goto free_iv;
+	}
+	memcpy(ctx->rec_seq, rec_seq, rec_seq_size);
+
+	rc = tls_sw_fallback_init(sk, offload_ctx, crypto_info);
+	if (rc)
+		goto free_rec_seq;
+
+	/* start at rec_seq - 1 to account for the start marker record */
+	memcpy(&rcd_sn, ctx->rec_seq, sizeof(rcd_sn));
+	offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
+
+	start_marker_record->end_seq = tcp_sk(sk)->write_seq;
+	start_marker_record->len = 0;
+	start_marker_record->num_frags = 0;
+
+	INIT_LIST_HEAD(&offload_ctx->records_list);
+	list_add_tail(&start_marker_record->list, &offload_ctx->records_list);
+	spin_lock_init(&offload_ctx->lock);
+
+	static_branch_inc(&clean_acked_data_enabled);
+	inet_csk(sk)->icsk_clean_acked = &tls_icsk_clean_acked;
+	ctx->push_pending_record = tls_device_push_pending_record;
+	offload_ctx->sk_destruct = sk->sk_destruct;
+
+	/* TLS offload is greatly simplified if we don't send
+	 * SKBs where only part of the payload needs to be encrypted.
+	 * So mark the last skb in the write queue as end of record.
+	 */
+	skb = tcp_write_queue_tail(sk);
+	if (skb)
+		TCP_SKB_CB(skb)->eor = 1;
+
+	refcount_set(&ctx->refcount, 1);
+
+	/* We support starting offload on multiple sockets
+	 * concurrently, so we only need a read lock here.
+	 * This lock must preceed get_netdev_for_sock to prevent races between
+	 * NETDEV_DOWN and setsockopt.
+	 */
+	down_read(&device_offload_lock);
+	netdev = get_netdev_for_sock(sk);
+	if (!netdev) {
+		pr_err_ratelimited("%s: netdev not found\n", __func__);
+		rc = -EINVAL;
+		goto release_lock;
+	}
+
+	if (!(netdev->features & NETIF_F_HW_TLS_TX)) {
+		rc = -ENOTSUPP;
+		goto release_netdev;
+	}
+
+	/* Avoid offloading if the device is down
+	 * We don't want to offload new flows after
+	 * the NETDEV_DOWN event
+	 */
+	if (!(netdev->flags & IFF_UP)) {
+		rc = -EINVAL;
+		goto release_netdev;
+	}
+
+	ctx->priv_ctx = offload_ctx;
+	rc = attach_sock_to_netdev(sk, netdev, ctx);
+	if (rc)
+		goto release_netdev;
+
+	ctx->netdev = netdev;
+
+	spin_lock_irq(&tls_device_lock);
+	list_add_tail(&ctx->list, &tls_device_list);
+	spin_unlock_irq(&tls_device_lock);
+
+	sk->sk_validate_xmit_skb = tls_validate_xmit_skb;
+	/* following this assignment tls_is_sk_tx_device_offloaded
+	 * will return true and the context might be accessed
+	 * by the netdev's xmit function.
+	 */
+	smp_store_release(&sk->sk_destruct,
+			  &tls_device_sk_destruct);
+	up_read(&device_offload_lock);
+	goto out;
+
+release_netdev:
+	dev_put(netdev);
+release_lock:
+	up_read(&device_offload_lock);
+	static_branch_dec(&clean_acked_data_enabled);
+	crypto_free_aead(offload_ctx->aead_send);
+free_rec_seq:
+	kfree(ctx->rec_seq);
+free_iv:
+	kfree(ctx->iv);
+free_offload_ctx:
+	kfree(offload_ctx);
+	ctx->priv_ctx = NULL;
+free_marker_record:
+	kfree(start_marker_record);
+out:
+	return rc;
+}
+
+static int tls_device_api_check(struct net_device *dev)
+{
+	if ((dev->features & NETIF_F_HW_TLS_TX) && !dev->tlsdev_ops)
+		return NOTIFY_BAD;
+
+	return NOTIFY_DONE;
+}
+
+static int tls_device_down(struct net_device *netdev)
+{
+	struct tls_context *ctx, *tmp;
+	struct list_head list;
+	unsigned long flags;
+
+	if (!(netdev->features & NETIF_F_HW_TLS_TX))
+		return NOTIFY_DONE;
+
+	INIT_LIST_HEAD(&list);
+
+	/* Request a write lock to block new offload attempts
+	 */
+	down_write(&device_offload_lock);
+
+	spin_lock_irqsave(&tls_device_lock, flags);
+	list_for_each_entry_safe(ctx, tmp, &tls_device_list, list) {
+		if (ctx->netdev != netdev ||
+		    !refcount_inc_not_zero(&ctx->refcount))
+			continue;
+
+		list_move(&ctx->list, &list);
+	}
+	spin_unlock_irqrestore(&tls_device_lock, flags);
+
+	list_for_each_entry_safe(ctx, tmp, &list, list)	{
+		netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
+						TLS_OFFLOAD_CTX_DIR_TX);
+		ctx->netdev = NULL;
+		dev_put(netdev);
+		list_del_init(&ctx->list);
+
+		if (refcount_dec_and_test(&ctx->refcount))
+			tls_device_free_ctx(ctx);
+	}
+
+	up_write(&device_offload_lock);
+
+	flush_work(&tls_device_gc_work);
+
+	return NOTIFY_DONE;
+}
+
+static int tls_dev_event(struct notifier_block *this, unsigned long event,
+			 void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+	switch (event) {
+	case NETDEV_REGISTER:
+	case NETDEV_FEAT_CHANGE:
+		return tls_device_api_check(dev);
+	case NETDEV_DOWN:
+		return tls_device_down(dev);
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block tls_dev_notifier = {
+	.notifier_call	= tls_dev_event,
+};
+
+void __init tls_device_init(void)
+{
+	register_netdevice_notifier(&tls_dev_notifier);
+}
+
+void __exit tls_device_cleanup(void)
+{
+	unregister_netdevice_notifier(&tls_dev_notifier);
+	flush_work(&tls_device_gc_work);
+}
diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
new file mode 100644
index 000000000000..843c7331cfc4
--- /dev/null
+++ b/net/tls/tls_device_fallback.c
@@ -0,0 +1,415 @@
+/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <net/tls.h>
+#include <crypto/aead.h>
+#include <crypto/scatterwalk.h>
+#include <net/ip6_checksum.h>
+
+static void chain_to_walk(struct scatterlist *sg, struct scatter_walk *walk)
+{
+	struct scatterlist *src = walk->sg;
+	int diff = walk->offset - src->offset;
+
+	sg_set_page(sg, sg_page(src),
+		    src->length - diff, walk->offset);
+
+	scatterwalk_crypto_chain(sg, sg_next(src), 0, 2);
+}
+
+static int tls_enc_record(struct aead_request *aead_req,
+			  struct crypto_aead *aead, char *aad, char *iv,
+			  __be64 rcd_sn, struct scatter_walk *in,
+			  struct scatter_walk *out, int *in_len)
+{
+	struct scatterlist sg_in[3];
+	struct scatterlist sg_out[3];
+	unsigned char buf[TLS_HEADER_SIZE + TLS_CIPHER_AES_GCM_128_IV_SIZE];
+	u16 len;
+	int rc;
+
+	len = min_t(int, *in_len, ARRAY_SIZE(buf));
+
+	scatterwalk_copychunks(buf, in, len, 0);
+	scatterwalk_copychunks(buf, out, len, 1);
+
+	*in_len -= len;
+	if (!*in_len)
+		return 0;
+
+	scatterwalk_pagedone(in, 0, 1);
+	scatterwalk_pagedone(out, 1, 1);
+
+	len = buf[4] | (buf[3] << 8);
+	len -= TLS_CIPHER_AES_GCM_128_IV_SIZE;
+
+	tls_make_aad(aad, len - TLS_CIPHER_AES_GCM_128_TAG_SIZE,
+		     (char *)&rcd_sn, sizeof(rcd_sn), buf[0]);
+
+	memcpy(iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, buf + TLS_HEADER_SIZE,
+	       TLS_CIPHER_AES_GCM_128_IV_SIZE);
+
+	sg_init_table(sg_in, ARRAY_SIZE(sg_in));
+	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
+	sg_set_buf(sg_in, aad, TLS_AAD_SPACE_SIZE);
+	sg_set_buf(sg_out, aad, TLS_AAD_SPACE_SIZE);
+	chain_to_walk(sg_in + 1, in);
+	chain_to_walk(sg_out + 1, out);
+
+	*in_len -= len;
+	if (*in_len < 0) {
+		*in_len += TLS_CIPHER_AES_GCM_128_TAG_SIZE;
+		if (*in_len < 0)
+		/* the input buffer doesn't contain the entire record.
+		 * trim len accordingly. The resulting authentication tag
+		 * will contain garbage. but we don't care as we won't
+		 * include any of it in the output skb
+		 * Note that we assume the output buffer length
+		 * is larger then input buffer length + tag size
+		 */
+			len += *in_len;
+
+		*in_len = 0;
+	}
+
+	if (*in_len) {
+		scatterwalk_copychunks(NULL, in, len, 2);
+		scatterwalk_pagedone(in, 0, 1);
+		scatterwalk_copychunks(NULL, out, len, 2);
+		scatterwalk_pagedone(out, 1, 1);
+	}
+
+	len -= TLS_CIPHER_AES_GCM_128_TAG_SIZE;
+	aead_request_set_crypt(aead_req, sg_in, sg_out, len, iv);
+
+	rc = crypto_aead_encrypt(aead_req);
+
+	return rc;
+}
+
+static void tls_init_aead_request(struct aead_request *aead_req,
+				  struct crypto_aead *aead)
+{
+	aead_request_set_tfm(aead_req, aead);
+	aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE);
+}
+
+static struct aead_request *tls_alloc_aead_request(struct crypto_aead *aead,
+						   gfp_t flags)
+{
+	unsigned int req_size = sizeof(struct aead_request) +
+		crypto_aead_reqsize(aead);
+	struct aead_request *aead_req;
+
+	aead_req = kzalloc(req_size, flags);
+	if (!aead_req)
+		return NULL;
+
+	tls_init_aead_request(aead_req, aead);
+	return aead_req;
+}
+
+static int tls_enc_records(struct aead_request *aead_req,
+			   struct crypto_aead *aead, struct scatterlist *sg_in,
+			   struct scatterlist *sg_out, char *aad, char *iv,
+			   u64 rcd_sn, int len)
+{
+	struct scatter_walk in;
+	struct scatter_walk out;
+	int rc;
+
+	scatterwalk_start(&in, sg_in);
+	scatterwalk_start(&out, sg_out);
+
+	do {
+		rc = tls_enc_record(aead_req, aead, aad, iv,
+				    cpu_to_be64(rcd_sn), &in, &out, &len);
+		rcd_sn++;
+
+	} while (rc == 0 && len);
+
+	scatterwalk_done(&in, 0, 0);
+	scatterwalk_done(&out, 1, 0);
+
+	return rc;
+}
+
+static inline void update_chksum(struct sk_buff *skb, int headln)
+{
+	/* Can't use icsk->icsk_af_ops->send_check here because the ip addresses
+	 * might have been changed by NAT.
+	 */
+
+	const struct ipv6hdr *ipv6h;
+	const struct iphdr *iph;
+	struct tcphdr *th = tcp_hdr(skb);
+	int datalen = skb->len - headln;
+
+	/* We only changed the payload so if we are using partial we don't
+	 * need to update anything.
+	 */
+	if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
+		return;
+
+	skb->ip_summed = CHECKSUM_PARTIAL;
+	skb->csum_start = skb_transport_header(skb) - skb->head;
+	skb->csum_offset = offsetof(struct tcphdr, check);
+
+	if (skb->sk->sk_family == AF_INET6) {
+		ipv6h = ipv6_hdr(skb);
+		th->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
+					     datalen, IPPROTO_TCP, 0);
+	} else {
+		iph = ip_hdr(skb);
+		th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen,
+					       IPPROTO_TCP, 0);
+	}
+}
+
+static void complete_skb(struct sk_buff *nskb, struct sk_buff *skb, int headln)
+{
+	skb_copy_header(nskb, skb);
+
+	skb_put(nskb, skb->len);
+	memcpy(nskb->data, skb->data, headln);
+	update_chksum(nskb, headln);
+
+	nskb->destructor = skb->destructor;
+	nskb->sk = skb->sk;
+	skb->destructor = NULL;
+	skb->sk = NULL;
+	refcount_add(nskb->truesize - skb->truesize,
+		     &nskb->sk->sk_wmem_alloc);
+}
+
+/* This function may be called after the user socket is already
+ * closed so make sure we don't use anything freed during
+ * tls_sk_proto_close here
+ */
+static struct sk_buff *tls_sw_fallback(struct sock *sk, struct sk_buff *skb)
+{
+	int tcp_header_size = tcp_hdrlen(skb);
+	int tcp_payload_offset = skb_transport_offset(skb) + tcp_header_size;
+	int payload_len = skb->len - tcp_payload_offset;
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+	int remaining, buf_len, resync_sgs, rc, i = 0;
+	void *buf, *dummy_buf, *iv, *aad;
+	struct scatterlist *sg_in;
+	struct scatterlist sg_out[3];
+	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
+	struct aead_request *aead_req;
+	struct sk_buff *nskb = NULL;
+	struct tls_record_info *record;
+	unsigned long flags;
+	s32 sync_size;
+	u64 rcd_sn;
+
+	/* worst case is:
+	 * MAX_SKB_FRAGS in tls_record_info
+	 * MAX_SKB_FRAGS + 1 in SKB head and frags.
+	 */
+	int sg_in_max_elements = 2 * MAX_SKB_FRAGS + 1;
+
+	if (!payload_len)
+		return skb;
+
+	sg_in = kmalloc_array(sg_in_max_elements, sizeof(*sg_in), GFP_ATOMIC);
+	if (!sg_in)
+		goto free_orig;
+
+	sg_init_table(sg_in, sg_in_max_elements);
+	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
+
+	spin_lock_irqsave(&ctx->lock, flags);
+	record = tls_get_record(ctx, tcp_seq, &rcd_sn);
+	if (!record) {
+		spin_unlock_irqrestore(&ctx->lock, flags);
+		WARN(1, "Record not found for seq %u\n", tcp_seq);
+		goto free_sg;
+	}
+
+	sync_size = tcp_seq - tls_record_start_seq(record);
+	if (sync_size < 0) {
+		int is_start_marker = tls_record_is_start_marker(record);
+
+		spin_unlock_irqrestore(&ctx->lock, flags);
+		if (!is_start_marker)
+		/* This should only occur if the relevant record was
+		 * already acked. In that case it should be ok
+		 * to drop the packet and avoid retransmission.
+		 *
+		 * There is a corner case where the packet contains
+		 * both an acked and a non-acked record.
+		 * We currently don't handle that case and rely
+		 * on TCP to retranmit a packet that doesn't contain
+		 * already acked payload.
+		 */
+			goto free_orig;
+
+		if (payload_len > -sync_size) {
+			WARN(1, "Fallback of partially offloaded packets is not supported\n");
+			goto free_sg;
+		} else {
+			return skb;
+		}
+	}
+
+	remaining = sync_size;
+	while (remaining > 0) {
+		skb_frag_t *frag = &record->frags[i];
+
+		__skb_frag_ref(frag);
+		sg_set_page(sg_in + i, skb_frag_page(frag),
+			    skb_frag_size(frag), frag->page_offset);
+
+		remaining -= skb_frag_size(frag);
+
+		if (remaining < 0)
+			sg_in[i].length += remaining;
+
+		i++;
+	}
+	spin_unlock_irqrestore(&ctx->lock, flags);
+	resync_sgs = i;
+
+	aead_req = tls_alloc_aead_request(ctx->aead_send, GFP_ATOMIC);
+	if (!aead_req)
+		goto put_sg;
+
+	buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE +
+		  TLS_CIPHER_AES_GCM_128_IV_SIZE +
+		  TLS_AAD_SPACE_SIZE +
+		  sync_size +
+		  tls_ctx->tag_size;
+	buf = kmalloc(buf_len, GFP_ATOMIC);
+	if (!buf)
+		goto free_req;
+
+	nskb = alloc_skb(skb_headroom(skb) + skb->len, GFP_ATOMIC);
+	if (!nskb)
+		goto free_buf;
+
+	skb_reserve(nskb, skb_headroom(skb));
+
+	iv = buf;
+
+	memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
+	       TLS_CIPHER_AES_GCM_128_SALT_SIZE);
+	aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
+	      TLS_CIPHER_AES_GCM_128_IV_SIZE;
+	dummy_buf = aad + TLS_AAD_SPACE_SIZE;
+
+	sg_set_buf(&sg_out[0], dummy_buf, sync_size);
+	sg_set_buf(&sg_out[1], nskb->data + tcp_payload_offset,
+		   payload_len);
+	/* Add room for authentication tag produced by crypto */
+	dummy_buf += sync_size;
+	sg_set_buf(&sg_out[2], dummy_buf, tls_ctx->tag_size);
+	rc = skb_to_sgvec(skb, &sg_in[i], tcp_payload_offset,
+			  payload_len);
+	if (rc < 0)
+		goto free_nskb;
+
+	rc = tls_enc_records(aead_req, ctx->aead_send, sg_in, sg_out, aad, iv,
+			     rcd_sn, sync_size + payload_len);
+	if (rc < 0)
+		goto free_nskb;
+
+	complete_skb(nskb, skb, tcp_payload_offset);
+
+	/* validate_xmit_skb_list assumes that if the skb wasn't segmented
+	 * nskb->prev will point to the skb itself
+	 */
+	nskb->prev = nskb;
+free_buf:
+	kfree(buf);
+free_req:
+	kfree(aead_req);
+put_sg:
+	for (i = 0; i < resync_sgs; i++)
+		put_page(sg_page(&sg_in[i]));
+free_sg:
+	kfree(sg_in);
+free_orig:
+	kfree_skb(skb);
+	return nskb;
+
+free_nskb:
+	kfree_skb(nskb);
+	nskb = NULL;
+	goto free_buf;
+}
+
+struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
+				      struct net_device *dev,
+				      struct sk_buff *skb)
+{
+	if (dev == tls_get_ctx(sk)->netdev)
+		return skb;
+
+	return tls_sw_fallback(sk, skb);
+}
+
+int tls_sw_fallback_init(struct sock *sk,
+			 struct tls_offload_context *offload_ctx,
+			 struct tls_crypto_info *crypto_info)
+{
+	int rc;
+	const u8 *key;
+
+	offload_ctx->aead_send =
+	    crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
+	if (IS_ERR(offload_ctx->aead_send)) {
+		rc = PTR_ERR(offload_ctx->aead_send);
+		pr_err_ratelimited("crypto_alloc_aead failed rc=%d\n", rc);
+		offload_ctx->aead_send = NULL;
+		goto err_out;
+	}
+
+	key = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->key;
+
+	rc = crypto_aead_setkey(offload_ctx->aead_send, key,
+				TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+	if (rc)
+		goto free_aead;
+
+	rc = crypto_aead_setauthsize(offload_ctx->aead_send,
+				     TLS_CIPHER_AES_GCM_128_TAG_SIZE);
+	if (rc)
+		goto free_aead;
+
+	return 0;
+free_aead:
+	crypto_free_aead(offload_ctx->aead_send);
+err_out:
+	return rc;
+}
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index d824d548447e..e0dface33017 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -54,6 +54,9 @@ enum {
 enum {
 	TLS_BASE_TX,
 	TLS_SW_TX,
+#ifdef CONFIG_TLS_DEVICE
+	TLS_HW_TX,
+#endif
 	TLS_NUM_CONFIG,
 };
 
@@ -416,11 +419,19 @@ static int do_tls_setsockopt_tx(struct sock *sk, char __user *optval,
 		goto err_crypto_info;
 	}
 
-	/* currently SW is default, we will have ethtool in future */
-	rc = tls_set_sw_offload(sk, ctx);
-	tx_conf = TLS_SW_TX;
-	if (rc)
-		goto err_crypto_info;
+#ifdef CONFIG_TLS_DEVICE
+	rc = tls_set_device_offload(sk, ctx);
+	tx_conf = TLS_HW_TX;
+	if (rc) {
+#else
+	{
+#endif
+		/* if HW offload fails fallback to SW */
+		rc = tls_set_sw_offload(sk, ctx);
+		tx_conf = TLS_SW_TX;
+		if (rc)
+			goto err_crypto_info;
+	}
 
 	ctx->tx_conf = tx_conf;
 	update_sk_prot(sk, ctx);
@@ -473,6 +484,12 @@ static void build_protos(struct proto *prot, struct proto *base)
 	prot[TLS_SW_TX] = prot[TLS_BASE_TX];
 	prot[TLS_SW_TX].sendmsg		= tls_sw_sendmsg;
 	prot[TLS_SW_TX].sendpage	= tls_sw_sendpage;
+
+#ifdef CONFIG_TLS_DEVICE
+	prot[TLS_HW_TX] = prot[TLS_SW_TX];
+	prot[TLS_HW_TX].sendmsg		= tls_device_sendmsg;
+	prot[TLS_HW_TX].sendpage	= tls_device_sendpage;
+#endif
 }
 
 static int tls_init(struct sock *sk)
@@ -531,6 +548,9 @@ static int __init tls_register(void)
 {
 	build_protos(tls_prots[TLSV4], &tcp_prot);
 
+#ifdef CONFIG_TLS_DEVICE
+	tls_device_init();
+#endif
 	tcp_register_ulp(&tcp_tls_ulp_ops);
 
 	return 0;
@@ -539,6 +559,9 @@ static int __init tls_register(void)
 static void __exit tls_unregister(void)
 {
 	tcp_unregister_ulp(&tcp_tls_ulp_ops);
+#ifdef CONFIG_TLS_DEVICE
+	tls_device_cleanup();
+#endif
 }
 
 module_init(tls_register);
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 08/14] net/mlx5e: Move defines out of ipsec code
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

The defines are not IPSEC specific.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h             | 3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h | 3 ---
 drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c     | 5 +----
 drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h       | 2 ++
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 4c9360b25532..6660986285bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,9 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+#define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
+#define MLX5E_METADATA_ETHER_LEN 8
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
index 1198fc1eba4c..93bf10e6508c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
@@ -45,9 +45,6 @@
 #define MLX5E_IPSEC_SADB_RX_BITS 10
 #define MLX5E_IPSEC_ESN_SCOPE_MID 0x80000000L
 
-#define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
-#define MLX5E_METADATA_ETHER_LEN 8
-
 struct mlx5e_priv;
 
 struct mlx5e_ipsec_sw_stats {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
index 4f1568528738..a6b672840e34 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
@@ -43,9 +43,6 @@
 #include "fpga/sdk.h"
 #include "fpga/core.h"
 
-#define SBU_QP_QUEUE_SIZE 8
-#define MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC	(60 * 1000)
-
 enum mlx5_fpga_ipsec_cmd_status {
 	MLX5_FPGA_IPSEC_CMD_PENDING,
 	MLX5_FPGA_IPSEC_CMD_SEND_FAIL,
@@ -258,7 +255,7 @@ static int mlx5_fpga_ipsec_cmd_wait(void *ctx)
 {
 	struct mlx5_fpga_ipsec_cmd_context *context = ctx;
 	unsigned long timeout =
-		msecs_to_jiffies(MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC);
+		msecs_to_jiffies(MLX5_FPGA_CMD_TIMEOUT_MSEC);
 	int res;
 
 	res = wait_for_completion_timeout(&context->complete, timeout);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h b/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
index baa537e54a49..a0573cc2fc9b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
@@ -41,6 +41,8 @@
  * DOC: Innova SDK
  * This header defines the in-kernel API for Innova FPGA client drivers.
  */
+#define SBU_QP_QUEUE_SIZE 8
+#define MLX5_FPGA_CMD_TIMEOUT_MSEC (60 * 1000)
 
 enum mlx5_fpga_access_type {
 	MLX5_FPGA_ACCESS_TYPE_I2C = 0x0,
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 10/14] net/mlx5e: TLS, Add Innova TLS TX support
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Add NETIF_F_HW_TLS_TX capability and expose tlsdev_ops to work with the
TLS generic NIC offload infrastructure.
The NETIF_F_HW_TLS_TX capability will be added in the next patch.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |  11 ++
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 173 +++++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  65 ++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   3 +
 5 files changed, 254 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 25deaa5a534c..6befd2c381b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -85,3 +85,14 @@ config MLX5_EN_IPSEC
 	  Build support for IPsec cryptography-offload accelaration in the NIC.
 	  Note: Support for hardware with this capability needs to be selected
 	  for this option to become available.
+
+config MLX5_EN_TLS
+	bool "TLS cryptography-offload accelaration"
+	depends on MLX5_CORE_EN
+	depends on TLS_DEVICE
+	depends on MLX5_ACCEL
+	default n
+	---help---
+	  Build support for TLS cryptography-offload accelaration in the NIC.
+	  Note: Support for hardware with this capability needs to be selected
+	  for this option to become available.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9989e5265a45..50872ed30c0b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,4 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
 		en_accel/ipsec_stats.o
 
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o
+
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
new file mode 100644
index 000000000000..38d88108a55a
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -0,0 +1,173 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/netdevice.h>
+#include <net/ipv6.h>
+#include "en_accel/tls.h"
+#include "accel/tls.h"
+
+static void mlx5e_tls_set_ipv4_flow(void *flow, struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+
+	MLX5_SET(tls_flow, flow, ipv6, 0);
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
+	       &inet->inet_daddr, MLX5_FLD_SZ_BYTES(ipv4_layout, ipv4));
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, src_ipv4_src_ipv6.ipv4_layout.ipv4),
+	       &inet->inet_rcv_saddr, MLX5_FLD_SZ_BYTES(ipv4_layout, ipv4));
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static void mlx5e_tls_set_ipv6_flow(void *flow, struct sock *sk)
+{
+	struct ipv6_pinfo *np = inet6_sk(sk);
+
+	MLX5_SET(tls_flow, flow, ipv6, 1);
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
+	       &sk->sk_v6_daddr, MLX5_FLD_SZ_BYTES(ipv6_layout, ipv6));
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, src_ipv4_src_ipv6.ipv6_layout.ipv6),
+	       &np->saddr, MLX5_FLD_SZ_BYTES(ipv6_layout, ipv6));
+}
+#endif
+
+static void mlx5e_tls_set_flow_tcp_ports(void *flow, struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, src_port), &inet->inet_sport,
+	       MLX5_FLD_SZ_BYTES(tls_flow, src_port));
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_port), &inet->inet_dport,
+	       MLX5_FLD_SZ_BYTES(tls_flow, dst_port));
+}
+
+static int mlx5e_tls_set_flow(void *flow, struct sock *sk, u32 caps)
+{
+	switch (sk->sk_family) {
+	case AF_INET:
+		mlx5e_tls_set_ipv4_flow(flow, sk);
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		if (!sk->sk_ipv6only &&
+		    ipv6_addr_type(&sk->sk_v6_daddr) == IPV6_ADDR_MAPPED) {
+			mlx5e_tls_set_ipv4_flow(flow, sk);
+			break;
+		}
+		if (!(caps & MLX5_ACCEL_TLS_IPV6))
+			goto error_out;
+
+		mlx5e_tls_set_ipv6_flow(flow, sk);
+		break;
+#endif
+	default:
+		goto error_out;
+	}
+
+	mlx5e_tls_set_flow_tcp_ports(flow, sk);
+	return 0;
+error_out:
+	return -EINVAL;
+}
+
+static int mlx5e_tls_add(struct net_device *netdev, struct sock *sk,
+			 enum tls_offload_ctx_dir direction,
+			 struct tls_crypto_info *crypto_info,
+			 u32 start_offload_tcp_sn)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct mlx5_core_dev *mdev = priv->mdev;
+	u32 caps = mlx5_accel_tls_device_caps(mdev);
+	int ret = -ENOMEM;
+	void *flow;
+
+	if (direction != TLS_OFFLOAD_CTX_DIR_TX)
+		return -EINVAL;
+
+	flow = kzalloc(MLX5_ST_SZ_BYTES(tls_flow), GFP_KERNEL);
+	if (!flow)
+		return ret;
+
+	ret = mlx5e_tls_set_flow(flow, sk, caps);
+	if (ret)
+		goto free_flow;
+
+	if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
+		struct mlx5e_tls_offload_context *tx_ctx =
+		    mlx5e_get_tls_tx_context(tls_ctx);
+		u32 swid;
+
+		ret = mlx5_accel_tls_add_tx_flow(mdev, flow, crypto_info,
+						 start_offload_tcp_sn, &swid);
+		if (ret < 0)
+			goto free_flow;
+
+		tx_ctx->swid = htonl(swid);
+		tx_ctx->expected_seq = start_offload_tcp_sn;
+	}
+
+	return 0;
+free_flow:
+	kfree(flow);
+	return ret;
+}
+
+static void mlx5e_tls_del(struct net_device *netdev,
+			  struct tls_context *tls_ctx,
+			  enum tls_offload_ctx_dir direction)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+
+	if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
+		u32 swid = ntohl(mlx5e_get_tls_tx_context(tls_ctx)->swid);
+
+		mlx5_accel_tls_del_tx_flow(priv->mdev, swid);
+	} else {
+		netdev_err(netdev, "unsupported direction %d\n", direction);
+	}
+}
+
+static const struct tlsdev_ops mlx5e_tls_ops = {
+	.tls_dev_add = mlx5e_tls_add,
+	.tls_dev_del = mlx5e_tls_del,
+};
+
+void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
+{
+	struct net_device *netdev = priv->netdev;
+
+	if (!mlx5_accel_is_tls_device(priv->mdev))
+		return;
+
+	netdev->tlsdev_ops = &mlx5e_tls_ops;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
new file mode 100644
index 000000000000..f7216b9b98e2
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -0,0 +1,65 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#ifndef __MLX5E_TLS_H__
+#define __MLX5E_TLS_H__
+
+#ifdef CONFIG_MLX5_EN_TLS
+
+#include <net/tls.h>
+#include "en.h"
+
+struct mlx5e_tls_offload_context {
+	struct tls_offload_context base;
+	u32 expected_seq;
+	__be32 swid;
+};
+
+static inline struct mlx5e_tls_offload_context *
+mlx5e_get_tls_tx_context(struct tls_context *tls_ctx)
+{
+	BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context) >
+		     TLS_OFFLOAD_CONTEXT_SIZE);
+	return container_of(tls_offload_ctx(tls_ctx),
+			    struct mlx5e_tls_offload_context,
+			    base);
+}
+
+void mlx5e_tls_build_netdev(struct mlx5e_priv *priv);
+
+#else
+
+static inline void mlx5e_tls_build_netdev(struct mlx5e_priv *priv) { }
+
+#endif
+
+#endif /* __MLX5E_TLS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index da94c8cba5ee..8dbe058da178 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -41,7 +41,9 @@
 #include "en_rep.h"
 #include "en_accel/ipsec.h"
 #include "en_accel/ipsec_rxtx.h"
+#include "en_accel/tls.h"
 #include "accel/ipsec.h"
+#include "accel/tls.h"
 #include "vxlan.h"
 
 struct mlx5e_rq_param {
@@ -4181,6 +4183,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 #endif
 
 	mlx5e_ipsec_build_netdev(priv);
+	mlx5e_tls_build_netdev(priv);
 }
 
 static void mlx5e_create_q_counter(struct mlx5e_priv *priv)
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 09/14] net/mlx5: Accel, Add TLS tx offload interface
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Add routines for manipulating TLS TX offload contexts.

In Innova TLS, TLS contexts are added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.

Add implementation for Innova TLS (FPGA-based) hardware.

These routines will be used by the TLS offload support in a later patch

mlx5/accel is a middle acceleration layer to allow mlx5e and other ULPs
to work directly with mlx5_core rather than Innova FPGA or other mlx5
acceleration providers.

In the future, when IPSec/TLS or any other acceleration gets integrated
into ConnectX chip, mlx5/accel layer will provide the integrated
acceleration, rather than the Innova one.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   4 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c    |  71 +++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h    |  86 ++++
 .../net/ethernet/mellanox/mlx5/core/fpga/core.h    |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 563 +++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  68 +++
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  11 +
 include/linux/mlx5/mlx5_ifc.h                      |  16 -
 include/linux/mlx5/mlx5_ifc_fpga.h                 |  77 +++
 9 files changed, 879 insertions(+), 18 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index c805769d92a9..9989e5265a45 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -8,10 +8,10 @@ mlx5_core-y :=	main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
 		fs_counters.o rl.o lag.o dev.o wq.o lib/gid.o lib/clock.o \
 		diag/fs_tracepoint.o
 
-mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o
+mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o accel/tls.o
 
 mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o fpga/conn.o fpga/sdk.o \
-		fpga/ipsec.o
+		fpga/ipsec.o fpga/tls.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \
 		en_tx.o en_rx.o en_dim.o en_txrx.o en_stats.o vxlan.o \
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
new file mode 100644
index 000000000000..77ac19f38cbe
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
@@ -0,0 +1,71 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/mlx5/device.h>
+
+#include "accel/tls.h"
+#include "mlx5_core.h"
+#include "fpga/tls.h"
+
+int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			       struct tls_crypto_info *crypto_info,
+			       u32 start_offload_tcp_sn, u32 *p_swid)
+{
+	return mlx5_fpga_tls_add_tx_flow(mdev, flow, crypto_info,
+					 start_offload_tcp_sn, p_swid);
+}
+
+void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid)
+{
+	mlx5_fpga_tls_del_tx_flow(mdev, swid, GFP_KERNEL);
+}
+
+bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev)
+{
+	return mlx5_fpga_is_tls_device(mdev);
+}
+
+u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev)
+{
+	return mlx5_fpga_tls_device_caps(mdev);
+}
+
+int mlx5_accel_tls_init(struct mlx5_core_dev *mdev)
+{
+	return mlx5_fpga_tls_init(mdev);
+}
+
+void mlx5_accel_tls_cleanup(struct mlx5_core_dev *mdev)
+{
+	mlx5_fpga_tls_cleanup(mdev);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
new file mode 100644
index 000000000000..6f9c9f446ecc
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
@@ -0,0 +1,86 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5_ACCEL_TLS_H__
+#define __MLX5_ACCEL_TLS_H__
+
+#include <linux/mlx5/driver.h>
+#include <linux/tls.h>
+
+#ifdef CONFIG_MLX5_ACCEL
+
+enum {
+	MLX5_ACCEL_TLS_TX = BIT(0),
+	MLX5_ACCEL_TLS_RX = BIT(1),
+	MLX5_ACCEL_TLS_V12 = BIT(2),
+	MLX5_ACCEL_TLS_V13 = BIT(3),
+	MLX5_ACCEL_TLS_LRO = BIT(4),
+	MLX5_ACCEL_TLS_IPV6 = BIT(5),
+	MLX5_ACCEL_TLS_AES_GCM128 = BIT(30),
+	MLX5_ACCEL_TLS_AES_GCM256 = BIT(31),
+};
+
+struct mlx5_ifc_tls_flow_bits {
+	u8         src_port[0x10];
+	u8         dst_port[0x10];
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits src_ipv4_src_ipv6;
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits dst_ipv4_dst_ipv6;
+	u8         ipv6[0x1];
+	u8         direction_sx[0x1];
+	u8         reserved_at_2[0x1e];
+};
+
+int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			       struct tls_crypto_info *crypto_info,
+			       u32 start_offload_tcp_sn, u32 *p_swid);
+void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid);
+bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev);
+u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev);
+int mlx5_accel_tls_init(struct mlx5_core_dev *mdev);
+void mlx5_accel_tls_cleanup(struct mlx5_core_dev *mdev);
+
+#else
+
+static inline int
+mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			   struct tls_crypto_info *crypto_info,
+			   u32 start_offload_tcp_sn, u32 *p_swid) { return 0; }
+static inline void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid) { }
+static inline bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev) { return false; }
+static inline u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev) { return 0; }
+static inline int mlx5_accel_tls_init(struct mlx5_core_dev *mdev) { return 0; }
+static inline void mlx5_accel_tls_cleanup(struct mlx5_core_dev *mdev) { }
+
+#endif
+
+#endif	/* __MLX5_ACCEL_TLS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h b/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h
index 82405ed84725..3e2355c8df3f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h
@@ -53,6 +53,7 @@ struct mlx5_fpga_device {
 	} conn_res;
 
 	struct mlx5_fpga_ipsec *ipsec;
+	struct mlx5_fpga_tls *tls;
 };
 
 #define mlx5_fpga_dbg(__adev, format, ...) \
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
new file mode 100644
index 000000000000..47f8b0d579e2
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
@@ -0,0 +1,563 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/mlx5/device.h>
+#include "fpga/tls.h"
+#include "fpga/cmd.h"
+#include "fpga/sdk.h"
+#include "fpga/core.h"
+#include "accel/tls.h"
+
+struct mlx5_fpga_tls_command_context;
+
+typedef void (*mlx5_fpga_tls_command_complete)
+	(struct mlx5_fpga_conn *conn, struct mlx5_fpga_device *fdev,
+	 struct mlx5_fpga_tls_command_context *ctx,
+	 struct mlx5_fpga_dma_buf *resp);
+
+struct mlx5_fpga_tls_command_context {
+	struct list_head list;
+	/* There is no guarantee on the order between the TX completion
+	 * and the command response.
+	 * The TX completion is going to touch cmd->buf even in
+	 * the case of successful transmission.
+	 * So instead of requiring separate allocations for cmd
+	 * and cmd->buf we've decided to use a reference counter
+	 */
+	refcount_t ref;
+	struct mlx5_fpga_dma_buf buf;
+	mlx5_fpga_tls_command_complete complete;
+};
+
+static inline void
+mlx5_fpga_tls_put_command_ctx(struct mlx5_fpga_tls_command_context *ctx)
+{
+	if (refcount_dec_and_test(&ctx->ref))
+		kfree(ctx);
+}
+
+static void mlx5_fpga_tls_cmd_complete(struct mlx5_fpga_device *fdev,
+				       struct mlx5_fpga_dma_buf *resp)
+{
+	struct mlx5_fpga_conn *conn = fdev->tls->conn;
+	struct mlx5_fpga_tls_command_context *ctx;
+	struct mlx5_fpga_tls *tls = fdev->tls;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tls->pending_cmds_lock, flags);
+	ctx = list_first_entry(&tls->pending_cmds,
+			       struct mlx5_fpga_tls_command_context, list);
+	list_del(&ctx->list);
+	spin_unlock_irqrestore(&tls->pending_cmds_lock, flags);
+	ctx->complete(conn, fdev, ctx, resp);
+}
+
+static void mlx5_fpga_cmd_send_complete(struct mlx5_fpga_conn *conn,
+					struct mlx5_fpga_device *fdev,
+					struct mlx5_fpga_dma_buf *buf,
+					u8 status)
+{
+	struct mlx5_fpga_tls_command_context *ctx =
+	    container_of(buf, struct mlx5_fpga_tls_command_context, buf);
+
+	mlx5_fpga_tls_put_command_ctx(ctx);
+
+	if (unlikely(status))
+		mlx5_fpga_tls_cmd_complete(fdev, NULL);
+}
+
+static void mlx5_fpga_tls_cmd_send(struct mlx5_fpga_device *fdev,
+				   struct mlx5_fpga_tls_command_context *cmd,
+				   mlx5_fpga_tls_command_complete complete)
+{
+	struct mlx5_fpga_tls *tls = fdev->tls;
+	unsigned long flags;
+	int ret;
+
+	refcount_set(&cmd->ref, 2);
+	cmd->complete = complete;
+	cmd->buf.complete = mlx5_fpga_cmd_send_complete;
+
+	spin_lock_irqsave(&tls->pending_cmds_lock, flags);
+	/* mlx5_fpga_sbu_conn_sendmsg is called under pending_cmds_lock
+	 * to make sure commands are inserted to the tls->pending_cmds list
+	 * and the command QP in the same order.
+	 */
+	ret = mlx5_fpga_sbu_conn_sendmsg(tls->conn, &cmd->buf);
+	if (likely(!ret))
+		list_add_tail(&cmd->list, &tls->pending_cmds);
+	else
+		complete(tls->conn, fdev, cmd, NULL);
+	spin_unlock_irqrestore(&tls->pending_cmds_lock, flags);
+}
+
+/* Start of context identifiers range (inclusive) */
+#define SWID_START	0
+/* End of context identifiers range (exclusive) */
+#define SWID_END	BIT(24)
+
+static int mlx5_fpga_tls_alloc_swid(struct idr *idr, spinlock_t *idr_spinlock,
+				    void *ptr)
+{
+	int ret;
+
+	/* TLS metadata format is 1 byte for syndrome followed
+	 * by 3 bytes of swid (software ID)
+	 * swid must not exceed 3 bytes.
+	 * See tls_rxtx.c:insert_pet() for details
+	 */
+	BUILD_BUG_ON((SWID_END - 1) & 0xFF000000);
+
+	idr_preload(GFP_KERNEL);
+	spin_lock_irq(idr_spinlock);
+	ret = idr_alloc(idr, ptr, SWID_START, SWID_END, GFP_ATOMIC);
+	spin_unlock_irq(idr_spinlock);
+	idr_preload_end();
+
+	return ret;
+}
+
+static void mlx5_fpga_tls_release_swid(struct idr *idr,
+				       spinlock_t *idr_spinlock, u32 swid)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(idr_spinlock, flags);
+	idr_remove(idr, swid);
+	spin_unlock_irqrestore(idr_spinlock, flags);
+}
+
+struct mlx5_teardown_stream_context {
+	struct mlx5_fpga_tls_command_context cmd;
+	u32 swid;
+};
+
+static void
+mlx5_fpga_tls_teardown_completion(struct mlx5_fpga_conn *conn,
+				  struct mlx5_fpga_device *fdev,
+				  struct mlx5_fpga_tls_command_context *cmd,
+				  struct mlx5_fpga_dma_buf *resp)
+{
+	struct mlx5_teardown_stream_context *ctx =
+		    container_of(cmd, struct mlx5_teardown_stream_context, cmd);
+
+	if (resp) {
+		u32 syndrome = MLX5_GET(tls_resp, resp->sg[0].data, syndrome);
+
+		if (syndrome)
+			mlx5_fpga_err(fdev,
+				      "Teardown stream failed with syndrome = %d",
+				      syndrome);
+		else
+			mlx5_fpga_tls_release_swid(&fdev->tls->tx_idr,
+						   &fdev->tls->idr_spinlock,
+						   ctx->swid);
+	}
+	mlx5_fpga_tls_put_command_ctx(cmd);
+}
+
+static void mlx5_fpga_tls_flow_to_cmd(void *flow, void *cmd)
+{
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, src_port), flow,
+	       MLX5_BYTE_OFF(tls_flow, ipv6));
+
+	MLX5_SET(tls_cmd, cmd, ipv6, MLX5_GET(tls_flow, flow, ipv6));
+	MLX5_SET(tls_cmd, cmd, direction_sx,
+		 MLX5_GET(tls_flow, flow, direction_sx));
+}
+
+void mlx5_fpga_tls_send_teardown_cmd(struct mlx5_core_dev *mdev, void *flow,
+				     u32 swid, gfp_t flags)
+{
+	struct mlx5_teardown_stream_context *ctx;
+	struct mlx5_fpga_dma_buf *buf;
+	void *cmd;
+
+	ctx = kzalloc(sizeof(*ctx) + MLX5_TLS_COMMAND_SIZE, flags);
+	if (!ctx)
+		return;
+
+	buf = &ctx->cmd.buf;
+	cmd = (ctx + 1);
+	MLX5_SET(tls_cmd, cmd, command_type, CMD_TEARDOWN_STREAM);
+	MLX5_SET(tls_cmd, cmd, swid, swid);
+
+	mlx5_fpga_tls_flow_to_cmd(flow, cmd);
+	kfree(flow);
+
+	buf->sg[0].data = cmd;
+	buf->sg[0].size = MLX5_TLS_COMMAND_SIZE;
+
+	ctx->swid = swid;
+	mlx5_fpga_tls_cmd_send(mdev->fpga, &ctx->cmd,
+			       mlx5_fpga_tls_teardown_completion);
+}
+
+void mlx5_fpga_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid,
+			       gfp_t flags)
+{
+	struct mlx5_fpga_tls *tls = mdev->fpga->tls;
+	void *flow;
+
+	rcu_read_lock();
+	flow = idr_find(&tls->tx_idr, swid);
+	rcu_read_unlock();
+
+	if (!flow) {
+		mlx5_fpga_err(mdev->fpga, "No flow information for swid %u\n",
+			      swid);
+		return;
+	}
+
+	mlx5_fpga_tls_send_teardown_cmd(mdev, flow, swid, flags);
+}
+
+enum mlx5_fpga_setup_stream_status {
+	MLX5_FPGA_CMD_PENDING,
+	MLX5_FPGA_CMD_SEND_FAILED,
+	MLX5_FPGA_CMD_RESPONSE_RECEIVED,
+	MLX5_FPGA_CMD_ABANDONED,
+};
+
+struct mlx5_setup_stream_context {
+	struct mlx5_fpga_tls_command_context cmd;
+	atomic_t status;
+	u32 syndrome;
+	struct completion comp;
+};
+
+static void
+mlx5_fpga_tls_setup_completion(struct mlx5_fpga_conn *conn,
+			       struct mlx5_fpga_device *fdev,
+			       struct mlx5_fpga_tls_command_context *cmd,
+			       struct mlx5_fpga_dma_buf *resp)
+{
+	struct mlx5_setup_stream_context *ctx =
+	    container_of(cmd, struct mlx5_setup_stream_context, cmd);
+	int status = MLX5_FPGA_CMD_SEND_FAILED;
+	void *tls_cmd = ctx + 1;
+
+	/* If we failed to send to command resp == NULL */
+	if (resp) {
+		ctx->syndrome = MLX5_GET(tls_resp, resp->sg[0].data, syndrome);
+		status = MLX5_FPGA_CMD_RESPONSE_RECEIVED;
+	}
+
+	status = atomic_xchg_release(&ctx->status, status);
+	if (likely(status != MLX5_FPGA_CMD_ABANDONED)) {
+		complete(&ctx->comp);
+		return;
+	}
+
+	mlx5_fpga_err(fdev, "Command was abandoned, syndrome = %u\n",
+		      ctx->syndrome);
+
+	if (!ctx->syndrome) {
+		/* The process was killed while waiting for the context to be
+		 * added, and the add completed successfully.
+		 * We need to destroy the HW context, and we can't can't reuse
+		 * the command context because we might not have received
+		 * the tx completion yet.
+		 */
+		mlx5_fpga_tls_del_tx_flow(fdev->mdev,
+					  MLX5_GET(tls_cmd, tls_cmd, swid),
+					  GFP_ATOMIC);
+	}
+
+	mlx5_fpga_tls_put_command_ctx(cmd);
+}
+
+static int mlx5_fpga_tls_setup_stream_cmd(struct mlx5_core_dev *mdev,
+					  struct mlx5_setup_stream_context *ctx)
+{
+	struct mlx5_fpga_dma_buf *buf;
+	void *cmd = ctx + 1;
+	int status, ret = 0;
+
+	buf = &ctx->cmd.buf;
+	buf->sg[0].data = cmd;
+	buf->sg[0].size = MLX5_TLS_COMMAND_SIZE;
+	MLX5_SET(tls_cmd, cmd, command_type, CMD_SETUP_STREAM);
+
+	init_completion(&ctx->comp);
+	atomic_set(&ctx->status, MLX5_FPGA_CMD_PENDING);
+	ctx->syndrome = -1;
+
+	mlx5_fpga_tls_cmd_send(mdev->fpga, &ctx->cmd,
+			       mlx5_fpga_tls_setup_completion);
+	wait_for_completion_killable(&ctx->comp);
+
+	status = atomic_xchg_acquire(&ctx->status, MLX5_FPGA_CMD_ABANDONED);
+	if (unlikely(status == MLX5_FPGA_CMD_PENDING))
+	/* ctx is going to be released in mlx5_fpga_tls_setup_completion */
+		return -EINTR;
+
+	if (unlikely(ctx->syndrome))
+		ret = -ENOMEM;
+
+	mlx5_fpga_tls_put_command_ctx(&ctx->cmd);
+	return ret;
+}
+
+static void mlx5_fpga_tls_hw_qp_recv_cb(void *cb_arg,
+					struct mlx5_fpga_dma_buf *buf)
+{
+	struct mlx5_fpga_device *fdev = (struct mlx5_fpga_device *)cb_arg;
+
+	mlx5_fpga_tls_cmd_complete(fdev, buf);
+}
+
+bool mlx5_fpga_is_tls_device(struct mlx5_core_dev *mdev)
+{
+	if (!mdev->fpga || !MLX5_CAP_GEN(mdev, fpga))
+		return false;
+
+	if (MLX5_CAP_FPGA(mdev, ieee_vendor_id) !=
+	    MLX5_FPGA_CAP_SANDBOX_VENDOR_ID_MLNX)
+		return false;
+
+	if (MLX5_CAP_FPGA(mdev, sandbox_product_id) !=
+	    MLX5_FPGA_CAP_SANDBOX_PRODUCT_ID_TLS)
+		return false;
+
+	if (MLX5_CAP_FPGA(mdev, sandbox_product_version) != 0)
+		return false;
+
+	return true;
+}
+
+static inline int mlx5_fpga_tls_get_caps(struct mlx5_fpga_device *fdev,
+					 u32 *p_caps)
+{
+	int err, cap_size = MLX5_ST_SZ_BYTES(tls_extended_cap);
+	u32 caps = 0;
+	void *buf;
+
+	buf = kzalloc(cap_size, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	err = mlx5_fpga_get_sbu_caps(fdev, cap_size, buf);
+	if (err)
+		goto out;
+
+	if (MLX5_GET(tls_extended_cap, buf, tx))
+		caps |= MLX5_ACCEL_TLS_TX;
+	if (MLX5_GET(tls_extended_cap, buf, rx))
+		caps |= MLX5_ACCEL_TLS_RX;
+	if (MLX5_GET(tls_extended_cap, buf, tls_v12))
+		caps |= MLX5_ACCEL_TLS_V12;
+	if (MLX5_GET(tls_extended_cap, buf, tls_v13))
+		caps |= MLX5_ACCEL_TLS_V13;
+	if (MLX5_GET(tls_extended_cap, buf, lro))
+		caps |= MLX5_ACCEL_TLS_LRO;
+	if (MLX5_GET(tls_extended_cap, buf, ipv6))
+		caps |= MLX5_ACCEL_TLS_IPV6;
+
+	if (MLX5_GET(tls_extended_cap, buf, aes_gcm_128))
+		caps |= MLX5_ACCEL_TLS_AES_GCM128;
+	if (MLX5_GET(tls_extended_cap, buf, aes_gcm_256))
+		caps |= MLX5_ACCEL_TLS_AES_GCM256;
+
+	*p_caps = caps;
+	err = 0;
+out:
+	kfree(buf);
+	return err;
+}
+
+int mlx5_fpga_tls_init(struct mlx5_core_dev *mdev)
+{
+	struct mlx5_fpga_device *fdev = mdev->fpga;
+	struct mlx5_fpga_conn_attr init_attr = {0};
+	struct mlx5_fpga_conn *conn;
+	struct mlx5_fpga_tls *tls;
+	int err = 0;
+
+	if (!mlx5_fpga_is_tls_device(mdev))
+		return 0;
+
+	tls = kzalloc(sizeof(*tls), GFP_KERNEL);
+	if (!tls)
+		return -ENOMEM;
+
+	err = mlx5_fpga_tls_get_caps(fdev, &tls->caps);
+	if (err)
+		goto error;
+
+	if (!(tls->caps & (MLX5_ACCEL_TLS_TX | MLX5_ACCEL_TLS_V12 |
+				 MLX5_ACCEL_TLS_AES_GCM128))) {
+		err = -ENOTSUPP;
+		goto error;
+	}
+
+	init_attr.rx_size = SBU_QP_QUEUE_SIZE;
+	init_attr.tx_size = SBU_QP_QUEUE_SIZE;
+	init_attr.recv_cb = mlx5_fpga_tls_hw_qp_recv_cb;
+	init_attr.cb_arg = fdev;
+	conn = mlx5_fpga_sbu_conn_create(fdev, &init_attr);
+	if (IS_ERR(conn)) {
+		err = PTR_ERR(conn);
+		mlx5_fpga_err(fdev, "Error creating TLS command connection %d\n",
+			      err);
+		goto error;
+	}
+
+	tls->conn = conn;
+	spin_lock_init(&tls->pending_cmds_lock);
+	INIT_LIST_HEAD(&tls->pending_cmds);
+
+	idr_init(&tls->tx_idr);
+	spin_lock_init(&tls->idr_spinlock);
+	fdev->tls = tls;
+	return 0;
+
+error:
+	kfree(tls);
+	return err;
+}
+
+void mlx5_fpga_tls_cleanup(struct mlx5_core_dev *mdev)
+{
+	struct mlx5_fpga_device *fdev = mdev->fpga;
+
+	if (!fdev->tls)
+		return;
+
+	mlx5_fpga_sbu_conn_destroy(fdev->tls->conn);
+	kfree(fdev->tls);
+	fdev->tls = NULL;
+}
+
+static void mlx5_fpga_tls_set_aes_gcm128_ctx(void *cmd,
+					     struct tls_crypto_info *info,
+					     __be64 *rcd_sn)
+{
+	struct tls12_crypto_info_aes_gcm_128 *crypto_info =
+	    (struct tls12_crypto_info_aes_gcm_128 *)info;
+
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, tls_rcd_sn), crypto_info->rec_seq,
+	       TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE);
+
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, tls_implicit_iv),
+	       crypto_info->salt, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, encryption_key),
+	       crypto_info->key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+
+	/* in AES-GCM 128 we need to write the key twice */
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, encryption_key) +
+		   TLS_CIPHER_AES_GCM_128_KEY_SIZE,
+	       crypto_info->key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+
+	MLX5_SET(tls_cmd, cmd, alg, MLX5_TLS_ALG_AES_GCM_128);
+}
+
+static int mlx5_fpga_tls_set_key_material(void *cmd, u32 caps,
+					  struct tls_crypto_info *crypto_info)
+{
+	__be64 rcd_sn;
+
+	switch (crypto_info->cipher_type) {
+	case TLS_CIPHER_AES_GCM_128:
+		if (!(caps & MLX5_ACCEL_TLS_AES_GCM128))
+			return -EINVAL;
+		mlx5_fpga_tls_set_aes_gcm128_ctx(cmd, crypto_info, &rcd_sn);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int mlx5_fpga_tls_add_flow(struct mlx5_core_dev *mdev, void *flow,
+				  struct tls_crypto_info *crypto_info, u32 swid,
+				  u32 tcp_sn)
+{
+	u32 caps = mlx5_fpga_tls_device_caps(mdev);
+	struct mlx5_setup_stream_context *ctx;
+	int ret = -ENOMEM;
+	size_t cmd_size;
+	void *cmd;
+
+	cmd_size = MLX5_TLS_COMMAND_SIZE + sizeof(*ctx);
+	ctx = kzalloc(cmd_size, GFP_KERNEL);
+	if (!ctx)
+		goto out;
+
+	cmd = ctx + 1;
+	ret = mlx5_fpga_tls_set_key_material(cmd, caps, crypto_info);
+	if (ret)
+		goto free_ctx;
+
+	mlx5_fpga_tls_flow_to_cmd(flow, cmd);
+
+	MLX5_SET(tls_cmd, cmd, swid, swid);
+	MLX5_SET(tls_cmd, cmd, tcp_sn, tcp_sn);
+
+	return mlx5_fpga_tls_setup_stream_cmd(mdev, ctx);
+
+free_ctx:
+	kfree(ctx);
+out:
+	return ret;
+}
+
+int mlx5_fpga_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			      struct tls_crypto_info *crypto_info,
+			      u32 start_offload_tcp_sn, u32 *p_swid)
+{
+	struct mlx5_fpga_tls *tls = mdev->fpga->tls;
+	int ret = -ENOMEM;
+	u32 swid;
+
+	ret = mlx5_fpga_tls_alloc_swid(&tls->tx_idr, &tls->idr_spinlock, flow);
+	if (ret < 0)
+		return ret;
+
+	swid = ret;
+	MLX5_SET(tls_flow, flow, direction_sx, 1);
+
+	ret = mlx5_fpga_tls_add_flow(mdev, flow, crypto_info, swid,
+				     start_offload_tcp_sn);
+	if (ret && ret != -EINTR)
+		goto free_swid;
+
+	*p_swid = swid;
+	return 0;
+free_swid:
+	mlx5_fpga_tls_release_swid(&tls->tx_idr, &tls->idr_spinlock, swid);
+
+	return ret;
+}
+
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h
new file mode 100644
index 000000000000..800a214e4e49
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h
@@ -0,0 +1,68 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5_FPGA_TLS_H__
+#define __MLX5_FPGA_TLS_H__
+
+#include <linux/mlx5/driver.h>
+
+#include <net/tls.h>
+#include "fpga/core.h"
+
+struct mlx5_fpga_tls {
+	struct list_head pending_cmds;
+	spinlock_t pending_cmds_lock; /* Protects pending_cmds */
+	u32 caps;
+	struct mlx5_fpga_conn *conn;
+
+	struct idr tx_idr;
+	spinlock_t idr_spinlock; /* protects the IDR */
+};
+
+int mlx5_fpga_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			      struct tls_crypto_info *crypto_info,
+			      u32 start_offload_tcp_sn, u32 *p_swid);
+
+void mlx5_fpga_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid,
+			       gfp_t flags);
+
+bool mlx5_fpga_is_tls_device(struct mlx5_core_dev *mdev);
+int mlx5_fpga_tls_init(struct mlx5_core_dev *mdev);
+void mlx5_fpga_tls_cleanup(struct mlx5_core_dev *mdev);
+
+static inline u32 mlx5_fpga_tls_device_caps(struct mlx5_core_dev *mdev)
+{
+	return mdev->fpga->tls->caps;
+}
+
+#endif /* __MLX5_FPGA_TLS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 13b6f66310c9..808091df84ee 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -60,6 +60,7 @@
 #include "fpga/core.h"
 #include "fpga/ipsec.h"
 #include "accel/ipsec.h"
+#include "accel/tls.h"
 #include "lib/clock.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
@@ -1186,6 +1187,12 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 		goto err_ipsec_start;
 	}
 
+	err = mlx5_accel_tls_init(dev);
+	if (err) {
+		dev_err(&pdev->dev, "TLS device start failed %d\n", err);
+		goto err_tls_start;
+	}
+
 	err = mlx5_init_fs(dev);
 	if (err) {
 		dev_err(&pdev->dev, "Failed to init flow steering\n");
@@ -1227,6 +1234,9 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_cleanup_fs(dev);
 
 err_fs:
+	mlx5_accel_tls_cleanup(dev);
+
+err_tls_start:
 	mlx5_accel_ipsec_cleanup(dev);
 
 err_ipsec_start:
@@ -1302,6 +1312,7 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_sriov_detach(dev);
 	mlx5_cleanup_fs(dev);
 	mlx5_accel_ipsec_cleanup(dev);
+	mlx5_accel_tls_cleanup(dev);
 	mlx5_fpga_device_stop(dev);
 	mlx5_irq_clear_affinity_hints(dev);
 	free_comp_eqs(dev);
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 14ad84afe8ba..24092a871c3d 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -350,22 +350,6 @@ struct mlx5_ifc_odp_per_transport_service_cap_bits {
 	u8         reserved_at_6[0x1a];
 };
 
-struct mlx5_ifc_ipv4_layout_bits {
-	u8         reserved_at_0[0x60];
-
-	u8         ipv4[0x20];
-};
-
-struct mlx5_ifc_ipv6_layout_bits {
-	u8         ipv6[16][0x8];
-};
-
-union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits {
-	struct mlx5_ifc_ipv6_layout_bits ipv6_layout;
-	struct mlx5_ifc_ipv4_layout_bits ipv4_layout;
-	u8         reserved_at_0[0x80];
-};
-
 struct mlx5_ifc_fte_match_set_lyr_2_4_bits {
 	u8         smac_47_16[0x20];
 
diff --git a/include/linux/mlx5/mlx5_ifc_fpga.h b/include/linux/mlx5/mlx5_ifc_fpga.h
index ec052491ba3d..193091537cb6 100644
--- a/include/linux/mlx5/mlx5_ifc_fpga.h
+++ b/include/linux/mlx5/mlx5_ifc_fpga.h
@@ -32,12 +32,29 @@
 #ifndef MLX5_IFC_FPGA_H
 #define MLX5_IFC_FPGA_H
 
+struct mlx5_ifc_ipv4_layout_bits {
+	u8         reserved_at_0[0x60];
+
+	u8         ipv4[0x20];
+};
+
+struct mlx5_ifc_ipv6_layout_bits {
+	u8         ipv6[16][0x8];
+};
+
+union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits {
+	struct mlx5_ifc_ipv6_layout_bits ipv6_layout;
+	struct mlx5_ifc_ipv4_layout_bits ipv4_layout;
+	u8         reserved_at_0[0x80];
+};
+
 enum {
 	MLX5_FPGA_CAP_SANDBOX_VENDOR_ID_MLNX = 0x2c9,
 };
 
 enum {
 	MLX5_FPGA_CAP_SANDBOX_PRODUCT_ID_IPSEC    = 0x2,
+	MLX5_FPGA_CAP_SANDBOX_PRODUCT_ID_TLS      = 0x3,
 };
 
 struct mlx5_ifc_fpga_shell_caps_bits {
@@ -370,6 +387,27 @@ struct mlx5_ifc_fpga_destroy_qp_out_bits {
 	u8         reserved_at_40[0x40];
 };
 
+struct mlx5_ifc_tls_extended_cap_bits {
+	u8         aes_gcm_128[0x1];
+	u8         aes_gcm_256[0x1];
+	u8         reserved_at_2[0x1e];
+	u8         reserved_at_20[0x20];
+	u8         context_capacity_total[0x20];
+	u8         context_capacity_rx[0x20];
+	u8         context_capacity_tx[0x20];
+	u8         reserved_at_a0[0x10];
+	u8         tls_counter_size[0x10];
+	u8         tls_counters_addr_low[0x20];
+	u8         tls_counters_addr_high[0x20];
+	u8         rx[0x1];
+	u8         tx[0x1];
+	u8         tls_v12[0x1];
+	u8         tls_v13[0x1];
+	u8         lro[0x1];
+	u8         ipv6[0x1];
+	u8         reserved_at_106[0x1a];
+};
+
 struct mlx5_ifc_ipsec_extended_cap_bits {
 	u8         encapsulation[0x20];
 
@@ -519,4 +557,43 @@ struct mlx5_ifc_fpga_ipsec_sa {
 	__be16 reserved2;
 } __packed;
 
+enum fpga_tls_cmds {
+	CMD_SETUP_STREAM		= 0x1001,
+	CMD_TEARDOWN_STREAM		= 0x1002,
+};
+
+#define MLX5_TLS_1_2 (0)
+
+#define MLX5_TLS_ALG_AES_GCM_128 (0)
+#define MLX5_TLS_ALG_AES_GCM_256 (1)
+
+struct mlx5_ifc_tls_cmd_bits {
+	u8         command_type[0x20];
+	u8         ipv6[0x1];
+	u8         direction_sx[0x1];
+	u8         tls_version[0x2];
+	u8         reserved[0x1c];
+	u8         swid[0x20];
+	u8         src_port[0x10];
+	u8         dst_port[0x10];
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits src_ipv4_src_ipv6;
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits dst_ipv4_dst_ipv6;
+	u8         tls_rcd_sn[0x40];
+	u8         tcp_sn[0x20];
+	u8         tls_implicit_iv[0x20];
+	u8         tls_xor_iv[0x40];
+	u8         encryption_key[0x100];
+	u8         alg[4];
+	u8         reserved2[0x1c];
+	u8         reserved3[0x4a0];
+};
+
+struct mlx5_ifc_tls_resp_bits {
+	u8         syndrome[0x20];
+	u8         stream_id[0x20];
+	u8         reserverd[0x40];
+};
+
+#define MLX5_TLS_COMMAND_SIZE (0x100)
+
 #endif /* MLX5_IFC_FPGA_H */
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 11/14] net/mlx5e: TLS, Add Innova TLS TX offload data path
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Implement the TLS tx offload data path according to the
requirements of the TLS generic NIC offload infrastructure.

Special metadata ethertype is used to pass information to
the hardware.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  15 ++
 .../mellanox/mlx5/core/en_accel/en_accel.h         |  72 ++++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c |   2 +
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         | 272 +++++++++++++++++++++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h         |  50 ++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  10 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  37 +--
 10 files changed, 455 insertions(+), 16 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 50872ed30c0b..ec785f589666 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,6 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
 		en_accel/ipsec_stats.o
 
-mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o
 
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 6660986285bf..7d8696fca826 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -340,6 +340,7 @@ struct mlx5e_sq_dma {
 enum {
 	MLX5E_SQ_STATE_ENABLED,
 	MLX5E_SQ_STATE_IPSEC,
+	MLX5E_SQ_STATE_TLS,
 };
 
 struct mlx5e_sq_wqe_info {
@@ -824,6 +825,8 @@ void mlx5e_build_ptys2ethtool_map(void);
 u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
 		       void *accel_priv, select_queue_fallback_t fallback);
 netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
+netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+			  struct mlx5e_tx_wqe *wqe, u16 pi);
 
 void mlx5e_completion_event(struct mlx5_core_cq *mcq);
 void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
@@ -929,6 +932,18 @@ static inline bool mlx5e_tunnel_inner_ft_supported(struct mlx5_core_dev *mdev)
 		MLX5_CAP_FLOWTABLE_NIC_RX(mdev, ft_field_support.inner_ip_version));
 }
 
+static inline void mlx5e_sq_fetch_wqe(struct mlx5e_txqsq *sq,
+				      struct mlx5e_tx_wqe **wqe,
+				      u16 *pi)
+{
+	struct mlx5_wq_cyc *wq;
+
+	wq = &sq->wq;
+	*pi = sq->pc & wq->sz_m1;
+	*wqe = mlx5_wq_cyc_get_wqe(wq, *pi);
+	memset(*wqe, 0, sizeof(**wqe));
+}
+
 static inline
 struct mlx5e_tx_wqe *mlx5e_post_nop(struct mlx5_wq_cyc *wq, u32 sqn, u16 *pc)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
new file mode 100644
index 000000000000..68fcb40a2847
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -0,0 +1,72 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5E_EN_ACCEL_H__
+#define __MLX5E_EN_ACCEL_H__
+
+#ifdef CONFIG_MLX5_ACCEL
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include "en_accel/ipsec_rxtx.h"
+#include "en_accel/tls_rxtx.h"
+#include "en.h"
+
+static inline struct sk_buff *mlx5e_accel_handle_tx(struct sk_buff *skb,
+						    struct mlx5e_txqsq *sq,
+						    struct net_device *dev,
+						    struct mlx5e_tx_wqe **wqe,
+						    u16 *pi)
+{
+#ifdef CONFIG_MLX5_EN_TLS
+	if (sq->state & BIT(MLX5E_SQ_STATE_TLS)) {
+		skb = mlx5e_tls_handle_tx_skb(dev, sq, skb, wqe, pi);
+		if (unlikely(!skb))
+			return NULL;
+	}
+#endif
+
+#ifdef CONFIG_MLX5_EN_IPSEC
+	if (sq->state & BIT(MLX5E_SQ_STATE_IPSEC)) {
+		skb = mlx5e_ipsec_handle_tx_skb(dev, *wqe, skb);
+		if (unlikely(!skb))
+			return NULL;
+	}
+#endif
+
+	return skb;
+}
+
+#endif /* CONFIG_MLX5_ACCEL */
+
+#endif /* __MLX5E_EN_ACCEL_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index 38d88108a55a..aa6981c98bdc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -169,5 +169,7 @@ void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
 	if (!mlx5_accel_is_tls_device(priv->mdev))
 		return;
 
+	netdev->features |= NETIF_F_HW_TLS_TX;
+	netdev->hw_features |= NETIF_F_HW_TLS_TX;
 	netdev->tlsdev_ops = &mlx5e_tls_ops;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
new file mode 100644
index 000000000000..49e8d455ebc3
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -0,0 +1,272 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include "en_accel/tls.h"
+#include "en_accel/tls_rxtx.h"
+
+#define SYNDROME_OFFLOAD_REQUIRED 32
+#define SYNDROME_SYNC 33
+
+struct sync_info {
+	u64 rcd_sn;
+	s32 sync_len;
+	int nr_frags;
+	skb_frag_t frags[MAX_SKB_FRAGS];
+};
+
+struct mlx5e_tls_metadata {
+	/* One byte of syndrome followed by 3 bytes of swid */
+	__be32 syndrome_swid;
+	__be16 first_seq;
+	/* packet type ID field	*/
+	__be16 ethertype;
+} __packed;
+
+static int mlx5e_tls_add_metadata(struct sk_buff *skb, __be32 swid)
+{
+	struct mlx5e_tls_metadata *pet;
+	struct ethhdr *eth;
+
+	if (skb_cow_head(skb, sizeof(struct mlx5e_tls_metadata)))
+		return -ENOMEM;
+
+	eth = (struct ethhdr *)skb_push(skb, sizeof(struct mlx5e_tls_metadata));
+	skb->mac_header -= sizeof(struct mlx5e_tls_metadata);
+	pet = (struct mlx5e_tls_metadata *)(eth + 1);
+
+	memmove(skb->data, skb->data + sizeof(struct mlx5e_tls_metadata),
+		2 * ETH_ALEN);
+
+	eth->h_proto = cpu_to_be16(MLX5E_METADATA_ETHER_TYPE);
+	pet->syndrome_swid = htonl(SYNDROME_OFFLOAD_REQUIRED << 24) | swid;
+
+	return 0;
+}
+
+static int mlx5e_tls_get_sync_data(struct mlx5e_tls_offload_context *context,
+				   u32 tcp_seq, struct sync_info *info)
+{
+	int remaining, i = 0, ret = -EINVAL;
+	struct tls_record_info *record;
+	unsigned long flags;
+	s32 sync_size;
+
+	spin_lock_irqsave(&context->base.lock, flags);
+	record = tls_get_record(&context->base, tcp_seq, &info->rcd_sn);
+
+	if (unlikely(!record))
+		goto out;
+
+	sync_size = tcp_seq - tls_record_start_seq(record);
+	info->sync_len = sync_size;
+	if (unlikely(sync_size < 0)) {
+		if (tls_record_is_start_marker(record))
+			goto done;
+
+		goto out;
+	}
+
+	remaining = sync_size;
+	while (remaining > 0) {
+		info->frags[i] = record->frags[i];
+		__skb_frag_ref(&info->frags[i]);
+		remaining -= skb_frag_size(&info->frags[i]);
+
+		if (remaining < 0)
+			skb_frag_size_add(&info->frags[i], remaining);
+
+		i++;
+	}
+	info->nr_frags = i;
+done:
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&context->base.lock, flags);
+	return ret;
+}
+
+static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb,
+					struct sk_buff *nskb, u32 tcp_seq,
+					int headln, __be64 rcd_sn)
+{
+	struct mlx5e_tls_metadata *pet;
+	u8 syndrome = SYNDROME_SYNC;
+	struct iphdr *iph;
+	struct tcphdr *th;
+	int data_len, mss;
+
+	nskb->dev = skb->dev;
+	skb_reset_mac_header(nskb);
+	skb_set_network_header(nskb, skb_network_offset(skb));
+	skb_set_transport_header(nskb, skb_transport_offset(skb));
+	memcpy(nskb->data, skb->data, headln);
+	memcpy(nskb->data + headln, &rcd_sn, sizeof(rcd_sn));
+
+	iph = ip_hdr(nskb);
+	iph->tot_len = htons(nskb->len - skb_network_offset(nskb));
+	th = tcp_hdr(nskb);
+	data_len = nskb->len - headln;
+	tcp_seq -= data_len;
+	th->seq = htonl(tcp_seq);
+
+	mss = nskb->dev->mtu - (headln - skb_network_offset(nskb));
+	skb_shinfo(nskb)->gso_size = 0;
+	if (data_len > mss) {
+		skb_shinfo(nskb)->gso_size = mss;
+		skb_shinfo(nskb)->gso_segs = DIV_ROUND_UP(data_len, mss);
+	}
+	skb_shinfo(nskb)->gso_type = skb_shinfo(skb)->gso_type;
+
+	pet = (struct mlx5e_tls_metadata *)(nskb->data + sizeof(struct ethhdr));
+	memcpy(pet, &syndrome, sizeof(syndrome));
+	pet->first_seq = htons(tcp_seq);
+
+	/* MLX5 devices don't care about the checksum partial start, offset
+	 * and pseudo header
+	 */
+	nskb->ip_summed = CHECKSUM_PARTIAL;
+
+	nskb->xmit_more = 1;
+	nskb->queue_mapping = skb->queue_mapping;
+}
+
+static struct sk_buff *
+mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
+		     struct mlx5e_txqsq *sq, struct sk_buff *skb,
+		     struct mlx5e_tx_wqe **wqe,
+		     u16 *pi)
+{
+	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
+	struct sync_info info;
+	struct sk_buff *nskb;
+	int linear_len = 0;
+	int headln;
+	int i;
+
+	sq->stats.tls_ooo++;
+
+	if (mlx5e_tls_get_sync_data(context, tcp_seq, &info))
+		/* We might get here if a retransmission reaches the driver
+		 * after the relevant record is acked.
+		 * It should be safe to drop the packet in this case
+		 */
+		goto err_out;
+
+	if (unlikely(info.sync_len < 0)) {
+		u32 payload;
+
+		headln = skb_transport_offset(skb) + tcp_hdrlen(skb);
+		payload = skb->len - headln;
+		if (likely(payload <= -info.sync_len))
+			/* SKB payload doesn't require offload
+			 */
+			return skb;
+
+		netdev_err(skb->dev,
+			   "Can't offload from the middle of an SKB [seq: %X, offload_seq: %X, end_seq: %X]\n",
+			   tcp_seq, tcp_seq + payload + info.sync_len,
+			   tcp_seq + payload);
+		goto err_out;
+	}
+
+	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid)))
+		goto err_out;
+
+	headln = skb_transport_offset(skb) + tcp_hdrlen(skb);
+	linear_len += headln + sizeof(info.rcd_sn);
+	nskb = alloc_skb(linear_len, GFP_ATOMIC);
+	if (unlikely(!nskb))
+		goto err_out;
+
+	context->expected_seq = tcp_seq + skb->len - headln;
+	skb_put(nskb, linear_len);
+	for (i = 0; i < info.nr_frags; i++)
+		skb_shinfo(nskb)->frags[i] = info.frags[i];
+
+	skb_shinfo(nskb)->nr_frags = info.nr_frags;
+	nskb->data_len = info.sync_len;
+	nskb->len += info.sync_len;
+	sq->stats.tls_resync_bytes += nskb->len;
+	mlx5e_tls_complete_sync_skb(skb, nskb, tcp_seq, headln,
+				    cpu_to_be64(info.rcd_sn));
+	mlx5e_sq_xmit(sq, nskb, *wqe, *pi);
+	mlx5e_sq_fetch_wqe(sq, wqe, pi);
+	return skb;
+
+err_out:
+	dev_kfree_skb_any(skb);
+	return NULL;
+}
+
+struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
+					struct mlx5e_txqsq *sq,
+					struct sk_buff *skb,
+					struct mlx5e_tx_wqe **wqe,
+					u16 *pi)
+{
+	struct mlx5e_tls_offload_context *context;
+	struct tls_context *tls_ctx;
+	u32 expected_seq;
+	int datalen;
+	u32 skb_seq;
+
+	if (!skb->sk || !tls_is_sk_tx_device_offloaded(skb->sk))
+		goto out;
+
+	datalen = skb->len - (skb_transport_offset(skb) + tcp_hdrlen(skb));
+	if (!datalen)
+		goto out;
+
+	tls_ctx = tls_get_ctx(skb->sk);
+	if (unlikely(tls_ctx->netdev != netdev))
+		goto out;
+
+	skb_seq = ntohl(tcp_hdr(skb)->seq);
+	context = mlx5e_get_tls_tx_context(tls_ctx);
+	expected_seq = context->expected_seq;
+
+	if (unlikely(expected_seq != skb_seq)) {
+		skb = mlx5e_tls_handle_ooo(context, sq, skb, wqe, pi);
+		goto out;
+	}
+
+	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid))) {
+		dev_kfree_skb_any(skb);
+		skb = NULL;
+		goto out;
+	}
+
+	context->expected_seq = skb_seq + datalen;
+out:
+	return skb;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h
new file mode 100644
index 000000000000..405dfd302225
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h
@@ -0,0 +1,50 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5E_TLS_RXTX_H__
+#define __MLX5E_TLS_RXTX_H__
+
+#ifdef CONFIG_MLX5_EN_TLS
+
+#include <linux/skbuff.h>
+#include "en.h"
+
+struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
+					struct mlx5e_txqsq *sq,
+					struct sk_buff *skb,
+					struct mlx5e_tx_wqe **wqe,
+					u16 *pi);
+
+#endif /* CONFIG_MLX5_EN_TLS */
+
+#endif /* __MLX5E_TLS_RXTX_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8dbe058da178..d4c397aec2ee 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -976,6 +976,8 @@ static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
 	sq->min_inline_mode = params->tx_min_inline_mode;
 	if (MLX5_IPSEC_DEV(c->priv->mdev))
 		set_bit(MLX5E_SQ_STATE_IPSEC, &sq->state);
+	if (mlx5_accel_is_tls_device(c->priv->mdev))
+		set_bit(MLX5E_SQ_STATE_TLS, &sq->state);
 
 	param->wq.db_numa_node = cpu_to_node(c->cpu);
 	err = mlx5_wq_cyc_create(mdev, &param->wq, sqc_wq, &sq->wq, &sq->wq_ctrl);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 5f0f3493d747..81c1f383d682 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -43,6 +43,12 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tso_inner_packets) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tso_inner_bytes) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_added_vlan_packets) },
+
+#ifdef CONFIG_MLX5_EN_TLS
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tls_ooo) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tls_resync_bytes) },
+#endif
+
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_lro_packets) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_lro_bytes) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_removed_vlan_packets) },
@@ -157,6 +163,10 @@ static void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
 			s->tx_csum_partial_inner += sq_stats->csum_partial_inner;
 			s->tx_csum_none		+= sq_stats->csum_none;
 			s->tx_csum_partial	+= sq_stats->csum_partial;
+#ifdef CONFIG_MLX5_EN_TLS
+			s->tx_tls_ooo		+= sq_stats->tls_ooo;
+			s->tx_tls_resync_bytes	+= sq_stats->tls_resync_bytes;
+#endif
 		}
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 0b3320a2b072..f956ee1704ef 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -91,6 +91,11 @@ struct mlx5e_sw_stats {
 	u64 rx_cache_waive;
 	u64 ch_eq_rearm;
 
+#ifdef CONFIG_MLX5_EN_TLS
+	u64 tx_tls_ooo;
+	u64 tx_tls_resync_bytes;
+#endif
+
 	/* Special handling counters */
 	u64 link_down_events_phy;
 };
@@ -187,6 +192,10 @@ struct mlx5e_sq_stats {
 	u64 csum_partial_inner;
 	u64 added_vlan_packets;
 	u64 nop;
+#ifdef CONFIG_MLX5_EN_TLS
+	u64 tls_ooo;
+	u64 tls_resync_bytes;
+#endif
 	/* less likely accessed in data path */
 	u64 csum_none;
 	u64 stopped;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 11b4f1089d1c..af3c318610e6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -35,12 +35,21 @@
 #include <net/dsfield.h>
 #include "en.h"
 #include "ipoib/ipoib.h"
-#include "en_accel/ipsec_rxtx.h"
+#include "en_accel/en_accel.h"
 #include "lib/clock.h"
 
 #define MLX5E_SQ_NOPS_ROOM  MLX5_SEND_WQE_MAX_WQEBBS
+
+#ifndef CONFIG_MLX5_EN_TLS
 #define MLX5E_SQ_STOP_ROOM (MLX5_SEND_WQE_MAX_WQEBBS +\
 			    MLX5E_SQ_NOPS_ROOM)
+#else
+/* TLS offload requires MLX5E_SQ_STOP_ROOM to have
+ * enough room for a resync SKB, a normal SKB and a NOP
+ */
+#define MLX5E_SQ_STOP_ROOM (2 * MLX5_SEND_WQE_MAX_WQEBBS +\
+			    MLX5E_SQ_NOPS_ROOM)
+#endif
 
 static inline void mlx5e_tx_dma_unmap(struct device *pdev,
 				      struct mlx5e_sq_dma *dma)
@@ -325,8 +334,8 @@ mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	}
 }
 
-static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
-				 struct mlx5e_tx_wqe *wqe, u16 pi)
+netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+			  struct mlx5e_tx_wqe *wqe, u16 pi)
 {
 	struct mlx5e_tx_wqe_info *wi   = &sq->db.wqe_info[pi];
 
@@ -399,21 +408,19 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
-	struct mlx5e_txqsq *sq = priv->txq2sq[skb_get_queue_mapping(skb)];
-	struct mlx5_wq_cyc *wq = &sq->wq;
-	u16 pi = sq->pc & wq->sz_m1;
-	struct mlx5e_tx_wqe *wqe = mlx5_wq_cyc_get_wqe(wq, pi);
+	struct mlx5e_tx_wqe *wqe;
+	struct mlx5e_txqsq *sq;
+	u16 pi;
 
-	memset(wqe, 0, sizeof(*wqe));
+	sq = priv->txq2sq[skb_get_queue_mapping(skb)];
+	mlx5e_sq_fetch_wqe(sq, &wqe, &pi);
 
-#ifdef CONFIG_MLX5_EN_IPSEC
-	if (sq->state & BIT(MLX5E_SQ_STATE_IPSEC)) {
-		skb = mlx5e_ipsec_handle_tx_skb(dev, wqe, skb);
-		if (unlikely(!skb))
-			return NETDEV_TX_OK;
-	}
+#ifdef CONFIG_MLX5_ACCEL
+	/* might send skbs and update wqe and pi */
+	skb = mlx5e_accel_handle_tx(skb, sq, dev, &wqe, &pi);
+	if (unlikely(!skb))
+		return NETDEV_TX_OK;
 #endif
-
 	return mlx5e_sq_xmit(sq, skb, wqe, pi);
 }
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 13/14] MAINTAINERS: Update mlx5 innova driver maintainers
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Dave Watson, Boris Pismenny, Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Boris Pismenny <borisp@mellanox.com>

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 MAINTAINERS | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 214c9bca232a..cd4067ccf959 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8913,26 +8913,17 @@ W:	http://www.mellanox.com
 Q:	http://patchwork.ozlabs.org/project/netdev/list/
 F:	drivers/net/ethernet/mellanox/mlx5/core/en_*
 
-MELLANOX ETHERNET INNOVA DRIVER
-M:	Ilan Tayari <ilant@mellanox.com>
-R:	Boris Pismenny <borisp@mellanox.com>
+MELLANOX ETHERNET INNOVA DRIVERS
+M:	Boris Pismenny <borisp@mellanox.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 W:	http://www.mellanox.com
 Q:	http://patchwork.ozlabs.org/project/netdev/list/
+F:	drivers/net/ethernet/mellanox/mlx5/core/en_accel/*
+F:	drivers/net/ethernet/mellanox/mlx5/core/accel/*
 F:	drivers/net/ethernet/mellanox/mlx5/core/fpga/*
 F:	include/linux/mlx5/mlx5_ifc_fpga.h
 
-MELLANOX ETHERNET INNOVA IPSEC DRIVER
-M:	Ilan Tayari <ilant@mellanox.com>
-R:	Boris Pismenny <borisp@mellanox.com>
-L:	netdev@vger.kernel.org
-S:	Supported
-W:	http://www.mellanox.com
-Q:	http://patchwork.ozlabs.org/project/netdev/list/
-F:	drivers/net/ethernet/mellanox/mlx5/core/en_ipsec/*
-F:	drivers/net/ethernet/mellanox/mlx5/core/ipsec*
-
 MELLANOX ETHERNET SWITCH DRIVERS
 M:	Jiri Pirko <jiri@mellanox.com>
 M:	Ido Schimmel <idosch@mellanox.com>
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 12/14] net/mlx5e: TLS, Add error statistics
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Add statistics for rare TLS related errors.
Since the errors are rare we have a counter per netdev
rather then per SQ.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  3 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 22 ++++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 22 ++++++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         | 24 +++---
 .../mellanox/mlx5/core/en_accel/tls_stats.c        | 89 ++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  4 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 22 ++++++
 8 files changed, 178 insertions(+), 10 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index ec785f589666..a7135f5d5cf6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,6 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
 		en_accel/ipsec_stats.o
 
-mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o en_accel/tls_stats.o
 
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 7d8696fca826..d397be0b5885 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -795,6 +795,9 @@ struct mlx5e_priv {
 #ifdef CONFIG_MLX5_EN_IPSEC
 	struct mlx5e_ipsec        *ipsec;
 #endif
+#ifdef CONFIG_MLX5_EN_TLS
+	struct mlx5e_tls          *tls;
+#endif
 };
 
 struct mlx5e_profile {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index aa6981c98bdc..d167845271c3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -173,3 +173,25 @@ void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
 	netdev->hw_features |= NETIF_F_HW_TLS_TX;
 	netdev->tlsdev_ops = &mlx5e_tls_ops;
 }
+
+int mlx5e_tls_init(struct mlx5e_priv *priv)
+{
+	struct mlx5e_tls *tls = kzalloc(sizeof(*tls), GFP_KERNEL);
+
+	if (!tls)
+		return -ENOMEM;
+
+	priv->tls = tls;
+	return 0;
+}
+
+void mlx5e_tls_cleanup(struct mlx5e_priv *priv)
+{
+	struct mlx5e_tls *tls = priv->tls;
+
+	if (!tls)
+		return;
+
+	kfree(tls);
+	priv->tls = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index f7216b9b98e2..b6162178f621 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -38,6 +38,17 @@
 #include <net/tls.h>
 #include "en.h"
 
+struct mlx5e_tls_sw_stats {
+	atomic64_t tx_tls_drop_metadata;
+	atomic64_t tx_tls_drop_resync_alloc;
+	atomic64_t tx_tls_drop_no_sync_data;
+	atomic64_t tx_tls_drop_bypass_required;
+};
+
+struct mlx5e_tls {
+	struct mlx5e_tls_sw_stats sw_stats;
+};
+
 struct mlx5e_tls_offload_context {
 	struct tls_offload_context base;
 	u32 expected_seq;
@@ -55,10 +66,21 @@ mlx5e_get_tls_tx_context(struct tls_context *tls_ctx)
 }
 
 void mlx5e_tls_build_netdev(struct mlx5e_priv *priv);
+int mlx5e_tls_init(struct mlx5e_priv *priv);
+void mlx5e_tls_cleanup(struct mlx5e_priv *priv);
+
+int mlx5e_tls_get_count(struct mlx5e_priv *priv);
+int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t *data);
+int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data);
 
 #else
 
 static inline void mlx5e_tls_build_netdev(struct mlx5e_priv *priv) { }
+static inline int mlx5e_tls_init(struct mlx5e_priv *priv) { return 0; }
+static inline void mlx5e_tls_cleanup(struct mlx5e_priv *priv) { }
+static inline int mlx5e_tls_get_count(struct mlx5e_priv *priv) { return 0; }
+static inline int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t *data) { return 0; }
+static inline int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data) { return 0; }
 
 #endif
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index 49e8d455ebc3..ad2790fb5966 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -164,7 +164,8 @@ static struct sk_buff *
 mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
 		     struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		     struct mlx5e_tx_wqe **wqe,
-		     u16 *pi)
+		     u16 *pi,
+		     struct mlx5e_tls *tls)
 {
 	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
 	struct sync_info info;
@@ -175,12 +176,14 @@ mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
 
 	sq->stats.tls_ooo++;
 
-	if (mlx5e_tls_get_sync_data(context, tcp_seq, &info))
+	if (mlx5e_tls_get_sync_data(context, tcp_seq, &info)) {
 		/* We might get here if a retransmission reaches the driver
 		 * after the relevant record is acked.
 		 * It should be safe to drop the packet in this case
 		 */
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_no_sync_data);
 		goto err_out;
+	}
 
 	if (unlikely(info.sync_len < 0)) {
 		u32 payload;
@@ -192,21 +195,22 @@ mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
 			 */
 			return skb;
 
-		netdev_err(skb->dev,
-			   "Can't offload from the middle of an SKB [seq: %X, offload_seq: %X, end_seq: %X]\n",
-			   tcp_seq, tcp_seq + payload + info.sync_len,
-			   tcp_seq + payload);
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_bypass_required);
 		goto err_out;
 	}
 
-	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid)))
+	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid))) {
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_metadata);
 		goto err_out;
+	}
 
 	headln = skb_transport_offset(skb) + tcp_hdrlen(skb);
 	linear_len += headln + sizeof(info.rcd_sn);
 	nskb = alloc_skb(linear_len, GFP_ATOMIC);
-	if (unlikely(!nskb))
+	if (unlikely(!nskb)) {
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_resync_alloc);
 		goto err_out;
+	}
 
 	context->expected_seq = tcp_seq + skb->len - headln;
 	skb_put(nskb, linear_len);
@@ -234,6 +238,7 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
 					struct mlx5e_tx_wqe **wqe,
 					u16 *pi)
 {
+	struct mlx5e_priv *priv = netdev_priv(netdev);
 	struct mlx5e_tls_offload_context *context;
 	struct tls_context *tls_ctx;
 	u32 expected_seq;
@@ -256,11 +261,12 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
 	expected_seq = context->expected_seq;
 
 	if (unlikely(expected_seq != skb_seq)) {
-		skb = mlx5e_tls_handle_ooo(context, sq, skb, wqe, pi);
+		skb = mlx5e_tls_handle_ooo(context, sq, skb, wqe, pi, priv->tls);
 		goto out;
 	}
 
 	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid))) {
+		atomic64_inc(&priv->tls->sw_stats.tx_tls_drop_metadata);
 		dev_kfree_skb_any(skb);
 		skb = NULL;
 		goto out;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c
new file mode 100644
index 000000000000..01468ec27446
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c
@@ -0,0 +1,89 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/ethtool.h>
+#include <net/sock.h>
+
+#include "en.h"
+#include "accel/tls.h"
+#include "fpga/sdk.h"
+#include "en_accel/tls.h"
+
+static const struct counter_desc mlx5e_tls_sw_stats_desc[] = {
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_metadata) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_resync_alloc) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_no_sync_data) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_bypass_required) },
+};
+
+#define MLX5E_READ_CTR_ATOMIC64(ptr, dsc, i) \
+	atomic64_read((atomic64_t *)((char *)(ptr) + (dsc)[i].offset))
+
+#define NUM_TLS_SW_COUNTERS ARRAY_SIZE(mlx5e_tls_sw_stats_desc)
+
+int mlx5e_tls_get_count(struct mlx5e_priv *priv)
+{
+	if (!priv->tls)
+		return 0;
+
+	return NUM_TLS_SW_COUNTERS;
+}
+
+int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t *data)
+{
+	unsigned int i, idx = 0;
+
+	if (!priv->tls)
+		return 0;
+
+	for (i = 0; i < NUM_TLS_SW_COUNTERS; i++)
+		strcpy(data + (idx++) * ETH_GSTRING_LEN,
+		       mlx5e_tls_sw_stats_desc[i].format);
+
+	return NUM_TLS_SW_COUNTERS;
+}
+
+int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data)
+{
+	int i, idx = 0;
+
+	if (!priv->tls)
+		return 0;
+
+	for (i = 0; i < NUM_TLS_SW_COUNTERS; i++)
+		data[idx++] =
+		    MLX5E_READ_CTR_ATOMIC64(&priv->tls->sw_stats,
+					    mlx5e_tls_sw_stats_desc, i);
+
+	return NUM_TLS_SW_COUNTERS;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index d4c397aec2ee..44cf09574926 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4220,12 +4220,16 @@ static void mlx5e_nic_init(struct mlx5_core_dev *mdev,
 	err = mlx5e_ipsec_init(priv);
 	if (err)
 		mlx5_core_err(mdev, "IPSec initialization failed, %d\n", err);
+	err = mlx5e_tls_init(priv);
+	if (err)
+		mlx5_core_err(mdev, "TLS initialization failed, %d\n", err);
 	mlx5e_build_nic_netdev(netdev);
 	mlx5e_vxlan_init(priv);
 }
 
 static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
 {
+	mlx5e_tls_cleanup(priv);
 	mlx5e_ipsec_cleanup(priv);
 	mlx5e_vxlan_cleanup(priv);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 81c1f383d682..a9800586b8d7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -32,6 +32,7 @@
 
 #include "en.h"
 #include "en_accel/ipsec.h"
+#include "en_accel/tls.h"
 
 static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_packets) },
@@ -971,6 +972,22 @@ static void mlx5e_grp_ipsec_update_stats(struct mlx5e_priv *priv)
 	mlx5e_ipsec_update_stats(priv);
 }
 
+static int mlx5e_grp_tls_get_num_stats(struct mlx5e_priv *priv)
+{
+	return mlx5e_tls_get_count(priv);
+}
+
+static int mlx5e_grp_tls_fill_strings(struct mlx5e_priv *priv, u8 *data,
+				      int idx)
+{
+	return idx + mlx5e_tls_get_strings(priv, data + idx * ETH_GSTRING_LEN);
+}
+
+static int mlx5e_grp_tls_fill_stats(struct mlx5e_priv *priv, u64 *data, int idx)
+{
+	return idx + mlx5e_tls_get_stats(priv, data + idx);
+}
+
 static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, packets) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, bytes) },
@@ -1165,6 +1182,11 @@ const struct mlx5e_stats_grp mlx5e_stats_grps[] = {
 		.fill_stats = mlx5e_grp_ipsec_fill_stats,
 		.update_stats = mlx5e_grp_ipsec_update_stats,
 	},
+	{
+		.get_num_stats = mlx5e_grp_tls_get_num_stats,
+		.fill_strings = mlx5e_grp_tls_fill_strings,
+		.fill_stats = mlx5e_grp_tls_fill_stats,
+	},
 	{
 		.get_num_stats = mlx5e_grp_channels_get_num_stats,
 		.fill_strings = mlx5e_grp_channels_fill_strings,
-- 
2.14.3

^ permalink raw reply related

* [PATCH V2 net-next 14/14] MAINTAINERS: Update TLS maintainers
From: Saeed Mahameed @ 2018-03-21 21:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Dave Watson, Boris Pismenny, Saeed Mahameed
In-Reply-To: <20180321210146.22537-1-saeedm@mellanox.com>

From: Boris Pismenny <borisp@mellanox.com>

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index cd4067ccf959..285ea4e6c580 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9711,7 +9711,7 @@ F:	net/netfilter/xt_CONNSECMARK.c
 F:	net/netfilter/xt_SECMARK.c
 
 NETWORKING [TLS]
-M:	Ilya Lesokhin <ilyal@mellanox.com>
+M:	Boris Pismenny <borisp@mellanox.com>
 M:	Aviad Yehezkel <aviadye@mellanox.com>
 M:	Dave Watson <davejwatson@fb.com>
 L:	netdev@vger.kernel.org
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH V2 net-next 06/14] net/tls: Add generic NIC offload infrastructure
From: Eric Dumazet @ 2018-03-21 21:10 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel
In-Reply-To: <20180321210146.22537-7-saeedm@mellanox.com>



On 03/21/2018 02:01 PM, Saeed Mahameed wrote:
> From: Ilya Lesokhin <ilyal@mellanox.com>
> 
> This patch adds a generic infrastructure to offload TLS crypto to a

...

> +
> +static inline int tls_push_record(struct sock *sk,
> +				  struct tls_context *ctx,
> +				  struct tls_offload_context *offload_ctx,
> +				  struct tls_record_info *record,
> +				  struct page_frag *pfrag,
> +				  int flags,
> +				  unsigned char record_type)
> +{
> +	skb_frag_t *frag;
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	struct page_frag fallback_frag;
> +	struct page_frag  *tag_pfrag = pfrag;
> +	int i;
> +
> +	/* fill prepand */
> +	frag = &record->frags[0];
> +	tls_fill_prepend(ctx,
> +			 skb_frag_address(frag),
> +			 record->len - ctx->prepend_size,
> +			 record_type);
> +
> +	if (unlikely(!skb_page_frag_refill(ctx->tag_size, pfrag, GFP_KERNEL))) {
> +		/* HW doesn't care about the data in the tag
> +		 * so in case pfrag has no room
> +		 * for a tag and we can't allocate a new pfrag
> +		 * just use the page in the first frag
> +		 * rather then write a complicated fall back code.
> +		 */
> +		tag_pfrag = &fallback_frag;
> +		tag_pfrag->page = skb_frag_page(frag);
> +		tag_pfrag->offset = 0;
> +	}
> +

If HW does not care, why even trying to call skb_page_frag_refill() ?

If you remove it, then we remove one seldom used path and might uncover bugs

This part looks very suspect to me, to be honest.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox