Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Horia Geantă" <horia.geanta@freescale.com>
To: Kim Phillips <kim.phillips@freescale.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Martin Hicks <mort@bork.org>,
	Scott Wood <scottwood@freescale.com>,
	Kumar Gala <galak@kernel.crashing.org>,
	<linuxppc-dev@lists.ozlabs.org>, <linux-crypto@vger.kernel.org>
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling
Date: Thu, 19 Mar 2015 17:56:57 +0200	[thread overview]
Message-ID: <550AF1C9.9090500@freescale.com> (raw)
In-Reply-To: <20150317170319.1e4cd2b1d1450b442a3b3b36@freescale.com>

On 3/18/2015 12:03 AM, Kim Phillips wrote:
> On Tue, 17 Mar 2015 19:58:55 +0200
> Horia Geantă <horia.geanta@freescale.com> wrote:
> 
>> On 3/17/2015 2:19 AM, Kim Phillips wrote:
>>> On Mon, 16 Mar 2015 12:02:51 +0200
>>> Horia Geantă <horia.geanta@freescale.com> wrote:
>>>
>>>> On 3/4/2015 2:23 AM, Kim Phillips wrote:
>>>>> Only potential problem is getting the crypto API to set the GFP_DMA
>>>>> flag in the allocation request, but presumably a
>>>>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
>>>>
>>>> Seems there are quite a few places that do not use the
>>>> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
>>>> Among them, IPsec and dm-crypt.
>>>> I've looked at the code and I don't think it can be converted to use
>>>> crypto API.
>>>
>>> why not?
>>
>> It would imply having 2 memory allocations, one for crypto request and
>> the other for the rest of the data bundled with the request (for IPsec
>> that would be ESN + space for IV + sg entries for authenticated-only
>> data and sk_buff extension, if needed).
>>
>> Trying to have a single allocation by making ESN, IV etc. part of the
>> request private context requires modifying tfm.reqsize on the fly.
>> This won't work without adding some kind of locking for the tfm.
> 
> can't a common minimum tfm.reqsize be co-established up front, at
> least for the fast path?

Indeed, for IPsec at tfm allocation time - esp_init_state() -
tfm.reqsize could be increased to account for what is known for a given
flow: ESN, IV and asg (S/G entries for authenticated-only data).
The layout would be:
aead request (fixed part)
private ctx of backend algorithm
seq_no_hi (if ESN)
IV
asg
sg <-- S/G table for skb_to_sgvec; how many entries is the question

Do you have a suggestion for how many S/G entries to preallocate for
representing the sk_buff data to be encrypted?
An ancient esp4.c used ESP_NUM_FAST_SG, set to 4.
Btw, currently maximum number of fragments supported by the net stack
(MAX_SKB_FRAGS) is 16 or more.

>>>> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
>>>> places. Some of the maintainers do not agree, as you've seen.
>>>
>>> would modifying the crypto API to either have a different
>>> *_request_alloc() API, and/or adding calls to negotiate the GFP mask
>>> between crypto users and drivers, e.g., get/set_gfp_mask, work?
>>
>> I think what DaveM asked for was the change to be transparent.
>>
>> Besides converting to *_request_alloc(), seems that all other options
>> require some extra awareness from the user.
>> Could you elaborate on the idea above?
> 
> was merely suggesting communicating GFP flags anonymously across the
> API, i.e., GFP_DMA wouldn't appear in user code.

Meaning user would have to get_gfp_mask before allocating a crypto
request - i.e. instead of kmalloc(..., GFP_ATOMIC) to have
kmalloc(GFP_ATOMIC | get_gfp_mask(aead))?

>>>> An alternative would be for talitos to use the page allocator to get 1 /
>>>> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
>>>> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
>>>> hw descriptors.
>>>> What do you think?
>>>
>>> There's a comment in esp_alloc_tmp(): "Use spare space in skb for
>>> this where possible," which is ideally where we'd want to be (esp.
>>
>> Ok, I'll check that. But note the "where possible" - finding room in the
>> skb to avoid the allocation won't always be the case, and then we're
>> back to square one.

So the skb cb is out of the question, being too small (48B).
Any idea what was the intention of the "TODO" - maybe to use the
tailroom in the skb data area?

>>> because that memory could already be DMA-able).  Your above
>>> suggestion would be in the opposite direction of that.
>>
>> The proposal:
>> -removes dma (un)mapping on the fast path
> 
> sure, but at the expense of additional complexity.

Right, there's no free lunch. But it's cheaper.

>> -avoids requesting dma mappable memory for more than it's actually
>> needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
>> only its private context)
> 
> compared to the payload?  Plus, we have plenty of DMA space these
> days.
> 
>> -for caam it has the added benefit of speeding the below search for the
>> offending descriptor in the SW ring from O(n) to O(1):
>> for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) >= 1; i++) {
>> 	sw_idx = (tail + i) & (JOBR_DEPTH - 1);
>>
>> 	if (jrp->outring[hw_idx].desc ==
>> 	    jrp->entinfo[sw_idx].desc_addr_dma)
>> 		break; /* found */
>> }
>> (drivers/crypto/caam/jr.c - caam_dequeue)
> 
> how?  The job ring h/w will still be spitting things out
> out-of-order.

jrp->outring[hw_idx].desc bus address can be used to find the sw_idx in
O(1):

dma_addr_t desc_base = dma_map_page(alloc_page(GFP_DMA),...);
[...]
sw_idx = (desc_base - jrp->outring[hw_idx].desc) / JD_SIZE;

JD_SIZE would be 16 words (64B) - 13 words used for the h/w job
descriptor, 3 words can be used for smth. else.
Basically all JDs would be filled at a 64B-aligned offset in the memory
page.

> Plus, like I said, it's taking the problem in the wrong direction:
> we need to strive to merge the allocation and mapping with the upper
> layers as much as possible.

IMHO propagating the GFP_DMA from backend crypto implementations to
crypto API users doesn't seem feasable.
It's error-prone to audit all places that allocate crypto requests w/out
using *_request_alloc API.
And even if all these places would be identified:
-in some cases there's some heavy rework involved
-more places might show up in the future and there's no way to detect them

Horia

WARNING: multiple messages have this Message-ID (diff)

From: "Horia Geantă" <horia.geanta@freescale.com>
To: Kim Phillips <kim.phillips@freescale.com>
Cc: Martin Hicks <mort@bork.org>,
	linux-crypto@vger.kernel.org,
	Scott Wood <scottwood@freescale.com>,
	linuxppc-dev@lists.ozlabs.org,
	"David S. Miller" <davem@davemloft.net>
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling
Date: Thu, 19 Mar 2015 17:56:57 +0200	[thread overview]
Message-ID: <550AF1C9.9090500@freescale.com> (raw)
In-Reply-To: <20150317170319.1e4cd2b1d1450b442a3b3b36@freescale.com>

On 3/18/2015 12:03 AM, Kim Phillips wrote:
> On Tue, 17 Mar 2015 19:58:55 +0200
> Horia Geantă <horia.geanta@freescale.com> wrote:
> 
>> On 3/17/2015 2:19 AM, Kim Phillips wrote:
>>> On Mon, 16 Mar 2015 12:02:51 +0200
>>> Horia Geantă <horia.geanta@freescale.com> wrote:
>>>
>>>> On 3/4/2015 2:23 AM, Kim Phillips wrote:
>>>>> Only potential problem is getting the crypto API to set the GFP_DMA
>>>>> flag in the allocation request, but presumably a
>>>>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
>>>>
>>>> Seems there are quite a few places that do not use the
>>>> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
>>>> Among them, IPsec and dm-crypt.
>>>> I've looked at the code and I don't think it can be converted to use
>>>> crypto API.
>>>
>>> why not?
>>
>> It would imply having 2 memory allocations, one for crypto request and
>> the other for the rest of the data bundled with the request (for IPsec
>> that would be ESN + space for IV + sg entries for authenticated-only
>> data and sk_buff extension, if needed).
>>
>> Trying to have a single allocation by making ESN, IV etc. part of the
>> request private context requires modifying tfm.reqsize on the fly.
>> This won't work without adding some kind of locking for the tfm.
> 
> can't a common minimum tfm.reqsize be co-established up front, at
> least for the fast path?

Indeed, for IPsec at tfm allocation time - esp_init_state() -
tfm.reqsize could be increased to account for what is known for a given
flow: ESN, IV and asg (S/G entries for authenticated-only data).
The layout would be:
aead request (fixed part)
private ctx of backend algorithm
seq_no_hi (if ESN)
IV
asg
sg <-- S/G table for skb_to_sgvec; how many entries is the question

Do you have a suggestion for how many S/G entries to preallocate for
representing the sk_buff data to be encrypted?
An ancient esp4.c used ESP_NUM_FAST_SG, set to 4.
Btw, currently maximum number of fragments supported by the net stack
(MAX_SKB_FRAGS) is 16 or more.

>>>> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
>>>> places. Some of the maintainers do not agree, as you've seen.
>>>
>>> would modifying the crypto API to either have a different
>>> *_request_alloc() API, and/or adding calls to negotiate the GFP mask
>>> between crypto users and drivers, e.g., get/set_gfp_mask, work?
>>
>> I think what DaveM asked for was the change to be transparent.
>>
>> Besides converting to *_request_alloc(), seems that all other options
>> require some extra awareness from the user.
>> Could you elaborate on the idea above?
> 
> was merely suggesting communicating GFP flags anonymously across the
> API, i.e., GFP_DMA wouldn't appear in user code.

Meaning user would have to get_gfp_mask before allocating a crypto
request - i.e. instead of kmalloc(..., GFP_ATOMIC) to have
kmalloc(GFP_ATOMIC | get_gfp_mask(aead))?

>>>> An alternative would be for talitos to use the page allocator to get 1 /
>>>> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
>>>> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
>>>> hw descriptors.
>>>> What do you think?
>>>
>>> There's a comment in esp_alloc_tmp(): "Use spare space in skb for
>>> this where possible," which is ideally where we'd want to be (esp.
>>
>> Ok, I'll check that. But note the "where possible" - finding room in the
>> skb to avoid the allocation won't always be the case, and then we're
>> back to square one.

So the skb cb is out of the question, being too small (48B).
Any idea what was the intention of the "TODO" - maybe to use the
tailroom in the skb data area?

>>> because that memory could already be DMA-able).  Your above
>>> suggestion would be in the opposite direction of that.
>>
>> The proposal:
>> -removes dma (un)mapping on the fast path
> 
> sure, but at the expense of additional complexity.

Right, there's no free lunch. But it's cheaper.

>> -avoids requesting dma mappable memory for more than it's actually
>> needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
>> only its private context)
> 
> compared to the payload?  Plus, we have plenty of DMA space these
> days.
> 
>> -for caam it has the added benefit of speeding the below search for the
>> offending descriptor in the SW ring from O(n) to O(1):
>> for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) >= 1; i++) {
>> 	sw_idx = (tail + i) & (JOBR_DEPTH - 1);
>>
>> 	if (jrp->outring[hw_idx].desc ==
>> 	    jrp->entinfo[sw_idx].desc_addr_dma)
>> 		break; /* found */
>> }
>> (drivers/crypto/caam/jr.c - caam_dequeue)
> 
> how?  The job ring h/w will still be spitting things out
> out-of-order.

jrp->outring[hw_idx].desc bus address can be used to find the sw_idx in
O(1):

dma_addr_t desc_base = dma_map_page(alloc_page(GFP_DMA),...);
[...]
sw_idx = (desc_base - jrp->outring[hw_idx].desc) / JD_SIZE;

JD_SIZE would be 16 words (64B) - 13 words used for the h/w job
descriptor, 3 words can be used for smth. else.
Basically all JDs would be filled at a 64B-aligned offset in the memory
page.

> Plus, like I said, it's taking the problem in the wrong direction:
> we need to strive to merge the allocation and mapping with the upper
> layers as much as possible.

IMHO propagating the GFP_DMA from backend crypto implementations to
crypto API users doesn't seem feasable.
It's error-prone to audit all places that allocate crypto requests w/out
using *_request_alloc API.
And even if all these places would be identified:
-in some cases there's some heavy rework involved
-more places might show up in the future and there's no way to detect them

Horia

next prev parent reply	other threads:[~2015-03-19 15:57 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-03 13:21 [PATCH v2 0/5] crypto: talitos: Add crypto async queue handling Martin Hicks
2015-03-03 13:21 ` Martin Hicks
2015-03-03 13:21 ` [PATCH v2 1/5] crypto: talitos: Simplify per-channel initialization Martin Hicks
2015-03-03 13:21   ` Martin Hicks
2015-03-06  0:06   ` Kim Phillips
2015-03-06  0:06     ` Kim Phillips
2015-03-03 13:21 ` [PATCH v2 2/5] crypto: talitos: Remove MD5_BLOCK_SIZE Martin Hicks
2015-03-03 13:21   ` Martin Hicks
2015-03-06  0:07   ` Kim Phillips
2015-03-06  0:07     ` Kim Phillips
2015-03-06 12:02     ` Herbert Xu
2015-03-06 12:02       ` Herbert Xu
2015-03-03 13:21 ` [PATCH v2 3/5] crypto: talitos: Fix off-by-one and use all hardware slots Martin Hicks
2015-03-03 13:21   ` Martin Hicks
2015-03-04  0:35   ` Kim Phillips
2015-03-04  0:35     ` Kim Phillips
2015-03-04 14:46     ` Martin Hicks
2015-03-04 14:46       ` Martin Hicks
2015-03-03 13:21 ` [PATCH v2 4/5] crypto: talitos: Reorganize request submission data structures Martin Hicks
2015-03-03 13:21   ` Martin Hicks
2015-03-03 13:21 ` [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling Martin Hicks
2015-03-03 13:21   ` Martin Hicks
2015-03-04  0:23   ` Kim Phillips
2015-03-04  0:23     ` Kim Phillips
2015-03-05  9:35     ` Horia Geantă
2015-03-05  9:35       ` Horia Geantă
2015-03-06  0:34       ` Kim Phillips
2015-03-06  0:34         ` Kim Phillips
2015-03-06  4:48       ` Herbert Xu
2015-03-06  4:48         ` Herbert Xu
2015-03-09 12:08         ` Horia Geantă
2015-03-09 12:08           ` Horia Geantă
2015-03-16 10:02     ` Horia Geantă
2015-03-16 10:02       ` Horia Geantă
2015-03-17  0:19       ` Kim Phillips
2015-03-17  0:19         ` Kim Phillips
2015-03-17 17:58         ` Horia Geantă
2015-03-17 17:58           ` Horia Geantă
2015-03-17 22:03           ` Kim Phillips
2015-03-17 22:03             ` Kim Phillips
2015-03-19 15:56             ` Horia Geantă [this message]
2015-03-19 15:56               ` Horia Geantă
2015-03-19 18:38               ` Kim Phillips
2015-03-19 18:38                 ` Kim Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=550AF1C9.9090500@freescale.com \
    --to=horia.geanta@freescale.com \
    --cc=davem@davemloft.net \
    --cc=galak@kernel.crashing.org \
    --cc=kim.phillips@freescale.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mort@bork.org \
    --cc=scottwood@freescale.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.