* Looking for non-NIC hardware-offload for wpa2 decrypt. @ 2014-03-31 4:40 Ben Greear 2014-03-31 18:09 ` Christian Lamparter 0 siblings, 1 reply; 21+ messages in thread From: Ben Greear @ 2014-03-31 4:40 UTC (permalink / raw) To: linux-wireless@vger.kernel.org Hello! Due to hardware/firmware limitations, it does not appear possible to have a wifi NIC do hardware decrypt when using multiple stations on a single NIC (and have both stations connected to the same AP). This just happens to be one of my favourite things to do, and it kills performance compared to normal 'Open' throughput. I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by using a specialized hardware board or maybe a feature of certain CPUs? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-03-31 4:40 Looking for non-NIC hardware-offload for wpa2 decrypt Ben Greear @ 2014-03-31 18:09 ` Christian Lamparter 2014-07-28 20:50 ` Ben Greear 0 siblings, 1 reply; 21+ messages in thread From: Christian Lamparter @ 2014-03-31 18:09 UTC (permalink / raw) To: Ben Greear; +Cc: linux-wireless@vger.kernel.org Hello, On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote: > Due to hardware/firmware limitations, it does not appear possible to > have a wifi NIC do hardware decrypt when using multiple stations on a single > NIC (and have both stations connected to the same AP). > > This just happens to be one of my favourite things to do, and it kills > performance compared to normal 'Open' throughput. > > I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by > using a specialized hardware board or maybe a feature of certain CPUs? You could check if your CPU (bios and kernel) have support for AES-NI [0]. AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports the proper crypto bindings can be used to accelerate the encryption and decryption process to some degree. And it just happens that thanks to AES-NI parts of math can be efficiently calculated by the CPU. Regards, Chr [0] <http://en.wikipedia.org/wiki/AES_instruction_set> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-03-31 18:09 ` Christian Lamparter @ 2014-07-28 20:50 ` Ben Greear 2014-07-29 22:29 ` Christian Lamparter 0 siblings, 1 reply; 21+ messages in thread From: Ben Greear @ 2014-07-28 20:50 UTC (permalink / raw) To: Christian Lamparter; +Cc: linux-wireless@vger.kernel.org On 03/31/2014 11:09 AM, Christian Lamparter wrote: > Hello, > > On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote: >> Due to hardware/firmware limitations, it does not appear possible to >> have a wifi NIC do hardware decrypt when using multiple stations on a single >> NIC (and have both stations connected to the same AP). >> >> This just happens to be one of my favourite things to do, and it kills >> performance compared to normal 'Open' throughput. >> >> I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by >> using a specialized hardware board or maybe a feature of certain CPUs? > > You could check if your CPU (bios and kernel) have support for AES-NI [0]. > AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports > the proper crypto bindings can be used to accelerate the encryption and > decryption process to some degree. And it just happens that thanks to > AES-NI parts of math can be efficiently calculated by the CPU. I recently took a look at this again, and the Intel E5 I'm using does use the aesni instructions/driver as far as I can tell. Throughput is still around 500Mbps where open is around 800Mbps. perf top shows this: Samples: 37K of event 'cycles', Event count (approx.): 19360716192 12.01% [kernel] [k] math_state_restore 11.64% [kernel] [k] _aesni_enc1 8.25% [kernel] [k] __save_init_fpu 2.44% [kernel] [k] crypto_xor 1.87% [kernel] [k] irq_fpu_usable 1.30% [kernel] [k] aes_encrypt 0.76% [kernel] [k] __kernel_fpu_end .... Any other magic add-in cards that would somehow just make this all faster w/out having to do any real programming work? :) Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-07-28 20:50 ` Ben Greear @ 2014-07-29 22:29 ` Christian Lamparter 2014-07-29 22:50 ` Ben Greear 2014-07-30 7:06 ` Johannes Berg 0 siblings, 2 replies; 21+ messages in thread From: Christian Lamparter @ 2014-07-29 22:29 UTC (permalink / raw) To: Ben Greear; +Cc: linux-wireless@vger.kernel.org On Monday, July 28, 2014 01:50:22 PM Ben Greear wrote: > On 03/31/2014 11:09 AM, Christian Lamparter wrote: > > Hello, > > > > On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote: > >> Due to hardware/firmware limitations, it does not appear possible to > >> have a wifi NIC do hardware decrypt when using multiple stations on a single > >> NIC (and have both stations connected to the same AP). > >> > >> This just happens to be one of my favourite things to do, and it kills > >> performance compared to normal 'Open' throughput. > >> > >> I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by > >> using a specialized hardware board or maybe a feature of certain CPUs? > > > > You could check if your CPU (bios and kernel) have support for AES-NI [0]. > > AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports > > the proper crypto bindings can be used to accelerate the encryption and > > decryption process to some degree. And it just happens that thanks to > > AES-NI parts of math can be efficiently calculated by the CPU. > > I recently took a look at this again, and the Intel E5 I'm using > does use the aesni instructions/driver as far as I can tell. Which E5 exactly? There are many different E5. > Throughput is still around 500Mbps where open is around 800Mbps. I can't test ath10k or your multiple station on a single NIC thing. But can you run a test for a "simple" single station - single AP wpa2 setup? I want to know how close to the 800Mbps it actually goes. > perf top shows this: > > Samples: 37K of event 'cycles', Event count (approx.): 19360716192 > 12.01% [kernel] [k] math_state_restore > 11.64% [kernel] [k] _aesni_enc1 > 8.25% [kernel] [k] __save_init_fpu > 2.44% [kernel] [k] crypto_xor > 1.87% [kernel] [k] irq_fpu_usable > 1.30% [kernel] [k] aes_encrypt > 0.76% [kernel] [k] __kernel_fpu_end > .... Yes, aesni is doing some of the heavy lifting! But in your original post, you said you are interested in accelerate rx-decrypt... Now it's about encryption offload?! [please make up your mind :-D] That being said 12.01% (math_state_restore - called by kernel_fpu_end) and 8.25% (__save_init_fpu - called by kernel_fpu_begin) cycles are wasted due fpu save and restore overhead. [You have noticed that before, didn't you ;-) ] I think part of the poor performance is due to the design of aes_encrypt in arch/x86/crypto/aesni-intel_glue.c: > static void aes_encrypt(struct crypto_tfm *tfm, u8 *dst, const u8 *src) > { > struct crypto_aes_ctx *ctx = aes_ctx(crypto_tfm_ctx(tfm)); > [...] > kernel_fpu_begin(); > aesni_enc(ctx, dst, src); > kernel_fpu_end(); > [...] > } Ideally you would want something like: > kernel_fpu_begin(); > aesni_enc(ctx, dst_frame1, src_frame1); > aesni_enc(ctx, dst_frame2, src_frame2); > ... > aesni_enc(ctx, dst_frameN, src_frameN); > kernel_fpu_end(); But getting there might not be easy and involve more than a bit of "real programming". In theory, it should be enough to test if there is some potential in this approach by "enhancing" the tx-path in the following way: 1. the fpu_begin and fpu_end calls should be added to ieee80211_crypto_ccmp_encrypt in net/mac80211/wpa.c. >+ kernel_fpu_begin(); > skb_queue_walk(&tx->skbs, skb) { > if (ccmp_encrypt_skb(tx, skb) < 0) > return TX_DROP; > } >+ kernel_fpu_end(); > > return TX_CONTINUE; 2. ieee80211_aes_ccm_encrypt in net/mac80211/aes_ccm.c has to call __aes_encrypt instead of aes_encrypt in crypto_aead_encrypt. [I can't think of a sane way to make this work. Of course, it's possible to make a copy of ccm(aes) crypto_alg* and overwrite aes_encrypt with __aes_encrypt. But that's not very nice... (It should work though) ] > Any other magic add-in cards that would somehow just make this all faster w/out > having to do any real programming work? :) I doubt there is an magic add-in card for such a use-case. I think most of them target directly applications/libraries and not the crypto-kernel interface mac80211 is using. [It would be really nice to know what E5 you actually have] Regards Christian ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-07-29 22:29 ` Christian Lamparter @ 2014-07-29 22:50 ` Ben Greear 2014-07-30 18:59 ` Christian Lamparter 2014-07-30 7:06 ` Johannes Berg 1 sibling, 1 reply; 21+ messages in thread From: Ben Greear @ 2014-07-29 22:50 UTC (permalink / raw) To: Christian Lamparter; +Cc: linux-wireless@vger.kernel.org On 07/29/2014 03:29 PM, Christian Lamparter wrote: > On Monday, July 28, 2014 01:50:22 PM Ben Greear wrote: >> On 03/31/2014 11:09 AM, Christian Lamparter wrote: >>> Hello, >>> >>> On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote: >>>> Due to hardware/firmware limitations, it does not appear possible to >>>> have a wifi NIC do hardware decrypt when using multiple stations on a single >>>> NIC (and have both stations connected to the same AP). >>>> >>>> This just happens to be one of my favourite things to do, and it kills >>>> performance compared to normal 'Open' throughput. >>>> >>>> I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by >>>> using a specialized hardware board or maybe a feature of certain CPUs? >>> >>> You could check if your CPU (bios and kernel) have support for AES-NI [0]. >>> AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports >>> the proper crypto bindings can be used to accelerate the encryption and >>> decryption process to some degree. And it just happens that thanks to >>> AES-NI parts of math can be efficiently calculated by the CPU. >> >> I recently took a look at this again, and the Intel E5 I'm using >> does use the aesni instructions/driver as far as I can tell. > Which E5 exactly? There are many different E5. > >> Throughput is still around 500Mbps where open is around 800Mbps. > I can't test ath10k or your multiple station on a single NIC thing. But > can you run a test for a "simple" single station - single AP wpa2 setup? > I want to know how close to the 800Mbps it actually goes. > >> perf top shows this: >> >> Samples: 37K of event 'cycles', Event count (approx.): 19360716192 >> 12.01% [kernel] [k] math_state_restore >> 11.64% [kernel] [k] _aesni_enc1 >> 8.25% [kernel] [k] __save_init_fpu >> 2.44% [kernel] [k] crypto_xor >> 1.87% [kernel] [k] irq_fpu_usable >> 1.30% [kernel] [k] aes_encrypt >> 0.76% [kernel] [k] __kernel_fpu_end >> .... > Yes, aesni is doing some of the heavy lifting! But in your original post, > you said you are interested in accelerate rx-decrypt... Now it's about > encryption offload?! [please make up your mind :-D] The perf top results above are from receiving (and decoding) wpa2 wifi frames that were not decoded by the NIC because NIC rx-decrypt logic was disabled. I think this means I want to accelerate the rx-decrypt. Transmit is not a problem for me because I can make the NIC do the encryption in it's hardware. My E5 is: [root@ct525-2u-3ac-3n]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5-1660 v2 @ 3.70GHz stepping : 4 microcode : 0x427 cpu MHz : 2163.054 cache size : 15360 KB physical id : 0 siblings : 12 core id : 0 cpu cores : 6 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms bogomips : 7400.31 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: .... 11 more entries. Thanks for the suggestions below. I have managed to find yet another way to crash my firmware so I have to pay attention to that for a bit, but will look into that decrypt code in more detail when I get a chance. Thanks, Ben > That being said 12.01% (math_state_restore - > called by kernel_fpu_end) and 8.25% (__save_init_fpu - called > by kernel_fpu_begin) cycles are wasted due fpu save and > restore overhead. [You have noticed that before, didn't you ;-) ] > > I think part of the poor performance is due to the design of > aes_encrypt in arch/x86/crypto/aesni-intel_glue.c: > >> static void aes_encrypt(struct crypto_tfm *tfm, u8 *dst, const u8 *src) >> { >> struct crypto_aes_ctx *ctx = aes_ctx(crypto_tfm_ctx(tfm)); >> [...] >> kernel_fpu_begin(); >> aesni_enc(ctx, dst, src); >> kernel_fpu_end(); >> [...] >> } > > Ideally you would want something like: > >> kernel_fpu_begin(); >> aesni_enc(ctx, dst_frame1, src_frame1); >> aesni_enc(ctx, dst_frame2, src_frame2); >> ... >> aesni_enc(ctx, dst_frameN, src_frameN); >> kernel_fpu_end(); > > But getting there might not be easy and involve more than a bit > of "real programming". > > In theory, it should be enough to test if there is some potential > in this approach by "enhancing" the tx-path in the following way: > > 1. the fpu_begin and fpu_end calls should be added to > ieee80211_crypto_ccmp_encrypt in net/mac80211/wpa.c. > >> + kernel_fpu_begin(); >> skb_queue_walk(&tx->skbs, skb) { >> if (ccmp_encrypt_skb(tx, skb) < 0) >> return TX_DROP; >> } >> + kernel_fpu_end(); >> >> return TX_CONTINUE; > > 2. ieee80211_aes_ccm_encrypt in net/mac80211/aes_ccm.c > has to call __aes_encrypt instead of aes_encrypt in crypto_aead_encrypt. > [I can't think of a sane way to make this work. Of course, it's possible to > make a copy of ccm(aes) crypto_alg* and overwrite aes_encrypt with > __aes_encrypt. But that's not very nice... (It should work though) ] > >> Any other magic add-in cards that would somehow just make this all faster w/out >> having to do any real programming work? :) > I doubt there is an magic add-in card for such a use-case. I think most of > them target directly applications/libraries and not the crypto-kernel > interface mac80211 is using. > > [It would be really nice to know what E5 you actually have] > > Regards > Christian > -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-07-29 22:50 ` Ben Greear @ 2014-07-30 18:59 ` Christian Lamparter 2014-07-30 19:08 ` Ben Greear 2014-07-31 20:05 ` Jouni Malinen 0 siblings, 2 replies; 21+ messages in thread From: Christian Lamparter @ 2014-07-30 18:59 UTC (permalink / raw) To: Ben Greear; +Cc: linux-wireless@vger.kernel.org, Johannes Berg On Tuesday, July 29, 2014 03:50:40 PM Ben Greear wrote: > On 07/29/2014 03:29 PM, Christian Lamparter wrote: > > On Monday, July 28, 2014 01:50:22 PM Ben Greear wrote: > >> On 03/31/2014 11:09 AM, Christian Lamparter wrote: > >>> Hello, > >>> > >>> On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote: > >>>> Due to hardware/firmware limitations, it does not appear possible to > >>>> have a wifi NIC do hardware decrypt when using multiple stations on a single > >>>> NIC (and have both stations connected to the same AP). > >>>> > >>>> This just happens to be one of my favourite things to do, and it kills > >>>> performance compared to normal 'Open' throughput. > >>>> > >>>> I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by > >>>> using a specialized hardware board or maybe a feature of certain CPUs? > >>> > >>> You could check if your CPU (bios and kernel) have support for AES-NI [0]. > >>> AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports > >>> the proper crypto bindings can be used to accelerate the encryption and > >>> decryption process to some degree. And it just happens that thanks to > >>> AES-NI parts of math can be efficiently calculated by the CPU. > >> > >> I recently took a look at this again, and the Intel E5 I'm using > >> does use the aesni instructions/driver as far as I can tell. > > Which E5 exactly? There are many different E5. > model name : Intel(R) Xeon(R) CPU E5-1660 v2 @ 3.70GHz > stepping : 4 > microcode : 0x427 > cpu MHz : 2163.054 Thanks. 500Mbps should not be a issue though. At 3,70GHz one single core should be able to encrypt/decrypt several Gbps. > >> Throughput is still around 500Mbps where open is around 800Mbps. > > I can't test ath10k or your multiple station on a single NIC thing. But > > can you run a test for a "simple" single station - single AP wpa2 setup? > > I want to know how close to the 800Mbps it actually goes. Any data for the single station, single AP, wpa2 setup? I would like to know what ath10k is able to achieve in this case. > >> perf top shows this: > >> > >> Samples: 37K of event 'cycles', Event count (approx.): 19360716192 > >> 12.01% [kernel] [k] math_state_restore > >> 11.64% [kernel] [k] _aesni_enc1 > >> 8.25% [kernel] [k] __save_init_fpu > >> 2.44% [kernel] [k] crypto_xor > >> 1.87% [kernel] [k] irq_fpu_usable > >> 1.30% [kernel] [k] aes_encrypt > >> 0.76% [kernel] [k] __kernel_fpu_end > >> .... > > Yes, aesni is doing some of the heavy lifting! But in your original post, > > you said you are interested in accelerate rx-decrypt... Now it's about > > encryption offload?! [please make up your mind :-D] > > The perf top results above are from receiving (and decoding) wpa2 wifi > frames that were not decoded by the NIC because NIC rx-decrypt logic was > disabled. I think this means I want to accelerate the rx-decrypt. Wait. If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1 or aes_decrypt listed in the perf top result? I think they should be. Have you removed them from the "perf top results" or are they really absent altogether? Because, from this perf result, it looks like your CPU is not burden by the incoming RX at all?! Instead it is busy with the encryption of frames it will be transmitting (in case of tcp, this could be tcp acks). It could be that I missed something important about the setup. For example, I assumed that you have a dedicated 802.11ac AP and the perf results are coming from the E5 machine with the ath10k in multi-station mode. The AP would be transmitting, whereas the E5 would be receiving. Is this assumption correct or not? > Transmit is not a problem for me because I can make the NIC do the > encryption in it's hardware. > Thanks for the suggestions below. I have managed to find yet another > way to crash my firmware so I have to pay attention to that for a bit, > but will look into that decrypt code in more detail when I get a chance. Yeah, but don't bother with the suggestions. Johannes pointed out "that this would be mostly useless afaict as the list is only iterated if you have software fragmentation." Furthermore, they only covered the ENcryption process of the TX path and not the DEcryption part of the RX path (which is what you are interested in). Regards Christian ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-07-30 18:59 ` Christian Lamparter @ 2014-07-30 19:08 ` Ben Greear 2014-07-31 20:05 ` Jouni Malinen 1 sibling, 0 replies; 21+ messages in thread From: Ben Greear @ 2014-07-30 19:08 UTC (permalink / raw) To: Christian Lamparter; +Cc: linux-wireless@vger.kernel.org, Johannes Berg On 07/30/2014 11:59 AM, Christian Lamparter wrote: > On Tuesday, July 29, 2014 03:50:40 PM Ben Greear wrote: >> On 07/29/2014 03:29 PM, Christian Lamparter wrote: >>> On Monday, July 28, 2014 01:50:22 PM Ben Greear wrote: >>>> On 03/31/2014 11:09 AM, Christian Lamparter wrote: >>>>> Hello, >>>>> >>>>> On Sunday, March 30, 2014 09:40:24 PM Ben Greear wrote: >>>>>> Due to hardware/firmware limitations, it does not appear possible to >>>>>> have a wifi NIC do hardware decrypt when using multiple stations on a single >>>>>> NIC (and have both stations connected to the same AP). >>>>>> >>>>>> This just happens to be one of my favourite things to do, and it kills >>>>>> performance compared to normal 'Open' throughput. >>>>>> >>>>>> I am curious if anyone knows of any way to accelerate rx-decrypt, perhaps by >>>>>> using a specialized hardware board or maybe a feature of certain CPUs? >>>>> >>>>> You could check if your CPU (bios and kernel) have support for AES-NI [0]. >>>>> AFAICT mac80211 utilizes the cryptoapi. Therefore anything that supports >>>>> the proper crypto bindings can be used to accelerate the encryption and >>>>> decryption process to some degree. And it just happens that thanks to >>>>> AES-NI parts of math can be efficiently calculated by the CPU. >>>> >>>> I recently took a look at this again, and the Intel E5 I'm using >>>> does use the aesni instructions/driver as far as I can tell. >>> Which E5 exactly? There are many different E5. > >> model name : Intel(R) Xeon(R) CPU E5-1660 v2 @ 3.70GHz >> stepping : 4 >> microcode : 0x427 >> cpu MHz : 2163.054 > Thanks. 500Mbps should not be a issue though. At 3,70GHz one single > core should be able to encrypt/decrypt several Gbps. > >>>> Throughput is still around 500Mbps where open is around 800Mbps. >>> I can't test ath10k or your multiple station on a single NIC thing. But >>> can you run a test for a "simple" single station - single AP wpa2 setup? >>> I want to know how close to the 800Mbps it actually goes. > Any data for the single station, single AP, wpa2 setup? I would like to know > what ath10k is able to achieve in this case. I will run this when I get a chance and let you know. But, exact same setup (same number of stations, etc), but just with open authentication, runs 800+Mbps. >>>> perf top shows this: >>>> >>>> Samples: 37K of event 'cycles', Event count (approx.): 19360716192 >>>> 12.01% [kernel] [k] math_state_restore >>>> 11.64% [kernel] [k] _aesni_enc1 >>>> 8.25% [kernel] [k] __save_init_fpu >>>> 2.44% [kernel] [k] crypto_xor >>>> 1.87% [kernel] [k] irq_fpu_usable >>>> 1.30% [kernel] [k] aes_encrypt >>>> 0.76% [kernel] [k] __kernel_fpu_end >>>> .... >>> Yes, aesni is doing some of the heavy lifting! But in your original post, >>> you said you are interested in accelerate rx-decrypt... Now it's about >>> encryption offload?! [please make up your mind :-D] >> >> The perf top results above are from receiving (and decoding) wpa2 wifi >> frames that were not decoded by the NIC because NIC rx-decrypt logic was >> disabled. I think this means I want to accelerate the rx-decrypt. > Wait. > > If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1 > or aes_decrypt listed in the perf top result? I think they should be. Have you > removed them from the "perf top results" or are they really absent > altogether? > > Because, from this perf result, it looks like your CPU is not burden by the > incoming RX at all?! Instead it is busy with the encryption of frames > it will be transmitting (in case of tcp, this could be tcp acks). > > It could be that I missed something important about the setup. > For example, I assumed that you have a dedicated 802.11ac AP > and the perf results are coming from the E5 machine with the ath10k > in multi-station mode. The AP would be transmitting, whereas > the E5 would be receiving. Is this assumption correct or not? My setup is where AP is transmitting and E5 is receiving. Test case is UDP, so very little upstream traffic. I did not trim anything off the top of perf top, and did not notice any other aesni calls listed. I do not particularly know why it is doing aesni_encl, I had assumed that was how it decoded. I will double-check all of this and try to figure out why it is calling the encl instead of decl logic. Possibly I have something that is actually configured differently than I think it is. Also, good to hear my E5 should be able to handle higher speeds, gives me something to hope for :) Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-07-30 18:59 ` Christian Lamparter 2014-07-30 19:08 ` Ben Greear @ 2014-07-31 20:05 ` Jouni Malinen 2014-07-31 20:45 ` Christian Lamparter 1 sibling, 1 reply; 21+ messages in thread From: Jouni Malinen @ 2014-07-31 20:05 UTC (permalink / raw) To: Christian Lamparter Cc: Ben Greear, linux-wireless@vger.kernel.org, Johannes Berg On Wed, Jul 30, 2014 at 08:59:33PM +0200, Christian Lamparter wrote: > If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1 > or aes_decrypt listed in the perf top result? I think they should be. Have you > removed them from the "perf top results" or are they really absent > altogether? > > Because, from this perf result, it looks like your CPU is not burden by the > incoming RX at all?! Instead it is busy with the encryption of frames > it will be transmitting (in case of tcp, this could be tcp acks). Keep in mind that this is CCMP, i.e., AES in CCM (Counter with CBC-MAC) mode. The CCM mode uses only the block cipher encryption function, i.e., you won't be seeing aes_decrypt or _aesni_dec1 for this even on the RX path (AES encryption operations are used to generate the key stream blocks for CCM decryption). -- Jouni Malinen PGP id EFC895FA ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-07-31 20:05 ` Jouni Malinen @ 2014-07-31 20:45 ` Christian Lamparter 2014-08-05 23:09 ` Ben Greear 0 siblings, 1 reply; 21+ messages in thread From: Christian Lamparter @ 2014-07-31 20:45 UTC (permalink / raw) To: Jouni Malinen; +Cc: Ben Greear, linux-wireless@vger.kernel.org, Johannes Berg On Thursday, July 31, 2014 11:05:22 PM Jouni Malinen wrote: > On Wed, Jul 30, 2014 at 08:59:33PM +0200, Christian Lamparter wrote: > > If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1 > > or aes_decrypt listed in the perf top result? I think they should be. Have you > > removed them from the "perf top results" or are they really absent > > altogether? > > > > Because, from this perf result, it looks like your CPU is not burden by the > > incoming RX at all?! Instead it is busy with the encryption of frames > > it will be transmitting (in case of tcp, this could be tcp acks). > > Keep in mind that this is CCMP, i.e., AES in CCM (Counter with CBC-MAC) > mode. The CCM mode uses only the block cipher encryption function, i.e., > you won't be seeing aes_decrypt or _aesni_dec1 for this even on the RX > path (AES encryption operations are used to generate the key stream > blocks for CCM decryption). Yes, I remember this detail/the old days (before 3.12/3.13?). Back then ieee80211_aes_ccm_decrypt did exactly that. But these semantic pitfalls were taken care of by the following commit: commit 7ec7c4a9a686c608315739ab6a2b0527a240883c (from wireless-testing.git) Author: Ard Biesheuvel <ard.biesheuvel@linaro.org> Date: Thu Oct 10 09:55:20 2013 +0200 mac80211: port CCMP to cryptoapi's CCM driver Use the generic CCM aead chaining mode driver rather than a local implementation that sits right on top of the core AES cipher. This allows the use of accelerated implementations of either CCM as a whole or the CTR mode which it encapsulates. [...] Regards Christian ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-07-31 20:45 ` Christian Lamparter @ 2014-08-05 23:09 ` Ben Greear 2014-08-07 14:05 ` Christian Lamparter 0 siblings, 1 reply; 21+ messages in thread From: Ben Greear @ 2014-08-05 23:09 UTC (permalink / raw) To: Christian Lamparter Cc: Jouni Malinen, linux-wireless@vger.kernel.org, Johannes Berg On 07/31/2014 01:45 PM, Christian Lamparter wrote: > On Thursday, July 31, 2014 11:05:22 PM Jouni Malinen wrote: >> On Wed, Jul 30, 2014 at 08:59:33PM +0200, Christian Lamparter wrote: >>> If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1 >>> or aes_decrypt listed in the perf top result? I think they should be. Have you >>> removed them from the "perf top results" or are they really absent >>> altogether? >>> >>> Because, from this perf result, it looks like your CPU is not burden by the >>> incoming RX at all?! Instead it is busy with the encryption of frames >>> it will be transmitting (in case of tcp, this could be tcp acks). >> >> Keep in mind that this is CCMP, i.e., AES in CCM (Counter with CBC-MAC) >> mode. The CCM mode uses only the block cipher encryption function, i.e., >> you won't be seeing aes_decrypt or _aesni_dec1 for this even on the RX >> path (AES encryption operations are used to generate the key stream >> blocks for CCM decryption). > Yes, I remember this detail/the old days (before 3.12/3.13?). Back then > ieee80211_aes_ccm_decrypt did exactly that. But these semantic pitfalls > were taken care of by the following commit: > > commit 7ec7c4a9a686c608315739ab6a2b0527a240883c (from wireless-testing.git) > Author: Ard Biesheuvel <ard.biesheuvel@linaro.org> > Date: Thu Oct 10 09:55:20 2013 +0200 This patch is in my tree (I'm using 3.14.14 kernel currently). Here is a perf top from a different machine, with single wlan interface running UDP download (btserver is user-space app that is generating/receiving the traffic). I can do about 200Mbps download with WPA2 encryption enabled on this machine, and ksoftirqd is using about 76% of a core according to top. Samples: 154K of event 'cycles', Event count (approx.): 45228404083 9.92% [kernel] [k] __lock_acquire 7.79% btserver [.] 0x0000000000349d73 6.44% [kernel] [k] math_state_restore 5.47% [kernel] [k] _aesni_enc1 4.36% [kernel] [k] fpu_save_init 2.88% [kernel] [k] arch_local_save_flags 2.68% [kernel] [k] arch_local_irq_restore 2.29% [kernel] [k] lock_release 1.80% [kernel] [k] mark_lock 1.58% [kernel] [k] lock_acquire 1.50% [kernel] [k] irq_fpu_usable 1.43% [kernel] [k] crypto_xor 1.35% [kernel] [k] mark_held_locks 1.30% [kernel] [k] fib_rules_lookup 1.25% [kernel] [k] hlock_class 1.20% [kernel] [k] trace_hardirqs_on_caller 0.99% [kernel] [k] copy_user_generic_string 0.92% [kernel] [k] __netif_receive_skb_core 0.88% [kernel] [k] trace_hardirqs_off_caller 0.87% [kernel] [k] arch_local_irq_save 0.85% [kernel] [k] dev_queue_xmit_nit 0.84% [kernel] [k] aes_encrypt 0.59% [kernel] [k] do_raw_spin_lock 0.55% [kernel] [k] get_data_to_compute 0.53% [kernel] [k] __rcu_read_unlock 0.52% [kernel] [k] crypto_ctr_crypt A second test where the station machine was not generating to itself (ie, tx on Ethernet, to AP, receive back on wlan), but only receiving traffic from the AP, shows this perf top: Samples: 126K of event 'cycles', Event count (approx.): 29019221373 10.74% [kernel] [k] math_state_restore 10.50% btserver [.] 0x000000000033260d 9.00% [kernel] [k] _aesni_enc1 7.33% [kernel] [k] fpu_save_init 6.70% [kernel] [k] __lock_acquire 2.46% [kernel] [k] irq_fpu_usable 2.34% [kernel] [k] crypto_xor 1.88% [kernel] [k] arch_local_save_flags 1.83% [kernel] [k] arch_local_irq_restore 1.58% [kernel] [k] lock_release 1.48% [kernel] [k] aes_encrypt 1.27% [kernel] [k] mark_lock 1.12% [kernel] [k] lock_acquire 1.02% [kernel] [k] mark_held_locks 0.96% [kernel] [k] trace_hardirqs_on_caller 0.93% [kernel] [k] get_data_to_compute 0.83% [kernel] [k] hlock_class 0.81% [kernel] [k] __kernel_fpu_begin 0.81% [kernel] [k] crypto_ctr_crypt 0.80% [kernel] [k] crypto_inc [greearb@ben-dt2 linux-3.14.x64]$ grep CCM .config CONFIG_LIB80211_CRYPT_CCMP=m # CONFIG_RTLLIB_CRYPTO_CCMP is not set CONFIG_CRYPTO_CCM=y [greearb@ben-dt2 linux-3.14.x64]$ [root@ct523-9292 lanforge]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2655LE CPU @ 2.20GHz stepping : 7 microcode : 0x28 cpu MHz : 2200.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4390.31 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: Out of curiosity, might it help to prefetch the entire skb when getting it from the NIC, since we are about to have to read it all to do the decrypt? Any idea how to prefetch the skb? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-08-05 23:09 ` Ben Greear @ 2014-08-07 14:05 ` Christian Lamparter 2014-08-07 17:45 ` Ben Greear 0 siblings, 1 reply; 21+ messages in thread From: Christian Lamparter @ 2014-08-07 14:05 UTC (permalink / raw) To: Ben Greear; +Cc: Jouni Malinen, linux-wireless@vger.kernel.org, Johannes Berg On Tuesday, August 05, 2014 04:09:42 PM Ben Greear wrote: > On 07/31/2014 01:45 PM, Christian Lamparter wrote: > > On Thursday, July 31, 2014 11:05:22 PM Jouni Malinen wrote: > >> On Wed, Jul 30, 2014 at 08:59:33PM +0200, Christian Lamparter wrote: > >>> If you have disabled rx-decrypt logic of ath10k, then why isn't _aesni_dec1 > >>> or aes_decrypt listed in the perf top result? I think they should be. Have you > >>> removed them from the "perf top results" or are they really absent > >>> altogether? > >>> > >>> Because, from this perf result, it looks like your CPU is not burden by the > >>> incoming RX at all?! Instead it is busy with the encryption of frames > >>> it will be transmitting (in case of tcp, this could be tcp acks). > >> > >> Keep in mind that this is CCMP, i.e., AES in CCM (Counter with CBC-MAC) > >> mode. The CCM mode uses only the block cipher encryption function, i.e., > >> you won't be seeing aes_decrypt or _aesni_dec1 for this even on the RX > >> path (AES encryption operations are used to generate the key stream > >> blocks for CCM decryption). > > Yes, I remember this detail/the old days (before 3.12/3.13?). Back then > > ieee80211_aes_ccm_decrypt did exactly that. But these semantic pitfalls > > were taken care of by the following commit: > > > > commit 7ec7c4a9a686c608315739ab6a2b0527a240883c (from wireless-testing.git) > > Author: Ard Biesheuvel <ard.biesheuvel@linaro.org> > > Date: Thu Oct 10 09:55:20 2013 +0200 > > This patch is in my tree (I'm using 3.14.14 kernel currently). > > Here is a perf top from a different machine, with single wlan interface > running UDP download (btserver is user-space app that is generating/receiving > the traffic). I can do about 200Mbps download with WPA2 encryption enabled > on this machine, and ksoftirqd is using about 76% of a core according to top. Thanks. I looked into AES in CCM (Counter with CBC-MAC) instead of ccm.c and guess what: "Both the CCM encryption and CCM decryption operations require only the block cipher encryption function." [0]. (Yes, same as Jouni said in his mail). Now to the perf: > Samples: 126K of event 'cycles', Event count (approx.): 29019221373 > 10.74% [kernel] [k] math_state_restore > 10.50% btserver [.] 0x000000000033260d > 9.00% [kernel] [k] _aesni_enc1 > 7.33% [kernel] [k] fpu_save_init > 6.70% [kernel] [k] __lock_acquire > 2.46% [kernel] [k] irq_fpu_usable > 2.34% [kernel] [k] crypto_xor > 1.88% [kernel] [k] arch_local_save_flags > 1.83% [kernel] [k] arch_local_irq_restore > 1.58% [kernel] [k] lock_release > 1.48% [kernel] [k] aes_encrypt > 1.27% [kernel] [k] mark_lock > 1.12% [kernel] [k] lock_acquire > 1.02% [kernel] [k] mark_held_locks > 0.96% [kernel] [k] trace_hardirqs_on_caller > 0.93% [kernel] [k] get_data_to_compute > 0.83% [kernel] [k] hlock_class > 0.81% [kernel] [k] __kernel_fpu_begin > 0.81% [kernel] [k] crypto_ctr_crypt > 0.80% [kernel] [k] crypto_inc The high overhead (math_state_restore and fpu_save_init) are caused by the way ccm.c interacts with the aesni implementation when calculating the MAC [1] (in compute_mac). > [ ... ] > /* now encrypt rest of data */ > while (datalen >= 16) { > crypto_xor(odata, data, bs); > crypto_cipher_encrypt_one(tfm, odata, odata); > > datalen -= 16; > data += 16; > } > [...] crypto_cipher_encrypt_one is a wrapper which in your case calls aesni's aes_encrypt [2]. And aes_encrypt looks like this: > [...] > kernel_fpu_begin(); > aesni_enc(ctx, dst, src); <-- this is where it goes to _aesni_enc1 > kernel_fpu_end(); > [...] Or: for every 16 Bytes of payload there is one fpu context save and restore... ouch! [0] http://tools.ietf.org/html/rfc3610 [1] http://lxr.free-electrons.com/source/crypto/ccm.c#L164 [2] http://lxr.free-electrons.com/source/arch/x86/crypto/aesni-intel_glue.c#L323 Regards Christian ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-08-07 14:05 ` Christian Lamparter @ 2014-08-07 17:45 ` Ben Greear 2014-08-10 13:44 ` Christian Lamparter 0 siblings, 1 reply; 21+ messages in thread From: Ben Greear @ 2014-08-07 17:45 UTC (permalink / raw) To: Christian Lamparter Cc: Jouni Malinen, linux-wireless@vger.kernel.org, Johannes Berg On 08/07/2014 07:05 AM, Christian Lamparter wrote: > The high overhead (math_state_restore and fpu_save_init) are caused by > the way ccm.c interacts with the aesni implementation when calculating > the MAC [1] (in compute_mac). > >> [ ... ] >> /* now encrypt rest of data */ >> while (datalen >= 16) { >> crypto_xor(odata, data, bs); >> crypto_cipher_encrypt_one(tfm, odata, odata); >> >> datalen -= 16; >> data += 16; >> } >> [...] > > crypto_cipher_encrypt_one is a wrapper which in your case calls > aesni's aes_encrypt [2]. > > And aes_encrypt looks like this: > >> [...] >> kernel_fpu_begin(); >> aesni_enc(ctx, dst, src); <-- this is where it goes to _aesni_enc1 >> kernel_fpu_end(); >> [...] > > Or: for every 16 Bytes of payload there is one fpu context save and > restore... ouch! I have never messed with this kind of stuff... Any idea if it would work to put the fpu_begin/end a bit higher and do all those 16 byte chunks in a batch without messing with the FPU for each chunk? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-08-07 17:45 ` Ben Greear @ 2014-08-10 13:44 ` Christian Lamparter 2014-08-12 18:34 ` Ben Greear 0 siblings, 1 reply; 21+ messages in thread From: Christian Lamparter @ 2014-08-10 13:44 UTC (permalink / raw) To: Ben Greear; +Cc: Jouni Malinen, linux-wireless@vger.kernel.org, Johannes Berg On Thursday, August 07, 2014 10:45:01 AM Ben Greear wrote: > On 08/07/2014 07:05 AM, Christian Lamparter wrote: > > Or: for every 16 Bytes of payload there is one fpu context save and > > restore... ouch! > > Any idea if it would work to put the fpu_begin/end a bit higher > and do all those 16 byte chunks in a batch without messing with > the FPU for each chunk? It sort of works - see sample feature patch for aesni-intel-glue (taken from 3.16-wl). Older kernels (like 3.15, 3.14) need: "crypto: allow blkcipher walks over AEAD data" [0] (and maybe more). The FPU save/restore overhead should be gone. Also, if the aesni instructions can't be used, the implementation will fall back to the original ccm(aes) code. Calculating the MAC is still much more expensive than the payload encryption or decryption. However, I can't see a way of making this more efficient without rewriting and combining the parts I took from crypto/ccm.c into an several, dedicated assembler functions. Regards Christian --- arch/x86/crypto/aesni-intel_glue.c | 484 +++++++++++++++++++++++++++++++++++++ 1 file changed, 484 insertions(+) diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index 948ad0e..beab823 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -36,6 +36,7 @@ #include <asm/crypto/aes.h> #include <crypto/ablk_helper.h> #include <crypto/scatterwalk.h> +#include <crypto/aead.h> #include <crypto/internal/aead.h> #include <linux/workqueue.h> #include <linux/spinlock.h> @@ -499,6 +500,448 @@ static int ctr_crypt(struct blkcipher_desc *desc, return err; } + +static int __ccm_setkey(struct crypto_aead *tfm, const u8 *in_key, + unsigned int key_len) +{ + struct crypto_aes_ctx *ctx = crypto_aead_ctx(tfm); + + return aes_set_key_common(crypto_aead_tfm(tfm), ctx, in_key, key_len); +} + +static int __ccm_setauthsize(struct crypto_aead *tfm, unsigned int authsize) +{ + if ((authsize & 1) || authsize < 4) + return -EINVAL; + return 0; +} + +static int set_msg_len(u8 *block, unsigned int msglen, int csize) +{ + __be32 data; + + memset(block, 0, csize); + block += csize; + + if (csize >= 4) + csize = 4; + else if (msglen > (1 << (8 * csize))) + return -EOVERFLOW; + + data = cpu_to_be32(msglen); + memcpy(block - csize, (u8 *)&data + 4 - csize, csize); + + return 0; +} + +static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen) +{ + struct crypto_aead *aead = crypto_aead_reqtfm(req); + __be32 *n = (__be32 *)&maciv[AES_BLOCK_SIZE - 8]; + u32 l = req->iv[0] + 1; + + /* verify that CCM dimension 'L' is set correctly in the IV */ + if (l < 2 || l > 8) + return -EINVAL; + + /* verify that msglen can in fact be represented in L bytes */ + if (l < 4 && msglen >> (8 * l)) + return -EOVERFLOW; + + /* + * Even if the CCM spec allows L values of up to 8, the Linux cryptoapi + * uses a u32 type to represent msglen so the top 4 bytes are always 0. + */ + n[0] = 0; + n[1] = cpu_to_be32(msglen); + + memcpy(maciv, req->iv, AES_BLOCK_SIZE - l); + + /* + * Meaning of byte 0 according to CCM spec (RFC 3610/NIST 800-38C) + * - bits 0..2 : max # of bytes required to represent msglen, minus 1 + * (already set by caller) + * - bits 3..5 : size of auth tag (1 => 4 bytes, 2 => 6 bytes, etc) + * - bit 6 : indicates presence of authenticate-only data + */ + maciv[0] |= (crypto_aead_authsize(aead) - 2) << 2; + if (req->assoclen) + maciv[0] |= 0x40; + + memset(&req->iv[AES_BLOCK_SIZE - l], 0, l); + return set_msg_len(maciv + AES_BLOCK_SIZE - l, msglen, l); +} + +static int compute_mac(struct crypto_aes_ctx *ctx, u8 mac[], u8 *data, int n, + unsigned int ilen, u8 *idata) +{ + unsigned int bs = AES_BLOCK_SIZE; + u8 *odata = mac; + int datalen, getlen; + + datalen = n; + + /* first time in here, block may be partially filled. */ + getlen = bs - ilen; + if (datalen >= getlen) { + memcpy(idata + ilen, data, getlen); + crypto_xor(odata, idata, bs); + + aesni_enc(ctx, odata, odata); + datalen -= getlen; + data += getlen; + ilen = 0; + } + + /* now encrypt rest of data */ + while (datalen >= bs) { + crypto_xor(odata, data, bs); + + aesni_enc(ctx, odata, odata); + + datalen -= bs; + data += bs; + } + + /* check and see if there's leftover data that wasn't + * enough to fill a block. + */ + if (datalen) { + memcpy(idata + ilen, data, datalen); + ilen += datalen; + } + return ilen; +} + +static unsigned int get_data_to_compute(struct crypto_aes_ctx *ctx, u8 mac[], + u8 *idata, struct scatterlist *sg, + unsigned int len, unsigned int ilen) +{ + struct scatter_walk walk; + u8 *data_src; + int n; + + scatterwalk_start(&walk, sg); + + while (len) { + n = scatterwalk_clamp(&walk, len); + if (!n) { + scatterwalk_start(&walk, sg_next(walk.sg)); + n = scatterwalk_clamp(&walk, len); + } + data_src = scatterwalk_map(&walk); + + ilen = compute_mac(ctx, mac, data_src, n, ilen, idata); + len -= n; + + scatterwalk_unmap(data_src); + scatterwalk_advance(&walk, n); + scatterwalk_done(&walk, 0, len); + } + + /* any leftover needs padding and then encrypted */ + if (ilen) { + int padlen; + u8 *odata = mac; + + padlen = AES_BLOCK_SIZE - ilen; + memset(idata + ilen, 0, padlen); + crypto_xor(odata, idata, AES_BLOCK_SIZE); + + aesni_enc(ctx, odata, odata); + ilen = 0; + } + return ilen; +} + +static void ccm_calculate_auth_mac(struct aead_request *req, + struct crypto_aes_ctx *ctx, u8 mac[], + struct scatterlist *src, + unsigned int cryptlen) +{ + unsigned int ilen; + u8 idata[AES_BLOCK_SIZE]; + u32 len = req->assoclen; + + aesni_enc(ctx, mac, mac); + + if (len) { + struct __packed { + __be16 l; + __be32 h; + } *ltag = (void *)idata; + + /* prepend the AAD with a length tag */ + if (len < 0xff00) { + ltag->l = cpu_to_be16(len); + ilen = 2; + } else { + ltag->l = cpu_to_be16(0xfffe); + ltag->h = cpu_to_be32(len); + ilen = 6; + } + + ilen = get_data_to_compute(ctx, mac, idata, + req->assoc, req->assoclen, + ilen); + } else { + ilen = 0; + } + + /* compute plaintext into mac */ + if (cryptlen) { + ilen = get_data_to_compute(ctx, mac, idata, + src, cryptlen, ilen); + } +} + +static int __ccm_encrypt(struct aead_request *req) +{ + struct crypto_aead *aead = crypto_aead_reqtfm(req); + struct crypto_aes_ctx *ctx = aes_ctx(crypto_aead_ctx(aead)); + struct blkcipher_desc desc = { .info = req->iv }; + struct blkcipher_walk walk; + struct scatterlist src[2], dst[2], *pdst; + u8 __aligned(8) mac[AES_BLOCK_SIZE]; + u32 len = req->cryptlen; + int err; + + err = ccm_init_mac(req, mac, len); + if (err) + return err; + + ccm_calculate_auth_mac(req, ctx, mac, req->src, len); + + sg_init_table(src, 2); + sg_set_buf(src, mac, sizeof(mac)); + scatterwalk_sg_chain(src, 2, req->src); + + pdst = src; + if (req->src != req->dst) { + sg_init_table(dst, 2); + sg_set_buf(dst, mac, sizeof(mac)); + scatterwalk_sg_chain(dst, 2, req->dst); + pdst = dst; + } + + len += sizeof(mac); + blkcipher_walk_init(&walk, pdst, src, len); + err = blkcipher_aead_walk_virt_block(&desc, &walk, aead, + AES_BLOCK_SIZE); + + while ((len = walk.nbytes) >= AES_BLOCK_SIZE) { + aesni_ctr_enc(ctx, walk.dst.virt.addr, walk.src.virt.addr, + len & AES_BLOCK_MASK, walk.iv); + len &= AES_BLOCK_SIZE - 1; + err = blkcipher_walk_done(&desc, &walk, len); + } + if (walk.nbytes) { + ctr_crypt_final(ctx, &walk); + err = blkcipher_walk_done(&desc, &walk, 0); + } + + if (err) + return err; + + /* copy authtag to end of dst */ + scatterwalk_map_and_copy(mac, req->dst, req->cryptlen, + crypto_aead_authsize(aead), 1); + return 0; +} + +static int __ccm_decrypt(struct aead_request *req) +{ + struct crypto_aead *aead = crypto_aead_reqtfm(req); + struct crypto_aes_ctx *ctx = aes_ctx(crypto_aead_ctx(aead)); + unsigned int authsize = crypto_aead_authsize(aead); + struct blkcipher_desc desc = { .info = req->iv }; + struct blkcipher_walk walk; + struct scatterlist src[2], dst[2], *pdst; + u8 __aligned(8) authtag[AES_BLOCK_SIZE], mac[AES_BLOCK_SIZE]; + u32 len; + int err; + + if (req->cryptlen < authsize) + return -EINVAL; + + scatterwalk_map_and_copy(authtag, req->src, + req->cryptlen - authsize, authsize, 0); + + err = ccm_init_mac(req, mac, req->cryptlen - authsize); + if (err) + return err; + + sg_init_table(src, 2); + sg_set_buf(src, authtag, sizeof(authtag)); + scatterwalk_sg_chain(src, 2, req->src); + + pdst = src; + if (req->src != req->dst) { + sg_init_table(dst, 2); + sg_set_buf(dst, authtag, sizeof(authtag)); + scatterwalk_sg_chain(dst, 2, req->dst); + pdst = dst; + } + + blkcipher_walk_init(&walk, pdst, src, + req->cryptlen - authsize + sizeof(mac)); + err = blkcipher_aead_walk_virt_block(&desc, &walk, aead, + AES_BLOCK_SIZE); + + while ((len = walk.nbytes) >= AES_BLOCK_SIZE) { + aesni_ctr_enc(ctx, walk.dst.virt.addr, walk.src.virt.addr, + len & AES_BLOCK_MASK, walk.iv); + len &= AES_BLOCK_SIZE - 1; + err = blkcipher_walk_done(&desc, &walk, len); + } + if (walk.nbytes) { + ctr_crypt_final(ctx, &walk); + err = blkcipher_walk_done(&desc, &walk, 0); + } + + ccm_calculate_auth_mac(req, ctx, mac, req->dst, + req->cryptlen - authsize); + if (err) + return err; + + /* compare calculated auth tag with the stored one */ + if (crypto_memneq(mac, authtag, authsize)) + return -EBADMSG; + return 0; +} + +struct ccm_async_ctx { + struct crypto_aes_ctx ctx; + struct crypto_aead *fallback; +}; + +static inline struct +ccm_async_ctx *get_ccm_ctx(struct crypto_aead *aead) +{ + return (struct ccm_async_ctx *) + PTR_ALIGN((u8 *) + crypto_tfm_ctx(crypto_aead_tfm(aead)), AESNI_ALIGN); +} + +static int ccm_init(struct crypto_tfm *tfm) +{ + struct crypto_aead *crypto_tfm; + struct ccm_async_ctx *ctx = (struct ccm_async_ctx *) + PTR_ALIGN((u8 *)crypto_tfm_ctx(tfm), AESNI_ALIGN); + + crypto_tfm = crypto_alloc_aead("ccm(aes)", 0, + CRYPTO_ALG_ASYNC | CRYPTO_ALG_NEED_FALLBACK); + if (IS_ERR(crypto_tfm)) + return PTR_ERR(crypto_tfm); + + ctx->fallback = crypto_tfm; + return 0; +} + +static void ccm_exit(struct crypto_tfm *tfm) +{ + struct ccm_async_ctx *ctx = (struct ccm_async_ctx *) + PTR_ALIGN((u8 *)crypto_tfm_ctx(tfm), AESNI_ALIGN); + + if (!IS_ERR_OR_NULL(ctx->fallback)) + crypto_free_aead(ctx->fallback); +} + +static int ccm_setkey(struct crypto_aead *aead, const u8 *in_key, + unsigned int key_len) +{ + struct crypto_tfm *tfm = crypto_aead_tfm(aead); + struct ccm_async_ctx *ctx = (struct ccm_async_ctx *) + PTR_ALIGN((u8 *)crypto_tfm_ctx(tfm), AESNI_ALIGN); + int err; + + err = __ccm_setkey(aead, in_key, key_len); + if (err) + return err; + + /* + * Set the fallback transform to use the same request flags as + * the hardware transform. + */ + ctx->fallback->base.crt_flags &= ~CRYPTO_TFM_REQ_MASK; + ctx->fallback->base.crt_flags |= + tfm->crt_flags & CRYPTO_TFM_REQ_MASK; + return crypto_aead_setkey(ctx->fallback, in_key, key_len); +} + +static int ccm_setauthsize(struct crypto_aead *aead, unsigned int authsize) +{ + struct crypto_tfm *tfm = crypto_aead_tfm(aead); + struct ccm_async_ctx *ctx = (struct ccm_async_ctx *) + PTR_ALIGN((u8 *)crypto_tfm_ctx(tfm), AESNI_ALIGN); + int err; + + err = __ccm_setauthsize(aead, authsize); + if (err) + return err; + + return crypto_aead_setauthsize(ctx->fallback, authsize); +} + +static int ccm_encrypt(struct aead_request *req) +{ + int ret; + + if (!irq_fpu_usable()) { + struct crypto_aead *aead = crypto_aead_reqtfm(req); + struct ccm_async_ctx *ctx = get_ccm_ctx(aead); + struct crypto_aead *fallback = ctx->fallback; + + char aead_req_data[sizeof(struct aead_request) + + crypto_aead_reqsize(fallback)] + __aligned(__alignof__(struct aead_request)); + struct aead_request *aead_req = (void *) aead_req_data; + + memset(aead_req, 0, sizeof(aead_req_data)); + aead_request_set_tfm(aead_req, fallback); + aead_request_set_assoc(aead_req, req->assoc, req->assoclen); + aead_request_set_crypt(aead_req, req->src, req->dst, + req->cryptlen, req->iv); + aead_request_set_callback(aead_req, req->base.flags, + req->base.complete, req->base.data); + ret = crypto_aead_encrypt(aead_req); + } else { + kernel_fpu_begin(); + ret = __ccm_encrypt(req); + kernel_fpu_end(); + } + return ret; +} + +static int ccm_decrypt(struct aead_request *req) +{ + int ret; + + if (!irq_fpu_usable()) { + struct crypto_aead *aead = crypto_aead_reqtfm(req); + struct ccm_async_ctx *ctx = get_ccm_ctx(aead); + struct crypto_aead *fallback = ctx->fallback; + + char aead_req_data[sizeof(struct aead_request) + + crypto_aead_reqsize(fallback)] + __aligned(__alignof__(struct aead_request)); + struct aead_request *aead_req = (void *) aead_req_data; + + memset(aead_req, 0, sizeof(aead_req_data)); + aead_request_set_tfm(aead_req, fallback); + aead_request_set_assoc(aead_req, req->assoc, req->assoclen); + aead_request_set_crypt(aead_req, req->src, req->dst, + req->cryptlen, req->iv); + aead_request_set_callback(aead_req, req->base.flags, + req->base.complete, req->base.data); + ret = crypto_aead_decrypt(aead_req); + } else { + kernel_fpu_begin(); + ret = __ccm_decrypt(req); + kernel_fpu_end(); + } + return ret; +} #endif static int ablk_ecb_init(struct crypto_tfm *tfm) @@ -1308,6 +1751,47 @@ static struct crypto_alg aesni_algs[] = { { }, }, }, { + .cra_name = "__ccm-aes-aesni", + .cra_driver_name = "__driver-ccm-aes-aesni", + .cra_priority = 0, + .cra_flags = CRYPTO_ALG_TYPE_AEAD, + .cra_blocksize = 1, + .cra_ctxsize = sizeof(struct crypto_aes_ctx) + + AESNI_ALIGN - 1, + .cra_alignmask = 0, + .cra_type = &crypto_aead_type, + .cra_module = THIS_MODULE, + .cra_aead = { + .ivsize = AES_BLOCK_SIZE, + .maxauthsize = AES_BLOCK_SIZE, + .setkey = __ccm_setkey, + .setauthsize = __ccm_setauthsize, + .encrypt = __ccm_encrypt, + .decrypt = __ccm_decrypt, + }, +}, { + .cra_name = "ccm(aes)", + .cra_driver_name = "ccm-aes-aesni", + .cra_priority = 700, + .cra_flags = CRYPTO_ALG_TYPE_AEAD | + CRYPTO_ALG_NEED_FALLBACK, + .cra_blocksize = 1, + .cra_ctxsize = AESNI_ALIGN - 1 + + sizeof(struct ccm_async_ctx), + .cra_alignmask = 0, + .cra_type = &crypto_aead_type, + .cra_module = THIS_MODULE, + .cra_init = ccm_init, + .cra_exit = ccm_exit, + .cra_aead = { + .ivsize = AES_BLOCK_SIZE, + .maxauthsize = AES_BLOCK_SIZE, + .setkey = ccm_setkey, + .setauthsize = ccm_setauthsize, + .encrypt = ccm_encrypt, + .decrypt = ccm_decrypt, + }, +}, { .cra_name = "__gcm-aes-aesni", .cra_driver_name = "__driver-gcm-aes-aesni", .cra_priority = 0, -- 2.0.1 [0] <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4f7f1d7cff8f2c170ce0319eb4c01a82c328d34f> ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-08-10 13:44 ` Christian Lamparter @ 2014-08-12 18:34 ` Ben Greear 2014-08-14 12:39 ` Christian Lamparter 0 siblings, 1 reply; 21+ messages in thread From: Ben Greear @ 2014-08-12 18:34 UTC (permalink / raw) To: Christian Lamparter Cc: Jouni Malinen, linux-wireless@vger.kernel.org, Johannes Berg On 08/10/2014 06:44 AM, Christian Lamparter wrote: > On Thursday, August 07, 2014 10:45:01 AM Ben Greear wrote: >> On 08/07/2014 07:05 AM, Christian Lamparter wrote: >>> Or: for every 16 Bytes of payload there is one fpu context save and >>> restore... ouch! >> >> Any idea if it would work to put the fpu_begin/end a bit higher >> and do all those 16 byte chunks in a batch without messing with >> the FPU for each chunk? > > It sort of works - see sample feature patch for aesni-intel-glue > (taken from 3.16-wl). Older kernels (like 3.15, 3.14) need: > "crypto: allow blkcipher walks over AEAD data" [0] (and maybe more). > > The FPU save/restore overhead should be gone. Also, if the aesni > instructions can't be used, the implementation will fall back > to the original ccm(aes) code. Calculating the MAC is still much > more expensive than the payload encryption or decryption. However, > I can't see a way of making this more efficient without rewriting > and combining the parts I took from crypto/ccm.c into an several, > dedicated assembler functions. I tried this patch on my i7 machine, on the 3.16+ kernel. Without your patch, I see about 260Mbps download. With it, performance improves to around 350Mbps - 375Mbps. Without encryption, I see download rate of around 400 - 420Mbps. So, your patch looks like a good improvement to me, and I'll be happy to test further patches if you happen to do those assembler optimizations you talk about above. Let me know if you would like more/different performance stats. Here is perf top of open authentication, download, UDP: Samples: 64K of event 'cycles', Event count (approx.): 8792558478 30.78% btserver [.] 0x0000000000100501 2.73% [kernel] [k] copy_user_generic_string 2.02% [kernel] [k] swiotlb_tbl_unmap_single 1.43% [kernel] [k] ioread32 1.40% [ath10k_core] [k] ath10k_htt_txrx_compl_task 1.38% [kernel] [k] csum_partial 1.22% [kernel] [k] _raw_spin_lock_irqsave 0.97% [cfg80211] [k] ftrace_define_fields_rdev_return_int_survey_info 0.97% [kernel] [k] pskb_expand_head 0.95% [kernel] [k] do_raw_spin_lock 0.82% [kernel] [k] __slab_free 0.78% [kernel] [k] __sk_run_filter 0.71% [kernel] [k] __rcu_read_unlock 0.67% [kernel] [k] __netif_receive_skb_core 0.65% [kernel] [k] __rcu_read_lock 0.62% [kernel] [k] build_skb 0.59% [mac80211] [k] ieee80211_rx_handlers 0.55% [kernel] [k] nf_iterate 0.52% [kernel] [k] arch_local_irq_restore Using WPA2, sw-crypt, download, UDP: Samples: 52K of event 'cycles', Event count (approx.): 13162827574 24.78% btserver [.] 0x00000000000c598c 10.97% [kernel] [k] _aesni_enc1 2.75% [kernel] [k] _aesni_enc4 2.26% [kernel] [k] crypto_xor 1.69% [kernel] [k] aesni_enc 1.29% [kernel] [k] swiotlb_tbl_unmap_single 1.21% [kernel] [k] copy_user_generic_string 1.17% [kernel] [k] ioread32 1.13% [kernel] [k] get_data_to_compute 0.99% [kernel] [k] _raw_spin_lock_irqsave 0.91% [ath10k_core] [k] ath10k_htt_txrx_compl_task 0.70% [kernel] [k] __schedule 0.70% [kernel] [k] native_write_msr_safe 0.69% [kernel] [k] csum_partial 0.62% [kernel] [k] pskb_expand_head 0.62% [kernel] [k] __switch_to 0.58% [kernel] [k] do_raw_spin_lock 0.53% [kernel] [k] menu_select 0.51% [kernel] [k] __rcu_read_unlock 0.47% [cfg80211] [k] ftrace_define_fields_rdev_return_int_survey_info 0.47% [kernel] [k] _aesni_inc 0.47% [kernel] [k] __rcu_read_lock 0.47% [kernel] [k] __sk_run_filter 0.44% [kernel] [k] aesni_ctr_enc 0.43% [kernel] [k] arch_local_irq_restore 0.43% [kernel] [k] do_sys_poll 0.42% [kernel] [k] __netif_receive_skb_core 0.41% [mac80211] [k] ieee80211_rx_handlers 0.38% [kernel] [k] update_cfs_shares Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-08-12 18:34 ` Ben Greear @ 2014-08-14 12:39 ` Christian Lamparter 2014-08-14 17:09 ` Ben Greear 0 siblings, 1 reply; 21+ messages in thread From: Christian Lamparter @ 2014-08-14 12:39 UTC (permalink / raw) To: Ben Greear; +Cc: Jouni Malinen, linux-wireless@vger.kernel.org, Johannes Berg On Tuesday, August 12, 2014 11:34:59 AM Ben Greear wrote: > On 08/10/2014 06:44 AM, Christian Lamparter wrote: > > On Thursday, August 07, 2014 10:45:01 AM Ben Greear wrote: > >> On 08/07/2014 07:05 AM, Christian Lamparter wrote: > >>> Or: for every 16 Bytes of payload there is one fpu context save and > >>> restore... ouch! > >> > >> Any idea if it would work to put the fpu_begin/end a bit higher > >> and do all those 16 byte chunks in a batch without messing with > >> the FPU for each chunk? > > > > It sort of works - see sample feature patch for aesni-intel-glue > > (taken from 3.16-wl). Older kernels (like 3.15, 3.14) need: > > "crypto: allow blkcipher walks over AEAD data" [0] (and maybe more). > > > > The FPU save/restore overhead should be gone. Also, if the aesni > > instructions can't be used, the implementation will fall back > > to the original ccm(aes) code. Calculating the MAC is still much > > more expensive than the payload encryption or decryption. However, > > I can't see a way of making this more efficient without rewriting > > and combining the parts I took from crypto/ccm.c into an several, > > dedicated assembler functions. > > Without encryption, I see download rate of around 400 - 420Mbps. > > So, your patch looks like a good improvement to me, and I'll be > happy to test further patches if you happen to do those assembler > optimizations you talk about above. Maybe, that will depend on what the results for: "wpa2, *HW*-crypt, download, udp" are. > Let me know if you would like more/different performance > stats. There's a test bench tool (tcrypt) to measure the performance of any cipher. It would be interesting to know what the performance/throughput it can produce without the overhead of any application. [Yep, I'm making a small patch to test that, but not before Saturday next week]. > Here is perf top of open authentication, download, UDP: > > Using WPA2, sw-crypt, download, UDP: > > Samples: 52K of event 'cycles', Event count (approx.): 13162827574 > 24.78% btserver [.] 0x00000000000c598c Is btserver your "udp download" test application? What does it do, as it is accounting for nearly 25%? Regards Christian ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-08-14 12:39 ` Christian Lamparter @ 2014-08-14 17:09 ` Ben Greear 2014-08-19 18:18 ` Ben Greear 0 siblings, 1 reply; 21+ messages in thread From: Ben Greear @ 2014-08-14 17:09 UTC (permalink / raw) To: Christian Lamparter Cc: Jouni Malinen, linux-wireless@vger.kernel.org, Johannes Berg On 08/14/2014 05:39 AM, Christian Lamparter wrote: > On Tuesday, August 12, 2014 11:34:59 AM Ben Greear wrote: >> On 08/10/2014 06:44 AM, Christian Lamparter wrote: >>> On Thursday, August 07, 2014 10:45:01 AM Ben Greear wrote: >>>> On 08/07/2014 07:05 AM, Christian Lamparter wrote: >>>>> Or: for every 16 Bytes of payload there is one fpu context save and >>>>> restore... ouch! >>>> >>>> Any idea if it would work to put the fpu_begin/end a bit higher >>>> and do all those 16 byte chunks in a batch without messing with >>>> the FPU for each chunk? >>> >>> It sort of works - see sample feature patch for aesni-intel-glue >>> (taken from 3.16-wl). Older kernels (like 3.15, 3.14) need: >>> "crypto: allow blkcipher walks over AEAD data" [0] (and maybe more). >>> >>> The FPU save/restore overhead should be gone. Also, if the aesni >>> instructions can't be used, the implementation will fall back >>> to the original ccm(aes) code. Calculating the MAC is still much >>> more expensive than the payload encryption or decryption. However, >>> I can't see a way of making this more efficient without rewriting >>> and combining the parts I took from crypto/ccm.c into an several, >>> dedicated assembler functions. >> >> Without encryption, I see download rate of around 400 - 420Mbps. >> >> So, your patch looks like a good improvement to me, and I'll be >> happy to test further patches if you happen to do those assembler >> optimizations you talk about above. > > Maybe, that will depend on what the results for: "wpa2, *HW*-crypt, > download, udp" are. I'll do that test sometime soon and post the results. >> Let me know if you would like more/different performance >> stats. > > There's a test bench tool (tcrypt) to measure the performance > of any cipher. It would be interesting to know what the > performance/throughput it can produce without the overhead > of any application. [Yep, I'm making a small patch to test that, > but not before Saturday next week]. > >> Here is perf top of open authentication, download, UDP: >> >> Using WPA2, sw-crypt, download, UDP: >> >> Samples: 52K of event 'cycles', Event count (approx.): 13162827574 >> 24.78% btserver [.] 0x00000000000c598c > Is btserver your "udp download" test application? What does it do, as > it is accounting for nearly 25%? btserver is our traffic generator. In this case, it is mostly just receiving UDP frames using non-blocking IO (using recvmmsg, in this case), but it does a fair bit of stats gathering and such. It typically compares well with iperf as far as throughput goes, but I'm sure it uses at least a bit more CPU as compared to iperf. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-08-14 17:09 ` Ben Greear @ 2014-08-19 18:18 ` Ben Greear 2014-08-20 20:47 ` Christian Lamparter 0 siblings, 1 reply; 21+ messages in thread From: Ben Greear @ 2014-08-19 18:18 UTC (permalink / raw) To: Christian Lamparter Cc: Jouni Malinen, linux-wireless@vger.kernel.org, Johannes Berg On 08/14/2014 10:09 AM, Ben Greear wrote: > On 08/14/2014 05:39 AM, Christian Lamparter wrote: >> On Tuesday, August 12, 2014 11:34:59 AM Ben Greear wrote: >>> On 08/10/2014 06:44 AM, Christian Lamparter wrote: >>>> On Thursday, August 07, 2014 10:45:01 AM Ben Greear wrote: >>>>> On 08/07/2014 07:05 AM, Christian Lamparter wrote: >>>>>> Or: for every 16 Bytes of payload there is one fpu context save and >>>>>> restore... ouch! >>>>> >>>>> Any idea if it would work to put the fpu_begin/end a bit higher >>>>> and do all those 16 byte chunks in a batch without messing with >>>>> the FPU for each chunk? >>>> >>>> It sort of works - see sample feature patch for aesni-intel-glue >>>> (taken from 3.16-wl). Older kernels (like 3.15, 3.14) need: >>>> "crypto: allow blkcipher walks over AEAD data" [0] (and maybe more). >>>> >>>> The FPU save/restore overhead should be gone. Also, if the aesni >>>> instructions can't be used, the implementation will fall back >>>> to the original ccm(aes) code. Calculating the MAC is still much >>>> more expensive than the payload encryption or decryption. However, >>>> I can't see a way of making this more efficient without rewriting >>>> and combining the parts I took from crypto/ccm.c into an several, >>>> dedicated assembler functions. >>> >>> Without encryption, I see download rate of around 400 - 420Mbps. >>> >>> So, your patch looks like a good improvement to me, and I'll be >>> happy to test further patches if you happen to do those assembler >>> optimizations you talk about above. >> >> Maybe, that will depend on what the results for: "wpa2, *HW*-crypt, >> download, udp" are. > > I'll do that test sometime soon and post the results. I ran that today, and I get about the same throughput with hw-crypt or sw-crypt (350-355Mbps UDP download goodput). I still see 400+Mbps with Open authentication. So, maybe the bottleneck now is elsewhere... Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-08-19 18:18 ` Ben Greear @ 2014-08-20 20:47 ` Christian Lamparter 2014-08-20 21:04 ` Ben Greear 0 siblings, 1 reply; 21+ messages in thread From: Christian Lamparter @ 2014-08-20 20:47 UTC (permalink / raw) To: Ben Greear; +Cc: Jouni Malinen, linux-wireless@vger.kernel.org, Johannes Berg On Tuesday, August 19, 2014 11:18:39 AM Ben Greear wrote: > On 08/14/2014 10:09 AM, Ben Greear wrote: > > On 08/14/2014 05:39 AM, Christian Lamparter wrote: > >> On Tuesday, August 12, 2014 11:34:59 AM Ben Greear wrote: > >>> > >>> Without encryption, I see download rate of around 400 - 420Mbps. > >>> > >>> So, your patch looks like a good improvement to me, and I'll be > >>> happy to test further patches if you happen to do those assembler > >>> optimizations you talk about above. > >> > >> Maybe, that will depend on what the results for: "wpa2, *HW*-crypt, > >> download, udp" are. > > > > I'll do that test sometime soon and post the results. > > I ran that today, and I get about the same throughput with hw-crypt or > sw-crypt (350-355Mbps UDP download goodput). > > I still see 400+Mbps with Open authentication. > > So, maybe the bottleneck now is elsewhere... Can you rule out that the "udp generator" (either the application or the hardware) is now the bottleneck for this test? [Does the datasheet mention the throughput of the hw-crypto? Or do you know someone at QCA which can tell you if the hardware is filling up the aggregates with additional padding to meet the MPDU start spacing] I'll look into the assembler implementation of aes-ccm. But I'm afraid that this won't increase the throughput (and only decrease the load on the CPU a bit). Also, just for fun: what goodput can you achieve over gbit ethernet? [Because ethernet is also affected by filtering, bridging, pcie-throughput... if it is setup in the same way so you could rule out that iptables, its friends or the pcie-port is a bottleneck]. Regards Christian ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-08-20 20:47 ` Christian Lamparter @ 2014-08-20 21:04 ` Ben Greear 2014-08-22 22:55 ` Christian Lamparter 0 siblings, 1 reply; 21+ messages in thread From: Ben Greear @ 2014-08-20 21:04 UTC (permalink / raw) To: Christian Lamparter Cc: Jouni Malinen, linux-wireless@vger.kernel.org, Johannes Berg On 08/20/2014 01:47 PM, Christian Lamparter wrote: > On Tuesday, August 19, 2014 11:18:39 AM Ben Greear wrote: >> On 08/14/2014 10:09 AM, Ben Greear wrote: >>> On 08/14/2014 05:39 AM, Christian Lamparter wrote: >>>> On Tuesday, August 12, 2014 11:34:59 AM Ben Greear wrote: >>>>> >>>>> Without encryption, I see download rate of around 400 - 420Mbps. >>>>> >>>>> So, your patch looks like a good improvement to me, and I'll be >>>>> happy to test further patches if you happen to do those assembler >>>>> optimizations you talk about above. >>>> >>>> Maybe, that will depend on what the results for: "wpa2, *HW*-crypt, >>>> download, udp" are. >>> >>> I'll do that test sometime soon and post the results. >> >> I ran that today, and I get about the same throughput with hw-crypt or >> sw-crypt (350-355Mbps UDP download goodput). >> >> I still see 400+Mbps with Open authentication. >> >> So, maybe the bottleneck now is elsewhere... > Can you rule out that the "udp generator" (either the application > or the hardware) is now the bottleneck for this test? [Does the > datasheet mention the throughput of the hw-crypto? Or do you know > someone at QCA which can tell you if the hardware is filling up > the aggregates with additional padding to meet the MPDU start > spacing] It is unlikely the UDP generator acts differently for encrypted v/s open traffic, and since the NIC is supposed to do offload in hw-crypt mode, the rest of the stack should be similar as well. Other ath10k users report similar open & wpa2 throughput, so it may be something in my kernel or firmware or configs. I will run some additional tests when I get a chance... > I'll look into the assembler implementation of aes-ccm. But I'm > afraid that this won't increase the throughput (and only decrease > the load on the CPU a bit). I think you are right, and probably it is not worth much effort at this point, at least as far as my setup is concerned. > Also, just for fun: what goodput can you achieve over gbit ethernet? > [Because ethernet is also affected by filtering, bridging, > pcie-throughput... if it is setup in the same way so you could > rule out that iptables, its friends or the pcie-port is a > bottleneck]. Since Open runs faster, it shouldn't be pci-e bus or CPU bottleneck. This class of system can generally sustain near 1 Gbps throughput on wired Ethernet. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-08-20 21:04 ` Ben Greear @ 2014-08-22 22:55 ` Christian Lamparter 0 siblings, 0 replies; 21+ messages in thread From: Christian Lamparter @ 2014-08-22 22:55 UTC (permalink / raw) To: Ben Greear; +Cc: Jouni Malinen, linux-wireless@vger.kernel.org, Johannes Berg On Wednesday, August 20, 2014 02:04:35 PM Ben Greear wrote: > On 08/20/2014 01:47 PM, Christian Lamparter wrote: > > > I'll look into the assembler implementation of aes-ccm. But I'm > > afraid that this won't increase the throughput (and only decrease > > the load on the CPU a bit). > > I think you are right, and probably it is not worth much effort at > this point, at least as far as my setup is concerned. "There's a test bench tool (tcrypt) to measure the performance of any cipher. It would be interesting to know what the performance/throughput it can produce without the overhead of any application. ..." here it is: the module is located in crpyto/tcrypt module parameters: - mode=212 (original ccm) - mode=213 (ccm-aesni) (sec=1 - Length in seconds of speed tests) This will test the speed of the ccm implementation at different block sizes for one second. BTW: any luck with figuring out, if there are any other obvious bottlenecks? (Other than: btserver, checksumming, ...)? Regards Christian --- diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 890449e..7675a13 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -354,8 +354,10 @@ static void test_aead_speed(const char *algo, int enc, unsigned int secs, ret = crypto_aead_setauthsize(tfm, authsize); iv_len = crypto_aead_ivsize(tfm); - if (iv_len) - memset(&iv, 0xff, iv_len); + if (iv_len) { + for (j = 0; j < iv_len; j++) + iv[j] = j + 1; + } crypto_aead_clear_flags(tfm, ~0); printk(KERN_INFO "test %u (%d bit key, %d byte blocks): ", @@ -1751,6 +1753,15 @@ static int do_test(int m) NULL, 0, 16, 8, aead_speed_template_20); break; + case 212: + test_aead_speed("ccm_base(ctr(aes-aesni),aes-aesni)", ENCRYPT, sec, + NULL, 0, 16, 8, aead_speed_template_16); + break; + case 213: + test_aead_speed("ccm-aes-aesni", ENCRYPT, sec, + NULL, 0, 16, 8, aead_speed_template_16); + break; + case 300: /* fall through */ diff --git a/crypto/tcrypt.h b/crypto/tcrypt.h index 6c7e21a..88f152d 100644 --- a/crypto/tcrypt.h +++ b/crypto/tcrypt.h @@ -66,6 +66,7 @@ static u8 speed_template_32_64[] = {32, 64, 0}; * AEAD speed tests */ static u8 aead_speed_template_20[] = {20, 0}; +static u8 aead_speed_template_16[] = {16, 0}; /* * Digest speed tests ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: Looking for non-NIC hardware-offload for wpa2 decrypt. 2014-07-29 22:29 ` Christian Lamparter 2014-07-29 22:50 ` Ben Greear @ 2014-07-30 7:06 ` Johannes Berg 1 sibling, 0 replies; 21+ messages in thread From: Johannes Berg @ 2014-07-30 7:06 UTC (permalink / raw) To: Christian Lamparter; +Cc: Ben Greear, linux-wireless@vger.kernel.org On Wed, 2014-07-30 at 00:29 +0200, Christian Lamparter wrote: > 1. the fpu_begin and fpu_end calls should be added to > ieee80211_crypto_ccmp_encrypt in net/mac80211/wpa.c. > > >+ kernel_fpu_begin(); > > skb_queue_walk(&tx->skbs, skb) { > > if (ccmp_encrypt_skb(tx, skb) < 0) > > return TX_DROP; > > } > >+ kernel_fpu_end(); > > > > return TX_CONTINUE; I don't really want to jump in here but I'll point out that this would be mostly useless afaict as the list is only iterated if you have software fragmentation. johannes ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2014-08-22 22:55 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-31 4:40 Looking for non-NIC hardware-offload for wpa2 decrypt Ben Greear 2014-03-31 18:09 ` Christian Lamparter 2014-07-28 20:50 ` Ben Greear 2014-07-29 22:29 ` Christian Lamparter 2014-07-29 22:50 ` Ben Greear 2014-07-30 18:59 ` Christian Lamparter 2014-07-30 19:08 ` Ben Greear 2014-07-31 20:05 ` Jouni Malinen 2014-07-31 20:45 ` Christian Lamparter 2014-08-05 23:09 ` Ben Greear 2014-08-07 14:05 ` Christian Lamparter 2014-08-07 17:45 ` Ben Greear 2014-08-10 13:44 ` Christian Lamparter 2014-08-12 18:34 ` Ben Greear 2014-08-14 12:39 ` Christian Lamparter 2014-08-14 17:09 ` Ben Greear 2014-08-19 18:18 ` Ben Greear 2014-08-20 20:47 ` Christian Lamparter 2014-08-20 21:04 ` Ben Greear 2014-08-22 22:55 ` Christian Lamparter 2014-07-30 7:06 ` Johannes Berg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).