Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed
* How should we handle variable address space sizes (Re: [RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size)
From: Andy Lutomirski @ 2017-01-02 16:52 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20170102094907.GC30735@node.shutemov.name>

On Mon, Jan 2, 2017 at 1:49 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Fri, Dec 30, 2016 at 06:11:05PM -0800, Andy Lutomirski wrote:
>> On Fri, Dec 30, 2016 at 7:56 AM, Dmitry Safonov <dsafonov@virtuozzo.com> wrote:
>> > Keep task's virtual address space size as mm_struct field which
>> > exists for a long time - it's initialized in setup_new_exec()
>> > depending on the new task's personality.
>> > This way TASK_SIZE will always be the same as current->mm->task_size.
>> > Previously, there could be an issue about different values of
>> > TASK_SIZE and current->mm->task_size: e.g, a 32-bit process can unset
>> > ADDR_LIMIT_3GB personality (with personality syscall) and
>> > so TASK_SIZE will be 4Gb, which is larger than mm->task_size = 3Gb.
>> > As TASK_SIZE *and* current->mm->task_size are used both in code
>> > frequently, this difference creates a subtle situations, for example:
>> > one can mmap addresses > 3Gb, but they will be hidden in
>> > /proc/pid/pagemap as it checks mm->task_size.
>> > I've moved initialization of mm->task_size earlier in setup_new_exec()
>> > as arch_pick_mmap_layout() initializes mmap_legacy_base with
>> > TASK_UNMAPPED_BASE, which depends on TASK_SIZE.
>>
>> I don't like this patch so much because I think that we should figure
>> out how this will all work in the long run first.  I've added some
>> more people to the thread because other arches have similar issues and
>> because x86 is about to get considerably more complicated (choices
>> include 3GB, 4GB, 47-bit, and 56-bit (the latter IIRC)).
>>
>> Here are a few of my thoughts on the matter.  This isn't all that well
>> thought out:
>>
>> The address space limit, especially if CRIU is in play, isn't really a
>> hard limit.  For example, you could allocate high memory then lower
>> the limit.  Similarly, I see no reason that an x32 program should be
>> forbidden from mapping some high addresses or, similarly, that an i386
>> program can't (if it really wanted to) do a 64-bit mmap() and get a
>> high address.
>>
>> On that note, can we just *delete* the task_size check from pagemap?
>> It's been there since the very beginning:
>>
>> commit 85863e475e59afb027b0113290e3796ee6020b7d
>> Author: Matt Mackall <mpm@selenic.com>
>> Date:   Mon Feb 4 22:29:04 2008 -0800
>>
>>     maps4: add /proc/pid/pagemap interface
>>
>> and there's no explanation for why it's needed.
>>
>> So maybe we should have a *number* (not a bit) that indicates the
>> maximum address that mmap() will return unless an override is in use.
>> Since common practice seems to be to stick this in the personality
>> field, we may need some fancy encoding.  Executing a setuid binary
>> needs to reset to the default, and personality handles that.
>
> If we want to be able to specify arbitrary address as maximum, a fancy
> encoding would need to claim 51 bits (63 VA - 12 in-page address) on x86
> from the persona flag.
> To me, it's stretching personality interface too far.
>
> Maybe it's easier to reset the rlimit for suid binaries?

I guess I don't see why rlimit makes any sense, though.  It's not a
resource utilization control, hard vs soft limits make very little
sense, requiring capabilities to exceed the hard limit doesn't help
anything, and it's only useful to preserve it across execve() to work
around bugs.

So if it's going to be a number, let's just make it be a new number
with a new API to control it.

--Andy

^ permalink raw reply

* [PATCH 0/3] crypto: picoxcell - Cleanups removing non-DT code
From: Javier Martinez Canillas @ 2017-01-02 17:06 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

This small series contains a couple of cleanups that removes some driver's code
that isn't needed due the driver being for a DT-only platform.

The changes were suggested by Arnd Bergmann as a response to a previous patch:
https://lkml.org/lkml/2017/1/2/342

Patch #1 allows the driver to be built when the COMPILE_TEST option is enabled.
Patch #2 removes the platform ID table since isn't needed for DT-only drivers.
Patch #3 removes a wrapper function that's also not needed if driver is DT-only.

Best regards,


Javier Martinez Canillas (3):
  crypto: picoxcell - Allow driver to build COMPILE_TEST is enabled
  crypto: picoxcell - Remove platform device ID table
  crypto: picoxcell - Remove spacc_is_compatible() wrapper function

 drivers/crypto/Kconfig            |  2 +-
 drivers/crypto/picoxcell_crypto.c | 28 +++-------------------------
 2 files changed, 4 insertions(+), 26 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH 2/3] crypto: picoxcell - Remove platform device ID table
From: Javier Martinez Canillas @ 2017-01-02 17:06 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1483376819-26726-1-git-send-email-javier@osg.samsung.com>

This driver is only used in the picoxcell platform and this is DT-only.

So only a OF device ID table is needed and there's no need to have a
platform device ID table. This patch removes the unneeded table.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
---

 drivers/crypto/picoxcell_crypto.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/crypto/picoxcell_crypto.c b/drivers/crypto/picoxcell_crypto.c
index 47576098831f..539effbbfc7a 100644
--- a/drivers/crypto/picoxcell_crypto.c
+++ b/drivers/crypto/picoxcell_crypto.c
@@ -1803,12 +1803,6 @@ static int spacc_remove(struct platform_device *pdev)
 	return 0;
 }
 
-static const struct platform_device_id spacc_id_table[] = {
-	{ "picochip,spacc-ipsec", },
-	{ "picochip,spacc-l2", },
-	{ }
-};
-
 static struct platform_driver spacc_driver = {
 	.probe		= spacc_probe,
 	.remove		= spacc_remove,
@@ -1819,7 +1813,6 @@ static struct platform_driver spacc_driver = {
 #endif /* CONFIG_PM */
 		.of_match_table	= of_match_ptr(spacc_of_id_table),
 	},
-	.id_table	= spacc_id_table,
 };
 
 module_platform_driver(spacc_driver);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 3/3] crypto: picoxcell - Remove spacc_is_compatible() wrapper function
From: Javier Martinez Canillas @ 2017-01-02 17:06 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1483376819-26726-1-git-send-email-javier@osg.samsung.com>

The function is used to check either the platform device ID name or the OF
node's compatible (depending how the device was registered) to know which
device type was registered.

But the driver is for a DT-only platform and so there's no need for this
level of indirection since the devices can only be registered via OF.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>

---

 drivers/crypto/picoxcell_crypto.c | 21 +++------------------
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/drivers/crypto/picoxcell_crypto.c b/drivers/crypto/picoxcell_crypto.c
index 539effbbfc7a..b6f14844702e 100644
--- a/drivers/crypto/picoxcell_crypto.c
+++ b/drivers/crypto/picoxcell_crypto.c
@@ -1616,32 +1616,17 @@ static const struct of_device_id spacc_of_id_table[] = {
 MODULE_DEVICE_TABLE(of, spacc_of_id_table);
 #endif /* CONFIG_OF */
 
-static bool spacc_is_compatible(struct platform_device *pdev,
-				const char *spacc_type)
-{
-	const struct platform_device_id *platid = platform_get_device_id(pdev);
-
-	if (platid && !strcmp(platid->name, spacc_type))
-		return true;
-
-#ifdef CONFIG_OF
-	if (of_device_is_compatible(pdev->dev.of_node, spacc_type))
-		return true;
-#endif /* CONFIG_OF */
-
-	return false;
-}
-
 static int spacc_probe(struct platform_device *pdev)
 {
 	int i, err, ret = -EINVAL;
 	struct resource *mem, *irq;
+	struct device_node *np = pdev->dev.of_node;
 	struct spacc_engine *engine = devm_kzalloc(&pdev->dev, sizeof(*engine),
 						   GFP_KERNEL);
 	if (!engine)
 		return -ENOMEM;
 
-	if (spacc_is_compatible(pdev, "picochip,spacc-ipsec")) {
+	if (of_device_is_compatible(np, "picochip,spacc-ipsec")) {
 		engine->max_ctxs	= SPACC_CRYPTO_IPSEC_MAX_CTXS;
 		engine->cipher_pg_sz	= SPACC_CRYPTO_IPSEC_CIPHER_PG_SZ;
 		engine->hash_pg_sz	= SPACC_CRYPTO_IPSEC_HASH_PG_SZ;
@@ -1650,7 +1635,7 @@ static int spacc_probe(struct platform_device *pdev)
 		engine->num_algs	= ARRAY_SIZE(ipsec_engine_algs);
 		engine->aeads		= ipsec_engine_aeads;
 		engine->num_aeads	= ARRAY_SIZE(ipsec_engine_aeads);
-	} else if (spacc_is_compatible(pdev, "picochip,spacc-l2")) {
+	} else if (of_device_is_compatible(np, "picochip,spacc-l2")) {
 		engine->max_ctxs	= SPACC_CRYPTO_L2_MAX_CTXS;
 		engine->cipher_pg_sz	= SPACC_CRYPTO_L2_CIPHER_PG_SZ;
 		engine->hash_pg_sz	= SPACC_CRYPTO_L2_HASH_PG_SZ;
-- 
2.7.4

^ permalink raw reply related

* [PATCH 0/3] crypto: picoxcell - Cleanups removing non-DT code
From: Arnd Bergmann @ 2017-01-02 17:10 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1483376819-26726-1-git-send-email-javier@osg.samsung.com>

On Monday, January 2, 2017 2:06:56 PM CET Javier Martinez Canillas wrote:
> 
> This small series contains a couple of cleanups that removes some driver's code
> that isn't needed due the driver being for a DT-only platform.
> 
> The changes were suggested by Arnd Bergmann as a response to a previous patch:
> https://lkml.org/lkml/2017/1/2/342
> 
> Patch #1 allows the driver to be built when the COMPILE_TEST option is enabled.
> Patch #2 removes the platform ID table since isn't needed for DT-only drivers.
> Patch #3 removes a wrapper function that's also not needed if driver is DT-only.
> 
> 

Looks good, but I don't know if the first patch causes some build warnings
on non-ARM platforms, better wait at least for the 0-day build results,
and maybe build-test on x86-32 and x86-64.

	Arnd

^ permalink raw reply

* [GIT PULL] ARM: exynos: Late mach/soc for v4.10
From: Sylwester Nawrocki @ 2017-01-02 17:25 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20170102155654.oey6c542vjcqyit2@kozik-lap>

On 01/02/2017 04:56 PM, Krzysztof Kozlowski wrote:
> On Mon, Jan 02, 2017 at 10:20:21AM +0100, Sylwester Nawrocki wrote:
>> On 12/30/2016 04:53 PM, Krzysztof Kozlowski wrote:
>>> Any comments on this? I guess it won't come as late-late-4.10, so can
>>> you pull it for v4.11?
>>  
>>>> Sylwester Nawrocki (1):
>>>>       ARM: S3C24XX: Add DMA slave maps for remaining s3c24xx SoCs
>> We need this patch in v4.10 to avoid possible I2S and MMC regressions
>> on selected s3c24xx SoC, since the DMA clients are already modified.
>> If the patch goes in only for v4.11 it would be good to mark it for 
>> inclusion in v4.10 stable kernels.  
>
> You didn't mention any strict dependencies when sending this patch...
> What do you mean by "needing patch in v4.10"? Is the code already
> not bisectable? Already broken?

Yes, unfortunately on s3c2410, s3c2412 and s3c2443.  I didn't notice
when sending patches for the s3c24xx-iis and s3c2440-sdi drivers 
the dma_slave maps were only added for s3c2440 (commit 34681d84a0f7cc2
dmaengine: s3c24xx: Add dma_slave_map for s3c2440 devices) and not for
remaining SoCs covered by these DMA client drivers.

-- 
Thanks,
Sylwester

^ permalink raw reply

* [GIT PULL] ARM: exynos: Late mach/soc for v4.10
From: Krzysztof Kozlowski @ 2017-01-02 17:35 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <c727069c-c21e-be4d-89f9-f27e1c6b9af2@samsung.com>

On Mon, Jan 02, 2017 at 06:25:44PM +0100, Sylwester Nawrocki wrote:
> On 01/02/2017 04:56 PM, Krzysztof Kozlowski wrote:
> > On Mon, Jan 02, 2017 at 10:20:21AM +0100, Sylwester Nawrocki wrote:
> >> On 12/30/2016 04:53 PM, Krzysztof Kozlowski wrote:
> >>> Any comments on this? I guess it won't come as late-late-4.10, so can
> >>> you pull it for v4.11?
> >>  
> >>>> Sylwester Nawrocki (1):
> >>>>       ARM: S3C24XX: Add DMA slave maps for remaining s3c24xx SoCs
> >> We need this patch in v4.10 to avoid possible I2S and MMC regressions
> >> on selected s3c24xx SoC, since the DMA clients are already modified.
> >> If the patch goes in only for v4.11 it would be good to mark it for 
> >> inclusion in v4.10 stable kernels.  
> >
> > You didn't mention any strict dependencies when sending this patch...
> > What do you mean by "needing patch in v4.10"? Is the code already
> > not bisectable? Already broken?
> 
> Yes, unfortunately on s3c2410, s3c2412 and s3c2443.  I didn't notice
> when sending patches for the s3c24xx-iis and s3c2440-sdi drivers 
> the dma_slave maps were only added for s3c2440 (commit 34681d84a0f7cc2
> dmaengine: s3c24xx: Add dma_slave_map for s3c2440 devices) and not for
> remaining SoCs covered by these DMA client drivers.

Okay, so either entire pull goes into v4.10 or I will need a resend of
this one patch with a little bit extended message why this is a v4.10
fix.

BR,
Krzysztof

^ permalink raw reply

* [PATCH] cpufreq: s3c64xx: remove incorrect __init annotation
From: Krzysztof Kozlowski @ 2017-01-02 17:36 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20170102070903.GC8588@vireshk-i7>

On Mon, Jan 02, 2017 at 12:39:03PM +0530, Viresh Kumar wrote:
> On 16-12-16, 10:06, Arnd Bergmann wrote:
> > s3c64xx_cpufreq_config_regulator is incorrectly annotated
> > as __init, since the caller is also not init:
> > 
> > WARNING: vmlinux.o(.text+0x92fe1c): Section mismatch in reference from the function s3c64xx_cpufreq_driver_init() to the function .init.text:s3c64xx_cpufreq_config_regulator()
> > 
> > With modern gcc versions, the function gets inline, so we don't
> > see the warning, this only happens with gcc-4.6 and older.
> > 
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > ---
> >  drivers/cpufreq/s3c64xx-cpufreq.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/cpufreq/s3c64xx-cpufreq.c b/drivers/cpufreq/s3c64xx-cpufreq.c
> > index 176e84cc3991..0cb9040eca49 100644
> > --- a/drivers/cpufreq/s3c64xx-cpufreq.c
> > +++ b/drivers/cpufreq/s3c64xx-cpufreq.c
> > @@ -107,7 +107,7 @@ static int s3c64xx_cpufreq_set_target(struct cpufreq_policy *policy,
> >  }
> >  
> >  #ifdef CONFIG_REGULATOR
> > -static void __init s3c64xx_cpufreq_config_regulator(void)
> > +static void s3c64xx_cpufreq_config_regulator(void)
> >  {
> >  	int count, v, i, found;
> >  	struct cpufreq_frequency_table *freq;
> 
> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

Rafael,
Are you going to pick it up?

Best regards,
Krzysztof

^ permalink raw reply

* [PATCH 0/3] crypto: picoxcell - Cleanups removing non-DT code
From: Javier Martinez Canillas @ 2017-01-02 17:49 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <2309314.lMWYhcWsTB@wuerfel>

Hello Arnd,

On 01/02/2017 02:10 PM, Arnd Bergmann wrote:
> On Monday, January 2, 2017 2:06:56 PM CET Javier Martinez Canillas wrote:
>>
>> This small series contains a couple of cleanups that removes some driver's code
>> that isn't needed due the driver being for a DT-only platform.
>>
>> The changes were suggested by Arnd Bergmann as a response to a previous patch:
>> https://lkml.org/lkml/2017/1/2/342
>>
>> Patch #1 allows the driver to be built when the COMPILE_TEST option is enabled.
>> Patch #2 removes the platform ID table since isn't needed for DT-only drivers.
>> Patch #3 removes a wrapper function that's also not needed if driver is DT-only.
>>
>>
> 
> Looks good, but I don't know if the first patch causes some build warnings

Thanks for looking at the patches.

> on non-ARM platforms, better wait at least for the 0-day build results,
> and maybe build-test on x86-32 and x86-64.
> 

I should had mentioned that I built tested for arm, arm64, x86-32 and x86-64,
and saw now issues. But I agree with you that it's better to wait for the
0-day builder in case it reports issues on some platforms.

> 	Arnd
> 

Best regards,
-- 
Javier Martinez Canillas
Open Source Group
Samsung Research America

^ permalink raw reply

* [PATCH 2/9] ARM: dts: omap3: Add an empty chosen node to top level DTSI
From: Pali Rohár @ 2017-01-02 18:01 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1482158681-4530-3-git-send-email-javier@osg.samsung.com>

On Monday 19 December 2016 15:44:34 Javier Martinez Canillas wrote:
> Commit 008a2ebcd677 ("ARM: dts: omap3: Remove skeleton.dtsi usage")
> removed the skeleton.dtsi usage since we want to get rid of it.
> 
> But this can cause issues when booting a kernel with a boot-loader
> that doesn't create a chosen node if this isn't present in the DTB
> since the decompressor relies on a pre-existing chosen node to be
> available to insert the command line and merge other ATAGS info.
> 
> Fixes: 008a2ebcd677 ("ARM: dts: omap3: Remove skeleton.dtsi usage")
> Reported-by: Pali Rohar <pali.rohar@gmail.com>
> Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>

Including empty chosen node fixes (or rather workaround) this problem. 
You can add my Tested-By.

-- 
Pali Roh?r
pali.rohar at gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20170102/310e22a8/attachment.sig>

^ permalink raw reply

* [v2,3/3] CLK: add more managed APIs
From: Guenter Roeck @ 2017-01-02 18:09 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1353562482-12422-4-git-send-email-dmitry.torokhov@gmail.com>

On Wed, Nov 21, 2012 at 09:34:42PM -0800, Dmitry Torokhov wrote:
> When converting a driver to managed resources it is desirable to be able to
> manage all resources in the same fashion. This change allows managing clocks
> in the same way we manage all other resources.
> 
> This adds the following managed APIs:
> 
> - devm_clk_prepare()/devm_clk_unprepare();
> - devm_clk_enable()/devm_clk_disable();
> - devm_clk_preapre_enable()/devm_clk_diable_unprepare().

s/devm_clk_preapre_enable/devm_clk_prepare_enable/
s//devm_clk_diable_unprepare//devm_clk_disable_unprepare/
> 
> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

What happened with this patch ? I find it highly inconventient having to add
devm_add_action_or_reset() for pretty much every call to clk_prepare_enable().

Another odd one is that there is a devm_clk_get(), but no devm_of_clk_get()
or devm_of_clk_get_by_name().

Thanks,
Guenter

> ---
>  drivers/clk/clk-devres.c |  90 +++++++++++++++++++++++++++++++---------
>  include/linux/clk.h      | 105 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 176 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/clk/clk-devres.c b/drivers/clk/clk-devres.c
> index 8f57154..3a2286b 100644
> --- a/drivers/clk/clk-devres.c
> +++ b/drivers/clk/clk-devres.c
> @@ -9,6 +9,32 @@
>  #include <linux/export.h>
>  #include <linux/gfp.h>
>  
> +static int devm_clk_match(struct device *dev, void *res, void *data)
> +{
> +	struct clk **c = res;
> +
> +	if (WARN_ON(!c || !*c))
> +		return 0;
> +
> +	return *c == data;
> +}
> +
> +
> +static int devm_clk_create_devres(struct device *dev, struct clk *clk,
> +				  void (*release)(struct device *, void *))
> +{
> +	struct clk **ptr;
> +
> +	ptr = devres_alloc(release, sizeof(*ptr), GFP_KERNEL);
> +	if (!ptr)
> +		return -ENOMEM;
> +
> +	*ptr = clk;
> +	devres_add(dev, ptr);
> +
> +	return 0;
> +}
> +
>  static void devm_clk_release(struct device *dev, void *res)
>  {
>  	clk_put(*(struct clk **)res);
> @@ -16,34 +42,22 @@ static void devm_clk_release(struct device *dev, void *res)
>  
>  struct clk *devm_clk_get(struct device *dev, const char *id)
>  {
> -	struct clk **ptr, *clk;
> -
> -	ptr = devres_alloc(devm_clk_release, sizeof(*ptr), GFP_KERNEL);
> -	if (!ptr)
> -		return ERR_PTR(-ENOMEM);
> +	struct clk *clk;
> +	int error;
>  
>  	clk = clk_get(dev, id);
>  	if (!IS_ERR(clk)) {
> -		*ptr = clk;
> -		devres_add(dev, ptr);
> -	} else {
> -		devres_free(ptr);
> +		error = devm_clk_create_devres(dev, clk, devm_clk_release);
> +		if (error) {
> +			clk_put(clk);
> +			return ERR_PTR(error);
> +		}
>  	}
>  
>  	return clk;
>  }
>  EXPORT_SYMBOL(devm_clk_get);
>  
> -static int devm_clk_match(struct device *dev, void *res, void *data)
> -{
> -	struct clk **c = res;
> -	if (!c || !*c) {
> -		WARN_ON(!c || !*c);
> -		return 0;
> -	}
> -	return *c == data;
> -}
> -
>  void devm_clk_put(struct device *dev, struct clk *clk)
>  {
>  	int ret;
> @@ -53,3 +67,41 @@ void devm_clk_put(struct device *dev, struct clk *clk)
>  	WARN_ON(ret);
>  }
>  EXPORT_SYMBOL(devm_clk_put);
> +
> +#define DEFINE_DEVM_CLK_OP(create_op, destroy_op)			\
> +static void devm_##destroy_op##_release(struct device *devm, void *res)	\
> +{									\
> +	destroy_op(*(struct clk **)res);				\
> +}									\
> +									\
> +int devm_##create_op(struct device *dev, struct clk *clk)		\
> +{									\
> +	int error;							\
> +									\
> +	error = devm_clk_create_devres(dev, clk,			\
> +					devm_##destroy_op##_release);	\
> +	if (error)							\
> +		return error;						\
> +									\
> +	error = create_op(clk);						\
> +	if (error) {							\
> +		WARN_ON(devres_destroy(dev,				\
> +					devm_##destroy_op##_release,	\
> +					devm_clk_match, clk));		\
> +		return error;						\
> +	}								\
> +									\
> +	return 0;							\
> +}									\
> +EXPORT_SYMBOL(devm_##create_op);					\
> +									\
> +void devm_##destroy_op(struct device *dev, struct clk *clk)		\
> +{									\
> +	WARN_ON(devres_release(dev, devm_##destroy_op##_release,	\
> +				devm_clk_match, clk));			\
> +}									\
> +EXPORT_SYMBOL(devm_##destroy_op)
> +
> +DEFINE_DEVM_CLK_OP(clk_prepare, clk_unprepare);
> +DEFINE_DEVM_CLK_OP(clk_prepare_enable, clk_disable_unprepare);
> +DEFINE_DEVM_CLK_OP(clk_enable, clk_disable);
> diff --git a/include/linux/clk.h b/include/linux/clk.h
> index 8bf149e..04b6300 100644
> --- a/include/linux/clk.h
> +++ b/include/linux/clk.h
> @@ -133,6 +133,17 @@ struct clk *devm_clk_get(struct device *dev, const char *id);
>  int clk_prepare(struct clk *clk);
>  
>  /**
> + * devm_clk_prepare - prepare a clock source as managed resource
> + * @dev: device owning the resource
> + * @clk: clock source
> + *
> + * This prepares the clock source for use.
> + *
> + * Must not be called from within atomic context.
> + */
> +int devm_clk_prepare(struct device *dev, struct clk *clk);
> +
> +/**
>   * clk_unprepare - undo preparation of a clock source
>   * @clk: clock source
>   *
> @@ -144,6 +155,18 @@ int clk_prepare(struct clk *clk);
>  void clk_unprepare(struct clk *clk);
>  
>  /**
> + * devm_clk_unprepare - undo preparation of a managed clock source.
> + * @dev: device used to prepare the clock
> + * @clk: clock source
> + *
> + * This undoes preparation of a clock previously prepared with call
> + * to devm_clk_pepare().
> + *
> + * Must not be called from within atomic context.
> + */
> +void devm_clk_unprepare(struct device *dev, struct clk *clk);
> +
> +/**
>   * clk_enable - inform the system when the clock source should be running.
>   * @clk: clock source
>   *
> @@ -156,6 +179,19 @@ void clk_unprepare(struct clk *clk);
>  int clk_enable(struct clk *clk);
>  
>  /**
> + * devm_clk_enable - enable the clock source as managed resource
> + * @dev: device owning the resource
> + * @clk: clock source
> + *
> + * If the clock can not be enabled/disabled, this should return success.
> + *
> + * May be not called from atomic contexts.
> + *
> + * Returns success (0) or negative errno.
> + */
> +int devm_clk_enable(struct device *dev, struct clk *clk);
> +
> +/**
>   * clk_disable - inform the system when the clock source is no longer required.
>   * @clk: clock source
>   *
> @@ -172,6 +208,18 @@ int clk_enable(struct clk *clk);
>  void clk_disable(struct clk *clk);
>  
>  /**
> + * devm_clk_disable - disable managed clock source resource
> + * @dev: device used to enable the clock
> + * @clk: clock source
> + *
> + * Inform the system that a clock source is no longer required by
> + * a driver and may be shut down.
> + *
> + * Must not be called from atomic contexts.
> + */
> +void devm_clk_disable(struct device *dev, struct clk *clk);
> +
> +/**
>   * clk_prepare_enable - prepare and enable a clock source
>   * @clk: clock source
>   *
> @@ -182,6 +230,17 @@ void clk_disable(struct clk *clk);
>  int clk_prepare_enable(struct clk *clk);
>  
>  /**
> + * devm_clk_prepare_enable - prepare and enable a managed clock source
> + * @dev: device owning the clock source
> + * @clk: clock source
> + *
> + * This prepares the clock source for use and enables it.
> + *
> + * Must not be called from within atomic context.
> + */
> +int devm_clk_prepare_enable(struct device *dev, struct clk *clk);
> +
> +/**
>   * clk_disable_unprepare - disable and undo preparation of a clock source
>   * @clk: clock source
>   *
> @@ -192,6 +251,17 @@ int clk_prepare_enable(struct clk *clk);
>  void clk_disable_unprepare(struct clk *clk);
>  
>  /**
> + * clk_disable_unprepare - disable and undo preparation of a managed clock source
> + * @dev: device used to prepare and enable the clock
> + * @clk: clock source
> + *
> + * This disables and undoes a previously prepared clock.
> + *
> + * Must not be called from within atomic context.
> + */
> +void devm_clk_disable_unprepare(struct device *dev, struct clk *clk);
> +
> +/**
>   * clk_get_rate - obtain the current clock rate (in Hz) for a clock source.
>   *		  This is only valid once the clock source has been enabled.
>   * @clk: clock source
> @@ -303,29 +373,64 @@ static inline int clk_prepare(struct clk *clk)
>  	return 0;
>  }
>  
> +static inline int devm_clk_prepare(struct device *dev, struct clk *clk)
> +{
> +	might_sleep();
> +	return 0;
> +}
> +
>  static inline void clk_unprepare(struct clk *clk)
>  {
>  	might_sleep();
>  }
>  
> +static inline void devm_clk_unprepare(struct device *dev, struct clk *clk)
> +{
> +	might_sleep();
> +}
> +
>  static inline int clk_enable(struct clk *clk)
>  {
>  	return 0;
>  }
>  
> +static inline int devm_clk_enable(struct device *dev, struct clk *clk)
> +{
> +	might_sleep();
> +	return 0;
> +}
> +
>  static inline void clk_disable(struct clk *clk) {}
>  
> +static inline void devm_clk_disable(struct device *dev, struct clk *clk)
> +{
> +	might_sleep();
> +	return 0;
> +}
> +
>  static inline int clk_prepare_enable(struct clk *clk)
>  {
>  	might_sleep();
>  	return 0;
>  }
>  
> +static inline int devm_clk_prepare_enable(struct device *dev, struct clk *clk)
> +{
> +	might_sleep();
> +	return 0;
> +}
> +
>  static inline void clk_disable_unprepare(struct clk *clk)
>  {
>  	might_sleep();
>  }
>  
> +static inline void devm_clk_disable_unprepare(struct device *dev,
> +					      struct clk *clk)
> +{
> +	might_sleep();
> +}
> +
>  static inline unsigned long clk_get_rate(struct clk *clk)
>  {
>  	return 0;

^ permalink raw reply

* 4.10-rc1 on Nokia N900: regression, WARN_ON() omap_l3_smx.c
From: Tony Lindgren @ 2017-01-02 18:10 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161229231411.GA6865@amd>

* Pavel Machek <pavel@ucw.cz> [161229 15:14]:
> Hi!
> 
> I forgot I had v4.10-rc1 running, and now I got warning on all the
> consoles (hand-copied).
> 
> 
> Unhandled fault: external abort on non-linefetch (0x1028) at
> 0xfa0ab060
> ...
> Comm: kworker/0:0 Not tainted.
> Workqueue: events musb_irq_work
> ...
> PC is at musb_default_readb().
> ...

This means the clocks are not enabled at that point.

> WARNING: CPU: 0 ... at drivers/bus/omap_l3_smx.c:166
> omap3_l3_app_irq+0xcc/...
> Tainted: GDW.

If you comment out postcore_initcall_sync(omap3_l3_init);
in drivers/bus/omap_l3_smx.c you'll see the proper stack
trace instead of the l3 interrupt trace. The system will
hang at that point most likely.

> I do have patches to allow nfsroot over usb. But they worked ok in
> v4.9... Does anyone see it, too?

Hmm not much has changed since v4.9. Are you sure you
had v4.9 or some earlier v4.9-rc version?

Regards,

Tony

^ permalink raw reply

* [PATCH] pinctrl: qcom: msm8660: rename some SDC1->SDC4
From: Bjorn Andersson @ 2017-01-02 18:19 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20170102084228.7890-1-linus.walleij@linaro.org>

On Mon 02 Jan 00:42 PST 2017, Linus Walleij wrote:

> These four pins are for SDC4, not SDC1. They are grouped for
> SDC4 later in the file so this must be a typo.
> 
> Cc: Bj?rn Andersson <bjorn.andersson@linaro.org>
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Regards,
Bjorn

> ---
>  drivers/pinctrl/qcom/pinctrl-msm8660.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pinctrl/qcom/pinctrl-msm8660.c b/drivers/pinctrl/qcom/pinctrl-msm8660.c
> index 5591d093bf78..bb71dd1e6279 100644
> --- a/drivers/pinctrl/qcom/pinctrl-msm8660.c
> +++ b/drivers/pinctrl/qcom/pinctrl-msm8660.c
> @@ -193,9 +193,9 @@ static const struct pinctrl_pin_desc msm8660_pins[] = {
>  	PINCTRL_PIN(171, "GPIO_171"),
>  	PINCTRL_PIN(172, "GPIO_172"),
>  
> -	PINCTRL_PIN(173, "SDC1_CLK"),
> -	PINCTRL_PIN(174, "SDC1_CMD"),
> -	PINCTRL_PIN(175, "SDC1_DATA"),
> +	PINCTRL_PIN(173, "SDC4_CLK"),
> +	PINCTRL_PIN(174, "SDC4_CMD"),
> +	PINCTRL_PIN(175, "SDC4_DATA"),
>  	PINCTRL_PIN(176, "SDC3_CLK"),
>  	PINCTRL_PIN(177, "SDC3_CMD"),
>  	PINCTRL_PIN(178, "SDC3_DATA"),
> -- 
> 2.9.3
> 

^ permalink raw reply

* [PATCH 0/6] crypto: ARM/arm64 - AES and ChaCha20 updates for v4.11
From: Ard Biesheuvel @ 2017-01-02 18:21 UTC (permalink / raw)
  To: linux-arm-kernel

This series adds SIMD implementations for arm64 and ARM of ChaCha20 (*),
and a port of the ARM bit-sliced AES algorithm to arm64, and 

Patch #1 is a prerequisite for the AES-XTS implementation in #6, which needs
a secondary AES transform to generate the initial tweak.

Patch #2 optimizes the bit-sliced AES glue code for ARM to iterate over the
input in the most efficient manner possible.

Patch #3 adds a NEON implementation of ChaCha20 for ARM.

Patch #4 adds a NEON implementation of ChaCha20 for arm64.

Patch #5 modifies the existing NEON and ARMv8 Crypto Extensions implementations
of AES-CTR to be available as a synchronous skcipher as well. This is intended
for the mac80211 code, which uses synchronous encapsulations of ctr(aes)
[ccm, gcm] in softirq context, which supports SIMD algorithms on arm64.

Patch #6 adds a port of the ARM bit-sliced AES code to arm64, in ECB, CTR
and XTS modes.

Ard Biesheuvel (6):
  crypto: generic/aes - export encrypt and decrypt entry points
  crypto: arm/aes-neonbs - process 8 blocks in parallel if we can
  crypto: arm/chacha20 - implement NEON version based on SSE3 code
  crypto: arm64/chacha20 - implement NEON version based on SSE3 code
  crypto: arm64/aes-blk - expose AES-CTR as synchronous cipher as well
  crypto: arm64/aes - reimplement bit-sliced ARM/NEON implementation for
    arm64

 arch/arm/crypto/Kconfig                |   6 +
 arch/arm/crypto/Makefile               |   2 +
 arch/arm/crypto/aesbs-glue.c           |  67 +-
 arch/arm/crypto/chacha20-neon-core.S   | 524 ++++++++++++
 arch/arm/crypto/chacha20-neon-glue.c   | 128 +++
 arch/arm64/crypto/Kconfig              |  13 +
 arch/arm64/crypto/Makefile             |   6 +
 arch/arm64/crypto/aes-glue.c           |  25 +-
 arch/arm64/crypto/aes-neonbs-core.S    | 879 ++++++++++++++++++++
 arch/arm64/crypto/aes-neonbs-glue.c    | 344 ++++++++
 arch/arm64/crypto/chacha20-neon-core.S | 450 ++++++++++
 arch/arm64/crypto/chacha20-neon-glue.c | 127 +++
 crypto/aes_generic.c                   |  10 +-
 include/crypto/aes.h                   |   3 +
 14 files changed, 2549 insertions(+), 35 deletions(-)
 create mode 100644 arch/arm/crypto/chacha20-neon-core.S
 create mode 100644 arch/arm/crypto/chacha20-neon-glue.c
 create mode 100644 arch/arm64/crypto/aes-neonbs-core.S
 create mode 100644 arch/arm64/crypto/aes-neonbs-glue.c
 create mode 100644 arch/arm64/crypto/chacha20-neon-core.S
 create mode 100644 arch/arm64/crypto/chacha20-neon-glue.c

-- 
2.7.4

^ permalink raw reply

* [PATCH 1/6] crypto: generic/aes - export encrypt and decrypt entry points
From: Ard Biesheuvel @ 2017-01-02 18:21 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1483381268-12987-1-git-send-email-ard.biesheuvel@linaro.org>

The generic AES code already shares its key schedule generation
routines (and its S-boxes) with other implementations via external
linkage. In the same way, export the core encrypt/decrypt routines
so they may be reused by other drivers as well.

This facility will be used by the bit slicing implementation of AES
in XTS mode for arm64, where using the 8-way cipher (and its ~2 KB
expanded key schedule) to generate the initial tweak is suboptimal.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/aes_generic.c | 10 ++++++----
 include/crypto/aes.h |  3 +++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/crypto/aes_generic.c b/crypto/aes_generic.c
index 3dd101144a58..26fd7b8c2e5f 100644
--- a/crypto/aes_generic.c
+++ b/crypto/aes_generic.c
@@ -1326,7 +1326,7 @@ EXPORT_SYMBOL_GPL(crypto_aes_set_key);
 	f_rl(bo, bi, 3, k);	\
 } while (0)
 
-static void aes_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+void crypto_aes_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 {
 	const struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
 	const __le32 *src = (const __le32 *)in;
@@ -1366,6 +1366,7 @@ static void aes_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 	dst[2] = cpu_to_le32(b0[2]);
 	dst[3] = cpu_to_le32(b0[3]);
 }
+EXPORT_SYMBOL_GPL(crypto_aes_encrypt);
 
 /* decrypt a block of text */
 
@@ -1398,7 +1399,7 @@ static void aes_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 	i_rl(bo, bi, 3, k);	\
 } while (0)
 
-static void aes_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+void crypto_aes_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 {
 	const struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
 	const __le32 *src = (const __le32 *)in;
@@ -1438,6 +1439,7 @@ static void aes_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 	dst[2] = cpu_to_le32(b0[2]);
 	dst[3] = cpu_to_le32(b0[3]);
 }
+EXPORT_SYMBOL_GPL(crypto_aes_decrypt);
 
 static struct crypto_alg aes_alg = {
 	.cra_name		=	"aes",
@@ -1453,8 +1455,8 @@ static struct crypto_alg aes_alg = {
 			.cia_min_keysize	=	AES_MIN_KEY_SIZE,
 			.cia_max_keysize	=	AES_MAX_KEY_SIZE,
 			.cia_setkey		=	crypto_aes_set_key,
-			.cia_encrypt		=	aes_encrypt,
-			.cia_decrypt		=	aes_decrypt
+			.cia_encrypt		=	crypto_aes_encrypt,
+			.cia_decrypt		=	crypto_aes_decrypt
 		}
 	}
 };
diff --git a/include/crypto/aes.h b/include/crypto/aes.h
index 7524ba3b6f3c..297fbba5d27b 100644
--- a/include/crypto/aes.h
+++ b/include/crypto/aes.h
@@ -32,6 +32,9 @@ extern const u32 crypto_fl_tab[4][256];
 extern const u32 crypto_it_tab[4][256];
 extern const u32 crypto_il_tab[4][256];
 
+void crypto_aes_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in);
+void crypto_aes_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in);
+
 int crypto_aes_set_key(struct crypto_tfm *tfm, const u8 *in_key,
 		unsigned int key_len);
 int crypto_aes_expand_key(struct crypto_aes_ctx *ctx, const u8 *in_key,
-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/6] crypto: arm/aes-neonbs - process 8 blocks in parallel if we can
From: Ard Biesheuvel @ 2017-01-02 18:21 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1483381268-12987-1-git-send-email-ard.biesheuvel@linaro.org>

The bit-sliced NEON implementation of AES only performs optimally if
it can process 8 blocks of input in parallel. This is due to the nature
of bit slicing, where the n-th bit of each byte of AES state of each input
block is collected into NEON register 'n', for registers q0 - q7.

This implies that the amount of work for the transform is fixed,
regardless of whether we are handling just one block or 8 in parallel.

So let's try a bit harder to iterate over the input in suitably sized
chunks, by setting the newly introduced walksize attribute to 8x the value
of AES_BLOCK_SIZE, and tweaking the loops to only process multiples of the
walk size, unless we are handling the last chunk in the input stream.

Note that the skcipher walk API guarantees that a step in the walk never
returns less than 'walksize' bytes if there are at least that many bytes
of input still available. However, it does *not* guarantee that those steps
produce an exact multiple of the walk size.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/aesbs-glue.c | 67 +++++++++++---------
 1 file changed, 38 insertions(+), 29 deletions(-)

diff --git a/arch/arm/crypto/aesbs-glue.c b/arch/arm/crypto/aesbs-glue.c
index d8e06de72ef3..f3019333c2eb 100644
--- a/arch/arm/crypto/aesbs-glue.c
+++ b/arch/arm/crypto/aesbs-glue.c
@@ -121,39 +121,26 @@ static int aesbs_cbc_encrypt(struct skcipher_request *req)
 	return crypto_cbc_encrypt_walk(req, aesbs_encrypt_one);
 }
 
-static inline void aesbs_decrypt_one(struct crypto_skcipher *tfm,
-				     const u8 *src, u8 *dst)
-{
-	struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
-
-	AES_decrypt(src, dst, &ctx->dec.rk);
-}
-
 static int aesbs_cbc_decrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
 	struct skcipher_walk walk;
-	unsigned int nbytes;
 	int err;
 
-	for (err = skcipher_walk_virt(&walk, req, false);
-	     (nbytes = walk.nbytes); err = skcipher_walk_done(&walk, nbytes)) {
-		u32 blocks = nbytes / AES_BLOCK_SIZE;
-		u8 *dst = walk.dst.virt.addr;
-		u8 *src = walk.src.virt.addr;
-		u8 *iv = walk.iv;
-
-		if (blocks >= 8) {
-			kernel_neon_begin();
-			bsaes_cbc_encrypt(src, dst, nbytes, &ctx->dec, iv);
-			kernel_neon_end();
-			nbytes %= AES_BLOCK_SIZE;
-			continue;
-		}
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while (walk.nbytes) {
+		unsigned int nbytes = walk.nbytes;
+
+		if (nbytes < walk.total)
+			nbytes = round_down(nbytes, walk.stride);
 
-		nbytes = crypto_cbc_decrypt_blocks(&walk, tfm,
-						   aesbs_decrypt_one);
+		kernel_neon_begin();
+		bsaes_cbc_encrypt(walk.src.virt.addr, walk.dst.virt.addr,
+				  nbytes, &ctx->dec, walk.iv);
+		kernel_neon_end();
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
 	}
 	return err;
 }
@@ -186,6 +173,12 @@ static int aesbs_ctr_encrypt(struct skcipher_request *req)
 		__be32 *ctr = (__be32 *)walk.iv;
 		u32 headroom = UINT_MAX - be32_to_cpu(ctr[3]);
 
+		if (walk.nbytes < walk.total) {
+			blocks = round_down(blocks,
+					    walk.stride / AES_BLOCK_SIZE);
+			tail = walk.nbytes - blocks * AES_BLOCK_SIZE;
+		}
+
 		/* avoid 32 bit counter overflow in the NEON code */
 		if (unlikely(headroom < blocks)) {
 			blocks = headroom + 1;
@@ -198,6 +191,9 @@ static int aesbs_ctr_encrypt(struct skcipher_request *req)
 		kernel_neon_end();
 		inc_be128_ctr(ctr, blocks);
 
+		if (tail > 0 && tail < AES_BLOCK_SIZE)
+			break;
+
 		err = skcipher_walk_done(&walk, tail);
 	}
 	if (walk.nbytes) {
@@ -227,11 +223,16 @@ static int aesbs_xts_encrypt(struct skcipher_request *req)
 	AES_encrypt(walk.iv, walk.iv, &ctx->twkey);
 
 	while (walk.nbytes) {
+		unsigned int nbytes = walk.nbytes;
+
+		if (nbytes < walk.total)
+			nbytes = round_down(nbytes, walk.stride);
+
 		kernel_neon_begin();
 		bsaes_xts_encrypt(walk.src.virt.addr, walk.dst.virt.addr,
-				  walk.nbytes, &ctx->enc, walk.iv);
+				  nbytes, &ctx->enc, walk.iv);
 		kernel_neon_end();
-		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
 	}
 	return err;
 }
@@ -249,11 +250,16 @@ static int aesbs_xts_decrypt(struct skcipher_request *req)
 	AES_encrypt(walk.iv, walk.iv, &ctx->twkey);
 
 	while (walk.nbytes) {
+		unsigned int nbytes = walk.nbytes;
+
+		if (nbytes < walk.total)
+			nbytes = round_down(nbytes, walk.stride);
+
 		kernel_neon_begin();
 		bsaes_xts_decrypt(walk.src.virt.addr, walk.dst.virt.addr,
-				  walk.nbytes, &ctx->dec, walk.iv);
+				  nbytes, &ctx->dec, walk.iv);
 		kernel_neon_end();
-		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
 	}
 	return err;
 }
@@ -272,6 +278,7 @@ static struct skcipher_alg aesbs_algs[] = { {
 	.min_keysize	= AES_MIN_KEY_SIZE,
 	.max_keysize	= AES_MAX_KEY_SIZE,
 	.ivsize		= AES_BLOCK_SIZE,
+	.walksize	= 8 * AES_BLOCK_SIZE,
 	.setkey		= aesbs_cbc_set_key,
 	.encrypt	= aesbs_cbc_encrypt,
 	.decrypt	= aesbs_cbc_decrypt,
@@ -290,6 +297,7 @@ static struct skcipher_alg aesbs_algs[] = { {
 	.max_keysize	= AES_MAX_KEY_SIZE,
 	.ivsize		= AES_BLOCK_SIZE,
 	.chunksize	= AES_BLOCK_SIZE,
+	.walksize	= 8 * AES_BLOCK_SIZE,
 	.setkey		= aesbs_ctr_set_key,
 	.encrypt	= aesbs_ctr_encrypt,
 	.decrypt	= aesbs_ctr_encrypt,
@@ -307,6 +315,7 @@ static struct skcipher_alg aesbs_algs[] = { {
 	.min_keysize	= 2 * AES_MIN_KEY_SIZE,
 	.max_keysize	= 2 * AES_MAX_KEY_SIZE,
 	.ivsize		= AES_BLOCK_SIZE,
+	.walksize	= 8 * AES_BLOCK_SIZE,
 	.setkey		= aesbs_xts_set_key,
 	.encrypt	= aesbs_xts_encrypt,
 	.decrypt	= aesbs_xts_decrypt,
-- 
2.7.4

^ permalink raw reply related

* [PATCH 3/6] crypto: arm/chacha20 - implement NEON version based on SSE3 code
From: Ard Biesheuvel @ 2017-01-02 18:21 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1483381268-12987-1-git-send-email-ard.biesheuvel@linaro.org>

This is a straight port to ARM/NEON of the x86 SSE3 implementation
of the ChaCha20 stream cipher. It uses the new skcipher walksize
attribute to process the input in strides of 4x the block size.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/Kconfig              |   6 +
 arch/arm/crypto/Makefile             |   2 +
 arch/arm/crypto/chacha20-neon-core.S | 524 ++++++++++++++++++++
 arch/arm/crypto/chacha20-neon-glue.c | 128 +++++
 4 files changed, 660 insertions(+)

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index 13f1b4c289d4..2f3339f015d3 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -130,4 +130,10 @@ config CRYPTO_CRC32_ARM_CE
 	depends on KERNEL_MODE_NEON && CRC32
 	select CRYPTO_HASH
 
+config CRYPTO_CHACHA20_NEON
+	tristate "NEON accelerated ChaCha20 symmetric cipher"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_BLKCIPHER
+	select CRYPTO_CHACHA20
+
 endif
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index b578a1820ab1..8d74e55eacd4 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
 obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
+obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
 
 ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
 ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
@@ -40,6 +41,7 @@ aes-arm-ce-y	:= aes-ce-core.o aes-ce-glue.o
 ghash-arm-ce-y	:= ghash-ce-core.o ghash-ce-glue.o
 crct10dif-arm-ce-y	:= crct10dif-ce-core.o crct10dif-ce-glue.o
 crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
+chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
 
 quiet_cmd_perl = PERL    $@
       cmd_perl = $(PERL) $(<) > $(@)
diff --git a/arch/arm/crypto/chacha20-neon-core.S b/arch/arm/crypto/chacha20-neon-core.S
new file mode 100644
index 000000000000..ff1d337bdb4a
--- /dev/null
+++ b/arch/arm/crypto/chacha20-neon-core.S
@@ -0,0 +1,524 @@
+/*
+ * ChaCha20 256-bit cipher algorithm, RFC7539, ARM NEON functions
+ *
+ * Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Based on:
+ * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSE3 functions
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/linkage.h>
+
+	.text
+	.fpu		neon
+	.align		5
+
+ENTRY(chacha20_block_xor_neon)
+	// r0: Input state matrix, s
+	// r1: 1 data block output, o
+	// r2: 1 data block input, i
+
+	//
+	// This function encrypts one ChaCha20 block by loading the state matrix
+	// in four NEON registers. It performs matrix operation on four words in
+	// parallel, but requireds shuffling to rearrange the words after each
+	// round.
+	//
+
+	// x0..3 = s0..3
+	add		ip, r0, #0x20
+	vld1.32		{q0-q1}, [r0]
+	vld1.32		{q2-q3}, [ip]
+
+	vmov		q8, q0
+	vmov		q9, q1
+	vmov		q10, q2
+	vmov		q11, q3
+
+	mov		r3, #10
+
+.Ldoubleround:
+	// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
+	vadd.i32	q0, q0, q1
+	veor		q4, q3, q0
+	vshl.u32	q3, q4, #16
+	vsri.u32	q3, q4, #16
+
+	// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
+	vadd.i32	q2, q2, q3
+	veor		q4, q1, q2
+	vshl.u32	q1, q4, #12
+	vsri.u32	q1, q4, #20
+
+	// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
+	vadd.i32	q0, q0, q1
+	veor		q4, q3, q0
+	vshl.u32	q3, q4, #8
+	vsri.u32	q3, q4, #24
+
+	// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
+	vadd.i32	q2, q2, q3
+	veor		q4, q1, q2
+	vshl.u32	q1, q4, #7
+	vsri.u32	q1, q4, #25
+
+	// x1 = shuffle32(x1, MASK(0, 3, 2, 1))
+	vext.8		q1, q1, q1, #4
+	// x2 = shuffle32(x2, MASK(1, 0, 3, 2))
+	vext.8		q2, q2, q2, #8
+	// x3 = shuffle32(x3, MASK(2, 1, 0, 3))
+	vext.8		q3, q3, q3, #12
+
+	// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
+	vadd.i32	q0, q0, q1
+	veor		q4, q3, q0
+	vshl.u32	q3, q4, #16
+	vsri.u32	q3, q4, #16
+
+	// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
+	vadd.i32	q2, q2, q3
+	veor		q4, q1, q2
+	vshl.u32	q1, q4, #12
+	vsri.u32	q1, q4, #20
+
+	// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
+	vadd.i32	q0, q0, q1
+	veor		q4, q3, q0
+	vshl.u32	q3, q4, #8
+	vsri.u32	q3, q4, #24
+
+	// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
+	vadd.i32	q2, q2, q3
+	veor		q4, q1, q2
+	vshl.u32	q1, q4, #7
+	vsri.u32	q1, q4, #25
+
+	// x1 = shuffle32(x1, MASK(2, 1, 0, 3))
+	vext.8		q1, q1, q1, #12
+	// x2 = shuffle32(x2, MASK(1, 0, 3, 2))
+	vext.8		q2, q2, q2, #8
+	// x3 = shuffle32(x3, MASK(0, 3, 2, 1))
+	vext.8		q3, q3, q3, #4
+
+	subs		r3, r3, #1
+	bne		.Ldoubleround
+
+	add		ip, r2, #0x20
+	vld1.8		{q4-q5}, [r2]
+	vld1.8		{q6-q7}, [ip]
+
+	// o0 = i0 ^ (x0 + s0)
+	vadd.i32	q0, q0, q8
+	veor		q0, q0, q4
+
+	// o1 = i1 ^ (x1 + s1)
+	vadd.i32	q1, q1, q9
+	veor		q1, q1, q5
+
+	// o2 = i2 ^ (x2 + s2)
+	vadd.i32	q2, q2, q10
+	veor		q2, q2, q6
+
+	// o3 = i3 ^ (x3 + s3)
+	vadd.i32	q3, q3, q11
+	veor		q3, q3, q7
+
+	add		ip, r1, #0x20
+	vst1.8		{q0-q1}, [r1]
+	vst1.8		{q2-q3}, [ip]
+
+	bx		lr
+ENDPROC(chacha20_block_xor_neon)
+
+	.align		5
+ENTRY(chacha20_4block_xor_neon)
+	push		{r4-r6, lr}
+	mov		ip, sp			// preserve the stack pointer
+	sub		r3, sp, #0x20		// allocate a 32 byte buffer
+	bic		r3, r3, #0x1f		// aligned to 32 bytes
+	mov		sp, r3
+
+	// r0: Input state matrix, s
+	// r1: 4 data blocks output, o
+	// r2: 4 data blocks input, i
+
+	//
+	// This function encrypts four consecutive ChaCha20 blocks by loading
+	// the state matrix in NEON registers four times. The algorithm performs
+	// each operation on the corresponding word of each state matrix, hence
+	// requires no word shuffling. For final XORing step we transpose the
+	// matrix by interleaving 32- and then 64-bit words, which allows us to
+	// do XOR in NEON registers.
+	//
+
+	// x0..15[0-3] = s0..3[0..3]
+	add		r3, r0, #0x20
+	vld1.32		{q0-q1}, [r0]
+	vld1.32		{q2-q3}, [r3]
+
+	adr		r3, CTRINC
+	vdup.32		q15, d7[1]
+	vdup.32		q14, d7[0]
+	vld1.32		{q11}, [r3, :128]
+	vdup.32		q13, d6[1]
+	vdup.32		q12, d6[0]
+	vadd.i32	q12, q12, q11		// x12 += counter values 0-3
+	vdup.32		q11, d5[1]
+	vdup.32		q10, d5[0]
+	vdup.32		q9, d4[1]
+	vdup.32		q8, d4[0]
+	vdup.32		q7, d3[1]
+	vdup.32		q6, d3[0]
+	vdup.32		q5, d2[1]
+	vdup.32		q4, d2[0]
+	vdup.32		q3, d1[1]
+	vdup.32		q2, d1[0]
+	vdup.32		q1, d0[1]
+	vdup.32		q0, d0[0]
+
+	mov		r3, #10
+
+.Ldoubleround4:
+	// x0 += x4, x12 = rotl32(x12 ^ x0, 16)
+	// x1 += x5, x13 = rotl32(x13 ^ x1, 16)
+	// x2 += x6, x14 = rotl32(x14 ^ x2, 16)
+	// x3 += x7, x15 = rotl32(x15 ^ x3, 16)
+	vadd.i32	q0, q0, q4
+	vadd.i32	q1, q1, q5
+	vadd.i32	q2, q2, q6
+	vadd.i32	q3, q3, q7
+
+	veor		q12, q12, q0
+	veor		q13, q13, q1
+	veor		q14, q14, q2
+	veor		q15, q15, q3
+
+	vrev32.16	q12, q12
+	vrev32.16	q13, q13
+	vrev32.16	q14, q14
+	vrev32.16	q15, q15
+
+	// x8 += x12, x4 = rotl32(x4 ^ x8, 12)
+	// x9 += x13, x5 = rotl32(x5 ^ x9, 12)
+	// x10 += x14, x6 = rotl32(x6 ^ x10, 12)
+	// x11 += x15, x7 = rotl32(x7 ^ x11, 12)
+	vadd.i32	q8, q8, q12
+	vadd.i32	q9, q9, q13
+	vadd.i32	q10, q10, q14
+	vadd.i32	q11, q11, q15
+
+	vst1.32		{q8-q9}, [sp, :256]
+
+	veor		q8, q4, q8
+	veor		q9, q5, q9
+	vshl.u32	q4, q8, #12
+	vshl.u32	q5, q9, #12
+	vsri.u32	q4, q8, #20
+	vsri.u32	q5, q9, #20
+
+	veor		q8, q6, q10
+	veor		q9, q7, q11
+	vshl.u32	q6, q8, #12
+	vshl.u32	q7, q9, #12
+	vsri.u32	q6, q8, #20
+	vsri.u32	q7, q9, #20
+
+	// x0 += x4, x12 = rotl32(x12 ^ x0, 8)
+	// x1 += x5, x13 = rotl32(x13 ^ x1, 8)
+	// x2 += x6, x14 = rotl32(x14 ^ x2, 8)
+	// x3 += x7, x15 = rotl32(x15 ^ x3, 8)
+	vadd.i32	q0, q0, q4
+	vadd.i32	q1, q1, q5
+	vadd.i32	q2, q2, q6
+	vadd.i32	q3, q3, q7
+
+	veor		q8, q12, q0
+	veor		q9, q13, q1
+	vshl.u32	q12, q8, #8
+	vshl.u32	q13, q9, #8
+	vsri.u32	q12, q8, #24
+	vsri.u32	q13, q9, #24
+
+	veor		q8, q14, q2
+	veor		q9, q15, q3
+	vshl.u32	q14, q8, #8
+	vshl.u32	q15, q9, #8
+	vsri.u32	q14, q8, #24
+	vsri.u32	q15, q9, #24
+
+	vld1.32		{q8-q9}, [sp, :256]
+
+	// x8 += x12, x4 = rotl32(x4 ^ x8, 7)
+	// x9 += x13, x5 = rotl32(x5 ^ x9, 7)
+	// x10 += x14, x6 = rotl32(x6 ^ x10, 7)
+	// x11 += x15, x7 = rotl32(x7 ^ x11, 7)
+	vadd.i32	q8, q8, q12
+	vadd.i32	q9, q9, q13
+	vadd.i32	q10, q10, q14
+	vadd.i32	q11, q11, q15
+
+	vst1.32		{q8-q9}, [sp, :256]
+
+	veor		q8, q4, q8
+	veor		q9, q5, q9
+	vshl.u32	q4, q8, #7
+	vshl.u32	q5, q9, #7
+	vsri.u32	q4, q8, #25
+	vsri.u32	q5, q9, #25
+
+	veor		q8, q6, q10
+	veor		q9, q7, q11
+	vshl.u32	q6, q8, #7
+	vshl.u32	q7, q9, #7
+	vsri.u32	q6, q8, #25
+	vsri.u32	q7, q9, #25
+
+	vld1.32		{q8-q9}, [sp, :256]
+
+	// x0 += x5, x15 = rotl32(x15 ^ x0, 16)
+	// x1 += x6, x12 = rotl32(x12 ^ x1, 16)
+	// x2 += x7, x13 = rotl32(x13 ^ x2, 16)
+	// x3 += x4, x14 = rotl32(x14 ^ x3, 16)
+	vadd.i32	q0, q0, q5
+	vadd.i32	q1, q1, q6
+	vadd.i32	q2, q2, q7
+	vadd.i32	q3, q3, q4
+
+	veor		q15, q15, q0
+	veor		q12, q12, q1
+	veor		q13, q13, q2
+	veor		q14, q14, q3
+
+	vrev32.16	q15, q15
+	vrev32.16	q12, q12
+	vrev32.16	q13, q13
+	vrev32.16	q14, q14
+
+	// x10 += x15, x5 = rotl32(x5 ^ x10, 12)
+	// x11 += x12, x6 = rotl32(x6 ^ x11, 12)
+	// x8 += x13, x7 = rotl32(x7 ^ x8, 12)
+	// x9 += x14, x4 = rotl32(x4 ^ x9, 12)
+	vadd.i32	q10, q10, q15
+	vadd.i32	q11, q11, q12
+	vadd.i32	q8, q8, q13
+	vadd.i32	q9, q9, q14
+
+	vst1.32		{q8-q9}, [sp, :256]
+
+	veor		q8, q7, q8
+	veor		q9, q4, q9
+	vshl.u32	q7, q8, #12
+	vshl.u32	q4, q9, #12
+	vsri.u32	q7, q8, #20
+	vsri.u32	q4, q9, #20
+
+	veor		q8, q5, q10
+	veor		q9, q6, q11
+	vshl.u32	q5, q8, #12
+	vshl.u32	q6, q9, #12
+	vsri.u32	q5, q8, #20
+	vsri.u32	q6, q9, #20
+
+	// x0 += x5, x15 = rotl32(x15 ^ x0, 8)
+	// x1 += x6, x12 = rotl32(x12 ^ x1, 8)
+	// x2 += x7, x13 = rotl32(x13 ^ x2, 8)
+	// x3 += x4, x14 = rotl32(x14 ^ x3, 8)
+	vadd.i32	q0, q0, q5
+	vadd.i32	q1, q1, q6
+	vadd.i32	q2, q2, q7
+	vadd.i32	q3, q3, q4
+
+	veor		q8, q15, q0
+	veor		q9, q12, q1
+	vshl.u32	q15, q8, #8
+	vshl.u32	q12, q9, #8
+	vsri.u32	q15, q8, #24
+	vsri.u32	q12, q9, #24
+
+	veor		q8, q13, q2
+	veor		q9, q14, q3
+	vshl.u32	q13, q8, #8
+	vshl.u32	q14, q9, #8
+	vsri.u32	q13, q8, #24
+	vsri.u32	q14, q9, #24
+
+	vld1.32		{q8-q9}, [sp, :256]
+
+	// x10 += x15, x5 = rotl32(x5 ^ x10, 7)
+	// x11 += x12, x6 = rotl32(x6 ^ x11, 7)
+	// x8 += x13, x7 = rotl32(x7 ^ x8, 7)
+	// x9 += x14, x4 = rotl32(x4 ^ x9, 7)
+	vadd.i32	q10, q10, q15
+	vadd.i32	q11, q11, q12
+	vadd.i32	q8, q8, q13
+	vadd.i32	q9, q9, q14
+
+	vst1.32		{q8-q9}, [sp, :256]
+
+	veor		q8, q7, q8
+	veor		q9, q4, q9
+	vshl.u32	q7, q8, #7
+	vshl.u32	q4, q9, #7
+	vsri.u32	q7, q8, #25
+	vsri.u32	q4, q9, #25
+
+	veor		q8, q5, q10
+	veor		q9, q6, q11
+	vshl.u32	q5, q8, #7
+	vshl.u32	q6, q9, #7
+	vsri.u32	q5, q8, #25
+	vsri.u32	q6, q9, #25
+
+	subs		r3, r3, #1
+	beq		0f
+
+	vld1.32		{q8-q9}, [sp, :256]
+	b		.Ldoubleround4
+
+	// x0[0-3] += s0[0]
+	// x1[0-3] += s0[1]
+	// x2[0-3] += s0[2]
+	// x3[0-3] += s0[3]
+0:	ldmia		r0!, {r3-r6}
+	vdup.32		q8, r3
+	vdup.32		q9, r4
+	vadd.i32	q0, q0, q8
+	vadd.i32	q1, q1, q9
+	vdup.32		q8, r5
+	vdup.32		q9, r6
+	vadd.i32	q2, q2, q8
+	vadd.i32	q3, q3, q9
+
+	// x4[0-3] += s1[0]
+	// x5[0-3] += s1[1]
+	// x6[0-3] += s1[2]
+	// x7[0-3] += s1[3]
+	ldmia		r0!, {r3-r6}
+	vdup.32		q8, r3
+	vdup.32		q9, r4
+	vadd.i32	q4, q4, q8
+	vadd.i32	q5, q5, q9
+	vdup.32		q8, r5
+	vdup.32		q9, r6
+	vadd.i32	q6, q6, q8
+	vadd.i32	q7, q7, q9
+
+	// interleave 32-bit words in state n, n+1
+	vzip.32		q0, q1
+	vzip.32		q2, q3
+	vzip.32		q4, q5
+	vzip.32		q6, q7
+
+	// interleave 64-bit words in state n, n+2
+	vswp		d1, d4
+	vswp		d3, d6
+	vswp		d9, d12
+	vswp		d11, d14
+
+	// xor with corresponding input, write to output
+	vld1.8		{q8-q9}, [r2]!
+	veor		q8, q8, q0
+	veor		q9, q9, q4
+	vst1.8		{q8-q9}, [r1]!
+
+	vld1.32		{q8-q9}, [sp, :256]
+
+	// x8[0-3] += s2[0]
+	// x9[0-3] += s2[1]
+	// x10[0-3] += s2[2]
+	// x11[0-3] += s2[3]
+	ldmia		r0!, {r3-r6}
+	vdup.32		q0, r3
+	vdup.32		q4, r4
+	vadd.i32	q8, q8, q0
+	vadd.i32	q9, q9, q4
+	vdup.32		q0, r5
+	vdup.32		q4, r6
+	vadd.i32	q10, q10, q0
+	vadd.i32	q11, q11, q4
+
+	// x12[0-3] += s3[0]
+	// x13[0-3] += s3[1]
+	// x14[0-3] += s3[2]
+	// x15[0-3] += s3[3]
+	ldmia		r0!, {r3-r6}
+	vdup.32		q0, r3
+	vdup.32		q4, r4
+	adr		r3, CTRINC
+	vadd.i32	q12, q12, q0
+	vld1.32		{q0}, [r3, :128]
+	vadd.i32	q13, q13, q4
+	vadd.i32	q12, q12, q0		// x12 += counter values 0-3
+
+	vdup.32		q0, r5
+	vdup.32		q4, r6
+	vadd.i32	q14, q14, q0
+	vadd.i32	q15, q15, q4
+
+	// interleave 32-bit words in state n, n+1
+	vzip.32		q8, q9
+	vzip.32		q10, q11
+	vzip.32		q12, q13
+	vzip.32		q14, q15
+
+	// interleave 64-bit words in state n, n+2
+	vswp		d17, d20
+	vswp		d19, d22
+	vswp		d25, d28
+	vswp		d27, d30
+
+	vmov		q4, q1
+
+	vld1.8		{q0-q1}, [r2]!
+	veor		q0, q0, q8
+	veor		q1, q1, q12
+	vst1.8		{q0-q1}, [r1]!
+
+	vld1.8		{q0-q1}, [r2]!
+	veor		q0, q0, q2
+	veor		q1, q1, q6
+	vst1.8		{q0-q1}, [r1]!
+
+	vld1.8		{q0-q1}, [r2]!
+	veor		q0, q0, q10
+	veor		q1, q1, q14
+	vst1.8		{q0-q1}, [r1]!
+
+	vld1.8		{q0-q1}, [r2]!
+	veor		q0, q0, q4
+	veor		q1, q1, q5
+	vst1.8		{q0-q1}, [r1]!
+
+	vld1.8		{q0-q1}, [r2]!
+	veor		q0, q0, q9
+	veor		q1, q1, q13
+	vst1.8		{q0-q1}, [r1]!
+
+	vld1.8		{q0-q1}, [r2]!
+	veor		q0, q0, q3
+	veor		q1, q1, q7
+	vst1.8		{q0-q1}, [r1]!
+
+	vld1.8		{q0-q1}, [r2]
+	veor		q0, q0, q11
+	veor		q1, q1, q15
+	vst1.8		{q0-q1}, [r1]
+
+	mov		sp, ip
+	pop		{r4-r6, pc}
+ENDPROC(chacha20_4block_xor_neon)
+
+	.align		4
+CTRINC:	.word		0, 1, 2, 3
+
diff --git a/arch/arm/crypto/chacha20-neon-glue.c b/arch/arm/crypto/chacha20-neon-glue.c
new file mode 100644
index 000000000000..592f75ae4fa1
--- /dev/null
+++ b/arch/arm/crypto/chacha20-neon-glue.c
@@ -0,0 +1,128 @@
+/*
+ * ChaCha20 256-bit cipher algorithm, RFC7539, ARM NEON functions
+ *
+ * Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Based on:
+ * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/chacha20.h>
+#include <crypto/internal/skcipher.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+
+asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
+asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
+
+static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
+			    unsigned int bytes)
+{
+	u8 buf[CHACHA20_BLOCK_SIZE];
+
+	while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+		chacha20_4block_xor_neon(state, dst, src);
+		bytes -= CHACHA20_BLOCK_SIZE * 4;
+		src += CHACHA20_BLOCK_SIZE * 4;
+		dst += CHACHA20_BLOCK_SIZE * 4;
+		state[12] += 4;
+	}
+	while (bytes >= CHACHA20_BLOCK_SIZE) {
+		chacha20_block_xor_neon(state, dst, src);
+		bytes -= CHACHA20_BLOCK_SIZE;
+		src += CHACHA20_BLOCK_SIZE;
+		dst += CHACHA20_BLOCK_SIZE;
+		state[12]++;
+	}
+	if (bytes) {
+		memcpy(buf, src, bytes);
+		chacha20_block_xor_neon(state, buf, buf);
+		memcpy(dst, buf, bytes);
+	}
+}
+
+static int chacha20_neon(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	u32 state[16];
+	int err;
+
+	if (req->cryptlen <= CHACHA20_BLOCK_SIZE || !may_use_simd())
+		return crypto_chacha20_crypt(req);
+
+	err = skcipher_walk_virt(&walk, req, true);
+
+	crypto_chacha20_init(state, ctx, walk.iv);
+
+	kernel_neon_begin();
+	while (walk.nbytes > 0) {
+		unsigned int nbytes = walk.nbytes;
+
+		if (nbytes < walk.total)
+			nbytes = round_down(nbytes, walk.stride);
+
+		chacha20_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
+				nbytes);
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+	}
+	kernel_neon_end();
+
+	return err;
+}
+
+static struct skcipher_alg alg = {
+	.base.cra_name		= "chacha20",
+	.base.cra_driver_name	= "chacha20-neon",
+	.base.cra_priority	= 300,
+	.base.cra_blocksize	= 1,
+	.base.cra_ctxsize	= sizeof(struct chacha20_ctx),
+	.base.cra_alignmask	= 1,
+	.base.cra_module	= THIS_MODULE,
+
+	.min_keysize		= CHACHA20_KEY_SIZE,
+	.max_keysize		= CHACHA20_KEY_SIZE,
+	.ivsize			= CHACHA20_IV_SIZE,
+	.chunksize		= CHACHA20_BLOCK_SIZE,
+	.walksize		= 4 * CHACHA20_BLOCK_SIZE,
+	.setkey			= crypto_chacha20_setkey,
+	.encrypt		= chacha20_neon,
+	.decrypt		= chacha20_neon,
+};
+
+static int __init chacha20_simd_mod_init(void)
+{
+	if (!(elf_hwcap & HWCAP_NEON))
+		return -ENODEV;
+
+	return crypto_register_skcipher(&alg);
+}
+
+static void __exit chacha20_simd_mod_fini(void)
+{
+	crypto_unregister_skcipher(&alg);
+}
+
+module_init(chacha20_simd_mod_init);
+module_exit(chacha20_simd_mod_fini);
+
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("chacha20");
-- 
2.7.4

^ permalink raw reply related

* [PATCH 4/6] crypto: arm64/chacha20 - implement NEON version based on SSE3 code
From: Ard Biesheuvel @ 2017-01-02 18:21 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1483381268-12987-1-git-send-email-ard.biesheuvel@linaro.org>

This is a straight port to arm64/NEON of the x86 SSE3 implementation
of the ChaCha20 stream cipher. It uses the new skcipher walksize
attribute to process the input in strides of 4x the block size.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/Kconfig              |   6 +
 arch/arm64/crypto/Makefile             |   3 +
 arch/arm64/crypto/chacha20-neon-core.S | 450 ++++++++++++++++++++
 arch/arm64/crypto/chacha20-neon-glue.c | 127 ++++++
 4 files changed, 586 insertions(+)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 450a85df041a..0bf0f531f539 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -72,4 +72,10 @@ config CRYPTO_CRC32_ARM64
 	depends on ARM64
 	select CRYPTO_HASH
 
+config CRYPTO_CHACHA20_NEON
+	tristate "NEON accelerated ChaCha20 symmetric cipher"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_BLKCIPHER
+	select CRYPTO_CHACHA20
+
 endif
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index aa8888d7b744..9d2826c5fccf 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -41,6 +41,9 @@ sha256-arm64-y := sha256-glue.o sha256-core.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM64) += sha512-arm64.o
 sha512-arm64-y := sha512-glue.o sha512-core.o
 
+obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
+chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
+
 AFLAGS_aes-ce.o		:= -DINTERLEAVE=4
 AFLAGS_aes-neon.o	:= -DINTERLEAVE=4
 
diff --git a/arch/arm64/crypto/chacha20-neon-core.S b/arch/arm64/crypto/chacha20-neon-core.S
new file mode 100644
index 000000000000..13c85e272c2a
--- /dev/null
+++ b/arch/arm64/crypto/chacha20-neon-core.S
@@ -0,0 +1,450 @@
+/*
+ * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions
+ *
+ * Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Based on:
+ * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSSE3 functions
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/linkage.h>
+
+	.text
+	.align		6
+
+ENTRY(chacha20_block_xor_neon)
+	// x0: Input state matrix, s
+	// x1: 1 data block output, o
+	// x2: 1 data block input, i
+
+	//
+	// This function encrypts one ChaCha20 block by loading the state matrix
+	// in four NEON registers. It performs matrix operation on four words in
+	// parallel, but requires shuffling to rearrange the words after each
+	// round.
+	//
+
+	// x0..3 = s0..3
+	adr		x3, ROT8
+	ld1		{v0.4s-v3.4s}, [x0]
+	ld1		{v8.4s-v11.4s}, [x0]
+	ld1		{v12.4s}, [x3]
+
+	mov		x3, #10
+
+.Ldoubleround:
+	// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
+	add		v0.4s, v0.4s, v1.4s
+	eor		v3.16b, v3.16b, v0.16b
+	rev32		v3.8h, v3.8h
+
+	// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
+	add		v2.4s, v2.4s, v3.4s
+	eor		v4.16b, v1.16b, v2.16b
+	shl		v1.4s, v4.4s, #12
+	sri		v1.4s, v4.4s, #20
+
+	// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
+	add		v0.4s, v0.4s, v1.4s
+	eor		v3.16b, v3.16b, v0.16b
+	tbl		v3.16b, {v3.16b}, v12.16b
+
+	// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
+	add		v2.4s, v2.4s, v3.4s
+	eor		v4.16b, v1.16b, v2.16b
+	shl		v1.4s, v4.4s, #7
+	sri		v1.4s, v4.4s, #25
+
+	// x1 = shuffle32(x1, MASK(0, 3, 2, 1))
+	ext		v1.16b, v1.16b, v1.16b, #4
+	// x2 = shuffle32(x2, MASK(1, 0, 3, 2))
+	ext		v2.16b, v2.16b, v2.16b, #8
+	// x3 = shuffle32(x3, MASK(2, 1, 0, 3))
+	ext		v3.16b, v3.16b, v3.16b, #12
+
+	// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
+	add		v0.4s, v0.4s, v1.4s
+	eor		v3.16b, v3.16b, v0.16b
+	rev32		v3.8h, v3.8h
+
+	// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
+	add		v2.4s, v2.4s, v3.4s
+	eor		v4.16b, v1.16b, v2.16b
+	shl		v1.4s, v4.4s, #12
+	sri		v1.4s, v4.4s, #20
+
+	// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
+	add		v0.4s, v0.4s, v1.4s
+	eor		v3.16b, v3.16b, v0.16b
+	tbl		v3.16b, {v3.16b}, v12.16b
+
+	// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
+	add		v2.4s, v2.4s, v3.4s
+	eor		v4.16b, v1.16b, v2.16b
+	shl		v1.4s, v4.4s, #7
+	sri		v1.4s, v4.4s, #25
+
+	// x1 = shuffle32(x1, MASK(2, 1, 0, 3))
+	ext		v1.16b, v1.16b, v1.16b, #12
+	// x2 = shuffle32(x2, MASK(1, 0, 3, 2))
+	ext		v2.16b, v2.16b, v2.16b, #8
+	// x3 = shuffle32(x3, MASK(0, 3, 2, 1))
+	ext		v3.16b, v3.16b, v3.16b, #4
+
+	subs		x3, x3, #1
+	b.ne		.Ldoubleround
+
+	ld1		{v4.16b-v7.16b}, [x2]
+
+	// o0 = i0 ^ (x0 + s0)
+	add		v0.4s, v0.4s, v8.4s
+	eor		v0.16b, v0.16b, v4.16b
+
+	// o1 = i1 ^ (x1 + s1)
+	add		v1.4s, v1.4s, v9.4s
+	eor		v1.16b, v1.16b, v5.16b
+
+	// o2 = i2 ^ (x2 + s2)
+	add		v2.4s, v2.4s, v10.4s
+	eor		v2.16b, v2.16b, v6.16b
+
+	// o3 = i3 ^ (x3 + s3)
+	add		v3.4s, v3.4s, v11.4s
+	eor		v3.16b, v3.16b, v7.16b
+
+	st1		{v0.16b-v3.16b}, [x1]
+
+	ret
+ENDPROC(chacha20_block_xor_neon)
+
+	.align		6
+ENTRY(chacha20_4block_xor_neon)
+	// x0: Input state matrix, s
+	// x1: 4 data blocks output, o
+	// x2: 4 data blocks input, i
+
+	//
+	// This function encrypts four consecutive ChaCha20 blocks by loading
+	// the state matrix in NEON registers four times. The algorithm performs
+	// each operation on the corresponding word of each state matrix, hence
+	// requires no word shuffling. For final XORing step we transpose the
+	// matrix by interleaving 32- and then 64-bit words, which allows us to
+	// do XOR in NEON registers.
+	//
+	adr		x3, CTRINC		// ... and ROT8
+	ld1		{v30.4s-v31.4s}, [x3]
+
+	// x0..15[0-3] = s0..3[0..3]
+	mov		x4, x0
+	ld4r		{ v0.4s- v3.4s}, [x4], #16
+	ld4r		{ v4.4s- v7.4s}, [x4], #16
+	ld4r		{ v8.4s-v11.4s}, [x4], #16
+	ld4r		{v12.4s-v15.4s}, [x4]
+
+	// x12 += counter values 0-3
+	add		v12.4s, v12.4s, v30.4s
+
+	mov		x3, #10
+
+.Ldoubleround4:
+	// x0 += x4, x12 = rotl32(x12 ^ x0, 16)
+	// x1 += x5, x13 = rotl32(x13 ^ x1, 16)
+	// x2 += x6, x14 = rotl32(x14 ^ x2, 16)
+	// x3 += x7, x15 = rotl32(x15 ^ x3, 16)
+	add		v0.4s, v0.4s, v4.4s
+	add		v1.4s, v1.4s, v5.4s
+	add		v2.4s, v2.4s, v6.4s
+	add		v3.4s, v3.4s, v7.4s
+
+	eor		v12.16b, v12.16b, v0.16b
+	eor		v13.16b, v13.16b, v1.16b
+	eor		v14.16b, v14.16b, v2.16b
+	eor		v15.16b, v15.16b, v3.16b
+
+	rev32		v12.8h, v12.8h
+	rev32		v13.8h, v13.8h
+	rev32		v14.8h, v14.8h
+	rev32		v15.8h, v15.8h
+
+	// x8 += x12, x4 = rotl32(x4 ^ x8, 12)
+	// x9 += x13, x5 = rotl32(x5 ^ x9, 12)
+	// x10 += x14, x6 = rotl32(x6 ^ x10, 12)
+	// x11 += x15, x7 = rotl32(x7 ^ x11, 12)
+	add		v8.4s, v8.4s, v12.4s
+	add		v9.4s, v9.4s, v13.4s
+	add		v10.4s, v10.4s, v14.4s
+	add		v11.4s, v11.4s, v15.4s
+
+	eor		v16.16b, v4.16b, v8.16b
+	eor		v17.16b, v5.16b, v9.16b
+	eor		v18.16b, v6.16b, v10.16b
+	eor		v19.16b, v7.16b, v11.16b
+
+	shl		v4.4s, v16.4s, #12
+	shl		v5.4s, v17.4s, #12
+	shl		v6.4s, v18.4s, #12
+	shl		v7.4s, v19.4s, #12
+
+	sri		v4.4s, v16.4s, #20
+	sri		v5.4s, v17.4s, #20
+	sri		v6.4s, v18.4s, #20
+	sri		v7.4s, v19.4s, #20
+
+	// x0 += x4, x12 = rotl32(x12 ^ x0, 8)
+	// x1 += x5, x13 = rotl32(x13 ^ x1, 8)
+	// x2 += x6, x14 = rotl32(x14 ^ x2, 8)
+	// x3 += x7, x15 = rotl32(x15 ^ x3, 8)
+	add		v0.4s, v0.4s, v4.4s
+	add		v1.4s, v1.4s, v5.4s
+	add		v2.4s, v2.4s, v6.4s
+	add		v3.4s, v3.4s, v7.4s
+
+	eor		v12.16b, v12.16b, v0.16b
+	eor		v13.16b, v13.16b, v1.16b
+	eor		v14.16b, v14.16b, v2.16b
+	eor		v15.16b, v15.16b, v3.16b
+
+	tbl		v12.16b, {v12.16b}, v31.16b
+	tbl		v13.16b, {v13.16b}, v31.16b
+	tbl		v14.16b, {v14.16b}, v31.16b
+	tbl		v15.16b, {v15.16b}, v31.16b
+
+	// x8 += x12, x4 = rotl32(x4 ^ x8, 7)
+	// x9 += x13, x5 = rotl32(x5 ^ x9, 7)
+	// x10 += x14, x6 = rotl32(x6 ^ x10, 7)
+	// x11 += x15, x7 = rotl32(x7 ^ x11, 7)
+	add		v8.4s, v8.4s, v12.4s
+	add		v9.4s, v9.4s, v13.4s
+	add		v10.4s, v10.4s, v14.4s
+	add		v11.4s, v11.4s, v15.4s
+
+	eor		v16.16b, v4.16b, v8.16b
+	eor		v17.16b, v5.16b, v9.16b
+	eor		v18.16b, v6.16b, v10.16b
+	eor		v19.16b, v7.16b, v11.16b
+
+	shl		v4.4s, v16.4s, #7
+	shl		v5.4s, v17.4s, #7
+	shl		v6.4s, v18.4s, #7
+	shl		v7.4s, v19.4s, #7
+
+	sri		v4.4s, v16.4s, #25
+	sri		v5.4s, v17.4s, #25
+	sri		v6.4s, v18.4s, #25
+	sri		v7.4s, v19.4s, #25
+
+	// x0 += x5, x15 = rotl32(x15 ^ x0, 16)
+	// x1 += x6, x12 = rotl32(x12 ^ x1, 16)
+	// x2 += x7, x13 = rotl32(x13 ^ x2, 16)
+	// x3 += x4, x14 = rotl32(x14 ^ x3, 16)
+	add		v0.4s, v0.4s, v5.4s
+	add		v1.4s, v1.4s, v6.4s
+	add		v2.4s, v2.4s, v7.4s
+	add		v3.4s, v3.4s, v4.4s
+
+	eor		v15.16b, v15.16b, v0.16b
+	eor		v12.16b, v12.16b, v1.16b
+	eor		v13.16b, v13.16b, v2.16b
+	eor		v14.16b, v14.16b, v3.16b
+
+	rev32		v15.8h, v15.8h
+	rev32		v12.8h, v12.8h
+	rev32		v13.8h, v13.8h
+	rev32		v14.8h, v14.8h
+
+	// x10 += x15, x5 = rotl32(x5 ^ x10, 12)
+	// x11 += x12, x6 = rotl32(x6 ^ x11, 12)
+	// x8 += x13, x7 = rotl32(x7 ^ x8, 12)
+	// x9 += x14, x4 = rotl32(x4 ^ x9, 12)
+	add		v10.4s, v10.4s, v15.4s
+	add		v11.4s, v11.4s, v12.4s
+	add		v8.4s, v8.4s, v13.4s
+	add		v9.4s, v9.4s, v14.4s
+
+	eor		v16.16b, v5.16b, v10.16b
+	eor		v17.16b, v6.16b, v11.16b
+	eor		v18.16b, v7.16b, v8.16b
+	eor		v19.16b, v4.16b, v9.16b
+
+	shl		v5.4s, v16.4s, #12
+	shl		v6.4s, v17.4s, #12
+	shl		v7.4s, v18.4s, #12
+	shl		v4.4s, v19.4s, #12
+
+	sri		v5.4s, v16.4s, #20
+	sri		v6.4s, v17.4s, #20
+	sri		v7.4s, v18.4s, #20
+	sri		v4.4s, v19.4s, #20
+
+	// x0 += x5, x15 = rotl32(x15 ^ x0, 8)
+	// x1 += x6, x12 = rotl32(x12 ^ x1, 8)
+	// x2 += x7, x13 = rotl32(x13 ^ x2, 8)
+	// x3 += x4, x14 = rotl32(x14 ^ x3, 8)
+	add		v0.4s, v0.4s, v5.4s
+	add		v1.4s, v1.4s, v6.4s
+	add		v2.4s, v2.4s, v7.4s
+	add		v3.4s, v3.4s, v4.4s
+
+	eor		v15.16b, v15.16b, v0.16b
+	eor		v12.16b, v12.16b, v1.16b
+	eor		v13.16b, v13.16b, v2.16b
+	eor		v14.16b, v14.16b, v3.16b
+
+	tbl		v15.16b, {v15.16b}, v31.16b
+	tbl		v12.16b, {v12.16b}, v31.16b
+	tbl		v13.16b, {v13.16b}, v31.16b
+	tbl		v14.16b, {v14.16b}, v31.16b
+
+	// x10 += x15, x5 = rotl32(x5 ^ x10, 7)
+	// x11 += x12, x6 = rotl32(x6 ^ x11, 7)
+	// x8 += x13, x7 = rotl32(x7 ^ x8, 7)
+	// x9 += x14, x4 = rotl32(x4 ^ x9, 7)
+	add		v10.4s, v10.4s, v15.4s
+	add		v11.4s, v11.4s, v12.4s
+	add		v8.4s, v8.4s, v13.4s
+	add		v9.4s, v9.4s, v14.4s
+
+	eor		v16.16b, v5.16b, v10.16b
+	eor		v17.16b, v6.16b, v11.16b
+	eor		v18.16b, v7.16b, v8.16b
+	eor		v19.16b, v4.16b, v9.16b
+
+	shl		v5.4s, v16.4s, #7
+	shl		v6.4s, v17.4s, #7
+	shl		v7.4s, v18.4s, #7
+	shl		v4.4s, v19.4s, #7
+
+	sri		v5.4s, v16.4s, #25
+	sri		v6.4s, v17.4s, #25
+	sri		v7.4s, v18.4s, #25
+	sri		v4.4s, v19.4s, #25
+
+	subs		x3, x3, #1
+	b.ne		.Ldoubleround4
+
+	ld4r		{v16.4s-v19.4s}, [x0], #16
+	ld4r		{v20.4s-v23.4s}, [x0], #16
+
+	// x12 += counter values 0-3
+	add		v12.4s, v12.4s, v30.4s
+
+	// x0[0-3] += s0[0]
+	// x1[0-3] += s0[1]
+	// x2[0-3] += s0[2]
+	// x3[0-3] += s0[3]
+	add		v0.4s, v0.4s, v16.4s
+	add		v1.4s, v1.4s, v17.4s
+	add		v2.4s, v2.4s, v18.4s
+	add		v3.4s, v3.4s, v19.4s
+
+	ld4r		{v24.4s-v27.4s}, [x0], #16
+	ld4r		{v28.4s-v31.4s}, [x0]
+
+	// x4[0-3] += s1[0]
+	// x5[0-3] += s1[1]
+	// x6[0-3] += s1[2]
+	// x7[0-3] += s1[3]
+	add		v4.4s, v4.4s, v20.4s
+	add		v5.4s, v5.4s, v21.4s
+	add		v6.4s, v6.4s, v22.4s
+	add		v7.4s, v7.4s, v23.4s
+
+	// x8[0-3] += s2[0]
+	// x9[0-3] += s2[1]
+	// x10[0-3] += s2[2]
+	// x11[0-3] += s2[3]
+	add		v8.4s, v8.4s, v24.4s
+	add		v9.4s, v9.4s, v25.4s
+	add		v10.4s, v10.4s, v26.4s
+	add		v11.4s, v11.4s, v27.4s
+
+	// x12[0-3] += s3[0]
+	// x13[0-3] += s3[1]
+	// x14[0-3] += s3[2]
+	// x15[0-3] += s3[3]
+	add		v12.4s, v12.4s, v28.4s
+	add		v13.4s, v13.4s, v29.4s
+	add		v14.4s, v14.4s, v30.4s
+	add		v15.4s, v15.4s, v31.4s
+
+	// interleave 32-bit words in state n, n+1
+	zip1		v16.4s, v0.4s, v1.4s
+	zip2		v17.4s, v0.4s, v1.4s
+	zip1		v18.4s, v2.4s, v3.4s
+	zip2		v19.4s, v2.4s, v3.4s
+	zip1		v20.4s, v4.4s, v5.4s
+	zip2		v21.4s, v4.4s, v5.4s
+	zip1		v22.4s, v6.4s, v7.4s
+	zip2		v23.4s, v6.4s, v7.4s
+	zip1		v24.4s, v8.4s, v9.4s
+	zip2		v25.4s, v8.4s, v9.4s
+	zip1		v26.4s, v10.4s, v11.4s
+	zip2		v27.4s, v10.4s, v11.4s
+	zip1		v28.4s, v12.4s, v13.4s
+	zip2		v29.4s, v12.4s, v13.4s
+	zip1		v30.4s, v14.4s, v15.4s
+	zip2		v31.4s, v14.4s, v15.4s
+
+	// interleave 64-bit words in state n, n+2
+	zip1		v0.2d, v16.2d, v18.2d
+	zip2		v4.2d, v16.2d, v18.2d
+	zip1		v8.2d, v17.2d, v19.2d
+	zip2		v12.2d, v17.2d, v19.2d
+	ld1		{v16.16b-v19.16b}, [x2], #64
+
+	zip1		v1.2d, v20.2d, v22.2d
+	zip2		v5.2d, v20.2d, v22.2d
+	zip1		v9.2d, v21.2d, v23.2d
+	zip2		v13.2d, v21.2d, v23.2d
+	ld1		{v20.16b-v23.16b}, [x2], #64
+
+	zip1		v2.2d, v24.2d, v26.2d
+	zip2		v6.2d, v24.2d, v26.2d
+	zip1		v10.2d, v25.2d, v27.2d
+	zip2		v14.2d, v25.2d, v27.2d
+	ld1		{v24.16b-v27.16b}, [x2], #64
+
+	zip1		v3.2d, v28.2d, v30.2d
+	zip2		v7.2d, v28.2d, v30.2d
+	zip1		v11.2d, v29.2d, v31.2d
+	zip2		v15.2d, v29.2d, v31.2d
+	ld1		{v28.16b-v31.16b}, [x2]
+
+	// xor with corresponding input, write to output
+	eor		v16.16b, v16.16b, v0.16b
+	eor		v17.16b, v17.16b, v1.16b
+	eor		v18.16b, v18.16b, v2.16b
+	eor		v19.16b, v19.16b, v3.16b
+	eor		v20.16b, v20.16b, v4.16b
+	eor		v21.16b, v21.16b, v5.16b
+	st1		{v16.16b-v19.16b}, [x1], #64
+	eor		v22.16b, v22.16b, v6.16b
+	eor		v23.16b, v23.16b, v7.16b
+	eor		v24.16b, v24.16b, v8.16b
+	eor		v25.16b, v25.16b, v9.16b
+	st1		{v20.16b-v23.16b}, [x1], #64
+	eor		v26.16b, v26.16b, v10.16b
+	eor		v27.16b, v27.16b, v11.16b
+	eor		v28.16b, v28.16b, v12.16b
+	st1		{v24.16b-v27.16b}, [x1], #64
+	eor		v29.16b, v29.16b, v13.16b
+	eor		v30.16b, v30.16b, v14.16b
+	eor		v31.16b, v31.16b, v15.16b
+	st1		{v28.16b-v31.16b}, [x1]
+
+	ret
+ENDPROC(chacha20_4block_xor_neon)
+
+CTRINC:	.word		0, 1, 2, 3
+ROT8:	.word		0x02010003, 0x06050407, 0x0a09080b, 0x0e0d0c0f
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
new file mode 100644
index 000000000000..a7f2337d46cf
--- /dev/null
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -0,0 +1,127 @@
+/*
+ * ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions
+ *
+ * Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Based on:
+ * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/chacha20.h>
+#include <crypto/internal/skcipher.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+
+asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
+asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
+
+static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
+			    unsigned int bytes)
+{
+	u8 buf[CHACHA20_BLOCK_SIZE];
+
+	while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+		chacha20_4block_xor_neon(state, dst, src);
+		bytes -= CHACHA20_BLOCK_SIZE * 4;
+		src += CHACHA20_BLOCK_SIZE * 4;
+		dst += CHACHA20_BLOCK_SIZE * 4;
+		state[12] += 4;
+	}
+	while (bytes >= CHACHA20_BLOCK_SIZE) {
+		chacha20_block_xor_neon(state, dst, src);
+		bytes -= CHACHA20_BLOCK_SIZE;
+		src += CHACHA20_BLOCK_SIZE;
+		dst += CHACHA20_BLOCK_SIZE;
+		state[12]++;
+	}
+	if (bytes) {
+		memcpy(buf, src, bytes);
+		chacha20_block_xor_neon(state, buf, buf);
+		memcpy(dst, buf, bytes);
+	}
+}
+
+static int chacha20_neon(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	u32 state[16];
+	int err;
+
+	if (req->cryptlen <= CHACHA20_BLOCK_SIZE)
+		return crypto_chacha20_crypt(req);
+
+	err = skcipher_walk_virt(&walk, req, true);
+
+	crypto_chacha20_init(state, ctx, walk.iv);
+
+	kernel_neon_begin();
+	while (walk.nbytes > 0) {
+		unsigned int nbytes = walk.nbytes;
+
+		if (nbytes < walk.total)
+			nbytes = round_down(nbytes, walk.stride);
+
+		chacha20_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
+				nbytes);
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+	}
+	kernel_neon_end();
+
+	return err;
+}
+
+static struct skcipher_alg alg = {
+	.base.cra_name		= "chacha20",
+	.base.cra_driver_name	= "chacha20-neon",
+	.base.cra_priority	= 300,
+	.base.cra_blocksize	= 1,
+	.base.cra_ctxsize	= sizeof(struct chacha20_ctx),
+	.base.cra_alignmask	= 1,
+	.base.cra_module	= THIS_MODULE,
+
+	.min_keysize		= CHACHA20_KEY_SIZE,
+	.max_keysize		= CHACHA20_KEY_SIZE,
+	.ivsize			= CHACHA20_IV_SIZE,
+	.chunksize		= CHACHA20_BLOCK_SIZE,
+	.walksize		= 4 * CHACHA20_BLOCK_SIZE,
+	.setkey			= crypto_chacha20_setkey,
+	.encrypt		= chacha20_neon,
+	.decrypt		= chacha20_neon,
+};
+
+static int __init chacha20_simd_mod_init(void)
+{
+	if (!(elf_hwcap & HWCAP_ASIMD))
+		return -ENODEV;
+
+	return crypto_register_skcipher(&alg);
+}
+
+static void __exit chacha20_simd_mod_fini(void)
+{
+	crypto_unregister_skcipher(&alg);
+}
+
+module_init(chacha20_simd_mod_init);
+module_exit(chacha20_simd_mod_fini);
+
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("chacha20");
-- 
2.7.4

^ permalink raw reply related

* [PATCH 5/6] crypto: arm64/aes-blk - expose AES-CTR as synchronous cipher as well
From: Ard Biesheuvel @ 2017-01-02 18:21 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1483381268-12987-1-git-send-email-ard.biesheuvel@linaro.org>

In addition to wrapping the AES-CTR cipher into the async SIMD wrapper,
which exposes it as an async skcipher that defers processing to process
context, expose our AES-CTR implementation directly as a synchronous cipher
as well, but with a lower priority.

This makes the AES-CTR transform usable in places where synchronous
transforms are required, such as the MAC802.11 encryption code, which
executes in sotfirq context, where SIMD processing is allowed on arm64.
Users of the async transform will keep the existing behavior.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-glue.c | 25 ++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 4e3f8adb1793..5164aaf82c6a 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -327,6 +327,23 @@ static struct skcipher_alg aes_algs[] = { {
 	.decrypt	= ctr_encrypt,
 }, {
 	.base = {
+		.cra_name		= "ctr(aes)",
+		.cra_driver_name	= "ctr-aes-" MODE,
+		.cra_priority		= PRIO - 1,
+		.cra_blocksize		= 1,
+		.cra_ctxsize		= sizeof(struct crypto_aes_ctx),
+		.cra_alignmask		= 7,
+		.cra_module		= THIS_MODULE,
+	},
+	.min_keysize	= AES_MIN_KEY_SIZE,
+	.max_keysize	= AES_MAX_KEY_SIZE,
+	.ivsize		= AES_BLOCK_SIZE,
+	.chunksize	= AES_BLOCK_SIZE,
+	.setkey		= skcipher_aes_setkey,
+	.encrypt	= ctr_encrypt,
+	.decrypt	= ctr_encrypt,
+}, {
+	.base = {
 		.cra_name		= "__xts(aes)",
 		.cra_driver_name	= "__xts-aes-" MODE,
 		.cra_priority		= PRIO,
@@ -350,8 +367,9 @@ static void aes_exit(void)
 {
 	int i;
 
-	for (i = 0; i < ARRAY_SIZE(aes_simd_algs) && aes_simd_algs[i]; i++)
-		simd_skcipher_free(aes_simd_algs[i]);
+	for (i = 0; i < ARRAY_SIZE(aes_simd_algs); i++)
+		if (aes_simd_algs[i])
+			simd_skcipher_free(aes_simd_algs[i]);
 
 	crypto_unregister_skciphers(aes_algs, ARRAY_SIZE(aes_algs));
 }
@@ -370,6 +388,9 @@ static int __init aes_init(void)
 		return err;
 
 	for (i = 0; i < ARRAY_SIZE(aes_algs); i++) {
+		if (!(aes_algs[i].base.cra_flags & CRYPTO_ALG_INTERNAL))
+			continue;
+
 		algname = aes_algs[i].base.cra_name + 2;
 		drvname = aes_algs[i].base.cra_driver_name + 2;
 		basename = aes_algs[i].base.cra_driver_name;
-- 
2.7.4

^ permalink raw reply related

* [PATCH 6/6] crypto: arm64/aes - reimplement bit-sliced ARM/NEON implementation for arm64
From: Ard Biesheuvel @ 2017-01-02 18:21 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1483381268-12987-1-git-send-email-ard.biesheuvel@linaro.org>

This is a reimplementation of the NEON version of the bit-sliced AES
algorithm. This code is heavily based on Andy Polyakov's OpenSSL version
for ARM, which is also available in the kernel. This is an alternative for
the existing NEON implementation for arm64 authored by me, which suffers
from poor performance due to its reliance on the pathologically slow four
register variant of the tbl/tbx NEON instruction.

This version is about ~30% (*) faster than the generic C code, but only in
cases where the input can be 8x interleaved (this is a fundamental property
of bit slicing). For this reason, only the chaining modes ECB, XTS and CTR
are implemented. (The significance of ECB is that it could potentially be
used by other chaining modes)

* Measured on Cortex-A57. Note that this is still an order of magnitude
  slower than the implementations that use the dedicated AES instructions
  introduced in ARMv8, but those are part of an optional extension, and so
  it is good to have a fallback.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/Kconfig           |   7 +
 arch/arm64/crypto/Makefile          |   3 +
 arch/arm64/crypto/aes-neonbs-core.S | 879 ++++++++++++++++++++
 arch/arm64/crypto/aes-neonbs-glue.c | 344 ++++++++
 4 files changed, 1233 insertions(+)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 0bf0f531f539..7c4249ad4935 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -78,4 +78,11 @@ config CRYPTO_CHACHA20_NEON
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_CHACHA20
 
+config CRYPTO_AES_NEON_BS
+	tristate "AES in ECB/CTR/XTS modes using bit-sliced NEON algorithm"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_BLKCIPHER
+	select CRYPTO_AES
+	select CRYPTO_SIMD
+
 endif
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 9d2826c5fccf..df3c0584b05c 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -44,6 +44,9 @@ sha512-arm64-y := sha512-glue.o sha512-core.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
 chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
 
+obj-$(CONFIG_CRYPTO_AES_NEON_BS) += aes-neon-bs.o
+aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
+
 AFLAGS_aes-ce.o		:= -DINTERLEAVE=4
 AFLAGS_aes-neon.o	:= -DINTERLEAVE=4
 
diff --git a/arch/arm64/crypto/aes-neonbs-core.S b/arch/arm64/crypto/aes-neonbs-core.S
new file mode 100644
index 000000000000..f5e1f76e8ee8
--- /dev/null
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -0,0 +1,879 @@
+/*
+ * Bit sliced AES using NEON instructions
+ *
+ * Copyright (C) 2016 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+/*
+ * The algorithm implemented here is described in detail by the paper
+ * 'Faster and Timing-Attack Resistant AES-GCM' by Emilia Kaesper and
+ * Peter Schwabe (https://eprint.iacr.org/2009/129.pdf)
+ *
+ * This implementation is based primarily on the OpenSSL implementation
+ * for 32-bit ARM written by Andy Polyakov <appro@openssl.org>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+	.text
+
+	rounds		.req	x11
+	bskey		.req	x12
+
+	.macro		in_bs_ch, b0, b1, b2, b3, b4, b5, b6, b7
+	eor		\b2, \b2, \b1
+	eor		\b5, \b5, \b6
+	eor		\b3, \b3, \b0
+	eor		\b6, \b6, \b2
+	eor		\b5, \b5, \b0
+	eor		\b6, \b6, \b3
+	eor		\b3, \b3, \b7
+	eor		\b7, \b7, \b5
+	eor		\b3, \b3, \b4
+	eor		\b4, \b4, \b5
+	eor		\b2, \b2, \b7
+	eor		\b3, \b3, \b1
+	eor		\b1, \b1, \b5
+	.endm
+
+	.macro		out_bs_ch, b0, b1, b2, b3, b4, b5, b6, b7
+	eor		\b0, \b0, \b6
+	eor		\b1, \b1, \b4
+	eor		\b4, \b4, \b6
+	eor		\b2, \b2, \b0
+	eor		\b6, \b6, \b1
+	eor		\b1, \b1, \b5
+	eor		\b5, \b5, \b3
+	eor		\b3, \b3, \b7
+	eor		\b7, \b7, \b5
+	eor		\b2, \b2, \b5
+	eor		\b4, \b4, \b7
+	.endm
+
+	.macro		inv_in_bs_ch, b6, b1, b2, b4, b7, b0, b3, b5
+	eor		\b1, \b1, \b7
+	eor		\b4, \b4, \b7
+	eor		\b7, \b7, \b5
+	eor		\b1, \b1, \b3
+	eor		\b2, \b2, \b5
+	eor		\b3, \b3, \b7
+	eor		\b6, \b6, \b1
+	eor		\b2, \b2, \b0
+	eor		\b5, \b5, \b3
+	eor		\b4, \b4, \b6
+	eor		\b0, \b0, \b6
+	eor		\b1, \b1, \b4
+	.endm
+
+	.macro		inv_out_bs_ch, b6, b5, b0, b3, b7, b1, b4, b2
+	eor		\b1, \b1, \b5
+	eor		\b2, \b2, \b7
+	eor		\b3, \b3, \b1
+	eor		\b4, \b4, \b5
+	eor		\b7, \b7, \b5
+	eor		\b3, \b3, \b4
+	eor 		\b5, \b5, \b0
+	eor		\b3, \b3, \b7
+	eor		\b6, \b6, \b2
+	eor		\b2, \b2, \b1
+	eor		\b6, \b6, \b3
+	eor		\b3, \b3, \b0
+	eor		\b5, \b5, \b6
+	.endm
+
+	.macro		mul_gf4, x0, x1, y0, y1, t0, t1
+	eor 		\t0, \y0, \y1
+	and		\t0, \t0, \x0
+	eor		\x0, \x0, \x1
+	and		\t1, \x1, \y0
+	and		\x0, \x0, \y1
+	eor		\x1, \t1, \t0
+	eor		\x0, \x0, \t1
+	.endm
+
+	.macro		mul_gf4_n_gf4, x0, x1, y0, y1, t0, x2, x3, y2, y3, t1
+	eor		\t0, \y0, \y1
+	eor 		\t1, \y2, \y3
+	and		\t0, \t0, \x0
+	and		\t1, \t1, \x2
+	eor		\x0, \x0, \x1
+	eor		\x2, \x2, \x3
+	and		\x1, \x1, \y0
+	and		\x3, \x3, \y2
+	and		\x0, \x0, \y1
+	and		\x2, \x2, \y3
+	eor		\x1, \x1, \x0
+	eor		\x2, \x2, \x3
+	eor		\x0, \x0, \t0
+	eor		\x3, \x3, \t1
+	.endm
+
+	.macro		mul_gf16_2, x0, x1, x2, x3, x4, x5, x6, x7, \
+				    y0, y1, y2, y3, t0, t1, t2, t3
+	eor		\t0, \x0, \x2
+	eor		\t1, \x1, \x3
+	mul_gf4  	\x0, \x1, \y0, \y1, \t2, \t3
+	eor		\y0, \y0, \y2
+	eor		\y1, \y1, \y3
+	mul_gf4_n_gf4	\t0, \t1, \y0, \y1, \t3, \x2, \x3, \y2, \y3, \t2
+	eor		\x0, \x0, \t0
+	eor		\x2, \x2, \t0
+	eor		\x1, \x1, \t1
+	eor		\x3, \x3, \t1
+	eor		\t0, \x4, \x6
+	eor		\t1, \x5, \x7
+	mul_gf4_n_gf4	\t0, \t1, \y0, \y1, \t3, \x6, \x7, \y2, \y3, \t2
+	eor		\y0, \y0, \y2
+	eor		\y1, \y1, \y3
+	mul_gf4  	\x4, \x5, \y0, \y1, \t2, \t3
+	eor		\x4, \x4, \t0
+	eor		\x6, \x6, \t0
+	eor		\x5, \x5, \t1
+	eor		\x7, \x7, \t1
+	.endm
+
+	.macro		inv_gf256, x0, x1, x2, x3, x4, x5, x6, x7, \
+				   t0, t1, t2, t3, s0, s1, s2, s3
+	eor		\t3, \x4, \x6
+	eor		\t0, \x5, \x7
+	eor		\t1, \x1, \x3
+	eor		\s1, \x7, \x6
+	eor		\s0, \x0, \x2
+	eor		\s3, \t3, \t0
+	orr		\t2, \t0, \t1
+	and		\s2, \t3, \s0
+	orr		\t3, \t3, \s0
+	eor		\s0, \s0, \t1
+	and		\t0, \t0, \t1
+	eor		\t1, \x3, \x2
+	and		\s3, \s3, \s0
+	and		\s1, \s1, \t1
+	eor		\t1, \x4, \x5
+	eor		\s0, \x1, \x0
+	eor		\t3, \t3, \s1
+	eor		\t2, \t2, \s1
+	and		\s1, \t1, \s0
+	orr		\t1, \t1, \s0
+	eor		\t3, \t3, \s3
+	eor		\t0, \t0, \s1
+	eor		\t2, \t2, \s2
+	eor		\t1, \t1, \s3
+	eor		\t0, \t0, \s2
+	and		\s0, \x7, \x3
+	eor		\t1, \t1, \s2
+	and		\s1, \x6, \x2
+	and		\s2, \x5, \x1
+	orr		\s3, \x4, \x0
+	eor		\t3, \t3, \s0
+	eor		\t1, \t1, \s2
+	eor		\s0, \t0, \s3
+	eor		\t2, \t2, \s1
+	and		\s2, \t3, \t1
+	eor		\s1, \t2, \s2
+	eor		\s3, \s0, \s2
+	bsl		\s1, \t1, \s0
+	not		\t0, \s0
+	bsl		\s0, \s1, \s3
+	bsl		\t0, \s1, \s3
+	bsl		\s3, \t3, \t2
+	eor		\t3, \t3, \t2
+	and		\s2, \s0, \s3
+	eor		\t1, \t1, \t0
+	eor		\s2, \s2, \t3
+	mul_gf16_2	\x0, \x1, \x2, \x3, \x4, \x5, \x6, \x7, \
+			\s3, \s2, \s1, \t1, \s0, \t0, \t2, \t3
+	.endm
+
+	.macro		sbox, b0, b1, b2, b3, b4, b5, b6, b7, \
+			      t0, t1, t2, t3, s0, s1, s2, s3
+	in_bs_ch	\b0\().16b, \b1\().16b, \b2\().16b, \b3\().16b, \
+			\b4\().16b, \b5\().16b, \b6\().16b, \b7\().16b
+	inv_gf256	\b6\().16b, \b5\().16b, \b0\().16b, \b3\().16b, \
+			\b7\().16b, \b1\().16b, \b4\().16b, \b2\().16b, \
+			\t0\().16b, \t1\().16b, \t2\().16b, \t3\().16b, \
+			\s0\().16b, \s1\().16b, \s2\().16b, \s3\().16b
+	out_bs_ch	\b7\().16b, \b1\().16b, \b4\().16b, \b2\().16b, \
+			\b6\().16b, \b5\().16b, \b0\().16b, \b3\().16b
+	.endm
+
+	.macro		inv_sbox, b0, b1, b2, b3, b4, b5, b6, b7, \
+				  t0, t1, t2, t3, s0, s1, s2, s3
+	inv_in_bs_ch	\b0\().16b, \b1\().16b, \b2\().16b, \b3\().16b, \
+			\b4\().16b, \b5\().16b, \b6\().16b, \b7\().16b
+	inv_gf256	\b5\().16b, \b1\().16b, \b2\().16b, \b6\().16b, \
+			\b3\().16b, \b7\().16b, \b0\().16b, \b4\().16b, \
+			\t0\().16b, \t1\().16b, \t2\().16b, \t3\().16b, \
+			\s0\().16b, \s1\().16b, \s2\().16b, \s3\().16b
+	inv_out_bs_ch	\b3\().16b, \b7\().16b, \b0\().16b, \b4\().16b, \
+			\b5\().16b, \b1\().16b, \b2\().16b, \b6\().16b
+	.endm
+
+	.macro		enc_next_rk
+	ldp		q16, q17, [bskey], #128
+	ldp		q18, q19, [bskey, #-96]
+	ldp		q20, q21, [bskey, #-64]
+	ldp		q22, q23, [bskey, #-32]
+	.endm
+
+	.macro		dec_next_rk
+	ldp		q16, q17, [bskey, #-128]!
+	ldp		q18, q19, [bskey, #32]
+	ldp		q20, q21, [bskey, #64]
+	ldp		q22, q23, [bskey, #96]
+	.endm
+
+	.macro		add_round_key, x0, x1, x2, x3, x4, x5, x6, x7
+	eor		\x0\().16b, \x0\().16b, v16.16b
+	eor		\x1\().16b, \x1\().16b, v17.16b
+	eor		\x2\().16b, \x2\().16b, v18.16b
+	eor		\x3\().16b, \x3\().16b, v19.16b
+	eor		\x4\().16b, \x4\().16b, v20.16b
+	eor		\x5\().16b, \x5\().16b, v21.16b
+	eor		\x6\().16b, \x6\().16b, v22.16b
+	eor		\x7\().16b, \x7\().16b, v23.16b
+	.endm
+
+	.macro		shift_rows, x0, x1, x2, x3, x4, x5, x6, x7, mask
+	tbl		\x0\().16b, {\x0\().16b}, \mask\().16b
+	tbl		\x1\().16b, {\x1\().16b}, \mask\().16b
+	tbl		\x2\().16b, {\x2\().16b}, \mask\().16b
+	tbl		\x3\().16b, {\x3\().16b}, \mask\().16b
+	tbl		\x4\().16b, {\x4\().16b}, \mask\().16b
+	tbl		\x5\().16b, {\x5\().16b}, \mask\().16b
+	tbl		\x6\().16b, {\x6\().16b}, \mask\().16b
+	tbl		\x7\().16b, {\x7\().16b}, \mask\().16b
+	.endm
+
+	.macro		mix_cols, x0, x1, x2, x3, x4, x5, x6, x7, \
+				  t0, t1, t2, t3, t4, t5, t6, t7, inv
+	ext		\t0\().16b, \x0\().16b, \x0\().16b, #12
+	ext		\t1\().16b, \x1\().16b, \x1\().16b, #12
+	eor		\x0\().16b, \x0\().16b, \t0\().16b
+	ext		\t2\().16b, \x2\().16b, \x2\().16b, #12
+	eor		\x1\().16b, \x1\().16b, \t1\().16b
+	ext		\t3\().16b, \x3\().16b, \x3\().16b, #12
+	eor		\x2\().16b, \x2\().16b, \t2\().16b
+	ext		\t4\().16b, \x4\().16b, \x4\().16b, #12
+	eor		\x3\().16b, \x3\().16b, \t3\().16b
+	ext		\t5\().16b, \x5\().16b, \x5\().16b, #12
+	eor		\x4\().16b, \x4\().16b, \t4\().16b
+	ext		\t6\().16b, \x6\().16b, \x6\().16b, #12
+	eor		\x5\().16b, \x5\().16b, \t5\().16b
+	ext		\t7\().16b, \x7\().16b, \x7\().16b, #12
+	eor		\x6\().16b, \x6\().16b, \t6\().16b
+	eor		\t1\().16b, \t1\().16b, \x0\().16b
+	eor		\x7\().16b, \x7\().16b, \t7\().16b
+	ext		\x0\().16b, \x0\().16b, \x0\().16b, #8
+	eor		\t2\().16b, \t2\().16b, \x1\().16b
+	eor		\t0\().16b, \t0\().16b, \x7\().16b
+	eor		\t1\().16b, \t1\().16b, \x7\().16b
+	ext		\x1\().16b, \x1\().16b, \x1\().16b, #8
+	eor		\t5\().16b, \t5\().16b, \x4\().16b
+	eor		\x0\().16b, \x0\().16b, \t0\().16b
+	eor		\t6\().16b, \t6\().16b, \x5\().16b
+	eor		\x1\().16b, \x1\().16b, \t1\().16b
+	ext		\t0\().16b, \x4\().16b, \x4\().16b, #8
+	eor		\t4\().16b, \t4\().16b, \x3\().16b
+	ext		\t1\().16b, \x5\().16b, \x5\().16b, #8
+	eor		\t7\().16b, \t7\().16b, \x6\().16b
+	ext		\x4\().16b, \x3\().16b, \x3\().16b, #8
+	eor		\t3\().16b, \t3\().16b, \x2\().16b
+	ext		\x5\().16b, \x7\().16b, \x7\().16b, #8
+	eor		\t4\().16b, \t4\().16b, \x7\().16b
+	ext		\x3\().16b, \x6\().16b, \x6\().16b, #8
+	eor		\t3\().16b, \t3\().16b, \x7\().16b
+	ext		\x6\().16b, \x2\().16b, \x2\().16b, #8
+	eor		\x7\().16b, \t1\().16b, \t5\().16b
+	.ifb		\inv
+	eor		\x2\().16b, \t0\().16b, \t4\().16b
+	eor		\x4\().16b, \x4\().16b, \t3\().16b
+	eor		\x5\().16b, \x5\().16b, \t7\().16b
+	eor		\x3\().16b, \x3\().16b, \t6\().16b
+	eor		\x6\().16b, \x6\().16b, \t2\().16b
+	.else
+	eor		\t3\().16b, \t3\().16b, \x4\().16b
+	eor		\x5\().16b, \x5\().16b, \t7\().16b
+	eor		\x2\().16b, \x3\().16b, \t6\().16b
+	eor		\x3\().16b, \t0\().16b, \t4\().16b
+	eor		\x4\().16b, \x6\().16b, \t2\().16b
+	mov		\x6\().16b, \t3\().16b
+	.endif
+	.endm
+
+	.macro		inv_mix_cols, x0, x1, x2, x3, x4, x5, x6, x7, \
+				      t0, t1, t2, t3, t4, t5, t6, t7
+	ext		\t0\().16b, \x0\().16b, \x0\().16b, #8
+	ext		\t6\().16b, \x6\().16b, \x6\().16b, #8
+	ext		\t7\().16b, \x7\().16b, \x7\().16b, #8
+	eor		\t0\().16b, \t0\().16b, \x0\().16b
+	ext		\t1\().16b, \x1\().16b, \x1\().16b, #8
+	eor		\t6\().16b, \t6\().16b, \x6\().16b
+	ext		\t2\().16b, \x2\().16b, \x2\().16b, #8
+	eor		\t7\().16b, \t7\().16b, \x7\().16b
+	ext		\t3\().16b, \x3\().16b, \x3\().16b, #8
+	eor		\t1\().16b, \t1\().16b, \x1\().16b
+	ext		\t4\().16b, \x4\().16b, \x4\().16b, #8
+	eor		\t2\().16b, \t2\().16b, \x2\().16b
+	ext		\t5\().16b, \x5\().16b, \x5\().16b, #8
+	eor		\t3\().16b, \t3\().16b, \x3\().16b
+	eor		\t4\().16b, \t4\().16b, \x4\().16b
+	eor		\t5\().16b, \t5\().16b, \x5\().16b
+	eor		\x0\().16b, \x0\().16b, \t6\().16b
+	eor		\x1\().16b, \x1\().16b, \t6\().16b
+	eor		\x2\().16b, \x2\().16b, \t0\().16b
+	eor		\x4\().16b, \x4\().16b, \t2\().16b
+	eor		\x3\().16b, \x3\().16b, \t1\().16b
+	eor		\x1\().16b, \x1\().16b, \t7\().16b
+	eor		\x2\().16b, \x2\().16b, \t7\().16b
+	eor		\x4\().16b, \x4\().16b, \t6\().16b
+	eor		\x5\().16b, \x5\().16b, \t3\().16b
+	eor		\x3\().16b, \x3\().16b, \t6\().16b
+	eor		\x6\().16b, \x6\().16b, \t4\().16b
+	eor		\x4\().16b, \x4\().16b, \t7\().16b
+	eor		\x5\().16b, \x5\().16b, \t7\().16b
+	eor		\x7\().16b, \x7\().16b, \t5\().16b
+	mix_cols	\x0, \x1, \x2, \x3, \x4, \x5, \x6, \x7, \
+			\t0, \t1, \t2, \t3, \t4, \t5, \t6, \t7, 1
+	.endm
+
+	.macro		swapmove_2x, a0, b0, a1, b1, n, mask, t0, t1
+	ushr		\t0\().2d, \b0\().2d, #\n
+	ushr		\t1\().2d, \b1\().2d, #\n
+	eor		\t0\().16b, \t0\().16b, \a0\().16b
+	eor		\t1\().16b, \t1\().16b, \a1\().16b
+	and		\t0\().16b, \t0\().16b, \mask\().16b
+	and		\t1\().16b, \t1\().16b, \mask\().16b
+	eor		\a0\().16b, \a0\().16b, \t0\().16b
+	shl		\t0\().2d, \t0\().2d, #\n
+	eor		\a1\().16b, \a1\().16b, \t1\().16b
+	shl		\t1\().2d, \t1\().2d, #\n
+	eor		\b0\().16b, \b0\().16b, \t0\().16b
+	eor		\b1\().16b, \b1\().16b, \t1\().16b
+	.endm
+
+	.macro		bitslice, x7, x6, x5, x4, x3, x2, x1, x0, t0, t1, t2, t3
+	movi		\t0\().16b, #0x55
+	movi		\t1\().16b, #0x33
+	swapmove_2x	\x0, \x1, \x2, \x3, 1, \t0, \t2, \t3
+	swapmove_2x	\x4, \x5, \x6, \x7, 1, \t0, \t2, \t3
+	movi		\t0\().16b, #0x0f
+	swapmove_2x	\x0, \x2, \x1, \x3, 2, \t1, \t2, \t3
+	swapmove_2x	\x4, \x6, \x5, \x7, 2, \t1, \t2, \t3
+	swapmove_2x	\x0, \x4, \x1, \x5, 4, \t0, \t2, \t3
+	swapmove_2x	\x2, \x6, \x3, \x7, 4, \t0, \t2, \t3
+	.endm
+
+
+	.align		6
+M0:	.octa		0x0004080c0105090d02060a0e03070b0f
+
+M0SR:	.octa		0x0004080c05090d010a0e02060f03070b
+SR:	.octa		0x0f0e0d0c0a09080b0504070600030201
+SRM0:	.octa		0x01060b0c0207080d0304090e00050a0f
+
+M0ISR:	.octa		0x0004080c0d0105090a0e0206070b0f03
+ISR:	.octa		0x0f0e0d0c080b0a090504070602010003
+ISRM0:	.octa		0x0306090c00070a0d01040b0e0205080f
+
+	/*
+	 * void aesbs_convert_key(u8 out[], u32 const rk[], int rounds)
+	 */
+ENTRY(aesbs_convert_key)
+	ld1		{v7.4s}, [x1], #16		// load round 0 key
+	ld1		{v17.4s}, [x1], #16		// load round 1 key
+
+	movi		v8.16b,  #0x01			// bit masks
+	movi		v9.16b,  #0x02
+	movi		v10.16b, #0x04
+	movi		v11.16b, #0x08
+	movi		v12.16b, #0x10
+	movi		v13.16b, #0x20
+	movi		v14.16b, #0x40
+	movi		v15.16b, #0x80
+	ldr		q16, M0
+
+	sub		x2, x2, #1
+	str		q7, [x0], #16		// save round 0 key
+
+.Lkey_loop:
+	tbl		v7.16b ,{v17.16b}, v16.16b
+	ld1		{v17.4s}, [x1], #16		// load next round key
+
+	cmtst		v0.16b, v7.16b, v8.16b
+	cmtst		v1.16b, v7.16b, v9.16b
+	cmtst		v2.16b, v7.16b, v10.16b
+	cmtst		v3.16b, v7.16b, v11.16b
+	cmtst		v4.16b, v7.16b, v12.16b
+	cmtst		v5.16b, v7.16b, v13.16b
+	cmtst		v6.16b, v7.16b, v14.16b
+	cmtst		v7.16b, v7.16b, v15.16b
+	not		v0.16b, v0.16b
+	not		v1.16b, v1.16b
+	not		v5.16b, v5.16b
+	not		v6.16b, v6.16b
+
+	subs		x2, x2, #1
+	stp		q0, q1, [x0], #128
+	stp		q2, q3, [x0, #-96]
+	stp		q4, q5, [x0, #-64]
+	stp		q6, q7, [x0, #-32]
+	b.ne		.Lkey_loop
+
+	movi		v7.16b, #0x63			// compose .L63
+	eor		v17.16b, v17.16b, v7.16b
+	str		q17, [x0]
+	ret
+ENDPROC(aesbs_convert_key)
+
+	.align		4
+aesbs_encrypt8:
+	ldr		q9, [bskey], #16		// round 0 key
+	ldr		q8, M0SR
+	ldr		q24, SR
+
+	eor		v10.16b, v0.16b, v9.16b		// xor with round0 key
+	eor		v11.16b, v1.16b, v9.16b
+	tbl		v0.16b, {v10.16b}, v8.16b
+	eor		v12.16b, v2.16b, v9.16b
+	tbl		v1.16b, {v11.16b}, v8.16b
+	eor		v13.16b, v3.16b, v9.16b
+	tbl		v2.16b, {v12.16b}, v8.16b
+	eor		v14.16b, v4.16b, v9.16b
+	tbl		v3.16b, {v13.16b}, v8.16b
+	eor		v15.16b, v5.16b, v9.16b
+	tbl		v4.16b, {v14.16b}, v8.16b
+	eor		v10.16b, v6.16b, v9.16b
+	tbl		v5.16b, {v15.16b}, v8.16b
+	eor		v11.16b, v7.16b, v9.16b
+	tbl		v6.16b, {v10.16b}, v8.16b
+	tbl		v7.16b, {v11.16b}, v8.16b
+
+	bitslice	v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11
+
+	sub		rounds, rounds, #1
+	b		.Lenc_sbox
+
+.Lenc_loop:
+	shift_rows	v0, v1, v2, v3, v4, v5, v6, v7, v24
+.Lenc_sbox:
+	sbox		v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, \
+								v13, v14, v15
+	subs		rounds, rounds, #1
+	b.cc		.Lenc_done
+
+	enc_next_rk
+
+	mix_cols	v0, v1, v4, v6, v3, v7, v2, v5, v8, v9, v10, v11, v12, \
+								v13, v14, v15
+
+	add_round_key	v0, v1, v2, v3, v4, v5, v6, v7
+
+	b.ne		.Lenc_loop
+	ldr		q24, SRM0
+	b		.Lenc_loop
+
+.Lenc_done:
+	ldr		q12, [bskey]			// last round key
+
+	bitslice	v0, v1, v4, v6, v3, v7, v2, v5, v8, v9, v10, v11
+
+	eor		v0.16b, v0.16b, v12.16b
+	eor		v1.16b, v1.16b, v12.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v6.16b, v6.16b, v12.16b
+	eor		v3.16b, v3.16b, v12.16b
+	eor		v7.16b, v7.16b, v12.16b
+	eor		v2.16b, v2.16b, v12.16b
+	eor		v5.16b, v5.16b, v12.16b
+	ret
+ENDPROC(aesbs_encrypt8)
+
+	.align		4
+aesbs_decrypt8:
+	lsl		x9, rounds, #7
+	add		bskey, bskey, x9
+
+	ldr		q9, [bskey, #-112]!		// round 0 key
+	ldr		q8, M0ISR
+	ldr		q24, ISR
+
+	eor		v10.16b, v0.16b, v9.16b		// xor with round0 key
+	eor		v11.16b, v1.16b, v9.16b
+	tbl		v0.16b, {v10.16b}, v8.16b
+	eor		v12.16b, v2.16b, v9.16b
+	tbl		v1.16b, {v11.16b}, v8.16b
+	eor		v13.16b, v3.16b, v9.16b
+	tbl		v2.16b, {v12.16b}, v8.16b
+	eor		v14.16b, v4.16b, v9.16b
+	tbl		v3.16b, {v13.16b}, v8.16b
+	eor		v15.16b, v5.16b, v9.16b
+	tbl		v4.16b, {v14.16b}, v8.16b
+	eor		v10.16b, v6.16b, v9.16b
+	tbl		v5.16b, {v15.16b}, v8.16b
+	eor		v11.16b, v7.16b, v9.16b
+	tbl		v6.16b, {v10.16b}, v8.16b
+	tbl		v7.16b, {v11.16b}, v8.16b
+
+	bitslice	v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11
+
+	sub		rounds, rounds, #1
+	b		.Ldec_sbox
+
+.Ldec_loop:
+	shift_rows	v0, v1, v2, v3, v4, v5, v6, v7, v24
+.Ldec_sbox:
+	inv_sbox	v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, \
+								v13, v14, v15
+	subs		rounds, rounds, #1
+	b.cc		.Ldec_done
+
+	dec_next_rk
+
+	add_round_key	v0, v1, v6, v4, v2, v7, v3, v5
+
+	inv_mix_cols	v0, v1, v6, v4, v2, v7, v3, v5, v8, v9, v10, v11, v12, \
+								v13, v14, v15
+
+	b.ne		.Ldec_loop
+	ldr		q24, ISRM0
+	b		.Ldec_loop
+.Ldec_done:
+	ldr		q12, [bskey, #-16]		// last round key
+
+	bitslice	v0, v1, v6, v4, v2, v7, v3, v5, v8, v9, v10, v11
+
+	eor		v0.16b, v0.16b, v12.16b
+	eor		v1.16b, v1.16b, v12.16b
+	eor		v6.16b, v6.16b, v12.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v2.16b, v2.16b, v12.16b
+	eor		v7.16b, v7.16b, v12.16b
+	eor		v3.16b, v3.16b, v12.16b
+	eor		v5.16b, v5.16b, v12.16b
+	ret
+ENDPROC(aesbs_decrypt8)
+
+	/*
+	 * aesbs_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
+	 *		     int blocks)
+	 * aesbs_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
+	 *		     int blocks)
+	 */
+	.macro		__ecb_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
+	stp		x29, x30, [sp, #-16]!
+	mov		x29, sp
+
+99:	mov		x5, #1
+	lsl		x5, x5, x4
+	subs		w4, w4, #8
+	csel		x4, x4, xzr, pl
+	csel		x5, x5, xzr, mi
+
+	ld1		{v0.16b}, [x1], #16
+	tbnz		x5, #1, 0f
+	ld1		{v1.16b}, [x1], #16
+	tbnz		x5, #2, 0f
+	ld1		{v2.16b}, [x1], #16
+	tbnz		x5, #3, 0f
+	ld1		{v3.16b}, [x1], #16
+	tbnz		x5, #4, 0f
+	ld1		{v4.16b}, [x1], #16
+	tbnz		x5, #5, 0f
+	ld1		{v5.16b}, [x1], #16
+	tbnz		x5, #6, 0f
+	ld1		{v6.16b}, [x1], #16
+	tbnz		x5, #7, 0f
+	ld1		{v7.16b}, [x1], #16
+
+0:	mov		bskey, x2
+	mov		rounds, x3
+	bl		\do8
+
+	st1		{\o0\().16b}, [x0], #16
+	tbnz		x5, #1, 1f
+	st1		{\o1\().16b}, [x0], #16
+	tbnz		x5, #2, 1f
+	st1		{\o2\().16b}, [x0], #16
+	tbnz		x5, #3, 1f
+	st1		{\o3\().16b}, [x0], #16
+	tbnz		x5, #4, 1f
+	st1		{\o4\().16b}, [x0], #16
+	tbnz		x5, #5, 1f
+	st1		{\o5\().16b}, [x0], #16
+	tbnz		x5, #6, 1f
+	st1		{\o6\().16b}, [x0], #16
+	tbnz		x5, #7, 1f
+	st1		{\o7\().16b}, [x0], #16
+
+	cbnz		x4, 99b
+
+1:	ldp		x29, x30, [sp], #16
+	ret
+	.endm
+
+	.align		4
+ENTRY(aesbs_ecb_encrypt)
+	__ecb_crypt	aesbs_encrypt8, v0, v1, v4, v6, v3, v7, v2, v5
+ENDPROC(aesbs_ecb_encrypt)
+
+	.align		4
+ENTRY(aesbs_ecb_decrypt)
+	__ecb_crypt	aesbs_decrypt8, v0, v1, v6, v4, v2, v7, v3, v5
+ENDPROC(aesbs_ecb_decrypt)
+
+	.macro		next_tweak, out, in, const, tmp
+	sshr		\tmp\().2d,  \in\().2d,   #63
+	and		\tmp\().16b, \tmp\().16b, \const\().16b
+	add		\out\().2d,  \in\().2d,   \in\().2d
+	ext		\tmp\().16b, \tmp\().16b, \tmp\().16b, #8
+	eor		\out\().16b, \out\().16b, \tmp\().16b
+	.endm
+
+	.align		4
+.Lxts_mul_x:
+CPU_LE(	.quad		1, 0x87		)
+CPU_BE(	.quad		0x87, 1		)
+
+	/*
+	 * aesbs_xts_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
+	 *		     int blocks, u8 iv[])
+	 * aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
+	 *		     int blocks, u8 iv[])
+	 */
+__xts_crypt8:
+	mov		x6, #1
+	lsl		x6, x6, x4
+	subs		w4, w4, #8
+	csel		x4, x4, xzr, pl
+	csel		x6, x6, xzr, mi
+
+	ld1		{v0.16b}, [x1], #16
+	next_tweak	v26, v25, v30, v31
+	eor		v0.16b, v0.16b, v25.16b
+	tbnz		x6, #1, 0f
+
+	ld1		{v1.16b}, [x1], #16
+	next_tweak	v27, v26, v30, v31
+	eor		v1.16b, v1.16b, v26.16b
+	tbnz		x6, #2, 0f
+
+	ld1		{v2.16b}, [x1], #16
+	next_tweak	v28, v27, v30, v31
+	eor		v2.16b, v2.16b, v27.16b
+	tbnz		x6, #3, 0f
+
+	ld1		{v3.16b}, [x1], #16
+	next_tweak	v29, v28, v30, v31
+	eor		v3.16b, v3.16b, v28.16b
+	tbnz		x6, #4, 0f
+
+	ld1		{v4.16b}, [x1], #16
+	str		q29, [sp, #16]
+	eor		v4.16b, v4.16b, v29.16b
+	next_tweak	v29, v29, v30, v31
+	tbnz		x6, #5, 0f
+
+	ld1		{v5.16b}, [x1], #16
+	str		q29, [sp, #32]
+	eor		v5.16b, v5.16b, v29.16b
+	next_tweak	v29, v29, v30, v31
+	tbnz		x6, #6, 0f
+
+	ld1		{v6.16b}, [x1], #16
+	str		q29, [sp, #48]
+	eor		v6.16b, v6.16b, v29.16b
+	next_tweak	v29, v29, v30, v31
+	tbnz		x6, #7, 0f
+
+	ld1		{v7.16b}, [x1], #16
+	str		q29, [sp, #64]
+	eor		v7.16b, v7.16b, v29.16b
+	next_tweak	v29, v29, v30, v31
+
+0:	mov		bskey, x2
+	mov		rounds, x3
+	br		x7
+ENDPROC(__xts_crypt8)
+
+	.macro		__xts_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
+	stp		x29, x30, [sp, #-80]!
+	mov		x29, sp
+
+	ldr		q30, .Lxts_mul_x
+	ld1		{v25.16b}, [x5]
+
+99:	adr		x7, \do8
+	bl		__xts_crypt8
+
+	ldp		q16, q17, [sp, #16]
+	ldp		q18, q19, [sp, #48]
+
+	eor		\o0\().16b, \o0\().16b, v25.16b
+	eor		\o1\().16b, \o1\().16b, v26.16b
+	eor		\o2\().16b, \o2\().16b, v27.16b
+	eor		\o3\().16b, \o3\().16b, v28.16b
+
+	st1		{\o0\().16b}, [x0], #16
+	mov		v25.16b, v26.16b
+	tbnz		x6, #1, 1f
+	st1		{\o1\().16b}, [x0], #16
+	mov		v25.16b, v27.16b
+	tbnz		x6, #2, 1f
+	st1		{\o2\().16b}, [x0], #16
+	mov		v25.16b, v28.16b
+	tbnz		x6, #3, 1f
+	st1		{\o3\().16b}, [x0], #16
+	mov		v25.16b, v29.16b
+	tbnz		x6, #4, 1f
+
+	eor		\o4\().16b, \o4\().16b, v16.16b
+	eor		\o5\().16b, \o5\().16b, v17.16b
+	eor		\o6\().16b, \o6\().16b, v18.16b
+	eor		\o7\().16b, \o7\().16b, v19.16b
+
+	st1		{\o4\().16b}, [x0], #16
+	tbnz		x6, #5, 1f
+	st1		{\o5\().16b}, [x0], #16
+	tbnz		x6, #6, 1f
+	st1		{\o6\().16b}, [x0], #16
+	tbnz		x6, #7, 1f
+	st1		{\o7\().16b}, [x0], #16
+
+	cbnz		x4, 99b
+
+1:	st1		{v25.16b}, [x5]
+	ldp		x29, x30, [sp], #80
+	ret
+	.endm
+
+ENTRY(aesbs_xts_encrypt)
+	__xts_crypt	aesbs_encrypt8, v0, v1, v4, v6, v3, v7, v2, v5
+ENDPROC(aesbs_xts_encrypt)
+
+ENTRY(aesbs_xts_decrypt)
+	__xts_crypt	aesbs_decrypt8, v0, v1, v6, v4, v2, v7, v3, v5
+ENDPROC(aesbs_xts_decrypt)
+
+	.macro		next_ctr, v
+	mov		\v\().d[1], x8
+	adds		x8, x8, #1
+	mov		\v\().d[0], x7
+	adc		x7, x7, xzr
+	rev64		\v\().16b, \v\().16b
+	.endm
+
+	/*
+	 * aesbs_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
+	 *		     int rounds, int blocks, u8 iv[], bool final)
+	 */
+ENTRY(aesbs_ctr_encrypt)
+	stp		x29, x30, [sp, #-16]!
+	mov		x29, sp
+
+	add		x4, x4, x6		// do one extra block if final
+
+	ldp		x7, x8, [x5]
+	ld1		{v0.16b}, [x5]
+CPU_LE(	rev		x7, x7		)
+CPU_LE(	rev		x8, x8		)
+	adds		x8, x8, #1
+	adc		x7, x7, xzr
+
+99:	mov		x9, #1
+	lsl		x9, x9, x4
+	subs		w4, w4, #8
+	csel		x4, x4, xzr, pl
+	csel		x9, x9, xzr, le
+
+	next_ctr	v1
+	next_ctr	v2
+	next_ctr	v3
+	next_ctr	v4
+	next_ctr	v5
+	next_ctr	v6
+	next_ctr	v7
+
+0:	mov		bskey, x2
+	mov		rounds, x3
+	bl		aesbs_encrypt8
+
+	lsr		x9, x9, x6		// disregard the extra block
+	tbnz		x9, #0, 0f
+
+	ld1		{v8.16b}, [x1], #16
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x0], #16
+	tbnz		x9, #1, 1f
+
+	ld1		{v9.16b}, [x1], #16
+	eor		v1.16b, v1.16b, v9.16b
+	st1		{v1.16b}, [x0], #16
+	tbnz		x9, #2, 2f
+
+	ld1		{v10.16b}, [x1], #16
+	eor		v4.16b, v4.16b, v10.16b
+	st1		{v4.16b}, [x0], #16
+	tbnz		x9, #3, 3f
+
+	ld1		{v11.16b}, [x1], #16
+	eor		v6.16b, v6.16b, v11.16b
+	st1		{v6.16b}, [x0], #16
+	tbnz		x9, #4, 4f
+
+	ld1		{v12.16b}, [x1], #16
+	eor		v3.16b, v3.16b, v12.16b
+	st1		{v3.16b}, [x0], #16
+	tbnz		x9, #5, 5f
+
+	ld1		{v13.16b}, [x1], #16
+	eor		v7.16b, v7.16b, v13.16b
+	st1		{v7.16b}, [x0], #16
+	tbnz		x9, #6, 6f
+
+	ld1		{v14.16b}, [x1], #16
+	eor		v2.16b, v2.16b, v14.16b
+	st1		{v2.16b}, [x0], #16
+	tbnz		x9, #7, 7f
+
+	ld1		{v15.16b}, [x1], #16
+	eor		v5.16b, v5.16b, v15.16b
+	st1		{v5.16b}, [x0], #16
+
+	next_ctr	v0
+	cbnz		x4, 99b
+
+0:	st1		{v0.16b}, [x5]
+8:	ldp		x29, x30, [sp], #16
+	ret
+
+	/*
+	 * If we are handling the tail of the input (x6 == 1), return the
+	 * final keystream block back to the caller via the IV buffer.
+	 */
+1:	cbz		x6, 8b
+	st1		{v1.16b}, [x5]
+	b		8b
+2:	cbz		x6, 8b
+	st1		{v4.16b}, [x5]
+	b		8b
+3:	cbz		x6, 8b
+	st1		{v6.16b}, [x5]
+	b		8b
+4:	cbz		x6, 8b
+	st1		{v3.16b}, [x5]
+	b		8b
+5:	cbz		x6, 8b
+	st1		{v7.16b}, [x5]
+	b		8b
+6:	cbz		x6, 8b
+	st1		{v2.16b}, [x5]
+	b		8b
+7:	cbz		x6, 8b
+	st1		{v5.16b}, [x5]
+	b		8b
+ENDPROC(aesbs_ctr_encrypt)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
new file mode 100644
index 000000000000..45c1862f86a7
--- /dev/null
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -0,0 +1,344 @@
+/*
+ * Bit sliced AES using NEON instructions
+ *
+ * Copyright (C) 2016 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <asm/neon.h>
+#include <crypto/aes.h>
+#include <crypto/internal/simd.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/xts.h>
+#include <linux/module.h>
+
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_LICENSE("GPL v2");
+
+MODULE_ALIAS_CRYPTO("ecb(aes)");
+MODULE_ALIAS_CRYPTO("ctr(aes)");
+MODULE_ALIAS_CRYPTO("xts(aes)");
+
+asmlinkage void aesbs_convert_key(u8 out[], u32 const rk[], int rounds);
+
+asmlinkage void aesbs_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
+				  int rounds, int blocks);
+asmlinkage void aesbs_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
+				  int rounds, int blocks);
+
+asmlinkage void aesbs_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
+				  int rounds, int blocks, u8 iv[], bool final);
+
+asmlinkage void aesbs_xts_encrypt(u8 out[], u8 const in[], u8 const rk[],
+				  int rounds, int blocks, u8 iv[]);
+asmlinkage void aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[],
+				  int rounds, int blocks, u8 iv[]);
+
+struct aesbs_key {
+	u8			key[13 * (8 * AES_BLOCK_SIZE) + 32];
+} __aligned(AES_BLOCK_SIZE);
+
+struct aesbs_ctx {
+	struct aesbs_key	bskey;
+	int			rounds;
+};
+
+struct aesbs_xts_ctx {
+	struct crypto_aes_ctx	tweak;		/* keep at the beginning */
+	struct aesbs_key	bskey;
+	int			rounds;
+};
+
+static int aesbs_setkey(struct crypto_skcipher *tfm, const u8 *in_key,
+			unsigned int key_len)
+{
+	struct aesbs_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct crypto_aes_ctx rk;
+	int err;
+
+	err = crypto_aes_expand_key(&rk, in_key, key_len);
+	if (err)
+		return err;
+
+	ctx->rounds = 6 + key_len / 4;
+
+	kernel_neon_begin();
+	aesbs_convert_key(ctx->bskey.key, rk.key_enc, ctx->rounds);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int __ecb_crypt(struct skcipher_request *req,
+		       void (*fn)(u8 out[], u8 const in[], u8 const rk[],
+				  int rounds, int blocks))
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct aesbs_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, true);
+
+	kernel_neon_begin();
+	while (walk.nbytes >= AES_BLOCK_SIZE) {
+		unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
+
+		if (walk.nbytes < walk.total)
+			blocks = round_down(blocks,
+					    walk.stride / AES_BLOCK_SIZE);
+
+		fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->bskey.key,
+		   ctx->rounds, blocks);
+		err = skcipher_walk_done(&walk,
+					 walk.nbytes - blocks * AES_BLOCK_SIZE);
+	}
+	kernel_neon_end();
+
+	return err;
+}
+
+static int ecb_encrypt(struct skcipher_request *req)
+{
+	return __ecb_crypt(req, aesbs_ecb_encrypt);
+}
+
+static int ecb_decrypt(struct skcipher_request *req)
+{
+	return __ecb_crypt(req, aesbs_ecb_decrypt);
+}
+
+static int ctr_encrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct aesbs_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, true);
+
+	kernel_neon_begin();
+	while (walk.nbytes > 0) {
+		unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
+		bool final = (walk.total % AES_BLOCK_SIZE) != 0;
+
+		if (walk.nbytes < walk.total) {
+			blocks = round_down(blocks,
+					    walk.stride / AES_BLOCK_SIZE);
+			final = false;
+		}
+
+		aesbs_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
+				  ctx->bskey.key, ctx->rounds, blocks, walk.iv,
+				  final);
+
+		if (final) {
+			u8 *dst = walk.dst.virt.addr + blocks * AES_BLOCK_SIZE;
+			u8 *src = walk.src.virt.addr + blocks * AES_BLOCK_SIZE;
+
+			if (dst != src)
+				memcpy(dst, src, walk.total % AES_BLOCK_SIZE);
+			crypto_xor(dst, walk.iv, walk.total % AES_BLOCK_SIZE);
+
+			err = skcipher_walk_done(&walk, 0);
+			break;
+		}
+		err = skcipher_walk_done(&walk,
+					 walk.nbytes - blocks * AES_BLOCK_SIZE);
+	}
+	kernel_neon_end();
+
+	return err;
+}
+
+static int aesbs_xts_setkey(struct crypto_skcipher *tfm, const u8 *in_key,
+			    unsigned int key_len)
+{
+	struct aesbs_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct crypto_aes_ctx rk;
+	int err;
+
+	err = xts_verify_key(tfm, in_key, key_len);
+	if (err)
+		return err;
+
+	key_len /= 2;
+	err = crypto_aes_expand_key(&ctx->tweak, in_key + key_len, key_len);
+	if (err)
+		return err;
+
+	err = crypto_aes_expand_key(&rk, in_key, key_len);
+	if (err)
+		return err;
+
+	ctx->rounds = 6 + key_len / 4;
+
+	kernel_neon_begin();
+	aesbs_convert_key(ctx->bskey.key, rk.key_enc, ctx->rounds);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int __xts_crypt(struct skcipher_request *req,
+		       void (*fn)(u8 out[], u8 const in[], u8 const rk[],
+				  int rounds, int blocks, u8 iv[]))
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct aesbs_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, true);
+
+	crypto_aes_encrypt(crypto_skcipher_tfm(tfm), walk.iv, walk.iv);
+
+	kernel_neon_begin();
+	while (walk.nbytes >= AES_BLOCK_SIZE) {
+		unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
+
+		if (walk.nbytes < walk.total)
+			blocks = round_down(blocks,
+					    walk.stride / AES_BLOCK_SIZE);
+
+		fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->bskey.key,
+		   ctx->rounds, blocks, walk.iv);
+		err = skcipher_walk_done(&walk,
+					 walk.nbytes - blocks * AES_BLOCK_SIZE);
+	}
+	kernel_neon_end();
+
+	return err;
+}
+
+static int xts_encrypt(struct skcipher_request *req)
+{
+	return __xts_crypt(req, aesbs_xts_encrypt);
+}
+
+static int xts_decrypt(struct skcipher_request *req)
+{
+	return __xts_crypt(req, aesbs_xts_decrypt);
+}
+
+static struct skcipher_alg aes_algs[] = { {
+	.base.cra_name		= "__ecb(aes)",
+	.base.cra_driver_name	= "__ecb-aes-neonbs",
+	.base.cra_priority	= 200,
+	.base.cra_blocksize	= AES_BLOCK_SIZE,
+	.base.cra_ctxsize	= sizeof(struct aesbs_ctx),
+	.base.cra_module	= THIS_MODULE,
+	.base.cra_flags		= CRYPTO_ALG_INTERNAL,
+
+	.min_keysize		= AES_MIN_KEY_SIZE,
+	.max_keysize		= AES_MAX_KEY_SIZE,
+	.walksize		= 8 * AES_BLOCK_SIZE,
+	.setkey			= aesbs_setkey,
+	.encrypt		= ecb_encrypt,
+	.decrypt		= ecb_decrypt,
+}, {
+	.base.cra_name		= "__ctr(aes)",
+	.base.cra_driver_name	= "__ctr-aes-neonbs",
+	.base.cra_priority	= 200,
+	.base.cra_blocksize	= 1,
+	.base.cra_ctxsize	= sizeof(struct aesbs_ctx),
+	.base.cra_module	= THIS_MODULE,
+	.base.cra_flags		= CRYPTO_ALG_INTERNAL,
+
+	.min_keysize		= AES_MIN_KEY_SIZE,
+	.max_keysize		= AES_MAX_KEY_SIZE,
+	.chunksize		= AES_BLOCK_SIZE,
+	.walksize		= 8 * AES_BLOCK_SIZE,
+	.ivsize			= AES_BLOCK_SIZE,
+	.setkey			= aesbs_setkey,
+	.encrypt		= ctr_encrypt,
+	.decrypt		= ctr_encrypt,
+}, {
+	.base.cra_name		= "ctr(aes)",
+	.base.cra_driver_name	= "ctr-aes-neonbs",
+	.base.cra_priority	= 200 - 1,
+	.base.cra_blocksize	= 1,
+	.base.cra_ctxsize	= sizeof(struct aesbs_ctx),
+	.base.cra_module	= THIS_MODULE,
+
+	.min_keysize		= AES_MIN_KEY_SIZE,
+	.max_keysize		= AES_MAX_KEY_SIZE,
+	.chunksize		= AES_BLOCK_SIZE,
+	.walksize		= 8 * AES_BLOCK_SIZE,
+	.ivsize			= AES_BLOCK_SIZE,
+	.setkey			= aesbs_setkey,
+	.encrypt		= ctr_encrypt,
+	.decrypt		= ctr_encrypt,
+}, {
+	.base.cra_name		= "__xts(aes)",
+	.base.cra_driver_name	= "__xts-aes-neonbs",
+	.base.cra_priority	= 200,
+	.base.cra_blocksize	= AES_BLOCK_SIZE,
+	.base.cra_ctxsize	= sizeof(struct aesbs_xts_ctx),
+	.base.cra_module	= THIS_MODULE,
+	.base.cra_flags		= CRYPTO_ALG_INTERNAL,
+
+	.min_keysize		= 2 * AES_MIN_KEY_SIZE,
+	.max_keysize		= 2 * AES_MAX_KEY_SIZE,
+	.walksize		= 8 * AES_BLOCK_SIZE,
+	.ivsize			= AES_BLOCK_SIZE,
+	.setkey			= aesbs_xts_setkey,
+	.encrypt		= xts_encrypt,
+	.decrypt		= xts_decrypt,
+} };
+
+static struct simd_skcipher_alg *aes_simd_algs[ARRAY_SIZE(aes_algs)];
+
+static void aes_exit(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(aes_simd_algs); i++)
+		if (aes_simd_algs[i])
+			simd_skcipher_free(aes_simd_algs[i]);
+
+	crypto_unregister_skciphers(aes_algs, ARRAY_SIZE(aes_algs));
+}
+
+static int __init aes_init(void)
+{
+	struct simd_skcipher_alg *simd;
+	const char *basename;
+	const char *algname;
+	const char *drvname;
+	int err;
+	int i;
+
+	if (!(elf_hwcap & HWCAP_ASIMD))
+		return -ENODEV;
+
+	err = crypto_register_skciphers(aes_algs, ARRAY_SIZE(aes_algs));
+	if (err)
+		return err;
+
+	for (i = 0; i < ARRAY_SIZE(aes_algs); i++) {
+		if (!(aes_algs[i].base.cra_flags & CRYPTO_ALG_INTERNAL))
+			continue;
+
+		algname = aes_algs[i].base.cra_name + 2;
+		drvname = aes_algs[i].base.cra_driver_name + 2;
+		basename = aes_algs[i].base.cra_driver_name;
+		simd = simd_skcipher_create_compat(algname, drvname, basename);
+		err = PTR_ERR(simd);
+		if (IS_ERR(simd))
+			goto unregister_simds;
+
+		aes_simd_algs[i] = simd;
+	}
+	return 0;
+
+unregister_simds:
+	aes_exit();
+	return err;
+}
+
+module_init(aes_init);
+module_exit(aes_exit);
-- 
2.7.4

^ permalink raw reply related

* [PATCH v6 6/8] IIO: add STM32 timer trigger driver
From: Jonathan Cameron @ 2017-01-02 18:22 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <CA+M3ks7O_6cWZLAAjEvPD-DQBmU--qrsuh1HfcTa9eLqaT-+fQ@mail.gmail.com>

On 02/01/17 08:46, Benjamin Gaignard wrote:
> 2016-12-30 22:12 GMT+01:00 Jonathan Cameron <jic23@kernel.org>:
>> On 09/12/16 14:15, Benjamin Gaignard wrote:
>>> Timers IPs can be used to generate triggers for other IPs like
>>> DAC, ADC or other timers.
>>> Each trigger may result of timer internals signals like counter enable,
>>> reset or edge, this configuration could be done through "master_mode"
>>> device attribute.
>>>
>>> A timer device could be triggered by other timers, we use the trigger
>>> name and is_stm32_iio_timer_trigger() function to distinguish them
>>> and configure IP input switch.
>>>
>>> Timer may also decide on which event (edge, level) they could
>>> be activated by a trigger, this configuration is done by writing in
>>> "slave_mode" device attribute.
>>>
>>> Since triggers could also be used by DAC or ADC their names are defined
>>> in include/ nux/iio/timer/stm32-timer-trigger.h so those IPs will be able
>>> to configure themselves in valid_trigger function
>>>
>>> Trigger have a "sampling_frequency" attribute which allow to configure
>>> timer sampling frequency without using PWM interface
>>>
>>> version 5:
>>> - simplify tables of triggers
>>> - only create an IIO device when needed
>>>
>>> version 4:
>>> - get triggers configuration from "reg" in DT
>>> - add tables of triggers
>>> - sampling frequency is enable/disable when writing in trigger
>>>   sampling_frequency attribute
>>> - no more use of interruptions
>>>
>>> version 3:
>>> - change compatible to "st,stm32-timer-trigger"
>>> - fix attributes access right
>>> - use string instead of int for master_mode and slave_mode
>>> - document device attributes in sysfs-bus-iio-timer-stm32
>>>
>>> version 2:
>>> - keep only one compatible
>>> - use st,input-triggers-names and st,output-triggers-names
>>>   to know which triggers are accepted and/or create by the device
>> Firstly, sorry it has taken me so long to get back to this.
>>
>> I'm still not keen on this use of iio_device elements just to act as
>> glue between triggers.  I think we need to work out a more light weight
>> way to do this.  As you are only using them for validation and to provide
>> somewhere to hang the control attibutes off, there is nothing stopping us
>> moving that over to the iio_trigger instead which would avoid the messy
>> duality going on here.
> 
> I have add an iio_device because each hardware can generate multiple
> triggers (up to 5: trgo, ch 1...4) and slave_mode attribute will impact all the
> triggers of a device. For me it was making sense to centralize that in an
> iio_device rather than having an attribute "shared" (from hardware
> point of view)
> on multiple triggers.
> Since master_mode attribute is only used by trgo and not impact ch1...4
> triggers I will move it to trigger instead of the iio_device.
> 
> I also wanted to be able to connect triggers on a iio_device as I
> could do for an
> ADC with a command like 'echo "tim1_trgo" > iio_deviceX/trigger/current_trigger'
This is interesting, but with a bit of refactoring I would think it would
be possible to share some of that code thus allowing non IIO devices to
bind to triggers.  Ultimately I want to be able to bind a trigger to
a trigger - I appreciate here the topology is more limited than that
so some complexity comes in.

My gut feeling is that representing that topology explicitly is hard
to do in a remotely general way, but lets try it and see.
We run into this sort of interdependency issue between different bits of
the hardware all the time.  Setting a value somewhere effects the configuration
elsewhere - often the best plan is to just let that happen and leave it up to
userspace to check for changes if it cares.

> If I change that to parent_trigger attribute it change this behavior
> and I will have to
> duplicated what is done in iio_trigger_write_current() to find and
> validate triggers.
I get the reasoning, but we still end up with something represented
by an IIO device that isn't providing any channels at all. It's simply
using some of the infrastructure.  To my mind it is 'something else'
and should be represented as such.  I have no problem at all with
you registering additional elements in /sysfs/bus/iio/ to represent
these shared elements - we already have drivers that do that to
provide some centralized infrastructure (e.g. the sysfs-trigger)

I'm worried about the scope spread we get for an IIO device otherwise.
They serve a well defined purpose at the moment, and that isn't what
is happening here.

So my gut feeling is we are better deliberately not representing the
inter dependence and claiming all triggers we are creating are
independent.  That way we can have a nice generic infrastructure
that will work in all cases (be it pushing the sanity checking to
userspace).

So each trigger has direct access to what controls it.  Changing anything
can effect other triggers in weird ways.

I'm finding it hard to see anything else generalizing sufficiently
as we'll always get cases where we can't represent the topology without
diving into the complexity of something like the media controller
framework.

Jonathan
> 
>> I might still be missing something though!
>>
>> You would only I think need 3 attributes
>>
>> parrent_trigger
>> and something like your master_mode and slave_mode attributes.
>>
>> The parrent_trigger would need some validation etc, but if we keep it
>> within this driver initially that won't be hard to do. Checking the device
>> parent matches will do most of it.
>>
>> Jonathan
>>>
>>> Signed-off-by: Benjamin Gaignard <benjamin.gaignard@st.com>
>>> ---
>>>  .../ABI/testing/sysfs-bus-iio-timer-stm32          |  55 +++
>>>  drivers/iio/Kconfig                                |   2 +-
>>>  drivers/iio/Makefile                               |   1 +
>>>  drivers/iio/timer/Kconfig                          |  13 +
>>>  drivers/iio/timer/Makefile                         |   1 +
>>>  drivers/iio/timer/stm32-timer-trigger.c            | 466 +++++++++++++++++++++
>>>  drivers/iio/trigger/Kconfig                        |   1 -
>>>  include/linux/iio/timer/stm32-timer-trigger.h      |  62 +++
>>>  8 files changed, 599 insertions(+), 2 deletions(-)
>>>  create mode 100644 Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
>>>  create mode 100644 drivers/iio/timer/Kconfig
>>>  create mode 100644 drivers/iio/timer/Makefile
>>>  create mode 100644 drivers/iio/timer/stm32-timer-trigger.c
>>>  create mode 100644 include/linux/iio/timer/stm32-timer-trigger.h
>>>
>>> diff --git a/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32 b/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
>>> new file mode 100644
>>> index 0000000..26583dd
>>> --- /dev/null
>>> +++ b/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
>>> @@ -0,0 +1,55 @@
>>> +What:                /sys/bus/iio/devices/iio:deviceX/master_mode_available
>>> +KernelVersion:       4.10
>>> +Contact:     benjamin.gaignard at st.com
>>> +Description:
>>> +             Reading returns the list possible master modes which are:
>>> +             - "reset"     : The UG bit from the TIMx_EGR register is used as trigger output (TRGO).
>>> +             - "enable"    : The Counter Enable signal CNT_EN is used as trigger output.
>>> +             - "update"    : The update event is selected as trigger output.
>>> +                             For instance a master timer can then be used as a prescaler for a slave timer.
>>> +             - "compare_pulse" : The trigger output send a positive pulse when the CC1IF flag is to be set.
>>> +             - "OC1REF"    : OC1REF signal is used as trigger output.
>>> +             - "OC2REF"    : OC2REF signal is used as trigger output.
>>> +             - "OC3REF"    : OC3REF signal is used as trigger output.
>>> +             - "OC4REF"    : OC4REF signal is used as trigger output.
>>> +
>>> +What:                /sys/bus/iio/devices/iio:deviceX/master_mode
>>> +KernelVersion:       4.10
>>> +Contact:     benjamin.gaignard at st.com
>>> +Description:
>>> +             Reading returns the current master modes.
>>> +             Writing set the master mode
>>> +
>>> +What:                /sys/bus/iio/devices/iio:deviceX/slave_mode_available
>>> +KernelVersion:       4.10
>>> +Contact:     benjamin.gaignard at st.com
>>> +Description:
>>> +             Reading returns the list possible slave modes which are:
>>> +             - "disabled"  : The prescaler is clocked directly by the internal clock.
>>> +             - "encoder_1" : Counter counts up/down on TI2FP1 edge depending on TI1FP2 level.
>>> +             - "encoder_2" : Counter counts up/down on TI1FP2 edge depending on TI2FP1 level.
>>> +             - "encoder_3" : Counter counts up/down on both TI1FP1 and TI2FP2 edges depending
>>> +                             on the level of the other input.
>>> +             - "reset"     : Rising edge of the selected trigger input reinitializes the counter
>>> +                             and generates an update of the registers.
>>> +             - "gated"     : The counter clock is enabled when the trigger input is high.
>>> +                             The counter stops (but is not reset) as soon as the trigger becomes low.
>>> +                             Both start and stop of the counter are controlled.
>>> +             - "trigger"   : The counter starts at a rising edge of the trigger TRGI (but it is not
>>> +                             reset). Only the start of the counter is controlled.
>>> +             - "external_clock": Rising edges of the selected trigger (TRGI) clock the counter.
>>> +
>>> +What:                /sys/bus/iio/devices/iio:deviceX/slave_mode
>>> +KernelVersion:       4.10
>>> +Contact:     benjamin.gaignard at st.com
>>> +Description:
>>> +             Reading returns the current slave mode.
>>> +             Writing set the slave mode
>>> +
>>> +What:                /sys/bus/iio/devices/triggerX/sampling_frequency
>>> +KernelVersion:       4.10
>>> +Contact:     benjamin.gaignard at st.com
>>> +Description:
>>> +             Reading returns the current sampling frequency.
>>> +             Writing an value different of 0 set and start sampling.
>>> +             Writing 0 stop sampling.
>>> diff --git a/drivers/iio/Kconfig b/drivers/iio/Kconfig
>>> index 6743b18..2de2a80 100644
>>> --- a/drivers/iio/Kconfig
>>> +++ b/drivers/iio/Kconfig
>>> @@ -90,5 +90,5 @@ source "drivers/iio/potentiometer/Kconfig"
>>>  source "drivers/iio/pressure/Kconfig"
>>>  source "drivers/iio/proximity/Kconfig"
>>>  source "drivers/iio/temperature/Kconfig"
>>> -
>>> +source "drivers/iio/timer/Kconfig"
>>>  endif # IIO
>>> diff --git a/drivers/iio/Makefile b/drivers/iio/Makefile
>>> index 87e4c43..b797c08 100644
>>> --- a/drivers/iio/Makefile
>>> +++ b/drivers/iio/Makefile
>>> @@ -32,4 +32,5 @@ obj-y += potentiometer/
>>>  obj-y += pressure/
>>>  obj-y += proximity/
>>>  obj-y += temperature/
>>> +obj-y += timer/
>>>  obj-y += trigger/
>>> diff --git a/drivers/iio/timer/Kconfig b/drivers/iio/timer/Kconfig
>>> new file mode 100644
>>> index 0000000..e3c21f2
>>> --- /dev/null
>>> +++ b/drivers/iio/timer/Kconfig
>>> @@ -0,0 +1,13 @@
>>> +#
>>> +# Timers drivers
>>> +
>>> +menu "Timers"
>>> +
>>> +config IIO_STM32_TIMER_TRIGGER
>>> +     tristate "STM32 Timer Trigger"
>>> +     depends on (ARCH_STM32 && OF && MFD_STM32_TIMERS) || COMPILE_TEST
>>> +     select IIO_TRIGGERED_EVENT
>>> +     help
>>> +       Select this option to enable STM32 Timer Trigger
>>> +
>>> +endmenu
>>> diff --git a/drivers/iio/timer/Makefile b/drivers/iio/timer/Makefile
>>> new file mode 100644
>>> index 0000000..4ad95ec9
>>> --- /dev/null
>>> +++ b/drivers/iio/timer/Makefile
>>> @@ -0,0 +1 @@
>>> +obj-$(CONFIG_IIO_STM32_TIMER_TRIGGER) += stm32-timer-trigger.o
>>> diff --git a/drivers/iio/timer/stm32-timer-trigger.c b/drivers/iio/timer/stm32-timer-trigger.c
>>> new file mode 100644
>>> index 0000000..8d16e8f
>>> --- /dev/null
>>> +++ b/drivers/iio/timer/stm32-timer-trigger.c
>>> @@ -0,0 +1,466 @@
>>> +/*
>>> + * Copyright (C) STMicroelectronics 2016
>>> + *
>>> + * Author: Benjamin Gaignard <benjamin.gaignard@st.com>
>>> + *
>>> + * License terms:  GNU General Public License (GPL), version 2
>>> + */
>>> +
>>> +#include <linux/iio/iio.h>
>>> +#include <linux/iio/sysfs.h>
>>> +#include <linux/iio/timer/stm32-timer-trigger.h>
>>> +#include <linux/iio/trigger.h>
>>> +#include <linux/iio/triggered_event.h>
>>> +#include <linux/interrupt.h>
>>> +#include <linux/mfd/stm32-timers.h>
>>> +#include <linux/module.h>
>>> +#include <linux/platform_device.h>
>>> +
>>> +#define MAX_TRIGGERS 6
>>> +#define MAX_VALIDS 5
>>> +
>>> +/* List the triggers created by each timer */
>>> +static const void *triggers_table[][MAX_TRIGGERS] = {
>>> +     { TIM1_TRGO, TIM1_CH1, TIM1_CH2, TIM1_CH3, TIM1_CH4,},
>>> +     { TIM2_TRGO, TIM2_CH1, TIM2_CH2, TIM2_CH3, TIM2_CH4,},
>>> +     { TIM3_TRGO, TIM3_CH1, TIM3_CH2, TIM3_CH3, TIM3_CH4,},
>>> +     { TIM4_TRGO, TIM4_CH1, TIM4_CH2, TIM4_CH3, TIM4_CH4,},
>>> +     { TIM5_TRGO, TIM5_CH1, TIM5_CH2, TIM5_CH3, TIM5_CH4,},
>>> +     { TIM6_TRGO,},
>>> +     { TIM7_TRGO,},
>>> +     { TIM8_TRGO, TIM8_CH1, TIM8_CH2, TIM8_CH3, TIM8_CH4,},
>>> +     { TIM9_TRGO, TIM9_CH1, TIM9_CH2,},
>>> +     { TIM12_TRGO, TIM12_CH1, TIM12_CH2,},
>>> +};
>>> +
>>> +/* List the triggers accepted by each timer */
>>> +static const void *valids_table[][MAX_VALIDS] = {
>>> +     { TIM5_TRGO, TIM2_TRGO, TIM4_TRGO, TIM3_TRGO,},
>>> +     { TIM1_TRGO, TIM8_TRGO, TIM3_TRGO, TIM4_TRGO,},
>>> +     { TIM1_TRGO, TIM8_TRGO, TIM5_TRGO, TIM4_TRGO,},
>>> +     { TIM1_TRGO, TIM2_TRGO, TIM3_TRGO, TIM8_TRGO,},
>>> +     { TIM2_TRGO, TIM3_TRGO, TIM4_TRGO, TIM8_TRGO,},
>>> +     { }, /* timer 6 */
>>> +     { }, /* timer 7 */
>>> +     { TIM1_TRGO, TIM2_TRGO, TIM4_TRGO, TIM5_TRGO,},
>>> +     { TIM2_TRGO, TIM3_TRGO,},
>>> +     { TIM4_TRGO, TIM5_TRGO,},
>>> +};
>>> +
>>> +struct stm32_timer_trigger {
>>> +     struct device *dev;
>>> +     struct regmap *regmap;
>>> +     struct clk *clk;
>>> +     u32 max_arr;
>>> +     const void *triggers;
>>> +     const void *valids;
>>> +};
>>> +
>>> +static int stm32_timer_start(struct stm32_timer_trigger *priv,
>>> +                          unsigned int frequency)
>>> +{
>>> +     unsigned long long prd, div;
>>> +     int prescaler = 0;
>>> +     u32 ccer, cr1;
>>> +
>>> +     /* Period and prescaler values depends of clock rate */
>>> +     div = (unsigned long long)clk_get_rate(priv->clk);
>>> +
>>> +     do_div(div, frequency);
>>> +
>>> +     prd = div;
>>> +
>>> +     /*
>>> +      * Increase prescaler value until we get a result that fit
>>> +      * with auto reload register maximum value.
>>> +      */
>>> +     while (div > priv->max_arr) {
>>> +             prescaler++;
>>> +             div = prd;
>>> +             do_div(div, (prescaler + 1));
>>> +     }
>>> +     prd = div;
>>> +
>>> +     if (prescaler > MAX_TIM_PSC) {
>>> +             dev_err(priv->dev, "prescaler exceeds the maximum value\n");
>>> +             return -EINVAL;
>>> +     }
>>> +
>>> +     /* Check if nobody else use the timer */
>>> +     regmap_read(priv->regmap, TIM_CCER, &ccer);
>>> +     if (ccer & TIM_CCER_CCXE)
>>> +             return -EBUSY;
>>> +
>>> +     regmap_read(priv->regmap, TIM_CR1, &cr1);
>>> +     if (!(cr1 & TIM_CR1_CEN))
>>> +             clk_enable(priv->clk);
>>> +
>>> +     regmap_write(priv->regmap, TIM_PSC, prescaler);
>>> +     regmap_write(priv->regmap, TIM_ARR, prd - 1);
>>> +     regmap_update_bits(priv->regmap, TIM_CR1, TIM_CR1_ARPE, TIM_CR1_ARPE);
>>> +
>>> +     /* Force master mode to update mode */
>>> +     regmap_update_bits(priv->regmap, TIM_CR2, TIM_CR2_MMS, 0x20);
>>> +
>>> +     /* Make sure that registers are updated */
>>> +     regmap_update_bits(priv->regmap, TIM_EGR, TIM_EGR_UG, TIM_EGR_UG);
>>> +
>>> +     /* Enable controller */
>>> +     regmap_update_bits(priv->regmap, TIM_CR1, TIM_CR1_CEN, TIM_CR1_CEN);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void stm32_timer_stop(struct stm32_timer_trigger *priv)
>>> +{
>>> +     u32 ccer, cr1;
>>> +
>>> +     regmap_read(priv->regmap, TIM_CCER, &ccer);
>>> +     if (ccer & TIM_CCER_CCXE)
>>> +             return;
>>> +
>>> +     regmap_read(priv->regmap, TIM_CR1, &cr1);
>>> +     if (cr1 & TIM_CR1_CEN)
>>> +             clk_disable(priv->clk);
>>> +
>>> +     /* Stop timer */
>>> +     regmap_update_bits(priv->regmap, TIM_CR1, TIM_CR1_CEN, 0);
>>> +     regmap_write(priv->regmap, TIM_PSC, 0);
>>> +     regmap_write(priv->regmap, TIM_ARR, 0);
>>> +}
>>> +
>>> +static ssize_t stm32_tt_store_frequency(struct device *dev,
>>> +                                     struct device_attribute *attr,
>>> +                                     const char *buf, size_t len)
>>> +{
>>> +     struct iio_trigger *trig = to_iio_trigger(dev);
>>> +     struct stm32_timer_trigger *priv = iio_trigger_get_drvdata(trig);
>>> +     unsigned int freq;
>>> +     int ret;
>>> +
>>> +     ret = kstrtouint(buf, 10, &freq);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     if (freq == 0) {
>>> +             stm32_timer_stop(priv);
>>> +     } else {
>>> +             ret = stm32_timer_start(priv, freq);
>>> +             if (ret)
>>> +                     return ret;
>>> +     }
>>> +
>>> +     return len;
>>> +}
>>> +
>>> +static ssize_t stm32_tt_read_frequency(struct device *dev,
>>> +                                    struct device_attribute *attr, char *buf)
>>> +{
>>> +     struct iio_trigger *trig = to_iio_trigger(dev);
>>> +     struct stm32_timer_trigger *priv = iio_trigger_get_drvdata(trig);
>>> +     u32 psc, arr, cr1;
>>> +     unsigned long long freq = 0;
>>> +
>>> +     regmap_read(priv->regmap, TIM_CR1, &cr1);
>>> +     regmap_read(priv->regmap, TIM_PSC, &psc);
>>> +     regmap_read(priv->regmap, TIM_ARR, &arr);
>>> +
>>> +     if (psc && arr && (cr1 & TIM_CR1_CEN)) {
>>> +             freq = (unsigned long long)clk_get_rate(priv->clk);
>>> +             do_div(freq, psc);
>>> +             do_div(freq, arr);
>>> +     }
>>> +
>>> +     return sprintf(buf, "%d\n", (unsigned int)freq);
>>> +}
>>> +
>>> +static IIO_DEV_ATTR_SAMP_FREQ(0660,
>>> +                           stm32_tt_read_frequency,
>>> +                           stm32_tt_store_frequency);
>>> +
>>> +static struct attribute *stm32_trigger_attrs[] = {
>>> +     &iio_dev_attr_sampling_frequency.dev_attr.attr,
>>> +     NULL,
>>> +};
>>> +
>>> +static const struct attribute_group stm32_trigger_attr_group = {
>>> +     .attrs = stm32_trigger_attrs,
>>> +};
>>> +
>>> +static const struct attribute_group *stm32_trigger_attr_groups[] = {
>>> +     &stm32_trigger_attr_group,
>>> +     NULL,
>>> +};
>>> +
>>> +static char *master_mode_table[] = {
>>> +     "reset",
>>> +     "enable",
>>> +     "update",
>>> +     "compare_pulse",
>>> +     "OC1REF",
>>> +     "OC2REF",
>>> +     "OC3REF",
>>> +     "OC4REF"
>>> +};
>>> +
>>> +static ssize_t stm32_tt_show_master_mode(struct device *dev,
>>> +                                      struct device_attribute *attr,
>>> +                                      char *buf)
>>> +{
>>> +     struct iio_dev *indio_dev = dev_to_iio_dev(dev);
>>> +     struct stm32_timer_trigger *priv = iio_priv(indio_dev);
>>> +     u32 cr2;
>>> +
>>> +     regmap_read(priv->regmap, TIM_CR2, &cr2);
>>> +     cr2 = (cr2 & TIM_CR2_MMS) >> TIM_CR2_MMS_SHIFT;
>>> +
>>> +     return snprintf(buf, PAGE_SIZE, "%s\n", master_mode_table[cr2]);
>>> +}
>>> +
>>> +static ssize_t stm32_tt_store_master_mode(struct device *dev,
>>> +                                       struct device_attribute *attr,
>>> +                                       const char *buf, size_t len)
>>> +{
>>> +     struct iio_dev *indio_dev = dev_to_iio_dev(dev);
>>> +     struct stm32_timer_trigger *priv = iio_priv(indio_dev);
>>> +     int i;
>>> +
>>> +     for (i = 0; i < ARRAY_SIZE(master_mode_table); i++) {
>>> +             if (!strncmp(master_mode_table[i], buf,
>>> +                          strlen(master_mode_table[i]))) {
>>> +                     regmap_update_bits(priv->regmap, TIM_CR2,
>>> +                                        TIM_CR2_MMS, i << TIM_CR2_MMS_SHIFT);
>>> +                     return len;
>>> +             }
>>> +     }
>>> +
>>> +     return -EINVAL;
>>> +}
>>> +
>>> +static IIO_CONST_ATTR(master_mode_available,
>>> +     "reset enable update compare_pulse OC1REF OC2REF OC3REF OC4REF");
>>> +
>>> +static IIO_DEVICE_ATTR(master_mode, 0660,
>>> +                    stm32_tt_show_master_mode,
>>> +                    stm32_tt_store_master_mode,
>>> +                    0);
>>> +
>>> +static char *slave_mode_table[] = {
>>> +     "disabled",
>>> +     "encoder_1",
>>> +     "encoder_2",
>>> +     "encoder_3",
>>> +     "reset",
>>> +     "gated",
>>> +     "trigger",
>>> +     "external_clock",
>>> +};
>>> +
>>> +static ssize_t stm32_tt_show_slave_mode(struct device *dev,
>>> +                                     struct device_attribute *attr,
>>> +                                     char *buf)
>>> +{
>>> +     struct iio_dev *indio_dev = dev_to_iio_dev(dev);
>>> +     struct stm32_timer_trigger *priv = iio_priv(indio_dev);
>>> +     u32 smcr;
>>> +
>>> +     regmap_read(priv->regmap, TIM_SMCR, &smcr);
>>> +     smcr &= TIM_SMCR_SMS;
>>> +
>>> +     return snprintf(buf, PAGE_SIZE, "%s\n", slave_mode_table[smcr]);
>>> +}
>>> +
>>> +static ssize_t stm32_tt_store_slave_mode(struct device *dev,
>>> +                                      struct device_attribute *attr,
>>> +                                      const char *buf, size_t len)
>>> +{
>>> +     struct iio_dev *indio_dev = dev_to_iio_dev(dev);
>>> +     struct stm32_timer_trigger *priv = iio_priv(indio_dev);
>>> +     int i;
>>> +
>>> +     for (i = 0; i < ARRAY_SIZE(slave_mode_table); i++) {
>>> +             if (!strncmp(slave_mode_table[i], buf,
>>> +                          strlen(slave_mode_table[i]))) {
>>> +                     regmap_update_bits(priv->regmap,
>>> +                                        TIM_SMCR, TIM_SMCR_SMS, i);
>>> +                     return len;
>>> +             }
>>> +     }
>>> +
>>> +     return -EINVAL;
>>> +}
>>> +
>>> +static IIO_CONST_ATTR(slave_mode_available,
>>> +"disabled encoder_1 encoder_2 encoder_3 reset gated trigger external_clock");
>>> +
>>> +static IIO_DEVICE_ATTR(slave_mode, 0660,
>>> +                    stm32_tt_show_slave_mode,
>>> +                    stm32_tt_store_slave_mode,
>>> +                    0);
>>> +
>>> +static struct attribute *stm32_timer_attrs[] = {
>>> +     &iio_dev_attr_master_mode.dev_attr.attr,
>>> +     &iio_const_attr_master_mode_available.dev_attr.attr,
>>> +     &iio_dev_attr_slave_mode.dev_attr.attr,
>>> +     &iio_const_attr_slave_mode_available.dev_attr.attr,
>>> +     NULL,
>>> +};
>>> +
>>> +static const struct attribute_group stm32_timer_attr_group = {
>>> +     .attrs = stm32_timer_attrs,
>>> +};
>>> +
>>> +static const struct iio_trigger_ops timer_trigger_ops = {
>>> +     .owner = THIS_MODULE,
>>> +};
>>> +
>>> +static int stm32_setup_iio_triggers(struct stm32_timer_trigger *priv)
>>> +{
>>> +     int ret;
>>> +     const char * const *cur = priv->triggers;
>>> +
>>> +     while (cur && *cur) {
>>> +             struct iio_trigger *trig;
>>> +
>>> +             trig = devm_iio_trigger_alloc(priv->dev, "%s", *cur);
>>> +             if  (!trig)
>>> +                     return -ENOMEM;
>>> +
>>> +             trig->dev.parent = priv->dev->parent;
>>> +             trig->ops = &timer_trigger_ops;
>>> +             trig->dev.groups = stm32_trigger_attr_groups;
>>> +             iio_trigger_set_drvdata(trig, priv);
>>> +
>>> +             ret = devm_iio_trigger_register(priv->dev, trig);
>>> +             if (ret)
>>> +                     return ret;
>>> +             cur++;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +/**
>>> + * is_stm32_timer_trigger
>>> + * @trig: trigger to be checked
>>> + *
>>> + * return true if the trigger is a valid stm32 iio timer trigger
>>> + * either return false
>>> + */
>>> +bool is_stm32_timer_trigger(struct iio_trigger *trig)
>>> +{
>>> +     return (trig->ops == &timer_trigger_ops);
>>> +}
>>> +EXPORT_SYMBOL(is_stm32_timer_trigger);
>>> +
>>> +static int stm32_validate_trigger(struct iio_dev *indio_dev,
>>> +                               struct iio_trigger *trig)
>>> +{
>>> +     struct stm32_timer_trigger *priv = iio_priv(indio_dev);
>>> +     const char * const *cur = priv->valids;
>>> +     unsigned int i = 0;
>>> +
>>> +     if (!is_stm32_timer_trigger(trig))
>>> +             return -EINVAL;
>>> +
>>> +     while (cur && *cur) {
>>> +             if (!strncmp(trig->name, *cur, strlen(trig->name))) {
>>> +                     regmap_update_bits(priv->regmap,
>>> +                                        TIM_SMCR, TIM_SMCR_TS,
>>> +                                        i << TIM_SMCR_TS_SHIFT);
>>> +                     return 0;
>>> +             }
>>> +             cur++;
>>> +             i++;
>>> +     }
>>> +
>>> +     return -EINVAL;
>>> +}
>>> +
>>> +static const struct iio_info stm32_trigger_info = {
>>> +     .driver_module = THIS_MODULE,
>>> +     .validate_trigger = stm32_validate_trigger,
>>> +     .attrs = &stm32_timer_attr_group,
>>> +};
>>> +
>>> +static struct stm32_timer_trigger *stm32_setup_iio_device(struct device *dev)
>>> +{
>>> +     struct iio_dev *indio_dev;
>>> +     int ret;
>>> +
>>> +     indio_dev = devm_iio_device_alloc(dev,
>>> +                                       sizeof(struct stm32_timer_trigger));
>>> +     if (!indio_dev)
>>> +             return NULL;
>>> +
>>> +     indio_dev->name = dev_name(dev);
>>> +     indio_dev->dev.parent = dev;
>>> +     indio_dev->info = &stm32_trigger_info;
>>> +     indio_dev->modes = INDIO_EVENT_TRIGGERED;
>>> +     indio_dev->num_channels = 0;
>>> +     indio_dev->dev.of_node = dev->of_node;
>>> +
>>> +     ret = devm_iio_device_register(dev, indio_dev);
>>> +     if (ret)
>>> +             return NULL;
>>> +
>>> +     return iio_priv(indio_dev);
>>> +}
>>> +
>>> +static int stm32_timer_trigger_probe(struct platform_device *pdev)
>>> +{
>>> +     struct device *dev = &pdev->dev;
>>> +     struct stm32_timer_trigger *priv;
>>> +     struct stm32_timers *ddata = dev_get_drvdata(pdev->dev.parent);
>>> +     unsigned int index;
>>> +     int ret;
>>> +
>>> +     if (of_property_read_u32(dev->of_node, "reg", &index))
>>> +             return -EINVAL;
>>> +
>>> +     if (index >= ARRAY_SIZE(triggers_table))
>>> +             return -EINVAL;
>>> +
>>> +     /* Create an IIO device only if we have triggers to be validated */
>>> +     if (*valids_table[index])
>>> +             priv = stm32_setup_iio_device(dev);
>>
>> I still don't like this. Really feels like we shouldn't be creating an
>> iio device with all the bagage that carries just to allow us to do the
>> trigger trees.  We ought to have a much more light weight solution for this
>> functionality - we aren't typically even using the interrupt tree stuff
>> that the triggers for devices are all really about.
>>
>> A simpler approach of allowing each trigger the option of a parent seems like
>> it would be cleaner.  Could be done entirely within this driver in the first
>> instance.  Basically it would just look like your master and slave attributes
>> but have those under triggerX not iio:deviceX.
>>
>> We can work out how to make it more generic later - including perhaps the
>> option to trigger from triggers outside this driver, using some parallel
>> infrastructure to the device triggering.
>>
>>
>>> +     else
>>> +             priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>>> +
>>> +     if (!priv)
>>> +             return -ENOMEM;
>>> +
>>> +     priv->dev = dev;
>>> +     priv->regmap = ddata->regmap;
>>> +     priv->clk = ddata->clk;
>>> +     priv->max_arr = ddata->max_arr;
>>> +     priv->triggers = triggers_table[index];
>>> +     priv->valids = valids_table[index];
>>> +
>>> +     ret = stm32_setup_iio_triggers(priv);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     platform_set_drvdata(pdev, priv);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct of_device_id stm32_trig_of_match[] = {
>>> +     { .compatible = "st,stm32-timer-trigger", },
>>> +     { /* end node */ },
>>> +};
>>> +MODULE_DEVICE_TABLE(of, stm32_trig_of_match);
>>> +
>>> +static struct platform_driver stm32_timer_trigger_driver = {
>>> +     .probe = stm32_timer_trigger_probe,
>>> +     .driver = {
>>> +             .name = "stm32-timer-trigger",
>>> +             .of_match_table = stm32_trig_of_match,
>>> +     },
>>> +};
>>> +module_platform_driver(stm32_timer_trigger_driver);
>>> +
>>> +MODULE_ALIAS("platform: stm32-timer-trigger");
>>> +MODULE_DESCRIPTION("STMicroelectronics STM32 Timer Trigger driver");
>>> +MODULE_LICENSE("GPL v2");
>>> diff --git a/drivers/iio/trigger/Kconfig b/drivers/iio/trigger/Kconfig
>>> index 809b2e7..f2af4fe 100644
>>> --- a/drivers/iio/trigger/Kconfig
>>> +++ b/drivers/iio/trigger/Kconfig
>>> @@ -46,5 +46,4 @@ config IIO_SYSFS_TRIGGER
>>>
>>>         To compile this driver as a module, choose M here: the
>>>         module will be called iio-trig-sysfs.
>>> -
>> Clean this up.
> 
> ok
> 
>>>  endmenu
>>> diff --git a/include/linux/iio/timer/stm32-timer-trigger.h b/include/linux/iio/timer/stm32-timer-trigger.h
>>> new file mode 100644
>>> index 0000000..55535ae
>>> --- /dev/null
>>> +++ b/include/linux/iio/timer/stm32-timer-trigger.h
>>> @@ -0,0 +1,62 @@
>>> +/*
>>> + * Copyright (C) STMicroelectronics 2016
>>> + *
>>> + * Author: Benjamin Gaignard <benjamin.gaignard@st.com>
>>> + *
>>> + * License terms:  GNU General Public License (GPL), version 2
>>> + */
>>> +
>>> +#ifndef _STM32_TIMER_TRIGGER_H_
>>> +#define _STM32_TIMER_TRIGGER_H_
>>> +
>>> +#define TIM1_TRGO    "tim1_trgo"
>>> +#define TIM1_CH1     "tim1_ch1"
>>> +#define TIM1_CH2     "tim1_ch2"
>>> +#define TIM1_CH3     "tim1_ch3"
>>> +#define TIM1_CH4     "tim1_ch4"
>>> +
>>> +#define TIM2_TRGO    "tim2_trgo"
>>> +#define TIM2_CH1     "tim2_ch1"
>>> +#define TIM2_CH2     "tim2_ch2"
>>> +#define TIM2_CH3     "tim2_ch3"
>>> +#define TIM2_CH4     "tim2_ch4"
>>> +
>>> +#define TIM3_TRGO    "tim3_trgo"
>>> +#define TIM3_CH1     "tim3_ch1"
>>> +#define TIM3_CH2     "tim3_ch2"
>>> +#define TIM3_CH3     "tim3_ch3"
>>> +#define TIM3_CH4     "tim3_ch4"
>>> +
>>> +#define TIM4_TRGO    "tim4_trgo"
>>> +#define TIM4_CH1     "tim4_ch1"
>>> +#define TIM4_CH2     "tim4_ch2"
>>> +#define TIM4_CH3     "tim4_ch3"
>>> +#define TIM4_CH4     "tim4_ch4"
>>> +
>>> +#define TIM5_TRGO    "tim5_trgo"
>>> +#define TIM5_CH1     "tim5_ch1"
>>> +#define TIM5_CH2     "tim5_ch2"
>>> +#define TIM5_CH3     "tim5_ch3"
>>> +#define TIM5_CH4     "tim5_ch4"
>>> +
>>> +#define TIM6_TRGO    "tim6_trgo"
>>> +
>>> +#define TIM7_TRGO    "tim7_trgo"
>>> +
>>> +#define TIM8_TRGO    "tim8_trgo"
>>> +#define TIM8_CH1     "tim8_ch1"
>>> +#define TIM8_CH2     "tim8_ch2"
>>> +#define TIM8_CH3     "tim8_ch3"
>>> +#define TIM8_CH4     "tim8_ch4"
>>> +
>>> +#define TIM9_TRGO    "tim9_trgo"
>>> +#define TIM9_CH1     "tim9_ch1"
>>> +#define TIM9_CH2     "tim9_ch2"
>>> +
>>> +#define TIM12_TRGO   "tim12_trgo"
>>> +#define TIM12_CH1    "tim12_ch1"
>>> +#define TIM12_CH2    "tim12_ch2"
>>> +
>>> +bool is_stm32_timer_trigger(struct iio_trigger *trig);
>>> +
>>> +#endif
>>>
>>
> 
> 
> 

^ permalink raw reply

* [PATCH v2 3/5] arm64: dts: exynos5433: Add PPMU dt node
From: Krzysztof Kozlowski @ 2017-01-02 18:33 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1481173091-9728-4-git-send-email-cw00.choi@samsung.com>

On Thu, Dec 08, 2016 at 01:58:09PM +0900, Chanwoo Choi wrote:
> This patch adds PPMU (Platform Performance Monitoring Unit) Device-tree node
> to measure the utilization of each IP in Exynos SoC.
> 
> - PPMU_D{0|1}_CPU are used to measure the utilization of MIF (Memory Interface)
>   block with VDD_MIF power source.
> - PPMU_D{0|1}_GENERAL are used to measure the utilization of INT(Internal)
>   block with VDD_INT power source.
> 
> Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
> Reviewed-by: Krzysztof Kozlowski <krzk@kernel.org>
> ---
>  arch/arm64/boot/dts/exynos/exynos5433.dtsi | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
Thanks, applied.

Best regards,
Krzysztof

^ permalink raw reply

* [PATCH v2 4/5] arm64: dts: exynos5433: Add bus dt node using VDD_INT for Exynos5433
From: Krzysztof Kozlowski @ 2017-01-02 18:35 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1481173091-9728-5-git-send-email-cw00.choi@samsung.com>

On Thu, Dec 08, 2016 at 01:58:10PM +0900, Chanwoo Choi wrote:
> This patch adds the bus nodes using VDD_INT for Exynos5433 SoC.
> Exynos5433 has the following AMBA AXI buses to translate data
> between DRAM and sub-blocks.
> 
> Following list specify the detailed correlation between sub-block and clock:
> - CLK_ACLK_G2D_{400|266}  : Bus clock for G2D (2D graphic engine)
> - CLK_ACLK_MSCL_400       : Bus clock for MSCL (Memory to memory Scaler)
> - CLK_ACLK_GSCL_333       : Bus clock for GSCL (General Scaler)
> - CLK_SCLK_JPEG_MSCL      : Bus clock for JPEG
> - CLK_ACLK_MFC_400        : Bus clock for MFC (Multi Format Codec)
> - CLK_ACLK_HEVC_400       : Bus clock for HEVC (High Efficient Video Codec)
> - CLK_ACLK_BUS0_400       : NoC(Network On Chip)'s bus clock for PERIC/PERIS/FSYS/MSCL
> - CLK_ACLK_BUS1_400       : NoC's bus clock for MFC/HEVC/G3D
> - CLK_ACLK_BUS2_400       : NoC's bus clock for GSCL/DISP/G2D/CAM0/CAM1/ISP
> 
> Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
> ---
>  arch/arm64/boot/dts/exynos/exynos5433-bus.dtsi | 197 +++++++++++++++++++++++++
>  arch/arm64/boot/dts/exynos/exynos5433.dtsi     |   1 +
>  2 files changed, 198 insertions(+)
>  create mode 100644 arch/arm64/boot/dts/exynos/exynos5433-bus.dtsi
> 
Thanks, applied with changes:
1. Subject prefix,
2. Minor adjustments in commit msg,
3. Fixed missing space in 'status = "disabled"'.

Best regards,
Krzysztof

^ permalink raw reply

* [PATCH v2 5/5] arm64: dts: exynos5433: Add support of bus frequency using VDD_INT on TM2
From: Krzysztof Kozlowski @ 2017-01-02 18:37 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1481173091-9728-6-git-send-email-cw00.choi@samsung.com>

On Thu, Dec 08, 2016 at 01:58:11PM +0900, Chanwoo Choi wrote:
> This patch adds the bus Device-tree nodes for INT (Internal) block
> to enable the bus frequency scaling.
> 
> Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
> Reviewed-by: Krzysztof Kozlowski <krzk@kernel.org>
> ---
>  arch/arm64/boot/dts/exynos/exynos5433-tm2.dts | 70 +++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
> 
Thanks, applied.

Best regards,
Krzysztof

^ permalink raw reply

* [PATCH] DTS: MCCMON6: IMX: Provide support for iMX6Q based Liebherr mccmon6 board
From: Vladimir Zapolskiy @ 2017-01-02 19:12 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20170102154437.63406b95@jawa>

Hi Lukasz,

please find some comments below as usual.

On 01/02/2017 04:44 PM, Lukasz Majewski wrote:
> Hi Vladimir,
> 
> Thank you for review. Comments without my remarks have been applied
> already.
> 
>> Hello Lukasz,
>>
>> On 12/27/2016 01:19 AM, Lukasz Majewski wrote:
>>> Signed-off-by: Lukasz Majewski <l.majewski@majess.pl>
>>
>> please add a commit message with a short description of the change.
>>
>> Also change subject line to "ARM: dts: imx6q: Add mccmon6 board
>> support".
>>
>>> ---

[snip]

>>> +/ {
>>> +	model = "Monitor6 i.MX6 Quad Board";
>>
>> Missing hardware vendor name.
>>
>>> +	compatible = "mccmon6", "fsl,imx6q";
>>
>> Missing hardware vendor prefix before "mccmon6".
> 
> "lwn,mccmon6" ?
> 

Something like that, but please ensure that you add "lwn" vendor in a separate
preceding change to Documentation/devicetree/bindings/vendor-prefixes.txt

>>
>>> +
>>> +	memory {
>>> +		reg = <0x10000000 0x80000000>;
>>> +	};
>>> +
>>> +	ethernet0 {
>>> +		status = "okay";
>>> +	};
>>
>> It looks like a useless device node, you have a description of &fec
>> already.
>>
>>> +
>>> +	backlight_lvds: backlight {
>>> +		compatible = "pwm-backlight";
>>> +		pinctrl-names = "default";
>>> +		pinctrl-0 = <&pinctrl_display>;
>>
>> I would recommend to rename "pinctrl_display" to "pinctrl_backlight".
>>
>>> +		pwms = <&pwm2 0 5000000 PWM_POLARITY_INVERTED>;
>>
>> This should work when extension to the i.MX PWM driver is merged.
> 
> Yes. The PWM -> apply is an ongoing work. But without the PMW patch the
> board is also fully operational (with reversed PWM :-) )
> 

Right, I believe that the current PWM driver igonores the value passed
in the third cell, so it should be okay.

>>
>>> +		brightness-levels = <  0   1   2   3   4   5   6
>>> 7   8   9
>>> +				      10  11  12  13  14  15  16
>>> 17  18  19
>>> +				      20  21  22  23  24  25  26
>>> 27  28  29
>>> +				      30  31  32  33  34  35  36
>>> 37  38  39
>>> +				      40  41  42  43  44  45  46
>>> 47  48  49
>>> +				      50  51  52  53  54  55  56
>>> 57  58  59
>>> +				      60  61  62  63  64  65  66
>>> 67  68  69
>>> +				      70  71  72  73  74  75  76
>>> 77  78  79
>>> +				      80  81  82  83  84  85  86
>>> 87  88  89
>>> +				      90  91  92  93  94  95  96
>>> 97  98  99
>>> +				     100 101 102 103 104 105 106
>>> 107 108 109
>>> +				     110 111 112 113 114 115 116
>>> 117 118 119
>>> +				     120 121 122 123 124 125 126
>>> 127 128 129
>>> +				     130 131 132 133 134 135 136
>>> 137 138 139
>>> +				     140 141 142 143 144 145 146
>>> 147 148 149
>>> +				     150 151 152 153 154 155 156
>>> 157 158 159
>>> +				     160 161 162 163 164 165 166
>>> 167 168 169
>>> +				     170 171 172 173 174 175 176
>>> 177 178 179
>>> +				     180 181 182 183 184 185 186
>>> 187 188 189
>>> +				     190 191 192 193 194 195 196
>>> 197 198 199
>>> +				     200 201 202 203 204 205 206
>>> 207 208 209
>>> +				     210 211 212 213 214 215 216
>>> 217 218 219
>>> +				     220 221 222 223 224 225 226
>>> 227 228 229
>>> +				     230 231 232 233 234 235 236
>>> 237 238 239
>>> +				     240 241 242 243 244 245 246
>>> 247 248 249
>>> +				     250 251 252 253 254 255>;
>>
>> I'm not sure that actually need such a long list of brightness levels.
> 
> Such brightness-level property is so verbose on purpose - in this board
> we need fine brightness adjustment (harsh environment operation).

Okay.

>>
>>> +		default-brightness-level = <50>;
>>> +		enable-gpios = <&gpio1 2 GPIO_ACTIVE_LOW>;
>>> +	};
>>> +

[snip]

>>> +		pinctrl_display: dispgrp {
>>> +			fsl,pins = <
>>> +				/* BLEN_OUT */
>>> +				MX6QDL_PAD_GPIO_2__GPIO1_IO02
>>> 0x1b0b0
>>> +				/* LVDS_PPEN_OUT */
>>> +				MX6QDL_PAD_SD1_DAT2__GPIO1_IO19
>>> 0x1b0b0
>>
>> This GPIO should be moved to a pinctrl group of regulator-lvds device
>> node.
> 
> You mean to provide separate:
> 
> pinctrl_reg_lvds: req_lvds_grp {
> 		fsl,pins = <
> 		/* LVDS_PPEN_OUT */
> 		MX6QDL_PAD_SD1_DAT2__GPIO1_IO19
> 		>;
> 
> and then
> 
> 	reg_lvds: regulator-lvds {
> 		compatible = "regulator-fixed";
> 		regulator-name = "lvds_ppen";
> 		regulator-min-microvolt = <3300000>;
> 		regulator-max-microvolt = <3300000>;
> 		regulator-boot-on;
> 
> 		pinctrl-names = "default";
> 		pinctrl-0 = <&pinctrl_reg_lvds>;
> 
> 		gpio = <&gpio1 19 GPIO_ACTIVE_HIGH>;
> 		enable-active-high;
> 	};
> 

This looks correct.

[snip]

>>> +
>>> +&uart1 {
>>> +	pinctrl-names = "default";
>>> +	pinctrl-0 = <&pinctrl_uart1>;
>>
>> Should you add "uart-has-rtscts" property?
> 
> This is a simple "console" uart without rts/cts, so this property is
> not needed.
> 

You are right, my review comment is valid for UART4 only.

[snip]

--
With best wishes,
Vladimir

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox