Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] drm/sun4i: rgb: Add 5% tolerance to dot clock frequency check
From: Icenowy Zheng @ 2016-11-24 15:15 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161124112231.4297-1-wens@csie.org>



24.11.2016, 19:27, "Chen-Yu Tsai" <wens@csie.org>:
> The panels shipped with Allwinner devices are very "generic", i.e.
> they do not have model numbers or reliable sources of information
> for the timings (that we know of) other than the fex files shipped
> on them. The dot clock frequency provided in the fex files have all
> been rounded to the nearest MHz, as that is the unit used in them.
>
> We were using the simple panel "urt,umsh-8596md-t" as a substitute
> for the A13 Q8 tablets in the absence of a specific model for what
> may be many different but otherwise timing compatible panels. This
> was usable without any visual artifacts or side effects, until the
> dot clock rate check was added in commit bb43d40d7c83 ("drm/sun4i:
> rgb: Validate the clock rate").
>
> The reason this check fails is because the dotclock frequency for
> this model is 33.26 MHz, which is not achievable with our dot clock
> hardware, and the rate returned by clk_round_rate deviates slightly,
> causing the driver to reject the display mode.
>
> The LCD panels have some tolerance on the dot clock frequency, even
> if it's not specified in their datasheets.
>
> This patch adds a 5% tolerence to the dot clock check.

Tested by me on an A33 Q8 tablet with 800x480 LCD and 
"urt,umsh-8596md-t" compatible.

The tablet is Aoson M751S.

Works properly with sun4i-drm, with my pll-mipi patch applied.

>
> Signed-off-by: Chen-Yu Tsai <wens@csie.org>
> ---
>
> The few LCD panel datasheets I found did not list minimums or maximums
> for the dot clock rate. The 5% tolerance is just something I made up.
> The point is to be able to use our dot clock, which doesn't have the
> resolution needed to generate the exact clock rate requested. AFAIK
> the sun4i driver is one of the strictest ones with regards to the dot
> clock frequency. Some drivers don't even check it.
>
> The clock rate given in vendor fex files are already rounded down to
> MHz resolution. I doubt not using the exact rate as specified in simple
> panels would cause any issues. But my experience is limited here.
> Feedback on this is requested.
>
> ---
> ?drivers/gpu/drm/sun4i/sun4i_rgb.c | 5 +++--
> ?1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/sun4i/sun4i_rgb.c b/drivers/gpu/drm/sun4i/sun4i_rgb.c
> index d198ad7e5323..66ad86afa561 100644
> --- a/drivers/gpu/drm/sun4i/sun4i_rgb.c
> +++ b/drivers/gpu/drm/sun4i/sun4i_rgb.c
> @@ -93,11 +93,12 @@ static int sun4i_rgb_mode_valid(struct drm_connector *connector,
>
> ?????????DRM_DEBUG_DRIVER("Vertical parameters OK\n");
>
> + /* Check against a 5% tolerance for the dot clock */
> ?????????rounded_rate = clk_round_rate(tcon->dclk, rate);
> - if (rounded_rate < rate)
> + if (rounded_rate < rate * 19 / 20)
> ?????????????????return MODE_CLOCK_LOW;
>
> - if (rounded_rate > rate)
> + if (rounded_rate > rate * 21 / 20)
> ?????????????????return MODE_CLOCK_HIGH;
>
> ?????????DRM_DEBUG_DRIVER("Clock rate OK\n");
> --
> 2.10.2
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply

* [PATCH 1/2] i2c: designware: report short transfers
From: Wolfram Sang @ 2016-11-24 15:18 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <E1c7p12-0000R0-S8@rmk-PC.armlinux.org.uk>

On Fri, Nov 18, 2016 at 07:40:04PM +0000, Russell King wrote:
> Rather than reporting success for a short transfer due to interrupt
> latency, report an error both to the caller, as well as to the kernel
> log.
> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

Applied to for-current, thanks!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20161124/956359d1/attachment.sig>

^ permalink raw reply

* [PATCH 2/2] i2c: designware: fix rx fifo depth tracking
From: Wolfram Sang @ 2016-11-24 15:18 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <E1c7p18-0000R6-4w@rmk-PC.armlinux.org.uk>

On Fri, Nov 18, 2016 at 07:40:10PM +0000, Russell King wrote:
> When loading the TX fifo to receive bytes on the I2C bus, we incorrectly
> count the number of bytes:
> 
> 	rx_limit = dev->rx_fifo_depth - dw_readl(dev, DW_IC_RXFLR);
> 
> 	while (buf_len > 0 && tx_limit > 0 && rx_limit > 0) {
> 		if (rx_limit - dev->rx_outstanding <= 0)
> 			break;
> 		rx_limit--;
> 		dev->rx_outstanding++;
> 	}
> 
> DW_IC_RXFLR indicates how many bytes are available to be read in the
> FIFO, dev->rx_fifo_depth is the FIFO size, and dev->rx_outstanding is
> the number of bytes that we've requested to be read so far, but which
> have not been read.
> 
> Firstly, increasing dev->rx_outstanding and decreasing rx_limit and then
> comparing them results in each byte consuming "two" bytes in this
> tracking, so this is obviously wrong.
> 
> Secondly, the number of bytes that _could_ be received into the FIFO at
> any time is the number of bytes we have so far requested but not yet
> read from the FIFO - in other words dev->rx_outstanding.
> 
> So, in order to request enough bytes to fill the RX FIFO, we need to
> request dev->rx_fifo_depth - dev->rx_outstanding bytes.
> 
> Modifying the code thusly results in us reaching the maximum number of
> bytes outstanding each time we queue more "receive" operations, provided
> the transfer allows that to happen.
> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

Applied to for-current, thanks!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20161124/cd1f3cc5/attachment.sig>

^ permalink raw reply

* Tearing down DMA transfer setup after DMA client has finished
From: Mason @ 2016-11-24 15:20 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <yw1xr3619d3l.fsf@unicorn.mansr.com>

On 24/11/2016 15:17, M?ns Rullg?rd wrote:

> Mason wrote:
> 
>> [   35.085854] SETUP DMA
>> [   35.088272] START NAND TRANSFER
>> [   35.091670] tangox_dma_pchan_start from tangox_dma_irq
>> [   35.096882] tango_dma_callback from vchan_complete
>> [   45.102513] DONE FAKE SPINNING
>>
>> So the IRQ rolls in, the ISR calls tangox_dma_pchan_start,
>> which calls tangox_dma_pchan_detach to tear down the sbox
>> setup; and only sometime later does the DMA framework call
>> my callback function.
> 
> Yes, I realised this soon after I said it.  The dma driver could be
> rearranged to make it work though.

There is a way to make the tasklet run and invoke the callback
before the interrupt service routine proceeds? Can you say more
about this?


>> So far, the work-arounds I've tested are:
>>
>> 1) delay sbox tear-down by 10 ?s in tangox_dma_pchan_detach.
>> 2) statically setup sbox in probe, and never touch it henceforth.
>>
>> WA1 is fragile, it might break for devices other than NFC.
>> WA2 is what I used when I wrote the NFC driver.
>>
>> Can tangox_dma_irq() be changed to have the framework call
>> the client's callback *before* tangox_dma_pchan_start?
>>
>> (Thinking out loud) The DMA_PREP_INTERRUPT requests that the
>> DMA framework invoke the callback from tasklet context,
>> maybe a different flag DMA_PREP_INTERRUPT_EX can request
>> calling the call-back directly from within the ISR?
>>
>> (Looking at existing flags) Could I use DMA_CTRL_ACK?
>> Description sounds like some kind hand-shake between
>> client and dmaengine.
>>
>> Grepping for DMA_PREP_INTERRUPT, I don't see where the framework
>> checks that flag to spawn the tasklet? Or is that up to each
>> driver individually?
> 
> Those flags all have defined meanings and abusing them for other things
> is a bad idea.  As far as possible, device drivers should work with any
> dma driver.

I was asking about introducing a new flag, not abusing existing
flags. (I don't understand the semantics of DMA_CTRL_ACK.)

(FWIW, both the NFC and the MBUS agent are custom designs,
not third-party IP blocks.)

Regards.

^ permalink raw reply

* [PATCH 0/2] OF phandle nexus support + GPIO nexus
From: Linus Walleij @ 2016-11-24 15:27 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161124102529.20212-1-stephen.boyd@linaro.org>

On Thu, Nov 24, 2016 at 11:25 AM, Stephen Boyd <stephen.boyd@linaro.org> wrote:

> This is one small chunk of work related to DT overlays for expansion
> boards. It would be good to have a way to expose #<list>-cells types of
> providers through a connector in a standard way. So we introduce a way
> to make "nexus" nodes for these types of properties to remap the consumer
> number space to the other side of the connector's number space. It's
> basically a copy of the interrupt nexus implementation, but without
> the address space matching design and interrupt-parent walking.
>
> The first patch implements a generic method to do this, and the second patch
> adds a unit test for it. The third patch is more of an example than anything
> else. It shows how we would modify frameworks to use the new API.
>
> Stephen Boyd (3):
>   of: Support parsing phandle argument lists through a nexus node
>   of: unittest: Add phandle remapping test
>   gpio: Support gpio nexus dt bindings

Looks perfectly reasonable to me. But it's mainly for the DT people to review
I guess.

I have no idea about the eventual merge path though, I guess it needs to
go through the OF tree with my ACK.

Yours,
Linus Walleij

^ permalink raw reply

* [PATCH v2 0/3] ARM: davinci: OHCI: Use a regulator instead of callbacks
From: Axel Haslam @ 2016-11-24 15:28 UTC (permalink / raw)
  To: linux-arm-kernel

Convert users of OHCI pdata to use a regulator instead of
passing platform function pointers. This will help to remove the
ohci platform callbacks in a future series.

Changes form v1->v2
* Add function name and error number in error path print (Sekhar)

Dependencies:
1. [PATCH v7 0/5] USB: ohci-da8xx: Add device tree support
https://lkml.org/lkml/2016/11/23/557

2. [PATCH v3 0/2] regulator: handling of error conditions for usb drivers
https://lkml.org/lkml/2016/11/4/465

Axel Haslam (3):
  ARM: davinci: da830: Handle vbus with a regulator
  ARM: davinci: hawk: Remove vbus and over current gpios
  ARM: davinci: remove ohci platform usage

 arch/arm/mach-davinci/board-da830-evm.c     | 109 ++++++++++------------------
 arch/arm/mach-davinci/board-omapl138-hawk.c |  99 +------------------------
 arch/arm/mach-davinci/include/mach/da8xx.h  |   2 +-
 arch/arm/mach-davinci/usb-da8xx.c           |   3 +-
 4 files changed, 45 insertions(+), 168 deletions(-)

-- 
2.9.3

^ permalink raw reply

* [PATCH v2 1/3] ARM: davinci: da830: Handle vbus with a regulator
From: Axel Haslam @ 2016-11-24 15:28 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161124152831.26678-1-ahaslam@baylibre.com>

The usb driver can now take a regulator instead of the platform
callbacks for vbus handling. Lets use a regulator so we can remove
the callbacks in a later patch.

Signed-off-by: Axel Haslam <ahaslam@baylibre.com>
---
 arch/arm/mach-davinci/board-da830-evm.c | 109 ++++++++++++--------------------
 1 file changed, 39 insertions(+), 70 deletions(-)

diff --git a/arch/arm/mach-davinci/board-da830-evm.c b/arch/arm/mach-davinci/board-da830-evm.c
index 5db0901..253d9af 100644
--- a/arch/arm/mach-davinci/board-da830-evm.c
+++ b/arch/arm/mach-davinci/board-da830-evm.c
@@ -14,6 +14,7 @@
 #include <linux/console.h>
 #include <linux/interrupt.h>
 #include <linux/gpio.h>
+#include <linux/gpio/machine.h>
 #include <linux/platform_device.h>
 #include <linux/i2c.h>
 #include <linux/i2c/pcf857x.h>
@@ -28,6 +29,7 @@
 #include <linux/platform_data/spi-davinci.h>
 #include <linux/platform_data/usb-davinci.h>
 #include <linux/regulator/machine.h>
+#include <linux/regulator/fixed.h>
 
 #include <asm/mach-types.h>
 #include <asm/mach/arch.h>
@@ -38,72 +40,48 @@
 #include <mach/da8xx.h>
 
 #define DA830_EVM_PHY_ID		""
-/*
- * USB1 VBUS is controlled by GPIO1[15], over-current is reported on GPIO2[4].
- */
-#define ON_BD_USB_DRV	GPIO_TO_PIN(1, 15)
-#define ON_BD_USB_OVC	GPIO_TO_PIN(2, 4)
 
 static const short da830_evm_usb11_pins[] = {
 	DA830_GPIO1_15, DA830_GPIO2_4,
 	-1
 };
 
-static da8xx_ocic_handler_t da830_evm_usb_ocic_handler;
-
-static int da830_evm_usb_set_power(unsigned port, int on)
-{
-	gpio_set_value(ON_BD_USB_DRV, on);
-	return 0;
-}
-
-static int da830_evm_usb_get_power(unsigned port)
-{
-	return gpio_get_value(ON_BD_USB_DRV);
-}
-
-static int da830_evm_usb_get_oci(unsigned port)
-{
-	return !gpio_get_value(ON_BD_USB_OVC);
-}
-
-static irqreturn_t da830_evm_usb_ocic_irq(int, void *);
+static struct regulator_consumer_supply usb_ohci_consumer_supply =
+	REGULATOR_SUPPLY("vbus", "ohci-da8xx");
 
-static int da830_evm_usb_ocic_notify(da8xx_ocic_handler_t handler)
-{
-	int irq 	= gpio_to_irq(ON_BD_USB_OVC);
-	int error	= 0;
-
-	if (handler != NULL) {
-		da830_evm_usb_ocic_handler = handler;
-
-		error = request_irq(irq, da830_evm_usb_ocic_irq,
-				    IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING,
-				    "OHCI over-current indicator", NULL);
-		if (error)
-			pr_err("%s: could not request IRQ to watch over-current indicator changes\n",
-			       __func__);
-	} else
-		free_irq(irq, NULL);
-
-	return error;
-}
+static struct regulator_init_data usb_ohci_initdata = {
+	.consumer_supplies = &usb_ohci_consumer_supply,
+	.num_consumer_supplies = 1,
+	.constraints = {
+		.valid_ops_mask = REGULATOR_CHANGE_STATUS,
+	},
+};
 
-static struct da8xx_ohci_root_hub da830_evm_usb11_pdata = {
-	.set_power	= da830_evm_usb_set_power,
-	.get_power	= da830_evm_usb_get_power,
-	.get_oci	= da830_evm_usb_get_oci,
-	.ocic_notify	= da830_evm_usb_ocic_notify,
+static struct fixed_voltage_config usb_ohci_config = {
+	.supply_name		= "vbus",
+	.microvolts		= 5000000,
+	.gpio			= GPIO_TO_PIN(1, 15),
+	.enable_high		= 1,
+	.enabled_at_boot	= 0,
+	.init_data		= &usb_ohci_initdata,
+};
 
-	/* TPS2065 switch @ 5V */
-	.potpgt		= (3 + 1) / 2,	/* 3 ms max */
+static struct platform_device da8xx_usb11_regulator = {
+	.name	= "reg-fixed-voltage",
+	.id	= 0,
+	.dev	= {
+		.platform_data = &usb_ohci_config,
+	},
 };
 
-static irqreturn_t da830_evm_usb_ocic_irq(int irq, void *dev_id)
-{
-	da830_evm_usb_ocic_handler(&da830_evm_usb11_pdata, 1);
-	return IRQ_HANDLED;
-}
+static struct gpiod_lookup_table usb11_gpios_table = {
+	.dev_id = "reg-fixed-voltage.0",
+	.table = {
+		/* gpio chip 1 contains gpio range 32-63 */
+		GPIO_LOOKUP("davinci_gpio.1", 4, "over-current",
+				GPIO_ACTIVE_LOW),
+	},
+};
 
 static __init void da830_evm_usb_init(void)
 {
@@ -145,23 +123,14 @@ static __init void da830_evm_usb_init(void)
 		return;
 	}
 
-	ret = gpio_request(ON_BD_USB_DRV, "ON_BD_USB_DRV");
-	if (ret) {
-		pr_err("%s: failed to request GPIO for USB 1.1 port power control: %d\n",
-		       __func__, ret);
-		return;
-	}
-	gpio_direction_output(ON_BD_USB_DRV, 0);
+	gpiod_add_lookup_table(&usb11_gpios_table);
 
-	ret = gpio_request(ON_BD_USB_OVC, "ON_BD_USB_OVC");
-	if (ret) {
-		pr_err("%s: failed to request GPIO for USB 1.1 port over-current indicator: %d\n",
-		       __func__, ret);
-		return;
-	}
-	gpio_direction_input(ON_BD_USB_OVC);
+	ret = platform_device_register(&da8xx_usb11_regulator);
+	if (ret)
+		pr_warn("%s: USB 1.1 regulator register failed: %d\n",
+			__func__, ret);
 
-	ret = da8xx_register_usb11(&da830_evm_usb11_pdata);
+	ret = da8xx_register_usb11(NULL);
 	if (ret)
 		pr_warn("%s: USB 1.1 registration failed: %d\n", __func__, ret);
 }
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 2/3] ARM: davinci: hawk: Remove vbus and over current gpios
From: Axel Haslam @ 2016-11-24 15:28 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161124152831.26678-1-ahaslam@baylibre.com>

The hawk board VBUS is fixed to a 5v source, and the over
current pin is actually not connected to the SoC.

Do not reseve these gpios for OHCI as they are not related
to usb.

Signed-off-by: Axel Haslam <ahaslam@baylibre.com>
---
 arch/arm/mach-davinci/board-omapl138-hawk.c | 99 ++---------------------------
 1 file changed, 4 insertions(+), 95 deletions(-)

diff --git a/arch/arm/mach-davinci/board-omapl138-hawk.c b/arch/arm/mach-davinci/board-omapl138-hawk.c
index a4e8726..a252404 100644
--- a/arch/arm/mach-davinci/board-omapl138-hawk.c
+++ b/arch/arm/mach-davinci/board-omapl138-hawk.c
@@ -28,9 +28,6 @@
 #define DA850_HAWK_MMCSD_CD_PIN		GPIO_TO_PIN(3, 12)
 #define DA850_HAWK_MMCSD_WP_PIN		GPIO_TO_PIN(3, 13)
 
-#define DA850_USB1_VBUS_PIN		GPIO_TO_PIN(2, 4)
-#define DA850_USB1_OC_PIN		GPIO_TO_PIN(6, 13)
-
 static short omapl138_hawk_mii_pins[] __initdata = {
 	DA850_MII_TXEN, DA850_MII_TXCLK, DA850_MII_COL, DA850_MII_TXD_3,
 	DA850_MII_TXD_2, DA850_MII_TXD_1, DA850_MII_TXD_0, DA850_MII_RXER,
@@ -181,76 +178,10 @@ static __init void omapl138_hawk_mmc_init(void)
 	gpio_free(DA850_HAWK_MMCSD_CD_PIN);
 }
 
-static irqreturn_t omapl138_hawk_usb_ocic_irq(int irq, void *dev_id);
-static da8xx_ocic_handler_t hawk_usb_ocic_handler;
-
-static const short da850_hawk_usb11_pins[] = {
-	DA850_GPIO2_4, DA850_GPIO6_13,
-	-1
-};
-
-static int hawk_usb_set_power(unsigned port, int on)
-{
-	gpio_set_value(DA850_USB1_VBUS_PIN, on);
-	return 0;
-}
-
-static int hawk_usb_get_power(unsigned port)
-{
-	return gpio_get_value(DA850_USB1_VBUS_PIN);
-}
-
-static int hawk_usb_get_oci(unsigned port)
-{
-	return !gpio_get_value(DA850_USB1_OC_PIN);
-}
-
-static int hawk_usb_ocic_notify(da8xx_ocic_handler_t handler)
-{
-	int irq         = gpio_to_irq(DA850_USB1_OC_PIN);
-	int error       = 0;
-
-	if (handler != NULL) {
-		hawk_usb_ocic_handler = handler;
-
-		error = request_irq(irq, omapl138_hawk_usb_ocic_irq,
-					IRQF_TRIGGER_RISING |
-					IRQF_TRIGGER_FALLING,
-					"OHCI over-current indicator", NULL);
-		if (error)
-			pr_err("%s: could not request IRQ to watch "
-				"over-current indicator changes\n", __func__);
-	} else {
-		free_irq(irq, NULL);
-	}
-	return error;
-}
-
-static struct da8xx_ohci_root_hub omapl138_hawk_usb11_pdata = {
-	.set_power      = hawk_usb_set_power,
-	.get_power      = hawk_usb_get_power,
-	.get_oci        = hawk_usb_get_oci,
-	.ocic_notify    = hawk_usb_ocic_notify,
-	/* TPS2087 switch @ 5V */
-	.potpgt         = (3 + 1) / 2,  /* 3 ms max */
-};
-
-static irqreturn_t omapl138_hawk_usb_ocic_irq(int irq, void *dev_id)
-{
-	hawk_usb_ocic_handler(&omapl138_hawk_usb11_pdata, 1);
-	return IRQ_HANDLED;
-}
-
 static __init void omapl138_hawk_usb_init(void)
 {
 	int ret;
 
-	ret = davinci_cfg_reg_list(da850_hawk_usb11_pins);
-	if (ret) {
-		pr_warn("%s: USB 1.1 PinMux setup failed: %d\n", __func__, ret);
-		return;
-	}
-
 	ret = da8xx_register_usb20_phy_clk(false);
 	if (ret)
 		pr_warn("%s: USB 2.0 PHY CLK registration failed: %d\n",
@@ -266,34 +197,12 @@ static __init void omapl138_hawk_usb_init(void)
 		pr_warn("%s: USB PHY registration failed: %d\n",
 			__func__, ret);
 
-	ret = gpio_request_one(DA850_USB1_VBUS_PIN,
-			GPIOF_DIR_OUT, "USB1 VBUS");
-	if (ret < 0) {
-		pr_err("%s: failed to request GPIO for USB 1.1 port "
-			"power control: %d\n", __func__, ret);
-		return;
-	}
-
-	ret = gpio_request_one(DA850_USB1_OC_PIN,
-			GPIOF_DIR_IN, "USB1 OC");
-	if (ret < 0) {
-		pr_err("%s: failed to request GPIO for USB 1.1 port "
-			"over-current indicator: %d\n", __func__, ret);
-		goto usb11_setup_oc_fail;
-	}
-
-	ret = da8xx_register_usb11(&omapl138_hawk_usb11_pdata);
-	if (ret) {
-		pr_warn("%s: USB 1.1 registration failed: %d\n", __func__, ret);
-		goto usb11_setup_fail;
-	}
+	ret = da8xx_register_usb11(NULL);
+	if (ret)
+		pr_warn("%s: USB 1.1 registration failed: %d\n",
+			__func__, ret);
 
 	return;
-
-usb11_setup_fail:
-	gpio_free(DA850_USB1_OC_PIN);
-usb11_setup_oc_fail:
-	gpio_free(DA850_USB1_VBUS_PIN);
 }
 
 static __init void omapl138_hawk_init(void)
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 3/3] ARM: davinci: remove ohci platform usage
From: Axel Haslam @ 2016-11-24 15:28 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161124152831.26678-1-ahaslam@baylibre.com>

As all users of ohci platform data have been converted
to use a regulator, we dont need to pass platform
data to register the ohci device anymore.

Signed-off-by: Axel Haslam <ahaslam@baylibre.com>
---
 arch/arm/mach-davinci/board-da830-evm.c     | 2 +-
 arch/arm/mach-davinci/board-omapl138-hawk.c | 2 +-
 arch/arm/mach-davinci/include/mach/da8xx.h  | 2 +-
 arch/arm/mach-davinci/usb-da8xx.c           | 3 +--
 4 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/arm/mach-davinci/board-da830-evm.c b/arch/arm/mach-davinci/board-da830-evm.c
index 253d9af..223de79 100644
--- a/arch/arm/mach-davinci/board-da830-evm.c
+++ b/arch/arm/mach-davinci/board-da830-evm.c
@@ -130,7 +130,7 @@ static __init void da830_evm_usb_init(void)
 		pr_warn("%s: USB 1.1 regulator register failed: %d\n",
 			__func__, ret);
 
-	ret = da8xx_register_usb11(NULL);
+	ret = da8xx_register_usb11();
 	if (ret)
 		pr_warn("%s: USB 1.1 registration failed: %d\n", __func__, ret);
 }
diff --git a/arch/arm/mach-davinci/board-omapl138-hawk.c b/arch/arm/mach-davinci/board-omapl138-hawk.c
index a252404..cbe7324 100644
--- a/arch/arm/mach-davinci/board-omapl138-hawk.c
+++ b/arch/arm/mach-davinci/board-omapl138-hawk.c
@@ -197,7 +197,7 @@ static __init void omapl138_hawk_usb_init(void)
 		pr_warn("%s: USB PHY registration failed: %d\n",
 			__func__, ret);
 
-	ret = da8xx_register_usb11(NULL);
+	ret = da8xx_register_usb11();
 	if (ret)
 		pr_warn("%s: USB 1.1 registration failed: %d\n",
 			__func__, ret);
diff --git a/arch/arm/mach-davinci/include/mach/da8xx.h b/arch/arm/mach-davinci/include/mach/da8xx.h
index 85ff218..b21ef07 100644
--- a/arch/arm/mach-davinci/include/mach/da8xx.h
+++ b/arch/arm/mach-davinci/include/mach/da8xx.h
@@ -91,7 +91,7 @@ int da8xx_register_spi_bus(int instance, unsigned num_chipselect);
 int da8xx_register_watchdog(void);
 int da8xx_register_usb_phy(void);
 int da8xx_register_usb20(unsigned mA, unsigned potpgt);
-int da8xx_register_usb11(struct da8xx_ohci_root_hub *pdata);
+int da8xx_register_usb11(void);
 int da8xx_register_usb_refclkin(int rate);
 int da8xx_register_usb20_phy_clk(bool use_usb_refclkin);
 int da8xx_register_usb11_phy_clk(bool use_usb_refclkin);
diff --git a/arch/arm/mach-davinci/usb-da8xx.c b/arch/arm/mach-davinci/usb-da8xx.c
index b010e5f..a438e2b 100644
--- a/arch/arm/mach-davinci/usb-da8xx.c
+++ b/arch/arm/mach-davinci/usb-da8xx.c
@@ -119,9 +119,8 @@ static struct platform_device da8xx_usb11_device = {
 	.resource	= da8xx_usb11_resources,
 };
 
-int __init da8xx_register_usb11(struct da8xx_ohci_root_hub *pdata)
+int __init da8xx_register_usb11(void)
 {
-	da8xx_usb11_device.dev.platform_data = pdata;
 	return platform_device_register(&da8xx_usb11_device);
 }
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH 7/10] mmc: sdhci-xenon: Add support to PHYs of Marvell Xenon SDHC
From: Ziji Hu @ 2016-11-24 15:37 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <CAPDyKFrskyZvMwhmj8ioyaowU8GL5=d-6ipGQcmmig_a9E9AQQ@mail.gmail.com>

Hi Ulf,

On 2016/11/24 22:33, Ulf Hansson wrote:
> [...]
> 
>>>
>>>>
>>>> +
>>>> +static int __xenon_emmc_delay_adj_test(struct mmc_card *card)
>>>> +{
>>>> +       int err;
>>>> +       u8 *ext_csd = NULL;
>>>> +
>>>> +       err = mmc_get_ext_csd(card, &ext_csd);
>>>> +       kfree(ext_csd);
>>>
>>> Why do you read the ext csd here?
>>>
>>    I would like to simply introduce the PHY setting of our SDHC.
>>    The target of the PHY setting is to achieve a perfect sampling
>>    point for transfers, during card initialization.
> 
> Okay, so the phy is involved when running the tuning sequence.
> 
    Actually, all the transfers pass our host PHY.

>>
>>    For HS200/HS400/SDR104 whose SDCLK is more than 50MHz, SDHC HW
>>    will search for this sampling point with DLL's help.
> 
> Apologize for my ignorance, but what is a "DLL" in this case?
> 
   DLL is Delay-locked Loop. It is a HW module similar to PLL.

>>
>>    For other speed mode whose SDLCK is less than or equals to 50MHz,
>>    SW has to scan the PHY delay line to find out this perfect sampling
>>    point. Our driver sends a command to verify a sampling point
>>    in current environment.
> 
> Ahh, okay! I guess the important part here is to not only send a
> command, but also to make sure data becomes transferred on the DAT
> lines, as to confirm your tuning sequence!?

   Yes.
   It is the best if the test command can transfer on DAT lines.

> 
> In cases of HS200/HS400/SDR104 you should be able to use the
> mmc_send_tuning() API, don't you think?

   For HS200/HS400/SDR104, we finally call sdhci_execute_tuning() to
   execute tuning. Those test commands are not used.
   In HS200/HS400/SDR104, HW will provide our host driver with suitable
   tuning step. Our host driver set the tuning step in SDHCI register and
   then start standard tuning sequence. The tuning step value provided
   by our host HW will enhance tuning. 

> 
> For the other cases (lower speed modes) which cards doesn't support
> the tuning command, perhaps you can just assume the PHY scan succeeded
> and then allow to core to continue with the card initialization
> sequence? Or do you foresee any issues with that? My point is that, if
> it will fail - it will fail anyway.

  Usually, our host driver will always successfully scan and select a
  perfect sampling point.
  If driver cannot find any suitable sampling point, it is likely that
  transfers will also fail after init. But usually it is a issue, caused by
  incorrect setting on boards/SOC/other PHY parameters, especially in development.
  We will fix the issue and then scan will succeed in final product.

> 
>>
>>    As result, our SDHC driver has to implement the functionality to
>>    send commands and check the results, in host layer.
>>    If directly calling mmc_wait_for_cmd() is improper, could you please
>>    give us some suggestions?
>>
>>    For eMMC, CMD8 is used to test current sampling point set in PHY.
> 
> Try to use mmc_send_tuning().
> 

    Could you please tell me the requirement of "op_code" parameter in
    mmc_send_tuning()?
    According to mmc_send_tuning(),it seems that tuning command(CMD19/CMD21)
    is required. Thus device will not response mmc_send_tuning() if current
    speed mode doesn't support tuning command.
    Please correct me if I am wrong.
    
>>
>>>> +
>>>> +       return err;
>>>> +}
>>>> +
>>>> +static int __xenon_sdio_delay_adj_test(struct mmc_card *card)
>>>> +{
>>>> +       struct mmc_command cmd = {0};
>>>> +       int err;
>>>> +
>>>> +       cmd.opcode = SD_IO_RW_DIRECT;
>>>> +       cmd.flags = MMC_RSP_R5 | MMC_CMD_AC;
>>>> +
>>>> +       err = mmc_wait_for_cmd(card->host, &cmd, 0);
>>>> +       if (err)
>>>> +               return err;
>>>> +
>>>> +       if (cmd.resp[0] & R5_ERROR)
>>>> +               return -EIO;
>>>> +       if (cmd.resp[0] & R5_FUNCTION_NUMBER)
>>>> +               return -EINVAL;
>>>> +       if (cmd.resp[0] & R5_OUT_OF_RANGE)
>>>> +               return -ERANGE;
>>>> +       return 0;
>>>
>>> No thanks! MMC/SD/SDIO protocol code belongs in the core.
>>>
>>    For SDIO, SD_IO_RW_DIRECT command is sent to test current sampling point
>>    in PHY.
>>    Please help provide some suggestion to implement the command transfer.
> 
> Again, I think mmc_send_tuning() should be possible for you to use.
> 
> [...]
> 
>>>> +       if (mmc->card)
>>>> +               card = mmc->card;
>>>> +       else
>>>> +               /*
>>>> +                * Only valid during initialization
>>>> +                * before mmc->card is set
>>>> +                */
>>>> +               card = priv->card_candidate;
>>>> +       if (unlikely(!card)) {
>>>> +               dev_warn(mmc_dev(mmc), "card is not present\n");
>>>> +               return -EINVAL;
>>>> +       }
>>>
>>> That your host need to hold a copy of the card pointer, tells me that
>>> something is not really correct.
>>>
>>> I might be wrong, if this turns out to be a special case, but I doubt
>>> it. Although, if it *is* a special such case, we shall most likely try
>>> to extend the the mmc core layer instead of adding all these hacks in
>>> your host driver.
>>>
>>     This card pointer copies the temporary structure mmc_card
>>     used in mmc_init_card(), mmc_sd_init_card() and mmc_sdio_init_card().
>>     Since we call mmc_wait_for_cmd() to send test commands, we need a copy
>>     of that temporary mmc_card here in our host driver.
> 
> I see, thanks for clarifying.
> 
>>
>>     During PHY setting in card initialization, mmc_host->card is not updated
>>     yet with that temporary mmc_card. Thus we are not able to directly use
>>     mmc_host->card. Instead, this card pointer is introduced to enable
>>     mmc_wait_for_cmd().
>>
>>     If we can improve our host driver to send test commands without mmc_card,
>>     this card pointer can be removed.
>>     Could you please share your opinion please?
> 
> The mmc_send_tuning() API takes the mmc_host as parameter. If you
> convert to that, perhaps you would be able to remove the need to hold
> the card pointer.
> 
> BTW, the reason why mmc_send_tuning() doesn't take the card as a
> parameter, is exactly those you just described above.
> 
   Got it.
   Thanks a lot for the information.

   Thank you for the great help.

Best regards,
Hu Ziji

> [...]
> 
> Kind regards
> Uffe
> 

^ permalink raw reply

* [GIT PULL 1/3] Rockchip defconfig64 changes for 4.10
From: Heiko Stuebner @ 2016-11-24 15:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnd, Kevin, Olof,

please find below (and in the replies) some more pull requests
for 4.10.

If that is to late, feel free to ignore them and I'll resubmit
after 4.10-rc.


Thanks
Heiko

The following changes since commit 1001354ca34179f3db924eb66672442a173147dc:

  Linux 4.9-rc1 (2016-10-15 12:17:50 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip.git tags/v4.10-rockchip-defconfig64

for you to fetch changes up to 92d22d77b14c988bce0293502a1b3ef25b267ed9:

  arm64: defconfig: allow rk3399-based boards to boot from mmc and usb (2016-11-17 17:15:59 +0100)

----------------------------------------------------------------
64bit defconfig changes to allow arm64 Rockchip socs
to basically boot.

----------------------------------------------------------------
Andy Yan (2):
      arm64: defconfig: enable I2C and DW MMC controller on rockchip platform
      arm64: defconfig: enable RK808 components

Heiko Stuebner (1):
      arm64: defconfig: allow rk3399-based boards to boot from mmc and usb

 arch/arm64/configs/defconfig | 11 +++++++++++
 1 file changed, 11 insertions(+)

^ permalink raw reply

* [GIT PULL 2/3] Rockchip dts64 changes for 4.10
From: Heiko Stuebner @ 2016-11-24 15:42 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1594090.z775ZFR2cz@phil>

The following changes since commit c49590691f3819bb6be3f77938ef39038eb76643:

  arm64: dts: rockchip: replace to "max-frequency" instead of "clock-freq-min-max" (2016-11-09 15:08:55 +0100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip.git tags/v4.10-rockchip-dts64-2

for you to fetch changes up to c6f049db0a783a35685104e83214c9cc02fb74b6:

  dt-bindings: add rockchip RK1108 Evaluation board (2016-11-21 21:47:02 +0100)

----------------------------------------------------------------
Some more powerdomains and usb2-otg support for the rk3399 as well
as the binding doc for the 32bit rk1108 eval board to prevent it
from conflicting with the recently added 64bit px5 board.

----------------------------------------------------------------
Andy Yan (1):
      dt-bindings: add rockchip RK1108 Evaluation board

Elaine Zhang (2):
      arm64: dts: rockchip: add eMMC's power domain support for rk3399
      arm64: dts: rockchip: add pd_sd power-domain node for rk3399

Jeffy Chen (1):
      arm64: dts: rockchip: add gmac needed pclk for rk3399 pd

William Wu (1):
      arm64: dts: rockchip: add usb2-phy otg-port support for rk3399

Yakir Yang (1):
      arm64: dts: rockchip: add backlight support for rk3399 evb board

 Documentation/devicetree/bindings/arm/rockchip.txt |  4 ++
 arch/arm64/boot/dts/rockchip/rk3399-evb.dts        | 40 +++++++++++++++++++
 arch/arm64/boot/dts/rockchip/rk3399.dtsi           | 46 +++++++++++++++++++++-
 3 files changed, 89 insertions(+), 1 deletion(-)

^ permalink raw reply

* [GIT PULL 3/3] Rockchip dts32 changes for 4.10
From: Heiko Stuebner @ 2016-11-24 15:43 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1594090.z775ZFR2cz@phil>

The following changes since commit 6a8883d614c7bede1075a4850139daa9723c291e:

  ARM: dts: rockchip: replace to "max-frequency" instead of "clock-freq-min-max" (2016-11-09 14:46:04 +0100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip.git tags/v4.10-rockchip-dts32-2

for you to fetch changes up to c458e1b504a6a8c817fc56b4d4704273f2c459bb:

  ARM: dts: rockchip: add the sdmmc pinctrl for rk1108 (2016-11-16 13:37:45 +0100)

----------------------------------------------------------------
A bit of attention for the rk3066, fixed tsadc reset node
as well as enabling the dma for uart and mmc controllers.

And one new soc, the rk1108 combining a single-core Cortex-A7
with a separate DSP core.

----------------------------------------------------------------
Andy Yan (2):
      ARM: dts: rockchip: add basic support for RK1108 SOC
      ARM: dts: rockchip: add rockchip RK1108 Evaluation board

Heiko Stuebner (1):
      Merge branch 'v4.10-shared/clkids' into v4.10-armsoc/dts32

Jacob Chen (1):
      ARM: dts: rockchip: add the sdmmc pinctrl for rk1108

Pawe? Jarosz (2):
      ARM: dts: rockchip: fix TSADC reset node for rk3066a
      ARM: dts: rockchip: enable dma for uart and mmc on rk3066a

Shawn Lin (2):
      dt-bindings: rockchip-dw-mshc: add RK1108 dw-mshc description
      clk: rockchip: add dt-binding header for rk1108

 .../devicetree/bindings/mmc/rockchip-dw-mshc.txt   |   1 +
 arch/arm/boot/dts/Makefile                         |   1 +
 arch/arm/boot/dts/rk1108-evb.dts                   |  69 ++++
 arch/arm/boot/dts/rk1108.dtsi                      | 452 +++++++++++++++++++++
 arch/arm/boot/dts/rk3066a.dtsi                     |  19 +-
 include/dt-bindings/clock/rk1108-cru.h             | 269 ++++++++++++
 6 files changed, 810 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm/boot/dts/rk1108-evb.dts
 create mode 100644 arch/arm/boot/dts/rk1108.dtsi
 create mode 100644 include/dt-bindings/clock/rk1108-cru.h

^ permalink raw reply

* [PATCH 0/4] crypto: CRCT10DIF support for ARM and arm64
From: Ard Biesheuvel @ 2016-11-24 15:43 UTC (permalink / raw)
  To: linux-arm-kernel

First of all, apologies to Yue Haibing for stealing his thunder, to some
extent. But after reviewing (and replying to) his patch, I noticed that his
code is not original code, but simply a transliteration of the existing Intel
code that resides in arch/x86/crypto/crct10dif-pcl-asm_64.S, but with the
license and copyright statement removed. 

So, if we are going to transliterate code, let's credit the original authors,
even if the resulting code does not look like the code you started out with.

Then, I noticed that we could stay *much* closer to the original, and that
there is no need for jump tables or computed gotos at all. So I got a bit
carried away, and ended up reimplementing the whole thing, for both arm and64
and ARM.

Patch #1 fixes an issue in testmgr that results in spurious false negatives
in the chunking tests if the third chunk exceeds 31 bytes.

Patch #2 expands the existing CRCT10DIF test cases, to ensure that all
code paths are actually covered.

Patch #3 is a straight transliteration of the Intel code to arm64.

Patch #4 is a straight transliteration of the Intel code to ARM. This patch
is against patch #3 (using --find-copies-harder) so that it is easy to
see how the ARM code deviates from the arm64 code.

NOTE: this code uses the 64x64->128 bit polynomial multiply instruction,
which is only available on cores that implement the v8 Crypto Extensions.

Ard Biesheuvel (4):
  crypto: testmgr - avoid overlap in chunked tests
  crypto: testmgr - add/enhance test cases for CRC-T10DIF
  crypto: arm64/crct10dif - port x86 SSE implementation to arm64
  crypto: arm/crct10dif - port x86 SSE implementation to ARM

 arch/arm/crypto/Kconfig               |   5 +
 arch/arm/crypto/Makefile              |   2 +
 arch/arm/crypto/crct10dif-ce-core.S   | 569 ++++++++++++++++++++
 arch/arm/crypto/crct10dif-ce-glue.c   |  89 +++
 arch/arm64/crypto/Kconfig             |   5 +
 arch/arm64/crypto/Makefile            |   3 +
 arch/arm64/crypto/crct10dif-ce-core.S | 518 ++++++++++++++++++
 arch/arm64/crypto/crct10dif-ce-glue.c |  80 +++
 crypto/testmgr.c                      |   2 +-
 crypto/testmgr.h                      |  70 ++-
 10 files changed, 1314 insertions(+), 29 deletions(-)
 create mode 100644 arch/arm/crypto/crct10dif-ce-core.S
 create mode 100644 arch/arm/crypto/crct10dif-ce-glue.c
 create mode 100644 arch/arm64/crypto/crct10dif-ce-core.S
 create mode 100644 arch/arm64/crypto/crct10dif-ce-glue.c

-- 
2.7.4

^ permalink raw reply

* [PATCH 1/4] crypto: testmgr - avoid overlap in chunked tests
From: Ard Biesheuvel @ 2016-11-24 15:43 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1480002201-1427-1-git-send-email-ard.biesheuvel@linaro.org>

The IDXn offsets are chosen such that tap values (which may go up to
255) end up overlapping in the xbuf allocation. In particular, IDX1
and IDX3 are too close together, so update IDX3 to avoid this issue.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/testmgr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 62dffa0028ac..15650597dcc9 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -62,7 +62,7 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask)
  */
 #define IDX1		32
 #define IDX2		32400
-#define IDX3		1
+#define IDX3		511
 #define IDX4		8193
 #define IDX5		22222
 #define IDX6		17101
-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/4] crypto: testmgr - add/enhance test cases for CRC-T10DIF
From: Ard Biesheuvel @ 2016-11-24 15:43 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1480002201-1427-1-git-send-email-ard.biesheuvel@linaro.org>

The existing test cases only exercise a small slice of the various
possible code paths through the x86 SSE/PCLMULQDQ implementation,
and the upcoming ports of it for arm64. So add one that exceeds 256
bytes in size, and convert another to a chunked test.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/testmgr.h | 70 ++++++++++++--------
 1 file changed, 42 insertions(+), 28 deletions(-)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index e64a4ef9d8ca..b7cd41b25a2a 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -1334,36 +1334,50 @@ static struct hash_testvec rmd320_tv_template[] = {
 	}
 };
 
-#define CRCT10DIF_TEST_VECTORS	3
+#define CRCT10DIF_TEST_VECTORS	ARRAY_SIZE(crct10dif_tv_template)
 static struct hash_testvec crct10dif_tv_template[] = {
 	{
-		.plaintext = "abc",
-		.psize  = 3,
-#ifdef __LITTLE_ENDIAN
-		.digest = "\x3b\x44",
-#else
-		.digest = "\x44\x3b",
-#endif
-	}, {
-		.plaintext = "1234567890123456789012345678901234567890"
-			     "123456789012345678901234567890123456789",
-		.psize	= 79,
-#ifdef __LITTLE_ENDIAN
-		.digest	= "\x70\x4b",
-#else
-		.digest	= "\x4b\x70",
-#endif
-	}, {
-		.plaintext =
-		"abcddddddddddddddddddddddddddddddddddddddddddddddddddddd",
-		.psize  = 56,
-#ifdef __LITTLE_ENDIAN
-		.digest = "\xe3\x9c",
-#else
-		.digest = "\x9c\xe3",
-#endif
-		.np     = 2,
-		.tap    = { 28, 28 }
+		.plaintext	= "abc",
+		.psize		= 3,
+		.digest		= (u8 *)(u16 []){ 0x443b },
+	}, {
+		.plaintext 	= "1234567890123456789012345678901234567890"
+				  "123456789012345678901234567890123456789",
+		.psize		= 79,
+		.digest 	= (u8 *)(u16 []){ 0x4b70 },
+		.np		= 2,
+		.tap		= { 63, 16 },
+	}, {
+		.plaintext	= "abcdddddddddddddddddddddddddddddddddddddddd"
+				  "ddddddddddddd",
+		.psize		= 56,
+		.digest		= (u8 *)(u16 []){ 0x9ce3 },
+		.np		= 8,
+		.tap		= { 1, 2, 28, 7, 6, 5, 4, 3 },
+	}, {
+		.plaintext 	= "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "123456789012345678901234567890123456789",
+		.psize		= 319,
+		.digest		= (u8 *)((u16 []){ 0x44c6 }),
+	}, {
+		.plaintext 	= "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "123456789012345678901234567890123456789",
+		.psize		= 319,
+		.digest		= (u8 *)((u16 []){ 0x44c6 }),
+		.np		= 4,
+		.tap		= { 1, 255, 57, 6 },
 	}
 };
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH 3/4] crypto: arm64/crct10dif - port x86 SSE implementation to arm64
From: Ard Biesheuvel @ 2016-11-24 15:43 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1480002201-1427-1-git-send-email-ard.biesheuvel@linaro.org>

This is a straight transliteration of the Intel algorithm implemented
using SSE and PCLMULQDQ instructions that resides under in the file
arch/x86/crypto/crct10dif-pcl-asm_64.S.

Suggested-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/Kconfig             |   5 +
 arch/arm64/crypto/Makefile            |   3 +
 arch/arm64/crypto/crct10dif-ce-core.S | 518 ++++++++++++++++++++
 arch/arm64/crypto/crct10dif-ce-glue.c |  80 +++
 4 files changed, 606 insertions(+)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 2cf32e9887e1..1b50671ffec3 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -23,6 +23,11 @@ config CRYPTO_GHASH_ARM64_CE
 	depends on ARM64 && KERNEL_MODE_NEON
 	select CRYPTO_HASH
 
+config CRYPTO_CRCT10DIF_ARM64_CE
+	tristate "CRCT10DIF digest algorithm using PMULL instructions"
+	depends on ARM64 && KERNEL_MODE_NEON
+	select CRYPTO_HASH
+
 config CRYPTO_AES_ARM64_CE
 	tristate "AES core cipher using ARMv8 Crypto Extensions"
 	depends on ARM64 && KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index abb79b3cfcfe..36fd3eb4201b 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -17,6 +17,9 @@ sha2-ce-y := sha2-ce-glue.o sha2-ce-core.o
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
+obj-$(CONFIG_CRYPTO_CRCT10DIF_ARM64_CE) += crct10dif-ce.o
+crct10dif-ce-y := crct10dif-ce-core.o crct10dif-ce-glue.o
+
 obj-$(CONFIG_CRYPTO_AES_ARM64_CE) += aes-ce-cipher.o
 CFLAGS_aes-ce-cipher.o += -march=armv8-a+crypto
 
diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S
new file mode 100644
index 000000000000..9148ebd3470a
--- /dev/null
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -0,0 +1,518 @@
+//
+// Accelerated CRC-T10DIF using arm64 NEON and Crypto Extensions instructions
+//
+// Copyright (C) 2016 Linaro Ltd <ard.biesheuvel@linaro.org>
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License version 2 as
+// published by the Free Software Foundation.
+//
+
+//
+// Implement fast CRC-T10DIF computation with SSE and PCLMULQDQ instructions
+//
+// Copyright (c) 2013, Intel Corporation
+//
+// Authors:
+//     Erdinc Ozturk <erdinc.ozturk@intel.com>
+//     Vinodh Gopal <vinodh.gopal@intel.com>
+//     James Guilford <james.guilford@intel.com>
+//     Tim Chen <tim.c.chen@linux.intel.com>
+//
+// This software is available to you under a choice of one of two
+// licenses.  You may choose to be licensed under the terms of the GNU
+// General Public License (GPL) Version 2, available from the file
+// COPYING in the main directory of this source tree, or the
+// OpenIB.org BSD license below:
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// * Redistributions of source code must retain the above copyright
+//   notice, this list of conditions and the following disclaimer.
+//
+// * Redistributions in binary form must reproduce the above copyright
+//   notice, this list of conditions and the following disclaimer in the
+//   documentation and/or other materials provided with the
+//   distribution.
+//
+// * Neither the name of the Intel Corporation nor the names of its
+//   contributors may be used to endorse or promote products derived from
+//   this software without specific prior written permission.
+//
+//
+// THIS SOFTWARE IS PROVIDED BY INTEL CORPORATION ""AS IS"" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL CORPORATION OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+//       Function API:
+//       UINT16 crc_t10dif_pcl(
+//               UINT16 init_crc, //initial CRC value, 16 bits
+//               const unsigned char *buf, //buffer pointer to calculate CRC on
+//               UINT64 len //buffer length in bytes (64-bit data)
+//       );
+//
+//       Reference paper titled "Fast CRC Computation for Generic
+//	Polynomials Using PCLMULQDQ Instruction"
+//       URL: http://www.intel.com/content/dam/www/public/us/en/documents
+//  /white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf
+//
+//
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+	.text
+	.cpu		generic+crypto
+
+	arg1_low32	.req	w0
+	arg2		.req	x1
+	arg3		.req	x2
+
+	vzr		.req	v13
+
+ENTRY(crc_t10dif_pmull)
+	stp		x29, x30, [sp, #-32]!
+	mov		x29, sp
+
+	movi		vzr.16b, #0		// init zero register
+
+	// adjust the 16-bit initial_crc value, scale it to 32 bits
+	lsl		arg1_low32, arg1_low32, #16
+
+	// check if smaller than 256
+	cmp		arg3, #256
+
+	// for sizes less than 128, we can't fold 64B at a time...
+	b.lt		_less_than_128
+
+	// load the initial crc value
+	// crc value does not need to be byte-reflected, but it needs
+	// to be moved to the high part of the register.
+	// because data will be byte-reflected and will align with
+	// initial crc at correct place.
+	movi		v10.16b, #0
+	mov		v10.s[3], arg1_low32		// initial crc
+
+	// receive the initial 64B data, xor the initial crc value
+	ld1		{v0.2d-v3.2d}, [arg2], #0x40
+	ld1		{v4.2d-v7.2d}, [arg2], #0x40
+CPU_LE(	rev64		v0.16b, v0.16b		)
+CPU_LE(	rev64		v1.16b, v1.16b		)
+CPU_LE(	rev64		v2.16b, v2.16b		)
+CPU_LE(	rev64		v3.16b, v3.16b		)
+CPU_LE(	rev64		v4.16b, v4.16b		)
+CPU_LE(	rev64		v5.16b, v5.16b		)
+CPU_LE(	rev64		v6.16b, v6.16b		)
+CPU_LE(	rev64		v7.16b, v7.16b		)
+
+	ext		v0.16b, v0.16b, v0.16b, #8
+	ext		v1.16b, v1.16b, v1.16b, #8
+	ext		v2.16b, v2.16b, v2.16b, #8
+	ext		v3.16b, v3.16b, v3.16b, #8
+	ext		v4.16b, v4.16b, v4.16b, #8
+	ext		v5.16b, v5.16b, v5.16b, #8
+	ext		v6.16b, v6.16b, v6.16b, #8
+	ext		v7.16b, v7.16b, v7.16b, #8
+
+	// XOR the initial_crc value
+	eor		v0.16b, v0.16b, v10.16b
+
+	ldr		q10, rk3	// xmm10 has rk3 and rk4
+					// type of pmull instruction
+					// will determine which constant to use
+
+	//
+	// we subtract 256 instead of 128 to save one instruction from the loop
+	//
+	sub		arg3, arg3, #256
+
+	// at this section of the code, there is 64*x+y (0<=y<64) bytes of
+	// buffer. The _fold_64_B_loop will fold 64B at a time
+	// until we have 64+y Bytes of buffer
+
+
+	// fold 64B at a time. This section of the code folds 4 vector
+	// registers in parallel
+_fold_64_B_loop:
+
+	.macro		fold64, reg1, reg2
+	ld1		{v11.2d-v12.2d}, [arg2], #0x20
+CPU_LE(	rev64		v11.16b, v11.16b	)
+CPU_LE(	rev64		v12.16b, v12.16b	)
+	ext		v11.16b, v11.16b, v11.16b, #8
+	ext		v12.16b, v12.16b, v12.16b, #8
+
+	pmull2		v8.1q, \reg1\().2d, v10.2d
+	pmull		\reg1\().1q, \reg1\().1d, v10.1d
+	pmull2		v9.1q, \reg2\().2d, v10.2d
+	pmull		\reg2\().1q, \reg2\().1d, v10.1d
+
+	eor		\reg1\().16b, \reg1\().16b, v11.16b
+	eor		\reg2\().16b, \reg2\().16b, v12.16b
+	eor		\reg1\().16b, \reg1\().16b, v8.16b
+	eor		\reg2\().16b, \reg2\().16b, v9.16b
+	.endm
+
+	fold64		v0, v1
+	fold64		v2, v3
+	fold64		v4, v5
+	fold64		v6, v7
+
+	subs		arg3, arg3, #128
+
+	// check if there is another 64B in the buffer to be able to fold
+	b.ge		_fold_64_B_loop
+
+	// at this point, the buffer pointer is pointing at the last y Bytes
+	// of the buffer the 64B of folded data is in 4 of the vector
+	// registers: v0, v1, v2, v3
+
+	// fold the 8 vector registers to 1 vector register with different
+	// constants
+
+	.macro		fold16, rk, reg
+	ldr		q10, \rk
+	pmull		v8.1q, \reg\().1d, v10.1d
+	pmull2		\reg\().1q, \reg\().2d, v10.2d
+	eor		v7.16b, v7.16b, v8.16b
+	eor		v7.16b, v7.16b, \reg\().16b
+	.endm
+
+	fold16		rk9, v0
+	fold16		rk11, v1
+	fold16		rk13, v2
+	fold16		rk15, v3
+	fold16		rk17, v4
+	fold16		rk19, v5
+	fold16		rk1, v6
+
+	// instead of 64, we add 48 to the loop counter to save 1 instruction
+	// from the loop instead of a cmp instruction, we use the negative
+	// flag with the jl instruction
+	adds		arg3, arg3, #(128-16)
+	b.lt		_final_reduction_for_128
+
+	// now we have 16+y bytes left to reduce. 16 Bytes is in register v7
+	// and the rest is in memory. We can fold 16 bytes@a time if y>=16
+	// continue folding 16B at a time
+
+_16B_reduction_loop:
+	pmull		v8.1q, v7.1d, v10.1d
+	pmull2		v7.1q, v7.2d, v10.2d
+	eor		v7.16b, v7.16b, v8.16b
+
+	ld1		{v0.2d}, [arg2], #16
+CPU_LE(	rev64		v0.16b, v0.16b		)
+	ext		v0.16b, v0.16b, v0.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+	subs		arg3, arg3, #16
+
+	// instead of a cmp instruction, we utilize the flags with the
+	// jge instruction equivalent of: cmp arg3, 16-16
+	// check if there is any more 16B in the buffer to be able to fold
+	b.ge		_16B_reduction_loop
+
+	// now we have 16+z bytes left to reduce, where 0<= z < 16.
+	// first, we reduce the data in the xmm7 register
+
+_final_reduction_for_128:
+	// check if any more data to fold. If not, compute the CRC of
+	// the final 128 bits
+	adds		arg3, arg3, #16
+	b.eq		_128_done
+
+	// here we are getting data that is less than 16 bytes.
+	// since we know that there was data before the pointer, we can
+	// offset the input pointer before the actual point, to receive
+	// exactly 16 bytes. after that the registers need to be adjusted.
+_get_last_two_regs:
+	mov		v2.16b, v7.16b
+
+	add		arg2, arg2, arg3
+	sub		arg2, arg2, #16
+	ld1		{v1.2d}, [arg2]
+CPU_LE(	rev64		v1.16b, v1.16b		)
+	ext		v1.16b, v1.16b, v1.16b, #8
+
+	// get rid of the extra data that was loaded before
+	// load the shift constant
+	adr		x4, tbl_shf_table + 16
+	sub		x4, x4, arg3
+	ld1		{v0.16b}, [x4]
+
+	// shift v2 to the left by arg3 bytes
+	tbl		v2.16b, {v2.16b}, v0.16b
+
+	// shift v7 to the right by 16-arg3 bytes
+	movi		v9.16b, #0x80
+	eor		v0.16b, v0.16b, v9.16b
+	tbl		v7.16b, {v7.16b}, v0.16b
+
+	// blend
+	sshr		v0.16b, v0.16b, #7	// convert to 8-bit mask
+	bsl		v0.16b, v2.16b, v1.16b
+
+	// fold 16 Bytes
+	pmull		v8.1q, v7.1d, v10.1d
+	pmull2		v7.1q, v7.2d, v10.2d
+	eor		v7.16b, v7.16b, v8.16b
+	eor		v7.16b, v7.16b, v0.16b
+
+_128_done:
+	// compute crc of a 128-bit value
+	ldr		q10, rk5		// rk5 and rk6 in xmm10
+
+	// 64b fold
+	mov		v0.16b, v7.16b
+	ext		v7.16b, v7.16b, v7.16b, #8
+	pmull		v7.1q, v7.1d, v10.1d
+	ext		v0.16b, vzr.16b, v0.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	// 32b fold
+	mov		v0.16b, v7.16b
+	mov		v0.s[3], vzr.s[0]
+	ext		v7.16b, v7.16b, vzr.16b, #12
+	ext		v9.16b, v10.16b, v10.16b, #8
+	pmull		v7.1q, v7.1d, v9.1d
+	eor		v7.16b, v7.16b, v0.16b
+
+	// barrett reduction
+_barrett:
+	ldr		q10, rk7
+	mov		v0.16b, v7.16b
+	ext		v7.16b, v7.16b, v7.16b, #8
+
+	pmull		v7.1q, v7.1d, v10.1d
+	ext		v7.16b, vzr.16b, v7.16b, #12
+	pmull2		v7.1q, v7.2d, v10.2d
+	ext		v7.16b, vzr.16b, v7.16b, #12
+	eor		v7.16b, v7.16b, v0.16b
+	mov		w0, v7.s[1]
+
+_cleanup:
+	// scale the result back to 16 bits
+	lsr		x0, x0, #16
+	ldp		x29, x30, [sp], #32
+	ret
+
+	.align		4
+_less_than_128:
+
+	// check if there is enough buffer to be able to fold 16B at a time
+	cmp		arg3, #32
+	b.lt		_less_than_32
+
+	// now if there is, load the constants
+	ldr		q10, rk1		// rk1 and rk2 in xmm10
+
+	movi		v0.16b, #0
+	mov		v0.s[3], arg1_low32	// get the initial crc value
+	ld1		{v7.2d}, [arg2], #0x10
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	// update the counter. subtract 32 instead of 16 to save one
+	// instruction from the loop
+	sub		arg3, arg3, #32
+
+	b		_16B_reduction_loop
+
+	.align		4
+_less_than_32:
+	cbz		arg3, _cleanup
+
+	movi		v0.16b, #0
+	mov		v0.s[3], arg1_low32	// get the initial crc value
+
+	cmp		arg3, #16
+	b.eq		_exact_16_left
+	b.lt		_less_than_16_left
+
+	ld1		{v7.2d}, [arg2], #0x10
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+	sub		arg3, arg3, #16
+	ldr		q10, rk1		// rk1 and rk2 in xmm10
+	b		_get_last_two_regs
+
+	.align		4
+_less_than_16_left:
+	// use stack space to load data less than 16 bytes, zero-out
+	// the 16B in memory first.
+
+	add		x11, sp, #0x10
+	stp		xzr, xzr, [x11]
+
+	cmp		arg3, #4
+	b.lt		_only_less_than_4
+
+	// backup the counter value
+	mov		x9, arg3
+	tbz		arg3, #3, _less_than_8_left
+
+	// load 8 Bytes
+	ldr		x0, [arg2], #8
+	str		x0, [x11], #8
+	sub		arg3, arg3, #8
+
+_less_than_8_left:
+	tbz		arg3, #2, _less_than_4_left
+
+	// load 4 Bytes
+	ldr		w0, [arg2], #4
+	str		w0, [x11], #4
+	sub		arg3, arg3, #4
+
+_less_than_4_left:
+	tbz		arg3, #1, _less_than_2_left
+
+	// load 2 Bytes
+	ldrh		w0, [arg2], #2
+	strh		w0, [x11], #2
+	sub		arg3, arg3, #2
+
+_less_than_2_left:
+	cbz		arg3, _zero_left
+
+	// load 1 Byte
+	ldrb		w0, [arg2]
+	strb		w0, [x11]
+
+_zero_left:
+	add		x11, sp, #0x10
+	ld1		{v7.2d}, [x11]
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	// shl r9, 4
+	adr		x0, tbl_shf_table + 16
+	sub		x0, x0, x9
+	ld1		{v0.16b}, [x0]
+	movi		v9.16b, #0x80
+	eor		v0.16b, v0.16b, v9.16b
+	tbl		v7.16b, {v7.16b}, v0.16b
+
+	b		_128_done
+
+	.align		4
+_exact_16_left:
+	ld1		{v7.2d}, [arg2]
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b	// xor the initial crc value
+
+	b		_128_done
+
+_only_less_than_4:
+	cmp		arg3, #3
+	b.lt		_only_less_than_3
+
+	// load 3 Bytes
+	ldrh		w0, [arg2]
+	strh		w0, [x11]
+
+	ldrb		w0, [arg2, #2]
+	strb		w0, [x11, #2]
+
+	ld1		{v7.2d}, [x11]
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	ext		v7.16b, v7.16b, vzr.16b, #5
+	b		_barrett
+
+_only_less_than_3:
+	cmp		arg3, #2
+	b.lt		_only_less_than_2
+
+	// load 2 Bytes
+	ldrh		w0, [arg2]
+	strh		w0, [x11]
+
+	ld1		{v7.2d}, [x11]
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	ext		v7.16b, v7.16b, vzr.16b, #6
+	b		_barrett
+
+_only_less_than_2:
+
+	// load 1 Byte
+	ldrb		w0, [arg2]
+	strb		w0, [x11]
+
+	ld1		{v7.2d}, [x11]
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	ext		v7.16b, v7.16b, vzr.16b, #7
+	b		_barrett
+
+ENDPROC(crc_t10dif_pmull)
+
+// precomputed constants
+// these constants are precomputed from the poly:
+// 0x8bb70000 (0x8bb7 scaled to 32 bits)
+	.align		4
+// Q = 0x18BB70000
+// rk1 = 2^(32*3) mod Q << 32
+// rk2 = 2^(32*5) mod Q << 32
+// rk3 = 2^(32*15) mod Q << 32
+// rk4 = 2^(32*17) mod Q << 32
+// rk5 = 2^(32*3) mod Q << 32
+// rk6 = 2^(32*2) mod Q << 32
+// rk7 = floor(2^64/Q)
+// rk8 = Q
+
+rk1:	.octa		0x06df0000000000002d56000000000000
+rk3:	.octa		0x7cf50000000000009d9d000000000000
+rk5:	.octa		0x13680000000000002d56000000000000
+rk7:	.octa		0x000000018bb7000000000001f65a57f8
+rk9:	.octa		0xbfd6000000000000ceae000000000000
+rk11:	.octa		0x713c0000000000001e16000000000000
+rk13:	.octa		0x80a6000000000000f7f9000000000000
+rk15:	.octa		0xe658000000000000044c000000000000
+rk17:	.octa		0xa497000000000000ad18000000000000
+rk19:	.octa		0xe7b50000000000006ee3000000000000
+
+tbl_shf_table:
+// use these values for shift constants for the tbl/tbx instruction
+// different alignments result in values as shown:
+//	DDQ 0x008f8e8d8c8b8a898887868584838281 # shl 15 (16-1) / shr1
+//	DDQ 0x01008f8e8d8c8b8a8988878685848382 # shl 14 (16-3) / shr2
+//	DDQ 0x0201008f8e8d8c8b8a89888786858483 # shl 13 (16-4) / shr3
+//	DDQ 0x030201008f8e8d8c8b8a898887868584 # shl 12 (16-4) / shr4
+//	DDQ 0x04030201008f8e8d8c8b8a8988878685 # shl 11 (16-5) / shr5
+//	DDQ 0x0504030201008f8e8d8c8b8a89888786 # shl 10 (16-6) / shr6
+//	DDQ 0x060504030201008f8e8d8c8b8a898887 # shl 9  (16-7) / shr7
+//	DDQ 0x07060504030201008f8e8d8c8b8a8988 # shl 8  (16-8) / shr8
+//	DDQ 0x0807060504030201008f8e8d8c8b8a89 # shl 7  (16-9) / shr9
+//	DDQ 0x090807060504030201008f8e8d8c8b8a # shl 6  (16-10) / shr10
+//	DDQ 0x0a090807060504030201008f8e8d8c8b # shl 5  (16-11) / shr11
+//	DDQ 0x0b0a090807060504030201008f8e8d8c # shl 4  (16-12) / shr12
+//	DDQ 0x0c0b0a090807060504030201008f8e8d # shl 3  (16-13) / shr13
+//	DDQ 0x0d0c0b0a090807060504030201008f8e # shl 2  (16-14) / shr14
+//	DDQ 0x0e0d0c0b0a090807060504030201008f # shl 1  (16-15) / shr15
+
+	.byte		 0x0, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87
+	.byte		0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f
+	.byte		 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7
+	.byte		 0x8,  0x9,  0xa,  0xb,  0xc,  0xd,  0xe , 0x0
diff --git a/arch/arm64/crypto/crct10dif-ce-glue.c b/arch/arm64/crypto/crct10dif-ce-glue.c
new file mode 100644
index 000000000000..d11f33dae79c
--- /dev/null
+++ b/arch/arm64/crypto/crct10dif-ce-glue.c
@@ -0,0 +1,80 @@
+/*
+ * Accelerated CRC-T10DIF using arm64 NEON and Crypto Extensions instructions
+ *
+ * Copyright (C) 2016 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufeature.h>
+#include <linux/crc-t10dif.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <crypto/internal/hash.h>
+
+#include <asm/neon.h>
+
+asmlinkage u16 crc_t10dif_pmull(u16 init_crc, const u8 buf[], u64 len);
+
+static int crct10dif_init(struct shash_desc *desc)
+{
+	u16 *crc = shash_desc_ctx(desc);
+
+	*crc = 0;
+	return 0;
+}
+
+static int crct10dif_update(struct shash_desc *desc, const u8 *data,
+			 unsigned int length)
+{
+	u16 *crc = shash_desc_ctx(desc);
+
+	kernel_neon_begin_partial(14);
+	*crc = crc_t10dif_pmull(*crc, data, length);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int crct10dif_final(struct shash_desc *desc, u8 *out)
+{
+	u16 *crc = shash_desc_ctx(desc);
+
+	*(u16 *)out = *crc;
+	return 0;
+}
+
+static struct shash_alg crc_t10dif_alg = {
+	.digestsize		= CRC_T10DIF_DIGEST_SIZE,
+	.init			= crct10dif_init,
+	.update			= crct10dif_update,
+	.final			= crct10dif_final,
+
+	.descsize		= CRC_T10DIF_DIGEST_SIZE,
+	.base.cra_name		= "crct10dif",
+	.base.cra_driver_name	= "crct10dif-arm64-ce",
+	.base.cra_priority	= 200,
+	.base.cra_blocksize	= CRC_T10DIF_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+};
+
+static int __init crc_t10dif_mod_init(void)
+{
+	return crypto_register_shash(&crc_t10dif_alg);
+}
+
+static void __exit crc_t10dif_mod_exit(void)
+{
+	crypto_unregister_shash(&crc_t10dif_alg);
+}
+
+module_cpu_feature_match(PMULL, crc_t10dif_mod_init);
+module_exit(crc_t10dif_mod_exit);
+
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_LICENSE("GPL v2");
-- 
2.7.4

^ permalink raw reply related

* [PATCH 4/4] crypto: arm/crct10dif - port x86 SSE implementation to ARM
From: Ard Biesheuvel @ 2016-11-24 15:43 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1480002201-1427-1-git-send-email-ard.biesheuvel@linaro.org>

This is a straight transliteration of the Intel algorithm implemented
using SSE and PCLMULQDQ instructions that resides under in the file
arch/x86/crypto/crct10dif-pcl-asm_64.S.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/Kconfig                        |   5 +
 arch/arm/crypto/Makefile                       |   2 +
 arch/{arm64 => arm}/crypto/crct10dif-ce-core.S | 457 +++++++++++---------
 arch/{arm64 => arm}/crypto/crct10dif-ce-glue.c |  23 +-
 4 files changed, 277 insertions(+), 210 deletions(-)

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index 27ed1b1cd1d7..fce801fa52a1 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -120,4 +120,9 @@ config CRYPTO_GHASH_ARM_CE
 	  that uses the 64x64 to 128 bit polynomial multiplication (vmull.p64)
 	  that is part of the ARMv8 Crypto Extensions
 
+config CRYPTO_CRCT10DIF_ARM_CE
+	tristate "CRCT10DIF digest algorithm using PMULL instructions"
+	depends on KERNEL_MODE_NEON && CRC_T10DIF
+	select CRYPTO_HASH
+
 endif
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index fc5150702b64..fc77265014b7 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -13,6 +13,7 @@ ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
 ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
 ce-obj-$(CONFIG_CRYPTO_SHA2_ARM_CE) += sha2-arm-ce.o
 ce-obj-$(CONFIG_CRYPTO_GHASH_ARM_CE) += ghash-arm-ce.o
+ce-obj-$(CONFIG_CRYPTO_CRCT10DIF_ARM_CE) += crct10dif-arm-ce.o
 
 ifneq ($(ce-obj-y)$(ce-obj-m),)
 ifeq ($(call as-instr,.fpu crypto-neon-fp-armv8,y,n),y)
@@ -36,6 +37,7 @@ sha1-arm-ce-y	:= sha1-ce-core.o sha1-ce-glue.o
 sha2-arm-ce-y	:= sha2-ce-core.o sha2-ce-glue.o
 aes-arm-ce-y	:= aes-ce-core.o aes-ce-glue.o
 ghash-arm-ce-y	:= ghash-ce-core.o ghash-ce-glue.o
+crct10dif-arm-ce-y	:= crct10dif-ce-core.o crct10dif-ce-glue.o
 
 quiet_cmd_perl = PERL    $@
       cmd_perl = $(PERL) $(<) > $(@)
diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm/crypto/crct10dif-ce-core.S
similarity index 60%
copy from arch/arm64/crypto/crct10dif-ce-core.S
copy to arch/arm/crypto/crct10dif-ce-core.S
index 9148ebd3470a..30168b0f8581 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm/crypto/crct10dif-ce-core.S
@@ -1,5 +1,5 @@
 //
-// Accelerated CRC-T10DIF using arm64 NEON and Crypto Extensions instructions
+// Accelerated CRC-T10DIF using ARM NEON and Crypto Extensions instructions
 //
 // Copyright (C) 2016 Linaro Ltd <ard.biesheuvel@linaro.org>
 //
@@ -71,20 +71,43 @@
 #include <linux/linkage.h>
 #include <asm/assembler.h>
 
-	.text
-	.cpu		generic+crypto
-
-	arg1_low32	.req	w0
-	arg2		.req	x1
-	arg3		.req	x2
+#ifdef CONFIG_CPU_ENDIAN_BE8
+#define CPU_LE(code...)
+#else
+#define CPU_LE(code...)		code
+#endif
 
-	vzr		.req	v13
+	.text
+	.fpu		crypto-neon-fp-armv8
+
+	arg1_low32	.req	r0
+	arg2		.req	r1
+	arg3		.req	r2
+
+	qzr		.req	q13
+
+	q0l		.req	d0
+	q0h		.req	d1
+	q1l		.req	d2
+	q1h		.req	d3
+	q2l		.req	d4
+	q2h		.req	d5
+	q3l		.req	d6
+	q3h		.req	d7
+	q4l		.req	d8
+	q4h		.req	d9
+	q5l		.req	d10
+	q5h		.req	d11
+	q6l		.req	d12
+	q6h		.req	d13
+	q7l		.req	d14
+	q7h		.req	d15
 
 ENTRY(crc_t10dif_pmull)
-	stp		x29, x30, [sp, #-32]!
-	mov		x29, sp
+	push		{r4, lr}
+	sub		sp, sp, #0x10
 
-	movi		vzr.16b, #0		// init zero register
+	vmov.i8		qzr, #0			// init zero register
 
 	// adjust the 16-bit initial_crc value, scale it to 32 bits
 	lsl		arg1_low32, arg1_low32, #16
@@ -93,41 +116,44 @@ ENTRY(crc_t10dif_pmull)
 	cmp		arg3, #256
 
 	// for sizes less than 128, we can't fold 64B at a time...
-	b.lt		_less_than_128
+	blt		_less_than_128
 
 	// load the initial crc value
 	// crc value does not need to be byte-reflected, but it needs
 	// to be moved to the high part of the register.
 	// because data will be byte-reflected and will align with
 	// initial crc at correct place.
-	movi		v10.16b, #0
-	mov		v10.s[3], arg1_low32		// initial crc
+	vmov		s0, arg1_low32		// initial crc
+	vext.8		q10, qzr, q0, #4
 
 	// receive the initial 64B data, xor the initial crc value
-	ld1		{v0.2d-v3.2d}, [arg2], #0x40
-	ld1		{v4.2d-v7.2d}, [arg2], #0x40
-CPU_LE(	rev64		v0.16b, v0.16b		)
-CPU_LE(	rev64		v1.16b, v1.16b		)
-CPU_LE(	rev64		v2.16b, v2.16b		)
-CPU_LE(	rev64		v3.16b, v3.16b		)
-CPU_LE(	rev64		v4.16b, v4.16b		)
-CPU_LE(	rev64		v5.16b, v5.16b		)
-CPU_LE(	rev64		v6.16b, v6.16b		)
-CPU_LE(	rev64		v7.16b, v7.16b		)
-
-	ext		v0.16b, v0.16b, v0.16b, #8
-	ext		v1.16b, v1.16b, v1.16b, #8
-	ext		v2.16b, v2.16b, v2.16b, #8
-	ext		v3.16b, v3.16b, v3.16b, #8
-	ext		v4.16b, v4.16b, v4.16b, #8
-	ext		v5.16b, v5.16b, v5.16b, #8
-	ext		v6.16b, v6.16b, v6.16b, #8
-	ext		v7.16b, v7.16b, v7.16b, #8
+	vld1.64		{q0-q1}, [arg2]!
+	vld1.64		{q2-q3}, [arg2]!
+	vld1.64		{q4-q5}, [arg2]!
+	vld1.64		{q6-q7}, [arg2]!
+CPU_LE(	vrev64.8	q0, q0			)
+CPU_LE(	vrev64.8	q1, q1			)
+CPU_LE(	vrev64.8	q2, q2			)
+CPU_LE(	vrev64.8	q3, q3			)
+CPU_LE(	vrev64.8	q4, q4			)
+CPU_LE(	vrev64.8	q5, q5			)
+CPU_LE(	vrev64.8	q6, q6			)
+CPU_LE(	vrev64.8	q7, q7			)
+
+	vext.8		q0, q0, q0, #8
+	vext.8		q1, q1, q1, #8
+	vext.8		q2, q2, q2, #8
+	vext.8		q3, q3, q3, #8
+	vext.8		q4, q4, q4, #8
+	vext.8		q5, q5, q5, #8
+	vext.8		q6, q6, q6, #8
+	vext.8		q7, q7, q7, #8
 
 	// XOR the initial_crc value
-	eor		v0.16b, v0.16b, v10.16b
+	veor.8		q0, q0, q10
 
-	ldr		q10, rk3	// xmm10 has rk3 and rk4
+	adrl		ip, rk3
+	vld1.64		{q10}, [ip]	// xmm10 has rk3 and rk4
 					// type of pmull instruction
 					// will determine which constant to use
 
@@ -146,32 +172,32 @@ CPU_LE(	rev64		v7.16b, v7.16b		)
 _fold_64_B_loop:
 
 	.macro		fold64, reg1, reg2
-	ld1		{v11.2d-v12.2d}, [arg2], #0x20
-CPU_LE(	rev64		v11.16b, v11.16b	)
-CPU_LE(	rev64		v12.16b, v12.16b	)
-	ext		v11.16b, v11.16b, v11.16b, #8
-	ext		v12.16b, v12.16b, v12.16b, #8
-
-	pmull2		v8.1q, \reg1\().2d, v10.2d
-	pmull		\reg1\().1q, \reg1\().1d, v10.1d
-	pmull2		v9.1q, \reg2\().2d, v10.2d
-	pmull		\reg2\().1q, \reg2\().1d, v10.1d
-
-	eor		\reg1\().16b, \reg1\().16b, v11.16b
-	eor		\reg2\().16b, \reg2\().16b, v12.16b
-	eor		\reg1\().16b, \reg1\().16b, v8.16b
-	eor		\reg2\().16b, \reg2\().16b, v9.16b
+	vld1.64		{q11-q12}, [arg2]!
+CPU_LE(	vrev64.8	q11, q11		)
+CPU_LE(	vrev64.8	q12, q12		)
+	vext.8		q11, q11, q11, #8
+	vext.8		q12, q12, q12, #8
+
+	vmull.p64	q8, \reg1\()h, d21
+	vmull.p64	\reg1\(), \reg1\()l, d20
+	vmull.p64	q9, \reg2\()h, d21
+	vmull.p64	\reg2\(), \reg2\()l, d20
+
+	veor.8		\reg1, \reg1, q11
+	veor.8		\reg2, \reg2, q12
+	veor.8		\reg1, \reg1, q8
+	veor.8		\reg2, \reg2, q9
 	.endm
 
-	fold64		v0, v1
-	fold64		v2, v3
-	fold64		v4, v5
-	fold64		v6, v7
+	fold64		q0, q1
+	fold64		q2, q3
+	fold64		q4, q5
+	fold64		q6, q7
 
 	subs		arg3, arg3, #128
 
 	// check if there is another 64B in the buffer to be able to fold
-	b.ge		_fold_64_B_loop
+	bge		_fold_64_B_loop
 
 	// at this point, the buffer pointer is pointing at the last y Bytes
 	// of the buffer the 64B of folded data is in 4 of the vector
@@ -181,46 +207,47 @@ CPU_LE(	rev64		v12.16b, v12.16b	)
 	// constants
 
 	.macro		fold16, rk, reg
-	ldr		q10, \rk
-	pmull		v8.1q, \reg\().1d, v10.1d
-	pmull2		\reg\().1q, \reg\().2d, v10.2d
-	eor		v7.16b, v7.16b, v8.16b
-	eor		v7.16b, v7.16b, \reg\().16b
+	vldr		d20, \rk
+	vldr		d21, \rk + 8
+	vmull.p64	q8, \reg\()l, d20
+	vmull.p64	\reg\(), \reg\()h, d21
+	veor.8		q7, q7, q8
+	veor.8		q7, q7, \reg
 	.endm
 
-	fold16		rk9, v0
-	fold16		rk11, v1
-	fold16		rk13, v2
-	fold16		rk15, v3
-	fold16		rk17, v4
-	fold16		rk19, v5
-	fold16		rk1, v6
+	fold16		rk9, q0
+	fold16		rk11, q1
+	fold16		rk13, q2
+	fold16		rk15, q3
+	fold16		rk17, q4
+	fold16		rk19, q5
+	fold16		rk1, q6
 
 	// instead of 64, we add 48 to the loop counter to save 1 instruction
 	// from the loop instead of a cmp instruction, we use the negative
 	// flag with the jl instruction
 	adds		arg3, arg3, #(128-16)
-	b.lt		_final_reduction_for_128
+	blt		_final_reduction_for_128
 
 	// now we have 16+y bytes left to reduce. 16 Bytes is in register v7
 	// and the rest is in memory. We can fold 16 bytes at a time if y>=16
 	// continue folding 16B at a time
 
 _16B_reduction_loop:
-	pmull		v8.1q, v7.1d, v10.1d
-	pmull2		v7.1q, v7.2d, v10.2d
-	eor		v7.16b, v7.16b, v8.16b
-
-	ld1		{v0.2d}, [arg2], #16
-CPU_LE(	rev64		v0.16b, v0.16b		)
-	ext		v0.16b, v0.16b, v0.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vmull.p64	q8, d14, d20
+	vmull.p64	q7, d15, d21
+	veor.8		q7, q7, q8
+
+	vld1.64		{q0}, [arg2]!
+CPU_LE(	vrev64.8	q0, q0		)
+	vext.8		q0, q0, q0, #8
+	veor.8		q7, q7, q0
 	subs		arg3, arg3, #16
 
 	// instead of a cmp instruction, we utilize the flags with the
 	// jge instruction equivalent of: cmp arg3, 16-16
 	// check if there is any more 16B in the buffer to be able to fold
-	b.ge		_16B_reduction_loop
+	bge		_16B_reduction_loop
 
 	// now we have 16+z bytes left to reduce, where 0<= z < 16.
 	// first, we reduce the data in the xmm7 register
@@ -229,99 +256,104 @@ _final_reduction_for_128:
 	// check if any more data to fold. If not, compute the CRC of
 	// the final 128 bits
 	adds		arg3, arg3, #16
-	b.eq		_128_done
+	beq		_128_done
 
 	// here we are getting data that is less than 16 bytes.
 	// since we know that there was data before the pointer, we can
 	// offset the input pointer before the actual point, to receive
 	// exactly 16 bytes. after that the registers need to be adjusted.
 _get_last_two_regs:
-	mov		v2.16b, v7.16b
+	vmov		q2, q7
 
 	add		arg2, arg2, arg3
 	sub		arg2, arg2, #16
-	ld1		{v1.2d}, [arg2]
-CPU_LE(	rev64		v1.16b, v1.16b		)
-	ext		v1.16b, v1.16b, v1.16b, #8
+	vld1.64		{q1}, [arg2]
+CPU_LE(	vrev64.8	q1, q1			)
+	vext.8		q1, q1, q1, #8
 
 	// get rid of the extra data that was loaded before
 	// load the shift constant
-	adr		x4, tbl_shf_table + 16
-	sub		x4, x4, arg3
-	ld1		{v0.16b}, [x4]
+	adr		lr, tbl_shf_table + 16
+	sub		lr, lr, arg3
+	vld1.8		{q0}, [lr]
 
 	// shift v2 to the left by arg3 bytes
-	tbl		v2.16b, {v2.16b}, v0.16b
+	vmov		q9, q2
+	vtbl.8		d4, {d18-d19}, d0
+	vtbl.8		d5, {d18-d19}, d1
 
 	// shift v7 to the right by 16-arg3 bytes
-	movi		v9.16b, #0x80
-	eor		v0.16b, v0.16b, v9.16b
-	tbl		v7.16b, {v7.16b}, v0.16b
+	vmov.i8		q9, #0x80
+	veor.8		q0, q0, q9
+	vmov		q9, q7
+	vtbl.8		d14, {d18-d19}, d0
+	vtbl.8		d15, {d18-d19}, d1
 
 	// blend
-	sshr		v0.16b, v0.16b, #7	// convert to 8-bit mask
-	bsl		v0.16b, v2.16b, v1.16b
+	vshr.s8		q0, q0, #7		// convert to 8-bit mask
+	vbsl.8		q0, q2, q1
 
 	// fold 16 Bytes
-	pmull		v8.1q, v7.1d, v10.1d
-	pmull2		v7.1q, v7.2d, v10.2d
-	eor		v7.16b, v7.16b, v8.16b
-	eor		v7.16b, v7.16b, v0.16b
+	vmull.p64	q8, d14, d20
+	vmull.p64	q7, d15, d21
+	veor.8		q7, q7, q8
+	veor.8		q7, q7, q0
 
 _128_done:
 	// compute crc of a 128-bit value
-	ldr		q10, rk5		// rk5 and rk6 in xmm10
+	vldr		d20, rk5
+	vldr		d21, rk6		// rk5 and rk6 in xmm10
 
 	// 64b fold
-	mov		v0.16b, v7.16b
-	ext		v7.16b, v7.16b, v7.16b, #8
-	pmull		v7.1q, v7.1d, v10.1d
-	ext		v0.16b, vzr.16b, v0.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vmov		q0, q7
+	vmull.p64	q7, d15, d20
+	vext.8		q0, qzr, q0, #8
+	veor.8		q7, q7, q0
 
 	// 32b fold
-	mov		v0.16b, v7.16b
-	mov		v0.s[3], vzr.s[0]
-	ext		v7.16b, v7.16b, vzr.16b, #12
-	ext		v9.16b, v10.16b, v10.16b, #8
-	pmull		v7.1q, v7.1d, v9.1d
-	eor		v7.16b, v7.16b, v0.16b
+	veor.8		d1, d1, d1
+	vmov		d0, d14
+	vmov		s2, s30
+	vext.8		q7, q7, qzr, #12
+	vmull.p64	q7, d14, d21
+	veor.8		q7, q7, q0
 
 	// barrett reduction
 _barrett:
-	ldr		q10, rk7
-	mov		v0.16b, v7.16b
-	ext		v7.16b, v7.16b, v7.16b, #8
+	vldr		d20, rk7
+	vldr		d21, rk8
+	vmov.8		q0, q7
 
-	pmull		v7.1q, v7.1d, v10.1d
-	ext		v7.16b, vzr.16b, v7.16b, #12
-	pmull2		v7.1q, v7.2d, v10.2d
-	ext		v7.16b, vzr.16b, v7.16b, #12
-	eor		v7.16b, v7.16b, v0.16b
-	mov		w0, v7.s[1]
+	vmull.p64	q7, d15, d20
+	vext.8		q7, qzr, q7, #12
+	vmull.p64	q7, d15, d21
+	vext.8		q7, qzr, q7, #12
+	veor.8		q7, q7, q0
+	vmov		r0, s29
 
 _cleanup:
 	// scale the result back to 16 bits
-	lsr		x0, x0, #16
-	ldp		x29, x30, [sp], #32
-	ret
+	lsr		r0, r0, #16
+	add		sp, sp, #0x10
+	pop		{r4, pc}
 
 	.align		4
 _less_than_128:
 
 	// check if there is enough buffer to be able to fold 16B at a time
 	cmp		arg3, #32
-	b.lt		_less_than_32
+	blt		_less_than_32
 
 	// now if there is, load the constants
-	ldr		q10, rk1		// rk1 and rk2 in xmm10
+	vldr		d20, rk1
+	vldr		d21, rk2		// rk1 and rk2 in xmm10
 
-	movi		v0.16b, #0
-	mov		v0.s[3], arg1_low32	// get the initial crc value
-	ld1		{v7.2d}, [arg2], #0x10
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vmov.i8		q0, #0
+	vmov		s3, arg1_low32		// get the initial crc value
+	vld1.64		{q7}, [arg2]!
+CPU_LE(	vrev64.8	q7, q7		)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 
 	// update the counter. subtract 32 instead of 16 to save one
 	// instruction from the loop
@@ -331,21 +363,23 @@ CPU_LE(	rev64		v7.16b, v7.16b		)
 
 	.align		4
 _less_than_32:
-	cbz		arg3, _cleanup
+	teq		arg3, #0
+	beq		_cleanup
 
-	movi		v0.16b, #0
-	mov		v0.s[3], arg1_low32	// get the initial crc value
+	vmov.i8		q0, #0
+	vmov		s3, arg1_low32		// get the initial crc value
 
 	cmp		arg3, #16
-	b.eq		_exact_16_left
-	b.lt		_less_than_16_left
+	beq		_exact_16_left
+	blt		_less_than_16_left
 
-	ld1		{v7.2d}, [arg2], #0x10
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vld1.64		{q7}, [arg2]!
+CPU_LE(	vrev64.8	q7, q7		)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 	sub		arg3, arg3, #16
-	ldr		q10, rk1		// rk1 and rk2 in xmm10
+	vldr		d20, rk1
+	vldr		d21, rk2		// rk1 and rk2 in xmm10
 	b		_get_last_two_regs
 
 	.align		4
@@ -353,117 +387,124 @@ _less_than_16_left:
 	// use stack space to load data less than 16 bytes, zero-out
 	// the 16B in memory first.
 
-	add		x11, sp, #0x10
-	stp		xzr, xzr, [x11]
+	vst1.8		{qzr}, [sp]
+	mov		ip, sp
 
 	cmp		arg3, #4
-	b.lt		_only_less_than_4
+	blt		_only_less_than_4
 
 	// backup the counter value
-	mov		x9, arg3
-	tbz		arg3, #3, _less_than_8_left
+	mov		lr, arg3
+	cmp		arg3, #8
+	blt		_less_than_8_left
 
 	// load 8 Bytes
-	ldr		x0, [arg2], #8
-	str		x0, [x11], #8
+	ldr		r0, [arg2], #4
+	ldr		r3, [arg2], #4
+	str		r0, [ip], #4
+	str		r3, [ip], #4
 	sub		arg3, arg3, #8
 
 _less_than_8_left:
-	tbz		arg3, #2, _less_than_4_left
+	cmp		arg3, #4
+	blt		_less_than_4_left
 
 	// load 4 Bytes
-	ldr		w0, [arg2], #4
-	str		w0, [x11], #4
+	ldr		r0, [arg2], #4
+	str		r0, [ip], #4
 	sub		arg3, arg3, #4
 
 _less_than_4_left:
-	tbz		arg3, #1, _less_than_2_left
+	cmp		arg3, #2
+	blt		_less_than_2_left
 
 	// load 2 Bytes
-	ldrh		w0, [arg2], #2
-	strh		w0, [x11], #2
+	ldrh		r0, [arg2], #2
+	strh		r0, [ip], #2
 	sub		arg3, arg3, #2
 
 _less_than_2_left:
-	cbz		arg3, _zero_left
+	cmp		arg3, #1
+	blt		_zero_left
 
 	// load 1 Byte
-	ldrb		w0, [arg2]
-	strb		w0, [x11]
+	ldrb		r0, [arg2]
+	strb		r0, [ip]
 
 _zero_left:
-	add		x11, sp, #0x10
-	ld1		{v7.2d}, [x11]
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vld1.64		{q7}, [sp]
+CPU_LE(	vrev64.8	q7, q7		)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 
 	// shl r9, 4
-	adr		x0, tbl_shf_table + 16
-	sub		x0, x0, x9
-	ld1		{v0.16b}, [x0]
-	movi		v9.16b, #0x80
-	eor		v0.16b, v0.16b, v9.16b
-	tbl		v7.16b, {v7.16b}, v0.16b
+	adr		ip, tbl_shf_table + 16
+	sub		ip, ip, lr
+	vld1.8		{q0}, [ip]
+	vmov.i8		q9, #0x80
+	veor.8		q0, q0, q9
+	vmov		q9, q7
+	vtbl.8		d14, {d18-d19}, d0
+	vtbl.8		d15, {d18-d19}, d1
 
 	b		_128_done
 
 	.align		4
 _exact_16_left:
-	ld1		{v7.2d}, [arg2]
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b	// xor the initial crc value
+	vld1.64		{q7}, [arg2]
+CPU_LE(	vrev64.8	q7, q7			)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0		// xor the initial crc value
 
 	b		_128_done
 
 _only_less_than_4:
 	cmp		arg3, #3
-	b.lt		_only_less_than_3
+	blt		_only_less_than_3
 
 	// load 3 Bytes
-	ldrh		w0, [arg2]
-	strh		w0, [x11]
+	ldrh		r0, [arg2]
+	strh		r0, [ip]
 
-	ldrb		w0, [arg2, #2]
-	strb		w0, [x11, #2]
+	ldrb		r0, [arg2, #2]
+	strb		r0, [ip, #2]
 
-	ld1		{v7.2d}, [x11]
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vld1.64		{q7}, [ip]
+CPU_LE(	vrev64.8	q7, q7			)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 
-	ext		v7.16b, v7.16b, vzr.16b, #5
+	vext.8		q7, q7, qzr, #5
 	b		_barrett
 
 _only_less_than_3:
 	cmp		arg3, #2
-	b.lt		_only_less_than_2
+	blt		_only_less_than_2
 
 	// load 2 Bytes
-	ldrh		w0, [arg2]
-	strh		w0, [x11]
+	ldrh		r0, [arg2]
+	strh		r0, [ip]
 
-	ld1		{v7.2d}, [x11]
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vld1.64		{q7}, [ip]
+CPU_LE(	vrev64.8	q7, q7			)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 
-	ext		v7.16b, v7.16b, vzr.16b, #6
+	vext.8		q7, q7, qzr, #6
 	b		_barrett
 
 _only_less_than_2:
 
 	// load 1 Byte
-	ldrb		w0, [arg2]
-	strb		w0, [x11]
+	ldrb		r0, [arg2]
+	strb		r0, [ip]
 
-	ld1		{v7.2d}, [x11]
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vld1.64		{q7}, [ip]
+CPU_LE(	vrev64.8	q7, q7			)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 
-	ext		v7.16b, v7.16b, vzr.16b, #7
+	vext.8		q7, q7, qzr, #7
 	b		_barrett
 
 ENDPROC(crc_t10dif_pmull)
@@ -482,16 +523,26 @@ ENDPROC(crc_t10dif_pmull)
 // rk7 = floor(2^64/Q)
 // rk8 = Q
 
-rk1:	.octa		0x06df0000000000002d56000000000000
-rk3:	.octa		0x7cf50000000000009d9d000000000000
-rk5:	.octa		0x13680000000000002d56000000000000
-rk7:	.octa		0x000000018bb7000000000001f65a57f8
-rk9:	.octa		0xbfd6000000000000ceae000000000000
-rk11:	.octa		0x713c0000000000001e16000000000000
-rk13:	.octa		0x80a6000000000000f7f9000000000000
-rk15:	.octa		0xe658000000000000044c000000000000
-rk17:	.octa		0xa497000000000000ad18000000000000
-rk19:	.octa		0xe7b50000000000006ee3000000000000
+rk1:	.quad		0x2d56000000000000
+rk2:	.quad		0x06df000000000000
+rk3:	.quad		0x9d9d000000000000
+rk4:	.quad		0x7cf5000000000000
+rk5:	.quad		0x2d56000000000000
+rk6:	.quad		0x1368000000000000
+rk7:	.quad		0x00000001f65a57f8
+rk8:	.quad		0x000000018bb70000
+rk9:	.quad		0xceae000000000000
+rk10:	.quad		0xbfd6000000000000
+rk11:	.quad		0x1e16000000000000
+rk12:	.quad		0x713c000000000000
+rk13:	.quad		0xf7f9000000000000
+rk14:	.quad		0x80a6000000000000
+rk15:	.quad		0x044c000000000000
+rk16:	.quad		0xe658000000000000
+rk17:	.quad		0xad18000000000000
+rk18:	.quad		0xa497000000000000
+rk19:	.quad		0x6ee3000000000000
+rk20:	.quad		0xe7b5000000000000
 
 tbl_shf_table:
 // use these values for shift constants for the tbl/tbx instruction
diff --git a/arch/arm64/crypto/crct10dif-ce-glue.c b/arch/arm/crypto/crct10dif-ce-glue.c
similarity index 76%
copy from arch/arm64/crypto/crct10dif-ce-glue.c
copy to arch/arm/crypto/crct10dif-ce-glue.c
index d11f33dae79c..e717538d902c 100644
--- a/arch/arm64/crypto/crct10dif-ce-glue.c
+++ b/arch/arm/crypto/crct10dif-ce-glue.c
@@ -1,5 +1,5 @@
 /*
- * Accelerated CRC-T10DIF using arm64 NEON and Crypto Extensions instructions
+ * Accelerated CRC-T10DIF using ARM NEON and Crypto Extensions instructions
  *
  * Copyright (C) 2016 Linaro Ltd <ard.biesheuvel@linaro.org>
  *
@@ -8,7 +8,6 @@
  * published by the Free Software Foundation.
  */
 
-#include <linux/cpufeature.h>
 #include <linux/crc-t10dif.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
@@ -18,6 +17,7 @@
 #include <crypto/internal/hash.h>
 
 #include <asm/neon.h>
+#include <asm/simd.h>
 
 asmlinkage u16 crc_t10dif_pmull(u16 init_crc, const u8 buf[], u64 len);
 
@@ -34,9 +34,13 @@ static int crct10dif_update(struct shash_desc *desc, const u8 *data,
 {
 	u16 *crc = shash_desc_ctx(desc);
 
-	kernel_neon_begin_partial(14);
-	*crc = crc_t10dif_pmull(*crc, data, length);
-	kernel_neon_end();
+	if (may_use_simd()) {
+		kernel_neon_begin();
+		*crc = crc_t10dif_pmull(*crc, data, length);
+		kernel_neon_end();
+	} else {
+		*crc = crc_t10dif_generic(*crc, data, length);
+	}
 
 	return 0;
 }
@@ -57,7 +61,7 @@ static struct shash_alg crc_t10dif_alg = {
 
 	.descsize		= CRC_T10DIF_DIGEST_SIZE,
 	.base.cra_name		= "crct10dif",
-	.base.cra_driver_name	= "crct10dif-arm64-ce",
+	.base.cra_driver_name	= "crct10dif-arm-ce",
 	.base.cra_priority	= 200,
 	.base.cra_blocksize	= CRC_T10DIF_BLOCK_SIZE,
 	.base.cra_module	= THIS_MODULE,
@@ -65,6 +69,9 @@ static struct shash_alg crc_t10dif_alg = {
 
 static int __init crc_t10dif_mod_init(void)
 {
+	if (!(elf_hwcap2 & HWCAP2_PMULL))
+		return -ENODEV;
+
 	return crypto_register_shash(&crc_t10dif_alg);
 }
 
@@ -73,8 +80,10 @@ static void __exit crc_t10dif_mod_exit(void)
 	crypto_unregister_shash(&crc_t10dif_alg);
 }
 
-module_cpu_feature_match(PMULL, crc_t10dif_mod_init);
+module_init(crc_t10dif_mod_init);
 module_exit(crc_t10dif_mod_exit);
 
 MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
 MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("crct10dif");
+MODULE_ALIAS_CRYPTO("crct10dif-arm-ce");
-- 
2.7.4

^ permalink raw reply related

* [PATCH 1/4] serial: core: Add LED trigger support
From: Mathieu Poirier @ 2016-11-24 15:45 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161124064137.l3i3p3brbyz74cd5@pengutronix.de>

On Thu, Nov 24, 2016 at 07:41:37AM +0100, Sascha Hauer wrote:
> On Wed, Nov 23, 2016 at 10:13:02AM -0700, Mathieu Poirier wrote:
> > On Wed, Nov 23, 2016 at 11:01:03AM +0100, Sascha Hauer wrote:
> > > With this patch the serial core provides LED triggers for RX and TX.
> > > 
> > > As the serial core layer does not know when the hardware actually sends
> > > or receives characters, this needs help from the UART drivers. The
> > > LED triggers are registered in uart_add_led_triggers() called from
> > > the UART drivers which want to support LED triggers. All the driver
> > > has to do then is to call uart_led_trigger_[tx|rx] to indicate
> > > activite.
> > 
> > Hello Sascha,
> > 
> > > 
> > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> > > ---
> > >  drivers/tty/serial/serial_core.c | 73 ++++++++++++++++++++++++++++++++++++++++
> > >  include/linux/serial_core.h      | 10 ++++++
> > >  2 files changed, 83 insertions(+)
> > > 
> > > diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
> > > index f2303f3..3e8afb7 100644
> > > --- a/drivers/tty/serial/serial_core.c
> > > +++ b/drivers/tty/serial/serial_core.c
> > > @@ -34,6 +34,7 @@
> > >  #include <linux/serial_core.h>
> > >  #include <linux/delay.h>
> > >  #include <linux/mutex.h>
> > > +#include <linux/leds.h>
> > >  
> > >  #include <asm/irq.h>
> > >  #include <asm/uaccess.h>
> > > @@ -2703,6 +2704,77 @@ static const struct attribute_group tty_dev_attr_group = {
> > >  	.attrs = tty_dev_attrs,
> > >  	};
> > >  
> > > +void uart_led_trigger_tx(struct uart_port *uport)
> > > +{
> > > +	unsigned long delay = 50;
> > > +
> > > +	led_trigger_blink_oneshot(uport->led_trigger_tx, &delay, &delay, 0);
> > > +}
> > > +
> > > +void uart_led_trigger_rx(struct uart_port *uport)
> > > +{
> > > +	unsigned long delay = 50;
> > > +
> > > +	led_trigger_blink_oneshot(uport->led_trigger_rx, &delay, &delay, 0);
> > 
> > For both rx/tx the core constrains the delay_on/off along with the inversion.
> > Instead of adding the led_trigger_rx/tx and led_trigger_rx/tx_name to the 
> > struct uart_port, wouldn't it be better to add a new struct led_trigger that
> > would encapsulate those along wit the on/off delay and the inversion?
> > 
> > That way those values could be communicated to the core at registration time
> > instead of hard-coding things.
> 
> Not sure this goes into the right direction. Looking at the other
> callers of led_trigger_blink_oneshot() most of them use an arbitrary
> blink time of 30 or 50ms and the users seem to be happy with it. There
> doesn't seem to be the desire to make that value configurable. If we
> want to make it configurable, it's probably better to move the option
> to the led device rather than putting the burden on every user of the
> led triggers.

So you did find instances where the blink time wasn't 50ms.  To me that's
a valid reason to not hard code the blink time and proceed differently.

> 
> I don't think the inversion flag is meant for being configurable. It's
> rather used for the default state of the LED. The CAN layer for example
> turns the LED on when the interface is up and then blinks (turns off and
> on again) on activity. For this it uses the inversion flag.
> 
> >  
> > > +}
> > > +
> > > +/**
> > > + *	uart_add_led_triggers - register LED triggers for a UART
> > > + *	@drv: pointer to the uart low level driver structure for this port
> > > + *	@uport: uart port structure to use for this port.
> > > + *
> > > + *	Called by the driver to register LED triggers for a UART. It's the
> > > + *	drivers responsibility to call uart_led_trigger_rx/tx on received
> > > + *	and transmitted chars then.
> > > + */
> > > +int uart_add_led_triggers(struct uart_driver *drv, struct uart_port *uport)
> > > +{
> > > +	int ret;
> > > +
> > > +	if (!IS_ENABLED(CONFIG_LEDS_TRIGGERS))
> > > +		return 0;
> > 
> > Since this is a public interface, checking for valid led_trigger_tx/rx before
> > moving on with the rest of the initialisation is probably a good idea.
> 
> What do you mean? Should we return an error when CONFIG_LEDS_TRIGGERS is
> disabled?

        if (!uport->led_trigger_rx || !uport->led_triggertx)
                return -EINVAL;


> 
> Sascha
> 
> 
> -- 
> Pengutronix e.K.                           |                             |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
> Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply

* [net-next PATCH v1 1/2] net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmac
From: Andrew Lunn @ 2016-11-24 15:48 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161124143417.10178-2-martin.blumenstingl@googlemail.com>

> The configuration values are provided as preprocessor macros to make the
> devicetree files easier to read.

Hi Martin

If i'm reading the code/comments correctly, you can set the delay to
0, 2, 4 or 6ns? So calling this property amlogic,tx-delay-ns would be
even easier to read.

     Andrew

^ permalink raw reply

* [net-next PATCH v1 0/2] stmmac: dwmac-meson8b: configurable RGMII TX delay
From: Jerome Brunet @ 2016-11-24 15:56 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161124143417.10178-1-martin.blumenstingl@googlemail.com>

On Thu, 2016-11-24 at 15:34 +0100, Martin Blumenstingl wrote:
> Currently the dwmac-meson8b stmmac glue driver uses a hardcoded 1/4
> cycle TX clock delay. This seems to work fine for many boards (for
> example Odroid-C2 or Amlogic's reference boards) but there are some
> others where TX traffic is simply broken.
> There are probably multiple reasons why it's working on some boards
> while it's broken on others:
> - some of Amlogic's reference boards are using a Micrel PHY
> - hardware circuit design
> - maybe more...
> 
> This raises a question though:
> Which device is supposed to enable the TX delay when both MAC and PHY
> support it? And should we implement it for each PHY / MAC separately
> or should we think about a more generic solution (currently it's not
> possible to disable the TX delay generated by the RTL8211F PHY via
> devicetree when using phy-mode "rgmii")?

Actually you can skip the part which activate the Tx-delay on the phy
by setting "phy-mode = "rgmii-id" instead of "rgmii"

phy->interface will no longer be PHY_INTERFACE_MODE_RGMII
but?PHY_INTERFACE_MODE_RGMII_ID.

> 
> iperf3 results on my Mecool BB2 board (Meson GXM, RTL8211F PHY) with
> TX clock delay disabled on the MAC (as it's enabled in the PHY
> driver).
> TX throughput was virtually zero before:
> $ iperf3 -c 192.168.1.100 -R??????????
> Connecting to host 192.168.1.100, port 5201
> Reverse mode, remote host 192.168.1.100 is sending
> [??4] local 192.168.1.206 port 52828 connected to 192.168.1.100 port
> 5201
> [ ID] Interval???????????Transfer?????Bandwidth
> [??4]???0.00-1.00???sec???108 MBytes???901
> Mbits/sec??????????????????
> [??4]???1.00-2.00???sec??94.2 MBytes???791
> Mbits/sec??????????????????
> [??4]???2.00-3.00???sec??96.5 MBytes???810
> Mbits/sec??????????????????
> [??4]???3.00-4.00???sec??96.2 MBytes???808
> Mbits/sec??????????????????
> [??4]???4.00-5.00???sec??96.6 MBytes???810
> Mbits/sec??????????????????
> [??4]???5.00-6.00???sec??96.5 MBytes???810
> Mbits/sec??????????????????
> [??4]???6.00-7.00???sec??96.6 MBytes???810
> Mbits/sec??????????????????
> [??4]???7.00-8.00???sec??96.5 MBytes???809
> Mbits/sec??????????????????
> [??4]???8.00-9.00???sec???105 MBytes???884
> Mbits/sec??????????????????
> [??4]???9.00-10.00??sec???111 MBytes???934
> Mbits/sec??????????????????
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval???????????Transfer?????Bandwidth???????Retr
> [??4]???0.00-10.00??sec??1000 MBytes???839
> Mbits/sec????0?????????????sender
> [??4]???0.00-10.00??sec???998 MBytes???837
> Mbits/sec??????????????????receiver
> 
> iperf Done.
> $ iperf3 -c 192.168.1.100???
> Connecting to host 192.168.1.100, port 5201
> [??4] local 192.168.1.206 port 52832 connected to 192.168.1.100 port
> 5201
> [ ID] Interval???????????Transfer?????Bandwidth???????Retr??Cwnd
> [??4]???0.00-1.01???sec??99.5 MBytes???829 Mbits/sec??117????139
> KBytes???????
> [??4]???1.01-2.00???sec???105 MBytes???884 Mbits/sec??129???70.7
> KBytes???????
> [??4]???2.00-3.01???sec???107 MBytes???889 Mbits/sec??106????187
> KBytes???????
> [??4]???3.01-4.01???sec???105 MBytes???878 Mbits/sec???92????143
> KBytes???????
> [??4]???4.01-5.00???sec???105 MBytes???882 Mbits/sec??140????129
> KBytes???????
> [??4]???5.00-6.01???sec???106 MBytes???883 Mbits/sec??115????195
> KBytes???????
> [??4]???6.01-7.00???sec???102 MBytes???863 Mbits/sec??133???70.7
> KBytes???????
> [??4]???7.00-8.01???sec???106 MBytes???884 Mbits/sec??143???97.6
> KBytes???????
> [??4]???8.01-9.01???sec???104 MBytes???875 Mbits/sec??124????107
> KBytes???????
> [??4]???9.01-10.01??sec???105 MBytes???876 Mbits/sec???90????139
> KBytes???????
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval???????????Transfer?????Bandwidth???????Retr
> [??4]???0.00-10.01??sec??1.02 GBytes???874
> Mbits/sec??1189?????????????sender
> [??4]???0.00-10.01??sec??1.02 GBytes???873
> Mbits/sec??????????????????receiver
> 
> iperf Done.

> 
> 
> Martin Blumenstingl (2):
> ? net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmac
> ? net: stmmac: dwmac-meson8b: make the RGMII TX delay configurable
> 
> ?Documentation/devicetree/bindings/net/meson-dwmac.txt | 11
> +++++++++++
> ?drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c???| 16
> +++++++++++-----
> ?include/dt-bindings/net/dwmac-meson8b.h???????????????| 18
> ++++++++++++++++++
> ?3 files changed, 40 insertions(+), 5 deletions(-)
> ?create mode 100644 include/dt-bindings/net/dwmac-meson8b.h
> 

^ permalink raw reply

* [RFC PATCH net v2 0/3] Fix OdroidC2 Gigabit Tx link issue
From: Jerome Brunet @ 2016-11-24 16:01 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <CAFBinCCexmS_z9FCX-ud5NgGhhP7xJ_cLxpC7TEc=mLAdafosg@mail.gmail.com>

On Thu, 2016-11-24 at 15:40 +0100, Martin Blumenstingl wrote:
> Hi Jerome,
> 
> On Mon, Nov 21, 2016 at 4:35 PM, Jerome Brunet <jbrunet@baylibre.com>
> wrote:
> > 
> > This patchset fixes an issue with the OdroidC2 board (DWMAC +
> > RTL8211F).
> > Initially reported as a low Tx throughput issue at gigabit speed,
> > the
> > platform enters LPI too often. This eventually break the link (both
> > Tx
> > and Rx), and require to bring the interface down and up again to
> > get the
> > Rx path working again.
> > 
> > The root cause of this issue is not fully understood yet but
> > disabling EEE
> > advertisement on the PHY prevent this feature to be negotiated.
> > With this change, the link is stable and reliable, with the
> > expected
> > throughput performance.
> I have just sent a series which allows configuring the TX delay on
> the
> MAC (dwmac-meson8b glue) side: [0]
> Disabling the TX delay generated by the MAC fixes TX throughput for
> me, even when leaving EEE enabled in the RTL8211F PHY driver!
> 
> Unfortunately the RTL8211F PHY is a black-box for the community
> because there is no public datasheeet available.
> *maybe* (pure speculation!) they're enabling the TX delay based on
> some internal magic only when EEE is enabled.

Hi already tried acting on the register setting the TX_delay. I also
tried on the PHY. I never been able to improve situation on the
Odroic2. Only disabling EEE improved the situation.

To make sure, i tried again with your patch but the result remains
unchanged. With Tx_delay disabled (either the mac or the phy), the
situation is even worse, it seems that nothing gets through

> 
> Jerome, could you please re-test the behavior on your Odroid-C2 when
> you have EEE still enabled but the TX-delay disabled?
> In my case throughput is fine, and "$ ethtool -S eth0 | grep lpi"
> gives:
> ????irq_tx_path_in_lpi_mode_n: 0
> ????irq_tx_path_exit_lpi_mode_n: 0
> ????irq_rx_path_in_lpi_mode_n: 0
> ????irq_rx_path_exit_lpi_mode_n: 0
> 

I still have lpi interrupts on my side. I don't get how a properly
configured tx_delay would disable EEE. I must be missing something
here.

> 
> Regards,
> Martin
> 
> 
> [0] http://lists.infradead.org/pipermail/linux-amlogic/2016-November/
> 001674.html

^ permalink raw reply

* [PATCH 1/3] of: Support parsing phandle argument lists through a nexus node
From: Pantelis Antoniou @ 2016-11-24 16:05 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161124102529.20212-2-stephen.boyd@linaro.org>

Hi Stephen,

> On Nov 24, 2016, at 12:25 , Stephen Boyd <stephen.boyd@linaro.org> wrote:
> 
> Platforms like 96boards have a standardized connector/expansion
> slot that exposes signals like GPIOs to expansion boards in an
> SoC agnostic way. We'd like the DT overlays for the expansion
> boards to be written once without knowledge of the SoC on the
> other side of the connector. This avoids the unscalable
> combinatorial explosion of a different DT overlay for each
> expansion board and SoC pair.
> 
> We need a way to describe the GPIOs routed through the connector
> in an SoC agnostic way. Let's introduce nexus property parsing
> into the OF core to do this. This is largely based on the
> interrupt nexus support we already have. This allows us to remap
> a phandle list in a consumer node (e.g. reset-gpios) through a
> connector in a generic way (e.g. via gpio-map). Do this in a
> generic routine so that we can remap any sort of variable length
> phandle list.
> 
> Taking GPIOs as an example, the connector would be a GPIO nexus,
> supporting the remapping of a GPIO specifier space to multiple
> GPIO providers on the SoC. DT would look as shown below, where
> 'soc_gpio1' and 'soc_gpio2' are inside the SoC, 'connector' is an
> expansion port where boards can be plugged in, and
> 'expansion_device' is a device on the expansion board.
> 
> 	soc {
> 		soc_gpio1: gpio-controller1 {
> 			#gpio-cells = <2>;
> 		};
> 
> 		soc_gpio2: gpio-controller2 {
> 			#gpio-cells = <2>;
> 		};
> 	};
> 
> 	connector: connector {
> 		#gpio-cells = <2>;
> 		gpio-map = <0 GPIO_ACTIVE_LOW &soc_gpio1 1 GPIO_ACTIVE_LOW>,
> 			   <1 GPIO_ACTIVE_LOW &soc_gpio2 4 GPIO_ACTIVE_LOW>,
> 			   <2 GPIO_ACTIVE_LOW &soc_gpio1 3 GPIO_ACTIVE_LOW>,
> 			   <3 GPIO_ACTIVE_LOW &soc_gpio2 2 GPIO_ACTIVE_LOW>;
> 		gpio-map-mask = <0xf 0x1>;
> 	};
> 
> 	expansion_device {
> 		reset-gpios = <&connector 2 GPIO_ACTIVE_LOW>;
> 	};
> 
> The GPIO core would use of_parse_phandle_with_args_map() instead
> of of_parse_phandle_with_args() and arrive at the same type of
> result, a phandle and argument list. The difference is that the
> phandle and arguments will be remapped through the nexus node to
> the underlying SoC GPIO controller node. In the example above,
> we would remap 'reset-gpios' from <&connector 2 GPIO_ACTIVE_LOW>
> to <&soc_gpio1 3 GPIO_ACTIVE_LOW>.
> 

Very good. My only point would be to elaborate a little bit on the
documentation part about how there might be different #list-cells values
pointed at, and how the lookup is performed in steps.

> Cc: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
> Cc: Linus Walleij <linus.walleij@linaro.org>
> Cc: Mark Brown <broonie@kernel.org>
> Signed-off-by: Stephen Boyd <stephen.boyd@linaro.org>
> ---
> drivers/of/base.c  | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/of.h |  14 +++++
> 2 files changed, 160 insertions(+)
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index d687e6de24a0..693b73f33675 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -1772,6 +1772,152 @@ int of_parse_phandle_with_args(const struct device_node *np, const char *list_na
> EXPORT_SYMBOL(of_parse_phandle_with_args);
> 
> /**
> + * of_parse_phandle_with_args_map() - Find a node pointed by phandle in a list and remap it
> + * @np:		pointer to a device tree node containing a list
> + * @list_name:	property name that contains a list
> + * @cells_name:	property name that specifies phandles' arguments count
> + * @index:	index of a phandle to parse out
> + * @out_args:	optional pointer to output arguments structure (will be filled)
> + *
> + * This function is useful to parse lists of phandles and their arguments.
> + * Returns 0 on success and fills out_args, on error returns appropriate
> + * errno value.
> + *
> + * Caller is responsible to call of_node_put() on the returned out_args->np
> + * pointer.
> + *
> + * Example:
> + *
> + * phandle1: node1 {
> + *	#list-cells = <2>;
> + * }
> + *
> + * phandle2: node2 {
> + *	#list-cells = <1>;
> + * }
> + *
> + * phandle3: node3 {
> + * 	#list-cells = <1>;
> + * 	list-map = <0 &phandle2 3>,
> + * 		   <1 &phandle2 2>,
> + * 		   <2 &phandle1 5 1>;
> + *	list-map-mask = <0x3>;
> + * };
> + *
> + * node4 {
> + *	list = <&phandle1 1 2 &phandle3 0>;
> + * }
> + *
> + * To get a device_node of the `node2' node you may call this:
> + * of_parse_phandle_with_args(node4, "list", "#list-cells", "list-map",
> + * 			      "list-map-mask", 1, &args);
> + */
> +int of_parse_phandle_with_args_map(const struct device_node *np,
> +				   const char *list_name,
> +				   const char *cells_name,
> +				   const char *map_name,
> +				   const char *mask_name,
> +				   int index, struct of_phandle_args *out_args)
> +{
> +	struct device_node *cur, *new = NULL;
> +	const __be32 *map, *mask, *tmp;
> +	const __be32 dummy_mask[] = { [0 ... MAX_PHANDLE_ARGS] = ~0 };
> +	__be32 initial_match_array[MAX_PHANDLE_ARGS];
> +	const __be32 *match_array = initial_match_array;
> +	int i, ret, map_len, match;
> +	u32 list_size, new_size;
> +
> +	if (index < 0)
> +		return -EINVAL;
> +
> +	ret = __of_parse_phandle_with_args(np, list_name, cells_name, 0, index,
> +					   out_args);
> +	if (ret)
> +		return ret;
> +
> +	/* Get the #<list>-cells property */
> +	cur = out_args->np;
> +	ret = of_property_read_u32(cur, cells_name, &list_size);
> +	if (ret < 0)
> +		goto fail;
> +
> +	/* Precalculate the match array - this simplifies match loop */
> +	for (i = 0; i < list_size; i++)
> +		initial_match_array[i] = cpu_to_be32(out_args->args[i]);
> +
> +	while (cur) {
> +		/* Get the <list>-map property */
> +		map = of_get_property(cur, map_name, &map_len);
> +		if (!map)
> +			return 0;
> +		map_len /= sizeof(u32);
> +
> +		/* Get the <list>-map-mask property (optional) */
> +		mask = of_get_property(cur, mask_name, NULL);
> +		if (!mask)
> +			mask = dummy_mask;
> +
> +		/* Iterate through <list>-map property */
> +		match = 0;
> +		while (map_len > (list_size + 1) && !match) {
> +			/* Compare specifiers */
> +			match = 1;
> +			for (i = 0; i < list_size; i++, map_len--)
> +				match &= !((match_array[i] ^ *map++) & mask[i]);
> +
> +			of_node_put(new);
> +			new = of_find_node_by_phandle(be32_to_cpup(map));
> +			map++;
> +			map_len--;
> +
> +			/* Check if not found */
> +			if (!new)
> +				goto fail;
> +
> +			if (!of_device_is_available(new))
> +				match = 0;
> +
> +			tmp = of_get_property(new, cells_name, NULL);
> +			if (!tmp)
> +				goto fail;
> +
> +			new_size = be32_to_cpu(*tmp);
> +
> +			/* Check for malformed properties */
> +			if (WARN_ON(new_size > MAX_PHANDLE_ARGS))
> +				goto fail;
> +			if (map_len < new_size)
> +				goto fail;
> +
> +			/* Move forward by new node's #<list>-cells amount */
> +			map += new_size;
> +			map_len -= new_size;
> +		}
> +		if (!match)
> +			goto fail;
> +
> +		/*
> +		 * Successfully parsed a <list>-map translation; copy new
> +		 * specifier into the out_args structure.
> +		 */
> +		match_array = map - new_size;
> +		for (i = 0; i < new_size; i++)
> +			out_args->args[i] = be32_to_cpup(map - new_size + i);
> +		out_args->args_count = list_size = new_size;
> +		/* Iterate again with new provider */
> +		out_args->np = new;
> +		of_node_put(cur);
> +		cur = new;
> +	}
> +fail:
> +	of_node_put(cur);
> +	of_node_put(new);
> +
> +	return -EINVAL;
> +}
> +EXPORT_SYMBOL(of_parse_phandle_with_args_map);
> +
> +/**
>  * of_parse_phandle_with_fixed_args() - Find a node pointed by phandle in a list
>  * @np:		pointer to a device tree node containing a list
>  * @list_name:	property name that contains a list
> diff --git a/include/linux/of.h b/include/linux/of.h
> index d3a9c2e69001..65ff306403a2 100644
> --- a/include/linux/of.h
> +++ b/include/linux/of.h
> @@ -344,6 +344,9 @@ extern struct device_node *of_parse_phandle(const struct device_node *np,
> extern int of_parse_phandle_with_args(const struct device_node *np,
> 	const char *list_name, const char *cells_name, int index,
> 	struct of_phandle_args *out_args);
> +extern int of_parse_phandle_with_args_map(const struct device_node *np,
> +	const char *list_name, const char *cells_name, const char *map_name,
> +	const char *mask_name, int index, struct of_phandle_args *out_args);
> extern int of_parse_phandle_with_fixed_args(const struct device_node *np,
> 	const char *list_name, int cells_count, int index,
> 	struct of_phandle_args *out_args);
> @@ -738,6 +741,17 @@ static inline int of_parse_phandle_with_args(const struct device_node *np,
> 	return -ENOSYS;
> }
> 
> +static inline int of_parse_phandle_with_args_map(const struct device_node *np,
> +						 const char *list_name,
> +						 const char *cells_name,
> +						 const char *map_name,
> +						 const char *mask_name,
> +						 int index,
> +						 struct of_phandle_args *out_args)
> +{
> +	return -ENOSYS;
> +}
> +
> static inline int of_parse_phandle_with_fixed_args(const struct device_node *np,
> 	const char *list_name, int cells_count, int index,
> 	struct of_phandle_args *out_args)
> -- 
> 2.10.0.297.gf6727b0
> 

^ permalink raw reply

* [net-next PATCH v1 0/2] stmmac: dwmac-meson8b: configurable RGMII TX delay
From: Jerome Brunet @ 2016-11-24 16:08 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161124143417.10178-1-martin.blumenstingl@googlemail.com>

On Thu, 2016-11-24 at 15:34 +0100, Martin Blumenstingl wrote:
> Currently the dwmac-meson8b stmmac glue driver uses a hardcoded 1/4
> cycle TX clock delay. This seems to work fine for many boards (for
> example Odroid-C2 or Amlogic's reference boards) but there are some
> others where TX traffic is simply broken.
> There are probably multiple reasons why it's working on some boards
> while it's broken on others:
> - some of Amlogic's reference boards are using a Micrel PHY
> - hardware circuit design
> - maybe more...
> 
> This raises a question though:
> Which device is supposed to enable the TX delay when both MAC and PHY
> support it? And should we implement it for each PHY / MAC separately
> or should we think about a more generic solution (currently it's not
> possible to disable the TX delay generated by the RTL8211F PHY via
> devicetree when using phy-mode "rgmii")?
> 
> iperf3 results on my Mecool BB2 board (Meson GXM, RTL8211F PHY) with
> TX clock delay disabled on the MAC (as it's enabled in the PHY
> driver).
> TX throughput was virtually zero before:
> $ iperf3 -c 192.168.1.100 -R??????????
> Connecting to host 192.168.1.100, port 5201
> Reverse mode, remote host 192.168.1.100 is sending
> [??4] local 192.168.1.206 port 52828 connected to 192.168.1.100 port
> 5201
> [ ID] Interval???????????Transfer?????Bandwidth
> [??4]???0.00-1.00???sec???108 MBytes???901
> Mbits/sec??????????????????
> [??4]???1.00-2.00???sec??94.2 MBytes???791
> Mbits/sec??????????????????
> [??4]???2.00-3.00???sec??96.5 MBytes???810
> Mbits/sec??????????????????
> [??4]???3.00-4.00???sec??96.2 MBytes???808
> Mbits/sec??????????????????
> [??4]???4.00-5.00???sec??96.6 MBytes???810
> Mbits/sec??????????????????
> [??4]???5.00-6.00???sec??96.5 MBytes???810
> Mbits/sec??????????????????
> [??4]???6.00-7.00???sec??96.6 MBytes???810
> Mbits/sec??????????????????
> [??4]???7.00-8.00???sec??96.5 MBytes???809
> Mbits/sec??????????????????
> [??4]???8.00-9.00???sec???105 MBytes???884
> Mbits/sec??????????????????
> [??4]???9.00-10.00??sec???111 MBytes???934
> Mbits/sec??????????????????
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval???????????Transfer?????Bandwidth???????Retr
> [??4]???0.00-10.00??sec??1000 MBytes???839
> Mbits/sec????0?????????????sender
> [??4]???0.00-10.00??sec???998 MBytes???837
> Mbits/sec??????????????????receiver
> 
> iperf Done.
> $ iperf3 -c 192.168.1.100???
> Connecting to host 192.168.1.100, port 5201
> [??4] local 192.168.1.206 port 52832 connected to 192.168.1.100 port
> 5201
> [ ID] Interval???????????Transfer?????Bandwidth???????Retr??Cwnd
> [??4]???0.00-1.01???sec??99.5 MBytes???829 Mbits/sec??117????139
> KBytes???????
> [??4]???1.01-2.00???sec???105 MBytes???884 Mbits/sec??129???70.7
> KBytes???????
> [??4]???2.00-3.01???sec???107 MBytes???889 Mbits/sec??106????187
> KBytes???????
> [??4]???3.01-4.01???sec???105 MBytes???878 Mbits/sec???92????143
> KBytes???????
> [??4]???4.01-5.00???sec???105 MBytes???882 Mbits/sec??140????129
> KBytes???????
> [??4]???5.00-6.01???sec???106 MBytes???883 Mbits/sec??115????195
> KBytes???????
> [??4]???6.01-7.00???sec???102 MBytes???863 Mbits/sec??133???70.7
> KBytes???????
> [??4]???7.00-8.01???sec???106 MBytes???884 Mbits/sec??143???97.6
> KBytes???????
> [??4]???8.01-9.01???sec???104 MBytes???875 Mbits/sec??124????107
> KBytes???????
> [??4]???9.01-10.01??sec???105 MBytes???876 Mbits/sec???90????139
> KBytes???????
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval???????????Transfer?????Bandwidth???????Retr
> [??4]???0.00-10.01??sec??1.02 GBytes???874
> Mbits/sec??1189?????????????sender
> [??4]???0.00-10.01??sec??1.02 GBytes???873
> Mbits/sec??????????????????receiver
> 

Cool, one more board working ;)
I tried your patch on another board (MXQ_V2.1, with sheep printed on
the PCB). It 's not improving the situation for this one unfortunately.
Actually I already tried playing with the TX delay on the MAC and PHY
but I could get any good results with the boards I have.

It is strange that we can adjust the delay by 2ns steps, when delay
seens by the phy should be 2ns ...

> iperf Done.
> 
> 
> Martin Blumenstingl (2):
> ? net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmac
> ? net: stmmac: dwmac-meson8b: make the RGMII TX delay configurable
> 
> ?Documentation/devicetree/bindings/net/meson-dwmac.txt | 11
> +++++++++++
> ?drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c???| 16
> +++++++++++-----
> ?include/dt-bindings/net/dwmac-meson8b.h???????????????| 18
> ++++++++++++++++++
> ?3 files changed, 40 insertions(+), 5 deletions(-)
> ?create mode 100644 include/dt-bindings/net/dwmac-meson8b.h
> 

^ permalink raw reply

* [PATCH RESEND 2/2] gpio: axp209: add pinctrl support
From: Chen-Yu Tsai @ 2016-11-24 16:08 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161123141151.25315-3-quentin.schulz@free-electrons.com>

On Wed, Nov 23, 2016 at 10:11 PM, Quentin Schulz
<quentin.schulz@free-electrons.com> wrote:
> The GPIOs present in the AXP209 PMIC have multiple functions. They
> typically allow a pin to be used as GPIO input or output and can also be
> used as ADC or regulator for example.[1]
>
> This adds the possibility to use all functions of the GPIOs present in
> the AXP209 PMIC thanks to pinctrl subsystem.
>
> [1] see registers 90H, 92H and 93H at
>     http://dl.linux-sunxi.org/AXP/AXP209_Datasheet_v1.0en.pdf
>
> Signed-off-by: Quentin Schulz <quentin.schulz@free-electrons.com>
> ---
>  .../devicetree/bindings/gpio/gpio-axp209.txt       |  28 +-
>  drivers/gpio/gpio-axp209.c                         | 551 ++++++++++++++++++---
>  2 files changed, 503 insertions(+), 76 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/gpio/gpio-axp209.txt b/Documentation/devicetree/bindings/gpio/gpio-axp209.txt
> index a661130..a5bfe87 100644
> --- a/Documentation/devicetree/bindings/gpio/gpio-axp209.txt
> +++ b/Documentation/devicetree/bindings/gpio/gpio-axp209.txt
> @@ -1,4 +1,4 @@
> -AXP209 GPIO controller
> +AXP209 GPIO & pinctrl controller
>
>  This driver follows the usual GPIO bindings found in
>  Documentation/devicetree/bindings/gpio/gpio.txt
> @@ -28,3 +28,29 @@ axp209: pmic at 34 {
>                 #gpio-cells = <2>;
>         };
>  };
> +
> +The GPIOs can be muxed to other functions and therefore, must be a subnode of
> +axp_gpio.
> +
> +Example:
> +
> +&axp_gpio {
> +       gpio0_adc: gpio0_adc {
> +               pin = "GPIO0";
> +               function = "adc";
> +       };
> +};
> +
> +&example_node {
> +       pinctrl-names = "default";
> +       pinctrl-0 = <&gpio0_adc>;
> +};
> +
> +GPIOs and their functions
> +-------------------------
> +
> +GPIO   |       Functions
> +------------------------
> +GPIO0  |       gpio_in, gpio_out, ldo, adc
> +GPIO1  |       gpio_in, gpio_out, ldo, adc
> +GPIO2  |       gpio_in, gpio_out
> diff --git a/drivers/gpio/gpio-axp209.c b/drivers/gpio/gpio-axp209.c
> index 4a346b7..0a64cfc 100644
> --- a/drivers/gpio/gpio-axp209.c
> +++ b/drivers/gpio/gpio-axp209.c
> @@ -1,7 +1,8 @@
>  /*
> - * AXP20x GPIO driver
> + * AXP20x Pin control driver
>   *
>   * Copyright (C) 2016 Maxime Ripard <maxime.ripard@free-electrons.com>
> + * Copyright (C) 2016 Quentin Schulz <quentin.schulz@free-electrons.com>
>   *
>   * This program is free software; you can redistribute it and/or modify it
>   * under  the terms of the GNU General  Public License as published by the
> @@ -21,52 +22,103 @@
>  #include <linux/platform_device.h>
>  #include <linux/regmap.h>
>  #include <linux/slab.h>
> +#include <linux/pinctrl/pinctrl.h>
> +#include <linux/pinctrl/pinmux.h>
> +#include <linux/pinctrl/pinconf-generic.h>
>
>  #define AXP20X_GPIO_FUNCTIONS          0x7
>  #define AXP20X_GPIO_FUNCTION_OUT_LOW   0
>  #define AXP20X_GPIO_FUNCTION_OUT_HIGH  1
>  #define AXP20X_GPIO_FUNCTION_INPUT     2
>
> -struct axp20x_gpio {
> -       struct gpio_chip        chip;
> -       struct regmap           *regmap;
> -};
> +#define AXP20X_PINCTRL_PIN(_pin_num, _pin, _regs)              \
> +       {                                                       \
> +               .number = _pin_num,                             \
> +               .name = _pin,                                   \
> +               .drv_data = _regs,                              \
> +       }
>
> -static int axp20x_gpio_get_reg(unsigned offset)
> -{
> -       switch (offset) {
> -       case 0:
> -               return AXP20X_GPIO0_CTRL;
> -       case 1:
> -               return AXP20X_GPIO1_CTRL;
> -       case 2:
> -               return AXP20X_GPIO2_CTRL;
> +#define AXP20X_PIN(_pin, ...)                                  \
> +       {                                                       \
> +               .pin = _pin,                                    \
> +               .functions = (struct axp20x_desc_function[]) {  \
> +                             __VA_ARGS__, { } },               \
>         }
>
> -       return -EINVAL;
> -}
> +#define AXP20X_FUNCTION(_val, _name)                           \
> +       {                                                       \
> +               .name = _name,                                  \
> +               .muxval = _val,                                 \
> +       }
>
> -static int axp20x_gpio_input(struct gpio_chip *chip, unsigned offset)
> -{
> -       struct axp20x_gpio *gpio = gpiochip_get_data(chip);
> -       int reg;
> +struct axp20x_desc_function {
> +       const char      *name;
> +       u8              muxval;
> +};
>
> -       reg = axp20x_gpio_get_reg(offset);
> -       if (reg < 0)
> -               return reg;
> +struct axp20x_desc_pin {
> +       struct pinctrl_pin_desc         pin;
> +       struct axp20x_desc_function     *functions;
> +};
>
> -       return regmap_update_bits(gpio->regmap, reg,
> -                                 AXP20X_GPIO_FUNCTIONS,
> -                                 AXP20X_GPIO_FUNCTION_INPUT);
> -}
> +struct axp20x_pinctrl_desc {
> +       const struct axp20x_desc_pin    *pins;
> +       int                             npins;
> +       unsigned int                    pin_base;

You do not need pin_base.

> +};
> +
> +struct axp20x_pinctrl_function {
> +       const char      *name;
> +       const char      **groups;
> +       unsigned int    ngroups;
> +};
> +
> +struct axp20x_pinctrl_group {
> +       const char      *name;
> +       unsigned long   config;
> +       unsigned int    pin;
> +};
> +
> +struct axp20x_pctl {
> +       struct pinctrl_dev                      *pctl_dev;
> +       struct device                           *dev;
> +       struct gpio_chip                        chip;
> +       struct regmap                           *regmap;
> +       const struct axp20x_pinctrl_desc        *desc;
> +       struct axp20x_pinctrl_group             *groups;
> +       unsigned int                            ngroups;
> +       struct axp20x_pinctrl_function          *functions;
> +       unsigned int                            nfunctions;
> +};
> +
> +static const struct axp20x_desc_pin axp209_pins[] = {
> +       AXP20X_PIN(AXP20X_PINCTRL_PIN(0, "GPIO0", (void *)AXP20X_GPIO0_CTRL),
> +                  AXP20X_FUNCTION(0x0, "gpio_out"),
> +                  AXP20X_FUNCTION(0x2, "gpio_in"),
> +                  AXP20X_FUNCTION(0x3, "ldo"),
> +                  AXP20X_FUNCTION(0x4, "adc")),
> +       AXP20X_PIN(AXP20X_PINCTRL_PIN(1, "GPIO1", (void *)AXP20X_GPIO1_CTRL),
> +                  AXP20X_FUNCTION(0x0, "gpio_out"),
> +                  AXP20X_FUNCTION(0x2, "gpio_in"),
> +                  AXP20X_FUNCTION(0x3, "ldo"),
> +                  AXP20X_FUNCTION(0x4, "adc")),
> +       AXP20X_PIN(AXP20X_PINCTRL_PIN(2, "GPIO2", (void *)AXP20X_GPIO2_CTRL),
> +                  AXP20X_FUNCTION(0x0, "gpio_out"),
> +                  AXP20X_FUNCTION(0x2, "gpio_in")),
> +};
> +
> +static const struct axp20x_pinctrl_desc axp20x_pinctrl_data = {
> +       .pins   = axp209_pins,
> +       .npins  = ARRAY_SIZE(axp209_pins),
> +};
>
>  static int axp20x_gpio_get(struct gpio_chip *chip, unsigned offset)
>  {
> -       struct axp20x_gpio *gpio = gpiochip_get_data(chip);
> +       struct axp20x_pctl *pctl = gpiochip_get_data(chip);
>         unsigned int val;
>         int ret;
>
> -       ret = regmap_read(gpio->regmap, AXP20X_GPIO20_SS, &val);
> +       ret = regmap_read(pctl->regmap, AXP20X_GPIO20_SS, &val);
>         if (ret)
>                 return ret;
>
> @@ -75,15 +127,12 @@ static int axp20x_gpio_get(struct gpio_chip *chip, unsigned offset)
>
>  static int axp20x_gpio_get_direction(struct gpio_chip *chip, unsigned offset)
>  {
> -       struct axp20x_gpio *gpio = gpiochip_get_data(chip);
> +       struct axp20x_pctl *pctl = gpiochip_get_data(chip);
> +       int pin_reg = (int)pctl->desc->pins[offset].pin.drv_data;
>         unsigned int val;
> -       int reg, ret;
> -
> -       reg = axp20x_gpio_get_reg(offset);
> -       if (reg < 0)
> -               return reg;
> +       int ret;
>
> -       ret = regmap_read(gpio->regmap, reg, &val);
> +       ret = regmap_read(pctl->regmap, pin_reg, &val);
>         if (ret)
>                 return ret;
>
> @@ -102,33 +151,335 @@ static int axp20x_gpio_get_direction(struct gpio_chip *chip, unsigned offset)
>         return val & 2;
>  }
>
> -static int axp20x_gpio_output(struct gpio_chip *chip, unsigned offset,
> +static void axp20x_gpio_set(struct gpio_chip *chip, unsigned int offset,
> +                           int value)
> +{
> +       struct axp20x_pctl *pctl = gpiochip_get_data(chip);
> +       int pin_reg = (int)pctl->desc->pins[offset].pin.drv_data;
> +
> +       regmap_update_bits(pctl->regmap, pin_reg,
> +                          AXP20X_GPIO_FUNCTIONS,
> +                          value ? AXP20X_GPIO_FUNCTION_OUT_HIGH
> +                                : AXP20X_GPIO_FUNCTION_OUT_LOW);
> +}
> +
> +static int axp20x_gpio_input(struct gpio_chip *chip, unsigned int offset)
> +{
> +       return pinctrl_gpio_direction_input(chip->base + offset);
> +}
> +
> +static int axp20x_gpio_output(struct gpio_chip *chip, unsigned int offset,
>                               int value)
>  {
> -       struct axp20x_gpio *gpio = gpiochip_get_data(chip);
> -       int reg;
> +       chip->set(chip, offset, value);
>
> -       reg = axp20x_gpio_get_reg(offset);
> -       if (reg < 0)
> -               return reg;
> +       return 0;
> +}
>
> -       return regmap_update_bits(gpio->regmap, reg,
> -                                 AXP20X_GPIO_FUNCTIONS,
> -                                 value ? AXP20X_GPIO_FUNCTION_OUT_HIGH
> -                                 : AXP20X_GPIO_FUNCTION_OUT_LOW);
> +static int axp20x_pmx_set(struct pinctrl_dev *pctldev, unsigned int offset,
> +                         u8 config)
> +{
> +       struct axp20x_pctl *pctl = pinctrl_dev_get_drvdata(pctldev);
> +       int pin_reg = (int)pctl->desc->pins[offset].pin.drv_data;
> +
> +       return regmap_update_bits(pctl->regmap, pin_reg, AXP20X_GPIO_FUNCTIONS,
> +                                 config);
>  }
>
> -static void axp20x_gpio_set(struct gpio_chip *chip, unsigned offset,
> -                           int value)
> +static int axp20x_pmx_func_cnt(struct pinctrl_dev *pctldev)
> +{
> +       struct axp20x_pctl *pctl = pinctrl_dev_get_drvdata(pctldev);
> +
> +       return pctl->nfunctions;
> +}
> +
> +static const char *axp20x_pmx_func_name(struct pinctrl_dev *pctldev,
> +                                       unsigned int selector)
> +{
> +       struct axp20x_pctl *pctl = pinctrl_dev_get_drvdata(pctldev);
> +
> +       return pctl->functions[selector].name;
> +}
> +
> +static int axp20x_pmx_func_groups(struct pinctrl_dev *pctldev,
> +                                 unsigned int selector,
> +                                 const char * const **groups,
> +                                 unsigned int *num_groups)
> +{
> +       struct axp20x_pctl *pctl = pinctrl_dev_get_drvdata(pctldev);
> +
> +       *groups = pctl->functions[selector].groups;
> +       *num_groups = pctl->functions[selector].ngroups;
> +
> +       return 0;
> +}
> +
> +static struct axp20x_desc_function *
> +axp20x_pinctrl_desc_find_func_by_name(struct axp20x_pctl *pctl,
> +                                     const char *group, const char *func)
> +{
> +       const struct axp20x_desc_pin *pin;
> +       struct axp20x_desc_function *desc_func;
> +       int i;
> +
> +       for (i = 0; i < pctl->desc->npins; i++) {
> +               pin = &pctl->desc->pins[i];
> +
> +               if (!strcmp(pin->pin.name, group)) {
> +                       desc_func = pin->functions;
> +
> +                       while (desc_func->name) {
> +                               if (!strcmp(desc_func->name, func))
> +                                       return desc_func;
> +                               desc_func++;
> +                       }
> +
> +                       /*
> +                        * Pins are uniquely named. Groups are named after one
> +                        * pin name. If one pin matches group name but its
> +                        * function cannot be found, no other pin will match
> +                        * group name.
> +                        */
> +                       return NULL;
> +               }
> +       }
> +
> +       return NULL;
> +}
> +
> +static int axp20x_pmx_set_mux(struct pinctrl_dev *pctldev,
> +                             unsigned int function, unsigned int group)
> +{
> +       struct axp20x_pctl *pctl = pinctrl_dev_get_drvdata(pctldev);
> +       struct axp20x_pinctrl_group *g = pctl->groups + group;
> +       struct axp20x_pinctrl_function *func = pctl->functions + function;
> +       struct axp20x_desc_function *desc_func =
> +               axp20x_pinctrl_desc_find_func_by_name(pctl, g->name,
> +                                                     func->name);
> +       if (!desc_func)
> +               return -EINVAL;
> +
> +       return axp20x_pmx_set(pctldev, g->pin, desc_func->muxval);
> +}
> +
> +static struct axp20x_desc_function *
> +axp20x_pctl_desc_find_func_by_pin(struct axp20x_pctl *pctl, unsigned int offset,
> +                                 const char *func)
> +{
> +       const struct axp20x_desc_pin *pin;
> +       struct axp20x_desc_function *desc_func;
> +       int i;
> +
> +       for (i = 0; i < pctl->desc->npins; i++) {
> +               pin = &pctl->desc->pins[i];
> +
> +               if (pin->pin.number == offset) {
> +                       desc_func = pin->functions;
> +
> +                       while (desc_func->name) {
> +                               if (!strcmp(desc_func->name, func))
> +                                       return desc_func;
> +
> +                               desc_func++;
> +                       }
> +               }
> +       }
> +
> +       return NULL;
> +}
> +
> +static int axp20x_pmx_gpio_set_direction(struct pinctrl_dev *pctldev,
> +                                        struct pinctrl_gpio_range *range,
> +                                        unsigned int offset, bool input)
> +{
> +       struct axp20x_pctl *pctl = pinctrl_dev_get_drvdata(pctldev);
> +       struct axp20x_desc_function *desc_func;
> +       const char *func;
> +
> +       if (input)
> +               func = "gpio_in";
> +       else
> +               func = "gpio_out";
> +
> +       desc_func = axp20x_pctl_desc_find_func_by_pin(pctl, offset, func);
> +       if (!desc_func)
> +               return -EINVAL;
> +
> +       return axp20x_pmx_set(pctldev, offset, desc_func->muxval);
> +}
> +
> +static const struct pinmux_ops axp20x_pmx_ops = {
> +       .get_functions_count    = axp20x_pmx_func_cnt,
> +       .get_function_name      = axp20x_pmx_func_name,
> +       .get_function_groups    = axp20x_pmx_func_groups,
> +       .set_mux                = axp20x_pmx_set_mux,
> +       .gpio_set_direction     = axp20x_pmx_gpio_set_direction,
> +       .strict                 = true,
> +};
> +
> +static int axp20x_groups_cnt(struct pinctrl_dev *pctldev)
> +{
> +       struct axp20x_pctl *pctl = pinctrl_dev_get_drvdata(pctldev);
> +
> +       return pctl->ngroups;
> +}
> +
> +static int axp20x_group_pins(struct pinctrl_dev *pctldev, unsigned int selector,
> +                            const unsigned int **pins, unsigned int *num_pins)
> +{
> +       struct axp20x_pctl *pctl = pinctrl_dev_get_drvdata(pctldev);
> +       struct axp20x_pinctrl_group *g = pctl->groups + selector;
> +
> +       *pins = (unsigned int *)&g->pin;
> +       *num_pins = 1;
> +
> +       return 0;
> +}
> +
> +static const char *axp20x_group_name(struct pinctrl_dev *pctldev,
> +                                    unsigned int selector)
> +{
> +       struct axp20x_pctl *pctl = pinctrl_dev_get_drvdata(pctldev);
> +
> +       return pctl->groups[selector].name;
> +}
> +
> +static const struct pinctrl_ops axp20x_pctrl_ops = {
> +       .dt_node_to_map         = pinconf_generic_dt_node_to_map_group,
> +       .dt_free_map            = pinconf_generic_dt_free_map,
> +       .get_groups_count       = axp20x_groups_cnt,
> +       .get_group_name         = axp20x_group_name,
> +       .get_group_pins         = axp20x_group_pins,
> +};
> +
> +static struct axp20x_pinctrl_function *
> +axp20x_pinctrl_function_by_name(struct axp20x_pctl *pctl, const char *name)
> +{
> +       struct axp20x_pinctrl_function *func = pctl->functions;
> +
> +       while (func->name) {
> +               if (!strcmp(func->name, name))
> +                       return func;
> +               func++;
> +       }
> +
> +       return NULL;
> +}
> +
> +static int axp20x_pinctrl_add_function(struct axp20x_pctl *pctl,
> +                                      const char *name)
>  {
> -       axp20x_gpio_output(chip, offset, value);
> +       struct axp20x_pinctrl_function *func = pctl->functions;
> +
> +       while (func->name) {
> +               if (!strcmp(func->name, name)) {
> +                       func->ngroups++;
> +                       return -EEXIST;
> +               }
> +
> +               func++;
> +       }
> +
> +       func->name = name;
> +       func->ngroups = 1;
> +
> +       pctl->nfunctions++;
> +
> +       return 0;
>  }
>
> -static int axp20x_gpio_probe(struct platform_device *pdev)
> +static int axp20x_attach_group_function(struct platform_device *pdev,
> +                                       const struct axp20x_desc_pin *pin)
> +{
> +       struct axp20x_pctl *pctl = platform_get_drvdata(pdev);
> +       struct axp20x_desc_function *desc_func = pin->functions;
> +       struct axp20x_pinctrl_function *func;
> +       const char **func_grp;
> +
> +       while (desc_func->name) {
> +               func = axp20x_pinctrl_function_by_name(pctl, desc_func->name);
> +               if (!func)
> +                       return -EINVAL;
> +
> +               if (!func->groups) {
> +                       func->groups = devm_kzalloc(&pdev->dev,
> +                                                   func->ngroups * sizeof(const char *),
> +                                                   GFP_KERNEL);
> +                       if (!func->groups)
> +                               return -ENOMEM;
> +               }
> +
> +               func_grp = func->groups;
> +               while (*func_grp)
> +                       func_grp++;
> +
> +               *func_grp = pin->pin.name;
> +               desc_func++;
> +       }
> +
> +       return 0;
> +}
> +
> +static int axp20x_build_state(struct platform_device *pdev)
> +{
> +       struct axp20x_pctl *pctl = platform_get_drvdata(pdev);
> +       unsigned int npins = pctl->desc->npins;
> +       const struct axp20x_desc_pin *pin;
> +       struct axp20x_desc_function *func;
> +       int i, ret;
> +
> +       pctl->ngroups = npins;
> +       pctl->groups = devm_kzalloc(&pdev->dev,
> +                                   pctl->ngroups * sizeof(*pctl->groups),
> +                                   GFP_KERNEL);
> +       if (!pctl->groups)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < npins; i++) {
> +               pctl->groups[i].name = pctl->desc->pins[i].pin.name;
> +               pctl->groups[i].pin = pctl->desc->pins[i].pin.number;
> +       }
> +
> +       /* We assume 4 functions per pin should be enough as a default max */
> +       pctl->functions = devm_kzalloc(&pdev->dev,
> +                                      npins * 4 * sizeof(*pctl->functions),
> +                                      GFP_KERNEL);
> +       if (!pctl->functions)
> +               return -ENOMEM;
> +
> +       /* Create a list of uniquely named functions */
> +       for (i = 0; i < npins; i++) {
> +               pin = &pctl->desc->pins[i];
> +               func = pin->functions;
> +
> +               while (func->name) {
> +                       axp20x_pinctrl_add_function(pctl, func->name);
> +                       func++;
> +               }
> +       }
> +
> +       pctl->functions = krealloc(pctl->functions,
> +                                  pctl->nfunctions * sizeof(*pctl->functions),
> +                                  GFP_KERNEL);
> +
> +       for (i = 0; i < npins; i++) {
> +               pin = &pctl->desc->pins[i];
> +               ret = axp20x_attach_group_function(pdev, pin);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       return 0;
> +}
> +
> +static int axp20x_pctl_probe(struct platform_device *pdev)
>  {
>         struct axp20x_dev *axp20x = dev_get_drvdata(pdev->dev.parent);
> -       struct axp20x_gpio *gpio;
> -       int ret;
> +       const struct axp20x_desc_pin *pin;
> +       struct axp20x_pctl *pctl;
> +       struct pinctrl_desc *pctrl_desc;
> +       struct pinctrl_pin_desc *pins;
> +       int ret, i;
>
>         if (!of_device_is_available(pdev->dev.of_node))
>                 return -ENODEV;
> @@ -138,51 +489,101 @@ static int axp20x_gpio_probe(struct platform_device *pdev)
>                 return -EINVAL;
>         }
>
> -       gpio = devm_kzalloc(&pdev->dev, sizeof(*gpio), GFP_KERNEL);
> -       if (!gpio)
> +       pctl = devm_kzalloc(&pdev->dev, sizeof(*pctl), GFP_KERNEL);
> +       if (!pctl)
> +               return -ENOMEM;
> +
> +       pctl->chip.base                 = -1;
> +       pctl->chip.can_sleep            = true;
> +       pctl->chip.request              = gpiochip_generic_request;
> +       pctl->chip.free                 = gpiochip_generic_free;
> +       pctl->chip.parent               = &pdev->dev;
> +       pctl->chip.label                = dev_name(&pdev->dev);
> +       pctl->chip.owner                = THIS_MODULE;
> +       pctl->chip.get                  = axp20x_gpio_get;
> +       pctl->chip.get_direction        = axp20x_gpio_get_direction;
> +       pctl->chip.set                  = axp20x_gpio_set;
> +       pctl->chip.direction_input      = axp20x_gpio_input;
> +       pctl->chip.direction_output     = axp20x_gpio_output;
> +       pctl->chip.ngpio                = 3;
> +       pctl->chip.can_sleep            = true;
> +
> +       pctl->regmap = axp20x->regmap;
> +
> +       pctl->desc = &axp20x_pinctrl_data;
> +       pctl->dev = &pdev->dev;
> +
> +       platform_set_drvdata(pdev, pctl);
> +
> +       ret = axp20x_build_state(pdev);
> +       if (ret)
> +               return ret;
> +
> +       pins = devm_kzalloc(&pdev->dev, pctl->desc->npins * sizeof(*pins),
> +                           GFP_KERNEL);
> +       if (!pins)
>                 return -ENOMEM;
>
> -       gpio->chip.base                 = -1;
> -       gpio->chip.can_sleep            = true;
> -       gpio->chip.parent               = &pdev->dev;
> -       gpio->chip.label                = dev_name(&pdev->dev);
> -       gpio->chip.owner                = THIS_MODULE;
> -       gpio->chip.get                  = axp20x_gpio_get;
> -       gpio->chip.get_direction        = axp20x_gpio_get_direction;
> -       gpio->chip.set                  = axp20x_gpio_set;
> -       gpio->chip.direction_input      = axp20x_gpio_input;
> -       gpio->chip.direction_output     = axp20x_gpio_output;
> -       gpio->chip.ngpio                = 3;
> -
> -       gpio->regmap = axp20x->regmap;
> -
> -       ret = devm_gpiochip_add_data(&pdev->dev, &gpio->chip, gpio);
> +       for (i = 0; i < pctl->desc->npins; i++)
> +               pins[i] = pctl->desc->pins[i].pin;
> +
> +       pctrl_desc = devm_kzalloc(&pdev->dev, sizeof(*pctrl_desc), GFP_KERNEL);
> +       if (!pctrl_desc)
> +               return -ENOMEM;
> +
> +       pctrl_desc->name = dev_name(&pdev->dev);
> +       pctrl_desc->owner = THIS_MODULE;
> +       pctrl_desc->pins = pins;
> +       pctrl_desc->npins = pctl->desc->npins;
> +       pctrl_desc->pctlops = &axp20x_pctrl_ops;
> +       pctrl_desc->pmxops = &axp20x_pmx_ops;
> +
> +       pctl->pctl_dev = devm_pinctrl_register(&pdev->dev, pctrl_desc, pctl);
> +       if (IS_ERR(pctl->pctl_dev)) {
> +               dev_err(&pdev->dev, "couldn't register pinctrl driver\n");
> +               return PTR_ERR(pctl->pctl_dev);
> +       }
> +
> +       ret = devm_gpiochip_add_data(&pdev->dev, &pctl->chip, pctl);
>         if (ret) {
>                 dev_err(&pdev->dev, "Failed to register GPIO chip\n");
>                 return ret;
>         }
>
> +       for (i = 0; i < pctl->desc->npins; i++) {
> +               pin = pctl->desc->pins + i;
> +
> +               ret = gpiochip_add_pin_range(&pctl->chip, dev_name(&pdev->dev),
> +                                            pin->pin.number, pin->pin.number,
> +                                            1);

The pins, unlike in sunxi, are sequential and contiguous. There's no need for
the loop. Just add them in one go.

> +               if (ret) {
> +                       dev_err(&pdev->dev, "failed to add pin range\n");
> +                       return ret;
> +               }
> +       }
> +
>         dev_info(&pdev->dev, "AXP209 GPIO driver loaded\n");
>
>         return 0;
>  }
>
> -static const struct of_device_id axp20x_gpio_match[] = {
> +static const struct of_device_id axp20x_pctl_match[] = {
>         { .compatible = "x-powers,axp209-gpio" },
>         { }
>  };
> -MODULE_DEVICE_TABLE(of, axp20x_gpio_match);
> +MODULE_DEVICE_TABLE(of, axp20x_pctl_match);
>
> -static struct platform_driver axp20x_gpio_driver = {
> -       .probe          = axp20x_gpio_probe,
> +static struct platform_driver axp20x_pctl_driver = {
> +       .probe          = axp20x_pctl_probe,
>         .driver = {
>                 .name           = "axp20x-gpio",
> -               .of_match_table = axp20x_gpio_match,
> +               .of_match_table = axp20x_pctl_match,
>         },
>  };
>
> -module_platform_driver(axp20x_gpio_driver);
> +module_platform_driver(axp20x_pctl_driver);
>
>  MODULE_AUTHOR("Maxime Ripard <maxime.ripard@free-electrons.com>");
> +MODULE_AUTHOR("Quentin Schulz <quentin.schulz@free-electrons.com>");
>  MODULE_DESCRIPTION("AXP20x PMIC GPIO driver");
>  MODULE_LICENSE("GPL");
> --
> 2.9.3
>

Apart from the minor comments above, and Thomas' earlier comments,
this patch looks good to me.

ChenYu

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox