* Re: [PATCH V4 0/8] net: ethernet: stmmac: add support for stm32mp1
From: Alexandre Torgue @ 2018-05-24 7:22 UTC (permalink / raw)
To: David Miller, christophe.roullier
Cc: mark.rutland, mcoquelin.stm32, peppe.cavallaro, devicetree,
linux-arm-kernel, netdev, andrew
In-Reply-To: <20180523.160811.1425159248399846750.davem@davemloft.net>
On 05/23/2018 10:08 PM, David Miller wrote:
> From: Christophe Roullier <christophe.roullier@st.com>
> Date: Wed, 23 May 2018 17:47:51 +0200
>
>> Patches to have Ethernet support on stm32mp1
>> Changelog:
>> Remark from Rob Herring
>> Move Documentation/devicetree/bindings/arm/stm32.txt in
>> Documentation/devicetree/bindings/arm/stm32/stm32.txt and create
>> Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt
>>
>> Replace also in arch/arm/boot/dts/stm32mp157c.dtsi, syscfg: system-config@50020000
>> with syscfg: syscon@50020000syscfg: system-config@50020000
>
> Probably the DTS file updates need to go in via the ARM tree, not
> mine.
Yes I will take them in my tree
>
> Can you respin a net-next targetted series that has just the driver
> code and device tree binding updates?
>
> Thank you!
> --
> To unsubscribe from this list: send the line "unsubscribe devicetree" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: Gi-Oh Kim @ 2018-05-24 7:23 UTC (permalink / raw)
To: Qing Huang
Cc: Tariq Toukan, davem, haakon.bugge, yanjun.zhu, netdev, linux-rdma,
linux-kernel
In-Reply-To: <20180523232246.20445-1-qing.huang@oracle.com>
On Thu, May 24, 2018 at 1:22 AM, Qing Huang <qing.huang@oracle.com> wrote:
> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
>
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
>
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
>
> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
>
> However in order to support smaller ICM chunk size, we need to fix
> another issue in large size kcalloc allocations.
>
> E.g.
> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
> entry). So we need a 16MB allocation for a table->icm pointer array to
> hold 2M pointers which can easily cause kcalloc to fail.
>
> The solution is to use kvzalloc to replace kcalloc which will fall back
> to vmalloc automatically if kmalloc fails.
Hi,
Could you please write why it first try to allocate the contiguous pages?
I think it is necessary to comment why it uses kvzalloc instead of vzalloc.
>
> Signed-off-by: Qing Huang <qing.huang@oracle.com>
> Acked-by: Daniel Jurgens <danielj@mellanox.com>
> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
+Reviewed-by: Gioh Kim <gi-oh.kim@profitbricks.com>
> ---
> v4: use kvzalloc instead of vzalloc
> add one err condition check
> don't include vmalloc.h any more
>
> v3: use PAGE_SIZE instead of PAGE_SHIFT
> add comma to the end of enum variables
> include vmalloc.h header file to avoid build issues on Sparc
>
> v2: adjusted chunk size to reflect different architectures
>
> drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +++++++++-------
> 1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index a822f7a..685337d 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -43,12 +43,12 @@
> #include "fw.h"
>
> /*
> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
> - * per chunk.
> + * We allocate in page size (default 4KB on many archs) chunks to avoid high
> + * order memory allocations in fragmented/high usage memory situation.
> */
> enum {
> - MLX4_ICM_ALLOC_SIZE = 1 << 18,
> - MLX4_TABLE_CHUNK_SIZE = 1 << 18
> + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
> + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE,
> };
>
> static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
> @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
> u64 size;
>
> obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
> + if (WARN_ON(!obj_per_chunk))
> + return -EINVAL;
> num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>
> - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
> + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL);
> if (!table->icm)
> return -ENOMEM;
> table->virt = virt;
> @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
> mlx4_free_icm(dev, table->icm[i], use_coherent);
> }
>
> - kfree(table->icm);
> + kvfree(table->icm);
>
> return -ENOMEM;
> }
> @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table)
> mlx4_free_icm(dev, table->icm[i], table->coherent);
> }
>
> - kfree(table->icm);
> + kvfree(table->icm);
> }
> --
> 2.9.3
>
--
GIOH KIM
Linux Kernel Entwickler
ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 176 2697 8962
Fax: +49 30 577 008 299
Email: gi-oh.kim@profitbricks.com
URL: https://www.profitbricks.de
Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens
^ permalink raw reply
* Re: [PATCH V4 1/8] net: ethernet: stmmac: add adaptation for stm32mp157c.
From: Alexandre Torgue @ 2018-05-24 7:24 UTC (permalink / raw)
To: Christophe Roullier, mark.rutland, mcoquelin.stm32,
peppe.cavallaro
Cc: devicetree, linux-arm-kernel, netdev, andrew
In-Reply-To: <1527090479-5263-2-git-send-email-christophe.roullier@st.com>
Hi,
On 05/23/2018 05:47 PM, Christophe Roullier wrote:
> Glue codes to support stm32mp157c device and stay
> compatible with stm32 mcu family
>
> Signed-off-by: Christophe Roullier <christophe.roullier@st.com>
> ---
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
> drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c | 270 ++++++++++++++++++++--
> 1 file changed, 255 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c
> index 9e6db16..f51e327 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c
> @@ -16,49 +16,183 @@
> #include <linux/of_net.h>
> #include <linux/phy.h>
> #include <linux/platform_device.h>
> +#include <linux/pm_wakeirq.h>
> #include <linux/regmap.h>
> #include <linux/slab.h>
> #include <linux/stmmac.h>
>
> #include "stmmac_platform.h"
>
> -#define MII_PHY_SEL_MASK BIT(23)
> +#define SYSCFG_MCU_ETH_MASK BIT(23)
> +#define SYSCFG_MP1_ETH_MASK GENMASK(23, 16)
> +
> +#define SYSCFG_PMCR_ETH_CLK_SEL BIT(16)
> +#define SYSCFG_PMCR_ETH_REF_CLK_SEL BIT(17)
> +#define SYSCFG_PMCR_ETH_SEL_MII BIT(20)
> +#define SYSCFG_PMCR_ETH_SEL_RGMII BIT(21)
> +#define SYSCFG_PMCR_ETH_SEL_RMII BIT(23)
> +#define SYSCFG_PMCR_ETH_SEL_GMII 0
> +#define SYSCFG_MCU_ETH_SEL_MII 0
> +#define SYSCFG_MCU_ETH_SEL_RMII 1
>
> struct stm32_dwmac {
> struct clk *clk_tx;
> struct clk *clk_rx;
> + struct clk *clk_eth_ck;
> + struct clk *clk_ethstp;
> + struct clk *syscfg_clk;
> + bool int_phyclk; /* Clock from RCC to drive PHY */
> u32 mode_reg; /* MAC glue-logic mode register */
> struct regmap *regmap;
> u32 speed;
> + const struct stm32_ops *ops;
> + struct device *dev;
> +};
> +
> +struct stm32_ops {
> + int (*set_mode)(struct plat_stmmacenet_data *plat_dat);
> + int (*clk_prepare)(struct stm32_dwmac *dwmac, bool prepare);
> + int (*suspend)(struct stm32_dwmac *dwmac);
> + void (*resume)(struct stm32_dwmac *dwmac);
> + int (*parse_data)(struct stm32_dwmac *dwmac,
> + struct device *dev);
> + u32 syscfg_eth_mask;
> };
>
> static int stm32_dwmac_init(struct plat_stmmacenet_data *plat_dat)
> {
> struct stm32_dwmac *dwmac = plat_dat->bsp_priv;
> - u32 reg = dwmac->mode_reg;
> - u32 val;
> int ret;
>
> - val = (plat_dat->interface == PHY_INTERFACE_MODE_MII) ? 0 : 1;
> - ret = regmap_update_bits(dwmac->regmap, reg, MII_PHY_SEL_MASK, val);
> - if (ret)
> - return ret;
> + if (dwmac->ops->set_mode) {
> + ret = dwmac->ops->set_mode(plat_dat);
> + if (ret)
> + return ret;
> + }
>
> ret = clk_prepare_enable(dwmac->clk_tx);
> if (ret)
> return ret;
>
> - ret = clk_prepare_enable(dwmac->clk_rx);
> - if (ret)
> - clk_disable_unprepare(dwmac->clk_tx);
> + if (!dwmac->dev->power.is_suspended) {
> + ret = clk_prepare_enable(dwmac->clk_rx);
> + if (ret) {
> + clk_disable_unprepare(dwmac->clk_tx);
> + return ret;
> + }
> + }
> +
> + if (dwmac->ops->clk_prepare) {
> + ret = dwmac->ops->clk_prepare(dwmac, true);
> + if (ret) {
> + clk_disable_unprepare(dwmac->clk_rx);
> + clk_disable_unprepare(dwmac->clk_tx);
> + }
> + }
>
> return ret;
> }
>
> +static int stm32mp1_clk_prepare(struct stm32_dwmac *dwmac, bool prepare)
> +{
> + int ret = 0;
> +
> + if (prepare) {
> + ret = clk_prepare_enable(dwmac->syscfg_clk);
> + if (ret)
> + return ret;
> +
> + if (dwmac->int_phyclk) {
> + ret = clk_prepare_enable(dwmac->clk_eth_ck);
> + if (ret) {
> + clk_disable_unprepare(dwmac->syscfg_clk);
> + return ret;
> + }
> + }
> + } else {
> + clk_disable_unprepare(dwmac->syscfg_clk);
> + if (dwmac->int_phyclk)
> + clk_disable_unprepare(dwmac->clk_eth_ck);
> + }
> + return ret;
> +}
> +
> +static int stm32mp1_set_mode(struct plat_stmmacenet_data *plat_dat)
> +{
> + struct stm32_dwmac *dwmac = plat_dat->bsp_priv;
> + u32 reg = dwmac->mode_reg;
> + int val;
> +
> + switch (plat_dat->interface) {
> + case PHY_INTERFACE_MODE_MII:
> + val = SYSCFG_PMCR_ETH_SEL_MII;
> + pr_debug("SYSCFG init : PHY_INTERFACE_MODE_MII\n");
> + break;
> + case PHY_INTERFACE_MODE_GMII:
> + val = SYSCFG_PMCR_ETH_SEL_GMII;
> + if (dwmac->int_phyclk)
> + val |= SYSCFG_PMCR_ETH_CLK_SEL;
> + pr_debug("SYSCFG init : PHY_INTERFACE_MODE_GMII\n");
> + break;
> + case PHY_INTERFACE_MODE_RMII:
> + val = SYSCFG_PMCR_ETH_SEL_RMII;
> + if (dwmac->int_phyclk)
> + val |= SYSCFG_PMCR_ETH_REF_CLK_SEL;
> + pr_debug("SYSCFG init : PHY_INTERFACE_MODE_RMII\n");
> + break;
> + case PHY_INTERFACE_MODE_RGMII:
> + case PHY_INTERFACE_MODE_RGMII_ID:
> + case PHY_INTERFACE_MODE_RGMII_RXID:
> + case PHY_INTERFACE_MODE_RGMII_TXID:
> + val = SYSCFG_PMCR_ETH_SEL_RGMII;
> + if (dwmac->int_phyclk)
> + val |= SYSCFG_PMCR_ETH_CLK_SEL;
> + pr_debug("SYSCFG init : PHY_INTERFACE_MODE_RGMII\n");
> + break;
> + default:
> + pr_debug("SYSCFG init : Do not manage %d interface\n",
> + plat_dat->interface);
> + /* Do not manage others interfaces */
> + return -EINVAL;
> + }
> +
> + return regmap_update_bits(dwmac->regmap, reg,
> + dwmac->ops->syscfg_eth_mask, val);
> +}
> +
> +static int stm32mcu_set_mode(struct plat_stmmacenet_data *plat_dat)
> +{
> + struct stm32_dwmac *dwmac = plat_dat->bsp_priv;
> + u32 reg = dwmac->mode_reg;
> + int val;
> +
> + switch (plat_dat->interface) {
> + case PHY_INTERFACE_MODE_MII:
> + val = SYSCFG_MCU_ETH_SEL_MII;
> + pr_debug("SYSCFG init : PHY_INTERFACE_MODE_MII\n");
> + break;
> + case PHY_INTERFACE_MODE_RMII:
> + val = SYSCFG_MCU_ETH_SEL_RMII;
> + pr_debug("SYSCFG init : PHY_INTERFACE_MODE_RMII\n");
> + break;
> + default:
> + pr_debug("SYSCFG init : Do not manage %d interface\n",
> + plat_dat->interface);
> + /* Do not manage others interfaces */
> + return -EINVAL;
> + }
> +
> + return regmap_update_bits(dwmac->regmap, reg,
> + dwmac->ops->syscfg_eth_mask, val);
> +}
> +
> static void stm32_dwmac_clk_disable(struct stm32_dwmac *dwmac)
> {
> clk_disable_unprepare(dwmac->clk_tx);
> clk_disable_unprepare(dwmac->clk_rx);
> +
> + if (dwmac->ops->clk_prepare)
> + dwmac->ops->clk_prepare(dwmac, false);
> }
>
> static int stm32_dwmac_parse_data(struct stm32_dwmac *dwmac,
> @@ -70,15 +204,22 @@ static int stm32_dwmac_parse_data(struct stm32_dwmac *dwmac,
> /* Get TX/RX clocks */
> dwmac->clk_tx = devm_clk_get(dev, "mac-clk-tx");
> if (IS_ERR(dwmac->clk_tx)) {
> - dev_err(dev, "No tx clock provided...\n");
> + dev_err(dev, "No ETH Tx clock provided...\n");
> return PTR_ERR(dwmac->clk_tx);
> }
> +
> dwmac->clk_rx = devm_clk_get(dev, "mac-clk-rx");
> if (IS_ERR(dwmac->clk_rx)) {
> - dev_err(dev, "No rx clock provided...\n");
> + dev_err(dev, "No ETH Rx clock provided...\n");
> return PTR_ERR(dwmac->clk_rx);
> }
>
> + if (dwmac->ops->parse_data) {
> + err = dwmac->ops->parse_data(dwmac, dev);
> + if (err)
> + return err;
> + }
> +
> /* Get mode register */
> dwmac->regmap = syscon_regmap_lookup_by_phandle(np, "st,syscon");
> if (IS_ERR(dwmac->regmap))
> @@ -91,11 +232,46 @@ static int stm32_dwmac_parse_data(struct stm32_dwmac *dwmac,
> return err;
> }
>
> +static int stm32mp1_parse_data(struct stm32_dwmac *dwmac,
> + struct device *dev)
> +{
> + struct device_node *np = dev->of_node;
> +
> + dwmac->int_phyclk = of_property_read_bool(np, "st,int-phyclk");
> +
> + /* Check if internal clk from RCC selected */
> + if (dwmac->int_phyclk) {
> + /* Get ETH_CLK clocks */
> + dwmac->clk_eth_ck = devm_clk_get(dev, "eth-ck");
> + if (IS_ERR(dwmac->clk_eth_ck)) {
> + dev_err(dev, "No ETH CK clock provided...\n");
> + return PTR_ERR(dwmac->clk_eth_ck);
> + }
> + }
> +
> + /* Clock used for low power mode */
> + dwmac->clk_ethstp = devm_clk_get(dev, "ethstp");
> + if (IS_ERR(dwmac->clk_ethstp)) {
> + dev_err(dev, "No ETH peripheral clock provided for CStop mode ...\n");
> + return PTR_ERR(dwmac->clk_ethstp);
> + }
> +
> + /* Clock for sysconfig */
> + dwmac->syscfg_clk = devm_clk_get(dev, "syscfg-clk");
> + if (IS_ERR(dwmac->syscfg_clk)) {
> + dev_err(dev, "No syscfg clock provided...\n");
> + return PTR_ERR(dwmac->syscfg_clk);
> + }
> +
> + return 0;
> +}
> +
> static int stm32_dwmac_probe(struct platform_device *pdev)
> {
> struct plat_stmmacenet_data *plat_dat;
> struct stmmac_resources stmmac_res;
> struct stm32_dwmac *dwmac;
> + const struct stm32_ops *data;
> int ret;
>
> ret = stmmac_get_platform_resources(pdev, &stmmac_res);
> @@ -112,6 +288,16 @@ static int stm32_dwmac_probe(struct platform_device *pdev)
> goto err_remove_config_dt;
> }
>
> + data = of_device_get_match_data(&pdev->dev);
> + if (!data) {
> + dev_err(&pdev->dev, "no of match data provided\n");
> + ret = -EINVAL;
> + goto err_remove_config_dt;
> + }
> +
> + dwmac->ops = data;
> + dwmac->dev = &pdev->dev;
> +
> ret = stm32_dwmac_parse_data(dwmac, &pdev->dev);
> if (ret) {
> dev_err(&pdev->dev, "Unable to parse OF data\n");
> @@ -149,15 +335,48 @@ static int stm32_dwmac_remove(struct platform_device *pdev)
> return ret;
> }
>
> +static int stm32mp1_suspend(struct stm32_dwmac *dwmac)
> +{
> + int ret = 0;
> +
> + ret = clk_prepare_enable(dwmac->clk_ethstp);
> + if (ret)
> + return ret;
> +
> + clk_disable_unprepare(dwmac->clk_tx);
> + clk_disable_unprepare(dwmac->syscfg_clk);
> + if (dwmac->int_phyclk)
> + clk_disable_unprepare(dwmac->clk_eth_ck);
> +
> + return ret;
> +}
> +
> +static void stm32mp1_resume(struct stm32_dwmac *dwmac)
> +{
> + clk_disable_unprepare(dwmac->clk_ethstp);
> +}
> +
> +static int stm32mcu_suspend(struct stm32_dwmac *dwmac)
> +{
> + clk_disable_unprepare(dwmac->clk_tx);
> + clk_disable_unprepare(dwmac->clk_rx);
> +
> + return 0;
> +}
> +
> #ifdef CONFIG_PM_SLEEP
> static int stm32_dwmac_suspend(struct device *dev)
> {
> struct net_device *ndev = dev_get_drvdata(dev);
> struct stmmac_priv *priv = netdev_priv(ndev);
> + struct stm32_dwmac *dwmac = priv->plat->bsp_priv;
> +
> int ret;
>
> ret = stmmac_suspend(dev);
> - stm32_dwmac_clk_disable(priv->plat->bsp_priv);
> +
> + if (dwmac->ops->suspend)
> + ret = dwmac->ops->suspend(dwmac);
>
> return ret;
> }
> @@ -166,8 +385,12 @@ static int stm32_dwmac_resume(struct device *dev)
> {
> struct net_device *ndev = dev_get_drvdata(dev);
> struct stmmac_priv *priv = netdev_priv(ndev);
> + struct stm32_dwmac *dwmac = priv->plat->bsp_priv;
> int ret;
>
> + if (dwmac->ops->resume)
> + dwmac->ops->resume(dwmac);
> +
> ret = stm32_dwmac_init(priv->plat);
> if (ret)
> return ret;
> @@ -181,8 +404,24 @@ static int stm32_dwmac_resume(struct device *dev)
> static SIMPLE_DEV_PM_OPS(stm32_dwmac_pm_ops,
> stm32_dwmac_suspend, stm32_dwmac_resume);
>
> +static struct stm32_ops stm32mcu_dwmac_data = {
> + .set_mode = stm32mcu_set_mode,
> + .suspend = stm32mcu_suspend,
> + .syscfg_eth_mask = SYSCFG_MCU_ETH_MASK
> +};
> +
> +static struct stm32_ops stm32mp1_dwmac_data = {
> + .set_mode = stm32mp1_set_mode,
> + .clk_prepare = stm32mp1_clk_prepare,
> + .suspend = stm32mp1_suspend,
> + .resume = stm32mp1_resume,
> + .parse_data = stm32mp1_parse_data,
> + .syscfg_eth_mask = SYSCFG_MP1_ETH_MASK
> +};
> +
> static const struct of_device_id stm32_dwmac_match[] = {
> - { .compatible = "st,stm32-dwmac"},
> + { .compatible = "st,stm32-dwmac", .data = &stm32mcu_dwmac_data},
> + { .compatible = "st,stm32mp1-dwmac", .data = &stm32mp1_dwmac_data},
> { }
> };
> MODULE_DEVICE_TABLE(of, stm32_dwmac_match);
> @@ -199,5 +438,6 @@ static SIMPLE_DEV_PM_OPS(stm32_dwmac_pm_ops,
> module_platform_driver(stm32_dwmac_driver);
>
> MODULE_AUTHOR("Alexandre Torgue <alexandre.torgue@gmail.com>");
> -MODULE_DESCRIPTION("STMicroelectronics MCU DWMAC Specific Glue layer");
> +MODULE_AUTHOR("Christophe Roullier <christophe.roullier@st.com>");
> +MODULE_DESCRIPTION("STMicroelectronics STM32 DWMAC Specific Glue layer");
> MODULE_LICENSE("GPL v2");
>
^ permalink raw reply
* Re: [PATCH bpf-next v4 00/10] bpf: enhancements for multi-function programs
From: Daniel Borkmann @ 2018-05-24 7:31 UTC (permalink / raw)
To: Sandipan Das, ast; +Cc: netdev, linuxppc-dev, mpe, naveen.n.rao, jakub.kicinski
In-Reply-To: <cover.1527143877.git.sandipan@linux.vnet.ibm.com>
On 05/24/2018 08:56 AM, Sandipan Das wrote:
> [1] Support for bpf-to-bpf function calls in the powerpc64 JIT compiler.
>
> [2] Provide a way for resolving function calls because of the way JITed
> images are allocated in powerpc64.
>
> [3] Fix to get JITed instruction dumps for multi-function programs from
> the bpf system call.
>
> [4] Fix for bpftool to show delimited multi-function JITed image dumps.
>
> v4:
> - Incorporate review comments from Jakub.
> - Fix JSON output for bpftool.
>
> v3:
> - Change base tree tag to bpf-next.
> - Incorporate review comments from Alexei, Daniel and Jakub.
> - Make sure that the JITed image does not grow or shrink after
> the last pass due to the way the instruction sequence used
> to load a callee's address maybe optimized.
> - Make additional changes to the bpf system call and bpftool to
> make multi-function JITed dumps easier to correlate.
>
> v2:
> - Incorporate review comments from Jakub.
>
> Sandipan Das (10):
> bpf: support 64-bit offsets for bpf function calls
> bpf: powerpc64: pad function address loads with NOPs
> bpf: powerpc64: add JIT support for multi-function programs
> bpf: get kernel symbol addresses via syscall
> tools: bpf: sync bpf uapi header
> tools: bpftool: resolve calls without using imm field
> bpf: fix multi-function JITed dump obtained via syscall
> bpf: get JITed image lengths of functions via syscall
> tools: bpf: sync bpf uapi header
> tools: bpftool: add delimiters to multi-function JITed dumps
>
> arch/powerpc/net/bpf_jit_comp64.c | 110 ++++++++++++++++++++++++++++++--------
> include/uapi/linux/bpf.h | 4 ++
> kernel/bpf/syscall.c | 82 ++++++++++++++++++++++++++--
> kernel/bpf/verifier.c | 22 +++++---
> tools/bpf/bpftool/prog.c | 97 ++++++++++++++++++++++++++++++++-
> tools/bpf/bpftool/xlated_dumper.c | 14 +++--
> tools/bpf/bpftool/xlated_dumper.h | 3 ++
> tools/include/uapi/linux/bpf.h | 4 ++
> 8 files changed, 301 insertions(+), 35 deletions(-)
Applied to bpf-next, thanks a lot Sandipan!
^ permalink raw reply
* Re: [PATCH bpf-next v4 02/10] bpf: powerpc64: pad function address loads with NOPs
From: Daniel Borkmann @ 2018-05-24 7:34 UTC (permalink / raw)
To: Sandipan Das, ast; +Cc: netdev, linuxppc-dev, mpe, naveen.n.rao, jakub.kicinski
In-Reply-To: <d0db970711596827ba88209e545c686d77c22b7d.1527143877.git.sandipan@linux.vnet.ibm.com>
On 05/24/2018 08:56 AM, Sandipan Das wrote:
> For multi-function programs, loading the address of a callee
> function to a register requires emitting instructions whose
> count varies from one to five depending on the nature of the
> address.
>
> Since we come to know of the callee's address only before the
> extra pass, the number of instructions required to load this
> address may vary from what was previously generated. This can
> make the JITed image grow or shrink.
>
> To avoid this, we should generate a constant five-instruction
> when loading function addresses by padding the optimized load
> sequence with NOPs.
>
> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
> ---
> arch/powerpc/net/bpf_jit_comp64.c | 34 +++++++++++++++++++++++-----------
> 1 file changed, 23 insertions(+), 11 deletions(-)
>
> diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
> index 1bdb1aff0619..e4582744a31d 100644
> --- a/arch/powerpc/net/bpf_jit_comp64.c
> +++ b/arch/powerpc/net/bpf_jit_comp64.c
> @@ -167,25 +167,37 @@ static void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx)
>
> static void bpf_jit_emit_func_call(u32 *image, struct codegen_context *ctx, u64 func)
> {
> + unsigned int i, ctx_idx = ctx->idx;
> +
> + /* Load function address into r12 */
> + PPC_LI64(12, func);
> +
> + /* For bpf-to-bpf function calls, the callee's address is unknown
> + * until the last extra pass. As seen above, we use PPC_LI64() to
> + * load the callee's address, but this may optimize the number of
> + * instructions required based on the nature of the address.
> + *
> + * Since we don't want the number of instructions emitted to change,
> + * we pad the optimized PPC_LI64() call with NOPs to guarantee that
> + * we always have a five-instruction sequence, which is the maximum
> + * that PPC_LI64() can emit.
> + */
> + for (i = ctx->idx - ctx_idx; i < 5; i++)
> + PPC_NOP();
By the way, I think you can still optimize this. The nops are not really
needed in case of insn->src_reg != BPF_PSEUDO_CALL since the address of
a normal BPF helper call will always be at a fixed location and known a
priori.
> #ifdef PPC64_ELF_ABI_v1
> - /* func points to the function descriptor */
> - PPC_LI64(b2p[TMP_REG_2], func);
> - /* Load actual entry point from function descriptor */
> - PPC_BPF_LL(b2p[TMP_REG_1], b2p[TMP_REG_2], 0);
> - /* ... and move it to LR */
> - PPC_MTLR(b2p[TMP_REG_1]);
> /*
> * Load TOC from function descriptor at offset 8.
> * We can clobber r2 since we get called through a
> * function pointer (so caller will save/restore r2)
> * and since we don't use a TOC ourself.
> */
> - PPC_BPF_LL(2, b2p[TMP_REG_2], 8);
> -#else
> - /* We can clobber r12 */
> - PPC_FUNC_ADDR(12, func);
> - PPC_MTLR(12);
> + PPC_BPF_LL(2, 12, 8);
> + /* Load actual entry point from function descriptor */
> + PPC_BPF_LL(12, 12, 0);
> #endif
> +
> + PPC_MTLR(12);
> PPC_BLRL();
> }
>
>
^ permalink raw reply
* Re: [PATCH v2 bpf-next] bpf: btf: Avoid variable length array
From: Daniel Borkmann @ 2018-05-24 7:35 UTC (permalink / raw)
To: Martin KaFai Lau, netdev; +Cc: Alexei Starovoitov, kernel-team
In-Reply-To: <20180523183236.519795-1-kafai@fb.com>
On 05/23/2018 08:32 PM, Martin KaFai Lau wrote:
> Sparse warning:
> kernel/bpf/btf.c:1985:34: warning: Variable length array is used.
>
> This patch directly uses ARRAY_SIZE().
>
> Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header")
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Applied to bpf-next, thanks Martin!
^ permalink raw reply
* Re: [PATCH bpf-next v2 0/3] bpf: add boot parameters for sysctl knobs
From: Jesper Dangaard Brouer @ 2018-05-24 7:41 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Eugene Syromiatnikov, netdev, linux-kernel, linux-doc, Kees Cook,
Kai-Heng Feng, Daniel Borkmann, Alexei Starovoitov,
Jonathan Corbet, Jiri Olsa, brouer
In-Reply-To: <20180523220244.a4u25kapqbjnmpr4@ast-mbp>
On Wed, 23 May 2018 15:02:45 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Wed, May 23, 2018 at 02:18:19PM +0200, Eugene Syromiatnikov wrote:
> > Some BPF sysctl knobs affect the loading of BPF programs, and during
> > system boot/init stages these sysctls are not yet configured.
> > A concrete example is systemd, that has implemented loading of BPF
> > programs.
> >
> > Thus, to allow controlling these setting at early boot, this patch set
> > adds the ability to change the default setting of these sysctl knobs
> > as well as option to override them via a boot-time kernel parameter
> > (in order to avoid rebuilding kernel each time a need of changing these
> > defaults arises).
> >
> > The sysctl knobs in question are kernel.unprivileged_bpf_disable,
> > net.core.bpf_jit_harden, and net.core.bpf_jit_kallsyms.
>
> - systemd is root. today it only uses cgroup-bpf progs which require root,
> so disabling unpriv during boot time makes no difference to systemd.
> what is the actual reason to present time?
>
> - say in the future systemd wants to use so_reuseport+bpf for faster
> networking. With unpriv disable during boot, it will force systemd
> to do such networking from root, which will lower its security barrier.
> How that make sense?
>
> - bpf_jit_kallsyms sysctl has immediate effect on loaded programs.
> Flipping it during the boot or right after or any time after
> is the same thing. Why add such boot flag then?
>
> - jit_harden can be turned on by systemd. so turning it during the boot
> will make systemd progs to be constant blinded.
> Constant blinding protects kernel from unprivileged JIT spraying.
> Are you worried that systemd will attack the kernel with JIT spraying?
I think you are missing that, we want the ability to change these
defaults in-order to avoid depending on /etc/sysctl.conf settings, and
that the these sysctl.conf setting happen too late.
For example with jit_harden, there will be a difference between the
loaded BPF program that got loaded at boot-time with systemd (no
constant blinding) and when someone reloads that systemd service after
/etc/sysctl.conf have been evaluated and setting bpf_jit_harden (now
slower due to constant blinding). This is inconsistent behavior.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* [PATCH net-next] bpfilter: don't pass O_CREAT when opening console for debug
From: Jakub Kicinski @ 2018-05-24 7:41 UTC (permalink / raw)
To: davem; +Cc: alexei.starovoitov, netdev, oss-drivers, Jakub Kicinski
Passing O_CREAT (00000100) to open means we should also pass file
mode as the third parameter. Creating /dev/console as a regular
file may not be helpful anyway, so simply drop the flag when
opening debug_fd.
Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
net/bpfilter/main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
index 81bbc1684896..1317f108df8a 100644
--- a/net/bpfilter/main.c
+++ b/net/bpfilter/main.c
@@ -55,7 +55,7 @@ static void loop(void)
int main(void)
{
- debug_fd = open("/dev/console", 00000002 | 00000100);
+ debug_fd = open("/dev/console", 00000002);
dprintf(debug_fd, "Started bpfilter\n");
loop();
close(debug_fd);
--
2.17.0
^ permalink raw reply related
* Re: WARNING in ip_recv_error
From: Paolo Abeni @ 2018-05-24 8:00 UTC (permalink / raw)
To: Willem de Bruijn, David Miller
Cc: Eric Dumazet, DaeLyong Jeong, Alexey Kuznetsov, Hideaki YOSHIFUJI,
Network Development, LKML, Byoungyoung Lee, Kyungtae Kim,
bammanag, Willem de Bruijn
In-Reply-To: <CAF=yD-KTfUbXGvU7qQy4=eHbuUB88=g_tQ8sp8TEebhW=rzKVQ@mail.gmail.com>
On Wed, 2018-05-23 at 11:40 -0400, Willem de Bruijn wrote:
> On Sun, May 20, 2018 at 7:13 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> > On Fri, May 18, 2018 at 2:59 PM, Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > > On Fri, May 18, 2018 at 2:46 PM, Willem de Bruijn
> > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > On Fri, May 18, 2018 at 2:44 PM, Willem de Bruijn
> > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > On Fri, May 18, 2018 at 1:09 PM, Willem de Bruijn
> > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > On Fri, May 18, 2018 at 11:44 AM, David Miller <davem@davemloft.net> wrote:
> > > > > > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > > > > > Date: Fri, 18 May 2018 08:30:43 -0700
> > > > > > >
> > > > > > > > We probably need to revert Willem patch (7ce875e5ecb8562fd44040f69bda96c999e38bbc)
> > > > > > >
> > > > > > > Is it really valid to reach ip_recv_err with an ipv6 socket?
> > > > > >
> > > > > > I guess the issue is that setsockopt IPV6_ADDRFORM is not an
> > > > > > atomic operation, so that the socket is neither fully ipv4 nor fully
> > > > > > ipv6 by the time it reaches ip_recv_error.
> > > > > >
> > > > > > sk->sk_socket->ops = &inet_dgram_ops;
> > > > > > < HERE >
> > > > > > sk->sk_family = PF_INET;
> > > > > >
> > > > > > Even calling inet_recv_error to demux would not necessarily help.
> > > > > >
> > > > > > Safest would be to look up by skb->protocol, similar to what
> > > > > > ipv6_recv_error does to handle v4-mapped-v6.
> > > > > >
> > > > > > Or to make that function safe with PF_INET and swap the order
> > > > > > of the above two operations.
> > > > > >
> > > > > > All sound needlessly complicated for this rare socket option, but
> > > > > > I don't have a better idea yet. Dropping on the floor is not nice,
> > > > > > either.
> > > > >
> > > > > Ensuring that ip_recv_error correctly handles packets from either
> > > > > socket and removing the warning should indeed be good.
> > > > >
> > > > > It is robust against v4-mapped packets from an AF_INET6 socket,
> > > > > but see caveat on reconnect below.
> > > > >
> > > > > The code between ipv6_recv_error for v4-mapped addresses and
> > > > > ip_recv_error is essentially the same, the main difference being
> > > > > whether to return network headers as sockaddr_in with SOL_IP
> > > > > or sockaddr_in6 with SOL_IPV6.
> > > > >
> > > > > There are very few other locations in the stack that explicitly test
> > > > > sk_family in this way and thus would be vulnerable to races with
> > > > > IPV6_ADDRFORM.
> > > > >
> > > > > I'm not sure whether it is possible for a udpv6 socket to queue a
> > > > > real ipv6 packet on the error queue, disconnect, connect to an
> > > > > ipv4 address, call IPV6_ADDRFORM and then call ip_recv_error
> > > > > on a true ipv6 packet. That would return buggy data, e.g., in
> > > > > msg_name.
> > > >
> > > > In do_ipv6_setsockopt IPV6_ADDRFORM we can test that the
> > > > error queue is empty, and then take its lock for the duration of the
> > > > operation.
> > >
> > > Actually, no reason to hold the lock. This setsockopt holds the socket
> > > lock, which connect would need, too. So testing that the queue
> > > is empty after testing that it is connected to a v4 address is
> > > sufficient to ensure that no ipv6 packets are queued for reception.
> > >
> > > diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
> > > index 4d780c7f0130..a975d6311341 100644
> > > --- a/net/ipv6/ipv6_sockglue.c
> > > +++ b/net/ipv6/ipv6_sockglue.c
> > > @@ -199,6 +199,11 @@ static int do_ipv6_setsockopt(struct sock *sk,
> > > int level, int optname,
> > >
> > > if (ipv6_only_sock(sk) ||
> > > !ipv6_addr_v4mapped(&sk->sk_v6_daddr)) {
> > > retv = -EADDRNOTAVAIL;
> > > break;
> > > }
> > >
> > > + if (!skb_queue_empty(&sk->sk_error_queue)) {
> > > + retv = -EBUSY;
> > > + break;
> > > + }
> > > +
> > > fl6_free_socklist(sk);
> > > __ipv6_sock_mc_close(sk);
> > >
> > > After this it should be safe to remove the warning in ip_recv_error.
> >
> > Hmm.. nope.
> >
> > This ensures that the socket cannot produce any new true v6 packets.
> > But it does not guarantee that they are not already in the system, e.g.
> > queued in tc, and will find their way to the error queue later.
> >
> > We'll have to just be able to handle ipv6 packets in ip_recv_error.
> > Since IPV6_ADDRFORM is used to pass to legacy v4-only
> > processes and those likely are only confused by SOL_IPV6
> > error messages, it is probably best to just drop them and perhaps
> > WARN_ONCE.
>
> Even more fun, this is not limited to the error queue.
>
> I can queue a v6 packet for reception on a socket, connect to a v4
> address, call IPV6_ADDRFORM and then a regular recvfrom will
> return a partial v6 address as AF_INET.
>
> We definitely do not want to have to add a check
>
> if (skb->protocol == htons(ETH_P_IPV6)) {
> kfree_skb(skb);
> goto try_again;
> }
>
> to the normal recvmsg path.
>
> An alternative may be to tighten the check on when to allow
> IPV6_ADDRFORM. Not only return EBUSY if a packet is pending,
> but also if any sk_{rmem, omem, wmem}_alloc is non-zero. Only,
> these tightened constraints could break a legacy application.
I fear that condition will be very restrictive: for UDP sockets sk_rmem
can be zero only occasionally, after the first packet has been
received, due to the peculiar memory accounting - commit 6b229cf77d68
("This computer thing still completely fool me").
Cheers,
Paolo
^ permalink raw reply
* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Jiri Pirko @ 2018-05-24 8:05 UTC (permalink / raw)
To: Ilias Apalodimas
Cc: netdev, grygorii.strashko, ivan.khoronzhuk, nsekhar, ivecera,
francois.ozog, yogeshs, spatton
In-Reply-To: <1527144984-31236-1-git-send-email-ilias.apalodimas@linaro.org>
Thu, May 24, 2018 at 08:56:20AM CEST, ilias.apalodimas@linaro.org wrote:
>Hello,
>
>This is adding a new mode on the cpsw driver based around switchdev.
>In order to enable this you need to enable CONFIG_NET_SWITCHDEV,
>CONFIG_BRIDGE_VLAN_FILTERING, CONFIG_TI_CPSW_SWITCHDEV
>and add to udev config:
>
>SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="0f011900", \
> ATTR{phys_port_name}!="", NAME="sw0$attr{phys_port_name}"
>Since the phys_switch_id is based on cpsw version, users with different
>version will need to do 'ip -d link show dev sw0p0 | grep switchid' and
>replace with the correct value.
>
>This patch creates 3 ports, sw0p0, sw0p1 and sw0p2.
>sw0p1 and sw0p2 are the netdev interfaces connected to PHY devices
>while sw0p0 is the switch 'cpu facing port'.
Any reason you need cpu port? We don't need it in mlxsw and also in dsa.
What is this device? Could you give me some pointer to description?
^ permalink raw reply
* Re: [PATCH bpf-next v4 02/10] bpf: powerpc64: pad function address loads with NOPs
From: Sandipan Das @ 2018-05-24 8:25 UTC (permalink / raw)
To: Daniel Borkmann
Cc: ast, netdev, linuxppc-dev, mpe, naveen.n.rao, jakub.kicinski
In-Reply-To: <43081c8c-d8a8-254a-69f0-7941acab90a3@iogearbox.net>
On 05/24/2018 01:04 PM, Daniel Borkmann wrote:
> On 05/24/2018 08:56 AM, Sandipan Das wrote:
>> For multi-function programs, loading the address of a callee
>> function to a register requires emitting instructions whose
>> count varies from one to five depending on the nature of the
>> address.
>>
>> Since we come to know of the callee's address only before the
>> extra pass, the number of instructions required to load this
>> address may vary from what was previously generated. This can
>> make the JITed image grow or shrink.
>>
>> To avoid this, we should generate a constant five-instruction
>> when loading function addresses by padding the optimized load
>> sequence with NOPs.
>>
>> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/net/bpf_jit_comp64.c | 34 +++++++++++++++++++++++-----------
>> 1 file changed, 23 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
>> index 1bdb1aff0619..e4582744a31d 100644
>> --- a/arch/powerpc/net/bpf_jit_comp64.c
>> +++ b/arch/powerpc/net/bpf_jit_comp64.c
>> @@ -167,25 +167,37 @@ static void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx)
>>
>> static void bpf_jit_emit_func_call(u32 *image, struct codegen_context *ctx, u64 func)
>> {
>> + unsigned int i, ctx_idx = ctx->idx;
>> +
>> + /* Load function address into r12 */
>> + PPC_LI64(12, func);
>> +
>> + /* For bpf-to-bpf function calls, the callee's address is unknown
>> + * until the last extra pass. As seen above, we use PPC_LI64() to
>> + * load the callee's address, but this may optimize the number of
>> + * instructions required based on the nature of the address.
>> + *
>> + * Since we don't want the number of instructions emitted to change,
>> + * we pad the optimized PPC_LI64() call with NOPs to guarantee that
>> + * we always have a five-instruction sequence, which is the maximum
>> + * that PPC_LI64() can emit.
>> + */
>> + for (i = ctx->idx - ctx_idx; i < 5; i++)
>> + PPC_NOP();
>
> By the way, I think you can still optimize this. The nops are not really
> needed in case of insn->src_reg != BPF_PSEUDO_CALL since the address of
> a normal BPF helper call will always be at a fixed location and known a
> priori.
>
Ah, true. Thanks for pointing this out. There are a few other things that
we are planning to do for the ppc64 JIT compiler. Will put out a patch for
this with that series.
- Sandipan
>> #ifdef PPC64_ELF_ABI_v1
>> - /* func points to the function descriptor */
>> - PPC_LI64(b2p[TMP_REG_2], func);
>> - /* Load actual entry point from function descriptor */
>> - PPC_BPF_LL(b2p[TMP_REG_1], b2p[TMP_REG_2], 0);
>> - /* ... and move it to LR */
>> - PPC_MTLR(b2p[TMP_REG_1]);
>> /*
>> * Load TOC from function descriptor at offset 8.
>> * We can clobber r2 since we get called through a
>> * function pointer (so caller will save/restore r2)
>> * and since we don't use a TOC ourself.
>> */
>> - PPC_BPF_LL(2, b2p[TMP_REG_2], 8);
>> -#else
>> - /* We can clobber r12 */
>> - PPC_FUNC_ADDR(12, func);
>> - PPC_MTLR(12);
>> + PPC_BPF_LL(2, 12, 8);
>> + /* Load actual entry point from function descriptor */
>> + PPC_BPF_LL(12, 12, 0);
>> #endif
>> +
>> + PPC_MTLR(12);
>> PPC_BLRL();
>> }
>>
>>
>
^ permalink raw reply
* suspicius csum initialization in vmxnet3_rx_csum
From: Paolo Abeni @ 2018-05-24 8:47 UTC (permalink / raw)
To: Ronak Doshi, Shrikrishna Khare, pv-drivers; +Cc: netdev, Neil Horman
Hi all,
we are hitting the BUG() condition in skb_checksum_help() -
skb_checksum_start_offset(skb) >= skb_headlen(skb) for skb received
from vmnxnet3 and queued from OVS to user-space
I think that the root cause is in vmxnet3_rx_csum():
if (gdesc->rcd.csum) {
skb->csum = htons(gdesc->rcd.csum);
skb->ip_summed = CHECKSUM_PARTIAL;
CHECKSUM_PARTIAL looks suspicious here, as the csum field is
initialized, instead of csum_offset/csum_start. To be honest I find
also strange that the csum value is converted from host byte order.
I'm wild guessing something like the below patch should fix the issue,
but I'm not familiar with the vmxnet3 code. Can you please have a look?
Thank you,
Paolo
---
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index 9ebe2a689966..06ade074c32c 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -1172,7 +1172,7 @@ vmxnet3_rx_csum(struct vmxnet3_adapter *adapter,
} else {
if (gdesc->rcd.csum) {
skb->csum = htons(gdesc->rcd.csum);
- skb->ip_summed = CHECKSUM_PARTIAL;
+ skb->ip_summed = CHECKSUM_COMPLETE;
} else {
skb_checksum_none_assert(skb);
}
^ permalink raw reply related
* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Ilias Apalodimas @ 2018-05-24 8:48 UTC (permalink / raw)
To: Jiri Pirko
Cc: netdev, grygorii.strashko, ivan.khoronzhuk, nsekhar, ivecera,
francois.ozog, yogeshs, spatton
In-Reply-To: <20180524080528.GD2295@nanopsycho>
On Thu, May 24, 2018 at 10:05:28AM +0200, Jiri Pirko wrote:
> Thu, May 24, 2018 at 08:56:20AM CEST, ilias.apalodimas@linaro.org wrote:
> Any reason you need cpu port? We don't need it in mlxsw and also in dsa.
Yes i've seen that on mlxsw/rocker drivers and i was reluctant adding one here.
The reason is that TI wants this configured differently from customer facing
ports. Apparently there are existing customers already using the "feature".
So OR'ing and adding the cpu port on every operation (add/del vlans add
ucast/mcast entries etc) was less favoured.
>
> What is this device? Could you give me some pointer to description?
This is the switch used on TI's AM5728 and BBB boards. I am pretty sure there
are other platforms i am not aware of.
http://www.ti.com/lit/ug/spruhz6j/spruhz6j.pdf is the techincal reference
manual. Section 24.11.5.4 "Initialization and Configuration of CPSW" is the
switch part.
Thanks,
Ilias
^ permalink raw reply
* [PATCH net-next] net: bridge: add support for port isolation
From: Nikolay Aleksandrov @ 2018-05-24 8:56 UTC (permalink / raw)
To: netdev; +Cc: roopa, davem, stephen, bridge, Nikolay Aleksandrov
This patch adds support for a new port flag - BR_ISOLATED. If it is set
then isolated ports cannot communicate between each other, but they can
still communicate with non-isolated ports. The same can be achieved via
ACLs but they can't scale with large number of ports and also the
complexity of the rules grows. This feature can be used to achieve
isolated vlan functionality (similar to pvlan) as well, though currently
it will be port-wide (for all vlans on the port). The new test in
should_deliver uses data that is already cache hot and the new boolean
is used to avoid an additional source port test in should_deliver.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
include/linux/if_bridge.h | 1 +
include/uapi/linux/if_link.h | 1 +
net/bridge/br_forward.c | 3 ++-
net/bridge/br_input.c | 1 +
net/bridge/br_netlink.c | 9 ++++++++-
net/bridge/br_private.h | 9 +++++++++
net/bridge/br_sysfs_if.c | 2 ++
7 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 585d27182425..7843b98e1c6e 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -50,6 +50,7 @@ struct br_ip_list {
#define BR_VLAN_TUNNEL BIT(13)
#define BR_BCAST_FLOOD BIT(14)
#define BR_NEIGH_SUPPRESS BIT(15)
+#define BR_ISOLATED BIT(16)
#define BR_DEFAULT_AGEING_TIME (300 * HZ)
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index b85266420bfb..cf01b6824244 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -333,6 +333,7 @@ enum {
IFLA_BRPORT_BCAST_FLOOD,
IFLA_BRPORT_GROUP_FWD_MASK,
IFLA_BRPORT_NEIGH_SUPPRESS,
+ IFLA_BRPORT_ISOLATED,
__IFLA_BRPORT_MAX
};
#define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 7a7fd672ccf2..9019f326fe81 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -30,7 +30,8 @@ static inline int should_deliver(const struct net_bridge_port *p,
vg = nbp_vlan_group_rcu(p);
return ((p->flags & BR_HAIRPIN_MODE) || skb->dev != p->dev) &&
br_allowed_egress(vg, skb) && p->state == BR_STATE_FORWARDING &&
- nbp_switchdev_allowed_egress(p, skb);
+ nbp_switchdev_allowed_egress(p, skb) &&
+ !br_skb_isolated(p, skb);
}
int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 7f98a7d25866..72074276c088 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -114,6 +114,7 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
goto drop;
BR_INPUT_SKB_CB(skb)->brdev = br->dev;
+ BR_INPUT_SKB_CB(skb)->src_port_isolated = !!(p->flags & BR_ISOLATED);
if (IS_ENABLED(CONFIG_INET) &&
(skb->protocol == htons(ETH_P_ARP) ||
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 015f465c514b..9f5eb05b0373 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -139,6 +139,7 @@ static inline size_t br_port_info_size(void)
+ nla_total_size(1) /* IFLA_BRPORT_PROXYARP_WIFI */
+ nla_total_size(1) /* IFLA_BRPORT_VLAN_TUNNEL */
+ nla_total_size(1) /* IFLA_BRPORT_NEIGH_SUPPRESS */
+ + nla_total_size(1) /* IFLA_BRPORT_ISOLATED */
+ nla_total_size(sizeof(struct ifla_bridge_id)) /* IFLA_BRPORT_ROOT_ID */
+ nla_total_size(sizeof(struct ifla_bridge_id)) /* IFLA_BRPORT_BRIDGE_ID */
+ nla_total_size(sizeof(u16)) /* IFLA_BRPORT_DESIGNATED_PORT */
@@ -213,7 +214,8 @@ static int br_port_fill_attrs(struct sk_buff *skb,
BR_VLAN_TUNNEL)) ||
nla_put_u16(skb, IFLA_BRPORT_GROUP_FWD_MASK, p->group_fwd_mask) ||
nla_put_u8(skb, IFLA_BRPORT_NEIGH_SUPPRESS,
- !!(p->flags & BR_NEIGH_SUPPRESS)))
+ !!(p->flags & BR_NEIGH_SUPPRESS)) ||
+ nla_put_u8(skb, IFLA_BRPORT_ISOLATED, !!(p->flags & BR_ISOLATED)))
return -EMSGSIZE;
timerval = br_timer_value(&p->message_age_timer);
@@ -660,6 +662,7 @@ static const struct nla_policy br_port_policy[IFLA_BRPORT_MAX + 1] = {
[IFLA_BRPORT_VLAN_TUNNEL] = { .type = NLA_U8 },
[IFLA_BRPORT_GROUP_FWD_MASK] = { .type = NLA_U16 },
[IFLA_BRPORT_NEIGH_SUPPRESS] = { .type = NLA_U8 },
+ [IFLA_BRPORT_ISOLATED] = { .type = NLA_U8 },
};
/* Change the state of the port and notify spanning tree */
@@ -810,6 +813,10 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[])
if (err)
return err;
+ err = br_set_port_flag(p, tb, IFLA_BRPORT_ISOLATED, BR_ISOLATED);
+ if (err)
+ return err;
+
br_port_flags_change(p, old_flags ^ p->flags);
return 0;
}
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 742f40aefdaf..11520ed528b0 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -423,6 +423,7 @@ struct br_input_skb_cb {
#endif
bool proxyarp_replied;
+ bool src_port_isolated;
#ifdef CONFIG_BRIDGE_VLAN_FILTERING
bool vlan_filtered;
@@ -574,6 +575,14 @@ int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb);
void br_flood(struct net_bridge *br, struct sk_buff *skb,
enum br_pkt_type pkt_type, bool local_rcv, bool local_orig);
+/* return true if both source port and dest port are isolated */
+static inline bool br_skb_isolated(const struct net_bridge_port *to,
+ const struct sk_buff *skb)
+{
+ return BR_INPUT_SKB_CB(skb)->src_port_isolated &&
+ (to->flags & BR_ISOLATED);
+}
+
/* br_if.c */
void br_port_carrier_check(struct net_bridge_port *p, bool *notified);
int br_add_bridge(struct net *net, const char *name);
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index fd31ad83ec7b..f99c5bf5c906 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -192,6 +192,7 @@ BRPORT_ATTR_FLAG(proxyarp_wifi, BR_PROXYARP_WIFI);
BRPORT_ATTR_FLAG(multicast_flood, BR_MCAST_FLOOD);
BRPORT_ATTR_FLAG(broadcast_flood, BR_BCAST_FLOOD);
BRPORT_ATTR_FLAG(neigh_suppress, BR_NEIGH_SUPPRESS);
+BRPORT_ATTR_FLAG(isolated, BR_ISOLATED);
#ifdef CONFIG_BRIDGE_IGMP_SNOOPING
static ssize_t show_multicast_router(struct net_bridge_port *p, char *buf)
@@ -243,6 +244,7 @@ static const struct brport_attribute *brport_attrs[] = {
&brport_attr_broadcast_flood,
&brport_attr_group_fwd_mask,
&brport_attr_neigh_suppress,
+ &brport_attr_isolated,
NULL
};
--
2.11.0
^ permalink raw reply related
* Re: INFO: rcu detected stall in corrupted
From: Xin Long @ 2018-05-24 9:02 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: Eric Dumazet, David Miller, syzbot+f116bc1994efe725d51b, kuznet,
LKML, network dev, syzkaller-bugs, yoshfuji, dsahern,
Roopa Prabhu, linux-sctp
In-Reply-To: <20180523231340.GN5488@localhost.localdomain>
On Thu, May 24, 2018 at 7:13 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Mon, May 21, 2018 at 11:13:46AM -0700, Eric Dumazet wrote:
>>
>>
>> On 05/21/2018 11:09 AM, David Miller wrote:
>> > From: syzbot <syzbot+f116bc1994efe725d51b@syzkaller.appspotmail.com>
>> > Date: Mon, 21 May 2018 11:05:02 -0700
>> >
>> >> find_match+0x244/0x13a0 net/ipv6/route.c:691
>> >> find_rr_leaf net/ipv6/route.c:729 [inline]
>> >> rt6_select net/ipv6/route.c:779 [inline]
>> >
>> > Hmmm, endless loop in find_rr_leaf or similar?
>> >
>>
>>
>> I do not think so, this really looks like SCTP specific
>> , we now have dozens of traces all sharing :
>>
>> sctp_transport_route+0xad/0x450 net/sctp/transport.c:293
>> sctp_packet_config+0xb89/0xfd0 net/sctp/output.c:123
>> sctp_outq_flush+0x79c/0x4370 net/sctp/outqueue.c:894
>> sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
>> sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
>> sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>> sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
>> sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
>> call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
>>
>>
>> Some kind of infinite loop.
>>
>> When the hrtimer fires, it can point to any code that sits below but does not necessarily have a bug.
>
> Agreed. Xin Long identified the root cause. syzkaller is setting too
> aggressive parameters to SCTP RTO, leading to issues with the
> heartbeat timer.
Right, I will prepare a fix soon with your suggestion rto_min value "HZ/5"
Thanks.
^ permalink raw reply
* [PATCH v2 net-next] sfc: stop the TX queue before pushing new buffers
From: Martin Habets @ 2018-05-24 9:14 UTC (permalink / raw)
To: linux-net-drivers, davem; +Cc: netdev, jarod
In-Reply-To: <651409eb-6faa-0224-e521-5b14d6913c9a@solarflare.com>
efx_enqueue_skb() can push new buffers for the xmit_more functionality.
We must stops the TX queue before this or else the TX queue does not get
restarted and we get a netdev watchdog.
In the error handling we may now need to unwind more than 1 packet, and
we may need to push the new buffers onto the partner queue.
v2: In the error leg also push this queue if xmit_more is set
Fixes: e9117e5099ea ("sfc: Firmware-Assisted TSO version 2")
Reported-by: Jarod Wilson <jarod@redhat.com>
Tested-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Martin Habets <mhabets@solarflare.com>
---
Dave, could you please also queue this patch up for stable?
drivers/net/ethernet/sfc/tx.c | 33 +++++++++++++++++++++++++--------
1 file changed, 25 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
index cece961f2e82..c3ad564ac4c0 100644
--- a/drivers/net/ethernet/sfc/tx.c
+++ b/drivers/net/ethernet/sfc/tx.c
@@ -435,17 +435,18 @@ static int efx_tx_map_data(struct efx_tx_queue *tx_queue, struct sk_buff *skb,
} while (1);
}
-/* Remove buffers put into a tx_queue. None of the buffers must have
- * an skb attached.
+/* Remove buffers put into a tx_queue for the current packet.
+ * None of the buffers must have an skb attached.
*/
-static void efx_enqueue_unwind(struct efx_tx_queue *tx_queue)
+static void efx_enqueue_unwind(struct efx_tx_queue *tx_queue,
+ unsigned int insert_count)
{
struct efx_tx_buffer *buffer;
unsigned int bytes_compl = 0;
unsigned int pkts_compl = 0;
/* Work backwards until we hit the original insert pointer value */
- while (tx_queue->insert_count != tx_queue->write_count) {
+ while (tx_queue->insert_count != insert_count) {
--tx_queue->insert_count;
buffer = __efx_tx_queue_get_insert_buffer(tx_queue);
efx_dequeue_buffer(tx_queue, buffer, &pkts_compl, &bytes_compl);
@@ -504,6 +505,8 @@ static int efx_tx_tso_fallback(struct efx_tx_queue *tx_queue,
*/
netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
{
+ unsigned int old_insert_count = tx_queue->insert_count;
+ bool xmit_more = skb->xmit_more;
bool data_mapped = false;
unsigned int segments;
unsigned int skb_len;
@@ -553,8 +556,10 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
/* Update BQL */
netdev_tx_sent_queue(tx_queue->core_txq, skb_len);
+ efx_tx_maybe_stop_queue(tx_queue);
+
/* Pass off to hardware */
- if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq)) {
+ if (!xmit_more || netif_xmit_stopped(tx_queue->core_txq)) {
struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
/* There could be packets left on the partner queue if those
@@ -577,14 +582,26 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
tx_queue->tx_packets++;
}
- efx_tx_maybe_stop_queue(tx_queue);
-
return NETDEV_TX_OK;
err:
- efx_enqueue_unwind(tx_queue);
+ efx_enqueue_unwind(tx_queue, old_insert_count);
dev_kfree_skb_any(skb);
+
+ /* If we're not expecting another transmit and we had something to push
+ * on this queue or a partner queue then we need to push here to get the
+ * previous packets out.
+ */
+ if (!xmit_more) {
+ struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
+
+ if (txq2->xmit_more_available)
+ efx_nic_push_buffers(txq2);
+
+ efx_nic_push_buffers(tx_queue);
+ }
+
return NETDEV_TX_OK;
}
^ permalink raw reply related
* Re: [PATCH bpf-next v4 02/10] bpf: powerpc64: pad function address loads with NOPs
From: Daniel Borkmann @ 2018-05-24 9:25 UTC (permalink / raw)
To: Sandipan Das; +Cc: ast, netdev, linuxppc-dev, mpe, naveen.n.rao, jakub.kicinski
In-Reply-To: <8826a2e1-71c5-06fc-7a66-3c33c1a54c78@linux.vnet.ibm.com>
On 05/24/2018 10:25 AM, Sandipan Das wrote:
> On 05/24/2018 01:04 PM, Daniel Borkmann wrote:
>> On 05/24/2018 08:56 AM, Sandipan Das wrote:
>>> For multi-function programs, loading the address of a callee
>>> function to a register requires emitting instructions whose
>>> count varies from one to five depending on the nature of the
>>> address.
>>>
>>> Since we come to know of the callee's address only before the
>>> extra pass, the number of instructions required to load this
>>> address may vary from what was previously generated. This can
>>> make the JITed image grow or shrink.
>>>
>>> To avoid this, we should generate a constant five-instruction
>>> when loading function addresses by padding the optimized load
>>> sequence with NOPs.
>>>
>>> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
>>> ---
>>> arch/powerpc/net/bpf_jit_comp64.c | 34 +++++++++++++++++++++++-----------
>>> 1 file changed, 23 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
>>> index 1bdb1aff0619..e4582744a31d 100644
>>> --- a/arch/powerpc/net/bpf_jit_comp64.c
>>> +++ b/arch/powerpc/net/bpf_jit_comp64.c
>>> @@ -167,25 +167,37 @@ static void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx)
>>>
>>> static void bpf_jit_emit_func_call(u32 *image, struct codegen_context *ctx, u64 func)
>>> {
>>> + unsigned int i, ctx_idx = ctx->idx;
>>> +
>>> + /* Load function address into r12 */
>>> + PPC_LI64(12, func);
>>> +
>>> + /* For bpf-to-bpf function calls, the callee's address is unknown
>>> + * until the last extra pass. As seen above, we use PPC_LI64() to
>>> + * load the callee's address, but this may optimize the number of
>>> + * instructions required based on the nature of the address.
>>> + *
>>> + * Since we don't want the number of instructions emitted to change,
>>> + * we pad the optimized PPC_LI64() call with NOPs to guarantee that
>>> + * we always have a five-instruction sequence, which is the maximum
>>> + * that PPC_LI64() can emit.
>>> + */
>>> + for (i = ctx->idx - ctx_idx; i < 5; i++)
>>> + PPC_NOP();
>>
>> By the way, I think you can still optimize this. The nops are not really
>> needed in case of insn->src_reg != BPF_PSEUDO_CALL since the address of
>> a normal BPF helper call will always be at a fixed location and known a
>> priori.
>
> Ah, true. Thanks for pointing this out. There are a few other things that
> we are planning to do for the ppc64 JIT compiler. Will put out a patch for
> this with that series.
Awesome, thanks Sandipan!
^ permalink raw reply
* Re: [PATCH v2 05/13] mtd: rawnand: marvell: remove the dmaengine compat need
From: Miquel Raynal @ 2018-05-24 9:30 UTC (permalink / raw)
To: Robert Jarzmik
Cc: Daniel Mack, Haojian Zhuang, Ezequiel Garcia, Boris Brezillon,
David Woodhouse, Brian Norris, Marek Vasut, Richard Weinberger,
Liam Girdwood, Mark Brown, Arnd Bergmann, alsa-devel, netdev,
linux-mmc, linux-kernel, linux-ide, linux-mtd, dmaengine,
linux-arm-kernel, linux-media
In-Reply-To: <20180524070703.11901-6-robert.jarzmik@free.fr>
Hi Robert,
On Thu, 24 May 2018 09:06:55 +0200, Robert Jarzmik
<robert.jarzmik@free.fr> wrote:
> As the pxa architecture switched towards the dmaengine slave map, the
> old compatibility mechanism to acquire the dma requestor line number and
> priority are not needed anymore.
>
> This patch simplifies the dma resource acquisition, using the more
> generic function dma_request_slave_channel().
>
> Signed-off-by: Signed-off-by: Daniel Mack <daniel@zonque.org>
> Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr>
> ---
> drivers/mtd/nand/raw/marvell_nand.c | 17 +----------------
> 1 file changed, 1 insertion(+), 16 deletions(-)
>
Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
Thanks,
Miquèl
^ permalink raw reply
* Re: STMMAC driver with TSO enabled issue
From: Jose Abreu @ 2018-05-24 9:31 UTC (permalink / raw)
To: Bhadram Varka, Jose Abreu, netdev@vger.kernel.org, Joao Pinto
In-Reply-To: <c1c9025e-f75b-9ae1-4513-24615308e0af@nvidia.com>
Hi Bhadram,
On 24-05-2018 06:58, Bhadram Varka wrote:
>
> After some time if check Tx descriptor status - then I see only
> below
>
> [..]
> [85788.286730] 027 [0x827951b0]: 0xf854f000 0x0 0x16d8 0x90000000
>
> index 025 and 026 descriptors processed but not index 027.
>
> At this stage Tx DMA is always in below state -
>
> ■ 3'b011: Running (Reading Data from system memory
> buffer and queuing it to the Tx buffer (Tx FIFO))
Thats strange, I think the descriptors look okay though. I will
need the registers values (before the lock) and, if possible, the
git bisect output.
Thanks and Best Regards,
Jose Miguel Abreu
>
> Thanks,
> Bhadram.
^ permalink raw reply
* Re: [PATCH net-next v3 0/7] Add support for QCA8334 switch
From: Michal Vokáč @ 2018-05-24 9:34 UTC (permalink / raw)
To: Florian Fainelli, andrew
Cc: netdev, linux-kernel, devicetree, vivien.didelot, mark.rutland,
robh+dt, davem, michal.vokac
In-Reply-To: <29ea7cdd-5690-e39a-a2ef-4f48fbcb7659@gmail.com>
On 23.5.2018 17:39, Florian Fainelli wrote:
>
>
> On 05/22/2018 11:20 PM, Michal Vokáč wrote:
>> This series basically adds support for a QCA8334 ethernet switch to the
>> qca8k driver. It is a four-port variant of the already supported seven
>> port QCA8337. Register map is the same for the whole familly and all chips
>> have the same device ID.
>>
>> Major part of this series enhances the CPU port setting. Currently the CPU
>> port is not set to any sensible defaults compatible with the xGMII
>> interface. This series forces the CPU port to its maximum bandwidth and
>> also allows to adjust the new defaults using fixed-link device tree
>> sub-node.
>>
>> Alongside these changes I fixed two checkpatch warnings regarding SPDX and
>> redundant parentheses.
>
> Looks great, thanks Michal! Do you have any features or things you are
> working on that would be added later to the driver?
Thank you too Florian. And also big thank to you Andrew. You helped me
a lot to debug the RGMII issue. I have been stuck at that for more than
a month and would not resolve it without your help.
As I have done this in a process of upgrading our BSP to a more recent
kernel, and hopefully mainline, I now need to move on to other parts of
the board. So unfortunately no, I do not have any other enhancements
planned to this driver for now. But as we are probably one of the few
with access to the NDA covered Qualcomm documentation I see a great
opportunity to work on that later. I am afraid "later" means something
like next year in this case as I am basically the only kernel developer
in our company and not yet very experienced.
Thank you all for your time,
Michal
^ permalink raw reply
* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: Tariq Toukan @ 2018-05-24 9:45 UTC (permalink / raw)
To: Qing Huang, tariqt, davem, haakon.bugge, yanjun.zhu
Cc: netdev, linux-rdma, linux-kernel, gi-oh.kim
In-Reply-To: <20180523232246.20445-1-qing.huang@oracle.com>
On 24/05/2018 2:22 AM, Qing Huang wrote:
> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
>
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
>
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
>
> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
>
> However in order to support smaller ICM chunk size, we need to fix
> another issue in large size kcalloc allocations.
>
> E.g.
> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
> entry). So we need a 16MB allocation for a table->icm pointer array to
> hold 2M pointers which can easily cause kcalloc to fail.
>
> The solution is to use kvzalloc to replace kcalloc which will fall back
> to vmalloc automatically if kmalloc fails.
>
> Signed-off-by: Qing Huang <qing.huang@oracle.com>
> Acked-by: Daniel Jurgens <danielj@mellanox.com>
> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
> ---
> v4: use kvzalloc instead of vzalloc
> add one err condition check
> don't include vmalloc.h any more
>
> v3: use PAGE_SIZE instead of PAGE_SHIFT
> add comma to the end of enum variables
> include vmalloc.h header file to avoid build issues on Sparc
>
> v2: adjusted chunk size to reflect different architectures
>
> drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +++++++++-------
> 1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index a822f7a..685337d 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -43,12 +43,12 @@
> #include "fw.h"
>
> /*
> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
> - * per chunk.
> + * We allocate in page size (default 4KB on many archs) chunks to avoid high
> + * order memory allocations in fragmented/high usage memory situation.
> */
> enum {
> - MLX4_ICM_ALLOC_SIZE = 1 << 18,
> - MLX4_TABLE_CHUNK_SIZE = 1 << 18
> + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
> + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE,
> };
>
> static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
> @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
> u64 size;
>
> obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
> + if (WARN_ON(!obj_per_chunk))
> + return -EINVAL;
> num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>
> - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
> + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL);
> if (!table->icm)
> return -ENOMEM;
> table->virt = virt;
> @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
> mlx4_free_icm(dev, table->icm[i], use_coherent);
> }
>
> - kfree(table->icm);
> + kvfree(table->icm);
>
> return -ENOMEM;
> }
> @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table)
> mlx4_free_icm(dev, table->icm[i], table->coherent);
> }
>
> - kfree(table->icm);
> + kvfree(table->icm);
> }
>
Thanks Qing.
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
^ permalink raw reply
* Re: [PATCH 3/4] cpsw_switchdev: add switchdev support files
From: Maxim Uvarov @ 2018-05-24 10:00 UTC (permalink / raw)
To: Ilias Apalodimas
Cc: netdev, grygorii.strashko, ivan.khoronzhuk, nsekhar, jiri,
ivecera, francois.ozog, yogeshs, spatton
In-Reply-To: <1527144984-31236-4-git-send-email-ilias.apalodimas@linaro.org>
2018-05-24 9:56 GMT+03:00 Ilias Apalodimas <ilias.apalodimas@linaro.org>:
> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> ---
> drivers/net/ethernet/ti/Kconfig | 9 +
> drivers/net/ethernet/ti/Makefile | 1 +
> drivers/net/ethernet/ti/cpsw_switchdev.c | 299 +++++++++++++++++++++++++++++++
> drivers/net/ethernet/ti/cpsw_switchdev.h | 4 +
> 4 files changed, 313 insertions(+)
> create mode 100644 drivers/net/ethernet/ti/cpsw_switchdev.c
> create mode 100644 drivers/net/ethernet/ti/cpsw_switchdev.h
>
> diff --git a/drivers/net/ethernet/ti/Kconfig b/drivers/net/ethernet/ti/Kconfig
> index 48a541e..b22ae7d 100644
> --- a/drivers/net/ethernet/ti/Kconfig
> +++ b/drivers/net/ethernet/ti/Kconfig
> @@ -73,6 +73,15 @@ config TI_CPSW
> To compile this driver as a module, choose M here: the module
> will be called cpsw.
>
> +config TI_CPSW_SWITCHDEV
> + bool "TI CPSW switchdev support"
> + depends on TI_CPSW
> + depends on NET_SWITCHDEV
> + help
> + Enable switchdev support on TI's CPSW Ethernet Switch.
> +
> + This will allow you to configure the switch using standard tools.
> +
> config TI_CPTS
> bool "TI Common Platform Time Sync (CPTS) Support"
> depends on TI_CPSW || TI_KEYSTONE_NETCP
> diff --git a/drivers/net/ethernet/ti/Makefile b/drivers/net/ethernet/ti/Makefile
> index 0be551d..3926c6a 100644
> --- a/drivers/net/ethernet/ti/Makefile
> +++ b/drivers/net/ethernet/ti/Makefile
> @@ -15,6 +15,7 @@ obj-$(CONFIG_TI_CPSW_PHY_SEL) += cpsw-phy-sel.o
> obj-$(CONFIG_TI_CPSW_ALE) += cpsw_ale.o
> obj-$(CONFIG_TI_CPTS_MOD) += cpts.o
> obj-$(CONFIG_TI_CPSW) += ti_cpsw.o
> +ti_cpsw-objs:= cpsw_switchdev.o
> ti_cpsw-y := cpsw.o
>
> obj-$(CONFIG_TI_KEYSTONE_NETCP) += keystone_netcp.o
> diff --git a/drivers/net/ethernet/ti/cpsw_switchdev.c b/drivers/net/ethernet/ti/cpsw_switchdev.c
> new file mode 100644
> index 0000000..bf8c1bf
> --- /dev/null
> +++ b/drivers/net/ethernet/ti/cpsw_switchdev.c
> @@ -0,0 +1,299 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Texas Instruments switchdev Driver
> + *
> + * Copyright (C) 2018 Texas Instruments
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation version 2.
> + *
> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> + * kind, whether express or implied; without even the implied warranty
> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/etherdevice.h>
> +#include <linux/if_bridge.h>
> +#include <net/switchdev.h>
> +#include "cpsw.h"
> +#include "cpsw_priv.h"
> +#include "cpsw_ale.h"
> +
> +static u32 cpsw_switchdev_get_ver(struct net_device *ndev)
> +{
> + struct cpsw_priv *priv = netdev_priv(ndev);
> + struct cpsw_common *cpsw = priv->cpsw;
> +
> + return cpsw->version;
> +}
> +
> +static int cpsw_port_attr_set(struct net_device *dev,
> + const struct switchdev_attr *attr,
> + struct switchdev_trans *trans)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static int cpsw_port_attr_get(struct net_device *dev,
> + struct switchdev_attr *attr)
> +{
> + u32 cpsw_ver;
> + int err = 0;
> +
> + switch (attr->id) {
> + case SWITCHDEV_ATTR_ID_PORT_PARENT_ID:
> + cpsw_ver = cpsw_switchdev_get_ver(dev);
> + attr->u.ppid.id_len = sizeof(cpsw_ver);
> + memcpy(&attr->u.ppid.id, &cpsw_ver, attr->u.ppid.id_len);
> + break;
> + default:
> + return -EOPNOTSUPP;
> + }
> +
> + return err;
err is always 0 here.
> +}
> +
> +static u16 cpsw_get_pvid(struct cpsw_priv *priv)
> +{
> + struct cpsw_common *cpsw = priv->cpsw;
> + u32 __iomem *port_vlan_reg;
> + u32 pvid;
> +
> + if (priv->emac_port) {
> + int reg = CPSW2_PORT_VLAN;
> +
> + if (cpsw->version == CPSW_VERSION_1)
> + reg = CPSW1_PORT_VLAN;
> + pvid = slave_read(cpsw->slaves + (priv->emac_port - 1), reg);
> + } else {
> + port_vlan_reg = &cpsw->host_port_regs->port_vlan;
> + pvid = readl(port_vlan_reg);
> + }
> +
> + pvid = pvid & 0xfff;
> +
> + return pvid;
> +}
> +
> +static void cpsw_set_pvid(struct cpsw_priv *priv, u16 vid, bool cfi, u32 cos)
> +{
> + struct cpsw_common *cpsw = priv->cpsw;
> + void __iomem *port_vlan_reg;
> + u32 pvid;
> +
> + pvid = vid;
> + pvid |= cfi ? BIT(12) : 0;
> + pvid |= (cos & 0x7) << 13;
> +
> + if (priv->emac_port) {
> + int reg = CPSW2_PORT_VLAN;
> +
> + if (cpsw->version == CPSW_VERSION_1)
> + reg = CPSW1_PORT_VLAN;
> + /* no barrier */
> + slave_write(cpsw->slaves + (priv->emac_port - 1), pvid, reg);
> + } else {
> + /* CPU port */
> + port_vlan_reg = &cpsw->host_port_regs->port_vlan;
> + writel(pvid, port_vlan_reg);
> + }
> +}
> +
> +static int cpsw_port_vlan_add(struct cpsw_priv *priv, bool untag, bool pvid,
> + u16 vid)
> +{
> + struct cpsw_common *cpsw = priv->cpsw;
> + int port_mask = BIT(priv->emac_port);
> + int unreg_mcast_mask = 0;
> + int reg_mcast_mask = 0;
> + int untag_mask = 0;
> + int ret = 0;
> +
> + if (priv->ndev->flags & IFF_ALLMULTI)
> + unreg_mcast_mask = port_mask;
> +
> + if (priv->ndev->flags & IFF_MULTICAST)
> + reg_mcast_mask = port_mask;
> +
> + if (untag)
> + untag_mask = port_mask;
> +
> + ret = cpsw_ale_vlan_add_modify(cpsw->ale, vid, port_mask, untag_mask,
> + reg_mcast_mask, unreg_mcast_mask);
> + if (ret) {
> + dev_err(priv->dev, "Unable to add vlan\n");
> + return ret;
> + }
> +
> + if (!pvid)
> + return ret;
> +
> + cpsw_set_pvid(priv, vid, 0, 0);
> +
> + dev_dbg(priv->dev, "VID: %u dev: %s port: %u\n", vid,
> + priv->ndev->name, priv->emac_port);
> +
> + return ret;
> +}
> +
> +static int cpsw_port_vlan_del(struct cpsw_priv *priv, u16 vid)
> +{
> + struct cpsw_common *cpsw = priv->cpsw;
> + int port_mask = BIT(priv->emac_port);
> + int ret = 0;
no need to set it to 0 here.
> +
> + ret = cpsw_ale_vlan_del_modify(cpsw->ale, vid, port_mask);
> + if (ret != 0)
> + return ret;
> +
> + ret = cpsw_ale_del_ucast(cpsw->ale, priv->mac_addr,
> + HOST_PORT_NUM, ALE_VLAN, vid);
> +
> + if (vid == cpsw_get_pvid(priv))
> + cpsw_set_pvid(priv, 0, 0, 0);
> +
> + if (ret != 0) {
> + dev_dbg(priv->dev, "Failed to delete unicast entry\n");
> + ret = 0;
no need to set it to 0.
> + }
> +
> + ret = cpsw_ale_del_mcast(cpsw->ale, priv->ndev->broadcast,
> + 0, ALE_VLAN, vid);
> + if (ret != 0) {
> + dev_dbg(priv->dev, "Failed to delete multicast entry\n");
> + ret = 0;
> + }
> +
just return 0 as it always returned.
> + return ret;
> +}
> +
> +static int cpsw_port_vlans_add(struct cpsw_priv *priv,
> + const struct switchdev_obj_port_vlan *vlan,
> + struct switchdev_trans *trans)
> +{
> + bool untagged = vlan->flags & BRIDGE_VLAN_INFO_UNTAGGED;
> + bool pvid = vlan->flags & BRIDGE_VLAN_INFO_PVID;
> + u16 vid;
> +
> + if (switchdev_trans_ph_prepare(trans))
> + return 0;
> +
> + for (vid = vlan->vid_begin; vid <= vlan->vid_end; vid++) {
> + int err;
> +
> + err = cpsw_port_vlan_add(priv, untagged, pvid, vid);
> + if (err)
> + return err;
> + }
> +
> + return 0;
> +}
> +
> +static int cpsw_port_vlans_del(struct cpsw_priv *priv,
> + const struct switchdev_obj_port_vlan *vlan)
> +
> +{
> + u16 vid;
> +
> + for (vid = vlan->vid_begin; vid <= vlan->vid_end; vid++) {
> + int err;
> +
> + err = cpsw_port_vlan_del(priv, vid);
> + if (err)
> + return err;
> + }
> +
> + return 0;
> +}
> +
> +static int cpsw_port_mdb_add(struct cpsw_priv *priv,
> + struct switchdev_obj_port_mdb *mdb,
> + struct switchdev_trans *trans)
> +{
> + struct cpsw_common *cpsw = priv->cpsw;
> + int port_mask;
> + int err;
> +
> + if (switchdev_trans_ph_prepare(trans))
> + return 0;
> +
> + port_mask = BIT(priv->emac_port);
> + err = cpsw_ale_mcast_add_modify(cpsw->ale, mdb->addr, port_mask,
> + ALE_VLAN, mdb->vid, 0);
> +
> + return err;
> +}
> +
> +static int cpsw_port_mdb_del(struct cpsw_priv *priv,
> + struct switchdev_obj_port_mdb *mdb)
> +
> +{
> + struct cpsw_common *cpsw = priv->cpsw;
> + int del_mask;
> + int err;
> +
> + del_mask = BIT(priv->emac_port);
> + err = cpsw_ale_mcast_del_modify(cpsw->ale, mdb->addr, del_mask,
> + ALE_VLAN, mdb->vid);
> +
> + return err;
> +}
> +
> +static int cpsw_port_obj_add(struct net_device *ndev,
> + const struct switchdev_obj *obj,
> + struct switchdev_trans *trans)
> +{
> + struct switchdev_obj_port_vlan *vlan = SWITCHDEV_OBJ_PORT_VLAN(obj);
> + struct switchdev_obj_port_mdb *mdb = SWITCHDEV_OBJ_PORT_MDB(obj);
> + struct cpsw_priv *priv = netdev_priv(ndev);
> + int err = 0;
> +
> + switch (obj->id) {
> + case SWITCHDEV_OBJ_ID_PORT_VLAN:
> + err = cpsw_port_vlans_add(priv, vlan, trans);
> + break;
> + case SWITCHDEV_OBJ_ID_PORT_MDB:
> + err = cpsw_port_mdb_add(priv, mdb, trans);
> + break;
> + default:
> + err = -EOPNOTSUPP;
> + break;
> + }
> +
> + return err;
> +}
> +
> +static int cpsw_port_obj_del(struct net_device *ndev,
> + const struct switchdev_obj *obj)
> +{
> + struct switchdev_obj_port_vlan *vlan = SWITCHDEV_OBJ_PORT_VLAN(obj);
> + struct cpsw_priv *priv = netdev_priv(ndev);
> + int err = 0;
> +
> + switch (obj->id) {
> + case SWITCHDEV_OBJ_ID_PORT_VLAN:
> + err = cpsw_port_vlans_del(priv, vlan);
> + break;
> + case SWITCHDEV_OBJ_ID_PORT_MDB:
> + err = cpsw_port_mdb_del(priv, SWITCHDEV_OBJ_PORT_MDB(obj));
> + break;
> + default:
> + err = -EOPNOTSUPP;
> + break;
> + }
> +
> + return err;
> +}
> +
> +static const struct switchdev_ops cpsw_port_switchdev_ops = {
> + .switchdev_port_attr_set = cpsw_port_attr_set,
> + .switchdev_port_attr_get = cpsw_port_attr_get,
> + .switchdev_port_obj_add = cpsw_port_obj_add,
> + .switchdev_port_obj_del = cpsw_port_obj_del,
> +};
> +
> +void cpsw_port_switchdev_init(struct net_device *ndev)
> +{
> + ndev->switchdev_ops = &cpsw_port_switchdev_ops;
> +}
> diff --git a/drivers/net/ethernet/ti/cpsw_switchdev.h b/drivers/net/ethernet/ti/cpsw_switchdev.h
> new file mode 100644
> index 0000000..4940462
> --- /dev/null
> +++ b/drivers/net/ethernet/ti/cpsw_switchdev.h
> @@ -0,0 +1,4 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <net/switchdev.h>
> +
> +void cpsw_port_switchdev_init(struct net_device *ndev);
> --
> 2.7.4
>
--
Best regards,
Maxim Uvarov
^ permalink raw reply
* [PATCH] net/9p: fix error path of p9_virtio_probe
From: Jean-Philippe Brucker @ 2018-05-24 10:10 UTC (permalink / raw)
To: v9fs-developer, ericvh, rminnich, lucho; +Cc: netdev, davem
Currently when virtio_find_single_vq fails, we go through del_vqs which
throws a warning (Trying to free already-free IRQ). Skip del_vqs if vq
allocation failed.
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
net/9p/trans_virtio.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 4d0372263e5d..1c87eee522b7 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -562,7 +562,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
chan->vq = virtio_find_single_vq(vdev, req_done, "requests");
if (IS_ERR(chan->vq)) {
err = PTR_ERR(chan->vq);
- goto out_free_vq;
+ goto out_free_chan;
}
chan->vq->vdev->priv = chan;
spin_lock_init(&chan->lock);
@@ -615,6 +615,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
kfree(tag);
out_free_vq:
vdev->config->del_vqs(vdev);
+out_free_chan:
kfree(chan);
fail:
return err;
--
2.17.0
^ permalink raw reply related
* Re: [PATCH bpf-next v7 0/6] ipv6: sr: introduce seg6local End.BPF action
From: Daniel Borkmann @ 2018-05-24 10:12 UTC (permalink / raw)
To: Mathieu Xhonneux, netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526824042.git.m.xhonneux@gmail.com>
On 05/20/2018 03:58 PM, Mathieu Xhonneux wrote:
> As of Linux 4.14, it is possible to define advanced local processing for
> IPv6 packets with a Segment Routing Header through the seg6local LWT
> infrastructure. This LWT implements the network programming principles
> defined in the IETF “SRv6 Network Programming” draft.
>
> The implemented operations are generic, and it would be very interesting to
> be able to implement user-specific seg6local actions, without having to
> modify the kernel directly. To do so, this patchset adds an End.BPF action
> to seg6local, powered by some specific Segment Routing-related helpers,
> which provide SR functionalities that can be applied on the packet. This
> BPF hook would then allow to implement specific actions at native kernel
> speed such as OAM features, advanced SR SDN policies, SRv6 actions like
> Segment Routing Header (SRH) encapsulation depending on the content of
> the packet, etc.
>
> This patchset is divided in 6 patches, whose main features are :
>
> - A new seg6local action End.BPF with the corresponding new BPF program
> type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
> passed to the LWT seg6local through netlink, the same way as the LWT
> BPF hook operates.
> - 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
> shrink a SRH and apply on a packet some of the generic SRv6 actions.
> - 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
> encapsulation (via IPv6 encapsulation or inlining if the packet contains
> already an IPv6 header).
>
> As this patchset adds a new LWT BPF hook, I took into account the result of
> the discussions when the LWT BPF infrastructure got merged. Hence, the
> seg6local BPF hook doesn’t allow write access to skb->data directly, only
> the SRH can be modified through specific helpers, which ensures that the
> integrity of the packet is maintained.
> More details are available in the related patches messages.
>
> The performances of this BPF hook have been assessed with the BPF JIT
> enabled on an Intel Xeon X3440 processors with 4 cores and 8 threads
> clocked at 2.53 GHz. No throughput losses are noted with the seg6local
> BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
> TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
> drops the throughput to 410kpps, and inlining a SRH via
> bpf_lwt_seg6_action drops the throughput to 420kpps.
> All throughputs are stable.
>
> -------
> v2: move the SRH integrity state from skb->cb to a per-cpu buffer
> v3: - document helpers in man-page style
> - fix kbuild bugs
> - un-break BPF LWT out hook
> - bpf_push_seg6_encap is now static
> - preempt_enable is now called when the packet is dropped in
> input_action_end_bpf
> v4: fix kbuild bugs when CONFIG_IPV6=m
> v5: fix kbuild sparse warnings when CONFIG_IPV6=m
> v6: fix skb pointers-related bugs in helpers
> v7: - fix memory leak in error path of End.BPF setup
> - add freeing of BPF data in seg6_local_destroy_state
> - new enums SEG6_LOCAL_BPF_* instead of re-using ones of lwt bpf for
> netlink nested bpf attributes
> - SEG6_LOCAL_BPF_PROG attr now contains prog->aux->id when dumping
> state
>
> Thanks.
>
> Mathieu Xhonneux (6):
> ipv6: sr: make seg6.h includable without IPv6
> ipv6: sr: export function lookup_nexthop
> bpf: Add IPv6 Segment Routing helpers
> bpf: Split lwt inout verifier structures
> ipv6: sr: Add seg6local action End.BPF
> selftests/bpf: test for seg6local End.BPF action
>
> include/linux/bpf_types.h | 5 +-
> include/net/seg6.h | 7 +-
> include/net/seg6_local.h | 32 ++
> include/uapi/linux/bpf.h | 97 ++++-
> include/uapi/linux/seg6_local.h | 12 +
> kernel/bpf/verifier.c | 1 +
> net/core/filter.c | 393 ++++++++++++++++---
> net/ipv6/Kconfig | 5 +
> net/ipv6/seg6_local.c | 190 +++++++++-
> tools/include/uapi/linux/bpf.h | 97 ++++-
> tools/lib/bpf/libbpf.c | 1 +
> tools/testing/selftests/bpf/Makefile | 6 +-
> tools/testing/selftests/bpf/bpf_helpers.h | 12 +
> tools/testing/selftests/bpf/test_lwt_seg6local.c | 437 ++++++++++++++++++++++
> tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 +++++++
> 15 files changed, 1363 insertions(+), 72 deletions(-)
> create mode 100644 include/net/seg6_local.h
> create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
> create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh
Applied to bpf-next, thanks Mathieu!
^ permalink raw reply
* Re: [PATCH bpf-next v7 3/6] bpf: Add IPv6 Segment Routing helpers
From: Daniel Borkmann @ 2018-05-24 10:18 UTC (permalink / raw)
To: Mathieu Xhonneux, netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <98333ad8233f546c91e8f51c8c8ae457ce19980f.1526824042.git.m.xhonneux@gmail.com>
On 05/20/2018 03:58 PM, Mathieu Xhonneux wrote:
> The BPF seg6local hook should be powerful enough to enable users to
> implement most of the use-cases one could think of. After some thinking,
> we figured out that the following actions should be possible on a SRv6
> packet, requiring 3 specific helpers :
> - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
> - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
> (to add/delete TLVs)
> - bpf_lwt_seg6_action: Apply some SRv6 network programming actions
> (specifically End.X, End.T, End.B6 and
> End.B6.Encap)
>
> The specifications of these helpers are provided in the patch (see
> include/uapi/linux/bpf.h).
>
> The non-sensitive fields of the SRH are the following : flags, tag and
> TLVs. The other fields can not be modified, to maintain the SRH
> integrity. Flags, tag and TLVs can easily be modified as their validity
> can be checked afterwards via seg6_validate_srh. It is not allowed to
> modify the segments directly. If one wants to add segments on the path,
> he should stack a new SRH using the End.B6 action via
> bpf_lwt_seg6_action.
>
> Growing, shrinking or editing TLVs via the helpers will flag the SRH as
> invalid, and it will have to be re-validated before re-entering the IPv6
> layer. This flag is stored in a per-CPU buffer, along with the current
> header length in bytes.
>
> Storing the SRH len in bytes in the control block is mandatory when using
> bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
> len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
> boundary). When adding/deleting TLVs within the BPF program, the SRH may
> temporary be in an invalid state where its length cannot be rounded to 8
> bytes without remainder, hence the need to store the length in bytes
> separately. The caller of the BPF program can then ensure that the SRH's
> final length is valid using this value. Again, a final SRH modified by a
> BPF program which doesn’t respect the 8-bytes boundary will be discarded
> as it will be considered as invalid.
>
> Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
> available from the LWT BPF IN hook, but not from the seg6local BPF one.
> This helper allows to encapsulate a Segment Routing Header (either with
> a new outer IPv6 header, or by inlining it directly in the existing IPv6
> header) into a non-SRv6 packet. This helper is required if we want to
> offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
> as the BPF seg6local hook only works on traffic already containing a SRH.
> This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
> the same purpose but with a static SRH per route.
>
> These helpers require CONFIG_IPV6=y (and not =m).
>
> Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
> Acked-by: David Lebrun <dlebrun@google.com>
One minor comments for follow-ups in here below.
> +BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
> + const void *, from, u32, len)
> +{
> +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
> + struct seg6_bpf_srh_state *srh_state =
> + this_cpu_ptr(&seg6_bpf_srh_states);
> + void *srh_tlvs, *srh_end, *ptr;
> + struct ipv6_sr_hdr *srh;
> + int srhoff = 0;
> +
> + if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
> + return -EINVAL;
> +
> + srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
> + srh_tlvs = (void *)((char *)srh + ((srh->first_segment + 1) << 4));
> + srh_end = (void *)((char *)srh + sizeof(*srh) + srh_state->hdrlen);
> +
> + ptr = skb->data + offset;
> + if (ptr >= srh_tlvs && ptr + len <= srh_end)
> + srh_state->valid = 0;
> + else if (ptr < (void *)&srh->flags ||
> + ptr + len > (void *)&srh->segments)
> + return -EFAULT;
> +
> + if (unlikely(bpf_try_make_writable(skb, offset + len)))
> + return -EFAULT;
> +
> + memcpy(skb->data + offset, from, len);
> + return 0;
> +#else /* CONFIG_IPV6_SEG6_BPF */
> + return -EOPNOTSUPP;
> +#endif
> +}
Instead of doing this inside the helper you can reject the program already
in the lwt_*_func_proto() by returning NULL when !CONFIG_IPV6_SEG6_BPF. That
way programs get rejected at verification time instead of runtime, so the
user can probe availability more easily.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox