Netdev List
 help / color / mirror / Atom feed
* Re: [v3,4/5] net: phy: at803x: Disable phy delay for RGMII mode
From: Marc Gonzalez @ 2019-02-12 13:34 UTC (permalink / raw)
  To: Peter Ujfalusi, Vinod Koul, Roger Quadros, Florian Fainelli
  Cc: David S Miller, netdev, MSM, Niklas Cassel, Bjorn Andersson,
	Andrew Lunn, Nori Sekhar
In-Reply-To: <147151c4-d162-4ebe-189a-564492d84d18@ti.com>

On 12/02/2019 11:55, Peter Ujfalusi wrote:

> On 21/01/2019 11.13, Vinod Koul wrote:
> 
>> For RGMII mode, phy delay should be disabled. Add this case along 
>> with disable delay routines.
> 
> In next-20190211 I need to revert this patch to get cpsw networking
> to work on am335x-evmsk. The board uses AR8031_AL1A PHY, which is
> handled by the phy/at803x.c

I'm having flashbacks:

[PATCH 1/2] net: phy: at803x: Fix RGMII RX and TX clock delays setup
https://www.spinics.net/lists/netdev/msg445053.html

Quirks of the Atheros 8035 PHY
https://www.spinics.net/lists/netdev/msg444527.html

See also
http://patchwork.ozlabs.org/project/netdev/list/?submitter=67482&state=*

Regards.

^ permalink raw reply

* Re: [RFC, PATCH] net: page_pool: Don't use page->private to store dma_addr_t
From: Jesper Dangaard Brouer @ 2019-02-12 13:49 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Ilias Apalodimas, Matthew Wilcox, David Miller,
	toke@redhat.com, netdev@vger.kernel.org,
	mgorman@techsingularity.net, linux-mm@kvack.org, brouer
In-Reply-To: <27e97aac-f25b-d46c-3e70-7d0d44f784b5@mellanox.com>

On Tue, 12 Feb 2019 12:39:59 +0000
Tariq Toukan <tariqt@mellanox.com> wrote:

> On 2/11/2019 7:14 PM, Eric Dumazet wrote:
> > 
> > On 02/11/2019 12:53 AM, Tariq Toukan wrote:  
> >>  
> >   
> >> Hi,
> >>
> >> It's great to use the struct page to store its dma mapping, but I am
> >> worried about extensibility.
> >> page_pool is evolving, and it would need several more per-page fields.
> >> One of them would be pageref_bias, a planned optimization to reduce the
> >> number of the costly atomic pageref operations (and replace existing
> >> code in several drivers).
> >>  
> > 
> > But the point about pageref_bias is to place it in a different
> > cache line than "struct page"

Yes, exactly.


> > The major cost is having a cache line bouncing between producer and
> > consumer. 
> 
> pageref_bias is meant to be dirtied only by the page requester, i.e. the 
> NIC driver / page_pool.
> All other components (basically, SKB release flow / put_page) should 
> continue working with the atomic page_refcnt, and not dirty the 
> pageref_bias.
> 
> However, what bothers me more is another issue.
> The optimization doesn't cleanly combine with the new page_pool 
> direction for maintaining a queue for "available" pages, as the put_page 
> flow would need to read pageref_bias, asynchronously, and act accordingly.
> 
> The suggested hook in put_page (to catch the 2 -> 1 "biased refcnt" 
> transition) causes a problem to the traditional pageref_bias idea, as it 
> implies a new point in which the pageref_bias field is read 
> *asynchronously*. This would risk missing the this critical 2 -> 1 
> transition! Unless pageref_bias is atomic...

I want to stop you here...

It seems to me that you are trying to shoehorn in a refcount
optimization into page_pool.  The page_pool is optimized for the XDP
case of one-frame-per-page, where we can avoid changing the refcount,
and tradeoff memory usage for speed.  It is compatible with the elevated
refcount usage, but that is not the optimization target.

If the case you are optimizing for is "packing" more frames in a page,
then the page_pool might be the wrong choice.  To me it would make more
sense to create another enum xdp_mem_type, that generalize the
pageref_bias tricks also used by some drivers.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next v4 2/9] switchdev: Add SWITCHDEV_PORT_ATTR_SET, SWITCHDEV_PORT_ATTR_GET
From: Ido Schimmel @ 2019-02-12 13:55 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev@vger.kernel.org, David S. Miller, open list,
	open list:STAGING SUBSYSTEM, moderated list:ETHERNET BRIDGE,
	Jiri Pirko, andrew@lunn.ch, vivien.didelot@gmail.com
In-Reply-To: <20190211191001.8623-3-f.fainelli@gmail.com>

On Mon, Feb 11, 2019 at 11:09:54AM -0800, Florian Fainelli wrote:
> In preparation for allowing switchdev enabled drivers to veto specific
> attribute settings from within the context of the caller, introduce a
> new switchdev notifier type for port attributes.
> 
> Suggested-by: Ido Schimmel <idosch@mellanox.com>
> Acked-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
>  include/net/switchdev.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> index 5e87b54c5dc5..b8becabbef38 100644
> --- a/include/net/switchdev.h
> +++ b/include/net/switchdev.h
> @@ -143,6 +143,9 @@ enum switchdev_notifier_type {
>  	SWITCHDEV_VXLAN_FDB_ADD_TO_DEVICE,
>  	SWITCHDEV_VXLAN_FDB_DEL_TO_DEVICE,
>  	SWITCHDEV_VXLAN_FDB_OFFLOADED,
> +
> +	SWITCHDEV_PORT_ATTR_SET, /* Blocking. */
> +	SWITCHDEV_PORT_ATTR_GET, /* Blocking. */

As I wrote in the cover letter, I don't believe GET is needed.

>  };
>  
>  struct switchdev_notifier_info {
> @@ -165,6 +168,13 @@ struct switchdev_notifier_port_obj_info {
>  	bool handled;
>  };
>  
> +struct switchdev_notifier_port_attr_info {
> +	struct switchdev_notifier_info info; /* must be first */
> +	struct switchdev_attr *attr;
> +	struct switchdev_trans *trans;
> +	bool handled;
> +};
> +
>  static inline struct net_device *
>  switchdev_notifier_info_to_dev(const struct switchdev_notifier_info *info)
>  {
> -- 
> 2.17.1
> 

^ permalink raw reply

* [REGRESSION 4.20] mvneta - DMA-API: device driver tries to sync DMA memory it has not allocated
From: Russell King - ARM Linux admin @ 2019-02-12 13:59 UTC (permalink / raw)
  To: Thomas Petazzoni, netdev

Hi,

Booting 4.20 on SolidRun Clearfog reliably provokes the following
warning - this is with mvneta built in, but DSA as modules:

WARNING: CPU: 0 PID: 555 at kernel/dma/debug.c:1230 check_sync+0x514/0x5bc
mvneta f1070000.ethernet: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x000000002dd7dc00] [size=240 bytes]
Modules linked in: ahci mv88e6xxx dsa_core xhci_plat_hcd xhci_hcd devlink armada_thermal marvell_cesa des_generic ehci_orion phy_armada38x_comphy mcp3021 spi_orion evbug sfp mdio_i2c ip_tables x_tables
CPU: 0 PID: 555 Comm: bridge-network- Not tainted 4.20.0+ #291
Hardware name: Marvell Armada 380/385 (Device Tree)
[<c0019638>] (unwind_backtrace) from [<c0014888>] (show_stack+0x10/0x14)
[<c0014888>] (show_stack) from [<c07f54e0>] (dump_stack+0x9c/0xd4)
[<c07f54e0>] (dump_stack) from [<c00312bc>] (__warn+0xf8/0x124)
[<c00312bc>] (__warn) from [<c00313b0>] (warn_slowpath_fmt+0x38/0x48)
[<c00313b0>] (warn_slowpath_fmt) from [<c00b0370>] (check_sync+0x514/0x5bc)
[<c00b0370>] (check_sync) from [<c00b04f8>] (debug_dma_sync_single_range_for_cpu+0x6c/0x74)
[<c00b04f8>] (debug_dma_sync_single_range_for_cpu) from [<c051bd14>] (mvneta_poll+0x298/0xf58)
[<c051bd14>] (mvneta_poll) from [<c0656194>] (net_rx_action+0x128/0x424)
[<c0656194>] (net_rx_action) from [<c000a230>] (__do_softirq+0xf0/0x540)
[<c000a230>] (__do_softirq) from [<c00386e0>] (irq_exit+0x124/0x144)
[<c00386e0>] (irq_exit) from [<c009b5e0>] (__handle_domain_irq+0x58/0xb0)
[<c009b5e0>] (__handle_domain_irq) from [<c03a63c4>] (gic_handle_irq+0x48/0x98)
[<c03a63c4>] (gic_handle_irq) from [<c0009a10>] (__irq_svc+0x70/0x98)
...

This appears to be from:

                if (rx_bytes <= rx_copybreak) {
                        /* better copy a small frame and not unmap the DMA region */
                        skb = netdev_alloc_skb_ip_align(dev, rx_bytes);
                        if (unlikely(!skb))
                                goto err_drop_frame_ret_pool;

                        dma_sync_single_range_for_cpu(dev->dev.parent,
                                                      rx_desc->buf_phys_addr,
                                                      MVNETA_MH_SIZE + NET_SKB_PAD,
                                                      rx_bytes,
                                                      DMA_FROM_DEVICE);

which suggests that rx_desc->buf_phys_addr is not something that should
be passed to dma_sync_single_range_for_cpu().  I've not been able to
track down why that is, nor which interface is provoking that.

As I don't have the details of how the buffer management hardware works
on Armada 388, I'm unable to debug this myself.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply

* Re: [PATCH net-next v4 0/9] net: Remove switchdev_ops
From: Jiri Pirko @ 2019-02-12 13:53 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Florian Fainelli, netdev@vger.kernel.org, David S. Miller,
	open list, open list:STAGING SUBSYSTEM,
	moderated list:ETHERNET BRIDGE, Jiri Pirko, andrew@lunn.ch,
	vivien.didelot@gmail.com
In-Reply-To: <20190212131443.GA13819@splinter>

Tue, Feb 12, 2019 at 02:14:47PM CET, idosch@mellanox.com wrote:
>On Mon, Feb 11, 2019 at 11:09:52AM -0800, Florian Fainelli wrote:
>> Hi all,
>> 
>> This patch series finishes by the removal of switchdev_ops. To get there
>> we convert the existing switchdev_port_attr_{set,get} switchdev_ops to
>> use a blocking notifier, thus making it consistent with how the objects
>> are pushed to the switchdev enabled devices.
>> 
>> Please review and let me know what you think!
>> 
>> David, I would like to get Ido's feedback on this to make sure I did not
>> miss something, thank you!
>
>Hi Florian,
>
>Why do you still keep switchdev_port_attr_get()? I believe we can remove
>it and simplify things.
>
>After your recent patchset to remove 'PORT_BRIDGE_FLAGS', the only
>remaining user of get() is 'PORT_BRIDGE_FLAGS_SUPPORT'. It can be
>converted to a blocking set() with 'PORT_PRE_BRIDGE_FLAGS' (or a similar
>name).

Let's do that in a follow-up.


>
>I would like to make sure we're in sync with regards to future changes.
>After this patchset to get rid of switchdev_ops we can continue to
>completely removing switchdev (I believe Jiri approves). The

Yes.


>prepare-commit model is not really needed and the two switchdev
>notification chains can be split into bridge and vxlan specific chains.
>
>Notifications sent in an atomic context can be handled by drivers
>directly in this context. Similar to how FDB/route/neighbour are
>handled. It will really simplify things. No need for the defer flag
>anymore and tricks like 'PORT_BRIDGE_FLAGS_SUPPORT' and
>'PORT_PRE_BRIDGE_FLAGS'. In the atomic context the driver can veto the
>requested bridge flags, but program the device from a blocking context
>(using a workqueue).

Sounds good to me.

^ permalink raw reply

* Re: [PATCH net-next v4 4/9] mlxsw: spectrum_switchdev: Handle SWITCHDEV_PORT_ATTR_GET/SET
From: Ido Schimmel @ 2019-02-12 14:07 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev@vger.kernel.org, David S. Miller, open list,
	open list:STAGING SUBSYSTEM, moderated list:ETHERNET BRIDGE,
	Jiri Pirko, andrew@lunn.ch, vivien.didelot@gmail.com
In-Reply-To: <20190211191001.8623-5-f.fainelli@gmail.com>

On Mon, Feb 11, 2019 at 11:09:56AM -0800, Florian Fainelli wrote:
> Following patches will change the way we communicate getting or setting
> a port's attribute and use a blocking notifier to perform those tasks.
> 
> Prepare mlxsw to support receiving notifier events targeting
> SWITCHDEV_PORT_ATTR_GET/SET and simply translate that into the existing
> mlxsw_sp_port_attr_{set,get} calls.
> 
> Acked-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
>  .../mellanox/mlxsw/spectrum_switchdev.c       | 23 +++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
> index 95e37de3e48f..88d4994309a7 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
> @@ -3443,6 +3443,26 @@ mlxsw_sp_switchdev_handle_vxlan_obj_del(struct net_device *vxlan_dev,
>  	}
>  }
>  
> +static int
> +mlxsw_sp_switchdev_port_attr_event(unsigned long event, struct net_device *dev,
> +		struct switchdev_notifier_port_attr_info *port_attr_info)
> +{
> +	int err = -EOPNOTSUPP;
> +
> +	switch (event) {
> +	case SWITCHDEV_PORT_ATTR_SET:
> +		err = mlxsw_sp_port_attr_set(dev, port_attr_info->attr,
> +					     port_attr_info->trans);

It is not that simple. These functions expect 'dev' to be an mlxsw
netdev since the operation is propagated using the device chain. This is
not the case with notification chains. 'dev' can be any netdev in the
system and then this line in mlxsw_sp_port_attr_set() is not correct:

struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);

You can check commit f30f0601eb93 ("switchdev: Add helpers to aid
traversal through lower devices") for reference.

> +		break;
> +	case SWITCHDEV_PORT_ATTR_GET:
> +		err = mlxsw_sp_port_attr_get(dev, port_attr_info->attr);
> +		break;
> +	}
> +
> +	port_attr_info->handled = true;

I believe this should only be set in case the driver actually handled
the notification. That's how it works for objects.

I suggest looking at the series merged in commit 06d212900ea9 ("Merge
branch 'switchdev-blocking-notifiers'").

> +	return notifier_from_errno(err);
> +}
> +
>  static int mlxsw_sp_switchdev_blocking_event(struct notifier_block *unused,
>  					     unsigned long event, void *ptr)
>  {
> @@ -3466,6 +3486,9 @@ static int mlxsw_sp_switchdev_blocking_event(struct notifier_block *unused,
>  							mlxsw_sp_port_dev_check,
>  							mlxsw_sp_port_obj_del);
>  		return notifier_from_errno(err);
> +	case SWITCHDEV_PORT_ATTR_SET: /* fall through */
> +	case SWITCHDEV_PORT_ATTR_GET:
> +		return mlxsw_sp_switchdev_port_attr_event(event, dev, ptr);
>  	}
>  
>  	return NOTIFY_DONE;
> -- 
> 2.17.1
> 

^ permalink raw reply

* Re: [PATCH net-next v4 9/9] net: Remove switchdev_ops
From: Ido Schimmel @ 2019-02-12 14:10 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev@vger.kernel.org, David S. Miller, open list,
	open list:STAGING SUBSYSTEM, moderated list:ETHERNET BRIDGE,
	Jiri Pirko, andrew@lunn.ch, vivien.didelot@gmail.com
In-Reply-To: <20190211191001.8623-10-f.fainelli@gmail.com>

On Mon, Feb 11, 2019 at 11:10:01AM -0800, Florian Fainelli wrote:
> Now that we have converted all possible callers to using a switchdev
> notifier for attributes we do not have a need for implementing
> switchdev_ops anymore, and this can be removed from all drivers the
> net_device structure.
> 
> Acked-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
>  drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 12 ------------
>  drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  2 --
>  .../mellanox/mlxsw/spectrum_switchdev.c        | 13 -------------
>  drivers/net/ethernet/mscc/ocelot.c             |  5 -----
>  drivers/net/ethernet/rocker/rocker_main.c      |  6 ------
>  drivers/staging/fsl-dpaa2/ethsw/ethsw.c        |  6 ------
>  include/linux/netdevice.h                      |  3 ---
>  include/net/switchdev.h                        | 18 ------------------
>  net/dsa/slave.c                                |  6 ------

There's also this line that can be removed from the 8021q driver:

net/8021q/vlan_dev.c:34:#include <net/switchdev.h>

^ permalink raw reply

* Re: [PATCH] ipv6: fix icmp6_send() route lookup
From: Alin Năstac @ 2019-02-12 14:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20190211.123818.1763509059512954986.davem@davemloft.net>

On Mon, Feb 11, 2019 at 9:38 PM David Miller <davem@davemloft.net> wrote:
>
> From: Alin Nastac <alin.nastac@gmail.com>
> Date: Thu,  7 Feb 2019 16:05:31 +0100
>
> > Original packet destination address must be used as saddr for the
> > route lookup performed by icmp6_send() even when this address is
> > not local. This fixes the IPv6 router ability to send back
> > destination unreachable ICMPv6 errors for forwarded packets when
> > the route toward the saddr of the original packet is source
> > filtered (e.g. a default route with a "from PD" attribute, where
> > PD is the delegated prefix).
> >
> > Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
>
> Yes, but however this will change behavior for a lot of situations
> not just the one you are interested in.
>
> The base ipv6_chk_addr() test has been there for more than a decade
> and I'm not comfortable with changing this logic until I see you
> write up a full audit of all of the use cases of icmp6_send() and
> how they are impacted by your changes.

Please consider these:
 - saddr variable is used only to perform route lookup towards the skb
originator and is explicitly reset to NULL when skb is multicast (see
the if statement below the change).
 - In order for icmp6_send() to perform its duty, this route lookup
must succeed.
 - As long as your IPv6 routes don't use source filtering or
source-based routing, this change will have absolutely no effect on
the kernel behavior because route lookup will succeed regardless of
the fl6.saddr value.

I don't contest the usefulness of ipv6_chk_addr(), it clearly has its
purpose. However when it comes to generating ICMPv6 errors, both
routers and hosts are entitled to generate them (see RFC 4443), hence
this test shouldn't be called here.

^ permalink raw reply

* [PATCH] net: phy: at803x: disable delay only for RGMII mode
From: Vinod Koul @ 2019-02-12 14:19 UTC (permalink / raw)
  To: David S Miller
  Cc: linux-arm-msm, Bjorn Andersson, Vinod Koul, netdev, Niklas Cassel,
	Andrew Lunn, Florian Fainelli, Nori, Sekhar, Peter Ujfalusi,
	Marc Gonzalez

Per "Documentation/devicetree/bindings/net/ethernet.txt" RGMII mode
should not have delay in PHY whereas RGMII_ID and RGMII_RXID/RGMII_TXID
can have delay in phy.

So disable the delay only for RGMII mode and disable for other modes

Fixes: cd28d1d6e52e: ("net: phy: at803x: Disable phy delay for RGMII mode")
Reported-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
---
 drivers/net/phy/at803x.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
index 8ff12938ab47..7b54b54e3316 100644
--- a/drivers/net/phy/at803x.c
+++ b/drivers/net/phy/at803x.c
@@ -110,6 +110,18 @@ static int at803x_debug_reg_mask(struct phy_device *phydev, u16 reg,
 	return phy_write(phydev, AT803X_DEBUG_DATA, val);
 }
 
+static inline int at803x_enable_rx_delay(struct phy_device *phydev)
+{
+	return at803x_debug_reg_mask(phydev, AT803X_DEBUG_REG_0, 0,
+				     AT803X_DEBUG_RX_CLK_DLY_EN);
+}
+
+static inline int at803x_enable_tx_delay(struct phy_device *phydev)
+{
+	return at803x_debug_reg_mask(phydev, AT803X_DEBUG_REG_5, 0,
+				     AT803X_DEBUG_TX_CLK_DLY_EN);
+}
+
 static inline int at803x_disable_rx_delay(struct phy_device *phydev)
 {
 	return at803x_debug_reg_mask(phydev, AT803X_DEBUG_REG_0,
@@ -255,18 +267,25 @@ static int at803x_config_init(struct phy_device *phydev)
 	if (ret < 0)
 		return ret;
 
-	if (phydev->interface == PHY_INTERFACE_MODE_RGMII_RXID ||
-			phydev->interface == PHY_INTERFACE_MODE_RGMII_ID ||
-			phydev->interface == PHY_INTERFACE_MODE_RGMII) {
+	if (phydev->interface == PHY_INTERFACE_MODE_RGMII) {
 		ret = at803x_disable_rx_delay(phydev);
 		if (ret < 0)
 			return ret;
+		ret = at803x_disable_tx_delay(phydev);
+		if (ret < 0)
+			return ret;
+	};
+
+	if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID ||
+			phydev->interface == PHY_INTERFACE_MODE_RGMII_RXID) {
+		ret = at803x_enable_rx_delay(phydev);
+		if (ret < 0)
+			return ret;
 	}
 
-	if (phydev->interface == PHY_INTERFACE_MODE_RGMII_TXID ||
-			phydev->interface == PHY_INTERFACE_MODE_RGMII_ID ||
-			phydev->interface == PHY_INTERFACE_MODE_RGMII) {
-		ret = at803x_disable_tx_delay(phydev);
+	if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID ||
+			phydev->interface == PHY_INTERFACE_MODE_RGMII_TXID) {
+		ret = at803x_enable_tx_delay(phydev);
 		if (ret < 0)
 			return ret;
 	}
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH net-next v2 06/16] net: bridge: Stop calling switchdev_port_attr_get()
From: Ido Schimmel @ 2019-02-12 14:20 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	devel@driverdev.osuosl.org, bridge@lists.linux-foundation.org,
	Jiri Pirko, andrew@lunn.ch, vivien.didelot@gmail.com
In-Reply-To: <73a10fbb-7689-74a7-3fbf-076f8dc9d76f@gmail.com>

On Sun, Feb 10, 2019 at 11:34:14AM -0800, Florian Fainelli wrote:
> Le 2/10/19 à 11:05 AM, Ido Schimmel a écrit :
> > On Sun, Feb 10, 2019 at 09:50:55AM -0800, Florian Fainelli wrote:
> >> Now that all switchdev drivers have been converted to checking the
> >> bridge port flags during the prepare phase of the
> >> switchdev_port_attr_set(), we can move straight to trying to set the
> >> desired flag through SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS.
> >>
> >> Acked-by: Jiri Pirko <jiri@mellanox.com>
> >> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> >> ---
> >>  net/bridge/br_switchdev.c | 20 +++-----------------
> >>  1 file changed, 3 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
> >> index db9e8ab96d48..939f300522c5 100644
> >> --- a/net/bridge/br_switchdev.c
> >> +++ b/net/bridge/br_switchdev.c
> >> @@ -64,29 +64,15 @@ int br_switchdev_set_port_flag(struct net_bridge_port *p,
> >>  {
> >>  	struct switchdev_attr attr = {
> >>  		.orig_dev = p->dev,
> >> -		.id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT,
> >> +		.id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS,
> >> +		.flags = SWITCHDEV_F_DEFER,
> > 
> > How does this work? You defer the operation, so the driver cannot veto
> > it. This is why we have SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT
> > which is not deferred.
> 
> I missed that indeed, how would you feel about splitting the attribute
> setting into different phases:
> 
> - checking that the attribute setting is supported (caller context, so
> possibly atomic)
> - allocating and committing resources (deferred context)

Yes, this is what I suggested in the other thread. Lets continue
discussion there.

We are doing that when processing route notifications (for example), but
we are missing a check in caller context that resources can actually be
allocated. That's part of a bigger task to track resources in mlxsw.

> 
> For pretty much any DSA driver, we will have to be in sleepable context
> anyway because of MDIO, SPI, I2C, whatever transport layer.
> 
> Not sure how to best approach this now...
> -- 
> Florian

^ permalink raw reply

* [net-next PATCH V2 0/3] Fix page_pool API and dma address storage
From: Jesper Dangaard Brouer @ 2019-02-12 14:48 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Toke Høiland-Jørgensen, Ilias Apalodimas, willy,
	Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
	Andrew Morton, mgorman, David S. Miller, Tariq Toukan

As pointed out by David Miller in [1] the current page_pool implementation
stores dma_addr_t in page->private. This won't work on 32-bit platforms with
64-bit DMA addresses since the page->private is an unsigned long and the
dma_addr_t a u64.

Since no driver is yet using the DMA mapping capabilities of the API let's
fix this by storing the information in 'struct page' and use that to store
and retrieve DMA addresses from network drivers.

As long as the addresses returned from dma_map_page() are aligned the first
bit, used by the compound pages code should not be set.

Ilias tested this on Espressobin driver mvneta, for which we have patches
for using the DMA API of page_pool.

[1]: https://lore.kernel.org/netdev/20181207.230655.1261252486319967024.davem@davemloft.net/

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
---

Ilias Apalodimas (1):
      net: page_pool: don't use page->private to store dma_addr_t

Jesper Dangaard Brouer (2):
      mm: add dma_addr_t to struct page
      page_pool: use DMA_ATTR_SKIP_CPU_SYNC for DMA mappings


 include/linux/mm_types.h |    7 +++++++
 net/core/page_pool.c     |   22 ++++++++++++++--------
 2 files changed, 21 insertions(+), 8 deletions(-)

--

^ permalink raw reply

* [net-next PATCH V2 1/3] mm: add dma_addr_t to struct page
From: Jesper Dangaard Brouer @ 2019-02-12 14:49 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Toke Høiland-Jørgensen, Ilias Apalodimas, willy,
	Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
	Andrew Morton, mgorman, David S. Miller, Tariq Toukan
In-Reply-To: <154998290571.8783.11827147914798438839.stgit@firesoul>

The page_pool API is using page->private to store DMA addresses.
As pointed out by David Miller we can't use that on 32-bit architectures
with 64-bit DMA

This patch adds a new dma_addr_t struct to allow storing DMA addresses

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mm_types.h |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..581737bd0878 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -95,6 +95,13 @@ struct page {
 			 */
 			unsigned long private;
 		};
+		struct {	/* page_pool used by netstack */
+			/**
+			 * @dma_addr: page_pool requires a 64-bit value even on
+			 * 32-bit architectures.
+			 */
+			dma_addr_t dma_addr;
+		};
 		struct {	/* slab, slob and slub */
 			union {
 				struct list_head slab_list;	/* uses lru */


^ permalink raw reply related

* [net-next PATCH V2 2/3] net: page_pool: don't use page->private to store dma_addr_t
From: Jesper Dangaard Brouer @ 2019-02-12 14:49 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Toke Høiland-Jørgensen, Ilias Apalodimas, willy,
	Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
	Andrew Morton, mgorman, David S. Miller, Tariq Toukan
In-Reply-To: <154998290571.8783.11827147914798438839.stgit@firesoul>

From: Ilias Apalodimas <ilias.apalodimas@linaro.org>

As pointed out by David Miller the current page_pool implementation
stores dma_addr_t in page->private.
This won't work on 32-bit platforms with 64-bit DMA addresses since the
page->private is an unsigned long and the dma_addr_t a u64.

A previous patch is adding dma_addr_t on struct page to accommodate this.
This patch adapts the page_pool related functions to use the newly added
struct for storing and retrieving DMA addresses from network drivers.

Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/page_pool.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 43a932cb609b..897a69a1477e 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -136,7 +136,9 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
 		goto skip_dma_map;
 
-	/* Setup DMA mapping: use page->private for DMA-addr
+	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
+	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
+	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
 	 * This mapping is kept for lifetime of page, until leaving pool.
 	 */
 	dma = dma_map_page(pool->p.dev, page, 0,
@@ -146,7 +148,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 		put_page(page);
 		return NULL;
 	}
-	set_page_private(page, dma); /* page->private = dma; */
+	page->dma_addr = dma;
 
 skip_dma_map:
 	/* When page just alloc'ed is should/must have refcnt 1. */
@@ -175,13 +177,16 @@ EXPORT_SYMBOL(page_pool_alloc_pages);
 static void __page_pool_clean_page(struct page_pool *pool,
 				   struct page *page)
 {
+	dma_addr_t dma;
+
 	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
 		return;
 
+	dma = page->dma_addr;
 	/* DMA unmap */
-	dma_unmap_page(pool->p.dev, page_private(page),
+	dma_unmap_page(pool->p.dev, dma,
 		       PAGE_SIZE << pool->p.order, pool->p.dma_dir);
-	set_page_private(page, 0);
+	page->dma_addr = 0;
 }
 
 /* Return a page to the page allocator, cleaning up our state */


^ permalink raw reply related

* [net-next PATCH V2 3/3] page_pool: use DMA_ATTR_SKIP_CPU_SYNC for DMA mappings
From: Jesper Dangaard Brouer @ 2019-02-12 14:49 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Toke Høiland-Jørgensen, Ilias Apalodimas, willy,
	Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
	Andrew Morton, mgorman, David S. Miller, Tariq Toukan
In-Reply-To: <154998290571.8783.11827147914798438839.stgit@firesoul>

As pointed out by Alexander Duyck, the DMA mapping done in page_pool needs
to use the DMA attribute DMA_ATTR_SKIP_CPU_SYNC.

As the principle behind page_pool keeping the pages mapped is that the
driver takes over the DMA-sync steps.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
---
 net/core/page_pool.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 897a69a1477e..7e624c2cd709 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -141,9 +141,9 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
 	 * This mapping is kept for lifetime of page, until leaving pool.
 	 */
-	dma = dma_map_page(pool->p.dev, page, 0,
-			   (PAGE_SIZE << pool->p.order),
-			   pool->p.dma_dir);
+	dma = dma_map_page_attrs(pool->p.dev, page, 0,
+				 (PAGE_SIZE << pool->p.order),
+				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
 	if (dma_mapping_error(pool->p.dev, dma)) {
 		put_page(page);
 		return NULL;
@@ -184,8 +184,9 @@ static void __page_pool_clean_page(struct page_pool *pool,
 
 	dma = page->dma_addr;
 	/* DMA unmap */
-	dma_unmap_page(pool->p.dev, dma,
-		       PAGE_SIZE << pool->p.order, pool->p.dma_dir);
+	dma_unmap_page_attr(pool->p.dev, dma,
+			    PAGE_SIZE << pool->p.order, pool->p.dma_dir,
+			    DMA_ATTR_SKIP_CPU_SYNC);
 	page->dma_addr = 0;
 }
 


^ permalink raw reply related

* Re: [PATCH] ser_gigaset: mark expected switch fall-through
From: Gustavo A. R. Silva @ 2019-02-12 14:58 UTC (permalink / raw)
  To: Paul Bolle
  Cc: gigaset307x-common, netdev, linux-kernel, Kees Cook, Karsten Keil
In-Reply-To: <3dbf6ecb46ad65b3b4bddd4154e31fde2c79d13d.camel@tiscali.nl>



On 2/12/19 2:45 AM, Paul Bolle wrote:
> Gustavo A. R. Silva schreef op ma 11-02-2019 om 16:34 [-0600]:
>> In preparation to enabling -Wimplicit-fallthrough, mark switch
>> cases where we are expecting to fall through.
>>
>> This patch fixes the following warning:
>>
>> drivers/isdn/gigaset/ser-gigaset.c: In function ‘gigaset_tty_ioctl’:
>> drivers/isdn/gigaset/ser-gigaset.c:627:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
>>    switch (arg) {
>>    ^~~~~~
>> drivers/isdn/gigaset/ser-gigaset.c:638:2: note: here
>>   default:
>>   ^~~~~~~
>>
>> Warning level 3 was used: -Wimplicit-fallthrough=3
>>
>> Notice that, in this particular case, the code comment is modified
>> in accordance with what GCC is expecting to find.
>>
>> This patch is part of the ongoing efforts to enable
>> -Wimplicit-fallthrough.
>>
>> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> 
> Acked-by: Paul Bolle <pebolle@tiscali.nl>
> 

Thanks, Paul.

--
Gustavo

^ permalink raw reply

* Re: [RFC, PATCH] net: page_pool: Don't use page->private to store dma_addr_t
From: Tariq Toukan @ 2019-02-12 14:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Ilias Apalodimas, Matthew Wilcox, David Miller,
	toke@redhat.com, netdev@vger.kernel.org,
	mgorman@techsingularity.net, linux-mm@kvack.org
In-Reply-To: <20190212144938.36dd45b4@carbon>



On 2/12/2019 3:49 PM, Jesper Dangaard Brouer wrote:
> On Tue, 12 Feb 2019 12:39:59 +0000
> Tariq Toukan <tariqt@mellanox.com> wrote:
> 
>> On 2/11/2019 7:14 PM, Eric Dumazet wrote:
>>>
>>> On 02/11/2019 12:53 AM, Tariq Toukan wrote:
>>>>   
>>>    
>>>> Hi,
>>>>
>>>> It's great to use the struct page to store its dma mapping, but I am
>>>> worried about extensibility.
>>>> page_pool is evolving, and it would need several more per-page fields.
>>>> One of them would be pageref_bias, a planned optimization to reduce the
>>>> number of the costly atomic pageref operations (and replace existing
>>>> code in several drivers).
>>>>   
>>>
>>> But the point about pageref_bias is to place it in a different
>>> cache line than "struct page"
> 
> Yes, exactly.
> 
> 
>>> The major cost is having a cache line bouncing between producer and
>>> consumer.
>>
>> pageref_bias is meant to be dirtied only by the page requester, i.e. the
>> NIC driver / page_pool.
>> All other components (basically, SKB release flow / put_page) should
>> continue working with the atomic page_refcnt, and not dirty the
>> pageref_bias.
>>
>> However, what bothers me more is another issue.
>> The optimization doesn't cleanly combine with the new page_pool
>> direction for maintaining a queue for "available" pages, as the put_page
>> flow would need to read pageref_bias, asynchronously, and act accordingly.
>>
>> The suggested hook in put_page (to catch the 2 -> 1 "biased refcnt"
>> transition) causes a problem to the traditional pageref_bias idea, as it
>> implies a new point in which the pageref_bias field is read
>> *asynchronously*. This would risk missing the this critical 2 -> 1
>> transition! Unless pageref_bias is atomic...
> 
> I want to stop you here...
> 
> It seems to me that you are trying to shoehorn in a refcount
> optimization into page_pool.  The page_pool is optimized for the XDP
> case of one-frame-per-page, where we can avoid changing the refcount,
> and tradeoff memory usage for speed.  It is compatible with the elevated
> refcount usage, but that is not the optimization target.
> 
> If the case you are optimizing for is "packing" more frames in a page,
> then the page_pool might be the wrong choice.  To me it would make more
> sense to create another enum xdp_mem_type, that generalize the
> pageref_bias tricks also used by some drivers.
> 

Hi Jesper,

We share the same interest, I tried to combine the pageref_bias 
optimization on top of the put_page hook, but turns out it doesn't fit. 
That's all.

Of course, I am aware of the fact that page_pool is optimized for XDP 
use cases. But, as drivers prefer a single flow for their 
page-allocation management, rather than having several allocation/free 
methods depending on whether XDP program is loaded or not, the 
performance for non-XDP flow also matters.
I know you're not ignoring this, the fact that you're adding 
compatibility for the elevated refcount usage is a key step in this 
direction.

Another key benefit of page_pool is providing a netdev-optimized API 
that can replace the page allocation / dma mapping logic of the 
different drivers, and take it into one common shared unit.
This helps remove many LOCs from drivers, significantly improves 
modularity, and eases the support of new optimizations.
By improving the general non-XDP flow (packing several packets in a 
page) you encourage more and more drivers to do the transition.

We all look to further improve the page-pool performance. The 
pageref_bias idea does not fit, that's fine.
We can still introduce an API for bulk page-allocation, it will improve 
both XDP and non-XDP flows.

Regards,
Tariq

^ permalink raw reply

* Re: [bpf-next 1/2] tcp: replace SOCK_DEBUG() with tcp_stats()
From: Eric Dumazet @ 2019-02-12 15:07 UTC (permalink / raw)
  To: Yafang Shao, daniel, ast
  Cc: yhs, brakmo, edumazet, davem, netdev, linux-kernel, shaoyafang
In-Reply-To: <1549971097-12627-2-git-send-email-laoar.shao@gmail.com>



On 02/12/2019 03:31 AM, Yafang Shao wrote:
> SOCK_DEBUG is a very ancient debugging interface, and it's not very useful
> for debugging.
> So this patch removes the SOCK_DEBUG() and introduce a new function
> tcp_stats() to trace this kind of events.
> Some MIBs are added for these events.
> 
> Regarding the SO_DEBUG in sock_{s,g}etsockopt, I think it is better to
> keep as-is, because if we return an errno to tell the application that
> this optname isn't supported for TCP, it may break the application.
> The application still can use this option but don't take any effect for
> TCP.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/uapi/linux/snmp.h |  3 +++
>  net/ipv4/proc.c           |  3 +++
>  net/ipv4/tcp_input.c      | 26 +++++++++++---------------
>  net/ipv6/tcp_ipv6.c       |  2 --
>  4 files changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index 86dc24a..fd5c09c 100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -283,6 +283,9 @@ enum
>  	LINUX_MIB_TCPACKCOMPRESSED,		/* TCPAckCompressed */
>  	LINUX_MIB_TCPZEROWINDOWDROP,		/* TCPZeroWindowDrop */
>  	LINUX_MIB_TCPRCVQDROP,			/* TCPRcvQDrop */
> +	LINUX_MIB_TCPINVALIDACK,		/* TCPInvalidAck */
> +	LINUX_MIB_TCPOLDACK,			/* TCPOldAck */
> +	LINUX_MIB_TCPPARTIALPACKET,		/* TCPPartialPacket */
>  	__LINUX_MIB_MAX
>  };
>  
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index c3610b3..1b0320a 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -291,6 +291,9 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
>  	SNMP_MIB_ITEM("TCPAckCompressed", LINUX_MIB_TCPACKCOMPRESSED),
>  	SNMP_MIB_ITEM("TCPZeroWindowDrop", LINUX_MIB_TCPZEROWINDOWDROP),
>  	SNMP_MIB_ITEM("TCPRcvQDrop", LINUX_MIB_TCPRCVQDROP),
> +	SNMP_MIB_ITEM("TCPInvalidAck", LINUX_MIB_TCPINVALIDACK),
> +	SNMP_MIB_ITEM("TCPOldAck", LINUX_MIB_TCPOLDACK),
> +	SNMP_MIB_ITEM("TCPPartialPacket", LINUX_MIB_TCPPARTIALPACKET),
>  	SNMP_MIB_SENTINEL
>  };
>  
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 7a027dec..88deb1f 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3554,6 +3554,11 @@ static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
>  	return delivered;
>  }
>  
> +static void tcp_stats(struct sock *sk, int mib_idx)
> +{
> +	NET_INC_STATS(sock_net(sk), mib_idx);
> +}

This is not a very descriptive name.

Why is it static, and in net/ipv4/tcp_input.c ???

> +
>  /* This routine deals with incoming acks, but not outgoing ones. */
>  static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  {
> @@ -3715,7 +3720,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  	return 1;
>  
>  invalid_ack:
> -	SOCK_DEBUG(sk, "Ack %u after %u:%u\n", ack, tp->snd_una, tp->snd_nxt);
> +	tcp_stats(sk, LINUX_MIB_TCPINVALIDACK);
>  	return -1;
>  
>  old_ack:
> @@ -3731,7 +3736,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  		tcp_xmit_recovery(sk, rexmit);
>  	}
>  
> -	SOCK_DEBUG(sk, "Ack %u before %u:%u\n", ack, tp->snd_una, tp->snd_nxt);
> +	tcp_stats(sk, LINUX_MIB_TCPOLDACK);
>  	return 0;
>  }
>


These counters will add noise to an already crowded MIB space.

What bug do you expect to track and fix with these ?

I see many TCP patches coming adding icache pressure, enabling companies to build their own modified
TCP stack, but no real meat.


^ permalink raw reply

* [PATCH] NETWORKING: avoid use IPCB in cipso_v4_error
From: Nazarov Sergey @ 2019-02-12 15:10 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: linux-security-module@vger.kernel.org, davem, kuznet, yoshfuji,
	Paul Moore
In-Reply-To: <CAHC9VhSDot+uNDi7ULoT7dry4FW9Jgp=NcqDh6TmA=z--kmqkA@mail.gmail.com>

Since cipso_v4_error might be called from different network stack layers, we can't safely use icmp_send there.
icmp_send copies IP options with ip_option_echo, which uses IPCB to take access to IP header compiled data.
But after commit 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses"), IPCB can't be used
above IP layer.
This patch fixes the problem by creating in cipso_v4_error a local copy of compiled IP options and using it with
introduced __icmp_send function. This looks some overloaded, but in quite rare error conditions only.

The original discussion is here:
https://lore.kernel.org/linux-security-module/16659801547571984@sas1-890ba5c2334a.qloud-c.yandex.net/

Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
---
 include/net/icmp.h    |    9 ++++++++-
 net/ipv4/cipso_ipv4.c |   18 ++++++++++++++++--
 net/ipv4/icmp.c       |    7 ++++---
 3 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/include/net/icmp.h b/include/net/icmp.h
index 6ac3a5b..e0f709d 100644
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -22,6 +22,7 @@
 
 #include <net/inet_sock.h>
 #include <net/snmp.h>
+#include <net/ip.h>
 
 struct icmp_err {
   int		errno;
@@ -39,7 +40,13 @@ struct icmp_err {
 struct sk_buff;
 struct net;
 
-void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info);
+void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
+		 const struct ip_options *opt);
+static inline void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
+{
+	__icmp_send(skb_in, type, code, info, &IPCB(skb_in)->opt);
+}
+
 int icmp_rcv(struct sk_buff *skb);
 int icmp_err(struct sk_buff *skb, u32 info);
 int icmp_init(void);
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index 777fa3b..234d12e 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -1735,13 +1735,27 @@ int cipso_v4_validate(const struct sk_buff *skb, unsigned char **option)
  */
 void cipso_v4_error(struct sk_buff *skb, int error, u32 gateway)
 {
+	unsigned char optbuf[sizeof(struct ip_options) + 40];
+	struct ip_options *opt = (struct ip_options *)optbuf;
+
 	if (ip_hdr(skb)->protocol == IPPROTO_ICMP || error != -EACCES)
 		return;
 
+	/*
+	 * We might be called above the IP layer,
+	 * so we can not use icmp_send and IPCB here.
+	 */
+
+	memset(opt, 0, sizeof(struct ip_options));
+	opt->optlen = ip_hdr(skb)->ihl*4 - sizeof(struct iphdr);
+	memcpy(opt->__data, (unsigned char *)&(ip_hdr(skb)[1]), opt->optlen);
+	if (ip_options_compile(dev_net(skb->dev), opt, NULL))
+		return;
+
 	if (gateway)
-		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_NET_ANO, 0);
+		__icmp_send(skb, ICMP_DEST_UNREACH, ICMP_NET_ANO, 0, opt);
 	else
-		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_ANO, 0);
+		__icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_ANO, 0, opt);
 }
 
 /**
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 065997f..3f24414 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -570,7 +570,8 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
  *			MUST reply to only the first fragment.
  */
 
-void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
+void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
+		 const struct ip_options *opt)
 {
 	struct iphdr *iph;
 	int room;
@@ -691,7 +692,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 					  iph->tos;
 	mark = IP4_REPLY_MARK(net, skb_in->mark);
 
-	if (ip_options_echo(net, &icmp_param.replyopts.opt.opt, skb_in))
+	if (__ip_options_echo(net, &icmp_param.replyopts.opt.opt, skb_in, opt))
 		goto out_unlock;
 
 
@@ -742,7 +743,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	local_bh_enable();
 out:;
 }
-EXPORT_SYMBOL(icmp_send);
+EXPORT_SYMBOL(__icmp_send);
 
 
 static void icmp_socket_deliver(struct sk_buff *skb, u32 info)
--


^ permalink raw reply related

* Re: [bpf-next 2/2] bpf: add BPF_SOCK_OPS_STATS_CB for tcp_stats()
From: Eric Dumazet @ 2019-02-12 15:11 UTC (permalink / raw)
  To: Yafang Shao, daniel, ast
  Cc: yhs, brakmo, edumazet, davem, netdev, linux-kernel, shaoyafang
In-Reply-To: <1549971097-12627-3-git-send-email-laoar.shao@gmail.com>



On 02/12/2019 03:31 AM, Yafang Shao wrote:
> Introuce this new op BPF_SOCK_OPS_STATS_CB for tcp_stats() such that it
> can be traced via BPF on a per socket basis.
> There's one argument in BPF_SOCK_OPS_STATS_CB, which is Linux MIB index
> LINUX_MIB_* to indicate the TCP event.
> All these Linux MIBs are defined in include/uapi/linux/snmp.h.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/uapi/linux/bpf.h | 5 +++++
>  net/ipv4/tcp_input.c     | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1777fa0..0314ddd 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2894,6 +2894,11 @@ enum {
>  	BPF_SOCK_OPS_TCP_LISTEN_CB,	/* Called on listen(2), right after
>  					 * socket transition to LISTEN state.
>  					 */
> +	BPF_SOCK_OPS_STATS_CB,		/*
> +					 * Called on tcp_stats().
> +					 * Arg1: Linux MIB index
> +					 * 	 LINUX_MIB_*
> +					 */
>  };
>  
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 88deb1f..4acf458 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3557,6 +3557,7 @@ static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
>  static void tcp_stats(struct sock *sk, int mib_idx)
>  {
>  	NET_INC_STATS(sock_net(sk), mib_idx);
> +	tcp_call_bpf(sk, BPF_SOCK_OPS_STATS_CB, 1, &mib_idx);
>  }
>  
>  /* This routine deals with incoming acks, but not outgoing ones. */
> 

If the plan is to add to all NET_INC_STATS() calls in TCP an additional tcp_call_bpf()
I will say no.

So far, tcp_call_bpf() has not been used in fast path.


^ permalink raw reply

* Re: [RFC, PATCH] net: page_pool: Don't use page->private to store dma_addr_t
From: Eric Dumazet @ 2019-02-12 15:15 UTC (permalink / raw)
  To: Tariq Toukan, Eric Dumazet, Ilias Apalodimas, Matthew Wilcox,
	brouer@redhat.com
  Cc: David Miller, toke@redhat.com, netdev@vger.kernel.org,
	mgorman@techsingularity.net, linux-mm@kvack.org
In-Reply-To: <27e97aac-f25b-d46c-3e70-7d0d44f784b5@mellanox.com>



On 02/12/2019 04:39 AM, Tariq Toukan wrote:
> 
> 
> On 2/11/2019 7:14 PM, Eric Dumazet wrote:
>>
>>
>> On 02/11/2019 12:53 AM, Tariq Toukan wrote:
>>>
>>
>>> Hi,
>>>
>>> It's great to use the struct page to store its dma mapping, but I am
>>> worried about extensibility.
>>> page_pool is evolving, and it would need several more per-page fields.
>>> One of them would be pageref_bias, a planned optimization to reduce the
>>> number of the costly atomic pageref operations (and replace existing
>>> code in several drivers).
>>>
>>
>> But the point about pageref_bias is to place it in a different cache line than "struct page"
>>
>> The major cost is having a cache line bouncing between producer and consumer.
>>
> 
> pageref_bias is meant to be dirtied only by the page requester, i.e. the 
> NIC driver / page_pool.
> All other components (basically, SKB release flow / put_page) should 
> continue working with the atomic page_refcnt, and not dirty the 
> pageref_bias.

This is exactly my point.

You suggested to put pageref_bias in struct page, which breaks this completely.

pageref_bias is better kept in a driver structure, with appropriate prefetching
since most NIC use a ring buffer for their queues.

The dma address _can_ be put in the struct page, since the driver does not dirty it
and does not even read it when page can be recycled.


^ permalink raw reply

* [PATCH net-next 1/7] net/smc: reset cursor update required flag
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>

From: Karsten Graul <kgraul@linux.ibm.com>

When an updated rx_cursor_confirmed field was sent to the peer then
reset the cons_curs_upd_req flag. And remove the duplicate reset and
cursor update in smc_tx_consumer_update().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
 net/smc/smc_cdc.c | 5 ++++-
 net/smc/smc_tx.c  | 3 ---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index a712c9f8699b..99d9d6e85dfb 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -105,8 +105,10 @@ int smc_cdc_msg_send(struct smc_connection *conn,
 			    &conn->local_tx_ctrl, conn);
 	smc_curs_copy(&cfed, &((struct smc_host_cdc_msg *)wr_buf)->cons, conn);
 	rc = smc_wr_tx_send(link, (struct smc_wr_tx_pend_priv *)pend);
-	if (!rc)
+	if (!rc) {
 		smc_curs_copy(&conn->rx_curs_confirmed, &cfed, conn);
+		conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
+	}
 
 	return rc;
 }
@@ -194,6 +196,7 @@ int smcd_cdc_msg_send(struct smc_connection *conn)
 	if (rc)
 		return rc;
 	smc_curs_copy(&conn->rx_curs_confirmed, &curs, conn);
+	conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
 	/* Calculate transmitted data and increment free send buffer space */
 	diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
 			     &conn->tx_curs_sent);
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index f93f3580c100..ce9586bce364 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -610,9 +610,6 @@ void smc_tx_consumer_update(struct smc_connection *conn, bool force)
 					      SMC_TX_WORK_DELAY);
 			return;
 		}
-		smc_curs_copy(&conn->rx_curs_confirmed,
-			      &conn->local_tx_ctrl.cons, conn);
-		conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
 	}
 	if (conn->local_rx_ctrl.prod_flags.write_blocked &&
 	    !atomic_read(&conn->bytes_to_rcv))
-- 
2.16.4


^ permalink raw reply related

* [PATCH net-next 2/7] net/smc: move wake up of close waiter
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>

From: Karsten Graul <kgraul@linux.ibm.com>

Move the call to smc_close_wake_tx_prepared() (which wakes up a possibly
waiting close processing that might wait for 'all data sent') to
smc_tx_sndbuf_nonempty() (which is the main function to send data).

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
 net/smc/smc_cdc.c | 2 --
 net/smc/smc_tx.c  | 7 +++++++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 99d9d6e85dfb..780b36c69292 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -290,8 +290,6 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
 	/* trigger sndbuf consumer: RDMA write into peer RMBE and CDC */
 	if (diff_cons && smc_tx_prepared_sends(conn)) {
 		smc_tx_sndbuf_nonempty(conn);
-		/* trigger socket release if connection closed */
-		smc_close_wake_tx_prepared(smc);
 	}
 	if (diff_cons && conn->urg_tx_pend &&
 	    atomic_read(&conn->peer_rmbe_space) == conn->peer_rmbe_size) {
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index ce9586bce364..dd10a913b38e 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -24,6 +24,7 @@
 #include "smc.h"
 #include "smc_wr.h"
 #include "smc_cdc.h"
+#include "smc_close.h"
 #include "smc_ism.h"
 #include "smc_tx.h"
 
@@ -554,6 +555,12 @@ int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
 	else
 		rc = smcr_tx_sndbuf_nonempty(conn);
 
+	if (!rc) {
+		/* trigger socket release if connection is closing */
+		struct smc_sock *smc = container_of(conn, struct smc_sock,
+						    conn);
+		smc_close_wake_tx_prepared(smc);
+	}
 	return rc;
 }
 
-- 
2.16.4


^ permalink raw reply related

* [PATCH net-next 3/7] net/smc: no delay for free tx buffer wait
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>

From: Karsten Graul <kgraul@linux.ibm.com>

When no free transfer buffers are available then a work to call
smc_tx_work() is scheduled. Set the schedule delay to zero, because for
the out-of-buffers condition the work can start immediately and will
block in the called function smc_wr_tx_get_free_slot(), waiting for free
buffers.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
 net/smc/smc_tx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index dd10a913b38e..a3bff08ff8c8 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -28,7 +28,7 @@
 #include "smc_ism.h"
 #include "smc_tx.h"
 
-#define SMC_TX_WORK_DELAY	HZ
+#define SMC_TX_WORK_DELAY	0
 #define SMC_TX_CORK_DELAY	(HZ >> 2)	/* 250 ms */
 
 /***************************** sndbuf producer *******************************/
-- 
2.16.4


^ permalink raw reply related

* [PATCH net-next 4/7] net/smc: reduce amount of status updates to peer
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>

From: Karsten Graul <kgraul@linux.ibm.com>

In smc_cdc_msg_recv_action() the received cdc message is evaluated.
To reduce the number of messaged triggered by this evaluation the logic
is streamlined. For the write_blocked condition we do not need to send
a response immediately. The remaining conditions can be put together
into one if clause.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
 net/smc/smc_cdc.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 780b36c69292..28bbdb04bc35 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -273,24 +273,18 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
 		smp_mb__after_atomic();
 		smc->sk.sk_data_ready(&smc->sk);
 	} else {
-		if (conn->local_rx_ctrl.prod_flags.write_blocked ||
-		    conn->local_rx_ctrl.prod_flags.cons_curs_upd_req ||
-		    conn->local_rx_ctrl.prod_flags.urg_data_pending) {
-			if (conn->local_rx_ctrl.prod_flags.urg_data_pending)
-				conn->urg_state = SMC_URG_NOTYET;
-			/* force immediate tx of current consumer cursor, but
-			 * under send_lock to guarantee arrival in seqno-order
-			 */
-			if (smc->sk.sk_state != SMC_INIT)
-				smc_tx_sndbuf_nonempty(conn);
-		}
+		if (conn->local_rx_ctrl.prod_flags.write_blocked)
+			smc->sk.sk_data_ready(&smc->sk);
+		if (conn->local_rx_ctrl.prod_flags.urg_data_pending)
+			conn->urg_state = SMC_URG_NOTYET;
 	}
 
-	/* piggy backed tx info */
 	/* trigger sndbuf consumer: RDMA write into peer RMBE and CDC */
-	if (diff_cons && smc_tx_prepared_sends(conn)) {
+	if ((diff_cons && smc_tx_prepared_sends(conn)) ||
+	    conn->local_rx_ctrl.prod_flags.cons_curs_upd_req ||
+	    conn->local_rx_ctrl.prod_flags.urg_data_pending)
 		smc_tx_sndbuf_nonempty(conn);
-	}
+
 	if (diff_cons && conn->urg_tx_pend &&
 	    atomic_read(&conn->peer_rmbe_space) == conn->peer_rmbe_size) {
 		/* urg data confirmed by peer, indicate we're ready for more */
-- 
2.16.4


^ permalink raw reply related

* [PATCH net-next 6/7] net/smc: check port_idx of ib event
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>

From: Karsten Graul <kgraul@linux.ibm.com>

For robustness protect of higher port numbers than expected to avoid
setting bits behind our port_event_mask. In case of an DEVICE_FATAL
event all ports must be checked. The IB_EVENT_GID_CHANGE event is
provided in the global event handler, so handle it there. And handle a
QP_FATAL event instead of an DEVICE_FATAL event in the qp handler.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
 net/smc/smc_ib.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 76487a16934e..0b244be24fe0 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -257,12 +257,20 @@ static void smc_ib_global_event_handler(struct ib_event_handler *handler,
 	smcibdev = container_of(handler, struct smc_ib_device, event_handler);
 
 	switch (ibevent->event) {
-	case IB_EVENT_PORT_ERR:
 	case IB_EVENT_DEVICE_FATAL:
+		/* terminate all ports on device */
+		for (port_idx = 0; port_idx < SMC_MAX_PORTS; port_idx++)
+			set_bit(port_idx, &smcibdev->port_event_mask);
+		schedule_work(&smcibdev->port_event_work);
+		break;
+	case IB_EVENT_PORT_ERR:
 	case IB_EVENT_PORT_ACTIVE:
+	case IB_EVENT_GID_CHANGE:
 		port_idx = ibevent->element.port_num - 1;
-		set_bit(port_idx, &smcibdev->port_event_mask);
-		schedule_work(&smcibdev->port_event_work);
+		if (port_idx < SMC_MAX_PORTS) {
+			set_bit(port_idx, &smcibdev->port_event_mask);
+			schedule_work(&smcibdev->port_event_work);
+		}
 		break;
 	default:
 		break;
@@ -294,13 +302,13 @@ static void smc_ib_qp_event_handler(struct ib_event *ibevent, void *priv)
 	u8 port_idx;
 
 	switch (ibevent->event) {
-	case IB_EVENT_DEVICE_FATAL:
-	case IB_EVENT_GID_CHANGE:
-	case IB_EVENT_PORT_ERR:
+	case IB_EVENT_QP_FATAL:
 	case IB_EVENT_QP_ACCESS_ERR:
 		port_idx = ibevent->element.qp->port - 1;
-		set_bit(port_idx, &smcibdev->port_event_mask);
-		schedule_work(&smcibdev->port_event_work);
+		if (port_idx < SMC_MAX_PORTS) {
+			set_bit(port_idx, &smcibdev->port_event_mask);
+			schedule_work(&smcibdev->port_event_work);
+		}
 		break;
 	default:
 		break;
-- 
2.16.4


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox