Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch net-next v2 00/10] introduce rocker switch driver with hardware accelerated datapath api - phase 1: bridge fdb offload
From: John Fastabend @ 2014-11-10 22:23 UTC (permalink / raw)
  To: Simon Horman
  Cc: Jamal Hadi Salim, Jiri Pirko, netdev, davem, nhorman, andy, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, ben, stephen,
	jeffrey.t.kirsher, vyasevic, xiyou.wangcong, john.r.fastabend,
	edumazet, sfeldma, f.fainelli, roopa, linville, jasowang,
	ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye, alexander.h.duyck,
	john.ronciak, mleitner, shrijeet, gospo
In-Reply-To: <20141110045830.GC17246@vergenet.net>

On 11/09/2014 08:58 PM, Simon Horman wrote:
> On Sun, Nov 09, 2014 at 11:03:40PM -0500, Jamal Hadi Salim wrote:
>> Hi Simon,
>>
>> On 11/09/14 22:46, Simon Horman wrote:
>>> Hi Jamal, Hi Jiri,
>>>
>>> On a somewhat related note I am also wondering what if any progress has
>>> been made regarding discussions of (and code for) the following:
>>>
>>> 1. Exposing flow tables to user-space
>>>     - I realise that this is Open vSwitch specific to some extent
>>>       but I am in no way implying that it should be done instead of
>>>       non-Open vSwitch specific work.
>>>     - Jiri, IIRC this was part ~v2 of your earlier offload patchset
>>>
>>
>> I dont know what Rocker crowd is doing; however, I know
>> John F. has been doing some work which i have stared at
>> and I was hoping to join in with Ben's effort and show tc flow
>> offload on the realtek chip in my infinite spare time unles.
>> (for both Linux bridge and ports).
>> The priority is to merge the obvious bits first.
>
> Merging the obvious bits first is quite fine my me.
>

+1

>>> 2. Describing Switch Hardware
>>>     - I see John Fastabend moving forwards on this in his git repository
>>>       https://github.com/jrfastab/flow-net-next
>>>
>>> The way that I see things is that both of the above could be exposed via
>>> netlink. And that the first at least could be backed by NDOs.  As such I
>>> see this work as complementary and perhaps applying on top of this
>>> patchset. If I am mistaken in this regards it would be good to know :)
>>>
>>
>> You are correct - I will let John speak on his work, but
>> that is the intent.
>> The challenge is there are many schools of thoughts and i am hoping
>> it is not an arms race.
>
> That is also my hope.

My intent is to submit the Flow API bits once the base rocker switch
gets committed. I've implemented the Flow API against ixgbe and a
sadly a proprietary SDK. I'll implement it against the rocker switch
as well.

>
>>> I am of course also interested to know if the above are moving forwards.
>>> To be clear I am very interested in being able to use these APIs to
>>> perform Open vSwitch offloads and I am very happy to help.
>>> (Jamal: I'm also interested in non-Open vSwitch offloads :)
>>>
>>

I think they are moving forward. I have some code cleanup to do on
the flow API, but its mostly in place. Then I want to implement
an example on Rocker switch so we could experiment with something
why we wait for a real hardware driver. From my side assuming I at
least got it close to correct should be doable in the next few week.

After that I want to work with Jesse/Jamal and look at integrating
with OVS and other stacks. I thought a bit about the OVS integration
path but I'll hold that discussion for the moment.

Simon, if your feeling adventurous any feedback on the repo link
would be great. I still need to smash the commit log into something
coherent though at the moment you can see all the errors and rewrites,
etc as I made them.

>> Hey, OVS should be able to use these APIs; i am just interested in making
>> sure they are not just for OVS or OF. Then we are all happy;->
>
> I think we are all happy :)

I'm happy :) Also my intent is the flow API is more general then
just OVS. My view is OVS should be one user of the API.

.John

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [patch net-next v2 03/10] rtnl: expose physical switch id for particular device
From: Jiri Pirko @ 2014-11-10 22:14 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma,
	f.fainelli, linville, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck, john.ronciak,
	mleitner, shrijeet, gospo, bcrl
In-Reply-To: <5460FCB0.1040409@cumulusnetworks.com>

Mon, Nov 10, 2014 at 06:58:08PM CET, roopa@cumulusnetworks.com wrote:
>On 11/9/14, 2:51 AM, Jiri Pirko wrote:
>>The netdevice represents a port in a switch, it will expose
>>IFLA_PHYS_SWITCH_ID value via rtnl. Two netdevices with the same value
>>belong to one physical switch.
>>
>>Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>>---
>>  include/uapi/linux/if_link.h |  1 +
>>  net/core/rtnetlink.c         | 26 +++++++++++++++++++++++++-
>>  2 files changed, 26 insertions(+), 1 deletion(-)
>>
>>diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>>index 7072d83..4163753 100644
>>--- a/include/uapi/linux/if_link.h
>>+++ b/include/uapi/linux/if_link.h
>>@@ -145,6 +145,7 @@ enum {
>>  	IFLA_CARRIER,
>>  	IFLA_PHYS_PORT_ID,
>>  	IFLA_CARRIER_CHANGES,
>>+	IFLA_PHYS_SWITCH_ID,
>
>Jiri, since we have not really converged on the switchdev class or having a
>separate switchdev instance,
>am thinking it is better if we dont expose any such switch_id to userspace
>yet until absolutely needed. Do you need it today ?
>There is no real in kernel hw switch driver that will use it today. And quite
>likely this will need to change when we introduce real hw switch drivers.

When and if the switchdev class is introduced, switch id can happily
live on. It is nothing against it. Userspace should use this id to group
the ports of physical switch.

>
>
>>  	__IFLA_MAX
>>  };
>>diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>>index 1087c6d..f839354 100644
>>--- a/net/core/rtnetlink.c
>>+++ b/net/core/rtnetlink.c
>>@@ -43,6 +43,7 @@
>>  #include <linux/inet.h>
>>  #include <linux/netdevice.h>
>>+#include <net/switchdev.h>
>>  #include <net/ip.h>
>>  #include <net/protocol.h>
>>  #include <net/arp.h>
>>@@ -868,7 +869,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
>>  	       + rtnl_port_size(dev, ext_filter_mask) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
>>  	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
>>  	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
>>-	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN); /* IFLA_PHYS_PORT_ID */
>>+	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
>>+	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN); /* IFLA_PHYS_SWITCH_ID */
>>  }
>>  static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
>>@@ -967,6 +969,24 @@ static int rtnl_phys_port_id_fill(struct sk_buff *skb, struct net_device *dev)
>>  	return 0;
>>  }
>>+static int rtnl_phys_switch_id_fill(struct sk_buff *skb, struct net_device *dev)
>>+{
>>+	int err;
>>+	struct netdev_phys_item_id psid;
>>+
>>+	err = netdev_sw_parent_id_get(dev, &psid);
>>+	if (err) {
>>+		if (err == -EOPNOTSUPP)
>>+			return 0;
>>+		return err;
>>+	}
>>+
>>+	if (nla_put(skb, IFLA_PHYS_SWITCH_ID, psid.id_len, psid.id))
>>+		return -EMSGSIZE;
>>+
>>+	return 0;
>>+}
>>+
>>  static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
>>  			    int type, u32 pid, u32 seq, u32 change,
>>  			    unsigned int flags, u32 ext_filter_mask)
>>@@ -1039,6 +1059,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
>>  	if (rtnl_phys_port_id_fill(skb, dev))
>>  		goto nla_put_failure;
>>+	if (rtnl_phys_switch_id_fill(skb, dev))
>>+		goto nla_put_failure;
>>+
>>  	attr = nla_reserve(skb, IFLA_STATS,
>>  			sizeof(struct rtnl_link_stats));
>>  	if (attr == NULL)
>>@@ -1198,6 +1221,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
>>  	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
>>  	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
>>  	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
>>+	[IFLA_PHYS_SWITCH_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
>>  };
>>  static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
>

^ permalink raw reply

* [PATCH net-next] mlx4: restore conditional call to napi_complete_done()
From: Eric Dumazet @ 2014-11-10 22:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, amirv, ogerlitz, willemb
In-Reply-To: <1415642488.9613.5.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <edumazet@google.com>

After commit 1a28817282 ("mlx4: use napi_complete_done()") we ended up
calling napi_complete_done() in the case NAPI poll consumed all its
budget.

This added extra interrupt pressure, this patch restores proper
behavior.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 1a28817282 ("mlx4: use napi_complete_done()")
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 46ee78326f1fecb14f42d1afcc113fa18087cfa6..5a193f40a14c2a89fecfbac5dce0771e15e69861 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -910,13 +910,14 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
 		cpu_curr = smp_processor_id();
 		aff = irq_desc_get_irq_data(cq->irq_desc)->affinity;
 
-		if (unlikely(!cpumask_test_cpu(cpu_curr, aff))) {
-			/* Current cpu is not according to smp_irq_affinity -
-			 * probably affinity changed. need to stop this NAPI
-			 * poll, and restart it on the right CPU
-			 */
-			done = 0;
-		}
+		if (likely(cpumask_test_cpu(cpu_curr, aff)))
+			return budget;
+
+		/* Current cpu is not according to smp_irq_affinity -
+		 * probably affinity changed. need to stop this NAPI
+		 * poll, and restart it on the right CPU
+		 */
+		done = 0;
 	}
 	/* Done for now */
 	napi_complete_done(napi, done);

^ permalink raw reply related

* Re: [patch net-next v2 05/10] rocker: introduce rocker switch driver
From: John Fastabend @ 2014-11-10 22:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma,
	f.fainelli, roopa, linville, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck, john.ronciak,
	mleitner, shrijeet, gospo, bcrl
In-Reply-To: <1415530280-9190-6-git-send-email-jiri@resnulli.us>

On 11/09/2014 02:51 AM, Jiri Pirko wrote:
> This patch introduces the first driver to benefit from the switchdev
> infrastructure and to implement newly introduced switch ndos. This is a
> driver for emulated switch chip implemented in qemu:
> https://github.com/sfeldma/qemu-rocker/
>
> This patch is a result of joint work with Scott Feldman.
>
> Signed-off-by: Scott Feldman <sfeldma@gmail.com>
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> ---
>   MAINTAINERS                          |    7 +
>   drivers/net/ethernet/Kconfig         |    1 +
>   drivers/net/ethernet/Makefile        |    1 +
>   drivers/net/ethernet/rocker/Kconfig  |   27 +
>   drivers/net/ethernet/rocker/Makefile |    5 +
>   drivers/net/ethernet/rocker/rocker.c | 2060 ++++++++++++++++++++++++++++++++++
>   drivers/net/ethernet/rocker/rocker.h |  427 +++++++
>   7 files changed, 2528 insertions(+)
>   create mode 100644 drivers/net/ethernet/rocker/Kconfig
>   create mode 100644 drivers/net/ethernet/rocker/Makefile
>   create mode 100644 drivers/net/ethernet/rocker/rocker.c
>   create mode 100644 drivers/net/ethernet/rocker/rocker.h
>

[...]

> +
> +static netdev_tx_t rocker_port_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	struct rocker_port *rocker_port = netdev_priv(dev);
> +	struct rocker *rocker = rocker_port->rocker;
> +	struct rocker_desc_info *desc_info;
> +	struct rocker_tlv *frags;
> +	int i;
> +	int err;
> +
> +	desc_info = rocker_desc_head_get(&rocker_port->tx_ring);
> +	if (unlikely(!desc_info)) {
> +		if (net_ratelimit())

Could you have a netif_stop_queue() here as well same as below? Not
that optimizing the xmit routine is the interesting part of this patch.
But I guess this is just some strange error path because I see you
check this case below.

> +			netdev_err(dev, "tx ring full when queue awake\n");
> +		return NETDEV_TX_BUSY;
> +	}
> +
> +	rocker_desc_cookie_ptr_set(desc_info, skb);
> +
> +	frags = rocker_tlv_nest_start(desc_info, ROCKER_TLV_TX_FRAGS);
> +	if (!frags)
> +		goto out;
> +	err = rocker_tx_desc_frag_map_put(rocker_port, desc_info,
> +					  skb->data, skb_headlen(skb));
> +	if (err)
> +		goto nest_cancel;
> +	if (skb_shinfo(skb)->nr_frags > ROCKER_TX_FRAGS_MAX)
> +		goto nest_cancel;
> +
> +	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> +		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> +
> +		err = rocker_tx_desc_frag_map_put(rocker_port, desc_info,
> +						  skb_frag_address(frag),
> +						  skb_frag_size(frag));
> +		if (err)
> +			goto unmap_frags;
> +	}
> +	rocker_tlv_nest_end(desc_info, frags);
> +
> +	rocker_desc_gen_clear(desc_info);
> +	rocker_desc_head_set(rocker, &rocker_port->tx_ring, desc_info);
> +
> +	desc_info = rocker_desc_head_get(&rocker_port->tx_ring);
> +	if (!desc_info)
> +		netif_stop_queue(dev);

I'm not entirely sure I followed the TLV usage here but OK. If it
works...

> +
> +	return NETDEV_TX_OK;
> +
> +unmap_frags:
> +	rocker_tx_desc_frags_unmap(rocker_port, desc_info);
> +nest_cancel:
> +	rocker_tlv_nest_cancel(desc_info, frags);
> +out:
> +	dev_kfree_skb(skb);
> +	return NETDEV_TX_OK;
> +}
> +
> +static int rocker_port_set_mac_address(struct net_device *dev, void *p)
> +{
> +	struct sockaddr *addr = p;
> +	struct rocker_port *rocker_port = netdev_priv(dev);
> +	int err;
> +
> +	if (!is_valid_ether_addr(addr->sa_data))
> +		return -EADDRNOTAVAIL;
> +
> +	err = rocker_cmd_set_port_settings_macaddr(rocker_port, addr->sa_data);
> +	if (err)
> +		return err;
> +	memcpy(dev->dev_addr, addr->sa_data, dev->addr_len);
> +	return 0;
> +}
> +
> +static int rocker_port_sw_parent_id_get(struct net_device *dev,
> +					struct netdev_phys_item_id *psid)
> +{
> +	struct rocker_port *rocker_port = netdev_priv(dev);
> +	struct rocker *rocker = rocker_port->rocker;
> +

hmm looks like you read this out of a magic switch register :) but
my switch doesn't have this magic reg. I suposse the switch MAC address
should work.

> +	psid->id_len = sizeof(rocker->hw.id);
> +	memcpy(&psid->id, &rocker->hw.id, psid->id_len);
> +	return 0;
> +}
> +
> +static const struct net_device_ops rocker_port_netdev_ops = {
> +	.ndo_open			= rocker_port_open,
> +	.ndo_stop			= rocker_port_stop,
> +	.ndo_start_xmit			= rocker_port_xmit,
> +	.ndo_set_mac_address		= rocker_port_set_mac_address,
> +	.ndo_sw_parent_id_get		= rocker_port_sw_parent_id_get,
> +};
> +
> +/********************
> + * ethtool interface
> + ********************/
> +
> +static int rocker_port_get_settings(struct net_device *dev,
> +				    struct ethtool_cmd *ecmd)
> +{
> +	struct rocker_port *rocker_port = netdev_priv(dev);
> +
> +	return rocker_cmd_get_port_settings_ethtool(rocker_port, ecmd);
> +}
> +
> +static int rocker_port_set_settings(struct net_device *dev,
> +				    struct ethtool_cmd *ecmd)
> +{
> +	struct rocker_port *rocker_port = netdev_priv(dev);
> +
> +	return rocker_cmd_set_port_settings_ethtool(rocker_port, ecmd);
> +}
> +
> +static void rocker_port_get_drvinfo(struct net_device *dev,
> +				    struct ethtool_drvinfo *drvinfo)
> +{
> +	strlcpy(drvinfo->driver, rocker_driver_name, sizeof(drvinfo->driver));
> +	strlcpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
> +}
> +
> +static const struct ethtool_ops rocker_port_ethtool_ops = {
> +	.get_settings		= rocker_port_get_settings,
> +	.set_settings		= rocker_port_set_settings,
> +	.get_drvinfo		= rocker_port_get_drvinfo,
> +	.get_link		= ethtool_op_get_link,
> +};
> +

[...]

Looks reasonable to me, although I mostly scanned it and looked
over the interface parts. My comments are just observations no
need to change anything for them.

Reviewed-by: John Fastabend <john.r.fastabend@intel.com>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [patch net-next v2 04/10] net-sysfs: expose physical switch id for particular device
From: John Fastabend @ 2014-11-10 22:01 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma,
	f.fainelli, roopa, linville, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck, john.ronciak,
	mleitner, shrijeet, gospo, bcrl
In-Reply-To: <1415530280-9190-5-git-send-email-jiri@resnulli.us>

On 11/09/2014 02:51 AM, Jiri Pirko wrote:
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> ---

Acked-by: John Fastabend <john.r.fastabend@intel.com>



-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [patch net-next v2 03/10] rtnl: expose physical switch id for particular device
From: John Fastabend @ 2014-11-10 22:01 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma,
	f.fainelli, roopa, linville, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck, john.ronciak,
	mleitner, shrijeet, gospo, bcrl
In-Reply-To: <1415530280-9190-4-git-send-email-jiri@resnulli.us>

On 11/09/2014 02:51 AM, Jiri Pirko wrote:
> The netdevice represents a port in a switch, it will expose
> IFLA_PHYS_SWITCH_ID value via rtnl. Two netdevices with the same value
> belong to one physical switch.
>
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> ---

Yep, I need something like this for my management app to
learn how the switch is laid out. This becomes more relevant
if I have switch silicon mixed with NICs that are not connected
to the switch object in the same platform.

Acked-by: John Fastabend <john.r.fastabend@intel.com>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [patch net-next v2 02/10] net: introduce generic switch devices support
From: John Fastabend @ 2014-11-10 21:59 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma,
	f.fainelli, roopa, linville, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck, john.ronciak,
	mleitner, shrijeet, gospo, bcrl
In-Reply-To: <1415530280-9190-3-git-send-email-jiri@resnulli.us>

On 11/09/2014 02:51 AM, Jiri Pirko wrote:
> The goal of this is to provide a possibility to support various switch
> chips. Drivers should implement relevant ndos to do so. Now there is
> only one ndo defined:
> - for getting physical switch id is in place.
>
> Note that user can use random port netdevice to access the switch.
>
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> ---
>   Documentation/networking/switchdev.txt | 59 ++++++++++++++++++++++++++++++++++
>   MAINTAINERS                            |  7 ++++
>   include/linux/netdevice.h              | 10 ++++++
>   include/net/switchdev.h                | 30 +++++++++++++++++
>   net/Kconfig                            |  1 +
>   net/Makefile                           |  3 ++
>   net/switchdev/Kconfig                  | 13 ++++++++
>   net/switchdev/Makefile                 |  5 +++
>   net/switchdev/switchdev.c              | 33 +++++++++++++++++++
>   9 files changed, 161 insertions(+)
>   create mode 100644 Documentation/networking/switchdev.txt
>   create mode 100644 include/net/switchdev.h
>   create mode 100644 net/switchdev/Kconfig
>   create mode 100644 net/switchdev/Makefile
>   create mode 100644 net/switchdev/switchdev.c
>
> diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt
> new file mode 100644
> index 0000000..98be76c
> --- /dev/null
> +++ b/Documentation/networking/switchdev.txt
> @@ -0,0 +1,59 @@
> +Switch (and switch-ish) device drivers HOWTO
> +===========================
> +
> +Please note that the word "switch" is here used in very generic meaning.
> +This include devices supporting L2/L3 but also various flow offloading chips,
> +including switches embedded into SR-IOV NICs.
> +
> +Lets describe a topology a bit. Imagine the following example:
> +
> +       +----------------------------+    +---------------+
> +       |     SOME switch chip       |    |      CPU      |
> +       +----------------------------+    +---------------+
> +       port1 port2 port3 port4 MNGMNT    |     PCI-E     |
> +         |     |     |     |     |       +---------------+
> +        PHY   PHY    |     |     |         |  NIC0 NIC1
> +                     |     |     |         |   |    |
> +                     |     |     +- PCI-E -+   |    |
> +                     |     +------- MII -------+    |
> +                     +------------- MII ------------+
> +
> +In this example, there are two independent lines between the switch silicon
> +and CPU. NIC0 and NIC1 drivers are not aware of a switch presence. They are
> +separate from the switch driver. SOME switch chip is by managed by a driver
> +via PCI-E device MNGMNT. Note that MNGMNT device, NIC0 and NIC1 may be
> +connected to some other type of bus.
> +
> +Now, for the previous example show the representation in kernel:
> +
> +       +----------------------------+    +---------------+
> +       |     SOME switch chip       |    |      CPU      |
> +       +----------------------------+    +---------------+
> +       sw0p0 sw0p1 sw0p2 sw0p3 MNGMNT    |     PCI-E     |
> +         |     |     |     |     |       +---------------+
> +        PHY   PHY    |     |     |         |  eth0 eth1
> +                     |     |     |         |   |    |
> +                     |     |     +- PCI-E -+   |    |
> +                     |     +------- MII -------+    |
> +                     +------------- MII ------------+
> +
> +Lets call the example switch driver for SOME switch chip "SOMEswitch". This
> +driver takes care of PCI-E device MNGMNT. There is a netdevice instance sw0pX
> +created for each port of a switch. These netdevices are instances
> +of "SOMEswitch" driver. sw0pX netdevices serve as a "representation"
> +of the switch chip. eth0 and eth1 are instances of some other existing driver.
> +
> +The only difference of the switch-port netdevice from the ordinary netdevice
> +is that is implements couple more NDOs:
> +
> +	ndo_sw_parent_get_id - This returns the same ID for two port netdevices
> +			       of the same physical switch chip. This is
> +			       mandatory to be implemented by all switch drivers
> +			       and serves the caller for recognition of a port
> +			       netdevice.

What is the connection between ndo_sw_parent_get_id and
ndo_get_phys_port_id(). I'm having a bit of trouble teasing
this out.

For example here is my ascii art for a SR-IOV NIC,

        eth0     eth1     eth2
         |         |        |
         |         |        |
         PF        VF       VF
    +----+---------+--------+----+
    |       embedded bridge      |
    +-------------+--------------+
                  |
                 port

that can do switching between the various uplinks and downlinks.
In IEEE 802.1Q language the embedded bridge acts like an edge
relay. At least that seems to be the current state of the art
for SR-IOV. Edge relay just means it has a single uplink port
to the network and multiple downlinks and also isn't required
to do learning and run loop detection protocols STP, et. al.

Also there are multi-function devices that look the same except
replace the VFs with PFs. It seems to be a common mode for NICs
that do the iSCSI offloads with storage functions.

When something is an embedded bridge vs a SOME switch chip is
not entirely clear.

My understanding is use ndo_sw_parent_get_id() when you have
multiple physical ports all connected to a single switch object.
When you have a single port connected to multiple PCIE functions
or queues representing a netdev (e.g. macvlan offload) use the
ndo_get_phys_port_id(). Just want to be sure we are on the
same page here.

Otherwise patch looks good. I think we can clear the above up
with an addition to the documentation. Could go in after the
initial set and be OK with me.

IMO this patch is needed otherwise user space is at a complete
loss on trying to figure out how netdevs map to switch silicon.
You could have reused ndo_get_phys_port_id() perhaps but then
I think user space may get confused by SR-IOV/VMDQ/etc ports
attached to a switch silicon. For .02$ having a new distinct
identifier is cleaner.


> +	ndo_sw_parent_* - Functions that serve for a manipulation of the switch
> +			  chip itself (it can be though of as a "parent" of the
> +			  port, therefore the name). They are not port-specific.
> +			  Caller might use arbitrary port netdevice of the same
> +			  switch and it will make no difference.
> +	ndo_sw_port_* - Functions that serve for a port-specific manipulation.

[...]

Thanks,
John


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [patch net-next v2 01/10] net: rename netdev_phys_port_id to more generic name
From: John Fastabend @ 2014-11-10 21:57 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma,
	f.fainelli, roopa, linville, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck, john.ronciak,
	mleitner, shrijeet, gospo, bcrl
In-Reply-To: <1415530280-9190-2-git-send-email-jiri@resnulli.us>

On 11/09/2014 02:51 AM, Jiri Pirko wrote:
> So this can be reused for identification of other "items" as well.
>
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> ---

Looks good to me. Just a simple code refactoring and doesn't
touch any user visible uapi files and makes the next patches a
bit cleaner.

Acked-by: John Fastabend <john.r.fastabend@intel.com>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [patch net-next v2 10/10] rocker: implement L2 bridge offloading
From: Jamal Hadi Salim @ 2014-11-10 21:14 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Roopa Prabhu, Jiri Pirko, Netdev, David S. Miller, nhorman,
	Andy Gospodarek, Thomas Graf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, Kirsher, Jeffrey T, vyasevic, Cong Wang,
	Fastabend, John R, Eric Dumazet, Florian Fainelli, John Linville,
	jasowang, ebiederm, Nicolas Dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, Alexei Starovoitov
In-Reply-To: <CAE4R7bDQesnq8kDFsUqiY0+H1SnUHPSJACLMgFc9eX8chjoxKA@mail.gmail.com>

On 11/10/14 14:47, Scott Feldman wrote:
> On Mon, Nov 10, 2014 at 9:27 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

>
> For swdev, I don't care for the model where each port has an fdb and
> the bridge has an fdb.  The bridge's fdb lookup/learning/fwding is
> what we're offloading to HW, so it makes more sense from the driver
> and to the user to use one fdb, the bridge's fdb.  So user types
> "bridge fdb show" and static fdbs installed on the bridge and learned
> fdbs synced from HW are represented.  One table.
>

side note:
I hope we'd be able to tell apart what is in hardware vs software
and what has been synced up from hardware (assuming policy says we
are allowed to sync things).
Yes, there is one fdb table per bridge - but each entry would say
which brport is involved. So the reference point being a bridge point
sounds reasonable to me. i.e
Any fdb entry whether in h/w or s/w would point to an egress
brport *always*. Are you thinking there is only one possible bridge?
Caveat: a piece of hardware could have multiple virtual bridges.
In what Ben showed on the $5 realtek, after boot up we just
know which brports exist and nothing more.
We can then create a bridge and attach brport to each. This gets
reflected into hardware on a per brport level (I think there was
a field called FID?).
Constraint: Each brport connects only to one bridge.

> I view the existing ndo_fdb_add/del ops useful for devices working
> standalone without the bridge driver that have some HW fwding
> capabilities and need to manage their own fdb.

As in offload said fdb?

>For devices under
> bridge, let's use the bridge's fdb, at least for swdev.
>

If the hardware can only do one bridge sure.

> Does this make sense?

not quiet for me but i may be missing something.

I hate to use a lot of "I"s in my sentences,
> but looks like I did exactly that in above, so take this as an
> opinion, within the scope of swdev.
>

"I" is useful for expressing an opinion or an expectation of course ;->
As an example:
_I_ believe we should be able to define how {learning, flooding etc} and 
where {hw vs sw} things are to be installed or learnt-from.
I have use cases where the controller makes all the decisions.
And i tried to provide my motivation in one of the meetings here:
https://linux.cumulusnetworks.com/offload-discussion-1/jamal-NFstatecaching.pdf

cheers,
jamal

^ permalink raw reply

* question about ethtool bits?
From: Jesse Brandeburg @ 2014-11-10 20:47 UTC (permalink / raw)
  To: netdev; +Cc: jesse.brandeburg, ben

Ben et al,

So, I was just looking at adding some more types/speeds to
ethtool.h and noticed we are about out of bits in ecmd->advertising
(and others), which are declared u32.

We currently have room for one more type (bit 31) and then all 32 bits
will be full.

Previous solutions have involved adding a new struct member to the end
of ecmd and friends and calling it u32 advertising2 or something.
Another option could be adding a u64, but there may be not fun results
of doing so.

...
#define SUPPORTED_56000baseSR4_Full     (1 << 29)
#define SUPPORTED_56000baseLR4_Full     (1 << 30)

...
#define ADVERTISED_56000baseSR4_Full    (1 << 29)
#define ADVERTISED_56000baseLR4_Full    (1 << 30)

in the case of speed there is already a speed_hi, so that gets us 32
bits of speed expressed in Mb/s

These fast networks create so many problems ;-)

There are two u32's reserved at the end of ethtool_cmd.

Suggestions?

^ permalink raw reply

* Re: [PATCH] drivers: atm: eni: Add pci_dma_mapping_error() call
From: David Miller @ 2014-11-10 20:32 UTC (permalink / raw)
  To: tinajohnson.1234
  Cc: chas, linux-atm-general, netdev, linux-kernel, julia.lawall
In-Reply-To: <1415463482-8013-1-git-send-email-tinajohnson.1234@gmail.com>

From: Tina Johnson <tinajohnson.1234@gmail.com>
Date: Sat,  8 Nov 2014 21:48:02 +0530

> Added a pci_dma_mapping_error() call to check for mapping errors before
> further using the dma handle. Unchecked dma handles were found using
> Coccinelle:
> 
> @rule1@
> expression e1;
> identifier x;
> @@
> 
> *x = pci_map_single(...);
>  ... when != pci_dma_mapping_error(e1,x)
> 
> Signed-off-by: Tina Johnson <tinajohnson.1234@gmail.com>
> Acked-by: Julia Lawall <julia.lawall@lip6.fr>

I really don't like quick fixes like this, exactly because of the
kind of new bug this change is introducing.

The 'trouble' label assumes that it is recovering and unwinding state
when an error occurs after the DMA buffer is successfully mapped.

It unconditionally does pci_unmap_single() if 'paddr' is non-zero
which it might be in the error case depending upon how DMA errors
are represented on a given platform.

^ permalink raw reply

* Re: [PATCH 2/2] 8139too: Allow using the largest possible MTU
From: David Miller @ 2014-11-10 20:30 UTC (permalink / raw)
  To: albeu; +Cc: linux-kernel, netdev, bhelgaas, benoit.taine, ebiederm
In-Reply-To: <1415447316-12424-2-git-send-email-albeu@free.fr>

From: Alban Bedel <albeu@free.fr>
Date: Sat,  8 Nov 2014 12:48:36 +0100

> This driver allows MTU up to 1518 bytes which is not enought to run
> batman-adv. Simply raise the maximum packet size up to the maximum
> allowed by the transmit descriptor, 1792 bytes, giving a maximum MTU
> of 1774 bytes.
> 
> Signed-off-by: Alban Bedel <albeu@free.fr>

Also applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH 1/2] 8139too: Allow setting MTU larger than 1500
From: David Miller @ 2014-11-10 20:30 UTC (permalink / raw)
  To: albeu; +Cc: linux-kernel, netdev, bhelgaas, benoit.taine, ebiederm
In-Reply-To: <1415447316-12424-1-git-send-email-albeu@free.fr>

From: Alban Bedel <albeu@free.fr>
Date: Sat,  8 Nov 2014 12:48:35 +0100

> Replace the default ndo_change_mtu callback with one that allow
> setting MTU that the driver can handle.
> 
> Signed-off-by: Alban Bedel <albeu@free.fr>

Applied to net-next

^ permalink raw reply

* Re: linux-next: Tree for Nov 10 (net/ipv4/ip_tunnel.c)
From: Tom Herbert @ 2014-11-10 20:29 UTC (permalink / raw)
  To: David Miller; +Cc: rdunlap, sfr, Linux-Next, LKML, Linux Netdev List
In-Reply-To: <20141110.142439.2292080248990157625.davem@davemloft.net>

I am looking at it.

On Mon, Nov 10, 2014 at 11:24 AM, David Miller <davem@davemloft.net> wrote:
> From: Randy Dunlap <rdunlap@infradead.org>
> Date: Mon, 10 Nov 2014 10:15:11 -0800
>
>> On 11/10/14 01:59, Stephen Rothwell wrote:
>>> Hi all,
>>>
>>> Changes since 20141106:
>>>
>>
>> on x86_64:
>> when CONFIG_NET_IP_TUNNEL=y and CONFIG_NET_FOU=m:
>>
>> net/built-in.o: In function `ip_tunnel_encap':
>> (.text+0xb04d8): undefined reference to `gue_build_header'
>> net/built-in.o: In function `ip_tunnel_encap':
>> (.text+0xb04ea): undefined reference to `fou_build_header'
>
> Tom, this has now been reported twice, please fix this.
>
> Thanks.

^ permalink raw reply

* Re: [PATCH net] cxgb4 : Fix bug in DCB app deletion
From: David Miller @ 2014-11-10 20:14 UTC (permalink / raw)
  To: anish; +Cc: netdev, hariprasad, leedom
In-Reply-To: <1415403681-4120-1-git-send-email-anish@chelsio.com>

From: Anish Bhatt <anish@chelsio.com>
Date: Fri,  7 Nov 2014 15:41:21 -0800

> Unlike CEE, IEEE has a bespoke app delete call and does not rely on priority
> for app deletion
> 
> Fixes : 2376c879b80c ('cxgb4 : Improve handling of DCB negotiation or loss
>  thereof')
> 
> Signed-off-by: Anish Bhatt <anish@chelsio.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v2 net] udptunnel: Add SKB_GSO_UDP_TUNNEL during gro_complete.
From: David Miller @ 2014-11-10 20:11 UTC (permalink / raw)
  To: jesse; +Cc: netdev
In-Reply-To: <1415648713-110483-1-git-send-email-jesse@nicira.com>

From: Jesse Gross <jesse@nicira.com>
Date: Mon, 10 Nov 2014 11:45:13 -0800

> When doing GRO processing for UDP tunnels, we never add
> SKB_GSO_UDP_TUNNEL to gso_type - only the type of the inner protocol
> is added (such as SKB_GSO_TCPV4). The result is that if the packet is
> later resegmented we will do GSO but not treat it as a tunnel. This
> results in UDP fragmentation of the outer header instead of (i.e.) TCP
> segmentation of the inner header as was originally on the wire.
> 
> Signed-off-by: Jesse Gross <jesse@nicira.com>
> ---
> v2: Update commit message to reflect result of problem.

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: David Miller @ 2014-11-10 20:08 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, ycai, willemb, ncardwell
In-Reply-To: <1415393472.13896.119.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 07 Nov 2014 12:51:12 -0800

> @@ -1455,6 +1455,7 @@ process:
>  		goto discard_and_relse;
>  
>  	sk_mark_napi_id(sk, skb);
> +	sk_incoming_cpu_update(sk);

Just make sk_mark_napi_id() call sk_incoming_cpu_update().

You've matched up the calls precisely in this patch, and I can't think
of any situation where we'd add a sk_mark_napi_call() not not want to
do an sk_incoming_cpu_update().

If, alternatively, you want to have a helper function that does both
things and has a special name, that's fine too.

^ permalink raw reply

* Re: [patch net-next v2 03/10] rtnl: expose physical switch id for particular device
From: Scott Feldman @ 2014-11-10 20:02 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jiri Pirko, Netdev, David S. Miller, nhorman, Andy Gospodarek,
	Thomas Graf, dborkman, ogerlitz, jesse, pshelar, azhou, ben,
	stephen, Kirsher, Jeffrey T, vyasevic, Cong Wang,
	Fastabend, John R, Eric Dumazet, Jamal Hadi Salim,
	Florian Fainelli, John Linville, jasowang, ebiederm,
	Nicolas Dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	Alexei Starovoitov
In-Reply-To: <5460FCB0.1040409@cumulusnetworks.com>

On Mon, Nov 10, 2014 at 7:58 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> On 11/9/14, 2:51 AM, Jiri Pirko wrote:
>>
>> The netdevice represents a port in a switch, it will expose
>> IFLA_PHYS_SWITCH_ID value via rtnl. Two netdevices with the same value
>> belong to one physical switch.
>>
>> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>> ---
>>   include/uapi/linux/if_link.h |  1 +
>>   net/core/rtnetlink.c         | 26 +++++++++++++++++++++++++-
>>   2 files changed, 26 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>> index 7072d83..4163753 100644
>> --- a/include/uapi/linux/if_link.h
>> +++ b/include/uapi/linux/if_link.h
>> @@ -145,6 +145,7 @@ enum {
>>         IFLA_CARRIER,
>>         IFLA_PHYS_PORT_ID,
>>         IFLA_CARRIER_CHANGES,
>> +       IFLA_PHYS_SWITCH_ID,
>
>
> Jiri, since we have not really converged on the switchdev class or having a
> separate switchdev instance,
> am thinking it is better if we dont expose any such switch_id to userspace
> yet until absolutely needed. Do you need it today ?
> There is no real in kernel hw switch driver that will use it today. And
> quite likely this will need to change when we introduce real hw switch
> drivers.

How will it change when real hw switch drivers are introduced?  Will
the real sw driver not be able to give a up unique ID for the switch?

^ permalink raw reply

* [net-next PATCH 3/5] phonet: Replace calls to __skb_alloc_page with __netdev_alloc_page
From: Alexander Duyck @ 2014-11-10 19:52 UTC (permalink / raw)
  To: netdev, linux-usb
  Cc: leedom, hariprasad, donald.c.skidmore, oliver, balbi,
	matthew.vick, mgorman, davem, jeffrey.t.kirsher
In-Reply-To: <20141110195104.16182.98174.stgit@ahduyck-vm-fedora20>

Replace the calls to __skb_alloc_page that are passed NULL with calls to
__netdev_alloc_page.

Cc: Oliver Neukum <oliver@neukum.org>
Cc: Felipe Balbi <balbi@ti.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 drivers/net/usb/cdc-phonet.c           |    2 +-
 drivers/usb/gadget/function/f_phonet.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/cdc-phonet.c b/drivers/net/usb/cdc-phonet.c
index 2ec1500..fd88da8 100644
--- a/drivers/net/usb/cdc-phonet.c
+++ b/drivers/net/usb/cdc-phonet.c
@@ -130,7 +130,7 @@ static int rx_submit(struct usbpn_dev *pnd, struct urb *req, gfp_t gfp_flags)
 	struct page *page;
 	int err;
 
-	page = __skb_alloc_page(gfp_flags | __GFP_NOMEMALLOC, NULL);
+	page = __netdev_alloc_page(gfp_flags | __GFP_NOMEMALLOC);
 	if (!page)
 		return -ENOMEM;
 
diff --git a/drivers/usb/gadget/function/f_phonet.c b/drivers/usb/gadget/function/f_phonet.c
index b9cfc15..1a5be3a 100644
--- a/drivers/usb/gadget/function/f_phonet.c
+++ b/drivers/usb/gadget/function/f_phonet.c
@@ -303,7 +303,7 @@ pn_rx_submit(struct f_phonet *fp, struct usb_request *req, gfp_t gfp_flags)
 	struct page *page;
 	int err;
 
-	page = __skb_alloc_page(gfp_flags | __GFP_NOMEMALLOC, NULL);
+	page = __netdev_alloc_page(gfp_flags | __GFP_NOMEMALLOC);
 	if (!page)
 		return -ENOMEM;
 

^ permalink raw reply related

* [net-next PATCH 5/5] net: Remove __skb_alloc_page and __skb_alloc_pages
From: Alexander Duyck @ 2014-11-10 19:52 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA
  Cc: leedom-ut6Up61K2wZBDgjK7y7TUQ, hariprasad-ut6Up61K2wZBDgjK7y7TUQ,
	donald.c.skidmore-ral2JQCrhuEAvxtiuMwx3w,
	oliver-GvhC2dPhHPQdnm+yROfE0A, balbi-l0cyMroinI0,
	matthew.vick-ral2JQCrhuEAvxtiuMwx3w, mgorman-l3A5Bk7waGM,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <20141110195104.16182.98174.stgit@ahduyck-vm-fedora20>

Remove the two functions which are now dead code.

Signed-off-by: Alexander Duyck <alexander.h.duyck-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/linux/skbuff.h |   43 -------------------------------------------
 1 file changed, 43 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f196628..e4acbca 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2233,49 +2233,6 @@ static inline struct page *netdev_alloc_page(void)
 }
 
 /**
- *	__skb_alloc_pages - allocate pages for ps-rx on a skb and preserve pfmemalloc data
- *	@gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX
- *	@skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used
- *	@order: size of the allocation
- *
- * 	Allocate a new page.
- *
- * 	%NULL is returned if there is no free memory.
-*/
-static inline struct page *__skb_alloc_pages(gfp_t gfp_mask,
-					      struct sk_buff *skb,
-					      unsigned int order)
-{
-	struct page *page;
-
-	gfp_mask |= __GFP_COLD;
-
-	if (!(gfp_mask & __GFP_NOMEMALLOC))
-		gfp_mask |= __GFP_MEMALLOC;
-
-	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
-	if (skb && page && page->pfmemalloc)
-		skb->pfmemalloc = true;
-
-	return page;
-}
-
-/**
- *	__skb_alloc_page - allocate a page for ps-rx for a given skb and preserve pfmemalloc data
- *	@gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX
- *	@skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used
- *
- * 	Allocate a new page.
- *
- * 	%NULL is returned if there is no free memory.
- */
-static inline struct page *__skb_alloc_page(gfp_t gfp_mask,
-					     struct sk_buff *skb)
-{
-	return __skb_alloc_pages(gfp_mask, skb, 0);
-}

^ permalink raw reply related

* [net-next PATCH 4/5] fm10k/igb/ixgbe: Replace __skb_alloc_page with netdev_alloc_page
From: Alexander Duyck @ 2014-11-10 19:52 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA
  Cc: leedom-ut6Up61K2wZBDgjK7y7TUQ, hariprasad-ut6Up61K2wZBDgjK7y7TUQ,
	donald.c.skidmore-ral2JQCrhuEAvxtiuMwx3w,
	oliver-GvhC2dPhHPQdnm+yROfE0A, balbi-l0cyMroinI0,
	matthew.vick-ral2JQCrhuEAvxtiuMwx3w, mgorman-l3A5Bk7waGM,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <20141110195104.16182.98174.stgit@ahduyck-vm-fedora20>

The Intel drivers were pretty much just using the plain vanilla GFP flags
in their calls to __skb_alloc_page so this change makes it so that they use
netdev_alloc_page which just uses GFP_ATOMIC for the gfp_flags value.

Cc: Jeff Kirsher <jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Matthew Vick <matthew.vick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Don Skidmore <donald.c.skidmore-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Alexander Duyck <alexander.h.duyck-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |    2 +-
 drivers/net/ethernet/intel/igb/igb_main.c     |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    3 +--
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index e645af4..64d82c5 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -83,7 +83,7 @@ static bool fm10k_alloc_mapped_page(struct fm10k_ring *rx_ring,
 		return true;
 
 	/* alloc new page for storage */
-	page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+	page = netdev_alloc_page();
 	if (unlikely(!page)) {
 		rx_ring->rx_stats.alloc_failed++;
 		return false;
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a2d72a8..1245a86 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6988,7 +6988,7 @@ static bool igb_alloc_mapped_page(struct igb_ring *rx_ring,
 		return true;
 
 	/* alloc new page for storage */
-	page = __skb_alloc_page(GFP_ATOMIC | __GFP_COLD, NULL);
+	page = netdev_alloc_page();
 	if (unlikely(!page)) {
 		rx_ring->rx_stats.alloc_failed++;
 		return false;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d2df4e3..8e4ba8c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1440,8 +1440,7 @@ static bool ixgbe_alloc_mapped_page(struct ixgbe_ring *rx_ring,
 
 	/* alloc new page for storage */
 	if (likely(!page)) {
-		page = __skb_alloc_pages(GFP_ATOMIC | __GFP_COLD | __GFP_COMP,
-					 bi->skb, ixgbe_rx_pg_order(rx_ring));
+		page = netdev_alloc_pages(ixgbe_rx_pg_order(rx_ring));
 		if (unlikely(!page)) {
 			rx_ring->rx_stats.alloc_rx_page_failed++;
 			return false;

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [net-next PATCH 1/5] net: Add netdev Rx page allocation function
From: Alexander Duyck @ 2014-11-10 19:51 UTC (permalink / raw)
  To: netdev, linux-usb
  Cc: leedom, hariprasad, donald.c.skidmore, oliver, balbi,
	matthew.vick, mgorman, davem, jeffrey.t.kirsher
In-Reply-To: <20141110195104.16182.98174.stgit@ahduyck-vm-fedora20>

This patch implements __netdev_alloc_pages and __netdev_alloc_page.  These
are meant to replace the __skb_alloc_pages and __skb_alloc_page functions.
The reason for doing this is that it occurred to me that __skb_alloc_page is
supposed to be passed an sk_buff pointer, but it is NULL in all cases where
it is used.  Worse is that in the case of ixgbe it is passed NULL via the
sk_buff pointer in the rx_buffer info structure which means the compiler is
not correctly stripping it out.

In the case of anything greater than order 0 it is assumed that we want a
compound page so __GFP_COMP is set for all allocations as we expect a
compound page when assigning a page frag.

The other change in this patch is to exploit the behaviors of the page
allocator in how it handles flags.  So for example we can always set
__GFP_COMP and __GFP_MEMALLOC since they are ignored if they are not
applicable or are overridden by another flag.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 include/linux/skbuff.h |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 103fbe8..f196628 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2185,6 +2185,54 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 }
 
 /**
+ * __netdev_alloc_pages - allocate page for ps-rx
+ * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx
+ * @order: size of the allocation
+ *
+ * Allocate a new page.
+ *
+ * %NULL is returned if there is no free memory.
+*/
+static inline struct page *__netdev_alloc_pages(gfp_t gfp_mask,
+						unsigned int order)
+{
+	/* This piece of code contains several assumptions.
+	 * 1.  This is for device Rx, therefor a cold page is preferred.
+	 * 2.  The expectation is the user wants a compound page.
+	 * 3.  If requesting a order 0 page it will not be compound
+	 *     due to the check to see if order has a value in prep_new_page
+	 * 4.  __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is due to
+	 *     code in gfp_to_alloc_flags that should be enforcing this.
+	 */
+	gfp_mask |= __GFP_COLD | __GFP_COMP | __GFP_MEMALLOC;
+
+	return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
+}
+
+static inline struct page *netdev_alloc_pages(unsigned int order)
+{
+	return __netdev_alloc_pages(GFP_ATOMIC, order);
+}
+
+/**
+ * __netdev_alloc_page - allocate a page for ps-rx
+ * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx
+ *
+ * Allocate a new page.
+ *
+ * %NULL is returned if there is no free memory.
+ */
+static inline struct page *__netdev_alloc_page(gfp_t gfp_mask)
+{
+	return __netdev_alloc_pages(gfp_mask, 0);
+}
+
+static inline struct page *netdev_alloc_page(void)
+{
+	return __netdev_alloc_page(GFP_ATOMIC);
+}
+
+/**
  *	__skb_alloc_pages - allocate pages for ps-rx on a skb and preserve pfmemalloc data
  *	@gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX
  *	@skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used

^ permalink raw reply related

* [net-next PATCH 0/5] Replace __skb_alloc_pages with much simpler function
From: Alexander Duyck @ 2014-11-10 19:51 UTC (permalink / raw)
  To: netdev, linux-usb
  Cc: leedom, hariprasad, donald.c.skidmore, oliver, balbi,
	matthew.vick, mgorman, davem, jeffrey.t.kirsher

This patch series replaces __skb_alloc_pages with a much simpler function,
__netdev_alloc_pages.  The main difference between the two is that
__skb_alloc_pages had an sk_buff pointer that was being passed as NULL in
call places where it was called.  In a couple of cases the NULL was passed
by variable and this led to unnecessary code being run.

As such in order to simplify things the __netdev_alloc_pages call only
takes a mask and the page order being requested.  In addition it takes
advantage of several behaviors already built into the page allocator so
that it can just set GFP flags unconditionally.

---

Alexander Duyck (5):
      net: Add netdev Rx page allocation function
      cxgb4/cxgb4vf: Replace __skb_alloc_page with __netdev_alloc_page
      phonet: Replace calls to __skb_alloc_page with __netdev_alloc_page
      fm10k/igb/ixgbe: Replace __skb_alloc_page with netdev_alloc_page
      net: Remove __skb_alloc_page and __skb_alloc_pages

 drivers/net/ethernet/chelsio/cxgb4/sge.c      |    6 +-
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c    |    7 ++-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |    2 -
 drivers/net/ethernet/intel/igb/igb_main.c     |    2 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    3 -
 drivers/net/usb/cdc-phonet.c                  |    2 -
 drivers/usb/gadget/function/f_phonet.c        |    2 -
 include/linux/skbuff.h                        |   61 ++++++++++++++-----------
 8 files changed, 45 insertions(+), 40 deletions(-)

--

^ permalink raw reply

* [net-next PATCH 2/5] cxgb4/cxgb4vf: Replace __skb_alloc_page with __netdev_alloc_page
From: Alexander Duyck @ 2014-11-10 19:52 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA
  Cc: leedom-ut6Up61K2wZBDgjK7y7TUQ, hariprasad-ut6Up61K2wZBDgjK7y7TUQ,
	donald.c.skidmore-ral2JQCrhuEAvxtiuMwx3w,
	oliver-GvhC2dPhHPQdnm+yROfE0A, balbi-l0cyMroinI0,
	matthew.vick-ral2JQCrhuEAvxtiuMwx3w, mgorman-l3A5Bk7waGM,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <20141110195104.16182.98174.stgit@ahduyck-vm-fedora20>

Drop the bloated use of __skb_alloc_page and replace it with
__netdev_alloc_page.  In addition update the one other spot that is
allocating a page so that it allocates with the correct flags.

Cc: Hariprasad S <hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Cc: Casey Leedom <leedom-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Signed-off-by: Alexander Duyck <alexander.h.duyck-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/net/ethernet/chelsio/cxgb4/sge.c   |    6 +++---
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c |    7 ++++---
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 5e1b314..f7dd6e3 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -576,7 +576,7 @@ static unsigned int refill_fl(struct adapter *adap, struct sge_fl *q, int n,
 	__be64 *d = &q->desc[q->pidx];
 	struct rx_sw_desc *sd = &q->sdesc[q->pidx];
 
-	gfp |= __GFP_NOWARN | __GFP_COLD;
+	gfp |= __GFP_NOWARN;
 
 	if (s->fl_pg_order == 0)
 		goto alloc_small_pages;
@@ -585,7 +585,7 @@ static unsigned int refill_fl(struct adapter *adap, struct sge_fl *q, int n,
 	 * Prefer large buffers
 	 */
 	while (n) {
-		pg = alloc_pages(gfp | __GFP_COMP, s->fl_pg_order);
+		pg = __netdev_alloc_pages(gfp, s->fl_pg_order);
 		if (unlikely(!pg)) {
 			q->large_alloc_failed++;
 			break;       /* fall back to single pages */
@@ -615,7 +615,7 @@ static unsigned int refill_fl(struct adapter *adap, struct sge_fl *q, int n,
 
 alloc_small_pages:
 	while (n--) {
-		pg = __skb_alloc_page(gfp, NULL);
+		pg = __netdev_alloc_page(gfp);
 		if (unlikely(!pg)) {
 			q->alloc_failed++;
 			break;
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
index 85036e6..ad5df49 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
@@ -602,6 +602,8 @@ static unsigned int refill_fl(struct adapter *adapter, struct sge_fl *fl,
 	 */
 	BUG_ON(fl->avail + n > fl->size - FL_PER_EQ_UNIT);
 
+	gfp |= __GFP_NOWARN;
+
 	/*
 	 * If we support large pages, prefer large buffers and fail over to
 	 * small pages if we can't allocate large pages to satisfy the refill.
@@ -612,8 +614,7 @@ static unsigned int refill_fl(struct adapter *adapter, struct sge_fl *fl,
 		goto alloc_small_pages;
 
 	while (n) {
-		page = alloc_pages(gfp | __GFP_COMP | __GFP_NOWARN,
-				   FL_PG_ORDER);
+		page = __netdev_alloc_pages(gfp, FL_PG_ORDER);
 		if (unlikely(!page)) {
 			/*
 			 * We've failed inour attempt to allocate a "large
@@ -657,7 +658,7 @@ static unsigned int refill_fl(struct adapter *adapter, struct sge_fl *fl,
 
 alloc_small_pages:
 	while (n--) {
-		page = __skb_alloc_page(gfp | __GFP_NOWARN, NULL);
+		page = __netdev_alloc_page(gfp);
 		if (unlikely(!page)) {
 			fl->alloc_failed++;
 			break;

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v2 net] udptunnel: Add SKB_GSO_UDP_TUNNEL during gro_complete.
From: Jesse Gross @ 2014-11-10 19:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

When doing GRO processing for UDP tunnels, we never add
SKB_GSO_UDP_TUNNEL to gso_type - only the type of the inner protocol
is added (such as SKB_GSO_TCPV4). The result is that if the packet is
later resegmented we will do GSO but not treat it as a tunnel. This
results in UDP fragmentation of the outer header instead of (i.e.) TCP
segmentation of the inner header as was originally on the wire.

Signed-off-by: Jesse Gross <jesse@nicira.com>
---
v2: Update commit message to reflect result of problem.
---
 drivers/net/vxlan.c      | 2 ++
 include/net/udp_tunnel.h | 9 +++++++++
 net/ipv4/fou.c           | 2 ++
 3 files changed, 13 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ca30982..cfb892b 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -621,6 +621,8 @@ static int vxlan_gro_complete(struct sk_buff *skb, int nhoff)
 	int vxlan_len  = sizeof(struct vxlanhdr) + sizeof(struct ethhdr);
 	int err = -ENOSYS;
 
+	udp_tunnel_gro_complete(skb, nhoff);
+
 	eh = (struct ethhdr *)(skb->data + nhoff + sizeof(struct vxlanhdr));
 	type = eh->h_proto;
 
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index a47790b..2a50a70 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -100,6 +100,15 @@ static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
 	return iptunnel_handle_offloads(skb, udp_csum, type);
 }
 
+static inline void udp_tunnel_gro_complete(struct sk_buff *skb, int nhoff)
+{
+	struct udphdr *uh;
+
+	uh = (struct udphdr *)(skb->data + nhoff - sizeof(struct udphdr));
+	skb_shinfo(skb)->gso_type |= uh->check ?
+				SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
+}
+
 static inline void udp_tunnel_encap_enable(struct socket *sock)
 {
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 32e7892..606c520 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -133,6 +133,8 @@ static int fou_gro_complete(struct sk_buff *skb, int nhoff)
 	int err = -ENOSYS;
 	const struct net_offload **offloads;
 
+	udp_tunnel_gro_complete(skb, nhoff);
+
 	rcu_read_lock();
 	offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
 	ops = rcu_dereference(offloads[proto]);
-- 
1.9.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox