Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH iproute2 v2] ip link: Fix crash on older kernels when show VF dev
From: Vadim Kochan @ 2015-01-09 19:22 UTC (permalink / raw)
  To: netdev; +Cc: william, Vadim Kochan, Vadim Kochan

From: Vadim Kochan <vadim4j@gmail.com>

The issue was caused that ifla_vf_rate does not exist on
older kernels and should be checked if it exists as nested attr.

Signed-off-by: Vadim Kochan <vadim4j@gmail.net>
Reported-by: William Dauchy <william@gandi.net>
Tested-by: William Dauchy <william@gandi.net>
---
 ip/ipaddress.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 28dfe8c..830b166 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -259,11 +259,10 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 {
 	struct ifla_vf_mac *vf_mac;
 	struct ifla_vf_vlan *vf_vlan;
-	struct ifla_vf_rate *vf_rate;
 	struct ifla_vf_tx_rate *vf_tx_rate;
 	struct ifla_vf_spoofchk *vf_spoofchk;
 	struct ifla_vf_link_state *vf_linkstate;
-	struct rtattr *vf[IFLA_VF_MAX+1];
+	struct rtattr *vf[IFLA_VF_MAX + 1] = {};
 	struct rtattr *tmp;
 	SPRINT_BUF(b1);
 
@@ -277,7 +276,6 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 	vf_mac = RTA_DATA(vf[IFLA_VF_MAC]);
 	vf_vlan = RTA_DATA(vf[IFLA_VF_VLAN]);
 	vf_tx_rate = RTA_DATA(vf[IFLA_VF_TX_RATE]);
-	vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
 
 	/* Check if the spoof checking vf info type is supported by
 	 * this kernel.
@@ -313,10 +311,16 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 		fprintf(fp, ", qos %d", vf_vlan->qos);
 	if (vf_tx_rate->rate)
 		fprintf(fp, ", tx rate %d (Mbps)", vf_tx_rate->rate);
-	if (vf_rate->max_tx_rate)
-		fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
-	if (vf_rate->min_tx_rate)
-		fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
+
+	if (vf[IFLA_VF_RATE]) {
+		struct ifla_vf_rate *vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
+
+		if (vf_rate->max_tx_rate)
+			fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
+		if (vf_rate->min_tx_rate)
+			fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
+	}
+
 	if (vf_spoofchk && vf_spoofchk->setting != -1) {
 		if (vf_spoofchk->setting)
 			fprintf(fp, ", spoof checking on");
-- 
2.1.3

^ permalink raw reply related

* Re: [PATCH iproute2 0/3] ip netns: Run over all netns
From: Vadim Kochan @ 2015-01-09 19:17 UTC (permalink / raw)
  To: Cong Wang; +Cc: Vadim Kochan, netdev
In-Reply-To: <CAHA+R7NPrBw_tN20adU6m0A0Toggo9SO9LpVV7WSaVa51Fyfsg@mail.gmail.com>

On Fri, Jan 09, 2015 at 10:49:50AM -0800, Cong Wang wrote:
> On Wed, Jan 7, 2015 at 4:52 PM, Vadim Kochan <vadim4j@gmail.com> wrote:
> > On Wed, Jan 07, 2015 at 04:04:14PM -0800, Cong Wang wrote:
> >> On Wed, Jan 7, 2015 at 3:04 AM, Vadim Kochan <vadim4j@gmail.com> wrote:
> >> > From: Vadim Kochan <vadim4j@gmail.com>
> >> >
> >> > Allow 'ip netns del' and 'ip netns exec' run over each network namespace names.
> >> >
> >> > 'ip netns exec' executes command forcely on eacn nsname.
> >> >
> >>
> >> Why this has to be done in iproute command?
> >> That is, why not just offloading this to a shell script like below?
> >>
> >> for ns in `ip netns show`;
> >> do
> >>      ip netns exec $ns ip link show.....
> >> done
> >
> > Hm, but would not it better to have it in iproute instead of collect
> > scripts ? Scripts allows to do a lot of things, but in this case it seems like a
> > feature which related to iproute.
> 
> iproute2 should keep a minimum set of features especially when
> a one-liner shell script can do that.

BTW, this script should have additional output of netns name
before the 'ip netns exec $ns ...' invocation.

^ permalink raw reply

* Re: [PATCH iproute2] ip link: Fix crash on older kernels when show VF dev
From: Vadim Kochan @ 2015-01-09 19:06 UTC (permalink / raw)
  To: William Dauchy; +Cc: Vadim Kochan, netdev
In-Reply-To: <20150109175557.GF31810@gandi.net>

On Fri, Jan 09, 2015 at 06:55:57PM +0100, William Dauchy wrote:
> On Jan09 19:25, Vadim Kochan wrote:
> > From: Vadim Kochan <vadim4j@gmail.com>
> > 
> > The issue was caused that ifla_vf_rate does not exist on
> > older kernels and should be checked if it exists as nested attr.
> > 
> > Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
> > Reported-by: William Dauchy <william@gandi.net>
> > Tested-by: William Dauchy <william@gandi.com>
> 
> gandi.net actually ;)
> 
> Thanks,
> 

Sorry, I will re-send.

^ permalink raw reply

* Re: [PATCH v2 net-next] bridge: Add ability to enable TSO
From: Pankaj Gupta @ 2015-01-09 19:10 UTC (permalink / raw)
  To: Toshiaki Makita; +Cc: David S . Miller, Stephen Hemminger, netdev, bridge
In-Reply-To: <54AF73AD.4000008@lab.ntt.co.jp>


> > 
> >>
> >> Currently a bridge device turns off TSO feature if no bridge ports
> >> support it. We can always enable it, since packets can be segmented on
> >> ports by software as well as on the bridge device.
> >> This will reduce the number of packets processed in the bridge.
> >>
> >> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> >> ---
> >> v2: Use an existing helper function.
> >>
> >>  net/bridge/br_if.c | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> >> index ed307db..81e49fb 100644
> >> --- a/net/bridge/br_if.c
> >> +++ b/net/bridge/br_if.c
> >> @@ -424,6 +424,7 @@ netdev_features_t br_features_recompute(struct
> >> net_bridge
> >> *br,
> >>  		features = netdev_increment_features(features,
> >>  						     p->dev->features, mask);
> >>  	}
> >> +	features = netdev_add_tso_features(features, mask);
> > 
> > Just a doubt. Are we inducing latency if source has traffic at very low
> > rate.
> > I mean by default do we need it?
> 
> Is your concern tcp_tso_should_defer() in tcp_write_xmit()?
yes.

> If so, since TSO packet is created by GSO even without this patch, this
> should not increase latency there.
> This patch just delays the point of software segmentation from the
> bridge device to its port.

I think now I got your point. 

Thanks,
Pankaj
> 
> Thanks,
> Toshiaki Makita
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Ani Sinha @ 2015-01-09 19:08 UTC (permalink / raw)
  To: Cong Wang; +Cc: Eric Dumazet, David Miller, netdev@vger.kernel.org
In-Reply-To: <CAHA+R7Ps2_+5NgjE8HNbrTxbjhtt4nJyK9y5HrbHez2PhZEiPQ@mail.gmail.com>

On Fri, Jan 9, 2015 at 10:47 AM, Cong Wang <cwang@twopensource.com> wrote:
> On Thu, Jan 8, 2015 at 12:13 PM, Ani Sinha <ani@arista.com> wrote:
>> On Thu, Jan 8, 2015 at 11:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>
>>> If you want to use network namespaces, you have to adapt your scripts.
>>>
>>> Nobody claimed network namespaces were totally transparent.
>>>
>>
>> I see. I am going back to an old thread here where Linus says that the
>> #1 rule is:
>>
>> ""We don't regress user space"
>>
>> https://lkml.org/lkml/2013/7/16/565
>>
>> Breaking scripts seems to me to fall into the category of regressing
>> userspace. Or may be we can treat these sysctls more softly since they
>> are not strictly speaking linux ABIs.
>
> As Eric said, it has been like this since day 0,

I beg to differ. It has not been like this for that particular sysctl
from day 0. That sysctl was available from a child namespace and now
it isn't.

why you still think
> we break something? It is you who misunderstands the interface
> not us who break your script.

Perhaps. What I am truly confused about is :

- We are keeping a sysctl interface that does absolutely nothing in
the kernel and is completely useless in case some userland
scripts/tools are rendered broken from it's removal.

- surprisingly, we contradict ourselves when we let scripts break when
running from a child namespace because the  same sysctl is no longer
available!

When the source is available for a script or tool, it's easy to change
the code to conform to the new semantics. However, for old binaries
for which we do not have any source, it's not easy or is impossible to
fix them.

I rest my case. We will of course find a way to fix our code if that
is what netdev thinks is the way to go.

^ permalink raw reply

* Re: [PATCH iproute2 0/3] ip netns: Run over all netns
From: Cong Wang @ 2015-01-09 18:49 UTC (permalink / raw)
  To: Vadim Kochan; +Cc: netdev
In-Reply-To: <20150108005212.GA10487@angus-think.lan>

On Wed, Jan 7, 2015 at 4:52 PM, Vadim Kochan <vadim4j@gmail.com> wrote:
> On Wed, Jan 07, 2015 at 04:04:14PM -0800, Cong Wang wrote:
>> On Wed, Jan 7, 2015 at 3:04 AM, Vadim Kochan <vadim4j@gmail.com> wrote:
>> > From: Vadim Kochan <vadim4j@gmail.com>
>> >
>> > Allow 'ip netns del' and 'ip netns exec' run over each network namespace names.
>> >
>> > 'ip netns exec' executes command forcely on eacn nsname.
>> >
>>
>> Why this has to be done in iproute command?
>> That is, why not just offloading this to a shell script like below?
>>
>> for ns in `ip netns show`;
>> do
>>      ip netns exec $ns ip link show.....
>> done
>
> Hm, but would not it better to have it in iproute instead of collect
> scripts ? Scripts allows to do a lot of things, but in this case it seems like a
> feature which related to iproute.

iproute2 should keep a minimum set of features especially when
a one-liner shell script can do that.

^ permalink raw reply

* Re: [PATCH] bonding: cleanup bond_opts array
From: Andy Gospodarek @ 2015-01-09 18:48 UTC (permalink / raw)
  To: Jonathan Toppins; +Cc: netdev, shm, Nikolay Aleksandrov
In-Reply-To: <1420828268-10360-1-git-send-email-jtoppins@cumulusnetworks.com>

On Fri, Jan 09, 2015 at 01:31:08PM -0500, Jonathan Toppins wrote:
> Remove the empty array element initializer and size the array with
> BOND_OPT_LAST so the compiler will complain if more elements are in
> there than should be.
> 
> An interesting unwanted side effect of this initializer is that if one
> inserts new options into the middle of the array then this initializer
> will zero out the option that equals BOND_OPT_TLB_DYNAMIC_LB+1.
> 
> Example:
> Extend the OPTS enum:
> enum {
>    ...
>    BOND_OPT_TLB_DYNAMIC_LB,
>    BOND_OPT_LACP_NEW1,
>    BOND_OPT_LAST
> };
> 
> Now insert into bond_opts array:
> static const struct bond_option bond_opts[] = {
>       ...
>       [BOND_OPT_LACP_RATE] = { .... unchanged stuff .... },
>       [BOND_OPT_LACP_NEW1] = { ... new stuff ... },
>       ...
>       [BOND_OPT_TLB_DYNAMIC_LB] = { .... unchanged stuff ....},
>       { } // MARK A
> };
> 
> Since BOND_OPT_LACP_NEW1 = BOND_OPT_TLB_DYNAMIC_LB+1, the last
> initializer (MARK A) will overwrite the contents of BOND_OPT_LACP_NEW1
> and can be easily viewed with the crash utility.
> 
> Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
> Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
> Cc: Nikolay Aleksandrov <nikolay@redhat.com>

I do not recall if there was a specific reason that Nik added this, so
presuming there was not....

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>

> ---
>  drivers/net/bonding/bond_options.c |    5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
> index 1a61cc9..9bd538d4 100644
> --- a/drivers/net/bonding/bond_options.c
> +++ b/drivers/net/bonding/bond_options.c
> @@ -186,7 +186,7 @@ static const struct bond_opt_value bond_tlb_dynamic_lb_tbl[] = {
>  	{ NULL,  -1, 0}
>  };
>  
> -static const struct bond_option bond_opts[] = {
> +static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>  	[BOND_OPT_MODE] = {
>  		.id = BOND_OPT_MODE,
>  		.name = "mode",
> @@ -379,8 +379,7 @@ static const struct bond_option bond_opts[] = {
>  		.values = bond_tlb_dynamic_lb_tbl,
>  		.flags = BOND_OPTFLAG_IFDOWN,
>  		.set = bond_option_tlb_dynamic_lb_set,
> -	},
> -	{ }
> +	}
>  };
>  
>  /* Searches for an option by name */
> -- 
> 1.7.10.4
> 

^ permalink raw reply

* Re: Clarification regarding IFLA_BRPORT_LEARNING_SYNC and aging of fdb entries learnt via br_fdb_external_learn_add()
From: Scott Feldman @ 2015-01-09 18:47 UTC (permalink / raw)
  To: Arad, Ronen; +Cc: Jiri Pirko, Siva Mannem, Netdev
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DDA643@ORSMSX101.amr.corp.intel.com>

On Fri, Jan 9, 2015 at 8:15 AM, Arad, Ronen <ronen.arad@intel.com> wrote:
>
>
>>-----Original Message-----
>>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>>Behalf Of Scott Feldman
>>Sent: Friday, January 09, 2015 3:47 AM
>>To: Jiri Pirko
>>Cc: Siva Mannem; Netdev
>>Subject: Re: Clarification regarding IFLA_BRPORT_LEARNING_SYNC and aging of
>>fdb entries learnt via br_fdb_external_learn_add()
>>
>>On Wed, Jan 7, 2015 at 4:53 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Tue, Dec 30, 2014 at 07:20:21PM CET, siva.mannem.lnx@gmail.com wrote:
>>>>Hi,
>>>>
>>>>I am trying to understand the ongoing switch device offload effort and
>>>>am following the discussions. I have a question regarding
>>>>IFLA_BRPORT_LEARNING_SYNC flag and how aging happens when this flag is
>>>>enabled on a port that is attached to a bridge that has vlan filtering
>>>>enabled.
>>>>
>>>>If I understand correctly, when  IFLA_BRPORT_LEARNING_SYNC is set on a
>>>>bridge port, fdb entries that are learnt externally(may be learnt by
>>>>hardware and driver is notified) are synced to bridges fdb using
>>>>br_fdb_external_learn_add(). The fdb
>>>>entries(fdb->added_by_external_learn set to true) that are learnt via
>>>>this method are also deleted by the aging logic after the aging time
>>>>even though L2 data forwadring  happens in hardware.
>>
>>This is correct...
>>
>>>> Is there a way
>>>>where aging can be disabled for these entries? and let the entries be
>>>>removed only via br_fdb_external_learn_delete()? or am I missing
>>>>something?
>>>
>>> Currently extenaly learned fdb entries are indeed removed during aging
>>> cleanup. I believe that br_fdb_cleanup should check added_by_external_learn
>>> and not remove that fdbs. What do you think Scott?
>>
>>Something like that would work, if we added another brport flag to
>>control that.  With the current arrangement, using bridge for aging
>>out entries, we want br_fdb_cleanup removing the external_learned
>>fdbs, but if there was another brport flag we could fine tune that.
>>Say new flag is IFLA_BRPORT_AGING_OFFLOAD or something like that.  I'm
>>not sure how aging settings for the bridge are pushed down to offload
>>hw, or if there is a different set for hw.
>>
>>But, isn't it nice to let Linux bridge control aging?  That way,
>>bridge -s fdb dump shows nice statistics on fdb entries.  Hardware
>>isn't involved in the aging processes (other than being told to remove
>>an entry).  Simple hardware equals simple driver.  Linux remains
>>control point.
>>
> It is indeed simpler. However, if the overhead of reading hit bits from the HW
> and updating freshness of entries using br_fdb_external_learn_add() is too
> expensive, it would force such platforms to disable learning sync altogether.
> Therefore, I believe aging offload flag (could be sufficient at bridge level)
> and external aging interval (possibly longer than the software aging interval)
> will encourage drivers to use leaning sync.
>>-scott

I'm not sure I follow that last part.

Can we list out the use-cases to see what's missing?

Case 1: bridge ages out learned_sync'ed entries

bridge port learning: off
offload port learning: on
offload port learning_sync: on

Driver calls br_fdb_external_learn_add() periodically to refresh
bridge fdb entry
to keep it from going stale.
Bridge removes aged out fdb entries (and indirectly tells offload
device to do the same).

Case 2: offload device/bridge age out entries independently

bridge port learning: on
offload port learning: on
offload port learning_sync: off

Bridge ages out its stale learned entries, independent of offload device.
Offload device ages out its stale learned entries, independent of bridge.

Case 3: ?

Please help me finish the use-case list so we can see what's missing.

-scott

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Cong Wang @ 2015-01-09 18:47 UTC (permalink / raw)
  To: Ani Sinha; +Cc: Eric Dumazet, David Miller, netdev@vger.kernel.org
In-Reply-To: <CAOxq_8M=Ros56U_VbUVumjS226jHs1DK0A-K9aRksTMu88H_sg@mail.gmail.com>

On Thu, Jan 8, 2015 at 12:13 PM, Ani Sinha <ani@arista.com> wrote:
> On Thu, Jan 8, 2015 at 11:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> If you want to use network namespaces, you have to adapt your scripts.
>>
>> Nobody claimed network namespaces were totally transparent.
>>
>
> I see. I am going back to an old thread here where Linus says that the
> #1 rule is:
>
> ""We don't regress user space"
>
> https://lkml.org/lkml/2013/7/16/565
>
> Breaking scripts seems to me to fall into the category of regressing
> userspace. Or may be we can treat these sysctls more softly since they
> are not strictly speaking linux ABIs.

As Eric said, it has been like this since day 0, why you still think
we break something? It is you who misunderstands the interface
not us who break your script.

^ permalink raw reply

* [PATCH RFC net-next] ip_tunnel: create percpu gro_cell
From: Martin KaFai Lau @ 2015-01-09 18:42 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team

In the ipip tunnel, the skb->queue_mapping is lost in ipip_rcv().
All skb will be queued to the same cell->napi_skbs.  The
gro_cell_poll is pinned to one core under load. In production traffic,
we also see severe rx_dropped in the tunl iface and it is probably due to
this limit: skb_queue_len(&cell->napi_skbs) > netdev_max_backlog

This patch is trying to alloc_percpu(struct gro_cell) and schedule
gro_cell_poll to process it in the same core.

Setup:
VIP_PREFIX=9.9.9.9/32
REMOTE_REAL_IP=10.228.95.75

if [ "$1" = "encap" ]
then
    sudo ip tunnel add mode ipip remote ${REMOTE_REAL_IP}
    sudo ip link set dev ipip0 up
    sudo ip route add dev ipip0 ${VIP_PREFIX}
else
    # Decapsulating host

    sudo ip tunnel add mode ipip
    sudo ip link set dev tunl0 up
    sudo ip addr add dev lo ${VIP_PREFIX}
    sudo sysctl -a  | grep '\.rp_filter' | awk '{print $1;}' | \
        xargs -n1 -I{} sudo sysctl {}=0
fi

Before:
[root@DECAP ~]# netserver -p 8888
[root@ENCAP ~]# super_netperf 200 -t TCP_RR -H 9.9.9.9 -p 8888 \
-l 30 -- -d 0x6 -m 8k,64k -s 1M -S 1M
332215
[root@DECAP ~]# perf probe -a gro_cell_poll
[root@DECAP ~]# perf stat -I 1000 -a -A -e probe:gro_cell_poll
   117.258518273 CPU0                     0      probe:gro_cell_poll
   117.258518273 CPU1                     0      probe:gro_cell_poll
   117.258518273 CPU2                     0      probe:gro_cell_poll
   117.258518273 CPU3                     0      probe:gro_cell_poll
   117.258518273 CPU4                     0      probe:gro_cell_poll
   117.258518273 CPU5                     0      probe:gro_cell_poll
   117.258518273 CPU6                     0      probe:gro_cell_poll
   117.258518273 CPU7                     0      probe:gro_cell_poll
   117.258518273 CPU8                     0      probe:gro_cell_poll
   117.258518273 CPU9                     0      probe:gro_cell_poll
   117.258518273 CPU10                    0      probe:gro_cell_poll
   117.258518273 CPU11                    0      probe:gro_cell_poll
   117.258518273 CPU12                    0      probe:gro_cell_poll
   117.258518273 CPU13                    0      probe:gro_cell_poll
   117.258518273 CPU14                    0      probe:gro_cell_poll
   117.258518273 CPU15                4,882      probe:gro_cell_poll
   117.258518273 CPU16                    0      probe:gro_cell_poll
   117.258518273 CPU17                    0      probe:gro_cell_poll
   117.258518273 CPU18                    0      probe:gro_cell_poll
   117.258518273 CPU19                    0      probe:gro_cell_poll
   117.258518273 CPU20                    0      probe:gro_cell_poll
   117.258518273 CPU21                    0      probe:gro_cell_poll
   117.258518273 CPU22                    0      probe:gro_cell_poll
   117.258518273 CPU23                    0      probe:gro_cell_poll
   117.258518273 CPU24                    0      probe:gro_cell_poll
   117.258518273 CPU25                    0      probe:gro_cell_poll
   117.258518273 CPU26                    0      probe:gro_cell_poll
   117.258518273 CPU27                    0      probe:gro_cell_poll
   117.258518273 CPU28                    0      probe:gro_cell_poll
   117.258518273 CPU29                    0      probe:gro_cell_poll
   117.258518273 CPU30                    0      probe:gro_cell_poll
   117.258518273 CPU31                    0      probe:gro_cell_poll
   117.258518273 CPU32                    0      probe:gro_cell_poll
   117.258518273 CPU33                    0      probe:gro_cell_poll
   117.258518273 CPU34                    0      probe:gro_cell_poll
   117.258518273 CPU35                    0      probe:gro_cell_poll
   117.258518273 CPU36                    0      probe:gro_cell_poll
   117.258518273 CPU37                    0      probe:gro_cell_poll
   117.258518273 CPU38                    0      probe:gro_cell_poll
   117.258518273 CPU39                    0      probe:gro_cell_poll

After:
[root@DECAP ~]# netserver -p 8888
[root@ENCAP ~]# super_netperf 200 -t TCP_RR -H 9.9.9.9 -p 8888 \
-l 30 -- -d 0x6 -m 8k,64k -s 1M -S 1M
877530
[root@DECAP ~]# perf probe -a gro_cell_poll
[root@DECAP ~]# perf stat -I 1000 -a -A -e probe:gro_cell_poll
    40.085714389 CPU0                13,607      probe:gro_cell_poll
    40.085714389 CPU1                13,188      probe:gro_cell_poll
    40.085714389 CPU2                12,913      probe:gro_cell_poll
    40.085714389 CPU3                12,790      probe:gro_cell_poll
    40.085714389 CPU4                13,395      probe:gro_cell_poll
    40.085714389 CPU5                13,121      probe:gro_cell_poll
    40.085714389 CPU6                11,083      probe:gro_cell_poll
    40.085714389 CPU7                12,945      probe:gro_cell_poll
    40.085714389 CPU8                13,704      probe:gro_cell_poll
    40.085714389 CPU9                13,514      probe:gro_cell_poll
    40.085714389 CPU10                    0      probe:gro_cell_poll
    40.085714389 CPU11                    0      probe:gro_cell_poll
    40.085714389 CPU12                    0      probe:gro_cell_poll
    40.085714389 CPU13                    0      probe:gro_cell_poll
    40.085714389 CPU14                    0      probe:gro_cell_poll
    40.085714389 CPU15                    0      probe:gro_cell_poll
    40.085714389 CPU16                    0      probe:gro_cell_poll
    40.085714389 CPU17                    0      probe:gro_cell_poll
    40.085714389 CPU18                    0      probe:gro_cell_poll
    40.085714389 CPU19                    0      probe:gro_cell_poll
    40.085714389 CPU20               10,402      probe:gro_cell_poll
    40.085714389 CPU21               12,312      probe:gro_cell_poll
    40.085714389 CPU22               11,913      probe:gro_cell_poll
    40.085714389 CPU23               12,964      probe:gro_cell_poll
    40.085714389 CPU24               13,727      probe:gro_cell_poll
    40.085714389 CPU25               12,943      probe:gro_cell_poll
    40.085714389 CPU26               13,558      probe:gro_cell_poll
    40.085714389 CPU27               12,676      probe:gro_cell_poll
    40.085714389 CPU28               13,754      probe:gro_cell_poll
    40.085714389 CPU29               13,379      probe:gro_cell_poll
    40.085714389 CPU30                    0      probe:gro_cell_poll
    40.085714389 CPU31                    0      probe:gro_cell_poll
    40.085714389 CPU32                    0      probe:gro_cell_poll
    40.085714389 CPU33                    0      probe:gro_cell_poll
    40.085714389 CPU34                    0      probe:gro_cell_poll
    40.085714389 CPU35                    0      probe:gro_cell_poll
    40.085714389 CPU36                    0      probe:gro_cell_poll
    40.085714389 CPU37                    0      probe:gro_cell_poll
    40.085714389 CPU38                    0      probe:gro_cell_poll
    40.085714389 CPU39                    0      probe:gro_cell_poll

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/gro_cells.h | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
index 734d9b5..cdac448 100644
--- a/include/net/gro_cells.h
+++ b/include/net/gro_cells.h
@@ -11,22 +11,20 @@ struct gro_cell {
 } ____cacheline_aligned_in_smp;
 
 struct gro_cells {
-	unsigned int		gro_cells_mask;
-	struct gro_cell		*cells;
+	struct gro_cell __percpu	*cells;
 };
 
 static inline void gro_cells_receive(struct gro_cells *gcells, struct sk_buff *skb)
 {
-	struct gro_cell *cell = gcells->cells;
+	struct gro_cell *cell;
 	struct net_device *dev = skb->dev;
 
-	if (!cell || skb_cloned(skb) || !(dev->features & NETIF_F_GRO)) {
+	if (!gcells->cells || skb_cloned(skb) || !(dev->features & NETIF_F_GRO)) {
 		netif_rx(skb);
 		return;
 	}
 
-	if (skb_rx_queue_recorded(skb))
-		cell += skb_get_rx_queue(skb) & gcells->gro_cells_mask;
+	cell = this_cpu_ptr(gcells->cells);
 
 	if (skb_queue_len(&cell->napi_skbs) > netdev_max_backlog) {
 		atomic_long_inc(&dev->rx_dropped);
@@ -72,15 +70,12 @@ static inline int gro_cells_init(struct gro_cells *gcells, struct net_device *de
 {
 	int i;
 
-	gcells->gro_cells_mask = roundup_pow_of_two(netif_get_num_default_rss_queues()) - 1;
-	gcells->cells = kcalloc(gcells->gro_cells_mask + 1,
-				sizeof(struct gro_cell),
-				GFP_KERNEL);
+	gcells->cells = alloc_percpu(struct gro_cell);
 	if (!gcells->cells)
 		return -ENOMEM;
 
-	for (i = 0; i <= gcells->gro_cells_mask; i++) {
-		struct gro_cell *cell = gcells->cells + i;
+	for_each_possible_cpu(i) {
+		struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
 
 		skb_queue_head_init(&cell->napi_skbs);
 		netif_napi_add(dev, &cell->napi, gro_cell_poll, 64);
@@ -91,16 +86,16 @@ static inline int gro_cells_init(struct gro_cells *gcells, struct net_device *de
 
 static inline void gro_cells_destroy(struct gro_cells *gcells)
 {
-	struct gro_cell *cell = gcells->cells;
 	int i;
 
-	if (!cell)
+	if (!gcells->cells)
 		return;
-	for (i = 0; i <= gcells->gro_cells_mask; i++,cell++) {
+	for_each_possible_cpu(i) {
+		struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
 		netif_napi_del(&cell->napi);
 		skb_queue_purge(&cell->napi_skbs);
 	}
-	kfree(gcells->cells);
+	free_percpu(gcells->cells);
 	gcells->cells = NULL;
 }
 
-- 
1.8.1

^ permalink raw reply related

* Re: [PATCH RFC] pci: Control whether VFs are probed on pci_enable_sriov
From: Don Dutile @ 2015-01-09 18:38 UTC (permalink / raw)
  To: Bjorn Helgaas, Eli Cohen
  Cc: davem, linux-pci, netdev, ogerlitz, yevgenyp, Eli Cohen
In-Reply-To: <20150109182546.GG6575@google.com>

On 01/09/2015 01:25 PM, Bjorn Helgaas wrote:
> On Sun, Dec 07, 2014 at 03:08:13PM +0200, Eli Cohen wrote:
>> Sometimes it is not desirable to probe the virtual fuctions right away,
>> but rather leave the decision to the host's administrator.
>>
>> This can save host side resource usage by VF instances which would be
>> eventually probed to VMs.
>>
>> Use a parameter to pci_enable_sriov to control that policy, and modify
>> all current callers such that they retain the same functionality.
>>
>> Use a one shot flag on struct pci_device which is cleared after the
>> first probe is ignored so subsequent attempts go through.
>>
>> Cc: Donald Dutile <ddutile@redhat.com>
>> Signed-off-by: Eli Cohen <eli@mellanox.com>
>
> Seems like we never really reached a consensus here.  Please repost if you
> want to continue down this path.
>
> Bjorn
>
hmmm, seems like I have to increase the power on my nack-phaser.
-dd

>> ---
>> This approach is used by the mlx5 driver SRIOV implementation, so
>> sending this to get feedback from the PCI and networking folks.
>>
>>   drivers/misc/genwqe/card_base.c                      |  2 +-
>>   drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c    |  2 +-
>>   drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c      |  2 +-
>>   drivers/net/ethernet/cisco/enic/enic_main.c          |  2 +-
>>   drivers/net/ethernet/emulex/benet/be_main.c          |  2 +-
>>   drivers/net/ethernet/intel/fm10k/fm10k_iov.c         |  2 +-
>>   drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c   |  2 +-
>>   drivers/net/ethernet/intel/igb/igb_main.c            |  2 +-
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c       |  4 ++--
>>   drivers/net/ethernet/mellanox/mlx4/main.c            |  2 +-
>>   drivers/net/ethernet/neterion/vxge/vxge-main.c       |  2 +-
>>   drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c |  2 +-
>>   drivers/net/ethernet/sfc/siena_sriov.c               |  2 +-
>>   drivers/pci/iov.c                                    | 12 +++++++-----
>>   drivers/pci/pci-driver.c                             | 11 ++++++++---
>>   drivers/scsi/lpfc/lpfc_init.c                        |  2 +-
>>   include/linux/pci.h                                  |  5 +++--
>>   17 files changed, 33 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/misc/genwqe/card_base.c b/drivers/misc/genwqe/card_base.c
>> index 4cf8f82cfca2..69253ca17506 100644
>> --- a/drivers/misc/genwqe/card_base.c
>> +++ b/drivers/misc/genwqe/card_base.c
>> @@ -1325,7 +1325,7 @@ static int genwqe_sriov_configure(struct pci_dev *dev, int numvfs)
>>
>>   	if (numvfs > 0) {
>>   		genwqe_setup_vf_jtimer(cd);
>> -		rc = pci_enable_sriov(dev, numvfs);
>> +		rc = pci_enable_sriov(dev, numvfs, 1);
>>   		if (rc < 0)
>>   			return rc;
>>   		return numvfs;
>> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
>> index c88b20af87df..773b20224a47 100644
>> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
>> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
>> @@ -2570,7 +2570,7 @@ int bnx2x_enable_sriov(struct bnx2x *bp)
>>   	if (rc)
>>   		return rc;
>>
>> -	rc = pci_enable_sriov(bp->pdev, req_vfs);
>> +	rc = pci_enable_sriov(bp->pdev, req_vfs, 1);
>>   	if (rc) {
>>   		BNX2X_ERR("pci_enable_sriov failed with %d\n", rc);
>>   		return rc;
>> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> index 3aea82bb9039..6e8afbfd3eba 100644
>> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
>> @@ -6597,7 +6597,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>>   sriov:
>>   #ifdef CONFIG_PCI_IOV
>>   	if (func < ARRAY_SIZE(num_vf) && num_vf[func] > 0)
>> -		if (pci_enable_sriov(pdev, num_vf[func]) == 0)
>> +		if (pci_enable_sriov(pdev, num_vf[func], 1) == 0)
>>   			dev_info(&pdev->dev,
>>   				 "instantiated %u virtual functions\n",
>>   				 num_vf[func]);
>> diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
>> index 86ee350e57f0..8a8b1d86f18a 100644
>> --- a/drivers/net/ethernet/cisco/enic/enic_main.c
>> +++ b/drivers/net/ethernet/cisco/enic/enic_main.c
>> @@ -2421,7 +2421,7 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>>   		pci_read_config_word(pdev, pos + PCI_SRIOV_TOTAL_VF,
>>   			&enic->num_vfs);
>>   		if (enic->num_vfs) {
>> -			err = pci_enable_sriov(pdev, enic->num_vfs);
>> +			err = pci_enable_sriov(pdev, enic->num_vfs, 1);
>>   			if (err) {
>>   				dev_err(dev, "SRIOV enable failed, aborting."
>>   					" pci_enable_sriov() returned %d\n",
>> diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
>> index dc77ec2bdafd..a96491777ac4 100644
>> --- a/drivers/net/ethernet/emulex/benet/be_main.c
>> +++ b/drivers/net/ethernet/emulex/benet/be_main.c
>> @@ -3274,7 +3274,7 @@ static int be_vf_setup(struct be_adapter *adapter)
>>   	}
>>
>>   	if (!old_vfs) {
>> -		status = pci_enable_sriov(adapter->pdev, adapter->num_vfs);
>> +		status = pci_enable_sriov(adapter->pdev, adapter->num_vfs, 1);
>>   		if (status) {
>>   			dev_err(dev, "SRIOV enable failed\n");
>>   			adapter->num_vfs = 0;
>> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
>> index 060190864238..04a3dc5acc28 100644
>> --- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
>> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
>> @@ -408,7 +408,7 @@ int fm10k_iov_configure(struct pci_dev *pdev, int num_vfs)
>>   		 */
>>   		fm10k_disable_aer_comp_abort(pdev);
>>
>> -		err = pci_enable_sriov(pdev, num_vfs);
>> +		err = pci_enable_sriov(pdev, num_vfs, 1);
>>   		if (err) {
>>   			dev_err(&pdev->dev,
>>   				"Enable PCI SR-IOV failed: %d\n", err);
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> index 668d860275d6..fe56e09725f2 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> @@ -852,7 +852,7 @@ int i40e_alloc_vfs(struct i40e_pf *pf, u16 num_alloc_vfs)
>>
>>   	/* Check to see if we're just allocating resources for extant VFs */
>>   	if (pci_num_vf(pf->pdev) != num_alloc_vfs) {
>> -		ret = pci_enable_sriov(pf->pdev, num_alloc_vfs);
>> +		ret = pci_enable_sriov(pf->pdev, num_alloc_vfs, 1);
>>   		if (ret) {
>>   			dev_err(&pf->pdev->dev,
>>   				"Failed to enable SR-IOV, error %d.\n", ret);
>> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
>> index 3c0221620c9d..da01326ef550 100644
>> --- a/drivers/net/ethernet/intel/igb/igb_main.c
>> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
>> @@ -2742,7 +2742,7 @@ static int igb_enable_sriov(struct pci_dev *pdev, int num_vfs)
>>
>>   	/* only call pci_enable_sriov() if no VFs are allocated already */
>>   	if (!old_vfs) {
>> -		err = pci_enable_sriov(pdev, adapter->vfs_allocated_count);
>> +		err = pci_enable_sriov(pdev, adapter->vfs_allocated_count, 1);
>>   		if (err)
>>   			goto err_out;
>>   	}
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
>> index 04eee7c7b653..74b33483a0d1 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
>> @@ -149,7 +149,7 @@ void ixgbe_enable_sriov(struct ixgbe_adapter *adapter)
>>   		 */
>>   		adapter->num_vfs = min_t(unsigned int, adapter->num_vfs, IXGBE_MAX_VFS_DRV_LIMIT);
>>
>> -		err = pci_enable_sriov(adapter->pdev, adapter->num_vfs);
>> +		err = pci_enable_sriov(adapter->pdev, adapter->num_vfs, 1);
>>   		if (err) {
>>   			e_err(probe, "Failed to enable PCI sriov: %d\n", err);
>>   			adapter->num_vfs = 0;
>> @@ -270,7 +270,7 @@ static int ixgbe_pci_sriov_enable(struct pci_dev *dev, int num_vfs)
>>   	for (i = 0; i < adapter->num_vfs; i++)
>>   		ixgbe_vf_configuration(dev, (i | 0x10000000));
>>
>> -	err = pci_enable_sriov(dev, num_vfs);
>> +	err = pci_enable_sriov(dev, num_vfs, 1);
>>   	if (err) {
>>   		e_dev_warn("Failed to enable PCI sriov: %d\n", err);
>>   		return err;
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
>> index 3044f9e623cb..ae38b556ec13 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
>> @@ -2350,7 +2350,7 @@ static u64 mlx4_enable_sriov(struct mlx4_dev *dev, struct pci_dev *pdev,
>>   					 existing_vfs, total_vfs);
>>   		} else {
>>   			mlx4_warn(dev, "Enabling SR-IOV with %d VFs\n", total_vfs);
>> -			err = pci_enable_sriov(pdev, total_vfs);
>> +			err = pci_enable_sriov(pdev, total_vfs, 1);
>>   		}
>>   		if (err) {
>>   			mlx4_err(dev, "Failed to enable SR-IOV, continuing without SR-IOV (err = %d)\n",
>> diff --git a/drivers/net/ethernet/neterion/vxge/vxge-main.c b/drivers/net/ethernet/neterion/vxge/vxge-main.c
>> index cc0485e3c621..c341e73fc68c 100644
>> --- a/drivers/net/ethernet/neterion/vxge/vxge-main.c
>> +++ b/drivers/net/ethernet/neterion/vxge/vxge-main.c
>> @@ -4495,7 +4495,7 @@ vxge_probe(struct pci_dev *pdev, const struct pci_device_id *pre)
>>   	/* Enable SRIOV mode, if firmware has SRIOV support and if it is a PF */
>>   	if (is_sriov(function_mode) && !is_sriov_initialized(pdev) &&
>>   	   (ll_config->intr_type != INTA)) {
>> -		ret = pci_enable_sriov(pdev, num_vfs);
>> +		ret = pci_enable_sriov(pdev, num_vfs, 1);
>>   		if (ret)
>>   			vxge_debug_ll_config(VXGE_ERR,
>>   				"Failed in enabling SRIOV mode: %d\n", ret);
>> diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
>> index a29538b86edf..b483705a1ef1 100644
>> --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
>> +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
>> @@ -570,7 +570,7 @@ static int qlcnic_sriov_pf_enable(struct qlcnic_adapter *adapter, int num_vfs)
>>   	if (!qlcnic_sriov_enable_check(adapter))
>>   		return 0;
>>
>> -	err = pci_enable_sriov(adapter->pdev, num_vfs);
>> +	err = pci_enable_sriov(adapter->pdev, num_vfs, 1);
>>   	if (err)
>>   		qlcnic_sriov_pf_cleanup(adapter);
>>
>> diff --git a/drivers/net/ethernet/sfc/siena_sriov.c b/drivers/net/ethernet/sfc/siena_sriov.c
>> index a8bbbad68a88..6804ed04cfcd 100644
>> --- a/drivers/net/ethernet/sfc/siena_sriov.c
>> +++ b/drivers/net/ethernet/sfc/siena_sriov.c
>> @@ -1332,7 +1332,7 @@ int efx_siena_sriov_init(struct efx_nic *efx)
>>
>>   	/* At this point we must be ready to accept VFDI requests */
>>
>> -	rc = pci_enable_sriov(efx->pci_dev, efx->vf_count);
>> +	rc = pci_enable_sriov(efx->pci_dev, efx->vf_count, 1);
>>   	if (rc)
>>   		goto fail_pci;
>>
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index 4d109c07294a..f6aba5beea78 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -57,7 +57,7 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
>>   		pci_remove_bus(virtbus);
>>   }
>>
>> -static int virtfn_add(struct pci_dev *dev, int id, int reset)
>> +static int virtfn_add(struct pci_dev *dev, int id, int reset, int probe)
>>   {
>>   	int i;
>>   	int rc = -ENOMEM;
>> @@ -85,6 +85,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>>   	virtfn->physfn = pci_dev_get(dev);
>>   	virtfn->is_virtfn = 1;
>>   	virtfn->multifunction = 0;
>> +	virtfn->probe_vf = probe;
>>
>>   	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>>   		res = dev->resource + PCI_IOV_RESOURCES + i;
>> @@ -170,7 +171,7 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>>   	pci_dev_put(dev);
>>   }
>>
>> -static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>> +static int sriov_enable(struct pci_dev *dev, int nr_virtfn, int probe_vfs)
>>   {
>>   	int rc;
>>   	int i, j;
>> @@ -255,7 +256,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>   		initial = nr_virtfn;
>>
>>   	for (i = 0; i < initial; i++) {
>> -		rc = virtfn_add(dev, i, 0);
>> +		rc = virtfn_add(dev, i, 0, probe_vfs);
>>   		if (rc)
>>   			goto failed;
>>   	}
>> @@ -558,17 +559,18 @@ int pci_iov_bus_range(struct pci_bus *bus)
>>    * pci_enable_sriov - enable the SR-IOV capability
>>    * @dev: the PCI device
>>    * @nr_virtfn: number of virtual functions to enable
>> + * @probe_vfs: in zero, don't probe new VFs, otherwise probe if suitable driver available
>>    *
>>    * Returns 0 on success, or negative on failure.
>>    */
>> -int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>> +int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int probe_vfs)
>>   {
>>   	might_sleep();
>>
>>   	if (!dev->is_physfn)
>>   		return -ENOSYS;
>>
>> -	return sriov_enable(dev, nr_virtfn);
>> +	return sriov_enable(dev, nr_virtfn, probe_vfs);
>>   }
>>   EXPORT_SYMBOL_GPL(pci_enable_sriov);
>>
>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>> index 2b3c89425bb5..d5b93339b8a4 100644
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -397,9 +397,14 @@ static int pci_device_probe(struct device *dev)
>>   	drv = to_pci_driver(dev->driver);
>>   	pci_dev = to_pci_dev(dev);
>>   	pci_dev_get(pci_dev);
>> -	error = __pci_device_probe(drv, pci_dev);
>> -	if (error)
>> -		pci_dev_put(pci_dev);
>> +	if (!pci_dev->is_virtfn || pci_dev->probe_vf) {
>> +		error = __pci_device_probe(drv, pci_dev);
>> +		if (error)
>> +			pci_dev_put(pci_dev);
>> +	}
>> +	/* one shot blocking of probe */
>> +	if (pci_dev->is_virtfn && !pci_dev->probe_vf)
>> +		pci_dev->probe_vf = 1;
>>
>>   	return error;
>>   }
>> diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
>> index 0b2c53af85c7..2f81f471b8f3 100644
>> --- a/drivers/scsi/lpfc/lpfc_init.c
>> +++ b/drivers/scsi/lpfc/lpfc_init.c
>> @@ -4797,7 +4797,7 @@ lpfc_sli_probe_sriov_nr_virtfn(struct lpfc_hba *phba, int nr_vfn)
>>   		return -EINVAL;
>>   	}
>>
>> -	rc = pci_enable_sriov(pdev, nr_vfn);
>> +	rc = pci_enable_sriov(pdev, nr_vfn, 1);
>>   	if (rc) {
>>   		lpfc_printf_log(phba, KERN_WARNING, LOG_INIT,
>>   				"2806 Failed to enable sriov on this device "
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 4c8ac5fcc224..beb2640ba18d 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -373,6 +373,7 @@ struct pci_dev {
>>   	phys_addr_t rom; /* Physical address of ROM if it's not from the BAR */
>>   	size_t romlen; /* Length of ROM if it's not from the BAR */
>>   	char *driver_override; /* Driver name to force a match */
>> +	int probe_vf; /* probe this device */
>>   };
>>
>>   static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
>> @@ -1655,14 +1656,14 @@ int pci_ext_cfg_avail(void);
>>   void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
>>
>>   #ifdef CONFIG_PCI_IOV
>> -int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>> +int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int probe_vfs);
>>   void pci_disable_sriov(struct pci_dev *dev);
>>   int pci_num_vf(struct pci_dev *dev);
>>   int pci_vfs_assigned(struct pci_dev *dev);
>>   int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>>   int pci_sriov_get_totalvfs(struct pci_dev *dev);
>>   #else
>> -static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>> +static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int nr_virt_probe)
>>   { return -ENODEV; }
>>   static inline void pci_disable_sriov(struct pci_dev *dev) { }
>>   static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
>> --
>> 2.1.3
>>

^ permalink raw reply

* [PATCH] bonding: cleanup bond_opts array
From: Jonathan Toppins @ 2015-01-09 18:31 UTC (permalink / raw)
  To: netdev; +Cc: shm, Andy Gospodarek, Nikolay Aleksandrov

Remove the empty array element initializer and size the array with
BOND_OPT_LAST so the compiler will complain if more elements are in
there than should be.

An interesting unwanted side effect of this initializer is that if one
inserts new options into the middle of the array then this initializer
will zero out the option that equals BOND_OPT_TLB_DYNAMIC_LB+1.

Example:
Extend the OPTS enum:
enum {
   ...
   BOND_OPT_TLB_DYNAMIC_LB,
   BOND_OPT_LACP_NEW1,
   BOND_OPT_LAST
};

Now insert into bond_opts array:
static const struct bond_option bond_opts[] = {
      ...
      [BOND_OPT_LACP_RATE] = { .... unchanged stuff .... },
      [BOND_OPT_LACP_NEW1] = { ... new stuff ... },
      ...
      [BOND_OPT_TLB_DYNAMIC_LB] = { .... unchanged stuff ....},
      { } // MARK A
};

Since BOND_OPT_LACP_NEW1 = BOND_OPT_TLB_DYNAMIC_LB+1, the last
initializer (MARK A) will overwrite the contents of BOND_OPT_LACP_NEW1
and can be easily viewed with the crash utility.

Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
---
 drivers/net/bonding/bond_options.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 1a61cc9..9bd538d4 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -186,7 +186,7 @@ static const struct bond_opt_value bond_tlb_dynamic_lb_tbl[] = {
 	{ NULL,  -1, 0}
 };

-static const struct bond_option bond_opts[] = {
+static const struct bond_option bond_opts[BOND_OPT_LAST] = {
 	[BOND_OPT_MODE] = {
 		.id = BOND_OPT_MODE,
 		.name = "mode",
@@ -379,8 +379,7 @@ static const struct bond_option bond_opts[] = {
 		.values = bond_tlb_dynamic_lb_tbl,
 		.flags = BOND_OPTFLAG_IFDOWN,
 		.set = bond_option_tlb_dynamic_lb_set,
-	},
-	{ }
+	}
 };

 /* Searches for an option by name */
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 1/1] update ip-sysctl.txt documentation (v2)
From: Ani Sinha @ 2015-01-09 18:34 UTC (permalink / raw)
  To: Jonathan Corbet, David Miller, Eric Dumazet, linux-doc,
	linux-kernel, Ani Sinha, Pádraig Brady,
	netdev@vger.kernel.org, fruggeri
In-Reply-To: <1420674356-32210-1-git-send-email-ani@arista.com>

On Wed, Jan 7, 2015 at 3:45 PM, Ani Sinha <ani@arista.com> wrote:
> Update documentation to reflect the fact that
> /proc/sys/net/ipv4/route/max_size is no longer used for ipv4.

Any more feedback on this?

^ permalink raw reply

* Re: [net-next PATCH v1 00/11] A flow API
From: John Fastabend @ 2015-01-09 18:27 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: tgraf, sfeldma, jiri, simon.horman, netdev, davem, andy,
	Shrijeet Mukherjee
In-Reply-To: <54ABD3D1.6020608@mojatatu.com>

On 01/06/2015 04:23 AM, Jamal Hadi Salim wrote:
> John,
>
> There are a lot of things to digest in your posting - I am interested
> in commenting on many things but feel need to pay attention to details
> in general given the importance of this interface (and conference is
> chewing my netdev time at the moment). I need to actually sit down
> and stare at code and documentation.

any additional feedback would be great. sorry I tried to be concise
but this email got fairly long regardless. Also a delayed the response
a few days as I mulled over some of it.

>
> I do think we need to have this discussion as part of the BOF
> Shrijeet is running at netdev01.

Maybe I was a bit ambitious thinking we could get this merged
by then? Maybe I can resolve concerns via email ;) What I wanted
to discuss at netdev01 was specifically the mapping between
software models and hardware model as exposed by this series.
I see value in doing this in user space for some consumers OVS
which is why the UAPI is there to support this.

Also I think in-kernel users are interesting as well and 'tc'
is a reasonable candidate to try and offload from in the kernel IMO.

>
> General comments:
> 1) one of the things that i have learnt over time is that not
> everything that sits or is abstracted from hardware is a table.
> You could have structs or simple scalars for config or runtime
> control. How does what you are proposing here allow to express that?
> I dont think you'd need it for simple things but if you dont allow
> for it you run into the square-hole-round-peg syndrome of "yeah
> i can express that u32 variable as a single table with a single row
> and a single column" ;-> or "you need another infrastructure for
> that single scalr u32"

The interface (both UAPI and kernel API) deals exclusively with the
flow table pipeline at the moment. I've allowed for table attributes
which allows you to give tables characteristics. Right now it only
supports basic attribs like ingress_root and egress_root but I have
some work not in this series to allow tables to be dynamic
(allocated/freed) at runtime. More attributes could be added as needed
here. But this still only covers tables.

I agree there other things besides tables, of course. First thing
that comes to mind for me is queues and QOS. How do we model these?
My take is you add another object type call it QUEUE and use a
'struct net_flow_queue' to model queues. Queues then have attributes
as well like length, QOS policies, etc. I would call this extending the
infrastructure not creating another one :). Maybe my naming it
'net_flow' is not ideal. With a queue structure I can connect queues
and tables together with an enqueue action. That would be one example I
can generate more, encrypt operations, etc.

FWIW queues and QOS to me fit nicely into the existing infrastructure
and it may be easier to utilize the existing 'tc' UAPI for this.

In this series I just want to get the flow table piece down though.

>
> 2) So i understood the sense of replacing ethtool for classifier
> access with a direct interface mostly because thats what it was
> already doing - but i am not sure why you need
> it for a generic interface. Am i mistaken you are providing direct
> access to hardware from user space? Would this make essentially
> the Linux infrastructure a bypass (which vendors and their SDKs
> love)? IMHO, a good example is to pick something like netfilter
> or tc-filters and show how that is offloaded. This keeps it in
> the same spirit as what we are shooting for in L2/3 at the moment.
>

I'll try to knock these off one by one:

Yes we are providing an interface for userspace to interrogate the
hardware and program it. My take on this is even if you embed this
into another netlink family OVS, NFT, TCA you end up with the same
operations w.r.t. table support (a) query hardware for
resources/constraints/etc and (b) an API to add/del rules in those
tables. It seems the intersection of these features with existing
netlink families is fairly small so I opted to create a new family.
The underlying hardware offload mechanisms in flow_table.c here could
be used by in-kernel consumers as well as user space. For some
consumers 'tc' perhaps this makes good sense for others 'OVS'
it does not IMO.

Direct access to the hardware? hmm not so sure about that its an
abstraction layer so I can talk to _any_ hardware device using the
same semantics. But yes at the bottom of the interface there is
hardware. Although this provide a "raw" interface for userspace to
inspect and program the hardware it equally provides an API for
in-kernel consumers from using the hardware offload APIs. For
example if you want 'tc' to offload a queueing discipline with some
filters. For what its worth I did some experimental work here and for
some basic cases its possible to do this offload. I'll explore
this more as Jiri/you suggest.

Would this make essentially the Linux infrastructure a bypass? hmm
I'm not sure here exactly what you mean? If switching is done in
the ASIC then the dataplane is being bypassed. And I don't want
to couple management of software dataplane with management with
hardware dataplane. It would be valid to have these dataplanes
running two completely different pipelines/network functions. So I
assume you mean does this API bypass the existing Linux control plane
infrastructure for software dataplanes. I'll say tentatively
yes it does. But in many cases my goal is to unify them in userspace
where it is easier to make policy decisions. For OVS, NFT it
seems to me that user space libraries can handle the unification
of hardware/software dataplanes. Further I think it is the correct
place to unify the dataplanes. I don't want to encode complex
policies into the kernel. Even if you embed the netlink UAPI into
another netlink family the semantics look the same.

To address how to offload existing infrastructures, I'll try to
explain my ideas for each subsystem.

I looked into using netfilter but really didn't make much traction
in the existing infrastructure. The trouble being nft wants to use
expressions like payload that have registers, base, offset, len in
the kernel but the hardware (again at least all the hardware I'm
working with) doesn't work with these semantics it needs a field-id,
possibly the logical operation to use and the value to match. Yes I can
map base/offset/len to a field_id but what do I do with register? And
this sort of complication continues with most the other expressions.
I could write a new expression that was primarily used by hardware
but could have a software user as well but I'm not convinced we would
ever use it in software when we already have the functionally more
generic expressions. To me this looks like a somewhat arbitrary
embedding into netfilter uapi where the gain of doing this is not
entirely clear to me.

OVS would seem to have similar trouble all the policy is in user
space. And the netlink UAPI is tuned for OVS we don't want to start
adding/removing bits to support a hardware API where very little of it
would be used in the software only case and vice versa very little of
the OVS uapi messages as they exist today would be sufficient for the
hardware API. My point of view is the intersection is small enough here
that its easier to write a clean API that stands on its own then try
to sync these hardware offload operations into the OVS UAPI. Further
OVS is very specific about what fields/tables it supports in its current
version and I don't want to force hardware into this model.

And finally 'tc'. Filters today can only be attached to qdisc's which
are bound to net_devices. So the model is netdev's have queues, queues
have a qdisc association and qdiscs have filters. Here we are are
modelling a pipeline associated with a set of ports and in hardware.
The model is slightly different we have queues that dequeue into
an ingress table and an egress table that enqueues packets into queues.
Queues may or may not be bound to the same port. Yes I know 'tc' can
forward to ports but it has no notion of a global table space.

We could build a new 'tc' filter that loaded the hardware tables and
then added rules or deleted rules via hardware api but we would need
some new mechanics to get out the capabilities/resources. Basically
the same set of operations supported in the UAPI of this series. This
would end up IMO to be basically this series only embedded in the TCA_
family with a new filter kind. But then what do we attach it to? Not
a specific qdisc because it is associated with a set of qdiscs. And
additionally why would we use this qdisc*/hw-filter in software when
we already have u32 and bpf? IMO 'tc' is about per port(queues) QOS
and filters/actions to support this. That said I actually see offloading
'tc' qdisc/filters on the ports into the hardware as being useful
and using the operations added in this series to flow_table.c. See
my response to Jiri noting I'll go ahead and try to get this working.
OTOH I still think you need the UAPI proposed in this series for other
consumers.

Maybe I need to be enlightened but I thought for a bit about some grand
unification of ovs, bridge, tc, netlink, et. al. but that seems like
an entirely different scope of project. (side note: filters/actions
are no longer locked by qdisc and could stand on their own) My thoughts
on this are not yet organized.

> Anyways I apologize i havent spent as much time (holiday period
> wasnt good for me and netdev01 is picking up and consuming my time
> but i will try my best to respond and comment with some latency)
>

great thanks. Maybe this will give you more to mull over. If its
clear as mud let me know and I'll draw up some pictures. Likely
need to do that regardless. Bottom line I think the proposed API
here solves a real need.

Thanks!
John

> cheers,
> jamal
>
> On 12/31/14 14:45, John Fastabend wrote:
>> So... I could continue to mull over this and tweak bits and pieces
>> here and there but I decided its best to get a wider group of folks
>> looking at it and hopefulyl with any luck using it so here it is.
>>
>> This set creates a new netlink family and set of messages to configure
>> flow tables in hardware. I tried to make the commit messages
>> reasonably verbose at least in the flow_table patches.
>>
>> What we get at the end of this series is a working API to ge

t device
>> capabilities and program flows using the rocker switch.
>>
>> I created a user space tool 'flow' that I use to configure and query
>> the devices it is posted here,
>>
>>     https://github.com/jrfastab/iprotue2-flow-tool
>>
>> For now it is a stand-alone tool but once the kernel bits get sorted
>> out (I'm guessing there will need to be a few versions of this series
>> to get it right) I would like to port it into the iproute2 package.
>> This way we can keep all of our tooling in one package see 'bridge'
>> for example.
>>
>> As far as testing, I've tested various combinations of tables and
>> rules on the rocker switch and it seems to work. I have not tested
>> 100% of the rocker code paths though. It would be great to get some
>> sort of automated framework around the API to do this. I don't
>> think should gate the inclusion of the API though.
>>
>> I could use some help reviewing,
>>
>>    (a) error paths and netlink validation code paths
>>
>>    (b) Break down of structures vs netlink attributes. I
>>        am trying to balance flexibility given by having
>>        netlinnk TLV attributes vs conciseness. So some
>>        things are passed as structures.
>>
>>    (c) are there any devices that have pipelines that we
>>        can't represent with this API? It would be good to
>>        know about these so we can design it in probably
>>        in a future series.
>>
>> For some examples and maybe a bit more illustrative description I
>> posted a quickly typed up set of notes on github io pages. Here we
>> can show the description along with images produced by the flow tool
>> showing the pipeline. Once we settle a bit more on the API we should
>> probably do a clean up of this and other threads happening and commit
>> something to the Documentation directory.
>>
>>   http://jrfastab.github.io/jekyll/update/2014/12/21/flow-api.html
>>
>> Finally I have more patches to add support for creating and destroying
>> tables. This allows users to define the pipeline at runtime rather
>> than statically as rocker does now. After this set gets some traction
>> I'll look at pushing them in a next round. However it likely requires
>> adding another "world" to rocker. Another piece that I want to add is
>> a description of the actions and metadata. This way user space can
>> "learn" what an action is and how metadata interacts with the system.
>> This work is under development.
>>
>> Thanks! Any comments/feedback always welcome.
>>
>> And also thanks to everyone who helped with this flow API so far. All
>> the folks at Dusseldorf LPC, OVS summit Santa Clara, P4 authors for
>> some inspiration, the collection of IETF FoRCES documents I mulled
>> over, Netfilter workshop where I started to realize fixing ethtool
>> was most likely not going to work, etc.
>>
>> ---
>>
>> John Fastabend (11):
>>        net: flow_table: create interface for hw match/action tables
>>        net: flow_table: add flow, delete flow
>>        net: flow_table: add apply action argument to tables
>>        rocker: add pipeline model for rocker switch
>>        net: rocker: add set flow rules
>>        net: rocker: add group_id slices and drop explicit goto
>>        net: rocker: add multicast path to bridging
>>        net: rocker: add get flow API operation
>>        net: rocker: add cookie to group acls and use flow_id to set
>> cookie
>>        net: rocker: have flow api calls set cookie value
>>        net: rocker: implement delete flow routine
>>
>>
>>   drivers/net/ethernet/rocker/rocker.c          | 1641
>> +++++++++++++++++++++++++
>>   drivers/net/ethernet/rocker/rocker_pipeline.h |  793 ++++++++++++
>>   include/linux/if_flow.h                       |  115 ++
>>   include/linux/netdevice.h                     |   20
>>   include/uapi/linux/if_flow.h                  |  413 ++++++
>>   net/Kconfig                                   |    7
>>   net/core/Makefile                             |    1
>>   net/core/flow_table.c                         | 1339
>> ++++++++++++++++++++
>>   8 files changed, 4312 insertions(+), 17 deletions(-)
>>   create mode 100644 drivers/net/ethernet/rocker/rocker_pipeline.h
>>   create mode 100644 include/linux/if_flow.h
>>   create mode 100644 include/uapi/linux/if_flow.h
>>   create mode 100644 net/core/flow_table.c
>>
>

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH RFC] pci: Control whether VFs are probed on pci_enable_sriov
From: Bjorn Helgaas @ 2015-01-09 18:25 UTC (permalink / raw)
  To: Eli Cohen
  Cc: davem, linux-pci, netdev, ogerlitz, yevgenyp, Eli Cohen,
	Donald Dutile
In-Reply-To: <1417957693-24979-1-git-send-email-eli@mellanox.com>

On Sun, Dec 07, 2014 at 03:08:13PM +0200, Eli Cohen wrote:
> Sometimes it is not desirable to probe the virtual fuctions right away,
> but rather leave the decision to the host's administrator.
> 
> This can save host side resource usage by VF instances which would be
> eventually probed to VMs.
> 
> Use a parameter to pci_enable_sriov to control that policy, and modify
> all current callers such that they retain the same functionality.
> 
> Use a one shot flag on struct pci_device which is cleared after the
> first probe is ignored so subsequent attempts go through.
> 
> Cc: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Eli Cohen <eli@mellanox.com>

Seems like we never really reached a consensus here.  Please repost if you
want to continue down this path.

Bjorn

> ---
> This approach is used by the mlx5 driver SRIOV implementation, so
> sending this to get feedback from the PCI and networking folks.
> 
>  drivers/misc/genwqe/card_base.c                      |  2 +-
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c    |  2 +-
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c      |  2 +-
>  drivers/net/ethernet/cisco/enic/enic_main.c          |  2 +-
>  drivers/net/ethernet/emulex/benet/be_main.c          |  2 +-
>  drivers/net/ethernet/intel/fm10k/fm10k_iov.c         |  2 +-
>  drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c   |  2 +-
>  drivers/net/ethernet/intel/igb/igb_main.c            |  2 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c       |  4 ++--
>  drivers/net/ethernet/mellanox/mlx4/main.c            |  2 +-
>  drivers/net/ethernet/neterion/vxge/vxge-main.c       |  2 +-
>  drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c |  2 +-
>  drivers/net/ethernet/sfc/siena_sriov.c               |  2 +-
>  drivers/pci/iov.c                                    | 12 +++++++-----
>  drivers/pci/pci-driver.c                             | 11 ++++++++---
>  drivers/scsi/lpfc/lpfc_init.c                        |  2 +-
>  include/linux/pci.h                                  |  5 +++--
>  17 files changed, 33 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/misc/genwqe/card_base.c b/drivers/misc/genwqe/card_base.c
> index 4cf8f82cfca2..69253ca17506 100644
> --- a/drivers/misc/genwqe/card_base.c
> +++ b/drivers/misc/genwqe/card_base.c
> @@ -1325,7 +1325,7 @@ static int genwqe_sriov_configure(struct pci_dev *dev, int numvfs)
>  
>  	if (numvfs > 0) {
>  		genwqe_setup_vf_jtimer(cd);
> -		rc = pci_enable_sriov(dev, numvfs);
> +		rc = pci_enable_sriov(dev, numvfs, 1);
>  		if (rc < 0)
>  			return rc;
>  		return numvfs;
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> index c88b20af87df..773b20224a47 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> @@ -2570,7 +2570,7 @@ int bnx2x_enable_sriov(struct bnx2x *bp)
>  	if (rc)
>  		return rc;
>  
> -	rc = pci_enable_sriov(bp->pdev, req_vfs);
> +	rc = pci_enable_sriov(bp->pdev, req_vfs, 1);
>  	if (rc) {
>  		BNX2X_ERR("pci_enable_sriov failed with %d\n", rc);
>  		return rc;
> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> index 3aea82bb9039..6e8afbfd3eba 100644
> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> @@ -6597,7 +6597,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  sriov:
>  #ifdef CONFIG_PCI_IOV
>  	if (func < ARRAY_SIZE(num_vf) && num_vf[func] > 0)
> -		if (pci_enable_sriov(pdev, num_vf[func]) == 0)
> +		if (pci_enable_sriov(pdev, num_vf[func], 1) == 0)
>  			dev_info(&pdev->dev,
>  				 "instantiated %u virtual functions\n",
>  				 num_vf[func]);
> diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
> index 86ee350e57f0..8a8b1d86f18a 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_main.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_main.c
> @@ -2421,7 +2421,7 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		pci_read_config_word(pdev, pos + PCI_SRIOV_TOTAL_VF,
>  			&enic->num_vfs);
>  		if (enic->num_vfs) {
> -			err = pci_enable_sriov(pdev, enic->num_vfs);
> +			err = pci_enable_sriov(pdev, enic->num_vfs, 1);
>  			if (err) {
>  				dev_err(dev, "SRIOV enable failed, aborting."
>  					" pci_enable_sriov() returned %d\n",
> diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
> index dc77ec2bdafd..a96491777ac4 100644
> --- a/drivers/net/ethernet/emulex/benet/be_main.c
> +++ b/drivers/net/ethernet/emulex/benet/be_main.c
> @@ -3274,7 +3274,7 @@ static int be_vf_setup(struct be_adapter *adapter)
>  	}
>  
>  	if (!old_vfs) {
> -		status = pci_enable_sriov(adapter->pdev, adapter->num_vfs);
> +		status = pci_enable_sriov(adapter->pdev, adapter->num_vfs, 1);
>  		if (status) {
>  			dev_err(dev, "SRIOV enable failed\n");
>  			adapter->num_vfs = 0;
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
> index 060190864238..04a3dc5acc28 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
> @@ -408,7 +408,7 @@ int fm10k_iov_configure(struct pci_dev *pdev, int num_vfs)
>  		 */
>  		fm10k_disable_aer_comp_abort(pdev);
>  
> -		err = pci_enable_sriov(pdev, num_vfs);
> +		err = pci_enable_sriov(pdev, num_vfs, 1);
>  		if (err) {
>  			dev_err(&pdev->dev,
>  				"Enable PCI SR-IOV failed: %d\n", err);
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> index 668d860275d6..fe56e09725f2 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> @@ -852,7 +852,7 @@ int i40e_alloc_vfs(struct i40e_pf *pf, u16 num_alloc_vfs)
>  
>  	/* Check to see if we're just allocating resources for extant VFs */
>  	if (pci_num_vf(pf->pdev) != num_alloc_vfs) {
> -		ret = pci_enable_sriov(pf->pdev, num_alloc_vfs);
> +		ret = pci_enable_sriov(pf->pdev, num_alloc_vfs, 1);
>  		if (ret) {
>  			dev_err(&pf->pdev->dev,
>  				"Failed to enable SR-IOV, error %d.\n", ret);
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index 3c0221620c9d..da01326ef550 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -2742,7 +2742,7 @@ static int igb_enable_sriov(struct pci_dev *pdev, int num_vfs)
>  
>  	/* only call pci_enable_sriov() if no VFs are allocated already */
>  	if (!old_vfs) {
> -		err = pci_enable_sriov(pdev, adapter->vfs_allocated_count);
> +		err = pci_enable_sriov(pdev, adapter->vfs_allocated_count, 1);
>  		if (err)
>  			goto err_out;
>  	}
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> index 04eee7c7b653..74b33483a0d1 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> @@ -149,7 +149,7 @@ void ixgbe_enable_sriov(struct ixgbe_adapter *adapter)
>  		 */
>  		adapter->num_vfs = min_t(unsigned int, adapter->num_vfs, IXGBE_MAX_VFS_DRV_LIMIT);
>  
> -		err = pci_enable_sriov(adapter->pdev, adapter->num_vfs);
> +		err = pci_enable_sriov(adapter->pdev, adapter->num_vfs, 1);
>  		if (err) {
>  			e_err(probe, "Failed to enable PCI sriov: %d\n", err);
>  			adapter->num_vfs = 0;
> @@ -270,7 +270,7 @@ static int ixgbe_pci_sriov_enable(struct pci_dev *dev, int num_vfs)
>  	for (i = 0; i < adapter->num_vfs; i++)
>  		ixgbe_vf_configuration(dev, (i | 0x10000000));
>  
> -	err = pci_enable_sriov(dev, num_vfs);
> +	err = pci_enable_sriov(dev, num_vfs, 1);
>  	if (err) {
>  		e_dev_warn("Failed to enable PCI sriov: %d\n", err);
>  		return err;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
> index 3044f9e623cb..ae38b556ec13 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -2350,7 +2350,7 @@ static u64 mlx4_enable_sriov(struct mlx4_dev *dev, struct pci_dev *pdev,
>  					 existing_vfs, total_vfs);
>  		} else {
>  			mlx4_warn(dev, "Enabling SR-IOV with %d VFs\n", total_vfs);
> -			err = pci_enable_sriov(pdev, total_vfs);
> +			err = pci_enable_sriov(pdev, total_vfs, 1);
>  		}
>  		if (err) {
>  			mlx4_err(dev, "Failed to enable SR-IOV, continuing without SR-IOV (err = %d)\n",
> diff --git a/drivers/net/ethernet/neterion/vxge/vxge-main.c b/drivers/net/ethernet/neterion/vxge/vxge-main.c
> index cc0485e3c621..c341e73fc68c 100644
> --- a/drivers/net/ethernet/neterion/vxge/vxge-main.c
> +++ b/drivers/net/ethernet/neterion/vxge/vxge-main.c
> @@ -4495,7 +4495,7 @@ vxge_probe(struct pci_dev *pdev, const struct pci_device_id *pre)
>  	/* Enable SRIOV mode, if firmware has SRIOV support and if it is a PF */
>  	if (is_sriov(function_mode) && !is_sriov_initialized(pdev) &&
>  	   (ll_config->intr_type != INTA)) {
> -		ret = pci_enable_sriov(pdev, num_vfs);
> +		ret = pci_enable_sriov(pdev, num_vfs, 1);
>  		if (ret)
>  			vxge_debug_ll_config(VXGE_ERR,
>  				"Failed in enabling SRIOV mode: %d\n", ret);
> diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> index a29538b86edf..b483705a1ef1 100644
> --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> @@ -570,7 +570,7 @@ static int qlcnic_sriov_pf_enable(struct qlcnic_adapter *adapter, int num_vfs)
>  	if (!qlcnic_sriov_enable_check(adapter))
>  		return 0;
>  
> -	err = pci_enable_sriov(adapter->pdev, num_vfs);
> +	err = pci_enable_sriov(adapter->pdev, num_vfs, 1);
>  	if (err)
>  		qlcnic_sriov_pf_cleanup(adapter);
>  
> diff --git a/drivers/net/ethernet/sfc/siena_sriov.c b/drivers/net/ethernet/sfc/siena_sriov.c
> index a8bbbad68a88..6804ed04cfcd 100644
> --- a/drivers/net/ethernet/sfc/siena_sriov.c
> +++ b/drivers/net/ethernet/sfc/siena_sriov.c
> @@ -1332,7 +1332,7 @@ int efx_siena_sriov_init(struct efx_nic *efx)
>  
>  	/* At this point we must be ready to accept VFDI requests */
>  
> -	rc = pci_enable_sriov(efx->pci_dev, efx->vf_count);
> +	rc = pci_enable_sriov(efx->pci_dev, efx->vf_count, 1);
>  	if (rc)
>  		goto fail_pci;
>  
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4d109c07294a..f6aba5beea78 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -57,7 +57,7 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
>  		pci_remove_bus(virtbus);
>  }
>  
> -static int virtfn_add(struct pci_dev *dev, int id, int reset)
> +static int virtfn_add(struct pci_dev *dev, int id, int reset, int probe)
>  {
>  	int i;
>  	int rc = -ENOMEM;
> @@ -85,6 +85,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>  	virtfn->physfn = pci_dev_get(dev);
>  	virtfn->is_virtfn = 1;
>  	virtfn->multifunction = 0;
> +	virtfn->probe_vf = probe;
>  
>  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>  		res = dev->resource + PCI_IOV_RESOURCES + i;
> @@ -170,7 +171,7 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>  	pci_dev_put(dev);
>  }
>  
> -static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> +static int sriov_enable(struct pci_dev *dev, int nr_virtfn, int probe_vfs)
>  {
>  	int rc;
>  	int i, j;
> @@ -255,7 +256,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>  		initial = nr_virtfn;
>  
>  	for (i = 0; i < initial; i++) {
> -		rc = virtfn_add(dev, i, 0);
> +		rc = virtfn_add(dev, i, 0, probe_vfs);
>  		if (rc)
>  			goto failed;
>  	}
> @@ -558,17 +559,18 @@ int pci_iov_bus_range(struct pci_bus *bus)
>   * pci_enable_sriov - enable the SR-IOV capability
>   * @dev: the PCI device
>   * @nr_virtfn: number of virtual functions to enable
> + * @probe_vfs: in zero, don't probe new VFs, otherwise probe if suitable driver available
>   *
>   * Returns 0 on success, or negative on failure.
>   */
> -int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
> +int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int probe_vfs)
>  {
>  	might_sleep();
>  
>  	if (!dev->is_physfn)
>  		return -ENOSYS;
>  
> -	return sriov_enable(dev, nr_virtfn);
> +	return sriov_enable(dev, nr_virtfn, probe_vfs);
>  }
>  EXPORT_SYMBOL_GPL(pci_enable_sriov);
>  
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 2b3c89425bb5..d5b93339b8a4 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -397,9 +397,14 @@ static int pci_device_probe(struct device *dev)
>  	drv = to_pci_driver(dev->driver);
>  	pci_dev = to_pci_dev(dev);
>  	pci_dev_get(pci_dev);
> -	error = __pci_device_probe(drv, pci_dev);
> -	if (error)
> -		pci_dev_put(pci_dev);
> +	if (!pci_dev->is_virtfn || pci_dev->probe_vf) {
> +		error = __pci_device_probe(drv, pci_dev);
> +		if (error)
> +			pci_dev_put(pci_dev);
> +	}
> +	/* one shot blocking of probe */
> +	if (pci_dev->is_virtfn && !pci_dev->probe_vf)
> +		pci_dev->probe_vf = 1;
>  
>  	return error;
>  }
> diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
> index 0b2c53af85c7..2f81f471b8f3 100644
> --- a/drivers/scsi/lpfc/lpfc_init.c
> +++ b/drivers/scsi/lpfc/lpfc_init.c
> @@ -4797,7 +4797,7 @@ lpfc_sli_probe_sriov_nr_virtfn(struct lpfc_hba *phba, int nr_vfn)
>  		return -EINVAL;
>  	}
>  
> -	rc = pci_enable_sriov(pdev, nr_vfn);
> +	rc = pci_enable_sriov(pdev, nr_vfn, 1);
>  	if (rc) {
>  		lpfc_printf_log(phba, KERN_WARNING, LOG_INIT,
>  				"2806 Failed to enable sriov on this device "
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 4c8ac5fcc224..beb2640ba18d 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -373,6 +373,7 @@ struct pci_dev {
>  	phys_addr_t rom; /* Physical address of ROM if it's not from the BAR */
>  	size_t romlen; /* Length of ROM if it's not from the BAR */
>  	char *driver_override; /* Driver name to force a match */
> +	int probe_vf; /* probe this device */
>  };
>  
>  static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
> @@ -1655,14 +1656,14 @@ int pci_ext_cfg_avail(void);
>  void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
>  
>  #ifdef CONFIG_PCI_IOV
> -int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
> +int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int probe_vfs);
>  void pci_disable_sriov(struct pci_dev *dev);
>  int pci_num_vf(struct pci_dev *dev);
>  int pci_vfs_assigned(struct pci_dev *dev);
>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
>  #else
> -static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
> +static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int nr_virt_probe)
>  { return -ENODEV; }
>  static inline void pci_disable_sriov(struct pci_dev *dev) { }
>  static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
> -- 
> 2.1.3
> 

^ permalink raw reply

* Re: [PATCH] i40e: don't enable and init FCOE by default when do PF reset
From: Dev, Vasu @ 2015-01-09 18:18 UTC (permalink / raw)
  To: Ronciak, John, Ethan Zhao, Kirsher, Jeffrey T, Brandeburg, Jesse,
	Allan, Bruce W, Wyborny, Carolyn, Skidmore, Donald C,
	Rose, Gregory V, Vick, Matthew, Williams, Mitch A, Parikh, Neerav
  Cc: e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org,
	brian.maly@oracle.com, linux-kernel@vger.kernel.org, Linux NICS,
	ethan.kernel@gmail.com
In-Reply-To: <D38E8E86660E514AB505863C19C9287C741097AF@ORSMSX102.amr.corp.intel.com>

> -----Original Message-----
> From: Ronciak, John
> Sent: Friday, January 09, 2015 8:42 AM
> To: Ethan Zhao; Kirsher, Jeffrey T; Brandeburg, Jesse; Allan, Bruce W;
> Wyborny, Carolyn; Skidmore, Donald C; Rose, Gregory V; Vick, Matthew;
> Williams, Mitch A; Dev, Vasu; Parikh, Neerav
> Cc: Linux NICS; e1000-devel@lists.sourceforge.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; ethan.kernel@gmail.com;
> brian.maly@oracle.com
> Subject: RE: [PATCH] i40e: don't enable and init FCOE by default when do PF
> reset
> 
> Adding Vasu and Neerav
> 
> Cheers,
> John
> 
> > -----Original Message-----
> > From: Ethan Zhao [mailto:ethan.zhao@oracle.com]
> > Sent: Friday, January 9, 2015 8:38 AM
> > To: Kirsher, Jeffrey T; Brandeburg, Jesse; Allan, Bruce W; Wyborny,
> > Carolyn; Skidmore, Donald C; Rose, Gregory V; Vick, Matthew; Ronciak,
> > John; Williams, Mitch A
> > Cc: Linux NICS; e1000-devel@lists.sourceforge.net;
> > netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > ethan.kernel@gmail.com; brian.maly@oracle.com; Ethan Zhao
> > Subject: [PATCH] i40e: don't enable and init FCOE by default when do
> > PF reset
> >
> > While do PF reset with function i40e_reset_and_rebuild(), it will call
> > i40e_init_pf_fcoe() by default if FCOE is defined, thus if the PF is
> > resetted, FCOE will be enabled whatever it was - enabled or not.
> >
> > Such bug might be hit when PF resumes from suspend, run diagnostic
> > test with ethtool, setup VLAN etc.
> >
> > Passed building with v3.19-rc3.
> >
> > Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_main.c | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
> > b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > index a5f2660..a2572cc 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > @@ -6180,9 +6180,12 @@ static void i40e_reset_and_rebuild(struct
> > i40e_pf *pf, bool reinit)
> >  	}
> >  #endif /* CONFIG_I40E_DCB */
> >  #ifdef I40E_FCOE
> > -	ret = i40e_init_pf_fcoe(pf);
> > -	if (ret)
> > -		dev_info(&pf->pdev->dev, "init_pf_fcoe failed: %d\n", ret);
> > +	if (pf->flags & I40E_FLAG_FCOE_ENABLED) {
> > +		ret = i40e_init_pf_fcoe(pf);

Calling i40e_init_pf_fcoe() here conflicts with its I40E_FLAG_FCOE_ENABLED pre-condition since I40E_FLAG_FCOE_ENABLED is set by very same i40e_init_pf_fcoe(), in turn i40e_init_pf_fcoe() will never get called.

Jeff Kirsher should be getting out a patch queued by me which adds I40E_FCoE Kbuild option, in that FCoE is disabled by default and  user could enable FCoE only if needed, that patch would do same of skipping i40e_init_pf_fcoe() whether FCoE capability in device enabled or not in default config.

>From patchwork Wed Oct  2 23:26:08 2013
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: [net] i40e: adds FCoE configure option
Date: Thu, 03 Oct 2013 07:26:08 -0000
From: Vasu Dev <vasu.dev@intel.com>
X-Patchwork-Id: 11797

Adds FCoE config option I40E_FCOE, so that FCoE can be enabled
as needed but otherwise have it disabled by default.

This also eliminate multiple FCoE config checks, instead now just
one config check for CONFIG_I40E_FCOE.

The I40E FCoE was added with 3.17 kernel and therefore this patch
shall be applied to stable 3.17 kernel also.

CC: <stable@vger.kernel.org>
Signed-off-by: Vasu Dev <vasu.dev@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>

---
drivers/net/ethernet/intel/Kconfig           |   11 +++++++++++
 drivers/net/ethernet/intel/i40e/Makefile     |    2 +-
 drivers/net/ethernet/intel/i40e/i40e_osdep.h |    4 ++--
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
index 5b8300a..4d61ef5 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -281,6 +281,17 @@ config I40E_DCB
 
 	  If unsure, say N.
 
+config I40E_FCOE
+	bool "Fibre Channel over Ethernet (FCoE)"
+	default n
+	depends on I40E && DCB && FCOE
+	---help---
+	  Say Y here if you want to use Fibre Channel over Ethernet (FCoE)
+	  in the driver. This will create new netdev for exclusive FCoE
+	  use with XL710 FCoE offloads enabled.
+
+	  If unsure, say N.
+
 config I40EVF
 	tristate "Intel(R) XL710 X710 Virtual Function Ethernet support"
 	depends on PCI_MSI
diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
index 4b94ddb..c405819 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -44,4 +44,4 @@ i40e-objs := i40e_main.o \
 	i40e_virtchnl_pf.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
-i40e-$(CONFIG_FCOE:m=y) += i40e_fcoe.o
+i40e-$(CONFIG_I40E_FCOE) += i40e_fcoe.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e_osdep.h b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
index 045b5c4..ad802dd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_osdep.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
@@ -78,7 +78,7 @@ do {                                                            \
 } while (0)
 
 typedef enum i40e_status_code i40e_status;
-#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
+#ifdef CONFIG_I40E_FCOE
 #define I40E_FCOE
-#endif /* CONFIG_FCOE or CONFIG_FCOE_MODULE */
+#endif
 #endif /* _I40E_OSDEP_H_ */

> > +		if (ret)
> > +			dev_info(&pf->pdev->dev,
> > +				 "init_pf_fcoe failed: %d\n", ret);
> > +	}
> >
> >  #endif
> >  	/* do basic switch setup */
> > --
> > 1.8.3.1


------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply related

* Re: [net-next PATCH v1 00/11] A flow API
From: John Fastabend @ 2015-01-09 18:10 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: tgraf, sfeldma, jhs, simon.horman, netdev, davem, andy
In-Reply-To: <20150108180320.GF1898@nanopsycho.orion>

On 01/08/2015 10:03 AM, Jiri Pirko wrote:
> Wed, Dec 31, 2014 at 08:45:19PM CET, john.fastabend@gmail.com wrote:
>> So... I could continue to mull over this and tweak bits and pieces
>> here and there but I decided its best to get a wider group of folks
>> looking at it and hopefulyl with any luck using it so here it is.
>>
>> This set creates a new netlink family and set of messages to configure
>> flow tables in hardware. I tried to make the commit messages
>> reasonably verbose at least in the flow_table patches.
>>
>> What we get at the end of this series is a working API to get device
>> capabilities and program flows using the rocker switch.
>>
>> I created a user space tool 'flow' that I use to configure and query
>> the devices it is posted here,
>>
>> 	https://github.com/jrfastab/iprotue2-flow-tool
>>
>> For now it is a stand-alone tool but once the kernel bits get sorted
>> out (I'm guessing there will need to be a few versions of this series
>> to get it right) I would like to port it into the iproute2 package.
>> This way we can keep all of our tooling in one package see 'bridge'
>> for example.
>>
>> As far as testing, I've tested various combinations of tables and
>> rules on the rocker switch and it seems to work. I have not tested
>> 100% of the rocker code paths though. It would be great to get some
>> sort of automated framework around the API to do this. I don't
>> think should gate the inclusion of the API though.
>>
>> I could use some help reviewing,
>>
>>   (a) error paths and netlink validation code paths
>>
>>   (b) Break down of structures vs netlink attributes. I
>>       am trying to balance flexibility given by having
>>       netlinnk TLV attributes vs conciseness. So some
>>       things are passed as structures.
>>
>>   (c) are there any devices that have pipelines that we
>>       can't represent with this API? It would be good to
>>       know about these so we can design it in probably
>>       in a future series.
>>
>> For some examples and maybe a bit more illustrative description I
>> posted a quickly typed up set of notes on github io pages. Here we
>> can show the description along with images produced by the flow tool
>> showing the pipeline. Once we settle a bit more on the API we should
>> probably do a clean up of this and other threads happening and commit
>> something to the Documentation directory.
>>
>> http://jrfastab.github.io/jekyll/update/2014/12/21/flow-api.html
>>
>> Finally I have more patches to add support for creating and destroying
>> tables. This allows users to define the pipeline at runtime rather
>> than statically as rocker does now. After this set gets some traction
>> I'll look at pushing them in a next round. However it likely requires
>> adding another "world" to rocker. Another piece that I want to add is
>> a description of the actions and metadata. This way user space can
>> "learn" what an action is and how metadata interacts with the system.
>> This work is under development.
>>
>> Thanks! Any comments/feedback always welcome.
>>
>> And also thanks to everyone who helped with this flow API so far. All
>> the folks at Dusseldorf LPC, OVS summit Santa Clara, P4 authors for
>> some inspiration, the collection of IETF FoRCES documents I mulled
>> over, Netfilter workshop where I started to realize fixing ethtool
>> was most likely not going to work, etc.
>>
>> ---
>>
>> John Fastabend (11):
>>       net: flow_table: create interface for hw match/action tables
>>       net: flow_table: add flow, delete flow
>>       net: flow_table: add apply action argument to tables
>>       rocker: add pipeline model for rocker switch
>>       net: rocker: add set flow rules
>>       net: rocker: add group_id slices and drop explicit goto
>>       net: rocker: add multicast path to bridging
>>       net: rocker: add get flow API operation
>>       net: rocker: add cookie to group acls and use flow_id to set cookie
>>       net: rocker: have flow api calls set cookie value
>>       net: rocker: implement delete flow routine
>
> Truly impressive work John (including the "flow" tool, documentation).
> Hat's off.
>
> Currently, all is very userspace oriented and I understand the reason.
> I also understand why Jamal is a bit nervous from that fact. I am as well..
> Correct me if I'm wrong but this amount of "direct hw access" is
> unprecedented. There have been kernel here to cover the hw differencies,
> I wonder if there is any way to continue in this direction with flows...
>

As it is currently written the API allows for abstracting the hardware
programming and low level interface by using a common model and API that
can represent a large array of devices.

By abstract the hw differencies I'm not sure what this means except for
the above model/API. I intentionally didn't want to force _all_
hardware to expose a specific pipeline for example the OVS pipeline.

> What I would love to see in this initial patchset is "the internal user".
> For example tc. The tc code could query the capabilities and decide what
> "flows" to put into hw tables.

Sure, the biggest gap for me on this is 'tc' is actually about
ports/queues and currently filters/tables are part of qdiscs. The
model in this series is a pipeline that has a set of egress endpoints
that can be reached by actions. The endpoints would be ports or tunnel
engines or could be other network function blocks.

That said I can imagine pushing the configuration into a per port table
in the hardware or most likely just requiring any matches on egress
qdisc's to use an implied egress_port match. On ingress similarly use
an ingress_port match.

I'll look at doing this next week but I think the series is useful even
without any "internal users" ;) I'll send out a v2 with all the feedback
I've received so far shortly then think some more about this. Doing the
mapping from software filters/actions/tables onto the hardware tables
exposed by the API in this series is actually what I wanted to present
@ netdev conference so I think we are heading in the same direction.

.John


>
> Jiri
>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH iproute2] ip link: Fix crash on older kernels when show VF dev
From: William Dauchy @ 2015-01-09 17:55 UTC (permalink / raw)
  To: Vadim Kochan; +Cc: netdev, william
In-Reply-To: <1420824351-23450-1-git-send-email-vadim4j@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2330 bytes --]

On Jan09 19:25, Vadim Kochan wrote:
> From: Vadim Kochan <vadim4j@gmail.com>
> 
> The issue was caused that ifla_vf_rate does not exist on
> older kernels and should be checked if it exists as nested attr.
> 
> Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
> Reported-by: William Dauchy <william@gandi.net>
> Tested-by: William Dauchy <william@gandi.com>

gandi.net actually ;)

Thanks,

> ---
>  ip/ipaddress.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> index 28dfe8c..830b166 100644
> --- a/ip/ipaddress.c
> +++ b/ip/ipaddress.c
> @@ -259,11 +259,10 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
>  {
>  	struct ifla_vf_mac *vf_mac;
>  	struct ifla_vf_vlan *vf_vlan;
> -	struct ifla_vf_rate *vf_rate;
>  	struct ifla_vf_tx_rate *vf_tx_rate;
>  	struct ifla_vf_spoofchk *vf_spoofchk;
>  	struct ifla_vf_link_state *vf_linkstate;
> -	struct rtattr *vf[IFLA_VF_MAX+1];
> +	struct rtattr *vf[IFLA_VF_MAX + 1] = {};
>  	struct rtattr *tmp;
>  	SPRINT_BUF(b1);
>  
> @@ -277,7 +276,6 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
>  	vf_mac = RTA_DATA(vf[IFLA_VF_MAC]);
>  	vf_vlan = RTA_DATA(vf[IFLA_VF_VLAN]);
>  	vf_tx_rate = RTA_DATA(vf[IFLA_VF_TX_RATE]);
> -	vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
>  
>  	/* Check if the spoof checking vf info type is supported by
>  	 * this kernel.
> @@ -313,10 +311,16 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
>  		fprintf(fp, ", qos %d", vf_vlan->qos);
>  	if (vf_tx_rate->rate)
>  		fprintf(fp, ", tx rate %d (Mbps)", vf_tx_rate->rate);
> -	if (vf_rate->max_tx_rate)
> -		fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
> -	if (vf_rate->min_tx_rate)
> -		fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
> +
> +	if (vf[IFLA_VF_RATE]) {
> +		struct ifla_vf_rate *vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
> +
> +		if (vf_rate->max_tx_rate)
> +			fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
> +		if (vf_rate->min_tx_rate)
> +			fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
> +	}
> +
>  	if (vf_spoofchk && vf_spoofchk->setting != -1) {
>  		if (vf_spoofchk->setting)
>  			fprintf(fp, ", spoof checking on");
> -- 
> 2.1.3
> 

-- 
William

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] openvswitch: Remove unnecessary version.h inclusion
From: Pravin Shelar @ 2015-01-09 18:00 UTC (permalink / raw)
  To: Syam Sidhardhan; +Cc: netdev, syamsidhardh, David Miller, dev@openvswitch.org
In-Reply-To: <1420815381-29973-1-git-send-email-s.syam@samsung.com>

On Fri, Jan 9, 2015 at 6:56 AM, Syam Sidhardhan <s.syam@samsung.com> wrote:
> version.h inclusion is not necessary as detected by versioncheck.
>
> Signed-off-by: Syam Sidhardhan <s.syam@samsung.com>

You should use net-next prefix flag for net-next tree patches.
Otherwise looks good.
Acked-by: Pravin B Shelar <pshelar@nicira.com>

> ---
>  net/openvswitch/vport-geneve.c |    2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
> index 347fa23..70e2aae 100644
> --- a/net/openvswitch/vport-geneve.c
> +++ b/net/openvswitch/vport-geneve.c
> @@ -9,8 +9,6 @@
>
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> -#include <linux/version.h>
> -
>  #include <linux/in.h>
>  #include <linux/ip.h>
>  #include <linux/net.h>
> --
> 1.7.9.5
>

^ permalink raw reply

* Re: [PATCH] brcm80211: brcmsmac: dma: Remove some unused functions
From: Rickard Strandqvist @ 2015-01-09 17:58 UTC (permalink / raw)
  To: Arend van Spriel
  Cc: Julia Lawall, Kalle Valo, Larry Finger, Brett Rudley,
	Hante Meuleman, Fabian Frederick,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	brcm80211-dev-list-dY08KVG/lbpWk0Htik3J/w, Network Development,
	Linux Kernel Mailing List
In-Reply-To: <54ACF523.2030706-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>

2015-01-07 9:58 GMT+01:00 Arend van Spriel <arend-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>:
> On 01/07/15 07:29, Julia Lawall wrote:
>>
>>
>>
>> On Wed, 7 Jan 2015, Rickard Strandqvist wrote:
>>
>>> 2015-01-05 12:06 GMT+01:00 Arend van Spriel<arend-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>:
>>>>
>>>> On 01/05/15 11:49, Kalle Valo wrote:
>>>>>
>>>>>
>>>>> Rickard Strandqvist<rickard_strandqvist-IW2WV5XWFqGZkjO+N0TKoMugMpMbD5Xr@public.gmane.org>   writes:
>>>>>
>>>>>> As I hope you can see I have made some changes regarding the
>>>>>> subject-line. Thought it was an advantage to be able to see which file
>>>>>> I actually removed something from. There seems to be a big focus on
>>>>>> getting right on subject-line right in recent weeks.
>>>>>>
>>>>>> I wonder why there is a script that takes a file name, and respond
>>>>>> with an appropriate subject line?
>>>>
>>>>
>>>>
>>>> Is there a script for this? Anyway, I would say driver name is enough.
>>>> Enough about the subject line ;-) I would like to give some general
>>>> remarks
>>>> as you seem to touch a lot of kernel code. First off, I think it is good
>>>> to
>>>> remove unused stuff. However, I would like some more explanation on your
>>>> methodology apart from "partially found by using a static code analysis
>>>> program". So a cover-letter explaining that would have been nice (maybe
>>>> still is). Things like Kconfig option can affect whether function are
>>>> used
>>>> or not so how did you cover that.
>>>>
>>>> Regards,
>>>> Arend
>>>>
>>>>
>>>>> I don't think you can really automate this as some drivers do this a
>>>>> bit
>>>>> differently. You always need to manually check the commit log.
>>>>>
>>>>>> But ok, I change my script accordingly. Should I submit the patch
>>>>>> again?
>>>>>
>>>>>
>>>>>
>>>>> Yes, please resubmit.
>>>>>
>>>>
>>>
>>> Hi Arend
>>>
>>> Yes, a script that had been excellent, I think!
>>> I have one as part of my git send-email script, until a week ago, it
>>> was enough that I removed the "drivers/" and changed all "/" to ": "
>>> I have now been expanded my sed pipe a lot (tell me if anyone is
>>> interested)
>>> But now I've seen everything from uppercase and [DIR], etc.
>>> So I can not understand how anyone should be able to get the right
>>> name without a good help.
>>>
>>> Sure i like to share how I use cppcheck, but is very hesitant to write
>>> this with each patch mails I send though!
>>>
>>> I run:
>>> cppcheck --force --quiet --enable=all .
>>>
>>> Or a specific file instead of .
>>>
>>> This will include, among other things get a lot of error message such,
>>> +4000 for the kernel.
>>> (style) The function 'xxx' is never used
>>>
>>> For these I made a script that searched through all the files after
>>> the function name (cppcheck missed a few). And save the rest so I go
>>> through them and possibly send patches.
>>
>>
>> I think that the question was about what methodology is cppcheck using to
>> find the given issue.  But probably cppcheck is a black box that does
>> whatever it does, so the user doesn't know what the rationale is.
>
>
> That would have been nice, but I also wanted to know what his subsequent
> steps were to validate the output from cppcheck. I went through some
> cppcheck web pages, but they only elaborate on what is can do and not the
> how. But hey, it is an open-source tool so there is always the code to
> check.
>
>> However, I think you mentioned that cppcheck found only some of the
>> issues.  You could thus describe what was the methodology for finding the
>> other ones.
>
>
> Maybe upon removing an unused function it had a ripple effect on others
> becoming unused as well? Still this is speculating and with this kind of
> cleanup effort all over the place it is better to review the methodology.
>
> Regards,
> Arend
>
>> julia


Hi all

Julia cppcheck is a gpl projekt.
http://sourceforge.net/projects/cppcheck/


Arend
I used cppcheck with all option in the linux root, and then use grep
to pick out what I was interested in.  I agree that there is a lack of
documentation, unfortunately.

More exactly how I have done this is, I searched with grep for the
4000 functions, put the result in a lot of files. These were input to
a script that open a file editor, did a visual overview of all over
the place where the function was found, several of them were used, for
example, directly in asambler code. And in recent times I have also
started doing git blame on the file to see how old the code is.
Then I made the choice to remove or not.

Hope this was clear enough :)


Kind regards
Rickard Strandqvist
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/6] vxlan: Group Policy extension
From: Alexei Starovoitov @ 2015-01-09 17:37 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Jesse Gross, Stephen Hemminger, Pravin Shelar,
	Tom Herbert, netdev@vger.kernel.org, dev@openvswitch.org
In-Reply-To: <7339e3bff124cecaf65cd04ea9bdc973c730ba34.1420756324.git.tgraf@suug.ch>

On Thu, Jan 8, 2015 at 2:47 PM, Thomas Graf <tgraf@suug.ch> wrote:
> +
> +struct vxlan_gbp {
> +} __packed;

empty struct ? seems unused.
looks good to me otherwise.

^ permalink raw reply

* Re: [PATCH v2 1/3] dtb: xgene: fix: Backward compatibility with older firmware
From: Ian Campbell @ 2015-01-09 17:36 UTC (permalink / raw)
  To: Iyappan Subramanian
  Cc: devicetree, netdev, patches, kchudgar, davem, linux-arm-kernel
In-Reply-To: <1420819166.21186.38.camel@citrix.com>

On Fri, 2015-01-09 at 15:59 +0000, Ian Campbell wrote:
> I'm running mustang_sw_1.13.29-beta, using the mustang_tianocore_ubt.fd
> method to launch from u-boot.

I realised shortly after that I had downloaded 1.14.14 too, but it
behaves the same.

I rebuilt mustang_tianocore_ubt.fd with the first hunk of this patch and
things now work.

Ian.

^ permalink raw reply

* [PATCH iproute2] ip link: Fix crash on older kernels when show VF dev
From: Vadim Kochan @ 2015-01-09 17:25 UTC (permalink / raw)
  To: netdev; +Cc: william, Vadim Kochan

From: Vadim Kochan <vadim4j@gmail.com>

The issue was caused that ifla_vf_rate does not exist on
older kernels and should be checked if it exists as nested attr.

Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
Reported-by: William Dauchy <william@gandi.net>
Tested-by: William Dauchy <william@gandi.com>
---
 ip/ipaddress.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 28dfe8c..830b166 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -259,11 +259,10 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 {
 	struct ifla_vf_mac *vf_mac;
 	struct ifla_vf_vlan *vf_vlan;
-	struct ifla_vf_rate *vf_rate;
 	struct ifla_vf_tx_rate *vf_tx_rate;
 	struct ifla_vf_spoofchk *vf_spoofchk;
 	struct ifla_vf_link_state *vf_linkstate;
-	struct rtattr *vf[IFLA_VF_MAX+1];
+	struct rtattr *vf[IFLA_VF_MAX + 1] = {};
 	struct rtattr *tmp;
 	SPRINT_BUF(b1);
 
@@ -277,7 +276,6 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 	vf_mac = RTA_DATA(vf[IFLA_VF_MAC]);
 	vf_vlan = RTA_DATA(vf[IFLA_VF_VLAN]);
 	vf_tx_rate = RTA_DATA(vf[IFLA_VF_TX_RATE]);
-	vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
 
 	/* Check if the spoof checking vf info type is supported by
 	 * this kernel.
@@ -313,10 +311,16 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 		fprintf(fp, ", qos %d", vf_vlan->qos);
 	if (vf_tx_rate->rate)
 		fprintf(fp, ", tx rate %d (Mbps)", vf_tx_rate->rate);
-	if (vf_rate->max_tx_rate)
-		fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
-	if (vf_rate->min_tx_rate)
-		fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
+
+	if (vf[IFLA_VF_RATE]) {
+		struct ifla_vf_rate *vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
+
+		if (vf_rate->max_tx_rate)
+			fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
+		if (vf_rate->min_tx_rate)
+			fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
+	}
+
 	if (vf_spoofchk && vf_spoofchk->setting != -1) {
 		if (vf_spoofchk->setting)
 			fprintf(fp, ", spoof checking on");
-- 
2.1.3

^ permalink raw reply related

* [PATCH net-next RFC 5/5] net-timestamp: tx timestamping default mode flag
From: Willem de Bruijn @ 2015-01-09 17:31 UTC (permalink / raw)
  To: netdev; +Cc: davem, richardcochran, eric.dumazet, luto, Willem de Bruijn
In-Reply-To: <1420824719-28848-1-git-send-email-willemb@google.com>

From: Willem de Bruijn <willemb@google.com>

The number of timestamping points along the transmit path has grown,
as have the options. Preferred behavior is to request timestamps with
ID, without data (which enables batching) and for all supported
timestamp points. Define a short option that enables all these
defaults.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/uapi/linux/net_tstamp.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 6d1abea..c371ce2 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -27,6 +27,14 @@ enum {
 	SOF_TIMESTAMPING_OPT_TSONLY = (1<<11),
 
 	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_TSONLY,
+
+	SOF_TIMESTAMPING_TX = SOF_TIMESTAMPING_TX_HARDWARE |
+			      SOF_TIMESTAMPING_TX_SOFTWARE |
+			      SOF_TIMESTAMPING_TX_SCHED |
+			      SOF_TIMESTAMPING_TX_ACK |
+			      SOF_TIMESTAMPING_OPT_ID |
+			      SOF_TIMESTAMPING_OPT_TSONLY,
+
 	SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
 				 SOF_TIMESTAMPING_LAST
 };
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next RFC 4/5] net-timestamp: tx timestamp cookies
From: Willem de Bruijn @ 2015-01-09 17:31 UTC (permalink / raw)
  To: netdev; +Cc: davem, richardcochran, eric.dumazet, luto, Willem de Bruijn
In-Reply-To: <1420824719-28848-1-git-send-email-willemb@google.com>

From: Willem de Bruijn <willemb@google.com>

Support looping multiple timestamps on top of a single skb on the
error queue.

Tx timestamps are returned on top of an skb. TCP timestamping and
other timestamp points enabled multiple timestamps for each buffer
passed in send. Due to retransmissions, this number may be high,
using lots of SO_RCVBUF space and kernel mode switches.

When returning without payload (SOF_TIMESTAMPING_OPT_TSONLY), the
total truesize is smaller, but still O(n). Without payload, the
constraint that a timestamp belongs to a specific skb also goes
away.

Instead of queuing multiple skbs onto the error queue, queue
successive timestamps onto the skb on top of the error queue.
For this purpose, introduce a timestamp cookie and use a list
of cookies instead of skb->tstamp.

The number of batched cookies is limited by having sends fail
with EAGAIN or ENOMSG as soon as a single packet is waiting on
the receive queue. If merging this functionality, a TODO is to
add a hard cap, so that processes can estimate the maximum
msg_controllen needed to read all timestamps.

The implementation returns the same structures as before, that is,
one struct sock_extended_err and one struct scm_timestamping for
each timestamp. The list is returned in reverse chronological
order: newest first. This choice is partially determined by the
callers (e.g., ip_recv_error) generating the final sock_extended_err.

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h        |  12 +++++
 include/net/sock.h            |   3 +-
 include/uapi/linux/errqueue.h |   1 +
 net/core/skbuff.c             | 104 ++++++++++++++++++++++++++++++++++++------
 net/socket.c                  |  64 ++++++++++++++++++++++++--
 5 files changed, 167 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 85ab7d7..6d77b51 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -298,6 +298,13 @@ struct ubuf_info {
 	unsigned long desc;
 };
 
+struct skb_tstamp_cookie {
+	u32 tskey;
+	u32 tstype;
+	ktime_t tstamp;
+	struct skb_tstamp_cookie *next;
+};
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -442,6 +449,8 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
  *	@next: Next buffer in list
  *	@prev: Previous buffer in list
  *	@tstamp: Time we arrived/left
+ *	@skb_mstamp: tstamp variant used only within the TCP stack
+ *	@tscookies: tstamp variant used only with no-payload errqueue packets
  *	@rbnode: RB tree node, alternative to next/prev for netem/tcp
  *	@sk: Socket we are owned by
  *	@dev: Device we arrived on/are leaving by
@@ -516,6 +525,7 @@ struct sk_buff {
 			union {
 				ktime_t		tstamp;
 				struct skb_mstamp skb_mstamp;
+				struct skb_tstamp_cookie *tscookies;
 			};
 		};
 		struct rb_node	rbnode; /* used in netem & tcp stack */
@@ -2861,6 +2871,8 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		     struct skb_shared_hwtstamps *hwtstamps,
 		     struct sock *sk, int tstype);
 
+bool skb_has_tscookies(struct sk_buff *skb);
+
 /**
  * skb_tstamp_tx - queue clone of skb with send time stamps
  * @orig_skb:	the original outgoing packet
diff --git a/include/net/sock.h b/include/net/sock.h
index 9729171..de190d8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2149,7 +2149,8 @@ sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
 	 */
 	if (sock_flag(sk, SOCK_RCVTSTAMP) ||
 	    (sk->sk_tsflags & SOF_TIMESTAMPING_RX_SOFTWARE) ||
-	    (kt.tv64 && sk->sk_tsflags & SOF_TIMESTAMPING_SOFTWARE) ||
+	    ((kt.tv64 || skb_has_tscookies(skb)) &&
+	     sk->sk_tsflags & SOF_TIMESTAMPING_SOFTWARE) ||
 	    (hwtstamps->hwtstamp.tv64 &&
 	     (sk->sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE)))
 		__sock_recv_timestamp(msg, sk, skb);
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 07bdce1..ab67bf0 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -41,6 +41,7 @@ enum {
 	SCM_TSTAMP_SND,		/* driver passed skb to NIC, or HW */
 	SCM_TSTAMP_SCHED,	/* data entered the packet scheduler */
 	SCM_TSTAMP_ACK,		/* data acknowledged by peer */
+	SCM_TSTAMP_HW,		/* internal use: HW generated */
 };
 
 #endif /* _UAPI_LINUX_ERRQUEUE_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e5f4c06..c41597f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3581,6 +3581,19 @@ int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer)
 }
 EXPORT_SYMBOL_GPL(skb_cow_data);
 
+static void skb_destructor_tscookies(struct sk_buff *skb)
+{
+	struct skb_tstamp_cookie *prev, *cur = skb->tscookies;
+
+	while (cur) {
+		prev = cur;
+		cur = cur->next;
+		kfree(prev);
+	}
+	skb->tscookies = NULL;
+	skb->destructor = NULL;
+}
+
 static void sock_rmem_free(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
@@ -3588,6 +3601,12 @@ static void sock_rmem_free(struct sk_buff *skb)
 	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
 }
 
+static void sock_rmem_free_tscookies(struct sk_buff *skb)
+{
+	skb_destructor_tscookies(skb);
+	sock_rmem_free(skb);
+}
+
 /*
  * Note: We dont mem charge error packets (no sk_forward_alloc changes)
  */
@@ -3597,9 +3616,13 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
 	    (unsigned int)sk->sk_rcvbuf)
 		return -ENOMEM;
 
-	skb_orphan(skb);
+	if (skb_has_tscookies(skb)) {
+		skb->destructor = sock_rmem_free_tscookies;
+	} else {
+		skb_orphan(skb);
+		skb->destructor = sock_rmem_free;
+	}
 	skb->sk = sk;
-	skb->destructor = sock_rmem_free;
 	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
 
 	/* before exiting rcu section, make sure dst is refcounted */
@@ -3666,23 +3689,78 @@ struct sk_buff *skb_clone_sk(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(skb_clone_sk);
 
-static void __skb_complete_tx_timestamp(struct sk_buff *skb,
-					struct sock *sk,
-					int tstype)
+bool skb_has_tscookies(struct sk_buff *skb)
+{
+	return skb->destructor == skb_destructor_tscookies ||
+	       skb->destructor == sock_rmem_free_tscookies;
+}
+EXPORT_SYMBOL(skb_has_tscookies);
+
+static bool __skb_queue_tstamp_cookie(struct sk_buff *skb, struct sock *sk,
+				      int tstype, u32 tskey, bool is_hw)
+{
+	struct sk_buff_head *q = &sk->sk_error_queue;
+	struct skb_tstamp_cookie *new;
+	struct sk_buff *qskb;
+	unsigned long flags;
+	bool queued = false;
+
+	if (skb->destructor)
+		return false;
+
+	new = kzalloc(sizeof(*new), GFP_ATOMIC);
+	if (!new)
+		return false;
+
+	new->tskey = tskey;
+	if (unlikely(is_hw)) {
+		new->tstype = SCM_TSTAMP_HW;
+		new->tstamp = skb_hwtstamps(skb)->hwtstamp;
+	} else {
+		new->tstype = tstype;
+		new->tstamp = skb->tstamp;
+	}
+
+	spin_lock_irqsave(&q->lock, flags);
+	qskb = skb_peek(&sk->sk_error_queue);
+	if (qskb && skb_has_tscookies(qskb)) {
+		new->next = qskb->tscookies;
+		qskb->tscookies = new;
+		queued = true;
+	}
+	spin_unlock_irqrestore(&q->lock, flags);
+	if (queued) {
+		consume_skb(skb);
+		return true;
+	}
+
+	skb->tscookies = new;
+	skb->destructor = skb_destructor_tscookies;
+	return false;
+}
+
+static void __skb_complete_tx_timestamp(struct sk_buff *skb, struct sock *sk,
+					int tstype, bool is_hw)
 {
 	struct sock_exterr_skb *serr;
-	int err;
+	int err, tskey = 0;
+
+	if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) {
+		tskey = skb_shinfo(skb)->tskey;
+		if (sk->sk_protocol == IPPROTO_TCP)
+			tskey -= sk->sk_tskey;
+	}
+
+	if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_TSONLY &&
+	    __skb_queue_tstamp_cookie(skb, sk, tstype, tskey, is_hw))
+		return;
 
 	serr = SKB_EXT_ERR(skb);
 	memset(serr, 0, sizeof(*serr));
 	serr->ee.ee_errno = ENOMSG;
 	serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
 	serr->ee.ee_info = tstype;
-	if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) {
-		serr->ee.ee_data = skb_shinfo(skb)->tskey;
-		if (sk->sk_protocol == IPPROTO_TCP)
-			serr->ee.ee_data -= sk->sk_tskey;
-	}
+	serr->ee.ee_data = tskey;
 
 	err = sock_queue_err_skb(sk, skb);
 
@@ -3708,7 +3786,7 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
 	sock_hold(sk);
 
 	*skb_hwtstamps(skb) = *hwtstamps;
-	__skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND);
+	__skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND, true);
 
 	sock_put(sk);
 }
@@ -3741,7 +3819,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	else
 		skb->tstamp = ktime_get_real();
 
-	__skb_complete_tx_timestamp(skb, sk, tstype);
+	__skb_complete_tx_timestamp(skb, sk, tstype, hwtstamps);
 }
 EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
 
diff --git a/net/socket.c b/net/socket.c
index a2c33a4..6595108 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -676,9 +676,63 @@ int kernel_sendmsg(struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(kernel_sendmsg);
 
-/*
- * called from sock_recv_timestamp() if sock_flag(sk, SOCK_RCVTSTAMP)
- */
+static bool __ts_allow_report(struct sock *sk, int tstype)
+{
+	if (tstype == SCM_TSTAMP_HW)
+		return sk->sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE;
+	else
+		return sk->sk_tsflags & SOF_TIMESTAMPING_SOFTWARE;
+}
+
+static void __ts_generate_serr(struct msghdr *msg, struct sock *sk,
+			       struct skb_tstamp_cookie *cur)
+{
+	struct sock_extended_err serr;
+
+	memset(&serr, 0, sizeof(serr));
+
+	serr.ee_errno = ENOMSG;
+	serr.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
+	serr.ee_data = cur->tskey;
+	serr.ee_info = cur->tstype;
+
+	/* work around legacy interface: HW reports SND with data in tss[2] */
+	if (serr.ee_info == SCM_TSTAMP_HW)
+		serr.ee_info = SCM_TSTAMP_SND;
+
+	if (sk->sk_family == AF_INET)
+		put_cmsg(msg, SOL_IP, IP_RECVERR, sizeof(serr), &serr);
+	else if (sk->sk_family == AF_INET6)
+		put_cmsg(msg, SOL_IPV6, IPV6_RECVERR, sizeof(serr), &serr);
+	else
+		net_warn_ratelimited("tscookie: unknown proto %x",
+				     sk->sk_family);
+}
+
+static void __ts_generate_tss(struct msghdr *msg, struct skb_tstamp_cookie *cur)
+{
+	struct scm_timestamping tss;
+	bool idx = cur->tstype == SCM_TSTAMP_HW ? 2 : 0;
+
+	memset(&tss, 0, sizeof(tss));
+	tss.ts[idx] = ktime_to_timespec(cur->tstamp);
+	put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPING, sizeof(tss), &tss);
+}
+
+static void __sock_recv_timestamp_cookies(struct msghdr *msg, struct sock *sk,
+					  struct skb_tstamp_cookie *cookie)
+{
+	while (cookie) {
+		if (__ts_allow_report(sk, cookie->tstype)) {
+			__ts_generate_tss(msg, cookie);
+			/* caller (e.g., ip_recv_error) generates last serr */
+			if (cookie->next)
+				__ts_generate_serr(msg, sk, cookie);
+		}
+		cookie = cookie->next;
+	}
+}
+
 void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 	struct sk_buff *skb)
 {
@@ -688,6 +742,10 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 	struct skb_shared_hwtstamps *shhwtstamps =
 		skb_hwtstamps(skb);
 
+	if (skb_has_tscookies(skb)) {
+		__sock_recv_timestamp_cookies(msg, sk, skb->tscookies);
+		return;
+	}
 	/* Race occurred between timestamp enabling and packet
 	   receiving.  Fill in the current time for now. */
 	if (need_software_tstamp && skb->tstamp.tv64 == 0)
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox