Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next PATCH v1 00/11] A flow API
From: John Fastabend @ 2015-01-09 18:27 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: tgraf, sfeldma, jiri, simon.horman, netdev, davem, andy,
	Shrijeet Mukherjee
In-Reply-To: <54ABD3D1.6020608@mojatatu.com>

On 01/06/2015 04:23 AM, Jamal Hadi Salim wrote:
> John,
>
> There are a lot of things to digest in your posting - I am interested
> in commenting on many things but feel need to pay attention to details
> in general given the importance of this interface (and conference is
> chewing my netdev time at the moment). I need to actually sit down
> and stare at code and documentation.

any additional feedback would be great. sorry I tried to be concise
but this email got fairly long regardless. Also a delayed the response
a few days as I mulled over some of it.

>
> I do think we need to have this discussion as part of the BOF
> Shrijeet is running at netdev01.

Maybe I was a bit ambitious thinking we could get this merged
by then? Maybe I can resolve concerns via email ;) What I wanted
to discuss at netdev01 was specifically the mapping between
software models and hardware model as exposed by this series.
I see value in doing this in user space for some consumers OVS
which is why the UAPI is there to support this.

Also I think in-kernel users are interesting as well and 'tc'
is a reasonable candidate to try and offload from in the kernel IMO.

>
> General comments:
> 1) one of the things that i have learnt over time is that not
> everything that sits or is abstracted from hardware is a table.
> You could have structs or simple scalars for config or runtime
> control. How does what you are proposing here allow to express that?
> I dont think you'd need it for simple things but if you dont allow
> for it you run into the square-hole-round-peg syndrome of "yeah
> i can express that u32 variable as a single table with a single row
> and a single column" ;-> or "you need another infrastructure for
> that single scalr u32"

The interface (both UAPI and kernel API) deals exclusively with the
flow table pipeline at the moment. I've allowed for table attributes
which allows you to give tables characteristics. Right now it only
supports basic attribs like ingress_root and egress_root but I have
some work not in this series to allow tables to be dynamic
(allocated/freed) at runtime. More attributes could be added as needed
here. But this still only covers tables.

I agree there other things besides tables, of course. First thing
that comes to mind for me is queues and QOS. How do we model these?
My take is you add another object type call it QUEUE and use a
'struct net_flow_queue' to model queues. Queues then have attributes
as well like length, QOS policies, etc. I would call this extending the
infrastructure not creating another one :). Maybe my naming it
'net_flow' is not ideal. With a queue structure I can connect queues
and tables together with an enqueue action. That would be one example I
can generate more, encrypt operations, etc.

FWIW queues and QOS to me fit nicely into the existing infrastructure
and it may be easier to utilize the existing 'tc' UAPI for this.

In this series I just want to get the flow table piece down though.

>
> 2) So i understood the sense of replacing ethtool for classifier
> access with a direct interface mostly because thats what it was
> already doing - but i am not sure why you need
> it for a generic interface. Am i mistaken you are providing direct
> access to hardware from user space? Would this make essentially
> the Linux infrastructure a bypass (which vendors and their SDKs
> love)? IMHO, a good example is to pick something like netfilter
> or tc-filters and show how that is offloaded. This keeps it in
> the same spirit as what we are shooting for in L2/3 at the moment.
>

I'll try to knock these off one by one:

Yes we are providing an interface for userspace to interrogate the
hardware and program it. My take on this is even if you embed this
into another netlink family OVS, NFT, TCA you end up with the same
operations w.r.t. table support (a) query hardware for
resources/constraints/etc and (b) an API to add/del rules in those
tables. It seems the intersection of these features with existing
netlink families is fairly small so I opted to create a new family.
The underlying hardware offload mechanisms in flow_table.c here could
be used by in-kernel consumers as well as user space. For some
consumers 'tc' perhaps this makes good sense for others 'OVS'
it does not IMO.

Direct access to the hardware? hmm not so sure about that its an
abstraction layer so I can talk to _any_ hardware device using the
same semantics. But yes at the bottom of the interface there is
hardware. Although this provide a "raw" interface for userspace to
inspect and program the hardware it equally provides an API for
in-kernel consumers from using the hardware offload APIs. For
example if you want 'tc' to offload a queueing discipline with some
filters. For what its worth I did some experimental work here and for
some basic cases its possible to do this offload. I'll explore
this more as Jiri/you suggest.

Would this make essentially the Linux infrastructure a bypass? hmm
I'm not sure here exactly what you mean? If switching is done in
the ASIC then the dataplane is being bypassed. And I don't want
to couple management of software dataplane with management with
hardware dataplane. It would be valid to have these dataplanes
running two completely different pipelines/network functions. So I
assume you mean does this API bypass the existing Linux control plane
infrastructure for software dataplanes. I'll say tentatively
yes it does. But in many cases my goal is to unify them in userspace
where it is easier to make policy decisions. For OVS, NFT it
seems to me that user space libraries can handle the unification
of hardware/software dataplanes. Further I think it is the correct
place to unify the dataplanes. I don't want to encode complex
policies into the kernel. Even if you embed the netlink UAPI into
another netlink family the semantics look the same.

To address how to offload existing infrastructures, I'll try to
explain my ideas for each subsystem.

I looked into using netfilter but really didn't make much traction
in the existing infrastructure. The trouble being nft wants to use
expressions like payload that have registers, base, offset, len in
the kernel but the hardware (again at least all the hardware I'm
working with) doesn't work with these semantics it needs a field-id,
possibly the logical operation to use and the value to match. Yes I can
map base/offset/len to a field_id but what do I do with register? And
this sort of complication continues with most the other expressions.
I could write a new expression that was primarily used by hardware
but could have a software user as well but I'm not convinced we would
ever use it in software when we already have the functionally more
generic expressions. To me this looks like a somewhat arbitrary
embedding into netfilter uapi where the gain of doing this is not
entirely clear to me.

OVS would seem to have similar trouble all the policy is in user
space. And the netlink UAPI is tuned for OVS we don't want to start
adding/removing bits to support a hardware API where very little of it
would be used in the software only case and vice versa very little of
the OVS uapi messages as they exist today would be sufficient for the
hardware API. My point of view is the intersection is small enough here
that its easier to write a clean API that stands on its own then try
to sync these hardware offload operations into the OVS UAPI. Further
OVS is very specific about what fields/tables it supports in its current
version and I don't want to force hardware into this model.

And finally 'tc'. Filters today can only be attached to qdisc's which
are bound to net_devices. So the model is netdev's have queues, queues
have a qdisc association and qdiscs have filters. Here we are are
modelling a pipeline associated with a set of ports and in hardware.
The model is slightly different we have queues that dequeue into
an ingress table and an egress table that enqueues packets into queues.
Queues may or may not be bound to the same port. Yes I know 'tc' can
forward to ports but it has no notion of a global table space.

We could build a new 'tc' filter that loaded the hardware tables and
then added rules or deleted rules via hardware api but we would need
some new mechanics to get out the capabilities/resources. Basically
the same set of operations supported in the UAPI of this series. This
would end up IMO to be basically this series only embedded in the TCA_
family with a new filter kind. But then what do we attach it to? Not
a specific qdisc because it is associated with a set of qdiscs. And
additionally why would we use this qdisc*/hw-filter in software when
we already have u32 and bpf? IMO 'tc' is about per port(queues) QOS
and filters/actions to support this. That said I actually see offloading
'tc' qdisc/filters on the ports into the hardware as being useful
and using the operations added in this series to flow_table.c. See
my response to Jiri noting I'll go ahead and try to get this working.
OTOH I still think you need the UAPI proposed in this series for other
consumers.

Maybe I need to be enlightened but I thought for a bit about some grand
unification of ovs, bridge, tc, netlink, et. al. but that seems like
an entirely different scope of project. (side note: filters/actions
are no longer locked by qdisc and could stand on their own) My thoughts
on this are not yet organized.

> Anyways I apologize i havent spent as much time (holiday period
> wasnt good for me and netdev01 is picking up and consuming my time
> but i will try my best to respond and comment with some latency)
>

great thanks. Maybe this will give you more to mull over. If its
clear as mud let me know and I'll draw up some pictures. Likely
need to do that regardless. Bottom line I think the proposed API
here solves a real need.

Thanks!
John

> cheers,
> jamal
>
> On 12/31/14 14:45, John Fastabend wrote:
>> So... I could continue to mull over this and tweak bits and pieces
>> here and there but I decided its best to get a wider group of folks
>> looking at it and hopefulyl with any luck using it so here it is.
>>
>> This set creates a new netlink family and set of messages to configure
>> flow tables in hardware. I tried to make the commit messages
>> reasonably verbose at least in the flow_table patches.
>>
>> What we get at the end of this series is a working API to ge

t device
>> capabilities and program flows using the rocker switch.
>>
>> I created a user space tool 'flow' that I use to configure and query
>> the devices it is posted here,
>>
>>     https://github.com/jrfastab/iprotue2-flow-tool
>>
>> For now it is a stand-alone tool but once the kernel bits get sorted
>> out (I'm guessing there will need to be a few versions of this series
>> to get it right) I would like to port it into the iproute2 package.
>> This way we can keep all of our tooling in one package see 'bridge'
>> for example.
>>
>> As far as testing, I've tested various combinations of tables and
>> rules on the rocker switch and it seems to work. I have not tested
>> 100% of the rocker code paths though. It would be great to get some
>> sort of automated framework around the API to do this. I don't
>> think should gate the inclusion of the API though.
>>
>> I could use some help reviewing,
>>
>>    (a) error paths and netlink validation code paths
>>
>>    (b) Break down of structures vs netlink attributes. I
>>        am trying to balance flexibility given by having
>>        netlinnk TLV attributes vs conciseness. So some
>>        things are passed as structures.
>>
>>    (c) are there any devices that have pipelines that we
>>        can't represent with this API? It would be good to
>>        know about these so we can design it in probably
>>        in a future series.
>>
>> For some examples and maybe a bit more illustrative description I
>> posted a quickly typed up set of notes on github io pages. Here we
>> can show the description along with images produced by the flow tool
>> showing the pipeline. Once we settle a bit more on the API we should
>> probably do a clean up of this and other threads happening and commit
>> something to the Documentation directory.
>>
>>   http://jrfastab.github.io/jekyll/update/2014/12/21/flow-api.html
>>
>> Finally I have more patches to add support for creating and destroying
>> tables. This allows users to define the pipeline at runtime rather
>> than statically as rocker does now. After this set gets some traction
>> I'll look at pushing them in a next round. However it likely requires
>> adding another "world" to rocker. Another piece that I want to add is
>> a description of the actions and metadata. This way user space can
>> "learn" what an action is and how metadata interacts with the system.
>> This work is under development.
>>
>> Thanks! Any comments/feedback always welcome.
>>
>> And also thanks to everyone who helped with this flow API so far. All
>> the folks at Dusseldorf LPC, OVS summit Santa Clara, P4 authors for
>> some inspiration, the collection of IETF FoRCES documents I mulled
>> over, Netfilter workshop where I started to realize fixing ethtool
>> was most likely not going to work, etc.
>>
>> ---
>>
>> John Fastabend (11):
>>        net: flow_table: create interface for hw match/action tables
>>        net: flow_table: add flow, delete flow
>>        net: flow_table: add apply action argument to tables
>>        rocker: add pipeline model for rocker switch
>>        net: rocker: add set flow rules
>>        net: rocker: add group_id slices and drop explicit goto
>>        net: rocker: add multicast path to bridging
>>        net: rocker: add get flow API operation
>>        net: rocker: add cookie to group acls and use flow_id to set
>> cookie
>>        net: rocker: have flow api calls set cookie value
>>        net: rocker: implement delete flow routine
>>
>>
>>   drivers/net/ethernet/rocker/rocker.c          | 1641
>> +++++++++++++++++++++++++
>>   drivers/net/ethernet/rocker/rocker_pipeline.h |  793 ++++++++++++
>>   include/linux/if_flow.h                       |  115 ++
>>   include/linux/netdevice.h                     |   20
>>   include/uapi/linux/if_flow.h                  |  413 ++++++
>>   net/Kconfig                                   |    7
>>   net/core/Makefile                             |    1
>>   net/core/flow_table.c                         | 1339
>> ++++++++++++++++++++
>>   8 files changed, 4312 insertions(+), 17 deletions(-)
>>   create mode 100644 drivers/net/ethernet/rocker/rocker_pipeline.h
>>   create mode 100644 include/linux/if_flow.h
>>   create mode 100644 include/uapi/linux/if_flow.h
>>   create mode 100644 net/core/flow_table.c
>>
>

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH RFC] pci: Control whether VFs are probed on pci_enable_sriov
From: Bjorn Helgaas @ 2015-01-09 18:25 UTC (permalink / raw)
  To: Eli Cohen
  Cc: davem, linux-pci, netdev, ogerlitz, yevgenyp, Eli Cohen,
	Donald Dutile
In-Reply-To: <1417957693-24979-1-git-send-email-eli@mellanox.com>

On Sun, Dec 07, 2014 at 03:08:13PM +0200, Eli Cohen wrote:
> Sometimes it is not desirable to probe the virtual fuctions right away,
> but rather leave the decision to the host's administrator.
> 
> This can save host side resource usage by VF instances which would be
> eventually probed to VMs.
> 
> Use a parameter to pci_enable_sriov to control that policy, and modify
> all current callers such that they retain the same functionality.
> 
> Use a one shot flag on struct pci_device which is cleared after the
> first probe is ignored so subsequent attempts go through.
> 
> Cc: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Eli Cohen <eli@mellanox.com>

Seems like we never really reached a consensus here.  Please repost if you
want to continue down this path.

Bjorn

> ---
> This approach is used by the mlx5 driver SRIOV implementation, so
> sending this to get feedback from the PCI and networking folks.
> 
>  drivers/misc/genwqe/card_base.c                      |  2 +-
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c    |  2 +-
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c      |  2 +-
>  drivers/net/ethernet/cisco/enic/enic_main.c          |  2 +-
>  drivers/net/ethernet/emulex/benet/be_main.c          |  2 +-
>  drivers/net/ethernet/intel/fm10k/fm10k_iov.c         |  2 +-
>  drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c   |  2 +-
>  drivers/net/ethernet/intel/igb/igb_main.c            |  2 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c       |  4 ++--
>  drivers/net/ethernet/mellanox/mlx4/main.c            |  2 +-
>  drivers/net/ethernet/neterion/vxge/vxge-main.c       |  2 +-
>  drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c |  2 +-
>  drivers/net/ethernet/sfc/siena_sriov.c               |  2 +-
>  drivers/pci/iov.c                                    | 12 +++++++-----
>  drivers/pci/pci-driver.c                             | 11 ++++++++---
>  drivers/scsi/lpfc/lpfc_init.c                        |  2 +-
>  include/linux/pci.h                                  |  5 +++--
>  17 files changed, 33 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/misc/genwqe/card_base.c b/drivers/misc/genwqe/card_base.c
> index 4cf8f82cfca2..69253ca17506 100644
> --- a/drivers/misc/genwqe/card_base.c
> +++ b/drivers/misc/genwqe/card_base.c
> @@ -1325,7 +1325,7 @@ static int genwqe_sriov_configure(struct pci_dev *dev, int numvfs)
>  
>  	if (numvfs > 0) {
>  		genwqe_setup_vf_jtimer(cd);
> -		rc = pci_enable_sriov(dev, numvfs);
> +		rc = pci_enable_sriov(dev, numvfs, 1);
>  		if (rc < 0)
>  			return rc;
>  		return numvfs;
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> index c88b20af87df..773b20224a47 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> @@ -2570,7 +2570,7 @@ int bnx2x_enable_sriov(struct bnx2x *bp)
>  	if (rc)
>  		return rc;
>  
> -	rc = pci_enable_sriov(bp->pdev, req_vfs);
> +	rc = pci_enable_sriov(bp->pdev, req_vfs, 1);
>  	if (rc) {
>  		BNX2X_ERR("pci_enable_sriov failed with %d\n", rc);
>  		return rc;
> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> index 3aea82bb9039..6e8afbfd3eba 100644
> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> @@ -6597,7 +6597,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  sriov:
>  #ifdef CONFIG_PCI_IOV
>  	if (func < ARRAY_SIZE(num_vf) && num_vf[func] > 0)
> -		if (pci_enable_sriov(pdev, num_vf[func]) == 0)
> +		if (pci_enable_sriov(pdev, num_vf[func], 1) == 0)
>  			dev_info(&pdev->dev,
>  				 "instantiated %u virtual functions\n",
>  				 num_vf[func]);
> diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
> index 86ee350e57f0..8a8b1d86f18a 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_main.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_main.c
> @@ -2421,7 +2421,7 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		pci_read_config_word(pdev, pos + PCI_SRIOV_TOTAL_VF,
>  			&enic->num_vfs);
>  		if (enic->num_vfs) {
> -			err = pci_enable_sriov(pdev, enic->num_vfs);
> +			err = pci_enable_sriov(pdev, enic->num_vfs, 1);
>  			if (err) {
>  				dev_err(dev, "SRIOV enable failed, aborting."
>  					" pci_enable_sriov() returned %d\n",
> diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
> index dc77ec2bdafd..a96491777ac4 100644
> --- a/drivers/net/ethernet/emulex/benet/be_main.c
> +++ b/drivers/net/ethernet/emulex/benet/be_main.c
> @@ -3274,7 +3274,7 @@ static int be_vf_setup(struct be_adapter *adapter)
>  	}
>  
>  	if (!old_vfs) {
> -		status = pci_enable_sriov(adapter->pdev, adapter->num_vfs);
> +		status = pci_enable_sriov(adapter->pdev, adapter->num_vfs, 1);
>  		if (status) {
>  			dev_err(dev, "SRIOV enable failed\n");
>  			adapter->num_vfs = 0;
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
> index 060190864238..04a3dc5acc28 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
> @@ -408,7 +408,7 @@ int fm10k_iov_configure(struct pci_dev *pdev, int num_vfs)
>  		 */
>  		fm10k_disable_aer_comp_abort(pdev);
>  
> -		err = pci_enable_sriov(pdev, num_vfs);
> +		err = pci_enable_sriov(pdev, num_vfs, 1);
>  		if (err) {
>  			dev_err(&pdev->dev,
>  				"Enable PCI SR-IOV failed: %d\n", err);
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> index 668d860275d6..fe56e09725f2 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> @@ -852,7 +852,7 @@ int i40e_alloc_vfs(struct i40e_pf *pf, u16 num_alloc_vfs)
>  
>  	/* Check to see if we're just allocating resources for extant VFs */
>  	if (pci_num_vf(pf->pdev) != num_alloc_vfs) {
> -		ret = pci_enable_sriov(pf->pdev, num_alloc_vfs);
> +		ret = pci_enable_sriov(pf->pdev, num_alloc_vfs, 1);
>  		if (ret) {
>  			dev_err(&pf->pdev->dev,
>  				"Failed to enable SR-IOV, error %d.\n", ret);
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index 3c0221620c9d..da01326ef550 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -2742,7 +2742,7 @@ static int igb_enable_sriov(struct pci_dev *pdev, int num_vfs)
>  
>  	/* only call pci_enable_sriov() if no VFs are allocated already */
>  	if (!old_vfs) {
> -		err = pci_enable_sriov(pdev, adapter->vfs_allocated_count);
> +		err = pci_enable_sriov(pdev, adapter->vfs_allocated_count, 1);
>  		if (err)
>  			goto err_out;
>  	}
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> index 04eee7c7b653..74b33483a0d1 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> @@ -149,7 +149,7 @@ void ixgbe_enable_sriov(struct ixgbe_adapter *adapter)
>  		 */
>  		adapter->num_vfs = min_t(unsigned int, adapter->num_vfs, IXGBE_MAX_VFS_DRV_LIMIT);
>  
> -		err = pci_enable_sriov(adapter->pdev, adapter->num_vfs);
> +		err = pci_enable_sriov(adapter->pdev, adapter->num_vfs, 1);
>  		if (err) {
>  			e_err(probe, "Failed to enable PCI sriov: %d\n", err);
>  			adapter->num_vfs = 0;
> @@ -270,7 +270,7 @@ static int ixgbe_pci_sriov_enable(struct pci_dev *dev, int num_vfs)
>  	for (i = 0; i < adapter->num_vfs; i++)
>  		ixgbe_vf_configuration(dev, (i | 0x10000000));
>  
> -	err = pci_enable_sriov(dev, num_vfs);
> +	err = pci_enable_sriov(dev, num_vfs, 1);
>  	if (err) {
>  		e_dev_warn("Failed to enable PCI sriov: %d\n", err);
>  		return err;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
> index 3044f9e623cb..ae38b556ec13 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -2350,7 +2350,7 @@ static u64 mlx4_enable_sriov(struct mlx4_dev *dev, struct pci_dev *pdev,
>  					 existing_vfs, total_vfs);
>  		} else {
>  			mlx4_warn(dev, "Enabling SR-IOV with %d VFs\n", total_vfs);
> -			err = pci_enable_sriov(pdev, total_vfs);
> +			err = pci_enable_sriov(pdev, total_vfs, 1);
>  		}
>  		if (err) {
>  			mlx4_err(dev, "Failed to enable SR-IOV, continuing without SR-IOV (err = %d)\n",
> diff --git a/drivers/net/ethernet/neterion/vxge/vxge-main.c b/drivers/net/ethernet/neterion/vxge/vxge-main.c
> index cc0485e3c621..c341e73fc68c 100644
> --- a/drivers/net/ethernet/neterion/vxge/vxge-main.c
> +++ b/drivers/net/ethernet/neterion/vxge/vxge-main.c
> @@ -4495,7 +4495,7 @@ vxge_probe(struct pci_dev *pdev, const struct pci_device_id *pre)
>  	/* Enable SRIOV mode, if firmware has SRIOV support and if it is a PF */
>  	if (is_sriov(function_mode) && !is_sriov_initialized(pdev) &&
>  	   (ll_config->intr_type != INTA)) {
> -		ret = pci_enable_sriov(pdev, num_vfs);
> +		ret = pci_enable_sriov(pdev, num_vfs, 1);
>  		if (ret)
>  			vxge_debug_ll_config(VXGE_ERR,
>  				"Failed in enabling SRIOV mode: %d\n", ret);
> diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> index a29538b86edf..b483705a1ef1 100644
> --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> @@ -570,7 +570,7 @@ static int qlcnic_sriov_pf_enable(struct qlcnic_adapter *adapter, int num_vfs)
>  	if (!qlcnic_sriov_enable_check(adapter))
>  		return 0;
>  
> -	err = pci_enable_sriov(adapter->pdev, num_vfs);
> +	err = pci_enable_sriov(adapter->pdev, num_vfs, 1);
>  	if (err)
>  		qlcnic_sriov_pf_cleanup(adapter);
>  
> diff --git a/drivers/net/ethernet/sfc/siena_sriov.c b/drivers/net/ethernet/sfc/siena_sriov.c
> index a8bbbad68a88..6804ed04cfcd 100644
> --- a/drivers/net/ethernet/sfc/siena_sriov.c
> +++ b/drivers/net/ethernet/sfc/siena_sriov.c
> @@ -1332,7 +1332,7 @@ int efx_siena_sriov_init(struct efx_nic *efx)
>  
>  	/* At this point we must be ready to accept VFDI requests */
>  
> -	rc = pci_enable_sriov(efx->pci_dev, efx->vf_count);
> +	rc = pci_enable_sriov(efx->pci_dev, efx->vf_count, 1);
>  	if (rc)
>  		goto fail_pci;
>  
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4d109c07294a..f6aba5beea78 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -57,7 +57,7 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
>  		pci_remove_bus(virtbus);
>  }
>  
> -static int virtfn_add(struct pci_dev *dev, int id, int reset)
> +static int virtfn_add(struct pci_dev *dev, int id, int reset, int probe)
>  {
>  	int i;
>  	int rc = -ENOMEM;
> @@ -85,6 +85,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>  	virtfn->physfn = pci_dev_get(dev);
>  	virtfn->is_virtfn = 1;
>  	virtfn->multifunction = 0;
> +	virtfn->probe_vf = probe;
>  
>  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>  		res = dev->resource + PCI_IOV_RESOURCES + i;
> @@ -170,7 +171,7 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>  	pci_dev_put(dev);
>  }
>  
> -static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> +static int sriov_enable(struct pci_dev *dev, int nr_virtfn, int probe_vfs)
>  {
>  	int rc;
>  	int i, j;
> @@ -255,7 +256,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>  		initial = nr_virtfn;
>  
>  	for (i = 0; i < initial; i++) {
> -		rc = virtfn_add(dev, i, 0);
> +		rc = virtfn_add(dev, i, 0, probe_vfs);
>  		if (rc)
>  			goto failed;
>  	}
> @@ -558,17 +559,18 @@ int pci_iov_bus_range(struct pci_bus *bus)
>   * pci_enable_sriov - enable the SR-IOV capability
>   * @dev: the PCI device
>   * @nr_virtfn: number of virtual functions to enable
> + * @probe_vfs: in zero, don't probe new VFs, otherwise probe if suitable driver available
>   *
>   * Returns 0 on success, or negative on failure.
>   */
> -int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
> +int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int probe_vfs)
>  {
>  	might_sleep();
>  
>  	if (!dev->is_physfn)
>  		return -ENOSYS;
>  
> -	return sriov_enable(dev, nr_virtfn);
> +	return sriov_enable(dev, nr_virtfn, probe_vfs);
>  }
>  EXPORT_SYMBOL_GPL(pci_enable_sriov);
>  
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 2b3c89425bb5..d5b93339b8a4 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -397,9 +397,14 @@ static int pci_device_probe(struct device *dev)
>  	drv = to_pci_driver(dev->driver);
>  	pci_dev = to_pci_dev(dev);
>  	pci_dev_get(pci_dev);
> -	error = __pci_device_probe(drv, pci_dev);
> -	if (error)
> -		pci_dev_put(pci_dev);
> +	if (!pci_dev->is_virtfn || pci_dev->probe_vf) {
> +		error = __pci_device_probe(drv, pci_dev);
> +		if (error)
> +			pci_dev_put(pci_dev);
> +	}
> +	/* one shot blocking of probe */
> +	if (pci_dev->is_virtfn && !pci_dev->probe_vf)
> +		pci_dev->probe_vf = 1;
>  
>  	return error;
>  }
> diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
> index 0b2c53af85c7..2f81f471b8f3 100644
> --- a/drivers/scsi/lpfc/lpfc_init.c
> +++ b/drivers/scsi/lpfc/lpfc_init.c
> @@ -4797,7 +4797,7 @@ lpfc_sli_probe_sriov_nr_virtfn(struct lpfc_hba *phba, int nr_vfn)
>  		return -EINVAL;
>  	}
>  
> -	rc = pci_enable_sriov(pdev, nr_vfn);
> +	rc = pci_enable_sriov(pdev, nr_vfn, 1);
>  	if (rc) {
>  		lpfc_printf_log(phba, KERN_WARNING, LOG_INIT,
>  				"2806 Failed to enable sriov on this device "
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 4c8ac5fcc224..beb2640ba18d 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -373,6 +373,7 @@ struct pci_dev {
>  	phys_addr_t rom; /* Physical address of ROM if it's not from the BAR */
>  	size_t romlen; /* Length of ROM if it's not from the BAR */
>  	char *driver_override; /* Driver name to force a match */
> +	int probe_vf; /* probe this device */
>  };
>  
>  static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
> @@ -1655,14 +1656,14 @@ int pci_ext_cfg_avail(void);
>  void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
>  
>  #ifdef CONFIG_PCI_IOV
> -int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
> +int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int probe_vfs);
>  void pci_disable_sriov(struct pci_dev *dev);
>  int pci_num_vf(struct pci_dev *dev);
>  int pci_vfs_assigned(struct pci_dev *dev);
>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
>  #else
> -static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
> +static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn, int nr_virt_probe)
>  { return -ENODEV; }
>  static inline void pci_disable_sriov(struct pci_dev *dev) { }
>  static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
> -- 
> 2.1.3
> 

^ permalink raw reply

* Re: [PATCH] i40e: don't enable and init FCOE by default when do PF reset
From: Dev, Vasu @ 2015-01-09 18:18 UTC (permalink / raw)
  To: Ronciak, John, Ethan Zhao, Kirsher, Jeffrey T, Brandeburg, Jesse,
	Allan, Bruce W, Wyborny, Carolyn, Skidmore, Donald C,
	Rose, Gregory V, Vick, Matthew, Williams, Mitch A, Parikh, Neerav
  Cc: e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org,
	brian.maly@oracle.com, linux-kernel@vger.kernel.org, Linux NICS,
	ethan.kernel@gmail.com
In-Reply-To: <D38E8E86660E514AB505863C19C9287C741097AF@ORSMSX102.amr.corp.intel.com>

> -----Original Message-----
> From: Ronciak, John
> Sent: Friday, January 09, 2015 8:42 AM
> To: Ethan Zhao; Kirsher, Jeffrey T; Brandeburg, Jesse; Allan, Bruce W;
> Wyborny, Carolyn; Skidmore, Donald C; Rose, Gregory V; Vick, Matthew;
> Williams, Mitch A; Dev, Vasu; Parikh, Neerav
> Cc: Linux NICS; e1000-devel@lists.sourceforge.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; ethan.kernel@gmail.com;
> brian.maly@oracle.com
> Subject: RE: [PATCH] i40e: don't enable and init FCOE by default when do PF
> reset
> 
> Adding Vasu and Neerav
> 
> Cheers,
> John
> 
> > -----Original Message-----
> > From: Ethan Zhao [mailto:ethan.zhao@oracle.com]
> > Sent: Friday, January 9, 2015 8:38 AM
> > To: Kirsher, Jeffrey T; Brandeburg, Jesse; Allan, Bruce W; Wyborny,
> > Carolyn; Skidmore, Donald C; Rose, Gregory V; Vick, Matthew; Ronciak,
> > John; Williams, Mitch A
> > Cc: Linux NICS; e1000-devel@lists.sourceforge.net;
> > netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > ethan.kernel@gmail.com; brian.maly@oracle.com; Ethan Zhao
> > Subject: [PATCH] i40e: don't enable and init FCOE by default when do
> > PF reset
> >
> > While do PF reset with function i40e_reset_and_rebuild(), it will call
> > i40e_init_pf_fcoe() by default if FCOE is defined, thus if the PF is
> > resetted, FCOE will be enabled whatever it was - enabled or not.
> >
> > Such bug might be hit when PF resumes from suspend, run diagnostic
> > test with ethtool, setup VLAN etc.
> >
> > Passed building with v3.19-rc3.
> >
> > Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_main.c | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
> > b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > index a5f2660..a2572cc 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > @@ -6180,9 +6180,12 @@ static void i40e_reset_and_rebuild(struct
> > i40e_pf *pf, bool reinit)
> >  	}
> >  #endif /* CONFIG_I40E_DCB */
> >  #ifdef I40E_FCOE
> > -	ret = i40e_init_pf_fcoe(pf);
> > -	if (ret)
> > -		dev_info(&pf->pdev->dev, "init_pf_fcoe failed: %d\n", ret);
> > +	if (pf->flags & I40E_FLAG_FCOE_ENABLED) {
> > +		ret = i40e_init_pf_fcoe(pf);

Calling i40e_init_pf_fcoe() here conflicts with its I40E_FLAG_FCOE_ENABLED pre-condition since I40E_FLAG_FCOE_ENABLED is set by very same i40e_init_pf_fcoe(), in turn i40e_init_pf_fcoe() will never get called.

Jeff Kirsher should be getting out a patch queued by me which adds I40E_FCoE Kbuild option, in that FCoE is disabled by default and  user could enable FCoE only if needed, that patch would do same of skipping i40e_init_pf_fcoe() whether FCoE capability in device enabled or not in default config.

>From patchwork Wed Oct  2 23:26:08 2013
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: [net] i40e: adds FCoE configure option
Date: Thu, 03 Oct 2013 07:26:08 -0000
From: Vasu Dev <vasu.dev@intel.com>
X-Patchwork-Id: 11797

Adds FCoE config option I40E_FCOE, so that FCoE can be enabled
as needed but otherwise have it disabled by default.

This also eliminate multiple FCoE config checks, instead now just
one config check for CONFIG_I40E_FCOE.

The I40E FCoE was added with 3.17 kernel and therefore this patch
shall be applied to stable 3.17 kernel also.

CC: <stable@vger.kernel.org>
Signed-off-by: Vasu Dev <vasu.dev@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>

---
drivers/net/ethernet/intel/Kconfig           |   11 +++++++++++
 drivers/net/ethernet/intel/i40e/Makefile     |    2 +-
 drivers/net/ethernet/intel/i40e/i40e_osdep.h |    4 ++--
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
index 5b8300a..4d61ef5 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -281,6 +281,17 @@ config I40E_DCB
 
 	  If unsure, say N.
 
+config I40E_FCOE
+	bool "Fibre Channel over Ethernet (FCoE)"
+	default n
+	depends on I40E && DCB && FCOE
+	---help---
+	  Say Y here if you want to use Fibre Channel over Ethernet (FCoE)
+	  in the driver. This will create new netdev for exclusive FCoE
+	  use with XL710 FCoE offloads enabled.
+
+	  If unsure, say N.
+
 config I40EVF
 	tristate "Intel(R) XL710 X710 Virtual Function Ethernet support"
 	depends on PCI_MSI
diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
index 4b94ddb..c405819 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -44,4 +44,4 @@ i40e-objs := i40e_main.o \
 	i40e_virtchnl_pf.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
-i40e-$(CONFIG_FCOE:m=y) += i40e_fcoe.o
+i40e-$(CONFIG_I40E_FCOE) += i40e_fcoe.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e_osdep.h b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
index 045b5c4..ad802dd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_osdep.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
@@ -78,7 +78,7 @@ do {                                                            \
 } while (0)
 
 typedef enum i40e_status_code i40e_status;
-#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
+#ifdef CONFIG_I40E_FCOE
 #define I40E_FCOE
-#endif /* CONFIG_FCOE or CONFIG_FCOE_MODULE */
+#endif
 #endif /* _I40E_OSDEP_H_ */

> > +		if (ret)
> > +			dev_info(&pf->pdev->dev,
> > +				 "init_pf_fcoe failed: %d\n", ret);
> > +	}
> >
> >  #endif
> >  	/* do basic switch setup */
> > --
> > 1.8.3.1


------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply related

* Re: [net-next PATCH v1 00/11] A flow API
From: John Fastabend @ 2015-01-09 18:10 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: tgraf, sfeldma, jhs, simon.horman, netdev, davem, andy
In-Reply-To: <20150108180320.GF1898@nanopsycho.orion>

On 01/08/2015 10:03 AM, Jiri Pirko wrote:
> Wed, Dec 31, 2014 at 08:45:19PM CET, john.fastabend@gmail.com wrote:
>> So... I could continue to mull over this and tweak bits and pieces
>> here and there but I decided its best to get a wider group of folks
>> looking at it and hopefulyl with any luck using it so here it is.
>>
>> This set creates a new netlink family and set of messages to configure
>> flow tables in hardware. I tried to make the commit messages
>> reasonably verbose at least in the flow_table patches.
>>
>> What we get at the end of this series is a working API to get device
>> capabilities and program flows using the rocker switch.
>>
>> I created a user space tool 'flow' that I use to configure and query
>> the devices it is posted here,
>>
>> 	https://github.com/jrfastab/iprotue2-flow-tool
>>
>> For now it is a stand-alone tool but once the kernel bits get sorted
>> out (I'm guessing there will need to be a few versions of this series
>> to get it right) I would like to port it into the iproute2 package.
>> This way we can keep all of our tooling in one package see 'bridge'
>> for example.
>>
>> As far as testing, I've tested various combinations of tables and
>> rules on the rocker switch and it seems to work. I have not tested
>> 100% of the rocker code paths though. It would be great to get some
>> sort of automated framework around the API to do this. I don't
>> think should gate the inclusion of the API though.
>>
>> I could use some help reviewing,
>>
>>   (a) error paths and netlink validation code paths
>>
>>   (b) Break down of structures vs netlink attributes. I
>>       am trying to balance flexibility given by having
>>       netlinnk TLV attributes vs conciseness. So some
>>       things are passed as structures.
>>
>>   (c) are there any devices that have pipelines that we
>>       can't represent with this API? It would be good to
>>       know about these so we can design it in probably
>>       in a future series.
>>
>> For some examples and maybe a bit more illustrative description I
>> posted a quickly typed up set of notes on github io pages. Here we
>> can show the description along with images produced by the flow tool
>> showing the pipeline. Once we settle a bit more on the API we should
>> probably do a clean up of this and other threads happening and commit
>> something to the Documentation directory.
>>
>> http://jrfastab.github.io/jekyll/update/2014/12/21/flow-api.html
>>
>> Finally I have more patches to add support for creating and destroying
>> tables. This allows users to define the pipeline at runtime rather
>> than statically as rocker does now. After this set gets some traction
>> I'll look at pushing them in a next round. However it likely requires
>> adding another "world" to rocker. Another piece that I want to add is
>> a description of the actions and metadata. This way user space can
>> "learn" what an action is and how metadata interacts with the system.
>> This work is under development.
>>
>> Thanks! Any comments/feedback always welcome.
>>
>> And also thanks to everyone who helped with this flow API so far. All
>> the folks at Dusseldorf LPC, OVS summit Santa Clara, P4 authors for
>> some inspiration, the collection of IETF FoRCES documents I mulled
>> over, Netfilter workshop where I started to realize fixing ethtool
>> was most likely not going to work, etc.
>>
>> ---
>>
>> John Fastabend (11):
>>       net: flow_table: create interface for hw match/action tables
>>       net: flow_table: add flow, delete flow
>>       net: flow_table: add apply action argument to tables
>>       rocker: add pipeline model for rocker switch
>>       net: rocker: add set flow rules
>>       net: rocker: add group_id slices and drop explicit goto
>>       net: rocker: add multicast path to bridging
>>       net: rocker: add get flow API operation
>>       net: rocker: add cookie to group acls and use flow_id to set cookie
>>       net: rocker: have flow api calls set cookie value
>>       net: rocker: implement delete flow routine
>
> Truly impressive work John (including the "flow" tool, documentation).
> Hat's off.
>
> Currently, all is very userspace oriented and I understand the reason.
> I also understand why Jamal is a bit nervous from that fact. I am as well..
> Correct me if I'm wrong but this amount of "direct hw access" is
> unprecedented. There have been kernel here to cover the hw differencies,
> I wonder if there is any way to continue in this direction with flows...
>

As it is currently written the API allows for abstracting the hardware
programming and low level interface by using a common model and API that
can represent a large array of devices.

By abstract the hw differencies I'm not sure what this means except for
the above model/API. I intentionally didn't want to force _all_
hardware to expose a specific pipeline for example the OVS pipeline.

> What I would love to see in this initial patchset is "the internal user".
> For example tc. The tc code could query the capabilities and decide what
> "flows" to put into hw tables.

Sure, the biggest gap for me on this is 'tc' is actually about
ports/queues and currently filters/tables are part of qdiscs. The
model in this series is a pipeline that has a set of egress endpoints
that can be reached by actions. The endpoints would be ports or tunnel
engines or could be other network function blocks.

That said I can imagine pushing the configuration into a per port table
in the hardware or most likely just requiring any matches on egress
qdisc's to use an implied egress_port match. On ingress similarly use
an ingress_port match.

I'll look at doing this next week but I think the series is useful even
without any "internal users" ;) I'll send out a v2 with all the feedback
I've received so far shortly then think some more about this. Doing the
mapping from software filters/actions/tables onto the hardware tables
exposed by the API in this series is actually what I wanted to present
@ netdev conference so I think we are heading in the same direction.

.John


>
> Jiri
>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH iproute2] ip link: Fix crash on older kernels when show VF dev
From: William Dauchy @ 2015-01-09 17:55 UTC (permalink / raw)
  To: Vadim Kochan; +Cc: netdev, william
In-Reply-To: <1420824351-23450-1-git-send-email-vadim4j@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2330 bytes --]

On Jan09 19:25, Vadim Kochan wrote:
> From: Vadim Kochan <vadim4j@gmail.com>
> 
> The issue was caused that ifla_vf_rate does not exist on
> older kernels and should be checked if it exists as nested attr.
> 
> Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
> Reported-by: William Dauchy <william@gandi.net>
> Tested-by: William Dauchy <william@gandi.com>

gandi.net actually ;)

Thanks,

> ---
>  ip/ipaddress.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> index 28dfe8c..830b166 100644
> --- a/ip/ipaddress.c
> +++ b/ip/ipaddress.c
> @@ -259,11 +259,10 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
>  {
>  	struct ifla_vf_mac *vf_mac;
>  	struct ifla_vf_vlan *vf_vlan;
> -	struct ifla_vf_rate *vf_rate;
>  	struct ifla_vf_tx_rate *vf_tx_rate;
>  	struct ifla_vf_spoofchk *vf_spoofchk;
>  	struct ifla_vf_link_state *vf_linkstate;
> -	struct rtattr *vf[IFLA_VF_MAX+1];
> +	struct rtattr *vf[IFLA_VF_MAX + 1] = {};
>  	struct rtattr *tmp;
>  	SPRINT_BUF(b1);
>  
> @@ -277,7 +276,6 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
>  	vf_mac = RTA_DATA(vf[IFLA_VF_MAC]);
>  	vf_vlan = RTA_DATA(vf[IFLA_VF_VLAN]);
>  	vf_tx_rate = RTA_DATA(vf[IFLA_VF_TX_RATE]);
> -	vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
>  
>  	/* Check if the spoof checking vf info type is supported by
>  	 * this kernel.
> @@ -313,10 +311,16 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
>  		fprintf(fp, ", qos %d", vf_vlan->qos);
>  	if (vf_tx_rate->rate)
>  		fprintf(fp, ", tx rate %d (Mbps)", vf_tx_rate->rate);
> -	if (vf_rate->max_tx_rate)
> -		fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
> -	if (vf_rate->min_tx_rate)
> -		fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
> +
> +	if (vf[IFLA_VF_RATE]) {
> +		struct ifla_vf_rate *vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
> +
> +		if (vf_rate->max_tx_rate)
> +			fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
> +		if (vf_rate->min_tx_rate)
> +			fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
> +	}
> +
>  	if (vf_spoofchk && vf_spoofchk->setting != -1) {
>  		if (vf_spoofchk->setting)
>  			fprintf(fp, ", spoof checking on");
> -- 
> 2.1.3
> 

-- 
William

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] openvswitch: Remove unnecessary version.h inclusion
From: Pravin Shelar @ 2015-01-09 18:00 UTC (permalink / raw)
  To: Syam Sidhardhan; +Cc: netdev, syamsidhardh, David Miller, dev@openvswitch.org
In-Reply-To: <1420815381-29973-1-git-send-email-s.syam@samsung.com>

On Fri, Jan 9, 2015 at 6:56 AM, Syam Sidhardhan <s.syam@samsung.com> wrote:
> version.h inclusion is not necessary as detected by versioncheck.
>
> Signed-off-by: Syam Sidhardhan <s.syam@samsung.com>

You should use net-next prefix flag for net-next tree patches.
Otherwise looks good.
Acked-by: Pravin B Shelar <pshelar@nicira.com>

> ---
>  net/openvswitch/vport-geneve.c |    2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
> index 347fa23..70e2aae 100644
> --- a/net/openvswitch/vport-geneve.c
> +++ b/net/openvswitch/vport-geneve.c
> @@ -9,8 +9,6 @@
>
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> -#include <linux/version.h>
> -
>  #include <linux/in.h>
>  #include <linux/ip.h>
>  #include <linux/net.h>
> --
> 1.7.9.5
>

^ permalink raw reply

* Re: [PATCH] brcm80211: brcmsmac: dma: Remove some unused functions
From: Rickard Strandqvist @ 2015-01-09 17:58 UTC (permalink / raw)
  To: Arend van Spriel
  Cc: Julia Lawall, Kalle Valo, Larry Finger, Brett Rudley,
	Hante Meuleman, Fabian Frederick,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	brcm80211-dev-list-dY08KVG/lbpWk0Htik3J/w, Network Development,
	Linux Kernel Mailing List
In-Reply-To: <54ACF523.2030706-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>

2015-01-07 9:58 GMT+01:00 Arend van Spriel <arend-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>:
> On 01/07/15 07:29, Julia Lawall wrote:
>>
>>
>>
>> On Wed, 7 Jan 2015, Rickard Strandqvist wrote:
>>
>>> 2015-01-05 12:06 GMT+01:00 Arend van Spriel<arend-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>:
>>>>
>>>> On 01/05/15 11:49, Kalle Valo wrote:
>>>>>
>>>>>
>>>>> Rickard Strandqvist<rickard_strandqvist-IW2WV5XWFqGZkjO+N0TKoMugMpMbD5Xr@public.gmane.org>   writes:
>>>>>
>>>>>> As I hope you can see I have made some changes regarding the
>>>>>> subject-line. Thought it was an advantage to be able to see which file
>>>>>> I actually removed something from. There seems to be a big focus on
>>>>>> getting right on subject-line right in recent weeks.
>>>>>>
>>>>>> I wonder why there is a script that takes a file name, and respond
>>>>>> with an appropriate subject line?
>>>>
>>>>
>>>>
>>>> Is there a script for this? Anyway, I would say driver name is enough.
>>>> Enough about the subject line ;-) I would like to give some general
>>>> remarks
>>>> as you seem to touch a lot of kernel code. First off, I think it is good
>>>> to
>>>> remove unused stuff. However, I would like some more explanation on your
>>>> methodology apart from "partially found by using a static code analysis
>>>> program". So a cover-letter explaining that would have been nice (maybe
>>>> still is). Things like Kconfig option can affect whether function are
>>>> used
>>>> or not so how did you cover that.
>>>>
>>>> Regards,
>>>> Arend
>>>>
>>>>
>>>>> I don't think you can really automate this as some drivers do this a
>>>>> bit
>>>>> differently. You always need to manually check the commit log.
>>>>>
>>>>>> But ok, I change my script accordingly. Should I submit the patch
>>>>>> again?
>>>>>
>>>>>
>>>>>
>>>>> Yes, please resubmit.
>>>>>
>>>>
>>>
>>> Hi Arend
>>>
>>> Yes, a script that had been excellent, I think!
>>> I have one as part of my git send-email script, until a week ago, it
>>> was enough that I removed the "drivers/" and changed all "/" to ": "
>>> I have now been expanded my sed pipe a lot (tell me if anyone is
>>> interested)
>>> But now I've seen everything from uppercase and [DIR], etc.
>>> So I can not understand how anyone should be able to get the right
>>> name without a good help.
>>>
>>> Sure i like to share how I use cppcheck, but is very hesitant to write
>>> this with each patch mails I send though!
>>>
>>> I run:
>>> cppcheck --force --quiet --enable=all .
>>>
>>> Or a specific file instead of .
>>>
>>> This will include, among other things get a lot of error message such,
>>> +4000 for the kernel.
>>> (style) The function 'xxx' is never used
>>>
>>> For these I made a script that searched through all the files after
>>> the function name (cppcheck missed a few). And save the rest so I go
>>> through them and possibly send patches.
>>
>>
>> I think that the question was about what methodology is cppcheck using to
>> find the given issue.  But probably cppcheck is a black box that does
>> whatever it does, so the user doesn't know what the rationale is.
>
>
> That would have been nice, but I also wanted to know what his subsequent
> steps were to validate the output from cppcheck. I went through some
> cppcheck web pages, but they only elaborate on what is can do and not the
> how. But hey, it is an open-source tool so there is always the code to
> check.
>
>> However, I think you mentioned that cppcheck found only some of the
>> issues.  You could thus describe what was the methodology for finding the
>> other ones.
>
>
> Maybe upon removing an unused function it had a ripple effect on others
> becoming unused as well? Still this is speculating and with this kind of
> cleanup effort all over the place it is better to review the methodology.
>
> Regards,
> Arend
>
>> julia


Hi all

Julia cppcheck is a gpl projekt.
http://sourceforge.net/projects/cppcheck/


Arend
I used cppcheck with all option in the linux root, and then use grep
to pick out what I was interested in.  I agree that there is a lack of
documentation, unfortunately.

More exactly how I have done this is, I searched with grep for the
4000 functions, put the result in a lot of files. These were input to
a script that open a file editor, did a visual overview of all over
the place where the function was found, several of them were used, for
example, directly in asambler code. And in recent times I have also
started doing git blame on the file to see how old the code is.
Then I made the choice to remove or not.

Hope this was clear enough :)


Kind regards
Rickard Strandqvist
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/6] vxlan: Group Policy extension
From: Alexei Starovoitov @ 2015-01-09 17:37 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Jesse Gross, Stephen Hemminger, Pravin Shelar,
	Tom Herbert, netdev@vger.kernel.org, dev@openvswitch.org
In-Reply-To: <7339e3bff124cecaf65cd04ea9bdc973c730ba34.1420756324.git.tgraf@suug.ch>

On Thu, Jan 8, 2015 at 2:47 PM, Thomas Graf <tgraf@suug.ch> wrote:
> +
> +struct vxlan_gbp {
> +} __packed;

empty struct ? seems unused.
looks good to me otherwise.

^ permalink raw reply

* Re: [PATCH v2 1/3] dtb: xgene: fix: Backward compatibility with older firmware
From: Ian Campbell @ 2015-01-09 17:36 UTC (permalink / raw)
  To: Iyappan Subramanian
  Cc: devicetree, netdev, patches, kchudgar, davem, linux-arm-kernel
In-Reply-To: <1420819166.21186.38.camel@citrix.com>

On Fri, 2015-01-09 at 15:59 +0000, Ian Campbell wrote:
> I'm running mustang_sw_1.13.29-beta, using the mustang_tianocore_ubt.fd
> method to launch from u-boot.

I realised shortly after that I had downloaded 1.14.14 too, but it
behaves the same.

I rebuilt mustang_tianocore_ubt.fd with the first hunk of this patch and
things now work.

Ian.

^ permalink raw reply

* [PATCH iproute2] ip link: Fix crash on older kernels when show VF dev
From: Vadim Kochan @ 2015-01-09 17:25 UTC (permalink / raw)
  To: netdev; +Cc: william, Vadim Kochan

From: Vadim Kochan <vadim4j@gmail.com>

The issue was caused that ifla_vf_rate does not exist on
older kernels and should be checked if it exists as nested attr.

Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
Reported-by: William Dauchy <william@gandi.net>
Tested-by: William Dauchy <william@gandi.com>
---
 ip/ipaddress.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 28dfe8c..830b166 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -259,11 +259,10 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 {
 	struct ifla_vf_mac *vf_mac;
 	struct ifla_vf_vlan *vf_vlan;
-	struct ifla_vf_rate *vf_rate;
 	struct ifla_vf_tx_rate *vf_tx_rate;
 	struct ifla_vf_spoofchk *vf_spoofchk;
 	struct ifla_vf_link_state *vf_linkstate;
-	struct rtattr *vf[IFLA_VF_MAX+1];
+	struct rtattr *vf[IFLA_VF_MAX + 1] = {};
 	struct rtattr *tmp;
 	SPRINT_BUF(b1);
 
@@ -277,7 +276,6 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 	vf_mac = RTA_DATA(vf[IFLA_VF_MAC]);
 	vf_vlan = RTA_DATA(vf[IFLA_VF_VLAN]);
 	vf_tx_rate = RTA_DATA(vf[IFLA_VF_TX_RATE]);
-	vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
 
 	/* Check if the spoof checking vf info type is supported by
 	 * this kernel.
@@ -313,10 +311,16 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
 		fprintf(fp, ", qos %d", vf_vlan->qos);
 	if (vf_tx_rate->rate)
 		fprintf(fp, ", tx rate %d (Mbps)", vf_tx_rate->rate);
-	if (vf_rate->max_tx_rate)
-		fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
-	if (vf_rate->min_tx_rate)
-		fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
+
+	if (vf[IFLA_VF_RATE]) {
+		struct ifla_vf_rate *vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
+
+		if (vf_rate->max_tx_rate)
+			fprintf(fp, ", max_tx_rate %dMbps", vf_rate->max_tx_rate);
+		if (vf_rate->min_tx_rate)
+			fprintf(fp, ", min_tx_rate %dMbps", vf_rate->min_tx_rate);
+	}
+
 	if (vf_spoofchk && vf_spoofchk->setting != -1) {
 		if (vf_spoofchk->setting)
 			fprintf(fp, ", spoof checking on");
-- 
2.1.3

^ permalink raw reply related

* [PATCH net-next RFC 5/5] net-timestamp: tx timestamping default mode flag
From: Willem de Bruijn @ 2015-01-09 17:31 UTC (permalink / raw)
  To: netdev; +Cc: davem, richardcochran, eric.dumazet, luto, Willem de Bruijn
In-Reply-To: <1420824719-28848-1-git-send-email-willemb@google.com>

From: Willem de Bruijn <willemb@google.com>

The number of timestamping points along the transmit path has grown,
as have the options. Preferred behavior is to request timestamps with
ID, without data (which enables batching) and for all supported
timestamp points. Define a short option that enables all these
defaults.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/uapi/linux/net_tstamp.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 6d1abea..c371ce2 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -27,6 +27,14 @@ enum {
 	SOF_TIMESTAMPING_OPT_TSONLY = (1<<11),
 
 	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_TSONLY,
+
+	SOF_TIMESTAMPING_TX = SOF_TIMESTAMPING_TX_HARDWARE |
+			      SOF_TIMESTAMPING_TX_SOFTWARE |
+			      SOF_TIMESTAMPING_TX_SCHED |
+			      SOF_TIMESTAMPING_TX_ACK |
+			      SOF_TIMESTAMPING_OPT_ID |
+			      SOF_TIMESTAMPING_OPT_TSONLY,
+
 	SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
 				 SOF_TIMESTAMPING_LAST
 };
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next RFC 4/5] net-timestamp: tx timestamp cookies
From: Willem de Bruijn @ 2015-01-09 17:31 UTC (permalink / raw)
  To: netdev; +Cc: davem, richardcochran, eric.dumazet, luto, Willem de Bruijn
In-Reply-To: <1420824719-28848-1-git-send-email-willemb@google.com>

From: Willem de Bruijn <willemb@google.com>

Support looping multiple timestamps on top of a single skb on the
error queue.

Tx timestamps are returned on top of an skb. TCP timestamping and
other timestamp points enabled multiple timestamps for each buffer
passed in send. Due to retransmissions, this number may be high,
using lots of SO_RCVBUF space and kernel mode switches.

When returning without payload (SOF_TIMESTAMPING_OPT_TSONLY), the
total truesize is smaller, but still O(n). Without payload, the
constraint that a timestamp belongs to a specific skb also goes
away.

Instead of queuing multiple skbs onto the error queue, queue
successive timestamps onto the skb on top of the error queue.
For this purpose, introduce a timestamp cookie and use a list
of cookies instead of skb->tstamp.

The number of batched cookies is limited by having sends fail
with EAGAIN or ENOMSG as soon as a single packet is waiting on
the receive queue. If merging this functionality, a TODO is to
add a hard cap, so that processes can estimate the maximum
msg_controllen needed to read all timestamps.

The implementation returns the same structures as before, that is,
one struct sock_extended_err and one struct scm_timestamping for
each timestamp. The list is returned in reverse chronological
order: newest first. This choice is partially determined by the
callers (e.g., ip_recv_error) generating the final sock_extended_err.

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h        |  12 +++++
 include/net/sock.h            |   3 +-
 include/uapi/linux/errqueue.h |   1 +
 net/core/skbuff.c             | 104 ++++++++++++++++++++++++++++++++++++------
 net/socket.c                  |  64 ++++++++++++++++++++++++--
 5 files changed, 167 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 85ab7d7..6d77b51 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -298,6 +298,13 @@ struct ubuf_info {
 	unsigned long desc;
 };
 
+struct skb_tstamp_cookie {
+	u32 tskey;
+	u32 tstype;
+	ktime_t tstamp;
+	struct skb_tstamp_cookie *next;
+};
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -442,6 +449,8 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
  *	@next: Next buffer in list
  *	@prev: Previous buffer in list
  *	@tstamp: Time we arrived/left
+ *	@skb_mstamp: tstamp variant used only within the TCP stack
+ *	@tscookies: tstamp variant used only with no-payload errqueue packets
  *	@rbnode: RB tree node, alternative to next/prev for netem/tcp
  *	@sk: Socket we are owned by
  *	@dev: Device we arrived on/are leaving by
@@ -516,6 +525,7 @@ struct sk_buff {
 			union {
 				ktime_t		tstamp;
 				struct skb_mstamp skb_mstamp;
+				struct skb_tstamp_cookie *tscookies;
 			};
 		};
 		struct rb_node	rbnode; /* used in netem & tcp stack */
@@ -2861,6 +2871,8 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		     struct skb_shared_hwtstamps *hwtstamps,
 		     struct sock *sk, int tstype);
 
+bool skb_has_tscookies(struct sk_buff *skb);
+
 /**
  * skb_tstamp_tx - queue clone of skb with send time stamps
  * @orig_skb:	the original outgoing packet
diff --git a/include/net/sock.h b/include/net/sock.h
index 9729171..de190d8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2149,7 +2149,8 @@ sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
 	 */
 	if (sock_flag(sk, SOCK_RCVTSTAMP) ||
 	    (sk->sk_tsflags & SOF_TIMESTAMPING_RX_SOFTWARE) ||
-	    (kt.tv64 && sk->sk_tsflags & SOF_TIMESTAMPING_SOFTWARE) ||
+	    ((kt.tv64 || skb_has_tscookies(skb)) &&
+	     sk->sk_tsflags & SOF_TIMESTAMPING_SOFTWARE) ||
 	    (hwtstamps->hwtstamp.tv64 &&
 	     (sk->sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE)))
 		__sock_recv_timestamp(msg, sk, skb);
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 07bdce1..ab67bf0 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -41,6 +41,7 @@ enum {
 	SCM_TSTAMP_SND,		/* driver passed skb to NIC, or HW */
 	SCM_TSTAMP_SCHED,	/* data entered the packet scheduler */
 	SCM_TSTAMP_ACK,		/* data acknowledged by peer */
+	SCM_TSTAMP_HW,		/* internal use: HW generated */
 };
 
 #endif /* _UAPI_LINUX_ERRQUEUE_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e5f4c06..c41597f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3581,6 +3581,19 @@ int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer)
 }
 EXPORT_SYMBOL_GPL(skb_cow_data);
 
+static void skb_destructor_tscookies(struct sk_buff *skb)
+{
+	struct skb_tstamp_cookie *prev, *cur = skb->tscookies;
+
+	while (cur) {
+		prev = cur;
+		cur = cur->next;
+		kfree(prev);
+	}
+	skb->tscookies = NULL;
+	skb->destructor = NULL;
+}
+
 static void sock_rmem_free(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
@@ -3588,6 +3601,12 @@ static void sock_rmem_free(struct sk_buff *skb)
 	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
 }
 
+static void sock_rmem_free_tscookies(struct sk_buff *skb)
+{
+	skb_destructor_tscookies(skb);
+	sock_rmem_free(skb);
+}
+
 /*
  * Note: We dont mem charge error packets (no sk_forward_alloc changes)
  */
@@ -3597,9 +3616,13 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
 	    (unsigned int)sk->sk_rcvbuf)
 		return -ENOMEM;
 
-	skb_orphan(skb);
+	if (skb_has_tscookies(skb)) {
+		skb->destructor = sock_rmem_free_tscookies;
+	} else {
+		skb_orphan(skb);
+		skb->destructor = sock_rmem_free;
+	}
 	skb->sk = sk;
-	skb->destructor = sock_rmem_free;
 	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
 
 	/* before exiting rcu section, make sure dst is refcounted */
@@ -3666,23 +3689,78 @@ struct sk_buff *skb_clone_sk(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(skb_clone_sk);
 
-static void __skb_complete_tx_timestamp(struct sk_buff *skb,
-					struct sock *sk,
-					int tstype)
+bool skb_has_tscookies(struct sk_buff *skb)
+{
+	return skb->destructor == skb_destructor_tscookies ||
+	       skb->destructor == sock_rmem_free_tscookies;
+}
+EXPORT_SYMBOL(skb_has_tscookies);
+
+static bool __skb_queue_tstamp_cookie(struct sk_buff *skb, struct sock *sk,
+				      int tstype, u32 tskey, bool is_hw)
+{
+	struct sk_buff_head *q = &sk->sk_error_queue;
+	struct skb_tstamp_cookie *new;
+	struct sk_buff *qskb;
+	unsigned long flags;
+	bool queued = false;
+
+	if (skb->destructor)
+		return false;
+
+	new = kzalloc(sizeof(*new), GFP_ATOMIC);
+	if (!new)
+		return false;
+
+	new->tskey = tskey;
+	if (unlikely(is_hw)) {
+		new->tstype = SCM_TSTAMP_HW;
+		new->tstamp = skb_hwtstamps(skb)->hwtstamp;
+	} else {
+		new->tstype = tstype;
+		new->tstamp = skb->tstamp;
+	}
+
+	spin_lock_irqsave(&q->lock, flags);
+	qskb = skb_peek(&sk->sk_error_queue);
+	if (qskb && skb_has_tscookies(qskb)) {
+		new->next = qskb->tscookies;
+		qskb->tscookies = new;
+		queued = true;
+	}
+	spin_unlock_irqrestore(&q->lock, flags);
+	if (queued) {
+		consume_skb(skb);
+		return true;
+	}
+
+	skb->tscookies = new;
+	skb->destructor = skb_destructor_tscookies;
+	return false;
+}
+
+static void __skb_complete_tx_timestamp(struct sk_buff *skb, struct sock *sk,
+					int tstype, bool is_hw)
 {
 	struct sock_exterr_skb *serr;
-	int err;
+	int err, tskey = 0;
+
+	if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) {
+		tskey = skb_shinfo(skb)->tskey;
+		if (sk->sk_protocol == IPPROTO_TCP)
+			tskey -= sk->sk_tskey;
+	}
+
+	if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_TSONLY &&
+	    __skb_queue_tstamp_cookie(skb, sk, tstype, tskey, is_hw))
+		return;
 
 	serr = SKB_EXT_ERR(skb);
 	memset(serr, 0, sizeof(*serr));
 	serr->ee.ee_errno = ENOMSG;
 	serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
 	serr->ee.ee_info = tstype;
-	if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) {
-		serr->ee.ee_data = skb_shinfo(skb)->tskey;
-		if (sk->sk_protocol == IPPROTO_TCP)
-			serr->ee.ee_data -= sk->sk_tskey;
-	}
+	serr->ee.ee_data = tskey;
 
 	err = sock_queue_err_skb(sk, skb);
 
@@ -3708,7 +3786,7 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
 	sock_hold(sk);
 
 	*skb_hwtstamps(skb) = *hwtstamps;
-	__skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND);
+	__skb_complete_tx_timestamp(skb, sk, SCM_TSTAMP_SND, true);
 
 	sock_put(sk);
 }
@@ -3741,7 +3819,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	else
 		skb->tstamp = ktime_get_real();
 
-	__skb_complete_tx_timestamp(skb, sk, tstype);
+	__skb_complete_tx_timestamp(skb, sk, tstype, hwtstamps);
 }
 EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
 
diff --git a/net/socket.c b/net/socket.c
index a2c33a4..6595108 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -676,9 +676,63 @@ int kernel_sendmsg(struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(kernel_sendmsg);
 
-/*
- * called from sock_recv_timestamp() if sock_flag(sk, SOCK_RCVTSTAMP)
- */
+static bool __ts_allow_report(struct sock *sk, int tstype)
+{
+	if (tstype == SCM_TSTAMP_HW)
+		return sk->sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE;
+	else
+		return sk->sk_tsflags & SOF_TIMESTAMPING_SOFTWARE;
+}
+
+static void __ts_generate_serr(struct msghdr *msg, struct sock *sk,
+			       struct skb_tstamp_cookie *cur)
+{
+	struct sock_extended_err serr;
+
+	memset(&serr, 0, sizeof(serr));
+
+	serr.ee_errno = ENOMSG;
+	serr.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
+	serr.ee_data = cur->tskey;
+	serr.ee_info = cur->tstype;
+
+	/* work around legacy interface: HW reports SND with data in tss[2] */
+	if (serr.ee_info == SCM_TSTAMP_HW)
+		serr.ee_info = SCM_TSTAMP_SND;
+
+	if (sk->sk_family == AF_INET)
+		put_cmsg(msg, SOL_IP, IP_RECVERR, sizeof(serr), &serr);
+	else if (sk->sk_family == AF_INET6)
+		put_cmsg(msg, SOL_IPV6, IPV6_RECVERR, sizeof(serr), &serr);
+	else
+		net_warn_ratelimited("tscookie: unknown proto %x",
+				     sk->sk_family);
+}
+
+static void __ts_generate_tss(struct msghdr *msg, struct skb_tstamp_cookie *cur)
+{
+	struct scm_timestamping tss;
+	bool idx = cur->tstype == SCM_TSTAMP_HW ? 2 : 0;
+
+	memset(&tss, 0, sizeof(tss));
+	tss.ts[idx] = ktime_to_timespec(cur->tstamp);
+	put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPING, sizeof(tss), &tss);
+}
+
+static void __sock_recv_timestamp_cookies(struct msghdr *msg, struct sock *sk,
+					  struct skb_tstamp_cookie *cookie)
+{
+	while (cookie) {
+		if (__ts_allow_report(sk, cookie->tstype)) {
+			__ts_generate_tss(msg, cookie);
+			/* caller (e.g., ip_recv_error) generates last serr */
+			if (cookie->next)
+				__ts_generate_serr(msg, sk, cookie);
+		}
+		cookie = cookie->next;
+	}
+}
+
 void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 	struct sk_buff *skb)
 {
@@ -688,6 +742,10 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 	struct skb_shared_hwtstamps *shhwtstamps =
 		skb_hwtstamps(skb);
 
+	if (skb_has_tscookies(skb)) {
+		__sock_recv_timestamp_cookies(msg, sk, skb->tscookies);
+		return;
+	}
 	/* Race occurred between timestamp enabling and packet
 	   receiving.  Fill in the current time for now. */
 	if (need_software_tstamp && skb->tstamp.tv64 == 0)
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next RFC 3/5] net-timestamp: no-payload option in txtimestamp test
From: Willem de Bruijn @ 2015-01-09 17:31 UTC (permalink / raw)
  To: netdev; +Cc: davem, richardcochran, eric.dumazet, luto, Willem de Bruijn
In-Reply-To: <1420824719-28848-1-git-send-email-willemb@google.com>

From: Willem de Bruijn <willemb@google.com>

Demonstrate how SOF_TIMESTAMPING_OPT_TSONLY can be used and
test the implementation.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 .../networking/timestamping/txtimestamp.c          | 28 ++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/timestamping/txtimestamp.c b/Documentation/networking/timestamping/txtimestamp.c
index 8778e68..1171ebf 100644
--- a/Documentation/networking/timestamping/txtimestamp.c
+++ b/Documentation/networking/timestamping/txtimestamp.c
@@ -68,6 +68,7 @@ static int do_ipv6 = 1;
 static int cfg_payload_len = 10;
 static bool cfg_show_payload;
 static bool cfg_do_pktinfo;
+static bool cfg_loop_nodata;
 static uint16_t dest_port = 9000;
 
 static struct sockaddr_in daddr;
@@ -139,6 +140,9 @@ static void print_payload(char *data, int len)
 {
 	int i;
 
+	if (!len)
+		return;
+
 	if (len > 70)
 		len = 70;
 
@@ -175,6 +179,7 @@ static void __recv_errmsg_cmsg(struct msghdr *msg, int payload_len)
 	struct sock_extended_err *serr = NULL;
 	struct scm_timestamping *tss = NULL;
 	struct cmsghdr *cm;
+	int batch = 0;
 
 	for (cm = CMSG_FIRSTHDR(msg);
 	     cm && cm->cmsg_len;
@@ -207,10 +212,18 @@ static void __recv_errmsg_cmsg(struct msghdr *msg, int payload_len)
 		} else
 			fprintf(stderr, "unknown cmsg %d,%d\n",
 					cm->cmsg_level, cm->cmsg_type);
+
+		if (serr && tss) {
+			print_timestamp(tss, serr->ee_info, serr->ee_data,
+					payload_len);
+			serr = NULL;
+			tss = NULL;
+			batch++;
+		}
 	}
 
-	if (serr && tss)
-		print_timestamp(tss, serr->ee_info, serr->ee_data, payload_len);
+	if (batch > 1)
+		fprintf(stderr, "batched %d timestamps\n", batch);
 }
 
 static int recv_errmsg(int fd)
@@ -242,7 +255,7 @@ static int recv_errmsg(int fd)
 	if (ret == -1 && errno != EAGAIN)
 		error(1, errno, "recvmsg");
 
-	if (ret > 0) {
+	if (ret >= 0) {
 		__recv_errmsg_cmsg(&msg, ret);
 		if (cfg_show_payload)
 			print_payload(data, cfg_payload_len);
@@ -307,6 +320,9 @@ static void do_test(int family, unsigned int opt)
 	opt |= SOF_TIMESTAMPING_SOFTWARE |
 	       SOF_TIMESTAMPING_OPT_CMSG |
 	       SOF_TIMESTAMPING_OPT_ID;
+	if (cfg_loop_nodata)
+		opt |= SOF_TIMESTAMPING_OPT_TSONLY;
+
 	if (setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING,
 		       (char *) &opt, sizeof(opt)))
 		error(1, 0, "setsockopt timestamping");
@@ -376,6 +392,7 @@ static void __attribute__((noreturn)) usage(const char *filepath)
 			"  -h:   show this message\n"
 			"  -I:   request PKTINFO\n"
 			"  -l N: send N bytes at a time\n"
+			"  -n:   set no-payload option\n"
 			"  -r:   use raw\n"
 			"  -R:   use raw (IP_HDRINCL)\n"
 			"  -p N: connect to port N\n"
@@ -390,7 +407,7 @@ static void parse_opt(int argc, char **argv)
 	int proto_count = 0;
 	char c;
 
-	while ((c = getopt(argc, argv, "46hIl:p:rRux")) != -1) {
+	while ((c = getopt(argc, argv, "46hIl:np:rRux")) != -1) {
 		switch (c) {
 		case '4':
 			do_ipv6 = 0;
@@ -401,6 +418,9 @@ static void parse_opt(int argc, char **argv)
 		case 'I':
 			cfg_do_pktinfo = true;
 			break;
+		case 'n':
+			cfg_loop_nodata = true;
+			break;
 		case 'r':
 			proto_count++;
 			cfg_proto = SOCK_RAW;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next RFC 2/5] net-timestamp: no-payload only sysctl
From: Willem de Bruijn @ 2015-01-09 17:31 UTC (permalink / raw)
  To: netdev; +Cc: davem, richardcochran, eric.dumazet, luto, Willem de Bruijn
In-Reply-To: <1420824719-28848-1-git-send-email-willemb@google.com>

From: Willem de Bruijn <willemb@google.com>

Tx timestamps are looped onto the error queue on top of an skb. This
mechanism leaks packet headers to processes unless the no-payload
options SOF_TIMESTAMPING_OPT_TSONLY is set.

Add a sysctl that optionally drops looped timestamps with data for
unprivileged users.

The policy is checked when timestamps are generated in the stack.
It is possible for timestamps with data to be reported after the
sysctl is set, if these were queued internally earlier.

No vulnerability is immediately known that exploits knowledge
gleaned from packet headers, but it may still be preferable to allow
administrators to lock down this path at the cost of possible
breakage of legacy applications.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/net/sock.h         |  1 +
 net/core/skbuff.c          | 11 ++++++++++-
 net/core/sock.c            |  3 +++
 net/core/sysctl_net_core.c |  9 +++++++++
 4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 2210fec..9729171 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2262,6 +2262,7 @@ bool sk_net_capable(const struct sock *sk, int cap);
 extern __u32 sysctl_wmem_max;
 extern __u32 sysctl_rmem_max;
 
+extern int sysctl_tstamp_allow_data;
 extern int sysctl_optmem_max;
 
 extern __u32 sysctl_wmem_default;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ece2bb8..e5f4c06 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3690,11 +3690,20 @@ static void __skb_complete_tx_timestamp(struct sk_buff *skb,
 		kfree_skb(skb);
 }
 
+static bool skb_may_tx_timestamp(struct sock *sk)
+{
+	return sysctl_tstamp_allow_data || capable(CAP_NET_RAW) ||
+	       sk->sk_tsflags & SOF_TIMESTAMPING_OPT_TSONLY;
+}
+
 void skb_complete_tx_timestamp(struct sk_buff *skb,
 			       struct skb_shared_hwtstamps *hwtstamps)
 {
 	struct sock *sk = skb->sk;
 
+	if (!skb_may_tx_timestamp(sk))
+		return;
+
 	/* take a reference to prevent skb_orphan() from freeing the socket */
 	sock_hold(sk);
 
@@ -3712,7 +3721,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	struct sk_buff *skb;
 	bool tsonly = sk->sk_tsflags & SOF_TIMESTAMPING_OPT_TSONLY;
 
-	if (!sk)
+	if (!sk || !skb_may_tx_timestamp(sk))
 		return;
 
 	if (tsonly)
diff --git a/net/core/sock.c b/net/core/sock.c
index 1c7a33d..93c8b20 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -325,6 +325,8 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
+int sysctl_tstamp_allow_data __read_mostly = 1;
+
 struct static_key memalloc_socks = STATIC_KEY_INIT_FALSE;
 EXPORT_SYMBOL_GPL(memalloc_socks);
 
@@ -840,6 +842,7 @@ set_rcvbuf:
 			ret = -EINVAL;
 			break;
 		}
+
 		if (val & SOF_TIMESTAMPING_OPT_ID &&
 		    !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
 			if (sk->sk_protocol == IPPROTO_TCP) {
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 31baba2..fde21d1 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -321,6 +321,15 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "tstamp_allow_data",
+		.data		= &sysctl_tstamp_allow_data,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one
+	},
 #ifdef CONFIG_RPS
 	{
 		.procname	= "rps_sock_flow_entries",
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next RFC 1/5] net-timestamp: no-payload option
From: Willem de Bruijn @ 2015-01-09 17:31 UTC (permalink / raw)
  To: netdev; +Cc: davem, richardcochran, eric.dumazet, luto, Willem de Bruijn
In-Reply-To: <1420824719-28848-1-git-send-email-willemb@google.com>

From: Willem de Bruijn <willemb@google.com>

Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
timestamps, this loops timestamps on top of empty packets.

Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
cmsg reception (aside from timestamps) are no longer possible. This
works together with a follow on patch that allows administrators to
only allow tx timestamping if it does not loop payload or metadata.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/uapi/linux/net_tstamp.h |  3 ++-
 net/core/skbuff.c               | 19 ++++++++++++++-----
 net/ipv4/ip_sockglue.c          |  9 +++++----
 net/ipv6/datagram.c             |  4 ++--
 net/rxrpc/ar-error.c            |  5 +++++
 5 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index edbc888..6d1abea 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -24,8 +24,9 @@ enum {
 	SOF_TIMESTAMPING_TX_SCHED = (1<<8),
 	SOF_TIMESTAMPING_TX_ACK = (1<<9),
 	SOF_TIMESTAMPING_OPT_CMSG = (1<<10),
+	SOF_TIMESTAMPING_OPT_TSONLY = (1<<11),
 
-	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_CMSG,
+	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_TSONLY,
 	SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
 				 SOF_TIMESTAMPING_LAST
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5a2a2e8..ece2bb8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3710,19 +3710,28 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		     struct sock *sk, int tstype)
 {
 	struct sk_buff *skb;
+	bool tsonly = sk->sk_tsflags & SOF_TIMESTAMPING_OPT_TSONLY;
 
 	if (!sk)
 		return;
 
-	if (hwtstamps)
-		*skb_hwtstamps(orig_skb) = *hwtstamps;
+	if (tsonly)
+		skb = alloc_skb(0, GFP_ATOMIC);
 	else
-		orig_skb->tstamp = ktime_get_real();
-
-	skb = skb_clone(orig_skb, GFP_ATOMIC);
+		skb = skb_clone(orig_skb, GFP_ATOMIC);
 	if (!skb)
 		return;
 
+	if (tsonly) {
+		skb_shinfo(skb)->tx_flags = skb_shinfo(orig_skb)->tx_flags;
+		skb_shinfo(skb)->tskey = skb_shinfo(orig_skb)->tskey;
+	}
+
+	if (hwtstamps)
+		*skb_hwtstamps(skb) = *hwtstamps;
+	else
+		skb->tstamp = ktime_get_real();
+
 	__skb_complete_tx_timestamp(skb, sk, tstype);
 }
 EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index a317797..d81ef70 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -440,7 +440,7 @@ static bool ipv4_pktinfo_prepare_errqueue(const struct sock *sk,
 
 	if ((ee_origin != SO_EE_ORIGIN_TIMESTAMPING) ||
 	    (!(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_CMSG)) ||
-	    (!skb->dev))
+	    (!skb->dev) || (!skb->len))
 		return false;
 
 	info->ipi_spec_dst.s_addr = ip_hdr(skb)->saddr;
@@ -483,7 +483,7 @@ int ip_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
 
 	serr = SKB_EXT_ERR(skb);
 
-	if (sin) {
+	if (sin && skb->len) {
 		sin->sin_family = AF_INET;
 		sin->sin_addr.s_addr = *(__be32 *)(skb_network_header(skb) +
 						   serr->addr_offset);
@@ -496,8 +496,9 @@ int ip_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
 	sin = &errhdr.offender;
 	sin->sin_family = AF_UNSPEC;
 
-	if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP ||
-	    ipv4_pktinfo_prepare_errqueue(sk, skb, serr->ee.ee_origin)) {
+	if (skb->len &&
+	    (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP ||
+	     ipv4_pktinfo_prepare_errqueue(sk, skb, serr->ee.ee_origin))) {
 		struct inet_sock *inet = inet_sk(sk);
 
 		sin->sin_family = AF_INET;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 100c589..91a31ea 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -369,7 +369,7 @@ int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
 
 	serr = SKB_EXT_ERR(skb);
 
-	if (sin) {
+	if (sin && skb->len) {
 		const unsigned char *nh = skb_network_header(skb);
 		sin->sin6_family = AF_INET6;
 		sin->sin6_flowinfo = 0;
@@ -394,7 +394,7 @@ int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
 	memcpy(&errhdr.ee, &serr->ee, sizeof(struct sock_extended_err));
 	sin = &errhdr.offender;
 	sin->sin6_family = AF_UNSPEC;
-	if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
+	if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL && skb->len) {
 		sin->sin6_family = AF_INET6;
 		sin->sin6_flowinfo = 0;
 		sin->sin6_port = 0;
diff --git a/net/rxrpc/ar-error.c b/net/rxrpc/ar-error.c
index 74c0fcd..5394b6b 100644
--- a/net/rxrpc/ar-error.c
+++ b/net/rxrpc/ar-error.c
@@ -42,6 +42,11 @@ void rxrpc_UDP_error_report(struct sock *sk)
 		_leave("UDP socket errqueue empty");
 		return;
 	}
+	if (!skb->len) {
+		_leave("UDP empty message");
+		kfree_skb(skb);
+		return;
+	}
 
 	rxrpc_new_skb(skb);
 
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next RFC 0/5] net-timestamp: address blinding and batching
From: Willem de Bruijn @ 2015-01-09 17:31 UTC (permalink / raw)
  To: netdev; +Cc: davem, richardcochran, eric.dumazet, luto, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Two issues were raised during recent timestamping discussions:
1. looping full packets on the error queue exposes packet headers
2. TCP timestamping with retransmissions generates many timestamps

This RFC patchset is an attempt at addressing both without breaking
legacy behavior.

Patch 1 reintroduces the "no payload" timestamp option, which loops
timestamps onto an empty skb. Patch 2 then gives administrators the
power to block all timestamp requests by unprivileged users that
contain data. I proposed this earlier as a backward compatible
workaround in the discussion of

  net-timestamp: pull headers for SOCK_STREAM
  http://patchwork.ozlabs.org/patch/414810/

Patch 3 only updates the txtimestamp example to test this option.

When looping timestamps without data, there is no need to associate
a timestamp with a specific packet. Patch 4 loops multiple timestamps
onto a single outstanding packet if this has no payload. It is a
variant of the cookies approach that David proposed in

  net-timestamp: TCP timestamping
  http://patchwork.ozlabs.org/patch/376513/

That patch turns out to introduce a quite a bit of code to save
relatively few bytes because
1. no-payload already limits the per-skb size that is queued and
2. batching is limited by send() failing as soon as there is an
   outstanding packet on the error queue. Therefore, I'm fine
with dropping this. By now, it is at least recorded in patchwork.

Patch 5, finally, creates a new short SO_TIMESTAMPING option,
SOF_TIMESTAMPING_TX, that combines all recent options, as a push
to get future applications to use the new ID and no-payload based
API by default.

Willem de Bruijn (5):
  net-timestamp: no-payload option
  net-timestamp: no-payload only sysctl
  net-timestamp: no-payload option in txtimestamp test
  net-timestamp: tx timestamp cookies
  net-timestamp: tx timestamping default mode flag

 .../networking/timestamping/txtimestamp.c          |  28 ++++-
 include/linux/skbuff.h                             |  12 ++
 include/net/sock.h                                 |   4 +-
 include/uapi/linux/errqueue.h                      |   1 +
 include/uapi/linux/net_tstamp.h                    |  11 +-
 net/core/skbuff.c                                  | 134 ++++++++++++++++++---
 net/core/sock.c                                    |   3 +
 net/core/sysctl_net_core.c                         |   9 ++
 net/ipv4/ip_sockglue.c                             |   9 +-
 net/ipv6/datagram.c                                |   4 +-
 net/rxrpc/ar-error.c                               |   5 +
 net/socket.c                                       |  64 +++++++++-
 12 files changed, 250 insertions(+), 34 deletions(-)

-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply

* Re: [net-next PATCH v1 00/11] A flow API
From: John Fastabend @ 2015-01-09 17:26 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Thomas Graf, Scott Feldman, Jiří Pírko,
	Jamal Hadi Salim, simon.horman, Linux Netdev List, David Miller,
	Andy Gospodarek
In-Reply-To: <CAJ3xEMjb+pG3TR+r0huNVSGkuPE9i7EZeHO4SQkcfm=WA6zMeA@mail.gmail.com>

On 01/08/2015 07:14 AM, Or Gerlitz wrote:
> On Wed, Dec 31, 2014 at 9:45 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> For some examples and maybe a bit more illustrative description I
>> posted a quickly typed up set of notes on github io pages. Here we
>> can show the description along with images produced by the flow tool
>> showing the pipeline. Once we settle a bit more on the API we should
>> probably do a clean up of this and other threads happening and commit
>> something to the Documentation directory.
>>
>>   http://jrfastab.github.io/jekyll/update/2014/12/21/flow-api.html
>
> John, Going over your excellent tutorial, the distinction between
> action and operation isn’t clear... specifically, the paragraph
> “Although this gives us a list of actions we can perform on a packet
> and a set of argument to give the action so we can use them it does
> not supply the operations performed on the packet” is a bit vague.
>
> Or.
>

Agreed that is a bit confusing. What I was trying to show is if two
hardware devices give you the same action but with different names
showing they are equivalent is not possible with the current API.
So either (a) you need to enforce every device names their actions
correctly or (b) provide a mechanism to describe the actions so we
can evaluate their equivalence.

Its actually worse then this what I want to eventually show is if
device A has support for a set of actions and device B has support
for another set. I want to be able to say things about the devices
like device A can support any action B can do but it may require
applying a 2 actions from A's collection of actions. (clear as mud?)
I'll try to clear it up in the documentation.

Thanks for looking it over.
.John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH] net: wireless: rtlwifi: btcoexist: halbtc8821a2ant: Remove some unused functions
From: Rickard Strandqvist @ 2015-01-09 17:23 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Larry Finger, Chaoming Li, Greg Kroah-Hartman, Fengguang Wu,
	linux-wireless@vger.kernel.org, Network Development,
	Linux Kernel Mailing List
In-Reply-To: <871tn7zsie.fsf@kamboji.qca.qualcomm.com>

Hi All

Yes I use sed as Julian said. I have it as part of my git send-email
script, until a few week ago, it was enough that I removed the
"drivers/" and changed all "/" to ": "
I have now been expanded my sed pipe a lot (tell me if anyone is interested).
But a real script to do this would have been very good I think!

It may be, there was some mail problems during the holiday season :-(
There is not a total consensus on how patcher will was designed.
I have previously received complaints that I made for large patches of
too many files.
Have since tried to make small patches, and it also fit better to have
more specific paths in the subject line.

But I'll try to gather all in one patchset instead this weekend.


Kind regards
Rickard Strandqvist



2015-01-06 19:14 GMT+01:00 Kalle Valo <kvalo@codeaurora.org>:
> Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> writes:
>
>> Removes some functions that are not used anywhere:
>> ex_halbtc8821a2ant_periodical() ex_halbtc8821a2ant_halt_notify()
>> ex_halbtc8821a2ant_bt_info_notify()
>> ex_halbtc8821a2ant_special_packet_notify()
>> ex_halbtc8821a2ant_connect_notify() ex_halbtc8821a2ant_scan_notify()
>> ex_halbtc8821a2ant_lps_notify() ex_halbtc8821a2ant_ips_notify()
>> ex_halbtc8821a2ant_display_coex_info() ex_halbtc8821a2ant_init_coex_dm()
>> ex_halbtc8821a2ant_init_hwconfig()
>>
>> This was partially found by using a static code analysis program called cppcheck.
>>
>> Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
>
> Rickard, I have dropped all your patches because I lost track which I
> should apply and which not and I do not want to waste half an hour
> figuring out. Please resend the ones which are still valid.
>
> And to make it easier for everyone please group these wireless-drivers
> cleanup patches into a larger patchset (like 10-15 patches max per
> patchset). That way it's a lot easier for me tomanage them. Sending one
> patch a day is not recommended, especially when we are talking trivial
> cleanup patches like this.
>
> --
> Kalle Valo

^ permalink raw reply

* Re: Fw: iproute2: segfault with ip link show dev
From: Vadim Kochan @ 2015-01-09 16:59 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, vadim4j
In-Reply-To: <20150107114217.42bf2392@urahara>

On Wed, Jan 07, 2015 at 11:42:17AM -0800, Stephen Hemminger wrote:
> Looks like one VF info changes broke old code
> 
> Begin forwarded message:
> 
> Date: Wed, 7 Jan 2015 04:06:53 -0800
> From: William Dauchy <william@gandi.net>
> To: "stephen@networkplumber.org" <stephen@networkplumber.org>
> Subject: iproute2: segfault with ip link show dev
> 
> 
> Hi,
> 
> I was using iproute2 3.15.
> My network card in using igb driver with VF enable, e.g igb.max_vfs=2
> After upgrading to 3.16, I have now a segfault while doing a
> usual `ip link show dev eth1`.
> Disabling VFS make the segafult disappear.
> 
> Here is the gdb trace even if it does not contain much info.
> The segfault occurs when at the VF step.
> 
> (gdb) break print_linkinfo
> Breakpoint 1 at 0x40782d
> (gdb) set args link show dev eth1
> (gdb) r
> Starting program: /sbin/ip link show dev eth1
> 
> Breakpoint 1, 0x000000000040782d in print_linkinfo ()
> (gdb) n
> Single stepping until exit from function print_linkinfo,
> which has no line number information.
> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP
> mode DEFAULT group default qlen 10000
>     link/ether 00:26:6c:ff:b5:c1 brd ff:ff:ff:ff:ff:ff
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x00000000004070eb in print_vfinfo ()
> (gdb) bt
> #0  0x00000000004070eb in print_vfinfo ()
> #1  0x0000000000407f9f in print_linkinfo ()
> #2  0x000000000041f266 in iplink_get ()
> #3  0x0000000000409c69 in ipaddr_list_flush_or_save ()
> #4  0x000000000040a113 in ipaddr_list_link ()
> #5  0x00000000004203f6 in do_iplink ()
> #6  0x0000000000405a07 in do_cmd ()
> #7  0x000000000040621e in main ()
> 
> The expected output is for example:
> 
> # ip link show dev eth1
> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 10000
>     link/ether 00:26:6c:ff:b3:8d brd ff:ff:ff:ff:ff:ff
>     vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
>     vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
> 
> 
> I'm using my own kernel build, a stable v3.14.x
> 
> Regards,
> -- 
> William

I have a working patch which was tested by William and will send it
soon.

Regards,
Vadim Kochan

^ permalink raw reply

* Re: [PATCH RESEND 2/2] wlcore: align member-assigns in a structure-copy block
From: Kalle Valo @ 2015-01-09 17:03 UTC (permalink / raw)
  To: Giel van Schijndel
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eliad Peller,
	John W. Linville, Arik Nemtsov, open list:TI WILINK WIRELES...,
	open list:NETWORKING DRIVERS
In-Reply-To: <1420659525-22975-2-git-send-email-me-sZ9Uef1cvPWHXe+LvDLADg@public.gmane.org>

Giel van Schijndel <me-sZ9Uef1cvPWHXe+LvDLADg@public.gmane.org> writes:

> This highlights the differences (e.g. the bug fixed in the previous
> commit).
>
> Signed-off-by: Giel van Schijndel <me-sZ9Uef1cvPWHXe+LvDLADg@public.gmane.org>
> ---
>  drivers/net/wireless/ti/wlcore/acx.c | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/wireless/ti/wlcore/acx.c b/drivers/net/wireless/ti/wlcore/acx.c
> index f28fa3b..93a2fa8 100644
> --- a/drivers/net/wireless/ti/wlcore/acx.c
> +++ b/drivers/net/wireless/ti/wlcore/acx.c
> @@ -1715,17 +1715,17 @@ int wl12xx_acx_config_hangover(struct wl1271 *wl)
>  		goto out;
>  	}
>  
> -	acx->recover_time = cpu_to_le32(conf->recover_time);
> -	acx->hangover_period = conf->hangover_period;
> -	acx->dynamic_mode = conf->dynamic_mode;
> -	acx->early_termination_mode = conf->early_termination_mode;
> -	acx->max_period = conf->max_period;
> -	acx->min_period = conf->min_period;
> -	acx->increase_delta = conf->increase_delta;
> -	acx->decrease_delta = conf->decrease_delta;
> -	acx->quiet_time = conf->quiet_time;
> -	acx->increase_time = conf->increase_time;
> -	acx->window_size = conf->window_size;
> +	acx->recover_time               = cpu_to_le32(conf->recover_time);
> +	acx->hangover_period            = conf->hangover_period;
> +	acx->dynamic_mode               = conf->dynamic_mode;
> +	acx->early_termination_mode     = conf->early_termination_mode;
> +	acx->max_period                 = conf->max_period;
> +	acx->min_period                 = conf->min_period;
> +	acx->increase_delta             = conf->increase_delta;
> +	acx->decrease_delta             = conf->decrease_delta;
> +	acx->quiet_time                 = conf->quiet_time;
> +	acx->increase_time              = conf->increase_time;
> +	acx->window_size                = conf->window_size;

I would like to get an ACK from one of the wlcore developers if I should
apply this (or not).

-- 
Kalle Valo
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH] i40e: don't enable and init FCOE by default when do PF reset
From: Ronciak, John @ 2015-01-09 16:41 UTC (permalink / raw)
  To: Ethan Zhao, Kirsher, Jeffrey T, Brandeburg, Jesse, Allan, Bruce W,
	Wyborny, Carolyn, Skidmore, Donald C, Rose, Gregory V,
	Vick, Matthew, Williams, Mitch A, Dev, Vasu, Parikh, Neerav
  Cc: Linux NICS, e1000-devel@lists.sourceforge.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	ethan.kernel@gmail.com, brian.maly@oracle.com
In-Reply-To: <1420821466-5747-1-git-send-email-ethan.zhao@oracle.com>

Adding Vasu and Neerav

Cheers,
John

> -----Original Message-----
> From: Ethan Zhao [mailto:ethan.zhao@oracle.com]
> Sent: Friday, January 9, 2015 8:38 AM
> To: Kirsher, Jeffrey T; Brandeburg, Jesse; Allan, Bruce W; Wyborny, Carolyn;
> Skidmore, Donald C; Rose, Gregory V; Vick, Matthew; Ronciak, John;
> Williams, Mitch A
> Cc: Linux NICS; e1000-devel@lists.sourceforge.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; ethan.kernel@gmail.com;
> brian.maly@oracle.com; Ethan Zhao
> Subject: [PATCH] i40e: don't enable and init FCOE by default when do PF
> reset
> 
> While do PF reset with function i40e_reset_and_rebuild(), it will call
> i40e_init_pf_fcoe() by default if FCOE is defined, thus if the PF is resetted,
> FCOE will be enabled whatever it was - enabled or not.
> 
> Such bug might be hit when PF resumes from suspend, run diagnostic test
> with ethtool, setup VLAN etc.
> 
> Passed building with v3.19-rc3.
> 
> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index a5f2660..a2572cc 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -6180,9 +6180,12 @@ static void i40e_reset_and_rebuild(struct i40e_pf
> *pf, bool reinit)
>  	}
>  #endif /* CONFIG_I40E_DCB */
>  #ifdef I40E_FCOE
> -	ret = i40e_init_pf_fcoe(pf);
> -	if (ret)
> -		dev_info(&pf->pdev->dev, "init_pf_fcoe failed: %d\n", ret);
> +	if (pf->flags & I40E_FLAG_FCOE_ENABLED) {
> +		ret = i40e_init_pf_fcoe(pf);
> +		if (ret)
> +			dev_info(&pf->pdev->dev,
> +				 "init_pf_fcoe failed: %d\n", ret);
> +	}
> 
>  #endif
>  	/* do basic switch setup */
> --
> 1.8.3.1

^ permalink raw reply

* [PATCH] i40e: don't enable and init FCOE by default when do PF reset
From: Ethan Zhao @ 2015-01-09 16:37 UTC (permalink / raw)
  To: jeffrey.t.kirsher, jesse.brandeburg, bruce.w.allan,
	carolyn.wyborny, donald.c.skidmore, gregory.v.rose, matthew.vick,
	john.ronciak, mitch.a.williams
  Cc: Ethan Zhao, linux.nics, e1000-devel, netdev, brian.maly,
	linux-kernel, ethan.kernel

While do PF reset with function i40e_reset_and_rebuild(), it will
call i40e_init_pf_fcoe() by default if FCOE is defined, thus if the
PF is resetted, FCOE will be enabled whatever it was - enabled or
not.

Such bug might be hit when PF resumes from suspend, run diagnostic
test with ethtool, setup VLAN etc.

Passed building with v3.19-rc3.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index a5f2660..a2572cc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6180,9 +6180,12 @@ static void i40e_reset_and_rebuild(struct i40e_pf *pf, bool reinit)
 	}
 #endif /* CONFIG_I40E_DCB */
 #ifdef I40E_FCOE
-	ret = i40e_init_pf_fcoe(pf);
-	if (ret)
-		dev_info(&pf->pdev->dev, "init_pf_fcoe failed: %d\n", ret);
+	if (pf->flags & I40E_FLAG_FCOE_ENABLED) {
+		ret = i40e_init_pf_fcoe(pf);
+		if (ret)
+			dev_info(&pf->pdev->dev,
+				 "init_pf_fcoe failed: %d\n", ret);
+	}
 
 #endif
 	/* do basic switch setup */
-- 
1.8.3.1


------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply related

* pull-request: wireless-drivers 2015-01-09
From: Kalle Valo @ 2015-01-09 16:31 UTC (permalink / raw)
  To: David Miller; +Cc: linux-wireless, netdev, linux-kernel

Hi Dave,

here are few more fixes to 3.19. Please let me know if there are any
problems.

Kalle

The following changes since commit 7ce67a38f799d1fb332f672b117efbadedaa5352:

  net: ethernet: cpsw: fix hangs with interrupts (2015-01-04 22:18:34 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git tags/wireless-drivers-for-davem-2015-01-09

for you to fetch changes up to c702674f99e612a22cf6e8f9b4bec341257970c0:

  Merge tag 'iwlwifi-for-kalle-2015-01-05' of https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes (2015-01-05 12:07:04 +0200)

----------------------------------------------------------------

* rtlwifi: fix a regression in large skb allocation failure

iwlwifi:

* fix for 7265D NVM check
* fixes for scan: fix long scanning times and network discovery
* new firmware API for iwlmvm supported devices
* fixes in rate control

----------------------------------------------------------------
Arik Nemtsov (1):
      iwlwifi: pcie: correctly define 7265-D cfg

David Spinadel (2):
      iwlwifi: mvm: add a flag to enable match found notification
      iwlwifi: mvm: scan dwell time corrections

Emmanuel Grumbach (2):
      iwlwifi: 7000: fix reported firmware name for 7265D
      iwlwifi: bump firmware API for mvm devices to 12

Eyal Shapira (2):
      iwlwifi: mvm: fix Rx with both chains
      iwlwifi: mvm: fix out of bounds access to tid_to_mac80211_ac

Kalle Valo (1):
      Merge tag 'iwlwifi-for-kalle-2015-01-05' of https://git.kernel.org/.../iwlwifi/iwlwifi-fixes

Larry Finger (1):
      rtlwifi: Fix error when accessing unmapped memory in skb

 drivers/net/wireless/iwlwifi/iwl-7000.c        |    6 ++---
 drivers/net/wireless/iwlwifi/iwl-8000.c        |    2 +-
 drivers/net/wireless/iwlwifi/iwl-fw-file.h     |    4 +++
 drivers/net/wireless/iwlwifi/mvm/fw-api-scan.h |    2 ++
 drivers/net/wireless/iwlwifi/mvm/scan.c        |   19 ++++++++++----
 drivers/net/wireless/iwlwifi/mvm/tx.c          |    8 ++++--
 drivers/net/wireless/iwlwifi/mvm/utils.c       |    2 +-
 drivers/net/wireless/iwlwifi/pcie/drv.c        |    4 ++-
 drivers/net/wireless/rtlwifi/pci.c             |   32 ++++++++++++++++++------
 9 files changed, 58 insertions(+), 21 deletions(-)

-- 
Kalle Valo

^ permalink raw reply

* Re: [Patch net-next] ipv6: fix redefinition of in6_pktinfo and ip6_mtuinfo
From: Carlos O'Donell @ 2015-01-09 16:27 UTC (permalink / raw)
  To: David Miller, xiyou.wangcong; +Cc: netdev, vlee
In-Reply-To: <20150108.193819.258816805892789860.davem@davemloft.net>

On 01/08/2015 10:38 PM, David Miller wrote:
> From: Cong Wang <xiyou.wangcong@gmail.com>
> Date: Tue,  6 Jan 2015 15:45:31 -0800
> 
>> Both netinet/in.h and linux/ipv6.h define these two structs,
>> if we include both of them, we got:
>>
>> 	/usr/include/linux/ipv6.h:19:8: error: redefinition of ‘struct in6_pktinfo’
>> 	 struct in6_pktinfo {
>> 		^
>> 	In file included from /usr/include/arpa/inet.h:22:0,
>> 			 from txtimestamp.c:33:
>> 	/usr/include/netinet/in.h:524:8: note: originally defined here
>> 	 struct in6_pktinfo
>> 		^
>> 	In file included from txtimestamp.c:40:0:
>> 	/usr/include/linux/ipv6.h:24:8: error: redefinition of ‘struct ip6_mtuinfo’
>> 	 struct ip6_mtuinfo {
>> 		^
>> 	In file included from /usr/include/arpa/inet.h:22:0,
>> 			 from txtimestamp.c:33:
>> 	/usr/include/netinet/in.h:531:8: note: originally defined here
>> 	 struct ip6_mtuinfo
>> 		^
>> So similarly to what we did for in6_addr, we need to sync with
>> libc header on their definitions.
>>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> 
> Applied.
> 

I'm reviewing the glibc side and will apply there if nobody has done so yet.

Cheers,
Carlos.

^ permalink raw reply

* RE: Clarification regarding IFLA_BRPORT_LEARNING_SYNC and aging of fdb entries learnt via br_fdb_external_learn_add()
From: Arad, Ronen @ 2015-01-09 16:15 UTC (permalink / raw)
  To: Scott Feldman, Jiri Pirko; +Cc: Siva Mannem, Netdev
In-Reply-To: <CAE4R7bBRboCfhRKFKgVVAGt1f-m6wkXNfaej4RjuaCKPypABeA@mail.gmail.com>



>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>Behalf Of Scott Feldman
>Sent: Friday, January 09, 2015 3:47 AM
>To: Jiri Pirko
>Cc: Siva Mannem; Netdev
>Subject: Re: Clarification regarding IFLA_BRPORT_LEARNING_SYNC and aging of
>fdb entries learnt via br_fdb_external_learn_add()
>
>On Wed, Jan 7, 2015 at 4:53 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Tue, Dec 30, 2014 at 07:20:21PM CET, siva.mannem.lnx@gmail.com wrote:
>>>Hi,
>>>
>>>I am trying to understand the ongoing switch device offload effort and
>>>am following the discussions. I have a question regarding
>>>IFLA_BRPORT_LEARNING_SYNC flag and how aging happens when this flag is
>>>enabled on a port that is attached to a bridge that has vlan filtering
>>>enabled.
>>>
>>>If I understand correctly, when  IFLA_BRPORT_LEARNING_SYNC is set on a
>>>bridge port, fdb entries that are learnt externally(may be learnt by
>>>hardware and driver is notified) are synced to bridges fdb using
>>>br_fdb_external_learn_add(). The fdb
>>>entries(fdb->added_by_external_learn set to true) that are learnt via
>>>this method are also deleted by the aging logic after the aging time
>>>even though L2 data forwadring  happens in hardware.
>
>This is correct...
>
>>> Is there a way
>>>where aging can be disabled for these entries? and let the entries be
>>>removed only via br_fdb_external_learn_delete()? or am I missing
>>>something?
>>
>> Currently extenaly learned fdb entries are indeed removed during aging
>> cleanup. I believe that br_fdb_cleanup should check added_by_external_learn
>> and not remove that fdbs. What do you think Scott?
>
>Something like that would work, if we added another brport flag to
>control that.  With the current arrangement, using bridge for aging
>out entries, we want br_fdb_cleanup removing the external_learned
>fdbs, but if there was another brport flag we could fine tune that.
>Say new flag is IFLA_BRPORT_AGING_OFFLOAD or something like that.  I'm
>not sure how aging settings for the bridge are pushed down to offload
>hw, or if there is a different set for hw.
>
>But, isn't it nice to let Linux bridge control aging?  That way,
>bridge -s fdb dump shows nice statistics on fdb entries.  Hardware
>isn't involved in the aging processes (other than being told to remove
>an entry).  Simple hardware equals simple driver.  Linux remains
>control point.
>
It is indeed simpler. However, if the overhead of reading hit bits from the HW
and updating freshness of entries using br_fdb_external_learn_add() is too
expensive, it would force such platforms to disable learning sync altogether.
Therefore, I believe aging offload flag (could be sufficient at bridge level)
and external aging interval (possibly longer than the software aging interval)
will encourage drivers to use leaning sync.
>-scott
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox