Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Tom Herbert @ 2016-04-09 13:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, Or Gerlitz, Daniel Borkmann, Eric Dumazet,
	Edward Cree, john fastabend, Thomas Graf, Johannes Berg,
	eranlinuxmellanox, Lorenzo Colitti
In-Reply-To: <20160409142759.25d8464a@redhat.com>

On Sat, Apr 9, 2016 at 9:27 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Sat, 9 Apr 2016 08:17:04 -0300
> Tom Herbert <tom@herbertland.com> wrote:
>
>> One other API issue is how to deal with encapsulation. In this case a
>> header may be prepended to the packet, I assume there are BPF helper
>> functions and we don't need to return a new length or start?
>
> That reminds me.  Do the BPF program need to know the head-room, then?
>
Right, that is basically my question. Can we have a helper function in
BPF that will prepend n bytes to the buffer? (I don't think we want
expose a notion of headroom).

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Hello....!!
From: Paul Lukas @ 2016-04-09 13:01 UTC (permalink / raw)

In-Reply-To: <1268439650.3324787.1460206885379.JavaMail.yahoo.ref@mail.yahoo.com>

PEASE URGENT RESPONSE,

I am Dr Paul Lukas,the Audit and Account Manager (A.D.B)Bank in Ouagadougou Burkina Faso,west africa

I have a business transaction for you.In my department i discovered an abandoned Sum of US$10.2 Million Dollars.In an account that belongs to one of our late foreign customer who died on Air crash with his family member. ever Since he died,Nobody to claim the left over balance in the account.
It is therefore upon this discovery that I decided to seek your assistance to present you as his business associate for my bank to transfer the funds to your account.

If you accept i would give you the guide lines of how we can achieve this transfer of the balance $10.2 Million Dollars to your account.
the fund will be share 50-50%. by both of us

if you are capable to handle this transaction please furnish me the following information for more details.

Full names:...
Country:..
Your pravite phone number:...
Your Age/Sex:..

Regards,
Dr Paul Lukas.

^ permalink raw reply

* Re: [PATCHv3 net-next 0/6] bridge: support sending rntl info when we set attributes through sysfs/ioctl
From: Nikolay Aleksandrov via Bridge @ 2016-04-09 12:55 UTC (permalink / raw)
  To: Xin Long, network dev, bridge; +Cc: davem
In-Reply-To: <cover.1460131308.git.lucien.xin@gmail.com>

On 04/08/2016 06:03 PM, Xin Long wrote:
> This patchset is used to support sending rntl info to user in some places,
> and ensure that whenever those attributes change internally or from sysfs,
> that a netlink notification is sent out to listeners.
> 
> It also make some adjustment in bridge sysfs so that we can implement this
> easily.
> 
> I've done some tests on this patchset, like:
> [br_sysfs]
>   1. change all the attribute values of br or brif:
>   $ echo $value > /sys/class/net/br0/bridge/{*}
>   $ echo $value > /sys/class/net/br0/brif/eth1/{*}
> 
>   2. meanwhile, on another terminal to observe the msg:
>   $ bridge monitor
> 
> [br_ioctl]
>   1. in bridge-utils package, do some changes in br_set, let brctl command
>   use ioctl to set attribute:
>          if ((ret = set_sysfs(path, value)) < 0) { -->
>          if (1) {
> 
>   $ brctl set*
> 
>   2. meanwhile, on another terminal to observe the msg:
>   $ bridge monitor
> 
> This test covers all the attributes that brctl and sysfs support to set.
> 

Overall the set looks good to me, just one comment for future posts - please
include the changes between versions of the set in your cover letter and 
individual patches. I had to go back to your previous postings and read my
own comments and compare them with this set.

Thank you,
 Nik

^ permalink raw reply

* Re: [PATCHv3 net-next 6/6] bridge: a netlink notification should be sent when those attributes are changed by ioctl
From: Nikolay Aleksandrov via Bridge @ 2016-04-09 12:49 UTC (permalink / raw)
  To: Xin Long, network dev, bridge; +Cc: davem
In-Reply-To: <412ca1bb5690dfd7a73a37d7e821784f24edd059.1460131308.git.lucien.xin@gmail.com>

On 04/08/2016 06:03 PM, Xin Long wrote:
> Now when we change the attributes of bridge or br_port by netlink,
> a relevant netlink notification will be sent, but if we change them
> by ioctl or sysfs, no notification will be sent.
> 
> We should ensure that whenever those attributes change internally or from
> sysfs/ioctl, that a netlink notification is sent out to listeners.
> 
> Also, NetworkManager will use this in the future to listen for out-of-band
> bridge master attribute updates and incorporate them into the runtime
> configuration.
> 
> This patch is used for ioctl.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/bridge/br_ioctl.c | 40 ++++++++++++++++++++++++----------------
>  1 file changed, 24 insertions(+), 16 deletions(-)
> 

LGTM,

Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

^ permalink raw reply

* Re: [PATCHv3 net-next 5/6] bridge: a netlink notification should be sent when those attributes are changed by br_sysfs_if
From: Nikolay Aleksandrov via Bridge @ 2016-04-09 12:45 UTC (permalink / raw)
  To: Xin Long, network dev, bridge; +Cc: davem
In-Reply-To: <8c87a70b3c214f90b7345edb68536429fdaad096.1460131308.git.lucien.xin@gmail.com>

On 04/08/2016 06:03 PM, Xin Long wrote:
> Now when we change the attributes of bridge or br_port by netlink,
> a relevant netlink notification will be sent, but if we change them
> by ioctl or sysfs, no notification will be sent.
> 
> We should ensure that whenever those attributes change internally or from
> sysfs/ioctl, that a netlink notification is sent out to listeners.
> 
> Also, NetworkManager will use this in the future to listen for out-of-band
> bridge master attribute updates and incorporate them into the runtime
> configuration.
> 
> This patch is used for br_sysfs_if, and we also move br_ifinfo_notify out
> of store_flag.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/bridge/br_sysfs_if.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 

Note: there's a slight behaviour change, before if the flags were the same
a notification wouldn't be sent, now it would. Anyway I don't see a problem
as this is true for other attributes which are set to the same value.

Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

^ permalink raw reply

* Re: [PATCHv3 net-next 4/6] bridge: a netlink notification should be sent when those attributes are changed by br_sysfs_br
From: Nikolay Aleksandrov via Bridge @ 2016-04-09 12:41 UTC (permalink / raw)
  To: Xin Long, network dev, bridge; +Cc: davem
In-Reply-To: <1dd5e14c7f042753a4d70b585de407a4e388262a.1460131308.git.lucien.xin@gmail.com>

On 04/08/2016 06:03 PM, Xin Long wrote:
> Now when we change the attributes of bridge or br_port by netlink,
> a relevant netlink notification will be sent, but if we change them
> by ioctl or sysfs, no notification will be sent.
> 
> We should ensure that whenever those attributes change internally or from
> sysfs/ioctl, that a netlink notification is sent out to listeners.
> 
> Also, NetworkManager will use this in the future to listen for out-of-band
> bridge master attribute updates and incorporate them into the runtime
> configuration.
> 
> This patch is used for br_sysfs_br. and we also need to remove some
> rtnl_trylock in old functions so that we can call it in a common one.
> 
> For group_addr_store, we cannot make it use store_bridge_parm, because
> it's not a string-to-long convert, we will add notification on it
> individually.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/bridge/br_sysfs_br.c | 21 +++++++++------------
>  net/bridge/br_vlan.c     | 30 +++++-------------------------
>  2 files changed, 14 insertions(+), 37 deletions(-)
> 

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

^ permalink raw reply

* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Eric Dumazet @ 2016-04-09 12:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm,
	netdev@vger.kernel.org, lsf-pc, Alexei Starovoitov
In-Reply-To: <20160409111132.781a11b6@redhat.com>

On Sat, 2016-04-09 at 11:11 +0200, Jesper Dangaard Brouer wrote:
> Hi Eric,


> Above code is okay.  But do you think we also can get away with the same
> trick we do with the SKB refcnf?  Where we avoid an atomic operation if
> refcnt==1.
> 
> void kfree_skb(struct sk_buff *skb)
> {
> 	if (unlikely(!skb))
> 		return;
> 	if (likely(atomic_read(&skb->users) == 1))
> 		smp_rmb();
> 	else if (likely(!atomic_dec_and_test(&skb->users)))
> 		return;
> 	trace_kfree_skb(skb, __builtin_return_address(0));
> 	__kfree_skb(skb);
> }
> EXPORT_SYMBOL(kfree_skb);

No we can not use this trick this for pages :

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ec91698360b3818ff426488a1529811f7a7ab87f






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv3 net-next 3/6] bridge: simplify the stp_state_store by calling store_bridge_parm
From: Nikolay Aleksandrov via Bridge @ 2016-04-09 12:33 UTC (permalink / raw)
  To: Xin Long, network dev, bridge; +Cc: davem
In-Reply-To: <8c14a891d0a8bcea071d4e5305776a5c5cd9fd17.1460131308.git.lucien.xin@gmail.com>

On 04/08/2016 06:03 PM, Xin Long wrote:
> There are some repetitive codes in stp_state_store, we can remove
> them by calling store_bridge_parm.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/bridge/br_sysfs_br.c | 24 +++++++++---------------
>  1 file changed, 9 insertions(+), 15 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

^ permalink raw reply

* Re: [PATCHv3 net-next 2/6] bridge: simplify the forward_delay_store by calling store_bridge_parm
From: Nikolay Aleksandrov via Bridge @ 2016-04-09 12:31 UTC (permalink / raw)
  To: Xin Long, network dev, bridge; +Cc: davem
In-Reply-To: <6197a35a2eb6df2caf90edcbf1b49da12077f659.1460131308.git.lucien.xin@gmail.com>

On 04/08/2016 06:03 PM, Xin Long wrote:
> There are some repetitive codes in forward_delay_store, we can remove
> them by calling store_bridge_parm.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/bridge/br_sysfs_br.c | 27 ++++++++++-----------------
>  1 file changed, 10 insertions(+), 17 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

^ permalink raw reply

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Jesper Dangaard Brouer @ 2016-04-09 12:27 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, Or Gerlitz, Daniel Borkmann, Eric Dumazet,
	Edward Cree, john fastabend, Thomas Graf, Johannes Berg,
	eranlinuxmellanox, Lorenzo Colitti, brouer
In-Reply-To: <CALx6S34m8cVNgvuGp845bicixodfavH9cj-rARSwwEAvFCjd7g@mail.gmail.com>

On Sat, 9 Apr 2016 08:17:04 -0300
Tom Herbert <tom@herbertland.com> wrote:

> One other API issue is how to deal with encapsulation. In this case a
> header may be prepended to the packet, I assume there are BPF helper
> functions and we don't need to return a new length or start?

That reminds me.  Do the BPF program need to know the head-room, then?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCHv3 net-next 1/6] bridge: simplify the flush_store by calling store_bridge_parm
From: Nikolay Aleksandrov via Bridge @ 2016-04-09 12:27 UTC (permalink / raw)
  To: Xin Long, network dev, bridge; +Cc: davem
In-Reply-To: <6e2cf6821542a253904dfc7d8ec431d6bbda2b4e.1460131308.git.lucien.xin@gmail.com>

On 04/08/2016 06:03 PM, Xin Long wrote:
> There are some repetitive codes in flush_store, we can remove
> them by calling store_bridge_parm, also, it would send rtnl notification
> after we add it in store_bridge_parm in the following patches.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/bridge/br_sysfs_br.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

^ permalink raw reply

* about our marketing solutions
From: Jeff Allen @ 2016-04-09  8:18 UTC (permalink / raw)
  To: netdev

Hey,

Greeting!!

Do you need to increase your sales?
Do you need more clients leads?
Do you need our marketing services?

No matter you are selling products or services, we can generate you new
leads from our marketing service.

If you need more information please contact us today.

Thanks,
Jeff Allen
Contact: xxianyus@sina.com

^ permalink raw reply

* [PATCH] qlge: Replace create_singlethread_workqueue with alloc_ordered_workqueue
From: Amitoj Kaur Chawla @ 2016-04-09 11:57 UTC (permalink / raw)
  To: harish.patil, sudarsana.kalluru, Dept-GELinuxNICDev, linux-driver,
	netdev, linux-kernel
  Cc: tj

Replace deprecated create_singlethread_workqueue with
alloc_ordered_workqueue.

Work items include getting tx/rx frame sizes, resetting MPI processor,
setting asic recovery bit so ordering seems necessary as only one work
item should be in queue/executing at any given time, hence the use of
alloc_ordered_workqueue.

WQ_MEM_RECLAIM flag has been set since ethernet devices seem to sit in
memory reclaim path, so to guarantee forward progress regardless of 
memory pressure.

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
Not sure if the assumption of requiring ordering of work items
is correct.

 drivers/net/ethernet/qlogic/qlge/qlge_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index b28e73e..83d7210 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -4687,7 +4687,7 @@ static int ql_init_device(struct pci_dev *pdev, struct net_device *ndev,
 	/*
 	 * Set up the operating parameters.
 	 */
-	qdev->workqueue = create_singlethread_workqueue(ndev->name);
+	qdev->workqueue = alloc_ordered_workqueue(ndev->name, WQ_MEM_RECLAIM);
 	INIT_DELAYED_WORK(&qdev->asic_reset_work, ql_asic_reset_work);
 	INIT_DELAYED_WORK(&qdev->mpi_reset_work, ql_mpi_reset_work);
 	INIT_DELAYED_WORK(&qdev->mpi_work, ql_mpi_work);
-- 
1.9.1

^ permalink raw reply related

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Tom Herbert @ 2016-04-09 11:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Or Gerlitz, Daniel Borkmann,
	Eric Dumazet, Edward Cree, john fastabend, Thomas Graf,
	Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm
In-Reply-To: <20160408213414.GA43408@ast-mbp.thefacebook.com>

On Fri, Apr 8, 2016 at 6:34 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, Apr 08, 2016 at 10:08:08PM +0200, Jesper Dangaard Brouer wrote:
>> On Fri, 8 Apr 2016 10:26:53 -0700
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>
>> > On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote:
>> > >
>> > > On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>> > >
>> > > > > +/* user return codes for PHYS_DEV prog type */
>> > > > > +enum bpf_phys_dev_action {
>> > > > > +     BPF_PHYS_DEV_DROP,
>> > > > > +     BPF_PHYS_DEV_OK,
>> > > > > +};
>> > > >
>> > > > I can imagine these extra return codes:
>> > > >
>> > > >  BPF_PHYS_DEV_MODIFIED,   /* Packet page/payload modified */
>> > > >  BPF_PHYS_DEV_STOLEN,     /* E.g. forward use-case */
>> > > >  BPF_PHYS_DEV_SHARED,     /* Queue for async processing, e.g. tcpdump use-case */
>> > > >
>> > > > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations,
>> > > > which we can look at when we get that far...
>> > >
>> > > I want to point out something which is quite FUNDAMENTAL, for
>> > > understanding these return codes (and network stack).
>> > >
>> > >
>> > > At driver RX time, the network stack basically have two ways of
>> > > building an SKB, which is send up the stack.
>> > >
>> > > Option-A (fastest): The packet page is writable. The SKB can be
>> > > allocated and skb->data/head can point directly to the page.  And
>> > > we place/write skb_shared_info in the end/tail-room. (This is done by
>> > > calling build_skb()).
>> > >
>> > > Option-B (slower): The packet page is read-only.  The SKB cannot point
>> > > skb->data/head directly to the page, because skb_shared_info need to be
>> > > written into skb->end (slightly hidden via skb_shinfo() casting).  To
>> > > get around this, a separate piece of memory is allocated (speedup by
>> > > __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can
>> > > be written. (This is done when calling netdev/napi_alloc_skb()).
>> > >   Drivers then need to copy over packet headers, and assign + adjust
>> > > skb_shinfo(skb)->frags[0] offset to skip copied headers.
>> > >
>> > >
>> > > Unfortunately most drivers use option-B.  Due to cost of calling the
>> > > page allocator.  It is only slightly most expensive to get a larger
>> > > compound page from the page allocator, which then can be partitioned into
>> > > page-fragments, thus amortizing the page alloc cost.  Unfortunately the
>> > > cost is added later, when constructing the SKB.
>> > >  Another reason for option-B, is that archs with expensive IOMMU
>> > > requirements (like PowerPC), don't need to dma_unmap on every packet,
>> > > but only on the compound page level.
>> > >
>> > > Side-note: Most drivers have a "copy-break" optimization.  Especially
>> > > for option-B, when copying header data anyhow. For small packet, one
>> > > might as well free (or recycle) the RX page, if header size fits into
>> > > the newly allocated memory (for skb_shared_info).
>> >
>> > I think you guys are going into overdesign territory, so
>> > . nack on read-only pages
>>
>> Unfortunately you cannot just ignore or nack read-only pages. They are
>> a fact in the current drivers.
>>
>> Most drivers today (at-least the ones we care about) only deliver
>> read-only pages.  If you don't accept read-only pages day-1, then you
>> first have to rewrite a lot of drivers... and that will stall the
>> project!  How will you deal with this fact?
>>
>> The early drop filter use-case in this patchset, can ignore read-only
>> pages.  But ABI wise we need to deal with the future case where we do
>> need/require writeable pages.  A simple need-writable pages in the API
>> could help us move forward.
>
> the program should never need to worry about whether dma buffer is
> writeable or not. Complicating drivers, api, abi, usability
> for the single use case of fast packet drop is not acceptable.
> XDP is not going to be a fit for all drivers and all architectures.
> That is cruicial 'performance vs generality' aspect of the design.
> All kernel-bypasses are taking advantage of specific architecture.
> We have to take advantage of it as well. If it doesn't fit
> powerpc with iommu, so be it. XDP will return -enotsupp.
> That is fundamental point. We have to cut such corners and avoid
> all cases where unnecessary generality hurts performance.
> Read-only pages is clearly such thing.
>
+1. Forwarding which will be a common application almost always
requires modification (decrement TTL), and header data split has
always been a weak feature since the device has to have some arbitrary
rules about what headers needs to be split out (either implements
protocol specific parsing or some fixed length).

>> > The whole thing must be dead simple to use. Above is not simple by any means.
>>
>> Maybe you missed that the above was a description of how the current
>> network stack handles this, which is not simple... which is root of the
>> hole performance issue.
>
> Disagree. The stack has copy-break, gro, gso and everything else because
> it's serving _host_ use case. XDP is packet forwarder use case.
> The requirements are completely different. Ex. the host needs gso
> in the core and drivers. It needs to deliver data all the way
> to user space and back. That is hard and that's where complexity
> comes from. For packet forwarder none of it is needed. So saying,
> look we have this complexity, so XDP needs it too, is flawed argument.
> The kernel is serving host and applications.
> XDP is pure packet-in/packet-out framework to achieve better
> performance than kernel-bypass, since kernel is the right
> place to do it. It has clean access to interrupts, per-cpu,
> scheduler, device registers and so on.
> Though there are only two broad use cases packet drop and forward,
> they cover a ton of real cases: firewalls, dos prevention,
> load balancer, nat, etc. In other words mostly stateless.
> As soon as packet needs to be queued somewhere we have to
> instantiate skb and pass it to the stack.
> So no queues in XDP and no 'stolen' and 'shared' return codes.
> The program always runs to completion with single packet.
> There is no header vs payload split. There is no header
> from program point of view. It's raw bytes in dma buffer.
>
Exactly. We are rethinking the low level data path for performance. An
all encompassing solution that covers ever existing driver model only
results in complexity which is what makes things "slow" in the first
place. Drivers need to change to implement XDP, but the model is as
simple as we can make it-- for instance we are putting very little
requirements on device features.

Tom


>> I do like the idea of rejecting XDP eBPF programs based on the DMA
>> setup is not compatible, or if the driver does not implement e.g.
>> writable DMA pages.
>
> exactly.
>
>> Customers wanting this feature will then go buy the NIC which support
>> this feature.  There is nothing more motivating for NIC vendors seeing
>> customers buying the competitors hardware. And it only require a driver
>> change to get this market...
>
> exactly.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Tom Herbert @ 2016-04-09 11:17 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, Or Gerlitz, Daniel Borkmann,
	Jesper Dangaard Brouer, Eric Dumazet, Edward Cree, john fastabend,
	Thomas Graf, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti
In-Reply-To: <1460090930-11219-1-git-send-email-bblanco@plumgrid.com>

On Fri, Apr 8, 2016 at 1:48 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a
> new context type, struct bpf_phys_dev_md, is exposed to userspace. So
> far only expose the readable packet length, and only in read mode.
>
> The PHYS_DEV name is chosen to represent that the program is meant only
> for physical adapters, rather than all netdevs.
>
> While the user visible struct is new, the underlying context must be
> implemented as a minimal skb in order for the packet load_* instructions
> to work. The skb filled in by the driver must have skb->len, skb->head,
> and skb->data set, and skb->data_len == 0.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  include/uapi/linux/bpf.h | 14 ++++++++++
>  kernel/bpf/verifier.c    |  1 +
>  net/core/filter.c        | 68 ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 83 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 70eda5a..3018d83 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -93,6 +93,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_SCHED_CLS,
>         BPF_PROG_TYPE_SCHED_ACT,
>         BPF_PROG_TYPE_TRACEPOINT,
> +       BPF_PROG_TYPE_PHYS_DEV,
>  };
>
>  #define BPF_PSEUDO_MAP_FD      1
> @@ -368,6 +369,19 @@ struct __sk_buff {
>         __u32 tc_classid;
>  };
>
> +/* user return codes for PHYS_DEV prog type */
> +enum bpf_phys_dev_action {
> +       BPF_PHYS_DEV_DROP,
> +       BPF_PHYS_DEV_OK,

I don't like OK. Maybe this mean LOCAL. We also need FORWARD (not sure
how to handle GRO yet).

I would suggest that we format the return code as code:subcode, where
the above are codes. subcode is relevant to major code. For instance
in forwarding the subcodes indicate a forwarding instruction (maybe a
queue). DROP subcodes might answer why.

> +};
> +
> +/* user accessible metadata for PHYS_DEV packet hook
> + * new fields must be added to the end of this structure
> + */
> +struct bpf_phys_dev_md {
> +       __u32 len;
> +};
> +

The naming is verbose and too tied to specific implementation. The
meta data structure is just a generic abstraction of a receive
descriptor, it is not specific BPF or physical devices. For instance,
the BPF programs for this should be exportable to a device, userspace,
other OSes, etc. Also, inevitably someone will have a reason to use an
alternate filtering than BPF based on same interfaces abstraction.

So for the above I suggest we simply name the structure xdp_md.
Similarly, drop codes can be XDP_DROP, etc. Also, these should be in a
new header say xdp.h which will also be in uapi.

One other API issue is how to deal with encapsulation. In this case a
header may be prepended to the packet, I assume there are BPF helper
functions and we don't need to return a new length or start?

Tom

>  struct bpf_tunnel_key {
>         __u32 tunnel_id;
>         union {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 58792fe..877542d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1344,6 +1344,7 @@ static bool may_access_skb(enum bpf_prog_type type)
>         case BPF_PROG_TYPE_SOCKET_FILTER:
>         case BPF_PROG_TYPE_SCHED_CLS:
>         case BPF_PROG_TYPE_SCHED_ACT:
> +       case BPF_PROG_TYPE_PHYS_DEV:
>                 return true;
>         default:
>                 return false;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index e8486ba..4f73fb9 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2021,6 +2021,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
>         }
>  }
>
> +static const struct bpf_func_proto *
> +phys_dev_func_proto(enum bpf_func_id func_id)
> +{
> +       return sk_filter_func_proto(func_id);
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
>         /* check bounds */
> @@ -2076,6 +2082,36 @@ static bool tc_cls_act_is_valid_access(int off, int size,
>         return __is_valid_access(off, size, type);
>  }
>
> +static bool __is_valid_phys_dev_access(int off, int size,
> +                                      enum bpf_access_type type)
> +{
> +       if (off < 0 || off >= sizeof(struct bpf_phys_dev_md))
> +               return false;
> +
> +       if (off % size != 0)
> +               return false;
> +
> +       if (size != 4)
> +               return false;
> +
> +       return true;
> +}
> +
> +static bool phys_dev_is_valid_access(int off, int size,
> +                                    enum bpf_access_type type)
> +{
> +       if (type == BPF_WRITE)
> +               return false;
> +
> +       switch (off) {
> +       case offsetof(struct bpf_phys_dev_md, len):
> +               break;
> +       default:
> +               return false;
> +       }
> +       return __is_valid_phys_dev_access(off, size, type);
> +}
> +
>  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>                                       int src_reg, int ctx_off,
>                                       struct bpf_insn *insn_buf,
> @@ -2213,6 +2249,26 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>         return insn - insn_buf;
>  }
>
> +static u32 bpf_phys_dev_convert_ctx_access(enum bpf_access_type type,
> +                                          int dst_reg, int src_reg,
> +                                          int ctx_off,
> +                                          struct bpf_insn *insn_buf,
> +                                          struct bpf_prog *prog)
> +{
> +       struct bpf_insn *insn = insn_buf;
> +
> +       switch (ctx_off) {
> +       case offsetof(struct bpf_phys_dev_md, len):
> +               BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
> +
> +               *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
> +                                     offsetof(struct sk_buff, len));
> +               break;
> +       }
> +
> +       return insn - insn_buf;
> +}
> +
>  static const struct bpf_verifier_ops sk_filter_ops = {
>         .get_func_proto = sk_filter_func_proto,
>         .is_valid_access = sk_filter_is_valid_access,
> @@ -2225,6 +2281,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
>         .convert_ctx_access = bpf_net_convert_ctx_access,
>  };
>
> +static const struct bpf_verifier_ops phys_dev_ops = {
> +       .get_func_proto = phys_dev_func_proto,
> +       .is_valid_access = phys_dev_is_valid_access,
> +       .convert_ctx_access = bpf_phys_dev_convert_ctx_access,
> +};
> +
>  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
>         .ops = &sk_filter_ops,
>         .type = BPF_PROG_TYPE_SOCKET_FILTER,
> @@ -2240,11 +2302,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
>         .type = BPF_PROG_TYPE_SCHED_ACT,
>  };
>
> +static struct bpf_prog_type_list phys_dev_type __read_mostly = {
> +       .ops = &phys_dev_ops,
> +       .type = BPF_PROG_TYPE_PHYS_DEV,
> +};
> +
>  static int __init register_sk_filter_ops(void)
>  {
>         bpf_register_prog_type(&sk_filter_type);
>         bpf_register_prog_type(&sched_cls_type);
>         bpf_register_prog_type(&sched_act_type);
> +       bpf_register_prog_type(&phys_dev_type);
>
>         return 0;
>  }
> --
> 2.8.0
>

^ permalink raw reply

* [PATCH net-next] vxlan: reduce usage of synchronize_net in ndo_stop
From: Hannes Frederic Sowa @ 2016-04-09 10:46 UTC (permalink / raw)
  To: netdev

We only need to do the synchronize_net dance once for both, ipv4 and
ipv6 sockets, thus removing one synchronize_net in case both sockets get
dismantled.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
Based on patch: [net-next,v2] vxlan: synchronously and race-free destruction of vxlan sockets

 drivers/net/vxlan.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 77ba31a0e44f97..c417bbfa3ab49f 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1037,14 +1037,14 @@ static bool vxlan_group_used(struct vxlan_net *vn, struct vxlan_dev *dev)
 	return false;
 }
 
-static void __vxlan_sock_release(struct vxlan_sock *vs)
+static bool __vxlan_sock_release_prep(struct vxlan_sock *vs)
 {
 	struct vxlan_net *vn;
 
 	if (!vs)
-		return;
+		return false;
 	if (!atomic_dec_and_test(&vs->refcnt))
-		return;
+		return false;
 
 	vn = net_generic(sock_net(vs->sock->sk), vxlan_net_id);
 	spin_lock(&vn->sock_lock);
@@ -1052,16 +1052,28 @@ static void __vxlan_sock_release(struct vxlan_sock *vs)
 	vxlan_notify_del_rx_port(vs);
 	spin_unlock(&vn->sock_lock);
 
-	synchronize_net();
-	udp_tunnel_sock_release(vs->sock);
-	kfree(vs);
+	return true;
 }
 
 static void vxlan_sock_release(struct vxlan_dev *vxlan)
 {
-	__vxlan_sock_release(vxlan->vn4_sock);
+	bool ipv4 = __vxlan_sock_release_prep(vxlan->vn4_sock);
 #if IS_ENABLED(CONFIG_IPV6)
-	__vxlan_sock_release(vxlan->vn6_sock);
+	bool ipv6 = __vxlan_sock_release_prep(vxlan->vn6_sock);
+#endif
+
+	synchronize_net();
+
+	if (ipv4) {
+		udp_tunnel_sock_release(vxlan->vn4_sock->sock);
+		kfree(vxlan->vn4_sock);
+	}
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (ipv6) {
+		udp_tunnel_sock_release(vxlan->vn6_sock->sock);
+		kfree(vxlan->vn6_sock);
+	}
 #endif
 }
 
-- 
2.5.5

^ permalink raw reply related

* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Jesper Dangaard Brouer @ 2016-04-09  9:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm,
	netdev@vger.kernel.org, lsf-pc, Alexei Starovoitov, brouer
In-Reply-To: <1460042309.6473.414.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On Thu, 07 Apr 2016 08:18:29 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote:
> > (Topic proposal for MM-summit)
> > 
> > Network Interface Cards (NIC) drivers, and increasing speeds stress
> > the page-allocator (and DMA APIs).  A number of driver specific
> > open-coded approaches exists that work-around these bottlenecks in the
> > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
> > allocating larger pages and handing-out page "fragments".
> > 
> > I'm proposing a generic page-pool recycle facility, that can cover the
> > driver use-cases, increase performance and open up for zero-copy RX.
> > 
> > 
> > The basic performance problem is that pages (containing packets at RX)
> > are cycled through the page allocator (freed at TX DMA completion
> > time).  While a system in a steady state, could avoid calling the page
> > allocator, when having a pool of pages equal to the size of the RX
> > ring plus the number of outstanding frames in the TX ring (waiting for
> > DMA completion).  
> 
> 
> We certainly used this at Google for quite a while.
> 
> The thing is : in steady state, the number of pages being 'in tx queues'
> is lower than number of pages that were allocated for RX queues.

That was also my expectation, thanks for confirming my expectation.

> The page allocator is hardly hit, once you have big enough RX ring
> buffers. (Nothing fancy, simply the default number of slots)
> 
> The 'hard coded´ code is quite small actually
> 
> if (page_count(page) != 1) {
>     free the page and allocate another one, 
>     since we are not the exclusive owner.
>     Prefer __GFP_COLD pages btw.
> }
> page_ref_inc(page);

Above code is okay.  But do you think we also can get away with the same
trick we do with the SKB refcnf?  Where we avoid an atomic operation if
refcnt==1.

void kfree_skb(struct sk_buff *skb)
{
	if (unlikely(!skb))
		return;
	if (likely(atomic_read(&skb->users) == 1))
		smp_rmb();
	else if (likely(!atomic_dec_and_test(&skb->users)))
		return;
	trace_kfree_skb(skb, __builtin_return_address(0));
	__kfree_skb(skb);
}
EXPORT_SYMBOL(kfree_skb);


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] qdisc: constify meta_type_ops structures
From: Julia Lawall @ 2016-04-09  8:49 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: kernel-janitors, David S. Miller, netdev, linux-kernel

The meta_type_ops structures are never modified, so declare them as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>

---
 net/sched/em_meta.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index f2aabc0..a309a07 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -796,7 +796,7 @@ struct meta_type_ops {
 	int	(*dump)(struct sk_buff *, struct meta_value *, int);
 };
 
-static struct meta_type_ops __meta_type_ops[TCF_META_TYPE_MAX + 1] = {
+static const struct meta_type_ops __meta_type_ops[TCF_META_TYPE_MAX + 1] = {
 	[TCF_META_TYPE_VAR] = {
 		.destroy = meta_var_destroy,
 		.compare = meta_var_compare,
@@ -812,7 +812,7 @@ static struct meta_type_ops __meta_type_ops[TCF_META_TYPE_MAX + 1] = {
 	}
 };
 
-static inline struct meta_type_ops *meta_type_ops(struct meta_value *v)
+static inline const struct meta_type_ops *meta_type_ops(struct meta_value *v)
 {
 	return &__meta_type_ops[meta_type(v)];
 }
@@ -870,7 +870,7 @@ static int em_meta_match(struct sk_buff *skb, struct tcf_ematch *m,
 static void meta_delete(struct meta_match *meta)
 {
 	if (meta) {
-		struct meta_type_ops *ops = meta_type_ops(&meta->lvalue);
+		const struct meta_type_ops *ops = meta_type_ops(&meta->lvalue);
 
 		if (ops && ops->destroy) {
 			ops->destroy(&meta->lvalue);
@@ -964,7 +964,7 @@ static int em_meta_dump(struct sk_buff *skb, struct tcf_ematch *em)
 {
 	struct meta_match *meta = (struct meta_match *) em->data;
 	struct tcf_meta_hdr hdr;
-	struct meta_type_ops *ops;
+	const struct meta_type_ops *ops;
 
 	memset(&hdr, 0, sizeof(hdr));
 	memcpy(&hdr.left, &meta->lvalue.hdr, sizeof(hdr.left));

^ permalink raw reply related

* [PATCH net-next v2] net: bcmgenet: add BQL support
From: Petri Gynther @ 2016-04-09  7:20 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, f.fainelli, opendmb, jaedon.shin,
	Petri Gynther

Add Byte Queue Limits (BQL) support to bcmgenet driver.

Signed-off-by: Petri Gynther <pgynther@google.com>
---
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index f7b42b9..c286ae8 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -1221,8 +1221,10 @@ static unsigned int __bcmgenet_tx_reclaim(struct net_device *dev,
 	dev->stats.tx_packets += pkts_compl;
 	dev->stats.tx_bytes += bytes_compl;
 
+	txq = netdev_get_tx_queue(dev, ring->queue);
+	netdev_tx_completed_queue(txq, pkts_compl, bytes_compl);
+
 	if (ring->free_bds > (MAX_SKB_FRAGS + 1)) {
-		txq = netdev_get_tx_queue(dev, ring->queue);
 		if (netif_tx_queue_stopped(txq))
 			netif_tx_wake_queue(txq);
 	}
@@ -1516,6 +1518,8 @@ static netdev_tx_t bcmgenet_xmit(struct sk_buff *skb, struct net_device *dev)
 	ring->prod_index += nr_frags + 1;
 	ring->prod_index &= DMA_P_INDEX_MASK;
 
+	netdev_tx_sent_queue(txq, GENET_CB(skb)->bytes_sent);
+
 	if (ring->free_bds <= (MAX_SKB_FRAGS + 1))
 		netif_tx_stop_queue(txq);
 
@@ -2364,6 +2368,7 @@ static int bcmgenet_dma_teardown(struct bcmgenet_priv *priv)
 static void bcmgenet_fini_dma(struct bcmgenet_priv *priv)
 {
 	int i;
+	struct netdev_queue *txq;
 
 	bcmgenet_fini_rx_napi(priv);
 	bcmgenet_fini_tx_napi(priv);
@@ -2378,6 +2383,14 @@ static void bcmgenet_fini_dma(struct bcmgenet_priv *priv)
 		}
 	}
 
+	for (i = 0; i < priv->hw_params->tx_queues; i++) {
+		txq = netdev_get_tx_queue(priv->dev, priv->tx_rings[i].queue);
+		netdev_tx_reset_queue(txq);
+	}
+
+	txq = netdev_get_tx_queue(priv->dev, priv->tx_rings[DESC_INDEX].queue);
+	netdev_tx_reset_queue(txq);
+
 	bcmgenet_free_rx_buffers(priv);
 	kfree(priv->rx_cbs);
 	kfree(priv->tx_cbs);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: [next-queue PATCH 0/3] Add support for GSO partial to Intel NIC drivers
From: Jeff Kirsher @ 2016-04-09  6:59 UTC (permalink / raw)
  To: Alexander Duyck, herbert, tom, jesse, alexander.duyck, edumazet,
	intel-wired-lan, netdev, davem
In-Reply-To: <20160408210103.13096.77973.stgit@ahduyck-xeon-server>

[-- Attachment #1: Type: text/plain, Size: 1711 bytes --]

On Fri, 2016-04-08 at 17:06 -0400, Alexander Duyck wrote:
> So these are the patches needed to enable tunnel segmentation
> offloads on
> the igb, igbvf, ixgbe, and ixgbevf drivers.  In addition this patch
> extends
> the i40e and i40evf drivers to include segmentation support for
> tunnels
> with outer checksums.
> 
> The net performance gain for these patches are pretty significant. 
> In the
> case of i40e a tunnel with outer checksums showed the following
> improvement:
> Throughput Throughput  Local Local   Result 
>            Units       CPU   Service Tag    
>                        Util  Demand         
>                        %                    
> 14066.29   10^6bits/s  3.49  0.651   "before" 
> 20618.16   10^6bits/s  3.09  0.393   "after"
> 
> For ixgbe similar results were seen:
> Throughput Throughput  Local  Local   Result 
>            Units       CPU    Service Tag    
>                        Util   Demand         
>                        %               
> 12879.89   10^6bits/s  10.00  0.763   "before"
> 14286.77   10^6bits/s  5.74   0.395   "after" 
> 
> These patches all rely on the TSO_MANGLEID and GSO_PARTIAL patches so
> I
> would not recommend applying them until those patches have first been
> applied.

Sorry I did not see this until after I tried applying your series. :-(

Maybe the two dependent patches should have been in the series, so I
and others do not waste their time.  Or not send this until the two
patches were accepted.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [next-queue PATCH 2/3] ixgbe/ixgbevf: Add support for GSO partial
From: Jeff Kirsher @ 2016-04-09  6:53 UTC (permalink / raw)
  To: Alexander Duyck, herbert, tom, jesse, alexander.duyck, edumazet,
	intel-wired-lan, netdev, davem
In-Reply-To: <20160408210647.13096.55323.stgit@ahduyck-xeon-server>

[-- Attachment #1: Type: text/plain, Size: 4597 bytes --]

On Fri, 2016-04-08 at 17:06 -0400, Alexander Duyck wrote:
> This patch adds support for partial GSO segmentation in the case of
> tunnels.  Specifically with this change the driver an perform
> segmenation
> as long as the frame either has IPv6 inner headers, or we are allowed
> to
> mangle the IP IDs on the inner header.  This is needed because we
> will not
> be modifying any fields from the start of the start of the outer
> transport
> header to the start of the inner transport header as we are treating
> them
> like they are just a block of IP options.
> 
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  105
> +++++++++++++-----
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  123
> ++++++++++++++++-----
>  2 files changed, 172 insertions(+), 56 deletions(-)

Dropping this patch (and now the series) because this does not
compile...

[23:51:49 @jtkirshe-linux:next-queue]$ make -j 77 -s
Makefile:679: Cannot use CONFIG_KCOV: -fsanitize-coverage=trace-pc is
not supported by compiler
  DESCEND  objtool
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c: In function
‘ixgbevf_set_features’:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:3892:17: error:
‘NETIF_F_TSO_MANGLEID’ undeclared (first use in this function)
  if (features & NETIF_F_TSO_MANGLEID)
                 ^
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:3892:17: note: each
undeclared identifier is reported only once for each function it
appears in
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c: In function
‘ixgbevf_probe’:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:4062:8: error:
‘struct net_device’ has no member named ‘gso_partial_features’
  netdev->gso_partial_features = IXGBEVF_GSO_PARTIAL_FEATURES;
        ^
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:4063:25: error:
‘NETIF_F_GSO_PARTIAL’ undeclared (first use in this function)
  netdev->hw_features |= NETIF_F_GSO_PARTIAL |
                         ^
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:4073:6: error:
‘NETIF_F_TSO_MANGLEID’ undeclared (first use in this function)
      NETIF_F_TSO_MANGLEID |
      ^
scripts/Makefile.build:291: recipe for target
'drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.o' failed
make[5]: *** [drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.o] Error
1
scripts/Makefile.build:440: recipe for target
'drivers/net/ethernet/intel/ixgbevf' failed
make[4]: *** [drivers/net/ethernet/intel/ixgbevf] Error 2
make[4]: *** Waiting for unfinished jobs....
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c: In function
‘ixgbe_set_features’:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8635:17: error:
‘NETIF_F_TSO_MANGLEID’ undeclared (first use in this function)
  if (features & NETIF_F_TSO_MANGLEID)
                 ^
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8635:17: note: each
undeclared identifier is reported only once for each function it
appears in
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c: In function
‘ixgbe_probe’:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:9350:8: error: ‘struct
net_device’ has no member named ‘gso_partial_features’
  netdev->gso_partial_features = IXGBE_GSO_PARTIAL_FEATURES;
        ^
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:9351:22: error:
‘NETIF_F_GSO_PARTIAL’ undeclared (first use in this function)
  netdev->features |= NETIF_F_GSO_PARTIAL |
                      ^
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:9368:6: error:
‘NETIF_F_TSO_MANGLEID’ undeclared (first use in this function)
      NETIF_F_TSO_MANGLEID |
      ^
scripts/Makefile.build:291: recipe for target
'drivers/net/ethernet/intel/ixgbe/ixgbe_main.o' failed
make[5]: *** [drivers/net/ethernet/intel/ixgbe/ixgbe_main.o] Error 1
scripts/Makefile.build:440: recipe for target
'drivers/net/ethernet/intel/ixgbe' failed
make[4]: *** [drivers/net/ethernet/intel/ixgbe] Error 2
scripts/Makefile.build:440: recipe for target
'drivers/net/ethernet/intel' failed
make[3]: *** [drivers/net/ethernet/intel] Error 2
scripts/Makefile.build:440: recipe for target 'drivers/net/ethernet'
failed
make[2]: *** [drivers/net/ethernet] Error 2
scripts/Makefile.build:440: recipe for target 'drivers/net' failed
make[1]: *** [drivers/net] Error 2
Makefile:962: recipe for target 'drivers' failed
make: *** [drivers] Error 2

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [next-queue PATCH 1/3] i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM
From: Jeff Kirsher @ 2016-04-09  6:51 UTC (permalink / raw)
  To: Alexander Duyck, herbert, tom, jesse, alexander.duyck, edumazet,
	intel-wired-lan, netdev, davem
In-Reply-To: <20160408210641.13096.8972.stgit@ahduyck-xeon-server>

[-- Attachment #1: Type: text/plain, Size: 6698 bytes --]

On Fri, 2016-04-08 at 17:06 -0400, Alexander Duyck wrote:
> This patch makes it so that i40e and i40evf can use GSO_PARTIAL to
> support
> segmentation for frames with checksums enabled in outer headers.  As
> a
> result we can now send data over these types of tunnels at over
> 20Gb/s
> versus the 12Gb/s that was previously possible on my system.
> 
> The advantage with the i40e parts is that this offload is mostly
> transparent as the hardware still deals with the inner and/or outer
> IPv4
> headers so the IP ID is still incrementing for both when this offload
> is
> performed.
> 
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c     |   10 ++++++++--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c     |    7 ++++++-
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c   |    7 ++++++-
>  drivers/net/ethernet/intel/i40evf/i40evf_main.c |   10 ++++++++--
>  4 files changed, 28 insertions(+), 6 deletions(-)

Dropping this patch because it does not even compile...

[23:49:07 @jtkirshe-linux:next-queue]$ make -j 77 -s
Makefile:679: Cannot use CONFIG_KCOV: -fsanitize-coverage=trace-pc is
not supported by compiler
  DESCEND  objtool
drivers/net/ethernet/intel/i40evf/i40evf_main.c: In function
‘i40evf_process_config’:
drivers/net/ethernet/intel/i40evf/i40evf_main.c:2354:8: error:
‘NETIF_F_GSO_PARTIAL’ undeclared (first use in this function)
        NETIF_F_GSO_PARTIAL  |
        ^
drivers/net/ethernet/intel/i40evf/i40evf_main.c:2354:8: note: each
undeclared identifier is reported only once for each function it
appears in
drivers/net/ethernet/intel/i40evf/i40evf_main.c:2361:9: error: ‘struct
net_device’ has no member named ‘gso_partial_features’
   netdev->gso_partial_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
         ^
drivers/net/ethernet/intel/i40evf/i40evf_main.c:2363:8: error: ‘struct
net_device’ has no member named ‘gso_partial_features’
  netdev->gso_partial_features |= NETIF_F_GSO_GRE_CSUM;
        ^
drivers/net/ethernet/intel/i40evf/i40evf_main.c:2367:6: error:
‘NETIF_F_TSO_MANGLEID’ undeclared (first use in this function)
      NETIF_F_TSO_MANGLEID;
      ^
drivers/net/ethernet/intel/i40evf/i40e_txrx.c: In function ‘i40e_tso’:
drivers/net/ethernet/intel/i40evf/i40e_txrx.c:1573:37: error:
‘SKB_GSO_PARTIAL’ undeclared (first use in this function)
   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) &&
                                     ^
drivers/net/ethernet/intel/i40evf/i40e_txrx.c:1573:37: note: each
undeclared identifier is reported only once for each function it
appears in
drivers/net/ethernet/intel/i40evf/i40e_txrx.c: In function
‘i40e_tx_enable_csum’:
drivers/net/ethernet/intel/i40evf/i40e_txrx.c:1711:37: error:
‘SKB_GSO_PARTIAL’ undeclared (first use in this function)
       !(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) &&
                                     ^
scripts/Makefile.build:291: recipe for target
'drivers/net/ethernet/intel/i40evf/i40evf_main.o' failed
make[5]: *** [drivers/net/ethernet/intel/i40evf/i40evf_main.o] Error 1
make[5]: *** Waiting for unfinished jobs....
scripts/Makefile.build:291: recipe for target
'drivers/net/ethernet/intel/i40evf/i40e_txrx.o' failed
make[5]: *** [drivers/net/ethernet/intel/i40evf/i40e_txrx.o] Error 1
scripts/Makefile.build:440: recipe for target
'drivers/net/ethernet/intel/i40evf' failed
make[4]: *** [drivers/net/ethernet/intel/i40evf] Error 2
make[4]: *** Waiting for unfinished jobs....
drivers/net/ethernet/intel/i40e/i40e_main.c: In function
‘i40e_config_netdev’:
drivers/net/ethernet/intel/i40e/i40e_main.c:9127:8: error:
‘NETIF_F_GSO_PARTIAL’ undeclared (first use in this function)
        NETIF_F_GSO_PARTIAL  |
        ^
drivers/net/ethernet/intel/i40e/i40e_main.c:9127:8: note: each
undeclared identifier is reported only once for each function it
appears in
drivers/net/ethernet/intel/i40e/i40e_main.c:9134:9: error: ‘struct
net_device’ has no member named ‘gso_partial_features’
   netdev->gso_partial_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
         ^
drivers/net/ethernet/intel/i40e/i40e_main.c:9136:8: error: ‘struct
net_device’ has no member named ‘gso_partial_features’
  netdev->gso_partial_features |= NETIF_F_GSO_GRE_CSUM;
        ^
drivers/net/ethernet/intel/i40e/i40e_main.c:9140:6: error:
‘NETIF_F_TSO_MANGLEID’ undeclared (first use in this function)
      NETIF_F_TSO_MANGLEID;
      ^
scripts/Makefile.build:291: recipe for target
'drivers/net/ethernet/intel/i40e/i40e_main.o' failed
make[5]: *** [drivers/net/ethernet/intel/i40e/i40e_main.o] Error 1
make[5]: *** Waiting for unfinished jobs....
drivers/net/ethernet/intel/i40e/i40e_txrx.c: In function ‘i40e_tso’:
drivers/net/ethernet/intel/i40e/i40e_txrx.c:2308:37: error:
‘SKB_GSO_PARTIAL’ undeclared (first use in this function)
   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) &&
                                     ^
drivers/net/ethernet/intel/i40e/i40e_txrx.c:2308:37: note: each
undeclared identifier is reported only once for each function it
appears in
drivers/net/ethernet/intel/i40e/i40e_txrx.c: In function
‘i40e_tx_enable_csum’:
drivers/net/ethernet/intel/i40e/i40e_txrx.c:2488:37: error:
‘SKB_GSO_PARTIAL’ undeclared (first use in this function)
       !(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) &&
                                     ^
scripts/Makefile.build:291: recipe for target
'drivers/net/ethernet/intel/i40e/i40e_txrx.o' failed
make[5]: *** [drivers/net/ethernet/intel/i40e/i40e_txrx.o] Error 1
scripts/Makefile.build:440: recipe for target
'drivers/net/ethernet/intel/i40e' failed
make[4]: *** [drivers/net/ethernet/intel/i40e] Error 2
scripts/Makefile.build:440: recipe for target
'drivers/net/ethernet/intel' failed
make[3]: *** [drivers/net/ethernet/intel] Error 2
make[3]: *** Waiting for unfinished jobs....
scripts/Makefile.build:440: recipe for target 'drivers/net/ethernet'
failed
make[2]: *** [drivers/net/ethernet] Error 2
scripts/Makefile.build:440: recipe for target 'drivers/net' failed
make[1]: *** [drivers/net] Error 2
make[1]: *** Waiting for unfinished jobs....
Makefile:962: recipe for target 'drivers' failed
make: *** [drivers] Error 2
make: *** Waiting for unfinished jobs....

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* [PATCH net-next RFC v2 2/2] ipv6: add support for stats via RTM_GETSTATS
From: Roopa Prabhu @ 2016-04-09  6:38 UTC (permalink / raw)
  To: netdev; +Cc: jhs, davem

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch is an example of adding af stats in
RTM_GETSTATS. It adds a new nested IFLA_STATS_INET6
attribute for ipv6 af stats. stats attributes inside
IFLA_STATS_INET6 nested attribute use the existing ipv6 stats
attributes from ipv6 IFLA_PROTINFO (I can certainly declare
new attributes if required)

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
This patch is an example of af stats hooked into the new stats
infrastructure. I have tested it to work. My real intent is to have
IFLA_STATS_MPLS implemented in the same way for mpls.
I am not sure how popular the current ipv6 stats are.
Instead of carrying over the old ipv6 stats, we 
We could rethink ipv6 stats in a new way when people see the need
for it.

 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c         |  1 +
 net/ipv6/addrconf.c          | 77 +++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 71 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 4cfd029..e0b51c8 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -791,6 +791,7 @@ struct if_stats_msg {
 enum {
 	IFLA_STATS_UNSPEC,
 	IFLA_STATS_LINK64,
+	IFLA_STATS_INET6,
 	__IFLA_STATS_MAX,
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d1fba58..9c6cff1 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3505,6 +3505,7 @@ nla_put_failure:
 
 static const struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
 	[IFLA_STATS_LINK64]	= { .len = sizeof(struct rtnl_link_stats64) },
+	[IFLA_STATS_INET6]      = {. type = NLA_NESTED },
 };
 
 static size_t rtnl_link_get_af_stats_size(const struct net_device *dev,
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 27aed1a..445f21a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4917,6 +4917,29 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 	}
 }
 
+static int inet6_fill_ifla6_stats(struct sk_buff *skb,
+				  struct inet6_dev *idev)
+{
+	struct nlattr *nla;
+
+	nla = nla_reserve(skb, IFLA_INET6_STATS, IPSTATS_MIB_MAX * sizeof(u64));
+	if (!nla)
+		goto nla_put_failure;
+	snmp6_fill_stats(nla_data(nla), idev, IFLA_INET6_STATS, nla_len(nla));
+
+	nla = nla_reserve(skb, IFLA_INET6_ICMP6STATS,
+			  ICMP6_MIB_MAX * sizeof(u64));
+	if (!nla)
+		goto nla_put_failure;
+	snmp6_fill_stats(nla_data(nla), idev, IFLA_INET6_ICMP6STATS,
+			 nla_len(nla));
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
 static int inet6_fill_ifla6_attrs(struct sk_buff *skb, struct inet6_dev *idev,
 				  u32 ext_filter_mask)
 {
@@ -4941,15 +4964,8 @@ static int inet6_fill_ifla6_attrs(struct sk_buff *skb, struct inet6_dev *idev,
 	if (ext_filter_mask & RTEXT_FILTER_SKIP_STATS)
 		return 0;
 
-	nla = nla_reserve(skb, IFLA_INET6_STATS, IPSTATS_MIB_MAX * sizeof(u64));
-	if (!nla)
-		goto nla_put_failure;
-	snmp6_fill_stats(nla_data(nla), idev, IFLA_INET6_STATS, nla_len(nla));
-
-	nla = nla_reserve(skb, IFLA_INET6_ICMP6STATS, ICMP6_MIB_MAX * sizeof(u64));
-	if (!nla)
+	if (inet6_fill_ifla6_stats(skb, idev))
 		goto nla_put_failure;
-	snmp6_fill_stats(nla_data(nla), idev, IFLA_INET6_ICMP6STATS, nla_len(nla));
 
 	nla = nla_reserve(skb, IFLA_INET6_TOKEN, sizeof(struct in6_addr));
 	if (!nla)
@@ -4991,6 +5007,49 @@ static int inet6_fill_link_af(struct sk_buff *skb, const struct net_device *dev,
 	return 0;
 }
 
+static size_t inet6_get_link_af_stats_size(const struct net_device *dev,
+					   u32 filter_mask)
+{
+	if (!(filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_INET6)))
+		return 0;
+
+	if (!__in6_dev_get(dev))
+		return 0;
+
+	return nla_total_size(sizeof(struct nlattr)) /* IFLA_STATS_INET6 */
+		+ nla_total_size(IPSTATS_MIB_MAX * 8) /* IFLA_INET6_STATS */
+		+ nla_total_size(ICMP6_MIB_MAX * sizeof(u64));/* IFLA_INET6_ICMP6STATS */
+}
+
+static int inet6_fill_link_af_stats(struct sk_buff *skb,
+				    const struct net_device *dev,
+				    u32 filter_mask)
+{
+	struct inet6_dev *idev = __in6_dev_get(dev);
+	struct nlattr *inet6_stats;
+
+	if (!(filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_INET6)))
+		return 0;
+
+	if (!idev)
+		return -ENODATA;
+
+	inet6_stats = nla_nest_start(skb, IFLA_STATS_INET6);
+	if (!inet6_stats)
+		return -EMSGSIZE;
+
+	if (inet6_fill_ifla6_stats(skb, idev) < 0)
+		goto errout;
+
+	nla_nest_end(skb, inet6_stats);
+
+	return 0;
+errout:
+	nla_nest_cancel(skb, inet6_stats);
+
+	return -EMSGSIZE;
+}
+
 static int inet6_set_iftoken(struct inet6_dev *idev, struct in6_addr *token)
 {
 	struct inet6_ifaddr *ifp;
@@ -6085,6 +6144,8 @@ static struct rtnl_af_ops inet6_ops __read_mostly = {
 	.get_link_af_size = inet6_get_link_af_size,
 	.validate_link_af = inet6_validate_link_af,
 	.set_link_af	  = inet6_set_link_af,
+	.get_link_af_stats_size = inet6_get_link_af_stats_size,
+	.fill_link_af_stats = inet6_fill_link_af_stats,
 };
 
 /*
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next v2 1/2] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: Roopa Prabhu @ 2016-04-09  6:38 UTC (permalink / raw)
  To: netdev; +Cc: jhs, davem

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch adds a new RTM_GETSTATS message to query link stats via netlink
from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
returns a lot more than just stats and is expensive in some cases when
frequent polling for stats from userspace is a common operation.

RTM_GETSTATS is an attempt to provide a light weight netlink message
to explicity query only link stats from the kernel on an interface.
The idea is to also keep it extensible so that new kinds of stats can be
added to it in the future.

This patch adds the following attribute for NETDEV stats:
struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
        [IFLA_STATS_LINK64]  = { .len = sizeof(struct rtnl_link_stats64) },
};

This patch also allows for af family stats (an example af stats for IPV6
is available with the second patch in the series).

Like any other rtnetlink message, RTM_GETSTATS can be used to get stats of
a single interface or all interfaces with NLM_F_DUMP.

Future possible new types of stat attributes:
- IFLA_MPLS_STATS  (nested. for mpls/mdev stats)
- IFLA_EXTENDED_STATS (nested. extended software netdev stats like bridge,
  vlan, vxlan etc)
- IFLA_EXTENDED_HW_STATS (nested. extended hardware stats which are
  available via ethtool today)

This patch also declares a filter mask for all stat attributes.
User has to provide a mask of stats attributes to query. This will be
specified in a new hdr 'struct if_stats_msg' for stats messages.

Without any attributes in the filter_mask, no stats will be returned.

This patch has been tested with mofified iproute2 ifstat.

Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 include/net/rtnetlink.h        |   5 ++
 include/uapi/linux/if_link.h   |  18 ++++
 include/uapi/linux/rtnetlink.h |   5 ++
 net/core/rtnetlink.c           | 200 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 228 insertions(+)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 2f87c1b..fa68158 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -131,6 +131,11 @@ struct rtnl_af_ops {
 						    const struct nlattr *attr);
 	int			(*set_link_af)(struct net_device *dev,
 					       const struct nlattr *attr);
+	size_t			(*get_link_af_stats_size)(const struct net_device *dev,
+							  u32 filter_mask);
+	int			(*fill_link_af_stats)(struct sk_buff *skb,
+						      const struct net_device *dev,
+						      u32 filter_mask);
 };
 
 void __rtnl_af_unregister(struct rtnl_af_ops *ops);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 9427f17..4cfd029 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -780,4 +780,22 @@ enum {
 
 #define IFLA_HSR_MAX (__IFLA_HSR_MAX - 1)
 
+/* STATS section */
+
+struct if_stats_msg {
+	__u8  family;
+	__u32 ifindex;
+	__u32 filter_mask;
+};
+
+enum {
+	IFLA_STATS_UNSPEC,
+	IFLA_STATS_LINK64,
+	__IFLA_STATS_MAX,
+};
+
+#define IFLA_STATS_MAX (__IFLA_STATS_MAX - 1)
+
+#define IFLA_STATS_FILTER_BIT(ATTR)	(1 << (ATTR))
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index ca764b5..cc885c4 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -139,6 +139,11 @@ enum {
 	RTM_GETNSID = 90,
 #define RTM_GETNSID RTM_GETNSID
 
+	RTM_NEWSTATS = 92,
+#define RTM_NEWSTATS RTM_NEWSTATS
+	RTM_GETSTATS = 94,
+#define RTM_GETSTATS RTM_GETSTATS
+
 	__RTM_MAX,
 #define RTM_MAX		(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a75f7e9..d1fba58 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3451,6 +3451,203 @@ out:
 	return err;
 }
 
+static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev,
+			       int type, u32 pid, u32 seq, u32 change,
+			       unsigned int flags, unsigned int filter_mask)
+{
+	const struct rtnl_link_stats64 *stats;
+	struct rtnl_link_stats64 temp;
+	struct if_stats_msg *ifsm;
+	struct nlmsghdr *nlh;
+	struct rtnl_af_ops *af_ops;
+	struct nlattr *attr;
+
+	ASSERT_RTNL();
+
+	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifsm), flags);
+	if (!nlh)
+		return -EMSGSIZE;
+
+	ifsm = nlmsg_data(nlh);
+	ifsm->ifindex = dev->ifindex;
+	ifsm->filter_mask = filter_mask;
+
+	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK64)) {
+		attr = nla_reserve(skb, IFLA_STATS_LINK64,
+				   sizeof(struct rtnl_link_stats64));
+		if (!attr)
+			return -EMSGSIZE;
+
+		stats = dev_get_stats(dev, &temp);
+
+		copy_rtnl_link_stats64(nla_data(attr), stats);
+	}
+
+	list_for_each_entry(af_ops, &rtnl_af_ops, list) {
+		if (af_ops->fill_link_af_stats) {
+			int err;
+
+			err = af_ops->fill_link_af_stats(skb, dev, filter_mask);
+			if (err < 0)
+				goto nla_put_failure;
+		}
+	}
+
+	nlmsg_end(skb, nlh);
+
+	return 0;
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+
+	return -EMSGSIZE;
+}
+
+static const struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
+	[IFLA_STATS_LINK64]	= { .len = sizeof(struct rtnl_link_stats64) },
+};
+
+static size_t rtnl_link_get_af_stats_size(const struct net_device *dev,
+					  u32 filter_mask)
+{
+	struct rtnl_af_ops *af_ops;
+	size_t size = 0;
+
+	list_for_each_entry(af_ops, &rtnl_af_ops, list) {
+		if (af_ops->get_link_af_stats_size)
+			size += af_ops->get_link_af_stats_size(dev,
+							       filter_mask);
+	}
+
+	return size;
+}
+
+static size_t if_nlmsg_stats_size(const struct net_device *dev,
+				  u32 filter_mask)
+{
+	size_t size = 0;
+
+	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK64))
+		size += nla_total_size(sizeof(struct rtnl_link_stats64));
+
+	size += rtnl_link_get_af_stats_size(dev, filter_mask);
+
+	return size;
+}
+
+static int rtnl_stats_get(struct sk_buff *skb, struct nlmsghdr *nlh)
+{
+	struct net *net = sock_net(skb->sk);
+	struct if_stats_msg *ifsm;
+	struct net_device *dev = NULL;
+	struct sk_buff *nskb;
+	u32 filter_mask;
+	int err;
+
+	ifsm = nlmsg_data(nlh);
+	if (ifsm->ifindex > 0)
+		dev = __dev_get_by_index(net, ifsm->ifindex);
+	else
+		return -EINVAL;
+
+	if (!dev)
+		return -ENODEV;
+
+	filter_mask = ifsm->filter_mask;
+	if (!filter_mask)
+		return -EINVAL;
+
+	nskb = nlmsg_new(if_nlmsg_stats_size(dev, filter_mask), GFP_KERNEL);
+	if (!nskb)
+		return -ENOBUFS;
+
+	err = rtnl_fill_statsinfo(nskb, dev, RTM_NEWSTATS,
+				  NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0,
+				  0, filter_mask);
+	if (err < 0) {
+		/* -EMSGSIZE implies BUG in if_nlmsg_stats_size */
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(nskb);
+	} else {
+		err = rtnl_unicast(nskb, net, NETLINK_CB(skb).portid);
+	}
+
+	return err;
+}
+
+static u16 rtnl_stats_calcit(struct sk_buff *skb, struct nlmsghdr *nlh)
+{
+	struct net *net = sock_net(skb->sk);
+	struct net_device *dev;
+	u16 min_ifinfo_dump_size = 0;
+	struct if_stats_msg *ifsm;
+	u32 filter_mask;
+
+	ifsm = nlmsg_data(nlh);
+	filter_mask = ifsm->filter_mask;
+
+	/* traverse the list of net devices and compute the minimum
+	 * buffer size based upon the filter mask.
+	 */
+	list_for_each_entry(dev, &net->dev_base_head, dev_list) {
+		min_ifinfo_dump_size = max_t(u16, min_ifinfo_dump_size,
+					     if_nlmsg_stats_size(dev,
+								 filter_mask));
+	}
+
+	return min_ifinfo_dump_size;
+}
+
+static int rtnl_stats_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct net *net = sock_net(skb->sk);
+	struct if_stats_msg *ifsm;
+	int h, s_h;
+	int idx = 0, s_idx;
+	struct net_device *dev;
+	struct hlist_head *head;
+	unsigned int flags = NLM_F_MULTI;
+	u32 filter_mask = 0;
+	int err;
+
+	s_h = cb->args[0];
+	s_idx = cb->args[1];
+
+	cb->seq = net->dev_base_seq;
+
+	ifsm = nlmsg_data(cb->nlh);
+	filter_mask = ifsm->filter_mask;
+
+	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
+		idx = 0;
+		head = &net->dev_index_head[h];
+		hlist_for_each_entry(dev, head, index_hlist) {
+			if (idx < s_idx)
+				goto cont;
+			err = rtnl_fill_statsinfo(skb, dev, RTM_NEWSTATS,
+						  NETLINK_CB(cb->skb).portid,
+						  cb->nlh->nlmsg_seq, 0,
+						  flags, filter_mask);
+			/* If we ran out of room on the first message,
+			 * we're in trouble
+			 */
+			WARN_ON((err == -EMSGSIZE) && (skb->len == 0));
+
+			if (err < 0)
+				goto out;
+
+			nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+cont:
+			idx++;
+		}
+	}
+out:
+	cb->args[1] = idx;
+	cb->args[0] = h;
+
+	return skb->len;
+}
+
 /* Process one rtnetlink message. */
 
 static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
@@ -3600,4 +3797,7 @@ void __init rtnetlink_init(void)
 	rtnl_register(PF_BRIDGE, RTM_GETLINK, NULL, rtnl_bridge_getlink, NULL);
 	rtnl_register(PF_BRIDGE, RTM_DELLINK, rtnl_bridge_dellink, NULL, NULL);
 	rtnl_register(PF_BRIDGE, RTM_SETLINK, rtnl_bridge_setlink, NULL, NULL);
+
+	rtnl_register(PF_UNSPEC, RTM_GETSTATS, rtnl_stats_get, rtnl_stats_dump,
+		      rtnl_stats_calcit);
 }
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next v2 0/2] rtnetlink: new message for stats
From: Roopa Prabhu @ 2016-04-09  6:38 UTC (permalink / raw)
  To: netdev; +Cc: jhs, davem

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch adds a new RTM_GETSTATS message to query link stats via
netlink from the kernel. RTM_NEWLINK also dumps stats today, but
RTM_NEWLINK returns a lot more than just stats and is expensive in some
cases when frequent polling for stats from userspace is a common operation.

RTM_GETSTATS is an attempt to provide a light weight netlink message
to explicity query only link stats from the kernel on an interface.
The idea is to also keep it extensible so that new kinds of stats can be
added to it in the future.


Roopa Prabhu (2):
  rtnetlink: add new RTM_GETSTATS to dump link stats
  ipv6: add support for stats via RTM_GETSTATS

Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>

RFC to v1 (apologies for the delay in sending this version out. busy days):
        - Addressed feedback from Dave
                - removed rtnl_link_stats
                - Added hdr struct if_stats_msg to carry ifindex and
                  filter mask
                - new macro IFLA_STATS_FILTER_BIT(ATTR) for filter mask
        - split the ipv6 patch into a separate patch, need some more eyes on it
        - prefix attributes with IFLA_STATS instead of IFLA_LINK_STATS for shorter
          attribute names

v1 - v2:
        - move IFLA_STATS_INET6 declaration to the inet6 patch
        - get rid of RTM_DELSTATS
        - mark ipv6 patch RFC. It can be used as an example for
          other AF stats like stats
        

 include/net/rtnetlink.h        |   5 +
 include/uapi/linux/if_link.h   |  19 ++++
 include/uapi/linux/rtnetlink.h |   7 ++
 net/core/rtnetlink.c           | 201 +++++++++++++++++++++++++++++++++++++++++
 net/ipv6/addrconf.c            |  77 ++++++++++++++--
 5 files changed, 301 insertions(+), 8 deletions(-)

-- 
1.9.1

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox