* Re: [PATCH net] ethtool: do not print warning for applications using legacy API
From: David Decotigny @ 2017-12-31 3:56 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: David S. Miller, netdev, linux-kernel
In-Reply-To: <20171229180252.6981-1-sthemmin@microsoft.com>
Signed-off-by: David Decotigny <decot@googlers.com>
On Fri, Dec 29, 2017 at 10:02 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> From: Stephen Hemminger <stephen@networkplumber.org>
>
> In kernel log ths message appears on every boot:
> "warning: `NetworkChangeNo' uses legacy ethtool link settings API,
> link modes are only partially reported"
>
> When ethtool link settings API changed, it started complaining about
> usages of old API. Ironically, the original patch was from google but
> the application using the legacy API is chrome.
>
> Linux ABI is fixed as much as possible. The kernel must not break it
> and should not complain about applications using legacy API's.
> This patch just removes the warning since using legacy API's
> in Linux is perfectly acceptable.
>
> Fixes: 3f1ac7a700d0 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
> net/core/ethtool.c | 15 ++-------------
> 1 file changed, 2 insertions(+), 13 deletions(-)
>
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index f8fcf450a36e..8225416911ae 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -770,15 +770,6 @@ static int ethtool_set_link_ksettings(struct net_device *dev,
> return dev->ethtool_ops->set_link_ksettings(dev, &link_ksettings);
> }
>
> -static void
> -warn_incomplete_ethtool_legacy_settings_conversion(const char *details)
> -{
> - char name[sizeof(current->comm)];
> -
> - pr_info_once("warning: `%s' uses legacy ethtool link settings API, %s\n",
> - get_task_comm(name, current), details);
> -}
> -
> /* Query device for its ethtool_cmd settings.
> *
> * Backward compatibility note: for compatibility with legacy ethtool,
> @@ -805,10 +796,8 @@ static int ethtool_get_settings(struct net_device *dev, void __user *useraddr)
> &link_ksettings);
> if (err < 0)
> return err;
> - if (!convert_link_ksettings_to_legacy_settings(&cmd,
> - &link_ksettings))
> - warn_incomplete_ethtool_legacy_settings_conversion(
> - "link modes are only partially reported");
> + convert_link_ksettings_to_legacy_settings(&cmd,
> + &link_ksettings);
>
> /* send a sensible cmd tag back to user */
> cmd.cmd = ETHTOOL_GSET;
> --
> 2.11.0
>
^ permalink raw reply
* Re: [PATCH] rds: fix use-after-free read in rds_find_bound
From: santosh.shilimkar @ 2017-12-31 5:09 UTC (permalink / raw)
To: Sowmini Varadhan; +Cc: netdev, davem
In-Reply-To: <20171230223238.GC27855@oracle.com>
On 12/30/17 2:32 PM, Sowmini Varadhan wrote:
> On (12/30/17 13:37), santosh.shilimkar@oracle.com wrote:
[...]
>> Thats what I thought as well initially but since the reported case,
>> the rs seems to be valid where as sk seems to be freed up as part of
>> sock_release callback.
>
> I dont understand the statement above- how can "rs be valid, and sk
> be freed"?
>
> rs_sk is embedded in the struct rds_sock, it is not a pointer.
>
I was going with order of evaluation of if () but you made good point.
rs_sk isn't a pointer so sk can't be null.
> let's find and fix the refcount bug. See stack trace in commit comment.
> The socket release is happening prematurely and existing WARN_ONs
> are not catching it.
>
Right. This was loop transport in action so xmit will just flip
the direction with receive. And rds_recv_incoming() can race with
socket_release. rds_find_bound() is suppose to add ref count on
socket for rds_recv_incoming() but by that time socket is DEAD &
freed by socket release callback.
And rds_release is suppose to be synced with rs_recv_lock. But
release callback is marking the sk orphan before syncing
up with receive path and updating the bind table. Probably it
can pushed down further after the socket clean up buut need
to think bit more.
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b405f77..11e1426 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -65,7 +65,6 @@ static int rds_release(struct socket *sock)
rs = rds_sk_to_rs(sk);
- sock_orphan(sk);
/* Note - rds_clear_recv_queue grabs rs_recv_lock, so
* that ensures the recv path has completed messing
* with the socket. */
@@ -85,6 +84,7 @@ static int rds_release(struct socket *sock)
rds_trans_put(rs->rs_transport);
+ sock_orphan(sk);
sock->sk = NULL;
sock_put(sk);
out:
^ permalink raw reply related
* GREETINGS FROM MOHAMMED AHMED .
From: Mr Mohamad Ahmed @ 2017-12-31 5:22 UTC (permalink / raw)
Hi Friend.
I am Mr. Mohammed Ahmed a banker in Bank of Africa Burkina Faso West
Africa, Please i want to transfer an abandoned sum of 13.5 millions
USD to your account.50% will be for you and 50% for me.
No risk involved. Contact me for more details along with your personal
information needed below.
1. Full name:.........
2. Current Address:.........
3. Phone.............
4. Occupation:.............
5. Age:............
6. Country:........
7. Sex........
8. Your Passport or ID card or Driving License
Thanks.
Mr. Mohammed Ahmed.
^ permalink raw reply
* Re: xfrm: Forbid state updates from changing encap type
From: Steffen Klassert @ 2017-12-31 7:49 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev
In-Reply-To: <20171226063444.GA26819@gondor.apana.org.au>
On Tue, Dec 26, 2017 at 05:34:44PM +1100, Herbert Xu wrote:
> Currently we allow state updates to competely replace the contents
> of x->encap. This is bad because on the user side ESP only sets up
> header lengths depending on encap_type once when the state is first
> created. This could result in the header lengths getting out of
> sync with the actual state configuration.
>
> In practice key managers will never do a state update to change the
> encapsulation type. Only the port numbers need to be changed as the
> peer NAT entry is updated.
>
> Therefore this patch adds a check in xfrm_state_update to forbid
> any changes to the encap_type.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Applied to the ipsec tree, thanks Herbert!
^ permalink raw reply
* Re: [PATCH ipsec] xfrm: skip policies marked as dead while rehashing
From: Steffen Klassert @ 2017-12-31 7:50 UTC (permalink / raw)
To: Florian Westphal
Cc: netdev, syzkaller-bugs, Herbert Xu, Timo Teras,
Christophe Gouault
In-Reply-To: <20171227222545.22219-1-fw@strlen.de>
On Wed, Dec 27, 2017 at 11:25:45PM +0100, Florian Westphal wrote:
> syzkaller triggered following KASAN splat:
>
> BUG: KASAN: slab-out-of-bounds in xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618
> read of size 2 at addr ffff8801c8e92fe4 by task kworker/1:1/23 [..]
> Workqueue: events xfrm_hash_rebuild [..]
> __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428
> xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618
> process_one_work+0xbbf/0x1b10 kernel/workqueue.c:2112
> worker_thread+0x223/0x1990 kernel/workqueue.c:2246 [..]
>
> The reproducer triggers:
> 1016 if (error) {
> 1017 list_move_tail(&walk->walk.all, &x->all);
> 1018 goto out;
> 1019 }
>
> in xfrm_policy_walk() via pfkey (it sets tiny rcv space, dump
> callback returns -ENOBUFS).
>
> In this case, *walk is located the pfkey socket struct, so this socket
> becomes visible in the global policy list.
>
> It looks like this is intentional -- phony walker has walk.dead set to 1
> and all other places skip such "policies".
>
> Ccing original authors of the two commits that seem to expose this
> issue (first patch missed ->dead check, second patch adds pfkey
> sockets to policies dumper list).
>
> Fixes: 880a6fab8f6ba5b ("xfrm: configure policy hash table thresholds by netlink")
> Fixes: 12a169e7d8f4b1c ("ipsec: Put dumpers on the dump list")
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Timo Teras <timo.teras@iki.fi>
> Cc: Christophe Gouault <christophe.gouault@6wind.com>
> Reported-by: syzbot <bot+c028095236fcb6f4348811565b75084c754dc729@syzkaller.appspotmail.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
Applied, thanks a lot!
^ permalink raw reply
* Re: [PATCH] af_key: fix buffer overread in verify_address_len()
From: Steffen Klassert @ 2017-12-31 7:51 UTC (permalink / raw)
To: Eric Biggers
Cc: netdev, Herbert Xu, David S . Miller, Alexander Potapenko,
Dmitry Vyukov, Kostya Serebryany, syzkaller, Eric Biggers, stable
In-Reply-To: <20171230001305.15553-1-ebiggers3@gmail.com>
On Fri, Dec 29, 2017 at 06:13:05PM -0600, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
>
> If a message sent to a PF_KEY socket ended with one of the extensions
> that takes a 'struct sadb_address' but there were not enough bytes
> remaining in the message for the ->sa_family member of the 'struct
> sockaddr' which is supposed to follow, then verify_address_len() read
> past the end of the message, into uninitialized memory. Fix it by
> returning -EINVAL in this case.
>
> This bug was found using syzkaller with KMSAN.
>
> Reproducer:
>
> #include <linux/pfkeyv2.h>
> #include <sys/socket.h>
> #include <unistd.h>
>
> int main()
> {
> int sock = socket(PF_KEY, SOCK_RAW, PF_KEY_V2);
> char buf[24] = { 0 };
> struct sadb_msg *msg = (void *)buf;
> struct sadb_address *addr = (void *)(msg + 1);
>
> msg->sadb_msg_version = PF_KEY_V2;
> msg->sadb_msg_type = SADB_DELETE;
> msg->sadb_msg_len = 3;
> addr->sadb_address_len = 1;
> addr->sadb_address_exttype = SADB_EXT_ADDRESS_SRC;
>
> write(sock, buf, 24);
> }
>
> Reported-by: Alexander Potapenko <glider@google.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Eric Biggers <ebiggers@google.com>
Applied to the ipsec tree, thanks Eric!
^ permalink raw reply
* Re: [PATCH] af_key: fix buffer overread in parse_exthdrs()
From: Steffen Klassert @ 2017-12-31 7:52 UTC (permalink / raw)
To: Eric Biggers
Cc: netdev, Herbert Xu, David S . Miller, Alexander Potapenko,
Dmitry Vyukov, Kostya Serebryany, syzkaller, Eric Biggers, stable
In-Reply-To: <20171230001523.15761-1-ebiggers3@gmail.com>
On Fri, Dec 29, 2017 at 06:15:23PM -0600, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
>
> If a message sent to a PF_KEY socket ended with an incomplete extension
> header (fewer than 4 bytes remaining), then parse_exthdrs() read past
> the end of the message, into uninitialized memory. Fix it by returning
> -EINVAL in this case.
>
> Reproducer:
>
> #include <linux/pfkeyv2.h>
> #include <sys/socket.h>
> #include <unistd.h>
>
> int main()
> {
> int sock = socket(PF_KEY, SOCK_RAW, PF_KEY_V2);
> char buf[17] = { 0 };
> struct sadb_msg *msg = (void *)buf;
>
> msg->sadb_msg_version = PF_KEY_V2;
> msg->sadb_msg_type = SADB_DELETE;
> msg->sadb_msg_len = 2;
>
> write(sock, buf, 17);
> }
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Eric Biggers <ebiggers@google.com>
Also applied to the ipsec tree, thanks a lot!
^ permalink raw reply
* Re: [PATCH net-next 4/7] net: phy: add paged phy register accessors
From: Andrew Lunn @ 2017-12-31 8:10 UTC (permalink / raw)
To: Russell King; +Cc: Florian Fainelli, netdev
In-Reply-To: <E1eUtoW-0002A1-TM@rmk-PC.armlinux.org.uk>
> +/**
> + * phy_save_page() - take the bus lock and save the current page
> + * @phydev: a pointer to a &struct phy_device
> + *
> + * Take the MDIO bus lock, and return the current page number. On error,
> + * returns a negative errno. phy_restore_page() must be called after this
> + * to release the lock even on failure.
> + */
> +int phy_save_page(struct phy_device *phydev)
> +{
> + mutex_lock(&phydev->mdio.bus->mdio_lock);
> + return __phy_read_page(phydev);
> +}
> +EXPORT_SYMBOL_GPL(phy_save_page);
> +
> +/**
> + * phy_select_page() - take the bus lock, save the current page, and set a page
> + * @phydev: a pointer to a &struct phy_device
> + * @page: desired page
> + *
> + * Take the MDIO bus lock to protect against concurrent access, save the
> + * current PHY page, and set the current page. On error, returns a
> + * negative errno, otherwise returns the previous page number.
> + * phy_restore_page() must be called after this to restore the page
> + * number (if this call was successful) and release the lock.
Hi Russell
This comment seems wrong. It looks like you need to call
phy_restore_page() on error as well. I think the text in () should be
removed, and add the "even on failure" which the previous function
states.
Andrew
^ permalink raw reply
* Re: [PATCH net-next 0/4] mlx4 misc for 4.16
From: Tariq Toukan @ 2017-12-31 8:29 UTC (permalink / raw)
To: David Miller, tariqt; +Cc: netdev, eranbe
In-Reply-To: <20171228.122841.875412773096063390.davem@davemloft.net>
On 28/12/2017 7:28 PM, David Miller wrote:
> From: Tariq Toukan <tariqt@mellanox.com>
> Date: Thu, 28 Dec 2017 16:26:07 +0200
>
>> This patchset contains misc cleanups and improvements
>> to the mlx4 Core and Eth drivers.
>>
>> In patches 1 and 2 I reduce and reorder the branches in the RX csum flow.
>> In patch 3 I align the FMR unmapping flow with the device spec, to allow
>> a remapping afterwards.
>> Patch 4 by Moni changes the default QoS settings so that a pause
>> frame stops all traffic regardless of its prio.
>>
>> Series generated against net-next commit:
>> 836df24a7062 net: hns3: hns3_get_channels() can be static
>
> Series applied, thanks Tariq.
>
> I can't say that that ipv6 ifdef you added in patch #1 is the nicest.
>
> It's one thing to ifdef control entire blocks of code, or individual
> values.
>
> It's another thing to partially include pieces of code structure
> such as closing parenthesis, arithmetic operators, and curly braces.
> Two of those were included in the ifdef section.
>
> For example, if we have:
>
> if (x & (IPV4_THING | IPV6_THING)) {
>
> I don't want to see:
>
> if (x & (IPV4_THING |
> #ifdef IPV6_TEST
> IPV6_THING)) {
> #else
> 0)) {
> #endif
>
> Those closing parenthesis and the openning curly brace are there
> regardless of the CPP test, and duplicating them in both arms of
> the CPP test only makes the code more confusing to read.
>
> I can say I would prefer:
>
> if (x & (IPV4_THING |
> #ifdef IPV6_TEST
> IPV6_THING
> #else
> 0
> #endif
> )) {
>
> which is better. However, best would be:
>
> #ifdef IPV6_TEST
> #define THING_MASK IPV4_THING | IPV6_THING
> #else
> #define THING_MASK IPV4_THING
> #endif
>
> if (x & THING_MASK) {
>
> is the best.
>
Hi Dave,
Thanks for accepting the series.
Your suggestions look better indeed, I will prepare a followup patch and
submit soon.
Regards,
Tariq
^ permalink raw reply
* Re: general protection fault in skb_segment
From: Xin Long @ 2017-12-31 8:41 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: Willem de Bruijn, syzbot, David Miller, LKML, linux-sctp,
Network Development, Neil Horman, syzkaller-bugs, Vlad Yasevich
In-Reply-To: <20171231022524.GE22042@localhost.localdomain>
On Sun, Dec 31, 2017 at 10:25 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Sat, Dec 30, 2017 at 10:52:20PM -0200, Marcelo Ricardo Leitner wrote:
>> On Sat, Dec 30, 2017 at 08:42:41AM +0100, Willem de Bruijn wrote:
> [...]
>> > Somewhat tangential, but any PF_PACKET socket can set this
>> > magic gso_size value in its virtio_net_hdr, so if it is assumed to
>> > be an SCTP GSO specific option, setting it for a TCP GSO packet
>> > may also cause unexpected results.
>>
>> It seems virtio_net could use more sanity checks. When PACKET_VNET_HDR
>> is used, it will end up calling:
>> tpacket_rcv() {
>> ...
>> if (do_vnet) {
>> if (virtio_net_hdr_from_skb(skb, h.raw + macoff -
>> sizeof(struct virtio_net_hdr),
>> vio_le(), true)) {
>> spin_lock(&sk->sk_receive_queue.lock);
>> goto drop_n_account;
>> }
>> }
>>
>> and virtio_net_hdr_from_skb does:
>> if (skb_is_gso(skb)) {
>> ...
>> if (sinfo->gso_type & SKB_GSO_TCPV4)
>> hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>> else if (sinfo->gso_type & SKB_GSO_TCPV6)
>> hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
>> else
>> return -EINVAL;
>>
>> Meaning that any gso_type other than TCP would be rejected, but this
>> SCTP one got through. Seems the header contains a sctp header, but the
>> gso_type set was actually pointing to TCP (otherwise it would have
>> been rejected). AFAICT if this packet had an ESP header, for example,
>> it could have hit esp4_gso_segment. Can you please confirm this?
>
> I added:
> --- a/net/sctp/offload.c
> +++ b/net/sctp/offload.c
> @@ -44,6 +44,18 @@ static struct sk_buff *sctp_gso_segment(struct sk_buff *skb,
> {
> struct sk_buff *segs = ERR_PTR(-EINVAL);
> struct sctphdr *sh;
> + int fail = 0;
> +
> + if (!(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP)) {
> + printk("Bogus gso_type: %x\n", skb_shinfo(skb)->gso_type);
> + fail = 1;
> + }
> + if (skb_shinfo(skb)->gso_size != GSO_BY_FRAGS) {
> + printk("Bogus gso_size: %u\n", skb_shinfo(skb)->gso_size);
> + fail = 1;
> + }
> + if (fail)
> + goto out;
>
> sh = sctp_hdr(skb);
> if (!pskb_may_pull(skb, sizeof(*sh)))
>
> and with the reproducer, got:
> [ 54.255469] Bogus gso_type: 7
> [ 54.258801] Bogus gso_size: 63464
> [ 54.262532] ------------[ cut here ]------------
> [ 54.267703] syz0: caps=(0x00000800000058c1, 0x0000000000000000) len=32 data_len=0 gso_size=63464 gso_type=7 ip_summed0
> [ 54.279777] WARNING: CPU: 1 PID: 13005 at /root/linux/net/core/dev.c:2600 skb_warn_bad_offload+0xd6/0xec
I couldn't reproduce this call trace on net-next, maybe it's been fixed by:
commit 8d74e9f88d65af8bb2e095aff506aa6eac755ada
Author: Willem de Bruijn <willemb@google.com>
Date: Tue Dec 12 11:39:04 2017 -0500
net: avoid skb_warn_bad_offload on IS_ERR
^ permalink raw reply
* Re: general protection fault in skb_segment
From: Willem de Bruijn @ 2017-12-31 9:17 UTC (permalink / raw)
To: Xin Long
Cc: Marcelo Ricardo Leitner, syzbot, David Miller, LKML, linux-sctp,
Network Development, Neil Horman, syzkaller-bugs, Vlad Yasevich
In-Reply-To: <CADvbK_d0_486KX-wtfQa=0aVGv0mtaFs4JeMTbv6n+aMWqU3hg@mail.gmail.com>
>> and with the reproducer, got:
>> [ 54.255469] Bogus gso_type: 7
>> [ 54.258801] Bogus gso_size: 63464
>> [ 54.262532] ------------[ cut here ]------------
>> [ 54.267703] syz0: caps=(0x00000800000058c1, 0x0000000000000000) len=32 data_len=0 gso_size=63464 gso_type=7 ip_summed0
>> [ 54.279777] WARNING: CPU: 1 PID: 13005 at /root/linux/net/core/dev.c:2600 skb_warn_bad_offload+0xd6/0xec
> I couldn't reproduce this call trace on net-next, maybe it's been fixed by:
> commit 8d74e9f88d65af8bb2e095aff506aa6eac755ada
> Author: Willem de Bruijn <willemb@google.com>
> Date: Tue Dec 12 11:39:04 2017 -0500
>
> net: avoid skb_warn_bad_offload on IS_ERR
Yes, I forgot to mention that that has been fixed in net-next.
It does not address the crash, but does suppress the gratuitous
warning once we make segmentation return with error for bad
packets like these.
^ permalink raw reply
* Re: general protection fault in skb_segment
From: Willem de Bruijn @ 2017-12-31 9:52 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: syzbot, David Miller, LKML, linux-sctp, Network Development,
Neil Horman, syzkaller-bugs, Vladislav Yasevich
In-Reply-To: <20171231005220.GD22042@localhost.localdomain>
> It seems virtio_net could use more sanity checks. When PACKET_VNET_HDR
> is used, it will end up calling:
> tpacket_rcv() {
> ...
> if (do_vnet) {
> if (virtio_net_hdr_from_skb(skb, h.raw + macoff -
> sizeof(struct virtio_net_hdr),
> vio_le(), true)) {
> spin_lock(&sk->sk_receive_queue.lock);
> goto drop_n_account;
> }
> }
>
> and virtio_net_hdr_from_skb does:
> if (skb_is_gso(skb)) {
> ...
> if (sinfo->gso_type & SKB_GSO_TCPV4)
> hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> else if (sinfo->gso_type & SKB_GSO_TCPV6)
> hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> else
> return -EINVAL;
That is the receive path, but the send path is analogous. Just adds
UFO.
> Meaning that any gso_type other than TCP would be rejected, but this
> SCTP one got through. Seems the header contains a sctp header, but the
> gso_type set was actually pointing to TCP (otherwise it would have
> been rejected). AFAICT if this packet had an ESP header, for example,
> it could have hit esp4_gso_segment. Can you please confirm this?
I have not tested this yet, but it certainly seems plausible.
There is nothing ensuring consistency between gso_type and
the actual packet contents that are parsed to look up gso callbacks.
> I don't know of anywhere in the stack validating if the gso_type
> matches the header that actually is in there.
>
> The fix you mentioned is a good start, we want that one way or
> another, but I'm afraid this bug is bigger than sctp.
Good point. Packet sockets require CAP_NET_RAW, but this is also
taken for virtio, so we probably want more stringent entry tests here.
The alternative to harden the segmentation code itself with a gso_type
sanity check in every gso callback is more work and fragile.
Need to figure out whether a brief check for just TCP or UDP is sufficient
or we need a full flow dissector step to support tunnel headers and such.
^ permalink raw reply
* Re: [PATCH net-next 4/7] net: phy: add paged phy register accessors
From: Russell King - ARM Linux @ 2017-12-31 10:07 UTC (permalink / raw)
To: Andrew Lunn; +Cc: Florian Fainelli, netdev
In-Reply-To: <20171231081039.GA14485@lunn.ch>
On Sun, Dec 31, 2017 at 09:10:39AM +0100, Andrew Lunn wrote:
> > +/**
> > + * phy_save_page() - take the bus lock and save the current page
> > + * @phydev: a pointer to a &struct phy_device
> > + *
> > + * Take the MDIO bus lock, and return the current page number. On error,
> > + * returns a negative errno. phy_restore_page() must be called after this
> > + * to release the lock even on failure.
> > + */
> > +int phy_save_page(struct phy_device *phydev)
> > +{
> > + mutex_lock(&phydev->mdio.bus->mdio_lock);
> > + return __phy_read_page(phydev);
> > +}
> > +EXPORT_SYMBOL_GPL(phy_save_page);
> > +
> > +/**
> > + * phy_select_page() - take the bus lock, save the current page, and set a page
> > + * @phydev: a pointer to a &struct phy_device
> > + * @page: desired page
> > + *
> > + * Take the MDIO bus lock to protect against concurrent access, save the
> > + * current PHY page, and set the current page. On error, returns a
> > + * negative errno, otherwise returns the previous page number.
> > + * phy_restore_page() must be called after this to restore the page
> > + * number (if this call was successful) and release the lock.
>
> Hi Russell
>
> This comment seems wrong. It looks like you need to call
> phy_restore_page() on error as well. I think the text in () should be
> removed, and add the "even on failure" which the previous function
> states.
It's one of those slightly confusing bits of English, and I doubt
there's a way to express it better.
phy_restore_page() must be called after this to [restore the page
number (if this call was successful) and] release the lock.
I could use:
phy_restore_page() must always be called to release the lock,
and restore the page number if this call was successful.
but that can still be read in an ambiguous manner.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
^ permalink raw reply
* Re: [PATCH net-next 2/2] tun: allow to attach ebpf socket filter
From: Willem de Bruijn @ 2017-12-31 10:14 UTC (permalink / raw)
To: Jason Wang
Cc: Network Development, LKML, Michael S. Tsirkin, Willem de Bruijn
In-Reply-To: <1514515491-6041-3-git-send-email-jasowang@redhat.com>
On Fri, Dec 29, 2017 at 3:44 AM, Jason Wang <jasowang@redhat.com> wrote:
> This patch allows userspace to attach eBPF filter to tun. This will
> allow to implement VM dataplane filtering in a more efficient way
> compared to cBPF filter.
Is the idea to allow the trusted hypervisor to install these programs,
or the untrusted guests?
eBPF privilege escalations like those recently described in
https://lwn.net/Articles/742170/ would give me pause to expose
this to guests.
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> drivers/net/tun.c | 26 ++++++++++++++++++++++++++
> include/uapi/linux/if_tun.h | 1 +
> 2 files changed, 27 insertions(+)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 0853829..6e9452b 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -238,6 +238,7 @@ struct tun_struct {
> struct tun_pcpu_stats __percpu *pcpu_stats;
> struct bpf_prog __rcu *xdp_prog;
> struct tun_prog __rcu *steering_prog;
> + struct tun_prog __rcu *filter_prog;
> };
>
> static int tun_napi_receive(struct napi_struct *napi, int budget)
> @@ -984,12 +985,25 @@ static void tun_automq_xmit(struct tun_struct *tun, struct sk_buff *skb)
> #endif
> }
>
> +static unsigned int run_ebpf_filter(struct tun_struct *tun,
> + struct sk_buff *skb,
> + int len)
> +{
> + struct tun_prog *prog = rcu_dereference(tun->filter_prog);
> +
> + if (prog)
> + len = bpf_prog_run_clear_cb(prog->prog, skb);
> +
> + return len;
> +}
> +
> /* Net device start xmit */
> static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> {
> struct tun_struct *tun = netdev_priv(dev);
> int txq = skb->queue_mapping;
> struct tun_file *tfile;
> + int len = skb->len;
>
> rcu_read_lock();
> tfile = rcu_dereference(tun->tfiles[txq]);
> @@ -1015,9 +1029,16 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> sk_filter(tfile->socket.sk, skb))
> goto drop;
>
> + len = run_ebpf_filter(tun, skb, len);
> + if (!len)
> + goto drop;
> +
This adds a second filter step independent of the sk_filter call above.
Perhaps the two filter interfaces can map onto to the same instance.
I imagine that qemu never programs SO_ATTACH_FILTER.
More importantly, should this program just return a boolean pass or
drop. Taking a length and trimming may introduce bugs later on if the
stack parses the packet unconditionally, expecting a minimum size
to be present.
This was the reason for introducing sk_filter_trim_cap and using that
in other sk_filter sites.
A quick scan shows that tun_put_user expects a full vlan tag to exist
if skb_vlan_tag_present(skb), for instance. If trimmed to below this
length the final call to skb_copy_datagram_iter may have negative
length.
This is an issue with the existing sk_filter call as much as with the
new run_ebpf_filter call.
> if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
> goto drop;
>
> + if (pskb_trim(skb, len))
> + goto drop;
> +
> skb_tx_timestamp(skb);
^ permalink raw reply
* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: Arkadi Sharshevsky @ 2017-12-31 10:52 UTC (permalink / raw)
To: David Ahern, Yuval Mintz, Roopa Prabhu, Jiri Pirko
Cc: netdev@vger.kernel.org, David Miller, mlxsw, Andrew Lunn,
Vivien Didelot, Florian Fainelli, Michael Chan,
ganeshgr@chelsio.com, Saeed Mahameed, Matan Barak,
Leon Romanovsky, Ido Schimmel, jakub.kicinski@netronome.com,
ast@kernel.org, Daniel Borkmann, Simon Horman,
pieter.jansenvanvuuren@netronome.com, "john.hurley@netronome
In-Reply-To: <409891dc-4223-c037-d669-7ee85845d331@cumulusnetworks.com>
On 12/30/2017 11:15 PM, David Ahern wrote:
> On 12/28/17 1:21 AM, Yuval Mintz wrote:
>> I think it goes the other way around. The dpipe tables are the ones that
>> can be translated to functionality; The resources are internal and HW-specific
>> representing the possible internal division of resources -
>> but a given resource sn't necessarily mapped to a single networking feature.
>> [It might be in some cases, but not in the general case]
>
> This is what I am getting at -- a single resource /kvd/linear is used
> for multiple networking features, and those networking features do map
> to well known entities -- fdb entries, ACL entries, ipv4/v6 host
> entries, LPM entries, etc.
>
> Nothing about the output from devlink helps the user in any way to
> understand how to change the resource values. Saying that these
The current patchset adds the following dpipe table <--> resource
relation
host4 -- hash single
host6 -- hash double
adj -- linear
By dumping the resources via the 'resource show' you can the tree like
structure, you can see that you have a tradeoff between those subparts.
So for example if a user would like to increase the number of nexthops
with the expense of neighbors, it is pretty clear. As more dpipe table
will be introduced this relations will be more complete and the user
will get the complete view of the ASIC.
Just to summarize, the user gets the following info
1. Constrains\trade off about setting the sizes - this you get
by the tree structure.
2. Each hardware process which use this resource is mapped to it
By combining those two you can get the most accurate information
about what your change will do. Partitioning of the KVD is very delicate
process, because the hardware is complex. Many hardware processes are
pointing to this memory and size changes effect the whole ASIC, as I
mentioned as more of the pipeline will be exposed via dpipe the user
will get a more precise vision of the hardware.
We will provide some recommended and tested configuration of the whole
mlxsw resource tree for different user scenarios. A more experienced
user can do it for himself, if he got some very special scenario.
> resources, what they mean and how they are used is MLX proprietary and
> is known only to MLX employees and those with MLX agreements is not
> acceptable. Likewise, requiring some network admin to deep dive into the
> mlxsw driver to piece together how kvd/linear (for example) is used is
> not acceptable.
>
> The cover letter touts "Many of the ASIC's internal resources are
> limited and are shared between several hardware procedures. For example,
> unified hash-based memory can be used for many lookup purposes, like FDB
> and LPM. In many cases the user can provide a partitioning scheme for
> such a resource in order to perform fine tuning for his application."
>
> Great, now give the user some indication of how to do that. Is setting
> /kvd/linear to 0 acceptable? If not, why? What functionality is lost?
> (Apparently, everything [1].)
>
> The dpipe tables list some correlation between the kvd resources and
> tables but that is not a complete list and again there is nothing to
> tell a user that it is only a partial list of how a kvd resource is
This is work in progress, the LPM block will be exposed as the last
L3 part. Then we will start the l2 part of the ASIC.
> used. For example, it shows ipv4 host is in /kvd/hash_single and that is
> all it shows. So if I have an ipv6 only deployment can I conclude that I
> can set /kvd/hash_single to 0? Or the reverse, can I set hash_double to
> 0 for an ipv4 only deployment? From the limited information given, it is
> reasonable for a user to assume yes and has to learn through trial and
> error what can be done. [2]
>
So you want to add min/max size attribute? I think this its not needed.
> -----
>
> [1] This is allowed by the current patch set and perhaps it should not be:
>
> $ ip ro ls vrf vrf1101
> unreachable default metric 8192
> 11.2.51.0/24 dev swp1s0.51 proto kernel scope link src 11.2.51.1 offload
> 11.3.51.0/24 dev swp1s1.51 proto kernel scope link src 11.3.51.1 offload
> 11.4.51.0/24 dev swp1s2.51 proto kernel scope link src 11.4.51.1 offload
> 11.5.51.0/24 dev swp1s3.51 proto kernel scope link src 11.5.51.1 offload
> 11.6.51.0/24 dev swp3s0.51 proto kernel scope link src 11.6.51.1 offload
> 11.7.51.0/24 dev swp3s1.51 proto kernel scope link src 11.7.51.1 offload
> 11.8.51.0/24 dev swp3s2.51 proto kernel scope link src 11.8.51.1 offload
> 11.9.51.0/24 dev swp3s3.51 proto kernel scope link src 11.9.51.1 offload
>
> $ devlink resource set pci/0000:03:00.0 path /kvd/linear size 0
This line actually did nothing, because size zero is not acceptable
see patch 6. This is pure userpsace problem that error is not shown.
You can verify it by dumping the resources and see that there is no
pending change (only size and not size_new).
> $ devlink reload pci/0000:03:00.0
> $ ip ro ls vrf vrf1101
> unreachable default metric 8192
>
So you just performed full reload of the driver which includes
unregistration of all the netdevs and full init. KVD update requires
full teardown of the driver.
The system will not get back to the same state after reloading,
It's should be done on init. But it doesn't have to be like this
this, each driver provides his own reload devlink op implementation
so in our case full blown reset is required.
> [2] Same exact result for setting hash_double to 0:
> $ ip ro ls vrf vrf1101
> unreachable default metric 8192
> 11.2.51.0/24 dev swp1s0.51 proto kernel scope link src 11.2.51.1 offload
> 11.3.51.0/24 dev swp1s1.51 proto kernel scope link src 11.3.51.1 offload
> 11.4.51.0/24 dev swp1s2.51 proto kernel scope link src 11.4.51.1 offload
> 11.5.51.0/24 dev swp1s3.51 proto kernel scope link src 11.5.51.1 offload
> 11.6.51.0/24 dev swp3s0.51 proto kernel scope link src 11.6.51.1 offload
> 11.7.51.0/24 dev swp3s1.51 proto kernel scope link src 11.7.51.1 offload
> 11.8.51.0/24 dev swp3s2.51 proto kernel scope link src 11.8.51.1 offload
> 11.9.51.0/24 dev swp3s3.51 proto kernel scope link src 11.9.51.1 offload
>
> $ devlink resource set pci/0000:03:00.0 path /kvd/hash_double size 0
> $ devlink reload pci/0000:03:00.0
> $ ip ro ls vrf vrf1101
> unreachable default metric 8192
>
^ permalink raw reply
* [bpf-next V3 PATCH 00/14] xdp: new XDP rx-queue info concept
From: Jesper Dangaard Brouer @ 2017-12-31 11:00 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov
Cc: netdev, Jesper Dangaard Brouer, dsahern, gospo, bjorn.topel,
michael.chan
V3:
* Fixed bug in virtio_net driver
* Remove export of xdp_rxq_info_init()
V2:
* Changed API exposed to drivers
- Removed invocation of "init" in drivers, and only call "reg"
(Suggested by Saeed)
- Allow "reg" to fail and handle this in drivers
(Suggested by David Ahern)
* Removed the SINKQ qtype, instead allow to register as "unused"
* Also fixed some drivers during testing on actual HW (noted in patches)
There is a need for XDP to know more about the RX-queue a given XDP
frames have arrived on. For both the XDP bpf-prog and kernel side.
Instead of extending struct xdp_buff each time new info is needed,
this patchset takes a different approach. Struct xdp_buff is only
extended with a pointer to a struct xdp_rxq_info (allowing for easier
extending this later). This xdp_rxq_info contains information related
to how the driver have setup the individual RX-queue's. This is
read-mostly information, and all xdp_buff frames (in drivers
napi_poll) point to the same xdp_rxq_info (per RX-queue).
We stress this data/cache-line is for read-mostly info. This is NOT
for dynamic per packet info, use the data_meta for such use-cases.
This patchset start out small, and only expose ingress_ifindex and the
RX-queue index to the XDP/BPF program. Access to tangible info like
the ingress ifindex and RX queue index, is fairly easy to comprehent.
The other future use-cases could allow XDP frames to be recycled back
to the originating device driver, by providing info on RX device and
queue number.
As XDP doesn't have driver feature flags, and eBPF code due to
bpf-tail-calls cannot determine that XDP driver invoke it, this
patchset have to update every driver that support XDP.
For driver developers (review individual driver patches!):
The xdp_rxq_info is tied to the drivers RX-ring(s). Whenever a RX-ring
modification require (temporary) stopping RX frames, then the
xdp_rxq_info should (likely) also be unregistred and re-registered,
especially if reallocating the pages in the ring. Make sure ethtool
set_channels does the right thing. When replacing XDP prog, if and
only if RX-ring need to be changed, then also re-register the
xdp_rxq_info.
I'm Cc'ing the individual driver patches to the registered maintainers.
Testing:
I've only tested the NIC drivers I have hardware for. The general
test procedure is to (DUT = Device Under Test):
(1) run pktgen script pktgen_sample04_many_flows.sh (against DUT)
(2) run samples/bpf program xdp_rxq_info --dev $DEV (on DUT)
(3) runtime modify number of NIC queues via ethtool -L (on DUT)
(4) runtime modify number of NIC ring-size via ethtool -G (on DUT)
Patch based on git tree bpf-next (at commit fb982666e380c1632a):
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/
---
Jesper Dangaard Brouer (14):
xdp: base API for new XDP rx-queue info concept
xdp/mlx5: setup xdp_rxq_info
i40e: setup xdp_rxq_info
ixgbe: setup xdp_rxq_info
xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg
mlx4: setup xdp_rxq_info
bnxt_en: setup xdp_rxq_info
nfp: setup xdp_rxq_info
thunderx: setup xdp_rxq_info
tun: setup xdp_rxq_info
virtio_net: setup xdp_rxq_info
xdp: generic XDP handling of xdp_rxq_info
bpf: finally expose xdp_rxq_info to XDP bpf-programs
samples/bpf: program demonstrating access to xdp_rxq_info
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 10
drivers/net/ethernet/broadcom/bnxt/bnxt.h | 2
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 1
drivers/net/ethernet/cavium/thunder/nicvf_main.c | 11
drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 4
drivers/net/ethernet/cavium/thunder/nicvf_queues.h | 2
drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 18 +
drivers/net/ethernet/intel/i40e/i40e_txrx.h | 3
drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2
drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 4
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 10
drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 3
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 13
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 4
drivers/net/ethernet/mellanox/mlx5/core/en.h | 4
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 9
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 1
drivers/net/ethernet/netronome/nfp/nfp_net.h | 5
.../net/ethernet/netronome/nfp/nfp_net_common.c | 10
drivers/net/ethernet/qlogic/qede/qede.h | 2
drivers/net/ethernet/qlogic/qede/qede_fp.c | 1
drivers/net/ethernet/qlogic/qede/qede_main.c | 10
drivers/net/tun.c | 24 +
drivers/net/virtio_net.c | 14 -
include/linux/filter.h | 2
include/linux/netdevice.h | 2
include/net/xdp.h | 48 ++
include/uapi/linux/bpf.h | 3
net/core/Makefile | 2
net/core/dev.c | 69 ++-
net/core/filter.c | 19 +
net/core/xdp.c | 73 +++
samples/bpf/Makefile | 4
samples/bpf/xdp_rxq_info_kern.c | 96 ++++
samples/bpf/xdp_rxq_info_user.c | 532 ++++++++++++++++++++
36 files changed, 991 insertions(+), 28 deletions(-)
create mode 100644 include/net/xdp.h
create mode 100644 net/core/xdp.c
create mode 100644 samples/bpf/xdp_rxq_info_kern.c
create mode 100644 samples/bpf/xdp_rxq_info_user.c
^ permalink raw reply
* [bpf-next V3 PATCH 01/14] xdp: base API for new XDP rx-queue info concept
From: Jesper Dangaard Brouer @ 2017-12-31 11:00 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov
Cc: netdev, Jesper Dangaard Brouer, dsahern, gospo, bjorn.topel,
michael.chan
In-Reply-To: <151471801977.30703.3258796879718706203.stgit@firesoul>
This patch only introduce the core data structures and API functions.
All XDP enabled drivers must use the API before this info can used.
There is a need for XDP to know more about the RX-queue a given XDP
frames have arrived on. For both the XDP bpf-prog and kernel side.
Instead of extending xdp_buff each time new info is needed, the patch
creates a separate read-mostly struct xdp_rxq_info, that contains this
info. We stress this data/cache-line is for read-only info. This is
NOT for dynamic per packet info, use the data_meta for such use-cases.
The performance advantage is this info can be setup at RX-ring init
time, instead of updating N-members in xdp_buff. A possible (driver
level) micro optimization is that xdp_buff->rxq assignment could be
done once per XDP/NAPI loop. The extra pointer deref only happens for
program needing access to this info (thus, no slowdown to existing
use-cases).
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
include/linux/filter.h | 2 +
include/net/xdp.h | 47 ++++++++++++++++++++++++++++++++++
net/core/Makefile | 2 +
net/core/xdp.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 117 insertions(+), 1 deletion(-)
create mode 100644 include/net/xdp.h
create mode 100644 net/core/xdp.c
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 2b0df2703671..425056c7f96c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -20,6 +20,7 @@
#include <linux/set_memory.h>
#include <linux/kallsyms.h>
+#include <net/xdp.h>
#include <net/sch_generic.h>
#include <uapi/linux/filter.h>
@@ -503,6 +504,7 @@ struct xdp_buff {
void *data_end;
void *data_meta;
void *data_hard_start;
+ struct xdp_rxq_info *rxq;
};
/* Compute the linear packet data range [data, data_end) which
diff --git a/include/net/xdp.h b/include/net/xdp.h
new file mode 100644
index 000000000000..86c41631a908
--- /dev/null
+++ b/include/net/xdp.h
@@ -0,0 +1,47 @@
+/* include/net/xdp.h
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2. See COPYING.
+ */
+#ifndef __LINUX_NET_XDP_H__
+#define __LINUX_NET_XDP_H__
+
+/**
+ * DOC: XDP RX-queue information
+ *
+ * The XDP RX-queue info (xdp_rxq_info) is associated with the driver
+ * level RX-ring queues. It is information that is specific to how
+ * the driver have configured a given RX-ring queue.
+ *
+ * Each xdp_buff frame received in the driver carry a (pointer)
+ * reference to this xdp_rxq_info structure. This provides the XDP
+ * data-path read-access to RX-info for both kernel and bpf-side
+ * (limited subset).
+ *
+ * For now, direct access is only safe while running in NAPI/softirq
+ * context. Contents is read-mostly and must not be updated during
+ * driver NAPI/softirq poll.
+ *
+ * The driver usage API is a register and unregister API.
+ *
+ * The struct is not directly tied to the XDP prog. A new XDP prog
+ * can be attached as long as it doesn't change the underlying
+ * RX-ring. If the RX-ring does change significantly, the NIC driver
+ * naturally need to stop the RX-ring before purging and reallocating
+ * memory. In that process the driver MUST call unregistor (which
+ * also apply for driver shutdown and unload). The register API is
+ * also mandatory during RX-ring setup.
+ */
+
+struct xdp_rxq_info {
+ struct net_device *dev;
+ u32 queue_index;
+ u32 reg_state;
+} ____cacheline_aligned; /* perf critical, avoid false-sharing */
+
+int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
+ struct net_device *dev, u32 queue_index);
+void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
+void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
+
+#endif /* __LINUX_NET_XDP_H__ */
diff --git a/net/core/Makefile b/net/core/Makefile
index 1fd0a9c88b1b..6dbbba8c57ae 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
obj-y += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
- fib_notifier.o
+ fib_notifier.o xdp.o
obj-y += net-sysfs.o
obj-$(CONFIG_PROC_FS) += net-procfs.o
diff --git a/net/core/xdp.c b/net/core/xdp.c
new file mode 100644
index 000000000000..229bc5a0ee04
--- /dev/null
+++ b/net/core/xdp.c
@@ -0,0 +1,67 @@
+/* net/core/xdp.c
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2. See COPYING.
+ */
+#include <linux/types.h>
+#include <linux/mm.h>
+
+#include <net/xdp.h>
+
+#define REG_STATE_NEW 0x0
+#define REG_STATE_REGISTERED 0x1
+#define REG_STATE_UNREGISTERED 0x2
+#define REG_STATE_UNUSED 0x3
+
+void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
+{
+ /* Simplify driver cleanup code paths, allow unreg "unused" */
+ if (xdp_rxq->reg_state == REG_STATE_UNUSED)
+ return;
+
+ WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG");
+
+ xdp_rxq->reg_state = REG_STATE_UNREGISTERED;
+ xdp_rxq->dev = NULL;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
+
+static void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq)
+{
+ memset(xdp_rxq, 0, sizeof(*xdp_rxq));
+}
+
+/* Returns 0 on success, negative on failure */
+int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
+ struct net_device *dev, u32 queue_index)
+{
+ if (xdp_rxq->reg_state == REG_STATE_UNUSED) {
+ WARN(1, "Driver promised not to register this");
+ return -EINVAL;
+ }
+
+ if (xdp_rxq->reg_state == REG_STATE_REGISTERED) {
+ WARN(1, "Missing unregister, handled but fix driver");
+ xdp_rxq_info_unreg(xdp_rxq);
+ }
+
+ if (!dev) {
+ WARN(1, "Missing net_device from driver");
+ return -ENODEV;
+ }
+
+ /* State either UNREGISTERED or NEW */
+ xdp_rxq_info_init(xdp_rxq);
+ xdp_rxq->dev = dev;
+ xdp_rxq->queue_index = queue_index;
+
+ xdp_rxq->reg_state = REG_STATE_REGISTERED;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_reg);
+
+void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq)
+{
+ xdp_rxq->reg_state = REG_STATE_UNUSED;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_unused);
^ permalink raw reply related
* [bpf-next V3 PATCH 02/14] xdp/mlx5: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-31 11:01 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov
Cc: netdev, dsahern, Matan Barak, Saeed Mahameed,
Jesper Dangaard Brouer, gospo, bjorn.topel, michael.chan,
Tariq Toukan
In-Reply-To: <151471801977.30703.3258796879718706203.stgit@firesoul>
The mlx5 driver have a special drop-RQ queue (one per interface) that
simply drops all incoming traffic. It helps driver keep other HW
objects (flow steering) alive upon down/up operations. It is
temporarily pointed by flow steering objects during the interface
setup, and when interface is down. It lacks many fields that are set
in a regular RQ (for example its state is never switched to
MLX5_RQC_STATE_RDY). (Thanks to Tariq Toukan for explanation).
The XDP RX-queue info for this drop-RQ marked as unused, which
allow us to use the same takedown/free code path as other RX-queues.
Driver hook points for xdp_rxq_info:
* reg : mlx5e_alloc_rq()
* unused: mlx5e_alloc_drop_rq()
* unreg : mlx5e_free_rq()
Tested on actual hardware with samples/bpf program
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Matan Barak <matanb@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 ++++
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 9 +++++++++
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 1 +
3 files changed, 14 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 543060c305a0..5299310f2481 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -46,6 +46,7 @@
#include <linux/mlx5/transobj.h>
#include <linux/rhashtable.h>
#include <net/switchdev.h>
+#include <net/xdp.h>
#include "wq.h"
#include "mlx5_core.h"
#include "en_stats.h"
@@ -571,6 +572,9 @@ struct mlx5e_rq {
u32 rqn;
struct mlx5_core_dev *mdev;
struct mlx5_core_mkey umr_mkey;
+
+ /* XDP read-mostly */
+ struct xdp_rxq_info xdp_rxq;
} ____cacheline_aligned_in_smp;
struct mlx5e_channel {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 3aa1c90e7c86..539bd1d24396 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -582,6 +582,9 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
goto err_rq_wq_destroy;
}
+ if (xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq->ix) < 0)
+ goto err_rq_wq_destroy;
+
rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
rq->buff.headroom = params->rq_headroom;
@@ -687,6 +690,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
err_rq_wq_destroy:
if (rq->xdp_prog)
bpf_prog_put(rq->xdp_prog);
+ xdp_rxq_info_unreg(&rq->xdp_rxq);
mlx5_wq_destroy(&rq->wq_ctrl);
return err;
@@ -699,6 +703,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
if (rq->xdp_prog)
bpf_prog_put(rq->xdp_prog);
+ xdp_rxq_info_unreg(&rq->xdp_rxq);
+
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
mlx5e_rq_free_mpwqe_info(rq);
@@ -2766,6 +2772,9 @@ static int mlx5e_alloc_drop_rq(struct mlx5_core_dev *mdev,
if (err)
return err;
+ /* Mark as unused given "Drop-RQ" packets never reach XDP */
+ xdp_rxq_info_unused(&rq->xdp_rxq);
+
rq->mdev = mdev;
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5b499c7a698f..7b38480811d4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -812,6 +812,7 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
xdp_set_data_meta_invalid(&xdp);
xdp.data_end = xdp.data + *len;
xdp.data_hard_start = va;
+ xdp.rxq = &rq->xdp_rxq;
act = bpf_prog_run_xdp(prog, &xdp);
switch (act) {
^ permalink raw reply related
* [bpf-next V3 PATCH 03/14] i40e: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-31 11:01 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov
Cc: netdev, Jesper Dangaard Brouer, intel-wired-lan, Jeff Kirsher,
dsahern, gospo, bjorn.topel, michael.chan
In-Reply-To: <151471801977.30703.3258796879718706203.stgit@firesoul>
The i40e driver have a special "FDIR" RX-ring (I40E_VSI_FDIR) which is
a sideband channel for configuring/updating the flow director tables.
This (i40e_vsi_)type does not invoke XDP-ebpf code. Thus, mark this
type as a XDP RX-queue type RXQ_TYPE_SINK.
Driver hook points for xdp_rxq_info:
* reg : i40e_setup_rx_descriptors (via i40e_vsi_setup_rx_resources)
* unreg: i40e_free_rx_resources (via i40e_vsi_free_rx_resources)
Tested on actual hardware with samples/bpf program.
V2: Fixed bug in i40e_set_ringparam
Cc: intel-wired-lan@lists.osuosl.org
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 ++
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 18 ++++++++++++++++--
drivers/net/ethernet/intel/i40e/i40e_txrx.h | 3 +++
3 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 5f6cf7212d4f..cfd788b4fd7a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1585,6 +1585,8 @@ static int i40e_set_ringparam(struct net_device *netdev,
*/
rx_rings[i].desc = NULL;
rx_rings[i].rx_bi = NULL;
+ /* Clear cloned XDP RX-queue info before setup call */
+ memset(&rx_rings[i].xdp_rxq, 0, sizeof(rx_rings[i].xdp_rxq));
/* this is to allow wr32 to have something to write to
* during early allocation of Rx buffers
*/
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 4566d66ffc7c..2a8a85e3ae8f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -27,6 +27,7 @@
#include <linux/prefetch.h>
#include <net/busy_poll.h>
#include <linux/bpf_trace.h>
+#include <net/xdp.h>
#include "i40e.h"
#include "i40e_trace.h"
#include "i40e_prototype.h"
@@ -1236,6 +1237,8 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
void i40e_free_rx_resources(struct i40e_ring *rx_ring)
{
i40e_clean_rx_ring(rx_ring);
+ if (rx_ring->vsi->type == I40E_VSI_MAIN)
+ xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
rx_ring->xdp_prog = NULL;
kfree(rx_ring->rx_bi);
rx_ring->rx_bi = NULL;
@@ -1256,6 +1259,7 @@ void i40e_free_rx_resources(struct i40e_ring *rx_ring)
int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
{
struct device *dev = rx_ring->dev;
+ int err = -ENOMEM;
int bi_size;
/* warn if we are about to overwrite the pointer */
@@ -1283,13 +1287,21 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
rx_ring->next_to_clean = 0;
rx_ring->next_to_use = 0;
+ /* XDP RX-queue info only needed for RX rings exposed to XDP */
+ if (rx_ring->vsi->type == I40E_VSI_MAIN) {
+ err = xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev,
+ rx_ring->queue_index);
+ if (err < 0)
+ goto err;
+ }
+
rx_ring->xdp_prog = rx_ring->vsi->xdp_prog;
return 0;
err:
kfree(rx_ring->rx_bi);
rx_ring->rx_bi = NULL;
- return -ENOMEM;
+ return err;
}
/**
@@ -2068,11 +2080,13 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
struct sk_buff *skb = rx_ring->skb;
u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
bool failure = false, xdp_xmit = false;
+ struct xdp_buff xdp;
+
+ xdp.rxq = &rx_ring->xdp_rxq;
while (likely(total_rx_packets < (unsigned int)budget)) {
struct i40e_rx_buffer *rx_buffer;
union i40e_rx_desc *rx_desc;
- struct xdp_buff xdp;
unsigned int size;
u16 vlan_tag;
u8 rx_ptype;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index fbae1182e2ea..2d08760fc4ce 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -27,6 +27,8 @@
#ifndef _I40E_TXRX_H_
#define _I40E_TXRX_H_
+#include <net/xdp.h>
+
/* Interrupt Throttling and Rate Limiting Goodies */
#define I40E_MAX_ITR 0x0FF0 /* reg uses 2 usec resolution */
@@ -428,6 +430,7 @@ struct i40e_ring {
*/
struct i40e_channel *ch;
+ struct xdp_rxq_info xdp_rxq;
} ____cacheline_internodealigned_in_smp;
static inline bool ring_uses_build_skb(struct i40e_ring *ring)
^ permalink raw reply related
* [bpf-next V3 PATCH 04/14] ixgbe: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-31 11:01 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov
Cc: netdev, dsahern, Alexander Duyck, intel-wired-lan, Jeff Kirsher,
Jesper Dangaard Brouer, gospo, bjorn.topel, michael.chan
In-Reply-To: <151471801977.30703.3258796879718706203.stgit@firesoul>
Driver hook points for xdp_rxq_info:
* reg : ixgbe_setup_rx_resources()
* unreg: ixgbe_free_rx_resources()
Tested on actual hardware.
V2: Fix ixgbe_set_ringparam, clear xdp_rxq_info in temp_ring
Cc: intel-wired-lan@lists.osuosl.org
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2 ++
drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 4 ++++
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 10 +++++++++-
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 468c3555a629..8611763d6129 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -53,6 +53,7 @@
#include <linux/dca.h>
#endif
+#include <net/xdp.h>
#include <net/busy_poll.h>
/* common prefix used by pr_<> macros */
@@ -371,6 +372,7 @@ struct ixgbe_ring {
struct ixgbe_tx_queue_stats tx_stats;
struct ixgbe_rx_queue_stats rx_stats;
};
+ struct xdp_rxq_info xdp_rxq;
} ____cacheline_internodealigned_in_smp;
enum ixgbe_ring_f_enum {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 0aad1c2a3667..0aaf70b3cfcd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -1156,6 +1156,10 @@ static int ixgbe_set_ringparam(struct net_device *netdev,
memcpy(&temp_ring[i], adapter->rx_ring[i],
sizeof(struct ixgbe_ring));
+ /* Clear copied XDP RX-queue info */
+ memset(&temp_ring[i].xdp_rxq, 0,
+ sizeof(temp_ring[i].xdp_rxq));
+
temp_ring[i].count = new_rx_count;
err = ixgbe_setup_rx_resources(adapter, &temp_ring[i]);
if (err) {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7737a05c717c..95aba975b391 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2318,12 +2318,14 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
#endif /* IXGBE_FCOE */
u16 cleaned_count = ixgbe_desc_unused(rx_ring);
bool xdp_xmit = false;
+ struct xdp_buff xdp;
+
+ xdp.rxq = &rx_ring->xdp_rxq;
while (likely(total_rx_packets < budget)) {
union ixgbe_adv_rx_desc *rx_desc;
struct ixgbe_rx_buffer *rx_buffer;
struct sk_buff *skb;
- struct xdp_buff xdp;
unsigned int size;
/* return some buffers to hardware, one at a time is too slow */
@@ -6444,6 +6446,11 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
rx_ring->next_to_clean = 0;
rx_ring->next_to_use = 0;
+ /* XDP RX-queue info */
+ if (xdp_rxq_info_reg(&rx_ring->xdp_rxq, adapter->netdev,
+ rx_ring->queue_index) < 0)
+ goto err;
+
rx_ring->xdp_prog = adapter->xdp_prog;
return 0;
@@ -6541,6 +6548,7 @@ void ixgbe_free_rx_resources(struct ixgbe_ring *rx_ring)
ixgbe_clean_rx_ring(rx_ring);
rx_ring->xdp_prog = NULL;
+ xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
vfree(rx_ring->rx_buffer_info);
rx_ring->rx_buffer_info = NULL;
^ permalink raw reply related
* [bpf-next V3 PATCH 05/14] xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg
From: Jesper Dangaard Brouer @ 2017-12-31 11:01 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov
Cc: bjorn.topel, netdev, Jesper Dangaard Brouer, Ariel Elior, dsahern,
gospo, everest-linux-l2, michael.chan
In-Reply-To: <151471801977.30703.3258796879718706203.stgit@firesoul>
The driver code qede_free_fp_array() depend on kfree() can be called
with a NULL pointer. This stems from the qede_alloc_fp_array()
function which either (kz)alloc memory for fp->txq or fp->rxq.
This also simplifies error handling code in case of memory allocation
failures, but xdp_rxq_info_unreg need to know the difference.
Introduce xdp_rxq_info_is_reg() to handle if a memory allocation fails
and detect this is the failure path by seeing that xdp_rxq_info was
not registred yet, which first happens after successful alloaction in
qede_init_fp().
Driver hook points for xdp_rxq_info:
* reg : qede_init_fp
* unreg: qede_free_fp_array
Tested on actual hardware with samples/bpf program.
V2: Driver have no proper error path for failed XDP RX-queue info reg, as
qede_init_fp() is a void function.
Cc: everest-linux-l2@cavium.com
Cc: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
drivers/net/ethernet/qlogic/qede/qede.h | 2 ++
drivers/net/ethernet/qlogic/qede/qede_fp.c | 1 +
drivers/net/ethernet/qlogic/qede/qede_main.c | 10 ++++++++++
include/net/xdp.h | 1 +
net/core/xdp.c | 6 ++++++
5 files changed, 20 insertions(+)
diff --git a/drivers/net/ethernet/qlogic/qede/qede.h b/drivers/net/ethernet/qlogic/qede/qede.h
index 8a336517baac..8116cfd30fad 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -40,6 +40,7 @@
#include <linux/kernel.h>
#include <linux/mutex.h>
#include <linux/bpf.h>
+#include <net/xdp.h>
#include <linux/qed/qede_rdma.h>
#include <linux/io.h>
#ifdef CONFIG_RFS_ACCEL
@@ -345,6 +346,7 @@ struct qede_rx_queue {
u64 xdp_no_pass;
void *handle;
+ struct xdp_rxq_info xdp_rxq;
};
union db_prod {
diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index 48ec4c56cddf..dafc079ab6b9 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -1006,6 +1006,7 @@ static bool qede_rx_xdp(struct qede_dev *edev,
xdp.data = xdp.data_hard_start + *data_offset;
xdp_set_data_meta_invalid(&xdp);
xdp.data_end = xdp.data + *len;
+ xdp.rxq = &rxq->xdp_rxq;
/* Queues always have a full reset currently, so for the time
* being until there's atomic program replace just mark read
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 90d79ae2a48f..9929b4370ce6 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -765,6 +765,12 @@ static void qede_free_fp_array(struct qede_dev *edev)
fp = &edev->fp_array[i];
kfree(fp->sb_info);
+ /* Handle mem alloc failure case where qede_init_fp
+ * didn't register xdp_rxq_info yet.
+ * Implicit only (fp->type & QEDE_FASTPATH_RX)
+ */
+ if (fp->rxq && xdp_rxq_info_is_reg(&fp->rxq->xdp_rxq))
+ xdp_rxq_info_unreg(&fp->rxq->xdp_rxq);
kfree(fp->rxq);
kfree(fp->xdp_tx);
kfree(fp->txq);
@@ -1493,6 +1499,10 @@ static void qede_init_fp(struct qede_dev *edev)
else
fp->rxq->data_direction = DMA_FROM_DEVICE;
fp->rxq->dev = &edev->pdev->dev;
+
+ /* Driver have no error path from here */
+ WARN_ON(xdp_rxq_info_reg(&fp->rxq->xdp_rxq, edev->ndev,
+ fp->rxq->rxq_id) < 0);
}
if (fp->type & QEDE_FASTPATH_TX) {
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 86c41631a908..b2362ddfa694 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -43,5 +43,6 @@ int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
struct net_device *dev, u32 queue_index);
void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
+bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
#endif /* __LINUX_NET_XDP_H__ */
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 229bc5a0ee04..097a0f74e004 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -65,3 +65,9 @@ void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq)
xdp_rxq->reg_state = REG_STATE_UNUSED;
}
EXPORT_SYMBOL_GPL(xdp_rxq_info_unused);
+
+bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
+{
+ return (xdp_rxq->reg_state == REG_STATE_REGISTERED);
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);
^ permalink raw reply related
* [bpf-next V3 PATCH 06/14] mlx4: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-31 11:01 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov
Cc: netdev, dsahern, Jesper Dangaard Brouer, gospo, bjorn.topel,
michael.chan, Tariq Toukan
In-Reply-To: <151471801977.30703.3258796879718706203.stgit@firesoul>
Driver hook points for xdp_rxq_info:
* reg : mlx4_en_create_rx_ring
* unreg: mlx4_en_destroy_rx_ring
Tested on actual hardware.
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 3 ++-
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 13 ++++++++++---
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 4 +++-
3 files changed, 15 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 21bc17fa3854..8fc51bc29003 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2172,8 +2172,9 @@ static int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
if (mlx4_en_create_rx_ring(priv, &priv->rx_ring[i],
prof->rx_ring_size, priv->stride,
- node))
+ node, i))
goto err;
+
}
#ifdef CONFIG_RFS_ACCEL
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5f9dbc9a7f5b..b4d144e67514 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -262,7 +262,7 @@ void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev)
int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
struct mlx4_en_rx_ring **pring,
- u32 size, u16 stride, int node)
+ u32 size, u16 stride, int node, int queue_index)
{
struct mlx4_en_dev *mdev = priv->mdev;
struct mlx4_en_rx_ring *ring;
@@ -286,6 +286,9 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
ring->log_stride = ffs(ring->stride) - 1;
ring->buf_size = ring->size * ring->stride + TXBB_SIZE;
+ if (xdp_rxq_info_reg(&ring->xdp_rxq, priv->dev, queue_index) < 0)
+ goto err_ring;
+
tmp = size * roundup_pow_of_two(MLX4_EN_MAX_RX_FRAGS *
sizeof(struct mlx4_en_rx_alloc));
ring->rx_info = vzalloc_node(tmp, node);
@@ -293,7 +296,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
ring->rx_info = vzalloc(tmp);
if (!ring->rx_info) {
err = -ENOMEM;
- goto err_ring;
+ goto err_xdp_info;
}
}
@@ -317,6 +320,8 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
err_info:
vfree(ring->rx_info);
ring->rx_info = NULL;
+err_xdp_info:
+ xdp_rxq_info_unreg(&ring->xdp_rxq);
err_ring:
kfree(ring);
*pring = NULL;
@@ -440,6 +445,7 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
lockdep_is_held(&mdev->state_lock));
if (old_prog)
bpf_prog_put(old_prog);
+ xdp_rxq_info_unreg(&ring->xdp_rxq);
mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
vfree(ring->rx_info);
ring->rx_info = NULL;
@@ -652,6 +658,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
int cq_ring = cq->ring;
bool doorbell_pending;
struct mlx4_cqe *cqe;
+ struct xdp_buff xdp;
int polled = 0;
int index;
@@ -666,6 +673,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
/* Protect accesses to: ring->xdp_prog, priv->mac_hash list */
rcu_read_lock();
xdp_prog = rcu_dereference(ring->xdp_prog);
+ xdp.rxq = &ring->xdp_rxq;
doorbell_pending = 0;
/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
@@ -750,7 +758,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
* read bytes but not past the end of the frag.
*/
if (xdp_prog) {
- struct xdp_buff xdp;
dma_addr_t dma;
void *orig_data;
u32 act;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 7db3d0d9bfce..f470ae37d937 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -46,6 +46,7 @@
#endif
#include <linux/cpu_rmap.h>
#include <linux/ptp_clock_kernel.h>
+#include <net/xdp.h>
#include <linux/mlx4/device.h>
#include <linux/mlx4/qp.h>
@@ -356,6 +357,7 @@ struct mlx4_en_rx_ring {
unsigned long dropped;
int hwtstamp_rx_filter;
cpumask_var_t affinity_mask;
+ struct xdp_rxq_info xdp_rxq;
};
struct mlx4_en_cq {
@@ -720,7 +722,7 @@ void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev);
void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv);
int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
struct mlx4_en_rx_ring **pring,
- u32 size, u16 stride, int node);
+ u32 size, u16 stride, int node, int queue_index);
void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
struct mlx4_en_rx_ring **pring,
u32 size, u16 stride);
^ permalink raw reply related
* [bpf-next V3 PATCH 07/14] bnxt_en: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-31 11:01 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov
Cc: netdev, Jesper Dangaard Brouer, dsahern, michael.chan,
bjorn.topel, gospo, Andy Gospodarek
In-Reply-To: <151471801977.30703.3258796879718706203.stgit@firesoul>
Driver hook points for xdp_rxq_info:
* reg : bnxt_alloc_rx_rings
* unreg: bnxt_free_rx_rings
This driver should be updated to re-register when changing
allocation mode of RX rings.
Tested on actual hardware.
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 10 ++++++++++
drivers/net/ethernet/broadcom/bnxt/bnxt.h | 2 ++
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 1 +
3 files changed, 13 insertions(+)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 9efbdc6f1fcb..89c3c8760a78 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -2247,6 +2247,9 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
if (rxr->xdp_prog)
bpf_prog_put(rxr->xdp_prog);
+ if (xdp_rxq_info_is_reg(&rxr->xdp_rxq))
+ xdp_rxq_info_unreg(&rxr->xdp_rxq);
+
kfree(rxr->rx_tpa);
rxr->rx_tpa = NULL;
@@ -2280,6 +2283,10 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
ring = &rxr->rx_ring_struct;
+ rc = xdp_rxq_info_reg(&rxr->xdp_rxq, bp->dev, i);
+ if (rc < 0)
+ return rc;
+
rc = bnxt_alloc_ring(bp, ring);
if (rc)
return rc;
@@ -2834,6 +2841,9 @@ void bnxt_set_ring_params(struct bnxt *bp)
bp->cp_ring_mask = bp->cp_bit - 1;
}
+/* Changing allocation mode of RX rings.
+ * TODO: Update when extending xdp_rxq_info to support allocation modes.
+ */
int bnxt_set_rx_skb_mode(struct bnxt *bp, bool page_mode)
{
if (page_mode) {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 5359a1f0045f..2d268fc26f5e 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -23,6 +23,7 @@
#include <net/devlink.h>
#include <net/dst_metadata.h>
#include <net/switchdev.h>
+#include <net/xdp.h>
struct tx_bd {
__le32 tx_bd_len_flags_type;
@@ -664,6 +665,7 @@ struct bnxt_rx_ring_info {
struct bnxt_ring_struct rx_ring_struct;
struct bnxt_ring_struct rx_agg_ring_struct;
+ struct xdp_rxq_info xdp_rxq;
};
struct bnxt_cp_ring_info {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 261e5847557a..1389ab5e05df 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -96,6 +96,7 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons,
xdp.data = *data_ptr;
xdp_set_data_meta_invalid(&xdp);
xdp.data_end = *data_ptr + *len;
+ xdp.rxq = &rxr->xdp_rxq;
orig_data = xdp.data;
mapping = rx_buf->mapping - bp->rx_dma_offset;
^ permalink raw reply related
* [bpf-next V3 PATCH 08/14] nfp: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-31 11:01 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov
Cc: dsahern, Jakub Kicinski, oss-drivers, Simon Horman,
Jesper Dangaard Brouer, netdev, gospo, bjorn.topel, michael.chan
In-Reply-To: <151471801977.30703.3258796879718706203.stgit@firesoul>
Driver hook points for xdp_rxq_info:
* reg : nfp_net_rx_ring_alloc
* unreg: nfp_net_rx_ring_free
In struct nfp_net_rx_ring moved member @size into a hole on 64-bit.
Thus, the size remaines the same after adding member @xdp_rxq.
Cc: oss-drivers@netronome.com
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net.h | 5 ++++-
.../net/ethernet/netronome/nfp/nfp_net_common.c | 10 ++++++++--
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 3801c52098d5..0e564cfabe7e 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -47,6 +47,7 @@
#include <linux/netdevice.h>
#include <linux/pci.h>
#include <linux/io-64-nonatomic-hi-lo.h>
+#include <net/xdp.h>
#include "nfp_net_ctrl.h"
@@ -350,6 +351,7 @@ struct nfp_net_rx_buf {
* @rxds: Virtual address of FL/RX ring in host memory
* @dma: DMA address of the FL/RX ring
* @size: Size, in bytes, of the FL/RX ring (needed to free)
+ * @xdp_rxq: RX-ring info avail for XDP
*/
struct nfp_net_rx_ring {
struct nfp_net_r_vector *r_vec;
@@ -361,13 +363,14 @@ struct nfp_net_rx_ring {
u32 idx;
int fl_qcidx;
+ unsigned int size;
u8 __iomem *qcp_fl;
struct nfp_net_rx_buf *rxbufs;
struct nfp_net_rx_desc *rxds;
dma_addr_t dma;
- unsigned int size;
+ struct xdp_rxq_info xdp_rxq;
} ____cacheline_aligned;
/**
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 0add4870ce2e..45b8cae937be 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1608,11 +1608,13 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
unsigned int true_bufsz;
struct sk_buff *skb;
int pkts_polled = 0;
+ struct xdp_buff xdp;
int idx;
rcu_read_lock();
xdp_prog = READ_ONCE(dp->xdp_prog);
true_bufsz = xdp_prog ? PAGE_SIZE : dp->fl_bufsz;
+ xdp.rxq = &rx_ring->xdp_rxq;
tx_ring = r_vec->xdp_ring;
while (pkts_polled < budget) {
@@ -1703,7 +1705,6 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
dp->bpf_offload_xdp) && !meta.portid) {
void *orig_data = rxbuf->frag + pkt_off;
unsigned int dma_off;
- struct xdp_buff xdp;
int act;
xdp.data_hard_start = rxbuf->frag + NFP_NET_RX_BUF_HEADROOM;
@@ -2252,6 +2253,7 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring *rx_ring)
struct nfp_net_r_vector *r_vec = rx_ring->r_vec;
struct nfp_net_dp *dp = &r_vec->nfp_net->dp;
+ xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
kfree(rx_ring->rxbufs);
if (rx_ring->rxds)
@@ -2275,7 +2277,11 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring *rx_ring)
static int
nfp_net_rx_ring_alloc(struct nfp_net_dp *dp, struct nfp_net_rx_ring *rx_ring)
{
- int sz;
+ int sz, err;
+
+ err = xdp_rxq_info_reg(&rx_ring->xdp_rxq, dp->netdev, rx_ring->idx);
+ if (err < 0)
+ return err;
rx_ring->cnt = dp->rxd_cnt;
rx_ring->size = sizeof(*rx_ring->rxds) * rx_ring->cnt;
^ permalink raw reply related
* [bpf-next V3 PATCH 09/14] thunderx: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-31 11:01 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov
Cc: bjorn.topel, Robert Richter, netdev, dsahern,
Jesper Dangaard Brouer, gospo, Sunil Goutham, michael.chan,
linux-arm-kernel
In-Reply-To: <151471801977.30703.3258796879718706203.stgit@firesoul>
This driver uses a bool scheme for "enable"/"disable" when setting up
different resources. Thus, the hook points for xdp_rxq_info is done
in the same function call nicvf_rcv_queue_config(). This is activated
through enable/disable via nicvf_config_data_transfer(), which is tied
into nicvf_stop()/nicvf_open().
Extending driver packet handler call-path nicvf_rcv_pkt_handler() with
a pointer to the given struct rcv_queue, in-order to access the
xdp_rxq_info data area (in nicvf_xdp_rx()).
V2: Driver have no proper error path for failed XDP RX-queue info reg,
as nicvf_rcv_queue_config is a void function.
Cc: linux-arm-kernel@lists.infradead.org
Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Robert Richter <rric@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
drivers/net/ethernet/cavium/thunder/nicvf_main.c | 11 +++++++----
drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 4 ++++
drivers/net/ethernet/cavium/thunder/nicvf_queues.h | 2 ++
3 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 52b3a6044f85..21618d0d694f 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -521,7 +521,7 @@ static void nicvf_unmap_page(struct nicvf *nic, struct page *page, u64 dma_addr)
static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
struct cqe_rx_t *cqe_rx, struct snd_queue *sq,
- struct sk_buff **skb)
+ struct rcv_queue *rq, struct sk_buff **skb)
{
struct xdp_buff xdp;
struct page *page;
@@ -545,6 +545,7 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
xdp.data = (void *)cpu_addr;
xdp_set_data_meta_invalid(&xdp);
xdp.data_end = xdp.data + len;
+ xdp.rxq = &rq->xdp_rxq;
orig_data = xdp.data;
rcu_read_lock();
@@ -698,7 +699,8 @@ static inline void nicvf_set_rxhash(struct net_device *netdev,
static void nicvf_rcv_pkt_handler(struct net_device *netdev,
struct napi_struct *napi,
- struct cqe_rx_t *cqe_rx, struct snd_queue *sq)
+ struct cqe_rx_t *cqe_rx,
+ struct snd_queue *sq, struct rcv_queue *rq)
{
struct sk_buff *skb = NULL;
struct nicvf *nic = netdev_priv(netdev);
@@ -724,7 +726,7 @@ static void nicvf_rcv_pkt_handler(struct net_device *netdev,
/* For XDP, ignore pkts spanning multiple pages */
if (nic->xdp_prog && (cqe_rx->rb_cnt == 1)) {
/* Packet consumed by XDP */
- if (nicvf_xdp_rx(snic, nic->xdp_prog, cqe_rx, sq, &skb))
+ if (nicvf_xdp_rx(snic, nic->xdp_prog, cqe_rx, sq, rq, &skb))
return;
} else {
skb = nicvf_get_rcv_skb(snic, cqe_rx,
@@ -781,6 +783,7 @@ static int nicvf_cq_intr_handler(struct net_device *netdev, u8 cq_idx,
struct cqe_rx_t *cq_desc;
struct netdev_queue *txq;
struct snd_queue *sq = &qs->sq[cq_idx];
+ struct rcv_queue *rq = &qs->rq[cq_idx];
unsigned int tx_pkts = 0, tx_bytes = 0, txq_idx;
spin_lock_bh(&cq->lock);
@@ -811,7 +814,7 @@ static int nicvf_cq_intr_handler(struct net_device *netdev, u8 cq_idx,
switch (cq_desc->cqe_type) {
case CQE_TYPE_RX:
- nicvf_rcv_pkt_handler(netdev, napi, cq_desc, sq);
+ nicvf_rcv_pkt_handler(netdev, napi, cq_desc, sq, rq);
work_done++;
break;
case CQE_TYPE_SEND:
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index f38ea349aa00..14e62c6ac342 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -760,6 +760,7 @@ static void nicvf_rcv_queue_config(struct nicvf *nic, struct queue_set *qs,
if (!rq->enable) {
nicvf_reclaim_rcv_queue(nic, qs, qidx);
+ xdp_rxq_info_unreg(&rq->xdp_rxq);
return;
}
@@ -772,6 +773,9 @@ static void nicvf_rcv_queue_config(struct nicvf *nic, struct queue_set *qs,
/* all writes of RBDR data to be loaded into L2 Cache as well*/
rq->caching = 1;
+ /* Driver have no proper error path for failed XDP RX-queue info reg */
+ WARN_ON(xdp_rxq_info_reg(&rq->xdp_rxq, nic->netdev, qidx) < 0);
+
/* Send a mailbox msg to PF to config RQ */
mbx.rq.msg = NIC_MBOX_MSG_RQ_CFG;
mbx.rq.qs_num = qs->vnic_id;
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
index 178ab6e8e3c5..7d1e4e2aaad0 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
@@ -12,6 +12,7 @@
#include <linux/netdevice.h>
#include <linux/iommu.h>
#include <linux/bpf.h>
+#include <net/xdp.h>
#include "q_struct.h"
#define MAX_QUEUE_SET 128
@@ -255,6 +256,7 @@ struct rcv_queue {
u8 start_qs_rbdr_idx; /* RBDR idx in the above QS */
u8 caching;
struct rx_tx_queue_stats stats;
+ struct xdp_rxq_info xdp_rxq;
} ____cacheline_aligned_in_smp;
struct cmp_queue {
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox