Netdev List
 help / color / mirror / Atom feed
* Re: Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-27  0:29 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Samudrala, Sridhar, Cornelia Huck, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <CADGSJ204fgdnafRsrzMVh+SeBCj3-z_twS9B+7g8iiOaUQGW5g@mail.gmail.com>

On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:
> On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
> >> > > > > Might not neccessarily be something wrong, but it's very limited to
> >> > > > > prohibit the MAC of VF from changing when enslaved by failover.
> >> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
> >> > > I think Sridhar and Jiri might be better person to answer it. My
> >> > > impression was that sync'ing the MAC address change between all 3
> >> > > devices is challenging, as the failover driver uses MAC address to
> >> > > match net_device internally.
> >>
> >> Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
> >> of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
> >> synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
> >> don't allow changing guest MAC unless it is a trusted VF.
> >
> > OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> > For example I can see host just
> > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
> 
> That's why I think pairing using MAC is fragile IMHO. When VF's MAC
> got changed before virtio attempts to match and pair the device, it
> ends up with no pairing found out at all.

Guest seems to match on the hardware mac and ignore whatever
is set by user. Makes sense to me and should not be fragile.


> UUID is better.
> 
> -Siwei
> 
> >
> > --
> > MST

^ permalink raw reply

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Cong Wang @ 2018-06-27  0:29 UTC (permalink / raw)
  To: Flavio Leitner
  Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <20180626233302.GU19565@plex.lan>

On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
>
> It is still isolated, the sk carries the netns info and it is
> orphaned when it re-enters the stack.

Then what difference does your patch make?

Before your patch:
veth orphans skb in its xmit

After your patch:
RX orphans it when re-entering stack (as you claimed, I don't know)

And for veth pair:
xmit from one side is RX for the other side

So, where is the queueing? Where is the buffer bloat? GRO list??

^ permalink raw reply

* Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues
From: Nambiar, Amritha @ 2018-06-27  0:36 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Alexander Duyck,
	Samudrala, Sridhar, Alexander Duyck, Eric Dumazet,
	Hannes Frederic Sowa, Tom Herbert
In-Reply-To: <CAF=yD-KQ60Ob2NekMvUhCQxCtLXcy-c5J1WR6stROQRmtANpOw@mail.gmail.com>

On 6/26/2018 4:04 AM, Willem de Bruijn wrote:
> On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
> <amritha.nambiar@intel.com> wrote:
>>
>> This patch adds support to pick Tx queue based on the Rx queue(s) map
>> configuration set by the admin through the sysfs attribute
>> for each Tx queue. If the user configuration for receive queue(s) map
>> does not apply, then the Tx queue selection falls back to CPU(s) map
>> based selection and finally to hashing.
>>
>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>> ---
> 
>> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>>  {
>>  #ifdef CONFIG_XPS
>>         struct xps_dev_maps *dev_maps;
>> -       struct xps_map *map;
>> +       struct sock *sk = skb->sk;
>>         int queue_index = -1;
>>
>>         if (!static_key_false(&xps_needed))
>>                 return -1;
>>
>>         rcu_read_lock();
>> -       dev_maps = rcu_dereference(dev->xps_cpus_map);
>> +       if (!static_key_false(&xps_rxqs_needed))
>> +               goto get_cpus_map;
>> +
>> +       dev_maps = rcu_dereference(dev->xps_rxqs_map);
>>         if (dev_maps) {
>> -               unsigned int tci = skb->sender_cpu - 1;
>> +               int tci = sk_rx_queue_get(sk);
> 
> What if the rx device differs from the tx device?
> 
I think I have 3 options here:
1. Cache the ifindex in sock_common which will introduce a new
additional field in sock_common.
2. Use dev_get_by_napi_id to get the device id. This could be expensive,
if the rxqs_map is set, this will be done on every packet and involves
walking through the hashlist for napi_id lookup.
3. Remove validating device id, similar to how it is in skb_tx_hash
where rx_queue recorded is used and if not, fall through to flow hash
calculation.
What do you think is suitable here?

^ permalink raw reply

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Flavio Leitner @ 2018-06-27  0:39 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpVmbuVK+pfhad7HrQtaSxFbe7K=59_34f0BwxoP_=W0VQ@mail.gmail.com>

On Tue, Jun 26, 2018 at 05:29:51PM -0700, Cong Wang wrote:
> On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
> >
> > It is still isolated, the sk carries the netns info and it is
> > orphaned when it re-enters the stack.
> 
> Then what difference does your patch make?

Don't forget it is fixing two issues.

> Before your patch:
> veth orphans skb in its xmit
> 
> After your patch:
> RX orphans it when re-entering stack (as you claimed, I don't know)

ip_rcv, and equivalents.

> And for veth pair:
> xmit from one side is RX for the other side
> So, where is the queueing? Where is the buffer bloat? GRO list??

CPU backlog.

-- 
Flavio

^ permalink raw reply

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Cong Wang @ 2018-06-27  0:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Flavio Leitner, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <aae3fa15-598b-342d-20c7-23ed4b9faf84@gmail.com>

On Tue, Jun 26, 2018 at 4:53 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 06/26/2018 03:47 PM, Cong Wang wrote:
> >
> > You need to justify why you want to break the TSQ's scope here,
> > which is obviously not compatible with netns design.
>
> You have to explain why you do not want us to fix this buggy behavior.
>
> Right now TSQ (and more generally back pressure) is broken by this skb_orphan()
>
> So we want to restore TSQ (and back pressure)
>
> TSQ scope never mentioned netns.

Conceptually, it is limiting the pipeline from L4 to L2 within one stack, now
you are extending it to NS1 (L4...L2)...NS2 (L2...L4) which could across
multiple stacks and _in theory_ could be infinitely long.

And TSQ setting is per-netns:

2180 static bool tcp_small_queue_check(struct sock *sk, const struct
sk_buff *skb,
2181                                   unsigned int factor)
2182 {
2183         unsigned int limit;
2184
2185         limit = max(2 * skb->truesize, sk->sk_pacing_rate >>
sk->sk_pacing_shift);
2186         limit = min_t(u32, limit,
2187                       sock_net(sk)->ipv4.sysctl_tcp_limit_output_bytes);
2188         limit <<= factor;

Should I expect to increase sysctl_tcp_limit_output_bytes based on the
number of stacks it crosses?

> We (TCP stack TSQ handler) want to be notified when this packet leaves the host,
> even if it had to traverse multiple netns (for whatever reasons).
>
> _If_ a packet is locally 'consumed' (like on loopback device, or veth pair),
> then the skb_orphan() will automatically be done.
>
> If you have a case where this skb_orphan() is needed, please add it at the needed place.


With this, a netns could totally throttle a TCP socket in a different
netns by holding the packets infinitely (e.g. putting them in a loop).
This is where the isolation breaks.

^ permalink raw reply

* Re: [rds-devel] [PATCH net-next] rds: clean up loopback rds_connections on netns deletion
From: David Miller @ 2018-06-27  1:12 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: netdev, rds-devel
In-Reply-To: <20180626162822.GA6394@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Tue, 26 Jun 2018 12:28:22 -0400

> On (06/26/18 10:53), Sowmini Varadhan wrote:
>> Date: Tue, 26 Jun 2018 10:53:23 -0400
>> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> To: David Miller <davem@davemloft.net>
>> Cc: netdev@vger.kernel.org, rds-devel@oss.oracle.com
>> Subject: Re: [rds-devel] [PATCH net-next] rds: clean up loopback
>> 
>> and just to add, the fix itself is logically correct, so belongs in
>> net-next. What I dont have (and therefore did not target net) is
>> official confirmation that the syzbot failures are root-caused to the
>> absence of this patch (since there is no reproducer for many of these,
>> and no crash dumps available from syzbot).  
>> 
> 
> With help from Dmitry, I just got the confirmation from syzbot that
> "syzbot has tested the proposed patch and the reproducer did not trigger 
> crash:"
> 
> thus, we can mark this
> 
> Reported-and-tested-by: syzbot+4c20b3866171ce8441d2@syzkaller.appspotmail.com
> 
> and yes, it can target net.

Great, applied, thanks!

^ permalink raw reply

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Cong Wang @ 2018-06-27  1:28 UTC (permalink / raw)
  To: Flavio Leitner
  Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <20180627003925.GV19565@plex.lan>

On Tue, Jun 26, 2018 at 5:39 PM Flavio Leitner <fbl@redhat.com> wrote:
>
> On Tue, Jun 26, 2018 at 05:29:51PM -0700, Cong Wang wrote:
> > On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
> > >
> > > It is still isolated, the sk carries the netns info and it is
> > > orphaned when it re-enters the stack.
> >
> > Then what difference does your patch make?
>
> Don't forget it is fixing two issues.

Sure. I am only talking about TSQ from the very beginning.
Let me rephrase my above question:
What difference does your patch make to TSQ?


>
> > Before your patch:
> > veth orphans skb in its xmit
> >
> > After your patch:
> > RX orphans it when re-entering stack (as you claimed, I don't know)
>
> ip_rcv, and equivalents.


ip_rcv() is L3, we enter a stack from L1. So your above claim is incorrect. :)

>
> > And for veth pair:
> > xmit from one side is RX for the other side
> > So, where is the queueing? Where is the buffer bloat? GRO list??
>
> CPU backlog.

Yeah, but this is never targeted by TSQ:

        tcp_limit_output_bytes limits the number of bytes on qdisc
        or device to reduce artificial RTT/cwnd and reduce bufferbloat.

which means you have to update Documentation/networking/ip-sysctl.txt
too.

^ permalink raw reply

* [BUG] net: e100: possible data races in e100_watchdog()
From: Jia-Ju Bai @ 2018-06-27  1:30 UTC (permalink / raw)
  To: jeffrey.t.kirsher, jchapman
  Cc: intel-wired-lan, netdev, Linux Kernel Mailing List

The call paths in Linux 4.16.7 that may raise data races are:

CPU0:
e100_set_multicast_list
     e100_exec_cb
         line 879: spin_lock_irqsave()
         e100_configure
             line 1139: nic->flags [READ]
             line 1148: nic->flags [READ]

CPU1:
e100_watchdog:
     line 1758, 1756: nic->flags [WRITE]

The READ operations in CPU0 are performed with holding a spinlock (line 
879), but the WRITE operation in CPU1 is performed without holding this 
spinlock, so it may cause data races here.

A possible fix is to add spin_lock_irqsave() in e100_watchdog().
I am not sure that whether this possible fix is correct, so I only 
report the data races.


Best wishes,
Jia-Ju Bai

^ permalink raw reply

* [PATCH net] strparser: Remove early eaten to fix full tcp receive buffer stall
From: Doron Roberts-Kedes @ 2018-06-27  1:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: Dave Watson, Tom Herbert, netdev, Doron Roberts-Kedes

On receving an incomplete message, the existing code stores the
remaining length of the cloned skb in the early_eaten field instead of
incrementing the value returned by __strp_recv. This defers invocation
of sock_rfree for the current skb until the next invocation of
__strp_recv, which returns early_eaten if early_eaten is non-zero.

This behavior causes a stall when the current message occupies the very
tail end of a massive skb, and strp_peek/need_bytes indicates that the
remainder of the current message has yet to arrive on the socket. The
TCP receive buffer is totally full, causing the TCP window to go to
zero, so the remainder of the message will never arrive.

Incrementing the value returned by __strp_recv by the amount otherwise
stored in early_eaten prevents stalls of this nature.

Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
---
 net/strparser/strparser.c | 17 +----------------
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index 373836615c57..625acb27efcc 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -35,7 +35,6 @@ struct _strp_msg {
 	 */
 	struct strp_msg strp;
 	int accum_len;
-	int early_eaten;
 };
 
 static inline struct _strp_msg *_strp_msg(struct sk_buff *skb)
@@ -115,20 +114,6 @@ static int __strp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 	head = strp->skb_head;
 	if (head) {
 		/* Message already in progress */
-
-		stm = _strp_msg(head);
-		if (unlikely(stm->early_eaten)) {
-			/* Already some number of bytes on the receive sock
-			 * data saved in skb_head, just indicate they
-			 * are consumed.
-			 */
-			eaten = orig_len <= stm->early_eaten ?
-				orig_len : stm->early_eaten;
-			stm->early_eaten -= eaten;
-
-			return eaten;
-		}
-
 		if (unlikely(orig_offset)) {
 			/* Getting data with a non-zero offset when a message is
 			 * in progress is not expected. If it does happen, we
@@ -297,9 +282,9 @@ static int __strp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 				}
 
 				stm->accum_len += cand_len;
+				eaten += cand_len;
 				strp->need_bytes = stm->strp.full_len -
 						       stm->accum_len;
-				stm->early_eaten = cand_len;
 				STRP_STATS_ADD(strp->stats.bytes, cand_len);
 				desc->count = 0; /* Stop reading socket */
 				break;
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH v2] fib_rules: match rules based on suppress_* properties too
From: David Miller @ 2018-06-27  1:34 UTC (permalink / raw)
  To: Jason; +Cc: roopa, netdev
In-Reply-To: <20180625233932.11531-1-Jason@zx2c4.com>

From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Tue, 26 Jun 2018 01:39:32 +0200

> Two rules with different values of suppress_prefix or suppress_ifgroup
> are not the same. This fixes an -EEXIST when running:
> 
>    $ ip -4 rule add table main suppress_prefixlength 0
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule")

Roopa, thanks for doing all of that analysis.

I think applying this patch makes the most sense at this point,
so that it what I have done.

^ permalink raw reply

* Re: [PATCH net-next] net/tls: Remove VLA usage on nonce
From: David Miller @ 2018-06-27  1:40 UTC (permalink / raw)
  To: keescook; +Cc: davejwatson, borisp, aviadye, netdev, linux-kernel
In-Reply-To: <20180625235505.GA22661@beast>

From: Kees Cook <keescook@chromium.org>
Date: Mon, 25 Jun 2018 16:55:05 -0700

> It looks like the prior VLA removal, commit b16520f7493d ("net/tls: Remove
> VLA usage"), and a new VLA addition, commit c46234ebb4d1e ("tls: RX path
> for ktls"), passed in the night. This removes the newly added VLA, which
> happens to have its bounds based on the same max value.
> 
> Signed-off-by: Kees Cook <keescook@chromium.org>

Applied, thanks Kees.

^ permalink raw reply

* Re: [PATCH net-next 0/6] Multipath tests for tunnel devices
From: David Miller @ 2018-06-27  1:42 UTC (permalink / raw)
  To: petrm; +Cc: netdev, linux-kselftest, shuah
In-Reply-To: <cover.1529971148.git.petrm@mellanox.com>

From: Petr Machata <petrm@mellanox.com>
Date: Tue, 26 Jun 2018 02:05:51 +0200

> This patchset adds a test for ECMP and weighted ECMP between two GRE
> tunnels.
> 
> In patches #1 and #2, the function multipath_eval() is first moved from
> router_multipath.sh to lib.sh for ease of reuse, and then fixed up.
> 
> In patch #3, the function tc_rule_stats_get() is parameterized to be
> useful for egress rules as well.
> 
> In patch #4, a new function __simple_if_init() is extracted from
> simple_if_init(). This covers the logic that needs to be done for the
> usual interface: VRF migration, upping and installation of IP addresses.
> 
> Patch #5 then adds the test itself.
> 
> Additionally in patch #6, a requirement to add diagrams to selftests is
> documented.

Looks good, series applied.

^ permalink raw reply

* Re: [PATCH net v2 0/2] nfp: MPLS and shared blocks TC offload fixes
From: David Miller @ 2018-06-27  1:47 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: jiri, oss-drivers, netdev
In-Reply-To: <20180626033628.17660-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Mon, 25 Jun 2018 20:36:26 -0700

> This series brings two fixes to TC filter/action offload code.
> Pieter fixes matching MPLS packets when the match is purely on
> the MPLS ethertype and none of the MPLS fields are used.
> John provides a fix for offload of shared blocks.  Unfortunately,
> with shared blocks there is currently no guarantee that filters
> which were added by the core will be removed before block unbind.
> Our simple fix is to not support offload of rules on shared blocks
> at all, a revert of this fix will be send for -next once the
> reoffload infrastructure lands.  The shared blocks became important
> as we are trying to use them for bonding offload (managed from user
> space) and lack of remove calls leads to resource leaks.
> 
> v2:
>  - fix build error reported by kbuild bot due to missing
>    tcf_block_shared() helper.

Series applied and queued up for -stable, thanks.

^ permalink raw reply

* [BUG] net: tg3: two possible data races
From: Jia-Ju Bai @ 2018-06-27  1:47 UTC (permalink / raw)
  To: siva.kallam, prashant, mchan; +Cc: netdev, Linux Kernel Mailing List

The call paths in Linux 4.16.7 that may raise the first data race:

CPU0:
tg3_open
     tg3_start
         line 11611: spin_lock_bh()
         tg3_enable_ints
             line 1023: tp->tnapi->last_tag [READ]

CPU1:
tg3_poll
     line 7341: tnapi->last_tag [WRITE]

The READ operation in CPU0 is performed with holding a spinlock (line 
11611), but the WRITE operation in CPU1 is performed without holding 
this spinlock, so it may cause a data race here.
A possible fix may be to add spin_lock_bh() in tg3_poll().

-----------------------------------------------------------------------

The call paths in Linux 4.16.7 that may raise the second data race:

CPU0:
tg3_open
     tg3_start
         line 11611: spin_lock_bh()
         tg3_enable_ints
             line 1023: tp->irq_sync [WRITE]

CPU1:
tg3_interrupt_tagged
     tg3_irq_sync
         line 7341: tp->irq_sync [READ]

The WRITE operation in CPU0 is performed with holding a spinlock (line 
11611), but the READ operation in CPU1 is performed without holding this 
spinlock, so it may cause a data race here.
A possible fix may be to add spin_lock_bh() in tg3_irq_sync().

I am not sure that whether the possible fixes are correct, so I only 
report the data races.


Best wishes,
Jia-Ju Bai

^ permalink raw reply

* [PATCH net-next] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: Lawrence Brakmo @ 2018-06-27  1:52 UTC (permalink / raw)
  To: netdev; +Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Eric Dumazet

When using dctcp and doing RPCs, if the last packet of a request is
ECN marked as having seen congestion (CE), the sender can decrease its
cwnd to 1. As a result, it will only send one packet when a new request
is sent. In some instances this results in high tail latencies.

For example, in one setup there are 3 hosts sending to a 4th one, with
each sender having 3 flows (1 stream, 1 1MB back-to-back RPCs and 1 10KB
back-to-back RPCs). The following table shows the 99% and 99.9%
latencies for both Cubic and dctcp

           Cubic 99%  Cubic 99.9%   dctcp 99%    dctcp 99.9%
 1MB RPCs    3.5ms       6.0ms         43ms          208ms
10KB RPCs    1.0ms       2.5ms         53ms          212ms

On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there. Forcing cwnd to be at least 2 in tcp_cwnd_reduction
fixes the problem with the high tail latencies. The latencies now look
like this:

             dctcp 99%       dctcp 99.9%
 1MB RPCs      3.8ms             4.4ms
10KB RPCs      168us             211us

Another group working with dctcp saw the same issue with production
traffic and it was solved with this patch.

The only issue is if it is safe to always use 2 or if it is better to
use min(2, snd_ssthresh) (which could still trigger the problem).

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f63b70..a9255c424761 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2477,7 +2477,7 @@ void tcp_cwnd_reduction(struct sock *sk, int newly_acked_sacked, int flag)
 	}
 	/* Force a fast retransmit upon entering fast recovery */
 	sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
-	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
+	tp->snd_cwnd = max(tcp_packets_in_flight(tp) + sndcnt, 2);
 }
 
 static inline void tcp_end_cwnd_reduction(struct sock *sk)
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net-next] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: kbuild test robot @ 2018-06-27  2:18 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: kbuild-all, netdev, Kernel Team, Blake Matheny,
	Alexei Starovoitov, Eric Dumazet
In-Reply-To: <20180627015222.3269067-1-brakmo@fb.com>

[-- Attachment #1: Type: text/plain, Size: 2947 bytes --]

Hi Lawrence,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Lawrence-Brakmo/tcp-force-cwnd-at-least-2-in-tcp_cwnd_reduction/20180627-095533
config: x86_64-randconfig-x003-201825 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:18:0,
                    from arch/x86/include/asm/bug.h:83,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from net/ipv4/tcp_input.c:67:
   net/ipv4/tcp_input.c: In function 'tcp_cwnd_reduction':
   include/linux/kernel.h:812:29: warning: comparison of distinct pointer types lacks a cast
      (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                                ^
   include/linux/kernel.h:826:4: note: in expansion of macro '__typecheck'
      (__typecheck(x, y) && __no_side_effects(x, y))
       ^~~~~~~~~~~
   include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp'
     __builtin_choose_expr(__safe_cmp(x, y), \
                           ^~~~~~~~~~
   include/linux/kernel.h:852:19: note: in expansion of macro '__careful_cmp'
    #define max(x, y) __careful_cmp(x, y, >)
                      ^~~~~~~~~~~~~
>> net/ipv4/tcp_input.c:2480:17: note: in expansion of macro 'max'
     tp->snd_cwnd = max(tcp_packets_in_flight(tp) + sndcnt, 2);
                    ^~~

vim +/max +2480 net/ipv4/tcp_input.c

  2455	
  2456	void tcp_cwnd_reduction(struct sock *sk, int newly_acked_sacked, int flag)
  2457	{
  2458		struct tcp_sock *tp = tcp_sk(sk);
  2459		int sndcnt = 0;
  2460		int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
  2461	
  2462		if (newly_acked_sacked <= 0 || WARN_ON_ONCE(!tp->prior_cwnd))
  2463			return;
  2464	
  2465		tp->prr_delivered += newly_acked_sacked;
  2466		if (delta < 0) {
  2467			u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
  2468				       tp->prior_cwnd - 1;
  2469			sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
  2470		} else if ((flag & FLAG_RETRANS_DATA_ACKED) &&
  2471			   !(flag & FLAG_LOST_RETRANS)) {
  2472			sndcnt = min_t(int, delta,
  2473				       max_t(int, tp->prr_delivered - tp->prr_out,
  2474					     newly_acked_sacked) + 1);
  2475		} else {
  2476			sndcnt = min(delta, newly_acked_sacked);
  2477		}
  2478		/* Force a fast retransmit upon entering fast recovery */
  2479		sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
> 2480		tp->snd_cwnd = max(tcp_packets_in_flight(tp) + sndcnt, 2);
  2481	}
  2482	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32495 bytes --]

^ permalink raw reply

* [PATCH net-next] net: qmi_wwan: Add pass through mode
From: Subash Abhinov Kasiviswanathan @ 2018-06-27  2:30 UTC (permalink / raw)
  To: bjorn, dnlplm, dcbw, davem, netdev; +Cc: Subash Abhinov Kasiviswanathan

Pass through mode is to allow packets in MAP format to be passed
on to the stack. rmnet driver can be used to process and demultiplex
these packets. Note that pass through mode can be enabled when the
device is in raw ip mode only.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 drivers/net/usb/qmi_wwan.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 8fac8e1..6eeec92 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -59,6 +59,7 @@ struct qmi_wwan_state {
 enum qmi_wwan_flags {
 	QMI_WWAN_FLAG_RAWIP = 1 << 0,
 	QMI_WWAN_FLAG_MUX = 1 << 1,
+	QMI_WWAN_FLAG_PASS_THROUGH = 1 << 2,
 };
 
 enum qmi_wwan_quirks {
@@ -425,14 +426,82 @@ static ssize_t del_mux_store(struct device *d,  struct device_attribute *attr, c
 	return ret;
 }
 
+static ssize_t pass_through_show(struct device *d,
+				 struct device_attribute *attr,
+				 char *buf)
+{
+	struct usbnet *dev = netdev_priv(to_net_dev(d));
+	struct qmi_wwan_state *info;
+
+	info = (void *)&dev->data;
+	return sprintf(buf, "%c\n",
+		       info->flags & QMI_WWAN_FLAG_PASS_THROUGH ? 'Y' : 'N');
+}
+
+static ssize_t pass_through_store(struct device *d,
+				  struct device_attribute *attr,
+				  const char *buf, size_t len)
+{
+	struct usbnet *dev = netdev_priv(to_net_dev(d));
+	struct qmi_wwan_state *info;
+	bool enable;
+	int ret;
+
+	if (strtobool(buf, &enable))
+		return -EINVAL;
+
+	info = (void *)&dev->data;
+
+	/* no change? */
+	if (enable == (info->flags & QMI_WWAN_FLAG_PASS_THROUGH))
+		return len;
+
+	/* pass through mode can be set for raw ip devices only */
+	if (!(info->flags & QMI_WWAN_FLAG_RAWIP)) {
+		netdev_err(dev->net, "Cannot set pass through mode on non ip device\n");
+		return -EINVAL;
+	}
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	/* we don't want to modify a running netdev */
+	if (netif_running(dev->net)) {
+		netdev_err(dev->net, "Cannot change a running device\n");
+		ret = -EBUSY;
+		goto err;
+	}
+
+	/* let other drivers deny the change */
+	ret = call_netdevice_notifiers(NETDEV_PRE_TYPE_CHANGE, dev->net);
+	ret = notifier_to_errno(ret);
+	if (ret) {
+		netdev_err(dev->net, "Type change was refused\n");
+		goto err;
+	}
+
+	if (enable)
+		info->flags |= QMI_WWAN_FLAG_PASS_THROUGH;
+	else
+		info->flags &= ~QMI_WWAN_FLAG_PASS_THROUGH;
+	qmi_wwan_netdev_setup(dev->net);
+	call_netdevice_notifiers(NETDEV_POST_TYPE_CHANGE, dev->net);
+	ret = len;
+err:
+	rtnl_unlock();
+	return ret;
+}
+
 static DEVICE_ATTR_RW(raw_ip);
 static DEVICE_ATTR_RW(add_mux);
 static DEVICE_ATTR_RW(del_mux);
+static DEVICE_ATTR_RW(pass_through);
 
 static struct attribute *qmi_wwan_sysfs_attrs[] = {
 	&dev_attr_raw_ip.attr,
 	&dev_attr_add_mux.attr,
 	&dev_attr_del_mux.attr,
+	&dev_attr_pass_through.attr,
 	NULL,
 };
 
@@ -479,6 +548,11 @@ static int qmi_wwan_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 	if (info->flags & QMI_WWAN_FLAG_MUX)
 		return qmimux_rx_fixup(dev, skb);
 
+	if (rawip && (info->flags & QMI_WWAN_FLAG_PASS_THROUGH)) {
+		skb->protocol = htons(ETH_P_MAP);
+		return (netif_rx(skb) == NET_RX_SUCCESS);
+	}
+
 	switch (skb->data[0] & 0xf0) {
 	case 0x40:
 		proto = htons(ETH_P_IP);
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Eric Dumazet @ 2018-06-27  2:32 UTC (permalink / raw)
  To: Cong Wang, Flavio Leitner
  Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpVmbuVK+pfhad7HrQtaSxFbe7K=59_34f0BwxoP_=W0VQ@mail.gmail.com>



On 06/26/2018 05:29 PM, Cong Wang wrote:
> On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
>>
>> It is still isolated, the sk carries the netns info and it is
>> orphaned when it re-enters the stack.
> 
> Then what difference does your patch make?
> 
> Before your patch:
> veth orphans skb in its xmit
> 
> After your patch:
> RX orphans it when re-entering stack (as you claimed, I don't know)
> 
> And for veth pair:
> xmit from one side is RX for the other side
> 
> So, where is the queueing? Where is the buffer bloat? GRO list??
> 

By re-entering the stack, Flavio probably meant storing this skb in
a socket receive queue, or anything that should already modify skb->destructor
(and thus call skb_orphan() before the modification)

If skb sits in some qdisc, like fq on ipvlan master device, we do not want skb->sk to be scrubbed,
just because ipvlan slave and master might be in different netns.

^ permalink raw reply

* [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: Lawrence Brakmo @ 2018-06-27  2:34 UTC (permalink / raw)
  To: netdev; +Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Eric Dumazet

When using dctcp and doing RPCs, if the last packet of a request is
ECN marked as having seen congestion (CE), the sender can decrease its
cwnd to 1. As a result, it will only send one packet when a new request
is sent. In some instances this results in high tail latencies.

For example, in one setup there are 3 hosts sending to a 4th one, with
each sender having 3 flows (1 stream, 1 1MB back-to-back RPCs and 1 10KB
back-to-back RPCs). The following table shows the 99% and 99.9%
latencies for both Cubic and dctcp

           Cubic 99%  Cubic 99.9%   dctcp 99%    dctcp 99.9%
 1MB RPCs    3.5ms       6.0ms         43ms          208ms
10KB RPCs    1.0ms       2.5ms         53ms          212ms

On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there. Forcing cwnd to be at least 2 in tcp_cwnd_reduction
fixes the problem with the high tail latencies. The latencies now look
like this:

             dctcp 99%       dctcp 99.9%
 1MB RPCs      3.8ms             4.4ms
10KB RPCs      168us             211us

Another group working with dctcp saw the same issue with production
traffic and it was solved with this patch.

The only issue is if it is safe to always use 2 or if it is better to
use min(2, snd_ssthresh) (which could still trigger the problem).

v2: fixed compiler warning in max function arguments

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f63b70..282bd85322b0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2477,7 +2477,7 @@ void tcp_cwnd_reduction(struct sock *sk, int newly_acked_sacked, int flag)
 	}
 	/* Force a fast retransmit upon entering fast recovery */
 	sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
-	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
+	tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);
 }
 
 static inline void tcp_end_cwnd_reduction(struct sock *sk)
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Eric Dumazet @ 2018-06-27  2:35 UTC (permalink / raw)
  To: Cong Wang, Eric Dumazet
  Cc: Flavio Leitner, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpUOvavz580tJHCXhAR6AMprmFvakqiaiVMYe7NLFvFJ0Q@mail.gmail.com>



On 06/26/2018 05:44 PM, Cong Wang wrote:
 
> With this, a netns could totally throttle a TCP socket in a different
> netns by holding the packets infinitely (e.g. putting them in a loop).
> This is where the isolation breaks.
> 

That is fine, really. Admin error -> Working as intended.

The current scrubbing is simply wrong, not documented, and added by someone
who had absolutely not intended all the side effects.

^ permalink raw reply

* [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP)
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed

Hello,

Although it is far from being close to completion, this series provides
the means and infrastructure to enable device drivers to share packets
meta data information and accelerations with XDP programs, 
such as hash, stripped vlan, checksum, flow mark, packet header types,
etc ..

The series is still work in progress and sending it now as RFC in order
to get early feedback and to discuss the design, structures and UAPI.

For now the general idea is to help XDP programs accelerate, and provide
them with meta data that already available from the HW or device driver
to save cpu cycles on packet headers and data processing and decision
making, aside of that we want to avoid having a fixed size meta data
structure that wastes a lot of buffer space for stuff that the xdp program
might not need, like the current "elephant" SKB fields, kidding :) ..

So my idea here is to provide a dynamic mechanism to allow XDP programs
on xdp load to request only the specific meta data they need, and
the device driver or netdevice will provide them in a predefined order
in the xdp_buff/xdp_md data meta section.

And here is how it is done and what i would like to discuss:

1. In the first patch: added the infrastructure to request predefined meta data
flags on xdp load indicating that the XDP program is going to need them.

1.1) In this patch I am using the current u32 bit IFLA_XDP_FLAGS,
TODO: this needs to be improved in order to allow more meta data flags,
maybe use a new dedicated flags ?

1.2) Device driver that want to support xdp meta data should implement
new XDP command XDP_QUERY_META_FLAGS and report the meta data flags they
can support.

1.3) the kernel will cross check the requested flags with the device's
supported flags and will fail the xdp load in case of discrepancy.

Question : do we want this ? or is it better to return to the program
the actual supported flags and let it decide ?


2. This is the interesting part: in the 2nd patch Added the meta data
set/get infrastructure to allow device drivers populate them into xdp_buff
data meta in a well defined and structured format and let the xdp program
read them according to the predefined format/structure, the idea here is
that the xdp program and the device driver will share a static information
offsets array that will define where each meta data will sit inside
xdp_buff data meta, such structure will be decided once on xdp load.

Enters struct xdp_md_info and xdp_md_info_arr:

struct xdp_md_info {
       __u16 present:1;
       __u16 offset:15; /* offset from data_meta in xdp_md buffer */
};

/* XDP meta data offsets info array
 * present bit describes if a meta data is or will be present in xdp_md buff
 * offset describes where a meta data is or should be placed in xdp_md buff
 *
 * Kernel builds this array using xdp_md_info_build helper on demand.
 * User space builds it statically in the xdp program.
 */
typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];

Offsets in xdp_md_info_arr are always in ascending order and only for
requested meta data:
example : for XDP prgram that requested the following flags:
flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;

the offsets array will be:

xdp_md_info_arr mdi = {
        [XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
        [XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
        [XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
        [XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
}

For this example: hash fields will always appear first then the vlan for every
xdp_md.

Once requested to provide xdp meta data, device driver will use a pre-built
xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
xdp_md_info_arr will tell the driver what is the offset of each meta data.
The user space XDP program will use a similar xdp_md_info_arr to
statically know what is the offset of each meta data.

*For future meta data they will be added to the end of the array with
higher flags value.

This patch also provides helper functions for the device drivers to store
meta data into xdp_buff, and helper function for XDP programs to fetch
specific xdp meta data from xdp_md buffer.

Question: currently the XDP program is responsible to build the static
meta data offsets array "xdp_md_info_arr" and the kernel will build it
for the netdevice, if we are going to choose this direction, should we
somehow share the same xdp_md_info_arr built by the kernel with the xdp
program ?

3. The last mlx5e patch is the actual show case of how the device driver
will add the support for xdp meta data, it is pretty simple and straight
forward ! The first two mlx5e patches are just small refactoring to make
the xdp_md_info_arr and packet completion information available in the rx
xdp handlers data path.

4. Just added a small example to demonstrate how XDP program can request
meta data on xdp load and how it can read them on the critical path.
of course more examples are needed and some performance numbers.
Exciting use cases such as:
  - using flow mark to allow fast white/black listing lookups.
  - flow mark to accelerate DDos prevention.
  - Generic Data path: Use the meta data from the xdp_buff to build SKBs !(Jesper's Idea)
  - using hash type to know the packet headers and encapsulation without
    touching the packet data at all.
  - using packet hash to do RPS and XPS like cpu redirecting.
  - and many more.


More ideas:

>From Jesper: 
=========================
Give XDP more rich information about the hash,
By reducing the 'ht' value to the PKT_HASH_TYPE's we are loosing information.

I happen to know your hardware well; CQE_RSS_HTYPE_IP tell us if this
is IPv4 or IPv6.  And CQE_RSS_HTYPE_L4 tell us if this is TCP, UDP or
IPSEC. (And you have another bit telling of this is IPv6 with extension
headers).

If we don't want to invent our own xdp_hash_types, we can simply adopt
the RSS Hashing Types defined by Microsoft:
 https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types

An interesting part of the RSS standard, is that the hash type can help
identify if this is a fragment. (XDP could use this info to avoid
touching payload and react, e.g. drop fragments, or redirect all
fragments to another CPU, or skip parsing in XDP and defer to network
stack via XDP_PASS).

By using the RSS standard, we do e.g. loose the ability to say this is
IPSEC traffic, even-though your HW supports this.

I have tried to implemented my own (non-dynamic) XDP RX-types UAPI here:
 https://marc.info/?l=linux-netdev&m=149512213531769
 https://marc.info/?l=linux-netdev&m=149512213631774
=========================

Thanks,
Saeed.

Saeed Mahameed (6):
  net: xdp: Add support for meta data flags requests
  net: xdp: RX meta data infrastructure
  net/mlx5e: Store xdp flags and meta data info
  net/mlx5e: Pass CQE to RX handlers
  net/mlx5e: Add XDP RX meta data support
  samples/bpf: Add meta data hash example to xdp_redirect_cpu

 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  19 ++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  58 +++++----
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  47 ++++++--
 include/linux/netdevice.h                     |  10 ++
 include/net/xdp.h                             |   6 +
 include/uapi/linux/bpf.h                      | 114 ++++++++++++++++++
 include/uapi/linux/if_link.h                  |  20 ++-
 net/core/dev.c                                |  41 +++++++
 samples/bpf/xdp_redirect_cpu_kern.c           |  67 ++++++++++
 samples/bpf/xdp_redirect_cpu_user.c           |   7 ++
 10 files changed, 354 insertions(+), 35 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [RFC bpf-next 1/6] net: xdp: Add support for meta data flags requests
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

A user space application can request to enable a specific set of meta data
to be reported in every xdp buffer provided to the xdp program.

When meta_data flags are required, XDP devices must respond to
XDP_QUERY_META_FLAGS command with all the meta data flags the device
actually supports, and the kernel will cross check them with the
requested flags, in case of discrepancy the xdp install will fail.

If the flags are supported, the device must guarantee to deliver all RX
packets with only the meta data requested in meta data_flags on
xdp_install operation.

The following flags are added, and can be provided by the netlink xdp
flags field.

+#define XDP_FLAGS_META_HASH            (1U << 16)
+#define XDP_FLAGS_META_FLOW_MARK       (1U << 17)
+#define XDP_FLAGS_META_VLAN            (1U << 18)
+#define XDP_FLAGS_META_CSUM_COMPLETE   (1U << 19)

The format, device delivery methods and XDP program access to such meta
data is discussed in a later patch.

TODO: use a different flags field for XDP meta data, to make sure we
have more free bits.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/netdevice.h    |  9 +++++++++
 include/uapi/linux/if_link.h | 16 +++++++++++++++-
 net/core/dev.c               | 20 ++++++++++++++++++++
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850c7936..fc8b6ce48a0f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -812,6 +812,10 @@ enum bpf_netdev_command {
 	 * is equivalent to XDP_ATTACHED_DRV.
 	 */
 	XDP_QUERY_PROG,
+	/* Query Supported XDP_FLAGS_META_*, Called before new XDP program setup
+	 * and only if meta_flags were requested by the user to validate if the
+	 * device supports the requested flags, if not program setup will fail */
+	XDP_QUERY_META_FLAGS,
 	/* BPF program for offload callbacks, invoked at program load time. */
 	BPF_OFFLOAD_VERIFIER_PREP,
 	BPF_OFFLOAD_TRANSLATE,
@@ -842,6 +846,11 @@ struct netdev_bpf {
 			/* flags with which program was installed */
 			u32 prog_flags;
 		};
+		/* XDP_QUERY_META_FLAGS */
+		struct {
+			/* TODO u64 */
+			u32 meta_flags;
+		};
 		/* BPF_OFFLOAD_VERIFIER_PREP */
 		struct {
 			struct bpf_prog *prog;
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index cf01b6824244..dfb1e26bacef 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -911,8 +911,22 @@ enum {
 #define XDP_FLAGS_MODES			(XDP_FLAGS_SKB_MODE | \
 					 XDP_FLAGS_DRV_MODE | \
 					 XDP_FLAGS_HW_MODE)
+
+/* TODO : add new netlink xdp u64 meta_flags
+ * for meta data only
+ */
+#define XDP_FLAGS_META_HASH		(1U << 16)
+#define XDP_FLAGS_META_MARK		(1U << 17)
+#define XDP_FLAGS_META_VLAN		(1U << 18)
+#define XDP_FLAGS_META_CSUM_COMPLETE	(1U << 19)
+#define XDP_FLAGS_META_ALL		(XDP_FLAGS_META_HASH      | \
+					 XDP_FLAGS_META_MARK      | \
+					 XDP_FLAGS_META_VLAN      | \
+					 XDP_FLAGS_META_CSUM_COMPLETE)
+
 #define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST | \
-					 XDP_FLAGS_MODES)
+					 XDP_FLAGS_MODES             | \
+					 XDP_FLAGS_META_ALL)
 
 /* These are stored into IFLA_XDP_ATTACHED on dump. */
 enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index a5aa1c7444e6..8a5cc2c731ec 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7295,6 +7295,21 @@ static u8 __dev_xdp_attached(struct net_device *dev, bpf_op_t bpf_op)
 	return xdp.prog_attached;
 }
 
+static bool __dev_xdp_meta_supported(struct net_device *dev,
+				     bpf_op_t bpf_op, u32 meta_flags)
+{
+	struct netdev_bpf xdp = {};
+
+	 /* Backward compatible, all devices support no meta_flags */
+	if (!meta_flags)
+		return true;
+
+	xdp.command = XDP_QUERY_META_FLAGS;
+	bpf_op(dev, &xdp);
+
+	return ((xdp.meta_flags & meta_flags) == meta_flags);
+}
+
 static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 			   struct netlink_ext_ack *extack, u32 flags,
 			   struct bpf_prog *prog)
@@ -7362,12 +7377,17 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 		bpf_chk = generic_xdp_install;
 
 	if (fd >= 0) {
+		u32 meta_flags = (flags & XDP_FLAGS_META_ALL);
+
 		if (bpf_chk && __dev_xdp_attached(dev, bpf_chk))
 			return -EEXIST;
 		if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) &&
 		    __dev_xdp_attached(dev, bpf_op))
 			return -EBUSY;
 
+		if (!__dev_xdp_meta_supported(dev, bpf_op, meta_flags))
+			return -EINVAL;
+
 		prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP,
 					     bpf_op == ops->ndo_bpf);
 		if (IS_ERR(prog))
-- 
2.17.0

^ permalink raw reply related

* [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

The idea from this patch is to define a well known structure for XDP meta
data fields format and offset placement inside the xdp data meta buffer.

For that driver will need some static information to know what to
provide and where, enters struct xdp_md_info and xdp_md_info_arr:

struct xdp_md_info {
       __u16 present:1;
       __u16 offset:15; /* offset from data_meta in xdp_md buffer */
};

/* XDP meta data offsets info array
 * present bit describes if a meta data is or will be present in xdp_md buff
 * offset describes where a meta data is or should be placed in xdp_md buff
 *
 * Kernel builds this array using xdp_md_info_build helper on demand.
 * User space builds it statically in the xdp program.
 */
typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];

Offsets in xdp_md_info_arr are always in ascending order and only for
requested meta data:
example : flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;

xdp_md_info_arr mdi = {
	[XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
	[XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
	[XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
	[XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
}

i.e: hash fields will always appear first then the vlan for every
xdp_md.

Once requested to provide xdp meta data, device driver will use a pre-built
xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
xdp_md_info_arr will tell the driver what is the offset of each meta data.

This patch also provides helper functions for the device drivers to store
meta data into xdp_buff, and helper function for XDP programs to fetch
specific xdp meta data from xdp_md buffer.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/netdevice.h    |   1 +
 include/net/xdp.h            |   6 ++
 include/uapi/linux/bpf.h     | 114 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/if_link.h |  16 +++--
 net/core/dev.c               |  21 +++++++
 5 files changed, 152 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fc8b6ce48a0f..172363310ade 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -838,6 +838,7 @@ struct netdev_bpf {
 			u32 flags;
 			struct bpf_prog *prog;
 			struct netlink_ext_ack *extack;
+			xdp_md_info_arr md_info;
 		};
 		/* XDP_QUERY_PROG */
 		struct {
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 2deea7166a34..afe302613ae1 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -138,6 +138,12 @@ xdp_set_data_meta_invalid(struct xdp_buff *xdp)
 	xdp->data_meta = xdp->data + 1;
 }
 
+static __always_inline void
+xdp_reset_data_meta(struct xdp_buff *xdp)
+{
+	xdp->data_meta = xdp->data_hard_start;
+}
+
 static __always_inline bool
 xdp_data_meta_unsupported(const struct xdp_buff *xdp)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6a40d7..e320e7421565 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2353,6 +2353,120 @@ struct xdp_md {
 	__u32 rx_queue_index;  /* rxq->queue_index  */
 };
 
+enum {
+	XDP_DATA_META_HASH = 0,
+	XDP_DATA_META_MARK,
+	XDP_DATA_META_VLAN,
+	XDP_DATA_META_CSUM_COMPLETE,
+
+	XDP_DATA_META_MAX,
+};
+
+struct xdp_md_hash {
+	__u32 hash;
+	__u8  type;
+} __attribute__((aligned(4)));
+
+struct xdp_md_mark {
+	__u32 mark;
+} __attribute__((aligned(4)));
+
+struct xdp_md_vlan {
+	__u16 vlan;
+} __attribute__((aligned(4)));
+
+struct xdp_md_csum {
+	__u16 csum;
+} __attribute__((aligned(4)));
+
+static const __u16 xdp_md_size[XDP_DATA_META_MAX] = {
+	sizeof(struct xdp_md_hash), /* XDP_DATA_META_HASH */
+	sizeof(struct xdp_md_mark), /* XDP_DATA_META_MARK */
+	sizeof(struct xdp_md_vlan), /* XDP_DATA_META_VLAN */
+	sizeof(struct xdp_md_csum), /* XDP_DATA_META_CSUM_COMPLETE */
+};
+
+struct xdp_md_info {
+	__u16 present:1;
+	__u16 offset:15; /* offset from data_meta in xdp_md buffer */
+};
+
+/* XDP meta data offsets info array
+ * present bit describes if a meta data is or will be present in xdp_md buff
+ * offset describes where a meta data is or should be placed in xdp_md buff
+ *
+ * Kernel builds this array using xdp_md_info_build helper on demand.
+ * User space builds it statically in the xdp program.
+ */
+typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
+
+static __always_inline __u8
+xdp_data_meta_present(xdp_md_info_arr mdi, int meta_data)
+{
+	return mdi[meta_data].present;
+}
+
+static __always_inline void
+xdp_data_meta_set_hash(xdp_md_info_arr mdi, void *data_meta, __u32 hash, __u8 htype)
+{
+	struct xdp_md_hash *hash_md;
+
+	hash_md = (struct xdp_md_hash *)((char*)data_meta + mdi[XDP_DATA_META_HASH].offset);
+	hash_md->hash = hash;
+	hash_md->type = htype;
+}
+
+static __always_inline struct xdp_md_hash *
+xdp_data_meta_get_hash(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_hash *)((char*)data_meta + mdi[XDP_DATA_META_HASH].offset);
+}
+
+static __always_inline void
+xdp_data_meta_set_mark(xdp_md_info_arr mdi, void *data_meta, __u32 mark)
+{
+	struct xdp_md_mark *mark_md;
+
+	mark_md = (struct xdp_md_mark *)((char*)data_meta + mdi[XDP_DATA_META_MARK].offset);
+	mark_md->mark = mark;
+}
+
+static __always_inline struct xdp_md_mark *
+xdp_data_meta_get_mark(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_mark *)((char*)data_meta + mdi[XDP_DATA_META_MARK].offset);
+}
+
+static __always_inline void
+xdp_data_meta_set_vlan(xdp_md_info_arr mdi, void *data_meta, __u16 vlan_tci)
+{
+	struct xdp_md_vlan *vlan_md;
+
+	vlan_md = (struct xdp_md_vlan *)((char*)data_meta + mdi[XDP_DATA_META_VLAN].offset);
+	vlan_md->vlan = vlan_tci;
+}
+
+static __always_inline struct xdp_md_vlan *
+xdp_data_meta_get_vlan(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_vlan *)((char*)data_meta + mdi[XDP_DATA_META_VLAN].offset);
+}
+
+static __always_inline void
+xdp_data_meta_set_csum_complete(xdp_md_info_arr mdi, void *data_meta, __u16 csum)
+{
+	struct xdp_md_csum *csum_md;
+
+	csum_md = (struct xdp_md_csum *)((char*)data_meta + mdi[XDP_DATA_META_CSUM_COMPLETE].offset);
+	csum_md->csum = csum;
+}
+
+static __always_inline struct xdp_md_csum *
+xdp_data_meta_get_csum_complete(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_csum *)((char*)data_meta + mdi[XDP_DATA_META_CSUM_COMPLETE].offset);
+}
+
 enum sk_action {
 	SK_DROP = 0,
 	SK_PASS,
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index dfb1e26bacef..5bdc07bfad35 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -4,6 +4,7 @@
 
 #include <linux/types.h>
 #include <linux/netlink.h>
+#include <linux/bpf.h>
 
 /* This struct should be in sync with struct rtnl_link_stats64 */
 struct rtnl_link_stats {
@@ -913,13 +914,16 @@ enum {
 					 XDP_FLAGS_HW_MODE)
 
 /* TODO : add new netlink xdp u64 meta_flags
- * for meta data only
+ * for meta data only & build according XDP_DATA_META_* enums
  */
-#define XDP_FLAGS_META_HASH		(1U << 16)
-#define XDP_FLAGS_META_MARK		(1U << 17)
-#define XDP_FLAGS_META_VLAN		(1U << 18)
-#define XDP_FLAGS_META_CSUM_COMPLETE	(1U << 19)
-#define XDP_FLAGS_META_ALL		(XDP_FLAGS_META_HASH      | \
+
+#define XDP_FLAGS_META_SHIFT            (16)
+#define XDP_FLAG_META(FLAG)             ((1U << XDP_DATA_META_##FLAG) << XDP_FLAGS_META_SHIFT)
+#define XDP_FLAGS_META_HASH             XDP_FLAG_META(HASH)
+#define XDP_FLAGS_META_MARK             XDP_FLAG_META(MARK)
+#define XDP_FLAGS_META_VLAN             XDP_FLAG_META(VLAN)
+#define XDP_FLAGS_META_CSUM_COMPLETE    XDP_FLAG_META(CSUM_COMPLETE)
+#define XDP_FLAGS_META_ALL              (XDP_FLAGS_META_HASH      | \
 					 XDP_FLAGS_META_MARK      | \
 					 XDP_FLAGS_META_VLAN      | \
 					 XDP_FLAGS_META_CSUM_COMPLETE)
diff --git a/net/core/dev.c b/net/core/dev.c
index 8a5cc2c731ec..141dd6632b00 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7310,6 +7310,25 @@ static bool __dev_xdp_meta_supported(struct net_device *dev,
 	return ((xdp.meta_flags & meta_flags) == meta_flags);
 }
 
+static void xdp_md_info_build(xdp_md_info_arr mdi, u32 xdp_flags_meta)
+{
+	u16 offset = 0;
+	int i;
+
+	for (i = 0; i < XDP_DATA_META_MAX; i++) {
+		u32 meta_data_flag = BIT(i) << XDP_FLAGS_META_SHIFT;
+
+		if (!(meta_data_flag & xdp_flags_meta)) {
+			mdi[i].present = 0;
+			continue;
+		}
+
+		mdi[i].present = 1;
+		mdi[i].offset = offset;
+		offset += xdp_md_size[i];
+	}
+}
+
 static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 			   struct netlink_ext_ack *extack, u32 flags,
 			   struct bpf_prog *prog)
@@ -7325,6 +7344,8 @@ static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 	xdp.flags = flags;
 	xdp.prog = prog;
 
+	xdp_md_info_build(xdp.md_info, flags & XDP_FLAGS_META_ALL);
+
 	return bpf_op(dev, &xdp);
 }
 
-- 
2.17.0

^ permalink raw reply related

* [RFC bpf-next 3/6] net/mlx5e: Store xdp flags and meta data info
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

Make xdp flags and meta data info avaiable to RQs data path,
to enable xdp meta data offloads in next two patches.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 10 +++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 52 +++++++++++--------
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  2 +-
 3 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index eb9eb7aa953a..5893acfae307 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -236,6 +236,12 @@ enum mlx5e_priv_flag {
 #define MLX5E_MAX_BW_ALLOC 100 /* Max percentage of BW allocation */
 #endif
 
+struct mlx5e_xdp_info {
+	struct bpf_prog      *prog;
+	u32                   flags;
+	xdp_md_info_arr       md_info;
+};
+
 struct mlx5e_params {
 	u8  log_sq_size;
 	u8  rq_wq_type;
@@ -257,7 +263,7 @@ struct mlx5e_params {
 	bool tx_dim_enabled;
 	u32 lro_timeout;
 	u32 pflags;
-	struct bpf_prog *xdp_prog;
+	struct mlx5e_xdp_info xdp;
 	unsigned int sw_mtu;
 	int hard_mtu;
 };
@@ -566,7 +572,7 @@ struct mlx5e_rq {
 	struct net_dim         dim; /* Dynamic Interrupt Moderation */
 
 	/* XDP */
-	struct bpf_prog       *xdp_prog;
+	struct mlx5e_xdp_info  xdp;
 	unsigned int           hw_mtu;
 	struct mlx5e_xdpsq     xdpsq;
 	DECLARE_BITMAP(flags, 8);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 56c1b6f5593e..8debae6b9cab 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -96,7 +96,7 @@ bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
 
 static u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params)
 {
-	if (!params->xdp_prog) {
+	if (!params->xdp.prog) {
 		u16 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
 		u16 rq_headroom = MLX5_RX_HEADROOM + NET_IP_ALIGN;
 
@@ -170,7 +170,7 @@ static u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev,
 static u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev,
 				 struct mlx5e_params *params)
 {
-	u16 linear_rq_headroom = params->xdp_prog ?
+	u16 linear_rq_headroom = params->xdp.prog ?
 		XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM;
 	bool is_linear_skb;
 
@@ -205,7 +205,7 @@ bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev,
 {
 	return mlx5e_check_fragmented_striding_rq_cap(mdev) &&
 		!MLX5_IPSEC_DEV(mdev) &&
-		!(params->xdp_prog && !mlx5e_rx_mpwqe_is_linear_skb(mdev, params));
+		!(params->xdp.prog && !mlx5e_rx_mpwqe_is_linear_skb(mdev, params));
 }
 
 void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params)
@@ -489,11 +489,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	rq->mdev    = mdev;
 	rq->hw_mtu  = MLX5E_SW2HW_MTU(params, params->sw_mtu);
 	rq->stats   = &c->priv->channel_stats[c->ix].rq;
+	rq->xdp     = params->xdp;
 
-	rq->xdp_prog = params->xdp_prog ? bpf_prog_inc(params->xdp_prog) : NULL;
-	if (IS_ERR(rq->xdp_prog)) {
-		err = PTR_ERR(rq->xdp_prog);
-		rq->xdp_prog = NULL;
+	rq->xdp.prog = params->xdp.prog ? bpf_prog_inc(params->xdp.prog) : NULL;
+	if (IS_ERR(rq->xdp.prog)) {
+		err = PTR_ERR(rq->xdp.prog);
+		rq->xdp.prog = NULL;
 		goto err_rq_wq_destroy;
 	}
 
@@ -501,7 +502,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	if (err < 0)
 		goto err_rq_wq_destroy;
 
-	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
+	rq->buff.map_dir = rq->xdp.prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params);
 	pool_size = 1 << params->log_rq_mtu_frames;
 
@@ -679,8 +680,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	}
 
 err_rq_wq_destroy:
-	if (rq->xdp_prog)
-		bpf_prog_put(rq->xdp_prog);
+	if (rq->xdp.prog)
+		bpf_prog_put(rq->xdp.prog);
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
 	if (rq->page_pool)
 		page_pool_destroy(rq->page_pool);
@@ -693,8 +694,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 {
 	int i;
 
-	if (rq->xdp_prog)
-		bpf_prog_put(rq->xdp_prog);
+	if (rq->xdp.prog)
+		bpf_prog_put(rq->xdp.prog);
 
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
 	if (rq->page_pool)
@@ -1906,7 +1907,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	c->netdev   = priv->netdev;
 	c->mkey_be  = cpu_to_be32(priv->mdev->mlx5e_res.mkey.key);
 	c->num_tc   = params->num_tc;
-	c->xdp      = !!params->xdp_prog;
+	c->xdp      = !!params->xdp.prog;
 	c->stats    = &priv->channel_stats[ix].ch;
 
 	mlx5_vector2eqn(priv->mdev, ix, &eqn, &irq);
@@ -4090,12 +4091,12 @@ static void mlx5e_tx_timeout(struct net_device *dev)
 	queue_work(priv->wq, &priv->tx_timeout_work);
 }
 
-static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
+static int mlx5e_xdp_set(struct net_device *netdev, struct mlx5e_xdp_info *xdp)
 {
 	struct mlx5e_priv *priv = netdev_priv(netdev);
-	struct bpf_prog *old_prog;
-	int err = 0;
+	struct bpf_prog *old_prog, *prog = xdp->prog;
 	bool reset, was_opened;
+	int err = 0;
 	int i;
 
 	mutex_lock(&priv->state_lock);
@@ -4114,7 +4115,7 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 
 	was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state);
 	/* no need for full reset when exchanging programs */
-	reset = (!priv->channels.params.xdp_prog || !prog);
+	reset = (!priv->channels.params.xdp.prog || !prog);
 
 	if (was_opened && reset)
 		mlx5e_close_locked(netdev);
@@ -4132,10 +4133,12 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 	/* exchange programs, extra prog reference we got from caller
 	 * as long as we don't fail from this point onwards.
 	 */
-	old_prog = xchg(&priv->channels.params.xdp_prog, prog);
+	old_prog = xchg(&priv->channels.params.xdp.prog, prog);
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
+	priv->channels.params.xdp = *xdp;
+
 	if (reset) /* change RQ type according to priv->xdp_prog */
 		mlx5e_set_rq_type(priv->mdev, &priv->channels.params);
 
@@ -4155,7 +4158,8 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 		napi_synchronize(&c->napi);
 		/* prevent mlx5e_poll_rx_cq from accessing rq->xdp_prog */
 
-		old_prog = xchg(&c->rq.xdp_prog, prog);
+		old_prog = xchg(&c->rq.xdp.prog, prog);
+		c->rq.xdp = *xdp;
 
 		set_bit(MLX5E_RQ_STATE_ENABLED, &c->rq.state);
 		/* napi_schedule in case we have missed anything */
@@ -4177,7 +4181,7 @@ static u32 mlx5e_xdp_query(struct net_device *dev)
 	u32 prog_id = 0;
 
 	mutex_lock(&priv->state_lock);
-	xdp_prog = priv->channels.params.xdp_prog;
+	xdp_prog = priv->channels.params.xdp.prog;
 	if (xdp_prog)
 		prog_id = xdp_prog->aux->id;
 	mutex_unlock(&priv->state_lock);
@@ -4187,9 +4191,15 @@ static u32 mlx5e_xdp_query(struct net_device *dev)
 
 static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
+	struct mlx5e_xdp_info xdp_info;
+
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
-		return mlx5e_xdp_set(dev, xdp->prog);
+		xdp_info.prog  = xdp->prog;
+		xdp_info.flags = xdp->flags;
+		memcpy(xdp_info.md_info, xdp->md_info, sizeof(xdp->md_info));
+
+		return mlx5e_xdp_set(dev, &xdp_info);
 	case XDP_QUERY_PROG:
 		xdp->prog_id = mlx5e_xdp_query(dev);
 		xdp->prog_attached = !!xdp->prog_id;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index d3a1dd20e41d..d12577c17011 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -925,7 +925,7 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    struct mlx5e_dma_info *di,
 				    void *va, u16 *rx_headroom, u32 *len)
 {
-	struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
+	struct bpf_prog *prog = READ_ONCE(rq->xdp.prog);
 	struct xdp_buff xdp;
 	u32 act;
 	int err;
-- 
2.17.0

^ permalink raw reply related

* [RFC bpf-next 4/6] net/mlx5e: Pass CQE to RX handlers
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

CQE has all the meta data information from HW.
Make it available to the driver xdp handlers.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++++++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 15 +++++++++------
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 5893acfae307..98bb315fc8a8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -505,7 +505,8 @@ struct mlx5e_rq;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-			       u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+			       u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+			       struct mlx5_cqe64 *cqe);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 			 struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt);
@@ -901,10 +902,12 @@ void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
 void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				struct mlx5_cqe64 *cqe);
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				   u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				   struct mlx5_cqe64 *cqe);
 struct sk_buff *
 mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 			  struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index d12577c17011..e37f9747a0e3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -923,7 +923,8 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_rq *rq,
 /* returns true if packet was consumed by xdp */
 static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    struct mlx5e_dma_info *di,
-				    void *va, u16 *rx_headroom, u32 *len)
+				    void *va, u16 *rx_headroom,
+				    u32 *len, struct mlx5_cqe64 *cqe)
 {
 	struct bpf_prog *prog = READ_ONCE(rq->xdp.prog);
 	struct xdp_buff xdp;
@@ -1012,7 +1013,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	}
 
 	rcu_read_lock();
-	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt);
+	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt, cqe);
 	rcu_read_unlock();
 	if (consumed)
 		return NULL; /* page/packet was consumed by XDP */
@@ -1155,7 +1156,8 @@ void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				   u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				   struct mlx5_cqe64 *cqe)
 {
 	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
@@ -1202,7 +1204,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				struct mlx5_cqe64 *cqe)
 {
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
 	u16 rx_headroom = rq->buff.headroom;
@@ -1221,7 +1224,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 	prefetch(data);
 
 	rcu_read_lock();
-	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32);
+	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32, cqe);
 	rcu_read_unlock();
 	if (consumed) {
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
@@ -1268,7 +1271,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
 
 	skb = rq->mpwqe.skb_from_cqe_mpwrq(rq, wi, cqe_bcnt, head_offset,
-					   page_idx);
+					   page_idx, cqe);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
-- 
2.17.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox