Netdev List
 help / color / mirror / Atom feed
* Re: Synopsys Ethernet QoS Driver
From: Rayagond Kokatanur @ 2016-11-21  5:29 UTC (permalink / raw)
  To: Rabin Vincent
  Cc: Joao Pinto, mued dib, David Miller, Jeff Kirsher, jiri, saeedm,
	idosch, netdev, linux-kernel, CARLOS.PALMINHA, andreas.irestal,
	Giuseppe Cavallaro, alexandre.torgue, lars.persson
In-Reply-To: <20161119135654.GA14079@lnxartpec.se.axis.com>

On Sat, Nov 19, 2016 at 7:26 PM, Rabin Vincent <rabin@rab.in> wrote:
> On Fri, Nov 18, 2016 at 02:20:27PM +0000, Joao Pinto wrote:
>> For now we are interesting in improving the synopsys QoS driver under
>> /nect/ethernet/synopsys. For now the driver structure consists of a single file
>> called dwc_eth_qos.c, containing synopsys ethernet qos common ops and platform
>> related stuff.
>>
>> Our strategy would be:
>>
>> a) Implement a platform glue driver (dwc_eth_qos_pltfm.c)
>> b) Implement a pci glue driver (dwc_eth_qos_pci.c)
>> c) Implement a "core driver" (dwc_eth_qos.c) that would only have Ethernet QoS
>> related stuff to be reused by the platform / pci drivers
>> d) Add a set of features to the "core driver" that we have available internally
>
> Note that there are actually two drivers in mainline for this hardware:
>
>  drivers/net/ethernet/synopsis/
>  drivers/net/ethernet/stmicro/stmmac/

Yes the later driver (drivers/net/ethernet/stmicro/stmmac/) supports
both 3.x and 4.x. It has glue layer for pci, platform, core etc,
please refer this driver once before you start.

You can start adding missing feature of 4.x in stmmac driver.

>
> (See http://lists.openwall.net/netdev/2016/02/29/127)
>
> The former only supports 4.x of the hardware.
>
> The later supports 4.x and 3.x and already has a platform glue driver
> with support for several platforms, a PCI glue driver, and a core driver
> with several features not present in the former (for example: TX/RX
> interrupt coalescing, EEE, PTP).
>
> Have you evaluated both drivers?  Why have you decided to work on the
> former rather than the latter?



-- 
wwr
Rayagond

^ permalink raw reply

* Re: [PATCH 4.9.0-rc5] AR9300 calibration problems with antenna selected
From: miaoqing @ 2016-11-21  5:48 UTC (permalink / raw)
  To: Krzysztof Hałasa, ath9k-devel, Kalle Valo, linux-wireless,
	ath9k-devel, netdev, linux-kernel
In-Reply-To: <61055e757c5645dfb69da3fc555cbcf5@aptaiexm02f.ap.qualcomm.com>


I would prefer that you didn't submit this.

> 
> I recently tried to select a single antenna on AR9300 and it works for
> 30 seconds only. The subsequent calibration makes the RX signal level
> to drop from the usual -30/-40 dBm to -70/-80 dBm, and the
> transmission practically stops.
> 
> With the attached patch it works, though selecting the antenna doesn't
> seem to have any visible effect, at least with "iw wlanX station dump"
> (perhaps it works for TX).
> 
> I'm using ad-hoc mode:
> 
> rmmod ath9k
> modprobe ath9k
> iw dev wlan0 set type ibss
> iw phy phyX set antenna 2

2 is a bad mask. We use bitmap, the valid masks are 1, 3, 7.

^ permalink raw reply

* Priority of HTB class on 10GbE
From: jipan yang @ 2016-11-21  7:04 UTC (permalink / raw)
  To: netdev

Hello,

Have been using htb with 1Gbe nic for traffic control of three classes
of traffic with priority 1,2 & 3,  the configuration worked perfectly.
The ceil of  higher priority traffic gets filled before that of lower
priority.

"
tc qdisc del dev enp1s0f1 root

# This line sets a HTB qdisc on the root of enp1s0f1, and it specifies
that the class 1:30 is used by default.
# It sets the name of the root as 1:, for future references.
tc qdisc add dev enp1s0f1 root handle 1: htb default 30

# This creates a class called 1:1, which is direct descendant of root
(the parent is 1:),
# this class gets assigned also an HTB qdisc, and then it sets a max
rate of 6mbits, with a burst of 15k
tc class add dev enp1s0f1 parent 1: classid 1:1  htb rate 1000mbit
ceil 1000mbit  burst 15k

# Class 1:10, which has a rate of 200mbit, ceil 800mbit
tc class add dev enp1s0f1 parent 1:1 classid 1:10 htb  rate 200mbit
ceil 800mbit  burst 15k  cburst 15k prio 1


# Class 1:20, which has a rate of 200 mbit, ceil 500mbit
tc class add dev enp1s0f1 parent 1:1  classid 1:20  htb  rate 200mbit
ceil 500mbit burst 15k cburst 15k  prio 2

# Class 1:30, which has a rate of 1mbit. This one is the default class.
tc class add dev enp1s0f1 parent 1:1 classid 1:30  htb rate 10mbit
ceil 1000mbit burst 15k cburst 15k prio 3

#sfq
tc qdisc add dev enp1s0f1 parent 1:10 handle 10: sfq perturb 10
tc qdisc add dev enp1s0f1 parent 1:20 handle 20: sfq perturb 10
tc qdisc add dev enp1s0f1 parent 1:30 handle 30: sfq perturb 10

tc filter add dev enp1s0f1 parent 1:  protocol ip prio 100 handle 100: cgroup
"


While moving to 10G nic on the same host,  I assumed same
configuration with the rate and ceil number timed 10 would also work,
but it turned out things started to fall apart.    The priority
setting doesn't seem to work, with 3 tcp connection, one on each
class, the iperf testing data doesn't make much sense, also the
throughput fluctuated a lot. I have tried to change parameters like
quantum, burst, cburst and etc. but none of them fixes the problem.

I'm wondering if this is known issue for HTB on 10G nic, if it is,
anyone already working on it?

Priority 1: class 1:10
[  4] 7362.00-7363.00 sec   518 MBytes  4.34 Gbits/sec  8825
538:339:391:3127:3891:281:223:35

[  4] 7363.00-7364.00 sec   507 MBytes  4.25 Gbits/sec  8809
540:321:486:3251:3904:174:121:12

[  4] 7364.00-7365.00 sec   496 MBytes  4.16 Gbits/sec  8793
585:264:530:3533:3694:114:67:6

[  4] 7365.00-7366.00 sec   481 MBytes  4.03 Gbits/sec  8761
552:242:747:3873:3304:27:14:2

[  4] 7366.00-7367.00 sec   484 MBytes  4.06 Gbits/sec  8783
596:239:650:3828:3462:6:2:0



Priority 2: class 1:20

[  4] 7401.00-7402.00 sec   488 MBytes  4.10 Gbits/sec  10151
1681:937:861:3046:3610:5:6:5

[  4] 7402.00-7403.00 sec   490 MBytes  4.11 Gbits/sec  10186
1688:979:857:3003:3646:6:3:4

[  4] 7403.00-7404.00 sec   454 MBytes  3.81 Gbits/sec  10499
2110:1103:1427:3639:2186:9:14:11

[  4] 7404.00-7405.00 sec   454 MBytes  3.81 Gbits/sec  10476
2118:1078:1400:3626:2210:10:22:12



Prority 3: class 1:30

  4] 7425.00-7426.00 sec   123 MBytes  1.03 Gbits/sec  5374
2926:521:622:1295:3:3:2:2

[  4] 7426.00-7427.00 sec   120 MBytes  1.01 Gbits/sec  5224
2782:580:609:1242:8:1:0:2

[  4] 7427.00-7428.00 sec   118 MBytes   989 Mbits/sec  5167
2764:577:594:1225:4:2:0:1

[  4] 7428.00-7429.00 sec   108 MBytes   906 Mbits/sec  4595
2331:594:562:1103:3:1:1:0

[  4] 7429.00-7430.00 sec   102 MBytes   856 Mbits/sec  4434
2262:601:550:1017:4:0:0:0

[  4] 7430.00-7431.00 sec   130 MBytes  1.09 Gbits/sec  5573
3006:535:626:1392:9:2:2:1

[  4] 7431.00-7432.00 sec   148 MBytes  1.25 Gbits/sec  6466
3500:633:762:1550:11:5:4:1



"
[root@dockerhost3 ~]# tc -s qdisc ls dev enp7s0f1

qdisc htb 1: root refcnt 65 r2q 10 default 30 direct_packets_stat 1425

 Sent 387791599974 bytes 256137127 pkt (dropped 32, overlimits
16471645 requeues 40)

 backlog 0b 5p requeues 40

qdisc sfq 10: parent 1:10 limit 127p quantum 1514b depth 127 divisor
1024 perturb 10sec

 Sent 170422902810 bytes 112564665 pkt (dropped 0, overlimits 0 requeues 0)

 backlog 68130b 1p requeues 0

qdisc sfq 20: parent 1:20 limit 127p quantum 1514b depth 127 divisor
1024 perturb 10sec

 Sent 164903555250 bytes 108919125 pkt (dropped 0, overlimits 0 requeues 0)

 backlog 68130b 1p requeues 0

qdisc sfq 30: parent 1:30 limit 127p quantum 1514b depth 127 divisor
1024 perturb 10sec

 Sent 52344279294 bytes 34573507 pkt (dropped 26, overlimits 0 requeues 0)

 backlog 204390b 3p requeues 0

[root@dockerhost3 ~]# tc -s filter  ls dev enp7s0f1

filter parent 1: protocol ip pref 100 cgroup handle 0x64

filter parent 1: protocol ip pref 100 cgroup handle 0x64

[root@dockerhost3 ~]# tc -s -d class show dev  enp7s0f1

class htb 1:1 root rate 10000Mbit ceil 10000Mbit burst 13750b/1 mpu 0b
overhead 0b cburst 0b/1 mpu 0b overhead 0b level 7

 Sent 387743091414 bytes 256105087 pkt (dropped 0, overlimits 0 requeues 0)

 rate 9862Mbit 828299pps backlog 0b 0p requeues 0

 lended: 828299 borrowed: 0 giants: 0

 tokens: -923 ctokens: -1095



class htb 1:10 parent 1:1 leaf 10: prio 1 quantum 200000 rate 2000Mbit
ceil 8000Mbit burst 15000b/1 mpu 0b overhead 0b cburst 14000b/1 mpu 0b
overhead 0b level 0

 Sent 170458534800 bytes 112588200 pkt (dropped 0, overlimits 0 requeues 0)

 rate 4222Mbit 353211pps backlog 0b 0p requeues 0

 lended: 353211 borrowed: 5040 giants: 0

 tokens: -4245 ctokens: -831



class htb 1:20 parent 1:1 leaf 20: prio 2 quantum 200000 rate 2000Mbit
ceil 5000Mbit burst 15000b/1 mpu 0b overhead 0b cburst 15000b/1 mpu 0b
overhead 0b level 0

 Sent 164938029030 bytes 108941895 pkt (dropped 0, overlimits 0 requeues 0)

 rate 4173Mbit 349256pps backlog 0b 1p requeues 0

 lended: 349256 borrowed: 4985 giants: 0

 tokens: -4073 ctokens: -1688



class htb 1:30 parent 1:1 leaf 30: prio 3 quantum 200000 rate
100000Kbit ceil 10000Mbit burst 15337b/1 mpu 0b overhead 0b cburst
13750b/1 mpu 0b overhead 0b level 0

 Sent 52346527584 bytes 34574992 pkt (dropped 26, overlimits 0 requeues 0)

 rate 1468Mbit 121414pps backlog 0b 3p requeues 0

 lended: 121414 borrowed: 4013 giants: 0

 tokens: -40758 ctokens: -665



class sfq 20:385 parent 20:

 (dropped 0, overlimits 0 requeues 0)

 backlog 68130b 1p requeues 0

 allot -67608



class sfq 30:2db parent 30:

 (dropped 0, overlimits 0 requeues 0)

 backlog 204390b 3p requeues 0

 allot -66792

"



Intel NIC:
07:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)


root@dockerhost3 ~]# uname -a
Linux dockerhost3 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18
19:05:49 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

root@dockerhost3 ~]# modinfo sch_htb

filename:
/lib/modules/3.10.0-327.36.1.el7.x86_64/kernel/net/sched/sch_htb.ko

license:        GPL

rhelversion:    7.2

srcversion:     67A59870E047ACE13610650

depends:

intree:         Y

vermagic:       3.10.0-327.36.1.el7.x86_64 SMP mod_unload modversions

signer:         CentOS Linux kernel signing key

sig_key:        7F:74:0F:3F:87:67:80:2E:E9:3B:A2:3F:10:EA:75:8D:2F:6C:AB:E3

sig_hashalgo:   sha256

parm:           htb_hysteresis:Hysteresis mode, less CPU load, less
accurate (int)


parm:           htb_rate_est:setup a default rate estimator (4sec
16sec) for htb classes (int)



Thanks,

Jipan

^ permalink raw reply

* Re: [B.A.T.M.A.N.] [PATCH net 1/1] net: batman-adv: Treat NET_XMIT_CN as transmit successfully
From: Sven Eckelmann @ 2016-11-21  8:16 UTC (permalink / raw)
  To: b.a.t.m.a.n; +Cc: fgao, mareklindner, sw, a, davem, netdev, gfree.wind
In-Reply-To: <1479688779-1328-1-git-send-email-fgao@ikuai8.com>

[-- Attachment #1: Type: text/plain, Size: 848 bytes --]

On Montag, 21. November 2016 08:39:39 CET fgao@ikuai8.com wrote:
> From: Gao Feng <fgao@ikuai8.com>
> 
> The tc could return NET_XMIT_CN as one congestion notification, but
> it does not mean the packe is lost. Other modules like ipvlan,

s/packe/packet/

> macvlan, and others treat NET_XMIT_CN as success too.
> 
> So batman-adv should add the NET_XMIT_CN check.
> 
> Signed-off-by: Gao Feng <fgao@ikuai8.com>
> ---
>  net/batman-adv/distributed-arp-table.c | 2 +-
>  net/batman-adv/fragmentation.c         | 2 +-
>  net/batman-adv/routing.c               | 2 +-
>  net/batman-adv/tp_meter.c              | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
[...]

Not sure how the batman-adv maintainers see this - but this patch needs
an additional patch for net-next to also add it to the parts which were
rewritten.

Kind regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: [B.A.T.M.A.N.] [PATCH net 1/1] net: batman-adv: Treat NET_XMIT_CN as transmit successfully
From: Feng Gao @ 2016-11-21  8:21 UTC (permalink / raw)
  To: Sven Eckelmann
  Cc: b.a.t.m.a.n, mareklindner, sw, a, David S. Miller,
	Linux Kernel Network Developers
In-Reply-To: <1926952.1nDu9mPud9@bentobox>

Hi Sven,

On Mon, Nov 21, 2016 at 4:16 PM, Sven Eckelmann <sven@narfation.org> wrote:
> On Montag, 21. November 2016 08:39:39 CET fgao@ikuai8.com wrote:
>> From: Gao Feng <fgao@ikuai8.com>
>>
>> The tc could return NET_XMIT_CN as one congestion notification, but
>> it does not mean the packe is lost. Other modules like ipvlan,
>
> s/packe/packet/

What's this mean?

>
>> macvlan, and others treat NET_XMIT_CN as success too.
>>
>> So batman-adv should add the NET_XMIT_CN check.
>>
>> Signed-off-by: Gao Feng <fgao@ikuai8.com>
>> ---
>>  net/batman-adv/distributed-arp-table.c | 2 +-
>>  net/batman-adv/fragmentation.c         | 2 +-
>>  net/batman-adv/routing.c               | 2 +-
>>  net/batman-adv/tp_meter.c              | 2 +-
>>  4 files changed, 4 insertions(+), 4 deletions(-)
> [...]
>
> Not sure how the batman-adv maintainers see this - but this patch needs
> an additional patch for net-next to also add it to the parts which were
> rewritten.
>
> Kind regards,
>         Sven

Ok. I would commit another patch to net-next.

Best Regards
Feng

^ permalink raw reply

* Re: [PATCH net 1/1] net: batman-adv: Treat NET_XMIT_CN as transmit successfully
From: Sven Eckelmann @ 2016-11-21  8:31 UTC (permalink / raw)
  To: Feng Gao
  Cc: mareklindner-rVWd3aGhH2z5bpWLKbzFeg,
	Linux Kernel Network Developers,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r, a, David S. Miller
In-Reply-To: <CA+6hz4rZyacN_jg74JLg8EWkno2aJEnwJSTwRaw558uE-zAaxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 885 bytes --]

On Montag, 21. November 2016 16:21:52 CET Feng Gao wrote:
> Hi Sven,
> 
> On Mon, Nov 21, 2016 at 4:16 PM, Sven Eckelmann <sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org> wrote:
> > On Montag, 21. November 2016 08:39:39 CET fgao-KlmEoCYek3zQT0dZR+AlfA@public.gmane.org wrote:
> >> From: Gao Feng <fgao-KlmEoCYek3zQT0dZR+AlfA@public.gmane.org>
> >>
> >> The tc could return NET_XMIT_CN as one congestion notification, but
> >> it does not mean the packe is lost. Other modules like ipvlan,
> >
> > s/packe/packet/
> 
> What's this mean?

That there is a minor typo (*t* is missing) and this sed statement (when
applied only to the commit message) would fix it.

David already marked this patch as "Under Review" in his patchwork. So I would
guess that he will accept this patch and not the batman-adv maintainers. And
maybe he will fix this small typo - or maybe not.

Kind regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: [PATCH net 1/1] net: batman-adv: Treat NET_XMIT_CN as transmit successfully
From: Feng Gao @ 2016-11-21  8:47 UTC (permalink / raw)
  To: Sven Eckelmann
  Cc: mareklindner-rVWd3aGhH2z5bpWLKbzFeg,
	Linux Kernel Network Developers,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r, a, David S. Miller
In-Reply-To: <1807277.5c8tsCoh5a@bentobox>

Hi Sven,

On Mon, Nov 21, 2016 at 4:31 PM, Sven Eckelmann <sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org> wrote:
> On Montag, 21. November 2016 16:21:52 CET Feng Gao wrote:
>> Hi Sven,
>>
>> On Mon, Nov 21, 2016 at 4:16 PM, Sven Eckelmann <sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org> wrote:
>> > On Montag, 21. November 2016 08:39:39 CET fgao-KlmEoCYek3zQT0dZR+AlfA@public.gmane.org wrote:
>> >> From: Gao Feng <fgao-KlmEoCYek3zQT0dZR+AlfA@public.gmane.org>
>> >>
>> >> The tc could return NET_XMIT_CN as one congestion notification, but
>> >> it does not mean the packe is lost. Other modules like ipvlan,
>> >
>> > s/packe/packet/
>>
>> What's this mean?
>
> That there is a minor typo (*t* is missing) and this sed statement (when
> applied only to the commit message) would fix it.

Thanks. I didn't thought it was sed statement.

>
> David already marked this patch as "Under Review" in his patchwork. So I would
> guess that he will accept this patch and not the batman-adv maintainers. And
> maybe he will fix this small typo - or maybe not.
>
> Kind regards,
>         Sven

I would correct the typo in the patch for net-next.

Best Regards
Feng

^ permalink raw reply

* [PATCH net] tcp: zero ca_priv area when switching cc algorithms
From: Florian Westphal @ 2016-11-21  9:08 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

We need to zero out the private data area when application switches
connection to different algorithm (TCP_CONGESTION setsockopt).

When congestion ops get assigned at connect time everything is already
zeroed because sk_alloc uses GFP_ZERO flag.  But in the setsockopt case
this contains whatever previous cc placed there.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/tcp_cong.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 1294af4e0127..f9038d6b109e 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -200,8 +200,10 @@ static void tcp_reinit_congestion_control(struct sock *sk,
 	icsk->icsk_ca_ops = ca;
 	icsk->icsk_ca_setsockopt = 1;
 
-	if (sk->sk_state != TCP_CLOSE)
+	if (sk->sk_state != TCP_CLOSE) {
+		memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
 		tcp_init_congestion_control(sk);
+	}
 }
 
 /* Manage refcounts on socket close. */
-- 
2.7.3

^ permalink raw reply related

* Re: [PATCH v2 net-next] mlx4: avoid unnecessary dirtying of critical fields
From: Tariq Toukan @ 2016-11-21  9:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Tariq Toukan
In-Reply-To: <1479662676.8455.364.camel@edumazet-glaptop3.roam.corp.google.com>


On 20/11/2016 7:24 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While stressing a 40Gbit mlx4 NIC with busy polling, I found false
> sharing in mlx4 driver that can be easily avoided.
>
> This patch brings an additional 7 % performance improvement in UDP_RR
> workload.
>
> 1) If we received no frame during one mlx4_en_process_rx_cq()
>     invocation, no need to call mlx4_cq_set_ci() and/or dirty ring->cons
>
> 2) Do not refill rx buffers if we have plenty of them.
>     This avoids false sharing and allows some bulk/batch optimizations.
>     Page allocator and its locks will thank us.
>
> Finally, mlx4_en_poll_rx_cq() should not return 0 if it determined
> cpu handling NIC IRQ should be changed. We should return budget-1
> instead, to not fool net_rx_action() and its netdev_budget.
>
>
> v2: keep AVG_PERF_COUNTER(... polled) even if polled is 0
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c |   47 ++++++++++++-------
>   1 file changed, 30 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 22f08f9ef4645869359783823127c0432fc7a591..6562f78b07f4370b5c1ea2c5e3a4221d7ebaeba8 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -688,18 +688,23 @@ static void validate_loopback(struct mlx4_en_priv *priv, struct sk_buff *skb)
>   	dev_kfree_skb_any(skb);
>   }
>   
> -static void mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
> -				     struct mlx4_en_rx_ring *ring)
> +static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
> +				      struct mlx4_en_rx_ring *ring)
>   {
> -	int index = ring->prod & ring->size_mask;
> +	u32 missing = ring->actual_size - (ring->prod - ring->cons);
>   
> -	while ((u32) (ring->prod - ring->cons) < ring->actual_size) {
> -		if (mlx4_en_prepare_rx_desc(priv, ring, index,
> +	/* Try to batch allocations, but not too much. */
> +	if (missing < 8)
> +		return false;
> +	do {
> +		if (mlx4_en_prepare_rx_desc(priv, ring,
> +					    ring->prod & ring->size_mask,
>   					    GFP_ATOMIC | __GFP_COLD))
>   			break;
>   		ring->prod++;
> -		index = ring->prod & ring->size_mask;
> -	}
> +	} while (--missing);
> +
> +	return true;
>   }
>   
>   /* When hardware doesn't strip the vlan, we need to calculate the checksum
> @@ -1081,15 +1086,20 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   
>   out:
>   	rcu_read_unlock();
> -	if (doorbell_pending)
> -		mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
>   
> +	if (polled) {
> +		if (doorbell_pending)
> +			mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> +
> +		mlx4_cq_set_ci(&cq->mcq);
> +		wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> +		ring->cons = cq->mcq.cons_index;
> +	}
>   	AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
> -	mlx4_cq_set_ci(&cq->mcq);
> -	wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> -	ring->cons = cq->mcq.cons_index;
> -	mlx4_en_refill_rx_buffers(priv, ring);
> -	mlx4_en_update_rx_prod_db(ring);
> +
> +	if (mlx4_en_refill_rx_buffers(priv, ring))
> +		mlx4_en_update_rx_prod_db(ring);
> +
>   	return polled;
>   }
>   
> @@ -1131,10 +1141,13 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
>   			return budget;
>   
>   		/* Current cpu is not according to smp_irq_affinity -
> -		 * probably affinity changed. need to stop this NAPI
> -		 * poll, and restart it on the right CPU
> +		 * probably affinity changed. Need to stop this NAPI
> +		 * poll, and restart it on the right CPU.
> +		 * Try to avoid returning a too small value (like 0),
> +		 * to not fool net_rx_action() and its netdev_budget
>   		 */
> -		done = 0;
> +		if (done)
> +			done--;
>   	}
>   	/* Done for now */
>   	if (napi_complete_done(napi, done))
>
>
Thanks Eric.
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v3 1/4] bpf, mlx5: fix mlx5e_create_rq taking reference on prog
From: Saeed Mahameed @ 2016-11-21  9:27 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Alexei Starovoitov, Brenden Blanco, Zhiyi Sun,
	Rana Shahout, Saeed Mahameed, Linux Netdev List
In-Reply-To: <cb3c229b56d24f6650472bc333d0a72de80327de.1479514784.git.daniel@iogearbox.net>

On Sat, Nov 19, 2016 at 2:45 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> In mlx5e_create_rq(), when creating a new queue, we call bpf_prog_add() but
> without checking the return value. bpf_prog_add() can fail since 92117d8443bc
> ("bpf: fix refcnt overflow"), so we really must check it. Take the reference
> right when we assign it to the rq from priv->xdp_prog, and just drop the
> reference on error path. Destruction in mlx5e_destroy_rq() looks good, though.
>
> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Acked-by: Saeed Mahameed <saeedm@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v3 2/4] bpf, mlx5: fix various refcount issues in mlx5e_xdp_set
From: Saeed Mahameed @ 2016-11-21  9:30 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Alexei Starovoitov, Brenden Blanco, Zhiyi Sun,
	Rana Shahout, Saeed Mahameed, Linux Netdev List
In-Reply-To: <6030d6a2d31ce68280138ac7ed33c3fc6ba585b2.1479514784.git.daniel@iogearbox.net>

On Sat, Nov 19, 2016 at 2:45 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> There are multiple issues in mlx5e_xdp_set():
>
> 1) The batched bpf_prog_add() is currently not checked for errors. When
>    doing so, it should be done at an earlier point in time to makes sure
>    that we cannot fail anymore at the time we want to set the program for
>    each channel. The batched refs short-cut can only be performed when we
>    don't need to perform a reset for changing the rq type and the device
>    was in opened state. In case the device was not in opened state, then
>    the next mlx5e_open_locked() will aquire the refs from the control prog
>    via mlx5e_create_rq(), same when we need to perform a reset.
>
> 2) When swapping the priv->xdp_prog, then no extra reference count must be
>    taken since we got that from call path via dev_change_xdp_fd() already.
>    Otherwise, we'd never be able to release the program. Also, bpf_prog_add()
>    without checking the return code could fail.
>
> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Acked-by: Saeed Mahameed <saeedm@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v3 3/4] bpf, mlx5: drop priv->xdp_prog reference on netdev cleanup
From: Saeed Mahameed @ 2016-11-21  9:31 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Alexei Starovoitov, Brenden Blanco, Zhiyi Sun,
	Rana Shahout, Saeed Mahameed, Linux Netdev List
In-Reply-To: <29469b54bfddfe218266b8b902a3735034d28226.1479514784.git.daniel@iogearbox.net>

On Sat, Nov 19, 2016 at 2:45 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> mlx5e_xdp_set() is currently the only place where we drop reference on the
> prog sitting in priv->xdp_prog when it's exchanged by a new one. We also
> need to make sure that we eventually release that reference, for example,
> in case the netdev is dismantled, otherwise we leak the program.
>
> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Acked-by: Saeed Mahameed <saeedm@mellanox.com>

^ permalink raw reply

* [PATCH iproute2 0/2] tc/cls_flower: Support for ip tunnel metadata set/release/classify
From: Amir Vadai @ 2016-11-21 10:20 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, netdev, Or Gerlitz, Hadar Har-Zion, Roi Dayan,
	Amir Vadai

Hi,

This short series adds support for matching and setting metadata for ip tunnel
shared device using the TC system, introduced in kernel 4.9 [1].

Applied and tested on top of commit f3f339e9590a ("cleanup debris from revert")

Example usage:

$ tc filter add dev vxlan0 protocol ip parent ffff: \
    flower \
      enc_src_ip 11.11.0.2 \
      enc_dst_ip 11.11.0.1 \
      enc_key_id 11 \
      dst_ip 11.11.11.1 \
    action tunnel_key release \
    action mirred egress redirect dev vnet0

$ tc filter add dev net0 protocol ip parent ffff: \
    flower \
      ip_proto 1 \
      dst_ip 11.11.11.2 \
    action tunnel_key set \
      src_ip 11.11.0.1 \
      dst_ip 11.11.0.2 \
      id 11 \
    action mirred egress redirect dev vxlan0

[1] - d1ba24feb466 ("Merge branch 'act_tunnel_key'")

Thanks,
Amir

Amir Vadai (2):
  tc/cls_flower: Classify packet in ip tunnels
  tc/act_tunnel: Introduce ip tunnel action

 include/linux/tc_act/tc_tunnel_key.h |  42 ++++++
 man/man8/tc-flower.8                 |  17 ++-
 man/man8/tc-tunnel_key.8             | 105 ++++++++++++++
 tc/Makefile                          |   1 +
 tc/f_flower.c                        |  85 +++++++++++-
 tc/m_tunnel_key.c                    | 259 +++++++++++++++++++++++++++++++++++
 6 files changed, 505 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/tc_act/tc_tunnel_key.h
 create mode 100644 man/man8/tc-tunnel_key.8
 create mode 100644 tc/m_tunnel_key.c

-- 
2.10.2

^ permalink raw reply

* [PATCH iproute2 2/2] tc/act_tunnel: Introduce ip tunnel action
From: Amir Vadai @ 2016-11-21 10:20 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, netdev, Or Gerlitz, Hadar Har-Zion, Roi Dayan,
	Amir Vadai
In-Reply-To: <20161121102056.13468-1-amir@vadai.me>

This action could be used before redirecting packets to a shared tunnel
device, or when redirecting packets arriving from a such a device.

The action will release the metadata created by the tunnel device
(decap), or set the metadata with the specified values for encap
operation.

For example, the following flower filter will forward all ICMP packets
destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before
redirecting, a metadata for the vxlan tunnel is created using the
tunnel_key action and it's arguments:

$ tc filter add dev net0 protocol ip parent ffff: \
    flower \
      ip_proto 1 \
      dst_ip 11.11.11.2 \
    action tunnel_key set \
      src_ip 11.11.0.1 \
      dst_ip 11.11.0.2 \
      id 11 \
    action mirred egress redirect dev vxlan0

Signed-off-by: Amir Vadai <amir@vadai.me>
---
 include/linux/tc_act/tc_tunnel_key.h |  42 ++++++
 man/man8/tc-tunnel_key.8             | 105 ++++++++++++++
 tc/Makefile                          |   1 +
 tc/m_tunnel_key.c                    | 259 +++++++++++++++++++++++++++++++++++
 4 files changed, 407 insertions(+)
 create mode 100644 include/linux/tc_act/tc_tunnel_key.h
 create mode 100644 man/man8/tc-tunnel_key.8
 create mode 100644 tc/m_tunnel_key.c

diff --git a/include/linux/tc_act/tc_tunnel_key.h b/include/linux/tc_act/tc_tunnel_key.h
new file mode 100644
index 000000000000..f9ddf5369a45
--- /dev/null
+++ b/include/linux/tc_act/tc_tunnel_key.h
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2016, Amir Vadai <amir@vadai.me>
+ * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __LINUX_TC_TUNNEL_KEY_H
+#define __LINUX_TC_TUNNEL_KEY_H
+
+#include <linux/pkt_cls.h>
+
+#define TCA_ACT_TUNNEL_KEY 17
+
+#define TCA_TUNNEL_KEY_ACT_SET	    1
+#define TCA_TUNNEL_KEY_ACT_RELEASE  2
+
+struct tc_tunnel_key {
+	tc_gen;
+	int t_action;
+};
+
+enum {
+	TCA_TUNNEL_KEY_UNSPEC,
+	TCA_TUNNEL_KEY_TM,
+	TCA_TUNNEL_KEY_PARMS,
+	TCA_TUNNEL_KEY_ENC_IPV4_SRC,	/* be32 */
+	TCA_TUNNEL_KEY_ENC_IPV4_DST,	/* be32 */
+	TCA_TUNNEL_KEY_ENC_IPV6_SRC,	/* struct in6_addr */
+	TCA_TUNNEL_KEY_ENC_IPV6_DST,	/* struct in6_addr */
+	TCA_TUNNEL_KEY_ENC_KEY_ID,	/* be64 */
+	TCA_TUNNEL_KEY_PAD,
+	__TCA_TUNNEL_KEY_MAX,
+};
+
+#define TCA_TUNNEL_KEY_MAX (__TCA_TUNNEL_KEY_MAX - 1)
+
+#endif
+
diff --git a/man/man8/tc-tunnel_key.8 b/man/man8/tc-tunnel_key.8
new file mode 100644
index 000000000000..c3b21e7d040e
--- /dev/null
+++ b/man/man8/tc-tunnel_key.8
@@ -0,0 +1,105 @@
+.TH "Tunnel metadata manipulation action in tc" 8 "10 Nov 2016" "iproute2" "Linux"
+
+.SH NAME
+tunnel_key - Tunnel metadata manipulation
+.SH SYNOPSIS
+.in +8
+.ti -8
+.BR tc " ... " "action tunnel_key" " { " release " | "
+.IR SET " }"
+
+.ti -8
+.IR SET " := "
+.BR set " " src_ip
+.IR ADDRESS
+.BR dst_ip
+.IR ADDRESS
+.BI id " KEY_ID"
+
+.SH DESCRIPTION
+The
+.B tunnel_key
+action allows to perform IP tunnel en- or decapsulation on a packet, reflected by
+the operation modes
+.IR RELEASE " and " SET .
+The
+.I RELEASE
+mode is simple, as no further information is required to just drop the
+metadata attached to the skb. The
+.IR SET
+mode requires the source and destination ip
+.I ADDRESS
+and the tunnel key id
+.I KEY_ID
+which will be used by the ip tunnel shared device to create the tunnel header. The
+.B tunnel_key
+action is useful only in combination with a
+.B mirred redirect
+action to a shared IP tunnel device which will use the metadata (for
+.I SET
+) and release the metadata created by it (for
+.I RELEASE
+).
+
+.SH OPTIONS
+.TP
+.B release
+Decapsulation mode, no further arguments allowed.
+.TP
+.B set
+Encapsulation mode. Requires
+.B id
+,
+.B src_ip
+and
+.B dst_ip
+options.
+.RS
+.TP
+.B id
+Tunnel ID (for example VNI in VXLAN tunnel)
+.TP
+.B src_ip
+Outer header source IP address (IPv4 or IPv6)
+.TP
+.B dst_ip
+Outer header destination IP address (IPv4 or IPv6)
+.RE
+.SH EXAMPLES
+The following example encapsulates incoming ICMP packets on eth0 into a vxlan
+tunnel by setting metadata to VNI 11, source IP 11.11.0.1 and destination IP
+11.11.0.1 by forwarding the skb with the metadata to device vxlan0, which will
+prepare the VXLAN headers:
+
+.RS
+.EX
+#tc qdisc add dev eth0 handle ffff: ingress
+#tc filter add dev eth0 protocol ip parent ffff: \\
+  flower \\
+    ip_proto icmp \\
+  action tunnel_key set \\
+    src_ip 11.11.0.1 \\
+    dst_ip 11.11.0.2 \\
+    id 11 \\
+  action mirred egress redirect dev vxlan0
+.EE
+.RE
+
+Here is an example of the
+.B release
+function: Incoming VXLAN packets on vxlan0 with specific outer IP's and VNI 11
+in the metadata are decapsulated and redirected to eth0:
+
+.RS
+.EX
+#tc qdisc add dev eth0 handle ffff: ingress
+#tc filter add dev vxlan0 protocol ip parent ffff: \
+  flower \\
+	  enc_src_ip 11.11.0.2 enc_dst_ip 11.11.0.1 enc_key_id 11 \
+	action tunnel_key release \
+	action mirred egress redirect dev eth0
+.EE
+.RE
+
+.SH SEE ALSO
+.BR tc (8)
diff --git a/tc/Makefile b/tc/Makefile
index dfa875b5edaf..f6f41ca2bb3d 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -50,6 +50,7 @@ TCMODULES += m_simple.o
 TCMODULES += m_vlan.o
 TCMODULES += m_connmark.o
 TCMODULES += m_bpf.o
+TCMODULES += m_tunnel_key.o
 TCMODULES += p_ip.o
 TCMODULES += p_icmp.o
 TCMODULES += p_tcp.o
diff --git a/tc/m_tunnel_key.c b/tc/m_tunnel_key.c
new file mode 100644
index 000000000000..42f2a184bd3a
--- /dev/null
+++ b/tc/m_tunnel_key.c
@@ -0,0 +1,259 @@
+/*
+ * m_tunnel_key.c	ip tunel manipulation module
+ *
+ *              This program is free software; you can redistribute it and/or
+ *              modify it under the terms of the GNU General Public License
+ *              as published by the Free Software Foundation; either version
+ *              2 of the License, or (at your option) any later version.
+ *
+ * Authors:     Amir Vadai <amir@vadai.me>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <linux/if_ether.h>
+#include "utils.h"
+#include "rt_names.h"
+#include "tc_util.h"
+#include <linux/tc_act/tc_tunnel_key.h>
+
+static void explain(void)
+{
+	fprintf(stderr, "Usage: tunnel_key release\n");
+	fprintf(stderr, "       tunnel_key set id TUNNELID src_ip IP dst_ip IP\n");
+}
+
+static void usage(void)
+{
+	explain();
+	exit(-1);
+}
+
+static int tunnel_key_parse_ip_addr(char *str, int addr4_type, int addr6_type,
+				    struct nlmsghdr *n)
+{
+	int ret;
+	inet_prefix addr;
+
+	ret = get_addr(&addr, str, AF_UNSPEC);
+	if (ret)
+		return -1;
+
+	addattr_l(n, MAX_MSG, addr.family == AF_INET ? addr4_type : addr6_type,
+		  addr.data, addr.bytelen);
+
+	return 0;
+}
+
+static int tunnel_key_parse_key_id(char *str, int type, struct nlmsghdr *n)
+{
+	int ret;
+	__be32 key_id;
+
+	ret = get_be32(&key_id, str, 10);
+	if (ret)
+		return -1;
+
+	addattr32(n, MAX_MSG, type, key_id);
+
+	return 0;
+}
+
+static int parse_tunnel_key(struct action_util *a, int *argc_p, char ***argv_p,
+			    int tca_id, struct nlmsghdr *n)
+{
+	struct tc_tunnel_key parm = { .action = TC_ACT_PIPE };
+	char **argv = *argv_p;
+	int argc = *argc_p;
+	struct rtattr *tail;
+	int action = 0;
+	int ret;
+	int has_src_ip = 0;
+	int has_dst_ip = 0;
+	int has_key_id = 0;
+
+	if (matches(*argv, "tunnel_key") != 0)
+		return -1;
+
+	tail = NLMSG_TAIL(n);
+	addattr_l(n, MAX_MSG, tca_id, NULL, 0);
+
+	NEXT_ARG();
+
+	while (argc > 0) {
+		if (matches(*argv, "release") == 0) {
+			if (action) {
+				fprintf(stderr, "unexpected \"%s\" - action already specified\n",
+					*argv);
+				explain();
+				return -1;
+			}
+			action = TCA_TUNNEL_KEY_ACT_RELEASE;
+		} else if (matches(*argv, "set") == 0) {
+			if (action) {
+				fprintf(stderr, "unexpected \"%s\" - action already specified\n",
+					*argv);
+				explain();
+				return -1;
+			}
+			action = TCA_TUNNEL_KEY_ACT_SET;
+		} else if (matches(*argv, "src_ip") == 0) {
+			NEXT_ARG();
+			ret = tunnel_key_parse_ip_addr(*argv,
+						       TCA_TUNNEL_KEY_ENC_IPV4_SRC,
+						       TCA_TUNNEL_KEY_ENC_IPV6_SRC,
+						       n);
+			if (ret < 0) {
+				fprintf(stderr, "Illegal \"src_ip\"\n");
+				return -1;
+			}
+			has_src_ip = 1;
+		} else if (matches(*argv, "dst_ip") == 0) {
+			NEXT_ARG();
+			ret = tunnel_key_parse_ip_addr(*argv,
+						       TCA_TUNNEL_KEY_ENC_IPV4_DST,
+						       TCA_TUNNEL_KEY_ENC_IPV6_DST,
+						       n);
+			if (ret < 0) {
+				fprintf(stderr, "Illegal \"dst_ip\"\n");
+				return -1;
+			}
+			has_dst_ip = 1;
+		} else if (matches(*argv, "id") == 0) {
+			NEXT_ARG();
+			ret = tunnel_key_parse_key_id(*argv, TCA_TUNNEL_KEY_ENC_KEY_ID, n);
+			if (ret < 0) {
+				fprintf(stderr, "Illegal \"id\"\n");
+				return -1;
+			}
+			has_key_id = 1;
+		} else if (matches(*argv, "help") == 0) {
+			usage();
+		} else {
+			break;
+		}
+		NEXT_ARG_FWD();
+	}
+
+	if (argc && !action_a2n(*argv, &parm.action, false))
+		NEXT_ARG_FWD();
+
+	if (argc) {
+		if (matches(*argv, "index") == 0) {
+			NEXT_ARG();
+			if (get_u32(&parm.index, *argv, 10)) {
+				fprintf(stderr, "tunnel_key: Illegal \"index\"\n");
+				return -1;
+			}
+
+			NEXT_ARG_FWD();
+		}
+	}
+
+	if (action == TCA_TUNNEL_KEY_ACT_SET &&
+	    (!has_src_ip || !has_dst_ip || !has_key_id)) {
+		fprintf(stderr, "set needs tunnel_key parameters\n");
+		explain();
+		return -1;
+	}
+
+	parm.t_action = action;
+	addattr_l(n, MAX_MSG, TCA_TUNNEL_KEY_PARMS, &parm, sizeof(parm));
+	tail->rta_len = (char *)NLMSG_TAIL(n) - (char *)tail;
+
+	*argc_p = argc;
+	*argv_p = argv;
+
+	return 0;
+}
+
+static void tunnel_key_print_ip_addr(FILE *f, char *name,
+				     struct rtattr *attr)
+{
+	int family;
+	size_t len;
+
+	if (!attr)
+		return;
+
+	len = RTA_PAYLOAD(attr);
+
+	if (len == 4)
+		family = AF_INET;
+	else if (len == 16)
+		family = AF_INET6;
+	else
+		return;
+
+	fprintf(f, "\n\t%s %s", name, rt_addr_n2a_rta(family, attr));
+}
+
+static void tunnel_key_print_key_id(FILE *f, char *name,
+				    struct rtattr *attr)
+{
+	if (!attr)
+		return;
+	fprintf(f, "\n\t%s %d", name, ntohl(rta_getattr_u32(attr)));
+}
+
+static int print_tunnel_key(struct action_util *au, FILE *f, struct rtattr *arg)
+{
+	struct rtattr *tb[TCA_TUNNEL_KEY_MAX + 1];
+	struct tc_tunnel_key *parm;
+
+	if (!arg)
+		return -1;
+
+	parse_rtattr_nested(tb, TCA_TUNNEL_KEY_MAX, arg);
+
+	if (!tb[TCA_TUNNEL_KEY_PARMS]) {
+		fprintf(f, "[NULL tunnel_key parameters]");
+		return -1;
+	}
+	parm = RTA_DATA(tb[TCA_TUNNEL_KEY_PARMS]);
+
+	fprintf(f, "tunnel_key");
+
+	switch (parm->t_action) {
+	case TCA_TUNNEL_KEY_ACT_RELEASE:
+		fprintf(f, " release");
+		break;
+	case TCA_TUNNEL_KEY_ACT_SET:
+		fprintf(f, " set");
+		tunnel_key_print_ip_addr(f, "src_ip",
+					 tb[TCA_TUNNEL_KEY_ENC_IPV4_SRC]);
+		tunnel_key_print_ip_addr(f, "dst_ip",
+					 tb[TCA_TUNNEL_KEY_ENC_IPV4_DST]);
+		tunnel_key_print_ip_addr(f, "src_ip",
+					 tb[TCA_TUNNEL_KEY_ENC_IPV6_SRC]);
+		tunnel_key_print_ip_addr(f, "dst_ip",
+					 tb[TCA_TUNNEL_KEY_ENC_IPV6_DST]);
+		tunnel_key_print_key_id(f, "key_id",
+					tb[TCA_TUNNEL_KEY_ENC_KEY_ID]);
+		break;
+	}
+	fprintf(f, " %s", action_n2a(parm->action));
+
+	fprintf(f, "\n\tindex %d ref %d bind %d", parm->index, parm->refcnt,
+		parm->bindcnt);
+
+	if (show_stats) {
+		if (tb[TCA_TUNNEL_KEY_TM]) {
+			struct tcf_t *tm = RTA_DATA(tb[TCA_TUNNEL_KEY_TM]);
+
+			print_tm(f, tm);
+		}
+	}
+
+	fprintf(f, "\n ");
+
+	return 0;
+}
+
+struct action_util tunnel_key_action_util = {
+	.id = "tunnel_key",
+	.parse_aopt = parse_tunnel_key,
+	.print_aopt = print_tunnel_key,
+};
-- 
2.10.2

^ permalink raw reply related

* [PATCH iproute2 1/2] tc/cls_flower: Classify packet in ip tunnels
From: Amir Vadai @ 2016-11-21 10:20 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, netdev, Or Gerlitz, Hadar Har-Zion, Roi Dayan,
	Amir Vadai
In-Reply-To: <20161121102056.13468-1-amir@vadai.me>

Introduce classifying by metadata extracted by the tunnel device.
Outer header fields - source/dest ip and tunnel id, are extracted from
the metadata when classifying.

For example, the following will add a filter on the ingress Qdisc of shared
vxlan device named 'vxlan0'. To forward packets with outer src ip
11.11.0.2, dst ip 11.11.0.1 and tunnel id 11. The packets will be
forwarded to tap device 'vnet0' (after metadata is released):

$ tc filter add dev vxlan0 protocol ip parent ffff: \
    flower \
      enc_src_ip 11.11.0.2 \
      enc_dst_ip 11.11.0.1 \
      enc_key_id 11 \
      dst_ip 11.11.11.1 \
    action tunnel_key release \
    action mirred egress redirect dev vnet0

The action tunnel_key, will be introduced in the next patch in this
series.

Signed-off-by: Amir Vadai <amir@vadai.me>
---
 man/man8/tc-flower.8 | 17 ++++++++++-
 tc/f_flower.c        | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 98 insertions(+), 4 deletions(-)

diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index 74f76647753b..0e0b0cf4bb72 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -36,7 +36,11 @@ flower \- flow based traffic control filter
 .BR dst_ip " | " src_ip " } { "
 .IR ipv4_address " | " ipv6_address " } | { "
 .BR dst_port " | " src_port " } "
-.IR port_number " }"
+.IR port_number " } | "
+.B enc_key_id
+.IR KEY-ID " | {"
+.BR enc_dst_ip " | " enc_src_ip " } { "
+.IR ipv4_address " | " ipv6_address " } | "
 .SH DESCRIPTION
 The
 .B flower
@@ -121,6 +125,17 @@ which has to be specified in beforehand.
 Match on layer 4 protocol source or destination port number. Only available for
 .BR ip_proto " values " udp " and " tcp ,
 which has to be specified in beforehand.
+.TP
+.BI enc_key_id " NUMBER"
+.TQ
+.BI enc_dst_ip " ADDRESS"
+.TQ
+.BI enc_src_ip " ADDRESS"
+Match on IP tunnel metadata. Key id
+.I NUMBER
+is a 32 bit tunnel key id (e.g. VNI for VXLAN tunnel).
+.I ADDRESS
+must be a valid IPv4 or IPv6 address.
 .SH NOTES
 As stated above where applicable, matches of a certain layer implicitly depend
 on the matches of the next lower layer. Precisely, layer one and two matches (
diff --git a/tc/f_flower.c b/tc/f_flower.c
index 2d31d1aa832d..1cf0750b5b83 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -41,7 +41,10 @@ static void explain(void)
 	fprintf(stderr, "                       dst_ip [ IPV4-ADDR | IPV6-ADDR ] |\n");
 	fprintf(stderr, "                       src_ip [ IPV4-ADDR | IPV6-ADDR ] |\n");
 	fprintf(stderr, "                       dst_port PORT-NUMBER |\n");
-	fprintf(stderr, "                       src_port PORT-NUMBER }\n");
+	fprintf(stderr, "                       src_port PORT-NUMBER |\n");
+	fprintf(stderr, "                       enc_dst_ip [ IPV4-ADDR | IPV6-ADDR ] |\n");
+	fprintf(stderr, "                       enc_src_ip [ IPV4-ADDR | IPV6-ADDR ] |\n");
+	fprintf(stderr, "                       enc_key_id [ KEY-ID ] }\n");
 	fprintf(stderr,	"       FILTERID := X:Y:Z\n");
 	fprintf(stderr,	"       ACTION-SPEC := ... look at individual actions\n");
 	fprintf(stderr,	"\n");
@@ -121,8 +124,9 @@ static int flower_parse_ip_addr(char *str, __be16 eth_type,
 		family = AF_INET;
 	} else if (eth_type == htons(ETH_P_IPV6)) {
 		family = AF_INET6;
+	} else if (!eth_type) {
+		family = AF_UNSPEC;
 	} else {
-		fprintf(stderr, "Illegal \"eth_type\" for ip address\n");
 		return -1;
 	}
 
@@ -130,8 +134,10 @@ static int flower_parse_ip_addr(char *str, __be16 eth_type,
 	if (ret)
 		return -1;
 
-	if (addr.family != family)
+	if (family && (addr.family != family)) {
+		fprintf(stderr, "Illegal \"eth_type\" for ip address\n");
 		return -1;
+	}
 
 	addattr_l(n, MAX_MSG, addr.family == AF_INET ? addr4_type : addr6_type,
 		  addr.data, addr.bytelen);
@@ -181,6 +187,20 @@ static int flower_parse_port(char *str, __u8 ip_port,
 	return 0;
 }
 
+static int flower_parse_key_id(char *str, int type, struct nlmsghdr *n)
+{
+	int ret;
+	__be32 key_id;
+
+	ret = get_be32(&key_id, str, 10);
+	if (ret)
+		return -1;
+
+	addattr32(n, MAX_MSG, type, key_id);
+
+	return 0;
+}
+
 static int flower_parse_opt(struct filter_util *qu, char *handle,
 			    int argc, char **argv, struct nlmsghdr *n)
 {
@@ -339,6 +359,38 @@ static int flower_parse_opt(struct filter_util *qu, char *handle,
 				fprintf(stderr, "Illegal \"src_port\"\n");
 				return -1;
 			}
+		} else if (matches(*argv, "enc_dst_ip") == 0) {
+			NEXT_ARG();
+			ret = flower_parse_ip_addr(*argv, 0,
+						   TCA_FLOWER_KEY_ENC_IPV4_DST,
+						   TCA_FLOWER_KEY_ENC_IPV4_DST_MASK,
+						   TCA_FLOWER_KEY_ENC_IPV6_DST,
+						   TCA_FLOWER_KEY_ENC_IPV6_DST_MASK,
+						   n);
+			if (ret < 0) {
+				fprintf(stderr, "Illegal \"enc_dst_ip\"\n");
+				return -1;
+			}
+		} else if (matches(*argv, "enc_src_ip") == 0) {
+			NEXT_ARG();
+			ret = flower_parse_ip_addr(*argv, 0,
+						   TCA_FLOWER_KEY_ENC_IPV4_SRC,
+						   TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK,
+						   TCA_FLOWER_KEY_ENC_IPV6_SRC,
+						   TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK,
+						   n);
+			if (ret < 0) {
+				fprintf(stderr, "Illegal \"enc_src_ip\"\n");
+				return -1;
+			}
+		} else if (matches(*argv, "enc_key_id") == 0) {
+			NEXT_ARG();
+			ret = flower_parse_key_id(*argv,
+						  TCA_FLOWER_KEY_ENC_KEY_ID, n);
+			if (ret < 0) {
+				fprintf(stderr, "Illegal \"enc_key_id\"\n");
+				return -1;
+			}
 		} else if (matches(*argv, "action") == 0) {
 			NEXT_ARG();
 			ret = parse_action(&argc, &argv, TCA_FLOWER_ACT, n);
@@ -509,6 +561,14 @@ static void flower_print_port(FILE *f, char *name, __u8 ip_proto,
 	fprintf(f, "\n  %s %d", name, ntohs(rta_getattr_u16(attr)));
 }
 
+static void flower_print_key_id(FILE *f, char *name,
+				struct rtattr *attr)
+{
+	if (!attr)
+		return;
+	fprintf(f, "\n  %s %d", name, ntohl(rta_getattr_u32(attr)));
+}
+
 static int flower_print_opt(struct filter_util *qu, FILE *f,
 			    struct rtattr *opt, __u32 handle)
 {
@@ -577,6 +637,25 @@ static int flower_print_opt(struct filter_util *qu, FILE *f,
 			  tb[TCA_FLOWER_KEY_TCP_SRC],
 			  tb[TCA_FLOWER_KEY_UDP_SRC]);
 
+	flower_print_ip_addr(f, "enc_dst_ip",
+			     tb[TCA_FLOWER_KEY_ENC_IPV4_DST_MASK] ?
+			     htons(ETH_P_IP) : htons(ETH_P_IPV6),
+			     tb[TCA_FLOWER_KEY_ENC_IPV4_DST],
+			     tb[TCA_FLOWER_KEY_ENC_IPV4_DST_MASK],
+			     tb[TCA_FLOWER_KEY_ENC_IPV6_DST],
+			     tb[TCA_FLOWER_KEY_ENC_IPV6_DST_MASK]);
+
+	flower_print_ip_addr(f, "enc_src_ip",
+			     tb[TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK] ?
+			     htons(ETH_P_IP) : htons(ETH_P_IPV6),
+			     tb[TCA_FLOWER_KEY_ENC_IPV4_SRC],
+			     tb[TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK],
+			     tb[TCA_FLOWER_KEY_ENC_IPV6_SRC],
+			     tb[TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK]);
+
+	flower_print_key_id(f, "enc_key_id",
+			    tb[TCA_FLOWER_KEY_ENC_KEY_ID]);
+
 	if (tb[TCA_FLOWER_FLAGS])  {
 		__u32 flags = rta_getattr_u32(tb[TCA_FLOWER_FLAGS]);
 
-- 
2.10.2

^ permalink raw reply related

* Re: [PATCH net 1/1] net: batman-adv: Treat NET_XMIT_CN as transmit successfully
From: Sergei Shtylyov @ 2016-11-21 10:44 UTC (permalink / raw)
  To: fgao-KlmEoCYek3zQT0dZR+AlfA, mareklindner-rVWd3aGhH2z5bpWLKbzFeg,
	sw-2YrNx6rUIHYiY0qSoAWiAoQuADTiUCJX, a,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r,
	netdev-u79uwXL29TY76Z2rM5mHXA, gfree.wind-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <1479688779-1328-1-git-send-email-fgao-KlmEoCYek3zQT0dZR+AlfA@public.gmane.org>

Hello.

On 11/21/2016 3:39 AM, fgao-KlmEoCYek3zQT0dZR+AlfA@public.gmane.org wrote:

> From: Gao Feng <fgao-KlmEoCYek3zQT0dZR+AlfA@public.gmane.org>
>
> The tc could return NET_XMIT_CN as one congestion notification, but
> it does not mean the packe is lost. Other modules like ipvlan,

    Packet.

> macvlan, and others treat NET_XMIT_CN as success too.
>
> So batman-adv should add the NET_XMIT_CN check.
>
> Signed-off-by: Gao Feng <fgao-KlmEoCYek3zQT0dZR+AlfA@public.gmane.org>

[...]

> diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c
> index 7e8dc64..8edd324 100644
> --- a/net/batman-adv/routing.c
> +++ b/net/batman-adv/routing.c
> @@ -706,7 +706,7 @@ static int batadv_route_unicast_packet(struct sk_buff *skb,
>  		goto out;
>
>  	/* translate transmit result into receive result */
> -	if (res == NET_XMIT_SUCCESS) {
> +	if (res == NET_XMIT_SUCCESS || ret == NET_XMIT_CN) {

    Not 'res == NET_XMIT_CN'?

>  		/* skb was transmitted and consumed */
>  		batadv_inc_counter(bat_priv, BATADV_CNT_FORWARD);
>  		batadv_add_counter(bat_priv, BATADV_CNT_FORWARD_BYTES,
[...]

MBR, Sergei

^ permalink raw reply

* [PATCH net v2 1/1] net: batman-adv: Treat NET_XMIT_CN as transmit successfully
From: fgao @ 2016-11-21 10:58 UTC (permalink / raw)
  To: mareklindner, sw, a, davem, b.a.t.m.a.n, netdev, gfree.wind; +Cc: Gao Feng

From: Gao Feng <fgao@ikuai8.com>

The tc could return NET_XMIT_CN as one congestion notification, but
it does not mean the packet is lost. Other modules like ipvlan,
macvlan, and others treat NET_XMIT_CN as success too.

So batman-adv should add the NET_XMIT_CN check.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
---
 v2: Correct two typo "packe" and "ret"
 v1: Initial version

 net/batman-adv/distributed-arp-table.c | 2 +-
 net/batman-adv/fragmentation.c         | 2 +-
 net/batman-adv/routing.c               | 2 +-
 net/batman-adv/tp_meter.c              | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/batman-adv/distributed-arp-table.c b/net/batman-adv/distributed-arp-table.c
index e257efd..4bf0622 100644
--- a/net/batman-adv/distributed-arp-table.c
+++ b/net/batman-adv/distributed-arp-table.c
@@ -660,7 +660,7 @@ static bool batadv_dat_send_data(struct batadv_priv *bat_priv,
 		}
 
 		send_status = batadv_send_unicast_skb(tmp_skb, neigh_node);
-		if (send_status == NET_XMIT_SUCCESS) {
+		if (send_status == NET_XMIT_SUCCESS || send_status == NET_XMIT_CN) {
 			/* count the sent packet */
 			switch (packet_subtype) {
 			case BATADV_P_DAT_DHT_GET:
diff --git a/net/batman-adv/fragmentation.c b/net/batman-adv/fragmentation.c
index 0934730..4714b8f 100644
--- a/net/batman-adv/fragmentation.c
+++ b/net/batman-adv/fragmentation.c
@@ -495,7 +495,7 @@ int batadv_frag_send_packet(struct sk_buff *skb,
 		batadv_add_counter(bat_priv, BATADV_CNT_FRAG_TX_BYTES,
 				   skb_fragment->len + ETH_HLEN);
 		ret = batadv_send_unicast_skb(skb_fragment, neigh_node);
-		if (ret != NET_XMIT_SUCCESS) {
+		if (ret != NET_XMIT_SUCCESS && ret != NET_XMIT_CN) {
 			/* return -1 so that the caller can free the original
 			 * skb
 			 */
diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c
index 7e8dc64..f44fb07 100644
--- a/net/batman-adv/routing.c
+++ b/net/batman-adv/routing.c
@@ -706,7 +706,7 @@ static int batadv_route_unicast_packet(struct sk_buff *skb,
 		goto out;
 
 	/* translate transmit result into receive result */
-	if (res == NET_XMIT_SUCCESS) {
+	if (res == NET_XMIT_SUCCESS || res == NET_XMIT_CN) {
 		/* skb was transmitted and consumed */
 		batadv_inc_counter(bat_priv, BATADV_CNT_FORWARD);
 		batadv_add_counter(bat_priv, BATADV_CNT_FORWARD_BYTES,
diff --git a/net/batman-adv/tp_meter.c b/net/batman-adv/tp_meter.c
index 8af1611..461dbad 100644
--- a/net/batman-adv/tp_meter.c
+++ b/net/batman-adv/tp_meter.c
@@ -618,7 +618,7 @@ static int batadv_tp_send_msg(struct batadv_tp_vars *tp_vars, const u8 *src,
 	if (r == -1)
 		kfree_skb(skb);
 
-	if (r == NET_XMIT_SUCCESS)
+	if (r == NET_XMIT_SUCCESS || r == NET_XMIT_CN)
 		return 0;
 
 	return BATADV_TP_REASON_CANT_SEND;
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net 1/1] net: batman-adv: Treat NET_XMIT_CN as transmit successfully
From: Feng Gao @ 2016-11-21 11:01 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: mareklindner, sw, a, David S. Miller, b.a.t.m.a.n,
	Linux Kernel Network Developers
In-Reply-To: <850ff61e-8f94-9316-ed1f-c7d3e8faf95b@cogentembedded.com>

Hi Sergei,

On Mon, Nov 21, 2016 at 6:44 PM, Sergei Shtylyov
<sergei.shtylyov@cogentembedded.com> wrote:
> Hello.
>
> On 11/21/2016 3:39 AM, fgao@ikuai8.com wrote:
>
>> From: Gao Feng <fgao@ikuai8.com>
>>
>> The tc could return NET_XMIT_CN as one congestion notification, but
>> it does not mean the packe is lost. Other modules like ipvlan,
>
>
>    Packet.

Thanks, it was typo.

>
>> macvlan, and others treat NET_XMIT_CN as success too.
>>
>> So batman-adv should add the NET_XMIT_CN check.
>>
>> Signed-off-by: Gao Feng <fgao@ikuai8.com>
>
>
> [...]
>
>> diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c
>> index 7e8dc64..8edd324 100644
>> --- a/net/batman-adv/routing.c
>> +++ b/net/batman-adv/routing.c
>> @@ -706,7 +706,7 @@ static int batadv_route_unicast_packet(struct sk_buff
>> *skb,
>>                 goto out;
>>
>>         /* translate transmit result into receive result */
>> -       if (res == NET_XMIT_SUCCESS) {
>> +       if (res == NET_XMIT_SUCCESS || ret == NET_XMIT_CN) {

Thanks again.
I didn't find it during myself's review and compile process.

>
>
>    Not 'res == NET_XMIT_CN'?
>
>>                 /* skb was transmitted and consumed */
>>                 batadv_inc_counter(bat_priv, BATADV_CNT_FORWARD);
>>                 batadv_add_counter(bat_priv, BATADV_CNT_FORWARD_BYTES,
>
> [...]
>
> MBR, Sergei
>

I have sent the v2 patch which corrects these two typos.

Best Regards
Feng

^ permalink raw reply

* Re: [PATCH net v2 1/1] net: batman-adv: Treat NET_XMIT_CN as transmit successfully
From: Florian Westphal @ 2016-11-21 11:09 UTC (permalink / raw)
  To: fgao; +Cc: mareklindner, sw, a, davem, b.a.t.m.a.n, netdev, gfree.wind
In-Reply-To: <1479725922-5112-1-git-send-email-fgao@ikuai8.com>

fgao@ikuai8.com <fgao@ikuai8.com> wrote:
> From: Gao Feng <fgao@ikuai8.com>
> 
> The tc could return NET_XMIT_CN as one congestion notification, but
> it does not mean the packet is lost. Other modules like ipvlan,
> macvlan, and others treat NET_XMIT_CN as success too.
> 
> So batman-adv should add the NET_XMIT_CN check.

"The tc could return NET_XMIT_CN as one congestion notification, but
it means another packet got dropped. Other modules like batman do not
treat NET_XMIT_CN as success, so modules like ipvlan, macvlan, ..
should ignore it as well."

What I am asking is:
Are you sure adding NET_XMIT_CN handling everywhere is the right way to
resolve the inconsistency?

^ permalink raw reply

* Re: [PATCH] mlxsw: switchib:  add MLXSW_PCI dependency
From: Jiri Pirko @ 2016-11-21 11:20 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jiri Pirko, Ido Schimmel, David S. Miller, Ivan Vecera, Elad Raz,
	netdev, linux-kernel
In-Reply-To: <20161118160127.473555-1-arnd@arndb.de>

Fri, Nov 18, 2016 at 05:01:14PM CET, arnd@arndb.de wrote:
>The newly added switchib driver fails to link if MLXSW_PCI=m:
>
>drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In function^Cmlxsw_sib_module_exit':
>switchib.c:(.exit.text+0x8): undefined reference to `mlxsw_pci_driver_unregister'
>switchib.c:(.exit.text+0x10): undefined reference to `mlxsw_pci_driver_unregister'
>drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In function `mlxsw_sib_module_init':
>switchib.c:(.init.text+0x28): undefined reference to `mlxsw_pci_driver_register'
>switchib.c:(.init.text+0x38): undefined reference to `mlxsw_pci_driver_register'
>switchib.c:(.init.text+0x48): undefined reference to `mlxsw_pci_driver_unregister'
>
>The other two such sub-drivers have a dependency, so add the same one
>here. In theory we could allow this driver if MLXSW_PCI is disabled,
>but it's probably not worth it.
>
>Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Thanks.

Reviewed-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* ip -s macsec show: Statistics for OutOctets... and OutPkts... switched
From: Daniel.Hopf @ 2016-11-21 11:31 UTC (permalink / raw)
  To: netdev

Dear community,

I'm using the following ip utility:

ubuntu2@ubuntu2:~$ ip -V
ip utility, iproute2-ss161009
ubuntu2@ubuntu2:~$ uname -r
4.8.0-22-generic

During tests with the recent MACsec implementation I noticed that the 
statistics for OutOctets[Protected|Encrypted] and 
OutPkts[Protected|Encrypted] are switched.

The error seems to reside in lines 637:640 of iproute2/ip/ipmacsec.c:
        [MACSEC_TXSC_STATS_ATTR_OUT_PKTS_PROTECTED] = 
"OutOctetsProtected",
        [MACSEC_TXSC_STATS_ATTR_OUT_PKTS_ENCRYPTED] = 
"OutOctetsEncrypted",
        [MACSEC_TXSC_STATS_ATTR_OUT_OCTETS_PROTECTED] = 
"OutPktsProtected",
        [MACSEC_TXSC_STATS_ATTR_OUT_OCTETS_ENCRYPTED] = 
"OutPktsEncrypted",

In my opinion this should instead read:
        [MACSEC_TXSC_STATS_ATTR_OUT_PKTS_PROTECTED] = "OutPktsProtected",
        [MACSEC_TXSC_STATS_ATTR_OUT_PKTS_ENCRYPTED] = "OutPktsEncrypted",
        [MACSEC_TXSC_STATS_ATTR_OUT_OCTETS_PROTECTED] = 
"OutOctetsProtected",
        [MACSEC_TXSC_STATS_ATTR_OUT_OCTETS_ENCRYPTED] = 
"OutOctetsEncrypted",

Regards
- Daniel

^ permalink raw reply

* Re: [net-next PATCH v2 4/5] virtio_net: add dedicated XDP transmit queues
From: Daniel Borkmann @ 2016-11-21 11:45 UTC (permalink / raw)
  To: John Fastabend, eric.dumazet, mst, kubakici, shm, davem,
	alexei.starovoitov
  Cc: netdev, bblanco, john.r.fastabend, brouer, tgraf
In-Reply-To: <20161120025104.19187.54400.stgit@john-Precision-Tower-5810>

On 11/20/2016 03:51 AM, John Fastabend wrote:
> XDP requires using isolated transmit queues to avoid interference
> with normal networking stack (BQL, NETDEV_TX_BUSY, etc). This patch
> adds a XDP queue per cpu when a XDP program is loaded and does not
> expose the queues to the OS via the normal API call to
> netif_set_real_num_tx_queues(). This way the stack will never push
> an skb to these queues.
>
> However virtio/vhost/qemu implementation only allows for creating
> TX/RX queue pairs at this time so creating only TX queues was not
> possible. And because the associated RX queues are being created I
> went ahead and exposed these to the stack and let the backend use
> them. This creates more RX queues visible to the network stack than
> TX queues which is worth mentioning but does not cause any issues as
> far as I can tell.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>   drivers/net/virtio_net.c |   32 +++++++++++++++++++++++++++++---
>   1 file changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 8f99a53..80a426c 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -114,6 +114,9 @@ struct virtnet_info {
>   	/* # of queue pairs currently used by the driver */
>   	u16 curr_queue_pairs;
>
> +	/* # of XDP queue pairs currently used by the driver */
> +	u16 xdp_queue_pairs;
> +
>   	/* I like... big packets and I cannot lie! */
>   	bool big_packets;
>
> @@ -1525,7 +1528,8 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>   {
>   	struct virtnet_info *vi = netdev_priv(dev);
>   	struct bpf_prog *old_prog;
> -	int i;
> +	u16 xdp_qp = 0, curr_qp;
> +	int err, i;
>
>   	if ((dev->features & NETIF_F_LRO) && prog) {
>   		netdev_warn(dev, "can't set XDP while LRO is on, disable LRO first\n");
> @@ -1542,12 +1546,34 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>   		return -EINVAL;
>   	}
>
> +	curr_qp = vi->curr_queue_pairs - vi->xdp_queue_pairs;
> +	if (prog)
> +		xdp_qp = nr_cpu_ids;
> +
> +	/* XDP requires extra queues for XDP_TX */
> +	if (curr_qp + xdp_qp > vi->max_queue_pairs) {
> +		netdev_warn(dev, "request %i queues but max is %i\n",
> +			    curr_qp + xdp_qp, vi->max_queue_pairs);
> +		return -ENOMEM;
> +	}
> +
> +	err = virtnet_set_queues(vi, curr_qp + xdp_qp);
> +	if (err) {
> +		dev_warn(&dev->dev, "XDP Device queue allocation failure.\n");
> +		return err;
> +	}
> +
>   	if (prog) {
> -		prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
> -		if (IS_ERR(prog))
> +		prog = bpf_prog_add(prog, vi->max_queue_pairs);

I think this change is not correct, it would be off by one now.
The previous 'vi->max_queue_pairs - 1' was actually correct here.
dev_change_xdp_fd() already gives you a reference (see the doc on
enum xdp_netdev_command in netdevice.h).

> +		if (IS_ERR(prog)) {
> +			virtnet_set_queues(vi, curr_qp);
>   			return PTR_ERR(prog);
> +		}
>   	}
>
> +	vi->xdp_queue_pairs = xdp_qp;
> +	netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
> +
>   	for (i = 0; i < vi->max_queue_pairs; i++) {
>   		old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
>   		rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
>

^ permalink raw reply

* [PATCH net-next 0/2] bridge: add support for IGMPv3 and MLDv2 querier
From: Nikolay Aleksandrov @ 2016-11-21 12:03 UTC (permalink / raw)
  To: netdev; +Cc: roopa, sashok, stephen, davem, liuhangbin, Nikolay Aleksandrov

Hi all,
This patch-set adds support for IGMPv3 and MLDv2 querier in the bridge.
Two new options which can be toggled via netlink and sysfs are added that
control the version per-bridge:
 multicast_igmp_version - default 2, can be set to 3
 multicast_mld_version - default 1, can be set to 2 (this option is
                         disabled if CONFIG_IPV6=n)

Note that the names do not include "querier", I think that these options
can be re-used later as more IGMPv3 support is added to the bridge so we
can avoid adding more options to switch between v2 and v3 behaviour.

The set uses the already existing br_ip{4,6}_multicast_alloc_query
functions and adds the appropriate header based on the chosen version.

For the initial support I have removed the compatibility implementation
(RFC3376 sec 7.3.1, 7.3.2; RFC3810 sec 8.3.1, 8.3.2), because there are
some details that we need to sort out.

Thank you,
 Nik


Nikolay Aleksandrov (2):
  bridge: mcast: add IGMPv3 query support
  bridge: mcast: add MLDv2 querier support

 include/uapi/linux/if_link.h |   2 +
 net/bridge/br_multicast.c    | 166 +++++++++++++++++++++++++++++++++----------
 net/bridge/br_netlink.c      |  34 ++++++++-
 net/bridge/br_private.h      |   7 ++
 net/bridge/br_sysfs_br.c     |  40 +++++++++++
 5 files changed, 210 insertions(+), 39 deletions(-)

-- 
2.1.4

^ permalink raw reply

* [PATCH net-next 1/2] bridge: mcast: add IGMPv3 query support
From: Nikolay Aleksandrov @ 2016-11-21 12:03 UTC (permalink / raw)
  To: netdev; +Cc: roopa, sashok, stephen, davem, liuhangbin, Nikolay Aleksandrov
In-Reply-To: <1479729805-23108-1-git-send-email-nikolay@cumulusnetworks.com>

This patch adds basic support for IGMPv3 queries, the default is IGMPv2
as before. A new multicast option - multicast_igmp_version, adds the
ability to change it between 2 and 3 via netlink and sysfs. The option
struct member is in a 4 byte hole in net_bridge.

There also a few minor style adjustments in br_multicast_new_group and
br_multicast_add_group.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 include/uapi/linux/if_link.h |  1 +
 net/bridge/br_multicast.c    | 79 ++++++++++++++++++++++++++++++++++----------
 net/bridge/br_netlink.c      | 15 ++++++++-
 net/bridge/br_private.h      |  3 ++
 net/bridge/br_sysfs_br.c     | 18 ++++++++++
 5 files changed, 98 insertions(+), 18 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index b4fba662cd32..325d2601150d 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -275,6 +275,7 @@ enum {
 	IFLA_BR_PAD,
 	IFLA_BR_VLAN_STATS_ENABLED,
 	IFLA_BR_MCAST_STATS_ENABLED,
+	IFLA_BR_MCAST_IGMP_VERSION,
 	__IFLA_BR_MAX,
 };
 
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 073d54afa056..66192c11aa45 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -365,13 +365,18 @@ static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br,
 						    __be32 group,
 						    u8 *igmp_type)
 {
+	struct igmpv3_query *ihv3;
+	size_t igmp_hdr_size;
 	struct sk_buff *skb;
 	struct igmphdr *ih;
 	struct ethhdr *eth;
 	struct iphdr *iph;
 
+	igmp_hdr_size = sizeof(*ih);
+	if (br->multicast_igmp_version == 3)
+		igmp_hdr_size = sizeof(*ihv3);
 	skb = netdev_alloc_skb_ip_align(br->dev, sizeof(*eth) + sizeof(*iph) +
-						 sizeof(*ih) + 4);
+						 igmp_hdr_size + 4);
 	if (!skb)
 		goto out;
 
@@ -396,7 +401,7 @@ static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br,
 	iph->version = 4;
 	iph->ihl = 6;
 	iph->tos = 0xc0;
-	iph->tot_len = htons(sizeof(*iph) + sizeof(*ih) + 4);
+	iph->tot_len = htons(sizeof(*iph) + igmp_hdr_size + 4);
 	iph->id = 0;
 	iph->frag_off = htons(IP_DF);
 	iph->ttl = 1;
@@ -412,17 +417,37 @@ static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br,
 	skb_put(skb, 24);
 
 	skb_set_transport_header(skb, skb->len);
-	ih = igmp_hdr(skb);
 	*igmp_type = IGMP_HOST_MEMBERSHIP_QUERY;
-	ih->type = IGMP_HOST_MEMBERSHIP_QUERY;
-	ih->code = (group ? br->multicast_last_member_interval :
-			    br->multicast_query_response_interval) /
-		   (HZ / IGMP_TIMER_SCALE);
-	ih->group = group;
-	ih->csum = 0;
-	ih->csum = ip_compute_csum((void *)ih, sizeof(struct igmphdr));
-	skb_put(skb, sizeof(*ih));
 
+	switch (br->multicast_igmp_version) {
+	case 2:
+		ih = igmp_hdr(skb);
+		ih->type = IGMP_HOST_MEMBERSHIP_QUERY;
+		ih->code = (group ? br->multicast_last_member_interval :
+				    br->multicast_query_response_interval) /
+			   (HZ / IGMP_TIMER_SCALE);
+		ih->group = group;
+		ih->csum = 0;
+		ih->csum = ip_compute_csum((void *)ih, sizeof(*ih));
+		break;
+	case 3:
+		ihv3 = igmpv3_query_hdr(skb);
+		ihv3->type = IGMP_HOST_MEMBERSHIP_QUERY;
+		ihv3->code = (group ? br->multicast_last_member_interval :
+				      br->multicast_query_response_interval) /
+			     (HZ / IGMP_TIMER_SCALE);
+		ihv3->group = group;
+		ihv3->qqic = br->multicast_query_interval / HZ;
+		ihv3->nsrcs = 0;
+		ihv3->resv = 0;
+		ihv3->suppress = 0;
+		ihv3->qrv = 2;
+		ihv3->csum = 0;
+		ihv3->csum = ip_compute_csum((void *)ihv3, sizeof(*ihv3));
+		break;
+	}
+
+	skb_put(skb, igmp_hdr_size);
 	__skb_pull(skb, sizeof(*eth));
 
 out:
@@ -608,7 +633,8 @@ static struct net_bridge_mdb_entry *br_multicast_get_group(
 }
 
 struct net_bridge_mdb_entry *br_multicast_new_group(struct net_bridge *br,
-	struct net_bridge_port *port, struct br_ip *group)
+						    struct net_bridge_port *p,
+						    struct br_ip *group)
 {
 	struct net_bridge_mdb_htable *mdb;
 	struct net_bridge_mdb_entry *mp;
@@ -624,7 +650,7 @@ struct net_bridge_mdb_entry *br_multicast_new_group(struct net_bridge *br,
 	}
 
 	hash = br_ip_hash(mdb, group);
-	mp = br_multicast_get_group(br, port, group, hash);
+	mp = br_multicast_get_group(br, p, group, hash);
 	switch (PTR_ERR(mp)) {
 	case 0:
 		break;
@@ -681,9 +707,9 @@ static int br_multicast_add_group(struct net_bridge *br,
 				  struct net_bridge_port *port,
 				  struct br_ip *group)
 {
-	struct net_bridge_mdb_entry *mp;
-	struct net_bridge_port_group *p;
 	struct net_bridge_port_group __rcu **pp;
+	struct net_bridge_port_group *p;
+	struct net_bridge_mdb_entry *mp;
 	unsigned long now = jiffies;
 	int err;
 
@@ -861,9 +887,9 @@ static void br_multicast_send_query(struct net_bridge *br,
 				    struct net_bridge_port *port,
 				    struct bridge_mcast_own_query *own_query)
 {
-	unsigned long time;
-	struct br_ip br_group;
 	struct bridge_mcast_other_query *other_query = NULL;
+	struct br_ip br_group;
+	unsigned long time;
 
 	if (!netif_running(br->dev) || br->multicast_disabled ||
 	    !br->multicast_querier)
@@ -1816,6 +1842,7 @@ void br_multicast_init(struct net_bridge *br)
 	br->hash_elasticity = 4;
 	br->hash_max = 512;
 
+	br->multicast_igmp_version = 2;
 	br->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
 	br->multicast_querier = 0;
 	br->multicast_query_use_ifaddr = 0;
@@ -2132,6 +2159,24 @@ int br_multicast_set_hash_max(struct net_bridge *br, unsigned long val)
 	return err;
 }
 
+int br_multicast_set_igmp_version(struct net_bridge *br, unsigned long val)
+{
+	/* Currently we support only version 2 and 3 */
+	switch (val) {
+	case 2:
+	case 3:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	spin_lock_bh(&br->multicast_lock);
+	br->multicast_igmp_version = val;
+	spin_unlock_bh(&br->multicast_lock);
+
+	return 0;
+}
+
 /**
  * br_multicast_list_adjacent - Returns snooped multicast addresses
  * @dev:	The bridge port adjacent to which to retrieve addresses
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index e99037c6f7b7..10b9b80f778f 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -858,6 +858,7 @@ static const struct nla_policy br_policy[IFLA_BR_MAX + 1] = {
 	[IFLA_BR_VLAN_DEFAULT_PVID] = { .type = NLA_U16 },
 	[IFLA_BR_VLAN_STATS_ENABLED] = { .type = NLA_U8 },
 	[IFLA_BR_MCAST_STATS_ENABLED] = { .type = NLA_U8 },
+	[IFLA_BR_MCAST_IGMP_VERSION] = { .type = NLA_U8 },
 };
 
 static int br_changelink(struct net_device *brdev, struct nlattr *tb[],
@@ -1069,6 +1070,15 @@ static int br_changelink(struct net_device *brdev, struct nlattr *tb[],
 		mcast_stats = nla_get_u8(data[IFLA_BR_MCAST_STATS_ENABLED]);
 		br->multicast_stats_enabled = !!mcast_stats;
 	}
+
+	if (data[IFLA_BR_MCAST_IGMP_VERSION]) {
+		__u8 igmp_version;
+
+		igmp_version = nla_get_u8(data[IFLA_BR_MCAST_IGMP_VERSION]);
+		err = br_multicast_set_igmp_version(br, igmp_version);
+		if (err)
+			return err;
+	}
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	if (data[IFLA_BR_NF_CALL_IPTABLES]) {
@@ -1135,6 +1145,7 @@ static size_t br_get_size(const struct net_device *brdev)
 	       nla_total_size_64bit(sizeof(u64)) + /* IFLA_BR_MCAST_QUERY_INTVL */
 	       nla_total_size_64bit(sizeof(u64)) + /* IFLA_BR_MCAST_QUERY_RESPONSE_INTVL */
 	       nla_total_size_64bit(sizeof(u64)) + /* IFLA_BR_MCAST_STARTUP_QUERY_INTVL */
+	       nla_total_size(sizeof(u8)) +	/* IFLA_BR_MCAST_IGMP_VERSION */
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	       nla_total_size(sizeof(u8)) +     /* IFLA_BR_NF_CALL_IPTABLES */
@@ -1210,7 +1221,9 @@ static int br_fill_info(struct sk_buff *skb, const struct net_device *brdev)
 	    nla_put_u32(skb, IFLA_BR_MCAST_LAST_MEMBER_CNT,
 			br->multicast_last_member_count) ||
 	    nla_put_u32(skb, IFLA_BR_MCAST_STARTUP_QUERY_CNT,
-			br->multicast_startup_query_count))
+			br->multicast_startup_query_count) ||
+	    nla_put_u8(skb, IFLA_BR_MCAST_IGMP_VERSION,
+		       br->multicast_igmp_version))
 		return -EMSGSIZE;
 
 	clockval = jiffies_to_clock_t(br->multicast_last_member_interval);
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 1b63177e0ccd..3d207d92d899 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -333,6 +333,8 @@ struct net_bridge
 	u32				multicast_last_member_count;
 	u32				multicast_startup_query_count;
 
+	u8				multicast_igmp_version;
+
 	unsigned long			multicast_last_member_interval;
 	unsigned long			multicast_membership_interval;
 	unsigned long			multicast_querier_interval;
@@ -582,6 +584,7 @@ int br_multicast_set_port_router(struct net_bridge_port *p, unsigned long val);
 int br_multicast_toggle(struct net_bridge *br, unsigned long val);
 int br_multicast_set_querier(struct net_bridge *br, unsigned long val);
 int br_multicast_set_hash_max(struct net_bridge *br, unsigned long val);
+int br_multicast_set_igmp_version(struct net_bridge *br, unsigned long val);
 struct net_bridge_mdb_entry *
 br_mdb_ip_get(struct net_bridge_mdb_htable *mdb, struct br_ip *dst);
 struct net_bridge_mdb_entry *
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index e120307c6e36..f00d1690658c 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -440,6 +440,23 @@ static ssize_t hash_max_store(struct device *d, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RW(hash_max);
 
+static ssize_t multicast_igmp_version_show(struct device *d,
+					   struct device_attribute *attr,
+					   char *buf)
+{
+	struct net_bridge *br = to_bridge(d);
+
+	return sprintf(buf, "%u\n", br->multicast_igmp_version);
+}
+
+static ssize_t multicast_igmp_version_store(struct device *d,
+					    struct device_attribute *attr,
+					    const char *buf, size_t len)
+{
+	return store_bridge_parm(d, buf, len, br_multicast_set_igmp_version);
+}
+static DEVICE_ATTR_RW(multicast_igmp_version);
+
 static ssize_t multicast_last_member_count_show(struct device *d,
 						struct device_attribute *attr,
 						char *buf)
@@ -809,6 +826,7 @@ static struct attribute *bridge_attrs[] = {
 	&dev_attr_multicast_query_response_interval.attr,
 	&dev_attr_multicast_startup_query_interval.attr,
 	&dev_attr_multicast_stats_enabled.attr,
+	&dev_attr_multicast_igmp_version.attr,
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	&dev_attr_nf_call_iptables.attr,
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next 2/2] bridge: mcast: add MLDv2 querier support
From: Nikolay Aleksandrov @ 2016-11-21 12:03 UTC (permalink / raw)
  To: netdev; +Cc: roopa, sashok, stephen, davem, liuhangbin, Nikolay Aleksandrov
In-Reply-To: <1479729805-23108-1-git-send-email-nikolay@cumulusnetworks.com>

This patch adds basic support for MLDv2 queries, the default is MLDv1
as before. A new multicast option - multicast_mld_version, adds the
ability to change it between 1 and 2 via netlink and sysfs.
The MLD option is disabled if CONFIG_IPV6 is disabled.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 include/uapi/linux/if_link.h |  1 +
 net/bridge/br_multicast.c    | 89 +++++++++++++++++++++++++++++++++-----------
 net/bridge/br_netlink.c      | 19 +++++++++-
 net/bridge/br_private.h      |  4 ++
 net/bridge/br_sysfs_br.c     | 22 +++++++++++
 5 files changed, 113 insertions(+), 22 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 325d2601150d..92b2d4928bf1 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -276,6 +276,7 @@ enum {
 	IFLA_BR_VLAN_STATS_ENABLED,
 	IFLA_BR_MCAST_STATS_ENABLED,
 	IFLA_BR_MCAST_IGMP_VERSION,
+	IFLA_BR_MCAST_MLD_VERSION,
 	__IFLA_BR_MAX,
 };
 
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 66192c11aa45..b30e77e8427c 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -459,15 +459,20 @@ static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br,
 						    const struct in6_addr *grp,
 						    u8 *igmp_type)
 {
-	struct sk_buff *skb;
+	struct mld2_query *mld2q;
+	unsigned long interval;
 	struct ipv6hdr *ip6h;
 	struct mld_msg *mldq;
+	size_t mld_hdr_size;
+	struct sk_buff *skb;
 	struct ethhdr *eth;
 	u8 *hopopt;
-	unsigned long interval;
 
+	mld_hdr_size = sizeof(*mldq);
+	if (br->multicast_mld_version == 2)
+		mld_hdr_size = sizeof(*mld2q);
 	skb = netdev_alloc_skb_ip_align(br->dev, sizeof(*eth) + sizeof(*ip6h) +
-						 8 + sizeof(*mldq));
+						 8 + mld_hdr_size);
 	if (!skb)
 		goto out;
 
@@ -486,7 +491,7 @@ static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br,
 	ip6h = ipv6_hdr(skb);
 
 	*(__force __be32 *)ip6h = htonl(0x60000000);
-	ip6h->payload_len = htons(8 + sizeof(*mldq));
+	ip6h->payload_len = htons(8 + mld_hdr_size);
 	ip6h->nexthdr = IPPROTO_HOPOPTS;
 	ip6h->hop_limit = 1;
 	ipv6_addr_set(&ip6h->daddr, htonl(0xff020000), 0, 0, htonl(1));
@@ -514,26 +519,47 @@ static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br,
 
 	/* ICMPv6 */
 	skb_set_transport_header(skb, skb->len);
-	mldq = (struct mld_msg *) icmp6_hdr(skb);
-
 	interval = ipv6_addr_any(grp) ?
 			br->multicast_query_response_interval :
 			br->multicast_last_member_interval;
-
 	*igmp_type = ICMPV6_MGM_QUERY;
-	mldq->mld_type = ICMPV6_MGM_QUERY;
-	mldq->mld_code = 0;
-	mldq->mld_cksum = 0;
-	mldq->mld_maxdelay = htons((u16)jiffies_to_msecs(interval));
-	mldq->mld_reserved = 0;
-	mldq->mld_mca = *grp;
-
-	/* checksum */
-	mldq->mld_cksum = csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
-					  sizeof(*mldq), IPPROTO_ICMPV6,
-					  csum_partial(mldq,
-						       sizeof(*mldq), 0));
-	skb_put(skb, sizeof(*mldq));
+	switch (br->multicast_mld_version) {
+	case 1:
+		mldq = (struct mld_msg *)icmp6_hdr(skb);
+		mldq->mld_type = ICMPV6_MGM_QUERY;
+		mldq->mld_code = 0;
+		mldq->mld_cksum = 0;
+		mldq->mld_maxdelay = htons((u16)jiffies_to_msecs(interval));
+		mldq->mld_reserved = 0;
+		mldq->mld_mca = *grp;
+		mldq->mld_cksum = csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
+						  sizeof(*mldq), IPPROTO_ICMPV6,
+						  csum_partial(mldq,
+							       sizeof(*mldq),
+							       0));
+		break;
+	case 2:
+		mld2q = (struct mld2_query *)icmp6_hdr(skb);
+		mld2q->mld2q_mrc = ntohs((u16)jiffies_to_msecs(interval));
+		mld2q->mld2q_type = ICMPV6_MGM_QUERY;
+		mld2q->mld2q_code = 0;
+		mld2q->mld2q_cksum = 0;
+		mld2q->mld2q_resv1 = 0;
+		mld2q->mld2q_resv2 = 0;
+		mld2q->mld2q_suppress = 0;
+		mld2q->mld2q_qrv = 2;
+		mld2q->mld2q_nsrcs = 0;
+		mld2q->mld2q_qqic = br->multicast_query_interval / HZ;
+		mld2q->mld2q_mca = *grp;
+		mld2q->mld2q_cksum = csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
+						     sizeof(*mld2q),
+						     IPPROTO_ICMPV6,
+						     csum_partial(mld2q,
+								  sizeof(*mld2q),
+								  0));
+		break;
+	}
+	skb_put(skb, mld_hdr_size);
 
 	__skb_pull(skb, sizeof(*eth));
 
@@ -1842,7 +1868,6 @@ void br_multicast_init(struct net_bridge *br)
 	br->hash_elasticity = 4;
 	br->hash_max = 512;
 
-	br->multicast_igmp_version = 2;
 	br->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
 	br->multicast_querier = 0;
 	br->multicast_query_use_ifaddr = 0;
@@ -1858,7 +1883,9 @@ void br_multicast_init(struct net_bridge *br)
 
 	br->ip4_other_query.delay_time = 0;
 	br->ip4_querier.port = NULL;
+	br->multicast_igmp_version = 2;
 #if IS_ENABLED(CONFIG_IPV6)
+	br->multicast_mld_version = 1;
 	br->ip6_other_query.delay_time = 0;
 	br->ip6_querier.port = NULL;
 #endif
@@ -2177,6 +2204,26 @@ int br_multicast_set_igmp_version(struct net_bridge *br, unsigned long val)
 	return 0;
 }
 
+#if IS_ENABLED(CONFIG_IPV6)
+int br_multicast_set_mld_version(struct net_bridge *br, unsigned long val)
+{
+	/* Currently we support version 1 and 2 */
+	switch (val) {
+	case 1:
+	case 2:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	spin_lock_bh(&br->multicast_lock);
+	br->multicast_mld_version = val;
+	spin_unlock_bh(&br->multicast_lock);
+
+	return 0;
+}
+#endif
+
 /**
  * br_multicast_list_adjacent - Returns snooped multicast addresses
  * @dev:	The bridge port adjacent to which to retrieve addresses
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 10b9b80f778f..71c7453268c1 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -859,6 +859,7 @@ static const struct nla_policy br_policy[IFLA_BR_MAX + 1] = {
 	[IFLA_BR_VLAN_STATS_ENABLED] = { .type = NLA_U8 },
 	[IFLA_BR_MCAST_STATS_ENABLED] = { .type = NLA_U8 },
 	[IFLA_BR_MCAST_IGMP_VERSION] = { .type = NLA_U8 },
+	[IFLA_BR_MCAST_MLD_VERSION] = { .type = NLA_U8 },
 };
 
 static int br_changelink(struct net_device *brdev, struct nlattr *tb[],
@@ -1079,6 +1080,17 @@ static int br_changelink(struct net_device *brdev, struct nlattr *tb[],
 		if (err)
 			return err;
 	}
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (data[IFLA_BR_MCAST_MLD_VERSION]) {
+		__u8 mld_version;
+
+		mld_version = nla_get_u8(data[IFLA_BR_MCAST_MLD_VERSION]);
+		err = br_multicast_set_mld_version(br, mld_version);
+		if (err)
+			return err;
+	}
+#endif
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	if (data[IFLA_BR_NF_CALL_IPTABLES]) {
@@ -1146,6 +1158,7 @@ static size_t br_get_size(const struct net_device *brdev)
 	       nla_total_size_64bit(sizeof(u64)) + /* IFLA_BR_MCAST_QUERY_RESPONSE_INTVL */
 	       nla_total_size_64bit(sizeof(u64)) + /* IFLA_BR_MCAST_STARTUP_QUERY_INTVL */
 	       nla_total_size(sizeof(u8)) +	/* IFLA_BR_MCAST_IGMP_VERSION */
+	       nla_total_size(sizeof(u8)) +	/* IFLA_BR_MCAST_MLD_VERSION */
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	       nla_total_size(sizeof(u8)) +     /* IFLA_BR_NF_CALL_IPTABLES */
@@ -1225,7 +1238,11 @@ static int br_fill_info(struct sk_buff *skb, const struct net_device *brdev)
 	    nla_put_u8(skb, IFLA_BR_MCAST_IGMP_VERSION,
 		       br->multicast_igmp_version))
 		return -EMSGSIZE;
-
+#if IS_ENABLED(CONFIG_IPV6)
+	if (nla_put_u8(skb, IFLA_BR_MCAST_MLD_VERSION,
+		       br->multicast_mld_version))
+		return -EMSGSIZE;
+#endif
 	clockval = jiffies_to_clock_t(br->multicast_last_member_interval);
 	if (nla_put_u64_64bit(skb, IFLA_BR_MCAST_LAST_MEMBER_INTVL, clockval,
 			      IFLA_BR_PAD))
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 3d207d92d899..26aec2366bc3 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -355,6 +355,7 @@ struct net_bridge
 	struct bridge_mcast_other_query	ip6_other_query;
 	struct bridge_mcast_own_query	ip6_own_query;
 	struct bridge_mcast_querier	ip6_querier;
+	u8				multicast_mld_version;
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 #endif
 
@@ -585,6 +586,9 @@ int br_multicast_toggle(struct net_bridge *br, unsigned long val);
 int br_multicast_set_querier(struct net_bridge *br, unsigned long val);
 int br_multicast_set_hash_max(struct net_bridge *br, unsigned long val);
 int br_multicast_set_igmp_version(struct net_bridge *br, unsigned long val);
+#if IS_ENABLED(CONFIG_IPV6)
+int br_multicast_set_mld_version(struct net_bridge *br, unsigned long val);
+#endif
 struct net_bridge_mdb_entry *
 br_mdb_ip_get(struct net_bridge_mdb_htable *mdb, struct br_ip *dst);
 struct net_bridge_mdb_entry *
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index f00d1690658c..c9d2e0abfb89 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -659,6 +659,25 @@ static ssize_t multicast_stats_enabled_store(struct device *d,
 	return store_bridge_parm(d, buf, len, set_stats_enabled);
 }
 static DEVICE_ATTR_RW(multicast_stats_enabled);
+
+#if IS_ENABLED(CONFIG_IPV6)
+static ssize_t multicast_mld_version_show(struct device *d,
+					  struct device_attribute *attr,
+					  char *buf)
+{
+	struct net_bridge *br = to_bridge(d);
+
+	return sprintf(buf, "%u\n", br->multicast_mld_version);
+}
+
+static ssize_t multicast_mld_version_store(struct device *d,
+					   struct device_attribute *attr,
+					   const char *buf, size_t len)
+{
+	return store_bridge_parm(d, buf, len, br_multicast_set_mld_version);
+}
+static DEVICE_ATTR_RW(multicast_mld_version);
+#endif
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 static ssize_t nf_call_iptables_show(
@@ -827,6 +846,9 @@ static struct attribute *bridge_attrs[] = {
 	&dev_attr_multicast_startup_query_interval.attr,
 	&dev_attr_multicast_stats_enabled.attr,
 	&dev_attr_multicast_igmp_version.attr,
+#if IS_ENABLED(CONFIG_IPV6)
+	&dev_attr_multicast_mld_version.attr,
+#endif
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	&dev_attr_nf_call_iptables.attr,
-- 
2.1.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox