Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf-next v3 01/11] bpf: adding bpf_xdp_adjust_tail helper
From: Nikita V. Shirokov @ 2018-04-18  4:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev, Nikita V. Shirokov
In-Reply-To: <20180418044223.17685-1-tehnerd@tehnerd.com>

Adding new bpf helper which would allow us to manipulate
xdp's data_end pointer, and allow us to reduce packet's size
indended use case: to generate ICMP messages from XDP context,
where such message would contain truncated original packet.

Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
---
 include/uapi/linux/bpf.h | 10 +++++++++-
 net/core/filter.c        | 29 ++++++++++++++++++++++++++++-
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec89732a8d..9a2d1a04eb24 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -755,6 +755,13 @@ union bpf_attr {
  *     @addr: pointer to struct sockaddr to bind socket to
  *     @addr_len: length of sockaddr structure
  *     Return: 0 on success or negative error code
+ *
+ * int bpf_xdp_adjust_tail(xdp_md, delta)
+ *     Adjust the xdp_md.data_end by delta. Only shrinking of packet's
+ *     size is supported.
+ *     @xdp_md: pointer to xdp_md
+ *     @delta: A negative integer to be added to xdp_md.data_end
+ *     Return: 0 on success or negative on error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -821,7 +828,8 @@ union bpf_attr {
 	FN(msg_apply_bytes),		\
 	FN(msg_cork_bytes),		\
 	FN(msg_pull_data),		\
-	FN(bind),
+	FN(bind),			\
+	FN(xdp_adjust_tail),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index a374b8560bc4..29318598fd60 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2725,6 +2725,30 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
+{
+	void *data_end = xdp->data_end + offset;
+
+	/* only shrinking is allowed for now. */
+	if (unlikely(offset >= 0))
+		return -EINVAL;
+
+	if (unlikely(data_end < xdp->data + ETH_HLEN))
+		return -EINVAL;
+
+	xdp->data_end = data_end;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_adjust_tail_proto = {
+	.func		= bpf_xdp_adjust_tail,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+};
+
 BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
 {
 	void *meta = xdp->data_meta + offset;
@@ -3074,7 +3098,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 	    func == bpf_l4_csum_replace ||
 	    func == bpf_xdp_adjust_head ||
 	    func == bpf_xdp_adjust_meta ||
-	    func == bpf_msg_pull_data)
+	    func == bpf_msg_pull_data ||
+	    func == bpf_xdp_adjust_tail)
 		return true;
 
 	return false;
@@ -3888,6 +3913,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
 		return &bpf_xdp_redirect_map_proto;
+	case BPF_FUNC_xdp_adjust_tail:
+		return &bpf_xdp_adjust_tail_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
-- 
2.15.1

^ permalink raw reply related

* [PATCH bpf-next v3 00/11] introduction of bpf_xdp_adjust_tail
From: Nikita V. Shirokov @ 2018-04-18  4:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev, Nikita V. Shirokov

In this patch series i'm add new bpf helper which allow to manupulate
xdp's data_end pointer. right now only "shrinking" (reduce packet's size
by moving pointer) is supported (and i see no use case for "growing").
Main use case for such helper is to be able to generate controll (ICMP)
messages from XDP context. such messages usually contains first N bytes
from original packets as a payload, and this is exactly what this helper
would allow us to do (see patch 3 for sample program, where we generate
ICMP "packet too big" message). This helper could be usefull for load
balancing applications where after additional encapsulation, resulting
packet could be bigger then interface MTU.
Aside from new helper this patch series contains minor changes in device
drivers (for ones which requires), so they would recal packet's length
not only when head pointer was adjusted, but if tail's one as well.

v2->v3:
 * adding missed "signed off by" in v2

v1->v2:
 * fixed kbuild warning
 * made offset eq 0 invalid for xdp_bpf_adjust_tail
 * splitted bpf_prog_test_run fix and selftests in sep commits
 * added SPDX licence where applicable
 * some reshuffling in patches order (tests now in the end)

Nikita V. Shirokov (11):
  bpf: making bpf_prog_test run aware of possible data_end ptr change
  bpf: adding tests for bpf_xdp_adjust_tail
  bpf: adding bpf_xdp_adjust_tail helper
  bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail
  bpf: make mlx4 compatible w/ bpf_xdp_adjust_tail
  bpf: make bnxt compatible w/ bpf_xdp_adjust_tail
  bpf: make cavium thunder compatible w/ bpf_xdp_adjust_tail
  bpf: make netronome nfp compatible w/ bpf_xdp_adjust_tail
  bpf: make tun compatible w/ bpf_xdp_adjust_tail
  bpf: make virtio compatible w/ bpf_xdp_adjust_tail
  bpf: add bpf_xdp_adjust_tail sample prog

 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c      |   2 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |   2 +-
 .../net/ethernet/netronome/nfp/nfp_net_common.c    |   2 +-
 drivers/net/tun.c                                  |   3 +-
 drivers/net/virtio_net.c                           |   7 +-
 include/uapi/linux/bpf.h                           |  10 +-
 net/bpf/test_run.c                                 |   3 +-
 net/core/dev.c                                     |  10 +-
 net/core/filter.c                                  |  29 +++-
 samples/bpf/Makefile                               |   4 +
 samples/bpf/xdp_adjust_tail_kern.c                 | 152 +++++++++++++++++++++
 samples/bpf/xdp_adjust_tail_user.c                 | 142 +++++++++++++++++++
 tools/include/uapi/linux/bpf.h                     |  10 +-
 tools/testing/selftests/bpf/Makefile               |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h          |   5 +
 tools/testing/selftests/bpf/test_adjust_tail.c     |  30 ++++
 tools/testing/selftests/bpf/test_progs.c           |  32 +++++
 18 files changed, 435 insertions(+), 12 deletions(-)
 create mode 100644 samples/bpf/xdp_adjust_tail_kern.c
 create mode 100644 samples/bpf/xdp_adjust_tail_user.c
 create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c

-- 
2.15.1

^ permalink raw reply

* Reply
From: Tamale David @ 2018-04-18 13:15 UTC (permalink / raw)


Dear Shaohui,
I plead an indulgence if I have invaded your privacy by receiving this
mail from me without prior permission.With due respect,I contact you
purposely based on the similarities of names between you and my
deceased client who was an oil servicing contractor with shell
petroleum in West Africa.

This is about Nine years I have been searching the contacts of the
relatives to my late client in order that they may come forward for
the repatriation of the estate of my late client,Engineer Victor
Shaohui  valued $16.4 Million but unfortunately all my search proved
abortive and the Togo bank holding these funds issued me a last notice
to bring the supposed heir/relatives before end of this year or they
will have the account declared un serviceable thereby confiscating the
funds hence my contact to you so that you may stand as the heir and
receive the funds into your bank account since you are bearing same
surname with my deceased client. I have the death certificate to send
to you as well as deposit document as proof.The sharing ratio of the
funds after a successful transfer to your bank account shall be 40/60
of which I do have trust that the funds will be secured pending my
arrival to meet you in your country.

Waiting to hear from you.
Sincerely yours,
Barrister Tamale David

^ permalink raw reply

* Re: [RFC PATCH] net: bridge: multicast querier per VLAN support
From: Nikolay Aleksandrov @ 2018-04-18 13:14 UTC (permalink / raw)
  To: Joachim Nilsson; +Cc: netdev, Stephen Hemminger, roopa
In-Reply-To: <20180418130718.GA16044@troglobit>

On 18/04/18 16:07, Joachim Nilsson wrote:
> On Wed, Apr 18, 2018 at 03:31:57PM +0300, Nikolay Aleksandrov wrote:
>> On 18/04/18 15:07, Joachim Nilsson wrote:
>>> - First of all, is this patch useful to anyone
>> Obviously to us as it's based on our patch. :-)
>> We actually recently discussed what will be needed to make it acceptable to upstream.
> 
> Great! :)
> 
>>> - The current br_multicast.c is very complex.  The support for both IPv4
>>>    and IPv6 is a no-brainer, but it also has #ifdef VLAN_FILTERING and
>>>    'br->vlan_enabled' ... this has likely been discussed before, but if
>>>    we could remove those code paths I believe what's left would be quite
>>>    a bit easier to read and maintain.
>> br->vlan_enabled has a wrapper that can be used without ifdefs, as does br_vlan_find()
>> so in short - you can remove the ifdefs and use the wrappers,  they'll degrade to always
>> false/null when vlans are disabled.
> 
> Thanks, I'll have a look at that and prepare an RFC v2!
> 
>>> - Many per-bridge specific multicast sysfs settings may need to have a
>>>    corresponding per-VLAN setting, e.g. snooping, query_interval, etc.
>>>    How should we go about that? (For status reporting I have a proposal)
>> We'll have to add more to the per-vlan context, but yes it has to happen.
>> It will be only netlink interface for config/retrieval, no sysfs.
> 
> Some settings are possible to do with sysfs, like multicast_query_interval
> and ...

We want to avoid sysfs in general, all of networking config and stats
are moving to netlink. It is better controlled and structured for such
changes, also provides nice interfaces for automatic  type checks etc.

Also (but a minor reason) there is no tree/entity in sysfs for the vlans
where to add this. It will either have to be a file which does some
format string hack (like us currently) or will need to add new tree for
them which I'd really like to avoid for the bridge.

> 
>>> - Dito per-port specific multicast sysfs settings, e.g. multicast_router
>> I'm not sure I follow this one, there is per-port mcast router config now ?
> 
> Sorry no, I meant we may want to add more per-VLAN settings when we get
> this base patch merged.  Like router ports, we may want to be able to
> set them per VLAN.

Sure, that can be done easily via netlink. br_afspec() can decode any
additional per-vlan attributes and can be fairly easily extended.
Also after my vlan rhastable change, we have per-vlan context even today
(e.g. per-vlan stats use it) so we'll just extend that.

> 
>> Thanks for the effort, I see that you have done some of the required cleanups
>> for this to be upstreamable, but as you've noted above we need to make it
>> complete (with the per-vlan contexts and all).
> 
> There's definitely more work to be done.  Agreeing on a base set of changes
> to start with is maybe the most important, as well as making it complete.>
>> I will review this patch in detail later and come back if there's anything.
> 
> Thank you so much for the quick feedback so far! :)
> 
> Cheers
>  /Joachim
> 

^ permalink raw reply

* Re: [RFC PATCH] net: bridge: multicast querier per VLAN support
From: Joachim Nilsson @ 2018-04-18 13:07 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, Stephen Hemminger, roopa
In-Reply-To: <b705b089-69f4-f93f-1dda-cd6a8937dc2f@cumulusnetworks.com>

On Wed, Apr 18, 2018 at 03:31:57PM +0300, Nikolay Aleksandrov wrote:
> On 18/04/18 15:07, Joachim Nilsson wrote:
> > - First of all, is this patch useful to anyone
> Obviously to us as it's based on our patch. :-)
> We actually recently discussed what will be needed to make it acceptable to upstream.

Great! :)

> > - The current br_multicast.c is very complex.  The support for both IPv4
> >    and IPv6 is a no-brainer, but it also has #ifdef VLAN_FILTERING and
> >    'br->vlan_enabled' ... this has likely been discussed before, but if
> >    we could remove those code paths I believe what's left would be quite
> >    a bit easier to read and maintain.
> br->vlan_enabled has a wrapper that can be used without ifdefs, as does br_vlan_find()
> so in short - you can remove the ifdefs and use the wrappers,  they'll degrade to always
> false/null when vlans are disabled.

Thanks, I'll have a look at that and prepare an RFC v2!

> > - Many per-bridge specific multicast sysfs settings may need to have a
> >    corresponding per-VLAN setting, e.g. snooping, query_interval, etc.
> >    How should we go about that? (For status reporting I have a proposal)
> We'll have to add more to the per-vlan context, but yes it has to happen.
> It will be only netlink interface for config/retrieval, no sysfs.

Some settings are possible to do with sysfs, like multicast_query_interval
and ...

> > - Dito per-port specific multicast sysfs settings, e.g. multicast_router
> I'm not sure I follow this one, there is per-port mcast router config now ?

Sorry no, I meant we may want to add more per-VLAN settings when we get
this base patch merged.  Like router ports, we may want to be able to
set them per VLAN.

> Thanks for the effort, I see that you have done some of the required cleanups
> for this to be upstreamable, but as you've noted above we need to make it
> complete (with the per-vlan contexts and all).

There's definitely more work to be done.  Agreeing on a base set of changes
to start with is maybe the most important, as well as making it complete.

> I will review this patch in detail later and come back if there's anything.

Thank you so much for the quick feedback so far! :)

Cheers
 /Joachim

^ permalink raw reply

* Re: [Regression] net/phy/micrel.c v4.9.94
From: Andrew Lunn @ 2018-04-18 13:02 UTC (permalink / raw)
  To: Chris Ruehl; +Cc: f.fainelli, netdev
In-Reply-To: <20180418125601.GF31643@lunn.ch>

On Wed, Apr 18, 2018 at 02:56:01PM +0200, Andrew Lunn wrote:
> On Wed, Apr 18, 2018 at 09:34:16AM +0800, Chris Ruehl wrote:
> > Hello,
> > 
> > I like to get your heads up at a regression introduced in 4.9.94
> > commitment lead to a kernel ops and make the network unusable on my MX6DL
> > customized board.
> > 
> > Race condition resume is called on startup and the phy not yet initialized.
> 
> Hi Chris
> 
> Please could you try
> 
> bfe72442578b ("net: phy: micrel: fix crash when statistic requested for KSZ9031 phy")

I don't think it is a complete fix. I suspect "Micrel KSZ8795",
"Micrel KSZ886X Switch", "Micrel KSZ8061", and "Micrel KS8737" will
still have problems.

Those four probably need a:

        .probe          = kszphy_probe,

	Andrew

^ permalink raw reply

* Re: [PATCH 1/2] net: netsec: enable tx-irq during open callback
From: Jassi Brar @ 2018-04-18 12:57 UTC (permalink / raw)
  To: David Miller
  Cc: <netdev@vger.kernel.org>, Masahisa Kojima, Ard Biesheuvel,
	Jassi Brar
In-Reply-To: <20180416.134657.469441951884835782.davem@davemloft.net>

Hi Dave,

On Mon, Apr 16, 2018 at 11:16 PM, David Miller <davem@davemloft.net> wrote:
> From: jassisinghbrar@gmail.com
> Date: Mon, 16 Apr 2018 12:52:16 +0530
>
>> From: Jassi Brar <jaswinder.singh@linaro.org>
>>
>> Enable TX-irq as well during ndo_open() as we can not count upon
>> RX to arrive early enough to trigger the napi. This patch is critical
>> for installation over network.
>>
>> Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
>> Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
>
> Applied.
>
Just to make sure, let me please mention that c009f413b79de52 and
9a00b697ce31e are very much needed in stable kernel. Without these we
couldn't install any OS over network.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] team: account for oper state
From: Jiri Pirko @ 2018-04-18 12:56 UTC (permalink / raw)
  To: George Wilkie; +Cc: netdev
In-Reply-To: <20180418102950.1033-1-gwilkie@vyatta.att-mail.com>

Wed, Apr 18, 2018 at 12:29:50PM CEST, gwilkie@vyatta.att-mail.com wrote:
>Account for operational state when determining port linkup state,
>as per Documentation/networking/operstates.txt.

Could you please point me to the exact place in the document where this
is suggested?


>
>Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
>---
> drivers/net/team/team.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index a6c6ce19eeee..231264a05e55 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -2918,7 +2918,8 @@ static int team_device_event(struct notifier_block *unused,
> 	case NETDEV_CHANGE:
> 		if (netif_running(port->dev))
> 			team_port_change_check(port,
>-					       !!netif_carrier_ok(port->dev));
>+					       !!(netif_carrier_ok(port->dev) &&
>+						  netif_oper_up(port->dev)));
> 		break;
> 	case NETDEV_UNREGISTER:
> 		team_del_slave(port->team->dev, dev);
>-- 
>2.11.0
>

^ permalink raw reply

* Re: [Regression] net/phy/micrel.c v4.9.94
From: Andrew Lunn @ 2018-04-18 12:56 UTC (permalink / raw)
  To: Chris Ruehl; +Cc: f.fainelli, netdev
In-Reply-To: <3bd29bdd-b5ab-03d5-ea53-292f9150ee4c@gtsys.com.hk>

On Wed, Apr 18, 2018 at 09:34:16AM +0800, Chris Ruehl wrote:
> Hello,
> 
> I like to get your heads up at a regression introduced in 4.9.94
> commitment lead to a kernel ops and make the network unusable on my MX6DL
> customized board.
> 
> Race condition resume is called on startup and the phy not yet initialized.

Hi Chris

Please could you try

bfe72442578b ("net: phy: micrel: fix crash when statistic requested for KSZ9031 phy")

	     Andrew

^ permalink raw reply

* Re: [PATCH bpf-next] tools: bpftool: make it easier to feed hex bytes to bpftool
From: Daniel Borkmann @ 2018-04-18 12:51 UTC (permalink / raw)
  To: Jakub Kicinski, alexei.starovoitov; +Cc: oss-drivers, netdev, Quentin Monnet
In-Reply-To: <20180418024634.8525-1-jakub.kicinski@netronome.com>

On 04/18/2018 04:46 AM, Jakub Kicinski wrote:
> From: Quentin Monnet <quentin.monnet@netronome.com>
> 
> bpftool uses hexadecimal values when it dumps map contents:
> 
>     # bpftool map dump id 1337
>     key: ff 13 37 ff  value: a1 b2 c3 d4 ff ff ff ff
>     Found 1 element
> 
> In order to lookup or update values with bpftool, the natural reflex is
> then to copy and paste the values to the command line, and to try to run
> something like:
> 
>     # bpftool map update id 1337 key ff 13 37 ff \
>             value 00 00 00 00 00 00 1a 2b
>     Error: error parsing byte: ff
> 
> bpftool complains, because it uses strtoul() with a 0 base to parse the
> bytes, and that without a "0x" prefix, the bytes are considered as
> decimal values (or even octal if they start with "0").
> 
> To feed hexadecimal values instead, one needs to add "0x" prefixes
> everywhere necessary:
> 
>     # bpftool map update id 1337 key 0xff 0x13 0x37 0xff \
>             value 0 0 0 0 0 0 0x1a 0x2b
> 
> To make it easier to use hexadecimal values, add an optional "hex"
> keyword to put after "key" or "value" to tell bpftool to consider the
> digits as hexadecimal. We can now do:
> 
>     # bpftool map update id 1337 key hex ff 13 37 ff \
>             value hex 0 0 0 0 0 0 1a 2b
> 
> Without the "hex" keyword, the bytes are still parsed according to
> normal integer notation (decimal if no prefix, or hexadecimal or octal
> if "0x" or "0" prefix is used, respectively).
> 
> The patch also add related documentation and bash completion for the
> "hex" keyword.
> 
> Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
> Suggested-by: David Beckett <david.beckett@netronome.com>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Applied to bpf-next, thanks Quentin!

^ permalink raw reply

* Re: [bpf-next PATCH] samples/bpf: fix xdp_monitor user output for tracepoint exception
From: Daniel Borkmann @ 2018-04-18 12:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev; +Cc: Daniel Borkmann, Alexei Starovoitov
In-Reply-To: <152397408629.13093.1644769061929047703.stgit@firesoul>

On 04/17/2018 04:08 PM, Jesper Dangaard Brouer wrote:
> The variable rec_i contains an XDP action code not an error.
> Thus, using err2str() was wrong, it should have been action2str().
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Applied to bpf-next, thanks Jesper!

^ permalink raw reply

* Re: [PATCH bpf-next v2 02/11] bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail
From: Jesper Dangaard Brouer @ 2018-04-18 12:48 UTC (permalink / raw)
  To: Nikita V. Shirokov
  Cc: brouer, Alexei Starovoitov, Daniel Borkmann, David S. Miller ,
	netdev
In-Reply-To: <20180418042951.17183-3-tehnerd@tehnerd.com>

On Tue, 17 Apr 2018 21:29:42 -0700
"Nikita V. Shirokov" <tehnerd@tehnerd.com> wrote:

> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for generic XDP we need to reflect this packet's length change by
> adjusting skb's tail pointer
> 
> Acked-by: Alexei Starovoitov <ast@kernel.org>

You are missing your own Signed-off-by: line on all of the patches.

BTW, thank you for working on this! It have been on my todo-list for a
while now!

_After_ this patchset, I would like to see adding support for
"increasing" the data_end location to create a larger packet.  For that
we should likely add a data_hard_end pointer.  This, would also be
helpful in cpu_map_build_skb() to know the data_hard_end, to determine
the frame size (as some driver doesn't use PAGE_SIZE frames, ixgbe).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] samples/bpf: correct comment in sock_example.c
From: Daniel Borkmann @ 2018-04-18 12:47 UTC (permalink / raw)
  To: Wang Sheng-Hui, ast, netdev
In-Reply-To: <20180417022520.2412-1-shhuiw@foxmail.com>

On 04/17/2018 04:25 AM, Wang Sheng-Hui wrote:
> The program run against loopback interace "lo", not "eth0".
> Correct the comment.
> 
> Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>

Applied to bpf-next, thanks Wang!

^ permalink raw reply

* Re: [Regression] net/phy/micrel.c v4.9.94
From: Andrew Lunn @ 2018-04-18 12:43 UTC (permalink / raw)
  To: Chris Ruehl; +Cc: f.fainelli, netdev
In-Reply-To: <22ccc548-0000-1873-1ea0-1aad140d7131@gtsys.com.hk>

> If I look at the patch I think it should call kszphy_config_init() not _reset()
> in the resume function:
> 
> 
> @@ -715,8 +723,14 @@ static int kszphy_suspend(struct phy_device *phydev)
> 
>  static int kszphy_resume(struct phy_device *phydev)
>  {
> +	int ret;
> +
>  	genphy_resume(phydev);
> 
> -	ret = kszphy_config_reset(phydev);
> +       ret = kszphy_config_init(phydev);
> +	if (ret)
> +		return ret;
> +
> 

Hi Chris

I think there has been a patch for this posted. If i remember
correctly, the PHY you have does not call probe, hence phydev->priv is
a NULL pointer, so priv->rmii_ref_clk_sel does not work.

It would be good to find the patch and make sure it has been accepted,
and marked for stable.

    Andrew

^ permalink raw reply

* Re: [PATCH v2 bpf-next 0/3] Add missing types to bpftool, libbpf
From: Daniel Borkmann @ 2018-04-18 12:42 UTC (permalink / raw)
  To: Andrey Ignatov, ast; +Cc: kubakici, quentin.monnet, netdev, kernel-team
In-Reply-To: <cover.1523985784.git.rdna@fb.com>

On 04/17/2018 07:28 PM, Andrey Ignatov wrote:
> v1->v2:
> - add new types to bpftool-cgroup man page;
> - add new types to bash completion for bpftool;
> - don't add types that should not be in bpftool cgroup.
> 
> Add support for various BPF prog types and attach types that have been
> added to kernel recently but not to bpftool or libbpf yet.
> 
> Andrey Ignatov (3):
>   bpftool: Support new prog types and attach types
>   libbpf: Support guessing post_bind{4,6} progs
>   libbpf: Type functions for raw tracepoints

Applied to bpf-next, thanks Andrey!

^ permalink raw reply

* Re: [PATCH bpf-next v2 00/11] introduction of bpf_xdp_adjust_tail
From: Daniel Borkmann @ 2018-04-18 12:37 UTC (permalink / raw)
  To: Nikita V. Shirokov, Alexei Starovoitov; +Cc: netdev
In-Reply-To: <20180418042951.17183-1-tehnerd@tehnerd.com>

On 04/18/2018 06:29 AM, Nikita V. Shirokov wrote:
> In this patch series i'm add new bpf helper which allow to manupulate
> xdp's data_end pointer. right now only "shrinking" (reduce packet's size
> by moving pointer) is supported (and i see no use case for "growing").
> Main use case for such helper is to be able to generate controll (ICMP)
> messages from XDP context. such messages usually contains first N bytes
> from original packets as a payload, and this is exactly what this helper
> would allow us to do (see patch 3 for sample program, where we generate
> ICMP "packet too big" message). This helper could be usefull for load
> balancing applications where after additional encapsulation, resulting
> packet could be bigger then interface MTU.
> Aside from new helper this patch series contains minor changes in device
> drivers (for ones which requires), so they would recal packet's length
> not only when head pointer was adjusted, but if tail's one as well.

The whole set doesn't have any SoBs from you which is mandatory before
applying anything. Please add.

Thanks,
Daniel

> v1->v2:
>  * fixed kbuild warning
>  * made offset eq 0 invalid for xdp_bpf_adjust_tail
>  * splitted bpf_prog_test_run fix and selftests in sep commits
>  * added SPDX licence where applicable
>  * some reshuffling in patches order (tests now in the end)
> 
> 
> Nikita V. Shirokov (11):
>   bpf: making bpf_prog_test run aware of possible data_end ptr change
>   bpf: adding tests for bpf_xdp_adjust_tail
>   bpf: adding bpf_xdp_adjust_tail helper
>   bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail
>   bpf: make mlx4 compatible w/ bpf_xdp_adjust_tail
>   bpf: make bnxt compatible w/ bpf_xdp_adjust_tail
>   bpf: make cavium thunder compatible w/ bpf_xdp_adjust_tail
>   bpf: make netronome nfp compatible w/ bpf_xdp_adjust_tail
>   bpf: make tun compatible w/ bpf_xdp_adjust_tail
>   bpf: make virtio compatible w/ bpf_xdp_adjust_tail
>   bpf: add bpf_xdp_adjust_tail sample prog
> 
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c      |   2 +-
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   2 +-
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c         |   2 +-
>  .../net/ethernet/netronome/nfp/nfp_net_common.c    |   2 +-
>  drivers/net/tun.c                                  |   3 +-
>  drivers/net/virtio_net.c                           |   7 +-
>  include/uapi/linux/bpf.h                           |  10 +-
>  net/bpf/test_run.c                                 |   3 +-
>  net/core/dev.c                                     |  10 +-
>  net/core/filter.c                                  |  29 +++-
>  samples/bpf/Makefile                               |   4 +
>  samples/bpf/xdp_adjust_tail_kern.c                 | 152 +++++++++++++++++++++
>  samples/bpf/xdp_adjust_tail_user.c                 | 142 +++++++++++++++++++
>  tools/include/uapi/linux/bpf.h                     |  10 +-
>  tools/testing/selftests/bpf/Makefile               |   2 +-
>  tools/testing/selftests/bpf/bpf_helpers.h          |   5 +
>  tools/testing/selftests/bpf/test_adjust_tail.c     |  30 ++++
>  tools/testing/selftests/bpf/test_progs.c           |  32 +++++
>  18 files changed, 435 insertions(+), 12 deletions(-)
>  create mode 100644 samples/bpf/xdp_adjust_tail_kern.c
>  create mode 100644 samples/bpf/xdp_adjust_tail_user.c
>  create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c
> 

^ permalink raw reply

* Re: [RFC PATCH] net: bridge: multicast querier per VLAN support
From: Nikolay Aleksandrov @ 2018-04-18 12:31 UTC (permalink / raw)
  To: Joachim Nilsson, netdev; +Cc: Stephen Hemminger, roopa
In-Reply-To: <20180418120713.GA10742@troglobit>

On 18/04/18 15:07, Joachim Nilsson wrote:
> This RFC patch¹ is an attempt to add multicast querier per VLAN support
> to a VLAN aware bridge.  I'm posting it as RFC for now since non-VLAN
> aware bridges are not handled, and one of my questions is if that is
> complexity we need to continue supporting?
> 
>  From what I understand, multicast join/report already support per VLAN
> operation, and the MDB as well support filtering per VLAN, but queries
> are currently limited to per-port operation on VLAN-aware bridges.
> 
> The naive² approach of this patch relocates query timers from the bridge
> to operate per VLAN, on timer expiry we send queries to all bridge ports
> in the same VLAN.  Tagged port members have tagged VLAN queries.
> 
> Unlike the original patch¹, which uses a sysfs entry to set the querier
> address of each VLAN, this use the IP address of the VLAN interface when
> initiating a per VLAN query.  A version of inet_select_addr() is used
> for this, called inet_select_dev_addr(), not included in this patch.
> 
> Open questions/TODO:
> 
> - First of all, is this patch useful to anyone

Obviously to us as it's based on our patch. :-)
We actually recently discussed what will be needed to make it acceptable to upstream.

> - The current br_multicast.c is very complex.  The support for both IPv4
>    and IPv6 is a no-brainer, but it also has #ifdef VLAN_FILTERING and
>    'br->vlan_enabled' ... this has likely been discussed before, but if
>    we could remove those code paths I believe what's left would be quite
>    a bit easier to read and maintain.

br->vlan_enabled has a wrapper that can be used without ifdefs, as does br_vlan_find()
so in short - you can remove the ifdefs and use the wrappers,  they'll degrade to always
false/null when vlans are disabled.

> - Many per-bridge specific multicast sysfs settings may need to have a
>    corresponding per-VLAN setting, e.g. snooping, query_interval, etc.
>    How should we go about that? (For status reporting I have a proposal)

We'll have to add more to the per-vlan context, but yes it has to happen.
It will be only netlink interface for config/retrieval, no sysfs.

> - Dito per-port specific multicast sysfs settings, e.g. multicast_router

I'm not sure I follow this one, there is per-port mcast router config now ?
Take a look at br_multicast_set_port_router().

> - The MLD support has been kept in sync with the rest but is completely
>    untested.  In particular I suspect the wrong source IP will be used.
> 
> ¹) Initially based on a patch by Cumulus Networks
>     http://repo3.cumulusnetworks.com/repo/pool/cumulus/l/linux/linux-source-4.1_4.1.33-1+cl3u11_all.deb

I knew this looked familiar when I glanced through it :)

> ²) This patch is currently limited to work only on bridges with VLAN
>     enabled.  Care has been taken to support MLD snooping, but it is
>     completely untested.
> 
> Thank you for reading this far!
> 
> Signed-off-by: Joachim Nilsson <troglobit@gmail.com>

Thanks for the effort, I see that you have done some of the required cleanups
for this to be upstreamable, but as you've noted above we need to make it
complete (with the per-vlan contexts and all).

I will review this patch in detail later and come back if there's anything.

Cheers,
  Nik

> ---
>   net/bridge/br_device.c    |   2 +-
>   net/bridge/br_input.c     |   2 +-
>   net/bridge/br_multicast.c | 456 ++++++++++++++++++++++++--------------
>   net/bridge/br_private.h   |  38 +++-
>   net/bridge/br_stp.c       |   5 +-
>   net/bridge/br_vlan.c      |   3 +
>   6 files changed, 327 insertions(+), 179 deletions(-)
> 
> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> index 02f9f8aab047..ba35485032d8 100644
> --- a/net/bridge/br_device.c
> +++ b/net/bridge/br_device.c
> @@ -98,7 +98,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
>   
>   		mdst = br_mdb_get(br, skb, vid);
>   		if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
> -		    br_multicast_querier_exists(br, eth_hdr(skb)))
> +		    br_multicast_querier_exists(br, vid, eth_hdr(skb)))
>   			br_multicast_flood(mdst, skb, false, true);
>   		else
>   			br_flood(br, skb, BR_PKT_MULTICAST, false, true);
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index 56bb9189c374..13d48489e0e1 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -137,7 +137,7 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
>   		mdst = br_mdb_get(br, skb, vid);
>   		if ((mdst && mdst->addr.proto == htons(ETH_P_ALL)) ||
>   		    ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
> -		     br_multicast_querier_exists(br, eth_hdr(skb)))) {
> +		     br_multicast_querier_exists(br, vid, eth_hdr(skb)))) {
>   			if ((mdst && mdst->host_joined) ||
>   			    br_multicast_is_router(br)) {
>   				local_rcv = true;
> diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
> index 277ecd077dc4..72e47d500972 100644
> --- a/net/bridge/br_multicast.c
> +++ b/net/bridge/br_multicast.c
> @@ -13,6 +13,7 @@
>   #include <linux/err.h>
>   #include <linux/export.h>
>   #include <linux/if_ether.h>
> +#include <linux/if_vlan.h>
>   #include <linux/igmp.h>
>   #include <linux/jhash.h>
>   #include <linux/kernel.h>
> @@ -37,7 +38,7 @@
>   
>   #include "br_private.h"
>   
> -static void br_multicast_start_querier(struct net_bridge *br,
> +static void br_multicast_start_querier(struct net_bridge_vlan *vlan,
>   				       struct bridge_mcast_own_query *query);
>   static void br_multicast_add_router(struct net_bridge *br,
>   				    struct net_bridge_port *port);
> @@ -46,13 +47,14 @@ static void br_ip4_multicast_leave_group(struct net_bridge *br,
>   					 __be32 group,
>   					 __u16 vid,
>   					 const unsigned char *src);
> -
> +static void br_ip4_multicast_query_expired(struct timer_list *t);
>   static void __del_port_router(struct net_bridge_port *p);
>   #if IS_ENABLED(CONFIG_IPV6)
>   static void br_ip6_multicast_leave_group(struct net_bridge *br,
>   					 struct net_bridge_port *port,
>   					 const struct in6_addr *group,
>   					 __u16 vid, const unsigned char *src);
> +static void br_ip6_multicast_query_expired(struct timer_list *t);
>   #endif
>   unsigned int br_mdb_rehash_seq;
>   
> @@ -381,8 +383,30 @@ static int br_mdb_rehash(struct net_bridge_mdb_htable __rcu **mdbp, int max,
>   	return 0;
>   }
>   
> +__be32 br_multicast_inet_addr(struct net_bridge *br, u16 vid)
> +{
> +	struct net_device *dev;
> +
> +	if (!br->multicast_query_use_ifaddr)
> +		return 0;
> +
> +	if (!vid)
> +		return inet_select_addr(br->dev, 0, RT_SCOPE_LINK);
> +
> +	rcu_read_lock();
> +	dev = __vlan_find_dev_deep_rcu(br->dev, htons(ETH_P_8021Q), vid);
> +	rcu_read_unlock();
> +
> +	if (!dev)
> +		return 0;
> +
> +	return inet_select_dev_addr(dev, 0, RT_SCOPE_LINK);
> +}
> +
>   static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br,
>   						    __be32 group,
> +						    __u16 vid,
> +						    bool tagged,
>   						    u8 *igmp_type)
>   {
>   	struct igmpv3_query *ihv3;
> @@ -391,12 +415,17 @@ static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br,
>   	struct igmphdr *ih;
>   	struct ethhdr *eth;
>   	struct iphdr *iph;
> +	int vh_size = 0;
> +
> +	/* if vid is non-zero, insert the 1Q header also */
> +	if (vid && tagged)
> +		vh_size = sizeof(struct vlan_hdr);
>   
>   	igmp_hdr_size = sizeof(*ih);
>   	if (br->multicast_igmp_version == 3)
>   		igmp_hdr_size = sizeof(*ihv3);
>   	skb = netdev_alloc_skb_ip_align(br->dev, sizeof(*eth) + sizeof(*iph) +
> -						 igmp_hdr_size + 4);
> +						 vh_size + igmp_hdr_size + 4);
>   	if (!skb)
>   		goto out;
>   
> @@ -415,6 +444,15 @@ static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br,
>   	eth->h_proto = htons(ETH_P_IP);
>   	skb_put(skb, sizeof(*eth));
>   
> +	if (vid && tagged) {
> +		skb = vlan_insert_tag_set_proto(skb, htons(ETH_P_8021Q), vid);
> +		if (!skb) {
> +			kfree_skb(skb);
> +			br_err(br, "Failed adding VLAN tag to IGMP query, vid:%d\n", vid);
> +			return NULL;
> +		}
> +	}
> +
>   	skb_set_network_header(skb, skb->len);
>   	iph = ip_hdr(skb);
>   
> @@ -426,8 +464,7 @@ static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br,
>   	iph->frag_off = htons(IP_DF);
>   	iph->ttl = 1;
>   	iph->protocol = IPPROTO_IGMP;
> -	iph->saddr = br->multicast_query_use_ifaddr ?
> -		     inet_select_addr(br->dev, 0, RT_SCOPE_LINK) : 0;
> +	iph->saddr = br_multicast_inet_addr(br, vid);
>   	iph->daddr = htonl(INADDR_ALLHOSTS_GROUP);
>   	((u8 *)&iph[1])[0] = IPOPT_RA;
>   	((u8 *)&iph[1])[1] = 4;
> @@ -477,6 +514,8 @@ static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br,
>   #if IS_ENABLED(CONFIG_IPV6)
>   static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br,
>   						    const struct in6_addr *grp,
> +						    __u16 vid,
> +						    bool tagged,
>   						    u8 *igmp_type)
>   {
>   	struct mld2_query *mld2q;
> @@ -486,13 +525,18 @@ static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br,
>   	size_t mld_hdr_size;
>   	struct sk_buff *skb;
>   	struct ethhdr *eth;
> +	int vh_size = 0;
>   	u8 *hopopt;
>   
> +	/* if vid is non-zero, insert the 1Q header also */
> +	if (vid && tagged)
> +		vh_size = sizeof(struct vlan_hdr);
> +
>   	mld_hdr_size = sizeof(*mldq);
>   	if (br->multicast_mld_version == 2)
>   		mld_hdr_size = sizeof(*mld2q);
>   	skb = netdev_alloc_skb_ip_align(br->dev, sizeof(*eth) + sizeof(*ip6h) +
> -						 8 + mld_hdr_size);
> +						 vh_size + 8 + mld_hdr_size);
>   	if (!skb)
>   		goto out;
>   
> @@ -506,6 +550,15 @@ static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br,
>   	eth->h_proto = htons(ETH_P_IPV6);
>   	skb_put(skb, sizeof(*eth));
>   
> +	if (vid && tagged) {
> +		skb = vlan_insert_tag_set_proto(skb, htons(ETH_P_8021Q), vid);
> +		if (!skb) {
> +			kfree_skb(skb);
> +			br_err(br, "Failed adding VLAN tag to MLD query, vid:%d\n", vid);
> +			return NULL;
> +		}
> +	}
> +
>   	/* IPv6 header + HbH option */
>   	skb_set_network_header(skb, skb->len);
>   	ip6h = ipv6_hdr(skb);
> @@ -590,15 +643,17 @@ static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br,
>   
>   static struct sk_buff *br_multicast_alloc_query(struct net_bridge *br,
>   						struct br_ip *addr,
> +						bool tagged,
>   						u8 *igmp_type)
>   {
>   	switch (addr->proto) {
>   	case htons(ETH_P_IP):
> -		return br_ip4_multicast_alloc_query(br, addr->u.ip4, igmp_type);
> +		return br_ip4_multicast_alloc_query(br, addr->u.ip4, addr->vid,
> +						    tagged, igmp_type);
>   #if IS_ENABLED(CONFIG_IPV6)
>   	case htons(ETH_P_IPV6):
> -		return br_ip6_multicast_alloc_query(br, &addr->u.ip6,
> -						    igmp_type);
> +		return br_ip6_multicast_alloc_query(br, &addr->u.ip6, addr->vid,
> +						    tagged, igmp_type);
>   #endif
>   	}
>   	return NULL;
> @@ -905,14 +960,16 @@ static void br_multicast_local_router_expired(struct timer_list *t)
>   	spin_unlock(&br->multicast_lock);
>   }
>   
> -static void br_multicast_querier_expired(struct net_bridge *br,
> +static void br_multicast_querier_expired(struct net_bridge_vlan *vlan,
>   					 struct bridge_mcast_own_query *query)
>   {
> +	struct net_bridge *br = vlan->br;
> +
>   	spin_lock(&br->multicast_lock);
>   	if (!netif_running(br->dev) || br->multicast_disabled)
>   		goto out;
>   
> -	br_multicast_start_querier(br, query);
> +	br_multicast_start_querier(vlan, query);
>   
>   out:
>   	spin_unlock(&br->multicast_lock);
> @@ -920,17 +977,17 @@ static void br_multicast_querier_expired(struct net_bridge *br,
>   
>   static void br_ip4_multicast_querier_expired(struct timer_list *t)
>   {
> -	struct net_bridge *br = from_timer(br, t, ip4_other_query.timer);
> +	struct net_bridge_vlan *v = from_timer(v, t, ip4_other_query.timer);
>   
> -	br_multicast_querier_expired(br, &br->ip4_own_query);
> +	br_multicast_querier_expired(v, &v->ip4_own_query);
>   }
>   
>   #if IS_ENABLED(CONFIG_IPV6)
>   static void br_ip6_multicast_querier_expired(struct timer_list *t)
>   {
> -	struct net_bridge *br = from_timer(br, t, ip6_other_query.timer);
> +	struct net_bridge_vlan *v = from_timer(v, t, ip6_other_query.timer);
>   
> -	br_multicast_querier_expired(br, &br->ip6_own_query);
> +	br_multicast_querier_expired(v, &v->ip6_own_query);
>   }
>   #endif
>   
> @@ -938,11 +995,17 @@ static void br_multicast_select_own_querier(struct net_bridge *br,
>   					    struct br_ip *ip,
>   					    struct sk_buff *skb)
>   {
> +	struct net_bridge_vlan *v;
> +
> +	v = br_vlan_find(br_vlan_group(br), ip->vid);
> +	if (!v)
> +		return;
> +
>   	if (ip->proto == htons(ETH_P_IP))
> -		br->ip4_querier.addr.u.ip4 = ip_hdr(skb)->saddr;
> +		v->ip4_querier.addr.u.ip4 = ip_hdr(skb)->saddr;
>   #if IS_ENABLED(CONFIG_IPV6)
>   	else
> -		br->ip6_querier.addr.u.ip6 = ipv6_hdr(skb)->saddr;
> +		v->ip6_querier.addr.u.ip6 = ipv6_hdr(skb)->saddr;
>   #endif
>   }
>   
> @@ -951,9 +1014,27 @@ static void __br_multicast_send_query(struct net_bridge *br,
>   				      struct br_ip *ip)
>   {
>   	struct sk_buff *skb;
> +	bool tagged = false;
>   	u8 igmp_type;
>   
> -	skb = br_multicast_alloc_query(br, ip, &igmp_type);
> +	if (port->state == BR_STATE_DISABLED ||
> +	    port->state == BR_STATE_BLOCKING)
> +		return;
> +
> +#ifdef CONFIG_BRIDGE_VLAN_FILTERING
> +	if (port && ip->vid) {
> +		struct net_bridge_vlan *v;
> +
> +		v = br_vlan_find(nbp_vlan_group_rcu(port), ip->vid);
> +		if (!br->vlan_enabled || !v)
> +			return;
> +
> +		if (!(v->flags & BRIDGE_VLAN_INFO_UNTAGGED))
> +			tagged = true;
> +	}
> +#endif
> +
> +	skb = br_multicast_alloc_query(br, ip, tagged, &igmp_type);
>   	if (!skb)
>   		return;
>   
> @@ -972,11 +1053,12 @@ static void __br_multicast_send_query(struct net_bridge *br,
>   	}
>   }
>   
> -static void br_multicast_send_query(struct net_bridge *br,
> +static void br_multicast_send_query(struct net_bridge_vlan *vlan,
>   				    struct net_bridge_port *port,
>   				    struct bridge_mcast_own_query *own_query)
>   {
>   	struct bridge_mcast_other_query *other_query = NULL;
> +	struct net_bridge *br = vlan->br;
>   	struct br_ip br_group;
>   	unsigned long time;
>   
> @@ -985,22 +1067,27 @@ static void br_multicast_send_query(struct net_bridge *br,
>   		return;
>   
>   	memset(&br_group.u, 0, sizeof(br_group.u));
> -
> -	if (port ? (own_query == &port->ip4_own_query) :
> -		   (own_query == &br->ip4_own_query)) {
> -		other_query = &br->ip4_other_query;
> +	br_group.vid = vlan->vid;
> +	if (own_query == &vlan->ip4_own_query) {
> +		other_query = &vlan->ip4_other_query;
>   		br_group.proto = htons(ETH_P_IP);
>   #if IS_ENABLED(CONFIG_IPV6)
>   	} else {
> -		other_query = &br->ip6_other_query;
> +		other_query = &vlan->ip6_other_query;
>   		br_group.proto = htons(ETH_P_IPV6);
>   #endif
>   	}
>   
> +	if (port) {
> +		__br_multicast_send_query(br, port, &br_group);
> +		return;
> +	}
> +
>   	if (!other_query || timer_pending(&other_query->timer))
>   		return;
>   
> -	__br_multicast_send_query(br, port, &br_group);
> +	list_for_each_entry(port, &br->port_list, list)
> +		__br_multicast_send_query(br, port, &br_group);
>   
>   	time = jiffies;
>   	time += own_query->startup_sent < br->multicast_startup_query_count ?
> @@ -1009,42 +1096,6 @@ static void br_multicast_send_query(struct net_bridge *br,
>   	mod_timer(&own_query->timer, time);
>   }
>   
> -static void
> -br_multicast_port_query_expired(struct net_bridge_port *port,
> -				struct bridge_mcast_own_query *query)
> -{
> -	struct net_bridge *br = port->br;
> -
> -	spin_lock(&br->multicast_lock);
> -	if (port->state == BR_STATE_DISABLED ||
> -	    port->state == BR_STATE_BLOCKING)
> -		goto out;
> -
> -	if (query->startup_sent < br->multicast_startup_query_count)
> -		query->startup_sent++;
> -
> -	br_multicast_send_query(port->br, port, query);
> -
> -out:
> -	spin_unlock(&br->multicast_lock);
> -}
> -
> -static void br_ip4_multicast_port_query_expired(struct timer_list *t)
> -{
> -	struct net_bridge_port *port = from_timer(port, t, ip4_own_query.timer);
> -
> -	br_multicast_port_query_expired(port, &port->ip4_own_query);
> -}
> -
> -#if IS_ENABLED(CONFIG_IPV6)
> -static void br_ip6_multicast_port_query_expired(struct timer_list *t)
> -{
> -	struct net_bridge_port *port = from_timer(port, t, ip6_own_query.timer);
> -
> -	br_multicast_port_query_expired(port, &port->ip6_own_query);
> -}
> -#endif
> -
>   static void br_mc_disabled_update(struct net_device *dev, bool value)
>   {
>   	struct switchdev_attr attr = {
> @@ -1063,12 +1114,6 @@ int br_multicast_add_port(struct net_bridge_port *port)
>   
>   	timer_setup(&port->multicast_router_timer,
>   		    br_multicast_router_expired, 0);
> -	timer_setup(&port->ip4_own_query.timer,
> -		    br_ip4_multicast_port_query_expired, 0);
> -#if IS_ENABLED(CONFIG_IPV6)
> -	timer_setup(&port->ip6_own_query.timer,
> -		    br_ip6_multicast_port_query_expired, 0);
> -#endif
>   	br_mc_disabled_update(port->dev, port->br->multicast_disabled);
>   
>   	port->mcast_stats = netdev_alloc_pcpu_stats(struct bridge_mcast_stats);
> @@ -1109,15 +1154,47 @@ static void __br_multicast_enable_port(struct net_bridge_port *port)
>   	if (br->multicast_disabled || !netif_running(br->dev))
>   		return;
>   
> -	br_multicast_enable(&port->ip4_own_query);
> -#if IS_ENABLED(CONFIG_IPV6)
> -	br_multicast_enable(&port->ip6_own_query);
> -#endif
>   	if (port->multicast_router == MDB_RTR_TYPE_PERM &&
>   	    hlist_unhashed(&port->rlist))
>   		br_multicast_add_router(br, port);
>   }
>   
> +static void __br_multicast_vlan_init(struct net_bridge_vlan *vlan)
> +{
> +	vlan->ip4_querier.port = NULL;
> +	vlan->ip4_other_query.delay_time = 0;
> +
> +	timer_setup(&vlan->ip4_other_query.timer,
> +		    br_ip4_multicast_querier_expired, 0);
> +	timer_setup(&vlan->ip4_own_query.timer,
> +		    br_ip4_multicast_query_expired, 0);
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +	vlan->ip6_querier.port = NULL;
> +	vlan->ip6_other_query.delay_time = 0;
> +	timer_setup(&vlan->ip6_other_query.timer,
> +		    br_ip6_multicast_querier_expired, 0);
> +	timer_setup(&vlan->ip6_own_query.timer,
> +		    br_ip6_multicast_query_expired, 0);
> + #endif
> +}
> +
> +void br_multicast_enable_vlan(struct net_bridge *br, u16 vid)
> +{
> +	struct net_bridge_vlan *v;
> +
> +	v = br_vlan_find(br_vlan_group(br), vid);
> +	if (!v)
> +		return;
> +
> +	__br_multicast_vlan_init(v);
> +	br_multicast_enable(&v->ip4_own_query);
> +#if IS_ENABLED(CONFIG_IPV6)
> +	br_multicast_enable(&v->ip6_own_query);
> +#endif
> +}
> +
> +/* called by stp to enable timers, only use it to enable router port? -jnn */
>   void br_multicast_enable_port(struct net_bridge_port *port)
>   {
>   	struct net_bridge *br = port->br;
> @@ -1127,6 +1204,7 @@ void br_multicast_enable_port(struct net_bridge_port *port)
>   	spin_unlock(&br->multicast_lock);
>   }
>   
> +/* called by stp_if */
>   void br_multicast_disable_port(struct net_bridge_port *port)
>   {
>   	struct net_bridge *br = port->br;
> @@ -1139,12 +1217,6 @@ void br_multicast_disable_port(struct net_bridge_port *port)
>   			br_multicast_del_pg(br, pg);
>   
>   	__del_port_router(port);
> -
> -	del_timer(&port->multicast_router_timer);
> -	del_timer(&port->ip4_own_query.timer);
> -#if IS_ENABLED(CONFIG_IPV6)
> -	del_timer(&port->ip6_own_query.timer);
> -#endif
>   	spin_unlock(&br->multicast_lock);
>   }
>   
> @@ -1283,65 +1355,66 @@ static int br_ip6_multicast_mld2_report(struct net_bridge *br,
>   }
>   #endif
>   
> -static bool br_ip4_multicast_select_querier(struct net_bridge *br,
> +static bool br_ip4_multicast_select_querier(struct net_bridge_vlan *vlan,
>   					    struct net_bridge_port *port,
>   					    __be32 saddr)
>   {
> -	if (!timer_pending(&br->ip4_own_query.timer) &&
> -	    !timer_pending(&br->ip4_other_query.timer))
> +
> +	if (!timer_pending(&vlan->ip4_own_query.timer) &&
> +	    !timer_pending(&vlan->ip4_other_query.timer))
>   		goto update;
>   
> -	if (!br->ip4_querier.addr.u.ip4)
> +	if (!vlan->ip4_querier.addr.u.ip4)
>   		goto update;
>   
> -	if (ntohl(saddr) <= ntohl(br->ip4_querier.addr.u.ip4))
> +	if (ntohl(saddr) <= ntohl(vlan->ip4_querier.addr.u.ip4))
>   		goto update;
>   
>   	return false;
>   
>   update:
> -	br->ip4_querier.addr.u.ip4 = saddr;
> +	vlan->ip4_querier.addr.u.ip4 = saddr;
>   
>   	/* update protected by general multicast_lock by caller */
> -	rcu_assign_pointer(br->ip4_querier.port, port);
> +	rcu_assign_pointer(vlan->ip4_querier.port, port);
>   
>   	return true;
>   }
>   
>   #if IS_ENABLED(CONFIG_IPV6)
> -static bool br_ip6_multicast_select_querier(struct net_bridge *br,
> +static bool br_ip6_multicast_select_querier(struct net_bridge_vlan *vlan,
>   					    struct net_bridge_port *port,
>   					    struct in6_addr *saddr)
>   {
> -	if (!timer_pending(&br->ip6_own_query.timer) &&
> -	    !timer_pending(&br->ip6_other_query.timer))
> +	if (!timer_pending(&vlan->ip6_own_query.timer) &&
> +	    !timer_pending(&vlan->ip6_other_query.timer))
>   		goto update;
>   
> -	if (ipv6_addr_cmp(saddr, &br->ip6_querier.addr.u.ip6) <= 0)
> +	if (ipv6_addr_cmp(saddr, &vlan->ip6_querier.addr.u.ip6) <= 0)
>   		goto update;
>   
>   	return false;
>   
>   update:
> -	br->ip6_querier.addr.u.ip6 = *saddr;
> +	vlan->ip6_querier.addr.u.ip6 = *saddr;
>   
>   	/* update protected by general multicast_lock by caller */
> -	rcu_assign_pointer(br->ip6_querier.port, port);
> +	rcu_assign_pointer(vlan->ip6_querier.port, port);
>   
>   	return true;
>   }
>   #endif
>   
> -static bool br_multicast_select_querier(struct net_bridge *br,
> +static bool br_multicast_select_querier(struct net_bridge_vlan *vlan,
>   					struct net_bridge_port *port,
>   					struct br_ip *saddr)
>   {
>   	switch (saddr->proto) {
>   	case htons(ETH_P_IP):
> -		return br_ip4_multicast_select_querier(br, port, saddr->u.ip4);
> +		return br_ip4_multicast_select_querier(vlan, port, saddr->u.ip4);
>   #if IS_ENABLED(CONFIG_IPV6)
>   	case htons(ETH_P_IPV6):
> -		return br_ip6_multicast_select_querier(br, port, &saddr->u.ip6);
> +		return br_ip6_multicast_select_querier(vlan, port, &saddr->u.ip6);
>   #endif
>   	}
>   
> @@ -1425,17 +1498,17 @@ static void br_multicast_mark_router(struct net_bridge *br,
>   		  now + br->multicast_querier_interval);
>   }
>   
> -static void br_multicast_query_received(struct net_bridge *br,
> +static void br_multicast_query_received(struct net_bridge_vlan *vlan,
>   					struct net_bridge_port *port,
>   					struct bridge_mcast_other_query *query,
>   					struct br_ip *saddr,
>   					unsigned long max_delay)
>   {
> -	if (!br_multicast_select_querier(br, port, saddr))
> +	if (!br_multicast_select_querier(vlan, port, saddr))
>   		return;
>   
> -	br_multicast_update_query_timer(br, query, max_delay);
> -	br_multicast_mark_router(br, port);
> +	br_multicast_update_query_timer(vlan->br, query, max_delay);
> +	br_multicast_mark_router(vlan->br, port);
>   }
>   
>   static int br_ip4_multicast_query(struct net_bridge *br,
> @@ -1482,10 +1555,17 @@ static int br_ip4_multicast_query(struct net_bridge *br,
>   	}
>   
>   	if (!group) {
> +		struct net_bridge_vlan *v;
> +
> +		v = br_vlan_find(br_vlan_group(br), vid);
> +		if (!v)
> +			goto out;
> +
>   		saddr.proto = htons(ETH_P_IP);
> +		saddr.vid   = vid;
>   		saddr.u.ip4 = iph->saddr;
>   
> -		br_multicast_query_received(br, port, &br->ip4_other_query,
> +		br_multicast_query_received(v, port, &v->ip4_other_query,
>   					    &saddr, max_delay);
>   		goto out;
>   	}
> @@ -1565,10 +1645,17 @@ static int br_ip6_multicast_query(struct net_bridge *br,
>   	is_general_query = group && ipv6_addr_any(group);
>   
>   	if (is_general_query) {
> +		struct net_bridge_vlan *v;
> +
> +		v = br_vlan_find(br_vlan_group(br), vid);
> +		if (!v)
> +			goto out;
> +
>   		saddr.proto = htons(ETH_P_IPV6);
> +		saddr.vid   = vid;
>   		saddr.u.ip6 = ip6h->saddr;
>   
> -		br_multicast_query_received(br, port, &br->ip6_other_query,
> +		br_multicast_query_received(v, port, &v->ip6_other_query,
>   					    &saddr, max_delay);
>   		goto out;
>   	} else if (!group) {
> @@ -1716,20 +1803,22 @@ static void br_ip4_multicast_leave_group(struct net_bridge *br,
>   					 __u16 vid,
>   					 const unsigned char *src)
>   {
> +	struct net_bridge_vlan *v;
>   	struct br_ip br_group;
> -	struct bridge_mcast_own_query *own_query;
>   
>   	if (ipv4_is_local_multicast(group))
>   		return;
>   
> -	own_query = port ? &port->ip4_own_query : &br->ip4_own_query;
> +	v = br_vlan_find(br_vlan_group(br), vid);
> +	if (!v)
> +		return;
>   
>   	br_group.u.ip4 = group;
>   	br_group.proto = htons(ETH_P_IP);
>   	br_group.vid = vid;
>   
> -	br_multicast_leave_group(br, port, &br_group, &br->ip4_other_query,
> -				 own_query, src);
> +	br_multicast_leave_group(br, port, &br_group, &v->ip4_other_query,
> +				 &v->ip4_own_query, src);
>   }
>   
>   #if IS_ENABLED(CONFIG_IPV6)
> @@ -1739,20 +1828,22 @@ static void br_ip6_multicast_leave_group(struct net_bridge *br,
>   					 __u16 vid,
>   					 const unsigned char *src)
>   {
> +	struct net_bridge_vlan *v;
>   	struct br_ip br_group;
> -	struct bridge_mcast_own_query *own_query;
>   
>   	if (ipv6_addr_is_ll_all_nodes(group))
>   		return;
>   
> -	own_query = port ? &port->ip6_own_query : &br->ip6_own_query;
> +	v = br_vlan_find(br_vlan_group(br), vid);
> +	if (!v)
> +		return;
>   
>   	br_group.u.ip6 = *group;
>   	br_group.proto = htons(ETH_P_IPV6);
>   	br_group.vid = vid;
>   
> -	br_multicast_leave_group(br, port, &br_group, &br->ip6_other_query,
> -				 own_query, src);
> +	br_multicast_leave_group(br, port, &br_group, &v->ip6_other_query,
> +				 &v->ip6_own_query, src);
>   }
>   #endif
>   
> @@ -1938,37 +2029,42 @@ int br_multicast_rcv(struct net_bridge *br, struct net_bridge_port *port,
>   	return ret;
>   }
>   
> -static void br_multicast_query_expired(struct net_bridge *br,
> +static void br_multicast_query_expired(struct net_bridge_vlan *vlan,
>   				       struct bridge_mcast_own_query *query,
>   				       struct bridge_mcast_querier *querier)
>   {
> +	struct net_bridge *br = vlan->br;
> +
>   	spin_lock(&br->multicast_lock);
>   	if (query->startup_sent < br->multicast_startup_query_count)
>   		query->startup_sent++;
>   
>   	RCU_INIT_POINTER(querier->port, NULL);
> -	br_multicast_send_query(br, NULL, query);
> +	br_multicast_send_query(vlan, NULL, query);
>   	spin_unlock(&br->multicast_lock);
>   }
>   
>   static void br_ip4_multicast_query_expired(struct timer_list *t)
>   {
> -	struct net_bridge *br = from_timer(br, t, ip4_own_query.timer);
> +	struct net_bridge_vlan *v = from_timer(v, t, ip4_own_query.timer);
>   
> -	br_multicast_query_expired(br, &br->ip4_own_query, &br->ip4_querier);
> +	br_multicast_query_expired(v, &v->ip4_own_query, &v->ip4_querier);
>   }
>   
>   #if IS_ENABLED(CONFIG_IPV6)
>   static void br_ip6_multicast_query_expired(struct timer_list *t)
>   {
> -	struct net_bridge *br = from_timer(br, t, ip6_own_query.timer);
> +	struct net_bridge_vlan *v = from_timer(v, t, ip6_own_query.timer);
>   
> -	br_multicast_query_expired(br, &br->ip6_own_query, &br->ip6_querier);
> +	br_multicast_query_expired(v, &v->ip6_own_query, &v->ip6_querier);
>   }
>   #endif
>   
>   void br_multicast_init(struct net_bridge *br)
>   {
> +	struct net_bridge_vlan_group *vg;
> +	struct net_bridge_vlan *v;
> +
>   	br->hash_elasticity = 4;
>   	br->hash_max = 512;
>   
> @@ -1985,29 +2081,22 @@ void br_multicast_init(struct net_bridge *br)
>   	br->multicast_querier_interval = 255 * HZ;
>   	br->multicast_membership_interval = 260 * HZ;
>   
> -	br->ip4_other_query.delay_time = 0;
> -	br->ip4_querier.port = NULL;
>   	br->multicast_igmp_version = 2;
>   #if IS_ENABLED(CONFIG_IPV6)
>   	br->multicast_mld_version = 1;
> -	br->ip6_other_query.delay_time = 0;
> -	br->ip6_querier.port = NULL;
>   #endif
>   	br->has_ipv6_addr = 1;
>   
>   	spin_lock_init(&br->multicast_lock);
>   	timer_setup(&br->multicast_router_timer,
>   		    br_multicast_local_router_expired, 0);
> -	timer_setup(&br->ip4_other_query.timer,
> -		    br_ip4_multicast_querier_expired, 0);
> -	timer_setup(&br->ip4_own_query.timer,
> -		    br_ip4_multicast_query_expired, 0);
> -#if IS_ENABLED(CONFIG_IPV6)
> -	timer_setup(&br->ip6_other_query.timer,
> -		    br_ip6_multicast_querier_expired, 0);
> -	timer_setup(&br->ip6_own_query.timer,
> -		    br_ip6_multicast_query_expired, 0);
> -#endif
> +
> +	vg = br_vlan_group(br);
> +	if (!vg || !vg->num_vlans)
> +		return;
> +
> +	list_for_each_entry(v, &vg->vlan_list, vlist)
> +		__br_multicast_vlan_init(v);
>   }
>   
>   static void __br_multicast_open(struct net_bridge *br,
> @@ -2023,21 +2112,41 @@ static void __br_multicast_open(struct net_bridge *br,
>   
>   void br_multicast_open(struct net_bridge *br)
>   {
> -	__br_multicast_open(br, &br->ip4_own_query);
> +	struct net_bridge_vlan_group *vg;
> +	struct net_bridge_vlan *v;
> +
> +	vg = br_vlan_group(br);
> +	if (!vg || !vg->num_vlans)
> +		return;
> +
> +	list_for_each_entry(v, &vg->vlan_list, vlist) {
> +		__br_multicast_vlan_init(v);
> +		__br_multicast_open(br, &v->ip4_own_query);
>   #if IS_ENABLED(CONFIG_IPV6)
> -	__br_multicast_open(br, &br->ip6_own_query);
> +		__br_multicast_open(br, &v->ip6_own_query);
>   #endif
> +	}
>   }
>   
>   void br_multicast_stop(struct net_bridge *br)
>   {
> +	struct net_bridge_vlan_group *vg;
> +	struct net_bridge_vlan *v;
> +
>   	del_timer_sync(&br->multicast_router_timer);
> -	del_timer_sync(&br->ip4_other_query.timer);
> -	del_timer_sync(&br->ip4_own_query.timer);
> +
> +	vg = br_vlan_group(br);
> +	if (!vg || !vg->num_vlans)
> +		return;
> +
> +	list_for_each_entry(v, &vg->vlan_list, vlist) {
> +		del_timer_sync(&v->ip4_other_query.timer);
> +		del_timer_sync(&v->ip4_own_query.timer);
>   #if IS_ENABLED(CONFIG_IPV6)
> -	del_timer_sync(&br->ip6_other_query.timer);
> -	del_timer_sync(&br->ip6_own_query.timer);
> +		del_timer_sync(&v->ip6_other_query.timer);
> +		del_timer_sync(&v->ip6_own_query.timer);
>   #endif
> +	}
>   }
>   
>   void br_multicast_dev_del(struct net_bridge *br)
> @@ -2162,25 +2271,37 @@ int br_multicast_set_port_router(struct net_bridge_port *p, unsigned long val)
>   	return err;
>   }
>   
> -static void br_multicast_start_querier(struct net_bridge *br,
> +/* Must be called with multicast_lock */
> +static void br_multicast_init_querier(struct net_bridge_vlan *vlan,
> +				      struct bridge_mcast_own_query *query,
> +				      unsigned long max_delay)
> +{
> +	struct bridge_mcast_other_query *other_query = NULL;
> +
> +	if (query == &vlan->ip4_own_query)
> +		other_query = &vlan->ip4_other_query;
> +	else
> +		other_query = &vlan->ip6_other_query;
> +
> +	if (!timer_pending(&other_query->timer))
> +		other_query->delay_time = jiffies + max_delay;
> +
> +	br_multicast_start_querier(vlan, query);
> +}
> +
> +static void br_multicast_start_querier(struct net_bridge_vlan *vlan,
>   				       struct bridge_mcast_own_query *query)
>   {
> -	struct net_bridge_port *port;
> +	struct net_bridge *br = vlan->br;
>   
>   	__br_multicast_open(br, query);
>   
> -	list_for_each_entry(port, &br->port_list, list) {
> -		if (port->state == BR_STATE_DISABLED ||
> -		    port->state == BR_STATE_BLOCKING)
> -			continue;
> -
> -		if (query == &br->ip4_own_query)
> -			br_multicast_enable(&port->ip4_own_query);
> +	if (query == &vlan->ip4_own_query)
> +		br_multicast_enable(&vlan->ip4_own_query);
>   #if IS_ENABLED(CONFIG_IPV6)
> -		else
> -			br_multicast_enable(&port->ip6_own_query);
> +	else
> +		br_multicast_enable(&vlan->ip6_own_query);
>   #endif
> -	}
>   }
>   
>   int br_multicast_toggle(struct net_bridge *br, unsigned long val)
> @@ -2248,6 +2369,8 @@ EXPORT_SYMBOL_GPL(br_multicast_router);
>   
>   int br_multicast_set_querier(struct net_bridge *br, unsigned long val)
>   {
> +	struct net_bridge_vlan_group *vg;
> +	struct net_bridge_vlan *v;
>   	unsigned long max_delay;
>   
>   	val = !!val;
> @@ -2260,19 +2383,18 @@ int br_multicast_set_querier(struct net_bridge *br, unsigned long val)
>   	if (!val)
>   		goto unlock;
>   
> -	max_delay = br->multicast_query_response_interval;
> -
> -	if (!timer_pending(&br->ip4_other_query.timer))
> -		br->ip4_other_query.delay_time = jiffies + max_delay;
> +	vg = br_vlan_group(br);
> +	if (!vg || !vg->num_vlans)
> +		goto unlock;
>   
> -	br_multicast_start_querier(br, &br->ip4_own_query);
> +	max_delay = br->multicast_query_response_interval;
>   
> +	list_for_each_entry(v, &vg->vlan_list, vlist) {
> +		br_multicast_init_querier(v, &v->ip4_own_query, max_delay);
>   #if IS_ENABLED(CONFIG_IPV6)
> -	if (!timer_pending(&br->ip6_other_query.timer))
> -		br->ip6_other_query.delay_time = jiffies + max_delay;
> -
> -	br_multicast_start_querier(br, &br->ip6_own_query);
> +		br_multicast_init_querier(v, &v->ip6_own_query, max_delay);
>   #endif
> +	}
>   
>   unlock:
>   	spin_unlock_bh(&br->multicast_lock);
> @@ -2425,6 +2547,7 @@ EXPORT_SYMBOL_GPL(br_multicast_list_adjacent);
>    */
>   bool br_multicast_has_querier_anywhere(struct net_device *dev, int proto)
>   {
> +	struct net_bridge_vlan_group *vg;
>   	struct net_bridge *br;
>   	struct net_bridge_port *port;
>   	struct ethhdr eth;
> @@ -2438,12 +2561,16 @@ bool br_multicast_has_querier_anywhere(struct net_device *dev, int proto)
>   	if (!port || !port->br)
>   		goto unlock;
>   
> +	vg = nbp_vlan_group_rcu(port);
> +	if (!vg)
> +		goto unlock;
> +
>   	br = port->br;
>   
>   	memset(&eth, 0, sizeof(eth));
>   	eth.h_proto = htons(proto);
>   
> -	ret = br_multicast_querier_exists(br, &eth);
> +	ret = br_multicast_querier_exists(br, br_get_pvid(vg), &eth);
>   
>   unlock:
>   	rcu_read_unlock();
> @@ -2462,7 +2589,8 @@ EXPORT_SYMBOL_GPL(br_multicast_has_querier_anywhere);
>    */
>   bool br_multicast_has_querier_adjacent(struct net_device *dev, int proto)
>   {
> -	struct net_bridge *br;
> +	struct net_bridge_vlan_group *vg;
> +	struct net_bridge_vlan *v;
>   	struct net_bridge_port *port;
>   	bool ret = false;
>   
> @@ -2474,18 +2602,24 @@ bool br_multicast_has_querier_adjacent(struct net_device *dev, int proto)
>   	if (!port || !port->br)
>   		goto unlock;
>   
> -	br = port->br;
> +	vg = nbp_vlan_group_rcu(port);
> +	if (!vg)
> +		goto unlock;
> +
> +	v = br_vlan_find(br_vlan_group(port->br), br_get_pvid(vg));
> +	if (!v)
> +		goto unlock;
>   
>   	switch (proto) {
>   	case ETH_P_IP:
> -		if (!timer_pending(&br->ip4_other_query.timer) ||
> -		    rcu_dereference(br->ip4_querier.port) == port)
> +		if (!timer_pending(&v->ip4_other_query.timer) ||
> +		    rcu_dereference(v->ip4_querier.port) == port)
>   			goto unlock;
>   		break;
>   #if IS_ENABLED(CONFIG_IPV6)
>   	case ETH_P_IPV6:
> -		if (!timer_pending(&br->ip6_other_query.timer) ||
> -		    rcu_dereference(br->ip6_querier.port) == port)
> +		if (!timer_pending(&v->ip6_other_query.timer) ||
> +		    rcu_dereference(v->ip6_querier.port) == port)
>   			goto unlock;
>   		break;
>   #endif
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index 6e31be61d2c6..00dac1bbfaba 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -140,6 +140,17 @@ struct net_bridge_vlan {
>   		struct net_bridge_vlan	*brvlan;
>   	};
>   
> +#ifdef CONFIG_BRIDGE_IGMP_SNOOPING
> +	struct bridge_mcast_other_query	ip4_other_query;
> +	struct bridge_mcast_own_query	ip4_own_query;
> +	struct bridge_mcast_querier	ip4_querier;
> +#if IS_ENABLED(CONFIG_IPV6)
> +	struct bridge_mcast_other_query	ip6_other_query;
> +	struct bridge_mcast_own_query	ip6_own_query;
> +	struct bridge_mcast_querier	ip6_querier;
> +#endif
> +#endif
> +
>   	struct br_tunnel_info		tinfo;
>   
>   	struct list_head		vlist;
> @@ -261,10 +272,6 @@ struct net_bridge_port {
>   	struct rcu_head			rcu;
>   
>   #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
> -	struct bridge_mcast_own_query	ip4_own_query;
> -#if IS_ENABLED(CONFIG_IPV6)
> -	struct bridge_mcast_own_query	ip6_own_query;
> -#endif /* IS_ENABLED(CONFIG_IPV6) */
>   	unsigned char			multicast_router;
>   	struct bridge_mcast_stats	__percpu *mcast_stats;
>   	struct timer_list		multicast_router_timer;
> @@ -390,14 +397,8 @@ struct net_bridge {
>   	struct hlist_head		router_list;
>   
>   	struct timer_list		multicast_router_timer;
> -	struct bridge_mcast_other_query	ip4_other_query;
> -	struct bridge_mcast_own_query	ip4_own_query;
> -	struct bridge_mcast_querier	ip4_querier;
>   	struct bridge_mcast_stats	__percpu *mcast_stats;
>   #if IS_ENABLED(CONFIG_IPV6)
> -	struct bridge_mcast_other_query	ip6_other_query;
> -	struct bridge_mcast_own_query	ip6_own_query;
> -	struct bridge_mcast_querier	ip6_querier;
>   	u8				multicast_mld_version;
>   #endif /* IS_ENABLED(CONFIG_IPV6) */
>   #endif
> @@ -618,6 +619,7 @@ int br_multicast_add_port(struct net_bridge_port *port);
>   void br_multicast_del_port(struct net_bridge_port *port);
>   void br_multicast_enable_port(struct net_bridge_port *port);
>   void br_multicast_disable_port(struct net_bridge_port *port);
> +void br_multicast_enable_vlan(struct net_bridge *br, u16 vid);
>   void br_multicast_init(struct net_bridge *br);
>   void br_multicast_open(struct net_bridge *br);
>   void br_multicast_stop(struct net_bridge *br);
> @@ -633,6 +635,7 @@ int br_multicast_set_igmp_version(struct net_bridge *br, unsigned long val);
>   #if IS_ENABLED(CONFIG_IPV6)
>   int br_multicast_set_mld_version(struct net_bridge *br, unsigned long val);
>   #endif
> +__be32 br_multicast_inet_addr(struct net_bridge *br, u16 vid);
>   struct net_bridge_mdb_entry *
>   br_mdb_ip_get(struct net_bridge_mdb_htable *mdb, struct br_ip *dst);
>   struct net_bridge_mdb_entry *
> @@ -687,17 +690,27 @@ __br_multicast_querier_exists(struct net_bridge *br,
>   	       (own_querier_enabled || timer_pending(&querier->timer));
>   }
>   
> +static struct net_bridge_vlan_group *br_vlan_group(const struct net_bridge *br);
> +struct net_bridge_vlan *br_vlan_find(struct net_bridge_vlan_group *vg, u16 vid);
> +
>   static inline bool br_multicast_querier_exists(struct net_bridge *br,
> +					       u16 vid,
>   					       struct ethhdr *eth)
>   {
> +	struct net_bridge_vlan *v;
> +
> +	v = br_vlan_find(br_vlan_group(br), vid);
> +	if (!v)
> +		return false;
> +
>   	switch (eth->h_proto) {
>   	case (htons(ETH_P_IP)):
>   		return __br_multicast_querier_exists(br,
> -			&br->ip4_other_query, false);
> +			&v->ip4_other_query, false);
>   #if IS_ENABLED(CONFIG_IPV6)
>   	case (htons(ETH_P_IPV6)):
>   		return __br_multicast_querier_exists(br,
> -			&br->ip6_other_query, true);
> +			&v->ip6_other_query, true);
>   #endif
>   	default:
>   		return false;
> @@ -768,6 +781,7 @@ static inline bool br_multicast_is_router(struct net_bridge *br)
>   }
>   
>   static inline bool br_multicast_querier_exists(struct net_bridge *br,
> +					       u16 vid,
>   					       struct ethhdr *eth)
>   {
>   	return false;
> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
> index a1ba52d247d8..d1d6c4fb39dd 100644
> --- a/net/bridge/br_stp.c
> +++ b/net/bridge/br_stp.c
> @@ -460,10 +460,7 @@ void br_port_state_selection(struct net_bridge *br)
>   
>   		if (p->state != BR_STATE_BLOCKING)
>   			br_multicast_enable_port(p);
> -		/* Multicast is not disabled for the port when it goes in
> -		 * blocking state because the timers will expire and stop by
> -		 * themselves without sending more queries.
> -		 */
> +
>   		if (p->state == BR_STATE_FORWARDING)
>   			++liveports;
>   	}
> diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
> index bb9cbad4bad6..3b8fb28e9ab4 100644
> --- a/net/bridge/br_vlan.c
> +++ b/net/bridge/br_vlan.c
> @@ -270,6 +270,9 @@ static int __vlan_add(struct net_bridge_vlan *v, u16 flags)
>   			goto out_filt;
>   		}
>   		vg->num_vlans++;
> +
> +		/* Start per VLAN IGMP/MLD querier timers */
> +		br_multicast_enable_vlan(br, v->vid);
>   	}
>   
>   	err = rhashtable_lookup_insert_fast(&vg->vlan_hash, &v->vnode,
> 

^ permalink raw reply

* Re: [PATCH net 5/5] nfp: remove false positive offloads in flower vxlan
From: John Hurley @ 2018-04-18 12:31 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Jakub Kicinski, Linux Netdev List, oss-drivers, Simon Horman
In-Reply-To: <CAJ3xEMgmUFS5vEzy-sRZ7XzFsHVSaKKLDuBoba0WZvauP9ZmLw@mail.gmail.com>

On Wed, Apr 18, 2018 at 8:43 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Fri, Nov 17, 2017 at 4:06 AM, Jakub Kicinski
> <jakub.kicinski@netronome.com> wrote:
>> From: John Hurley <john.hurley@netronome.com>
>>
>> Pass information to the match offload on whether or not the repr is the
>> ingress or egress dev. Only accept tunnel matches if repr is the egress dev.
>>
>> This means rules such as the following are successfully offloaded:
>> tc .. add dev vxlan0 .. enc_dst_port 4789 .. action redirect dev nfp_p0
>>
>> While rules such as the following are rejected:
>> tc .. add dev nfp_p0 .. enc_dst_port 4789 .. action redirect dev vxlan0
>
> cool
>
>
>> Also reject non tunnel flows that are offloaded to an egress dev.
>> Non tunnel matches assume that the offload dev is the ingress port and
>> offload a match accordingly.
>
> not following on the "Also" here, see below
>
>
>> diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>> index a0193e0c24a0..f5d73b83dcc2 100644
>> --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
>> +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>> @@ -131,7 +131,8 @@ static bool nfp_flower_check_higher_than_mac(struct tc_cls_flower_offload *f)
>>
>>  static int
>>  nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>> -                               struct tc_cls_flower_offload *flow)
>> +                               struct tc_cls_flower_offload *flow,
>> +                               bool egress)
>>  {
>>         struct flow_dissector_key_basic *mask_basic = NULL;
>>         struct flow_dissector_key_basic *key_basic = NULL;
>> @@ -167,6 +168,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>                         skb_flow_dissector_target(flow->dissector,
>>                                                   FLOW_DISSECTOR_KEY_ENC_CONTROL,
>>                                                   flow->key);
>> +               if (!egress)
>> +                       return -EOPNOTSUPP;
>> +
>>                 if (mask_enc_ctl->addr_type != 0xffff ||
>>                     enc_ctl->addr_type != FLOW_DISSECTOR_KEY_IPV4_ADDRS)
>>                         return -EOPNOTSUPP;
>> @@ -194,6 +198,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>
>>                 key_layer |= NFP_FLOWER_LAYER_VXLAN;
>>                 key_size += sizeof(struct nfp_flower_vxlan);
>> +       } else if (egress) {
>> +               /* Reject non tunnel matches offloaded to egress repr. */
>> +               return -EOPNOTSUPP;
>>         }
>
> with these two hunks we get: egress <- IFF -> encap match, right?
>
> (1) we can't offload the egress way if there isn't matching on encap headers
> (2) we can't go the matching on encap headers way if we are not egress
>

yes, this is correct.
With the block code and egdev offload, we do not have access to the
ingress netdev when doing an offload.
We need to use the encap headers (especially the enc_port) to
distinguish the type of tunnel used and, therefore, require that the
encap matches be present before offloading.

> what other cases are rejected by this logic?
>

Yes, some other cases may be rejected (like veth mentioned below).
However, this is better than allowing rules to be incorrectly
offloaded (as could have happened before these changes).
Currently, we are looking at offloading flows on other ingress devices
such as bonds so this will require a change to the driver code here.
IMO, the cleanest solution will also require tc core changes to either
avoid egdev offload or to have access to the ingress netdev of a rule.

> e.g If we add a rule with SW device (veth. tap) being the ingress, and
> HW device (vf rep)
> being the egress while not using skip_sw (just no flags == both) we
> get the TC stack
> go along the egdev callback from the vf rep hw device and add an
> (uplink --> vf rep) rule
> which will not be rejected if there is matching on tunnel headers, it
> will also not be rejected
> by some driver logic as the one we discussed to identify and ignore
> rules that are attempted to being added twice.
>
> Or.

^ permalink raw reply

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: Rahul Lakkireddy @ 2018-04-18 12:31 UTC (permalink / raw)
  To: Dave Young
  Cc: Indranil Choudhury,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Nirranjan Kirubaharan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org,
	Ganesh GR, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org
In-Reply-To: <20180418061546.GA4551-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>

On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> Hi Rahul,
> On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> > On production servers running variety of workloads over time, kernel
> > panic can happen sporadically after days or even months. It is
> > important to collect as much debug logs as possible to root cause
> > and fix the problem, that may not be easy to reproduce. Snapshot of
> > underlying hardware/firmware state (like register dump, firmware
> > logs, adapter memory, etc.), at the time of kernel panic will be very
> > helpful while debugging the culprit device driver.
> > 
> > This series of patches add new generic framework that enable device
> > drivers to collect device specific snapshot of the hardware/firmware
> > state of the underlying device in the crash recovery kernel. In crash
> > recovery kernel, the collected logs are added as elf notes to
> > /proc/vmcore, which is copied by user space scripts for post-analysis.
> > 
> > The sequence of actions done by device drivers to append their device
> > specific hardware/firmware logs to /proc/vmcore are as follows:
> > 
> > 1. During probe (before hardware is initialized), device drivers
> > register to the vmcore module (via vmcore_add_device_dump()), with
> > callback function, along with buffer size and log name needed for
> > firmware/hardware log collection.
> 
> I assumed the elf notes info should be prepared while kexec_[file_]load
> phase. But I did not read the old comment, not sure if it has been discussed
> or not.
> 

We must not collect dumps in crashing kernel. Adding more things in
crash dump path risks not collecting vmcore at all. Eric had
discussed this in more detail at:

https://lkml.org/lkml/2018/3/24/319

We are safe to collect dumps in the second kernel. Each device dump
will be exported as an elf note in /proc/vmcore.

> If do this in 2nd kernel a question is driver can be loaded later than vmcore init.

Yes, drivers will add their device dumps after vmcore init.

> How to guarantee the function works if vmcore reading happens before
> the driver is loaded?
> 
> Also it is possible that kdump initramfs does not contains the driver
> module.
> 
> Am I missing something?
> 

Yes, driver must be in initramfs if it wants to collect and add device
dump to /proc/vmcore in second kernel.

> > 
> > 2. vmcore module allocates the buffer with requested size. It adds
> > an elf note and invokes the device driver's registered callback
> > function.
> > 
> > 3. Device driver collects all hardware/firmware logs into the buffer
> > and returns control back to vmcore module.
> > 
> > The device specific hardware/firmware logs can be seen as elf notes:
> > 
> > # readelf -n /proc/vmcore
> > 
> > Displaying notes found at file offset 0x00001000 with length 0x04003288:
> >   Owner                 Data size	Description
> >   VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8	Unknown note type: (0x00000700)
> >   VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8	Unknown note type: (0x00000700)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   VMCOREINFO           0x0000074f	Unknown note type: (0x00000000)
> > 
> > Patch 1 adds API to vmcore module to allow drivers to register callback
> > to collect the device specific hardware/firmware logs.  The logs will
> > be added to /proc/vmcore as elf notes.
> > 
> > Patch 2 updates read and mmap logic to append device specific hardware/
> > firmware logs as elf notes.
> > 
> > Patch 3 shows a cxgb4 driver example using the API to collect
> > hardware/firmware logs in crash recovery kernel, before hardware is
> > initialized.
> > 
> > Thanks,
> > Rahul
> > 
> > RFC v1: https://lkml.org/lkml/2018/3/2/542
> > RFC v2: https://lkml.org/lkml/2018/3/16/326
> > 
[...]

Thanks,
Rahul

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Sowmini Varadhan @ 2018-04-18 12:31 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Samudrala, Sridhar, Network Development, Willem de Bruijn
In-Reply-To: <CAF=yD-KhNrcZBQizK+RtFq4Lx-ExntdLR69qz_2beRo8d7XOTA@mail.gmail.com>

I went through the patch set and the code looks fine- it extends existing
infra for TCP/GSO to UDP.

One thing that was not clear to me about the API: shouldn't UDP_SEGMENT
just be automatically determined in the stack from the pmtu? Whats
the motivation for the socket option for this? also AIUI this can be
either a per-socket or a per-packet option?

However, I share Sridhar's concerns about the very fundamental change
to UDP message boundary semantics here.  There is actually no such thing
as a "segment" in udp, so in general this feature makes me a little
uneasy.  Well behaved udp applications should already be sending mtu
sized datagrams. And the not-so-well-behaved ones are probably relying
on IP fragmentation/reassembly to take care of datagram boundary semantics
for them?

As Sridhar points out, the feature is not really "negotiated" - one side
unilaterally sets the option. If the receiver is a classic/POSIX UDP
implementation, it will have no way of knowing that message boundaries
have been re-adjusted at the sender.  

One thought to recover from this: use the infra being proposed in
  https://tools.ietf.org/html/draft-touch-tsvwg-udp-options-09
to include a new UDP TLV option that tracks datagram# (similar to IP ID)
to help the receiver reassemble the UDP datagram and pass it up with
the POSIX-conformant UDP message boundary. I realize that this is also
not a perfect solution: as you point out, there are risks from
packet re-ordering/drops- you may well end up just reinventing IP
frag/re-assembly when you are done (with just the slight improvement
that each "fragment" has a full UDP header, so it has a better shot
at ECMP and RSS).

--Sowmini

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: phy: Enable C45 PHYs with vendor specific address space
From: Andrew Lunn @ 2018-04-18 12:27 UTC (permalink / raw)
  To: Vicenţiu Galanopulo
  Cc: Florian Fainelli, robh@kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, mark.rutland@arm.com,
	davem@davemloft.net, marcel@holtmann.org,
	devicetree@vger.kernel.org, Alexandru Marginean,
	Madalin-cristian Bucur
In-Reply-To: <AM0PR04MB411620E8DB3E7EF8E55C99B5EEB60@AM0PR04MB4116.eurprd04.prod.outlook.com>

On Wed, Apr 18, 2018 at 09:38:47AM +0000, Vicenţiu Galanopulo wrote:
> 
> 
> > > Having dev-addr stored in devices_addrs, in get_phy_c45_ids(), when
> > > probing the identifiers, dev-addr can be extracted from devices_addrs
> > > and probed if devices_addrs[current_identifier] is not 0.
> > 
> > I must clearly be missing something, but why are you introducing all these
> > conditionals instead of updating the existing code to be able to operate against
> > an arbitrary dev-addr value, and then just making sure the first thing you do is
> > fetch that property from Device Tree? There is no way someone is going to be
> > testing with your specific use case in the future (except yourselves) so unless you
> > make supporting an arbitrary "dev-addr" value become part of how the code
> > works, this is going to be breaking badly.
> >
> 
> Hi Florian,
> 
> My intention was to have this patch as "plugin" and modify the existing kernel API little to none.

Hi Vicenţiu

In Linux, kernel APIs are not sacred. If you need to change them, do
so.

We want a clear, well integrated solution, with minimal
duplication.

	Andrew

^ permalink raw reply

* Re: [PATCH v3 00/10] New network driver for Amiga X-Surf 100 (m68k)
From: Andrew Lunn @ 2018-04-18 12:19 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: netdev, Finn Thain, Geert Uytterhoeven, Florian Fainelli,
	Linux/m68k, Michael Karcher
In-Reply-To: <CAOmrzk+zUTmSzXWU9WoXYauBx2Z4qkAh+Y4d49faA8Tu5RRQnQ@mail.gmail.com>

On Wed, Apr 18, 2018 at 05:10:45PM +1200, Michael Schmitz wrote:
> All,
> 
> just noticed belatedly that the Makefile hunk of patch 9 does no
> longer apply cleanly in 4.17-rc1, sorry. My series was based on 4.16.
> I'll resend that one, OK?

Hi Michael

You should be based on DaveM net-next tree:

git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git

Please also have "net-next" in the patch subject. See
Documentation/networking/netdev-FAQ.txt

	Andrew

^ permalink raw reply

* Re: [PATCH v3 1/9] net: phy: new Asix Electronics PHY driver
From: Andrew Lunn @ 2018-04-18 12:13 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: netdev, fthain, geert, f.fainelli, linux-m68k, Michael.Karcher
In-Reply-To: <1524025616-3722-2-git-send-email-schmitzmic@gmail.com>

> +
> +/**
> + * asix_soft_reset - software reset the PHY via BMCR_RESET bit
> + * @phydev: target phy_device struct
> + *
> + * Description: Perform a software PHY reset using the standard
> + * BMCR_RESET bit and poll for the reset bit to be cleared.
> + * Toggle BMCR_RESET bit off to accomodate broken PHY implementations
> + * such as used on the Individual Computers' X-Surf 100 Zorro card.
> + *
> + * Returns: 0 on success, < 0 on failure
> + */
> +static int asix_soft_reset(struct phy_device *phydev)
> +{
> +	int ret;
> +
> +	/* Asix PHY won't reset unless reset bit toggles */
> +	ret = phy_write(phydev, MII_BMCR, 0);
> +	if (ret < 0)
> +		return ret;
> +
> +	phy_write(phydev, MII_BMCR, BMCR_RESET);
> +
> +	return phy_poll_reset(phydev);
> +}

Why not simply:

static int asix_soft_reset(struct phy_device *phydev)
{
	int ret;

	/* Asix PHY won't reset unless reset bit toggles */
	ret = phy_write(phydev, MII_BMCR, 0);
	if (ret < 0)
		return ret;

	return genphy_soft_reset(phydev);
}

	Andrew

^ permalink raw reply

* [RFC net-next PATCH 2/2] bpf: disallow XDP data_meta to overlap with xdp_frame area
From: Jesper Dangaard Brouer @ 2018-04-18 12:10 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev, Jesper Dangaard Brouer
In-Reply-To: <152405338404.30730.9846848505925123326.stgit@firesoul>

If combining xdp_adjust_head and xdp_adjust_meta, then it is possible
to make data_meta overlap with area used by xdp_frame.  And another
invocation of xdp_adjust_head can then clear that area, due to
clearing of xdp_frame area.

The easiest solution I found was to simply not allow
xdp_buff->data_meta to overlap with area used by xdp_frame.

Fixes: 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/filter.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 15e9b5477360..e3623e741181 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2701,6 +2701,11 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
 		     data > xdp->data_end - ETH_HLEN))
 		return -EINVAL;
 
+	/* Disallow data_meta to use xdp_frame area */
+	if (metalen > 0 &&
+	    unlikely((data - metalen) < xdp_frame_end))
+		return -EINVAL;
+
 	/* Avoid info leak, when reusing area prev used by xdp_frame */
 	if (data < xdp_frame_end) {
 		unsigned long clearlen = xdp_frame_end - data;
@@ -2734,6 +2739,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
 
 BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
 {
+	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
 	void *meta = xdp->data_meta + offset;
 	unsigned long metalen = xdp->data - meta;
 
@@ -2742,6 +2748,11 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
 	if (unlikely(meta < xdp->data_hard_start ||
 		     meta > xdp->data))
 		return -EINVAL;
+
+	/* Disallow data_meta to use xdp_frame area */
+	if (unlikely(meta < xdp_frame_end))
+		return -EINVAL;
+
 	if (unlikely((metalen & (sizeof(__u32) - 1)) ||
 		     (metalen > 32)))
 		return -EACCES;

^ permalink raw reply related

* [RFC net-next PATCH 1/2] bpf: avoid clear xdp_frame area again
From: Jesper Dangaard Brouer @ 2018-04-18 12:10 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev, Jesper Dangaard Brouer
In-Reply-To: <152405338404.30730.9846848505925123326.stgit@firesoul>

Avoid clearing xdp_frame area if this was already done by prevous
invocations of bpf_xdp_adjust_head.

The xdp_adjust_head helper can be called multiple times by the
bpf_prog.  If increasing the packet header size (with a negative
offset), kernel must assume bpf_prog store valuable information here,
and not clear this information.

In case of extending header into xdp_frame area the kernel clear this
area to avoid any info leaking.

The bug in the current implementation is that if existing xdp->data
pointer have already been moved into xdp_frame area, then memory is
cleared between new-data pointer and xdp_frame-end, which covers an
area that might contain information store by BPF-prog (as curr
xdp->data lays between those pointers).

Fixes: 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/filter.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index a374b8560bc4..15e9b5477360 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2705,6 +2705,13 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
 	if (data < xdp_frame_end) {
 		unsigned long clearlen = xdp_frame_end - data;

+		/* Handle if prev call adjusted xdp->data into xdp_frame area */
+		if (unlikely(xdp->data < xdp_frame_end)) {
+			if (data < xdp->data)
+				clearlen = xdp->data - data;
+			else
+				clearlen = 0;
+		}
 		memset(data, 0, clearlen);
 	}

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox