Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: BUG_ON triggered in skb_segment
From: Eric Dumazet @ 2018-03-14  1:15 UTC (permalink / raw)
  To: Alexei Starovoitov, Yonghong Song, Steffen Klassert
  Cc: Daniel Borkmann, netdev, Martin Lau, saeedm, diptanu
In-Reply-To: <84dc788d-eedd-acc2-64dc-d51e909bb663@gmail.com>



On 03/13/2018 05:35 PM, Eric Dumazet wrote:
> 
> 
> On 03/13/2018 05:26 PM, Eric Dumazet wrote:
>>
>>
>> On 03/13/2018 05:04 PM, Alexei Starovoitov wrote:
>>> On 3/13/18 4:27 PM, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 03/13/2018 04:09 PM, Alexei Starovoitov wrote:
>>>>
>>>>> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
>>>>> It's not clear why it's not crashing there, but we cannot just
>>>>> reject changing proto in bpf programs now.
>>>>> We have to fix whatever needs to be fixed in skb_segment
>>>>> (if bug is there) or fix whatever necessary on mlx5 side.
>>>>> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
>>>>> through virtio would do, so if skb_segment() needs to do something
>>>>> special with skb the SKB_GSO_DODGY flag is already there.
>>>>
>>>> 'Fixing' skb_segment(), I did that a long time ago and Herbert Xu was
>>>> not happy with the fix and provided something else.
>>>
>>> any link to your old patches and discussion?
>>>
>>> I think since mlx4 can do tso on them and the packets came out
>>> correct on the wire, there is nothing fundamentally wrong with
>>> changing gso_size. Just tricky to teach skb_segment.
>>>
>>
>> The world is not mlx4 only. Some NIC will ask skb_segment() fallback 
>> segmentation for various reasons (like skb->len above a given limit 
>> like 16KB)
>>
>> Maybe https://www.spinics.net/lists/netdev/msg255549.html
> 
> 
> Herbert patch :
> 
> commit 9d8506cc2d7ea1f911c72c100193a3677f6668c3
> Author: Herbert Xu <herbert@gondor.apana.org.au>
> Date:   Thu Nov 21 11:10:04 2013 -0800
> 
>      gso: handle new frag_list of frags GRO packets
> 

I found my initial patch.

https://www.spinics.net/lists/netdev/msg255452.html

^ permalink raw reply

* Re: [PATCH] drivers: net: wireless: ath: ath9: dfs: remove VLA usage
From: Daniel Micay @ 2018-03-14  1:18 UTC (permalink / raw)
  To: David Laight
  Cc: Arend van Spriel, Andreas Christoforou, Kees Cook,
	Kernel Hardening, QCA ath9k Development, Kalle Valo,
	linux-wireless@vger.kernel.org, Netdev, kernel list
In-Reply-To: <423b529d48c54eba9f7ed51922814d46@AcuMS.aculab.com>

No, it's undefined behavior to write to a const variable. The `static`
and `const` on the variable both change the code generation in the
real world as permitted / encouraged by the standard. It's placed in
read-only memory. Trying to write to it will break. It's not
"implemented defined" to write to it, it's "undefined behavior" i.e.
it's considered incorrect. There a clear distinction between those in
the standard.

You're confusing having a real `const` for a variable with having it
applied to a pointer. It's well-defined to cast away const from a
pointer and write to what it points at if it's not actually const. If
it is const, that's broken.

There's nothing implementation defined about either case.

The C standard could have considered `static const` variables to work
as constant expressions just like the C++ standard. They borrowed it
from there but made it less useful than const in what became the C++
standard. They also used stricter rules for the permitted implicit
conversions of const pointers which made those much less usable, i.e.
converting `int **` to `const int *const *` wasn't permitted like C++.
I don't think the difference between C and C++ const pointer
conversions, it's a quirk of them being standardized on different
timelines and ending up with different versions of the same thing. On
the other hand, they might have left out having ever so slightly more
useful constant expressions on purpose since people can use #define.

^ permalink raw reply

* Re: [PATCH] drivers: net: wireless: ath: ath9: dfs: remove VLA usage
From: Daniel Micay @ 2018-03-14  1:22 UTC (permalink / raw)
  To: David Laight
  Cc: Arend van Spriel, Andreas Christoforou, Kees Cook,
	Kernel Hardening, QCA ath9k Development, Kalle Valo,
	linux-wireless@vger.kernel.org, Netdev, kernel list
In-Reply-To: <CA+DvKQK9jRe5YOk_K4DzBYkh2VUys56XbaYZRGs0fn2WJNT+nQ@mail.gmail.com>

> I don't think the difference between C and C++ const pointer
> conversions

I mean I don't think the difference between them was intended.

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH 12/15] ice: Add stats and ethtool support
From: Toshiaki Makita @ 2018-03-14  1:30 UTC (permalink / raw)
  To: Jesse Brandeburg, Eric Dumazet, davem
  Cc: Venkataramanan, Anirudh, kubakici@wp.pl, netdev@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org
In-Reply-To: <20180313141414.00007d11@intel.com>

On 2018/03/14 6:14, Jesse Brandeburg wrote:
> which iproute2 tool shows the /proc/net/dev stats?

ip -s -s link show

-- 
Toshiaki Makita

^ permalink raw reply

* Re: [PATCH] drivers: net: wireless: ath: ath9: dfs: remove VLA usage
From: Miguel Ojeda @ 2018-03-14  1:38 UTC (permalink / raw)
  To: Kees Cook
  Cc: Arend van Spriel, Andreas Christoforou, Kernel Hardening,
	QCA ath9k Development, Kalle Valo, linux-wireless,
	Network Development, LKML
In-Reply-To: <CAGXu5jKDA=L5E6Cu_oRe-M=ouuhrccCZHyM+z6_UMAUqe9xnRQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Sun, Mar 11, 2018 at 12:12 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>
> The problem is that it's not a "constant expression", so the compiler
> frontend still yells about it under -Wvla. I would characterize this
> mainly as a fix for "accidental VLA" or "misdetected VLA" or something
> like that. AIUI, there really isn't a functional change here.

C++11's constexpr variables would be nice to have.

Cheers,
Miguel

^ permalink raw reply

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
From: Jakub Kicinski @ 2018-03-14  1:50 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jiri Pirko, Rabie Loulou, John Hurley, Simon Horman,
	Linux Netdev List, mlxsw, Yevgeny Kliteynik, Paul Blakey
In-Reply-To: <CAJ3xEMhOHOXCUG-0d6SdFoRvYUdmYGBrAQjoOAnxSsucx86LzQ@mail.gmail.com>

On Tue, 13 Mar 2018 17:53:39 +0200, Or Gerlitz wrote:
> > Starting with type 2, in our current NIC HW APIs we have to duplicate
> > these rules
> > into two rules set to HW:
> >
> > 2.1 VF rep --> uplink 0
> > 2.2 VF rep --> uplink 1
> >
> > and we do that in the driver (add/del two HW rules, combine the stat
> > results, etc)

Ack, I think our HW API also will require us to duplicate the rules
today, but IMHO we should implement some common helper module in the
core that would work for any block sharing rather than bond specific
solution.

> > 3. ingress rule on VF rep port with shared tunnel device being the
> > egress (encap)
> > and where the routing of the underlay (tunnel) goes through LAG.
> >
> > in our case, this is like 2.1/2.2 above, offload two rules, combine stats
> >
> > 4. ingress rule shared tunnel device being the ingress and VF rep port
> > being the egress (decap)
> >
> > this uses the egdev facility to be offloaded into the our driver, and
> > then in the driver
> > we will treat it like type 1, two rules need to be installed into HW,
> > but now, we can't delegate them
> > from the vxlan device b/c it has no direct connection with the bond.

Let's get rid of the egdev crutch first then :]

^ permalink raw reply

* Re: [RESEND] rsi: Remove stack VLA usage
From: Tobin C. Harding @ 2018-03-14  2:11 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Kalle Valo, kernel-hardening, Linux Kernel Mailing List, netdev,
	open list:TI WILINK WIRELES..., Tycho Andersen, Kees Cook
In-Reply-To: <CAHp75VeM1zNyNtHvsn0=rg_xibcitVHREY73cAk=KCSK3Ti3vw@mail.gmail.com>

On Tue, Mar 13, 2018 at 11:00:47PM +0200, Andy Shevchenko wrote:
> On Tue, Mar 13, 2018 at 10:17 PM, tcharding <me@tobin.cc> wrote:
> > On Mon, Mar 12, 2018 at 09:46:06AM +0000, Kalle Valo wrote:
> >> tcharding <me@tobin.cc> wrote:
> 
> I'm pretty much sure it depends on the original email headers, like
> above ^^^ — no name.
> Perhaps git config on your side should be done.

Thanks for the suggestion Andy but the 'tcharding' as the name was
munged by either Kalle or patchwork.  I'm guessing patchwork.


thanks,
Tobin.

^ permalink raw reply

* [PATCH net] net/sched: act_simple: don't leak 'index' in the error path
From: Davide Caratti @ 2018-03-14  2:13 UTC (permalink / raw)
  To: Cong Wang, Jamal Hadi Salim, Jiri Pirko, David S. Miller; +Cc: netdev

if the kernel fails duplicating 'sdata', creation of a new action fails
with -ENOMEM. However, subsequent attempts to install the same action
using the same value of 'index' will systematically fail with -ENOSPC,
and that value of 'index' will no more be usable by act_simple, until
rmmod / insmod of act_simple.ko is done:

 # tc actions add action simple sdata hello index 100
 # tc actions list action simple

        action order 0: Simple <hello>
         index 100 ref 1 bind 0
 # tc actions flush action simple
 # tc actions add action simple sdata hello index 100
RTNETLINK answers: Cannot allocate memory
We have an error talking to the kernel
 # tc actions flush action simple
 # tc actions add action simple sdata hello index 100
RTNETLINK answers: No space left on device
We have an error talking to the kernel
 # tc actions add action simple sdata hello index 100
RTNETLINK answers: No space left on device
We have an error talking to the kernel
 ...

Similarly to what other TC actions do, we can duplicate 'sdata' before
calling tcf_idr_create(), and avoid calling tcf_idr_cleanup(), so that
leaks of 'index' don't occur anymore.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---

Notes:
    Hello,
    
    I observed this faulty behavior on act_bpf, in case of negative return
    value of tcf_bpf_init_from_ops() and tcf_bpf_init_from_efd(). Then I
    tried on act_simple, that parses its parameter in a similar way, and
    reproduced the same leakage of 'index'.
    
    So, unless you think we should fix this issue in a different way (i.e.
    changing tcf_idr_cleanup() ), I will post a similar fix for act_bpf.
    
    Any feedback is welcome, thank you in advance!

 net/sched/act_simple.c | 22 ++++++----------------
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c
index 425eac11f6da..b80d4a69a848 100644
--- a/net/sched/act_simple.c
+++ b/net/sched/act_simple.c
@@ -53,15 +53,6 @@ static void tcf_simp_release(struct tc_action *a)
 	kfree(d->tcfd_defdata);
 }
 
-static int alloc_defdata(struct tcf_defact *d, char *defdata)
-{
-	d->tcfd_defdata = kzalloc(SIMP_MAX_DATA, GFP_KERNEL);
-	if (unlikely(!d->tcfd_defdata))
-		return -ENOMEM;
-	strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA);
-	return 0;
-}
-
 static void reset_policy(struct tcf_defact *d, char *defdata,
 			 struct tc_defact *p)
 {
@@ -110,20 +101,18 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla,
 		return -EINVAL;
 	}
 
-	defdata = nla_data(tb[TCA_DEF_DATA]);
-
 	if (!exists) {
+		defdata = nla_strdup(tb[TCA_DEF_DATA], GFP_KERNEL);
+		if (unlikely(!defdata))
+			return -ENOMEM;
+
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_simp_ops, bind, false);
 		if (ret)
 			return ret;
 
 		d = to_defact(*a);
-		ret = alloc_defdata(d, defdata);
-		if (ret < 0) {
-			tcf_idr_cleanup(*a, est);
-			return ret;
-		}
+		d->tcfd_defdata = defdata;
 		d->tcf_action = parm->action;
 		ret = ACT_P_CREATED;
 	} else {
@@ -133,6 +122,7 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla,
 		if (!ovr)
 			return -EEXIST;
 
+		defdata = nla_data(tb[TCA_DEF_DATA]);
 		reset_policy(d, defdata, parm);
 	}
 
-- 
2.14.3

^ permalink raw reply related

* Re: [RESEND] rsi: Remove stack VLA usage
From: Kees Cook @ 2018-03-14  2:53 UTC (permalink / raw)
  To: Tobin C. Harding
  Cc: Andy Shevchenko, Kalle Valo, Kernel Hardening,
	Linux Kernel Mailing List, netdev, open list:TI WILINK WIRELES...,
	Tycho Andersen
In-Reply-To: <20180314021151.GA11666@eros>

On Tue, Mar 13, 2018 at 7:11 PM, Tobin C. Harding <me@tobin.cc> wrote:
> On Tue, Mar 13, 2018 at 11:00:47PM +0200, Andy Shevchenko wrote:
>> On Tue, Mar 13, 2018 at 10:17 PM, tcharding <me@tobin.cc> wrote:
>> > On Mon, Mar 12, 2018 at 09:46:06AM +0000, Kalle Valo wrote:
>> >> tcharding <me@tobin.cc> wrote:
>>
>> I'm pretty much sure it depends on the original email headers, like
>> above ^^^ — no name.
>> Perhaps git config on your side should be done.
>
> Thanks for the suggestion Andy but the 'tcharding' as the name was
> munged by either Kalle or patchwork.  I'm guessing patchwork.

Something you're sending from is using "tcharding" (see the email Andy
quotes). I see the headers as:

Date: Wed, 14 Mar 2018 07:17:57 +1100
From: tcharding <me@tobin.cc>
...
Message-ID: <20180313201757.GK8631@eros>
X-Mailer: Mutt 1.5.24 (2015-08-30)
User-Agent: Mutt/1.5.24 (2015-08-30)

Your most recently email shows "Tobin C. Harding" though, and also
sent with Mutt...

Do you have multiple Mutt configurations? Is something lacking a
"From" header insertion and your MTA is filling it in for you from
your username?

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* Re: [PATCH] pktgen: use dynamic allocation for debug print buffer
From: Wang Jian @ 2018-03-14  3:02 UTC (permalink / raw)
  To: David Miller
  Cc: arnd, gustavo, dima, johannes.berg, edumazet, netdev,
	linux-kernel
In-Reply-To: <20180313.202552.1990425972566331246.davem@davemloft.net>

>> +  kfree(buf);
free tb? buf is an array.

On Wed, Mar 14, 2018 at 8:25 AM, David Miller <davem@davemloft.net> wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Tue, 13 Mar 2018 21:58:39 +0100
>
>> After the removal of the VLA, we get a harmless warning about a large
>> stack frame:
>>
>> net/core/pktgen.c: In function 'pktgen_if_write':
>> net/core/pktgen.c:1710:1: error: the frame size of 1076 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
>>
>> The function was previously shown to be safe despite hitting
>> the 1024 bye warning level. To get rid of the annoyging warning,
>> while keeping it readable, this changes it to use strndup_user().
>>
>> Obviously this is not a fast path, so the kmalloc() overhead
>> can be disregarded.
>>
>> Fixes: 35951393bbff ("pktgen: Remove VLA usage")
>> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>
> Applied, thanks.

^ permalink raw reply

* RE: [Intel-wired-lan] [next-queue PATCH v4 6/8] igb: Add MAC address support for ethtool nftuple filters
From: Brown, Aaron F @ 2018-03-14  3:04 UTC (permalink / raw)
  To: Gomes, Vinicius, intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, Sanchez-Palencia, Jesus
In-Reply-To: <20180308003713.29195-7-vinicius.gomes@intel.com>

> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Vinicius Costa Gomes
> Sent: Wednesday, March 7, 2018 4:37 PM
> To: intel-wired-lan@lists.osuosl.org
> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus <jesus.sanchez-
> palencia@intel.com>
> Subject: [Intel-wired-lan] [next-queue PATCH v4 6/8] igb: Add MAC address
> support for ethtool nftuple filters
> 
> This adds the capability of configuring the queue steering of arriving
> packets based on their source and destination MAC addresses.
> 
> In practical terms this adds support for the following use cases,
> characterized by these examples:
> 
> $ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0
> (this will direct packets with destination address "aa:aa:aa:aa:aa:aa"
> to the RX queue 0)
> 
> $ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 action 3
> (this will direct packets with source address "44:44:44:44:44:44" to
> the RX queue 3)

This seems to work fine on i210, and the patch series allows me to set the rx filters on the i350, i354 and i211, but it is not directing the packets to the queue I request.

With the exception of i210 the rx_queues number does not seem to be effected by setting the filter.  In the case of i211 the rx packets stay on rx_queue 0 with or without an ether src or dst filter.  The first example one seems to work at first since it's directing to queue 0, but changing the filter to "action 1" does not change the behavior.  With the i350 and i354 ports the packets are spread across the rx_queues with or without the filter set.

> 
> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> ---
>  drivers/net/ethernet/intel/igb/igb_ethtool.c | 35
> ++++++++++++++++++++++++----
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c
> b/drivers/net/ethernet/intel/igb/igb_ethtool.c
> index 94fc9a4bed8b..3f98299d4cd0 100644
> --- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
> +++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
> @@ -2494,6 +2494,23 @@ static int igb_get_ethtool_nfc_entry(struct
> igb_adapter *adapter,
>  			fsp->h_ext.vlan_tci = rule->filter.vlan_tci;
>  			fsp->m_ext.vlan_tci = htons(VLAN_PRIO_MASK);
>  		}
> +		if (rule->filter.match_flags &
> IGB_FILTER_FLAG_DST_MAC_ADDR) {
> +			ether_addr_copy(fsp->h_u.ether_spec.h_dest,
> +					rule->filter.dst_addr);
> +			/* As we only support matching by the full
> +			 * mask, return the mask to userspace
> +			 */
> +			eth_broadcast_addr(fsp->m_u.ether_spec.h_dest);
> +		}
> +		if (rule->filter.match_flags &
> IGB_FILTER_FLAG_SRC_MAC_ADDR) {
> +			ether_addr_copy(fsp->h_u.ether_spec.h_source,
> +					rule->filter.src_addr);
> +			/* As we only support matching by the full
> +			 * mask, return the mask to userspace
> +			 */
> +			eth_broadcast_addr(fsp-
> >m_u.ether_spec.h_source);
> +		}
> +
>  		return 0;
>  	}
>  	return -EINVAL;
> @@ -2932,10 +2949,6 @@ static int igb_add_ethtool_nfc_entry(struct
> igb_adapter *adapter,
>  	if ((fsp->flow_type & ~FLOW_EXT) != ETHER_FLOW)
>  		return -EINVAL;
> 
> -	if (fsp->m_u.ether_spec.h_proto != ETHER_TYPE_FULL_MASK &&
> -	    fsp->m_ext.vlan_tci != htons(VLAN_PRIO_MASK))
> -		return -EINVAL;
> -
>  	input = kzalloc(sizeof(*input), GFP_KERNEL);
>  	if (!input)
>  		return -ENOMEM;
> @@ -2945,6 +2958,20 @@ static int igb_add_ethtool_nfc_entry(struct
> igb_adapter *adapter,
>  		input->filter.match_flags = IGB_FILTER_FLAG_ETHER_TYPE;
>  	}
> 
> +	/* Only support matching addresses by the full mask */
> +	if (is_broadcast_ether_addr(fsp->m_u.ether_spec.h_source)) {
> +		input->filter.match_flags |=
> IGB_FILTER_FLAG_SRC_MAC_ADDR;
> +		ether_addr_copy(input->filter.src_addr,
> +				fsp->h_u.ether_spec.h_source);
> +	}
> +
> +	/* Only support matching addresses by the full mask */
> +	if (is_broadcast_ether_addr(fsp->m_u.ether_spec.h_dest)) {
> +		input->filter.match_flags |=
> IGB_FILTER_FLAG_DST_MAC_ADDR;
> +		ether_addr_copy(input->filter.dst_addr,
> +				fsp->h_u.ether_spec.h_dest);
> +	}
> +
>  	if ((fsp->flow_type & FLOW_EXT) && fsp->m_ext.vlan_tci) {
>  		if (fsp->m_ext.vlan_tci != htons(VLAN_PRIO_MASK)) {
>  			err = -EINVAL;
> --
> 2.16.2
> 
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply

* RE: [Intel-wired-lan] [next-queue PATCH v4 2/8] igb: Fix queue selection on MAC filters on i210 and i211
From: Brown, Aaron F @ 2018-03-14  3:07 UTC (permalink / raw)
  To: Gomes, Vinicius, intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, Sanchez-Palencia, Jesus
In-Reply-To: <20180308003713.29195-3-vinicius.gomes@intel.com>

> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Vinicius Costa Gomes
> Sent: Wednesday, March 7, 2018 4:37 PM
> To: intel-wired-lan@lists.osuosl.org
> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus <jesus.sanchez-
> palencia@intel.com>
> Subject: [Intel-wired-lan] [next-queue PATCH v4 2/8] igb: Fix queue selection
> on MAC filters on i210 and i211
> 
> On the RAH registers there are semantic differences on the meaning of
> the "queue" parameter for traffic steering depending on the controller
> model: there is the 82575 meaning, which "queue" means a RX Hardware
> Queue, and the i350 meaning, where it is a reception pool.
> 
> The previous behaviour was having no effect for i210 and i211 based
> controllers because the QSEL bit of the RAH register wasn't being set.
> 
> This patch separates the condition in discrete cases, so the different
> handling is clearer.
> 
> Fixes: 83c21335c876 ("igb: improve MAC filter handling")
> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> ---
>  drivers/net/ethernet/intel/igb/e1000_defines.h |  1 +
>  drivers/net/ethernet/intel/igb/igb_main.c      | 15 +++++++++++----
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h
> b/drivers/net/ethernet/intel/igb/e1000_defines.h
> index 83cabff1e0ab..573bf177fd08 100644
> --- a/drivers/net/ethernet/intel/igb/e1000_defines.h
> +++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
> @@ -490,6 +490,7 @@
>   * manageability enabled, allowing us room for 15 multicast addresses.
>   */
>  #define E1000_RAH_AV  0x80000000        /* Receive descriptor valid */
> +#define E1000_RAH_QSEL_ENABLE 0x10000000
>  #define E1000_RAL_MAC_ADDR_LEN 4
>  #define E1000_RAH_MAC_ADDR_LEN 2
>  #define E1000_RAH_POOL_MASK 0x03FC0000
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index 715bb32e6901..eabedc6b6518 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -8747,12 +8747,19 @@ static void igb_rar_set_index(struct igb_adapter
> *adapter, u32 index)
>  		if (is_valid_ether_addr(addr))
>  			rar_high |= E1000_RAH_AV;
> 
> -		if (hw->mac.type == e1000_82575)
> +		switch (hw->mac.type) {
> +		case e1000_82575:
> +		case e1000_i210:
> +		case e1000_i211:
> +			rar_high |= E1000_RAH_QSEL_ENABLE;
>  			rar_high |= E1000_RAH_POOL_1 *
> -				    adapter->mac_table[index].queue;
> -		else
> +				      adapter->mac_table[index].queue;
> +			break;
> +		default:
>  			rar_high |= E1000_RAH_POOL_1 <<
> -				    adapter->mac_table[index].queue;
> +				adapter->mac_table[index].queue;

Small nit.  Shouldn't this line be indented more to be a few spaces past the "|=" operator as above?

> +			break;
> +		}
>  	}
> 
>  	wr32(E1000_RAL(index), rar_low);
> --
> 2.16.2
> 
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply

* [PATCH 1/7] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-14  3:19 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 2 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e554aa6cf..7028516 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1375,7 +1375,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 	 * such as IA-64).
 	 */
 	wmb();
-	writel(val, rx_ring->tail);
+	writel_relaxed(val, rx_ring->tail);
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 357d605..2d323fc 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -667,7 +667,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 	 * such as IA-64).
 	 */
 	wmb();
-	writel(val, rx_ring->tail);
+	writel_relaxed(val, rx_ring->tail);
 }
 
 /**
-- 
2.7.4

^ permalink raw reply related

* [PATCH 1/7] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-14  3:20 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-kernel, Sinan Kaya, intel-wired-lan,
	Jeff Kirsher, linux-arm-kernel

Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 2 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e554aa6cf..7028516 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1375,7 +1375,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 	 * such as IA-64).
 	 */
 	wmb();
-	writel(val, rx_ring->tail);
+	writel_relaxed(val, rx_ring->tail);
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 357d605..2d323fc 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -667,7 +667,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 	 * such as IA-64).
 	 */
 	wmb();
-	writel(val, rx_ring->tail);
+	writel_relaxed(val, rx_ring->tail);
 }
 
 /**
-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/7] ixgbe: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-14  3:20 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel
In-Reply-To: <1520997629-17361-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0da5aa2..35ca1d8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1692,7 +1692,7 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, u16 cleaned_count)
 		 * such as IA-64).
 		 */
 		wmb();
-		writel(i, rx_ring->tail);
+		writel_relaxed(i, rx_ring->tail);
 	}
 }
 
@@ -2453,7 +2453,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		 * know there are new descriptors to fetch.
 		 */
 		wmb();
-		writel(ring->next_to_use, ring->tail);
+		writel_relaxed(ring->next_to_use, ring->tail);
 
 		xdp_do_flush_map();
 	}
@@ -10014,7 +10014,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	 * are new descriptors to fetch.
 	 */
 	wmb();
-	writel(ring->next_to_use, ring->tail);
+	writel_relaxed(ring->next_to_use, ring->tail);
 
 	return;
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH 3/7] RDMA/qedr: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-14  3:20 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Michal Kalderon,
	Ariel Elior, Doug Ledford, Jason Gunthorpe, linux-rdma,
	linux-kernel
In-Reply-To: <1520997629-17361-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/infiniband/hw/qedr/verbs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 53f00db..ccd55f4 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -1870,7 +1870,7 @@ static int qedr_update_qp_state(struct qedr_dev *dev,
 
 			if (rdma_protocol_roce(&dev->ibdev, 1)) {
 				wmb();
-				writel(qp->rq.db_data.raw, qp->rq.db);
+				writel_relaxed(qp->rq.db_data.raw, qp->rq.db);
 				/* Make sure write takes effect */
 				mmiowb();
 			}
@@ -3247,7 +3247,7 @@ int qedr_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 	 * redundant doorbell.
 	 */
 	wmb();
-	writel(qp->sq.db_data.raw, qp->sq.db);
+	writel_relaxed(qp->sq.db_data.raw, qp->sq.db);
 
 	/* Make sure write sticks */
 	mmiowb();
-- 
2.7.4

^ permalink raw reply related

* [PATCH 4/7] igbvf: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-14  3:20 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel
In-Reply-To: <1520997629-17361-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/igbvf/netdev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index 4214c15..fe3441b 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -251,7 +251,7 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring *rx_ring,
 		 * such as IA-64).
 		*/
 		wmb();
-		writel(i, adapter->hw.hw_addr + rx_ring->tail);
+		writel_relaxed(i, adapter->hw.hw_addr + rx_ring->tail);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH 5/7] igb: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-14  3:20 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel
In-Reply-To: <1520997629-17361-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae7..ba8ccb5 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -8072,7 +8072,7 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 cleaned_count)
 		 * such as IA-64).
 		 */
 		wmb();
-		writel(i, rx_ring->tail);
+		writel_relaxed(i, rx_ring->tail);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH 6/7] e1000: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-14  3:20 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel
In-Reply-To: <1520997629-17361-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 3dd4aeb..e0e583a 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -4573,7 +4573,7 @@ e1000_alloc_jumbo_rx_buffers(struct e1000_adapter *adapter,
 		 * such as IA-64).
 		 */
 		wmb();
-		writel(i, adapter->hw.hw_addr + rx_ring->rdt);
+		writel_relaxed(i, adapter->hw.hw_addr + rx_ring->rdt);
 	}
 }
 
@@ -4688,7 +4688,7 @@ static void e1000_alloc_rx_buffers(struct e1000_adapter *adapter,
 		 * such as IA-64).
 		 */
 		wmb();
-		writel(i, hw->hw_addr + rx_ring->rdt);
+		writel_relaxed(i, hw->hw_addr + rx_ring->rdt);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-14  3:20 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel
In-Reply-To: <1520997629-17361-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      | 5 ++++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 7 +++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index f695242..64d0e0b 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -244,9 +244,12 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring *ring)
 	return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
+/* Assumes caller has executed a write barrier to order memory and device
+ * requests.
+ */
 static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
 {
-	writel(value, ring->tail);
+	writel_relaxed(value, ring->tail);
 }
 
 #define IXGBEVF_RX_DESC(R, i)	\
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 9b3d43d..0ba7f59 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3643,6 +3643,13 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 
 	tx_ring->next_to_use = i;
 
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.  (Only
+	 * applicable for weak-ordered memory model archs,
+	 * such as IA-64).
+	 */
+	wmb();
+
 	/* notify HW of packet */
 	ixgbevf_write_tail(tx_ring, i);
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next] tuntap: XDP_TX can use native XDP
From: Jason Wang @ 2018-03-14  3:23 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang

Now we have ndo_xdp_xmit, switch to use it instead of the slow generic
XDP TX routine. XDP_TX on TAP gets ~20% improvements from ~1.5Mpps to
~1.8Mpps on 2.60GHz Core(TM) i7-5600U.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 475088f..baeafa0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1613,7 +1613,6 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	unsigned int delta = 0;
 	char *buf;
 	size_t copied;
-	bool xdp_xmit = false;
 	int err, pad = TUN_RX_PAD;
 
 	rcu_read_lock();
@@ -1671,8 +1670,14 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 			preempt_enable();
 			return NULL;
 		case XDP_TX:
-			xdp_xmit = true;
-			/* fall through */
+			get_page(alloc_frag->page);
+			alloc_frag->offset += buflen;
+			if (tun_xdp_xmit(tun->dev, &xdp))
+				goto err_redirect;
+			tun_xdp_flush(tun->dev);
+			rcu_read_unlock();
+			preempt_enable();
+			return NULL;
 		case XDP_PASS:
 			delta = orig_data - xdp.data;
 			break;
@@ -1699,14 +1704,6 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	get_page(alloc_frag->page);
 	alloc_frag->offset += buflen;
 
-	if (xdp_xmit) {
-		skb->dev = tun->dev;
-		generic_xdp_tx(skb, xdp_prog);
-		rcu_read_unlock();
-		preempt_enable();
-		return NULL;
-	}
-
 	rcu_read_unlock();
 	preempt_enable();
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next] tuntap: XDP_TX can use native XDP
From: Michael S. Tsirkin @ 2018-03-14  3:37 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel
In-Reply-To: <1520997820-8289-1-git-send-email-jasowang@redhat.com>

On Wed, Mar 14, 2018 at 11:23:40AM +0800, Jason Wang wrote:
> Now we have ndo_xdp_xmit, switch to use it instead of the slow generic
> XDP TX routine. XDP_TX on TAP gets ~20% improvements from ~1.5Mpps to
> ~1.8Mpps on 2.60GHz Core(TM) i7-5600U.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  drivers/net/tun.c | 19 ++++++++-----------
>  1 file changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 475088f..baeafa0 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1613,7 +1613,6 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>  	unsigned int delta = 0;
>  	char *buf;
>  	size_t copied;
> -	bool xdp_xmit = false;
>  	int err, pad = TUN_RX_PAD;
>  
>  	rcu_read_lock();
> @@ -1671,8 +1670,14 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>  			preempt_enable();
>  			return NULL;
>  		case XDP_TX:
> -			xdp_xmit = true;
> -			/* fall through */
> +			get_page(alloc_frag->page);
> +			alloc_frag->offset += buflen;
> +			if (tun_xdp_xmit(tun->dev, &xdp))
> +				goto err_redirect;
> +			tun_xdp_flush(tun->dev);

Why do we have to flush here though?
It might be a good idea to document the reason in a code comment.

> +			rcu_read_unlock();
> +			preempt_enable();
> +			return NULL;
>  		case XDP_PASS:
>  			delta = orig_data - xdp.data;
>  			break;
> @@ -1699,14 +1704,6 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>  	get_page(alloc_frag->page);
>  	alloc_frag->offset += buflen;
>  
> -	if (xdp_xmit) {
> -		skb->dev = tun->dev;
> -		generic_xdp_tx(skb, xdp_prog);
> -		rcu_read_unlock();
> -		preempt_enable();
> -		return NULL;
> -	}
> -
>  	rcu_read_unlock();
>  	preempt_enable();
>  
> -- 
> 2.7.4

^ permalink raw reply

* [PATCH RFC bpf-next 2/6] selftests/bpf: Selftest for sys_bind hooks
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team
In-Reply-To: <20180314033934.3502167-1-ast@kernel.org>

From: Andrey Ignatov <rdna@fb.com>

Add selftest to work with bpf_sock_addr context from
`BPF_PROG_TYPE_CGROUP_INET4_BIND` and BPF_PROG_TYPE_CGROUP_INET6_BIND`
programs.

Try to bind(2) on IP:port and apply:
* loads to make sure context can be read correctly, including narrow
  loads (byte, half) for IP and full-size loads (word) for all fields;
* stores to those fields allowed by verifier.

All combination from IPv4/IPv6 and TCP/UDP are tested.

Test passes when expected data can be read from context in the
BPF-program, and after the call to bind(2) socket is bound to IP:port
pair that was written by BPF-program to the context.

Example:
  # ./test_sock_addr
  Attached bind4 program.
  Test case #1 (IPv4/TCP):
          Requested: bind(192.168.1.254, 4040) ..
             Actual: bind(127.0.0.1, 4444)
  Test case #2 (IPv4/UDP):
          Requested: bind(192.168.1.254, 4040) ..
             Actual: bind(127.0.0.1, 4444)
  Attached bind6 program.
  Test case #3 (IPv6/TCP):
          Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
             Actual: bind(::1, 6666)
  Test case #4 (IPv6/UDP):
          Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
             Actual: bind(::1, 6666)
  ### SUCCESS

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/include/uapi/linux/bpf.h               |  24 ++
 tools/testing/selftests/bpf/Makefile         |   3 +-
 tools/testing/selftests/bpf/test_sock_addr.c | 463 +++++++++++++++++++++++++++
 3 files changed, 489 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_sock_addr.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index db6bdc375126..a6af06bb5efb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -133,6 +133,8 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SOCK_OPS,
 	BPF_PROG_TYPE_SK_SKB,
 	BPF_PROG_TYPE_CGROUP_DEVICE,
+	BPF_PROG_TYPE_CGROUP_INET4_BIND,
+	BPF_PROG_TYPE_CGROUP_INET6_BIND,
 };
 
 enum bpf_attach_type {
@@ -143,6 +145,8 @@ enum bpf_attach_type {
 	BPF_SK_SKB_STREAM_PARSER,
 	BPF_SK_SKB_STREAM_VERDICT,
 	BPF_CGROUP_DEVICE,
+	BPF_CGROUP_INET4_BIND,
+	BPF_CGROUP_INET6_BIND,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -952,6 +956,26 @@ struct bpf_map_info {
 	__u64 netns_ino;
 } __attribute__((aligned(8)));
 
+/* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
+ * by user and intended to be used by socket (e.g. to bind to, depends on
+ * attach attach type).
+ */
+struct bpf_sock_addr {
+	__u32 user_family;	/* Allows 4-byte read, but no write. */
+	__u32 user_ip4;		/* Allows 1,2,4-byte read and 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 user_ip6[4];	/* Allows 1,2,4-byte read an 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 user_port;	/* Allows 4-byte read and write.
+				 * Stored in network byte order
+				 */
+	__u32 family;		/* Allows 4-byte read, but no write */
+	__u32 type;		/* Allows 4-byte read, but no write */
+	__u32 protocol;		/* Allows 4-byte read, but no write */
+};
+
 /* User bpf_sock_ops struct to access socket values and specify request ops
  * and their replies.
  * Some of this fields are in network (bigendian) byte order and may need
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 8567a858b789..f319b67fd0f6 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -15,7 +15,7 @@ LDLIBS += -lcap -lelf -lrt -lpthread
 
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
-	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user
+	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user test_sock_addr
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \
 	test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o     \
@@ -42,6 +42,7 @@ $(TEST_GEN_PROGS): $(BPFOBJ)
 $(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/libbpf.a
 
 $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
+$(OUTPUT)/test_sock_addr: cgroup_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/test_sock_addr.c b/tools/testing/selftests/bpf/test_sock_addr.c
new file mode 100644
index 000000000000..18ea250484dc
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -0,0 +1,463 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <arpa/inet.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+
+#include <linux/filter.h>
+
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+
+#define CG_PATH	"/foo"
+
+#define SERV4_IP		"192.168.1.254"
+#define SERV4_REWRITE_IP	"127.0.0.1"
+#define SERV4_PORT		4040
+#define SERV4_REWRITE_PORT	4444
+
+#define SERV6_IP		"face:b00c:1234:5678::abcd"
+#define SERV6_REWRITE_IP	"::1"
+#define SERV6_PORT		6060
+#define SERV6_REWRITE_PORT	6666
+
+#define INET_NTOP_BUF	40
+
+typedef int (*load_fn)(void);
+typedef int (*info_fn)(int, struct sockaddr *, socklen_t *);
+
+struct program {
+	enum bpf_attach_type type;
+	load_fn	loadfn;
+	int fd;
+	const char *name;
+};
+
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
+
+static int mk_sockaddr(int domain, const char *ip, unsigned short port,
+		       struct sockaddr *addr, socklen_t addr_len)
+{
+	struct sockaddr_in6 *addr6;
+	struct sockaddr_in *addr4;
+
+	if (domain != AF_INET && domain != AF_INET6) {
+		log_err("Unsupported address family");
+		return -1;
+	}
+
+	memset(addr, 0, addr_len);
+
+	if (domain == AF_INET) {
+		if (addr_len < sizeof(struct sockaddr_in))
+			return -1;
+		addr4 = (struct sockaddr_in *)addr;
+		addr4->sin_family = domain;
+		addr4->sin_port = htons(port);
+		if (inet_pton(domain, ip, (void *)&addr4->sin_addr) != 1) {
+			log_err("Invalid IPv4: %s", ip);
+			return -1;
+		}
+	} else if (domain == AF_INET6) {
+		if (addr_len < sizeof(struct sockaddr_in6))
+			return -1;
+		addr6 = (struct sockaddr_in6 *)addr;
+		addr6->sin6_family = domain;
+		addr6->sin6_port = htons(port);
+		if (inet_pton(domain, ip, (void *)&addr6->sin6_addr) != 1) {
+			log_err("Invalid IPv6: %s", ip);
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+static int load_insns(enum bpf_prog_type type, const struct bpf_insn *insns,
+		      size_t insns_cnt, const char *comment)
+{
+	int ret;
+
+	ret = bpf_load_program(type, insns, insns_cnt, "GPL", 0, bpf_log_buf,
+			       BPF_LOG_BUF_SIZE);
+	if (ret < 0) {
+		log_err(">>> Loading %s program error.\n"
+			">>> Output from verifier:\n%s\n-------\n",
+			comment, bpf_log_buf);
+	}
+
+	return ret;
+}
+
+/* [1] These testing programs try to read different context fields, including
+ * narrow loads of different sizes from user_ip4 and user_ip6, and write to
+ * those allowed to be overridden.
+ *
+ * [2] BPF_LD_IMM64 & BPF_JMP_REG are used below whenever there is a need to
+ * compare a register with unsigned 32bit integer. BPF_JMP_IMM can't be used
+ * in such cases since it accepts only _signed_ 32bit integer as IMM
+ * argument. Also note that BPF_LD_IMM64 contains 2 instructions what matters
+ * to count jumps properly.
+ */
+
+static int bind4_prog_load(void)
+{
+	union {
+		uint8_t u4_addr8[4];
+		uint16_t u4_addr16[2];
+		uint32_t u4_addr32;
+	} ip4;
+	struct sockaddr_in addr4_rw;
+
+	if (inet_pton(AF_INET, SERV4_IP, (void *)&ip4) != 1) {
+		log_err("Invalid IPv4: %s", SERV4_IP);
+		return -1;
+	}
+
+	if (mk_sockaddr(AF_INET, SERV4_REWRITE_IP, SERV4_REWRITE_PORT,
+			(struct sockaddr *)&addr4_rw, sizeof(addr4_rw)) == -1)
+		return -1;
+
+	/* See [1]. */
+	struct bpf_insn insns[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+
+		/* if (sk.family == AF_INET && */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, family)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, AF_INET, 16),
+
+		/*     (sk.type == SOCK_DGRAM || sk.type == SOCK_STREAM) && */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, type)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, SOCK_DGRAM, 1),
+		BPF_JMP_A(1),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, SOCK_STREAM, 12),
+
+		/*     1st_byte_of_user_ip4 == expected && */
+		BPF_LDX_MEM(BPF_B, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip4)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, ip4.u4_addr8[0], 10),
+
+		/*     1st_half_of_user_ip4 == expected && */
+		BPF_LDX_MEM(BPF_H, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip4)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, ip4.u4_addr16[0], 8),
+
+		/*     whole_user_ip4 == expected) { */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip4)),
+		BPF_LD_IMM64(BPF_REG_8, ip4.u4_addr32), /* See [2]. */
+		BPF_JMP_REG(BPF_JNE, BPF_REG_7, BPF_REG_8, 4),
+
+		/*      user_ip4 = addr4_rw.sin_addr */
+		BPF_MOV32_IMM(BPF_REG_7, addr4_rw.sin_addr.s_addr),
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,
+			    offsetof(struct bpf_sock_addr, user_ip4)),
+
+		/*      user_port = addr4_rw.sin_port */
+		BPF_MOV32_IMM(BPF_REG_7, addr4_rw.sin_port),
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,
+			    offsetof(struct bpf_sock_addr, user_port)),
+		/* } */
+
+		/* return 1 */
+		BPF_MOV64_IMM(BPF_REG_0, 1),
+		BPF_EXIT_INSN(),
+	};
+
+	return load_insns(BPF_PROG_TYPE_CGROUP_INET4_BIND, insns,
+			  sizeof(insns) / sizeof(struct bpf_insn),
+			  "bind() for AF_INET");
+}
+
+static int bind6_prog_load(void)
+{
+	struct sockaddr_in6 addr6_rw;
+	struct in6_addr ip6;
+
+	if (inet_pton(AF_INET6, SERV6_IP, (void *)&ip6) != 1) {
+		log_err("Invalid IPv6: %s", SERV6_IP);
+		return -1;
+	}
+
+	if (mk_sockaddr(AF_INET6, SERV6_REWRITE_IP, SERV6_REWRITE_PORT,
+			(struct sockaddr *)&addr6_rw, sizeof(addr6_rw)) == -1)
+		return -1;
+
+	/* See [1]. */
+	struct bpf_insn insns[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+
+		/* if (sk.family == AF_INET6 && */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, family)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, AF_INET6, 18),
+
+		/*            5th_byte_of_user_ip6 == expected && */
+		BPF_LDX_MEM(BPF_B, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip6[1])),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, ip6.s6_addr[4], 16),
+
+		/*            3rd_half_of_user_ip6 == expected && */
+		BPF_LDX_MEM(BPF_H, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip6[1])),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, ip6.s6_addr16[2], 14),
+
+		/*            last_word_of_user_ip6 == expected) { */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip6[3])),
+		BPF_LD_IMM64(BPF_REG_8, ip6.s6_addr32[3]),  /* See [2]. */
+		BPF_JMP_REG(BPF_JNE, BPF_REG_7, BPF_REG_8, 10),
+
+
+#define STORE_IPV6_WORD(N)						       \
+		BPF_MOV32_IMM(BPF_REG_7, addr6_rw.sin6_addr.s6_addr32[N]),     \
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,		       \
+			    offsetof(struct bpf_sock_addr, user_ip6[N]))
+
+		/*      user_ip6 = addr6_rw.sin6_addr */
+		STORE_IPV6_WORD(0),
+		STORE_IPV6_WORD(1),
+		STORE_IPV6_WORD(2),
+		STORE_IPV6_WORD(3),
+
+		/*      user_port = addr6_rw.sin6_port */
+		BPF_MOV32_IMM(BPF_REG_7, addr6_rw.sin6_port),
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,
+			    offsetof(struct bpf_sock_addr, user_port)),
+
+		/* } */
+
+		/* return 1 */
+		BPF_MOV64_IMM(BPF_REG_0, 1),
+		BPF_EXIT_INSN(),
+	};
+
+	return load_insns(BPF_PROG_TYPE_CGROUP_INET6_BIND, insns,
+			  sizeof(insns) / sizeof(struct bpf_insn),
+			  "bind() for AF_INET6");
+}
+
+static void print_ip_port(int sockfd, info_fn fn, const char *fmt)
+{
+	char addr_buf[INET_NTOP_BUF];
+	struct sockaddr_storage addr;
+	struct sockaddr_in6 *addr6;
+	struct sockaddr_in *addr4;
+	socklen_t addr_len;
+	unsigned short port;
+	void *nip;
+
+	addr_len = sizeof(struct sockaddr_storage);
+	memset(&addr, 0, addr_len);
+
+	if (fn(sockfd, (struct sockaddr *)&addr, (socklen_t *)&addr_len) == 0) {
+		if (addr.ss_family == AF_INET) {
+			addr4 = (struct sockaddr_in *)&addr;
+			nip = (void *)&addr4->sin_addr;
+			port = ntohs(addr4->sin_port);
+		} else if (addr.ss_family == AF_INET6) {
+			addr6 = (struct sockaddr_in6 *)&addr;
+			nip = (void *)&addr6->sin6_addr;
+			port = ntohs(addr6->sin6_port);
+		} else {
+			return;
+		}
+		const char *addr_str =
+			inet_ntop(addr.ss_family, nip, addr_buf, INET_NTOP_BUF);
+		printf(fmt, addr_str ? addr_str : "??", port);
+	}
+}
+
+static void print_local_ip_port(int sockfd, const char *fmt)
+{
+	print_ip_port(sockfd, getsockname, fmt);
+}
+
+static int start_server(int type, const struct sockaddr_storage *addr,
+			socklen_t addr_len)
+{
+
+	int fd;
+
+	fd = socket(addr->ss_family, type, 0);
+	if (fd == -1) {
+		log_err("Failed to create server socket");
+		goto out;
+	}
+
+	if (bind(fd, (const struct sockaddr *)addr, addr_len) == -1) {
+		log_err("Failed to bind server socket");
+		goto close_out;
+	}
+
+	if (type == SOCK_STREAM) {
+		if (listen(fd, 128) == -1) {
+			log_err("Failed to listen on server socket");
+			goto close_out;
+		}
+	}
+
+	print_local_ip_port(fd, "\t   Actual: bind(%s, %d)\n");
+
+	goto out;
+close_out:
+	close(fd);
+	fd = -1;
+out:
+	return fd;
+}
+
+static void print_test_case_num(int domain, int type)
+{
+	static int test_num;
+
+	printf("Test case #%d (%s/%s):\n", ++test_num,
+	       (domain == AF_INET ? "IPv4" :
+		domain == AF_INET6 ? "IPv6" :
+		"unknown_domain"),
+	       (type == SOCK_STREAM ? "TCP" :
+		type == SOCK_DGRAM ? "UDP" :
+		"unknown_type"));
+}
+
+static int run_test_case(int domain, int type, const char *ip,
+			 unsigned short port)
+{
+	struct sockaddr_storage addr;
+	socklen_t addr_len = sizeof(addr);
+	int servfd = -1;
+	int err = 0;
+
+	print_test_case_num(domain, type);
+
+	if (mk_sockaddr(domain, ip, port, (struct sockaddr *)&addr,
+			addr_len) == -1)
+		return -1;
+
+	printf("\tRequested: bind(%s, %d) ..\n", ip, port);
+	servfd = start_server(type, &addr, addr_len);
+	if (servfd == -1)
+		goto err;
+
+	goto out;
+err:
+	err = -1;
+out:
+	close(servfd);
+	return err;
+}
+
+static void close_progs_fds(struct program *progs, size_t prog_cnt)
+{
+	size_t i;
+
+	for (i = 0; i < prog_cnt; ++i) {
+		close(progs[i].fd);
+		progs[i].fd = -1;
+	}
+}
+
+static int load_and_attach_progs(int cgfd, struct program *progs,
+				 size_t prog_cnt)
+{
+	size_t i;
+
+	for (i = 0; i < prog_cnt; ++i) {
+		progs[i].fd = progs[i].loadfn();
+		if (progs[i].fd == -1) {
+			log_err("Failed to load program %s", progs[i].name);
+			goto err;
+		}
+		if (bpf_prog_attach(progs[i].fd, cgfd, progs[i].type,
+				    BPF_F_ALLOW_OVERRIDE) == -1) {
+			log_err("Failed to attach program %s", progs[i].name);
+			goto err;
+		}
+		printf("Attached %s program.\n", progs[i].name);
+	}
+
+	return 0;
+err:
+	close_progs_fds(progs, prog_cnt);
+	return -1;
+}
+
+static int run_domain_test(int domain, int cgfd, struct program *progs,
+			   size_t prog_cnt, const char *ip, unsigned short port)
+{
+	int err = 0;
+
+	if (load_and_attach_progs(cgfd, progs, prog_cnt) == -1)
+		goto err;
+
+	if (run_test_case(domain, SOCK_STREAM, ip, port) == -1)
+		goto err;
+
+	if (run_test_case(domain, SOCK_DGRAM, ip, port) == -1)
+		goto err;
+
+	goto out;
+err:
+	err = -1;
+out:
+	close_progs_fds(progs, prog_cnt);
+	return err;
+}
+
+static int run_test(void)
+{
+	size_t inet6_prog_cnt;
+	size_t inet_prog_cnt;
+	int cgfd = -1;
+	int err = 0;
+
+	struct program inet6_progs[] = {
+		{BPF_CGROUP_INET6_BIND, bind6_prog_load, -1, "bind6"},
+	};
+	inet6_prog_cnt = sizeof(inet6_progs) / sizeof(struct program);
+
+	struct program inet_progs[] = {
+		{BPF_CGROUP_INET4_BIND, bind4_prog_load, -1, "bind4"},
+	};
+	inet_prog_cnt = sizeof(inet_progs) / sizeof(struct program);
+
+	if (setup_cgroup_environment())
+		goto err;
+
+	cgfd = create_and_get_cgroup(CG_PATH);
+	if (!cgfd)
+		goto err;
+
+	if (join_cgroup(CG_PATH))
+		goto err;
+
+	if (run_domain_test(AF_INET, cgfd, inet_progs, inet_prog_cnt, SERV4_IP,
+			    SERV4_PORT) == -1)
+		goto err;
+
+	if (run_domain_test(AF_INET6, cgfd, inet6_progs, inet6_prog_cnt,
+			    SERV6_IP, SERV6_PORT) == -1)
+		goto err;
+
+	goto out;
+err:
+	err = -1;
+out:
+	close(cgfd);
+	cleanup_cgroup_environment();
+	printf(err ? "### FAIL\n" : "### SUCCESS\n");
+	return err;
+}
+
+int main(int argc, char **argv)
+{
+	return run_test();
+}
-- 
2.9.5

^ permalink raw reply related

* [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team

For our container management we've been using complicated and fragile setup
consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
all containerized applications.
The setup involves per-container IPs, policy, etc, so traditional
network-only solutions that involve VRFs, netns, acls are not applicable.
Changing apps is not possible and LD_PRELOAD doesn't work
for apps that don't use glibc like java and golang.
BPF+cgroup looks to be the best solution for this problem.
Hence we introduce 3 hooks:
- at entry into sys_bind and sys_connect
  to let bpf prog look and modify 'struct sockaddr' provided
  by user space and fail bind/connect when appropriate
- post sys_bind after port is allocated

The approach works great and has zero overhead for anyone who doesn't
use it and very low overhead when deployed.

The main question for Daniel and Dave is what approach to take
with prog types...

In this patch set we introduce 6 new program types to make user
experience easier:
  BPF_PROG_TYPE_CGROUP_INET4_BIND,
  BPF_PROG_TYPE_CGROUP_INET6_BIND,
  BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
  BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
  BPF_PROG_TYPE_CGROUP_INET4_POST_BIND,
  BPF_PROG_TYPE_CGROUP_INET6_POST_BIND,

since v4 programs should not be using 'struct bpf_sock_addr'->user_ip6 fields
and different prog type for v4 and v6 helps verifier reject such access
at load time.
Similarly bind vs connect are two different prog types too,
since only sys_connect programs can call new bpf_bind() helper.

This approach is very different from tcp-bpf where single
'struct bpf_sock_ops' and single prog type is used for different hooks.
The field checks are done at run-time instead of load time.

I think the approach taken by this patch set is justified,
but we may do better if we extend BPF_PROG_ATTACH cmd
with log_buf + log_size, then we should be able to combine
bind+connect+v4+v6 into single program type.
The idea that at load time the verifier will remember a bitmask
of fields in bpf_sock_addr used by the program and helpers
that program used, then at attach time we can check that
hook is compatible with features used by the program and
report human readable error message back via log_buf.
We cannot do this right now with just EINVAL, since combinations
of errors like 'using user_ip6 field but attaching to v4 hook'
are too high to express as errno.
This would be bigger change. If you folks think it's worth it
we can go with this approach or if you think 6 new prog types
is not too bad, we can leave the patch as-is.
Comments?
Other comments on patches are welcome.

Andrey Ignatov (6):
  bpf: Hooks for sys_bind
  selftests/bpf: Selftest for sys_bind hooks
  net: Introduce __inet_bind() and __inet6_bind
  bpf: Hooks for sys_connect
  selftests/bpf: Selftest for sys_connect hooks
  bpf: Post-hooks for sys_bind

 include/linux/bpf-cgroup.h                    |  68 +++-
 include/linux/bpf_types.h                     |   6 +
 include/linux/filter.h                        |  10 +
 include/net/inet_common.h                     |   2 +
 include/net/ipv6.h                            |   2 +
 include/net/sock.h                            |   3 +
 include/net/udp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  52 ++-
 kernel/bpf/cgroup.c                           |  36 ++
 kernel/bpf/syscall.c                          |  42 ++
 kernel/bpf/verifier.c                         |   6 +
 net/core/filter.c                             | 479 ++++++++++++++++++++++-
 net/ipv4/af_inet.c                            |  60 ++-
 net/ipv4/tcp_ipv4.c                           |  16 +
 net/ipv4/udp.c                                |  14 +
 net/ipv6/af_inet6.c                           |  47 ++-
 net/ipv6/tcp_ipv6.c                           |  16 +
 net/ipv6/udp.c                                |  20 +
 tools/include/uapi/linux/bpf.h                |  39 +-
 tools/testing/selftests/bpf/Makefile          |   8 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   2 +
 tools/testing/selftests/bpf/connect4_prog.c   |  45 +++
 tools/testing/selftests/bpf/connect6_prog.c   |  61 +++
 tools/testing/selftests/bpf/test_sock_addr.c  | 541 ++++++++++++++++++++++++++
 tools/testing/selftests/bpf/test_sock_addr.sh |  57 +++
 25 files changed, 1580 insertions(+), 53 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/connect4_prog.c
 create mode 100644 tools/testing/selftests/bpf/connect6_prog.c
 create mode 100644 tools/testing/selftests/bpf/test_sock_addr.c
 create mode 100755 tools/testing/selftests/bpf/test_sock_addr.sh

-- 
2.9.5

^ permalink raw reply

* [PATCH RFC bpf-next 3/6] net: Introduce __inet_bind() and __inet6_bind
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team
In-Reply-To: <20180314033934.3502167-1-ast@kernel.org>

From: Andrey Ignatov <rdna@fb.com>

Refactor `bind()` code to make it ready to be called from BPF helper
function `bpf_bind()` (will be added soon). Implementation of
`inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and
`__inet6_bind()` correspondingly. These function can be used from both
`sk_prot->bind` and `bpf_bind()` contexts.

New functions have two additional arguments.

`force_bind_address_no_port` forces binding to IP only w/o checking
`inet_sock.bind_address_no_port` field. It'll allow to bind local end of
a connection to desired IP in `bpf_bind()` w/o changing
`bind_address_no_port` field of a socket. It's useful since `bpf_bind()`
can return an error and we'd need to restore original value of
`bind_address_no_port` in that case if we changed this before calling to
the helper.

`with_lock` specifies whether to lock socket when working with `struct
sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old
behavior is preserved. But it will be set to `false` for `bpf_bind()`
use-case. The reason is all call-sites, where `bpf_bind()` will be
called, already hold that socket lock.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/net/inet_common.h |  2 ++
 include/net/ipv6.h        |  2 ++
 net/ipv4/af_inet.c        | 39 ++++++++++++++++++++++++---------------
 net/ipv6/af_inet6.c       | 37 ++++++++++++++++++++++++-------------
 4 files changed, 52 insertions(+), 28 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 500f81375200..384b90c62c0b 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -32,6 +32,8 @@ int inet_shutdown(struct socket *sock, int how);
 int inet_listen(struct socket *sock, int backlog);
 void inet_sock_destruct(struct sock *sk);
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
+int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+		bool force_bind_address_no_port, bool with_lock);
 int inet_getname(struct socket *sock, struct sockaddr *uaddr,
 		 int peer);
 int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 50a6f0ddb878..2e5fedc56e59 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1066,6 +1066,8 @@ void ipv6_local_error(struct sock *sk, int err, struct flowi6 *fl6, u32 info);
 void ipv6_local_rxpmtu(struct sock *sk, struct flowi6 *fl6, u32 mtu);
 
 int inet6_release(struct socket *sock);
+int __inet6_bind(struct sock *sock, struct sockaddr *uaddr, int addr_len,
+		 bool force_bind_address_no_port, bool with_lock);
 int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
 int inet6_getname(struct socket *sock, struct sockaddr *uaddr,
 		  int peer);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 2dec266507dc..e203a39d6988 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -432,30 +432,37 @@ EXPORT_SYMBOL(inet_release);
 
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
-	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
 	struct sock *sk = sock->sk;
-	struct inet_sock *inet = inet_sk(sk);
-	struct net *net = sock_net(sk);
-	unsigned short snum;
-	int chk_addr_ret;
-	u32 tb_id = RT_TABLE_LOCAL;
 	int err;
 
 	/* If the socket has its own bind function then use it. (RAW) */
 	if (sk->sk_prot->bind) {
-		err = sk->sk_prot->bind(sk, uaddr, addr_len);
-		goto out;
+		return sk->sk_prot->bind(sk, uaddr, addr_len);
 	}
-	err = -EINVAL;
 	if (addr_len < sizeof(struct sockaddr_in))
-		goto out;
+		return -EINVAL;
 
 	/* BPF prog is run before any checks are done so that if the prog
 	 * changes context in a wrong way it will be caught.
 	 */
 	err = BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr);
 	if (err)
-		goto out;
+		return err;
+
+	return __inet_bind(sk, uaddr, addr_len, false, true);
+}
+EXPORT_SYMBOL(inet_bind);
+
+int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+		bool force_bind_address_no_port, bool with_lock)
+{
+	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
+	struct inet_sock *inet = inet_sk(sk);
+	struct net *net = sock_net(sk);
+	unsigned short snum;
+	int chk_addr_ret;
+	u32 tb_id = RT_TABLE_LOCAL;
+	int err;
 
 	if (addr->sin_family != AF_INET) {
 		/* Compatibility games : accept AF_UNSPEC (mapped to AF_INET)
@@ -499,7 +506,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	 *      would be illegal to use them (multicast/broadcast) in
 	 *      which case the sending device address is used.
 	 */
-	lock_sock(sk);
+	if (with_lock)
+		lock_sock(sk);
 
 	/* Check these errors (active socket, double bind). */
 	err = -EINVAL;
@@ -511,7 +519,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		inet->inet_saddr = 0;  /* Use device */
 
 	/* Make sure we are allowed to bind here. */
-	if ((snum || !inet->bind_address_no_port) &&
+	if ((snum || !(inet->bind_address_no_port ||
+		       force_bind_address_no_port)) &&
 	    sk->sk_prot->get_port(sk, snum)) {
 		inet->inet_saddr = inet->inet_rcv_saddr = 0;
 		err = -EADDRINUSE;
@@ -528,11 +537,11 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	sk_dst_reset(sk);
 	err = 0;
 out_release_sock:
-	release_sock(sk);
+	if (with_lock)
+		release_sock(sk);
 out:
 	return err;
 }
-EXPORT_SYMBOL(inet_bind);
 
 int inet_dgram_connect(struct socket *sock, struct sockaddr *uaddr,
 		       int addr_len, int flags)
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index fa24e3f06ac6..13110bee5c14 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -277,15 +277,7 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
 /* bind for INET6 API */
 int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
-	struct sockaddr_in6 *addr = (struct sockaddr_in6 *)uaddr;
 	struct sock *sk = sock->sk;
-	struct inet_sock *inet = inet_sk(sk);
-	struct ipv6_pinfo *np = inet6_sk(sk);
-	struct net *net = sock_net(sk);
-	__be32 v4addr = 0;
-	unsigned short snum;
-	bool saved_ipv6only;
-	int addr_type = 0;
 	int err = 0;
 
 	/* If the socket has its own bind function then use it. */
@@ -302,11 +294,28 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	if (err)
 		return err;
 
+	return __inet6_bind(sk, uaddr, addr_len, false, true);
+}
+EXPORT_SYMBOL(inet6_bind);
+
+int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+		 bool force_bind_address_no_port, bool with_lock)
+{
+	struct sockaddr_in6 *addr = (struct sockaddr_in6 *)uaddr;
+	struct inet_sock *inet = inet_sk(sk);
+	struct ipv6_pinfo *np = inet6_sk(sk);
+	struct net *net = sock_net(sk);
+	__be32 v4addr = 0;
+	unsigned short snum;
+	bool saved_ipv6only;
+	int addr_type = 0;
+	int err = 0;
+
 	if (addr->sin6_family != AF_INET6)
 		return -EAFNOSUPPORT;
 
 	addr_type = ipv6_addr_type(&addr->sin6_addr);
-	if ((addr_type & IPV6_ADDR_MULTICAST) && sock->type == SOCK_STREAM)
+	if ((addr_type & IPV6_ADDR_MULTICAST) && sk->sk_type == SOCK_STREAM)
 		return -EINVAL;
 
 	snum = ntohs(addr->sin6_port);
@@ -314,7 +323,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
-	lock_sock(sk);
+	if (with_lock)
+		lock_sock(sk);
 
 	/* Check these errors (active socket, double bind). */
 	if (sk->sk_state != TCP_CLOSE || inet->inet_num) {
@@ -402,7 +412,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		sk->sk_ipv6only = 1;
 
 	/* Make sure we are allowed to bind here. */
-	if ((snum || !inet->bind_address_no_port) &&
+	if ((snum || !(inet->bind_address_no_port ||
+		       force_bind_address_no_port)) &&
 	    sk->sk_prot->get_port(sk, snum)) {
 		sk->sk_ipv6only = saved_ipv6only;
 		inet_reset_saddr(sk);
@@ -418,13 +429,13 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	inet->inet_dport = 0;
 	inet->inet_daddr = 0;
 out:
-	release_sock(sk);
+	if (with_lock)
+		release_sock(sk);
 	return err;
 out_unlock:
 	rcu_read_unlock();
 	goto out;
 }
-EXPORT_SYMBOL(inet6_bind);
 
 int inet6_release(struct socket *sock)
 {
-- 
2.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox