Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Cong Wang @ 2017-05-09 23:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet
In-Reply-To: <1494371348.7796.95.camel@edumazet-glaptop3.roam.corp.google.com>

On Tue, May 9, 2017 at 4:09 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2017-05-09 at 15:54 -0700, Eric Dumazet wrote:
>> On Tue, 2017-05-09 at 15:52 -0700, Eric Dumazet wrote:
>> > On Tue, 2017-05-09 at 15:07 -0700, Cong Wang wrote:
>> > > On Tue, May 9, 2017 at 1:56 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > > > Wait... if we transfer dst->dev to loopback_dev because we don't
>> > > > want to block unregister path, then we might have a similar problem
>> > > > for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
>> > > > hold the dev references...
>> > > >
>> > >
>> > > I finally come up with the attach patch... Do you mind to give it a try?
>> >
>> > I will, but this might be delayed by a few hours.
>> >
>> > In the mean time, it looks like you could try adding the following to
>> > your .config ;)
>> >
>> > CONFIG_IP_ROUTE_MULTIPATH=y
>> >
>> >
>>
>> +                               /* This should be fine, we are on unregister
>> +                                * path so synchronize_net() already waits for
>> +                                * existing readers. We have to release the
>> +                                * dev here because dst could still hold this
>> +                                * fib_info via rt->fi, we can't wait for GC.
>> +                                */
>> +                               RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
>> +                               dev_put(dev);
>>                                 dead = fi->fib_nhs;
>>
>> dead = fi->fib_mhs looks wrong if you remove the break; statement ?
>>
>> -                               break;

This statement is only used to ensure we pass the "dead == fi->fib_nhs"
check right below the inner loop, it is fine to keep it without break since
fi is not changed in the inner loop.


>
> Also setting nexthop_nh->nh_dev to NULL looks quite dangerous
>
> We have plenty of sites doing :
>
> if (fi->fib_dev)
>     x = fi->fib_dev->field
>
> fib_route_seq_show() is one example.
>

All of them take RCU read lock, so, as I explained in the code comment,
they all should be fine because of synchronize_net() on unregister path.
Do you see anything otherwise?

^ permalink raw reply

* [PATCH] libertas: Avoid reading past end of buffer
From: Kees Cook @ 2017-05-09 23:23 UTC (permalink / raw)
  To: netdev; +Cc: Kalle Valo, libertas-dev, linux-wireless, linux-kernel,
	Daniel Micay

Using memcpy() from a string that is shorter than the length copied means
the destination buffer is being filled with arbitrary data from the kernel
rodata segment. Instead, use strncpy() which will fill the trailing bytes
with zeros. Additionally adjust indentation to keep checkpatch.pl happy.

This was found with the future CONFIG_FORTIFY_SOURCE feature.

Cc: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 drivers/net/wireless/marvell/libertas/mesh.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/marvell/libertas/mesh.c b/drivers/net/wireless/marvell/libertas/mesh.c
index d0c881dd5846..d0b1948ca242 100644
--- a/drivers/net/wireless/marvell/libertas/mesh.c
+++ b/drivers/net/wireless/marvell/libertas/mesh.c
@@ -1177,9 +1177,9 @@ void lbs_mesh_ethtool_get_strings(struct net_device *dev,
 	switch (stringset) {
 	case ETH_SS_STATS:
 		for (i = 0; i < MESH_STATS_NUM; i++) {
-			memcpy(s + i * ETH_GSTRING_LEN,
-					mesh_stat_strings[i],
-					ETH_GSTRING_LEN);
+			strncpy(s + i * ETH_GSTRING_LEN,
+				mesh_stat_strings[i],
+				ETH_GSTRING_LEN);
 		}
 		break;
 	}
-- 
2.7.4


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Eric Dumazet @ 2017-05-09 23:09 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet
In-Reply-To: <1494370451.7796.93.camel@edumazet-glaptop3.roam.corp.google.com>

On Tue, 2017-05-09 at 15:54 -0700, Eric Dumazet wrote:
> On Tue, 2017-05-09 at 15:52 -0700, Eric Dumazet wrote:
> > On Tue, 2017-05-09 at 15:07 -0700, Cong Wang wrote:
> > > On Tue, May 9, 2017 at 1:56 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > Wait... if we transfer dst->dev to loopback_dev because we don't
> > > > want to block unregister path, then we might have a similar problem
> > > > for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
> > > > hold the dev references...
> > > >
> > > 
> > > I finally come up with the attach patch... Do you mind to give it a try?
> > 
> > I will, but this might be delayed by a few hours.
> > 
> > In the mean time, it looks like you could try adding the following to
> > your .config ;)
> > 
> > CONFIG_IP_ROUTE_MULTIPATH=y
> > 
> > 
> 
> +                               /* This should be fine, we are on unregister
> +                                * path so synchronize_net() already waits for
> +                                * existing readers. We have to release the
> +                                * dev here because dst could still hold this
> +                                * fib_info via rt->fi, we can't wait for GC.
> +                                */
> +                               RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
> +                               dev_put(dev);
>                                 dead = fi->fib_nhs;
> 
> dead = fi->fib_mhs looks wrong if you remove the break; statement ?
> 
> -                               break;

Also setting nexthop_nh->nh_dev to NULL looks quite dangerous

We have plenty of sites doing :

if (fi->fib_dev)
    x = fi->fib_dev->field

fib_route_seq_show() is one example.

^ permalink raw reply

* Re: [PATCH] wcn36xx: Close SMD channel on device removal
From: Bjorn Andersson @ 2017-05-09 23:03 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Eugene Krasnikov, Eyal Ilsar, wcn36xx, linux-wireless, netdev,
	linux-kernel, linux-arm-msm
In-Reply-To: <87h90u3672.fsf@kamboji.qca.qualcomm.com>

On Mon 08 May 23:17 PDT 2017, Kalle Valo wrote:

> Bjorn Andersson <bjorn.andersson@linaro.org> writes:
> 
> > The SMD channel is not the primary WCNSS channel and must explicitly be
> > closed as the device is removed, or the channel will already by open on
> > a subsequent probe call in e.g. the case of reloading the kernel module.
> >
> > This issue was introduced because I simplified the underlying SMD
> > implementation while the SMD adaptions of the driver sat on the mailing
> > list, but missed to update these patches. The patch does however only
> > apply back to the transition to rpmsg, hence the limited Fixes.
> >
> > Fixes: 5052de8deff5 ("soc: qcom: smd: Transition client drivers from smd to rpmsg")
> > Reported-by: Eyal Ilsar <c_eilsar@qti.qualcomm.com>
> > Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> 
> As this is a regression I'll queue this to 4.12.
> 

Thanks.

> But if this is an older bug (didn't quite understand your description
> though) should there be a separate patch for stable releases?
> 

AFAICT this never worked, as it seems I did the rework in SMD while we
tried to figure out the dependency issues we had with moving to SMD. So
v4.9 through v4.11 has SMD support - with this bug.

How do I proceed, do you want me to write up a fix for stable@? Do I
send that out as an ordinary patch?

Regards,
Bjorn

^ permalink raw reply

* Re: [PATCH net-next] bnxt: add dma mapping attributes
From: Shannon Nelson @ 2017-05-09 22:54 UTC (permalink / raw)
  To: Michael Chan; +Cc: David Miller, Netdev, sparclinux
In-Reply-To: <CACKFLikw_oi4rHRC9P-23xuvuOJSaf9b2kUSeTMgbNW1WA__=w@mail.gmail.com>

On 5/9/2017 2:05 PM, Michael Chan wrote:
> On Tue, May 9, 2017 at 1:37 PM, Shannon Nelson
> <shannon.nelson@oracle.com> wrote:
>> On the SPARC platform we need to use the DMA_ATTR_WEAK_ORDERING attribute
>> in our Rx path dma mapping in order to get the expected performance out
>> of the receive path.  Adding it to the Tx path has little effect, so
>> that's not a part of this patch.
>>
>> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
>> Reviewed-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Reviewed-by: Tom Saeger <tom.saeger@oracle.com>
>> ---
>>  drivers/net/ethernet/broadcom/bnxt/bnxt.c |   61 ++++++++++++++++++----------
>>  1 files changed, 39 insertions(+), 22 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> index 1f1e54b..771742c 100644
>> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> @@ -66,6 +66,12 @@
>>  MODULE_DESCRIPTION("Broadcom BCM573xx network driver");
>>  MODULE_VERSION(DRV_MODULE_VERSION);
>>
>> +#ifdef CONFIG_SPARC
>> +#define BNXT_DMA_ATTRS  DMA_ATTR_WEAK_ORDERING
>> +#else
>> +#define BNXT_DMA_ATTRS 0
>> +#endif
>> +
>
> I think we can use the same attribute for all architectures.
> Architectures that don't implement weak ordering will ignore the
> attribute.
>

In the long run, you are probably correct, and it would be simple enough 
to change this.  However, given the recent threads about the 
applicability of Relaxed Ordering and a couple of PCIe root complexes 
that have been found to have issues with Relaxed Ordering TLPs, I prefer 
to stay on the conservative side and set it up only for the platform I 
know.  As it stands, this patch won't change the currently working 
behavior on other platforms, but will help us out on the one we know can 
use the feature.

sln



^ permalink raw reply

* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Eric Dumazet @ 2017-05-09 22:54 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet
In-Reply-To: <1494370367.7796.92.camel@edumazet-glaptop3.roam.corp.google.com>

On Tue, 2017-05-09 at 15:52 -0700, Eric Dumazet wrote:
> On Tue, 2017-05-09 at 15:07 -0700, Cong Wang wrote:
> > On Tue, May 9, 2017 at 1:56 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > Wait... if we transfer dst->dev to loopback_dev because we don't
> > > want to block unregister path, then we might have a similar problem
> > > for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
> > > hold the dev references...
> > >
> > 
> > I finally come up with the attach patch... Do you mind to give it a try?
> 
> I will, but this might be delayed by a few hours.
> 
> In the mean time, it looks like you could try adding the following to
> your .config ;)
> 
> CONFIG_IP_ROUTE_MULTIPATH=y
> 
> 

+                               /* This should be fine, we are on unregister
+                                * path so synchronize_net() already waits for
+                                * existing readers. We have to release the
+                                * dev here because dst could still hold this
+                                * fib_info via rt->fi, we can't wait for GC.
+                                */
+                               RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
+                               dev_put(dev);
                                dead = fi->fib_nhs;

dead = fi->fib_mhs looks wrong if you remove the break; statement ?

-                               break;
                        }

^ permalink raw reply

* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Eric Dumazet @ 2017-05-09 22:52 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet
In-Reply-To: <CAM_iQpVhdzeen+ZJ3jyK7Qb=yVuoRSZ7BFB3RoVMShY8KPBTZA@mail.gmail.com>

On Tue, 2017-05-09 at 15:07 -0700, Cong Wang wrote:
> On Tue, May 9, 2017 at 1:56 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > Wait... if we transfer dst->dev to loopback_dev because we don't
> > want to block unregister path, then we might have a similar problem
> > for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
> > hold the dev references...
> >
> 
> I finally come up with the attach patch... Do you mind to give it a try?

I will, but this might be delayed by a few hours.

In the mean time, it looks like you could try adding the following to
your .config ;)

CONFIG_IP_ROUTE_MULTIPATH=y

^ permalink raw reply

* Re: [PATCH v2 net] dccp/tcp: do not inherit mc_list from parent
From: Eric Dumazet @ 2017-05-09 22:51 UTC (permalink / raw)
  To: Cong Wang; +Cc: David Miller, netdev, Pray3r, Andrey Konovalov
In-Reply-To: <CAM_iQpXHXhnCJr6TNCUm06TkSQvdz6H+k=avN1AyBO9uBkWZAA@mail.gmail.com>

On Tue, 2017-05-09 at 15:37 -0700, Cong Wang wrote:
> On Tue, May 9, 2017 at 6:29 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > syzkaller found a way to trigger double frees from ip_mc_drop_socket()
> >
> > It turns out that leave a copy of parent mc_list at accept() time,
> > which is very bad.
> >
> > Very similar to commit 8b485ce69876 ("tcp: do not inherit
> > fastopen_req from parent")
> >
> > Initial report from Pray3r, completed by Andrey one.
> > Thanks a lot to them !
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Reported-by: Pray3r <pray3r.z@gmail.com>
> > Reported-by: Andrey Konovalov <andreyknvl@google.com>
> > Tested-by: Andrey Konovalov <andreyknvl@google.com>
> > ---
> > v2: fix moved into inet_csk_clone_lock() to fix both DCCP and TCP
> >
> >  net/ipv4/inet_connection_sock.c |    2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index 5e313c1ac94fc88eca5fe3a0e9e46e551e955ff0..1054d330bf9df3189a21dbb08e27c0e6ad136775 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -794,6 +794,8 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
> >                 /* listeners have SOCK_RCU_FREE, not the children */
> >                 sock_reset_flag(newsk, SOCK_RCU_FREE);
> >
> > +               inet_sk(newsk)->mc_list = NULL;
> > +
> >                 newsk->sk_mark = inet_rsk(req)->ir_mark;
> >                 atomic64_set(&newsk->sk_cookie,
> >                              atomic64_read(&inet_rsk(req)->ir_cookie));
> >
> 
> I think IPv6 needs this too?
> 
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index aeb9497..b3611d9 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1062,6 +1062,7 @@ static struct sock *tcp_v6_syn_recv_sock(const
> struct sock *sk, struct sk_buff *
>                 newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
>  #endif
> 
> +               newnp->ipv6_mc_list = NULL;
>                 newnp->ipv6_ac_list = NULL;
>                 newnp->ipv6_fl_list = NULL;
>                 newnp->pktoptions  = NULL;

Good point, but it looks like you patched on only IPv4 mapped sockets.

And DCCP would need fixes as well.

^ permalink raw reply

* Re: [PATCH v2 net] dccp/tcp: do not inherit mc_list from parent
From: Cong Wang @ 2017-05-09 22:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Pray3r, Andrey Konovalov
In-Reply-To: <1494336559.7796.78.camel@edumazet-glaptop3.roam.corp.google.com>

On Tue, May 9, 2017 at 6:29 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> syzkaller found a way to trigger double frees from ip_mc_drop_socket()
>
> It turns out that leave a copy of parent mc_list at accept() time,
> which is very bad.
>
> Very similar to commit 8b485ce69876 ("tcp: do not inherit
> fastopen_req from parent")
>
> Initial report from Pray3r, completed by Andrey one.
> Thanks a lot to them !
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Pray3r <pray3r.z@gmail.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Tested-by: Andrey Konovalov <andreyknvl@google.com>
> ---
> v2: fix moved into inet_csk_clone_lock() to fix both DCCP and TCP
>
>  net/ipv4/inet_connection_sock.c |    2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 5e313c1ac94fc88eca5fe3a0e9e46e551e955ff0..1054d330bf9df3189a21dbb08e27c0e6ad136775 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -794,6 +794,8 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
>                 /* listeners have SOCK_RCU_FREE, not the children */
>                 sock_reset_flag(newsk, SOCK_RCU_FREE);
>
> +               inet_sk(newsk)->mc_list = NULL;
> +
>                 newsk->sk_mark = inet_rsk(req)->ir_mark;
>                 atomic64_set(&newsk->sk_cookie,
>                              atomic64_read(&inet_rsk(req)->ir_cookie));
>

I think IPv6 needs this too?

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index aeb9497..b3611d9 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1062,6 +1062,7 @@ static struct sock *tcp_v6_syn_recv_sock(const
struct sock *sk, struct sk_buff *
                newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
 #endif

+               newnp->ipv6_mc_list = NULL;
                newnp->ipv6_ac_list = NULL;
                newnp->ipv6_fl_list = NULL;
                newnp->pktoptions  = NULL;

^ permalink raw reply related

* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Cong Wang @ 2017-05-09 22:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet
In-Reply-To: <CAM_iQpXADa6AyQ3X-EbmRPGCQArgPWygvHMYgWtfUYV082oF3Q@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 397 bytes --]

On Tue, May 9, 2017 at 1:56 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> Wait... if we transfer dst->dev to loopback_dev because we don't
> want to block unregister path, then we might have a similar problem
> for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
> hold the dev references...
>

I finally come up with the attach patch... Do you mind to give it a try?

[-- Attachment #2: ipv4-rt-fi-ref-count.diff --]
[-- Type: text/plain, Size: 2821 bytes --]

commit a431b4d969f617c5ef8711b6bf493199eecec759
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Tue May 9 14:35:10 2017 -0700

    ipv4: restore rt->fi for reference counting, try #2
    
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

diff --git a/include/net/route.h b/include/net/route.h
index 2cc0e14..4335eb7 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -69,6 +69,7 @@ struct rtable {
 
 	struct list_head	rt_uncached;
 	struct uncached_list	*rt_uncached_list;
+	struct fib_info		*fi; /* for refcnt to shared metrics */
 };
 
 static inline bool rt_is_input_route(const struct rtable *rt)
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index da449dd..d77c453 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1429,8 +1429,15 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force)
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 			if (event == NETDEV_UNREGISTER &&
 			    nexthop_nh->nh_dev == dev) {
+				/* This should be fine, we are on unregister
+				 * path so synchronize_net() already waits for
+				 * existing readers. We have to release the
+				 * dev here because dst could still hold this
+				 * fib_info via rt->fi, we can't wait for GC.
+				 */
+				RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
+				dev_put(dev);
 				dead = fi->fib_nhs;
-				break;
 			}
 #endif
 		} endfor_nexthops(fi)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 655d9ee..f647310 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1387,6 +1387,11 @@ static void ipv4_dst_destroy(struct dst_entry *dst)
 {
 	struct rtable *rt = (struct rtable *) dst;
 
+	if (rt->fi) {
+		fib_info_put(rt->fi);
+		rt->fi = NULL;
+	}
+
 	if (!list_empty(&rt->rt_uncached)) {
 		struct uncached_list *ul = rt->rt_uncached_list;
 
@@ -1424,6 +1429,16 @@ static bool rt_cache_valid(const struct rtable *rt)
 		!rt_is_expired(rt);
 }
 
+static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
+{
+	if (fi->fib_metrics != (u32 *)dst_default_metrics) {
+		fib_info_hold(fi);
+		rt->fi = fi;
+	}
+
+	dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+}
+
 static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			   const struct fib_result *res,
 			   struct fib_nh_exception *fnhe,
@@ -1438,7 +1453,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			rt->rt_gateway = nh->nh_gw;
 			rt->rt_uses_gateway = 1;
 		}
-		dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+		rt_init_metrics(rt, fi);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
 #endif
@@ -1490,6 +1505,7 @@ struct rtable *rt_dst_alloc(struct net_device *dev,
 		rt->rt_gateway = 0;
 		rt->rt_uses_gateway = 0;
 		rt->rt_table_id = 0;
+		rt->fi = NULL;
 		INIT_LIST_HEAD(&rt->rt_uncached);
 
 		rt->dst.output = ip_output;

^ permalink raw reply related

* Re: [PATCH net-next] bnxt: add dma mapping attributes
From: Michael Chan @ 2017-05-09 21:05 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: David Miller, Netdev, sparclinux
In-Reply-To: <1494362253-270990-1-git-send-email-shannon.nelson@oracle.com>

On Tue, May 9, 2017 at 1:37 PM, Shannon Nelson
<shannon.nelson@oracle.com> wrote:
> On the SPARC platform we need to use the DMA_ATTR_WEAK_ORDERING attribute
> in our Rx path dma mapping in order to get the expected performance out
> of the receive path.  Adding it to the Tx path has little effect, so
> that's not a part of this patch.
>
> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
> Reviewed-by: Tushar Dave <tushar.n.dave@oracle.com>
> Reviewed-by: Tom Saeger <tom.saeger@oracle.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c |   61 ++++++++++++++++++----------
>  1 files changed, 39 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 1f1e54b..771742c 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -66,6 +66,12 @@
>  MODULE_DESCRIPTION("Broadcom BCM573xx network driver");
>  MODULE_VERSION(DRV_MODULE_VERSION);
>
> +#ifdef CONFIG_SPARC
> +#define BNXT_DMA_ATTRS  DMA_ATTR_WEAK_ORDERING
> +#else
> +#define BNXT_DMA_ATTRS 0
> +#endif
> +

I think we can use the same attribute for all architectures.
Architectures that don't implement weak ordering will ignore the
attribute.

^ permalink raw reply

* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Cong Wang @ 2017-05-09 20:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet
In-Reply-To: <1494348962.7796.88.camel@edumazet-glaptop3.roam.corp.google.com>

On Tue, May 9, 2017 at 9:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2017-05-09 at 09:44 -0700, Cong Wang wrote:
>
>>
>> Eric, how did you produce it?
>> I guess it's because of nh_dev which is the only netdevice pointer inside
>> fib_info. Let me take a deeper look.
>>
>
> Nothing particular, I am using kexec to boot new kernels, and all my
> attempts with your patch included demonstrated the issue.


Interesting. So this happens on your eth0 teardown path, but
I don't see any problem in the related NETDEV_UNREGISTER
notifiers yet, we don't call dst_destroy() on this path and will defer it
to GC, but that should not be delayed for so long?

Wait... if we transfer dst->dev to loopback_dev because we don't
want to block unregister path, then we might have a similar problem
for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
hold the dev references...

Something like this (just for proof of concept):

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index da449dd..85cd614 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -205,7 +205,7 @@ static void free_fib_info_rcu(struct rcu_head *head)
        struct fib_info *fi = container_of(head, struct fib_info, rcu);

        change_nexthops(fi) {
-               if (nexthop_nh->nh_dev)
+               if (nexthop_nh->nh_dev && !(nexthop_nh->nh_flags & RTNH_F_DEAD))
                        dev_put(nexthop_nh->nh_dev);
                lwtstate_put(nexthop_nh->nh_lwtstate);
                free_nh_exceptions(nexthop_nh);
@@ -1414,8 +1414,10 @@ int fib_sync_down_dev(struct net_device *dev,
unsigned long event, bool force)
                        else if (nexthop_nh->nh_dev == dev &&
                                 nexthop_nh->nh_scope != scope) {
                                switch (event) {
-                               case NETDEV_DOWN:
                                case NETDEV_UNREGISTER:
+                                       dev_put(dev);
+                                       /* fall through */
+                               case NETDEV_DOWN:
                                        nexthop_nh->nh_flags |= RTNH_F_DEAD;
                                        /* fall through */
                                case NETDEV_CHANGE:

^ permalink raw reply related

* [PATCH net-next] bnxt: add dma mapping attributes
From: Shannon Nelson @ 2017-05-09 20:37 UTC (permalink / raw)
  To: davem, netdev; +Cc: sparclinux, Shannon Nelson

On the SPARC platform we need to use the DMA_ATTR_WEAK_ORDERING attribute
in our Rx path dma mapping in order to get the expected performance out
of the receive path.  Adding it to the Tx path has little effect, so
that's not a part of this patch.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Tom Saeger <tom.saeger@oracle.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |   61 ++++++++++++++++++----------
 1 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1f1e54b..771742c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -66,6 +66,12 @@
 MODULE_DESCRIPTION("Broadcom BCM573xx network driver");
 MODULE_VERSION(DRV_MODULE_VERSION);
 
+#ifdef CONFIG_SPARC
+#define BNXT_DMA_ATTRS  DMA_ATTR_WEAK_ORDERING
+#else
+#define BNXT_DMA_ATTRS	0
+#endif
+
 #define BNXT_RX_OFFSET (NET_SKB_PAD + NET_IP_ALIGN)
 #define BNXT_RX_DMA_OFFSET NET_SKB_PAD
 #define BNXT_RX_COPY_THRESH 256
@@ -582,7 +588,8 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int nr_pkts)
 	if (!page)
 		return NULL;
 
-	*mapping = dma_map_page(dev, page, 0, PAGE_SIZE, bp->rx_dir);
+	*mapping = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, bp->rx_dir,
+				      BNXT_DMA_ATTRS);
 	if (dma_mapping_error(dev, *mapping)) {
 		__free_page(page);
 		return NULL;
@@ -601,8 +608,9 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int nr_pkts)
 	if (!data)
 		return NULL;
 
-	*mapping = dma_map_single(&pdev->dev, data + bp->rx_dma_offset,
-				  bp->rx_buf_use_size, bp->rx_dir);
+	*mapping = dma_map_single_attrs(&pdev->dev, data + bp->rx_dma_offset,
+					bp->rx_buf_use_size, bp->rx_dir,
+					BNXT_DMA_ATTRS);
 
 	if (dma_mapping_error(&pdev->dev, *mapping)) {
 		kfree(data);
@@ -705,8 +713,9 @@ static inline int bnxt_alloc_rx_page(struct bnxt *bp,
 			return -ENOMEM;
 	}
 
-	mapping = dma_map_page(&pdev->dev, page, offset, BNXT_RX_PAGE_SIZE,
-			       PCI_DMA_FROMDEVICE);
+	mapping = dma_map_page_attrs(&pdev->dev, page, offset,
+				     BNXT_RX_PAGE_SIZE, PCI_DMA_FROMDEVICE,
+				     BNXT_DMA_ATTRS);
 	if (dma_mapping_error(&pdev->dev, mapping)) {
 		__free_page(page);
 		return -EIO;
@@ -799,7 +808,8 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_napi *bnapi, u16 cp_cons,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_unmap_page(&bp->pdev->dev, dma_addr, PAGE_SIZE, bp->rx_dir);
+	dma_unmap_page_attrs(&bp->pdev->dev, dma_addr, PAGE_SIZE, bp->rx_dir,
+			     BNXT_DMA_ATTRS);
 
 	if (unlikely(!payload))
 		payload = eth_get_headlen(data_ptr, len);
@@ -841,8 +851,8 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_napi *bnapi, u16 cp_cons,
 	}
 
 	skb = build_skb(data, 0);
-	dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size,
-			 bp->rx_dir);
+	dma_unmap_single_attrs(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size,
+			       bp->rx_dir, BNXT_DMA_ATTRS);
 	if (!skb) {
 		kfree(data);
 		return NULL;
@@ -909,8 +919,8 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_napi *bnapi, u16 cp_cons,
 			return NULL;
 		}
 
-		dma_unmap_page(&pdev->dev, mapping, BNXT_RX_PAGE_SIZE,
-			       PCI_DMA_FROMDEVICE);
+		dma_unmap_page_attrs(&pdev->dev, mapping, BNXT_RX_PAGE_SIZE,
+				     PCI_DMA_FROMDEVICE, BNXT_DMA_ATTRS);
 
 		skb->data_len += frag_len;
 		skb->len += frag_len;
@@ -1329,8 +1339,9 @@ static void bnxt_abort_tpa(struct bnxt *bp, struct bnxt_napi *bnapi,
 		tpa_info->mapping = new_mapping;
 
 		skb = build_skb(data, 0);
-		dma_unmap_single(&bp->pdev->dev, mapping, bp->rx_buf_use_size,
-				 bp->rx_dir);
+		dma_unmap_single_attrs(&bp->pdev->dev, mapping,
+				       bp->rx_buf_use_size, bp->rx_dir,
+				       BNXT_DMA_ATTRS);
 
 		if (!skb) {
 			kfree(data);
@@ -1971,9 +1982,11 @@ static void bnxt_free_rx_skbs(struct bnxt *bp)
 				if (!data)
 					continue;
 
-				dma_unmap_single(&pdev->dev, tpa_info->mapping,
-						 bp->rx_buf_use_size,
-						 bp->rx_dir);
+				dma_unmap_single_attrs(&pdev->dev,
+						       tpa_info->mapping,
+						       bp->rx_buf_use_size,
+						       bp->rx_dir,
+						       BNXT_DMA_ATTRS);
 
 				tpa_info->data = NULL;
 
@@ -1993,13 +2006,15 @@ static void bnxt_free_rx_skbs(struct bnxt *bp)
 
 			if (BNXT_RX_PAGE_MODE(bp)) {
 				mapping -= bp->rx_dma_offset;
-				dma_unmap_page(&pdev->dev, mapping,
-					       PAGE_SIZE, bp->rx_dir);
+				dma_unmap_page_attrs(&pdev->dev, mapping,
+						     PAGE_SIZE, bp->rx_dir,
+						     BNXT_DMA_ATTRS);
 				__free_page(data);
 			} else {
-				dma_unmap_single(&pdev->dev, mapping,
-						 bp->rx_buf_use_size,
-						 bp->rx_dir);
+				dma_unmap_single_attrs(&pdev->dev, mapping,
+						       bp->rx_buf_use_size,
+						       bp->rx_dir,
+						       BNXT_DMA_ATTRS);
 				kfree(data);
 			}
 		}
@@ -2012,8 +2027,10 @@ static void bnxt_free_rx_skbs(struct bnxt *bp)
 			if (!page)
 				continue;
 
-			dma_unmap_page(&pdev->dev, rx_agg_buf->mapping,
-				       BNXT_RX_PAGE_SIZE, PCI_DMA_FROMDEVICE);
+			dma_unmap_page_attrs(&pdev->dev, rx_agg_buf->mapping,
+					     BNXT_RX_PAGE_SIZE,
+					     PCI_DMA_FROMDEVICE,
+					     BNXT_DMA_ATTRS);
 
 			rx_agg_buf->page = NULL;
 			__clear_bit(j, rxr->rx_agg_bmap);
-- 
1.7.1


^ permalink raw reply related

* Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit
From: Daniel Borkmann @ 2017-05-09 20:25 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: David Miller, Kees Cook, Mircea Gherzan, Network Development,
	kernel-hardening, linux-arm-kernel, ast
In-Reply-To: <CAHgaXd+xb5dN90sH___RtxgSC3usnH2jXkA5r3=fQJc3pOY5xw@mail.gmail.com>

On 05/09/2017 10:12 PM, Shubham Bansal wrote:
> Hi Daniel,
>
> I just tried running test_bpf.ko module.
>
> $ echo 2 >>  /proc/sys/net/core/bpf_jit_enable
> $ insmod test_bpf.ko
>
> test_bpf: #0 TAX
> bpf_jit: flen=14 proglen=212 pass=2 image=7f15a83c from=insmod pid=730
> JIT code: 00000000: f0 05 2d e9 40 d2 4d e2 00 40 a0 e3 0c 42 8d e5
> JIT code: 00000010: 08 42 8d e5 00 00 20 e0 01 10 21 e0 20 62 9d e5
> JIT code: 00000020: 20 72 9d e5 06 70 27 e0 20 72 8d e5 24 62 9d e5
> JIT code: 00000030: 24 72 9d e5 06 70 27 e0 24 72 8d e5 00 40 a0 e1
> JIT code: 00000040: 01 50 a0 e1 01 00 a0 e3 00 10 a0 e3 20 02 8d e5
> JIT code: 00000050: 24 12 8d e5 02 00 a0 e3 00 10 a0 e3 20 62 9d e5
> JIT code: 00000060: 06 00 80 e0 00 10 a0 e3 00 00 60 e2 00 10 a0 e3
> JIT code: 00000070: 20 02 8d e5 24 12 8d e5 54 40 90 e5 20 62 9d e5
> JIT code: 00000080: 06 00 80 e0 00 10 a0 e3 20 02 8d e5 24 12 8d e5
> JIT code: 00000090: 04 00 a0 e1 01 10 a0 e3 20 62 9d e5 06 10 81 e0
> JIT code: 000000a0: 01 20 a0 e3 04 32 8d e2 bc 68 0a e3 11 60 48 e3
> JIT code: 000000b0: 36 ff 2f e1 01 10 21 e0 00 00 50 e3 04 00 00 0a
> JIT code: 000000c0: 00 00 d0 e5 01 00 00 ea 40 d2 8d e2 f0 05 bd e8
> JIT code: 000000d0: 1e ff 2f e1
> jited:1
> Unhandled fault: page domain fault (0x01b) at 0x00000051
> pgd = 871d0000
> [00000051] *pgd=671b7831, *pte=00000000, *ppte=00000000
> Internal error: : 1b [#1] SMP ARM
> Modules linked in: test_bpf(+)
> CPU: 0 PID: 730 Comm: insmod Not tainted 4.11.0+ #5
> Hardware name: ARM-Versatile Express
> task: 87023700 task.stack: 8718a000
> PC is at 0x7f15a8b4
> LR is at test_bpf_init+0x5bc/0x1000 [test_bpf]
> pc : [<7f15a8b4>]    lr : [<7f1575bc>]    psr: 80000013
> sp : 8718bd7c  ip : 00000015  fp : 7f005008
> r10: 7f005094  r9 : 893ba020  r8 : 893ba000
> r7 : 00000000  r6 : 00000001  r5 : 00000000  r4 : 00000000
> r3 : 7f15a83c  r2 : 893ba020  r1 : 00000000  r0 : fffffffd
> Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> Control: 10c5387d  Table: 671d0059  DAC: 00000051
> Process insmod (pid: 730, stack limit = 0x8718a210)
> Stack: (0x8718bd7c to 0x8718c000)
> bd60:                                                                00000000
> bd80: 00002710 870db300 c302e7e8 7f004010 893ba000 7f005094 00000000 00000000
> bda0: 00000000 00000000 00000000 00000001 00000001 00000000 014000c0 00150628
> bdc0: 7f0050ac 7f154840 1234aaaa 1234aaab c302e7e8 0000000f 00000000 893ba000
> bde0: 0000000b 7f004010 87fd54a0 ffffe000 7f157000 00000000 871b6fc0 00000001
> be00: 78e4905c 00000024 7f154640 8010179c 80a06544 8718a000 00000001 80a54980
> be20: 80a3066c 00000007 809685c0 80a54700 80a54700 07551000 80a54700 60070013
> be40: 7f154640 801f3fc8 78e4905c 7f154640 00000001 871b6fe4 7f154640 00000001
> be60: 871b6b00 00000001 78e4905c 801eaa94 00000001 871b6fe4 8718bf44 00000001
> be80: 871b6fe4 80196e4c 7f15464c 00007fff 7f154640 80193f10 87127000 7f154640
> bea0: 7f154688 80703800 7f154770 807037e4 8081b184 807bec60 807becc4 807bec6c
> bec0: 7f15481c 8010c1b8 93600000 76ed8028 00000f60 00000000 00000000 00000000
> bee0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> bf00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00003f80
> bf20: 76f5cf88 00000000 93684f80 8718a000 00160fda 00000051 00000000 801973b0
> bf40: 87671a00 93501000 00183f80 93684760 93684574 936788e0 00155000 00155290
> bf60: 00000000 00000000 00000000 00001f64 00000032 00000033 0000001d 00000000
> bf80: 00000017 00000000 00000000 00183f80 756e694c 00000080 80107684 fffffffd
> bfa0: 00000000 801074c0 00000000 00183f80 76dd9008 00183f80 00160fda 00000000
> bfc0: 00000000 00183f80 756e694c 00000080 00000001 7eabae2c 00172f8c 00000000
> bfe0: 7eabaae0 7eabaad0 0004017f 00013172 60070030 76dd9008 00000000 00000000
> [<7f1575bc>] (test_bpf_init [test_bpf]) from [<7f157000>]
> (test_bpf_init+0x0/0x1000 [test_bpf])
> [<7f157000>] (test_bpf_init [test_bpf]) from [<78e4905c>] (0x78e4905c)
> Code: e2600000 e3a01000 e58d0220 e58d1224 (e5904054)
> ---[ end trace a36398923b914fe2 ]---
> Segmentation fault
>
> Why is trying to execute TAX which is a cBPF instruction?

Kernel translates this to eBPF internally (bpf_prepare_filter() ->
bpf_migrate_filter()), no cBPF will see the JIT directly.

Is your implementation still using bpf_jit_compile() callback as
opposed to bpf_int_jit_compile()?!

Cheers,
Daniel

^ permalink raw reply

* Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit
From: David Miller @ 2017-05-09 20:19 UTC (permalink / raw)
  To: illusionist.neo
  Cc: daniel, keescook, mgherzan, netdev, kernel-hardening,
	linux-arm-kernel, ast
In-Reply-To: <CAHgaXd+xb5dN90sH___RtxgSC3usnH2jXkA5r3=fQJc3pOY5xw@mail.gmail.com>

From: Shubham Bansal <illusionist.neo@gmail.com>
Date: Wed, 10 May 2017 01:42:10 +0530

> Why is trying to execute TAX which is a cBPF instruction?

Because some of the tests are classic BPF programs which
get translated into eBPF ones and sent to the JIT for
compilation.

^ permalink raw reply

* [PATCH nf] xtables: zero padding in data_to_user
From: Willem de Bruijn @ 2017-05-09 20:17 UTC (permalink / raw)
  To: netfilter-devel
  Cc: netdev, rgb, fwestpha, pmoore, pvrabec, pablo, davem,
	Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

When looking up an iptables rule, the iptables binary compares the
aligned match and target data (XT_ALIGN). In some cases this can
exceed the actual data size to include padding bytes.

Before commit f77bc5b23fb1 ("iptables: use match, target and data
copy_to_user helpers") the malloc()ed bytes were overwritten by the
kernel with kzalloced contents, zeroing the padding and making the
comparison succeed. After this patch, the kernel copies and clears
only data, leaving the padding bytes undefined.

Extend the clear operation from data size to aligned data size to
include the padding bytes, if any.

Padding bytes can be observed in both match and target, and the bug
triggered, by issuing a rule with match icmp and target ACCEPT:

  iptables -t mangle -A INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT
  iptables -t mangle -D INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT

Fixes: f77bc5b23fb1 ("iptables: use match, target and data copy_to_user helpers")
Reported-by: Paul Moore <pmoore@redhat.com>
Reported-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/netfilter/x_tables.h | 2 +-
 net/bridge/netfilter/ebtables.c    | 9 ++++++---
 net/netfilter/x_tables.c           | 9 ++++++---
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index be378cf47fcc..b3044c2c62cb 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -294,7 +294,7 @@ int xt_match_to_user(const struct xt_entry_match *m,
 int xt_target_to_user(const struct xt_entry_target *t,
 		      struct xt_entry_target __user *u);
 int xt_data_to_user(void __user *dst, const void *src,
-		    int usersize, int size);
+		    int usersize, int size, int aligned_size);
 
 void *xt_copy_counters_from_user(const void __user *user, unsigned int len,
 				 struct xt_counters_info *info, bool compat);
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 9ec0c9f908fa..9c6e619f452b 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -1373,7 +1373,8 @@ static inline int ebt_obj_to_user(char __user *um, const char *_name,
 	strlcpy(name, _name, sizeof(name));
 	if (copy_to_user(um, name, EBT_FUNCTION_MAXNAMELEN) ||
 	    put_user(datasize, (int __user *)(um + EBT_FUNCTION_MAXNAMELEN)) ||
-	    xt_data_to_user(um + entrysize, data, usersize, datasize))
+	    xt_data_to_user(um + entrysize, data, usersize, datasize,
+			    XT_ALIGN(datasize)))
 		return -EFAULT;
 
 	return 0;
@@ -1658,7 +1659,8 @@ static int compat_match_to_user(struct ebt_entry_match *m, void __user **dstptr,
 		if (match->compat_to_user(cm->data, m->data))
 			return -EFAULT;
 	} else {
-		if (xt_data_to_user(cm->data, m->data, match->usersize, msize))
+		if (xt_data_to_user(cm->data, m->data, match->usersize, msize,
+				    COMPAT_XT_ALIGN(msize)))
 			return -EFAULT;
 	}
 
@@ -1687,7 +1689,8 @@ static int compat_target_to_user(struct ebt_entry_target *t,
 		if (target->compat_to_user(cm->data, t->data))
 			return -EFAULT;
 	} else {
-		if (xt_data_to_user(cm->data, t->data, target->usersize, tsize))
+		if (xt_data_to_user(cm->data, t->data, target->usersize, tsize,
+				    COMPAT_XT_ALIGN(tsize)))
 			return -EFAULT;
 	}
 
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index f134d384852f..c64716a735b0 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -283,12 +283,13 @@ static int xt_obj_to_user(u16 __user *psize, u16 size,
 		       &U->u.user.revision, K->u.kernel.TYPE->revision)
 
 int xt_data_to_user(void __user *dst, const void *src,
-		    int usersize, int size)
+		    int usersize, int size, int aligned_size)
 {
 	usersize = usersize ? : size;
 	if (copy_to_user(dst, src, usersize))
 		return -EFAULT;
-	if (usersize != size && clear_user(dst + usersize, size - usersize))
+	if (usersize != aligned_size &&
+	    clear_user(dst + usersize, aligned_size - usersize))
 		return -EFAULT;
 
 	return 0;
@@ -298,7 +299,9 @@ EXPORT_SYMBOL_GPL(xt_data_to_user);
 #define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)				\
 	xt_data_to_user(U->data, K->data,				\
 			K->u.kernel.TYPE->usersize,			\
-			C_SIZE ? : K->u.kernel.TYPE->TYPE##size)
+			C_SIZE ? : K->u.kernel.TYPE->TYPE##size,	\
+			C_SIZE ? COMPAT_XT_ALIGN(C_SIZE) :		\
+				 XT_ALIGN(K->u.kernel.TYPE->TYPE##size))
 
 int xt_match_to_user(const struct xt_entry_match *m,
 		     struct xt_entry_match __user *u)
-- 
2.13.0.rc2.291.g57267f2277-goog

^ permalink raw reply related

* Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit
From: Shubham Bansal @ 2017-05-09 20:12 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David Miller, Kees Cook, Mircea Gherzan, Network Development,
	kernel-hardening, linux-arm-kernel, ast
In-Reply-To: <58E639E0.1010700@iogearbox.net>

Hi Daniel,

I just tried running test_bpf.ko module.

$ echo 2 >>  /proc/sys/net/core/bpf_jit_enable
$ insmod test_bpf.ko

test_bpf: #0 TAX
bpf_jit: flen=14 proglen=212 pass=2 image=7f15a83c from=insmod pid=730
JIT code: 00000000: f0 05 2d e9 40 d2 4d e2 00 40 a0 e3 0c 42 8d e5
JIT code: 00000010: 08 42 8d e5 00 00 20 e0 01 10 21 e0 20 62 9d e5
JIT code: 00000020: 20 72 9d e5 06 70 27 e0 20 72 8d e5 24 62 9d e5
JIT code: 00000030: 24 72 9d e5 06 70 27 e0 24 72 8d e5 00 40 a0 e1
JIT code: 00000040: 01 50 a0 e1 01 00 a0 e3 00 10 a0 e3 20 02 8d e5
JIT code: 00000050: 24 12 8d e5 02 00 a0 e3 00 10 a0 e3 20 62 9d e5
JIT code: 00000060: 06 00 80 e0 00 10 a0 e3 00 00 60 e2 00 10 a0 e3
JIT code: 00000070: 20 02 8d e5 24 12 8d e5 54 40 90 e5 20 62 9d e5
JIT code: 00000080: 06 00 80 e0 00 10 a0 e3 20 02 8d e5 24 12 8d e5
JIT code: 00000090: 04 00 a0 e1 01 10 a0 e3 20 62 9d e5 06 10 81 e0
JIT code: 000000a0: 01 20 a0 e3 04 32 8d e2 bc 68 0a e3 11 60 48 e3
JIT code: 000000b0: 36 ff 2f e1 01 10 21 e0 00 00 50 e3 04 00 00 0a
JIT code: 000000c0: 00 00 d0 e5 01 00 00 ea 40 d2 8d e2 f0 05 bd e8
JIT code: 000000d0: 1e ff 2f e1
jited:1
Unhandled fault: page domain fault (0x01b) at 0x00000051
pgd = 871d0000
[00000051] *pgd=671b7831, *pte=00000000, *ppte=00000000
Internal error: : 1b [#1] SMP ARM
Modules linked in: test_bpf(+)
CPU: 0 PID: 730 Comm: insmod Not tainted 4.11.0+ #5
Hardware name: ARM-Versatile Express
task: 87023700 task.stack: 8718a000
PC is at 0x7f15a8b4
LR is at test_bpf_init+0x5bc/0x1000 [test_bpf]
pc : [<7f15a8b4>]    lr : [<7f1575bc>]    psr: 80000013
sp : 8718bd7c  ip : 00000015  fp : 7f005008
r10: 7f005094  r9 : 893ba020  r8 : 893ba000
r7 : 00000000  r6 : 00000001  r5 : 00000000  r4 : 00000000
r3 : 7f15a83c  r2 : 893ba020  r1 : 00000000  r0 : fffffffd
Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 671d0059  DAC: 00000051
Process insmod (pid: 730, stack limit = 0x8718a210)
Stack: (0x8718bd7c to 0x8718c000)
bd60:                                                                00000000
bd80: 00002710 870db300 c302e7e8 7f004010 893ba000 7f005094 00000000 00000000
bda0: 00000000 00000000 00000000 00000001 00000001 00000000 014000c0 00150628
bdc0: 7f0050ac 7f154840 1234aaaa 1234aaab c302e7e8 0000000f 00000000 893ba000
bde0: 0000000b 7f004010 87fd54a0 ffffe000 7f157000 00000000 871b6fc0 00000001
be00: 78e4905c 00000024 7f154640 8010179c 80a06544 8718a000 00000001 80a54980
be20: 80a3066c 00000007 809685c0 80a54700 80a54700 07551000 80a54700 60070013
be40: 7f154640 801f3fc8 78e4905c 7f154640 00000001 871b6fe4 7f154640 00000001
be60: 871b6b00 00000001 78e4905c 801eaa94 00000001 871b6fe4 8718bf44 00000001
be80: 871b6fe4 80196e4c 7f15464c 00007fff 7f154640 80193f10 87127000 7f154640
bea0: 7f154688 80703800 7f154770 807037e4 8081b184 807bec60 807becc4 807bec6c
bec0: 7f15481c 8010c1b8 93600000 76ed8028 00000f60 00000000 00000000 00000000
bee0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00003f80
bf20: 76f5cf88 00000000 93684f80 8718a000 00160fda 00000051 00000000 801973b0
bf40: 87671a00 93501000 00183f80 93684760 93684574 936788e0 00155000 00155290
bf60: 00000000 00000000 00000000 00001f64 00000032 00000033 0000001d 00000000
bf80: 00000017 00000000 00000000 00183f80 756e694c 00000080 80107684 fffffffd
bfa0: 00000000 801074c0 00000000 00183f80 76dd9008 00183f80 00160fda 00000000
bfc0: 00000000 00183f80 756e694c 00000080 00000001 7eabae2c 00172f8c 00000000
bfe0: 7eabaae0 7eabaad0 0004017f 00013172 60070030 76dd9008 00000000 00000000
[<7f1575bc>] (test_bpf_init [test_bpf]) from [<7f157000>]
(test_bpf_init+0x0/0x1000 [test_bpf])
[<7f157000>] (test_bpf_init [test_bpf]) from [<78e4905c>] (0x78e4905c)
Code: e2600000 e3a01000 e58d0220 e58d1224 (e5904054)
---[ end trace a36398923b914fe2 ]---
Segmentation fault

Why is trying to execute TAX which is a cBPF instruction?

Best,
Shubham Bansal


On Thu, Apr 6, 2017 at 6:21 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 04/06/2017 01:05 PM, Shubham Bansal wrote:
>>
>> Gentle Reminder.
>
>
> Sorry for late reply.
>
>> Anybody can tell me how to test the JIT compiler ?
>
>
> There's lib/test_bpf.c, see Documentation/networking/filter.txt +1349
> for some more information. It basically contains various test cases that
> have the purpose to test the JIT with corner cases. If you see a useful
> test missing, please send a patch for it, so all other JITs can benefit
> from this as well. For extracting disassembly from a generated test case,
> check out bpf_jit_disasm (Documentation/networking/filter.txt +486).
>
> Thanks,
> Daniel

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2017-05-09 20:03 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) Fix multiqueue in stmmac driver on PCI, from Andy Shevchenko.

2) cdc_ncm doesn't actually fully zero out the padding area is
   allocates on TX, from Jim Baxter.

3) Don't leak map addresses in BPF verifier, from Daniel Borkmann.

4) If we randomize TCP timestamps, we have to do it everywhere
   including SYN cookies.  From Eric Dumazet.

5) Fix "ethtool -S" crash in aquantia driver, from Pavel Belous.

6) Fix allocation size for ntp filter bitmap in bnxt_en driver,
   from Dan Carpenter.

7) Add missing memory allocation return value check to DSA loop
   driver, from Christophe Jaillet.

8) Fix XDP leak on driver unload in qed driver, from Suddarsana Reddy
   Kalluru.

9) Don't inherit MC list from parent inet connection sockets,
   another syzkaller spotted gem.  Fix from Eric Dumazet.

Please pull, thanks a lot.

The following changes since commit af82455f7dbd9dc20244d80d033721b30d22c065:

  Merge tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc (2017-05-04 19:15:35 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 657831ffc38e30092a2d5f03d385d710eb88b09a:

  dccp/tcp: do not inherit mc_list from parent (2017-05-09 15:17:49 -0400)

----------------------------------------------------------------
Andy Shevchenko (4):
      stmmac: pci: set default number of rx and tx queues
      stmmac: pci: TX and RX queue priority configuration
      stmmac: pci: RX queue routing configuration
      stmmac: pci: split out common_default_data() helper

Christophe Jaillet (1):
      net: dsa: loop: Check for memory allocation failure

Dan Carpenter (1):
      bnxt_en: allocate enough space for ->ntp_fltr_bmap

Daniel Borkmann (1):
      bpf: don't let ldimm64 leak map addresses on unprivileged

David S. Miller (5):
      Merge branch 'stmmac-pci-fix-crash-on-Intel-Galileo-Gen2'
      Merge tag 'mac80211-for-davem-2017-05-08' of git://git.kernel.org/.../jberg/mac80211
      Revert "ipv4: restore rt->fi for reference counting"
      Merge branch 'mlx4-misc-fixes'
      Merge branch 'qed-general-fixes'

Eric Dumazet (2):
      tcp: randomize timestamps on syncookies
      dccp/tcp: do not inherit mc_list from parent

Ganesh Goudar (1):
      cxgb4: avoid disabling FEC by default

Geliang Tang (2):
      net/hippi/rrunner: use memdup_user
      yam: use memdup_user

Grygorii Strashko (1):
      net: ethernet: ti: cpsw: adjust cpsw fifos depth for fullduplex flow control

Hangbin Liu (2):
      bonding: check nla_put_be32 return value
      vti: check nla_put_* return value

Jack Morgenstein (1):
      net/mlx4_core: Reduce harmless SRIOV error message to debug level

Jim Baxter (1):
      net: cdc_ncm: Fix TX zero padding

Johannes Berg (4):
      mac80211: properly remove RX_ENC_FLAG_40MHZ
      nl80211: correctly validate MU-MIMO groups
      mac80211: fix IBSS presp allocation size
      cfg80211: fix multi scheduled scan kernel-doc

Jon Mason (1):
      net: mdio-mux: bcm-iproc: call mdiobus_free() in error path

Kamal Heib (1):
      net/mlx4_en: Change the error print to debug print

Karim Eshapa (1):
      drivers: net: wimax: i2400m: i2400m-usb: Use time_after for time comparison

Kees Cook (4):
      bna: Avoid reading past end of buffer
      bna: ethtool: Avoid reading past end of buffer
      qlge: Avoid reading past end of buffer
      DECnet: Use container_of() for embedded struct

Luca Coelho (1):
      mac80211: bail out from prep_connection() if a reconfig is ongoing

Mintz, Yuval (3):
      qed: Fix VF removal sequence
      qed: Tell QM the number of tasks
      qede: Split PF/VF ndos.

Pavel Belous (1):
      aquantia: Fix "ethtool -S" crash when adapter down.

Rakesh Pandit (1):
      net: alx: handle pci_alloc_irq_vectors return correctly

Ram Amrani (1):
      qed: Correct doorbell configuration for !4Kb pages

Suddarsana Reddy Kalluru (1):
      qede: Fix XDP memory leak on unload

Talat Batheesh (1):
      net/mlx4_en: Avoid adding steering rules with invalid ring

Tobias Klauser (1):
      bridge: netlink: account for IFLA_BRPORT_{B, M}CAST_FLOOD size and policy

Vlad Yasevich (1):
      vlan: Keep NETIF_F_HW_CSUM similar to other software devices

WANG Cong (2):
      ipv4: restore rt->fi for reference counting
      ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf

Wei Wang (1):
      tcp: make congestion control optionally skip slow start after idle

 drivers/net/bonding/bond_netlink.c                    |  3 ++-
 drivers/net/dsa/dsa_loop.c                            |  3 +++
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c       |  6 ++++--
 drivers/net/ethernet/atheros/alx/main.c               |  4 ++--
 drivers/net/ethernet/broadcom/bnxt/bnxt.c             |  3 ++-
 drivers/net/ethernet/brocade/bna/bfa_ioc.c            |  2 +-
 drivers/net/ethernet/brocade/bna/bnad_ethtool.c       |  4 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h            |  9 +++++++++
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c            | 38 +++++++++++++++++++++++++++++++-------
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h         |  6 +++---
 drivers/net/ethernet/mellanox/mlx4/cmd.c              | 14 +++++++++++---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c       |  5 +++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c            |  3 ++-
 drivers/net/ethernet/mellanox/mlx4/resource_tracker.c |  2 +-
 drivers/net/ethernet/qlogic/qed/qed_cxt.c             |  1 +
 drivers/net/ethernet/qlogic/qed/qed_dev.c             |  2 +-
 drivers/net/ethernet/qlogic/qed/qed_main.c            |  6 ++++--
 drivers/net/ethernet/qlogic/qede/qede_filter.c        |  5 -----
 drivers/net/ethernet/qlogic/qede/qede_main.c          | 25 ++++++++++++++++++++++++-
 drivers/net/ethernet/qlogic/qlge/qlge_dbg.c           |  4 ++--
 drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c      | 41 ++++++++++++++++++-----------------------
 drivers/net/ethernet/ti/cpsw.c                        | 16 ++++++++++++++++
 drivers/net/hamradio/yam.c                            | 10 ++++------
 drivers/net/hippi/rrunner.c                           | 17 +++++++----------
 drivers/net/phy/mdio-mux-bcm-iproc.c                  |  5 ++++-
 drivers/net/usb/cdc_ncm.c                             | 11 +++++++----
 drivers/net/wimax/i2400m/i2400m-usb.h                 |  2 +-
 drivers/net/wireless/ath/ath9k/ar9003_mac.c           |  2 +-
 drivers/net/wireless/ath/ath9k/mac.c                  |  4 ++--
 drivers/net/wireless/intel/iwlegacy/4965-mac.c        |  4 +++-
 drivers/net/wireless/intel/iwlwifi/dvm/rx.c           |  4 +++-
 drivers/net/wireless/mac80211_hwsim.c                 |  8 +++++++-
 include/net/addrconf.h                                |  2 ++
 include/net/cfg80211.h                                |  2 +-
 include/net/mac80211.h                                |  2 --
 include/net/secure_seq.h                              | 10 ++++++----
 include/net/tcp.h                                     |  9 ++++++---
 kernel/bpf/verifier.c                                 | 21 ++++++++++++++++-----
 net/8021q/vlan_dev.c                                  | 13 ++++++++++---
 net/bridge/br_netlink.c                               |  4 ++++
 net/core/secure_seq.c                                 | 31 +++++++++++++++++++------------
 net/decnet/dn_neigh.c                                 | 12 ++++++------
 net/ipv4/inet_connection_sock.c                       |  2 ++
 net/ipv4/ip_vti.c                                     | 13 +++++++------
 net/ipv4/syncookies.c                                 | 12 ++++++++++--
 net/ipv4/tcp_input.c                                  |  8 +++-----
 net/ipv4/tcp_ipv4.c                                   | 32 +++++++++++++++++++-------------
 net/ipv4/tcp_output.c                                 |  4 +++-
 net/ipv6/addrconf.c                                   |  1 +
 net/ipv6/route.c                                      | 13 +++++++++++--
 net/ipv6/syncookies.c                                 | 10 +++++++++-
 net/ipv6/tcp_ipv6.c                                   | 32 +++++++++++++++++++-------------
 net/mac80211/ibss.c                                   |  2 ++
 net/mac80211/mlme.c                                   |  4 ++++
 net/wireless/nl80211.c                                |  4 ++--
 55 files changed, 345 insertions(+), 167 deletions(-)

^ permalink raw reply

* Re: [PATCH v2 net] dccp/tcp: do not inherit mc_list from parent
From: David Miller @ 2017-05-09 19:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, pray3r.z, andreyknvl
In-Reply-To: <1494336559.7796.78.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 09 May 2017 06:29:19 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> syzkaller found a way to trigger double frees from ip_mc_drop_socket()
> 
> It turns out that leave a copy of parent mc_list at accept() time,
> which is very bad.
> 
> Very similar to commit 8b485ce69876 ("tcp: do not inherit
> fastopen_req from parent")
> 
> Initial report from Pray3r, completed by Andrey one.
> Thanks a lot to them !
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Pray3r <pray3r.z@gmail.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Tested-by: Andrey Konovalov <andreyknvl@google.com>
> ---
> v2: fix moved into inet_csk_clone_lock() to fix both DCCP and TCP

Applied and queued up for -stable, thanks Eric.

^ permalink raw reply

* Re: Requirements for a shutdown function?
From: Florian Fainelli @ 2017-05-09 19:06 UTC (permalink / raw)
  To: Timur Tabi, netdev
In-Reply-To: <f995d5b3-2a5c-df08-3915-f77ab9203844@codeaurora.org>

On 05/09/2017 11:51 AM, Timur Tabi wrote:
> On 05/09/2017 01:46 PM, Florian Fainelli wrote:
>> A good test case for exercising a .shutdown() function is kexec'ing a
>> new kernel for instance.
> 
> I tried that.  I run iperf in one window while launching kexec in another.
> Even without a shutdown function, network traffic appear to halt on its own
> and the kexec succeeds.
> 
> Is it possible that the network stack detects a kexec and automatically
> stops all network devices?

No. why would it? However the device driver model does call into your
driver's remove function and that one does a right job already because
it does an network device unregister, and so on.

There is no strict requirement for implementing a .shutdown() function
AFAICT and it does not necessarily make sense to have one depending on
the bus type. For platform/MMIO devices, it hardly has any value, but on
e.g: PCI, it could be added as an additional step to perform a full
device shutdown.

> 
>> You should put your HW in a state where it won't be doing DMA, or have
>> any adverse side effects to the system, putting it in a low power state
>> is also a good approach.
> 
> My in-house driver stops the RX and TX queues.  I'm guessing that's good
> enough, but I don't have a failing test case to prove it.
> 

That's probably good enough, yes.
-- 
Florian

^ permalink raw reply

* Re: Requirements for a shutdown function?
From: Timur Tabi @ 2017-05-09 18:51 UTC (permalink / raw)
  To: Florian Fainelli, netdev
In-Reply-To: <1721db9b-ed60-4556-9aac-81f17e2c1849@gmail.com>

On 05/09/2017 01:46 PM, Florian Fainelli wrote:
> A good test case for exercising a .shutdown() function is kexec'ing a
> new kernel for instance.

I tried that.  I run iperf in one window while launching kexec in another.
Even without a shutdown function, network traffic appear to halt on its own
and the kexec succeeds.

Is it possible that the network stack detects a kexec and automatically
stops all network devices?

> You should put your HW in a state where it won't be doing DMA, or have
> any adverse side effects to the system, putting it in a low power state
> is also a good approach.

My in-house driver stops the RX and TX queues.  I'm guessing that's good
enough, but I don't have a failing test case to prove it.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply

* Re: Requirements for a shutdown function?
From: Florian Fainelli @ 2017-05-09 18:46 UTC (permalink / raw)
  To: Timur Tabi, netdev
In-Reply-To: <49bee65f-2ea8-1787-9642-659a967df8f0@codeaurora.org>

On 05/09/2017 09:58 AM, Timur Tabi wrote:
> I'm trying to add a platform_driver.shutdown function to my Ethernet driver
> (drivers/net/ethernet/qualcomm/emac/*), but I can't find any definitive
> information as to what a network driver shutdown callback is supposed to do.
>  I also don't know what testcase I should use to verify that my function is
> working.

A good test case for exercising a .shutdown() function is kexec'ing a
new kernel for instance.

> 
> I see only four instances of a platform_driver.shutdown function in
> drivers/net/ethernet:
> 
> $ git grep -A 20 -w platform_driver | grep '\.shutdown'
> apm/xgene-v2/main.c-	.shutdown = xge_shutdown,
> apm/xgene/xgene_enet_main.c-	.shutdown = xgene_enet_shutdown,
> marvell/mv643xx_eth.c-	.shutdown	= mv643xx_eth_shutdown,
> marvell/pxa168_eth.c-	.shutdown = pxa168_eth_shutdown,
> 
> (Other shutdown functions are for pci_driver.shutdown).
> 
> For the xgene drivers, the shutdown function just calls the 'remove'
> function.  Isn't that overkill?  Why bother with a shutdown function if it's
> just the same thing as removing the driver outright?

Yes, that appears unnecessary.

> 
> mv643xx_eth_shutdown() seems more reasonable.  All it does is halt the TX
> and RX queues.
> 
> pxa168_eth_shutdown() is a little more heavyweight: halts the queues, and
> stops the DMA and calls phy_stop().
> 
> Can anyone help me figure out what my driver really should do?

You should put your HW in a state where it won't be doing DMA, or have
any adverse side effects to the system, putting it in a low power state
is also a good approach.
-- 
Florian

^ permalink raw reply

* RE: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses
From: Chiappero, Marco @ 2017-05-09 18:41 UTC (permalink / raw)
  To: Dan Williams, Jiri Benc
  Cc: netdev@vger.kernel.org, David S . Miller, Kirsher, Jeffrey T,
	Duyck, Alexander H, Grandhi, Sainath, Mahesh Bandewar
In-Reply-To: <1494267188.8248.21.camel@redhat.com>


> -----Original Message-----
> From: Dan Williams [mailto:dcbw@redhat.com]
> Sent: Monday, May 8, 2017 7:13 PM
> To: Chiappero, Marco <marco.chiappero@intel.com>; Jiri Benc
> <jbenc@redhat.com>
> Cc: netdev@vger.kernel.org; David S . Miller <davem@davemloft.net>; Kirsher,
> Jeffrey T <jeffrey.t.kirsher@intel.com>; Duyck, Alexander H
> <alexander.h.duyck@intel.com>; Grandhi, Sainath
> <sainath.grandhi@intel.com>; Mahesh Bandewar <maheshb@google.com>
> Subject: Re: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses
> 
> On Mon, 2017-05-08 at 15:29 +0000, Chiappero, Marco wrote:
> > > -----Original Message-----
> > > From: Jiri Benc [mailto:jbenc@redhat.com]
> > > Sent: Thursday, May 4, 2017 5:44 PM
> > > To: Chiappero, Marco <marco.chiappero@intel.com>
> > > Cc: Dan Williams <dcbw@redhat.com>; netdev@vger.kernel.org; David S
> > > 
> > > And doing some kind of weird MAC NAT in ipvlan just to satisfy
> > > broken tools that can't cope with multiple interfaces with the same
> > > MAC address is wrong, too.
> >
> > Ipvlan has always had the MAC issue, regardless, these tools simply
> > make it more apparent. And as I said already, whether they are broken
> 
> If we're talking about the slaves being unable to change when the parent MAC
> changes, everyone agrees that was a bug.
> 
> If we are talking about the "all slaves have the same MAC" then that's not an
> "issue", that's the way it's designed.  Whether that design is the best design
> possible is debatable, but that's the way it currently is.

I'm referring to the latter, the former is obvious. That "design" looks like a temporary workaround to me, in fact a couple of TODOs in the code lead to believe it is. They should be removed, at the very least.

> > is debatable (yet I have to read a reasonable motivation). At the
> 
> Existing tools are already broken for bond slaves and a couple existing drivers,
> see below.
> 
> Note that making the MACs unique would break DHCP functionality, because
> now the MAC address encoded in the 'chaddr' field of the DHCP packet would
> no longer correspond to the MAC address of the device that DHCP replies should
> be received on.  The userspace process writes the 'chaddr' field, and the
> userspace process would see the unique (and
> incorrect) MAC address.

Fair point. However, despite not fixing the issues with DHCP, it would be no way worse that it is now (even though I admit I don't like much the workaround). BTW, I don't know about the ISC upstream version, but on Debian/Ubuntu the -B flag is not available, which makes ipvlan+DHCP non-viable on a very large number of deployments.

> > very least their expectation to have unique addresses on the same
> > broadcast domain is hardly arguable. Should ipvlan considered special?
> > Again, questionable.
> 
> bond slaves
> drivers/net/ethernet/toshiba/ps3_gelic_net.c
> drivers/s390/net/qeth_l3_main.c
> 
> already all have the same MAC address for different netdevices.  Yeah, not many
> people have PS3s anymore, but s390 qeth is fairly widely used.
>  Just pointing out that ipvlan is hardly the first device to have encountered this
> issue, or to have solved it this way.  qeth does pretty much the same thing as
> ipvlan.
> 
> I'm not arguing that ipvlan is perfect.  I'm just arguing that in its current form
> (eg, virtualized L2 device) making this change is incorrect.

Don't get me wrong, I understand and appreciate your lengthy reply, even though the fact that a poor solution is already included elsewhere doesn't make it any better.
However, regardless of the uniqueness topic, I don't feel is completely right to change the whole world upstairs making it ipvlan aware either, unless there is a real and coherent differentiation (e.g. L3 only interfaces). I'll drop considering ipvlan as an option for now.


Regards,
Marco

--------------------------------------------------------------
Intel Research and Development Ireland Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263


This e-mail and any attachments may contain confidential material for the sole
use of the intended recipient(s). Any review or distribution by others is
strictly prohibited. If you are not the intended recipient, please contact the
sender and delete all copies.

^ permalink raw reply

* Re: [PATCH net v2] driver: vrf: Fix one possible use-after-free issue
From: David Miller @ 2017-05-09 18:37 UTC (permalink / raw)
  To: gfree.wind; +Cc: dsa, shm, fw, netdev
In-Reply-To: <1494325653-39885-1-git-send-email-gfree.wind@vip.163.com>

From: gfree.wind@vip.163.com
Date: Tue,  9 May 2017 18:27:33 +0800

> @@ -989,6 +989,7 @@ static u32 vrf_fib_table(const struct net_device *dev)
>  
>  static int vrf_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
>  {
> +	kfree_skb(skb);
>  	return 0;
>  }
>  
> @@ -998,7 +999,7 @@ static struct sk_buff *vrf_rcv_nfhook(u8 pf, unsigned int hook,
>  {
>  	struct net *net = dev_net(dev);
>  
> -	if (NF_HOOK(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) < 0)
> +	if (nf_hook(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) != 1)
>  		skb = NULL;    /* kfree_skb(skb) handled by nf code */
>  
>  	return skb;

Indeed, this fixes the immediate problem with NF_STOLEN.

Making NF_STOLEN fully functional is another story, we'd need to stack
this all together properly:

static int __ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
 ...
}

static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	return l3mdev_ip_rcv(skb, __ip_rcv_finish);
}
...
static inline
struct sk_buff *l3mdev_ip_rcv(struct sk_buff *skb,
			      int (*okfn)(struct net *, struct sock *, struct sk_buff *))
{
	return l3mdev_l3_rcv(skb, okfn, AF_INET);
}

etc. but that's going to really add a kink to the receive path,
microbenchmark wise.

^ permalink raw reply

* Re: bpf pointer alignment validation
From: David Miller @ 2017-05-09 18:32 UTC (permalink / raw)
  To: daniel; +Cc: ast, netdev
In-Reply-To: <59104D35.8080108@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon, 08 May 2017 12:49:25 +0200

> Could you also add test cases specifically to this for test_verifier
> in bpf selftests? I'm thinking of the cases when we have no pkt id
> and offset originated from reg->off (accumulated through const imm
> ops on reg) and insn->off, where we had i) no pkt id and ii) a
> specific pkt id (so we can probe for aux_off_align rejection as well).
> I believe we do have coverage to some extend in some of the tests
> (more on the map_value_adj though), but it would be good to keep
> tracking this specifically as well.

So I created a new test, that uses the verifier, but in a new way.

The thing I wanted to do is match on verifier dump strings so that I
can test what the verifier thinks about the known alignment et al. of
various registers after the execution of certain instructions.

I accomplish this by doing two things:

1) If the log level of 2 or greater is given to the verifier, I dump
   the internal state to the log after every instruction.

2) A new BPF library helper called bpf_verify_program() is added which
   always passes the log to the BPF system call and uses a log level
   of 2.

Then in my "test_align" I have BPF programs as well as strings to
match against in the verifier dump.

The whole thing is below, and writing the test cases certainly helped
me refine the implementation quite a bit.

I'll keep adding some more tests and hacking on this.  It just
occurred to me that it would be extremely variable to be able to turn
on strict alignment requirements even on architectures that do not
need it.

That way anyone adding regressions to the alignment code, or hitting
new cases the code can't currently handle, would notice immediately
regardles of the type of system the run the test case on.

To be quite honest, strict alignment might not be a bad default either
to give helpful feedback to people writing eBPF programs.

---
 include/linux/bpf_verifier.h             |   3 +
 kernel/bpf/verifier.c                    | 104 ++++++++-
 tools/lib/bpf/bpf.c                      |  20 ++
 tools/lib/bpf/bpf.h                      |   4 +
 tools/testing/selftests/bpf/Makefile     |   4 +-
 tools/testing/selftests/bpf/test_align.c | 378 +++++++++++++++++++++++++++++++
 6 files changed, 500 insertions(+), 13 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_align.c

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 5efb4db..7c6a519 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -40,6 +40,9 @@ struct bpf_reg_state {
 	 */
 	s64 min_value;
 	u64 max_value;
+	u32 min_align;
+	u32 aux_off;
+	u32 aux_off_align;
 };
 
 enum bpf_stack_slot_type {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c2ff608..2b1b06e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -241,6 +241,12 @@ static void print_verifier_state(struct bpf_verifier_state *state)
 		if (reg->max_value != BPF_REGISTER_MAX_RANGE)
 			verbose(",max_value=%llu",
 				(unsigned long long)reg->max_value);
+		if (reg->min_align)
+			verbose(",min_align=%u", reg->min_align);
+		if (reg->aux_off)
+			verbose(",aux_off=%u", reg->aux_off);
+		if (reg->aux_off_align)
+			verbose(",aux_off_align=%u", reg->aux_off_align);
 	}
 	for (i = 0; i < MAX_BPF_STACK; i += BPF_REG_SIZE) {
 		if (state->stack_slot_type[i] == STACK_SPILL)
@@ -455,6 +461,9 @@ static void init_reg_state(struct bpf_reg_state *regs)
 		regs[i].imm = 0;
 		regs[i].min_value = BPF_REGISTER_MIN_RANGE;
 		regs[i].max_value = BPF_REGISTER_MAX_RANGE;
+		regs[i].min_align = 0;
+		regs[i].aux_off = 0;
+		regs[i].aux_off_align = 0;
 	}
 
 	/* frame pointer */
@@ -483,11 +492,17 @@ static void reset_reg_range_values(struct bpf_reg_state *regs, u32 regno)
 	regs[regno].max_value = BPF_REGISTER_MAX_RANGE;
 }
 
+static void reset_reg_align(struct bpf_reg_state *regs, u32 regno)
+{
+	regs[regno].min_align = 0;
+}
+
 static void mark_reg_unknown_value_and_range(struct bpf_reg_state *regs,
 					     u32 regno)
 {
 	mark_reg_unknown_value(regs, regno);
 	reset_reg_range_values(regs, regno);
+	reset_reg_align(regs, regno);
 }
 
 enum reg_arg_type {
@@ -770,13 +785,24 @@ static bool is_pointer_value(struct bpf_verifier_env *env, int regno)
 static int check_pkt_ptr_alignment(const struct bpf_reg_state *reg,
 				   int off, int size)
 {
-	if (reg->id && size != 1) {
-		verbose("Unknown alignment. Only byte-sized access allowed in packet access.\n");
-		return -EACCES;
+	int reg_off;
+
+	/* Byte size accesses are always allowed. */
+	if (size == 1)
+		return 0;
+
+	reg_off = reg->off;
+	if (reg->id) {
+		if (reg->aux_off_align % size) {
+			verbose("Packet access is only %u byte aligned, %d byte access not allowed\n",
+				reg->aux_off_align, size);
+			return -EACCES;
+		}
+		reg_off += reg->aux_off;
 	}
 
 	/* skb->data is NET_IP_ALIGN-ed */
-	if ((NET_IP_ALIGN + reg->off + off) % size != 0) {
+	if ((NET_IP_ALIGN + reg_off + off) % size != 0) {
 		verbose("misaligned packet access off %d+%d+%d size %d\n",
 			NET_IP_ALIGN, reg->off, off, size);
 		return -EACCES;
@@ -872,6 +898,8 @@ static int check_mem_access(struct bpf_verifier_env *env, u32 regno, int off,
 							 value_regno);
 			/* note that reg.[id|off|range] == 0 */
 			state->regs[value_regno].type = reg_type;
+			state->regs[value_regno].aux_off = 0;
+			state->regs[value_regno].aux_off_align = 0;
 		}
 
 	} else if (reg->type == FRAME_PTR || reg->type == PTR_TO_STACK) {
@@ -1444,6 +1472,8 @@ static int check_packet_ptr_add(struct bpf_verifier_env *env,
 		 */
 		dst_reg->off += imm;
 	} else {
+		bool had_id;
+
 		if (src_reg->type == PTR_TO_PACKET) {
 			/* R6=pkt(id=0,off=0,r=62) R7=imm22; r7 += r6 */
 			tmp_reg = *dst_reg;  /* save r7 state */
@@ -1477,14 +1507,23 @@ static int check_packet_ptr_add(struct bpf_verifier_env *env,
 				src_reg->imm);
 			return -EACCES;
 		}
+
+		had_id = (dst_reg->id != 0);
+
 		/* dst_reg stays as pkt_ptr type and since some positive
 		 * integer value was added to the pointer, increment its 'id'
 		 */
 		dst_reg->id = ++env->id_gen;
 
-		/* something was added to pkt_ptr, set range and off to zero */
+		/* something was added to pkt_ptr, set range to zero */
+		dst_reg->aux_off = dst_reg->off;
 		dst_reg->off = 0;
 		dst_reg->range = 0;
+		if (had_id)
+			dst_reg->aux_off_align = min(dst_reg->aux_off_align,
+						     src_reg->min_align);
+		else
+			dst_reg->aux_off_align = src_reg->min_align;
 	}
 	return 0;
 }
@@ -1658,6 +1697,20 @@ static void check_reg_overflow(struct bpf_reg_state *reg)
 		reg->min_value = BPF_REGISTER_MIN_RANGE;
 }
 
+static u32 calc_align(u32 imm)
+{
+	u32 align = 1;
+
+	if (!imm)
+		return 1U << 31;
+
+	while (!(imm & 1)) {
+		imm >>= 1;
+		align <<= 1;
+	}
+	return align;
+}
+
 static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 				    struct bpf_insn *insn)
 {
@@ -1665,8 +1718,10 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 	s64 min_val = BPF_REGISTER_MIN_RANGE;
 	u64 max_val = BPF_REGISTER_MAX_RANGE;
 	u8 opcode = BPF_OP(insn->code);
+	u32 dst_align, src_align;
 
 	dst_reg = &regs[insn->dst_reg];
+	src_align = 0;
 	if (BPF_SRC(insn->code) == BPF_X) {
 		check_reg_overflow(&regs[insn->src_reg]);
 		min_val = regs[insn->src_reg].min_value;
@@ -1682,18 +1737,25 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 		    regs[insn->src_reg].type != UNKNOWN_VALUE) {
 			min_val = BPF_REGISTER_MIN_RANGE;
 			max_val = BPF_REGISTER_MAX_RANGE;
+			src_align = 0;
+		} else {
+			src_align = regs[insn->src_reg].min_align;
 		}
 	} else if (insn->imm < BPF_REGISTER_MAX_RANGE &&
 		   (s64)insn->imm > BPF_REGISTER_MIN_RANGE) {
 		min_val = max_val = insn->imm;
+		src_align = calc_align(insn->imm);
 	}
 
+	dst_align = dst_reg->min_align;
+
 	/* We don't know anything about what was done to this register, mark it
 	 * as unknown.
 	 */
 	if (min_val == BPF_REGISTER_MIN_RANGE &&
 	    max_val == BPF_REGISTER_MAX_RANGE) {
 		reset_reg_range_values(regs, insn->dst_reg);
+		reset_reg_align(regs, insn->dst_reg);
 		return;
 	}
 
@@ -1712,18 +1774,21 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 			dst_reg->min_value += min_val;
 		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
 			dst_reg->max_value += max_val;
+		dst_reg->min_align = min(src_align, dst_align);
 		break;
 	case BPF_SUB:
 		if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
 			dst_reg->min_value -= min_val;
 		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
 			dst_reg->max_value -= max_val;
+		dst_reg->min_align = min(src_align, dst_align);
 		break;
 	case BPF_MUL:
 		if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
 			dst_reg->min_value *= min_val;
 		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
 			dst_reg->max_value *= max_val;
+		dst_reg->min_align = max(src_align, dst_align);
 		break;
 	case BPF_AND:
 		/* Disallow AND'ing of negative numbers, ain't nobody got time
@@ -1735,17 +1800,23 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 		else
 			dst_reg->min_value = 0;
 		dst_reg->max_value = max_val;
+		dst_reg->min_align = max(src_align, dst_align);
 		break;
 	case BPF_LSH:
 		/* Gotta have special overflow logic here, if we're shifting
 		 * more than MAX_RANGE then just assume we have an invalid
 		 * range.
 		 */
-		if (min_val > ilog2(BPF_REGISTER_MAX_RANGE))
+		if (min_val > ilog2(BPF_REGISTER_MAX_RANGE)) {
 			dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
-		else if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
-			dst_reg->min_value <<= min_val;
-
+			dst_reg->min_align = 1;
+		} else {
+			if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
+				dst_reg->min_value <<= min_val;
+			if (!dst_reg->min_align)
+				dst_reg->min_align = 1;
+			dst_reg->min_align <<= min_val;
+		}
 		if (max_val > ilog2(BPF_REGISTER_MAX_RANGE))
 			dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
 		else if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
@@ -1755,11 +1826,19 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 		/* RSH by a negative number is undefined, and the BPF_RSH is an
 		 * unsigned shift, so make the appropriate casts.
 		 */
-		if (min_val < 0 || dst_reg->min_value < 0)
+		if (min_val < 0 || dst_reg->min_value < 0) {
 			dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
-		else
+		} else {
 			dst_reg->min_value =
 				(u64)(dst_reg->min_value) >> min_val;
+		}
+		if (min_val < 0) {
+			dst_reg->min_align = 1;
+		} else {
+			dst_reg->min_align >>= (u64) min_val;
+			if (!dst_reg->min_align)
+				dst_reg->min_align = 1;
+		}
 		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
 			dst_reg->max_value >>= max_val;
 		break;
@@ -1861,6 +1940,7 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 			regs[insn->dst_reg].imm = insn->imm;
 			regs[insn->dst_reg].max_value = insn->imm;
 			regs[insn->dst_reg].min_value = insn->imm;
+			regs[insn->dst_reg].min_align = calc_align(insn->imm);
 		}
 
 	} else if (opcode > BPF_END) {
@@ -2845,7 +2925,7 @@ static int do_check(struct bpf_verifier_env *env)
 			goto process_bpf_exit;
 		}
 
-		if (log_level && do_print_state) {
+		if (log_level > 1 || (log_level && do_print_state)) {
 			verbose("\nfrom %d to %d:", prev_insn_idx, insn_idx);
 			print_verifier_state(&env->cur_state);
 			do_print_state = false;
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 4fe444b80..d9d3164 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -117,6 +117,26 @@ int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
 	return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
 }
 
+int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
+		       size_t insns_cnt, const char *license,
+		       __u32 kern_version, char *log_buf, size_t log_buf_sz)
+{
+	union bpf_attr attr;
+
+	bzero(&attr, sizeof(attr));
+	attr.prog_type = type;
+	attr.insn_cnt = (__u32)insns_cnt;
+	attr.insns = ptr_to_u64(insns);
+	attr.license = ptr_to_u64(license);
+	attr.log_buf = ptr_to_u64(log_buf);
+	attr.log_size = log_buf_sz;
+	attr.log_level = 2;
+	log_buf[0] = 0;
+	attr.kern_version = kern_version;
+
+	return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+}
+
 int bpf_map_update_elem(int fd, const void *key, const void *value,
 			__u64 flags)
 {
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index edb4dae..1a32b02 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -35,6 +35,10 @@ int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
 		     size_t insns_cnt, const char *license,
 		     __u32 kern_version, char *log_buf,
 		     size_t log_buf_sz);
+int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
+		       size_t insns_cnt, const char *license,
+		       __u32 kern_version, char *log_buf,
+		       size_t log_buf_sz);
 
 int bpf_map_update_elem(int fd, const void *key, const void *value,
 			__u64 flags);
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 91edd05..1e1cde1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -11,7 +11,8 @@ endif
 CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include
 LDLIBS += -lcap -lelf
 
-TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs
+TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
+	test_align
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o
 
diff --git a/tools/testing/selftests/bpf/test_align.c b/tools/testing/selftests/bpf/test_align.c
new file mode 100644
index 0000000..22f34fa
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_align.c
@@ -0,0 +1,378 @@
+#include <asm/types.h>
+#include <linux/types.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <stddef.h>
+#include <stdbool.h>
+
+#include <linux/unistd.h>
+#include <linux/filter.h>
+#include <linux/bpf_perf_event.h>
+#include <linux/bpf.h>
+
+#include <bpf/bpf.h>
+
+#include "../../../include/linux/filter.h"
+
+#ifndef ARRAY_SIZE
+# define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+#define MAX_INSNS	512
+#define MAX_MATCHES	16
+
+struct bpf_align_test {
+	const char *descr;
+	struct bpf_insn	insns[MAX_INSNS];
+	enum {
+		UNDEF,
+		ACCEPT,
+		REJECT
+	} result;
+	enum bpf_prog_type prog_type;
+	const char *matches[MAX_MATCHES];
+};
+
+static struct bpf_align_test tests[] = {
+	{
+		.descr = "mov",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_3, 2),
+			BPF_MOV64_IMM(BPF_REG_3, 4),
+			BPF_MOV64_IMM(BPF_REG_3, 8),
+			BPF_MOV64_IMM(BPF_REG_3, 16),
+			BPF_MOV64_IMM(BPF_REG_3, 32),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+		.matches = {
+			"from 0 to 1: R1=ctx R3=imm2,min_value=2,max_value=2,min_align=2 R10=fp",
+			"from 0 to 2: R1=ctx R3=imm4,min_value=4,max_value=4,min_align=4 R10=fp",
+			"from 0 to 3: R1=ctx R3=imm8,min_value=8,max_value=8,min_align=8 R10=fp",
+			"from 0 to 4: R1=ctx R3=imm16,min_value=16,max_value=16,min_align=16 R10=fp",
+			"from 0 to 5: R1=ctx R3=imm32,min_value=32,max_value=32,min_align=32 R10=fp",
+		},
+	},
+	{
+		.descr = "shift",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_3, 1),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+			BPF_ALU64_IMM(BPF_RSH, BPF_REG_3, 4),
+			BPF_MOV64_IMM(BPF_REG_4, 32),
+			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+		.matches = {
+			"from 0 to 1: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R10=fp",
+			"from 0 to 2: R1=ctx R3=imm2,min_value=2,max_value=2,min_align=2 R10=fp",
+			"from 0 to 3: R1=ctx R3=imm4,min_value=4,max_value=4,min_align=4 R10=fp",
+			"from 0 to 4: R1=ctx R3=imm8,min_value=8,max_value=8,min_align=8 R10=fp",
+			"from 0 to 5: R1=ctx R3=imm16,min_value=16,max_value=16,min_align=16 R10=fp",
+			"from 0 to 6: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R10=fp",
+			"from 0 to 7: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R4=imm32,min_value=32,max_value=32,min_align=32 R10=fp",
+			"from 0 to 8: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R4=imm16,min_value=16,max_value=16,min_align=16 R10=fp",
+			"from 0 to 9: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R4=imm8,min_value=8,max_value=8,min_align=8 R10=fp",
+			"from 0 to 10: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R4=imm4,min_value=4,max_value=4,min_align=4 R10=fp",
+			"from 0 to 11: R1=ctx R3=imm1,min_value=1,max_value=1,min_align=1 R4=imm2,min_value=2,max_value=2,min_align=2 R10=fp",
+		},
+	},
+	{
+		.descr = "addsub",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_3, 4),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, 4),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, 2),
+			BPF_MOV64_IMM(BPF_REG_4, 8),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 4),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+		.matches = {
+			"from 0 to 1: R1=ctx R3=imm4,min_value=4,max_value=4,min_align=4 R10=fp",
+			"from 0 to 2: R1=ctx R3=imm8,min_value=8,max_value=8,min_align=4 R10=fp",
+			"from 0 to 3: R1=ctx R3=imm10,min_value=10,max_value=10,min_align=2 R10=fp",
+			"from 0 to 4: R1=ctx R3=imm10,min_value=10,max_value=10,min_align=2 R4=imm8,min_value=8,max_value=8,min_align=8 R10=fp",
+			"from 0 to 5: R1=ctx R3=imm10,min_value=10,max_value=10,min_align=2 R4=imm12,min_value=12,max_value=12,min_align=4 R10=fp",
+			"from 0 to 6: R1=ctx R3=imm10,min_value=10,max_value=10,min_align=2 R4=imm14,min_value=14,max_value=14,min_align=2 R10=fp",
+		},
+	},
+	{
+		.descr = "mul",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_3, 7),
+			BPF_ALU64_IMM(BPF_MUL, BPF_REG_3, 1),
+			BPF_ALU64_IMM(BPF_MUL, BPF_REG_3, 2),
+			BPF_ALU64_IMM(BPF_MUL, BPF_REG_3, 4),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+		.matches = {
+			"from 0 to 1: R1=ctx R3=imm7,min_value=7,max_value=7,min_align=1 R10=fp",
+			"from 0 to 2: R1=ctx R3=imm7,min_value=7,max_value=7,min_align=1 R10=fp",
+			"from 0 to 3: R1=ctx R3=imm14,min_value=14,max_value=14,min_align=2 R10=fp",
+			"from 0 to 4: R1=ctx R3=imm56,min_value=56,max_value=56,min_align=4 R10=fp",
+		},
+	},
+
+#define LOAD_UNKNOWN(DST_REG) \
+	BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, \
+		    offsetof(struct __sk_buff, data)), \
+	BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1, \
+		    offsetof(struct __sk_buff, data_end)), \
+	BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), \
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8), \
+	BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_0, 1), \
+	BPF_EXIT_INSN(), \
+	BPF_LDX_MEM(BPF_B, DST_REG, BPF_REG_2, 0)
+
+	{
+		.descr = "unknown shift",
+		.insns = {
+			LOAD_UNKNOWN(BPF_REG_3),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
+			LOAD_UNKNOWN(BPF_REG_4),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_4, 5),
+			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+		.matches = {
+			"from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R10=fp",
+			"from 4 to 8: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv55,min_align=2 R10=fp",
+			"from 4 to 9: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv54,min_align=4 R10=fp",
+			"from 4 to 10: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv53,min_align=8 R10=fp",
+			"from 4 to 11: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv52,min_align=16 R10=fp",
+			"from 15 to 18: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv56 R10=fp",
+			"from 15 to 19: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv51,min_align=32 R10=fp",
+			"from 15 to 20: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv52,min_align=16 R10=fp",
+			"from 15 to 21: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv53,min_align=8 R10=fp",
+			"from 15 to 22: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv54,min_align=4 R10=fp",
+			"from 15 to 23: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv55,min_align=2 R10=fp",
+		},
+	},
+	{
+		.descr = "unknown mul",
+		.insns = {
+			LOAD_UNKNOWN(BPF_REG_3),
+			BPF_MOV64_REG(BPF_REG_4, BPF_REG_3),
+			BPF_ALU64_IMM(BPF_MUL, BPF_REG_4, 1),
+			BPF_MOV64_REG(BPF_REG_4, BPF_REG_3),
+			BPF_ALU64_IMM(BPF_MUL, BPF_REG_4, 2),
+			BPF_MOV64_REG(BPF_REG_4, BPF_REG_3),
+			BPF_ALU64_IMM(BPF_MUL, BPF_REG_4, 4),
+			BPF_MOV64_REG(BPF_REG_4, BPF_REG_3),
+			BPF_ALU64_IMM(BPF_MUL, BPF_REG_4, 8),
+			BPF_ALU64_IMM(BPF_MUL, BPF_REG_4, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+		.matches = {
+			"from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R10=fp",
+			"from 4 to 8: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv56 R10=fp",
+			"from 4 to 9: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv55,min_align=1 R10=fp",
+			"from 4 to 10: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv56 R10=fp",
+			"from 4 to 11: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv54,min_align=2 R10=fp",
+			"from 4 to 12: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv56 R10=fp",
+			"from 4 to 13: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv53,min_align=4 R10=fp",
+			"from 4 to 14: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv56 R10=fp",
+			"from 4 to 15: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv52,min_align=8 R10=fp",
+			"from 4 to 16: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R4=inv50,min_align=8 R10=fp"
+		},
+	},
+	{
+		.descr = "packet variable offset",
+		.insns = {
+			LOAD_UNKNOWN(BPF_REG_6),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_6, 2),
+
+			/* First, add a constant to the packet pointer,
+			 * then a variable with a known alignment.
+			 */
+			BPF_MOV64_REG(BPF_REG_5, BPF_REG_2),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_5, 14),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_5, BPF_REG_6),
+			BPF_MOV64_REG(BPF_REG_4, BPF_REG_5),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 4),
+			BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_4, 1),
+			BPF_EXIT_INSN(),
+			BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_5, 0),
+
+			/* Now, test in the other direction.  Adding first
+			 * the variable offset, then the constant.
+			 */
+			BPF_MOV64_REG(BPF_REG_5, BPF_REG_2),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_5, BPF_REG_6),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_5, 14),
+			BPF_MOV64_REG(BPF_REG_4, BPF_REG_5),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 4),
+			BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_4, 1),
+			BPF_EXIT_INSN(),
+			BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_5, 0),
+
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+		.matches = {
+			/* Calculated offset in R4 has unknown value, but known
+			 * alignment of 4.
+			 */
+			"from 4 to 8: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R6=inv54,min_align=4 R10=fp",
+
+			/* Offset is added to packet pointer, resulting in known
+			 * auxiliary alignment and offset.
+			 */
+			"from 4 to 11: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R5=pkt(id=1,off=0,r=0),aux_off=14,aux_off_align=4 R6=inv54,min_align=4 R10=fp",
+
+			/* At the time the word size load is performed from R5,
+			 * it's total offset is NET_IP_ALIGN + reg->off (0) +
+			 * reg->aux_off (14) which is 16.  Then the variable
+			 * offset is considered using reg->aux_off_align which
+			 * is 4 and meets the load's requirements.
+			 */
+			"from 13 to 15: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=pkt(id=1,off=4,r=4),aux_off=14,aux_off_align=4 R5=pkt(id=1,off=0,r=4),aux_off=14,aux_off_align=4 R6=inv54,min_align=4 R10=fp",
+
+
+			/* Variable offset is added to R5 packet pointer,
+			 * resulting in auxiliary alignment of 4.
+			 */
+			"from 13 to 18: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv,aux_off=14,aux_off_align=4 R5=pkt(id=2,off=0,r=0),aux_off_align=4 R6=inv54,min_align=4 R10=fp",
+
+			/* Constant offset is added to R5, resulting in
+			 * reg->off of 14.
+			 */
+			"from 13 to 19: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv,aux_off=14,aux_off_align=4 R5=pkt(id=2,off=14,r=0),aux_off_align=4 R6=inv54,min_align=4 R10=fp",
+
+			/* At the time the word size load is performed from R5,
+			 * it's total offset is NET_IP_ALIGN + reg->off (14) which
+			 * is 16.  Then the variable offset is considered using
+			 * reg->aux_off_align which is 4 and meets the load's
+			 * requirements.
+			 */
+			"from 21 to 23: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=pkt(id=2,off=18,r=18),aux_off_align=4 R5=pkt(id=2,off=14,r=18),aux_off_align=4 R6=inv54,min_align=4 R10=fp",
+		},
+	},
+};
+
+static int probe_filter_length(const struct bpf_insn *fp)
+{
+	int len;
+
+	for (len = MAX_INSNS - 1; len > 0; --len)
+		if (fp[len].code != 0 || fp[len].imm != 0)
+			break;
+	return len + 1;
+}
+
+static char bpf_vlog[32768];
+
+static int do_test_single(struct bpf_align_test *test)
+{
+	struct bpf_insn *prog = test->insns;
+	int prog_type = test->prog_type;
+	int prog_len, i;
+	int fd_prog;
+	int ret;
+
+	prog_len = probe_filter_length(prog);
+	fd_prog = bpf_verify_program(prog_type ? : BPF_PROG_TYPE_SOCKET_FILTER,
+				     prog, prog_len, "GPL", 0,
+				     bpf_vlog, sizeof(bpf_vlog));
+	if (fd_prog < 0) {
+		printf("Failed to load program.\n");
+		printf("%s", bpf_vlog);
+		ret = 1;
+	} else {
+		ret = 0;
+		for (i = 0; i < MAX_MATCHES; i++) {
+			const char *t, *m = test->matches[i];
+
+			if (!m)
+				break;
+			t = strstr(bpf_vlog, m);
+			if (!t) {
+				printf("Failed to find match: %s\n", m);
+				ret = 1;
+				printf("%s", bpf_vlog);
+				break;
+			}
+		}
+		/* printf("%s", bpf_vlog); */
+		close(fd_prog);
+	}
+	return ret;
+}
+
+static int do_test(unsigned int from, unsigned int to)
+{
+	int all_pass = 0;
+	int all_fail = 0;
+	unsigned int i;
+
+	for (i = from; i < to; i++) {
+		struct bpf_align_test *test = &tests[i];
+		int fail;
+
+		printf("Test %3d: %s ... ",
+		       i, test->descr);
+		fail = do_test_single(test);
+		if (fail) {
+			all_fail++;
+			printf("FAIL\n");
+		} else {
+			all_pass++;
+			printf("PASS\n");
+		}
+	}
+	printf("Results: %d pass %d fail\n",
+	       all_pass, all_fail);
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	unsigned int from = 0, to = ARRAY_SIZE(tests);
+
+	if (argc == 3) {
+		unsigned int l = atoi(argv[argc - 2]);
+		unsigned int u = atoi(argv[argc - 1]);
+
+		if (l < to && u < to) {
+			from = l;
+			to   = u + 1;
+		}
+	} else if (argc == 2) {
+		unsigned int t = atoi(argv[argc - 1]);
+
+		if (t < to) {
+			from = t;
+			to   = t + 1;
+		}
+	}
+	return do_test(from, to);
+}
-- 
2.1.2.532.g19b5d50

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox