Netdev List
 help / color / mirror / Atom feed
* [patch net repost] ipv4: fix nexthop attlen check in fib_nh_match
From: Jiri Pirko @ 2014-10-13 14:34 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuznet, jmorris, yoshfuji, kaber, edumazet, tgraf

fib_nh_match does not match nexthops correctly. Example:

ip route add 172.16.10/24 nexthop via 192.168.122.12 dev eth0 \
                          nexthop via 192.168.122.13 dev eth0
ip route del 172.16.10/24 nexthop via 192.168.122.14 dev eth0 \
                          nexthop via 192.168.122.15 dev eth0

Del command is successful and route is removed. After this patch
applied, the route is correctly matched and result is:
RTNETLINK answers: No such process

Please consider this for stable trees as well.

Fixes: 4e902c57417c4 ("[IPv4]: FIB configuration using struct fib_config")
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Eric Dumazet <edumazet@google.com>
---
reposted with example (it was missing for some reason in the original post)

 net/ipv4/fib_semantics.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 5b6efb3..f99f41b 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -537,7 +537,7 @@ int fib_nh_match(struct fib_config *cfg, struct fib_info *fi)
 			return 1;
 
 		attrlen = rtnh_attrlen(rtnh);
-		if (attrlen < 0) {
+		if (attrlen > 0) {
 			struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
 
 			nla = nla_find(attrs, attrlen, RTA_GATEWAY);
-- 
1.9.3

^ permalink raw reply related

* Re: [patch net] ipv4: fix nexthop attlen check in fib_nh_match
From: Jiri Pirko @ 2014-10-13 14:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Thomas Graf, netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1413202966.9362.80.camel@edumazet-glaptop2.roam.corp.google.com>

Mon, Oct 13, 2014 at 02:22:46PM CEST, eric.dumazet@gmail.com wrote:
>On Mon, 2014-10-13 at 11:54 +0200, Jiri Pirko wrote:
>> fib_nh_match does not match nexthops correctly. Example:
>> 
>> This command is not successful and route is removed. After this patch
>> applied, the route is correctly matched and result is:
>> RTNETLINK answers: No such process
>> 
>> Please consider this for stable trees as well.
>> 
>> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>> ---
>>  net/ipv4/fib_semantics.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>> index 5b6efb3..f99f41b 100644
>> --- a/net/ipv4/fib_semantics.c
>> +++ b/net/ipv4/fib_semantics.c
>> @@ -537,7 +537,7 @@ int fib_nh_match(struct fib_config *cfg, struct fib_info *fi)
>>  			return 1;
>>  
>>  		attrlen = rtnh_attrlen(rtnh);
>> -		if (attrlen < 0) {
>> +		if (attrlen > 0) {
>>  			struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
>>  
>>  			nla = nla_find(attrs, attrlen, RTA_GATEWAY);
>
>Fixes: 4e902c57417c4 ("[IPv4]: FIB configuration using struct fib_config")
>
>Good catch, thanks !
>
>Acked-by: Eric Dumazet <edumazet@google.com>


Thanks Eric. I reposted with your ack, fixes line and missing example.

^ permalink raw reply

* Re: Regarding tx-nocache-copy in the Sheevaplug
From: Eric Dumazet @ 2014-10-13 14:49 UTC (permalink / raw)
  To: Lluís Batlle i Rossell, Ezequiel Garcia
  Cc: Andrew Lunn, linux-kernel, netdev, Carles Pagès,
	linux-arm-kernel
In-Reply-To: <20141013143138.GJ1972@vicerveza.homeunix.net>

On Mon, 2014-10-13 at 16:31 +0200, Lluís Batlle i Rossell wrote:
> Enabling tx offload and disabling tx-nocache-copy, making the machine *send* a
> lot of ssh traffic (sftp for example) makes ssh fail HMAC. It's quite easy to
> reproduce here.
> 
> As for the hardware, it's an old sheevaplug board.


Have you tried disabling TSO only, and are you using the latest kernel ?

Ezequiel Garcia added lot of changes recently.

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: kill nf_send_reset6() from include/net/netfilter/ipv6/nf_reject.h
From: Josh Boyer @ 2014-10-13 15:41 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, David Miller, netdev, Eric Dumazet
In-Reply-To: <1412879261-25045-2-git-send-email-pablo@netfilter.org>

[-- Attachment #1: Type: text/plain, Size: 2111 bytes --]

On Thu, Oct 9, 2014 at 2:27 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> nf_send_reset6() now resides in net/ipv6/netfilter/nf_reject_ipv6.c
>
> Fixes: c8d7b98 ("netfilter: move nf_send_resetX() code to nf_reject_ipvX modules")
> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> Acked-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/net/netfilter/ipv6/nf_reject.h |  157 +-------------------------------
>  1 file changed, 2 insertions(+), 155 deletions(-)

Hi All,

This morning I was testing a kernel build from Linus' tree as of Linux
v3.17-7639-g90eac7eee2f4.  When I rebooted my test machines, I
couldn't ssh back into any of them.  I poked around a bit and noticed
that it seems the iptables rules weren't getting loaded properly.
Traffic out worked fine, and I could ping the machine, but other
incoming traffic was blocked.  Then I saw that the ip6t_REJECT and
ip6t_rpfilter modules were not being loaded on the bad kernel.
Looking in dmesg I see:

[   14.619028] nf_reject_ipv6: module license 'unspecified' taints kernel.
[   14.619125] nf_reject_ipv6: Unknown symbol ip6_local_out (err 0)

So I did a git bisect and it pointed to this patch:

[jwboyer@obiwan linux]$ git bisect bad
91c1a09b33c902e20e09d9742560cc238a714de5 is the first bad commit
commit 91c1a09b33c902e20e09d9742560cc238a714de5
Author: Pablo Neira Ayuso <pablo@netfilter.org>
Date:   Tue Oct 7 18:48:12 2014 +0200

    netfilter: kill nf_send_reset6() from include/net/netfilter/ipv6/nf_reject.h

    nf_send_reset6() now resides in net/ipv6/netfilter/nf_reject_ipv6.c

    Fixes: c8d7b98 ("netfilter: move nf_send_resetX() code to
nf_reject_ipvX modules")
    Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Acked-by: Eric Dumazet <edumazet@google.com>

:040000 040000 ab5b61e7ba562e0b22d781a4322ed73c657878dc
71822a7e785c408dd55a6c4600144573de500512 M      include
[jwboyer@obiwan linux]$

I've attached the bisect log.

Perhaps one too many header files were trimmed in this case?

josh

[-- Attachment #2: BISECT_LOG --]
[-- Type: application/octet-stream, Size: 2277 bytes --]

git bisect start
# bad: [90eac7eee2f4257644dcfb9d22348fded7c24afd] Merge tag 'ftracetest-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
git bisect bad 90eac7eee2f4257644dcfb9d22348fded7c24afd
# good: [c798360cd1438090d51eeaa8e67985da11362eba] Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
git bisect good c798360cd1438090d51eeaa8e67985da11362eba
# good: [a2ce35273c2f1aa0dcddd8822681d64ee5f31852] Merge tag 'sound-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good a2ce35273c2f1aa0dcddd8822681d64ee5f31852
# good: [90d0c376f5ee1927327b267faf15bf970476f09e] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
git bisect good 90d0c376f5ee1927327b267faf15bf970476f09e
# good: [d53ba6b3bba33432cc37b7101a86f8f3392c46e7] cxl: Fix afu_read() not doing finish_wait() on signal or non-blocking
git bisect good d53ba6b3bba33432cc37b7101a86f8f3392c46e7
# good: [052db7ec86dff26f734031c3ef5c2c03a94af0af] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
git bisect good 052db7ec86dff26f734031c3ef5c2c03a94af0af
# bad: [2403077d47991a8385789779ee5fc90b003f9fbe] Merge branch 'xgene'
git bisect bad 2403077d47991a8385789779ee5fc90b003f9fbe
# good: [b71b12dce200e4709bd9f709e71c84dcb2cf8a82] networking: fm10k: Fix build failure
git bisect good b71b12dce200e4709bd9f709e71c84dcb2cf8a82
# good: [4511a4a50e1a8757f771681c3e92dbf5a928eeac] Merge tag 'master-2014-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
git bisect good 4511a4a50e1a8757f771681c3e92dbf5a928eeac
# bad: [b61d18904e2a99ed16b6e97d5419f1db19e08bd2] MAINTAINERS: Update APM X-Gene section
git bisect bad b61d18904e2a99ed16b6e97d5419f1db19e08bd2
# bad: [f0d1f04f0a2f662b6b617e24d115fddcf6ef8723] netfilter: fix wrong arithmetics regarding NFT_REJECT_ICMPX_MAX
git bisect bad f0d1f04f0a2f662b6b617e24d115fddcf6ef8723
# bad: [91c1a09b33c902e20e09d9742560cc238a714de5] netfilter: kill nf_send_reset6() from include/net/netfilter/ipv6/nf_reject.h
git bisect bad 91c1a09b33c902e20e09d9742560cc238a714de5
# first bad commit: [91c1a09b33c902e20e09d9742560cc238a714de5] netfilter: kill nf_send_reset6() from include/net/netfilter/ipv6/nf_reject.h

^ permalink raw reply

* Re: Regarding tx-nocache-copy in the Sheevaplug
From: Lluís Batlle i Rossell @ 2014-10-13 15:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ezequiel Garcia, Andrew Lunn, linux-kernel, netdev,
	Carles Pagès, linux-arm-kernel
In-Reply-To: <1413211759.9362.103.camel@edumazet-glaptop2.roam.corp.google.com>

On Mon, Oct 13, 2014 at 07:49:19AM -0700, Eric Dumazet wrote:
> On Mon, 2014-10-13 at 16:31 +0200, Lluís Batlle i Rossell wrote:
> > Enabling tx offload and disabling tx-nocache-copy, making the machine *send* a
> > lot of ssh traffic (sftp for example) makes ssh fail HMAC. It's quite easy to
> > reproduce here.
> > 
> > As for the hardware, it's an old sheevaplug board.
> 
> 
> Have you tried disabling TSO only, and are you using the latest kernel ?
> 
> Ezequiel Garcia added lot of changes recently.
> 
> 

Is TSO TCP segmentation offload? It's disabled. The kernel is 3.16.3 (debian).
https://packages.debian.org/testing/kernel/linux-image-3.16-2-kirkwood

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: kill nf_send_reset6() from include/net/netfilter/ipv6/nf_reject.h
From: Florian Westphal @ 2014-10-13 15:55 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Pablo Neira Ayuso, netfilter-devel, David Miller, netdev,
	Eric Dumazet
In-Reply-To: <CA+5PVA7-k-HFGUUeZ3LTCmVcmR1h_=8W-_PzRO4-2dcc_cRHaQ@mail.gmail.com>

Josh Boyer <jwboyer@fedoraproject.org> wrote:
> On Thu, Oct 9, 2014 at 2:27 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > nf_send_reset6() now resides in net/ipv6/netfilter/nf_reject_ipv6.c
> >
> > Fixes: c8d7b98 ("netfilter: move nf_send_resetX() code to nf_reject_ipvX modules")
> > Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> > Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> > Acked-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  include/net/netfilter/ipv6/nf_reject.h |  157 +-------------------------------
> >  1 file changed, 2 insertions(+), 155 deletions(-)
> 
> Hi All,
> 
> This morning I was testing a kernel build from Linus' tree as of Linux
> v3.17-7639-g90eac7eee2f4.  When I rebooted my test machines, I
> couldn't ssh back into any of them.  I poked around a bit and noticed
> that it seems the iptables rules weren't getting loaded properly.
> Traffic out worked fine, and I could ping the machine, but other
> incoming traffic was blocked.  Then I saw that the ip6t_REJECT and
> ip6t_rpfilter modules were not being loaded on the bad kernel.
> Looking in dmesg I see:
> 
> [   14.619028] nf_reject_ipv6: module license 'unspecified' taints kernel.
> [   14.619125] nf_reject_ipv6: Unknown symbol ip6_local_out (err 0)

Ouch. ip6_local_is EXPORT_SYMBOL_GPL.

http://patchwork.ozlabs.org/patch/398501/

should fix this.

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: kill nf_send_reset6() from include/net/netfilter/ipv6/nf_reject.h
From: Josh Boyer @ 2014-10-13 16:01 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Pablo Neira Ayuso, netfilter-devel, David Miller, netdev,
	Eric Dumazet
In-Reply-To: <20141013155510.GA26105@breakpoint.cc>

On Mon, Oct 13, 2014 at 11:55 AM, Florian Westphal <fw@strlen.de> wrote:
> Josh Boyer <jwboyer@fedoraproject.org> wrote:
>> On Thu, Oct 9, 2014 at 2:27 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> > nf_send_reset6() now resides in net/ipv6/netfilter/nf_reject_ipv6.c
>> >
>> > Fixes: c8d7b98 ("netfilter: move nf_send_resetX() code to nf_reject_ipvX modules")
>> > Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
>> > Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
>> > Acked-by: Eric Dumazet <edumazet@google.com>
>> > ---
>> >  include/net/netfilter/ipv6/nf_reject.h |  157 +-------------------------------
>> >  1 file changed, 2 insertions(+), 155 deletions(-)
>>
>> Hi All,
>>
>> This morning I was testing a kernel build from Linus' tree as of Linux
>> v3.17-7639-g90eac7eee2f4.  When I rebooted my test machines, I
>> couldn't ssh back into any of them.  I poked around a bit and noticed
>> that it seems the iptables rules weren't getting loaded properly.
>> Traffic out worked fine, and I could ping the machine, but other
>> incoming traffic was blocked.  Then I saw that the ip6t_REJECT and
>> ip6t_rpfilter modules were not being loaded on the bad kernel.
>> Looking in dmesg I see:
>>
>> [   14.619028] nf_reject_ipv6: module license 'unspecified' taints kernel.
>> [   14.619125] nf_reject_ipv6: Unknown symbol ip6_local_out (err 0)
>
> Ouch. ip6_local_is EXPORT_SYMBOL_GPL.
>
> http://patchwork.ozlabs.org/patch/398501/
>
> should fix this.

I believe you're correct.  I dug in some more and I was just about to
send a similar patch.  I'll add it on top of my builds and test it
out.  Thanks for the pointer.

josh

^ permalink raw reply

* Re: [PATCH] ipv4: dst_entry leak in ip_append_data()
From: David Miller @ 2014-10-13 16:03 UTC (permalink / raw)
  To: vvs; +Cc: netdev, kuznet, jmorris, yoshfuji, kaber, eric.dumazet
In-Reply-To: <543BA6BE.3040509@parallels.com>

From: Vasily Averin <vvs@parallels.com>
Date: Mon, 13 Oct 2014 14:17:34 +0400

> @@ -1152,9 +1161,14 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
>  		transhdrlen = 0;
>  	}
>  
> -	return __ip_append_data(sk, fl4, &sk->sk_write_queue, &inet->cork.base,
> +	err = __ip_append_data(sk, fl4, &sk->sk_write_queue, &inet->cork.base,
>  				sk_page_frag(sk), getfrag,
>  				from, length, transhdrlen, flags);

If you are changing the column of the openning parenthesis of the function
call, you must adjust the indentation of the arguments on the subsequent
lines so that they start exactly at the first column after that openning
parenthesis.

Thanks.

^ permalink raw reply

* [PATCH] nf_conntrack_proto_tcp: allow server to become a client in TW handling
From: Marcelo Ricardo Leitner @ 2014-10-13 16:09 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <5419B546.40702@redhat.com>

When a port that was used to listen for inbound connections gets closed
and reused for outgoing connections (like rsh ends up doing for stderr
flow), current we may reject the SYN/ACK packet for the new connection
because tcp_conntracks states forbirds a port to become a client while
there is still a TIME_WAIT entry in there for it.

As TCP may expire the TIME_WAIT socket in 60s and conntrack's timeout
for it is 120s, there is a ~60s window that the application can end up
opening a port that conntrack will end up blocking.

This patch fixes this by simply allowing such state transition: if we
see a SYN, in TIME_WAIT state, on REPLY direction, move it to sSS. Note
that the rest of the code already handles this situation, more
specificly in tcp_packet(), first switch clause.

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
---
 net/netfilter/nf_conntrack_proto_tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 44d1ea32570a07338dc39f34624bd823b6f76916..d87b6423ffb21e0f8f9b6ef25ef51c1cb5f54ad6 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -213,7 +213,7 @@ static const u8 tcp_conntracks[2][6][TCP_CONNTRACK_MAX] = {
 	{
 /* REPLY */
 /* 	     sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2	*/
-/*syn*/	   { sIV, sS2, sIV, sIV, sIV, sIV, sIV, sIV, sIV, sS2 },
+/*syn*/	   { sIV, sS2, sIV, sIV, sIV, sIV, sIV, sSS, sIV, sS2 },
 /*
  *	sNO -> sIV	Never reached.
  *	sSS -> sS2	Simultaneous open
@@ -223,7 +223,7 @@ static const u8 tcp_conntracks[2][6][TCP_CONNTRACK_MAX] = {
  *	sFW -> sIV
  *	sCW -> sIV
  *	sLA -> sIV
- *	sTW -> sIV	Reopened connection, but server may not do it.
+ *	sTW -> sSS	Reopened connection, but server may have switched role
  *	sCL -> sIV
  */
 /* 	     sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2	*/
-- 
1.9.3


^ permalink raw reply related

* [PATCH stable 3.2 3.4] ipv4: disable bh while doing route gc
From: Marcelo Ricardo Leitner @ 2014-10-13 16:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, hannes

Further tests revealed that after moving the garbage collector to a work
queue and protecting it with a spinlock may leave the system prone to
soft lockups if bottom half gets very busy.

It was reproced with a set of firewall rules that REJECTed packets. If
the NIC bottom half handler ends up running on the same CPU that is
running the garbage collector on a very large cache, the garbage
collector will not be able to do its job due to the amount of work
needed for handling the REJECTs and also won't reschedule.

The fix is to disable bottom half during the garbage collecting, as it
already was in the first place (most calls to it came from softirqs).

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---

Notes:
    Hi Dave,
    
    This is needed for stables 3.2 and 3.4, as those are the ones that we
    applied the previous patches:
        ipv4: move route garbage collector to work queue
        ipv4: avoid parallel route cache gc executions
    
    Thanks.

 net/ipv4/route.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 9e7909eef8d10107008f8d629f9f2d75fde52eb2..6c34bc98bce7147cf6c242439f2afeb9adf28c72 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -998,7 +998,7 @@ static void __do_rt_garbage_collect(int elasticity, int min_interval)
 	 * do not make it too frequently.
 	 */
 
-	spin_lock(&rt_gc_lock);
+	spin_lock_bh(&rt_gc_lock);
 
 	RT_CACHE_STAT_INC(gc_total);
 
@@ -1101,7 +1101,7 @@ work_done:
 	    dst_entries_get_slow(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh)
 		expire = ip_rt_gc_timeout;
 out:
-	spin_unlock(&rt_gc_lock);
+	spin_unlock_bh(&rt_gc_lock);
 }
 
 static void __rt_garbage_collect(struct work_struct *w)
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next] tg3: Add skb->xmit_more support
From: Prashant Sreedharan @ 2014-10-13 16:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, dborkman, mchan, Prashant Sreedharan

Ring TX doorbell only if xmit_more is not set or the queue is stopped.

Suggested-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/ethernet/broadcom/tg3.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index ba49948..dbb41c1 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -8099,9 +8099,6 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Sync BD data before updating mailbox */
 	wmb();
 
-	/* Packets are ready, update Tx producer idx local and on card. */
-	tw32_tx_mbox(tnapi->prodmbox, entry);
-
 	tnapi->tx_prod = entry;
 	if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) {
 		netif_tx_stop_queue(txq);
@@ -8116,7 +8113,12 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			netif_tx_wake_queue(txq);
 	}
 
-	mmiowb();
+	if (!skb->xmit_more || netif_xmit_stopped(txq)) {
+		/* Packets are ready, update Tx producer idx on card. */
+		tw32_tx_mbox(tnapi->prodmbox, entry);
+		mmiowb();
+	}
+
 	return NETDEV_TX_OK;
 
 dma_error:
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net-next] tg3: Add skb->xmit_more support
From: Eric Dumazet @ 2014-10-13 16:48 UTC (permalink / raw)
  To: Prashant Sreedharan; +Cc: davem, netdev, dborkman, mchan
In-Reply-To: <1413217302-15396-1-git-send-email-prashant@broadcom.com>

On Mon, 2014-10-13 at 09:21 -0700, Prashant Sreedharan wrote:
> Ring TX doorbell only if xmit_more is not set or the queue is stopped.
> 
> Suggested-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>
> ---
>  drivers/net/ethernet/broadcom/tg3.c |   10 ++++++----
>  1 files changed, 6 insertions(+), 4 deletions(-)

Have you noticed any performance change ?

I did the patch for bnx2x but got no real difference...

Thanks

^ permalink raw reply

* [net PATCH 0/3] bug fixes for davinci_cpdma and cpsw drivers
From: Mugunthan V N @ 2014-10-13 16:51 UTC (permalink / raw)
  To: netdev; +Cc: davem, Mugunthan V N


Mugunthan V N (3):
  drivers: net: davinci_cpdma: remove kfree on objects allocated with
    devm_* apis
  drivers: net: davinci_cpdma: remove spinlock as SOFTIRQ-unsafe lock
    order detected
  drivers: net: cpsw: remove child devices while driver detach

 drivers/net/ethernet/ti/cpsw.c          | 10 ++++++++++
 drivers/net/ethernet/ti/davinci_cpdma.c |  5 -----
 2 files changed, 10 insertions(+), 5 deletions(-)

-- 
2.1.1.332.g0bf7dd6

^ permalink raw reply

* [net PATCH 1/3] drivers: net: davinci_cpdma: remove kfree on objects allocated with devm_* apis
From: Mugunthan V N @ 2014-10-13 16:51 UTC (permalink / raw)
  To: netdev; +Cc: davem, Mugunthan V N
In-Reply-To: <1413219067-15328-1-git-send-email-mugunthanvnm@ti.com>

memories allocated with devm_* apis must not be freed with kfree apis,
so removing the kfree calls

Fixes: e194312854ed ('drivers: net: davinci_cpdma: Convert kzalloc() to devm_kzalloc().')

Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index 4a000f6..32dc289 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -561,7 +561,6 @@ int cpdma_chan_destroy(struct cpdma_chan *chan)
 		cpdma_chan_stop(chan);
 	ctlr->channels[chan->chan_num] = NULL;
 	spin_unlock_irqrestore(&ctlr->lock, flags);
-	kfree(chan);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(cpdma_chan_destroy);
-- 
2.1.1.332.g0bf7dd6

^ permalink raw reply related

* [net PATCH 2/3] drivers: net: davinci_cpdma: remove spinlock as SOFTIRQ-unsafe lock order detected
From: Mugunthan V N @ 2014-10-13 16:51 UTC (permalink / raw)
  To: netdev; +Cc: davem, Mugunthan V N
In-Reply-To: <1413219067-15328-1-git-send-email-mugunthanvnm@ti.com>

remove spinlock in cpdma_desc_pool_destroy() as there is no active cpdma
channel and iounmap should be called without auquiring lock.

root@dra7xx-evm:~# modprobe -r ti_cpsw
[   50.539743]
[   50.541312] ======================================================
[   50.547796] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[   50.554826] 3.14.19-02124-g95c5b7b #308 Not tainted
[   50.559939] ------------------------------------------------------
[   50.566416] modprobe/1921 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[   50.573347]  (vmap_area_lock){+.+...}, at: [<c01127fc>] find_vmap_area+0x10/0x6c
[   50.581132]
[   50.581132] and this task is already holding:
[   50.587249]  (&(&pool->lock)->rlock#2){..-...}, at: [<bf017c74>] cpdma_ctlr_destroy+0x5c/0x114 [davinci_cpdma]
[   50.597766] which would create a new lock dependency:
[   50.603048]  (&(&pool->lock)->rlock#2){..-...} -> (vmap_area_lock){+.+...}
[   50.610296]
[   50.610296] but this new dependency connects a SOFTIRQ-irq-safe lock:
[   50.618601]  (&(&pool->lock)->rlock#2){..-...}
... which became SOFTIRQ-irq-safe at:
[   50.626829]   [<c06585a4>] _raw_spin_lock_irqsave+0x38/0x4c
[   50.632677]   [<bf01773c>] cpdma_desc_free.constprop.7+0x28/0x58 [davinci_cpdma]
[   50.640437]   [<bf0177e8>] __cpdma_chan_free+0x7c/0xa8 [davinci_cpdma]
[   50.647289]   [<bf017908>] __cpdma_chan_process+0xf4/0x134 [davinci_cpdma]
[   50.654512]   [<bf017984>] cpdma_chan_process+0x3c/0x54 [davinci_cpdma]
[   50.661455]   [<bf0277e8>] cpsw_poll+0x14/0xa8 [ti_cpsw]
[   50.667038]   [<c05844f4>] net_rx_action+0xc0/0x1e8
[   50.672150]   [<c0048234>] __do_softirq+0xcc/0x304
[   50.677183]   [<c004873c>] irq_exit+0xa8/0xfc
[   50.681751]   [<c000eeac>] handle_IRQ+0x50/0xb0
[   50.686513]   [<c0008638>] gic_handle_irq+0x28/0x5c
[   50.691628]   [<c06590a4>] __irq_svc+0x44/0x5c
[   50.696289]   [<c0658ab4>] _raw_spin_unlock_irqrestore+0x34/0x44
[   50.702591]   [<c065a9c4>] do_page_fault.part.9+0x144/0x3c4
[   50.708433]   [<c065acb8>] do_page_fault+0x74/0x84
[   50.713453]   [<c00083dc>] do_DataAbort+0x34/0x98
[   50.718391]   [<c065923c>] __dabt_usr+0x3c/0x40
[   50.723148]
[   50.723148] to a SOFTIRQ-irq-unsafe lock:
[   50.728893]  (vmap_area_lock){+.+...}
... which became SOFTIRQ-irq-unsafe at:
[   50.736476] ...  [<c06584e8>] _raw_spin_lock+0x28/0x38
[   50.741876]   [<c011376c>] alloc_vmap_area.isra.28+0xb8/0x300
[   50.747908]   [<c0113a44>] __get_vm_area_node.isra.29+0x90/0x134
[   50.754210]   [<c011486c>] get_vm_area_caller+0x3c/0x48
[   50.759692]   [<c0114be0>] vmap+0x40/0x78
[   50.763900]   [<c09442f0>] check_writebuffer_bugs+0x54/0x1a0
[   50.769835]   [<c093eac0>] start_kernel+0x320/0x388
[   50.774952]   [<80008074>] 0x80008074
[   50.778793]
[   50.778793] other info that might help us debug this:
[   50.778793]
[   50.787181]  Possible interrupt unsafe locking scenario:
[   50.787181]
[   50.794295]        CPU0                    CPU1
[   50.799042]        ----                    ----
[   50.803785]   lock(vmap_area_lock);
[   50.807446]                                local_irq_disable();
[   50.813652]                                lock(&(&pool->lock)->rlock#2);
[   50.820782]                                lock(vmap_area_lock);
[   50.827086]   <Interrupt>
[   50.829823]     lock(&(&pool->lock)->rlock#2);
[   50.834490]
[   50.834490]  *** DEADLOCK ***
[   50.834490]
[   50.840695] 4 locks held by modprobe/1921:
[   50.844981]  #0:  (&__lockdep_no_validate__){......}, at: [<c03e53e8>] driver_detach+0x44/0xb8
[   50.854038]  #1:  (&__lockdep_no_validate__){......}, at: [<c03e53f4>] driver_detach+0x50/0xb8
[   50.863102]  #2:  (&(&ctlr->lock)->rlock){......}, at: [<bf017c34>] cpdma_ctlr_destroy+0x1c/0x114 [davinci_cpdma]
[   50.873890]  #3:  (&(&pool->lock)->rlock#2){..-...}, at: [<bf017c74>] cpdma_ctlr_destroy+0x5c/0x114 [davinci_cpdma]
[   50.884871]
the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[   50.892827] -> (&(&pool->lock)->rlock#2){..-...} ops: 167 {
[   50.898703]    IN-SOFTIRQ-W at:
[   50.901995]                     [<c06585a4>] _raw_spin_lock_irqsave+0x38/0x4c
[   50.909476]                     [<bf01773c>] cpdma_desc_free.constprop.7+0x28/0x58 [davinci_cpdma]
[   50.918878]                     [<bf0177e8>] __cpdma_chan_free+0x7c/0xa8 [davinci_cpdma]
[   50.927366]                     [<bf017908>] __cpdma_chan_process+0xf4/0x134 [davinci_cpdma]
[   50.936218]                     [<bf017984>] cpdma_chan_process+0x3c/0x54 [davinci_cpdma]
[   50.944794]                     [<bf0277e8>] cpsw_poll+0x14/0xa8 [ti_cpsw]
[   50.952009]                     [<c05844f4>] net_rx_action+0xc0/0x1e8
[   50.958765]                     [<c0048234>] __do_softirq+0xcc/0x304
[   50.965432]                     [<c004873c>] irq_exit+0xa8/0xfc
[   50.971635]                     [<c000eeac>] handle_IRQ+0x50/0xb0
[   50.978035]                     [<c0008638>] gic_handle_irq+0x28/0x5c
[   50.984788]                     [<c06590a4>] __irq_svc+0x44/0x5c
[   50.991085]                     [<c0658ab4>] _raw_spin_unlock_irqrestore+0x34/0x44
[   50.999023]                     [<c065a9c4>] do_page_fault.part.9+0x144/0x3c4
[   51.006510]                     [<c065acb8>] do_page_fault+0x74/0x84
[   51.013171]                     [<c00083dc>] do_DataAbort+0x34/0x98
[   51.019738]                     [<c065923c>] __dabt_usr+0x3c/0x40
[   51.026129]    INITIAL USE at:
[   51.029335]                    [<c06585a4>] _raw_spin_lock_irqsave+0x38/0x4c
[   51.036729]                    [<bf017d78>] cpdma_chan_submit+0x4c/0x2f0 [davinci_cpdma]
[   51.045225]                    [<bf02863c>] cpsw_ndo_open+0x378/0x6bc [ti_cpsw]
[   51.052897]                    [<c058747c>] __dev_open+0x9c/0x104
[   51.059287]                    [<c05876ec>] __dev_change_flags+0x88/0x160
[   51.066420]                    [<c05877e4>] dev_change_flags+0x18/0x48
[   51.073270]                    [<c05ed51c>] devinet_ioctl+0x61c/0x6e0
[   51.080029]                    [<c056ee54>] sock_ioctl+0x5c/0x298
[   51.086418]                    [<c01350a4>] do_vfs_ioctl+0x78/0x61c
[   51.092993]                    [<c01356ac>] SyS_ioctl+0x64/0x74
[   51.099200]                    [<c000e580>] ret_fast_syscall+0x0/0x48
[   51.105956]  }
[   51.107696]  ... key      at: [<bf019000>] __key.21312+0x0/0xfffff650 [davinci_cpdma]
[   51.115912]  ... acquired at:
[   51.119019]    [<c00899ac>] lock_acquire+0x9c/0x104
[   51.124138]    [<c06584e8>] _raw_spin_lock+0x28/0x38
[   51.129341]    [<c01127fc>] find_vmap_area+0x10/0x6c
[   51.134547]    [<c0114960>] remove_vm_area+0x8/0x6c
[   51.139659]    [<c0114a7c>] __vunmap+0x20/0xf8
[   51.144318]    [<c001c350>] __arm_iounmap+0x10/0x18
[   51.149440]    [<bf017d08>] cpdma_ctlr_destroy+0xf0/0x114 [davinci_cpdma]
[   51.156560]    [<bf026294>] cpsw_remove+0x48/0x8c [ti_cpsw]
[   51.162407]    [<c03e62c8>] platform_drv_remove+0x18/0x1c
[   51.168063]    [<c03e4c44>] __device_release_driver+0x70/0xc8
[   51.174094]    [<c03e5458>] driver_detach+0xb4/0xb8
[   51.179212]    [<c03e4a6c>] bus_remove_driver+0x4c/0x90
[   51.184693]    [<c00b024c>] SyS_delete_module+0x10c/0x198
[   51.190355]    [<c000e580>] ret_fast_syscall+0x0/0x48
[   51.195661]
[   51.197217]
the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[   51.205986] -> (vmap_area_lock){+.+...} ops: 520 {
[   51.211032]    HARDIRQ-ON-W at:
[   51.214321]                     [<c06584e8>] _raw_spin_lock+0x28/0x38
[   51.221090]                     [<c011376c>] alloc_vmap_area.isra.28+0xb8/0x300
[   51.228750]                     [<c0113a44>] __get_vm_area_node.isra.29+0x90/0x134
[   51.236690]                     [<c011486c>] get_vm_area_caller+0x3c/0x48
[   51.243811]                     [<c0114be0>] vmap+0x40/0x78
[   51.249654]                     [<c09442f0>] check_writebuffer_bugs+0x54/0x1a0
[   51.257239]                     [<c093eac0>] start_kernel+0x320/0x388
[   51.263994]                     [<80008074>] 0x80008074
[   51.269474]    SOFTIRQ-ON-W at:
[   51.272769]                     [<c06584e8>] _raw_spin_lock+0x28/0x38
[   51.279525]                     [<c011376c>] alloc_vmap_area.isra.28+0xb8/0x300
[   51.287190]                     [<c0113a44>] __get_vm_area_node.isra.29+0x90/0x134
[   51.295126]                     [<c011486c>] get_vm_area_caller+0x3c/0x48
[   51.302245]                     [<c0114be0>] vmap+0x40/0x78
[   51.308094]                     [<c09442f0>] check_writebuffer_bugs+0x54/0x1a0
[   51.315669]                     [<c093eac0>] start_kernel+0x320/0x388
[   51.322423]                     [<80008074>] 0x80008074
[   51.327906]    INITIAL USE at:
[   51.331112]                    [<c06584e8>] _raw_spin_lock+0x28/0x38
[   51.337775]                    [<c011376c>] alloc_vmap_area.isra.28+0xb8/0x300
[   51.345352]                    [<c0113a44>] __get_vm_area_node.isra.29+0x90/0x134
[   51.353197]                    [<c011486c>] get_vm_area_caller+0x3c/0x48
[   51.360224]                    [<c0114be0>] vmap+0x40/0x78
[   51.365977]                    [<c09442f0>] check_writebuffer_bugs+0x54/0x1a0
[   51.373464]                    [<c093eac0>] start_kernel+0x320/0x388
[   51.380131]                    [<80008074>] 0x80008074
[   51.385517]  }
[   51.387260]  ... key      at: [<c0a66948>] vmap_area_lock+0x10/0x20
[   51.393841]  ... acquired at:
[   51.396945]    [<c00899ac>] lock_acquire+0x9c/0x104
[   51.402060]    [<c06584e8>] _raw_spin_lock+0x28/0x38
[   51.407266]    [<c01127fc>] find_vmap_area+0x10/0x6c
[   51.412478]    [<c0114960>] remove_vm_area+0x8/0x6c
[   51.417592]    [<c0114a7c>] __vunmap+0x20/0xf8
[   51.422252]    [<c001c350>] __arm_iounmap+0x10/0x18
[   51.427369]    [<bf017d08>] cpdma_ctlr_destroy+0xf0/0x114 [davinci_cpdma]
[   51.434487]    [<bf026294>] cpsw_remove+0x48/0x8c [ti_cpsw]
[   51.440336]    [<c03e62c8>] platform_drv_remove+0x18/0x1c
[   51.446000]    [<c03e4c44>] __device_release_driver+0x70/0xc8
[   51.452031]    [<c03e5458>] driver_detach+0xb4/0xb8
[   51.457147]    [<c03e4a6c>] bus_remove_driver+0x4c/0x90
[   51.462628]    [<c00b024c>] SyS_delete_module+0x10c/0x198
[   51.468289]    [<c000e580>] ret_fast_syscall+0x0/0x48
[   51.473584]
[   51.475140]
[   51.475140] stack backtrace:
[   51.479703] CPU: 0 PID: 1921 Comm: modprobe Not tainted 3.14.19-02124-g95c5b7b #308
[   51.487744] [<c0016090>] (unwind_backtrace) from [<c0012060>] (show_stack+0x10/0x14)
[   51.495865] [<c0012060>] (show_stack) from [<c0652a20>] (dump_stack+0x78/0x94)
[   51.503444] [<c0652a20>] (dump_stack) from [<c0086f18>] (check_usage+0x408/0x594)
[   51.511293] [<c0086f18>] (check_usage) from [<c00870f8>] (check_irq_usage+0x54/0xb0)
[   51.519416] [<c00870f8>] (check_irq_usage) from [<c0088724>] (__lock_acquire+0xe54/0x1b90)
[   51.528077] [<c0088724>] (__lock_acquire) from [<c00899ac>] (lock_acquire+0x9c/0x104)
[   51.536291] [<c00899ac>] (lock_acquire) from [<c06584e8>] (_raw_spin_lock+0x28/0x38)
[   51.544417] [<c06584e8>] (_raw_spin_lock) from [<c01127fc>] (find_vmap_area+0x10/0x6c)
[   51.552726] [<c01127fc>] (find_vmap_area) from [<c0114960>] (remove_vm_area+0x8/0x6c)
[   51.560935] [<c0114960>] (remove_vm_area) from [<c0114a7c>] (__vunmap+0x20/0xf8)
[   51.568693] [<c0114a7c>] (__vunmap) from [<c001c350>] (__arm_iounmap+0x10/0x18)
[   51.576362] [<c001c350>] (__arm_iounmap) from [<bf017d08>] (cpdma_ctlr_destroy+0xf0/0x114 [davinci_cpdma])
[   51.586494] [<bf017d08>] (cpdma_ctlr_destroy [davinci_cpdma]) from [<bf026294>] (cpsw_remove+0x48/0x8c [ti_cpsw])
[   51.597261] [<bf026294>] (cpsw_remove [ti_cpsw]) from [<c03e62c8>] (platform_drv_remove+0x18/0x1c)
[   51.606659] [<c03e62c8>] (platform_drv_remove) from [<c03e4c44>] (__device_release_driver+0x70/0xc8)
[   51.616237] [<c03e4c44>] (__device_release_driver) from [<c03e5458>] (driver_detach+0xb4/0xb8)
[   51.625264] [<c03e5458>] (driver_detach) from [<c03e4a6c>] (bus_remove_driver+0x4c/0x90)
[   51.633749] [<c03e4a6c>] (bus_remove_driver) from [<c00b024c>] (SyS_delete_module+0x10c/0x198)
[   51.642781] [<c00b024c>] (SyS_delete_module) from [<c000e580>] (ret_fast_syscall+0x0/0x48)

Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index 32dc289..657b65b 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -193,12 +193,9 @@ fail:
 
 static void cpdma_desc_pool_destroy(struct cpdma_desc_pool *pool)
 {
-	unsigned long flags;
-
 	if (!pool)
 		return;
 
-	spin_lock_irqsave(&pool->lock, flags);
 	WARN_ON(pool->used_desc);
 	if (pool->cpumap) {
 		dma_free_coherent(pool->dev, pool->mem_size, pool->cpumap,
@@ -206,7 +203,6 @@ static void cpdma_desc_pool_destroy(struct cpdma_desc_pool *pool)
 	} else {
 		iounmap(pool->iomap);
 	}
-	spin_unlock_irqrestore(&pool->lock, flags);
 }
 
 static inline dma_addr_t desc_phys(struct cpdma_desc_pool *pool,
-- 
2.1.1.332.g0bf7dd6

^ permalink raw reply related

* [net PATCH 3/3] drivers: net: cpsw: remove child devices while driver detach
From: Mugunthan V N @ 2014-10-13 16:51 UTC (permalink / raw)
  To: netdev; +Cc: davem, Mugunthan V N
In-Reply-To: <1413219067-15328-1-git-send-email-mugunthanvnm@ti.com>

remove all the child devices from the system to make sure that re-insert of
cpsw module doesn't fail on child device populated by of_platform_populate().

Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
---
 drivers/net/ethernet/ti/cpsw.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index ab167dc..952e1e4 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2392,6 +2392,15 @@ clean_ndev_ret:
 	return ret;
 }
 
+static int cpsw_remove_child_device(struct device *dev, void *c)
+{
+	struct platform_device *pdev = to_platform_device(dev);
+
+	of_device_unregister(pdev);
+
+	return 0;
+}
+
 static int cpsw_remove(struct platform_device *pdev)
 {
 	struct net_device *ndev = platform_get_drvdata(pdev);
@@ -2406,6 +2415,7 @@ static int cpsw_remove(struct platform_device *pdev)
 	cpdma_chan_destroy(priv->rxch);
 	cpdma_ctlr_destroy(priv->dma);
 	pm_runtime_disable(&pdev->dev);
+	device_for_each_child(&pdev->dev, NULL, cpsw_remove_child_device);
 	if (priv->data.dual_emac)
 		free_netdev(cpsw_get_slave_ndev(priv, 1));
 	free_netdev(ndev);
-- 
2.1.1.332.g0bf7dd6

^ permalink raw reply related

* Re: [PATCH stable 3.2 3.4] ipv4: disable bh while doing route gc
From: David Miller @ 2014-10-13 16:52 UTC (permalink / raw)
  To: mleitner; +Cc: netdev, hannes
In-Reply-To: <6c3d6eca5d6a15c01393b010f2116bd169477c5a.1413215324.git.mleitner@redhat.com>

From: Marcelo Ricardo Leitner <mleitner@redhat.com>
Date: Mon, 13 Oct 2014 13:20:38 -0300

> Further tests revealed that after moving the garbage collector to a work
> queue and protecting it with a spinlock may leave the system prone to
> soft lockups if bottom half gets very busy.
> 
> It was reproced with a set of firewall rules that REJECTed packets. If
> the NIC bottom half handler ends up running on the same CPU that is
> running the garbage collector on a very large cache, the garbage
> collector will not be able to do its job due to the amount of work
> needed for handling the REJECTs and also won't reschedule.
> 
> The fix is to disable bottom half during the garbage collecting, as it
> already was in the first place (most calls to it came from softirqs).
> 
> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Please add my:

Acked-by: David S. Miller <davem@davemloft.net>

and submit this directly to -stable, thanks.

^ permalink raw reply

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
From: Eric Dumazet @ 2014-10-13 16:58 UTC (permalink / raw)
  To: Dave Taht
  Cc: Alexander Duyck, Jesper Dangaard Brouer, Hannes Frederic Sowa,
	John Fastabend, Jamal Hadi Salim, Daniel Borkmann,
	Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, Tom Herbert, David Miller
In-Reply-To: <CAL4WiioOodM60_Ud6qyGt1_uyeqpDeGFF6PtEKtGa=NwHhVFiw@mail.gmail.com>


> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
>         When I first got cc'd on these threads, and saw
>         netperf-wrapper being
>         used on it,
>         I thought: "Oh god, I've created a monster.". My intent with
>         helping create
>         such a measurement tool was not to routinely drive a network
>         to saturation
>         but to be able to measure the impact on latency of doing so.
>         
>         I was trying get reasonable behavior when a "router" went into
>         overload.
>         
>         Servers, on the other hand, have more options to avoid
>         overload than
>         routers do. There's been a great deal of really nice work on
>         that
>         front. I love all that.
>         
>         and I like BQL because it provides enough backpressure to be
>         able to
>         do smarter things about scheduling packets higher in the
>         stack. (life
>         pre-BQL cost some hair)
>         
>         But tom once told me me "BQL's objective is to keep the
>         hardware busy".
>         It uses an MIAD controller instead of a more sane AIMD one, in
>         particular,
>         I'd much rather it ramped down to smaller values after
>         absorbing a burst.
>         
>         My objective is always to keep the *network's behavior
>         optimal*,
>         minimizing bursts, and subsequent tail loss on the other side,
>         and responding quickly to loss, and doing that by
>         preserving to the highest extent possible the ack clocking
>         that a
>         fluid model has. I LOVE BQL for providing more backpressure
>         than has
>         ever existed before, and I know it's incredibly difficult to
>         have fluid models
>         in a conventional cpu architecture that has to do other stuff.
>         
>         But in order to get the best results for network behavior I'm
>         willing to
>         sacrifice a great deal of cpu, interrupts, whatever it takes!
>         to get the
>         most packets to all the destinations specified, whatever the
>         workload,
>         with the *minimum amount of latency between ack and reply*
>         possible.
>         
>         What I'd hoped for in the new bulking and rcu stuff was to be
>         able to
>         see a net reduction in TSO/GSO Size, and/or BQL's size, and I
>         also did
>         keep hoping for some profiles on sch_fq, and for more complex
>         benchmarking of dozens or hundreds of realistically sized TCP
>         flows
>         (in both directions) to exercise it all.
>         
>         Some of the data presented showed that a single BQL'd queue
>         was >400K,
>         and with hardware multi-queue, 128K, when TSO and GSO were
>         used, but
>         with hardware multi-queue and no TSO/GSO, BQL was closer to
>         30K.
>         
>         This said to me that the maximum "right" size for a TSO/GSO
>         "packet" was
>         closer to 12k in this environment, and the right size for BQL,
>         30k,
>         before it started exerting backpressure to the qdisc.
>         
>         This would reduce the potential inter-flow network latency by
>         a factor
>         of 10 on the single hw queue scenario, and 4 in the multi
>         queue one.
>         
>         It would probably cost some interrupts, and in scenarios
>         lacking
>         packet loss, throughput, but in other scenarios with lots of
>         flows
>         each flow will ramp up in speed, faster, as you reduce the
>         RTTs.
>         Paying attention to this will also push profiling activities
>         into
>         areas of the stack that might be profitable.
>         
>         I would very much like to have profiles of happens now both
>         here and
>         elsewhere in the stack with this new code with TSO/GSO sizes
>         capped
>         thusly and BQL capped to 30k, and a smarter qdisc like fq
>         used.
>         
>         2) Most of the time, a server is not driving the wire to
>         saturation. If
>            it is, you are doing something wrong. The BQL queues are
>         empty, or
>            nearly so, so the instant someone creates a qdisc queue, it
>            drains.
>         
>         But: if there are two or more flows under contention, creating
>         a qdisc queue
>             better multiplexing the results is highly desirable, and
>         the stack
>            should be smart enough to make that overload only last
>         briefly.
>         
>            This is part of why I'm unfond of the deep and persistent
>         BQL queues as we
>         get today.
>         
>         3) Pure ack-only workloads are rare. It is a useful test case,
>         but...
>         
>         4) I thought the ring-cleanup optimization was rather
>         interesting and
>            could be made more dynamic.
>         
>         5) I remain amazed at the vast improvements in throughput,
>         reductions in
>         interrupts, lockless operation and the RCU stuff that have
>         come out of
>         this so far, but had to make these points in the hope that the
>         big picture
>         is retained.
>         
>         It does no good to blast packets through the network unless
>         there is a
>         high probability that they will actually be received on the
>         other side.
>         
>         thanks for listening.

Dave

I am kind of surprised you wrote this nonsense.

Being able to send data at full speed has nothing to do about how
packets are scheduled. You are concerned about packet scheduling, and
not about how fast we can send raw data on a 40Gb NIC.

We made all these changes so that we can spend cpu cycles at the right
place.

There are reasons we have fq_codel, and fq. Do not forget this, please.

Take a deep breath, some coffee, and rethink.

Thanks

^ permalink raw reply

* Re: [PATCH stable 3.2 3.4] ipv4: disable bh while doing route gc
From: Marcelo Ricardo Leitner @ 2014-10-13 16:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, hannes
In-Reply-To: <20141013.125225.1129718741415025861.davem@davemloft.net>

On 13-10-2014 13:52, David Miller wrote:
> From: Marcelo Ricardo Leitner <mleitner@redhat.com>
> Date: Mon, 13 Oct 2014 13:20:38 -0300
>
>> Further tests revealed that after moving the garbage collector to a work
>> queue and protecting it with a spinlock may leave the system prone to
>> soft lockups if bottom half gets very busy.
>>
>> It was reproced with a set of firewall rules that REJECTed packets. If
>> the NIC bottom half handler ends up running on the same CPU that is
>> running the garbage collector on a very large cache, the garbage
>> collector will not be able to do its job due to the amount of work
>> needed for handling the REJECTs and also won't reschedule.
>>
>> The fix is to disable bottom half during the garbage collecting, as it
>> already was in the first place (most calls to it came from softirqs).
>>
>> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
>> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>
> Please add my:
>
> Acked-by: David S. Miller <davem@davemloft.net>
>
> and submit this directly to -stable, thanks.

Will do. Thanks.

Marcelo

^ permalink raw reply

* [PATCH] sfc: efx: add support for skb->xmit_more
From: Daniel Borkmann @ 2014-10-13 16:59 UTC (permalink / raw)
  To: davem; +Cc: nikolay, netdev, Shradha Shah

Add support for xmit_more batching: efx has BQL support, thus we need
to only move the queue hang check before the hw tail pointer write
and test for xmit_more bit resp. whether the queue/stack has stopped.
Thanks to Nikolay Aleksandrov who had a Solarflare NIC to test!

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@redhat.com>
Cc: Shradha Shah <sshah@solarflare.com>
---
 drivers/net/ethernet/sfc/tx.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
index 3206098..566432c 100644
--- a/drivers/net/ethernet/sfc/tx.c
+++ b/drivers/net/ethernet/sfc/tx.c
@@ -439,13 +439,13 @@ finish_packet:
 
 	netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
 
+	efx_tx_maybe_stop_queue(tx_queue);
+
 	/* Pass off to hardware */
-	efx_nic_push_buffers(tx_queue);
+	if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
+		efx_nic_push_buffers(tx_queue);
 
 	tx_queue->tx_packets++;
-
-	efx_tx_maybe_stop_queue(tx_queue);
-
 	return NETDEV_TX_OK;
 
  dma_err:
@@ -1308,11 +1308,12 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
 
 	netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
 
-	/* Pass off to hardware */
-	efx_nic_push_buffers(tx_queue);
-
 	efx_tx_maybe_stop_queue(tx_queue);
 
+	/* Pass off to hardware */
+	if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
+		efx_nic_push_buffers(tx_queue);
+
 	tx_queue->tso_bursts++;
 	return NETDEV_TX_OK;
 
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH net-next] tg3: Add skb->xmit_more support
From: Rick Jones @ 2014-10-13 17:14 UTC (permalink / raw)
  To: Eric Dumazet, Prashant Sreedharan; +Cc: davem, netdev, dborkman, mchan
In-Reply-To: <1413218934.9362.104.camel@edumazet-glaptop2.roam.corp.google.com>

On 10/13/2014 09:48 AM, Eric Dumazet wrote:
> On Mon, 2014-10-13 at 09:21 -0700, Prashant Sreedharan wrote:
>> Ring TX doorbell only if xmit_more is not set or the queue is stopped.
>>
>> Suggested-by: Daniel Borkmann <dborkman@redhat.com>
>> Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
>> Signed-off-by: Michael Chan <mchan@broadcom.com>
>> ---
>>   drivers/net/ethernet/broadcom/tg3.c |   10 ++++++----
>>   1 files changed, 6 insertions(+), 4 deletions(-)
>
> Have you noticed any performance change ?
>
> I did the patch for bnx2x but got no real difference...

If ringing the doorbell is just a pio write, it may just get lost in the 
noise.  If though the programming model (or some sort of defect 
workaround) of the NIC requires a pio read of some sort when ringing the 
doorbell...

rick

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: kill nf_send_reset6() from include/net/netfilter/ipv6/nf_reject.h
From: Josh Boyer @ 2014-10-13 17:18 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Pablo Neira Ayuso, netfilter-devel, David Miller, netdev,
	Eric Dumazet
In-Reply-To: <CA+5PVA72ME31+CAzcM46JyvrP19dawTO4LSmta6Jt2eDZVbWaA@mail.gmail.com>

On Mon, Oct 13, 2014 at 12:01 PM, Josh Boyer <jwboyer@fedoraproject.org> wrote:
> On Mon, Oct 13, 2014 at 11:55 AM, Florian Westphal <fw@strlen.de> wrote:
>> Josh Boyer <jwboyer@fedoraproject.org> wrote:
>>> On Thu, Oct 9, 2014 at 2:27 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>>> > nf_send_reset6() now resides in net/ipv6/netfilter/nf_reject_ipv6.c
>>> >
>>> > Fixes: c8d7b98 ("netfilter: move nf_send_resetX() code to nf_reject_ipvX modules")
>>> > Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
>>> > Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
>>> > Acked-by: Eric Dumazet <edumazet@google.com>
>>> > ---
>>> >  include/net/netfilter/ipv6/nf_reject.h |  157 +-------------------------------
>>> >  1 file changed, 2 insertions(+), 155 deletions(-)
>>>
>>> Hi All,
>>>
>>> This morning I was testing a kernel build from Linus' tree as of Linux
>>> v3.17-7639-g90eac7eee2f4.  When I rebooted my test machines, I
>>> couldn't ssh back into any of them.  I poked around a bit and noticed
>>> that it seems the iptables rules weren't getting loaded properly.
>>> Traffic out worked fine, and I could ping the machine, but other
>>> incoming traffic was blocked.  Then I saw that the ip6t_REJECT and
>>> ip6t_rpfilter modules were not being loaded on the bad kernel.
>>> Looking in dmesg I see:
>>>
>>> [   14.619028] nf_reject_ipv6: module license 'unspecified' taints kernel.
>>> [   14.619125] nf_reject_ipv6: Unknown symbol ip6_local_out (err 0)
>>
>> Ouch. ip6_local_is EXPORT_SYMBOL_GPL.
>>
>> http://patchwork.ozlabs.org/patch/398501/
>>
>> should fix this.
>
> I believe you're correct.  I dug in some more and I was just about to
> send a similar patch.  I'll add it on top of my builds and test it
> out.  Thanks for the pointer.

The patch you pointed to does indeed fix the issues I was seeing.
Thank you very much for the fast response.

josh

^ permalink raw reply

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
From: Dave Taht @ 2014-10-13 17:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Jesper Dangaard Brouer, Hannes Frederic Sowa,
	John Fastabend, Jamal Hadi Salim, Daniel Borkmann,
	Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, Tom Herbert, David Miller
In-Reply-To: <1413219532.9362.109.camel@edumazet-glaptop2.roam.corp.google.com>

On Mon, Oct 13, 2014 at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
>>         When I first got cc'd on these threads, and saw
>>         netperf-wrapper being
>>         used on it,
>>         I thought: "Oh god, I've created a monster.". My intent with
>>         helping create
>>         such a measurement tool was not to routinely drive a network
>>         to saturation
>>         but to be able to measure the impact on latency of doing so.
>>
>>         I was trying get reasonable behavior when a "router" went into
>>         overload.
>>
>>         Servers, on the other hand, have more options to avoid
>>         overload than
>>         routers do. There's been a great deal of really nice work on
>>         that
>>         front. I love all that.
>>
>>         and I like BQL because it provides enough backpressure to be
>>         able to
>>         do smarter things about scheduling packets higher in the
>>         stack. (life
>>         pre-BQL cost some hair)
>>
>>         But tom once told me me "BQL's objective is to keep the
>>         hardware busy".
>>         It uses an MIAD controller instead of a more sane AIMD one, in
>>         particular,
>>         I'd much rather it ramped down to smaller values after
>>         absorbing a burst.
>>
>>         My objective is always to keep the *network's behavior
>>         optimal*,
>>         minimizing bursts, and subsequent tail loss on the other side,
>>         and responding quickly to loss, and doing that by
>>         preserving to the highest extent possible the ack clocking
>>         that a
>>         fluid model has. I LOVE BQL for providing more backpressure
>>         than has
>>         ever existed before, and I know it's incredibly difficult to
>>         have fluid models
>>         in a conventional cpu architecture that has to do other stuff.
>>
>>         But in order to get the best results for network behavior I'm
>>         willing to
>>         sacrifice a great deal of cpu, interrupts, whatever it takes!
>>         to get the
>>         most packets to all the destinations specified, whatever the
>>         workload,
>>         with the *minimum amount of latency between ack and reply*
>>         possible.
>>
>>         What I'd hoped for in the new bulking and rcu stuff was to be
>>         able to
>>         see a net reduction in TSO/GSO Size, and/or BQL's size, and I
>>         also did
>>         keep hoping for some profiles on sch_fq, and for more complex
>>         benchmarking of dozens or hundreds of realistically sized TCP
>>         flows
>>         (in both directions) to exercise it all.
>>
>>         Some of the data presented showed that a single BQL'd queue
>>         was >400K,
>>         and with hardware multi-queue, 128K, when TSO and GSO were
>>         used, but
>>         with hardware multi-queue and no TSO/GSO, BQL was closer to
>>         30K.
>>
>>         This said to me that the maximum "right" size for a TSO/GSO
>>         "packet" was
>>         closer to 12k in this environment, and the right size for BQL,
>>         30k,
>>         before it started exerting backpressure to the qdisc.
>>
>>         This would reduce the potential inter-flow network latency by
>>         a factor
>>         of 10 on the single hw queue scenario, and 4 in the multi
>>         queue one.
>>
>>         It would probably cost some interrupts, and in scenarios
>>         lacking
>>         packet loss, throughput, but in other scenarios with lots of
>>         flows
>>         each flow will ramp up in speed, faster, as you reduce the
>>         RTTs.
>>         Paying attention to this will also push profiling activities
>>         into
>>         areas of the stack that might be profitable.
>>
>>         I would very much like to have profiles of happens now both
>>         here and
>>         elsewhere in the stack with this new code with TSO/GSO sizes
>>         capped
>>         thusly and BQL capped to 30k, and a smarter qdisc like fq
>>         used.
>>
>>         2) Most of the time, a server is not driving the wire to
>>         saturation. If
>>            it is, you are doing something wrong. The BQL queues are
>>         empty, or
>>            nearly so, so the instant someone creates a qdisc queue, it
>>            drains.
>>
>>         But: if there are two or more flows under contention, creating
>>         a qdisc queue
>>             better multiplexing the results is highly desirable, and
>>         the stack
>>            should be smart enough to make that overload only last
>>         briefly.
>>
>>            This is part of why I'm unfond of the deep and persistent
>>         BQL queues as we
>>         get today.
>>
>>         3) Pure ack-only workloads are rare. It is a useful test case,
>>         but...
>>
>>         4) I thought the ring-cleanup optimization was rather
>>         interesting and
>>            could be made more dynamic.
>>
>>         5) I remain amazed at the vast improvements in throughput,
>>         reductions in
>>         interrupts, lockless operation and the RCU stuff that have
>>         come out of
>>         this so far, but had to make these points in the hope that the
>>         big picture
>>         is retained.
>>
>>         It does no good to blast packets through the network unless
>>         there is a
>>         high probability that they will actually be received on the
>>         other side.
>>
>>         thanks for listening.
>
> Dave
>
> I am kind of surprised you wrote this nonsense.

Sorry, it was definately pre-coffee! I get twitchy when all the joy
seems to be in spewing packets at high rates rather than optimizing
for low RTTs in packet paired flow behavior.

> Being able to send data at full speed has nothing to do about how
> packets are scheduled. You are concerned about packet scheduling, and
> not about how fast we can send raw data on a 40Gb NIC.

I would like to also get better behavior out of gigE and below, and for
these changes to not impact the downstream behavior of the network
overall.

To give you an example, I would like to see the tcp flows in the
2nd chart here, to converge faster than the 5 seconds they currently
take at GigE speeds.

http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html

> We made all these changes so that we can spend cpu cycles at the right
> place.

I grok that.

> There are reasons we have fq_codel, and fq. Do not forget this, please.

Which is why I was hoping to see profiles along the way that showed where
else locks were being taken, what those cpu cycles were like, etc, and what were
those hotspots, when a smarter qdisc was engaged.


-- 
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply

* Re: [PATCH v9 net-next 2/4] net: filter: split filter.h and expose eBPF to user space
From: Daniel Borkmann @ 2014-10-13 17:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Hannes Frederic Sowa, Chema Gonzalez,
	Eric Dumazet, Peter Zijlstra, H. Peter Anvin, Andrew Morton,
	Kees Cook, linux-api, netdev, linux-kernel
In-Reply-To: <540737DF.1010801@redhat.com>

On 09/03/2014 05:46 PM, Daniel Borkmann wrote:
...
> Ok, given you post the remaining two RFCs later on this window as
> you indicate, I have no objections:
>
> Acked-by: Daniel Borkmann <dborkman@redhat.com>

Ping, Alexei, are you still sending the patch for bpf_common.h or
do you want me to take care of this?

Cheers,
Daniel

^ permalink raw reply

* Re: [PATCH v2 net] x86: bpf_jit: fix two bugs in eBPF JIT compiler
From: Darrick J. Wong @ 2014-10-13 17:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, netdev, linux-kernel
In-Reply-To: <1412998223-3805-1-git-send-email-ast@plumgrid.com>

On Fri, Oct 10, 2014 at 08:30:23PM -0700, Alexei Starovoitov wrote:
> 1.
> JIT compiler using multi-pass approach to converge to final image size,
> since x86 instructions are variable length. It starts with large
> gaps between instructions (so some jumps may use imm32 instead of imm8)
> and iterates until total program size is the same as in previous pass.
> This algorithm works only if program size is strictly decreasing.
> Programs that use LD_ABS insn need additional code in prologue, but it
> was not emitted during 1st pass, so there was a chance that 2nd pass would
> adjust imm32->imm8 jump offsets to the same number of bytes as increase in
> prologue, which may cause algorithm to erroneously decide that size converged.
> Fix it by always emitting largest prologue in the first pass which
> is detected by oldproglen==0 check.
> Also change error check condition 'proglen != oldproglen' to fail gracefully.
> 
> 2.
> while staring at the code realized that 64-byte buffer may not be enough
> when 1st insn is large, so increase it to 128 to avoid buffer overflow
> (theoretical maximum size of prologue+div is 109) and add runtime check.
> 
> Fixes: 622582786c9e ("net: filter: x86: internal BPF JIT")
> Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>

This fixes the crash, thank you!

Tested-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
> v1->v2: reduce chances of stack corruption in case of future bugs (suggested by Eric)
> 
> note in classic BPF programs 1st insn is always short move, but native eBPF
> programs may trigger buffer overflow. I couldn't force the crash with overflow,
> since there are no further calls while this part of stack is used.
> Both are ugly bugs regardless.
> When net-next opens I will add narrowed down testcase from 'nmap' to testsuite.
> 
>  arch/x86/net/bpf_jit_comp.c |   25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index d56cd1f..3f62734 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -182,12 +182,17 @@ struct jit_context {
>  	bool seen_ld_abs;
>  };
>  
> +/* maximum number of bytes emitted while JITing one eBPF insn */
> +#define BPF_MAX_INSN_SIZE	128
> +#define BPF_INSN_SAFETY		64
> +
>  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>  		  int oldproglen, struct jit_context *ctx)
>  {
>  	struct bpf_insn *insn = bpf_prog->insnsi;
>  	int insn_cnt = bpf_prog->len;
> -	u8 temp[64];
> +	bool seen_ld_abs = ctx->seen_ld_abs | (oldproglen == 0);
> +	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
>  	int i;
>  	int proglen = 0;
>  	u8 *prog = temp;
> @@ -225,7 +230,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>  	EMIT2(0x31, 0xc0); /* xor eax, eax */
>  	EMIT3(0x4D, 0x31, 0xED); /* xor r13, r13 */
>  
> -	if (ctx->seen_ld_abs) {
> +	if (seen_ld_abs) {
>  		/* r9d : skb->len - skb->data_len (headlen)
>  		 * r10 : skb->data
>  		 */
> @@ -685,7 +690,7 @@ xadd:			if (is_imm8(insn->off))
>  		case BPF_JMP | BPF_CALL:
>  			func = (u8 *) __bpf_call_base + imm32;
>  			jmp_offset = func - (image + addrs[i]);
> -			if (ctx->seen_ld_abs) {
> +			if (seen_ld_abs) {
>  				EMIT2(0x41, 0x52); /* push %r10 */
>  				EMIT2(0x41, 0x51); /* push %r9 */
>  				/* need to adjust jmp offset, since
> @@ -699,7 +704,7 @@ xadd:			if (is_imm8(insn->off))
>  				return -EINVAL;
>  			}
>  			EMIT1_off32(0xE8, jmp_offset);
> -			if (ctx->seen_ld_abs) {
> +			if (seen_ld_abs) {
>  				EMIT2(0x41, 0x59); /* pop %r9 */
>  				EMIT2(0x41, 0x5A); /* pop %r10 */
>  			}
> @@ -804,7 +809,8 @@ emit_jmp:
>  			goto common_load;
>  		case BPF_LD | BPF_ABS | BPF_W:
>  			func = CHOOSE_LOAD_FUNC(imm32, sk_load_word);
> -common_load:		ctx->seen_ld_abs = true;
> +common_load:
> +			ctx->seen_ld_abs = seen_ld_abs = true;
>  			jmp_offset = func - (image + addrs[i]);
>  			if (!func || !is_simm32(jmp_offset)) {
>  				pr_err("unsupported bpf func %d addr %p image %p\n",
> @@ -878,6 +884,11 @@ common_load:		ctx->seen_ld_abs = true;
>  		}
>  
>  		ilen = prog - temp;
> +		if (ilen > BPF_MAX_INSN_SIZE) {
> +			pr_err("bpf_jit_compile fatal insn size error\n");
> +			return -EFAULT;
> +		}
> +
>  		if (image) {
>  			if (unlikely(proglen + ilen > oldproglen)) {
>  				pr_err("bpf_jit_compile fatal error\n");
> @@ -934,9 +945,11 @@ void bpf_int_jit_compile(struct bpf_prog *prog)
>  			goto out;
>  		}
>  		if (image) {
> -			if (proglen != oldproglen)
> +			if (proglen != oldproglen) {
>  				pr_err("bpf_jit: proglen=%d != oldproglen=%d\n",
>  				       proglen, oldproglen);
> +				goto out;
> +			}
>  			break;
>  		}
>  		if (proglen == oldproglen) {
> -- 
> 1.7.9.5
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox