Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] tipc: fix one byte leak in tipc_sk_set_orig_addr()
From: David Miller @ 2018-05-10 21:29 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet, jon.maloy, ying.xue
In-Reply-To: <20180509165022.199827-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Wed,  9 May 2018 09:50:22 -0700

> sysbot/KMSAN reported an uninit-value in recvmsg() that
> I tracked down to tipc_sk_set_orig_addr(), missing
> srcaddr->member.scope initialization.
> 
> This patches moves srcaddr->sock.scope init to follow
> fields order and ease future verifications.
 ...
> Fixes: 31c82a2d9d51 ("tipc: add second source address to recvmsg()/recvfrom()")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net] net/ipv6: fix lock imbalance in ip6_route_del()
From: David Miller @ 2018-05-10 21:30 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet, dsahern
In-Reply-To: <20180509170546.247826-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Wed,  9 May 2018 10:05:46 -0700

> WARNING: lock held when returning to user space!
> 4.17.0-rc3+ #37 Not tainted
 ...
> Fixes: 23fb93a4d3f1 ("net/ipv6: Cleanup exception and cache route handling")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: David Ahern <dsahern@gmail.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>

Applied to net-next.

^ permalink raw reply

* Re: [net-next v2 0/6][pull request] 100GbE Intel Wired LAN Driver Updates 2018-05-09
From: David Miller @ 2018-05-10 21:31 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20180509181011.30907-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed,  9 May 2018 11:10:05 -0700

> This series contains updates to fm10k only.
> 
> Jake provides all the changes in the series, starting with adding
> support for accelerated MACVLAN devices.  Reduced code duplication by
> implementing a macro to be used when setting up the type specific
> macros.  Avoided potential bugs with stats by using a macro to calculate
> the array size when passing to ensure that the size is correct.
> 
> v2: changed macro reference '#' with __stringify() as suggested by
>     Joe Perches to patch 2 of the series.  Also made sure the updated
>     series of patches is actually pushed to my kernel.org tree
 ...
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 100GbE

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [PATCH net-next] liquidio: monitor all of Octeon's cores in watchdog thread
From: David Miller @ 2018-05-10 21:32 UTC (permalink / raw)
  To: felix.manlunas; +Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla
In-Reply-To: <20180509183131.GA1811@felix-thinkpad.cavium.com>

From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Wed, 9 May 2018 11:31:31 -0700

> The liquidio_watchdog kernel thread is watching over only 12 cores of the
> Octeon CN23XX; it's neglecting the other 4 cores that are present in the
> CN2360.  Fix it by defining LIO_MAX_CORES as 16.
> 
> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] liquidio: bump up driver version to 1.7.2 to match newer NIC firmware
From: David Miller @ 2018-05-10 21:32 UTC (permalink / raw)
  To: felix.manlunas; +Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla
In-Reply-To: <20180509184938.GA1852@felix-thinkpad.cavium.com>

From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Wed, 9 May 2018 11:49:38 -0700

> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>

Applied.

^ permalink raw reply

* Re: pull-request: mac80211 2018-05-09
From: David Miller @ 2018-05-10 21:35 UTC (permalink / raw)
  To: johannes; +Cc: netdev, linux-wireless
In-Reply-To: <20180509193613.10902-1-johannes@sipsolutions.net>

From: Johannes Berg <johannes@sipsolutions.net>
Date: Wed,  9 May 2018 21:36:12 +0200

> We just have a few fixes this time around.
> 
> Please pull and let me know if there's any problem.

Pulled, thank you!

^ permalink raw reply

* Re: pull-request: mac80211-next 2018-05-09
From: David Miller @ 2018-05-10 21:35 UTC (permalink / raw)
  To: johannes; +Cc: netdev, linux-wireless
In-Reply-To: <1525901377.6910.29.camel@sipsolutions.net>

From: Johannes Berg <johannes@sipsolutions.net>
Date: Wed, 09 May 2018 23:29:37 +0200

> Hi,
> 
> Sorry, scratch that.
> 
> I forgot that this commit:
> 
>> Toke Høiland-Jørgensen (3):
> 
>>       cfg80211: Expose TXQ stats and parameters to userspace
> 
> caused a bunch of "too much stack" warnings - I should put in at least
> the non-driver fix for that first, and then coordinate with Kalle to
> send the driver fixes in too.

Ok, tossed.

^ permalink raw reply

* Re: [PATCH net] hv_netvsc: set master device
From: David Miller @ 2018-05-10 21:36 UTC (permalink / raw)
  To: stephen; +Cc: netdev, sthemmin
In-Reply-To: <20180509210904.21406-1-sthemmin@microsoft.com>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Wed,  9 May 2018 14:09:04 -0700

> The hyper-v transparent bonding should have used master_dev_link.
> The netvsc device should look like a master bond device not
> like the upper side of a tunnel.
> 
> This makes the semantics the same so that userspace applications
> looking at network devices see the correct master relationshipship.
> 
> Fixes: 0c195567a8f6 ("netvsc: transparent VF management")
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net-next] net/core: correct the variable name in dev_ioctl() comment
From: David Miller @ 2018-05-10 21:41 UTC (permalink / raw)
  To: sunlw.fnst; +Cc: netdev
In-Reply-To: <20180510030120.4747-1-sunlw.fnst@cn.fujitsu.com>

From: Sun Lianwen <sunlw.fnst@cn.fujitsu.com>
Date: Thu, 10 May 2018 11:01:20 +0800

> The variable name is not "arg" but "ifr" in dev_ioctl()
> 
> Signed-off-by: Sun Lianwen <sunlw.fnst@cn.fujitsu.com>

If you are going to touch this, fix it full by adding the need_copyout
variable to the comment as well.

^ permalink raw reply

* Re: [PATCH] net: ipv4: remove define INET_CSK_DEBUG and unnecessary EXPORT_SYMBOL
From: David Miller @ 2018-05-10 21:44 UTC (permalink / raw)
  To: joe; +Cc: kuznet, yoshfuji, lirongqing, acme, netdev, linux-kernel
In-Reply-To: <0424e034b4640359bbe1ae50229b9fbc25b06181.1525932412.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Wed,  9 May 2018 23:24:07 -0700

> INET_CSK_DEBUG is always set and only is used for 2 pr_debug calls.
> 
> EXPORT_SYMBOL(inet_csk_timer_bug_msg) is only used by these 2
> pr_debug calls and is also unnecessary as the exported string can
> be used directly by these calls.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH net-next v2] tcp: Add mark for TIMEWAIT sockets
From: David Miller @ 2018-05-10 21:45 UTC (permalink / raw)
  To: jmaxwell37; +Cc: kuznet, yoshfuji, netdev, linux-kernel, jmaxwell
In-Reply-To: <20180510065351.22535-1-jmaxwell37@gmail.com>

From: Jon Maxwell <jmaxwell37@gmail.com>
Date: Thu, 10 May 2018 16:53:51 +1000

> This version has some suggestions by Eric Dumazet:
> 
> - Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP 
> races. 
> - Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
> statement. 
> - Factorize code as sk_fullsock() check is not necessary.
> 
> Aidan McGurn from Openwave Mobility systems reported the following bug:
> 
> "Marked routing is broken on customer deployment. Its effects are large 
> increase in Uplink retransmissions caused by the client never receiving 
> the final ACK to their FINACK - this ACK misses the mark and routes out 
> of the incorrect route."
> 
> Currently marks are added to sk_buffs for replies when the "fwmark_reflect" 
> sysctl is enabled. But not for TW sockets that had sk->sk_mark set via 
> setsockopt(SO_MARK..).  
> 
> Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the 
> original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location. 
> Then progate this so that the skb gets sent with the correct mark. Do the same 
> for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
> netfilter rules are still honored.
> 
> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>

I'm surprised the lack of a mark in timewait sockets wasn't noticed earlier.

Applied, thank you.

^ permalink raw reply

* Re: [PATCH v2] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
From: David Miller @ 2018-05-10 21:47 UTC (permalink / raw)
  To: christophe.jaillet
  Cc: tariqt, netdev, linux-rdma, linux-kernel, kernel-janitors
In-Reply-To: <20180510070604.19635-1-christophe.jaillet@wanadoo.fr>

From: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date: Thu, 10 May 2018 09:06:04 +0200

> If an error occurs, 'mlx4_en_destroy_netdev()' is called.
> It then calls 'mlx4_en_free_resources()' which does the needed resources
> cleanup.
> 
> So, doing some explicit kfree in the error handling path would lead to
> some double kfree.
> 
> Simplify code to avoid such a case.
> 
> Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: KASAN: use-after-free Read in __dev_queue_xmit
From: Willem de Bruijn @ 2018-05-10 21:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Biggers, syzbot, alexander.deucher, Andrey Konovalov,
	Anoob Soman, Chris Wilson, David Miller, Reshetova, Elena,
	Greg Kroah-Hartman, Kees Cook, LKML, Mike Maloney, mchehab,
	netdev, Rosen, Rami, Sowmini Varadhan, syzkaller-bugs,
	Willem de Bruijn
In-Reply-To: <CAF=yD-LwxT6+9U-qCcYEVsPUeL+eVG_aSUZzEM7spwHGb7AQgQ@mail.gmail.com>

On Wed, May 9, 2018 at 5:05 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Wed, May 9, 2018 at 3:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>> On 05/09/2018 12:21 PM, Willem de Bruijn wrote:
>>
>>> Indeed. The skb shared info struct is zeroed by dev_validate_header
>>> as a result of dev->hard_header_len exceeding skb->end - skb->data.
>>>
>>> Not exactly sure yet how this can happen. The hard header length space
>>> is accounted for during allocation as reserved memory. But,
>>> packet_alloc_skb does call skb_reserve(), moving skb->data
>>> effectively beyond this reserved region.
>>>
>>> It may be incorrect to pass skb->data to dev_validate_header, as that
>>> does not point to the start of the ll_header anymore. Still figuring out what
>>> the right fix is..

The following resolves the issue.

packet_alloc_skb already calls skb_reserve(skb, reserve), so now
the network header should start at 0, not at reserve.

If SOCK_DGRAM, dev_hard_header() calls skb_push for the link
layer and returns this offset.

If SOCK_RAW, we should do the same and use the reserved space to
write the link layer.

Now behavior is the same as in tpacket_snd.

@@ -2898,19 +2911,26 @@ static int packet_snd(struct socket *sock,
struct msghdr *msg, size_t len)
        tlen = dev->needed_tailroom;
        linear = __virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len);
        linear = max(linear, min_t(int, len, dev->hard_header_len));
        skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, linear,
                               msg->msg_flags & MSG_DONTWAIT, &err);
        if (skb == NULL)
                goto out_unlock;

-       skb_set_network_header(skb, reserve);
+       skb_reset_network_header(skb);

        err = -EINVAL;
        if (sock->type == SOCK_DGRAM) {
                offset = dev_hard_header(skb, dev, ntohs(proto), addr,
NULL, len);
                if (unlikely(offset < 0))
                        goto out_free;
+       } else {
+               skb_push(skb, dev->hard_header_len);
        }

        /* Returns -EFAULT on error */
        err = skb_copy_datagram_from_iter(skb, offset, &msg->msg_iter, len);

^ permalink raw reply

* Re: [PATCH net] sctp: remove sctp_chunk_put from fail_mark err path in sctp_ulpevent_make_rcvmsg
From: David Miller @ 2018-05-10 21:49 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, linux-sctp, marcelo.leitner, nhorman, syzkaller
In-Reply-To: <e4d7cf118c028ed05c9005951e30babc8bb300eb.1525944853.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Thu, 10 May 2018 17:34:13 +0800

> In Commit 1f45f78f8e51 ("sctp: allow GSO frags to access the chunk too"),
> it held the chunk in sctp_ulpevent_make_rcvmsg to access it safely later
> in recvmsg. However, it also added sctp_chunk_put in fail_mark err path,
> which is only triggered before holding the chunk.
> 
> syzbot reported a use-after-free crash happened on this err path, where
> it shouldn't call sctp_chunk_put.
> 
> This patch simply removes this call.
> 
> Fixes: 1f45f78f8e51 ("sctp: allow GSO frags to access the chunk too")
> Reported-by: syzbot+141d898c5f24489db4aa@syzkaller.appspotmail.com
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net-next 0/4] mlxsw: Support VLAN devices in mirroring offloads
From: David Miller @ 2018-05-10 21:51 UTC (permalink / raw)
  To: idosch; +Cc: netdev, bridge, jiri, petrm, stephen, nikolay, mlxsw
In-Reply-To: <20180510101306.4891-1-idosch@mellanox.com>

From: Ido Schimmel <idosch@mellanox.com>
Date: Thu, 10 May 2018 13:13:02 +0300

> Petr says:
> 
> When offloading "tc action mirred mirror", there are several scenarios
> where VLAN devices can show up, that mlxsw can offload on Spectrum
> machines.
> 
> I) A direct mirror to a VLAN device on top of a front-panel port device
>    (commonly referred to as "RSPAN")
> 
> II) VLAN device in egress path of a packet when resolving a mirror to
>     gretap or ip6gretap netdevice.
> 
> Specifically in the latter case, the following are the cases that can be
> offloaded:
> 
> IIa) VLAN device directly above a physical device.
> IIb) A VLAN-unaware bridge where the egress device is as in IIa.
> IIc) VLAN device on top of a VLAN-aware bridge where the egress device
>      is a physical device.
> 
> This patch set implements all the above cases.
 ...

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] cxgb4: fix the wrong conversion of Mbps to Kbps
From: David Miller @ 2018-05-10 21:52 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, indranil, venkatesh
In-Reply-To: <1525948643-13034-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Thu, 10 May 2018 16:07:23 +0530

> fix the wrong conversion where 1 Mbps was converted to
> 1024 Kbps.
> 
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH][next] net: aquantia: fix unsigned numvecs comparison with less than zero
From: David Miller @ 2018-05-10 21:53 UTC (permalink / raw)
  To: colin.king
  Cc: igor.russkikh, pavel.belous, weiyongjun1, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20180510125201.19707-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Thu, 10 May 2018 13:52:01 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> The comparison of numvecs < 0 is always false because numvecs is a u32
> and hence the error return from a failed call to pci_alloc_irq_vectores
> is never detected.  Fix this by using the signed int ret to handle the
> error return and assign numvecs to err.
> 
> Detected by CoverityScan, CID#1468650 ("Unsigned compared against 0")
> 
> Fixes: a09bd81b5413 ("net: aquantia: Limit number of vectors to actually allocated irqs")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

This doesn't apply to net-next.

^ permalink raw reply

* Re: [PATCH net-next] tls: Fix tls_device initialization
From: David Miller @ 2018-05-10 21:54 UTC (permalink / raw)
  To: borisp; +Cc: netdev, davejwatson
In-Reply-To: <1525958845-72260-1-git-send-email-borisp@mellanox.com>

From: Boris Pismenny <borisp@mellanox.com>
Date: Thu, 10 May 2018 16:27:25 +0300

> Add sg table initialization to fix a BUG_ON encountered when enabling
> CONFIG_DEBUG_SG.
> 
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] rocker: Postpone filtering of !added_by_user FDB
From: David Miller @ 2018-05-10 21:55 UTC (permalink / raw)
  To: petrm; +Cc: netdev, jiri, vivien.didelot
In-Reply-To: <6fbb39711485d7abbe7cdaa8042a222b9588a6b6.1525957759.git.petrm@mellanox.com>

From: Petr Machata <petrm@mellanox.com>
Date: Thu, 10 May 2018 15:29:46 +0200

> Breaking out of the switch in rocker_switchdev_event() still ends up
> scheduling work, except an ill-defined one. This leads to an OOPS cited
> below. Fix by postponing the check until rocker_switchdev_event_work().
 ...
> Fixes: 816a3bed9549 ("switchdev: Add fdb.added_by_user to switchdev notifications")
> Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
> Signed-off-by: Petr Machata <petrm@mellanox.com>

Applied.

^ permalink raw reply

* Re: [PATCH] qed: fix spelling mistake: "taskelt" -> "tasklet"
From: David Miller @ 2018-05-10 21:56 UTC (permalink / raw)
  To: colin.king
  Cc: Ariel.Elior, everest-linux-l2, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20180510140327.15991-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Thu, 10 May 2018 15:03:27 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in DP_VERBOSE message text
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* Re: pull-request: can 2018-05-10
From: David Miller @ 2018-05-10 21:58 UTC (permalink / raw)
  To: mkl; +Cc: netdev, linux-can, kernel
In-Reply-To: <20180510164749.20481-1-mkl@pengutronix.de>

From: Marc Kleine-Budde <mkl@pengutronix.de>
Date: Thu, 10 May 2018 18:47:47 +0200

> this is a pull request for net/master consisting of 2 patches.
> 
> Both patches are from Lukas Wunner and fix two problems found in the hi311x CAN
> driver under high load situations.

Applied.

^ permalink raw reply

* [PATCH v2 net-next] tcp: switch pacing timer to softirq based hrtimer
From: Eric Dumazet @ 2018-05-10 21:59 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

linux-4.16 got support for softirq based hrtimers.
TCP can switch its pacing hrtimer to this variant, since this
avoids going through a tasklet and some atomic operations.

pacing timer logic looks like other (jiffies based) tcp timers.

v2: use hrtimer_try_to_cancel() in tcp_clear_xmit_timers()
    to correctly release reference on socket if needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h     |  4 ++-
 net/ipv4/tcp_output.c | 69 ++++++++++++++++---------------------------
 net/ipv4/tcp_timer.c  |  2 +-
 3 files changed, 29 insertions(+), 46 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index cf803fe0fb86b6a0fb1876a9f775a9c6e6a28ac4..3b1d617b01109b133b4ecafa9ee46173851083f8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -557,7 +557,9 @@ void tcp_fin(struct sock *sk);
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
 {
-	hrtimer_cancel(&tcp_sk(sk)->pacing_timer);
+	if (hrtimer_try_to_cancel(&tcp_sk(sk)->pacing_timer) == 1)
+		sock_put(sk);
+
 	inet_csk_clear_xmit_timers(sk);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d07c0dcc99aaa55c4da963599c8286c8baa1f783..0d8f950a9006598c70dbf51e281a3fe32dfaa234 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -772,7 +772,7 @@ struct tsq_tasklet {
 };
 static DEFINE_PER_CPU(struct tsq_tasklet, tsq_tasklet);
 
-static void tcp_tsq_handler(struct sock *sk)
+static void tcp_tsq_write(struct sock *sk)
 {
 	if ((1 << sk->sk_state) &
 	    (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
@@ -789,6 +789,16 @@ static void tcp_tsq_handler(struct sock *sk)
 			       0, GFP_ATOMIC);
 	}
 }
+
+static void tcp_tsq_handler(struct sock *sk)
+{
+	bh_lock_sock(sk);
+	if (!sock_owned_by_user(sk))
+		tcp_tsq_write(sk);
+	else if (!test_and_set_bit(TCP_TSQ_DEFERRED, &sk->sk_tsq_flags))
+		sock_hold(sk);
+	bh_unlock_sock(sk);
+}
 /*
  * One tasklet per cpu tries to send more skbs.
  * We run in tasklet context but need to disable irqs when
@@ -816,16 +826,7 @@ static void tcp_tasklet_func(unsigned long data)
 		smp_mb__before_atomic();
 		clear_bit(TSQ_QUEUED, &sk->sk_tsq_flags);
 
-		if (!sk->sk_lock.owned &&
-		    test_bit(TCP_TSQ_DEFERRED, &sk->sk_tsq_flags)) {
-			bh_lock_sock(sk);
-			if (!sock_owned_by_user(sk)) {
-				clear_bit(TCP_TSQ_DEFERRED, &sk->sk_tsq_flags);
-				tcp_tsq_handler(sk);
-			}
-			bh_unlock_sock(sk);
-		}
-
+		tcp_tsq_handler(sk);
 		sk_free(sk);
 	}
 }
@@ -853,9 +854,10 @@ void tcp_release_cb(struct sock *sk)
 		nflags = flags & ~TCP_DEFERRED_ALL;
 	} while (cmpxchg(&sk->sk_tsq_flags, flags, nflags) != flags);
 
-	if (flags & TCPF_TSQ_DEFERRED)
-		tcp_tsq_handler(sk);
-
+	if (flags & TCPF_TSQ_DEFERRED) {
+		tcp_tsq_write(sk);
+		__sock_put(sk);
+	}
 	/* Here begins the tricky part :
 	 * We are called from release_sock() with :
 	 * 1) BH disabled
@@ -929,7 +931,7 @@ void tcp_wfree(struct sk_buff *skb)
 		if (!(oval & TSQF_THROTTLED) || (oval & TSQF_QUEUED))
 			goto out;
 
-		nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | TCPF_TSQ_DEFERRED;
+		nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED;
 		nval = cmpxchg(&sk->sk_tsq_flags, oval, nval);
 		if (nval != oval)
 			continue;
@@ -948,37 +950,17 @@ void tcp_wfree(struct sk_buff *skb)
 	sk_free(sk);
 }
 
-/* Note: Called under hard irq.
- * We can not call TCP stack right away.
+/* Note: Called under soft irq.
+ * We can call TCP stack right away, unless socket is owned by user.
  */
 enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
 {
 	struct tcp_sock *tp = container_of(timer, struct tcp_sock, pacing_timer);
 	struct sock *sk = (struct sock *)tp;
-	unsigned long nval, oval;
 
-	for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) {
-		struct tsq_tasklet *tsq;
-		bool empty;
+	tcp_tsq_handler(sk);
+	sock_put(sk);
 
-		if (oval & TSQF_QUEUED)
-			break;
-
-		nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | TCPF_TSQ_DEFERRED;
-		nval = cmpxchg(&sk->sk_tsq_flags, oval, nval);
-		if (nval != oval)
-			continue;
-
-		if (!refcount_inc_not_zero(&sk->sk_wmem_alloc))
-			break;
-		/* queue this socket to tasklet queue */
-		tsq = this_cpu_ptr(&tsq_tasklet);
-		empty = list_empty(&tsq->head);
-		list_add(&tp->tsq_node, &tsq->head);
-		if (empty)
-			tasklet_schedule(&tsq->tasklet);
-		break;
-	}
 	return HRTIMER_NORESTART;
 }
 
@@ -1011,7 +993,8 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
 	do_div(len_ns, rate);
 	hrtimer_start(&tcp_sk(sk)->pacing_timer,
 		      ktime_add_ns(ktime_get(), len_ns),
-		      HRTIMER_MODE_ABS_PINNED);
+		      HRTIMER_MODE_ABS_PINNED_SOFT);
+	sock_hold(sk);
 }
 
 static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
@@ -1078,7 +1061,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 
 	/* if no packet is in qdisc/device queue, then allow XPS to select
 	 * another queue. We can be called from tcp_tsq_handler()
-	 * which holds one reference to sk_wmem_alloc.
+	 * which holds one reference to sk.
 	 *
 	 * TODO: Ideally, in-flight pure ACK packets should not matter here.
 	 * One way to get this would be to set skb->truesize = 2 on them.
@@ -2185,7 +2168,7 @@ static int tcp_mtu_probe(struct sock *sk)
 static bool tcp_pacing_check(const struct sock *sk)
 {
 	return tcp_needs_internal_pacing(sk) &&
-	       hrtimer_active(&tcp_sk(sk)->pacing_timer);
+	       hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
 }
 
 /* TCP Small Queues :
@@ -2365,8 +2348,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 					  skb, limit, mss_now, gfp)))
 			break;
 
-		if (test_bit(TCP_TSQ_DEFERRED, &sk->sk_tsq_flags))
-			clear_bit(TCP_TSQ_DEFERRED, &sk->sk_tsq_flags);
 		if (tcp_small_queue_check(sk, skb, 0))
 			break;
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index f7d944855f8ebd0a312fe73a53a56ab8d451ee44..92bdf64fffae3a5be291ca419eb21276b4c8cbae 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -713,6 +713,6 @@ void tcp_init_xmit_timers(struct sock *sk)
 	inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer,
 				  &tcp_keepalive_timer);
 	hrtimer_init(&tcp_sk(sk)->pacing_timer, CLOCK_MONOTONIC,
-		     HRTIMER_MODE_ABS_PINNED);
+		     HRTIMER_MODE_ABS_PINNED_SOFT);
 	tcp_sk(sk)->pacing_timer.function = tcp_pace_kick;
 }
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* Re: [PATCH net-next] tcp: switch pacing timer to softirq based hrtimer
From: David Miller @ 2018-05-10 22:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: edumazet, netdev
In-Reply-To: <604c253d-a6d9-99ff-317b-b2ca2ffff54a@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 10 May 2018 13:55:00 -0700

> 
> 
> On 05/10/2018 12:49 PM, Eric Dumazet wrote:
>> linux-4.16 got support for softirq based hrtimers.
>> TCP can switch its pacing hrtimer to this variant, since this
>> avoids going through a tasklet and some atomic operations.
>> 
> 
> I need to send a V2, adding a test of hrtimer_cancel() return value
> in tcp_clear_xmit_timers() to eventually release the socket reference.

Ok.

^ permalink raw reply

* [PATCH net v2] rps: Correct wrong skb_flow_limit check when enable RPS
From: gfree.wind @ 2018-05-10 22:09 UTC (permalink / raw)
  To: davem, daniel, edumazet, willemb, jakub.kicinski, ktkhai, ast,
	linux, john.fastabend, brouer, dsahern, netdev
  Cc: Gao Feng

From: Gao Feng <gfree.wind@vip.163.com>

The skb flow limit is implemented for each CPU independently. In the
current codes, the function skb_flow_limit gets the softnet_data by
this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
the current cpu when enable RPS. As the result, the skb_flow_limit checks
the stats of current CPU, while the skb is going to append the queue of
another CPU. It isn't the expected behavior.

Now pass the softnet_data as a param to make consistent.

Fixes: 99bbc7074190 ("rps: selective flow shedding during softnet overflow")
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
---
 v2: Add Fixes tag per Eric, and enhance the commit log
 v1: intial version

 net/core/dev.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index af0558b..0f98eff 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3883,18 +3883,15 @@ static int rps_ipi_queued(struct softnet_data *sd)
 int netdev_flow_limit_table_len __read_mostly = (1 << 12);
 #endif
 
-static bool skb_flow_limit(struct sk_buff *skb, unsigned int qlen)
+static bool skb_flow_limit(struct softnet_data *sd, struct sk_buff *skb, unsigned int qlen)
 {
 #ifdef CONFIG_NET_FLOW_LIMIT
 	struct sd_flow_limit *fl;
-	struct softnet_data *sd;
 	unsigned int old_flow, new_flow;
 
 	if (qlen < (netdev_max_backlog >> 1))
 		return false;
 
-	sd = this_cpu_ptr(&softnet_data);
-
 	rcu_read_lock();
 	fl = rcu_dereference(sd->flow_limit);
 	if (fl) {
@@ -3938,7 +3935,7 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 	if (!netif_running(skb->dev))
 		goto drop;
 	qlen = skb_queue_len(&sd->input_pkt_queue);
-	if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {
+	if (qlen <= netdev_max_backlog && !skb_flow_limit(sd, skb, qlen)) {
 		if (qlen) {
 enqueue:
 			__skb_queue_tail(&sd->input_pkt_queue, skb);
-- 
1.9.1

^ permalink raw reply related

* [PATCH bpf-next] selftests/bpf: Fix bash reference in Makefile
From: Joe Stringer @ 2018-05-10 22:26 UTC (permalink / raw)
  To: daniel; +Cc: netdev

'|& ...' is a bash 4.0+ construct which is not guaranteed to be available
when using '$(shell ...)' in a Makefile. Fall back to the more portable
'2>&1 | ...'.

Fixes the following warning during compilation:

	/bin/sh: 1: Syntax error: "&" unexpected

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 tools/testing/selftests/bpf/Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 9d762184b805..79d29d6cc719 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -90,9 +90,9 @@ CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \
 $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline
 $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
 
-BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help |& grep dwarfris)
-BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help |& grep BTF)
-BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --version |& grep LLVM)
+BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
+BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
+BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --version 2>&1 | grep LLVM)
 
 ifneq ($(BTF_LLC_PROBE),)
 ifneq ($(BTF_PAHOLE_PROBE),)
-- 
2.14.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox