Netdev List
 help / color / mirror / Atom feed
* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Guenter Roeck @ 2018-11-01 15:28 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: paul.burton@mips.com, linux-kernel@vger.kernel.org,
	ralf@linux-mips.org, jlayton@kernel.org,
	linuxppc-dev@lists.ozlabs.org, bfields@fieldses.org,
	linux-mips@linux-mips.org, linux-nfs@vger.kernel.org,
	akpm@linux-foundation.org, anna.schumaker@netapp.com,
	jhogan@kernel.org, netdev@vger.kernel.org, davem@davemloft.net,
	arnd@arndb.de, paulus@samba.org, mpe@ellerman.id.au
In-Reply-To: <d7fe095d8d1f848b5742a5b3e8cce9f89e0c1c8d.camel@hammerspace.com>

On Thu, Nov 01, 2018 at 06:30:08AM +0000, Trond Myklebust wrote:
[ ... ]
> > 
> > For my part I agree that this would be a much better solution. The
> > argument
> > that it is not always absolutely guaranteed that atomics don't wrap
> > doesn't
> > really hold for me because it looks like they all do. On top of that,
> > there
> > is an explicit atomic_dec_if_positive() and
> > atomic_fetch_add_unless(),
> > which to me strongly suggests that they _are_ supposed to wrap.
> > Given the cost of adding a comparison to each atomic operation to
> > prevent it from wrapping, anything else would not really make sense
> > to me.
> 
> That's a hypothesis, not a proven fact. There are architectures out
> there that do not wrap signed integers, hence my question.
> 

If what you say is correct, the kernel is in big trouble on those architectures.
atomic_inc_return() is used all over the place in the kernel with the assumption
that each returned value differs from the previous value (ie the value is used
as cookie, session ID, or for similar purposes).

Guenter

^ permalink raw reply

* [PATCH] ath10k: fix some typo
From: Yangtao Li @ 2018-11-01 15:29 UTC (permalink / raw)
  To: kvalo, davem; +Cc: ath10k, linux-wireless, netdev, linux-kernel, Yangtao Li

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
---
 drivers/net/wireless/ath/ath10k/wow.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/wow.c b/drivers/net/wireless/ath/ath10k/wow.c
index a6b179f88d36..e834c25a9330 100644
--- a/drivers/net/wireless/ath/ath10k/wow.c
+++ b/drivers/net/wireless/ath/ath10k/wow.c
@@ -135,7 +135,7 @@ static void ath10k_wow_convert_8023_to_80211
 	       &old_hdr_mask->h_proto,
 	       sizeof(old_hdr_mask->h_proto));
 
-	/* Caculate new pkt_offset */
+	/* Calculate new pkt_offset */
 	if (old->pkt_offset < ETH_ALEN)
 		new->pkt_offset = old->pkt_offset +
 			offsetof(struct ieee80211_hdr_3addr, addr1);
@@ -146,7 +146,7 @@ static void ath10k_wow_convert_8023_to_80211
 	else
 		new->pkt_offset = old->pkt_offset + hdr_len + rfc_len - ETH_HLEN;
 
-	/* Caculate new hdr end offset */
+	/* Calculate new hdr end offset */
 	if (total_len > ETH_HLEN)
 		hdr_80211_end_offset = hdr_len + rfc_len;
 	else if (total_len > offsetof(struct ethhdr, h_proto))
-- 
2.17.0

^ permalink raw reply related

* [PATCH] cw1200: fix small typo
From: Yangtao Li @ 2018-11-01 15:33 UTC (permalink / raw)
  To: pizza, kvalo, davem; +Cc: linux-wireless, netdev, linux-kernel, Yangtao Li

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
---
 drivers/net/wireless/st/cw1200/sta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/st/cw1200/sta.c b/drivers/net/wireless/st/cw1200/sta.c
index 38678e9a0562..8dae92a79fe1 100644
--- a/drivers/net/wireless/st/cw1200/sta.c
+++ b/drivers/net/wireless/st/cw1200/sta.c
@@ -1123,7 +1123,7 @@ int cw1200_setup_mac(struct cw1200_common *priv)
 	 *
 	 * NOTE2: RSSI based reports have been switched to RCPI, since
 	 * FW has a bug and RSSI reported values are not stable,
-	 * what can leads to signal level oscilations in user-end applications
+	 * what can lead to signal level oscilations in user-end applications
 	 */
 	struct wsm_rcpi_rssi_threshold threshold = {
 		.rssiRcpiMode = WSM_RCPI_RSSI_THRESHOLD_ENABLE |
-- 
2.17.0

^ permalink raw reply related

* Re: [RFC PATCH v3 04/10] ip: factor out protocol delivery helper
From: Subash Abhinov Kasiviswanathan @ 2018-11-01  6:35 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, David S. Miller, Willem de Bruijn, Steffen Klassert
In-Reply-To: <06f628363fc53443f30f1d3120c5e844800b7718.1540920083.git.pabeni@redhat.com>

On 2018-10-30 11:24, Paolo Abeni wrote:
> So that we can re-use it at the UDP lavel in a later patch
> 

Hi Paolo

Minor queries -
Should it be "level" instead of "lavel"? Similar comment for the ipv6
patch as well.

> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  net/ipv4/ip_input.c | 73 ++++++++++++++++++++++-----------------------
>  1 file changed, 36 insertions(+), 37 deletions(-)
> 
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index 35a786c0aaa0..72250b4e466d 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -188,51 +188,50 @@ bool ip_call_ra_chain(struct sk_buff *skb)
>  	return false;
>  }
> 
> -static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> struct sk_buff *skb)
> +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb,
> int protocol)

Would it be better if this function was declared in include/net/ip.h &
include/net/ipv6.h rather than in net/ipv4/udp.c & net/ipv6/udp.c as in
the patch "udp: cope with UDP GRO packet misdirection"

diff --git a/include/net/ip.h b/include/net/ip.h
index 72593e1..3d7fdb4 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -717,4 +717,6 @@ static inline void ip_cmsg_recv(struct msghdr *msg, 
struct sk_buff *skb)
  int rtm_getroute_parse_ip_proto(struct nlattr *attr, u8 *ip_proto,
                                 struct netlink_ext_ack *extack);

+void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
proto);
+
  #endif /* _IP_H */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8296505..4d4d2cfe 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1101,4 +1101,8 @@ int ipv6_sock_mc_join_ssm(struct sock *sk, int 
ifindex,
                           const struct in6_addr *addr, unsigned int 
mode);
  int ipv6_sock_mc_drop(struct sock *sk, int ifindex,
                       const struct in6_addr *addr);
+
+void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
nexthdr,
+                              bool have_final);
+
  #endif /* _NET_IPV6_H */

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply related

* RE: [PATCH net-next v2 5/6] net/ncsi: Reset channel state in ncsi_start_dev()
From: Justin.Lee1 @ 2018-11-01 15:51 UTC (permalink / raw)
  To: sam, netdev; +Cc: davem, linux-kernel, openbmc
In-Reply-To: <8004d6c236582253a82004ce72425b45f09d7d56.camel@mendozajonas.com>


> For your trace below can you share exactly which commands you were
> sending? Those messages aren't upstream so it's not 100% clear what's
> being sent.
> 
> Thanks!
> Sam
> 

It is sending via netlink directly. I have two packages (0 and 1) and each package has
two channels (0 and 1) on my system. If I convert it to command line, it would be
as below.

/home/root# ncsi_netlink -l 2 -m -a 0x01
Set cmd - ifindex 2, multi 1, mask 0x00000001

/home/root# ncsi_netlink -l 2 -m -b 0x03 -p 0
Set cmd - ifindex 2, package 0, channel -1, multi 1, mask 0x00000003

/home/root# ncsi_netlink -l 2 -m -b 0x00 -p 1
Set cmd - ifindex 2, package 1, channel -1, multi 1, mask 0x00000000

/home/root# ncsi_netlink
Usage: ncsi_netlink COMMAND [OPTION]...
Send messages to the NCSI driver via Netlink (0.4)

Mandatory arguments to long options are mandatory for short options too.
COMMAND:
  -i, --info                 Display info for packages and channels
  -s, --set                  Force the usage of a certain package/channel combination
  -x, --clear                Clear the above setting
  -a, --package-mask MASK    Set package selection mask
  -b, --channel-mask MASK    Set channel selection mask
  -o, --cmd DATA0 [DATA1]... Send NC-SI command
  -d, --discovery            Request to discover packages and channels
  -k, --lookup               Look up channel id or name
  -u, --status               Display status
  -h, --help                 Print this help text

OPTION:
  -l, --ifindex INDEX        Specify the interface index
  -p, --package PACKAGE      Package number
  -c, --channel CHANNEL      Channel number (aka. port number)
  -n, --channel NAME         Channel name (aka. ncsiX)
  -m, --multi-mode           Set multi-mode

Example: ncsi_netlink -l 2 --info


> > 
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: Multi-package enabled on ifindex 2, mask 0x00000001
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_stop_channel_monitor() - pkg 0 ch 0
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_dev_work()
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 state 0400
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: pkg 0 ch 0 set as preferred channel
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: Multi-channel enabled on ifindex 2, mask 0x00000003
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_stop_channel_monitor() - pkg 0 ch 1
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_dev_work()
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 1 state 0400
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: Package 1 set to all channels disabled
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: Multi-channel enabled on ifindex 2, mask 0x00000000
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_choose_active_channel()
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_choose_active_channel() - pkg 0
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_choose_active_channel() - pass pkg whitelist
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_choose_active_channel() - ch 0
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_choose_active_channel() - pass ch whitelist
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_choose_active_channel() - skip
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_choose_active_channel() - ch 1
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_choose_active_channel() - pass ch whitelist
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_choose_active_channel() - skip
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_choose_active_channel() - next pkg
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_choose_active_channel() - pkg 1
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: No channel found to configure!
> > > > npcm7xx-emc f0825000.eth eth2: NCSI interface down
> > > > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_dev_work()
> > > > npcm7xx-emc f0825000.eth eth2: Wrong NCSI state 0x100 in workqueue
> > > > 
> > > > All masks are set correctly, but you can see the PS column is not right and channel doesn't
> > > > configure correctly.
> > > > 
> > > > /sys/kernel/debug/ncsi_protocol# cat ncsi_device_status
> > > > IFIDX IFNAME NAME   PID CID RX TX MP MC WP WC PC PS LS RU CR NQ HA
> > > > ===================================================================
> > > >   2   eth2   ncsi0  000 000 1  1  1  1  1  1  1  0  1  1  1  0  1
> > > >   2   eth2   ncsi1  000 001 1  0  1  1  1  1  0  0  1  1  1  0  1
> > > >   2   eth2   ncsi2  001 000 0  0  1  1  0  0  0  0  1  1  1  0  1
> > > >   2   eth2   ncsi3  001 001 0  0  1  1  0  0  0  0  1  1  1  0  1
> > > > ===================================================================
> > > > MP: Multi-mode Package     WP: Whitelist Package
> > > > MC: Multi-mode Channel     WC: Whitelist Channel
> > > > PC: Primary Channel
> > > > PS: Poll Status
> > > > LS: Link Status
> > > > RU: Running
> > > > CR: Carrier OK
> > > > NQ: Queue Stopped
> > > > HA: Hardware Arbitration
> > > > 
> > > > PS column is getting from (int)nc->monitor.enabled.
> 
> 



^ permalink raw reply

* [PATCH bpf 3/3] bpf: show main program address in bpf_prog_info->jited_ksyms
From: Song Liu @ 2018-11-01  7:00 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, Song Liu, ast, daniel, sandipan
In-Reply-To: <20181101070058.2760251-1-songliubraving@fb.com>

Currently, when there is not subprog (prog->aux->func_cnt == 0),
bpf_prog_info does not return any jited_ksyms. This patch adds
main program address (prog->bpf_func) to jited_ksyms.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/bpf/syscall.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 34a9eef5992c..7293b17ca62a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2158,7 +2158,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 	}
 
 	ulen = info.nr_jited_ksyms;
-	info.nr_jited_ksyms = prog->aux->func_cnt;
+	info.nr_jited_ksyms = prog->aux->func_cnt ? : 1;
 	if (info.nr_jited_ksyms && ulen) {
 		if (bpf_dump_raw_ok()) {
 			u64 __user *user_ksyms;
@@ -2170,9 +2170,17 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 			 */
 			ulen = min_t(u32, info.nr_jited_ksyms, ulen);
 			user_ksyms = u64_to_user_ptr(info.jited_ksyms);
-			for (i = 0; i < ulen; i++) {
-				ksym_addr = (ulong) prog->aux->func[i]->bpf_func;
-				if (put_user((u64) ksym_addr, &user_ksyms[i]))
+			if (prog->aux->func_cnt) {
+				for (i = 0; i < ulen; i++) {
+					ksym_addr = (ulong)
+						prog->aux->func[i]->bpf_func;
+					if (put_user((u64) ksym_addr,
+						     &user_ksyms[i]))
+						return -EFAULT;
+				}
+			} else {
+				ksym_addr = (ulong) prog->bpf_func;
+				if (put_user((u64) ksym_addr, &user_ksyms[0]))
 					return -EFAULT;
 			}
 		} else {
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf 2/3] bpf: show real jited address in bpf_prog_info->jited_ksyms
From: Song Liu @ 2018-11-01  7:00 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, Song Liu, ast, daniel, sandipan
In-Reply-To: <20181101070058.2760251-1-songliubraving@fb.com>

Currently, jited_ksyms in bpf_prog_info shows page addresses of jited
bpf program. This is not ideal for detailed profiling (find hot
instructions from stack traces). This patch replaces the page address
with real prog start address.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/bpf/syscall.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ccb93277aae2..34a9eef5992c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2172,7 +2172,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 			user_ksyms = u64_to_user_ptr(info.jited_ksyms);
 			for (i = 0; i < ulen; i++) {
 				ksym_addr = (ulong) prog->aux->func[i]->bpf_func;
-				ksym_addr &= PAGE_MASK;
 				if (put_user((u64) ksym_addr, &user_ksyms[i]))
 					return -EFAULT;
 			}
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf 0/3] show more accurrate bpf program address
From: Song Liu @ 2018-11-01  7:00 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, Song Liu, ast, daniel, sandipan

This set improves bpf program address showed in /proc/kallsyms and in
bpf_prog_info. First, real program address is showed instead of page
address. Second, when there is no subprogram, bpf_prog_info->jited_ksyms
returns the main prog address.

Song Liu (3):
  bpf: show real jited prog address in /proc/kallsyms
  bpf: show real jited address in bpf_prog_info->jited_ksyms
  bpf: show main program address in bpf_prog_info->jited_ksyms

 kernel/bpf/core.c    |  4 +---
 kernel/bpf/syscall.c | 17 ++++++++++++-----
 2 files changed, 13 insertions(+), 8 deletions(-)

^ permalink raw reply

* [PATCH bpf 1/3] bpf: show real jited prog address in /proc/kallsyms
From: Song Liu @ 2018-11-01  7:00 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, Song Liu, ast, daniel, sandipan
In-Reply-To: <20181101070058.2760251-1-songliubraving@fb.com>

Currently, /proc/kallsyms shows page address of jited bpf program. This
is not ideal for detailed profiling (find hot instructions from stack
traces). This patch replaces the page address with real prog start
address.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/bpf/core.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 6377225b2082..1a796e0799ec 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -553,7 +553,6 @@ bool is_bpf_text_address(unsigned long addr)
 int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
 		    char *sym)
 {
-	unsigned long symbol_start, symbol_end;
 	struct bpf_prog_aux *aux;
 	unsigned int it = 0;
 	int ret = -ERANGE;
@@ -566,10 +565,9 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
 		if (it++ != symnum)
 			continue;
 
-		bpf_get_prog_addr_region(aux->prog, &symbol_start, &symbol_end);
 		bpf_get_prog_name(aux->prog, sym);
 
-		*value = symbol_start;
+		*value = (unsigned long)aux->prog->bpf_func;
 		*type  = BPF_SYM_ELF_TYPE;
 
 		ret = 0;
-- 
2.17.1

^ permalink raw reply related

* Re: [GIT] Networking
From: Linus Torvalds @ 2018-11-01 16:17 UTC (permalink / raw)
  To: David Miller; +Cc: Andrew Morton, netdev, Linux Kernel Mailing List
In-Reply-To: <20181031.184402.2213867913967695313.davem@davemloft.net>

On Wed, Oct 31, 2018 at 6:44 PM David Miller <davem@davemloft.net> wrote:
>
> Please pull, thanks a lot!

Pulled,

                    Linus

^ permalink raw reply

* Re: [PATCH v1] net: ipv6: fix racey clock check in route cache aging logic
From: Brendan Higgins @ 2018-11-01 16:56 UTC (permalink / raw)
  To: eric.dumazet
  Cc: David S . Miller, Alexey Kuznetsov, yoshfuji, netdev,
	Linux Kernel Mailing List
In-Reply-To: <9ce74f5b-387d-b02f-2efa-f12c1450577c@gmail.com>

On Thu, Oct 25, 2018 at 3:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 10/25/2018 02:46 PM, Brendan Higgins wrote:
> > On Thu, Oct 25, 2018 at 2:40 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> On 10/25/2018 02:13 PM, Brendan Higgins wrote:
> > <snip>
> >>>
> >>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> >>> index 2a7423c394560..54d28b91fd840 100644
> >>> --- a/net/ipv6/route.c
> >>> +++ b/net/ipv6/route.c
> >>> @@ -1734,7 +1734,7 @@ static void rt6_age_examine_exception(struct rt6_exception_bucket *bucket,
> >>>                       rt6_remove_exception(bucket, rt6_ex);
> >>>                       return;
> >>>               }
> >>> -     } else if (time_after(jiffies, rt->dst.expires)) {
> >>> +     } else if (time_after(now, rt->dst.expires)) {
> >>>               RT6_TRACE("purging expired route %p\n", rt);
> >>>               rt6_remove_exception(bucket, rt6_ex);
> >>>               return;
> >>>
> >>
> >>
> >> I do not think there is a bug here ?
> >>
> >> As a matter of fact, using the latest value of jiffies is probably better,
> >> since in some cases the @now variable could be quite in the past.
> >
> > Then why do we pass the `now` parameter in at all and use it at all,
> > like here: https://elixir.bootlin.com/linux/latest/source/net/ipv6/route.c#L1764
> > ?
> >
> > I am still skeptical that we should check jiffies in each check, but
> > we should at least be consistent.
>
> Well, this is a case where we do not really care.
>
> When a bug is fixed (you added a Fixes: tag which is good), we want
> to understand the real problem that needs to be fixed on stable kernels.

So by bug you mean user visible bug?
>
> Since this does not seem to be a real issue, I would suggest you send a cleanup
> patch when net-next is open (few days after linux-4.20-rc1 is release)

Sounds good, I will resend shortly.

Thanks!

^ permalink raw reply

* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Eric Dumazet @ 2018-11-01 16:59 UTC (permalink / raw)
  To: Peter Zijlstra, Trond Myklebust
  Cc: mark.rutland@arm.com, linux-kernel@vger.kernel.org,
	ralf@linux-mips.org, jlayton@kernel.org,
	linuxppc-dev@lists.ozlabs.org, bfields@fieldses.org,
	linux-mips@linux-mips.org, linux@roeck-us.net,
	linux-nfs@vger.kernel.org, akpm@linux-foundation.org,
	will.deacon@arm.com, boqun.feng@gmail.com, paul.burton@mips.com,
	anna.schumaker@netapp.com, jhogan@kernel.org, "netdev@vger.ke
In-Reply-To: <20181101163212.GF3159@hirez.programming.kicks-ass.net>



On 11/01/2018 09:32 AM, Peter Zijlstra wrote:

>> Anyhow, if the atomic maintainers are willing to stand up and state for
>> the record that the atomic counters are guaranteed to wrap modulo 2^n
>> just like unsigned integers, then I'm happy to take Paul's patch.
> 
> I myself am certainly relying on it.


Could we get uatomic_t support maybe ?

This reminds me of this sooooo silly patch :/

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=adb03115f4590baa280ddc440a8eff08a6be0cb7

^ permalink raw reply

* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Paul E. McKenney @ 2018-11-01 17:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Trond Myklebust, mark.rutland@arm.com,
	linux-kernel@vger.kernel.org, ralf@linux-mips.org,
	jlayton@kernel.org, linuxppc-dev@lists.ozlabs.org,
	bfields@fieldses.org, linux-mips@linux-mips.org,
	linux@roeck-us.net, linux-nfs@vger.kernel.org,
	akpm@linux-foundation.org, will.deacon@arm.com,
	boqun.feng@gmail.com, paul.burton@mips.com,
	anna.schumaker@netapp.com, "jhogan@kerne
In-Reply-To: <20181101163212.GF3159@hirez.programming.kicks-ass.net>

On Thu, Nov 01, 2018 at 05:32:12PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 01, 2018 at 03:22:15PM +0000, Trond Myklebust wrote:
> > On Thu, 2018-11-01 at 15:59 +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 01, 2018 at 01:18:46PM +0000, Mark Rutland wrote:
> 
> > > > > My one question (and the reason why I went with cmpxchg() in the
> > > > > first place) would be about the overflow behaviour for
> > > > > atomic_fetch_inc() and friends. I believe those functions should
> > > > > be OK on x86, so that when we overflow the counter, it behaves
> > > > > like an unsigned value and wraps back around.  Is that the case
> > > > > for all architectures?
> > > > > 
> > > > > i.e. are atomic_t/atomic64_t always guaranteed to behave like
> > > > > u32/u64 on increment?
> > > > > 
> > > > > I could not find any documentation that explicitly stated that
> > > > > they should.
> > > > 
> > > > Peter, Will, I understand that the atomic_t/atomic64_t ops are
> > > > required to wrap per 2's-complement. IIUC the refcount code relies
> > > > on this.
> > > > 
> > > > Can you confirm?
> > > 
> > > There is quite a bit of core code that hard assumes 2s-complement.
> > > Not only for atomics but for any signed integer type. Also see the
> > > kernel using -fno-strict-overflow which implies -fwrapv, which
> > > defines signed overflow to behave like 2s-complement (and rids us of
> > > that particular UB).
> > 
> > Fair enough, but there have also been bugfixes to explicitly fix unsafe
> > C standards assumptions for signed integers. See, for instance commit
> > 5a581b367b5d "jiffies: Avoid undefined behavior from signed overflow"
> > from Paul McKenney.
> 
> Yes, I feel Paul has been to too many C/C++ committee meetings and got
> properly paranoid. Which isn't always a bad thing :-)

Even the C standard defines 2s complement for atomics.  Just not for
normal arithmetic, where yes, signed overflow is UB.  And yes, I do
know about -fwrapv, but I would like to avoid at least some copy-pasta
UB from my kernel code to who knows what user-mode environment.  :-/

At least where it is reasonably easy to do so.

And there is a push to define C++ signed arithmetic as 2s complement,
but there are still 1s complement systems with C compilers.  Just not
C++ compilers.  Legacy...

> But for us using -fno-strict-overflow which actually defines signed
> overflow, I myself am really not worried. I'm also not sure if KASAN has
> been taught about this, or if it will still (incorrectly) warn about UB
> for signed types.

UBSAN gave me a signed-overflow warning a few days ago.  Which I have
fixed, even though 2s complement did the right thing.  I am also taking
advantage of the change to use better naming.

> > Anyhow, if the atomic maintainers are willing to stand up and state for
> > the record that the atomic counters are guaranteed to wrap modulo 2^n
> > just like unsigned integers, then I'm happy to take Paul's patch.
> 
> I myself am certainly relying on it.

Color me confused.  My 5a581b367b5d is from 2013.  Or is "Paul" instead
intended to mean Paul Mackerras, who happens to be on CC?

							Thanx, Paul

^ permalink raw reply

* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Peter Zijlstra @ 2018-11-01 17:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Trond Myklebust, mark.rutland@arm.com,
	linux-kernel@vger.kernel.org, ralf@linux-mips.org,
	jlayton@kernel.org, linuxppc-dev@lists.ozlabs.org,
	bfields@fieldses.org, linux-mips@linux-mips.org,
	linux@roeck-us.net, linux-nfs@vger.kernel.org,
	akpm@linux-foundation.org, will.deacon@arm.com,
	boqun.feng@gmail.com, paul.burton@mips.com,
	anna.schumaker@netapp.com, "jhogan@kerne
In-Reply-To: <b0160f4b-b996-b0ee-405a-3d5f1866272e@gmail.com>

On Thu, Nov 01, 2018 at 09:59:38AM -0700, Eric Dumazet wrote:
> On 11/01/2018 09:32 AM, Peter Zijlstra wrote:
> 
> >> Anyhow, if the atomic maintainers are willing to stand up and state for
> >> the record that the atomic counters are guaranteed to wrap modulo 2^n
> >> just like unsigned integers, then I'm happy to take Paul's patch.
> > 
> > I myself am certainly relying on it.
> 
> Could we get uatomic_t support maybe ?

Whatever for; it'd be the exact identical same functions as for
atomic_t, except for a giant amount of code duplication to deal with the
new type.

That is; today we merged a bunch of scripts that generates most of
atomic*_t, so we could probably script uatomic*_t wrappers with minimal
effort, but it would add several thousand lines of code to each compile
for absolutely no reason what so ever.

> This reminds me of this sooooo silly patch :/
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=adb03115f4590baa280ddc440a8eff08a6be0cb7

Yes, that's stupid. UBSAN is just wrong there.

^ permalink raw reply

* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Peter Zijlstra @ 2018-11-01 17:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Trond Myklebust, mark.rutland@arm.com,
	linux-kernel@vger.kernel.org, ralf@linux-mips.org,
	jlayton@kernel.org, linuxppc-dev@lists.ozlabs.org,
	bfields@fieldses.org, linux-mips@linux-mips.org,
	linux@roeck-us.net, linux-nfs@vger.kernel.org,
	akpm@linux-foundation.org, will.deacon@arm.com,
	boqun.feng@gmail.com, paul.burton@mips.com,
	anna.schumaker@netapp.com, "jhogan@kerne
In-Reply-To: <20181101170146.GQ4170@linux.ibm.com>

On Thu, Nov 01, 2018 at 10:01:46AM -0700, Paul E. McKenney wrote:
> On Thu, Nov 01, 2018 at 05:32:12PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 01, 2018 at 03:22:15PM +0000, Trond Myklebust wrote:
> > > On Thu, 2018-11-01 at 15:59 +0100, Peter Zijlstra wrote:
> > > > On Thu, Nov 01, 2018 at 01:18:46PM +0000, Mark Rutland wrote:
> > 
> > > > > > My one question (and the reason why I went with cmpxchg() in the
> > > > > > first place) would be about the overflow behaviour for
> > > > > > atomic_fetch_inc() and friends. I believe those functions should
> > > > > > be OK on x86, so that when we overflow the counter, it behaves
> > > > > > like an unsigned value and wraps back around.  Is that the case
> > > > > > for all architectures?
> > > > > > 
> > > > > > i.e. are atomic_t/atomic64_t always guaranteed to behave like
> > > > > > u32/u64 on increment?
> > > > > > 
> > > > > > I could not find any documentation that explicitly stated that
> > > > > > they should.
> > > > > 
> > > > > Peter, Will, I understand that the atomic_t/atomic64_t ops are
> > > > > required to wrap per 2's-complement. IIUC the refcount code relies
> > > > > on this.
> > > > > 
> > > > > Can you confirm?
> > > > 
> > > > There is quite a bit of core code that hard assumes 2s-complement.
> > > > Not only for atomics but for any signed integer type. Also see the
> > > > kernel using -fno-strict-overflow which implies -fwrapv, which
> > > > defines signed overflow to behave like 2s-complement (and rids us of
> > > > that particular UB).
> > > 
> > > Fair enough, but there have also been bugfixes to explicitly fix unsafe
> > > C standards assumptions for signed integers. See, for instance commit
> > > 5a581b367b5d "jiffies: Avoid undefined behavior from signed overflow"
> > > from Paul McKenney.
> > 
> > Yes, I feel Paul has been to too many C/C++ committee meetings and got
> > properly paranoid. Which isn't always a bad thing :-)
> 
> Even the C standard defines 2s complement for atomics.  

Ooh good to know.

> Just not for
> normal arithmetic, where yes, signed overflow is UB.  And yes, I do
> know about -fwrapv, but I would like to avoid at least some copy-pasta
> UB from my kernel code to who knows what user-mode environment.  :-/
> 
> At least where it is reasonably easy to do so.

Fair enough I suppose; I just always make sure to include the same
-fknobs for the userspace thing when I lift code.

> And there is a push to define C++ signed arithmetic as 2s complement,
> but there are still 1s complement systems with C compilers.  Just not
> C++ compilers.  Legacy...

*groan*; how about those ancient hardwares keep using ancient compilers
and we all move on to the 70s :-)

> > But for us using -fno-strict-overflow which actually defines signed
> > overflow, I myself am really not worried. I'm also not sure if KASAN has
> > been taught about this, or if it will still (incorrectly) warn about UB
> > for signed types.
> 
> UBSAN gave me a signed-overflow warning a few days ago.  Which I have
> fixed, even though 2s complement did the right thing.  I am also taking
> advantage of the change to use better naming.

Oh too many *SANs I suppose; and yes, if you can make the code better,
why not.

> > > Anyhow, if the atomic maintainers are willing to stand up and state for
> > > the record that the atomic counters are guaranteed to wrap modulo 2^n
> > > just like unsigned integers, then I'm happy to take Paul's patch.
> > 
> > I myself am certainly relying on it.
> 
> Color me confused.  My 5a581b367b5d is from 2013.  Or is "Paul" instead
> intended to mean Paul Mackerras, who happens to be on CC?

Paul Burton I think, on a part of the thread before we joined :-)

^ permalink raw reply

* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Peter Zijlstra @ 2018-11-01 17:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Trond Myklebust, mark.rutland@arm.com,
	linux-kernel@vger.kernel.org, ralf@linux-mips.org,
	jlayton@kernel.org, linuxppc-dev@lists.ozlabs.org,
	bfields@fieldses.org, linux-mips@linux-mips.org,
	linux@roeck-us.net, linux-nfs@vger.kernel.org,
	akpm@linux-foundation.org, will.deacon@arm.com,
	boqun.feng@gmail.com, paul.burton@mips.com,
	anna.schumaker@netapp.com, "jhogan@kerne
In-Reply-To: <20181101171432.GH3178@hirez.programming.kicks-ass.net>

On Thu, Nov 01, 2018 at 06:14:32PM +0100, Peter Zijlstra wrote:
> > This reminds me of this sooooo silly patch :/
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=adb03115f4590baa280ddc440a8eff08a6be0cb7

You'd probably want to write it like so; +- some ordering stuff, that
code didn't look like it really needs the memory barriers implied by
these, but I didn't look too hard.

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c0a9d26c06ce..11deb1d7e96b 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -485,16 +485,10 @@ u32 ip_idents_reserve(u32 hash, int segs)
 	u32 now = (u32)jiffies;
 	u32 new, delta = 0;
 
-	if (old != now && cmpxchg(p_tstamp, old, now) == old)
+	if (old != now && try_cmpxchg(p_tstamp, &old, now))
 		delta = prandom_u32_max(now - old);
 
-	/* Do not use atomic_add_return() as it makes UBSAN unhappy */
-	do {
-		old = (u32)atomic_read(p_id);
-		new = old + delta + segs;
-	} while (atomic_cmpxchg(p_id, old, new) != old);
-
-	return new - segs;
+	return atomic_fetch_add(segs + delta, p_id) + delta;
 }
 EXPORT_SYMBOL(ip_idents_reserve);
 

^ permalink raw reply related

* [PATCH iproute2-next] rdma: Refresh help section of resource information
From: Leon Romanovsky @ 2018-11-01  8:35 UTC (permalink / raw)
  To: David Ahern
  Cc: Leon Romanovsky, netdev, RDMA mailing list, Stephen Hemminger,
	Steve Wise

From: Leon Romanovsky <leonro@mellanox.com>

After commit 4060e4c0d257 ("rdma: Add PD resource tracking
information"), the resource information shows PDs and MRs,
but help pages didn't fully reflect it.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 rdma/res.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/rdma/res.c b/rdma/res.c
index 0d8c1c38..cbb2efe6 100644
--- a/rdma/res.c
+++ b/rdma/res.c
@@ -16,13 +16,17 @@ static int res_help(struct rd *rd)
 {
 	pr_out("Usage: %s resource\n", rd->filename);
 	pr_out("          resource show [DEV]\n");
-	pr_out("          resource show [qp|cm_id]\n");
+	pr_out("          resource show [qp|cm_id|pd|mr|cq]\n");
 	pr_out("          resource show qp link [DEV/PORT]\n");
 	pr_out("          resource show qp link [DEV/PORT] [FILTER-NAME FILTER-VALUE]\n");
 	pr_out("          resource show cm_id link [DEV/PORT]\n");
 	pr_out("          resource show cm_id link [DEV/PORT] [FILTER-NAME FILTER-VALUE]\n");
 	pr_out("          resource show cq link [DEV/PORT]\n");
 	pr_out("          resource show cq link [DEV/PORT] [FILTER-NAME FILTER-VALUE]\n");
+	pr_out("          resource show pd dev [DEV]\n");
+	pr_out("          resource show pd dev [DEV] [FILTER-NAME FILTER-VALUE]\n");
+	pr_out("          resource show mr dev [DEV]\n");
+	pr_out("          resource show mr dev [DEV] [FILTER-NAME FILTER-VALUE]\n");
 	return 0;
 }

^ permalink raw reply related

* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Dmitry Vyukov @ 2018-11-01 17:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Trond Myklebust, mark.rutland@arm.com,
	linux-kernel@vger.kernel.org, ralf@linux-mips.org,
	jlayton@kernel.org, linuxppc-dev@lists.ozlabs.org,
	bfields@fieldses.org, linux-mips@linux-mips.org,
	linux@roeck-us.net, linux-nfs@vger.kernel.org,
	akpm@linux-foundation.org, will.deacon@arm.com,
	boqun.feng@gmail.com, paul.burton@mips.com,
	"anna.schumaker@netapp.com
In-Reply-To: <20181101171846.GI3178@hirez.programming.kicks-ass.net>

On Thu, Nov 1, 2018 at 6:18 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > > > > > My one question (and the reason why I went with cmpxchg() in the
>> > > > > > first place) would be about the overflow behaviour for
>> > > > > > atomic_fetch_inc() and friends. I believe those functions should
>> > > > > > be OK on x86, so that when we overflow the counter, it behaves
>> > > > > > like an unsigned value and wraps back around.  Is that the case
>> > > > > > for all architectures?
>> > > > > >
>> > > > > > i.e. are atomic_t/atomic64_t always guaranteed to behave like
>> > > > > > u32/u64 on increment?
>> > > > > >
>> > > > > > I could not find any documentation that explicitly stated that
>> > > > > > they should.
>> > > > >
>> > > > > Peter, Will, I understand that the atomic_t/atomic64_t ops are
>> > > > > required to wrap per 2's-complement. IIUC the refcount code relies
>> > > > > on this.
>> > > > >
>> > > > > Can you confirm?
>> > > >
>> > > > There is quite a bit of core code that hard assumes 2s-complement.
>> > > > Not only for atomics but for any signed integer type. Also see the
>> > > > kernel using -fno-strict-overflow which implies -fwrapv, which
>> > > > defines signed overflow to behave like 2s-complement (and rids us of
>> > > > that particular UB).
>> > >
>> > > Fair enough, but there have also been bugfixes to explicitly fix unsafe
>> > > C standards assumptions for signed integers. See, for instance commit
>> > > 5a581b367b5d "jiffies: Avoid undefined behavior from signed overflow"
>> > > from Paul McKenney.
>> >
>> > Yes, I feel Paul has been to too many C/C++ committee meetings and got
>> > properly paranoid. Which isn't always a bad thing :-)
>>
>> Even the C standard defines 2s complement for atomics.
>
> Ooh good to know.
>
>> Just not for
>> normal arithmetic, where yes, signed overflow is UB.  And yes, I do
>> know about -fwrapv, but I would like to avoid at least some copy-pasta
>> UB from my kernel code to who knows what user-mode environment.  :-/
>>
>> At least where it is reasonably easy to do so.
>
> Fair enough I suppose; I just always make sure to include the same
> -fknobs for the userspace thing when I lift code.
>
>> And there is a push to define C++ signed arithmetic as 2s complement,
>> but there are still 1s complement systems with C compilers.  Just not
>> C++ compilers.  Legacy...
>
> *groan*; how about those ancient hardwares keep using ancient compilers
> and we all move on to the 70s :-)
>
>> > But for us using -fno-strict-overflow which actually defines signed
>> > overflow, I myself am really not worried. I'm also not sure if KASAN has
>> > been taught about this, or if it will still (incorrectly) warn about UB
>> > for signed types.
>>
>> UBSAN gave me a signed-overflow warning a few days ago.  Which I have
>> fixed, even though 2s complement did the right thing.  I am also taking
>> advantage of the change to use better naming.
>
> Oh too many *SANs I suppose; and yes, if you can make the code better,
> why not.

If there is a warning that we don't want to see at all, then we can
disable it. It supposed to be a useful tool, rather than a thing in
itself that lives own life. We already I think removed 1 particularly
noisy warning and made another optional via a config.
But the thing with overflows is that, even if it's defined, it's not
necessary the intended behavior. For example, take allocation size
calculation done via unsigned size_t. If it overflows it does not help
if C defines result or not, it still gives a user controlled write
primitive. We've seen similar cases with timeout/deadline calculation
in kernel, we really don't want it to just wrap modulo-2, right. Some
user-space projects even test with unsigned overflow warnings or
implicit truncation warnings, which are formally legal, but frequently
bugs.

^ permalink raw reply

* [PATCH] SUNRPC: Use atomic(64)_t for seq_send(64)
From: Paul Burton @ 2018-11-01 17:51 UTC (permalink / raw)
  To: linux-nfs@vger.kernel.org, Trond Myklebust, Anna Schumaker
  Cc: linux-kernel@vger.kernel.org, Paul Burton, J . Bruce Fields,
	Jeff Layton, David S . Miller, netdev@vger.kernel.org
In-Reply-To: <20181101145926.GE3178@hirez.programming.kicks-ass.net>

The seq_send & seq_send64 fields in struct krb5_ctx are used as
atomically incrementing counters. This is implemented using cmpxchg() &
cmpxchg64() to implement what amount to custom versions of
atomic_fetch_inc() & atomic64_fetch_inc().

Besides the duplication, using cmpxchg64() has another major drawback in
that some 32 bit architectures don't provide it. As such commit
571ed1fd2390 ("SUNRPC: Replace krb5_seq_lock with a lockless scheme")
resulted in build failures for some architectures.

Change seq_send to be an atomic_t and seq_send64 to be an atomic64_t,
then use atomic(64)_* functions to manipulate the values. The atomic64_t
type & associated functions are provided even on architectures which
lack real 64 bit atomic memory access via CONFIG_GENERIC_ATOMIC64 which
uses spinlocks to serialize access. This fixes the build failures for
architectures lacking cmpxchg64().

A potential alternative that was raised would be to provide cmpxchg64()
on the 32 bit architectures that currently lack it, using spinlocks.
However this would provide a version of cmpxchg64() with semantics a
little different to the implementations on architectures with real 64
bit atomics - the spinlock-based implementation would only work if all
access to the memory used with cmpxchg64() is *always* performed using
cmpxchg64(). That is not currently a requirement for users of
cmpxchg64(), and making it one seems questionable. As such avoiding
cmpxchg64() outside of architecture-specific code seems best,
particularly in cases where atomic64_t seems like a better fit anyway.

The CONFIG_GENERIC_ATOMIC64 implementation of atomic64_* functions will
use spinlocks & so faces the same issue, but with the key difference
that the memory backing an atomic64_t ought to always be accessed via
the atomic64_* functions anyway making the issue moot.

Signed-off-by: Paul Burton <paul.burton@mips.com>
Fixes: 571ed1fd2390 ("SUNRPC: Replace krb5_seq_lock with a lockless scheme")
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 include/linux/sunrpc/gss_krb5.h     |  7 ++-----
 net/sunrpc/auth_gss/gss_krb5_mech.c | 16 ++++++++++------
 net/sunrpc/auth_gss/gss_krb5_seal.c | 28 ++--------------------------
 net/sunrpc/auth_gss/gss_krb5_wrap.c |  4 ++--
 4 files changed, 16 insertions(+), 39 deletions(-)

diff --git a/include/linux/sunrpc/gss_krb5.h b/include/linux/sunrpc/gss_krb5.h
index 131424cefc6a..02c0412e368c 100644
--- a/include/linux/sunrpc/gss_krb5.h
+++ b/include/linux/sunrpc/gss_krb5.h
@@ -107,8 +107,8 @@ struct krb5_ctx {
 	u8			Ksess[GSS_KRB5_MAX_KEYLEN]; /* session key */
 	u8			cksum[GSS_KRB5_MAX_KEYLEN];
 	s32			endtime;
-	u32			seq_send;
-	u64			seq_send64;
+	atomic_t		seq_send;
+	atomic64_t		seq_send64;
 	struct xdr_netobj	mech_used;
 	u8			initiator_sign[GSS_KRB5_MAX_KEYLEN];
 	u8			acceptor_sign[GSS_KRB5_MAX_KEYLEN];
@@ -118,9 +118,6 @@ struct krb5_ctx {
 	u8			acceptor_integ[GSS_KRB5_MAX_KEYLEN];
 };
 
-extern u32 gss_seq_send_fetch_and_inc(struct krb5_ctx *ctx);
-extern u64 gss_seq_send64_fetch_and_inc(struct krb5_ctx *ctx);
-
 /* The length of the Kerberos GSS token header */
 #define GSS_KRB5_TOK_HDR_LEN	(16)
 
diff --git a/net/sunrpc/auth_gss/gss_krb5_mech.c b/net/sunrpc/auth_gss/gss_krb5_mech.c
index 7f0424dfa8f6..eab71fc7af3e 100644
--- a/net/sunrpc/auth_gss/gss_krb5_mech.c
+++ b/net/sunrpc/auth_gss/gss_krb5_mech.c
@@ -274,6 +274,7 @@ get_key(const void *p, const void *end,
 static int
 gss_import_v1_context(const void *p, const void *end, struct krb5_ctx *ctx)
 {
+	u32 seq_send;
 	int tmp;
 
 	p = simple_get_bytes(p, end, &ctx->initiate, sizeof(ctx->initiate));
@@ -315,9 +316,10 @@ gss_import_v1_context(const void *p, const void *end, struct krb5_ctx *ctx)
 	p = simple_get_bytes(p, end, &ctx->endtime, sizeof(ctx->endtime));
 	if (IS_ERR(p))
 		goto out_err;
-	p = simple_get_bytes(p, end, &ctx->seq_send, sizeof(ctx->seq_send));
+	p = simple_get_bytes(p, end, &seq_send, sizeof(seq_send));
 	if (IS_ERR(p))
 		goto out_err;
+	atomic_set(&ctx->seq_send, seq_send);
 	p = simple_get_netobj(p, end, &ctx->mech_used);
 	if (IS_ERR(p))
 		goto out_err;
@@ -607,6 +609,7 @@ static int
 gss_import_v2_context(const void *p, const void *end, struct krb5_ctx *ctx,
 		gfp_t gfp_mask)
 {
+	u64 seq_send64;
 	int keylen;
 
 	p = simple_get_bytes(p, end, &ctx->flags, sizeof(ctx->flags));
@@ -617,14 +620,15 @@ gss_import_v2_context(const void *p, const void *end, struct krb5_ctx *ctx,
 	p = simple_get_bytes(p, end, &ctx->endtime, sizeof(ctx->endtime));
 	if (IS_ERR(p))
 		goto out_err;
-	p = simple_get_bytes(p, end, &ctx->seq_send64, sizeof(ctx->seq_send64));
+	p = simple_get_bytes(p, end, &seq_send64, sizeof(seq_send64));
 	if (IS_ERR(p))
 		goto out_err;
+	atomic64_set(&ctx->seq_send64, seq_send64);
 	/* set seq_send for use by "older" enctypes */
-	ctx->seq_send = ctx->seq_send64;
-	if (ctx->seq_send64 != ctx->seq_send) {
-		dprintk("%s: seq_send64 %lx, seq_send %x overflow?\n", __func__,
-			(unsigned long)ctx->seq_send64, ctx->seq_send);
+	atomic_set(&ctx->seq_send, seq_send64);
+	if (seq_send64 != atomic_read(&ctx->seq_send)) {
+		dprintk("%s: seq_send64 %llx, seq_send %x overflow?\n", __func__,
+			seq_send64, atomic_read(&ctx->seq_send));
 		p = ERR_PTR(-EINVAL);
 		goto out_err;
 	}
diff --git a/net/sunrpc/auth_gss/gss_krb5_seal.c b/net/sunrpc/auth_gss/gss_krb5_seal.c
index b4adeb06660b..48fe4a591b54 100644
--- a/net/sunrpc/auth_gss/gss_krb5_seal.c
+++ b/net/sunrpc/auth_gss/gss_krb5_seal.c
@@ -123,30 +123,6 @@ setup_token_v2(struct krb5_ctx *ctx, struct xdr_netobj *token)
 	return krb5_hdr;
 }
 
-u32
-gss_seq_send_fetch_and_inc(struct krb5_ctx *ctx)
-{
-	u32 old, seq_send = READ_ONCE(ctx->seq_send);
-
-	do {
-		old = seq_send;
-		seq_send = cmpxchg(&ctx->seq_send, old, old + 1);
-	} while (old != seq_send);
-	return seq_send;
-}
-
-u64
-gss_seq_send64_fetch_and_inc(struct krb5_ctx *ctx)
-{
-	u64 old, seq_send = READ_ONCE(ctx->seq_send);
-
-	do {
-		old = seq_send;
-		seq_send = cmpxchg64(&ctx->seq_send64, old, old + 1);
-	} while (old != seq_send);
-	return seq_send;
-}
-
 static u32
 gss_get_mic_v1(struct krb5_ctx *ctx, struct xdr_buf *text,
 		struct xdr_netobj *token)
@@ -177,7 +153,7 @@ gss_get_mic_v1(struct krb5_ctx *ctx, struct xdr_buf *text,
 
 	memcpy(ptr + GSS_KRB5_TOK_HDR_LEN, md5cksum.data, md5cksum.len);
 
-	seq_send = gss_seq_send_fetch_and_inc(ctx);
+	seq_send = atomic_fetch_inc(&ctx->seq_send);
 
 	if (krb5_make_seq_num(ctx, ctx->seq, ctx->initiate ? 0 : 0xff,
 			      seq_send, ptr + GSS_KRB5_TOK_HDR_LEN, ptr + 8))
@@ -205,7 +181,7 @@ gss_get_mic_v2(struct krb5_ctx *ctx, struct xdr_buf *text,
 
 	/* Set up the sequence number. Now 64-bits in clear
 	 * text and w/o direction indicator */
-	seq_send_be64 = cpu_to_be64(gss_seq_send64_fetch_and_inc(ctx));
+	seq_send_be64 = cpu_to_be64(atomic64_fetch_inc(&ctx->seq_send64));
 	memcpy(krb5_hdr + 8, (char *) &seq_send_be64, 8);
 
 	if (ctx->initiate) {
diff --git a/net/sunrpc/auth_gss/gss_krb5_wrap.c b/net/sunrpc/auth_gss/gss_krb5_wrap.c
index 962fa84e6db1..5cdde6cb703a 100644
--- a/net/sunrpc/auth_gss/gss_krb5_wrap.c
+++ b/net/sunrpc/auth_gss/gss_krb5_wrap.c
@@ -228,7 +228,7 @@ gss_wrap_kerberos_v1(struct krb5_ctx *kctx, int offset,
 
 	memcpy(ptr + GSS_KRB5_TOK_HDR_LEN, md5cksum.data, md5cksum.len);
 
-	seq_send = gss_seq_send_fetch_and_inc(kctx);
+	seq_send = atomic_fetch_inc(&kctx->seq_send);
 
 	/* XXX would probably be more efficient to compute checksum
 	 * and encrypt at the same time: */
@@ -475,7 +475,7 @@ gss_wrap_kerberos_v2(struct krb5_ctx *kctx, u32 offset,
 	*be16ptr++ = 0;
 
 	be64ptr = (__be64 *)be16ptr;
-	*be64ptr = cpu_to_be64(gss_seq_send64_fetch_and_inc(kctx));
+	*be64ptr = cpu_to_be64(atomic64_fetch_inc(&kctx->seq_send64));
 
 	err = (*kctx->gk5e->encrypt_v2)(kctx, offset, buf, pages);
 	if (err)
-- 
2.19.1

^ permalink raw reply related

* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Paul Burton @ 2018-11-01 17:54 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux@roeck-us.net, linux-kernel@vger.kernel.org,
	ralf@linux-mips.org, jlayton@kernel.org,
	linuxppc-dev@lists.ozlabs.org, bfields@fieldses.org,
	linux-mips@linux-mips.org, linux-nfs@vger.kernel.org,
	akpm@linux-foundation.org, anna.schumaker@netapp.com,
	jhogan@kernel.org, netdev@vger.kernel.org, davem@davemloft.net,
	arnd@arndb.de, paulus@samba.org, mpe@ellerman.id.au
In-Reply-To: <4e2438a23d2edf03368950a72ec058d1d299c32e.camel@hammerspace.com>

Hi Trond,

On Thu, Nov 01, 2018 at 12:17:31AM +0000, Trond Myklebust wrote:
> On Wed, 2018-10-31 at 23:32 +0000, Paul Burton wrote:
> > In this particular case I have no idea why
> > net/sunrpc/auth_gss/gss_krb5_seal.c is using cmpxchg64() at all. It's
> > essentially reinventing atomic64_fetch_inc() which is already
> > provided
> > everywhere via CONFIG_GENERIC_ATOMIC64 & the spinlock approach. At
> > least
> > for atomic64_* functions the assumption that all access will be
> > performed using those same functions seems somewhat reasonable.
> > 
> > So how does the below look? Trond?
> 
> My one question (and the reason why I went with cmpxchg() in the first
> place) would be about the overflow behaviour for atomic_fetch_inc() and
> friends. I believe those functions should be OK on x86, so that when we
> overflow the counter, it behaves like an unsigned value and wraps back
> around.  Is that the case for all architectures?
> 
> i.e. are atomic_t/atomic64_t always guaranteed to behave like u32/u64
> on increment?
> 
> I could not find any documentation that explicitly stated that they
> should.

Based on other replies it seems like it's at least implicitly assumed by
other code, even if not explicitly stated.

>From a MIPS perspective where atomics are implemented using load-linked
& store-conditional instructions the actual addition will be performed
using the same addu instruction that a plain integer addition would
generate (regardless of signedness), so there'll be absolutely no
difference in arithmetic between your gss_seq_send64_fetch_and_inc()
function and atomic64_fetch_inc(). I'd expect the same to be true for
other architectures with load-linked & store-conditional style atomics.

In any case, for the benefit of anyone interested who I didn't copy on
the patch submission, here it is:

    https://lore.kernel.org/lkml/20181101175109.8621-1-paul.burton@mips.com/

Thanks,
    Paul

^ permalink raw reply

* Re: [PATCH] SUNRPC: Use atomic(64)_t for seq_send(64)
From: Trond Myklebust @ 2018-11-01 17:57 UTC (permalink / raw)
  To: anna.schumaker@netapp.com, linux-nfs@vger.kernel.org,
	paul.burton@mips.com
  Cc: bfields@fieldses.org, linux-kernel@vger.kernel.org,
	pburton@wavecomp.com, davem@davemloft.net, jlayton@kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20181101175109.8621-1-paul.burton@mips.com>

On Thu, 2018-11-01 at 17:51 +0000, Paul Burton wrote:
> The seq_send & seq_send64 fields in struct krb5_ctx are used as
> atomically incrementing counters. This is implemented using cmpxchg()
> &
> cmpxchg64() to implement what amount to custom versions of
> atomic_fetch_inc() & atomic64_fetch_inc().
> 
> Besides the duplication, using cmpxchg64() has another major drawback
> in
> that some 32 bit architectures don't provide it. As such commit
> 571ed1fd2390 ("SUNRPC: Replace krb5_seq_lock with a lockless scheme")
> resulted in build failures for some architectures.
> 
> Change seq_send to be an atomic_t and seq_send64 to be an atomic64_t,
> then use atomic(64)_* functions to manipulate the values. The
> atomic64_t
> type & associated functions are provided even on architectures which
> lack real 64 bit atomic memory access via CONFIG_GENERIC_ATOMIC64
> which
> uses spinlocks to serialize access. This fixes the build failures for
> architectures lacking cmpxchg64().
> 
> A potential alternative that was raised would be to provide
> cmpxchg64()
> on the 32 bit architectures that currently lack it, using spinlocks.
> However this would provide a version of cmpxchg64() with semantics a
> little different to the implementations on architectures with real 64
> bit atomics - the spinlock-based implementation would only work if
> all
> access to the memory used with cmpxchg64() is *always* performed
> using
> cmpxchg64(). That is not currently a requirement for users of
> cmpxchg64(), and making it one seems questionable. As such avoiding
> cmpxchg64() outside of architecture-specific code seems best,
> particularly in cases where atomic64_t seems like a better fit
> anyway.
> 
> The CONFIG_GENERIC_ATOMIC64 implementation of atomic64_* functions
> will
> use spinlocks & so faces the same issue, but with the key difference
> that the memory backing an atomic64_t ought to always be accessed via
> the atomic64_* functions anyway making the issue moot.
> 
> Signed-off-by: Paul Burton <paul.burton@mips.com>
> Fixes: 571ed1fd2390 ("SUNRPC: Replace krb5_seq_lock with a lockless
> scheme")
> Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
> Cc: Anna Schumaker <anna.schumaker@netapp.com>
> Cc: J. Bruce Fields <bfields@fieldses.org>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: linux-nfs@vger.kernel.org
> Cc: netdev@vger.kernel.org
> ---
>  include/linux/sunrpc/gss_krb5.h     |  7 ++-----
>  net/sunrpc/auth_gss/gss_krb5_mech.c | 16 ++++++++++------
>  net/sunrpc/auth_gss/gss_krb5_seal.c | 28 ++-------------------------
> -
>  net/sunrpc/auth_gss/gss_krb5_wrap.c |  4 ++--
>  4 files changed, 16 insertions(+), 39 deletions(-)
> 

Thanks Paul! ...and thanks for your patience in working out the
atomicity wraparound issues. Applied..

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply

* Re: [LKP] [tools]  6534770d6f: kernel_selftests.bpf.test_lwt_seg6local.fail
From: Jakub Kicinski @ 2018-11-01 18:18 UTC (permalink / raw)
  To: kernel test robot
  Cc: David S. Miller, LKML, Linus Torvalds, lkp, Martin KaFai Lau,
	Okash Khawaja, Daniel Borkmann, Alexei Starovoitov,
	netdev@vger.kernel.org
In-Reply-To: <20181101032643.GB24195@shao2-debian>

On Thu, 1 Nov 2018 11:26:44 +0800, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: 6534770d6f176093b50896961107b2d545ef38f0 ("tools: bpf: fix BTF code added twice to different trees")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

That's a fix to a merge commit, perhaps something before broke it?

> in testcase: kernel_selftests
> with following parameters:
> 
> 	group: kselftests-00
> 
> test-description: The kernel contains a set of "self tests" under the tools/testing/selftests/ directory. These are intended to be small unit tests to exercise individual code paths in the kernel.
> test-url: https://www.kernel.org/doc/Documentation/kselftest.txt
> 
> 
> on test machine: 80 threads Skylake with 64G memory
> 
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> 
> 
> commit:
>   6b431d50d2a8acd1c418b998b856a055252ebc3a
>   6534770d6f176093b50896961107b2d545ef38f0
> 
> 6b431d50d2a8acd1 6534770d6f176093b508969611
> ---------------- --------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>         443:8        14689%        1618:8     dmesg.timestamp:last
>            :8          100%           8:8     kernel_selftests.bpf.test_lwt_seg6local.fail
>            :8          100%           8:8     kernel_selftests.bpf.test_lwt_seg6local.sh.fail
> 
> 
> 
> 
> To reproduce:
> 
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp install job.yaml  # job file is attached in this email
>         bin/lkp run     job.yaml
> 
> 
> 
> Thanks,
> Rong Chen

^ permalink raw reply

* Re: [PATCH v2 net 3/3] net/mlx4_en: use __netdev_tx_sent_queue()
From: Tariq Toukan @ 2018-11-01  9:19 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: netdev, Tariq Toukan, Willem de Bruijn, Eric Dumazet
In-Reply-To: <20181031153914.132127-4-edumazet@google.com>



On 31/10/2018 5:39 PM, Eric Dumazet wrote:
> doorbell only depends on xmit_more and netif_tx_queue_stopped()
> 
> Using __netdev_tx_sent_queue() avoids messing with BQL stop flag,
> and is more generic.
> 
> This patch increases performance on GSO workload by keeping
> doorbells to the minimum required.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_tx.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 1857ee0f0871d48285a6d3711f7c3e9a1e08a05f..6f5153afcab4dfc331c099da854c54f1b9500887 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -1006,7 +1006,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>   		ring->packets++;
>   	}
>   	ring->bytes += tx_info->nr_bytes;
> -	netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes);
>   	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len);
>   
>   	if (tx_info->inl)
> @@ -1044,7 +1043,10 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>   		netif_tx_stop_queue(ring->tx_queue);
>   		ring->queue_stopped++;
>   	}
> -	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> +
> +	send_doorbell = __netdev_tx_sent_queue(ring->tx_queue,
> +					       tx_info->nr_bytes,
> +					       skb->xmit_more);
>   
>   	real_size = (real_size / 16) & 0x3f;
>   
> 

Reviewed-by: Tariq Toukan <tariqt@mellanox.com>

Looks good to me.
Thanks.


^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Jesper Dangaard Brouer @ 2018-11-01  9:22 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: brouer, Eric Dumazet, netdev, Tariq Toukan, Ilias Apalodimas,
	Yoel Caspersen, Mel Gorman, Aaron Lu
In-Reply-To: <8e10bf68-f3b3-98f2-91a5-25b151756dd6@itcare.pl>

On Wed, 31 Oct 2018 23:20:01 +0100
Paweł Staszewski <pstaszewski@itcare.pl> wrote:

> W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:
> >
> > On 10/31/2018 02:57 PM, Paweł Staszewski wrote:  
> >> Hi
> >>
> >> So maybee someone will be interested how linux kernel handles
> >> normal traffic (not pktgen :) )

Pawel is this live production traffic?

I know Yoel (Cc) is very interested to know the real-life limitation of
Linux as a router, especially with VLANs like you use.


> >>
> >> Server HW configuration:
> >>
> >> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> >>
> >> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
> >>
> >>
> >> Server software:
> >>
> >> FRR - as routing daemon
> >>
> >> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)
> >>
> >> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
> >>
> >>
> >> Maximum traffic that server can handle:
> >>
> >> Bandwidth
> >>
> >>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> >>    input: /proc/net/dev type: rate
> >>    \         iface                   Rx Tx                Total
> >> ==============================================================================
> >>         enp175s0f1:          28.51 Gb/s           37.24 Gb/s           65.74 Gb/s
> >>         enp175s0f0:          38.07 Gb/s           28.44 Gb/s           66.51 Gb/s
> >> ------------------------------------------------------------------------------
> >>              total:          66.58 Gb/s           65.67 Gb/s          132.25 Gb/s
> >>

Actually rather impressive number for a Linux router.

> >>
> >> Packets per second:
> >>
> >>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> >>    input: /proc/net/dev type: rate
> >>    -         iface                   Rx Tx                Total
> >> ==============================================================================
> >>         enp175s0f1:      5248589.00 P/s       3486617.75 P/s 8735207.00 P/s
> >>         enp175s0f0:      3557944.25 P/s       5232516.00 P/s 8790460.00 P/s
> >> ------------------------------------------------------------------------------
> >>              total:      8806533.00 P/s       8719134.00 P/s 17525668.00 P/s
> >>

Average packet size:
  (28.51*10^9/8)/5248589 =  678.99 bytes 
  (38.07*10^9/8)/3557944 = 1337.49 bytes


> >> After reaching that limits nics on the upstream side (more RX
> >> traffic) start to drop packets
> >>
> >>
> >> I just dont understand that server can't handle more bandwidth
> >> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on
> >> RX side are increasing.
> >>
> >> Was thinking that maybee reached some pcie x16 limit - but x16 8GT
> >> is 126Gbit - and also when testing with pktgen i can reach more bw
> >> and pps (like 4x more comparing to normal internet traffic)
> >>
> >> And wondering if there is something that can be improved here.
> >>
> >>
> >>
> >> Some more informations / counters / stats and perf top below:
> >>
> >> Perf top flame graph:
> >>
> >> https://uploadfiles.io/7zo6u

Thanks a lot for the flame graph!

> >>
> >> System configuration(long):
> >>
> >>
> >> cat /sys/devices/system/node/node1/cpulist
> >> 14-27,42-55
> >> cat /sys/class/net/enp175s0f0/device/numa_node
> >> 1
> >> cat /sys/class/net/enp175s0f1/device/numa_node
> >> 1
> >>

Hint grep can give you nicer output that cat:

$ grep -H . /sys/class/net/*/device/numa_node

> >>
> >>
> >>
> >>
> >> ip -s -d link ls dev enp175s0f0
> >> 6: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
> >>      link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
> >>      RX: bytes  packets  errors  dropped overrun mcast
> >>      184142375840858 141347715974 2       2806325 0       85050528
> >>      TX: bytes  packets  errors  dropped carrier collsns
> >>      99270697277430 172227994003 0       0       0       0
> >>
> >>   ip -s -d link ls dev enp175s0f1
> >> 7: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
> >>      link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
> >>      RX: bytes  packets  errors  dropped overrun mcast
> >>      99686284170801 173507590134 61      669685  0       100304421
> >>      TX: bytes  packets  errors  dropped carrier collsns
> >>      184435107970545 142383178304 0       0       0       0
> >>

You have increased the default (1000) qlen to 8192, why?

What default qdisc do you run?... looking through your very detail main
email report (I do love the details you give!).  You run
pfifo_fast_dequeue, thus this 8192 qlen is actually having effect.

I would like to know if and how much qdisc_dequeue bulking is happening
in this setup?  Can you run:

 perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets 

The perf-stat-hist is from Brendan Gregg's git-tree:
 https://github.com/brendangregg/perf-tools
 https://github.com/brendangregg/perf-tools/blob/master/misc/perf-stat-hist


> >> ./softnet.sh
> >> cpu      total    dropped   squeezed  collision        rps flow_limit
> >>
> >>
> >>
> >>
> >>     PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz cycles],  (all, 56 CPUs)
> >> ------------------------------------------------------------------------------------------
> >>
> >>      26.78%  [kernel]       [k] queued_spin_lock_slowpath  
> >
> > This is highly suspect.
> >

I agree! -- 26.78% spend in queued_spin_lock_slowpath.  Hint if you see
_raw_spin_lock then it is likely not a contended lock, but if you see
queued_spin_lock_slowpath in a perf-report your workload is likely in
trouble.


> > A call graph (perf record -a -g sleep 1; perf report --stdio)
> > would tell what is going on.  
>
> perf report:
> https://ufile.io/rqp0h
> 

Thanks for the output (my 30" screen is just large enough to see the
full output).  Together with the flame-graph, it is clear that this
lock happens in the page allocator code.

Section copied out:

  mlx5e_poll_tx_cq
  |          
   --16.34%--napi_consume_skb
             |          
             |--12.65%--__free_pages_ok
             |          |          
             |           --11.86%--free_one_page
             |                     |          
             |                     |--10.10%--queued_spin_lock_slowpath
             |                     |          
             |                      --0.65%--_raw_spin_lock
             |          
             |--1.55%--page_frag_free
             |          
              --1.44%--skb_release_data


Let me explain what (I think) happens.  The mlx5 driver RX-page recycle
mechanism is not effective in this workload, and pages have to go
through the page allocator.  The lock contention happens during mlx5
DMA TX completion cycle.  And the page allocator cannot keep up at
these speeds.

One solution is extend page allocator with a bulk free API.  (This have
been on my TODO list for a long time, but I don't have a
micro-benchmark that trick the driver page-recycle to fail).  It should
fit nicely, as I can see that kmem_cache_free_bulk() does get
activated (bulk freeing SKBs), which means that DMA TX completion do
have a bulk of packets. 

We can (and should) also improve the page recycle scheme in the driver.
After LPC, I have a project with Tariq and Ilias (Cc'ed) to improve the
page_pool, and we will (attempt) to generalize this, for both high-end
mlx5 and more low-end ARM64-boards (macchiatobin and espressobin).

The MM-people is in parallel working to improve the performance of
order-0 page returns.  Thus, the explicit page bulk free API might
actually become less important.  I actually think (Cc.) Aaron have a
patchset he would like you to test, which removes the (zone->)lock
you hit in free_one_page().

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH resend] macvlan: use per-cpu queues for broadcast and multicast packets
From: David Miller @ 2018-11-01 18:53 UTC (permalink / raw)
  To: khlebnikov; +Cc: netdev, linux-kernel, vfedorenko
In-Reply-To: <154108883150.919735.10620997593741055347.stgit@buzz>

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Date: Thu, 01 Nov 2018 19:13:51 +0300

> Currently macvlan has single per-port queue for broadcast and multicast.
> This disrupts order of packets when flows from different cpus are mixed.
> 
> This patch replaces this queue with single set of per-cpu queues.
> Pointer to macvlan port is passed in skb control block.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Reported-add-tested-by: Vadim Fedorenko <vfedorenko@yandex-team.ru>

Konstantin, all of your postings to vger.kernel.org are being rejected.

They are because of the:

"List-Unsubscribe: <https://ml.yandex-team.ru/lists/vfedorenko@yandex-team.ru/unsubscribe-click>"

and similar strings that appear in your email headers.

Please disable whatever is producing these email headers, otherwise
nobody (including me) will see and process your postings.

Thank you.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox