Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] sctp: define SCTP_SS_DEFAULT for Stream schedulers
From: David Miller @ 2018-11-04  2:41 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <4986625e9c58cccdfbb6c809a82c77a83b57a541.1541224891.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Sat,  3 Nov 2018 14:01:31 +0800

> According to rfc8260#section-4.3.2, SCTP_SS_DEFAULT is required to
> defined as SCTP_SS_FCFS or SCTP_SS_RR.
> 
> SCTP_SS_FCFS is used for SCTP_SS_DEFAULT's value in this patch.
> 
> Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations")
> Reported-by: Jianwen Ji <jiji@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Also applied and queued up for -stable.

Thanks.

^ permalink raw reply

* Re: [PATCH net] sctp: fix strchange_flags name for Stream Change Event
From: David Miller @ 2018-11-04  2:39 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <34b8e0475a841d258c2165b5a46a72c373e52976.1541224785.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Sat,  3 Nov 2018 13:59:45 +0800

> As defined in rfc6525#section-6.1.3, SCTP_STREAM_CHANGE_DENIED
> and SCTP_STREAM_CHANGE_FAILED should be used instead of
> SCTP_ASSOC_CHANGE_DENIED and SCTP_ASSOC_CHANGE_FAILED.
> 
> To keep the compatibility, fix it by adding two macros.
> 
> Fixes: b444153fb5a6 ("sctp: add support for generating add stream change event notification")
> Reported-by: Jianwen Ji <jiji@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net] net-sysfs: fix formatting of tx_timeout attribute
From: David Miller @ 2018-11-04  2:31 UTC (permalink / raw)
  To: jwi; +Cc: netdev
In-Reply-To: <20181102183317.2017-1-jwi@linux.ibm.com>

From: Julian Wiedmann <jwi@linux.ibm.com>
Date: Fri,  2 Nov 2018 19:33:17 +0100

> Add a missing newline.
> 
> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
> ---
>  net/core/net-sysfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index bd67c4d0fcfd..ef06409d768e 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1039,7 +1039,7 @@ static ssize_t tx_timeout_show(struct netdev_queue *queue, char *buf)
>  	trans_timeout = queue->trans_timeout;
>  	spin_unlock_irq(&queue->_xmit_lock);
>  
> -	return sprintf(buf, "%lu", trans_timeout);
> +	return sprintf(buf, "%lu\n", trans_timeout);

Maybe use 'fmt_ulong' instead?

^ permalink raw reply

* Re: [PATCH] qed: fix link config error handling
From: David Miller @ 2018-11-04  2:27 UTC (permalink / raw)
  To: arnd
  Cc: Ariel.Elior, everest-linux-l2, sudarsana.kalluru, Tomer.Tayar,
	denis.bolotin, Rahul.Verma, netdev, linux-kernel
In-Reply-To: <20181102153632.1646632-1-arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Fri,  2 Nov 2018 16:36:24 +0100

> gcc-8 notices that qed_mcp_get_transceiver_data() may fail to
> return a result to the caller:
> 
> drivers/net/ethernet/qlogic/qed/qed_mcp.c: In function 'qed_mcp_trans_speed_mask':
> drivers/net/ethernet/qlogic/qed/qed_mcp.c:1955:2: error: 'transceiver_type' may be used uninitialized in this function [-Werror=maybe-uninitialized]
> 
> When an error is returned by qed_mcp_get_transceiver_data(), we
> should propagate that to the caller of qed_mcp_trans_speed_mask()
> rather than continuing with uninitialized data.
> 
> Fixes: c56a8be7e7aa ("qed: Add supported link and advertise link to display in ethtool.")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Applied, thanks.

^ permalink raw reply

* Re: [RFC PATCH] net/mlx5e: Temp software stats variable is not required
From: David Miller @ 2018-11-04  2:36 UTC (permalink / raw)
  To: saeedm; +Cc: netdev, arnd, akpm, jgg, eric.dumazet, eranbe, leonro
In-Reply-To: <20181103015422.22978-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Fri,  2 Nov 2018 18:54:22 -0700

> +static void mlx5e_fold_sw_stats(struct mlx5e_priv *priv, struct rtnl_link_stats64 *s)
> +{
> +	int i;
> +
> +	/* not required ? */
> +	memset(s, 0, sizeof(*s));

Why wouldn't this be required?

I can see that perhaps you can only zero out the statistics that are used in
the ndo_get_stats64() code path, but that's different.

^ permalink raw reply

* Re: pull-request: bpf 2018-11-03
From: David Miller @ 2018-11-04  2:34 UTC (permalink / raw)
  To: daniel; +Cc: ast, netdev
In-Reply-To: <20181103004703.6007-1-daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Sat,  3 Nov 2018 01:47:03 +0100

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Fix BPF prog kallsyms and bpf_prog_get_info_by_fd() jited address export
>    to not use page start but actual start instead to work correctly with
>    profiling e.g. finding hot instructions from stack traces. Also fix latter
>    with regards to dumping address and jited len for !multi prog, from Song.
> 
> 2) Fix bpf_prog_get_info_by_fd() for !root to zero info.nr_jited_func_lens
>    instead of leaving the user provided length, from Daniel.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.

^ permalink raw reply

* Re: [PATCH v2 net 0/3] net: bql: better deal with GSO
From: David Miller @ 2018-11-04  0:26 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, tariqt, willemb, eric.dumazet
In-Reply-To: <CANn89iLJcc4o7ftTL3Sb0GEd3tYvexKn-tHyDWzqEo2Pk1ECRA@mail.gmail.com>

From: Eric Dumazet <edumazet@google.com>
Date: Sat, 3 Nov 2018 16:00:39 -0700

> On Sat, Nov 3, 2018 at 3:40 PM David Miller <davem@davemloft.net> wrote:
> 
>> Series applied, but I wonder how many other commonly used drivers we
>> should update the same way mlx4 is here?
> 
> I can provide patches but can not test them, so probably we should get
> Tested-by tags before applying them.

Ok, agreed.

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-04  0:24 UTC (permalink / raw)
  To: David Ahern, Jesper Dangaard Brouer; +Cc: netdev, Yoel Caspersen
In-Reply-To: <f6188534-38d7-fc3e-189d-443a76bcb47d@gmail.com>

n


W dniu 03.11.2018 o 18:32, David Ahern pisze:
> On 11/1/18 11:30 AM, Paweł Staszewski wrote:
>>>>>   
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c
>>>>>
>>>>>
>>>> I can try some tests on same hw but testlab configuration - will give it
>>>> a try :)
>>>>
>>> That version does not work with VLANs. I have patches for it but it
>>> needs a bit more work before sending out. Perhaps I can get back to it
>>> next week.
>>>
>> Will be nice - next week i will be able to replace network controller
>> and install separate two 100Gbit nics into two pciex x16 slots - so can
>> test without hitting pcie bandwidth limits.
>>
>>
> Does your setup have any other device types besides physical ports with
> VLANs (e.g., any macvlans or bonds)?
>
>
no.
just
phy(mlnx)->vlans only config

And today again after allpy patch for page allocator - reached again 
64/64 Gbit/s

with only 50-60% cpu load

today no slowpath hit for netwoking :)

But again dropped pckt at 64GbitRX and 64TX ....
And as it should not be pcie express limit  -i think something more is 
going on there - and hard to catch - cause perf top doestn chenged 
besides there is no queued slowpath hit now


I ordered now also intel cards to compare - but 3 weeks eta
Faster - cause 3 days - i will have mellanox connectx 5 - so can 
separate traffic to two different x16 pcie busses

^ permalink raw reply

* [PATCH] e1000e: Change watchdog task to be delayed work
From: Robert Eshleman @ 2018-11-04  0:17 UTC (permalink / raw)
  Cc: Robert Eshleman, Jeff Kirsher, David S. Miller, intel-wired-lan,
	netdev, linux-kernel

This completes a pending TODO to use queue_delayed_work() instead of
schedule_work().

Signed-off-by: Robert Eshleman <bobbyeshleman@gmail.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 16a73bd..a387b21 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -39,6 +39,8 @@ static int debug = -1;
 module_param(debug, int, 0);
 MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
 
+struct workqueue_struct *e1000e_workqueue;
+
 static const struct e1000_info *e1000_info_tbl[] = {
 	[board_82571]		= &e1000_82571_info,
 	[board_82572]		= &e1000_82572_info,
@@ -5137,9 +5139,9 @@ static void e1000_watchdog(struct timer_list *t)
 	struct e1000_adapter *adapter = from_timer(adapter, t, watchdog_timer);
 
 	/* Do the rest outside of interrupt context */
-	schedule_work(&adapter->watchdog_task);
+	struct delayed_work *dwork = to_delayed_work(&adapter->watchdog_task);
 
-	/* TODO: make this use queue_delayed_work() */
+	queue_delayed_work(e1000e_workqueue, dwork, 1);
 }
 
 static void e1000_watchdog_task(struct work_struct *work)
@@ -7572,6 +7574,13 @@ static int __init e1000_init_module(void)
 		e1000e_driver_version);
 	pr_info("Copyright(c) 1999 - 2015 Intel Corporation.\n");
 
+	e1000e_workqueue = alloc_workqueue("%s", WQ_MEM_RECLAIM, 0,
+					   e1000e_driver_name);
+	if (!e1000e_workqueue) {
+		pr_err("%s: Failed to create workqueue\n", e1000e_driver_name);
+		return -ENOMEM;
+	}
+
 	return pci_register_driver(&e1000_driver);
 }
 module_init(e1000_init_module);
@@ -7585,6 +7594,7 @@ module_init(e1000_init_module);
 static void __exit e1000_exit_module(void)
 {
 	pci_unregister_driver(&e1000_driver);
+	destroy_workqueue(e1000e_workqueue);
 }
 module_exit(e1000_exit_module);
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v2 net 0/3] net: bql: better deal with GSO
From: Eric Dumazet @ 2018-11-03 23:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tariq Toukan, Willem de Bruijn, Eric Dumazet
In-Reply-To: <20181103.154028.1864906755289829872.davem@davemloft.net>

On Sat, Nov 3, 2018 at 3:40 PM David Miller <davem@davemloft.net> wrote:

> Series applied, but I wonder how many other commonly used drivers we
> should update the same way mlx4 is here?

I can provide patches but can not test them, so probably we should get
Tested-by tags before applying them.

Thanks.

^ permalink raw reply

* Re: [PATCH net] net: hns3: Fix for out-of-bounds access when setting pfc back pressure
From: David Miller @ 2018-11-03 22:42 UTC (permalink / raw)
  To: linyunsheng
  Cc: huangdaode, xuwei5, liguozhu, Yisen.Zhuang, john.garry, linuxarm,
	salil.mehta, lipeng321, netdev, linux-kernel
In-Reply-To: <1541074363-98630-1-git-send-email-linyunsheng@huawei.com>

From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Thu, 1 Nov 2018 20:12:43 +0800

> The vport should be initialized to hdev->vport for each bp group,
> otherwise it will cause out-of-bounds access and bp setting not
> correct problem.
 ...
> Fixes: 67bf2541f4b9 ("net: hns3: Fixes the back pressure setting when sriov is enabled")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v2 net 0/3] net: bql: better deal with GSO
From: David Miller @ 2018-11-03 22:40 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, tariqt, willemb, eric.dumazet
In-Reply-To: <20181031153914.132127-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Wed, 31 Oct 2018 08:39:11 -0700

> While BQL bulk dequeue works well for TSO packets, it is
> not very efficient as soon as GSO is involved.
> 
> On a GSO only workload (UDP or TCP), this patch series
> can save about 8 % of cpu cycles on a 40Gbit mlx4 NIC,
> by keeping optimal batching, and avoiding expensive
> doorbells, qdisc requeues and reschedules.
> 
> This patch series :
> 
> - Add __netdev_tx_sent_queue() so that drivers
>   can implement efficient BQL and xmit_more support.
> 
> - Implement a work around in dev_hard_start_xmit()
>   for drivers not using __netdev_tx_sent_queue()
> 
> - changes mlx4 to use __netdev_tx_sent_queue()
> 
> v2: Tariq and Willem feedback addressed.
>     added __netdev_tx_sent_queue() (Willem suggestion)

Series applied, but I wonder how many other commonly used drivers we
should update the same way mlx4 is here?

^ permalink raw reply

* Re: [PATCH v2] libertas: return errno from lbs_add_card()
From: Pavel Machek @ 2018-11-03 19:43 UTC (permalink / raw)
  To: Lubomir Rintel
  Cc: Kalle Valo, David S. Miller, libertas-dev, linux-wireless, netdev,
	linux-kernel
In-Reply-To: <20181007003327.2806010-1-lkundrak@v3.sk>

[-- Attachment #1: Type: text/plain, Size: 431 bytes --]

On Sun 2018-10-07 02:33:27, Lubomir Rintel wrote:
> This makes the error handling somewhat cleaner -- lbs_add_card() does no
> logner throw away the errno and lets its callers propagate it.
> 
> Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>

Acked-by: Pavel Machek <pavel@ucw.cz>


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH net 0/6] s390/qeth: fixes 2018-11-02
From: David Miller @ 2018-11-03 17:44 UTC (permalink / raw)
  To: jwi; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20181102180413.53020-1-jwi@linux.ibm.com>

From: Julian Wiedmann <jwi@linux.ibm.com>
Date: Fri,  2 Nov 2018 19:04:07 +0100

> please apply one round of qeth fixes for -net.
> 
> Patch 1 is rather large and removes a use-after-free hazard from many of our
> debug trace entries.
> Patch 2 is yet another fix-up for the L3 subdriver's new IP address management
> code.
> Patch 3 and 4 resolve some fallout from the recent changes wrt how/when qeth
> allocates its net_device.
> Patch 5 makes sure we don't set reserved bits when building HW commands from
> user-provided data.
> And finally, patch 6 allows ethtool to play nice with new HW.

Series applied.

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: David Ahern @ 2018-11-03 17:32 UTC (permalink / raw)
  To: Paweł Staszewski, Jesper Dangaard Brouer; +Cc: netdev, Yoel Caspersen
In-Reply-To: <a63df254-3791-cd5c-ddfc-6c9b8d38c559@itcare.pl>

On 11/1/18 11:30 AM, Paweł Staszewski wrote:
>>>>  
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c
>>>>
>>>>
>>> I can try some tests on same hw but testlab configuration - will give it
>>> a try :)
>>>
>> That version does not work with VLANs. I have patches for it but it
>> needs a bit more work before sending out. Perhaps I can get back to it
>> next week.
>>
> Will be nice - next week i will be able to replace network controller
> and install separate two 100Gbit nics into two pciex x16 slots - so can
> test without hitting pcie bandwidth limits.
> 
> 

Does your setup have any other device types besides physical ports with
VLANs (e.g., any macvlans or bonds)?

^ permalink raw reply

* Re: [PATCH] openvswitch: fix linking without CONFIG_NF_CONNTRACK_LABELS
From: David Miller @ 2018-11-04  2:29 UTC (permalink / raw)
  To: arnd-r2nGTMty4D4
  Cc: dev-yBygre7rU0TnMu66kgdUjQ,
	eswierk-FilZDy9cOaHkQYj/0HfcvtBPR1lH4CV8,
	netdev-u79uwXL29TY76Z2rM5mHXA, fw-HFFVJYpyMKqzQB+pC5nmwQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Julia.Lawall-L2FTfq7BK8M,
	gfree.wind-GveOkT2v/s5BDgjK7y7TUQ, thierry-EDrxZYgYwg/bB1ukUYwELw,
	fbl-H+wXaHxf7aLQT0dZR+AlfA, pablo-Cap9r6Oaw4JrovVCs/uTlw
In-Reply-To: <20181102153705.1664786-1-arnd-r2nGTMty4D4@public.gmane.org>

From: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
Date: Fri,  2 Nov 2018 16:36:55 +0100

> When CONFIG_CC_OPTIMIZE_FOR_DEBUGGING is enabled, the compiler
> fails to optimize out a dead code path, which leads to a link failure:
> 
> net/openvswitch/conntrack.o: In function `ovs_ct_set_labels':
> conntrack.c:(.text+0x2e60): undefined reference to `nf_connlabels_replace'
> 
> In this configuration, we can take a shortcut, and completely
> remove the contrack label code. This may also help the regular
> optimization.
> 
> Signed-off-by: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>

Applied, thanks Arnd.

^ permalink raw reply

* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Masami Hiramatsu @ 2018-11-04  2:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Josh Poimboeuf, Aleksa Sarai, Naveen N. Rao, Anil S Keshavamurthy,
	David S. Miller, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Shuah Khan, Alexei Starovoitov, Daniel Borkmann,
	Brendan Gregg, Christian Brauner, Aleksa Sarai, netdev, linux-doc
In-Reply-To: <20181103133021.6676708c@vmware.local.home>

On Sat, 3 Nov 2018 13:30:21 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sun, 4 Nov 2018 01:34:30 +0900
> Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > 
> > > I was thinking of a bitmask that represents the handlers, and use that
> > > to map which handler gets called for which shadow entry for a
> > > particular task.  
> > 
> > Hmm, I doubt that is too complicated and not scalable. I rather like to see
> > the open shadow entry...
> 
> It can scale and not too complex (I already played a little with it).
> But that said, I'm not committed to it, and using the shadow stack is
> also an interesting idea.
> 
> > 
> > entry: [[original_retaddr][function][modified_retaddr]]
> > 
> > So if there are many users on same function, the entries will be like this 
> > 
> > [[original_return_address][function][trampoline_A]]
> > [[trampline_A][function][trampoline_B]]
> > [[trampline_B][function][trampoline_C]]
> > 
> > And on the top of the stack, there is trampline_C instead of original_return_address.
> > In this case, return to trampoline_C(), it jumps back to trampline_B() and then
> > it jumps back to trampline_A(). And eventually it jumps back to
> > original_return_address.
> 
> Where are trampolines A, B, and C made? Do we also need to dynamically
> create them? If I register multiple function tracing ones, each one
> will need its own trampoline?
> 

No, I think tramplines are very limited. currently we will only have ftrace
and kretprobe trampolines.


> > This way, we don't need allocate another bitmap/pages for the shadow stack.
> > We only need a shadow stack for each task.
> > Also, unwinder can easily find the trampline_C from the shadow stack and restores
> > original_return_address. (of course trampline_A,B,C must be registered so that
> > search function can skip it.)
> 
> What I was thinking was to store a count and the functions to be called:
> 
> 
> 	[original_return_address]
> 	[function_A]
> 	[function_B]
> 	[function_C]
> 	[ 3 ]
> 
> Then the trampoline that processes the return codes for ftrace (and
> kretprobes and everyone else) can simply do:
> 
> 	count = pop_shadow_stack();
> 	for (i = 0; i < count; i++) {
> 		func = pop_shadow_stack();
> 		func(...);
> 	}
> 	return_address = pop_shadow_stack();

Ah, that's a good idea. I think we also have to store the called function
entry address with the number header, but basically I agree with you.

If we have a space to store a data with the function address, that is also
good to kretprobe. systemtap heavily uses "entry data" for saving some data
at function entry for exit handler.

> That way we only need to register a function to the return handler and
> it will be called, without worrying about making trampolines. There
> will just be a single trampoline that handles all the work.

OK, and could you make it independent from func graph tracer, so that
CONFIG_KPROBES=y but CONFIG_FUNCTION_GRAPH_TRACER=n kernel can support
kretprobes too.

Thank you,


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] tcp: do not update snd_una if it is same with ack_seq
From: Joe Perches @ 2018-11-03 17:04 UTC (permalink / raw)
  To: Yafang Shao, davem, edumazet; +Cc: netdev, linux-kernel
In-Reply-To: <1541264071-9905-1-git-send-email-laoar.shao@gmail.com>

On Sun, 2018-11-04 at 00:54 +0800, Yafang Shao wrote:
> In the slow path, TCP_SKB_SB(skb)->ack_seq may be same with tp->snd_una,
> and under this condition we don't need to update the snd_una.
> 
> Furthermore, tcp_ack_update_window() is only called in slow path,
> so introducing this check won't affect the fast path processing.
[]
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
[]
> @@ -3610,7 +3611,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  	if (flag & FLAG_UPDATE_TS_RECENT)
>  		tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
>  
> -	if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
> +	if (!(flag & FLAG_SLOWPATH) && flag & FLAG_SND_UNA_ADVANCED) {

stylistic nit:

While the precedence is correct in any case,
perhaps adding parentheses around
	flag & FLAG_SND_UNA_ADVANCED
would make it more obvious.

^ permalink raw reply

* [PATCH] tcp: do not update snd_una if it is same with ack_seq
From: Yafang Shao @ 2018-11-03 16:54 UTC (permalink / raw)
  To: davem, edumazet; +Cc: netdev, linux-kernel, Yafang Shao

In the slow path, TCP_SKB_SB(skb)->ack_seq may be same with tp->snd_una,
and under this condition we don't need to update the snd_una.

Furthermore, tcp_ack_update_window() is only called in slow path,
so introducing this check won't affect the fast path processing.

By the way, '&' is a little faster than '-', so I replaced after() with
"flag & FLAG_SND_UNA_ADVANCED".

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 net/ipv4/tcp_input.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2868ef2..db5a6b7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3376,7 +3376,8 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
 		}
 	}
 
-	tcp_snd_una_update(tp, ack);
+	if (after(ack, tp->snd_una))
+		tcp_snd_una_update(tp, ack);
 
 	return flag;
 }
@@ -3610,7 +3611,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (flag & FLAG_UPDATE_TS_RECENT)
 		tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
 
-	if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
+	if (!(flag & FLAG_SLOWPATH) && flag & FLAG_SND_UNA_ADVANCED) {
 		/* Window is constant, pure forward advance.
 		 * No more checks are required.
 		 * Note, we use the fact that SND.UNA>=SND.WL2.
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Masami Hiramatsu @ 2018-11-03 16:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Josh Poimboeuf, Aleksa Sarai, Naveen N. Rao, Anil S Keshavamurthy,
	David S. Miller, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Shuah Khan, Alexei Starovoitov, Daniel Borkmann,
	Brendan Gregg, Christian Brauner, Aleksa Sarai, netdev, linux-doc
In-Reply-To: <20181103091341.3d32683e@vmware.local.home>

On Sat, 3 Nov 2018 09:13:41 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sat, 3 Nov 2018 22:00:12 +0900
> Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
> > On Fri, 2 Nov 2018 12:13:07 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> 
> > > Because that means if function graph tracer is active, then you can't
> > > do a kretprobe, and vice versa. I'd really like to have it working for
> > > multiple users, then we could trace different graph functions and store
> > > them in different buffers. It would also allow for perf to use function
> > > graph tracer too.  
> > 
> > Steve, how woul you allow multiple users on it? Something like this?
> > 
> > ret_trampoline_multiple(){
> >    list_for_each(handler, &shadow_entry[i].handlers, list)
> > 	handler(shadow_entry[i]);
> >    restore_retval_and_jump_to(shadow_entry[i].orig);
> > }
> >
> 
> Something like that. But since it's not a single mapping between shadow
> entries and handlers, that is we have multiple tasks with multiple
> shadow entries and multiple handlers, we can't use a link list (two
> different parents per handler).

Yes, I understand it.

> I was thinking of a bitmask that represents the handlers, and use that
> to map which handler gets called for which shadow entry for a
> particular task.

Hmm, I doubt that is too complicated and not scalable. I rather like to see
the open shadow entry...

entry: [[original_retaddr][function][modified_retaddr]]

So if there are many users on same function, the entries will be like this 

[[original_return_address][function][trampoline_A]]
[[trampline_A][function][trampoline_B]]
[[trampline_B][function][trampoline_C]]

And on the top of the stack, there is trampline_C instead of original_return_address.
In this case, return to trampoline_C(), it jumps back to trampline_B() and then
it jumps back to trampline_A(). And eventually it jumps back to
original_return_address.

This way, we don't need allocate another bitmap/pages for the shadow stack.
We only need a shadow stack for each task.
Also, unwinder can easily find the trampline_C from the shadow stack and restores
original_return_address. (of course trampline_A,B,C must be registered so that
search function can skip it.)

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] tcp: do not update snd_una if it is same with ack_seq
From: Yafang Shao @ 2018-11-04  1:27 UTC (permalink / raw)
  To: joe; +Cc: David Miller, Eric Dumazet, netdev, LKML
In-Reply-To: <945a90d62aa45c4b53349ba0a104574759d40efe.camel@perches.com>

On Sun, Nov 4, 2018 at 1:04 AM Joe Perches <joe@perches.com> wrote:
>
> On Sun, 2018-11-04 at 00:54 +0800, Yafang Shao wrote:
> > In the slow path, TCP_SKB_SB(skb)->ack_seq may be same with tp->snd_una,
> > and under this condition we don't need to update the snd_una.
> >
> > Furthermore, tcp_ack_update_window() is only called in slow path,
> > so introducing this check won't affect the fast path processing.
> []
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> []
> > @@ -3610,7 +3611,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
> >       if (flag & FLAG_UPDATE_TS_RECENT)
> >               tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
> >
> > -     if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
> > +     if (!(flag & FLAG_SLOWPATH) && flag & FLAG_SND_UNA_ADVANCED) {
>
> stylistic nit:
>
> While the precedence is correct in any case,
> perhaps adding parentheses around
>         flag & FLAG_SND_UNA_ADVANCED
> would make it more obvious.
>

Sure. will change it.

Thanks
Yafang

^ permalink raw reply

* Re: [PATCH] tcp: do not update snd_una if it is same with ack_seq
From: Yafang Shao @ 2018-11-04  1:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Eric Dumazet, netdev, LKML
In-Reply-To: <5f585304-b6f5-5f49-adcf-eaed471c0d76@gmail.com>

On Sun, Nov 4, 2018 at 8:40 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 11/03/2018 09:54 AM, Yafang Shao wrote:
> > In the slow path, TCP_SKB_SB(skb)->ack_seq may be same with tp->snd_una,
> > and under this condition we don't need to update the snd_una.
> >
> > Furthermore, tcp_ack_update_window() is only called in slow path,
> > so introducing this check won't affect the fast path processing.
> >
> > By the way, '&' is a little faster than '-', so I replaced after() with
> > "flag & FLAG_SND_UNA_ADVANCED".
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  net/ipv4/tcp_input.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 2868ef2..db5a6b7 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3376,7 +3376,8 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
> >               }
> >       }
> >
> > -     tcp_snd_una_update(tp, ack);
> > +     if (after(ack, tp->snd_una))
> > +             tcp_snd_una_update(tp, ack);
> >
>
> Adding this after() here is confusing, how ack could be before snd_una ?
> That would be a serious bug.
>

ack can't be before snd_una, but it can be equal with snd_una.
Seems bellow change would be more specific,
    if (ack != tp->snd_una)
       tcp_snd_una_update(tp, ack);

> I do not see why another conditional has any gain here.
>
> You are trying to avoid very cheap operations :
>
>     u32 delta = ack - tp->snd_una;
>
>     tp->bytes_acked += delta;
>     tp->snd_una = ack;
>
> Maybe the real reason for this patch is not explained in the changelog ?

No additional reason. I just want to avoid these uneccessary operations.
Because I find that this uncessary operations always happen,
especially when head prediction fails and then the packet can't go to
fast path processing.

Thanks
Yafang

^ permalink raw reply

* SACK compression patch causing performance drop
From: Jean-Louis Dupond @ 2018-11-03 15:59 UTC (permalink / raw)
  To: netdev, edumazet

Hi All,

On recent kernels we noticed a way lower throughput to our SAN system 
than before.
While on pre 4.18 kernels we had 400-700MB/sec read speed, on 4.18+ we 
only had 70-120MB/sec.

The SAN is connected via iSCSI over a 10G network (ixgbe/X520 NICS if it 
matters).

After some debugging, I tried to bisect between 4.17 and 4.18 to see 
what commit caused the slowdown.
It showed that the addition of the SACK compression 
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e) 
was the cause.

And indeed, if I set net.ipv4.tcp_comp_sack_nr to 0 on 4.19 for example, 
the throughput is (almost) back to normal again.
So it seems like this change causes quite some performance issues.

Any ideas?

Thanks
Jean-Louis

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-03 15:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Aaron Lu, Saeed Mahameed, eric.dumazet@gmail.com,
	netdev@vger.kernel.org, Tariq Toukan, ilias.apalodimas@linaro.org,
	yoel@kviknet.dk, mgorman@techsingularity.net
In-Reply-To: <9f35c3f5-e865-54db-73bb-960ded60c1cc@itcare.pl>



W dniu 03.11.2018 o 16:23, Paweł Staszewski pisze:
>
>
> W dniu 03.11.2018 o 13:58, Jesper Dangaard Brouer pisze:
>> On Sat, 3 Nov 2018 01:16:08 +0100
>> Paweł Staszewski <pstaszewski@itcare.pl> wrote:
>>
>>> W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze:
>>>>
>>>> W dniu 02.11.2018 o 15:20, Aaron Lu pisze:
>>>>> On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer 
>>>>> wrote:
>>>>>> On Fri, 2 Nov 2018 13:23:56 +0800
>>>>>> Aaron Lu <aaron.lu@intel.com> wrote:
>>>>>>> On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:
>>>>>>>> On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:
>>>>>>>>> On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
>>>>>>>>> wrote:
>>>>>>>>> ... ...
>> [...]
>>>>>> TL;DR: this is order-0 pages (code-walk trough proof below)
>>>>>>
>>>>>> To Aaron, the network stack *can* call __free_pages_ok() with 
>>>>>> order-0
>>>>>> pages, via:
>> [...]
>>>>>   I think here is a problem - order 0 pages are freed directly to 
>>>>> buddy,
>>>>> bypassing per-cpu-pages. This might be the reason lock contention
>>>>> appeared on free path. Can someone apply below diff and see if lock
>>>>> contention is gone?
>>>>>
>>>> Will test it tonight
>>> Patch applied
>>> perf report:
>>> https://ufile.io/sytfh
>>>
>>>
>>> But i need to wait also with more traffic currently cpu's are sleeping
>>>
>> Well, that would be the expected result, that the CPUs get more time to
>> sleep, if the lock contention is gone...
>>
>> What is the measured bandwidth now?
> 30 RX /30 TX Gbit/s
>
>>
>> Notice, you might still be limited by the PCIe bandwidth, but then your
>> CPUs might actually decide to sleep, as they are getting data fast
>> enough.
> Yes - i will replace network controller to two separate nic's in two 
> separate x16 pcie
> But after monday.
>
> But i dont think i hit pcie limit there - it looks like pcie x16 gen3 
> have 16GB/s RX and 16GB/s TX so bidirectional
>
Was thinking that maybee memory limit - but also there is 4 channel DDR4 
2666MHz - so total bandwidth for memory is bigger (48GB/s) than needed 
for 100Gbit ethernet





>
>>
>> [...]
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index e2ef1c17942f..65c0ae13215a 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
>>>>>    {
>>>>>        struct page *page = virt_to_head_page(addr);
>>>>>    -    if (unlikely(put_page_testzero(page)))
>>>>> -        __free_pages_ok(page, compound_order(page));
>>>>> +    if (unlikely(put_page_testzero(page))) {
>>>>> +        unsigned int order = compound_order(page);
>>>>> +
>>>>> +        if (order == 0)
>>>>> +            free_unref_page(page);
>>>>> +        else
>>>>> +            __free_pages_ok(page, order);
>>>>> +    }
>>
>>
>
>

^ permalink raw reply

* Re: [PATCH] tcp: do not update snd_una if it is same with ack_seq
From: Eric Dumazet @ 2018-11-04  0:40 UTC (permalink / raw)
  To: Yafang Shao, davem, edumazet; +Cc: netdev, linux-kernel
In-Reply-To: <1541264071-9905-1-git-send-email-laoar.shao@gmail.com>



On 11/03/2018 09:54 AM, Yafang Shao wrote:
> In the slow path, TCP_SKB_SB(skb)->ack_seq may be same with tp->snd_una,
> and under this condition we don't need to update the snd_una.
> 
> Furthermore, tcp_ack_update_window() is only called in slow path,
> so introducing this check won't affect the fast path processing.
> 
> By the way, '&' is a little faster than '-', so I replaced after() with
> "flag & FLAG_SND_UNA_ADVANCED".
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  net/ipv4/tcp_input.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 2868ef2..db5a6b7 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3376,7 +3376,8 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
>  		}
>  	}
>  
> -	tcp_snd_una_update(tp, ack);
> +	if (after(ack, tp->snd_una))
> +		tcp_snd_una_update(tp, ack);
>  

Adding this after() here is confusing, how ack could be before snd_una ?
That would be a serious bug.

I do not see why another conditional has any gain here.

You are trying to avoid very cheap operations :

    u32 delta = ack - tp->snd_una;

    tp->bytes_acked += delta;
    tp->snd_una = ack;

Maybe the real reason for this patch is not explained in the changelog ?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox