Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 0/2] dpaa2-eth: defer probe on object allocate
From: David Miller @ 2018-11-09  3:29 UTC (permalink / raw)
  To: ioana.ciornei; +Cc: netdev, ruxandra.radulescu
In-Reply-To: <20181108.192857.2194316658343454675.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Thu, 08 Nov 2018 19:28:57 -0800 (PST)

> From: Ioana Ciornei <ioana.ciornei@nxp.com>
> Date: Thu, 8 Nov 2018 13:17:46 +0000
> 
>> Allocatable objects on the fsl-mc bus may be probed by the fsl_mc_allocator
>> after the first attempts of other drivers to use them. Defer the probe when
>> this situation happens.
> 
> Series applied, thanks.

Whoops, I just saw Andrew Lunn's feedback.

I reverted your changes, please address his concerns and resubmit.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next 0/2] dpaa2-eth: defer probe on object allocate
From: David Miller @ 2018-11-09  3:28 UTC (permalink / raw)
  To: ioana.ciornei; +Cc: netdev, ruxandra.radulescu
In-Reply-To: <1541683054-22273-1-git-send-email-ioana.ciornei@nxp.com>

From: Ioana Ciornei <ioana.ciornei@nxp.com>
Date: Thu, 8 Nov 2018 13:17:46 +0000

> Allocatable objects on the fsl-mc bus may be probed by the fsl_mc_allocator
> after the first attempts of other drivers to use them. Defer the probe when
> this situation happens.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] dpaa2-eth: Introduce TX congestion management
From: David Miller @ 2018-11-09  3:26 UTC (permalink / raw)
  To: ruxandra.radulescu; +Cc: netdev, ioana.ciornei
In-Reply-To: <AM0PR04MB4994571246CB1708B79A986594C50@AM0PR04MB4994.eurprd04.prod.outlook.com>

From: Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>
Date: Thu, 8 Nov 2018 20:21:15 +0000

> Today I tried to further coalesce the confirmation frames such that I call
> netdev_tx_completed_queue() only at the end of the NAPI poll, once for each
> confirmation queue that was serviced during that NAPI.

That sounds like exactly what you should do given your design description.

> I need to do more testing, but so far it performs *almost* on par
> with the non-BQL driver version. But it does complicate the fastpath
> code and feels somewhat unnatural.

Well, this is exactly what should happen with BQL and as a result you will
get much better TCP queue utilization and avoidance of bufferbloat.

^ permalink raw reply

* Should the bridge learn from frames with link local destination MAC address?
From: Andrew Lunn @ 2018-11-09  3:24 UTC (permalink / raw)
  To: Roopa Prabhu, Nikolay Aleksandrov; +Cc: netdev, Florian Fainelli

Hi Roopa, Nikolay

br_handle_frame() looks out for frames with a destination MAC
addresses with is Ethernet link local, those which fit
01-80-C2-00-00-XX. It does not normally forward these, but it will
deliver them locally.

Should the bridge perform learning on such frames?

I've got a setup with two bridges connected together with multiple
links between them. STP has done its thing, and blocked one of the
ports to solve the loop.

host0                   host1
+-----------------+     +-----------------+
| lan0 forwarding |-----| lan0 forwarding |
|                 |	|                 |
| lan1 forwarding |-----| lan1 blocked    |
+-----------------+	+-----------------+

I have LLDP running on both system, and they are sending out periodic
frames on each port.

Now, lan0 and lan1 on host1 use the same MAC address.  So i see the
MAC address bouncing between ports because of the LLDP packets.

# bridge monitor
00:26:55:d2:27:a8 dev lan1 master br0 
00:26:55:d2:27:a8 dev lan0 master br0 
00:26:55:d2:27:a8 dev lan1 master br0 
00:26:55:d2:27:a8 dev lan0 master br0 
00:26:55:d2:27:a8 dev lan1 master br0 

This then results in normal traffic from host0 to host1 being sent to
the blocked port for some of the time.

LLDP is using 01-80-C2-00-00-0E, a link local MAC address. If the
bridge did not learn on such frames, i think this setup would
work. The bridge would learn from ARP, IP etc, coming from the
forwarding port of host1, and the blocked port would be ignored.

I've tried a similar setup with a hardware switch, Marvell 6352. It
never seems to learn from such frames.

Thanks
	Andrew

^ permalink raw reply

* Re: [PATCH] net: Add trace events for all receive exit points
From: Genevieve Bastien @ 2018-11-09  2:56 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: David S. Miller, netdev, rostedt, Ingo Molnar
In-Reply-To: <1057796879.2336.1541708744442.JavaMail.zimbra@efficios.com>

On 2018-11-08 03:25 PM, Mathieu Desnoyers wrote:
> ----- On Nov 8, 2018, at 2:56 PM, Geneviève Bastien gbastien@versatic.net wrote:
>
>> Trace events are already present for the receive entry points, to indicate
>> how the reception entered the stack.
>>
>> This patch adds the corresponding exit trace events that will bound the
>> reception such that all events occurring between the entry and the exit
>> can be considered as part of the reception context. This greatly helps
>> for dependency and root cause analyses.
>>
>> Without this, it is impossible to determine whether a sched_wakeup
>> event following a netif_receive_skb event is the result of the packet
>> reception or a simple coincidence after further processing by the
>> thread.
> As a clarification: it is indeed not possible with tracepoint instrumentation,
> which I think is your point here. It might be possible to do so by using other
> mechanisms like kretprobes, but considering that the "entry" point was deemed
> important enough to have a tracepoint, it would be good to add the matching exit
> events.
>
> Being able to link the packet reception entry/exit with wakeups is key to
> allow tools to perform automated network stack latency analysis, so I think
> the use case justifies adding the missing "exit" events.
Thanks for the precision. Indeed so far, kretprobes have been used as a
workaround, but it is harder to setup and has quite more overhead. After
seeing those "entry" points, I thought corresponding "exits" would be
the best candidate to encapsulate the "network reception" context.

For an example of dependency and critical path analysis of 'wget', see
this screenshot here, arranged from Trace Compass:
https://framapic.org/DgSdNwiuymib/PDPdHJBGCJGR.png

The top view shows the dependency analysis without the "exit" events:
the wakeup is not associated with the packet reception, so the source is
considered to be the process that was running at that time, an IRQ
thread. So wget was blocked by the iwlwifi IRQ thread, who was himself
simply woken up by an hardware interrupt, while in fact, wget was
waiting for the network.

The bottom view shows the dependency analysis with the "exit" events:
because the wakeup happens between the "entry" and "exit", we know the
packet reception is the source of the wakeup and if we happen to know
where that packet came from, we can follow the dependency to the packet
source. So wget was blocked waiting for a network request to another
machine which sent back the reply and we clearly see the time spent on
the other machine and the network latency during this blockage.

I hope this kind of possibilities for analyses justify adding those
tracepoints.

Thanks,
Geneviève


>
> It might be great if you can provide a glimpse of the wakeup dependency chain
> analysis prototypes you have created, which rely on this new event, in order
> to justify adding it.
>
> Thanks,
>
> Mathieu
>
>> Signed-off-by: Geneviève Bastien <gbastien@versatic.net>
>> CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> CC: Steven Rostedt <rostedt@goodmis.org>
>> CC: Ingo Molnar <mingo@redhat.com>
>> CC: David S. Miller <davem@davemloft.net>
>> ---
>> include/trace/events/net.h | 59 ++++++++++++++++++++++++++++++++++++++
>> net/core/dev.c             | 30 ++++++++++++++++---
>> 2 files changed, 85 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/trace/events/net.h b/include/trace/events/net.h
>> index 00aa72ce0e7c..318307511018 100644
>> --- a/include/trace/events/net.h
>> +++ b/include/trace/events/net.h
>> @@ -117,6 +117,23 @@ DECLARE_EVENT_CLASS(net_dev_template,
>> 		__get_str(name), __entry->skbaddr, __entry->len)
>> )
>>
>> +DECLARE_EVENT_CLASS(net_dev_template_simple,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(void *,	skbaddr)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->skbaddr = skb;
>> +	),
>> +
>> +	TP_printk("skbaddr=%p", __entry->skbaddr)
>> +)
>> +
>> DEFINE_EVENT(net_dev_template, net_dev_queue,
>>
>> 	TP_PROTO(struct sk_buff *skb),
>> @@ -244,6 +261,48 @@ DEFINE_EVENT(net_dev_rx_verbose_template,
>> netif_rx_ni_entry,
>> 	TP_ARGS(skb)
>> );
>>
>> +DEFINE_EVENT(net_dev_template_simple, napi_gro_frags_exit,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +);
>> +
>> +DEFINE_EVENT(net_dev_template_simple, napi_gro_receive_exit,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +);
>> +
>> +DEFINE_EVENT(net_dev_template_simple, netif_receive_skb_exit,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +);
>> +
>> +DEFINE_EVENT(net_dev_template_simple, netif_receive_skb_list_exit,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +);
>> +
>> +DEFINE_EVENT(net_dev_template_simple, netif_rx_exit,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +);
>> +
>> +DEFINE_EVENT(net_dev_template_simple, netif_rx_ni_exit,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +);
>> +
>> #endif /* _TRACE_NET_H */
>>
>> /* This part must be outside protection */
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 0ffcbdd55fa9..e670ca27e829 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4520,9 +4520,14 @@ static int netif_rx_internal(struct sk_buff *skb)
>>
>> int netif_rx(struct sk_buff *skb)
>> {
>> +	int ret;
>> +
>> 	trace_netif_rx_entry(skb);
>>
>> -	return netif_rx_internal(skb);
>> +	ret = netif_rx_internal(skb);
>> +	trace_netif_rx_exit(skb);
>> +
>> +	return ret;
>> }
>> EXPORT_SYMBOL(netif_rx);
>>
>> @@ -4537,6 +4542,7 @@ int netif_rx_ni(struct sk_buff *skb)
>> 	if (local_softirq_pending())
>> 		do_softirq();
>> 	preempt_enable();
>> +	trace_netif_rx_ni_exit(skb);
>>
>> 	return err;
>> }
>> @@ -5222,9 +5228,14 @@ static void netif_receive_skb_list_internal(struct
>> list_head *head)
>>  */
>> int netif_receive_skb(struct sk_buff *skb)
>> {
>> +	int ret;
>> +
>> 	trace_netif_receive_skb_entry(skb);
>>
>> -	return netif_receive_skb_internal(skb);
>> +	ret = netif_receive_skb_internal(skb);
>> +	trace_netif_receive_skb_exit(skb);
>> +
>> +	return ret;
>> }
>> EXPORT_SYMBOL(netif_receive_skb);
>>
>> @@ -5247,6 +5258,8 @@ void netif_receive_skb_list(struct list_head *head)
>> 	list_for_each_entry(skb, head, list)
>> 		trace_netif_receive_skb_list_entry(skb);
>> 	netif_receive_skb_list_internal(head);
>> +	list_for_each_entry(skb, head, list)
>> +		trace_netif_receive_skb_list_exit(skb);
>> }
>> EXPORT_SYMBOL(netif_receive_skb_list);
>>
>> @@ -5634,12 +5647,17 @@ static gro_result_t napi_skb_finish(gro_result_t ret,
>> struct sk_buff *skb)
>>
>> gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
>> {
>> +	gro_result_t ret;
>> +
>> 	skb_mark_napi_id(skb, napi);
>> 	trace_napi_gro_receive_entry(skb);
>>
>> 	skb_gro_reset_offset(skb);
>>
>> -	return napi_skb_finish(dev_gro_receive(napi, skb), skb);
>> +	ret = napi_skb_finish(dev_gro_receive(napi, skb), skb);
>> +	trace_napi_gro_receive_exit(skb);
>> +
>> +	return ret;
>> }
>> EXPORT_SYMBOL(napi_gro_receive);
>>
>> @@ -5753,6 +5771,7 @@ static struct sk_buff *napi_frags_skb(struct napi_struct
>> *napi)
>>
>> gro_result_t napi_gro_frags(struct napi_struct *napi)
>> {
>> +	gro_result_t ret;
>> 	struct sk_buff *skb = napi_frags_skb(napi);
>>
>> 	if (!skb)
>> @@ -5760,7 +5779,10 @@ gro_result_t napi_gro_frags(struct napi_struct *napi)
>>
>> 	trace_napi_gro_frags_entry(skb);
>>
>> -	return napi_frags_finish(napi, skb, dev_gro_receive(napi, skb));
>> +	ret = napi_frags_finish(napi, skb, dev_gro_receive(napi, skb));
>> +	trace_napi_gro_frags_exit(skb);
>> +
>> +	return ret;
>> }
>> EXPORT_SYMBOL(napi_gro_frags);
>>
>> --
>> 2.19.1

^ permalink raw reply

* Re: [PATCH ipsec-next 00/11] xfrm: policy: add inexact policy search tree
From: David Miller @ 2018-11-09  3:00 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <20181107220041.26205-1-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Wed,  7 Nov 2018 23:00:30 +0100

> This series attempts to improve xfrm policy lookup performance when
> a lot of (several hundred or even thousands) inexact policies exist
> on a system.
> 
> On insert, a policy is either placed in hash table (all direct (/32 for
> ipv4, /128 policies, or all policies matching a user-configured threshold).
> All other policies get inserted into inexact list as per priority.
> 
> Lookup then scans inexact list for first matching entry.
> 
> This series instead makes it so that inexact policy is added to exactly
> one of four different search list classes.
> 
> 1. "Any:Any" list, containing policies where both saddr and daddr are
>    wildcards or have very coarse prefixes, e.g. 10.0.0.0/8 and the like.
> 2. "saddr:any" list, containing policies with a fixed saddr/prefixlen,
>    but without destination restrictions.
>    These lists are stored in rbtree nodes; each node contains those
>    policies matching saddr/prefixlen.
> 3. "Any:daddr" list. Similar to 2), except for policies where only the
>    destinations are specified.
> 4. "saddr:daddr" lists, containing policies that match the given
>    source/destination network.
> 
>    The root of the saddr/daddr tree is stored in the nodes of the
>    'daddr' tree.
...
> Comments or questions welcome.

Acked-by: David S. Miller <davem@davemloft.net>

This looks really great.  Nice work Florian.

^ permalink raw reply

* Re: net: ipv4: tcp_vegas
From: Suraj Singh @ 2018-11-09 12:37 UTC (permalink / raw)
  To: davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, suraj1998
In-Reply-To: <1541665792-5888-1-git-send-email-suraj1998@gmail.com>

Hi,

I'm extremely sorry if I have seemed unresponsive to you but it took me a 
while to figure out how to respond to threads using the corresponding mail
-id. I think the first two comments that you wrote on my initial patch with 
message-id: 1541425985-31869-1-git-send-email-suraj1998@gmail.com were being
redirected to my spam folder and since I didn't have mutt configured
then, I didn't see them, 


The reason I've been prefixing all the subject of my mails with "staging: "
is because I was following Greg Kroah-Hartman's YouTube video on how to 
submit your first kernel patch. I didn't realise the significance of using
it and didn't take the time out to understand what it really meant and that
is my bad. I've got clarifications for the same in comments to the patch with 
message-id: 1541670377-17483-1-git-send-email-suraj1998@gmail.com

I think what I will do is first figure out how to use mutt effectively before
sending subsequent patches and emails because all the emails that I have sent 
(including this one), I've done so by manually writing git send-mail commands 
in response to the corresponding message-id. 

Other than unwarranted "staging: " in the subject, are there any other changes
that need to be made before I submit [v3] of the second tcp_westwood patch
(1541670377-17483-1-git-send-email-suraj1998@gmail.com)? 

Regards,
Suraj

^ permalink raw reply

* [PATCH net-next] tcp_bbr: update comments to reflect pacing_margin_percent
From: Neal Cardwell @ 2018-11-09  2:54 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Soheil Hassas Yeganeh,
	Eric Dumazet

Recently, in commit ab408b6dc744 ("tcp: switch tcp and sch_fq to new
earliest departure time model"), the TCP BBR code switched to a new
approach of using an explicit bbr_pacing_margin_percent for shaving a
pacing rate "haircut", rather than the previous implict
approach. Update an old comment to reflect the new approach.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_bbr.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 9277abdd822a0..0f497fc49c3fe 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -128,7 +128,12 @@ static const u32 bbr_probe_rtt_mode_ms = 200;
 /* Skip TSO below the following bandwidth (bits/sec): */
 static const int bbr_min_tso_rate = 1200000;
 
-/* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck. */
+/* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck.
+ * In order to help drive the network toward lower queues and low latency while
+ * maintaining high utilization, the average pacing rate aims to be slightly
+ * lower than the estimated bandwidth. This is an important aspect of the
+ * design.
+ */
 static const int bbr_pacing_margin_percent = 1;
 
 /* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain
@@ -247,13 +252,7 @@ static void bbr_init_pacing_rate_from_rtt(struct sock *sk)
 	sk->sk_pacing_rate = bbr_bw_to_pacing_rate(sk, bw, bbr_high_gain);
 }
 
-/* Pace using current bw estimate and a gain factor. In order to help drive the
- * network toward lower queues while maintaining high utilization and low
- * latency, the average pacing rate aims to be slightly (~1%) lower than the
- * estimated bandwidth. This is an important aspect of the design. In this
- * implementation this slightly lower pacing rate is achieved implicitly by not
- * including link-layer headers in the packet size used for the pacing rate.
- */
+/* Pace using current bw estimate and a gain factor. */
 static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-- 
2.19.1.930.g4563a0d9d0-goog

^ permalink raw reply related

* Re: [PATCH v2 net] inet: frags: better deal with smp races
From: David Miller @ 2018-11-09  2:41 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet, soukjin.bae
In-Reply-To: <20181109013427.218098-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Thu,  8 Nov 2018 17:34:27 -0800

> Multiple cpus might attempt to insert a new fragment in rhashtable,
> if for example RPS is buggy, as reported by 배석진 in
> https://patchwork.ozlabs.org/patch/994601/
> 
> We use rhashtable_lookup_get_insert_key() instead of
> rhashtable_insert_fast() to let cpus losing the race
> free their own inet_frag_queue and use the one that
> was inserted by another cpu.
> 
> Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: 배석진 <soukjin.bae@samsung.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: net: ipv4: tcp_westwood
From: Suraj Singh @ 2018-11-09 12:18 UTC (permalink / raw)
  To: davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, suraj1998, LinoSanfilippo
In-Reply-To: <1541670377-17483-1-git-send-email-suraj1998@gmail.com>

Hi Lino,

Thank you for clarifying the doubts I had and for the suggestion on where to
start contributing. Previously, and in this patch  I hadn't configured mutt 
and was manually using git send-mail to respond to each thread using the 
corresponding mail-id. That's why I wasn't sure why even in this patch, 
"staging: " had appeared even after editing the commit message. Now that I've 
got the email client setup, I'll fix that "staging: " issue and send a [v3] 
patch.

^ permalink raw reply

* RE:(2) (2) (2) (2) (2) [Kernel][NET] Bug report on packet defragmenting
From: 배석진 @ 2018-11-09  2:24 UTC (permalink / raw)
  To: Eric Dumazet, netdev@vger.kernel.org
In-Reply-To: <716b4f59-eac1-ba74-c39e-b9918b25b073@gmail.com>

>On 11/08/2018 04:42 PM, 배석진 wrote:
>> Thanks for testing.
>>>
>>> This is not a pristine net-next tree, this dump seems unrelated to the patch ?
>> 
>> 
>> yes, looks like that.
>> but only when using your patch, panic came. even right after packet recieving..
>> without that, there's no problem except defrag issue. it's odd.. :p
>> I couldn't more debugging since have other problems.
> 
> 
> You might need to backport some fixes (check all changes to lib/rhashtable.c )
 
I try to backport the updates to my space.
but.. there are too many related files about lib/rhashtable.c ..
I give up ;(

thank you for your help!

^ permalink raw reply

* Re: (2) (2) (2) (2) [Kernel][NET] Bug report on packet defragmenting
From: Eric Dumazet @ 2018-11-09  1:58 UTC (permalink / raw)
  To: soukjin.bae, Eric Dumazet, netdev@vger.kernel.org
In-Reply-To: <1181390155.1787604.1541724136242.JavaMail.jboss@ep1ml102>



On 11/08/2018 04:42 PM, 배석진 wrote:
>> Thanks for testing.
>>
>> This is not a pristine net-next tree, this dump seems unrelated to the patch ?
> 
> 
> yes, looks like that.
> but only when using your patch, panic came. even right after packet recieving..
> without that, there's no problem except defrag issue. it's odd.. :p
> I couldn't more debugging since have other problems.


You might need to backport some fixes (check all changes to lib/rhashtable.c )

^ permalink raw reply

* Re: [PATCH] [stable, netdev 4.4+] lan78xx: make sure RX_ADDRL & RX_ADDRH regs are always up to date
From: Paolo Pisati @ 2018-11-09 11:31 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Paolo Pisati, Woojung Huh, Microchip Linux Driver Support, netdev,
	stable, linux-usb, linux-kernel
In-Reply-To: <20181108154904.GD8097@sasha-vm>

On Thu, Nov 08, 2018 at 10:49:04AM -0500, Sasha Levin wrote:
> 
> Can you confirm it actually works on 4.4?

Yes, built and tested on 4.4.y:

Tested-by: Paolo Pisati <p.pisati@gmail.com>
-- 
bye,
p.

^ permalink raw reply

* [PATCH v2 net] inet: frags: better deal with smp races
From: Eric Dumazet @ 2018-11-09  1:34 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, 배석진

Multiple cpus might attempt to insert a new fragment in rhashtable,
if for example RPS is buggy, as reported by 배석진 in
https://patchwork.ozlabs.org/patch/994601/

We use rhashtable_lookup_get_insert_key() instead of
rhashtable_insert_fast() to let cpus losing the race
free their own inet_frag_queue and use the one that
was inserted by another cpu.

Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: 배석진 <soukjin.bae@samsung.com>
---
 net/ipv4/inet_fragment.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index bcb11f3a27c0c34115af05034a5a20f57842eb0a..760a9e52e02b91b36af323c92f7027e150858f88 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -178,21 +178,22 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 }
 
 static struct inet_frag_queue *inet_frag_create(struct netns_frags *nf,
-						void *arg)
+						void *arg,
+						struct inet_frag_queue **prev)
 {
 	struct inet_frags *f = nf->f;
 	struct inet_frag_queue *q;
-	int err;
 
 	q = inet_frag_alloc(nf, f, arg);
-	if (!q)
+	if (!q) {
+		*prev = ERR_PTR(-ENOMEM);
 		return NULL;
-
+	}
 	mod_timer(&q->timer, jiffies + nf->timeout);
 
-	err = rhashtable_insert_fast(&nf->rhashtable, &q->node,
-				     f->rhash_params);
-	if (err < 0) {
+	*prev = rhashtable_lookup_get_insert_key(&nf->rhashtable, &q->key,
+						 &q->node, f->rhash_params);
+	if (*prev) {
 		q->flags |= INET_FRAG_COMPLETE;
 		inet_frag_kill(q);
 		inet_frag_destroy(q);
@@ -204,22 +205,22 @@ static struct inet_frag_queue *inet_frag_create(struct netns_frags *nf,
 /* TODO : call from rcu_read_lock() and no longer use refcount_inc_not_zero() */
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key)
 {
-	struct inet_frag_queue *fq;
+	struct inet_frag_queue *fq = NULL, *prev;
 
 	if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh)
 		return NULL;
 
 	rcu_read_lock();
 
-	fq = rhashtable_lookup(&nf->rhashtable, key, nf->f->rhash_params);
-	if (fq) {
+	prev = rhashtable_lookup(&nf->rhashtable, key, nf->f->rhash_params);
+	if (!prev)
+		fq = inet_frag_create(nf, key, &prev);
+	if (prev && !IS_ERR(prev)) {
+		fq = prev;
 		if (!refcount_inc_not_zero(&fq->refcnt))
 			fq = NULL;
-		rcu_read_unlock();
-		return fq;
 	}
 	rcu_read_unlock();
-
-	return inet_frag_create(nf, key);
+	return fq;
 }
 EXPORT_SYMBOL(inet_frag_find);
-- 
2.19.1.930.g4563a0d9d0-goog

^ permalink raw reply related

* Re: [PATCH net] inet: frags: better deal with smp races
From: Eric Dumazet @ 2018-11-09  1:31 UTC (permalink / raw)
  To: David Miller, edumazet; +Cc: netdev, eric.dumazet, soukjin.bae
In-Reply-To: <20181108.170235.1217371124928853665.davem@davemloft.net>



On 11/08/2018 05:02 PM, David Miller wrote:

> GCC is unwilling to see that all paths leading to that final return
> statement do in fact set 'fq' one way or another:
> 
> net/ipv4/inet_fragment.c: In function ‘inet_frag_find’:
> net/ipv4/inet_fragment.c:224:9: warning: ‘fq’ may be used uninitialized in this function [-Wmaybe-uninitialized]
> 
> This is with:
> 
> gcc (GCC) 8.2.1 20181011 (Red Hat 8.2.1-4)
> 
> Please adjust your patch so that the warning is eliminated.
> 

Interesting, I will init *fq to NULL in v2, thanks.

^ permalink raw reply

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Yonghong Song @ 2018-11-09  1:26 UTC (permalink / raw)
  To: Edward Cree, Alexei Starovoitov
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Network Development, Kernel Team
In-Reply-To: <ca4cb188-744d-5274-b12a-59fa3efc68f4@solarflare.com>



On 11/8/18 2:56 PM, Edward Cree wrote:
> On 08/11/18 19:42, Alexei Starovoitov wrote:
>> same link let's continue at 1pm PST.
> So, one thing we didn't really get onto was maps, and you mentioned that it
>   wasn't really clear what I was proposing there.
> What I have in mind comes in two parts:
> 1) map type.  A new BTF_KIND_MAP with metadata 'key_type', 'value_type'
>   (both are type_ids referencing other BTF type records), describing the
>   type "map from key_type to value_type".
> 2) record in the 'instances' table.  This would have a name_off (the
>   name of the map), a type_id (pointing at a BTF_KIND_MAP in the 'types'
>   table), and potentially also some indication of what symbol (from
>   section 'maps') refers to this map.  This is pretty much the exact
>   same metadata that a function in the 'instances' table has, the only
>   differences being
>   (a) function's type_id points at a BTF_KIND_FUNC record
>   (b) function's symbol indication refers from .text section
>   (c) in future functions may be nested inside other functions, whereas
>   AIUI a map can't live inside a function.  (But a variable, which is
>   the other thing that would want to go in an 'instances' table, can.)
> So the 'instances' table record structure looks like
> 
> struct btf_instance {
>      __u32 type_id; /* Type of object declared.  An index into type section */
>      __u32 name_off; /* Name of object.  An offset into string section */
>      __u32 parent; /* Containing object if any (else 0).  An index into instance section */
> };
> 
> and we extend the BTF header:
> 
> struct btf_header {
>      __u16   magic;
>      __u8    version;
>      __u8    flags;
>      __u32   hdr_len;
> 
>      /* All offsets are in bytes relative to the end of this header */
>      __u32   type_off;      /* offset of type section       */
>      __u32   type_len;      /* length of type section       */
>      __u32   str_off;       /* offset of string section     */
>      __u32   str_len;       /* length of string section     */
>      __u32   inst_off;      /* offset of instance section   */
>      __u32   inst_len;      /* length of instance section   */
> };
> 
> Then in the .BTF.ext section, we have both
> 
> struct bpf_func_info {
>      __u32 prog_symbol; /* Index of symbol giving address of subprog */
>      __u32 inst_id; /* Index into instance section */
> }
> 
> struct bpf_map_info {
> {
>      __u32 map_symbol; /* Index of symbol creating this map */
>      __u32 inst_id; /* Index into instance section */
> }
> 
> (either living in different subsections, or in a single table with
>   the addition of a kind field, or in a single table relying on the
>   ultimately referenced type to distinguish funcs from maps).
> 
> Note that the name (in btf_instance) of a map or function need not
>   match the name of the corresponding symbol; we use the .BTF.ext
>   section to tie together btf_instance IDs and symbol IDs.  Then in
>   the case of functions (subprogs), the prog_symbol can be looked
>   up in the ELF symbol table to find the address (== insn_offset)
>   of the subprog, as well as the section containing it (since that
>   might not be .text).  Similarly in the case of maps the BTF info
>   about the map is connected with the info in the maps section.
> 
> Now when the loader has munged this, what it passes to the kernel
>   might not have map_symbol, but instead map_fd.  Instead of
>   prog_symbol it will have whatever identifies the subprog in the
>   blob of stuff it feeds to the kernel (so probably insn_offset).
> 
> All this would of course require a bit more compiler support than
>   the current BPF_ANNOTATE_KV_PAIR, since that just causes the
>   existing BTF machinery to declare a specially constructed struct
>   type.  At the C level you could still have BPF_ANNOTATE_KV_PAIR
>   and the '____bpf_map_foo' name, but then the compiler would
>   recognise that and convert it into an instance record by looking
>   up the name 'foo' in its "maps" section.  That way the special
>   ____bpf_map_* handling (which ties map names to symbol names,

Compiler in general does not do transformation based on variable
or struct type names by default, so this probably should stay
in the loader.

>   also) would be entirely compiler-internal and not 'leak out' into
>   the definition of the format.  Frontends for other languages
>   which do possess a native map type (e.g. Python dict) might have
>   other ways of indicating the key/value type of a map at source
>   level (e.g. PEP 484) and could directly generate the appropriate
>   BTF_KIND_MAP and bpf_map_info records rather than (as they would
>   with the current design) having to encode the information as a
>   struct ____bpf_map_foo type-definition.

You mean a python application can generate bpf byte codes and
BTFs (include map types)? That will be different from the C/LLVM
use case. The python app. probably will be the loader as well.

One option is to pass BPF specific flag like 
"-map-type-prefix="___bpf_map_" and LLVM will generate BTF_KIND_MAP type
for any structure with name "___bpf_map_<...>". But if this is
the case, user can just search the type table for struct name
"___bpf_map_<...>" and llvm does not need to do anything. Note that
once user passes "-map-type-prefix="___bpf_map_" to llvm, the
definition of the format is already leaked.

So I feel that this probably belongs to the loader.

> 
> 
> While I realise the desire to concentrate on one topic at once, I
>   think this question of maps should be discussed in tomorrow's
>   call, since it is when we start having other kinds of instances
>   besides functions that the advantages of my design become
>   apparent, unifying the process of 'declaration' of functions,
>   maps, and (eventually) variables while separating them all from
>   the process of 'definition' of the types of all three.
> 
> Thank you for your continued patience with me.
> -Ed
> 

^ permalink raw reply

* Re: [PATCH net-next 0/8] s390/qeth: updates 2018-11-08
From: David Miller @ 2018-11-09  1:23 UTC (permalink / raw)
  To: jwi; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20181108140622.35244-1-jwi@linux.ibm.com>

From: Julian Wiedmann <jwi@linux.ibm.com>
Date: Thu,  8 Nov 2018 15:06:14 +0100

> please apply the following qeth patches to net-next.
> 
> The first patch allows one more device type to query the FW for a MAC address,
> the others are all basically just removal of duplicated or unused code.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH][net-next] openvswitch: remove BUG_ON from get_dpdev
From: David Miller @ 2018-11-09  1:15 UTC (permalink / raw)
  To: lirongqing; +Cc: netdev, pshelar, dev
In-Reply-To: <1541680820-20158-1-git-send-email-lirongqing@baidu.com>

From: Li RongQing <lirongqing@baidu.com>
Date: Thu,  8 Nov 2018 20:40:20 +0800

> if local is NULL pointer, and the following access of local's
> dev will trigger panic, which is same as BUG_ON
> 
> Signed-off-by: Li RongQing <lirongqing@baidu.com>

Applied.

^ permalink raw reply

* Re: [Xen-devel] [PATCH] xen/netfront: remove unnecessary wmb
From: Wei Liu @ 2018-11-09 10:52 UTC (permalink / raw)
  To: Jacob Wen; +Cc: netdev, jgross, xen-devel, linux-kernel, Wei Liu
In-Reply-To: <20181109065359.14900-1-jian.w.wen@oracle.com>

On Fri, Nov 09, 2018 at 02:53:59PM +0800, Jacob Wen wrote:
> RING_PUSH_REQUESTS_AND_CHECK_NOTIFY is already able to make sure backend sees
> requests before req_prod is updated.
> 
> Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>

Reviewed-by: Wei Liu <wei.liu2@citrix.com>

^ permalink raw reply

* Re: [PATCH net-next v2 00/11] ICMP error handling for UDP tunnels
From: David Miller @ 2018-11-09  1:13 UTC (permalink / raw)
  To: sbrivio; +Cc: sd, lucien.xin, stephen, jbenc, dsahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu,  8 Nov 2018 12:19:13 +0100

> This series introduces ICMP error handling for UDP tunnels and
> encapsulations and related selftests. We need to handle ICMP errors to
> support PMTU discovery and route redirection -- this support is entirely
> missing right now:
> 
> - patch 1/11 adds a socket lookup for UDP tunnels that use, by design,
>   the same destination port on both endpoints -- i.e. VXLAN and GENEVE
> - patches 2/11 to 7/11 are specific to VxLAN and GENEVE
> - patches 8/11 and 9/11 add infrastructure for lookup of encapsulations
>   where sent packets cannot be matched via receiving socket lookup, i.e.
>   FoU and GUE
> - patches 10/11 and 11/11 are specific to FoU and GUE
> 
> v2: changes are listed in the single patches

Series applied, thanks Stefano.

^ permalink raw reply

* Re: [PATCH net-next] cxgb4: Add new T6 PCI device ids 0x608a
From: David Miller @ 2018-11-09  1:07 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, indranil, dt
In-Reply-To: <1541667067-31387-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Thu,  8 Nov 2018 14:21:07 +0530

> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

Applied.

^ permalink raw reply

* Re: [PATCH][net-next][v2] net/ipv6: compute anycast address hash only if dev is null
From: David Miller @ 2018-11-09  1:04 UTC (permalink / raw)
  To: lirongqing; +Cc: netdev
In-Reply-To: <1541660287-8728-1-git-send-email-lirongqing@baidu.com>

From: Li RongQing <lirongqing@baidu.com>
Date: Thu,  8 Nov 2018 14:58:07 +0800

> avoid to compute the hash value if dev is not null, since
> hash value is not used
> 
> Signed-off-by: Li RongQing <lirongqing@baidu.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] inet: frags: better deal with smp races
From: David Miller @ 2018-11-09  1:02 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet, soukjin.bae
In-Reply-To: <20181108061053.137720-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Wed,  7 Nov 2018 22:10:53 -0800

> @@ -204,22 +205,22 @@ static struct inet_frag_queue *inet_frag_create(struct netns_frags *nf,
>  /* TODO : call from rcu_read_lock() and no longer use refcount_inc_not_zero() */
>  struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key)
>  {
> -	struct inet_frag_queue *fq;
> +	struct inet_frag_queue *fq, *prev;
>  
>  	if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh)
>  		return NULL;
>  
>  	rcu_read_lock();
>  
> -	fq = rhashtable_lookup(&nf->rhashtable, key, nf->f->rhash_params);
> -	if (fq) {
> +	prev = rhashtable_lookup(&nf->rhashtable, key, nf->f->rhash_params);
> +	if (!prev)
> +		fq = inet_frag_create(nf, key, &prev);
> +	if (prev && !IS_ERR(prev)) {
> +		fq = prev;
>  		if (!refcount_inc_not_zero(&fq->refcnt))
>  			fq = NULL;
> -		rcu_read_unlock();
> -		return fq;
>  	}
>  	rcu_read_unlock();
> -
> -	return inet_frag_create(nf, key);
> +	return fq;

GCC is unwilling to see that all paths leading to that final return
statement do in fact set 'fq' one way or another:

net/ipv4/inet_fragment.c: In function ‘inet_frag_find’:
net/ipv4/inet_fragment.c:224:9: warning: ‘fq’ may be used uninitialized in this function [-Wmaybe-uninitialized]

This is with:

gcc (GCC) 8.2.1 20181011 (Red Hat 8.2.1-4)

Please adjust your patch so that the warning is eliminated.

Thanks.

^ permalink raw reply

* Re: [PATCH v2] kselftests/bpf: use ping6 as the default ipv6 ping binary when it exists
From: Daniel Borkmann @ 2018-11-09 10:23 UTC (permalink / raw)
  To: Li Zhijian, Song Liu, shuah, netdev, linux-kselftest
  Cc: linux-kernel, ast, Philip Li
In-Reply-To: <1541408268-11256-1-git-send-email-lizhijian@cn.fujitsu.com>

On 11/05/2018 09:57 AM, Li Zhijian wrote:
> At commit deee2cae27d1 ("kselftests/bpf: use ping6 as the default ipv6 ping
> binary if it exists"), it fixed similar issues for shell script, but it
> missed a same issue in the C code.
> 
> Fixes: 371e4fcc9d96 ("selftests/bpf: cgroup local storage-based network counters")
> CC: Philip Li <philip.li@intel.com>
> Reported-by: kernel test robot <rong.a.chen@intel.com>
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>

Applied to bpf, thanks Li!

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: David Ahern @ 2018-11-09  0:42 UTC (permalink / raw)
  To: Paweł Staszewski, Jesper Dangaard Brouer; +Cc: netdev, Yoel Caspersen
In-Reply-To: <8cb2630e-e7fe-cd44-7798-070f2e6d348a@itcare.pl>

On 11/8/18 5:40 PM, Paweł Staszewski wrote:
> 
> 
> W dniu 08.11.2018 o 17:32, David Ahern pisze:
>> On 11/8/18 9:27 AM, Paweł Staszewski wrote:
>>>>> What hardware is this?
>>>>>
>>> mellanox connectx 4
>>> ethtool -i enp175s0f0
>>> driver: mlx5_core
>>> version: 5.0-0
>>> firmware-version: 12.21.1000 (SM_2001000001033)
>>> expansion-rom-version:
>>> bus-info: 0000:af:00.0
>>> supports-statistics: yes
>>> supports-test: yes
>>> supports-eeprom-access: no
>>> supports-register-dump: no
>>> supports-priv-flags: yes
>>>
>>> ethtool -i enp175s0f1
>>> driver: mlx5_core
>>> version: 5.0-0
>>> firmware-version: 12.21.1000 (SM_2001000001033)
>>> expansion-rom-version:
>>> bus-info: 0000:af:00.1
>>> supports-statistics: yes
>>> supports-test: yes
>>> supports-eeprom-access: no
>>> supports-register-dump: no
>>> supports-priv-flags: yes
>>>
>>>>> Start with:
>>>>>
>>>>> echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
>>>>> cat /sys/kernel/debug/tracing/trace_pipe
>>>>   cat /sys/kernel/debug/tracing/trace_pipe
>>>>           <idle>-0     [045] ..s. 68469.467752: xdp_devmap_xmit:
>>>> ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1
>>>> from_ifindex=4 to_ifindex=5 err=-6
>> FIB lookup is good, the redirect is happening, but the mlx5 driver does
>> not like it.
>>
>> I think the -6 is coming from the mlx5 driver and the packet is getting
>> dropped. Perhaps this check in mlx5e_xdp_xmit:
>>
>>         if (unlikely(sq_num >= priv->channels.num))
>>                  return -ENXIO;
> I removed that part and recompiled - but after running now xdp_fwd i
> have kernel pamic :)

Jesper or one of the Mellanox folks needs to respond about the config
needed to run XDP with this NIC. I don't have a 40G or 100G card to play
with.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox