netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch net-next] net_sched: call qdisc_reset() with qdisc lock
@ 2017-12-22  0:03 Cong Wang
  2017-12-22  0:03 ` [Patch net-next] net_sched: remove the unsafe __skb_array_empty() Cong Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Cong Wang @ 2017-12-22  0:03 UTC (permalink / raw)
  To: netdev; +Cc: jakub.kicinski, Cong Wang, John Fastabend

qdisc_reset() should always be called with qdisc spinlock
and with BH disabled, otherwise qdisc ->reset() could race
with TX BH.

Fixes: 7bbde83b1860 ("net: sched: drop qdisc_reset from dev_graft_qdisc")
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_generic.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 10aaa3b615ce..00ddb5f8f430 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1097,8 +1097,11 @@ static void dev_qdisc_reset(struct net_device *dev,
 {
 	struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
 
-	if (qdisc)
+	if (qdisc) {
+		spin_lock_bh(qdisc_lock(qdisc));
 		qdisc_reset(qdisc);
+		spin_unlock_bh(qdisc_lock(qdisc));
+	}
 }
 
 /**
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
  2017-12-22  0:03 [Patch net-next] net_sched: call qdisc_reset() with qdisc lock Cong Wang
@ 2017-12-22  0:03 ` Cong Wang
  2017-12-22  3:06   ` John Fastabend
  2018-01-02 16:37   ` David Miller
  2017-12-22  3:36 ` [Patch net-next] net_sched: call qdisc_reset() with qdisc lock John Fastabend
  2018-01-02 16:39 ` David Miller
  2 siblings, 2 replies; 12+ messages in thread
From: Cong Wang @ 2017-12-22  0:03 UTC (permalink / raw)
  To: netdev; +Cc: jakub.kicinski, Cong Wang, John Fastabend

__skb_array_empty() is only safe if array is never resized.
pfifo_fast_dequeue() is called in TX BH context and without
qdisc lock, so even after we disable BH on ->reset() path
we can still race with other CPU's.

Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_generic.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 00ddb5f8f430..9279258ce060 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -622,9 +622,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
 		struct skb_array *q = band2list(priv, band);
 
-		if (__skb_array_empty(q))
-			continue;
-
 		skb = skb_array_consume_bh(q);
 	}
 	if (likely(skb)) {
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
  2017-12-22  0:03 ` [Patch net-next] net_sched: remove the unsafe __skb_array_empty() Cong Wang
@ 2017-12-22  3:06   ` John Fastabend
  2017-12-22 20:31     ` Cong Wang
  2018-01-02 16:37   ` David Miller
  1 sibling, 1 reply; 12+ messages in thread
From: John Fastabend @ 2017-12-22  3:06 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: jakub.kicinski

On 12/21/2017 04:03 PM, Cong Wang wrote:
> __skb_array_empty() is only safe if array is never resized.
> pfifo_fast_dequeue() is called in TX BH context and without
> qdisc lock, so even after we disable BH on ->reset() path
> we can still race with other CPU's.
> 
> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  net/sched/sch_generic.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 00ddb5f8f430..9279258ce060 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -622,9 +622,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>  		struct skb_array *q = band2list(priv, band);
>  
> -		if (__skb_array_empty(q))
> -			continue;
> -
>  		skb = skb_array_consume_bh(q);
>  	}
>  	if (likely(skb)) {
> 


So this is a performance thing we don't want to grab the consumer lock on
empty bands. Which can be fairly common depending on traffic patterns.

Although its not logical IMO to have both reset and dequeue running at
the same time. Some skbs would get through others would get sent, sort
of a mess. I don't see how it can be an issue. The never resized bit
in the documentation is referring to resizing the ring size _not_ popping
off elements of the ring. array_empty just reads the consumer head.
The only ring resizing in pfifo fast should be at init and destroy where
enqueue/dequeue should be disconnected by then. Although based on the
trace I missed a case.

I think the right fix is to only call reset/destroy patterns after
waiting a grace period and for all tx_action calls in-flight to
complete. This is also better going forward for more complex qdiscs.

.John

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch net-next] net_sched: call qdisc_reset() with qdisc lock
  2017-12-22  0:03 [Patch net-next] net_sched: call qdisc_reset() with qdisc lock Cong Wang
  2017-12-22  0:03 ` [Patch net-next] net_sched: remove the unsafe __skb_array_empty() Cong Wang
@ 2017-12-22  3:36 ` John Fastabend
  2017-12-22  4:06   ` Jakub Kicinski
  2017-12-22 21:41   ` Cong Wang
  2018-01-02 16:39 ` David Miller
  2 siblings, 2 replies; 12+ messages in thread
From: John Fastabend @ 2017-12-22  3:36 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: jakub.kicinski

On 12/21/2017 04:03 PM, Cong Wang wrote:
> qdisc_reset() should always be called with qdisc spinlock
> and with BH disabled, otherwise qdisc ->reset() could race
> with TX BH.
> 
hmm I don't see how this fixes the issue. pfifo_fast is no longer
using the qdisc lock so that doesn't help.  And it is only a
local_bh_disable.


> Fixes: 7bbde83b1860 ("net: sched: drop qdisc_reset from dev_graft_qdisc")
> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  net/sched/sch_generic.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 10aaa3b615ce..00ddb5f8f430 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1097,8 +1097,11 @@ static void dev_qdisc_reset(struct net_device *dev,
>  {
>  	struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
>  
> -	if (qdisc)
> +	if (qdisc) {
> +		spin_lock_bh(qdisc_lock(qdisc));
>  		qdisc_reset(qdisc);
> +		spin_unlock_bh(qdisc_lock(qdisc));
> +	}
>  }
>  
>  /**
> 

OK first the cases to get to qdisc_reset that I've tracked
down are,

    dev_shutdown()
      qdisc_destroy()

    dev_deactivate_many()
      dev_qdisc_reset() <- for each txq
         qdisc_reset()

    chained calls from qdisc_reset ops

At the moment all the lockless qdiscs don't care about chained
calls so we can ignore that, but would be nice to keep in mind.

Next qdisc_reset() is doing a couple things calling the qdisc
ops reset call but also walking gso_skb and skb_bad_txq. The
'unlink' operations there are not safe to be called while an
enqueue/dequeue op is in-flight. Also pfifo_fast's reset op
is not safe to be called with enqueue/dequeue ops in-flight.

So I've made the assumption that qdisc_reset is _only_ ever
called after a qdisc is no longer attached on the enqueue
dev_xmit side and also any in-progress tx_action calls are
completed. For what its worth this has always been the assumption
AFAIK.

So those are the assumptions what did I miss?

The biggest gap I see is dev_deactivate_many() is supposed
to wait for all tx_action calls to complete, this bit:

        /* Wait for outstanding qdisc_run calls. */
        list_for_each_entry(dev, head, close_list) {
                while (some_qdisc_is_busy(dev))
                        yield();

Where some_qdisc_is_busy in the lockless case only does
the following,

                if (q->flags & TCQ_F_NOLOCK) {
                        val = test_bit(__QDISC_STATE_SCHED, &q->state);

Oops that only tells us something is scheduled nothing about
if something is running. Excerpt from tx_action side here

                        clear_bit(__QDISC_STATE_SCHED, &q->state);
                        qdisc_run(q);

For comparison here is the check in the qdisc locked branch of
some_qdisc_is_busy,

                        root_lock = qdisc_lock(q);
                        spin_lock_bh(root_lock);

                        val = (qdisc_is_running(q) ||
                               test_bit(__QDISC_STATE_SCHED, &q->state));

                        spin_unlock_bh(root_lock);
 
previously we had the qdisc_lock() to protect the bit clear and then in
the some_qdisc_is_busy case we still did a qdisc_is_running() call. The
easiest fix I see is add the qdisc_is_running() check to the TCQ_F_NOLOCK
case in some_qdisc_is_busy AND to be correct I guess we should set the
running bit before clearing the QDISC_STATE_SCHED bit to be correct.

Untested but perhaps as simple as,

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ab497ef..720829e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1071,7 +1071,8 @@ static bool some_qdisc_is_busy(struct net_device *dev)
                q = dev_queue->qdisc_sleeping;
 
                if (q->flags & TCQ_F_NOLOCK) {
-                       val = test_bit(__QDISC_STATE_SCHED, &q->state);
+                       val = (qdisc_is_running(q) ||
+                              test_bit(__QDISC_STATE_SCHED, &q->state));
                } else {
                        root_lock = qdisc_lock(q);
                        spin_lock_bh(root_lock);


Thanks,
John

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Patch net-next] net_sched: call qdisc_reset() with qdisc lock
  2017-12-22  3:36 ` [Patch net-next] net_sched: call qdisc_reset() with qdisc lock John Fastabend
@ 2017-12-22  4:06   ` Jakub Kicinski
  2017-12-22 21:41   ` Cong Wang
  1 sibling, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2017-12-22  4:06 UTC (permalink / raw)
  To: John Fastabend; +Cc: Cong Wang, netdev

[-- Attachment #1: Type: text/plain, Size: 18056 bytes --]

On Thu, 21 Dec 2017 19:36:51 -0800, John Fastabend wrote:
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index ab497ef..720829e 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1071,7 +1071,8 @@ static bool some_qdisc_is_busy(struct net_device *dev)
>                 q = dev_queue->qdisc_sleeping;
>  
>                 if (q->flags & TCQ_F_NOLOCK) {
> -                       val = test_bit(__QDISC_STATE_SCHED, &q->state);
> +                       val = (qdisc_is_running(q) ||
> +                              test_bit(__QDISC_STATE_SCHED, &q->state));
>                 } else {
>                         root_lock = qdisc_lock(q);
>                         spin_lock_bh(root_lock);

Ah, I just found hpinging the machine with the right .config seems
to be a good trigger.  Testing... I'm afraid KASAN splat still there.
Attaching my .config FWIW.

[   67.870458] ==================================================================
[   67.878616] BUG: KASAN: slab-out-of-bounds in pfifo_fast_dequeue+0x140/0x2d0
[   67.886566] Read of size 8 at addr ffff88036f34c040 by task swapper/12/0
[   67.894123] 
[   67.895854] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.15.0-rc3-perf-01035-ge394cb5fbb97-dirty #24
[   67.906232] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016
[   67.914665] Call Trace:
[   67.917462]  <IRQ>
[   67.919778]  dump_stack+0xa6/0x118
[   67.923645]  ? _atomic_dec_and_lock+0xe8/0xe8
[   67.928582]  ? pfifo_fast_dequeue+0x140/0x2d0
[   67.933524]  print_address_description+0x6a/0x270
[   67.938839]  ? pfifo_fast_dequeue+0x140/0x2d0
[   67.943775]  kasan_report+0x23f/0x350
[   67.947933]  pfifo_fast_dequeue+0x140/0x2d0
[   67.952676]  ? stack_access_ok+0x3d/0xa0
[   67.957126]  __qdisc_run+0x167/0xa20
[   67.961188]  ? __orc_find+0x6b/0xc0
[   67.965153]  ? sch_direct_xmit+0x3d0/0x3d0
[   67.969820]  ? _raw_spin_unlock+0x73/0xc0
[   67.974388]  ? _raw_spin_trylock+0xe0/0xe0
[   67.979053]  ? deref_stack_reg+0x50/0xd0
[   67.983528]  ? tg3_poll_msix+0x64/0x290 [tg3]
[   67.988529]  ? is_module_text_address+0x11/0x30
[   67.993682]  ? kernel_text_address+0x5a/0x100
[   67.998631]  ? pfifo_fast_enqueue+0x154/0x180
[   68.003591]  __dev_queue_xmit+0x5ae/0x1110
[   68.008268]  ? depot_save_stack+0x12d/0x470
[   68.013023]  ? netdev_pick_tx+0x150/0x150
[   68.017592]  ? __kmalloc_node_track_caller+0x16e/0x2a0
[   68.023424]  ? __kmalloc_reserve.isra.7+0x2e/0x80
[   68.028769]  ? __alloc_skb+0xed/0x390
[   68.032949]  ? sock_wmalloc+0xc2/0x110
[   68.037227]  ? __ip_append_data.isra.3+0xee8/0x1090
[   68.042768]  ? ip_append_data.part.4+0x8b/0xd0
[   68.047821]  ? ip_send_unicast_reply+0x564/0x6d0
[   68.053071]  ? tcp_v4_send_reset+0xa7b/0xd20
[   68.057929]  ? tcp_v4_rcv+0xcce/0x1650
[   68.062207]  ? ip_local_deliver_finish+0x13d/0x3f0
[   68.067649]  ? ip_local_deliver+0xe2/0x280
[   68.072314]  ? ip_rcv_finish+0x5f7/0xa10
[   68.076785]  ? ip_rcv+0x49d/0x780
[   68.080576]  ? __netif_receive_skb_core+0xfa9/0x1bd0
[   68.086213]  ? netif_receive_skb_internal+0x93/0x240
[   68.091849]  ? napi_gro_receive+0x1b3/0x1d0
[   68.096616]  ? tg3_poll_work+0x13e3/0x1d80 [tg3]
[   68.101867]  ? tg3_poll_msix+0x64/0x290 [tg3]
[   68.106825]  ? net_rx_action+0x4a8/0xb00
[   68.111295]  ? __do_softirq+0x17f/0x4de
[   68.115673]  ? irq_exit+0xe1/0xf0
[   68.119463]  ? do_IRQ+0x94/0xe0
[   68.123059]  ? common_interrupt+0x8c/0x8c
[   68.127631]  ? cpuidle_enter_state+0x12a/0x510
[   68.132687]  ? do_idle+0x1af/0x200
[   68.136574]  ? cpu_startup_entry+0xd2/0xe0
[   68.141240]  ? start_secondary+0x271/0x2b0
[   68.145906]  ? secondary_startup_64+0xa5/0xb0
[   68.150862]  ? common_interrupt+0x8c/0x8c
[   68.155430]  ? cpuidle_enter_state+0x12a/0x510
[   68.160508]  ? do_idle+0x1af/0x200
[   68.164396]  ? cpu_startup_entry+0xd2/0xe0
[   68.169061]  ? start_secondary+0x271/0x2b0
[   68.173726]  ? secondary_startup_64+0xa5/0xb0
[   68.178683]  ? ___slab_alloc+0x45d/0x600
[   68.183155]  ? inet_lookup_ifaddr_rcu+0x126/0x170
[   68.188534]  ? memcg_kmem_put_cache+0x63/0x120
[   68.193587]  ? memcg_kmem_get_cache+0x4e0/0x4e0
[   68.198739]  ? __rcu_read_unlock+0x6e/0x120
[   68.203502]  ? memcg_kmem_put_cache+0x63/0x120
[   68.208571]  ? memcg_kmem_get_cache+0x4e0/0x4e0
[   68.213723]  ? __kmalloc_node_track_caller+0x1fe/0x2a0
[   68.219546]  ? __alloc_skb+0xed/0x390
[   68.223726]  ? __kmalloc_reserve.isra.7+0x43/0x80
[   68.229071]  ? memset+0x1f/0x40
[   68.232667]  ? __alloc_skb+0x27e/0x390
[   68.236944]  ? __kmalloc_reserve.isra.7+0x80/0x80
[   68.242289]  ? ktime_get+0x10d/0x1a0
[   68.246370]  ? __rcu_read_unlock+0x6e/0x120
[   68.251134]  ? ip_finish_output2+0x68d/0x7c0
[   68.255994]  ip_finish_output2+0x68d/0x7c0
[   68.260661]  ? ip_send_check+0x60/0x60
[   68.264937]  ? skb_set_owner_w+0x9c/0x120
[   68.269504]  ? sock_wmalloc+0xd5/0x110
[   68.273782]  ? sk_alloc+0x6b0/0x6b0
[   68.277768]  ? ipv4_mtu+0x163/0x200
[   68.281754]  ? ipv4_negative_advice+0x60/0x60
[   68.286710]  ? ip_reply_glue_bits+0x2c/0x50
[   68.291474]  ? __ip_append_data.isra.3+0xdb1/0x1090
[   68.297014]  ? ip_idents_reserve+0x11d/0x1a0
[   68.301872]  ? ipv4_sysctl_rtcache_flush+0x40/0x40
[   68.307315]  ? ip_setup_cork+0x230/0x230
[   68.311786]  ? ip_finish_output+0x39a/0x4c0
[   68.316582]  ip_finish_output+0x39a/0x4c0
[   68.321150]  ? ip_fragment.constprop.5+0xf0/0xf0
[   68.326398]  ? ipv4_mtu+0x163/0x200
[   68.330384]  ? __ip_select_ident+0xf8/0x180
[   68.335147]  ? find_exception+0x270/0x270
[   68.339716]  ? ipv4_mtu+0x163/0x200
[   68.343699]  ? ip_send_check+0x20/0x60
[   68.347978]  ip_output+0x106/0x280
[   68.351865]  ? ip_mc_output+0x750/0x750
[   68.356258]  ? ip_append_page+0x6d0/0x6d0
[   68.360825]  ? ip_append_data.part.4+0x8b/0xd0
[   68.365879]  ip_send_skb+0x29/0x70
[   68.369766]  ? ip_push_pending_frames+0x2e/0x50
[   68.374918]  ip_send_unicast_reply+0x64e/0x6d0
[   68.379973]  ? ip_make_skb+0x1d0/0x1d0
[   68.384257]  ? ip_local_deliver+0xe2/0x280
[   68.388920]  ? ip_rcv+0x49d/0x780
[   68.392711]  ? __netif_receive_skb_core+0xfa9/0x1bd0
[   68.398348]  ? napi_gro_receive+0x1b3/0x1d0
[   68.403113]  ? tg3_poll_work+0x13e3/0x1d80 [tg3]
[   68.408358]  ? tg3_poll_msix+0x64/0x290 [tg3]
[   68.413316]  ? net_rx_action+0x4a8/0xb00
[   68.417788]  ? __do_softirq+0x17f/0x4de
[   68.422163]  ? cpuidle_enter_state+0x12a/0x510
[   68.427218]  ? do_idle+0x1af/0x200
[   68.431104]  ? cpu_startup_entry+0xd2/0xe0
[   68.435769]  ? start_secondary+0x271/0x2b0
[   68.440453]  ? map_id_range_down+0x186/0x1b0
[   68.445311]  ? __put_user_ns+0x30/0x30
[   68.449589]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
[   68.456529]  tcp_v4_send_reset+0xa7b/0xd20
[   68.461196]  ? tcp_v4_reqsk_send_ack+0x1a0/0x1a0
[   68.466444]  ? __inet_lookup_listener+0x1e5/0x520
[   68.471790]  ? sock_edemux+0x20/0x20
[   68.475873]  ? reweight_entity+0x630/0x630
[   68.480569]  ? rb_insert_color_cached+0x6f3/0x7a0
[   68.485914]  ? rb_insert_color+0x770/0x770
[   68.490581]  ? tcp_v4_rcv+0xcce/0x1650
[   68.494858]  tcp_v4_rcv+0xcce/0x1650
[   68.498941]  ? tcp_v4_early_demux+0x350/0x350
[   68.503900]  ? raw_rcv+0x1b0/0x1b0
[   68.507787]  ? reweight_entity+0x393/0x630
[   68.512482]  ? update_curr+0xa9/0x3c0
[   68.516662]  ? rb_insert_color+0x770/0x770
[   68.521328]  ? __inet_lookup_established+0x12e/0x3d0
[   68.526966]  ? __udp4_lib_rcv+0x1150/0x1150
[   68.531719]  ? load_too_imbalanced+0xd0/0xd0
[   68.536581]  ip_local_deliver_finish+0x13d/0x3f0
[   68.541921]  ? inet_del_offload+0x40/0x40
[   68.546489]  ? update_curr+0xa9/0x3c0
[   68.550667]  ? __rcu_read_unlock+0x6e/0x120
[   68.555431]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
[   68.562332]  ip_local_deliver+0xe2/0x280
[   68.566804]  ? ip_call_ra_chain+0x2e0/0x2e0
[   68.571567]  ? ip_route_input_noref+0x95/0xd0
[   68.576563]  ? ip_route_input_rcu+0x1570/0x1570
[   68.581716]  ip_rcv_finish+0x5f7/0xa10
[   68.585983]  ? put_prev_task_fair+0x50/0x50
[   68.590747]  ? ip_local_deliver_finish+0x3f0/0x3f0
[   68.596204]  ? account_entity_enqueue+0x1d2/0x240
[   68.601551]  ? load_too_imbalanced+0xd0/0xd0
[   68.606409]  ? __update_load_avg_cfs_rq.isra.5+0x2b6/0x2c0
[   68.612628]  ? __enqueue_entity+0x93/0xc0
[   68.617196]  ? tcp_v4_send_synack+0x1a0/0x1a0
[   68.622153]  ? enqueue_entity+0xc16/0x1450
[   68.626817]  ? rb_insert_color_cached+0x6f3/0x7a0
[   68.632180]  ? rb_insert_color+0x770/0x770
[   68.636847]  ? __dev_queue_xmit+0x61d/0x1110
[   68.641706]  ? put_prev_task_fair+0x50/0x50
[   68.646468]  ip_rcv+0x49d/0x780
[   68.650066]  ? ip_local_deliver+0x280/0x280
[   68.654830]  ? idle_cpu+0x100/0x100
[   68.658818]  ? __alloc_pages_nodemask+0x316/0x1d30
[   68.664277]  ? enqueue_task_fair+0x2d9/0x10d0
[   68.669235]  ? smp_thermal_interrupt+0x230/0x230
[   68.674484]  ? update_cfs_group+0x232/0x290
[   68.679246]  ? rb_insert_color+0x770/0x770
[   68.683910]  ? reweight_entity+0x630/0x630
[   68.688583]  ? ip_local_deliver+0x280/0x280
[   68.693346]  __netif_receive_skb_core+0xfa9/0x1bd0
[   68.698791]  ? flush_backlog+0x250/0x250
[   68.703262]  ? 0xffffffffa0000000
[   68.707052]  ? update_curr+0xa9/0x3c0
[   68.711230]  ? stack_access_ok+0x3d/0xa0
[   68.715702]  ? __module_address+0x232/0x330
[   68.720498]  ? modules_open+0x60/0x60
[   68.724678]  ? __accumulate_pelt_segments+0x47/0xd0
[   68.730217]  ? __module_address+0x232/0x330
[   68.734980]  ? stack_access_ok+0x3d/0xa0
[   68.739450]  ? deref_stack_reg+0x98/0xd0
[   68.743922]  ? __read_once_size_nocheck.constprop.3+0x10/0x10
[   68.750435]  ? get_stack_info+0x37/0x150
[   68.754906]  ? __orc_find+0x6b/0xc0
[   68.758891]  ? secondary_startup_64+0xa4/0xb0
[   68.763848]  ? unwind_next_frame+0xcd/0xbf0
[   68.768610]  ? cpuidle_enter_state+0x12a/0x510
[   68.773667]  ? tg3_poll_msix+0x64/0x290 [tg3]
[   68.778623]  ? deref_stack_reg+0xd0/0xd0
[   68.783093]  ? __build_skb+0x85/0x210
[   68.787276]  ? tg3_poll_msix+0x64/0x290 [tg3]
[   68.792242]  ? is_module_text_address+0x11/0x30
[   68.797393]  ? kernel_text_address+0x5a/0x100
[   68.802350]  ? _mix_pool_bytes+0x1fc/0x260
[   68.807019]  ? tg3_poll_msix+0x64/0x290 [tg3]
[   68.811975]  ? __build_skb+0x85/0x210
[   68.816155]  ? __save_stack_trace+0x73/0xd0
[   68.831167]  ? depot_save_stack+0x12d/0x470
[   68.835929]  ? __build_skb+0x85/0x210
[   68.840108]  ? kasan_kmalloc+0x142/0x170
[   68.844591]  ? kmem_cache_alloc+0xb3/0x1c0
[   68.849247]  ? __build_skb+0x85/0x210
[   68.853425]  ? __netdev_alloc_skb+0xeb/0x150
[   68.858287]  ? tg3_poll_work+0x1113/0x1d80 [tg3]
[   68.863535]  ? __rcu_read_unlock+0x6e/0x120
[   68.868299]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
[   68.875199]  ? tcp_gro_receive+0x3cd/0x550
[   68.879865]  ? tcp4_gro_receive+0x25/0x350
[   68.884566]  ? start_secondary+0x271/0x2b0
[   68.889230]  ? secondary_startup_64+0xa5/0xb0
[   68.894187]  ? inet_gro_receive+0x19a/0x5e0
[   68.898950]  ? ktime_get_with_offset+0x144/0x1e0
[   68.904217]  ? ktime_get_resolution_ns+0xf0/0xf0
[   68.909457]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
[   68.916360]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
[   68.923261]  ? netif_receive_skb_internal+0x93/0x240
[   68.928897]  netif_receive_skb_internal+0x93/0x240
[   68.934340]  ? dev_cpu_dead+0x470/0x470
[   68.938715]  ? __build_skb+0x85/0x210
[   68.942895]  ? net_rx_action+0xb00/0xb00
[   68.947366]  ? memset+0x1f/0x40
[   68.950962]  ? __build_skb+0x1b7/0x210
[   68.955238]  ? skb_push+0x80/0x80
[   68.959029]  napi_gro_receive+0x1b3/0x1d0
[   68.963597]  ? dev_gro_receive+0xc50/0xc50
[   68.968284]  tg3_poll_work+0x13e3/0x1d80 [tg3]
[   68.973346]  ? tg3_poll_controller+0x90/0x90 [tg3]
[   68.978787]  ? enqueue_entity+0x1450/0x1450
[   68.983551]  ? napi_complete_done+0x159/0x2e0
[   68.988547]  ? napi_gro_flush+0xd0/0xd0
[   68.992922]  ? rb_insert_color+0x12e/0x770
[   68.997588]  ? rb_first_postorder+0x50/0x50
[   69.002353]  ? rcu_segcblist_extract_pend_cbs+0xa0/0xa0
[   69.008282]  ? dequeue_rt_stack+0x103/0x5a0
[   69.013050]  tg3_poll_msix+0x64/0x290 [tg3]
[   69.017814]  net_rx_action+0x4a8/0xb00
[   69.022093]  ? napi_complete_done+0x2e0/0x2e0
[   69.027108]  ? read_exit_mmio+0x140/0x140 [kvm]
[   69.032277]  ? update_max_interval+0x40/0x40
[   69.037136]  ? wait_rcu_exp_gp+0x60/0x60
[   69.041607]  ? _raw_spin_unlock+0xc0/0xc0
[   69.046175]  ? rcu_segcblist_future_gp_needed+0x48/0x80
[   69.052105]  ? cpu_needs_another_gp+0x29e/0x2b0
[   69.057255]  ? print_other_cpu_stall+0x870/0x870
[   69.062503]  ? __wake_up_common+0xa9/0x2d0
[   69.067169]  ? __note_gp_changes+0x670/0x670
[   69.072028]  ? remove_wait_queue+0x160/0x160
[   69.076889]  ? _raw_spin_unlock+0xc0/0xc0
[   69.081457]  ? resched_curr+0x84/0x1b0
[   69.085732]  ? wake_q_add+0x50/0x50
[   69.089720]  ? cyc2ns_read_end+0x20/0x20
[   69.094192]  ? _raw_spin_unlock+0xc0/0xc0
[   69.098759]  ? _raw_spin_trylock+0xe0/0xe0
[   69.103425]  ? native_sched_clock_from_tsc+0x130/0x160
[   69.109258]  ? sched_clock_cpu+0x14/0xf0
[   69.113730]  ? try_to_wake_up+0x4a9/0x7f0
[   69.118300]  ? raise_softirq_irqoff+0x40/0x40
[   69.123257]  ? migrate_swap_stop+0x3b0/0x3b0
[   69.128117]  ? crng_reseed+0x3d0/0x3d0
[   69.132398]  ? _raw_spin_unlock+0xc0/0xc0
[   69.136966]  ? cyc2ns_read_begin+0x20/0x90
[   69.141633]  ? add_interrupt_randomness+0x1cd/0x3d0
[   69.147177]  ? tg3_interrupt_tagged+0x1b0/0x1b0 [tg3]
[   69.152911]  ? xfer_secondary_pool+0x70/0x70
[   69.157771]  ? irq_wait_for_poll+0xf0/0xf0
[   69.162437]  __do_softirq+0x17f/0x4de
[   69.166618]  ? __softirqentry_text_start+0x8/0x8
[   69.171866]  ? handle_irq_event_percpu+0xb9/0xf0
[   69.177116]  ? __handle_irq_event_percpu+0x390/0x390
[   69.182745]  ? nr_iowait+0x110/0x110
[   69.186828]  ? _raw_spin_unlock+0x73/0xc0
[   69.191397]  ? _raw_spin_trylock+0xe0/0xe0
[   69.196062]  ? handle_irq_event+0x79/0x90
[   69.200629]  ? handle_edge_irq+0x166/0x2f0
[   69.205295]  irq_exit+0xe1/0xf0
[   69.208892]  do_IRQ+0x94/0xe0
[   69.212295]  common_interrupt+0x8c/0x8c
[   69.216669]  </IRQ>
[   69.219100] RIP: 0010:cpuidle_enter_state+0x12a/0x510
[   69.224835] RSP: 0018:ffff88036c7cfd08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
[   69.233409] RAX: 0000000000000000 RBX: ffffe8fb00b060e0 RCX: ffffffffa01329f5
[   69.241476] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff8803703246e8
[   69.249542] RBP: 1ffff1006d8f9fa6 R08: fffffbfff43429f8 R09: fffffbfff43429f8
[   69.257608] R10: ffff88036c7cfcc8 R11: fffffbfff43429f7 R12: 0000000fcd62a077
[   69.265673] R13: 0000000000000002 R14: 0000000000000002 R15: ffffffffa183eac0
[   69.273741]  ? sched_idle_set_state+0x25/0x30
[   69.278699]  ? cpuidle_enter_state+0x106/0x510
[   69.283754]  ? cpuidle_enter_s2idle+0x130/0x130
[   69.288906]  ? rcu_eqs_enter_common.constprop.62+0xd1/0x1e0
[   69.295223]  ? rcu_gp_init+0xf70/0xf70
[   69.299499]  ? sched_set_stop_task+0x160/0x160
[   69.304586]  do_idle+0x1af/0x200
[   69.308287]  cpu_startup_entry+0xd2/0xe0
[   69.312759]  ? cpu_in_idle+0x20/0x20
[   69.316842]  ? _raw_spin_trylock+0xe0/0xe0
[   69.321508]  ? memcpy+0x34/0x50
[   69.325107]  start_secondary+0x271/0x2b0
[   69.329577]  ? set_cpu_sibling_map+0x840/0x840
[   69.334633]  secondary_startup_64+0xa5/0xb0
[   69.339387] 
[   69.341136] Allocated by task 844:
[   69.345024]  __kmalloc+0xfa/0x230
[   69.348815]  pfifo_fast_init+0x69/0x160
[   69.353190]  qdisc_create_dflt+0x97/0xc0
[   69.357661]  mq_init+0x19f/0x1f0
[   69.361356]  qdisc_create_dflt+0x97/0xc0
[   69.365825]  dev_activate+0x48e/0x4e0
[   69.370005]  __dev_open+0x19e/0x210
[   69.373989]  __dev_change_flags+0x3b5/0x3f0
[   69.378751]  dev_change_flags+0x50/0xa0
[   69.383126]  do_setlink+0x5eb/0x1cf0
[   69.387209]  rtnl_newlink+0x9d5/0xe40
[   69.391390]  rtnetlink_rcv_msg+0x37c/0x7e0
[   69.396122]  netlink_rcv_skb+0x122/0x230
[   69.400594]  netlink_unicast+0x2ae/0x360
[   69.405066]  netlink_sendmsg+0x5d5/0x620
[   69.409536]  sock_sendmsg+0x64/0x80
[   69.413520]  ___sys_sendmsg+0x4a8/0x500
[   69.417894]  __sys_sendmsg+0xa9/0x140
[   69.422073]  entry_SYSCALL_64_fastpath+0x1e/0x81
[   69.427319] 
[   69.429067] Freed by task 1:
[   69.432371]  kfree+0x8d/0x1c0
[   69.435775]  erst_reader+0x7ef/0x980
[   69.439857]  pstore_get_backend_records+0xdf/0x370
[   69.445303]  pstore_get_records+0x69/0x90
[   69.449872]  pstore_fill_super+0xfe/0x110
[   69.454441]  mount_single+0x60/0xe0
[   69.458426]  mount_fs+0x48/0x190
[   69.462122]  vfs_kern_mount.part.7+0x9f/0x210
[   69.467079]  do_mount+0x945/0x1690
[   69.470965]  SyS_mount+0x55/0xd0
[   69.474658]  entry_SYSCALL_64_fastpath+0x1e/0x81
[   69.479904] 
[   69.481652] The buggy address belongs to the object at ffff88036f34a100
[   69.481652]  which belongs to the cache kmalloc-8192 of size 8192
[   69.495958] The buggy address is located 8000 bytes inside of
[   69.495958]  8192-byte region [ffff88036f34a100, ffff88036f34c100)
[   69.509388] The buggy address belongs to the page:
[   69.514832] page:0000000086a349ff count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
[   69.525933] flags: 0x2ffff0000008100(slab|head)
[   69.531084] raw: 02ffff0000008100 0000000000000000 0000000000000000 0000000100030003
[   69.539852] raw: dead000000000100 dead000000000200 ffff88036fc0e680 0000000000000000
[   69.548640] page dumped because: kasan: bad access detected
[   69.554955] 
[   69.556704] Memory state around the buggy address:
[   69.562147]  ffff88036f34bf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   69.570331]  ffff88036f34bf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   69.578515] >ffff88036f34c000: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[   69.586699]                                            ^
[   69.592725]  ffff88036f34c080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   69.600901]  ffff88036f34c100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   69.609084] ==================================================================

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27625 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
  2017-12-22  3:06   ` John Fastabend
@ 2017-12-22 20:31     ` Cong Wang
  2017-12-24  6:57       ` John Fastabend
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2017-12-22 20:31 UTC (permalink / raw)
  To: John Fastabend; +Cc: Linux Kernel Network Developers, Jakub Kicinski

On Thu, Dec 21, 2017 at 7:06 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 12/21/2017 04:03 PM, Cong Wang wrote:
>> __skb_array_empty() is only safe if array is never resized.
>> pfifo_fast_dequeue() is called in TX BH context and without
>> qdisc lock, so even after we disable BH on ->reset() path
>> we can still race with other CPU's.
>>
>> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
>> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>> Cc: John Fastabend <john.fastabend@gmail.com>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> ---
>>  net/sched/sch_generic.c | 3 ---
>>  1 file changed, 3 deletions(-)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 00ddb5f8f430..9279258ce060 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -622,9 +622,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>       for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>               struct skb_array *q = band2list(priv, band);
>>
>> -             if (__skb_array_empty(q))
>> -                     continue;
>> -
>>               skb = skb_array_consume_bh(q);
>>       }
>>       if (likely(skb)) {
>>
>
>
> So this is a performance thing we don't want to grab the consumer lock on
> empty bands. Which can be fairly common depending on traffic patterns.


I understand why you had it, but it is just not safe. You don't want
to achieve performance gain by crashing system, right?

>
> Although its not logical IMO to have both reset and dequeue running at
> the same time. Some skbs would get through others would get sent, sort
> of a mess. I don't see how it can be an issue. The never resized bit
> in the documentation is referring to resizing the ring size _not_ popping
> off elements of the ring. array_empty just reads the consumer head.
> The only ring resizing in pfifo fast should be at init and destroy where
> enqueue/dequeue should be disconnected by then. Although based on the
> trace I missed a case.


Both pfifo_fast_reset() and pfifo_fast_dequeue() call
skb_array_consume_bh(), so there is no difference w.r.t. resizing.

And ->reset() is called in qdisc_graft() too. Let's say we have htb+pfifo_fast,
htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
concurrently.


>
> I think the right fix is to only call reset/destroy patterns after
> waiting a grace period and for all tx_action calls in-flight to
> complete. This is also better going forward for more complex qdiscs.

But we don't even have rcu read lock in TX BH, do we?

Also, people certainly don't like yet another synchronize_net()...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch net-next] net_sched: call qdisc_reset() with qdisc lock
  2017-12-22  3:36 ` [Patch net-next] net_sched: call qdisc_reset() with qdisc lock John Fastabend
  2017-12-22  4:06   ` Jakub Kicinski
@ 2017-12-22 21:41   ` Cong Wang
  1 sibling, 0 replies; 12+ messages in thread
From: Cong Wang @ 2017-12-22 21:41 UTC (permalink / raw)
  To: John Fastabend; +Cc: Linux Kernel Network Developers, Jakub Kicinski

On Thu, Dec 21, 2017 at 7:36 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 12/21/2017 04:03 PM, Cong Wang wrote:
>> qdisc_reset() should always be called with qdisc spinlock
>> and with BH disabled, otherwise qdisc ->reset() could race
>> with TX BH.
>>
> hmm I don't see how this fixes the issue. pfifo_fast is no longer
> using the qdisc lock so that doesn't help.  And it is only a
> local_bh_disable.


First of all, this is to fix non-pfifo_fast which you seem totally forget.


>
>
>> Fixes: 7bbde83b1860 ("net: sched: drop qdisc_reset from dev_graft_qdisc")
>> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>> Cc: John Fastabend <john.fastabend@gmail.com>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> ---
>>  net/sched/sch_generic.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 10aaa3b615ce..00ddb5f8f430 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -1097,8 +1097,11 @@ static void dev_qdisc_reset(struct net_device *dev,
>>  {
>>       struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
>>
>> -     if (qdisc)
>> +     if (qdisc) {
>> +             spin_lock_bh(qdisc_lock(qdisc));
>>               qdisc_reset(qdisc);
>> +             spin_unlock_bh(qdisc_lock(qdisc));
>> +     }
>>  }
>>
>>  /**
>>
>
> OK first the cases to get to qdisc_reset that I've tracked
> down are,
>
>     dev_shutdown()
>       qdisc_destroy()
>
>     dev_deactivate_many()
>       dev_qdisc_reset() <- for each txq
>          qdisc_reset()
>
>     chained calls from qdisc_reset ops
>
> At the moment all the lockless qdiscs don't care about chained
> calls so we can ignore that, but would be nice to keep in mind.
>
> Next qdisc_reset() is doing a couple things calling the qdisc
> ops reset call but also walking gso_skb and skb_bad_txq. The
> 'unlink' operations there are not safe to be called while an
> enqueue/dequeue op is in-flight. Also pfifo_fast's reset op
> is not safe to be called with enqueue/dequeue ops in-flight.

This is why I sent two patches instead just this one.


>
> So I've made the assumption that qdisc_reset is _only_ ever
> called after a qdisc is no longer attached on the enqueue
> dev_xmit side and also any in-progress tx_action calls are
> completed. For what its worth this has always been the assumption
> AFAIK.
>
> So those are the assumptions what did I miss?


Speaking of this, qdisc_reset() is also called in dev_deactivate_queue()
even before synchronize_net()...

>
> The biggest gap I see is dev_deactivate_many() is supposed
> to wait for all tx_action calls to complete, this bit:

In some_qdisc_is_busy() you hold qdisc spinlock for non-lockless
ones, I don't understand why you don't want it in dev_qdisc_reset().

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
  2017-12-22 20:31     ` Cong Wang
@ 2017-12-24  6:57       ` John Fastabend
  2017-12-27 18:29         ` Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: John Fastabend @ 2017-12-24  6:57 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers, Jakub Kicinski

On 12/22/2017 12:31 PM, Cong Wang wrote:
> On Thu, Dec 21, 2017 at 7:06 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> On 12/21/2017 04:03 PM, Cong Wang wrote:
>>> __skb_array_empty() is only safe if array is never resized.
>>> pfifo_fast_dequeue() is called in TX BH context and without
>>> qdisc lock, so even after we disable BH on ->reset() path
>>> we can still race with other CPU's.
>>>
>>> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
>>> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>>> Cc: John Fastabend <john.fastabend@gmail.com>
>>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>>> ---
>>>  net/sched/sch_generic.c | 3 ---
>>>  1 file changed, 3 deletions(-)
>>>
>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>>> index 00ddb5f8f430..9279258ce060 100644
>>> --- a/net/sched/sch_generic.c
>>> +++ b/net/sched/sch_generic.c
>>> @@ -622,9 +622,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>       for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>>               struct skb_array *q = band2list(priv, band);
>>>
>>> -             if (__skb_array_empty(q))
>>> -                     continue;
>>> -
>>>               skb = skb_array_consume_bh(q);
>>>       }
>>>       if (likely(skb)) {
>>>
>>
>>
>> So this is a performance thing we don't want to grab the consumer lock on
>> empty bands. Which can be fairly common depending on traffic patterns.
> 
> 
> I understand why you had it, but it is just not safe. You don't want
> to achieve performance gain by crashing system, right?

huh? So my point is the patch you submit here is not a
real fix but a work around. To peek the head of a consumer/producer ring
without a lock, _should_ be fine. This _should_ work as well with
consumer or producer operations happening at the same time. After some
digging the issue is in the ptr_ring code.

The peek code (what empty check calls) is the following,

static inline void *__ptr_ring_peek(struct ptr_ring *r)
{
        if (likely(r->size))
                return r->queue[r->consumer_head];
        return NULL;
}

So what the splat is detecting is consumer head being 'out of bounds'.
This happens because ptr_ring_discard_one increments the consumer_head
and then checks to see if it overran the array size. If above peek
happens after the increment, but before the size check we get the
splat. There are two ways, as far as I can see, to fix this. First
do the check before incrementing the consumer head. Or the easier
fix,

--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct
ptr_ring *r,

 static inline void **__ptr_ring_init_queue_alloc(unsigned int size,
gfp_t gfp)
 {
-       return kcalloc(size, sizeof(void *), gfp);
+       return kcalloc(size + 1, sizeof(void *), gfp);
 }

With Jakub's help (Thanks!) I was able to reproduce the original splat
and also verify the above removes it.

To be clear "resizing" a skb_array only refers to changing the actual
array size not adding/removing elements.

> 
>>
>> Although its not logical IMO to have both reset and dequeue running at
>> the same time. Some skbs would get through others would get sent, sort
>> of a mess. I don't see how it can be an issue. The never resized bit
>> in the documentation is referring to resizing the ring size _not_ popping
>> off elements of the ring. array_empty just reads the consumer head.
>> The only ring resizing in pfifo fast should be at init and destroy where
>> enqueue/dequeue should be disconnected by then. Although based on the
>> trace I missed a case.
> 
> 
> Both pfifo_fast_reset() and pfifo_fast_dequeue() call
> skb_array_consume_bh(), so there is no difference w.r.t. resizing.
> 

Sorry not following.

> And ->reset() is called in qdisc_graft() too. Let's say we have htb+pfifo_fast,
> htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
> so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
> concurrently.

Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering
though if this API can be cleaned up. What are the paths that do a reset
without a destroy.. Do we really need to have this pattern where reset
is called then later destroy. Seems destroy could do the entire cleanup
and this would simplify things. None of this has to do with the splat
though.

> 
> 
>>
>> I think the right fix is to only call reset/destroy patterns after
>> waiting a grace period and for all tx_action calls in-flight to
>> complete. This is also better going forward for more complex qdiscs.
> 
> But we don't even have rcu read lock in TX BH, do we?
> 
> Also, people certainly don't like yet another synchronize_net()...
> 

This needs a fix and is a _real_ bug, but removing __skb_array_empty()
doesn't help solve this at all. Will work on a fix after the holiday
break. The fix here is to ensure the destroy is not going to happen
while tx_action is in-flight. Can be done with qdisc_run and checking
correct bits in lockless case.

.John

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
  2017-12-24  6:57       ` John Fastabend
@ 2017-12-27 18:29         ` Cong Wang
  2017-12-27 23:53           ` John Fastabend
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2017-12-27 18:29 UTC (permalink / raw)
  To: John Fastabend; +Cc: Linux Kernel Network Developers, Jakub Kicinski

On Sat, Dec 23, 2017 at 10:57 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 12/22/2017 12:31 PM, Cong Wang wrote:
>> I understand why you had it, but it is just not safe. You don't want
>> to achieve performance gain by crashing system, right?
>
> huh? So my point is the patch you submit here is not a
> real fix but a work around. To peek the head of a consumer/producer ring
> without a lock, _should_ be fine. This _should_ work as well with
> consumer or producer operations happening at the same time. After some
> digging the issue is in the ptr_ring code.


The comments disagree with you:

/* Might be slightly faster than skb_array_empty below, but only safe if the
 * array is never resized. Also, callers invoking this in a loop must take care
 * to use a compiler barrier, for example cpu_relax().
 */

If the comments are right, you miss a barrier here too.


>
> The peek code (what empty check calls) is the following,
>
> static inline void *__ptr_ring_peek(struct ptr_ring *r)
> {
>         if (likely(r->size))
>                 return r->queue[r->consumer_head];
>         return NULL;
> }
>
> So what the splat is detecting is consumer head being 'out of bounds'.
> This happens because ptr_ring_discard_one increments the consumer_head
> and then checks to see if it overran the array size. If above peek
> happens after the increment, but before the size check we get the
> splat. There are two ways, as far as I can see, to fix this. First
> do the check before incrementing the consumer head. Or the easier
> fix,
>
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct
> ptr_ring *r,
>
>  static inline void **__ptr_ring_init_queue_alloc(unsigned int size,
> gfp_t gfp)
>  {
> -       return kcalloc(size, sizeof(void *), gfp);
> +       return kcalloc(size + 1, sizeof(void *), gfp);
>  }
>
> With Jakub's help (Thanks!) I was able to reproduce the original splat
> and also verify the above removes it.
>
> To be clear "resizing" a skb_array only refers to changing the actual
> array size not adding/removing elements.

I never look into the implementation, just simply trust the comments.

At least the comments above __skb_array_empty() need to improve.


>
>>
>>>
>>> Although its not logical IMO to have both reset and dequeue running at
>>> the same time. Some skbs would get through others would get sent, sort
>>> of a mess. I don't see how it can be an issue. The never resized bit
>>> in the documentation is referring to resizing the ring size _not_ popping
>>> off elements of the ring. array_empty just reads the consumer head.
>>> The only ring resizing in pfifo fast should be at init and destroy where
>>> enqueue/dequeue should be disconnected by then. Although based on the
>>> trace I missed a case.
>>
>>
>> Both pfifo_fast_reset() and pfifo_fast_dequeue() call
>> skb_array_consume_bh(), so there is no difference w.r.t. resizing.
>>
>
> Sorry not following.
>
>> And ->reset() is called in qdisc_graft() too. Let's say we have htb+pfifo_fast,
>> htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
>> so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
>> concurrently.
>
> Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering
> though if this API can be cleaned up. What are the paths that do a reset
> without a destroy.. Do we really need to have this pattern where reset
> is called then later destroy. Seems destroy could do the entire cleanup
> and this would simplify things. None of this has to do with the splat
> though.

I don't follow your point any more.

We are talking about ->reset() race with ->dequeue() which is the
cause of the bug, right?

If you expect ->reset() runs in parallel with ->dequeue() for pfifo_fast,
why did you even mention synchronize_net() from the beginning???
Also you changed the code too, to adjust rcu grace period.


>
>>
>>
>>>
>>> I think the right fix is to only call reset/destroy patterns after
>>> waiting a grace period and for all tx_action calls in-flight to
>>> complete. This is also better going forward for more complex qdiscs.
>>
>> But we don't even have rcu read lock in TX BH, do we?
>>
>> Also, people certainly don't like yet another synchronize_net()...
>>
>
> This needs a fix and is a _real_ bug, but removing __skb_array_empty()
> doesn't help solve this at all. Will work on a fix after the holiday
> break. The fix here is to ensure the destroy is not going to happen
> while tx_action is in-flight. Can be done with qdisc_run and checking
> correct bits in lockless case.

Sounds like you missed a lot of things with your "lockless" patches....
First qdisc rcu callback, second rcu read lock in TX BH...

My quick one-line fix is to amend this bug before you going deeper
in this rabbit hole.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
  2017-12-27 18:29         ` Cong Wang
@ 2017-12-27 23:53           ` John Fastabend
  0 siblings, 0 replies; 12+ messages in thread
From: John Fastabend @ 2017-12-27 23:53 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers, Jakub Kicinski

On 12/27/2017 10:29 AM, Cong Wang wrote:
> On Sat, Dec 23, 2017 at 10:57 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> On 12/22/2017 12:31 PM, Cong Wang wrote:
>>> I understand why you had it, but it is just not safe. You don't want
>>> to achieve performance gain by crashing system, right?
>>
>> huh? So my point is the patch you submit here is not a
>> real fix but a work around. To peek the head of a consumer/producer ring
>> without a lock, _should_ be fine. This _should_ work as well with
>> consumer or producer operations happening at the same time. After some
>> digging the issue is in the ptr_ring code.
> 
> 
> The comments disagree with you:
> 
> /* Might be slightly faster than skb_array_empty below, but only safe if the
>  * array is never resized. Also, callers invoking this in a loop must take care
>  * to use a compiler barrier, for example cpu_relax().
>  */
> 
> If the comments are right, you miss a barrier here too.
> 

The comment talks about resizing arrays not consumer/producer operations
so I think it is OK. We don't call skb_array_empty on the same skb_array
in a loop here, its different skb arrays. So probably don't need a barrier.

> 
>>
>> The peek code (what empty check calls) is the following,
>>
>> static inline void *__ptr_ring_peek(struct ptr_ring *r)
>> {
>>         if (likely(r->size))
>>                 return r->queue[r->consumer_head];
>>         return NULL;
>> }
>>
>> So what the splat is detecting is consumer head being 'out of bounds'.
>> This happens because ptr_ring_discard_one increments the consumer_head
>> and then checks to see if it overran the array size. If above peek
>> happens after the increment, but before the size check we get the
>> splat. There are two ways, as far as I can see, to fix this. First
>> do the check before incrementing the consumer head. Or the easier
>> fix,
>>
>> --- a/include/linux/ptr_ring.h
>> +++ b/include/linux/ptr_ring.h
>> @@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct
>> ptr_ring *r,
>>
>>  static inline void **__ptr_ring_init_queue_alloc(unsigned int size,
>> gfp_t gfp)
>>  {
>> -       return kcalloc(size, sizeof(void *), gfp);
>> +       return kcalloc(size + 1, sizeof(void *), gfp);
>>  }
>>
>> With Jakub's help (Thanks!) I was able to reproduce the original splat
>> and also verify the above removes it.
>>
>> To be clear "resizing" a skb_array only refers to changing the actual
>> array size not adding/removing elements.
> 
> I never look into the implementation, just simply trust the comments.
> 
> At least the comments above __skb_array_empty() need to improve.
> > 
>>
>>>
>>>>
>>>> Although its not logical IMO to have both reset and dequeue running at
>>>> the same time. Some skbs would get through others would get sent, sort
>>>> of a mess. I don't see how it can be an issue. The never resized bit
>>>> in the documentation is referring to resizing the ring size _not_ popping
>>>> off elements of the ring. array_empty just reads the consumer head.
>>>> The only ring resizing in pfifo fast should be at init and destroy where
>>>> enqueue/dequeue should be disconnected by then. Although based on the
>>>> trace I missed a case.
>>>
>>>
>>> Both pfifo_fast_reset() and pfifo_fast_dequeue() call
>>> skb_array_consume_bh(), so there is no difference w.r.t. resizing.
>>>
>>
>> Sorry not following.
>>
>>> And ->reset() is called in qdisc_graft() too. Let's say we have htb+pfifo_fast,
>>> htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
>>> so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
>>> concurrently.
>>
>> Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering
>> though if this API can be cleaned up. What are the paths that do a reset
>> without a destroy.. Do we really need to have this pattern where reset
>> is called then later destroy. Seems destroy could do the entire cleanup
>> and this would simplify things. None of this has to do with the splat
>> though.
> 
> I don't follow your point any more.> 
> We are talking about ->reset() race with ->dequeue() which is the
> cause of the bug, right?> 

Yes. But it should be OK for qdiscs to have reset and dequeue run
in parallel. So the bug is in ptr_ring.

> If you expect ->reset() runs in parallel with ->dequeue() for pfifo_fast,
> why did you even mention synchronize_net() from the beginning???

A mistake probably I am tracking two separate issues. First this one
with skb_array out of bounds access and the tx_action being called
while destroy is in progress. When I was first looking at this I thought
the splat was from the tx_action issue so probably started talking
about how to sync destroy and tx_action paths. Either way it was
a mistake to mention in this thread.

> Also you changed the code too, to adjust rcu grace period.
> 
> 
>>
>>>
>>>
>>>>
>>>> I think the right fix is to only call reset/destroy patterns after
>>>> waiting a grace period and for all tx_action calls in-flight to
>>>> complete. This is also better going forward for more complex qdiscs.
>>>
>>> But we don't even have rcu read lock in TX BH, do we?
>>>
>>> Also, people certainly don't like yet another synchronize_net()...
>>>
>>
>> This needs a fix and is a _real_ bug, but removing __skb_array_empty()
>> doesn't help solve this at all. Will work on a fix after the holiday
>> break. The fix here is to ensure the destroy is not going to happen
>> while tx_action is in-flight. Can be done with qdisc_run and checking
>> correct bits in lockless case.
> 
> Sounds like you missed a lot of things with your "lockless" patches....
> First qdisc rcu callback, second rcu read lock in TX BH...

A couple things yes. TX BH, ptr_ring bug, skb list free fixed in
another patch. I assume the qdisc rcu callbacks you talk about are
the TX BH or something else?

The other possible bug, still need to do analysis, is if we need to
add back the qdisc destroy rcu callback. Or if there is a grace period
in all cases. My guess is current code paths in pfifo fast don't have
an issue but future work will.

> 
> My quick one-line fix is to amend this bug before you going deeper
> in this rabbit hole.
> 

Nope its not a good fix and we are in net-next timeframe here so lets
fix the root cause in ptr_ring. Will post a patch in the next couple
days after PTO. Its minor issue and can wait.

.John

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
  2017-12-22  0:03 ` [Patch net-next] net_sched: remove the unsafe __skb_array_empty() Cong Wang
  2017-12-22  3:06   ` John Fastabend
@ 2018-01-02 16:37   ` David Miller
  1 sibling, 0 replies; 12+ messages in thread
From: David Miller @ 2018-01-02 16:37 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, jakub.kicinski, john.fastabend

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Thu, 21 Dec 2017 16:03:30 -0800

> __skb_array_empty() is only safe if array is never resized.
> pfifo_fast_dequeue() is called in TX BH context and without
> qdisc lock, so even after we disable BH on ->reset() path
> we can still race with other CPU's.
> 
> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Based upon the discussion over this patch, this bug is ultimately fixed
by John's patch which adds a dummy element at the end of allocated
ptr_ring queues.

And I've just applied that.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch net-next] net_sched: call qdisc_reset() with qdisc lock
  2017-12-22  0:03 [Patch net-next] net_sched: call qdisc_reset() with qdisc lock Cong Wang
  2017-12-22  0:03 ` [Patch net-next] net_sched: remove the unsafe __skb_array_empty() Cong Wang
  2017-12-22  3:36 ` [Patch net-next] net_sched: call qdisc_reset() with qdisc lock John Fastabend
@ 2018-01-02 16:39 ` David Miller
  2 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2018-01-02 16:39 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, jakub.kicinski, john.fastabend

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Thu, 21 Dec 2017 16:03:29 -0800

> qdisc_reset() should always be called with qdisc spinlock
> and with BH disabled, otherwise qdisc ->reset() could race
> with TX BH.
> 
> Fixes: 7bbde83b1860 ("net: sched: drop qdisc_reset from dev_graft_qdisc")
> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

There doesn't seem to be agreement over whether qdisc_reset() really
needs to run with the lock held.  In fact, the general consensus is
that this really shouldn't run until a grace period had occurred and
therefore parallel TX paths cannot be running any longer.

In any event, this was supposed to work towards a bug fix which
ultimately was fixed instead with the ptr_ring change from John.

So this seems unnecessary now.

If you disagree, please repost with an updated commit message which
explains things further.

Thank you.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-01-02 16:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-22  0:03 [Patch net-next] net_sched: call qdisc_reset() with qdisc lock Cong Wang
2017-12-22  0:03 ` [Patch net-next] net_sched: remove the unsafe __skb_array_empty() Cong Wang
2017-12-22  3:06   ` John Fastabend
2017-12-22 20:31     ` Cong Wang
2017-12-24  6:57       ` John Fastabend
2017-12-27 18:29         ` Cong Wang
2017-12-27 23:53           ` John Fastabend
2018-01-02 16:37   ` David Miller
2017-12-22  3:36 ` [Patch net-next] net_sched: call qdisc_reset() with qdisc lock John Fastabend
2017-12-22  4:06   ` Jakub Kicinski
2017-12-22 21:41   ` Cong Wang
2018-01-02 16:39 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).