netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Softirq priority inversion from "softirq: reduce latencies"
@ 2016-02-27 18:19 Peter Hurley
  2016-02-27 20:13 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Peter Hurley @ 2016-02-27 18:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, linux-kernel, Greg KH, dmaengine,
	John Ogness, Sebastian Andrzej Siewior, Andrew Morton

Hi Eric,

For a while now, we've been struggling to understand why we've been
observing missed uart rx DMA.

Because both the uart driver (omap8250) and the dmaengine driver
(edma) were (relatively) new, we assumed there was some race between
starting a new rx DMA and processing the previous one.

However, after instrumenting both the uart driver and the dmaengine
driver, what we've observed is huge anomalous latencies between receiving
the DMA interrupt and servicing the DMA tasklet.

For example, at 3Mbaud we recorded the following distribution of
softirq[TASKLET] service latency for this specific DMA channel:

root@black:/sys/kernel/debug/edma# cat 35
latency(us):   0+   20+   40+   60+   80+  100+  120+  140+  160+  180+  200+  220+  240+  260+  280+  300+  320+  340+  360+  380+
           195681    33    53    15     7     4     3     1     0     0     0     1     4     6     1     0     0     0     0     0

As you can see, the vast majority of tasklet service happens immediately,
tapering off to 140+us.

However, note the island of distribution at 220~300 [latencies beyond 300+
are not recorded because the uart fifo has filled again by this point and
dma must be aborted].

So I cribbed together a latency tracer to catch what was happening at
the extreme, and what it caught was a priority inversion stemming from
your commit:

   commit c10d73671ad30f54692f7f69f0e09e75d3a8926a
   Author: Eric Dumazet <edumazet@google.com>
   Date:   Thu Jan 10 15:26:34 2013 -0800

       softirq: reduce latencies
    
       In various network workloads, __do_softirq() latencies can be up
       to 20 ms if HZ=1000, and 200 ms if HZ=100.
    
       This is because we iterate 10 times in the softirq dispatcher,
       and some actions can consume a lot of cycles.


In the trace below [1], the trace begins in the edma completion interrupt
handler when the tasklet is scheduled; the edma interrupt has occurred during
NET_RX softirq context.

However, instead of causing a restart of the softirq loop to process the
tasklet _which occurred before sshd was scheduled_, the softirq loop is
aborted and deferred for ksoftirqd. The tasklet is not serviced for 521us,
which is way too long, so DMA was aborted.

Your patch has effectively inverted the priority of tasklets with normal
pri/nice processes that have merely received a network packet.

ISTM, the problem you're trying to solve here was caused by NET_RX softirq
to begin with, and maybe that thing needs a diet.

But rather than outright reverting your patch, what if more selective
conditions are used to abort the softirq restart? What would those conditions
be? In the netperf benchmark you referred to in that commit, is it just
NET_TX/NET_RX softirqs that are causing scheduling latencies?

It just doesn't make sense to special case for a workload that isn't
even running.


Regards,
Peter Hurley


[1] softirq tasklet latency trace  (apologies that it's only events - full
    function trace introduces too much overhead)

# tracer: latency
#
# latency latency trace v1.1.5 on 4.5.0-rc2+
# --------------------------------------------------------------------
# latency: 476 us, #59/59, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0)
#    -----------------
#    | task: sshd-750 (uid:1000 nice:0 policy:0 rt_prio:0)
#    -----------------
#  => started at: __tasklet_schedule  
#  => ended at:   tasklet_action
#
#
#                  _------=> CPU#
#                 / _-----=> irqs-off 
#                | / _----=> need-resched
#                || / _---=> hardirq/softirq
#                ||| / _--=> preempt-depth
#                |||| /     delay
#  cmd     pid   ||||| time  |   caller
#     \   /      |||||  \    |   /
  <idle>-0       0d.H3    1us : __tasklet_schedule
  <idle>-0       0d.H4    3us : softirq_raise: vec=6 [action=TASKLET]
  <idle>-0       0d.H3    6us : irq_handler_exit: irq=20 ret=handled
  <idle>-0       0..s2   15us : kmem_cache_alloc: call_site=c08378e4 ptr=de55d7c0 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_ATOMIC
  <idle>-0       0..s2   23us : netif_receive_skb_entry: dev=eth0 napi_id=0x0 queue_mapping=0 skbaddr=dca04400 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x000
0 protocol=0x0800 ip_summed=0 hash=0x00000000 l4_hash=0 len=88 data_len=0 truesize=1984 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 gso_type=0x0
  <idle>-0       0..s2   30us+: netif_receive_skb: dev=eth0 skbaddr=dca04400 len=88
  <idle>-0       0d.s5   98us : sched_waking: comm=sshd pid=750 prio=120 target_cpu=000
  <idle>-0       0d.s6  105us : sched_stat_sleep: comm=sshd pid=750 delay=3125230447 [ns]
  <idle>-0       0dns6  110us+: sched_wakeup: comm=sshd pid=750 prio=120 target_cpu=000
  <idle>-0       0dns4  123us+: timer_start: timer=dc940e9c function=tcp_delack_timer expires=9746 [timeout=10] flags=0x00000000
  <idle>-0       0dnH3  150us : irq_handler_entry: irq=176 name=4a100000.ethernet
  <idle>-0       0dnH3  153us : softirq_raise: vec=3 [action=NET_RX]
  <idle>-0       0dnH3  155us : irq_handler_exit: irq=176 ret=handled
  <idle>-0       0dnH3  160us : irq_handler_entry: irq=20 name=49000000.edma_ccint
  <idle>-0       0dnH3  163us : irq_handler_exit: irq=20 ret=handled
  <idle>-0       0.ns2  169us : napi_poll: napi poll on napi struct de465c30 for device eth0
  <idle>-0       0.ns2  171us : softirq_exit: vec=3 [action=NET_RX]
  <idle>-0       0dns3  175us : sched_waking: comm=ksoftirqd/0 pid=3 prio=120 target_cpu=000
  <idle>-0       0dns4  178us : sched_stat_sleep: comm=ksoftirqd/0 pid=3 delay=19371625 [ns]
  <idle>-0       0dns4  180us : sched_wakeup: comm=ksoftirqd/0 pid=3 prio=120 target_cpu=000
  <idle>-0       0.n.3  186us : gpio_value: 55 set 1
  <idle>-0       0dn.3  191us : hrtimer_cancel: hrtimer=c0e2af80
  <idle>-0       0dn.3  196us : hrtimer_start: hrtimer=c0e2af80 function=tick_sched_timer expires=338948000000 softexpires=338948000000
  <idle>-0       0dn.2  199us : rcu_utilization: Start context switch
  <idle>-0       0dn.2  202us : rcu_utilization: End context switch
  <idle>-0       0dn.3  206us : sched_stat_wait: comm=sshd pid=750 delay=74042 [ns]
  <idle>-0       0d..3  209us+: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=sshd next_pid=750 next_prio=120
    sshd-750     0...1  252us : sys_exit: NR 142 = 1
    sshd-750     0...1  257us+: sys_select -> 0x1
    sshd-750     0...1  273us : sys_enter: NR 175 (0, be9cc238, be9cc1b8, 8, be9cc1b8, 0)
    sshd-750     0...1  276us : sys_rt_sigprocmask(how: 0, nset: be9cc238, oset: be9cc1b8, sigsetsize: 8)
    sshd-750     0...1  285us : sys_exit: NR 175 = 0
    sshd-750     0...1  287us : sys_rt_sigprocmask -> 0x0
    sshd-750     0...1  289us : sys_enter: NR 175 (2, be9cc1b8, 0, 8, 0, 2)
    sshd-750     0...1  290us : sys_rt_sigprocmask(how: 2, nset: be9cc1b8, oset: 0, sigsetsize: 8)
    sshd-750     0...1  292us : sys_exit: NR 175 = 0
    sshd-750     0...1  293us : sys_rt_sigprocmask -> 0x0
    sshd-750     0...1  299us : sys_enter: NR 263 (1, be9cc270, 801b25e8, 0, 7f60e3b0, 7f60ee7c)
    sshd-750     0...1  301us : sys_clock_gettime(which_clock: 1, tp: be9cc270)
    sshd-750     0...1  307us : sys_exit: NR 263 = 0
    sshd-750     0d.h2  313us+: irq_handler_entry: irq=160 name=481aa000.serial
    sshd-750     0d.h3  327us : omap_8250_rx_dma_flush: 481aa000.serial: ret:22 iir:0xc4 seq:19389 last:19388 elapsed:446292
    sshd-750     0d.h3  335us : __dma_rx_do_complete: 481aa000.serial: cmplt: iir=c4 count=48
    sshd-750     0d.h4  341us : workqueue_queue_work: work struct=de41239c function=flush_to_ldisc workqueue=de1bef00 req_cpu=1 cpu=4294967295
    sshd-750     0d.h4  343us : workqueue_activate_work: work struct de41239c
    sshd-750     0d.h5  346us : sched_waking: comm=kworker/u2:2 pid=763 prio=120 target_cpu=000
    sshd-750     0d.h6  350us : sched_stat_sleep: comm=kworker/u2:2 pid=763 delay=442250 [ns]
    sshd-750     0d.h6  359us : sched_wakeup: comm=kworker/u2:2 pid=763 prio=120 target_cpu=000
    sshd-750     0d.h3  366us : kfree: call_site=c05a88f0 ptr=de6ba940
    sshd-750     0d.h3  367us+: kfree: call_site=c05a88f0 ptr=de6f5640
    sshd-750     0d.h4  445us : kmalloc: call_site=c05aa6d0 ptr=de6f5640 bytes_req=132 bytes_alloc=192 gfp_flags=GFP_ATOMIC|GFP_ZERO
    sshd-750     0d.h2  455us : irq_handler_exit: irq=160 ret=handled
    sshd-750     0..s2  460us : softirq_entry: vec=3 [action=NET_RX]
    sshd-750     0..s2  469us : napi_poll: napi poll on napi struct de465c30 for device eth0
    sshd-750     0..s2  470us : softirq_exit: vec=3 [action=NET_RX]
    sshd-750     0..s2  472us : softirq_entry: vec=6 [action=TASKLET]
    sshd-750     0..s1  475us : tasklet_action
    sshd-750     0d.s1  478us+: tasklet_action <-tasklet_action
    sshd-750     0d.s1  521us : <stack trace>
 => irq_exit
 => __handle_domain_irq
 => omap_intc_handle_irq
 => __irq_svc
 => ftrace_syscall_exit
 => ftrace_syscall_exit
 => syscall_trace_exit
 => __sys_trace_return

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-27 18:19 Softirq priority inversion from "softirq: reduce latencies" Peter Hurley
@ 2016-02-27 20:13 ` Eric Dumazet
  2016-02-27 20:29   ` Peter Hurley
  2016-02-28  5:55 ` Mike Galbraith
  2016-03-07 15:48 ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-02-27 20:13 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Eric Dumazet, David Miller, netdev, linux-kernel, Greg KH,
	dmaengine, John Ogness, Sebastian Andrzej Siewior, Andrew Morton

On sam., 2016-02-27 at 10:19 -0800, Peter Hurley wrote:
> Hi Eric,
> 
> For a while now, we've been struggling to understand why we've been
> observing missed uart rx DMA.
> 
> Because both the uart driver (omap8250) and the dmaengine driver
> (edma) were (relatively) new, we assumed there was some race between
> starting a new rx DMA and processing the previous one.
> 
> However, after instrumenting both the uart driver and the dmaengine
> driver, what we've observed is huge anomalous latencies between receiving
> the DMA interrupt and servicing the DMA tasklet.
> 
> For example, at 3Mbaud we recorded the following distribution of
> softirq[TASKLET] service latency for this specific DMA channel:
> 
> root@black:/sys/kernel/debug/edma# cat 35
> latency(us):   0+   20+   40+   60+   80+  100+  120+  140+  160+  180+  200+  220+  240+  260+  280+  300+  320+  340+  360+  380+
>            195681    33    53    15     7     4     3     1     0     0     0     1     4     6     1     0     0     0     0     0
> 
> As you can see, the vast majority of tasklet service happens immediately,
> tapering off to 140+us.
> 
> However, note the island of distribution at 220~300 [latencies beyond 300+
> are not recorded because the uart fifo has filled again by this point and
> dma must be aborted].
> 
> So I cribbed together a latency tracer to catch what was happening at
> the extreme, and what it caught was a priority inversion stemming from
> your commit:
> 
>    commit c10d73671ad30f54692f7f69f0e09e75d3a8926a
>    Author: Eric Dumazet <edumazet@google.com>
>    Date:   Thu Jan 10 15:26:34 2013 -0800
> 
>        softirq: reduce latencies
>     
>        In various network workloads, __do_softirq() latencies can be up
>        to 20 ms if HZ=1000, and 200 ms if HZ=100.
>     
>        This is because we iterate 10 times in the softirq dispatcher,
>        and some actions can consume a lot of cycles.
> 
> 
> In the trace below [1], the trace begins in the edma completion interrupt
> handler when the tasklet is scheduled; the edma interrupt has occurred during
> NET_RX softirq context.
> 
> However, instead of causing a restart of the softirq loop to process the
> tasklet _which occurred before sshd was scheduled_, the softirq loop is
> aborted and deferred for ksoftirqd. The tasklet is not serviced for 521us,
> which is way too long, so DMA was aborted.
> 
> Your patch has effectively inverted the priority of tasklets with normal
> pri/nice processes that have merely received a network packet.
> 
> ISTM, the problem you're trying to solve here was caused by NET_RX softirq
> to begin with, and maybe that thing needs a diet.
> 
> But rather than outright reverting your patch, what if more selective
> conditions are used to abort the softirq restart? What would those conditions
> be? In the netperf benchmark you referred to in that commit, is it just
> NET_TX/NET_RX softirqs that are causing scheduling latencies?
> 
> It just doesn't make sense to special case for a workload that isn't
> even running.
> 
> 
> Regards,
> Peter Hurley
> 
> 
> [1] softirq tasklet latency trace  (apologies that it's only events - full
>     function trace introduces too much overhead)
> 
> # tracer: latency
> #
> # latency latency trace v1.1.5 on 4.5.0-rc2+
> # --------------------------------------------------------------------
> # latency: 476 us, #59/59, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0)
> #    -----------------
> #    | task: sshd-750 (uid:1000 nice:0 policy:0 rt_prio:0)
> #    -----------------
> #  => started at: __tasklet_schedule  
> #  => ended at:   tasklet_action
> #
> #
> #                  _------=> CPU#
> #                 / _-----=> irqs-off 
> #                | / _----=> need-resched
> #                || / _---=> hardirq/softirq
> #                ||| / _--=> preempt-depth
> #                |||| /     delay
> #  cmd     pid   ||||| time  |   caller
> #     \   /      |||||  \    |   /
>   <idle>-0       0d.H3    1us : __tasklet_schedule
>   <idle>-0       0d.H4    3us : softirq_raise: vec=6 [action=TASKLET]
>   <idle>-0       0d.H3    6us : irq_handler_exit: irq=20 ret=handled
>   <idle>-0       0..s2   15us : kmem_cache_alloc: call_site=c08378e4 ptr=de55d7c0 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_ATOMIC
>   <idle>-0       0..s2   23us : netif_receive_skb_entry: dev=eth0 napi_id=0x0 queue_mapping=0 skbaddr=dca04400 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x000
> 0 protocol=0x0800 ip_summed=0 hash=0x00000000 l4_hash=0 len=88 data_len=0 truesize=1984 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 gso_type=0x0
>   <idle>-0       0..s2   30us+: netif_receive_skb: dev=eth0 skbaddr=dca04400 len=88
>   <idle>-0       0d.s5   98us : sched_waking: comm=sshd pid=750 prio=120 target_cpu=000
>   <idle>-0       0d.s6  105us : sched_stat_sleep: comm=sshd pid=750 delay=3125230447 [ns]
>   <idle>-0       0dns6  110us+: sched_wakeup: comm=sshd pid=750 prio=120 target_cpu=000
>   <idle>-0       0dns4  123us+: timer_start: timer=dc940e9c function=tcp_delack_timer expires=9746 [timeout=10] flags=0x00000000
>   <idle>-0       0dnH3  150us : irq_handler_entry: irq=176 name=4a100000.ethernet
>   <idle>-0       0dnH3  153us : softirq_raise: vec=3 [action=NET_RX]
>   <idle>-0       0dnH3  155us : irq_handler_exit: irq=176 ret=handled
>   <idle>-0       0dnH3  160us : irq_handler_entry: irq=20 name=49000000.edma_ccint
>   <idle>-0       0dnH3  163us : irq_handler_exit: irq=20 ret=handled
>   <idle>-0       0.ns2  169us : napi_poll: napi poll on napi struct de465c30 for device eth0
>   <idle>-0       0.ns2  171us : softirq_exit: vec=3 [action=NET_RX]
>   <idle>-0       0dns3  175us : sched_waking: comm=ksoftirqd/0 pid=3 prio=120 target_cpu=000
>   <idle>-0       0dns4  178us : sched_stat_sleep: comm=ksoftirqd/0 pid=3 delay=19371625 [ns]
>   <idle>-0       0dns4  180us : sched_wakeup: comm=ksoftirqd/0 pid=3 prio=120 target_cpu=000
>   <idle>-0       0.n.3  186us : gpio_value: 55 set 1
>   <idle>-0       0dn.3  191us : hrtimer_cancel: hrtimer=c0e2af80
>   <idle>-0       0dn.3  196us : hrtimer_start: hrtimer=c0e2af80 function=tick_sched_timer expires=338948000000 softexpires=338948000000
>   <idle>-0       0dn.2  199us : rcu_utilization: Start context switch
>   <idle>-0       0dn.2  202us : rcu_utilization: End context switch
>   <idle>-0       0dn.3  206us : sched_stat_wait: comm=sshd pid=750 delay=74042 [ns]
>   <idle>-0       0d..3  209us+: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=sshd next_pid=750 next_prio=120
>     sshd-750     0...1  252us : sys_exit: NR 142 = 1
>     sshd-750     0...1  257us+: sys_select -> 0x1
>     sshd-750     0...1  273us : sys_enter: NR 175 (0, be9cc238, be9cc1b8, 8, be9cc1b8, 0)
>     sshd-750     0...1  276us : sys_rt_sigprocmask(how: 0, nset: be9cc238, oset: be9cc1b8, sigsetsize: 8)
>     sshd-750     0...1  285us : sys_exit: NR 175 = 0
>     sshd-750     0...1  287us : sys_rt_sigprocmask -> 0x0
>     sshd-750     0...1  289us : sys_enter: NR 175 (2, be9cc1b8, 0, 8, 0, 2)
>     sshd-750     0...1  290us : sys_rt_sigprocmask(how: 2, nset: be9cc1b8, oset: 0, sigsetsize: 8)
>     sshd-750     0...1  292us : sys_exit: NR 175 = 0
>     sshd-750     0...1  293us : sys_rt_sigprocmask -> 0x0
>     sshd-750     0...1  299us : sys_enter: NR 263 (1, be9cc270, 801b25e8, 0, 7f60e3b0, 7f60ee7c)
>     sshd-750     0...1  301us : sys_clock_gettime(which_clock: 1, tp: be9cc270)
>     sshd-750     0...1  307us : sys_exit: NR 263 = 0
>     sshd-750     0d.h2  313us+: irq_handler_entry: irq=160 name=481aa000.serial
>     sshd-750     0d.h3  327us : omap_8250_rx_dma_flush: 481aa000.serial: ret:22 iir:0xc4 seq:19389 last:19388 elapsed:446292
>     sshd-750     0d.h3  335us : __dma_rx_do_complete: 481aa000.serial: cmplt: iir=c4 count=48
>     sshd-750     0d.h4  341us : workqueue_queue_work: work struct=de41239c function=flush_to_ldisc workqueue=de1bef00 req_cpu=1 cpu=4294967295
>     sshd-750     0d.h4  343us : workqueue_activate_work: work struct de41239c
>     sshd-750     0d.h5  346us : sched_waking: comm=kworker/u2:2 pid=763 prio=120 target_cpu=000
>     sshd-750     0d.h6  350us : sched_stat_sleep: comm=kworker/u2:2 pid=763 delay=442250 [ns]
>     sshd-750     0d.h6  359us : sched_wakeup: comm=kworker/u2:2 pid=763 prio=120 target_cpu=000
>     sshd-750     0d.h3  366us : kfree: call_site=c05a88f0 ptr=de6ba940
>     sshd-750     0d.h3  367us+: kfree: call_site=c05a88f0 ptr=de6f5640
>     sshd-750     0d.h4  445us : kmalloc: call_site=c05aa6d0 ptr=de6f5640 bytes_req=132 bytes_alloc=192 gfp_flags=GFP_ATOMIC|GFP_ZERO
>     sshd-750     0d.h2  455us : irq_handler_exit: irq=160 ret=handled
>     sshd-750     0..s2  460us : softirq_entry: vec=3 [action=NET_RX]
>     sshd-750     0..s2  469us : napi_poll: napi poll on napi struct de465c30 for device eth0
>     sshd-750     0..s2  470us : softirq_exit: vec=3 [action=NET_RX]
>     sshd-750     0..s2  472us : softirq_entry: vec=6 [action=TASKLET]
>     sshd-750     0..s1  475us : tasklet_action
>     sshd-750     0d.s1  478us+: tasklet_action <-tasklet_action
>     sshd-750     0d.s1  521us : <stack trace>
>  => irq_exit
>  => __handle_domain_irq
>  => omap_intc_handle_irq
>  => __irq_svc
>  => ftrace_syscall_exit
>  => ftrace_syscall_exit
>  => syscall_trace_exit
>  => __sys_trace_return


Nothing in the linux kernel will make sure softirq are always processed
under the higher prio.

You are pointing out my patch but the real problem for you is having
ksoftirqd in the first place.

You need RT variant I guess, or a better hardware.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-27 20:13 ` Eric Dumazet
@ 2016-02-27 20:29   ` Peter Hurley
  2016-02-27 23:04     ` David Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Hurley @ 2016-02-27 20:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David Miller, netdev, linux-kernel, Greg KH,
	dmaengine, John Ogness, Sebastian Andrzej Siewior, Andrew Morton

On 02/27/2016 12:13 PM, Eric Dumazet wrote:
> On sam., 2016-02-27 at 10:19 -0800, Peter Hurley wrote:
>> Hi Eric,
>>
>> For a while now, we've been struggling to understand why we've been
>> observing missed uart rx DMA.
>>
>> Because both the uart driver (omap8250) and the dmaengine driver
>> (edma) were (relatively) new, we assumed there was some race between
>> starting a new rx DMA and processing the previous one.
>>
>> However, after instrumenting both the uart driver and the dmaengine
>> driver, what we've observed is huge anomalous latencies between receiving
>> the DMA interrupt and servicing the DMA tasklet.
>>
>> For example, at 3Mbaud we recorded the following distribution of
>> softirq[TASKLET] service latency for this specific DMA channel:
>>
>> root@black:/sys/kernel/debug/edma# cat 35
>> latency(us):   0+   20+   40+   60+   80+  100+  120+  140+  160+  180+  200+  220+  240+  260+  280+  300+  320+  340+  360+  380+
>>            195681    33    53    15     7     4     3     1     0     0     0     1     4     6     1     0     0     0     0     0
>>
>> As you can see, the vast majority of tasklet service happens immediately,
>> tapering off to 140+us.
>>
>> However, note the island of distribution at 220~300 [latencies beyond 300+
>> are not recorded because the uart fifo has filled again by this point and
>> dma must be aborted].
>>
>> So I cribbed together a latency tracer to catch what was happening at
>> the extreme, and what it caught was a priority inversion stemming from
>> your commit:
>>
>>    commit c10d73671ad30f54692f7f69f0e09e75d3a8926a
>>    Author: Eric Dumazet <edumazet@google.com>
>>    Date:   Thu Jan 10 15:26:34 2013 -0800
>>
>>        softirq: reduce latencies
>>     
>>        In various network workloads, __do_softirq() latencies can be up
>>        to 20 ms if HZ=1000, and 200 ms if HZ=100.
>>     
>>        This is because we iterate 10 times in the softirq dispatcher,
>>        and some actions can consume a lot of cycles.
>>
>>
>> In the trace below [1], the trace begins in the edma completion interrupt
>> handler when the tasklet is scheduled; the edma interrupt has occurred during
>> NET_RX softirq context.
>>
>> However, instead of causing a restart of the softirq loop to process the
>> tasklet _which occurred before sshd was scheduled_, the softirq loop is
>> aborted and deferred for ksoftirqd. The tasklet is not serviced for 521us,
>> which is way too long, so DMA was aborted.
>>
>> Your patch has effectively inverted the priority of tasklets with normal
>> pri/nice processes that have merely received a network packet.
>>
>> ISTM, the problem you're trying to solve here was caused by NET_RX softirq
>> to begin with, and maybe that thing needs a diet.
>>
>> But rather than outright reverting your patch, what if more selective
>> conditions are used to abort the softirq restart? What would those conditions
>> be? In the netperf benchmark you referred to in that commit, is it just
>> NET_TX/NET_RX softirqs that are causing scheduling latencies?
>>
>> It just doesn't make sense to special case for a workload that isn't
>> even running.
>>
>>
>> Regards,
>> Peter Hurley
>>
>>
>> [1] softirq tasklet latency trace  (apologies that it's only events - full
>>     function trace introduces too much overhead)
>>
>> # tracer: latency
>> #
>> # latency latency trace v1.1.5 on 4.5.0-rc2+
>> # --------------------------------------------------------------------
>> # latency: 476 us, #59/59, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0)
>> #    -----------------
>> #    | task: sshd-750 (uid:1000 nice:0 policy:0 rt_prio:0)
>> #    -----------------
>> #  => started at: __tasklet_schedule  
>> #  => ended at:   tasklet_action
>> #
>> #
>> #                  _------=> CPU#
>> #                 / _-----=> irqs-off 
>> #                | / _----=> need-resched
>> #                || / _---=> hardirq/softirq
>> #                ||| / _--=> preempt-depth
>> #                |||| /     delay
>> #  cmd     pid   ||||| time  |   caller
>> #     \   /      |||||  \    |   /
>>   <idle>-0       0d.H3    1us : __tasklet_schedule
>>   <idle>-0       0d.H4    3us : softirq_raise: vec=6 [action=TASKLET]
>>   <idle>-0       0d.H3    6us : irq_handler_exit: irq=20 ret=handled
>>   <idle>-0       0..s2   15us : kmem_cache_alloc: call_site=c08378e4 ptr=de55d7c0 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_ATOMIC
>>   <idle>-0       0..s2   23us : netif_receive_skb_entry: dev=eth0 napi_id=0x0 queue_mapping=0 skbaddr=dca04400 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x000
>> 0 protocol=0x0800 ip_summed=0 hash=0x00000000 l4_hash=0 len=88 data_len=0 truesize=1984 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 gso_type=0x0
>>   <idle>-0       0..s2   30us+: netif_receive_skb: dev=eth0 skbaddr=dca04400 len=88
>>   <idle>-0       0d.s5   98us : sched_waking: comm=sshd pid=750 prio=120 target_cpu=000
>>   <idle>-0       0d.s6  105us : sched_stat_sleep: comm=sshd pid=750 delay=3125230447 [ns]
>>   <idle>-0       0dns6  110us+: sched_wakeup: comm=sshd pid=750 prio=120 target_cpu=000
>>   <idle>-0       0dns4  123us+: timer_start: timer=dc940e9c function=tcp_delack_timer expires=9746 [timeout=10] flags=0x00000000
>>   <idle>-0       0dnH3  150us : irq_handler_entry: irq=176 name=4a100000.ethernet
>>   <idle>-0       0dnH3  153us : softirq_raise: vec=3 [action=NET_RX]
>>   <idle>-0       0dnH3  155us : irq_handler_exit: irq=176 ret=handled
>>   <idle>-0       0dnH3  160us : irq_handler_entry: irq=20 name=49000000.edma_ccint
>>   <idle>-0       0dnH3  163us : irq_handler_exit: irq=20 ret=handled
>>   <idle>-0       0.ns2  169us : napi_poll: napi poll on napi struct de465c30 for device eth0
>>   <idle>-0       0.ns2  171us : softirq_exit: vec=3 [action=NET_RX]
>>   <idle>-0       0dns3  175us : sched_waking: comm=ksoftirqd/0 pid=3 prio=120 target_cpu=000
>>   <idle>-0       0dns4  178us : sched_stat_sleep: comm=ksoftirqd/0 pid=3 delay=19371625 [ns]
>>   <idle>-0       0dns4  180us : sched_wakeup: comm=ksoftirqd/0 pid=3 prio=120 target_cpu=000
>>   <idle>-0       0.n.3  186us : gpio_value: 55 set 1
>>   <idle>-0       0dn.3  191us : hrtimer_cancel: hrtimer=c0e2af80
>>   <idle>-0       0dn.3  196us : hrtimer_start: hrtimer=c0e2af80 function=tick_sched_timer expires=338948000000 softexpires=338948000000
>>   <idle>-0       0dn.2  199us : rcu_utilization: Start context switch
>>   <idle>-0       0dn.2  202us : rcu_utilization: End context switch
>>   <idle>-0       0dn.3  206us : sched_stat_wait: comm=sshd pid=750 delay=74042 [ns]
>>   <idle>-0       0d..3  209us+: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=sshd next_pid=750 next_prio=120
>>     sshd-750     0...1  252us : sys_exit: NR 142 = 1
>>     sshd-750     0...1  257us+: sys_select -> 0x1
>>     sshd-750     0...1  273us : sys_enter: NR 175 (0, be9cc238, be9cc1b8, 8, be9cc1b8, 0)
>>     sshd-750     0...1  276us : sys_rt_sigprocmask(how: 0, nset: be9cc238, oset: be9cc1b8, sigsetsize: 8)
>>     sshd-750     0...1  285us : sys_exit: NR 175 = 0
>>     sshd-750     0...1  287us : sys_rt_sigprocmask -> 0x0
>>     sshd-750     0...1  289us : sys_enter: NR 175 (2, be9cc1b8, 0, 8, 0, 2)
>>     sshd-750     0...1  290us : sys_rt_sigprocmask(how: 2, nset: be9cc1b8, oset: 0, sigsetsize: 8)
>>     sshd-750     0...1  292us : sys_exit: NR 175 = 0
>>     sshd-750     0...1  293us : sys_rt_sigprocmask -> 0x0
>>     sshd-750     0...1  299us : sys_enter: NR 263 (1, be9cc270, 801b25e8, 0, 7f60e3b0, 7f60ee7c)
>>     sshd-750     0...1  301us : sys_clock_gettime(which_clock: 1, tp: be9cc270)
>>     sshd-750     0...1  307us : sys_exit: NR 263 = 0
>>     sshd-750     0d.h2  313us+: irq_handler_entry: irq=160 name=481aa000.serial
>>     sshd-750     0d.h3  327us : omap_8250_rx_dma_flush: 481aa000.serial: ret:22 iir:0xc4 seq:19389 last:19388 elapsed:446292
>>     sshd-750     0d.h3  335us : __dma_rx_do_complete: 481aa000.serial: cmplt: iir=c4 count=48
>>     sshd-750     0d.h4  341us : workqueue_queue_work: work struct=de41239c function=flush_to_ldisc workqueue=de1bef00 req_cpu=1 cpu=4294967295
>>     sshd-750     0d.h4  343us : workqueue_activate_work: work struct de41239c
>>     sshd-750     0d.h5  346us : sched_waking: comm=kworker/u2:2 pid=763 prio=120 target_cpu=000
>>     sshd-750     0d.h6  350us : sched_stat_sleep: comm=kworker/u2:2 pid=763 delay=442250 [ns]
>>     sshd-750     0d.h6  359us : sched_wakeup: comm=kworker/u2:2 pid=763 prio=120 target_cpu=000
>>     sshd-750     0d.h3  366us : kfree: call_site=c05a88f0 ptr=de6ba940
>>     sshd-750     0d.h3  367us+: kfree: call_site=c05a88f0 ptr=de6f5640
>>     sshd-750     0d.h4  445us : kmalloc: call_site=c05aa6d0 ptr=de6f5640 bytes_req=132 bytes_alloc=192 gfp_flags=GFP_ATOMIC|GFP_ZERO
>>     sshd-750     0d.h2  455us : irq_handler_exit: irq=160 ret=handled
>>     sshd-750     0..s2  460us : softirq_entry: vec=3 [action=NET_RX]
>>     sshd-750     0..s2  469us : napi_poll: napi poll on napi struct de465c30 for device eth0
>>     sshd-750     0..s2  470us : softirq_exit: vec=3 [action=NET_RX]
>>     sshd-750     0..s2  472us : softirq_entry: vec=6 [action=TASKLET]
>>     sshd-750     0..s1  475us : tasklet_action
>>     sshd-750     0d.s1  478us+: tasklet_action <-tasklet_action
>>     sshd-750     0d.s1  521us : <stack trace>
>>  => irq_exit
>>  => __handle_domain_irq
>>  => omap_intc_handle_irq
>>  => __irq_svc
>>  => ftrace_syscall_exit
>>  => ftrace_syscall_exit
>>  => syscall_trace_exit
>>  => __sys_trace_return
> 
> 
> Nothing in the linux kernel will make sure softirq are always processed
> under the higher prio.

It's still a priority inversion, and an unnecessary one at that.


> You are pointing out my patch but the real problem for you is having
> ksoftirqd in the first place.

Not really. softirq raised from interrupt context will always execute
on this cpu and not in ksoftirqd, unless load forces softirq loop abort.

In this case, there is no load.


> You need RT variant I guess, or a better hardware.

500us for softirq while at idle is hardly the litmus the RT.

So then no suggestions for reducing the scope of your patch?

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-27 20:29   ` Peter Hurley
@ 2016-02-27 23:04     ` David Miller
  2016-02-27 23:33       ` Peter Hurley
  0 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2016-02-27 23:04 UTC (permalink / raw)
  To: peter
  Cc: eric.dumazet, edumazet, netdev, linux-kernel, gregkh, dmaengine,
	john.ogness, bigeasy, akpm

From: Peter Hurley <peter@hurleysoftware.com>
Date: Sat, 27 Feb 2016 12:29:39 -0800

> Not really. softirq raised from interrupt context will always execute
> on this cpu and not in ksoftirqd, unless load forces softirq loop abort.

That guarantee never was specified.

Or are you saying that by design, on a system under load, your UART
will not function properly?

Surely you don't mean that.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-27 23:04     ` David Miller
@ 2016-02-27 23:33       ` Peter Hurley
  2016-02-28  1:59         ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Hurley @ 2016-02-27 23:33 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, edumazet, netdev, linux-kernel, gregkh, dmaengine,
	john.ogness, bigeasy, akpm

On 02/27/2016 03:04 PM, David Miller wrote:
> From: Peter Hurley <peter@hurleysoftware.com>
> Date: Sat, 27 Feb 2016 12:29:39 -0800
> 
>> Not really. softirq raised from interrupt context will always execute
>> on this cpu and not in ksoftirqd, unless load forces softirq loop abort.
> 
> That guarantee never was specified.

??

Neither is running network socket servers at normal priority as if they're
higher priority than softirq.


> Or are you saying that by design, on a system under load, your UART
> will not function properly?
> 
> Surely you don't mean that.

No, that's not what I mean.

What I mean is that bypassing the entire SOFTIRQ priority so that
sshd can process one network packet makes a mockery of the point of softirq.

This hack to workaround NET_RX looping over-and-over-and-over affects every
subsystem, not just one uart.

HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-27 23:33       ` Peter Hurley
@ 2016-02-28  1:59         ` Eric Dumazet
  2016-02-28  2:10           ` Peter Hurley
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-02-28  1:59 UTC (permalink / raw)
  To: Peter Hurley
  Cc: David Miller, edumazet, netdev, linux-kernel, gregkh, dmaengine,
	john.ogness, bigeasy, akpm

On sam., 2016-02-27 at 15:33 -0800, Peter Hurley wrote:
> On 02/27/2016 03:04 PM, David Miller wrote:
> > From: Peter Hurley <peter@hurleysoftware.com>
> > Date: Sat, 27 Feb 2016 12:29:39 -0800
> > 
> >> Not really. softirq raised from interrupt context will always execute
> >> on this cpu and not in ksoftirqd, unless load forces softirq loop abort.
> > 
> > That guarantee never was specified.
> 
> ??
> 
> Neither is running network socket servers at normal priority as if they're
> higher priority than softirq.
> 
> 
> > Or are you saying that by design, on a system under load, your UART
> > will not function properly?
> > 
> > Surely you don't mean that.
> 
> No, that's not what I mean.
> 
> What I mean is that bypassing the entire SOFTIRQ priority so that
> sshd can process one network packet makes a mockery of the point of softirq.
> 
> This hack to workaround NET_RX looping over-and-over-and-over affects every
> subsystem, not just one uart.
> 
> HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug.

No idea what you talk about.

All pending softirq interrupts are processed. _Nothing_ is skipped.

Really, your system stability seems to depend on a completely
undocumented behavior of linux kernels before linux-3.8

If I understood, you expect that a tasklet activated from a softirq
handler is run from the same  __do_softirq() loop. This never has been
the case.

My change simply triggers the bug in your driver earlier. As David
pointed out, your bug should trigger the same on a loaded machine, even
if you revert my patch.

I honestly do not know why you arm a tasklet from NET_RX, why don't you
simply process this directly, so that you do not rely on some scheduler
decision ?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-28  1:59         ` Eric Dumazet
@ 2016-02-28  2:10           ` Peter Hurley
  2016-02-28  2:17             ` Eric Dumazet
  2016-02-28  4:46             ` David Miller
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Hurley @ 2016-02-28  2:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, edumazet, netdev, linux-kernel, gregkh, dmaengine,
	john.ogness, bigeasy, akpm

On 02/27/2016 05:59 PM, Eric Dumazet wrote:
> On sam., 2016-02-27 at 15:33 -0800, Peter Hurley wrote:
>> On 02/27/2016 03:04 PM, David Miller wrote:
>>> From: Peter Hurley <peter@hurleysoftware.com>
>>> Date: Sat, 27 Feb 2016 12:29:39 -0800
>>>
>>>> Not really. softirq raised from interrupt context will always execute
>>>> on this cpu and not in ksoftirqd, unless load forces softirq loop abort.
>>>
>>> That guarantee never was specified.
>>
>> ??
>>
>> Neither is running network socket servers at normal priority as if they're
>> higher priority than softirq.
>>
>>
>>> Or are you saying that by design, on a system under load, your UART
>>> will not function properly?
>>>
>>> Surely you don't mean that.
>>
>> No, that's not what I mean.
>>
>> What I mean is that bypassing the entire SOFTIRQ priority so that
>> sshd can process one network packet makes a mockery of the point of softirq.
>>
>> This hack to workaround NET_RX looping over-and-over-and-over affects every
>> subsystem, not just one uart.
>>
>> HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug.
> 
> No idea what you talk about.
> 
> All pending softirq interrupts are processed. _Nothing_ is skipped.

An interrupt that schedules HI softirq while in NET_RX softirq should
still run the HI softirq. But with your patch that won't happen.


> Really, your system stability seems to depend on a completely
> undocumented behavior of linux kernels before linux-3.8
> 
> If I understood, you expect that a tasklet activated from a softirq
> handler is run from the same  __do_softirq() loop. This never has been
> the case.

No.

The *interrupt handler* for DMA goes off while NET_RX softirq is running.
That's what schedules the *DMA tasklet*.

That tasklet should run before any process.

But it doesn't because your patch bails out early from softirq.


> My change simply triggers the bug in your driver earlier. As David
> pointed out, your bug should trigger the same on a loaded machine, even
> if you revert my patch.
> 
> I honestly do not know why you arm a tasklet from NET_RX, why don't you
> simply process this directly, so that you do not rely on some scheduler
> decision ?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-28  2:10           ` Peter Hurley
@ 2016-02-28  2:17             ` Eric Dumazet
  2016-02-28  4:46             ` David Miller
  1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2016-02-28  2:17 UTC (permalink / raw)
  To: Peter Hurley
  Cc: David Miller, edumazet, netdev, linux-kernel, gregkh, dmaengine,
	john.ogness, bigeasy, akpm

On sam., 2016-02-27 at 18:10 -0800, Peter Hurley wrote:
> On 02/27/2016 05:59 PM, Eric Dumazet wrote:
> > On sam., 2016-02-27 at 15:33 -0800, Peter Hurley wrote:
> >> On 02/27/2016 03:04 PM, David Miller wrote:
> >>> From: Peter Hurley <peter@hurleysoftware.com>
> >>> Date: Sat, 27 Feb 2016 12:29:39 -0800
> >>>
> >>>> Not really. softirq raised from interrupt context will always execute
> >>>> on this cpu and not in ksoftirqd, unless load forces softirq loop abort.
> >>>
> >>> That guarantee never was specified.
> >>
> >> ??
> >>
> >> Neither is running network socket servers at normal priority as if they're
> >> higher priority than softirq.
> >>
> >>
> >>> Or are you saying that by design, on a system under load, your UART
> >>> will not function properly?
> >>>
> >>> Surely you don't mean that.
> >>
> >> No, that's not what I mean.
> >>
> >> What I mean is that bypassing the entire SOFTIRQ priority so that
> >> sshd can process one network packet makes a mockery of the point of softirq.
> >>
> >> This hack to workaround NET_RX looping over-and-over-and-over affects every
> >> subsystem, not just one uart.
> >>
> >> HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug.
> > 
> > No idea what you talk about.
> > 
> > All pending softirq interrupts are processed. _Nothing_ is skipped.
> 
> An interrupt that schedules HI softirq while in NET_RX softirq should
> still run the HI softirq. But with your patch that won't happen.

Stop saying this. This never had been the case. I am glad my patch
finally show you are wrong.

> 
> 
> > Really, your system stability seems to depend on a completely
> > undocumented behavior of linux kernels before linux-3.8
> > 
> > If I understood, you expect that a tasklet activated from a softirq
> > handler is run from the same  __do_softirq() loop. This never has been
> > the case.
> 
> No.
> 
> The *interrupt handler* for DMA goes off while NET_RX softirq is running.
> That's what schedules the *DMA tasklet*.
> 
> That tasklet should run before any process.
> 
> But it doesn't because your patch bails out early from softirq.

Fine. Fix your driver.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-28  2:10           ` Peter Hurley
  2016-02-28  2:17             ` Eric Dumazet
@ 2016-02-28  4:46             ` David Miller
  1 sibling, 0 replies; 32+ messages in thread
From: David Miller @ 2016-02-28  4:46 UTC (permalink / raw)
  To: peter
  Cc: eric.dumazet, edumazet, netdev, linux-kernel, gregkh, dmaengine,
	john.ogness, bigeasy, akpm

From: Peter Hurley <peter@hurleysoftware.com>
Date: Sat, 27 Feb 2016 18:10:27 -0800

> That tasklet should run before any process.

You never have this guarantee, even before Eric's patch.
Under load tasklets run from ksoftirqd just like any other
softirq.

Please fix your driver and stop blaming Eric's change.

Thank you.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-27 18:19 Softirq priority inversion from "softirq: reduce latencies" Peter Hurley
  2016-02-27 20:13 ` Eric Dumazet
@ 2016-02-28  5:55 ` Mike Galbraith
  2016-02-28 17:01   ` Francois Romieu
  2016-03-07 15:48 ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 32+ messages in thread
From: Mike Galbraith @ 2016-02-28  5:55 UTC (permalink / raw)
  To: Peter Hurley, Eric Dumazet
  Cc: David Miller, netdev, linux-kernel, Greg KH, dmaengine,
	John Ogness, Sebastian Andrzej Siewior, Andrew Morton

On Sat, 2016-02-27 at 10:19 -0800, Peter Hurley wrote:
> Hi Eric,
> 
> For a while now, we've been struggling to understand why we've been
> observing missed uart rx DMA.
> 
> Because both the uart driver (omap8250) and the dmaengine driver
> (edma) were (relatively) new, we assumed there was some race between
> starting a new rx DMA and processing the previous one.

Hrm, relatively new + tasklet woes rings a bell.  Ah, that..

<quote>
What's worse is that at the point where this code was written it was
already well known that tasklets are a steaming pile of crap and
should die.
</quote>

Source thereof https://lwn.net/Articles/588457/

	-Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-28  5:55 ` Mike Galbraith
@ 2016-02-28 17:01   ` Francois Romieu
  2016-02-29  4:58     ` Mike Galbraith
  0 siblings, 1 reply; 32+ messages in thread
From: Francois Romieu @ 2016-02-28 17:01 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Hurley, Eric Dumazet, David Miller, netdev, linux-kernel,
	Greg KH, dmaengine, John Ogness, Sebastian Andrzej Siewior,
	Andrew Morton

Mike Galbraith <umgwanakikbuti@gmail.com> :
[...]
> Hrm, relatively new + tasklet woes rings a bell.  Ah, that..
> 
> <quote>
> What's worse is that at the point where this code was written it was
> already well known that tasklets are a steaming pile of crap and
> should die.
> </quote>
> 
> Source thereof https://lwn.net/Articles/588457/

tasklets are ingrained in the dmaengine API (see Documentation/dmaengine/client.txt
and drivers/dma/virt-dma.h::vchan_cookie_complete).

Moving everything to irq context or handling his own sub-{jiffy/ms} timer
while losing async dma doesn't exactly smell like roses either. :o(

-- 
Ueimor

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-28 17:01   ` Francois Romieu
@ 2016-02-29  4:58     ` Mike Galbraith
  2016-02-29 15:03       ` Peter Hurley
  2016-03-07 15:31       ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 32+ messages in thread
From: Mike Galbraith @ 2016-02-29  4:58 UTC (permalink / raw)
  To: Francois Romieu
  Cc: Peter Hurley, Eric Dumazet, David Miller, netdev, linux-kernel,
	Greg KH, dmaengine, John Ogness, Sebastian Andrzej Siewior,
	Andrew Morton

On Sun, 2016-02-28 at 18:01 +0100, Francois Romieu wrote:
> Mike Galbraith <umgwanakikbuti@gmail.com> :
> [...]
> > Hrm, relatively new + tasklet woes rings a bell.  Ah, that..
> > 
> > 
> > What's worse is that at the point where this code was written it was
> > already well known that tasklets are a steaming pile of crap and
> > should die.
> > 
> > 
> > Source thereof https://lwn.net/Articles/588457/
> 
> tasklets are ingrained in the dmaengine API (see Documentation/dmaengine/client.txt
> and drivers/dma/virt-dma.h::vchan_cookie_complete).
> 
> Moving everything to irq context or handling his own sub-{jiffy/ms} timer
> while losing async dma doesn't exactly smell like roses either. :o(

https://lwn.net/Articles/239633/

If I'm listening properly, the root cause is that there is a timing
constraint involved, which is being exposed because one softirq raises
another (ew).  Processing timeout happens, freshly raised tasklet
wanders off to SCHED_NORMAL kthread context where its constraint dies.

Given the dma stuff apparently works fine in -rt (or did, see below),
timing constraints can't be super tight, so perhaps we could grow
realtime workqueue support for the truly deserving.  The tricky bit
would be being keeping everybody and his brother from abusing it.

WRT -rt: if dma tasklets really do have hard (ish) constraints, -rt
recently "broke" in the same way.. of all softirqs which are deferred
to kthread context, due to a recent change, only timer/hrtimer are
executed at realtime priority by default.

	-Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29  4:58     ` Mike Galbraith
@ 2016-02-29 15:03       ` Peter Hurley
  2016-02-29 15:19         ` Eric Dumazet
                           ` (3 more replies)
  2016-03-07 15:31       ` Sebastian Andrzej Siewior
  1 sibling, 4 replies; 32+ messages in thread
From: Peter Hurley @ 2016-02-29 15:03 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Francois Romieu, Eric Dumazet, David Miller, netdev, linux-kernel,
	Greg KH, dmaengine, John Ogness, Sebastian Andrzej Siewior,
	Andrew Morton, Thomas Gleixner

On 02/28/2016 08:58 PM, Mike Galbraith wrote:
> On Sun, 2016-02-28 at 18:01 +0100, Francois Romieu wrote:
>> Mike Galbraith <umgwanakikbuti@gmail.com> :
>> [...]
>>> Hrm, relatively new + tasklet woes rings a bell.  Ah, that..
>>>
>>>
>>> What's worse is that at the point where this code was written it was
>>> already well known that tasklets are a steaming pile of crap and
>>> should die.
>>>
>>>
>>> Source thereof https://lwn.net/Articles/588457/

Thanks but not applicable. tglx's POV has everything to do with the
tasklet interface and not the general concept of bottom-half interrupt
processing in a timely manner. In any event, the problem created by
Eric's change is not restricted to tasklets, but rather applies to
all softirq.


>> tasklets are ingrained in the dmaengine API (see Documentation/dmaengine/client.txt
>> and drivers/dma/virt-dma.h::vchan_cookie_complete).
>>
>> Moving everything to irq context or handling his own sub-{jiffy/ms} timer
>> while losing async dma doesn't exactly smell like roses either. :o(
> 
> https://lwn.net/Articles/239633/
> 
> If I'm listening properly, the root cause is that there is a timing
> constraint involved, which is being exposed because one softirq raises
> another (ew).

Not the case. The softirq is raised from interrupt.

Before Eric's change, when an interrupt raises a new softirq
while processing another softirq, the new softirq is immediately
processed *after the existing softirq completes*.

After Eric's change, when an interrupt raises a new softirq
while processing another softirq and _that softirq wakes a process_,
the new softirq is *deferred to normal process priority*.
This happens even if the new softirq is higher priority than the
one currently running, which is flat-out wrong.

The reason this happens repeatedly and regularly is because
1. The time window while NET_RX softirq is running is big.
2. NET_RX softirq will almost always wake a process for a received packet.

The reason why Eric's change is so effective for Eric's workload is
that it fixes the problem where NET_RX keeps getting new network packets
so it keeps looping, servicing more NET_RX softirq.

However, I'm pointing out that Eric's sledgehammer approach to fixing
the NET_RX softirq bug is having significant side-effects in other
subsystems.


> Processing timeout happens, freshly raised tasklet
> wanders off to SCHED_NORMAL kthread context where its constraint dies.
> 
> Given the dma stuff apparently works fine in -rt (or did, see below),
> timing constraints can't be super tight, so perhaps we could grow
> realtime workqueue support for the truly deserving.  The tricky bit
> would be being keeping everybody and his brother from abusing it.
> 
> WRT -rt: if dma tasklets really do have hard (ish) constraints, -rt
> recently "broke" in the same way.. of all softirqs which are deferred
> to kthread context, due to a recent change, only timer/hrtimer are
> executed at realtime priority by default.
> 
> 	-Mike
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 15:03       ` Peter Hurley
@ 2016-02-29 15:19         ` Eric Dumazet
  2016-02-29 15:54           ` Peter Hurley
  2016-02-29 15:27         ` Eric Dumazet
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-02-29 15:19 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Mike Galbraith, Francois Romieu, Eric Dumazet, David Miller,
	netdev, linux-kernel, Greg KH, dmaengine, John Ogness,
	Sebastian Andrzej Siewior, Andrew Morton, Thomas Gleixner

On lun., 2016-02-29 at 07:03 -0800, Peter Hurley wrote:

> Not the case. The softirq is raised from interrupt.
> 
> Before Eric's change, when an interrupt raises a new softirq
> while processing another softirq, the new softirq is immediately
> processed *after the existing softirq completes*.
> 
> After Eric's change, when an interrupt raises a new softirq
> while processing another softirq and _that softirq wakes a process_,
> the new softirq is *deferred to normal process priority*.

For the last time, this is not true.

My patch changed the probability for this to happen.

It will happen even if you revert it.

linux never claimed that softirq could steal all cpu time.

Are by any chance still running a HZ=100 kernel ?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 15:03       ` Peter Hurley
  2016-02-29 15:19         ` Eric Dumazet
@ 2016-02-29 15:27         ` Eric Dumazet
  2016-02-29 19:13           ` Peter Hurley
  2016-02-29 15:40         ` Mike Galbraith
  2016-02-29 17:16         ` David Miller
  3 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-02-29 15:27 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Mike Galbraith, Francois Romieu, Eric Dumazet, David Miller,
	netdev, linux-kernel, Greg KH, dmaengine, John Ogness,
	Sebastian Andrzej Siewior, Andrew Morton, Thomas Gleixner

On lun., 2016-02-29 at 07:03 -0800, Peter Hurley wrote:

> The reason why Eric's change is so effective for Eric's workload is
> that it fixes the problem where NET_RX keeps getting new network packets
> so it keeps looping, servicing more NET_RX softirq.

You have very little idea of what is happening in networking land.

Once hard irq for RX has triggered, we arm a NAPI (NET_RX softirq), and
no more irq will come unless the napi handler ran. Then when NAPI is
complete, we re-allow interrupt to be delivered when a new packet is
coming.

Yes, ksoftirqd runs under load, and this is _wanted_.

Sure, it might add a latency if some high prio task is wanting the same
cpu, but this is exactly the purpose of having multi tasking.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 15:03       ` Peter Hurley
  2016-02-29 15:19         ` Eric Dumazet
  2016-02-29 15:27         ` Eric Dumazet
@ 2016-02-29 15:40         ` Mike Galbraith
  2016-02-29 15:58           ` Peter Hurley
  2016-02-29 17:16         ` David Miller
  3 siblings, 1 reply; 32+ messages in thread
From: Mike Galbraith @ 2016-02-29 15:40 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Francois Romieu, Eric Dumazet, David Miller, netdev, linux-kernel,
	Greg KH, dmaengine, John Ogness, Sebastian Andrzej Siewior,
	Andrew Morton, Thomas Gleixner

On Mon, 2016-02-29 at 07:03 -0800, Peter Hurley wrote:

> > If I'm listening properly, the root cause is that there is a timing
> > constraint involved, which is being exposed because one softirq raises
> > another (ew).
> 
> Not the case. The softirq is raised from interrupt.

Yeah, saw that on re-read.

> Before Eric's change, when an interrupt raises a new softirq
> while processing another softirq, the new softirq is immediately
> processed *after the existing softirq completes*.

Not necessarily, Eric only changed it from an arbitrary count to an
arbitrary time, so your irq could just as well land when there's no
count left and be up the same creek.

I was more infatuated by the constraint that's left dangling in the
breeze any time processing is deferred to ksoftirqd.

	-Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 15:19         ` Eric Dumazet
@ 2016-02-29 15:54           ` Peter Hurley
  2016-02-29 16:21             ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Hurley @ 2016-02-29 15:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mike Galbraith, Francois Romieu, Eric Dumazet, David Miller,
	netdev, linux-kernel, Greg KH, dmaengine, John Ogness,
	Sebastian Andrzej Siewior, Andrew Morton, Thomas Gleixner

On 02/29/2016 07:19 AM, Eric Dumazet wrote:
> On lun., 2016-02-29 at 07:03 -0800, Peter Hurley wrote:
> 
>> Not the case. The softirq is raised from interrupt.
>>
>> Before Eric's change, when an interrupt raises a new softirq
>> while processing another softirq, the new softirq is immediately
>> processed *after the existing softirq completes*.
>>
>> After Eric's change, when an interrupt raises a new softirq
>> while processing another softirq and _that softirq wakes a process_,
>> the new softirq is *deferred to normal process priority*.
> 
> For the last time, this is not true.
> 
> My patch changed the probability for this to happen.

There is a huge difference between
1. heavy i/o load forcing ksoftirqd to battle out i/o with regular
   sched processes *as a fallback to avoid 100% softirq* and
2. always deferring new softirq just because a process was woken


> It will happen even if you revert it.

I think there is a happy medium where finer constraints on
softirq looping will get us both what we want.

For example, an accumulating mask of softirq already run would
keep one softirq level from looping over-and-over. Or a per-softirq
limiting counter. Or relying on the hard limit that was added later
of a fixed number of softirq loops. Or a combination of those.


> linux never claimed that softirq could steal all cpu time.

That's not the problem observed here.

In fact, what your patch triggers is exactly the opposite:
although cpu load is initially very light because DMA is used to perform
device i/o, once DMA is not being serviced in a timely manner, the
driver fallbacks to purely interrupt-driven i/o which dramatically
increases the real cpu load at those line rates.

> Are by any chance still running a HZ=100 kernel ?

The current kernel is HZ=250 but this would occur on HZ=1000 as well.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 15:40         ` Mike Galbraith
@ 2016-02-29 15:58           ` Peter Hurley
  2016-02-29 16:24             ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Hurley @ 2016-02-29 15:58 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Francois Romieu, Eric Dumazet, David Miller, netdev, linux-kernel,
	Greg KH, dmaengine, John Ogness, Sebastian Andrzej Siewior,
	Andrew Morton, Thomas Gleixner

On 02/29/2016 07:40 AM, Mike Galbraith wrote:
> On Mon, 2016-02-29 at 07:03 -0800, Peter Hurley wrote:
> 
>>> If I'm listening properly, the root cause is that there is a timing
>>> constraint involved, which is being exposed because one softirq raises
>>> another (ew).
>>
>> Not the case. The softirq is raised from interrupt.
> 
> Yeah, saw that on re-read.
> 
>> Before Eric's change, when an interrupt raises a new softirq
>> while processing another softirq, the new softirq is immediately
>> processed *after the existing softirq completes*.
> 
> Not necessarily, Eric only changed it from an arbitrary count to an
> arbitrary time, so your irq could just as well land when there's no
> count left and be up the same creek.

Your misreading the softirq abort logic:
neither 2ms nor a fixed number of loops has elapsed.

All that's happened is the first loop of NET_RX softirq has woken a
process; that is sufficient to abort softirq and defer it for ksoftirqd.

That's why I'm saying this is a priority inversion, and one that
will happen a lot.


> I was more infatuated by the constraint that's left dangling in the
> breeze any time processing is deferred to ksoftirqd.
> 
> 	-Mike
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 15:54           ` Peter Hurley
@ 2016-02-29 16:21             ` Eric Dumazet
  2016-02-29 18:05               ` Peter Hurley
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-02-29 16:21 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Mike Galbraith, Francois Romieu, Eric Dumazet, David Miller,
	netdev, linux-kernel, Greg KH, dmaengine, John Ogness,
	Sebastian Andrzej Siewior, Andrew Morton, Thomas Gleixner

On lun., 2016-02-29 at 07:54 -0800, Peter Hurley wrote:

>  The current kernel is HZ=250 but this would occur on HZ=1000 as well.

Right. But the problem with HZ=100 and HZ=250 is that the detection can
happens because jiffy granularity is too coarse, since 

msecs_to_jiffies(2) -> 1

Following patch might reduce the probability, but wont really fix your
problem.

Fact that ksoftirqd prio is not what you want is completely orthogonal.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e443..f7cc594 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -180,7 +180,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);
 
 /*
  * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
- * but break the loop if need_resched() is set or after 2 ms.
+ * but break the loop if need_resched() is set or after 2 ms/ticks.
  * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
  * certain cases, such as stop_machine(), jiffies may cease to
  * increment and so we need the MAX_SOFTIRQ_RESTART limit as
@@ -191,7 +191,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);
  * we want to handle softirqs as soon as possible, but they
  * should not be able to lock up the box.
  */
-#define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
+#define MAX_SOFTIRQ_TIME  (1 + msecs_to_jiffies(2))
 #define MAX_SOFTIRQ_RESTART 10
 
 #ifdef CONFIG_TRACE_IRQFLAGS

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 15:58           ` Peter Hurley
@ 2016-02-29 16:24             ` Eric Dumazet
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2016-02-29 16:24 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Mike Galbraith, Francois Romieu, Eric Dumazet, David Miller,
	netdev, linux-kernel, Greg KH, dmaengine, John Ogness,
	Sebastian Andrzej Siewior, Andrew Morton, Thomas Gleixner

On lun., 2016-02-29 at 07:58 -0800, Peter Hurley wrote:

> All that's happened is the first loop of NET_RX softirq has woken a
> process; that is sufficient to abort softirq and defer it for ksoftirqd.
> 
> That's why I'm saying this is a priority inversion, and one that
> will happen a lot.

Sure. This will happen every time ksoftirqd is launched.

Get rid of ksoftirqd or renice it so that you can easily be killed by
softirq storm.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 15:03       ` Peter Hurley
                           ` (2 preceding siblings ...)
  2016-02-29 15:40         ` Mike Galbraith
@ 2016-02-29 17:16         ` David Miller
  3 siblings, 0 replies; 32+ messages in thread
From: David Miller @ 2016-02-29 17:16 UTC (permalink / raw)
  To: peter
  Cc: umgwanakikbuti, romieu, edumazet, netdev, linux-kernel, gregkh,
	dmaengine, john.ogness, bigeasy, akpm, tglx

From: Peter Hurley <peter@hurleysoftware.com>
Date: Mon, 29 Feb 2016 07:03:11 -0800

> However, I'm pointing out that Eric's sledgehammer approach to fixing
> the NET_RX softirq bug is having significant side-effects in other
> subsystems.

Either your hardware can handle arbitrary latencies and thus can use
softirqs for event completion successfully, or it can't.

You, my friend, are the one using the sledgehammer.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 16:21             ` Eric Dumazet
@ 2016-02-29 18:05               ` Peter Hurley
  2016-02-29 18:24                 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Hurley @ 2016-02-29 18:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mike Galbraith, Francois Romieu, Eric Dumazet, David Miller,
	netdev, linux-kernel, Greg KH, dmaengine, John Ogness,
	Sebastian Andrzej Siewior, Andrew Morton, Thomas Gleixner

On 02/29/2016 08:21 AM, Eric Dumazet wrote:
> On lun., 2016-02-29 at 07:54 -0800, Peter Hurley wrote:
> 
>>  The current kernel is HZ=250 but this would occur on HZ=1000 as well.
> 
> Right. But the problem with HZ=100 and HZ=250 is that the detection can
> happens because jiffy granularity is too coarse, since 
> 
> msecs_to_jiffies(2) -> 1
> 
> Following patch might reduce the probability, but wont really fix your
> problem.
> 
> Fact that ksoftirqd prio is not what you want is completely orthogonal.
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 479e443..f7cc594 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -180,7 +180,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);
>  
>  /*
>   * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
> - * but break the loop if need_resched() is set or after 2 ms.
> + * but break the loop if need_resched() is set or after 2 ms/ticks.
>   * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
>   * certain cases, such as stop_machine(), jiffies may cease to
>   * increment and so we need the MAX_SOFTIRQ_RESTART limit as
> @@ -191,7 +191,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);
>   * we want to handle softirqs as soon as possible, but they
>   * should not be able to lock up the box.
>   */
> -#define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
> +#define MAX_SOFTIRQ_TIME  (1 + msecs_to_jiffies(2))
>  #define MAX_SOFTIRQ_RESTART 10
>  
>  #ifdef CONFIG_TRACE_IRQFLAGS

While I appreciate the attempt, that's not the problem.

Just to be clear

		if (time_before(jiffies, end) && !need_resched() &&
		    --max_restart)
			goto restart;

aborts softirq *even if 0ns have elapsed*, if NET_RX has woken a process.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 18:05               ` Peter Hurley
@ 2016-02-29 18:24                 ` Eric Dumazet
  2016-02-29 18:53                   ` Peter Hurley
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-02-29 18:24 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Mike Galbraith, Francois Romieu, Eric Dumazet, David Miller,
	netdev, linux-kernel, Greg KH, dmaengine, John Ogness,
	Sebastian Andrzej Siewior, Andrew Morton, Thomas Gleixner

On lun., 2016-02-29 at 10:05 -0800, Peter Hurley wrote:

> While I appreciate the attempt, that's not the problem.
> 
> Just to be clear
> 
> 		if (time_before(jiffies, end) && !need_resched() &&
> 		    --max_restart)
> 			goto restart;
> 
> aborts softirq *even if 0ns have elapsed*, if NET_RX has woken a process.


Sure, now remove the 1st and 2nd condition.

You would still 'abort' (ie wakeup ksoftirqd really) when --max_restart
becomes 0

So, instead of some subtle load dependent bug, you know have a reliable
trigger.

The fact it took 3 years for someone to complain about this change
should tell us something really.

The only way for your bug to hide would be to remove all the 'break
infinite loop' logic.

And this is not going to happen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 18:24                 ` Eric Dumazet
@ 2016-02-29 18:53                   ` Peter Hurley
  2016-02-29 19:14                     ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Hurley @ 2016-02-29 18:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mike Galbraith, Francois Romieu, Eric Dumazet, David Miller,
	netdev, linux-kernel, Greg KH, dmaengine, John Ogness,
	Sebastian Andrzej Siewior, Andrew Morton, Thomas Gleixner

On 02/29/2016 10:24 AM, Eric Dumazet wrote:
> On lun., 2016-02-29 at 10:05 -0800, Peter Hurley wrote:
> 
>> While I appreciate the attempt, that's not the problem.
>>
>> Just to be clear
>>
>> 		if (time_before(jiffies, end) && !need_resched() &&
>> 		    --max_restart)
>> 			goto restart;
>>
>> aborts softirq *even if 0ns have elapsed*, if NET_RX has woken a process.
> 
> 
> Sure, now remove the 1st and 2nd condition.

Well just removing the 2nd condition has everything working fine,
because that fixes the priority inversion.


> You would still 'abort' (ie wakeup ksoftirqd really) when --max_restart
> becomes 0

Sure. Which would mean there's contended heavy i/o load so the driver
has to fallback to non-DMA. That's an acceptable outcome.


> So, instead of some subtle load dependent bug, you know have a reliable
> trigger.

There's no "subtle load dependent bug" here.

The driver has a fallback mode of operation that it relies on without
DMA. Of course, as I already wrote, this has consequences.

If system resources are _actually contended_, then naturally, fighting
for cpu and i/o time is fine, and I'm happy to do that in ksoftirqd.

However, when system resources are _not_ contended, it makes no
sense to be forced to revert to ksoftirqd resolution, which is strictly
intended as fallback.

Or flipping your argument on its head, why not just _always_ execute
softirq in ksoftirqd?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 15:27         ` Eric Dumazet
@ 2016-02-29 19:13           ` Peter Hurley
  2016-02-29 19:43             ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Hurley @ 2016-02-29 19:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mike Galbraith, Francois Romieu, Eric Dumazet, David Miller,
	netdev, linux-kernel, Greg KH, dmaengine, John Ogness,
	Sebastian Andrzej Siewior, Andrew Morton, Thomas Gleixner

On 02/29/2016 07:27 AM, Eric Dumazet wrote:
> On lun., 2016-02-29 at 07:03 -0800, Peter Hurley wrote:
> 
>> The reason why Eric's change is so effective for Eric's workload is
>> that it fixes the problem where NET_RX keeps getting new network packets
>> so it keeps looping, servicing more NET_RX softirq.
> 
> You have very little idea of what is happening in networking land.

While that is true, I can read a trace:

  ** already in NET_RX softirq **

  <idle>-0       0..s2   15us : kmem_cache_alloc: call_site=c08378e4 ptr=de55d7c0 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_ATOMIC
  <idle>-0       0..s2   23us : netif_receive_skb_entry: dev=eth0 napi_id=0x0 queue_mapping=0 skbaddr=dca04400 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x000
0 protocol=0x0800 ip_summed=0 hash=0x00000000 l4_hash=0 len=88 data_len=0 truesize=1984 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 gso_type=0x0
  <idle>-0       0..s2   30us+: netif_receive_skb: dev=eth0 skbaddr=dca04400 len=88
  <idle>-0       0d.s5   98us : sched_waking: comm=sshd pid=750 prio=120 target_cpu=000
  <idle>-0       0d.s6  105us : sched_stat_sleep: comm=sshd pid=750 delay=3125230447 [ns]
  <idle>-0       0dns6  110us+: sched_wakeup: comm=sshd pid=750 prio=120 target_cpu=000
  <idle>-0       0dns4  123us+: timer_start: timer=dc940e9c function=tcp_delack_timer expires=9746 [timeout=10] flags=0x00000000
  <idle>-0       0dnH3  150us : irq_handler_entry: irq=176 name=4a100000.ethernet
  <idle>-0       0dnH3  153us : softirq_raise: vec=3 [action=NET_RX]
  <idle>-0       0dnH3  155us : irq_handler_exit: irq=176 ret=handled
  <idle>-0       0dnH3  160us : irq_handler_entry: irq=20 name=49000000.edma_ccint
  <idle>-0       0dnH3  163us : irq_handler_exit: irq=20 ret=handled
  <idle>-0       0.ns2  169us : napi_poll: napi poll on napi struct de465c30 for device eth0
  <idle>-0       0.ns2  171us : softirq_exit: vec=3 [action=NET_RX]


As you can see, NET_RX softirq is re-raised while in NET_RX softirq,
as a result of receiving new packets. So NET_RX will keep looping,
which is what I wrote.


> Once hard irq for RX has triggered, we arm a NAPI (NET_RX softirq), and
> no more irq will come unless the napi handler ran. Then when NAPI is
> complete, we re-allow interrupt to be delivered when a new packet is
> coming.
> 
> Yes, ksoftirqd runs under load, and this is _wanted_.
> 
> Sure, it might add a latency if some high prio task is wanting the same
> cpu, but this is exactly the purpose of having multi tasking.
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 18:53                   ` Peter Hurley
@ 2016-02-29 19:14                     ` Thomas Gleixner
  2016-02-29 20:24                       ` David Miller
  2016-02-29 23:04                       ` Peter Hurley
  0 siblings, 2 replies; 32+ messages in thread
From: Thomas Gleixner @ 2016-02-29 19:14 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Eric Dumazet, Mike Galbraith, Francois Romieu, Eric Dumazet,
	David Miller, netdev, linux-kernel, Greg KH, dmaengine,
	John Ogness, Sebastian Andrzej Siewior, Andrew Morton

On Mon, 29 Feb 2016, Peter Hurley wrote:
> On 02/29/2016 10:24 AM, Eric Dumazet wrote:
> >> Just to be clear
> >>
> >> 		if (time_before(jiffies, end) && !need_resched() &&
> >> 		    --max_restart)
> >> 			goto restart;
> >>
> >> aborts softirq *even if 0ns have elapsed*, if NET_RX has woken a process.
> > 
> > Sure, now remove the 1st and 2nd condition.
> 
> Well just removing the 2nd condition has everything working fine,
> because that fixes the priority inversion.

No. It does not fix anything. It hides the shortcomings of the driver.
 
> However, when system resources are _not_ contended, it makes no
> sense to be forced to revert to ksoftirqd resolution, which is strictly
> intended as fallback.

No. You claim it is simply because your driver does not handle that situation
properly.
 
> Or flipping your argument on its head, why not just _always_ execute
> softirq in ksoftirqd?

Which is what that change effectivley does. And that makes a lot of sense,
because you get the softirq load under scheduler control and do not let the
softirq run as a context stealing entity which is completely uncontrollable by
the scheduler.

Running the softirq on return from interrupt can cause real priority
inversions.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 19:13           ` Peter Hurley
@ 2016-02-29 19:43             ` Eric Dumazet
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2016-02-29 19:43 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Mike Galbraith, Francois Romieu, Eric Dumazet, David Miller,
	netdev, linux-kernel, Greg KH, dmaengine, John Ogness,
	Sebastian Andrzej Siewior, Andrew Morton, Thomas Gleixner

On lun., 2016-02-29 at 11:13 -0800, Peter Hurley wrote:
> On 02/29/2016 07:27 AM, Eric Dumazet wrote:
> > On lun., 2016-02-29 at 07:03 -0800, Peter Hurley wrote:
> > 
> >> The reason why Eric's change is so effective for Eric's workload is
> >> that it fixes the problem where NET_RX keeps getting new network packets
> >> so it keeps looping, servicing more NET_RX softirq.
> > 
> > You have very little idea of what is happening in networking land.
> 
> While that is true, I can read a trace:
> 
>   ** already in NET_RX softirq **
> 
>   <idle>-0       0..s2   15us : kmem_cache_alloc: call_site=c08378e4 ptr=de55d7c0 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_ATOMIC
>   <idle>-0       0..s2   23us : netif_receive_skb_entry: dev=eth0 napi_id=0x0 queue_mapping=0 skbaddr=dca04400 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x000
> 0 protocol=0x0800 ip_summed=0 hash=0x00000000 l4_hash=0 len=88 data_len=0 truesize=1984 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 gso_type=0x0
>   <idle>-0       0..s2   30us+: netif_receive_skb: dev=eth0 skbaddr=dca04400 len=88
>   <idle>-0       0d.s5   98us : sched_waking: comm=sshd pid=750 prio=120 target_cpu=000
>   <idle>-0       0d.s6  105us : sched_stat_sleep: comm=sshd pid=750 delay=3125230447 [ns]
>   <idle>-0       0dns6  110us+: sched_wakeup: comm=sshd pid=750 prio=120 target_cpu=000
>   <idle>-0       0dns4  123us+: timer_start: timer=dc940e9c function=tcp_delack_timer expires=9746 [timeout=10] flags=0x00000000
>   <idle>-0       0dnH3  150us : irq_handler_entry: irq=176 name=4a100000.ethernet
>   <idle>-0       0dnH3  153us : softirq_raise: vec=3 [action=NET_RX]
>   <idle>-0       0dnH3  155us : irq_handler_exit: irq=176 ret=handled
>   <idle>-0       0dnH3  160us : irq_handler_entry: irq=20 name=49000000.edma_ccint
>   <idle>-0       0dnH3  163us : irq_handler_exit: irq=20 ret=handled
>   <idle>-0       0.ns2  169us : napi_poll: napi poll on napi struct de465c30 for device eth0
>   <idle>-0       0.ns2  171us : softirq_exit: vec=3 [action=NET_RX]
> 
> 
> As you can see, NET_RX softirq is re-raised while in NET_RX softirq,
> as a result of receiving new packets. So NET_RX will keep looping,
> which is what I wrote.

Well, NET_RX can not be re-raised, it is a single bit flip.

It is 'raised' on this trace because the driver already rearmed the IRQ
so that hard irq handler could fire.

Anyway, it seems you know much better than me, so I will stop answering
your mails on this topic.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 19:14                     ` Thomas Gleixner
@ 2016-02-29 20:24                       ` David Miller
  2016-02-29 23:04                       ` Peter Hurley
  1 sibling, 0 replies; 32+ messages in thread
From: David Miller @ 2016-02-29 20:24 UTC (permalink / raw)
  To: tglx
  Cc: peter, eric.dumazet, umgwanakikbuti, romieu, edumazet, netdev,
	linux-kernel, gregkh, dmaengine, john.ogness, bigeasy, akpm

From: Thomas Gleixner <tglx@linutronix.de>
Date: Mon, 29 Feb 2016 20:14:36 +0100 (CET)

> On Mon, 29 Feb 2016, Peter Hurley wrote:
>> Or flipping your argument on its head, why not just _always_ execute
>> softirq in ksoftirqd?
> 
> Which is what that change effectivley does. And that makes a lot of sense,
> because you get the softirq load under scheduler control and do not let the
> softirq run as a context stealing entity which is completely uncontrollable by
> the scheduler.

+1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29 19:14                     ` Thomas Gleixner
  2016-02-29 20:24                       ` David Miller
@ 2016-02-29 23:04                       ` Peter Hurley
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Hurley @ 2016-02-29 23:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Eric Dumazet, Mike Galbraith, Francois Romieu, Eric Dumazet,
	David Miller, netdev, linux-kernel, Greg KH, dmaengine,
	John Ogness, Sebastian Andrzej Siewior, Andrew Morton

On 02/29/2016 11:14 AM, Thomas Gleixner wrote:
> On Mon, 29 Feb 2016, Peter Hurley wrote:
>> On 02/29/2016 10:24 AM, Eric Dumazet wrote:
>>>> Just to be clear
>>>>
>>>> 		if (time_before(jiffies, end) && !need_resched() &&
>>>> 		    --max_restart)
>>>> 			goto restart;
>>>>
>>>> aborts softirq *even if 0ns have elapsed*, if NET_RX has woken a process.
>>>
>>> Sure, now remove the 1st and 2nd condition.
>>
>> Well just removing the 2nd condition has everything working fine,
>> because that fixes the priority inversion.
> 
> No. It does not fix anything. It hides the shortcomings of the driver.
>  
>> However, when system resources are _not_ contended, it makes no
>> sense to be forced to revert to ksoftirqd resolution, which is strictly
>> intended as fallback.
> 
> No. You claim it is simply because your driver does not handle that situation
> properly.
>  
>> Or flipping your argument on its head, why not just _always_ execute
>> softirq in ksoftirqd?
> 
> Which is what that change effectivley does. And that makes a lot of sense,
> because you get the softirq load under scheduler control and do not let the
> softirq run as a context stealing entity which is completely uncontrollable by
> the scheduler.

Ok, fair enough.

However, charging [in the scheduler sense] very lightweight DMA completion for
one subsystem collectively with very heavyweight NET_RX (doing garbage collection
in softirq!) is hardly ideal.

The alternative being threaded interrupt handlers (which are essentially treated
as 0.000000 scheduler cost).

I just want to make sure that's the conscious choice being made, when the
patches for converting from tasklet to threaded irq start hitting subsystem
maintainers.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-29  4:58     ` Mike Galbraith
  2016-02-29 15:03       ` Peter Hurley
@ 2016-03-07 15:31       ` Sebastian Andrzej Siewior
  2016-03-07 17:06         ` Mike Galbraith
  1 sibling, 1 reply; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-03-07 15:31 UTC (permalink / raw)
  To: Mike Galbraith, Francois Romieu
  Cc: Peter Hurley, Eric Dumazet, David Miller, netdev, linux-kernel,
	Greg KH, dmaengine, John Ogness, Andrew Morton

On 02/29/2016 05:58 AM, Mike Galbraith wrote:
> WRT -rt: if dma tasklets really do have hard (ish) constraints, -rt
> recently "broke" in the same way.. of all softirqs which are deferred
> to kthread context, due to a recent change, only timer/hrtimer are
> executed at realtime priority by default.

no. All softirqs are invoked in the context of the current process that
triggerd the softirq invocation. If NAPI goes on for too long (or other
softirq can't be executed in this context) it will continue in the
ksoftirqd. And this threads runs at a normal priority like it does in
mainline.
I adjusted it with mainline.

> 	-Mike
> 
Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-02-27 18:19 Softirq priority inversion from "softirq: reduce latencies" Peter Hurley
  2016-02-27 20:13 ` Eric Dumazet
  2016-02-28  5:55 ` Mike Galbraith
@ 2016-03-07 15:48 ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-03-07 15:48 UTC (permalink / raw)
  To: Peter Hurley, Eric Dumazet
  Cc: David Miller, netdev, linux-kernel, Greg KH, dmaengine,
	John Ogness, Andrew Morton

On 02/27/2016 07:19 PM, Peter Hurley wrote:
> Hi Eric,

Hi Peter,

> Because both the uart driver (omap8250) and the dmaengine driver
> (edma) were (relatively) new, we assumed there was some race between
> starting a new rx DMA and processing the previous one.

Now after digesting the whole thread. I complained about this a long
while ago. After you start RX-DMA the DMA-engine is not programmed
immediately but deferred into softirq/tasklet. This is not the case for
continuous DMA transfer - those are programmed right away.

I don't remember that I found a reason why this simple programming has
to be deferred and can't happen immediately like it is the case for the
continuous DMA transfers. So I skipped that. RX-DMA in UART was working
well but for some reason omap's MMC-card driver refused to work.

Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Softirq priority inversion from "softirq: reduce latencies"
  2016-03-07 15:31       ` Sebastian Andrzej Siewior
@ 2016-03-07 17:06         ` Mike Galbraith
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Galbraith @ 2016-03-07 17:06 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Francois Romieu
  Cc: Peter Hurley, Eric Dumazet, David Miller, netdev, linux-kernel,
	Greg KH, dmaengine, John Ogness, Andrew Morton

On Mon, 2016-03-07 at 16:31 +0100, Sebastian Andrzej Siewior wrote:
> On 02/29/2016 05:58 AM, Mike Galbraith wrote:
> > WRT -rt: if dma tasklets really do have hard (ish) constraints, -rt
> > recently "broke" in the same way.. of all softirqs which are deferred
> > to kthread context, due to a recent change, only timer/hrtimer are
> > executed at realtime priority by default.
> 
> no. All softirqs are invoked in the context of the current process that
> triggerd the softirq invocation. If NAPI goes on for too long (or other
> softirq can't be executed in this context) it will continue in the
> ksoftirqd. And this threads runs at a normal priority like it does in
> mainline.
> I adjusted it with mainline.

Yeah, that's what I said, the ksoftirqd case became the same in -rt.

	-Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-03-07 17:06 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-27 18:19 Softirq priority inversion from "softirq: reduce latencies" Peter Hurley
2016-02-27 20:13 ` Eric Dumazet
2016-02-27 20:29   ` Peter Hurley
2016-02-27 23:04     ` David Miller
2016-02-27 23:33       ` Peter Hurley
2016-02-28  1:59         ` Eric Dumazet
2016-02-28  2:10           ` Peter Hurley
2016-02-28  2:17             ` Eric Dumazet
2016-02-28  4:46             ` David Miller
2016-02-28  5:55 ` Mike Galbraith
2016-02-28 17:01   ` Francois Romieu
2016-02-29  4:58     ` Mike Galbraith
2016-02-29 15:03       ` Peter Hurley
2016-02-29 15:19         ` Eric Dumazet
2016-02-29 15:54           ` Peter Hurley
2016-02-29 16:21             ` Eric Dumazet
2016-02-29 18:05               ` Peter Hurley
2016-02-29 18:24                 ` Eric Dumazet
2016-02-29 18:53                   ` Peter Hurley
2016-02-29 19:14                     ` Thomas Gleixner
2016-02-29 20:24                       ` David Miller
2016-02-29 23:04                       ` Peter Hurley
2016-02-29 15:27         ` Eric Dumazet
2016-02-29 19:13           ` Peter Hurley
2016-02-29 19:43             ` Eric Dumazet
2016-02-29 15:40         ` Mike Galbraith
2016-02-29 15:58           ` Peter Hurley
2016-02-29 16:24             ` Eric Dumazet
2016-02-29 17:16         ` David Miller
2016-03-07 15:31       ` Sebastian Andrzej Siewior
2016-03-07 17:06         ` Mike Galbraith
2016-03-07 15:48 ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).