All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrew Gallatin <gallatin@myri.com>,
	linux-kernel@vger.kernel.org, rick.jones2@hp.com, brice@myri.com,
	Paul Mackerras <paulus@samba.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Ingo Molnar <mingo@elte.hu>
Subject: [PATCH] sched: account system time properly
Date: Wed, 29 Apr 2009 09:46:17 +0200	[thread overview]
Message-ID: <49F805C9.9070303@cosmosbay.com> (raw)
In-Reply-To: <49F7DF44.8090907@cosmosbay.com>

Eric Dumazet a écrit :
> Andrew Morton a écrit :
>> (cc's added)
>>
>> On Thu, 23 Apr 2009 10:19:38 -0400
>> Andrew Gallatin <gallatin@myri.com> wrote:
>>
>>> When running netperf for some 10GbE tests, I noticed
>>> that  IRQ and SOFTIRQ CPU time is no longer reported for an
>>> otherwise idle CPU on recent kernels, at least for x86_64.
>>>
>>> If I take a 2-CPU system, and bind the NIC IRQ to CPU0, and
>>> bind the user-space netserver daemon to CPU1, the problem
>>> is obvious when blasting 10Gb/s of traffic at it.  I see
>>> no CPU used for irq or softirq on CPU0, even though it is
>>> handling 13K interrupts/sec:
>>>
>>>
>>>
>>> % mpstat -P 0 1
>>> Linux 2.6.30-rc1 (venice)       04/22/09
>>>
>>> 11:25:25     CPU   %user   %nice %system %iowait    %irq   %soft   %idle 
>>>     intr/s
>>> 11:25:26       0    0.00    0.00    0.00    0.00    0.00    0.00  100.00 
>>>   13248.00
>>> 11:25:27       0    0.00    0.00    0.00    0.00    0.00    0.00  100.00 
>>>   13280.00
>>>
>>> Common sense tells me that is wrong, and oprofile verifies there is
>>> a lot happening on CPU0.  Further, when I run a cpu-soaker in
>>> usermode bound to CPU0, I start to see irq, softirq, etc,
>>> being correctly identified:
>>>
>>> 11:28:02     CPU   %user   %nice %system %iowait    %irq   %soft   %idle 
>>>     intr/s
>>> 11:28:03       0   45.10    0.00    0.00    0.00    1.96   52.94    0.00 
>>>   13019.61
>>> 11:28:04       0   46.46    0.00    0.00    0.00    2.02   51.52    0.00 
>>>   13414.14
>>>
>>> The problem is observable, but much less obvious when using a more
>>> common, e1000 1GbE NIC (15% softirq is missing, rather than 50%).
>>>
>>> I spent a few hours git-bisecting until I finally got here:
>>>
>>> % git-bisect bad
>>> Bisecting: 0 revisions left to test after this
>>> [457533a7d3402d1d91fbc125c8bd1bd16dcd3cd4] fix scaled & unscaled cputime 
>>> accounting
>>>
>>> I have neither CONFIG_NO_HZ, CONFIG_VIRT_CPU_ACCOUNTING, or XEN configured.
>>>
>> Thanks for doing the bisection.
>>
>> 457533a7d3402d1d91fbc125c8bd1bd16dcd3cd4 was merged late last year, so
>> this regression has been around for a while.
>>
>> We might have fixed it in more recent kernels - stranger things have
>> happened ;)
>>
> 
> Same problem here on 32 bit current git kernel, 8 cpus machine.
> Easily reproductible with a ping flood, NIC interrupts bonded to CPU0
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> 
> vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>  0  0      0 396260 474568 2881560    0    0     0     4 43642 2859  0  0 100  0
>  0  0      0 396260 474568 2881560    0    0     0     0 55802 2840  0  0 100  0
>  0  0      0 396260 474576 2881560    0    0     0     4 51239 2861  0  0 100  0
>  0  0      0 396260 474576 2881560    0    0     0     0 47848 2847  0  0 100  0
>  0  0      0 396260 474576 2881560    0    0     0     4 54245 2841  0  0 100  0
>  0  0      0 396260 474576 2881560    0    0     0     0 46500 2851  0  0 100  0
>  0  0      0 396260 474576 2881560    0    0     0     4 48035 2839  0  0 100  0
>  0  0      0 396136 474580 2881560    0    0     0     0 41994 2855  0  0 100  0
>  0  0      0 396136 474580 2881560    0    0     0     4 46701 2842  0  0 100  0
> 
> If load is large enough to trigger ksoftirqd, then softirq time is correctly reported.
> 
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,100.0%si,  0.0%st
> 
> CONFIG_GENERIC_TIME=y
> # CONFIG_GENERIC_TIME_VSYSCALL is not set
> CONFIG_KTIME_SCALAR=y
> CONFIG_TIMERFD=y
> # CONFIG_NO_HZ is not set
> CONFIG_HIGH_RES_TIMERS=y
> CONFIG_HPET_TIMER=y
> # CONFIG_HZ_100 is not set
> # CONFIG_HZ_250 is not set
> # CONFIG_HZ_300 is not set
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
> CONFIG_X86_PM_TIMER=y
> # CONFIG_NETFILTER_XT_MATCH_TIME is not set
> CONFIG_SERIAL_8250_RUNTIME_UARTS=4
> # CONFIG_HW_RANDOM_TIMERIOMEM is not set
> # CONFIG_HANGCHECK_TIMER is not set
> # CONFIG_MACHZ_WDT is not set
> CONFIG_PRINTK_TIME=y
> CONFIG_TIMER_STATS=y
> 
> 
> ping flood received on eth0 / eth1 interfaces (PCI-MSI-edge IRQS)
> 
> $ cat /proc/interrupts
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>   0:        343          0          0          0          0          0          0          0   IO-APIC-edge      timer
>   1:          1          0          1          0          0          0          0          0   IO-APIC-edge      i8042
>   9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
>  12:          0          0          0          1          1          1          1          1   IO-APIC-edge      i8042
>  14:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide0
>  15:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide1
>  16:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb1
>  17:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb2
>  18:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
>  19:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
>  22:         12         10         13          9         14         14         14          8   IO-APIC-fasteoi   uhci_hcd:usb5
>  33:     109191     165558     162520     156506     160469     161549     162655     161780   PCI-MSI-edge      cciss0
>  34:   91628623          0          0          0          0          0          0          0   PCI-MSI-edge      eth1
>  36:   18517783          0          0          0          0          0          0          0   PCI-MSI-edge      eth0
> NMI:  227238100  165960221  154895362  165425712  111786333  116776935  114633666  111315114   Non-maskable interrupts
> LOC:   88434578   80673549   73277767   78134420   77183920   71046805   68905716   70869751   Local timer interrupts
> SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
> RES:      16425     216454      73793     347021     125963      30655      47623      31300   Rescheduling interrupts
> CAL:      13164      13182      13184      13175        100      13169      13151      13163   Function call interrupts
> TLB:      45488      52740      50498      53806      57548      54933      58091      56249   TLB shootdowns
> TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
> 
> 
> 
> commit 79741dd35713ff4f6fd0eafd59fa94e8a4ba922d did :
> 
> void account_process_tick(struct task_struct *p, int user_tick)
> {
>        cputime_t one_jiffy = jiffies_to_cputime(1);
>        cputime_t one_jiffy_scaled = cputime_to_scaled(one_jiffy);
>        struct rq *rq = this_rq();
> 
>        if (user_tick)
>                account_user_time(p, one_jiffy, one_jiffy_scaled);
>        else if (p != rq->idle)
>                account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
>                                    one_jiffy_scaled);
>        else
>                account_idle_time(one_jiffy);
> }
> 
> So, if IRQs are interrupting idle task, I guess if (p != rq->idle) will be false.
> 
> 

Maybe following patch is needed ?

[PATCH] sched: account system time properly

When idle task is interrupted by an IRQ, time accounting considers CPU is idle,
even while it should account for hard or softirq.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..26efa47 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4732,7 +4732,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
 
 	if (user_tick)
 		account_user_time(p, one_jiffy, one_jiffy_scaled);
-	else if (p != rq->idle)
+	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
 		account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
 				    one_jiffy_scaled);
 	else



  reply	other threads:[~2009-04-29  7:51 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-23 14:19 IRQ / SoftIRQ CPU time accounting broken by 457533a7d3402d1d91fbc125c8bd1bd16dcd3cd4 Andrew Gallatin
2009-04-28 23:30 ` Andrew Morton
2009-04-29  5:01   ` Eric Dumazet
2009-04-29  7:46     ` Eric Dumazet [this message]
2009-04-29  8:08       ` [PATCH] sched: account system time properly Martin Schwidefsky
2009-04-29  9:20         ` Eric Dumazet
2009-04-29  9:48           ` Martin Schwidefsky
2009-04-29 10:24             ` Ingo Molnar
2009-04-29 12:44               ` Eric Dumazet
2009-04-29 13:09                 ` [tip:sched/urgent] " tip-bot for Eric Dumazet
2009-04-29 14:58                   ` Andrew Morton
2009-04-29 19:06                     ` Ingo Molnar
2009-04-29 13:23                 ` [PATCH] " Andrew Gallatin
2009-04-29  8:02   ` IRQ / SoftIRQ CPU time accounting broken by 457533a7d3402d1d91fbc125c8bd1bd16dcd3cd4 Martin Schwidefsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49F805C9.9070303@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=akpm@linux-foundation.org \
    --cc=benh@kernel.crashing.org \
    --cc=brice@myri.com \
    --cc=gallatin@myri.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=paulus@samba.org \
    --cc=rick.jones2@hp.com \
    --cc=schwidefsky@de.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.