* Process accounting in interrupt diabled cases @ 2009-03-06 23:03 Alok Kataria 2009-03-07 0:37 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 6+ messages in thread From: Alok Kataria @ 2009-03-06 23:03 UTC (permalink / raw) To: Ingo Molnar; +Cc: schwidefsky, H. Peter Anvin, LKML, virtualization Hi, I am not sure, but I think this may be a process accounting bug. If interrupts are disabled for a considerable amount of time ( say multiple ticks), the process accounting code will still account a single tick for such cases, on the next interrupt tick. Shouldn't we have some way to fix that case like we do for NO_HZ restart_sched_tick case, where we account for multiple idle ticks. IOW, doesn't process accounting need to account for these cases when interrupts are disabled for more than one tick period? I stumbled across this while trying to find a solution to figure out the amount of stolen time from Linux, when it is running under a hypervisor. One of the solutions could be to ask the hypervisor directly for this info, but in my quest to find a generic solution I think the below would work too. The total process time accounted by the system on a cpu ( system, idle, wait and etc) when deducted from the amount TSC counter has advanced since boot, should give us this info about the cputime stolen from the kernel (by either hypervisor or other cases like say, SMI) on a particular CPU. i.e. PCPU_STOLEN = (TSC since boot) - (PCPU-idle + system + wait + ...) But for this to work the above problem about process accounting in interrupt disabled cases need to work correctly. Let me know if I overlooking any case where the above assumption might not hold true. Thanks, Alok ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Process accounting in interrupt diabled cases 2009-03-06 23:03 Process accounting in interrupt diabled cases Alok Kataria @ 2009-03-07 0:37 ` Jeremy Fitzhardinge 2009-03-07 0:59 ` Alok Kataria 0 siblings, 1 reply; 6+ messages in thread From: Jeremy Fitzhardinge @ 2009-03-07 0:37 UTC (permalink / raw) To: akataria; +Cc: Ingo Molnar, schwidefsky, virtualization, LKML, H. Peter Anvin Alok Kataria wrote: > Hi, > > I am not sure, but I think this may be a process accounting bug. > > If interrupts are disabled for a considerable amount of time ( say > multiple ticks), the process accounting code will still account a single > tick for such cases, on the next interrupt tick. > Shouldn't we have some way to fix that case like we do for NO_HZ > restart_sched_tick case, where we account for multiple idle ticks. > > IOW, doesn't process accounting need to account for these cases when > interrupts are disabled for more than one tick period? > Why are interrupts being disabled for so long? If its happening often enough to upset process accounting, then surely the fix is to not disable interrupts for such a long time? > I stumbled across this while trying to find a solution to figure out the > amount of stolen time from Linux, when it is running under a hypervisor. > One of the solutions could be to ask the hypervisor directly for this > info, but in my quest to find a generic solution I think the below would > work too. > The total process time accounted by the system on a cpu ( system, idle, > wait and etc) when deducted from the amount TSC counter has advanced > since boot, should give us this info about the cputime stolen from the > kernel You're assuming that the tsc is always going to be advancing at a constant rate in wallclock time? Is that a good assumption? Does VMWare virtualize the tsc to make this valid? If something's going to the effort of virtualizing tsc, how do you know they're not also excluding stolen time? Is the tsc guaranteed to be synchronized across cpus? > (by either hypervisor or other cases like say, SMI) (In the past I've argued that stolen time is not a binary property; your time can be "stolen" when you're running on a slow CPU, as well as explicit behind-the-scenes context switching, which could well be worth accounting for.) > on a > particular CPU. > i.e. PCPU_STOLEN = (TSC since boot) - (PCPU-idle + system + wait + ...) > What timebase is the kernel using to measure idle, system, wait, ...? Presumably something that doesn't include stolen time. In that case this just comes down to "PCPU_STOLEN = TOTAL_TIME - PCPU_UNSTOLEN_TIME", where you're proposing that TOTAL_TIME is the tsc. Direct use of the tsc definitely doesn't work in a Xen PV guest because the tsc is the raw physical cpu tsc; but Xen also provides everything you need to derive a globally-meaningful timebase from the tsc. Xen also provides per-vcpu info on time spent blocked, runnable (ie, could run but no pcpu available), running and offline. J ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Process accounting in interrupt diabled cases 2009-03-07 0:37 ` Jeremy Fitzhardinge @ 2009-03-07 0:59 ` Alok Kataria 2009-03-07 1:26 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 6+ messages in thread From: Alok Kataria @ 2009-03-07 0:59 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, schwidefsky@de.ibm.com, virtualization@lists.linux-foundation.org, LKML, H. Peter Anvin On Fri, 2009-03-06 at 16:37 -0800, Jeremy Fitzhardinge wrote: > Alok Kataria wrote: > > Hi, > > > > I am not sure, but I think this may be a process accounting bug. > > > > If interrupts are disabled for a considerable amount of time ( say > > multiple ticks), the process accounting code will still account a single > > tick for such cases, on the next interrupt tick. > > Shouldn't we have some way to fix that case like we do for NO_HZ > > restart_sched_tick case, where we account for multiple idle ticks. > > > > IOW, doesn't process accounting need to account for these cases when > > interrupts are disabled for more than one tick period? > > > > Why are interrupts being disabled for so long? If its happening often > enough to upset process accounting, then surely the fix is to not > disable interrupts for such a long time? I don't know if their are instances when interrupts are actually disabled for such a long time in the kernel , but I don't see a reason why this might not be happening currently, i.e. do we have a way to detect such cases. I noticed this problem ( with process accounting) only when testing my stolen time theory below, in which i had intentionally disabled interrupts for long. So, in case of buggy code which disables interrupt for long, this could affect process accounting and could result in the stolen time being reported incorrectly ( considering the stolen time idea mentioned below is okay). > > > I stumbled across this while trying to find a solution to figure out the > > amount of stolen time from Linux, when it is running under a hypervisor. > > One of the solutions could be to ask the hypervisor directly for this > > info, but in my quest to find a generic solution I think the below would > > work too. > > The total process time accounted by the system on a cpu ( system, idle, > > wait and etc) when deducted from the amount TSC counter has advanced > > since boot, should give us this info about the cputime stolen from the > > kernel > > You're assuming that the tsc is always going to be advancing at a > constant rate in wallclock time? Is that a good assumption? Does > VMWare virtualize the tsc to make this valid? If something's going to > the effort of virtualizing tsc, how do you know they're not also > excluding stolen time? Yes, TSC is the correct thing atleast for VMware over here. But my idea is not to advocate using TSC here, if it doesn't work for Xen we could use something else which gives a notion of Total_time there, a parvirt call to read that can be done. I don't know what that would be for XEN, but you would know better, please suggest if there is already a paravirt call which gets that value for XEN ? > Is the tsc guaranteed to be synchronized across > cpus? > Why is that a requirement, we have the kstat per cpu, the total_time can also be per cpu. > > (by either hypervisor or other cases like say, SMI) > > (In the past I've argued that stolen time is not a binary property; your > time can be "stolen" when you're running on a slow CPU, as well as > explicit behind-the-scenes context switching, which could well be worth > accounting for.) > > > on a > > particular CPU. > > i.e. PCPU_STOLEN = (TSC since boot) - (PCPU-idle + system + wait + ...) > > > > What timebase is the kernel using to measure idle, system, wait, ...? > Presumably something that doesn't include stolen time. In that case > this just comes down to "PCPU_STOLEN = TOTAL_TIME - PCPU_UNSTOLEN_TIME", > where you're proposing that TOTAL_TIME is the tsc. Again not proposing to use tsc, please suggest what works for Xen. And about the PCU_UNSTOLEN_TIME, i am proposing it could be a summation of all the fields in kstat_cpu.cpustat except the steal value. > > Direct use of the tsc definitely doesn't work in a Xen PV guest because > the tsc is the raw physical cpu tsc; but Xen also provides everything > you need to derive a globally-meaningful timebase from the tsc. Xen > also provides per-vcpu info on time spent blocked, runnable (ie, could > run but no pcpu available), running and offline. > That means it should be easy to get the TOTAL_Time value then ? Thanks, Alok ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Process accounting in interrupt diabled cases 2009-03-07 0:59 ` Alok Kataria @ 2009-03-07 1:26 ` Jeremy Fitzhardinge 2009-03-07 8:59 ` Alok Kataria 0 siblings, 1 reply; 6+ messages in thread From: Jeremy Fitzhardinge @ 2009-03-07 1:26 UTC (permalink / raw) To: akataria Cc: Ingo Molnar, schwidefsky@de.ibm.com, virtualization@lists.linux-foundation.org, LKML, H. Peter Anvin Alok Kataria wrote: > I don't know if their are instances when interrupts are actually > disabled for such a long time in the kernel , but I don't see a reason > why this might not be happening currently, i.e. do we have a way to > detect such cases. > I noticed this problem ( with process accounting) only when testing my > stolen time theory below, in which i had intentionally disabled > interrupts for long. > > So, in case of buggy code which disables interrupt for long, this could > affect process accounting and could result in the stolen time being > reported incorrectly ( considering the stolen time idea mentioned below > is okay). > Does it matter how long interrupts are actually disabled. Tickless is definitely the preferred mode of operation for any virtual guest, so time accounting is independent from when the actual timer interrupts occur; its quite possible we'll see no interrupts for a long time indeed. If we accrue unstolen time to a task when we actually context switch then the accounting will all work out, no? >>> I stumbled across this while trying to find a solution to figure out the >>> amount of stolen time from Linux, when it is running under a hypervisor. >>> One of the solutions could be to ask the hypervisor directly for this >>> info, but in my quest to find a generic solution I think the below would >>> work too. >>> The total process time accounted by the system on a cpu ( system, idle, >>> wait and etc) when deducted from the amount TSC counter has advanced >>> since boot, should give us this info about the cputime stolen from the >>> kernel >>> >> You're assuming that the tsc is always going to be advancing at a >> constant rate in wallclock time? Is that a good assumption? Does >> VMWare virtualize the tsc to make this valid? If something's going to >> the effort of virtualizing tsc, how do you know they're not also >> excluding stolen time? >> > > > Yes, TSC is the correct thing atleast for VMware over here. But my idea > is not to advocate using TSC here, if it doesn't work for Xen we could > use something else which gives a notion of Total_time there, a parvirt > call to read that can be done. I don't know what that would be for XEN, > but you would know better, please suggest if there is already a paravirt > call which gets that value for XEN ? > Yes, Xen already accounts stolen time in its timer interrupt handler. But more significantly it uses unstolen time as the timebase for sched_clock() so that the scheduler will only credit a task for the actual amount of time it spends executing, rather than a full wallclock timeslice. >> What timebase is the kernel using to measure idle, system, wait, ...? >> Presumably something that doesn't include stolen time. In that case >> this just comes down to "PCPU_STOLEN = TOTAL_TIME - PCPU_UNSTOLEN_TIME", >> where you're proposing that TOTAL_TIME is the tsc. >> > > Again not proposing to use tsc, please suggest what works for Xen. > And about the PCU_UNSTOLEN_TIME, i am proposing it could be a summation > of all the fields in kstat_cpu.cpustat except the steal value. > No, I'm not advocating anything in particular; I'm trying to understand your proposal. You're positing two timebases: one which measures wallclock time (that could be the tsc in VMWare's case), and another which measures unstolen time, so you can tell how long a cpu has spent actually running something. What's your proposal for the unstolen clock? How does the kernel measure unstolen time? It can't measure it with the tsc, because that would include any stolen time in the measurement. Also, I'm not sure it makes sense to distinguish between vcpu idle time and stolen time. If a vcpu is idle/blocked, how can you steal time from it? It's only stolen time if it wants to run but can't. >> Direct use of the tsc definitely doesn't work in a Xen PV guest because >> the tsc is the raw physical cpu tsc; but Xen also provides everything >> you need to derive a globally-meaningful timebase from the tsc. Xen >> also provides per-vcpu info on time spent blocked, runnable (ie, could >> run but no pcpu available), running and offline. >> >> > That means it should be easy to get the TOTAL_Time value then ? > Yes, Xen exposes all the necessary accounting information directly. It also guarantees that if you add up blocked+runnable+running+offline == wallclock time. See xen/time.c: do_stolen_accounting(), and how it accumulates stolen time with account_steal_ticks(). J ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Process accounting in interrupt diabled cases 2009-03-07 1:26 ` Jeremy Fitzhardinge @ 2009-03-07 8:59 ` Alok Kataria 2009-03-11 8:47 ` Martin Schwidefsky 0 siblings, 1 reply; 6+ messages in thread From: Alok Kataria @ 2009-03-07 8:59 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Ingo Molnar, schwidefsky@de.ibm.com, virtualization@lists.linux-foundation.org, LKML, H. Peter Anvin On Fri, 2009-03-06 at 17:26 -0800, Jeremy Fitzhardinge wrote: > Alok Kataria wrote: > > I don't know if their are instances when interrupts are actually > > disabled for such a long time in the kernel , but I don't see a reason > > why this might not be happening currently, i.e. do we have a way to > > detect such cases. > > I noticed this problem ( with process accounting) only when testing my > > stolen time theory below, in which i had intentionally disabled > > interrupts for long. > > > > So, in case of buggy code which disables interrupt for long, this could > > affect process accounting and could result in the stolen time being > > reported incorrectly ( considering the stolen time idea mentioned below > > is okay). > > > > Does it matter how long interrupts are actually disabled. Tickless is > definitely the preferred mode of operation for any virtual guest, so > time accounting is independent from when the actual timer interrupts > occur; its quite possible we'll see no interrupts for a long time > indeed. Yes that's alright, all that time when vcpu was idle and scheduled out will anyways be accounted as idle time, as mentioned in my earlier mail ( and if my understanding is not wrong) this is handled by tick_nohz_restart_sched_tick. But i was talking about a case, where we have this code local_irq_disable() some_work() local_irq_enable() If this some_work() executed for say 2 ticks, shouldn't the process executing this be accounted 2 ticks of system time ? According to my understanding, we will account a single tick for this, right ? I agree that this "some_work" is wrong to start with and it shouldn't keep interrupts disabled for so long, but i am considering this case just to understand if process_accounting needs any changes for this case. > If we accrue unstolen time to a task when we actually context > switch then the accounting will all work out, no? But getting the unstolen time is a problem, and I want to try and see if this unstolen time can be calculated without querying the hypervisor. > > >>> I stumbled across this while trying to find a solution to figure out the > >>> amount of stolen time from Linux, when it is running under a hypervisor. > >>> One of the solutions could be to ask the hypervisor directly for this > >>> info, but in my quest to find a generic solution I think the below would > >>> work too. > >>> The total process time accounted by the system on a cpu ( system, idle, > >>> wait and etc) when deducted from the amount TSC counter has advanced > >>> since boot, should give us this info about the cputime stolen from the > >>> kernel > >>> > >> You're assuming that the tsc is always going to be advancing at a > >> constant rate in wallclock time? Is that a good assumption? Does > >> VMWare virtualize the tsc to make this valid? If something's going to > >> the effort of virtualizing tsc, how do you know they're not also > >> excluding stolen time? > >> > > > > > > Yes, TSC is the correct thing atleast for VMware over here. But my idea > > is not to advocate using TSC here, if it doesn't work for Xen we could > > use something else which gives a notion of Total_time there, a parvirt > > call to read that can be done. I don't know what that would be for XEN, > > but you would know better, please suggest if there is already a paravirt > > call which gets that value for XEN ? > > > > Yes, Xen already accounts stolen time in its timer interrupt handler. Yep I did look at that code earlier, and that's how we can do it for VMI too, (actually it was there in VMI in 2.6.21 but we removed it during the NO_HZ merge). I am trying to see if we could do this for the non-VMI case i.e for the native kernel case. > > But more significantly it uses unstolen time as the timebase for > sched_clock() so that the scheduler will only credit a task for the > actual amount of time it spends executing, rather than a full wallclock > timeslice. That's a good point, that means we might have to change the native_sched_clock definition too to return this (rdtsc - stolen time) value. > > >> What timebase is the kernel using to measure idle, system, wait, ...? > >> Presumably something that doesn't include stolen time. In that case > >> this just comes down to "PCPU_STOLEN = TOTAL_TIME - PCPU_UNSTOLEN_TIME", > >> where you're proposing that TOTAL_TIME is the tsc. > >> > > > > Again not proposing to use tsc, please suggest what works for Xen. > > And about the PCU_UNSTOLEN_TIME, i am proposing it could be a summation > > of all the fields in kstat_cpu.cpustat except the steal value. > > > > No, I'm not advocating anything in particular; I'm trying to understand > your proposal. > > You're positing two timebases: one which measures wallclock time (that > could be the tsc in VMWare's case), and another which measures unstolen > time, so you can tell how long a cpu has spent actually running > something. What's your proposal for the unstolen clock? How does the > kernel measure unstolen time? It can't measure it with the tsc, because > that would include any stolen time in the measurement. > I am saying that we should have just one timebase, one which measures wallclock time. I am thinking that we can calculate the unstolen time rather than asking the hypervisor. As mentioned earlier UNSTOLEN time on a cpu = summation of kstat_cpu().cpustat variables (except the steal time variable) So we can easily get the stolen_time as... stolen_time from a cpu = wallclock_time - Unstolen_time. > Also, I'm not sure it makes sense to distinguish between vcpu idle time > and stolen time. If a vcpu is idle/blocked, how can you steal time from > it? It's only stolen time if it wants to run but can't. The vcpu idle time will be accounted correctly as idle time with tick_nohz_restart_sched_tick. So I don't think we would be confusing it ever as stolen time atleast in the NO_HZ case. Thanks, Alok ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Process accounting in interrupt diabled cases 2009-03-07 8:59 ` Alok Kataria @ 2009-03-11 8:47 ` Martin Schwidefsky 0 siblings, 0 replies; 6+ messages in thread From: Martin Schwidefsky @ 2009-03-11 8:47 UTC (permalink / raw) To: akataria Cc: Jeremy Fitzhardinge, Ingo Molnar, virtualization@lists.linux-foundation.org, LKML, H. Peter Anvin On Sat, 07 Mar 2009 00:59:33 -0800 Alok Kataria <akataria@vmware.com> wrote: > Yes that's alright, all that time when vcpu was idle and scheduled out > will anyways be accounted as idle time, as mentioned in my earlier mail > ( and if my understanding is not wrong) this is handled by > tick_nohz_restart_sched_tick. > > But i was talking about a case, where we have this code > local_irq_disable() > some_work() > local_irq_enable() > > If this some_work() executed for say 2 ticks, shouldn't the process > executing this be accounted 2 ticks of system time ? According to my > understanding, we will account a single tick for this, right ? Don't know too much about x86 and Xen but on s390 the tick is just a convenient way to transfer the accumulated cpu time to the process and the cpustat fields. The cpu time that has been used is determined by the cpu timer. This would still work without any ticks at all, although then you'd have to wait for the next context switch until the accumulated cpu time gets visible in the process and cpustat. In short: the above example just works fine for us. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-03-11 8:51 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-03-06 23:03 Process accounting in interrupt diabled cases Alok Kataria 2009-03-07 0:37 ` Jeremy Fitzhardinge 2009-03-07 0:59 ` Alok Kataria 2009-03-07 1:26 ` Jeremy Fitzhardinge 2009-03-07 8:59 ` Alok Kataria 2009-03-11 8:47 ` Martin Schwidefsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox