From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: irqbalance seg faults with 2.6.38 or later kernels [patch + solution included] if running under Xen hypervisor Date: Wed, 11 May 2011 09:10:58 -0400 Message-ID: <20110511131058.GA4130@dumpdata.com> References: <20110511003347.GA29851@dumpdata.com> <4DCA62150200007800040EEA@vpn.id2.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <4DCA62150200007800040EEA@vpn.id2.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Jan Beulich Cc: xen-devel@lists.xensource.com, crrodriguez@suse.de, nhorman@tuxdriver.com, bwalle@suse.de, pbrobinson@gmail.com, notting@redhat.com, arjan@linux.intel.com, anibal@debian.org List-Id: xen-devel@lists.xenproject.org On Wed, May 11, 2011 at 09:16:53AM +0100, Jan Beulich wrote: > >>> On 11.05.11 at 02:33, Konrad Rzeszutek Wilk wrote: > > The reason behind it is that irqbalance parses the /proc/interrupts > > and whenever it hits something it can't understand: > > > > RES: 191614137 73904910 Rescheduling interrupts > > > > It will count the number of interrupts towards the IRQ 0. That IRQ does > > exist > > when the kernel boots under baremetal: > > > > 0: 46 0 IO-APIC-edge timer > > > > but under Xen, the timer interrupts are initialized much later: > > > > 272: 41197188 0 xen-percpu-virq timer0 > > > > and the first IRQ that is used is not zero, but rather one: > > > > 1: 73037 0 0 0 0 0 > > xen-pirq-ioapic-edge i8042 > > > > so when irqbalance tries to account for the IRQ 'RES' to the IRQ 0 > > it fails and segfaults. The attached patch fixes it for whoever else is > > hitting this problem. > > In the svn snapshot I have, I see > > /* lines with letters in front are special, like NMI count. Ignore */ > if (!(line[0]==' ' || (line[0]>='0' && line[0]<='9'))) > break; > > which I would think should be taking care of your problem (or > I mis-read your description), and which was there already before Not anymore. In kernels 2.6.37: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 .. snip. NMI: 0 0 0 0 Non-maskable interrupts LOC: 12413629 12858323 16296183 11098466 Local timer interrupts In 2.6.38 and later: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 TRM: 0 0 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 0 0 Machine check exceptions They added in a space before the name. The check you mentioned above could be augmented for this of course, as another solution for this. > 0.56. Or are you perhaps having the problem because you have > 1000+ interrupts, thus causing even the non-numeric strings to > get space padded on their left? In that case I'd rather think above > check should be either improved or removed (replaced by your > solution). > > > I am not sure who the upstream maintainer is for this so > > I am sending this patch to the different distros as well. > > Copying Neil and Arjan. > > Jan > > > > > --- irqbalance-0.56.orig/procinterrupts.c 2010-06-10 10:45:55.000000000 -0400 > > +++ irqbalance-0.56/procinterrupts.c 2011-05-10 20:22:06.897465003 -0400 > > @@ -50,7 +50,7 @@ void parse_proc_interrupts(void) > > int cpunr; > > int number; > > uint64_t count; > > - char *c, *c2; > > + char *c, *c2, *err; > > > > if (getline(&line, &size, file)==0) > > break; > > @@ -64,7 +64,11 @@ void parse_proc_interrupts(void) > > continue; > > *c = 0; > > c++; > > - number = strtoul(line, NULL, 10); > > + number = strtoul(line, &err, 10); > > + /* Man page says that if that happens and number == 0, then it > > + * failed to parse. */ > > + if (err == line && number == 0) > > + continue; > > count = 0; > > cpunr = 0; > > > >