All of lore.kernel.org
 help / color / mirror / Atom feed
From: Maxim <maximlevitsky@gmail.com>
To: Len Brown <lenb@kernel.org>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
Date: Sun, 18 Mar 2007 01:00:01 +0200	[thread overview]
Message-ID: <200703180100.01889.maximlevitsky@gmail.com> (raw)
In-Reply-To: <200703161919.44458.lenb@kernel.org>

On Saturday 17 March 2007 01:19:44 Len Brown wrote:
> On Friday 16 March 2007 06:30, Maxim Levitsky wrote:
> > 
> > Good day, 
> > 
> > I want to report regressions I have with 2.6.21-rc3 kernel.
> > I use CONFIG_NO_HZ.
> 
> Do any of these issues go away with CONFIG_NO_HZ=n (or boot with nohz=n)
> or are they all independent of it?
> 
> thanks,
> -Len
> 
> > 1) Both suspend to disk and suspend to RAM are completely broken:
> > On vanilla 2.6.20 suspend to disk works perfectly and suspend to ram works _almost_ perfectly (I will tell about that later).
> > On 2.6.21-rc1 and later system hangs even before suspend begins (suspend to disk hangs before image write , and after suspend to ram , 
> > some devices are powered down (disk,power leds) , and some and not(fans, power) , and system hangs).
> > 
> > I did a git-bisect and I found which commit caused that:
> > 	e3c7db621bed4afb8e231cb005057f2feb5db557 - [PATCH] [PATCH] PM: Change code ordering in main.c (breaks  S3)
> > 	ed746e3b18f4df18afa3763155972c5835f284c5 - [PATCH] [PATCH] swsusp: Change code ordering in disk.c (breaks swsusp, I don't use it, but I tested it)
> >         259130526c267550bc365d3015917d90667732f1 - [PATCH] [PATCH] swsusp: Change code ordering in user.c (breaks uswsusp, that I use)
> > 
> > I reverted those commits and now system suspends correctly to disk, but suspend to ram showed some more regressions.
> > 
> > 
> > 2) ) After suspend to ram I get this 
> > 
> > Mar 14 00:22:23 MAIN kernel: [    2.072875] caller is check_tsc_sync_source+0x1d/0x100
> > Mar 14 00:22:23 MAIN kernel: [    2.072878]  [show_trace_log_lvl+26/48] show_trace_log_lvl+0x1a/0x30
> > Mar 14 00:22:23 MAIN kernel: [    2.072881]  [show_trace+18/32] show_trace+0x12/0x20
> > Mar 14 00:22:23 MAIN kernel: [    2.072884]  [dump_stack+22/32] dump_stack+0x16/0x20
> > Mar 14 00:22:23 MAIN kernel: [    2.072887]  [debug_smp_processor_id+173/176] debug_smp_processor_id+0xad/0xb0
> > Mar 14 00:22:23 MAIN kernel: [    2.072891]  [check_tsc_sync_source+29/256] check_tsc_sync_source+0x1d/0x100
> > Mar 14 00:22:23 MAIN kernel: [    2.072894]  [__cpu_up+80/384] __cpu_up+0x50/0x180
> > Mar 14 00:22:23 MAIN kernel: [    2.072897]  [_cpu_up+98/208] _cpu_up+0x62/0xd0
> > Mar 14 00:22:23 MAIN kernel: [    2.072901]  [cpu_up+46/80] cpu_up+0x2e/0x50
> > Mar 14 00:22:23 MAIN kernel: [    2.072903]  [enable_nonboot_cpus+110/160] enable_nonboot_cpus+0x6e/0xa0
> > Mar 14 00:22:23 MAIN kernel: [    2.072906]  [enter_state+326/496] enter_state+0x146/0x1f0
> > Mar 14 00:22:23 MAIN kernel: [    2.072909]  [state_store+174/192] state_store+0xae/0xc0
> > Mar 14 00:22:23 MAIN kernel: [    2.072912]  [subsys_attr_store+43/64] subsys_attr_store+0x2b/0x40
> > Mar 14 00:22:23 MAIN kernel: [    2.072917]  [sysfs_write_file+186/272] sysfs_write_file+0xba/0x110
> > Mar 14 00:22:23 MAIN kernel: [    2.072920]  [vfs_write+150/352] vfs_write+0x96/0x160
> > Mar 14 00:22:23 MAIN kernel: [    2.072923]  [sys_write+61/112] sys_write+0x3d/0x70
> > Mar 14 00:22:23 MAIN kernel: [    2.072926]  [sysenter_past_esp+93/153] sysenter_past_esp+0x5d/0x99
> > Mar 14 00:22:23 MAIN kernel: [    2.072929]  =======================
> > Mar 14 00:22:23 MAIN kernel: [    2.072931] checking TSC synchronization [CPU#0 -> CPU#1]:
> > Mar 14 00:22:23 MAIN kernel: [    2.092922] Measured 72051818872 cycles TSC warp between CPUs, turning off
> > 
> > It looks clear that preempt is enabled all the way in second cpu initialization, ( I think that at least in check_tsc_sync_source, it should be disabled,
> > shouldn't it ? )
> > 
> > Then I did add preempt_disable() / preempt_enable()  to this function , and  I still got this:
> > 
> > Mar 14 00:22:23 MAIN kernel: [    2.072931] checking TSC synchronization [CPU#0 -> CPU#1]:
> > Mar 14 00:22:23 MAIN kernel: [    2.092922] Measured 72051818872 cycles TSC warp between CPUs, turning off
> > 
> > It happens after second CPU is brought back on-line.
> > 
> > Now I understand that this is TSC sync problem and I tried to do some tests:
> > 
> >  I tried to disable/enable second CPU by hand, eg I did number of times,
> > 
> > echo "0" > /sys/devices/system/cpu/cpu1/online
> > echo "1" > /sys/devices/system/cpu/cpu1/online
> > 
> > and TSC sync was ok.
> > 
> > Then I disabled 2nd CPU, have suspended system to RAM , resumed it  , and then enabled 2nd CPU and got same error message.
> > Then I disabled cpufreq , and did above tests, and got same results.
> > I think that maybe this error is false, that there is some difference in TSC clock, but this difference is constant, and can be fixed
> > 
> > 3) Sometimes I get this (once in three boots or so)
> > 
> > [   36.217405] ENABLING IO-APIC IRQs
> > [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
> > [   36.433917] APIC timer disabled due to verification failure.
> > 
> > And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
> > I haven't investigated that yet.
> > It looks like another new test that my hardware fails to perform... 
> > 
> > 
> > And now I want to tell you about that _almost_ working suspend to ram I got in 2.6.20:
> > To put it simply sometimes system wakes from resume, and sometimes not (about 1 in 5 times)
> > When it does it works perfectly. 
> > 
> > This is quite common problem but ironically my case is very different and harder to solve.
> > 
> > The fact is that I found thanks to RTC tricks , similar to PM_TRACE that system hangs in exactly three places (in one of them of course)
> > 
> > I put between instructions, code like that to save a position in RTC alarm which is not cleared on reboot
> > Note that this code uses ax, but I checked every time I put it that ax can be used (eg, it is loaded in next instruction)
> > 
> > #define TRACE(val) 			 \
> > 	movb	$0x01, %al		; \
> > 	outb	%al, $0x70 		; \
> > 	movb	$ ## val, %al 		; \
> > 	outb	%al, $0x71
> > 
> > It hangs very early in asm code, and those are places:
> > 
> > 1) /home/maxim/software/kernel/linux-2.6.20-mod/arch/i386/kernel/acpi/wakeup.S:wakeup_start:
> > 	ljmpl	$__KERNEL_CS,$wakeup_pmode_return
> > As I see that is first time wakeup low page is addressing kernel memory by jumping to it.
> > 
> > 2)  /home/maxim/software/kernel/linux-2.6.20-mod/arch/i386/kernel/acpi/wakeup.S:do_suspend_lowlevel
> > 	call	restore_registers 
> > It hangs exacly on that instruction, I can only see that this is first time protected stack is accessed
> > 
> > 3) /home/maxim/software/kernel/linux-2.6.20-mod/arch/i386/power/cpu.c :__restore_processor_state(struct saved_context *ctxt)
> > 	mtrr_ap_init();
> > 
> > It hangs somewhere inside this function, and because is is long I haven't found where exactly, it is easier to disable MTRR all together.
> > 
> > Note that all three places have different external behavier:
> > 
> > in first case, system powers on -> off -> on -> hangs
> > I did put test in RTC to see whenever BIOS call kernel twice but I know from this test for sure that it calls it only once,
> > It actualy makes sense, because exception occures before IDT is loaded, so system has no choice but to power down
> > 
> > in second case I see blinking leds -> almost sure a oops
> > 
> > in third case system just hangs
> > 
> > That's all, I will continue to dig those problems out
> > 
> > Thanks for attention,
> > 	Maxim Levitsky
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 

Issues that are caused by those

        e3c7db621bed4afb8e231cb005057f2feb5db557 - [PATCH] [PATCH] PM: Change code ordering in main.c (breaks  S3)
        ed746e3b18f4df18afa3763155972c5835f284c5 - [PATCH] [PATCH] swsusp: Change code ordering in disk.c (breaks swsusp, I don't use it, but I tested it)
        259130526c267550bc365d3015917d90667732f1 - [PATCH] [PATCH] swsusp: Change code ordering in user.c (breaks uswsusp, that I use)

doesn't go away, even worse, I tested latest Linus's git tree, and same hang both is suspend to disk and suspend to ram (before suspending)



>Mar 14 00:22:23 MAIN kernel: [    2.072875] caller is check_tsc_sync_source+0x1d/0x100
>Mar 14 00:22:23 MAIN kernel: [    2.072878]  [show_trace_log_lvl+26/48] show_trace_log_lvl+0x1a/0x30
>Mar 14 00:22:23 MAIN kernel: [    2.072881]  [show_trace+18/32] show_trace+0x12/0x20
>Mar 14 00:22:23 MAIN kernel: [    2.072884]  [dump_stack+22/32] dump_stack+0x16/0x20
>Mar 14 00:22:23 MAIN kernel: [    2.072887]  [debug_smp_processor_id+173/176] debug_smp_processor_id+0xad/0x


^ This thing is fixed (sorry for bothering you)


>Mar 14 00:22:23 MAIN kernel: [    2.072931] checking TSC synchronization [CPU#0 -> CPU#1]:
>Mar 14 00:22:23 MAIN kernel: [    2.092922] Measured 72051818872 cycles TSC warp between CPUs, turning off

^ This one I don't think is related to NO_HZ, maybe it is hardware problem, but it exist without NO_HZ


>[   36.217405] ENABLING IO-APIC IRQs
>[   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
>[   36.433917] APIC timer disabled due to verification failure.

This one is now discussed, I will look at it and it is not related to NO_HZ

And hangs with S3 as I said occur even in 2.6.20


And I forgot to tell about another problem with (now I know ,hi-resolution timers)
That before suspend to ram APIC timer is used and HPET is not used :

root@MAIN:/home/maxim# cat /sys/devices/system/clockevents/clockevents0/registered
lapic                F:0007 M:3(periodic) C: 1
hpet                 F:0003 M:1(shutdown) C: 0
lapic                F:0007 M:3(periodic) C: 0
root@MAIN:/home/maxim#   


But after suspend to ram HPET is 'woken'

root@MAIN:/home/maxim# cat /sys/devices/system/clockevents/clockevents0/registered
lapic                F:0007 M:3(one shoot) C: 1
hpet                 F:0003 M:3(one shoot) C: 0
lapic                F:0007 M:3(one shoot) C: 0

Note that I added those (one shoot), (periodic) descriptions, would be nice to have them in kernel, can I send a patch ?  ;-)

and I see average of 18 IRQs/sec on IRQ 0


Best regards,
	Maxim Levitsky


  reply	other threads:[~2007-03-17 23:00 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-16 10:30 [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far Maxim Levitsky
2007-03-16 23:19 ` Len Brown
2007-03-17 23:00   ` Maxim [this message]
2007-03-17 23:32     ` Thomas Gleixner
2007-12-30  7:50       ` Mike Galbraith
2007-12-30 14:53         ` Ingo Molnar
2007-03-20 11:54     ` sysfs ugly timer interface (was Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far) Pavel Machek
2007-03-22 15:28       ` Greg KH
2007-03-22 15:41         ` Thomas Gleixner
2007-03-23  1:24         ` sysfs q [was: sysfs ugly timer interface] Jan Engelhardt
2007-03-23  4:48           ` Greg KH
2007-03-23  6:05             ` Jan Engelhardt
2007-03-16 23:39 ` [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far Thomas Gleixner
2007-03-17 23:01   ` Maxim
2007-03-16 23:44 ` Thomas Gleixner
2007-03-17  0:04   ` [PATCH] i386: trust the PM-Timer calibration of the local APIC timer Thomas Gleixner
2007-03-17  7:22     ` Ingo Molnar
2007-03-17 13:24     ` Andi Kleen
2007-03-18  8:12     ` Andrew Morton
2007-03-18  8:45       ` Thomas Gleixner
2007-03-17  1:32   ` [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far Len Brown
2007-03-17  9:56     ` Thomas Gleixner
2007-03-17 11:05       ` Thomas Gleixner
2007-03-17 16:52       ` Thomas Gleixner
2007-03-17 10:32     ` Arjan van de Ven
2007-03-17 13:26       ` Andi Kleen
2007-03-20  4:27         ` Greg KH
2007-03-20  6:30           ` Thomas Gleixner
2007-03-20  9:14           ` Arjan van de Ven
2007-03-20 11:36           ` Andi Kleen
2007-03-20 11:41             ` Oliver Neukum
2007-03-17 22:45     ` Maxim
2007-03-20  5:04   ` Lee Revell
2007-03-20  5:36     ` Eric St-Laurent
2007-03-20  9:15       ` Arjan van de Ven
2007-03-20 18:04         ` Andy Lutomirski
2007-03-20 22:58         ` Eric St-Laurent

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200703180100.01889.maximlevitsky@gmail.com \
    --to=maximlevitsky@gmail.com \
    --cc=lenb@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.