* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related @ 2012-01-20 2:20 Graeme Russ 2012-01-20 4:05 ` Mulyadi Santosa 0 siblings, 1 reply; 25+ messages in thread From: Graeme Russ @ 2012-01-20 2:20 UTC (permalink / raw) To: kernelnewbies This may not be the best place to post this question, so please excuse... I have a brand new Intel i5 / z68 EUFI motherboard (ASRock Z68 Pro3 Gen3) with 8GB RAM running Fedora 16 (64 bit) which is experiencing lockups which send the video crazy (flashing screen) but there is nothing appearing in /var/log/messages to indicate what is going wrong (no oops). I'm sure it is graphics related. I thought of Cc'ing the i915 maintainer (Keith Packard) but figured I should wait until I am sure I _think_ it might be power related as the hang never seems to occur while I'm actively using the computer - Only after I have stopped for a few minutes does it happen but not always. I have had it run overnight without a problem, I've had times when it has slept several times before hanging, and times when it has hung the very first sleep after reboot I have pulled the latest mainline vanilla kernel (from about two days ago) and have it configured, build and installed it so I am in a position to do whatever hacking is required to isolate the issue So where should I begin? I have already posted a bug report in Bugzilla (https://bugzilla.kernel.org/show_bug.cgi?id=42597), but I want to isolate the problem to improve the report. I'm somewhat familiar with kernel hacking, git, diffs etc, so I'm not at all afraid of some pretty intensive hacking. I have a second PC and Null-Modem cable (plus Ethernet of course) if that helps, although I am not familiar with using tools like GDB so anything requiring on-line debugging will need a bit of explaining Regards, Graeme ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-20 2:20 Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related Graeme Russ @ 2012-01-20 4:05 ` Mulyadi Santosa [not found] ` <4F1A8C32.3050907@gmail.com> 0 siblings, 1 reply; 25+ messages in thread From: Mulyadi Santosa @ 2012-01-20 4:05 UTC (permalink / raw) To: kernelnewbies Hi Graeme :) On Fri, Jan 20, 2012 at 09:20, Graeme Russ <graeme.russ@gmail.com> wrote: > This may not be the best place to post this question, so please excuse... > > I have a brand new Intel i5 / z68 EUFI motherboard (ASRock Z68 Pro3 > Gen3) with 8GB RAM running Fedora 16 (64 bit) which is experiencing > lockups which send the video crazy (flashing screen) but there is > nothing appearing in /var/log/messages to indicate what is going wrong > (no oops). > > I'm sure it is graphics related. I thought of Cc'ing the i915 > maintainer (Keith Packard) but figured I should wait until I am sure > > I _think_ it might be power related as the hang never seems to occur > while I'm actively using the computer - Only after I have stopped for > a few minutes does it happen but not always. >From your description alone, I think it's still hard to pinpoint the root of the problem. So I guess we'll gonna play cat and mouse a bit here. First of all, IMHO you still need to do logging. Maybe, just maybe, the reason why you don't see anything suspicious in /var/log/messages because the lock is bad enough that prevents the error messages to be picked up. Try setting up netconsole (here is the clue: https://wiki.ubuntu.com/Kernel/Netconsole) or serial console (try reading http://linux.koolsolutions.com/2009/03/29/howto-redirecting-linux-console-output-over-serial-port-on-another-machine/). Hopefully it will catch suspicious message. You could also enable several verbose debugging message under "Kernel hacking" section during "make menuconfig". Hope it helps.... NB: You said if it actively used, it goes well? Sounds like dynamic CPU frequency adjustment bug.... -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <4F1A8C32.3050907@gmail.com>]
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related [not found] ` <4F1A8C32.3050907@gmail.com> @ 2012-01-21 17:40 ` Mulyadi Santosa 2012-01-23 11:12 ` Graeme Russ 2012-01-26 10:15 ` Graeme Russ 0 siblings, 2 replies; 25+ messages in thread From: Mulyadi Santosa @ 2012-01-21 17:40 UTC (permalink / raw) To: kernelnewbies Hi again :) On Sat, Jan 21, 2012 at 16:58, Graeme Russ <graeme.russ@gmail.com> wrote: > Tell me about it :( - I've been here before, just not with the Linux kernel At least I can tell you will get hiccup less :) Anyway, you do have data backup, right? Just in case it causes corrupt and you lose some of your data. Better safe than sorry... > I got netconsole working and it took a while to crash, but it finally did > and guess what - No oops :( Ok, no luck....how about serial console? same result? (I am not sure which one will be helpful in your case). Just crossing my mind, is this bug also happen if you work strictly in text console i.e no X Window at all? > I've added more (see attached config) >From what I can see at a glance, you already enabled enough debugging features....so just keep it that way at least for now. > Also, I am using a very recent tip of linus' branch (commit > ccb19d263fd1c9e34948e2158c53eacbff369344) I think this is the way it will go, you would likely doing bisecting. Thing is, you first need to find kernel version that works well. You said you have trouble with 3.1.9-fc...how about 3.1.8 or below? > Hmmm, any way to test this theory? Let's say, switching cpu governor? What do you use now? ondemand? then try to switch to conservative. Or just lock the cpu frequency into certain number. -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-21 17:40 ` Mulyadi Santosa @ 2012-01-23 11:12 ` Graeme Russ 2012-01-23 17:21 ` Mulyadi Santosa 2012-01-26 10:15 ` Graeme Russ 1 sibling, 1 reply; 25+ messages in thread From: Graeme Russ @ 2012-01-23 11:12 UTC (permalink / raw) To: kernelnewbies Back again ;) On 01/22/2012 04:40 AM, Mulyadi Santosa wrote: > Hi again :) > > On Sat, Jan 21, 2012 at 16:58, Graeme Russ <graeme.russ@gmail.com> wrote: >> Tell me about it :( - I've been here before, just not with the Linux kernel > > > At least I can tell you will get hiccup less :) Anyway, you do have > data backup, right? Just in case it causes corrupt and you lose some > of your data. Better safe than sorry... I have /home and a separate HDD :) >> I got netconsole working and it took a while to crash, but it finally did >> and guess what - No oops :( > > Ok, no luck....how about serial console? same result? (I am not sure > which one will be helpful in your case). Haven't tried serial console - not convinced it will help (see below) > Just crossing my mind, is this bug also happen if you work strictly in > text console i.e no X Window at all? > >> I've added more (see attached config) > >>From what I can see at a glance, you already enabled enough debugging > features....so just keep it that way at least for now. > >> Also, I am using a very recent tip of linus' branch (commit >> ccb19d263fd1c9e34948e2158c53eacbff369344) > > I think this is the way it will go, you would likely doing bisecting. > Thing is, you first need to find kernel version that works well. You > said you have trouble with 3.1.9-fc...how about 3.1.8 or below? > >> Hmmm, any way to test this theory? > > Let's say, switching cpu governor? What do you use now? ondemand? then > try to switch to conservative. Or just lock the cpu frequency into > certain number. I managed to make the system more unstable as I trimmed the kernel and added debugging info. The symptoms were slightly different (graphic freezes rather than flashing screen) To isolate i915, I've installed a nVidia 8600GT from my old machine. Even though I blacklisted i915, it still loads, so I've done an 'rmmod i915' to force it out. So far it all seems pretty stable - I'll leave it powered on for a few days to make sure. I'm then going to buy an ATI 5450 (cheap and passively cooled) and see how that goes. After a week of uptime, I'll switch back to the i915 driver and see what happens. If I get a hang, I'll get the i915 maintainer on board with my problem I think the lack of any kernel messages is due to a hardware conflict which hard-locks the machine. Even the reset button does not respond. Regards, Graeme ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-23 11:12 ` Graeme Russ @ 2012-01-23 17:21 ` Mulyadi Santosa 0 siblings, 0 replies; 25+ messages in thread From: Mulyadi Santosa @ 2012-01-23 17:21 UTC (permalink / raw) To: kernelnewbies Hi :) On Mon, Jan 23, 2012 at 18:12, Graeme Russ <graeme.russ@gmail.com> wrote: > I managed to make the system more unstable as I trimmed the kernel and > added debugging info. The symptoms were slightly different (graphic freezes > rather than flashing screen) Ouch.....debugging brings more bugs...that's nasty.. > To isolate i915, I've installed a nVidia 8600GT from my old machine. Even > though I blacklisted i915, it still loads, so I've done an 'rmmod i915' to > force it out. > > So far it all seems pretty stable If possible, try to run one or more games like Doom and see how it goes... IIRC there's a mode in Doom to run kinda automated play but I forgot the name. Just enable as much feature as it can to pound your graphic card at highest resolution possible. >- I'll leave it powered on for a few days > to make sure. I'm then going to buy an ATI 5450 (cheap and passively > cooled) and see how that goes. > > After a week of uptime, I'll switch back to the i915 driver and see what > happens. If I get a hang, I'll get the i915 maintainer on board with my problem > > I think the lack of any kernel messages is due to a hardware conflict which > hard-locks the machine. Even the reset button does not respond. Looks like you're already on the right debugging track. Keep us informed :) -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-21 17:40 ` Mulyadi Santosa 2012-01-23 11:12 ` Graeme Russ @ 2012-01-26 10:15 ` Graeme Russ 2012-01-26 10:30 ` Mulyadi Santosa 1 sibling, 1 reply; 25+ messages in thread From: Graeme Russ @ 2012-01-26 10:15 UTC (permalink / raw) To: kernelnewbies Hi Again, On 01/22/2012 04:40 AM, Mulyadi Santosa wrote: > Hi again :) > > On Sat, Jan 21, 2012 at 16:58, Graeme Russ <graeme.russ@gmail.com> wrote: >> Tell me about it :( - I've been here before, just not with the Linux kernel > > > At least I can tell you will get hiccup less :) Anyway, you do have > data backup, right? Just in case it causes corrupt and you lose some > of your data. Better safe than sorry... > >> I got netconsole working and it took a while to crash, but it finally did >> and guess what - No oops :( > > Ok, no luck....how about serial console? same result? (I am not sure > which one will be helpful in your case). > > Just crossing my mind, is this bug also happen if you work strictly in > text console i.e no X Window at all? > >> I've added more (see attached config) > >>From what I can see at a glance, you already enabled enough debugging > features....so just keep it that way at least for now. > >> Also, I am using a very recent tip of linus' branch (commit >> ccb19d263fd1c9e34948e2158c53eacbff369344) > > I think this is the way it will go, you would likely doing bisecting. > Thing is, you first need to find kernel version that works well. You > said you have trouble with 3.1.9-fc...how about 3.1.8 or below? > >> Hmmm, any way to test this theory? > > Let's say, switching cpu governor? What do you use now? ondemand? then > try to switch to conservative. Or just lock the cpu frequency into > certain number. Look like the problem is Motherboard/CPU - I installed Windows 7 on a spare HDD and replicated the fault. Also had it hang when using an nVidia G210 PCIe card Regards, Graeme ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-26 10:15 ` Graeme Russ @ 2012-01-26 10:30 ` Mulyadi Santosa 2012-01-26 10:45 ` Graeme Russ 0 siblings, 1 reply; 25+ messages in thread From: Mulyadi Santosa @ 2012-01-26 10:30 UTC (permalink / raw) To: kernelnewbies Hi :) On Thu, Jan 26, 2012 at 17:15, Graeme Russ <graeme.russ@gmail.com> wrote: > Look like the problem is Motherboard/CPU - I installed Windows 7 on a spare > HDD and replicated the fault. Also had it hang when using an nVidia G210 > PCIe card Hm alright.....so that is likely the culprit. What motherboard is it? ASUS? MSI? else? -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-26 10:30 ` Mulyadi Santosa @ 2012-01-26 10:45 ` Graeme Russ 2012-01-26 11:00 ` Mulyadi Santosa 2012-01-31 3:00 ` Graeme Russ 0 siblings, 2 replies; 25+ messages in thread From: Graeme Russ @ 2012-01-26 10:45 UTC (permalink / raw) To: kernelnewbies On 01/26/2012 09:30 PM, Mulyadi Santosa wrote: > Hi :) > > On Thu, Jan 26, 2012 at 17:15, Graeme Russ <graeme.russ@gmail.com> wrote: >> Look like the problem is Motherboard/CPU - I installed Windows 7 on a spare >> HDD and replicated the fault. Also had it hang when using an nVidia G210 >> PCIe card > > Hm alright.....so that is likely the culprit. What motherboard is it? > ASUS? MSI? else? > ASRock Z68 Pro3 Gen3 There was a report of a automatic voltage regulation bug on the ASRock Z68 Pro3-M which could be resolved by specifying a fixed GPU voltage. The Gen3 does not allow setting of fixed GPU voltage (only offset) Regards, Graeme ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-26 10:45 ` Graeme Russ @ 2012-01-26 11:00 ` Mulyadi Santosa 2012-01-31 3:00 ` Graeme Russ 1 sibling, 0 replies; 25+ messages in thread From: Mulyadi Santosa @ 2012-01-26 11:00 UTC (permalink / raw) To: kernelnewbies Hi... On Thu, Jan 26, 2012 at 17:45, Graeme Russ <graeme.russ@gmail.com> wrote: > ASRock Z68 Pro3 Gen3 > > There was a report of a automatic voltage regulation bug on the ASRock Z68 > Pro3-M which could be resolved by specifying a fixed GPU voltage. The Gen3 > does not allow setting of fixed GPU voltage (only offset) Ok, I hope you can fix the problem now. This is a new info for me too btw -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-26 10:45 ` Graeme Russ 2012-01-26 11:00 ` Mulyadi Santosa @ 2012-01-31 3:00 ` Graeme Russ 2012-01-31 4:25 ` Mulyadi Santosa 1 sibling, 1 reply; 25+ messages in thread From: Graeme Russ @ 2012-01-31 3:00 UTC (permalink / raw) To: kernelnewbies Hi Again, On Thu, Jan 26, 2012 at 9:45 PM, Graeme Russ <graeme.russ@gmail.com> wrote: > On 01/26/2012 09:30 PM, Mulyadi Santosa wrote: >> Hi :) >> >> On Thu, Jan 26, 2012 at 17:15, Graeme Russ <graeme.russ@gmail.com> wrote: >>> Look like the problem is Motherboard/CPU - I installed Windows 7 on a spare >>> HDD and replicated the fault. Also had it hang when using an nVidia G210 >>> PCIe card >> >> Hm alright.....so that is likely the culprit. What motherboard is it? >> ASUS? MSI? else? >> > > ASRock Z68 Pro3 Gen3 > > There was a report of a automatic voltage regulation bug on the ASRock Z68 > Pro3-M which could be resolved by specifying a fixed GPU voltage. The Gen3 > does not allow setting of fixed GPU voltage (only offset) I _think_ I've solved the problem - SDRAM Voltage The SDRAM I am using has a rated operating voltage of 1.5V +/- 0.075. It looked like the motherboard BIOS had decided to use the upper limit of 1.575V when set to 'Auto'. I changed it to 'Manual' and set the SDRAM voltage to 1.5V and it's been running stably for the longest time it ever has. Regards, Graeme ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-31 3:00 ` Graeme Russ @ 2012-01-31 4:25 ` Mulyadi Santosa 2012-01-31 4:52 ` Graeme Russ 0 siblings, 1 reply; 25+ messages in thread From: Mulyadi Santosa @ 2012-01-31 4:25 UTC (permalink / raw) To: kernelnewbies Hi :) On Tue, Jan 31, 2012 at 10:00, Graeme Russ <graeme.russ@gmail.com> wrote: > I _think_ I've solved the problem - SDRAM Voltage You got my respect man, you're really stubborn :) > The SDRAM I am using has a rated operating voltage of 1.5V +/- 0.075. > It looked like the motherboard BIOS had decided to use the upper limit > of 1.575V when set to 'Auto'. I changed it to 'Manual' and set the > SDRAM voltage to 1.5V and it's been running stably for the longest > time it ever has. Thanks (again) for sharing. So this indeed has tight relationship with RAM "misbehaviour". How do you know it? Do you inspect every piece of your hardware? I am curious to know (maybe others too). NB: it could be a good lesson that system lock up might have absolutely nothing to do with kernel. -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-31 4:25 ` Mulyadi Santosa @ 2012-01-31 4:52 ` Graeme Russ 2012-01-31 5:49 ` Mulyadi Santosa 2012-01-31 6:14 ` Fredrick 0 siblings, 2 replies; 25+ messages in thread From: Graeme Russ @ 2012-01-31 4:52 UTC (permalink / raw) To: kernelnewbies Hi Mulyadi, On Tue, Jan 31, 2012 at 3:25 PM, Mulyadi Santosa <mulyadi.santosa@gmail.com> wrote: > Hi :) > > On Tue, Jan 31, 2012 at 10:00, Graeme Russ <graeme.russ@gmail.com> wrote: >> I _think_ I've solved the problem - SDRAM Voltage > > You got my respect man, you're really stubborn :) > >> The SDRAM I am using has a rated operating voltage of 1.5V +/- 0.075. >> It looked like the motherboard BIOS had decided to use the upper limit >> of 1.575V when set to 'Auto'. I changed it to 'Manual' and set the >> SDRAM voltage to 1.5V and it's been running stably for the longest >> time it ever has. > > Thanks (again) for sharing. So this indeed has tight relationship with > RAM "misbehaviour". How do you know it? Do you inspect every piece of > your hardware? I am curious to know (maybe others too). The first symptom was that the screen would cycle through solid colour, so naturally the video 'card' was the first to be blamed. Of course, the i5 has the video built into the CPU, so the likelihood of a fault there is probably minimal, so the graphics driver was next in line So I installed an nVidia 8600GT and ran the nouveau driver (now I did get a glitch using this combo, but it wasn't a hang so I set that aside as a driver bug as well... could be related) I then installed an nVidia G210 (it's a much smaller and quieter card). I experienced one hang with this combination (right, now things are getting interesting...) In the meantime, I had tried fiddling with the IGPU voltage offset - no luck of course I removed my Linux hard drives and installed a spare hard drive and proceeded to install Windows 7 (using the on-chip Intel graphics). The machine hung once before the Window 7 drivers were installed (promising) I then installed the Windows 7 drivers and started downloading 3DMark 2006 ...Off to Australia Day Lunch with friends, back later... OK, so 3DMark downloaded OK and the machine was still running some 6 hours later :( Before getting a chance to install 3DMark, I had some other things to attend to... Glancing over bright flashing colours!!! Linux had been exonerated :) So I took it back to the shop I bought it from (long argument about voiding the warranty by taking of the cover blah blah blah). They ran a stress test without failure. I suggested they run memtest which was met by 'Ah, yeah, I should have thought of that first' (and _I_ voided the warranty!) So memtest failed, they put in another pair of memory modules and memtest failed again. Now the plot thickens... They put the old memory back and memtest passed! (what the!) then the put the new memory in and, you guessed it, memtest passed! So the old memory goes back in and more stress testing begins. It was run all day, no failure. So I went in and picked up the machine to take back home on the assumption that the problem was the seating of the memory modules - well I couldn't really fault that analysis (another argument about voiding warranty, 'parts still in warranty, labour to run the tests not', and 'Oh, it failed under Linux, must be software related, not covered by warrantly' Me: 'It failed before I opened the case', Them: 'doesn't matter, you opened the case') - Anyway, I got it back without paying anything mumbling 'idiots' under my breath... so I put my Linux drives back in and run it over night. It survived and so I thought the problem was solved but alas, it failed ten minutes after waking it up in the morning... bugger! So RAM modules not the problem, that leaves CPU, Motherboard and PSU... So I switched out the PSU - Fail (really quickly this time... interesting) So that's when I decided to look at the SDRAM voltage - I looked up the datasheet for the RAM and compared it to the BIOS setting... Hmm, right at the upper limit of the spec'd DIMM voltage, so I set it to 1.5V manually. Since then it has not skipped a beat (only been ~18 hours, but that's way longer than previously) Now if it fails again, I'm just going to buy another motherboard. If that works, I'm going to have a _very_ interesting time with the shop I bought it from (after all, the parts are under warranty hardy, har har!) > NB: it could be a good lesson that system lock up might have > absolutely nothing to do with kernel. Verily :) Regards, Graeme ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-31 4:52 ` Graeme Russ @ 2012-01-31 5:49 ` Mulyadi Santosa 2012-01-31 10:44 ` Graeme Russ 2012-01-31 6:14 ` Fredrick 1 sibling, 1 reply; 25+ messages in thread From: Mulyadi Santosa @ 2012-01-31 5:49 UTC (permalink / raw) To: kernelnewbies Hi :) On Tue, Jan 31, 2012 at 11:52, Graeme Russ <graeme.russ@gmail.com> wrote: > So RAM modules not the problem, that leaves CPU, Motherboard and PSU... > > So I switched out the PSU - Fail (really quickly this time... interesting) > > So that's when I decided to look at the SDRAM voltage - I looked up the > datasheet for the RAM and compared it to the BIOS setting... Hmm, right > at the upper limit of the spec'd DIMM voltage, so I set it to 1.5V > manually. > > Since then it has not skipped a beat (only been ~18 hours, but that's way > longer than previously) A very well done trouble shooting! you're lucky that during the process, your memory module isn't burn out (voltage overload could do that sometimes AFAIK). lately many issues surface which tend to has something to do with hardware, for example the excessive power consumption (ASPM). Sensors are all we need. Linux could already fetch numbers from SMART, cpu sensors etc. Maybe we need memory voltage sensors too? :) -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-31 5:49 ` Mulyadi Santosa @ 2012-01-31 10:44 ` Graeme Russ 2012-01-31 14:41 ` Mulyadi Santosa 2012-01-31 22:06 ` Graeme Russ 0 siblings, 2 replies; 25+ messages in thread From: Graeme Russ @ 2012-01-31 10:44 UTC (permalink / raw) To: kernelnewbies Hi Mulyadi, On 01/31/2012 04:49 PM, Mulyadi Santosa wrote: > Hi :) > > On Tue, Jan 31, 2012 at 11:52, Graeme Russ <graeme.russ@gmail.com> wrote: >> So RAM modules not the problem, that leaves CPU, Motherboard and PSU... >> >> So I switched out the PSU - Fail (really quickly this time... interesting) >> >> So that's when I decided to look at the SDRAM voltage - I looked up the >> datasheet for the RAM and compared it to the BIOS setting... Hmm, right >> at the upper limit of the spec'd DIMM voltage, so I set it to 1.5V >> manually. >> >> Since then it has not skipped a beat (only been ~18 hours, but that's way >> longer than previously) > > A very well done trouble shooting! you're lucky that during the Thanks :) > process, your memory module isn't burn out (voltage overload could do > that sometimes AFAIK). Hmm, the MB voltage was just at the upper limit of the operational voltage - I think the max. rated voltage goes a little higher although there are two voltages that must be kept at correct levels relative to each other (one must always remain less than or equal to the other) > lately many issues surface which tend to has something to do with > hardware, for example the excessive power consumption (ASPM). Sensors > are all we need. Linux could already fetch numbers from SMART, cpu > sensors etc. Maybe we need memory voltage sensors too? :) Yes, I would love to get all the on-board and on-die sensors working (temperatures, voltages, fan speeds). This is going to be an 'always on' media server which will also operate as a part-time dev machine. I would love to do complete system status logging. Any idea how I could find out what the correct kernel options I need to do so? Regards, Graeme P.S. 24+ hours :) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-31 10:44 ` Graeme Russ @ 2012-01-31 14:41 ` Mulyadi Santosa 2012-01-31 22:06 ` Graeme Russ 1 sibling, 0 replies; 25+ messages in thread From: Mulyadi Santosa @ 2012-01-31 14:41 UTC (permalink / raw) To: kernelnewbies Hi Graeme :) On Tue, Jan 31, 2012 at 17:44, Graeme Russ <graeme.russ@gmail.com> wrote: > Hmm, the MB voltage was just at the upper limit of the operational voltage > - I think the max. rated voltage goes a little higher although there are > two voltages that must be kept at correct levels relative to each other > (one must always remain less than or equal to the other) could it be that you need to upgrade your motherboard's BIOS firmware? (since you mention that "auto" didn't work well) > Yes, I would love to get all the on-board and on-die sensors working > (temperatures, voltages, fan speeds). This is going to be an 'always on' > media server which will also operate as a part-time dev machine. I would > love to do complete system status logging. quite likely you already have it (from your distro's stock kernel). Try the following command: sensors powertop hdparm i forgot which one fetches the SMART related info. Some of them are in the form of kernel modules, some are configurable kernel configs. For example, "sensors" command needs acpi-cpufreq module. To be precises, use "sensors-detect" command to find which kernel modules that matches your environment. >Any idea how I could find out > what the correct kernel options I need to do so? and some of them are in the form of CONFIG_DEBUG... try to explore the "make menuconfig". I really can't recall them on top of my head.... :) > P.S. 24+ hours :) you need to film this you know, it will be the new "24" :D -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-31 10:44 ` Graeme Russ 2012-01-31 14:41 ` Mulyadi Santosa @ 2012-01-31 22:06 ` Graeme Russ 2012-02-01 15:03 ` Mulyadi Santosa 1 sibling, 1 reply; 25+ messages in thread From: Graeme Russ @ 2012-01-31 22:06 UTC (permalink / raw) To: kernelnewbies Hi Mulyadi, On Tue, Jan 31, 2012 at 9:44 PM, Graeme Russ <graeme.russ@gmail.com> wrote: > Hi Mulyadi, > > On 01/31/2012 04:49 PM, Mulyadi Santosa wrote: >> Hi :) >> >> On Tue, Jan 31, 2012 at 11:52, Graeme Russ <graeme.russ@gmail.com> wrote: >>> So RAM modules not the problem, that leaves CPU, Motherboard and PSU... >>> >>> So I switched out the PSU - Fail (really quickly this time... interesting) >>> >>> So that's when I decided to look at the SDRAM voltage - I looked up the >>> datasheet for the RAM and compared it to the BIOS setting... Hmm, right >>> at the upper limit of the spec'd DIMM voltage, so I set it to 1.5V >>> manually. >>> >>> Since then it has not skipped a beat (only been ~18 hours, but that's way >>> longer than previously) >> >> A very well done trouble shooting! you're lucky that during the > > Thanks :) Alas, after 36 hours it froze again - tweaking the SDRAM voltage up and down didn't resolve the problem. I'm convinced it's motherboard related (most likely a flakey voltage regulator) >> process, your memory module isn't burn out (voltage overload could do >> that sometimes AFAIK). > > Hmm, the MB voltage was just at the upper limit of the operational voltage > - I think the max. rated voltage goes a little higher although there are > two voltages that must be kept at correct levels relative to each other > (one must always remain less than or equal to the other) > >> lately many issues surface which tend to has something to do with >> hardware, for example the excessive power consumption (ASPM). Sensors >> are all we need. Linux could already fetch numbers from SMART, cpu >> sensors etc. Maybe we need memory voltage sensors too? :) > > Yes, I would love to get all the on-board and on-die sensors working > (temperatures, voltages, fan speeds). This is going to be an 'always on' > media server which will also operate as a part-time dev machine. I would > love to do complete system status logging. Any idea how I could find out > what the correct kernel options I need to do so? I recompiled with the various sensor bus options as modules and got it working Regards, Graeme ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-31 22:06 ` Graeme Russ @ 2012-02-01 15:03 ` Mulyadi Santosa 2012-02-01 15:28 ` Graeme Russ 0 siblings, 1 reply; 25+ messages in thread From: Mulyadi Santosa @ 2012-02-01 15:03 UTC (permalink / raw) To: kernelnewbies Hi Graeme :) On Wed, Feb 1, 2012 at 05:06, Graeme Russ <graeme.russ@gmail.com> wrote: > Alas, after 36 hours it froze again - tweaking the SDRAM voltage up and > down didn't resolve the problem. Bugger....heat maybe? > I recompiled with the various sensor bus options as modules and got it > working Sounds good. BTW, which kernel version do you use now? -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-02-01 15:03 ` Mulyadi Santosa @ 2012-02-01 15:28 ` Graeme Russ 2012-02-01 15:34 ` Mulyadi Santosa 0 siblings, 1 reply; 25+ messages in thread From: Graeme Russ @ 2012-02-01 15:28 UTC (permalink / raw) To: kernelnewbies On Feb 2, 2012 2:04 AM, "Mulyadi Santosa" <mulyadi.santosa@gmail.com> wrote: > > Hi Graeme :) > > On Wed, Feb 1, 2012 at 05:06, Graeme Russ <graeme.russ@gmail.com> wrote: > > Alas, after 36 hours it froze again - tweaking the SDRAM voltage up and > > down didn't resolve the problem. > > Bugger....heat maybe? I suspect possibly flakey cap in one of the voltage regulators > > > I recompiled with the various sensor bus options as modules and got it > > working > > Sounds good. BTW, which kernel version do you use now? 3.3.0+ (top of Linus' tree from a few days ago) Regards, Graeme -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20120202/71eefc89/attachment.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-02-01 15:28 ` Graeme Russ @ 2012-02-01 15:34 ` Mulyadi Santosa 2012-02-01 15:44 ` Graeme Russ 0 siblings, 1 reply; 25+ messages in thread From: Mulyadi Santosa @ 2012-02-01 15:34 UTC (permalink / raw) To: kernelnewbies Hi again :) On Wed, Feb 1, 2012 at 22:28, Graeme Russ <graeme.russ@gmail.com> wrote: > I suspect possibly flakey cap in one of the voltage regulators if that's true, that's really nasty. And I think that also means that motherboard is pretty unusable for long usage period. Is it possible to just trade your motherboard with new one? > 3.3.0+ (top of Linus' tree from a few days ago) Somehow, IMHO, 2.6.x series is still better for stability. After looking at kernel.org, I think you can test with 2.6.32.55. -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-02-01 15:34 ` Mulyadi Santosa @ 2012-02-01 15:44 ` Graeme Russ 2012-02-01 22:11 ` Graeme Russ 0 siblings, 1 reply; 25+ messages in thread From: Graeme Russ @ 2012-02-01 15:44 UTC (permalink / raw) To: kernelnewbies Hello (again :)) On Feb 2, 2012 2:35 AM, "Mulyadi Santosa" <mulyadi.santosa@gmail.com> wrote: > > Hi again :) > > On Wed, Feb 1, 2012 at 22:28, Graeme Russ <graeme.russ@gmail.com> wrote: > > I suspect possibly flakey cap in one of the voltage regulators > > if that's true, that's really nasty. And I think that also means that > motherboard is pretty unusable for long usage period. > > Is it possible to just trade your motherboard with new one? Since the shop I bought it from is so difficult to deal with, I ordered a new one (different brand, Gigabyte this time) online like I always have before. > > 3.3.0+ (top of Linus' tree from a few days ago) > > Somehow, IMHO, 2.6.x series is still better for stability. After > looking at kernel.org, I think you can test with 2.6.32.55. I think Fedora 16 came with that (or close to) The freeze happens in Winows 7 as well Regards, Graeme -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20120202/30cdf1ac/attachment.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-02-01 15:44 ` Graeme Russ @ 2012-02-01 22:11 ` Graeme Russ 2012-02-02 3:28 ` Mulyadi Santosa 0 siblings, 1 reply; 25+ messages in thread From: Graeme Russ @ 2012-02-01 22:11 UTC (permalink / raw) To: kernelnewbies Back again, On Thu, Feb 2, 2012 at 2:44 AM, Graeme Russ <graeme.russ@gmail.com> wrote: > Hello (again :)) > > > On Feb 2, 2012 2:35 AM, "Mulyadi Santosa" <mulyadi.santosa@gmail.com> wrote: >> >> Hi again :) >> >> On Wed, Feb 1, 2012 at 22:28, Graeme Russ <graeme.russ@gmail.com> wrote: >> > I suspect possibly flakey cap in one of the voltage regulators >> >> if that's true, that's really nasty. And I think that also means that >> motherboard is pretty unusable for long usage period. >> >> Is it possible to just trade your motherboard with new one? > > Since the shop I bought it from is so difficult to deal with, I ordered a > new one (different brand, Gigabyte this time) online like I always have > before. For anyone still following this thread, I just had another thought as to what may be the problem (although I have not tried it yet) - SDRAM timing. I am using Strontium DIMMs (hynix chips) which typically have 'lower' spec'd timings. Now the BIOS should pick up the correct timings by reading the DIMM's SPD data so the chance the timings are 'wrong' is probably pretty low, but I'll check. I may even wind the timing down _lower_ than the SDRAM is spec'd at (can't hurt) I'll be a bit peeved if that turns out to be it now that I've forked out another $100 odd for a new MB :( Regards, Graeme ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-02-01 22:11 ` Graeme Russ @ 2012-02-02 3:28 ` Mulyadi Santosa 2012-02-02 23:04 ` Graeme Russ 0 siblings, 1 reply; 25+ messages in thread From: Mulyadi Santosa @ 2012-02-02 3:28 UTC (permalink / raw) To: kernelnewbies Hi ! :) On Thu, Feb 2, 2012 at 05:11, Graeme Russ <graeme.russ@gmail.com> wrote: > For anyone still following this thread, I just had another thought as to > what may be the problem (although I have not tried it yet) - SDRAM timing. that CAS (latency) timing, right? > I am using Strontium DIMMs (hynix chips) which typically have 'lower' > spec'd timings. Now the BIOS should pick up the correct timings by reading > the DIMM's SPD data so the chance the timings are 'wrong' is probably > pretty low, but I'll check. I may even wind the timing down _lower_ than > the SDRAM is spec'd at (can't hurt) you need something that shall hammer memory.... repetitive kernel compilation with "make allyesconfig" perhaps? > I'll be a bit peeved if that turns out to be it now that I've forked out > another $100 odd for a new MB :( like old saying "no pain no gain"? :) But really, you get my salute.... who knows it will end up with excellent suggestion toward memory subsystem in Linux virtual memory? -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-02-02 3:28 ` Mulyadi Santosa @ 2012-02-02 23:04 ` Graeme Russ 2012-02-03 6:02 ` Mulyadi Santosa 0 siblings, 1 reply; 25+ messages in thread From: Graeme Russ @ 2012-02-02 23:04 UTC (permalink / raw) To: kernelnewbies Hi again :) On Thu, Feb 2, 2012 at 2:28 PM, Mulyadi Santosa <mulyadi.santosa@gmail.com> wrote: > Hi ! :) > > On Thu, Feb 2, 2012 at 05:11, Graeme Russ <graeme.russ@gmail.com> wrote: >> For anyone still following this thread, I just had another thought as to >> what may be the problem (although I have not tried it yet) - SDRAM timing. > > that CAS (latency) timing, right? Yes (plus a few others - there are a lot of timings for DDR3 SDRAM) >> I am using Strontium DIMMs (hynix chips) which typically have 'lower' >> spec'd timings. Now the BIOS should pick up the correct timings by reading >> the DIMM's SPD data so the chance the timings are 'wrong' is probably >> pretty low, but I'll check. I may even wind the timing down _lower_ than >> the SDRAM is spec'd at (can't hurt) > > you need something that shall hammer memory.... repetitive kernel > compilation with "make allyesconfig" perhaps? in bash: while true do make clean; make done :) >> I'll be a bit peeved if that turns out to be it now that I've forked out >> another $100 odd for a new MB :( > > like old saying "no pain no gain"? :) But really, you get my > salute.... who knows it will end up with excellent suggestion toward > memory subsystem in Linux virtual memory? Well it still failed (thankfully) and my new Gigabyte Z68P-DS3 just arrived so I can test it out over the weekend :) Regards, Graeme ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-02-02 23:04 ` Graeme Russ @ 2012-02-03 6:02 ` Mulyadi Santosa 0 siblings, 0 replies; 25+ messages in thread From: Mulyadi Santosa @ 2012-02-03 6:02 UTC (permalink / raw) To: kernelnewbies Hi :) On Fri, Feb 3, 2012 at 06:04, Graeme Russ <graeme.russ@gmail.com> wrote: > > Yes (plus a few others - there are a lot of timings for DDR3 SDRAM) > perhaps you remember on top of your head, what are those timings? > in bash: > > while true > do > make clean; make > done > > :) nice, that should do it. "make -j 4" punches more I guess :) > Well it still failed (thankfully) and my new Gigabyte Z68P-DS3 just arrived > so I can test it out over the weekend :) It's almost 100% that you can blame your motherboard. This ASRock, I don't know, is it a good brand? -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related 2012-01-31 4:52 ` Graeme Russ 2012-01-31 5:49 ` Mulyadi Santosa @ 2012-01-31 6:14 ` Fredrick 1 sibling, 0 replies; 25+ messages in thread From: Fredrick @ 2012-01-31 6:14 UTC (permalink / raw) To: kernelnewbies On 01/30/2012 08:52 PM, Graeme Russ wrote: > Hi Mulyadi, > > On Tue, Jan 31, 2012 at 3:25 PM, Mulyadi Santosa > <mulyadi.santosa@gmail.com> wrote: >> Hi :) >> >> On Tue, Jan 31, 2012 at 10:00, Graeme Russ<graeme.russ@gmail.com> wrote: >>> I _think_ I've solved the problem - SDRAM Voltage >> >> You got my respect man, you're really stubborn :) >> >>> The SDRAM I am using has a rated operating voltage of 1.5V +/- 0.075. >>> It looked like the motherboard BIOS had decided to use the upper limit >>> of 1.575V when set to 'Auto'. I changed it to 'Manual' and set the >>> SDRAM voltage to 1.5V and it's been running stably for the longest >>> time it ever has. >> >> Thanks (again) for sharing. So this indeed has tight relationship with >> RAM "misbehaviour". How do you know it? Do you inspect every piece of >> your hardware? I am curious to know (maybe others too). > > The first symptom was that the screen would cycle through solid colour, so > naturally the video 'card' was the first to be blamed. Of course, the i5 > has the video built into the CPU, so the likelihood of a fault there is > probably minimal, so the graphics driver was next in line > > So I installed an nVidia 8600GT and ran the nouveau driver (now I did get > a glitch using this combo, but it wasn't a hang so I set that aside as a > driver bug as well... could be related) > > I then installed an nVidia G210 (it's a much smaller and quieter card). I > experienced one hang with this combination (right, now things are getting > interesting...) > > In the meantime, I had tried fiddling with the IGPU voltage offset - no > luck of course > > I removed my Linux hard drives and installed a spare hard drive and > proceeded to install Windows 7 (using the on-chip Intel graphics). The > machine hung once before the Window 7 drivers were installed (promising) > > I then installed the Windows 7 drivers and started downloading 3DMark 2006 > > ...Off to Australia Day Lunch with friends, back later... > > OK, so 3DMark downloaded OK and the machine was still running some 6 hours > later :( > > Before getting a chance to install 3DMark, I had some other things to > attend to... Glancing over bright flashing colours!!! Linux had been > exonerated :) > > So I took it back to the shop I bought it from (long argument about voiding > the warranty by taking of the cover blah blah blah). They ran a stress > test without failure. I suggested they run memtest which was met by 'Ah, > yeah, I should have thought of that first' (and _I_ voided the warranty!) > > So memtest failed, they put in another pair of memory modules and memtest > failed again. Now the plot thickens... They put the old memory back and > memtest passed! (what the!) then the put the new memory in and, you guessed > it, memtest passed! So the old memory goes back in and more stress testing > begins. > > It was run all day, no failure. So I went in and picked up the machine to > take back home on the assumption that the problem was the seating of the > memory modules - well I couldn't really fault that analysis (another > argument about voiding warranty, 'parts still in warranty, labour to run > the tests not', and 'Oh, it failed under Linux, must be software related, > not covered by warrantly' Me: 'It failed before I opened the case', > Them: 'doesn't matter, you opened the case') - Anyway, I got it back > without paying anything mumbling 'idiots' under my breath... > > so I put my Linux drives back in and run it over night. It survived and so > I thought the problem was solved but alas, it failed ten minutes after > waking it up in the morning... bugger! > > So RAM modules not the problem, that leaves CPU, Motherboard and PSU... > > So I switched out the PSU - Fail (really quickly this time... interesting) > > So that's when I decided to look at the SDRAM voltage - I looked up the > datasheet for the RAM and compared it to the BIOS setting... Hmm, right > at the upper limit of the spec'd DIMM voltage, so I set it to 1.5V > manually. > > Since then it has not skipped a beat (only been ~18 hours, but that's way > longer than previously) > > Now if it fails again, I'm just going to buy another motherboard. If that > works, I'm going to have a _very_ interesting time with the shop I > bought it from (after all, the parts are under warranty hardy, har har!) > >> NB: it could be a good lesson that system lock up might have >> absolutely nothing to do with kernel. > > Verily :) > > Regards, > > Graeme > Thank you Graeme for sharing this experience. Amazing persistence! I would not have gone this far. :) Sometimes you have to doubt even the nuts and bolts :) -Fredrick > _______________________________________________ > Kernelnewbies mailing list > Kernelnewbies at kernelnewbies.org > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2012-02-03 6:02 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-20 2:20 Best way to debug an Intel Core i5 hang - likely graphics (possibly power) related Graeme Russ
2012-01-20 4:05 ` Mulyadi Santosa
[not found] ` <4F1A8C32.3050907@gmail.com>
2012-01-21 17:40 ` Mulyadi Santosa
2012-01-23 11:12 ` Graeme Russ
2012-01-23 17:21 ` Mulyadi Santosa
2012-01-26 10:15 ` Graeme Russ
2012-01-26 10:30 ` Mulyadi Santosa
2012-01-26 10:45 ` Graeme Russ
2012-01-26 11:00 ` Mulyadi Santosa
2012-01-31 3:00 ` Graeme Russ
2012-01-31 4:25 ` Mulyadi Santosa
2012-01-31 4:52 ` Graeme Russ
2012-01-31 5:49 ` Mulyadi Santosa
2012-01-31 10:44 ` Graeme Russ
2012-01-31 14:41 ` Mulyadi Santosa
2012-01-31 22:06 ` Graeme Russ
2012-02-01 15:03 ` Mulyadi Santosa
2012-02-01 15:28 ` Graeme Russ
2012-02-01 15:34 ` Mulyadi Santosa
2012-02-01 15:44 ` Graeme Russ
2012-02-01 22:11 ` Graeme Russ
2012-02-02 3:28 ` Mulyadi Santosa
2012-02-02 23:04 ` Graeme Russ
2012-02-03 6:02 ` Mulyadi Santosa
2012-01-31 6:14 ` Fredrick
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).