public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <htejun@gmail.com>
To: Torsten Kaiser <just.for.lkml@googlemail.com>
Cc: Jeff Garzik <jeff@garzik.org>,
	linux-kernel@vger.kernel.org, akpm@linux-foundation.org
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1
Date: Sun, 30 Sep 2007 10:39:48 -0700	[thread overview]
Message-ID: <46FFDF64.1080005@gmail.com> (raw)
In-Reply-To: <64bb37e0709300919w3e9db6aci4c0b9df43407fff3@mail.gmail.com>

Torsten Kaiser wrote:
> That boot ended in a minimal initrd environment that normally only
> starts the RAID5 and then opens contained encrypted real root.
> I was just able to push the output from dmesg through the serial link,
> but had no man pages to tell me about -s ...
> And that kind of error was until now a one-of-a-kind one. All other
> errors where not "internal error" but "timeout".
> But one time I had another SGT related error:
> Sep 11 19:19:24 treogen [   33.340000] ata1.00: exception Emask 0x20
> SAct 0x1 SErr 0x0 action 0x2
> Sep 11 19:19:24 treogen [   33.340000] ata1.00: irq_stat 0x00020002,
> PCI master abort while fetching SGT
> Sep 11 19:19:24 treogen [   33.340000] ata1.00: cmd
> 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
> Sep 11 19:19:24 treogen [   33.340000]          res
> 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)
> 
> What I find kind of interessing is, that while I got three different
> error codes the cmd part of the output was always the same.

That's NCQ write command.  You'll be using it a lot if you're rebuilding
md5.  It seems like something is going wrong with request DMA or sg
mapping.  Maybe some change in block/*.[hc]?

> It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
> Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
> narrowing on rc4-mm1 because that was the first version to break.
> 
> I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

Have you tested 2.6.23-rc4 without mm patches?  It could be something
introduced between -rc3 and 4.

> [the 2.6.23-rc4-mm1 series-file has 2013 lines]
> Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
> Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
> (line779): 2 good boots
> Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
> 1045): 1 failed
> Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed
> 
> Next try: up to patch fs-remove-some-aop_truncated_page.patch
> 
> That means from the patches added to the rc4 variant of the mm-kernel
> the following are remaining:
> git-xfs-build-fix.patch
> enforce-noreplace-smp-in-alternative_instructions.patch
> paravirt-fix-preemptible-lazy-mode-bug.patch
> i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
> fix-the-max-path-calculation-in-radix-treec-update.patch
> mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
> introduce-write_begin-write_end-aops-fix2.patch
> implement-simple-fs-aops-fix.patch
> ext2-convert-to-new-aops-fix2.patch
> ext3-convert-to-new-aops-fix-fix.patch
> ext4-convert-to-new-aops-fix-fix.patch
> gfs2-convert-to-new-aops-fix.patch
> reiserfs-convert-to-new-aops-fix2.patch
> hostfs-convert-to-new-aops-fix-fix.patch
> ufs-convert-to-new-aops-fix2.patch
> sysv-convert-to-new-aops-fix2.patch
> minix-convert-to-new-aops-fix2.patch
> affs-convert-to-new-aops-fix-fix.patch
> memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
> memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
> memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
> update-n_high_memory-node-state-for-memory-hotadd.patch
> slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
> slub-do-not-use-page-mapping.patch
> slub-move-page-offset-to-kmem_cache_cpu-offset.patch
> slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
> slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
> slub-optimize-cacheline-use-for-zeroing.patch
> 
> But due to the unreliable nature of the bug, I can't be to sure about that.

Yeah, that's what I'm worried about.  Bisection is extremely difficult
if errors are intermittent and takes long time to reproduce.

> Next version is compiled, now again switching the PC off for an hour...

Thanks a lot.  Much appreciated.

-- 
tejun

  reply	other threads:[~2007-09-30 17:41 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-26 20:26 sata_sil24 broken since 2.6.23-rc4-mm1 Torsten Kaiser
2007-09-27  4:54 ` Tejun Heo
2007-09-27  4:57   ` Tejun Heo
2007-09-27  6:14     ` Torsten Kaiser
2007-09-27  6:24       ` Jeff Garzik
2007-09-27 17:34         ` Torsten Kaiser
2007-09-27 20:22           ` Tejun Heo
2007-09-28  5:36             ` Torsten Kaiser
2007-09-30  6:00               ` Torsten Kaiser
2007-09-30 14:34                 ` Tejun Heo
2007-09-30 16:19                   ` Torsten Kaiser
2007-09-30 17:39                     ` Tejun Heo [this message]
2007-09-30 18:39                       ` Torsten Kaiser
2007-10-01 18:00                         ` Torsten Kaiser
2007-10-03 15:21                           ` Torsten Kaiser
2007-10-03 15:55                             ` Torsten Kaiser
2007-10-03 16:38                               ` Matt Mackall
2007-10-03 17:36                                 ` Torsten Kaiser
2007-10-03 17:51                                   ` Matt Mackall
2007-10-03 18:06                                     ` Torsten Kaiser
2007-10-04  5:32                                 ` Torsten Kaiser
2007-10-04 17:05                                   ` Matt Mackall
2007-10-05  6:06                                     ` Torsten Kaiser
2007-10-07  8:44                                       ` Torsten Kaiser
2007-10-07 14:39                                         ` Torsten Kaiser
2007-10-11  3:25                                           ` Tejun Heo
2007-10-11  5:54                                             ` Torsten Kaiser
2007-10-11  6:26                                               ` Tejun Heo
2007-10-11 17:51                                                 ` Torsten Kaiser
2007-10-11  8:26                                             ` Jens Axboe
2007-10-11  8:36                                               ` Tejun Heo
2007-10-11 10:28                                                 ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46FFDF64.1080005@gmail.com \
    --to=htejun@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=jeff@garzik.org \
    --cc=just.for.lkml@googlemail.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox