All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <htejun@gmail.com>
To: Torsten Kaiser <just.for.lkml@googlemail.com>
Cc: Jeff Garzik <jeff@garzik.org>,
	linux-kernel@vger.kernel.org, akpm@linux-foundation.org
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1
Date: Sun, 30 Sep 2007 10:39:48 -0700	[thread overview]
Message-ID: <46FFDF64.1080005@gmail.com> (raw)
In-Reply-To: <64bb37e0709300919w3e9db6aci4c0b9df43407fff3@mail.gmail.com>

Torsten Kaiser wrote:
> That boot ended in a minimal initrd environment that normally only
> starts the RAID5 and then opens contained encrypted real root.
> I was just able to push the output from dmesg through the serial link,
> but had no man pages to tell me about -s ...
> And that kind of error was until now a one-of-a-kind one. All other
> errors where not "internal error" but "timeout".
> But one time I had another SGT related error:
> Sep 11 19:19:24 treogen [   33.340000] ata1.00: exception Emask 0x20
> SAct 0x1 SErr 0x0 action 0x2
> Sep 11 19:19:24 treogen [   33.340000] ata1.00: irq_stat 0x00020002,
> PCI master abort while fetching SGT
> Sep 11 19:19:24 treogen [   33.340000] ata1.00: cmd
> 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
> Sep 11 19:19:24 treogen [   33.340000]          res
> 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)
> 
> What I find kind of interessing is, that while I got three different
> error codes the cmd part of the output was always the same.

That's NCQ write command.  You'll be using it a lot if you're rebuilding
md5.  It seems like something is going wrong with request DMA or sg
mapping.  Maybe some change in block/*.[hc]?

> It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
> Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
> narrowing on rc4-mm1 because that was the first version to break.
> 
> I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

Have you tested 2.6.23-rc4 without mm patches?  It could be something
introduced between -rc3 and 4.

> [the 2.6.23-rc4-mm1 series-file has 2013 lines]
> Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
> Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
> (line779): 2 good boots
> Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
> 1045): 1 failed
> Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed
> 
> Next try: up to patch fs-remove-some-aop_truncated_page.patch
> 
> That means from the patches added to the rc4 variant of the mm-kernel
> the following are remaining:
> git-xfs-build-fix.patch
> enforce-noreplace-smp-in-alternative_instructions.patch
> paravirt-fix-preemptible-lazy-mode-bug.patch
> i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
> fix-the-max-path-calculation-in-radix-treec-update.patch
> mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
> introduce-write_begin-write_end-aops-fix2.patch
> implement-simple-fs-aops-fix.patch
> ext2-convert-to-new-aops-fix2.patch
> ext3-convert-to-new-aops-fix-fix.patch
> ext4-convert-to-new-aops-fix-fix.patch
> gfs2-convert-to-new-aops-fix.patch
> reiserfs-convert-to-new-aops-fix2.patch
> hostfs-convert-to-new-aops-fix-fix.patch
> ufs-convert-to-new-aops-fix2.patch
> sysv-convert-to-new-aops-fix2.patch
> minix-convert-to-new-aops-fix2.patch
> affs-convert-to-new-aops-fix-fix.patch
> memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
> memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
> memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
> update-n_high_memory-node-state-for-memory-hotadd.patch
> slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
> slub-do-not-use-page-mapping.patch
> slub-move-page-offset-to-kmem_cache_cpu-offset.patch
> slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
> slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
> slub-optimize-cacheline-use-for-zeroing.patch
> 
> But due to the unreliable nature of the bug, I can't be to sure about that.

Yeah, that's what I'm worried about.  Bisection is extremely difficult
if errors are intermittent and takes long time to reproduce.

> Next version is compiled, now again switching the PC off for an hour...

Thanks a lot.  Much appreciated.

-- 
tejun

  reply	other threads:[~2007-09-30 17:41 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-26 20:26 sata_sil24 broken since 2.6.23-rc4-mm1 Torsten Kaiser
2007-09-27  4:54 ` Tejun Heo
2007-09-27  4:57   ` Tejun Heo
2007-09-27  6:14     ` Torsten Kaiser
2007-09-27  6:24       ` Jeff Garzik
2007-09-27 17:34         ` Torsten Kaiser
2007-09-27 20:22           ` Tejun Heo
2007-09-28  5:36             ` Torsten Kaiser
2007-09-30  6:00               ` Torsten Kaiser
2007-09-30 14:34                 ` Tejun Heo
2007-09-30 16:19                   ` Torsten Kaiser
2007-09-30 17:39                     ` Tejun Heo [this message]
2007-09-30 18:39                       ` Torsten Kaiser
2007-10-01 18:00                         ` Torsten Kaiser
2007-10-03 15:21                           ` Torsten Kaiser
2007-10-03 15:55                             ` Torsten Kaiser
2007-10-03 16:38                               ` Matt Mackall
2007-10-03 17:36                                 ` Torsten Kaiser
2007-10-03 17:51                                   ` Matt Mackall
2007-10-03 18:06                                     ` Torsten Kaiser
2007-10-04  5:32                                 ` Torsten Kaiser
2007-10-04 17:05                                   ` Matt Mackall
2007-10-05  6:06                                     ` Torsten Kaiser
2007-10-07  8:44                                       ` Torsten Kaiser
2007-10-07 14:39                                         ` Torsten Kaiser
2007-10-11  3:25                                           ` Tejun Heo
2007-10-11  5:54                                             ` Torsten Kaiser
2007-10-11  6:26                                               ` Tejun Heo
2007-10-11 17:51                                                 ` Torsten Kaiser
2007-10-11  8:26                                             ` Jens Axboe
2007-10-11  8:36                                               ` Tejun Heo
2007-10-11 10:28                                                 ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46FFDF64.1080005@gmail.com \
    --to=htejun@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=jeff@garzik.org \
    --cc=just.for.lkml@googlemail.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.