All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@redhat.com>
To: Paul Mackerras <paulus@samba.org>
Cc: dm-devel@redhat.com, linux-kernel@vger.kernel.org,
	linuxppc-dev@ozlabs.org,
	Vladimir Davydov <vdavydov@parallels.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	bvanassche@acm.org
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI
Date: Tue, 1 Jul 2014 15:39:07 -0400	[thread overview]
Message-ID: <20140701193907.GA15306@redhat.com> (raw)
In-Reply-To: <20140630103058.GA17747@iris.ozlabs.ibm.com>

On Mon, Jun 30 2014 at  6:30am -0400,
Paul Mackerras <paulus@samba.org> wrote:

> I have a machine on which 3.15 usually fails to boot, and 3.14 boots
> every time.  The machine is a POWER8 2-socket server with 20 cores
> (thus 160 CPUs), 128GB of RAM, and 7 SCSI disks connected via a
> hardware-RAID-capable adapter which appears as two IPR controllers
> which are both connected to each disk.  I am booting from a disk that
> has Fedora 20 installed on it.
> 
> After over two weeks of bisections, I can finally point to the commits
> that cause the problems.  The culprits are:
> 
> 3e9f1be1 dm mpath: remove process_queued_ios()
> e8099177 dm mpath: push back requests instead of queueing
> bcccff93 kobject: don't block for each kobject_uevent
> 
> The interesting thing is that neither e8099177 nor bcccff93 cause
> failures on their own, but with both commits in there are failures
> where the system will fail to find /home on some occasions.
> 
> With 3e9f1be1 included, the system appears to be prone to a deadlock
> condition which typically causes the boot process to hang with this
> message showing:
> 
> A start job is running for Monitoring of LVM2 mirror...rogress polling
> 
> (with a [***     ] thing before it where the asterisks move back and
> forth).
> 
> If I revert 63d832c3 ("dm mpath: really fix lockdep warning") ,
> 4cdd2ad7 ("dm mpath: fix lock order inconsistency in
> multipath_ioctl"), 3e9f1be1 and bcccff93, in that order, I get a
> kernel that will boot every time.  The first two are later commits
> that fix some problems with 3e9f1be1 (though not the problems I am
> seeing).
> 
> Can anyone see any reason why e8099177 and bcccff93 would interfere
> with each other?

No, not seeing any obvious relation.

But even though you listed e8099177 as a culprit you didn't list it as a
commit you reverted.  Did you leave e8099177 simply because attempting
to revert it fails (if you don't first revert other dm-mpath.c commits)?

(btw, Bart Van Assche also has issues with commit e8099177 due to hangs
during cable pull testing of mpath devices -- Bart: curious to know if
your cable pull tests pass if you just revert bcccff93).

Mike

WARNING: multiple messages have this Message-ID (diff)
From: Mike Snitzer <snitzer@redhat.com>
To: Paul Mackerras <paulus@samba.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>,
	bvanassche@acm.org, linux-kernel@vger.kernel.org,
	linuxppc-dev@ozlabs.org, dm-devel@redhat.com,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI
Date: Tue, 1 Jul 2014 15:39:07 -0400	[thread overview]
Message-ID: <20140701193907.GA15306@redhat.com> (raw)
In-Reply-To: <20140630103058.GA17747@iris.ozlabs.ibm.com>

On Mon, Jun 30 2014 at  6:30am -0400,
Paul Mackerras <paulus@samba.org> wrote:

> I have a machine on which 3.15 usually fails to boot, and 3.14 boots
> every time.  The machine is a POWER8 2-socket server with 20 cores
> (thus 160 CPUs), 128GB of RAM, and 7 SCSI disks connected via a
> hardware-RAID-capable adapter which appears as two IPR controllers
> which are both connected to each disk.  I am booting from a disk that
> has Fedora 20 installed on it.
> 
> After over two weeks of bisections, I can finally point to the commits
> that cause the problems.  The culprits are:
> 
> 3e9f1be1 dm mpath: remove process_queued_ios()
> e8099177 dm mpath: push back requests instead of queueing
> bcccff93 kobject: don't block for each kobject_uevent
> 
> The interesting thing is that neither e8099177 nor bcccff93 cause
> failures on their own, but with both commits in there are failures
> where the system will fail to find /home on some occasions.
> 
> With 3e9f1be1 included, the system appears to be prone to a deadlock
> condition which typically causes the boot process to hang with this
> message showing:
> 
> A start job is running for Monitoring of LVM2 mirror...rogress polling
> 
> (with a [***     ] thing before it where the asterisks move back and
> forth).
> 
> If I revert 63d832c3 ("dm mpath: really fix lockdep warning") ,
> 4cdd2ad7 ("dm mpath: fix lock order inconsistency in
> multipath_ioctl"), 3e9f1be1 and bcccff93, in that order, I get a
> kernel that will boot every time.  The first two are later commits
> that fix some problems with 3e9f1be1 (though not the problems I am
> seeing).
> 
> Can anyone see any reason why e8099177 and bcccff93 would interfere
> with each other?

No, not seeing any obvious relation.

But even though you listed e8099177 as a culprit you didn't list it as a
commit you reverted.  Did you leave e8099177 simply because attempting
to revert it fails (if you don't first revert other dm-mpath.c commits)?

(btw, Bart Van Assche also has issues with commit e8099177 due to hangs
during cable pull testing of mpath devices -- Bart: curious to know if
your cable pull tests pass if you just revert bcccff93).

Mike

  parent reply	other threads:[~2014-07-01 19:39 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-30 10:30 Regression in 3.15 on POWER8 with multipath SCSI Paul Mackerras
2014-06-30 10:30 ` Paul Mackerras
2014-06-30 10:30 ` Paul Mackerras
2014-06-30 10:52 ` Hannes Reinecke
2014-06-30 10:52   ` Hannes Reinecke
2014-06-30 11:02   ` Paul Mackerras
2014-06-30 11:02     ` Paul Mackerras
2014-06-30 11:35     ` Hannes Reinecke
2014-06-30 11:35       ` Hannes Reinecke
2014-06-30 21:28       ` Paul Mackerras
2014-06-30 21:28         ` Paul Mackerras
2014-07-01  5:57         ` Hannes Reinecke
2014-07-01  5:57           ` Hannes Reinecke
2014-06-30 21:30   ` Paul Mackerras
2014-06-30 21:30     ` Paul Mackerras
2014-06-30 21:30     ` Paul Mackerras
2014-07-01 19:39 ` Mike Snitzer [this message]
2014-07-01 19:39   ` Mike Snitzer
2014-07-02 15:30   ` Bart Van Assche
2014-07-02 15:30     ` Bart Van Assche
2014-07-08 10:28   ` Junichi Nomura
2014-07-08 10:28     ` Junichi Nomura
2014-07-08 10:28     ` Junichi Nomura
2014-07-09  3:55     ` Alexey Kardashevskiy
2014-07-09 12:13       ` [dm-devel] " Junichi Nomura
2014-07-09 12:13         ` Junichi Nomura
2014-07-09 12:13         ` Junichi Nomura

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140701193907.GA15306@redhat.com \
    --to=snitzer@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bvanassche@acm.org \
    --cc=dm-devel@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=paulus@samba.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vdavydov@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.