From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hannes Reinecke Subject: Re: v3.15 dm-mpath regression: cable pull test causes I/O hang Date: Thu, 03 Jul 2014 16:15:48 +0200 Message-ID: <53B56594.1060104@suse.de> References: <53AD6B62.2020407@acm.org> <20140627133345.GA6150@redhat.com> <20140702220223.GA23894@redhat.com> <53B56120.8040802@acm.org> <20140703140516.GB28104@redhat.com> Reply-To: device-mapper development Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1"; Format="flowed" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20140703140516.GB28104@redhat.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: Mike Snitzer , Bart Van Assche Cc: Jun'ichi Nomura , device-mapper development List-Id: dm-devel.ids On 07/03/2014 04:05 PM, Mike Snitzer wrote: > On Thu, Jul 03 2014 at 9:56am -0400, > Bart Van Assche wrote: > >> On 07/03/14 00:02, Mike Snitzer wrote: >>> On Fri, Jun 27 2014 at 9:33am -0400, >>> Mike Snitzer wrote: >>> >>>> On Fri, Jun 27 2014 at 9:02am -0400, >>>> Bart Van Assche wrote: >>>> >>>>> Hello, >>>>> >>>>> While running a cable pull simulation test with dm_multipath on top of >>>>> the SRP initiator driver I noticed that after a few iterations I/O lo= cks >>>>> up instead of dm_multipath processing the path failure properly (see = also >>>>> below for a call trace). At least kernel versions 3.15 and 3.16-rc2 a= re >>>>> vulnerable. This issue does not occur with kernel 3.14. I have tried = to >>>>> bisect this but gave up when I noticed that I/O locked up completely = with >>>>> a kernel built from git commit ID e809917735ebf1b9a56c24e877ce0d320ba= ee2ec >>>>> (dm mpath: push back requests instead of queueing). But with the bise= ct I >>>>> have been able to narrow down this issue to one of the patches in "Me= rge >>>>> tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/gi= t/ >>>>> device-mapper/linux-dm". Does anyone have a suggestion how to analyze= this >>>>> further or how to fix this ? >>> >>> I still don't have a _known_ fix for your issue but I reviewed commit >>> e809917735ebf1b9a56c24e877ce0d320baee2ec closer and identified what >>> looks to be a regression in logic for multipath_busy, it now calls >>> !pg_ready() instead of directly checking pg_init_in_progress. I think >>> this is needed (Hannes, what do you think?): >>> >>> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c >>> index 3f6fd9d..561ead6 100644 >>> --- a/drivers/md/dm-mpath.c >>> +++ b/drivers/md/dm-mpath.c >>> @@ -373,7 +373,7 @@ static int __must_push_back(struct multipath *m) >>> dm_noflush_suspending(m->ti))); >>> } >>> >>> -#define pg_ready(m) (!(m)->queue_io && !(m)->pg_init_required) >>> +#define pg_ready(m) (!(m)->queue_io && !(m)->pg_init_required && !(m)-= >pg_init_in_progress) >>> >>> /* >>> * Map cloned requests >> >> Hello Mike, >> >> Sorry but even with this patch applied and additionally with commit IDs >> 86d56134f1b6 ("kobject: Make support for uevent_helper optional") and >> bcccff93af35 ("kobject: don't block for each kobject_uevent") reverted >> my multipath test still hangs after a few iterations. I also reran the >> same test with kernel 3.14.3 and it is still running after 30 iterations. > > OK, thanks for testing though! I still think the patch is needed. > > You are using queue_if_no_path, do you see hangs due to paths not being > restored after the "cable" is restored? Any errors in the multipathd > userspace logging? Or abnormal errors in kernel? Basically I'm looking > for some other clue besides the hung task timeout spew. > > How easy would it be to replicate your testbed? Is it uniquely FIO hw > dependent? How are you simulating the cable pull tests? > > I'd love to setup a testbed that would enable me to chase this more > interactively rather than punting to you for testing. > > Hannes, do you have a testbed for heavy cable pull testing? Are you > able to replicate these hangs? > Yes, I do. But sadly I've been tied up with polishing up SLES12 = (release deadline is looming nearer) and for some inexplicable = reason management seems to find releasing a product more important = than working on mainline issue ... But I hope to find some time soonish (ie start of next week) to work = on this; it's the very next thing on my to-do list. Cheers, Hannes -- = Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N=FCrnberg GF: J. Hawn, J. Guild, F. Imend=F6rffer, HRB 16746 (AG N=FCrnberg)