All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@redhat.com>
To: device-mapper development <dm-devel@redhat.com>
Subject: Re: Crash in dm_done()
Date: Wed, 11 Jan 2012 11:25:04 -0500	[thread overview]
Message-ID: <20120111162501.GA25980@redhat.com> (raw)
In-Reply-To: <4F0D4A23.7060604@ce.jp.nec.com>

On Wed, Jan 11 2012 at  3:36am -0500,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:

> Hi Hannes,
> 
> On 01/10/12 20:18, Hannes Reinecke wrote:
> > I'm trying to hunt down a mysterious crash:
> > 
> > Unable to handle kernel pointer dereference at virtual kernel
> > address 000003c001762000
> > Oops: 0011 [#1] SMP
> > Modules linked in: dm_round_robin sg sd_mod crc_t10dif ipv6 loop
> > dm_multipath scsi_dh dm_mod qeth_l3 ipv6_lib zfcp scsi_transport_fc
> > scsi_tgt scsi_mod dasd_eckd_mod dasd_mod qeth qdio ccwgroup ext3
> > mbcache jbd
> > Supported: Yes
> > CPU: 0 Not tainted 3.0.13-0.5-default #1
> > Process kworker/0:1 (pid: 51750, task: 0000000008450138, ksp:
> > 0000000007323620)
> > Krnl PSW : 0704200180000000 000003c001953750 (dm_done+0x44/0x188
> > [dm_mod])
> >            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
> > Krnl GPRS: 000000002adb5e80 000003c001762100 0000000037de9e80
> > 0000000000000000
> >            0000000000000400 000003c00195e178 0000000000000380
> > 0000000000000000
> >            0000000000000100 0000000000000001 0000000037de9e80
> > 0000000037de9e68
> >            000003c00194f000 000003c00195e168 000000007ef97d30
> > 000000007ef97cd8
> > Krnl Code: 000003c001953740: e3b021580004       lg     %r11,344(%r2)
> >            000003c001953746: e310b0080004       lg     %r1,8(%r11)
> >            000003c00195374c: bf4fb180           icm %r4,15,384(%r11)
> >           >000003c001953750: e32010080004       lg      %r2,8(%r1)
> >            000003c001953756: e38020500004       lg      %r8,80(%r2)
> >            000003c00195375c: a7740062           brc    7,3c001953820
> >            000003c001953760: b9020099           ltgr    %r9,%r9
> >            000003c001953764: a7840049           brc    8,3c0019537f6
> > 
> > r11 is struct dm_rq_target_io *tio = clone->end_io_data;
> > r1 is tio->ti (ie struct dm_target), which is invalid.
> > r2 is tio->ti->type, likewise.
> > 
> > Apparently the table got replaced between map_io() and dm_done(),
> > causing this invalid pointer.
> > While we do hold a reference on the mapped_device in map_request(),
> > we only take a _single_ reference to the table in dm_request_fn(),
> > which is dropped again at the end.
> 
> > And as the table holds the pointer to the targets, they'll be
> > invalidated upon table swapping, causing dm_done() accessing an
> > invalid pointer.
> 
> The last paragraph is not correct.
> If any requests are in-flight, dm does not swap table.
> 
> I.e., in dm_suspend(), for request-based dm, we do:
>   1) stop request_queue processing
>      <from this point, no new request becomes in-flight>
>   2) wait for completion of in-flight I/Os
>      <from this point, no requests are in-flight>
> and only after that, we can swap tables.
> 
> Existence of in-flight I/O is checked by "pending" counter of md.
> 
> The counter is increased in dm_request_fn()
> and decreased in rq_completed(), which is called either
> when the original request is requeued or completed.
> I.e. after dm_done() processing.
> 
> 
> > I can't really believe that is the case, so please do correct me if
> > the above analysis is wrong.
> 
> Just guessing...
> Such a mysterious crash could occur if there are bugs like:
>   - somebody start the queue while dm stopped it in dm_suspend()
>   - somebody submit/complete/requeue request with
>     wrong function and corrupt pending counter
>   - lower-level driver completes a request twice
> 
> If you can recreate the crash, try attached debug patch.
> It should raise warnings when cases like above happen
> and might help hunting down the problem.

There was also that earlier thread about a crash in dm_done(), where
dereferencing clone->end_io_data caused the crash:
https://lkml.org/lkml/2011/10/31/97

Heiko said that Hannes' patch:
http://www.redhat.com/archives/dm-devel/2011-November/msg00176.html

... actually helped:
https://lkml.org/lkml/2011/11/29/168

But we never did get to the bottom of _why_, and Jun'ichi pointed out
a NULL pointer check is missing:
http://www.redhat.com/archives/dm-devel/2011-December/msg00022.html

Anyway, Hannes, if you can reproduce it'd also be interesting to see if
your patch (updated with NULL pointer check?) makes any difference.

Mike

  reply	other threads:[~2012-01-11 16:25 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-10 11:18 Crash in dm_done() Hannes Reinecke
2012-01-11  8:36 ` Jun'ichi Nomura
2012-01-11 16:25   ` Mike Snitzer [this message]
2012-01-12  7:26     ` Hannes Reinecke
2012-03-05 18:23       ` Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120111162501.GA25980@redhat.com \
    --to=snitzer@redhat.com \
    --cc=dm-devel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.