linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Chris Leech <cleech@redhat.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Lee Duncan <lduncan@suse.com>,
	open-iscsi@googlegroups.com,
	Linux SCSI List <linux-scsi@vger.kernel.org>,
	linux-block@vger.kernel.org, Christoph Hellwig <hch@lst.de>
Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
Date: Thu, 22 Dec 2016 17:50:12 +1100	[thread overview]
Message-ID: <20161222065012.GI4758@dastard> (raw)
In-Reply-To: <CA+55aFza=cv6TGYh+Be1xz2x-G-U=vruhbmfGpHSqY1ocZrBeg@mail.gmail.com>

On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
> 
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?

I mean the three months leading up to the 4.10 merge, when all the
XFS changes were being tested against 4.9-rc kernels.

The iscsi problem showed up when I updated the base kernel from
4.9 to 4.10-current last week to test the pullreq I was going to
send you. I've been bust with other stuff until now, so I didn't
upgrade my working trees again until today in the hope the iscsi
problem had already been found and fixed.

> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] ------------[ cut here ]------------
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> 
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
> 
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
> 
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
> 
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
> 
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.

The iscsi problem is 100% reproducable. create a pair of iscsi luns,
mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
the filesystems.

The test machine I'm having all these other problems on? stable and
steady as a rock using PMEM devices. Moment I go to use /dev/vdc
(i.e. run load/perf benchmarks) it starts falling over left, right
and center.

And I just smacked into this in the bulkstat phase of the benchmark
(mkfs, fsmark, xfs_repair, mount, bulkstat, find, grep, rm):

[ 2729.750563] BUG: Bad page state in process bstat  pfn:14945
[ 2729.751863] page:ffffea0000525140 count:-1 mapcount:0 mapping:          (null) index:0x0
[ 2729.753763] flags: 0x4000000000000000()
[ 2729.754671] raw: 4000000000000000 0000000000000000 0000000000000000 ffffffffffffffff
[ 2729.756469] raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
[ 2729.758276] page dumped because: nonzero _refcount
[ 2729.759393] Modules linked in:
[ 2729.760137] CPU: 7 PID: 25902 Comm: bstat Tainted: G    B           4.9.0-dgc #18
[ 2729.761888] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 2729.763943] Call Trace:
[ 2729.764523]  <IRQ>
[ 2729.765004]  dump_stack+0x63/0x83
[ 2729.765784]  bad_page+0xc4/0x130
[ 2729.766552]  free_pages_check_bad+0x4f/0x70
[ 2729.767531]  free_pcppages_bulk+0x3c5/0x3d0
[ 2729.768513]  ? page_alloc_cpu_dead+0x30/0x30
[ 2729.769510]  drain_pages_zone+0x41/0x60
[ 2729.770417]  drain_pages+0x3e/0x60
[ 2729.771215]  drain_local_pages+0x24/0x30
[ 2729.772138]  flush_smp_call_function_queue+0x88/0x160
[ 2729.773317]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 2729.774742]  smp_call_function_single_interrupt+0x27/0x40
[ 2729.776000]  smp_call_function_interrupt+0xe/0x10
[ 2729.777102]  call_function_interrupt+0x8e/0xa0
[ 2729.778147] RIP: 0010:delay_tsc+0x41/0x90
[ 2729.779085] RSP: 0018:ffffc9000f0cf500 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff03
[ 2729.780852] RAX: 0000000077541291 RBX: ffff88008b5efe40 RCX: 000000000000002e
[ 2729.782514] RDX: 0000057700000000 RSI: 0000057777541291 RDI: 0000000000000001
[ 2729.784167] RBP: ffffc9000f0cf500 R08: 0000000000000007 R09: ffffc9000f0cf678
[ 2729.785818] R10: 0000000000000006 R11: 0000000000001000 R12: 0000000000000061
[ 2729.787480] R13: 0000000000000001 R14: 0000000083214e30 R15: 0000000000000080
[ 2729.789124]  </IRQ>
[ 2729.789626]  __delay+0xf/0x20
[ 2729.790333]  do_raw_spin_lock+0x8c/0x160
[ 2729.791255]  _raw_spin_lock+0x15/0x20
[ 2729.792112]  list_lru_add+0x1a/0x70
[ 2729.792932]  xfs_buf_rele+0x3e7/0x410
[ 2729.793792]  xfs_buftarg_shrink_scan+0x6b/0x80
[ 2729.794841]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[ 2729.796099]  shrink_node+0x57/0x90
[ 2729.796905]  do_try_to_free_pages+0xdd/0x230
[ 2729.797914]  try_to_free_pages+0xce/0x1a0
[ 2729.798852]  __alloc_pages_slowpath+0x2df/0x960
[ 2729.799908]  __alloc_pages_nodemask+0x24b/0x290
[ 2729.800963]  new_slab+0x2ac/0x380
[ 2729.801743]  ___slab_alloc.constprop.82+0x336/0x440
[ 2729.802890]  ? kmem_zone_alloc+0x81/0x120
[ 2729.803830]  ? xfs_buf_read_map+0x2d/0x1a0
[ 2729.804785]  ? kmem_zone_alloc+0x81/0x120
[ 2729.805723]  __slab_alloc.isra.78.constprop.81+0x2b/0x40
[ 2729.806970]  kmem_cache_alloc+0x142/0x180
[ 2729.807905]  kmem_zone_alloc+0x81/0x120
[ 2729.808807]  xfs_inode_alloc+0x29/0x1d0
[ 2729.809707]  xfs_iget+0x3cd/0x9d0
[ 2729.810496]  ? xfs_iflush+0x350/0x350
[ 2729.811358]  xfs_bulkstat_one_int+0x87/0x370
[ 2729.812359]  xfs_bulkstat_one+0x20/0x30
[ 2729.813259]  xfs_bulkstat+0x485/0x690
[ 2729.814125]  ? xfs_bulkstat_one_int+0x370/0x370
[ 2729.815183]  xfs_ioc_bulkstat+0xfa/0x1d0
[ 2729.816105]  xfs_file_ioctl+0x8e4/0xc70
[ 2729.817006]  ? _raw_spin_unlock_irq+0xe/0x30
[ 2729.818011]  ? finish_task_switch+0xab/0x240
[ 2729.819007]  ? __schedule+0x265/0x760
[ 2729.819866]  ? _raw_spin_unlock+0xe/0x20
[ 2729.820783]  ? preempt_schedule_irq+0x59/0xb0
[ 2729.821797]  ? retint_kernel+0x1b/0x1d
[ 2729.822685]  do_vfs_ioctl+0x94/0x700
[ 2729.823523]  ? __fget+0x77/0xb0
[ 2729.824263]  SyS_ioctl+0x79/0x90
[ 2729.825022]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 2729.826105] RIP: 0033:0x7fd25ff088e7
[ 2729.826943] RSP: 002b:00007fcfc4757e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 2729.828714] RAX: ffffffffffffffda RBX: 00007fd0d4000020 RCX: 00007fd25ff088e7
[ 2729.830372] RDX: 00007fcfc4757e60 RSI: ffffffffc0205865 RDI: 0000000000000003
[ 2729.832020] RBP: 0000000000022010 R08: ffffffffffffffff R09: 0000000000000400
[ 2729.833670] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000023000
[ 2729.835327] R13: 0000000000022010 R14: 0000000000001000 R15: 00007fd0d4000020

> That said:
> 
> > And worse, on that last error, the /host/ is now going into meltdown
> > (running 4.7.5) with 32 CPUs all burning down in ACPI code:
> 
> The obvious question here is how much you trust the environment if the
> host ends up also showing problems. Maybe you do end up having hw
> issues pop up too.

I trust the hardware way more than I trust the software running on
it....

> The primary suspect would presumably be the development kernel you're
> testing triggering something, but it has to be asked..

I've seen a couple of guest panics in RCU related code take down the
parent QEMU process recently - asserts in console handling code IIRC
- but that's the only signs I've ever had of host side problems.
That both the host and guest went whacky in new and interesting ways
at exactly the same time might be coincidence, but Occam's Razor
suggests they are linked in some way...


> 
>                      Linus
> 

-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2016-12-22  6:50 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20161214222411.GH4326@dastard>
     [not found] ` <20161214222953.GI4326@dastard>
     [not found]   ` <20161216185906.t2wmrr6wqjdsrduw@straylight.hirudinean.org>
2016-12-21 22:16     ` [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0 Dave Chinner
2016-12-21 23:19       ` Linus Torvalds
2016-12-22  0:13         ` Chris Leech
2016-12-22  5:13           ` Dave Chinner
2016-12-22  5:46             ` Linus Torvalds
2016-12-22  6:50               ` Dave Chinner [this message]
2016-12-22 18:50                 ` Chris Leech
2016-12-22 23:53                   ` Ming Lei
     [not found]                     ` <CACVXFVOGoh+AEGSSMsbxfdAjZQS5k_t+7uR-+qwusm0vfauHEA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-23  0:03                       ` Chris Leech
2016-12-23 10:00                         ` Christoph Hellwig
2016-12-23 19:42                           ` Linus Torvalds
2016-12-24  2:45                             ` Jens Axboe
     [not found]                               ` <fadbf3d2-23c1-f822-cf0f-50182227a2dc-b10kYP2dOMg@public.gmane.org>
2016-12-24  9:49                                 ` Christoph Hellwig
     [not found]                             ` <CA+55aFxA1LE+4jdR8YXZyRVJ-MpOCUVob=ESBsMPuW=Qb=px-A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-24 10:07                               ` Christoph Hellwig
     [not found]                                 ` <20161224100756.GA16741-jcswGhMUV9g@public.gmane.org>
2016-12-24 13:17                                   ` Hannes Reinecke
     [not found]                                     ` <a9fc4c0b-8521-84cc-66f6-48396c095539-l3A5Bk7waGM@public.gmane.org>
2016-12-24 13:19                                       ` Christoph Hellwig
2017-01-04 14:07                                       ` Christoph Hellwig
2016-12-22 20:22               ` Hugh Dickins
2016-12-23  7:32                 ` Johannes Weiner
2016-12-23  8:33                   ` Johannes Weiner
     [not found]                     ` <20161223083329.GA13952-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2017-01-02 21:11                       ` Johannes Weiner
2017-01-03 12:28                         ` Jan Kara
2017-01-04 15:26                           ` Laurence Oberman
     [not found]                             ` <584380074.12440338.1483543569131.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-01-04 17:38                               ` Laurence Oberman
     [not found]                           ` <20170103122825.GC3780-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2017-01-08  2:02                             ` Johannes Weiner
     [not found]                               ` <20170108020200.GA16312-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2017-01-08  2:17                                 ` Linus Torvalds
2017-01-09 20:30                                 ` Jan Kara
2017-01-09 20:45                                   ` Johannes Weiner
2016-12-22  6:28             ` Dave Chinner
2016-12-22  6:18         ` Christoph Hellwig
2016-12-22  6:30           ` Dave Chinner
2016-12-22  6:36             ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161222065012.GI4758@dastard \
    --to=david@fromorbit.com \
    --cc=cleech@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@lst.de \
    --cc=lduncan@suse.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=open-iscsi@googlegroups.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).