From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Veronika Kabatova <vkabatov@redhat.com>,
CKI Project <cki-project@redhat.com>,
linux-block@vger.kernel.org, Changhui Zhong <czhong@redhat.com>,
Rachel Sibley <rasibley@redhat.com>, Song Liu <song@kernel.org>,
linux-raid@vger.kernel.org
Subject: Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
Date: Fri, 4 Sep 2020 11:22:44 +0800 [thread overview]
Message-ID: <20200904032244.GA808936@T590> (raw)
In-Reply-To: <cc956f4c-9b71-2b02-80be-dd387316dad8@kernel.dk>
On Thu, Sep 03, 2020 at 02:53:39PM -0600, Jens Axboe wrote:
> On 9/3/20 1:58 PM, Veronika Kabatova wrote:
> >
> >
> > ----- Original Message -----
> >> From: "Rachel Sibley" <rasibley@redhat.com>
> >> To: "Jens Axboe" <axboe@kernel.dk>, "CKI Project" <cki-project@redhat.com>, linux-block@vger.kernel.org
> >> Cc: "Changhui Zhong" <czhong@redhat.com>
> >> Sent: Thursday, September 3, 2020 8:59:48 PM
> >> Subject: Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
> >>
> >>
> >>
> >> On 9/3/20 1:46 PM, Jens Axboe wrote:
> >>> On 9/3/20 11:10 AM, Rachel Sibley wrote:
> >>>>
> >>>> On 9/3/20 1:07 PM, CKI Project wrote:
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> We ran automated tests on a recent commit from this kernel tree:
> >>>>>
> >>>>> Kernel repo:
> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
> >>>>> Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into
> >>>>> for-next
> >>>>>
> >>>>> The results of these automated tests are provided below.
> >>>>>
> >>>>> Overall result: FAILED (see details below)
> >>>>> Merge: OK
> >>>>> Compile: OK
> >>>>> Tests: PANICKED
> >>>>>
> >>>>> All kernel binaries, config files, and logs are available for download
> >>>>> here:
> >>>>>
> >>>>> https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
> >>>>>
> >>>>> One or more kernel tests failed:
> >>>>>
> >>>>> ppc64le:
> >>>>> 💥 storage: software RAID testing
> >>>>>
> >>>>> aarch64:
> >>>>> 💥 storage: software RAID testing
> >>>>>
> >>>>> x86_64:
> >>>>> 💥 storage: software RAID testing
> >>>>
> >>>> Hello,
> >>>>
> >>>> We're seeing a panic for all non s390x arches triggered by swraid test.
> >>>> Seems to be reproducible
> >>>> for all succeeding pipelines after this one, and we haven't yet seen it in
> >>>> mainline or yesterday's
> >>>> block tree results.
> >>>>
> >>>> Thank you,
> >>>> Rachel
> >>>>
> >>>> https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_aarch64_redhat%3A968098/tests/8757835_aarch64_3_console.log
> >>>>
> >>>> [ 8394.609219] Internal error: Oops: 96000004 [#1] SMP
> >>>> [ 8394.614070] Modules linked in: raid0 loop raid456 async_raid6_recov
> >>>> async_memcpy async_pq async_xor async_tx dm_log_writes dm_flakey
> >>>> rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache
> >>>> rfkill sunrpc vfat fat xgene_hwmon xgene_enet at803x mdio_xgene xgene_rng
> >>>> xgene_edac mailbox_xgene_slimpro drm ip_tables xfs sdhci_of_arasan
> >>>> sdhci_pltfm i2c_xgene_slimpro crct10dif_ce sdhci gpio_dwapb cqhci
> >>>> xhci_plat_hcd
> >>>> gpio_xgene_sb gpio_keys aes_neon_bs
> >>>> [ 8394.654298] CPU: 3 PID: 471427 Comm: kworker/3:2 Kdump: loaded Not
> >>>> tainted 5.9.0-rc3-020ad03.cki #1
> >>>> [ 8394.663299] Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene
> >>>> Mustang Board, BIOS 3.06.25 Oct 17 2016
> >>>> [ 8394.672999] Workqueue: md_misc mddev_delayed_delete
> >>>> [ 8394.677853] pstate: 40400085 (nZcv daIf +PAN -UAO BTYPE=--)
> >>>> [ 8394.683399] pc : percpu_ref_exit+0x5c/0xc8
> >>>> [ 8394.687473] lr : percpu_ref_exit+0x20/0xc8
> >>>> [ 8394.691547] sp : ffff800019f33d00
> >>>> [ 8394.694843] x29: ffff800019f33d00 x28: 0000000000000000
> >>>> [ 8394.700129] x27: ffff0003c63ae000 x26: ffff8000120b6228
> >>>> [ 8394.705414] x25: 0000000000000001 x24: ffff0003d8322a80
> >>>> [ 8394.710698] x23: 0000000000000000 x22: 0000000000000000
> >>>> [ 8394.715983] x21: 0000000000000000 x20: ffff8000121d2000
> >>>> [ 8394.721266] x19: ffff0003d8322af0 x18: 0000000000000000
> >>>> [ 8394.726550] x17: 0000000000000000 x16: 0000000000000000
> >>>> [ 8394.731834] x15: 0000000000000007 x14: 0000000000000003
> >>>> [ 8394.737119] x13: 0000000000000000 x12: ffff0003888a1978
> >>>> [ 8394.742403] x11: ffff0003888a1918 x10: 0000000000000001
> >>>> [ 8394.747688] x9 : 0000000000000000 x8 : 0000000000000000
> >>>> [ 8394.752972] x7 : 0000000000000400 x6 : 0000000000000001
> >>>> [ 8394.758257] x5 : ffff800010423030 x4 : ffff8000121d2e40
> >>>> [ 8394.763540] x3 : 0000000000000000 x2 : 0000000000000000
> >>>> [ 8394.768825] x1 : 0000000000000000 x0 : 0000000000000000
> >>>> [ 8394.774110] Call trace:
> >>>> [ 8394.776544] percpu_ref_exit+0x5c/0xc8
> >>>> [ 8394.780273] md_free+0x64/0xa0
> >>>> [ 8394.783311] kobject_put+0x7c/0x218
> >>>> [ 8394.786781] mddev_delayed_delete+0x3c/0x50
> >>>> [ 8394.790944] process_one_work+0x1c4/0x450
> >>>> [ 8394.794932] worker_thread+0x164/0x4a8
> >>>> [ 8394.798662] kthread+0xf4/0x120
> >>>> [ 8394.801787] ret_from_fork+0x10/0x18
> >>>> [ 8394.805344] Code: 2a0403e0 350002c0 a9400262 52800001 (f9400000)
> >>>> [ 8394.811407] ---[ end trace 481cab6e1ad73da1 ]---
> >>>
> >>> Ming, I wonder if this is:
> >>>
> >>> commit d0c567d60f3730b97050347ea806e1ee06445c78
> >>> Author: Ming Lei <ming.lei@redhat.com>
> >>> Date: Wed Sep 2 20:26:42 2020 +0800
> >>>
> >>> percpu_ref: reduce memory footprint of percpu_ref in fast path
> >>>
> >>> Rachel, any chance you can do a run with that commit reverted?
> >>
> >> Hi Jens, yes we're working on it and will share our findings as soon as the
> >> job finishes.
> >>
> >
> > Hi Jens, we can confirm that there are no panics and the test passes
> > with the patch reverted.
> >
> >
> > We also realized that this patch is a likely cause of serious problems
> > on ppc64le during LTP testing as well, specifically msgstress04. Both
> > issues started occurring at the same time, we just didn't notice as the
> > test was crashing.
> >
> >
> > [ 5682.999169] msgstress04 invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=0, oom_score_adj=0
> > [ 5682.999981] CPU: 1 PID: 170909 Comm: msgstress04 Kdump: loaded Not tainted 5.9.0-rc3-020ad03.cki #1
> > [ 5683.000048] Call Trace:
> > [ 5683.000098] [c00000023de972e0] [c000000000927e00] dump_stack+0xc4/0x114 (unreliable)
> > [ 5683.000161] [c00000023de97330] [c000000000386958] dump_header+0x64/0x274
> > [ 5683.000205] [c00000023de973c0] [c000000000385534] oom_kill_process+0x284/0x290
> > [ 5683.000259] [c00000023de97400] [c0000000003862b0] out_of_memory+0x220/0x790
> > [ 5683.000307] [c00000023de974a0] [c000000000408890] __alloc_pages_slowpath.constprop.0+0xd60/0xeb0
> > [ 5683.000370] [c00000023de97670] [c000000000408d20] __alloc_pages_nodemask+0x340/0x400
> > [ 5683.000426] [c00000023de97700] [c000000000434dec] alloc_pages_current+0xac/0x130
> > [ 5683.000479] [c00000023de97750] [c000000000442fc4] allocate_slab+0x584/0x810
> > [ 5683.000525] [c00000023de977c0] [c000000000447e7c] ___slab_alloc+0x44c/0xa30
> > [ 5683.000571] [c00000023de978b0] [c000000000448494] __slab_alloc+0x34/0x60
> > [ 5683.000615] [c00000023de978e0] [c000000000448b48] kmem_cache_alloc+0x688/0x700
> > [ 5683.000671] [c00000023de97940] [c0000000003d9c80] __pud_alloc+0x70/0x1e0
> > [ 5683.000717] [c00000023de97990] [c0000000003ddbb4] copy_page_range+0x1204/0x1490
> > [ 5683.000779] [c00000023de97b20] [c00000000013b7c0] dup_mm+0x370/0x6e0
> > [ 5683.000826] [c00000023de97bd0] [c00000000013ce10] copy_process+0xd20/0x1950
> > [ 5683.000870] [c00000023de97c90] [c00000000013dc64] _do_fork+0xa4/0x560
> > [ 5683.000915] [c00000023de97d00] [c00000000013e24c] __do_sys_clone+0x7c/0xa0
> > [ 5683.000965] [c00000023de97dc0] [c00000000002f9a4] system_call_exception+0xe4/0x1c0
> > [ 5683.001019] [c00000023de97e20] [c00000000000d140] system_call_common+0xf0/0x27c
> >
> > The test then manages the fill the console log with good 4G of dump...
> > this is actually visible in the ppc64le console log from the linked
> > artifacts (warnings, it's a huge file!):
> >
> > https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757368_ppc64le_3_console.log
> >
> >
> > There are also more ppc64le traces in the other log (of reasonable size):
> > https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757337_ppc64le_2_console.log
>
> I'll revert this change for now.
It is one MD's bug, and percpu_ref_exit() may be called on one ref not
initialized via percpu_ref_init(), and the following patch can fix the
issue:
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 607278207023..9c55489066d2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5599,7 +5599,9 @@ static void md_free(struct kobject *ko)
blk_cleanup_queue(mddev->queue);
if (mddev->gendisk)
put_disk(mddev->gendisk);
- percpu_ref_exit(&mddev->writes_pending);
+
+ if (mddev->writes_pending.percpu_count_ptr)
+ percpu_ref_exit(&mddev->writes_pending);
bioset_exit(&mddev->bio_set);
bioset_exit(&mddev->sync_set);
Thanks,
Ming
next prev parent reply other threads:[~2020-09-04 3:23 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-03 17:07 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block) CKI Project
2020-09-03 17:10 ` Rachel Sibley
2020-09-03 17:46 ` Jens Axboe
2020-09-03 18:59 ` Rachel Sibley
2020-09-03 19:58 ` Veronika Kabatova
2020-09-03 20:53 ` Jens Axboe
2020-09-04 3:22 ` Ming Lei [this message]
2020-09-04 3:37 ` Jens Axboe
2020-09-04 4:24 ` Ming Lei
2020-09-04 15:06 ` Jens Axboe
2020-09-04 1:02 ` Ming Lei
2020-09-04 11:06 ` Veronika Kabatova
2020-09-06 3:19 ` 💥 PANICKED: Test report for?kernel " Ming Lei
2020-09-07 18:49 ` 💥 PANICKED: Test report for kernel " Veronika Kabatova
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200904032244.GA808936@T590 \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=cki-project@redhat.com \
--cc=czhong@redhat.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=rasibley@redhat.com \
--cc=song@kernel.org \
--cc=vkabatov@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.