From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Veronika Kabatova <vkabatov@redhat.com>,
CKI Project <cki-project@redhat.com>,
linux-block@vger.kernel.org, Changhui Zhong <czhong@redhat.com>,
Rachel Sibley <rasibley@redhat.com>, Song Liu <song@kernel.org>,
linux-raid@vger.kernel.org
Subject: Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
Date: Fri, 4 Sep 2020 11:22:44 +0800 [thread overview]
Message-ID: <20200904032244.GA808936@T590> (raw)
In-Reply-To: <cc956f4c-9b71-2b02-80be-dd387316dad8@kernel.dk>
On Thu, Sep 03, 2020 at 02:53:39PM -0600, Jens Axboe wrote:
> On 9/3/20 1:58 PM, Veronika Kabatova wrote:
> >
> >
> > ----- Original Message -----
> >> From: "Rachel Sibley" <rasibley@redhat.com>
> >> To: "Jens Axboe" <axboe@kernel.dk>, "CKI Project" <cki-project@redhat.com>, linux-block@vger.kernel.org
> >> Cc: "Changhui Zhong" <czhong@redhat.com>
> >> Sent: Thursday, September 3, 2020 8:59:48 PM
> >> Subject: Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
> >>
> >>
> >>
> >> On 9/3/20 1:46 PM, Jens Axboe wrote:
> >>> On 9/3/20 11:10 AM, Rachel Sibley wrote:
> >>>>
> >>>> On 9/3/20 1:07 PM, CKI Project wrote:
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> We ran automated tests on a recent commit from this kernel tree:
> >>>>>
> >>>>> Kernel repo:
> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
> >>>>> Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into
> >>>>> for-next
> >>>>>
> >>>>> The results of these automated tests are provided below.
> >>>>>
> >>>>> Overall result: FAILED (see details below)
> >>>>> Merge: OK
> >>>>> Compile: OK
> >>>>> Tests: PANICKED
> >>>>>
> >>>>> All kernel binaries, config files, and logs are available for download
> >>>>> here:
> >>>>>
> >>>>> https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
> >>>>>
> >>>>> One or more kernel tests failed:
> >>>>>
> >>>>> ppc64le:
> >>>>> 💥 storage: software RAID testing
> >>>>>
> >>>>> aarch64:
> >>>>> 💥 storage: software RAID testing
> >>>>>
> >>>>> x86_64:
> >>>>> 💥 storage: software RAID testing
> >>>>
> >>>> Hello,
> >>>>
> >>>> We're seeing a panic for all non s390x arches triggered by swraid test.
> >>>> Seems to be reproducible
> >>>> for all succeeding pipelines after this one, and we haven't yet seen it in
> >>>> mainline or yesterday's
> >>>> block tree results.
> >>>>
> >>>> Thank you,
> >>>> Rachel
> >>>>
> >>>> https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_aarch64_redhat%3A968098/tests/8757835_aarch64_3_console.log
> >>>>
> >>>> [ 8394.609219] Internal error: Oops: 96000004 [#1] SMP
> >>>> [ 8394.614070] Modules linked in: raid0 loop raid456 async_raid6_recov
> >>>> async_memcpy async_pq async_xor async_tx dm_log_writes dm_flakey
> >>>> rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache
> >>>> rfkill sunrpc vfat fat xgene_hwmon xgene_enet at803x mdio_xgene xgene_rng
> >>>> xgene_edac mailbox_xgene_slimpro drm ip_tables xfs sdhci_of_arasan
> >>>> sdhci_pltfm i2c_xgene_slimpro crct10dif_ce sdhci gpio_dwapb cqhci
> >>>> xhci_plat_hcd
> >>>> gpio_xgene_sb gpio_keys aes_neon_bs
> >>>> [ 8394.654298] CPU: 3 PID: 471427 Comm: kworker/3:2 Kdump: loaded Not
> >>>> tainted 5.9.0-rc3-020ad03.cki #1
> >>>> [ 8394.663299] Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene
> >>>> Mustang Board, BIOS 3.06.25 Oct 17 2016
> >>>> [ 8394.672999] Workqueue: md_misc mddev_delayed_delete
> >>>> [ 8394.677853] pstate: 40400085 (nZcv daIf +PAN -UAO BTYPE=--)
> >>>> [ 8394.683399] pc : percpu_ref_exit+0x5c/0xc8
> >>>> [ 8394.687473] lr : percpu_ref_exit+0x20/0xc8
> >>>> [ 8394.691547] sp : ffff800019f33d00
> >>>> [ 8394.694843] x29: ffff800019f33d00 x28: 0000000000000000
> >>>> [ 8394.700129] x27: ffff0003c63ae000 x26: ffff8000120b6228
> >>>> [ 8394.705414] x25: 0000000000000001 x24: ffff0003d8322a80
> >>>> [ 8394.710698] x23: 0000000000000000 x22: 0000000000000000
> >>>> [ 8394.715983] x21: 0000000000000000 x20: ffff8000121d2000
> >>>> [ 8394.721266] x19: ffff0003d8322af0 x18: 0000000000000000
> >>>> [ 8394.726550] x17: 0000000000000000 x16: 0000000000000000
> >>>> [ 8394.731834] x15: 0000000000000007 x14: 0000000000000003
> >>>> [ 8394.737119] x13: 0000000000000000 x12: ffff0003888a1978
> >>>> [ 8394.742403] x11: ffff0003888a1918 x10: 0000000000000001
> >>>> [ 8394.747688] x9 : 0000000000000000 x8 : 0000000000000000
> >>>> [ 8394.752972] x7 : 0000000000000400 x6 : 0000000000000001
> >>>> [ 8394.758257] x5 : ffff800010423030 x4 : ffff8000121d2e40
> >>>> [ 8394.763540] x3 : 0000000000000000 x2 : 0000000000000000
> >>>> [ 8394.768825] x1 : 0000000000000000 x0 : 0000000000000000
> >>>> [ 8394.774110] Call trace:
> >>>> [ 8394.776544] percpu_ref_exit+0x5c/0xc8
> >>>> [ 8394.780273] md_free+0x64/0xa0
> >>>> [ 8394.783311] kobject_put+0x7c/0x218
> >>>> [ 8394.786781] mddev_delayed_delete+0x3c/0x50
> >>>> [ 8394.790944] process_one_work+0x1c4/0x450
> >>>> [ 8394.794932] worker_thread+0x164/0x4a8
> >>>> [ 8394.798662] kthread+0xf4/0x120
> >>>> [ 8394.801787] ret_from_fork+0x10/0x18
> >>>> [ 8394.805344] Code: 2a0403e0 350002c0 a9400262 52800001 (f9400000)
> >>>> [ 8394.811407] ---[ end trace 481cab6e1ad73da1 ]---
> >>>
> >>> Ming, I wonder if this is:
> >>>
> >>> commit d0c567d60f3730b97050347ea806e1ee06445c78
> >>> Author: Ming Lei <ming.lei@redhat.com>
> >>> Date: Wed Sep 2 20:26:42 2020 +0800
> >>>
> >>> percpu_ref: reduce memory footprint of percpu_ref in fast path
> >>>
> >>> Rachel, any chance you can do a run with that commit reverted?
> >>
> >> Hi Jens, yes we're working on it and will share our findings as soon as the
> >> job finishes.
> >>
> >
> > Hi Jens, we can confirm that there are no panics and the test passes
> > with the patch reverted.
> >
> >
> > We also realized that this patch is a likely cause of serious problems
> > on ppc64le during LTP testing as well, specifically msgstress04. Both
> > issues started occurring at the same time, we just didn't notice as the
> > test was crashing.
> >
> >
> > [ 5682.999169] msgstress04 invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=0, oom_score_adj=0
> > [ 5682.999981] CPU: 1 PID: 170909 Comm: msgstress04 Kdump: loaded Not tainted 5.9.0-rc3-020ad03.cki #1
> > [ 5683.000048] Call Trace:
> > [ 5683.000098] [c00000023de972e0] [c000000000927e00] dump_stack+0xc4/0x114 (unreliable)
> > [ 5683.000161] [c00000023de97330] [c000000000386958] dump_header+0x64/0x274
> > [ 5683.000205] [c00000023de973c0] [c000000000385534] oom_kill_process+0x284/0x290
> > [ 5683.000259] [c00000023de97400] [c0000000003862b0] out_of_memory+0x220/0x790
> > [ 5683.000307] [c00000023de974a0] [c000000000408890] __alloc_pages_slowpath.constprop.0+0xd60/0xeb0
> > [ 5683.000370] [c00000023de97670] [c000000000408d20] __alloc_pages_nodemask+0x340/0x400
> > [ 5683.000426] [c00000023de97700] [c000000000434dec] alloc_pages_current+0xac/0x130
> > [ 5683.000479] [c00000023de97750] [c000000000442fc4] allocate_slab+0x584/0x810
> > [ 5683.000525] [c00000023de977c0] [c000000000447e7c] ___slab_alloc+0x44c/0xa30
> > [ 5683.000571] [c00000023de978b0] [c000000000448494] __slab_alloc+0x34/0x60
> > [ 5683.000615] [c00000023de978e0] [c000000000448b48] kmem_cache_alloc+0x688/0x700
> > [ 5683.000671] [c00000023de97940] [c0000000003d9c80] __pud_alloc+0x70/0x1e0
> > [ 5683.000717] [c00000023de97990] [c0000000003ddbb4] copy_page_range+0x1204/0x1490
> > [ 5683.000779] [c00000023de97b20] [c00000000013b7c0] dup_mm+0x370/0x6e0
> > [ 5683.000826] [c00000023de97bd0] [c00000000013ce10] copy_process+0xd20/0x1950
> > [ 5683.000870] [c00000023de97c90] [c00000000013dc64] _do_fork+0xa4/0x560
> > [ 5683.000915] [c00000023de97d00] [c00000000013e24c] __do_sys_clone+0x7c/0xa0
> > [ 5683.000965] [c00000023de97dc0] [c00000000002f9a4] system_call_exception+0xe4/0x1c0
> > [ 5683.001019] [c00000023de97e20] [c00000000000d140] system_call_common+0xf0/0x27c
> >
> > The test then manages the fill the console log with good 4G of dump...
> > this is actually visible in the ppc64le console log from the linked
> > artifacts (warnings, it's a huge file!):
> >
> > https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757368_ppc64le_3_console.log
> >
> >
> > There are also more ppc64le traces in the other log (of reasonable size):
> > https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757337_ppc64le_2_console.log
>
> I'll revert this change for now.
It is one MD's bug, and percpu_ref_exit() may be called on one ref not
initialized via percpu_ref_init(), and the following patch can fix the
issue:
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 607278207023..9c55489066d2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5599,7 +5599,9 @@ static void md_free(struct kobject *ko)
blk_cleanup_queue(mddev->queue);
if (mddev->gendisk)
put_disk(mddev->gendisk);
- percpu_ref_exit(&mddev->writes_pending);
+
+ if (mddev->writes_pending.percpu_count_ptr)
+ percpu_ref_exit(&mddev->writes_pending);
bioset_exit(&mddev->bio_set);
bioset_exit(&mddev->sync_set);
Thanks,
Ming
next parent reply other threads:[~2020-09-04 3:23 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <cki.538AE6A321.BMB0X5ZYG5@redhat.com>
[not found] ` <0f92c40e-b234-896c-0810-af36ee95e259@redhat.com>
[not found] ` <18db2772-3f37-55a7-d92e-dbcbe92d2cc4@kernel.dk>
[not found] ` <ad1bf306-6f23-9b7c-842f-766a6efbda3e@redhat.com>
[not found] ` <1300213431.10047993.1599163090152.JavaMail.zimbra@redhat.com>
[not found] ` <cc956f4c-9b71-2b02-80be-dd387316dad8@kernel.dk>
2020-09-04 3:22 ` Ming Lei [this message]
2020-09-04 3:37 ` 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block) Jens Axboe
2020-09-04 4:24 ` Ming Lei
2020-09-04 15:06 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200904032244.GA808936@T590 \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=cki-project@redhat.com \
--cc=czhong@redhat.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=rasibley@redhat.com \
--cc=song@kernel.org \
--cc=vkabatov@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).