From mboxrd@z Thu Jan 1 00:00:00 1970 From: sunnydrake Subject: Re: [Regression] Amd-Vi + ivrs_ioapic cause kernel oops (4.4, 4.7 fail 3.9 works) Date: Tue, 5 Jul 2016 10:46:30 +0300 Message-ID: <577B65D6.6020804@gmail.com> References: <57786362.1010702@gmail.com> <790da4e5-985a-a4f5-1ead-b4fa1f37e8a4@iommu.org> <57797A37.4030805@gmail.com> <6a84fd9d-6897-2b19-de87-be09722593dc@iommu.org> <577B13D5.7030200@gmail.com> <577B26C3.6040108@iommu.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6458366715158847270==" Return-path: In-Reply-To: <577B26C3.6040108-6ukY98dZOFrYtjvyW6yDsg@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: Wan Zongshun , iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org List-Id: iommu@lists.linux-foundation.org This is a multi-part message in MIME format. --===============6458366715158847270== Content-Type: multipart/alternative; boundary="------------010300040803040506050703" This is a multi-part message in MIME format. --------------010300040803040506050703 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable On 05.07.16 06:17, Wan Zongshun wrote: > > > On 2016=E5=B9=B407=E6=9C=8805=E6=97=A5 09:56, sunnydrake wrote: >> >> On 04.07.16 16:51, Wan Zongshun wrote: >>> >>> >>> =E5=9C=A8 7/4/2016 4:48 AM, sunnydrake =E5=86=99=E9=81=93: >>>> Thanks for reply. >>>> On 03.07.16 17:26, Wan Zongshun wrote: >>>>> >>>>> >>>>> =E5=9C=A8 7/3/2016 8:59 AM, sunnydrake =E5=86=99=E9=81=93: >>>>>> [description] >>>>>> working in kernel 3.9 >>>>>> Oops in current 4.4.0-28,4.7.0-040700rc5 >>>>>> kernel options ivrs_ioapic[7]=3D00:14.0 ivrs_ioapic[8]=3D00:00.1 >>>>>> workaround to fix ivrs table >>>>>> cause kernel Oops on boot >>>>> Do you mean "ivrs_ioapic[7]=3D00:14.0 ivrs_ioapic[8]=3D00:00.1" are >>>>> workable at kernel-3.9 but failed in kernel-4.4? >>>> 1)yes kernel 3.9 boots ok with ivrs_ioapic[7]=3D00:14.0 >>>> ivrs_ioapic[8]=3D00:00.1 >>>> kernels 4.4 and 4.7 fall to Oops >>>>> >>>>>> >>>>>> [bug] >>>>>> oops: >>>>>> short oops text >>>>>> AMD-Vi: Completion_wait loop timed Out >>>>>> BUG: unable to handle kernel NULL pointer dereference at 000..03e >>>>>> ... irq_pm_install_action+0x1c/0xd0 >>>>>> full oops image text >>>>>> http://img.ctrlv.in/img/16/07/03/577863055370c.jpg >>>>>> >>>>>> [additional info] >>>>>> dmesg|grep AMD-Vi without ivrs_ioapic[8]=3D00:00.1 >>>>> This log is from the kernel print without ivrs_ioapic[8]=3D00:00.1? >>>>> Why not provide your kernel log with "ivrs_ioapic[7]=3D00:14.0 >>>>> ivrs_ioapic[8]=3D00:00.1" ? >>>>> Full kernel log is better. >>>>> >>>> >>>> 2) yes, because with ivrs_ioapic[7]=3D00:14.0 ivrs_ioapic[8]=3D00:0= 0.1 >>>> kernels are not bootable. Screen of Oops >>>> http://img.ctrlv.in/img/16/07/03/577863055370c.jpg (this with params >>>> ivrs_ioapic[7]=3D00:14.0 ivrs_ioapic[8]=3D00:00.1 ). if you need >>>> something another like kdump, i can provide. >>> >>> If you can provide a full kernel log with ivrs_ioapic[7]=3D00:14.0 >>> ivrs_ioapic[8]=3D00:00.1, that is better. >>> I checked your crash log, and find some things related to i8042 maybe >>> wrong, it is ps2 relation driver, is it necessary in your system? can >>> you disable this i8042 firsty to check if your issue is reasoned=20 >>> from it? >> i have serial port disabled in bios and booting with i8042.no_acpi=3D1 >> does not fix problem. I don't think i8042 related, because >> i8042_panic_blink is caps lock blinking when kernel crash (std=20 >> behavior) >> >> here is more detailed image of crash >> http://img.ctrlv.in/img/16/07/05/577b0ec96746e.jpg > > This is not enough to check this issue, I just see "AMD-vi CW loop=20 > timoutout...", but I can not see that more info ahead of this timeout. > > I guess some pci device dead, and it leads to iommu send command=20 > timeout or else... > Unfortunetly kdump cant reproduce this error due to skipping some hw=20 init.. my best bet is somehow reload iommu module while under kdump=20 kernel(Dunno how?). Other findings i have irqbypass used by=20 kvm,vfio_pci if it related somehow. My guess(no i do not read iommu code) that after getting ivrs table info=20 it try to remap interrupts and got smashed. from 4.6 kern 31 * Called from __setup_irq() with desc->lock held after @action has 32 * been installed in the action chain. 33 */ 34 void irq_pm_install_action(struct irq_desc *desc, struct irqaction=20 *action) 35 { 36 desc->nr_actions++; 37 38 if (action->flags & IRQF_FORCE_RESUME) 39 desc->force_resume_depth++; 40 41 WARN_ON_ONCE(desc->force_resume_depth && 42 desc->force_resume_depth !=3D desc->nr_actions)= ; 43 44 if (action->flags & IRQF_NO_SUSPEND) 45 desc->no_suspend_depth++; 46 else if (action->flags & IRQF_COND_SUSPEND) 47 desc->cond_suspend_depth++; 48 49 WARN_ON_ONCE(desc->no_suspend_depth && 50 (desc->no_suspend_depth + 51 desc->cond_suspend_depth) !=3D desc->nr_acti= ons); 52 } hmm actually checks if irq is shared call in +/source/kernel/irq/manage.c 1097 /* 1098 * Internal function to register an irqaction - typically used to 1099 * allocate special interrupts that are part of the architecture. 1100 */ 1102 __setup_irq(unsigned int irq, struct irq_desc *desc, struct=20 irqaction *new) 1331 irq_pm_install_action(desc, new); >> Unable to handle null pointer reference at irq_pm_install_action... >> ok i will setup linux-crashdump and report logs >> > --------------010300040803040506050703 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
On 05.07.16 06:17, Wan Zongshun wrote:=


On 2016=E5=B9=B407=E6=9C=8805=E6=97=A5 09:56, sunnydrake wrote:

On 04.07.16 16:51, Wan Zongshun wrote:


=E5=9C=A8 7/4/2016 4:48 AM, sunnydrake =E5=86=99=E9=81=93:
Thanks for reply.
On 03.07.16 17:26, Wan Zongshun wrote:


=E5=9C=A8 7/3/2016 8:59 AM, sunnydrake =E5=86=99=E9=81=93:
[description]
working in kernel 3.9
Oops in current 4.4.0-28,4.7.0-040700rc5
kernel options ivrs_ioapic[7]=3D00:14.0 ivrs_ioapic[8]=3D00:00.1
workaround to fix ivrs table
cause kernel Oops on boot
Do you mean "ivrs_ioapic[7]=3D00:14.0=C2=A0 ivrs_ioapic[8]=3D00:00.1" are
workable at kernel-3.9 but failed in kernel-4.4?
1)yes kernel 3.9 boots ok with ivrs_ioapic[7]=3D00:14.0
ivrs_ioapic[8]=3D00:00.1
kernels 4.4 and 4.7 fall to Oops


[bug]
oops:
short oops text
AMD-Vi: Completion_wait loop timed Out
BUG: unable to handle kernel NULL pointer dereference at 000..03e
... irq_pm_install_action+0x1c/0xd0
full oops image text
http://img.ctrlv.in/img/16/07/03/57= 7863055370c.jpg

[additional info]
dmesg|grep AMD-Vi without ivrs_ioapic[8]=3D00:00.1
This log is from the kernel print without ivrs_ioapic[8]=3D00:00.1?
Why not provide your kernel log with "ivrs_ioapic[7]=3D00:14.0
ivrs_ioapic[8]=3D00:00.1" ?
Full kernel log is better.


2) yes,=C2=A0 because with ivrs_ioapic[7]=3D00:14.0 ivrs_ioapic[8]=3D00:00.1
kernels are not bootable. Screen of Oops
http://img.ctrlv.in/img/16/07/03/577863= 055370c.jpg (this with params
ivrs_ioapic[7]=3D00:14.0=C2=A0 ivrs_ioapic[8]=3D00:00.1 ). if= you need
something another like kdump, i can provide.

If you can provide a full kernel log with ivrs_ioapic[7]=3D00:14.0
ivrs_ioapic[8]=3D00:00.1, that is better.
I checked your crash log, and find some things related to i8042 maybe
wrong, it is ps2 relation driver, is it necessary in your system? can
you disable this i8042 firsty to check if your issue is reasoned from it?
i have serial port disabled in bios and booting with i8042.no_acpi=3D1
does not fix problem. I don't think i8042 related, because
i8042_panic_blink=C2=A0 is caps lock blinking when kernel crash (= std behavior)

here is more detailed image of crash
http://img.ctrlv.in/img/16/07/05/577b0ec967= 46e.jpg

This is not enough to check this issue, I just see "AMD-vi CW loop timoutout...", but I can not see that more info ahead of this timeout.

I guess some pci device dead, and it leads to iommu send command timeout or else...

Unfortunetly kdump cant reproduce this error due to skipping some hw init.. my best bet is somehow reload iommu module while under kdump kernel(Dunno how?). Other findings i have irqbypass=C2=A0=C2=A0=C2=A0= used by kvm,vfio_pci if it related somehow.
My guess(no i do not read iommu code) that after getting ivrs table info it try to remap interrupts and got smashed.

from 4.6 kern
=C2=A031=C2=A0 * Called from __setup_irq() with desc->lock held af= ter @action has
=C2=A032=C2=A0 * been installed in the action chain.
=C2=A033=C2=A0 */
=C2=A034 void irq_pm_install_action(struct irq_desc *desc, struct irqaction *action)
=C2=A035 {
=C2=A036=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 desc->nr_= actions++;
=C2=A037
=C2=A038=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (action-&= gt;flags & IRQF_FORCE_RESUME)
=C2=A039=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 desc->force_resume_depth++;
=C2=A040
=C2=A041=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 WARN_ON_ONCE= (desc->force_resume_depth &&
=C2=A042=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 desc->= force_resume_depth !=3D desc->nr_actions);
=C2=A043
=C2=A044=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (action-&= gt;flags & IRQF_NO_SUSPEND)
=C2=A045=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 desc->no_suspend_depth++;
=C2=A046=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 else if (act= ion->flags & IRQF_COND_SUSPEND)
=C2=A047=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 desc->cond_suspend_depth++;
=C2=A048
=C2=A049=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 WARN_ON_ONCE= (desc->no_suspend_depth &&
=C2=A050=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (desc->= ;no_suspend_depth +
=C2=A051=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 desc->cond_suspend_depth) !=3D desc->nr_actions);
=C2=A052 }
hmm actually checks if irq is shared call in
+/source/kernel/irq/manage.c
1097 /*
1098=C2=A0 * Internal function to register an irqaction - typically u= sed to
1099=C2=A0 * allocate special interrupts that are part of the architecture.
1100=C2=A0 */
1102 __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
1331=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 irq_pm_install_a= ction(desc, new);
Unable to handle null pointer reference a= t irq_pm_install_action...
ok i will setup linux-crashdump and report logs



--------------010300040803040506050703-- --===============6458366715158847270== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline --===============6458366715158847270==--