Re: Crash during resume of pcie bridge due to infinite loop in ACPICA

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Bert Karwatzki <spasswolf@web.de>
To: "Christian König" <christian.koenig@amd.com>,
	"Mario Limonciello (AMD) (kernel.org)" <superm1@kernel.org>,
	linux-kernel@vger.kernel.org
Cc: linux-next@vger.kernel.org, regressions@lists.linux.dev,
	 linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org,
	"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
	spasswolf@web.de, acpica-devel@lists.linux.dev,
	 Robert Moore <robert.moore@intel.com>
Subject: Re: Crash during resume of pcie bridge due to infinite loop in ACPICA
Date: Sun, 16 Nov 2025 22:08:54 +0100	[thread overview]
Message-ID: <3f790ee59129e5e49dd875526cb308cc4d97b99d.camel@web.de> (raw)
In-Reply-To: <0719d985-1c09-4039-84c1-8736a1ca5e2d@amd.com>

Am Montag, dem 10.11.2025 um 14:33 +0100 schrieb Christian König:
> Hi Bert,
> 
> well sorry to say that but from your dumps it looks more and more like you just have faulty HW.
> 
> An SMU response of 0xFFFFFFFF means that the device has spontaneously fallen of the bus while trying to resume it.
> 
> My educated guess is that this is caused by a faulty power management, but basically it could be anything.
> 
> Regards,
> Christian.

I think there may be more than one error here. The loss of the GPU (with SMU respone log message) may be caused
by faulty hardware but does not cause "the" crash (i.e. the crash which showed no log messages and was so hard
one of my nvme devices was missing temporarily afterward, and which caused me to investigate this in the first place ...).

As bisection of the crash is impossible I went back to inserting printk()s into acpi_power_transition() and the
functions called by it. To reduce log spam I created _debug suffixed copies of the original functions.
The code is found here in branch amdgpu_suspend_resume:
https://gitlab.freedesktop.org/spasswolf/linux-stable/-/commits/amdgpu_suspend_resume?ref_type=heads
(Should I post the debug patches to the list?)

The last two commits finally cleared up what happens (but I've yet to find out why this happens).

6.14.0-debug-00014-g2e933c56f3b6	booted 20:17, 15.11.2025 crashed 0:50, 16.11.2025
					(~4.5h, 518 GPP0 events, 393 GPU resumes)

The interesting part of the instrumented code is this:

acpi_status acpi_ps_parse_aml_debug(struct acpi_walk_state *walk_state)
{
	[...]
	printk(KERN_INFO "%s: before walk loop\n", __func__);
	while (walk_state) {
		if (ACPI_SUCCESS(status)) {
			/*
			 * The parse_loop executes AML until the method terminates
			 * or calls another method.
			 */
			status = acpi_ps_parse_loop(walk_state);
		}
	[...]
	}
	printk(KERN_INFO "%s: after walk loop\n", __func__);
	[...]
}

This gives the following message in netconsole
1. No crash:
2025-11-16T00:50:35.634745+01:00 10.0.0.1 6,21514,16419759755,-,caller=T59901;acpi_ps_execute_method_debug 329
2025-11-16T00:50:35.634745+01:00 10.0.0.1 6,21515,16419759781,-,caller=T59901;acpi_ps_parse_aml_debug: before walk loop
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21516,16420046219,-,caller=T59901;acpi_ps_parse_aml_debug: after walk loop
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21517,16420046231,-,caller=T59901;acpi_ps_execute_method_debug 331
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21518,16420046235,-,caller=T59901;acpi_ns_evaluate_debug 475 METHOD
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21519,16420046240,-,caller=T59901;acpi_evaluate_object_debug 255
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21520,16420046244,-,caller=T59901;__acpi_power_on_debug 369
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21521,16420046248,-,caller=T59901;acpi_power_on_unlocked_debug 446
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21522,16420046251,-,caller=T59901;acpi_power_on_debug 471
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21523,16420046255,-,caller=T59901;acpi_power_on_list_debug 642: result = 0
Resume successful, normal messages from resuming GPU follow.

2. Crash:
2025-11-16T00:50:46.483555+01:00 10.0.0.1 6,21566,16430609060,-,caller=T59702;acpi_ps_execute_method_debug 329
2025-11-16T00:50:46.483555+01:00 10.0.0.1 6,21567,16430609083,-,caller=T59702;acpi_ps_parse_aml_debug: before walk loop
No more messages via netconsole due to crash.

So here we can already say that the main loop in acpi_ps_parse_aml_debug() is not finishing properly.

The next step is to put monitoring inside the loop:

6.14.0-debug-00015-gc09fd8dd0492	booted 12:09, 16.11.2025 crashed 19:55, 16.11.2025 
					(~8h, 1539 GPP0 events, 587 GPU resumes) "infinite" walk loop

The interesting part of the instrumented code is this:

acpi_status acpi_ps_parse_aml_debug(struct acpi_walk_state *walk_state)
{
	[...]
	printk(KERN_INFO "%s: before walk loop\n", __func__);
	while (walk_state) {
		if (ACPI_SUCCESS(status)) {
			/*
			 * The parse_loop executes AML until the method terminates
			 * or calls another method.
			 */
			printk(KERN_INFO "%s: before parse loop\n", __func__);
			status = acpi_ps_parse_loop(walk_state);
			printk(KERN_INFO "%s: after parse loop\n", __func__);
		}
	[...]
	}
	printk(KERN_INFO "%s: after walk loop\n", __func__);
	[...]
}

This gives the following message in netconsole
1. No crash:
2025-11-16T19:55:54.203765+01:00 localhost 6,5479352,28054924877,-,caller=T5967;acpi_ps_execute_method_debug 329
2025-11-16T19:55:54.203765+01:00 localhost 6,5479353,28054924889,-,caller=T5967;acpi_ps_parse_aml_debug: before walk loop
The next two lines are repeated 1500-1700 times (it varies a little ...):
2025-11-16T19:55:54.203765+01:00 localhost 6,5479354,28054924894,-,caller=T5967;acpi_ps_parse_aml_debug: before parse loop
2025-11-16T19:55:54.203765+01:00 localhost 6,5479355,28054924908,-,caller=T5967;acpi_ps_parse_aml_debug: after parse loop

2025-11-16T19:55:54.498216+01:00 localhost 6,5482288,28055219778,-,caller=T5967;acpi_ps_parse_aml_debug: after walk loop
2025-11-16T19:55:54.498216+01:00 localhost 6,5482289,28055219782,-,caller=T5967;acpi_ps_execute_method_debug 331
2025-11-16T19:55:54.498233+01:00 localhost 6,5482290,28055219786,-,caller=T5967;acpi_ns_evaluate_debug 475 METHOD
2025-11-16T19:55:54.498233+01:00 localhost 6,5482291,28055219791,-,caller=T5967;acpi_evaluate_object_debug 255
2025-11-16T19:55:54.498233+01:00 localhost 6,5482292,28055219795,-,caller=T5967;__acpi_power_on_debug 369
2025-11-16T19:55:54.498247+01:00 localhost 6,5482293,28055219799,-,caller=T5967;acpi_power_on_unlocked_debug 446
2025-11-16T19:55:54.498247+01:00 localhost 6,5482294,28055219802,-,caller=T5967;acpi_power_on_debug 471
2025-11-16T19:55:54.498247+01:00 localhost 6,5482295,28055219806,-,caller=T5967;acpi_power_on_list_debug 642: result = 0
Resume successful, normal messages from resuming GPU follow.

2. Crash:
2025-11-16T19:56:24.213495+01:00 localhost 6,5483042,28084932950,-,caller=T5967;acpi_ps_execute_method_debug 329
2025-11-16T19:56:24.213495+01:00 localhost 6,5483043,28084932965,-,caller=T5967;acpi_ps_parse_aml_debug: before walk loop
The next two lines are repeated more than 30000 times, then the transmition stops due to the crash:
2025-11-16T19:56:24.213495+01:00 localhost 6,5483044,28084932971,-,caller=T5967;acpi_ps_parse_aml_debug: before parse loop
2025-11-16T19:56:24.213495+01:00 localhost 6,5483045,28084932991,-,caller=T5967;acpi_ps_parse_aml_debug: after parse loop
No more messages via netconsole due to crash.

So there is some kind of infinite recursion happening inside the loop in acpi_ps_parse_aml(). Even if there is some kind
of error in the hardware this shouldn't happen, I think.

This bug is present in every kernel version I've tested so far, that is 6.12.x, 6.13.x, 6.14.x,
6.15.x, 6.16.x, 6.17.x (here I only tested the release candidates). 6.18 has not been tested, yet.

To get to this result took several months of 24/7 test runs, I hope resolving this will
be faster.

Bert Karwatzki

next prev parent reply	other threads:[~2025-11-16 21:09 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-06 12:09 [REGRESSION 00/04] Crash during resume of pcie bridge Bert Karwatzki
2025-10-06 12:09 ` [REGRESSION 01/04] " Bert Karwatzki
2025-10-06 12:09 ` [REGRESSION 02/04] " Bert Karwatzki
2025-10-06 12:09 ` [REGRESSION 03/04] " Bert Karwatzki
2025-10-06 12:09 ` [REGRESSION 04/04] " Bert Karwatzki
2025-10-06 12:39 ` [REGRESSION 00/04] " Christian König
2025-10-06 16:22   ` Bert Karwatzki
2025-10-07  6:50     ` Bert Karwatzki
2025-10-07 21:33 ` Mario Limonciello
2025-10-13 16:29   ` Bert Karwatzki
2025-10-13 18:51     ` Mario Limonciello
2025-10-14 10:50       ` Christian König
     [not found]         ` <1853e2af7f70cf726df278137b6d2d89d9d9dc82.camel@web.de>
2025-10-31 13:38           ` Bert Karwatzki
2025-10-31 13:47             ` Bert Karwatzki
2025-10-31 18:35               ` Bert Karwatzki
2025-11-05 11:44                 ` Bert Karwatzki
2025-11-05 21:31                   ` Mario Limonciello (AMD) (kernel.org)
2025-11-07 13:09                     ` Bert Karwatzki
2025-11-07 17:09                       ` Bert Karwatzki
2025-11-10 13:33                         ` Christian König
2025-11-16 21:08                           ` Bert Karwatzki [this message]
2025-11-17 16:40                             ` Crash during resume of pcie bridge due to infinite loop in ACPICA Rafael J. Wysocki
2025-11-24 22:34                               ` Bert Karwatzki
2025-11-25 19:46                                 ` Rafael J. Wysocki
2025-11-27  0:08                                   ` Bert Karwatzki
2025-11-27 13:02                                     ` Rafael J. Wysocki
2025-11-28 20:47                                       ` Bert Karwatzki
2025-12-02 18:59                                         ` Rafael J. Wysocki
2025-12-02 19:53                                           ` Bert Karwatzki
2025-12-02 20:01                                             ` Rafael J. Wysocki
2025-12-05 10:05                                               ` Crash during resume of pcie bridge due to incorrect error handling Bert Karwatzki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3f790ee59129e5e49dd875526cb308cc4d97b99d.camel@web.de \
    --to=spasswolf@web.de \
    --cc=acpica-devel@lists.linux.dev \
    --cc=christian.koenig@amd.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=regressions@lists.linux.dev \
    --cc=robert.moore@intel.com \
    --cc=superm1@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).